Neural Networks explained with spreadsheets, 2: Gradients for a single neuron

Gradients for a single neuron

Hi all, here’s the second on my series on neural networks / machine learning / AI from scratch. In the previous article (please read it first!), I explained

how a single neuron works. In this article, I’ll explain how you can determine the ‘gradients’ of that neuron, in other words how much effect the weight and bias has on the final ‘loss’, using some high-school calculus. This is an prerequisite for training, which I’ll cover later.

Spreadsheet

I recommend opening this spreadsheet in a separate tab, and viewing it as you read this post which explains the maths: Single neuron gradients.

In case the linked spreadsheet is lost to posterity, here it is in slightly less well-formatted form (note: for brevity’s sake, I’ve shortened references such as B2 to simply ‘B’ when referring to a column in the same row):

	A	B	C	D	E	F	G	H	I	K
1		Input	Weight	Bias	Net	Output	Target	Error	Loss
2	Neuron maths:	0.4	0.5	0.6	0.8 (B*C+D)	0.664 (tanh(E))	0.7	-0.035963 (F-G)	0.0006467 (H^2 / 2)
3	Real local gradients:	0.5 (C2)	0.4 (B2)	1	0.5591 (1-F2^2)	-0.036 (H2)
4	Real global gradients:	-0.0101 (B3*E)	-0.0080 (C3*E)	-0.0201 (E)	-0.0201 (E3*F)	-0.036 (F3)
5										Faux gradient
6	Faux gradient of ‘output’:					0.66414 (F2+Tiny)	0.7	-0.035863 (F-G)	0.0006431 (H^2 / 2)	-0.0359 ((I - I2)/Tiny)
7	Faux gradient of ‘net’:				0.8001 (E2+Tiny)	0.66409 (tanh(E))	0.7	-0.035907 (F-G)	0.0006447 (H^2 / 2)	-0.0201 ((I - I2)/Tiny)
8	Faux gradient of ‘bias’:	0.4	0.5	0.6001 (D2+Tiny)	0.8001 (B*C+D)	0.66409 (tanh(E))	0.7	-0.035907 (F-G)	0.0006447 (H^2 / 2)	-0.0201 ((I - I2)/Tiny)
9	Faux gradient of ‘weight’:	0.4	0.5001 (C2+Tiny)	0.6	0.80004 (B*C+D)	0.66406 (tanh(E))	0.7	-0.035941 (F-G)	0.0006459 (H^2 / 2)	-0.0080 ((I - I2)/Tiny)
10	Faux gradient of ‘input’:	0.4001 (B2+Tiny)	0.5	0.6	0.80005 (B*C+D)	0.66406 (tanh(E))	0.7	-0.035935 (F-G)	0.0006457 (H^2 / 2)	-0.0100 ((I - I2)/Tiny)
Tiny	0.0001	Moved down here to help with readability

What is the gradient?

Firstly: what is the gradient? It is also known as the slope, derivative, or velocity of an equation.

For a simple example, consider tides in a river mouth:

At high tide (maximum position), the water is still (0 velocity).
Then, half-way from high to low tide (0 position), the water is rushing out (maximum positive velocity). This is the time when the waves are biggest and my friend almost drowned the other day on his jet ski, but that’s a story for another day!
Then, at low tide (minimum position), the water is still again (0 velocity).
Then, half-way from low to high tide (0 position again), the water is rushing in (maximum negative velocity).

In this analogy, the height of the water is the position (like the values for the weights, bias, net, output, or loss), and the velocity of the water is the gradient (or derivative, or slope). Figuring out that gradient is what this article is all about.

For a more thorough explanation of gradients, check out Wikipedia.

Why do we want to know the gradients?

The reason we want the gradients of a neuron’s weight(s) and bias, is that we can use them to figure out whether we need to nudge their values up or down a bit or leave them as-is, in order to get an output that’s closer to the target during training.

Faking a gradient

You can fake a gradient by comparing the result of an equation vs the result when adding a tiny amount to the input. These faux gradients are helpful for verifying our calculus later.

Here’s the general way to fake a gradient:

Faux gradient of f(x) = ( f(x + tiny) - f(x) ) / tiny

To make it more specific to our neuron:

Faux gradient of how weight affects output = (
    tanh(input * (weight + tiny) + bias) -
    tanh(input * weight + bias)
) / tiny

Or the full kahuna on the loss function:

Faux gradient of how bias affects loss = (
    (tanh(input * weight + (bias + tiny)) - target)^2 / 2 
    -
    (tanh(input * weight + bias) - target)^2 / 2
) / tiny

Please note that the loss function changed vs the previous article (it now has a / 2) - this is to make the calculus simpler.

You can look at rows 6 through 10 in the spreadsheet to see how these faux gradients are calculated. In columns B to I, various things have the tiny value added to them, to see how this affects the final ‘loss’. For instance, on row 6, you can see I’m adding the tiny value to the output, then feeding that through to the loss function, and doing the (loss with tiny - loss without tiny) / tiny to calculate the faux gradient. The rest of these faux gradients are similar.

Real gradients with calculus

Lets use calculus to calculate the real gradients. Firstly we need to calculate the ‘local’ gradients. See row 3 in the spreadsheet as you follow along:

What is a local gradient? Since all our calculations are performed in stages (eg net > output > error > loss), a local gradient is how much impact changes in one stage have on the next stage.

A better maths teacher than I would be able to explain how we arrive at the following, but here are the formulas below:

Local gradient equations

(Note when I say ‘the gradient of Y with respect to X’ it means that X is the input/earlier stage, Y is the output/later stage, and it roughly means ‘if you nudge X, what impact will that have on Y?’.)

Input (gradient of Net with respect to Input) = Weight (see B3)
Weight (gradient of Net with respect to Weight) = Input (see C3)
Bias (gradient of Net with respect to Bias) = 1 (see D3)
Net (gradient of Output with respect to Net) = 1 - Output^2 (see E3)
Output (gradient of Error with respect to Output) = Error (see F3)
Error (gradient of Loss with respect to Error) = Error (this is where the / 2 in our loss helps) (see H3)

Global gradients

Next we need to combine the gradients using the calculus ‘chain rule’, so that we can get the impacts of each variable on the loss.

These are calculated in reverse order (this is why it is called _back_propagation) because most of these rely on the next step’s gradient.

Output (gradient of Loss with respect to Output) = Output (See F4)
Net (gradient of Loss with respect to Net) = (1 - Output^2) * Output global gradient (See E4)
Bias (gradient of Loss with respect to Bias) = Net global gradient (See D4)
Weight (gradient of Loss with respect to Weight) = Input * Net global gradient (See C4)
Input (gradient of Loss with respect to Input) = Weight * Net global gradient (See B4)

You may like to compare these with the respective faux gradients and see that they are (roughly) the same.

And there you have it, you have the gradients for a single neuron. Next I’ll explain how to use these gradients for training!

Unnecessary Rust implementation

Just for the hell of it, here’s an implementation in Rust:

struct Neuron {
    input: f32,
    weight: f32,
    bias: f32,
    target: f32,
}

impl Neuron {
    fn net(&self) -> f32 {
        self.input * self.weight + self.bias
    }
    fn output(&self) -> f32 {
        self.net().tanh()
    }
    fn error(&self) -> f32 {
        self.output() - self.target
    }
    fn loss(&self) -> f32 {
        let e = self.error();
        e * e / 2.
    }
    fn output_gradient(&self) -> f32 {
        self.error()
    }
    fn net_gradient(&self) -> f32 {
        let o = self.output();
        let net_local_derivative = 1. - o * o;
        net_local_derivative * self.output_gradient()
    }
    fn bias_gradient(&self) -> f32 {
        self.net_gradient()
    }
    fn weight_gradient(&self) -> f32 {
        self.input * self.net_gradient()
    }
}

fn main() {
    let neuron = Neuron {
        input: 0.4,
        weight: 0.5,
        bias: 0.6,
        target: 0.7,
    };
    println!("Weight gradient: {:.4}", neuron.weight_gradient());
    println!("Bias gradient: {:.4}", neuron.bias_gradient());
}

Which outputs:

Weight gradient: -0.0080
Bias gradient: -0.0201

Which matches the spreadsheet nicely!

Thanks for reading, hope you found this helpful, at least a tiny bit, God bless!

Photo by Chinnu Indrakumar on Unsplash

Thanks for reading! And if you want to get in touch, I'd love to hear from you: chris.hulbert at gmail.

Chris Hulbert

(Comp Sci, Hons - UTS)

Software Developer (Freelancer / Contractor) in Australia.

I have worked at places such as Google, Cochlear, CommBank, Assembly Payments, News Corp, Fox Sports, NineMSN, FetchTV, Coles, Woolworths, Trust Bank, and Westpac, among others. If you're looking for help developing an iOS app, drop me a line!

Get in touch:
[email protected]
github.com/chrishulbert
linkedin

Subscribe via RSS