Hi, I'm Chris 👋

I am an iOS developer / freelancer / contractor based in Australia.
I specialise in building well-structured app platforms for companies to build their long-term app strategies upon.
I have a broad array of experience from my work at Google, News Corp, Fox Sports, NineMSN, FetchTV, Coles, Woolworths, Trust Bank, and Westpac.
Please see my portfolio, then get in touch if I can be of service!

See my Portfolio »

Training layers of neurons

Hi all, here’s the fourth on my series on neural networks / machine learning / AI from scratch. In the previous articles (please read them first!), I explained how a single neuron works, then how to calculate the gradient of its weight and bias, and how you can use that gradient to train the neuron. In this article, I’ll explain how to determine the gradients when you have many layers of many neurons, and use those gradients to train the neural net.

In my previous articles in this series, I used spreadsheets to make the maths easier to follow along. Unfortunately I don’t think I’ll be able to demonstrate this topic in a spreadsheet, I think it’d get out of hand, so I’ll keep it in code. I hope you can still follow along!

Data model

Pardon my pseudocode:

class Net {
    layers: [Layer]
}

class Layer {
    neurons: [Neuron]
}

class Neuron {
    value: float
    bias: float
    weights: [float]
    activation_gradient: float
}

Explanation:

  • Layers: The neural net is made up of multiple layers. The first one in the array is the input layer, the last one is the output layer.
  • Neurons: The neurons that make up a layer. Each layer will typically have different numbers of neurons.
  • Value: The output of each neuron.
  • Bias: The bias of each neuron.
  • Weights: Input weights for each neuron. This array’s size will be the number of inputs to this layer. For the first layer, this will be the number of inputs (aka features) to the neural net. For subsequent layers, this will be the count of neurons in the previous layer.
  • Activation Gradient: These are the gradients of each neuron, chained to the latter layers via the magic of calculus. This is also equal to the gradient of the bias too. Maybe reading my second article in this series will help understand what this gradient means :)

High(ish) level explanation

What we’re trying to achieve here is to use calculus to determine the ‘gradient’ of every bias and every weight in this neural net. In order to do this, we have to ‘back propagate’ these gradients from the back to the front of the ‘layers’ array.

Concretely - if, say, we had 3 layers: we’d figure out the gradients of the activation functions of layers[2], then use those values to calculate the gradients of layers[1], and then layers[0].

Once we have the gradients of the activation functions for each neuron in each layer, it’s easy to figure out the gradient of the weights and bias for each neuron.

And, as demonstrated in my previous article, once we have the gradients, we can ‘nudge’ the weights and biases in the direction that their gradients say, thus train the neural net.

Steps

Training and determining the gradients go hand-in-hand, as you need the inputs to calculate the values of each neuron in the net, and you need the targets (aka desired outputs) to determine the gradients. Thus it’s a three step process:

  • Forward pass (calculate the Layer.values)
  • Backpropagation (calculate the Layer.activation_gradients)
  • Train the weights and biases (adjust the Layer.biases and Layer.weights)

Forward pass

This pass fills in the ‘value’ fields.

  • The first layer’s neurons must have the same number of weights as the number of inputs.
  • Each neuron’s value is calculated as tanh(bias + sum(weights * inputs)).
  • Since tanh is used as the activation function, this neural net can only work with inputs and outputs and targets that are in the range -1 to +1.

Forward pass pseudocode:

for layer in layers, first to last {
    if this is the first layer {
        for neuron in layer.neurons {
            total = neuron.bias
            for weight in neuron.weights {
                total += weight * inputs[weight_index]
            }
            neuron.value = tanh(total)
        }
    } else {
        previous_layer = layers[layer_index - 1]
        for neuron in layer.neurons {
            total = neuron.bias
            for weight in neuron.weights {
                total += weight * previous_layer.neuron[weight_index].value
            }
            neuron.value = tanh(total)
        }
    }
}

Backward pass (aka backpropagation)

This fills in the ‘activation_gradient’ fields.

  • Note that when iterating the layers here, you must go last to first.
  • The ‘targets’ are the array of output value(s) from the training data.
  • The last layer must have the same number of neurons as the number of targets.
  • The (1 - value^2) * ... are calculus equations for determining gradients.

Backward pass pseudocode:

for layer in reversed layers, last to first {
    if this is the last layer {
        for neuron in layer.neurons {
            neuron.activation_gradient =
                (1 - neuron.value^2) *
                (value - targets[neuron_index])
        }
    } else {
        next_layer = layers[layer_index + 1]
        for this_layer_neuron in layer.neurons {
            next_layer_gradient_sum = 0
            for next_layer_neuron in next_layer.neurons {
                next_layer_gradient_sum +=
                    next_layer_neuron.activation_gradient * 
                    next_layer_neuron.weights[this_layer_neuron_index]
            }
            this_layer_neuron.activation_gradient =
                (1 - this_layer_neuron.value^2) *
                next_layer_gradient_sum
        }
    }
}

Training pass

Now that you have the gradients, you can adjust the biases/weights to train it to better.

I’ll skim over this as it’s covered in my earlier articles in this series. The gist of it is that, for each neuron, the gradient is calculated for the bias and every weight, and the bias/weights are adjusted a little to ‘descend the gradient’. Perhaps my pseudocode might make more sense:

learning_rate = 0.01 // Aka 1%
for layer in layers {
    if this is the first layer {
        for neuron in layer.neurons {
            neuron.bias -= neuron.activation_gradient * learning_rate
            for weight in neuron.weights {
                gradient_for_this_weight = inputs[weight_index] *
                    neuron.activation_gradient
                weight -= gradient_for_this_weight * learning_rate
            }
        }
    } else {
        previous_layer = layers[layer_index - 1]
        for neuron in layer.neurons {
            neuron.bias -= neuron.activation_gradient * learning_rate
            for weight in neuron.weights {
                gradient_for_this_weight =
                    previous_layer.neurons[weight_index].value *
                    neuron.activation_gradient
                weight -= gradient_for_this_weight * learning_rate
            }
        }
    }
}

Rust demo

Because I’m a Rust tragic, here’s a demo. It’s kinda long, sorry, not sorry. It was fun to write :)

This trains a neural network to calculate the area and circumference of a rectangle, given the width and height as inputs.

  • Width and height are scaled to the range 0.1 - 1. because that’s the range that the tanh activation function supports.
  • Target values are also scaled to be in the range that tanh supports.
  • Initial biases and weights are randomly assigned.

🦀🦀🦀

use rand::Rng;

struct Net {
    layers: Vec<Layer>,
}

struct Layer {
    neurons: Vec<Neuron>,
}

struct Neuron {
    value: f64,
    bias: f64,
    weights: Vec<f64>,
    activation_gradient: f64
}

const LEARNING_RATE: f64 = 0.001;

fn main() {
    let mut rng = rand::thread_rng();

    // Make a 3,3,2 neural net that inputs the width and height of a rectangle,
    // and outputs the area and circumference.
    let mut net = Net {
        layers: vec![
            Layer { // First layer has 2 weights to suit the 2 inputs.
                neurons: vec![
                    Neuron {
                        value: 0.,
                        bias: rng.gen_range(-1. .. 1.),
                        weights: vec![
                            rng.gen_range(-1. .. 1.),
                            rng.gen_range(-1. .. 1.),
                        ],
                        activation_gradient: 0.,
                    },
                    Neuron {
                        value: 0.,
                        bias: rng.gen_range(-1. .. 1.),
                        weights: vec![
                            rng.gen_range(-1. .. 1.),
                            rng.gen_range(-1. .. 1.),
                        ],
                        activation_gradient: 0.,
                    },
                    Neuron {
                        value: 0.,
                        bias: rng.gen_range(-1. .. 1.),
                        weights: vec![
                            rng.gen_range(-1. .. 1.),
                            rng.gen_range(-1. .. 1.),
                        ],
                        activation_gradient: 0.,
                    },
                ],
            },
            Layer { // Second layer neurons have the same number of weights as the previous layer has neurons.
                neurons: vec![
                    Neuron {
                        value: 0.,
                        bias: rng.gen_range(-1. .. 1.),
                        weights: vec![
                            rng.gen_range(-1. .. 1.),
                            rng.gen_range(-1. .. 1.),
                            rng.gen_range(-1. .. 1.),
                        ],
                        activation_gradient: 0.,
                    },
                    Neuron {
                        value: 0.,
                        bias: rng.gen_range(-1. .. 1.),
                        weights: vec![
                            rng.gen_range(-1. .. 1.),
                            rng.gen_range(-1. .. 1.),
                            rng.gen_range(-1. .. 1.),
                        ],
                        activation_gradient: 0.,
                    },
                    Neuron {
                        value: 0.,
                        bias: rng.gen_range(-1. .. 1.),
                        weights: vec![
                            rng.gen_range(-1. .. 1.),
                            rng.gen_range(-1. .. 1.),
                            rng.gen_range(-1. .. 1.),
                        ],
                        activation_gradient: 0.,
                    },
                ],
            },
            Layer { // Last layer has 2 neurons to suit 2 outputs.
                neurons: vec![
                    Neuron {
                        value: 0.,
                        bias: rng.gen_range(-1. .. 1.),
                        weights: vec![
                            rng.gen_range(-1. .. 1.),
                            rng.gen_range(-1. .. 1.),
                            rng.gen_range(-1. .. 1.),
                        ],
                        activation_gradient: 0.,
                    },
                    Neuron {
                        value: 0.,
                        bias: rng.gen_range(-1. .. 1.),
                        weights: vec![
                            rng.gen_range(-1. .. 1.),
                            rng.gen_range(-1. .. 1.),
                            rng.gen_range(-1. .. 1.),
                        ],
                        activation_gradient: 0.,
                    },
                ],
            },
        ],
    };

    // Train.
    let mut cumulative_error_counter: i64 = 0; // These vars are for averaging the errors.
    let mut area_error_percent_sum: f64 = 0.;
    let mut circumference_error_percent_sum: f64 = 0.;
    for training_iteration in 0..100_000_000 {
        // Inputs:
        let width: f64 = rng.gen_range(0.1 .. 1.);
        let height: f64 = rng.gen_range(0.1 .. 1.);
        let inputs: Vec<f64> = vec![width, height];

        // Targets (eg desired outputs):
        let area = width * height;
        let circumference_scaled = (height * 2. + width * 2.) * 0.25; // Scaled by 0.25 so it'll always be in range 0..1.
        let targets: Vec<f64> = vec![area, circumference_scaled];

        // Forward pass!
        for layer_index in 0..net.layers.len() {
            if layer_index == 0 {
                let layer = &mut net.layers[layer_index];
                for neuron in &mut layer.neurons {
                    let mut total = neuron.bias;
                    for (weight_index, weight) in neuron.weights.iter().enumerate() {
                        total += weight * inputs[weight_index];
                    }
                    neuron.value = total.tanh();
                }
            } else {
                // Workaround for Rust not allowing you to borrow two different vec elements simultaneously.
                let previous_layer: &Layer;
                unsafe { previous_layer = & *net.layers.as_ptr().add(layer_index - 1) }
                let layer = &mut net.layers[layer_index];
                for neuron in &mut layer.neurons {
                    let mut total = neuron.bias;
                    for (weight_index, weight) in neuron.weights.iter().enumerate() {
                        total += weight * previous_layer.neurons[weight_index].value;
                    }
                    neuron.value = total.tanh();
                }
            }
        }

        // Let's check the results!
        let outputs: Vec<f64> = net.layers.last().unwrap().neurons
            .iter().map(|n| n.value).collect();
        let area_error_percent = (targets[0] - outputs[0]).abs() / targets[0] * 100.;
        let circumference_error_percent = (targets[1] - outputs[1]).abs() / targets[1] * 100.;
        area_error_percent_sum += area_error_percent;
        circumference_error_percent_sum += circumference_error_percent;
        cumulative_error_counter += 1;
        if training_iteration % 10_000_000 == 0 {
            println!("Iteration {} errors: area {:.3}%, circumference: {:.3}% (smaller = better)",
                training_iteration,
                area_error_percent_sum / cumulative_error_counter as f64,
                circumference_error_percent_sum / cumulative_error_counter as f64);
            area_error_percent_sum = 0.;
            circumference_error_percent_sum = 0.;
            cumulative_error_counter = 0;
        }

        // Backward pass! (aka backpropagation)
        let layers_len = net.layers.len();
        for layer_index in (0..layers_len).rev() { // Reverse the order.
            if layer_index == layers_len - 1 { // Last layer.
                let layer = &mut net.layers[layer_index];
                for (neuron_index, neuron) in layer.neurons.iter_mut().enumerate() {
                    neuron.activation_gradient =
                        (1. - neuron.value * neuron.value) *
                        (neuron.value - targets[neuron_index]);
                }
            } else {
                // Workaround for Rust not allowing you to borrow two different vec elements simultaneously.
                let next_layer: &Layer;
                unsafe { next_layer = & *net.layers.as_ptr().add(layer_index + 1) }
                let layer = &mut net.layers[layer_index];
                for (this_layer_neuron_index, this_layer_neuron) in layer.neurons.iter_mut().enumerate() {
                    let mut next_layer_gradient_sum: f64 = 0.;
                    for next_layer_neuron in &next_layer.neurons {
                        next_layer_gradient_sum +=
                            next_layer_neuron.activation_gradient * 
                            next_layer_neuron.weights[this_layer_neuron_index];
                    }
                    this_layer_neuron.activation_gradient =
                        (1. - this_layer_neuron.value * this_layer_neuron.value) *
                        next_layer_gradient_sum;
                }
            }
        }

        // Training pass!
        for layer_index in 0..net.layers.len() {
            if layer_index == 0 {
                let layer = &mut net.layers[layer_index];
                for neuron in &mut layer.neurons {
                    neuron.bias -= neuron.activation_gradient * LEARNING_RATE;
                    for (weight_index, weight) in neuron.weights.iter_mut().enumerate() {
                        let gradient_for_this_weight =
                            inputs[weight_index] *
                            neuron.activation_gradient;
                        *weight -= gradient_for_this_weight * LEARNING_RATE
                    }
                }
            } else {
                // Workaround for Rust not allowing you to borrow two different vec elements simultaneously.
                let previous_layer: &Layer;
                unsafe { previous_layer = & *net.layers.as_ptr().add(layer_index - 1) }
                let layer = &mut net.layers[layer_index];
                for neuron in &mut layer.neurons {
                    neuron.bias -= neuron.activation_gradient * LEARNING_RATE;
                    for (weight_index, weight) in neuron.weights.iter_mut().enumerate() {
                        let gradient_for_this_weight =
                            previous_layer.neurons[weight_index].value *
                            neuron.activation_gradient;
                        *weight -= gradient_for_this_weight * LEARNING_RATE;
                    }
                }
            }
        }
    }
}

Which outputs:

Iteration 0 errors: area 223.106%, circumference: 13.175% (smaller = better)
Iteration 10000000 errors: area 17.861%, circumference: 1.123% (smaller = better)
Iteration 20000000 errors: area 14.656%, circumference: 0.790% (smaller = better)
Iteration 30000000 errors: area 14.516%, circumference: 0.698% (smaller = better)
Iteration 40000000 errors: area 6.359%, circumference: 0.882% (smaller = better)
Iteration 50000000 errors: area 2.966%, circumference: 0.875% (smaller = better)
Iteration 60000000 errors: area 2.769%, circumference: 0.807% (smaller = better)
Iteration 70000000 errors: area 2.600%, circumference: 0.698% (smaller = better)
Iteration 80000000 errors: area 2.401%, circumference: 0.573% (smaller = better)
Iteration 90000000 errors: area 2.166%, circumference: 0.468% (smaller = better)

Which you can see the error percentage drop down as it ‘learns’ to calculate the area and circumference of a rectangle. Magic!

Thanks for reading, hope you found this helpful, at least a tiny bit, God bless!

Photo by Jonas Hensel on Unsplash


You can read more of my blog here in my blog archive.

Chris Hulbert

(Comp Sci, Hons - UTS)

iOS Developer (Freelancer / Contractor) in Australia.

I have worked at places such as Google, Cochlear, Assembly Payments, News Corp, Fox Sports, NineMSN, FetchTV, Coles, Woolworths, Trust Bank, and Westpac, among others. If you're looking for help developing an iOS app, drop me a line!

Get in touch:
[email protected]
github.com/chrishulbert
linkedin



 Subscribe via RSS