Tutorial: Build your first neural network from scratch

9 minute read

Published:

In this episode, we want to investigate a simple working neural network (NN) anatomically, without the use of any deep learning package. You might find it very interesting to design every single element that is essential to perform forward and backward propagation. Every piece of component can be put up together like playing a LEGO game!

Getting started

We are going to implement a $L$ layer neural network (multiple layer perceptron) that can perform simple tasks such as digit recognization. $L$ is an arbitrary number as you require, within which we want $L-1$ weight matrices $\boldsymbol{w}^{(i)}$ that link the $i$-th and the $i+1$-th layer, and the last one is the output layer (e.g. a Softmax). We define a data matrix $\boldsymbol{X}$ ($N$ by $M$) that contains $N$ pieces of data with dimension $M$. In the MNIST dataset, each image is 28 by 28 pixels, so we have a dimension of 784. We also have a target $\boldsymbol{Y}$ ($N$ by $K$), a matrix of $K$-category one-hot label. We will do this step by step via specification of every component and giving an implementation in MATLAB. Before proceeding, it is recommended to have a look at the lecture note.

Excerpts from MNIST dataset.
Excerpts from MNIST dataset.

In this tutorial, our MLP is constructed in the form \begin{equation} \hat{\boldsymbol{Y}}=\operatorname{Softmax}(\boldsymbol{w}^{(L-1)} \sigma(\boldsymbol{w}^{(L-2)}\cdots\sigma(\boldsymbol{w}^{(2)}\sigma(\boldsymbol{X}\boldsymbol{w}^{(1)}+\boldsymbol{b}^{(1)})+\boldsymbol{b}^{(2)})\cdots+\boldsymbol{b}^{(L-2)})). \end{equation} Since we are dealing with data matrices rather than vectors, to make more efficient computation we try to enable every MATLAB function to do element-wise computation with matrices or tensors. Recall from your data science lectures, these are some important functions that compose our neural network.

  • Sigmoid \(\begin{equation} \sigma(t)=\frac{1}{1+\exp (-t)} \end{equation}\)
    function g = sigmoid(T)
      g = 1 ./ (1+exp(-T));
    end
    
  • Derivative of Sigmoid \(\begin{align}\frac{d \sigma(t)}{d t}=\sigma(t)(1-\sigma(t))\end{align}\)
    function g = dsigmoid(T)
      g = zeros(size(T));
      g = sigmoid(t).*(1-sigmoid(T));
    end
    
  • Softmax \(\begin{equation} \operatorname{Softmax}\left(\boldsymbol{t}\right)=\left(\frac{\exp \left(t_1\right)}{\sum_{k=1}^K \exp \left(t_k\right)}, \cdots, \frac{\exp \left(t_K\right)}{\sum_{k=1}^K \exp \left(t_k\right)}\right) \end{equation}\)
    function y = softmax(T)
      y = exp(T) ./ sum(exp(T),2);        %denominator sum by row
    end
    
  • Cross Entropy (This will be our loss function) \(\begin{equation} \mathcal{L}(\boldsymbol{Y},\hat{\boldsymbol{Y}})=\frac{1}{N} \sum_i^N \sum_j^K -y_{i k} \log \left(\hat{y}_{i k}\right) \end{equation}\)
    function C = crossentropy(Y,Y_hat)
    C = sum(sum(-Y.* log(Y_hat))) / size(Y, 1) ;
    

Initialise Weights

The initialization only needs to specify the size of layers (number of neurons in each layer), say, layers = [input_size,hidden_size ,..., hidden_size , output_size]. Note that, we need to reserve one more dimension in every weight matrix for the bias coefficients. All initial weight entries can be sampled from a Gaussian distribution, say, $\mathcal{N}(0,0.1)$.

function ini_weights = randWeights(layers)
for i = 1:(size(layers, 2)-1)
    epsilon = 0.1;
    ini_weights{i} = rand(layers(i+1), 1+layers(i))*2*epsilon - epsilon;
    end
end

Forward Propagation

Here are some definitions of intermediate variables.
$\boldsymbol{z}^{(i)}=\boldsymbol{a}^{(i-1)}\boldsymbol{w}^{(i-1)}$, linear production at the $i$-th layer;
$\boldsymbol{a}^{(i)}=\sigma(\boldsymbol{z}^{(i)})$, intermediate outcome at the $i$-th layer;
Firstly, specify the input layer $\boldsymbol{a}^{(1)}$. For convenience, we append one more column of all ones to all intermediate outcomes from every layer as the bias term $\boldsymbol{b}$. \(\begin{equation}\boldsymbol{a}^{(1)}=\left[\begin{array}{cc} 1&\boldsymbol{X}_1\\ \cdots&\cdots\\ 1&\boldsymbol{X}_N \end{array}\right]=\left[\begin{array}{cc} \boldsymbol{1}&\boldsymbol{X}\\ \end{array}\right], \end{equation}\) where $\boldsymbol{1}$ represent a column vector of ones. Now we generate a sequence of hidden layers from $(\boldsymbol{z}^{(2)},\boldsymbol{a}^{(2)})$ to $(\boldsymbol{z}^{(L-1)},\boldsymbol{a}^{(l-1)}).$ \(\begin{equation} \begin{aligned} \boldsymbol{z}^{(l)}&=\boldsymbol{a}^{(l-1)}\cdot\boldsymbol{w}^{(l-1)^\top}\\ \boldsymbol{a}^{(l)}&=\left[\begin{array}{cc} \boldsymbol{1}&\sigma(\boldsymbol{z}^{(l)})\\ \end{array}\right] \end{aligned} \end{equation}\) Compute the output layer $\boldsymbol{z}^L$ and $\boldsymbol{a}^L$, \(\begin{equation} \begin{aligned} \boldsymbol{z}^{(L)}&=\boldsymbol{a}^{(L-1)}\cdot\boldsymbol{w}^{(L-1)^\top}\\ \boldsymbol{a}^{(L)}&=\operatorname{Softmax}(\boldsymbol{z}^{(L)}) \end{aligned} \end{equation}\) Compute the total lost, \(\begin{equation} \mathcal{L}\left(\boldsymbol{Y},\boldsymbol{a}^{(L)}\right). \end{equation}\)

function [C,a,z] = forward(X,Y,layers,weights)
a1 = [ones(size(X,1), 1) X];                %design input
C = 0;                                      %initialise cost

for i = 1:(size(layers,2))                  %initialize the sum and the activations
    z{i} = zeros(size(X,1),layers(i)+1);
    a{i} = zeros(size(X,1),layers(i)+1);
end

a{1} = a1;                                  %input layer
for i = 2:size(layers,2)                    %hidden layers
z{i} = a{i-1}*weights{i-1}';
a{i} = z{i};
if i ~= size(layers,2)                       %(eq 1.6)
    a{i} = sigmoid(z{i});
    a{i} = [ones(length(z{i}),1) a{i}];     %append bias for next layer
    else
    a{i} = softmax(a{i});                   %output layer (eq 1.7)
    end
end

C = crossentropy(Y,a{end});                 %compute cost eq.(eq 1.8)
end

Backward Propagation

First, look at the gradient of cross-entropy w.r.t the weights from the last layer, \(\begin{equation} \begin{aligned} \frac{\partial}{\partial\boldsymbol{w}} \mathcal{L}(\boldsymbol{w}) &=\frac{\partial}{\partial\boldsymbol{w}} \left(\frac{1}{N}\sum_{i=1}^N \sum_{k=1}^K -y_{i, k}\left[\log \left(\frac{\exp \left(\mathbf{a}_i \boldsymbol{w}^{\top}_k\right)}{\sum_{l=1}^K \exp \left(\mathbf{a}_i \boldsymbol{w}^{\top}_l\right)}\right)\right]\right) \\ &=\frac{1}{N}\left[\begin{array}{ccc} \frac{\exp \left(\mathbf{a}_i \boldsymbol{w}^{\top}_1\right)}{\sum_{l=1}^K \exp \left(\mathbf{a}_i \boldsymbol{w}^{\top}_l\right)} & \cdots & \frac{\exp \left(\mathbf{a}_i \boldsymbol{w}^{\top}_K\right)}{\sum_{l=1}^K \exp \left(\mathbf{a}_i \boldsymbol{w}^{\top}_l\right)} \end{array}\right]-\frac{1}{N}\boldsymbol{y}_i \\ & \equiv \frac{1}{N} \left[\frac{\exp \left(\mathbf{a}_i \boldsymbol{w}^{\top}\right)}{\sum_{l=1}^K \exp \left(\mathbf{a}_i \boldsymbol{w}^{\top}_l\right)}-\boldsymbol{y}_i\right]\\ & \equiv \frac{1}{N}\left(\boldsymbol{a}^{(L)}-\boldsymbol{Y}\right). \end{aligned} \end{equation}\) This is the difference between the predicted outcome and the true label. Next, we consider the gradient w.r.t the weights from the hidden layers. Compute $\boldsymbol{\delta}^{(L)}$ first, \(\begin{equation} \boldsymbol{\delta}^{(L)}=\frac{\partial}{\partial\boldsymbol{w}^{(L-1)}} \mathcal{L}(\boldsymbol{w}^{(L-1)})=\frac{1}{N}\left(\boldsymbol{a}^{(L)}-\boldsymbol{y}\right). \end{equation}\) Why and how do we calculating this $\boldsymbol{\delta}$ ? You can find the answer in the chain rule. Generate the whole sequence of deltas from $\boldsymbol{\delta}^{(L)}$ to $\boldsymbol{\delta}^{(1)}$. Note that the order is reversed compared with forward propagation. \(\begin{equation} \boldsymbol{\delta}^{(l)}=\left(\boldsymbol{\delta}^{(l+1)}\cdot\boldsymbol{w}^{(l)^{\top}} \right) \odot \sigma^{\prime}\left(\boldsymbol{z}^{(l)}\right) \end{equation}\) Then compute gradients at each layer, \(\begin{equation} \frac{\partial \mathcal{L}}{\partial \boldsymbol{w}^{(l)}}=\boldsymbol{\delta}^{(l+1)^{\top}} \mathbf{a}^{(l)} \end{equation}\)

function [gradient] = backward(a,z,layers,Y,weights)

for i = size(layers,2):-1:2             %in backward direction
    if i == size(layers,2)
        delta{i} = (a{i} - Y)./size(Y,1);   %first gradient of cross entropy (eq 1.10)
        else
                                        %passing delta iteratively (eq 1.11)
        delta{i} = (delta{i+1}*weights{i}(:, 2:end)) .* sigmoidGradient(z{i});
        end
    gradient{i-1} = delta{i}'*a{i-1};       %compute gradient in terms of weights (eq 1.12)
    end
end

Gradient Descent

This is presented as one step of training, which is composed of one step of forward & backward propagation and one step of updating weights using gradient descent, \(\begin{equation} \boldsymbol{w}_{t+1}=\boldsymbol{w}_t- \alpha\nabla_{\boldsymbol{w}} \mathcal{L}\left(\boldsymbol{w}_t\right), \end{equation}\) where $\alpha$ is a constant learning rate. To perform one step of gradient descent, we need to gather all the gradients we’ve just computed from the back propagation. Remember, this is the full weight tensor \(\begin{equation} \boldsymbol{w}=\left[\begin{array}{llll} [\boldsymbol{w}^{(1)}] & [\boldsymbol{w}^{(2)}] & \cdots &[\boldsymbol{w}^{(L-1)}] \end{array}\right] \end{equation}\)

function [C, newweights] = train(weights, layer_sizes,X, Y, alpha)

    [C,a,z] = forward(X,Y,layer_sizes,weights);             %forward propagetion
    [weight_grad] = backward(a,z,layer_sizes,Y,weights);    %backward propagation
 
    for i = 1:(size(layer_sizes,2)-1)                       %update weights (eq 1.13)
        newweights{i} = weights{i} - alpha*weight_grad{i};
    end
end

Put Things Together

The only source we have is the MNIST data set from MATLAB, to save time we only extract 5000 pictures from the training data set and manually convert them into a data matrix with labels. Labels are subsequently converted into a matrix of one-hot vectors.

clear all;
oldpath = addpath(fullfile(matlabroot,'examples','nnet','main'));
filenameImagesTrain = 'train-images-idx3-ubyte.gz';
filenameLabelsTrain = 'train-labels-idx1-ubyte.gz';
XTrain = processImagesMNIST(filenameImagesTrain);
YTrain = processLabelsMNIST(filenameLabelsTrain);

data = [];                      %creat data matrix
for i = 1:5000
    img = XTrain(:,:,1,i);
    v = reshape(img,[28,28]);
    vec = reshape(v,1,[]);      %turn image into vector
    data = [data;vec];          %put vectors together
end

X=double(extractdata(data));        %   X [5000,28*28]
y = double(YTrain(1:5000))-1;   %extract labels, turn categorical into double
Y = (0:9) == y;                 %convert into One Hot.
>> [size(X),size(Y)]

ans =

        5000         784        5000          10

Training

Only two steps of training on MNIST is not enough, we try 1000 iterations and document the cost for every 100 iterations. To see if our MLP really works, we also compute the precision of the classification within the training set. In this sample MLP we use two hidden layers, each with 50 neurons. This is good enough for our task.

input_layer_size  = size(X, 2);                         
num_labels = size(Y, 2);                                
layers = [input_layer_size,50,50,num_labels];          %vector of layer sizes
alpha = 1;                                              %learning rate
num_iteration = 1000;                                   %number of iterations
tic; 

weights = randWeights(layers);                          %initialize weights and bias

for i = 1:num_iteration
    [C, weights] = train(weights, layers,...            %training loop
                                   X, Y, alpha);
    if mod(i,100) == 0                                  %report every 100 iterations
        [~,a,~] = forward(X,Y,layers,weights);          %make prediction
        [~, p] = max(a{end}, [], 2);                    %argmax of One Hot
        pred = p-1;  
        precision = mean(double(pred == y)) * 100;      %calculate precision
        
        disp(['Iteration: ' num2str(i) ' / ' num2str(num_iteration) ' ; Cost: ' num2str(C) ' ; Precission: ' num2str(precision)]);
    end
end
time = toc;
disp(['Time spent: ' num2str(time) ' seconds' ])
>> neural_network
Iteration: 100 / 1000 ; Cost: 1.5695 ; Precission: 50.9
Iteration: 200 / 1000 ; Cost: 0.63653 ; Precission: 82.54
Iteration: 300 / 1000 ; Cost: 0.42388 ; Precission: 88.88
Iteration: 400 / 1000 ; Cost: 0.32171 ; Precission: 91.58
Iteration: 500 / 1000 ; Cost: 0.25099 ; Precission: 93.06
Iteration: 600 / 1000 ; Cost: 0.20056 ; Precission: 94.8
Iteration: 700 / 1000 ; Cost: 0.16264 ; Precission: 95.94
Iteration: 800 / 1000 ; Cost: 0.1328 ; Precission: 96.92
Iteration: 900 / 1000 ; Cost: 0.1088 ; Precission: 97.76
Iteration: 1000 / 1000 ; Cost: 0.089292 ; Precission: 98.34
Time spent: 50.6449 seconds

This result definitely makes sense, our neural network is able to learn something from MNIST. So far, we have build this arbitrary layer neural network with a total of merely 86 lines of MATLAB code (excluding importing data set) without any machine learning package.

Let’s print more results!

i = randi(length(y));                       %take one sample
[~,a,~] = forward(X,Y,layers,weights);
[~, p] = max(a{end}, [], 2);
pred=p(i)-1;

figure;
imshow(reshape(X(i,:),28,28));
fprintf('True: %d  ;  Predicted: %d\n',y(i),pred);
>> True class: 3  ;  Predicted class: 3

Exercise: Build NN with convolution layers

A convolution layer

A convolution layer should look something like this. Some more [animations](https://animatedai.github.io/)
A convolution layer should look something like this. Some more [animations](https://animatedai.github.io/)

To do this you just follow the same steps previously

  • Specify input/output dimensions and model parameters such as filters, kernel size, strides, padding, etc.
  • Derive forward & backward propagations using product rule and chain rule mathematically.
  • Make your maths work in simulation.
  • Try some automatic differentiation software (TensorFlow, PyTorch… )