Listen to the article

How They Work
All useful computer systems have an input, and an output, with some kind of calculation in between. Neural networks are no different. When we don’t know how something works we can try to estimate it with a model which includes parameters which we can adjust. If we didn’t know how to convert kilometers to miles, we might use a linear function as a model, with an adjustable gradient. A good way of refining these models is to adjust the parameters based on how wrong the model is compared to known true examples.
We want to train our linear classifier to correctly classify bugs as ladybirds or caterpillars. This is simply about refining the slope of the dividing line that separates the two groups of points on a plot of bug width and height. example 1: ladybird: width 3.0 and length 1.0. caterpillar: width 1.0 and length 3.0.
y = Ax. Let’s go for A = 0.25. The dividing line is y = 0.25x. Let’s plot this line. The line doesn’t divide the two types of bugs. So intuitively we need to move the line up a bit. We want to see if we can find a repeatable recipe to do this, a series of computer instructions, which scientists call an algorithm. If y was 1,0 then the line goes right through the point where the ladybird sits at (x,y) = (3.0, 1.0). We want all ladybird points to be below the line, not on it. The line needs to be a deviding line between ladybirds and caterpillars, not a predictor of a bug’s length given its width. Let’s try to aim for y = 1.1, 1.2 or 1.3.
error = (desired target – actual output)
E = 1.1 – 0.75 = 0.35
Let’s call correct desired value, t for target value. To get tha value t, we need to adjust A by a small amount.
t = (A + △A)x
E = (△A) x
△A = E/x
If we keep updating for each training data, all we get is the final update simply matches the last training example closely. We are throwing away any learning that previous training examples might give us and just learning from the last one.
An important idea in machine learning is to moderate the updates. This moderation, has another very powerful and useful side efffect. Moderation can dampen the impact of errors noise. It smooths them out.
△A = L (E/x) . The moderating factor is often called a learning rate.
Boolean logic (true =1, false =0) functions are important in computer science. Is there more malaria when it rains AND it is hotter than 35 degrees? Is there malaria when either (Boolean OR) of these conditions is true? There is another Boolean function called XOR, short for eXclusive OR, which only has a true output if either one of the inputs A or B is true, but not both. That is, when the inputs are both false, or both true, the output is false.
Input A  Input B  Logical XOR 
0  0  0 
0  1  1 
1  0  1 
1  1  0 
Traditional computers processed data very sequentially, and in pretty exact concrete terms. There is no fuzziness or ambiguity about their cold hard calculations. Animal brains on the other hand, seemed to process signals in parallel, and fuzziness was a feature of their computation.
Neurons, transmit an electrical signal from one end to the other, from the dendrites along the axons to the terminals. These signals are then passed from one neuron to another. This is how your body senses light, sound, touch, pressure, heat, and so on. Signals from specialized sensory neurons are transmitted along your nervous system to your brain, which itself is mostly made of neurons. The very capable human brain has about 100 billion neurons! A fruit fly has about 100,000 neurons and is capable of flying, feeding, evading danger, finding food, and many more fairly complex tasks. The nematode worm has just 302 neurons, which is positively miniscule compared to today’s digital computer resources! But that worm is able to do some fairly useful tasks that traditional computer programs of much larger size would struggle to do.
Neuron takes an electrical input, and pops out another electrical signal. Neurons don’t react readily, but instead suppress the input until it has grown so large that it triggers an output. It’s like water in a cup – the water doesn’t spill over until it has first filled the cup. A function that takes the input signal and generates an output signal, but takes into account some kind of threshold is called an activation function. Mathematically, there are many such activation functions that could achieve this effect.
The Sshaped function shown below is called the sigmoid function. It is smoother than the cold hard step function, and this makes it more natural and realistic.
The sigmoid function, sometimes also called the logistic function, is y = 1 / (1 + e^{x})
e is a mathematical constant 2.71828. When x is zero, e^{x} is 1 and y is 0.5 (half).
First thing to realize is that realize is that real biological neurons take many inputs, not just one. We saw this with Boolean logic. We combine these inputs by adding them up, and the resultant sum is the input to the sigmoid function which controls the output. This reflects how real neurons work. If the combined signal is not large enough then the effect of the sigmoid threshold function is to suppress the output signal. If the sum of x is large enough the effect of the sigmoid is to fire the neuron.
The electrical signals are collected by the dendrites and these combine to form a stronger electrical signal. If the signal is strong enough to pass the threshold, the neuron fires a signal down the axon towards the terminals to pass onto the next neuron’s dendrites. The thing to notice is that each neuron takes input from many before it, and also provides signals to many more, if it happens to be firing. One way to replicate this from nature to an artificial model is to have layers of neurons (nodes), with each connected to every other one in the preceding and subsequent layer.
What part of this architecture does the learning (L) represent? Weight is shown associated with each connection. A low weight will deemphasize a signal, and a high weight will amplify it. So W_{1,2} is the weight that diminishes or amplifies the signal between node 1 and node 2 in the next layer. The network learns to improve it’s outputs by refining the link weights inside the network, some weights become zero or close to zero. Let’s imagine 2 inputs (1.0 and 0.5). Each node turns the sum of the inputs into an output using sigmoid function y = 1 / (1 + e^{x} ), where x is the sum of incoming signals to the neuron, and y is the output of that neuron. Let’s go with some random weights. W_{1,1} = 0.9, W_{1,2} = 0.2, W_{2,1} = 03, W_{2,2} = 0.8
The first layer of neural networks is the input layer and all that layer does is represent inputs (1.0 and 0.5). Let’s focus on node 1 in the layer 2. Both nodes in the fist input layer are connected to it. Those input nodes have raw values of 1.0 and 0.5. The link from the first node has a weight of 0.9. The link from the second node has a weight of 0.3. So the combined moderated input is
x = (output from the first node * link weight) + (output from the second node * link weight)
x = (1 * 0.9) + (0.5 * 0.3) = 1.05 and y = 1/(1 + e^{x}) = 0.7408. Similarly, we can calculate second node’s output using the sigmoid activation function. So y = 0.6457
X = W*I
W is the matrix of weights, I is the matrix of inputs.
The activation function simply applies a threshold and squishes the response to be more like that seen in biological neurons.
O = sigmoid (X)
That O written in bold is a matrix, which contains all the outputs from the final layer of the neural network. The expression X = W*I applies to the calculations between one layer and the next. If we have 3 layers, we simply do the matrix multiplication again, using the outputs of the second layer as inputs to the third layer but of course combined and moderated using more weights. The first layer is the input layer, the final layer is the output layer and the middle layer is called the hidden layer.
X_{hidden} = W_{input_hidden} * I and O_{hidden} = sigmoid(X_{hiddden}). The sigmoid activation function is applied to each element of X_{hidden} to produce the matrix which has the output of the middle hidden layer. No matter how many layers we have, we can treat each layer like any other – with incoming signals which we combine, link weights to moderate those incoming signals, and an activation function to produce the output from that layer.
X_{output} = W_{hidden_output}*O_{hidden}
O_{output} = sigmoid(X_{output})
The next step is to use the output from the neural network and compare it with the training example to work out an error. We need to use that error to refine the neural network itself so that it improves its outputs. Here there are 2 nodes contributing a signal to the output node. The link weights are 3.0 and 1.0. If we split the error in a way that is proportionate to these weights, we can see that 3/4 of the output error should be used to update the first larger weight, and 1/4 of the error for the second smaller weight. This method is called backpropagation.
error e_{1}, is the difference between the desired output provided by the training data t_{1} and the actual output o1. The error at the second output node is e_{2}.
e1 = ( t1 – o1),
e_{hidden,1} = e_{output,1} * (W_{11}/(W_{11}+W_{21})) + e_{output,2} * (W_{12}/(W_{12}+W_{22}))
You can see the error 0.5 at the second output layer node being split proportionately into 0.1 and 0.4 across the two connected links which have weights 1.0 and 4.0. You can also see that recombined error at the second hidden layer node is the sum of the connected split errors, which here are 0.48 and 0.4, to give 0.88.
Backpropagation can be made more concise by using matrix multiplication (vectorize the process).
This is called transposing a matrix, and is written as W^{T}. So we have a matrix approach to propagating the errors back.
error_{hidden} = W^{T}_{hidden_output} * error_{output}
Gradient Descent: After you’ve taken a step, you look again at the surrounding area to see which direction takes you closer to your objective, and then you step again in that direction. The gradient refers to the slope of the ground. You step in the direction where the slope is the steepest downwards.
error function, y = ((x1)^{2} + 1)
Error = (target – actual)^{2}
df/dx = df/dy * dy/dx. This is called chain rule.
dE/dW_{jk} . How does the error E change as the weight Wjk changes. Thats the slope of the error function that we want to descend towards the minimum.
dE/dWjk = d/dW_{jk} (t_{k}  o_{k})^{2}
dE/dW_{jk} = dE/do_{k} * do_{k}/dW_{jk} . The first bit is a simple derivative of a squared function. dE/dW_{jk} = 2(t_{k} – o_{k}) * do_{k}/dW_{jk}. o_{k} is the output of the node k which is the sigmoid function applied to the weighted sum of the connected incoming signals.
d/dX sigmoid (X) = sigmoid (X) ( 1 – sigmoid (X) )
dE/dWjk = 2(t_{k} – o_{k}) * sigmoid ( ∑_{j} W_{jk} * o_{j} ) ( 1 – sigmoid ∑_{j} W_{jk} * o_{j} ) ) * o_{j} . Let’s get rid of 2 at the front, because we are only interested in the direction of the slope of the error function so that we can descend it.
dE/dW_{jk} = – (t_{k} – o_{k}) * sigmoid ( ∑_{j} W_{jk }* o_{j} ) ( 1 – sigmoid ( ∑_{j} W_{jk} * o_{j} ) ) * o_{j}
The first part is the ( target – actual ) error. The sum expression inside the sigmoids is the signal into the final layer node. It’s the signal into a node before the activation squashing function is applied. The last part is the output from the previous hidden layer node j.
dE/dW_{ij}_{ }= – ( e_{j} ) * sigmoid ( ∑_{i} W_{ij} * o_{i} ) ( 1 – sigmoid ( ∑_{i} W_{ij} * o_{i} ) ) * o_{i}
The first part which was the ( target – actual ) error now becomes the recombined backpropagated error out of the hidden nodes. The sigmoid parts stay the same, but hte sum expressions inside refer to the preceding layers, so the sum is over all the inputs moderated by the weights into a hidden node j. The last part is now the output of the first layer of nodes o_{i}, which happens to be the input signals. Remember the weights are changed in a direction opposite to the gradient. We moderate the change in weights by using a learning factor.
new W_{jk} = old W_{jk} – ɑ * dE/dW_{jk} . The updated weight Wjk is the old weight adjusted by the negative of the error slope we just worked out. It’s negative because we want to decrease the weight if we have a positive slope, and increase it if we have a negative slope. The symbol ɑ , is a factor which moderates the strength of these changes to make sure we don’t overshoot. It’s called a learning rate.
△ W_{jk} = ɑ * E_{k} * O_{k} ( 1 – O_{k} ) * O_{j}^{T}
O_{j}^{T} are the values from previous layer transposed. Sigmoids have disappeared because they were simply the node outputs O_{k}.