1. 程式人生 > >Neural networks and backpropagation explained in a simple way

Neural networks and backpropagation explained in a simple way

Step 4- Differentiation

Obviously we can use any optimisation technique that modifies the internal weights of neural networks in order to minimise the total loss function that we previously defined. These techniques can include genetic algorithms or greedy search or even a simple brute-force search:In our simple numerical example, with only one parameter of weight to optimize W

, we can search from -1000.0 to +1000.0 step 0.001, which W has the smallest sum of squares of errors over the dataset.

This might works if the model has only very few parameters and we don’t care much about precision. However, if we are training the NN over an array of 600x400 inputs (like in image processing), we can reach very easily models with millions of weights to optimise and brute force can’t be even be imaginable, since it’s a pure waste of computational resources!

Luckily for us, there is a powerful concept in mathematics that can guide us how to optimise the weights called differentiation. Basically it deals with the derivative of the loss function. In mathematics, the derivative of a function at a certain point, gives the rate or the speed of which this function is changing its values at this point.

In order to see the effect of the derivative, we can ask ourselves the following question: how much the total error will change if we change the internal weight of the neural network with a certain small value δW. For the sake of simplicity will consider δW=0.0001. in reality it should be much smaller!.

Let’s recalculate the sum of the squares of errors when the weight W changes very slightly:

+--------+----------+-------+-----------+------------+---------+| Input  |  Output  |  W=3  |  rmse(3)  |  W=3.0001  |   rmse  |+--------+----------+-------+-----------+------------+---------+| 0      |       0  |    0  |        0  |         0  |       0 || 1      |       2  |    3  |        1  |    3.0001  |  1.0002 || 2      |       4  |    6  |        4  |    6.0002  |  4.0008 || 3      |       6  |    9  |        9  |    9.0003  |  9.0018 || 4      |       8  |   12  |       16  |   12.0004  | 16.0032 || Total: |        - |     - |       30  |         -  |  30.006 |+--------+----------+-------+-----------+------------+---------+

Now as we can see from this table, if we increase W from 3 to 3.0001, the sum of squares of error will increase from 30 to 30.006. Since we know that the best function that fits this model is y=2.x, increasing the weights from 3 to 3.0001 should obviously create a little bit more error (because we are going further from the intuitive correct weight of 2. 3.0001 > 3 > 2 thus the error is higher) But what we really care about is the rate of which the error changes relatively to the changes on the weight.Basically here this rate is the increase of 0.006 in the total error for each 0.0001 increasing weight -> that’s a rate of 0.006/0.0001 = 60x!It works in both direction, so basically if we decrease the weights by 0.0001, we should be able to decrease the total error by 0.006 as well!Here is the proof, if you run again the calculation, at W=2.9999 you get an error of 29.994. We managed to decrease the total error!

We could have guessed this rate by calculating directly the derivative of the loss function.The advantage of using the mathematical derivative is that it is much faster and more precise to calculate (less floating point precision problems).

Here is what our loss function looks like:

  • If w=2, we have a loss of 0, since the neural network actual output will fit perfectly the training set.
  • If w<2, we have a positive loss function, but the derivative is negative, meaning that an increase of weight will decrease the loss function.
  • At w=2, the loss is 0 and the derivative is 0, we reached a perfect model, nothing is needed.
  • If w>2, the loss becomes positive again, but the derivative is as well positive, meaning that any more increase in the weight, will increase the losses even more!!

If we initialise randomly the network, we are putting any random point on this curve (let’s say w=3) . The learning process is actually saying this:

- Let’s check the derivative.- If it is positive, meaning the error increases if we increase the weights, then we should decrease the weight.- If it’s negative, meaning the error decreases if we increase the weights, then we should increase the weight.- If it’s 0, we do nothing, we reach our stable point.

In a simple matter, we are designing a process that acts like gravity. No matter where we randomly initialise the ball on this error function curve, there is a kind of force field that drives the ball back to the lowest energy level of ground 0.