1. 程式人生 > >用500行Julia程式碼開始深度學習之旅 Beginning deep learning with 500 lines of Julia

用500行Julia程式碼開始深度學習之旅 Beginning deep learning with 500 lines of Julia

The two axes are w1 and w2, two parameters of our network, and the contour plot represents the loss with a minimum at x. If we start at x0, the Newton direction (in red) points almost towards the minimum, whereas the gradient (in green), perpendicular to the contours, points to the right.
Unfortunately Newton's direction is expensive to compute. However, it is also probably unnecessary for several reasons: (1) Newton gives us the ideal direction for second degree objective functions, which our neural network loss almost certainly is not, (2) The loss function whose gradient backprop calculated is the loss for the last minibatch/instance only
, which at best is a very noisy version of the real loss function, so we shouldn't spend too much effort getting it exactly right.
So people have come up with various approximate methods to improve the step direction. Instead of multiplying each component of the gradient with the same learning rate, these methods scale them separately using their running average (momentum, Nesterov), or RMS (Adagrad, Rmsprop). I realize this necessarily short summary barely covers what has been implemented in KUnet and doesn't do justice to the literature or cover most of the important ideas. The interested reader can start with a 
standard textbook
 on numerical optimization, and peruse the latest papers on optimization in deep learning.
Minimize what? The final problem with gradient descent, other than not telling us the ideal step size or direction, is that it is not even minimizing the right objective! We want small loss on never before seen test data, not just on the training data. The truth is, a sufficiently large neural network with a good optimization algorithm can get arbitrarily low loss on any finite training data (e.g. by just memorizing the answers). And it can typically do so in many different ways (typically many different local minima for training loss in weight space exist). Some of those ways will generalize well to unseen data, some won't. And unseen data is (by definition) not seen, so how will we ever know which weight settings will do well on it? There are at least three ways people deal with this problem: (1) Bayes tells us that we should use all possible networks and weigh their answers by how well they do on training data (see 
Radford Neal's fbm
), (2) New methods like dropout or adding distortions and noise to inputs and weights during training seem to help generalization, (3) Pressuring the optimization to stay in one corner of the weight space (e.g. L1, L2, maxnorm regularization) helps generalization.
KUnet views dropout (and other distortion methods) as a preprocessing step of each layer. The other techniques (learning rate, momentum, regularization etc.) are declared as UpdateParam's for l.w in l.pw and for l.b in l.pb for each layer: