1. 程式人生 > >Caffe中learning rate 和 weight decay 的理解

Caffe中learning rate 和 weight decay 的理解

The learning rate is a parameter that determines how much an updating step influences the current value of the weights. While weight decay is an additional term in the weight update rule that causes the weights to exponentially decay to zero, if no other update is scheduled.

So let's say that we have a cost or error function E

(w) that we want to minimize. Gradient descent tells us to modify the weights w in the direction of steepest descent in E:

wiwiηEwi, where η is the learning rate, and if it's large you will have a correspondingly large modification of the weights wi (in general it shouldn't be too large, otherwise you'll overshoot the local minimum in your cost function).

In order to effectively limit the number of free parameters in your model so as to avoid over-fitting, it is possible to regularize the cost function. An easy way to do that is by introducing a zero mean Gaussian prior over the weights, which is equivalent to changing the cost function to E

˜(w)=E(w)+λ2w2. In practice this penalizes large weights and effectively limits the freedom in your model. The regularization parameter λ determines how you trade off the original cost E with the large weights penalization.

Applying gradient descent to this new cost function we obtain:

wiwiηE
wi
ηλwi.
The new term ηλwi coming from the regularization causes the weight to decay in proportion to its size.