1. 程式人生 > >學習率衰減

學習率衰減

isp variable val learn n) gui ren gradient auto

  在我們訓練模型的時候,通常會遇到這種情況。我們平衡模型的訓練速度和損失(loss)後選擇了相對合適的學習率(learning rate)。但是訓練集的損失下降到一定的程度後就不在下降了,比如training loss一直在0.7和0.9之間來回震蕩,不能進一步下降。如下圖所示:

技術分享

  遇到這種情況通常可以通過適當降低學習率(learning rate)來實現。但是,降低學習率又會延長訓練所需的時間。

  學習率衰減(learning rate decay)就是一種可以平衡這兩者之間矛盾的解決方案。學習率衰減的基本思想是:學習率隨著訓練的進行逐漸衰減。

  學習率衰減基本有兩種實現方法:

  1. 線性衰減。例如:每過5個epochs學習率減半
  2. 指數衰減。例如:每過5個epochs將學習率乘以0.1

  TensorFlow中使用tf.train.AdamOptimizer方法可以很方便的實現學習率衰減,我們來大概看下AdamOptimizer是怎樣實現的。

  官方文檔的描述:

tf.train.AdamOptimizer

class tf.train.AdamOptimizer

Defined in tensorflow/python/training/adam.py.

See the guide: Training > Optimizers

Optimizer that implements the Adam algorithm.

See Kingma et. al., 
2014 (pdf). Methods __init__ __init__( learning_rate=0.001, beta1=0.9, beta2=0.999, epsilon=1e-08, use_locking=False, name=Adam ) Construct a new Adam optimizer. Initialization: m_0 <- 0 (Initialize initial 1st moment vector) v_0 <- 0 (Initialize initial 2nd moment vector) t
<- 0 (Initialize timestep) The update rule for variable with gradient g uses an optimization described at the end of section2 of the paper: t <- t + 1 lr_t <- learning_rate * sqrt(1 - beta2^t) / (1 - beta1^t) m_t <- beta1 * m_{t-1} + (1 - beta1) * g v_t <- beta2 * v_{t-1} + (1 - beta2) * g * g variable <- variable - lr_t * m_t / (sqrt(v_t) + epsilon) The default value of 1e-8 for epsilon might not be a good default in general. For example, when training an Inception network on ImageNet a current good choice is 1.0 or 0.1. Note that since AdamOptimizer uses the formulation just before Section 2.1 of the Kingma and Ba paper rather than the formulation in Algorithm 1, the "epsilon" referred to here is "epsilon hat" in the paper.

  可以看到,隨著時間t的增加,學習率的更新方式為 lr_t = learning_rate * sqrt(1 - beta2^t) / (1 - beta1^t)

  變量的更新方式為 variable <- variable - lr_t * m_t / (sqrt(v_t) + epsilon)

  epsilon的缺省值可能並不適合所有模型。例如,在用感知機模型訓練ImagNet數據集時,目前一個比較好的選擇是1.0或0.1

學習率衰減