1. 程式人生 > >Batch Normalization理論基礎以及tensorflow實現

Batch Normalization理論基礎以及tensorflow實現

Batch Normalization 理論

Batch Normalization 相當於歸一化輸出的feature map。理論基礎首先在Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift 文中提出,用於消除深度學習中臭名昭著的梯度消失和梯度爆炸的問題。batch normalization 通常用於卷積層的後面,用於啟用層的前面。
其理論如下:
流程圖
其中上述黑箱Batch Normalization 實現的具體步驟如下:
μ

B = 1 m B
i = 1 m
B
x ( i )
σ B 2 = 1 m B i = 1 m B ( x ( i ) μ B ) 2 x ^ ( i ) = x ( i ) μ B σ B 2 + ϵ z ( i ) = γ x ^ ( i ) + β \begin{aligned} \mu _ { B } & = \frac { 1 } { m _ { B } } \sum _ { i = 1 } ^ { m _ { B } } \mathbf { x } ^ { ( i ) } \\ \sigma _ { B } ^ { 2 } & = \frac { 1 } { m _ { B } } \sum _ { i = 1 } ^ { m _ { B } } \left( \mathbf { x } ^ { ( i ) } - \mu _ { B } \right) ^ { 2 } \\ \widehat { \mathbf { x } } ^ { ( i ) } & = \frac { \mathbf { x } ^ { ( i ) } - \mu _ { B } } { \sqrt { \sigma _ { B } ^ { 2 } + \epsilon } } \\ \mathbf { z } ^ { ( i ) } & =\gamma \widehat { x } ^ { ( i ) } + \beta \end{aligned}
• μB is the empirical mean, evaluated over the whole mini-batch B.
• σB is the empirical standard deviation, also evaluated over the whole mini-batch.
• mB is the number of instances in the mini-batch
x ^ ( i ) \widehat { x } ^ { ( i ) } is the zero-centered and normalized input.
• γ is the scaling parameter for the layer.
• β is the shifting parameter (offset) for the layer.
• ϵ is a tiny number to avoid division by zero (typically 10–3). This is called a
smoothing term.
• z(i) is the output of the BN operation: it is a scaled and shifted version of the inputs.[^1]

Batch Normalization在tensorflow上的實現

tensorflow關於batch normalization 的函式如下所示

tf.layers.batch_normalization(inputs, axis=-1, momentum=0.99, epsilon=0.001, center=True, scale=True, beta_initializer=<tensorflow.python.ops.init_ops.Zeros object at 0x000000000AC16A20>, gamma_initializer=<tensorflow.python.ops.init_ops.Ones object at 0x000000000AC16A58>, moving_mean_initializer=<tensorflow.python.ops.init_ops.Zeros object at 0x000000000AC16A90>, moving_variance_initializer=<tensorflow.python.ops.init_ops.Ones object at 0x000000000AC16AC8>, beta_regularizer=None, gamma_regularizer=None, beta_constraint=None, gamma_constraint=None, training=False, trainable=True, name=None, reuse=None, renorm=False, renorm_clipping=None, renorm_momentum=0.99, fused=None, virtual_batch_size=None, adjustment=None)

inputs: 輸入,為必選的引數
training: 是否訓練,一般也都要給入False or True
beta_initializer: 初始化上述理論中提到的 β \beta , 只有在scale=True下起作用
gamma_initializer: 初始化上述l理論中提到的 γ \gamma ,只有在scale=True下起作用
moving_mean_initializer: 初始化上述提到的均值 μ \mu
moving_variance_initializer: 初始化上述提到的方差 σ \sigma

batch normalization 封裝函式

def batch_normalization(inputs,  is_training):
    moving_var = tf.constant_initializer(0.01)
    output = tf.layers.batch_normalization(inputs, moving_variance_initializer=moving_var, training=is_training)
    return output

batch normalization 訓練注意事項

update_ops = tf.get_collection(tf.GraphKeys.UPDATE_OPS)
update_weight= tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES)

在訓練時通常常見上述兩種操作,兩者的詳細區別見上條部落格。簡單來說就是第一行語句是為了獲取需要更新的操作,比如 batch normalization的均值和方差的更新;第二行代表獲取需要更新的權重等變數,比如weight,bias。因此,一旦使用了batch normalization就一定要在第一行程式碼的條件下進行訓練操作。至於第二條語句,一般在需要指定某些層訓練的時候使用,也就是用於凍結部分層,不指定就是全部的變數都進行訓練。
訓練時的程式碼如下:

# 直接使用trian_op進行訓練即可
update_ops = tf.get_collection(tf.GraphKeys.UPDATE_OPS)
with tf.control_dependencies(update_ops):
    train_op1 = optimizer.minimize(loss)

參考文獻

1.[^1] Aurélien Géron. Hands-On Machine Learning with Scikit-Learn and TensorFlow. 2017