CNN 中的BN（batch normalization）“批歸一化”原理

阿新 • • 發佈：2018-12-25

在看 ladder network(https://arxiv.org/pdf/1507.02672v2.pdf) 時初次遇到batch normalization（BN）. 文中說BN能加速收斂等好處，但是並不理解，然後就在網上搜了些關於BN的資料。

看了知乎上關於深度學習中 Batch Normalization為什麼效果好？和CSDN上一個關於Batch Normalization 的學習筆記，總算對BN有一定的瞭解了。這裡只是總結一下BN的具體操作流程，對於BN更深層次的理解，為什麼要BN，BN是否真的有效也還在持續學習和實驗中。

BN就是在神經網路的訓練過程中對每層的輸入資料加一個標準化處理。
這裡寫圖片描述

傳統的神經網路，只是在將樣本x進行標準化，還對每個隱藏層的輸入進行標準化。

這裡寫圖片描述

標準化後的x了）

將s1

需要注意的是，上述的計算方法用於在訓練過程中。在測試時，所使用的μ的值通常是在訓練的同時用移動平均法來計算的.

在看具體程式碼之前，先來看兩個求平均值函式的用法：

mean, variance = tf.nn.moments(x, axes, name=None, keep_dims=False)

這個函式的輸入引數x表示樣本，形如[batchsize, height, width, kernels]
axes表示在哪個維度上求解，是個list
函式輸出均值和方差

'''
batch = np.array(np.random.randint(1, 100, [10, 5]))開始這裡沒有定義資料型別，batch的dtype=int64,導致後面sess.run([mm,vv])時老報InvalidArgumentError錯誤，原因是tf.nn.moments中的計算要求引數是float的
'''
batch = np.array(np.random.randint(1, 100, [10, 5]),dtype=np.float64)
mm, vv=tf.nn.moments(batch,axes=[0])#按維度0求均值和方差
#mm, vv=tf.nn.moments(batch,axes=[0,1])求所有資料的平均值和方差 

sess = tf.Session()
print batch
print sess.run([mm, vv])#一定要注意引數型別
sess.close()
  
   1
   2
   3
   4
   5
   6
   7
   8
   9
   10

輸出結果：

[[ 53.   9.  67.  30.  69.]
 [ 79.  25.   7.  80.  16.]
 [ 77.  67.  60.  30.  85.]
 [ 45.  14.  92.  12.  67.]
 [ 32.  98.  70.  98.  48.]
 [ 45.  89.  73.  73.  80.]
 [ 35.  67.  21.  77.  63.]
 [ 24.  33.  56.  85.  17.]
 [ 88.  43.  58.  82.  59.]
 [ 53.  23.  34.   4.  33.]]
[array([ 53.1,  46.8,  53.8,  57.1,  53.7]), array([  421.09,   896.96,   598.36,  1056.69,   542.61])]
  
   1
   2
   3
   4
   5
   6
   7
   8
   9
   10
   11

ema = tf.train.ExponentialMovingAverage(decay) 求滑動平均值需要提供一個衰減率。該衰減率用於控制模型更新的速度，ExponentialMovingAverage 對每一個（待更新訓練學習的）變數（variable）都會維護一個影子變數（shadow variable）。影子變數的初始值就是這個變數的初始值，

shadow_variable=decay×shadow_variable+(1−decay)×variable

由上述公式可知， decay 控制著模型更新的速度，越大越趨於穩定。實際運用中，decay 一般會設定為十分接近 1 的常數（0.99或0.999）。為了使得模型在訓練的初始階段更新得更快，ExponentialMovingAverage 還提供了 num_updates 引數來動態設定 decay 的大小：

decay=min{decay,1+num_updates10+num_updates}

對於滑動平均值我是這樣理解的（也不知道對不對，如果有覺得錯了的地方希望能幫忙指正）

假設有一串時間序列 {a1,a2,a3,⋯,at,at+1,⋯,}

import tensorflow as tf
graph=tf.Graph()
with graph.as_default():
    w = tf.Variable(dtype=tf.float32,initial_value=1.0)
    ema = tf.train.ExponentialMovingAverage(0.9)
    update = tf.assign_add(w, 1.0)

    with tf.control_dependencies([update]):
        ema_op = ema.apply([w])#返回一個op,這個op用來更新moving_average #這句和下面那句不能調換順序

    ema_val = ema.average(w)#此op用來返回當前的moving_average,這個引數不能是list

with tf.Session(graph=graph) as sess:
    sess.run(tf.initialize_all_variables())
    for i in range(3):
        print i
        print 'w_old=',sess.run(w)
        print sess.run(ema_op)
        print 'w_new=', sess.run(w)
        print sess.run(ema_val)
        print '**************'
  
   1
   2
   3
   4
   5
   6
   7
   8
   9
   10
   11
   12
   13
   14
   15
   16
   17
   18
   19
   20
   21

輸出：

0
w_old= 1.0
None
w_new= 2.0#在執行ema_op時先執行了對w的更新
1.1  #0.9*1.0+0.1*2.0=1.1
**************
1
w_old= 2.0
None
w_new= 3.0
1.29  #0.9*1.1+0.1*3.0=1.29
**************
2
w_old= 3.0
None
w_new= 4.0
1.561  #0.9*1.29+0.1*4.0=1.561
  
   1
   2
   3
   4
   5
   6
   7
   8
   9
   10
   11
   12
   13
   14
   15
   16
   17

關於加入了batch Normal的對mnist手寫數字分類的nn網路完整程式碼：

import tensorflow as tf
#import input_data
from tqdm import tqdm
import numpy as np
import math
from six.moves import cPickle as pickle
#資料預處理
pickle_file = '/home/sxl/tensor學習/My Udacity/notM/notMNISTs.pickle'
#為了加速計算，這個是經過處理的小樣本mnist手寫數字，這個資料可在[這裡](http://download.csdn.net/detail/whitesilence/9908115)下載
with open(pickle_file, 'rb') as f:
  save = pickle.load(f)
  train_dataset = save['train_dataset']
  train_labels = save['train_labels']
  valid_dataset = save['valid_dataset']
  valid_labels = save['valid_labels']
  test_dataset = save['test_dataset']
  test_labels = save['test_labels']
  del save  # hint to help gc free up memory
  print('Training set', train_dataset.shape, train_labels.shape)
  print('Validation set', valid_dataset.shape, valid_labels.shape)
  print('Test set', test_dataset.shape, test_labels.shape)

image_size = 28
num_labels = 10

def reformat(dataset, labels):
    dataset = dataset.reshape((-1, image_size * image_size)).astype(np.float32)
    # Map 0 to [1.0, 0.0, 0.0 ...], 1 to [0.0, 1.0, 0.0 ...]
    labels = (np.arange(num_labels) == labels[:, None]).astype(np.float32)
    return dataset, labels

train_dataset, train_labels = reformat(train_dataset, train_labels)
valid_dataset, valid_labels = reformat(valid_dataset, valid_labels)
test_dataset, test_labels = reformat(test_dataset, test_labels)
print('Training set', train_dataset.shape, train_labels.shape)
print('Validation set', valid_dataset.shape, valid_labels.shape)
print('Test set', test_dataset.shape, test_labels.shape)

#建立一個7層網路
layer_sizes = [784, 1000, 500, 250, 250,250,10]
L = len(layer_sizes) - 1  # number of layers
num_examples = train_dataset.shape[0]
num_epochs = 100
starter_learning_rate = 0.02
decay_after = 15  # epoch after which to begin learning rate decay
batch_size = 120
num_iter = (num_examples/batch_size) * num_epochs  # number of loop iterations

x = tf.placeholder(tf.float32, shape=(None, layer_sizes[0]))
outputs = tf.placeholder(tf.float32)
testing=tf.placeholder(tf.bool)
learning_rate = tf.Variable(starter_learning_rate, trainable=False)

def bi(inits, size, name):
    return tf.Variable(inits * tf.ones([size]), name=name)

def wi(shape, name):
    return tf.Variable(tf.random_normal(shape, name=name)) / math.sqrt(shape[0])

shapes = zip(layer_sizes[:-1], layer_sizes[1:])  # shapes of linear layers

weights = {'W': [wi(s, "W") for s in shapes],  # feedforward weights
           # batch normalization parameter to shift the normalized value
           'beta': [bi(0.0, layer_sizes[l+1], "beta") for l in range(L)],
           # batch normalization parameter to scale the normalized value
           'gamma': [bi(1.0, layer_sizes[l+1], "beta") for l in range(L)]}

ewma = tf.train.ExponentialMovingAverage(decay=0.99)  # to calculate the moving averages of mean and variance
bn_assigns = []  # this list stores the updates to be made to average mean and variance

def batch_normalization(batch, mean=None, var=None):
    if mean is None or var is None:
        mean, var = tf.nn.moments(batch, axes=[0])
    return (batch - mean) / tf.sqrt(var + tf.constant(1e-10))

# average mean and variance of all layers
running_mean = [tf.Variable(tf.constant(0.0, shape=[l]), trainable=False) for l in layer_sizes[1:]]
running_var = [tf.Variable(tf.constant(1.0, shape=[l]), trainable=False) for l in layer_sizes[1:]]

def update_batch_normalization(batch, l):
    "batch normalize + update average mean and variance of layer l"
    mean, var = tf.nn.moments(batch, axes=[0])
    assign_mean = running_mean[l-1].assign(mean)
    assign_var = running_var[l-1].assign(var)
    bn_assigns.append(ewma.apply([running_mean[l-1], running_var[l-1]]))
    with tf.control_dependencies([assign_mean, assign_var]):
        return (batch - mean) / tf.sqrt(var + 1e-10)


def eval_batch_norm(batch,l):
    mean = ewma.average(running_mean[l - 1])
    var = ewma.average(running_var[l - 1])
    s = batch_normalization(batch, mean, var)
    return s

def net(x,weights,testing=False):
    d={'m': {}, 'v': {}, 'h': {}}
    h=x
    for l in range(1, L+1):
        print "Layer ", l, ": ", layer_sizes[l-1], " -> ", layer_sizes[l]
        d['h'][l-1]=h
        s= tf.matmul(d['h'][l-1], weights['W'][l-1])
        m, v = tf.nn.moments(s, axes=[0])
        if testing:
            s=eval_batch_norm(s,l)
        else:
            s=update_batch_normalization(s, l)
        s=weights['gamma'][l-1] * s + weights["beta"][l-1]
        if l == L:
            # use softmax activation in output layer
            h = tf.nn.softmax(s)
        else:
            h= tf.nn.relu(s)
        d['m'][l]=m
        d['v'][l]=v
    d['h'][l]=h
    return h,d

y,_=net(x,weights)

cost = -tf.reduce_mean(tf.reduce_sum(outputs*tf.log(y), 1))

correct_prediction = tf.equal(tf.argmax(y, 1), tf.argmax(outputs, 1))  # no of correct predictions

accuracy = tf.reduce_mean(tf.cast(correct_prediction, "float")) * tf.constant(100.0)


train_step = tf.train.AdamOptimizer(learning_rate).minimize(cost)

# add the updates of batch normalization statistics to train_step
bn_updates = tf.group(*bn_assigns)
with tf.control_dependencies([train_step]):
    train_step = tf.group(bn_updates)

print "===  Starting Session ==="

sess = tf.Session()
init = tf.initialize_all_variables()
sess.run(init)

i_iter = 0
print "=== Training ==="
#print "Initial Accuracy: ", sess.run(accuracy, feed_dict={x: test_dataset, outputs: test_labels, testing: True}), "%"

for i in tqdm(range(i_iter, num_iter)):
    #images, labels = mnist.train.next_batch(batch_size)
    start = (i * batch_size) % num_examples
    images=train_dataset[start:start+batch_size,:]
    labels=train_labels[start:start+batch_size,:]
    sess.run(train_step, feed_dict={x: images, outputs: labels})
    if (i > 1) and ((i+1) % (num_iter/num_epochs) == 0):#i>1且完成了一個epochs,即所有資料訓練完一遍
        epoch_n = i/(num_examples/batch_size)#第幾個epochs
        perm = np.arange(num_examples)
        np.random.shuffle(perm)
        train_dataset = train_dataset[perm]#所有訓練資料迭代完一次後，對訓練資料進行重排，避免下一次迭代時取的是同樣的資料
        train_labels = train_labels[perm]
        if (epoch_n+1) >= decay_after:
            # decay learning rate
            # learning_rate = starter_learning_rate * ((num_epochs - epoch_n) / (num_epochs - decay_after))
            ratio = 1.0 * (num_epochs - (epoch_n+1))  # epoch_n + 1 because learning rate is set for next epoch
            ratio = max(0, ratio / (num_epochs - decay_after))
            sess.run(learning_rate.assign(starter_learning_rate * ratio))
        print "Train Accuracy: ",sess.run(accuracy,feed_dict={x: images, outputs: labels})

print "Final Accuracy: ", sess.run(accuracy, feed_dict={x: test_dataset, outputs: test_labels, testing: True}), "%"

sess.close()



  
   1
   2
   3
   4
   5
   6
   7
   8
   9
   10
   11
   12
   13
   14
   15
   16
   17
   18
   19
   20
   21
   22
   23
   24
   25
   26
   27
   28
   29
   30
   31
   32
   33
   34
   35
   36
   37
   38
   39
   40
   41
   42
   43
   44
   45
   46
   47
   48
   49
   50
   51
   52
   53
   54
   55
   56
   57
   58
   59
   60
   61
   62
   63
   64
   65
   66
   67
   68
   69
   70
   71
   72
   73
   74
   75
   76
   77
   78
   79
   80
   81
   82
   83
   84
   85
   86
   87
   88
   89
   90
   91
   92
   93
   94
   95
   96
   97
   98
   99
   100
   101
   102
   103
   104
   105
   106
   107
   108
   109
   110
   111
   112
   113
   114
   115
   116
   117
   118
   119
   120
   121
   122
   123
   124
   125
   126
   127
   128
   129
   130
   131
   132
   133
   134
   135
   136
   137
   138
   139
   140
   141
   142
   143
   144
   145
   146
   147
   148
   149
   150
   151
   152
   153
   154
   155
   156
   157
   158
   159
   160
   161
   162
   163
   164
   165
   166
   167
   168
   169
   170

關於batch normal 的另一參考資料http://blog.csdn.net/intelligence1994/article/details/53888270
tensorflow常用函式介紹http://blog.csdn.net/wuqingshan2010/article/details/71056292

CNN 中的BN（batch normalization）“批歸一化”原理

CNN 中的BN（batch normalization）“批歸一化”原理

BN（Batch Normalization）在TensorFlow的實現

機器學習------批歸一化（Batch Normalization, BN）

【深度學習】批歸一化（Batch Normalization）

批歸一化（Batch Normalization）、L1正則化和L2正則化

批標準化（Batch Normalization）、Tensorflow實現Batch Normalization

TensorFlow 中的正則化（Batch Normalization）詳解和實現程式碼

3.1 Tensorflow: 批標準化（Batch Normalization）

神經網路訓練的一些建議（Batch Normalization）

批歸一化Batch Normalization的原理及演算法

batch normalization 批歸一化 --- 一個硬幣的兩面

批歸一化Batch Normalization

Batch Normolization(批歸一化）

長短期記憶（LSTM）系列_LSTM的資料準備（4）——如何歸一化標準化長短期記憶網路的資料

NumPy學習筆記（4）--資料歸一化

機器學習實戰之k-近鄰演算法（4）--- 如何歸一化資料

DIFFERENTIABLE LEARNING-TO-NORMALIZE VIA SWITCHABLE NORMALIZATION(SN,切換歸一化)

何愷明、吳育昕最新成果：用組歸一化替代批歸一化

批歸一化

批處理文件（Batch Files ）

CNN 中的BN（batch normalization）“批歸一化”原理

相關推薦