TensorFlow 2.0+Keras 防坑指南

阿新 • • 發佈：2019-04-30

TensorFlow 2.0是對1.x版本做了一次大的瘦身，Eager Execution預設開啟，並且使用Keras作為預設高階API，
這些改進大大降低的TensorFlow使用難度。

本文主要記錄了一次曲折的使用Keras+TensorFlow2.0的BatchNormalization的踩坑經歷，這個坑差點要把TF2.0的新特性都毀滅殆盡，如果你在學習TF2.0的官方教程，不妨一觀。

問題的產生

從教程[1]https://www.tensorflow.org/alpha/tutorials/images/transfer_learning?hl=zh-cn（講述如何Transfer Learning）說起：

IMG_SHAPE = (IMG_SIZE, IMG_SIZE, 3)
# Create the base model from the pre-trained model MobileNet V2
base_model = tf.keras.applications.MobileNetV2(input_shape=IMG_SHAPE,
                                               include_top=False,weights='imagenet')
model = tf.keras.Sequential([
  base_model,
    tf.keras.layers.GlobalAveragePooling2D(),
    tf.keras.layers.Dense(NUM_CLASSES)
])

簡單的程式碼我們就複用了MobileNetV2的結構建立了一個分類器模型，接著我們就可以呼叫Keras的介面去訓練模型：

model.compile(optimizer=tf.keras.optimizers.RMSprop(lr=base_learning_rate),

              loss='sparse_categorical_crossentropy',

              metrics=['sparse_categorical_accuracy'])

model.summary()

history = model.fit(train_batches.repeat(),

                    epochs=20,

                    steps_per_epoch = steps_per_epoch,

                    validation_data=validation_batches.repeat(),

                    validation_steps=validation_steps)

輸出的結果看，一起都很完美：

Model: "sequential"
________________________________________________________
Layer (type)                 Output Shape              Param #
=================================================================
mobilenetv2_1.00_160 (Model) (None, 5, 5, 1280)        2257984
_________________________________________________________________
global_average_pooling2d (Gl (None, 1280)              0
_________________________________________________________________
dense (Dense)                (None, 2)                 1281
=================================================================
Total params: 2,259,265
Trainable params: 1,281
Non-trainable params: 2,257,984
_________________________________________________________________
Epoch 11/20
581/581 [==============================] - 134s 231ms/step - loss: 0.4208 - accuracy: 0.9484 - val_loss: 0.1907 - val_accuracy: 0.9812
Epoch 12/20
581/581 [==============================] - 114s 197ms/step - loss: 0.3359 - accuracy: 0.9570 - val_loss: 0.1835 - val_accuracy: 0.9844
Epoch 13/20
581/581 [==============================] - 116s 200ms/step - loss: 0.2930 - accuracy: 0.9650 - val_loss: 0.1505 - val_accuracy: 0.9844
Epoch 14/20
581/581 [==============================] - 114s 196ms/step - loss: 0.2561 - accuracy: 0.9701 - val_loss: 0.1575 - val_accuracy: 0.9859
Epoch 15/20
581/581 [==============================] - 119s 206ms/step - loss: 0.2302 - accuracy: 0.9715 - val_loss: 0.1600 - val_accuracy: 0.9812
Epoch 16/20
581/581 [==============================] - 115s 197ms/step - loss: 0.2134 - accuracy: 0.9747 - val_loss: 0.1407 - val_accuracy: 0.9828
Epoch 17/20
581/581 [==============================] - 115s 197ms/step - loss: 0.1546 - accuracy: 0.9813 - val_loss: 0.0944 - val_accuracy: 0.9828
Epoch 18/20
581/581 [==============================] - 116s 200ms/step - loss: 0.1636 - accuracy: 0.9794 - val_loss: 0.0947 - val_accuracy: 0.9844
Epoch 19/20
581/581 [==============================] - 115s 198ms/step - loss: 0.1356 - accuracy: 0.9823 - val_loss: 0.1169 - val_accuracy: 0.9828
Epoch 20/20
581/581 [==============================] - 116s 199ms/step - loss: 0.1243 - accuracy: 0.9849 - val_loss: 0.1121 - val_accuracy: 0.9875

然而這種寫法還是不方便Debug，我們希望可以精細的控制迭代的過程，並能夠看到中間結果，所以我們訓練的過程改成了這樣：

optimizer = tf.keras.optimizers.RMSprop(lr=base_learning_rate)
train_accuracy = tf.keras.metrics.SparseCategoricalAccuracy(name='train_accuracy')


@tf.function

def train_cls_step(image, label):

    with tf.GradientTape() as tape:

        predictions = model(image)

        loss = tf.keras.losses.SparseCategoricalCrossentropy()(label, predictions)

    gradients = tape.gradient(loss, model.trainable_variables)

    optimizer.apply_gradients(zip(gradients, model.trainable_variables))

    train_accuracy(label, predictions)


for images, labels in train_batches:

    train_cls_step(images,labels)

重新訓練後，結果依然很完美！

但是，這時候我們想對比一下Finetune和重頭開始訓練的差別，所以把構建模型的程式碼改成了這樣：

base_model = tf.keras.applications.MobileNetV2(input_shape=IMG_SHAPE,

                                              include_top=False,weights=None)

使得模型的權重隨機生成，這時候訓練結果就開始抽風了，Loss不下降，Accuracy穩定在50%附近遊蕩：

Step #10: loss=0.6937199831008911 acc=46.5625%
Step #20: loss=0.6932525634765625 acc=47.8125%
Step #30: loss=0.699873685836792 acc=49.16666793823242%
Step #40: loss=0.6910845041275024 acc=49.6875%
Step #50: loss=0.6935917139053345 acc=50.0625%
Step #60: loss=0.6965731382369995 acc=49.6875%
Step #70: loss=0.6949992179870605 acc=49.19642639160156%
Step #80: loss=0.6942993402481079 acc=49.84375%
Step #90: loss=0.6933775544166565 acc=49.65277862548828%
Step #100: loss=0.6928421258926392 acc=49.5%
Step #110: loss=0.6883170008659363 acc=49.54545593261719%
Step #120: loss=0.695658802986145 acc=49.453125%
Step #130: loss=0.6875559091567993 acc=49.61538314819336%
Step #140: loss=0.6851695775985718 acc=49.86606979370117%
Step #150: loss=0.6978713274002075 acc=49.875%
Step #160: loss=0.7165156602859497 acc=50.0%
Step #170: loss=0.6945627331733704 acc=49.797794342041016%
Step #180: loss=0.6936900615692139 acc=49.9305534362793%
Step #190: loss=0.6938323974609375 acc=49.83552551269531%
Step #200: loss=0.7030564546585083 acc=49.828125%
Step #210: loss=0.6926192045211792 acc=49.76190185546875%
Step #220: loss=0.6932414770126343 acc=49.786930084228516%
Step #230: loss=0.6924526691436768 acc=49.82337188720703%
Step #240: loss=0.6882281303405762 acc=49.869789123535156%
Step #250: loss=0.6877702474594116 acc=49.86249923706055%
Step #260: loss=0.6933954954147339 acc=49.77163314819336%
Step #270: loss=0.6944763660430908 acc=49.75694274902344%
Step #280: loss=0.6945018768310547 acc=49.49776840209961%

我們將predictions的結果打印出來，發現batch內每個輸出都是一模一樣的：

0 = tf.Tensor([0.51352817 0.48647183], shape=(2,), dtype=float32)

1 = tf.Tensor([0.51352817 0.48647183], shape=(2,), dtype=float32)

2 = tf.Tensor([0.51352817 0.48647183], shape=(2,), dtype=float32)

3 = tf.Tensor([0.51352817 0.48647183], shape=(2,), dtype=float32)

4 = tf.Tensor([0.51352817 0.48647183], shape=(2,), dtype=float32)

5 = tf.Tensor([0.51352817 0.48647183], shape=(2,), dtype=float32)

6 = tf.Tensor([0.51352817 0.48647183], shape=(2,), dtype=float32)

7 = tf.Tensor([0.51352817 0.48647183], shape=(2,), dtype=float32)

8 = tf.Tensor([0.51352817 0.48647183], shape=(2,), dtype=float32)

9 = tf.Tensor([0.51352817 0.48647183], shape=(2,), dtype=float32)

只是修改了初始權重，為何會產生這樣的結果？

問題排查

實驗1

是不是訓練不夠充分，或者learning rate設定的不合適？
經過幾輪調整，發現無論訓練多久，learning rate變大變小，都無法改變這種結果

實驗2

既然是權重的問題，是不是權重隨機初始化的有問題，把初始權重拿出來統計了一下，一切正常

實驗3

這種問題根據之前的經驗，在匯出Inference模型的時候BatchNormalization沒有處理好會出現這種一個batch內所有結果都一樣的問題。但是如何解釋訓練的時候為什麼會出現這個問題？而且為什麼Finetue不會出現問題呢？只是改了權重的初始值而已呀
按照這個方向去Google的一番，發現了Keras的BatchNormalization確實有很多issue，其中一個問題是在儲存模型的是BatchNormalzation的moving mean和moving variance不會被儲存[6]https://github.com/tensorflow/tensorflow/issues/16455，而另外一個issue提到問題就和我們問題有關係的了：
[2] https://github.com/tensorflow/tensorflow/issues/19643
[3] https://github.com/tensorflow/tensorflow/issues/23873
最後，這位作者找到了原因，並且總結在了這裡：
[4] https://pgaleone.eu/tensorflow/keras/2019/01/19/keras-not-yet-interface-to-tensorflow/

根據這個提示，我們做了如下嘗試：

實驗3.1

改用model.fit的寫法進行訓練，在最初的幾個epoch裡面，我們發現好的一點的是training accuracy已經開始緩慢提升了，但是validation accuracy存在原來的問題。而且通過model.predict_on_batch()拿到中間結果，發現依然還是batch內輸出都一樣。

Epoch 1/20
581/581 [==============================] - 162s 279ms/step - loss: 0.6768 - sparse_categorical_accuracy: 0.6224 - val_loss: 0.6981 - val_sparse_categorical_accuracy: 0.4984
Epoch 2/20
581/581 [==============================] - 133s 228ms/step - loss: 0.4847 - sparse_categorical_accuracy: 0.7684 - val_loss: 0.6931 - val_sparse_categorical_accuracy: 0.5016
Epoch 3/20
581/581 [==============================] - 130s 223ms/step - loss: 0.3905 - sparse_categorical_accuracy: 0.8250 - val_loss: 0.6996 - val_sparse_categorical_accuracy: 0.4984
Epoch 4/20
581/581 [==============================] - 131s 225ms/step - loss: 0.3113 - sparse_categorical_accuracy: 0.8660 - val_loss: 0.6935 - val_sparse_categorical_accuracy: 0.5016

但是，隨著訓練的深入，結果出現了逆轉，開始變得正常了（tf.function的寫法是無論怎麼訓練都不會變化，幸好沒有放棄治療）（追加：其實這裡還是有問題的，繼續看後面，當時就覺得怪怪的，不應該收斂這麼慢）

Epoch 18/20
581/581 [==============================] - 131s 226ms/step - loss: 0.0731 - sparse_categorical_accuracy: 0.9725 - val_loss: 1.4896 - val_sparse_categorical_accuracy: 0.8703
Epoch 19/20
581/581 [==============================] - 130s 225ms/step - loss: 0.0664 - sparse_categorical_accuracy: 0.9748 - val_loss: 0.6890 - val_sparse_categorical_accuracy: 0.9016
Epoch 20/20
581/581 [==============================] - 126s 217ms/step - loss: 0.0631 - sparse_categorical_accuracy: 0.9768 - val_loss: 1.0290 - val_sparse_categorical_accuracy: 0.9031

通多model.predict_on_batch()拿到的結果也和這個Accuracy也是一致的

實驗3.2

通過上一個實驗，我們驗證了確實如果只通過Keras的API去訓練，是正常。更深層的原因是什麼呢？是不是BatchNomalization沒有update moving mean和moving variance導致的呢？答案是Yes
我們分別在兩中訓練方法前後，列印 moving mean和moving variance的值：

def get_bn_vars(collection):

    moving_mean, moving_variance = None, None    for var in collection:

        name = var.name.lower()

        if "variance" in name:

            moving_variance = var

        if "mean" in name:

            moving_mean = var



    if moving_mean is not None and moving_variance is not None:

        return moving_mean, moving_variance

    raise ValueError("Unable to find moving mean and variance")

mean, variance = get_bn_vars(model.variables)

print(mean)

print(variance)

我們發現，確實如果使用model.fit()進行訓練，mean和variance是在update的(雖然更新的速率看著有些奇怪)，但是對於tf.function那種寫法這兩個值就沒有被update

那這裡我們也可以解釋為什麼Finetune不會出現問題了，因為imagenet訓練的mean, variance已經是一個比較好的值了，即使不更新也可以正常使用

實驗3.3

是不是改成[4]裡面說的方法構建動態的Input_Shape的模型就OK了呢？

class MyModel(Model):
    def __init__(self):
        super(MyModel, self).__init__()
        self.conv1 = Conv2D(32, 3, activation='relu')
        self.batch_norm1=BatchNormalization()
        self.flatten = Flatten()
        self.d1 = Dense(128, activation='relu')
        self.d2 = Dense(10, activation='softmax')

    def call(self, x):
        x = self.conv1(x)
        x = self.batch_norm1(x)
        x = self.flatten(x)
        x = self.d1(x)
        return self.d2(x)
model = MyModel()
#model.build((None,28,28,1))
model.summary()

@tf.functiondef train_step(image, label):
    with tf.GradientTape() as tape:
        predictions = model(image)
        loss = loss_object(label, predictions)
    gradients = tape.gradient(loss, model.trainable_variables)
    optimizer.apply_gradients(zip(gradients, model.trainable_variables))

    train_loss(loss)
    train_accuracy(label, predictions)

模型如下：

Model: "my_model"
_________________________________________________________________
Layer (type)                 Output Shape              Param #  
=================================================================
conv2d (Conv2D)              multiple                  320      
_________________________________________________________________
batch_normalization_v2 (Batc multiple                  128      
_________________________________________________________________
flatten (Flatten)            multiple                  0        
_________________________________________________________________
dense (Dense)                multiple                  2769024  
_________________________________________________________________
dense_1 (Dense)              multiple                  1290      
=================================================================
Total params: 2,770,762
Trainable params: 2,770,698
Non-trainable params: 64

從Output Shape看，構建模型沒問題
跑了一遍MINST，結果也很不錯！
以防萬一，我們同樣測試了一下mean和variance是否被更新，然而結果出乎意料，並沒有！
也就是說[4]裡面說的方案在我們這裡並不可行

實驗3.4

既然我們定位問題是在BatchNormalization這裡，所以就想到BatchNormalization的training和testing時候行為是不一致的，在testing的時候moving mean和variance是不需要update的，那麼會不會是tf.function的這種寫法並不會自動更改這個狀態呢？
檢視原始碼，發現BatchNormalization的call()存在一個training引數，而且預設是False

 Call arguments:

   inputs: Input tensor (of any rank).

   training: Python boolean indicating whether the layer should behave in

     training mode or in inference mode.

     - `training=True`: The layer will normalize its inputs using the

       mean and variance of the current batch of inputs.

     - `training=False`: The layer will normalize its inputs using the

       mean and variance of its moving statistics, learned during training.

所以，做了如下改進：

class MyModel(Model):
    def __init__(self):
        super(MyModel, self).__init__()
        self.conv1 = Conv2D(32, 3, activation='relu')
        self.batch_norm1=BatchNormalization()
        self.flatten = Flatten()
        self.d1 = Dense(128, activation='relu')
        self.d2 = Dense(10, activation='softmax')

    def call(self, x,training=True):
        x = self.conv1(x)
        x = self.batch_norm1(x,training=training)
        x = self.flatten(x)
        x = self.d1(x)
        return self.d2(x)

model = MyModel()
#model.build((None,28,28,1))
model.summary()

@tf.functiondef train_step(image, label):
    with tf.GradientTape() as tape:
        predictions = model(image,training=True)
        loss = loss_object(label, predictions)
    gradients = tape.gradient(loss, model.trainable_variables)
    optimizer.apply_gradients(zip(gradients, model.trainable_variables))

    train_loss(loss)
    train_accuracy(label, predictions)

@tf.functiondef test_step(image, label):
    predictions = model(image,training=False)
    t_loss = loss_object(label, predictions)

    test_loss(t_loss)
    test_accuracy(label, predictions)

結果顯示，moving mean和variance開始更新啦，測試Accuracy也是符合預期
所以，我們可以確定問題的根源在於需要指定BatchNormalization是在training還是在testing！

實驗3.5

3.4中方法雖然解決了我們的問題，但是它是使用構建Model的subclass的方式，而我們之前的MobileNetV2是基於更加靈活Keras Functional API構建的，由於無法控制call()函式的定義，沒有辦法靈活切換training和testing的狀態，另外用Sequential的方式構建時也是一樣。
[5]https://blog.keras.io/keras-as-a-simplified-interface-to-tensorflow-tutorial.html
[7]https://github.com/keras-team/keras/issues/7085
[8]https://github.com/keras-team/keras/issues/6752
從5[8]中，我瞭解到兩個情況，

1. tf.keras.backend.set_learning_phase()可以改變training和testing的狀態；
1. model.updates和layer.updates 存著old_value和new_value的Assign Op

所以我首先嚐試：

 tf.keras.backend.set_learning_phase(True)

結果，MobileNetV2構建的模型也可以正常工作了。
而且收斂的速度似乎比model.fit()還快了很多，結合之前model.fit()收斂慢的困惑，這裡又增加的一個實驗，在model.fit()的版本里面也加上這句話，發現同樣收斂速度也變快了！1個epoch就能得到不錯的結果了！
因此，這裡又產生了一個問題model.fit()到底有沒有設learning_phase狀態？如果沒有是怎麼做moving mean和variance的update的？
第二個方法，由於教程中講述的是如何在1.x的版本構建，而在eager execution模式下，似乎沒有辦法去run這些Assign Operation。僅做參考吧

update_ops = []

for assign_op in model.updates:

    update_ops.append(assign_op))
#但是不知道拿到這些update_ops在eager execution模式下怎麼處理呢？

結論

總結一下，我們從[4]找到了解決問題的啟發點，但是最終證明[4]裡面的問題和解決方法用到我們這裡並不能真正解決問題，問題的關鍵還是在於Keras+TensorFlow2.0裡面我們如何處理在training和testing狀態下行為不一致的Layer；以及對於model.fit()和tf.funtion這兩種訓練方法的區別，最終來看model.fit()裡面似乎包含很多詭異的行為。
最終的使用建議如下：

在使用model.fit()或者model.train_on_batch()這種Keras的API訓練模型時，也推薦手動設定tf.keras.backend.set_learning_phase(True)，可以加快收斂
如果使用eager execution這種方法，

1）使用構建Model的subclass，但是針對call()設定training的狀態，對於BatchNoramlization，Dropout這樣的Layer進行不同處理
2）使用Functional API或者Sequential的方式構建Model，設定tf.keras.backend.set_learning_phase(True)，但是注意在testing的時候改變一下狀態

最後，為什麼TF 2.0的教程裡面沒有提及這些？預設你已經精通Keras了嗎？[捂臉哭]

感謝

感謝柏濤帆月應知老師提供的幫助

[1]https://www.tensorflow.org/alpha/tutorials/images/transfer_learning?hl=zh-cn
[2] https://github.com/tensorflow/tensorflow/issues/19643
[3] https://github.com/tensorflow/tensorflow/issues/23873
[4] https://pgaleone.eu/tensorflow/keras/2019/01/19/keras-not-yet-interface-to-tensorflow/
[5]https://blog.keras.io/keras-as-a-simplified-interface-to-tensorflow-tutorial.html
[6]https://github.com/tensorflow/tensorflow/issues/16455
[7]https://github.com/keras-team/keras/issues/7085
[8]https://github.com/keras-team/keras/issues/6752

作者：爍凡

原文連結

本文為雲棲社群原創內容，未經

TensorFlow 2.0+Keras 防坑指南

問題的產生

問題排查

實驗1

實驗2

實驗3

實驗3.1

實驗3.2

實驗3.3

實驗3.4

實驗3.5

結論

感謝

TensorFlow 2.0+Keras 防坑指南

Guns 5.1使用Mybatis-plus從2.3升級至3.0.6填坑指南

Keras 2.3.0 釋出：支援TensorFlow 2.0

【防坑指南】nginx重啟後出現[error] open() “/usr/local/var/run/nginx/nginx.pid” failed

【防坑指南】使用Mybatis分頁外掛PageHelper為什麼PageInfo物件出現null的原因

WIN10配置tensorflow_GPU+ObjectDetection_API（防坑指南）

Apollo 2.5 安裝踩坑指南(轉載)

TensorFlow 2.0釋出在即，高階API變化搶先看

TensorFlow 2.0預覽版就要來了！最快年底釋出！

Eager Mode,寫在TensorFlow 2.0 到來之前

TensorFlow 2.0 每日構建版本釋出，每晚更新

spring boot 2.0 版本踩坑

TensorFlow 2.0 Alpha pip安裝指令

關於TensorFlow 2.0，這裡有你想知道的一切

阿里巴巴泰山版《Java 開發者手冊》，也是一份防坑指南

Mask R-CNN訓練自己的資料集在win10上的踩坑全過程：CUDA9.0+CUDNN7.1.4+Tensorflow-gpu1.9.0+keras-gpu2.2.4

Windows 下安裝 tensorflow & keras & opencv 的避坑指南！

史坑：.Net core 2.0 體驗

分布式監控系統Zabbix3.2跳坑指南

Elasticsearch5.2.0部署過程的坑

TensorFlow 2.0+Keras 防坑指南

問題的產生

問題排查

實驗1

實驗2

實驗3

實驗3.1

實驗3.2

實驗3.3

實驗3.4

實驗3.5

結論

感謝

相關推薦