深度學習筆記（四）VGG14

阿新 • • 發佈：2019-01-28

Very Deep Convolutional Networks for Large-Scale Image Recognition

1. 主要貢獻

本文探究了引數總數基本不變的情況下，CNN隨著層數的增加，其效果的變化。(thorough evaluation of networks of increasing depth using an architecture with very small (3×3) convolution filters, which shows that a significant improvementon the prior-art configurations can be achieved by pushing the depth to 16–19 weight layers.)

2. 前人的改進

針對原始論文ImageNet classification with deep convolutional neural networks[2]裡的框架，目前主要的改進有：

文獻[3]：utilised smaller receptive window size and smaller stride of the first convolutional layer.
文獻[4]：dealt with training and testing the networks densely over the whole image and over multiple scales.

3. CNN網路architecture

To measure the improvement brought by the increased ConvNet depth in a fair setting, all our ConvNet layer configurations are designed using the same principles come from [1].

CNN的輸入都是224×224×3的圖片.
輸入前唯一的預處理是減去均值.
卷積核大小基本都是為3×3,步長為1.
額外的1×1的核可以被看成是輸入通道的線性變換.
共有五個Max-Pooling層, 池化視窗大小為2×2

, 步長為2.
所以得隱含層都使用rectification non-linearity(RELU)作為啟用函式.
不需要新增Local Response Normalization(LRN), 因為它不提升效果反而會帶來計算花費和記憶體花費, 增加計算時間.
最後一層是soft-max transform layer作為代價函式層.

4. CNN configurations

卷積網路的配置見表1, 按照A-E來命名. From 11 weight layers in the network A (8 conv. and 3 FC layers) to 19 weight layers in the network E (16 conv. and 3 FC layers).
卷積層的通道channels數目（寬度width）從64開始, 每過一個max-pooling層數量翻倍，到512為止.

5. 討論

為什麼論文中全程使用3×3大小的filters？這是因為2個相連3×3大小的filters相當於一個5×5大小的filters. 同樣的3個相連3×3大小的filters相當於一個7×7大小的filters.
那麼為什麼不直接用一個5×5大小的或7×7大小的呢？以7×7的為例：

首先, 三層比一層更具有判別性.(First, we incorporate three non-linearrectification layers instead of a single one, which makes the decision function more discriminative.)

其次, 假設同樣的通道數C, 那麼三層3×3的引數數目為3×(3×3)C×C=27C×C, 一層7×7引數數目為7×7×C×C=49C×C, 大大減少了引數數目.

使用1×1的卷積核可以在不影響感知域的情況下增加判別函式的非線性。該核已被用於文獻Network in Network[5]網路結構。

6. 訓練

除了在樣本取樣中使用 multiple scale之外，本文實驗基本都遵循論文[2]的設定。batch size是256，momentum是0.9，正則化係數是5×10e-4，前兩層全連線的dropout引數設定為0.5，learning rate初始化為10e-2，且當驗證集結果不再上升時步長除以10，除三次為止。學習了370K迭代(74 epochs)時停止。(The batch size was set to 256, momentum to 0.9. The training was regularised by weight decay (the L2 penalty multiplier set to $5 · 10^{−4}$) and dropout regularisation for the first two fully-connected layers (dropout ratio set to 0.5). The learning rate was initially set to $10^{−2}$, and then decreased by a factor of 10 when the validation set accuracy stopped improving. In total, the learning rate was decreased 3 times, and the learning was stopped after 370K iterations (74 epochs).)

本文的網路比原來的網路[2]要更容易收斂, 這是因為

　　　a) implicit regularisation imposed by greater depth and smaller conv. filter sizes

　　 b) pre-initialisation of certain layers.

網路的權重初始化方式: 先訓練淺層網路, 如圖中的A網路, 得到權重引數. 然後當訓練更深的網路時, 使用A中得到的引數初始化前四個卷積層和最後三個全連線層, 中間的其他層仍使用隨機初始化. 在pre-initialised layers裡我們不改變學習率learning rate, 允許它們在學習learning 的過程中改變. 對於隨機初始化，我們在一個均值為1，方差為0.01正態分佈中取樣. 偏置項bias設為0.(To circumvent this problem, we began with training the configuration A (Table 1), shallow enough to be trained with random initialisation. Then, when training deeper architectures, we initialised the first four convolutional layers and the last three fullyconnected layers with the layers of net A (the intermediate layers were initialised randomly). We did not decrease the learning rate for the pre-initialised layers, allowing them to change during learning. For random initialisation (where applicable), we sampled the weights from a normal distribution with the zero mean and $10^{−2}$ variance. The biases were initialised with zero.)
224×224輸入的獲得, 將原始圖片等比例縮放, 保證短邊 S 大於224, 然後隨機選擇224×224的視窗, 為了進一步data augment, 還要考慮隨機的水平翻轉和RGB通道變換.
Multi-scale Training, 多尺度的意義在於圖片中的物體的尺度有變化, 多尺度可以更好的識別物體. 有兩種方法進行多尺度訓練:

　　　a). 在不同的尺度下, 訓練多個分類器, 引數為S, 引數的意義就是在做原始圖片上的縮放時的短邊長度. 論文中訓練了S=256和S=384兩個分類器, 其中S=384的分類器的引數使用S=256的引數進行初始化, 且使用一個小的初始學習率10e-3.

　　　b). 另一種方法是直接訓練一個分類器, 每次資料輸入時, 每張圖片被重新縮放, 縮放的短邊 S 隨機從[min, max]中選擇, 本文中使用區間[256,384], 網路引數初始化時使用S=384時的引數.

7. 測試

首先進行等比例縮放, 短邊長度Q大於224, Q的意義與S相同, 不過S是訓練集中的, Q是測試集中的引數. Q不必等於S, 相反的, 對於一個S, 使用多個Q值進行測試, 然後去平均會使效果變好.
然後，按照本文參考文獻[4]的方式對測試資料進行測試:

　　　a). 將全連線層轉換為卷積層，第一個全連線轉換為7×7的卷積，第二個轉換為1×1的卷積。

　　　b). Resulting net is applied to the whole image by convolving the filters in each layer with the full-size input. The resulting output feature map is a class score map with the 　　　　 number channels equal to the number of classes, and the variable spatial resolution, dependent on the input image size.

　　　c). Finally, class score map is spatially averaged(sum-pooled) to obtain a fixed-size vector of class scores of the image.

8. 實現

使用C++ Caffe toolbox實現
支援單系統多GPU
多GPU把batch分為多個GPU-batch，在每個GPU上進行計算，得到子batch的梯度後，以平均值作為整個batch的梯度。
論文的參考文獻[7]中提出了很多加速訓練的方法。論文實驗表明，在4-GPU的系統上，可以加速3.75倍。

9. 實驗

9.1 Configuration Comparison

使用圖1中的CNN結構進行實驗，在C/D/E網路結構上進行多尺度的訓練，注意的是，該組實驗的測試集只有一個尺度。如下圖所示：

9.2 Multi-Scale Comparison

測試集多尺度，且考慮到尺度差異過大會導致效能的下降，所以測試集的尺度Q在S的上下32內浮動。對於訓練集是區間尺度的，測試集尺度為區間的最小值、最大值、中值。

9.3 Convnet Fusion

模型融合，方法是取其後驗概率估計的均值。融合圖3和圖4中兩個最好的model可以達到更好的值，融合七個model會變差。

10. Reference

[1]. Simonyan K, Zisserman A. Very Deep Convolutional Networks for Large-Scale Image Recognition[J]. arXiv preprint arXiv:1409.1556, 2014.

[2]. Krizhevsky, A., Sutskever, I., and Hinton, G. E. ImageNet classification with deep convolutional neural networks. In NIPS, pp. 1106–1114, 2012.

[3]. Zeiler, M. D. and Fergus, R. Visualizing and understanding convolutional networks. CoRR, abs/1311.2901, 2013. Published in Proc. ECCV, 2014.

[4]. Sermanet, P., Eigen, D., Zhang, X., Mathieu, M., Fergus, R., and LeCun, Y. OverFeat: Integrated Recognition, Localization and Detection using Convolutional Networks. In Proc. ICLR, 2014.

[5]. Lin, M., Chen, Q., and Yan, S. Network in network. In Proc. ICLR, 2014.