深度視覺經典重讀之一：卷積網絡的蠻荒時代

阿新 • • 發佈：2018-03-07

complete red which Y軸 shift initial 變化 minor 數量

最近在找下一篇文章的研究方向，於是重新拿起了入學前看過的一些經典老文，沒想到其中蘊含的信息量這麽大，原來當初naive的我根本沒有領悟其中的精髓。相對於一些瑣碎的技術細節，我更關註的是【作者是怎麽想出這種東西的】，或者說是【我認為有趣的點子】。我只是出於分享和交流的目的簡單地把自己的為知筆記發布了上來，所以不能保證這一系列內容適合每個人閱讀。這一篇從2012年說起，涵蓋了AlexNet、Maxout、NIN、Overfeat四個經典工作。ZFNet、VGG、GoogLeNet將會在下一篇《卷積網絡的工業革命》中被總結。 2012：AlexNet 2018-03-01

綜述：

On the test data, we achieved top-1 and top-5 error rates of 37.5%

and 17.0%
60 million parameters and 650,000 neurons
takes between five and six days to train on two GTX 580 3GB GPUs. (roughly 90 cycles)

架構：

Dropout. Without dropout, our network exhibits substantial overfitting. Dropout roughly doubles the number of iterations required to converge.
ReLU。用ReLU比sigmoid或者tanh快好幾倍。
LRN. LRN的初衷是“側面抑制”，這個表述有點意思：implements a form of lateral inhibition inspired by the type found in real neurons, creating competition for big activities amongst neuron outputs computed using different kernels. LRC reduces our top-1 and top-5 error rates by 1.4% and 1.2%, respectively.
Overlapping Pooling: This scheme reduces the top-1 and top-5 error rates by 0.4% and 0.3%, respectively, as compared with the non-overlapping scheme

. We generally observe during training that models with overlapping pooling find it slightly more difficult to overfit.

預處理：

Given a rectangular image, we first rescaled the image such that the shorter side was of length 256, and then cropped out the central 256×256 patch from the resulting image.
Subtract the mean activity over the training set from each pixel.
(Training time)Extract random 224 × 224 patches (and their horizontal reflections) from the 256×256 images. This increases the size of our training set by a factor of 2048, though the resulting training examples are, of course, highly interdependent. Without this scheme, our network suffers from substantial overfitting, which would have forced us to use much smaller networks.
(Training time)PCA-based color distortion.
(Test time) the network makes a prediction by extracting five 224 × 224 patches (the four corner patches and the center patch) as well as their horizontal reflections (hence ten patches in all), and averaging the predictions made by the network’s softmax layer on the ten patches.

訓練

Initialization: We initialized the weights in each layer from a zero-mean Gaussian distribution with standard deviation 0.01. We initialized the neuron biases in the second, fourth, and fifth convolutional layers, as well as in the fully-connected hidden layers, with the constant 1. This initialization accelerates the early stages of learning by providing the ReLUs with positive inputs. We initialized the neuron biases in the remaining layers with the constant 0.
A batch size of 128 examples, momentum of 0.9, and weight decay of 0.0005. We found that this small amount of weight decay was important for the model to learn. Weight decay here is not merely a regularizer: it reduces the model’s training error.

有效性驗證：

拿出幾張圖的top-5結果，說明錯的也有情可原。
然後找出網絡所認為的相似的圖片，看起來確實相似。找相似的方法就是last hidden layer產生的feature vector的歐幾裏得距離小。 If two images produce feature activation vectors with a small Euclidean separation, we can say that the higher levels of the neural network consider them to be similar。這樣的話，也可以把網絡用於image retrieval（缺點：Computing similarity by using Euclidean distance between two 4096-dimensional, real-valued vectors is inefficient, but it could be made efficient by training an auto-encoder to compress these vectors to short binary codes.）

未解之謎，後續可以做：The kernels on GPU 1 are largely color-agnostic, while the kernels on on GPU 2 are largely color-specific. This kind of specialization occurs during every run and is independent of any particular random weight initialization (modulo a renumbering of the GPUs)：

2013：Maxout 2018-03-03 本文考察了dropout作為一個提高性能的工具的內在機理。本文提出一個激活函數Maxout，它有一些好的性質，能夠充分發揮dropout的作用。 We argue that rather than using dropout as a slight performance enhancement applied to arbitrary models, the best performance may be obtained by directly designing a model that enhances dropout’s abilities as a model averaging technique.

由Dropout引發的思考和註意事項：

Training using dropout differs significantly from previous approaches such as ordinary stochastic gradient descent.
Dropout is most effective when taking relatively large steps in parameter space.
In this regime, each update can be seen as making a significant update to a different model on a different subset of the training set.
與bagging的聯系和啟發：The ideal operating regime for dropout is when the overall training procedure resembles training an ensemble with bagging under parameter sharing constraints. This differs radically from the ideal stochastic gradient operating regime in which a single model makes steady progress via small steps.
Another consideration is that dropout model averaging is only an approximation when applied to deep models.

Motivation：既然上面提到only an approximation，那麽怎麽辦能讓這種approximation更好呢？Explicitly designing models to minimize this approximation error may thus enhance dropout’s performance
在inference time，用了dropout的層應該怎麽辦？

如果我們假設，組合不同model的softmax輸出時，應該用幾何平均值，那麽：
對於只有一層且後接softmax的網絡，理論上參數W應該變成W/2（輸出softmax(v^T W/2+b)）。
對於深層的情況，這不是嚴格成立的，但也可以這樣來近似。

Maxout本質上是一個activation function：

對於輸入是向量的情況（FC層之後的activation）：（k是超參數）
對於輸入是feature map的情況（卷積層後的activation）：a maxout feature map can be constructed by taking the maximum across k affine feature maps (i.e., pool across channels, in addition spatial locations)
A single maxout unit can be interpreted as making a piecewise linear approximation to an arbitrary convex function. Maxout networks learn not just the relationship between hidden units, but also the activation function of each hidden unit.
Maxout產生的activation不是稀疏的，但是梯度是稀疏的。如果結合了dropout，dropout會使activation變成稀疏的。
Maxout的性質：maxout networks are universal approximators. Provided that each individual maxout unit may have arbitrarily many affine components, we show that a maxout model with just two hidden units can approximate, arbitrarily well, any continuous function of v ∈ R^n （證明見paper）

實驗：

MNIST。有個劃分validation set和在其之上調參的技巧。既然為了避免拿test set調參的嫌疑，測定performance的時候test set只能用一次，怎麽知道什麽時候算是訓到最好了呢？

前50000 train set，後10000 val set
根據val set上的error確定超參。取error最小的超參。記錄這時候train set（50000）上的log likelihood
用整個60000的集合訓練，直到val set（現在是訓練集的子集了）上的log likelihood和之前記錄的相同
上test set，記錄結果，作為最終performance

CIFAR-10：global contrast normalization and ZCA whitening for data augmentation. 測performance又用的另一套方法，因為最後在全集上訓的時候，validation set error相對之前而言太高了（訓練集擴大了，總要欠擬合一點），怎麽訓都沒法達到之前一樣：

確定超參之後，扔掉模型重訓練
直到新的（訓練集上的）log likelihood和之前的相同
（再加上data augmentation的話，連上面這條都沒法滿足了，因為更加欠擬合了，所以只好訓跟之前一樣的epoch數）

CIFAR-100：直接用CIFAR-10上調好的參數
SVHN

Maxout和ReLU的對比實驗

單純拿同樣結構的網絡，只是換了Maxout和ReLU的話，當然是Maxout更好
ReLU從cross channel pooling中獲益不大
要達到同樣的performance，ReLU所需要的filter數量比Maxout多（約k倍？）

理論解釋：section 7沒仔細看，說的是不同激活函數和dropout對多模型平均的近似能力的關系，結論是折線比曲線好，maxout比ReLU好
“ReLU+cross channel pruning”和maxout差不多。如果把後者加一個max（0，xxx），就完全等價了。實驗證明，在使用dropout的情況下，這個max（0，xxx）是非常有害的。
Dropout和普通SGD的比較：

SGD usually works best with a small learning rate that results in a smoothly decreasing objective function, while dropout works best with a large learning rate, resulting in a constantly fluctuating objective function.
Dropout rapidly explores many different directions and rejects the ones that worsen performance, while SGD moves slowly and steadily in the most promising direction.
Dropout對ReLU saturation的影響：When training with SGD, we find that the rectifier units saturate at 0 less than 5% of the time. When training with dropout, we initialize the units to sature rarely but training gradually increases their saturation rate to 60%

在使用Dropout的情況下ReLU和Maxout的理論比較：

In the absence of gradient through the unit, it is difficult for training to change this unit to become active again. Maxout does not suffer from this problem because gradient always flows through every maxout unit–even when a maxout unit is 0, this 0 is a function of the parameters and may be adjusted. Units that take on negative activations may be steered to become positive again later.
實驗證明Active rectifier units become inactive at a greater rate than inactive units become active when training with dropout, but maxout units, which are always active, transition between positive and negative activations at about equal rates in each direction.
上一條為什麽能說明maxout比ReLU好呢：We hypothesize that the high proportion of zeros and the difficulty of escaping them impairs the optimization performance of rectifiers relative to maxout

給maxout追加一個max（0，xxx），等效於ReLU+cross channel pruning，做實驗驗證上一條的假設：

When we include a constant 0 in the max pooling, the resulting trained model fails to make use of 17.6% of the filters in the second layer and 39.2% of the filters in the second layer. A small minority of the filters usually took on the maximal value in the pool, and the rest of the time the maximal value was a constant 0.
Maxout, on the other hand, used all but 2 of the 2400 filters in the network. Each filter in each maxout unit in the network was maximal for some training example. All filters had been utilised and tuned.

探究底層的梯度：

首先，某一層的梯度對Dropout mask的方差越大，說明dropout越有用。否則就退化成SGD了。
We tested the hypothesis that rectifier networks suffer from diminished gradient flow to the lower layers of the network by monitoring the variance with respect to dropout masks for fixed data during training of two different MLPs on MNIST
The variance of the gradient on the output weights was 1.4 times larger for maxout on an average training step, while the variance on the gradient of the first layer weights was 3.4 times larger for maxout than for rectifiers
This greater variance suggests that maxout better propagates varying information downward to the lower layers and helps dropout training to better resemble bagging for the lower-layer parameters. Rectifier networks, with more of their gradient lost to saturation, presumably cause dropout training to resemble regular SGD toward the bottom of the network.

2013：Network In Network 2018-03-03 關於卷積層、抽象和線性可分的觀點比較有意思

Insight：

首先，指出卷積層的任務是“abstract”下層的特征，傳統的filter的這種抽象層次是低的。We argue that the level of abstraction is low. By abstraction we mean that the feature is invariant to the variants of the same concept 。這種抽象是線性的，所以只適用於線性可分的特征。但是不能假設一個卷積層所要提取的特征是線性可分的，這樣是不合理的。
The conventional convolutional layer uses linear filters followed by a nonlinear activation function to scan the input.
冗余性來源的另一種解釋：This linear convolution is sufficient for abstraction when the instances of the latent concepts are linearly separable. However, representations that achieve good abstraction are generally highly nonlinear functions of the input data. In conventional CNN, this might be compensated by utilizing an over-complete set of filters to cover all variations of the latent concepts. Namely, individual linear filters can be learned to detect different variations of a same concept.

創新的NIN：

與Maxout的區別：Mlpconv layer differs from maxout layer in that the convex function approximator is replaced by a universal function approximator, which has greater capability in modeling various distributions of latent concepts.
Instead, we build micro neural networks with more complex structures to abstract the data within the receptive field. We instantiate the micro neural network with a multilayer perceptron, which is a potent function approximator.

創新的GAP（insight很重要，我怎麽就想不到）：

最後的多個全連接層容易過擬合。With enhanced local modeling via the micro network, we are able to utilize global average pooling over feature maps in the classification layer, which is easier to interpret and less prone to overfitting than traditional fully connected layers.
把最後的feature map直接連入softmax層，就可以迫使網絡generate one feature map for each correspondingcategory of the classification task in the last conv layer. One advantage of global average pooling over the fully connected layers is that it is more native to the convolution structure by enforcing correspondences between feature maps and categories. Thus the feature maps can be easily interpreted as categories confidence maps.
把不同空間位置上的信息加起來，是符合卷積的基本假設的（空間不變性）。Futhermore, global average pooling sums out the spatial information, thus it is more robust to spatial translations of the input.
再次強調：We can see global average pooling as a structural regularizer that explicitly enforces feature maps to be confidence maps of concepts (categories).

效果：state-of-the-art classification performances with NIN on CIFAR-10 and CIFAR-100, and reasonable performances on SVHN and MNIST datasets
實驗

訓練過程：batch size=128，訓練集上精度停了就lr除以10.
Dropout有用
對比試驗證明了GAP確實減輕了過擬合，但這需要模型本身的filter的abstract能力足夠強。

2013：Overfeat 2018-03-03

The main point of this paper is to show that training a convolutional network to simultaneously classify, locate and detect objects in images can boost the classification accuracy and the detection and localization accuracy of all tasks.
Contributions:

The paper proposes a new integrated approach to object detection, recognition, and localization with a single ConvNet.
We also introduce a novel method for localization and detection by accumulating predicted bounding boxes.
We suggest that by combining many localization predictions, detection can be performed without training on background samples and that it is possible to avoid the time-consuming and complicated bootstrapping training passes. Not training on background also lets the network focus solely on positive classes for higher accuracy.

問題背景：

在ImageNet分類任務中，目標物體的尺寸和位置是多變的。所以，
第一種解決方案是：用sliding window的形式，以不同的scale，在圖片上用卷積網絡feed多次。如果這樣做的話，假想一個sliding window包括了目標物體的足夠identifiable的一部分，例如一個狗頭，那麽分類結果（一只狗）是好的，但是localizaiton和detection的結果是不好的（只框出了一個狗頭）
第二種解決方案是：對每個sliding window，不止產生分類結果（每個類別的概率），也要產生localization和bounding box的大小（相對於這個sliding window）
第三種解決方案是： accumulate the evidence for each category at each location and size

In this paper, we explore three computer vision tasks in increasing order of difficulty: (i) classification, (ii) localization, and (iii) detection. Each task is a sub-task of the next.
網絡架構：

不用LRN
non-overlapping pooling
第一第二層stride變小，feature map尺寸變大。因為本文認為A larger stride is beneficial for speed but will hurt accuracy。
設計了兩款模型：fast和accurate。accurate的多了一層，第一個fc層的neuron變多了。刷榜的時候用了5個模型的組合。下面是fast版本。
類似於FCN的思路：during training, we treat this architecture as non-spatial (output maps of size 1x1), as opposed to the inference step, which produces spatial outputs. The spatial size of the feature maps depends on the input image size, which varies during our inference step. Here we show training spatial sizes. The fully-connected layers can also be seen as 1x1 convolutions in a spatial setting

Classification

訓練細節：

Each image is downsampled so that the smallest dimension is 256 pixels. We then extract 5 random crops (and their horizontal flips) of size 221x221 pixels and present these to the network in mini-batches of size 128.
momentum term of 0.6 and an L2 weight decay of 1 × 10?5.
The learning rate is initially 5 × 10?2 and is successively decreased by a factor of 0.5 after (30, 50, 60, 70, 80) epochs
DropOut with a rate of 0.5 is employed on the fully connected layers (6th and 7th) in the classifier

技巧：Multi-scale

The result of convolving a ConvNet on an image of arbitrary size is a spatial map of C-dimensional vectors at each scale
之前的工作是10-view（四角，中心，水平翻轉），單一scale
we explore the entire image by densely running the network at each location and at multiple scales.
那怎麽提高效率呢？（越看越像FCN）ConvNets are inherently efficient when applied in a sliding fashion because they naturally share computations common to overlapping regions.如下圖
the total subsampling ratio in the network described above is 2x3x2x3, or 36. Hence when applied densely, this architecture can only produce a classification vector every 36 pixels in the input dimension along each axis. 這樣是不夠好的，因為36個像素太多了，圖片中真正的object可能align不到這些window 上去，所以就要降低subsampling ratio。而簡單地去掉max pool的話又不太好，所以本文用了另一種比較新穎的方法，也就是所謂的resolution augmentation。這種方法可以概括為帶offset的3x3 pooling，然後把不同offset（x軸方向0,1,2，y軸方向0,1,2，所以共9種offset組合）產生的pooled feature map全部經過classifier（3個fc層），把結果組合到一起，這樣（最終的）subsampling ratio就沒有變化：
These operations can be viewed as shifting the classifier’s viewing window by 1 pixel through pooling layers without subsampling and using skip-kernels in the following layer (where values in the neighborhood are non-adjacent). Or equivalently, as applying the final pooling layer and fullyconnected stack at every possible offset, and assembling the results by interleaving the outputs.
The procedure above is repeated for the horizontally flipped version of each image. We then produce the final classification by (i) taking the spatial max for each class, at each scale and flip; (ii) averaging the resulting C-dimensional vectors from different scales and flips and (iii) taking the top-1 or top-5 elements (depending on the evaluation criterion) from the mean class vector.
為什麽要這樣做：The exhaustive pooling scheme ensures that we can obtain fine alignment between the classifier and the representation of the object in the feature map.

Localization：

從上面Classification的模型接著幹：we replace the classifier layers by a regression network and train it to predict object bounding boxes at each spatial location and scale. We then combine the regression predictions together, along with the classification results at each location
To generate object bounding box predictions, we simultaneously run the classifier and regressor networks across all locations and scales. Since these share the same feature extraction layers, only the final regression layers need to be recomputed after computing the classification network. The output of the final softmax layer for a class c at each location provides a score of confidence that an object of class c is present (though not necessarily fully contained) in the corresponding field of view. Thus we can assign a confidence to each bounding box.
Regressor: The regression network takes as input the pooled feature maps from layer 5. It has 2 fully-connected hidden layers of size 4096 and 1024 channels, respectively. The final output layer has 4 units which specify the coordinates for the bounding box edges.
Bounding-box merging instead of NMS. 作者認為our approach is naturally more robust to false positives coming from the pure-classification model than traditional non-maximum suppression, by rewarding bounding box coherence.
重要經驗和以後可能的發展方向：Using a different top layer for each class in the regressor network for each class (Per-Class Regressor (PCR) in Fig. 9) surprisingly did not outperform using only a single network shared among all classes (44.1% vs. 31.3%). This may be because there are relatively few examples per class annotated with bounding boxes in the training set, while the network has 1000 times more top-layer parameters, resulting in insufficient training. It is possible this approach may be improved by sharing parameters only among similar classes (e.g. training one network for all classes of dogs, another for vehicles, etc.).

Detection：

和Localization的不同點：The main difference with the localization task, is the necessity to predict a background class when no object is present
通常的訓練方法： Traditionally, negative examples are initially taken at random for training. Then the most offending negative errors are added to the training set in bootstrapping passes.
這樣的缺點是：

Independent bootstrapping passes render training complicated and risk potential mismatches between the negative examples collection and training times.
the size of bootstrapping passes needs to be tuned to make sure training does not overfit on a small set.

我們的創新在於：To circumvent all these problems, we perform negative training on the fly, by selecting a few interesting negative examples per image such as random ones or most offending ones. This approach is more computationally expensive, but renders the procedure much simpler

Localization和Detection可能的改進方向：We are using L2 loss, rather than directly optimizing the intersection-over-union (IOU) criterion on which performance is measured. Swapping the loss to this should be possible since IOU is still differentiable, provided there is some overlap. (iii) Alternate parameterizations of the bounding box may help to decorrelate the outputs, which will aid network training.

<wiz_tmp_tag id="wiz-table-range-border" contenteditable="false" style="display: none;">

來自為知筆記(Wiz)

深度視覺經典重讀之一：卷積網絡的蠻荒時代

complete red which Y軸 shift initial 變化 minor 數量最近在找下一篇文章的研究方向，於是重新拿起了入學前看過的一些經典老文，沒想到其中蘊含的信息量這麽大，原來當初naive的我根本沒有領悟其中的精髓。相對於一些瑣碎的技術細節，我更

深度視覺經典重讀之一：卷積網絡的蠻荒時代

深度視覺經典重讀之一：卷積網絡的蠻荒時代

深度學習——深卷積網絡：實例探究

深度學習之 TensorFlow（四）：卷積神經網絡

吳恩達深度學習系列課程筆記：卷積神經網路（一）

吳恩達深度學習第四課：卷積神經網路（學習筆記2）

【Python圖像特征的音樂序列生成】深度卷積網絡，以及網絡核心

基於圖卷積網絡的圖深度學習

R-FCN：基於區域的全卷積網絡來檢測物體

Dual Path Networks（DPN）——一種結合了ResNet和DenseNet優勢的新型卷積網絡結構。深度殘差網絡通過殘差旁支通路再利用特征，但殘差通道不善於探索新特征。密集連接網絡通過密集連接通路探索新特征，但有高冗余度。

TensorFlow(九)：卷積神經網絡實現手寫數字識別以及可視化

神經網絡2：卷積神經網絡學習 1

CNN學習筆記：卷積神經網絡

深度學習入門|第七章卷積神經網絡（三）

VGA系列之一：VGA顯示網絡圖片

使用Caffe完成圖像目標檢測和 caffe 全卷積網絡

python 實現簡單卷積網絡框架

卷積網絡輸出尺寸計算

用keras作CNN卷積網絡書本分類（書本、非書本）

cs231n---卷積網絡可視化

TensorFlow 中的卷積網絡

深度視覺經典重讀之一：卷積網絡的蠻荒時代

相關推薦