1. 程式人生 > >一步一步分析講解深度神經網路基礎-Convolutional Neural Network

一步一步分析講解深度神經網路基礎-Convolutional Neural Network

history
這裡寫圖片描述

Convolutional Neural Networks (CNNs / ConvNets)

Convolutional Neural Networks are very similar to ordinary Neural Networks from the previous chapter: they are made up of neurons that have learnable weights and biases. Each neuron receives some inputs, performs a dot product and optionally follows it with a non-linearity. The whole network still expresses a single differentiable score function: from the raw image pixels on one end to class scores at the other. And they still have a loss function (e.g. SVM/Softmax) on the last (fully-connected) layer and all the tips/tricks we developed for learning regular Neural Networks still apply.
(卷積神經網路其實是一種比較簡單的前饋神經網路,有些時候也可以使用backpropagation演算法,減小誤差。)

So what does change? ConvNet architectures make the explicit assumption that the inputs are images, which allows us to encode certain properties into the architecture. These then make the forward function more efficient to implement and vastly reduce the amount of parameters in the network.
(我們假設CNN的輸入值圖片,並且能夠高效的減少引數,訓練網路模型。)

一,Architecture Overview(結構概述)

Recall: Regular Neural Nets. As we saw in the previous chapter, Neural Networks receive an input (a single vector), and transform it through a series of hidden layers. Each hidden layer is made up of a set of neurons, where each neuron is fully connected to all neurons in the previous layer, and where neurons in a single layer function completely independently and do not share any connections. The last fully-connected layer is called the “output layer” and in classification settings it represents the class scores.
回想一下,前饋神經網路的內容,層與層之間的神經元是全部相互連線,最後輸出有全連線層,並且得到屬於每一類的得分。

Regular Neural Nets don’t scale well to full images. In CIFAR-10, images are only of size 32x32x3 (32 wide, 32 high, 3 color channels), so a single fully-connected neuron in a first hidden layer of a regular Neural Network would have 32*32*3 = 3072 weights. This amount still seems manageable, but clearly this fully-connected structure does not scale to larger images. For example, an image of more respectable size, e.g. 200x200x3, would lead to neurons that have 200*200*3 = 120,000 weights. Moreover, we would almost certainly want to have several such neurons, so the parameters would add up quickly! Clearly, this full connectivity is wasteful and the huge number of parameters would quickly lead to overfitting.
普通的神經網路並不能很好的適應大規模圖片。在CIFAR-10中,圖片是32x32x3,所以每一個隱藏層處理32*32*3 = 3072個權重。但是實際中我們會用到200x200x3的圖片,200*200*3 = 120,000個權重。顯然這樣的計算量是增長是數量級的。我們還要保證快速計算。普通網路全連線顯然不合適。

3D volumes of neurons. Convolutional Neural Networks take advantage of the fact that the input consists of images and they constrain the architecture in a more sensible way. In particular, unlike a regular Neural Network, the layers of a ConvNet have neurons arranged in 3 dimensions: width, height, depth. (Note that the word depth here refers to the third dimension of an activation volume, not to the depth of a full Neural Network, which can refer to the total number of layers in a network.) For example, the input images in CIFAR-10 are an input volume of activations, and the volume has dimensions 32x32x3 (width, height, depth respectively). As we will soon see, the neurons in a layer will only be connected to a small region of the layer before it, instead of all of the neurons in a fully-connected manner. Moreover, the final output layer would for CIFAR-10 have dimensions 1x1x10, because by the end of the ConvNet architecture we will reduce the full image into a single vector of class scores, arranged along the depth dimension. Here is a visualization:
1069/5000
3D的神經元。卷積神經網路充分利用了輸入由影象組成的事實,並以更合理的方式約束了體系結構。特別是,與常規的神經網路不同的是,ConvNet的層具有三維排列的神經元:寬度,高度,深度。 (請注意,這裡的詞深度是指啟用體積的第三維,而不是完整的神經網路的深度,可以指網路中的總層數。)例如,CIFAR- 10是啟用的輸入量,並且體積具有32×32×3(分別為寬度,高度,深度)的尺寸。正如我們將要看到的那樣,一層中的神經元只能連線到它之前層的一個小區域,而不是以完全連線的方式連線到所有的神經元。此外,CIFAR-10的最終輸出層的尺寸為1x1x10,因為在ConvNet架構的最後,我們將把整個影象縮減為沿深度維度排列的單個分數向量。這是一個視覺化:
這裡寫圖片描述
這裡寫圖片描述

上:常規的三層神經網路。 下:一個ConvNet在三個維度(寬度,高度,深度)上安排它的神經元,如在一個圖層中視覺化的那樣。 ConvNet的每一層都將3D輸入volume轉換為神經元啟用的3D輸出volume。 在這個例子中,紅色輸入層儲存影象,所以它的寬度和高度將是影象的尺寸,深度將是3(紅色,綠色,藍色通道)。

二,Layers used to build ConvNets
As we described above, a simple ConvNet is a sequence of layers, and every layer of a ConvNet transforms one volume of activations to another through a differentiable function. We use three main types of layers to build ConvNet architectures: Convolutional Layer, Pooling Layer, and Fully-Connected Layer (exactly as seen in regular Neural Networks). We will stack these layers to form a full ConvNet architecture.
正如我們上面所描述的,一個簡單的ConvNet是一系列的層次,ConvNet的每一層通過一個可微函式將一個啟用體積轉換成另一個啟用體積。 我們使用三種主要型別的層來構建ConvNet體系結構:卷積層,池化層和完全連線層(正如在常規神經網路中看到的那樣)。 我們將堆疊這些層來形成一個完整的ConvNet架構。

Example Architecture: Overview. We will go into more details below, but a simple ConvNet for CIFAR-10 classification could have the architecture [INPUT - CONV - RELU - POOL - FC]. In more detail:
INPUT - CONV - RELU - POOL - FC的架構講解CNNs

  1. INPUT [32x32x3] will hold the raw pixel values of the image, in this
    case an image of width 32, height 32, and with three color channels
    R,G,B.
    影象的寬度為32,高度為32,並且具有三個顏色通道R,G,B。
  2. CONV layer will compute the output of neurons that are connected to
    local regions in the input, each computing a dot product between
    their weights and a small region they are connected to in the input
    volume. This may result in volume such as [32x32x12] if we decided to
    use 12 filters.
    使用12個filter,這可能導致valume如[32x32x12]
  3. RELU layer will apply an elementwise activation function, such as
    the max(0,x) thresholding at zero. This leaves the size of the
    volume unchanged ([32x32x12]).
    使用RELU的啟用函式max(0,x)閾值為零。 這會使volume不變([32x32x12])。
  4. POOL layer will perform a downsampling operation along the spatial
    dimensions (width, height), resulting in volume such as [16x16x12].
    使用降低樣本維度的取樣操作,使valume變成[16x16x12]
  5. FC (i.e. fully-connected) layer will compute the class scores,
    resulting in volume of size [1x1x10], where each of the 10 numbers
    correspond to a class score, such as among the 10 categories of
    CIFAR-10. As with ordinary Neural Networks and as the name implies,
    each neuron in this layer will be connected to all the numbers in
    the previous volume.

    全連線層,會使valume變成[1x1x10],例如:CIFAR-10分成10種類別。和普通神經網路一般,就是直接連線所有的valume。

In this way, ConvNets transform the original image layer by layer from the original pixel values to the final class scores. Note that some layers contain parameters and other don’t. In particular, the CONV/FC layers perform transformations that are a function of not only the activations in the input volume, but also of the parameters (the weights and biases of the neurons). On the other hand, the RELU/POOL layers will implement a fixed function. The parameters in the CONV/FC layers will be trained with gradient descent so that the class scores that the ConvNet computes are consistent with the labels in the training set for each image.

  • ConvNet體系結構是最簡單的一個圖層列表,可以將影象體積轉換為輸出體積(例如,保持課程成績)
  • 有幾種不同的圖層(例如CONV / FC / RELU / POOL是目前最流行的)
  • 每個圖層都接受一個輸入3D音量,並通過可微分功能將其轉換為輸出3D音量
  • 每一層可能有也可能沒有引數(例如,CONV / FC do,RELU / POOL不)
  • 每一層可能有也可能沒有額外的超引數(例如,CONV / FC / POOL do,RELU不)

We now describe the individual layers and the details of their hyperparameters and their connectivities.
接下來,我們詳細描述每一層以及相關的超引數。

1,Convolutional Layer卷積層

The Conv layer is the core building block of a Convolutional Network that does most of the computational heavy lifting.
Conv層是卷積網路的核心構建塊,完成大部分計算繁重工作。
Overview and intuition without brain stuff. Lets first discuss what the CONV layer computes without brain/neuron analogies. The CONV layer’s parameters consist of a set of learnable filters. Every filter is small spatially (along width and height), but extends through the full depth of the input volume. For example, a typical filter on a first layer of a ConvNet might have size 5x5x3 (i.e. 5 pixels width and height, and 3 because images have depth 3, the color channels). During the forward pass, we slide (more precisely, convolve) each filter across the width and height of the input volume and compute dot products between the entries of the filter and the input at any position. As we slide the filter over the width and height of the input volume we will produce a 2-dimensional activation map that gives the responses of that filter at every spatial position. Intuitively, the network will learn filters that activate when they see some type of visual feature such as an edge of some orientation or a blotch of some color on the first layer, or eventually entire honeycomb or wheel-like patterns on higher layers of the network. Now, we will have an entire set of filters in each CONV layer (e.g. 12 filters), and each of them will produce a separate 2-dimensional activation map. We will stack these activation maps along the depth dimension and produce the output volume.

The brain view. If you’re a fan of the brain/neuron analogies, every entry in the 3D output volume can also be interpreted as an output of a neuron that looks at only a small region in the input and shares parameters with all neurons to the left and right spatially (since these numbers all result from applying the same filter). We now discuss the details of the neuron connectivities, their arrangement in space, and their parameter sharing scheme.
站在大腦的視角,我們來討論一下神經元的連線,以及在空間上的安排,和他們之間傳遞的引數
Local Connectivity. When dealing with high-dimensional inputs such as images, as we saw above it is impractical to connect neurons to all neurons in the previous volume. Instead, we will connect each neuron to only a local region of the input volume. The spatial extent of this connectivity is a hyperparameter called the receptive field of the neuron (equivalently this is the filter size). The extent of the connectivity along the depth axis is always equal to the depth of the input volume. It is important to emphasize again this asymmetry in how we treat the spatial dimensions (width and height) and the depth dimension: The connections are local in space (along width and height), but always full along the entire depth of the input volume.
在本地連線上,我們往往使用receptive field (區域性感知野)相當於filter,去處理一個區域的影象。
Example 1. For example, suppose that the input volume has size [32x32x3], (e.g. an RGB CIFAR-10 image). If the receptive field (or the filter size) is 5x5, then each neuron in the Conv Layer will have weights to a [5x5x3] region in the input volume, for a total of 5*5*3 = 75 weights (and +1 bias parameter). Notice that the extent of the connectivity along the depth axis must be 3, since this is the depth of the input volume.
例如:輸入[32x32x3]。過濾器大小 5x5,每一個神經元會有一個大小為 [5x5x3]區域性輸入。
Example 2. Suppose an input volume had size [16x16x20]. Then using an example receptive field size of 3x3, every neuron in the Conv Layer would now have a total of 3*3*20 = 180 connections to the input volume. Notice that, again, the connectivity is local in space (e.g. 3x3), but full along the input depth (20).
示例2.假設輸入音量大小為[16x16x20]。 然後,使用3x3的示例感受野大小,Conv層中的每個神經元現在將具有總共3 * 3 * 20 = 180個連線到輸入音量。 注意,連通性在空間上是區域性的(例如3×3),但是輸入深度(20)。
這裡寫圖片描述
這裡寫圖片描述

上:紅色的示例輸入體積(例如,32x32x3 CIFAR-10影象)以及第一卷積層中的示例體積的神經元。 卷積層中的每個神經元只在空間上連線到輸入體積中的區域性區域,但是連線到全深度(即所有顏色通道)。 請注意,沿著深度有多個神經元(在這個例子中是5個),所有神經元都在輸入中檢視相同的區域 - 請參閱下面文字中深度列的討論。 下圖:神經網路章節中的神經元保持不變:它們仍然計算它們的權值與輸入之間的點積,然後是非線性,但是它們的連通性現在被限制在區域性空間上。

Spatial arrangement. We have explained the connectivity of each neuron in the Conv Layer to the input volume, but we haven’t yet discussed how many neurons there are in the output volume or how they are arranged. Three hyperparameters control the size of the output volume: the depth, stride and zero-padding. We discuss these next:
空間排列。 三個超引數控制輸出valume的大小:深度,步幅和零填充。 我們接下來討論這些:

  1. First, the depth of the output volume is a hyperparameter: it
    corresponds to the number of filters we would like to use, each
    learning to look for something different in the input. For example,
    if the first Convolutional Layer takes as input the raw image, then
    different neurons along the depth dimension may activate in presence
    of various oriented edges, or blobs of color. We will refer to a set
    of neurons that are all looking at the same region of the input as a
    depth column (some people also prefer the term fibre).
    volume的depth對於我們使用的filter的數量。反應同一個神經元在不同方面的著色。
  2. Second, we must specify the stride with which we slide the filter. When the stride is 1 then we move the filters one pixel at a time. When the stride is 2 (or uncommonly 3 or more, though this is rare in practice) then the filters jump 2 pixels at a time as we slide them around. This will produce smaller output volumes spatially.
    volume的stride就是filter移動的畫素大小。一般我們取1或者2,2的話能夠產生一個更小的volume。太大沒有意義。
  3. As we will soon see, sometimes it will be convenient to pad the
    input volume with zeros around the border. The size of this
    zero-padding is a hyperparameter. The nice feature of zero padding
    is that it will allow us to control the spatial size of the output
    volumes (most commonly as we’ll see soon we will use it to exactly
    preserve the spatial size of the input volume so the input and
    output width and height are the same).
    volume的padding就是為了控制輸出的volume和原始volume一樣大。一般我們使用邊界填充保證volume大小一致。

We can compute the spatial size of the output volume as a function of the input volume size (W), the receptive field size of the Conv Layer neurons (F), the stride with which they are applied (S), and the amount of zero padding used (P) on the border. You can convince yourself that the correct formula for calculating how many neurons “fit” is given by (W−F+2P)/S+1. For example for a 7x7 input and a 3x3 filter with stride 1 and pad 0 we would get a 5x5 output. With stride 2 we would get a 3x3 output. Lets also see one more graphical example:
對於一個大小是W的volume,感知野是F,步長S,邊界填充P,(W-F + 2P)/ S + 1用於計算輸出的volume。
7x7 input
3x3 filter
stride 1
pad 0
get a 5x5 output(7-3+0*2)/1+1 。

Use of zero-padding. In the example above on left, note that the input dimension was 5 and the output dimension was equal: also 5. This worked out so because our receptive fields were 3 and we used zero padding of 1. If there was no zero-padding used, then the output volume would have had spatial dimension of only 3, because that it is how many neurons would have “fit” across the original input. In general, setting zero padding to be P=(F−1)/2 when the stride is S=1 ensures that the input volume and output volume will have the same size spatially. It is very common to use zero-padding in this way and we will discuss the full reasons when we talk more about ConvNet architectures.
一般情況下,步長為1,填充設定P=(F−1)/2 ,
感知野減一,再除以2,等於填充。

Constraints on strides. Note again that the spatial arrangement hyperparameters have mutual constraints. For example, when the input has size W=10
, no zero-padding is used P=0, and the filter size is F=3, then it would be impossible to use stride S=2, since (W−F+2P)/S+1=(10−3+0)/2+1=4.5
, i.e. not an integer, indicating that the neurons don’t “fit” neatly and symmetrically across the input. Therefore, this setting of the hyperparameters is considered to be invalid, and a ConvNet library could throw an exception or zero pad the rest to make it fit, or crop the input to make it fit, or something. As we will see in the ConvNet architectures section, sizing the ConvNets appropriately so that all the dimensions “work out” can be a real headache, which the use of zero-padding and some design guidelines will significantly alleviate.
限制步伐。 再次注意,空間排列超引數具有相互約束。 例如,當輸入大小為W = 10時,不使用零填充P = 0,並且過濾器大小為F = 3,則不可能使用步長S = 2,因為(W-F + 2P )/ S + 1 =(10-3 + 0)/2+1=4.5,即不是一個整數,表明神經元在整個輸入上不整齊和對稱。 因此,超引數的這種設定被認為是無效的,ConvNet庫可以丟擲一個異常,或者將剩下的部分填充以使其合適,或者裁剪輸入以使其合適,或者其他東西。 正如我們將在ConvNet體系結構部分中看到的那樣,適當調整ConvNets的大小以使所有維度“解決”可能是一個真正令人頭痛的問題,使用零填充和一些設計準則將顯著減輕。

Real-world example. The Krizhevsky et al. architecture that won the ImageNet challenge in 2012 accepted images of size [227x227x3]. On the first Convolutional Layer, it used neurons with receptive field size F=11
, stride S=4 and no zero padding P=0. Since (227 - 11)/4 + 1 = 55, and since the Conv layer had a depth of K=96, the Conv layer output volume had size [55x55x96]. Each of the 55*55*96 neurons in this volume was connected to a region of size [11x11x3] in the input volume. Moreover, all 96 neurons in each depth column are connected to the same [11x11x3] region of the input, but of course with different weights. As a fun aside, if you read the actual paper it claims that the input images were 224x224, which is surely incorrect because (224 - 11)/4 + 1 is quite clearly not an integer. This has confused many people in the history of ConvNets and little is known about what happened. My own best guess is that Alex used zero-padding of 3 extra pixels that he does not mention in the paper.
真實世界的例子。 Krizhevsky等。贏得2012年ImageNet挑戰的架構接受了大小[227x227x3]的影象。在第一個卷積層上,它使用感受野大小為F = 11的神經元,步長S = 4,零填充P = 0。由於(227-11)/ 4 + 1 = 55,並且由於Conv層具有K = 96的深度,所以Conv層輸出體積具有尺寸[55×55×96]。volume中的每個55 * 55 * 96神經元連線到輸入音量大小為[11x11x3]的區域。此外,每個深度列中的所有96個神經元連線到輸入的相同的[11x11x3]區域,但是當然具有不同的權重。除此之外,如果你讀到實際的紙張,它聲稱輸入影象是224×224,這肯定是不正確的,因為(224 - 11)/ 4 + 1很明顯不是一個整數。這讓ConvNets的歷史上很多人感到困惑,對發生的事情知之甚少。我自己最好的猜測是,Alex使用了他在本文中沒有提到的3個額外畫素的零填充。

Parameter Sharing. Parameter sharing scheme is used in Convolutional Layers to control the number of parameters. Using the real-world example above, we see that there are 55*55*96 = 290,400 neurons in the first Conv Layer, and each has 11*11*3 = 363 weights and 1 bias. Together, this adds up to 290400 * 364 = 105,705,600 parameters on the first layer of the ConvNet alone. Clearly, this number is very high.

引數共享。卷積層中使用引數共享方案來控制引數的數量。使用上面的實際例子,我們看到第一個Conv層有55 * 55 * 96 = 290,400個神經元,每個神經元有11 * 11 * 3 = 363個權重和1個偏差。總之,在ConvNet的第一層單獨添加了290400 * 364 = 105,705,600個引數。顯然,這個數字非常高。

It turns out that we can dramatically reduce the number of parameters by making one reasonable assumption: That if one feature is useful to compute at some spatial position (x,y), then it should also be useful to compute at a different position (x2,y2). In other words, denoting a single 2-dimensional slice of depth as a depth slice (e.g. a volume of size [55x55x96] has 96 depth slices, each of size [55x55]), we are going to constrain the neurons in each depth slice to use the same weights and bias. With this parameter sharing scheme, the first Conv Layer in our example would now have only 96 unique set of weights (one for each depth slice), for a total of 96*11*11*3 = 34,848 unique weights, or 34,944 parameters (+96 biases). Alternatively, all 55*55 neurons in each depth slice will now be using the same parameters. In practice during backpropagation, every neuron in the volume will compute the gradient for its weights, but these gradients will be added up across each depth slice and only update a single set of weights per slice.

Notice that if all neurons in a single depth slice are using the same weight vector, then the forward pass of the CONV layer can in each depth slice be computed as a convolution of the neuron’s weights with the input volume (Hence the name: Convolutional Layer). This is why it is common to refer to the sets of weights as a filter (or a kernel), that is convolved with the input.

事實證明,通過作出一個合理的假設,我們可以大大減少引數的數量:如果一個特徵對於在某個空間位置(x,y)計算是有用的,那麼在不同的位置(x2 ,y2)上。換句話說,將一個二維深度切片表示為一個深度切片(例如,大小為[55x55x96]的volume具有96個深度切片,每個切片的大小為[55x55]),我們將限制每個深度切片中的神經元使用相同的權重和偏見。使用這個引數共享方案,我們例子中的第一個Conv層現在將只有96個獨特權重集(每個深度切片一個權重集),總共96 * 11 * 11 * 3 = 34,848個唯一權重或34,944個引數+96偏見)。或者,每個深度切片中的所有55 * 55個神經元現在將使用相同的引數。在反向傳播的實踐中,volume中的每個神經元將計算其權重的梯度,但這些梯度將疊加在每個深度切片上,並且僅更新每個切片的一組權重。
請注意,如果單個深度切片中的所有神經元都使用相同的權向量,則可以在每個深度切片中將CONV層的正向傳遞計算為神經元權重與輸入音量的卷積(因此名稱:卷積層)。這就是為什麼通常將權重集合稱為過濾器(或核心)作為卷積輸入。

這裡寫圖片描述
Krizhevsky等人學習的示例過濾器 這裡顯示的96個過濾器中的每一個都是大小為[11x11x3],並且每一個在一個深度切片中由55 * 55個神經元共享。 注意,引數共享假設是比較合理的:如果檢測水平邊緣在影象的某個位置是重要的,那麼由於影象的平移不變結構,它應該直觀地在其他位置有用。 因此,不需要重新學習來檢測Conv層輸出體積中每個55 * 55個不同位置的水平邊緣。

Note that sometimes the parameter sharing assumption may not make sense. This is especially the case when the input images to a ConvNet have some specific centered structure, where we should expect, for example, that completely different features should be learned on one side of the image than another. One practical example is when the input are faces that have been centered in the image. You might expect that different eye-specific or hair-specific features could (and should) be learned in different spatial locations. In that case it is common to relax the parameter sharing scheme, and instead simply call the layer a Locally-Connected Layer.
Numpy examples. To make the discussion above more concrete, lets express the same ideas but in code and with a specific example. Suppose that the input volume is a numpy array X. Then:

A depth column (or a fibre) at position (x,y) would be the activations X[x,y,:].
A depth slice, or equivalently an activation map at depth d would be the activations X[:,:,d].

Conv Layer Example. Suppose that the input volume X has shape X.shape: (11,11,4). Suppose further that we use no zero padding (P=0
), that the filter size is F=5, and that the stride is S=2

. The output volume would therefore have spatial size (11-5)/2+1 = 4, giving a volume with width and height of 4. The activation map in the output volume (call it V), would then look as follows (only some of the elements are computed in this example):

V[0,0,0] = np.sum(X[:5,:5,:] * W0) + b0
V[1,0,0] = np.sum(X[2:7,:5,:] * W0) + b0
V[2,0,0] = np.sum(X[4:9,:5,:] * W0) + b0
V[3,0,0] = np.sum(X[6:11,:5,:] * W0) + b0

Remember that in numpy, the operation * above denotes elementwise multiplication between the arrays. Notice also that the weight vector W0 is the weight vector of that neuron and b0 is the bias. Here, W0 is assumed to be of shape W0.shape: (5,5,4), since the filter size is 5 and the depth of the input volume is 4. Notice that at each point, we are computing the dot product as seen before in ordinary neural networks. Also, we see that we are using the same weight and bias (due to parameter sharing), and where the dimensions along the width are increasing in steps of 2 (i.e. the stride). To construct a second activation map in the output volume, we would have:

V[0,0,1] = np.sum(X[:5,:5,:] * W1) + b1
V[1,0,1] = np.sum(X[2:7,:5,:] * W1) + b1
V[2,0,1] = np.sum(X[4:9,:5,:] * W1) + b1
V[3,0,1] = np.sum(X[6:11,:5,:] * W1) + b1
V[0,1,1] = np.sum(X[:5,2:7,:] * W1) + b1 (example of going along y)
V[2,3,1] = np.sum(X[4:9,6:11,:] * W1) + b1 (or along both)

where we see that we are indexing into the second depth dimension in V (at index 1) because we are computing the second activation map, and that a different set of parameters (W1) is now used. In the example above, we are for brevity leaving out some of the other operations the Conv Layer would perform to fill the other parts of the output array V. Additionally, recall that these activation maps are often followed elementwise through an activation function such as ReLU, but this is not shown here.

Summary. To summarize, the Conv Layer:

Accepts a volume of size W1×H1×D1

Requires four hyperparameters:

Number of filters K

their spatial extent F

the stride S

the amount of zero padding P
Produces a volume of size W2×H2×D2
where:

W2=(W1−F+2P)/S+1

H2=(H1−F+2P)/S+1
(i.e. width and height are computed equally by symmetry)
D2=K

With parameter sharing, it introduces F⋅F⋅D1
weights per filter, for a total of (F⋅F⋅D1)⋅K weights and K
biases.
In the output volume, the d
-th depth slice (of size W2×H2) is the result of performing a valid convolution of the d-th filter over the input volume with a stride of S, and then offset by d-th bias.

A common setting of the hyperparameters is F=3,S=1,P=1.However, there are common conventions and rules of thumb that motivate these hyperparameters. See the ConvNet architectures section below.

Convolution Demo. Below is a running demo of a CONV layer. Since 3D volumes are hard to visualize, all the volumes (the input volume (in blue), the weight volumes (in red), the output volume (in green)) are visualized with each depth slice stacked in rows. The input volume is of size W1=5,H1=5,D1=3
, and the CONV layer parameters are K=2,F=3,S=2,P=1. That is, we have two filters of size 3×3, and they are applied with a stride of 2. Therefore, the output volume size has spatial size (5 - 3 + 2)/2 + 1 = 3. Moreover, notice that a padding of P=1 is applied to the input volume, making the outer border of the input volume zero. The visualization below iterates over the output activations (green), and shows that each element is computed by elementwise multiplying the highlighted input (blue) with the filter (red), summing it up, and then offsetting the result by the bias.

這裡寫圖片描述
這裡寫圖片描述
仔細觀察這個圖示。

Implementation as Matrix Multiplication. Note that the convolution operation essentially performs dot products between the filters and local regions of the input. A common implementation pattern of the CONV layer is to take advantage of this fact and formulate the forward pass of a convolutional layer as one big matrix multiply as follows:

The local regions in the input image are stretched out into columns in an operation commonly called im2col. For example, if the input is [227x227x3] and it is to be convolved with 11x11x3 filters at stride 4, then we would take [11x11x3] blocks of pixels in the input and stretch each block into a column vector of size 11*11*3 = 363. Iterating this process in the input at stride of 4 gives (227-11)/4+1 = 55 locations along both width and height, leading to an output matrix X_col of im2col of size [363 x 3025], where every column is a stretched out receptive field and there are 55*55 = 3025 of them in total. Note that since the receptive fields overlap, every number in the input volume may be duplicated in multiple distinct columns.
The weights of the CONV layer are similarly stretched out into rows. For example, if there are 96 filters of size [11x11x3] this would give a matrix W_row of size [96 x 363].
The result of a convolution is now equivalent to performing one large matrix multiply np.dot(W_row, X_col), which evaluates the dot product between every filter and every receptive field location. In our example, the output of this operation would be [96 x 3025], giving the output of the dot product of each filter at each location.
The result must finally be reshaped back to its proper output dimension [55x55x96].

This approach has the downside that it can use a lot of memory, since some values in the input volume are replicated multiple times in X_col. However, the benefit is that there are many very efficient implementations of Matrix Multiplication that we can take advantage of (for example, in the commonly used BLAS API). Moreover, the same im2col idea can be reused to perform the pooling operation, which we discuss next.

Backpropagation. The backward pass for a convolution operation (for both the data and the weights) is also a convolution (but with spatially-flipped filters). This is easy to derive in the 1-dimensional case with a toy example (not expanded on for now).

1x1 convolution. As an aside, several papers use 1x1 convolutions, as first investigated by Network in Network. Some people are at first confused to see 1x1 convolutions especially when they come from signal processing background. Normally signals are 2-dimensional so 1x1 convolutions do not make sense (it’s just pointwise scaling). However, in ConvNets this is not the case because one must remember that we operate over 3-dimensional volumes, and that the filters always extend through the full depth of the input volume. For example, if the input is [32x32x3] then doing 1x1 convolutions would effectively be doing 3-dimensional dot products (since the input depth is 3 channels).

Dilated convolutions. A recent development (e.g. see paper by Fisher Yu and Vladlen Koltun) is to introduce one more hyperparameter to the CONV layer called the dilation. So far we’ve only discussed CONV filters that are contiguous. However, it’s possible to have filters that have spaces between each cell, called dilation. As an example, in one dimension a filter w of size 3 would compute over input x the following: w[0]*x[0] + w[1]*x[1] + w[2]*x[2]. This is dilation of 0. For dilation 1 the filter would instead compute w[0]*x[0] + w[1]*x[2] + w[2]*x[4]; In other words there is a gap of 1 between the applications. This can be very useful in some settings to use in conjunction with 0-dilated filters because it allows you to merge spatial information across the inputs much more agressively with fewer layers. For example, if you stack two 3x3 CONV layers on top of each other then you can convince yourself that the neurons on the 2nd layer are a function of a 5x5 patch of the input (we would say that the effective receptive field of these neurons is 5x5). If we use dilated convolutions then this effective receptive field would grow much quicker.

2,Pooling Layer(池化層)

It is common to periodically insert a Pooling layer in-between successive Conv layers in a ConvNet architecture. Its function is to progressively reduce the spatial size of the representation to reduce the amount of parameters and computation in the network, and hence to also control overfitting. The Pooling Layer operates independently on every depth slice of the input and resizes it spatially, using the MAX operation. The most common form is a pooling layer with filters of size 2x2 applied with a stride of 2 downsamples every depth slice in the input by 2 along both width and height, discarding 75% of the activations. Every MAX operation would in this case be taking a max over 4 numbers (little 2x2 region in some depth slice). The depth dimension remains unchanged. More generally, the pooling layer:
週期性地在ConvNet體系結構中連續的Conv層之間插入一個Pooling層。 其功能是逐步減小表示的空間大小,以減少網路中的引數和計算量,從而也控制過擬合。 池層在輸入的每個深度切片上獨立執行,並使用MAX操作在空間上調整其大小。 最常見的形式是一個大小為2x2的過濾器的匯聚層,在輸入的每個深度切片上沿著寬度和高度兩次施加2個下采樣的步幅,丟棄75%的啟用。 在這種情況下,每個MAX操作最多需要超過4個數字(在某個深度切片中,只有很少的2×2區域)。 深度維度保持不變。 更一般地說,彙集層:

  • Accepts a volume of size W1×H1×D1輸入volume的大小
  • Requires two hyperparameters:
    their spatial extent F,the stride S兩個超引數,填充F,步長S

  • Produces a volume of size W2×H2×D2 where:輸出volume大小

  • W2=(W1−F)/S+1
    H2=(H1−F)/S+1
    D2=D1
  • Introduces zero parameters since it computes a fixed function of the
    input
  • Note that it is not common to use zero-padding for Pooling layers使用零填充

It is worth noting that there are only two commonly seen variations of the max pooling layer found in practice: A pooling layer with F=3,S=2 (also called overlapping pooling), and more commonly F=2,S=2. Pooling sizes with larger receptive fields are too destructive.

General pooling. In addition to max pooling, the pooling units can also perform other functions, such as average pooling or even L2-norm pooling. Average pooling was often used historically but has recently fallen out of favor compared to the max pooling operation, which has been shown to work better in practice.
通常池化,是最大池化,平均數池化,L2-norm池化。歷史上使用平均池化最多。但最近大家都是使用最大池化,效果比較好。

這裡寫圖片描述
這裡寫圖片描述

合併層在輸入volume的每個深度切片中獨立地在空間上下采樣該volume。 上圖:在此示例中,尺寸為[224x224x64]的輸入量與過濾器尺寸2合併,跨度為2尺寸為[112x112x64]的輸出量。 請注意,volume深度保留。 下圖:最常見的下采樣操作是最大的,產生最大的彙集,這裡顯示的步幅是2.也就是說,每個最大值是4個數字(小2×2平方)。

Backpropagation. Recall from the backpropagation chapter that the backward pass for a max(x, y) operation has a simple interpretation as only routing the gradient to the input that had the highest value in the forward pass. Hence, during the forward pass of a pooling layer it is common to keep track of the index of the max activation (sometimes also called the switches) so that gradient routing is efficient during backpropagation.
Backpropagation.回顧一下反向傳播演算法,因為池化追求的是最大值,保證的梯度的有效性。

Getting rid of pooling. Many people dislike the pooling operation and think that we can get away without it. For example, Striving for Simplicity: The All Convolutional Net proposes to discard the pooling layer in favor of architecture that only consists of repeated CONV layers. To reduce the size of the representation they suggest using larger stride in CONV layer once in a while. Discarding pooling layers has also been found to be important in training good generative models, such as variational autoencoders (VAEs) or generative adversarial networks (GANs). It seems likely that future architectures will feature very few to no pooling layers.
拋棄池化層,很多人不喜歡池化操作,並且認為應該拋棄。例如:爭取模型簡單,卷積網路中,只存在重複的卷積操作,我們使用比較大的步長減少引數。拋棄池化層,已經被證實在(變分自動編碼器)VAEs,(生成對抗網路)GANs中,並取得不錯的效果。似乎未來不需要池化操作。

3,Normalization Layer
Many types of normalization layers have been proposed for use in ConvNet architectures, sometimes with the intentions of implementing inhibition schemes observed in the biological brain. However, these layers have since fallen out of favor because in practice their contribution has been shown to be minimal, if any. For various types of normalizations, see the discussion in Alex Krizhevsky’s cuda-convnet library API.
許多型別歸一化層被應用到ConvNet結構中,這種傾向是來源於觀察生物大腦給出的inhibition schemes。然而,這種歸一化操作已經失寵,因為在實際操作中,它所帶來的作用是非常小的。更多歸一化操作,請看Alex Krizhevsky的cuda-convnet 的API.

4,Fully-connected layer

Neurons in a fully connected layer have full connections to all activations in the previous layer, as seen in regular Neural Networks. Their activations can hence be computed with a matrix multiplication followed by a bias offset. See the Neural Network section of the notes for more information.
完整連線層中的神經元與前一層中的所有啟用都有完全連線,正如在常規神經網路中所見。 因此可以用一個矩陣乘法和一個偏置偏移來計算它們的啟用。 有關更多資訊,請參閱筆記的“神經網路”部分。

5,Converting FC layers to CONV layers

It is worth noting that the only difference between FC and CONV layers is that the neurons in the CONV layer are connected only to a local region in the input, and that many of the neurons in a CONV volume share parameters. However, the neurons in both layers still compute dot products, so their functional form is identical. Therefore, it turns out that it’s possible to convert between FC and CONV layers:

值得注意的是,FC和CONV層之間的唯一區別在於,CONV層中的神經元僅連線到輸入中的區域性區域,並且CONV中的許多神經元共享引數。 然而,兩層神經元仍然計算點積,所以它們的功能形式是相同的。 因此,可以在FC和CONV層之間進行轉換

  • For any CONV layer there is an FC layer that implements the same
    forward function. The weight matrix would be a large matrix that is
    mostly zero except for at certain blocks (due to local connectivity)
    where the weights in many of the blocks are equal (due to parameter
    sharing).

  • Conversely, any FC layer can be converted to a CONV layer. For
    example, an FC layer with K=4096

that is looking at some input volume of size 7×7×512 can be equivalently expressed as a CONV layer with F=7,P=0,S=1,K=4096. In other words, we are setting the filter size to be exactly the size of the input volume, and hence the output will simply be 1×1×4096
也就是說看一些7×7×512的輸入量可以等效地表示為F = 7,P = 0,S = 1,K = 4096的CONV層。 換句話說,我們將filter大小設定為輸入volume大小,因此輸出將簡單地為1×1×4096
since only a single depth column “fits” across the input volume, giving identical result as the initial FC layer.
這樣就能轉變成最初的FC層。

FC->CONV conversion. Of these two conversions, the ability to convert an FC layer to a CONV layer is particularly useful in practice. Consider a ConvNet architecture that takes a 224x224x3 image, and then uses a series of CONV layers and POOL layers to reduce the image to an activations volume of size 7x7x512 (in an AlexNet architecture that we’ll see later, this is done by use of 5 pooling layers that downsample the input spatially by a factor of two each time, making the final spatial size 224/2/2/2/2/2 = 7). From there, an AlexNet uses two FC layers of size 4096 and finally the last FC layers with 1000 neurons that compute the class scores. We can convert each of these three FC layers to CONV layers as described above:
FC-> CONV轉換。 在這兩種轉換中,將FC層轉換為CONV層的能力在實踐中特別有用。 考慮一個採用224x224x3影象的ConvNet體系結構,然後使用一系列CONV圖層和POOL圖層將影象縮小為7x7x512的啟用volume(在AlexNet體系結構中,我們稍後會看到,這是通過使用 5個匯聚層,每次對輸入進行空間下采樣,使得最終的空間尺寸為224/2/2/2/2/2 = 7)。 從那裡,一個AlexNet使用兩個大小為4096的FC層,最後使用1000個神經元計算類分數。 我們可以將這三個FC層中的每一個轉換為CONV層,如上所述

  • Replace the first FC layer that looks at [7x7x512] volume with a CONV
    layer that uses filter size F=7, giving output volume [1x1x4096].
  • Replace the second FC layer with a CONV layer that uses filter size
    F=1, giving output volume [1x1x4096]
  • Replace the last FC layer similarly, with F=1, giving final output
    [1x1x1000]

Each of these conversions could in practice involve manipulating (e.g. reshaping) the weight matrix W
這些轉換中的每一個實際上都可以涉及對權重矩陣W進行操縱(例如重塑)
in each FC layer into CONV layer filters. It turns out that this conversion allows us to “slide” the original ConvNet very efficiently across many spatial positions in a larger image, in a single forward pass.
在每個FC層轉換成CONV層濾波器。 事實證明,這個轉換允許我們在一個較大的影象中,在一個單獨的正向通道中,非常有效地將原始的ConvNet“滑”到許多空間位置。

For example, if 224x224 image gives a volume of size [7x7x512] - i.e. a reduction by 32, then forwarding an image of size 384x384 through the converted architecture would give the equivalent volume in size [12x12x512], since 384/32 = 12. Following through with the next 3 CONV layers that we just converted from FC layers would now give the final volume of size [6x6x1000], since (12 - 7)/1 + 1 = 6. Note that instead of a single vector of class scores of size [1x1x1000], we’re now getting an entire 6x6 array of class scores across the 384x384 image.
例如,如果224x224影象的大小為[7x7x512],即減少32,那麼通過轉換的體系結構轉發大小為384x384的影象會得到等效的大小[12x12x512],因為384/32 = 12。 接下來我們剛剛從FC層轉換而來的接下來的3個CONV層現在將給出最終的大小[6x6x1000],因為(12-7)/ 1 + 1 = 6。注意,不是一個單獨的向量類分數 的大小[1x1x1000],我們現在得到整個384x384影象整個6x6的班級分數陣列。

  • Evaluating the original ConvNet (with FC layers) independently across
    224x224 crops of the 384x384 image in strides of 32 pixels gives an
    identical result to forwarding the converted ConvNet one time.

Naturally, forwarding the converted ConvNet a single time is much more efficient than iterating the original ConvNet over all those 36 locations, since the 36 evaluations share computation. This trick is often used in practice to get better performance, where for example, it is common to resize an image to make it bigger, use a converted ConvNet to evaluate the class scores at many spatial positions and then average the class scores.
自然地,轉發轉換的ConvNet一次比在所有這36個位置迭代原始的ConvNet效率高得多,因為36個評估共享計算。 這個訣竅在實踐中經常被用來獲得更好的表現,例如通常調整影象的大小以使其更大,使用轉換後的ConvNet評估許多空間位置的分數,然後平均分數。

Lastly, what if we wanted to efficiently apply the original ConvNet over the image but at a stride smaller than 32 pixels? We could achieve this with multiple forward passes. For example, note that if we wanted to use a stride of 16 pixels we could do so by combining the volumes received by forwarding the converted ConvNet twice: First over the original image and second over the image but with the image shifted spatially by 16 pixels along both width and height.

最後,如果我們想要在影象上有效地應用原始的ConvNet,但是步幅小於32畫素呢? 我們可以用多個前鋒傳球來實現這一點。 例如,如果我們想要使用16畫素的步幅,我們可以通過將轉換的ConvNet轉發兩次來合併所接收到的體積:首先在原始影象上方,然後在影象上方,但是影象在空間上被移動了16個畫素 沿著寬度和高度。

An IPython Notebook on Net Surgery shows how to perform the conversion in practice, in code (using Caffe)

二,ConvNet Architectures

We have seen that Convolutional Networks are commonly made up of only three layer types: CONV, POOL (we assume Max pool unless stated otherwise) and FC (short for fully-connected). We will also explicitly write the RELU activation function as a layer, which applies elementwise non-linearity. In this section we discuss how these are commonly stacked together to form entire ConvNets.

一個卷積網路通常有三部分組成,CONV,POOL,FC,通常使用RELU作為啟用函式。在本節中,我們討論如何將這些通常堆疊在一起形成整個ConvNets。

1,Layer Patterns

The most common form of a ConvNet architecture stacks a few CONV-RELU layers, follows them with POOL layers, and repeats this pattern until the image has been merged spatially to a small size. At some point, it is common to transition to fully-connected layers. The last fully-connected layer holds the output, such as the class scores. In other words, the most common ConvNet architecture follows the pattern:
ConvNet體系結構最常見的形式是將幾個CONV-RELU層疊加在一起,然後使用POOL層,並重復這種模式,直到影象空間合併為一個小尺寸。 在某些情況下,過渡到完全連線的層是很常見的。 最後的完全連線層儲存輸出,例如班級分數。 換句話說,最常見的ConvNet架構遵循以下模式:

INPUT -> [[CONV -> RELU]*N -> POOL?]*M -> [FC -> RELU]*K -> FC

where the * indicates repetition, and the POOL? indicates an optional pooling layer. Moreover, N >= 0 (and usually N <= 3), M >= 0, K >= 0 (and usually K < 3). For example, here are some common ConvNet architectures you may see that follow this pattern:

  • INPUT -> FC, implements a linear classifier. Here N = M = K = 0.

  • INPUT -> CONV -> RELU -> FC

  • INPUT -> [CONV -> RELU -> POOL]*2 -> FC -> RELU -> FC. Here we see
    that there is a single CONV layer between every POOL layer.

  • INPUT -> [CONV -> RELU -> CONV -> RELU -> POOL]*3 -> [FC -> RELU]*2
    -> FC Here we see two CONV layers stacked before every POOL layer. This is generally a good idea for larger and deeper networks, because
    multiple stacked CONV layers can develop more complex features of the
    input volume before the destructive pooling operation.

Prefer a stack of small filter CONV to one large receptive field CONV layer. Suppose that you stack three 3x3 CONV layers on top of each other (with non-linearities in between, of course). In this arrangement, each neuron on the first CONV layer has a 3x3 view of the input volume. A neuron on the second CONV layer has a 3x3 view of the first CONV layer, and hence by extension a 5x5 view of the input volume. Similarly, a neuron on the third CONV layer has a 3x3 view of the 2nd CONV layer, and hence a 7x7 view of the input volume. Suppose that instead of these three layers of 3x3 CONV, we only wanted to use a single CONV layer with 7x7 receptive fields. These neurons would have a receptive field size of the input volume that is identical in spatial extent (7x7), but with several disadvantages. First, the neurons would be computing a linear function over the input, while the three stacks of CONV layers contain non-linearities that make their features more expressive. Second, if we suppose that all the volumes have C channels, then it can be seen that the single 7x7 CONV layer would contain C×(7×7×C)=49C2 parameters, while the three 3x3 CONV layers would only contain 3×(C×(3×3×C))=27C2 parameters. Intuitively, stacking CONV layers with tiny filters as opposed to having one CONV layer with big filters allows us to express more powerful features of the input, and with fewer parameters. As a practical disadvantage, we might need more memory to hold all the intermediate CONV layer results if we plan to do backpropagation.

首選一個小的過濾器CONV堆疊到一個大的感受野CONV層。假設你將三個3x3的CONV層層疊在一起(當然是非線性的)。在這種佈置中,第一CONV層上的每個神經元都具有輸入音量的3×3檢視。第二CONV層上的神經元具有第一CONV層的3×3檢視,並且因此具有輸入體積的5×5檢視。類似地,第三CONV層上的神經元具有第二CONV層的3×3檢視,因此具有輸入體積的7×7檢視。假設代替這三層3x3的CONV,我們只想使用一個帶有7x7感受域的CONV層。這些神經元將具有輸入體積的感受野大小,其在空間範圍(7×7)中是相同的,但是具有幾個缺點。首先,神經元將在輸入上計算線性函式,而三個CONV層疊包含使得它們的特徵更具表達性的非線性。其次,如果我們假設所有的體積都有C通道,那麼可以看出單個7×7 CONV層將包含C×(7×7×C)= 49C2引數,而三個3×3 CONV層將只包含3× (C×(3×3×C))= 27C2.直觀地說,使用小型過濾器來堆疊CONV層,而不是使用一個帶有大型過濾器的CONV層,這使得我們可以表達更強大的輸入功能,並且引數更少。 作為一個實際的缺點,如果我們打算做反向傳播,我們可能需要更多的記憶體來儲存所有的中間CONV層結果。

Recent departures. It should be noted that the conventional paradigm of a linear list of layers has recently been challenged, in Google’s Inception architectures and also in current (state of the art) Residual Networks from Microsoft Research Asia. Both of these (see details below in case studies section) feature more intricate and different connectivity structures.

最近的離開。應該指出的是,在Google的初始架構中,以及微軟亞洲研究院(Microsoft Research Asia)當前(最先進的)剩餘網路中,線性列表層的傳統範例最近受到了挑戰。這兩個(請參閱案例研究部分的詳細資訊)功能更復雜和不同的連線結構。

In practice: use whatever works best on ImageNet. If you’re feeling a bit of a fatigue in thinking about the architectural decisions, you’ll be pleased to know that in 90% or more of applications you should not have to worry about these. I like to summarize this point as “don’t be a hero”: Instead of rolling your own architecture for a problem, you should look at whatever architecture currently works best on ImageNet, download a pretrained model and finetune it on your data. You should rarely ever have to train a ConvNet from scratch or design one from scratch. I also made this point at the Deep Learning school.

在實踐中:使用ImageNet上最好的方法。如果您在思考架構決策時感到疲勞,您會很高興知道,在90%以上的應用程式中,您不必擔心這些問題。我喜歡把這一點總結為“不要成為英雄”:你應該看看ImageNet上現有的最好的架構,下載一個預訓練的模型,並在你的資料上進行微調。你從不需要從零開始訓練ConvNet或者從頭開始設計一個ConvNet。在深度學習學校我也提到了這一點。

2,Layer Sizing Patterns

Until now we’ve omitted mentions of common hyperparameters used in each of the layers in a ConvNet. We will first state the common rules of thumb for sizing the architectures and then follow the rules with a discussion of the notation:
到目前為止,我們已經忽略了在ConvNet中每個層中使用的常見超引數的提及。我們將首先說明確定體系結構的常用經驗規則,然後遵循規則並討論符號:

The input layer (that contains the image) should be divisible by 2 many times. Common numbers include 32 (e.g. CIFAR-10), 64, 96 (e.g. STL-10), or 224 (e.g. common ImageNet ConvNets), 384, and 512.
輸入層(包含影象)應該可以被整除2次。常見的數字包括32(例如CIFAR-10),64,96(例如STL-10)或224(例如,常見的ImageNet ConvNets),384和512。
The conv layers should be using small filters (e.g. 3x3 or at most 5x5), using a stride of S=1, and crucially, padding the input volume with zeros in such way that the conv layer does not alter the spatial dimensions of the input. That is, when F=3, then using P=1 will retain the original size of the input. When F=5, P=2. For a general F, it can be seen that P=(F−1)/2 preserves the input size. If you must use bigger filter sizes (such as 7x7 or so), it is only common to see this on the very first conv layer that is looking at the input image.

conv層應該使用小的過濾器(例如3x3或至多5x5),使用S = 1的步幅
,關鍵是用零填充輸入音量,使得conv層不會改變輸入的空間維度。也就是說,當F = 3時,則使用P = 1將保留輸入的原始大小。當F = 5時,P = 2。對於一般的F,可以看出P =(F-1)/ 2保留了輸入大小。如果您必須使用更大的過濾器尺寸(例如7x7左右),則通常在檢視輸入影象的第一個conv層上看到這一點。

The pool layers are in charge of downsampling the spatial dimensions of the input. The most common setting is to use max-pooling with 2x2 receptive fields (i.e. F=2), and with a stride of 2 (i.e. S=2). Note that this discards exactly 75% of the activations in an input volume (due to downsampling by 2 in both width and height). Another slightly less common setting is to use 3x3 receptive fields with a stride of 2, but this makes. It is very uncommon to see receptive field sizes for max pooling that are larger than 3 because the pooling is then too lossy and aggressive. This usually leads to worse performance.
池層負責對輸入的空間維度進行下采樣。最常見的設定是使用具有2×2接受域(即F = 2)的max-pooling,並且步長為2(即S = 2)。請注意,這會丟棄輸入音量中啟用的75%(由於在寬度和高度上的向下取樣都是2)。另一個稍微不太常見的設定是使用3x3接受欄位,步幅為2,但是這使得。對於大於3的最大彙集來說,接受欄位大小是非常罕見的,因為彙集過於有損和積極。這通常會導致更糟糕的表現。
Reducing sizing headaches. The scheme presented above is pleasing because all the CONV layers preserve the spatial size of their input, while the POOL layers alone are in charge of down-sampling the volumes spatially. In an alternative scheme where we use strides greater than 1 or don’t zero-pad the input in CONV layers, we would have to very carefully keep track of the input volumes throughout the CNN architecture and make sure that all strides and filters “work out”, and that the ConvNet architecture is nicely and symmetrically wired.
減少大小頭痛。上面介紹的方案是令人滿意的,因為所有的CONV層保持其輸入的空