1. 程式人生 > >RCNN、SPPnet、Fast-RCNN 論文學習筆記

RCNN、SPPnet、Fast-RCNN 論文學習筆記

——R-CNN、Fast-Rcnn、Fast-Rcnn是目標檢測的一系列頂會論文,自己也看了好久,才慢慢有所感悟,這裡做個記載。看論文原版還是最好的選擇,但由於論文都是英文,且有大量引用前人已有的思想,對於小白來說,直接看論文並不友善,可以選擇網上已有的論文解讀,大致瞭解思想,然後有針對性的閱讀論文原版,可能會事半功倍。

給出RGB大神部落格,基本論文還有原始碼、slides都可以找到:
【RGB大神部落格】
推薦個講解很好的網站:
Gluon.ai

——object detection是在給定的圖片中精確找到物體所在位置,並標註出物體的類別。object detection要解決的問題是物體在哪裡,是什麼這整個流程的問題。然而,這個問題可不是那麼容易解決的,物體的尺寸變化範圍很大,擺放物體的角度,姿態不定,而且可以出現在圖片的任何地方,更何況物體還可以是多個類別。已有的 CNN 和好地解決了圖片中物體是什麼的問題,即分類,接下來幾篇論文可以看到物體檢測大致的發展流程。

R-CNN

基本思想是: 在圖片中框出大量區域,每個區域進行物體分類,進而得到目標檢測的結果。

  1. 圖片中選出大量候選框(選擇性搜尋,selective search)
  2. 對每一個候選框,進行大小修正,以適合後面卷積網路(保證卷積網路輸出大小為固定值)
  3. 對每一個區域,進行特徵提取(CNN),隨後使用SVM進行分類
  4. 訓練一個線性迴歸模型,使用迴歸模型精細修正候選框的位置。該回歸模型使用Loss函式為 bounding box IOU。


  5. 這裡寫圖片描述

R-CNN 使用了選擇性搜尋Selective Search(SS)在圖片中獲得大約2k個候選框。
其基本思路如下所述:

使用過分割方法將影象分成很多小區域。在此之後,觀察現有的區域,之後以最高概率合併兩個區域。重複此步驟,直到所有影象合併為一個區域位置。注意,在此處的合併規則與RCNN是相同的,優先合併以下四種區域: 顏色(顏色直方圖)相近的; 紋理(梯度直方圖)相近的; 合併後總面積小的。最後,所有已經存在的區域都被輸出,並生成候選區域。

論文原文對網路結構的描述

—— Our object detection system consists of three modules. The first generates category-independent region proposals. These proposals define the set of candidate detections available to our detector. The second

module is a large convolutional neural network that extracts a fixed-length feature vector from each region. The third module is a set of class specific linear SVMs.


這裡寫圖片描述

——Region proposals. While R-CNN is agnostic to the particular region proposal method, we use selective search to enable a controlled comparison with prior detection work (e.g., [34, 36]).

——Feature extraction. We extract a 4096-dimensional feature vector from each region proposal using the Caffe [22] implementation of the CNN described by Krizhevsky et al. [23]. Features are computed by forward propagating a mean-subtracted 227 × 227 RGB image through five convolutional layers and two fully connected layers. We refer readers to [22, 23] for more network architecture details.

—— In order to compute features for a region proposal, we must first convert the image data in that region into a form that is compatible with the CNN (its architecture requires inputs of a fixed 227×227 pixel size). Of the many possible transformations of our arbitrary-shaped regions, we opt for the simplest. Regardless of the size or aspect ratio of the candidate region,we warp all pixels in a tight bounding box around it to the required size. Prior to warping,we dilate the tight bounding box so that at the warped size there are exactly p pixels of warped image context around the original box(we use p = 16).

從上面加粗的內容,可以看到 R-CNN 的幾個基本問題:

  1. 一張圖片先 selective search,然後送入CNN 網路進行特徵檢測,這樣會造成大量的重複卷積計算;
  2. 因為 CNN 後面接的是 FC 全連線網路,所有對 CNN 的輸出層大小有嚴格要求,進而限制了整個網路的圖片輸入大小。

SPP-net

——主要解決了上述 R-CNN 需要固定大小的輸入圖片的問題(因為 CNN 網路後面 會接 FC , FC 全連線層是需要一個固定維度的向量作為輸入的)。論文一開始就說了,輸入 CNN 網路的圖片需要 crop 或者 warp ,用以 resize 到固定大小,但是二者都有不可避免的缺點:crop 無法觀察整體圖片,warp 容易使圖片幾何變形,從而失真。

—— In this paper, we introduce a spatial pyramid pooling (SPP) [14], [15] layer to remove the fixed-size constraint of the network. Specifically, we add an SPP layer on top of the last convolutional layer. The SPP layer pools the features and generates fixedlength outputs, which are then fed into the fully connected layers (or other classifiers).


這裡寫圖片描述


對比 RCNNSPPnet 的提出主要解決了兩個問題:
1. 圖片多候選區,多次卷積特徵提取 ==> 一次卷積特徵提取,將候選框應用在整個圖片的特徵提取層後面(即conv5),得到不同的特徵候選框
2. 不同圖片候選框進入CNN時,需要resize,為了在CNN層輸出時得到統一大小,從而進入 FC 層–> 使用SPP(spatial pyramid pooling),是一種可伸縮的池化層,不管輸入解析度是多大,都可以劃分成m*n個部分。這是SPP-net的第一個顯著特徵,它的輸入是conv5特徵圖 以及特徵圖候選框(原圖候選框 通過stride對映得到),輸出是固定尺寸(m*n)特徵;

原論文關對 SPP 的描述

——The Spatial Pyramid Pooling Layer

——The convolutional layers accept arbitrary input sizes, but they produce outputs of variable sizes. The classifiers (SVM/softmax) or fully-connected layers require fixed-length vectors. Such vectors can be generated by the Bag-of-Words (BoW) approach [16] that pools the features together. Spatial pyramid pooling [14], [15] improves BoW in that it can maintain spatial information by pooling in local spatial bins. These spatial bins have sizes proportional to the image size, so the number of bins is fixed regardless of the image size. This is in contrast to the sliding window pooling of the previous deep networks [3], where the number of sliding windows depends on the input size.


這裡寫圖片描述

—— To adopt the deep network for images of arbitrary sizes, we replace the last pooling layer (e.g., pool5, after the last convolutional layer) with a spatial pyramid pooling layer. Figure 3 illustrates our method. In each spatial bin, we pool the responses of each filter (throughout this paper we use max pooling). The outputs of the spatial pyramid pooling are kM dimensional vectors with the number of bins denoted as M (k is the number of filters in the last convolutional layer). The fixed-dimensional vectors are the input to the fully-connected layer.

——With spatial pyramid pooling, the input image can be of any sizes. This not only allows arbitrary aspect ratios, but also allows arbitrary scales. We can resize the input image to any scale (e.g., min(w,h)=180, 224, …) and apply the same deep network. When the input image is at different scales, the network (with the same filter sizes) will extract features at different scales.


這裡寫圖片描述

——上圖中的 sizeX 是池化核的大小,stride 是步長, pool3x3 表示池化後輸出大小為 3X3。 可以得到,pool1x1 是對整個 inputs 整個特徵圖做max pooling,pool2x2 是將 inputs 特徵圖分割為 2x2 的4個區域,每個區域進行max pooling。因為sizeX和stride的限制,max pooling 基本是在每個特徵區域找最大值(私以為,這樣會重複選擇特徵,因為最大值在任何一個區域都是最大值,是不是用 mean pooling或者 Gaussian pooling會好一點)。

RCNN 與 SPPnet 結構對比


這裡寫圖片描述
RCNN 結構圖

這裡寫圖片描述
SPPnet 結構圖

Fast-Rcnn

同樣是針對 Rcnn 的缺點進行改進:

  1. 第一點同 Sppnet 一樣,將在圖片上提取候選區域,送入CNN進行特徵提取,改為對整幅圖片進行特徵提取(一次CNN),然後在特徵提取圖上進行 候選區域提取
  2. 第二點與 Sppnet 也差不多,提取的特徵區域大小不一,無法直接送入 FC 網路進行分類。與 Sppnet 解決方法不同的是,Fast-Rcnn 採用 ROI pooling 層進行特徵對映。
    即: 卷積操作過後可以得到feature map,根據之前RoI框選擇出對應的區域(既可以理解為將feature map映射回原影象), 在最後一次卷積之前,使用 RoI pooling 層來統一相同的比例(可以看成單層 SSP)。
  3. 端到端 的訓練,將分類和定位統一在一起進行訓練,整個網路的損失函式是二者損失函式之和。進一步加快速度。

原論文對網路結構的描述

Fast R-CNN architecture
——A Fast R-CNN network takes as input an entire image and a set of object proposals. The network first processes the whole image with several convolutional (conv) and max pooling layers to produce a conv feature map. Then, for each object proposal a region of interest (RoI) pooling layer extracts a fixed-length feature vector from the feature map. Each feature vectoris fed into a sequence of fully connected (fc) layers that finally branch into two sibling output layers: one that produces softmax probability estimates over K object classes plus a catch-all “background” class and another layer that outputs four real-valued numbers for each of the K object classes. Each set of 4 values encodes refined bounding-box positions for one of the K classes.


這裡寫圖片描述

對於ROI pooling,作者說,可以看作SPPNet的特殊情況,就是隻是用了一個 SPP pooling filter 的池化層。H、W 是更具全連線層設定的引數,h、w 是特徵候選區的長寬,因為要被劃分為 H x W 個區域,所以每個區域大小計算公式為:h/H × w/W。

The RoIpooling layer
—— The RoI pooling layer uses max pooling to convert the features inside any valid region of interest into a small feature map with a fixed spatial extent of H ×W (e.g., 7×7), where H and W are layer hyper-parameters that are independent of any particular RoI. In this paper, an RoI is a rectangular window into a conv feature map. Each RoI is defined by a four-tuple (r,c,h,w) that specifies its top-left corner (r,c) and its height and width (h,w).
—— RoI max pooling works by dividing the h×w RoI window into an H × W grid of sub-windows of approximate size h/H ×w/W and then max-pooling the values in each sub-window into the corresponding output grid cell. Pooling is applied independently to each feature map channel, as in standard max pooling. The RoI layer is simply the special-case of the spatial pyramid pooling layer used in SPPnets [11] in which there is only one pyramid level. We use the pooling sub-window calculation given in [11].

還有一點是對於training的優化, Fast R-CNN 的SGD batch number N=2,每次進行 R/N=128/2=64 個ROI的計算,而 R-CNN 和 SPPnet 都是batch number N=128,而對每個輸入進行1個ROI的計算。作者說經過優化,提升64X倍速率,而且 batch number 過小而可能出現的收斂緩慢並沒有出現。

Fine-tuning for detection
——Training all network weights with back-propagation is an important capability of Fast R-CNN. First, let’s elucidate why SPPnet is unable to update weights below the spatial pyramid pooling layer.
—— We propose a more efficient training method that takes advantage of feature sharing during training. In Fast RCNN training, stochastic gradient descent (SGD) mini batches are sampled hierarchically, first by sampling N images and then by sampling R/N RoIs from each image. Critically, RoIs from the same image share computation and memory in the forward and backward passes. Making N small decreases mini-batch computation. For example, when using N = 2 and R = 128, the proposed training scheme is roughly 64× faster than sampling one RoI from 128 different images(i.e.,the R-CNN and SPPnet strategy). One concern over this strategy is it may causes low training convergence because RoIs from the same image are correlated. This concern does not appear to be a practical issue and we achieve good results with N = 2 and R = 128 using fewer SGD iterations than R-CNN.

參考