王權富貴論文篇：Faster R-CNN論文翻譯——中英文對照

阿新 • • 發佈：2018-11-28

文章作者：Tyan

感謝Tyan作者大大，相見恨晚，大家可以看原汁原味的Tyan部落格哦。
部落格：noahsnail.com | CSDN | 簡書

宣告：作者翻譯論文僅為學習，如有侵權請聯絡作者刪除博文，謝謝！

翻譯論文彙總：https://github.com/SnailTyan/deep-learning-papers-translation

Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks

Abstract

State-of-the-art object detection networks depend on region proposal algorithms to hypothesize object locations. Advances like SPPnet [1] and Fast R-CNN [2] have reduced the running time of these detection networks, exposing region proposal computation as a bottleneck. In this work, we introduce a Region Proposal Network (RPN) that shares full-image convolutional features with the detection network, thus enabling nearly cost-free region proposals. An RPN is a fully convolutional network that simultaneously predicts object bounds and objectness scores at each position. The RPN is trained end-to-end to generate high-quality region proposals, which are used by Fast R-CNN for detection. We further merge RPN and Fast R-CNN into a single network by sharing their convolutional features——using the recently popular terminology of neural networks with “attention” mechanisms, the RPN component tells the unified network where to look. For the very deep VGG-16 model [3], our detection system has a frame rate of 5fps (including all steps) on a GPU, while achieving state-of-the-art object detection accuracy on PASCAL VOC 2007, 2012, and MS COCO datasets with only 300 proposals per image. In ILSVRC and COCO 2015 competitions, Faster R-CNN and RPN are the foundations of the 1st-place winning entries in several tracks. Code has been made publicly available.

摘要

最先進的目標檢測網路依靠區域提出演算法來假設目標的位置。SPPnet[1]和Fast R-CNN[2]等研究已經減少了這些檢測網路的執行時間，使得區域提出計算成為一個瓶頸。在這項工作中，我們引入了一個區域提出網路（RPN），該網路與檢測網路共享全影象的卷積特徵，從而使近乎零成本的區域提出成為可能。RPN是一個全卷積網路，可以同時在每個位置預測目標邊界和目標分數。RPN經過端到端的訓練，可以生成高質量的區域提出，由Fast R-CNN用於檢測。我們將RPN和Fast R-CNN通過共享卷積特徵進一步合併為一個單一的網路——使用最近流行的具有“注意力”機制的神經網路術語，RPN元件告訴統一網路在哪裡尋找。對於非常深的VGG-16模型[3]，我們的檢測系統在GPU上的幀率為5fps（包括所有步驟），同時在PASCAL VOC 2007，2012和MS COCO資料集上實現了最新的目標檢測精度，每個影象只有300個提出。在ILSVRC和COCO 2015競賽中，Faster R-CNN和RPN是多個比賽中獲得第一名輸入的基礎。程式碼可公開獲得。

1. Introduction

Recent advances in object detection are driven by the success of region proposal methods (e.g., [4]) and region-based convolutional neural networks (R-CNNs) [5]. Although region-based CNNs were computationally expensive as originally developed in [5], their cost has been drastically reduced thanks to sharing convolutions across proposals [1], [2]. The latest incarnation, Fast R-CNN [2], achieves near real-time rates using very deep networks [3], when ignoring the time spent on region proposals. Now, proposals are the test-time computational bottleneck in state-of-the-art detection systems.

1. 引言

目標檢測的最新進展是由區域提出方法（例如[4]）和基於區域的卷積神經網路（R-CNN）[5]的成功驅動的。儘管在[5]中最初開發的基於區域的CNN計算成本很高，但是由於在各種提議中共享卷積，所以其成本已經大大降低了[1]，[2]。忽略花費在區域提議上的時間，最新版本Fast R-CNN[2]利用非常深的網路[3]實現了接近實時的速率。現在，提議是最新的檢測系統中測試時間的計算瓶頸。

Region proposal methods typically rely on inexpensive features and economical inference schemes. Selective Search [4], one of the most popular methods, greedily merges superpixels based on engineered low-level features. Yet when compared to efficient detection networks [2], Selective Search is an order of magnitude slower, at 2 seconds per image in a CPU implementation. EdgeBoxes [6] currently provides the best tradeoff between proposal quality and speed, at 0.2 seconds per image. Nevertheless, the region proposal step still consumes as much running time as the detection network.

區域提議方法通常依賴廉價的特徵和簡練的推斷方案。選擇性搜尋[4]是最流行的方法之一，它貪婪地合併基於設計的低階特徵的超級畫素。然而，與有效的檢測網路[2]相比，選擇性搜尋速度慢了一個數量級，在CPU實現中每張影象的時間為2秒。EdgeBoxes[6]目前提供了在提議質量和速度之間的最佳權衡，每張影象0.2秒。儘管如此，區域提議步驟仍然像檢測網路那樣消耗同樣多的執行時間。

One may note that fast region-based CNNs take advantage of GPUs, while the region proposal methods used in research are implemented on the CPU, making such runtime comparisons inequitable. An obvious way to accelerate proposal computation is to re-implement it for the GPU. This may be an effective engineering solution, but re-implementation ignores the down-stream detection network and therefore misses important opportunities for sharing computation.

有人可能會注意到，基於區域的快速CNN利用GPU，而在研究中使用的區域提議方法在CPU上實現，使得執行時間比較不公平。加速區域提議計算的一個顯而易見的方法是將其在GPU上重新實現。這可能是一個有效的工程解決方案，但重新實現忽略了下游檢測網路，因此錯過了共享計算的重要機會。

In this paper, we show that an algorithmic change——computing proposals with a deep convolutional neural network——leads to an elegant and effective solution where proposal computation is nearly cost-free given the detection network’s computation. To this end, we introduce novel Region Proposal Networks (RPNs) that share convolutional layers with state-of-the-art object detection networks [1], [2]. By sharing convolutions at test-time, the marginal cost for computing proposals is small (e.g., 10ms per image).

在本文中，我們展示了演算法的變化——用深度卷積神經網路計算區域提議——導致了一個優雅和有效的解決方案，其中在給定檢測網路計算的情況下區域提議計算接近零成本。為此，我們引入了新的區域提議網路（RPN），它們共享最先進目標檢測網路的卷積層[1]，[2]。通過在測試時共享卷積，計算區域提議的邊際成本很小（例如，每張影象10ms）。

Our observation is that the convolutional feature maps used by region-based detectors, like Fast R-CNN, can also be used for generating region proposals. On top of these convolutional features, we construct an RPN by adding a few additional convolutional layers that simultaneously regress region bounds and objectness scores at each location on a regular grid. The RPN is thus a kind of fully convolutional network (FCN) [7] and can be trained end-to-end specifically for the task for generating detection proposals.

我們的觀察是，基於區域的檢測器所使用的卷積特徵對映，如Fast R-CNN，也可以用於生成區域提議。在這些卷積特徵之上，我們通過新增一些額外的卷積層來構建RPN，這些卷積層同時在規則網格上的每個位置上回歸區域邊界和目標分數。因此RPN是一種全卷積網路（FCN）[7]，可以針對生成檢測區域建議的任務進行端到端的訓練。

RPNs are designed to efficiently predict region proposals with a wide range of scales and aspect ratios. In contrast to prevalent methods [8], [9], [1], [2] that use pyramids of images (Figure 1, a) or pyramids of filters (Figure 1, b), we introduce novel “anchor” boxes that serve as references at multiple scales and aspect ratios. Our scheme can be thought of as a pyramid of regression references (Figure 1, c), which avoids enumerating images or filters of multiple scales or aspect ratios. This model performs well when trained and tested using single-scale images and thus benefits running speed.

Figure 1: Different schemes for addressing multiple scales and sizes. (a) Pyramids of images and feature maps are built, and the classifier is run at all scales. (b) Pyramids of filters with multiple scales/sizes are run on the feature map. (c) We use pyramids of reference boxes in the regression functions.

RPN旨在有效預測具有廣泛尺度和長寬比的區域提議。與使用影象金字塔（圖1，a）或濾波器金字塔（圖1，b）的流行方法[8]，[9]，[1]相比，我們引入新的“錨”盒作為多種尺度和長寬比的參考。我們的方案可以被認為是迴歸參考金字塔（圖1，c），它器避免了列舉多種比例或長寬比的影象或濾波。這個模型在使用單尺度影象進行訓練和測試時執行良好，從而有利於執行速度。

圖1：解決多尺度和尺寸的不同方案。（a）構建影象和特徵對映金字塔，分類器以各種尺度執行。（b）在特徵對映上執行具有多個比例/大小的濾波器的金字塔。（c）我們在迴歸函式中使用參考邊界框金字塔。

To unify RPNs with Fast R-CNN [2] object detection networks, we propose a training scheme that alternates between fine-tuning for the region proposal task and then fine-tuning for object detection, while keeping the proposals fixed. This scheme converges quickly and produces a unified network with convolutional features that are shared between both tasks.

為了將RPN與Fast R-CNN 2]目標檢測網路相結合，我們提出了一種訓練方案，在微調區域提議任務和微調目標檢測之間進行交替，同時保持區域提議的固定。該方案快速收斂，併產生兩個任務之間共享的具有卷積特徵的統一網路。

We comprehensively evaluate our method on the PASCAL VOC detection benchmarks [11] where RPNs with Fast R-CNNs produce detection accuracy better than the strong baseline of Selective Search with Fast R-CNNs. Meanwhile, our method waives nearly all computational burdens of Selective Search at test-time——the effective running time for proposals is just 10 milliseconds. Using the expensive very deep models of [3], our detection method still has a frame rate of 5fps (including all steps) on a GPU, and thus is a practical object detection system in terms of both speed and accuracy. We also report results on the MS COCO dataset [12] and investigate the improvements on PASCAL VOC using the COCO data. Code has been made publicly available at https://github.com/shaoqingren/faster_rcnn (in MATLAB) and https://github.com/rbgirshick/py-faster-rcnn (in Python).

我們在PASCAL VOC檢測基準資料集上[11]綜合評估了我們的方法，其中具有Fast R-CNN的RPN產生的檢測精度優於使用選擇性搜尋的Fast R-CNN的強基準。同時，我們的方法在測試時幾乎免除了選擇性搜尋的所有計算負擔——區域提議的有效執行時間僅為10毫秒。使用[3]的昂貴的非常深的模型，我們的檢測方法在GPU上仍然具有5fps的幀率（包括所有步驟），因此在速度和準確性方面是實用的目標檢測系統。我們還報告了在MS COCO資料集上[12]的結果，並使用COCO資料研究了在PASCAL VOC上的改進。程式碼可公開獲得https://github.com/shaoqingren/faster_rcnn（在MATLAB中）和https://github.com/rbgirshick/py-faster-rcnn（在Python中）。

A preliminary version of this manuscript was published previously [10]. Since then, the frameworks of RPN and Faster R-CNN have been adopted and generalized to other methods, such as 3D object detection [13], part-based detection [14], instance segmentation [15], and image captioning [16]. Our fast and effective object detection system has also been built in commercial systems such as at Pinterests [17], with user engagement improvements reported.

這個手稿的初步版本是以前發表的[10]。從那時起，RPN和Faster R-CNN的框架已經被採用並推廣到其他方法，如3D目標檢測[13]，基於部件的檢測[14]，例項分割[15]和影象標題[16]。我們快速和有效的目標檢測系統也已經在Pinterest[17]的商業系統中建立了，並報告了使用者參與度的提高。

In ILSVRC and COCO 2015 competitions, Faster R-CNN and RPN are the basis of several 1st-place entries [18] in the tracks of ImageNet detection, ImageNet localization, COCO detection, and COCO segmentation. RPNs completely learn to propose regions from data, and thus can easily benefit from deeper and more expressive features (such as the 101-layer residual nets adopted in [18]). Faster R-CNN and RPN are also used by several other leading entries in these competitions. These results suggest that our method is not only a cost-efficient solution for practical usage, but also an effective way of improving object detection accuracy.

在ILSVRC和COCO 2015競賽中，Faster R-CNN和RPN是ImageNet檢測，ImageNet定位，COCO檢測和COCO分割中幾個第一名參賽者[18]的基礎。RPN完全從資料中學習提議區域，因此可以從更深入和更具表達性的特徵（例如[18]中採用的101層殘差網路）中輕鬆獲益。Faster R-CNN和RPN也被這些比賽中的其他幾個主要參賽者所使用。這些結果表明，我們的方法不僅是一個實用合算的解決方案，而且是一個提高目標檢測精度的有效方法。

Object Proposals. There is a large literature on object proposal methods. Comprehensive surveys and comparisons of object proposal methods can be found in [19], [20], [21]. Widely used object proposal methods include those based on grouping super-pixels (e.g., Selective Search [4], CPMC [22], MCG [23]) and those based on sliding windows (e.g., objectness in windows [24], EdgeBoxes [6]). Object proposal methods were adopted as external modules independent of the detectors (e.g., Selective Search [4] object detectors, R-CNN [5], and Fast R-CNN [2]).

2. 相關工作

目標提議。目標提議方法方面有大量的文獻。目標提議方法的綜合調查和比較可以在[19]，[20]，[21]中找到。廣泛使用的目標提議方法包括基於超畫素分組（例如，選擇性搜尋[4]，CPMC[22]，MCG[23]）和那些基於滑動視窗的方法（例如視窗中的目標[24]，EdgeBoxes[6]）。目標提議方法被採用為獨立於檢測器（例如，選擇性搜尋[4]目標檢測器，R-CNN[5]和Fast R-CNN[2]）的外部模組。

Deep Networks for Object Detection. The R-CNN method [5] trains CNNs end-to-end to classify the proposal regions into object categories or background. R-CNN mainly plays as a classifier, and it does not predict object bounds (except for refining by bounding box regression). Its accuracy depends on the performance of the region proposal module (see comparisons in [20]). Several papers have proposed ways of using deep networks for predicting object bounding boxes [25], [9], [26], [27]. In the OverFeat method [9], a fully-connected layer is trained to predict the box coordinates for the localization task that assumes a single object. The fully-connected layer is then turned into a convolutional layer for detecting multiple classspecific objects. The MultiBox methods [26], [27] generate region proposals from a network whose last fully-connected layer simultaneously predicts multiple class-agnostic boxes, generalizing the “single-box” fashion of OverFeat. These class-agnostic boxes are used as proposals for R-CNN [5]. The MultiBox proposal network is applied on a single image crop or multiple large image crops (e.g., 224×224), in contrast to our fully convolutional scheme. MultiBox does not share features between the proposal and detection networks. We discuss OverFeat and MultiBox in more depth later in context with our method. Concurrent with our work, the DeepMask method [28] is developed for learning segmentation proposals.

用於目標檢測的深度網路。R-CNN方法[5]端到端地對CNN進行訓練，將提議區域分類為目標類別或背景。R-CNN主要作為分類器，並不能預測目標邊界（除了通過邊界框迴歸進行細化）。其準確度取決於區域提議模組的效能（參見[20]中的比較）。一些論文提出了使用深度網路來預測目標邊界框的方法[25]，[9]，[26]，[27]。在OverFeat方法[9]中，訓練一個全連線層來預測假定單個目標定位任務的邊界框座標。然後將全連線層變成卷積層，用於檢測多個類別的目標。MultiBox方法[26]，[27]從網路中生成區域提議，網路最後的全連線層同時預測多個類別不相關的邊界框，並推廣到OverFeat的“單邊界框”方式。這些類別不可知的邊界框框被用作R-CNN的提議區域[5]。與我們的全卷積方案相比，MultiBox提議網路適用於單張裁剪影象或多張大型裁剪影象（例如224×224）。MultiBox在提議區域和檢測網路之間不共享特徵。稍後在我們的方法上下文中會討論OverFeat和MultiBox。與我們的工作同時進行的，DeepMask方法[28]是為學習分割提議區域而開發的。

Shared computation of convolutions [9], [1], [29], [7], [2] has been attracting increasing attention for efficient, yet accurate, visual recognition. The OverFeat paper [9] computes convolutional features from an image pyramid for classification, localization, and detection. Adaptively-sized pooling (SPP) [1] on shared convolutional feature maps is developed for efficient region-based object detection [1], [30] and semantic segmentation [29]. Fast R-CNN [2] enables end-to-end detector training on shared convolutional features and shows compelling accuracy and speed.

卷積[9]，[1]，[29]，[7]，[2]的共享計算已經越來越受到人們的關注，因為它可以有效而準確地進行視覺識別。OverFeat論文[9]計算影象金字塔的卷積特徵用於分類，定位和檢測。共享卷積特徵對映的自適應大小池化（SPP）[1]被開發用於有效的基於區域的目標檢測[1]，[30]和語義分割[29]。Fast R-CNN[2]能夠對共享卷積特徵進行端到端的檢測器訓練，並顯示出令人信服的準確性和速度。

3. FASTER R-CNN

Our object detection system, called Faster R-CNN, is composed of two modules. The first module is a deep fully convolutional network that proposes regions, and the second module is the Fast R-CNN detector [2] that uses the proposed regions. The entire system is a single, unified network for object detection (Figure 2). Using the recently popular terminology of neural networks with attention [31] mechanisms, the RPN module tells the Fast R-CNN module where to look. In Section 3.1 we introduce the designs and properties of the network for region proposal. In Section 3.2 we develop algorithms for training both modules with features shared.

Figure 2: Faster R-CNN is a single, unified network for object detection. The RPN module serves as the ‘attention’ of this unified network.

3. FASTER R-CNN

我們的目標檢測系統，稱為Faster R-CNN，由兩個模組組成。第一個模組是提議區域的深度全卷積網路，第二個模組是使用提議區域的Fast R-CNN檢測器[2]。整個系統是一個單個的，統一的目標檢測網路（圖2）。使用最近流行的“注意力”[31]機制的神經網路術語，RPN模組告訴Fast R-CNN模組在哪裡尋找。在第3.1節中，我們介紹了區域提議網路的設計和屬性。在第3.2節中，我們開發了用於訓練具有共享特徵模組的演算法。

圖2：Faster R-CNN是一個單一，統一的目標檢測網路。RPN模組作為這個統一網路的“注意力”。

3.1 Region Proposal Networks

A Region Proposal Network (RPN) takes an image (of any size) as input and outputs a set of rectangular object proposals, each with an objectness score.3 We model this process with a fully convolutional network [7], which we describe in this section. Because our ultimate goal is to share computation with a Fast R-CNN object detection network [2], we assume that both nets share a common set of convolutional layers. In our experiments, we investigate the Zeiler and Fergus model 32, which has 5 shareable convolutional layers and the Simonyan and Zisserman model 3, which has 13 shareable convolutional layers.

3.1 區域提議網路

區域提議網路（RPN）以任意大小的影象作為輸入，輸出一組矩形的目標提議，每個提議都有一個目標得分。我們用全卷積網路[7]對這個過程進行建模，我們將在本節進行描述。因為我們的最終目標是與Fast R-CNN目標檢測網路[2]共享計算，所以我們假設兩個網路共享一組共同的卷積層。在我們的實驗中，我們研究了具有5個共享卷積層的Zeiler和Fergus模型[32]（ZF）和具有13個共享卷積層的Simonyan和Zisserman模型[3]（VGG-16）。

To generate region proposals, we slide a small network over the convolutional feature map output by the last shared convolutional layer. This small network takes as input an spatial window of the input convolutional feature map. Each sliding window is mapped to a lower-dimensional feature (256-d for ZF and 512-d for VGG, with ReLU [33] following). This feature is fed into two sibling fully-connected layers——a box-regression layer (reg) and a box-classification layer (cls). We use in this paper, noting that the effective receptive field on the input image is large (171 and 228 pixels for ZF and VGG, respectively). This mini-network is illustrated at a single position in Figure 3 (left). Note that because the mini-network operates in a sliding-window fashion, the fully-connected layers are shared across all spatial locations. This architecture is naturally implemented with an n×n convolutional layer followed by two sibling 1 × 1 convolutional layers (for reg and cls, respectively).

Figure 3: Left: Region Proposal Network (RPN). Right: Example detections using RPN proposals on PASCAL VOC 2007 test. Our method detects objects in a wide range of scales and aspect ratios.

為了生成區域提議，我們在最後的共享卷積層輸出的卷積特徵對映上滑動一個小網路。這個小網路將輸入卷積特徵對映的空間視窗作為輸入。每個滑動視窗對映到一個低維特徵（ZF為256維，VGG為512維，後面是ReLU[33]）。這個特徵被輸入到兩個子全連線層——一個邊界框迴歸層（reg）和一個邊界框分類層（cls）。在本文中，我們使用，注意輸入影象上的有效感受野是大的（ZF和VGG分別為171和228個畫素）。圖3（左）顯示了這個小型網路的一個位置。請注意，因為小網路以滑動視窗方式執行，所有空間位置共享全連線層。這種架構通過一個n×n卷積層，後面是兩個子1×1卷積層（分別用於reg和cls）自然地實現。

圖3：左：區域提議網路（RPN）。右：在PASCAL VOC 2007測試集上使用RPN提議的示例檢測。我們的方法可以檢測各種尺度和長寬比的目標。

3.1.1 Anchors

At each sliding-window location, we simultaneously predict multiple region proposals, where the number of maximum possible proposals for each location is denoted as . So the reg layer hasoutputs encoding the coordinates of boxes, and the cls layer outputs scores that estimate probability of object or not object for each proposal. The proposals are parameterized relative to reference boxes, which we call anchors. An anchor is centered at the sliding window in question, and is associated with a scale and aspect ratio (Figure 3, left). By default we use 3 scales and 3 aspect ratios, yielding anchors at each sliding position. For a convolutional feature map of a size W × H (typically ∼2,400), there are anchors in total.

3.1.1 錨點

在每個滑動視窗位置，我們同時預測多個區域提議，其中每個位置可能提議的最大數目表示為。因此，reg層具有個輸出，編碼個邊界框的座標，cls層輸出個分數，估計每個提議是目標或不是目標的概率。相對於我們稱之為錨點的個參考邊界框，個提議是引數化的。錨點位於所討論的滑動視窗的中心，並與一個尺度和長寬比相關（圖3左）。預設情況下，我們使用3個尺度和3個長寬比，在每個滑動位置產生個錨點。對於大小為W×H（通常約為2400）的卷積特徵對映，總共有個錨點。

Translation-Invariant Anchors

An important property of our approach is that it is translation invariant, both in terms of the anchors and the functions that compute proposals relative to the anchors. If one translates an object in an image, the proposal should translate and the same function should be able to predict the proposal in either location. This translation-invariant property is guaranteed by our method. As a comparison, the MultiBox method [27] uses k-means to generate 800 anchors, which are not translation invariant. So MultiBox does not guarantee that the same proposal is generated if an object is translated.

平移不變的錨點

我們的方法的一個重要特性是它是平移不變的，無論是在錨點還是計算相對於錨點的區域提議的函式。如果在影象中平移目標，提議應該平移，並且同樣的函式應該能夠在任一位置預測提議。平移不變特性是由我們的方法保證的。作為比較，MultiBox方法[27]使用k-means生成800個錨點，這不是平移不變的。所以如果平移目標，MultiBox不保證會生成相同的提議。

The translation-invariant property also reduces the model size. MultiBox has a -dimensional fully-connected output layer, whereas our method has a -dimensional convolutional output layer in the case of anchors. As a result, our output layer has parameters ( for VGG-16), two orders of magnitude fewer than MultiBox’s output layer that has parameters ( for GoogleNet [34] in MultiBox [27]. If considering the feature projection layers, our proposal layers still have an order of magnitude fewer parameters than MultiBox. We expect our method to have less risk of overfitting on small datasets, like PASCAL VOC.

平移不變特性也減小了模型的大小。MultiBox有維的全連線輸出層，而我們的方法在個錨點的情況下有維的卷積輸出層。因此，對於VGG-16，我們的輸出層具有個引數（對於VGG-16為），比MultiBox輸出層的個引數少了兩個數量級（對於MultiBox [27]中的GoogleNet[34]為）。如果考慮到特徵投影層，我們的提議層仍然比MultiBox少一個數量級。我們期望我們的方法在PASCAL VOC等小資料集上有更小的過擬合風險。

Multi-Scale Anchors as Regression References

Our design of anchors presents a novel scheme for addressing multiple scales (and aspect ratios). As shown in Figure 1, there have been two popular ways for multi-scale predictions. The first way is based on image/feature pyramids, e.g., in DPM [8] and CNN-based methods [9], [1], [2]. The images are resized at multiple scales, and feature maps (HOG [8] or deep convolutional features [9], [1], [2]) are computed for each scale (Figure 1(a)). This way is often useful but is time-consuming. The second way is to use sliding windows of multiple scales (and/or aspect ratios) on the feature maps. For example, in DPM [8], models of different aspect ratios are trained separately using different filter sizes (such as 5×7 and 7×5). If this way is used to address multiple scales, it can be thought of as a “pyramid of filters” (Figure 1(b)). The second way is usually adopted jointly with the first way [8].

多尺度錨點作為迴歸參考

我們的錨點設計提出了一個新的方案來解決多尺度（和長寬比）。如圖1所示，多尺度預測有兩種流行的方法。第一種方法是基於影象/特徵金字塔，例如DPM[8]和基於CNN的方法[9]，[1]，[2]中。影象在多個尺度上進行縮放，並且針對每個尺度（圖1（a））計算特徵對映（HOG[8]或深卷積特徵[9]，[1]，[2]）。這種方法通常是有用的，但是非常耗時。第二種方法是在特徵對映上使用多尺度（和/或長寬比）的滑動視窗。例如，在DPM[8]中，使用不同的濾波器大小（例如5×7和7×5）分別對不同長寬比的模型進行訓練。如果用這種方法來解決多尺度問題，可以把它看作是一個“濾波器金字塔”（圖1（b））。第二種方法通常與第一種方法聯合採用[8]。

As a comparison, our anchor-based method is built on a pyramid of anchors, which is more cost-efficient. Our method classifies and regresses bounding boxes with reference to anchor boxes of multiple scales and aspect ratios. It only relies on images and feature maps of a single scale, and uses filters (sliding windows on the feature map) of a single size. We show by experiments the effects of this scheme for addressing multiple scales and sizes (Table 8).

Table 8: Detection results of Faster R-CNN on PAS- CAL VOC 2007 test set using different settings of anchors. The network is VGG-16. The training data is VOC 2007 trainval. The default setting of using 3 scales and 3 aspect ratios () is the same as that in Table 3.

Table 8

作為比較，我們的基於錨點方法建立在錨點金字塔上，這是更具成本效益的。我們的方法參照多尺度和長寬比的錨盒來分類和迴歸邊界框。它只依賴單一尺度的影象和特徵對映，並使用單一尺寸的濾波器（特徵對映上的滑動視窗）。我們通過實驗來展示這個方案解決多尺度和尺寸的效果（表8）。

表8：Faster R-CNN在PAS-CAL VOC 2007測試資料集上使用不同錨點設定的檢測結果。網路是VGG-16。訓練資料是VOC 2007訓練集。使用3個尺度和3個長寬比（）的預設設定，與表3中的相同。

Table 8

Because of this multi-scale design based on anchors, we can simply use the convolutional features computed on a single-scale image, as is also done by the Fast R-CNN detector [2]. The design of multi-scale anchors is a key component for sharing features without extra cost for addressing scales.

由於這種基於錨點的多尺度設計，我們可以簡單地使用在單尺度影象上計算的卷積特徵，Fast R-CNN檢測器也是這樣做的[2]。多尺度錨點設計是共享特徵的關鍵元件，不需要額外的成本來處理尺度。

3.1.2 Loss Function

For training RPNs, we assign a binary class label (of being an object or not) to each anchor. We assign a positive label to two kinds of anchors: (i) the anchor/anchors with the highest Intersection-over-Union (IoU) overlap with a ground-truth box, or (ii) an anchor that has an IoU overlap higher than 0.7 with any ground-truth box. Note that a single ground-truth box may assign positive labels to multiple anchors. Usually the second condition is sufficient to determine the positive samples; but we still adopt the first condition for the reason that in some rare cases the second condition may find no positive sample. We assign a negative label to a non-positive anchor if its IoU ratio is lower than 0.3 for all ground-truth boxes. Anchors that are neither positive nor negative do not contribute to the training objective.

3.1.2 損失函式

為了訓練RPN，我們為每個錨點分配一個二值類別標籤（是目標或不是目標）。我們給兩種錨點分配一個正標籤：（i）具有與實際邊界框的重疊最高交併比（IoU）的錨點，或者（ii）具有與實際邊界框的重疊超過0.7 IoU的錨點。注意，單個真實邊界框可以為多個錨點分配正標籤。通常第二個條件足以確定正樣本；但我們仍然採用第一個條件，因為在一些極少數情況下，第二個條件可能找不到正樣本。對於所有的真實邊界框，如果一個錨點的IoU比率低於0.3，我們給非正面的錨點分配一個負標籤。既不正面也不負面的錨點不會有助於訓練目標函式。

With these definitions, we minimize an objective function following the multi-task loss in Fast R-CNN [2]. Our loss function for an image is defined as:

Here, is the index of an anchor in a mini-batch and is the predicted probability of anchor being an object. The ground-truth label is 1 if the anchor is positive, and is 0 if the anchor is negative. is a vector representing the 4 parameterized coordinates of the predicted bounding box, and is that of the ground-truth box associated with a positive anchor. The classification loss is log loss over two classes (object vs not object). For the regression loss, we use where is the robust loss function (smooth ) defined in [2]. The term means the regression loss is activated only for positive anchors () and is disabled otherwise (). The outputs of the cls and reg layers consist of andrespectively.

根據這些定義，我們對目標函式Fast R-CNN[2]中的多工損失進行最小化。我們對影象的損失函式定義為：

其中，是一個小批量資料中錨點的索引，是錨點作為目標的預測概率。如果錨點為正，真實標籤為1，如果錨點為負，則為0。是表示預測邊界框4個引數化座標的向量，而是與正錨點相關的真實邊界框的向量。分類損失是兩個類別上（目標或不是目標）的對數損失。對於迴歸損失，我們使用，其中是在[2]中定義的魯棒損失函式（平滑）。項表示迴歸損失僅對於正錨點啟用，否則被禁用（）。cls和reg層的輸出分別由和組成。

The two terms are normalized by and and weighted by a balancing parameter . In our current implementation (as in the released code), the term in Eqn.(1) is normalized by the mini-batch size (ie, ) and the term is normalized by the number of anchor locations (ie, ). By default we set , and thus both cls and reg terms are roughly equally weighted. We show by experiments that the results are insensitive to the values of in a wide range(Table 9). We also note that the normalization as above is not required and could be simplified.

Table 9: Detection results of Faster R-CNN on PASCAL VOC 2007 test set using different values ofin Equation (1). The network is VGG-16. The training data is VOC 2007 trainval. The default setting of using () is the same as that in Table 3.

Table 9

這兩個項用和進行標準化，並由一個平衡引數加權。在我們目前的實現中（如在釋出的程式碼中），方程（1）中的項通過小批量資料的大小（即）進行歸一化，項根據錨點位置的數量（即，）進行歸一化。預設情況下，我們設定，因此cls和reg項的權重大致相等。我們通過實驗顯示，結果對寬範圍的值不敏感(表9)。我們還注意到，上面的歸一化不是必需的，可以簡化。

表9：Faster R-CNN使用方程(1)中不同的值在PASCAL VOC 2007測試集上的檢測結果。網路是VGG-16。訓練資料是VOC 2007訓練集。使用（）的預設設定與表3中的相同。

Table 9

For bounding box regression, we adopt the parameterizations of the 4 coordinates following [5]:

where , , , and denote the box’s center coordinates and its width and height. Variables , , and are for the predicted box, anchor box, and ground-truth box respectively (likewise for ). This can be thought of as bounding-box regression from an anchor box to a nearby ground-truth box.

對於邊界框迴歸，我們採用[5]中的4個座標引數化：

其中，，，和表示邊界框的中心座標及其寬和高。變數，和分別表示預測邊界框，錨盒和實際邊界框（類似於）。這可以被認為是從錨盒到鄰近的實際邊界框的迴歸。

Nevertheless, our method achieves bounding-box regression by a different manner from previous RoI-based (Region of Interest) methods [1], [2]. In [1], [2], bounding-box regression is performed on features pooled from arbitrarily sized RoIs, and the regression weights are shared by all region sizes. In our formulation, the features used for regression are of the same spatial size (3 × 3) on the feature maps. To account for varying sizes, a set of bounding-box regressors are learned. Each regressor is responsible for one scale and one aspect ratio, and the regressors do not share weights. As such, it is still possible to predict boxes of various sizes even though the features are of a fixed size/scale, thanks to the design of anchors.

然而，我們的方法通過與之前的基於RoI（感興趣區域）方法[1]，[2]不同的方式來實現邊界框迴歸。在[1]，[2]中，對任意大小的RoI池化的特徵執行邊界框迴歸，並且迴歸權重由所有區域大小共享。在我們的公式中，用於迴歸的特徵在特徵對映上具有相同的空間大小（3×3）。為了說明不同的大小，學習一組個邊界框迴歸器。每個迴歸器負責一個尺度和一個長寬比，而個迴歸器不共享權重。因此，由於錨點的設計，即使特徵具有固定的尺度/比例，仍然可以預測各種尺寸的邊界框。

3.1.3 Training RPNs

The RPN can be trained end-to-end by back-propagation and stochastic gradient descent (SGD) [35]. We follow the “image-centric” sampling strategy from [2] to train this network. Each mini-batch arises from a single image that contains many positive and negative example anchors. It is possible to optimize for the loss functions of all anchors, but this will bias towards negative samples as they are dominate. Instead, we randomly sample 256 anchors in an image to compute the loss function of a mini-batch, where the sampled positive and negative anchors have a ratio of up to 1:1. If there are fewer than 128 positive samples in an image, we pad the mini-batch with negative ones.

3.1.3 訓練RPN

RPN可以通過反向傳播和隨機梯度下降（SGD）進行端對端訓練[35]。我們遵循[2]的“以影象為中心”的取樣策略來訓練這個網路。每個小批量資料都從包含許多正面和負面示例錨點的單張影象中產生。對所有錨點的損失函式進行優化是可能的，但是這樣會偏向於負樣本，因為它們是占主導地位的。取而代之的是，我們在影象中隨機取樣256個錨點，計算一個小批量資料的損失函式，其中取樣的正錨點和負錨點的比率可達1:1。如果影象中的正樣本少於128個，我們使用負樣本填充小批量資料。

We randomly initialize all new layers by drawing weights from a zero-mean Gaussian distribution with standard deviation 0.01. All other layers (i.e., the shared convolutional layers) are initialized by pre-training a model for ImageNet classification [36], as is standard practice [5]. We tune all layers of the ZF net, and conv3_1 and up for the VGG net to conserve memory [2]. We use a learning rate of 0.001 for 60k mini-batches, and 0.0001 for the next 20k mini-batches on the PASCAL VOC dataset. We use a momentum of 0.9 and a weight decay of 0.0005 [37]. Our implementation uses Caffe [38].

我們通過從標準方差為0.01的零均值高斯分佈中提取權重來隨機初始化所有新層。所有其他層（即共享卷積層）通過預訓練的ImageNet分類模型[36]來初始化，如同標準實踐[5]。我們調整ZF網路的所有層，以及VGG網路的conv3_1及其之上的層以節省記憶體[2]。對於60k的小批量資料，我們使用0.001的學習率，對於PASCAL VOC資料集中的下一個20k小批量資料，使用0.0001。我們使用0.9的動量和0.0005的重量衰減[37]。我們的實現使用Caffe[38]。

Thus far we have described how to train a network for region proposal generation, without considering the region-based object detection CNN that will utilize these proposals. For the detection network, we adopt Fast R-CNN [2]. Next we describe algorithms that learn a unified network composed of RPN and Fast R-CNN with shared convolutional layers (Figure 2).

3.2 RPN和Fast R-CNN共享特徵

到目前為止，我們已經描述瞭如何訓練用於區域提議生成的網路，沒有考慮將利用這些提議的基於區域的目標檢測CNN。對於檢測網路，我們採用Fast R-CNN[2]。接下來我們介紹一些演算法，學習由RPN和Fast R-CNN組成的具有共享卷積層的統一網路（圖2）。

Both RPN and Fast R-CNN, trained independently, will modify their convolutional layers in different ways. We therefore need to develop a technique that allows for sharing convolutional layers between the two networks, rather than learning two separate networks. We discuss three ways for training networks with features shared:

獨立訓練的RPN和Fast R-CNN將以不同的方式修改卷積層。因此，我們需要開發一種允許在兩個網路之間共享卷積層的技術，而不是學習兩個獨立的網路。我們討論三個方法來訓練具有共享特徵的網路：

(i) Alternating training. In this solution, we first train RPN, and use the proposals to train Fast R-CNN. The network tuned by Fast R-CNN is then used to initialize RPN, and this process is iterated. This is the solution that is used in all experiments in this paper.

（一）交替訓練。在這個解決方案中，我們首先訓練RPN，並使用這些提議來訓練Fast R-CNN。由Fast R-CNN微調的網路然後被用於初始化RPN，並且重複這個過程。這是本文所有實驗中使用的解決方案。

(ii) Approximate joint training. In this solution, the RPN and Fast R-CNN networks are merged into one network during training as in Figure 2. In each SGD iteration, the forward pass generates region proposals which are treated just like fixed, pre-computed proposals when training a Fast R-CNN detector. The backward propagation takes place as usual, where for the shared layers the backward propagated signals from both the RPN loss and the Fast R-CNN loss are combined. This solution is easy to implement. But this solution ignores the derivative w.r.t. the proposal boxes’ coordinates that are also network responses, so is approximate. In our experiments, we have empirically found this solver produces close results, yet reduces the training time by about comparing with alternating training. This solver is included in our released Python code.

（二）近似聯合訓練。在這個解決方案中，RPN和Fast R-CNN網路在訓練期間合併成一個網路，如圖2所示。在每次SGD迭代中，前向傳遞生成區域提議，在訓練Fast R-CNN檢測器將這看作是固定的、預計算的提議。反向傳播像往常一樣進行，其中對於共享層，組合來自RPN損失和Fast R-CNN損失的反向傳播訊號。這個解決方案很容易實現。但是這個解決方案忽略了關於提議邊界框的座標（也是網路響應）的導數，因此是近似的。在我們的實驗中，我們實驗發現這個求解器產生了相當的結果，與交替訓練相比，訓練時間減少了大約。這個求解器包含在我們釋出的Python程式碼中。

(iii) Non-approximate joint training. As discussed above, the bounding boxes predicted by RPN are also functions of the input. The RoI pooling layer [2] in Fast R-CNN accepts the convolutional features and also the predicted bounding boxes as input, so a theoretically valid backpropagation solver should also involve gradients w.r.t. the box coordinates. These gradients are ignored in the above approximate joint training. In a non-approximate joint training solution, we need an RoI pooling layer that is differentiable w.r.t. the box coordinates. This is a nontrivial problem and a solution can be given by an “RoI warping” layer as developed in [15], which is beyond the scope of this paper.

（三）非近似的聯合訓練。如上所述，由RPN預測的邊界框也是輸入的函式。Fast R-CNN中的RoI池化層[2]接受卷積特徵以及預測的邊界框作為輸入，所以理論上有效的反向傳播求解器也應該包括關於邊界框座標的梯度。在上述近似聯合訓練中，這些梯度被忽略。在一個非近似的聯合訓練解決方案中，我們需要一個關於邊界框座標可微分的RoI池化層。這是一個重要的問題，可以通過[15]中提出的“RoI扭曲”層給出解決方案，這超出了本文的範圍。

4-Step Alternating Training. In this paper, we adopt a pragmatic 4-step training algorithm to learn shared features via alternating optimization. In the first step, we train the RPN as described in Section 3.1.3. This network is initialized with an ImageNet-pre-trained model and fine-tuned end-to-end for the region proposal task. In the second step, we train a separate detection network by Fast R-CNN using the proposals generated by the step-1 RPN. This detection network is also initialized by the ImageNet-pre-trained model. At this point the two networks do not share convolutional layers. In the third step, we use the detector network to initialize RPN training, but we fix the shared convolutional layers and only fine-tune the layers unique to RPN. Now the two networks share convolutional layers. Finally, keeping the shared convolutional layers fixed, we fine-tune the unique layers of Fast R-CNN. As such, both networks share the same convolutional layers and form a unified network. A similar alternating training can be run for more iterations, but we have observed negligible improvements.

四步交替訓練。在本文中，我們採用實用的四步訓練演算法，通過交替優化學習共享特徵。在第一步中，我們按照3.1.3節的描述訓練RPN。該網路使用ImageNet的預訓練模型進行初始化，並針對區域提議任務進行了端到端的微調。在第二步中，我們使用由第一步RPN生成的提議，由Fast R-CNN訓練單獨的檢測網路。該檢測網路也由ImageNet的預訓練模型進行初始化。此時兩個網路不共享卷積層。在第三步中，我們使用檢測器網路來初始化RPN訓練，但是我們修正共享的卷積層，並且只對RPN特有的層進行微調。現在這兩個網路共享卷積層。最後，保持共享卷積層的固定，我們對Fast R-CNN的獨有層進行微調。因此，兩個網路共享相同的卷積層並形成統一的網路。類似的交替訓練可以執行更多的迭代，但是我們只觀察到可以忽略的改進。

3.3 Implementation Details

We train and test both region proposal and object detection networks on images of a single scale [1], [2]. We re-scale the images such that their shorter side is pixels [2]. Multi-scale feature extraction (using an image pyramid) may improve accuracy but does not exhibit a good speed-accuracy trade-off [2]. On the re-scaled images, the total stride for both ZF and VGG nets on the last convolutional layer is 16 pixels, and thus is ∼10 pixels on a typical PASCAL image before resizing (∼500×375). Even such a large stride provides good results, though accuracy may be further improved with a smaller stride.

3.3 實現細節

我們在單尺度影象上訓練和測試區域提議和目標檢測網路[1]，[2]。我們重新縮放影象，使得它們的短邊是畫素[2]。多尺度特徵提取（使用影象金字塔）可能會提高精度，但不會表現出速度與精度的良好折衷[2]。在重新縮放的影象上，最後卷積層上的ZF和VGG網路的總步長為16個畫素，因此在調整大小（〜500×375）之前，典型的PASCAL影象上的總步長為〜10個畫素。即使如此大的步長也能提供良好的效果，儘管步幅更小，精度可能會進一步提高。

For anchors, we use 3 scales with box areas of , , and pixels, and 3 aspect ratios of 1:1, 1:2, and 2:1. These hyper-parameters are not carefully chosen for a particular dataset, and we provide ablation experiments on their effects in the next section. As discussed, our solution does not need an image pyramid or filter pyramid to predict regions of multiple scales, saving considerable running time. Figure 3 (right) shows the capability of our method for a wide range of scales and aspect ratios. Table 1 shows the learned average proposal size for each anchor using the ZF net. We note that our algorithm allows predictions that are larger than the underlying receptive field. Such predictions are not impossible—one may still roughly infer the extent of an object if only the middle of the object is visible.

Table 1: the learned average proposal size for each anchor using the ZF net (numbers for ).

Table 1

對於錨點，我們使用了3個尺度，邊界框面積分別為，和個畫素，以及1:1，1:2和2:1的長寬比。這些超引數不是針對特定資料集仔細選擇的，我們將在下一節中提供有關其作用的消融實驗。如上所述，我們的解決方案不需要影象金字塔或濾波器金字塔來預測多個尺度的區域，節省了大量的執行時間。圖3（右）顯示了我們的方法在廣泛的尺度和長寬比方面的能力。表1顯示了使用ZF網路的每個錨點學習到的平均提議大小。我們注意到，我們的演算法允許預測比基礎感受野更大。這樣的預測不是不可能的——如果只有目標的中間部分是可見的，那麼仍然可以粗略地推斷出目標的範圍。

表1：使用ZF網路的每個錨點學習到的平均提議大小（的數字）。

Table 1

The anchor boxes that cross image boundaries need to be handled with care. During training, we ignore all cross-boundary anchors so they do not contribute to the loss. For a typical image, there will be roughly 20000 () anchors in total. With the cross-boundary anchors ignored, there are about 6000 anchors per image for training. If the boundary-crossing outliers are not ignored in training, they introduce large, difficult to correct error terms in the objective, and training does not converge. During testing, however, we still apply the fully convolutional RPN to the entire image. This may generate cross-boundary proposal boxes, which we clip to the image boundary.

跨越影象邊界的錨盒需要小心處理。在訓練過程中，我們忽略了所有的跨界錨點，所以不會造成損失。對於一個典型的的圖片，總共將會有大約20000（）個錨點。跨界錨點被忽略，每張影象約有6000個錨點用於訓練。如果跨界異常值在訓練中不被忽略，則會在目標函式中引入大的，難以糾正的誤差項，且訓練不會收斂。但在測試過程中，我們仍然將全卷積RPN應用於整張影象。這可能會產生跨邊界的提議邊界框，我們剪下到影象邊界。

Some RPN proposals highly overlap with each other. To reduce redundancy, we adopt non-maximum suppression (NMS) on the proposal regions based on their cls scores. We fix the IoU threshold for NMS at 0.7, which leaves us about 2000 proposal regions per image. As we will show, NMS does not harm the ultimate detection accuracy, but substantially reduces the number of proposals. After NMS, we use the top-N ranked proposal regions for detection. In the following, we train Fast R-CNN using 2000 RPN proposals, but evaluate different numbers of proposals at test-time.

一些RPN提議互相之間高度重疊。為了減少冗餘，我們在提議區域根據他們的cls分數採取非極大值抑制（NMS）。我們將NMS的IoU閾值固定為0.7，這就給每張影象留下了大約2000個提議區域。正如我們將要展示的那樣，NMS不會損害最終的檢測準確性，但會大大減少提議的數量。在NMS之後，我們使用前N個提議區域來進行檢測。接下來，我們使用2000個RPN提議對Fast R-CNN進行訓練，但在測試時評估不同數量的提議。

4. EXPERIMENTS

4.1 Experiments on PASCAL VOC

We comprehensively evaluate our method on the PASCAL VOC 2007 detection benchmark [11]. This dataset consists of about 5k trainval images and 5k test images over 20 object categories. We also provide results on the PASCAL VOC 2012 benchmark for a few models. For the ImageNet pre-trained network, we use the “fast” version of ZF net [32] that has 5 convolutional layers and 3 fully-connected layers, and the public VGG-16 model [3] that has 13 convolutional layers and 3 fully-connected layers. We primarily evaluate detection mean Average Precision (mAP), because this is the actual metric for object detection (rather than focusing on object proposal proxy metrics).

4. 實驗

4.1 PASCAL VOC上的實驗

我們在PASCAL VOC 2007檢測基準資料集[11]上全面評估了我們的方法。這個資料集包含大約5000張訓練評估影象和在20個目標類別上的5000張測試影象。我們還提供了一些模型在PASCAL VOC 2012基準資料集上的測試結果。對於ImageNet預訓練網路，我們使用具有5個卷積層和3個全連線層的ZF網路[32]的“快速”版本以及具有13個卷積層和3個全連線層的公開的VGG-16模型[3]。我們主要評估檢測的平均精度均值（mAP），因為這是檢測目標的實際指標（而不是關注目標提議代理度量）。

Table 2 (top) shows Fast R-CNN results when trained and tested using various region proposal methods. These results use the ZF net. For Selective Search (SS) [4], we generate about 2000 proposals by the “fast” mode. For EdgeBoxes (EB) [6], we generate the proposals by the default EB setting tuned for 0.7 IoU. SS has an mAP of and EB has an mAP of under the Fast R-CNN framework. RPN with Fast R-CNN achieves competitive results, with an mAP of while using up to 300 proposals. Using RPN yields a much faster detection system than using either SS or EB because of shared convolutional computations; the fewer proposals also reduce the region-wise fully-connected layers’ cost (Table 5).

Table 2: Detection results on PASCAL VOC 2007 test set (trained on VOC 2007 trainval). The detectors are Fast R-CNN with ZF, but using various proposal methods for training and testing.

Table 2

Table 5: Timing (ms) on a K40 GPU, except SS proposal is evaluated in a CPU. “Region-wise” includes NMS, pooling, fully-connected, and softmax layers. See our released code for the profiling of running time.

Table 5

表2（頂部）顯示了使用各種區域提議方法進行訓練和測試的Fast R-CNN結果。這些結果使用ZF網路。對於選擇性搜尋（SS）[4]，我們通過“快速”模式生成約2000個提議。對於EdgeBoxes（EB）[6]，我們通過調整0.7 IoU的預設EB設定生成提議。SS在Fast R-CNN框架下的mAP為，EB的mAP為。RPN與Fast R-CNN取得了有競爭力的結果，使用多達300個提議，mAP為。由於共享卷積計算，使用RPN比使用SS或EB產生了更快的檢測系統；較少的建議也減少了區域方面的全連線層成本（表5）。

表2：PASCAL VOC 2007測試集上的檢測結果（在VOC 2007訓練評估集上進行了訓練）。檢測器是帶有ZF的Fast R-CNN，但使用各種提議方法進行訓練和測試。

Table 2

表5：K40 GPU上的時間（ms），除了SS提議是在CPU上評估。“區域方面”包括NMS，池化，全連線和softmax層。檢視我們釋出的程式碼來分析執行時間。

Table 5

Ablation Experiments on RPN. To investigate the behavior of RPNs as a proposal method, we conducted several ablation studies. First, we show the effect of sharing convolutional layers between the RPN and Fast R-CNN detection network. To do this, we stop after the second step in the 4-step training process. Using separate networks reduces the result slightly to (RPN+ZF, unshared, Table 2). We observe that this is because in the third step when the detector-tuned features are used to fine-tune the RPN, the proposal quality is improved.

RPN上的消融實驗。為了研究RPN作為提議方法的效能，我們進行了幾項消融研究。首先，我們顯示了RPN和Fast R-CNN檢測網路共享卷積層的效果。為此，我們在四步訓練過程的第二步之後停止訓練。使用單獨的網路將結果略微減少到（RPN+ZF，非共享，表2）。我們觀察到，這是因為在第三步中，當使用檢測器調整的特徵來微調RPN時，提議質量得到了改善。

Next, we disentangle the RPN’s influence on training the Fast R-CNN detection network. For this purpose, we train a Fast R-CNN model by using the 2000 SS proposals and ZF net. We fix this detector and evaluate the detection mAP by changing the proposal regions used at test-time. In these ablation experiments, the RPN does not share features with the detector.

接下來，我們分析RPN對訓練Fast R-CNN檢測網路的影響。為此，我們通過使用2000個SS提議和ZF網路來訓練Fast R-CNN模型。我們固定這個檢測器，並通過改變測試時使用的提議區域來評估檢測的mAP。在這些消融實驗中，RPN不與檢測器共享特徵。

Replacing SS with 300 RPN proposals at test-time leads to an mAP of . The loss in mAP is because of the inconsistency between the training/testing proposals. This result serves as the baseline for the following comparisons.

在測試階段用300個RPN提議替換SS提議得到了的MAP。mAP的損失是因為訓練/測試提議不一致。這個結果作為以下比較的基準。

Somewhat surprisingly, the RPN still leads to a competitive result () when using the top-ranked 100 proposals at test-time, indicating that the top-ranked RPN proposals are accurate. On the other extreme, using the top-ranked 6000 RPN proposals (without NMS) has a comparable mAP (), suggesting NMS does not harm the detection mAP and may reduce false alarms.

有些令人驚訝的是，RPN在測試時使用排名最高的100個提議仍然會導致有競爭力的結果（），表明排名靠前的RPN提議是

王權富貴論文篇：Faster R-CNN論文翻譯——中英文對照

Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks

Abstract

摘要

1. Introduction

1. 引言

2. 相關工作

3. FASTER R-CNN