1. 程式人生 > >論文閱讀筆記(二十二):Feature Pyramid Networks for Object Detection(FPN)

論文閱讀筆記(二十二):Feature Pyramid Networks for Object Detection(FPN)

Feature pyramids are a basic component in recognition systems for detecting objects at different scales. But recent deep learning object detectors have avoided pyramid representations, in part because they are compute and memory intensive. In this paper, we exploit the inherent multi-scale, pyramidal hierarchy of deep convolutional networks to construct feature pyramids with marginal extra cost. A top-down architecture with lateral connections is developed for building high-level semantic feature maps at all scales. This architecture, called a Feature Pyramid Network (FPN), shows significant improvement as a generic feature extractor in several applications. Using FPN in a basic Faster R-CNN system, our method achieves state-of-the-art single-model results on the COCO detection benchmark without bells and whistles, surpassing all existing single-model entries including those from the COCO 2016 challenge winners. In addition, our method can run at 6 FPS on a GPU and thus is a practical and accurate solution to multi-scale object detection.

特徵金字塔是識別系統中用於檢測不同尺度目標的基本元件。但最近的深度學習目標檢測器已經避免了金字塔表示,部分原因是它們是計算和記憶體密集型的。在本文中,我們利用深度卷積網路內在的多尺度、金字塔分級來構造具有很少額外成本的特徵金字塔。開發了一種具有橫向連線的自頂向下架構,用於在所有尺度上構建高階語義特徵對映。這種稱為特徵金字塔網路(FPN)的架構在幾個應用程式中作為通用特徵提取器表現出了顯著的改進。在一個基本的Faster R-CNN系統中使用FPN,沒有任何不必要的東西,我們的方法可以在COCO檢測基準資料集上取得state-of-the-art的單模型結果,結果超過了所有現有的單模型輸入,包括COCO 2016挑戰賽的獲獎者。此外,我們的方法可以在GPU上以6FPS執行,因此是多尺度目標檢測的實用和準確的解決方案。

Recognizing objects at vastly different scales is a fundamental challenge in computer vision. Feature pyramids built upon image pyramids (for short we call these featurized image pyramids) form the basis of a standard solution [1] (Fig. 1(a)). These pyramids are scale-invariant in the sense that an object’s scale change is offset by shifting its level in the pyramid. Intuitively, this property enables a model to detect objects across a large range of scales by scanning the model over both positions and pyramid levels.

識別不同尺度的物體是計算機視覺的一項根本性挑戰。建立在影象金字塔之上的特徵金字塔 (為簡短我們稱這些為Featurized image pyramids) 形成level解決方案的基礎[1] (圖 1 (a))。在某種意義上, 這些金字塔是尺度不變的, 物體的尺度變化是通過改變它在金字塔中的level而抵消的。直觀地, 此屬性使模型能夠通過在位置和金字塔level上掃描模型來檢測大範圍尺度的物體。

Featurized image pyramids were heavily used in the era of hand-engineered features [5, 25]. They were so critical that object detectors like DPM [7] required dense scale sampling to achieve good results (e.g., 10 scales per octave). For recognition tasks, engineered features have largely been replaced with features computed by deep convolutional networks (ConvNets) [19, 20]. Aside from being capable of representing higher-level semantics, ConvNets are also more robust to variance in scale and thus facilitate recognition from features computed on a single input scale [15, 11, 29] (Fig. 1(b)). But even with this robustness, pyramids are still needed to get the most accurate results. All recent top entries in the ImageNet [33] and COCO [21] detection challenges use multi-scale testing on featurized image pyramids (e.g., [16, 35]). The principle advantage of featurizing each level of an image pyramid is that it produces a multi-scale feature representation in which all levels are semantically strong, including the high-resolution levels.

Featurized image pyramids在Hand-engineered特徵的時代被大量使用了 [5, 25]。他們是如此關鍵的物體檢測器如 DPM [7] 需要密集的規模取樣, 以取得良好的結果 (例如, 10 尺度每octave)。對於識別任務, 工程特徵已基本上被替換為由深卷積網路 (ConvNets) 計算的特徵 [19, 20]。除了能夠表示更高層次的語義之外, ConvNets 還更健壯, 可以依比例上產生差異, 從而便於從單一輸入尺度 [15、11、29] (圖 1 (b)) 計算的特徵進行識別。但即使有了這種魯棒性, 仍然需要金字塔得到最準確的結果。最近在 ImageNet [33] 和COCO [21] 檢測挑戰中的所有頂級條目都使用 featurized image pyramids上的多尺度測試 (例如 [16, 35])。featurizing image pyramids的每個層次的原理優勢是它產生了一個多尺度特徵表示, 其中所有level都在語義上很強, 包括高解析度level。

Nevertheless, featurizing each level of an image pyramid has obvious limitations. Inference time increases considerably (e.g., by four times [11]), making this approach impractical for real applications. Moreover, training deep networks end-to-end on an image pyramid is infeasible in terms of memory, and so, if exploited, image pyramids are used only at test time [15, 11, 16, 35], which creates an inconsistency between train/test-time inference. For these reasons, Fast and Faster R-CNN [11, 29] opt to not use featurized image pyramids under default settings.

然而, featurizing image pyramids的每個層次都有明顯的侷限性。推理時間大大增加 (例如, 四乘以 [11]), 使這種方法不切實際的實際應用。此外, 在影象金字塔端到端對深網路進行訓練是不可行的, 因此, 如果被利用, 影象金字塔只在測試時使用 [15、11、16、35], 這就造成了訓練/測試時間推斷之間的不一致。由於這些原因, Fast R-CNN 和 Faster R-CNN [11, 29] 選擇不使用 featurized image pyramids在預設設定下。

However, image pyramids are not the only way to compute a multi-scale feature representation. A deep ConvNet computes a feature hierarchy layer by layer, and with subsampling layers the feature hierarchy has an inherent multi-scale, pyramidal shape. This in-network feature hierarchy produces feature maps of different spatial resolutions, but introduces large semantic gaps caused by different depths. The high-resolution maps have low-level features that harm their representational capacity for object recognition.

然而, 影象金字塔並不是計算多尺度特徵表示的唯一方法。Deep ConvNet按層計算特徵層層, 並且具有抽樣層, 特徵層具有固有的多尺度、金字塔形狀。這種網路特徵層產生了不同空間解析度的特徵對映, 但引入了不同深度引起的大語義缺口。高解析度的對映具有低階的特徵, 損害了它們的表示能力, 用於物體識別。

The Single Shot Detector (SSD) [22] is one of the first attempts at using a ConvNet’s pyramidal feature hierarchy as if it were a featurized image pyramid (Fig. 1(c)). Ideally, the SSD-style pyramid would reuse the multi-scale feature maps from different layers computed in the forward pass and thus come free of cost. But to avoid using low-level features SSD foregoes reusing already computed layers and instead builds the pyramid starting from high up in the network (e.g., conv4 3 of VGG nets [36]) and then by adding several new layers. Thus it misses the opportunity to reuse the higher-resolution maps of the feature hierarchy. We show that these are important for detecting small objects.

Single Shot Detector (SSD) [22] 是第一次嘗試使用 ConvNet 的金字塔特徵層, 就好像它是一個 featurized image pyramid (圖 1 (c))。理想情況下, SSD 風格的金字塔將重用在正向傳遞中計算的不同層的多尺度特徵對映, 並且沒有成本。但為了避免使用低階特徵 SSD 放棄再利用已經計算的層, 取而代之的是在網路中從高處開始構建金字塔 (例如, conv4 3 of VGG nets [36]), 然後新增幾個新層。因此, 它錯失了重用特徵層的高解析度對映的機會。我們表明, 這些對檢測小物體很重要。

The goal of this paper is to naturally leverage the pyramidal shape of a ConvNet’s feature hierarchy while creating a feature pyramid that has strong semantics at all scales. To achieve this goal, we rely on an architecture that combines low-resolution, semantically strong features with high-resolution, semantically weak features via a top-down pathway and lateral connections (Fig. 1(d)). The result is a feature pyramid that has rich semantics at all levels and is built quickly from a single input image scale. In other words, we show how to create in-network feature pyramids that can be used to replace featurized image pyramids without sacrificing representational power, speed, or memory.

本文的目標是自然地利用 ConvNet 的特徵層的金字塔形狀, 同時建立一個具有強烈語義的特徵金字塔。為了實現這一目標, 我們依賴於一種體系結構, 它將低解析度、語義強的特徵與高解析度、語義較弱的特徵結合起來, 通過自上而下的途徑和橫向連線 (圖 1 (d))。其結果是一個特徵金字塔, 它在所有level都具有豐富的語義, 並且從單一輸入影象比例快速構建。換言之, 我們展示瞭如何建立網路內特徵金字塔, 可用於替換 featurized image pyramids而不犧牲表現力、速度或記憶體。

Similar architectures adopting top-down and skip connections are popular in recent research [28, 17, 8, 26]. Their goals are to produce a single high-level feature map of a fine resolution on which the predictions are to be made (Fig. 2 top). On the contrary, our method leverages the architecture as a feature pyramid where predictions (e.g., object detections) are independently made on each level (Fig. 2 bottom). Our model echoes a featurized image pyramid, which has not been explored in these works.

採用自上而下和跳過連線的類似體系結構在最近的研究中很受歡迎 [28、17、8、26]。他們的目標是製作一個高level的特徵對映, 其中的一個精細的解析度, 將作出預測 (圖2頂部)。相反, 我們的方法利用架構作為一個特徵金字塔, 在每個level上獨立地進行預測 (如物體檢測) (圖2底部)。我們的模型呼應了一個 featurized image pyramid, 這沒有在這項工作中探索。

We evaluate our method, called a Feature Pyramid Network (FPN), in various systems for detection and segmentation [11, 29, 27]. Without bells and whistles, we report a state-of-the-art single-model result on the challenging COCO detection benchmark [21] simply based on FPN and a basic Faster R-CNN detector [29], surpassing all existing heavily-engineered single-model entries of competition winners. In ablation experiments, we find that for bounding box proposals, FPN significantly increases the Average Recall (AR) by 8.0 points; for object detection, it improves the COCO-style Average Precision (AP) by 2.3 points and PASCAL-style AP by 3.8 points, over a strong single-scale baseline of Faster R-CNN on ResNets [16]. Our method is also easily extended to mask proposals and improves both instance segmentation AR and speed over state-of-the-art methods that heavily depend on image pyramids.

我們評估我們的方法, 稱為特徵金字塔網路 (FPN), 在各種系統的檢測和分割 [11, 29, 27]。沒有bells 和 whistles, 我們報告一個state-of-the-art的單一模型的結果, 在挑戰COCO檢測基準 [21] 簡單地基於 FPN 和一個基本Faster R-CNN 探測器 [29], 超過所有現有的重工程單模型條目競爭優勝者。在消融實驗中, 我們發現, 對於邊界框的提議, FPN 顯著增加Average Recall (AR) 8.0 點;對於物體檢測, 它提高了COCO式Average Precision (AP) 2.3 點和PASCAL-style AP 3.8 點, 超越了一個強大的單尺度基於Faster R-CNN 的 ResNets [16]。我們的方法也很容易擴充套件到mask proposals, 並改進了例項分割AR 並且速度超越了嚴重依賴於影象金字塔的state-of-the-art方法。

In addition, our pyramid structure can be trained end-to-end with all scales and is used consistently at train/test time, which would be memory-infeasible using image pyramids. As a result, FPNs are able to achieve higher accuracy than all existing state-of-the-art methods. Moreover, this improvement is achieved without increasing testing time over the single-scale baseline. We believe these advances will facilitate future research and applications.

此外, 我們的金字塔結構可以用所有尺度進行端到端的訓練, 並且在訓練/測試時間上始終如一地使用, 使用影象金字塔這將是記憶體不可行的。因此, FPNs 能夠達到比所有現有的先進方法更高的精確度。而且, 這種改進是在不增加單尺度基線的測試時間的情況下實現的。我們相信這些進展將有助於今後的研究和應用。

Hand-engineered features and early neural networks. SIFT features [25] were originally extracted at scale-space extrema and used for feature point matching. HOG features [5], and later SIFT features as well, were computed densely over entire image pyramids. These HOG and SIFT pyramids have been used in numerous works for image classification, object detection, human pose estimation, and more.

Hand-engineered 的特徵和早期神經網路。SIFT特徵 [25] 最初提取在尺度空間極值和用於特徵點匹配。HOG特徵 [5] 和後來的SIFT特徵也是在整個圖象金字塔密集地被計算了。這些HOG和SIFT金字塔已用於許多作品的影象分類, 物體檢測, 人的姿態估計, 等等。

There has also been significant interest in computing featurized image pyramids quickly. Dollar et al.[6] demonstrated fast pyramid computation by first computing a sparsely sampled (in scale) pyramid and then interpolating missing levels. Before HOG and SIFT, early work on face detection with ConvNets [38, 32] computed shallow networks over image pyramids to detect faces across scales.

對 featurized image pyramids的快速計算也有很大的興趣。Dollar等. [6] 演示了fast pyramid演算由首先計算一個稀疏抽樣 (依比例) 金字塔然後插值缺掉的level。在HOG和SIFT之前, 及早前用ConvNets [38, 32] 計算淺層網路來檢測人臉的工作超越了在圖象金字塔跨尺度檢測人臉。

Deep ConvNet object detectors. With the development of modern deep ConvNets [19], object detectors like OverFeat [34] and R-CNN [12] showed dramatic improvements in accuracy. OverFeat adopted a strategy similar to early neural network face detectors by applying a ConvNet as a sliding window detector on an image pyramid. R-CNN adopted a region proposal-based strategy [37] in which each proposal was scale-normalized before classifying with a ConvNet.

Deep ConvNet 物體探測器。與現代 Deep ConvNet 的發展 [19], 物體檢測器像 OverFeat [34]和 R-CNN [12] 顯示了戲劇性的改善準確性。OverFeat 採用了類似於早期神經網路面探測器的策略, 將 ConvNet 作為影象金字塔上的滑動視窗檢測器。R-CNN 通過了一個區域提議的戰略 [37] 在其中每個提議在分類 ConvNet 之前是尺度規範化的。

SPPnet [15] demonstrated that such region-based detectors could be applied much more efficiently on feature maps extracted on a single image scale. Recent and more accurate detection methods like Fast R-CNN [11] and Faster R-CNN [29] advocate using features computed from a single scale, because it offers a good trade-off between accuracy and speed. Multi-scale detection, however, still performs better, especially for small objects.

SPPnet [15] 表明, 這種基於區域的探測器可以更有效地應用於在單一影象尺度上提取的特徵對映。最近和更準確的檢測方法像Fast R-CNN [11] 和Faster R-CNN [29] 提倡使用從單一尺度計算的特徵, 因為它提供了一個良好的權衡精度和速度。但是, 多尺度檢測仍能更好地執行, 特別是對於小物體。

Methods using multiple layers. A number of recent approaches improve detection and segmentation by using different layers in a ConvNet. FCN [24] sums partial scores for each category over multiple scales to compute semantic segmentations. Hypercolumns [13] uses a similar method for object instance segmentation. Several other approaches (HyperNet [18], ParseNet [23], and ION [2]) concatenate features of multiple layers before computing predictions, which is equivalent to summing transformed features. SSD [22] and MS-CNN [3] predict objects at multiple layers of the feature hierarchy without combining features or scores.

多層次方法。最近的一些方法通過在 ConvNet 中使用不同的層來改進檢測和分割。FCN [24] 彙總區域性分數為每個類別在多尺度計算語義分割。Hypercolumns [13] 使用類似的方法來進行物體例項分割。其他幾種方法 (HyperNet [18]、ParseNet [23] 和ION [2]) 在計算預測之前串聯多個層的特徵, 這相當於對轉換後的特徵求和。SSD [22] 和 MS-CNN [3] 預測特徵層的多個層次上的物體, 而不結合特徵或分數。

There are recent methods exploiting lateral/skip connections that associate low-level feature maps across resolutions and semantic levels, including U-Net [31] and SharpMask [28] for segmentation, Recombinator networks [17] for face detection, and Stacked Hourglass networks [26] for keypoint estimation. Ghiasi et al. [8] present a Laplacian pyramid presentation for FCNs to progressively refine segmentation. Although these methods adopt architectures with pyramidal shapes, they are unlike featurized image pyramids [5, 7, 34] where predictions are made independently at all levels, see Fig. 2. In fact, for the pyramidal architecture in Fig. 2 (top), image pyramids are still needed to recognize objects across multiple scales [28].

最近有幾種方法利用橫向/跳過連線, 將低層特徵對映與解析度和語義level相關聯, 包括 U-Net [31] 和 SharpMask [28] 用於分割、Recombinator 網路 [17] 用於人臉檢測和Stacked Hourglass Net [26] 為關鍵點估計。Ghiasi 等 [8] 提出一個Laplacian金字塔呈現為 FCNs 逐步細化分割。雖然這些方法採用具有金字塔形狀的體系結構, 但它們不同於 featurized image pyramids [5、7、34], 在所有level上獨立進行預測, 見圖2。事實上, 對於圖 2 (頂部) 的金字塔結構, 仍然需要影象金字塔來識別跨多尺度的物體 [28]。

Our goal is to leverage a ConvNet’s pyramidal feature hierarchy, which has semantics from low to high levels, and build a feature pyramid with high-level semantics throughout. The resulting Feature Pyramid Network is general purpose and in this paper we focus on sliding window proposers (Region Proposal Network, RPN for short) [29] and region-based detectors (Fast R-CNN) [11]. We also generalize FPNs to instance segmentation proposals in Sec. 6.

我們的目標是利用 ConvNet 的金字塔特徵層, 它具有從低階到高階的語義, 並在整個過程中構建一個具有高階語義的特徵金字塔。由此產生的特徵金字塔網路是一般目的, 本文重點研究滑動視窗提議 (區域提出網路, RPN簡寫) [29] 和基於區域的探測器 (Fast R-CNN) [11]。我們還將 FPNs 概括為例項分割提議。

Our method takes a single-scale image of an arbitrary size as input, and outputs proportionally sized feature maps at multiple levels, in a fully convolutional fashion. This process is independent of the backbone convolutional architectures (e.g., [19, 36, 16]), and in this paper we present results using ResNets [16]. The construction of our pyramid involves a bottom-up pathway, a top-down pathway, and lateral connections, as introduced in the following.

我們的方法採取一個任意大小的單尺度影象作為輸入, 並輸出成比例大小的特徵對映在多個level, 以完全卷積的方式。此過程獨立於主幹卷積體系結構 (例如 [19、36、16]), 本文使用 ResNets [16] 來顯示結果。我們的金字塔的構造包括一個自下而上的路徑, 一個自上而下的路徑, 和橫向連線, 如下面介紹。

Bottom-up pathway. The bottom-up pathway is the feedforward computation of the backbone ConvNet, which computes a feature hierarchy consisting of feature maps at several scales with a scaling step of 2. There are often many layers producing output maps of the same size and we say these layers are in the same network stage. For our feature pyramid, we define one pyramid level for each stage. We choose the output of the last layer of each stage as our reference set of feature maps, which we will enrich to create our pyramid. This choice is natural since the deepest layer of each stage should have the strongest features.

自下而上的途徑。自下而上的路徑是主幹 ConvNet 的前饋計算, 它計算一個特徵層, 由幾個尺度的特徵對映組成, 縮放步驟為2。通常有許多層產生相同大小的輸出對映, 我們說這些層處於同一網路階段。對於我們的特徵金字塔, 我們為每個階段定義一個金字塔level。我們選擇每個階段的最後一層的輸出作為我們的特徵對映的參考集, 我們將豐富建立我們的金字塔。這種選擇是自然的, 因為每個階段最深的層應該具有最強的特徵。

Specifically, for ResNets [16] we use the feature activations output by each stage’s last residual block. We denote the output of these last residual blocks as {C2 , C3 , C4 , C5 } for conv2, conv3, conv4, and conv5 outputs, and note that they have strides of {4, 8, 16, 32} pixels with respect to the input image. We do not include conv1 into the pyramid due to its large memory footprint.

具體地說, 對於 ResNets [16], 我們使用每個階段的最後一個residual block的特徵啟用輸出。我們將這些最後一個residual block的輸出表示為 conv2、conv3、conv4 和 conv5 輸出的 {C2、C3、C4、C5}, 並注意到它們在輸入影象方面具有 {4、8、16、32} 畫素的步長。由於記憶體佔用量大, 我們不包括 conv1 到金字塔中。

Top-down pathway and lateral connections. The topdown pathway hallucinates higher resolution features by upsampling spatially coarser, but semantically stronger, feature maps from higher pyramid levels. These features are then enhanced with features from the bottom-up pathway via lateral connections. Each lateral connection merges feature maps of the same spatial size from the bottom-up pathway and the top-down pathway. The bottom-up feature map is of lower-level semantics, but its activations are more accurately localized as it was subsampled fewer times.

自上而下的路徑和橫向連線。自上而下路徑 hallucinates 更高的解析度特徵上取樣空間的粗糙, 但語義更強, 特徵對映從較高的金字塔level。這些特徵然後增強與特徵從自下而上的途徑通過橫向連線。每個橫向連線從自下而上的和自上而下的路徑融合了相同空間大小的特徵對映。自下而上的特徵對映是低階語義, 但是它的啟用更精確地被本地化, 因為它被重取樣幾次。

Fig. 3 shows the building block that constructs our topdown feature maps. With a coarser-resolution feature map, we upsample the spatial resolution by a factor of 2 (using nearest neighbor upsampling for simplicity). The upsampled map is then merged with the corresponding bottom-up map (which undergoes a 1×1 convolutional layer to reduce channel dimensions) by element-wise addition. This process is iterated until the finest resolution map is generated. To start the iteration, we simply attach a 1×1 convolutional layer on C5 to produce the coarsest resolution map. Finally, we append a 3 × 3 convolution on each merged map to generate the final feature map, which is to reduce the aliasing effect of upsampling. This final set of feature maps is called {P2, P3, P4, P5}, corresponding to {C2, C3, C4, C5} that are respectively of the same spatial sizes.

圖3顯示了構建我們的自上而下特徵對映的構建塊。用一個粗糙解析度的特徵對映, 我們上取樣的空間解析度的係數為 2 (使用最近鄰上取樣簡單)。然後, 上取樣對映與相應的自底向上對映 (通過1×1卷積層來減少通道尺寸) 合併, 通過元素的加法。此過程將被迭代, 直到生成最佳解析度對映為止。要開始迭代, 我們只需在 C5 上附加一個1×1卷積層即可生成粗糙解析度對映。最後, 我們在每個合併的對映上追加 3 x 3 卷積, 以生成最終的特徵對映, 這是為了減少上取樣的混淆效果。這最後一組特徵對映稱為 {P2、P3、P4、P5}, 對應於分別為相同空間大小的 {C2、C3、C4、C5}。

Because all levels of the pyramid use shared classifiers/regressors as in a traditional featurized image pyramid, we fix the feature dimension (numbers of channels, denoted as d) in all the feature maps. We set d = 256 in this paper and thus all extra convolutional layers have 256-channel outputs. There are no non-linearities in these extra layers, which we have empirically found to have minor impacts.

由於金字塔的所有level都使用共享分類器/迴歸量, 就像在傳統的 featurized image pyramids中一樣, 我們在所有特徵對映中修正了特徵維度 (通道數, 表示為 d)。我們在本文中設定了 d = 256, 因此所有額外的卷積層都有256通道輸出。在這些額外的層中沒有非線性, 我們經驗中發現它有輕微的影響。

Simplicity is central to our design and we have found that our model is robust to many design choices. We have experimented with more sophisticated blocks (e.g., using multilayer residual blocks [16] as the connections) and observed marginally better results. Designing better connection modules is not the focus of this paper, so we opt for the simple design described above.

簡單性是我們設計的核心, 我們發現我們的模型對許多設計選擇是健壯的。我們已經試驗了更復雜的block (例如, 使用多層residual block [16] 作為連線), 並觀察到了稍微更好的結果。設計更好的連線模組不是本文的重點, 所以我們選擇了上面描述的簡單設計。

We have presented a clean and simple framework for building feature pyramids inside ConvNets. Our method shows significant improvements over several strong baselines and competition winners. Thus, it provides a practical solution for research and applications of feature pyramids, without the need of computing image pyramids. Finally, our study suggests that despite the strong representational power of deep ConvNets and their implicit robustness to scale variation, it is still critical to explicitly address multiscale problems using pyramid representations.

我們提出了一個乾淨和簡單的框架, 以建立在 ConvNets 內的特徵金字塔。我們的方法表明, 在幾個強大的基線和比賽贏家有顯著的改善。因此, 它為特徵金字塔的研究和應用提供了實用的解決方案, 無需計算影象金字塔。最後, 我們的研究表明, 儘管deep ConvNets 的強大的表現力及其對尺度變化的隱含魯棒性, 但使用金字塔表示法處理多尺度問題仍然是至關重要的。

Figure 1. (a) Using an image pyramid to build a feature pyramid. Features are computed on each of the image scales independently, which is slow. (b) Recent detection systems have opted to use only single scale features for faster detection. (c) An alternative is to reuse the pyramidal feature hierarchy computed by a ConvNet as if it were a featurized image pyramid. (d) Our proposed Feature Pyramid Network (FPN) is fast like (b) and (c), but more accurate. In this figure, feature maps are indicate by blue outlines and thicker outlines denote semantically stronger features.

圖1。(a) 使用影象金字塔構建特徵金字塔。在每個影象尺度上分別計算特徵, 這是緩慢的。(b) 最近的檢測系統選擇只使用單一的尺度特徵, 以便更快地檢測。(c) 另一種方法是重用 ConvNet 計算的金字塔特徵層, 就好像它是一個 featurized image pyramid。(d) 我們提議的特徵金字塔網路 (FPN) 快速如 (b) 和 (c), 但更準確。在這個數字中, 特徵對映由藍色輪廓表示, 較粗的輪廓顯示語義上更強的特徵。

Figure 2. Top: a top-down architecture with skip connections, where predictions are made on the finest level (e.g., [28]). Bottom: our model that has a similar structure but leverages it as a feature pyramid, with predictions made independently at all levels.

圖2。頂部: 一個自上而下的架構與跳過連線, 其中的預測是在最好的level (例如, [28])。底部: 我們的模型, 具有類似的結構, 但利用它作為一個特徵金字塔, 與預測在所有level獨立。

Figure 3. A building block illustrating the lateral connection and the top-down pathway, merged by addition.

圖3。一種building block, 用於說明橫向連線和自上而下的路徑, 並由加法合併。

Figure 4. FPN for object segment proposals. The feature pyramid is constructed with identical structure as for object detection. We apply a small MLP on 5×5 windows to generate dense object segments with output dimension of 14×14. Shown in orange are the size of the image regions the mask corresponds to for each pyramid level (levels P3−5 are shown here). Both the corresponding image region size (light orange) and canonical object size (dark orange) are shown. Half octaves are handled by an MLP on 7x7 windows (7 ≈ 52), not shown here. Details are in the appendix.

圖4。FPN 物體分割提議。特徵金字塔構造了與物體檢測相同的結構。我們在5×5視窗上應用小 MLP, 以生成具有14×14輸出維度的dense物體分割。以橙色顯示的是mask對應於每個金字塔level的影象區域的大小 (此處顯示的level P3−5)。相應的顯示影象區域大小 (淺橙色) 和level物體大小 (暗橙色)。Half octaves是由一個 MLP 在7x7 視窗 (7 ≈ 52)使用, 這裡沒有顯示。詳情載於附錄。

相關參考