1. 程式人生 > >論文翻譯——Scalable Object Detection using Deep Neural Networks

論文翻譯——Scalable Object Detection using Deep Neural Networks

Scalable Object Detection using Deep Neural Networks

作者:Dumitru Erhan,Christian Szegedy, Alexander Toshev等

發表時間:2013

Abstract

Deep convolutional neural networks have recently achieved state-of-the-art performance on a number of image recognition benchmarks, including the ImageNet Large-Scale Visual Recognition Challenge(ILSVRC-2012). The winning model on the localization sub-task was a network that predicts a single bounding box and a confidence score for each object category in the image. Such a model captures the whole-image context around the objects but cannot handle multiple instances of the same object in the image without naively replicating the number of outputs for each instance. In this work, we propose a saliency-inspired neural network model for detection, which predicts a set of class-agnostic bounding boxes along with a single score for each box, corresponding to its likelihood of containing any object of interest. The model naturally handles a variable number of instances for each class and allows for crossclass generalization at the highest levels of the network. We are able to obtain competitive recognition performance on VOC2007 and ILSVRC2012, while using only the top few predicted locations in each image and a small number of neural network evaluations.

深度卷積神經網路最近在包括ILSVRC-2012等多個影象識別基準上取得了最新的效能。定位子任務的獲勝模型是在預測影象中對每個物件類別預測單個邊界框和置信度分數的網路。這樣的模型捕獲目標物件周圍的整個影象上下文,但若不復制每個例項的輸出數量,則不能處理影象中相同物件的多個例項。在本文中,我們提出一個用於檢測的顯著性啟發式神經網路模型,該模型預測一組類不可知的邊界框,其中每個框都包含其感興趣物件的可能性分數。該模型自然地為每個類處理數量可變的例項,並允許在網路的最高層中進行跨類泛化。當在每幅圖上僅使用前幾個預測為並且採用少量的神經網路評價指標,我們能夠得到在VOC2007和ILSVRC2012上的競賽級別的識別效能。

1.Introduction

Object detection is one of the fundamental tasks in computer vision. A common paradigm to address this problem is to train object detectors which operate on a subimage and apply these detectors in an exhaustive manner across all locations and scales. This paradigm was successfully used within a discriminatively trained Deformable Part Model (DPM) to achieve state-of-art results on detection tasks [6].

目標檢測是計算機視覺的基本任務之一。解決這個問題的一個常見範例是訓練對子影象進行操作的目標檢測器,並以窮舉的方式在所有位置和尺度上應用這些檢測器。該範例在受過區別訓練的可變形部件模型(DPM)內被成功使用,並獲得檢測任務的最新結果[6]。


The exhaustive search through all possible locations and scales poses a computational challenge. This challenge becomes even harder as the number of classes grows, since most of the approaches train a separate detector per class. In order to address this issue a variety of methods were proposed, varying from detector cascades, to using segmentation to suggest a small number of object hypotheses [14, 2, 4].

窮盡搜尋通過所有可能的位置和尺度構成了一個計算挑戰。隨著類數量的增加,這個挑戰變得更加困難,因為大多數方法針對每個類訓練單獨的檢測器。為了解決這個問題,提出了各種方法,從檢測器串聯到使用分割來建議少量物件假設[14,2,4]。


In this paper,we ascribe to the latter philosphy and propose to train a detector, called “DeepMultiBox”,’ which generates a few bounding boxes as object candidates. These boxes are generated by a single DNN in a class agnostic manner. Our model has several contributions. First, we define object detection as a regression problem to the coordinates of several bounding boxes. In addition, for each predicted box the net outputs a confidence score of how likely this box contains an object. This is quite different from traditional approaches,which score features within predefined boxes, and has the advantage of expressing detection of objects in a very compact and efficient way.

在本文中,我們根據後者的哲學邏輯提出培養一個探測器,稱為“DeepMultiBox”,產生一些邊界框作為候選物件。這些框由一個DNN以類不相關的方式生成。我們的模型有幾個貢獻。首先,我們將物體檢測作為對若干邊界框座標的迴歸問題。此外,對於每個預測的框,網路輸出一個置信度評分表明該框包含一個物件的可能性。與傳統的方法不同,預定義框中的分數特徵,並且具有以非常緊湊和高效的方式表示物件檢測的優點。


The second major contribution is the loss, which trains the bounding box predictors as part of the network training. For each training example,we solve an assignment problem between the current predictions and the groundtruth boxes and update the matched box coordinates, their confidences and the underlying features through Backpropagation. In this way, we learn a deep net tailored towards our localization problem. We capitalize on the excellent representation learning abilities of DNNs,as recently exeplified recently in image classification [10] and object detection settings [13], and perform joint learning of representation and predictors.

第二個主要貢獻是損失,訓練邊界框預測器作為網路訓練的一部分。對於每個訓練例項,我們解決當前預測與真實框之間的分配問題,並通過反向傳播更新匹配的盒座標、它們的置信度以及底層特徵。通過這種方式,我們學習了一個針對我們定位問題的深層網路。我們利用了DNN最近在影象分類[10]和目標檢測[13]中表現出的優良的學習能力,並且執行表現和預測器的聯合學習。


Finally, we train our object box predictor in a classagnostic manner. We consider this as a scalable way to enable efficient detection of large number of object classes. We show in our experiments that by only post-classifying less than ten boxes, obtained by a single network application, we can achieve state-of-art detection results. Further, we show that our box predictor generalizes over unseen classes and as such is flexible to be re-used within other detection problems.  

最後,我們以類不可知的方式訓練我們的目標框預測器。我們將其視作一種可擴充套件的方式來實現對大量物件類的有效檢測。我們在實驗中表明,通過僅對單個網路應用獲得的不到十個盒子進行後分類,就可以獲得最先進的檢測結果。此外,我們證明了我們的框預測器在不可見類上靈活的泛化是可行的,可以在其他檢測問題中重複使用。

2.Previous work

The literature on object detection is vast, and in this section we will focus on approaches exploiting class-agnostic ideas and addressing scalability.

關於物件檢測的文獻很多,在本節中,我們將重點介紹利用類不相關思想和解決可擴充套件性的方法。


Many of the proposed detection approaches are based on part-based models [7], which more recently have achieved impressive performance thanks to discriminative learning and carefully crafted features[6]. These methods, however,rely on exhaustive application of part templates over multiple scales and as such are expensive. Moreover, they scale linearly in the number of classes, which becomes a challenge for modern datasets such as ImageNet.

許多提出的檢測方法是基於基於零件的模型[7],最近由於鑑別學習和精心設計的特徵,該模型取得了令人印象深刻的效能[6]。然而,這些方法依賴於部分模板在多個尺度上的窮盡性應用,因此是昂貴的。此外,它們在類數上線性地縮放,這對於現代資料集(如ImageNet)來說是一個挑戰。


To address the former issue, Lampert et al. [11] use a branch-and-bound strategy to avoid evaluating all potential object locations. To address the latter issue,Song et al.[12] use a low-dimensional part basis, shared across all object classes. A hashing based approach for efficient part detection has shown good results as well [3].

為了解決前一個問題,Lampert等人[11],使用分支定界策略來避免評估所有可能的物件位置。為了解決後一個問題,Song等人[12]使用低維部分基礎,在所有物件類之間共享。一種基於雜湊的有效部分檢測方法也顯示了良好的效果 [3]。


A different line of work, closer to ours, is based on the idea that objects can be localized without having to know their class. Some of these approaches build on bottom-up classless segmentation [9]. The segments, obtained in this way, can be scored using top-down feedback [14, 2, 4]. Using the same motivation, Alexe et al. [1] use an inexpensive classifier to score object hypotheses for being an object or not and in this way reduce the number of location for the subsequent detection steps. These approaches can be thought of as Multi-layered models, with segmentation as first layer and a segment classification as a subsequent layer. Despite the fact that they encode proven perceptual principles, we will show that having deeper models which are fully learned can lead to superior results.

與我們更接近的是一種基於這樣一種理念的不同工作,即物件可以被定位,而不必知道他們的類。這些方法中的一些建立在自底向上的無類分割[9]。以這種方式獲得的割片可以使用自上而下的反饋[14, 2, 4]進行評分。使用同樣的動機,Alexe等,[1]使用廉價的分類器來對作為或不作為物件的假設目標進行評分,以此方式減少後續檢測步驟的位置數量。這些方法可以認為是多層模型,以分割為第一層,以分割分類為後續層。儘管事實是,他們編碼證實了感知原理,我們將表明,擁有更深的模型,充分學習後可以得到更好的結果。


Finally, we capitalize on the recent advances in Deep Learning, most noticeably the work by Krizhevsky et al.[10]. We extend their bounding box regression approach for detection to the case of handling multiple objects in a scalable manner. DNN-based regression, to object masks however, has been applied by Szegedy et al. [13]. This last approach achieves state-of-art detection performance but does not scale up to multiple classes due to the cost of a single mask regression.

最後,我們利用了深度學習的最新進展,最值得注意的是Krizhevsky等人的工作[10]。我們以可擴充套件的方式擴充套件了他們檢測到多個處理物件時的邊界框的迴歸方法。然而,基於DNN的迴歸,到物件掩膜,已經被Szegedy等人應用[13]。最後一種方法實現了最先進的檢測效能,但是由於單個掩膜迴歸的成本,無法擴充套件到多個類。


3.Proposed approach

We aim at achieving a class-agnostic scalable object detection by predicting a set of bounding boxes, which represent potential objects. More precisely, we use a Deep Neural Network (DNN), which outputs a fixed number of bounding boxes. Inaddition, it outputs a score for each box expressing the network confidence of this box containing an object.

我們的目標是通過預測一組表示潛在物件的邊界框來實現類無關的可擴充套件目標檢測。更確切地說,我們使用一個深度神經網路(DNN),它輸出一個固定數量的邊界框。此外,它還輸出每個盒子的分數來表示網路認為該邊界框包含物件的可能性。


Model    To formalize the above idea, we encode the i-th object box and its associated confidence as node values of the last net layer:

模型      為了使上述思想形式化,我們將第i個物件框及其相關聯的分數作為最後一個網路層的節點值進行編碼:


Boundingbox: we encode the upper-left and lower-right coordinates of each box as four node values,which can be written as a vector l_{i}\in R^{4}. These coordinates are normalized w.r.t.image dimensions to achieve invariance to absolute image size. Each normalized coordinate is produced by a linear transformation of the last hidden layer.

Boundingbox:我們將每個方框的左上和右下座標編碼為四個節點值,它們可以被寫成向量l_{i}\in R^{4}。這些座標被歸一化w.e.t.影象維度,以實現絕對影象尺寸的不變性。每個歸一化座標由最後隱藏層的線性變換產生。


Confidence: the confidence score for the box containing an object is encoded as a single node value ci ∈ [0,1]. This value is produced through a linear transformation of the last hidden layer followed by a sigmoid.

置信度:包含一個物件的框的置信度分數被編碼為單個節點值ci ∈ [0,1]。這個值是通過最後一個隱藏層的sigmoid線性變換產生的。


We can combine the bounding box locations li, i ∈ {1,...K}, as one linear layer. Similarly, we can treat collection of all confidences ci, i ∈{1,...K}as the output as one sigmoid layer. Both these output layers are connected to the last hidden layers.

我們可以結合邊界框位置li,i ∈ {1,...K},作為一個線性層。類似地,我們可以將所有的置信度分數ci,i ∈{1,...K}的集合作為一個sigmoid輸出層。這兩個輸出層都連線到最後的隱藏層。


At inference time, out algorithm produces K bounding boxes. In our experiments, we use K = 100 and K = 200. If desired, we can use the confidence scores and non-maximum suppression to obtain a smaller number of high-confidence boxes at inference time. These boxes are supposed to represent objects. As such, they can be classified with a subsequent classifier to achieve object detection. Since the number of boxes is very small,we can afford powerful classifiers. In our experiments, we use another DNN for classification [10].

在前向傳播過程中,out演算法產生K個邊界框。在我們的實驗中,我們使用k=100和k=200。如果需要的話,我們可以使用置信度分數和非極大值抑制以在前向傳播過程中獲得較少數量的高置信度分數的框。這些框應該代表目標。因此,它們可以被隨後的分類器分類,以實現目標檢測。由於框的數量很小,我們可以負擔得起巨大的的分類。在我們的實驗中,我們使用另一個DNN進行分類[10]。


Training Objective     We train a DNN to predict bounding boxes and their confidence scores for each training image such that the highest scoring boxes match well the ground truth object boxes for the image. Suppose that for a particular training example, M objects were labeled by bounding boxes gj, j ∈{1,...,M}. In practice, the number of predictions K is much larger than the number of ground truth boxes M. Therefore, we try to optimize only the subset of predicted boxes which match best the ground truth ones. We optimize their locations to improve their match and maximize their confidences. At the same time we minimize the confidences of the remaining predictions,which are deemed not to localize the true objects well.

訓練目標    我們訓練一個DNN來預測每個訓練影象的邊界框及其關聯置信度分數,以便最高得分框與影象的真實目標框較好地匹配。假設對於一個特定的訓練例項,M個目標由邊界框gj, j ∈{1,...,M}標記。在實際應用中,預測的邊界框數量K遠大於真實邊界框的數量M。因此,我們試圖只優化預測邊界框的子集,它們與真實邊界框最匹配。我們優化他們的位置,以改善他們的匹配度和最大化他們的可信度。同時,我們將剩餘預測邊界框的置信度最小化,表示他們被認為不能很好地定位真實目標。


To achieve the above, we formulate an assignment problem for each training example. We xij ∈{0,1} denote the assignment: xij = 1 if the i-th prediction is assigned to j-th true object. The objective of this assignment can be expressed as:

where we use L2 distance between the normalized bounding box coordinates to quantify the dissimilarity between bounding boxes. 

為了實現上述,我們為訓練例項的分配問題而制定一個公式。如果第i個預測被分配給第j個真物件,則Xij ∈{0,1}賦值:Xij=1。本任務可以表示為(如上公式),在這裡我們使用歸一化的座標框座標之間的L2距離來量化邊界框之間的相異性。


Additionally,we want to optimize the confidences of the boxes according to the assignment x. Maximizing the confidences of assigned predictions can be expressed as: 

In the above objective \sum _{j}x_{ij}=1 if prediction i has been matched to a groundtruth. In that case ci is being maximized, while in the opposite case it is being minimized. A different interpretation of the above term is achieved if we \sum _{j}x_{ij}  view as a probability of prediction i containing an object of interest. Then,the above loss is the negative of the entropy and thus corresponds to a max entropy loss. 

此外,我們用分配x的方式來優化邊界框的置信度,所安排的預測框的最大置信度可以表示為(如上公式): 

在上述任務中,\sum _{j}x_{ij}=1如果預測i已經匹配到一個真實框。在這種情況下,ci被最大化,而在相反的情況下,它被最小化。如果我們將\sum _{j}x_{ij}視作預測框i包含感興趣目標的概率,則可以對上述術語進行不同的解釋。此外,上述損失是熵的負,因此對應於最大熵損失。


The final loss objective combines the matching and confidence losses: 

subject to constraints in Eq. 1. α balances the contribution of the different loss terms. 

最終目標損失結合了匹配度和置信度損失: (如上公式)。受等式1的限制。α平衡不同損失項的貢獻。 


Optimization        For each training example, we solve for an optimal assignment x∗ of predictions to true boxes by

where the constraints enforce an assignment solution. This is a variant of bipartite matching, which is polynomial in complexity. In our application the matching is very inexpensive – the number of labeled objects per image is less than a dozen and in most cases only very few objects are labeled. 

優化              對於每一個訓練例子,我們求解一個從預測框倒真實框的最優分配X*(如上),其中約束實施分配解決方案。這是二分匹配的一種變體,它是複雜的多項式。在我們的應用中,匹配非常便宜——每幅影象的標記物件數量少於12個,並且在大多數情況下只有很少的物件被標記。


 Then, we optimize the network parameters via back propagation. For example, the first derivatives of the back propagation algorithm are computed w. r. t. l and c:

然後,通過反向傳播優化網路引數。例如,反向傳播演算法通過導數計算w.r.t.l和c:


Training Details         While the loss as defined above is in principle sufficient, three modifications make it possible to reach better accuracy significantly faster. The first such modification is to perform clustering of ground truth locations and find K such clusters/centroids that we can use as priors for each of the predicted locations. Thus, the learning algorithm is encouraged to learn a residual to a prior,for each of the predicted locations.  

訓練細節              雖然上述的計算loss原則上很合適,但是三個修正可以顯著地更快地達到更好的精度。首先第一個修改是執行真實框座標的聚類,以找到K個這樣的簇/質心,我們可以用它作為每個預測位置的先驗。因此,學習演算法被鼓勵學習每個預測位置的殘差到先驗。 


A second modification pertains to using these priors in the matching process: instead of matching the N ground truth locations with the K predictions, we find the best match between the K priors and the ground truth. Once the matching is done, the target confidences are computed as before. Moreover, the location prediction loss is also unchanged: for any matched pair of (target, prediction) locations, the loss is defined by the difference between the groundtruth and the coordinates that correspond to the matched prior. We call the usage of priors for matching prior matching and hypothesize that it enforces diversification among the predictions.

第二個改進涉及在匹配過程中使用這些先驗:我們沒有用K個預測來匹配N個真值位置,而是用K個先驗與真實位置之間進行最佳匹配。一旦匹配完成,目標置信度就如以前一樣計算。此外,位置預測損失也是不變的:對於任何匹配的一對(目標,預測)位置,損失是通過真實座標值與對應於匹配的先驗的座標之間的差異來定義的。我們呼叫先驗匹配先前的匹配,並假設它在預測中強制執行多樣性。


It should be noted, that although we defined our method in a class-agnostic way, we can apply it to predicting object boxes for a particular class. To do this, we simply need to train our models on bounding boxes for that class. 

需要注意的是,儘管我們以類不相關的方式定義了方法,但我們可以將其應用於預測特定類的目標框。要做到這一點,我們只需要在該類的邊界框上訓練我們的模型。  


Further, we can predict K boxes per class. Unfortunately, this model will have number of parameters growing linearly with the number of classes. Also, in a typical setting, where the number of objects for a given class is relatively small, most of these parameters will see very few training examples with a corresponding gradient contribution. We argue thusly that our two-step process – first localize, then recognize – is a superior alternative in that it allows leveraging data from multiple object types in the same image using a small number of parameters.

此外,我們可以預測每個類的k個邊界框。不幸的是,這個模型將有許多引數隨類數線性增長。而且,在典型的設定中,給定類的物件數量相對較少,這些引數中的大多數將很少看到具有相應梯度貢獻的訓練示例。因此,我們認為,我們的兩步過程——首先定位,然後識別——是一個更好的選擇,因為它允許來自同一影象中多個物件型別的資料使用少量引數。


4.Experimental results

4.1.Network Architecture and Experiment Details 

The network architecture for the localization and classification models that we use is the same as the one used by [10]. We use Adagrad for controlling the learning rate decay, mini-batches of size 128, and parallel distributed training with multiple identical replicas of the network, which achieves faster convergence. As mentioned previously,we use priors in the localization loss–these are computed using k-means on the training set. We also use an α of 0.3 to balance the localization and confidence losses. 

我們使用的定位和分類模型的網路架構與[10 ]所使用的網路架構相同。我們使用Adagrad來控制學習速率衰減,批處理大小為128,以及具有網路的多個相同副本的並行分散式訓練,從而實現更快的收斂。如前所述,我們在定位損失中使用先驗位置框-這些是在訓練集上使用k-均值計算得到的。我們還使用0.3的α來平衡定位損失和置信度損失。 


The localizer might output coordinates outside the crop area used for the inference. The coordinates are mapped and truncated to the final image area, at the end. Boxes are additionally pruned using non-maximum-suppression with a Jaccard similarity threshold of 0.5. Our second model then classifies each bounding box as objects of interest or “background”. 

定位器可能輸出用於推斷的目標區域以外的座標。座標被對映並截斷到最後的影象區域。另外,使用Jaccard相似性閾值為0.5的非極大值抑制來修剪框。然後,我們的第二個模型將每個座標框分類為感興趣的物件或“背景”。 


To train our localizer networks, we generated approximately 30 million images from the training set by applying the following procedure to each image in the training set. For each image, we generate the same number of square samples such that the total number of samples is about ten million. For each image,the samples are bucketed such that for each of the ratios in the ranges of 0−5%,5−15%,15−50%,50−100%, there is an equal number of samples in which the ratio covered by the bounding boxes is in the given range. 

為了訓練我們的定位器網路,我們通過對訓練集中的每個影象應用以下過程,得到從訓練集中生成的大約3000萬幅影象。對於每個影象,我們產生相同數量的正方形樣本,使得樣本總數約為一千萬。對於每個影象,將樣本進行桶裝,使得對於0_5%、5_15%、15_50%、50_100%範圍內的每個比率,存在相同數量的樣本,其中邊界框所覆蓋的比率在給定範圍內。 


The selection of the training set and most of our hyperparameters were based on past experiences with non-public data sets. For the experiments below we have not explored any non-standard data generation or regularization options. 

訓練集和我們的大部分超引數的選擇是基於過去在非公開資料集上的經驗。對於下面的實驗,我們還沒有探索任何非標準的資料生成或正則化選項。 


In all experiments, all hyper-parameters were selected by evaluating on a held out portion of the training set (10%random choice of examples). 

在所有實驗中,所有超引數都是通過評估訓練集的保留部分(10%隨機選擇示例)來選擇的。 


4.2.VOC2007

The Pascal Visual Object Classes (VOC) Challenge [5] is the most commong benchmark for object detection algorithms. It consists mainly of complex scene images in which bounding boxes of 20 diverse object classes were labelled.

In our evaluation we focus on the 2007 edition of VOC, for which a test set was released. We present results by training on VOC 2012, which contains approx. 11000 images. We trained a 100 box localizer as well as a deep net based classifier [10].

PASCAL視覺物件類(VOC)挑戰[5 ]是最常用的目標檢測演算法的基準。它包含了主要的複雜場景影象,其中20個不同的物件類的座標框被標記。

在我們的評價中,我們關注VOC的2007版,它釋出了一套測試集。我們目前的結果是通過訓練VOC 2012,其中包含約11000張影象。我們訓練了一個100框定位器和一個基於深度網路的分類器[10 ]。 


4.2.1 Training methodology 

We trained the classifier on a data set comprising of 

• 10 million crops overlapping some object with at least 0.5 Jaccard overlap similarity. The crops are labeled with one of the 20 VOC object classes.

 • 20 million negative crops that have at most 0.2 Jaccard similarity with any of the object boxes. These crops are labeled with the special “background” class label.

The architecture and the selection of hyperparameters followed that of [10].

我們在一個數據集上對分類器進行了訓練。              

1000萬個裁剪重疊至少0.5的IOU相似性的物件。這些裁剪被標記為20個VOC物件類別中的一個。            

 2000萬個負裁剪樣本,最多有0.2的IOU與任何一個目標框。這些裁剪區是用特殊的“背景”類標籤標註的。              

超引數的體系結構和選擇遵循[10 ]。 


4.2.2 Evaluation methodology 

In the first round, the localizer model is applied to the maximum center square crop in the image. The crop is resized to the network input size which is 220 × 220. A single pass through this network gives us up to hundred candidate boxes. After a non-maximum-suppression with overlap threshold 0.5, the top 10 highest scoring detections are kept and were classified by the 21-way classifier model in a separate passes through the network. The final detection score is the product of the localizer score for the given box multiplied by the score of the classifier evaluated on the maximum square region around the crop. These scores are passed to the evaluation and were used for computing the precision recall curves. 

在第一回合,定位模型應用於影象中最大的裁剪正方形的中心。將裁剪區的大小調整為網路的輸入大小220×220。一次通過這個網路,給我們提供多達上百個候選框。然後通過0.5的IOU閾值的非極大值抑制,得分最高的前十個檢測結果被保留,並且由21路分類器模型在通過網路的單獨通道中進行分類 。最終的檢測得分是給定框的定位器得分乘以裁剪區周圍最大平方區域上評估的分類器的得分的乘積。 這些分數被傳遞用於評估,並被用於計算精確召回曲線(P-R曲線)。 


4.3.Discussion

First,we analyze the performance of our localizer in isolation. We present the number of detected objects, as defined by the Pascal detection criterion, against the number of produced bounding boxes. In Fig.1 plot we show results obtained by training on VOC2012. In addition, we present results by using the max-center square crop of the image as input as well as by using two scales: the max-center crop by a second scale where we select 3×3 windows of size 60% of the image size.

首先,我們單獨分析了我們的定位器的效能。我們根據生成的邊界框的數量,給出由Pascal檢測標準確定的檢測物件的數量。在圖1中,我們展示了通過對VOC2012的訓練獲得的結果。此外,我們展示了使用影象的最大中心正方形裁剪作為輸入,以及使用兩個尺度的結果:通過第二尺度的最大中心裁剪,其中我們選擇大小為影象大小的60%的3×3視窗。 


As we can see, when using a budget of 10 bounding boxes we can localize 45.3% of the objects with the first model, and 48% with the second model. This shows better perfomance than other reported results, such as the objectness algorithm achieving 42% [1]. Further, this plot shows the importance of looking at the image at several resolutions. Although our algorithm manages to get large number of objects by using the max-center crop, we obtain an additional boost when using higher resolution image crops. 

如我們所見,當使用10個邊界框的預算時,我們可以用第一個模型定位45.3%的物件,而用第二個模型定位48%的物件。這顯示了比其他報道的結果更好的效能,如目標演算法達到42%[1]。此外,這個圖顯示了在幾個解析度中看影象的重要性。雖然我們的演算法通過使用最大中心裁剪來儘可能獲得大量的物件,但是當使用高解析度的影象裁剪物件時,我們獲得了額外的增強。


Further, we classify the produced bounding boxes by a 21-way classifier, as described above. The average precisions (APs) on VOC 2007 are presented in Table 1. The achieved mean AP is 0.29,which is on par with state-of-art. Note that, our running time complexity is very low – we simply use the top 10 boxes. 

此外,我們用21路分類法對所生成的座標框進行分類,如上所述。VOC 2007的各類的平均準確率(APs)列於表1中。所獲得的MAP為0.29,這與現有技術相當。注意,我們的執行時間複雜度很低,我們只使用前10個座標框。  


Example detections and full precision recall curves are shown in Fig. 2 and Fig. 3 respectively. It is important to note that the visualized detections were obtained by using only the max-centered square image crop, i. e. the full image was used. Nevertheless, we manage to obtain relatively small objects, such as the boats in row 2 and column 2, as well as the sheep in row 3 and column 3. 

示例檢測和完整的P-R曲線分別如圖2和圖3中所示。重要的是要注意,視覺化的檢測只使用最大中心正方形影象裁剪,即使用全影象。然而,我們設法獲得相對較小的物件,例如第二行第二列中的船,以及第三行第三列中的羊。


4.4.ILSVRC 2012 Detection Challenge  

For this set of experiments, we used the ILSVRC 2012 detection challenge dataset. This dataset consists of 544,545 training images labeled with categories and locations of 1,000 object categories, relatively uniformly distributed among the classes. The validation set, on which the performance metrics are calculated, consists of 48,238 images.

對於這組實驗,我們使用ILVRC 2012檢測挑戰資料集。該資料集由544545個訓練影象組成,標記有1000個物件類別的類別和位置,在類別之間相對均勻地分佈。計算效能指標的驗證集由48238個影象組成。 


4.4.1 Training methodology

In addition to a localization model that is identical (up to the dataset on which it is trained on) to the VOC model, we also train a model on the ImageNet Classification challenge data,which will serve as the recognition model. This model is trained in a procedure that is substantially similar to that of [10] and is able to achieve the same results on the classification challenge validation set; note that we only train a single model, instead of 7 – the latter brings substantial benefits in terms of classification accuracy, but is 7×more expensive, which is not a negligible factor. 

除了與VOC模型相同的定位模型(直到它被訓練的資料集),我們還在ImageNet分類挑戰資料上訓練一個模型,該模型將作為識別模型。這個模型是在一個與[10]基本相似的程式中訓練的,並且能夠在分類挑戰驗證集上獲得相同的結果;注意,我們只訓練單個模型,而不是7——後者在分類準確性方面帶來實質性的好處。但7×更昂貴,這不是一個可以忽略的因素。 


Inference is done as with the VOC setup: the number of predicted locations is K = 100, which are then reduced by Non-Max-Suppression (Jaccard overlap criterion of 0.4) and which are post-scored by the classifier: the score is the product of the localizer confidence for the given box multiplied by the score of the classifier evaluated on the minimum square region around the crop. The final scores (detection score times classification score) are then sorted in descending order and only the top scoring score/location pair is kept for a given class (as per the challenge evaluation criterion). 

推斷與VOC設定一樣:預測位置的數量是K=100,然後通過非最大抑制(IOU閾值為0.4)來減少,並且由分類器進行後評分:得分是給定框的定位器置信度乘以給定框的分類驗證分在crop周圍最小平方區域上進行評價。然後,按降序對最後得分(檢測得分乘以分類得分)進行排序,對於給定類(根據挑戰評估標準),僅保留最高得分/位置對。 


In all experiments, the hyper-parameters were selected by evaluating on a held out portion of the training set (10% random choice of examples).

在所有實驗中,通過評估訓練集的保持部分(10%隨機選擇例項)來選擇超引數。 


4.4.2 Evaluation methodology

The official metric of the “Classification with localization“ ILSVRC-2012 challenge is [email protected], where an algorithm is only allowed to produce one box per each of the 5 labels (in other words, a model is neither penalized nor rewarded for producing valid multiple detections of the same class), where the detection criterion is 0.5 Jaccard overlap with any of the ground-truth boxes(in addition to the matching class label). 

“Classication with Localization”ILSVRC-2012挑戰的一個標準度量是[email protected],其中演算法只允許每5個標籤生成一個盒子(換句話說,一個模型既不被處罰,也不因生成同一個類的有效多次檢測而獲得獎勵)。其中,檢測標準是0.5 Jaccard與任何基本真值框(除了匹配的類標籤之外)重疊。 


Table 4.4.2 contains a comparison of the proposed method, dubbed DeepMultiBox, with classifying the ground-truth boxes directly and with the approach of inferring one box per class directly. The metrics reported are detection5 and classification5, the official metrics for the ILSVRC-2012 challenge metrics. In the table, we vary the number of windows at which we apply the classifier (this number represents the top windows chosen after nonmax-suppression, the ranking coming from the confidence scores). The one-box-per-class approach is a careful reimplementation of the winning entry of ILSVRC-2012 (the “classification with localization”challenge),with 1 network trained (instead of 7).

表4.4.2包含所提議的稱為DeepMultiBox的方法與直接對基本真值框進行分類以及直接推斷每個類一個框的方法的比較。報告的度量是detection5和classification5,ILVRC-2012挑戰度量的官方度量。在表中,我們改變應用分類器的視窗數量(這個數字表示在非最大抑制之後選擇的頂部視窗數量,排名來自於置信度分數)。每類一盒的方法是認真地重新實施ILSVRC-2012獲獎專案(“具有定位的分類”挑戰),由一個網路訓練(而不是7)。 


We can see that the DeepMultiBox approach is quite competitive: with5-10 windows,it is able to perform about as well as the competing approach. While the one-box-perclass approach may come off as more appealing in this particular case in terms of the raw performance, it suffers from a number of drawbacks: first, its output scales linearly with the number of classes, for which there needs to be training data. The multibox approach can in principle use transfer learning to detect certain types of objects on which it has never been specifically trained on, but which share similarities with objects that it has seen. Figure 5 explores this hypothesis by observing what happens when one takes a localization model trained on ImageNet and applies it on the VOC test set, and vice-versa. The figure shows a precision recall curve: in this case, we perform a class-agnostic detection: a true positive occurs if two windows (prediction and groundtruth) overlap by more than 0.5, independently of their class. Interestingly, the ImageNet-trained model is able to capture more VOC windows than vice-versa: we hypothesize that this is due to the ImageNet class set being much richer than the VOC class set. 

我們可以看到,DeepMultiBox方法是非常有競爭力的:使用5-10個視窗,它可以執行與競爭方法相當的效能。雖然單框每類方法在特殊情況下就原始效能而言可能更具吸引力,但是它存在許多缺點:首先,其輸出大小與需要訓練資料的類的數量成線性關係。多盒方法原則上可以使用遷移學習來檢測某些型別的物件,這些物件從未被專門訓練過,但是與它曾被訓練過的物件具有相似性。圖5通過觀察在ImageNet上訓練的定位模型應用於VOC測試集時發生的情況來探索這個假設,反之亦然。圖中顯示了一個P-R曲線:在這種情況下,我們執行一個類不可知檢測:如果兩個視窗(預測和真實框)重疊超過0.5,則出現真正例,這與它們的類無關。有趣的是,ImageNet訓練的模型能夠捕獲比VOC更多的視窗,反之亦然:我們假設這是由於ImageNet類集比VOC類集豐富得多。 


Secondly, the one-box-per-class approach does not generalize naturally to multiple instances of objects of the same type (except via the the method presented in this work, for instance). Figure 5 shows this too, in the comparison between DeepMultiBox and the one-per-class approach2. Generalizing to such a scenario is necessary for actual image understanding by algorithms,thus such limitations need to be overcome, and our method is a scalable way of doing so. Evidence supporting this statement is shown in Figure5 shows that the proposed method is able to generally capture more objects more accurately that a single-box method.

其次,每類一盒的方法並不自然地泛化到相同型別的物件的多個例項(例如,除了通過本文中介紹的方法之外)。圖5也顯示了這一點,在DeepMultiBox和每個類方法之間的比較中。推廣到這種場景對於通過演算法來理解實際影象是必要的,因此需要克服這種限制,並且我們的方法是一種可擴充套件的方法。支援此語句的證據如圖5所示,表明所提出的方法通常能夠比單盒方法更準確地捕獲更多的物件。 


5.Discussion and Conclusion

In this work, we propose a novel method for localizing objects in an image, which predicts multiple bounding boxes at a time. The method uses a deep convolutional neural network as a base feature extraction and learning model. It formulates a multiple box localization cost that is able to take advantage of variable number of groundtruth locations of interest in a given image and learn to predict such locations in unseen images. 

在這項工作中,我們提出了一種新的方法在影象中進行目標物件定位,一次預測多個邊界框。該方法採用深度卷積神經網路作為基本特徵提取和學習模型。它制定了多盒定位成本,該成本能夠利用給定影象中感興趣的不同數目的真實框位置,並學習預測未知影象中的這些位置。 


We present results on two challenging benchmarks, VOC2007 and ILSVRC-2012, on which the proposed method is competitive. Moreover, the method is able to perform well by predicting only very few locations to be probed by a subsequent classifier. Our results show that the DeepMultiBox approach is scalable and can even generalize across the two datasets, in terms of being able to predict locations of interest, even for categories on which it was not trained on. Additionally, it is able to capture multiple instances of objects of the same class, which is an important feature of algorithms that aim for better image understanding. 

我們提出結果在兩個具有挑戰性的基準,VoT2007和ILVRC-2012,所提出的方法是有競爭力的。此外,該方法能夠通過預測非常少的位置來進行良好的後續的分類器探測。我們的結果表明,DeepMultiBox方法是可擴充套件的,甚至可以在兩個資料集之間進行泛化,即能夠預測感興趣的位置,甚至對於未經訓練的類別也是如此。此外,它能夠捕獲相同類的物件的多個例項,這是旨在更好地理解影象的演算法的一個重要特徵。 


In the future, we hope to be able to fold the localization and recognition paths into a single network, such that we would be able to extract both location and class label information in a one-shot feed-forward pass through the network. Even in its current state, the two-pass procedure (localization network followed by categorization network) entails 5-10 network evaluations, each at roughly 1 CPU-sec (modern machine). Importantly, this number does not scale linearly with the number of classes to be recognized, which makes the proposed approach very competitive with DPMlike approaches.

將來,我們希望能夠將定位和識別路徑摺疊到單個網路中,這樣我們就能夠在通過網路的一次前饋傳遞中提取位置和類標籤資訊。即使在當前狀態下,雙程過程(定位網路後跟分類網路)也需要5-10個網路評估,每個評估大約為1CPU-sec(現代機器)。重要的是,這個數字並不與要識別的類的數量成線性關係,這使得所提出的方法與DPM類方法非常具有競爭力。 


References

[1] B.Alexe, T.Deselaers, and V. Ferrari. Whatisan object? In CVPR. IEEE, 2010. 2, 4

[2] J. Carreira and C. Sminchisescu. Constrained parametric min-cuts for automatic object segmentation. InCVPR,2010.1,2

[3] T. Dean, M. A. Ruzon, M. Segal, J. Shlens, S. Vijayanarasimhan, and J. Yagnik. Fast, accurate detection of 100,000 object classes on a single machine. In CVPR, 2013. 2 

[4] I. Endres and D. Hoiem. Category independent object proposals. In ECCV. 2010. 1, 2 

[5] M. Everingham, L. Van Gool, C. K. Williams, J. Winn, and A. Zisserman. The pascal visual object classes (voc) challenge. International journal of computer vision, 88(2):303– 338, 2010. 4 

[6] P.F.Felzenszwalb,R.B.Girshick,D.McAllester,andD.Ramanan. Object detection with discriminatively trained partbased models. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 32(9):1627–1645, 2010. 1, 6 

[7] M. A. Fischler and R. A. Elschlager. The representation and matching of pictorial structures. Computers, IEEE Transactions on, 100(1):67–92, 1973. 1 

[8] R. B. Girshick, P. F. Felzenszwalb, and D. McAllester. Discriminatively trained deformable part models, release 5. http://people.cs.uchicago.edu/ rbg/latent-release5/. 6 

[9] C. Gu, J. J. Lim, P. Arbel´aez, and J. Malik. Recognition using regions. In CVPR, 2009. 2 

[10] A. Krizhevsky, I. Sutskever, and G. Hinton. Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems 25, pages 1106–1114, 2012. 1, 2, 3, 4, 6 

[11] C. H. Lampert, M. B. Blaschko, and T. Hofmann. Beyond sliding windows: Object localization by efficient subwindow search. In CVPR, 2008. 2 

[12] H. O. Song, S. Zickler, T. Althoff, R. Girshick, M. Fritz, C. Geyer, P. Felzenszwalb, and T. Darrell. Sparselet models for efficient multiclass object detection. In ECCV. 2012. 2 

[13] C. Szegedy, A. Toshev, and D. Erhan. Deep neural networks forobjectdetection. In Advances in Neural Information Processing Systems (NIPS), 2013. 1, 2, 6

[14] K. E. van de Sande, J. R. Uijlings, T. Gevers, and A. W. Smeulders. Segmentation as selective search for object recognition. In ICCV, 2011. 1, 2
[15]L.Zhu,Y Chen,A.Yuille,and W. Freeman. Latent hierarchical structural learning for object detection. In Computer Vision and Pattern Recognition (CVPR), 2010 IEEE Conference on, pages 1062–1069. IEEE, 2010. 6