1. 程式人生 > >論文筆記 | SSD: Single Shot MultiBox Detector

論文筆記 | SSD: Single Shot MultiBox Detector

Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, Cheng-Yang Fu, Alexander C. Berg
這裡寫圖片描述
Wei Liu

Abstract

Our approach, named SSD, discretizes the output space of bounding boxes into a set of default boxes over different aspect ratios and scales per feature map location. At prediction time, the network generates scores for the presence of each object category in each default box and produces adjustments to the box to better match the object shape. Additionally, the network combines predictions from multiple feature maps with different resolutions to naturally handle objects of various sizes. And it completely eliminates proposal generation and subsequent pixel or feature resampleing stage and encapsulate all computation in a single network.

1 Introduction

This paper presents the first deep network based object detector that does not resample pixels or features for bounding box hypothese and is as accurate as approaches that do.
Our improvements include using a small convlutional filter to predict object categories and offsets in bounding box locations, using separate predictors(filters) for different aspect ratio detections, and applying these filters to multiple feature maps from the later stages of a network in order to perform detection using at multiple scales.
We summarize our contributions as follows:
1. Faster than YOLO, and more accurate. as accurates as slower techinques.
2. The core of the SSD approach is predicting category scores and box offsets for a fixed set of default bounding boxes using small convolutional filters applied to feature maps.
3. We produce predictions of different scales from feature maps of diffrent scales and explicitly separate predictions by aspect ratio.

2 SSD

2.1 Model

feed-forward convolutional network–>produces a fixed-size collection of bounding boxes and scores
NMS–>produce the final detections
SSD=base network+Auxiliary structure
auxilary detections with the following key features:
1. Multi-scale feature maps for detection. We add convolutional feature layers to the end of the truncated base network. These layers decrease in size progressively and allow predictions is different for each feature layer.
2. Convolutional predictors for detection

.
這裡寫圖片描述
Each added feature layer can produce a fixed set of detection predictions using a set of convolutional filters.For a feature layers of size mxn with p channels, the basic element for prediting parameters of a potential detection is a 3x3xp small kernel that produces either a score for a category, or a shape offset relative to the default box coordinates.
3. Default boxes and aspect ratios.The default boxes similar to the achor boxes used in Faster R-CNN, Allowing different default box shapes in several feature maps lets us efficiently discretize the space of possible output box shapes.
這裡寫圖片描述

2.2 Training

The ground truth information needs to be assigned to specific outputs in the fixed set of detector outputs. Training also involves choosing the set of default boxes and scales for detection as well as hard negative mining and data augmentation strategies.
1. Matching strategy We begin by matching each ground truth box to the default box with the best jaccard overlap, then match default boxes to any ground truth with jaccard overlap higher than a threshold (0.5): it allows the network to predict high confidence for multiple overlapping default boxes rather than requiring it to pick only the one with maximum overlap. More tha one default box can match to the j ground truth box.
2. Training objective這裡寫圖片描述
3. Choosing scales and aspect ratios for default boxes
redcue the size of the feature map–>reduce computation and memory cost but it also provide some degree of translation and scale invariance
utilizing feature maps from esveral different layers in a single netwaork for prediction –>different object scales, sharing parameters across all object scales, lower layers: details, top feature map:smooth details.
The scales of the default boxes for each feature map is computed as :
這裡寫圖片描述
We compute the widthwk=skarheighthk=sk/ar ar={1,2,3,1/2,1/3},for aspect ratio of 1, we add a default box sk=sksk+1
4. Hard negative mining

?

After the matching step ,most of the default boxes are negatives. We sort negative examples using the highest confidence for each default box and pick the top noes so the the ratio between the negatives and positives is at most 3:1(?????why is not 6:1).
5. Data augmeentation
- entire original input
- sample a patch so that the minimum jaccard overlap with the objects is 0.1 0.3 0.5 0.7 0.9
- Randomly sample a patch

3 Experimental Results

SSD is very sensitive to the bounding box size: worse performance on smaller objects
這裡寫圖片描述

4 Related Work

  1. SSD is very similar to the region proposal network in Faster RCNN in that we also use a fixed set of boxes for prediction, similar to the achor boxes in the RPN, But instead of useing these to pool features and evaluate another classifier , we simultaneously produce a score for each object category in each box. Thus, our approach avids the complication of merging RPN with Fast RCNN and is easier to train,faster,and straightforward to intergrate in other tasks.
  2. If we only use one default box per location from the topmost feature map, our SSD would have similar architecture to OverFeat
  3. if we use the whole topmost feature map and add a fuul conneted layer for predictions instead of our convolutional predictiors, and donot explicitly consider multiple aspect ratios, we can approximately reproduce YOLO

Conclusions

相關推薦

no