1. 程式人生 > >Tensorflow object detection API 原始碼閱讀筆記:RPN

Tensorflow object detection API 原始碼閱讀筆記:RPN

Update:
建議先看從程式設計實現角度學習Faster R-CNN,比較直觀。這裡由於原始碼抽象程度較高,顯得比較混亂。

  • faster_rcnn_meta_arch.py中這兩個對應知乎文章中RPN包含的3*3和1*1卷積:
    rpn_box_predictor_features = slim.conv2d(rpn_features_to_crop
    self._first_stage_box_predictor=box_predictor.ConvolutionalBoxPredictor

  • 知乎文章中的AnchorTargetCreator按照IoU將20000多個候選的anchor選出256個anchor進行分類和迴歸位置(計算RPN loss),對應:
    target_assigner.batch_assign_targets;
    self._first_stage_sampler=sampler.BalancedPositiveNegativeSampler,作用在first_stage_minibatch_size;
    其中20000是RPN輸入的feature map大小和anchor的種類決定的,256對應first_stage_minibatch_size(見protos/faster_rcnn.proto);
    總之就是在def _loss_rpn。

  • (proposal=2000)知乎文章中的ProposalCreator: 在RPN中,從上萬個anchor中,按照概率選擇一定數目(如12000/6000),並調整大小和位置,經過NMS,選出概率最大的2000/300個,生成RoIs,對應:
    def _postprocess_rpn
    first_stage_max_proposals=300

  • 知乎文章中ProposalTargetCreator從2000/300候選中選擇一部分(比如128個)pooling出來用以訓練Fast R-CNN,對應:
    不使用hard_example_miner
    _unpad_proposals_and_sample_box_classifier_batch
    second_stage_batch_size

    =64

Old:

'''RPN概況。注意,分析時很多術語直接採用了原始論文中的表述,和程式碼中不一樣。
'''
FasterRCNNFeatureExtractor.extract_proposal_features實際呼叫的是
FasterRCNNResnetV1FeatureExtractor._extract_proposal_features
生成first stage RPN features作為RPN的輸入。

class FasterRCNNMetaArch(model.DetectionModel)的_extract_rpn_feature_map:
呼叫上述特徵提取器的_extract_proposal_features,並且返回
      rpn_box_predictor_features: A 4
-D float32 tensor with shape [batch, height, width, depth] to be used for predicting proposal boxes and corresponding objectness scores.'''sliding window得到的intermediate layer''' rpn_features_to_crop: A 4-D float32 tensor with shape [batch, height, width, depth] representing image features to crop using the proposals boxes. '''其實就是前面特徵提取器得到的feature map。''' anchors: A BoxList representing anchors (for the RPN) in absolute coordinates. '''這裡使用了grid_anchor_generator.GridAnchorGenerator,生成9個anchor boxes(3 different scales and 3 aspect ratios)。具體見下文分析。 ''' anchors = self._first_stage_anchor_generator.generate( [(feature_map_shape[1], feature_map_shape[2])] '''sliding window 作用於 conv feature map,得到intermediate layer. first_stage_box_predictor_kernel_size: Kernel size to use for the convolution op just prior to RPN box predictions. ''' with slim.arg_scope(self._first_stage_box_predictor_arg_scope): kernel_size = self._first_stage_box_predictor_kernel_size rpn_box_predictor_features = slim.conv2d( rpn_features_to_crop, self._first_stage_box_predictor_depth, kernel_size=[kernel_size, kernel_size], rate=self._first_stage_atrous_rate, activation_fn=tf.nn.relu6) '''按照論文下面應該是intermediate layer進入cls和reg layer ''' def _predict_rpn_proposals(self, rpn_box_predictor_features): 進入 self._first_stage_box_predictor.predict self._first_stage_box_predictor = box_predictor.ConvolutionalBoxPredictor '''Box predictors are classes that take a high level image feature map as input and produce two predictions, (1) a tensor encoding box locations, and (2) a tensor encoding classes for each box. 下文具體看class ConvolutionalBoxPredictor(BoxPredictor) ''' '''再進一步是進loss了。這個predict函式可以同時返回兩個階段的一個prediction_dict,然後進loss。 ''' def predict(self, preprocessed_inputs) def loss(self, prediction_dict, scope=None): '''loss中呼叫了第一階段loss的計算。下文細看。 ''' def _loss_rpn
'''anchor生成
object_detection/anchor_generators/grid_anchor_generator.py
'''
def _generate #通過父類core/anchor_generator.py的generate函式呼叫
    grid_height, grid_width = feature_map_shape_list[0]
    # Multidimensional analog of numpy.meshgrid
    scales_grid, aspect_ratios_grid = ops.meshgrid(self._scales,
                                                   self._aspect_ratios)
    scales_grid = tf.reshape(scales_grid, [-1])
    aspect_ratios_grid = tf.reshape(aspect_ratios_grid, [-1])
    return tile_anchors(grid_height,
                        grid_width,
                        scales_grid,
                        aspect_ratios_grid,
                        self._base_anchor_size,
                        self._anchor_stride,
                        self._anchor_offset)
'''去test指令碼手算驗證一下。
    base_anchor_size = [10, 10]#default=[256, 256]
    anchor_stride = [19, 19]#default=[16, 16]
    anchor_offset = [0, 0]
    scales = [0.5, 1.0, 2.0]
    aspect_ratios = [1.0]

    exp_anchor_corners = [[-2.5, -2.5, 2.5, 2.5], [-5., -5., 5., 5.],
                          [-10., -10., 10., 10.], [-2.5, 16.5, 2.5, 21.5],
                          [-5., 14., 5, 24], [-10., 9., 10, 29],
                          [16.5, -2.5, 21.5, 2.5], [14., -5., 24, 5],
                          [9., -10., 29, 10], [16.5, 16.5, 21.5, 21.5],
                          [14., 14., 24, 24], [9., 9., 29, 29]]
    feature_map_shape_list=[(2, 2)] #asks for anchors that correspond
        to an 2x2 layer
grid_height, grid_width = 2,2
scales_grid, aspect_ratios_grid 略,三種組合,導致整個feature map一共獲得2*2*3=12個anchor
anchor的高和寬由scales, aspect_ratio和base_anchor_size決定,簡單。
anchor的中心由range(grid),anchor_stride和anchor_offset決定,grid就是grid_height和grid_width。比如顯然第一個是0,第二個是19。理解grid就是指在feature map上產生anchor的格點,然後就很簡單。思考:base_anchor_size和anchor_stride等引數是怎麼配置的?
anchor的中心怎麼和sliding window的中心一致?答案是在輸入的feature map上每一個格子都生成anchor:
    feature_map_shape = tf.shape(rpn_features_to_crop)
    anchors = self._first_stage_anchor_generator.generate(
        [(feature_map_shape[1], feature_map_shape[2])])
由此可以得到anchor的stride應該是1×16=16(因為feature map的每個格點對應原圖的感受野是16*16)。符合預期。基本的想法就是feature map還原回去感受野很大,形狀單一,所以在每個格點引入了k種不同大小和形狀的anchor box,以便在原圖上更好的框住物體。這種思想在yolo2, ssd等paper中有進一步改進和擴充套件。
'''      
def tile_anchors
"""生成locations and classes
object_detection/core/box_predictor.py. 
作用於sliding window得到的intermediate layer,輸出是每個anchor的tensor encoding box locations和tensor encoding classes for each box.
可能會引入額外的卷積層。
另外位置學習也沒有什麼特別的,就是卷積:
        box_encodings = slim.conv2d(
            net, num_predictions_per_location * self._box_code_size,
            [self._kernel_size, self._kernel_size],
            scope='BoxEncodingPredictor')
num_predictions_per_location是anchor數。類別學習也類似:
        class_predictions_with_background = slim.conv2d(
            net, num_predictions_per_location * num_class_slots,
            [self._kernel_size, self._kernel_size], scope='ClassPredictor',
            biases_initializer=tf.constant_initializer(
                self._class_prediction_bias_init))
思考:這裡學習到的box location是啥東西?其實就是predict函式呼叫_predict_rpn_proposals函式得到的'rpn_box_encodings'.
    self._first_stage_box_predictor = box_predictor.ConvolutionalBoxPredictor
後面被用來計算Loss。和anchor本身的location是啥關係?它實際是論文中的predicted box與anchor box之間座標的差值。
"""
class ConvolutionalBoxPredictor(BoxPredictor)
"""cls, reg loss
object_detection/meta_architectures/faster_rcnn_meta_arch.py
這裡直接拿rpn_box_encodings計算loss了。推測batch_reg_targets是anchor box與ground truth的差值。查target_assigner.batch_assign_targets程式碼:
  def assign(self, anchors, groundtruth_boxes, groundtruth_labels=None,
             **params):
      reg_targets = self._create_regression_targets(anchors,
                                                    groundtruth_boxes,
                                                    match)
_create_regression_targets
        matched_reg_targets = self._box_coder.encode(matched_gt_boxes,
                                                 matched_anchors)
box_coders/faster_rcnn_box_coder.py
    tx = (xcenter - xcenter_a) / wa
    ty = (ycenter - ycenter_a) / ha
    tw = tf.log(w / wa)
    th = tf.log(h / ha) 
順藤摸瓜找到了,和論文中一致。                                                           
"""
def _loss_rpn
      (batch_cls_targets, batch_cls_weights, batch_reg_targets,
       batch_reg_weights, _) = target_assigner.batch_assign_targets(
           self._proposal_target_assigner, box_list.BoxList(anchors),
           groundtruth_boxlists, len(groundtruth_boxlists)*[None])
      batch_cls_targets = tf.squeeze(batch_cls_targets, axis=2)

      localization_losses = self._first_stage_localization_loss(
          rpn_box_encodings, batch_reg_targets, weights=sampled_reg_indices)

下面看loss計算公式具體是怎麼實現的。

"""The ground-truth label is 1 if the anchor is positive, and is 0 if the anchor is negative. 
An anchor is labeled as positive if:
(a) the anchor is the one with highest IoU overlap with a ground-truth box
(b) the anchor has an IoU overlap with a ground-truth box higher than 0.7
Negative labels are assigned to anchors with IoU lower than 0.3 for all ground-truth
boxes.
50%/50% ratio of positive/negative anchors in a minibatch.
"""
經過之前的分析,相應的程式碼應該是
      (batch_cls_targets, batch_cls_weights, batch_reg_targets,
       batch_reg_weights, _) = target_assigner.batch_assign_targets(
           self._proposal_target_assigner, box_list.BoxList(anchors),
           groundtruth_boxlists, len(groundtruth_boxlists)*[None])
這裡呼叫的target_assigner物件是這樣構建的:
    self._proposal_target_assigner = target_assigner.create_target_assigner(
        'FasterRCNN', 'proposal')
進入core/target_assigner.py中的create_target_assigner函式:
  elif reference == 'FasterRCNN' and stage == 'proposal':
    similarity_calc = sim_calc.IouSimilarity()
    matcher = argmax_matcher.ArgMaxMatcher(matched_threshold=0.7,
                                           unmatched_threshold=0.3,
                                           force_match_for_each_row=True)
    box_coder = faster_rcnn_box_coder.FasterRcnnBoxCoder(
        scale_factors=[10.0, 10.0, 5.0, 5.0])
具體實現在:       
from object_detection.matchers import argmax_matcher