基於Tensorflow的目標檢測（Detection）的程式碼案例詳解

阿新 • • 發佈：2018-11-12

這篇博文我主要闡述了基於Tensorflow的Faster RCNN在Windows上的一個Demo程式，其中，分為兩個部分，一個是訓練資料匯入部分，一個是網路架構部分開始。源程式git地址我會放在文章最後，下載後可以參考對應看一下。

一、程式執行環境說明

首先，我想闡述一堆巨坑，下面只要有一條沒有環境或條件達到或做到，你的程式將無法執行：

Windows10 家庭版：

Python3.5+Windows+Visual Studio 2015+cuda9.1

這裡，本人踩過幾個坑，忘後來人應用這個版本的Demo不要再走：

① Python3.6無法編譯該程式。因為作者編譯時環境為3.5

② 如果你的電腦是Windows家庭版，不要用Anaconda進行安裝Python3.5，直接裝上Python3.5即可，因為家庭版的Windows10系統無法安裝Anaconda3+Python3.5的環境，Anaconda3預設3.6版或2.7版。

③ 除Visual Studio2015外的版本將無法執行符合要求的編譯Python所需的C++環境。（不要問我為什麼，我也不知道）

Windows10 企業版：

Anaconda3+Python3.5+Cuda9.1

① Anaconda與Python對應的版本可以百度搜索清華Python映象中下載。

② 如果用Anaconda搭載python3.5將不需要Visual Studio環境，無需安裝。反之，如果沒有用Anaconda搭載python，而是直接安裝python，就必須要安裝Visual Studio 2015的環境。

好了，坑到此結束，說完這些，按照ReadMe編譯程式之後，應該程式可以運行了。

我的IDE用的是Pycharm Jetbrain。

二、訓練資料匯入部分

那麼，我們先來看資料匯入的環節：

由於物體檢測是迴歸和分類任務，那麼匯入的資料就要包括物體的位置以及他的類別，那麼在程式中，這些資訊的根目錄在：

...\FasterRcnn\Faster-RCNN-TensorFlow-Python3.5-master\data\VOCDevkit2007\VOC2007\Annotations

影象資訊由xml檔案讀取。

而影象與影象資訊xml檔案是一一對應的，這些訓練集中影象的根目錄在：

...\Desktop\FasterRcnn\Faster-RCNN-TensorFlow-Python3.5-master\data\VOCDevkit2007\VOC2007\JPEGImages

現在，我們回到程式碼train.py：

可以明顯的發現，train檔案中主函式中一共就有兩句話：

train = Train()
train.train()

第一句就是我們網路訓練資料集匯入的過程，而第二句主要就是真正的訓練資料集的過程，那麼我們還是從第一句開始：

首先，我們跳入這句Train（），再跳入VGG16.py中的初始化過程，具體在network.py中：

self._feat_stride = [16, ]
self._feat_compress = [1. / 16., ]
self._batch_size = batch_size
self._predictions = {}
self._losses = {}
self._anchor_targets = {}
self._proposal_targets = {}
self._layers = {}
self._act_summaries = []
self._score_summaries = {}
self._train_summaries = []
self._event_summaries = {}
self._variables_to_fix = {}

一開始包括了一些引數的指定，例如feat_stride，為後續說到的錨點和原始影象對應的區域。

我們回到train.py接下去看：

self.imdb, self.roidb = combined_roidb("voc_2007_trainval")

這一句，把訓練的影象資訊全部讀入到了roidb這樣一個變數中，跳入combined_roidb（）：

def get_roidb(imdb_name):
imdb = get_imdb(imdb_name)
print('Loaded dataset `{:s}` for training'.format(imdb.name))
imdb.set_proposal_method("gt")
print('Set proposal method: {:s}'.format("gt"))
roidb = get_training_roidb(imdb)
return roidb

以上程式碼，表示了通過名字把roidb讀入進來的過程，最後返回了roidb這個變數：

注意到，程式碼中有這樣一句：

roidbs = [get_roidb(s) for s in imdb_names.split('+')]

這句的意思就是資料來源可能是從多個源頭進行匯入的，所以假如真的是從多個數據源進行匯入，則用加號把各種資料集連起來，到了用到的時候再用split函式把各種資料集的名字分開。

但事實上，程式中只用到了一個數據集，所以下一句是：

roidb = roidbs[0]

由於程式確定只有一個數據集的資料，所以只需要取0位置上的資料集即可，這裡如果後續有修改，則可以按照具體情況修改。

那麼具體的資料集操作是怎麼進行的呢？我們跳入get_imdb（）：

再跳一次，到了factory.py

# Set up voc_<year>_<split>
for year in ['2007', '2012']:
  for split in ['train', 'val', 'trainval', 'test']:
    name = 'voc_{}_{}'.format(year, split)
    __sets[name] = (lambda split=split, year=year: pascal_voc(split, year))

# Set up coco_2014_<split>
for year in ['2014']:
  for split in ['train', 'val', 'minival', 'valminusminival', 'trainval']:
    name = 'coco_{}_{}'.format(year, split)
    __sets[name] = (lambda split=split, year=year: coco(split, year))

# Set up coco_2015_<split>
for year in ['2015']:
  for split in ['test', 'test-dev']:
    name = 'coco_{}_{}'.format(year, split)
    __sets[name] = (lambda split=split, year=year: coco(split, year))

我們發現，會有三個迴圈，怕是coco資料集和pascal_voc資料集在不同年份，他內部的格式也不同，所以要經過這樣的處理吧。

先從pascal_voc資料集看起，跳入imdb.init函式，下面程式碼位於imdb.py：

 def __init__(self, name, classes=None):
        self._name = name
        self._num_classes = 0
        if not classes:
            self._classes = []
        else:
            self._classes = classes
        self._image_index = []
        self._obj_proposer = 'gt'
        self._roidb = None
        self._roidb_handler = self.default_roidb
        # Use this dict for storing dataset specific config options
        self.config = {}

imdb.py這個檔案主要就是對讀入的資料進行一系列的操作：

初始化部分指定了資料集的名字，初始化類的數量，初始化類的索引標籤。指定了proposal的名字為gt，roidb是我們最終得到的結果，先設為NULL，同時，程式設定了一個handler，進行一些操作，一會兒會詳細說到。

現在回到pascal_voc.py繼續看初始化後的過程:

self._year = year
self._image_set = image_set

先指定了資料集年份，然後指定了要用到的東西的Annotation在哪裡，我們現在用到的就只有Val和Train，即訓練資料和我們的真實資料，就是ground truth：

其中PascalVOC的標註檔案在：

...\Desktop\FasterRcnn\Faster-RCNN-TensorFlow-Python3.5-master\data\VOCDevkit2007\VOC2007\ImageSets\Main

其中可以開啟看一下，trainval這個檔案：

檔案中是以這樣的形式出現的資料，一共五千條，測試了五千組需要用的案例。trainval中的這些資料就是我們接下來需要訓練的資料的一個標籤，即對應的圖片的名字以及對應的xml資訊。

接下來就是指定路徑讀入相關的資訊了。

self._devkit_path = self._get_default_path() if devkit_path is None \
            else devkit_path
self._data_path = os.path.join(self._devkit_path, 'VOC' + self._year)

再後面指定了我們做分類的類別，一共21個類，二十個前景加上一個背景。之後，給每個類的字串設定一個固定的索引值，這樣更加方便接下來的一系列操作：

self._class_to_ind = dict(list(zip(self.classes, list(range(self.num_classes)))))

實際上，pascalVOC這麼多檔案中，這個程式中用到的怕是隻有valtrain這一個txt檔案了，之後，load一下我們的資料，根據ImageSet中指定的資料，從_data_path路徑中讀出，並通過x.strip一條一條讀出，並把讀到的東西以image_index的引數形式返回：

    def _load_image_set_index(self):
        """
        Load the indexes listed in this dataset's image set file.
        """
        # Example path to image set file:
        # self._devkit_path + /VOCdevkit2007/VOC2007/ImageSets/Main/val.txt
        image_set_file = os.path.join(self._data_path, 'ImageSets', 'Main',
                                      self._image_set + '.txt')
        assert os.path.exists(image_set_file), \
            'Path does not exist: {}'.format(image_set_file)
        with open(image_set_file) as f:
            image_index = [x.strip() for x in f.readlines()]
        return image_index

接下來，我們已經看完了pascalVOC的讀入過程了，coco資料集也是同理，所以不作贅述，繼續回到train.py：

其中set_proposal_method（“gt”），這句話指定了讀入的資訊就是我們的ground truth。

以下的一句話，有點意思哦：

roidb = get_training_roidb(imdb)

然後我們跳入這個方法來看一下:

def get_training_roidb(imdb):
    """Returns a roidb (Region of Interest database) for use in training."""
    if True:
        print('Appending horizontally-flipped training examples...')
        imdb.append_flipped_images()
        print('done')

    print('Preparing training data...')
    rdl_roidb.prepare_roidb(imdb)
    print('done')

    return imdb.roidb

這裡將得到的影象都反轉了一下，其實就是將影象做了一個鏡面對稱，這樣我們一開始的資料量有5000，翻轉之後，我們的資料量就有了一萬。

我們仔細來看一下這個翻轉的過程，具體再imdb.py中：

    def append_flipped_images(self):
        num_images = self.num_images
        widths = self._get_widths()
        for i in range(num_images):
            boxes = self.roidb[i]['boxes'].copy()
            oldx1 = boxes[:, 0].copy()
            oldx2 = boxes[:, 2].copy()
            boxes[:, 0] = widths[i] - oldx2 - 1
            boxes[:, 2] = widths[i] - oldx1 - 1
            assert (boxes[:, 2] >= boxes[:, 0]).all()
            entry = {'boxes': boxes,
                     'gt_overlaps': self.roidb[i]['gt_overlaps'],
                     'gt_classes': self.roidb[i]['gt_classes'],
                     'flipped': True}
            self.roidb.append(entry)
        self._image_index = self._image_index * 2

好了，到此，我們的資料就算是基本載入完畢了，有一些其他的處理要說明一下，就比如pascalVOC中的:

    def gt_roidb(self):
        """
        Return the database of ground-truth regions of interest.

        This function loads/saves from/to a cache file to speed up future calls.
        """
        cache_file = os.path.join(self.cache_path, self.name + '_gt_roidb.pkl')
        if os.path.exists(cache_file):
            with open(cache_file, 'rb') as fid:
                try:
                    roidb = pickle.load(fid)
                except:
                    roidb = pickle.load(fid, encoding='bytes')
            print('{} gt roidb loaded from {}'.format(self.name, cache_file))
            return roidb

這個函式的目的是載入資料之後形成一個pickle檔案，以後再執行程式的時候，如果資料已經載入就直接從pickle檔案中讀取，如果沒有載入，就繼續載入。

...\Desktop\FasterRcnn\Faster-RCNN-TensorFlow-Python3.5-master\data\cache

這是快取的根目錄，可以嘗試刪除試試會出現什麼效果哦。

看程式碼中，指定快取目錄和名字，如果名字存在就先載入完已有的再載入新的資料，如果不存在就從頭載入。好，那麼到現在為止，我們已經知道了，選用哪個資料集，載入哪些資料，那些固定的資料在什麼位置，以何種形式載入進來，但是，還有一個重要的問題就是，這個資料是怎麼以標籤的形式具體載入進來的呢？
XML檔案是通過解析器解析出來的：

    def _load_pascal_annotation(self, index):
        """
        Load image and bounding boxes info from XML file in the PASCAL VOC
        format.
        """
        filename = os.path.join(self._data_path, 'Annotations', index + '.xml')
        tree = ET.parse(filename)
        objs = tree.findall('object')
        if not self.config['use_diff']:
            # Exclude the samples labeled as difficult
            non_diff_objs = [
                obj for obj in objs if int(obj.find('difficult').text) == 0]
            # if len(non_diff_objs) != len(objs):
            #     print 'Removed {} difficult objects'.format(
            #         len(objs) - len(non_diff_objs))
            objs = non_diff_objs
        num_objs = len(objs)

        boxes = np.zeros((num_objs, 4), dtype=np.uint16)
        gt_classes = np.zeros((num_objs), dtype=np.int32)
        overlaps = np.zeros((num_objs, self.num_classes), dtype=np.float32)
        # "Seg" area for pascal is just the box area
        seg_areas = np.zeros((num_objs), dtype=np.float32)

        # Load object bounding boxes into a data frame.
        for ix, obj in enumerate(objs):
            bbox = obj.find('bndbox')
            # Make pixel indexes 0-based
            x1 = float(bbox.find('xmin').text) - 1
            y1 = float(bbox.find('ymin').text) - 1
            x2 = float(bbox.find('xmax').text) - 1
            y2 = float(bbox.find('ymax').text) - 1
            cls = self._class_to_ind[obj.find('name').text.lower().strip()]
            boxes[ix, :] = [x1, y1, x2, y2]
            gt_classes[ix] = cls
            overlaps[ix, cls] = 1.0
            seg_areas[ix] = (x2 - x1 + 1) * (y2 - y1 + 1)

        overlaps = scipy.sparse.csr_matrix(overlaps)

        return {'boxes': boxes,
                'gt_classes': gt_classes,
                'gt_overlaps': overlaps,
                'flipped': False,
                'seg_areas': seg_areas}

boxes = np.zeros((num_objs, 4), dtype=np.uint16) 其中，boxes是一個迴歸框，兩個座標，有n個物體，就是4×n個位置。 gt_classes = np.zeros((num_objs), dtype=np.int32)
其中，有幾類，就載入幾類進來。

overlaps做one hold recording。 seg_areas求面積，暫時還沒有用到。

然後就是迴圈了：這裡迴圈的是一張圖片上的n個物體。

現在翻轉也做了，數量加倍了，指定了相應的資料了，也都提取出來了。

下面還有一句：

 rdl_roidb.prepare_roidb(imdb)

再跳一次，到roidb中的prepare_roidb函式中：

def prepare_roidb(imdb):
  """Enrich the imdb's roidb by adding some derived quantities that
  are useful for training. This function precomputes the maximum
  overlap, taken over ground-truth boxes, between each ROI and
  each ground-truth box. The class with maximum overlap is also
  recorded.
  """
  roidb = imdb.roidb
  if not (imdb.name.startswith('coco')):
    sizes = [PIL.Image.open(imdb.image_path_at(i)).size
         for i in range(imdb.num_images)]
  for i in range(len(imdb.image_index)):
    roidb[i]['image'] = imdb.image_path_at(i)
    if not (imdb.name.startswith('coco')):
      roidb[i]['width'] = sizes[i][0]
      roidb[i]['height'] = sizes[i][1]
    # need gt_overlaps as a dense array for argmax
    gt_overlaps = roidb[i]['gt_overlaps'].toarray()
    # max overlap with gt over classes (columns)
    max_overlaps = gt_overlaps.max(axis=1)
    # gt class that had the max overlap
    max_classes = gt_overlaps.argmax(axis=1)
    roidb[i]['max_classes'] = max_classes
    roidb[i]['max_overlaps'] = max_overlaps
    # sanity checks
    # max overlap of 0 => class should be zero (background)
    zero_inds = np.where(max_overlaps == 0)[0]
    assert all(max_classes[zero_inds] == 0)
    # max overlap > 0 => class should not be zero (must be a fg class)
    nonzero_inds = np.where(max_overlaps > 0)[0]
    assert all(max_classes[nonzero_inds] != 0)

這裡，主要做了什麼樣的工作呢？

把所有的資料集合到了roidb上並返回。分別指定了路徑，圖片寬度，高度，重疊率，重疊最大的類別等等。

self.data_layer = RoIDataLayer(self.roidb, self.imdb.num_classes)
self.output_dir = cfg.get_output_dir(self.imdb, 'default')

最後，output_dir設定了pickel的預設路徑。

datalayer傳入了roidb處理完之後的相關資料，和相應類別，並做了一個洗牌操作shuffle。

三、網路架構搭建部分

好，現在先來總結一下Faster RCNN中網路的搭建架構：

圖1

① 搭建了一個conv layers，即一個全卷積網路，在Tensorflow程式碼中為一個VGG16的結構。

② 從①中迭代幾次後的卷積，池化操作後的Feature Map送入RPN（RegionProposal Network）層。

③ 用一個3×3的滑動視窗在②中得到的Feature Map中，（從左到右）滑動，以中間點為錨點，對應到原圖，設定三個影象大小，和三個不同的長寬比例，經排列組合，一個錨點位置得到9個不同的對應影象，設所有錨點共計k個對應影象。

④ 用③中得到的k個對應影象，分別執行下述兩個操作：迴歸和分類。迴歸操作為區分前景背景所用，進行一個二分類操作，故得到2k scores；當迴歸操作區分出是背景，則無需進行分類操作，如是前景則進行分類操作，得到4k coordinates，每個影象得到的四個值分別是，中心點座標（x，y），以及該影象的具體長和寬（h，w）。

⑤ 經過迴歸和分類操作之後，進行框的篩選操作，即proposal層做的主要事情。首先，篩掉的框走以下幾個步驟：第一，IOU>0.7，即產生的框和原始影象的ground truth的對比，如果重疊率大於0.7，則保留，否則篩掉；第二，NMS非極大值抑制篩選，通過二分類得到的scores值（即為前景的概率值），篩選前n個從大到小的框；第三，越界框篩選。第四，經過以上步驟後，繼續篩選score值前m個從大到小的框。

⑥ 對得到的框進行Roi Pooling操作之後，連線一個全連線網路，並在此做一個分類任務，一個迴歸任務，分類任務為二十一分類，即二十個前景和一個背景，完成整個操作。

好了，到現在為止，回憶結束。

下面，我們正式進入程式碼：

① 網路結構搭建的大部分程式碼都位於VGG16.py這個網路中，進入主函式中，第一個Train（）交代了資料的部分讀入操作，第二個train（）交代了網路的訓練過程。我們先來解釋網路的訓練過程。核心程式碼為第85行：

layers = self.net.create_architecture(sess,"TRAIN", self.imdb.num_classes, tag='default')

其中，create_architecture（）函式建立了所有的網路結構。下面，我們跳入該函式。

② 前面指定了一系列卷積，反捲積的引數，核心程式碼為295行：

rois, cls_prob, bbox_pred = self.build_network(sess,training)

rois為roi pooling層得到的框，cls_prob得到的是最後全連線層的分類score，bbox_pred得到的是二十一分類之後的分類標籤。我們繼續跳入build_network（）；

③ 跳入VGG16.py中的第18行同名函式。好，我們來仔細研究一下這個同名函式：

    def build_network(self, sess, is_training=True):
        with tf.variable_scope('vgg_16', 'vgg_16'):

            # select initializer
            if cfg.FLAGS.initializer == "truncated":
                initializer = tf.truncated_normal_initializer(mean=0.0, stddev=0.01)
                initializer_bbox = tf.truncated_normal_initializer(mean=0.0, stddev=0.001)
            else:
                initializer = tf.random_normal_initializer(mean=0.0, stddev=0.01)
                initializer_bbox = tf.random_normal_initializer(mean=0.0, stddev=0.001)

            # Build head
            net = self.build_head(is_training)

            # Build rpn
            rpn_cls_prob, rpn_bbox_pred, rpn_cls_score, rpn_cls_score_reshape = self.build_rpn(net, is_training, initializer)

            # Build proposals
            rois = self.build_proposals(is_training, rpn_cls_prob, rpn_bbox_pred, rpn_cls_score)

            # Build predictions
            cls_score, cls_prob, bbox_pred = self.build_predictions(net, rois, is_training, initializer, initializer_bbox)

            self._predictions["rpn_cls_score"] = rpn_cls_score
            self._predictions["rpn_cls_score_reshape"] = rpn_cls_score_reshape
            self._predictions["rpn_cls_prob"] = rpn_cls_prob
            self._predictions["rpn_bbox_pred"] = rpn_bbox_pred
            self._predictions["cls_score"] = cls_score
            self._predictions["cls_prob"] = cls_prob
            self._predictions["bbox_pred"] = bbox_pred
            self._predictions["rois"] = rois

            self._score_summaries.update(self._predictions)

            return rois, cls_prob, bbox_pred

④ 該函式分為了幾段，build head，buildrpn，build proposals，build predictions對應的剛好是我們所剛剛敘述的全卷積層，RPN層，Proposal Layer，和最後經過的全連線層。大體結構已有，那麼我們就來逐步分析這個這幾個函式：

⑤ 全卷積網路層的建立（build head）。在這個Demo中，全卷積網路為五個層，每層有一個卷積，一個池化操作，但是，最後一層操作中，僅有一個卷積操作，無池化操作。

 # Main network
        # Layer  1
        net = slim.repeat(self._image, 2, slim.conv2d, 64, [3, 3], trainable=False, scope='conv1')
        net = slim.max_pool2d(net, [2, 2], padding='SAME', scope='pool1')

        # Layer 2
        net = slim.repeat(net, 2, slim.conv2d, 128, [3, 3], trainable=False, scope='conv2')
        net = slim.max_pool2d(net, [2, 2], padding='SAME', scope='pool2')

        # Layer 3
        net = slim.repeat(net, 3, slim.conv2d, 256, [3, 3], trainable=is_training, scope='conv3')
        net = slim.max_pool2d(net, [2, 2], padding='SAME', scope='pool3')

        # Layer 4
        net = slim.repeat(net, 3, slim.conv2d, 512, [3, 3], trainable=is_training, scope='conv4')
        net = slim.max_pool2d(net, [2, 2], padding='SAME', scope='pool4')

        # Layer 5
        net = slim.repeat(net, 3, slim.conv2d, 512, [3, 3], trainable=is_training, scope='conv5')

由程式碼中可以看出，這裡作者用的silm.conv2d函式進行卷積操作，傳統卷積操作為nn模組下的conv2d，max_pool2d進行池化操作。池化用2×2的方格進行，由於卷積層操作不能夠縮小影象大小，池化層變為原來的二分之一，所以四個池化層最終變為原來的1/16。

⑦RPN層的建立（build rpn）。_anchor_component()是用來生成九個框的函式。我們繼續進入，其中設定了引數，height和width，在這裡，都為3，然後通過，tf.py_func()生成9個候選框，generate_anchors_pre中，產生框的具體函式是generate_anchor()；generate_anchors()產生位置。建立了位置關係之後，需要對映到原始影象，所以feat_stride為原始影象與這裡影象的倍數關係，feat_stride在這裡為16。

network.py檔案相關程式碼（從Vgg16.py）跳轉來：

    def _anchor_component(self):
        with tf.variable_scope('ANCHOR_' + 'default'):
            # just to get the shape right
            height = tf.to_int32(tf.ceil(self._im_info[0, 0] / np.float32(self._feat_stride[0])))
            width = tf.to_int32(tf.ceil(self._im_info[0, 1] / np.float32(self._feat_stride[0])))
            anchors, anchor_length = tf.py_func(generate_anchors_pre,
                                                [height, width,
                                                 self._feat_stride, self._anchor_scales, self._anchor_ratios],
                                                [tf.float32, tf.int32], name="generate_anchors")
            anchors.set_shape([None, 4])
            anchor_length.set_shape([])
            self._anchors = anchors
            self._anchor_length = anchor_length

snippit()中相關程式碼：

def generate_anchors_pre(height, width, feat_stride, anchor_scales=(8, 16, 32), anchor_ratios=(0.5, 1, 2)):
    """ A wrapper function to generate anchors given different scales
      Also return the number of anchors in variable 'length'
    """
    anchors = generate_anchors(ratios=np.array(anchor_ratios), scales=np.array(anchor_scales))
    A = anchors.shape[0]
    shift_x = np.arange(0, width) * feat_stride
    shift_y = np.arange(0, height) * feat_stride
    shift_x, shift_y = np.meshgrid(shift_x, shift_y)
    shifts = np.vstack((shift_x.ravel(), shift_y.ravel(), shift_x.ravel(), shift_y.ravel())).transpose()
    K = shifts.shape[0]
    # width changes faster, so here it is H, W, C
    anchors = anchors.reshape((1, A, 4)) + shifts.reshape((1, K, 4)).transpose((1, 0, 2))
    anchors = anchors.reshape((K * A, 4)).astype(np.float32, copy=False)
    length = np.int32(anchors.shape[0])

    return anchors, length

我們現在再回到vgg16.py中的build_rpn（）函式，看產生完9個候選框之後的操作。首先經過了一個3×3的卷積，之後用1×1的卷積去進行迴歸操作，分出前景或是背景，形成分數值，即rpn_cls_score_reshape。再通過softmax函式，得到rpn_clas_prob_reshape,之後，通過reshape化成了標準型，則，變為rpn_bbox_prob。

進行二分類操作和迴歸操作是並行的，於是用同樣1×1的卷積去操作原來的future map，生成長度為4×k，即_num_anchors×4的長度。

最後，將二分類產生的引數以及迴歸任務產生的引數進行返回，Rpn層就建立好了。

① Proposal層的建立（build proposal）。

    def build_proposals(self, is_training, rpn_cls_prob, rpn_bbox_pred, rpn_cls_score):

        if is_training:
            rois, roi_scores = self._proposal_layer(rpn_cls_prob, rpn_bbox_pred, "rois")
            rpn_labels = self._anchor_target_layer(rpn_cls_score, "anchor")

            # Try to have a deterministic order for the computing graph, for reproducibility
            with tf.control_dependencies([rpn_labels]):
                rois, _ = self._proposal_target_layer(rois, roi_scores, "rpn_rois")
        else:
            if cfg.FLAGS.test_mode == 'nms':
                rois, _ = self._proposal_layer(rpn_cls_prob, rpn_bbox_pred, "rois")
            elif cfg.FLAGS.test_mode == 'top':
                rois, _ = self._proposal_top_layer(rpn_cls_prob, rpn_bbox_pred, "rois")
            else:
                raise NotImplementedError
        return rois

依然是vgg16.py中的build_proposal函式，我們跳到_proposal_layer的函式中：

network.py：

    def _proposal_layer(self, rpn_cls_prob, rpn_bbox_pred, name):
        with tf.variable_scope(name):
            rois, rpn_scores = tf.py_func(proposal_layer,
                                          [rpn_cls_prob, rpn_bbox_pred, self._im_info, self._mode,
                                           self._feat_stride, self._anchors, self._num_anchors],
                                          [tf.float32, tf.float32])
            rois.set_shape([None, 5])
            rpn_scores.set_shape([None, 1])

        return rois, rpn_scores

其中核心程式碼為，tf.func()中的proposal_layer，我們繼續跳入，proposal_layer.py中：

def proposal_layer(rpn_cls_prob, rpn_bbox_pred, im_info, cfg_key, _feat_stride, anchors, num_anchors):
    """A simplified version compared to fast/er RCNN
       For details please see the technical report
    """
    if type(cfg_key) == bytes:
        cfg_key = cfg_key.decode('utf-8')

    if cfg_key == "TRAIN":
        pre_nms_topN = cfg.FLAGS.rpn_train_pre_nms_top_n
        post_nms_topN = cfg.FLAGS.rpn_train_post_nms_top_n
        nms_thresh = cfg.FLAGS.rpn_train_nms_thresh
    else:
        pre_nms_topN = cfg.FLAGS.rpn_test_pre_nms_top_n
        post_nms_topN = cfg.FLAGS.rpn_test_post_nms_top_n
        nms_thresh = cfg.FLAGS.rpn_test_nms_thresh

    im_info = im_info[0]
    # Get the scores and bounding boxes
    scores = rpn_cls_prob[:, :, :, num_anchors:]
    rpn_bbox_pred = rpn_bbox_pred.reshape((-1, 4))
    scores = scores.reshape((-1, 1))
    proposals = bbox_transform_inv(anchors, rpn_bbox_pred)
    proposals = clip_boxes(proposals, im_info[:2])

    # Pick the top region proposals
    order = scores.ravel().argsort()[::-1]
    if pre_nms_topN > 0:
        order = order[:pre_nms_topN]
    proposals = proposals[order, :]
    scores = scores[order]

    # Non-maximal suppression
    keep = nms(np.hstack((proposals, scores)), nms_thresh)

    # Pick th top region proposals after NMS
    if post_nms_topN > 0:
        keep = keep[:post_nms_topN]
    proposals = proposals[keep, :]
    scores = scores[keep]

    # Only support single image as input
    batch_inds = np.zeros((proposals.shape[0], 1), dtype=np.float32)
    blob = np.hstack((batch_inds, proposals.astype(np.float32, copy=False)))

    return blob, scores

再來回憶一下，我們proposal_layer中做的事情：實際上，再proposal_layer中的任務主要就是篩選合適的框，縮小檢測範圍，那麼，在前文回憶部分的步驟⑤中我們已經說到：第一，篩選與ground truth中，重疊率大於70%的候選框，篩掉其他的候選框，縮小範圍；第二，用NMS非極大值抑制，篩選二分類中前n個score值的候選框；第三，篩掉越界框後，再來從前n個從大到小排序的值中篩選一次。好了，那麼現在就嚴格按照這個步驟開始操作：

一開始先指定引數，我們剛才說進行了兩次topN操作，所以設定兩個引數，一個pre_num_topN和post_num_topN。bbox_transform中為調整框和ground truth大小位置的操作。進入bbox_transform函式：

可以看出，該公式調整的時候，先進行了整體平移，再進行了整體縮放，所以，在求出變換因子之後，求出，pred_ctr_x, pred_ctr_y, pred_w以及pred_h。然後返回兩個座標，（x1，y1），（x2，y2）。其中，變換調整到和ground truth差不多的大小。調整辦法對應的是論文的上圖部分。程式碼如下： bbox_transform.py:

def bbox_transform_inv(boxes, deltas):
    if boxes.shape[0] == 0:
        return np.zeros((0, deltas.shape[1]), dtype=deltas.dtype)

    boxes = boxes.astype(deltas.dtype, copy=False)
    widths = boxes[:, 2] - boxes[:, 0] + 1.0
    heights = boxes[:, 3] - boxes[:, 1] + 1.0
    ctr_x = boxes[:, 0] + 0.5 * widths
    ctr_y = boxes[:, 1] + 0.5 * heights

    dx = deltas[:, 0::4]
    dy = deltas[:, 1::4]
    dw = deltas[:, 2::4]
    dh = deltas[:, 3::4]

    pred_ctr_x = dx * widths[:, np.newaxis] + ctr_x[:, np.newaxis]
    pred_ctr_y = dy * heights[:, np.newaxis] + ctr_y[:, np.newaxis]
    pred_w = np.exp(dw) * widths[:, np.newaxis]
    pred_h = np.exp(dh) * heights[:, np.newaxis]

    pred_boxes = np.zeros(deltas.shape, dtype=deltas.dtype)
    # x1
    pred_boxes[:, 0::4] = pred_ctr_x - 0.5 * pred_w
    # y1
    pred_boxes[:, 1::4] = pred_ctr_y - 0.5 * pred_h
    # x2
    pred_boxes[:, 2::4] = pred_ctr_x + 0.5 * pred_w
    # y2
    pred_boxes[:, 3::4] = pred_ctr_y + 0.5 * pred_h

    return pred_boxes

之後，程式碼對框先進行了一下出界清除操作，篩掉出界的框，對應程式碼中clip_transform()，同時選取了前n個框。再接下來nms函式得到keep，之後，在通過topN操作得到非極大值抑制篩選後的框。

 # Non-maximal suppression
    keep = nms(np.hstack((proposals, scores)), nms_thresh)

    # Pick th top region proposals after NMS
    if post_nms_topN > 0:
        keep = keep[:post_nms_topN]
    proposals = proposals[keep, :]
    scores = scores[keep]

最後將所得到剩下的框返回，便得到了proposal層之後的留下的框。

接下來，就是篩出來IOU大於70%的框，於是：程式碼中，_anchor_target_layer()函式中，

    def _anchor_target_layer(self, rpn_cls_score, name):
        with tf.variable_scope(name):
            rpn_labels, rpn_bbox_targets, rpn_bbox_inside_weights, rpn_bbox_outside_weights = tf.py_func(
                anchor_target_layer,
                [rpn_cls_score, self._gt_boxes, self._im_info, self._feat_stride, self._anchors, self._num_anchors],
                [tf.float32, tf.float32, tf.float32, tf.float32])

在進入，anchor_target_layer.py中看一看相關的程式碼:

def anchor_target_layer(rpn_cls_score, gt_boxes, im_info, _feat_stride, all_anchors, num_anchors):
    """Same as the anchor target layer in original Fast/er RCNN """
    A = num_anchors
    total_anchors = all_anchors.shape[0]
    K = total_anchors / num_anchors
    im_info = im_info[0]

    # allow boxes to sit over the edge by a small amount
    _allowed_border = 0

    # map of shape (..., H, W)
    height, width = rpn_cls_score.shape[1:3]

    # only keep anchors inside the image
    inds_inside = np.where(
        (all_anchors[:, 0] >= -_allowed_border) &
        (all_anchors[:, 1] >= -_allowed_border) &
        (all_anchors[:, 2] < im_info[1] + _allowed_border) &  # width
        (all_anchors[:, 3] < im_info[0] + _allowed_border)  # height
    )[0]

    # keep only inside anchors
    anchors = all_anchors[inds_inside, :]

    # label: 1 is positive, 0 is negative, -1 is dont care
    labels = np.empty((len(inds_inside),), dtype=np.float32)
    labels.fill(-1)

    # overlaps between the anchors and the gt boxes
    # overlaps (ex, gt)
    overlaps = bbox_overlaps(
        np.ascontiguousarray(anchors, dtype=np.float),
        np.ascontiguousarray(gt_boxes, dtype=np.float))
    argmax_overlaps = overlaps.argmax(axis=1)
    max_overlaps = overlaps[np.arange(len(inds_inside)), argmax_overlaps]
    gt_argmax_overlaps = overlaps.argmax(axis=0)
    gt_max_overlaps = overlaps[gt_argmax_overlaps,
                               np.arange(overlaps.shape[1])]
    gt_argmax_overlaps = np.where(overlaps == gt_max_overlaps)[0]

    if not cfg.FLAGS.rpn_clobber_positives:
        # assign bg labels first so that positive labels can clobber them
        # first set the negatives
        labels[max_overlaps < cfg.FLAGS.rpn_negative_overlap] = 0

    # fg label: for each gt, anchor with highest overlap
    labels[gt_argmax_overlaps] = 1

    # fg label: above threshold IOU
    labels[max_overlaps >= cfg.FLAGS.rpn_positive_overlap] = 1

    if cfg.FLAGS.rpn_clobber_positives:
        # assign bg labels last so that negative labels can clobber positives
        labels[max_overlaps < cfg.FLAGS.rpn_negative_overlap] = 0

    # subsample positive labels if we have too many
    num_fg = int(cfg.FLAGS.rpn_fg_fraction * cfg.FLAGS.rpn_batchsize)
    fg_inds = np.where(labels == 1)[0]
    if len(fg_inds) > num_fg:
        disable_inds = npr.choice(
            fg_inds, size=(len(fg_inds) - num_fg), replace=False)
        labels[disable_inds] = -1

    # subsample negative labels if we have too many
    num_bg = cfg.FLAGS.rpn_batchsize - np.sum(labels == 1)
    bg_inds = np.where(labels == 0)[0]
    if len(bg_inds) > num_bg:
        disable_inds = npr.choice(
            bg_inds, size=(len(bg_inds) - num_bg), replace=False)
        labels[disable_inds] = -1

    bbox_targets = _compute_targets(anchors, gt_boxes[argmax_overlaps, :])

    bbox_inside_weights = np.zeros((len(inds_inside), 4), dtype=np.float32)
    # only the positive ones have regression targets
    bbox_inside_weights[labels == 1, :] = np.array(cfg.FLAGS2["bbox_inside_weights"])

    bbox_outside_weights = np.zeros((len(inds_inside), 4), dtype=np.float32)
    if cfg.FLAGS.rpn_positive_weight < 0:
        # uniform weighting of examples (given non-uniform sampling)
        num_examples = np.sum(labels >= 0)
        positive_weights = np.ones((1, 4)) * 1.0 / num_examples
        negative_weights = np.ones((1, 4)) * 1.0 / num_examples
    else:
        assert ((cfg.FLAGS.rpn_positive_weight > 0) &
                (cfg.FLAGS.rpn_positive_weight < 1))
        positive_weights = (cfg.FLAGS.rpn_positive_weight /
                            np.sum(labels == 1))
        negative_weights = ((1.0 - cfg.FLAGS.rpn_positive_weight) /
                            np.sum(labels == 0))
    bbox_outside_weights[labels == 1, :] = positive_weights
    bbox_outside_weights[labels == 0, :] = negative_weights

    # map up to original set of anchors
    labels = _unmap(labels, total_anchors, inds_inside, fill=-1)
    bbox_targets = _unmap(bbox_targets, total_anchors, inds_inside, fill=0)
    bbox_inside_weights = _unmap(bbox_inside_weights, total_anchors, inds_inside, fill=0)
    bbox_outside_weights = _unmap(bbox_outside_weights, total_anchors, inds_inside, fill=0)

    # labels
    labels = labels.reshape((1, height, width, A)).transpose(0, 3, 1, 2)
    labels = labels.reshape((1, 1, A * height, width))
    rpn_labels = labels

    # bbox_targets
    bbox_targets = bbox_targets \
        .reshape((1, height, width, A * 4))

    rpn_bbox_targets = bbox_targets
    # bbox_inside_weights
    bbox_inside_weights = bbox_inside_weights \
        .reshape((1, height, width, A * 4))

    rpn_bbox_inside_weights = bbox_inside_weights

    # bbox_outside_weights
    bbox_outside_weights = bbox_outside_weights \
        .reshape((1, height, width, A * 4))

    rpn_bbox_outside_weights = bbox_outside_weights
    return rpn_labels, rpn_bbox_targets, rpn_bbox_inside_weights, rpn_bbox_outside_weights

本文原始碼Github地址：
https://github.com/dBeker/Faster-RCNN-TensorFlow-Python3.5

基於Tensorflow的目標檢測（Detection）的程式碼案例詳解

一、程式執行環境說明

二、訓練資料匯入部分

三、網路架構搭建部分

基於Tensorflow的目標檢測（Detection）的程式碼案例詳解

tensorflow利用預訓練模型進行目標檢測（一）：預訓練模型的使用

tensorflow利用預訓練模型進行目標檢測（二）：將檢測結果存入mysql資料庫

tensorflow利用預訓練模型進行目標檢測（四）：檢測中的精度問題以及evaluation

SSD-Tensorflow 目標檢測（自定義資料集（VOC2007格式））

手把手教你如何用objection detection API實現實時目標檢測（三）

手把手教你如何用objection detection API實現實時目標檢測（二）

手把手教你如何用objection detection API實現實時目標檢測（一）

基於背景建模的紅外運動目標檢測（二）

吳恩達【深度學習工程師】 04.卷積神經網絡第三周目標檢測（1）基本的對象檢測算法

一文帶你學會使用YOLO及Opencv完成影象及視訊流目標檢測（上）|附原始碼

一文帶你學會使用YOLO及Opencv完成圖像及視頻流目標檢測（上）|附源碼

10分鐘學會使用YOLO及Opencv實現目標檢測（下）|附原始碼

Tensorflow機器學習（三）程式碼實現反捲積過程（de-convolution/convolution transpose）

目標檢測模型的評估指標mAP詳解(附程式碼）

目標檢測（一）——R-CNN

目標檢測（二）——Fast R-CNN

CNN目標檢測（二）：YOLO

目標檢測（九）--YOLO v1,v2,v3

目標檢測（十）——SSD

基於Tensorflow的目標檢測（Detection）的程式碼案例詳解

一、程式執行環境說明

二、訓練資料匯入部分

三、網路架構搭建部分

相關推薦