1. 程式人生 > >基於Tensorflow的目標檢測(Detection)的程式碼案例詳解

基於Tensorflow的目標檢測(Detection)的程式碼案例詳解

這篇博文我主要闡述了基於Tensorflow的Faster RCNN在Windows上的一個Demo程式,其中,分為兩個部分,一個是訓練資料匯入部分,一個是網路架構部分開始。源程式git地址我會放在文章最後,下載後可以參考對應看一下。

一、程式執行環境說明

首先,我想闡述一堆巨坑,下面只要有一條沒有環境或條件達到或做到,你的程式將無法執行:

Windows10 家庭版:

Python3.5+Windows+Visual Studio 2015+cuda9.1

這裡,本人踩過幾個坑,忘後來人應用這個版本的Demo不要再走:

① Python3.6無法編譯該程式。因為作者編譯時環境為3.5

② 如果你的電腦是Windows家庭版,不要用Anaconda進行安裝Python3.5,直接裝上Python3.5即可,因為家庭版的Windows10系統無法安裝Anaconda3+Python3.5的環境,Anaconda3預設3.6版或2.7版。

③ 除Visual Studio2015外的版本將無法執行符合要求的編譯Python所需的C++環境。(不要問我為什麼,我也不知道)

Windows10 企業版:

Anaconda3+Python3.5+Cuda9.1

① Anaconda與Python對應的版本可以百度搜索清華Python映象中下載。

② 如果用Anaconda搭載python3.5將不需要Visual Studio環境,無需安裝。反之,如果沒有用Anaconda搭載python,而是直接安裝python,就必須要安裝Visual Studio 2015的環境。

好了,坑到此結束,說完這些,按照ReadMe編譯程式之後,應該程式可以運行了。

我的IDE用的是Pycharm Jetbrain。


二、訓練資料匯入部分

那麼,我們先來看資料匯入的環節:

由於物體檢測是迴歸和分類任務,那麼匯入的資料就要包括物體的位置以及他的類別,那麼在程式中,這些資訊的根目錄在:

...\FasterRcnn\Faster-RCNN-TensorFlow-Python3.5-master\data\VOCDevkit2007\VOC2007\Annotations

影象資訊由xml檔案讀取。

而影象與影象資訊xml檔案是一一對應的,這些訓練集中影象的根目錄在:

...\Desktop\FasterRcnn\Faster-RCNN-TensorFlow-Python3.5-master\data\VOCDevkit2007\VOC2007\JPEGImages

現在,我們回到程式碼train.py:

可以明顯的發現,train檔案中主函式中一共就有兩句話:

train = Train()
train.train()

第一句就是我們網路訓練資料集匯入的過程,而第二句主要就是真正的訓練資料集的過程,那麼我們還是從第一句開始:

首先,我們跳入這句Train(),再跳入VGG16.py中的初始化過程,具體在network.py中:

self._feat_stride = [16, ]
self._feat_compress = [1. / 16., ]
self._batch_size = batch_size
self._predictions = {}
self._losses = {}
self._anchor_targets = {}
self._proposal_targets = {}
self._layers = {}
self._act_summaries = []
self._score_summaries = {}
self._train_summaries = []
self._event_summaries = {}
self._variables_to_fix = {}

一開始包括了一些引數的指定,例如feat_stride,為後續說到的錨點和原始影象對應的區域。

我們回到train.py接下去看:

self.imdb, self.roidb = combined_roidb("voc_2007_trainval")

這一句,把訓練的影象資訊全部讀入到了roidb這樣一個變數中,跳入combined_roidb():

def get_roidb(imdb_name):
imdb = get_imdb(imdb_name)
print('Loaded dataset `{:s}` for training'.format(imdb.name))
imdb.set_proposal_method("gt")
print('Set proposal method: {:s}'.format("gt"))
roidb = get_training_roidb(imdb)
return roidb

以上程式碼,表示了通過名字把roidb讀入進來的過程,最後返回了roidb這個變數:

注意到,程式碼中有這樣一句:

roidbs = [get_roidb(s) for s in imdb_names.split('+')]

這句的意思就是資料來源可能是從多個源頭進行匯入的,所以假如真的是從多個數據源進行匯入,則用加號把各種資料集連起來,到了用到的時候再用split函式把各種資料集的名字分開。

但事實上,程式中只用到了一個數據集,所以下一句是:

roidb = roidbs[0]

由於程式確定只有一個數據集的資料,所以只需要取0位置上的資料集即可,這裡如果後續有修改,則可以按照具體情況修改。

 那麼具體的資料集操作是怎麼進行的呢?我們跳入get_imdb():

再跳一次,到了factory.py

# Set up voc_<year>_<split>
for year in ['2007', '2012']:
  for split in ['train', 'val', 'trainval', 'test']:
    name = 'voc_{}_{}'.format(year, split)
    __sets[name] = (lambda split=split, year=year: pascal_voc(split, year))

# Set up coco_2014_<split>
for year in ['2014']:
  for split in ['train', 'val', 'minival', 'valminusminival', 'trainval']:
    name = 'coco_{}_{}'.format(year, split)
    __sets[name] = (lambda split=split, year=year: coco(split, year))

# Set up coco_2015_<split>
for year in ['2015']:
  for split in ['test', 'test-dev']:
    name = 'coco_{}_{}'.format(year, split)
    __sets[name] = (lambda split=split, year=year: coco(split, year))

我們發現,會有三個迴圈,怕是coco資料集和pascal_voc資料集在不同年份,他內部的格式也不同,所以要經過這樣的處理吧。

先從pascal_voc資料集看起,跳入imdb.init函式,下面程式碼位於imdb.py:

 def __init__(self, name, classes=None):
        self._name = name
        self._num_classes = 0
        if not classes:
            self._classes = []
        else:
            self._classes = classes
        self._image_index = []
        self._obj_proposer = 'gt'
        self._roidb = None
        self._roidb_handler = self.default_roidb
        # Use this dict for storing dataset specific config options
        self.config = {}

imdb.py這個檔案主要就是對讀入的資料進行一系列的操作:

初始化部分指定了資料集的名字,初始化類的數量,初始化類的索引標籤。指定了proposal的名字為gt,roidb是我們最終得到的結果,先設為NULL,同時,程式設定了一個handler,進行一些操作,一會兒會詳細說到。

現在回到pascal_voc.py繼續看初始化後的過程:

self._year = year
self._image_set = image_set

先指定了資料集年份,然後指定了要用到的東西的Annotation在哪裡,我們現在用到的就只有Val和Train,即訓練資料和我們的真實資料,就是ground truth:

其中PascalVOC的標註檔案在:

...\Desktop\FasterRcnn\Faster-RCNN-TensorFlow-Python3.5-master\data\VOCDevkit2007\VOC2007\ImageSets\Main

其中可以開啟看一下,trainval這個檔案:

000005
000007
000009
000012
000016
000017
000019
000020
000021
000023
000024
000026
000030

檔案中是以這樣的形式出現的資料,一共五千條,測試了五千組需要用的案例。trainval中的這些資料就是我們接下來需要訓練的資料的一個標籤,即對應的圖片的名字以及對應的xml資訊。

接下來就是指定路徑讀入相關的資訊了。

self._devkit_path = self._get_default_path() if devkit_path is None \
            else devkit_path
self._data_path = os.path.join(self._devkit_path, 'VOC' + self._year)
再後面指定了我們做分類的類別,一共21個類,二十個前景加上一個背景。之後,給每個類的字串設定一個固定的索引值,這樣更加方便接下來的一系列操作:
self._class_to_ind = dict(list(zip(self.classes, list(range(self.num_classes)))))

實際上,pascalVOC這麼多檔案中,這個程式中用到的怕是隻有valtrain這一個txt檔案了,之後,load一下我們的資料,根據ImageSet中指定的資料,從_data_path路徑中讀出,並通過x.strip一條一條讀出,並把讀到的東西以image_index的引數形式返回:

    def _load_image_set_index(self):
        """
        Load the indexes listed in this dataset's image set file.
        """
        # Example path to image set file:
        # self._devkit_path + /VOCdevkit2007/VOC2007/ImageSets/Main/val.txt
        image_set_file = os.path.join(self._data_path, 'ImageSets', 'Main',
                                      self._image_set + '.txt')
        assert os.path.exists(image_set_file), \
            'Path does not exist: {}'.format(image_set_file)
        with open(image_set_file) as f:
            image_index = [x.strip() for x in f.readlines()]
        return image_index

接下來,我們已經看完了pascalVOC的讀入過程了,coco資料集也是同理,所以不作贅述,繼續回到train.py:

其中set_proposal_method(“gt”),這句話指定了讀入的資訊就是我們的ground truth。

以下的一句話,有點意思哦:

roidb = get_training_roidb(imdb)

然後我們跳入這個方法來看一下:

def get_training_roidb(imdb):
    """Returns a roidb (Region of Interest database) for use in training."""
    if True:
        print('Appending horizontally-flipped training examples...')
        imdb.append_flipped_images()
        print('done')

    print('Preparing training data...')
    rdl_roidb.prepare_roidb(imdb)
    print('done')

    return imdb.roidb

 這裡將得到的影象都反轉了一下,其實就是將影象做了一個鏡面對稱,這樣我們一開始的資料量有5000,翻轉之後,我們的資料量就有了一萬。

我們仔細來看一下這個翻轉的過程,具體再imdb.py中:

    def append_flipped_images(self):
        num_images = self.num_images
        widths = self._get_widths()
        for i in range(num_images):
            boxes = self.roidb[i]['boxes'].copy()
            oldx1 = boxes[:, 0].copy()
            oldx2 = boxes[:, 2].copy()
            boxes[:, 0] = widths[i] - oldx2 - 1
            boxes[:, 2] = widths[i] - oldx1 - 1
            assert (boxes[:, 2] >= boxes[:, 0]).all()
            entry = {'boxes': boxes,
                     'gt_overlaps': self.roidb[i]['gt_overlaps'],
                     'gt_classes': self.roidb[i]['gt_classes'],
                     'flipped': True}
            self.roidb.append(entry)
        self._image_index = self._image_index * 2

好了,到此,我們的資料就算是基本載入完畢了,有一些其他的處理要說明一下,就比如pascalVOC中的:

    def gt_roidb(self):
        """
        Return the database of ground-truth regions of interest.

        This function loads/saves from/to a cache file to speed up future calls.
        """
        cache_file = os.path.join(self.cache_path, self.name + '_gt_roidb.pkl')
        if os.path.exists(cache_file):
            with open(cache_file, 'rb') as fid:
                try:
                    roidb = pickle.load(fid)
                except:
                    roidb = pickle.load(fid, encoding='bytes')
            print('{} gt roidb loaded from {}'.format(self.name, cache_file))
            return roidb

這個函式的目的是載入資料之後形成一個pickle檔案,以後再執行程式的時候,如果資料已經載入就直接從pickle檔案中讀取,如果沒有載入,就繼續載入。

...\Desktop\FasterRcnn\Faster-RCNN-TensorFlow-Python3.5-master\data\cache

這是快取的根目錄,可以嘗試刪除試試會出現什麼效果哦。

看程式碼中,指定快取目錄和名字,如果名字存在就先載入完已有的再載入新的資料,如果不存在就從頭載入。 好,那麼到現在為止,我們已經知道了,選用哪個資料集,載入哪些資料,那些固定的資料在什麼位置,以何種形式載入進來,但是,還有一個重要的問題就是,這個資料是怎麼以標籤的形式具體載入進來的呢?
XML檔案是通過解析器解析出來的:
    def _load_pascal_annotation(self, index):
        """
        Load image and bounding boxes info from XML file in the PASCAL VOC
        format.
        """
        filename = os.path.join(self._data_path, 'Annotations', index + '.xml')
        tree = ET.parse(filename)
        objs = tree.findall('object')
        if not self.config['use_diff']:
            # Exclude the samples labeled as difficult
            non_diff_objs = [
                obj for obj in objs if int(obj.find('difficult').text) == 0]
            # if len(non_diff_objs) != len(objs):
            #     print 'Removed {} difficult objects'.format(
            #         len(objs) - len(non_diff_objs))
            objs = non_diff_objs
        num_objs = len(objs)

        boxes = np.zeros((num_objs, 4), dtype=np.uint16)
        gt_classes = np.zeros((num_objs), dtype=np.int32)
        overlaps = np.zeros((num_objs, self.num_classes), dtype=np.float32)
        # "Seg" area for pascal is just the box area
        seg_areas = np.zeros((num_objs), dtype=np.float32)

        # Load object bounding boxes into a data frame.
        for ix, obj in enumerate(objs):
            bbox = obj.find('bndbox')
            # Make pixel indexes 0-based
            x1 = float(bbox.find('xmin').text) - 1
            y1 = float(bbox.find('ymin').text) - 1
            x2 = float(bbox.find('xmax').text) - 1
            y2 = float(bbox.find('ymax').text) - 1
            cls = self._class_to_ind[obj.find('name').text.lower().strip()]
            boxes[ix, :] = [x1, y1, x2, y2]
            gt_classes[ix] = cls
            overlaps[ix, cls] = 1.0
            seg_areas[ix] = (x2 - x1 + 1) * (y2 - y1 + 1)

        overlaps = scipy.sparse.csr_matrix(overlaps)

        return {'boxes': boxes,
                'gt_classes': gt_classes,
                'gt_overlaps': overlaps,
                'flipped': False,
                'seg_areas': seg_areas}
boxes = np.zeros((num_objs, 4), dtype=np.uint16) 其中,boxes是一個迴歸框,兩個座標,有n個物體,就是4×n個位置。 gt_classes = np.zeros((num_objs), dtype=np.int32)
其中,有幾類,就載入幾類進來。

overlaps做one hold recording。 seg_areas求面積,暫時還沒有用到。

然後就是迴圈了:這裡迴圈的是一張圖片上的n個物體。

現在翻轉也做了,數量加倍了,指定了相應的資料了,也都提取出來了。

下面還有一句:

 rdl_roidb.prepare_roidb(imdb)

再跳一次,到roidb中的prepare_roidb函式中:

def prepare_roidb(imdb):
  """Enrich the imdb's roidb by adding some derived quantities that
  are useful for training. This function precomputes the maximum
  overlap, taken over ground-truth boxes, between each ROI and
  each ground-truth box. The class with maximum overlap is also
  recorded.
  """
  roidb = imdb.roidb
  if not (imdb.name.startswith('coco')):
    sizes = [PIL.Image.open(imdb.image_path_at(i)).size
         for i in range(imdb.num_images)]
  for i in range(len(imdb.image_index)):
    roidb[i]['image'] = imdb.image_path_at(i)
    if not (imdb.name.startswith('coco')):
      roidb[i]['width'] = sizes[i][0]
      roidb[i]['height'] = sizes[i][1]
    # need gt_overlaps as a dense array for argmax
    gt_overlaps = roidb[i]['gt_overlaps'].toarray()
    # max overlap with gt over classes (columns)
    max_overlaps = gt_overlaps.max(axis=1)
    # gt class that had the max overlap
    max_classes = gt_overlaps.argmax(axis=1)
    roidb[i]['max_classes'] = max_classes
    roidb[i]['max_overlaps'] = max_overlaps
    # sanity checks
    # max overlap of 0 => class should be zero (background)
    zero_inds = np.where(max_overlaps == 0)[0]
    assert all(max_classes[zero_inds] == 0)
    # max overlap > 0 => class should not be zero (must be a fg class)
    nonzero_inds = np.where(max_overlaps > 0)[0]
    assert all(max_classes[nonzero_inds] != 0)

這裡,主要做了什麼樣的工作呢?

把所有的資料集合到了roidb上並返回。分別指定了路徑,圖片寬度,高度,重疊率,重疊最大的類別等等。

self.data_layer = RoIDataLayer(self.roidb, self.imdb.num_classes)
self.output_dir = cfg.get_output_dir(self.imdb, 'default')

最後,output_dir設定了pickel的預設路徑。

datalayer傳入了roidb處理完之後的相關資料,和相應類別,並做了一個洗牌操作shuffle。

三、網路架構搭建部分

好,現在先來總結一下Faster RCNN中網路的搭建架構:

圖1

①   搭建了一個conv layers,即一個全卷積網路,在Tensorflow程式碼中為一個VGG16的結構。

②  從①中迭代幾次後的卷積,池化操作後的Feature Map送入RPN(RegionProposal Network)層。

③  用一個3×3的滑動視窗在②中得到的Feature Map中,(從左到右)滑動,以中間點為錨點,對應到原圖,設定三個影象大小,和三個不同的長寬比例,經排列組合,一個錨點位置得到9個不同的對應影象,設所有錨點共計k個對應影象。

④  用③中得到的k個對應影象,分別執行下述兩個操作:迴歸和分類。迴歸操作為區分前景背景所用,進行一個二分類操作,故得到2k scores;當迴歸操作區分出是背景,則無需進行分類操作,如是前景則進行分類操作,得到4k coordinates,每個影象得到的四個值分別是,中心點座標(x,y),以及該影象的具體長和寬(h,w)。

⑤  經過迴歸和分類操作之後,進行框的篩選操作,即proposal層做的主要事情。首先,篩掉的框走以下幾個步驟:第一,IOU>0.7,即產生的框和原始影象的ground truth的對比,如果重疊率大於0.7,則保留,否則篩掉;第二,NMS非極大值抑制篩選,通過二分類得到的scores值(即為前景的概率值),篩選前n個從大到小的框;第三,越界框篩選。第四,經過以上步驟後,繼續篩選score值前m個從大到小的框。

⑥ 對得到的框進行Roi Pooling操作之後,連線一個全連線網路,並在此做一個分類任務,一個迴歸任務,分類任務為二十一分類,即二十個前景和一個背景,完成整個操作。

 

好了,到現在為止,回憶結束。

下面,我們正式進入程式碼:

① 網路結構搭建的大部分程式碼都位於VGG16.py這個網路中,進入主函式中,第一個Train()交代了資料的部分讀入操作,第二個train()交代了網路的訓練過程。我們先來解釋網路的訓練過程。核心程式碼為第85行:

layers = self.net.create_architecture(sess,"TRAIN", self.imdb.num_classes, tag='default')

    其中,create_architecture()函式建立了所有的網路結構。下面,我們跳入該函式。

② 前面指定了一系列卷積,反捲積的引數,核心程式碼為295行:

rois, cls_prob, bbox_pred = self.build_network(sess,training)

rois為roi pooling層得到的框,cls_prob得到的是最後全連線層的分類score,bbox_pred得到的是二十一分類之後的分類標籤。我們繼續跳入build_network();

③ 跳入VGG16.py中的第18行同名函式。 好,我們來仔細研究一下這個同名函式:
    def build_network(self, sess, is_training=True):
        with tf.variable_scope('vgg_16', 'vgg_16'):

            # select initializer
            if cfg.FLAGS.initializer == "truncated":
                initializer = tf.truncated_normal_initializer(mean=0.0, stddev=0.01)
                initializer_bbox = tf.truncated_normal_initializer(mean=0.0, stddev=0.001)
            else:
                initializer = tf.random_normal_initializer(mean=0.0, stddev=0.01)
                initializer_bbox = tf.random_normal_initializer(mean=0.0, stddev=0.001)

            # Build head
            net = self.build_head(is_training)

            # Build rpn
            rpn_cls_prob, rpn_bbox_pred, rpn_cls_score, rpn_cls_score_reshape = self.build_rpn(net, is_training, initializer)

            # Build proposals
            rois = self.build_proposals(is_training, rpn_cls_prob, rpn_bbox_pred, rpn_cls_score)

            # Build predictions
            cls_score, cls_prob, bbox_pred = self.build_predictions(net, rois, is_training, initializer, initializer_bbox)

            self._predictions["rpn_cls_score"] = rpn_cls_score
            self._predictions["rpn_cls_score_reshape"] = rpn_cls_score_reshape
            self._predictions["rpn_cls_prob"] = rpn_cls_prob
            self._predictions["rpn_bbox_pred"] = rpn_bbox_pred
            self._predictions["cls_score"] = cls_score
            self._predictions["cls_prob"] = cls_prob
            self._predictions["bbox_pred"] = bbox_pred
            self._predictions["rois"] = rois

            self._score_summaries.update(self._predictions)

            return rois, cls_prob, bbox_pred

④ 該函式分為了幾段,build head,buildrpn,build proposals,build predictions對應的剛好是我們所剛剛敘述的全卷積層,RPN層,Proposal Layer,和最後經過的全連線層。大體結構已有,那麼我們就來逐步分析這個這幾個函式:

⑤ 全卷積網路層的建立(build head)。在這個Demo中,全卷積網路為五個層,每層有一個卷積,一個池化操作,但是,最後一層操作中,僅有一個卷積操作,無池化操作。

 # Main network
        # Layer  1
        net = slim.repeat(self._image, 2, slim.conv2d, 64, [3, 3], trainable=False, scope='conv1')
        net = slim.max_pool2d(net, [2, 2], padding='SAME', scope='pool1')

        # Layer 2
        net = slim.repeat(net, 2, slim.conv2d, 128, [3, 3], trainable=False, scope='conv2')
        net = slim.max_pool2d(net, [2, 2], padding='SAME', scope='pool2')

        # Layer 3
        net = slim.repeat(net, 3, slim.conv2d, 256, [3, 3], trainable=is_training, scope='conv3')
        net = slim.max_pool2d(net, [2, 2], padding='SAME', scope='pool3')

        # Layer 4
        net = slim.repeat(net, 3, slim.conv2d, 512, [3, 3], trainable=is_training, scope='conv4')
        net = slim.max_pool2d(net, [2, 2], padding='SAME', scope='pool4')

        # Layer 5
        net = slim.repeat(net, 3, slim.conv2d, 512, [3, 3], trainable=is_training, scope='conv5')

由程式碼中可以看出,這裡作者用的silm.conv2d函式進行卷積操作,傳統卷積操作為nn模組下的conv2d,max_pool2d進行池化操作。池化用2×2的方格進行,由於卷積層操作不能夠縮小影象大小,池化層變為原來的二分之一,所以四個池化層最終變為原來的1/16。

⑦RPN層的建立(build rpn)。_anchor_component()是用來生成九個框的函式。我們繼續進入,其中設定了引數,height和width,在這裡,都為3,然後通過,tf.py_func()生成9個候選框,generate_anchors_pre中,產生框的具體函式是generate_anchor();generate_anchors()產生位置。建立了位置關係之後,需要對映到原始影象,所以feat_stride為原始影象與這裡影象的倍數關係,feat_stride在這裡為16。

network.py檔案相關程式碼(從Vgg16.py)跳轉來:

    def _anchor_component(self):
        with tf.variable_scope('ANCHOR_' + 'default'):
            # just to get the shape right
            height = tf.to_int32(tf.ceil(self._im_info[0, 0] / np.float32(self._feat_stride[0])))
            width = tf.to_int32(tf.ceil(self._im_info[0, 1] / np.float32(self._feat_stride[0])))
            anchors, anchor_length = tf.py_func(generate_anchors_pre,
                                                [height, width,
                                                 self._feat_stride, self._anchor_scales, self._anchor_ratios],
                                                [tf.float32, tf.int32], name="generate_anchors")
            anchors.set_shape([None, 4])
            anchor_length.set_shape([])
            self._anchors = anchors
            self._anchor_length = anchor_length

snippit()中相關程式碼:

def generate_anchors_pre(height, width, feat_stride, anchor_scales=(8, 16, 32), anchor_ratios=(0.5, 1, 2)):
    """ A wrapper function to generate anchors given different scales
      Also return the number of anchors in variable 'length'
    """
    anchors = generate_anchors(ratios=np.array(anchor_ratios), scales=np.array(anchor_scales))
    A = anchors.shape[0]
    shift_x = np.arange(0, width) * feat_stride
    shift_y = np.arange(0, height) * feat_stride
    shift_x, shift_y = np.meshgrid(shift_x, shift_y)
    shifts = np.vstack((shift_x.ravel(), shift_y.ravel(), shift_x.ravel(), shift_y.ravel())).transpose()
    K = shifts.shape[0]
    # width changes faster, so here it is H, W, C
    anchors = anchors.reshape((1, A, 4)) + shifts.reshape((1, K, 4)).transpose((1, 0, 2))
    anchors = anchors.reshape((K * A, 4)).astype(np.float32, copy=False)
    length = np.int32(anchors.shape[0])

    return anchors, length

        我們現在再回到vgg16.py中的build_rpn()函式,看產生完9個候選框之後的操作。首先經過了一個3×3的卷積,之後用1×1的卷積去進行迴歸操作,分出前景或是背景,形成分數值,即rpn_cls_score_reshape。再通過softmax函式,得到rpn_clas_prob_reshape,之後,通過reshape化成了標準型,則,變為rpn_bbox_prob。

       進行二分類操作和迴歸操作是並行的,於是用同樣1×1的卷積去操作原來的future map,生成長度為4×k,即_num_anchors×4的長度。

       最後,將二分類產生的引數以及迴歸任務產生的引數進行返回,Rpn層就建立好了。

①  Proposal層的建立(build proposal)。

    def build_proposals(self, is_training, rpn_cls_prob, rpn_bbox_pred, rpn_cls_score):

        if is_training:
            rois, roi_scores = self._proposal_layer(rpn_cls_prob, rpn_bbox_pred, "rois")
            rpn_labels = self._anchor_target_layer(rpn_cls_score, "anchor")

            # Try to have a deterministic order for the computing graph, for reproducibility
            with tf.control_dependencies([rpn_labels]):
                rois, _ = self._proposal_target_layer(rois, roi_scores, "rpn_rois")
        else:
            if cfg.FLAGS.test_mode == 'nms':
                rois, _ = self._proposal_layer(rpn_cls_prob, rpn_bbox_pred, "rois")
            elif cfg.FLAGS.test_mode == 'top':
                rois, _ = self._proposal_top_layer(rpn_cls_prob, rpn_bbox_pred, "rois")
            else:
                raise NotImplementedError
        return rois

依然是vgg16.py中的build_proposal函式,我們跳到_proposal_layer的函式中:

network.py:

    def _proposal_layer(self, rpn_cls_prob, rpn_bbox_pred, name):
        with tf.variable_scope(name):
            rois, rpn_scores = tf.py_func(proposal_layer,
                                          [rpn_cls_prob, rpn_bbox_pred, self._im_info, self._mode,
                                           self._feat_stride, self._anchors, self._num_anchors],
                                          [tf.float32, tf.float32])
            rois.set_shape([None, 5])
            rpn_scores.set_shape([None, 1])

        return rois, rpn_scores

其中核心程式碼為,tf.func()中的proposal_layer,我們繼續跳入,proposal_layer.py中:

def proposal_layer(rpn_cls_prob, rpn_bbox_pred, im_info, cfg_key, _feat_stride, anchors, num_anchors):
    """A simplified version compared to fast/er RCNN
       For details please see the technical report
    """
    if type(cfg_key) == bytes:
        cfg_key = cfg_key.decode('utf-8')

    if cfg_key == "TRAIN":
        pre_nms_topN = cfg.FLAGS.rpn_train_pre_nms_top_n
        post_nms_topN = cfg.FLAGS.rpn_train_post_nms_top_n
        nms_thresh = cfg.FLAGS.rpn_train_nms_thresh
    else:
        pre_nms_topN = cfg.FLAGS.rpn_test_pre_nms_top_n
        post_nms_topN = cfg.FLAGS.rpn_test_post_nms_top_n
        nms_thresh = cfg.FLAGS.rpn_test_nms_thresh

    im_info = im_info[0]
    # Get the scores and bounding boxes
    scores = rpn_cls_prob[:, :, :, num_anchors:]
    rpn_bbox_pred = rpn_bbox_pred.reshape((-1, 4))
    scores = scores.reshape((-1, 1))
    proposals = bbox_transform_inv(anchors, rpn_bbox_pred)
    proposals = clip_boxes(proposals, im_info[:2])

    # Pick the top region proposals
    order = scores.ravel().argsort()[::-1]
    if pre_nms_topN > 0:
        order = order[:pre_nms_topN]
    proposals = proposals[order, :]
    scores = scores[order]

    # Non-maximal suppression
    keep = nms(np.hstack((proposals, scores)), nms_thresh)

    # Pick th top region proposals after NMS
    if post_nms_topN > 0:
        keep = keep[:post_nms_topN]
    proposals = proposals[keep, :]
    scores = scores[keep]

    # Only support single image as input
    batch_inds = np.zeros((proposals.shape[0], 1), dtype=np.float32)
    blob = np.hstack((batch_inds, proposals.astype(np.float32, copy=False)))

    return blob, scores

再來回憶一下,我們proposal_layer中做的事情:實際上,再proposal_layer中的任務主要就是篩選合適的框,縮小檢測範圍,那麼,在前文回憶部分的步驟⑤中我們已經說到:第一,篩選與ground truth中,重疊率大於70%的候選框,篩掉其他的候選框,縮小範圍;第二,用NMS非極大值抑制,篩選二分類中前n個score值的候選框;第三,篩掉越界框後,再來從前n個從大到小排序的值中篩選一次。好了,那麼現在就嚴格按照這個步驟開始操作:

一開始先指定引數,我們剛才說進行了兩次topN操作,所以設定兩個引數,一個pre_num_topN和post_num_topN。bbox_transform中為調整框和ground truth大小位置的操作。進入bbox_transform函式:

    

可以看出,該公式調整的時候,先進行了整體平移,再進行了整體縮放,所以,在求出變換因子之後,求出,pred_ctr_x, pred_ctr_y, pred_w以及pred_h。然後返回兩個座標,(x1y1),(x2y2)。其中,變換調整到和ground truth差不多的大小。調整辦法對應的是論文的上圖部分。 程式碼如下: bbox_transform.py:
def bbox_transform_inv(boxes, deltas):
    if boxes.shape[0] == 0:
        return np.zeros((0, deltas.shape[1]), dtype=deltas.dtype)

    boxes = boxes.astype(deltas.dtype, copy=False)
    widths = boxes[:, 2] - boxes[:, 0] + 1.0
    heights = boxes[:, 3] - boxes[:, 1] + 1.0
    ctr_x = boxes[:, 0] + 0.5 * widths
    ctr_y = boxes[:, 1] + 0.5 * heights

    dx = deltas[:, 0::4]
    dy = deltas[:, 1::4]
    dw = deltas[:, 2::4]
    dh = deltas[:, 3::4]

    pred_ctr_x = dx * widths[:, np.newaxis] + ctr_x[:, np.newaxis]
    pred_ctr_y = dy * heights[:, np.newaxis] + ctr_y[:, np.newaxis]
    pred_w = np.exp(dw) * widths[:, np.newaxis]
    pred_h = np.exp(dh) * heights[:, np.newaxis]

    pred_boxes = np.zeros(deltas.shape, dtype=deltas.dtype)
    # x1
    pred_boxes[:, 0::4] = pred_ctr_x - 0.5 * pred_w
    # y1
    pred_boxes[:, 1::4] = pred_ctr_y - 0.5 * pred_h
    # x2
    pred_boxes[:, 2::4] = pred_ctr_x + 0.5 * pred_w
    # y2
    pred_boxes[:, 3::4] = pred_ctr_y + 0.5 * pred_h

    return pred_boxes

之後,程式碼對框先進行了一下出界清除操作,篩掉出界的框,對應程式碼中clip_transform(),同時選取了前n個框。再接下來nms函式得到keep,之後,在通過topN操作得到非極大值抑制篩選後的框。

 # Non-maximal suppression
    keep = nms(np.hstack((proposals, scores)), nms_thresh)

    # Pick th top region proposals after NMS
    if post_nms_topN > 0:
        keep = keep[:post_nms_topN]
    proposals = proposals[keep, :]
    scores = scores[keep]

最後將所得到剩下的框返回,便得到了proposal層之後的留下的框。

接下來,就是篩出來IOU大於70%的框,於是:程式碼中,_anchor_target_layer()函式中,

    def _anchor_target_layer(self, rpn_cls_score, name):
        with tf.variable_scope(name):
            rpn_labels, rpn_bbox_targets, rpn_bbox_inside_weights, rpn_bbox_outside_weights = tf.py_func(
                anchor_target_layer,
                [rpn_cls_score, self._gt_boxes, self._im_info, self._feat_stride, self._anchors, self._num_anchors],
                [tf.float32, tf.float32, tf.float32, tf.float32])

在進入,anchor_target_layer.py中看一看相關的程式碼:

def anchor_target_layer(rpn_cls_score, gt_boxes, im_info, _feat_stride, all_anchors, num_anchors):
    """Same as the anchor target layer in original Fast/er RCNN """
    A = num_anchors
    total_anchors = all_anchors.shape[0]
    K = total_anchors / num_anchors
    im_info = im_info[0]

    # allow boxes to sit over the edge by a small amount
    _allowed_border = 0

    # map of shape (..., H, W)
    height, width = rpn_cls_score.shape[1:3]

    # only keep anchors inside the image
    inds_inside = np.where(
        (all_anchors[:, 0] >= -_allowed_border) &
        (all_anchors[:, 1] >= -_allowed_border) &
        (all_anchors[:, 2] < im_info[1] + _allowed_border) &  # width
        (all_anchors[:, 3] < im_info[0] + _allowed_border)  # height
    )[0]

    # keep only inside anchors
    anchors = all_anchors[inds_inside, :]

    # label: 1 is positive, 0 is negative, -1 is dont care
    labels = np.empty((len(inds_inside),), dtype=np.float32)
    labels.fill(-1)

    # overlaps between the anchors and the gt boxes
    # overlaps (ex, gt)
    overlaps = bbox_overlaps(
        np.ascontiguousarray(anchors, dtype=np.float),
        np.ascontiguousarray(gt_boxes, dtype=np.float))
    argmax_overlaps = overlaps.argmax(axis=1)
    max_overlaps = overlaps[np.arange(len(inds_inside)), argmax_overlaps]
    gt_argmax_overlaps = overlaps.argmax(axis=0)
    gt_max_overlaps = overlaps[gt_argmax_overlaps,
                               np.arange(overlaps.shape[1])]
    gt_argmax_overlaps = np.where(overlaps == gt_max_overlaps)[0]

    if not cfg.FLAGS.rpn_clobber_positives:
        # assign bg labels first so that positive labels can clobber them
        # first set the negatives
        labels[max_overlaps < cfg.FLAGS.rpn_negative_overlap] = 0

    # fg label: for each gt, anchor with highest overlap
    labels[gt_argmax_overlaps] = 1

    # fg label: above threshold IOU
    labels[max_overlaps >= cfg.FLAGS.rpn_positive_overlap] = 1

    if cfg.FLAGS.rpn_clobber_positives:
        # assign bg labels last so that negative labels can clobber positives
        labels[max_overlaps < cfg.FLAGS.rpn_negative_overlap] = 0

    # subsample positive labels if we have too many
    num_fg = int(cfg.FLAGS.rpn_fg_fraction * cfg.FLAGS.rpn_batchsize)
    fg_inds = np.where(labels == 1)[0]
    if len(fg_inds) > num_fg:
        disable_inds = npr.choice(
            fg_inds, size=(len(fg_inds) - num_fg), replace=False)
        labels[disable_inds] = -1

    # subsample negative labels if we have too many
    num_bg = cfg.FLAGS.rpn_batchsize - np.sum(labels == 1)
    bg_inds = np.where(labels == 0)[0]
    if len(bg_inds) > num_bg:
        disable_inds = npr.choice(
            bg_inds, size=(len(bg_inds) - num_bg), replace=False)
        labels[disable_inds] = -1

    bbox_targets = _compute_targets(anchors, gt_boxes[argmax_overlaps, :])

    bbox_inside_weights = np.zeros((len(inds_inside), 4), dtype=np.float32)
    # only the positive ones have regression targets
    bbox_inside_weights[labels == 1, :] = np.array(cfg.FLAGS2["bbox_inside_weights"])

    bbox_outside_weights = np.zeros((len(inds_inside), 4), dtype=np.float32)
    if cfg.FLAGS.rpn_positive_weight < 0:
        # uniform weighting of examples (given non-uniform sampling)
        num_examples = np.sum(labels >= 0)
        positive_weights = np.ones((1, 4)) * 1.0 / num_examples
        negative_weights = np.ones((1, 4)) * 1.0 / num_examples
    else:
        assert ((cfg.FLAGS.rpn_positive_weight > 0) &
                (cfg.FLAGS.rpn_positive_weight < 1))
        positive_weights = (cfg.FLAGS.rpn_positive_weight /
                            np.sum(labels == 1))
        negative_weights = ((1.0 - cfg.FLAGS.rpn_positive_weight) /
                            np.sum(labels == 0))
    bbox_outside_weights[labels == 1, :] = positive_weights
    bbox_outside_weights[labels == 0, :] = negative_weights

    # map up to original set of anchors
    labels = _unmap(labels, total_anchors, inds_inside, fill=-1)
    bbox_targets = _unmap(bbox_targets, total_anchors, inds_inside, fill=0)
    bbox_inside_weights = _unmap(bbox_inside_weights, total_anchors, inds_inside, fill=0)
    bbox_outside_weights = _unmap(bbox_outside_weights, total_anchors, inds_inside, fill=0)

    # labels
    labels = labels.reshape((1, height, width, A)).transpose(0, 3, 1, 2)
    labels = labels.reshape((1, 1, A * height, width))
    rpn_labels = labels

    # bbox_targets
    bbox_targets = bbox_targets \
        .reshape((1, height, width, A * 4))

    rpn_bbox_targets = bbox_targets
    # bbox_inside_weights
    bbox_inside_weights = bbox_inside_weights \
        .reshape((1, height, width, A * 4))

    rpn_bbox_inside_weights = bbox_inside_weights

    # bbox_outside_weights
    bbox_outside_weights = bbox_outside_weights \
        .reshape((1, height, width, A * 4))

    rpn_bbox_outside_weights = bbox_outside_weights
    return rpn_labels, rpn_bbox_targets, rpn_bbox_inside_weights, rpn_bbox_outside_weights
本文原始碼Github地址:
https://github.com/dBeker/Faster-RCNN-TensorFlow-Python3.5