1. 程式人生 > >深度學習Tracking(1)——Learning to Track at 100 FPS with Deep Regression Networks(程式碼理解)

深度學習Tracking(1)——Learning to Track at 100 FPS with Deep Regression Networks(程式碼理解)

第一次看深度學習網路實現的工程程式碼,有很多內容和結構不理解,並且在Linux下跑網路工程程式碼沒有IDE,無法除錯,我也不知道在檢視函式的時候如何跳轉,因此看整個工程檔案十分麻煩。因此自己也是邊看邊查邊學。

下面開始解析該工程程式碼,可能存在一些偏差和錯誤,將會不斷學習和修正。

一、工程結構

工程結構

工程主目錄下有6個資料夾以及3個檔案。

CMakeLists.txt:
CMake是一個跨平臺的安裝(編譯)工具,可以用簡單的語句來描述所有平臺的安裝(編譯過程)。CMake的所有語句都寫在CMakeLists.txt檔案中,當CMakeList.txt檔案確定後,可以用ccmake命令對相關的變數值進行配置,這個命令必須指向CMakeList.txt所在的目錄。配置完成後,應用cmake命令生成相應的Makefile(在UNIX like系統下)或者project檔案(指定用Windows下的相應程式設計工具編譯時)。
詳細看博文:【

http://blog.csdn.net/u012150179/article/details/17852273

build資料夾:
這裡寫圖片描述

該資料夾內為編譯之後生成的一些檔案,具體我也不是很懂,之後再看。

cmake資料夾:
這裡寫圖片描述

該資料夾是配置檔案,指向相應的caffe模型,配置工程所需的庫檔案。
FindTinyXML.cmake的功能之後再去了解吧。。。

imgs資料夾記憶體放的是一張論文中功能實現的截圖。

nets資料夾:
這裡寫圖片描述

tracker.prototxt定義了網路的結構,利用netscope我們可以看到網路結構是這樣的,並不複雜。
這裡寫圖片描述
這裡寫圖片描述
這裡寫圖片描述
solver.prototxt檔案是整個模型執行的引數配置檔案。

models裡是已經訓練好的網路模型:
這裡寫圖片描述

tracker_output裡是工程的輸出檔案,包括videos等。

scripts檔案:
這裡寫圖片描述
該資料夾內是執行程式的一些指令碼檔案。實現的功能包括下載模型初始化(download_model_init.sh),下載訓練好的模型(download_trained_model.sh),評估結果(evaluate_val.sh),儲存測試執行結果(save_videos_test.sh),追蹤過程展示(show_tracker_test.sh),追蹤過程驗證(show_tracker_val.sh),訓練(train.sh)以及解壓ImageNet(unzip_imagenet.sh)。還有一個資料夾Fscore_v1.0應該是計算Fscore時用到的相關檔案。

src資料夾:
這裡寫圖片描述
該資料夾內為c++原始碼。

由於伺服器上執行無法實現視覺化,所以我按照原始碼文件的說明,將測試資料執行結果儲存下來(呼叫save_videos_test.sh指令碼檔案),然後觀察其performance。下面將按照執行的流程對部分程式碼進行解釋說明。

二、程式碼流程(根據需求呼叫的部分)
首先通過如下語句執行指令碼檔案以執行程式:

bash scripts/save_videos_test.sh  /*/dataPackage/vot2014

1、因此,進入save_videos_test.sh檔案

#!/bin/bash

if [ -z "$1" ]
  then
    echo "No folder supplied!"
    echo "Usage: bash `basename "$0"` vot_videos_folder"
    exit
fi

# Choose which GPU the tracker runs on
GPU_ID=10

VIDEOS_FOLDER=$1

FOLDER=GOTURN1_test

DEPLOY_PROTO=nets/tracker.prototxt

CAFFE_MODEL=nets/models/pretrained_model/tracker.caffemodel

OUTPUT_FOLDER=nets/tracker_output/$FOLDER

echo "Saving output to " $OUTPUT_FILE

# Run tracker on test set and save vidoes 
build/save_videos_vot $VIDEOS_FOLDER $DEPLOY_PROTO $CAFFE_MODEL $OUTPUT_FOLDER $GPU_ID 

GPU_ID:這裡可以修改使用的GPU編號,由於伺服器上執行的程式較多,導致程式執行速度變慢,因此,要選擇一個空閒的GPU使用。(新開一個終端,輸入nvidia-smi就可以檢視GPU使用情況。)

VIDEOS_FOLDER:$1指的是第一個引數,即指令後面緊跟的第一個引數,這裡是輸入的視訊資料夾(事實上是影象資料夾);

DEPLOY_PROTO:該指令呼叫的模型,即網路結構
,但是存在未知引數。

CAFFE_MODEL:訓練好的網路模型,經過訓練,引數都已確定。

指令碼檔案的最後呼叫了save_videos_vot.cpp,因此下面進入該檔案。

2、save_videos_vot.cpp

該檔案在test資料夾下,程式碼內容如下:

#include <string>

#include <boost/lexical_cast.hpp>
#include <boost/filesystem.hpp>

#include <opencv/cv.h>
#include <opencv2/core/core.hpp>
#include <opencv2/highgui/highgui.hpp>

#include "helper/high_res_timer.h"
#include "network/regressor.h"
#include "loader/loader_alov.h"
#include "loader/loader_vot.h"
#include "tracker/tracker.h"
#include "tracker/tracker_manager.h"

using std::string;

int main (int argc, char *argv[]) {
  if (argc < 6) {
    std::cerr << "Usage: " << argv[0]
              << " videos_folder deploy.prototxt network.caffemodel"
              << " output_folder gpu_id" << std::endl;
    return 1;
  }

  ::google::InitGoogleLogging(argv[0]);

  string videos_folder          = argv[1];
  string test_proto             = argv[2];
  string caffe_model           = argv[3];
  string output_folder          = argv[4];
  int gpu_id                    = atoi(argv[5]);

  boost::filesystem::create_directories(output_folder);

  const bool do_train = false;
  Regressor regressor(test_proto, caffe_model, gpu_id, do_train);

  // Time how long tracking takes.
  HighResTimer hrt_total("Total evaluation (including loading videos)");
  hrt_total.start();

  // Get videos.
  std::vector<Video> videos;
  LoaderVOT loader(videos_folder);
  videos = loader.get_videos();

  // Create a tracker object.
  const bool show_intermediate_output = false;
  Tracker tracker(show_intermediate_output);

  // Track all objects in all videos and save the output.
  const bool save_videos = true;
  TrackerTesterAlov tracker_tester(videos, save_videos, &regressor, &tracker, output_folder);
  tracker_tester.TrackAll();

  // Print the timing information.
  hrt_total.stop();
  hrt_total.print();

  return 0;
}

函式開始,先建立了一個Regressor物件並利用引數(現有網路模型)對其進行初始化,此處設定為test模式。進入Regressor類看一下:

- 2.1 、regressor.cpp

Regressor::Regressor(const string& deploy_proto,
                     const string& caffe_model,
                     const int gpu_id,
                     const bool do_train)
  : num_inputs_(kNumInputs),
    caffe_model_(caffe_model),
    modified_params_(false)
{
  SetupNetwork(deploy_proto, caffe_model, gpu_id, do_train);
}

上面為Regressor類的建構函式,呼叫SetupNetwork函式來設定網路。SetupNetwork函式如下:

void Regressor::SetupNetwork(const string& deploy_proto,
                             const string& caffe_model,
                             const int gpu_id,
                             const bool do_train) {
#ifdef CPU_ONLY
  printf("Setting up Caffe in CPU mode\n");
  caffe::Caffe::set_mode(caffe::Caffe::CPU);
#else
  printf("Setting up Caffe in GPU mode with ID: %d\n", gpu_id);
  caffe::Caffe::SetDevice(gpu_id);
  caffe::Caffe::set_mode(caffe::Caffe::GPU);
#endif

  if (do_train) {
    printf("Setting phase to train\n");
    net_.reset(new Net<float>(deploy_proto, caffe::TRAIN));
  } else {
    printf("Setting phase to test\n");
    net_.reset(new Net<float>(deploy_proto, caffe::TEST));
  }

  if (caffe_model != "NONE") {
    net_->CopyTrainedLayersFrom(caffe_model_);
  } else {
    printf("Not initializing network from pre-trained model\n");
  }

  //CHECK_EQ(net_->num_inputs(), num_inputs_) << "Network should have exactly " << num_inputs_ << " inputs.";
  CHECK_EQ(net_->num_outputs(), 1) << "Network should have exactly one output.";

  Blob<float>* input_layer = net_->input_blobs()[0];

  printf("Network image size: %d, %d\n", input_layer->width(), input_layer->height());

  num_channels_ = input_layer->channels();
  CHECK(num_channels_ == 3 || num_channels_ == 1)
    << "Input layer should have 1 or 3 channels.";
  input_geometry_ = cv::Size(input_layer->width(), input_layer->height());

  // Load the binaryproto mean file.
  SetMean();
}

設定網路實現的主要功能是:在螢幕輸出當前使用的GPU,當前為訓練還是測試階段,是否有可用的訓練好的模型等。

這邊定義了一個Blob型別的指標input_layer,指向前面的net_的input_blobs()物件的第0個數據。這邊的net_應該是通過前面的CopyTrainedLayersFrom(caffe_model_)進行初始化,即初始化為已訓練好的模型,因此這邊後面輸出的“Network image size”應該指的是已訓練好的模型對應的影象的size吧?因為我使用的測試資料的size都不是其輸出的227x227的。我猜想對於我輸入的影象資料,網路應該會對其進行一個reshape或者resize之類的操作吧?後面再看看。

該函式中呼叫的最後一個函式是SetMean()。該函式的作用為設定一個均值影象,即對於一幅3通道的影象,設定其三個通道值,使其表現出色彩均衡。函式體定義如下:

void Regressor::SetMean() {
  // Set the mean image.
  mean_ = cv::Mat(input_geometry_, CV_32FC3, cv::Scalar(104, 117, 123));
}

上述操作,即Regressor類的建構函式,對於該工程呼叫的模型進行了一些初始化操作。

之後是定義了時間物件,為了後續計算程式執行時間。然後載入需要測試的影象檔案(這裡用的是videos)。

下一步是建立一個tracker物件。

- 2.2 、tracker.cpp
首先將輸出中間結果的flag置為false,即不輸出中間結果。
建立了一個tracker物件。

- 2.3 、tracker_manager.cpp
然後定義了一個TrackerTesterAlov物件,對於輸入資料進行測試。
TrackerTesterAlov派生於TrackerManager類。

TrackerTesterAlov::TrackerTesterAlov(const std::vector<Video>& videos,
                                     const bool save_videos,
                                     RegressorBase* regressor, Tracker* tracker,
                                     const std::string& output_folder) :
  TrackerManager(videos, regressor, tracker),
  output_folder_(output_folder),
  hrt_("Tracker"),
  total_ms_(0),
  num_frames_(0),
  save_videos_(save_videos)
{
}

上面是TrackerTesterAlov的建構函式,save_videos_vot.cpp在呼叫該函式時,引數videos為從資料夾中獲取到的videos,save_videos置為true,regressor為前面建立過的regressor,tracker為前面建立的Tracker物件,以及最後輸出的路徑。

引數後面緊跟著的是建構函式初始值列表,為新建立物件的資料成員賦初值。
這邊有呼叫了其父類TrackerManager的建構函式:

TrackerManager::TrackerManager(const std::vector<Video>& videos,
                               RegressorBase* regressor, Tracker* tracker) :
  videos_(videos),
  regressor_(regressor),
  tracker_(tracker)
{
}

初始化完成之後,開始tracking操作,呼叫TrackAll函式。

void TrackerManager::TrackAll(const size_t start_video_num, const int pause_val) {
  // Iterate over all videos and track the target object in each.
  for (size_t video_num = start_video_num; video_num < videos_.size(); ++video_num) {
    // Get the video.
    const Video& video = videos_[video_num];

    // Perform any pre-processing steps on this video.
    VideoInit(video, video_num);

    // Get the first frame of this video with the initial ground-truth bounding box (to initialize the tracker).
    int first_frame;
    cv::Mat image_curr;
    BoundingBox bbox_gt;
    video.LoadFirstAnnotation(&first_frame, &image_curr, &bbox_gt);

    // Initialize the tracker.
    tracker_->Init(image_curr, bbox_gt, regressor_);//image_curr:the first frame; bbox_gt: the groundtruth box of the first frame 

    // Iterate over the remaining frames of the video.
    for (size_t frame_num = first_frame + 1; frame_num < video.all_frames.size(); ++frame_num) {

      // Get image for the current frame.
      // (The ground-truth bounding box is used only for visualization).
      const bool draw_bounding_box = false;
      const bool load_only_annotation = false;
      cv::Mat image_curr;
      BoundingBox bbox_gt;
      bool has_annotation = video.LoadFrame(frame_num,
                                            draw_bounding_box,
                                            load_only_annotation,
                                            &image_curr, &bbox_gt);

      // Get ready to track the object.
      SetupEstimate();

      // Track and estimate the target's bounding box location in the current image.
      // Important: this method cannot receive bbox_gt (the ground-truth bounding box) as an input.
      BoundingBox bbox_estimate_uncentered;
      tracker_->Track(image_curr, regressor_, &bbox_estimate_uncentered);

      // Process the output (e.g. visualize / save results).
      ProcessTrackOutput(frame_num, image_curr, has_annotation, bbox_gt,
                           bbox_estimate_uncentered, pause_val);
    }
    PostProcessVideo();
  }
  PostProcessAll();
}

呼叫該函式時,將start_video_num置為0,pause_val置為1,即迭代所有的videos並且track其中的目標。

1、首先利用for迴圈,獲取資料夾內所有的videos;

2、然後利用VideoInit函式對videos進行初始化,包括從資料夾獲取video的名稱,建立資料夾以存放輸出結果,獲取輸出結果影象的尺寸以及建立一個video_writer物件以儲存videos。

3、獲取該視訊的第一幀以及其對應的初始目標所在框。這裡呼叫了LoadFirstAnnotation函式來載入initial groundtruth。對應的函式定義在loader資料夾下的video.cpp檔案內。

void Video::LoadFirstAnnotation(int* first_frame, cv::Mat* image,
                               BoundingBox* box) const {
  LoadAnnotation(0, first_frame, image, box);
}

void Video::LoadAnnotation(const int annotation_index,
                          int* frame_num,
                          cv::Mat* image,
                          BoundingBox* box) const {
  // Get the annotation corresponding to this index.
  const Frame& annotated_frame = annotations[annotation_index];

  // Get the frame number corresponding to this annotation.
  *frame_num = annotated_frame.frame_num;

  *box = annotated_frame.bbox;

  const string& video_path = path;
  const vector<string>& image_files = all_frames;

  if (image_files.empty()) {
    printf("Error - no image files for video at path: %s\n", path.c_str());
    return;
  } else if (*frame_num >= image_files.size()) {
    printf("Cannot find frame: %d; only %zu image files were found at %s\n", *frame_num, image_files.size(), path.c_str());
    return;
  }

  // Load the image corresponding to this annotation.
  const string& image_file = video_path + "/" + image_files[*frame_num];
  *image = cv::imread(image_file);

  if (!image->data) {
    printf("Could not find file: %s\n", image_file.c_str());
  }
}

4、下面就利用獲取到的當前幀影象即第一幀影象、當前幀目標框即初始目標所在框和定義好的迴歸器對於tracker進行初始化。

這裡進入tracker資料夾下的tracker.cpp檔案。對應的函式定義為:

void Tracker::Init(const cv::Mat& image, const BoundingBox& bbox_gt,
                   RegressorBase* regressor) {
  image_prev_ = image;
  bbox_prev_tight_ = bbox_gt;

  // Predict in the current frame that the location will be approximately the same
  // as in the previous frame.
  // TODO - use a motion model?
  bbox_curr_prior_tight_ = bbox_gt;

  // Initialize the neural network.
  regressor->Init();
}

這裡假設當前幀中目標位置是和前一幀中目標位置非常相近,即運動很小的情況下。這邊對於神經網路又進行了一次初始化。防止引數修改。

void Regressor::Init() {
  if (modified_params_ ) {
    printf("Reloading new params\n");
    net_->CopyTrainedLayersFrom(caffe_model_);
    modified_params_ = false;
  }
}

5、遍歷該video的後續幀,獲取當前幀影象以及目標標記框的groundtruth(如果有的話),設定時間,準備開始tracking操作。
呼叫Tracker類的Track函式,實現track功能。函式定義如下:

void Tracker::Track(const cv::Mat& image_curr, RegressorBase* regressor,
                    BoundingBox* bbox_estimate_uncentered) {
  // Get target from previous image.
  cv::Mat target_pad;
  CropPadImage(bbox_prev_tight_, image_prev_, &target_pad);

  // Crop the current image based on predicted prior location of target.
  cv::Mat curr_search_region;
  BoundingBox search_location;
  double edge_spacing_x, edge_spacing_y;
  CropPadImage(bbox_curr_prior_tight_, image_curr, &curr_search_region, &search_location, &edge_spacing_x, &edge_spacing_y);

  // Estimate the bounding box location of the target, centered and scaled relative to the cropped image.
  BoundingBox bbox_estimate;
  regressor->Regress(image_curr, curr_search_region, target_pad, &bbox_estimate);

  // Unscale the estimation to the real image size.
  BoundingBox bbox_estimate_unscaled;
  bbox_estimate.Unscale(curr_search_region, &bbox_estimate_unscaled);

  // Find the estimated bounding box location relative to the current crop.
  bbox_estimate_unscaled.Uncenter(image_curr, search_location, edge_spacing_x, edge_spacing_y, bbox_estimate_uncentered);

  if (show_tracking_) {
    ShowTracking(target_pad, curr_search_region, bbox_estimate);
  }

  // Save the image.
  image_prev_ = image_curr;

  // Save the current estimate as the location of the target.
  bbox_prev_tight_ = *bbox_estimate_uncentered;

  // Save the current estimate as the prior prediction for the next image.
  // TODO - replace with a motion model prediction?
  bbox_curr_prior_tight_ = *bbox_estimate_uncentered;
}

該函式實現的功能包含以下操作:

  • (1)裁剪出影象中目標區域,即從前一幀中獲取目標所在patch;
  • (2)基於前一幀影象中目標位置,對於當前幀影象進行裁剪;
    由於這邊定義了一個乍看不是很明白的變數,因此,把這個函式貼出來看一下其函式體。
void CropPadImage(const BoundingBox& bbox_tight, const cv::Mat& image, cv::Mat* pad_image,
                  BoundingBox* pad_image_location, double* edge_spacing_x, double* edge_spacing_y) {
  // Crop the image based on the bounding box location, adding some padding.

  // Get the location of the cropped and padded image.
  ComputeCropPadImageLocation(bbox_tight, image, pad_image_location);

  // Compute the ROI, ensuring that the crop stays within the boundaries of the image.
  const double roi_left = std::min(pad_image_location->x1_, static_cast<double>(image.cols - 1));
  const double roi_bottom = std::min(pad_image_location->y1_, static_cast<double>(image.rows - 1));
  const double roi_width = std::min(static_cast<double>(image.cols), std::max(1.0, ceil(pad_image_location->x2_ - pad_image_location->x1_)));
  const double roi_height = std::min(static_cast<double>(image.rows), std::max(1.0, ceil(pad_image_location->y2_ - pad_image_location->y1_)));

  // Crop the image based on the ROI.
  cv::Rect myROI(roi_left, roi_bottom, roi_width, roi_height);
  cv::Mat cropped_image = image(myROI);

  // Now we need to place the crop in a new image of the appropriate size,
  // adding a black border where necessary to account for edge effects.

  // Make a new image to store the output.
  // The new image should have size: get_output_width(), get_output_height(), but
  // to be safe we ensure that the output is not smaller than roi_width, roi_height.
  const double output_width = std::max(ceil(bbox_tight.compute_output_width()), roi_width);
  const double output_height = std::max(ceil(bbox_tight.compute_output_height()), roi_height);
  cv::Mat output_image = cv::Mat(output_height, output_width, image.type(), cv::Scalar(0, 0, 0));

  // Compute the location to place the crop so that it will be centered at the
  // center of the bounding box (accounting for edge effects).

  // Get the amount that the output "sticks out" beyond the left and bottom edges of the image.
  // This might be 0, but it might be > 0 if the output is near the edge of the image.
  *edge_spacing_x = std::min(bbox_tight.edge_spacing_x(), static_cast<double>(output_image.cols - 1));
  *edge_spacing_y = std::min(bbox_tight.edge_spacing_y(), static_cast<double>(output_image.rows - 1));

  // Get the location within the output to put the cropped image (accounting for edge effects).
  cv::Rect output_rect(*edge_spacing_x, *edge_spacing_y, roi_width, roi_height);
  cv::Mat output_image_roi = output_image(output_rect);

  // Copy the cropped image to the specified location within the output.
  // Without edge effects, this will fill the output.
  // With edge effects, this will comprise a subset of the output, with black
  // being placed around the crop to account for edge effects.
  cropped_image.copyTo(output_image_roi);

  // Set the output.
  *pad_image = output_image;
}

這裡的引數bbox_tight為前一幀的目標所在框,image為當前幀,pad_image用來存放裁剪出的區域影象,pad_image_location用目標狂來表示裁剪出的影象塊,edge_spacing_x和edge_spacing_y表示該影象塊的左上角點的座標。但是這個座標不是之前裁剪區域對應的左上角座標,而是可能比其稍大一些的output框對應的左上角。這邊有點不是很理解。

下面呼叫ComputeCropPadImageLocation函式來計算裁剪出的影象及其位置。

之後計算興趣域,保證裁剪區域在影象邊界範圍內。

然後將上一步處理完成的區域裁剪出來生成剪裁後的影象。

之後將剪裁出來的影象放置於一幅有著合適尺寸的新影象中,對於會產生邊緣效應的地方,加上一個黑色邊框。生成合適尺寸的影象,用於存放該輸出。

計算出cropped image的合適位置,使得其中心位於bounding box的中心。

double BoundingBox::edge_spacing_x() const {
  const double output_width = compute_output_width();
  const double bbox_center_x = get_center_x();

  // Compute the amount that the output "sticks out" beyond the edge of the image (edge effects).
  // If there are no edge effects, we would have output_width / 2 < bbox_center_x, but if the crop is near the left
  // edge of the image then we would have output_width / 2 > bbox_center_x, with the difference
  // being the amount that the output "sticks out" beyond the edge of the image.
  return std::max(0.0, output_width / 2 - bbox_center_x);
}

該函式首先計算該bounding box的寬度,然後獲取該bounding box的中心點的x座標,然後計算bounding box的左側邊的x座標值,應該是大於0的。這個值即為返回的edge_spacing_x。

6、進入正題,開始迴歸估計本幀中目標所在框的位置。

進入該函式看一下:
Regress函式:

void Regressor::Regress(const cv::Mat& image_curr,
                        const cv::Mat& image, const cv::Mat& target,
                        BoundingBox* bbox) {
  assert(net_->phase() == caffe::TEST);

  // Estimate the bounding box location of the target object in the current image.
  std::vector<float> estimation;
  Estimate(image, target, &estimation);

  // Wrap the estimation in a bounding box object.
  *bbox = BoundingBox(estimation);
}

Estimate函式:

void Regressor::Estimate(const cv::Mat& image, const cv::Mat& target, std::vector<float>* output) {
  assert(net_->phase() == caffe::TEST);

  // Reshape the input blobs to be the appropriate size.
  Blob<float>* input_target = net_->input_blobs()[0];
  input_target->Reshape(1, num_channels_,
                       input_geometry_.height, input_geometry_.width);

  Blob<float>* input_image = net_->input_blobs()[1];
  input_image->Reshape(1, num_channels_,
                       input_geometry_.height, input_geometry_.width);

  Blob<float>* input_bbox = net_->input_blobs()[2];
  input_bbox->Reshape(1, 4, 1, 1);

  // Forward dimension change to all layers.
  net_->Reshape();

  // Process the inputs so we can set them.
  std::vector<cv::Mat> target_channels;
  std::vector<cv::Mat> image_channels;
  WrapInputLayer(&target_channels, &image_channels);

  // Set the inputs to the network.
  Preprocess(image, &image_channels);
  Preprocess(target, &target_channels);

  // Perform a forward-pass in the network.
  net_->ForwardPrefilled();

  // Get the network output.
  GetOutput(output);
}

該函式的第一步是將輸入影象reshape成合適的大小,與我之前猜測的一致。

這邊又遇到問題了,對於Blob這種資料格式不夠理解。