Tensorflow Object Detection API分散式訓練模型

阿新 • • 發佈：2019-01-01

說明：Tensorflow官方的models專案（https://github.com/tensorflow/models）中已經支援了各種模型的訓練和驗證，並且有詳細的教程，但是在models/research/Object_detection的教程中並沒有提及如何進行分散式訓練，本文旨在介紹如何進行分散式的Object_detection訓練。

環境：
- 3臺ubuntu 16.04(一臺ps，兩個worker）
- tensorflow-gpu 1.4.1(不確定ps那臺機器需要安裝顯示卡不)

我是兩臺臺式當worker，每臺臺式一張1070顯示卡，然後筆記本當ps，筆記本有1050ti的顯示卡，因此也裝的tensorflow-gpu，不知道ps需要安裝gpu版本不

。

一、準備工作

1.下載models專案

注意：官網上的https://github.com/tensorflow/models更新特別快，我是2017年12月裝的tensorflow-gpu1.4.1，2018年3月從官網上下models專案進行編譯就報錯了，如果你使用的最新版tensorflw，可以下載去官網下載models。

如果版本跟我一致，是tensorflow-gpu1.4.1，可以從這裡下載models專案：
連結: https://pan.baidu.com/s/1Afesy1s_5XaQ98eg4mY1RQ 密碼: b3g8

注意：下載下來後可以進行解壓，解壓出來的檔名應該叫models-master，官網一般使用的models表示models-master這個資料夾，解壓後的資料夾最好直接放在使用者目錄下，不要出現中文字元在路徑中，否則可能會報Cython的錯誤。比如我的models-master的絕對路徑為：/home/hadoop/tensorflow/models-master/ 其中hadoop是我ubuntu的使用者名稱。

2.安裝object_detection相關依賴庫

這裡是官方文件： https://github.com/tensorflow/models/tree/master/research/object_detection，裡面有相關說明，這裡簡單介紹一下，詳細說明可以去看官網的。

使用如下命令安裝依賴：

sudo apt-get install protobuf-compiler python-pil python-lxml python-tk
sudo pip install jupyter
sudo pip install matplotlib

或者使用：

sudo pip install pillow
sudo pip install lxml
sudo pip install jupyter
sudo pip install matplotlib

可能還需要安裝Cython，通過以下命令安裝：

sudo pip install Cython

如果想使用COCO的評估指標的話，可以進行下載配置。預設的評估指標是Pascal VOC。這裡直接貼一段原文吧：To use the COCO object detection metrics add metrics_set: “coco_detection_metrics” to the eval_config message in the config file. To use the COCO instance segmentation metrics add metrics_set: “coco_mask_metrics” to the eval_config message in the config file.

git clone https://github.com/cocodataset/cocoapi.git
cd cocoapi/PythonAPI
make
cp -r pycocotools <path_to_tensorflow>/models/research/

Protobuf Compilation

#進入到下載的  /models-master/research/   目錄下，執行：
protoc object_detection/protos/*.proto --python_out=.

Add Libraries to PYTHONPATH

# 還是在 /models-master/research/ 目錄下
export PYTHONPATH=$PYTHONPATH:`pwd`:`pwd`/slim

注意：這一步驟每次新開一個終端都要執行，如果嫌麻煩，可以將其新增到~/.bashrc中去，其中那個 ‘pwd’ 指的是 /models/research/ 所在目錄，可以用絕對路徑，例如:
export PYTHONPATH=$PYTHONPATH:/home/hadoop/tensorflow/models-master/research:/home/hadoop/tensorflow/models-master/research/slim

3.測試安裝

# 還是在 /models-master/research/ 目錄下
python object_detection/builders/model_builder_test.py

二、準備資料

官網有詳細的資料準備過程，提供了兩個資料集PASCAL VOC和Oxford-IIIT Pet的下載和轉換。
https://github.com/tensorflow/models/blob/master/research/object_detection/g3doc/preparing_inputs.md

說明:大體過程就是下載資料集後，然後將資料集轉換成TFRecord格式的檔案。

我這裡也提供一個轉換後的資料下載，包括PASCAL VOC和Oxford-IIIT Pet兩種：
連結: https://pan.baidu.com/s/1DbYTQo4TFd6xSJV8RDI-Ow 密碼: dun8

說明:其中.record檔案是訓練和驗證資料，.pbtxt檔案存的是各個類別對應的名字，.config檔案則是我測試過的網路相關配置。下載完成後可以解壓縮到 /models-master/資料夾下，這樣可以和我的目錄結構一樣。

二、準備預訓練的模型

官網提供了各個檢測方法的預訓練模型，下載地址為：
https://github.com/tensorflow/models/blob/master/research/object_detection/g3doc/detection_model_zoo.md

我這裡也提供一個預訓練模型下載地址：
連結: https://pan.baidu.com/s/1ldDYseIpE4qt6ooUWJoWsw 密碼: ruyx

說明:其中有faster_rcnn_resnet50_coco_2018_01_28，faster_rcnn_resnet101_coco_11_06_2017，ssd_inception_v2_coco_2017_11_17，ssd_mobilenet_v1_coco_2017_11_17這四個我測試過的預訓練模型。同上，下載完成後可以解壓縮到 /models-master/資料夾下，這樣可以和我的目錄結構一樣。

三、開始單機訓練

官網有單機版的訓練教程：
https://github.com/tensorflow/models/blob/master/research/object_detection/g3doc/running_locally.md

# 還是在 /models-master/research/ 目錄下
python object_detection/train.py \
    --logtostderr \
    --pipeline_config_path=${PATH_TO_YOUR_PIPELINE_CONFIG} \
    --train_dir=${PATH_TO_TRAIN_DIR}

說明：pipeline_config_path指定是所使用網路的配置檔案，train_dir指定模型訓練後的儲存位置。網路的配置檔案在 models-master/research/object_detection/samples/configs/ 資料夾下可以找到。如果直接從我之前百度雲的壓縮包下載的資料，data資料夾下已經存了幾個模型的配置。

接下來進行所使用的網路配置檔案的修改。這裡以這個網路配置為例,主要修改的部分都加了中文註釋：
models-master/research/object_detection/samples/configs/ssd_mobilenet_v1_pets.config

train_config: {
  batch_size: 24  ############################用來修改batchsize
  optimizer {
    rms_prop_optimizer: {
      learning_rate: {
        exponential_decay_learning_rate {
          initial_learning_rate: 0.004
          decay_steps: 800720
          decay_factor: 0.95
        }
      }
      momentum_optimizer_value: 0.9
      decay: 0.9
      epsilon: 1.0
    }
  }
  fine_tune_checkpoint: "PATH_TO_BE_CONFIGURED/model.ckpt"
  #####指向預訓練的模型路徑，如果從
  #我的百度雲下的模型，則是model_dir檔案下的各個模型
  #例如：fine_tune_checkpoint: "/home/hadoop/tensorflow/models-master/model_dir/ssd_mobilenet_v1_coco_2017_11_17/model.ckpt"
                                     #
  from_detection_checkpoint: true
  # Note: The below line limits the training process to 200K steps, which we
  # empirically found to be sufficient enough to train the pets dataset. This
  # effectively bypasses the learning rate schedule (the learning rate will
  # never decay). Remove the below line to train indefinitely.
  num_steps: 200000   ############################用來修改訓練次數
  data_augmentation_options {
    random_horizontal_flip {
    }
  }
  data_augmentation_options {
    ssd_random_crop {
    }
  }
}

train_input_reader: {
  tf_record_input_reader {
    input_path: "PATH_TO_BE_CONFIGURED/pet_train.record"####指向訓練資料路徑，
  #例如：input_path: "/home/hadoop/tensorflow/models-master/data/pet_train_with_masks.record"
  }
  label_map_path: "PATH_TO_BE_CONFIGURED/pet_label_map.pbtxt"####指向資料的標籤
  #例如：label_map_path: "/home/hadoop/tensorflow/models-master/data/pet_label_map.pbtxt"
}

eval_config: {
  num_examples: 2000
  # Note: The below line limits the evaluation process to 10 evaluations.
  # Remove the below line to evaluate indefinitely.
  max_evals: 10
}

eval_input_reader: {
  tf_record_input_reader {
    input_path: "PATH_TO_BE_CONFIGURED/pet_val.record"####指向驗證資料路徑
    #例如：input_path: "/home/hadoop/tensorflow/models-master/data/pet_val_with_masks.record"
  }
  label_map_path: "PATH_TO_BE_CONFIGURED/pet_label_map.pbtxt"###指向資料的標籤
  #例如：label_map_path: "/home/hadoop/tensorflow/models-master/data/pet_label_map.pbtxt"
  shuffle: false
  num_readers: 1
}

這裡是我的單機版執行ssd_mobilenet_v1_pets的完整程式碼:

python object_detection/train.py --logtostderr --pipeline_config_path=/home/hadoop/tensorflow/models-master/data/ssd_mobilenet_v1_pets.config  --train_dir=/home/hadoop/tensorflow/models-master/train_dir_tmp

三、開始分散式訓練

官網對分散式訓練介紹的不多，這裡說明如何進行分散式訓練。
準備工作：將三臺機器的配置都按前述弄好，其中作為ps的機器中可以沒有訓練和測試的資料集，即之前的.record檔案。

說明：分散式執行通過TF_CONFIG來指定叢集，train.py和trainer.py程式碼中會自動讀取系統中的TF_CONFIG這一變數，用於構建叢集。
作為ps的機器啟動1個ps程序，作為worker0和worker1的機器分別啟動兩個程序，一個為master，一個為worker。
另外，這個分散式的原理我也不是很懂，master和worker的關係希望有人可以不吝賜教。還有，這個分散式執行的模式應該是between-graph+Synchronization，如果不對也請告知我。

另外說明一下，我的woker0和worker1是桌上型電腦，這兩臺機器搭建了hadoop，然後我的筆記本當作ps，也安裝了hadoop，作為hadoop的客戶端在使用。所以我在/etc/hosts中配置了相關的ip和名字的對應：
下文中的執行命令中的”master:2000”,”slave1:2222”,”slave2:2224”這中間的master，slave1和slave2分別對應我三臺電腦的ip。
master————–worker0的ip
slave1————–worker1的ip
slave2————–ps的ip

說明：接下來是各個機器的執行命令，注意命令中”index“和”type”的區別

ps機器中執行ps：

# 還是在 /models-master/research/ 目錄下
TF_CONFIG='{"cluster": {"master": ["master:2220","slave1:2222"], "ps": ["slave2:2224"], "worker": ["master:3001","slave1:3003"]}, "task": {"index": 0, "type": "ps"}}' python object_detection/train.py --logtostderr --pipeline_config_path=/home/hadoop/tensorflow/models-master/data/ssd_mobilenet_v1_pets.config  --train_dir=/home/hadoop/tensorflow/models-master/train_dir_tmp

worker0機器中執行master：

# 還是在 /models-master/research/ 目錄下
TF_CONFIG='{"cluster": {"master": ["master:2220","slave1:2222"], "ps": ["slave2:2224"], "worker": ["master:3001","slave1:3003"]}, "task": {"index": 0, "type": "master"}}' python object_detection/train.py --logtostderr --pipeline_config_path=/home/hadoop/tensorflow/models-master/data/ssd_mobilenet_v1_pets.config  --train_dir=/home/hadoop/tensorflow/models-master/train_dir_tmp

worker0機器中執行worker：

# 還是在 /models-master/research/ 目錄下
TF_CONFIG='{"cluster": {"master": ["master:2220","slave1:2222"], "ps": ["slave2:2224"], "worker": ["master:3001","slave1:3003"]}, "task": {"index": 0, "type": "worker"}}' python object_detection/train.py --logtostderr --pipeline_config_path=/home/hadoop/tensorflow/models-master/data/ssd_mobilenet_v1_pets.config  --train_dir=/home/hadoop/tensorflow/models-master/train_dir_tmp

worker1機器中執行master：

# 還是在 /models-master/research/ 目錄下
TF_CONFIG='{"cluster": {"master": ["master:2220","slave1:2222"], "ps": ["slave2:2224"], "worker": ["master:3001","slave1:3003"]}, "task": {"index": 1, "type": "master"}}' python object_detection/train.py --logtostderr --pipeline_config_path=/home/hadoop/tensorflow/models-master/data/ssd_mobilenet_v1_pets.config  --train_dir=/home/hadoop/tensorflow/models-master/train_dir_tmp

worker1機器中執行worker：

# 還是在 /models-master/research/ 目錄下
TF_CONFIG='{"cluster": {"master": ["master:2220","slave1:2222"], "ps": ["slave2:2224"], "worker": ["master:3001","slave1:3003"]}, "task": {"index": 1, "type": "worker"}}' python object_detection/train.py --logtostderr --pipeline_config_path=/home/hadoop/tensorflow/models-master/data/ssd_mobilenet_v1_pets.config  --train_dir=/home/hadoop/tensorflow/models-master/train_dir_tmp

至此，應該就可以順利開始訓練了，可以利用命令:

tensorboard --logdir=/home/hadoop/tensorflow/models-master/train_dir_tmp

來監測訓練情況，最後訓練的模型應該會儲存在worker0這臺機器對應的資料夾上。

四、可以加入HDFS

說明：還可以將訓練資料，預訓練模型甚至訓練後儲存的模型儲存到HDFS中，在配置成功hadoop的叢集裡，只要對配置檔案、檔案路徑以及執行命令稍做修改即可。

這裡貼出ps上的執行命令：

CLASSPATH=$($HADOOP_HDFS_HOME/bin/hadoop classpath --glob) TF_CONFIG='{"cluster": {"master": ["master:2220","slave1:2222"], "ps": ["slave2:2224"], "worker": ["master:3001","slave1:3003"]}, "task": {"index": 0, "type": "ps"}}' python object_detection/train.py --logtostderr --pipeline_config_path=hdfs://master:9000/home/hadoop/tensorflow/models-master/data/ssd_mobilenet_v1_pets_hdfs.config  --train_dir=hdfs://master:9000/home/hadoop/tensorflow/models-master/train_dir_tmp

注意：資料需要事先傳到HDFS中，相關配置檔案ssd_mobilenet_v1_pets_hdfs.config也可以傳到HDFS上，只是要將裡面訓練資料的對應路徑修改成HDFS中的。

五、一些問題

說明：
ps是筆記本，兩臺臺式是worker。
臺式配置：1070，16g記憶體。
區域網：100M的乙太網

1.使用tensorflow時，開啟程序就會把所有視訊記憶體佔滿，worker機器中要啟動master和worker兩個程序，如果先啟動worker程序，則master程序可分配的視訊記憶體很小，程式會跑不起來，有時候會輸出特別長的字各種路徑。若先啟動master程序，再啟動worker程序，則程式可以跑，但過一段時間worker程序會視訊記憶體錯誤。

解決方法：給tensorflow程式開啟視訊記憶體自增長，而不是啟動就佔滿。
在models-master/research/object_detection/train.py檔案中，大約143行開始，修改成如下樣子,修改的部分在註釋 #maqy add下


  if worker_replicas >= 1 and ps_tasks > 0:
    # Set up distributed training.

    #maqy add gpu
    config = tf.ConfigProto()
    config.gpu_options.allow_growth = True
    server = tf.train.Server(tf.train.ClusterSpec(cluster), protocol='grpc',
                             job_name=task_info.type,
                             task_index=task_info.index,
                             config=config)
    if task_info.type == 'ps':
      server.join()
      return

在models-master/research/object_detection/trainer.py檔案中，大約307行開始，修改成如下樣子：

    # Merge all summaries together.
    summary_op = tf.summary.merge(list(summaries), name='summary_op')

    # Soft placement allows placing on CPU ops without GPU implementation.
    session_config = tf.ConfigProto(allow_soft_placement=True,
                                    log_device_placement=False)
    #maqy add
    session_config.gpu_options.allow_growth = True    

    # Save checkpoints regularly.
    keep_checkpoint_every_n_hours = train_config.keep_checkpoint_every_n_hours
    saver = tf.train.Saver(
        keep_checkpoint_every_n_hours=keep_checkpoint_every_n_hours)

2.承接上一問題，當修改完畢後，使用ssd_mobilenet_v1_coco_2017_11_17這一小網路，並把訓練batchsize調為12，則master程序所使用視訊記憶體大約為2.5g，之後執行一段時間，worker程序也開始進行計算輸出global_step，並且視訊記憶體使用從100m也達到2.5g，不知道master程序和worker程序到底存在著什麼關係。另外可以觀察一下當worker程序也開始計算後，單步速度有沒有變慢。

3.為什麼模型最終會儲存在worker0機器中，這個可以指定麼？

4.分散式執行速度明顯下降了很多，使用ssd_mobilenet_v1_coco_2017_11_17，batchsize為12時，單機版每步速度大約0.25s，分散式每步大約4.8s，並且經過測試，不同的網路結構速度會差很多，比如faster_rcnn_resnet101大約每步要30-50s。

初步考慮原因是引數傳遞的過程耗費時間，通過系統監視器可以看到100M的乙太網，各個機器的傳輸與接收速度基本在12M/s左右，可能是影響速度的主要原因，如果有人試了更大頻寬的區域網，希望可以留言一下計算速度。

Tensorflow Object Detection API分散式訓練模型

一、準備工作

1.下載models專案

2.安裝object_detection相關依賴庫

3.測試安裝

二、準備資料

二、準備預訓練的模型

三、開始單機訓練

三、開始分散式訓練

四、可以加入HDFS

五、一些問題

3.為什麼模型最終會儲存在worker0機器中，這個可以指定麼？

4.分散式執行速度明顯下降了很多，使用ssd_mobilenet_v1_coco_2017_11_17，batchsize為12時，單機版每步速度大約0.25s，分散式每步大約4.8s，並且經過測試，不同的網路結構速度會差很多，比如faster_rcnn_resnet101大約每步要30-50s。

5.faster-rcnn的所有配置檔案的batch_size都是1，這個是什麼原因呢？另外，不論是resnet50還是resnet101，我使用的時候視訊記憶體仍然會被佔滿(8g)，這個正常麼？

Tensorflow Object Detection API分散式訓練模型

Windows下安裝TensorFlow Object Detection API，訓練自己的資料集

用多張GPU 顯示卡　加速TensorFlow Object Detection API 模型訓練的過程

使用tensorflow object detection API 訓練自己的目標檢測模型（一）labelImg的安裝配置過程

使用tensorflow object detection API 訓練自己的目標檢測模型（二）

（更新視訊教程）Tensorflow object detection API 搭建屬於自己的物體識別模型（2）——訓練並使用自己的模型

關於使用tensorflow object detection API訓練自己的模型-補充部分（程式碼，資料標註工具，訓練資料，測試資料）

TensorFlow Object Detection API 超詳細教程和踩坑過程（資料準備和訓練）

Tensorflow object detection API 訓練自己的資料集

Tensorflow object detection API 搭建物體識別模型（一）

Tensorflow object detection API 搭建物體識別模型（二）

用Tensorflow Object Detection API 訓練自己的資料集

Tensorflow object detection API 搭建屬於自己的物體識別模型（1）——環境搭建與測試

Tensorflow object detection API 搭建屬於自己的物體識別模型——環境搭建與測試

Tensorflow object detection API 搭建屬於自己的物體識別模型（3）——將自己的模型遷移到手機上執行

Tensorflow object detection API 搭建屬於自己的物體識別模型——常見問題彙總 Q&A

tensorflow object detection api訓練自己的資料集

TensorFlow Object Detection API中的Faster R-CNN /SSD模型引數調整

配置tensorflow object detection api

谷歌開源的TensorFlow Object Detection API視頻物體識別系統實現教程

Tensorflow Object Detection API分散式訓練模型

一、準備工作

1.下載models專案

2.安裝object_detection相關依賴庫

3.測試安裝

二、準備資料

二、準備預訓練的模型

三、開始單機訓練

三、開始分散式訓練

四、可以加入HDFS

五、一些問題

3.為什麼模型最終會儲存在worker0機器中，這個可以指定麼？

4.分散式執行速度明顯下降了很多，使用ssd_mobilenet_v1_coco_2017_11_17，batchsize為12時，單機版每步速度大約0.25s，分散式每步大約4.8s，並且經過測試，不同的網路結構速度會差很多，比如faster_rcnn_resnet101大約每步要30-50s。

5.faster-rcnn的所有配置檔案的batch_size都是1，這個是什麼原因呢？另外，不論是resnet50還是resnet101，我使用的時候視訊記憶體仍然會被佔滿(8g)，這個正常麼？

相關推薦