1. 程式人生 > >TensorFlow學習筆記(11)--【Ubuntu】slim框架下的inception_v4模型的執行、視覺化、匯出和使用

TensorFlow學習筆記(11)--【Ubuntu】slim框架下的inception_v4模型的執行、視覺化、匯出和使用

模型:slim框架下的Inception_v4模型
Inception_v4的Checkpoint:http://download.tensorflow.org/models/inception_v4_2016_09_09.tar.gz
資料集:google的flower資料集http://download.tensorflow.org/example_images/flower_photos.tgz 5種類別的花

資料準備

資料集下下來之後按/home/lwp/data/flower/my_flower_5路徑放好,可以看到它是這個樣子的,每個類的花一個資料夾

這裡寫圖片描述

開啟一個我們可以看到裡面是各種圖片

這裡寫圖片描述

在模型目錄source/models/slim

下有一個指令碼檔案convert_tfrecord.sh
convert_tfrecord.sh檔案內容如下:

source env_set.sh
python download_and_convert_data.py \
  --dataset_name=$DATASET_NAME \
  --dataset_dir=$DATASET_DIR

可以看到通過env_set.sh傳遞變數
env_set.sh檔案內容如下:

export DATASET_NAME=my_flower_5
export DATASET_DIR=/home/lwp/data/flower
export CHECKPOINT_PATH=/home/lwp/pre_trained/inception_v4.ckpt
export TRAIN_DIR=/tmp/my_train_20170725

檔案定義了:

  • DATASET_NAME:資料集名稱
  • DATASET_DIR:資料集路徑
  • CHECKPOINT_PATH:預訓練的inception_v4模型路徑
  • TRAIN_DIR:訓練生成checkpoint儲存路徑

環境變數配置完後進入到模型目錄下

$ cd source/models/slim

執行指令碼:

$ ./convert_tfrecord.sh

完成後資料就準備好了
這裡寫圖片描述

預訓練模型準備

/home/lwp/pre_trained

這裡寫圖片描述

執行訓練指令碼

(在修改好模型相關引數的前提下,如訓練程式執行指令碼run_train.sh,測試程式執行指令碼run_eval.sh,環境變數env_set.sh

等)

$ ./run_train.sh

run_train.sh內容如下:

source env_set.sh

nohup python -u train_image_classifier.py \
  --dataset_name=$DATASET_NAME \
  --dataset_dir=$DATASET_DIR \
  --checkpoint_path=$CHECKPOINT_PATH \
  --model_name=inception_v4 \
  --checkpoint_exclude_scopes=InceptionV4/Logits,InceptionV4/AuxLogits/Aux_logits \
  --trainable_scopes=InceptionV4/Logits,InceptionV4/AuxLogits/Aux_logits \
  --train_dir=$TRAIN_DIR \
  --learning_rate=0.001 \
  --learning_rate_decay_factor=0.76\
  --num_epochs_per_decay=50 \
  --moving_average_decay=0.9999 \
  --optimizer=adam \
  --ignore_missing_vars=True \
  --batch_size=32 > output.log 2>&1 &
$ tail -f output.log # 當前日誌動態顯示
# 或者
$ cat output.log # 一次顯示整個log檔案

如下所示

INFO:tensorflow:Summary name /clone_loss is illegal; using clone_loss instead.
INFO:tensorflow:Fine-tuning from /home/lwp/pre_trained/inception_v4.ckpt
2017-07-27 08:32:08.547822: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.1 instructions, but these are available on your machine and could speed up CPU computations.
2017-07-27 08:32:08.547847: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.2 instructions, but these are available on your machine and could speed up CPU computations.
2017-07-27 08:32:08.547868: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX instructions, but these are available on your machine and could speed up CPU computations.
2017-07-27 08:32:08.547887: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX2 instructions, but these are available on your machine and could speed up CPU computations.
2017-07-27 08:32:08.547892: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use FMA instructions, but these are available on your machine and could speed up CPU computations.
2017-07-27 08:32:08.861766: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:893] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2017-07-27 08:32:08.862322: I tensorflow/core/common_runtime/gpu/gpu_device.cc:955] Found device 0 with properties: 
name: GeForce GTX 1080 Ti
major: 6 minor: 1 memoryClockRate (GHz) 1.582
pciBusID 0000:01:00.0
Total memory: 10.91GiB
Free memory: 10.58GiB
2017-07-27 08:32:08.862342: I tensorflow/core/common_runtime/gpu/gpu_device.cc:976] DMA: 0 
2017-07-27 08:32:08.862350: I tensorflow/core/common_runtime/gpu/gpu_device.cc:986] 0:   Y 
2017-07-27 08:32:08.862359: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1045] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GTX 1080 Ti, pci bus id: 0000:01:00.0)
INFO:tensorflow:Restoring parameters from /home/lwp/pre_trained/inception_v4.ckpt
INFO:tensorflow:Starting Session.
INFO:tensorflow:Saving checkpoint to path /tmp/my_train_20170725/model.ckpt
INFO:tensorflow:Starting Queues.
INFO:tensorflow:global_step/sec: 0
INFO:tensorflow:Recording summary at step 1.
INFO:tensorflow:global step 10: loss = 2.9544 (0.277 sec/step)
INFO:tensorflow:global step 20: loss = 2.7159 (0.267 sec/step)
INFO:tensorflow:global step 30: loss = 3.0572 (0.261 sec/step)

/tmp/my_train_20170725路徑下可以看到訓練生成的checkpoint:meta、data、index

這裡寫圖片描述

該路徑在環境變數設定指令碼env_set.sh中定義

執行測試指令碼

$ ./run_eval.sh

run_eval.sh的內容如下:

source env_set.sh
python -u eval_image_classifier.py \
  --dataset_name=$DATASET_NAME \
  --dataset_dir=$DATASET_DIR \
  --dataset_split_name=validation \
  --model_name=inception_v4 \
  --checkpoint_path=$TRAIN_DIR \
  --eval_dir=/tmp/eval/validation \
  --eval_interval_secs=60 \
  --batch_size=32 

其中eval_interval_secs=60是指定兩次驗證的最小間隔時間為60s,具體定義在eval_image_classifier.py檔案中。

這裡訓練和驗證程式是分開的,模型在剛開始訓練的時候效果必然很差,並不需要去驗證,而且訓練過程持續時間很長,如果將訓練和驗證放在一起的話,無用的驗證就佔用的很多時間。
將訓練和驗證分開這樣就可以在其他機器上訪問checkpoint(路徑為/tmp/my_train_20170725)去做驗證,這樣就可以把資源分散開。

執行後如下:

.
.
.
name: GeForce GTX 1080 Ti
major: 6 minor: 1 memoryClockRate (GHz) 1.582
pciBusID 0000:01:00.0
Total memory: 10.91GiB
Free memory: 2.24GiB
2017-07-27 09:27:33.151287: I tensorflow/core/common_runtime/gpu/gpu_device.cc:976] DMA: 0 
2017-07-27 09:27:33.151292: I tensorflow/core/common_runtime/gpu/gpu_device.cc:986] 0:   Y 
2017-07-27 09:27:33.151299: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1045] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GTX 1080 Ti, pci bus id: 0000:01:00.0)
INFO:tensorflow:Restoring parameters from /tmp/my_train_20170725/model.ckpt-11028
INFO:tensorflow:Starting evaluation at 2017-07-27-01:27:47
2017-07-27 09:27:49.207742: W tensorflow/core/common_runtime/bfc_allocator.cc:217] Allocator (GPU_0_bfc) ran out of memory trying to allocate 1.51GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory is available.
INFO:tensorflow:Evaluation [1/12]
INFO:tensorflow:Evaluation [2/12]
INFO:tensorflow:Evaluation [3/12]
INFO:tensorflow:Evaluation [4/12]
INFO:tensorflow:Evaluation [5/12]
INFO:tensorflow:Evaluation [6/12]
INFO:tensorflow:Evaluation [7/12]
INFO:tensorflow:Evaluation [8/12]
INFO:tensorflow:Evaluation [9/12]
INFO:tensorflow:Evaluation [10/12]
INFO:tensorflow:Evaluation [11/12]
INFO:tensorflow:Evaluation [12/12]
INFO:tensorflow:Finished evaluation at 2017-07-27-01:27:56
2017-07-27 09:27:57.363998: I tensorflow/core/kernels/logging_ops.cc:79] eval/Recall_5[1]
2017-07-27 09:27:57.364187: I tensorflow/core/kernels/logging_ops.cc:79] eval/Accuracy[0.87760419]
INFO:tensorflow:Waiting for new checkpoint at /tmp/my_train_20170725

迴圈驗證
可以看到給出了驗證結果,注意最後一行Waiting for new checkpoint at /tmp/my_train_20170725,這是在eval_image_classifier.py中自定義了一個loop,去監聽/tmp/my_train_20170725,一旦有新的checkpoint生成,就去執行一次驗證。

視覺化訓練:TensorBoard

執行:

$ tensorboard --logdir /tmp/my_train_20170725

得到:

Starting TensorBoard 55 at http://lw:6006
(Press CTRL+C to quit)

檢視本機IP:

$ ifconfig -a

在瀏覽器中輸入地址:

http://192.168.0.102:6006

這裡寫圖片描述

如果出現TensorBoard但不顯示內容的情況,可以嘗試換一個瀏覽器,我用Fire fox就是不顯示,換chrome就好了。

結束訓練

檢視python程序
執行:

$ ps -ef |grep python

得到:

lwp       2780  2025 99 08:31 pts/0    03:38:22 python -u train_image_classifier.py --dataset_name=my_flower_5 --dataset_dir=/home/lwp/data/flower --checkpoint_path=/home/lwp/pre_trained/inception_v4.ckpt --model_name=inception_v4 --checkpoint_exclude_scopes=InceptionV4/Logits,InceptionV4/AuxLogits/Aux_logits --trainable_scopes=InceptionV4/Logits,InceptionV4/AuxLogits/Aux_logits --train_dir=/tmp/my_train_20170725 --learning_rate=0.001 --learning_rate_decay_factor=0.76 --num_epochs_per_decay=50 --moving_average_decay=0.9999 --optimizer=adam --ignore_missing_vars=True --batch_size=32
lwp      18830  3674  1 09:40 pts/2    00:00:15 /usr/bin/python /usr/local/bin/tensorboard --logdir /tmp/my_train_20170725
lwp      24837  2763  0 09:53 pts/0    00:00:00 grep --color=auto python

可以看到模型訓練的程序號為2780

殺掉程序,結束訓練

$ kill 2780

模型匯出和使用

模型匯出
執行指令碼:

$ ./export_freeze.sh

得到3個檔案:
這裡寫圖片描述
分別儲存的是模型的label、權重、結構

export_freeze.sh檔案內容如下:

source env_set.sh
python -u export_inference_graph.py \
  --model_name=inception_v4 \
  --output_file=./my_inception_v4.pb \
  --dataset_name=$DATASET_NAME \
  --dataset_dir=$DATASET_DIR


NEWEST_CHECKPOINT=$(ls -t1 $TRAIN_DIR/model.ckpt*| head -n1)
NEWEST_CHECKPOINT=${NEWEST_CHECKPOINT%.*}
python -u ~/tensorflow/tensorflow/python/tools/freeze_graph.py \
  --input_graph=my_inception_v4.pb \
  --input_checkpoint=$NEWEST_CHECKPOINT \
  --output_graph=./my_inception_v4_freeze.pb \
  --input_binary=True \
  --output_node_name=InceptionV4/Logits/Predictions

cp $DATASET_DIR/labels.txt ./my_inception_v4_freeze.label

模型使用
基於python的webserver
執行指令碼:

$ ./server.sh

得到:

listening on port 5001
2017-07-27 10:04:54.279779: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.1 instructions, but these are available on your machine and could speed up CPU computations.
2017-07-27 10:04:54.279800: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.2 instructions, but these are available on your machine and could speed up CPU computations.
2017-07-27 10:04:54.279806: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX instructions, but these are available on your machine and could speed up CPU computations.
2017-07-27 10:04:54.279810: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX2 instructions, but these are available on your machine and could speed up CPU computations.
2017-07-27 10:04:54.279814: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use FMA instructions, but these are available on your machine and could speed up CPU computations.
2017-07-27 10:04:54.411389: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:893] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2017-07-27 10:04:54.411804: I tensorflow/core/common_runtime/gpu/gpu_device.cc:955] Found device 0 with properties: 
name: GeForce GTX 1080 Ti
major: 6 minor: 1 memoryClockRate (GHz) 1.582
pciBusID 0000:01:00.0
Total memory: 10.91GiB
Free memory: 10.50GiB
2017-07-27 10:04:54.411818: I tensorflow/core/common_runtime/gpu/gpu_device.cc:976] DMA: 0 
2017-07-27 10:04:54.411822: I tensorflow/core/common_runtime/gpu/gpu_device.cc:986] 0:   Y 
2017-07-27 10:04:54.411828: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1045] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GTX 1080 Ti, pci bus id: 0000:01:00.0)
 * Running on http://0.0.0.0:5001/ (Press CTRL+C to quit)

在瀏覽器輸入地址:

http://本機IP:5001

這裡寫圖片描述

選擇一張圖片並上傳,然後就會顯示識別結果
(注意,圖片所在路徑為/tmp/upload,在server.sh檔案中定義)

server.sh檔案內容如下:

python -u server.py \
  --model_name=my_inception_v4_freeze.pb \
  --label_file=my_inception_v4_freeze.label \
  --upload_folder=/tmp/upload

具體定義在server.py檔案中

這裡寫圖片描述

如圖得到5個分類的得分值,識別為sunflowers的score為0.79741

一些思考:我們剛才做的是5分類,分別是幾種花,如果我們現在有一張貓的圖片,這張圖片對模型資料來說是未標識的,也就是對未標識的物體進行預測會是什麼結果?
我們來試一下:
這裡寫圖片描述

可以看到,同樣也給出了分類預測的得分值,可是這隻貓當然不是蒲公英,這也是目前影象識別模型普遍存在的問題,也就是它不知道自己不知道。對人類而言,對於這5類花的預測分類,如果碰見這隻貓,我們會說這不是花,或者遇見一種不認識的不屬於這5類的我們會說我們不認識,或者不屬於這5類,但是對於模型而言,它目前做不到,它最終只會把這隻貓分到其中某一類花裡面去。