TensorFlow學習筆記-組合訓練資料

阿新 • • 發佈：2019-02-10

　　Tensorflow資料預處理操作：http://blog.csdn.net/lovelyaiq/article/details/78716325
　　Tensorflow讀出TFRecord中的資料，然後在經過預處理操作，此時需要注意：資料還是單個，而網路的輸入一般以Batch為單位，因此我們需要將單個的資料組合成一個Batch，做為神經網路的輸入。
　　Tensorflow提供組合訓練資料的函式有四個：tf.train.batch(),tf.train.shuffle_batch()與tf.train.batch_join、tf.train.shuffle_batch_join，這裡為什麼要用與呢？其實他們是針對兩種情況。tf.train.batch和tf.train.batch_join的區別，一般來說，單一檔案多執行緒，選用tf.train.batch（需要打亂樣本，有對應的tf.train.shuffle_batch）；而對於多執行緒多檔案的情況，一般選用tf.train.batch_join來獲取樣本（打亂樣本同樣也有對應的tf.train.shuffle_batch_join使用）。下面會通過具體的例子來說明。tf.train.batch(),tf.train.shuffle_batch()這兩個函式都會生成一個佇列，佇列的入隊操作是生成單個樣例的方法，也就是經過預處理之後的影象。
我們首先看看一下這兩個函式的定義：

def batch(tensors, batch_size, num_threads=1, capacity=32,
          enqueue_many=False, shapes=None, dynamic_pad=False,
          allow_smaller_final_batch=False, shared_name=None, name=None):
def shuffle_batch(tensors, batch_size, capacity, min_after_dequeue,
                  num_threads=1, seed=None, enqueue_many=False, shapes=None,
                  allow_smaller_final_batch=False, shared_name=None, name=None) 
:

　　這兩個函式的主要引數為：
　　1、tensors入隊佇列，預處理後的資料和對應的標籤。
　　2、batch_size：batch的大小。如果太大，則需要佔用較多的記憶體資源，如果太小，那麼出隊操作可能會因為沒有資料而被阻塞，從而導致訓練效率降低。
　　3、capacity：佇列的最大容量，當佇列的長度等於容量時，Tensorflow將暫停入隊操作，而只是等待元素出隊。當佇列個數小於容量時，Tensorflow將自動啟動入隊操作。
　　4、num_threads：啟動多少個執行緒讀取檔案和預處理。
　　5、allow_smaller_final_batch：如果設定True，則會允許最後一個Batch的大小比較小，當沒有足夠的資料輸入時。
　　6、min_after_dequeue：限制出隊時佇列中元素的最小個數，如果佇列中剩餘個數太小，則隨機打亂的作用就會不大。
例如：API中關於tf.train.shuffle_batch()的一個例子為：

  # Creates batches of 32 images and 32 labels.
  image_batch, label_batch = tf.train.shuffle_batch(
        [single_image, single_label],
        batch_size=32,
        num_threads=4,
        capacity=50000,
        min_after_dequeue=10000)

# -*- coding: utf-8 -*-
import tensorflow as tf
import os

# 生成整數型的屬性
def _int64_feature(value):
    return tf.train.Feature(int64_list=tf.train.Int64List(value=[value]))

num_shards = 2
instance_per_shard = 10
#
for i in range(num_shards):
    filename = 'model/data.tfrecord-%.5d-of%.5d' %(i,num_shards)
    writer = tf.python_io.TFRecordWriter(filename)
    for j in range(instance_per_shard):
        example = tf.train.Example(features=tf.train.Features(feature={
            'i':_int64_feature(i),
            'j':_int64_feature(j)
        }))
        writer.write(example.SerializeToString())
    writer.close()


tf_record_pattern = os.path.join( 'model/', 'data.tfrecord-*' )
data_files = tf.gfile.Glob( tf_record_pattern )
filename_quene = tf.train.string_input_producer(data_files,shuffle=False)

reader = tf.TFRecordReader()
_, serialized_example = reader.read(filename_quene)

features = tf.parse_single_example(serialized_example,features={
    'i': tf.FixedLenFeature([],tf.int64),
    'j': tf.FixedLenFeature( [], tf.int64),
})

with tf.Session() as sess:
    tf.global_variables_initializer().run()
    # print(sess.run(filename))
    coord = tf.train.Coordinator()
    threads = tf.train.start_queue_runners(sess = sess, coord = coord)
    for i in range(15):
        print(sess.run([features['i'],features['j']]))

    coord.request_stop()
    coord.join(threads)
# 輸出結果為：
[0, 0]
[0, 1]
[0, 2]
[0, 3]
[0, 4]
[0, 5]
[0, 6]
[0, 7]
[0, 8]
[0, 9]
[1, 0]
[1, 1]
[1, 2]
[1, 3]
[1, 4]

　　從結果中可以看出它是通過順序讀取檔案中的內容。
　　接下來，我們使用tf.train.batch來組合資料

example,lable = features['i'],features['j']
batch_size = 3
capacity = 1000 + 3 * batch_size

example_batch, label_batch = tf.train.batch([example,lable],batch_size = batch_size, capacity=capacity)

with tf.Session() as sess:
    tf.global_variables_initializer().run()
    # print(sess.run(filename))
    coord = tf.train.Coordinator()
    threads = tf.train.start_queue_runners(sess = sess, coord = coord)
    for i in range(15):
        print(sess.run([example_batch,label_batch]))

    coord.request_stop()
    coord.join(threads)
# 輸出結果為：
[array([0, 0, 0]), array([0, 1, 2])]
[array([0, 0, 0]), array([3, 4, 5])]
[array([0, 0, 0]), array([6, 7, 8])]
[array([0, 1, 1]), array([9, 0, 1])]
[array([1, 1, 1]), array([2, 3, 4])]
[array([1, 1, 1]), array([5, 6, 7])]
[array([1, 1, 0]), array([8, 9, 0])]
[array([0, 0, 0]), array([1, 2, 3])]
[array([0, 0, 0]), array([4, 5, 6])]
[array([0, 0, 0]), array([7, 8, 9])]
[array([1, 1, 1]), array([0, 1, 2])]
[array([1, 1, 1]), array([3, 4, 5])]
[array([1, 1, 1]), array([6, 7, 8])]
[array([1, 0, 0]), array([9, 0, 1])]
[array([0, 0, 0]), array([2, 3, 4])]

　　從結果中可以看出它是通過順序讀取檔案中的內容。
　　而當我使用tf.train.shuffle_batch時，輸出結果的順序已經被打亂。

example_batch, label_batch = tf.train.shuffle_batch([example,lable],batch_size = batch_size, capacity=capacity,min_after_dequeue =10)
[array([0, 0, 0]), array([7, 1, 8])]
[array([0, 1, 0]), array([6, 1, 0])]
[array([1, 1, 0]), array([4, 7, 5])]
[array([1, 0, 0]), array([5, 9, 1])]
[array([0, 0, 1]), array([3, 2, 3])]
[array([0, 0, 1]), array([0, 3, 8])]
[array([0, 0, 0]), array([2, 4, 7])]
[array([1, 1, 1]), array([1, 9, 0])]
[array([0, 0, 1]), array([5, 9, 0])]
[array([1, 1, 0]), array([4, 2, 4])]
[array([1, 1, 1]), array([3, 6, 6])]
[array([1, 1, 0]), array([5, 8, 2])]
[array([0, 0, 0]), array([8, 1, 8])]
[array([1, 0, 0]), array([9, 0, 5])]
[array([1, 0, 1]), array([1, 7, 2])]

　　當使用多個執行緒讀取多個檔案時，這時候就需要使用 tf.train.batch_join或tf.train.shuffle_batch_join。

nums_read = []
examples_queue = tf.RandomShuffleQueue(
    capacity=1000 + 3 * batch_size,
    min_after_dequeue=10,
    dtypes=tf.string )

# for i in range(4):
reader = tf.TFRecordReader()
_, serialized_example = reader.read(filename_quene)
# nums_read.append(examples_queue.enqueue(serialized_example))

tf.train.queue_runner.add_queue_runner(
    tf.train.queue_runner.QueueRunner( examples_queue, [examples_queue.enqueue(serialized_example)]*4 ) )
example_serialized = examples_queue.dequeue()


images_and_labels = []
for thread_id in range( 4 ):
    # Parse a serialized Example proto to extract the image and metadata.
    features = tf.parse_single_example( serialized_example, features={
        'i': tf.FixedLenFeature( [], tf.int64 ),
        'j': tf.FixedLenFeature( [], tf.int64 ),
    } )
    example, lable = features['i'], features['j']

    images_and_labels.append( [example, lable] )

example_batch, label_batch = tf.train.batch_join(
    images_and_labels,
    batch_size=batch_size,
    capacity=capacity )
# 結果為：
[array([0, 0, 0]), array([6, 4, 8])]
[array([1, 1, 1]), array([0, 3, 4])]
[array([1, 1, 0]), array([5, 6, 0])]
[array([0, 0, 1]), array([6, 8, 1])]
[array([1, 1, 0]), array([3, 6, 1])]
[array([0, 0, 1]), array([2, 7, 0])]
[array([1, 1, 1]), array([2, 4, 6])]
[array([1, 0, 0]), array([7, 0, 1])]
[array([0, 0, 1]), array([2, 8, 3])]
[array([1, 1, 1]), array([4, 5, 9])]
[array([0, 0, 0]), array([0, 2, 4])]
[array([0, 0, 1]), array([6, 9, 4])]
[array([1, 1, 1]), array([7, 2, 9])]
[array([0, 0, 1]), array([3, 5, 0])]
[array([1, 1, 1]), array([1, 2, 5])]

TensorFlow學習筆記-組合訓練資料

TensorFlow學習筆記-組合訓練資料

tensorflow學習筆記1:影象資料的一些簡單操作

TensorFlow學習筆記(10) 影象資料處理

Tensorflow學習筆記：資料集加工和轉化為TensorFlow專用格式——Finetuning，貓狗大戰，VGGNet的重新針對訓練

TensorFlow學習筆記——LeNet-5（訓練自己的資料集）

tensorflow學習筆記(北京大學) tf5_1minst_forward.py 完全解析 mnist資料集

（print除去省略號）tensorflow學習筆記(北京大學) tf4_1_0.py 完全解析列印完整資料

TensorFlow學習筆記(9) TFRecord 輸入資料格式

Tensorflow學習筆記：VGG16模型——Finetuning，貓狗大戰，VGGNet的重新針對訓練

Tensorflow學習筆記：VGG16訓練——Finetuning，貓狗大戰，VGGNet的重新針對訓練

caffe學習筆記6--訓練自己的資料集

TensorFlow學習筆記（5）--實現卷積神經網路（MNIST資料集）

TensorFlow學習筆記（九）：CIFAR-10訓練例子報錯解決

TensorFlow學習筆記（4）--實現多層感知機（MNIST資料集）

Tensorflow學習筆記-基於LeNet5結構的ORL資料集人臉識別

《TensorFlow學習筆記》卷積神經網路CNN實戰-cifar10資料集（tensorboard視覺化）

tensorflow學習筆記（五）：TensorFlow變數共享和資料讀取

TensorFlow 組合訓練資料（batching）

TensorFlow學習筆記（五）—— MNIST —— 資料下載，讀取

《TensorFlow學習筆記》對圖片資料的預處理一、-編碼解碼調整大小色彩亮度

TensorFlow學習筆記-組合訓練資料

相關推薦