1. 程式人生 > >用RNN生成文字的簡單例子(過程詳細)

用RNN生成文字的簡單例子(過程詳細)

將文章字母編碼

import time
from collections import namedtuple

import numpy as np
import tensorflow as tf

with open('anna.txt', 'r') as f:
    text=f.read()
vocab = sorted(set(text))#set將文章中的所有不同字元取出,然後sorted排序
vocab_to_int = {c: i for i, c in enumerate(vocab)}#排好序的字元列表進行字典索引
int_to_vocab = dict(enumerate(vocab))#與上字典相反,索引號為鍵,字元為值
encoded = np.array([vocab_to_int[c] for c in text], dtype=np.int32)#把text中所有字元進行數字編碼

將資料生成mini-batches

定義函式,讀入文章,sequence長度、step長度為超引數

def get_batches(arr, n_seqs, n_steps):

    # 用sequence和step計算batch大小,得出batch個數,最後不夠一個batch的扔掉
    characters_per_batch = n_seqs * n_steps
    n_batches = len(arr)//characters_per_batch
    arr = arr[:n_batches * characters_per_batch]

    # 重新reshape為sequence行,列數自動生成(-1)
arr = arr.reshape((n_seqs, -1)) # 生成樣本特徵batch及目標值batch(目標值為樣本值的下一個字母) for n in range(0, arr.shape[1], n_steps): x = arr[:, n:n+n_steps] y = np.zeros_like(x) # 目標值往下滾動一個字母,目標batch最後一列可設定為樣本特徵batch的第一列,不會影響精度 y[:, :-1], y[:,-1] = x[:, 1:], x[:, 0] # x,y為生成器(generater)
yield x, y

建立輸入層

建立輸入、目標值佔位符,以及keep_prob的佔位符(Dropout層用到)

def build_inputs(batch_size, num_steps):
    '''batch_size是每個batch中sequence的長度(batch行數)
        num_steps是batch列數
    '''
    inputs = tf.placeholder(tf.int32, [batch_size, num_steps], name='inputs')
    targets = tf.placeholder(tf.int32, [batch_size, num_steps], name='targets')
    keep_prob = tf.placeholder(tf.float32, name='keep_prob')

    return inputs, targets, keep_prob

建立LSTM單元

  1. 建立隱藏層中的LSTM單元tf.contrib.rnn.BasicLSTMCell(num_units)
  2. 在cell外包裹上Dropouttf.contrib.rnn.DropoutWrapper(lstm, output_keep_prob=keep_prob)
    為什麼這麼做可以看一下Wojciech Zaremba的論文:Recurrent Neural Network Regularization

    對於rnn的部分不進行dropout,也就是說從t-1時候的狀態傳遞到t時刻進行計算時,這個中間不進行memory的dropout;僅在同一個t時刻中,多層cell之間傳遞資訊的時候進行dropout

  3. 多層LSTM層堆疊

  4. 初始化cell狀態為0
def build_lstm(lstm_size, num_layers, batch_size, keep_prob):

    # 建立LSTM單元

    def build_cell(lstm_size, keep_prob):
        # Use a basic LSTM cell
        lstm = tf.contrib.rnn.BasicLSTMCell(lstm_size)

        # Add dropout to the cell
        drop = tf.contrib.rnn.DropoutWrapper(lstm, output_keep_prob=keep_prob)
        return drop


    # Stack up multiple LSTM layers, for deep learning
    cell = tf.contrib.rnn.MultiRNNCell([build_cell(lstm_size, keep_prob) for _ in range(num_layers)])
    initial_state = cell.zero_state(batch_size, tf.float32)

    return cell, initial_state

建立輸出層

將RNN cell連線到一個有softmax輸出的全連線層,可以給出一個用於預測下一個字母的概率分佈。

如果輸入batch尺寸(sequence長度)為N,步長為M,隱藏層有L個隱藏單元,則輸出為一個N×M×L的3維tensor。輸出M個尺寸為L的LSTM cell每一個代表一個sequence,總共有N個sequence,所以總尺寸為N×M×L。

進一步優化輸出尺寸為(N×M)×L,即每行對應一個batch(sequence N × step M),列數對應LSTM cell個數。

將權值與變數包裹在一個scope中,以便共享已經在LSTM cell中建立的變數。如果不設定scope而重用LSTM cell中的變數名,則會報錯

def build_output(lstm_output, in_size, out_size):

    # reshape
    seq_output = tf.concat(lstm_output, axis=1)
    x = tf.reshape(seq_output, [-1, in_size])

    # 將RNN輸入連線到softmax層
    with tf.variable_scope('softmax'):
        softmax_w = tf.Variable(tf.truncated_normal((in_size, out_size), stddev=0.1))
        softmax_b = tf.Variable(tf.zeros(out_size))

    logits = tf.matmul(x, softmax_w) + softmax_b
    out = tf.nn.softmax(logits, name='predictions')

    return out, logits

訓練誤差

計算目標值和預測值的交叉熵損失:
1. 目標值one-hot編碼
2. reshape目標值
3. 將輸出單元和目標值傳遞給softmax的交叉熵損失函式

def build_loss(logits, targets, lstm_size, num_classes):

    y_one_hot = tf.one_hot(targets, num_classes)
    y_reshaped = tf.reshape(y_one_hot, logits.get_shape())

    loss = tf.nn.softmax_cross_entropy_with_logits(logits=logits, labels=y_reshaped)
    loss = tf.reduce_mean(loss)

    return loss

優化

LSTM不存在梯度消失的問題(常規RNN存在梯度爆炸和梯度消失),但是其增長沒有上界。用gradient clip來處理梯度爆炸的問題,設定一個閾值作為上界,超過上界時梯度值設定為閾值。使用AdamOpitmizer進行學習

def build_optimizer(loss, learning_rate, grad_clip):

    tvars = tf.trainable_variables()
    grads, _ = tf.clip_by_global_norm(tf.gradients(loss, tvars), grad_clip)
    train_op = tf.train.AdamOptimizer(learning_rate)
    opitmizer = train_op.apply_gradients(zip(grads, tvars))

    return optimizer

建立網路

使用tf.nn.dynamic_rnn可以允許不同batch的sequence長度不同,不會像tf.nn.rnn一樣固定batch大小,用padding補足長度而浪費空間。而且前者是動態建立graph,後者是靜態,後者速度慢且佔用資源多。

class CharRNN:

    def __init__(self, num_classes, batch_size=64, num_steps=50, lstm_size=128,
                num_layers=2, learning_rate=0.001, grad_clip=5, sampling=False):

        if sempling == True:
            batch_size, num_steps = 1, 1
        else:
            batch_size, num_steps = batch_size, num_steps

        tf.reset_default_graph()

        self.inputs, self.targets, self.keep_prob = build_inputs(batch_size, num_steps)
        cell, self.initial_state = build_lstm(lstm_size, num_layers, batch_size, self.keep_prob)

        x_one_hot = tf.one_hot(self.inputs, num_classes)

        outputs, state = tf.nn.dynamic_rnn(cell, x_one_hot, initial_state=self.initial_state)
        self.final_state = state

        self.prediction, self.logits = build_output(outputs, lstm_size, num_classes)

        self.loss = build_loss(self.logits, self.targets, lstm_size, num_classes)
        self.optimizer = build_optimizer(self.loss, learning_rate, grad_clip)

超引數

涉及到的超引數:
- batch_size:一次輸入傳遞的sequence數,也就是batch的行數
- num_steps:步長,即batch的列數,是一個sequence中的字母個數。一般來說越大越好,因為越多的字母數可以讓模型學習到更多的相關性,但是訓練時間長。一般設定為100
- lstm_size:隱藏層中的LSTM單元數
- num_layers:隱藏層數
- learning_rate:學習率
- keep_prob:Dropout率,如果模型過擬合,減小Dropout

Andrej Karpathy大神的設定原則:
- num_layers設定為2或者3
- lstm_size依照資料的規模及模型引數數量來設定:
- 在訓練之前打印出模型的引數數量
- 資料集尺寸:1M的檔案大概有100萬個字母
- 然後設定引數和資料規模到同一個量級,比如:
- 資料集檔案100MB,引數有150k,資料集規模遠遠大於引數數量,那麼模型很可能會欠擬合,這種情況下就要把lstm_size設定大一點
- 資料集檔案100MB,引數有1000萬,這時要注意觀測驗證集損失,如果比訓練集損失大許多,那麼可以嘗試提高Dropout
- 原文,以上是擷取的Approximate number of parameters一節

下面給出一種可能的設定:

batch_size = 100
num_steps = 100
lstm_size = 512
num_layers =2
learning_rate = 0.0001
keep_prob = 0.5

開始訓練

將inputs和targets傳遞到網路,然後執行optimizer優化。然後用checkpoint儲存final LSTM狀態,用來傳遞給下一個batch的訓練。

epochs = 20

# 每200步儲存一個checkpoint
save_every_n = 200

model = CharRNN(len(vocab), batch_size=batch_size, num_steps=num_steps, 
               lstm_size=lstm_size, num_layers=num_layers, 
               learning_rate=learning_rate)

saver = tf.train.Saver(max_to_keep=100)#  the maximum number of recent checkpoint files to keep.  As new files are created, older files are deleted.
with tf.Session() as sess:
    sess.run(tf.global_variables_initializer())

    counter = 0
    for e in range(epochs):
        new_state = sess.run(model.initial_state)
        loss = 0
        for x, y in get_batches(encoded, batch_size, num_steps):
            counter += 1
            start = time.time()
            feed = {model.inputs: x,
                    model.targets: y,
                    model.keep_prob: keep_prob,
                    model.initial_state: new_state}
            batch_loss, new_state, _ = sess.run([model.loss,
                                                 model.final_state,
                                                 model.optimizer],
                                                 feed_dict=feed)
            end = time.time()
            print('Epoch:{}/{}...'.format(e+1, epochs),
                  'Training Step:{}...'.format(counter),
                  'Training loss:{:.4f}...'.format(batch_loss),
                  '{:.4f} sec/batch'.format((end-start)))

            if (counter % save_every_n == 0):
                saver.save(sess, 'checkpoints/i{}.ckpt'.format(counter))

    saver.save(sess, 'checkpoints/i{}.ckpt'.format(counter))

儲存checkpoint

將ckpt檔案儲存在checkpoint資料夾中

tf.train.get_checkpoint_state('checkpoints')

Sampling

模型訓練完畢後,可以用來生成新的文章段落。在模型中輸入一個字母,模型預測生成下一個字母,再用下一個字母預測生成下下個字母。為了減小生成的噪聲,使隨機性減小,我們選擇生成的前N概率最大的字母

def pick_top_n(preds, vocab_size, top_n=5):
    p = np.squeeze(preds)
    p[np.argsort(p)[:-top_n]]=0
    p = p / np.sum(p)
    char = np.random.choice(vocab_size, 1, p=p)[0]
    return char
def sample(checkpoint, n_samples, lstm_size, vocab_size, prime='The '):
    samples = [c for c in prime]
    model = CharRNN(len(vocab), lstm_size=lstm_size, sampling=True)
    saver = tf.train.Saver()
    with tf.Session() as sess:
        saver.restore(sess, checkpoint)
        new_state = sess.run(model.initial_state)
        for c in prime:
            x = np.zeros((1, 1))
            x[0, 0] = vocab_to_int[c]
            feed = {model.inputs: x,
                    model.keep_prob: keep_prob,
                    model.initial_state: new_state}
            preds, new_state = sess.run([model.prediction,
                                         model.final_state],
                                         feed_dict=feed)
            c = pick_top_n(preds, len(vocab))
            samples.append(int_to_vocab[c])

            for i in range(n_samples):
                x[0,0] = c
                feed = {model.inputs: x,
                        model.keep_prob: 1.,
                        model.initial_state: new_state}
                preds, new_state = sess.run([model.prediction, model.final_state], 
                                         feed_dict=feed)

                c = pick_top_n(preds, len(vocab))
                samples.append(int_to_vocab[c])

    return ''.join(samples)         

把checkpoint資料夾中的ckpt檔案傳遞給模型,生成2000字母新的文字

checkpoint = tf.train.latest_checkpoint('checkpoints')
samp = sample(checkpoint, 2000, lstm_size, len(vocab), prime='The')
print(samp)

執行結果

INFO:tensorflow:Restoring parameters from checkpoints\i3800.ckpt The outntine tore. “Anda was sely wark, whith the south as the simarest insorse, there wo here that wish to den to a selting theme though he drad the coustest an with him, bot the word and the hurs fold her befoul ther the some on the said, bet an wered to ather her and the wist and and a menter and was a ment warked to him hore. Her have thind as she dind to buther and the sainted as the sairs, as the sampat on the said wate and alating on the precissair of the partere aspace the hid her, and as all her thither wored and his said talk in and and a thaid andostant of to thithe alled and at a whangs at that she wit of the her..” “They’s all her that so coure that’s at it tele so do be and hus so did to anden tha marte and stear as is in whe the comprince, “she’s sele thith the mome ano she cusprian of hime fall,. Tho could as she cas a wand of thim. And tall he her hander of her, and athing think and hid bother he had buth horssing and his and her to deer alday, hid a cored her brtairiad hem, and souddy then troute her, her sond anore the caster. He could not him atrentse his befurtale and whate he chanded her and ta see at an with the cruct and the bristed and a to dit wo him. “I’s serther as the consere a dear and stice to the paster a sender on thought he was not in the compiout ta his andensted at as aldot and with his sore tare to the sore, and her and the pare al and and the which here was her a same, shich were than in thin whele and the pains in at hishing a shatted in thinge of the mest of had, sunding with a mant of trear the casions of tith his. He wis to sain he coull has had her had same the midly of the pand that she sand on tee it, and to been the sering anctreat har and same of the sarint to and the has tonk the her thote wile wat say hes woold has. Ande sald not antions.

從以上結果可以看出,介詞連詞代詞都基本沒有拼寫錯誤,而且位置用的也合乎語法,只是形容詞名詞等字母數比較多的,中間會有拼寫問題。模型基本跑通了,只差優化了