tensor2tensor自定義問題，訓練模型(bpe篇)

阿新 • • 發佈：2018-11-05

tensor2tensor自定義問題，訓練模型

上一篇：https://blog.csdn.net/hpulfc/article/details/81172498

之前一篇文章簡單介紹瞭如何使用 google 的SubwordTokenEncoder 進行編碼資料，並進行模型的訓練。今天這裡記錄一下如何使用自定義資料的資料以bpe分詞的方式進行模型的訓練。後面有完整程式碼，節省時間直接看就能看懂！

這裡主要是先理清一下思路和基本結構，然後編寫程式碼，總結概括，並且附上了貼心的注意事項 $_$ / ^_^ 。

首先是介紹一下tensor2tensor的資料生成的基本流程。

tensor2tensor 是一個封裝較好的工具，其中資料生成和訓練、解碼的步驟是分開的。

如下：是一些基本的執行檔案。

主要有：平均檢查點的、計算bleu的、根據問題生成資料的、訓練模型的、翻譯的等。

這裡面第一步就是要進行資料的生成，在生成資料的時候要對問題進行定義。如同下面一樣：

上面擷取的主要是定義了一些翻譯的任務，是tensor2tensor 已經定義的一些翻譯任務。然而這裡面的有時並不符合一些需求，那麼tensor2tensor 就提供了自定義任務的功能。主要是註冊一些問題。在上一篇文章中對此細節已經有所提及，這裡不再重複。

那麼如何定義問題呢，這裡就需要了解問題的基本結構了。如下：

因為對問題的定義主要是涉及到資料生成的方面，所以就先看一下t2t_datagen.py，就會發現下面的程式碼：

def main(_):
  usr_dir.import_usr_dir(FLAGS.t2t_usr_dir)

  # Calculate the list of problems to generate.
  problems = sorted(
      list(_SUPPORTED_PROBLEM_GENERATORS) + registry.list_problems())
  
  # ....省略很多....

  for problem in problems:
    set_random_seed()

    if problem in _SUPPORTED_PROBLEM_GENERATORS:
      generate_data_for_problem(problem)
    else:
      generate_data_for_registered_problem(problem)

這裡有兩個重要的資訊 generate_data_for_problem 和 generate_data_for_registered_problem，也就是下面的函式：

def generate_data_in_process(arg):
  problem_name, data_dir, tmp_dir, task_id = arg
  problem = registry.problem(problem_name)
  problem.generate_data(data_dir, tmp_dir, task_id)


def generate_data_for_registered_problem(problem_name):
  """Generate data for a registered problem."""
  tf.logging.info("Generating data for %s.", problem_name)
  # 又省略了很多。。。
  if task_id is None and problem.multiprocess_generate:
    # ...是的，省略了
    pool.map(generate_data_in_process, args)
  else:
    problem.generate_data(data_dir, tmp_dir, task_id)

從上面的程式碼可以看出，生成資料的時候，主要是問題的的generate_data 函式。也就是Problem 的generate_data中的函式。對應到不同的問題上面有不同的實現，在文字到文字的問題的上面，是像下面這種方式實現的。

class Text2TextProblem(problem.Problem):

  .....

  def generate_data(self, data_dir, tmp_dir, task_id=-1):

    ......

    if self.is_generate_per_split:
      for split, paths in split_paths:
        generator_utils.generate_files(
            self._maybe_pack_examples(
                self.generate_encoded_samples(data_dir, tmp_dir, split)), paths)
    else:
      generator_utils.generate_files(
          self._maybe_pack_examples(
              self.generate_encoded_samples(
                  data_dir, tmp_dir, problem.DatasetSplit.TRAIN)), all_paths)

    generator_utils.shuffle_dataset(all_paths)

從上面的程式碼可以看出，主要是通過generate_encoded_samples進行生成以編碼的樣本，然後進行生成檔案的。所以這裡重要的就是如何生成編碼樣本了。

對於文字到文字的問題，預設是這個樣式兒滴~：

  def generate_encoded_samples(self, data_dir, tmp_dir, dataset_split):
    generator = self.generate_samples(data_dir, tmp_dir, dataset_split)
    encoder = self.get_or_create_vocab(data_dir, tmp_dir)
    return text2text_generate_encoded(generator, encoder,
                                      has_inputs=self.has_inputs)

也就是說，通過函式 text2text_generate_encoded 生成所謂的樣本資料，這個 text2text_generate_encoded如下：

def text2text_generate_encoded(sample_generator,
                               vocab,
                               targets_vocab=None,
                               has_inputs=True):
  """Encode Text2Text samples from the generator with the vocab."""
  targets_vocab = targets_vocab or vocab
  for sample in sample_generator:
    if has_inputs:
      sample["inputs"] = vocab.encode(sample["inputs"])
      sample["inputs"].append(text_encoder.EOS_ID)
    sample["targets"] = targets_vocab.encode(sample["targets"])
    sample["targets"].append(text_encoder.EOS_ID)
    yield sample

主要就是通過迭代器，獲取樣本資料，然後通過編碼器進行編碼樣本資料，然後yield 出去。

也就是說關鍵的是要通過generate_encode_samples 函式返回類似於下面的返回值

所以，這裡關鍵就是弄清楚各個函式的返回值和引數值的含義，然後在定義自己資料的時候，構造出所需要的返回值就可以了。

插一句：如果你沒有一個好的工作/學習環境，就要試圖去改變。嗯！

ok 繼續，根據上面的程式碼可以看到，generate_encoded_samples 裡面的預設實現是通過 generate_sample 和 get_or_create_vocab 兩個函式分別獲取樣本和編碼器，，然後通過 text2text_generate_encoded 編碼為所需內容並返回。這裡面的預設實現是對於一個單詞表而言的，也就是說源語言和目標語言位於同一個單詞表中(注意，後面的自定義的時候有所改變), 所以只有一個encoder。

首先是說encoder ，這裡的encoder 主要是有/tensor2tensor/data_generators/text_encoder.py 裡面的兩個個編碼類的中的其中一個的例項，SubwordTextEncoder，TokenTextEncoder。這兩個類主要是用來編碼和解碼用，這裡的編碼解碼使將文字裝換為對應單詞表中的索引，與模型中的編碼器解碼器有所不同。具體應該怎麼用，可以待會兒直接看後面的實現程式碼。

說完如何回去編碼器，接下來就要說一下如何獲取資料的樣例，這個在預設是沒有實現的，但是，我們有其他的例子可供參考，還不算太糟。獲取樣例主要是通過generate_samles 獲取，這裡參考 TranslateEndeWmtBpe32k 中的實現，如下：

@registry.register_problem
class TranslateEndeWmtBpe32k(translate.TranslateProblem):

  ... 很明顯，省略了一些，..# 論小公司與大公司的區別，大神們講講啊

  def generate_samples(self, data_dir, tmp_dir, dataset_split):
    """Instance of token generator for the WMT en->de task, training set."""
    train = dataset_split == problem.DatasetSplit.TRAIN
    dataset_path = ("train.tok.clean.bpe.32000"
                    if train else "newstest2013.tok.bpe.32000")
    train_path = _get_wmt_ende_bpe_dataset(tmp_dir, dataset_path)

    # Vocab
    token_path = os.path.join(data_dir, self.vocab_filename)
    if not tf.gfile.Exists(token_path):
      token_tmp_path = os.path.join(tmp_dir, self.vocab_filename)
      tf.gfile.Copy(token_tmp_path, token_path)
      with tf.gfile.GFile(token_path, mode="r") as f:
        vocab_data = "<pad>\n<EOS>\n" + f.read() + "UNK\n"
      with tf.gfile.GFile(token_path, mode="w") as f:
        f.write(vocab_data)

    return text_problems.text2text_txt_iterator(train_path + ".en",
                                                train_path + ".de")

上面程式碼中的主要就是最後一句話了，如下：

def txt_line_iterator(txt_path):
  """Iterate through lines of file."""
  with tf.gfile.Open(txt_path) as f:
    for line in f:
      yield line.strip()


def text2text_txt_iterator(source_txt_path, target_txt_path):
  """Yield dicts for Text2TextProblem.generate_samples from lines of files."""
  for inputs, targets in zip(
      txt_line_iterator(source_txt_path), txt_line_iterator(target_txt_path)):
    yield {"inputs": inputs, "targets": targets}

也就是說，在generate_samples 函式裡面提供平行語料的路徑，然後使用 text2text_txt_iterator 就能獲取對應的樣本資料。

這樣一來，思路就基本理順了。那麼也就有了下面的使用 bpe 方式訓練模型，問題程式碼如下。

完整程式碼：在後面，有不懂的可在評論區討論

具體解釋：下面的程式碼主要是通過已經給定的平行語料和單詞表進行問題的定義，也就是用來生成資料的。

主要有一下幾點：

兩個單詞表，這裡是對中英的單詞表，tensor2tensor 中的英德問題是使用的一個單詞表，這裡使用兩個。
這裡使用bpe的方式進行分詞，然後進行令牌化，然而預設的tensor2tensor 是使用subwords的方式進行令牌化的。由於這裡已經有自己的單詞表了，所以在生成編碼器的時候，只是使用了TokenTextEncoder 。
由於預設的改變了預設的編碼器，所以要重新定義一下 feature_encoders 以此來說明具體使用的哪種編碼器。

然後，這裡就可以進行資料訓練了。如果你想看看效果和最先進的系統有哪些差距，看這裡和這裡！！不謝~

需要注意的是，tmp 資料夾下面的平行語料檔名稱和開發集名稱應該和下面程式碼中相同，不然會有異常提醒的。

這裡是建議使用 bpe 對英文進行分詞，具體應該怎們分，github上面有對應的開源工具的，可以搜尋subowrd，當然這裡好心滴放上鍊接祝你 ‘一臂之力 ’。然後在分完詞之後，選取頻度前50000個詞作為單詞表即可。可以是用NLTK這個工具包，也是很好用的-_-!3!3 . 中文的話，分一下詞就可以了，具體的話使用thulac ，精度和速度都比較好。

嗯 ojbk 到這應該就能訓練處不錯的模型了，具體的如何進行模型引數的調優，小夥伴們快來一起討論呦！！！微信：hpulfc

另外：如何快速理清專案結構，看各模組名字，輸入值，返回值，整體思考，應該不會差！！！

下面是完整的程式碼，講道理的，是可以直接使用的~

ENZH_BPE_DATASETS = {
    "TRAIN": "raw-train-bpe.zh-en",
    "DEV": "raw-dev-bpe.zh-en"
}


def get_enzh_bpe_dataset(directory, filename):
    train_path = os.path.join(directory, filename)
    if not (tf.gfile.Exists(train_path + ".en") and
            tf.gfile.Exists(train_path + ".zh")):
        raise Exception("there should be some training/dev data in the tmp dir.")

    return train_path



@registry.register_problem
class TranslateEnzhBpe50k(translate.TranslateProblem):
    """根據英德和英中的問題修改而來，這裡是將英德的一個單詞表變為中英的兩個單詞表來進行資料生成。"""

    @property
    def approx_vocab_size(self):
        return 50000

    @property
    def source_vocab_name(self):
        return "vocab.bpe.en.%d" % self.approx_vocab_size

    @property
    def target_vocab_name(self):
        return "vocab.bpe.zh.%d" % self.approx_vocab_size

    def get_vocab(self, data_dir, is_target=False):
        """返回的是一個encoder，單詞表對應的編碼器"""
        vocab_filename = os.path.join(data_dir, self.target_vocab_name if is_target else self.source_vocab_name)
        if not tf.gfile.Exists(vocab_filename):
            raise ValueError("Vocab %s not found" % vocab_filename)
        return text_encoder.TokenTextEncoder(vocab_filename, replace_oov="UNK")

    def generate_samples(self, data_dir, tmp_dir, dataset_split):
        """Instance of token generator for the WMT en->zh task, training set."""
        train = dataset_split == problem.DatasetSplit.TRAIN
        dataset_path = (ENZH_BPE_DATASETS["TRAIN"] if train else ENZH_BPE_DATASETS["DEV"])
        train_path = get_enzh_bpe_dataset(tmp_dir, dataset_path)

        # Vocab
        src_token_path = (os.path.join(data_dir, self.source_vocab_name), self.source_vocab_name)
        tar_token_path = (os.path.join(data_dir, self.target_vocab_name), self.target_vocab_name)
        for token_path, vocab_name in [src_token_path, tar_token_path]:
            if not tf.gfile.Exists(token_path):
                token_tmp_path = os.path.join(tmp_dir, vocab_name)
                tf.gfile.Copy(token_tmp_path, token_path)
                with tf.gfile.GFile(token_path, mode="r") as f:
                    vocab_data = "<pad>\n<EOS>\n" + f.read() + "UNK\n"
                with tf.gfile.GFile(token_path, mode="w") as f:
                    f.write(vocab_data)

        return text_problems.text2text_txt_iterator(train_path + ".en",
                                                    train_path + ".zh")

    def generate_encoded_samples(self, data_dir, tmp_dir, dataset_split):
        """在生成資料的時候，主要是通過這個方法獲取已編碼樣本的"""
        generator = self.generate_samples(data_dir, tmp_dir, dataset_split)
        encoder = self.get_vocab(data_dir)
        target_encoder = self.get_vocab(data_dir, is_target=True)
        return text_problems.text2text_generate_encoded(generator, encoder, target_encoder,
                                                        has_inputs=self.has_inputs)

    def feature_encoders(self, data_dir):
        source_token = self.get_vocab(data_dir)
        target_token = self.get_vocab(data_dir, is_target=True)
        return {
            "inputs": source_token,
            "targets": target_token,
        }

一個連結：http://www.statmt.org/

自定義引數：

根據以往的文章，應該能夠輕鬆的定義超引數！

直接上程式碼：

from tensor2tensor.models.transformer import transformer_base_single_gpu
@registry.register_hparams
def transformer_bsg():
  """HParams for transformer base model for single GPU."""
  hparams = transformer_base_single_gpu()
  hparams.batch_size = 2048
  hparams.learning_rate_cosine_cycle_steps = 300000
  hparams.learning_rate = 0.2
  hparams.learning_rate_warmup_steps = 16000
  return hparams

裡面都是一些可以自定義的程式碼，儲存檔案，放入到usr_dir 中引入到__init__.py 檔案即可。

tensor2tensor自定義問題，訓練模型(bpe篇)