2018年最強自然語言模型 Google BERT 資源彙總

阿新 • • 發佈：2018-11-15

本文介紹了一種新的語言表徵模型 BERT——來自 Transformer 的雙向編碼器表徵。與最近的語言表徵模型不同，BERT 旨在基於所有層的左、右語境來預訓練深度雙向表徵。BERT 是首個在大批句子層面和 token 層面任務中取得當前最優效能的基於微調的表徵模型，其效能超越許多使用任務特定架構的系統，重新整理了 11 項 NLP 任務的當前最優效能記錄。

BERT 論文內容精要

模型結構

其中的主要模組 Transformer 來自 Attention Is All You Need

模型輸入

預訓練方法

遮蔽語言模型（完形填空）和預測下一句任務。

實驗

模型分析

Effect of Pre-training Tasks

Effect of Model Size

Effect of Number of Training Steps

Feature-based Approach with BERT

結論

Recent empirical improvements due to transfer learning with language models have demonstrated that rich, unsupervised pre-training is an integral part of many language understanding systems. Inparticular, these results enable even low-resource tasks to benefit from very deep unidirectional architectures.Our major contribution is further generalizing these findings to deep bidirectional architectures, allowing the same pre-trained model to successfully tackle a broad set of NLP tasks. While the empirical results are strong, in some cases surpassing human performance, important future work is to investigate the linguistic phenomena that may or may not be captured by BERT.

BERT 相關資源

標題	說明	附加
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding	原始論文	20181011
Reddit 討論	作者討論
BERT-pytorch	Google AI 2018 BERT pytorch implementation
論文解讀:BERT模型及fine-tuning	習翔宇論文解讀
最強NLP預訓練模型！谷歌BERT橫掃11項NLP任務記錄	論文淺析
【NLP】Google BERT詳解	李入魔解讀
如何評價 BERT 模型？	解讀論文思想點
NLP突破性成果 BERT 模型詳細解讀	章魚小丸子解讀
谷歌最強 NLP 模型 BERT 解讀	AI科技評論
預訓練BERT，官方程式碼釋出前他們是這樣用TensorFlow解決的	論文復現說明	20181030
谷歌終於開源BERT程式碼：3 億引數量，機器之心全面解讀		20181101

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Jacob Devlin, Ming-Wei Chang, Kenton Lee, Kristina Toutanova (Submitted on 11 Oct 2018)

We introduce a new language representation model called BERT, which stands for Bidirectional Encoder Representations from Transformers. Unlike recent language representation models, BERT is designed to pre-train deep bidirectional representations by jointly conditioning on both left and right context in all layers. As a result, the pre-trained BERT representations can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks, such as question answering and language inference, without substantial task-specific architecture modifications. BERT is conceptually simple and empirically powerful. It obtains new state-of-the-art results on eleven natural language processing tasks, including pushing the GLUE benchmark to 80.4% (7.6% absolute improvement), MultiNLI accuracy to 86.7 (5.6% absolute improvement) and the SQuAD v1.1 question answering Test F1 to 93.2 (1.5% absolute improvement), outperforming human performance by 2.0%. Comments: 13 pages

摘要：本文介紹了一種新的語言表徵模型 BERT，意為來自 Transformer 的雙向編碼器表徵（Bidirectional Encoder Representations from Transformers）。與最近的語言表徵模型（Peters et al., 2018; Radford et al., 2018）不同，BERT 旨在基於所有層的左、右語境來預訓練深度雙向表徵。因此，預訓練的 BERT 表徵可以僅用一個額外的輸出層進行微調，進而為很多工（如問答和語言推斷任務）建立當前最優模型，無需對任務特定架構做出大量修改。

BERT 的概念很簡單，但實驗效果很強大。它重新整理了 11 個 NLP 任務的當前最優結果，包括將 GLUE 基準提升至 80.4%（7.6% 的絕對改進）、將 MultiNLI 的準確率提高到 86.7%（5.6% 的絕對改進），以及將 SQuAD v1.1 的問答測試 F1 得分提高至 93.2 分（提高 1.5 分）——比人類表現還高出 2 分。

Subjects: Computation and Language (cs.CL) Cite as: arXiv:1810.04805 [cs.CL] (or arXiv:1810.04805v1 [cs.CL] for this version) Bibliographic data Select data provider: Semantic Scholar [Disable Bibex(What is Bibex?)] No data available yet Submission history From: Jacob Devlin [view email] [v1] Thu, 11 Oct 2018 00:50:01 GMT (227kb,D)

Reddit 討論

官方復現 google-research bert

最近谷歌釋出了基於雙向 Transformer 的大規模預訓練語言模型，該預訓練模型能高效抽取文字資訊並應用於各種 NLP 任務，該研究憑藉預訓練模型重新整理了 11 項 NLP 任務的當前最優效能記錄。如果這種預訓練方式能經得起實踐的檢驗，那麼各種 NLP 任務只需要少量資料進行微調就能實現非常好的效果，BERT 也將成為一種名副其實的骨幹網路。

Introduction

BERT, or Bidirectional Encoder Representations from Transformers, is a new method of pre-training language representations which obtains state-of-the-art results on a wide array of Natural Language Processing (NLP) tasks.

Our academic paper which describes BERT in detail and provides full results on a number of tasks can be found here: arxiv.org/abs/1810.04….

To give a few numbers, here are the results on the SQuAD v1.1 question answering task:

SQuAD v1.1 Leaderboard (Oct 8th 2018)	Test EM	Test F1
1st Place Ensemble - BERT	87.4	93.2
2nd Place Ensemble - nlnet	86.0	91.7
1st Place Single Model - BERT	85.1	91.8
2nd Place Single Model - nlnet	83.5	90.1

And several natural language inference tasks:

System	MultiNLI	Question NLI	SWAG
BERT	86.7	91.1	86.3
OpenAI GPT (Prev. SOTA)	82.2	88.1	75.0

Plus many other tasks.

Moreover, these results were all obtained with almost no task-specific neural network architecture design.

If you already know what BERT is and you just want to get started, you can download the pre-trained models and run a state-of-the-art fine-tuning in only a few minutes.

復現 bert_language_understanding

Pre-training of Deep Bidirectional Transformers for Language Understanding

復現 BERT-keras

Keras implementation of BERT(Bidirectional Encoder Representations from Transformers)

復現 pytorch-pretrained-BERT

PyTorch version of Google AI's BERT model with script to load Google's pre-trained models.

BERT的資料集 GLUE

GLUE 來自論文 GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding

摘要

For natural language understanding (NLU) technology to be maximally useful, both practically and as a scientific object of study, it must be general: it must be able to process language in a way that is not exclusively tailored to any one specific task or dataset. In pursuit of this objective, we introduce the General Language Understanding Evaluation benchmark (GLUE), a tool for evaluating and analyzing the performance of models across a diverse range of existing NLU tasks. GLUE is model-agnostic, but it incentivizes sharing knowledge across tasks because certain tasks have very limited training data. We further provide a hand-crafted diagnostic test suite that enables detailed linguistic analysis of NLU models. We evaluate baselines based on current methods for multi-task and transfer learning and find that they do not immediately give substantial improvements over the aggregate performance of training a separate model per task, indicating room for improvement in developing general and robust NLU systems.