谷歌AI論文BERT雙向編碼器表徵模型：機器閱讀理解NLP基準11種最優(公號回覆“谷歌BERT論文”下載彩標PDF論文)

阿新 • • 發佈：2018-11-12

谷歌AI論文BERT雙向編碼器表徵模型：機器閱讀理解NLP基準11種最優(公號回覆“谷歌BERT論文”下載彩標PDF論文)

資料簡化DataSimp導讀：谷歌AI語言組論文《BERT：語言理解的深度雙向變換器預訓練》，介紹一種新的語言表徵模型BERT——來自變換器的雙向編碼器表徵量。異於最新語言表徵模型，BERT基於所有層的左、右語境來預訓練深度雙向表徵量。BERT是首個大批句子層面和詞塊層面任務中取得當前最優效能的表徵模型，效能超越許多使用任務特定架構的系統，重新整理11項NLP任務當前最優效能記錄，堪稱最強NLP預訓練模型！未來可能成為新行業基礎。本文翻譯BERT

論文(原文中英文對照)，BERT簡版原始碼10月30日已釋出，我們後期抽空分析，祝大家學習愉快~要推進人類文明，不可止步於敲門吶喊；設計空想太多，無法實現就虛度一生；工程能力至關重要，秦隴紀與君共勉之。

谷歌AI論文BERT雙向編碼器表徵模型：機器閱讀理解NLP基準11種最優(62264字)

A谷歌AI論文BERT雙向編碼器表徵模型(58914字)

一、介紹Introduction

二、相關工作RelatedWork

三、BERT變換器雙向編碼器表徵

四、實驗Experiments

五、消模實驗AblationStudies

六、結論Conclusion

參考文獻References

B機器閱讀理解11種NLP任務BERT超人類(2978字)

一、BERT模型主要貢獻

二、BERT模型與其它兩個的不同

參考文獻(1214字)Appx(845字).資料簡化DataSimp社群簡介

A谷歌AI論文BERT雙向編碼器表徵模型(58914字)

BERT：語言理解的深度雙向變換器預訓練

文|谷歌AI語言組BERT作者，譯|秦隴紀，資料簡化DataSimp20181013Sat-1103Sat

名稱：BERT：語言理解的深度雙向變換器預訓練

BERT:Pre-training of Deep Bidirectional Transformers for Language Understanding

論文地址：https://arxiv.org/pdf/1810.04805.pdf

作者：Jacob Devlin, Ming-Wei Chang, Kenton Lee, Kristina Toutanova

單位：Google AILanguage {jacobdevlin,mingweichang,kentonl,kristout}@google.com

摘要：本文介紹一種稱之為BERT的新語言表徵模型，意為來自變換器的雙向編碼器表徵量(BidirectionalEncoder Representations from Transformers)。不同於最近的語言表徵模型(Peters等，2018; Radford等，2018)，BERT旨在基於所有層的左、右語境來預訓練深度雙向表徵。因此，預訓練的BERT表徵可以僅用一個額外的輸出層進行微調，進而為很多工(如問答和語言推理)建立當前最優模型，無需對任務特定架構做出大量修改。

BERT的概念很簡單，但實驗效果很強大。它重新整理了11個NLP任務的當前最優結果，包括將GLUE基準提升至80.4%(7.6%的絕對改進)、將MultiNLI的準確率提高到86.7%(5.6%的絕對改進)，以及將SQuADv1.1問答測試F1的得分提高至93.2分(1.5分絕對提高)——比人類效能還高出2.0分。

Abstract：We introduce anew language representation model called BERT, which stands for BidirectionalEncoder Representations from Transformers. Unlike recent languagerepresentation models (Peters et al., 2018; Radford et al., 2018), BERT isdesigned to pre-train deep bidirectional representations by jointlyconditioning on both left and right context in all layers. As a result, thepre-trained BERT representations can be fine-tuned with just one additionaloutput layer to create state-of-the-art models for a wide range of tasks, suchas question answering and language inference, without substantial task-specificarchitecture modifications.

BERT isconceptually simple and empirically powerful. It obtains new state-of-the-artresults on eleven natural language processing tasks, including pushing the GLUEbenchmark to 80.4% (7.6% absolute improvement), MultiNLI accuracy to 86.7% (5.6%absolute improvement) and the SQuAD v1.1 question answering Test F1 to 93.2 (1.5absolute improvement), outperforming human performance by 2.0.

除了上述摘要Abstact，論文有6節：介紹Introduction、相關工作Related Work、BERT、實驗Experiments、消模實驗Ablation Studies、結論Conclusion，末尾42篇參考資料References。

一、介紹Introduction

語言模型預訓練已被證明可有效改進許多自然語言處理任務(Dai and Le, 2015;Peters等，2017, 2018; Radford等，2018; Howard and Ruder, 2018)。這些任務包括句子級任務，如自然語言推理inference(Bowman等，2015; Williams等，2018)和釋義paraphrasing(Dolan and Brockett, 2005)，旨在通過整體分析來預測句子之間的關係；以及詞塊級任務，如命名實體識別(Tjong Kim Sang andDe Meulder, 2003)和SQuAD問題回答(Rajpurkar等，2016)，其中模型需要在詞塊級別生成細粒度輸出。

Language modelpre-training has shown to be effective for improving many natural languageprocessing tasks (Dai and Le, 2015; Peters et al., 2017, 2018; Radford et al.,2018; Howard and Ruder, 2018). These tasks include sentence-level tasks such asnatural language inference (Bowman et al., 2015; Williams et al., 2018) andparaphrasing (Dolan and Brockett, 2005), which aim to predict the relationshipsbetween sentences by analyzing them holistically, as well as token-level taskssuch as named entity recognition (Tjong Kim Sang and De Meulder, 2003) and SQuADquestion answering (Rajpurkar et al., 2016), where models are required toproduce fine-grained output at the token-level. (譯註1：token義為象徵、標誌、紀念品、代幣、代價券，和sign意思相同但比sign莊重文雅，常用於嚴肅場合。token有語言學詞義：[語言學]語言符號、計算機詞義：[計算機]詞塊、詞塊。秦隴紀認為“符標”更合意，但常見NLP文獻裡token譯為“詞塊”，隨大流吧。)

將預訓練語言表徵應用於下游任務有兩種現有策略：基於特徵feature-based和微調fine-tuning。基於特徵的方法，例如ELMo(Peters等，2018)，使用特定於任務的架構，其包括將預訓練表徵作為附加特徵。微調方法，例如GenerativePre-trained Transformer(OpenAIGPT生成型預訓練變換器)(Radford等，2018)，引入了最小的任務特定引數，並通過簡單地微調預訓練引數在下游任務中進行訓練。在以前的工作中，兩種方法在預訓練期間共享相同的目標函式，它們使用單向語言模型來學習通用語言表徵。

There are twoexisting strategies for applying pre-trained language representations todownstream tasks: feature-based and fine-tuning. The feature-based approach,such as ELMo (Peters et al., 2018), uses tasks-specific architectures thatinclude the pre-trained representations as additional features. The fine-tuningapproach, such as the Generative Pre-trained Transformer (OpenAI GPT) (Radfordet al., 2018), introduces minimal task-specific parameters, and is trained onthe downstream tasks by simply fine-tuning the pretrained parameters. Inprevious work, both approaches share the same objective function duringpre-training, where they use unidirectional language models to learn generallanguage representations.

我們認為，當前技術嚴重製約了預訓練表徵的能力，特別是對於微調方法。其主要侷限在於標準語言模型是單向的，這限制了可以在預訓練期間使用的架構型別。例如，在OpenAI GPT，作者們用一個從左到右的架構，其中每個詞塊只能注意變換器自注意層中的前驗詞塊(Vaswani等，2017)。這種侷限對於句子層面任務而言是次優選擇，對於詞塊級任務的方法，則可能是毀滅性的。在這種任務中應用基於詞塊級微調法，如SQuAD問答(Rajpurkar等，2016)，結合兩個方向語境至關重要。

We argue thatcurrent techniques severely restrict the power of the pre-trainedrepresentations, especially for the fine-tuning approaches. The majorlimitation is that standard language models are unidirectional, and this limitsthe choice of architectures that can be used during pre-training. For example,in OpenAI GPT, the authors use a left-to-right architecture, where every tokencan only attended to previous tokens in the self-attention layers of theTransformer (Vaswani et al., 2017). Such restrictions are sub-optimal forsentencelevel tasks, and could be devastating when applying fine-tuning basedapproaches to token-level tasks such as SQuAD question answering (Rajpurkar etal., 2016), where it is crucial to incorporate context from both directions.

在本論文，我們通過提出BERT模型：來自變換器的雙向編碼器表徵量(Bidirectional Encoder Representations fromTransformers)，改進了基於微調的方法。BERT通過提出一個新的預訓練目標：“遮蔽語言模型”(maskedlanguage model，MLM)，來自Cloze任務(Taylor，1953)的啟發，來解決前面提到的單向侷限。該遮蔽語言模型隨機地從輸入中遮蔽一些詞塊，並且，目標是僅基於該遮蔽詞語境語境來預測其原始詞彙id。不像從左到右的語言模型預訓練，該MLM目標允許表徵融合左右兩側語境語境，這允許我們預訓練一個深度雙向變換器。除了該遮蔽語言模型，我們還引入了一個“下一句預測”(nextsentence prediction)任務，該任務聯合預訓練文字對錶徵量。

In this paper,we improve the fine-tuning based approaches by proposing BERT: BidirectionalEncoder Representations from Transformers. BERT addresses the previouslymentioned unidirectional constraints by proposing a new pre-training objective:the “masked language model” (MLM), inspired by the Cloze task (Taylor, 1953).The masked language model randomly masks some of the tokens from the input, andthe objective is to predict the original vocabulary id of the masked word basedonly on its context. Unlike left-to-right language model pre-training, the MLMobjective allows the representation to fuse the left and the right context,which allows us to pre-train a deep bidirectional Transformer. In addition tothe masked language model, we also introduce a “next sentence prediction” taskthat jointly pre-trains text-pair representations.

我們的論文貢獻如下：

•我們證明了雙向預訓練對語言表徵量的重要性。與Radford等人(2018)不同，其使用單向語言模型進行預訓練，BERT使用遮蔽語言模型來實現預訓練的深度雙向表徵量。這也與Peters等人(2018)形成對比，其使用由獨立訓練的從左到右和從右到左LMs(語言模型)的淺層串聯。

•我們展示了預訓練表徵量能消除許多重型工程任務特定架構的需求。BERT是第一個基於微調的表徵模型，它在大量的句子級和詞塊級任務上實現了最先進的效能，優於許多具有任務特定架構的系統。

•BERT推進了11項NLP任務的最高水平。因此，我們報告了廣泛的BERT消融，證明我們模型的雙向性質是最重要的新貢獻。程式碼和預訓練模型將在goo.gl/language/bert上提供1。(注1 將於2018年10月底前公佈。)

The contributions of our paper are as follows:

•Wedemonstrate the importance of bidirectional pre-training for languagerepresentations. Unlike Radford et al. (2018), which uses unidirectionallanguage models for pretraining, BERT uses masked language models to enablepre-trained deep bidirectional representations. This is also in contrast toPeters et al. (2018), which uses a shallow concatenation of independentlytrained leftto-right and right-to-left LMs.

•We show thatpre-trained representations eliminate the needs of many heavilyengineeredtask-specific architectures. BERT is the first fine-tuning based representationmodel that achieves state-of-the-art performance on a large suite ofsentence-level and token-level tasks, outperforming many systems withtask-specific architectures.

•BERT advancesthe state-of-the-art for eleven NLP tasks. We also report extensive ablationsof BERT, demonstrating that the bidirectional nature of our model is the singlemost important new contribution. The code and pre-trained model will beavailable at goo.gl/language/bert.1

1 Will be released before the end ofOctober 2018.

二、相關工作Related Work

預訓練通用語言表徵有很長曆史，本節我們簡要回顧中這些最常用的方法。

There is a long history ofpre-training general language representations, and we briefly review the mostpopular approaches in this section.

2.1 基於特徵的方法Feature-based Approaches

廣泛採用的單詞表徵學習，已經是數十年的活躍研究領域，包括非神經(Brown等，1992; Ando and Zhang, 2005; Blitzer等，2006)和神經(Collobert andWeston, 2008; Mikolov等，2013; Pennington等，2014)方法。預訓練的單詞嵌入被認為是現代NLP系統的組成部分，與從頭學習的嵌入相比提供了顯著的改進(Turian等，2010)。

這些方法已經被推廣到更粗的粒度，如句子嵌入(Kiros等，2015; Logeswaran and Lee, 2018)或段落嵌入(Le and Mikolov, 2014)。與傳統詞嵌入一樣，這些學習到的表徵通常用作下游模型中的特徵。

ELMo(Peters等，2017)將傳統的詞嵌入研究概括為不同維度。他們建議從語言模型中提取語境敏感型特徵。把語境字詞嵌入與現有任務特定架構整合時，ELMo針對一些主要的NLP基準(Peters et al., 2018)提出了最先進的技術，包括關於SQUAD問答(Rajpurkar等，2016)，情緒分析(Socher等，2013)，以及命名實體識別(Tjong Kim Sang和De Meulder，2003)。

Learning widely applicablerepresentations of words has been an active area of research for decades,including non-neural (Brown et al., 1992; Ando and Zhang, 2005; Blitzer et al.,2006) and neural (Collobert and Weston, 2008; Mikolov et al., 2013; Penningtonet al., 2014) methods. Pretrained word embeddings are considered to be anintegral part of modern NLP systems, offering significant improvements overembeddings learned from scratch (Turian et al., 2010).

These approaches have beengeneralized to coarser granularities, such as sentence embeddings (Kiros etal., 2015; Logeswaran and Lee, 2018) or paragraph embeddings (Le and Mikolov,2014). As with traditional word embeddings, these learned representations arealso typically used as features in a downstream model.

ELMo (Peters et al., 2017)generalizes traditional word embedding research along a different dimension.They propose to extract contextsensitive features from a language model. Whenintegrating contextual word embeddings with existing task-specificarchitectures, ELMo advances the state-of-the-art for several major NLPbenchmarks (Peters et al., 2018) including question answering (Rajpurkar etal., 2016) on SQuAD, sentiment analysis (Socher et al., 2013), and named entityrecognition (Tjong Kim Sang and De Meulder, 2003).

2.2 微調方法Fine-tuning Approaches

一種源於語言模型(LMs)的遷移學習新趨勢，是微調前預訓練一些LM目標上的模型架構，該微調是相同型號的一種監督下游任務(Dai and Le, 2015;Howard and Ruder, 2018; Radford等，2018)。這些方法的優點是幾乎沒有引數需要從頭開始學習。至少部分是由於這一優勢，OpenAIGPT(Radford等，2018)在許多句子級別任務的GLUE基準(Wang等，2018)，取得此前最好測試結果。

A recent trend in transfer learningfrom language models (LMs) is to pre-train some model architecture on a LMobjective before fine-tuning that same model for a supervised downstream task (Daiand Le, 2015; Howard and Ruder, 2018; Radford et al., 2018). The advantage ofthese approaches is that few parameters need to be learned from scratch. Atleast partly due this advantage, OpenAI GPT (Radford et al., 2018) achievedpreviously state-of-the-art results on many sentencelevel tasks from the GLUEbenchmark (Wang et al., 2018).

2.3 從監督資料轉移學習Transfer Learning fromSupervised Data

雖然無監督預訓練的優勢在於可獲得的資料量幾乎無限，但也有工作表明從具有大型資料集的監督任務中可有效遷移，例如自然語言推理(Conneau等，2017)和機器翻譯(Mc-Cann等，2017)。在NLP之外，計算機視覺研究也證明了從大型預訓練模型遷移學習的重要性，其中一個有效的方法是微調在ImageNet上預訓練的模型(Deng等，2009; Yosinski等，2014)。

While the advantage of unsupervisedpre-training is that there is a nearly unlimited amount of data available,there has also been work showing effective transfer from supervised tasks withlarge datasets, such as natural language inference (Conneau et al., 2017) andmachine translation (Mc-Cann et al., 2017). Outside of NLP, computer visionresearch has also demonstrated the importance of transfer learning from largepre-trained models, where an effective recipe is to fine-tune modelspre-trained on ImageNet (Deng et al., 2009; Yosinski et al., 2014).

三、BERT變換器雙向編碼器表徵

我們在本節介紹BERT及其詳細實現。我們先介紹BERT的模型架構和輸入表徵。然後，我們將在3.3節中介紹預訓練任務，即本文的核心創新。預訓練程式和微調程式分別在第3.4節和第3.5節中詳述。最後，第3.6節討論了BERT和OpenAIGPT之間的差異。

We introduce BERT and its detailedimplementation in this section. We first cover the model architecture and theinput representation for BERT. We then introduce the pre-training tasks, thecore innovation in this paper, in Section 3.3. The pre-training procedures, andfine-tuning procedures are detailed in Section 3.4 and 3.5, respectively.Finally, the differences between BERT and OpenAI GPT are discussed in Section3.6.

3.1 模型架構Model Architecture

BERT模型架構是一種多層雙向變換器編碼器，基於Vaswani等人(2017年)描述並在tensor2tensor庫2發行的原始實現。(注2https://github.com/tensorflow/tensor2tensor)因為變換器的使用最近變得無處不在，我們架構的實施有效地等同於原始實現，所以我們會忽略模型架構詳盡的背景描述，並向讀者推薦Vaswani等人(2017)的優秀指南，如“註釋變換器”3。(注3 http://nlp.seas.harvard.edu/2018/04/03/attention.html)

在這項工作中，我們把層數(即Transformerblocks變換器塊)表徵為L，隱藏大小表徵為H，自注意頭數表徵為A。在所有情況下，我們設定前饋/過濾器的尺寸為4H，如H=768時為3072，H=1024時為4096。我們主要報告在兩種模型尺寸上的結果：

•BERTBASE：L=12，H=768，A=12，總引數=110M

•BERTLARGE：L=24，H=1024，A=16，總引數=340M

選擇的BERTBASE模型尺寸等同於OpenAIGPT模型尺寸，以進行比較。然而，重要的是，BERT變換器使用雙向自注意，而GPT變換器使用受限自注意，每個詞塊只能注意其左側語境。我們注意到，在文獻中，雙向變換器通常指稱為“變換器編碼器”，而其左側語境版本被稱為“變換器解碼器”，因為它可用於文字生成。BERT，OpenAIGPT和ELMo之間的比較如圖1所示。

圖1：預訓練模型架構間差異。BERT使用雙向變換器，OpenAI GPT使用從左到右的變換器，ELMo使用獨立訓練的從左到右和從右到左LSTM級聯來生成下游任務的特徵。三種模型中只有BERT表徵基於所有層左右兩側語境。

Figure 1: Differences inpre-training model architectures. BERT uses a bidirectional Transformer. OpenAIGPT uses a left-to-right Transformer. ELMo uses the concatenation ofindependently trained left-to-right and rightto-left LSTM to generate featuresfor downstream tasks. Among three, only BERT representations are jointlyconditioned on both left and right context in all layers.

3.2 輸入表徵Input Representation

我們的輸入表徵(inputrepresentation)能在一個詞塊序列中明確地表徵單個文字句子或一對文字句子(例如，[問題，答案][Question,Answer])。4(注4 在整個這項工作中，“句子”可以是連續文字的任意跨度，而不是實際的語言句子。“序列”指BERT的輸入詞塊序列，其可以是單個句子或兩個句子打包在一起。)對於給定詞塊，其輸入表徵通過對相應詞塊的詞塊嵌入、段嵌入和位嵌入求和來構造。圖2給出了我們的輸入表徵的直觀表徵。

圖2：BERT輸入表徵。輸入嵌入是詞塊嵌入、段嵌入和位嵌入的總和。

Figure 2: BERT inputrepresentation. The input embeddings is the sum of the token embeddings, thesegmentation embeddings and the position embeddings.

具體是：

•我們使用WordPiece嵌入(Wu等，2016)和30,000個詞塊表。我們用##表徵分詞。

•我們使用學習的位置嵌入，支援的序列長度最多為512個詞塊。

•每個序列的第一個詞塊始終是特殊分類嵌入([CLS])。對應該詞塊的最終隱藏狀態(即，變換器輸出)被用作分類任務的聚合序列表徵。對於非分類任務，將忽略此向量。

•句子對被打包成單個序列。我們以兩種方式區分句子。首先，我們用特殊詞塊([SEP])將它們分開。其次，我們新增一個學習句子A嵌入到第一個句子的每個詞塊中，一個句子B嵌入到第二個句子的每個詞塊中。

•對於單句輸入，我們只使用句子A嵌入。

3.3 預訓練任務Pre-training Tasks

與Peters等人(2018)和Radford等人(2018)不同，我們不使用傳統的從左到右或從右到左的語言模型來預訓練BERT。相反，我們使用兩個新型無監督預測任務對BERT進行預訓練，如本節所述。

3.3.1 任務#1：遮蔽語言模型 Task#1: Masked LM

直觀地說，有理由相信深度雙向模型比左向右模型或從左到右和右到左模型的淺層連線更嚴格。遺憾的是，標準條件語言模型只能從左到右或從右到左進行訓練，因為雙向調節將允許每個單詞在多層語境中間接地“看到自己”。

為了訓練深度雙向表徵，我們採用一種直接方法，隨機遮蔽輸入詞塊的某些部分，然後僅預測那些被遮蔽詞塊。我們將這個過程稱為“遮蔽LM”(MLM)，儘管它在文獻中通常被稱為Cloze完形任務(Taylor, 1953)。在這種情況下，對應於遮蔽詞塊的最終隱藏向量被饋送到詞彙表上的輸出softmax函式中，如在標準LM中那樣預測所有詞彙的概率。在我們所有實驗中，我們隨機地遮蔽蔽每個序列中所有WordPiece詞塊的15％。與去噪自動編碼器(Vincent等，2008)相反，我們只預測遮蔽單詞而不是重建整個輸入。

雖然這確實允許我們獲得雙向預訓練模型，但該方法有兩個缺點。首先，我們正在建立預訓練和微調之間的不匹配，因為在微調期間從未看到[MASK]詞塊。為了緩解這個問題，我們並不總是用實際的[MASK]詞塊替換“遮蔽”單詞。相反，訓練資料生成器隨機選擇15％的詞塊，例如，在句子：我的狗是毛茸茸的，它選擇毛茸茸的。然後完成以下過程：

•並非始終用[MASK]替換所選單詞，資料生成器將執行以下操作：

•80％的時間：用[MASK]詞塊替換單詞，例如，我的狗是毛茸茸的！我的狗是[MASK]

•10％的時間：用隨機詞替換遮蔽詞，例如，我的狗是毛茸茸的！我的狗是蘋果

•10％的時間：保持單詞不變，例如，我的狗是毛茸茸的！我的狗毛茸茸的。這樣做的目的是將該表徵偏向於實際觀察到的單詞。

變換器編碼器不知道它將被要求預測哪些單詞或哪些單詞已被隨機單詞替換，因此它被迫保持每個輸入詞塊的分散式語境表徵。此外，因為隨機替換隻發生在所有詞塊的1.5％(即15％的10％)，這似乎不會損害模型的語言理解能力。

使用MLM的第二個缺點是每批中只預測了15％的詞塊，這表明模型可能需要更多的預訓練步驟才能收斂。在5.3節中，我們證明MLM的收斂速度略慢於從左到右的模型(預測每個詞塊)，但MLM模型在實驗上的改進遠遠超過所增加的訓練成本。

3.3.2 任務#2：下一句預測Task#2: Next Sentence Prediction

很多重要的下游任務，例如問答(QA)和自然語言推理(NLI)，都是基於對兩個文字句子間關係的理解，而這種關係並非通過語言建模直接獲得。為了訓練一個理解句子關係的模型，我們預訓練了一個二值化下一句預測任務，該任務可以從任何單語語料庫中輕鬆生成。具體來說，選擇句子A和B作為預訓練樣本：B有50%的可能是A的下一句，也有50%的可能是來自語料庫的隨機句子。例如：

輸入=[CLS]男子去[MASK]商店[SEP]他買了一加侖[MASK]牛奶[SEP]

Label= IsNext

輸入=[CLS]男人[面具]到商店[SEP]企鵝[面具]是飛行##少鳥[SEP]

Label= NotNext

我們完全隨機選擇這些NotNext語句，最終預訓練模型在此任務中達到97％-98％的準確率。儘管它很簡單，但我們在5.1節中證明，面向該任務的預訓練對QA和NLI都非常有益。

3.4 預訓練過程Pre-training Procedure

BERT預訓練過程主要遵循現有的語言模型預訓練文獻。對於預訓練語料庫，我們使用BooksCorpus(800M單詞)(Zhu等，2015)和英語維基百科(2,500M單詞)的串聯。對於維基百科，我們只提取文字段落並忽略列表、表格和題頭。至關重要的是，使用文件級語料庫而不是洗牌式(亂詞序)句子級語料庫，例如Billion Word Benchmark(Chelba等，2013)，以便提取長的連續序列。

為了生成每個訓練輸入序列，我們從語料庫中取樣兩個文字跨度，我們將其稱為“句子”，即使它們通常比單個句子長得多(但也可以更短)。第一個句子接收A嵌入，第二個句子接收B嵌入。B有50％可能剛好是A嵌入後的下一個句子，亦有50％可能是個隨機句子，此乃為“下一句預測”任務而做。對它們取樣，使其組合長度≦512個詞塊。該LM遮蔽應用於具有15％統一掩蔽率的WordPiece詞塊化之後，並且不特別考慮部分字塊。

我們訓練批量大小為256個序列(256個序列*512個詞塊=128,000個詞塊/批次)，持續1,000,000個步驟，這比33億個單詞語料庫大約40個週期。我們使用Adam(學習程式)，設其學習率為1e-4，β1=0.9，β2=0.999，L2權重衰減為0.01，學習率預熱超過前10,000步以上以及線性衰減該學習率。我們在所有層上使用0.1的丟失概率。在OpenAIGPT之後，我們使用gelu啟用(Hendrycks和Gimpel, 2016)而不是標準relu。訓練損失是平均的遮蔽LM可能性和平均的下一句子預測可能性的總和。

在Pod配置的4個雲TPU上進行了BERTBASE訓練(總共16個TPU晶片)。5(注5 https://cloudplatform.googleblog.com/2018/06/Cloud-TPU-now-offers-preemptible-pricing-and-globalavailability.html)在16個雲TPU(總共64個TPU晶片)進行了BERTLARGE訓練。每次預訓練需4天完成。

3.5 微調過程Fine-tuning Procedure

對於序列級分類任務，BERT微調很簡單。為了獲得輸入序列的固定維度池化表徵，我們對該輸入第一個詞塊採取最終隱藏狀態(例如，該變換器輸出)，通過對應於特殊[CLS]詞嵌入來構造。我們將該向量表示為C∈RH。微調期間新增的唯一新引數是分類層向量W∈RKxH，其中K是分類器標籤的數量。該標籤概率P∈RK用標準softmax函式，P=softmax(CWT)計算。BERT和W的所有引數都經過聯動地微調，以最大化正確標籤的對數概率。對於跨度級和詞塊級預測任務，必須以任務特定方式稍微修改上述過程。詳情見第4節的相應小節。

對於微調，大多數模型超引數與預訓練相同，但批量大小、學習率和訓練週期數量除外。丟失概率始終保持在0.1。最佳超引數值是特定於任務的，但我們發現以下範圍的可能值可以在所有任務中很好地工作：

•批量大小：16,32

•學習率(Adam)：5e-5,3e-5,2e-5

•週期數量：3,4

我們還觀察到，大資料集(如100k+詞塊的訓練樣例)對超引數選擇的敏感性遠小於小資料集。微調通常非常快，因此需合理簡單地對上述引數進行詳盡搜尋，並選擇開發集上效能最佳的模型。

3.6 BERT和OpenAI GPT比較Comparison of BERT and OpenAI GPT

與BERT最具可比性的現有預訓練方法是OpenAI GPT，它在大型文字語料庫中訓練左到右的變換器LM。實際上，許多BERT設計決策被有意地選擇為儘可能接近GPT，以便最細微地比較這兩種方法。這項工作的核心論點是佔主要經驗改進的3.3節中提出的兩個新型預訓練任務，但我們注意到BERT和GPT在如何訓練之間還存在其他一些差異：

•GPT在BooksCorpus(800M單詞)訓練；BERT在BooksCorpus(800M單詞)和維基百科(2,500M單詞)訓練。

•GPT使用一種句子分隔符([SEP])和分類符詞塊([CLS])，它們僅在微調時引入；BERT在預訓練期間學習[SEP]，[CLS]和句子A/B嵌入。

•GPT用一個批量32,000單詞訓練1M步；BERT用一個批量128,000單詞訓練1M步。

•GPT對所有微調實驗使用的5e-5相同學習率；BERT選擇特定於任務的微調學習率，在開發集表現最佳。

為了分離這些差異的影響，我們在5.1節進行了消融實驗，證明大多數改進實際上來自新型預訓練任務。

The most comparableexisting pre-training method to BERT is OpenAI GPT, which trains a left-to-rightTransformer LM on a large text corpus. In fact, many of the design decisions inBERT were intentionally chosen to be as close to GPT as possible so that thetwo methods could be minimally compared. The core argument of this work is thatthe two novel pre-training tasks presented in Section 3.3 account for themajority of the empirical improvements, but we do note that there are severalother differences between how BERT and GPT were trained:

• GPT is trained on theBooksCorpus (800M words); BERT is trained on the BooksCorpus (800M words) andWikipedia (2,500M words).

• GPT uses a sentenceseparator ([SEP]) and classifier token ([CLS]) which are only introduced atfine-tuning time; BERT learns [SEP], [CLS] and sentence A/B embeddings duringpre-training.

• GPT was trained for 1Msteps with a batch size of 32,000 words; BERT was trained for 1M steps with abatch size of 128,000 words.

• GPT used the samelearning rate of 5e-5 for all fine-tuning experiments; BERT chooses a task-specificfine-tuning learning rate which performs the best on the development set.

To isolate the effect ofthese differences, we perform ablation experiments in Section 5.1 which demonstratethat the majority of the improvements are in fact coming from the newpre-training tasks.

四、實驗Experiments

在本節中，我們將介紹11個NLP任務的BERT微調結果。

In this section, wepresent BERT fine-tuning results on 11 NLP tasks.

4.1 GLUE資料集GLUE Datasets

通用語言理解評估(GLUE)基準(Wang等，2018)是各種自然語言理解任務的集合。大多數GLUE資料集已存在多年，但GLUE的目的是(1)使用規範的Train、Dev和Test拆分發行這些資料集，以及(2)設定評估伺服器以減輕評估不一致事件和測試集過度擬合。GLUE不會為測試集分發標籤，使用者必須將其預測上傳到GLUE伺服器進行評估，並限制提交的數量。

GLUE基準包括以下資料集，其描述最初在Wang等人(2018)的文章中進行了總結：

MNLI多型別自然語言推理是一項大規模的眾包蘊涵分類任務(Williams等，2018)。給定一對句子，目標是預測第二句與第一句相比是蘊涵、矛盾還是中立。

QQP Quora問題對是一個二元分類任務，其目的是確定Quora上提出的兩個問題是否在語義上是等價的(Chen等，2018)。

QNLI問題自然語言推理是斯坦福問題答疑資料集(Rajpurkar等，2016)的一個版本，已被轉換為二元分類任務(Wang等，2018)。積極的例子是(問題，句子)對包含正確答案，而負面例子是(問題，句子)來自同一段落，不包含答案。

SST-2斯坦福情感樹庫2是一個二元單句分類任務，由從電影評論中提取的句子和人類註釋的情緒組成(Socher等，2013)。

CoLA語言可接受性語料庫是一個二元單句分類任務，其目標是預測英語句子在語言上是否“可接受”(Warstadt等，2018)。

STS-B語義文字相似性基準是從新聞標題和其他來源中提取的句子對的集合(Cer等，2017)。它們用1到5的分數進行註釋，表示兩個句子在語義上的相似程度。

MRPC微軟研究院解釋語料庫由從線上新聞源自動提取的句子對組成，其中人類註釋是否該對中的句子是否在語義上相等(Dolan和Brockett，2005)。

RTE識別文字蘊涵是類似於MNLI的二進位制蘊涵任務，但訓練資料少得多(Bentivogli等，2009)。6(注6 請注意，本文僅報告單任務微調結果。多工微調方法可能會進一步推動結果。例如，我們確實觀察到MNLI多工培訓對RTE的實質性改進。)

WNLI威諾格拉德自然語言推理是一個源自(Levesque等，2011)的小型自然語言推理資料集。GLUE網頁指出，該資料集的構建存在問題7，並且每個提交給GLUE訓練過的系統的效能都比預測大多數類別的65.1基線準確度差。(注7 https://gluebenchmark.com/faq)因此，我們將這一組排除在OpenAIGPT的公平性之外。對於我們的GLUE提交，我們總是預測其大多數的類。

The General LanguageUnderstanding Evaluation (GLUE) benchmark (Wang et al., 2018) is a collection ofdiverse natural language understanding tasks. Most of the GLUE datasets havealready existed for a number of years, but the purpose of GLUE is to (1)distribute these datasets with canonical Train, Dev, and Test splits, and (2)set up an evaluation server to mitigate issues with evaluation inconsistenciesand Test set overfitting. GLUE does not distribute labels for the Test set andusers must upload their predictions to the GLUE server for evaluation, withlimits on the number of submissions.

The GLUE benchmarkincludes the following datasets, the descriptions of which were originallysummarized in Wanget al. (2018):

MNLI Multi-Genre NaturalLanguage Inference is a large-scale, crowdsourced entailment classificationtask (Williamset al., 2018). Given a pair ofsentences, the goal is to predict whether the second sentence is an entailment, contradiction, or neutralwithrespect to the first one.

QQP Quora Question Pairs is abinary classification task where the goal is to determine if two questionsasked on Quora are semantically equivalent (Chen et al., 2018).

QNLI Question Natural LanguageInference is a version of the Stanford Question Answering Dataset (Rajpurkar et al., 2016) which has been convertedto a binary classification task (Wanget al., 2018). The positive examplesare (question, sentence) pairs which do contain the correct answer, and thenegative examples are (question, sentence) from the same paragraph which do notcontain the answer.

SST-2 The Stanford SentimentTreebank is a binary single-sentence classification task consisting ofsentences extracted from movie reviews with human annotations of theirsentiment (Socheret al., 2013).

CoLA The Corpus of LinguisticAcceptability is a binary single-sentence classification task, where the goalis to predict whether an English sentence is linguistically “acceptable” or not(Warstadtet al., 2018).

STS-B The Semantic Textual SimilarityBenchmark is a collection of sentence pairs drawn from news headlines and othersources (Ceret al., 2017). They were annotatedwith a score from 1 to 5 denoting how similar the two sentences are in terms ofsemantic meaning.

MRPC Microsoft Research ParaphraseCorpus consists of sentence pairs automatically extracted from online newssources, with human annotations for whether the sentences in the pair aresemantically equivalent (Dolanand Brockett, 2005).

RTE Recognizing TextualEntailment is a binary entailment task similar to MNLI, but with much less trainingdata (Bentivogliet al., 2009).6 (6Note that we only reportsingle-task fine-tuning results in this paper. Multitask fine-tuning approachcould potentially push the results even further. For example, we did observesubstantial improvements on RTE from multi-task training with MNLI.)

WNLI Winograd NLI is a smallnatural language inference dataset deriving from (Levesque et al., 2011). The GLUE webpage notesthat there are issues with the construction of this dataset, 7 (7https://gluebenchmark.com/faq) and every trained system that’s been submitted to GLUEhas has performed worse than the 65.1 baseline accuracy of predicting themajority class. We therefore exclude this set out of fairness to OpenAI GPT.For our GLUE submission, we always predicted the majority class.

4.1.1 GLUE結果GLUEResults

圖3：我們的任務特定模型是由向BERT新增一個額外輸出層而形成的，因此一小部分引數需要從頭開始學習。在該任務中，(a)和(b)是序列級任務，(c)和(d)是詞塊級任務。圖中E代表其輸入嵌入，Ti代表詞塊i的語境表徵，[CLS]是分類輸出的特殊符號，[SEP]是分割非連續詞塊序列的特殊符號。

Figure 3: Our task specific models are formed byincorporating BERT with one additional output layer, so a minimal number ofparameters need to be learned from scratch. Among the tasks, (a) and (b) aresequence-level tasks while (c) and (d) are token-level tasks. In the figure, Erepresents the input embedding, Ti represents the contextual representation oftoken i, [CLS] is the special symbol for classification output, and [SEP] isthe special symbol to separate non-consecutive token sequences.

對GLUE微調，我們呈現了第3節中描述的輸入序列或序列對，並使用對應於第一個輸入詞塊([CLS])的最終隱藏向量C∈RH作為聚合表徵。這都呈現在視覺化圖3(a)和(b)中。在微調期間引入的唯一新引數是分類層W∈RK×H，其中K是標籤數量。我們用C和W計算標準分類損失，即log(softmax(CWT))。

對所有GLUE任務，我們均在其資料上使用一個批量大小為32和3個週期。對於每項任務，我們用學習率5e-5,4e-5,3e-5和2e-5做了微調，並選擇了在其Dev集上效能最佳的那一個。此外，對於BERTLARGE，我們發現微調有時在小資料集上不穩定(如，某些執行會產生退化結果)，因此我們運行了幾次隨機重啟並選擇了在Dev集上效能最佳的模型。通過隨機重啟，我們使用相同的預訓練檢查點，但執行不同的微調資料混洗和分類器層初始化。我們注意到GLUE資料集分佈不包括其測試標籤，我們只為每個BERTBASE和BERTLARGE做單一的GLUE評估伺服器提交。

表1：GLUE測試結果，評分來自其GLUE評估伺服器。每個任務下面的數字代表該訓練樣本數量。“Average”列與GLUE官方分數略微不同，因為我們排除了有問題的WNLI集。OpenAI GPT = (L=12, H=768, A=12)；BERTBASE= (L=12, H=768, A=12)；BERTLARGE = (L=24, H=1024,A=16)。BERT和OpenAI GPT是單模型、單任務。所有結果來自於以下地址：https://gluebenchmark.com/leaderboard和https://blog.openai. com/language-unsupervised/。

Table 1: GLUE Testresults, scored by the GLUE evaluation server. The number below each taskdenotes the number of training examples. The “Average” column is slightlydifferent than the official GLUE score, since we exclude the problematic WNLIset. OpenAI GPT = (L=12, H=768, A=12); BERTBASE = (L=12, H=768,A=12); BERTLARGE = (L=24, H=1024, A=16). BERT and OpenAI GPT aresingle-model, single task. All results obtained from https://gluebenchmark.com/leaderboard and https://blog.openai.com/language-unsupervised/.

結果如表1所示。BERTBASE和BERTLARGE在所有任務上的效能均優於所有現有系統，相對於最先進水平，平均準確度提高了4.4％和6.7％。請注意，BERTBASE和OpenAIGPT在其注意遮蔽之外的模型架構幾乎相同。對於規模最大、報道最廣泛的GLUE任務，MNLI、BERT的絕對精度提高了4.7％，超過了最先進水平。在官方GLUE排行榜8上，BERTLARGE得分為80.4，而該排行榜系統登頂的OpenAIGPT在本文撰寫之日獲得72.8分。(注8 https://gluebenchmark.com/leaderboard)

有趣的是，BERTLARGE在所有任務中都明顯優於BERTBASE，即使訓練資料非常少的那些也是如此。第5.2節更全面地探討了BERT模型尺寸的影響。

To fine-tune on GLUE, werepresent the input sequence or sequence pair as described in Section 3, and use the final hiddenvector C∈RHcorresponding to the first input token ([CLS]) as the aggregaterepresentation. This is demonstrated visually in Figure 3 (a) and (b). The only newparameters introduced during fine-tuning is a classification layer W∈RK×H, where K is the number of labels. We compute a standard classificationloss with C and W, i.e., log(softmax(CWT)).

We use a batch size of 32and 3 epochs over the data for all GLUE tasks. For each task, we ranfine-tunings with learning rates of 5e-5, 4e-5, 3e-5, and 2e-5 and selected theone that performed best on the Dev set. Additionally, for BERTLARGE we found that fine-tuningwas sometimes unstable on small data sets (i.e., some runs would producedegenerate results), so we ran several random restarts and selected the modelthat performed best on the Dev set. With random restarts, we use the samepre-trained checkpoint but perform different finetuning data shuffling andclassifier layer initialization. We note that the GLUE data set distributiondoes not include the Test labels, and we only made a single GLUE evaluationserver submission for each BERTBASEandBERTLARGE.

Results are presented inTable 1. Both BERTBASE and BERTLARGE outperform all existingsystems on all tasks by a substantial margin, obtaining 4.4% and 6.7%respective average accuracy improvement over the state-of-the-art. Note thatBERTBASEandOpenAI GPT are nearly identical in terms of model architecture outside of theattention masking. For the largest and most widely reported GLUE task, MNLI,BERT obtains a 4.7% absolute accuracy improvement over the state-of-the-art. Onthe official GLUE leaderboard 8 (注8 https://gluebenchmark.com/leaderboard),BERTLARGE obtains a score of 80.4,compared to the top leaderboard system, OpenAI GPT, which obtains 72.8 as ofthe date of writing.

It is interesting toobserve that BERTLARGEsignificantlyoutperforms BERTBASEacrossall tasks, even those with very little training data. The effect of BERT modelsize is explored more thoroughly in Section 5.2.

4.2 斯坦福問答資料集SQuAD v1.1

Standford問題回答資料集(SQuAD)是一種100k眾包問答對的集合(Rajpurkar等，2016)。給出一個問題和包含答案的來自維基百科的一個段落，任務是預測該段落中的其答案文字的跨度。例如：

•輸入問題：

水滴在哪裡與冰晶碰撞形成沉澱？

•輸入段落：

...沉澱形成為較小的液滴通過與雲中的其他雨滴或冰晶碰撞而聚結。...

•輸出答案：

在雲中

這種型別的跨度預測任務與GLUE的序列分類任務完全不同，但我們能以簡單的方式調整BERT以在SQuAD上執行。與GLUE一樣，我們將輸入問題和段落表示為單個打包序列，問題使用A嵌入和使用B嵌入的段落。在微調期間學習的唯一新引數是起始向量S∈RH和結束向量E∈RH。讓來自BERT的第i個輸入詞塊的最終隱藏向量表示為Ti∈RH。請參見視覺化圖3(c)。然後，單詞i作為答案跨度開始的概率被計算為Ti和S之間的點積(dot product)，跟隨著段落中所有單詞的softmax：

Pi = e(S×Ti)/ Σj(e(S×Tj))

相同公式用於其答案跨度的末端，最大評分範圍用作其預測。訓練目標是正確的開始和結束位置的log似然(log-likelihood)。

我們以學習率5e-5批量大小32來訓練3個週期。推理時，由於結束預測不以開始為條件，我們添加了在開始後必須結束的約束，但是沒有使用其他啟發式方法。詞塊化標記跨度與原始非詞塊化輸入對齊，以做評估。

結果呈現在表2。SQuAD用很嚴格的測試過程，其提交者必須人工聯絡SQuAD組織者以在一個隱藏測試集上執行他們的系統，因此我們只提交了我們最好的系統進行測試。該表顯示的結果是我們向SQuAD提交的第一個也是唯一的測試。我們注意到SQuAD排行榜最好高結果沒有最新的可用公共系統描述，並且在訓練他們的系統時可以使用任何公共資料。因此，我們通過我們提交的系統中使用非常適度的資料增強，在SQuAD和TriviaQA(Joshi等，2017)上聯合訓練。

表2：SQuAD結果。本BERT整合是使用不同預訓練檢查點和微調種子(fine-tuning seed)的7x系統。

Table 2: SQuADresults. The BERT ensemble is 7x systems which use different pre-trainingcheckpoints and fine-tuning seeds.

我們效能最佳的系統在整體排名中優於頂級排行榜系統+1.5 F1項，在單一系統中優於+1.3 F1項。事實上，我們的單一BERT模型在F1得分方面優於頂級全體系統。如果我們只微調SQuAD(沒有TriviaQA)，我們將失去0.1-0.4的F1得分，但仍然大幅超越所有現有系統。

The Standford Question Answering Dataset (SQuAD) is acollection of 100k crowdsourced question/answer pairs (Rajpurkar et al., 2016). Given a question and aparagraph from Wikipedia containing the answer, the task is to predict theanswer text span in the paragraph. For example:

• Input Question:

Where do water droplets collide with ice crystals to formprecipitation?

• Input Paragraph:

... Precipitation forms as smaller droplets coalesce viacollision with other rain drops or ice crystals within a cloud. ...

• Output Answer:

within a cloud

This type of spanprediction task is quite different from the sequence classification tasks ofGLUE, but we are able to adapt BERT to run on SQuAD in a straightforwardmanner. Just as with GLUE, we represent the input question and paragraph as asingle packed sequence, with the question using the A embedding and theparagraph using the B embedding. The only new parameters learned duringfine-tuning are a start vector S∈RHand an end vector E∈RH. Let the final hiddenvector from BERT for the ith input token be denoted as Ti∈RH. See Figure 3 (c) for a visualization. Then,the probability of word i being the start of theanswer span is computed as a dot product between Ti and S followed by a softmax overall of the words in the paragraph:

Pi = eS×Ti / ∑j(eS×Tj)

The same formula is usedfor the end of the answer span, and the maximum scoring span is used as theprediction. The training objective is the log-likelihood of the correct startand end positions.

We train for 3 epochs witha learning rate of 5e- 5 and a batch size of 32. At inference time, since theend prediction is not conditioned on the start, we add the constraint that theend must come after the start, but no other heuristics are used. The tokenizedlabeled span is aligned back to the original untokenized input for evaluation.

Results are presented inTable 2. SQuAD uses a highly rigorous testing procedure wherethe submitter must manually contact the SQuAD organizers to run their system ona hidden test set, so we only submitted our best system for testing. The resultshown in the table is our first and only Test submission to SQuAD.We note thatthe top results fromthe SQuAD leaderboard do not have up-to-date public system descriptionsavailable, and are allowed to use any public data when training their systems.We therefore use very modest data augmentation in our submitted system byjointly training on SQuAD and TriviaQA (Joshi et al., 2017).

Our best performing systemoutperforms the top leaderboard system by +1.5 F1 in ensembling and +1.3 F1 asa single system. In fact,our single BERT model outperforms the top ensemble system in terms of F1 score.If we fine-tune on only SQuAD (without TriviaQA) we lose 0.1-0.4 F1 and stilloutperform all existing systems by a wide margin.

4.3 命名實體識別Named Entity Recognition

為了評估詞塊標記任務的效能，我們在CoNLL 2003命名實體識別(NER)資料集上微調BERT。該資料集由200k個訓練單詞組成，這些單詞已註釋為人員、組織、位置、雜項或其他(非命名實體)。

為做微調，我們將最終隱藏表徵Ti∈RH提供給每個詞塊i到NER標籤集上的分類層。此預測不以周圍預測為條件(即，非自迴歸和無CRF)。為了使其與WordPiece詞塊化相相容，我們將每個CoNLL詞塊化輸入單詞提供給我們的WordPiece詞塊化器，並使用與第一個子標記相對應的隱藏狀態作為分類器的輸入。例如：

Jim Hen##son是一個木偶##eer

I-PERI-PER X O O X.

在沒有對X做預測的情況下。由於WordPiece詞塊化邊界是一個該輸入的已知部分，因此對訓練和測試都做了預測。圖3(d)中還給出了視覺化呈現。一種事例WordPiece模型用於NER，而非事例模型用於所有其他任務。

結果呈現在表3中。BERTLARGE優於現有SOTA——具有多工學習(Clark等，2018)的跨檢視訓練，在CoNLL-2003NER測試中達+0.2。

表3：CoNLL-2003命名實體識別結果。超引數通過開發集來選擇，得出的開發和測試分數是使用這些超引數進行五次隨機重啟的平均值。

Table 3: CoNLL-2003 Named EntityRecognition results. The hyperparameters were selected using the Dev set, andthe reported Dev and Test scores are averaged over 5 random restarts usingthose hyperparameters.

To evaluate performance ona token tagging task, we fine-tune BERT on the CoNLL 2003 Named EntityRecognition (NER) dataset. This dataset consists of 200k training words which havebeen annotated as Person, Organization, Location, Miscellaneous, or Other (non-named entity).

For fine-tuning, we feed the final hidden representationTi∈RH for to each token i into aclassification layer over the NER label set. The predictions are not conditioned on the surroundingpredictions (i.e., non-autoregressive and no CRF). To make this compatible withWordPiece tokenization, we feed each CoNLL-tokenized input word into ourWordPiece tokenizer and use the hidden state corresponding to the first sub-token as input to the classifier. For example:

Jim Hen ##sonwas a puppet ##eer

I-PER I-PER X O OO X

Where no prediction is made forX. Since the WordPiecetokenization boundaries are a known part of the input, this is done for bothtraining and test. A visual representation is also given in Figure 3 (d). A cased WordPiecemodel is used for NER, whereas an uncased model is used for all other tasks.

Results are presented in Table 3. BERTLARGEoutperforms the existing SOTA, Cross-View Training with multi-task learning (Clark et al., 2018), by +0.2 on CoNLL-2003NER Test.

4.4 對抗生成情境資料集SWAG

此對抗生成情境(SWAG)資料集包含113k個句子對的完成樣例，用於評估基礎常識推理(Zellers等，2018)。

給定一個視訊字幕資料集中的某一個句子，任務是在四個選項中決定最合理的後續。例如：

一個女孩正穿過一套猴架杆。她

(i)跳過猴架杆。

(ii)掙扎到架杆抓住她的頭。

(iii)走到盡頭，站在木板上。

(iv)跳起並做後退。

(譯註2：monkey bars n.猴架，供孩子們攀爬玩耍的架子)

調到SWAG資料集的BERT，類似於其GLUE適配。對於每個樣本，我們構造四個輸入序列，每個輸入序列包含給定句子(句子A)和可能後續(句子B)的串聯。我們引入的唯一任務特定引數是一個向量V∈RH，其具有最終聚合表徵Ci∈RH的點積代表每個選擇i的得分。概率分佈是四種選擇的softmax：

Pi = e(V×Ci) / Σj=1 to 4(e(V×Cj))

我們用學習率2e-5批量大小16，對此模型做了3個週期的微調。結果呈現在表4。BERTLARGE的效能優於該作者ESIM+ELMo系統的基線達+27.1％。

表4：SWAG開發和測試精度。測試結果由SWAG作者們對其隱藏標籤進行評分。如SWAG論文所述，人類效能是用100個樣本測量的。

Table 4: SWAG Dev andTest accuracies. Test results were scored against the hidden labels by the SWAGauthors. Human performance is measure with 100 samples, as reported in the SWAGpaper.

五、消模實驗Ablation Studies

雖然我們已經演示了極其強大的實驗結果，但到目前為止所呈現的結果並未分離BERT框架各個方面的具體貢獻。在本節中，我們將對BERT多個方面進行消融實驗，以便更好地瞭解它們的相對重要性。(譯註3：Quora上對ablation study的解釋：An ablation study typicallyrefers to removing some “feature” of the model or algorithm, and seeing howthat affects performance. 消模實驗通常是指刪除模型或演算法的某些“特徵”，並檢視如何影響效能。ablation study是為研究模型中提出的一些結構是否有效而設計的實驗。比如你提出了某結構，但要想確定這個結構是否有利於最終效果，就要將去掉該結構的模型與加上該結構的模型所得到的結果進行對比。ablation study直譯為“消融研究”，意譯是“模型簡化測試”或“消模實驗”。)

Although we have demonstrated extremely strongempirical results, the results presented so far have not isolated the specificcontributions from each aspect of the BERT framework. In this section, weperform ablation experiments over a number of facets of BERT in order to betterunderstand their relative importance.

5.1 預訓練任務的影響Effect of Pre-training Tasks

我們的核心主張之一是BERT的深度雙向性，這是通過遮蔽LM預訓練實現的，是BERT與以前工作相比最重要的改進。為證明這一主張，我們評估了兩個使用完全相同預訓練資料、微調方案和變換器超引數的BERTBASE新模型：

1.無NSP：一種使用“遮蔽LM”(MLM)訓練但沒有“下一句預測”(NSP)任務的模型。

2.LTR＆NoNSP：使用從左到右(LTR)LM而不是MLM訓練的模型。在這種情況下，我們預測每個輸入單詞，不應用任何遮蔽。左側約束也用於微調，因為我們發現使用左側語境預訓練和雙向語境微調，效果總是更差。此外，該模型在沒有NSP任務的情況下做了預訓練。這與OpenAIGPT直接相當，但使用我們更大的訓練資料集、我們的輸入表徵和我們的微調方案。

結果顯示在表5中。我們首先檢查NSP任務帶來的影響。我們可以看到，刪除NSP會嚴重損害QNLI，MNLI和SQuAD的效能。這些結果表明，我們的預訓練方法對於獲得先前提出的強有力的實證結果至關重要。

表5：用BERTBASE架構做的預訓練任務消融。“無NSP”是無下一句話預測任務的訓練。“LTR＆無NSP”用作從左到右的LM，沒有下一個句子預測，如OpenAI GPT的訓練。“+ BiLSTM”在微調期間在“LTR +無NSP”模型上新增隨機初始化BiLSTM。

Table 5: Ablation over the pre-training tasks usingthe BERTBASE architecture. “No NSP” is trained without the nextsentence prediction task. “LTR & No NSP” is trained as a left-to-right LMwithout the next sen

谷歌AI論文BERT雙向編碼器表徵模型：機器閱讀理解NLP基準11種最優(公號回覆“谷歌BERT論文”下載彩標PDF論文)