一種機器翻譯的評價準則——Bleu

阿新 • • 發佈：2019-01-24

1. 引言

在牽涉到語句生成尤其是機器翻譯的應用領域，如何衡量生成語句與參考語句之間的相似性是一個很重要的問題，而在2002年Kishore Papineni et al.就提出了一個經典的衡量標準Bleu，如今這篇文獻已經引用量過萬，因此是NLP領域必讀文章之一。

2. 論文中使用的例子

論文中給出了四個例子來輔助解釋演算法，每個例子都有待評價(Candidate)語句和標準參考(Reference)語句。

Example 1.

Candidate 1：It is a guide to action which ensures that the military always obeys the commands of the party.

Candidate 2: It is to insure the troops forever hearing the activity guidebook that party direct.

Reference 1: It is a guide to action that ensures that the military will forever heed Party commands.

Reference 2: It is the guiding principle which guarantees the military forces always being under the command of the Party.

Reference 3: It is the practical guide for the army always to heed the directions of the party .

Example 2.

Candidate: the the the the the the the.

Reference 1: The cat is on the mat.

Reference 2: There is a cat on the mat.

Example 3.

Candidate: of the

Reference 1: It is a guide to action that ensures that the military will forever heed Party commands.

Reference 2: It is the guiding principle which guarantees the military forces always being under the command of the Party.

Reference 3: It is the practical guide for the army always to heed the directions of the party.

Example 4.

Candidate 1: I always invariably perpetually do.

Candidate 2: I always do.

Reference 1: I always do.

Reference 2: I invariably do.

Reference 3: I perpetually do.

3. Bleu方法使用的基本度量指標和概念

3.1 “n單位片段” (n-gram)

所謂n-gram就是指一個語句裡面連續的n個單片語成的片段，一個18單詞的語句有18個1-gram，每個單詞都睡一個1-gram；有17個2-gram，這個很好理解。

3.2 精確度（Precision）和“修正的n-單位精確度”(modified n-gram recision)

Precision是指Candidate語句裡面的n-gram在所有Reference語句裡面出現的概率。

在Example 1.的Candidate 1 語句中，18個單詞共有17個單詞出現過，所以1-gram的precision是17/18,17個2-gram片段總共有10個出現過，所以2-gram的precision是10/17。同理有，Example 2.的Candidate 語句1-gram的Precision是7/7。

但是以上方法存在一個問題，就是可能Reference裡面的單詞會被重複利用，這是不合理的。所以有了“修正的n-單位精確度”(modified n-gram recision)，主要思路是Reference語句裡面如果一個單詞片段已經被匹配，那麼這個片段就不能再次被匹配，並且一個單詞片段只能取一個Reference語句中出現次數的最大值，比如7個the分別在Reference 1 和 2中出現2和1次，所以取2而不是兩者相加的3。

利用以上方法，每一個句子都可以得到一個modified n-gram recision，一個句子不能代表文字翻譯的水平高低，於是把一段話或者所有翻譯句子的結果綜合起來可以得到pn

pn=∑C∈{Candidate}∑n-gram∈CCountclip(n-gram)∑C'∈{Candidate}∑n-gram'∈C'Count(n-gram)
簡而言之，就是把所有句子的modified n-gram precision的分子加起來除以分母加起來。

4. BP值(Brevity Penalty)和BLEU值的計算公式

上面我們已經介紹了modified n-gram precision，對於不同的長度n都會有一個pn，那麼如何將不同n的pn結合起來得到最終的Bleu值。研究者們還考慮到一種情況，就是待測譯文翻譯不完全不完整的情況，這個問題在機器翻譯中是不能忽略的，而簡單的pn值不能反映這個問題，例如Example 3。

這個問題也不能用recall來解決，例如Example 4. 顯然Candidate 1的回召率比Candidate 2要高，但是顯然Candidate 1的翻譯不如Candidate 2。所以recall並不能解決這個問題。

首先引入BP值，作者指定當待評價譯文同任意一個參考譯文長度相等或超過參考譯文長度時，BP值為1，當待評價譯文的長度較短時，則用一個演算法得出BP值。以c來表示待評價譯文的長度，r來表示參考譯文的文字長度，則

BP={1ifc>re1−r/cifc≤r
之後又Bleu值等於
Bleu=BP⋅exp(∑n=1Nwnlogpn)
在對數情況下，計算變得更加簡便
logBlue=min(1−rc,0)+∑n=1Nwnlogpn
通常這個N取4，wn=1/4，這就是很多論文裡面的一個經典指標Bleu4

一種機器翻譯的評價準則——Bleu

1. 引言

2. 論文中使用的例子

Example 1.

Example 2.

Example 3.

Example 4.

3. Bleu方法使用的基本度量指標和概念

3.1 “n單位片段” (n-gram)

3.2 精確度（Precision）和“修正的n-單位精確度”(modified n-gram recision)

4. BP值(Brevity Penalty)和BLEU值的計算公式

一種機器翻譯的評價準則——Bleu

關於機器翻譯評價指標BLEU(bilingual evaluation understudy)的直覺以及個人理解

機器翻譯評價指標之BLEU詳細計算過程

BLEU機器翻譯評價指標學習筆記

一種大氣簡單的Web管理（陳列）版面設計

程序員的十種級別，看看你屬於哪一種？

C++差分隱私的指數機制的一種實現方法

導致spring事務配置不起作用的一種原因

redis數據類型四之hash的指令操作（五種數據類型中最重要的一種）

淺析在QtWidget中自定義Model（beginInsertRows()和endInsertRows()是空架子，類似於一種信號，用來通知底層）

c語言中一種典型的排列組合算法

poj 1703 Find them, Catch them（種類並查集和一種巧妙的方法）

樹鏈剖分的一種用法

kotlin，一種新的android平臺一級開發語言

【HLSDK系列】怎麽增加一種新實體

.Net MVC 導入導出Excel總結(三種導出Excel方法，一種導入Excel方法) 通過MVC控制器導出導入Excel文件(可用於java SSH架構)

另一種的SQL註入和DNS結合的技巧

SVM算法的另外一種理解

opencv實現一種改進的Fast特征檢測算法

【強連通分量縮點】【拓撲排序】【dp預處理】CDOJ1640 花自飄零水自流，一種相思，兩處閑愁。

一種機器翻譯的評價準則——Bleu

1. 引言

2. 論文中使用的例子

Example 1.

Example 2.

Example 3.

Example 4.

3. Bleu方法使用的基本度量指標和概念

3.1 “n單位片段” (n-gram)

3.2 精確度（Precision）和“修正的n-單位精確度”(modified n-gram recision)

4. BP值(Brevity Penalty)和BLEU值的計算公式

相關推薦