用最大熵模型進行字標註中文分詞（Python實現）

阿新 • • 發佈：2019-01-12

同前面的那篇文章一樣（參見：最大熵模型進行中文分詞），本文運用字標註法進行中文分詞，分別使用4-tag和6-tag對語料進行字標註，觀察分詞效果。前面的文章中使用了模型工具包中自帶的一個樣例進行4-tag中文分詞，但由於其選取的特徵是針對英文詞性標註開發的，故準確率和召回率較低（召回率為83.7%，準確率為84.1%）。 PS：為什麼用作詞性標註的特徵也可以用來做分詞呢？這是因為最大熵模型處理問題的時候和具體的實際問題是無關的，模型處理任何問題都會轉化為對序列資料的標註問題。因此序列化標註的問題都可以通過最大熵模型解決，只不過針對不同的問題，特徵的選取也不太一樣。特徵模板的的優劣會影響到結果的正確性。

第一部分最大熵模型工具包安裝說明

最大熵模型方面使用開源的張樂博士的最大熵模型工具包(Maximum Entropy Modeling Toolkit for Python and C++)。使用的中文語料資源是SIGHAN提供的backoff 2005語料，目前封閉測試最好的結果是4-tag+CFR標註分詞，在北大語料庫上可以在準確率，召回率以及F值上達到92%以上的效果，在微軟語料庫上可以到達96%以上的效果。以下我們將轉入這篇文章的主題，基於最大熵模型的字標註中文分詞。

下載安裝和使用張樂博士的最大熵模型工具包，本文使用的是其在github上的程式碼：

maxent , 安裝說明：

1.進入到程式碼主目錄maxent-master後，正常按照“configure & make & (sudo) make install就可以完成C++庫的安裝。

注意：（1）gcc編譯器最好是4.7版本以上，我試過在4.4.3上面是不成功的，升級到4.7之後就可以了，具體請參閱：升級Ubuntu中g++和gcc的版本）。

（2）gcc版本沒問題了，如果報出 ./configure錯誤，請參閱：Linux下./configure錯誤詳解。

2.再進入到子目錄python下，安裝python包：python setup.py build & (sudo) python setup.py install，這個python庫是通過強大的

SWIG生成的。

注意：中間如果報出：python.h 沒有各個檔案或目錄的錯誤，請參閱：解決python.h 沒有那個檔案或目錄的方法。

關於這個最大熵模型工具包詳情及背景，推薦看官方manual文件，寫得非常詳細。

第二部分 4-tag和6-tag

1.字標註。

什麼是字標註呢？先看一個句子：我是一名程式設計師。將所有字分為4類，S表示單字，B表示詞首，M表示詞中，E表示詞尾。

如果我們知道上述句子中每個字的類別，即:

我/S 是/S 一/B 名/E 程/B 序/M 員/E　。/S

那麼我們就可以知道這個句子的分詞結果：我　是　一名　程式設計師　。

從這裡可以看出，分詞問題轉化成了一個分類問題，即對每個字分類。我們知道，機器學習方法可以很好地處理分類問題。所以接下來，我們需要用機器學習的方法來解決分詞。

2. 4-tag && 6-tag

如果把字的型別分為四類（4-tag），就是：S（單字），B（詞首），M（詞中），E（詞尾）。

如果把字的型別分成六類（6-tag），就是：S（單字），B（詞首），C（第二個字），D（第三個字），M（詞中），E（詞尾）。

例如：如果【中華人民共和國】作為一個詞的話，按照4-tag，描述為：中/B 華/M 人/M 民/M 共/M 和/M 國/E。如果按照6-tag，描述為：中/B 華/C 人/D 民/M 共/M 和/M 國/E。

從這裡可以看出，分詞問題轉化成了一個分類問題，即對每個字分類。我們知道，機器學習方法可以很好地處理分類問題。所以接下來，我們需要用機器學習的方法來解決分詞。本文將分別用4-tag和6-tag對訓練語料和測試語料進行標註，利用訓練語料求得對應的模型，得到測試語料的分詞結果。

第三部分機器學習方法

機器學習處理問題的基本思想是，首先對標註好的資料進行訓練，得到模型，再根據模型對新資料進行預測。這樣我們將問題轉化為三個子問題：

（1）標註資料是什麼?

（2）模型是什麼?

（3）如何使用模型進行預測？

具體到以字分詞問題，我們簡單地回答上述三個問題。

（1）標註好的資料就是:字x在情境A下類別為a；字x在情境B下類別為b；字x在情境C下類別為c……

（2）模型就是：一些公式。這些公式是對資料的一種描述，其中包含了標註好的資料的資訊，以及一些未知的引數，對模型的訓練就是採用機器學習的方法，賦予未知的引數一些合適的值，這些值使得公式的值達到最優。

（3）如何預測：給定字x,情境Z,我們得到模型後，就可以根據 x,Z，模型，得到在這種情況下x取各個類別的概率是多少，由此可以預測出x的類別。

在上述回答中，有幾個地方是沒有定義的，例如，什麼叫情境。在以字分詞中，情境就是對一個字環境的描述，例如在句子我是程式設計師中，如果我們將一個字的情境定義為：這個字前面的字和這個字後面的字。那麼“程”這個字的情境為：“是”，以及“序”。所以，所謂標註好的資料就是：{類別：情境}的集合，我們舉例說明。

將C-1定義為字前的字，C0定義為當前字，C1定義為字後的字，那麼如果我們有一句分好詞的句子：我是一名程式設計師。

首先我們需要對每個字打個標籤，表示類別：我/S 是/S 一/B 名/E 程/B 序/M 員/E 。/S

然後就可以轉化為標註資料了，例如“程”字，可以得到一條標註資料為： B C-1=名 C0=程 C1=序。這條標註資料意思為當遇到C-1=名 C0=程 C1=序這種情境時，類別為B。

所以，由一堆分好詞的資料，就可以得到許多條這樣的標註資料。得到標註資料後，就可以使用機器學習包進行訓練，訓練後可以得到相應的模型，預測時匯入模型，然後給出相應情境(即C-1=? C0=? C1=?)，就可以知道在此情境下各個類別的概率，在此可以簡單地認識哪個類別概率大，結果就是哪個類別，當然，也可以設計其它的方式。

在整個過程中，情境，機器學習方法，預測的方法都是可變的，但是以字分詞大概的思想應該都是這樣的。

第四部分分詞實踐

1.語料來源

（1）訓練資料：icwb2-data/training/pku_ training.utf8

（2）測試資料：icwb2-data/testing/pku_ test.utf8

（3）正確分詞結果：icwb2-data/gold/pku_ test_ gold.utf8

（4）評分工具：icwb2-data/script/socre

2.分詞過程資料示例

注：下面展示的僅僅是資料示例，並非完整的資料

（1）訓練語料

（2）訓練語料的字標註結果

（3）訓練語料的特徵

其中，Ci 表示與當前字偏移為i的字

（4）測試語料

（5）測試語料的特徵

（6）測試語料的字標註結果

（7）測試語料的分詞結果

3.計算準確率和召回率以及F值

有了這個字標註分詞結果，我們就可以利用backoff2005的測試指令碼來測一下這次分詞的效果了：
./icwb2-data/scripts/score ./icwb2-data/gold/pku_training_words.utf8 ./icwb2-data/gold/pku_test_gold.utf8 pku_result.utf8 > pku_maxent.score
其中，pku_result.utf8 是你的結果檔案，pku_maxent.score是你指定的存放評分結果的檔名

第五部分分詞結果

下面是測試結果：

類別	迭代次數	真實詞數	得到詞數	召回率	準確率	F值
4-tag	100	104372	102810	89.40%	90.70%	0.9
4-tag	150	104372	102806	89.40%	90.80%	0.901
4-tag	200	104372	102808	89.50%	90.80%	0.901
6-tag	100	104372	103246	91.00%	92.00%	0.915
6-tag	150	104372	103240	91.00%	92.00%	0.915
6-tag	200	104372	103237	91.10%	92.10%	0.915
6-tag	300	104372	103212	91.10%	92.10%	0.916

從結果來看，6-tag的結果要優於4-tag的結果。最高值為：召回率91.1%，準確率92.1，F值為0.916。

第六部分原始碼

#! /usr/bin/env python
# -*- coding: utf-8 -*-

# 基於最大熵模型及字標註的分詞工具4-tag && 6-tag

import codecs
import sys
from maxent import MaxentModel

# 標註訓練集6-tag
def tag6_training_set(training_file, tag_training_set_file):
    fin = codecs.open(training_file, 'r', 'utf-8')
    contents = fin.read()
    contents = contents.replace(u'\r', u'')
    contents = contents.replace(u'\n', u'')
    
    words = contents.split(' ')
    print len(words)

    tag_words_list = []
    i = 0
    for word in words:
        i += 1
        if(i % 100 == 0): 
            tag_words_list.append(u'\r')
        if(len(word) == 0):
            continue
        if(len(word) == 1):
            tag_word = word + '/S'
        elif (len(word) == 2):
            tag_word = word[0] + '/B' + word[1] + '/E'
        elif (len(word) == 3):
            tag_word = word[0] + '/B' + word[1] + '/C' + word[2] + '/E'
        elif (len(word) == 4):
            tag_word = word[0] + '/B' + word[1] + '/C' + word[2] + '/D' + word[3] + '/E'
        else:
            tag_word = word[0] + '/B' + word[1] + '/C' + word[2] + '/D'
            mid_words = word[3:-1]
            for mid_word in mid_words:
                tag_word += (mid_word + '/M')
            tag_word += (word[-1] + '/E')

        tag_words_list.append(tag_word)

    tag_words = ''.join(tag_words_list)
    fout = codecs.open(tag_training_set_file, 'w', 'utf-8')
    fout.write(tag_words)
    fout.close()

    return (words, tag_words_list)

# 標註訓練集4-tag
def tag4_training_set(training_file, tag_training_set_file):
    fin = codecs.open(training_file, 'r', 'utf-8')
    contents = fin.read()
    contents = contents.replace(u'\r', u'')
    contents = contents.replace(u'\n', u'')
    
    words = contents.split(' ')
    print len(words)

    tag_words_list = []
    i = 0
    for word in words:
        i += 1
        if(i % 100 == 0): 
            tag_words_list.append(u'\r')
        if(len(word) == 0):
            continue
        if(len(word) == 1):
            tag_word = word + '/S'
        elif (len(word) == 2):
            tag_word = word[0] + '/B' + word[1] + '/E'
        else:
            tag_word = word[0] + '/B' + word[1] + '/C' + word[2] + '/D'
            mid_words = word[1:-1]
            for mid_word in mid_words:
                tag_word += (mid_word + '/M')
            tag_word += (word[-1] + '/E')

        tag_words_list.append(tag_word)

    tag_words = ''.join(tag_words_list)
    fout = codecs.open(tag_training_set_file, 'w', 'utf-8')
    fout.write(tag_words)
    fout.close()

    return (words, tag_words_list)

def get_near_char(contents, i, times):
    words_len = len(contents) / times;
    if (i < 0 or i > words_len - 1):
        return '_'
    else:
        return contents[i*times]

def get_near_tag(contents, i ,times):
    words_len = len(contents) / times;
    if (i < 0 or i > words_len - 1):
        return '_'
    else:
        return contents[i*times*2]

def isPu(char):
    punctuation = [u'，', u'。', u'？', u'！', u'；', u'－－', u'、', u'——', u'（', u'）', u'《', u'》', u'：', u'“', u'”', u'’', u'‘']
    if char in punctuation:
        return '1'
    else:
        return '0'

def get_class(char):
    zh_num = [u'零', u'○', u'一', u'二', u'三', u'四', u'五', u'六', u'七', u'八', u'九', u'十', u'百', u'千', u'萬']
    ar_num = [u'0', u'1', u'2', u'3', u'4', u'5', u'6', u'7', u'8', u'9', u'.', u'０',u'１',u'２',u'３',u'４',u'５',u'６',u'７',u'８',u'９']
    date = [u'日', u'年', u'月']
    letter = ['a', 'b', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'g', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z', 'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z']
    if char in zh_num or char in ar_num:
        return '1'
    elif char in date:
        return '2'
    elif char in letter:
        return '3'
    else:
        return '4'

# 獲取訓練集特徵
def get_event(tag_file_path, event_file_path):
    f = codecs.open(tag_file_path, 'r', 'utf-8')
    contents = f.read()
    contents = contents.replace(u'\r', u'')
    contents = contents.replace(u'\n', u'')
    words_len = len(contents)/3
    event_list = []

    index = range(0, words_len)
    for i in index:
        pre_char = get_near_char(contents, i-1, 3)
        pre_pre_char = get_near_char(contents, i-2, 3)
        cur_char = get_near_char(contents, i, 3)
        next_char = get_near_char(contents, i+1, 3)
        next_next_char = get_near_char(contents, i+2, 3)
        event_list.append(
            contents[i*3+2] + ' '
            + 'C-2='+pre_pre_char + ' ' + 'C-1='+pre_char + ' '
            + ' ' + 'C0=' + cur_char + ' '
            + 'C1=' + next_char + ' ' + 'C2=' + next_next_char + ' '
            + 'C-2=' + pre_pre_char + 'C-1=' + pre_char + ' '
            + 'C-1=' + pre_char + 'C0=' + cur_char + ' '
            + 'C0=' + cur_char + 'C1=' + next_char + ' '
            + 'C1=' + next_char + 'C2=' + next_next_char + ' '
            + 'C-1=' + pre_char + 'C1=' + next_char + ' '
            + 'C-2=' + pre_pre_char + 'C-1=' + pre_char + 'C0=' + cur_char + ' '
            + 'C-1=' + pre_char + 'C0=' + cur_char + 'C1=' + next_char + ' '
            + 'C0=' + cur_char + 'C1=' + next_char + 'C2=' + next_next_char + ' '
            + 'Pu=' + isPu(cur_char) + ' '
            + 'Tc-2=' + get_class(pre_pre_char) + 'Tc-1=' + get_class(pre_char)
            + 'Tc0=' + get_class(cur_char) + 'Tc1=' + get_class(next_char)
            + 'Tc2=' + get_class(next_next_char) + ' '
            + '\r')


    # events = ''.join(event_list)
    fout = codecs.open(event_file_path, 'w', 'utf-8')
    for event in event_list:
        fout.write(event)
    fout.close()

    return event_list

# 測試集生成特徵
def get_feature(test_file_path, feature_file_path):
    f = codecs.open(test_file_path, 'r', 'utf-8')
    contents = f.read()
    contents_list = contents.split('\r\n')
    contents_list.remove('')
    contents_list.remove('')

    fout = codecs.open(feature_file_path, 'w', 'utf-8')
    for line in contents_list:
        words_len = len(line)
        feature_list = []

        index = range(0, words_len)
        for i in index:
            pre_char = get_near_char(line, i-1, 1)
            pre_pre_char = get_near_char(line, i-2, 1)
            cur_char = get_near_char(line, i, 1)
            next_char = get_near_char(line, i+1, 1)
            next_next_char = get_near_char(line, i+2, 1)
            feature_list.append(
                  'C-2=' + pre_pre_char + ' ' + 'C-1=' + pre_char + ' '
                + 'C0=' + cur_char + ' '
                + 'C1=' + next_char + ' ' + 'C2=' + next_next_char + ' '
                + 'C-2=' + pre_pre_char + 'C-1=' + pre_char + ' '
                + 'C-1=' + pre_char + 'C0=' + cur_char + ' '
                + 'C0=' + cur_char + 'C1=' + next_char + ' '
                + 'C1=' + next_char + 'C2=' + next_next_char + ' '
                + 'C-1=' + pre_char + 'C1=' + next_char + ' '
                + 'C-2=' + pre_pre_char + 'C-1=' + pre_char + 'C0=' + cur_char + ' '
                + 'C-1=' + pre_char + 'C0=' + cur_char + 'C1=' + next_char + ' '
                + 'C0=' + cur_char + 'C1=' + next_char + 'C2=' + next_next_char + ' '
                + 'Pu=' + isPu(cur_char) + ' '
                + 'Tc-2=' + get_class(pre_pre_char) + 'Tc-1=' + get_class(pre_char)
                + 'Tc0=' + get_class(cur_char) + 'Tc1=' + get_class(next_char)
                + 'Tc2=' + get_class(next_next_char) + ' '
                + '\r')

        for item in feature_list:
            fout.write(item)
        fout.write('split\r\n')

    fout.close()

    return feature_list

#
def split_by_blank(line):
    line_list = []
    line_len = len(line)
    i = 0
    while i < line_len:
        line_list.append(line[i])
        i += 2

    return line_list

# 訓練模型
def training(feature_file_path, trained_model_file, times):
    m = MaxentModel()
    fin = codecs.open(feature_file_path, 'r', 'utf-8')
    all_list = []
    m.begin_add_event()
    for line in fin:
        line = line.rstrip()
        line_list = line.split(' ')
        str_list = []
        for item in line_list:
            str_list.append(item.encode('utf-8'))
        all_list.append(str_list)
        m.add_event(str_list[1:], str_list[0], 1)
    m.end_add_event()
    print 'begin training'
    m.train(times, "lbfgs")
    print 'end training'
    m.save(trained_model_file)
    return all_list

#
def max_prob(label_prob_list):
    max_prob = 0
    max_prob_label = ''
    for label_prob in label_prob_list:
        if label_prob[1] > max_prob:
            max_prob = label_prob[1]
            max_prob_label = label_prob[0]

    return max_prob_label

# 標註測試集
def tag_test(test_feature_file, trained_model_file, tag_test_set_file):
    fin = codecs.open(test_feature_file, 'r', 'utf-8')
    fout = codecs.open(tag_test_set_file, 'w', 'utf-8')
    m = MaxentModel()
    m.load(trained_model_file)
    contents = fin.read()
    feature_list = contents.split('\r')
    feature_list.remove('\n')
    for feature in feature_list:
        if (feature == 'split'):
            fout.write('\n\n\n')
            continue
        str_feature = []
        u_feature = feature.split(' ')
        for item in u_feature:
            str_feature.append(item.encode('utf-8'))
        label_prob_list = m.eval_all(str_feature)
        label = max_prob(label_prob_list)

        try:
            new_tag = str_feature[2].split('=')[1] + '/' + label
        except IndexError:
            print str_feature
        fout.write(new_tag.decode('utf-8'))
        pre_tag = label

    return feature_list

# 獲取最終結果6-tag
def tag6_to_words(tag_training_set_file, result_file):
    fin = codecs.open(tag_training_set_file, 'r', 'utf-8')
    fout = codecs.open(result_file, 'w', 'utf-8')

    contents = fin.read()
    words_len = len(contents) / 3
    result = []
    i = 0
    while (i < words_len):
        cur_word_label = contents[i*3+2]
        cur_word = contents[i*3]
        if (cur_word_label == 'S'):
            result.append(cur_word + ' ')
        elif (cur_word_label == 'B'):
            result.append(cur_word)
        elif (cur_word_label == 'C'):
            result.append(cur_word)
        elif (cur_word_label == 'D'):
            result.append(cur_word)
        elif (cur_word_label == 'M'):
            result.append(cur_word)
        elif (cur_word_label == 'E'):
            result.append(cur_word + ' ')
        else:
            result.append(cur_word)
        i += 1

    fout.write(''.join(result))

# 獲取最終結果4-tag
def tag4_to_words(tag_training_set_file, result_file):
    fin = codecs.open(tag_training_set_file, 'r', 'utf-8')
    fout = codecs.open(result_file, 'w', 'utf-8')

    contents = fin.read()
    words_len = len(contents) / 3
    result = []
    i = 0
    while (i < words_len):
        cur_word_label = contents[i*3+2]
        cur_word = contents[i*3]
        if (cur_word_label == 'S'):
            result.append(cur_word + ' ')
        elif (cur_word_label == 'B'):
            result.append(cur_word)
        elif (cur_word_label == 'M'):
            result.append(cur_word)
        elif (cur_word_label == 'E'):
            result.append(cur_word + ' ')
        else:
            result.append(cur_word)
        i += 1

    fout.write(''.join(result))

#
def main():
    args = sys.argv[1:]
    if len(args) < 3:
        print 'Usage: python ' + sys.argv[0] + ' training_file test_file result_file'
        exit(-1)
    training_file = args[0]
    test_file = args[1]
    result_file = args[2]

    # 標註訓練集
    tag_training_set_file = training_file + ".tag"
    #tag4_training_set(training_file, tag_training_set_file)
    tag6_training_set(training_file, tag_training_set_file)
    print 'tag train set succeed'

    # 獲取訓練集特徵
    feature_file_path = training_file + ".feature"
    get_event(tag_training_set_file, feature_file_path)
    print 'get training set feature succeed'

    # 測試集生成特徵
    test_feature_file = test_file + ".feature"
    get_feature(test_file, test_feature_file)
    print 'get test set features succeed'

    # 訓練模型
    times = [10]
    for time in times:
        trained_model_file = training_file + '.' + str(time) + ".model"
        training(feature_file_path, trained_model_file, time)
        print 'training model succeed: ' + str(time)

        # 標註測試集
        tag_test_set_file = test_file + ".tag"
        tag_test(test_feature_file, trained_model_file, tag_test_set_file)
        print 'tag test set succeed'

        # 獲取最終結果
        #tag4_to_words(tag_test_set_file, result_file + '.' + str(time))
        tag6_to_words(tag_test_set_file, result_file + '.' + str(time))
        print 'get final result succeed ' + result_file + '.' + str(time)

'''
def main():
    args = sys.argv[1:]
    if len(args) < 3:
        print 'Usage: python ' + sys.argv[0] + ' trainen_model_file test_file result_file'
        exit(-1)
    trained_model_file = args[0]
    test_file = args[1]
    result_file = args[2]

    # 測試集生成特徵
    test_feature_file = test_file + ".feature"
    get_feature(test_file, test_feature_file)
    print 'get test set features succeed'

    # 訓練模型
    times = [10]
    for time in times:
        #trained_model_file = training_file + '.' + str(time) + ".model"
        #training(feature_file_path, trained_model_file, time)
        #print 'training model succeed: ' + str(time)

        # 標註測試集
        tag_test_set_file = test_file + ".tag"
        tag_test(test_feature_file, trained_model_file, tag_test_set_file)
        print 'tag test set succeed'

        # 獲取最終結果
        tag_to_words(tag_test_set_file, result_file + '.' + str(time))
        print 'get final result succeed ' + result_file + '.' + str(time)
'''

if __name__ == "__main__":
    main()

參考資料：

本文測試過程主要參考了下面的文章：

【1】使用Python,字標註及最大熵法進行中文分詞http://blog.csdn.net/on_1y/article/details/9769919

用最大熵模型進行字標註中文分詞（Python實現）

用最大熵模型進行字標註中文分詞（Python實現）

用條件隨機場CRF進行字標註中文分詞（Python實現）

最大熵模型進行中文分詞

最大熵模型

通俗理解最大熵模型

淺談最大熵模型中的特徵

斯坦福大學-自然語言處理入門筆記第十一課最大熵模型與判別模型（2）

斯坦福大學-自然語言處理入門筆記第八課最大熵模型與判別模型

【統計學習方法-李航-筆記總結】六、邏輯斯諦迴歸和最大熵模型

統計學習---邏輯斯蒂迴歸與最大熵模型

【機器學習】最大熵模型原理小結

最大熵模型（MaxEnt）解析

一些對最大熵模型的理解

最大熵模型中的數學推導

NLP --- 最大熵模型的解法（GIS演算法、IIS演算法）

NLP --- 最大熵模型的引入

《統計學習方法（李航）》邏輯斯蒂迴歸與最大熵模型學習筆記

深入解析最大熵模型

十、最大熵模型與EM演算法

最大熵模型及其python實現

用最大熵模型進行字標註中文分詞（Python實現）

相關推薦