轉：CRF++詞性標註

阿新 • • 發佈：2017-08-29

2.7 github nes ngs pla 計算 try 可能 tag

CRF++詞性標註

2016-02-28 分類：NLP 閱讀(5558) 評論(19)

訓練和測試的語料都是人民日報98年標註語料，訓練和測試比例是10：1，直接通過CRF++標註詞性的準確率:0.933882。特征有一千多萬個，訓練時間比較長。機器cpu是48核，通過crf++，指定並線數量 -p為40，訓練了大概七個小時才結束。

語料庫、生成訓練數據的python腳本、訓練日誌、模型、計算準確率腳本都上傳到網盤，可以直接下載：戳我下載 CRF++詞性標註，程序在centos6.5+python2.7下面運行通過，如果在win下或者ubuntu下可能會有異常，通常都是編碼、路徑規範等小問題，通過逐行debug腳本應該很容易找到問題，同時要確定crf++在自己機器本身編譯沒有問題，下面說一下每一步的過程。

文章目錄 [展開]

生成訓練和測試數據

生成訓練和測試數據腳本：get_post_train_test_data.py，執行過程中會打印出來一些調試信息。

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64

#coding=utf8 import sys #home_dir = "D:/source/NLP/people_daily//"

home_dir = "./" def saveDataFile(trainobj,testobj,isTest,word,handle): if isTest: saveTrainFile(testobj,word,handle) else: saveTrainFile(trainobj,word,handle) def saveTrainFile(fiobj,word,handle): if len(word) > 0 and word != "。" and word != "，": fiobj.write(word + ‘\t‘ + handle + ‘\n‘)

else: fiobj.write(‘\n‘) def convertTag(): fiobj = open( home_dir + ‘people-daily.txt‘,‘r‘) trainobj = open( home_dir +‘train.data‘,‘w‘ ) testobj = open( home_dir +‘test.data‘,‘w‘) arr = fiobj.readlines() i = 0 for a in sys.stdin: i += 1 a = a.strip(‘\r\n\t ‘) if a=="":continue words = a.split(" ") test = False if i % 10 == 0: test = True for word in words[1:]: print "---->", word word = word.strip(‘\t ‘) if len(word) > 0: i1 = word.find(‘[‘) if i1 >= 0: word = word[i1+1:] i2 = word.find(‘]‘) if i2 > 0: w = word[:i2] word_hand = word.split(‘/‘) print "----",word w,h = word_hand #print w,h if h == ‘nr‘: #ren min #print ‘NR‘,w if w.find(‘·‘) >= 0: tmpArr = w.split(‘·‘) for tmp in tmpArr: saveDataFile(trainobj,testobj,test,tmp,h) continue saveDataFile(trainobj,testobj,test,w,h) saveDataFile(trainobj, testobj, test,"","") trainobj.flush() testobj.flush() if __name__ == ‘__main__‘: convertTag()

執行訓練和測試

設置模板為：

1 2 3 4 5 6 7 8

# Unigram U00:%x[-2,0] U01:%x[-1,0] U02:%x[0,0] U03:%x[1,0] U04:%x[2,0] U05:%x[-1,0]/%x[0,0] U06:%x[0,0]/%x[1,0]

訓練的時候的-p參數根據自己機器情況設置

1 2	crf_learn -f 3 -p 4 -c 4.0 template train.data model > train.rst crf_test -m model test.data > test.rst

計算準確率

通過命令：python clc_f.py test.rst 執行python腳本，clc_f.py中的具體程序：

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38

#!/usr/bin/python # -*- coding: utf-8 -*- import sys if __name__=="__main__": try: file = open(sys.argv[1], "r") except: print "result file is not specified, or open failed!" sys.exit() wc = 0 wc_of_test = 0 wc_of_gold = 0 wc_of_correct = 0 flag = True for l in file: if l==‘\n‘: continue _, g, r = l.strip().split() if r != g: flag = False wc += 1 if flag: wc_of_correct +=1 flag = True print "WordCount from result:", wc print "WordCount of correct post :", wc_of_correct #準確率 P = wc_of_correct/float(wc) print "準確率:%f" % (P)

實驗結果

轉：CRF++詞性標註

2.7 github nes ngs pla 計算 try 可能 tag CRF++詞性標註 2016-02-28 分類：NLP 閱讀(5558) 評論(19) 訓練和測試的語料都是人民日報98年標註語料，訓練和測試比例是10：1，直接通過CRF++標註詞性的準確率:0.

轉：CRF++詞性標註

CRF++詞性標註

生成訓練和測試數據

執行訓練和測試

計算準確率

實驗結果

轉：CRF++詞性標註

自然語言處理學習6：nltk詞性標註

隱馬爾可夫(HMM)/感知機/條件隨機場(CRF)----詞性標註

轉：TensorFlow入門（六）雙端 LSTM 實現序列標註（分詞）

轉：pytorch版的bilstm+crf實現sequence label

精通Python自然語言處理 4 ：詞性標註--單詞識別

Python 文字挖掘：jieba中文分詞和詞性標註

統計自然語言處理梳理一：分詞、命名實體識別、詞性標註

實習點滴（3）--以“詞性標註”為例理解CRF演算法

Java實現：拋開jieba等工具，寫HMM+維特比演算法進行詞性標註

轉：UML工具Astah的使用

轉：Linux 雙網卡配置兩個IP同時只有一個會通的原因

轉：Windows Phone 7 設計簡介

轉：輕松理解 Android Binder，只需要讀這一篇

Tensorflow進行POS詞性標註NER實體識別 - 構建LSTM網絡進行序列化標註

轉：Android檢查設備是否聯網

轉：阮一峰Flex 布局教程：實例篇

轉：深入Java集合學習系列：HashSet的實現原理

轉：Android命令Monkey壓力測試，詳解

轉：消息隊列的使用場景

轉：CRF++詞性標註

CRF++詞性標註

生成訓練和測試數據

執行訓練和測試

計算準確率

實驗結果

相關推薦