三大分詞工具:standford CoreNLP/中科院NLPIR/哈工大LTP的簡單使用
寫在前面的話:
一個學期下來,發現寫了不少程式碼。但是都沒有好好整理,以後會慢慢整理。第一篇博文,可能也比較雜。望見諒。目的只是為了過段日子再次review時候不至於那麼生疏。如果你能幫一下各位NLPer那真的是我的榮幸。
本文將簡單介紹standford CoreNLP /中科院NLPIR系統 /哈工大LTP,這三個分詞系統下載到簡單示例程式碼的呼叫。
1. Standford coreNLP
Download:coreNLP和對應的語言模型的jar包,預設為英語。比如中文的則是stanford-chinese-corenlp-2016-10-31-models.jar
Programming language:
Stanford CoreNLP is written in Java; current releases require Java1.8+.
You can use Stanford CoreNLP from the command-line, via its Javaprogrammatic API, via third party APIsfor most major modern programming languages, or via a service. It works on Linux, OS X, and Windows.
(本文將展示命令列, java,以及python
1.1 命令列,預設使用9000埠。
# Run the server using all jarsin the current directory (e.g., the CoreNLP home directory)
java -mx4g -cp "*"edu.stanford.nlp.pipeline.StanfordCoreNLPServer[port][timeout]
如果語料是中文,需要呼叫對應的property。命名行如下
java -mx4g -cp "*"edu.stanford.nlp.pipeline.StanfordCoreNLPServer 9000 -propsStanfordCoreNLP-chinese.properties
1.2 java 程式碼呼叫
把coreNLP資料夾中的jar包及stanford-chinese-corenlp-2016-10-31-models.jar匯入到專案裡。
package Test;
/**
* Created by Roy on 2016/11/13.
*/
import edu.stanford.nlp.ling.CoreAnnotations.*;
import edu.stanford.nlp.ling.CoreLabel;
import edu.stanford.nlp.pipeline.Annotation;
import edu.stanford.nlp.pipeline.StanfordCoreNLP;
import edu.stanford.nlp.util.CoreMap;
import java.util.List;
public class TestCoreNLP {
public static void main(String[] args) {StanfordCoreNLP nlp = new StanfordCoreNLP("StanfordCoreNLP-chinese.properties");
// read some text in the text variable
String text = "李航老師的《統計方法》在市面上很暢銷。";
// create an empty Annotation just with the given text
Annotation document = new Annotation(text);
nlp.annotate(document);
// these are all the sentences in this document
// a CoreMap is essentially a Map that uses class objects as keys and has values with custom types
List<CoreMap> sentences = document.get(SentencesAnnotation.class);
System.out.println("word\tpos\tlemma\tner");
for(CoreMap sentence: sentences) {
// traversing the words in the current sentence
// a CoreLabel is a CoreMap with additional token-specific methods
for (CoreLabel token: sentence.get(TokensAnnotation.class)) {
String word = token.get(TextAnnotation.class);
String pos = token.get(PartOfSpeechAnnotation.class);String ne = token.get(NamedEntityTagAnnotation.class);
String lemma = token.get(LemmaAnnotation.class);
System.out.println(word+"\t"+pos+"\t"+lemma+"\t"+ne);
}
}
}
}
命名實體識別NER:
stanford-ner-2012-11-11-chinese的壓縮包可在上方連結處找到並下載
packageTest;
/**
* Created by Roy on 2016/11/14.
*/
import edu.stanford.nlp.ie.AbstractSequenceClassifier;
import edu.stanford.nlp.ie.crf.CRFClassifier;
import edu.stanford.nlp.ling.CoreLabel;
public class Ner {
private static AbstractSequenceClassifier<CoreLabel>ner;
public Ner() {
String serializedClassifier = "classifiers/chinese.misc.distsim.crf.ser.gz";//chinese.misc.distsim.crf.ser.gz
if (ner==null)
{
ner = CRFClassifier.getClassifierNoExceptions(serializedClassifier);
}
}
public StringdoNer(String sent) {
return ner.classifyWithInlineXML(sent);
}
}
package Test;
import edu.stanford.nlp.ie.crf.CRFClassifier;
import edu.stanford.nlp.ling.CoreLabel;
import java.util.Properties;
/**
* Created by Roy on 2016/11/14.
*/
public class NerforAText {
public static CRFClassifier<CoreLabel> segmenter;
public NerforAText(){
Properties props = new Properties();
props.setProperty("sighanCorporaDict", "data");
props.setProperty("serDictionary", "data/dict-chris6.ser.gz");
props.setProperty("inputEncoding", "UTF-8");
props.setProperty("sighanPostProcessing", "true");
segmenter = new CRFClassifier<CoreLabel>(props);
segmenter.loadClassifierNoExceptions("data/ctb.gz", props);
segmenter.flags.setProperties(props);
}
public static String doSegment(String text){
String[] strs=(String[]) segmenter.segmentString(text).toArray();
String result="";
for (String s:strs){
result=result+s+" ";
}
System.out.println(result);
return result;
}
public static void main(String args[]){
String text="習近平祝賀特朗普當選美國總統。習近平表示,中美建交37年來,兩國關係不斷向前發展,給兩國人民帶來了實實在在的利益,也促進了世界和地區和平、穩定、繁榮。";
NerforAText nerforAText =new NerforAText();
String seg=nerforAText.doSegment(text);
Ner ner=new Ner();
System.out.println(ner.doNer(seg));
}
}
1.3 python,安裝pycorenlp
直接在linux裡pip installpycorenlp
Step 1:先去命令列中開啟coreNLP服務
Step 2:執行python程式碼
# coding: utf-8
import re
import os
import sys
reload(sys)
sys.setdefaultencoding("utf-8")
from pycorenlp import StanfordCoreNLP
nlp =StanfordCoreNLP('http://127.0.0.1:9000')
line="習近平 主席 指出 ,我們 要 深入 學習 兩學一做系列 活動"
print line
#輸出可以格式除了text,還可以為json等,具體看官網
output=nlp.annotate(line,properties={'annotators': 'tokenize,ssplit,pos,lemma,ner','outputFormat':'text'})
print output.decode('utf-8')
2. NLPIR 分詞系統
下載相應的分詞包,http://ictclas.nlpir.org/downloads
下載相應的XX.user檔案去替換本地原有的XX.user
import java.io.UnsupportedEncodingException;
import com.sun.jna.Library;
import com.sun.jna.Native;
public class NLP {
// 定義介面CLibrary,繼承自com.sun.jna.Library
public interface CLibraryextends Library {
// 定義並初始化介面的靜態變數
CLibrary Instance =(CLibrary)Native.loadLibrary(System.getProperty("user.dir")+"\\source\\NLPIR",CLibrary.class);
// printf函式宣告
public intNLPIR_Init(byte[] sDataPath, int encoding,byte[] sLicenceCode);
//
public StringNLPIR_ParagraphProcess(String sSrc, int bPOSTagged);
public StringNLPIR_GetKeyWords(String sLine, int nMaxKeyLimit, boolean bWeightOut);
public doubleNLPIR_FileProcess(String sSourceFilename,String sResultFilename, intbPOStagged);
public StringNLPIR_GetFileKeyWords(String sLine, int nMaxKeyLimit,boolean bWeightOut);
public StringNLPIR_WordFreqStat(String sText);
public StringNLPIR_FileWordFreqStat(String sSourceFilename);
public voidNLPIR_Exit();
}
public static StringtransString(String aidString, String ori_encoding,
String new_encoding) {
try {
return newString(aidString.getBytes(ori_encoding), new_encoding);
} catch(UnsupportedEncodingException e) {
e.printStackTrace();
}
return null;
}
public static voidmain(String[] args) throws Exception {
String argu ="";
String system_charset= "utf-8";
int charset_type = 1;
int init_flag =CLibrary.Instance.NLPIR_Init(argu.getBytes(system_charset), charset_type,"0".getBytes(system_charset));
if (0 == init_flag) {
System.err.println("初始化失敗!");
return;
}
String sInput = "據悉,質檢總局已將最新有關情況再次通報美方,要求美方加強對輸華玉米的產地來源、運輸及倉儲等環節的管控措施,有效避免輸華玉米被未經我國農業部安全評估並批准的轉基因品系汙染。";
String nativeBytes =null;
try {
nativeBytes =CLibrary.Instance.NLPIR_ParagraphProcess(sInput, 3);
System.out.println("分詞結果為: " + nativeBytes);
int nCountKey = 0;
String nativeByte= CLibrary.Instance.NLPIR_GetKeyWords(sInput, 10,false);
System.out.print("\n關鍵詞提取結果是:" +nativeByte);
String wordFreq =CLibrary.Instance.NLPIR_WordFreqStat(sInput);
System.out.print("\n字串詞頻統計結果:" +wordFreq);
CLibrary.Instance.NLPIR_Exit();
} catch (Exception ex){
// TODOAuto-generated catch block
ex.printStackTrace();
}
}
}
3. LTP, PyLTP (Linux系統)
安裝指南:http://ltp.readthedocs.io/zh_CN/latest/install.html
下載LTP專案檔案https://github.com/HIT-SCIR/ltp/releases
解壓專案檔案,進入專案根目錄編譯 (保證Cmake已經安裝)
./configure
make
安裝pyltp:pip install pyltp
# -*- coding: utf-8 -*-
import sys
import re
import glob #讀取資料夾下的所有檔名稱模組
reload(sys)
sys.setdefaultencoding('utf-8') # 編碼格式
from pyltp import Segmentor
def read_txt(filename):
# 開啟檔案
f = open('/mnt/e/code/run/' + filename,'r')
w = open('/mnt/e/code/Segmentation/test/seg_' + filename,'w')
# 初始化例項,並使用外部詞典
segmentor = Segmentor()
segmentor.load_with_lexicon('/mnt/e/Pris/duozhuan/code/data/cws.model','/mnt/e/Pris/duozhuan/code/data/pro-noun.txt')
content = f.readline()
line = content.split("\t")[1]
index = content.split("\t")[0]
i=1
while line:
words = segmentor.segment(line)
words_len = len(words)
str_line = ''
for i in range(words_len):
str_line += words[i];
str_line += ' '
w.write(index+"\t"+str_line)
w.write('\n')
content = f.readline()
line = content.split("\t")[1]
index = content.split("\t")[0]
i=i+1
if i%100==0:
print i
segmentor.release()
f.close()
w.close()
if __name__ == '__main__':
read_txt("q_with_id.txt")
寫在最後:
個人感覺這三個分詞工具各有利弊,就分詞而言,匯入外部詞典還是很重要的。還有一些詞彙,即使在外部詞典中存在,分詞的時候還是會分開,比如數字+中文,字母+中文等詞彙。需要做後處理。