NLTK學習之四：文字資訊抽取

阿新 • • 發佈：2019-01-07

1 資訊抽取

從資料庫中抽取資訊是容易的，但對於從自然文字中抽取資訊則不那麼直觀。通常資訊抽取的流程如下圖：
資訊抽取流程
它開始於分句，分詞。接下來進行詞性標註，識別其中的命名實體，最後使用關係識別搜尋相近實體間的可能的關係。

2 分塊

分塊是實體識別(NER)使用的基本技術，詞性標註是分塊所需的最主要資訊。本節以名詞短語(NP)為例，展示如何分塊。類似的還可以對動詞短語，介詞短語等進行分塊。下圖展示了NP分塊的概念。
分塊示意圖
分塊可以簡單的基於經驗，使用正則表示式來匹配，也可以使用基於統計的分類演算法來實現。主節先介紹NLTK提供的正則分塊器。

2.1 基於正則的匹配

NLTK提供了一個基於詞性的正則解析器RegexpParser，可以通過正則表示式匹配特定標記的詞塊。每條正則表示式由一系列詞性標籤組成，標籤以尖括號為單位用來匹配一個詞性對應的詞。例如<NN>

用於匹配句子中出現的名詞，由於名詞還有細分的如NNP,NNS等，可以用<NN.*>來表示所有名詞的匹配。下面的程式碼演示了匹配上圖中冠詞-形容詞-名詞構成的短語塊。

import nltk

sent = sentence = [("the", "DT"), ("little", "JJ"), ("yellow", "JJ"),("dog", "NN"), ("barked", "VBD"), ("at", "IN"),  ("the", "DT"), ("cat", "NN")]

grammer = 'NP:{<DT>*<JJ>*<NN>+}' 

cp = nltk.RegexpParser(grammer)
tree = cp.parse(sent)

print tree
tree.draw()

詞性標註樹

2.2 處理遞迴

為了支援語言結構的遞迴，匹配規則是支援引用自身的，如下面的程式碼，先定義了NP的規則，而在VP和CLAUSE的定義中，互相進行了引用。

import nltk

grammar = r"""
NP: {<DT|JJ|NN.*>+} 
PP: {<IN><NP>} 
VP: {<VB.*><NP|PP|CLAUSE>+$}
CLAUSE: {<NP><VP>}
""" 

cp = nltk.RegexpParser(grammar，loop=2)
sentence = [("Mary", "NN"), ("saw", "VBD"), ("the", "DT"), ("cat", "NN"),("sit", "VB"), ("on", "IN"), ("the", "DT"), ("mat", "NN")]

cp.parse(sentence)

3 基於分類的分塊器

本節將使用nltk.corpus的conll2000語料來訓練一個分塊器。conll語料使用IOB格式對分塊進行了標註，IOB是Inside,Outside,Begin的縮寫，用來描述一個詞與塊的關係，下圖是一個示例。
IOB分塊邊界

語料庫中有兩個檔案:train.txt,test.txt。另外語料庫提供了NP，VP和PP的塊標註型別。下表對此語料類的方法進行解釋：

方法	作用
tagged_sents(fileid)	返回詞性標註的句子列表，列表元素(word,pos_tag)
chunked_sents(fileid,chunk_types)	返回IOB標記的語樹tree，樹的節點元素(word,pos_tag,iob_tag)

下表對nltk.chunk包提供工具方法進行介紹：

方法	作用
tree2conlltags(tree)	將conll IOB樹轉化為三元列表
conlltags2tree(sents)	上面方法的逆，將三元組列表轉為樹

下面的程式碼使用最大熵分類器訓練一個iob標記分類器,然後利用標記進行分塊。分類器的訓練資料格式為((word,pos_tag),iob_tag)，經過學習，分類器就可以對新見到的(word,pos_tag)對進行iob分類，從而打上合適的標籤。

import nltk
from nltk.corpus import conll2000

# define feature base on pos and prevpos 
def npchunk_features(sentence, i, history):
    word, pos = sentence[i]
    if i == 0:
        prevword, prevpos = "<START>", "<START>"
    else:
        prevword, prevpos = sentence[i-1]
    return {"pos": pos, "prevpos": prevpos}

# A tagger based on classifier uses pos info
class ContextNPChunkTagger(nltk.TaggerI):
    def __init__(self, train_sents):
        train_set = []
        for tagged_sent in train_sents:
            untagged_sent = nltk.tag.untag(tagged_sent)
            history = []
            for i, (word, tag) in enumerate(tagged_sent):
                featureset = npchunk_features(untagged_sent, i, history)
                train_set.append((featureset, tag))
                history.append(tag)
        self.classifier = nltk.MaxentClassifier.train(train_set)

    def tag(self, sentence):
        history = []
        for i, word in enumerate(sentence):
            featureset = npchunk_features(sentence, i, history)
            tag = self.classifier.classify(featureset)
            history.append(tag)
        return zip(sentence, history)

#wrap tagger to tag sentence
class ContextNPChunker(nltk.ChunkParserI):
    def __init__(self, train_sents):
        tagged_sents = [[((w, t), c) for (w, t, c) in nltk.chunk.tree2conlltags(sent)]
                        for sent in train_sents]
        self.tagger = ContextNPChunkTagger(tagged_sents)

    def parse(self, sentence):
        tagged_sents = self.tagger.tag(sentence)
        conlltags = [(w, t, c) for ((w, t), c) in tagged_sents]
        return nltk.chunk.conlltags2tree(conlltags)

test_sents = conll2000.chunked_sents('test.txt', chunk_types=['NP'])
train_sents = conll2000.chunked_sents('train.txt', chunk_types=['NP'])

chunker = ContextNPChunker(train_sents)
print(chunker.evaluate(test_sents))

''' output
ChunkParse score:
    IOB Accuracy:  93.6%%
    Precision:     82.0%%
    Recall:        87.2%%
    F-Measure:     84.6%%
'''

4 命名實體識別

命名實體識別系統的目標是識別文字提及的命名實體。可以分解成兩個子任務：確定NE的邊界和確定其型別。
命名實體識別也是適合基於分類器型別的方法來處理。通常標註語料庫會標註下列的命名實體：['LOCATION', 'ORGANIZATION', 'PERSON', 'DURATION','DATE', 'CARDINAL', 'PERCENT', 'MONEY', 'MEASURE', 'FACILITY', 'GPE']
NLTK提供了一個訓練好的NER分類器，nltk.chunk.named_entify.py原始碼展示了基於ace_data訓練一個命名實體識注器的方法。瀏覽原始碼 :-)
下面程式碼使用nltk.chunk.ne_chunk()進行NE的識別。

import nltk

tagged = nltk.corpus.brown.tagged_sents()[0]
entity = nltk.chunk.ne_chunk(tagged)
print entity

5 關係抽取

一旦文字中的命名實體被識別，就可以提取其間的關係，通常是尋找所有(e1,relation,e2)形式的三元組。

在nltk.sem.extract.py中實現對語料庫ieer,ace,conll2002文字的關係提取。所以下面的程式碼可以使用正則表示式r'.*\bpresident\b'來提取某組織主席(PER president ORG)的資訊。

import re
import nltk
def open_ie():
    PR = re.compile(r'.*\president\b')
    for doc in nltk.corpus.ieer.parsed_docs():
        for rel in nltk.sem.extract_rels('PER', 'ORG', doc, corpus='ieer', pattern=PR):
            return nltk.sem.rtuple(rel)

print open_ie()
'''output
[PER: u'Kiriyenko'] u'became president of the' [ORG: u'NORSI']
[PER: u'Bill Gross'] u', president of' [ORG: u'Idealab']
[PER: u'Abe Kleinfield'] u', a vice president at' [ORG: u'Open Text']
[PER: u'Kaufman'] u', president of the privately held' [ORG: u'TV Books LLC']
[PER: u'Lindsay Doran'] u', president of' [ORG: u'United Artists']
[PER: u'Laura Ziskin'] u', president of' [ORG: u'Fox 2000']
[PER: u'Tom Rothman'] u', president of production at' [ORG: u'20th Century Fox']
[PER: u'John Wren'] u', the president and chief executive at' [ORG: u'Omnicom']
[PER: u'Ken Kaess'] u', president of the' [ORG: u'DDB Needham']
[PER: u'Jack Ablin'] u', president of' [ORG: u'Barnett Capital Advisors Inc.']
[PER: u'Lloyd Kiva New'] u', president emeritus of the' [ORG: u'Institute of American Indian Art']
[PER: u'J. Jackson Walter'] u', who served as president of the' [ORG: u'National Trust for Historic Preservation']
[PER: u'Bill Gamba'] u', senior vice president and manager of bond trading at' [ORG: u'Cowen &AMP; Co.']
'''

NLTK學習之四：文字資訊抽取

1 資訊抽取

2 分塊

2.1 基於正則的匹配

2.2 處理遞迴

3 基於分類的分塊器

4 命名實體識別

5 關係抽取

NLTK學習之四：文字資訊抽取

Docker學習之四：使用docker安裝mysql，碰到了一個啟動的坑

C++11併發學習之四：執行緒同步（續）

USB開裝置開發學習之四：USB傳輸之控制傳輸

六天搞懂“深度學習”之四：基於神經網路的分類

VVC程式碼 BMS 幀內預測學習之四：xFillReferenceSamples()

用Python開始機器學習（5：文字特徵抽取與向量化）

MyBatis學習之四：MyBatis配置檔案

WebRTC學習之四：最簡單的語音聊天

深度學習之四：卷積神經網路基礎

tomcat學習之四：tomcat的類載入機制

強化學習之四：基於策略的Agents (Policy-based Agents)

Echarts學習之四：series-pie餅圖

用Python開始機器學習（5：文字特徵抽取與向量化） sklearn

seL4微核心學習之四：系統呼叫

Elastic search 系統學習之四：文件API

jackson學習之四：WRAP_ROOT_VALUE（root物件）

JUnit5學習之四：按條件執行

機器學習入門之四：機器學習的方法-神經網絡（轉載）

Android異步載入學習筆記之四：利用緩存優化網絡載入圖片及ListView載入優化

NLTK學習之四：文字資訊抽取

1 資訊抽取

2 分塊

2.1 基於正則的匹配

2.2 處理遞迴

3 基於分類的分塊器

4 命名實體識別

5 關係抽取

相關推薦