1. 程式人生 > >自然語言處理學習6:nltk詞性標註

自然語言處理學習6:nltk詞性標註

1. 使用詞性標註器

import nltk
text = nltk.word_tokenize("They refuse to permit us to obtain the refuse permit")

tagged_text = nltk.pos_tag(text)
print(tagged_text)
# 為避免標記的複雜化,可設定tagset為‘universal’
tagged_text = nltk.pos_tag(text,tagset='universal')    
print(tagged_text)
[('They', 'PRP'), ('refuse', 'VBP'), ('to', 'TO'), ('permit', 'VB'), ('us', 'PRP'), ('to', 'TO'), ('obtain', 'VB'), ('the', 'DT'), ('refuse', 'NN'), ('permit', 'NN')]
[('They', 'PRON'), ('refuse', 'VERB'), ('to', 'PRT'), ('permit', 'VERB'), ('us', 'PRON'), ('to', 'PRT'), ('obtain', 'VERB'), ('the', 'DET'), ('refuse', 'NOUN'), ('permit', 'NOUN')]

2. str2tuple()建立標註元組

tagged_token = nltk.tag.str2tuple('fly/NN')
print(tagged_token)
('fly', 'NN')

3. 讀取已標註的語料庫

print(nltk.corpus.brown.tagged_words())
print(nltk.corpus.nps_chat.tagged_words())
[('The', 'AT'), ('Fulton', 'NP-TL'), ...]
[('now', 'RB'), ('im', 'PRP'), ('left', 'VBD'), ...]

4. 探索已標註的語料庫

brown_learned_text = nltk.corpus.brown.words(categories='learned')
sorted(set(b for (a,b) in nltk.bigrams(brown_learned_text) if a == 'often'))
[',',
 '.',
 'accomplished',
 'analytically',
 'appear',
 'apt',
 'associated',
 'assuming',
 ...]

5. 檢視跟隨詞的詞性標記: 檢視‘often'後面跟隨的詞的詞性分佈

tags = nltk.pos_tag([b for (a,b) in nltk.bigrams(brown_learned_text) if a == 'often'],tagset='universal')
tags = [item[1] for item in tags]
fd = nltk.FreqDist(tags)
print(fd.tabulate())
VERB  ADV  ADJ  ADP NOUN    .  PRT 
  27   12    7    7    5    4    2 

6. 使用POS標記尋找三詞短語

def process(sentence):
    for (w1,t1), (w2,t2), (w3,t3) in nltk.trigrams(sentence):  
        if (t1.startswith('V') and t2 == 'TO' and t3.startswith('V')):
            print(w1, w2, w3)

for tagged_sent in nltk.corpus.brown.tagged_sents():
    process(tagged_sent)

combined to achieve
continue to place
serve to protect
wanted to wait
allowed to place
expected to become
......

7. 預設標註器: nltk.DefaultTagger()

from nltk.corpus import brown
brown_tagged_sents = brown.tagged_sents(categories='news')
brown_sents = brown.sents(categories='news')
tags = [tag for (word,tag) in brown.tagged_words(categories='news')]
print(nltk.FreqDist(tags).max())
NN

"NN"出現的次數最多,設定"NN"為預設的詞性, 但是效果不佳

raw = 'I do not like green eggs and ham, I do not like them Sam I am'
tokens = nltk.word_tokenize(raw)
default_tagger = nltk.DefaultTagger('NN')
default_tagger.tag(tokens)
print(default_tagger.evaluate(brown_tagged_sents)) 
0.13089484257215028

8. 正則表示式標註器
#注意,這些是按順序處理的,第一個被匹配上的會被使用。

效果也不是很好,但是要由於預設的標註器

patterns = [(r'.*ing$','VBG'),
            (r'.*ed$','VBD'),
            (r'.*es$','VBZ'),
            (r'.*ould$','MD'),
            (r'.*\'s$','NN$'),
            (r'.*s$','NNS'),
            (r'^-?[0-9]+(.[0-9]+)?$','CD'),
            (r'.*','NN')]
regexp_tagger = nltk.RegexpTagger(patterns)
print(regexp_tagger.tag(brown_sents[3]))
print(regexp_tagger.evaluate(brown_tagged_sents))
[('``', 'NN'), ('Only', 'NN'), ('a', 'NN'), ('relative', 'NN'), ('handful', 'NN'), ('of', 'NN'), ('such', 'NN'), ('reports', 'NNS'), ('was', 'NNS'), ('received', 'VBD'), ("''", 'NN'), (',', 'NN'), ('the', 'NN'), ('jury', 'NN'), ('said', 'NN'), (',', 'NN'), ('``', 'NN'), ('considering', 'VBG'), ('the', 'NN'), ('widespread', 'NN'), ('interest', 'NN'), ('in', 'NN'), ('the', 'NN'), ('election', 'NN'), (',', 'NN'), ('the', 'NN'), ('number', 'NN'), ('of', 'NN'), ('voters', 'NNS'), ('and', 'NN'), ('the', 'NN'), ('size', 'NN'), ('of', 'NN'), ('this', 'NNS'), ('city', 'NN'), ("''", 'NN'), ('.', 'NN')]
0.20326391789486245

9. 查詢標註器

fd = nltk.FreqDist(brown.words(categories='news'))
cfd = nltk.ConditionalFreqDist(brown.tagged_words(categories='news'))
most_freq_words = list(fd.keys())[:100]
likely_tags = dict((word,cfd[word].max()) for word in most_freq_words)
baseline_tagger = nltk.UnigramTagger(model=likely_tags)
print(baseline_tagger.evaluate(brown_tagged_sents))

sent = brown.sents(categories='news')[5]
baseline_tagger.tag(sent)

0.3329355371243312
Out[19]:
[('It', 'PPS'),
 ('recommended', 'VBD'),
 ('that', 'CS'),
 ('Fulton', 'NP-TL'),
 ('legislators', 'NNS'),
 ('act', 'NN'),
  ......]

10. N-gram標註
(1) 一元標註器(unigram tagging):使用簡單的統計演算法,對每個識別符號分配最有可能的標記,不考慮上下文

#指定已標註的句子資料作為引數來訓練一元標註器
unigram_tagger = nltk.UnigramTagger(brown_tagged_sents) 
unigram_tagger.tag(brown_sents[2009])
[('The', 'AT'),
 ('structures', 'NNS'),
 ('housing', 'VBG'),
 ('the', 'AT'),
 ('apartments', 'NNS'),
 ('are', 'BER'),
 ('of', 'IN'),
 ('masonry', 'NN'),
 ('and', 'CC'),
 ('frame', 'NN'),
 ('construction', 'NN'),
 ('.', '.')]

(2) 分離訓練和測試資料: 80%作為訓練集來訓練Unigram標註器,準確率達到93.60%

size = int(len(brown_tagged_sents)*0.8)
train_sents = brown_tagged_sents[:size]
test_sents = brown_tagged_sents[:size]
unigram_tagger = nltk.UnigramTagger(train_sents)
unigram_tagger.evaluate(test_sents)
0.9359608998057523

(3)一般的N-gram標註,它的上下文是當前詞和前面n-1個識別符號的詞性標記

#注意,bigram標註器能夠標註訓練中它看到過的句子中的所有詞,但對一個沒見過的句子卻不行,
#只要遇到一個新詞就無法給它分配標記
bigram_tagger = nltk.BigramTagger(train_sents)
print(bigram_tagger.evaluate(test_sents))
print(bigram_tagger.tag(brown_sents[2007]))
0.7912525847484179
[('Various', 'JJ'), ('of', 'IN'), ('the', 'AT'), ('apartments', 'NNS'), ('are', 'BER'), ('of', 'IN'), ('the', 'AT'), ('terrace', 'NN'), ('type', 'NN'), (',', ','), ('being', 'BEG'), ('on', 'IN'), ('the', 'AT'), ('ground', 'NN'), ('floor', 'NN'), ('so', 'CS'), ('that', 'CS'), ('entrance', 'NN'), ('is', 'BEZ'), ('direct', 'JJ'), ('.', '.')]

print(bigram_tagger.tag(brown_sents[4203]))
[('The', 'AT'), ('population', 'NN'), ('of', 'IN'), ('the', 'AT'), ('Congo', None), ('is', None), ('13.5', None), ('million', None), (',', None), ('divided', None), ('into', None), ('at', None), ('least', None), ('seven', None), ('major', None), ('``', None), ('culture', None), ('clusters', None), ("''", None), ('and', None), ('innumerable', None), ('tribes', None), ('speaking', None), ('400', None), ('separate', None), ('dialects', None), ('.', None)]

注意:N-gram不應考慮跨越句子邊界的上下文,因此,NLTK的標註器被設計用於句子連結串列,一個句子是一個連結串列!

11. 組合標註器:回退backoff

"例項:嘗試使用bigram標註器標註識別符號,如果無法找到標記,嘗試使用unigram標註器,還無法找到,則使用預設標註器"
t0 = nltk.DefaultTagger('NN')
t1 = nltk.UnigramTagger(train_sents,backoff=t0)
t2 = nltk.BigramTagger(train_sents,backoff=t1)
print(t0.evaluate(test_sents))
print(t1.evaluate(test_sents))
print(t2.evaluate(test_sents))
0.1334795413246444
0.9359608998057523
0.9740459928566952
"練習:定義一個名為t3的TrigramTagger,擴充套件前面的例子"
t3 = nltk.TrigramTagger(train_sents,backoff=t2)
print(t3.evaluate(test_sents))
0.9835954633748982

12. 儲存標註器:使用pickle, 也可使用cPickle (是pickle的一個更快的c語言編譯版本。)

#儲存標註器
from pickle import dump
output = open('t2.pkl','wb')
dump(t2,output,-1)
output.close()
#載入標註器
from pickle import load
input = open('t2.pkl', 'rb')
tagger = load(input)
input.close()
tagger.tag(brown_sents[22])
[('Regarding', 'IN'),
 ("Atlanta's", 'NP$'),
 ('new', 'JJ'),
 ('multi-million-dollar', 'JJ'),
 ('airport', 'NN'),
 (',', ','),
 ('the', 'AT'),
 ('jury', 'NN'),
 ('recommended', 'VBD'),
  ......]