1. 程式人生 > >python 自然語言處理 詞性標註

python 自然語言處理 詞性標註

一、詞性標註簡介

import nltk
text1=nltk.word_tokenize("It is a pleasant day today")
print(nltk.pos_tag(text1))

Number

Tag

Description

1. CC Coordinating conjunction
2. CD Cardinal number
3. DT Determiner
4. EX Existential there
5. FW Foreign word
6. IN Preposition or subordinating conjunction
7. JJ Adjective
8. JJR Adjective, comparative
9. JJS Adjective, superlative
10. LS List item marker
11. MD Modal
12. NN Noun, singular or mass
13. NNS Noun, plural
14. NNP Proper noun, singular
15. NNPS Proper noun, plural
16. PDT Predeterminer
17. POS Possessive ending
18. PRP Personal pronoun
19. PRP$ Possessive pronoun
20. RB Adverb
21. RBR Adverb, comparative
22. RBS Adverb, superlative
23. RP Particle
24. SYM Symbol
25. TO to
26. UH Interjection
27. VB Verb, base form
28. VBD Verb, past tense
29. VBG Verb, gerund or present participle
30. VBN Verb, past participle
31. VBP Verb, non-3rd person singular present
32. VBZ Verb, 3rd person singular present
33. WDT Wh-determiner
34. WP Wh-pronoun
35. WP$ Possessive wh-pronoun
36. WRB Wh-adverb

構建(識別符號,標記)組成的元組

import nltk
taggedword=nltk.tag.str2tuple('bear/NN')
print(taggedword)
print(taggedword[0])
print(taggedword[1])

import nltk
sentence='''The/DT sacred/VBN Ganga/NNP flows/VBZ in/IN this/DT region/NN ./. This/DT is/VBZ a/DT pilgrimage/NN ./. People/NNP from/IN all/DT over/IN the/DT country/NN visit/NN this/DT place/NN ./. '''
print([nltk.tag.str2tuple(t) for t in sentence.split()])

將元組返回成字串

import nltk
taggedtok = ('bear', 'NN')
from nltk.tag.util import tuple2str
print(tuple2str(taggedtok))

統計標記出現的頻率

import nltk
from nltk.corpus import treebank
treebank_tagged = treebank.tagged_words(tagset='universal')
tag = nltk.FreqDist(tag for (word, tag) in treebank_tagged)
print(tag.most_common())

設定預設標記和去除標記

import nltk
from nltk.tag import DefaultTagger
tag = DefaultTagger('NN')
print(tag.tag(['Beautiful', 'morning']))

import nltk
from nltk.tag import untag
print(untag([('beautiful', 'NN'), ('morning', 'NN')]))

用NLTK庫實現標註任務的方式主要有兩種

1、使用NLTK庫或者其他庫中的預置標註器,並將其運用到測試資料上(足夠英文和不特殊的任務)

2、基於測試資料來建立或訓練適用的標註器,這意味著要處理一個非常特殊的用例

一個典型的標準器需要大量的訓練資料,他主要被用於標註出句子的各個單詞,人們已經花了大量力氣去標註一些內容,如果需要訓練處自己的POS標準器,應該也算的上高手了....我們下面瞭解一些標註器的效能

  • 順序標註器 
  • 讓我們的標註器的tag都是 ‘NN’ 這樣一個標記.....

import nltk
from nltk.corpus import brown

brown_tagged_sents=brown.tagged_sents(categories='news')
default_tagger=nltk.DefaultTagger('NN')
print( default_tagger.evaluate(brown_tagged_sents))  #0.13 效率低下說明這樣的標註器是個shi...

  • 使用我們前幾章說的N-grams的標註器

我們使用N-gram前面90%作為訓練集,訓練他的規則,然後拿剩下10%作為測試集,看這樣的標註器效果如何

import nltk
from nltk.corpus import brown
from nltk.tag import UnigramTagger
from nltk.tag import DefaultTagger
from nltk.tag import BigramTagger
from nltk.tag import TrigramTagger
brown_tagged_sents=brown.tagged_sents(categories='news')
default_tagger=nltk.DefaultTagger('NN')
train_data=brown_tagged_sents[:int(len(brown_tagged_sents)*0.9)]
test_data=brown_tagged_sents[int(len(brown_tagged_sents)*0.9):]
unigram_tagger=UnigramTagger(train_data,backoff=default_tagger)
print( unigram_tagger.evaluate(test_data) )

bigram_tagger=BigramTagger(train_data,backoff=unigram_tagger)
print( bigram_tagger.evaluate(test_data) )

trigram_tagger=TrigramTagger(train_data,backoff=bigram_tagger)
print( trigram_tagger.evaluate(test_data) )

為了更加清楚訓練和測試的過程,下面給出下面的程式碼

import nltk
from nltk.corpus import treebank
from nltk.tag import UnigramTagger
unitag = UnigramTagger(model={'Vinken': 'NN'})   # 只標記一個tag,讓這樣一個數據進行訓練
print(unitag.tag(treebank.sents()[0]))

import nltk
from nltk.corpus import treebank
from nltk.tag import UnigramTagger
training= treebank.tagged_sents()[:7000]
unitagger=UnigramTagger(training)    #使用資料集訓練 
testing = treebank.tagged_sents()[2000:]
print(unitagger.evaluate(testing))

談談回退機制backoff的作用

這是順序標記的一個主要特徵吧,如果這有限的訓練集中,你無法得到這次資料的Tag,可以使用下一個標註器來標註這個單詞;

當然還有很多標準器,可以讀一下原始碼......

import nltk
from nltk.tag import AffixTagger
from nltk.corpus import treebank
testing = treebank.tagged_sents()[2000:]
training= treebank.tagged_sents()[:7000]
prefixtag = AffixTagger(training, affix_length=4)   #使用四個字首...
print(prefixtag.evaluate(testing))

import nltk
from nltk.tag import AffixTagger
from nltk.corpus import treebank
testing = treebank.tagged_sents()[2000:]
training= treebank.tagged_sents()[:7000]
suffixtag = AffixTagger(training, affix_length=-3)    #使用三個字尾...
print(suffixtag.evaluate(testing))

基於機器學習的訓練模型...後面再繼續學習