自然語言處理(NLP)——分詞統計可能用到的模組方法
阿新 • • 發佈:2018-12-11
一、itertools.chain( *[ ] )
import itertools
a= itertools.chain(['a','aa','aaa'])
b= itertools.chain(*['a','aa','aaa'])
print(list(a))
print(list(b))
輸出:
[‘a’, ‘aa’, ‘aaa’]
[‘a’, ‘a’, ‘a’, ‘a’, ‘a’, ‘a’]
二、NLTK工具:條件頻率分佈、正則表示式、詞幹提取器和歸併器。
2.1 nltk 分句—分詞
- NLTK文字分割:
nltk.sent_tokenize(text)
#對文字按照句子進行分割nltk.word_tokenize(sent)
- NLTK進行詞性標註
nltk.pos_tag(tokens)
#tokens是句子分詞後的結果,同樣是句子級的標註- NLTK進行命名實體識別(NER)
nltk.ne_chunk(tags)
#tags是句子詞性標註後的結果,同樣是句子級
Sentences Segment(分句)
sent_tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')
paragraph = "The first time I heard that song was in Hawaii on radio.
I was just a kid, and loved it very much! What a fantastic song!"
print(sent_tokenizer.tokenize(paragraph))
輸出:
['The first time I heard that song was in Hawaii on radio.',
'I was just a kid, and loved it very much!',
'What a fantastic song!']
Tokenize sentences (分詞)
from nltk.tokenize import WordPunctTokenizer
sentence = "Are you old enough to remember Michael Jackson attending
the Grammys with Brooke Shields and Webster sat on his lap during the show?"
print(WordPunctTokenizer().tokenize(sentence))
輸出:
['Are', 'you', 'old', 'enough', 'to', 'remember', 'Michael', 'Jackson', 'attending',
'the', 'Grammys', 'with', 'Brooke', 'Shields', 'and', 'Webster', 'sat', 'on', 'his',
'lap', 'during', 'the', 'show', '?']
----------------------------------------------------
text = 'That U.S.A. poster-print costs $12.40...'
pattern = r"""(?x) # set flag to allow verbose regexps
(?:[A-Z]\.)+ # abbreviations, e.g. U.S.A.
|\d+(?:\.\d+)?%? # numbers, incl. currency and percentages
|\w+(?:[-']\w+)* # words w/ optional internal hyphens/apostrophe
|\.\.\. # ellipsis
|(?:[.,;"'?():-_`]) # special characters with meanings
"""
nltk.regexp_tokenize(text, pattern)
['That', 'U.S.A.', 'poster-print', 'costs', '12.40', '...']
2.2 nltk提供了兩種常用的介面:FreqDist
和 ConditionalFreqDist
FreqDist
使用
from nltk import *
import matplotlib.pyplot as plt
tem = ['hello','world','hello','dear']
print(FreqDist(tem))
輸出:
FreqDist({'dear': 1, 'hello': 2, 'world': 1})
通過 plot(TopK,cumulative=True) 和 tabulate() 可以繪製對應的折線圖和表格
ConditionalFreqDist
使用
以一個配對連結串列作為輸入,需要給分配的每個事件關聯一個條件, 輸入時類似於 (條件,事件) 的元組。
import nltk
from nltk.corpus import brown
cfd = nltk.ConditionalFreqDist((genre,word) \
for genre in brown.categories()\
for word in brown.words(categories=genre))
print("conditions are:",cfd.conditions()) #檢視conditions
print(cfd['news'])
print(cfd['news']['could']) #類似字典查詢
輸出:
conditions are: ['adventure', 'belles_lettres', 'editorial', 'fiction',
'government', 'hobbies', 'humor', 'learned', 'lore', 'mystery',
'news', 'religion', 'reviews', 'romance', 'science_fiction']
<FreqDist with 14394 samples and 100554 outcomes>
86
"""
尤其對於plot() 和 tabulate() 有了更多引數選擇:
conditions:指定條件
samples: 迭代器型別,指定取值範圍
cumulative:設定為True可以檢視累積值
"""
cfd.tabulate(conditions=['news','romance'],samples=['could','can'])
cfd.tabulate(conditions=['news','romance'],samples=['could','can'],cumulative=True)
輸出:
could can
news 86 93
romance 193 74
could can
news 86 179
romance 193 267
2.3 正則表示式及其應用
輸入法聯想提示(9宮格輸入法)
import re
from nltk.corpus import words
#查詢類似於hole和golf序列(4653)的單詞。
wordlist = [w for w in words.words('en-basic') if w.islower()]
same = [w for w in wordlist if re.search(r'^[ghi][mno][jlk][def]$',w)]
print(same)
尋找字元塊 —查詢兩個或兩個以上的母音序列,並且確定相對頻率。
import nltk
wsj = sorted(set(nltk.corpus.treebank.words()))
fd = nltk.FreqDist(vs for word in wsj for vs in re.findall(r'[aeiou]{2,}',word))
fd.items()
查詢詞幹—apples和apple對比中,apple就是詞幹。寫一個簡單指令碼來查詢詞幹。
def stem(word):
for suffix in ['ing','ly','ed','ious','ies','ive','es','s','ment']:
if word.endswith(suffix):
return word[:-len(suffix)]
return None
或者使用正則表示式,只需要一行:
re.findall(r'^(.*?)(ing|ly|ed|ious|ies|ive|es|s|ment)$',word)
2.4 詞幹提取器 和 歸併器
nltk提供了PorterStemmer
和 LancasterStemmer
兩個詞幹提取器,
Porter比較好,可以處理lying這樣的單詞。
porter = nltk.PorterStemmer()
print(porter.stem('lying'))
---------------------------------------
詞性歸併器:WordNetLemmatizer
wnl = nltk.WordNetLemmatizer()
print(wnl.lemmatize('women'))
利用詞幹提取器實現索引文字(concordance)
用到nltk.Index這個函式:nltk.Index((word , i) for (i,word) in enumerate(['a','b','a']))
class IndexText:
def __init__(self,stemmer,text):
self._text = text
self._stemmer = stemmer
self._index = nltk.Index((self._stem(word),i) for (i,word) in enumerate(text))
def _stem(self,word):
return self._stemmer.stem(word).lower()
def concordance(self,word,width =40):
key = self._stem(word)
wc = width/4 #words of context
for i in self._index[key]:
lcontext = ' '.join(self._text[int(i-wc):int(i)])
rcontext = ' '.join(self._text[int(i):int(i+wc)])
ldisplay = '%*s' % (width,lcontext[-width:])
rdisplay = '%-*s' % (width,rcontext[:width])
print(ldisplay,rdisplay)
porter = nltk.PorterStemmer() #詞幹提取
grail = nltk.corpus.webtext.words('grail.txt')
text = IndexText(porter,grail)
text.concordance('lie')