1. 程式人生 > >python 自然語言處理 統計語言建模(1/2)

python 自然語言處理 統計語言建模(1/2)

一、計算單詞頻率

例子:生成1-gram,2-gram,4-gram的Alpino語料庫的分詞樣本


import nltk  # 1 - gram
from nltk.util import ngrams
from nltk.corpus import alpino
print(alpino.words())
unigrams=ngrams(alpino.words(),1)
for i in unigrams:
    print(i)


import nltk  #2 - gram
from nltk.util import ngrams
from nltk.corpus import alpino
print(alpino.words())
bigrams_tokens=ngrams(alpino.words(),2)
for i in bigrams_tokens:
    print(i) 


import nltk  #4 - gram
from nltk.util import ngrams
from nltk.corpus import alpino
print(alpino.words())
quadgrams=ngrams(alpino.words(),4)
for i in quadgrams:
    print(i)

生成一段文字的2 - grams 和 2 - grams的頻數 以及 4 - grams和4 - grams的頻數


import nltk  # 2 - grams
from nltk.collocations import *
import nltk
text="Hello how are you doing ? I hope you find the book interesting"
tokens=nltk.wordpunct_tokenize(text)
twograms=nltk.collocations.BigramCollocationFinder.from_words(tokens)
for twogram, freq in twograms.ngram_fd.items():
    print(twogram,freq)


import nltk # 4 - grams
from nltk.collocations import *
import nltk
text="Hello how are you doing ? I hope you find the book interesting"
tokens=nltk.wordpunct_tokenize(text)
fourgrams=nltk.collocations.QuadgramCollocationFinder.from_words(tokens)
for fourgram, freq in fourgrams.ngram_fd.items():
    print(fourgram,freq)

二、NLTK的頻率

import nltk
from nltk.probability import FreqDist 
text="How tragic that most people had to get ill before they understood what a gift it was to be alive"
ftext=nltk.word_tokenize(text)
fdist=FreqDist(ftext)

print(fdist.N())#總數
print(fdist.max())#數值最大的樣本的頻率
print(fdist.freq("How"))#頻率

for i in fdist:
    print(i,fdist.freq(i))#輸出全部的樣本的頻率

words=fdist.keys()
print(words)#map中的key

fdist.tabulate()#繪製頻數分佈圖
fdist.plot()

頻率、概率之間的關係

在一定的實驗情況下頻率與概率可以相互替換,比如扔一枚硬幣10000次,向上的頻數是5023次,概率可以相當於5023/10000,為了獲取這些頻率之間的分佈(概率之間的分佈);我們通常用估計來求解

三、NLTK中的概率分佈(在nltk中的probability.py檔案中,大家可以去拜讀)

我們知道了頻率就大概知道了概率,概率論中應該有學過估計,利用樣本來求解一些方差、期望。這裡使用頻率來求解概率分佈

import nltk #最大似然估計
from nltk.probability import FreqDist, MLEProbDist   
text="How tragic that most people had to get ill before they understood what a gift it was to be alive"
ftext=nltk.word_tokenize(text)
fdist=FreqDist(ftext)
print(MLEProbDist(fdist).max())
print(MLEProbDist(fdist).samples())
for i in MLEProbDist(fdist).freqdist():
    print(i,MLEProbDist(fdist).prob(i))

import nltk   #Lidstone估計
from nltk.probability import FreqDist, LidstoneProbDist   
text="How tragic that most people had to get ill before they understood what a gift it was to be alive"
ftext=nltk.word_tokenize(text)
fdist=FreqDist(ftext)
print(LidstoneProbDist(fdist,0.5).max())
print(LidstoneProbDist(fdist,0.5).samples())
for i in LidstoneProbDist(fdist,0.5).freqdist():
    print(i,LidstoneProbDist(fdist,0.5).prob(i))

還有其他估計函式可以檢視文件 probability