1. 程式人生 > >python自然語言處理學習筆記一

python自然語言處理學習筆記一

第一章 語言處理與python

1 語言計算 文字與詞彙

NLTK入門

下載安裝nltk

下載資料

>>> import nltk

>>> nltk.download()

 下載完成,載入文字

>>> from nltk.book import *

*** Introductory Examples for the NLTK Book***

Loading text1, ..., text9 and sent1, ...,sent9

Type the name of the text or sentence toview it.

Type: 'texts()' or 'sents()' to list thematerials.

text1: Moby Dick by Herman Melville 1851

text2: Sense and Sensibility by Jane Austen1811

text3: The Book of Genesis

text4: Inaugural Address Corpus

text5: Chat Corpus

text6: Monty Python and the Holy Grail

text7: Wall Street Journal

text8: Personals Corpus

text9: The Man Who Was Thursday by G . K .Chesterton 1908

輸入名字即可查詢對應文字

>>> text1

<Text: Moby Dick by Herman Melville1851>

>>> text2

<Text: Sense and Sensibility by JaneAusten 1811>

搜尋文字

>>> text1.concordance('monstrous')

Displaying 11 of 11 matches:

ong the former , one was of a mostmonstrous size . ... This came towards us ,

ON OF THE PSALMS . " Touching thatmonstrous bulk of the whale or ork we have r

ll over with a heathenish array ofmonstrous clubs and spears . Some were thick

d as you gazed , and wondered whatmonstrous cannibal and savage could ever hav

that has survived the flood ; mostmonstrous and most mountainous ! That Himmal

they might scout at Moby Dick as amonstrous fable , or still worse and more de

th of Radney .'" CHAPTER 55 Of theMonstrous Pictures of Whales . I shall ere l

ing Scenes . In connexion with themonstrous pictures of whales , I am strongly

ere to enter upon those still moremonstrous stories of them which are to be fo

ght have been rummaged out of thismonstrous cabinet there is no telling . But

of Whale - Bones ; for Whales of amonstrous size are oftentimes cast up dead u

#還有哪些詞出現在相同的上文中

>>> text1.similar('monstrous')

impalpable puzzled part mystifying gamesomehorrible maddens

domineering curious exasperate untowardcontemptible careful

trustworthy delightfully christian meanuncommon abundant wise

>>> text2.similar('monstrous')

very exceedingly so heartily as vastamazingly extremely great sweet a

remarkably good

>>>text2.common_contexts(['monstrous','very'])

be_glad a_lucky am_glad is_pretty a_pretty

#美國總統就職演說詞彙分佈圖

>>> text4.dispersion_plot(['citizens','democracy','freedom','duties','America'])

#產生隨機文字

>>> text3.generate()

Traceback (most recent call last):

 File "<stdin>", line 1, in <module>

AttributeError: 'Text' object has noattribute 'generate'

#計數詞彙

>>> len(text3)  #創世紀有44764個單詞和標點符號

44764

#取得排序後的詞彙條目

>>> sorted(set(text3))

['!', "'", '(', ')', ',', ',)','.', '.)', ':', ';', ';)', '?', '?)', 'A', 'Abel', 'Abelmizraim', 'Abidah','Abide', 'Abimael', 'Abimelech', 'Abr', 'A

brah', 'Abraham', 'Abram', 'Accad', 'Achbor','Adah', 'Adam', 'Adbeel', 'Admah', 'Adullamite', 'After', 'Aholibamah','Ahuzzath', 'Ajah', 'Akan', 'All

', 'Allonbachuth', 'Almighty', 'Almodad','Also', 'Alvah', 'Alvan', 'Am', 'Amal', 'Amalek', 'Amalekites', 'Ammon','Amorite', 'Amorites', 'Amraphel',

#取得詞彙總數

>>> len(set(text3))

2789

#每個詞平均使用了16次

>>> from __future__ importdivision

>>> len(text3)/len(set(text3))

16.050197203298673

#計算單詞在文中出現次數

>>> text3.count('smote')

5

#計算特定詞在文中佔據的百分比

>>>100*text4.count('a')/len(text4)

1.4643016433938312

#定義函式實現以上功能

>>> def lexical_diversity(text):

...    return len(text)/len(set(text))

...

>>> def percentage(count,total):

...    return 100*count/total

...

#函式呼叫

>>> lexical_diversity(text3)

16.050197203298673

>>> percentage(text4.count('a'),len(text4))

1.4643016433938312

2 將文本當作詞連結串列

連結串列定義

>>>sent1=['call','me','retacn','.']

>>> sent1

['call', 'me', 'retacn', '.']

>>> len(sent1)

4

>>> lexical_diversity(sent1)

1.0

>>> ['monty','python']+['and','the','holy','grail']

['monty', 'python', 'and', 'the', 'holy','grail']

>>> sent1.append('some')

>>> sent1

['call', 'me', 'retacn', '.', 'some']

列表索引

>>> text4[173]

'awaken'

>>> text4.index('awaken')

173

#切片

>>> text5[16715:16735]

['U86', 'thats', 'why', 'something','like', 'gamefly', 'is', 'so', 'good', 'because', 'you', 'can', 'actually','play', 'a', 'full', 'game', 'without

', 'buying', 'it']

>>> text6[1600:1625]

['We', "'", 're', 'an','anarcho', '-', 'syndicalist', 'commune', '.', 'We', 'take', 'it', 'in','turns', 'to', 'act', 'as', 'a', 'sort', 'of', 'execu

tive', 'officer', 'for', 'the', 'week']

#索引的錯誤使用

>>> len(text5)

45010

>>> text5[45010]

Traceback (most recent call last):

 File "<stdin>", line 1, in <module>

 File "D:\Python35\lib\site-packages\nltk\text.py", line 314,in __getitem__

   return self.tokens[i]

IndexError: list index out of range

#使用索引示例

>>> text8[:3]

['25', 'SEXY', 'MALE']

>>> text8[4860:]

['FREE', 'to', 'advertise', 'in','Perfect', 'Match', '!']

變數

>>>sent1=['call','me','retacn','.']

字串

>>> name='retacn'

>>> name[0]

'r'

>>> name[:5]

'retac'

>>> name[:]

'retacn'

#乘法/加法

>>> name*2

'retacnretacn'

>>> name+'!'

'retacn!'

#把字串分割成連結串列

>>> ' '.join(['retacn','coder'])

'retacn coder'

>>> 'retacn coder'.split()

['retacn', 'coder']

3 簡單的統計

>>>saying=['retacn','is','a','coder','!']

>>> saying

['retacn', 'is', 'a', 'coder', '!']

>>> tokens=set(saying)

>>> tokens

{'is', 'retacn', 'a', 'coder', '!'}

>>> tokens=sorted(tokens)

>>> tokens

['!', 'a', 'coder', 'is', 'retacn']

>>> tokens[-2:]

['is', 'retacn']

頻率分佈

#找到書中使用最頻繁的50個字

>>> fdist1=FreqDist(text1)

>>> fdist1

FreqDist({',': 18713, 'the': 13721, '.':6862, 'of': 6536, 'and': 6024, 'a': 4569, 'to': 4542, ';': 4072, 'in': 3916,'that': 2982, ...})

>>> vocabulary1=fdist1.keys()

#單詞總數

>>> len(fdist1)

19317

>>> vocabulary1=list(vocabulary1)

>>> vocabulary1[:50]

['sweetest', 'demeanor', 'cutlets','bathing', 'eddy', 'summits', 'eager', 'carcass', 'splintered', 'coppers','ruinous', 'ease', 'Gather', 'immovable

', 'Verdes', 'breathing', 'colony','dreamed', 'deepeningly', 'artisan', 'placid', 'seven', 'manipulator','Cistern', 'favourites', 'GOLDSMITH', 'Nick

', 'swooping', 'coupled', 'vocation','greatness', 'Tower', 'reelingly', 'ceti', 'Lavater', 'Zealand', 'unread','Spin', 'crape', 'screamed', '.\'"',

'charged', 'sublimer', 'phantoms', 'cheer','saved', 'timber', 'skins', 'yore', 'spot']

#出現906次

>>> fdist1['whale']

906

#檢視詞彙的累計頻率圖

>>>fdist1.plot(50,cumulative=True)

#查詢只出現一次的詞

>>> fdist1.hapaxes()

細粒度的選擇詞

#查詢長度大於15的詞

>>> v=set(text1)

>>> long_word=[w for w in v iflen(w)>15]

>>> long_word

['uninterpenetratingly','uncomfortableness', 'supernaturalness', 'physiognomically','characteristically', 'cannibalistically', 'CIRCUMNAVIGATION', '

circumnavigating', 'indiscriminately','apprehensiveness', 'superstitiousness', 'circumnavigation','simultaneousness', 'circumnavigations', 'preterna

turalness', 'Physiognomically','indispensableness', 'comprehensiveness', 'hermaphroditical', 'irresistibleness','undiscriminating', 'subterraneousne

ss', 'responsibilities','uncompromisedness']

>>> sorted(long_word)

['CIRCUMNAVIGATION', 'Physiognomically','apprehensiveness', 'cannibalistically', 'characteristically','circumnavigating', 'circumnavigation', 'circu

mnavigations', 'comprehensiveness','hermaphroditical', 'indiscriminately', 'indispensableness','irresistibleness', 'physiognomically', 'preternatura

lness', 'responsibilities','simultaneousness', 'subterraneousness', 'supernaturalness', 'superstitiousness','uncomfortableness', 'uncompromisedness'

, 'undiscriminating','uninterpenetratingly']

#長度超過7個字元且出現次數大於7

>>> fdist5=FreqDist(text5)

>>> sorted([w for w in set(fdist5)if len(w)>7 and fdist5[w]>7])

['#14-19teens', '#talkcity_adults', '((((((((((','........', 'Question', 'actually', 'anything', 'computer', 'cute.-ass','everyone', 'football', 'in

nocent', 'listening', 'remember','seriously', 'something', 'together', 'tomorrow', 'watching']

詞語搭配和雙連詞

>>>bigrams(['more','is','said','than','done'])

Traceback (most recent call last):

 File "<stdin>", line 1, in <module>

NameError: name 'bigrams' is not defined

#匯入模組

>>> from nltk import *

>>>bigrams(['more','is','said','than','done'])

<generator object bigrams at0x0F0AD330>

>>> v=bigrams(['more','is','said','than','done'])

>>> v

<generator object bigrams at0x1071C7E0>

>>> list(v)

[('more', 'is'), ('is', 'said'), ('said','than'), ('than', 'done')]

#查詢雙連詞

>>> text4.collocations()

United States; fellow citizens; four years;years ago; Federal

Government; General Government; Americanpeople; Vice President; Old

World; Almighty God; Fellow citizens; ChiefMagistrate; Chief Justice;

God bless; every citizen; Indian tribes;public debt; one another;

foreign nations; political parties

>>> text8.collocations()

would like; medium build; social drinker;quiet nights; non smoker;

long term; age open; Would like; easygoing; financially secure; fun

times; similar interests; Age open;weekends away; poss rship; well

presented; never married; single mum;permanent relationship; slim

Build

計算其他東西

#檢視文字中詞長分佈

>>> [len(w) for w in text1][:10]

[1, 4, 4, 2, 6, 8, 4, 1, 9, 1]

#計數連結串列中每個數字出現的次數

>>> fdist=FreqDist([len(w) for win text1])

>>> fdist

FreqDist({3: 50223, 1: 47933, 4: 42345, 2:38513, 5: 26597, 6: 17111, 7: 14399, 8: 9966, 9: 6428, 10: 3528, ...})

#樣本總數

>>> fdist.N()

#同上

>>> len(fdist)

19

>>> fdist.keys()

dict_keys([1, 2, 3, 4, 5, 6, 7, 8, 9, 10,11, 12, 13, 14, 15, 16, 17, 18, 20])

#可以看出最長字元個數為20

>>> list(fdist.keys())

[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,14, 15, 16, 17, 18, 20]

>>> fdist.items()

dict_items([(1, 47933), (2, 38513), (3,50223), (4, 42345), (5, 26597), (6, 17111), (7, 14399), (8, 9966), (9, 6428),(10, 3528), (11, 1873), (12, 105

3), (13, 567),(14, 177), (15, 70), (16, 22), (17, 12), (18, 1), (20, 1)])

#最頻繁詞的長度是3

>>> fdist.max()

3

#長度為3的詞

>>> fdist[3]

50223

>>> fdist.freq(3)

0.19255882431878046

Nltk頻率分佈類中常用函式

Fdist=FreqDist(text) #建立頻率分佈

Fdist.inc(‘’) #增加樣本

Fdist[‘’] #給定樣本出現的次數

Fdist.freq(‘’) #給定樣本的頻率

fdist.N() #樣本總數

Fdist.keys() #樣本連結串列(頻率遞減速)

For w in fdist: #遍歷樣本連結串列

Fdist.max() #數值最大的樣本

Fdist.tabulate() #繪製頻率分佈表

Fdist.plot() #繪製頻率分佈圖

Fdist.plot(cumulative=True) #繪製累積頻率分佈圖

Fdist1<fdist2 #樣本在1中的頻率小於2

4 決策與控制

條件

數值比較運算子

>>> from nltk.book import *

*** Introductory Examples for the NLTK Book***

Loading text1, ..., text9 and sent1, ...,sent9

Type the name of the text or sentence toview it.

Type: 'texts()' or 'sents()' to list thematerials.

text1: Moby Dick by Herman Melville 1851

text2: Sense and Sensibility by Jane Austen1811

text3: The Book of Genesis

text4: Inaugural Address Corpus

text5: Chat Corpus

text6: Monty Python and the Holy Grail

text7: Wall Street Journal

text8: Personals Corpus

text9: The Man Who Was Thursday by G . K .Chesterton 1908

>>> sent1

['Call', 'me', 'Ishmael', '.']

>>> sent7

['Pierre', 'Vinken', ',', '61', 'years','old', ',', 'will', 'join', 'the', 'board', 'as', 'a', 'nonexecutive','director', 'Nov.', '29', '.']

>>> [w for w in sent7 iflen(w)<4]

[',', '61', 'old', ',', 'the', 'as', 'a','29', '.']

>>> [w for w in sent7 iflen(w)<=4]

[',', '61', 'old', ',', 'will', 'join','the', 'as', 'a', 'Nov.', '29', '.']

>>> [w for w in sent7 iflen(w)==4]

['will', 'join', 'Nov.']

>>> [w for w in sent7 iflen(w)!=4]

['Pierre', 'Vinken', ',', '61', 'years','old', ',', 'the', 'board', 'as', 'a', 'nonexecutive', 'director', '29', '.']

詞彙比較運算子

#以指定字元開頭

>>> sorted(w for w in set(text1)if w.startswith('ab'))

['aback', 'abaft', 'abandon', 'abandoned','abandonedly', 'abandonment', 'abased', 'abasement', 'abashed', 'abate','abated', 'abatement', 'abating',

#以指定字元開頭

>>> sorted(w for w in set(text1)if w.endswith('ableness'))

['comfortableness', 'honourableness','immutableness', 'indispensableness', 'indomitableness', 'intolerableness','palpableness', 'reasonableness', 'u

ncomfortableness']

#包含指定字元

>>> sorted([term for term inset(text4) if 'gnt' in term])

['Sovereignty', 'sovereignties','sovereignty']

#首字母大寫

>>> sorted([term for term inset(text6) if term.istitle()])

['A', 'Aaaaaaaaah', 'Aaaaaaaah', 'Aaaaaah','Aaaah', 'Aaaaugh', 'Aaagh', 'Aaah', 'Aaauggh', 'Aaaugh', 'Aaauugh', 'Aagh','Aah', 'Aauuggghhh', 'Aauuugh

', 'Aauuuuugh', 'Aauuuves', 'Action','Actually', 'African', 'Ages', 'Aggh', 'Agh', 'Ah', 'Ahh', 'Alice', 'All','Allo', 'Almighty', 'Alright', 'Am',

#數字

>>> sorted([term for term inset(text6) if term.isdigit()])

['1', '10', '11', '12', '13', '14', '15','16', '17', '18', '19', '2', '20', '21', '22', '23', '24', '3', '4', '5', '6','7', '8', '9']

#小寫

Islower()

#大寫

Isupper()

#字母

Isalpha()

#字母或數字

Isalnum()

對每個元素進行操作

>>> [len(s.upper() for s in text1]

>>> [len(w.upper() for s in text1]

#詞彙計數

>>> len(text1)

260819

>>> len(set(text1))

19317

>>> len(set([word.lower() for wordin text1]))

17231

>>> len(set([word.lower() for wordin text1 if word.isalpha()]))

16948

巢狀程式碼塊

>>> word='cat'

>>> if len(word)<5:

...    print('word length is less than 5')

...

word length is less than 5

#迴圈體內要空4個空格

>>> for word in sent1:

...    print(word)

...

Call

me

Ishmael

#條件迴圈

>>> for word in sent1:

...    if(word.endswith('l')):

...             print(word)

...

Call

Ishmael

>>> for token in sent1:

...    if token.islower():

...             print(token,'is a lowercase word')

...    elif token.istitle():

...             print(token,'is a titlecase word')

...    else:

...             print(token,'is punctuation')

...

Call is a titlecase word

me is a lowercase word

Ishmael is a titlecase word

. is punctuation

5 自動理解自然語言

語言理解技術

詞意消歧

指代消解

自動生成語言

機器翻譯

人機對話系統