1. 程式人生 > >用Python進行自然語言處理學習筆記一

用Python進行自然語言處理學習筆記一

NLTK是一個高效的Python構建的平臺,用來處理人類自然語言資料。它提供了易於使用的介面,通過這些介面可以訪問超過50個語料庫和詞彙資源(如WordNet),還有一套用於分類、標記化、詞幹標記、解析和語義推理的文字處理庫,以及工業級NLP庫的封裝器和一個活躍的討論論壇

搜尋文字

來查一下《白鯨記》中的詞monstrous:

from NLTK import *

text1.concordance("monstrous")

輸出結果如下:

*** Introductory Examples for the NLTK Book ***
Loading text1, ..., text9 and sent1, ..., sent9
Type the name of the text or sentence to view it.
Type: 'texts()' or 'sents()' to list the materials.
text1: Moby Dick by Herman Melville 1851
text2: Sense and Sensibility by Jane Austen 1811
text3: The Book of Genesis
text4: Inaugural Address Corpus
text5: Chat Corpus
text6: Monty Python and the Holy Grail
text7: Wall Street Journal
text8: Personals Corpus
text9: The Man Who Was Thursday by G . K . Chesterton 1908
Displaying 11 of 11 matches:
ong the former , one was of a most monstrous size . ... This came towards us , 
ON OF THE PSALMS . " Touching that monstrous bulk of the whale or ork we have r
ll over with a heathenish array of monstrous clubs and spears . Some were thick
d as you gazed , and wondered what monstrous cannibal and savage could ever hav
that has survived the flood ; most monstrous and most mountainous ! That Himmal
they might scout at Moby Dick as a monstrous fable , or still worse and more de
th of Radney .'" CHAPTER 55 Of the Monstrous Pictures of Whales . I shall ere l
ing Scenes . In connexion with the monstrous pictures of whales , I am strongly
ere to enter upon those still more monstrous stories of them which are to be fo
ght have been rummaged out of this monstrous cabinet there is no telling . But 
of Whale - Bones ; for Whales of a monstrous size are oftentimes cast up dead u

函式common_contexts允許我們研究兩個或兩個以上的詞共同的上下文,如monstrous和very。我們必須用方括號和圓括號把這些詞括起來,中間用逗號分割。

程式碼如下:

text2.common_contexts(["monstrous", "very"])

結果為:

a_pretty am_glad a_lucky is_pretty be_glad

text.generate()#隨機輸出一段text中的內容,但是該函式在NLTK3中沒有,若想實現需要下載NLTK2。

set(text3)#獲取詞彙表

sort(set(text))#獲取排序的詞彙表,排序表中大寫字母出現在小寫字母之前。

我們使用FreqDist 尋找《白鯨記》中最常見的50 個詞

程式碼如下:

fdist1 = FreqDist(text1)

collocations()函式為我們找到比我們基於單個詞的頻率預期得到的更頻繁出現的雙連詞。

fdist = FreqDist([len(w) for w in text1])
print(fdist.items())

輸出結果如下:

(1, 47933), (4, 42345), (2, 38513), (6, 17111), (8, 9966), (9, 6428), (11, 1873), (5, 26597), (7, 14399), (3, 50223), (10, 3528), (12, 1053), (13, 567), (14, 177), (16, 22), (15, 70), (17, 12), (18, 1), (20, 1)

其中(1,47933)表示長度為1的詞有47933個。

s.startswith(t) 測試s 是否以t 開頭
s.endswith(t) 測試s 是否以t 結尾
t in s 測試s 是否包含t
s.islower() 測試s 中所有字元是否都是小寫字母
s.isupper() 測試s 中所有字元是否都是大寫字母
s.isalpha() 測試s 中所有字元是否都是字母
s.isalnum() 測試s 中所有字元是否都是字母或數字
s.isdigit() 測試s 中所有字元是否都是數字
s.istitle() 測試s 是否首字母大寫(s 中所有的詞都首字母大寫)