1. 程式人生 > >20.【進階】流行庫模型--NLTK(Nature Language Toolkit)

20.【進階】流行庫模型--NLTK(Nature Language Toolkit)

#-*- coding:utf-8 -*-

#如何將下面兩行句子向量化
sentence1 = 'The cat is walking in the bedroom.'
sentence2 = 'A dog was running across the kitchen.'

#1.使用詞袋法進行向量化
#詞袋法,顧名思義就是講所有樣本中出現的單詞,形成一個列向量,或者稱之為詞表,
#然後每一個訓練資料,根據包含單詞的個數,進行數字化表示。

from sklearn.feature_extraction.text import CountVectorizer
vec = CountVectorizer()
sentences = [sentence1,sentence2]
sentences = vec.fit_transform(sentences)
print
sentences.toarray() # [[0 1 1 0 1 1 0 0 2 1 0] # [1 0 0 1 0 0 1 1 1 0 1]] #輸出向量各個維度的特徵含義 print vec.get_feature_names() # [u'across', u'bedroom', u'cat', u'dog', u'in', u'is', u'kitchen', u'running', u'the', u'walking', u'was'] #************************************************************************************* #2.
使用NLTK進行向量化 import nltk #(1)對句子進行詞彙分割和正規化,有些情況如 aren't需要分割成are和't, I'm 分割成I和'm tokens_1 = nltk.word_tokenize(sentence1) print tokens_1 #['The', 'cat', 'is', 'walking', 'in', 'the', 'bedroom', '.'] tokens_2 = nltk.word_tokenize(sentence2) print tokens_2 #['A', 'dog', 'was', 'running', 'across', 'the'
, 'kitchen', '.'] #(2)整理兩句的詞表,按照ASCII的排序輸出 vocab_1 = sorted(set(tokens_1)) print vocab_1 #['.', 'The', 'bedroom', 'cat', 'in', 'is', 'the', 'walking'] vocab_2 = sorted(set(tokens_2)) print vocab_2 #['.', 'A', 'across', 'dog', 'kitchen', 'running', 'the', 'was'] #(3)初始化stemmer尋找各個詞彙最原始的詞根(如 walking->walk,running->run...) stemmer = nltk.stem.PorterStemmer() stem_1 = [stemmer.stem(t) for t in tokens_1] print stem_1 #['the', 'cat', 'is', u'walk', 'in', 'the', 'bedroom', '.'] stem_2 = [stemmer.stem(t) for t in tokens_2] print stem_2 #['A', 'dog', u'wa', u'run', u'across', 'the', 'kitchen', '.'] #(4)初始化詞性標註器,對每個詞彙進行標註(詞性,名次,動詞,介詞...) pos_tag_1 = nltk.tag.pos_tag(tokens_1) print pos_tag_1 #[('The', 'DT'), ('cat', 'NN'), ('is', 'VBZ'), ('walking', 'VBG'), ('in', 'IN'), ('the', 'DT'), ('bedroom', 'NN'), ('.', '.')] pos_tag_2 = nltk.tag.pos_tag(tokens_2) print pos_tag_2 #[('A', 'DT'), ('dog', 'NN'), ('was', 'VBD'), ('running', 'VBG'), ('across', 'IN'), ('the', 'DT'), ('kitchen', 'NN'), ('.', '.')] #小結: #1.NLTK不僅可以對詞彙的具體詞性進行標註,甚至可以對句子進行結構, #2.缺點是我們只能分析詞性,但是對於具體詞彙word之間的含義是否相似,無法度量, #3.在本例中的兩個句子,從語義的角度來講,二者描述的場景是極為相似的,我們需要將word轉成向量表示, # 接下來學習word2vec技術。