人工智慧：python 實現第十章，NLP 第四天 A　Ｂａｇ Of Words

阿新 • • 發佈：2018-12-31

使用用詞袋（a bag of words）模型提取頻繁項

文字分析的主要目標之一是將文字轉化為數值形式。以便使用機器進行學習。我們考慮下，數以百萬計的單詞文件，為了去分析這些文件，我們需要提取文字並且將其轉化為數值符號。

機器學習演算法需要處理數值的資料，以便他們能夠分析資料並且提取有用的資訊。用詞袋模型從文件的所有單詞中提取特徵單詞，並且用這些特徵項矩陣建模。這就使得我們能夠將每一份文件描述成一個用詞袋。我們只需要記錄單詞的數量，語法和單詞的順序都可以忽略。

那麼一份文件的單詞矩陣是怎樣的呢。一個文件的單詞矩陣是一個記錄出現在文件中的所有單詞的次數。因此一份文件能被描述成各種單詞權重的組合體。我們能夠設定條件，篩選出更有意義的單詞。順帶，我們能構建出現在文件中所有單詞的頻率柱狀圖，這就是一個特徵向量。這個特徵向量將被用在文字分類。

思考一下幾句話：

句1：the children are playing in the hall
句2：The hall has a lot of space
句3：Lots of children like playing in an open space

假設你思考完了這三句話，我們能夠得到下面14個唯一的單詞：

the、children 、are 、playing、in 、hall、has 、a、lot、of、space、like、an、open

我們可以用出現在每句話中的單詞次數為每一句話構建一個柱狀圖。每一個特徵矩陣都將有14維，因為有14個不同的單詞:

句1：[2 1 1 1 1 1 0 0 0 0 0 0 0 0]

句2：[1 0 0 0 0 1 1 1 1 1 1 0 0 0]

句3：[0 1 0 1 1 0 0 0 1 1 1 1 1 1]

既然我們已經提取這些特徵向量，我們能夠使用機器學習演算法分析這些資料

如何使用NLTK構建用詞袋模型呢？建立一個python程式，匯入如下包

import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from nltk.corpus import brown

讀入布朗語料庫文字，我們將讀入5400個單詞，你能按照自己的意願輸入

# Read the data from the Brown corpus
input_data = ' '.join(brown.words()[:5400])

定義每塊的單詞數量

# Number of words in each chunk 
chunk_size = 800

定義分塊函式：

#將輸入的文字分塊，每一塊含有N個單詞  
def chunker(input_data,N):  
    input_words = input_data.split(' ')  
    output=[]  
    cur_chunk = []  
    count = 0  
    for word in input_words:  
        cur_chunk.append(word)  
        count+=1  
        if count==N:  
            output.append(' '.join(cur_chunk))  
            count,cur_chunk =0,[]  
    output.append(' '.join(cur_chunk))  
    return output

對輸入文字分塊

text_chunks = chunker(input_data, chunk_size)

將所分的塊轉換為字典項

# Convert to dict items
chunks = []
for count, chunk in enumerate(text_chunks):
    d = {'index': count, 'text': chunk}
    chunks.append(d)

使用已經得到的單詞出現次數，提取文件術語矩陣。我們將使用CountVectorizer方法完成此工作，該方法需要兩個輸入引數。。第一個引數是出現在文件中單詞的最小頻率度，第二個引數是出現在文件中的單詞的最大的頻率度。這兩個頻度是參考在文字中單詞的出現次數。

max_df：可以設定為範圍在[0.0 1.0]的float，也可以設定為沒有範圍限制的int，預設為1.0。這個引數的作用是作為一個閾值，當構造語料庫的關鍵詞集的時候，如果某個詞的document frequence大於max_df，這個詞不會被當作關鍵詞。如果這個引數是float，則表示詞出現的次數與語料庫文件數的百分比，如果是int，則表示詞出現的次數。如果引數中已經給定了vocabulary，則這個引數無效

min_df：類似於max_df，不同之處在於如果某個詞的document frequence小於min_df，則這個詞不會被當作關鍵詞

# Extract the document term matrix
count_vectorizer = CountVectorizer(min_df=7, max_df=20)
document_term_matrix = count_vectorizer.fit_transform([chunk['text'] for chunk in chunks])

提取詞彙並顯示。單詞引用於之前步驟所提取的並去重的一系列單詞。

# Extract the vocabulary and display it
vocabulary = np.array(count_vectorizer.get_feature_names())
print("\nVocabulary:\n", vocabulary)

建立顯示列：

# Generate names for chunks
chunk_names = []
for i in range(len(text_chunks)):
    chunk_names.append('Chunk-' + str(i+1))

輸出文件項矩陣：

# Print the document term matrix
print("\nDocument term matrix:")
formatted_text = '{:>12}' * (len(chunk_names) + 1)
print('\n', formatted_text.format('Word', *chunk_names), '\n')
for word, item in zip(vocabulary, document_term_matrix.T):
    # 'item' is a 'csr_matrix' data structure
    output = [word] + [str(freq) for freq in item.data]
    print(formatted_text.format(*output))

完整程式碼如下：

import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from nltk.corpus import brown

#將輸入的文字分塊，每一塊含有N個單詞  
def chunker(input_data,N):  
    input_words = input_data.split(' ')  
    output=[]  
    cur_chunk = []  
    count = 0  
    for word in input_words:  
        cur_chunk.append(word)  
        count+=1  
        if count==N:  
            output.append(' '.join(cur_chunk))  
            count,cur_chunk =0,[]  
    output.append(' '.join(cur_chunk))  
    return output  

# Read the data from the Brown corpus
input_data = ' '.join(brown.words()[:5400])

# Number of words in each chunk 
chunk_size = 800

text_chunks = chunker(input_data, chunk_size)

# Convert to dict items
chunks = []
for count, chunk in enumerate(text_chunks):
    d = {'index': count, 'text': chunk}
    chunks.append(d)

# Extract the document term matrix
count_vectorizer = CountVectorizer(min_df=7, max_df=20)
document_term_matrix = count_vectorizer.fit_transform([chunk['text'] for chunk in chunks])

# Extract the vocabulary and display it
vocabulary = np.array(count_vectorizer.get_feature_names())
print("\nVocabulary:\n", vocabulary)

# Generate names for chunks
chunk_names = []
for i in range(len(text_chunks)):
    chunk_names.append('Chunk-' + str(i+1))

# Print the document term matrix
print("\nDocument term matrix:")
formatted_text = '{:>12}' * (len(chunk_names) + 1)
print('\n', formatted_text.format('Word', *chunk_names), '\n')
for word, item in zip(vocabulary, document_term_matrix.T):
    # 'item' is a 'csr_matrix' data structure
    output = [word] + [str(freq) for freq in item.data]
    print(formatted_text.format(*output))

我們能夠看到所有的文件單詞矩陣和每個單詞在每一塊的出現次數

人工智慧：python 實現第十章，NLP 第四天 A　Ｂａｇ Of Words

人工智慧：python 實現第十章，NLP 第四天 A　Ｂａｇ Of Words

人工智慧：python 實現第十章，NLP 第二天基於詞義的詞形還原

人工智慧：python 實現第十一章，使用隱馬爾科夫模型生成資料

android：第十章，後臺的默默勞動者——服務，學習筆記

案例：python實現名字漢字驗證，密碼驗證

《Python程式設計：從入門到實踐》第十章：檔案和異常

第十一節課：第九章，網絡卡繫結與sshd服務

第十二節課：第10章，Apache網站服務

第十三節課：第11章和第12章，vsftpd服務與samba和nfs服務

第十四節課：第13章，部署DNS域名解析服務（bind服務）

作業系統概念（高等教育出版社，第七版）複習——第十章：檔案系統介面

資料庫系統概念（機械工業出版社，第六版）複習——第十章：資料儲存和資料存取

高等數學：第十章曲線積分與曲面積分（1）對弧長、座標的曲線積分，格林公式及其應用

Python程式設計從入門到實踐第十章：檔案和異常

《利用python做資料分析》第十章：時間序列分析

Python程式設計：從入門到實踐的動手試一試答案（第十章）

《用Python構建機器學習》——第十章：計算機視覺-模式識別讀後小結

Python核心編程第二版第十章課後答案

Redis 設計與實現（第十章） -- 持久化AOF

python 培訓第三章，函數，裝飾器，模塊，內置函數之一函數

人工智慧：python 實現 第十章，NLP 第四天 A Ｂａｇ Of Words

相關推薦

人工智慧：python 實現第十章，NLP 第四天 A　Ｂａｇ Of Words