Udacity機器學習入門——文字學習

阿新 • • 發佈：2019-01-03

文字學習的基本問題與輸入特徵相關，我們學習的每個檔案、每封郵件或每個書名，它的長度都是不標準的，所以不能講某個單獨的詞作為輸入特徵，因此在文字的機器學習中有個功能——詞袋Bag of Words，基本理念選定一個文字，然後計算文字的頻率

Nice Day與A Very Nice Day

Mr Day Loves a Nice Day

詞袋屬性：短語的單詞順序不會影響頻率（Nice Day 和 Day Nice）、長文字或短語給出的輸入向量完全不同（相同的郵件複製十次然後將十份內容放入同一個文字，得到的結果與一份完全不同）、不能處理多個單片語成的複合短語

（芝加哥公牛分開和組合在一起意義不同，現在已經改變了字典，有複合短語芝加哥公牛）

在sklearn中詞袋被稱為CountVectorizer

from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()
string1 = "hi Katie the self dirving car will be late Best Sebastian"
string2 = "Hi Sebastian the machine learning class will be great great great Best Katie"
string3 = "Hi Katie the machine learning class will be most excellent"
email_list = [string1,string2,string3]
bag_of_words = vectorizer.fit(email_list)
bag_of_words = vectorizer.transform(email_list)

在進行文字分析時，在詞彙表中，不是所有單詞都是平等的，有些單詞包含的資訊更多，如上圖低資訊單詞；

有些單詞不包含資訊，它們會成為資料集中噪音，因此要移除語料庫，這個單詞清單是停止詞，一般是出現非常頻繁的低資訊單詞，在文字分析中一個常見的預處理步驟就是在處理資料前去除停止詞

NLTK是自然語言工具包，從NLTK中獲取停止詞清單,NLTK需要一個語料庫（即文件）以獲取停止詞，第一次使用時需要下載

import nltk
nltk.donload()
from nltk.corpus import stopwords
sw = stopwords.words("english")
len(sw)

<<<153

詞幹提取，可以在分類器或迴歸中使用的詞根或詞幹

將下面的五維數輸入空間轉化為一維數，而且不會損失任何真正的資訊

使用NLTK進行詞幹化

from nltk.stem.snowball import SnowballStemmer
stemmer = SnowballStemmer('english')
print stemmer.stem('responsiveness')
print stemmer.stem('responsivity')
print stemmer.stem('unresponsive')
<<<respons
<<<respons
<<<unrespons

總體來說，使用詞幹提取器之後能夠很大程度地清理你的語料庫的詞彙表

文字處理中的運算子順序：先詞幹提取（原因：如果先放入詞袋，詞袋內會得到重複多次的相同的詞）

例如：

假設正在討論“responsibility is responsive to responsible people”這一段文字（這句話不合語法）

如果你直接將這段文字放入詞袋，你得到的就是：
[is:1
people: 1
responsibility: 1
responsive: 1
responsible:1]

然後再運用詞幹化，你會得到
[is:1
people:1
respon:1
respon:1
respon:1]
(如果你可以找到方法在 sklearn 中詞幹化計數向量器的物件，這種嘗試最有可能的結果就是你的程式碼會崩潰……)

那樣，你就需要再進行一次後處理來得到以下詞袋，如果你一開始就進行詞幹化，你就會直接獲得這個詞袋了：
[is:1
people:1
respon:3]

顯然，第二個詞袋很可能是你想要的，所以在此處先進行詞幹化可使你獲得正確的答案。

Tf Idf表達——更注重罕見的單詞，幫助分辨資訊

Tf——term frequency 術語頻率——更像詞袋，每個術語或單詞會根據其在一個檔案中出現的次數加重權重（一個詞出現十次其權重比出現一次的詞重十倍）

Idf——inverse document frequency逆向檔案頻率——單詞會根據其在整個語料庫、所有檔案中出現的頻率得到加權

文字學習迷你專案

練習：parseOutText()熱身

把程式碼執行一遍就好了
answer: Hi Everyone If you can read this message youre properly using parseOutText Please proceed to the next part of the project

練習：部署詞幹化

answer:hi everyon if you can read this messag your proper use parseouttext pleas proceed to the next part of the project
這裡要注意filter的使用，要過濾掉空白字元

        text_list = filter(None,text_string.split(' '))
        stemmer = SnowballStemmer("english")
        text_list = [stemmer.stem(item) for item in text_list]
        words = ' '.join(text_list)

練習：清楚簽名文字

answer:tjonesnsf stephani and sam need nymex calendar

  ### use parseOutText to extract the text from the opened email
        words = parseOutText(email)
        ### use str.replace() to remove any instances of the words
        ### ["sara", "shackleton", "chris", "germani"]
        stopwords = ["sara", "shackleton", "chris", "germani", "sshacklensf", "cgermannsf"]
        for word in stopwords:
            words = words.replace(word, "")
        words = ' '.join(words.split())
        ### append the text to word_data
        word_data.append(words)
        ### append a 0 to from_data if email is from Sara, and 1 if email is from Chris
        if name == "sara":
            from_data.append(0)
        elif name == "chris":
            from_data.append(1)

練習：進行TfIdf

使用 sklearn TfIdf 轉換將 word_data 轉換為 tf-idf 矩陣。刪除英文停止詞。

使用 get_feature_names() 訪問單詞和特徵數字之間的對映，返回一個包含詞彙表所有單詞的列表。有多少不同的單詞？

練習：訪問TfIdf特徵

你 TfId 中的單詞編號 34597 是什麼？

我得到的答案：42282和reqs，且stephaniethank序列為37645

正確答案:38757和stephaniethank

for name, from_person in [("sara", from_sara), ("chris", from_chris)]:
    for path in from_person:
        ### only look at first 200 emails when developing
        ### once everything is working, remove this line to run over full dataset
        #temp_counter += 1
        #if temp_counter < 200:
        path = os.path.join('..', path[:-1])
        print path
        email = open(path, "r")

        ### use parseOutText to extract the text from the opened email
        words = parseOutText(email)
        ### use str.replace() to remove any instances of the words
        ### ["sara", "shackleton", "chris", "germani"]
        stopwords = ["sara", "shackleton", "chris", "germani", "sshacklensf", "cgermannsf"]
        for word in stopwords:
            words = words.replace(word, "")
        words = ' '.join(words.split())
        ### append the text to word_data
        word_data.append(words)
        ### append a 0 to from_data if email is from Sara, and 1 if email is from Chris
        if name == "sara":
            from_data.append(0)
        elif name == "chris":
            from_data.append(1)

        email.close()

print word_data[152]
print "emails processed"
from_sara.close()
from_chris.close()

pickle.dump( word_data, open("your_word_data.pkl", "w") )
pickle.dump( from_data, open("your_email_authors.pkl", "w") )


### in Part 4, do TfIdf vectorization here
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer()
vectorizer.fit_transform(word_data)
feature_names = vectorizer.get_feature_names()
print len(feature_names)
print feature_names[34597]

Udacity機器學習入門——文字學習

Udacity機器學習入門——文字學習

深度學習入門和學習書籍

深度學習入門 --- 自我學習與半監督學習

【機器學習】Udacity機器學習入門

Udacity機器學習入門——交叉驗證（cross-validation）

Udacity機器學習入門筆記——自選演算法隨機森林

Udacity機器學習入門——特徵選擇

Udacity機器學習入門專案3:線性迴歸

『Python』MachineLearning機器學習入門_效率對比

『Python』MachineLearning機器學習入門_極小的機器學習應用

機器學習入門 - 1. 介紹與決策樹(decision tree)

機器學習入門：線性回歸及梯度下降

機器學習--入門答疑

機器學習入門之四：機器學習的方法-神經網絡（轉載）

機器學習入門之決策樹算法

機器學習入門之python實現圖片簡單分類

機器學習入門點滴（一）（待補充完整）

機器學習入門之使用numpy和matplotlib繪制圖形

入門機器（深度）學習的書籍及學習資料推薦

機器學習和python學習之路吐血整理技術書從入門到進階(珍藏版)

Udacity機器學習入門——文字學習

相關推薦