Udacity機器學習入門——特徵選擇

阿新 • • 發佈：2019-01-15

練習：一個新的安然特徵練習

poi_flag_emal.py

    if from_emails:
        ctr=0
        while not from_poi and ctr < len(from_emails):
            if from_emails[ctr] in poi_email_list:
                from_poi = True
            ctr += 1

練習：視覺化新特徵

studentCode.py

    ### you fill in this code, so that it returns either
    ###     the fraction of all messages to this person that come from POIs
    ###     or
    ###     the fraction of all messages from this person that are sent to POIs
    ### the same code can be used to compute either quantity

    ### beware of "NaN" when there is no known email address (and so
    ### no filled email features), and integer division!
    ### in case of poi_messages or all_messages having "NaN" value, return 0.
    if poi_messages !='NaN' and all_messages != 'NaN':
        fraction = float(poi_messages)/all_messages
    else:
        fraction =0.

    return fraction

警惕特徵漏洞：

任何人都有可能犯錯—要對你得到的結果持懷疑態度！你應該時刻警惕 100% 準確率。不尋常的主張要有不尋常的證據來支援。如果有特徵過度追蹤你的標籤，那麼它很可能就是一個漏洞！如果你確定它不是漏洞，那麼你很大程度上就不需要機器學習了——你可以只用該特徵來分配標籤。

去除特徵：

什麼情況下回忽略一種特徵：

特徵≠資訊，特徵是特定的試圖獲取資訊的資料點的實際數量或特點

例如：如果你有大量的特徵，你可能擁有大量的資料，而這些特徵的質量就是資訊的內容。我們需要的是儘量多資訊的數量儘量少的特徵，如果你認為特徵沒有能給予你資訊，你就要刪除它。

在 sklearn 中自動選擇特徵有多種輔助方法。多數方法都屬於單變數特徵選擇的範疇，即獨立對待每個特徵並詢問其在分類或迴歸中的能力。

sklearn 中有兩大單變數特徵選擇工具：SelectPercentile 和 SelectKBest。兩者之間的區別從名字就可以看出：SelectPercentile 選擇最強大的 X% 特徵（X 是引數），而 SelectKBest 選擇 K 個最強大的特徵（K 是引數）。

經典的高偏差情形：使用少量特徵引發高偏差

經典的高方差情形：過多的特徵、過於調整引數

平衡點：使用很少幾個特徵來擬合某種演算法，但是同時就回歸而言，想要得到較大的R方或較低的殘餘誤差平方和

過多特徵造成高方差，泛化能力弱

一種正則化迴歸：Lasso迴歸

一般的線性迴歸是要最大程度地降低擬閤中的平方誤差（即縮短擬合與任何指定資料點之間的距離或距離的平方），Lasso迴歸也要減小平方誤差，但是除了最大化減小平方誤差以外，它還要最大化減小使用的特徵數量，lambda懲罰引數，β描述的是使用的特徵數量，公式原理：使用更多的特徵會有更小的平方誤差，能更精確的擬合這些點，但是會有額外的懲罰，因此需要使用多特徵得到的好處要比形成的損失大。這個公式規定了更少的誤差與使用更少特徵的更簡單的擬合之間的平衡

套索迴歸練習

.coef_ 列印係數

.predict([[2,4]]) 預測

.fit(特徵，標籤) 擬合

特徵選擇迷你專案：

決策樹作為傳統演算法非常容易過擬合，獲得過擬合決策樹最簡單的一種方式就是使用小型訓練集和大量特徵。

1.如果決策樹被過擬合，你期望測試集的準確率是非常高還是相當低？ low

2.如果決策樹被過擬合，你期望訓練集的準確率是高還是低？high

過擬合算法的一種傳統方式是使用大量特徵和少量訓練資料。你可以在 feature_selection/find_signature.py 中找到初始程式碼。準備好決策樹，開始在訓練資料上進行訓練，打印出準確率。

根據初始程式碼，有多少訓練點？150

### a classic way to overfit is to use a small number
### of data points and a large number of features;
### train on only 150 events to put ourselves in this regime
features_train = features_train[:150].toarray()
labels_train   = labels_train[:150]

你剛才建立的決策樹的準確率是多少？0.950511945392

（記住，我們設定決策樹用於過擬合——理想情況下，我們希望看到的是相對較低的測試準確率。）

選擇（過擬合）決策樹並使用 feature_importances_ 屬性來獲得一個列表，其中列出了所有用到的特徵的相對重要性（由於是文字資料，因此列表會很長）。我們建議迭代此列表並且僅在超過閾值（比如 0.2——記住，所有單詞都同等重要，每個單詞的重要性都低於 0.01）的情況下將特徵重要性打印出來。

最重要特徵的重要性是什麼？ 0.764705882353 該特徵的數字是多少？ 36584

（由於文字學習的迷你專案的安然資料集可能不同，沒得到正確答案，因此此處的特徵的數字的答案也只是我自己的答案）

為了確定是什麼單詞導致了問題的發生，你需要返回至 TfIdf，使用你從迷你專案的上一部分中獲得的特徵數量來獲取關聯詞。你可以在 TfIdf 中呼叫 get_feature_names() 來返回包含所有單詞的列表；抽出造成大多數決策樹歧視的單詞。

這個單詞是什麼？類似於簽名這種與 Chris Germany 或 Sara Shackleton 唯一關聯的單詞是否講得通？

sshacklensf

從某種意義上說，這一單詞看起來像是一個異常值，所以讓我們在刪除它之後重新擬合。返回至 text_learning/vectorize_text.py，使用我們刪除“sara”、“chris”等的方法，從郵件中刪除此單詞。重新執行 vectorize_text.py，完成以後立即重新執行 find_signature.py。

有跳出其他任何的異常值嗎？是什麼單詞？像是一個簽名型別的單詞？（跟之前一樣，將異常值定義為重要性大於 0.2 的特徵）。cgermannsf

再次更新 vectorize_test.py 後重新執行。然後，再次執行 find_signature.py。

是否出現其他任何的重要特徵（重要性大於 0.2）？有多少？它們看起來像“簽名文字”，還是更像來自郵件正文的“郵件內容文字”？

是，還有一個新的重要詞

9.現在決策樹的準確率是多少？0.811149032992

find_signature.py

#!/usr/bin/python

import pickle
import numpy
numpy.random.seed(42)


### The words (features) and authors (labels), already largely processed.
### These files should have been created from the previous (Lesson 10)
### mini-project.
words_file = "../text_learning/your_word_data.pkl" 
authors_file = "../text_learning/your_email_authors.pkl"
word_data = pickle.load( open(words_file, "r"))
authors = pickle.load( open(authors_file, "r") )



### test_size is the percentage of events assigned to the test set (the
### remainder go into training)
### feature matrices changed to dense representations for compatibility with
### classifier functions in versions 0.15.2 and earlier
from sklearn import cross_validation
features_train, features_test, labels_train, labels_test = cross_validation.train_test_split(word_data, authors, test_size=0.1, random_state=42)

from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer(sublinear_tf=True, max_df=0.5,
                             stop_words='english')
features_train = vectorizer.fit_transform(features_train)
features_test  = vectorizer.transform(features_test).toarray()

words = vectorizer.get_feature_names()
### a classic way to overfit is to use a small number
### of data points and a large number of features;
### train on only 150 events to put ourselves in this regime
features_train = features_train[:150].toarray()
labels_train   = labels_train[:150]



### your code goes here
from sklearn import tree
from sklearn.metrics import accuracy_score

clf = tree.DecisionTreeClassifier()
clf.fit(features_train,labels_train)
#accuracy_score method 1
acc = clf.score(features_test,labels_test)
print acc
#accuracy_score method 2
pred = clf.predict(features_test)
print "Accuracy:", accuracy_score(labels_test, pred)


fi=clf.feature_importances_

print "Important features:"
for index, feature in enumerate(clf.feature_importances_):
    if feature>0.2:
        print "feature no", index
        print "importance", feature
        print "word", words[index]

vectorize_text.py

        stopwords = ["sara", "shackleton", "chris", "germani", "sshacklensf", "cgermannsf"]
        for word in stopwords:
            words = words.replace(word, ' ')
        words = ' '.join(words.split())

Udacity機器學習入門——特徵選擇

Udacity機器學習入門——特徵選擇

機器學習筆記——特徵選擇

【機器學習】Udacity機器學習入門

【機器學習】特徵選擇之最小冗餘最大相關性(mRMR)與隨機森林(RF)

【機器學習】機器學習之特徵選擇

Udacity機器學習入門——文字學習

Udacity機器學習入門——交叉驗證（cross-validation）

機器學習之特徵選擇方法整理

Udacity機器學習入門筆記——自選演算法隨機森林

Udacity機器學習入門專案3:線性迴歸

機器學習中特徵選擇概述

機器學習入門講解：什麼是特徵和特徵選擇

機器學習入門——常用優化器(Optimizer)的種類與選擇

Spark機器學習之特徵提取、選擇、轉換

機器學習之特徵工程-特徵選擇

機器學習中特徵降維和特徵選擇的區別

『Python』MachineLearning機器學習入門_效率對比

『Python』MachineLearning機器學習入門_極小的機器學習應用

機器學習入門 - 1. 介紹與決策樹(decision tree)

機器學習入門：線性回歸及梯度下降

Udacity機器學習入門——特徵選擇

相關推薦