機器學習筆記：初識sklearn(一)

阿新 • • 發佈：2019-02-09

以下內容為優達學城機器學習入門的mini專案：這裡有一系列分別由Sara(label 0)與Chris(label 1)所寫的郵件，劃分資料集，使用sklearn中的整合模型進行訓練與預測。

預處理

依賴庫

import nltk
import numpy
import scipy
import time
import sys
import sklearn
from email_preprocess import preprocess
sys.path.append("../tools/")

資料處理

email_preprocess.py：

import pickle
import 
 _pickle as cPickle
import numpy
from sklearn import model_selection
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_selection import SelectPercentile, f_classif

def preprocess(words_file = "../tools/word_data.pkl", authors_file="../tools/email_authors.pkl"):
    """ 
        this function takes a pre-made list of email texts and
        the corresponding authors and performs
        a number of preprocessing steps:
            -- splits into training/testing sets (10% testing)
            -- vectorizes into tfidf matrix
            -- selects/keeps most helpful features

        after this, the feaures and labels are put into numpy arrays
        which play nice with sklearn functions

        4 objects are returned:
            -- training/testing features
            -- training/testing labels
    """ 


    ### the words (features) and authors (labels), already largely preprocessed
    authors_file_handler = open(authors_file, "rb")
    authors = pickle.load(authors_file_handler)
    authors_file_handler.close()

    words_file_handler = open(words_file, "rb")
    word_data = cPickle.load(words_file_handler)
    words_file_handler.close()

    ### 按照交叉驗證法則劃分資料集,test_size表示劃分到測試集的百分比 

    features_train, features_test, labels_train, labels_test = \
        model_selection.train_test_split(word_data, authors, test_size=0.1, random_state=42)

    ### text vectorization--go from strings to lists of numbers
    vectorizer = TfidfVectorizer(sublinear_tf=True, max_df=0.5, stop_words='english')
    features_train_transformed = vectorizer.fit_transform(features_train)
    features_test_transformed  = vectorizer.transform(features_test)

    ### 特徵選擇。因為文字的特徵數量非常多,只選取一部分特徵
    selector = SelectPercentile(f_classif, percentile=10)      #選取的特徵百分比
    selector.fit(features_train_transformed, labels_train)
    features_train_transformed = selector.transform(features_train_transformed).toarray()
    features_test_transformed  = selector.transform(features_test_transformed).toarray()

    print("no. of Chris training emails:", sum(labels_train))
    print("no. of Sara training emails:", len(labels_train)-sum(labels_train))

    return features_train_transformed, features_test_transformed, labels_train, labels_test

樸素貝葉斯

簡介

貝葉斯概率就不用介紹了。

sklearn官方文件給了樸素貝葉斯中高斯模型的示例程式碼：

import numpy as np
X = np.array([[-1, -1], [-2, -1], [-3, -2], [1, 1], [2, 1], [3, 2]])
Y = np.array([1, 1, 1, 2, 2, 2])
from sklearn.naive_bayes import GaussianNB
clf = GaussianNB()
clf.fit(X, Y)
print(clf.predict([[-0.8, -1]]))

依賴庫

from sklearn.naive_bayes import GaussianNB

使用

features_train, features_test, labels_train, labels_test = preprocess()

clf=GaussianNB()

tic=time.time()
clf.fit(features_train,labels_train)
toc=time.time()
print("training time:{}s.".format(round(toc-tic,3)))

accuracy=clf.score(features_test,labels_test)    #.score()方法用於評測模型準確度
print("accuracy:{}".format(accuracy))

支援向量機

發現SVM這塊優達學城講的挺通俗的，低維到高位的對映就是通過增加組合特徵值實現的，之前在《機器學習》這本書上沒看懂，現在懂了。

簡介

sklearn官方文件給出了SVM分類器的簡單程式碼示例：

from sklearn import svm
X = [[0, 0], [1, 1]]
y = [0, 1]
clf = svm.SVC()
clf.fit(X, y)

其中SVC物件有幾個重要的引數：C=1.0,gamma='auto', kernel='rbf'。

kernel指定所用的核函式，sklearn中內建的核函式有’linear’、’poly’、’rbf’、’sigmoid’、’precomputed’ 或者可呼叫的自定義核函式。引數C與gamma對’rbf’核的SVM影響較大。
C引數決定在繪製超平面時把多少樣本考慮進去，越高的C值會使決策邊界更復雜甚至過擬合，而越低的C值會使得決策邊界更平滑。對線性核函式無影響。
gamma決定了單個樣本在繪製決策邊界時的影響範圍有多大，高gamma值影響範圍小，低gamma值影響範圍大。gamma引數的值可看作是樣本影響半徑的倒數。

依賴庫

from sklearn import svm

使用

features_train, features_test, labels_train, labels_test = preprocess()

clf=svm.SVC(kernel='linear')

#丟棄部分訓練資料以加快擬合
features_train=features_train[:int(len(features_train)/100)]
labels_train=labels_train[:int(len(labels_train)/100)]

tic=time.time()
clf.fit(features_train,labels_train)
toc=time.time()
print("train time:{}".format(round(toc-tic,3)))

acc=clf.score(features_test,labels_test)
print(acc)

注意在擬合時只使用到了1%的訓練資料，因為在處理文字時SVM的擬合速度要比樸素貝葉斯慢得多。實際執行中，線性SVM對完整訓練集的擬合時間為2min左右，準確率為98%；而線性SVM對1%訓練集的擬合時間在0.1s以內，準確率為88%。

rbf核函式引數C的優化

接下來將核函式換成rbf，來討論引數C的優化問題。在程式中設定四個不同的C值1.0,、10.0、1000.0、10000.0，分別檢視不同引數值模型的準確度，為了更快地看到優化效果，只使用1%的訓練集：

features_train, features_test, labels_train, labels_test = preprocess()

clf1=svm.SVC(kernel='rbf',C=1.)
clf2=svm.SVC(kernel='rbf',C=10.)
clf3=svm.SVC(kernel='rbf',C=1000.)
clf4=svm.SVC(kernel='rbf',C=10000.)

#丟棄部分訓練資料以加快擬合
features_train=features_train[:int(len(features_train)/100)]
labels_train=labels_train[:int(len(labels_train)/100)]

clf1.fit(features_train,labels_train)
clf2.fit(features_train,labels_train)
clf3.fit(features_train,labels_train)
clf4.fit(features_train,labels_train)

acc1=clf1.score(features_test,labels_test)
acc2=clf2.score(features_test,labels_test)
acc3=clf3.score(features_test,labels_test)
acc4=clf4.score(features_test,labels_test)
print(acc1,acc2,acc3,acc4)

採用四號分類器(C=10000)，使用整個訓練集對clf4進行訓練，最後準確度達到了：

決策樹

簡介

決策樹是一種以線性劃分做出非線性決策的演算法，它以不同層次的限定條件來對資料進行線性劃分，然後根據不同區域做出非線性決策。

熵

決策樹中一個很重要的屬性就是節點的熵，其計算公式為：

Ent=−∑iPilog2(Pi)
Pi表示第i類樣本在此節點中所佔的比例。舉個例子，有四個節點A1,A2,B1,B2被劃分到了同一個節點中，則此節點的PA與PB均為0.5，計算得到的節點熵為1，這說明節點所含的資訊最不純淨(熵為0時表示最純淨)。

資訊增益

很明顯決策樹需要依照不同特徵來對資料進行劃分，或者說對每一個節點進行進一步的劃分，很明顯最佳的決策就是在劃分之後子節點的資訊純度儘量高，資訊增益就是表示這一概念的量：

Gain(parent,feature)=Ent(parent)−∑childwchildEnt(child)
後一項為父節點劃分後所有子節點的加權平均資訊熵，權重wchild為某一子節點包含的樣本數佔父節點樣本數的比值。

假如有如下樣本，三項二元特徵grads、bumpiness與speed limit，二元類別為speed，樣本數為4：

grads	bumpiness	speed limit	speed
steep	bumpy	yes	slow
steep	smooth	yes	slow
flat	bumpy	no	fast
steep	smooth	no	fast

假設決策樹先按照grads特徵對資料進行劃分，那麼有：

則根據特徵grads劃分之後的資訊增益計算過程如下。

父節點的熵為：

parent_ent=-((2/4)*math.log(2/4,2)+(2/4)*math.log(2/4,2))

顯而易見右子節點的熵為0：

r_child_ent=0

計算左子結點的熵：

l_child_ent=-((2/3)*math.log(2/3,2)+(1/3)*math.log(1/3,2))

子節點的權重：

l_child_w=3/4
r_child_w=1/4

劃分後的資訊增益為：

Gain=parent_ent-l_child_w*l_child_ent-r_child_w*r_child_ent

這樣，分別計算按照不同特徵劃分樣本後的資訊增益，選擇能得到最大資訊增益的特徵來進行劃分。此例中最佳劃分特徵為speed limit，因為劃分之後兩子節點的熵都是0，資訊增益為1，如下圖所示：

示例

sklearn官方文件給出了決策樹分類器的使用程式碼示例：

from sklearn import tree
X = [[0, 0], [1, 1]]
Y = [0, 1]
clf = tree.DecisionTreeClassifier()
clf = clf.fit(X, Y)

決策樹分類器有幾個重要的引數：

criterion=’gini’：劃分節點的依據，可接受值'entropy'來按照資訊增益劃分節點。
min_samples_split=2：節點的最小樣本數，當節點包含的樣本數大於這個值時就會繼續劃分下去。此值太小會導致過擬合。

依賴庫

from sklearn import tree

使用

features_train, features_test, labels_train, labels_test = preprocess()

clf=tree.DecisionTreeClassifier(criterion='entropy',min_samples_split=40)

tic=time.time()
clf.fit(features_train,labels_train)
toc=time.time()
print("training time:{}".format(round(toc-tic,3)))

acc=clf.score(features_test,labels_test)
print(acc)

不難得出，決策樹模型的擬合時長是由資料的特徵數量決定的，上述程式碼的擬合時長明顯是無法接受的，所以有必要對特徵數量進行削減。在email_preprocess.py檔案中修改以下行，只保留前1%的特徵：

selector = SelectPercentile(f_classif, percentile=1)

再次執行模型：

機器學習筆記：初識sklearn(一)

預處理

依賴庫

資料處理

樸素貝葉斯

簡介

依賴庫

使用

支援向量機

簡介

依賴庫

使用

rbf核函式引數C的優化

決策樹

簡介

熵

資訊增益

示例

依賴庫

使用

機器學習筆記：初識sklearn(一)

機器學習筆記（二十一）：TensorFlow實戰十三（遷移學習）

Python機器學習筆記：sklearn庫的學習

機器學習筆記：python中使用sklearn中的svm進行分類demo，並輸入分類概率

機器學習筆記：ID3演算法建立決策樹(一)

Python機器學習筆記：SVM（4）——sklearn實現

Python機器學習筆記：利用Keras進行多類分類

effectiveJava學習筆記：通用程式設計(一)

effectiveJava學習筆記：泛型(一)

吳恩達老師機器學習筆記異常檢測（一）

Hadoop學習筆記—4.初識MapReduce 一、神馬是高大上的MapReduce 　　MapReduce是Google的一項重要技術，它首先是一個程式設計模型，用以進行大資料量的計算。對於大資料

機器學習筆記：正則化

機器學習筆記：Overview

機器學習筆記：各種熵

spark機器學習筆記：（三）用Spark Python構建推薦系統

機器學習筆記：tensorflow實現卷積神經網路經典案例--識別手寫數字

VB.NET學習筆記：初識委託——System.Delegate 類

機器學習筆記(12)---使用Sklearn中的SVM

機器學習筆記：正則化項

Python機器學習筆記：深入理解Keras中序貫模型和函式模型

機器學習筆記：初識sklearn(一)

預處理

依賴庫

資料處理

樸素貝葉斯

簡介

依賴庫

使用

支援向量機

簡介

依賴庫

使用

rbf核函式引數C的優化

決策樹

簡介

熵

資訊增益

示例

依賴庫

使用

相關推薦