1. 程式人生 > >支援向量機SVM:使用sklearn+python

支援向量機SVM:使用sklearn+python

程式碼

這個例子主要是演示3種不同的核函式(線性核,高斯核和多項式核)的用法。

使用的資料是自動生成的,生成資料的介面是make_blobs。

from sklearn import svm
from sklearn.datasets import make_blobs
from matplotlib import pyplot as plt
import numpy as np
from sklearn.externals import joblib

# 生成測試資料
X, y = make_blobs(n_samples=100, centers=3, random_state=0
, cluster_std=0.8) # 構造svm分類器例項 clf_linear = svm.SVC(C=1.0, kernel='linear') clf_poly = svm.SVC(C=1.0, kernel='poly', degree=3) clf_rbf = svm.SVC(C=1.0, kernel='rbf', gamma=0.5) clf_rbf2 = svm.SVC(C=1.0, kernel='rbf', gamma=0.1) plt.figure(figsize=(10, 10), dpi=144) clfs = [clf_linear, clf_poly, clf_rbf, clf_rbf2] titles = [ 'Linear Kernel'
, 'Polynomial Kernel with Degree=3', 'Gaussian Kernel with gamma=0.5', 'Gaussian Kernel with gamma=0.1'] # train and predict for clf, i in zip(clfs, range(len(clfs))): clf.fit(X, y) print("{}'s score:{}".format(titles[i], clf.score(X,y))) out = clf.predict(X) print("out's shape:{}, out:{}"
.format(out.shape, out)) # plt.subplot(2, 2, i+1) # plot_hyperplane(clf, X, y, title=titles[i]) # 參考頁面:http://scikit-learn.org/stable/modules/model_persistence.html # http://sofasofa.io/forum_main_post.php?postid=1001002 # save trained model to disk-file for clf, i in zip(clfs, range(len(clfs))): joblib.dump(clf, str(i)+'.pkl') # load model from file and test for i in range(len(clfs)): clf = joblib.load(str(i)+'.pkl') print( "{}'s score:{}".format( titles[i], clf.score( X, y ) ) )

作為一個支援向量分類演算法,SVC有三個常用的介面:
- fit:訓練
- predict:預測
- score:評估準確率

skearn的模型可以用joblib來儲存和載入,這個類直接操作檔案。
還有一個類可以把模型儲存到記憶體變數,那就是pickle。模型的載入和儲存參考官方的這裡

上述程式碼的輸出是:

Linear Kernel's score:0.98
out's shape:(100,), out:[1 0 1 0 0 0 2 2 1 0 0 0 1 0 2 1 2 0 2 2 2 2 2 0 1 1 1 1 2 2 0 1 1 0 2 0 0
 1 1 2 2 1 1 0 0 0 1 1 2 2 0 1 0 1 2 2 1 1 0 1 1 2 2 2 2 1 0 2 1 0 2 0 0 1
 1 0 0 0 2 1 0 0 1 0 1 0 0 0 1 0 1 1 2 2 2 2 0 0 2 2]
Polynomial Kernel with Degree=3's score:0.95
out's shape:(100,), out:[1 0 1 0 2 0 2 2 1 0 0 0 1 0 2 1 2 2 2 2 2 2 2 2 1 1 1 1 2 2 0 1 1 0 2 0 0
 1 1 2 2 1 1 0 0 0 1 1 2 2 0 1 0 1 2 2 1 1 0 1 1 2 2 2 2 1 0 2 1 0 2 0 0 1
 1 0 0 0 2 1 0 0 1 0 1 0 0 0 1 0 1 1 2 2 2 2 0 0 2 2]
Gaussian Kernel with gamma=0.5's score:0.98
out's shape:(100,), out:[1 0 1 0 0 0 2 2 1 0 0 0 1 0 2 1 2 0 2 2 2 2 2 0 1 1 1 1 2 2 0 1 1 0 2 0 0
 1 1 2 2 1 1 0 0 0 1 1 2 2 0 1 0 1 2 2 1 1 0 1 1 2 2 2 2 1 0 2 1 0 2 0 0 1
 1 0 0 0 2 1 0 0 1 0 1 0 0 0 1 0 1 1 2 2 2 2 0 0 2 2]
Gaussian Kernel with gamma=0.1's score:0.96
out's shape:(100,), out:[1 0 1 0 0 0 2 2 1 0 0 0 1 0 2 1 2 0 2 2 2 2 2 0 1 1 1 1 2 2 0 1 1 0 2 0 0
 1 1 2 2 1 1 0 0 0 1 1 2 2 0 1 0 1 2 2 1 1 0 1 1 2 2 2 2 1 0 2 1 0 2 1 2 1
 1 0 0 0 2 1 0 0 1 0 1 0 0 0 1 0 1 1 2 2 2 2 0 0 2 2]

Linear Kernel's score:0.98
Polynomial Kernel with Degree=3's score:0.95
Gaussian Kernel with gamma=0.5's score:0.98
Gaussian Kernel with gamma=0.1's score:0.96

模型搜尋

利用GridSearchCV,可以對模型的超參空間進行搜尋,選擇最優的超參。

搜尋的超參可以是多個維度,也可以是單個維度。

但是實際測試搜尋C有問題,我只搜尋了gamma。下面的程式碼也參考了曹永昌的程式碼。

from sklearn import svm
from sklearn.datasets import make_blobs
from matplotlib import pyplot as plt
import numpy as np
from sklearn.model_selection import GridSearchCV
from sklearn import metrics

X, y = make_blobs(n_samples=500, centers=2, random_state=0, cluster_std=0.8)

X_train = X[:350]
y_train = y[:350]
X_test = X[350:]
y_test = y[350:]

thresholds = np.linspace(0, 0.001, 100)
C_nums = np.linspace(0.1, 0.02, 5)
#param_grid = {'gamma': thresholds, 'C':C_nums}
param_grid = {'gamma': thresholds}
#param_grid = {'C':C_nums}

clf = GridSearchCV(svm.SVC(kernel='rbf'), param_grid, cv=5)
clf.fit(X_train, y_train)
print("best param: {0}\nbest score: {1}".format(clf.best_params_,
                                                clf.best_score_))
y_pred = clf.predict(X_test)
print("y_pred:{}".format(y_pred))
print("y_test:{}".format(y_test))

print("查準率:",metrics.precision_score(y_pred, y_test))
print("召回率:",metrics.recall_score(y_pred, y_test))
print("F1:",metrics.f1_score(y_pred, y_test))

輸出為:

best param: {'gamma': 0.00047474747474747476}
best score: 0.9857142857142858
y_pred:[0 1 1 0 1 1 0 0 0 0 0 1 0 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 1 0 0 0 0
 0 1 1 1 0 1 1 0 0 1 1 0 0 1 1 1 0 0 0 0 1 0 0 0 1 1 0 1 0 0 1 0 1 0 1 0 0
 1 0 0 0 0 1 0 1 1 0 1 0 1 1 0 1 1 1 0 1 1 1 1 1 1 0 1 0 1 0 0 0 1 1 0 0 1
 0 0 1 1 1 1 1 1 0 1 1 1 0 1 1 0 0 1 1 0 1 1 0 1 0 0 1 1 0 0 1 0 1 1 0 1 0
 1 0]
y_test:[0 1 1 0 1 1 0 0 0 1 0 1 0 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 1 0 0 0 0
 0 1 1 1 0 1 1 0 0 1 1 0 0 1 1 1 0 0 0 0 1 0 0 0 1 1 0 1 0 0 1 0 1 0 1 0 0
 1 0 0 0 0 1 0 1 1 0 1 0 1 1 1 1 1 1 0 1 1 1 1 1 1 0 1 0 1 0 0 0 1 1 0 1 1
 0 0 1 1 1 1 1 1 0 1 1 1 0 1 1 0 0 1 1 0 1 1 0 1 0 0 1 1 0 0 1 0 1 1 0 1 0
 1 0]
查準率: 0.9642857142857143
召回率: 1.0
F1: 0.9818181818181818

可以看到,搜尋到的最優gamma值是0.00047474747474747476, 對應的最優分數是 0.9857142857142858。

上述工作其實可以自己寫迴圈來完成,不過麻煩一點。

介面簡介

簡介一下上面程式碼用到的幾個介面

make_blobs

用於產生高斯分佈的聚類樣本。官網文件

Parameters:

  • n_samples : int, optional (default=100)。樣例點個數

The total number of points equally divided among clusters.

  • n_features : int, optional (default=2)。每個樣例的維數

The number of features for each sample.

  • centers : int or array of shape [n_centers, n_features], optional。聚類的數量

(default=3) The number of centers to generate, or the fixed center locations.

  • cluster_std : float or sequence of floats, optional (default=1.0)。樣例的標準差

The standard deviation of the clusters.

  • center_box : pair of floats (min, max), optional (default=(-10.0, 10.0))

The bounding box for each cluster center when centers are generated at random.

shuffle : boolean, optional (default=True)

Shuffle the samples.

  • random_state : int, RandomState instance or None, optional (default=None)

If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, the random number generator is the RandomState instance used by np.random.

Returns:

  • X : array of shape [n_samples, n_features]

The generated samples.

  • y : array of shape [n_samples]

The integer labels for cluster membership of each sample.

svm.SVC

這個介面用於建立一個svm例項

引數解釋摘抄自這裡

C:C-SVC的懲罰引數C?預設值是1.0

C越大,相當於懲罰鬆弛變數,希望鬆弛變數接近0,即對誤分類的懲罰增大,趨向於對訓練集全分對的情況,這樣對訓練集測試時準確率很高,但泛化能力弱。C值小,對誤分類的懲罰減小,允許容錯,將他們當成噪聲點,泛化能力較強。

kernel :核函式,預設是rbf,可以是‘linear’,‘poly’, ‘rbf’

  • liner – 線性核函式:u’v
  • poly – 多項式核函式:(gamma*u’*v + coef0)^degree
  • rbf – RBF高斯核函式:exp(-gamma|u-v|^2)

degree :多項式poly函式的維度,預設是3,選擇其他核函式時會被忽略。

gamma : ‘rbf’,‘poly’ 和‘sigmoid’的核函式引數。預設是’auto’,則會選擇1/n_features

coef0 :核函式的常數項。對於‘poly’和 ‘sigmoid’有用。

probability :是否採用概率估計?.預設為False

shrinking :是否採用shrinking heuristic方法,預設為true

tol :停止訓練的誤差值大小,預設為1e-3

cache_size :核函式cache快取大小,預設為200

class_weight :類別的權重,字典形式傳遞。設定第幾類的引數C為weight * C(C-SVC中的C)

verbose :允許冗餘輸出?

max_iter :最大迭代次數。-1為無限制。

decision_function_shape :‘ovo’, ‘ovr’ or None, default=None3

random_state :資料洗牌時的種子值,int值