[四]機器學習之支援向量機SVM

阿新 • • 發佈：2018-11-11

4.1 實驗資料

本資料集來源於UCI的Adult資料集，並對其進行處理得到的。資料集下載地址：http://archive.ics.uci.edu/ml/datasets/Adult。本實驗使用LIBSVM包對該資料進行分類。

原始資料集每條資料有14個特徵，分別為age,workclass,fnlwgt(final weight),education,education-num,marital-status,occupation,relationship,race,sex,captital-gain,captital-loss,hours-per-week和native-country。其中有6個特徵是連續值，包括age,fnlwgt.education-num,captital-gain,captital-loss,hours-per-week;其它8個特徵是離散的。本資料首先要做的處理是：將連續特徵離散化，將有M個類別的離散特徵轉換為M個二進位制特徵。

本資料集共有48842條資料，每條資料從原始特徵的14個轉換成123個，並以2：1的比例分為訓練集和測試集，其中a9a為訓練集，用來訓練分類器模型；a9a-t是測試集，用來預測模型的分類效果。它共有兩個類別，標籤分別用-1和1表示，標籤的含義是一個人一年的薪資是否超過50K，1表示超過50K，-1表示不超過50K。

變換後的資料下載地址：https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html#a9a

每個特徵轉換方式如下：

（1）age：連續值，拓展為5位，即第1-5維，採用one-hot方式，劃分標準如下

1.age<=25,第1維為1；

2.26<=age<=32,第2維為1；

3.33<=age<=40,第3維為1；

4.41<=age<=49,第4維為1；

5.age>=50,第5維為1；

（2）workclass：離散值，取值為Private,Self-emp-not-inc,Self-emp-inc,Federal-gov,Local-gov,State-gov,Without-pay,Never-worked,共8個取值，擴充套件為8位，即6-13維

（3）fnlwgt：連續值，擴充套件為5位，即14-18維，劃分標準如下

1.fnlwgt<=110000,第14維為1；

2.110000<=fnlwgt<=159999,第15維為1；

2.160000<=fnlwgt<=196335,第16維為1；

2.196336<=fnlwgt<=259865,第17維為1；

2.fnlwgt>=259866,第18維為1；

（4）education：離散值，取值有：Bachelors，Some-college，11th，HS-grad，Prof-school，Assoc-acdm，Assoc-voc，9th，7th-8th，12th，Masters，1st-4th，10th，5-6th，Preschool共16個，擴充套件為16位，即19-34維。

（5）education-num：連續值，擴充套件為5位，即35-39維，劃分標準如下

1.11th，9th，7-8th，12th，1st-4th，10th，5th-6th，Preschool：第35維為1；

2.HS-grad：第36維為1；

3.Some-college：第37維為1；

4.Assoc-acdm，Assoc-voc：第38維為1；

5.Bachelors，Prof-school，Masters，Doctorate：第39維為1。

（6）marital-status：離散值，取值有：Married-civ-spouse，Divorced，Never-married，Separated，Wideowed，Married-spouse-absent，Married-AF-spouse，擴充套件為7位，即40-46維。

（7）occupation：離散值，取值有：Tech-support，Craft-repair，Other-service，Sales，Exec-managerial，Prof-specialty，Handlers-cleaners，Machine-op-inspct，Adm-clerical，Farming-fishing，Transport-moving，Priv-house-serv，Protective-serv，Armed-Forces共14個，擴充套件為14位，即47-60維。

（8）relationship：離散值，取值為Wife，Own-Child，Husband，Not-in-family，Other-relative，Unmarrie共6個，擴充套件為6位，即61-66維。

（9）race：離散值，取值有：White，Asian-Pac-Islander，Amer-Indian-Eskimo，Other，Black共5個，擴充套件為5位，即67-71維。

（10）sex：離散值，取值有Female，Male共2個，擴充套件為2位，即72-73維。

（11）captital-gain：連續值，擴充套件為2位，即74-75維，劃分標準如下

1.captital-gain=0：第74維為1；

2.captital-gain≠0：第75維為1.

（12）captital-loss：連續值，擴充套件為兩位，即76-77維，劃分標準如下

1.captital-loss=0：第76維為1；

2.captital-loss≠0：第77維為1

（13）hours-per-week：連續值，擴充套件為5位，即78-82維，劃分標準如下

1.hours-per-week<=34：第78維為1；

2.35<=hours-per-week<=39：第79維為1；

3.hours-per-week=40：第80維為1；

4.41<=hours-per-week<=47：第81維為1；

5.hours-per-week>=48：第82維為1；

（14）native-country：離散值，取值有：United-States，Cambodia，England，Puerto-Rico，Canada，Germany，Outlying-US(Guam-USVI-etc)，India，Japan，Greece，South，China，Cuba，Iran，Honduras，Philippines，Italy，Poland，Jamaica，Vietnam，Mexico，Portugal，Ireland，France，Dominican-Republic，Laos，Ecuador，Taiwan，Haiti，Columbia，Hungary，Guatemala，Nicaragua，Scotland，Thailand，Yugoslavia，EI-Salvador，Trinidad&Tobago，Peru，Hong，Holand-Netherlands共41個，擴充套件為41位，即83-123維。

4.2 LIBSVM簡介

LIBSVM是臺灣大學林智仁教授等開發的一個簡單、易於使用和快速有效的SVM模式識別與迴歸軟體包，它是一個開源庫，能夠對SVM模型進行訓練，給出預測，並利用資料集對預測結果進行測試。LIBSVM還提供了針對徑向基函式和許多其他型別的核函式的支援。

LIBSVM下載地址：https://www.csie.ntu.edu.tw/~cjlin/libsvm/

4.3 LIBSVM呼叫過程

（1）確定本機python版本：

（2）下載zip或tar.gz壓縮檔案

（3）解壓檔案

（4）呼叫LIBSVM來訓練分類器模型

import os
os.chdir('C:\Users\Administrator\Desktop\機器學習\libsvm-3.23\python')
from svmutil import *
y,x = svm_read_problem('a9a')
m = svm_train(y,x,'-c 5')

引數選項：

options:
-s svm_type : set type of SVM (default 0)
	0 -- C-SVC
	1 -- nu-SVC
	2 -- one-class SVM
	3 -- epsilon-SVR
	4 -- nu-SVR
-t kernel_type : set type of kernel function (default 2)
	0 -- linear: u'*v
	1 -- polynomial: (gamma*u'*v + coef0)^degree
	2 -- radial basis function: exp(-gamma*|u-v|^2)
	3 -- sigmoid: tanh(gamma*u'*v + coef0)
-d degree : set degree in kernel function (default 3)
-g gamma : set gamma in kernel function (default 1/num_features)
-r coef0 : set coef0 in kernel function (default 0)
-c cost : set the parameter C of C-SVC, epsilon-SVR, and nu-SVR (default 1)
-n nu : set the parameter nu of nu-SVC, one-class SVM, and nu-SVR (default 0.5)
-p epsilon : set the epsilon in loss function of epsilon-SVR (default 0.1)
-m cachesize : set cache memory size in MB (default 100)
-e epsilon : set tolerance of termination criterion (default 0.001)
-h shrinking: whether to use the shrinking heuristics, 0 or 1 (default 1)
-b probability_estimates: whether to train a SVC or SVR model for probability estimates, 0 or 1 (default 0)
-wi weight: set the parameter C of class i to weight*C, for C-SVC (default 1)

The k in the -g option means the number of attributes in the input data.

呼叫LIBSVM來測試分類器模型的好壞：

test_y,test_x = svm_read_problem('a9a.t')
p_label,p_acc,p_val = svm_predict(test_y,test_x,m)

4.4 實驗效果分析

從上面結果可知，利用LIBSVM和a9a的訓練集訓練得到的分類器模型在a9a測試集上的分類準確率約為84.97%。

[四]機器學習之支援向量機SVM

[四]機器學習之支援向量機SVM

機器學習之支援向量機SVM Support Vector Machine (五) scikit-learn演算法庫

機器學習之支援向量機SVM及程式碼示例

機器學習之支援向量機SVM Support Vector Machine (六) 高斯核調參

機器學習之支援向量機（四）

機器學習之支援向量機： Support Vector Machines (SVM)

機器學習演算法——支援向量機svm，實現過程

機器學習之支援向量機(Support Vector Machines)

機器學習：支援向量機SVM和人工神經網路ANN的比較

機器學習：支援向量機(SVM)

機器學習模型-支援向量機(SVM)

【機器學習】支援向量機SVM原理及推導

【機器學習】支援向量機SVM及例項應用

機器學習之支援向量機迴歸（機器學習技法）

機器學習之支撐向量機SVM

機器學習之支援向量機演算法例項

機器學習之支援向量機原理和sklearn實踐

機器學習之支援向量機演算法(一)

機器學習之支援向量機演算法(二)

機器學習(四)：通俗理解支援向量機SVM及程式碼實踐

[四]機器學習之支援向量機SVM

相關推薦