Python Spark 之SVM支援向量機

阿新 • • 發佈：2018-12-09

資料準備

和決策樹分類一樣，依然使用StumbleUpon Evergreen資料進行實驗。

Local模式啟動ipython notebook cd ~/pythonwork/ipynotebook PYSPARK_DRIVER_PYTHON=ipython PYSPARK_DRIVER_PYTHON_OPTS="notebook" MASTER=local[*] pyspark 匯入並轉換資料

## 定義路徑
global Path
if sc.master[:5]=="local":
    Path="file:/home/yyf/pythonwork/PythonProject/" 

else:
    Path="hdfs://master:9000/user/yyf/"
## 讀取train.tsv
print("開始匯入資料...")
rawDataWithHeader = sc.textFile(Path+"data/train.tsv")
## 取第一項資料
header = rawDataWithHeader.first()
## 剔除欄位名（特徵名）行，取資料行
rawData = rawDataWithHeader.filter(lambda x:x!=header)
## 將雙引號"替換為空字元（剔除雙引號）
rData = rawData.map(lambda x:x.replace 
("\"",""))
## 以製表符分割每一行
lines = rData.map(lambda x: x.split("\t"))
print("共有："+str(lines.count())+"項資料")

資料預處理

1、處理特徵

該資料集tran.tsv和test.tsv的第3個欄位是alchemy_category網頁分類，是一個離散值特徵，要採用OneHotEncode的方式進行編碼轉換為數值特徵，主要過程如下：

(1) 建立categoriesMap字典，key為網頁類別名，value為數字（網頁類別名的索引值），每個類別名對應一個索引值
(2) 根據categoriesMap字典查詢每個alchemy_category特徵值對應的索引值，例如business的索引值categoryIdx為2
(3) 根據categoryIdx 
=2，以OneHotEncodeer的方式轉換為一個列表categoryFeatures List，該列表長度為14（統計所有網頁類別），categoryIdx=2對應的列表為[0,0,1,0,0,0,0,0,0,0,0,0,0,0]。

建立categoriesMap網頁分類字典

categoriesMap = lines.map(lambda fields: fileds[3]).distinct().zipWithIndex().collectAsMap()

其中，lines.map()表示處理之前讀取的資料的每一行，.map(lambda fields: fileds[3])表示讀取第3個欄位，.distinct()保留不重複資料，.zipWithIndex()將第3個欄位中不重複的資料進行編號，.collectAsMap()轉換為dict字典格式

將每個alchemy_category網頁分類特徵值轉化為列表categoryFeatures List

## 給定一個alchemy_category網頁分類特徵轉化為OneHot 列表
## 查詢對應索引值
import numpy as np
categoryIdx = categoriesMap[lines.first()[3]]
OneHot = np.zeros(len(categoriesMap))
OneHot[categoryIdx] = 1
print(OneHot)

對於第4~25欄位的數值特徵，要轉換為數值，用float函式將字串轉換為數值，同時簡單處理缺失值”?”為0.

整個處理特徵的過程可以封裝成一個函式：

import numpy as np

def convert(v):
    """處理數值特徵的轉換函式"""
    return (0 if v=="?" else float(v))

def process_features(line, categoriesMap, featureEnd):
    """處理特徵，line為欄位行，categoriesMap為網頁分類字典，featureEnd為特徵結束位置，此例為25"""
    ## 處理alchemy_category網頁分類特徵
    categoryIdx = categoriesMap[line[3]]
    OneHot = np.zeros(len(categoriesMap))
    OneHot[categoryIdx] = 1
    ## 處理數值特徵
    numericalFeatures = [convert(value) for value in line[4:featureEnd]]
    # 返回拼接的總特徵列表
    return np.concatenate((OneHot, numericalFeatures))


## 處理特徵生成featureRDD
featureRDD = lines.map(lambda r: process_features(r, categoriesMap, len(r)-1))

2、資料標準化

與決策樹不同的是，邏輯迴歸需要對數值型特徵進行資料標準化，主要原因在於邏輯迴歸演算法過程中使用了梯度下降法，不進行標準化會使得部分數值較大的特徵對梯度的影響很大，造成難以收斂等不良現象。資料標準化使得每個特徵的數值規範到同一水平上（比如都分佈在-1~1之間）進而平衡不同特徵對梯度的影響。

沒有標準化之前的特徵值：這裡寫圖片描述

上圖可以看到，有5424.0的比較大的數值特徵，也有最小0.0235的小數值特徵，所以有必要對資料進行標準化。

對數值特徵進行標準化：

## 資料標準化
from pyspark.mllib.feature import StandardScaler   # 匯入資料標準化模組

## 對featureRDD進行標準化
stdScaler = StandardScaler(withMean=True, withStd=True).fit(featureRDD)  # 建立一個標準化例項
ScalerFeatureRDD = stdScaler.transform(featureRDD)
ScalerFeatureRDD.first()

檢視標準化之後的數值特徵：這裡寫圖片描述

3、處理label構成labelpoint資料格式

處理標籤資料（test.tsv最後一列），只需把字串型別轉化為float型：

## 處理標籤
def process_label(line):
    return float(line[-1])  # 最後一個欄位為類別標籤

labelRDD = lines.map(lambda r: process_label(r))

構成labelpointRDD，Spark Mllib分類任務所支援的資料型別為LabeledPoint格式，LabeledPoint資料由標籤label和特徵feature組成。構建LabeledPoint資料：

## 構建LabeledPoint資料：
from pyspark.mllib.regression import LabeledPoint

## 拼接標籤和特徵
labelpoint = labelRDD.zip(ScalerFeatureRDD)
labelpointRDD = labelpoint.map(lambda r: LabeledPoint(r[0],r[1]))
labelpointRDD.first()

4、劃分訓練集、驗證集及測試集

## 劃分訓練集、驗證集和測試集
(trainData, validationData, testData) = labelpointRDD.randomSplit([7,1,2])

# 將資料暫存在記憶體中，加快後續運算效率
trainData.persist()
validationData.persist()
testData.persist()

訓練模型

Spark Mllib封裝了SVMWithSGD支援向量機分類器，其.train()方法訓練模型，呼叫形式如下：

SVMWithSGD.train(data, iterations=100, step=1.0, regParam=0.01,
              miniBatchFraction=1.0, initialWeights=None, regType="l2",
              intercept=False, validateData=True, convergenceTol=0.001)

主要引數說明如下：

data：輸入的訓練資料，資料格式為LabeledPoint格式
iterations：使用SGD的迭代次數，預設為100
step：每次執行SGD迭代步長大小，預設為1
miniBatchFraction：小批量隨機梯度下降法每次參與計算的樣本比例，數值在0~1，預設為1
initialWeights：初始化係數，預設為None
regParam：正則項係數大小
regType：正則化型別”l1”或”l2”或”None”，預設為”l2”,

## 使用SVM分類模型進行訓練
from pyspark.mllib.classification import SVMWithSGD
## 使用預設引數訓練模型
model = SVMWithSGD.train(trainData, iterations=100, step=1.0, miniBatchFraction=1.0, regParam=0.01, regType="l2")

模型評估

為簡單起見使用預測準確率作為模型評估的指標，自定義函式計算準確率（好吧，其實是pyspark MLlib的evaluation中的類用的時候老報錯。。不知道什麼原因）

## 定義模型評估函式
def ModelAccuracy(model, validationData):
    ## 計算模型的準確率
    predict = model.predict(validationData.map(lambda p:p.features))
    predict = predict.map(lambda p: float(p))
    ## 拼接預測值和實際值
    predict_real = predict.zip(validationData.map(lambda p: p.label))
    matched = predict_real.filter(lambda p:p[0]==p[1])
    accuracy =  float(matched.count()) / float(predict_real.count())
    return accuracy

acc = ModelAccuracy(model, validationData)
## 列印accuracy
print("accuracy="+str(acc))

返回結果：accuracy=0.646563814867

模型引數調優

邏輯迴歸的引數：迭代次數iterations，SGD步長step，訓練批次大小miniBatchFraction, 正則項係數regParam=0.01, 正則化方式regType=”l2”，會影響模型的準確率及訓練的時間，下面對不同模型引數取值進行測試評估。

建立trainEvaluateModel函式包含訓練與評估功能，並計算訓練評估的時間。

## 建立trainEvaluateModel函式包含訓練與評估功能，並計算訓練評估的時間。
from time import time

def trainEvaluateModel(trainData, validationData, iterations, step, miniBatchFraction, regParam, regType):
    startTime = time()
    ## 建立並訓練模型
    Model = SVMWithSGD.train(trainData, iterations=iterations, step=step, 
                                            miniBatchFraction=miniBatchFraction, regParam=regParam, regType=regType)
    ## 計算accuracy
    accuracy = ModelAccuracy(Model, validationData)

    duration = time() - startTime   # 持續時間
    print("訓練評估：引數"+"iterations="+str(iterations) + 
         ",  step="+str(step)+",  miniBatchFraction="+str(miniBatchFraction)+
          ", regParam"+str(regParam)+", regType=" + str(regType) +"\n"+
         "===>消耗時間="+str(duration)+",  準確率accuracy="+str(accuracy))
    return accuracy, duration, iterations, step, miniBatchFraction, regParam, regType, model

1、評估iterations引數

分別測試iterations迭代次數為[10, 100, 1000, 10000] 的模型執行時間及在驗證集上的AUC

## 評估iterations引數
iterationsList = [10, 100, 1000, 10000] 
stepList = [1]
miniBatchFractionList = [1]
regParamList = [0.01]
regTypeList = ["l2"]

## 返回結果存放至metries中
metrics = [trainEvaluateModel(trainData, validationData,iterations, step, miniBatchFraction, regParam, regType)
          for iterations in iterationsList
          for step in stepList
          for miniBatchFraction in miniBatchFractionList
          for regParam in regParamList
          for regType in regTypeList]

執行結果：

訓練評估：引數iterations=10,  step=1,  miniBatchFraction=1, regParam0.01, regType=l2
===>消耗時間=0.995012998581,  準確率accuracy=0.652173913043
訓練評估：引數iterations=100,  step=1,  miniBatchFraction=1, regParam0.01, regType=l2
===>消耗時間=2.74122095108,  準確率accuracy=0.646563814867
訓練評估：引數iterations=1000,  step=1,  miniBatchFraction=1, regParam0.01, regType=l2
===>消耗時間=2.00927209854,  準確率accuracy=0.646563814867
訓練評估：引數iterations=10000,  step=1,  miniBatchFraction=1, regParam0.01, regType=l2
===>消耗時間=1.96263098717,  準確率accuracy=0.646563814867

觀察發現，iterations小的時候執行時間少，iterations大到一定程度，執行時間差不多。AUC也都很接近，由此看來，iterations可能不是關鍵。

2、評估引數step

分別測試step為[0.1, 1, 10, 100, 500, 1000] 的模型執行時間及在驗證集上的AUC

## 評估istep引數
iterationsList = [100] 
stepList = [0.01, 1, 10, 100, 500]
miniBatchFractionList = [1]
regParamList = [0.01]
regTypeList = ["l2"]

## 返回結果存放至metries中
metrics = [trainEvaluateModel(trainData, validationData,iterations, step, miniBatchFraction, regParam, regType)
          for iterations in iterationsList
          for step in stepList
          for miniBatchFraction in miniBatchFractionList
          for regParam in regParamList
          for regType in regTypeList]

執行結果：

訓練評估：引數iterations=100,  step=0.1,  miniBatchFraction=1, regParam0.01, regType=l2
===>消耗時間=2.24229192734,  準確率accuracy=0.656381486676
訓練評估：引數iterations=100,  step=1,  miniBatchFraction=1, regParam0.01, regType=l2
===>消耗時間=2.11195993423,  準確率accuracy=0.646563814867
訓練評估：引數iterations=100,  step=10,  miniBatchFraction=1, regParam0.01, regType=l2
===>消耗時間=1.82550406456,  準確率accuracy=0.642356241234
訓練評估：引數iterations=100,  step=100,  miniBatchFraction=1, regParam0.01, regType=l2
===>消耗時間=1.71144795418,  準確率accuracy=0.583450210379
訓練評估：引數iterations=100,  step=500,  miniBatchFraction=1, regParam0.01, regType=l2
===>消耗時間=1.78000092506,  準確率accuracy=0.50350631136
訓練評估：引數iterations=100,  step=1000,  miniBatchFraction=1, regParam0.01, regType=l2
===>消耗時間=1.95169901848,  準確率accuracy=0.468443197756

觀察發現，步長過小或過大，執行時間都會增加，而步長過大會導致準確率accuracy降低。

3、評估引數訓練批次大小miniBatchFraction

分別測試miniBatchFraction為[[0.01, 0.1, 0.5, 1]] 的模型執行時間及在驗證集上的AUC

## 評估miniBatchFractionList引數
iterationsList = [100] 
stepList = [100]
miniBatchFractionList = [0.01, 0.1, 0.5, 1]
regParamList = [0.01]
regTypeList = ["l2"]

## 返回結果存放至metries中
metrics = [trainEvaluateModel(trainData, validationData,iterations, step, miniBatchFraction, regParam, regType)
          for iterations in iterationsList
          for step in stepList
          for miniBatchFraction in miniBatchFractionList
          for regParam in regParamList
          for regType in regTypeList]

執行結果：

訓練評估：引數iterations=100,  step=1,  miniBatchFraction=0.01, regParam0.01, regType=l2
===>消耗時間=1.71210098267,  準確率accuracy=0.647966339411
訓練評估：引數iterations=100,  step=1,  miniBatchFraction=0.1, regParam0.01, regType=l2
===>消耗時間=1.34743785858,  準確率accuracy=0.646563814867
訓練評估：引數iterations=100,  step=1,  miniBatchFraction=0.5, regParam0.01, regType=l2
===>消耗時間=1.30253577232,  準確率accuracy=0.643758765778
訓練評估：引數iterations=100,  step=1,  miniBatchFraction=1, regParam0.01, regType=l2
===>消耗時間=1.37389492989,  準確率accuracy=0.646563814867

引數miniBatchFractionList影響不顯著

4、評估引數正則項係數regParam及正則化方式regType

分別測試regParam為 [0.01, 0.1, 1,10, 100]的模型執行時間及在驗證集上的AUC 測試結果：

訓練評估：引數iterations=100,  step=1,  miniBatchFraction=1, regParam0.01, regType=l2
===>消耗時間=1.33343195915,  準確率accuracy=0.646563814867
訓練評估：引數iterations=100,  step=1,  miniBatchFraction=1, regParam0.1, regType=l2
===>消耗時間=0.793405056,  準確率accuracy=0.649368863955
訓練評估：引數iterations=100,  step=1,  miniBatchFraction=1, regParam1, regType=l2
===>消耗時間=0.505378007889,  準確率accuracy=0.650771388499
訓練評估：引數iterations=100,  step=1,  miniBatchFraction=1, regParam10, regType=l2
===>消耗時間=0.927163124084,  準確率accuracy=0.639551192146
訓練評估：引數iterations=100,  step=1,  miniBatchFraction=1, regParam100, regType=l2
===>消耗時間=1.36627006531,  準確率accuracy=0.360448807854

值得注意的是，如果正則化係數過大，會使得邏輯迴歸各個特徵相對應的係數變得很小，失去預測效果。

分別測試正則化方式引數regType=[“l2”,”l1”,None] 執行結果：

訓練評估：引數iterations=100,  step=1,  miniBatchFraction=1, regParam1, regType=l2
===>消耗時間=0.544511795044,  準確率accuracy=0.650771388499
訓練評估：引數iterations=100,  step=1,  miniBatchFraction=1, regParam1, regType=l1
===>消耗時間=0.370025873184,  準確率accuracy=0.465638148668
訓練評估：引數iterations=100,  step=1,  miniBatchFraction=1, regParam1, regType=None
===>消耗時間=1.3747150898,  準確率accuracy=0.643758765778

觀察發現，此例可能不宜採用l1正則化。

選擇最佳模型引數組合

以網格搜尋的方式進行查詢：

## 定義函式gridSearch網格搜尋最佳引數組合

def gridSearch(trainData, validationData, iterationsList, stepList, miniBatchFractionList, regParamList, regTypeList):
    metrics = [trainEvaluateModel(trainData, validationData,iterations, step, miniBatchFraction, regParam, regType)
          for iterations in iterationsList
          for step in stepList
          for miniBatchFraction in miniBatchFractionList
          for regParam in regParamList
          for regType in regTypeList]
    # 按照AUC從大到小排序，返回最大AUC的引數組合
    sorted_metics = sorted(metrics, key=lambda k:k[0], reverse=True)
    best_parameters = sorted_metics[0]
    print("最佳引數組合："+"iterations="+str(best_parameters[2]) + 
         ",  step="+str( best_parameters[3])+",  miniBatchFraction="+str( best_parameters[4])+
          ", regParam"+str( best_parameters[5])+", regType=" + str( best_parameters[6]) +"\n"+ "準確率accuracy="+str( best_parameters[0]))
    return  best_parameters
## 引數組合
iterationsList = [10, 100,100] 
stepList = [1, 10]
miniBatchFractionList = [0.1, 1]
regParamList = [0.001, 0.01, 0.1]
regTypeList = ["l2","l1"]

## 呼叫函式返回最佳引數組合
best_parameters = gridSearch(trainData, validationData, iterationsList, stepList, miniBatchFractionList, regParamList, regTypeList)

得出最佳引數組合為：iterations=100, step=10, miniBatchFraction=0.1, regParam0.001, regType=l2 相應的在驗證集上的準確率為：accuracy=0.661991584853

判斷是否發生過擬合及模型預測

1、判斷是否過擬合

前面已經得到最佳引數組合iterations=100, step=10, miniBatchFraction=0.1, regParam0.001, regType=l2及相應的accuracy。使用該最佳引數組合作用於測試資料，是否會過擬合


## 使用最佳引數組合iterations=100,  step=10,  miniBatchFraction=0.1, regParam0.001, regType=l2訓練模型
best_model = SVMWithSGD.train(trainData, iterations=100, step=10, 
                                            miniBatchFraction=0.1, regParam=0.001, regType="l2")
trainACC = ModelAccuracy(best_model, trainData)
testACC =  ModelAccuracy(best_model, testData)
print("training: accurary="+str(trainACC))
print("testing: accurary="+str(testACC))

返回結果： training: accurary=0.679861644889 testing: accurary=0.638700947226 二者接近，說明模型沒有過擬合的產生。

2、使用模型進行預測

使用最佳引數組合對test.tsv中的資料進行預測

## 使用模型進行預測
## 使用模型進行預測
def predictData(sc,model,categoriesMap):
    print("開始匯入資料...")
    rawDataWithHeader = sc.textFile(Path+"data/test.tsv")
    ## 取第一項資料
    header = rawDataWithHeader.first()
    ## 剔除欄位名（特徵名）行，取資料行
    rawData = rawDataWithHeader.filter(lambda x:x!=header)
    ## 將雙引號"替換為空字元（剔除雙引號）
    rData = rawData.map(lambda x:x.replace("\"",""))
    ## 以製表符分割每一行
    lines = rData.map(lambda x: x.split("\t"))
    ## 預處理測試資料集(都是特徵欄位)
    testDataRDD=lines.map(lambda r: process_features(r, categoriesMap, len(r)))
    ## 資料標準化

    stdScaler = StandardScaler(withMean=True, withStd=True).fit(testDataRDD)  # 建立一個標準化例項
    ScalertestRDD = stdScaler.transform(testDataRDD)
    DescDict={0:"暫時型(ephemeral)網頁",
              1:"長久型(evergreen)網頁"}
    ## 預測前5項資料
    for i in range(5):
        predictResult=model.predict(ScalertestRDD.take(5)[i])
        print("網址："+str(lines.collect()[i][0])+"\n"+" ===>預測結果為: "+str(predictResult) + "說明: "+DescDict[predictResult]+"\n")

predictData(sc,best_model,categoriesMap)

返回結果：

開始匯入資料...
網址：http://www.lynnskitchenadventures.com/2009/04/homemade-enchilada-sauce.html
 ===>預測結果為: 1說明: 長久型(evergreen)網頁

網址：http://lolpics.se/18552-stun-grenade-ar
 ===>預測結果為: 0說明: 暫時型(ephemeral)網頁

網址：http://www.xcelerationfitness.com/treadmills.html
 ===>預測結果為: 0說明: 暫時型(ephemeral)網頁

網址：http://www.bloomberg.com/news/2012-02-06/syria-s-assad-deploys-tactics-of-father-to-crush-revolt-threatening-reign.html
 ===>預測結果為: 0說明: 暫時型(ephemeral)網頁

網址：http://www.wired.com/gadgetlab/2011/12/stem-turns-lemons-and-limes-into-juicy-atomizers/
 ===>預測結果為: 0說明: 暫時型(ephemeral)網頁

Python Spark 之SVM支援向量機

資料準備

資料預處理

訓練模型

模型評估

模型引數調優

選擇最佳模型引數組合

判斷是否發生過擬合及模型預測

Python Spark 之SVM支援向量機

【機器學習演算法-python實現】svm支援向量機(3)—核函式

機器學習之&&SVM支援向量機入門:Maximum Margin Classifier

python機器學習庫scikit-learn簡明教程之：SVM支援向量機

python opencv3.x中支援向量機（svm）模型儲存與載入問題

斯坦福CS229機器學習筆記-Lecture8- SVM支援向量機之核方法 + 軟間隔 + SMO 演算法

SVM支援向量機分類模型SVC理論+python sklean.svm實踐

PYTHON機器學習實戰——SVM支援向量機

用Python開始機器學習（8：SVM支援向量機）

SVM支援向量機之證明SVM （三）

吳恩達機器學習作業Python實現(六)：SVM支援向量機

Python/scikit-learn機器學習庫(SVM支援向量機)

SVM支援向量機-《機器學習實戰》SMO演算法Python實現（5）

[機器學習]svm支援向量機介紹

【SVM-tutorial】SVM-支援向量機綜述

機器學習實戰——SVM支援向量機實現記錄

SVM(支援向量機)

機器學習（十一） SVM-支援向量機

SVM支援向量機系列理論（九）核嶺迴歸

SVM支援向量機系列理論(八) 核邏輯迴歸

Python Spark 之SVM支援向量機

資料準備

資料預處理

訓練模型

模型評估

模型引數調優

選擇最佳模型引數組合

判斷是否發生過擬合及模型預測

相關推薦