1. 程式人生 > >Spark之訓練分類模型練習(1)

Spark之訓練分類模型練習(1)

()本博文為 spark機器學習 第5章學習筆記。
所用資料下載地址為:實驗資料集train.tsv

各列的資料意義為:
“url” “urlid” “boilerplate” “alchemy_category” “alchemy_category_score” “avglinksize” “commonlinkratio_1” “commonlinkratio_2” “commonlinkratio_3” “commonlinkratio_4” “compression_ratio” “embed_ratio” “framebased” “frameTagRatio” “hasDomainLink” “html_ratio” “image_ratio” “is_news” “lengthyLinkDomain” “linkwordscore” “news_front_page” “non_markup_alphanum_characters” “numberOfLinks” “numwords_in_url” “parametrizedLinkRatio” “spelling_errors_ratio” “label”

前四列含義為: 連結地址,頁面ID,頁面內容、頁面所屬類別
緊接著22列為:各種數值或類別特徵
最後一列為:目標值,-1為長久;0為不長久

在linux指令行中使用管道將首行去除:

$ sed 1d train.tsv > train_noheader.tsv

開啟spark-shell

val rawData = sc.textFile("file:///home/hadoop/train_noheader.tsv")
val records = rawData.map(line => line.split("\t"))
records.first()

輸出結果為:
Array(“

http://www.bloomberg.com/news/2010-12-23/ibm-predicts-holographic-calls-air-breathing-batteries-by-2015.html“, “4042”, “{“”title”“:”“IBM Sees Holographic Calls Air Breathing Batteries ibm sees holographic calls, air-breathing batteries”“,”“body”“:”“A sign stands outside the International Business Machines Corp IBM Almaden Research Center campus in San Jose California Photographer Tony Avelar Bloomberg Buildings stand at the International Business Machines Corp IBM Almaden Research Center campus in the Santa Teresa Hills of San Jose California Photographer Tony Avelar Bloomberg By 2015 your mobile phone will project a 3 D image of anyone who calls and your laptop will be powered by kinetic energy At least that s what International Business Machines Corp sees …

import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.mllib.linalg.Vectors

// 去除多餘的 " ,並填補缺失資料(“?”),生成 LabelPoint 訓練資料
val data = records.map { r =>
val trimmed = r.map(_.replaceAll("\"", ""))
val label = trimmed(r.size - 1).toInt
val features = trimmed.slice(4, r.size - 1).map(d => if (d ==
"?") 0.0 else d.toDouble)
LabeledPoint(label, Vectors.dense(features))
}
// 將負特徵值轉換為 0,方便樸素貝葉斯模型訓練
val nbData = records.map { r =>
val trimmed = r.map(_.replaceAll("\"", ""))
val label = trimmed(r.size - 1).toInt
val features = trimmed.slice(4, r.size - 1).map(d => if (d ==
"?") 0.0 else d.toDouble).map(d => if (d < 0) 0.0 else d)
LabeledPoint(label, Vectors.dense(features))
}

// 訓練分類模型:
import org.apache.spark.mllib.classification.LogisticRegressionWithSGD //logistic迴歸
import org.apache.spark.mllib.classification.SVMWithSGD // SVM
import org.apache.spark.mllib.classification.NaiveBayes //樸素貝葉斯
import org.apache.spark.mllib.tree.DecisionTree         // 決策樹
import org.apache.spark.mllib.tree.configuration.Algo
import org.apache.spark.mllib.tree.impurity.Entropy     //熵不純度
val numIterations = 10
val maxTreeDepth = 5

//各個模型訓練
val lrModel = LogisticRegressionWithSGD.train(data, numIterations)
val svmModel = SVMWithSGD.train(data, numIterations)
val nbModel = NaiveBayes.train(nbData)
val dtModel = DecisionTree.train(data, Algo.Classification, Entropy, maxTreeDepth)

利用訓練所得的模型對未知資料進行預測也很簡單,以 logistic迴歸為例:

val dataPoint = data.first
val prediction = lrModel.predict(dataPoint.features) //根據特徵值,進行預測
                                                    // dataPoint.label
                                                    // dataPoint.features

輸出為:prediction: Double = 1.0

2分類效能評估

2.1預測的正確率錯誤率

正確率:訓練樣本中,被正確分類的數目除以總樣本數(正樣本+負樣本)。
錯誤率:訓練樣本中,被錯誤分類的資料除以總樣本數(正樣本+負樣本)。

//各演算法平均正確率
//
val lrTotalCorrect = data.map { point =>
if (lrModel.predict(point.features) == point.label) 1 else 0
}.sum
val lrAccuracy = lrTotalCorrect / data.count
//SVM
val svmTotalCorrect = data.map { point =>
if (svmModel.predict(point.features) == point.label) 1 else 0
}.sum
//NB
val nbTotalCorrect = nbData.map { point =>
if (nbModel.predict(point.features) == point.label) 1 else 0
}.sum
//DT 需要給出閾值
val dtTotalCorrect = data.map { point =>
val score = dtModel.predict(point.features)
val predicted = if (score > 0.5) 1 else 0
if (predicted == point.label) 1 else 0
}.sum
//求正確率:
val svmAccuracy = svmTotalCorrect / numData
val nbAccuracy = nbTotalCorrect / numData
val dtAccuracy = dtTotalCorrect / numData

輸出結果:
這裡寫圖片描述

2.2預測的準確率召回率 PR曲線

定義:

在二分類中:
準確率:定義為 真陽性的數目除以真陽性和假陽性的總和;(真陽性指:被正確預測的類別為1的樣本,假陰性是錯誤預測為類別1的樣本。)
意義:結果中,有意義的比例。(評價結果的質量)。
召回率:定義為真陽性的數目除以真陽性和假陰性的和,其中假陰性是類別為1卻被預測為0的樣本。
意義:100%,表示所有的正樣本我都能檢測到。(評價演算法的完整性)。
PR曲線是,橫軸為召回率,縱軸為準確率所形成的曲線。

2.3 ROC曲線和AUC

ROC曲線是真陽性率—–假陽性率的圖形化解釋。

真陽性率(TPR)——真陽性樣本數除以真陽性與假陰性樣本之和。
假陽性率(FPR)——假陽性樣本數除以假陽性與真陰性樣本之和。
理性情況為下ROC下的面積(AUC)為1,越接近1越好。

import org.apache.spark.mllib.evaluation.BinaryClassificationMetrics

// lr 和 SVM
val metrics = Seq(lrModel, svmModel).map { model =>
val scoreAndLabels = data.map { point =>
(model.predict(point.features), point.label)
}
val metrics = new BinaryClassificationMetrics(scoreAndLabels)
(model.getClass.getSimpleName, metrics.areaUnderPR, metrics.areaUnderROC)
}

// NB
val nbMetrics = Seq(nbModel).map{ model =>
val scoreAndLabels = nbData.map { point =>
val score = model.predict(point.features)
(if (score > 0.5) 1.0 else 0.0, point.label)
}
val metrics = new BinaryClassificationMetrics(scoreAndLabels)
(model.getClass.getSimpleName, metrics.areaUnderPR,
metrics.areaUnderROC)
}

//DT 決策樹
val dtMetrics = Seq(dtModel).map{ model =>
val scoreAndLabels = data.map { point =>
val score = model.predict(point.features)
(if (score > 0.5) 1.0 else 0.0, point.label)
}
val metrics = new BinaryClassificationMetrics(scoreAndLabels)
(model.getClass.getSimpleName, metrics.areaUnderPR, metrics.areaUnderROC)
}

// 總的輸出結果:
val allMetrics = metrics ++ nbMetrics ++ dtMetrics
allMetrics.foreach{ case(m,pr,roc)=>
  println(f"$m,Area under PR: ${pr * 100.0}%2.4f%%,Area under ROC: ${roc *100}%2.4f%%")}

結果輸出:

這裡寫圖片描述

演算法所得結果並不理想,下節探討引數調優方法。