1. 程式人生 > >《Spark機器學習》筆記——Spark分類模型(線性迴歸、樸素貝葉斯、決策樹、支援向量機)

《Spark機器學習》筆記——Spark分類模型(線性迴歸、樸素貝葉斯、決策樹、支援向量機)

一、分類模型的種類

1.1、線性模型

1.1.1、邏輯迴歸

1.2.3、線性支援向量機

1.2、樸素貝葉斯模型

1.3、決策樹模型

二、從資料中抽取合適的特徵

MLlib中的分類模型通過LabeledPoint(label: Double, features: Vector)物件操作,其中封裝了目標變數(標籤)和特徵向量

從Kaggle/StumbleUpon evergreen分類資料集中抽取特徵

該資料集設計網頁中推薦的網頁是短暫(短暫存在。很快就不流行了)還是長久(長時間流行)

使用sed ld train.tsv > train_noheader.ts可以將第一行的標題欄去掉

下面開始看程式碼

import org.apache.spark.mllib.classification.{ClassificationModel, LogisticRegressionWithSGD, NaiveBayes, SVMWithSGD}
import org.apache.spark.mllib.evaluation.BinaryClassificationMetrics
import org.apache.spark.mllib.feature.StandardScaler
import org.apache.spark.mllib.linalg.Vectors
import 
org.apache.spark.mllib.linalg.distributed.RowMatrix import org.apache.spark.mllib.optimization.{SimpleUpdater, SquaredL2Updater, Updater} import org.apache.spark.mllib.regression.LabeledPoint import org.apache.spark.mllib.tree.DecisionTree import org.apache.spark.mllib.tree.configuration.Algo import
org.apache.spark.mllib.tree.impurity.{Entropy, Gini, Impurity} import org.apache.spark.rdd.RDD import org.apache.spark.{SparkConf, SparkContext} object Evergreen { def main(args: Array[String]): Unit = { val sparkConf = new SparkConf().setAppName("Evergreen").setMaster("local") //設定在本地模式執行 val BASEDIR = "hdfs://pc1:9000/" //HDFS檔案 //val BASEDIR = "file:///home/chenjie/" // 本地檔案 //val sparkConf = new SparkConf().setAppName("Evergreen-cluster").setMaster("spark://pc1:7077").setJars(List("untitled2.jar")) //設定在叢集模式執行 val sc = new SparkContext(sparkConf) //初始化sc val rawData = sc.textFile(BASEDIR + "train_noheader.tsv") //載入資料 println("rawData.first()=" + rawData.first()) //列印第一條 //"http://www.bloomberg.com/news/2010-12-23/ibm-predicts-holographic-calls-air-breathing-batteries-by-2015.html" // "4042" // "{""title"":""IBM hic calies"", // ""body"":""A sign the tahe cwlett Packard Co t last."", // ""url"":""bloomberg news 2010 12 23 ibm predicts holographic calls air breathing batteries by 2015 html""}" // "business" "0.789131" "2.055555556" "0.676470588" "0.205882353" "0.047058824" "0.023529412" "0.443783175" "0" "0" "0.09077381" "0" "0.245831182" "0.003883495" "1" "1" "24" "0" "5424" "170" "8" "0.152941176" "0.079129575" "0" }
以上程式碼載入了資料集,並觀察第一行的資料。注意到該資料包括URL、頁面的ID、原始的文字內容和分配給文字的類別。接下來22列包含各種各樣的數值或者類屬特徵。最後一列為目標值,-1為長久,0為短暫。
由於資料格式的問題,需要對資料進行清洗,在處理過程中把額外的引號去掉。並把原始資料中的?號代替的缺失資料用0替換。
val records = rawData.map(line => line.split("\t"))
println(records.first())

val data = records.map{ r =>
  val trimmed = r.map(_.replaceAll("\"",""))//去掉引號
val label = trimmed(r.size - 1).toInt//得到最後一列,即類別資訊
val features = trimmed.slice(4, r.size - 1).map(  d => if(d == "?") 0.0 else d.toDouble)//?0代替
LabeledPoint(label, Vectors.dense(features))
}
data.cache()
val numData = data.count
println("numData=" + numData)
//numData=7395
val nbData = records.map{ r =>
  val trimmed = r.map(_.replaceAll("\"",""))
  val label = trimmed(r.size - 1).toInt
  val features = trimmed.slice(4, r.size - 1).map(  d => if(d == "?") 0.0 else d.toDouble)
    .map( d => if (d < 0) 0.0 else d)
  LabeledPoint(label, Vectors.dense(features))
}
//在對資料集進一步處理之前,我們發現數值資料中包含負數特徵值。我們知道,樸素貝葉斯模型要求特徵值非負,否則遇到負的特徵值就會丟擲異常
//因此需要為樸素貝葉斯模型構建一份輸入特徵向量的資料,將負特徵值設為0
下面的程式碼將依次加入main函式中

三、訓練分類模型

//------------訓練分類模型------------------------------------------------------------------
val numItetations = 10
val maxTreeDepth = 5
val lrModel = LogisticRegressionWithSGD.train(data, numItetations)
val svmModel = SVMWithSGD.train(data, numItetations)
val nbModel = NaiveBayes.train(nbData)
val dtModel = DecisionTree.train(data, Algo.Classification, Entropy, maxTreeDepth)
//在決策樹中,設定模式或者Algo時使用了Entript不純度估計

四、使用分類模型

val dataPoint = data.first()
val trueLabel = dataPoint.label
println("真實分類:" + trueLabel)
val prediction1 = lrModel.predict(dataPoint.features)
val prediction2 = svmModel.predict(dataPoint.features)
val prediction3 = nbModel.predict(dataPoint.features)
val prediction4 = dtModel.predict(dataPoint.features)
println("lrModel預測分類:" + prediction1)
println("svmModel預測分類:" + prediction2)
println("nbModel預測分類:" + prediction3)
println("dtModel預測分類:" + prediction4)
/*
* 真實分類:0.0
  lrModel預測分類:1.0
  svmModel預測分類:1.0
  nbModel預測分類:1.0
  dtModel預測分類:0.0
* */
//也可以將RDD[Vector]整體作為輸入做預測
/* val preditions = lrModel.predict(data.map(lp => lp.features))
 preditions.take(5).foreach(println)*/

五、評估分類模型的效能

5.1、預測的正確率和錯誤率

//--------評估分類模型的效能:預測的正確率和錯誤率--------------------------------
val lrTotalCorrect = data.map{  point =>
  if(lrModel.predict(point.features) == point.label) 1 else 0
}.sum

val svmTotalCorrect = data.map{ point =>
  if(svmModel.predict(point.features) == point.label) 1 else 0
}.sum
val nbTotalCorrect = nbData.map{ point =>
  if(nbModel.predict(point.features) == point.label) 1 else 0
}.sum
val dtTotalCorrect = data.map{  point =>
  val socre = dtModel.predict(point.features)
  val predicted = if(socre > 0.5) 1 else 0
if(predicted == point.label) 1 else 0
}.sum
val lrAccuracy = lrTotalCorrect / data.count
val svmAccuracy = svmTotalCorrect / numData
val nbAccuracy = nbTotalCorrect / numData
val dtAccuracy = dtTotalCorrect / numData
println("lrModel預測分類正確率:" + lrAccuracy)
println("svmModel預測分類正確率:" + svmAccuracy)
println("nbModel預測分類正確率:" + nbAccuracy)
println("dtModel預測分類正確率:" + dtAccuracy)
/*
* lrModel預測分類正確率:0.5146720757268425
  svmModel預測分類正確率:0.5146720757268425
  nbModel預測分類正確率:0.5803921568627451
  dtModel預測分類正確率:0.6482758620689655
* */

5.2、準確率和召回律

//--------評估分類模型的效能:準確率和召回律--------------------------------
/**準確率用於評價結果的質量,召回律用來評價結果的完整性
*
  *                              真陽性的數目(被正確預測的類別為1的樣本)
* 在二分類的問題中,準確率= ------------------------- ---------------------
  *                        真陽性的數目 + 假陽性的數目(被錯誤預測為類別1的樣本)
*
  *                           真陽性的數目(被正確預測的類別為1的樣本)
*                 召回率= ---------------------------------------------
  *                        真陽性的數目 + 假陰性的數目(被錯誤預測為類別0的樣本)
* 準確率-召回率(PR)曲線下的面積為平均準確率
*/
val metrics = Seq(lrModel, svmModel).map{ model =>
  val scoreAndLabels = data.map{  point =>
    (model.predict(point.features), point.label)
  }
  val metrics = new BinaryClassificationMetrics(scoreAndLabels)
  (model.getClass.getSimpleName, metrics.areaUnderPR(), metrics.areaUnderROC())
}
val nbMetrics = Seq(nbModel).map{ model =>
  val scoreAndLabels = nbData.map{  point =>
    val score = model.predict(point.features)
    (if (score > 0.5) 1.0 else 0.0, point.label)
  }
  val metrics = new BinaryClassificationMetrics(scoreAndLabels)
  (model.getClass.getSimpleName, metrics.areaUnderPR(), metrics.areaUnderROC())
}
val dtMetrics = Seq(dtModel).map{ model =>
  val scoreAndLabels = nbData.map { point =>
    val score  = model.predict(point.features)
    (if (score > 0.5) 1.0 else 0.0, point.label)
  }
  val metrics = new BinaryClassificationMetrics(scoreAndLabels)
  (model.getClass.getSimpleName, metrics.areaUnderPR(), metrics.areaUnderROC())
}
val allMetrics = metrics ++ nbMetrics ++ dtMetrics
allMetrics.foreach{ case (model,pr,roc) =>
  println(f"$model, Area under PR : ${pr * 100.0}%2.4f%%,Area under ROC: ${roc * 100.0}%2.4f%%")
}

//LogisticRegressionModel, Area under PR : 75.6759%,Area under ROC: 50.1418%
//SVMModel, Area under PR : 75.6759%,Area under ROC: 50.1418%
//NaiveBayesModel, Area under PR : 68.0851%,Area under ROC: 58.3559%
//DecisionTreeModel, Area under PR : 74.3081%,Area under ROC: 64.8837%

5.3、ROC曲線和AUC

//--------評估分類模型的效能:ROC曲線和AUC--------------------------------
/**ROC曲線在概念上與PR曲線類似,它是對分類器的真陽性率-假陽性率的圖形化解釋。
*
  *           真陽性的數目(被正確預測的類別為1的樣本)
* 真陽性率= ----------------------------------------------- , 與召回率類似,也稱為敏感度。
*           真陽性的數目 + 假陰性的數目(被錯誤預測為類別0的樣本)
*
  * ROC曲線表現了分類器效能在不同決策閾值下TPRFPR的折衷。ROC下的面積,稱為AUC,表示平均值。
*
  *
  */

六、改進模型效能以及引數調優

6.1、特徵標準化

//------改進模型效能以及引數調優-------------------------------------------
val vectors = data.map(lp => lp.features)
val matrix = new RowMatrix(vectors)
val matrixSummary = matrix.computeColumnSummaryStatistics()
println("每列的均值:")
println(matrixSummary.mean)
//[0.41225805299526774,2.76182319198661,0.46823047328613876,0.21407992638350257,0.0920623607189991,0.04926216043908034,2.255103452212025,-0.10375042752143329,0.0,0.05642274498417848,0.02123056118999324,0.23377817665490225,0.2757090373659231,0.615551048005409,0.6603110209601082,30.077079107505178,0.03975659229208925,5716.598242055454,178.75456389452327,4.960649087221106,0.17286405047031753,0.10122079189276531]
println("每列的最小值:")
//[0.0,0.0,0.0,0.0,0.0,0.0,0.0,-1.0,0.0,0.0,0.0,0.045564223,-1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0]
println(matrixSummary.min)
println("每列的最大值:")
println(matrixSummary.max)
//[0.999426,363.0,1.0,1.0,0.980392157,0.980392157,21.0,0.25,0.0,0.444444444,1.0,0.716883117,113.3333333,1.0,1.0,100.0,1.0,207952.0,4997.0,22.0,1.0,1.0]
println("每列的方差:")
println(matrixSummary.variance)
//[0.10974244167559023,74.30082476809655,0.04126316989120245,0.021533436332001124,0.009211817450882448,0.005274933469767929,32.53918714591818,0.09396988697611537,0.0,0.001717741034662896,0.020782634824610638,0.0027548394224293023,3.6837889196744116,0.2366799607085986,0.22433071201674218,415.87855895438463,0.03818116876739597,7.877330081138441E7,32208.11624742624,10.453009045764313,0.03359363403832387,0.0062775328842146995]
println("每列的非0項數目:")
println(matrixSummary.numNonzeros)
//[5053.0,7354.0,7172.0,6821.0,6160.0,5128.0,7350.0,1257.0,0.0,7362.0,157.0,7395.0,7355.0,4552.0,4883.0,7347.0,294.0,7378.0,7395.0,6782.0,6868.0,7235.0]
//觀察到第二列的方差和均值比其他都要高,為了使資料更符合模型的假設,可以對每個特徵進行標準化,使得每個特徵都是0均值和單位標準差。
//具體做法是對每個特徵值減去列的均值,然後除以列的標準差進行縮放
//可以使用SparkStandardScaler中的方法方便地完成這些操作。
val scaler = new StandardScaler(withMean = true, withStd = true).fit(vectors)
val scaledData = data.map(lp => LabeledPoint(lp.label, scaler.transform(lp.features)))
println("標準化前:" + data.first().features)
println("標準化後:" + scaledData.first().features)
//標準化前:[0.789131,2.055555556,0.676470588,0.205882353,0.047058824,0.023529412,0.443783175,0.0,0.0,0.09077381,0.0,0.245831182,0.003883495,1.0,1.0,24.0,0.0,5424.0,170.0,8.0,0.152941176,0.079129575]
//標準化後:[1.1376473364976751,-0.08193557169294784,1.0251398128933333,-0.05586356442541853,-0.4688932531289351,-0.35430532630793654,-0.3175352172363122,0.3384507982396541,0.0,0.8288221733153222,-0.14726894334628504,0.22963982357812907,-0.14162596909880876,0.7902380499177364,0.7171947294529865,-0.29799681649642484,-0.2034625779299476,-0.03296720969690467,-0.04878112975579767,0.9400699751165406,-0.10869848852526329,-0.27882078231369967]
//下面使用標準化的資料重新訓練模型。這裡只訓練邏輯迴歸,因為決策樹和樸素貝葉斯不受特徵標準化的影響。
val lrModelScaled = LogisticRegressionWithSGD.train(scaledData, numItetations)
val lrTotalCorrectScaled = scaledData.map{  point =>
  if(lrModelScaled.predict(point.features) == point.label) 1 else 0
}.sum
val lrAccuracyScaled = lrTotalCorrectScaled / numData
val lrPreditionsVsTrue = scaledData.map{  point =>
  (lrModelScaled.predict(point.features), point.label)
}
val lrMetricsScaled = new BinaryClassificationMetrics(lrPreditionsVsTrue)
val lrPr = lrMetricsScaled.areaUnderPR()
val lrRoc = lrMetricsScaled.areaUnderROC()
println(f"${lrModelScaled.getClass.getSimpleName}\n Accuracy:${lrAccuracyScaled * 100}%2.4f%%\n Area under PR : ${lrPr * 100.0}%2.4f%%,Area under ROC: ${lrRoc * 100.0}%2.4f%%")
//LogisticRegressionModel
//Accuracy:62.0419%
// Area under PR : 72.7254%,Area under ROC: 61.9663%
//對比之前的
//lrModel預測分類正確率:0.5146720757268425
// LogisticRegressionModel, Area under PR : 75.6759%,Area under ROC: 50.1418%
//正確率和ROC提高了很多,這就算特徵標準化的作用

6.2、使用其他特徵

//-------------其他特徵--------------------------------------------------
//之前我們只使用了資料的部分特徵
val categories = records.map(r => r(3)).distinct().collect().zipWithIndex.toMap
val numCategories = categories.size
println(categories)
println("種類數:" + numCategories)
//Map("weather" -> 0, "sports" -> 1, "unknown" -> 10, "computer_internet" -> 11, "?" -> 8, "culture_politics" -> 9, "religion" -> 4, "recreation" -> 7, "arts_entertainment" -> 5, "health" -> 12, "law_crime" -> 6, "gaming" -> 13, "business" -> 2, "science_technology" -> 3)
//種類數:14
//下面使用一個長為14的向量來表示類別特徵,然後根據每個樣本所屬類別索引,對相應的維度賦值為1,其他為0.我們假定這個新的特徵向量和其他的數值特徵向量一樣
val dataCategories = records.map{ r =>
  val trimmed = r.map(_.replaceAll("\"",""))
  val label = trimmed(r.size - 1).toInt
  val categoryIdx = categories(r(3))
  val categoryFeatures = Array.ofDim[Double](numCategories)
  categoryFeatures(categoryIdx) = 1.0
val otherFeatures = trimmed.slice(4, r.size - 1).map(d => if(d == "?") 0.0 else d.toDouble)
  val features = categoryFeatures ++ otherFeatures
  LabeledPoint(label, Vectors.dense(features))
}
println("觀察第一行:" + dataCategories.first())
//觀察第一行:(0.0,[0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.789131,2.055555556,0.676470588,0.205882353,0.047058824,0.023529412,0.443783175,0.0,0.0,0.09077381,0.0,0.245831182,0.003883495,1.0,1.0,24.0,0.0,5424.0,170.0,8.0,0.152941176,0.079129575])
//發現此前的類別特徵已經轉為14維向量
val scalerCats = new StandardScaler(withMean = true, withStd = true).fit(dataCategories.map(lp => lp.features))
val scaledDataCasts = dataCategories.map( lp =>
  LabeledPoint(lp.label, scalerCats.transform(lp.features))
)
scaledDataCasts.cache()
println("標準化後:" + scaledDataCasts.first())
//標準化後:(0.0,[-0.02326210589837061,-0.23272797709480803,2.7207366564548514,-0.2016540523193296,-0.09914991930875496,-0.38181322324318134,-0.06487757239262681,-0.4464212047941535,-0.6807527904251456,-0.22052688457880879,-0.028494000387023734,-0.20418221057887365,-0.2709990696925828,-0.10189469097220732,1.1376473364976751,-0.08193557169294784,1.0251398128933333,-0.05586356442541853,-0.4688932531289351,-0.35430532630793654,-0.3175352172363122,0.3384507982396541,0.0,0.8288221733153222,-0.14726894334628504,0.22963982357812907,-0.14162596909880876,0.7902380499177364,0.7171947294529865,-0.29799681649642484,-0.2034625779299476,-0.03296720969690467,-0.04878112975579767,0.9400699751165406,-0.10869848852526329,-0.27882078231369967])
val nbDataCategories = records.map{ r =>
  val trimmed = r.map(_.replaceAll("\"",""))
  val label = trimmed(r.size - 1).toInt
  val categoryIdx = categories(r(3))
  val categoryFeatures = Array.ofDim[Double](numCategories)
  categoryFeatures(categoryIdx) = 1.0
val otherFeatures = trimmed.slice(4, r.size - 1).map(d => if(d == "?") 0.0 else d.toDouble)
    .map( d => if (d < 0) 0.0 else d)
  val features = categoryFeatures ++ otherFeatures
  LabeledPoint(label, Vectors.dense(features))
}
println("觀察第一行:" + nbDataCategories.first())
//觀察第一行:(0.0,[0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.789131,2.055555556,0.676470588,0.205882353,0.047058824,0.023529412,0.443783175,0.0,0.0,0.09077381,0.0,0.245831182,0.003883495,1.0,1.0,24.0,0.0,5424.0,170.0,8.0,0.152941176,0.079129575])
nbDataCategories.cache()

val lrModelScaledCats = LogisticRegressionWithSGD.train(scaledDataCasts, numItetations)//增加型別矩陣並標準化後的邏輯迴歸分類模型
val svmModelScaledCats = SVMWithSGD.train(scaledDataCasts, numItetations)//增加型別矩陣並標準化後的支援向量機分類模型
val nbModelScaledCats = NaiveBayes.train(nbDataCategories)//增加型別矩陣並標準化後的樸素貝葉斯分類模型
val dtModelScaledCats = DecisionTree.train(dataCategories, Algo.Classification, Entropy, maxTreeDepth) //增加型別矩陣並標準化後的決策樹分類模型
//注意,決策樹和樸素貝葉斯不受特徵標準化的影響。反而標準化後出現負值無法使用貝葉斯
val lrTotalCorrectScaledCats = scaledDataCasts.map{  point =>
  if(lrModelScaledCats.predict(point.features) == point.label) 1 else 0
}.sum

val svmTotalCorrectScaledCats = scaledDataCasts.map{ point =>
  if(svmModelScaledCats.predict(point.features) == point.label) 1 else 0
}.sum
val nbTotalCorrectScaledCats = nbDataCategories.map{ point =>
  if(nbModelScaledCats.predict(point.features) == point.label) 1 else 0
}.sum
val dtTotalCorrectScaledCats = dataCategories.map{  point =>
  val socre = dtModelScaledCats.predict(point.features)
  val predicted = if(socre > 0.5) 1 else 0
if(predicted == point.label) 1 else 0
}.sum
val  lrAccuracyScaledCats =  lrTotalCorrectScaledCats / numData
val svmAccuracyScaledCats = svmTotalCorrectScaledCats / numData
val  nbAccuracyScaledCats =  nbTotalCorrectScaledCats / numData
val  dtAccuracyScaledCats =  dtTotalCorrectScaledCats / numData
println(" lrModel預測分類正確率:" +  lrAccuracyScaledCats)
println("svmModel預測分類正確率:" + svmAccuracyScaledCats)
println(" nbModel預測分類正確率:" +  nbAccuracyScaledCats)
println(" dtModel預測分類正確率:" +  dtAccuracyScaledCats)
/*
此前的*  lrModel預測分類正確率:0.5146720757268425
  svmModel預測分類正確率:0.5146720757268425
   nbModel預測分類正確率:0.5803921568627451
   dtModel預測分類正確率:0.6482758620689655
* */
/***
  *lrModel預測分類正確率:0.6657200811359026
svmModel預測分類正確率:0.6645030425963488
nbModel預測分類正確率:0.5832319134550372
dtModel預測分類正確率:0.6655848546315077
  *
  */
val lrPreditionsVsTrueScaledCats = dataCategories.map{  point =>
  (lrModelScaledCats.predict(point.features), point.label)
}
val lrMetricsScaledCats = new BinaryClassificationMetrics(lrPreditionsVsTrueScaledCats)
val lrPrScaledCats = lrMetricsScaledCats.areaUnderPR()
val lrRocScaledCats = lrMetricsScaledCats.areaUnderROC()
println(f"${lrModelScaledCats.getClass.getSimpleName}\n Accuracy:${lrAccuracyScaledCats * 100}%2.4f%%\n Area under PR : ${lrPrScaledCats * 100.0}%2.4f%%,Area under ROC: ${lrRocScaledCats * 100.0}%2.4f%%")
//LogisticRegressionModel
//Accuracy:66.5720%
//Area under PR : 75.6015%,Area under ROC: 52.1977%
val metrics2 = Seq(lrModelScaledCats, svmModelScaledCats).map{ model =>
  val scoreAndLabels = dataCategories.map{  point =>
    (model.predict(point.features), point.label)
  }
  val metrics = new BinaryClassificationMetrics(scoreAndLabels)
  (model.getClass.getSimpleName, metrics.areaUnderPR(), metrics.areaUnderROC())
}
val nbMetrics2 = Seq(nbModelScaledCats).map{ model =>
  val scoreAndLabels = nbDataCategories.map{  point =>
    val score = model.predict(point.features)
    (if (score > 0.5) 1.0 else 0.0, point.label)
  }
  val metrics = new BinaryClassificationMetrics(scoreAndLabels)
  (model.getClass.getSimpleName, metrics.areaUnderPR(), metrics.areaUnderROC())
}
val dtMetrics2 = Seq(dtModelScaledCats).map{ model =>
  val scoreAndLabels = dataCategories.map { point =>
    val score  = model.predict(point.features)
    (if (score > 0.5) 1.0 else 0.0, point.label)
  }
  val metrics = new BinaryClassificationMetrics(scoreAndLabels)
  (model.getClass.getSimpleName, metrics.areaUnderPR(), metrics.areaUnderROC())
}
val allMetrics2 = metrics2 ++ nbMetrics2 ++ dtMetrics2
allMetrics2.foreach{ case (model,pr,roc) =>
  println(f"$model, Area under PR : ${pr * 100.0}%2.4f%%,Area under ROC: ${roc * 100.0}%2.4f%%")
}

//LogisticRegressionModel, Area under PR : 75.6759%,Area under ROC: 50.1418%
//SVMModel, Area under PR : 75.6759%,Area under ROC: 50.1418%
//NaiveBayesModel, Area under PR : 68.0851%,Area under ROC: 58.3559%
//DecisionTreeModel, Area under PR : 74.3081%,Area under ROC: 64.8837%
//LogisticRegressionModel, Area under PR : 75.6015%,Area under ROC: 52.1977%
//SVMModel, Area under PR : 75.5180%,Area under ROC: 54.1606%
//NaiveBayesModel, Area under PR : 68.3386%,Area under ROC: 58.6397%
//DecisionTreeModel, Area under PR : 75.8784%,Area under ROC: 66.5005%

6.3、使用正確的資料格式

//--------使用正確的資料格式----------------------------------------------
//現在我們僅僅使用型別特徵,也就是隻使用前14個向量,因為1-of-k編碼的型別特徵更符合樸素貝葉斯模型
val nbDataOnlyCategories = records.map{ r =>
  val trimmed = r.map(_.replaceAll("\"",""))
  val label = trimmed(r.size - 1).toInt
  val categoryIdx = categories(r(3))
  val