《Spark機器學習》筆記——Spark分類模型(線性迴歸、樸素貝葉斯、決策樹、支援向量機)
阿新 • • 發佈:2019-01-09
一、分類模型的種類
1.1、線性模型
1.1.1、邏輯迴歸
1.2.3、線性支援向量機
1.2、樸素貝葉斯模型
1.3、決策樹模型
二、從資料中抽取合適的特徵
MLlib中的分類模型通過LabeledPoint(label: Double, features: Vector)物件操作,其中封裝了目標變數(標籤)和特徵向量
從Kaggle/StumbleUpon evergreen分類資料集中抽取特徵
該資料集設計網頁中推薦的網頁是短暫(短暫存在。很快就不流行了)還是長久(長時間流行)
使用sed ld train.tsv > train_noheader.ts可以將第一行的標題欄去掉
下面開始看程式碼
import org.apache.spark.mllib.classification.{ClassificationModel, LogisticRegressionWithSGD, NaiveBayes, SVMWithSGD} import org.apache.spark.mllib.evaluation.BinaryClassificationMetrics import org.apache.spark.mllib.feature.StandardScaler import org.apache.spark.mllib.linalg.Vectors importorg.apache.spark.mllib.linalg.distributed.RowMatrix import org.apache.spark.mllib.optimization.{SimpleUpdater, SquaredL2Updater, Updater} import org.apache.spark.mllib.regression.LabeledPoint import org.apache.spark.mllib.tree.DecisionTree import org.apache.spark.mllib.tree.configuration.Algo importorg.apache.spark.mllib.tree.impurity.{Entropy, Gini, Impurity} import org.apache.spark.rdd.RDD import org.apache.spark.{SparkConf, SparkContext} object Evergreen { def main(args: Array[String]): Unit = { val sparkConf = new SparkConf().setAppName("Evergreen").setMaster("local") //設定在本地模式執行 val BASEDIR = "hdfs://pc1:9000/" //HDFS檔案 //val BASEDIR = "file:///home/chenjie/" // 本地檔案 //val sparkConf = new SparkConf().setAppName("Evergreen-cluster").setMaster("spark://pc1:7077").setJars(List("untitled2.jar")) //設定在叢集模式執行 val sc = new SparkContext(sparkConf) //初始化sc val rawData = sc.textFile(BASEDIR + "train_noheader.tsv") //載入資料 println("rawData.first()=" + rawData.first()) //列印第一條 //"http://www.bloomberg.com/news/2010-12-23/ibm-predicts-holographic-calls-air-breathing-batteries-by-2015.html" // "4042" // "{""title"":""IBM hic calies"", // ""body"":""A sign the tahe cwlett Packard Co t last."", // ""url"":""bloomberg news 2010 12 23 ibm predicts holographic calls air breathing batteries by 2015 html""}" // "business" "0.789131" "2.055555556" "0.676470588" "0.205882353" "0.047058824" "0.023529412" "0.443783175" "0" "0" "0.09077381" "0" "0.245831182" "0.003883495" "1" "1" "24" "0" "5424" "170" "8" "0.152941176" "0.079129575" "0" }
以上程式碼載入了資料集,並觀察第一行的資料。注意到該資料包括URL、頁面的ID、原始的文字內容和分配給文字的類別。接下來22列包含各種各樣的數值或者類屬特徵。最後一列為目標值,-1為長久,0為短暫。
由於資料格式的問題,需要對資料進行清洗,在處理過程中把額外的引號去掉。並把原始資料中的?號代替的缺失資料用0替換。
下面的程式碼將依次加入main函式中val records = rawData.map(line => line.split("\t")) println(records.first()) val data = records.map{ r => val trimmed = r.map(_.replaceAll("\"",""))//去掉引號 val label = trimmed(r.size - 1).toInt//得到最後一列,即類別資訊 val features = trimmed.slice(4, r.size - 1).map( d => if(d == "?") 0.0 else d.toDouble)//將?用0代替 LabeledPoint(label, Vectors.dense(features)) } data.cache() val numData = data.count println("numData=" + numData) //numData=7395 val nbData = records.map{ r => val trimmed = r.map(_.replaceAll("\"","")) val label = trimmed(r.size - 1).toInt val features = trimmed.slice(4, r.size - 1).map( d => if(d == "?") 0.0 else d.toDouble) .map( d => if (d < 0) 0.0 else d) LabeledPoint(label, Vectors.dense(features)) } //在對資料集進一步處理之前,我們發現數值資料中包含負數特徵值。我們知道,樸素貝葉斯模型要求特徵值非負,否則遇到負的特徵值就會丟擲異常 //因此需要為樸素貝葉斯模型構建一份輸入特徵向量的資料,將負特徵值設為0
三、訓練分類模型
//------------訓練分類模型------------------------------------------------------------------ val numItetations = 10 val maxTreeDepth = 5 val lrModel = LogisticRegressionWithSGD.train(data, numItetations) val svmModel = SVMWithSGD.train(data, numItetations) val nbModel = NaiveBayes.train(nbData) val dtModel = DecisionTree.train(data, Algo.Classification, Entropy, maxTreeDepth) //在決策樹中,設定模式或者Algo時使用了Entript不純度估計
四、使用分類模型
val dataPoint = data.first() val trueLabel = dataPoint.label println("真實分類:" + trueLabel) val prediction1 = lrModel.predict(dataPoint.features) val prediction2 = svmModel.predict(dataPoint.features) val prediction3 = nbModel.predict(dataPoint.features) val prediction4 = dtModel.predict(dataPoint.features) println("lrModel預測分類:" + prediction1) println("svmModel預測分類:" + prediction2) println("nbModel預測分類:" + prediction3) println("dtModel預測分類:" + prediction4) /* * 真實分類:0.0 lrModel預測分類:1.0 svmModel預測分類:1.0 nbModel預測分類:1.0 dtModel預測分類:0.0 * */ //也可以將RDD[Vector]整體作為輸入做預測 /* val preditions = lrModel.predict(data.map(lp => lp.features)) preditions.take(5).foreach(println)*/
五、評估分類模型的效能
5.1、預測的正確率和錯誤率
//--------評估分類模型的效能:預測的正確率和錯誤率-------------------------------- val lrTotalCorrect = data.map{ point => if(lrModel.predict(point.features) == point.label) 1 else 0 }.sum val svmTotalCorrect = data.map{ point => if(svmModel.predict(point.features) == point.label) 1 else 0 }.sum val nbTotalCorrect = nbData.map{ point => if(nbModel.predict(point.features) == point.label) 1 else 0 }.sum val dtTotalCorrect = data.map{ point => val socre = dtModel.predict(point.features) val predicted = if(socre > 0.5) 1 else 0 if(predicted == point.label) 1 else 0 }.sum val lrAccuracy = lrTotalCorrect / data.count val svmAccuracy = svmTotalCorrect / numData val nbAccuracy = nbTotalCorrect / numData val dtAccuracy = dtTotalCorrect / numData println("lrModel預測分類正確率:" + lrAccuracy) println("svmModel預測分類正確率:" + svmAccuracy) println("nbModel預測分類正確率:" + nbAccuracy) println("dtModel預測分類正確率:" + dtAccuracy) /* * lrModel預測分類正確率:0.5146720757268425 svmModel預測分類正確率:0.5146720757268425 nbModel預測分類正確率:0.5803921568627451 dtModel預測分類正確率:0.6482758620689655 * */
5.2、準確率和召回律
//--------評估分類模型的效能:準確率和召回律-------------------------------- /**準確率用於評價結果的質量,召回律用來評價結果的完整性 * * 真陽性的數目(被正確預測的類別為1的樣本) * 在二分類的問題中,準確率= ------------------------- --------------------- * 真陽性的數目 + 假陽性的數目(被錯誤預測為類別1的樣本) * * 真陽性的數目(被正確預測的類別為1的樣本) * 召回率= --------------------------------------------- * 真陽性的數目 + 假陰性的數目(被錯誤預測為類別0的樣本) * 準確率-召回率(PR)曲線下的面積為平均準確率 */ val metrics = Seq(lrModel, svmModel).map{ model => val scoreAndLabels = data.map{ point => (model.predict(point.features), point.label) } val metrics = new BinaryClassificationMetrics(scoreAndLabels) (model.getClass.getSimpleName, metrics.areaUnderPR(), metrics.areaUnderROC()) } val nbMetrics = Seq(nbModel).map{ model => val scoreAndLabels = nbData.map{ point => val score = model.predict(point.features) (if (score > 0.5) 1.0 else 0.0, point.label) } val metrics = new BinaryClassificationMetrics(scoreAndLabels) (model.getClass.getSimpleName, metrics.areaUnderPR(), metrics.areaUnderROC()) } val dtMetrics = Seq(dtModel).map{ model => val scoreAndLabels = nbData.map { point => val score = model.predict(point.features) (if (score > 0.5) 1.0 else 0.0, point.label) } val metrics = new BinaryClassificationMetrics(scoreAndLabels) (model.getClass.getSimpleName, metrics.areaUnderPR(), metrics.areaUnderROC()) } val allMetrics = metrics ++ nbMetrics ++ dtMetrics allMetrics.foreach{ case (model,pr,roc) => println(f"$model, Area under PR : ${pr * 100.0}%2.4f%%,Area under ROC: ${roc * 100.0}%2.4f%%") } //LogisticRegressionModel, Area under PR : 75.6759%,Area under ROC: 50.1418% //SVMModel, Area under PR : 75.6759%,Area under ROC: 50.1418% //NaiveBayesModel, Area under PR : 68.0851%,Area under ROC: 58.3559% //DecisionTreeModel, Area under PR : 74.3081%,Area under ROC: 64.8837%
5.3、ROC曲線和AUC
//--------評估分類模型的效能:ROC曲線和AUC-------------------------------- /**ROC曲線在概念上與PR曲線類似,它是對分類器的真陽性率-假陽性率的圖形化解釋。 * * 真陽性的數目(被正確預測的類別為1的樣本) * 真陽性率= ----------------------------------------------- , 與召回率類似,也稱為敏感度。 * 真陽性的數目 + 假陰性的數目(被錯誤預測為類別0的樣本) * * ROC曲線表現了分類器效能在不同決策閾值下TPR對FPR的折衷。ROC下的面積,稱為AUC,表示平均值。 * * */
六、改進模型效能以及引數調優
6.1、特徵標準化
//------改進模型效能以及引數調優------------------------------------------- val vectors = data.map(lp => lp.features) val matrix = new RowMatrix(vectors) val matrixSummary = matrix.computeColumnSummaryStatistics() println("每列的均值:") println(matrixSummary.mean) //[0.41225805299526774,2.76182319198661,0.46823047328613876,0.21407992638350257,0.0920623607189991,0.04926216043908034,2.255103452212025,-0.10375042752143329,0.0,0.05642274498417848,0.02123056118999324,0.23377817665490225,0.2757090373659231,0.615551048005409,0.6603110209601082,30.077079107505178,0.03975659229208925,5716.598242055454,178.75456389452327,4.960649087221106,0.17286405047031753,0.10122079189276531] println("每列的最小值:") //[0.0,0.0,0.0,0.0,0.0,0.0,0.0,-1.0,0.0,0.0,0.0,0.045564223,-1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0] println(matrixSummary.min) println("每列的最大值:") println(matrixSummary.max) //[0.999426,363.0,1.0,1.0,0.980392157,0.980392157,21.0,0.25,0.0,0.444444444,1.0,0.716883117,113.3333333,1.0,1.0,100.0,1.0,207952.0,4997.0,22.0,1.0,1.0] println("每列的方差:") println(matrixSummary.variance) //[0.10974244167559023,74.30082476809655,0.04126316989120245,0.021533436332001124,0.009211817450882448,0.005274933469767929,32.53918714591818,0.09396988697611537,0.0,0.001717741034662896,0.020782634824610638,0.0027548394224293023,3.6837889196744116,0.2366799607085986,0.22433071201674218,415.87855895438463,0.03818116876739597,7.877330081138441E7,32208.11624742624,10.453009045764313,0.03359363403832387,0.0062775328842146995] println("每列的非0項數目:") println(matrixSummary.numNonzeros) //[5053.0,7354.0,7172.0,6821.0,6160.0,5128.0,7350.0,1257.0,0.0,7362.0,157.0,7395.0,7355.0,4552.0,4883.0,7347.0,294.0,7378.0,7395.0,6782.0,6868.0,7235.0] //觀察到第二列的方差和均值比其他都要高,為了使資料更符合模型的假設,可以對每個特徵進行標準化,使得每個特徵都是0均值和單位標準差。 //具體做法是對每個特徵值減去列的均值,然後除以列的標準差進行縮放 //可以使用Spark的StandardScaler中的方法方便地完成這些操作。 val scaler = new StandardScaler(withMean = true, withStd = true).fit(vectors) val scaledData = data.map(lp => LabeledPoint(lp.label, scaler.transform(lp.features))) println("標準化前:" + data.first().features) println("標準化後:" + scaledData.first().features) //標準化前:[0.789131,2.055555556,0.676470588,0.205882353,0.047058824,0.023529412,0.443783175,0.0,0.0,0.09077381,0.0,0.245831182,0.003883495,1.0,1.0,24.0,0.0,5424.0,170.0,8.0,0.152941176,0.079129575] //標準化後:[1.1376473364976751,-0.08193557169294784,1.0251398128933333,-0.05586356442541853,-0.4688932531289351,-0.35430532630793654,-0.3175352172363122,0.3384507982396541,0.0,0.8288221733153222,-0.14726894334628504,0.22963982357812907,-0.14162596909880876,0.7902380499177364,0.7171947294529865,-0.29799681649642484,-0.2034625779299476,-0.03296720969690467,-0.04878112975579767,0.9400699751165406,-0.10869848852526329,-0.27882078231369967] //下面使用標準化的資料重新訓練模型。這裡只訓練邏輯迴歸,因為決策樹和樸素貝葉斯不受特徵標準化的影響。 val lrModelScaled = LogisticRegressionWithSGD.train(scaledData, numItetations) val lrTotalCorrectScaled = scaledData.map{ point => if(lrModelScaled.predict(point.features) == point.label) 1 else 0 }.sum val lrAccuracyScaled = lrTotalCorrectScaled / numData val lrPreditionsVsTrue = scaledData.map{ point => (lrModelScaled.predict(point.features), point.label) } val lrMetricsScaled = new BinaryClassificationMetrics(lrPreditionsVsTrue) val lrPr = lrMetricsScaled.areaUnderPR() val lrRoc = lrMetricsScaled.areaUnderROC() println(f"${lrModelScaled.getClass.getSimpleName}\n Accuracy:${lrAccuracyScaled * 100}%2.4f%%\n Area under PR : ${lrPr * 100.0}%2.4f%%,Area under ROC: ${lrRoc * 100.0}%2.4f%%") //LogisticRegressionModel //Accuracy:62.0419% // Area under PR : 72.7254%,Area under ROC: 61.9663% //對比之前的 //lrModel預測分類正確率:0.5146720757268425 // LogisticRegressionModel, Area under PR : 75.6759%,Area under ROC: 50.1418% //正確率和ROC提高了很多,這就算特徵標準化的作用
6.2、使用其他特徵
//-------------其他特徵-------------------------------------------------- //之前我們只使用了資料的部分特徵 val categories = records.map(r => r(3)).distinct().collect().zipWithIndex.toMap val numCategories = categories.size println(categories) println("種類數:" + numCategories) //Map("weather" -> 0, "sports" -> 1, "unknown" -> 10, "computer_internet" -> 11, "?" -> 8, "culture_politics" -> 9, "religion" -> 4, "recreation" -> 7, "arts_entertainment" -> 5, "health" -> 12, "law_crime" -> 6, "gaming" -> 13, "business" -> 2, "science_technology" -> 3) //種類數:14 //下面使用一個長為14的向量來表示類別特徵,然後根據每個樣本所屬類別索引,對相應的維度賦值為1,其他為0.我們假定這個新的特徵向量和其他的數值特徵向量一樣 val dataCategories = records.map{ r => val trimmed = r.map(_.replaceAll("\"","")) val label = trimmed(r.size - 1).toInt val categoryIdx = categories(r(3)) val categoryFeatures = Array.ofDim[Double](numCategories) categoryFeatures(categoryIdx) = 1.0 val otherFeatures = trimmed.slice(4, r.size - 1).map(d => if(d == "?") 0.0 else d.toDouble) val features = categoryFeatures ++ otherFeatures LabeledPoint(label, Vectors.dense(features)) } println("觀察第一行:" + dataCategories.first()) //觀察第一行:(0.0,[0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.789131,2.055555556,0.676470588,0.205882353,0.047058824,0.023529412,0.443783175,0.0,0.0,0.09077381,0.0,0.245831182,0.003883495,1.0,1.0,24.0,0.0,5424.0,170.0,8.0,0.152941176,0.079129575]) //發現此前的類別特徵已經轉為14維向量 val scalerCats = new StandardScaler(withMean = true, withStd = true).fit(dataCategories.map(lp => lp.features)) val scaledDataCasts = dataCategories.map( lp => LabeledPoint(lp.label, scalerCats.transform(lp.features)) ) scaledDataCasts.cache() println("標準化後:" + scaledDataCasts.first()) //標準化後:(0.0,[-0.02326210589837061,-0.23272797709480803,2.7207366564548514,-0.2016540523193296,-0.09914991930875496,-0.38181322324318134,-0.06487757239262681,-0.4464212047941535,-0.6807527904251456,-0.22052688457880879,-0.028494000387023734,-0.20418221057887365,-0.2709990696925828,-0.10189469097220732,1.1376473364976751,-0.08193557169294784,1.0251398128933333,-0.05586356442541853,-0.4688932531289351,-0.35430532630793654,-0.3175352172363122,0.3384507982396541,0.0,0.8288221733153222,-0.14726894334628504,0.22963982357812907,-0.14162596909880876,0.7902380499177364,0.7171947294529865,-0.29799681649642484,-0.2034625779299476,-0.03296720969690467,-0.04878112975579767,0.9400699751165406,-0.10869848852526329,-0.27882078231369967]) val nbDataCategories = records.map{ r => val trimmed = r.map(_.replaceAll("\"","")) val label = trimmed(r.size - 1).toInt val categoryIdx = categories(r(3)) val categoryFeatures = Array.ofDim[Double](numCategories) categoryFeatures(categoryIdx) = 1.0 val otherFeatures = trimmed.slice(4, r.size - 1).map(d => if(d == "?") 0.0 else d.toDouble) .map( d => if (d < 0) 0.0 else d) val features = categoryFeatures ++ otherFeatures LabeledPoint(label, Vectors.dense(features)) } println("觀察第一行:" + nbDataCategories.first()) //觀察第一行:(0.0,[0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.789131,2.055555556,0.676470588,0.205882353,0.047058824,0.023529412,0.443783175,0.0,0.0,0.09077381,0.0,0.245831182,0.003883495,1.0,1.0,24.0,0.0,5424.0,170.0,8.0,0.152941176,0.079129575]) nbDataCategories.cache() val lrModelScaledCats = LogisticRegressionWithSGD.train(scaledDataCasts, numItetations)//增加型別矩陣並標準化後的邏輯迴歸分類模型 val svmModelScaledCats = SVMWithSGD.train(scaledDataCasts, numItetations)//增加型別矩陣並標準化後的支援向量機分類模型 val nbModelScaledCats = NaiveBayes.train(nbDataCategories)//增加型別矩陣並標準化後的樸素貝葉斯分類模型 val dtModelScaledCats = DecisionTree.train(dataCategories, Algo.Classification, Entropy, maxTreeDepth) //增加型別矩陣並標準化後的決策樹分類模型 //注意,決策樹和樸素貝葉斯不受特徵標準化的影響。反而標準化後出現負值無法使用貝葉斯 val lrTotalCorrectScaledCats = scaledDataCasts.map{ point => if(lrModelScaledCats.predict(point.features) == point.label) 1 else 0 }.sum val svmTotalCorrectScaledCats = scaledDataCasts.map{ point => if(svmModelScaledCats.predict(point.features) == point.label) 1 else 0 }.sum val nbTotalCorrectScaledCats = nbDataCategories.map{ point => if(nbModelScaledCats.predict(point.features) == point.label) 1 else 0 }.sum val dtTotalCorrectScaledCats = dataCategories.map{ point => val socre = dtModelScaledCats.predict(point.features) val predicted = if(socre > 0.5) 1 else 0 if(predicted == point.label) 1 else 0 }.sum val lrAccuracyScaledCats = lrTotalCorrectScaledCats / numData val svmAccuracyScaledCats = svmTotalCorrectScaledCats / numData val nbAccuracyScaledCats = nbTotalCorrectScaledCats / numData val dtAccuracyScaledCats = dtTotalCorrectScaledCats / numData println("新 lrModel預測分類正確率:" + lrAccuracyScaledCats) println("新svmModel預測分類正確率:" + svmAccuracyScaledCats) println("新 nbModel預測分類正確率:" + nbAccuracyScaledCats) println("新 dtModel預測分類正確率:" + dtAccuracyScaledCats) /* 此前的* lrModel預測分類正確率:0.5146720757268425 svmModel預測分類正確率:0.5146720757268425 nbModel預測分類正確率:0.5803921568627451 dtModel預測分類正確率:0.6482758620689655 * */ /*** *新 lrModel預測分類正確率:0.6657200811359026 新svmModel預測分類正確率:0.6645030425963488 新 nbModel預測分類正確率:0.5832319134550372 新 dtModel預測分類正確率:0.6655848546315077 * */ val lrPreditionsVsTrueScaledCats = dataCategories.map{ point => (lrModelScaledCats.predict(point.features), point.label) } val lrMetricsScaledCats = new BinaryClassificationMetrics(lrPreditionsVsTrueScaledCats) val lrPrScaledCats = lrMetricsScaledCats.areaUnderPR() val lrRocScaledCats = lrMetricsScaledCats.areaUnderROC() println(f"${lrModelScaledCats.getClass.getSimpleName}\n Accuracy:${lrAccuracyScaledCats * 100}%2.4f%%\n Area under PR : ${lrPrScaledCats * 100.0}%2.4f%%,Area under ROC: ${lrRocScaledCats * 100.0}%2.4f%%") //LogisticRegressionModel //Accuracy:66.5720% //Area under PR : 75.6015%,Area under ROC: 52.1977% val metrics2 = Seq(lrModelScaledCats, svmModelScaledCats).map{ model => val scoreAndLabels = dataCategories.map{ point => (model.predict(point.features), point.label) } val metrics = new BinaryClassificationMetrics(scoreAndLabels) (model.getClass.getSimpleName, metrics.areaUnderPR(), metrics.areaUnderROC()) } val nbMetrics2 = Seq(nbModelScaledCats).map{ model => val scoreAndLabels = nbDataCategories.map{ point => val score = model.predict(point.features) (if (score > 0.5) 1.0 else 0.0, point.label) } val metrics = new BinaryClassificationMetrics(scoreAndLabels) (model.getClass.getSimpleName, metrics.areaUnderPR(), metrics.areaUnderROC()) } val dtMetrics2 = Seq(dtModelScaledCats).map{ model => val scoreAndLabels = dataCategories.map { point => val score = model.predict(point.features) (if (score > 0.5) 1.0 else 0.0, point.label) } val metrics = new BinaryClassificationMetrics(scoreAndLabels) (model.getClass.getSimpleName, metrics.areaUnderPR(), metrics.areaUnderROC()) } val allMetrics2 = metrics2 ++ nbMetrics2 ++ dtMetrics2 allMetrics2.foreach{ case (model,pr,roc) => println(f"新$model, Area under PR : ${pr * 100.0}%2.4f%%,Area under ROC: ${roc * 100.0}%2.4f%%") } //LogisticRegressionModel, Area under PR : 75.6759%,Area under ROC: 50.1418% //SVMModel, Area under PR : 75.6759%,Area under ROC: 50.1418% //NaiveBayesModel, Area under PR : 68.0851%,Area under ROC: 58.3559% //DecisionTreeModel, Area under PR : 74.3081%,Area under ROC: 64.8837% //新LogisticRegressionModel, Area under PR : 75.6015%,Area under ROC: 52.1977% //新SVMModel, Area under PR : 75.5180%,Area under ROC: 54.1606% //新NaiveBayesModel, Area under PR : 68.3386%,Area under ROC: 58.6397% //新DecisionTreeModel, Area under PR : 75.8784%,Area under ROC: 66.5005%
6.3、使用正確的資料格式
//--------使用正確的資料格式---------------------------------------------- //現在我們僅僅使用型別特徵,也就是隻使用前14個向量,因為1-of-k編碼的型別特徵更符合樸素貝葉斯模型 val nbDataOnlyCategories = records.map{ r => val trimmed = r.map(_.replaceAll("\"","")) val label = trimmed(r.size - 1).toInt val categoryIdx = categories(r(3)) val