SparkMLlib分類算法之決策樹學習
SparkMLlib分類算法之決策樹學習
(一) 決策樹的基本概念
決策樹(Decision Tree)是在已知各種情況發生概率的基礎上,通過構成決策樹來求取凈現值的期望值大於等於零的概率,評價項目風險,判斷其可行性的決策分析方法,是直觀運用概率分析的一種圖解法。由於這種決策分支畫成圖形很像一棵樹的枝幹,故稱決策樹。在機器學習中,決策樹是一個預測模型,他代表的是對象屬性與對象值之間的一種映射關系。Entropy = 系統的淩亂程度,使用算法ID3, C4.5和C5.0生成樹算法使用熵。這一度量是基於信息學理論中熵的概念。通過信息增益來篩選出屬性的優先性。
缺點:參考網址:http://www.ppvke.com/Blog/archives/25042
val orig_file=sc.textFile("train_nohead.tsv") //println(orig_file.first()) val data_file=orig_file.map(_.split("\t")).map{ r=> val trimmed =r.map(_.replace("\"","")) val lable=trimmed(r.length-1).toDouble val feature=trimmed.slice(4,r.length-1).map(d => if(d=="?")0.0 else d.toDouble) LabeledPoint(lable,Vectors.dense(feature)) } /*特征標準化優化,似乎對決策數沒啥影響*/ val vectors=data_file.map(x =>x.features) val rows=new RowMatrix(vectors) println(rows.computeColumnSummaryStatistics().variance)//每列的方差 val scaler=new StandardScaler(withMean=true,withStd=true).fit(vectors)//標準化 val scaled_data=data_file.map(point => LabeledPoint(point.label,scaler.transform(point.features))) .randomSplit(Array(0.7,0.3),11L)//固定seed為11L,確保每次每次實驗能得到相同的結果 val data_train=scaled_data(0) val data_test=scaled_data(1)
3,構建模型及模型評價
/*訓練決策樹模型*/ val model_DT=DecisionTree.train(data_train,Algo.Classification,Entropy,maxTreeDepth) /*決策樹的精確度*/ val predectionAndLabeledDT=data_test.map { point => val predectLabeled = if (model_DT.predict(point.features) > 0.5) 1.0 else 0.0 (predectLabeled,point.label) } val metricsDT=new MulticlassMetrics(predectionAndLabeledDT) println(metricsDT.accuracy)//0.6273062730627307 /*決策樹的PR曲線和AOC曲線*/ val dtMetrics = Seq(model_DT).map{ model => val scoreAndLabels = data_test.map { point => val score = model.predict(point.features) (if (score > 0.5) 1.0 else 0.0, point.label) } val metrics = new BinaryClassificationMetrics(scoreAndLabels) (model.getClass.getSimpleName, metrics.areaUnderPR, metrics.areaUnderROC) } val allMetrics = dtMetrics allMetrics.foreach{ case (m, pr, roc) => println(f"$m, Area under PR: ${pr * 100.0}%2.4f%%, Area under ROC: ${roc * 100.0}%2.4f%%") } /* DecisionTreeModel, Area under PR: 74.2335%, Area under ROC: 62.4326% */
4,模型參數調優(可以調解長度和純度兩方面考慮)
4.1 構建調參函數
/*調參函數*/ def trainDTWithParams(input: RDD[LabeledPoint], maxDepth: Int, impurity: Impurity) = { DecisionTree.train(input, Algo.Classification, impurity, maxDepth) }
4.2 調解樹的深度評估函數 (提高樹的深度可以得到更精確的模型(這和預期一致,因為模型在更大的樹深度下會變得更加復雜)。然而樹的深度越大,模型對訓練數據過擬合程度越嚴重)
/*改變深度*/ val dtResultsEntropy = Seq(1, 2, 3, 4, 5, 10, 20).map { param => val model = trainDTWithParams(data_train, param, Entropy) val scoreAndLabels = data_test.map { point => val score = model.predict(point.features) (if (score > 0.5) 1.0 else 0.0, point.label) } val metrics = new BinaryClassificationMetrics(scoreAndLabels) (s"$param tree depth", metrics.areaUnderROC) } dtResultsEntropy.foreach { case (param, auc) => println(f"$param, " + f"AUC = ${auc * 100}%2.2f%%") } /* 1 tree depth, AUC = 58.57% 2 tree depth, AUC = 60.69% 3 tree depth, AUC = 61.40% 4 tree depth, AUC = 61.30% 5 tree depth, AUC = 62.43% 10 tree depth, AUC = 62.26% 20 tree depth, AUC = 60.59% */
2,調解純度參數 (差異不是很明顯。。)
/*改變純度*/ val dtResultsEntropy = Seq(Gini,Entropy).map { param => val model = trainDTWithParams(data_train, 5, param) val scoreAndLabels = data_test.map { point => val score = model.predict(point.features) (if (score > 0.5) 1.0 else 0.0, point.label) } val metrics = new BinaryClassificationMetrics(scoreAndLabels) (s"$param tree depth", metrics.areaUnderROC) } dtResultsEntropy.foreach { case (param, auc) => println(f"$param, " + f"AUC = ${auc * 100}%2.2f%%") } /* [email protected] tree depth, AUC = 62.37% [email protected] tree depth, AUC = 62.43% */
(三) 交叉驗證
1,數據集分類
創建三個數據集:訓練集
評估集(類似上述測試集用於模型參數的調優,比如 lambda 和步長)
測試集(不用於模型的訓練和參數調優,只用於估計模型在新數據中性能)
2,交叉驗證的常用方法
一個流行的方法是 K- 折疊交叉驗證,其中數據集被分成 K 個不重疊的部分。用數據中的 K-1 份訓練模型,剩下一部分測試模型。而只分訓練集和測試集可以看做是 2- 折疊交叉驗證。
還有“留一交叉驗證”和“隨機采樣”。更多資料詳見 http://en.wikipedia.org/wiki/Cross-validation_(statistics) 。
SparkMLlib分類算法之決策樹學習