1. 程式人生 > >SparkMLlib回歸算法之決策樹

SparkMLlib回歸算法之決策樹

ria 之間 feature 輸入 修改 決策樹算法 技術 color 實例

SparkMLlib回歸算法之決策樹

(一),決策樹概念

1,決策樹算法(ID3,C4.5 ,CART)之間的比較:

  1,ID3算法在選擇根節點和各內部節點中的分支屬性時,采用信息增益作為評價標準。信息增益的缺點是傾向於選擇取值較多的屬性,在有些情況下這類屬性可能不會提供太多有價值的信息。

  2 ID3算法只能對描述屬性為離散型屬性的數據集構造決策樹,其余兩種算法對離散和連續都可以處理

2,C4.5算法實例介紹(參考網址:http://m.blog.csdn.net/article/details?id=44726921)

  技術分享

技術分享

技術分享

技術分享

c4.5後剪枝策略:以悲觀剪枝為主參考網址:http://www.cnblogs.com/zhangchaoyang/articles/2842490.html

(二) SparkMLlib決策樹回歸的應用

1,數據集來源及描述:參考http://www.cnblogs.com/ksWorld/p/6891664.html

2,代碼實現:

  2.1 構建輸入數據格式:

val file_bike = "hour_nohead.csv"
    val file_tree=sc.textFile(file_bike).map(_.split(",")).map{
      x =>
        val feature=x.slice(2,x.length-3).map(_.toDouble)
        val label=x(x.length-1).toDouble
        LabeledPoint(label,Vectors.dense(feature))
    }
    println(file_tree.first())
   val categoricalFeaturesInfo 
= Map[Int,Int]() val model_DT=DecisionTree.trainRegressor(file_tree,categoricalFeaturesInfo,"variance",5,32)

  2.2 模型評判標準(mse,mae,rmsle)

val predict_vs_train = file_tree.map {
        point => (model_DT.predict(point.features),point.label)
       /* point => (math.exp(model_DT.predict(point.features)), math.exp(point.label))
*/ } predict_vs_train.take(5).foreach(println(_)) /*MSE是均方誤差*/ val mse = predict_vs_train.map(x => math.pow(x._1 - x._2, 2)).mean() /* 平均絕對誤差(MAE)*/ val mae = predict_vs_train.map(x => math.abs(x._1 - x._2)).mean() /*均方根對數誤差(RMSLE)*/ val rmsle = math.sqrt(predict_vs_train.map(x => math.pow(math.log(x._1 + 1) - math.log(x._2 + 1), 2)).mean()) println(s"mse is $mse and mae is $mae and rmsle is $rmsle") /* mse is 11611.485999495755 and mae is 71.15018786490428 and rmsle is 0.6251152586960916 */

(三) 改進模型性能和參數調優

1,改變目標量 (對目標值求根號),修改下面語句

LabeledPoint(math.log(label),Vectors.dense(feature))
和
 val predict_vs_train = file_tree.map {
        /*point => (model_DT.predict(point.features),point.label)*/
        point => (math.exp(model_DT.predict(point.features)), math.exp(point.label))
      }
/*結果
mse is 14781.575988339053 and mae is 76.41310991122032 and rmsle is 0.6405996100717035
*/

決策樹在變換後的性能有所下降

2,模型參數調優

  1,構建訓練集和測試集

 val file_tree=sc.textFile(file_bike).map(_.split(",")).map{
      x =>
        val feature=x.slice(2,x.length-3).map(_.toDouble)
        val label=x(x.length-1).toDouble
      LabeledPoint(label,Vectors.dense(feature))
        /*LabeledPoint(math.log(label),Vectors.dense(feature))*/
    }
    val tree_orgin=file_tree.randomSplit(Array(0.8,0.2),11L)
    val tree_train=tree_orgin(0)
    val tree_test=tree_orgin(1)

  2,調節樹的深度參數

val categoricalFeaturesInfo = Map[Int,Int]()
    val model_DT=DecisionTree.trainRegressor(file_tree,categoricalFeaturesInfo,"variance",5,32)
    /*調節樹深度次數*/
    val Deep_Results = Seq(1, 2, 3, 4, 5, 10, 20).map { param =>
      val model = DecisionTree.trainRegressor(tree_train, categoricalFeaturesInfo,"variance",param,32)
      val scoreAndLabels = tree_test.map { point =>
        (model.predict(point.features), point.label)
      }
      val rmsle = math.sqrt(scoreAndLabels.map(x => math.pow(math.log(x._1) - math.log(x._2), 2)).mean)
      (s"$param lambda", rmsle)
    }
/*深度的結果輸出*/
    Deep_Results.foreach { case (param, rmsl) => println(f"$param, rmsle = ${rmsl}")}
/*
1 lambda, rmsle = 1.0763369409492645
2 lambda, rmsle = 0.9735820606349874
3 lambda, rmsle = 0.8786984993014815
4 lambda, rmsle = 0.8052113493915528
5 lambda, rmsle = 0.7014036913077335
10 lambda, rmsle = 0.44747906135994925
20 lambda, rmsle = 0.4769214752638845
*/

  深度較大的決策樹出現過擬合,從結果來看這個數據集最優的樹深度大概在10左右

  3,調節劃分數

/*調節劃分數*/
    val ClassNum_Results = Seq(2, 4, 8, 16, 32, 64, 100).map { param =>
      val model = DecisionTree.trainRegressor(tree_train, categoricalFeaturesInfo,"variance",10,param)
      val scoreAndLabels = tree_test.map { point =>
        (model.predict(point.features), point.label)
      }
      val rmsle = math.sqrt(scoreAndLabels.map(x => math.pow(math.log(x._1) - math.log(x._2), 2)).mean)
      (s"$param lambda", rmsle)
    }
    /*劃分數的結果輸出*/
    ClassNum_Results.foreach { case (param, rmsl) => println(f"$param, rmsle = ${rmsl}")}
/*
2 lambda, rmsle = 1.2995002615220668
4 lambda, rmsle = 0.7682777577495858
8 lambda, rmsle = 0.6615110909041817
16 lambda, rmsle = 0.4981237727958235
32 lambda, rmsle = 0.44747906135994925
64 lambda, rmsle = 0.4487531073836407
100 lambda, rmsle = 0.4487531073836407
*/

  更多的劃分數會使模型變復雜,並且有助於提升特征維度較大的模型性能。劃分數到一定程度之後,對性能的提升幫助不大。實際上,由於過擬合的原因會導致測試集的性能變差。可見分類數應在32左右。。

SparkMLlib回歸算法之決策樹