梯度迭代樹（GBDT）演算法原理及Spark MLlib呼叫例項（Scala/Java/python）

阿新 • • 發佈：2019-01-15

梯度迭代樹

演算法簡介：

梯度提升樹是一種決策樹的整合演算法。它通過反覆迭代訓練決策樹來最小化損失函式。決策樹類似，梯度提升樹具有可處理類別特徵、易擴充套件到多分類問題、不需特徵縮放等性質。Spark.ml通過使用現有decision tree工具來實現。

梯度提升樹依次迭代訓練一系列的決策樹。在一次迭代中，演算法使用現有的整合來對每個訓練例項的類別進行預測，然後將預測結果與真實的標籤值進行比較。通過重新標記，來賦予預測結果不好的例項更高的權重。所以，在下次迭代中，決策樹會對先前的錯誤進行修正。

對例項標籤進行重新標記的機制由損失函式來指定。每次迭代過程中，梯度迭代樹在訓練資料上進一步減少損失函式的值。spark.ml為分類問題提供一種損失函式（Log Loss），為迴歸問題提供兩種損失函式（平方誤差與絕對誤差）。

Spark.ml支援二分類以及迴歸的隨機森林演算法，適用於連續特徵以及類別特徵。

＊注意梯度提升樹目前不支援多分類問題。

引數：

checkpointInterval:

型別：整數型。

含義：設定檢查點間隔（>=1），或不設定檢查點（-1）。

featuresCol:

型別：字串型。

含義：特徵列名。

impurity:

型別：字串型。

含義：計算資訊增益的準則（不區分大小寫）。

labelCol:

型別：字串型。

含義：標籤列名。

lossType:

型別：字串型。

含義：損失函式型別。

maxBins:

型別：整數型。

含義：連續特徵離散化的最大數量，以及選擇每個節點分裂特徵的方式。

maxDepth:

型別：整數型。

含義：樹的最大深度（>=0）。

maxIter:

型別：整數型。

含義：迭代次數（>=0）。

minInfoGain:

型別：雙精度型。

含義：分裂節點時所需最小資訊增益。

minInstancesPerNode:

型別：整數型。

含義：分裂後自節點最少包含的例項數量。

predictionCol:

型別：字串型。

含義：預測結果列名。

rawPredictionCol:

型別：字串型。

含義：原始預測。

seed:

型別：長整型。

含義：隨機種子。

subsamplingRate:

型別：雙精度型。

含義：學習一棵決策樹使用的訓練資料比例，範圍[0,1]。

stepSize:

型別：雙精度型。

含義：每次迭代優化步長。

示例：

下面的例子匯入LibSVM格式資料，並將之劃分為訓練資料和測試資料。使用第一部分資料進行訓練，剩下資料來測試。訓練之前我們使用了兩種資料預處理方法來對特徵進行轉換，並且添加了元資料到DataFrame。

Scala:

import org.apache.spark.ml.Pipeline
import org.apache.spark.ml.classification.{GBTClassificationModel, GBTClassifier}
import org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator
import org.apache.spark.ml.feature.{IndexToString, StringIndexer, VectorIndexer}

// Load and parse the data file, converting it to a DataFrame.
val data = spark.read.format("libsvm").load("data/mllib/sample_libsvm_data.txt")

// Index labels, adding metadata to the label column.
// Fit on whole dataset to include all labels in index.
val labelIndexer = new StringIndexer()
  .setInputCol("label")
  .setOutputCol("indexedLabel")
  .fit(data)
// Automatically identify categorical features, and index them.
// Set maxCategories so features with > 4 distinct values are treated as continuous.
val featureIndexer = new VectorIndexer()
  .setInputCol("features")
  .setOutputCol("indexedFeatures")
  .setMaxCategories(4)
  .fit(data)

// Split the data into training and test sets (30% held out for testing).
val Array(trainingData, testData) = data.randomSplit(Array(0.7, 0.3))

// Train a GBT model.
val gbt = new GBTClassifier()
  .setLabelCol("indexedLabel")
  .setFeaturesCol("indexedFeatures")
  .setMaxIter(10)

// Convert indexed labels back to original labels.
val labelConverter = new IndexToString()
  .setInputCol("prediction")
  .setOutputCol("predictedLabel")
  .setLabels(labelIndexer.labels)

// Chain indexers and GBT in a Pipeline.
val pipeline = new Pipeline()
  .setStages(Array(labelIndexer, featureIndexer, gbt, labelConverter))

// Train model. This also runs the indexers.
val model = pipeline.fit(trainingData)

// Make predictions.
val predictions = model.transform(testData)

// Select example rows to display.
predictions.select("predictedLabel", "label", "features").show(5)

// Select (prediction, true label) and compute test error.
val evaluator = new MulticlassClassificationEvaluator()
  .setLabelCol("indexedLabel")
  .setPredictionCol("prediction")
  .setMetricName("accuracy")
val accuracy = evaluator.evaluate(predictions)
println("Test Error = " + (1.0 - accuracy))

val gbtModel = model.stages(2).asInstanceOf[GBTClassificationModel]
println("Learned classification GBT model:\n" + gbtModel.toDebugString)

Java:

import org.apache.spark.ml.Pipeline;
import org.apache.spark.ml.PipelineModel;
import org.apache.spark.ml.PipelineStage;
import org.apache.spark.ml.classification.GBTClassificationModel;
import org.apache.spark.ml.classification.GBTClassifier;
import org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator;
import org.apache.spark.ml.feature.*;
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.SparkSession;

// Load and parse the data file, converting it to a DataFrame.
Dataset<Row> data = spark
  .read()
  .format("libsvm")
  .load("data/mllib/sample_libsvm_data.txt");

// Index labels, adding metadata to the label column.
// Fit on whole dataset to include all labels in index.
StringIndexerModel labelIndexer = new StringIndexer()
  .setInputCol("label")
  .setOutputCol("indexedLabel")
  .fit(data);
// Automatically identify categorical features, and index them.
// Set maxCategories so features with > 4 distinct values are treated as continuous.
VectorIndexerModel featureIndexer = new VectorIndexer()
  .setInputCol("features")
  .setOutputCol("indexedFeatures")
  .setMaxCategories(4)
  .fit(data);

// Split the data into training and test sets (30% held out for testing)
Dataset<Row>[] splits = data.randomSplit(new double[] {0.7, 0.3});
Dataset<Row> trainingData = splits[0];
Dataset<Row> testData = splits[1];

// Train a GBT model.
GBTClassifier gbt = new GBTClassifier()
  .setLabelCol("indexedLabel")
  .setFeaturesCol("indexedFeatures")
  .setMaxIter(10);

// Convert indexed labels back to original labels.
IndexToString labelConverter = new IndexToString()
  .setInputCol("prediction")
  .setOutputCol("predictedLabel")
  .setLabels(labelIndexer.labels());

// Chain indexers and GBT in a Pipeline.
Pipeline pipeline = new Pipeline()
  .setStages(new PipelineStage[] {labelIndexer, featureIndexer, gbt, labelConverter});

// Train model. This also runs the indexers.
PipelineModel model = pipeline.fit(trainingData);

// Make predictions.
Dataset<Row> predictions = model.transform(testData);

// Select example rows to display.
predictions.select("predictedLabel", "label", "features").show(5);

// Select (prediction, true label) and compute test error.
MulticlassClassificationEvaluator evaluator = new MulticlassClassificationEvaluator()
  .setLabelCol("indexedLabel")
  .setPredictionCol("prediction")
  .setMetricName("accuracy");
double accuracy = evaluator.evaluate(predictions);
System.out.println("Test Error = " + (1.0 - accuracy));

GBTClassificationModel gbtModel = (GBTClassificationModel)(model.stages()[2]);
System.out.println("Learned classification GBT model:\n" + gbtModel.toDebugString());

Python：

from pyspark.ml import Pipeline
from pyspark.ml.classification import GBTClassifier
from pyspark.ml.feature import StringIndexer, VectorIndexer
from pyspark.ml.evaluation import MulticlassClassificationEvaluator

# Load and parse the data file, converting it to a DataFrame.
data = spark.read.format("libsvm").load("data/mllib/sample_libsvm_data.txt")

# Index labels, adding metadata to the label column.
# Fit on whole dataset to include all labels in index.
labelIndexer = StringIndexer(inputCol="label", outputCol="indexedLabel").fit(data)
# Automatically identify categorical features, and index them.
# Set maxCategories so features with > 4 distinct values are treated as continuous.
featureIndexer =\
    VectorIndexer(inputCol="features", outputCol="indexedFeatures", maxCategories=4).fit(data)

# Split the data into training and test sets (30% held out for testing)
(trainingData, testData) = data.randomSplit([0.7, 0.3])

# Train a GBT model.
gbt = GBTClassifier(labelCol="indexedLabel", featuresCol="indexedFeatures", maxIter=10)

# Chain indexers and GBT in a Pipeline
pipeline = Pipeline(stages=[labelIndexer, featureIndexer, gbt])

# Train model.  This also runs the indexers.
model = pipeline.fit(trainingData)

# Make predictions.
predictions = model.transform(testData)

# Select example rows to display.
predictions.select("prediction", "indexedLabel", "features").show(5)

# Select (prediction, true label) and compute test error
evaluator = MulticlassClassificationEvaluator(
    labelCol="indexedLabel", predictionCol="prediction", metricName="accuracy")
accuracy = evaluator.evaluate(predictions)
print("Test Error = %g" % (1.0 - accuracy))

gbtModel = model.stages[2]
print(gbtModel)  # summary only

梯度迭代樹（GBDT）演算法原理及Spark MLlib呼叫例項（Scala/Java/python）

梯度迭代樹（GBDT）演算法原理及Spark MLlib呼叫例項（Scala/Java/python）

多層感知機（MLP）演算法原理及Spark MLlib呼叫例項（Scala/Java/Python）

MLlib--多層感知機（MLP）演算法原理及Spark MLlib呼叫例項（Scala/Java/Python）

隨機森林迴歸（Random Forest）演算法原理及Spark MLlib呼叫例項（Scala/Java/python）

二分K均值演算法原理及Spark MLlib呼叫例項(Scala/Java/Python)

二十種特徵變換方法及Spark MLlib呼叫例項（Scala/Java/python）（一）

三種特徵選擇方法及Spark MLlib呼叫例項（Scala/Java/python）

二十種特徵變換方法及Spark MLlib呼叫例項（Scala/Java/python）（二）

【演算法學習】AVL平衡二叉搜尋樹原理及各項操作程式設計實現（C++）

寫程式學ML：決策樹演算法原理及實現（四）

Apache Spark MLlib學習筆記（六）MLlib決策樹類演算法原始碼解析 2

深度學習之神經網路（CNN/RNN/GAN）演算法原理+實戰目前最新

【機器學習】Apriori演算法——原理及程式碼實現（Python版）

deformable convolution（可變形卷積）演算法解析及程式碼分析

《Kalman濾波原理及應用》學習筆記（一）——Kalman濾波演算法在溫度測量中的應用

簡單選擇排序演算法原理及java實現（超詳細）

【原創】大數據基礎之Spark（4）RDD原理及代碼解析

氣泡排序演算法原理及實現（超詳細）

快速排序演算法原理及實現（單軸快速排序、三向切分快速排序、雙軸快速排序）

偏最小二乘迴歸（PLSR）演算法原理

梯度迭代樹（GBDT）演算法原理及Spark MLlib呼叫例項（Scala/Java/python）

相關推薦