Spark2.0機器學習系列之7： MLPC（多層神經網絡）

阿新 • • 發佈：2018-01-30

element nbsp hid 隨機梯度下降 support file dict 分類器希望

Spark2.0 MLPC（多層神經網絡分類器）算法概述

MultilayerPerceptronClassifier（MLPC）這是一個基於前饋神經網絡的分類器，它是一種在輸入層與輸出層之間含有一層或多層隱含結點的具有正向傳播機制的神經網絡模型。

中間的節點使用sigmoid （logistic）函數，輸出層的節點使用softmax函數。輸出層的節點的數目表示分類器有幾類。MLPC學習過程中使用BP算法，優化問題抽象成logistic loss function並使用L-BFGS進行優化。

算法進一步剖析

Sigmoid函數
中間層使用Sigmoid函數（也叫 logistic函數），看下面的函數曲線，Zi=0時，f(Zi)=0.5，Zi<0時候，f(Zi)<0.5, Zi>0時，f(Zi)>0.5,這樣就可以把f(Zi)當概率理解了，大於0.5的概率一類，小於0.5概率的一類。由此可見，Sigmoid函數通常用於二分問題。

技術分享圖片

　　Sigmoid函數：

　　　　f(x)=11+e?x

　　Sigmoid微分：

　　　　f′=f(1?f)

softmax函數
參考：https://en.wikipedia.org/wiki/Softmax_function
假設輸出層有

個節點，各節點完成訓練後的擬合系數向量為：

softmax函數為：

Spark2.0 MLPC 代碼（流程）

package my.spark.ml.practice.classification;

import org.apache.log4j.Level;
import org.apache.log4j.Logger;
import org.apache.spark.ml.classification.MultilayerPerceptronClassificationModel;
import org.apache.spark.ml.classification.MultilayerPerceptronClassifier;
import org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator;
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.SparkSession;

 
public class myMLPC {

    public static void main(String[] args) {
        SparkSession spark=SparkSession
                .builder()
                .appName("MLPC")
                .master("local[4]")
                .config("spark.sql.warehouse.dir","file///:G:/Projects/Java/Spark/spark-warehouse" )
                .getOrCreate();
        String path="G:/Projects/CgyWin64/home/pengjy3/softwate/spark-2.0.0-bin-hadoop2.6/"
                + "data/mllib/sample_multiclass_classification_data.txt";       
        //屏蔽日誌
        Logger.getLogger("org.apache.spark").setLevel(Level.ERROR);
        Logger.getLogger("org.eclipse.jetty.server").setLevel(Level.OFF);
        //加載數據,randomSplit時加了一個固定的種子seed=100，
        //是為了得到可重復的結果，方便調試算法，實際工作中不能這樣設置
        Dataset<Row>[] split=spark.read().format("libsvm").load(path).randomSplit(new double[]{0.6,0.4},100);
        Dataset<Row> training=split[0];
        Dataset<Row> test=split[1];
        training.show(100,false);//數據檢查

        //第一層樹特征個數
        //最後一層，即輸出層是labels個數（類數）
        //隱藏層自己定義
        int[] layer=new int[]{4,6,4,3};     

        int[] maxIter=new int[]{5,10,20,50,100,200};
        double[] accuracy=new double[]{0,0,0,0,0,0,0,0,0,0};
        //利用如下類似的循環可以很方便的對各種參數進行調優
        for(int i=0;i<maxIter.length;i++){
            MultilayerPerceptronClassifier multilayerPerceptronClassifier=
                    new MultilayerPerceptronClassifier()
                    .setLabelCol("label")
                    .setFeaturesCol("features")
                    .setLayers(layer)
                    .setMaxIter(maxIter[i])
                    .setBlockSize(128)
                    .setSeed(1000);
            MultilayerPerceptronClassificationModel model=
                    multilayerPerceptronClassifier.fit(training);

            Dataset<Row> predictions=model.transform(test);
            MulticlassClassificationEvaluator evaluator=
                    new MulticlassClassificationEvaluator()
                    .setLabelCol("label")               
                    .setPredictionCol("prediction")
                    .setMetricName("accuracy");
            accuracy[i]=evaluator.evaluate(predictions);            
        }           

        //一次性輸出所有評估結果
        for(int j=0;j<maxIter.length;j++){
            String str_accuracy=String.format(" accuracy =  %.2f", accuracy[j]);
            String str_maxIter=String.format(" maxIter =  %d", maxIter[j]);
            System.out.println(str_maxIter+str_accuracy);
        }           
    }
}

參數設置，算法調優

　　Spark2.0中對於這個在1.6版本新加入的機器學習算法還沒有什麽文檔，我們就用下面的辦法吧：

//Spark中explainParams函數可以展示某個分類器有那些參數：
System.out.println(multilayerPerceptronClassifier.explainParams());
//有一些不影響結果設置的我就省略了（如輸入label列labelCol等等）。
（1）layers: Sizes of layers from input layer to output layer. E.g., Array(780, 100, 10) means 
780 inputs, one hidden layer with 100 neurons and output layer of 10 neurons. (current: 
[I@158f492)

（2）maxIter: maximum number of iterations (>= 0) (default: 100, current: 20)
（3）tol: the convergence tolerance for iterative algorithms (default: 1.0E-4)

（4）solver: The solver algorithm for optimization. Supported options: l-bfgs, gd. (Default l-
bfgs)

（5）stepSize: Step size to be used for each iteration of optimization (default: 0.03)

（6）blockSize: Block size for stacking input data in matrices. Data is stacked within 
partitions. If block size is more than remaining data in a partition then it is adjusted to 
the size of this data. Recommended size is between 10 and 1000 (default: 128, current: 128)

可以設定的一些參數有： 神經網絡結構

（1）layers：？？？
叠代停止條件
（2）maxIter: 需要運算到結果收斂
（3）tol：允許誤差一般取0.001~0.00001，當叠代結果的誤差小於該值時，結束叠代計算，給出結果。

優化算法-solver
　　Spark中的優化算法，可以參考本人的另兩篇文章：
ML優化算法之一：梯度下降算法、隨機梯度下降（應用於線性回歸、Logistic回歸等等）
http://blog.csdn.net/qq_34531825/article/details/52396165
（4）solver：有兩種算法可供選擇： l-bfgs和gd；

.stepSize=0.03,tol=0.0001
l-bfgs:上很快能收斂，大約20次，訓練速度也更快
maxIter = 5 accuracy = 0.35 training time = 267ms
maxIter = 10 accuracy = 0.74 training time = 429ms
maxIter = 20 accuracy = 0.91 training time = 657ms
maxIter = 50 accuracy = 0.92 training time = 941ms
maxIter = 100 accuracy = 0.92 training time = 914ms
maxIter = 500 accuracy = 0.92 training time = 1052ms

gd算法：需要多得多的叠代次數，即使在提高學習率和提高允許誤差tol的情況下，
還是慢很多，慢10以上倍左右吧。
stepsize=0.2,tol=0.001
maxIter = 100 accuracy = 0.55 training time = 4209ms
maxIter = 500 accuracy = 0.92 training time = 11216ms
maxIter = 1000 accuracy = 0.92 training time = 14540ms
maxIter = 2000 accuracy = 0.92 training time = 14708ms
maxIter = 5000 accuracy = 0.92 training time = 14669ms
由此可見，兩種算法要想達到收斂，GB（梯度下降算法）慢很多，建議優先使用L-BFGS。
In general, when L-BFGS is available, we recommend using it instead of SGD since L-BFGS tends to converge faster (in fewer iterations).(Spark document)

(5)學習率stepSize，這是一個比較關鍵的參數
　　一般來說學習率越大，權重變化越大，收斂越快；但訓練速率過大，會引起系統的振蕩。
太高的學習率,可以減少網絡訓練的時間,但是容易導致網絡的不穩定與訓練誤差的增加，
會引起系統的振蕩。
　　太低的學習率,需要較長的訓練時間。
　　在實際工作中，在時間可以接受的範圍內，為了模型的穩定性，還是建議選擇盡量選擇
小一些的學習率。
技術分享圖片
（6）blockSize：這個不很清楚究竟是什麽?希望計算機牛人告訴我。

Spark2.0機器學習系列之7： MLPC（多層神經網絡）

element nbsp hid 隨機梯度下降 support file dict 分類器希望 Spark2.0 MLPC（多層神經網絡分類器）算法概述 MultilayerPerceptronClassifier（MLPC）這是一個基於前饋神經網絡的分類器，它是一種在

Spark2.0機器學習系列之7： MLPC（多層神經網絡）

Spark2.0 MLPC（多層神經網絡分類器）算法概述

算法進一步剖析

Spark2.0 MLPC 代碼（流程）

參數設置，算法調優

Spark2.0機器學習系列之7： MLPC（多層神經網絡）

Spark2.0機器學習系列之11：聚類(冪迭代聚類， power iteration clustering， PIC)

Spark2.0機器學習系列之10：聚類(高斯混合模型 GMM）

Spark2.0機器學習系列之3：決策樹及Spark 2.0-MLlib、Scikit程式碼分析

Spark2.0機器學習系列之2：Logistic迴歸及Binary分類（二分問題）結果評估

Spark2.0機器學習系列之1：基於Pipeline、交叉驗證、ParamMap的模型選擇和超引數調優

Spark機器學習系列之13：支援向量機SVM

機器學習系列之K-近鄰演算法（監督學習-分類問題）

【機器學習筆記21】神經網路（多層感知機)

機器學習入門之四：機器學習的方法-神經網絡（轉載）

機器學習系列之偏差、方差與交叉驗證

機器學習系列之特徵工程

機器學習系列之交叉驗證、網格搜尋

機器學習系列之GBDT

Python 機器學習系列之線性迴歸篇深度詳細

Java從0開始學習系列之路(6)

機器學習系列之coursera week 10 Large Scale Machine Learning

《機器學習系列教程》：第二章機器學習基礎

[050]Python 機器學習系列之線性迴歸篇深度詳細

Spring原理學習系列之三：Spring AOP原理(從原始碼層面分析)-------上部

Spark2.0機器學習系列之7： MLPC（多層神經網絡）

Spark2.0 MLPC（多層神經網絡分類器）算法概述

算法進一步剖析

Spark2.0 MLPC 代碼（流程）

參數設置，算法調優

相關推薦