我的spark學習之路（三）：利用spark做迴歸分析

阿新 • • 發佈：2019-01-18

spark的機器學習庫（MLlib）下有簡單的迴歸分析方法，今天只說最簡單的線性迴歸，spark提供有兩個迴歸分析庫（mllib和ml），我學習的時候在網上也查了不少資料，有一個奇怪的現象是網上關於spark迴歸分析的資料基本全是mllib，關於ml的基本沒見到，根據官方文件我自己對兩個庫的方法都做了測試，發現mllib做出的結果不是很正確

6,15,7,8,1,21,16,45,45,33,22

11,31,12,15,1,44,34,88,90,67,54

上面是我用來測試的一組資料，用mllib計算得到的係數a=-6.977555728270526E260，而有ml得到的係數為0.44543491975396066，不知道是不是我資料量少的原因，很明顯mllib的結果是有問題的。此外，spark官網對於我們的學習給出這樣的建議

This page documents sections of the MLlib guide for the RDD-based API (the spark.mllib package). Please see the MLlib Main Guide for the DataFrame-based API (the spark.ml package), which is now the primary API for MLlib.

下面進入正題，說說ml庫的迴歸分析，下面是完整程式碼

import org.apache.log4j.PropertyConfigurator 

import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.ml.regression.LinearRegression
import org.apache.spark.rdd.RDD
import org.apache.spark.sql.types.{DoubleType, StructField, StructType}
import org.apache.spark.sql.{DataFrame, Row, SQLContext, types}
import org.apache.spark.mllib 
.linalg.{VectorUDT, Vectors}
import org.apache.spark.mllib.regression.LabeledPoint

import scala.io.Source
object Regression extends App{
  val conf=new SparkConf().setAppName("regression")
  val sc=new SparkContext(conf)

  val sqc=new SQLContext(sc)

  val date=Source.fromFile("data/data1.txt").getLines().map{line=>
    val parts=line.split(",")
    val a=Vectors.dense(parts(1).split(" ").map(_.toDouble))
    val b=parts(0).toDouble
    LabeledPoint(b,a)
    //LabeledPoint(parts(0).toDouble,Vectors.dense(parts(1).split(" ").map(_.toDouble)))
  }

  //val df=sqc.createDataFrame(d,schema)
  val df=sqc.createDataFrame(sc.parallelize(date.toSeq))

  //val training=Source.fromFile("data/data.txt").getLines()
  val lr=new LinearRegression()
    .setMaxIter(10)//set maximum number of iterations
    .setRegParam(0.3)//Set the regularization parameter.
    .setElasticNetParam(0.8)//Set the ElasticNet mixing parameter.
  // Fit the model
  val lrModel = lr.fit(df)


  // Print the coefficients and intercept for linear regression
  println(s"Coefficients: ${lrModel.coefficients} Intercept: ${lrModel.intercept}")

  // Summarize the model over the training set and print out some metrics
  val trainingSummary = lrModel.summary
  println(s"numIterations: ${trainingSummary.totalIterations}")
  println(s"objectiveHistory: ${trainingSummary.objectiveHistory.toList}")
  trainingSummary.residuals.show()
  println(s"RMSE: ${trainingSummary.rootMeanSquaredError}")
  println(s"r2: ${trainingSummary.r2}")
}

其實程式碼很少，整個過程都很簡單，管方文件已經寫的很清楚了，唯一的難點就是

val lrModel = lr.fit(df)

這是的df是DataFrame格式，官方文件是利用

val df=spark.read.format("libsvm")
    .load("data/mllib/sample_linear_regression_data.txt")

從檔案中直接讀取資料，讀取後df就是DataFrame格式，但是實際使用時，我們的資料可能是其它函式計算的結果，因此，如何把格式的資料（比如陣列，其它任何結果轉陣列都是很容易的）轉為DataFrame就是問題的難點了（對於我這樣的初學都來說，對於熟悉scala的人來說可能都不是個事），翻啟遍了scala的文件，終於找到一個函式sqc.createDataFrame(rdd: RDD[A]),SQLContext有一個方法createDataFrame可以把RDD 轉為DataFrame，那麼接下來的問題就是如何把陣列轉為我們需要的RDD格式了，那麼我們需要的RDD到底是什麼格式呢，除錯跟蹤發現是LabeledPoint，它的定義如下：

case class LabeledPoint @Since("1.0.0") (
    @Since("0.8.0") label: Double,
    @Since("1.0.0") features: Vector) {
  override def toString: String = {
    s"($label,$features)"
  }
}

其中label是因變數，它是一個Double資料，而features則是自變數，它是一個Vector，知道了格式構造起來就簡單了，比如上面的程式碼，它是這樣構造的：

val date=Source.fromFile("data/data1.txt").getLines().map{line=>
    val parts=line.split(",")
    val a=Vectors.dense(parts(1).split(" ").map(_.toDouble))
    val b=parts(0).toDouble
    LabeledPoint(b,a)
  }

這裡雖然還是從檔案中讀的資料，但和官方文件的本質區別在於它是把讀得的資料分割成陣列然後構造RDD，筆者這裡要說的是如何把一個數據組造成DataFrame，如果我們要用到的資料都存在檔案中當然不用這麼麻煩，直接讀就可以了，但是筆者遇到的問題是，我的資料是其它函式計算得到，它是放在一個數組中的，所以才有了這篇部落格。如果大家有更好的方法，歡迎探討

我的spark學習之路（三）：利用spark做迴歸分析

我的spark學習之路（三）：利用spark做迴歸分析

初識vue.js，我的學習之路（三）

Linux 學習之路（三）：使用者管理命令詳解

菜鳥的 PHP 學習之路（三）：一個簡單的連線資料庫並查詢的小程式（1）

webService學習之路（三）：springMVC整合CXF後呼叫已知的wsdl介面

我的python之路（三）：什麽是代碼與python的基本類型

python學習之路（三）使用socketserver進行ftp斷點續傳

我的學習之路（一）SQL盲註學習篇

Spring學習之路（三）bean註解管理AOP操作

初識vue.js，我的學習之路（一）

Python學習之路（三）爬蟲（二）

Hive學習之路（三）Hive元數據信息對應MySQL數據庫表

Spark學習之路（二）Spark2.3 HA集群的分布式安裝

Spark學習之路（四）Spark的廣播變量和累加器

學習之路（三）淺談：輸出重定向，grep及正則表達式，egrep

Ceph學習之路（三）Ceph luminous版本部署

webpack學習之路（三）

Tecnomatix Plant Simulation 14 學習之路（三）

Flume學習之路（三）Flume的配置方式

Kafka學習之路（三）Kafka的高可用

我的spark學習之路（三）：利用spark做迴歸分析

相關推薦