跟著吳恩達學深度學習:用Scala實現神經網路-第一課
1. Introduction
2017年8月,前百度首席科學家吳恩達先生在twitter上宣佈自己從百度離職後的第一個動作:在Coursera上推出一門從零開始構建神經網路的Deep Learning課程,一時間廣為轟動。
截止到今天(2017年8月17日星期四),本人已經註冊該門課程並且完成了兩週的課程學習和作業。在前兩週的課程中,吳恩達先生利用Logistic Regression來深入淺出的說明了神經網路的工作原理,並且用通俗易懂的語言介紹了反向傳播的原理即為鏈式求導。在第二週的編碼作業中,學生被要求利用Python Notebook從頭實現Logistic Regression模型,並且利用此模型對所給定的影象集進行二元分類,判斷某張圖片是否是貓,最終訓練好的模型的Test Accuracy能達到70%。吳恩達先生還說,在接下來的課程中我們會進一步學習神經網路的優化方法,以進一步提高貓狗分辨的Accuracy。
在編碼測驗中,吳恩達先生所強調的重點即為向量化運算,並且用例項說明了Python Numpy包的向量乘法比簡單的for迴圈求和的速度快300多倍,這也意味著1分鐘與5個小時的差距。然而,眾所周知Python在實際的工程開發中更多是扮演者快速實驗idea,快速得到結果的作用,一定程度上不適用於模型的正式開發及上線。本文中使用Scala實現吳恩達先生在Deep Learning課程中佈置的所有作業,感謝您的閱讀,期望共同進步。
在Python中實現深度學習演算法以及向量化運算所依賴的包叫做Numpy,即Number Python。Numpy中提供了Vector與Matrix的實現,以及矩陣的各種運算和分解的函式。對應地,在Scala中我們使用Breeze包,其中也提供了DenseVector和DenseMatrix的資料結構,並且在資料量特別稀疏的情況下還有SparseVector和SparseMatrix可供使用,一定程度上比Numpy更加強大。最重要地,作為靜態型別語言Scala是型別安全的,意味著我們不僅可以用Scala來實現演算法,還可以用其進行資料預處理和資料清洗,即ETL。
本文分為四個部分。第一部分介紹整個專案結構;第二部分詳細解釋用Scala實現Logistic Regression的程式碼;第三部分給出其他功能性程式碼的解釋,如資料預處理,畫圖工具,和一些其他的helper類;第四部分給出本文的demo結果和資料集的下載地址。另外,本專案的所有程式碼都可以在GitHub中找到,GitHub專案地址為https://github.com/pan5431333/coursera-deeplearning-practice-in-scala,跟隨吳恩達先生的課程進度程式碼會及時保持更新,歡迎follow。
2. 專案結構
本文擬使用的專案結構分為五個subpackage,分別為data,demo,helper,model和utils。
data包中包含一個類Cat,其型別為Scala中的caseclass,特別適合用來表示真實世界中的一個entity。demo中即為每一節課後作業的執行例項;helper中現在包含兩個類,CatDataHelper利用Java中的ImageIO從本地檔案系統中讀取圖片,將其轉化為RGB矩陣的表示形式,之後再reshape成向量形式。DlCollection為一個集合泛型類,其提供三個深度學習中常用的方法,分別為split,用來切分訓練集和測試集;getFeatureAsMatrix返回演算法所需要的特徵矩陣;getLabelAsVector返回標籤向量。Model包中現在僅包含Logistic Regression Model的實現。Utils包中現在有PlotUtils,其提供一個plotCostHistory方法,用來對cost隨著迭代次數的變化情況畫圖。
下面介紹Logistic Regression演算法在Scala中的具體實現。
3. Logistic Regression的Scala實戰
首先,定義LogisticRegressionModel類:
classLogisticRegressionModel(){
var learningRate:Double=
_
var iterationTime:Int
= _
var w: DenseVector[Double] =_
var b:Double
= _
val costHistory: mutable.TreeMap[Int, Double] =new
mutable.TreeMap[Int,Double]()
此類包含五個InstanceVariables,其中前兩個為超引數,learningRate表示學習率,iterationTime表示最大迭代次數;w和b即為模型引數,會隨著迭代進行尋優;costHistory是一個用來儲存迭代過程中cost變化情況的TreeMap,其key為迭代次數,value為cost值。
接下來是模型超引數的兩個setter:
def setLearningRate(learningRate: Double): this.type = { this.learningRate = learningRate this } def setIterationTime(iterationTime: Int): this.type = { this.iterationTime = iterationTime this }
注意這裡的setter與Java中的setter不一樣,我們採用了鏈式程式設計的開發模式,即使用者在呼叫時可以寫成:val model = new LogisticRegressionModel().setLearningRate(0.0001).setIterationTime(3000),會使得整個編碼過程更加流暢。鏈式程式設計也在Spark中被廣泛使用,特別是構造資料管道(Pipeline)時會顯得很優雅。
接下來是模型訓練方法:
def train(feature: DenseMatrix[Double], label: DenseVector[Double]): this.type = { var (w, b) = initializeParams(feature.cols) (1 to this.iterationTime) .foreach{i => val (cost, dw, db) = propagate(feature, label, w, b) if (i % 100 == 0) println("INFO: Cost in " + i + "th time of iteration: " + cost) costHistory.put(i, cost) val adjustedLearningRate = this.learningRate / (log(i/1000 + 1) + 1) w :-= adjustedLearningRate * dw b -= adjustedLearningRate * db } this.w = w this.b = b this }
注意在此方法中我們用了兩個私有方法,分別為initializeParams()和propagate(),我們會在下面對這兩個方法詳細解釋。另外,我們對learningRate進行了簡單的調整,使其隨著迭代次數的增加逐漸減小,以儘量減少尋優時跳過最優解的可能性。
接下來是模型引數初始化的方法:
private def initializeParams(featureSize: Int): (DenseVector[Double], Double) = { val w = DenseVector.rand[Double](featureSize) val b = DenseVector.rand[Double](1).data(0) (w, b) }
這裡我們對w和b賦予0到1之間的隨機賦值。
接下來是LogisticRegression核心的正向傳播與反向傳播的實現方法:
private def propagate(feature: DenseMatrix[Double], label: DenseVector[Double], w: DenseVector[Double], b: Double): (Double, DenseVector[Double], Double) = { val numExamples = feature.rows val labelHat = sigmoid(feature * w + b) // println("DEBUG: feature * w + b is " + feature * w + b) // println("DEBUG: the feature's number of cols is " + feature.cols) // println("DEBUG: the feature's number of rows is " + feature.rows) // println("DEBUG: the labelHat is " + labelHat) val cost = -(label.t * log(labelHat) + (DenseVector.ones[Double](numExamples) - label).t * log(DenseVector.ones[Double](numExamples) - labelHat)) / numExamples val dw = feature.t * (labelHat - label) /:/ numExamples.toDouble val db = DenseVector.ones[Double](numExamples).t * (labelHat - label) / numExamples.toDouble // println("DEBUG: the (dw, db) is " + dw + ", " + db) (cost, dw, db) }
其中註釋掉的程式碼為開發過程中的DEBUG程式碼,因為我沒有在此專案中引入log包,所以只能以這種方式進行DEBUG。feature.rows和feature.cols相當於Python Numpy中的feature.shape[0]和feature.shape[1];Sigmoid為breeze.numerics._中提供的函式,可以接受一個DenseVector或者DenseMatrix作為引數;cost、dw和db的計算請詳見LogisticRegression的理論知識,如有不清楚的地方可以學習吳恩達先生Deep Learning課程。這裡需要注意的一點是,Python Numpy支援broadcasting運算,如1 – np.array([1, 2, 3])會得到np.array([0, -1,-2]),即一個常量與向量或矩陣發生運算時,numpy會自動將該常量與向量或矩陣中的每個元素進行運算。Scala的breeze對此支援有限,所以在計算cost時我們只能用DenseVector.ones[Double](numExamples)– label,而不能直接用 1 – label。
接下來是用訓練好的模型預測的方法:
def predict(feature: DenseMatrix[Double]): DenseVector[Double] = { val yPredicted = sigmoid(feature * this.w).map{eachY => if (eachY <= 0.05) 0.0 else 1.0 } yPredicted }
這裡我們使用了函數語言程式設計中常用的map,可以看出map使我們的程式碼變得很簡潔。
接下來是計算預測準確度的方法:
def accuracy(label: DenseVector[Double], labelPredicted: DenseVector[Double]): Double = { val numCorrect = (0 until label.length) .map{index => if (label(index) == labelPredicted(index)) 1 else 0 } .count(_ == 1) numCorrect.toDouble / label.length.toDouble }
這裡進一步使用了函數語言程式設計的特性,程式碼非常簡潔。
最後,還有一些輔助的getter方法:
def getCostHistory: mutable.TreeMap[Int, Double] = this.costHistory def getLearningRate: Double = this.learningRate def getIterationTime: Int = this.iterationTime
以上就是用Scala實現的Logistic Regression,完整程式碼如下所示,紙上得來終覺淺,絕知此事要躬行,如有疑問煩請複製程式碼到本地環境試著執行一下,對有疑問的地方進行適當修改觀察程式的表現,可獲益良多。
package org.mengpan.deeplearning.model import breeze.linalg.{DenseMatrix, DenseVector, max} import breeze.numerics.{log, sigmoid} import scala.collection.mutable /** * Created by mengpan on 2017/8/15. */ class LogisticRegressionModel() { var learningRate:Double = _ var iterationTime: Int = _ var w: DenseVector[Double] = _ var b: Double = _ val costHistory: mutable.TreeMap[Int, Double] = new mutable.TreeMap[Int, Double]() def setLearningRate(learningRate: Double): this.type = { this.learningRate = learningRate this } def setIterationTime(iterationTime: Int): this.type = { this.iterationTime = iterationTime this } def train(feature: DenseMatrix[Double], label: DenseVector[Double]): this.type = { var (w, b) = initializeParams(feature.cols) (1 to this.iterationTime) .foreach{i => val (cost, dw, db) = propagate(feature, label, w, b) if (i % 100 == 0) println("INFO: Cost in " + i + "th time of iteration: " + cost) costHistory.put(i, cost) val adjustedLearningRate = this.learningRate / (log(i/1000 + 1) + 1) w :-= adjustedLearningRate * dw b -= adjustedLearningRate * db } this.w = w this.b = b this } def predict(feature: DenseMatrix[Double]): DenseVector[Double] = { val yPredicted = sigmoid(feature * this.w).map{eachY => if (eachY <= 0.05) 0.0 else 1.0 } yPredicted } def accuracy(label: DenseVector[Double], labelPredicted: DenseVector[Double]): Double = { val numCorrect = (0 until label.length) .map{index => if (label(index) == labelPredicted(index)) 1 else 0 } .count(_ == 1) numCorrect.toDouble / label.length.toDouble } def getCostHistory: mutable.TreeMap[Int, Double] = this.costHistory def getLearningRate: Double = this.learningRate def getIterationTime: Int = this.iterationTime private def initializeParams(featureSize: Int): (DenseVector[Double], Double) = { val w = DenseVector.rand[Double](featureSize) val b = DenseVector.rand[Double](1).data(0) (w, b) } private def propagate(feature: DenseMatrix[Double], label: DenseVector[Double], w: DenseVector[Double], b: Double): (Double, DenseVector[Double], Double) = { val numExamples = feature.rows val labelHat = sigmoid(feature * w + b) // println("DEBUG: feature * w + b is " + feature * w + b) // println("DEBUG: the feature's number of cols is " + feature.cols) // println("DEBUG: the feature's number of rows is " + feature.rows) // println("DEBUG: the labelHat is " + labelHat) val cost = -(label.t * log(labelHat) + (DenseVector.ones[Double](numExamples) - label).t * log(DenseVector.ones[Double](numExamples) - labelHat)) / numExamples val dw = feature.t * (labelHat - label) /:/ numExamples.toDouble val db = DenseVector.ones[Double](numExamples).t * (labelHat - label) / numExamples.toDouble // println("DEBUG: the (dw, db) is " + dw + ", " + db) (cost, dw, db) } }
4. 其他功能性程式碼
首先我們來看一下表示Cat的case class:
package org.mengpan.deeplearning.data import breeze.linalg.DenseVector /** * Created by mengpan on 2017/8/15. */ case class Cat(feature: DenseVector[Double], label: Double)
然後是從本地讀取圖片資料的CatDataHelper靜態類(Scala中的object):
package org.mengpan.deeplearning.helper import java.io.File import javax.imageio.ImageIO import breeze.linalg.{DenseMatrix, DenseVector} import org.mengpan.deeplearning.data.Cat import scala.io.Source /** * Created by mengpan on 2017/8/15. */ object CatDataHelper { def getAllCatData: DlCollection[Cat] = { val labels = getLabels val catNonCatLabels = getBalancedBatNonCatLabels(labels) val catList = catNonCatLabels.map{indexedLabel => val fileNumber = indexedLabel._1 val label = indexedLabel._2 val animalFileName: String = "/Users/mengpan/Downloads/train/" + fileNumber + ".png" val feature = getFeatureForOneAnimal(animalFileName) feature match { case Some(s) => Cat(s, label) case None => Cat(DenseVector.zeros[Double](10), label) } } .filter{cat => cat.feature.length != 10 } .toList new DlCollection[Cat](catList) } private def getFeatureForOneAnimal(animalFileName: String): Option[DenseVector[Double]] = { println("Reading file: " + animalFileName) try { val image = ImageIO.read(new File(animalFileName)) val imageData = image.getData val redVector = DenseVector.zeros[Double](imageData.getHeight * imageData.getWidth) val greenVector = DenseVector.zeros[Double](imageData.getHeight * imageData.getWidth) val blueVector = DenseVector.zeros[Double](imageData.getHeight * imageData.getWidth) (0 until imageData.getHeight).foreach{height => (0 until imageData.getWidth).foreach{width => val RGB = imageData.getPixel(width, height, Array(0, 0, 0)) redVector(width + height*10) = RGB(0) greenVector(width + height*10) = RGB(1) blueVector(width + height*10) = RGB(2) } } val resVector = DenseMatrix(redVector, greenVector, blueVector).reshape(imageData.getHeight*imageData.getWidth*3, 1).toDenseVector Some((resVector - breeze.stats.mean(resVector)) /:/ breeze.stats.stddev(resVector)) } catch { case _: Exception => None } } private def getLabels: Vector[(Int, String)] = { Source .fromFile("/Users/mengpan/Downloads/trainLabels.csv") .getLines() .map{eachRow => val split = eachRow.split(",") (split(0), split(1)) } .filter{eachRow => eachRow._1 != "id" } .map{eachRow => (eachRow._1.toInt, eachRow._2) } .toVector } private def getBalancedBatNonCatLabels(labels: Vector[(Int, String)]): Vector[(Int, Int)] = { labels .map{label => val numLabel = label._2 match { case "cat" => 1 case "automobile" => 0 case _ => 2 } (label._1, numLabel) } .filter{label => label._2 != 2 } } }
接下來是在本專案中我們用來儲存資料集合的容器DlCollection:
package org.mengpan.deeplearning.helper import breeze.linalg.{DenseMatrix, DenseVector} import org.mengpan.deeplearning.data.Cat /** * Created by mengpan on 2017/8/15. */ class DlCollection[E <: Cat](data: List[E]) { private val numRows: Int = this.data.size private val numCols: Int = this.data.head.feature.length def split(trainingSize: Double): (DlCollection[E], DlCollection[E]) = { val splited = data.splitAt((data.length * trainingSize).toInt) (new DlCollection[E](splited._1), new DlCollection[E](splited._2)) } def getFeatureAsMatrix: DenseMatrix[Double] = { val feature = DenseMatrix.zeros[Double](this.numRows, this.numCols) var i = 0 this.data.foreach{eachRow => feature(i, ::) := eachRow.feature.t i = i+1 } feature } def getLabelAsVector: DenseVector[Double] = { val label = DenseVector.zeros[Double](this.numRows) var i: Int = 0 this.data.foreach{eachRow => label(i) = eachRow.label i += 1 } label } override def toString = s"DlCollection($numRows, $numCols, $getFeatureAsMatrix, $getLabelAsVector)" }
最後是畫圖的工具類,是對JFreeChart的一層包裝:
package org.mengpan.deeplearning.utils import javax.swing.JFrame import org.jfree.chart.plot.PlotOrientation import org.jfree.chart.{ChartFactory, ChartPanel, JFreeChart} import org.jfree.data.xy.DefaultXYDataset import scala.collection.mutable /** * Created by mengpan on 2017/8/17. */ object PlotUtils { def plotCostHistory(costHistory: mutable.TreeMap[Int, Double]): Unit = { val x = costHistory.keys.toArray.map{_.toDouble} val y = costHistory.values.toArray[Double] val data = Array(x, y) val xyDataset: DefaultXYDataset = new DefaultXYDataset() xyDataset.addSeries("Iteration v.s. Cost", data) val jFreeChart: JFreeChart = ChartFactory.createScatterPlot("Cost History", "Iteration", "Cost", xyDataset, PlotOrientation.VERTICAL, true, false, false ) val panel = new ChartPanel(jFreeChart, true) val frame = new JFrame() frame.add(panel) frame.setBounds(50, 50, 800, 600) frame.setVisible(true) } }
5. Demo
由於我無法找到DeepLearning中吳恩達先生用來分別貓的影象集,我就以影象識別領域著名的資料集CIFAR-10來做測試,本例中我們只選取了10中動物中的兩種來進行分類,CIFAR-10的下載地址可在網上搜到,如果不想搜尋可直接在Kaggle官網下載:https://www.kaggle.com/c/cifar-10
接下來是本文使用的Demo程式碼:
package org.mengpan.deeplearning.demo import org.mengpan.deeplearning.data.Cat import org.mengpan.deeplearning.helper.{CatDataHelper, DlCollection} import org.mengpan.deeplearning.model.LogisticRegressionModel import org.mengpan.deeplearning.utils.PlotUtils /** * Created by mengpan on 2017/8/15. */ object ClassOneLogisticRegressionDemo extends App{ //載入貓的影象的資料集val catData: DlCollection[Cat] = CatDataHelper.getAllCatData //獲取training set和test set val (training, test) = catData.split(0.8) //分別獲取訓練集和測試集的feature和label val trainingFeature = training.getFeatureAsMatrix val trainingLabel = training.getLabelAsVector val testFeature = test.getFeatureAsMatrix val testLabel = test.getLabelAsVector //初始化LR的演算法模型val lrModel: LogisticRegressionModel = new LogisticRegressionModel() .setLearningRate(0.005) .setIterationTime(3000) //用訓練集的資料訓練演算法val trainedModel: LogisticRegressionModel = lrModel.train(trainingFeature, trainingLabel) //測試演算法獲得演算法優劣指標val yPredicted = trainedModel.predict(testFeature) val trainYPredicted = trainedModel.predict(trainingFeature) val testAccuracy = trainedModel.accuracy(testLabel, yPredicted) val trainAccuracy = trainedModel.accuracy(trainingLabel, trainYPredicted) println("\n The train accuracy of this model is: " + trainAccuracy) println("\n The test accuracy of this model is: " +