1. 程式人生 > >跟著吳恩達學深度學習:用Scala實現神經網路-第一課

跟著吳恩達學深度學習:用Scala實現神經網路-第一課

1.         Introduction

       2017年8月,前百度首席科學家吳恩達先生在twitter上宣佈自己從百度離職後的第一個動作:在Coursera上推出一門從零開始構建神經網路的Deep Learning課程,一時間廣為轟動。

       截止到今天(2017年8月17日星期四),本人已經註冊該門課程並且完成了兩週的課程學習和作業。在前兩週的課程中,吳恩達先生利用Logistic Regression來深入淺出的說明了神經網路的工作原理,並且用通俗易懂的語言介紹了反向傳播的原理即為鏈式求導。在第二週的編碼作業中,學生被要求利用Python Notebook從頭實現Logistic Regression模型,並且利用此模型對所給定的影象集進行二元分類,判斷某張圖片是否是貓,最終訓練好的模型的Test Accuracy能達到70%。吳恩達先生還說,在接下來的課程中我們會進一步學習神經網路的優化方法,以進一步提高貓狗分辨的Accuracy。

       在編碼測驗中,吳恩達先生所強調的重點即為向量化運算,並且用例項說明了Python Numpy包的向量乘法比簡單的for迴圈求和的速度快300多倍,這也意味著1分鐘與5個小時的差距。然而,眾所周知Python在實際的工程開發中更多是扮演者快速實驗idea,快速得到結果的作用,一定程度上不適用於模型的正式開發及上線。本文中使用Scala實現吳恩達先生在Deep Learning課程中佈置的所有作業,感謝您的閱讀,期望共同進步。

       在Python中實現深度學習演算法以及向量化運算所依賴的包叫做Numpy,即Number Python。Numpy中提供了Vector與Matrix的實現,以及矩陣的各種運算和分解的函式。對應地,在Scala中我們使用Breeze包,其中也提供了DenseVector和DenseMatrix的資料結構,並且在資料量特別稀疏的情況下還有SparseVector和SparseMatrix可供使用,一定程度上比Numpy更加強大。最重要地,作為靜態型別語言Scala是型別安全的,意味著我們不僅可以用Scala來實現演算法,還可以用其進行資料預處理和資料清洗,即ETL。

       本文分為四個部分。第一部分介紹整個專案結構;第二部分詳細解釋用Scala實現Logistic Regression的程式碼;第三部分給出其他功能性程式碼的解釋,如資料預處理,畫圖工具,和一些其他的helper類;第四部分給出本文的demo結果和資料集的下載地址。另外,本專案的所有程式碼都可以在GitHub中找到,GitHub專案地址為https://github.com/pan5431333/coursera-deeplearning-practice-in-scala,跟隨吳恩達先生的課程進度程式碼會及時保持更新,歡迎follow。

2.         專案結構

       本文擬使用的專案結構分為五個subpackage,分別為data,demo,helper,model和utils。

       data包中包含一個類Cat,其型別為Scala中的caseclass,特別適合用來表示真實世界中的一個entity。demo中即為每一節課後作業的執行例項;helper中現在包含兩個類,CatDataHelper利用Java中的ImageIO從本地檔案系統中讀取圖片,將其轉化為RGB矩陣的表示形式,之後再reshape成向量形式。DlCollection為一個集合泛型類,其提供三個深度學習中常用的方法,分別為split,用來切分訓練集和測試集;getFeatureAsMatrix返回演算法所需要的特徵矩陣;getLabelAsVector返回標籤向量。Model包中現在僅包含Logistic Regression Model的實現。Utils包中現在有PlotUtils,其提供一個plotCostHistory方法,用來對cost隨著迭代次數的變化情況畫圖。

       下面介紹Logistic Regression演算法在Scala中的具體實現。

3.         Logistic Regression的Scala實戰

首先,定義LogisticRegressionModel類:

classLogisticRegressionModel(){
 
var learningRate:Double= _
 
var iterationTime:Int = _
 
var w: DenseVector[Double] =_
 
var b:Double = _
 
val costHistory: mutable.TreeMap[Int, Double] =new mutable.TreeMap[Int,Double]()

此類包含五個InstanceVariables,其中前兩個為超引數,learningRate表示學習率,iterationTime表示最大迭代次數;w和b即為模型引數,會隨著迭代進行尋優;costHistory是一個用來儲存迭代過程中cost變化情況的TreeMap,其key為迭代次數,value為cost值。

接下來是模型超引數的兩個setter:

def setLearningRate(learningRate: Double): this.type = {
  this.learningRate = learningRate
  this
}

def setIterationTime(iterationTime: Int): this.type = {
  this.iterationTime = iterationTime
  this
}

注意這裡的setter與Java中的setter不一樣,我們採用了鏈式程式設計的開發模式,即使用者在呼叫時可以寫成:val model = new LogisticRegressionModel().setLearningRate(0.0001).setIterationTime(3000),會使得整個編碼過程更加流暢。鏈式程式設計也在Spark中被廣泛使用,特別是構造資料管道(Pipeline)時會顯得很優雅。

接下來是模型訓練方法:

def train(feature: DenseMatrix[Double], label: DenseVector[Double]): this.type = {

  var (w, b) = initializeParams(feature.cols)

  (1 to this.iterationTime)
    .foreach{i =>
      val (cost, dw, db) = propagate(feature, label, w, b)

      if (i % 100 == 0) println("INFO: Cost in " + i + "th time of iteration: " + cost)
      costHistory.put(i, cost)

      val adjustedLearningRate = this.learningRate / (log(i/1000 + 1) + 1)
      w :-= adjustedLearningRate * dw
      b -= adjustedLearningRate * db
    }

  this.w = w
  this.b = b
  this
}

注意在此方法中我們用了兩個私有方法,分別為initializeParams()和propagate(),我們會在下面對這兩個方法詳細解釋。另外,我們對learningRate進行了簡單的調整,使其隨著迭代次數的增加逐漸減小,以儘量減少尋優時跳過最優解的可能性。

接下來是模型引數初始化的方法:

private def initializeParams(featureSize: Int): (DenseVector[Double], Double) = {
  val w = DenseVector.rand[Double](featureSize)
  val b = DenseVector.rand[Double](1).data(0)
  (w, b)
}

這裡我們對w和b賦予0到1之間的隨機賦值。

接下來是LogisticRegression核心的正向傳播與反向傳播的實現方法:

private def propagate(feature: DenseMatrix[Double], label: DenseVector[Double], w: DenseVector[Double], b: Double): (Double, DenseVector[Double], Double) = {
    val numExamples = feature.rows
    val labelHat = sigmoid(feature * w + b)

//    println("DEBUG: feature * w + b is " + feature * w + b)
//    println("DEBUG: the feature's number of cols is " + feature.cols)
//    println("DEBUG: the feature's number of rows is " + feature.rows)
//    println("DEBUG: the labelHat is " + labelHat)

    val cost = -(label.t * log(labelHat) + (DenseVector.ones[Double](numExamples) - label).t * log(DenseVector.ones[Double](numExamples) - labelHat)) / numExamples

    val dw = feature.t * (labelHat - label) /:/ numExamples.toDouble
    val db = DenseVector.ones[Double](numExamples).t * (labelHat - label) / numExamples.toDouble

//    println("DEBUG: the (dw, db) is " + dw + ", " + db)

    (cost, dw, db)
  }

其中註釋掉的程式碼為開發過程中的DEBUG程式碼,因為我沒有在此專案中引入log包,所以只能以這種方式進行DEBUG。feature.rows和feature.cols相當於Python Numpy中的feature.shape[0]和feature.shape[1];Sigmoid為breeze.numerics._中提供的函式,可以接受一個DenseVector或者DenseMatrix作為引數;cost、dw和db的計算請詳見LogisticRegression的理論知識,如有不清楚的地方可以學習吳恩達先生Deep Learning課程。這裡需要注意的一點是,Python Numpy支援broadcasting運算,如1 – np.array([1, 2, 3])會得到np.array([0, -1,-2]),即一個常量與向量或矩陣發生運算時,numpy會自動將該常量與向量或矩陣中的每個元素進行運算。Scala的breeze對此支援有限,所以在計算cost時我們只能用DenseVector.ones[Double](numExamples)– label,而不能直接用 1 – label。

接下來是用訓練好的模型預測的方法:

def predict(feature: DenseMatrix[Double]): DenseVector[Double] = {

  val yPredicted = sigmoid(feature * this.w).map{eachY =>
    if (eachY <= 0.05) 0.0 else 1.0
  }

  yPredicted
}

這裡我們使用了函數語言程式設計中常用的map,可以看出map使我們的程式碼變得很簡潔。

接下來是計算預測準確度的方法:

def accuracy(label: DenseVector[Double], labelPredicted: DenseVector[Double]): Double = {
  val numCorrect = (0 until label.length)
    .map{index =>
      if (label(index) == labelPredicted(index)) 1 else 0
    }
    .count(_ == 1)
  numCorrect.toDouble / label.length.toDouble
}

這裡進一步使用了函數語言程式設計的特性,程式碼非常簡潔。

最後,還有一些輔助的getter方法:

def getCostHistory: mutable.TreeMap[Int, Double] = this.costHistory
def getLearningRate: Double = this.learningRate
def getIterationTime: Int = this.iterationTime

以上就是用Scala實現的Logistic Regression,完整程式碼如下所示,紙上得來終覺淺,絕知此事要躬行,如有疑問煩請複製程式碼到本地環境試著執行一下,對有疑問的地方進行適當修改觀察程式的表現,可獲益良多。

package org.mengpan.deeplearning.model

import breeze.linalg.{DenseMatrix, DenseVector, max}
import breeze.numerics.{log, sigmoid}

import scala.collection.mutable

/**
  * Created by mengpan on 2017/8/15.
  */
class LogisticRegressionModel() {
  var learningRate:Double = _
  var iterationTime: Int = _
  var w: DenseVector[Double] = _
  var b: Double = _
  val costHistory: mutable.TreeMap[Int, Double] = new mutable.TreeMap[Int, Double]()

  def setLearningRate(learningRate: Double): this.type = {
    this.learningRate = learningRate
    this
  }

  def setIterationTime(iterationTime: Int): this.type = {
    this.iterationTime = iterationTime
    this
  }

  def train(feature: DenseMatrix[Double], label: DenseVector[Double]): this.type = {

    var (w, b) = initializeParams(feature.cols)

    (1 to this.iterationTime)
      .foreach{i =>
        val (cost, dw, db) = propagate(feature, label, w, b)

        if (i % 100 == 0) println("INFO: Cost in " + i + "th time of iteration: " + cost)
        costHistory.put(i, cost)

        val adjustedLearningRate = this.learningRate / (log(i/1000 + 1) + 1)
        w :-= adjustedLearningRate * dw
        b -= adjustedLearningRate * db
      }

    this.w = w
    this.b = b
    this
  }

  def predict(feature: DenseMatrix[Double]): DenseVector[Double] = {

    val yPredicted = sigmoid(feature * this.w).map{eachY =>
      if (eachY <= 0.05) 0.0 else 1.0
    }

    yPredicted
  }

  def accuracy(label: DenseVector[Double], labelPredicted: DenseVector[Double]): Double = {
    val numCorrect = (0 until label.length)
      .map{index =>
        if (label(index) == labelPredicted(index)) 1 else 0
      }
      .count(_ == 1)
    numCorrect.toDouble / label.length.toDouble
  }

  def getCostHistory: mutable.TreeMap[Int, Double] = this.costHistory
  def getLearningRate: Double = this.learningRate
  def getIterationTime: Int = this.iterationTime

  private def initializeParams(featureSize: Int): (DenseVector[Double], Double) = {
    val w = DenseVector.rand[Double](featureSize)
    val b = DenseVector.rand[Double](1).data(0)
    (w, b)
  }

  private def propagate(feature: DenseMatrix[Double], label: DenseVector[Double], w: DenseVector[Double], b: Double): (Double, DenseVector[Double], Double) = {
    val numExamples = feature.rows
    val labelHat = sigmoid(feature * w + b)

//    println("DEBUG: feature * w + b is " + feature * w + b)
//    println("DEBUG: the feature's number of cols is " + feature.cols)
//    println("DEBUG: the feature's number of rows is " + feature.rows)
//    println("DEBUG: the labelHat is " + labelHat)

    val cost = -(label.t * log(labelHat) + (DenseVector.ones[Double](numExamples) - label).t * log(DenseVector.ones[Double](numExamples) - labelHat)) / numExamples

    val dw = feature.t * (labelHat - label) /:/ numExamples.toDouble
    val db = DenseVector.ones[Double](numExamples).t * (labelHat - label) / numExamples.toDouble

//    println("DEBUG: the (dw, db) is " + dw + ", " + db)

    (cost, dw, db)
  }
}

4.         其他功能性程式碼

首先我們來看一下表示Cat的case class:

package org.mengpan.deeplearning.data

import breeze.linalg.DenseVector

/**
  * Created by mengpan on 2017/8/15.
  */
case class Cat(feature: DenseVector[Double], label: Double)

然後是從本地讀取圖片資料的CatDataHelper靜態類(Scala中的object):

package org.mengpan.deeplearning.helper

import java.io.File
import javax.imageio.ImageIO

import breeze.linalg.{DenseMatrix, DenseVector}
import org.mengpan.deeplearning.data.Cat

import scala.io.Source

/**
  * Created by mengpan on 2017/8/15.
  */
object CatDataHelper {
  def getAllCatData: DlCollection[Cat] = {

    val labels = getLabels

    val catNonCatLabels = getBalancedBatNonCatLabels(labels)

    val catList = catNonCatLabels.map{indexedLabel =>

      val fileNumber = indexedLabel._1
      val label = indexedLabel._2
      val animalFileName: String = "/Users/mengpan/Downloads/train/" + fileNumber + ".png"
      val feature = getFeatureForOneAnimal(animalFileName)

      feature match {
        case Some(s) => Cat(s, label)
        case None => Cat(DenseVector.zeros[Double](10), label)
      }
    }
      .filter{cat =>
        cat.feature.length != 10
      }
      .toList

    new DlCollection[Cat](catList)
 }

  private def getFeatureForOneAnimal(animalFileName: String): Option[DenseVector[Double]] = {
    println("Reading file: " + animalFileName)

    try {
      val image = ImageIO.read(new File(animalFileName))
      val imageData = image.getData

      val redVector = DenseVector.zeros[Double](imageData.getHeight * imageData.getWidth)
      val greenVector = DenseVector.zeros[Double](imageData.getHeight * imageData.getWidth)
      val blueVector = DenseVector.zeros[Double](imageData.getHeight * imageData.getWidth)

      (0 until imageData.getHeight).foreach{height =>
        (0 until imageData.getWidth).foreach{width =>
          val RGB = imageData.getPixel(width, height, Array(0, 0, 0))
          redVector(width + height*10) = RGB(0)
          greenVector(width + height*10) = RGB(1)
          blueVector(width + height*10) = RGB(2)
        }
      }

      val resVector = DenseMatrix(redVector, greenVector, blueVector).reshape(imageData.getHeight*imageData.getWidth*3, 1).toDenseVector
      Some((resVector - breeze.stats.mean(resVector)) /:/ breeze.stats.stddev(resVector))
    } catch {
      case _: Exception => None
    }
  }

  private def getLabels: Vector[(Int, String)] = {
    Source
      .fromFile("/Users/mengpan/Downloads/trainLabels.csv")
      .getLines()
      .map{eachRow =>
        val split = eachRow.split(",")
        (split(0), split(1))
      }
      .filter{eachRow =>
        eachRow._1 != "id"
      }
      .map{eachRow =>
        (eachRow._1.toInt, eachRow._2)
      }
      .toVector
  }

  private def getBalancedBatNonCatLabels(labels: Vector[(Int, String)]): Vector[(Int, Int)] = {
    labels
      .map{label =>
      val numLabel = label._2 match {
        case "cat" => 1
        case "automobile" => 0
        case _ => 2
      }
      (label._1, numLabel)
    }
      .filter{label =>
        label._2 != 2
      }
  }

}

接下來是在本專案中我們用來儲存資料集合的容器DlCollection:

package org.mengpan.deeplearning.helper

import breeze.linalg.{DenseMatrix, DenseVector}
import org.mengpan.deeplearning.data.Cat

/**
  * Created by mengpan on 2017/8/15.
  */
class DlCollection[E <: Cat](data: List[E]) {
  private val numRows: Int = this.data.size
  private val numCols: Int = this.data.head.feature.length

  def split(trainingSize: Double): (DlCollection[E], DlCollection[E]) = {
    val splited = data.splitAt((data.length * trainingSize).toInt)
    (new DlCollection[E](splited._1), new DlCollection[E](splited._2))
  }

  def getFeatureAsMatrix: DenseMatrix[Double] = {
    val feature = DenseMatrix.zeros[Double](this.numRows, this.numCols)

    var i = 0
    this.data.foreach{eachRow =>
      feature(i, ::) := eachRow.feature.t
      i = i+1
    }

    feature
  }

  def getLabelAsVector: DenseVector[Double] = {
    val label = DenseVector.zeros[Double](this.numRows)

    var i: Int = 0
    this.data.foreach{eachRow =>
      label(i) = eachRow.label
      i += 1
    }

    label
  }


  override def toString = s"DlCollection($numRows, $numCols, $getFeatureAsMatrix, $getLabelAsVector)"
}

最後是畫圖的工具類,是對JFreeChart的一層包裝:

package org.mengpan.deeplearning.utils

import javax.swing.JFrame

import org.jfree.chart.plot.PlotOrientation
import org.jfree.chart.{ChartFactory, ChartPanel, JFreeChart}
import org.jfree.data.xy.DefaultXYDataset

import scala.collection.mutable

/**
  * Created by mengpan on 2017/8/17.
  */
object PlotUtils {
  def plotCostHistory(costHistory: mutable.TreeMap[Int, Double]): Unit = {

    val x = costHistory.keys.toArray.map{_.toDouble}
    val y = costHistory.values.toArray[Double]

    val data = Array(x, y)

    val xyDataset: DefaultXYDataset = new DefaultXYDataset()
    xyDataset.addSeries("Iteration v.s. Cost", data)

    val jFreeChart: JFreeChart = ChartFactory.createScatterPlot("Cost History",
      "Iteration", "Cost", xyDataset, PlotOrientation.VERTICAL, true, false, false
    )

    val panel = new ChartPanel(jFreeChart, true)

    val frame = new JFrame()

    frame.add(panel)
    frame.setBounds(50, 50, 800, 600)
    frame.setVisible(true)
  }
}

5.         Demo

由於我無法找到DeepLearning中吳恩達先生用來分別貓的影象集,我就以影象識別領域著名的資料集CIFAR-10來做測試,本例中我們只選取了10中動物中的兩種來進行分類,CIFAR-10的下載地址可在網上搜到,如果不想搜尋可直接在Kaggle官網下載:https://www.kaggle.com/c/cifar-10

接下來是本文使用的Demo程式碼:

package org.mengpan.deeplearning.demo

import org.mengpan.deeplearning.data.Cat
import org.mengpan.deeplearning.helper.{CatDataHelper, DlCollection}
import org.mengpan.deeplearning.model.LogisticRegressionModel
import org.mengpan.deeplearning.utils.PlotUtils

/**
  * Created by mengpan on 2017/8/15.
  */
object ClassOneLogisticRegressionDemo extends App{
  //載入貓的影象的資料集val catData: DlCollection[Cat] = CatDataHelper.getAllCatData

  //獲取training settest set
  val (training, test) = catData.split(0.8)


  //分別獲取訓練集和測試集的featurelabel
  val trainingFeature = training.getFeatureAsMatrix
  val trainingLabel = training.getLabelAsVector
  val testFeature = test.getFeatureAsMatrix
  val testLabel = test.getLabelAsVector

  //初始化LR的演算法模型val lrModel: LogisticRegressionModel = new LogisticRegressionModel()
    .setLearningRate(0.005)
    .setIterationTime(3000)

  //用訓練集的資料訓練演算法val trainedModel: LogisticRegressionModel = lrModel.train(trainingFeature, trainingLabel)

  //測試演算法獲得演算法優劣指標val yPredicted = trainedModel.predict(testFeature)
  val trainYPredicted = trainedModel.predict(trainingFeature)

  val testAccuracy = trainedModel.accuracy(testLabel, yPredicted)
  val trainAccuracy = trainedModel.accuracy(trainingLabel, trainYPredicted)
  println("\n The train accuracy of this model is: " + trainAccuracy)
  println("\n The test accuracy of this model is: " +