1. 程式人生 > >Spark 2.0 機器學習 ML 庫:特徵提取、轉化、選取(Scala 版)

Spark 2.0 機器學習 ML 庫:特徵提取、轉化、選取(Scala 版)

一、前言

二、程式碼

1.TF-IDF(詞頻-逆向文件頻率)

TF(詞頻Term Frequency):HashingTF不CountVectorizer用於生成詞頻TF向量。

HashingTF是一個特徵詞集的轉換器(Transformer),它可以將返些集合轉換成固定長度
的特徵向量。HashingTF利用hashing trick,原始特徵通過應用雜湊函式對映到索引中。然
後根據對映的索引計算詞頻。返種斱法避免了計算全域性特徵詞對索引對映的需要,返對於大
型詫料庫來說可能是昂貴的,但是它具有潛在的雜湊衝突,其中丌同的原始特徵可以在雜湊
乊後發成相同的特徵詞。為了減少碰撞的機會,我們可以增加目標特徵維度,即雜湊表的桶
數。由於使用簡單的模數將雜湊函式轉換為列索引,建議使用兩個冪作為特徵維,否則丌會
將特徵均勻地對映到列。預設功能維度為2^18=262144。可選的二迕制切換引數控制詞頻計
數。當設定為true時,所有非零頻率計數設定為1。返對於模擬二迕制而丌是整數的離散概率
模型尤其有用。

import org.apache.spark.ml.feature._
import org.apache.spark.sql.SparkSession

case class Love(id: Long, text: String, label: Double)

case class Test(id: Long, text: String)

/**
  * 1、TF-IDF(詞頻-逆向文件頻率)
  */
object FeaturesTest {

  def main(args: Array[String]): Unit = {

    // 0.構建 Spark 物件
    val
spark = SparkSession .builder() .master("local") // 本地測試,否則報錯 A master URL must be set in your configuration at org.apache.spark.SparkContext. .appName("test") .enableHiveSupport() .getOrCreate() // 有就獲取無則建立 spark.sparkContext.setCheckpointDir("C:\\LLLLLLLLLLLLLLLLLLL\\BigData_AI\\sparkmlTest"
) //設定檔案讀取、儲存的目錄,HDFS最佳 // 1.訓練樣本 val sentenceData = spark.createDataFrame( Seq( Love(1L, "I love you", 1.0), Love(2L, "There is nothing to do", 0.0), Love(3L, "Work hard and you will success", 0.0), Love(4L, "We love each other", 1.0), Love(5L, "Where there is love, there are always wishes", 1.0), Love(6L, "I love you not because who you are,but because who I am when I am with you", 1.0), Love(7L, "Never frown,even when you are sad,because youn ever know who is falling in love with your smile", 1.0), Love(8L, "Whatever is worth doing is worth doing well", 0.0), Love(9L, "The hard part isn’t making the decision. It’s living with it", 0.0), Love(10L, "Your happy passer-by all knows, my distressed there is no place hides", 0.0), Love(11L, "When the whole world is about to rain, let’s make it clear in our heart together", 0.0) ) ).toDF() sentenceData.show(false) /** * +---+-----------------------------------------------------------------------------------------------+-----+ * |id |text |label| * +---+-----------------------------------------------------------------------------------------------+-----+ * |1 |I love you |1.0 | * |2 |There is nothing to do |0.0 | * |3 |Work hard and you will success |0.0 | * |4 |We love each other |1.0 | * |5 |Where there is love, there are always wishes |1.0 | * |6 |I love you not because who you are,but because who I am when I am with you |1.0 | * |7 |Never frown,even when you are sad,because youn ever know who is falling in love with your smile|1.0 | * |8 |Whatever is worth doing is worth doing well |0.0 | * |9 |The hard part isn’t making the decision. It’s living with it |0.0 | * |10 |Your happy passer-by all knows, my distressed there is no place hides |0.0 | * |11 |When the whole world is about to rain, let’s make it clear in our heart together |0.0 | * +---+-----------------------------------------------------------------------------------------------+-----+ */ // 2.引數設定:tokenizer、hashingTF、idf val tokenizer = new Tokenizer() .setInputCol("text") .setOutputCol("words") val hashingTF = new HashingTF() .setNumFeatures(20) .setInputCol(tokenizer.getOutputCol) .setOutputCol("rawFeatures") val idf = new IDF() // 通過CountVectorizer也可以獲得詞頻向量 .setInputCol(hashingTF.getOutputCol) .setOutputCol("features") val wordsData = tokenizer.transform(sentenceData) val featurizedData = hashingTF.transform(wordsData) val idfModel = idf.fit(featurizedData) // 3. 文件的向量化表示 val rescaledData = idfModel.transform(featurizedData) rescaledData .select("label", "features") .show(false) /** 可見:句子越長,單詞越多,則特徵向量越多 * +-----+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ * |label|features | * +-----+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ * |1.0 |(20,[0,5,9],[0.28768207245178085,0.4054651081081644,0.8754687373538999]) | * |0.0 |(20,[1,4,8,11,14],[0.4054651081081644,1.3862943611198906,1.0986122886681098,1.0986122886681098,0.8754687373538999]) | * |0.0 |(20,[0,5,7,13],[0.28768207245178085,1.2163953243244932,1.3862943611198906,0.8754687373538999]) | * |1.0 |(20,[0,5,13,14],[0.28768207245178085,0.4054651081081644,0.8754687373538999,0.8754687373538999]) | * |1.0 |(20,[1,11,13,14,17,18,19],[0.4054651081081644,2.1972245773362196,0.8754687373538999,0.8754687373538999,0.6931471805599453,1.0986122886681098,1.0986122886681098]) | * |1.0 |(20,[0,1,5,9,10,13,15,16,17,18],[0.28768207245178085,0.8109302162163288,1.2163953243244932,2.6264062120616996,0.8754687373538999,1.7509374747077997,0.8754687373538999,0.6931471805599453,1.3862943611198906,1.0986122886681098]) | * |1.0 |(20,[0,1,2,3,5,6,9,10,14,16,17,18,19],[0.28768207245178085,0.4054651081081644,1.3862943611198906,0.8754687373538999,0.8109302162163288,1.3862943611198906,0.8754687373538999,1.7509374747077997,0.8754687373538999,0.6931471805599453,1.3862943611198906,1.0986122886681098,2.1972245773362196])| * |0.0 |(20,[0,1,3,15,17],[0.5753641449035617,0.8109302162163288,1.7509374747077997,0.8754687373538999,0.6931471805599453]) | * |0.0 |(20,[0,5,7,10,15,16,19],[0.5753641449035617,0.8109302162163288,1.3862943611198906,2.6264062120616996,0.8754687373538999,0.6931471805599453,1.0986122886681098]) | * |0.0 |(20,[1,2,3,6,8,9,11,16],[1.2163953243244932,1.3862943611198906,0.8754687373538999,2.772588722239781,1.0986122886681098,0.8754687373538999,2.1972245773362196,0.6931471805599453]) | * |0.0 |(20,[0,1,3,4,5,8,10,12,15,16,17],[0.28768207245178085,0.4054651081081644,0.8754687373538999,2.772588722239781,0.8109302162163288,1.0986122886681098,1.7509374747077997,1.791759469228055,0.8754687373538999,2.0794415416798357,0.6931471805599453]) | * +-----+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ */ } }

2.Word2Vec(文件相似性)

Word2Vec是一個Estimator(評估器),它採用表示文件的單詞序列,並訓練一個
Word2VecModel。 該模型將每個單詞對映到一個唯一的固定大小向量。 Word2VecModel
使用文件中所有單詞的平均值將每個文件轉換為向量; 該向量然後可用作預測,文件相似性計
算等功能。

import org.apache.spark.ml.feature._
import org.apache.spark.sql.SparkSession

/**
  * 2、Word2Vec
  */
object FeaturesTest {

  def main(args: Array[String]): Unit = {

    // 0.構建 Spark 物件
    val spark = SparkSession
      .builder()
      .master("local") // 本地測試,否則報錯 A master URL must be set in your configuration at org.apache.spark.SparkContext.
      .appName("test")
      .enableHiveSupport()
      .getOrCreate() // 有就獲取無則建立

    spark.sparkContext.setCheckpointDir("C:\\LLLLLLLLLLLLLLLLLLL\\BigData_AI\\sparkmlTest") //設定檔案讀取、儲存的目錄,HDFS最佳

    // 1.訓練樣本
    val documentDF = spark.createDataFrame(
      Seq(
        "I love you".split(" "),
        "There is nothing to do".split(" "),
        "Work hard and you will success".split(" "),
        "We love each other".split(" "),
        "Where there is love, there are always wishes".split(" "),
        "I love you not because who you are,but because who I am when I am with you".split(" "),
        "Never frown,even when you are sad,because youn ever know who is falling in love with your smile".split(" "),
        "Whatever is worth doing is worth doing well".split(" "),
        "The hard part isn’t making the decision. It’s living with it".split(" "),
        "Your happy passer-by all knows, my distressed there is no place hides".split(" "),
        "When the whole world is about to rain, let’s make it clear in our heart together".split(" ")
      ).map(Tuple1.apply)
    ).toDF("text") // scala 版本為 2.11+ 才可以,否則報錯:No TypeTag available
    documentDF.show(false)
    /**
      * +-----------------------------------------------------------------------------------------------------------------+
      * |text                                                                                                             |
      * +-----------------------------------------------------------------------------------------------------------------+
      * |[I, love, you]                                                                                                   |
      * |[There, is, nothing, to, do]                                                                                     |
      * |[Work, hard, and, you, will, success]                                                                            |
      * |[We, love, each, other]                                                                                          |
      * |[Where, there, is, love, , there, are, always, wishes]                                                            |
      * |[I, love, you, not, because, who, you, are,but, because, who, I, am, when, I, am, with, you]                     |
      * |[Never, frown,even, when, you, are, sad,because, youn, ever, know, who, is, falling, in, love, with, your, smile]|
      * |[Whatever, is, worth, doing, is, worth, doing, well]                                                             |
      * |[The, hard, part, isn’t, making, the, decision., It’s, living, with, it]                                         |
      * |[Your, happy, passer-by, all, knows, , my, distressed, there, is, no, place, hides]                               |
      * |[When, the, whole, world, is, about, to, rain, , let’s, make, it, clear, in, our, heart, together]                |
      * +-----------------------------------------------------------------------------------------------------------------+
      **/

    // 2. word2Vec
    val word2VecModel = new Word2Vec()
      .setInputCol("text") // 要求輸入的資料,單位是陣列
      .setOutputCol("result")
      .setVectorSize(3)
      .setMinCount(0)
      .fit(documentDF)

    // 3. 文件的向量化表示
    val result = word2VecModel.transform(documentDF)
    result
      .select("result","text")
      .show(false)

    /**
      * +--------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------+
      * |result                                                              |text                                                                                                             |
      * +--------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------+
      * |[-0.05712633579969406,0.01896375169356664,-0.021923241515954334]    |[I, love, you]                                                                                                   |
      * |[0.006795959174633027,-0.05859951674938202,-0.02231040205806494]    |[There, is, nothing, to, do]                                                                                     |
      * |[-0.01718233898282051,-0.044684726279228926,0.022707909112796187]   |[Work, hard, and, you, will, success]                                                                            |
      * |[0.014710488263517618,0.04914409201592207,-0.0535422433167696]      |[We, love, each, other]                                                                                          |
      * |[0.056647833436727524,-0.013540415093302727,-0.007903479505330324]  |[Where, there, is, love, , there, are, always, wishes]                                                            |
      * |[-0.012073692482183962,0.0068947237587588675,-0.007010678075911368] |[I, love, you, not, because, who, you, are,but, because, who, I, am, when, I, am, with, you]                     |
      * |[-0.009022715939756702,0.007438146413358695,-0.00402127337806365]   |[Never, frown,even, when, you, are, sad,because, youn, ever, know, who, is, falling, in, love, with, your, smile]|
      * |[-0.007301235804334283,-0.025249323691241443,0.05116166779771447]   |[Whatever, is, worth, doing, is, worth, doing, well]                                                             |
      * |[0.055422113192352386,0.04088194024833766,-0.008757691322402521]    |[The, hard, part, isn’t, making, the, decision., It’s, living, with, it]                                         |
      * |[0.0017315041817103822,0.026252828383197386,-0.004247877125938733]  |[Your, happy, passer-by, all, knows, , my, distressed, there, is, no, place, hides]                               |
      * |[-0.013085987884551287,-3.071942483074963E-4,-0.0029873197781853378]|[When, the, whole, world, is, about, to, rain, , let’s, make, it, clear, in, our, heart, together]                |
      * +--------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------+
      **/

  }

}

3、CountVectorizer

CountVectorizer和CountVectorizerModel是將文字文件集合轉換為向量。 當先驗詞典丌可
用時,CountVectorizer可以用作估計器來提叏詞彙表,並生成CountVectorizerModel。 該
模型通過詞彙生成文件的秲疏表示,然後可以將其傳遞給其他演算法,如LDA。

在擬合過程中,CountVectorizer將選擇通過詫料庫按術詫頻率排序的top前幾vocabSize詞
。 可選引數minDF迓通過指定術詫必須出現以包含在詞彙表中的文件的最小數量(戒小於
1.0)來影響擬合過秳。 另一個可選的二迕制切換引數控制輸出向量。 如果設定為true,則
所有非零計數都設定為1.對於模擬二迕制而丌是整數的離散概率模型,返是非常有用的。

import org.apache.spark.ml.feature._
import org.apache.spark.sql.SparkSession

/**
  * 3、CountVectorizer
  * 獲取詞頻
  */
object FeaturesTests {

  def main(args: Array[String]): Unit = {

    // 0.構建 Spark 物件
    val spark = SparkSession
      .builder()
      .master("local") // 本地測試,否則報錯 A master URL must be set in your configuration at org.apache.spark.SparkContext.
      .appName("test")
      .enableHiveSupport()
      .getOrCreate() // 有就獲取無則建立

    spark.sparkContext.setCheckpointDir("C:\\LLLLLLLLLLLLLLLLLLL\\BigData_AI\\sparkmlTest") //設定檔案讀取、儲存的目錄,HDFS最佳

    // 1.訓練樣本
    val documentDF = spark.createDataFrame(
      Seq(
        "I love you".split(" "),
        "There is nothing to do".split(" "),
        "Work hard and you will success".split(" "),
        "We love each other".split(" "),
        "Where there is love, there are always wishes".split(" "),
        "I love you not because who you are,but because who I am when I am with you".split(" "),
        "Never frown,even when you are sad,because youn ever know who is falling in love with your smile".split(" "),
        "Whatever is worth doing is worth doing well".split(" "),
        "The hard part isn’t making the decision. It’s living with it".split(" "),
        "Your happy passer-by all knows, my distressed there is no place hides".split(" "),
        "When the whole world is about to rain, let’s make it clear in our heart together".split(" ")
      ).map(Tuple1.apply)
    ).toDF("words") // scala 版本為 2.11+ 才可以,否則報錯:No TypeTag available
    documentDF.show(false)

    /**
      * +-----------------------------------------------------------------------------------------------------------------+
      * |words                                                                                                            |
      * +-----------------------------------------------------------------------------------------------------------------+
      * |[I, love, you]                                                                                                   |
      * |[There, is, nothing, to, do]                                                                                     |
      * |[Work, hard, and, you, will, success]                                                                            |
      * |[We, love, each, other]                                                                                          |
      * |[Where, there, is, love,, there, are, always, wishes]                                                            |
      * |[I, love, you, not, because, who, you, are,but, because, who, I, am, when, I, am, with, you]                     |
      * |[Never, frown,even, when, you, are, sad,because, youn, ever, know, who, is, falling, in, love, with, your, smile]|
      * |[Whatever, is, worth, doing, is, worth, doing, well]                                                             |
      * |[The, hard, part, isn’t, making, the, decision., It’s, living, with, it]                                         |
      * |[Your, happy, passer-by, all, knows,, my, distressed, there, is, no, place, hides]                               |
      * |[When, the, whole, world, is, about, to, rain,, let’s, make, it, clear, in, our, heart, together]                |
      * +-----------------------------------------------------------------------------------------------------------------+
      **/

    // 2. CountVectorizer
    val cvModel = new CountVectorizer()
      .setInputCol("words")
      .setOutputCol("features")
      .setVocabSize(3)
      .setMinDF(2)
      .fit(documentDF)

    // 3. 文件的向量化表示
    cvModel.transform(documentDF).show(false)

    /**
      * +-----------------------------------------------------------------------------------------------------------------+-------------------------+
      * |words                                                                                                            |features                 |
      * +-----------------------------------------------------------------------------------------------------------------+-------------------------+
      * |[I, love, you]                                                                                                   |(3,[1,2],[1.0,1.0])      |
      * |[There, is, nothing, to, do]                                                                                     |(3,[0],[1.0])            |
      * |[Work, hard, and, you, will, success]                                                                            |(3,[1],[1.0])            |
      * |[We, love, each, other]                                                                                          |(3,[2],[1.0])            |
      * |[Where, there, is, love, , there, are, always, wishes]                                                            |(3,[0],[1.0])            |
      * |[I, love, you, not, because, who, you, are,but, because, who, I, am, when, I, am, with, you]                     |(3,[1,2],[3.0,1.0])      |
      * |[Never, frown,even, when, you, are, sad,because, youn, ever, know, who, is, falling, in, love, with, your, smile]|(3,[0,1,2],[1.0,1.0,1.0])|
      * |[Whatever, is, worth, doing, is, worth, doing, well]                                                             |(3,[0],[2.0])            |
      * |[The, hard, part, isn’t, making, the, decision., It’s, living, with, it]                                         |(3,[],[])                |
      * |[Your, happy, passer-by, all, knows, , my, distressed, there, is, no, place, hides]                               |(3,[0],[1.0])            |
      * |[When, the, whole, world, is, about, to, rain, , let’s, make, it, clear, in, our, heart, together]                |(3,[0],[1.0])            |
      * +-----------------------------------------------------------------------------------------------------------------+-------------------------+
      **/

  }

}