Spark SQL中使用StringIndexer和IndexToString來對字串資訊進行索引和反索引

阿新 • • 發佈：2018-12-15

簡介

本篇部落格使用Kaggle上的AdultBase資料集：Machine-Learning-Databases 此資料集雖然歷史比較悠久，但是資料格式比較容易處理，而且資訊比較全面，適合資料處理入門。本篇部落格使用了Spark SQL的相關語句，實現了以下功能：

使用StringIndexer來對文字資訊進行索引
使用IndexToString和StringIndexer的labels值來實現反索引
如何使用StructType和scheme來實現資料格式化存取

本篇文章主要介紹StringIndexer和IndexToString的搭配使用方法，以及通過Pipeline對其進行組裝，也介紹了通過SparkSesssion的read和write操作來格式化存取資料集。

配置

語言：Scala 2.11
Spark版本：Spark 2.3.1

型別介紹

StringIndexer

引數名稱	作用
handleInvalid	對空缺數值的處理規定，有如下引數：“skip” - 過濾掉此條資料；“error” - 丟擲錯誤；“keep” - 對其設定一個新的索引值
inputCol	設定要進行索引的列名
outputCol	設定索引儲存的列名
stringOrderType	設定索引編號的方式，包含如下取值：“frequencyDesc” - 頻率倒序編號，即出現次數多的編號大；“frequencyAsc” - 頻率升序編號；“alplabetDesc” - 字母表降序編號；“alphabetAsc” - 字母表升序編號

程式碼示例：

  import org.apache.spark.ml.feature.StringIndexer
  def getWorkclassIndexer: StringIndexer = new StringIndexer()
    .setInputCol("workclass").setOutputCol("workclassIndex").setHandleInvalid("keep")
    .setStringOrderType("frequencyAsc")

可知此段程式碼通過函式間接設定了workclassIndexer，其inputCol和outputCol分別是“workclass”和“workclassIndexer”，對預設值的處理方式是用新的編號值來對其編號結果如下：

+----------------+--------------+
|       workclass|workclassIndex|
+----------------+--------------+
|    Never-worked|           0.0|
|     Without-pay|           1.0|
|     Federal-gov|           2.0|
|    Self-emp-inc|           3.0|
|       State-gov|           4.0|
|               ?|           5.0|
|       Local-gov|           6.0|
|Self-emp-not-inc|           7.0|
|         Private|           8.0|
+----------------+--------------+

很重要的一個小技巧是，我們可以通過對其fit一個數據集後通過labels來獲得編號對應的字串，之後我們會用到這一技巧。

IndexToString

引數名稱	作用
inputCol	設定要進行反索引的列名
outputCol	設定字串儲存的列名
labels	StringArrayParam類的引數，用於規定得到的字串序列

示例程式碼：

    import org.apache.spark.ml.feature.IndexToString
    new IndexToString()
      .setInputCol("workclassIndex").setOutputCol("workclass")
      .setLabels(getWorkclassIndexer.fit(getTrainingData).labels)

可見此段程式碼中定義了一個IndexToString類，其inputCol為“workclassIndex”，outputCol為“workclass”，其字串序列定義為workclassIndexer的未索引前的列

Pipeline

Pipline是Spark SQL中很好用的一個類，可以組合幾個不同的模型，可以有效減少程式碼量，首先看一下它的文件： Spark 2.3.2 ScalaDoc - IndexToString 可見其只有一個引數值，即stages，其值為一個Array類，其中包含了不同的操作模型，在示例程式碼中我們用一個Pipeline組合了一系列IndexToString，然後定義了一個函式用於得到此Pipeline：

  private val converter_pipeline = new Pipeline().setStages(Array(
    new IndexToString()
      .setInputCol("workclassIndex").setOutputCol("workclass")
      .setLabels(getWorkclassIndexer.fit(getTrainingData).labels),
    new IndexToString()
      .setInputCol("educationIndex").setOutputCol("education")
      .setLabels(getEduIndexer.fit(getTrainingData).labels),
    new IndexToString()
      .setInputCol("maritial_statusIndex").setOutputCol("maritial_status")
      .setLabels(getMaritalIndexer.fit(getTrainingData).labels),
    new IndexToString()
      .setInputCol("occupationIndex").setOutputCol("occupation")
      .setLabels(getOccupationIndexer.fit(getTrainingData).labels),
    new IndexToString()
      .setInputCol("relationshipIndex").setOutputCol("relationship")
      .setLabels(getRelationshipIndexer.fit(getTrainingData).labels),
    new IndexToString()
      .setInputCol("raceIndex").setOutputCol("race")
      .setLabels(getRaceIndexer.fit(getTrainingData).labels),
    new IndexToString()
      .setInputCol("sexIndex").setOutputCol("sex")
      .setLabels(getSexIndexer.fit(getTrainingData).labels)
  ))
  
    def getConverterPipline: Pipeline = converter_pipeline

Pipeline可以通過fit和transform對資料集進行操作，示例如下：

      val cluster_info_split_table = getConverterPipline
        .fit(split_table).transform(split_table)

結果效果如下：

+----+--------------+--------------+--------------------+---------------+-----------------+---------+--------+-------------------+----------------+---------+---------------+-------------+-------------+-----+----+
| age|workclassIndex|educationIndex|maritial_statusIndex|occupationIndex|relationshipIndex|raceIndex|sexIndex|native_countryIndex|       workclass|education|maritial_status|   occupation| relationship| race| sex|
+----+--------------+--------------+--------------------+---------------+-----------------+---------+--------+-------------------+----------------+---------+---------------+-------------+-------------+-----+----+
|37.0|           7.0|          13.0|                 5.0|           10.0|              4.0|      4.0|     1.0|               41.0|Self-emp-not-inc|Bachelors|  Never-married|        Sales|Not-in-family|White|Male|
|57.0|           7.0|          12.0|                 5.0|           10.0|              4.0|      4.0|     1.0|               41.0|Self-emp-not-inc|  Masters|  Never-married|        Sales|Not-in-family|White|Male|
|46.0|           7.0|          13.0|                 5.0|           10.0|              4.0|      4.0|     1.0|               41.0|Self-emp-not-inc|Bachelors|  Never-married|        Sales|Not-in-family|White|Male|
|21.0|           7.0|          13.0|                 5.0|            9.0|              3.0|      4.0|     1.0|               41.0|Self-emp-not-inc|Bachelors|  Never-married|Other-service|    Own-child|White|Male|
|49.0|           7.0|          11.0|                 5.0|           10.0|              4.0|      3.0|     1.0|               20.0|Self-emp-not-inc|Assoc-voc|  Never-married|        Sales|Not-in-family|Black|Male|
|70.0|           6.0|          11.0|                 5.0|            9.0|              4.0|      4.0|     1.0|               40.0|       Local-gov|Assoc-voc|  Never-married|Other-service|Not-in-family|White|Male|
|29.0|           7.0|          13.0|                 5.0|           10.0|              4.0|      4.0|     1.0|               41.0|Self-emp-not-inc|Bachelors|  Never-married|        Sales|Not-in-family|White|Male|
|29.0|           7.0|          12.0|                 5.0|           10.0|              3.0|      3.0|     1.0|               19.0|Self-emp-not-inc|  Masters|  Never-married|        Sales|    Own-child|Black|Male|
+----+--------------+--------------+--------------------+---------------+-----------------+---------+--------+-------------------+----------------+---------+---------------+-------------+-------------+-----+----+

Spark SQL中使用StringIndexer和IndexToString來對字串資訊進行索引和反索引

簡介

配置

型別介紹

StringIndexer

IndexToString

Pipeline

Spark SQL中使用StringIndexer和IndexToString來對字串資訊進行索引和反索引

Spark SQL中thriftserver和beeline的使用

Spark SQL中 RDD 轉換到 DataFrame

5、xamarin.android 中如何對AndroidManifest.xml 進行配置和調整

Spark SQL中Dataframe join操作含null值的列

Android簽名文件轉化為pk8和pem來對apk重簽名

Spark SQL中RDDs轉化為DataFrame（詳細全面）

Vuejs在v-for中，利用index來對第一項新增class

Delphi中使用cxGrid對資料集進行Sort和Locate操作

Spark sql中的case when else

Linux中利用logrotate來對log檔案進行轉儲

SpringMVC中利用@InitBinder來對頁面資料進行解析繫結

SQL 中IN、NOT IN 對結果含NULL的子查詢使用

Vuejs（14）——在v-for中，利用index來對第一項新增class

對儲存過程進行加密和解密(SQL 2008/SQL 2012)

利用dmesg和addr2line來對（動態庫裡的）段錯誤進行除錯

Spark SQL解析查詢parquet格式Hive表獲取分割槽欄位和查詢條件

Spark SQL中Not in Subquery為何低效以及如何規避

STL算法設計理念 - 函數對象和函數對象當參數和返回值

蜜罐技術——通過布置一些作為誘餌的主機、網絡服務或者信息，誘使攻擊方對它們實施攻擊，從而可以對攻擊行為進行捕獲和分析

Spark SQL中使用StringIndexer和IndexToString來對字串資訊進行索引和反索引

簡介

配置

型別介紹

StringIndexer

IndexToString

Pipeline

相關推薦