Spark SQL將rdd轉換為資料集-以程式設計方式指定模式（Programmatically Specifying the Schema）

阿新 • • 發佈：2018-12-31

一：解釋

官網：https://spark.apache.org/docs/latest/sql-getting-started.html
這種場景是生活中的常態
When case classes cannot be defined ahead of time (for example, the structure of records is encoded in a string, or a text dataset will be parsed and fields will be projected differently for different users), a DataFrame can be created programmatically with three steps.

Create an RDD of Rows from the original RDD;
Create the schema represented by a StructType matching the structure of Rows in the RDD created in Step 1.
Apply the schema to the RDD of Rows via createDataFrame method provided by SparkSession.

二：操作

package g5.learning

import org.apache.spark.sql.types.{LongType, StructField, StructType,StringType}
import org.apache.spark.sql.{Row, SparkSession}

object DFApp {
  def main(args: Array[String]): Unit = {
    val  sparksession= SparkSession.builder().appName("DFApp")
      .master("local[2]")
      .getOrCreate()

     //inferReflection(sparksession)

    programmatically(sparksession)
    sparksession.stop()
  }



  //第二種方法是程式設計的形式
  def programmatically(sparkSession: SparkSession)={
    val info = sparkSession.sparkContext.textFile("file:///E:\\data.txt")
    //Create an RDD of Rows from the original RDD;
    import sparkSession.implicits._
    val rdd =info.map(_.split("\t")).map(x=>Row(x(0),x(1),x(2).toLong))
    //這個rdd是個row型別
    //Create the schema
    val struct =StructType(Array(
      StructField("ip", StringType,true),
      StructField("time", StringType,false) ,
        StructField("responseSize",LongType,false)
    ))
    //Apply the schema to the RDD of Rows via createDataFrame method provided by SparkSession.
   val df = sparkSession.createDataFrame(rdd,struct)
    df.show()
        }
  }

三：注意

1.這種型別要比官網更好，適合大部分型別
2.要注意運算元的轉換（開始的時候，找運算元也是遇到了很多麻煩，運算元很多，要注意挑選自己的需求）

四：結果

在這裡插入圖片描述

Spark SQL將rdd轉換為資料集-以程式設計方式指定模式（Programmatically Specifying the Schema）

一：解釋

二：操作

三：注意

四：結果

Spark SQL將rdd轉換為資料集-以程式設計方式指定模式（Programmatically Specifying the Schema）

【sql】將 float 轉換為資料型別 numeric 時出現算術溢位錯誤

Spark SQL中 RDD 轉換到 DataFrame

11.spark sql之RDD轉換DataSet

將json轉換為資料結構體

將numeric轉換為資料型別numeric是出現算術溢位錯誤

Android-將View轉換成圖片分享到QQ，微信（不使用第三方API）

Spark核心類：彈性分散式資料集RDD及其轉換和操作pyspark.RDD

Spark RDD轉換為DataFrame

Spark SQL將資料寫入Mysql表的一些坑

Power BI 將商業智慧資料轉換為資料理解

Django2.0-db(10)-什麼時候`Django`會將`QuerySet`轉換為`SQL`去執行

Pandas將列表（List）轉換為資料框（Dataframe）

java.lang.IllegalArgumentException (實體中存在基本資料型別，將實體轉換為Object時會報錯)

報表本年對應的上一年資料為空時，將空轉換為0

如何檢視張量tensor，並將其轉換為numpy資料

使用 LINQPad 將linq轉換為 lambda表示式或者 SQL語句

對於sql欄位非空但插入值為空，將值轉換為空的字串

spark sql 將資料匯入到redis 裡面

用SQL將多行字串資料轉換成一行資料(例項)

Spark SQL將rdd轉換為資料集-以程式設計方式指定模式（Programmatically Specifying the Schema）

一：解釋

二：操作

三：注意

四：結果

相關推薦