《深入理解Spark》之RDD和DataFrame的相互轉換

阿新 • • 發佈：2019-01-27

package com.lyzx.day18

import org.apache.spark.sql.SQLContext
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.sql.Row;
import org.apache.spark.sql.types.{StructType,StructField,StringType,IntegerType};

/**
 * Spark SQL
 * RDD和DataFrame的相互轉換
 */
class T4 {

  /*
    通過反射的方式把RDD[User]轉換為DataFrame
   */
  def f1(sc:SparkContext): Unit ={
    val sqlCtx = new SQLContext(sc)
//  讀取檔案並轉換為RDD[User]
    val rdd = sc.textFile("User.txt")
    val userRdd = rdd.map(item=>item.split(","))
                     .map(item=>User(item(0).toInt,item(1),item(2).toInt,item(3).toInt))

//  引入隱式轉換的函式
    import sqlCtx.implicits._

//  把RDD[User]轉換為DataFrame，這裡資料的列名不能指定，因為使用方法了反射，所以列名就是User的屬性名
    val df = userRdd.toDF()
//  把DataFrame註冊為一個臨時表 即把df裡面的資料"放入"一張臨時表裡面並起一個名字
    df.registerTempTable("user")

//    通過SQLContext的例項寫SQL並返回包含結果集的DataFrame的物件
    val result = sqlCtx.sql("select id,name,age,height from user where id >=2")

//    遍歷結果集
    result.foreach(println)
  }

  /*
    動態得把RDD轉換為DataFrame
    可以動態的指定Schema(這是Spark裡面的稱呼,其實就是列名+型別+是否為空,不知道spark為什麼把這些東西叫Schema)
   */
  def f2(sc:SparkContext): Unit ={
      val sqlCtx = new SQLContext(sc)

      val rdd = sc.textFile("./User.txt")
      val mapRdd = rdd.map(item=>item.split(","))
                      .map(item=>Row(item(0),item(1),item(2),item(3)))


    def getSchema2(columnName:String): StructType ={
      StructType(columnName.split(",").map(item=>StructField(item,StringType,true)))
    }

    //這就是schema即列名+型別+是否為空
    val schema = getSchema2("id_x,name_y,age_z,height_m")

    //通過sqlContext的例項建立DataFrame
    val df = sqlCtx.createDataFrame(mapRdd,schema)
    df.registerTempTable("user")

    val result = sqlCtx.sql("select id_x,name_y,age_z,height_m from user where id_x >=3")
    result.foreach(println)
  }


  def f3(sc:SparkContext): Unit ={
    val sqlCtx = new SQLContext(sc)

    val userRdd = sc.textFile("./User.txt")
                    .map(x=>x.split(","))
                    .map(x=>Row(x(0),x(1),x(2),x(3)))

    def getSchema2(columnName:String): StructType ={
      StructType(columnName.split(",").map(item=>StructField(item,StringType,true)))
    }

    val userSchema = getSchema2("id,name,age,height")
    val userDf = sqlCtx.createDataFrame(userRdd,userSchema)
        userDf.registerTempTable("user")

    val goodsRdd = sc.textFile("./goods.txt")
                     .map(x=>x.split(","))
                     .map(x=>Row(x(0),x(1),x(2),x(3)))

    val goodsSchema = getSchema2("userId,goodsName,goodsPrice,goodsCount")

    val goodsDf = sqlCtx.createDataFrame(goodsRdd,goodsSchema)
        goodsDf.registerTempTable("goods")

    val result = sqlCtx.sql("select a.id as userId,a.name as userName,b.goodsName,b.goodsPrice from user a left join goods b on a.id=b.userId")
    result.foreach(println)
  }

  /*
   json資料來源
   sqlContext可以直接讀取json格式的文字檔案
  */
  def f4(sc:SparkContext): Unit ={
    val sqlCtx = new SQLContext(sc)
    val jsonRdd = sqlCtx.read.json("./json.txt")
    jsonRdd.printSchema()

    jsonRdd.registerTempTable("person")

    val df = sqlCtx.sql("select * from person where age > 10")
    df.foreach(println)
  }
}

object T4{
  def main(args: Array[String]) {
    val conf = new SparkConf().setAppName("day18").setMaster("local")
    val sc = new SparkContext(conf)

    val  t = new T4
//    t.f1(sc)
//    t.f2(sc)
//    t.f3(sc)
    t.f4(sc)

    sc.stop()
  }
}

case class User(id:Int,name:String,age:Int,height:Int){
  private val _id = id;
  private val _name = name;
  private val _age = age;
  private val _height = height;

  override def toString(): String ={
    "[id="+_id+" name="+_name+" age="+_age+" height="+_height+"]"
  }
}

如何理解spark中RDD和DataFrame的結構？

RDD中可以儲存任何的單機型別的資料，但是，直接使用RDD在欄位需求明顯時，存在運算元難以複用的缺點。例如，現在RDD存的資料是一個Person型別的資料，現在要求所有每個年齡段（10年一個年齡段）

spark基礎之RDD和DataFrame的轉換方式

一通過定義Case Class,使用反射推斷Schema 定義Case Class，在RDD的轉換過程中使用Case Class可以隱式轉換成SchemaRDD,然後再註冊成表，然後就可以利用sql

《深入理解Spark》之RDD和DataFrame的相互轉換

package com.lyzx.day18 import org.apache.spark.sql.SQLContext import org.apache.spark.{SparkConf, SparkContext} import org.apache.spark.s

深入理解spark－rdd詳解

彈性 gem exc .com drive image 都是 spa ima 1.我們在使用spark計算的時候，操作數據集的感覺很方便是因為spark幫我們封裝了一個rdd（彈性分布式數據集Resilient Distributed Dataset）；那麽rdd

帶你深入理解STL之Stack和Queue

上一篇部落格，帶你深入理解STL之Deque容器中詳細介紹了deque容器的原始碼實現方式。結合前面介紹的兩個容器vector和list，在使用的過程中，我們確實要知道在什麼情況下需要選擇恰當的容器來滿足需求和提升效率。一般選擇的準則有如下幾條：如果需要隨

深入理解Spark之ListenerBus監聽器

ListenerBus對消費佇列的實現上圖為LiveListenerBus類的申明 self => 這句相當於給this起了一個別名為self LiveListenerBus負責將SparkListenerEvents非同步傳送過已經註冊過的S

Spark中json字串和DataFrame相互轉換

本文介紹基於Spark（2.0+）的Json字串和DataFrame相互轉換。 json字串轉DataFrame spark提供了將json字串解析為DF的介面，如果不指定生成的DF的schema，預設spark會先掃碼一遍給的json字串，然後推斷生成

深入理解ES6之——代理和反射（proxy）

通過呼叫new proxy()你可以建立一個代理來替代另一個物件（被稱為目標），這個代理對目標物件進行了虛擬，因此該代理與該目標物件表面上可以被當做同一個物件來對待。建立一個簡單的代理當你使用Proxy構造器來建立一個代理時，需要傳遞兩個引數：目標物件以及一個處理器，後者是定義了一個或多個陷阱函式的物件。

c#之image和byte相互轉換

//將image轉換成byte[]資料 private byte[] imageToByte(System.Drawing.Image _image) { MemoryStream ms = n

《深入理解Spark》之RDD轉換DataFrame的兩種方式的比較

package com.lyzx.day19 import org.apache.spark.sql.types.{StringType, StructField, StructType} import org.apache.spark.{SparkConf, Spark

Android開發之深入理解泛型extends和super的區別

我想 lis dataset 文檔 cnblogs extend 擦除選擇提前摘要：什麽是泛型？什麽是擦除邊界？什麽是上界限定或下界限定（子類型限定或超類型限定）？什麽是類型安全？泛型extends關和super關鍵字結合通配符?使用的區別，兩種泛型在實際Andro

深入理解python之二——python列表和元組

n) 數據兩種性能執行效率動態單元這一從一開始學習python的時候，很多人就聽到的是元組和列表差不多，區別就是元組不可以改變，列表可以改變。從數據結構來說，這兩者都應當屬於數組，元組屬於靜態的數組，而列表屬於動態數組。稍後再內存的分配上也會體現這一點。對

《深入理解Spark-核心思想與源碼分析》（二）第二章Spark設計理念和基本架構

基礎知識 cut info 負責驅動源碼分析 spa spark 節點若夫乘天地之正，而禦六氣之辯解，以遊無窮者，彼且惡乎待哉？

如何優雅的實現pandas DataFrame 和spark dataFrame 相互轉換

#!/usr/bin/env python3 # -*- coding: utf-8 -*- """ Created on Fri Jun 8 16:27:57 2018 @author: luoga

[2.2]Spark DataFrame操作（二）之通過反射實現RDD與DataFrame的轉換

參考場景檔案/home/pengyucheng/java/rdd2dfram.txt中有如下4條記錄： 1,hadoop,11 2,spark,7 3,flink,5 4,ivy,27 編碼實現：查詢並在控制檯打印出每行第三個欄位值大於7

《深入理解Spark》之 reduceByKey

XML Code 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22

《深入理解Spark》之通過自定義分割槽器解決資料傾斜問題

package com.lyzx.day37 import org.apache.spark.{Partitioner, SparkConf, SparkContext} class D1 { //partitionBy和自定義分割槽器解決資料傾斜的問題 def

《深入理解Spark》之運算元詳解

XML Code 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22

《深入理解Spark》之spark Streaming概念的再理解

1、spark Streaming是一個微批處理的框架 2、批處理時間間隔 batchInterval >> 表示在batchInterval時間內Spark 所接收的資料被當做一個批次做處理 3、批處理時間間隔(batchInterval)、視窗長

《深入理解Spark》之結構化流(spark streaming+spark SQL 處理結構化資料)的一個demo

最近在做關於spark Streaming + spark sql 結合處理結構化的資料的業務，下面是一個小栗子，有需要的拿走！ package com.unistack.tamboo.compute.process.impl; import com.alibaba.

《深入理解Spark》之RDD和DataFrame的相互轉換

相關推薦