SparkSql學習筆記（包含IDEA編寫的原生代碼）

阿新 • • 發佈：2018-11-11

Spark SQL and DataFrame

1.為什麼要用Spark Sql

原來我們使用Hive，是將Hive Sql 轉換成Map Reduce 然後提交到叢集上去執行，大大簡化了編寫MapReduce的程式的複雜性，由於MapReduce這種計算模型執行效率比較慢，所以Spark Sql的應運而生，它是將SparkSql轉換成RDD，然後提交到叢集執行，執行效率非常的快。

Spark Sql的有點：1、易整合 2、統一的資料訪問方式 3、相容Hvie 4、標準的資料連線

2、DataFrames

什麼是DataFrames？

與RDD類似，DataFrames也是一個分散式資料容器，然而DataFrame更像是傳統資料庫的二維表格，除了資料以外，還記錄資料的結構資訊，即schema。同時，與Hive類似，DataFrame與支援巢狀資料型別（struct、array和map）。從API的易用性上看，DataFrame API提供的是一套高層的關係操作，比函式式的RDD API要更加友好，門檻更低。由於與R和Pandas的DataFrame類似，Spark DataFrame很好地繼承了傳統單機資料分析的開發體驗。

建立DataFrames

在Spark SQL中SQLContext是建立DataFrames和執行SQL的入口，在spark-1。5.2中已經內建了一個sqlContext。

1.在本地建立一個檔案，有三列，分別是id、name、age，用空格分隔，然後上傳到hdfs上

hdfs dfs -put person.txt /

2.在spark shell執行下面命令，讀取資料，將每一行的資料使用列分隔符分割

val lineRDD = sc.textFile("hdfs://hadoop01:9000/person.txt").map(_.split(" "))

3.定義case class（相當於表的schema）

case class Person(id:Int, name:String, age:Int)

4.將RDD和case class關聯

val personRDD = lineRDD.map(x => Person(x(0).toInt, x(1), x(2).toInt))

5.將RDD轉換成DataFrame

val personDF = personRDD.toDF

6.對DataFrame進行處理

personDF.show

程式碼：

object SparkSqlTest {

def main(args: Array[String]): Unit = {
val conf = new SparkConf().setMaster("local").setAppName("SQL-1")
val sc = new SparkContext(conf)
fun1(sc)
}
//定義case class 相當於表的schema
case class Person(id:Int,name:String,age:Int)

def fun1(sc:SparkContext): Unit ={
val sqlContext = new SQLContext(sc)
　　// 位置一般情況下是換成HDFS檔案路徑
val lineRdd = sc.textFile("D:\\data\\person.txt").map(_.split(" "))

val personRdd = lineRdd.map(x=>Person(x(0).toInt,x(1),x(2).toInt))

import sqlContext.implicits._
val personDF = personRdd.toDF
//登錄檔
personDF.registerTempTable("person_df")

//傳入sql
val df = sqlContext.sql("select * from person_df order by age desc")

//將結果以JSON的方式儲存到指定位置
df.write.json("D:\\data\\personOut")
sc.stop()

}

DataFrame 常用操作

DSL風格語法（個人理解短小精悍的含義）

// 檢視DataFrame部分列中的內容
df.select(personDF.col("name")).show()
df.select(col = "age").show()
df.select("id").show()

// 列印DataFrame的Schema資訊
df.printSchema()

//查詢所有的name 和 age ，並將 age+2
df.select(df("id"),df("name"),df("age")+2).show()

//查詢所有年齡大於20的
df.filter(df("age")>20).show()

// 按年齡分組並統計相同年齡人數
df.groupBy("age").count().show()

SQL風格語法（前提：需要將DataFrame註冊成表）

//註冊成表
personDF.registerTempTable("person_df")

// 查詢年齡最大的兩位並用物件接接收
val persons = sqlContext.sql("select * from person_df order by age desc limit 2")
persons.foreach(x=>print(x(0),x(1),x(2)))

通過StructType直接指定Schema

/*通過StructType直接指定Schema*/
def fun2(sc: SparkContext): Unit = {
val sqlContext = new SQLContext(sc)
val personRDD = sc.textFile("D:\\data\\person.txt").map(_.split(" "))
// 通過StructType直接指定每個欄位的Schema
val schema = StructType(List(StructField("id", IntegerType, true), StructField("name", StringType, true), StructField("age", IntegerType)))
//將rdd對映到RowRDD
val rowRdd = personRDD.map(x=>Row(x(0).toInt,x(1).trim,x(2).toInt))
//將schema資訊應用到rowRdd上
val dataFrame = sqlContext.createDataFrame(rowRdd,schema)
//註冊
dataFrame.registerTempTable("person_struct")

sqlContext.sql("select * from person_struct").show()

sc.stop()

}

連線資料來源

/*連線mysql資料來源*/
def fun3(sc:SparkContext): Unit ={
val sqlContext = new SQLContext(sc)
val jdbcDF = sqlContext.read.format("jdbc").options(Map("url"->"jdbc:mysql://192.168.180.100:3306/bigdata","driver"->"com.mysql.jdbc.Driver","dbtable"->"person","user"->"root","password"->"123456")).load()
jdbcDF.show()
sc.stop()
}

再回寫到資料庫中

// 寫入資料庫
val personTextRdd = sc.textFile("D:\\data\\person.txt").map(_.split(" ")).map(x=>Row(x(0).toInt,x(1),x(2).toInt))

val schema = StructType(List(StructField("id", IntegerType), StructField("name", StringType), StructField("age", IntegerType)))

val personDataFrame = sqlContext.createDataFrame(personTextRdd,schema)

val prop = new Properties()
prop.put("user","root")
prop.put("password","123456")
//寫入資料庫
personDataFrame.write.mode("append").jdbc("jdbc:mysql://192.168.180.100:3306/bigdata","bigdata.person",prop)

sc.stop()

SparkSql學習筆記（包含IDEA編寫的原生代碼）

SparkSql學習筆記（包含IDEA編寫的原生代碼）

XGBoost中參數調整的完整指南（包含Python中的代碼）

Java程序猿的JavaScript學習筆記（12——jQuery-擴展選擇器）

4.28-python學習筆記（轉義符&input函數）

移動端學習筆記（小強測試品牌學員作品）

1.mysql學習筆記（在命令行中的操作）

Asp.Net 學習筆記（IIS不同版本和Asp.Net）

廖雪峰JavaScript學習筆記（基礎及資料型別、變數）

KNN機器學習實戰（包含SKLearn--KNN 包的呼叫）

18-12-8-視覺化庫Seaborn學習筆記（四：REG-迴歸分析繪圖）

SpringMVC學習筆記（三、重定向與轉發）

Java第十六天學習筆記（基本資料物件、集合框架）

Django框架學習筆記（28.檔案上傳詳解）

Vue.js零基礎學習筆記（一、二章Vue介紹）

Django框架學習筆記（14.一對多跨表操作）

Django學習筆記（2019年1月12日）

shell指令碼學習筆記（一、shell指令碼變數語法）

shell指令碼學習筆記（shell指令碼實現檔案的建立）

計算機網路學習筆記（10. 速率、頻寬、延遲）

我的LTE學習筆記（總篇，CSDN下載連結）

SparkSql學習筆記（包含IDEA編寫的原生代碼）

相關推薦