1. 程式人生 > >Spark修煉之道(進階篇)——Spark入門到精通:第十節 Spark SQL案例實戰(一)

Spark修煉之道(進階篇)——Spark入門到精通:第十節 Spark SQL案例實戰(一)

作者:周志湖

放假了,終於能抽出時間更新部落格了…….

1. 獲取資料

本文通過將github上的Spark專案git日誌作為資料,對SparkSQL的內容進行詳細介紹
資料獲取命令如下:

[[email protected] spark]# git log  --pretty=format:'{"commit":"%H","author":"%an","author_email":"%ae","date":"%ad","message":"%f"}' > sparktest.json

格式化日誌內容輸出如下:

[[email protected]
spark]# head -1 sparktest.json {"commit":"30b706b7b36482921ec04145a0121ca147984fa8","author":"Josh Rosen","author_email":"[email protected]","date":"Fri Nov 6 18:17:34 2015 -0800","message":"SPARK-11389-CORE-Add-support-for-off-heap-memory-to-MemoryManager"}

然後使用命令將sparktest.json檔案上傳到HDFS上


[root@master
spark]#hadoop dfs -put sparktest.json /data/

2. 建立DataFrame

使用資料建立DataFrame

scala> val df = sqlContext.read.json("/data/sparktest.json")
16/02/05 09:59:56 INFO json.JSONRelation: Listing hdfs://ns1/data/sparktest.json on driver

檢視其模式:

scala> df.printSchema()
root
 |-- author: string (nullable = true
) |-- author_email: string (nullable = true) |-- commit: string (nullable = true) |-- date: string (nullable = true) |-- message: string (nullable = true)

3. DataFrame方法實戰

(1)顯式前兩行資料

scala> df.show(2)

+----------------+--------------------+--------------------+--------------------+--------------------+
|          author|        author_email|              commit|                date|             message|
+----------------+--------------------+--------------------+--------------------+--------------------+
|      Josh Rosen|[email protected]|30b706b7b36482921...|Fri Nov 6 18:17:3...|SPARK-11389-CORE-...|
|Michael Armbrust|[email protected]|105732dcc6b651b97...|Fri Nov 6 17:22:3...|HOTFIX-Fix-python...|
+----------------+--------------------+--------------------+--------------------+--------------------+

(2)計算總提交次數


scala> df.count
res4: Long = 13507
下圖給出的是我github上的commits次數,可以看到,其結束是一致的

這裡寫圖片描述

(3)按提交次數進行降序排序

scala>df.groupBy("author").count.sort($"count".desc).show

+--------------------+-----+
|              author|count|
+--------------------+-----+
|       Matei Zaharia| 1590|
|         Reynold Xin| 1071|
|     Patrick Wendell|  857|
|       Tathagata Das|  416|
|          Josh Rosen|  348|
|  Mosharaf Chowdhury|  290|
|           Andrew Or|  287|
|       Xiangrui Meng|  285|
|          Davies Liu|  281|
|          Ankur Dave|  265|
|          Cheng Lian|  251|
|    Michael Armbrust|  243|
|             zsxwing|  200|
|           Sean Owen|  197|
|     Prashant Sharma|  186|
|  Joseph E. Gonzalez|  185|
|            Yin Huai|  177|
|Shivaram Venkatar...|  173|
|      Aaron Davidson|  164|
|      Marcelo Vanzin|  142|
+--------------------+-----+
only showing top 20 rows

4. DataFrame註冊成臨時表使用實戰

使用下列語句將DataFrame註冊成表

scala> val commitLog=df.registerTempTable("commitlog")

(1)顯示前2行資料

scala> sqlContext.sql("SELECT * FROM commitlog").show(2)
+----------------+--------------------+--------------------+--------------------+--------------------+
|          author|        author_email|              commit|                date|             message|
+----------------+--------------------+--------------------+--------------------+--------------------+
|      Josh Rosen|[email protected]|30b706b7b36482921...|Fri Nov 6 18:17:3...|SPARK-11389-CORE-...|
|Michael Armbrust|[email protected]|105732dcc6b651b97...|Fri Nov 6 17:22:3...|HOTFIX-Fix-python...|
+----------------+--------------------+--------------------+--------------------+--------------------+

(2)計算總提交次數

scala> sqlContext.sql("SELECT count(*) as TotalCommitNumber  FROM commitlog").show
+-----------------+
|TotalCommitNumber|
+-----------------+
|            13507|
+-----------------+

(3)按提交次數進行降序排序

scala> sqlContext.sql("SELECT author,count(*) as CountNumber  FROM commitlog GROUP BY author ORDER BY CountNumber DESC").show

+--------------------+-----------+
|              author|CountNumber|
+--------------------+-----------+
|       Matei Zaharia|       1590|
|         Reynold Xin|       1071|
|     Patrick Wendell|        857|
|       Tathagata Das|        416|
|          Josh Rosen|        348|
|  Mosharaf Chowdhury|        290|
|           Andrew Or|        287|
|       Xiangrui Meng|        285|
|          Davies Liu|        281|
|          Ankur Dave|        265|
|          Cheng Lian|        251|
|    Michael Armbrust|        243|
|             zsxwing|        200|
|           Sean Owen|        197|
|     Prashant Sharma|        186|
|  Joseph E. Gonzalez|        185|
|            Yin Huai|        177|
|Shivaram Venkatar...|        173|
|      Aaron Davidson|        164|
|      Marcelo Vanzin|        142|
+--------------------+-----------+

更多複雜的玩法,大家可以自己去嘗試,這裡給出的只是DataFrame方法與臨時表SQL語句的用法差異,以便於有整體的認知。