1. 程式人生 > >Creating DataFrames spark當中重要的部分DataFrames

Creating DataFrames spark當中重要的部分DataFrames

Creating DataFrames

官網:https://spark.apache.org/docs/latest/sql-getting-started.html

With a SparkSession, applications can create DataFrames from an existing RDD, from a Hive table, or from Spark data sources.

As an example, the following creates a DataFrame based on the content of a JSON file:

啟動spark
[[email protected] spark-2.4.0-bin-2.6.0-cdh5.7.0]$ cd bin 
[[email protected] bin]$ ./spark-shell
找到官方提供的json檔案
[[email protected] resources]$ pwd
/home/hadoop/app/spark-2.4.0-bin-2.6.0-cdh5.7.0/examples/src/main/resources
[[email protected] resources]$ cat people.json
{"name":"Michael"}
{"name":"Andy", "age":30}
{"name":"Justin", "age":19}
scala> val df = spark.read.json("file:///home/hadoop/app/spark-2.4.0-bin-2.6.0-cdh5.7.0/examples/src/main/resources/people.json")
[Stage 0:>                                                          (0 + 1)                                                                           df: org.apache.spark.sql.DataFrame = [age: bigint, name: string]

scala> df.show()
+----+-------+
| age|   name|
+----+-------+
|null|Michael|
|  30|   Andy|
|  19| Justin|
+----+-------+

用spark sql來處理資料是非常方便的,他的底層是外部資料來源實現的

擴充套件

scala> spark.table("ruoze_emp").show  

這個讀hive在裡面的檔案在這個執行之前一定要把hdfs啟動起來

在idea上如何操作

pom中要下載hive的依賴

<dependency>
      <groupId>org.apache.spark</groupId>
      <artifactId>spark-hive_2.11</artifactId>
      <version>${spark.version}</version>
    </dependency>

然後:

package g5.learning

import org.apache.spark.sql.SparkSession

object SparkSessionApp {
  def main(args: Array[String]): Unit = {
    val  sparksession= SparkSession.builder().appName("SparkSessionApp")
      .master("local[2]")
      .enableHiveSupport()//使用到hive一定要開啟這個
      .getOrCreate()


//    sparksession.sparkContext.parallelize(Array(1,2,3,4)).collect().foreach( println)


    sparksession.table("ruoze_emp").show
    sparksession.stop()
  }

}

.enableHiveSupport()//使用到hive一定要開啟這個
在windows上跑hive還是很麻煩的,還需要很多操作,獲取檔案