Creating DataFrames spark當中重要的部分DataFrames
阿新 • • 發佈:2018-12-30
Creating DataFrames
官網:https://spark.apache.org/docs/latest/sql-getting-started.html
With a SparkSession, applications can create DataFrames from an existing RDD, from a Hive table, or from Spark data sources.
As an example, the following creates a DataFrame based on the content of a JSON file:
啟動spark [[email protected] spark-2.4.0-bin-2.6.0-cdh5.7.0]$ cd bin [[email protected] bin]$ ./spark-shell 找到官方提供的json檔案 [[email protected] resources]$ pwd /home/hadoop/app/spark-2.4.0-bin-2.6.0-cdh5.7.0/examples/src/main/resources [[email protected] resources]$ cat people.json {"name":"Michael"} {"name":"Andy", "age":30} {"name":"Justin", "age":19} scala> val df = spark.read.json("file:///home/hadoop/app/spark-2.4.0-bin-2.6.0-cdh5.7.0/examples/src/main/resources/people.json") [Stage 0:> (0 + 1) df: org.apache.spark.sql.DataFrame = [age: bigint, name: string] scala> df.show() +----+-------+ | age| name| +----+-------+ |null|Michael| | 30| Andy| | 19| Justin| +----+-------+
用spark sql來處理資料是非常方便的,他的底層是外部資料來源實現的
擴充套件
scala> spark.table("ruoze_emp").show
這個讀hive在裡面的檔案在這個執行之前一定要把hdfs啟動起來
在idea上如何操作
pom中要下載hive的依賴
<dependency> <groupId>org.apache.spark</groupId> <artifactId>spark-hive_2.11</artifactId> <version>${spark.version}</version> </dependency>
然後:
package g5.learning
import org.apache.spark.sql.SparkSession
object SparkSessionApp {
def main(args: Array[String]): Unit = {
val sparksession= SparkSession.builder().appName("SparkSessionApp")
.master("local[2]")
.enableHiveSupport()//使用到hive一定要開啟這個
.getOrCreate()
// sparksession.sparkContext.parallelize(Array(1,2,3,4)).collect().foreach( println)
sparksession.table("ruoze_emp").show
sparksession.stop()
}
}
.enableHiveSupport()//使用到hive一定要開啟這個
在windows上跑hive還是很麻煩的,還需要很多操作,獲取檔案