spark-shell 資料檔案讀成表的兩種方式！！！相對路徑！！hdfs dfs -ls

阿新 • • 發佈：2019-01-24

park SQL應用
Spark Shell啟動後，就可以用Spark SQL API執行資料分析查詢。

在第一個示例中，我們將從文字檔案中載入使用者資料並從資料集中建立一個DataFrame物件。然後執行DataFrame函式，執行特定的資料選擇查詢。

文字檔案customers.txt中的內容如下：

100, John Smith, Austin, TX, 78727
200, Joe Johnson, Dallas, TX, 75201
300, Bob Jones, Houston, TX, 77028
400, Andy Davis, San Antonio, TX, 78227
500, James Williams, Austin, TX, 78727
下述程式碼片段展示了可以在Spark Shell終端執行的Spark SQL命令。

// 首先用已有的Spark Context物件建立SQLContext物件
val sqlContext = new org.apache.spark.sql.SQLContext(sc);

// 匯入語句，可以隱式地將RDD轉化成DataFrame
import sqlContext.implicits._

// 建立一個表示客戶的自定義類
case class Customer(customer_id: Int, name: String, city: String, state: String, zip_code: String)

// 用資料集文字檔案建立一個Customer物件的DataFrame
val dfCustomers = sc.textFile("data/customers.txt").map(_.split(",")).map(p => Customer(p(0).trim.toInt, p(1), p(2), p(3), p(4))).toDF();

// 將DataFrame註冊為一個表
dfCustomers.registerTempTable("customers");

// 顯示DataFrame的內容
dfCustomers.show();

// 列印DF模式
dfCustomers.printSchema();

// 選擇客戶名稱列
dfCustomers.select("name").show();

// 選擇客戶名稱和城市列
dfCustomers.select("name", "city").show()

// 根據id選擇客戶
dfCustomers.filter(dfCustomers("customer_id").equalTo(500)).show();

// 根據郵政編碼統計客戶數量
dfCustomers.groupBy("zip_code").count().show();



使用新的資料型別類StructType，StringType和StructField指定模式。

//
// 用程式設計的方式指定模式
//

// 用已有的Spark Context物件建立SQLContext物件
val sqlContext = new org.apache.spark.sql.SQLContext(sc);

// 建立RDD物件
val rddCustomers = sc.textFile("data/customers.txt");

// 用字串編碼模式
val schemaString = "customer_id name city state zip_code";

// 匯入Spark SQL資料型別和Row
import org.apache.spark.sql._

import org.apache.spark.sql.types._;

// 用模式字串生成模式物件
val schema = StructType(schemaString.split(" ").map(fieldName => StructField(fieldName, StringType, true)));

// 將RDD（rddCustomers）記錄轉化成Row。
val rowRDD = rddCustomers.map(_.split(",")).map(p => Row(p(0).trim,p(1),p(2),p(3),p(4)));

// 將模式應用於RDD物件。
val dfCustomers = sqlContext.createDataFrame(rowRDD, schema);

// 將DataFrame註冊為表
dfCustomers.registerTempTable("customers");

// 用sqlContext物件提供的sql方法執行SQL語句。
val custNames = sqlContext.sql("SELECT name FROM customers");

// SQL查詢的返回結果為DataFrame物件，支援所有通用的RDD操作。
// 可以按照順序訪問結果行的各個列。
custNames.map(t => "Name: " + t(0)).collect().foreach(println);

// 用sqlContext物件提供的sql方法執行SQL語句。
val customersByCity = sqlContext.sql("SELECT name,zip_code FROM customers ORDER BY zip_code");

// SQL查詢的返回結果為DataFrame物件，支援所有通用的RDD操作。
// 可以按照順序訪問結果行的各個列。
customersByCity.map(t => t(0) + "," + t(1)).collect().foreach(println);
除了文字檔案之外，也可以從其他資料來源中載入資料，如JSON資料檔案，Hive表，甚至可以通過JDBC資料來源載入關係型資料庫表中的資料

<pre name="code" class="sql">Last login: Tue Jul 19 23:21:05 2016 from 192.168.3.103
[[email protected] ~]# jps
23852 SparkSubmit
19801 NameNode
23932 Jps
20330 NodeManager
22342 Master
22509 Worker
20231 ResourceManager
20082 SecondaryNameNode
19898 DataNode
[[email protected] ~]# hdfs dfs -ls /user/
Found 2 items
drwxr-xr-x   - root supergroup          0 2016-07-19 21:16 /user/hive
drwxr-xr-x   - root supergroup          0 2016-07-19 23:07 /user/test
[ 
[email protected] ~]# hdfs dfs -ls /user/test
[[email protected] ~]# cd /user/test
[[email protected] test]# ll
total 0
[[email protected] test]# vim customers.txt
[[email protected] test]# ll
total 4
-rw-r--r--. 1 root root 185 Jul 19 23:25 customers.txt
[[email protected] test]# cat customers.txt
100, John Smith, Austin, TX, 78727
200, Joe Johnson, Dallas, TX, 75201
300, Bob Jones, Houston, TX, 77028
400, Andy Davis, San Antonio, TX, 78227
500, James Williams, Austin, TX, 78727
[ 
[email protected] test]# hdfs dfs -ls /user/test
[[email protected] test]# hdfs dfs -put /user/test/customers.txt  /user/test
[[email protected] test]# hdfs dfs -ls /user/test
Found 1 items
-rw-r--r--   1 root supergroup        185 2016-07-19 23:26 /user/test/customers.txt
[[email protected] test]# hdfs dfs -ls /user/test
Found 1 items
-rw-r--r--   1 root supergroup        185 2016-07-19 23:26 /user/test/customers.txt
[[email protected] test]# hdfs dfs -ls /
Found 2 items
drwxr-xr-x   - root supergroup          0 2016-07-19 21:31 /tmp
drwxr-xr-x   - root supergroup          0 2016-07-19 23:07 /user
[[email protected] test]#

Last login: Tue Jul 19 23:23:11 2016 from 192.168.3.103
[[email protected] ~]# cd /user/local/spark-1.4.0-bin-hadoop2.6/
[[email protected] spark-1.4.0-bin-hadoop2.6]# ll
total 684
drwxr-xr-x. 2 yc   yc     4096 Jun  3  2015 bin
-rw-r--r--. 1 yc   yc   561149 Jun  3  2015 CHANGES.txt
drwxr-xr-x. 2 yc   yc     4096 Jun 18 00:03 conf
drwxr-xr-x. 3 yc   yc     4096 Jun  3  2015 data
drwxr-xr-x. 3 yc   yc     4096 Jun  3  2015 ec2
drwxr-xr-x. 3 yc   yc     4096 Jun  3  2015 examples
drwxr-xr-x. 2 yc   yc     4096 Jun  3  2015 lib
-rw-r--r--. 1 yc   yc    50902 Jun  3  2015 LICENSE
drwxr-xr-x. 2 root root   4096 Jul 19 22:35 logs
-rw-r--r--. 1 yc   yc    22559 Jun  3  2015 NOTICE
drwxr-xr-x. 6 yc   yc     4096 Jun  3  2015 python
drwxr-xr-x. 3 yc   yc     4096 Jun  3  2015 R
-rw-r--r--. 1 yc   yc     3624 Jun  3  2015 README.md
-rw-r--r--. 1 yc   yc      134 Jun  3  2015 RELEASE
drwxr-xr-x. 2 yc   yc     4096 Jul 19 22:39 sbin
drwxr-xr-x. 2 root root   4096 Jun 18 00:03 work
[[email protected] spark-1.4.0-bin-hadoop2.6]# cd work/
[[email protected] work]# ll
total 0
[[email protected] work]# cd ..
[[email protected] spark-1.4.0-bin-hadoop2.6]# cd data/
[[email protected] data]# ll
total 4
drwxr-xr-x. 5 yc yc 4096 Jun  3  2015 mllib
[[email protected] data]# cd mllib/
[[email protected] mllib]# ll
total 828
drwxr-xr-x. 2 yc yc   4096 Jun  3  2015 als
-rw-r--r--. 1 yc yc  63973 Jun  3  2015 gmm_data.txt
-rw-r--r--. 1 yc yc     72 Jun  3  2015 kmeans_data.txt
drwxr-xr-x. 2 yc yc   4096 Jun  3  2015 lr-data
-rw-r--r--. 1 yc yc 197105 Jun  3  2015 lr_data.txt
-rw-r--r--. 1 yc yc     24 Jun  3  2015 pagerank_data.txt
drwxr-xr-x. 2 yc yc   4096 Jun  3  2015 ridge-data
-rw-r--r--. 1 yc yc 104736 Jun  3  2015 sample_binary_classification_data.txt
-rw-r--r--. 1 yc yc     68 Jun  3  2015 sample_fpgrowth.txt
-rw-r--r--. 1 yc yc   1598 Jun  3  2015 sample_isotonic_regression_data.txt
-rw-r--r--. 1 yc yc    264 Jun  3  2015 sample_lda_data.txt
-rw-r--r--. 1 yc yc 104736 Jun  3  2015 sample_libsvm_data.txt
-rwxr-xr-x. 1 yc yc 119069 Jun  3  2015 sample_linear_regression_data.txt
-rw-r--r--. 1 yc yc  14351 Jun  3  2015 sample_movielens_data.txt
-rw-r--r--. 1 yc yc   6953 Jun  3  2015 sample_multiclass_classification_data.txt
-rw-r--r--. 1 yc yc     48 Jun  3  2015 sample_naive_bayes_data.txt
-rw-r--r--. 1 yc yc  39474 Jun  3  2015 sample_svm_data.txt
-rw-r--r--. 1 yc yc 115476 Jun  3  2015 sample_tree_data.csv
[[email protected] mllib]# cd ..
[[email protected] data]# ll
total 4
drwxr-xr-x. 5 yc yc 4096 Jun  3  2015 mllib
[[email protected] data]# pwd
/user/local/spark-1.4.0-bin-hadoop2.6/data
[[email protected] data]# ls
mllib
[[email protected] data]# vim customers.txt

[1]+  Stopped                 vim customers.txt
[[email protected] data]# vim customers.txt
[[email protected] data]# ls
customers.txt  mllib
[[email protected] data]# ls -l
total 8
-rw-r--r--. 1 root root  185 Jul 19 23:44 customers.txt
drwxr-xr-x. 5 yc   yc   4096 Jun  3  2015 mllib
[[email protected] data]# hdfs dfs -l /user/root/data/
-l: Unknown command
[[email protected] data]# hdfs dfs -ls /user/root/data/
ls: `/user/root/data/': No such file or directory
[[email protected] data]# hdfs dfs -ls /user/
Found 2 items
drwxr-xr-x   - root supergroup          0 2016-07-19 21:16 /user/hive
drwxr-xr-x   - root supergroup          0 2016-07-19 23:26 /user/test
[[email protected] data]# hdfs dfs -mkdir /user/root/data/
mkdir: `/user/root/data/': No such file or directory
[[email protected] data]# hdfs dfs -mkdir -p /user/root/data/
[[email protected] data]# hdfs dfs -ls /user/root/data/
[[email protected] data]# hdfs dfs -ls /user/root
Found 1 items
drwxr-xr-x   - root supergroup          0 2016-07-19 23:47 /user/root/data
[[email protected] data]# pwd
/user/local/spark-1.4.0-bin-hadoop2.6/data
[[email protected] data]# ll
total 8
-rw-r--r--. 1 root root  185 Jul 19 23:44 customers.txt
drwxr-xr-x. 5 yc   yc   4096 Jun  3  2015 mllib
[[email protected] data]# hdfs dfs -put customers.txt /user/root/data
[[email protected] data]# hdfs dfs -ls /user/root/data
Found 1 items
-rw-r--r--   1 root supergroup        185 2016-07-19 23:48 /user/root/data/customers.txt
[[email protected]dh1 data]# hdfs dfs -text /user/root/data/customers.txt
100, John Smith, Austin, TX, 78727
200, Joe Johnson, Dallas, TX, 75201
300, Bob Jones, Houston, TX, 77028
400, Andy Davis, San Antonio, TX, 78227
500, James Williams, Austin, TX, 78727
[[email protected] data]# hdfs dfs -cat /user/root/data/customers.txt
100, John Smith, Austin, TX, 78727
200, Joe Johnson, Dallas, TX, 75201
300, Bob Jones, Houston, TX, 77028
400, Andy Davis, San Antonio, TX, 78227
500, James Williams, Austin, TX, 78727
[[email protected] data]# hdfs dfs -mv /user/root/data/customers.txt /user/root/data/customer.tx
[[email protected] data]# hdfs dfs -ls /user/root/data
Found 1 items
-rw-r--r--   1 root supergroup        185 2016-07-19 23:48 /user/root/data/customer.tx
[[email protected] data]#

[BEGIN] 2016/7/20 22:46:49
dfCustomers.groupBy("zip_code").count().show();filter(dfCustomers("customer_id").equalTo(500)).show();select("name", "city").show()).show();printSchema();show();registerTempTable("customers");val dfCustomers = sc.textFile("/user/test/customers.txt").map(_.split(",")).map(p => Customer(p(0).trim.toInt, p(1), p(2), p(3), p(4))).toDF();data/customers.txt").map(_.split(",")).map(p => Customer(p(0).trim.toInt, p(1), p(2), p(3), p(4))).toDF();/user/test/customers.txt").map(_.split(",")).map(p => Customer(p(0).trim.toInt, p(1), p(2), p(3), p(4))).toDF();data/customers.txt").map(_.split(",")).map(p => Customer(p(0).trim.toInt, p(1), p(2), p(3), p(4))).toDF();case class Customer(customer_id: Int, name: String, city: String, state: String, zip_code: String)import sqlContext.implicits._val sqlContext = new org.apache.spark.sql.SQLContext(sc);textFile = sc.textFile("/user/test/customers.txt").collect();customers.txt").collect();
16/07/19 23:33:01 INFO storage.MemoryStore: ensureFreeSpace(222752) called with curMem=896994, maxMem=278019440
16/07/19 23:33:01 INFO storage.MemoryStore: Block broadcast_19 stored as values in memory (estimated size 217.5 KB, free 264.1 MB)
16/07/19 23:33:01 INFO storage.MemoryStore: ensureFreeSpace(19999) called with curMem=1119746, maxMem=278019440
16/07/19 23:33:01 INFO storage.MemoryStore: Block broadcast_19_piece0 stored as bytes in memory (estimated size 19.5 KB, free 264.1 MB)
16/07/19 23:33:01 INFO storage.BlockManagerInfo: Added broadcast_19_piece0 in memory on localhost:56137 (size: 19.5 KB, free: 265.0 MB)
16/07/19 23:33:01 INFO spark.SparkContext: Created broadcast 19 from textFile at <console>:26
org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: hdfs://cdh1:9000/user/root/customers.txt
	at org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:285)
	at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:228)
	at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:313)
	at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:207)
	at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219)
	at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217)
	at scala.Option.getOrElse(Option.scala:120)
	at org.apache.spark.rdd.RDD.partitions(RDD.scala:217)
	at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32)
	at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219)
	at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217)
	at scala.Option.getOrElse(Option.scala:120)
	at org.apache.spark.rdd.RDD.partitions(RDD.scala:217)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:1779)
	at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:885)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:148)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:109)
	at org.apache.spark.rdd.RDD.withScope(RDD.scala:286)
	at org.apache.spark.rdd.RDD.collect(RDD.scala:884)
	at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:26)
	at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:31)
	at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:33)
	at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:35)
	at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:37)
	at $iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:39)
	at $iwC$$iwC$$iwC$$iwC.<init>(<console>:41)
	at $iwC$$iwC$$iwC.<init>(<console>:43)
	at $iwC$$iwC.<init>(<console>:45)
	at $iwC.<init>(<console>:47)
	at <init>(<console>:49)
	at .<init>(<console>:53)
	at .<clinit>(<console>)
	at .<init>(<console>:7)
	at .<clinit>(<console>)
	at $print(<console>)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:606)
	at org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:1065)
	at org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1338)
	at org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:840)
	at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:871)
	at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:819)
	at org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:857)
	at org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:902)
	at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:814)
	at org.apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:657)
	at org.apache.spark.repl.SparkILoop.innerLoop$1(SparkILoop.scala:665)
	at org.apache.spark.repl.SparkILoop.org$apache$spark$repl$SparkILoop$$loop(SparkILoop.scala:670)
	at org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply$mcZ$sp(SparkILoop.scala:997)
	at org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:945)
	at org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:945)
	at scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135)
	at org.apache.spark.repl.SparkILoop.org$apache$spark$repl$SparkILoop$$process(SparkILoop.scala:945)
	at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:1059)
	at org.apache.spark.repl.Main$.main(Main.scala:31)
	at org.apache.spark.repl.Main.main(Main.scala)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:606)
	at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:664)
	at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:169)
	at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:192)
	at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:111)
	at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)


scala> val textFile = sc.textFile("customers.txt").collect();dfCustomers.groupBy("zip_code").count().show();filter(dfCustomers("customer_id").equalTo(500)).show();select("name", "city").show()).show();printSchema();show();registerTempTable("customers");val dfCustomers = sc.textFile("/user/test/customers.txt").map(_.split(",")).map(p => Customer(p(0).trim.toInt, p(1), p(2), p(3), p(4))).toDF();data/customers.txt").map(_.split(",")).map(p => Customer(p(0).trim.toInt, p(1), p(2), p(3), p(4))).toDF();case class Customer(customer_id: Int, name: String, city: String, state: String, zip_code: String)import sqlContext.implicits._val sqlContext = new org.apache.spark.sql.SQLContext(sc);textFile = sc.textFile("/user/test/customers.txt").collect();
16/07/19 23:33:16 INFO storage.BlockManagerInfo: Removed broadcast_19_piece0 on localhost:56137 in memory (size: 19.5 KB, free: 265.0 MB)
16/07/19 23:33:16 INFO storage.MemoryStore: ensureFreeSpace(222752) called with curMem=896994, maxMem=278019440
16/07/19 23:33:16 INFO storage.MemoryStore: Block broadcast_20 stored as values in memory (estimated size 217.5 KB, free 264.1 MB)
16/07/19 23:33:16 INFO storage.MemoryStore: ensureFreeSpace(19999) called with curMem=1119746, maxMem=278019440
16/07/19 23:33:16 INFO storage.MemoryStore: Block broadcast_20_piece0 stored as bytes in memory (estimated size 19.5 KB, free 264.1 MB)
16/07/19 23:33:16 INFO storage.BlockManagerInfo: Added broadcast_20_piece0 in memory on localhost:56137 (size: 19.5 KB, free: 265.0 MB)
16/07/19 23:33:16 INFO spark.SparkContext: Created broadcast 20 from textFile at <console>:26
16/07/19 23:33:16 INFO mapred.FileInputFormat: Total input paths to process : 1
16/07/19 23:33:16 INFO spark.SparkContext: Starting job: collect at <console>:26
16/07/19 23:33:16 INFO scheduler.DAGScheduler: Got job 13 (collect at <console>:26) with 2 output partitions (allowLocal=false)
16/07/19 23:33:16 INFO scheduler.DAGScheduler: Final stage: ResultStage 17(collect at <console>:26)
16/07/19 23:33:16 INFO scheduler.DAGScheduler: Parents of final stage: List()
16/07/19 23:33:16 INFO scheduler.DAGScheduler: Missing parents: List()
16/07/19 23:33:16 INFO scheduler.DAGScheduler: Submitting ResultStage 17 (MapPartitionsRDD[38] at textFile at <console>:26), which has no missing parents
16/07/19 23:33:16 INFO storage.MemoryStore: ensureFreeSpace(3128) called with curMem=1139745, maxMem=278019440
16/07/19 23:33:16 INFO storage.MemoryStore: Block broadcast_21 stored as values in memory (estimated size 3.1 KB, free 264.1 MB)
16/07/19 23:33:16 INFO storage.MemoryStore: ensureFreeSpace(1795) called with curMem=1142873, maxMem=278019440
16/07/19 23:33:16 INFO storage.MemoryStore: Block broadcast_21_piece0 stored as bytes in memory (estimated size 1795.0 B, free 264.0 MB)
16/07/19 23:33:16 INFO storage.BlockManagerInfo: Added broadcast_21_piece0 in memory on localhost:56137 (size: 1795.0 B, free: 265.0 MB)
16/07/19 23:33:16 INFO spark.SparkContext: Created broadcast 21 from broadcast at DAGScheduler.scala:874
16/07/19 23:33:16 INFO scheduler.DAGScheduler: Submitting 2 missing tasks from ResultStage 17 (MapPartitionsRDD[38] at textFile at <console>:26)
16/07/19 23:33:16 INFO scheduler.TaskSchedulerImpl: Adding task set 17.0 with 2 tasks
16/07/19 23:33:16 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 17.0 (TID 414, localhost, ANY, 1413 bytes)
16/07/19 23:33:16 INFO scheduler.TaskSetManager: Starting task 1.0 in stage 17.0 (TID 415, localhost, ANY, 1413 bytes)
16/07/19 23:33:16 INFO executor.Executor: Running task 0.0 in stage 17.0 (TID 414)
16/07/19 23:33:16 INFO rdd.HadoopRDD: Input split: hdfs://cdh1:9000/user/test/customers.txt:0+92
16/07/19 23:33:16 INFO executor.Executor: Running task 1.0 in stage 17.0 (TID 415)
16/07/19 23:33:16 INFO rdd.HadoopRDD: Input split: hdfs://cdh1:9000/user/test/customers.txt:92+93
16/07/19 23:33:16 INFO executor.Executor: Finished task 1.0 in stage 17.0 (TID 415). 1875 bytes result sent to driver
16/07/19 23:33:16 INFO scheduler.TaskSetManager: Finished task 1.0 in stage 17.0 (TID 415) in 28 ms on localhost (1/2)
16/07/19 23:33:16 INFO executor.Executor: Finished task 0.0 in stage 17.0 (TID 414). 1904 bytes result sent to driver
16/07/19 23:33:16 INFO scheduler.TaskSetManager: Finished task 0.0 in stage 17.0 (TID 414) in 30 ms on localhost (2/2)
16/07/19 23:33:16 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 17.0, whose tasks have all completed, from pool 
16/07/19 23:33:16 INFO scheduler.DAGScheduler: ResultStage 17 (collect at <console>:26) finished in 0.030 s
16/07/19 23:33:16 INFO scheduler.DAGScheduler: Job 13 finished: collect at <console>:26, took 0.041509 s
textFile: Array[String] = Array(100, John Smith, Austin, TX, 78727, 200, Joe Johnson, Dallas, TX, 75201, 300, Bob Jones, Houston, TX, 77028, 400, Andy Davis, San Antonio, TX, 78227, 500, James Williams, Austin, TX, 78727)

scala> val textFile = sc.textFile("/user/test/customers.txt").collect();customers.txt").collect();dfCustomers.groupBy("zip_code").count().show();filter(dfCustomers("customer_id").equalTo(500)).show();select("name", "city").show()).show();printSchema();show();registerTempTable("customers");val dfCustomers = sc.textFile("/user/test/customers.txt").map(_.split(",")).map(p => Customer(p(0).trim.toInt, p(1), p(2), p(3), p(4))).toDF();data/customers.txt").map(_.split(",")).map(p => Customer(p(0).trim.toInt, p(1), p(2), p(3), p(4))).toDF();case class Customer(customer_id: Int, name: String, city: String, state: String, zip_code: String)import sqlContext.implicits._val sqlContext = new org.apache.spark.sql.SQLContext(sc);textFile = sc.textFile("/user/test/customers.txt").collect();sqlContext = new org.apache.spark.sql.SQLContext(sc);
sqlContext: org.apache.spark.sql.SQLContext = [email protected]

scala> 
(reverse-i-search)`': 
scala> val sqlContext = new org.apache.spark.sql.SQLContext(sc);textFile = sc.textFile("/user/test/customers.txt").collect();sqlContext = new org.apache.spark.sql.SQLContext(sc);val sqlContext = new org.apache.spark.sql.SQLContext(sc);textFile = sc.textFile("/user/test/customers.txt").collect();customers.txt").collect();dfCustomers.groupBy("zip_code").count().show();filter(dfCustomers("customer_id").equalTo(500)).show();select("name", "city").show()).show();printSchema();show();registerTempTable("customers");val dfCustomers = sc.textFile("/user/test/customers.txt").map(_.split(",")).map(p => Customer(p(0).trim.toInt, p(1), p(2), p(3), p(4))).toDF();data/customers.txt").map(_.split(",")).map(p => Customer(p(0).trim.toInt, p(1), p(2), p(3), p(4))).toDF();case class Customer(customer_id: Int, name: String, city: String, state: String, zip_code: String)import sqlContext.implicits._val sqlContext = new org.apache.spark.sql.SQLContext(sc);import sqlContext.implicits._
import sqlContext.implicits._

scala> import sqlContext.implicits._val sqlContext = new org.apache.spark.sql.SQLContext(sc);textFile = sc.textFile("/user/test/customers.txt").collect();customers.txt").collect();dfCustomers.groupBy("zip_code").count().show();filter(dfCustomers("customer_id").equalTo(500)).show();select("name", "city").show()).show();printSchema();show();registerTempTable("customers");val dfCustomers = sc.textFile("/user/test/customers.txt").map(_.split(",")).map(p => Customer(p(0).trim.toInt, p(1), p(2), p(3), p(4))).toDF();data/customers.txt").map(_.split(",")).map(p => Customer(p(0).trim.toInt, p(1), p(2), p(3), p(4))).toDF();case class Customer(customer_id: Int, name: String, city: String, state: String, zip_code: String)import sqlContext.implicits._case class Customer(customer_id: Int, name: String, city: String, state: String, zip_code: String)
defined class Customer

scala> case class Customer(customer_id: Int, name: String, city: String, state: String, zip_code: String)import sqlContext.implicits._val sqlContext = new org.apache.spark.sql.SQLContext(sc);textFile = sc.textFile("/user/test/customers.txt").collect();customers.txt").collect();dfCustomers.groupBy("zip_code").count().show();filter(dfCustomers("customer_id").equalTo(500)).show();select("name", "city").show()).show();printSchema();show();registerTempTable("customers");val dfCustomers = sc.textFile("/user/test/customers.txt").map(_.split(",")).map(p => Customer(p(0).trim.toInt, p(1), p(2), p(3), p(4))).toDF();
16/07/19 23:34:33 INFO storage.BlockManagerInfo: Removed broadcast_21_piece0 on localhost:56137 in memory (size: 1795.0 B, free: 265.0 MB)
16/07/19 23:34:33 INFO storage.BlockManagerInfo: Removed broadcast_20_piece0 on localhost:56137 in memory (size: 19.5 KB, free: 265.0 MB)
16/07/19 23:34:33 INFO storage.MemoryStore: ensureFreeSpace(222752) called with curMem=896994, maxMem=278019440
16/07/19 23:34:33 INFO storage.MemoryStore: Block broadcast_22 stored as values in memory (estimated size 217.5 KB, free 264.1 MB)
16/07/19 23:34:33 INFO storage.MemoryStore: ensureFreeSpace(19999) called with curMem=1119746, maxMem=278019440
16/07/19 23:34:33 INFO storage.MemoryStore: Block broadcast_22_piece0 stored as bytes in memory (estimated size 19.5 KB, free 264.1 MB)
16/07/19 23:34:33 INFO storage.BlockManagerInfo: Added broadcast_22_piece0 in memory on localhost:56137 (size: 19.5 KB, free: 265.0 MB)
16/07/19 23:34:33 INFO spark.SparkContext: Created broadcast 22 from textFile at <console>:33
dfCustomers: org.apache.spark.sql.DataFrame = [customer_id: int, name: string, city: string, state: string, zip_code: string]

scala> dfCustomers.registerTempTable("customers");

scala> dfCustomers.show();
16/07/19 23:34:55 INFO mapred.FileInputFormat: Total input paths to process : 1
16/07/19 23:34:55 INFO spark.SparkContext: Starting job: show at <console>:36
16/07/19 23:34:55 INFO scheduler.DAGScheduler: Got job 14 (show at <console>:36) with 1 output partitions (allowLocal=false)
16/07/19 23:34:55 INFO scheduler.DAGScheduler: Final stage: ResultStage 18(show at <console>:36)
16/07/19 23:34:55 INFO scheduler.DAGScheduler: Parents of final stage: List()
16/07/19 23:34:55 INFO scheduler.DAGScheduler: Missing parents: List()
16/07/19 23:34:55 INFO scheduler.DAGScheduler: Submitting ResultStage 18 (MapPartitionsRDD[44] at show at <console>:36), which has no missing parents
16/07/19 23:34:55 INFO storage.MemoryStore: ensureFreeSpace(4088) called with curMem=1139745, maxMem=278019440
16/07/19 23:34:55 INFO storage.MemoryStore: Block broadcast_23 stored as values in memory (estimated size 4.0 KB, free 264.0 MB)
16/07/19 23:34:55 INFO storage.MemoryStore: ensureFreeSpace(2227) called with curMem=1143833, maxMem=278019440
16/07/19 23:34:55 INFO storage.MemoryStore: Block broadcast_23_piece0 stored as bytes in memory (estimated size 2.2 KB, free 264.0 MB)
16/07/19 23:34:55 INFO storage.BlockManagerInfo: Added broadcast_23_piece0 in memory on localhost:56137 (size: 2.2 KB, free: 265.0 MB)
16/07/19 23:34:55 INFO spark.SparkContext: Created broadcast 23 from broadcast at DAGScheduler.scala:874
16/07/19 23:34:55 INFO scheduler.DAGScheduler: Submitting 1 missing tasks from ResultStage 18 (MapPartitionsRDD[44] at show at <console>:36)
16/07/19 23:34:55 INFO scheduler.TaskSchedulerImpl: Adding task set 18.0 with 1 tasks
16/07/19 23:34:55 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 18.0 (TID 416, localhost, ANY, 1413 bytes)
16/07/19 23:34:55 INFO executor.Executor: Running task 0.0 in stage 18.0 (TID 416)
16/07/19 23:34:55 INFO rdd.HadoopRDD: Input split: hdfs://cdh1:9000/user/test/customers.txt:0+92
16/07/19 23:34:55 INFO executor.Executor: Finished task 0.0 in stage 18.0 (TID 416). 2420 bytes result sent to driver
16/07/19 23:34:55 INFO scheduler.TaskSetManager: Finished task 0.0 in stage 18.0 (TID 416) in 28 ms on localhost (1/1)
16/07/19 23:34:55 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 18.0, whose tasks have all completed, from pool 
16/07/19 23:34:55 INFO scheduler.DAGScheduler: ResultStage 18 (show at <console>:36) finished in 0.027 s
16/07/19 23:34:55 INFO scheduler.DAGScheduler: Job 14 finished: show at <console>:36, took 0.057418 s
16/07/19 23:34:55 INFO spark.SparkContext: Starting job: show at <console>:36
16/07/19 23:34:55 INFO scheduler.DAGScheduler: Got job 15 (show at <console>:36) with 1 output partitions (allowLocal=false)
16/07/19 23:34:55 INFO scheduler.DAGScheduler: Final stage: ResultStage 19(show at <console>:36)
16/07/19 23:34:55 INFO scheduler.DAGScheduler: Parents of final stage: List()
16/07/19 23:34:55 INFO scheduler.DAGScheduler: Missing parents: List()
16/07/19 23:34:55 INFO scheduler.DAGScheduler: Submitting ResultStage 19 (MapPartitionsRDD[44] at show at <console>:36), which has no missing parents
16/07/19 23:34:55 INFO storage.MemoryStore: ensureFreeSpace(4088) called with curMem=1146060, maxMem=278019440
16/07/19 23:34:55 INFO storage.MemoryStore: Block broadcast_24 stored as values in memory (estimated size 4.0 KB, free 264.0 MB)
16/07/19 23:34:55 INFO storage.MemoryStore: ensureFreeSpace(2227) called with curMem=1150148, maxMem=278019440
16/07/19 23:34:55 INFO storage.MemoryStore: Block broadcast_24_piece0 stored as bytes in memory (estimated size 2.2 KB, free 264.0 MB)
16/07/19 23:34:55 INFO storage.BlockManagerInfo: Added broadcast_24_piece0 in memory on localhost:56137 (size: 2.2 KB, free: 265.0 MB)
16/07/19 23:34:55 INFO spark.SparkContext: Created broadcast 24 from broadcast at DAGScheduler.scala:874
16/07/19 23:34:55 INFO scheduler.DAGScheduler: Submitting 1 missing tasks from ResultStage 19 (MapPartitionsRDD[44] at show at <console>:36)
16/07/19 23:34:55 INFO scheduler.TaskSchedulerImpl: Adding task set 19.0 with 1 tasks
16/07/19 23:34:55 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 19.0 (TID 417, localhost, ANY, 1413 bytes)
16/07/19 23:34:55 INFO executor.Executor: Running task 0.0 in stage 19.0 (TID 417)
16/07/19 23:34:55 INFO rdd.HadoopRDD: Input split: hdfs://cdh1:9000/user/test/customers.txt:92+93
16/07/19 23:34:55 INFO executor.Executor: Finished task 0.0 in stage 19.0 (TID 417). 2311 bytes result sent to driver
16/07/19 23:34:55 INFO scheduler.TaskSetManager: Finished task 0.0 in stage 19.0 (TID 417) in 14 ms on localhost (1/1)
16/07/19 23:34:55 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 19.0, whose tasks have all completed, from pool 
16/07/19 23:34:55 INFO scheduler.DAGScheduler: ResultStage 19 (show at <console>:36) finished in 0.013 s
16/07/19 23:34:55 INFO scheduler.DAGScheduler: Job 15 finished: show at <console>:36, took 0.026368 s
+-----------+---------------+------------+-----+--------+
|customer_id|           name|        city|state|zip_code|
+-----------+---------------+------------+-----+--------+
|        100|     John Smith|      Austin|   TX|   78727|
|        200|    Joe Johnson|      Dallas|   TX|   75201|
|        300|      Bob Jones|     Houston|   TX|   77028|
|        400|     Andy Davis| San Antonio|   TX|   78227|
|        500| James Williams|      Austin|   TX|   78727|
+-----------+---------------+------------+-----+--------+


scala> dfCustomers.printSchema();
root
 |-- customer_id: integer (nullable = false)
 |-- name: string (nullable = true)
 |-- city: string (nullable = true)
 |-- state: string (nullable = true)
 |-- zip_code: string (nullable = true)


scala> dfCustomers.select("name").show();
16/07/19 23:35:11 INFO spark.SparkContext: Starting job: show at <console>:36
16/07/19 23:35:11 INFO scheduler.DAGScheduler: Got job 16 (show at <console>:36) with 1 output partitions (allowLocal=false)
16/07/19 23:35:11 INFO scheduler.DAGScheduler: Final stage: ResultStage 20(show at <console>:36)
16/07/19 23:35:11 INFO scheduler.DAGScheduler: Parents of final stage: List()
16/07/19 23:35:11 INFO scheduler.DAGScheduler: Missing parents: List()
16/07/19 23:35:11 INFO scheduler.DAGScheduler: Submitting ResultStage 20 (MapPartitionsRDD[46] at show at <console>:36), which has no missing parents
16/07/19 23:35:11 INFO storage.MemoryStore: ensureFreeSpace(5472) called with curMem=1152375, maxMem=278019440
16/07/19 23:35:11 INFO storage.MemoryStore: Block broadcast_25 stored as values in memory (estimated size 5.3 KB, free 264.0 MB)
16/07/19 23:35:11 INFO storage.MemoryStore: ensureFreeSpace(2881) called with curMem=1157847, maxMem=278019440
16/07/19 23:35:11 INFO storage.MemoryStore: Block broadcast_25_piece0 stored as bytes in memory (estimated size 2.8 KB, free 264.0 MB)
16/07/19 23:35:11 INFO storage.BlockManagerInfo: Added broadcast_25_piece0 in memory on localhost:56137 (size: 2.8 KB, free: 265.0 MB)
16/07/19 23:35:11 INFO spark.SparkContext: Created broadcast 25 from broadcast at DAGScheduler.scala:874
16/07/19 23:35:11 INFO scheduler.DAGScheduler: Submitting 1 missing tasks from ResultStage 20 (MapPartitionsRDD[46] at show at <console>:36)
16/07/19 23:35:11 INFO scheduler.TaskSchedulerImpl: Adding task set 20.0 with 1 tasks
16/07/19 23:35:11 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 20.0 (TID 418, localhost, ANY, 1413 bytes)
16/07/19 23:35:11 INFO executor.Executor: Running task 0.0 in stage 20.0 (TID 418)
16/07/19 23:35:11 INFO rdd.HadoopRDD: Input split: hdfs://cdh1:9000/user/test/customers.txt:0+92
16/07/19 23:35:11 INFO executor.Executor: Finished task 0.0 in stage 20.0 (TID 418). 2130 bytes result sent to driver
16/07/19 23:35:11 INFO scheduler.TaskSetManager: Finished task 0.0 in stage 20.0 (TID 418) in 45 ms on localhost (1/1)
16/07/19 23:35:11 INFO scheduler.DAGScheduler: ResultStage 20 (show at <console>:36) finished in 0.045 s
16/07/19 23:35:11 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 20.0, whose tasks have all completed, from pool 
16/07/19 23:35:11 INFO scheduler.DAGScheduler: Job 16 finished: show at <console>:36, took 0.064127 s
16/07/19 23:35:11 INFO spark.SparkContext: Starting job: show at <console>:36
16/07/19 23:35:11 INFO scheduler.DAGScheduler: Got job 17 (show at <console>:36) with 1 output partitions (allowLocal=false)
16/07/19 23:35:11 INFO scheduler.DAGScheduler: Final stage: ResultStage 21(show at <console>:36)
16/07/19 23:35:11 INFO scheduler.DAGScheduler: Parents of final stage: List()
16/07/19 23:35:11 INFO scheduler.DAGScheduler: Missing parents: List()
16/07/19 23:35:11 INFO scheduler.DAGScheduler: Submitting ResultStage 21 (MapPartitionsRDD[46] at show at <console>:36), which has no missing parents
16/07/19 23:35:11 INFO storage.MemoryStore: ensureFreeSpace(5472) called with curMem=1160728, maxMem=278019440
16/07/19 23:35:11 INFO storage.MemoryStore: Block broadcast_26 stored as values in memory (estimated size 5.3 KB, free 264.0 MB)
16/07/19 23:35:11 INFO storage.MemoryStore: ensureFreeSpace(2881) called with curMem=1166200, maxMem=278019440
16/07/19 23:35:11 INFO storage.MemoryStore: Block broadcast_26_piece0 stored as bytes in memory (estimated size 2.8 KB, free 264.0 MB)
16/07/19 23:35:11 INFO storage.BlockManagerInfo: Added broadcast_26_piece0 in memory on localhost:56137 (size: 2.8 KB, free: 265.0 MB)
16/07/19 23:35:11 INFO spark.SparkContext: Created broadcast 26 from broadcast at DAGScheduler.scala:874
16/07/19 23:35:11 INFO storage.BlockManagerInfo: Removed broadcast_25_piece0 on localhost:56137 in memory (size: 2.8 KB, free: 265.0 MB)
16/07/19 23:35:11 INFO scheduler.DAGScheduler: Submitting 1 missing tasks from ResultStage 21 (MapPartitionsRDD[46] at show at <console>:36)
16/07/19 23:35:11 INFO scheduler.TaskSchedulerImpl: Adding task set 21.0 with 1 tasks
16/07/19 23:35:11 INFO storage.BlockManagerInfo: Removed broadcast_24_piece0 on localhost:56137 in memory (size: 2.2 KB, free: 265.0 MB)
16/07/19 23:35:11 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 21.0 (TID 419, localhost, ANY, 1413 bytes)
16/07/19 23:35:11 INFO executor.Executor: Running task 0.0 in stage 21.0 (TID 419)
16/07/19 23:35:11 INFO storage.BlockManagerInfo: Removed broadcast_23_piece0 on localhost:56137 in memory (size: 2.2 KB, free: 265.0 MB)
16/07/19 23:35:11 INFO rdd.HadoopRDD: Input split: hdfs://cdh1:9000/user/test/customers.txt:92+93
16/07/19 23:35:11 INFO executor.Executor: Finished task 0.0 in stage 21.0 (TID 419). 2091 bytes result sent to driver
16/07/19 23:35:11 INFO scheduler.TaskSetManager: Finished task 0.0 in stage 21.0 (TID 419) in 18 ms on localhost (1/1)
16/07/19 23:35:11 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 21.0, whose tasks have all completed, from pool 
16/07/19 23:35:11 INFO scheduler.DAGScheduler: ResultStage 21 (show at <console>:36) finished in 0.019 s
16/07/19 23:35:11 INFO scheduler.DAGScheduler: Job 17 finished: show at <console>:36, took 0.060116 s
+---------------+
|           name|
+---------------+
|     John Smith|
|    Joe Johnson|
|      Bob Jones|
|     Andy Davis|
| James Williams|
+---------------+


scala> dfCustomers.select("name", "city").show()
16/07/19 23:35:19 INFO spark.SparkContext: Starting job: show at <console>:36
16/07/19 23:35:19 INFO scheduler.DAGScheduler: Got job 18 (show at <console>:36) with 1 output partitions (allowLocal=false)
16/07/19 23:35:19 INFO scheduler.DAGScheduler: Final stage: ResultStage 22(show at <console>:36)
16/07/19 23:35:19 INFO scheduler.DAGScheduler: Parents of final stage: List()
16/07/19 23:35:19 INFO scheduler.DAGScheduler: Missing parents: List()
16/07/19 23:35:19 INFO scheduler.DAGScheduler: Submitting ResultStage 22 (MapPartitionsRDD[48] at show at <console>:36), which has no missing parents
16/07/19 23:35:19 INFO storage.MemoryStore: ensureFreeSpace(5480) called with curMem=1148098, maxMem=278019440
16/07/19 23:35:19 INFO storage.MemoryStore: Block broadcast_27 stored as values in memory (estimated size 5.4 KB, free 264.0 MB)
16/07/19 23:35:19 INFO storage.MemoryStore: ensureFreeSpace(2883) called with curMem=1153578, maxMem=278019440
16/07/19 23:35:19 INFO storage.MemoryStore: Block broadcast_27_piece0 stored as bytes in memory (estimated size 2.8 KB, free 264.0 MB)
16/07/19 23:35:19 INFO storage.BlockManagerInfo: Added broadcast_27_piece0 in memory on localhost:56137 (size: 2.8 KB, free: 265.0 MB)
16/07/19 23:35:19 INFO spark.SparkContext: Created broadcast 27 from broadcast at DAGScheduler.scala:874
16/07/19 23:35:19 INFO scheduler.DAGScheduler: Submitting 1 missing tasks from ResultStage 22 (MapPartitionsRDD[48] at show at <console>:36)
16/07/19 23:35:19 INFO scheduler.TaskSchedulerImpl: Adding task set 22.0 with 1 tasks
16/07/19 23:35:19 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 22.0 (TID 420, localhost, ANY, 1413 bytes)
16/07/19 23:35:19 INFO executor.Executor: Running task 0.0 in stage 22.0 (TID 420)
16/07/19 23:35:19 INFO rdd.HadoopRDD: Input split: hdfs://cdh1:9000/user/test/customers.txt:0+92
16/07/19 23:35:19 INFO executor.Executor: Finished task 0.0 in stage 22.0 (TID 420). 2200 bytes result sent to driver
16/07/19 23:35:19 INFO scheduler.TaskSetManager: Finished task 0.0 in stage 22.0 (TID 420) in 17 ms on localhost (1/1)
16/07/19 23:35:19 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 22.0, whose tasks have all completed, from pool 
16/07/19 23:35:19 INFO scheduler.DAGScheduler: ResultStage 22 (show at <console>:36) finished in 0.013 s
16/07/19 23:35:19 INFO scheduler.DAGScheduler: Job 18 finished: show at <console>:36, took 0.030147 s
16/07/19 23:35:19 INFO spark.SparkContext: Starting job: show at <console>:36
16/07/19 23:35:19 INFO scheduler.DAGScheduler: Got job 19 (show at <console>:36) with 1 output partitions (allowLocal=false)
16/07/19 23:35:19 INFO scheduler.DAGScheduler: Final stage: ResultStage 23(show at <console>:36)
16/07/19 23:35:19 INFO scheduler.DAGScheduler: Parents of final stage: List()
16/07/19 23:35:19 INFO scheduler.DAGScheduler: Missing parents: List()
16/07/19 23:35:19 INFO scheduler.DAGScheduler: Submitting ResultStage 23 (MapPartitionsRDD[48] at show at <console>:36), which has no missing parents
16/07/19 23:35:19 INFO storage.MemoryStore: ensureFreeSpace(5480) called with curMem=1156461, maxMem=278019440
16/07/19 23:35:19 INFO storage.MemoryStore: Block broadcast_28 stored as values in memory (estimated size 5.4 KB, free 264.0 MB)
16/07/19 23:35:19 INFO storage.MemoryStore: ensureFreeSpace(2883) called with curMem=1161941, maxMem=278019440
16/07/19 23:35:19 INFO storage.MemoryStore: Block broadcast_28_piece0 stored as bytes in memory (estimated size 2.8 KB, free 264.0 MB)
16/07/19 23:35:19 INFO storage.BlockManagerInfo: Added broadcast_28_piece0 in memory on localhost:56137 (size: 2.8 KB, free: 265.0 MB)
16/07/19 23:35:19 INFO spark.SparkContext: Created broadcast 28 from broadcast at DAGScheduler.scala:874
16/07/19 23:35:19 INFO scheduler.DAGScheduler: Submitting 1 missing tasks from ResultStage 23 (MapPartitionsRDD[48] at show at <console>:36)
16/07/19 23:35:19 INFO scheduler.TaskSchedulerImpl: Adding task set 23.0 with 1 tasks
16/07/19 23:35:19 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 23.0 (TID 421, localhost, ANY, 1413 bytes)
16/07/19 23:35:19 INFO executor.Executor: Running task 0.0 in stage 23.0 (TID 421)
16/07/19 23:35:19 INFO rdd.HadoopRDD: Input split: hdfs://cdh1:9000/user/test/customers.txt:92+93
16/07/19 23:35:19 INFO executor.Executor: Finished task 0.0 in stage 23.0 (TID 421). 2142 bytes result sent to driver
16/07/19 23:35:19 INFO scheduler.TaskSetManager: Finished task 0.0 in stage 23.0 (TID 421) in 14 ms on localhost (1/1)
16/07/19 23:35:19 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 23.0, whose tasks have all completed, from pool 
16/07/19 23:35:19 INFO scheduler.DAGScheduler: ResultStage 23 (show at <console>:36) finished in 0.014 s
16/07/19 23:35:19 INFO scheduler.DAGScheduler: Job 19 finished: show at <console>:36, took 0.027027 s
+---------------+------------+
|           name|        city|
+---------------+------------+
|     John Smith|      Austin|
|    Joe Johnson|      Dallas|
|      Bob Jones|     Houston|
|     Andy Davis| San Antonio|
| James Williams|      Austin|
+---------------+------------+


scala> dfCustomers.filter(dfCustomers("customer_id").equalTo(500)).show();
16/07/19 23:35:41 INFO spark.SparkContext: Starting job: show at <console>:36
16/07/19 23:35:41 INFO scheduler.DAGScheduler: Got job 20 (show at <console>:36) with 1 output partitions (allowLocal=false)
16/07/19 23:35:41 INFO scheduler.DAGScheduler: Final stage: ResultStage 24(show at <console>:36)
16/07/19 23:35:41 INFO scheduler.DAGScheduler: Parents of final stage: List()
16/07/19 23:35:41 INFO scheduler.DAGScheduler: Missing parents: List()
16/07/19 23:35:41 INFO scheduler.DAGScheduler: Submitting ResultStage 24 (MapPartitionsRDD[50] at show at <console>:36), which has no missing parents
16/07/19 23:35:41 INFO storage.MemoryStore: ensureFreeSpace(5632) called with curMem=1164824, maxMem=278019440
16/07/19 23:35:41 INFO storage.MemoryStore: Block broadcast_29 stored as values in memory (estimated size 5.5 KB, free 264.0 MB)
16/07/19 23:35:41 INFO storage.MemoryStore: ensureFreeSpace(2924) called with curMem=1170456, maxMem=278019440
16/07/19 23:35:41 INFO storage.MemoryStore: Block broadcast_29_piece0 stored as bytes in memory (estimated size 2.9 KB, free 264.0 MB)
16/07/19 23:35:41 INFO storage.BlockManagerInfo: Added broadcast_29_piece0 in memory on localhost:56137 (size: 2.9 KB, free: 265.0 MB)
16/07/19 23:35:41 INFO spark.SparkContext: Created broadcast 29 from broadcast at DAGScheduler.scala:874
16/07/19 23:35:41 INFO scheduler.DAGScheduler: Submitting 1 missing tasks from ResultStage 24 (MapPartitionsRDD[50] at show at <console>:36)
16/07/19 23:35:41 INFO scheduler.TaskSchedulerImpl: Adding task set 24.0 with 1 tasks
16/07/19 23:35:41 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 24.0 (TID 422, localhost, ANY, 1413 bytes)
16/07/19 23:35:41 INFO executor.Executor: Running task 0.0 in stage 24.0 (TID 422)
16/07/19 23:35:41 INFO rdd.HadoopRDD: Input split: hdfs://cdh1:9000/user/test/customers.txt:0+92
16/07/19 23:35:41 INFO executor.Executor: Finished task 0.0 in stage 24.0 (TID 422). 1800 bytes result sent to driver
16/07/19 23:35:41 INFO scheduler.TaskSetManager: Finished task 0.0 in stage 24.0 (TID 422) in 41 ms on localhost (1/1)
16/07/19 23:35:41 INFO scheduler.DAGScheduler: ResultStage 24 (show at <console>:36) finished in 0.040 s
16/07/19 23:35:41 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 24.0, whose tasks have all completed, from pool 
16/07/19 23:35:41 INFO scheduler.DAGScheduler: Job 20 finished: show at <console>:36, took 0.057741 s
16/07/19 23:35:41 INFO spark.SparkContext: Starting job: show at <console>:36
16/07/19 23:35:41 INFO scheduler.DAGScheduler: Got job 21 (show at <console>:36) with 1 output partitions (allowLocal=false)
16/07/19 23:35:41 INFO scheduler.DAGScheduler: Final stage: ResultStage 25(show at <console>:36)
16/07/19 23:35:41 INFO scheduler.DAGScheduler: Parents of final stage: List()
16/07/19 23:35:41 INFO scheduler.DAGScheduler: Missing parents: List()
16/07/19 23:35:41 INFO scheduler.DAGScheduler: Submitting ResultStage 25 (MapPartitionsRDD[50] at show at <console>:36), which has no missing parents
16/07/19 23:35:41 INFO storage.MemoryStore: ensureFreeSpace(5632) called with curMem=1173380, maxMem=278019440
16/07/19 23:35:41 INFO storage.MemoryStore: Block broadcast_30 stored as values in memory (estimated size 5.5 KB, free 264.0 MB)
16/07/19 23:35:41 INFO storage.MemoryStore: ensureFreeSpace(2924) called with curMem=1179012, maxMem=278019440
16/07/19 23:35:41 INFO storage.MemoryStore: Block broadcast_30_piece0 stored as bytes in memory (estimated size 2.9 KB, free 264.0 MB)
16/07/19 23:35:41 INFO storage.BlockManagerInfo: Added broadcast_30_piece0 in memory on localhost:56137 (size: 2.9 KB, free: 265.0 MB)
16/07/19 23:35:41 INFO spark.SparkContext: Created broadcast 30 from broadcast at DAGScheduler.scala:874
16/07/19 23:35:41 INFO scheduler.DAGScheduler: Submitting 1 missing tasks from ResultStage 25 (MapPartitionsRDD[50] at show at <console>:36)
16/07/19 23:35:41 INFO scheduler.TaskSchedulerImpl: Adding task set 25.0 with 1 tasks
16/07/19 23:35:41 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 25.0 (TID 423, localhost, ANY, 1413 bytes)
16/07/19 23:35:41 INFO executor.Executor: Running task 0.0 in stage 25.0 (TID 423)
16/07/19 23:35:41 INFO rdd.HadoopRDD: Input split: hdfs://cdh1:9000/user/test/customers.txt:92+93
16/07/19 23:35:41 INFO executor.Executor: Finished task 0.0 in stage 25.0 (TID 423). 2189 bytes result sent to driver
16/07/19 23:35:41 INFO scheduler.TaskSetManager: Finished task 0.0 in stage 25.0 (TID 423) in 15 ms on localhost (1/1)
16/07/19 23:35:41 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 25.0, whose tasks have all completed, from pool 
16/07/19 23:35:41 INFO scheduler.DAGScheduler: ResultStage 25 (show at <console>:36) finished in 0.015 s
16/07/19 23:35:41 INFO scheduler.DAGScheduler: Job 21 finished: show at <console>:36, took 0.038441 s
+-----------+---------------+-------+-----+--------+
|customer_id|           name|   city|state|zip_code|
+-----------+---------------+-------+-----+--------+
|        500| James Williams| Austin|   TX|   78727|
+-----------+---------------+-------+-----+--------+


scala> dfCustomers.groupBy("zip_code").count().show();
16/07/19 23:35:52 INFO execution.Exchange: Using SparkSqlSerializer2.
16/07/19 23:35:52 INFO spark.SparkContext: Starting job: show at <console>:36
16/07/19 23:35:52 INFO scheduler.DAGScheduler: Registering RDD 53 (show at <console>:36)
16/07/19 23:35:52 INFO scheduler.DAGScheduler: Got job 22 (show at <console>:36) with 1 output partitions (allowLocal=false)
16/07/19 23:35:52 INFO scheduler.DAGScheduler: Final stage: ResultStage 27(show at <console>:36)
16/07/19 23:35:52 INFO scheduler.DAGScheduler: Parents of final stage: List(ShuffleMapStage 26)
16/07/19 23:35:52 INFO scheduler.DAGScheduler: Missing parents: List(ShuffleMapStage 26)
16/07/19 23:35:52 INFO scheduler.DAGScheduler: Submitting ShuffleMapStage 26 (MapPartitionsRDD[53] at show at <console>:36), which has no missing parents
16/07/19 23:35:52 INFO storage.MemoryStore: ensureFreeSpace(9456) called with curMem=1181936, maxMem=278019440
16/07/19 23:35:52 INFO storage.MemoryStore: Block broadcast_31 stored as values in memory (estimated size 9.2 KB, free 264.0 MB)
16/07/19 23:35:52 INFO storage.MemoryStore: ensureFreeSpace(4565) called with curMem=1191392, maxMem=278019440
16/07/19 23:35:52 INFO storage.MemoryStore: Block broadcast_31_piece0 stored as bytes in memory (estimated size 4.5 KB, free 264.0 MB)
16/07/19 23:35:52 INFO storage.BlockManagerInfo: Removed broadcast_29_piece0 on localhost:56137 in memory (size: 2.9 KB, free: 265.0 MB)
16/07/19 23:35:52 INFO storage.BlockManagerInfo: Added broadcast_31_piece0 in memory on localhost:56137 (size: 4.5 KB, free: 265.0 MB)
16/07/19 23:35:52 INFO storage.BlockManagerInfo: Removed broadcast_28_piece0 on localhost:56137 in memory (size: 2.8 KB, free: 265.0 MB)
16/07/19 23:35:52 INFO spark.SparkContext: Created broadcast 31 from broadcast at DAGScheduler.scala:874
16/07/19 23:35:52 INFO storage.BlockManagerInfo: Removed broadcast_27_piece0 on localhost:56137 in memory (size: 2.8 KB, free: 265.0 MB)
16/07/19 23:35:52 INFO scheduler.DAGScheduler: Submitting 2 missing tasks from ShuffleMapStage 26 (MapPartitionsRDD[53] at show at <console>:36)
16/07/19 23:35:52 INFO scheduler.TaskSchedulerImpl: Adding task set 26.0 with 2 tasks
16/07/19 23:35:52 INFO storage.BlockManagerInfo: Removed broadcast_26_piece0 on localhost:56137 in memory (size: 2.8 KB, free: 265.0 MB)
16/07/19 23:35:52 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 26.0 (TID 424, localhost, ANY, 1402 bytes)
16/07/19 23:35:52 INFO scheduler.TaskSetManager: Starting task 1.0 in stage 26.0 (TID 425, localhost, ANY, 1402 bytes)
16/07/19 23:35:52 INFO executor.Executor: Running task 0.0 in stage 26.0 (TID 424)
16/07/19 23:35:52 INFO storage.BlockManagerInfo: Removed broadcast_30_piece0 on localhost:56137 in memory (size: 2.9 KB, free: 265.0 MB)
16/07/19 23:35:52 INFO rdd.HadoopRDD: Input split: hdfs://cdh1:9000/user/test/customers.txt:0+92
16/07/19 23:35:52 INFO executor.Executor: Running task 1.0 in stage 26.0 (TID 425)
16/07/19 23:35:52 INFO rdd.HadoopRDD: Input split: hdfs://cdh1:9000/user/test/customers.txt:92+93
16/07/19 23:35:52 INFO executor.Executor: Finished task 0.0 in stage 26.0 (TID 424). 2203 bytes result sent to driver
16/07/19 23:35:52 INFO scheduler.TaskSetManager: Finished task 0.0 in stage 26.0 (TID 424) in 175 ms on localhost (1/2)
16/07/19 23:35:52 INFO executor.Executor: Finished task 1.0 in stage 26.0 (TID 425). 2203 bytes result sent to driver
16/07/19 23:35:52 INFO scheduler.TaskSetManager: Finished task 1.0 in stage 26.0 (TID 425) in 181 ms on localhost (2/2)
16/07/19 23:35:52 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 26.0, whose tasks have all completed, from pool 
16/07/19 23:35:52 INFO scheduler.DAGScheduler: ShuffleMapStage 26 (show at <console>:36) finished in 0.180 s
16/07/19 23:35:52 INFO scheduler.DAGScheduler: looking for newly runnable stages
16/07/19 23:35:52 INFO scheduler.DAGScheduler: running: Set()
16/07/19 23:35:52 INFO scheduler.DAGScheduler: waiting: Set(ResultStage 27)
16/07/19 23:35:52 INFO scheduler.DAGScheduler: failed: Set()
16/07/19 23:35:52 INFO scheduler.DAGScheduler: Missing parents for ResultStage 27: List()
16/07/19 23:35:52 INFO scheduler.DAGScheduler: Submitting ResultStage 27 (MapPartitionsRDD[57] at show at <console>:36), which is now runnable
16/07/19 23:35:52 INFO storage.MemoryStore: ensureFreeSpace(10280) called with curMem=1153766, maxMem=278019440
16/07/19 23:35:52 INFO storage.MemoryStore: Block broadcast_32 stored as values in memory (estimated size 10.0 KB, free 264.0 MB)
16/07/19 23:35:52 INFO storage.MemoryStore: ensureFreeSpace(4981) called with curMem=1164046, maxMem=278019440
16/07/19 23:35:52 INFO storage.MemoryStore: Block broadcast_32_piece0 stored as bytes in memory (estimated size 4.9 KB, free 264.0 MB)
16/07/19 23:35:52 INFO storage.BlockManagerInfo: Added broadcast_32_piece0 in memory on localhost:56137 (size: 4.9 KB, free: 265.0 MB)
16/07/19 23:35:52 INFO spark.SparkContext: Created broadcast 32 from broadcast at DAGScheduler.scala:874
16/07/19 23:35:52 INFO scheduler.DAGScheduler: Submitting 1 missing tasks from ResultStage 27 (MapPartitionsRDD[57] at show at <console>:36)
16/07/19 23:35:52 INFO scheduler.TaskSchedulerImpl: Adding task set 27.0 with 1 tasks
16/07/19 23:35:52 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 27.0 (TID 426, localhost, PROCESS_LOCAL, 1165 bytes)
16/07/19 23:35:52 INFO executor.Executor: Running task 0.0 in stage 27.0 (TID 426)
16/07/19 23:35:52 INFO storage.ShuffleBlockFetcherIterator: Getting 2 non-empty blocks out of 2 blocks
16/07/19 23:35:52 INFO storage.ShuffleBlockFetcherIterator: Started 0 remote fetches in 1 ms
16/07/19 23:35:52 INFO executor.Executor: Finished task 0.0 in stage 27.0 (TID 426). 894 bytes result sent to driver
16/07/19 23:35:52 INFO scheduler.TaskSetManager: Finished task 0.0 in stage 27.0 (TID 426) in 7 ms on localhost (1/1)
16/07/19 23:35:52 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 27.0, whose tasks have all completed, from pool 
16/07/19 23:35:52 INFO scheduler.DAGScheduler: ResultStage 27 (show at <console>:36) finished in 0.007 s
16/07/19 23:35:52 INFO scheduler.DAGScheduler: Job 22 finished: show at <console>:36, took 0.261327 s
16/07/19 23:35:52 INFO spark.SparkContext: Starting job: show at <console>:36
16/07/19 23:35:52 INFO spark.MapOutputTrackerMaster: Size of output statuses for shuffle 2 is 169 bytes
16/07/19 23:35:52 INFO scheduler.DAGScheduler: Got job 23 (show at <console>:36) with 199 output partitions (allowLocal=false)
16/07/19 23:35:52 INFO scheduler.DAGScheduler: Final stage: ResultStage 29(show at <console>:36)
16/07/19 23:35:52 INFO scheduler.DAGScheduler: Parents of final stage: List(ShuffleMapStage 28)
16/07/19 23:35:52 INFO scheduler.DAGScheduler: Missing parents: List()
16/07/19 23:35:52 INFO scheduler.DAGScheduler: Submitting ResultStage 29 (MapPartitionsRDD[57] at show at <console>:36), which has no missing parents
16/07/19 23:35:52 INFO storage.MemoryStore: ensureFreeSpace(10280) called with curMem=1169027, maxMem=278019440
16/07/19 23:35:52 INFO storage.MemoryStore: Block broadcast_33 stored as values in memory (estimated size 10.0 KB, free 264.0 MB)
16/07/19 23:35:52 INFO storage.MemoryStore: ensureFreeSpace(4981) called with curMem=1179307, maxMem=278019440
16/07/19 23:35:52 INFO storage.MemoryStore: Block broadcast_33_piece0 stored as bytes in memory (estimated size 4.9 KB, free 264.0 MB)
16/07/19 23:35:52 INFO storage.BlockManagerInfo: Added broadcast_33_piece0 in memory on localhost:56137 (size: 4.9 KB, free: 265.0 MB)
16/07/19 23:35:53 INFO spark.SparkContext: Created broadcast 33 from broadcast at DAGScheduler.scala:874
16/07/19 23:35:53 INFO scheduler.DAGScheduler: Submitting 199 missing tasks from ResultStage 29 (MapPartitionsRDD[57] at show at <console>:36)
16/07/19 23:35:53 INFO scheduler.TaskSchedulerImpl: Adding task set 29.0 with 199 tasks
16/07/19 23:35:53 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 29.0 (TID 427, localhost, PROCESS_LOCAL, 1165 bytes)
16/07/19 23:35:53 INFO scheduler.TaskSetManager: Starting task 1.0 in stage 29.0 (TID 428, localhost, PROCESS_LOCAL, 1165 bytes)
16/07/19 23:35:53 INFO executor.Executor: Running task 0.0 in stage 29.0 (TID 427)
16/07/19 23:35:53 INFO executor.Executor: Running task 1.0 in stage 29.0 (TID 428)
16/07/19 23:35:53 INFO storage.ShuffleBlockFetcherIterator: Getting 2 non-empty blocks out of 2 blocks
16/07/19 23:35:53 INFO storage.ShuffleBlockFetcherIterator: Getting 2 non-empty blocks out of 2 blocks
16/07/19 23:35:53 INFO storage.ShuffleBlockFetcherIterator: Started 0 remote fetches in 0 ms
16/07/19 23:35:53 INFO storage.ShuffleBlockFetcherIterator: Started 0 remote fetches in 0 ms
16/07/19 23:35:53 INFO executor.Executor: Finished task 1.0 in stage 29.0 (TID 428). 894 bytes result sent to driver
16/07/19 23:35:53 INFO executor.Executor: Finished task 0.0 in stage 29.0 (TID 427). 894 bytes result sent to driver
16/07/19 23:35:53 INFO scheduler.TaskSetManager: Starting task 2.0 in stage 29.0 (TID 429, localhost, PROCESS_LOCAL, 1165 bytes)
16/07/19 23:35:53 INFO executor.Executor: Running task 2.0 in stage 29.0 (TID 429)
16/07/19 23:35:53 INFO scheduler.TaskSetManager: Starting task 3.0 in stage 29.0 (TID 430, localhost, PROCESS_LOCAL, 1165 bytes)
16/07/19 23:35:53 INFO scheduler.TaskSetManager: Finished task 0.0 in stage 29.0 (TID 427) in 11 ms on localhost (1/199)
16/07/19 23:35:53 INFO scheduler.TaskSetManager: Finished task 1.0 in stage 29.0 (TID 428) in 10 ms on localhost (2/199)
16/07/19 23:35:53 INFO executor.Executor: Running task 3.0 in stage 29.0 (TID 430)
16/07/19 23:35:53 INFO storage.ShuffleBlockFetcherIterator: Getting 2 non-empty blocks out of 2 blocks
16/07/19 23:35:53 INFO storage.ShuffleBlockFetcherIterator: Started 0 remote fetches in 0 ms
16/07/19 23:35:53 INFO storage.ShuffleBlockFetcherIterator: Getting 2 non-empty blocks out of 2 blocks
16/07/19 23:35:53 INFO storage.ShuffleBlockFetcherIterator: Started 0 remote fetches in 0 ms
16/07/19 23:35:53 INFO executor.Executor: Finished task 3.0 in stage 29.0 (TID 430). 894 bytes result sent to driver
16/07/19 23:35:53 INFO scheduler.TaskSetManager: Starting task 4.0 in stage 29.0 (TID 431, localhost, PROCESS_LOCAL, 1165 bytes)
16/07/19 23:35:53 INFO executor.Executor: Running task 4.0 in stage 29.0 (TID 431)
16/07/19 23:35:53 INFO scheduler.TaskSetManager: Finished task 3.0 in stage 29.0 (TID 430) in 8 ms on localhost (3/199)
16/07/19 23:35:53 INFO executor.Executor: Finished task 2.0 in stage 29.0 (TID 429). 894 bytes result sent to driver
16/07/19 23:35:53 INFO scheduler.TaskSetManager: Starting task 5.0 in stage 29.0 (TID 432, localhost, PROCESS_LOCAL, 1165 bytes)
16/07/19 23:35:53 INFO storage.ShuffleBlockFetcherIterator: Getting 2 non-empty blocks out of 2 blocks
16/07/19 23:35:53 INFO storage.ShuffleBlockFetcherIterator: Started 0 remote fetches in 1 ms
16/07/19 23:35:53 INFO scheduler.TaskSetManager: Finished task 2.0 in stage 29.0 (TID 429) in 12 ms on localhost (4/199)
16/07/19 23:35:53 INFO executor.Executor: Running task 5.0 in stage 29.0 (TID 432)
16/07/19 23:35:53 INFO executor.Executor: Finished task 4.0 in stage 29.0 (TID 431). 894 bytes result sent to driver
16/07/19 23:35:53 INFO scheduler.TaskSetManager: Starting task 6.0 in stage 29.0 (TID 433, localhost, PROCESS_LOCAL, 1165 bytes)
16/07/19 23:35:53 INFO scheduler.TaskSetManager: Finished task 4.0 in stage 29.0 (TID 431) in 6 ms on localhost (5/199)
16/07/19 23:35:53 INFO executor.Executor: Running task 6.0 in stage 29.0 (TID 433)
16/07/19 23:35:53 INFO storage.ShuffleBlockFetcherIterator: Getting 2 non-empty blocks out of 2 blocks
16/07/19 23:35:53 INFO storage.ShuffleBlockFetcherIterator: Started 0 remote fetches in 1 ms
16/07/19 23:35:53 INFO storage.ShuffleBlockFetcherIterator: Getting 2 non-empty blocks out of 2 blocks
16/07/19 23:35:53 INFO storage.ShuffleBlockFetcherIterator: Started 0 remote fetches in 0 ms
16/07/19 23:35:53 INFO executor.Executor: Finished task 6.0 in stage 29.0 (TID 433). 894 bytes result sent to driver
16/07/19 23:35:53 INFO scheduler.TaskSetManager: Starting task 7.0 in stage 29.0 (TID 434, localhost, PROCESS_LOCAL

 
 
              
           
              
              
            
            相關推薦
			   
            
            
            
 

    

    
    spark-shell 資料檔案 讀成 表 的兩種方式！！！ 相對路徑！！hdfs dfs -ls
      
                
park SQL應用
Spark Shell啟動後，就可以用Spark SQL API執行資料分析查詢。

在第一個示例中，我們將從文字檔案中載入使用者資料並從資料集中建立一個DataFrame物件。然後執行DataFrame函式，執行特定的資料選擇查詢。

文字檔案cu 

  
 

    

    
    Mysql學習（三）Spark（Scala）寫入Mysql的兩種方式
      
                package total
import org.apache.spark.sql.{DataFrame, Row, SQLContext, SparkSession}
import org.apache.spark.{SparkConf, SparkContext}
imp 

  
 

    

    
    詳解Go開發Struct轉換成map兩種方式比較
       
 
 詳解Go開發Struct轉換成map兩種方式比較 
   
   
 本篇文章主要介紹了詳解Go開發Struct轉換成map兩種方式比較，小編覺得挺不錯的，現在分享給大家，也給大家做個參考。一起跟隨小編過來看看吧 
   
 最近做Go開發的時候接觸到了一個新的orm第 

  
 

    

    
    伺服器和客戶端的json資料互動（http/socket兩種方式）
      
							
							
							一、首先是Http方式 
伺服器端：

@WebServlet("/service")
public class ServiceServlet extends HttpServlet {
    private static final long serialV 

  
 

    

    
    Http協議中，主要常見的傳送資料到伺服器有哪兩種方式，這兩種方式的特點和區別，以及其在Http協議中的位置
      
Get 和 Post 的區別兩點：
一、這兩者傳遞引數時所用的編碼不一定是一樣的。在 Tomcat 中似乎 Get 的編碼方式是根據頁面中指定的編碼方式，而 Post 則是一直使用同一種編碼方式，可在 Tomcat 的 server.xml 中配置。
二、使用 Get 的時候，引數會顯示在位址列上，而 Po 

  
 

    

    
    Android 寫檔案 複寫和追加  兩種方式
      
                    /
     * 此方法為android程式寫入sd檔案檔案，用到了android-annotation的支援庫@
     *
     * @param buffer   寫入檔案的內容
     * @param folder   儲存檔案的資料夾名稱,如log 

  
 

    

    
    Java檔案上傳的兩種方式（uploadify和Spring預設方式）
      
                
<%@ page contentType="text/html;charset=UTF-8" language="java" %>
<%@ include file="../jsp/include/taglibs.jsp"%>

<!DOCTY 

  
 

    

    
    java檔案上傳的兩種方式的一些問題
      
                接觸到一個專案，一個java web專案，據說是十幾年的寫的程式碼，現在打算做新版本，先要我們專案組解決一下就版本程式碼裡面的bug，以便現在的日常使用。

主要的bug是檔案上傳失敗

打斷點跟蹤了一下，發現了問題：SpringMVC中servletFileUpload.p 

  
 

    

    
    Mysql 複製表 兩種方式
      
                

第一、只複製表結構到新表

create table 新表 select * from 舊錶 where 1=2

或者

create table 新表 like 舊錶 

第二、複製表結構及資料到新表

create table新表 selec 

  
 

    

    
    curl傳送檔案 post圖片的兩種方式
      
								
								            
							
							
							
First Type：

curl  
-F "[email protected]/mnt/shared/Image/jpg/Screensho1t.jpg; filename='Scree 

  
 

    

    
    讀取五種格式的配置檔案（xml（兩種方式），txt，excel，csv，json）
      
							
							
							using Mono.Xml; 
using System.Security; 
using LitJson; 
using System.Collections.Generic; 
using System.IO; 
using Excel; 
using S 

  
 

    

    
    [C#] 計算大檔案的MD5的兩種方式(直接呼叫方法計算，流計算-適用於大檔案)
      
                
通過.NET中的預設類實現，但是採用不同類，針對不同的情況：
具體如下：
類：

/// <summary>
    /// 檔案MD5操作類
    /// </summary>
    public class MD5Checker
    {
 

  
 

    

    
    asp.net資料庫配置檔案連線字串的兩種方式
      
                

一  本地資料庫
在ASP.NET開發的網站根目錄，有一個名為web.config的檔案，顧名思義，這是為整個網站進行配置的檔案，其格式為XML格式。
這裡主要談談檔案中的<connectionStrings>節。 <connectionStrings& 

  
 

    

    
    iOS --- 關於SandBox機制及檔案讀寫的幾種方式
      
							
							
							iOS中的SandBox（沙盒）機制是一種安全體系，它規定了APP只能在為該APP建立的資料夾內讀取檔案，不可以訪問其他地方的內容。所有的非程式碼檔案都儲存在這個地方，比如圖片、聲音、屬性列表和文字檔案等。即：


每個應用程式都在自己的沙盒內
不能隨意跨越自己 

  
 

    

    
    《深入理解Spark》之RDD轉換DataFrame的兩種方式的比較
      
                
package com.lyzx.day19

import org.apache.spark.sql.types.{StringType, StructField, StructType}
import org.apache.spark.{SparkConf, Spark 

  
 

    

    
    工具篇-Spark-Streaming獲取kafka數據的兩種方式（轉載）
      min   但是   col   必須   hdfs   span   保存   memory   簡單   轉載自：https://blog.csdn.net/wisgood/article/details/51815845
一、基於Receiver的方式
原理
Receiver從Kafka中獲取的數 

  
 

    

    
    shell指令碼載入資料檔案到hive表中
      
							
							
							如果執行時間允許，還可以增加判斷hive表是否存在的。



#!/bin/ksh
#-------------------------------------------------------------------------------------
#- 

  
 

    

    
    spark讀取kafka資料（兩種方式比較及flume配置檔案）
      
a1.sources = r1
a1.channels = c1
a1.sinks = k1

a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

a1.channels.c1.type = memory
a1.channels.c1.capacity 

  
 

    

    
    Java：FileInputStream讀入檔案資料的兩種方式
      
								
								            
							
							
							
  FileInputStream是位元組流，它的read()方法允許一個位元組一個位元組的讀入，也允許先把資料存到緩衝區位元組陣列中，再一次性讀取整個陣列——在實際開發中，通常使用後者






 

  
 

    

    
    FileInputStream讀取位元組流。讀取檔案資料的兩種方式(寫的好)
       
 
 總結：    
 //1讀取檔案的資料到位元組流inputStream     InputStream inputStream = new FileInputStream("D:\\demo.txt");//讀取檔案的資料到位元組流inputStream。

spark-shell 資料檔案讀成表的兩種方式！！！相對路徑！！hdfs dfs -ls

spark-shell 資料檔案讀成表的兩種方式！！！相對路徑！！hdfs dfs -ls

Mysql學習（三）Spark（Scala）寫入Mysql的兩種方式

詳解Go開發Struct轉換成map兩種方式比較

伺服器和客戶端的json資料互動（http/socket兩種方式）

Http協議中，主要常見的傳送資料到伺服器有哪兩種方式，這兩種方式的特點和區別，以及其在Http協議中的位置

Android 寫檔案複寫和追加兩種方式

Java檔案上傳的兩種方式（uploadify和Spring預設方式）

java檔案上傳的兩種方式的一些問題

Mysql 複製表兩種方式

curl傳送檔案 post圖片的兩種方式

讀取五種格式的配置檔案（xml（兩種方式），txt，excel，csv，json）

[C#] 計算大檔案的MD5的兩種方式(直接呼叫方法計算，流計算-適用於大檔案)

asp.net資料庫配置檔案連線字串的兩種方式

iOS --- 關於SandBox機制及檔案讀寫的幾種方式

《深入理解Spark》之RDD轉換DataFrame的兩種方式的比較

工具篇-Spark-Streaming獲取kafka數據的兩種方式（轉載）

shell指令碼載入資料檔案到hive表中

spark讀取kafka資料（兩種方式比較及flume配置檔案）

Java：FileInputStream讀入檔案資料的兩種方式

FileInputStream讀取位元組流。讀取檔案資料的兩種方式(寫的好)

spark-shell 資料檔案 讀成 表 的兩種方式！！！ 相對路徑！！hdfs dfs -ls

相關推薦

spark-shell 資料檔案讀成表的兩種方式！！！相對路徑！！hdfs dfs -ls