spark幾種讀檔案的方式
阿新 • • 發佈:2018-11-28
spark.read.textFile和sc.textFile的區別
val rdd1 = spark.read.textFile("hdfs://han02:9000/words.txt") //讀取到的是一個RDD物件
val rdd2 = sc.textFile("hdfs://han02:9000/words.txt") //讀取到的是一個Dataset的資料集
分別進行單詞統計的方法:
rdd1.flatMap(_.split(" ")).map((_,1)).reduceByKey(_+_).sortBy(_._2,false)
rdd2.flatMap(x=>x.split(" ")).groupByKey(x=>x).count()
前者返回Array[(String,Int)],後者返回Array[(String,Long)]
TextFile(url,num)///num為設定分割槽個數檔案超過(128)
1.從當前目錄讀取一個檔案:
val path = "Current.txt" //Current fold file val rdd1 = sc.textFile(path,2)
2.從當前目錄讀取一個檔案:
val path = "Current1.txt,Current2.txt," //Current fold file val rdd1 = sc.textFile(path,2)
3.從本地讀取一個檔案:
val path = "file:///usr/local/spark/spark-1.6.0-bin-hadoop2.6/README.md" //local file val rdd1 = sc.textFile(path,2)
4.從本地讀取一個資料夾中的內容:
val path = "file:///usr/local/spark/spark-1.6.0-bin-hadoop2.6/licenses/" //local file val rdd1 = sc.textFile(path,2)
5.從本地讀取一個多個檔案:
val path = "file:///usr/local/spark/spark-1.6.0-bin-hadoop2.6/licenses/LICENSE-scala.txt,file:///usr/local/spark/spark-1.6.0-bin-hadoop2.6/licenses/LICENSE-spire.txt" //local file val rdd1 = sc.textFile(path,2)
6.從本地讀取多個資料夾中的內容:
val path = "/usr/local/spark/spark-1.6.0-bin-hadoop2.6/data/*/*" //local file
val rdd1 = sc.textFile(path,2)
val path = "/usr/local/spark/spark-1.6.0-bin-hadoop2.6/data/*/*.txt" //local file,指定字尾名檔案
val rdd1 = sc.textFile(path,2)
7.採用萬用字元讀取相似的檔案中的內容:
for (i <- 1 to 2){ val rdd1 = sc.textFile(s"/root/application/temp/people$i*",2) }
eg:google中的檔案讀取不了