1. 程式人生 > >spark幾種讀檔案的方式

spark幾種讀檔案的方式

spark.read.textFile和sc.textFile的區別

val rdd1 = spark.read.textFile("hdfs://han02:9000/words.txt")   //讀取到的是一個RDD物件

val rdd2 = sc.textFile("hdfs://han02:9000/words.txt")  //讀取到的是一個Dataset的資料集

分別進行單詞統計的方法:

rdd1.flatMap(_.split(" ")).map((_,1)).reduceByKey(_+_).sortBy(_._2,false)
rdd2.flatMap(x=>x.split(" ")).groupByKey(x=>x).count()

前者返回Array[(String,Int)],後者返回Array[(String,Long)]

 

 

TextFile(url,num)///num為設定分割槽個數檔案超過(128)

1.從當前目錄讀取一個檔案:

val path = "Current.txt"  //Current fold file
val rdd1 = sc.textFile(path,2)

2.從當前目錄讀取一個檔案:

val path = "Current1.txt,Current2.txt,"  //Current fold file
val rdd1 = sc.textFile(path,2
)

 

3.從本地讀取一個檔案:

val path = "file:///usr/local/spark/spark-1.6.0-bin-hadoop2.6/README.md"  //local file
val rdd1 = sc.textFile(path,2)

4.從本地讀取一個資料夾中的內容:

val path = "file:///usr/local/spark/spark-1.6.0-bin-hadoop2.6/licenses/"  //local file
val rdd1 = sc.textFile(path,2)

5.從本地讀取一個多個檔案:

val path = "
file:///usr/local/spark/spark-1.6.0-bin-hadoop2.6/licenses/LICENSE-scala.txt,file:///usr/local/spark/spark-1.6.0-bin-hadoop2.6/licenses/LICENSE-spire.txt" //local file val rdd1 = sc.textFile(path,2)

6.從本地讀取多個資料夾中的內容:

val path = "/usr/local/spark/spark-1.6.0-bin-hadoop2.6/data/*/*"  //local file
val rdd1 = sc.textFile(path,2)

val path = "/usr/local/spark/spark-1.6.0-bin-hadoop2.6/data/*/*.txt" //local file,指定字尾名檔案
val rdd1 = sc.textFile(path,2)

7.採用萬用字元讀取相似的檔案中的內容:

for (i <- 1 to 2){
      val rdd1 = sc.textFile(s"/root/application/temp/people$i*",2)
    }

eg:google中的檔案讀取不了