1. 程式人生 > >spark入門五(RDD高階運算元二)

spark入門五(RDD高階運算元二)

1. 複習List的操作

list操作:

val ls1 = List(1)
val ls2 = List(2)
// 追加一個元素
ls1 :+ 2 這個結果為:res15: List[Any] = List(1, 2)
// 追加一個集合
ls1 ++ ls2 這個結果為:res16: List[Int] = List(1, 2)
// 追加一個集合
ls1 ::: ls2 這個結果為:res17: List[Int] = List(1, 2)

2. combineByKey

val rdd1 = sc.parallelize(List("dog","cat","gnu","salmon","rabbit"
,"wolf","bear","bee"),3) val rdd2 = sc.parallelize(List(1,2,1,2,1,2,1,2),3) scala> val rdd3 = rdd2.zip(rdd1).collect rdd3: Array[(Int, String)] = Array((1,dog), (2,cat), (1,gnu), (2,salmon), (1,rabbit), (2,wolf), (1,bear), (2,bee)) scala> val rdd3 = rdd2.zip(rdd1) rdd3: org.apache.spark.rdd.
RDD[(Int, String)] = ZippedPartitionsRDD2[5] at zip at <console>:27 scala> rdd3.combineByKey(List(_),(a:List[String],b:String)=> a :+ b,(x:List[String],y:List[String])=> x ::: y) res21: org.apache.spark.rdd.RDD[(Int, List[String])] = ShuffledRDD[6] at combineByKey at <console>:
26 scala> res21.collect res22: Array[(Int, List[String])] = Array((1,List(gnu, rabbit, dog, bear)), (2,List(cat, wolf, bee, salmon)))

說明: List[String],b:String)=> a :+ b 作用是什麼?