Spark之RDD運算元-轉換運算元

阿新 • • 發佈：2018-12-01

RDD-Transformation

轉換（Transformation）運算元就是對RDD進行操作的介面函式，其作用是將一個或多個RDD變換成新的RDD。

使用Spark進行資料計算，在利用建立運算元生成RDD後，資料處理的演算法設計和程式編寫的最關鍵部分，就是利用變換運算元對原始資料產生的RDD進行一步一步的變換，最終得到期望的計算結果。

對於變換運算元可理解為分兩類：1，對Value型RDD進行變換的運算元；2，對Key/Value型RDD進行變換運算元。在每個變換中有僅對一個RDD進行變換的，也有是對兩個RDD進行變換的。

對單個Value型的RDD進行變換

map
filter
distinct
flatMap
sample
union
intersection
groupByKey
對於上面列出的幾個RDD轉換運算元因為在前面的文章裡有介紹了，這裡就不進行示例展示了。詳見

coalesce——重新分割槽

將當前RDD進行重新分割槽，生成一個以numPartitions引數指定的分割槽數儲存的新RDD。引數shuffle為true時在變換過程中進行shuffle操作，否則不進行shuffle。

def coalesce(numPartitions: Int, shuffle: Boolean = false, partitionCoalescer: Option[PartitionCoalescer] = Option.empty)(implicit ord: Ordering[T] = null): RDD[T]

Note:
With shuffle = true, you can actually coalesce to a larger number of partitions. This is useful if you have a small number of partitions, say 100, potentially with a few partitions being abnormally large. Calling coalesce(1000, shuffle = true) will result in 1000 partitions with the data distributed using a hash partitioner. The optional partition coalescer passed in must be serializable.

scala> val rdd = sc.parallelize(List(1,2,3,4,5,6,7,8), 4)
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[7] at parallelize at <console>:24

scala> rdd.partitions
res13: Array[org.apache.spark.Partition] = Array([email protected], [email protected], [email protected], [email protected])

scala> rdd.partitions.length
res14: Int = 4

scala> rdd.collect
res15: Array[Int] = Array(1, 2, 3, 4, 5, 6, 7, 8)

scala> rdd.glom.collect
res16: Array[Array[Int]] = Array(Array(1, 2), Array(3, 4), Array(5, 6), Array(7, 8))

scala> val newRDD = rdd.coalesce(2, false)
newRDD: org.apache.spark.rdd.RDD[Int] = CoalescedRDD[9] at coalesce at <console>:26

scala> newRDD.partitions.length
res17: Int = 2

scala> newRDD.collect
res18: Array[Int] = Array(1, 2, 3, 4, 5, 6, 7, 8)

scala> newRDD.glom.collect
res19: Array[Array[Int]] = Array(Array(1, 2, 3, 4), Array(5, 6, 7, 8))

coalesce-test

coalesce

pipe——呼叫Shell命令

# Return an RDD created by piping elements to a forked external process.
def pipe(command: String): RDD[String]

在Linux系統中，有許多對資料進行處理的shell命令，我們可能通過pipe變換將一些shell命令用於Spark中生成新的RDD。

scala> val rdd = sc.parallelize(0 to 7, 4)
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[11] at parallelize at <console>:24

scala> rdd.glom.collect
res20: Array[Array[Int]] = Array(Array(0, 1), Array(2, 3), Array(4, 5), Array(6, 7))

scala> rdd.pipe("head -n 1").collect #提取每一個分割槽中的第一個元素構成新的RDD
res21: Array[String] = Array(0, 2, 4, 6)

pipe-RDD

pipe

sortBy——排序

對原RDD中的元素按照函式f指定的規則進行排序，並可通過ascending引數進行升序或降序設定，排序後的結果生成新的RDD，新的RDD的分割槽數量可以由引數numPartitions指定，預設與原RDD相同的分割槽數。

# Return this RDD sorted by the given key function.
def sortBy[K](f: (T) ⇒ K, ascending: Boolean = true, numPartitions: Int = this.partitions.length)(implicit ord: Ordering[K], ctag: ClassTag[K]): RDD[T]

sortBy

scala> val rdd = sc.parallelize(List(2,1,4,3),1)
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[19] at parallelize at <console>:24

scala> rdd.glom.collect
res24: Array[Array[Int]] = Array(Array(2, 1, 4, 3))

scala> rdd.sortBy(x=>x, true).collect
res25: Array[Int] = Array(1, 2, 3, 4)

scala> rdd.sortBy(x=>x, false).collect
res26: Array[Int] = Array(4, 3, 2, 1)

對兩個Value型RDD進行變換

cartesian——笛卡爾積

輸入引數為另一個RDD，返回兩個RDD中所有元素的笛卡爾積。

# Return the Cartesian product of this RDD and another one, 
# that is, the RDD of all pairs of elements (a, b) where a is in this and b is in other.
def cartesian[U](other: RDD[U])(implicit arg0: ClassTag[U]): RDD[(T, U)]

cartesian-RDD

scala> val rdd1 = sc.parallelize(List("a", "b", "c"),1)
rdd1: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[27] at parallelize at <console>:24

scala> val rdd2 = sc.parallelize(List(1,2,3), 1)
rdd2: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[28] at parallelize at <console>:24

scala> rdd1.cartesian(rdd2).collect
res27: Array[(String, Int)] = Array((a,1), (a,2), (a,3), (b,1), (b,2), (b,3), (c,1), (c,2), (c,3))

subtract——補集

輸入引數為另一個RDD，返回原始RDD與輸入引數RDD的補集，即生成由原始RDD中而不在輸入引數RDD中的元素構成新的RDD，引數numPartitions指定新RDD分割槽數。

#Return an RDD with the elements from this that are not in other.
defsubtract(other: RDD[T], numPartitions: Int): RDD[T]

subtract-RDD

scala> val rdd1 = sc.parallelize(0 to 5, 1)
rdd1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[30] at parallelize at <console>:24

scala> val rdd2 = sc.parallelize(0 to 2,1)
rdd2: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[31] at parallelize at <console>:24

scala> rdd1.subtract(rdd2).collect
res28: Array[Int] = Array(3, 4, 5)

subtract

union——並集

返回原始RDD與另一個RDD的並集。

#  Return the union of this RDD and another one.
def union(other: RDD[T]): RDD[T]

def ++(other: RDD[T]): RDD[T]
#Return the union of this RDD and another one.

union-RDD

zip——聯結

生成由原始RDD的值為Key，另一個RDD的值為Value依次配對構成的所有Key/Value對，並返回這些Key/Value對集合構成的新RDD

Paste_Image.png

對Key/Value型RDD進行變換

對單個Key-Value型RDD進行變換

combineByKey——按Key聚合

def combineByKey[C](createCombiner: (V) => C, mergeValue: (C, V) => C, mergeCombiners: (C, C) => C): RDD[(K, C)]

combineByKey-RDD

scala> val pair = sc.parallelize(List(("fruit", "Apple"), ("fruit", "Banana"), ("vegetable", "Cucumber"), ("fruit", "Cherry"), ("vegetable", "Bean"), ("vegetable", "Pepper")),2)
pair: org.apache.spark.rdd.RDD[(String, String)] = ParallelCollectionRDD[41] at parallelize at <console>:24

scala> val combinePair = pair.combineByKey(List(_), (x:List[String], y:String) => y::x, (x:List[String], y:List[String]) => x:::y)
combinePair: org.apache.spark.rdd.RDD[(String, List[String])] = ShuffledRDD[42] at combineByKey at <console>:26

scala> combinePair.collect
res31: Array[(String, List[String])] = Array((fruit,List(Banana, Apple, Cherry)), (vegetable,List(Cucumber, Pepper, Bean)))

flatMapValues——對所有Value進行flatMap

# Pass each value in the key-value pair RDD through a flatMap function without changing the keys;
# this also retains the original RDD's partitioning.
def flatMapValues[U](f: (V) =>TraversableOnce[U]): RDD[(K, U)]

flatMapValue-RDD

scala> val rdd = sc.parallelize(List("a", "boy"), 1).keyBy(_.length)
rdd: org.apache.spark.rdd.RDD[(Int, String)] = MapPartitionsRDD[44] at keyBy at <console>:24

scala> rdd.collect
res32: Array[(Int, String)] = Array((1,a), (3,boy))

scala> rdd.flatMapValues(x=>"*" + x + "*").collect
res33: Array[(Int, Char)] = Array((1,*), (1,a), (1,*), (3,*), (3,b), (3,o), (3,y), (3,*))

keys——提取Key

將Key/Value型RDD中的元素的Key提取出來，所有Key值構成一個序列形成新的RDD。

# Return an RDD with the keys of each tuple.
def keys: RDD[K]

keys-RDD

scala> val pairs = sc.parallelize(List("wills", "aprilchang","kris"),1).keyBy(_.length) 
pairs: org.apache.spark.rdd.RDD[(Int, String)] = MapPartitionsRDD[47] at keyBy at <console>:24

scala> pairs.collect
res34: Array[(Int, String)] = Array((5,wills), (10,aprilchang), (4,kris))

scala> pairs.keys.collect
res35: Array[Int] = Array(5, 10, 4)

mapValues——對Value值進行變換

將Key/Value型RDD中的元素的Value值使用輸入引數函式f進行變換構成一個新的RDD。

# Pass each value in the key-value pair RDD through a map function without changing the keys; 
# this also retains the original RDD's partitioning.
def mapValues[U](f: (V) => U): RDD[(K, U)]

mapValues-RDD

partitionBy——按Key值重新分割槽

def partitionBy(partitioner: Partitioner): RDD[(K, V)]
#Return a copy of the RDD partitioned using the specified partitioner.

partitionBy-RDD

scala> val pairs = sc.parallelize(0 to 9, 2).keyBy(x=>x)
pairs: org.apache.spark.rdd.RDD[(Int, Int)] = MapPartitionsRDD[51] at keyBy at <console>:24

scala> pairs.collect
res37: Array[(Int, Int)] = Array((0,0), (1,1), (2,2), (3,3), (4,4), (5,5), (6,6), (7,7), (8,8), (9,9))

scala> import org.apache.spark.HashPartitioner
import org.apache.spark.HashPartitioner

scala> val partitionPairs = pairs.partitionBy(new HashPartitioner(2)) #按每個key的Hash值進行分割槽的
partitionPairs: org.apache.spark.rdd.RDD[(Int, Int)] = ShuffledRDD[52] at partitionBy at <console>:27

scala> partitionPairs.glom.collect
res38: Array[Array[(Int, Int)]] = Array(Array((0,0), (2,2), (4,4), (6,6), (8,8)), Array((1,1), (3,3), (5,5), (7,7), (9,9)))

reduceByKey——按Key值進行Reduce操作

def reduceByKey(func: (V, V) => V): RDD[(K, V)]
## Merge the values for each key using an associative and commutative reduce function.

def reduceByKey(func: (V, V) => V, numPartitions: Int): RDD[(K, V)]
## Merge the values for each key using an associative and commutative reduce function. 
## This will also perform the merging locally on each mapper before sending results to a reducer, 
## similarly to a "combiner" in MapReduce. Output will be hash-partitioned with numPartitions partitions.

def reduceByKey(partitioner: Partitioner, func: (V, V) => V): RDD[(K, V)]
## Merge the values for each key using an associative and commutative reduce function.

sortByKey——按Key值排序

values——取得value值構成新的RDD

對兩個Key-Value型RDD進行變換

cogroup——按Key值聚合

join——按Key值聯結

leftOuterJoin——按Key值進行左外聯結

rightOuterJoin——按Key值進行右外聯結

subtractByKey——按Key值求補

Spark之RDD運算元-轉換運算元

RDD-Transformation 轉換（Transformation）運算元就是對RDD進行操作的介面函式，其作用是將一個或多個RDD變換成新的RDD。使用Spark進行資料計算，在利用建立運算元生成RDD後，資料處理的演算法設計和程式編寫的最關鍵部分，就是利用

Spark函數詳解系列之RDD基本轉換

9.png cal shuff reac 數組a water all conn data 摘要： RDD：彈性分布式數據集，是一種特殊集合 ? 支持多種來源 ? 有容錯機制 ? 可以被緩存 ? 支持並行操作，一個RDD代表一個分區裏的數據集 RDD有兩種操作算子： Tra

Spark函式詳解系列之RDD基本轉換

摘要： RDD：彈性分散式資料集，是一種特殊集合 ‚ 支援多種來源 ‚ 有容錯機制 ‚ 可以被快取 ‚ 支援並行操作，一個RDD代表一個分割槽裡的資料集 RDD有兩種操作運算元： Transformation（轉換）：Transformation屬於延遲計

Spark函式詳解系列之RDD基本轉換+例項

RDD：彈性分散式資料集，是一種特殊集合 ‚ 支援多種來源 ‚ 有容錯機制 ‚ 可以被快取 ‚ 支援並行操作，一個RDD代表一個分割槽裡的資料集 RDD有兩種操作運算元： &nbs

跟我一起學Spark之——RDD Join中寬依賴與窄依賴的判斷

1.規律　　　如果JoinAPI之前被呼叫的RDD API是寬依賴(存在shuffle), 而且兩個join的RDD的分割槽數量一致，join結果的rdd分割槽數量也一樣，這個時候join api是窄依賴　　除此之外的，rdd 的join api是寬依賴 2.Join的理解　

Spark之RDD的屬性

1.一組分片（Partition），即資料集的基本組成單位。對於RDD來說，每個分片都會被一個計算任務處理，並決定平行計算的粒度。使用者可以在建立RDD時指定RDD的分片個數，如果沒有指定，那麼就會採用預設值。預設值就是程式所分配到的CPU Core的數目。

Spark之RDD程式設計

RDD，全稱Resilient Distributed Datasets（彈性分散式資料集），是Spark最為核心的概念，是Spark對資料的抽象。RDD是分散式的元素集合，每個RDD只支援讀操作，且每個RDD都被分為多個分割槽儲存到叢集的不同節點上

Spark的RDD連續轉換操作有時需要注意強行觸發action執行操作，否則（Tansformation）的惰性（lazy）機制會導致結果錯誤

最近通過spark做一些資料處理，遇到一些詭異的現象我開發了一個隨機生成海量資料點的程式，因為要保證這些點具有自增序號，不適合直接map分散式做（幾十億的資料，map計算需要分割槽（不主動分割槽估計也會自動分割槽，spark自帶的資料累加邏輯只能對單個partitio

零基礎入門大資料之spark中rdd部分運算元詳解

先前文章介紹過一些spark相關知識，本文繼續補充一些細節。我們知道，spark中一個重要的資料結構是rdd，這是一種並行集合的資料格式，大多數操作都是圍繞著rdd來的，rdd裡面擁有眾多的方法可以呼叫從而實現各種各樣的功能，那麼通常情況下我們讀入的資料來源並非rdd格式的，如何轉

Spark運算元：transformation之鍵值轉換groupByKey、reduceByKey、reduceByKeyLocally

1、groupByKey 1）def groupByKey(): RDD[(K, Iterable[V])] 2）def groupByKey(numPartitions: Int): RDD[(K, Iterable[V])] 3）def groupByKey(parti

Spark運算元：transformation之鍵值轉換combineByKey、foldByKey

1、combineByKey 1）def combineByKey[C](createCombiner: (V) => C, mergeValue: (C, V) => C, mergeCombiners: (C, C) => C): RDD[(K, C)

Spark運算元：transformation之鍵值轉換join、cogroup

1、join 1）def join[W](other: RDD[(K, W)]): RDD[(K, (V, W))] 2）def join[W](other: RDD[(K, W)], numPartitions: Int): RDD[(K, (V, W))] 3）def

Spark運算元：transformation之鍵值轉換partitionBy、mapValues、flatMapValues

1、partitionBy：def partitionBy(partitioner: Partitioner): RDD[(K, V)] 該函式根據partitioner函式生成新的ShuffleRDD，將原RDD重新分割槽。 scala> var rdd1 = sc.makeRDD(

Spark運算元：transformation之鍵值轉換leftOuterJoin、rightOuterJoin、subtractByKey

1、leftOuterJoin 1）def leftOuterJoin[W](other: RDD[(K, W)]): RDD[(K, (V, Option[W]))] 2）def leftOuterJoin[W](other: RDD[(K, W)], numPartitions: Int):

spark RDD運算元（十一）之RDD Action 儲存操作saveAsTextFile,saveAsSequenceFile,saveAsObjectFile,saveAsHadoopFile 等

關鍵字:Spark運算元、Spark函式、Spark RDD行動Action、Spark RDD儲存操作、saveAsTextFile、saveAsSequenceFile、saveAsObjectFile,saveAsHadoopFile、saveAsHa

【spark】Spark運算元：RDD基本轉換操作–map、flagMap、distinct

map將一個RDD中的每個資料項，通過map中的函式對映變為一個新的元素。輸入分割槽與輸出分割槽一對一，即：有多少個輸入分割槽，就有多少個輸出分割槽。 hadoop fs -cat /tmp/lxw1234/1.txthello worldhello sparkhello

Spark運算元:RDD基本轉換操作(5)–mapPartitions/mapPartitionsWithIndex

mapPartitions def mapPartitions[U](f: (Iterator[T]) => Iterator[U], preservesPartitioning: Boolean = false)(implicit arg0: ClassTag[U]

Spark筆記三之RDD,運算元

RDD核心概念 Resilientdistributed DataSet,彈性分散式資料集 1是隻讀的，分割槽記錄的集合物件 2分割槽(partition)是RDD的基本組成單位，其決定了平行計算的粒度。應用程式對RDD的轉換最終都是對其分割槽的轉換。 3使用者可以指定RD

11.spark sql之RDD轉換DataSet

Once lds nco ldd 方法 att context gin statement 簡介 ??Spark SQL提供了兩種方式用於將RDD轉換為Dataset。使用反射機制推斷RDD的數據結構 ??當spark應用可以推斷RDD數據結構時，可使用這種方式。這種

Spark基礎 -- Spark Shell -- RDD -- 運算元

Spark基礎 – Spark Shell – RDD – 運算元文章目錄 Spark基礎 -- Spark Shell -- RDD -- 運算元一、簡介二、Spark 1.6.3部署

Spark之RDD運算元-轉換運算元

對兩個Value型RDD進行變換

對Key/Value型RDD進行變換

對單個Key-Value型RDD進行變換

對兩個Key-Value型RDD進行變換

相關推薦