ReduceByKey 和 groupByKey 的區別

阿新 • • 發佈：2019-03-01

先來看一下在PairRDDFunctions.scala檔案中reduceByKey和groupByKey的原始碼

/**
 * Merge the values for each key using an associative reduce function. This will also perform
 * the merging locally on each mapper before sending results to a reducer, similarly to a
 * "combiner" in MapReduce. Output will be hash-partitioned with the existing partitioner/
 * parallelism level.
 */
def reduceByKey(func: (V, V) => V): RDD[(K, V)] = {
  reduceByKey(defaultPartitioner(self), func)
}


/**
 * Group the values for each key in the RDD into a single sequence. Allows controlling the
 * partitioning of the resulting key-value pair RDD by passing a Partitioner.
 * The ordering of elements within each group is not guaranteed, and may even differ
 * each time the resulting RDD is evaluated.
 *
 * Note: This operation may be very expensive. If you are grouping in order to perform an
 * aggregation (such as a sum or average) over each key, using [[PairRDDFunctions.aggregateByKey]]
 * or [[PairRDDFunctions.reduceByKey]] will provide much better performance.
 *
 * Note: As currently implemented, groupByKey must be able to hold all the key-value pairs for any
 * key in memory. If a key has too many values, it can result in an [[OutOfMemoryError]].
 */
def groupByKey(partitioner: Partitioner): RDD[(K, Iterable[V])] = {
  // groupByKey shouldn't use map side combine because map side combine does not
  // reduce the amount of data shuffled and requires all map side data be inserted
  // into a hash table, leading to more objects in the old gen.
  val createCombiner = (v: V) => CompactBuffer(v)
  val mergeValue = (buf: CompactBuffer[V], v: V) => buf += v
  val mergeCombiners = (c1: CompactBuffer[V], c2: CompactBuffer[V]) => c1 ++= c2
  val bufs = combineByKey[CompactBuffer[V]](
    createCombiner, mergeValue, mergeCombiners, partitioner, mapSideCombine=false)
  bufs.asInstanceOf[RDD[(K, Iterable[V])]]
}

reduceByKey：reduceByKey會在結果傳送至reducer之前會對每個mapper在本地進行merge，有點類似於在MapReduce中的combiner。這樣做的好處在於，在map端進行一次reduce之後，資料量會大幅度減小，從而減小傳輸，保證reduce端能夠更快的進行結果計算。

groupByKey：groupByKey會對每一個RDD中的value值進行聚合形成一個序列(Iterator)，此操作發生在reduce端，所以勢必會將所有的資料通過網路進行傳輸，造成不必要的浪費。同時如果資料量十分大，可能還會造成OutOfMemoryError。

通過以上對比可以發現在進行大量資料的reduce操作時候建議使用reduceByKey。不僅可以提高速度，還是可以防止使用groupByKey造成的記

ReduceByKey 和 groupByKey 的區別

【Spark系列2】reduceByKey和groupByKey區別與用法

reduceByKey和groupByKey區別與用法

ReduceByKey 和 groupByKey 的區別

[Spark原始碼學習] reduceByKey和groupByKey實現與combineByKey的關係

reduceByKey與groupByKey的區別

GET和POST區別總結

JS中const、var和let區別

equals 和== 的區別

mybatis中的#和$的區別

hibernate中hql語句中list和iterate區別

java中ArrayList和LinkedList區別

mysql中replicate_wild_do_table和replicate_do_db區別

2000行之宏中#和##的區別

HTML提交方式post和get區別（實驗）

stringbuffer 和 stringbuilder區別

水晶頭鍍金30U和50區別

MyBatis Mapper.xml文件中 $和#的區別

require(),include(),require_once()和include_once()區別

Zepto和Jquery區別

Antelope 和Barracuda區別

ReduceByKey 和 groupByKey 的區別

相關推薦