[Spark原始碼學習] reduceByKey和groupByKey實現與combineByKey的關係

阿新 • • 發佈：2018-12-20

reduceByKey原始碼：

    def reduceByKey(self, func, numPartitions=None, partitionFunc=portable_hash):
        """
        Merge the values for each key using an associative and commutative reduce function.

        This will also perform the merging locally on each mapper before
        sending results to a reducer, similarly to a "combiner" in MapReduce.

        Output will be partitioned with C{numPartitions} partitions, or
        the default parallelism level if C{numPartitions} is not specified.
        Default partitioner is hash-partition.

        >>> from operator import add
        >>> rdd = sc.parallelize([("a", 1), ("b", 1), ("a", 1)])
        >>> sorted(rdd.reduceByKey(add).collect())
        [('a', 2), ('b', 1)]
        """ 

        return self.combineByKey(lambda x: x, func, func, numPartitions, partitionFunc)

groupByKey原始碼：

   def groupByKey(self, numPartitions=None, partitionFunc=portable_hash):
        """
        Group the values for each key in the RDD into a single sequence.
        Hash-partitions the resulting RDD with numPartitions partitions.

        .. note:: If you are grouping in order to perform an aggregation (such as a
            sum or average) over each key, using reduceByKey or aggregateByKey will
            provide much better performance.

        >>> rdd = sc.parallelize([("a", 1), ("b", 1), ("a", 1)])
        >>> sorted(rdd.groupByKey().mapValues(len).collect())
        [('a', 2), ('b', 1)]
        >>> sorted(rdd.groupByKey().mapValues(list).collect())
        [('a', [1, 1]), ('b', [1])]
        """ 

        def createCombiner(x):
            return [x]

        def mergeValue(xs, x):
            xs.append(x)
            return xs

        def mergeCombiners(a, b):
            a.extend(b)
            return a

        memory = self._memory_limit()
        serializer = self._jrdd_deserializer
        agg = 
 Aggregator(createCombiner, mergeValue, mergeCombiners)

        def combine(iterator):
            merger = ExternalMerger(agg, memory * 0.9, serializer)
            merger.mergeValues(iterator)
            return merger.items()

        locally_combined = self.mapPartitions(combine, preservesPartitioning=True)
        shuffled = locally_combined.partitionBy(numPartitions, partitionFunc)

        def groupByKey(it):
            merger = ExternalGroupBy(agg, memory, serializer)
            merger.mergeCombiners(it)
            return merger.items()

        return shuffled.mapPartitions(groupByKey, True).mapValues(ResultIterable)

combineByKey()原始碼：

def combineByKey(self, createCombiner, mergeValue, mergeCombiners,
                     numPartitions=None, partitionFunc=portable_hash):
        """
        Generic function to combine the elements for each key using a custom
        set of aggregation functions.

        Turns an RDD[(K, V)] into a result of type RDD[(K, C)], for a "combined
        type" C.

        Users provide three functions:

            - C{createCombiner}, which turns a V into a C (e.g., creates
              a one-element list)
            - C{mergeValue}, to merge a V into a C (e.g., adds it to the end of
              a list)
            - C{mergeCombiners}, to combine two C's into a single one (e.g., merges
              the lists)

        To avoid memory allocation, both mergeValue and mergeCombiners are allowed to
        modify and return their first argument instead of creating a new C.

        In addition, users can control the partitioning of the output RDD.

        .. note:: V and C can be different -- for example, one might group an RDD of type
            (Int, Int) into an RDD of type (Int, List[Int]).

        >>> x = sc.parallelize([("a", 1), ("b", 1), ("a", 2)])
        >>> def to_list(a):
        ...     return [a]
        ...
        >>> def append(a, b):
        ...     a.append(b)
        ...     return a
        ...
        >>> def extend(a, b):
        ...     a.extend(b)
        ...     return a
        ...
        >>> sorted(x.combineByKey(to_list, append, extend).collect())
        [('a', [1, 2]), ('b', [1])]
        """
        if numPartitions is None:
            numPartitions = self._defaultReducePartitions()

        serializer = self.ctx.serializer
        memory = self._memory_limit()
        agg = Aggregator(createCombiner, mergeValue, mergeCombiners)

        def combineLocally(iterator):
            merger = ExternalMerger(agg, memory * 0.9, serializer)
            merger.mergeValues(iterator)
            return merger.items()

        locally_combined = self.mapPartitions(combineLocally, preservesPartitioning=True)
        shuffled = locally_combined.partitionBy(numPartitions, partitionFunc)

        def _mergeCombiners(iterator):
            merger = ExternalMerger(agg, memory, serializer)
            merger.mergeCombiners(iterator)
            return merger.items()

        return shuffled.mapPartitions(_mergeCombiners, preservesPartitioning=True)

具體分析待續

[Spark原始碼學習] reduceByKey和groupByKey實現與combineByKey的關係

reduceByKey原始碼： def reduceByKey(self, func, numPartitions=None, partitionFunc=portable_hash): """ Merge the val

【Spark系列2】reduceByKey和groupByKey區別與用法

在spark中，我們知道一切的操作都是基於RDD的。在使用中，RDD有一種非常特殊也是非常實用的format——pair RDD，即RDD的每一行是（key, value）的格式。這種格式很像Python的字典型別，便於針對key進行一些處理。針對pair RDD這樣的

reduceByKey和groupByKey區別與用法

轉自:https://blog.csdn.net/zongzhiyuan/article/details/49965021在spark中，我們知道一切的操作都是基於RDD的。在使用中，RDD有一種非常特殊也是非常實用的format——pair RDD，即RDD的每一行是（ke

Spark原始碼學習（二）---Master和Worker的啟動以及Actor通訊流程

在《Spark原始碼學習（一）》中通過Spark的啟動指令碼，我們看到Spark啟動Master的時候實際上是啟動了org.apache.spark.deploy.master.Master，下面我們就從這2個類入手，通過閱讀Spark的原始碼，瞭解Spark的啟動流程。

jquery下的ajax和jsonp實現與區別

soc 分隔分享 com 服務器 img input post npc json和jsonp和ajax的實質和區別ajax的兩個問題　　1.ajax以何種格式來交換數據　　2.跨域的需求如何解決　　　　數據跨域用自定義字符串或者用XML來描述　　　　跨域可以用服務器代理來

spring原始碼學習之路---IOC實現原理（二）

上一章我們已經初步認識了BeanFactory和BeanDefinition，一個是IOC的核心工廠介面，一個是IOC的bean定義介面，上章提到說我們無法讓BeanFactory持有一個Map package org.springframework.beans.factory.supp

STL原始碼學習----lower_bound和upper_bound演算法

　　STL中的每個演算法都非常精妙，來介紹一下STL中的演算法。　　ForwardIter lower_bound(ForwardIter first, ForwardIter last,const _Tp& val)演算法返回一個非遞減序列[fir

spark RDD，reduceByKey vs groupByKey

Spark 中有兩個類似的api，分別是 reduceByKey 和 groupByKey 。這兩個的功能類似，但底層實現卻有些不同，那麼為什麼要這樣設計呢？我們來從原始碼的角度分析一下。先看兩者的呼叫順序（都是使用預設的Partitioner，即defaultPartitioner）所用 spark 版

許鵬-Spark原始碼閱讀經驗和C++經典書籍資料推薦

CSDN：多年C和C++專案開發及管理，有什麼經驗可以分享給這個領域的工作者？在程式設計師修養方面，他們又應該注意什麼，多學些什麼，多看些什麼？許鵬：儘管從事C和C++開發多年，我還是不敢說自己非常精通。有的只是一點點的感悟和體會，如果是進行Linux平臺下的C語言

spark streaming 學習（和flume結合+和kafka 的結合）

spark 2.1 設定日誌級別很簡單下面幾行程式碼就可以搞定主要是下面畫橫線的程式碼val conf = new SparkConf().setAppName("FlumePollWordCount").setMaster("local[2]") val sc = ne

spark原始碼學習

從啟動SparkContext開始 1、createTaskScheduler：建立scheduler 、backend case SPARK_REGEX(sparkUrl) => val scheduler = new Tas

決策樹ID3;C4.5詳解和python實現與R語言實現比較

本文網址：http://blog.csdn.net/crystal_tyan/article/details/42130851（請不要在採集站閱讀）把決策樹研究一下，找來了一些自己覺得還可以的資料：分類樹（決策樹）是一種十分常用的分類方法。他是一種監管學習，所謂監管

muduo網路庫原始碼學習————執行緒池實現

muduo庫裡面的執行緒池是固定執行緒池，即建立的執行緒池裡面的執行緒個數是一定的，不是動態的。執行緒池裡面一般要包含執行緒佇列還有任務佇列，外部程式將任務存放到執行緒池的任務佇列中，執行緒池中的執行緒佇列執行任務，也是一種生產者和消費者模型。muduo庫中的執

spark 原始碼學習之列印執行緒堆疊

spark頁面中有個列印executor的堆疊的，很好用，最近自己的web專案也想把堆疊用servlet的方式去展現出來，於是跟了下spark原始碼 SparkContext中： /** * Called by the web UI to ob

Spring原始碼學習之IOC容器實現原理（一）-DefaultListableBeanFactory

從這個繼承體系結構圖來看，我們可以發現DefaultListableBeanFactory是第一個非抽象類，非介面類。實際IOC容器。所以這篇部落格以DefaultListableBeanFactoryIOC容器為基準進行IOC原理解析。一.兩個重要介面前面已經分析了BeanFactor，它的三個直接子

Spark原始碼學習（4）——Scheduler

本文要解決的問題：從scheduler各個類的具體方法閱讀原始碼，進一步瞭解Spark的scheduler的工作原理和過程。 Scheduler的基本過程使用者提交的Job到DAGScheduler後，會封裝成ActiveJob，同時啟動Job

Tomcat原始碼閱讀之閉鎖的實現與連線數量的控制

嗯，今天其實在看HtttpProcessor的實現，但是突然想到了以前在看poller的時候看到了有閉鎖，用於控制當前connector的連線數量，嗯，那就順便把這部分來看了。。。在Tomcat中，通過繼承AbstractQueuedSynchronizer來實現了自己的

Spark：用Scala和Java實現WordCount

1 Spark assembly has been built with Hive, including Datanucleus jars on classpath 2 Using Spark's default log4j profile: org/apache/spark/log4j-def

【原始碼學習】STL原始碼學習----lower_bound和upper_bound演算法

STL原始碼學習----lower_bound和upper_bound演算法ForwardIter lower_bound(ForwardIterfirst, ForwardIterlast,const _Tp& val)返回一個非遞減序列[first, last)中

Spark機器學習：TF-IDF實現原理

先簡單地介紹下什麼是TF-IDF(詞頻-逆文件頻率)，它可以反映出語料庫中某篇文件中某個詞的重要性。假設t表示某個詞，d表示一篇文件，則詞頻TF(t,d)是某個詞t在文件d中出現的次數，而文件DF(t,D)是包含詞t的文件數目。為了過濾掉常用的片語，如"the" "a" "

[Spark原始碼學習] reduceByKey和groupByKey實現與combineByKey的關係

reduceByKey原始碼：

groupByKey原始碼：

combineByKey()原始碼：

相關推薦