Spark中repartition和partitionBy的區別

阿新 • • 發佈：2018-10-26

是我 item its alt ive 同時 tint nts exe

repartition 和 partitionBy 都是對數據進行重新分區，默認都是使用 HashPartitioner，區別在於partitionBy 只能用於 PairRDD，但是當它們同時都用於 PairRDD時，結果卻不一樣：

技術分享圖片

不難發現，其實 partitionBy 的結果才是我們所預期的，我們打開 repartition 的源碼進行查看：

/**
   * Return a new RDD that has exactly numPartitions partitions.
   *
   * Can increase or decrease the level of parallelism in this RDD. Internally, this uses
   * a shuffle to redistribute data.
   *
   * If you are decreasing the number of partitions in this RDD, consider using `coalesce`,
   * which can avoid performing a shuffle.
   *
   * TODO Fix the Shuffle+Repartition data loss issue described in SPARK-23207.
    
*/
  def repartition(numPartitions: Int)(implicit ord: Ordering[T] = null): RDD[T] = withScope {
    coalesce(numPartitions, shuffle = true)
  }

  /**
   * Return a new RDD that is reduced into `numPartitions` partitions.
   *
   * This results in a narrow dependency, e.g. if you go from 1000 partitions
   * to 100 partitions, there will not be a shuffle, instead each of the 100
   * new partitions will claim 10 of the current partitions. If a larger number
   * of partitions is requested, it will stay at the current number of partitions.
   *
   * However, if you‘re doing a drastic coalesce, e.g. to numPartitions = 1,
   * this may result in your computation taking place on fewer nodes than
   * you like (e.g. one node in the case of numPartitions = 1). To avoid this,
   * you can pass shuffle = true. This will add a shuffle step, but means the
   * current upstream partitions will be executed in parallel (per whatever
   * the current partitioning is).
   *
   * @note With shuffle = true, you can actually coalesce to a larger number
   * of partitions. This is useful if you have a small number of partitions,
   * say 100, potentially with a few partitions being abnormally large. Calling
   * coalesce(1000, shuffle = true) will result in 1000 partitions with the
   * data distributed using a hash partitioner. The optional partition coalescer
   * passed in must be serializable.
    
*/
  def coalesce(numPartitions: Int, shuffle: Boolean = false,
               partitionCoalescer: Option[PartitionCoalescer] = Option.empty)
              (implicit ord: Ordering[T] = null)
      : RDD[T] = withScope {
    require(numPartitions > 0, s"Number of partitions ($numPartitions) must be positive. 
")
    if (shuffle) {
      /** Distributes elements evenly across output partitions, starting from a random partition. */
      val distributePartition = (index: Int, items: Iterator[T]) => {
        var position = new Random(hashing.byteswap32(index)).nextInt(numPartitions)
        items.map { t =>
          // Note that the hash code of the key will just be the key itself. The HashPartitioner
          // will mod it with the number of total partitions.
          position = position + 1
          (position, t)
        }
      } : Iterator[(Int, T)]

      // include a shuffle step so that our upstream tasks are still distributed
      new CoalescedRDD(
        new ShuffledRDD[Int, T, T](mapPartitionsWithIndex(distributePartition),
        new HashPartitioner(numPartitions)),
        numPartitions,
        partitionCoalescer).values
    } else {
      new CoalescedRDD(this, numPartitions, partitionCoalescer)
    }
  }

即使是RairRDD也不會使用自己的key，repartition 其實使用了一個隨機生成的數來當做 Key，而不是使用原來的 Key！！

Spark中repartition和partitionBy的區別

是我 item its alt ive 同時 tint nts exe repartition 和 partitionBy 都是對數據進行重新分區，默認都是使用 HashPartitioner，區別在於partitionBy 只能用於 PairRDD，但是當它們同時都用於

Spark中repartition和coalesce的用法

repartition(numPartitions:Int):RDD[T]和coalesce(numPartitions:Int，shuffle:Boolean=false):RDD[T] 他們兩個都是RDD的分割槽進行重新劃分，repartition只是coalesce介

Spark中map和flatMap的區別

Map和flatMap的區別 Transformation 含義 map(func) 返回一個新的RDD,該RDD由每一個輸入元素經過func函式轉換後組成 flatMap(func) 類似於map,但是每一個輸入元素可以被對映為0或多個輸出

spark中makerdd和parallelize的區別

我們知道，在Spark中建立RDD的建立方式大概可以分為三種：（1）、從集合中建立RDD；（2）、從外部儲存建立RDD；（3）、從其他RDD建立。　　而從集合中建立RDD，Spark主要提供了兩中函式：parallelize和makeRDD。我們可以先看看這兩個函式的宣告

【Big Data 每日一題20180821】Spark中ml和mllib的區別

Spark中ml和mllib的主要區別和聯絡如下： ml和mllib都是Spark中的機器學習庫，目前常用的機器學習功能2個庫都能滿足需求。 spark官方推薦使用ml, 因為ml功能更全面更靈活，未來會主要支援ml，mllib很有可能會被廢棄(據說可能是在spark3.

每次進步一點點——spark中cache和persist的區別

昨天面試被問到了cache和persist區別，當時只記得是其中一個呼叫了另一個，但沒有回答出二者的不同，所以回來後重新看了原始碼，算是弄清楚它們的區別了。 cache和persist都是用於將一個RDD進行快取的，這樣在之後使用的過程中就不需要重新計算了，可

spark中map和flatmap之間的區別

map()是將函式用於RDD中的每個元素，將返回值構成新的RDD。 flatmap()是將函式應用於RDD中的每個元素，將返回的迭代器的所有內容構成新的RDD,這樣就得到了一個由各列表中的元素組成的RDD,而不是一個列表組成的RDD。有些拗口，看看例子就明白了。 val

Spark學習筆記 --- Spark中Map和FlatMap轉換的區別

wechat:812716131 ------------------------------------------------------ 技術交流群請聯絡上面wechat ----------------------------------------------

spark 中map 和flatmap 的區別

需求背景：統計相鄰兩個單詞出現的次數。 val s="A;B;C;D;B;D;C;B;D;A;E;D;C;A;B" s: String = A;B;C;D;B;D;C;B;D;A;E;D;C;A;B val data=sc.parallelize(Seq(s)

mybatis中的#和$的區別

背景插入 trac sql註入 -m .com article 參數 -s 1. #將傳入的數據都當成一個字符串，會對自動傳入的數據加一個雙引號。如：order by #user_id#，如果傳入的值是111,那麽解析成sql時的值為order by "111", 如果傳

hibernate中hql語句中list和iterate區別

每次 hibernate 寫入所有讀取條件 iter 查詢 hql 1.使用list()方法獲取查詢結果，每次發出一條語句，獲取全部數據。2.使用iterate()方法獲取查詢結果，先發出一條SQL語句用來查詢滿足條件數據的id，然後依次按照這些id查詢記錄，也就是要

java中ArrayList和LinkedList區別

插入 list 新的查找 arr tro 基於列表時間復雜度 ArrayList和LinkedList最主要的區別是基於不同數據結構 ArrayList是基於動態數組的數據結構，LinkedList基於鏈表的數據結構，針對這點，從時間復雜度和空間復雜度來看主要區別：

mysql中replicate_wild_do_table和replicate_do_db區別

lan rep cati mil 多人 pan think lte 避免使用replicate_do_db和replicate_ignore_db時有一個隱患，跨庫更新時會出錯。如在Master（主）服務器上設置 replicate_do_db=test（my.conf

linux中 ll 和ls 區別

彩色顯示文件時間排序 linux 常用所有數字名稱 sub ll 列出來的結果詳細，有時間，是否可讀寫等信息，象windows裏的詳細信息ls 只列出文件名或目錄名就象windows裏的列表ll －t 是降序， ll －t ｜ tac 是升序 ll不是

js中decodeURI()和encodeURI()區別，decodeURIComponent和encodeURIComponent區別

nbsp sch www 問題 encode 替換副本字符替換序列 decodeURI()定義和用法:decodeURI()函數可對encodeURI()函數編碼過的URI進行解碼.語法:decodeURI(URIstring)參數描述:URIstring必需,一個字

HTP協議中URI和URL區別

int 名稱 net form 打開文件路徑指定支持地址 URL（uniform resource location ）：統一資源定位符 URI（uniform resource identifier）：統一資源標誌符 URI：可以表示一個域，也可以表示一個

mysql中varchar和char區別（思維導圖整理）

var 但是系統 mysql 由於 varchar .html nbsp 了解　　由於mysql一直是我的弱項（其實各方面我都是很弱的），所以最近在看msyql，正好看到varchar和char區別，所以整理一下，便於以後遺忘。　　　　0.0圖片已經說明一切，但是系

JavaScript中Null和undefind區別

cdc 如何 undefine 只有一個 som pre cnblogs 定義報錯公眾號原文 Javascript有5種基本類型：Boolean，Number，Null，Undefined，String；和一種復雜類型：Object（對象）； undef

淺談 Mybatis中的 ${ } 和 #{ }的區別

mybatis sql註入語句 nbsp 之前 com pre 預編譯 sql 語句一、舉例說明 1 select * from user where name = "dato"; 2 3 select * from user where name = #

mysql 中delete和trncate區別

重新 sql delet use 它的刪除掃描進行 from mysql中刪除表記錄delete from和truncate table的用法區別: MySQL中有兩種刪除表中記錄的方法:(1)delete from語句，(2)truncate table語句。 d

Spark中repartition和partitionBy的區別

相關推薦