1. 程式人生 > >spark效能調優---廣播變數的使用

spark效能調優---廣播變數的使用

Broadcast Variables

Broadcast variables allow the programmer to keep a read-only variable cached on each machine rather than shipping a copy of it with tasks. They can be used, for example, to give every node a copy of a large input dataset in an efficient manner. Spark also attempts to distribute broadcast variables using efficient broadcast algorithms to reduce communication cost.

Spark actions are executed through a set of stages, separated by distributed “shuffle” operations. Spark automatically broadcasts the common data needed by tasks within each stage. The data broadcasted this way is cached in serialized form and deserialized before running each task. This means that explicitly creating broadcast variables is only useful when tasks across multiple stages need the same data or when caching the data in deserialized form is important.

Broadcast variables are created from a variable v by calling SparkContext.broadcast(v). The broadcast variable is a wrapper around v, and its value can be accessed by calling the value method. The code below shows this:

 

上面的是官網上的資訊大概意思就是:

         廣播變數允許程式在每臺機器上快取只讀變數,而不是在任務中附帶他的變數副本

 

使用廣播變數的好處?

以50個executor,1000個task。一個map 10M為例說明使用廣播變數和沒有使用廣播變數的區別 

預設情況下,1000個task,1000份副本。10G的資料,網路傳輸,在叢集中,耗費10G的記憶體資源。 如果使用了廣播變數。50個execurtor,50個副本。500M的資料,網路傳輸,而且不一定都是從Driver傳輸到每個節點,還可能是就近從最近的節點的executor的bockmanager上拉取變數副本,網路傳輸速度大大增加;500M的記憶體消耗。 10000M,500M,20倍。20倍~以上的網路傳輸效能消耗的降低;20倍的記憶體消耗的減少。

如何使用廣播變數(摘抄自官網)?

//這個是對你需要廣播的資料進行廣播
Broadcast<int[]> broadcastVar = sc.broadcast(new int[] {1, 2, 3});


//通過broadcast的value方法獲取廣播的值
broadcastVar.value();
// returns [1, 2, 3]