Spark on YARN簡介與執行wordcount（master、slave1和slave2）（博主推薦）

阿新 • • 發佈：2019-01-13

前期部落格

Spark On YARN模式

　　這是一種很有前景的部署模式。但限於YARN自身的發展，目前僅支援粗粒度模式（Coarse-grained Mode）。這是由於YARN上的Container資源是不可以動態伸縮的，一旦Container啟動之後，可使用的資源不能再發生變化，不過這個已經在YARN計劃中了。

　　spark on yarn 的支援兩種模式：
　　　　1) yarn-cluster：適用於生產環境；
　　　　2) yarn-client：適用於互動、除錯，希望立即看到app的輸出

　　yarn-cluster和yarn-client的區別在於yarn appMaster，每個yarn app例項有一個appMaster程序，是為app啟動的第一個container；負責從ResourceManager請求資源，獲取到資源後，告訴NodeManager為其啟動container。yarn-cluster和yarn-client模式內部實現還是有很大的區別。如果你需要用於生產環境，那麼請選擇yarn-cluster

；而如果你僅僅是Debug程式，可以選擇yarn-client。

YARN概述

　　YARN是什麼

　　Apache Hadoop YARN（Yet Another Resource Negotiator，另一種資源協調者）是一種新的 Hadoop 資源管理器，它是一個通用資源管理系統，可為上層應用提供統一的資源管理和排程，它的引入為叢集在利用率、資源統一管理和資料共享等方面帶來了巨大好處。

YARN在Hadoop生態系統中的位置

YARN產生的背景

　　隨著網際網路高速發展導致資料量劇增，MapReduce 這種基於磁碟的離線計算框架已經不能滿足應用要求，從而出現了一些新的計算框架以應對各種場景，包括記憶體計算框架、流式計算框架和迭代式計算框架等，而MRv1 不能支援多種計算框架並存。

YARN基本架構

ResourceManager(RM)

　　ResourceManager負責叢集資源的統一管理和排程，承擔了 JobTracker 的角色，整個叢集只有“一個”，總的來說，RM有以下作用：

　　1.處理客戶端請求

　　2.啟動或監控ApplicationMaster

　　3.監控NodeManager

　　4.資源的分配與排程

NodeManager(NM)

　　NodeManager管理YARN叢集中的每個節點。NodeManager 提供針對叢集中每個節點的服務，從監督對一個容器的終生管理到監視資源和跟蹤節點健康。MRv1 通過slot管理 Map 和 Reduce 任務的執行，而 NodeManager 管理抽象容器，這些容器代表著可供一個特定應用程式使用的針對每個節點的資源。NM有以下作用。

　　1.管理單個節點上的資源

　　2.處理來自ResourceManager的命令

　　3.處理來自ApplicationMaster的命令

ApplicationMaster(AM)

　　每個應用有一個，負責應用程式的管理。ApplicationMaster 負責協調來自 ResourceManager 的資源，並通過 NodeManager 監視容器的執行和資源使用（CPU、記憶體等的資源分配）。請注意，儘管目前的資源更加傳統（CPU 核心、記憶體），但未來會支援新資源型別（比如圖形處理單元或專用處理裝置）。AM有以下作用：

　　1.負責資料的切分

　　2.為應用程式申請資源並分配給內部的任務

　　3.任務的監控與容錯

Container

　　Container 是 YARN 中的資源抽象，它封裝了某個節點上的多維度資源，如記憶體、CPU、磁碟、網路等，當AM向RM申請資源時，RM為AM返回的資源便是用Container表示的。YARN會為每個任務分配一個Container，且該任務只能使用該Container中描述的資源。

　　Container有以下作用：

　　對任務執行環境進行抽象，封裝CPU、記憶體等多維度的資源以及環境變數、啟動命令等任務執行相關的資訊

Spark on YARN執行架構解析

　　回顧Spark基本工作流程

　　以SparkContext為程式執行的總入口，在SparkContext的初始化過程中，Spark會分別建立DAGScheduler作業排程和TaskScheduler任務排程兩級排程模組。其中作業排程模組是基於任務階段的高層排程模組，它為每個Spark作業計算具有依賴關係的多個排程階段（通常根據shuffle來劃分），然後為每個階段構建出一組具體的任務（通常會考慮資料的本地性等），然後以TaskSets（任務組）的形式提交給任務排程模組來具體執行。而任務排程模組則負責具體啟動任務、監控和彙報任務執行情況。

YARN standalone/YARN cluster

　　YARN standalone是0.9及之前版本的叫法，1.0開始更名為YARN cluster

　　yarn-cluster(YarnClusterScheduler)，是Driver和AM執行在一起，Client單獨的。

./bin/spark-submit --class path.to.your.Class --master yarn --deploy-mode cluster [options] [app options]

YARN standalone/YARN cluster

　　Spark Driver首選作為一個ApplicationMaster在Yarn叢集中啟動，客戶端提交給ResourceManager的每一個job都會在叢集的worker節點上分配一個唯一的ApplicationMaster,由該ApplicationMaster管理全生命週期的應用。因為Driver程式在YARN中執行，所以事先不用啟動Spark Master/Client，應用的執行結果不能再客戶端顯示(可以在history server中檢視)。

YARN standalone/YARN cluster

YARN client

　　yarn-client(YarnClientClusterScheduler)

　　Client和Driver執行在一起(執行在本地)，AM只用來管理資源

./bin/spark-submit --class path.to.your.Class --master yarn --deploy-mode client [options] [app options]

YARN client

　　在Yarn-client模式下，Driver執行在Client上，通過ApplicationMaster向RM獲取資源。本地Driver負責與所有的executor container進行互動，並將最後的結果彙總。結束掉終端，相當於kill掉這個spark應用。一般來說，如果執行的結果僅僅返回到terminal上時需要配置這個。

如何選擇

　　如果需要返回資料到client就用YARN client模式。

　　資料儲存到hdfs的建議用YARN cluster模式。

其他配置和注意事項

　　如何更改預設配置

spark_home/conf/spark-defaults.conf，每個app提交時都會使用他裡面的配置

--conf PROP=VALUE，為單獨的app指定個性化引數

　　環境變數

spark_home/conf/spark-defaults.conf，每個app提交時都會使用他裡面的配置

spark.yarn.appMasterEnv.[EnvironmentVariableName]

　　相關配置

特別注意

　　在cluster mode下，yarn.nodemanager.local-dirs對?Spark executors 和Spark driver都管用， spark.local.dir將被忽略

　　在client mode下， Spark executors 使用yarn.nodemanager.local-dirs， Spark driver使用spark.local.dir

　　--files and –archives支援用#對映到hdfs

　　--jars

spark-shell執行在YARN上（這是Spark on YARN模式）

(包含YARN client和YARN cluster）（作為補充）

登陸安裝Spark那臺機器

bin/spark-shell --master yarn-client

或者

bin/spark-shell --master yarn

　　包括可以加上其他的，比如控制記憶體啊等。這很簡單，不多贅述。

[[email protected] spark-1.6.1-bin-hadoop2.6]$ bin/spark-shell --master yarn-client
17/03/29 22:40:04 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
17/03/29 22:40:04 INFO spark.SecurityManager: Changing view acls to: spark
17/03/29 22:40:04 INFO spark.SecurityManager: Changing modify acls to: spark
17/03/29 22:40:04 INFO spark.SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(spark); users with modify permissions: Set(spark)
17/03/29 22:40:05 INFO spark.HttpServer: Starting HTTP Server
17/03/29 22:40:06 INFO server.Server: jetty-8.y.z-SNAPSHOT
17/03/29 22:40:06 INFO server.AbstractConnector: Started [email protected]0.0.0.0:35692
17/03/29 22:40:06 INFO util.Utils: Successfully started service 'HTTP class server' on port 35692.
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 1.6.1
      /_/

Using Scala version 2.10.5 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_60)

提交spark作業

1、用yarn-client模式提交spark作業

在/usr/local/spark目錄下建立資料夾

vi spark_pi.sh

$SPARK_HOME/bin/spark-submit \
--class org.apache.spark.examples.JavaSparkPi \
--master yarn-client \
--num-executors 1 \
--driver-memory 1g \
--executor-memory 1g \
--executor-cores 1 \

$SPARK_HOME/lib/spark-examples-1.6.1-hadoop2.6.0.jar \

chmod 777 spark_pi.sh
./spark_pi.sh

或者

[[email protected] ~]$  $SPARK_HOME/bin/spark-submit  \
> --class org.apache.spark.examples.JavaSparkPi \
> --master yarn-cluster \
> --num-executors 1 \
> --driver-memory 1g \
> --executor-memory 1g \
> --executor-cores 1 \
>  $SPARK_HOME/lib/spark-examples-1.6.1-hadoop2.6.0.jar

2、用yarn-cluster模式提交spark作業

在/usr/local/spark目錄下建立資料夾

vi spark_pi.sh

$SPARK_HOME/bin/spark-submit \
--class org.apache.spark.examples.JavaSparkPi \
--master yarn-cluster \
--num-executors 1 \
--driver-memory 1g \
--executor-memory 1g \
--executor-cores 1 \

$SPARK_HOME/lib/spark-examples-1.6.1-hadoop2.6.0.jar

 chmod 777 spark_pi.sh
 ./spark_pi.sh

或者

[[email protected] ~]$  $SPARK_HOME/bin/spark-submit  \
> --class org.apache.spark.examples.JavaSparkPi \
> --master yarn-cluster \
> --num-executors 1 \
> --driver-memory 1g \
> --executor-memory 1g \
> --executor-cores 1 \
>  $SPARK_HOME/lib/spark-examples-1.6.1-hadoop2.6.0.jar

Spark Standalone 下執行wordcount和Spark on YARN下執行wordcount（做個對比）

1、Spark on YARN下執行wordcount

具體，請移步

● wordcount程式碼

● mvn 專案打包上傳至Spark叢集。

● Spark 叢集提交作業

[[email protected] hadoop-2.6.0]$ $HADOOP_HOME/bin/hadoop fs -mkdir -p hdfs://master:9000/testspark/inputData/wordcount

[[email protected] ~]$ mkdir -p /home/spark/testspark/inputData/wordcount
[[email protected] hadoop-2.6.0]$ $HADOOP_HOME/bin/hadoop fs -copyFromLocal /home/spark/testspark/inputData/wordcount/wc.txt  hdfs://master:9000/testspark/inputData/wordcount/

　　這裡在/home/spark/testspark下上傳mySpark-1.0-SNAPSHOT.jar省略

[[email protected] spark-1.6.1-bin-hadoop2.6]$ $SPARK_HOME/bin/spark-submit \
 --master yarn-client \
 --name scalawordcount \
 --num-executors 1 \
 --driver-memory 1g \
 --executor-memory 1g \
 --executor-cores 1 \
 --class zhouls.bigdata.MyScalaWordCount \
 /home/spark/testspark/mySpark-1.0-SNAPSHOT.jar \
 hdfs://master:9000/testspark/inputData/wordcount/wc.txt \
 hdfs://master:9000/testspark/outData/MyScalaWordCount


或者

[[email protected] spark-1.6.1-bin-hadoop2.6]$ $SPARK_HOME/bin/spark-submit \
 --master yarn\
 --deploy-mode client \
 --name scalawordcount \
 --num-executors 1 \
 --driver-memory 1g \
 --executor-memory 1g \
 --executor-cores 1 \
 --class zhouls.bigdata.MyScalaWordCount \
 /home/spark/testspark/mySpark-1.0-SNAPSHOT.jar \
 hdfs://master:9000/testspark/inputData/wordcount/wc.txt \
 hdfs://master:9000/testspark/outData/MyScalaWordCount

[[email protected] spark-1.6.1-bin-hadoop2.6]$ $SPARK_HOME/bin/spark-submit \

 --master yarn-cluster\
 --name scalawordcount \
 --num-executors 1 \
 --driver-memory 1g \
 --executor-memory 1g \
 --executor-cores 1 \
 --class zhouls.bigdata.MyScalaWordCount \
 /home/spark/testspark/mySpark-1.0-SNAPSHOT.jar \
 hdfs://master:9000/testspark/inputData/wordcount/wc.txt \
 hdfs://master:9000/testspark/outData/MyScalaWordCount



或者

[[email protected] spark-1.6.1-bin-hadoop2.6]$ $SPARK_HOME/bin/spark-submit \
 --master yarn\ 
 --deploy-mode cluster \
 --name scalawordcount \
 --num-executors 1 \
 --driver-memory 1g \
 --executor-memory 1g \
 --executor-cores 1 \
 --class zhouls.bigdata.MyScalaWordCount \
 /home/spark/testspark/mySpark-1.0-SNAPSHOT.jar \
 hdfs://master:9000/testspark/inputData/wordcount/wc.txt \
 hdfs://master:9000/testspark/outData/MyScalaWordCount

[[email protected] spark-1.6.1-bin-hadoop2.6]$ $SPARK_HOME/bin/spark-submit \
 --master yarn-client \
 --name javawordcount \
 --num-executors 1 \
 --driver-memory 1g \
 --executor-memory 1g \
 --executor-cores 1 \
 --class zhouls.bigdata.MyJavaWordCount \
 /home/spark/testspark/mySpark-1.0-SNAPSHOT.jar \
 hdfs://master:9000/testspark/inputData/wordcount/wc.txt \
 hdfs://master:9000/testspark/outData/MyJavaWordCount




或者

[[email protected] spark-1.6.1-bin-hadoop2.6]$ $SPARK_HOME/bin/spark-submit \
 --master yarn\
 --deploy-mode client \
 --name javawordcount \
 --num-executors 1 \
 --driver-memory 1g \
 --executor-memory 1g \
 --executor-cores 1 \
 --class zhouls.bigdata.MyJavaWordCount \
 /home/spark/testspark/mySpark-1.0-SNAPSHOT.jar \
 hdfs://master:9000/testspark/inputData/wordcount/wc.txt \
 hdfs://master:9000/testspark/outData/MyJavaWordCount

[[email protected] spark-1.6.1-bin-hadoop2.6]$ $SPARK_HOME/bin/spark-submit \

 --master yarn-cluster\
 --name javawordcount \
 --num-executors 1 \
 --driver-memory 1g \
 --executor-memory 1g \
 --executor-cores 1 \
 --class zhouls.bigdata.MyJavaWordCount \
 /home/spark/testspark/mySpark-1.0-SNAPSHOT.jar \
 hdfs://master:9000/testspark/inputData/wordcount/wc.txt \
 hdfs://master:9000/testspark/outData/MyJavaWordCount





或者

[[email protected] spark-1.6.1-bin-hadoop2.6]$ $SPARK_HOME/bin/spark-submit \
 --master yarn\
 --deploy-mode cluster \
 --name javawordcount \
 --num-executors 1 \
 --driver-memory 1g \
 --executor-memory 1g \
 --executor-cores 1 \
 --class zhouls.bigdata.MyJavaWordCount \
 /home/spark/testspark/mySpark-1.0-SNAPSHOT.jar \
 hdfs://master:9000/testspark/inputData/wordcount/wc.txt \
 hdfs://master:9000/testspark/outData/MyJavaWordCount

2、Spark Standalone 下執行wordcount

具體，請移步

● wordcount程式碼

● mvn 專案打包上傳至Spark叢集。

● Spark 叢集提交作業

$SPARK_HOME/bin/spark-submit \

--master spark://master:7077 \
 --class zhouls.bigdata.MyScalaWordCount \
/home/spark/testspark/mySpark-1.0-SNAPSHOT.jar \
hdfs://master:9000/testspark/inputData/wordcount/wc.txt \
hdfs://master:9000/testspark/outData/MyScalaWordCount

或者

$SPARK_HOME/bin/spark-submit \
--master spark://master:7077 \
--class zhouls.bigdata.MyJavaWordCount \
/home/spark/testspark/mySpark-1.0-SNAPSHOT.jar \
hdfs://master:9000/testspark/inputData/wordcount/wc.txt \
hdfs://master:9000/testspark/outData/MyJavaWordCount

Spark on YARN簡介與執行wordcount（master、slave1和slave2）（博主推薦）

YARN概述

Spark on YARN執行架構解析

其他配置和注意事項

提交spark作業

Spark on YARN簡介與執行wordcount（master、slave1和slave2）（博主推薦）

Spark standalone簡介與執行wordcount（master、slave1和slave2）

Spark on YARN模式的安裝（spark-1.6.1-bin-hadoop2.6.tgz + hadoop-2.6.0.tar.gz）（master、slave1和slave2）（博主推薦）

Spark standalone模式的安裝（spark-1.6.1-bin-hadoop2.6.tgz）（master、slave1和slave2）

記2018最後一次問題診斷-Spark on Yarn所有任務執行失敗

Spark on Yarn解密及執行流程

大資料各子專案的環境搭建之建立與刪除軟連線（博主推薦）

Windows裏如何正確安裝Redis以服務運行（博主推薦）（圖文詳解）

分區助手是什麽？（博主推薦）（圖文詳解）

關於大數據領域各個組件打包部署到集群運行的總結（含手動和maven）（博主推薦）

大數據搭建各個子項目時配置文件技巧（適合CentOS和Ubuntu系統）（博主推薦）

用maven來創建scala和java項目代碼環境（圖文詳解）（Intellij IDEA（Ultimate版本）、Intellij IDEA（Community版本）和Scala IDEA for Eclipse皆適用）（博主推薦）

如何走上更高平臺分享傳遞幹貨知識：（開通個人微信公眾號：大數據躺過的坑）（圖文詳解）（博主推薦）

ubuntu18.04 搭建hadoop完全分散式叢集（Master、slave1、slave2）共三個節點

全網最詳細的基於Ubuntu14.04/16.04 + Anaconda2 / Anaconda3 + Python2.7/3.4/3.5/3.6安裝Tensorflow詳細步驟（圖文）（博主推薦）

大資料入門基礎系列之初步認識大資料生態系統圈（博主推薦）

如何走上更高平臺分享傳遞乾貨知識：（開通個人微信公眾號：大資料躺過的坑）（圖文詳解）（博主推薦）

Ambari叢集移動現有複製到另外地方或更改ip地址，導致各項服務元件上為黃色問號代表心跳丟失的解決方案（圖文詳解）（博主推薦）

如何走上更高平臺分享傳遞乾貨知識：（開通個人Github面向開源及私有軟體專案的託管平臺：https://github.com/zlslch/）（圖文詳解）（博主推薦）

大資料入門基礎系列之Hadoop1.X、Hadoop2.X和Hadoop3.X的多維度區別詳解（博主推薦）

Spark on YARN簡介與執行wordcount（master、slave1和slave2）（博主推薦）

YARN概述

Spark on YARN執行架構解析

其他配置和注意事項

提交spark作業

相關推薦