1. 程式人生 > >在 YARN 上執行 Spark

在 YARN 上執行 Spark

翻譯中...

Running Spark on YARN

Support for running on YARN (Hadoop NextGen) was added to Spark in version 0.6.0, and improved in subsequent releases.

 在Spark的0.6.0版本中已經支援在YARN(Hadoop NextGen)上執行的Spark,並在後續版本中得到改進。

Launching Spark on YARN

Ensure that HADOOP_CONF_DIR or YARN_CONF_DIR points to the directory which contains the (client side) configuration files for the Hadoop cluster. These configs are used to write to HDFS and connect to the YARN ResourceManager. The configuration contained in this directory will be distributed to the YARN cluster so that all containers used by the application use the same configuration. If the configuration references Java system properties or environment variables not managed by YARN, they should also be set in the Spark application’s configuration (driver, executors, and the AM when running in client mode).

確保HADOOP_CONF_DIR或YARN_CONF_DIR指向包含Hadoop叢集的(客戶端)配置檔案的目錄。這些配置用於寫入HDFS並連線到YARN ResourceManager。此目錄中包含的配置將分發到YARN群集,以便應用程式使用的所有容器都使用相同的配置。如果配置引用了不受YARN管理的Java系統屬性或環境變數,那麼也應該在Spark應用程式的配置(driver,executors和AM在客戶端模式下執行時)中進行設定。

There are two deploy modes that can be used to launch Spark applications on YARN. In cluster

 mode, the Spark driver runs inside an application master process which is managed by YARN on the cluster, and the client can go away after initiating the application. In client mode, the driver runs in the client process, and the application master is only used for requesting resources from YARN.

有兩種模式可以在YARN上啟動Spark 應用。在叢集模式下,Spark驅動程式執行在由叢集上的YARN管理的應用程式主程序中,客戶端可以在啟動應用程式後離開。在客戶端模式下,驅動程式在客戶端程序中執行,應用程式主程式僅用於從YARN請求資源。

Unlike Spark standalone and Mesos modes, in which the master’s address is specified in the --master parameter, in YARN mode the ResourceManager’s address is picked up from the Hadoop configuration. Thus, the --master parameter is yarn.

與Spark standalone和Mesos模式不同,主節點地址在--master引數中指定,在YARN模式下,ResourceManager的地址從Hadoop配置中提取。因此,--master的引數是yarn。

To launch a Spark application in cluster mode:

$ ./bin/spark-submit --class path.to.your.Class --master yarn --deploy-mode cluster [options] <app jar> [app options]

例如:
$ ./bin/spark-submit --class org.apache.spark.examples.SparkPi \
    --master yarn \
    --deploy-mode cluster \
    --driver-memory 4g \
    --executor-memory 2g \
    --executor-cores 1 \
    --queue thequeue \
    lib/spark-examples*.jar \
    10

The above starts a YARN client program which starts the default Application Master. Then SparkPi will be run as a child thread of Application Master. The client will periodically poll the Application Master for status updates and display them in the console. The client will exit once your application has finished running. Refer to the “Debugging your Application” section below for how to see driver and executor logs.

以上啟動了執行預設Application Master的YARN客戶端程式。SparkPi 將作為Application Master的一個子執行緒執行。客戶端將定期輪詢Application Master的狀態更新並將其顯示在控制檯中。一旦你的應用結束執行,客戶端將退出。請參閱下面的“除錯應用程式”部分,瞭解如何檢視驅動程式和執行程式日誌。

To launch a Spark application in client mode, do the same, but replace cluster with client. The following shows how you can run spark-shell in client mode:

以客戶端模式執行 Spark 應用,相似的是,將cluster 替換為client。以下顯示如何在客戶端模式下執行spark-shell:
$ ./bin/spark-shell --master yarn --deploy-mode client

Adding Other JARs

In cluster mode, the driver runs on a different machine than the client, so SparkContext.addJar won’t work out of the box with files that are local to the client. To make files on the client available to SparkContext.addJar, include them with the --jars option in the launch command.
在叢集模式下,驅動程式和客戶端執行在不同的機器上,因此SparkContext.addJar 將不會與客戶端本地檔案配合使用。要使客戶端上的檔案可用於SparkContext.addJar,請在launch命令中使用--jars選項包含它們。
$ ./bin/spark-submit --class my.main.Class \
    --master yarn \
    --deploy-mode cluster \
    --jars my-other-jar.jar,my-other-other-jar.jar \
    my-main-jar.jar \
    app_arg1 app_arg2

Preparations

Running Spark on YARN requires a binary distribution of Spark which is built with YARN support. Binary distributions can be downloaded from the downloads page of the project website. To build Spark yourself, refer to Building Spark.

在YARN上執行Spark需要使用YARN支援構建的Spark的二進位制分發。二進位制發行版可以從專案網站的下載頁面下載。 要自己構建Spark,請參閱Building Spark

To make Spark runtime jars accessible from YARN side, you can specify spark.yarn.archive or spark.yarn.jars. For details please refer to Spark Properties. If neither spark.yarn.archive nor spark.yarn.jars is specified, Spark will create a zip file with all jars under $SPARK_HOME/jarsand upload it to the distributed cache.

要使Spark執行時jar可以從YARN端訪問,可以指定spark.yarn.archive或spark.yarn.jars。更多的細節可以參閱Spark Properties。如果spark.yarn.archive或spark.yarn.jars都沒有指定,Spark 將在$SPARK_HOME/jarsand目錄下建立包含所有jar的zip檔案,並上傳到分散式快取。

Configuration

Most of the configs are the same for Spark on YARN as for other deployment modes.See the configuration page for more information on those. These are configs that are specific to Spark on YARN.
Spark的YARN大部分配置與其他部署模式相同。有關這些的詳細資訊,請參閱configuration page。這些配置項都是針對Spark on YARN的。

Debugging your Application

In YARN terminology, executors and application masters run inside “containers”. YARN has two modes for handling container logs after an application has completed. If log aggregation is turned on (with the yarn.log-aggregation-enable config), container logs are copied to HDFS and deleted on the local machine. These logs can be viewed from anywhere on the cluster with the yarn logs command.
在YARN 終端上,executors 和 application master 都執行在 "containers" 中。當一個application 完成後,YARN 有兩種方式來處理container 日誌。如果日誌聚合被開啟( 使用yarn.log-aggregation-enable 配置項),container 的日誌將被複制到HDFS上並且從本地的機器上刪除。使用yarn logs命令可以在叢集中的任何地方檢視這些日誌。
yarn logs -applicationId <app ID>

will print out the contents of all log files from all containers from the given application. You can also view the container log files directly in HDFS using the HDFS shell or API. The directory where they are located can be found by looking at your YARN configs (yarn.nodemanager.remote-app-log-dir and yarn.nodemanager.remote-app-log-dir-suffix). The logs are also available on the Spark Web UI under the Executors Tab. You need to have both the Spark history server and the MapReduce history server running and configure yarn.log.server.url in yarn-site.xml properly. The log URL on the Spark history server UI will redirect you to the MapReduce history server to show the aggregated logs.

這個命令將打印出指定application的所有containers的所有日誌內容。你也可以使用HDFS shell 或者 API 來直接檢視在HDFS 上的日誌。日誌的所在位置可以通過檢視YARN 的配置(yarn.nodemanager.remote-app-log-dir 和 yarn.nodemanager.remote-app-log-dir-suffix)獲知。也可以從Spark Web UI下的Executors的tab頁中獲得。這需要讓Spark history server 和 MapReduce history server 一起運作,並且配置 yarn-site.xml 中的yarn.log.server.url 屬性。Spark history server UI上的日誌URL將被重定向 MapReduce history server 上來顯示聚合過的日誌。

When log aggregation isn’t turned on, logs are retained locally on each machine under YARN_APP_LOGS_DIR, which is usually configured to /tmp/logs or $HADOOP_HOME/logs/userlogs depending on the Hadoop version and installation. Viewing logs for a container requires going to the host that contains them and looking in this directory. Subdirectories organize log files by application ID and container ID. The logs are also available on the Spark Web UI under the Executors Tab and doesn’t require running the MapReduce history server.

當日志聚合功能沒有被開啟的時候,日誌將儲存在每臺機器的YARN_APP_LOGS_DIR下面,YARN_APP_LOGS_DIR經常配置為/tmp/logs或者$HADOOP_HOME/logs/userlogs, 這取決於 Hadoop 的版本和安裝。檢視一個container的日誌需要到這個container所在的主機上併到日誌所在的目錄檢視。子目錄根據應用程式ID和容器ID組織日誌檔案。這些日誌也可以在Spark Web UI的Executors Tab頁獲得,並且不需要執行 MapReduce history server。

To review per-container launch environment, increase yarn.nodemanager.delete.debug-delay-sec to a large value (e.g. 36000), and then access the application cache through yarn.nodemanager.local-dirs on the nodes on which containers are launched. This directory contains the launch script, JARs, and all environment variables used for launching each container. This process is useful for debugging classpath problems in particular. (Note that enabling this requires admin privileges on cluster settings and a restart of all node managers. Thus, this is not applicable to hosted clusters).

要檢視每個容器的啟動環境,請將yarn.nodemanager.delete.debug-delay-sec增加到一個較大的值(例如36000),然後通過在已啟動的容器的節點上的yarn.nodemanager.local-dirs訪問應用程式快取。此目錄包含啟動指令碼,JAR和用於啟動每個容器的所有環境變數。 此過程特別適用於除錯類路徑問題。 (請注意,啟用此功能需要管理員對群集設定的許可權和所有節點管理器的重新啟動,因此這不適用於宿主群集)。

To use a custom log4j configuration for the application master or executors, here are the options:

要為 application master 或者 executors 使用自定義的log4j 配置,參考以下的選項:

  • upload a custom log4j.properties using spark-submit, by adding it to the --files list of files to be uploaded with the application. 
  • 使用spark-submit上傳自定義log4j.properties,將其新增到要與應用程式一起上傳的檔案的--files列表中。
  • add -Dlog4j.configuration=<location of configuration file> to spark.driver.extraJavaOptions (for the driver) or spark.executor.extraJavaOptions (for executors). Note that if using a file, the file: protocol should be explicitly provided, and the file needs to exist locally on all the nodes.
  • 新增-Dlog4j.configuration = <配置檔案的位置>到spark.driver.extraJavaOptions(對於驅動程式)或spark.executor.extraJavaOptions(對於執行程式)。請注意,如果使用檔案,則應顯式提供檔案:協議,並且檔案需要本地存在於所有節點上。
  • update the $SPARK_CONF_DIR/log4j.properties file and it will be automatically uploaded along with the other configurations. Note that other 2 options has higher priority than this option if multiple options are specified.
  • 更新 $SPARK_CONF_DIR/log4j.properties 檔案,它將與其他配置一起自動上傳。 請注意,如果指定了多個選項,其他2個選項的優先順序高於此選項。

Note that for the first option, both executors and the application master will share the same log4j configuration, which may cause issues when they run on the same node (e.g. trying to write to the same log file).

請注意,對於第一個選項,執行程式和應用程式主機將共享相同的log4j配置,這可能會導致在同一節點上執行時出現問題(例如,嘗試寫入同一個日誌檔案)。

If you need a reference to the proper location to put log files in the YARN so that YARN can properly display and aggregate them, use spark.yarn.app.container.log.dir in your log4j.properties. For example, log4j.appender.file_appender.File=${spark.yarn.app.container.log.dir}/spark.log. For streaming applications, configuring RollingFileAppender and setting file location to YARN’s log directory will avoid disk overflow caused by large log files, and logs can be accessed using YARN’s log utility.

如果您需要引用正確的位置將日誌檔案放在YARN中,以便YARN可以正確顯示和聚合它們,請在log4j.properties中使用spark.yarn.app.container.log.dir。 例如,log4j.appender.file_appender.File = $ {spark.yarn.app.container.log.dir} /spark.log。 對於流式應用程式,配置RollingFileAppender並將檔案位置設定為YARN的日誌目錄將避免大型日誌檔案引起的磁碟溢位,並且可以使用YARN的日誌實用程式訪問日誌。

To use a custom metrics.properties for the application master and executors, update the $SPARK_CONF_DIR/metrics.properties file. It will automatically be uploaded with other configurations, so you don’t need to specify it manually with --files.

要為 application master 和 executors 使用自定義的metricss.properties,請更新$SPARK_CONF_DIR/metrics.properties 檔案。 它將自動與其他配置一起上傳,因此您不需要使用--files手動指定。

Important notes

Running in a Secure Cluster


Configuring the External Shuffle Service  配置外部的 Shuffle 服務

注意:動態資源分配時需要結合使用

Launching your application with Apache Oozie

Troubleshooting Kerberos