1. 程式人生 > >Spark on Yarn提交任務緩慢

Spark on Yarn提交任務緩慢

在使用 Spark on Yarn模式在叢集中提交任務的時候執行很緩慢,並且還報了一個WARN
使用叢集提交任務

  ./bin/spark-submit \
  --class org.apache.spark.examples.SparkPi \
  --master yarn \
  --deploy-mode cluster \
  --executor-memory 1G \
  --num-executors 1 \
  /opt/spark-2.3.0-bin-hadoop2.6/examples/jars/spark-examples_2.11-2.3.0.jar \
  10

但是出現警告資訊:

 WARN  Client:66 - Neither spark.yarn.jars nor spark.yarn.archive is set, falling back to uploading libraries under SPARK_HOME.

日誌在提交程式依賴的 jar 包,造成任務提交速度慢,在官網上看到解決辦法

To make Spark runtime jars accessible from YARN side, you can specify spark.yarn.archive or spark.yarn.jars. 
For details please refer to
Spark Properties. If neither spark.yarn.archive nor spark.yarn.jars is specified, Spark will create a zip file with all jars under $SPARK_HOME/jars and upload it to the distributed cache.

大概是:要想在 yarn 節點訪問 spark 的 runtime jars,需要指定spark.yarn.jars。如果沒有指定,spark就會把$SPARK_HOME/jars/下的 jar 包上傳到分佈快取中去。

解決辦法:將$SPARK_HOME/jars/* 下spark執行依賴的jar上傳到hdfs上。

hadoop fs -mkdir /tmp/lib_jars
hadoop fs -put $SPARK_HOME/jars/* /tmp/lib_jars

在配置檔案$SPARK_HOME/conf/spark-defaults.conf 新增內容

spark.yarn.jars hdfs://master:9000/tmp/lib_jars/*

再次提交任務,執行,出現以下資訊。

2018-03-21 22:35:13 INFO  Client:54 - Preparing resources for our AM container
2018-03-21 22:35:16 INFO  Client:54 - Source and destination file systems are the same. Not copying hdfs://master:9000/tmp/lib_jars/JavaEWAH-0.3.2.jar
2018-03-21 22:35:16 INFO  Client:54 - Source and destination file systems are the same. Not copying hdfs://master:9000/tmp/lib_jars/RoaringBitmap-0.5.11.jar
2018-03-21 22:35:16 INFO  Client:54 - Source and destination file systems are the same. Not copying hdfs://master:9000/tmp/lib_jars/ST4-4.0.4.jar
2018-03-21 22:35:16 INFO  Client:54 - Source and destination file systems are the same. Not copying hdfs://master:9000/tmp/lib_jars/activation-1.1.1.jar