1. 程式人生 > >一步一步完成如何在現有的CDH叢集中部署一個與CDH版本不同的spark

一步一步完成如何在現有的CDH叢集中部署一個與CDH版本不同的spark

首先當然是下載一個spark原始碼,在http://archive.cloudera.com/cdh5/cdh/5/中找到屬於自己的原始碼,自己編譯打包,有關如何編譯打包可以參考一下我原來寫的文章:

http://blog.csdn.net/xiao_jun_0820/article/details/44178169

執行完之後你應該能得到一個類似spark-1.6.0-cdh5.7.1-bin-custom-spark.tgz的壓縮包(版本根據你具體下載的版本而有所區別)

然後上傳到節點上去,解壓到/opt目錄下面,解壓出來的目錄應該是spark-1.6.0-cdh5.7.1-bin-custom-spark,名字太長了,給它做個軟鏈把:

ln -s spark-1.6.0-cdh5.7.1-bin-custom-spark spark

然後cd /opt/spark/conf目錄下面,將裡面的那些模板檔案都刪掉,沒卵用了,然後是關鍵了,在conf目錄下面建立兩個軟連線yarn-conf和log4j.properties分別指向CDH的SPARK配置目錄(預設目錄是/etc/spark/conf,除非你改過)下面相應的檔案:

ln -s /etc/spark/conf/yarn-conf yarn-conf

ln -s /etc/spark/conf/log4j.properties log4j.properties

然後把/etc/spark/conf目錄下面的classpath.txt,spark-defaults.conf,spark-env.sh這三個檔案拷貝到你自己的spark conf目錄裡面來,本例是/opt/spark/conf,最終/opt/spark/conf目錄下面有5個檔案:


編輯classpath.txt檔案,找到裡面spark相關的jar包,應該有兩個:

/opt/cloudera/parcels/CDH-5.7.1-1.cdh5.7.1.p0.11/jars/spark-1.6.0-cdh5.7.1-yarn-shuffle.jar
/opt/cloudera/parcels/CDH-5.7.1-1.cdh5.7.1.p0.11/jars/spark-streaming-flume-sink_2.10-1.6.0-cdh5.7.1.jar

一個是spark yarn shuffle 的jar包(如果啟用動態資源分配的話會用到該JAR),這個在你自己打包的spark的SPARK_HOME/lib目錄下面也有一個,將裡面的路徑替換成你自己的jar包路徑:/opt/spark/lib/spark-1.6.0-cdh5.7.1-yarn-shuffle.jar,另一個是spark streaming消費flume的,我不用,就把它刪掉把,如果要用的話,就改成自己的jar包路徑

接下來修改spark-defaults.conf檔案,CDH自帶的應該是:

spark.authenticate=false
spark.dynamicAllocation.enabled=true
spark.dynamicAllocation.executorIdleTimeout=60
spark.dynamicAllocation.minExecutors=0
spark.dynamicAllocation.schedulerBacklogTimeout=1
spark.eventLog.enabled=true
#spark.serializer=org.apache.spark.serializer.KryoSerializer
spark.shuffle.service.enabled=true
spark.shuffle.service.port=7337
spark.eventLog.dir=hdfs://name84:8020/user/spark/applicationHistory
spark.yarn.historyServer.address=http://name84:18088
spark.yarn.jar=local:/opt/cloudera/parcels/CDH-5.7.1-1.cdh5.7.1.p0.11/lib/spark/lib/spark-assembly.jar
spark.driver.extraLibraryPath=/opt/cloudera/parcels/CDH-5.7.1-1.cdh5.7.1.p0.11/lib/hadoop/lib/native
spark.executor.extraLibraryPath=/opt/cloudera/parcels/CDH-5.7.1-1.cdh5.7.1.p0.11/lib/hadoop/lib/native
spark.yarn.am.extraLibraryPath=/opt/cloudera/parcels/CDH-5.7.1-1.cdh5.7.1.p0.11/lib/hadoop/lib/native
spark.yarn.config.gatewayPath=/opt/cloudera/parcels
spark.yarn.config.replacementPath={{HADOOP_COMMON_HOME}}/../../..
spark.master=yarn-client

就把spark.yarn.jar的路徑修改一下就好了,修改成自己的jar包路徑:

spark.yarn.jar=local:/opt/spark/lib/spark-assembly-1.6.0-cdh5.7.1-hadoop2.6.0-cdh5.7.1.jar

接下來修改spark-env.sh,將原來的export SPARK_HOME=/opt/cloudera/parcels/CDH-5.7.1-1.cdh5.7.1.p0.11/lib/spark修改成export SPARK_HOME=/opt/spark

OK,所有修改完畢。

每臺機器上都按相同的方式裝一遍自己的spark二進位制分發包吧。

上面有兩個注意點,就是log4j.properties和yarn-conf目錄使用的是軟鏈,所以你在cloudera manager中修改配置,不會影響你新裝的spark。但是spark-default.conf和spark-env.sh不是軟鏈,spark-env.sh一般也不會變了沒啥關係,spark-default.conf裡面有寫死一些配置資訊,如果你在cm中修改了這些配置,是不會同步的,需要手工改一下這些配置,比如你把historyServer換了一臺機器部署,那麼你要自己去修改這個配置。

做軟鏈 應該也是可以的,無非就是spark.yarn.jar這個配置改了一下,應該可以在spark-submit 指令碼中通過--conf spark.yarn.jar=xxxx設定回來。暫未嘗試。

嘗試提交一個任務試試吧:

/opt/spark/bin/spark-submit    --class com.kingnet.framework.StreamingRunnerPro     --master yarn-client --num-executors 2     --driver-memory 1g     --executor-memory 1g     --executor-cores 1    /opt/spark/lib/dm-streaming-pro.jar test

或者yarn-cluster模式:

/opt/spark/bin/spark-submit    --class com.kingnet.framework.StreamingRunnerPro     --master yarn-cluster --num-executors 2     --driver-memory 1g     --executor-memory 1g     --executor-cores 1    hdfs://name84:8020/install/dm-streaming-pro.jar test

參考:http://spark.apache.org/docs/1.6.3/hadoop-provided.html

Using Spark's "Hadoop Free" Build

Spark uses Hadoop client libraries for HDFS and YARN. Starting in version Spark 1.4, the project packages “Hadoop free” builds that lets you more easily connect a single Spark binary to any Hadoop version. To use these builds, you need to modify SPARK_DIST_CLASSPATH to include Hadoop’s package jars. The most convenient place to do this is by adding an entry in conf/spark-env.sh.

This page describes how to connect Spark to Hadoop for different types of distributions.

Apache Hadoop

For Apache distributions, you can use Hadoop’s ‘classpath’ command. For instance:

### in conf/spark-env.sh ###

# If 'hadoop' binary is on your PATH
export SPARK_DIST_CLASSPATH=$(hadoop classpath)

# With explicit path to 'hadoop' binary
export SPARK_DIST_CLASSPATH=$(/path/to/hadoop/bin/hadoop classpath)

# Passing a Hadoop configuration directory
export SPARK_DIST_CLASSPATH=$(hadoop --config /path/to/configs classpath)