spark-1.4.1-bin-cdh5.3.2 Maven編譯
Spark 編譯前準備
1. 下載 Spark1.4.1 原始碼包,並解壓
筆者解壓到
tar -zxvf spark-1.4.1.tgz -C /home/hadoop/softwares/
2. 安裝 Maven
安裝就是解壓,然後配置環境變數,沒啥了
export SCALA_HOME=/home/hadoop/softwares/scala-2.10.4
export PATH=${PATH}:$SCALA_HOME/bin
在 linux 安裝下 Scala 環境,鍵入 scala -version
,出現如下即可:
4. 安裝 Oracle 的 JDK 7
雖然筆者使用 Open-jdk 1.7 編譯成功了,但是還是暫時推薦讀者使用 Oracle 的 JDK 7。jdk 1.7 下載及安裝,具體參考筆者的 JAVA 配置
注意:實際中,筆者沒像網上的人那樣,直接把 open-jdk 刪的不要不要的。我只是將 Oracle 的 jdk 的環境變數新增到原有系統變數 $PATH 的之前(路徑搜尋從前向後,搜尋到就停止啦~),具體如下:
export JAVA_HOME=/home/hadoop/softwares/jdk1.7.0_71
export CLASSPATH=.:$JAVA_HOME/jre/lib/rt.jar:$JAVA_HOME/lib/dt.jar:$JAVA_HOME /lib/tools.jar
export PATH=$JAVA_HOME/bin:$PATH
編譯
進入 spark 1.4.1 原始碼目錄下,編譯之前的目錄結構:
然後編譯:
mvn -Dhadoop.version=2.5.0-cdh5.3.2 -Pyarn -Phive -Phive-thriftserver -DskipTests clean package
但筆者希望將輸出結果不僅在螢幕上顯示,同時也希望儲存到文件中,於是命令為(筆者就用這個):
mvn -Dhadoop.version=2.5.0-cdh5.3.2 -Pyarn -Phive -Phive -thriftserver -DskipTests clean package | tee building.txt
題外話:其實好像用 cdh 版本的只要寫以下編譯語句就可以了(筆者未考證)
mvn -Pyarn -Phive -Phive-thriftserver -DskipTests clean package
注意的是 hadoop version 和 scala 的版本設定成對應的版本。
Note:
- Mvn 並不會預設生成 tar 包。你會得到很多 jar 檔案 —— 每一個工程下面都有它自己的 jar 包(例如上圖中的標註的)
ls /home/hadoop/softwares/spark-1.4.1/network/yarn/target
- 在 assembly/target/scala-2.10 目錄下有個 spark-assembly-1.4.1-hadoop2.5.0-cdh5.3.2.jar 檔案
ls /home/hadoop/softwares/spark-1.4.1/assembly/target/scala-2.10
筆者將其拖入 windows 下,用解壓工具開啟 see 了下:
在 org 資料夾下:
該資料夾下的檔案:
這就說明了編譯成功了。
Make 生成二進位制 tgz 包(解壓可直接執行)
然後在原始碼目錄下面 make-distribution.sh ,可以用來打 二進位制bin包:
Note:執行這個命令,筆者瞬間覺得自己SB了,不用 mvn ,好像直接 ./make-distribution 就 OK 了,因為 make 自帶 Maven 編譯。
./make-distribution.sh --name custom-spark --skip-java-test --tgz -Pyarn -Dhadoop.version=2.5.0-cdh5.3.2 -Dscala-2.10.4 -Phive -Phive-thriftserver
上述命令中 “–name custom-spark” 還有待商榷,貌似應該是 “hadoop-version”。
筆者所用命令為:
./make-distribution.sh --name cdh5.3.2 --skip-java-test --tgz -Pyarn -Dhadoop.version=2.5.0-cdh5.3.2 -Dscala-2.10.4 -Phive -Phive-thriftserver | tee building_distribution.txt
最後,它提示 (Y/N),筆者小心翼翼地選擇了 Y,然後就進入漫長的編譯階段…
最終經歷了種種困難後,終於成功編譯了,如下圖:
然後在該目錄下:
這個部署包 322 M 大小
在該目錄下,生成了 spark-1.4.1-bin-cdh5.3.2.tgz 檔案,322M 大小(後記:經初步檢測,能正常使用),到此,筆者編譯就告一段落了。
Q & A
Q1: warning: [options] bootstrap class path not set in conjunction with -source 1.6
原因:
This is not Ant but the JDK’s javac emitting the warning.
If you use Java 7’s javac and -source for anything smaller than 7 javac warns you you should also set the bootstrap classpath to point to an older rt.jar - because this is the only way to ensure the result is usable on an older VM.
This is only a warning, so you could ignore it and even suppress it with
<compilerarg value="-Xlint:-options"/>
Alternatively you really install an older JVM and adapt your bootclasspath accordingly (you need to include rt.jar, not the bin folder)
解決辦法:忽略不管唄~
Q2:編譯中斷失敗 (compile failed. CompileFailed)
Failed to execute goal net.alchim31.maven:scala-maven-plugin:3.2.0:compile (scala-compile-first) on project spark-sql_2.10: Execution scala-compile-first of goal net.alchim31.maven:scala-maven-plugin:3.2.0:compile failed. CompileFailed -> [Help 1]
Failed to execute goal net.alchim31.maven:scala-maven-plugin:3.2.0:testCompile (scala-test-compile-first) on project spark-sql_2.10: Execution scala-test-compile-first of goal net.alchim31.maven:scala-maven-plugin:3.2.0:testCompile failed. CompileFailed -> [Help 1]
Failed to execute goal net.alchim31.maven:scala-maven-plugin:3.2.0:compile (scala-compile-first) on project spark-core_2.10: Execution scala-compile-first of golchim31.maven:scala-maven-plugin:3.2.0:compile failed. CompileFailed -> [Help 1]
[WARNING] The requested profile “hive-” could not be activated because it does not exist.
[ERROR] Failed to execute goal net.alchim31.maven:scala-maven-plugin:3.2.0:compile (scala-compile-first) on project spark-mllib_2.10: Exeoal net.alchim31.maven:scala-maven-plugin:3.2.0:compile failed. CompileFailed -> [Help 1]
原因:
- 網速問題?
- 時間太長了,超出編譯的最大時間
- 編譯主機負荷大?
解決辦法:
- 刪除本地 Maven 倉庫,然後多次重新編譯
- 要麼
mvn <goals> -rf :spark-sql_2.10
// 從失敗的地方(比如 spark-sql_2.10 )開始編譯
./make-distribution.sh --name cdh5.3.2 --skip-java-test --tgz -Pyarn -Dhadoop.version=2.5.0-cdh5.3.2 -Dscala-2.10.4 -Phive -Phive-thriftserver -rf :spark-sql_2.10
- 修改spark1.4.1原始碼下的 pom.xml 檔案
<dependency>
<groupId>net.alchim31.maven</groupId>
<artifactId>scala-maven-plugin</artifactId>
<version>3.2.0</version>
</dependency>
Q3: spark-repl_2.10 的 MissingRequirementError
[ERROR] error while loading , error in opening zip file
[ERROR] Failed to execute goal net.alchim31.maven:scala-maven-plugin:3.2.0:compile (scala-compile-first) on project spark-repl_2.10: wrap: scala.reflect.internal.MissingRequirementError: object scala.runtime in compiler mirror not found. -> [Help 1]
org.apache.maven.lifecycle.LifecycleExecutionException: Failed to execute goal net.alchim31.maven:scala-maven-plugin:3.2.0:compile (scala-compile-first) on project spark-repl_2.10: Execution scala-compile-first of goal net.alchim31.maven:scala-maven-plugin:3.2.0:compile failed.
Google 到的困難原因:
回答一
This error is actually an error from scalac, not a compile error from the code. It sort of sounds like it has not been able to download scala dependencies. Check or maybe recreate your environment.回答二
This error is very misleading, it actually has nothing to do with scala.runtime or the compiler mirror: this is the error you get when you have a faulty JAR file on your classpath.
Sadly, there is no way from the error (even with -Ydebug) to tell exactly which file. You can run scala with -Ylog-classpath, it will output a lot of classpath stuff, including the exact classpath used (look for “[init] [search path for class files:”). Then I guess you will have to go through them to check if they are valid or not.
I recently tried to improve that (SI-5463), at least to get a clear error message, but couldn’t find a satisfyingly clean way to do this…回答三
I have checked to ensure that in my class path that ALL jars from SCALA_HOME/lib/ are included
As we figured out at #scala, the documentation was missing the fact that one needs to provide the -Dscala.usejavacp=true argument to the JVM command that invokes scalac. After that everything worked fine, and I updated the docs: http://docs.scala-lang.org/overviews/macros/overview.html#debugging_macros.
Q4: 其他潛在的問題
為了防止Spark(1.4.1)與Hadoop(2.5.0)所使用的Protocol Buffers版本不一致會造成不能正確讀取HDFS檔案, 所以需要對pom.xml進行相應修改。
<!--<protobuf.version>2.4.1</protobuf.version>-->
<protobuf.version>2.5.0</protobuf.version>
重要的參考資料
《spark1.4.0基於yarn的安裝心得體會 》:http://blog.csdn.net/xiao_jun_0820/article/details/46561097
目前線上用的是cdh5.3.2中內嵌的spark1.2.0版本,該版本BUG還是蠻多的,尤其是一些spark sql的BUG,簡直不能忍。spark1.4.0新出的支援SparkR,其他用R的同時很期待試用該版本看看sparkR好不好用,於是乎打算升級一下spark的版本。《CDH5.1.0編譯spark-assembly包來支援hive 》:http://blog.csdn.net/aaa1117a8w5s6d/article/details/44307207
maven的配置檔案apache-maven-3.2.5/conf/settings.xml 增加私服地址,同時提供測試程式碼-
- Exception in thread “main” java.lang.OutOfMemoryError
- Cannot run program “javac”: java.io.IOException
- Please set the SCALA_HOME
- 選擇相應的Hadoop和Yarn版本