1. 程式人生 > >maven環境下使用java、scala混合開發spark應用

maven環境下使用java、scala混合開發spark應用

熟悉java的開發者在開發spark應用時,常常會遇到spark對java的介面文件不完善或者不提供對應的java介面的問題。這個時候,如果在java專案中能直接使用scala來開發spark應用,同時使用java來處理專案中的其它需求,將在一定程度上降低開發spark專案的難度。下面就來探索一下java、scala、spark、maven這一套開發環境要怎樣來搭建。

1、下載scala sdk

(後面在intellijidea中建立.scala字尾原始碼時,ide會智慧感知並提示你設定scala sdk,按提示指定sdk目錄為解壓目錄即可)

 也可以手動配置scala SDK:ideal =>File =>project struct.. =>library..=> +...

2、下載scala forintellij idea的外掛

如上圖,直接在plugins裡搜尋Scala,然後安裝即可,如果不具備上網環境,或網速不給力。也可以直接到http://plugins.jetbrains.com/plugin/?idea&id=1347手動下載外掛的zip包,手動下載時,要特別注意版本號,一定要跟本機的intellij idea的版本號匹配,否則下載後無法安裝。下載完成後,在上圖中,點選“Install plugin from disk...”,選擇外掛包的zip即可。

3、如何跟maven整合

使用maven對專案進行打包的話,需要在pom檔案中配置scala-maven-plugin這個外掛。同時,由於是spark開發,jar包需要打包為可執行java包,還需要在pom檔案中配置maven-assembly-plugin和maven-shade-plugin外掛並設定mainClass。經過實驗摸索,下面貼出一個可用的pom檔案,使用時只需要在包依賴上進行增減即可使用。

  1. <project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"

  2. xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd">

  3. <modelVersion>4.0.0</modelVersion>

  4. <groupId>my-project-groupid</groupId>

  5. <artifactId>sparkTest</artifactId>

  6. <packaging>jar</packaging>

  7. <version>1.0-SNAPSHOT</version>

  8. <name>sparkTest</name>

  9. <url>http://maven.apache.org</url>

  10. <properties>

  11. <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>

  12. <hbase.version>0.98.3</hbase.version>

  13. <!--<spark.version>1.3.1</spark.version>-->

  14. <spark.version>1.6.0</spark.version>

  15. <jdk.version>1.7</jdk.version>

  16. <scala.version>2.10.5</scala.version>

  17. <!--<scala.maven.version>2.11.1</scala.maven.version>-->

  18. </properties>

  19. <repositories>

  20. <repository>

  21. <id>repo1.maven.org</id>

  22. <url>http://repo1.maven.org/maven2</url>

  23. <releases>

  24. <enabled>true</enabled>

  25. </releases>

  26. <snapshots>

  27. <enabled>false</enabled>

  28. </snapshots>

  29. </repository>

  30. <repository>

  31. <id>repository.jboss.org</id>

  32. <url>http://repository.jboss.org/nexus/content/groups/public/

  33. </url>

  34. <snapshots>

  35. <enabled>false</enabled>

  36. </snapshots>

  37. </repository>

  38. <repository>

  39. <id>cloudhopper</id>

  40. <name>Repository for Cloudhopper</name>

  41. <url>http://maven.cloudhopper.com/repos/third-party/</url>

  42. <releases>

  43. <enabled>true</enabled>

  44. </releases>

  45. <snapshots>

  46. <enabled>false</enabled>

  47. </snapshots>

  48. </repository>

  49. <repository>

  50. <id>mvnr</id>

  51. <name>Repository maven</name>

  52. <url>http://mvnrepository.com/</url>

  53. <releases>

  54. <enabled>true</enabled>

  55. </releases>

  56. <snapshots>

  57. <enabled>false</enabled>

  58. </snapshots>

  59. </repository>

  60. <repository>

  61. <id>scala</id>

  62. <name>Scala Tools</name>

  63. <url>https://mvnrepository.com/</url>

  64. <releases>

  65. <enabled>true</enabled>

  66. </releases>

  67. <snapshots>

  68. <enabled>false</enabled>

  69. </snapshots>

  70. </repository>

  71. </repositories>

  72. <pluginRepositories>

  73. <pluginRepository>

  74. <id>scala</id>

  75. <name>Scala Tools</name>

  76. <url>https://mvnrepository.com/</url>

  77. <releases>

  78. <enabled>true</enabled>

  79. </releases>

  80. <snapshots>

  81. <enabled>false</enabled>

  82. </snapshots>

  83. </pluginRepository>

  84. </pluginRepositories>

  85. <dependencies>

  86. <dependency>

  87. <groupId>org.scala-lang</groupId>

  88. <artifactId>scala-library</artifactId>

  89. <version>${scala.version}</version>

  90. <scope>compile</scope>

  91. </dependency>

  92. <dependency>

  93. <groupId>org.scala-lang</groupId>

  94. <artifactId>scala-compiler</artifactId>

  95. <version>${scala.version}</version>

  96. <scope>compile</scope>

  97. </dependency>

  98. <!-- https://mvnrepository.com/artifact/javax.mail/javax.mail-api -->

  99. <dependency>

  100. <groupId>javax.mail</groupId>

  101. <artifactId>javax.mail-api</artifactId>

  102. <version>1.4.7</version>

  103. </dependency>

  104. <dependency>

  105. <groupId>junit</groupId>

  106. <artifactId>junit</artifactId>

  107. <version>3.8.1</version>

  108. <scope>test</scope>

  109. </dependency>

  110. <!-- https://mvnrepository.com/artifact/org.apache.spark/spark-core_2.10 -->

  111. <dependency>

  112. <groupId>org.apache.spark</groupId>

  113. <artifactId>spark-core_2.10</artifactId>

  114. <version>${spark.version}</version>

  115. </dependency>

  116. <!-- https://mvnrepository.com/artifact/org.apache.spark/spark-sql_2.10 -->

  117. <dependency>

  118. <groupId>org.apache.spark</groupId>

  119. <artifactId>spark-sql_2.10</artifactId>

  120. <version>${spark.version}</version>

  121. </dependency>

  122. <!-- https://mvnrepository.com/artifact/org.apache.spark/spark-streaming_2.10 -->

  123. <dependency>

  124. <groupId>org.apache.spark</groupId>

  125. <artifactId>spark-streaming_2.10</artifactId>

  126. <version>${spark.version}</version>

  127. </dependency>

  128. <!-- https://mvnrepository.com/artifact/org.apache.spark/spark-mllib_2.10 -->

  129. <dependency>

  130. <groupId>org.apache.spark</groupId>

  131. <artifactId>spark-mllib_2.10</artifactId>

  132. <version>${spark.version}</version>

  133. </dependency>

  134. <!-- https://mvnrepository.com/artifact/org.apache.spark/spark-hive_2.10 -->

  135. <dependency>

  136. <groupId>org.apache.spark</groupId>

  137. <artifactId>spark-hive_2.10</artifactId>

  138. <version>${spark.version}</version>

  139. </dependency>

  140. <!-- https://mvnrepository.com/artifact/org.apache.spark/spark-graphx_2.10 -->

  141. <dependency>

  142. <groupId>org.apache.spark</groupId>

  143. <artifactId>spark-graphx_2.10</artifactId>

  144. <version>${spark.version}</version>

  145. </dependency>

  146. <dependency>

  147. <groupId>mysql</groupId>

  148. <artifactId>mysql-connector-java</artifactId>

  149. <version>5.1.30</version>

  150. </dependency>

  151. <!--<dependency>-->

  152. <!--<groupId>org.spark-project.akka</groupId>-->

  153. <!--<artifactId>akka-actor_2.10</artifactId>-->

  154. <!--<version>2.3.4-spark</version>-->

  155. <!--</dependency>-->

  156. <!--<dependency>-->

  157. <!--<groupId>org.spark-project.akka</groupId>-->

  158. <!--<artifactId>akka-remote_2.10</artifactId>-->

  159. <!--<version>2.3.4-spark</version>-->

  160. <!--</dependency>-->

  161. <dependency>

  162. <groupId>com.google.guava</groupId>

  163. <artifactId>guava</artifactId>

  164. <version>14.0.1</version>

  165. </dependency>

  166. <dependency>

  167. <groupId>org.apache.hadoop</groupId>

  168. <artifactId>hadoop-common</artifactId>

  169. <version>2.6.0</version>

  170. </dependency>

  171. <dependency>

  172. <groupId>org.apache.hadoop</groupId>

  173. <artifactId>hadoop-client</artifactId>

  174. <version>2.6.0</version>

  175. </dependency>

  176. <dependency>

  177. <groupId>org.apache.spark</groupId>

  178. <artifactId>spark-hive_2.10</artifactId>

  179. <version>${spark.version}</version>

  180. </dependency>

  181. <dependency>

  182. <groupId>com.alibaba</groupId>

  183. <artifactId>fastjson</artifactId>

  184. <version>1.2.3</version>

  185. </dependency>

  186. <dependency>

  187. <groupId>p6spy</groupId>

  188. <artifactId>p6spy</artifactId>

  189. <version>1.3</version>

  190. </dependency>

  191. <dependency>

  192. <groupId>org.apache.commons</groupId>

  193. <artifactId>commons-math3</artifactId>

  194. <version>3.3</version>

  195. </dependency>

  196. <dependency>

  197. <groupId>org.jdom</groupId>

  198. <artifactId>jdom</artifactId>

  199. <version>2.0.2</version>

  200. </dependency>

  201. <dependency>

  202. <groupId>com.google.guava</groupId>

  203. <artifactId>guava</artifactId>

  204. <version>14.0.1</version>

  205. </dependency>

  206. <dependency>

  207. <groupId>org.apache.hadoop</groupId>

  208. <artifactId>hadoop-common</artifactId>

  209. <version>2.6.0</version>

  210. </dependency>

  211. <dependency>

  212. <groupId>org.apache.hadoop</groupId>

  213. <artifactId>hadoop-hdfs</artifactId>

  214. <version>2.6.0</version>

  215. </dependency>

  216. <dependency>

  217. <groupId>redis.clients</groupId>

  218. <artifactId>jedis</artifactId>

  219. <version>2.6.0</version>

  220. </dependency>

  221. <!-- https://mvnrepository.com/artifact/org.apache.hbase/hbase-client -->

  222. <dependency>

  223. <groupId>org.apache.hbase</groupId>

  224. <artifactId>hbase-client</artifactId>

  225. <version>0.98.6-hadoop2</version>

  226. </dependency>

  227. <!-- https://mvnrepository.com/artifact/org.apache.hbase/hbase -->

  228. <dependency>

  229. <groupId>org.apache.hbase</groupId>

  230. <artifactId>hbase</artifactId>

  231. <version>0.98.6-hadoop2</version>

  232. <type>pom</type>

  233. </dependency>

  234. <!-- https://mvnrepository.com/artifact/org.apache.hbase/hbase-common -->

  235. <dependency>

  236. <groupId>org.apache.hbase</groupId>

  237. <artifactId>hbase-common</artifactId>

  238. <version>0.98.6-hadoop2</version>

  239. </dependency>

  240. <!-- https://mvnrepository.com/artifact/org.apache.hbase/hbase-server -->

  241. <dependency>

  242. <groupId>org.apache.hbase</groupId>

  243. <artifactId>hbase-server</artifactId>

  244. <version>0.98.6-hadoop2</version>

  245. </dependency>

  246. <dependency>

  247. <groupId>org.testng</groupId>

  248. <artifactId>testng</artifactId>

  249. <version>6.8.8</version>

  250. <scope>test</scope>

  251. </dependency>

  252. <dependency>

  253. <groupId>mysql</groupId>

  254. <artifactId>mysql-connector-java</artifactId>

  255. <version>5.1.30</version>

  256. </dependency>

  257. <dependency>

  258. <groupId>com.fasterxml.jackson.jaxrs</groupId>

  259. <artifactId>jackson-jaxrs-json-provider</artifactId>

  260. <version>2.4.4</version>

  261. </dependency>

  262. <dependency>

  263. <groupId>com.fasterxml.jackson.core</groupId>

  264. <artifactId>jackson-databind</artifactId>

  265. <version>2.4.4</version>

  266. </dependency>

  267. <dependency>

  268. <groupId>net.sf.json-lib</groupId>

  269. <artifactId>json-lib</artifactId>

  270. <version>2.4</version>

  271. <classifier>jdk15</classifier>

  272. </dependency>

  273. <!-- https://mvnrepository.com/artifact/javax.mail/javax.mail-api -->

  274. <dependency>

  275. <groupId>javax.mail</groupId>

  276. <artifactId>javax.mail-api</artifactId>

  277. <version>1.4.7</version>

  278. </dependency>

  279. <dependency>

  280. <groupId>junit</groupId>

  281. <artifactId>junit</artifactId>

  282. <version>3.8.1</version>

  283. <scope>test</scope>

  284. </dependency>

  285. </dependencies>

  286. <build>

  287. <plugins>

  288. <!--<打包後的專案必須spark submit方式提交給spark執行,勿使用java -jar執行java包>-->

  289. <plugin>

  290. <artifactId>maven-assembly-plugin</artifactId>

  291. <configuration>

  292. <appendAssemblyId>false</appendAssemblyId>

  293. <descriptorRefs>

  294. <descriptorRef>jar-with-dependencies</descriptorRef>

  295. </descriptorRefs>

  296. <archive>

  297. <manifest>

  298. <mainClass>rrkd.dt.sparkTest.HelloWorld</mainClass>

  299. </manifest>

  300. </archive>

  301. </configuration>

  302. <executions>

  303. <execution>

  304. <id>make-assembly</id>

  305. <phase>package</phase>

  306. <goals>

  307. <goal>assembly</goal>

  308. </goals>

  309. </execution>

  310. </executions>

  311. </plugin>

  312. <plugin>

  313. <groupId>org.apache.maven.plugins</groupId>

  314. <artifactId>maven-compiler-plugin</artifactId>

  315. <version>3.1</version>

  316. <configuration>

  317. <source>${jdk.version}</source>

  318. <target>${jdk.version}</target>

  319. <encoding>${project.build.sourceEncoding}</encoding>

  320. </configuration>

  321. </plugin>

  322. <plugin>

  323. <groupId>org.apache.maven.plugins</groupId>

  324. <artifactId>maven-shade-plugin</artifactId>

  325. <version>2.1</version>

  326. <configuration>

  327. <createDependencyReducedPom>false</createDependencyReducedPom>

  328. </configuration>

  329. <executions>

  330. <execution>

  331. <phase>package</phase>

  332. <goals>

  333. <goal>shade</goal>

  334. </goals>

  335. <configuration>

  336. <shadedArtifactAttached>true</shadedArtifactAttached>

  337. <shadedClassifierName>allinone</shadedClassifierName>

  338. <artifactSet>

  339. <includes>

  340. <include>*:*</include>

  341. </includes>

  342. </artifactSet>

  343. <filters>

  344. <filter>

  345. <artifact>*:*</artifact>

  346. <excludes>

  347. <exclude>META-INF/*.SF</exclude>

  348. <exclude>META-INF/*.DSA</exclude>

  349. <exclude>META-INF/*.RSA</exclude>

  350. </excludes>

  351. </filter>

  352. </filters>

  353. <transformers>

  354. <transformer

  355. implementation="org.apache.maven.plugins.shade.resource.AppendingTransformer">

  356. <resource>reference.conf</resource>

  357. </transformer>

  358. <transformer implementation="org.apache.maven.plugins.shade.resource.ManifestResourceTransformer">

  359. <mainClass>rrkd.dt.sparkTest.HelloWorld</mainClass>

  360. </transformer>

  361. </transformers>

  362. </configuration>

  363. </execution>

  364. </executions>

  365. </plugin>

  366. <!--< build circular dependencies between Java and Scala>-->

  367. <plugin>

  368. <groupId>net.alchim31.maven</groupId>

  369. <artifactId>scala-maven-plugin</artifactId>

  370. <version>3.2.0</version>

  371. <executions>

  372. <execution>

  373. <id>compile-scala</id>

  374. <phase>compile</phase>

  375. <goals>

  376. <goal>add-source</goal>

  377. <goal>compile</goal>

  378. </goals>

  379. </execution>

  380. <execution>

  381. <id>test-compile-scala</id>

  382. <phase>test-compile</phase>

  383. <goals>

  384. <goal>add-source</goal>

  385. <goal>testCompile</goal>

  386. </goals>

  387. </execution>

  388. </executions>

  389. <configuration>

  390. <scalaVersion>${scala.version}</scalaVersion>

  391. </configuration>

  392. </plugin>

  393. </plugins>

  394. </build>

  395. </project>

主要是build部分的配置,其它的毋須過多關注。

專案的目錄結構,大體跟maven的預設約定一樣,只是src下多了一個scala目錄,主要還是為了便於組織java原始碼和scala原始碼,如下圖:

在java目錄下建立HelloWorld類HelloWorld.class:

  1. package test;

  2. import test.Hello;

  3. /**

  4. * Created by L on 2017/1/5.

  5. */

  6. public class HelloWorld {

  7. public static void main(String[] args){

  8. System.out.print("test");

  9. Hello.sayHello("scala");

  10. Hello.runSpark();

  11. }

  12. }

在scala目錄下建立hello類hello.scala:

  1. package test

  2. import org.apache.spark.graphx.{Graph, Edge, VertexId, GraphLoader}

  3. import org.apache.spark.rdd.RDD

  4. import org.apache.spark.{SparkContext, SparkConf}

  5. import breeze.linalg.{Vector, DenseVector, squaredDistance}

  6. /**

  7. * Created by L on 2017/1/5.

  8. */

  9. object Hello {

  10. def sayHello(x: String): Unit = {

  11. println("hello," + x);

  12. }

  13. // def main(args: Array[String]) {

  14. def runSpark() {

  15. val sparkConf = new SparkConf().setAppName("SparkKMeans").setMaster("local[*]")

  16. val sc = new SparkContext(sparkConf)

  17. // Create an RDD for the vertices

  18. val users: RDD[(VertexId, (String, String))] =

  19. sc.parallelize(Array((3L, ("rxin", "student")), (7L, ("jgonzal", "postdoc")),

  20. (5L, ("franklin", "prof")), (2L, ("istoica", "prof")),

  21. (4L, ("peter", "student"))))

  22. // Create an RDD for edges

  23. val relationships: RDD[Edge[String]] =

  24. sc.parallelize(Array(Edge(3L, 7L, "collab"), Edge(5L, 3L, "advisor"),

  25. Edge(2L, 5L, "colleague"), Edge(5L, 7L, "pi"),

  26. Edge(4L, 0L, "student"), Edge(5L, 0L, "colleague")))

  27. // Define a default user in case there are relationship with missing user

  28. val defaultUser = ("John Doe", "Missing")

  29. // Build the initial Graph

  30. val graph = Graph(users, relationships, defaultUser)

  31. // Notice that there is a user 0 (for which we have no information) connected to users

  32. // 4 (peter) and 5 (franklin).

  33. graph.triplets.map(

  34. triplet => triplet.srcAttr._1 + " is the " + triplet.attr + " of " + triplet.dstAttr._1

  35. ).collect.foreach(println(_))

  36. // Remove missing vertices as well as the edges to connected to them

  37. val validGraph = graph.subgraph(vpred = (id, attr) => attr._2 != "Missing")

  38. // The valid subgraph will disconnect users 4 and 5 by removing user 0

  39. validGraph.vertices.collect.foreach(println(_))

  40. validGraph.triplets.map(

  41. triplet => triplet.srcAttr._1 + " is the " + triplet.attr + " of " + triplet.dstAttr._1

  42. ).collect.foreach(println(_))

  43. sc.stop()

  44. }

  45. }

這樣子,在scala專案中呼叫spark的介面來執行一些spark應用,在java專案中再呼叫scala。

4、scala專案maven的編譯打包

java/scala混合的專案,怎麼先編譯scala再編譯java,可以使用以下maven 命令來進行編譯打包:

mvn clean scala:compile assembly:assembly

5、spark專案的jar包的執行問題

在開發時,我們可能會以local模式在IDEA中執行,然後使用了上面的命令進行打包。打包後的spark專案必須要放到spark叢集下以spark-submit的方式提交執行。

--------------------- 本文來自 大愚若智_ 的CSDN 部落格 ,全文地址請點選:https://blog.csdn.net/zbc1090549839/article/details/54290233?utm_source=copy