1. 程式人生 > >Spark on yarn Intellij ide 安裝,編譯,打包,叢集執行 詳解

Spark on yarn Intellij ide 安裝,編譯,打包,叢集執行 詳解

這裡寫圖片描述

說明:已經安裝好hadoop2.2.0 完全分佈,scala,spark已安裝好,環境配置完畢;主機為hadoop-master,hadoop-slave

一.intellij 安裝(centos6.5系統)

步驟一。

1.將上述兩個安裝包放到主節點相關位置(本機放在hadoop-master上,root使用者下桌面上)
2.解壓 命令如下:
tar zxvf ideaIc-2017.1.tar.gz -C ~/
3.將scala-intellij-bin-2017.1.14 放置在ide-IC-171.3780.95
目錄下的 plugins 裡面

步驟二。

1.開啟intellij
2.新建 Project,如圖所示,next
這裡寫圖片描述
3.project name ,專案名稱
project location 為project存放路徑
JDK為java(本機之前用centos6.5自帶版本1.7,但使用不了,後改為sun公司的1.8)
scala SDK 為2.10.4 (本機安裝的版本為2.10.4)
這裡寫圖片描述
3.這裡寫圖片描述
4.在IDEA中開發應用程式時,常常需要通過一定的檔案目錄組織進行原始碼編寫,例如原始檔目錄、測試原始檔目錄,下面演示在Intellij IDEA的src目錄下建立main/scala原始檔目錄。
直接按F4或右鍵點選工程檔案 ,點選OPEN module setting
這裡寫圖片描述


5.點選Moudules,點選src目錄,然後右鍵建立main/scala資料夾,再點選scala資料夾為sources,如下圖所示
這裡寫圖片描述
6匯入Spark 依賴包(本機用spark-assemble-1.0.0-hadoop2.2.0),點選libraries,點選+,點選java,選則spark-assemble-1.0.0-hadoop2.2.0,點選 項目錄名稱,ok
這裡寫圖片描述

至此Spark開發環境配置完成

步驟三。本地執行

1.新建 scala object,然後輸入程式碼
這裡寫圖片描述
2.編譯程式碼,直接Build->build Project
3.然後程式設計執行引數,Run->Edit Configurations
這裡寫圖片描述

`/**
* Created by root on 3/28/17.
*/
import org.apache.spark.SparkContext._
import org.apache.spark.{SparkConf, SparkContext}
/**
* Created by root on 3/23/17.
*/
object spp {
def main(args: Array[String]) {
//輸入檔案既可以是本地linux系統檔案,也可以是其它來原始檔,例如HDFS
if (args.length == 0) {
System.err.println(“Usage: SparkWordCount ”)
System.exit(1)
}
//以本地執行緒方式執行,可以指定執行緒個數,
//如.setMaster(“local[2]”),兩個執行緒執行
//下面給出的是單執行緒執行
val conf = new SparkConf().setAppName(“SparkWordCount”).setMaster(“local”)
val sc = new SparkContext(conf)

//wordcount操作,計算檔案中包含Spark的行數
val rdd2=sc.textFile(args(0)).flatMap(line => line.split(" ")).map(word => (word,1)).reduceByKey(_+_)
//    val count1=rdd2.countByValue()
//列印結果

// rdd2.saveAsTextFile(path = args(1))
// rdd2.count()
println(rdd2.count())
sc.stop()
}

}

完成後直接Run->Run或Alt+Shift+F10執行程式,執行結果如下圖:
/usr/lib/jvm/jdk1.8.0_60/bin/java -javaagent:/root/Desktop/idea-IC-171.3780.95/lib/idea_rt.jar=42032:/root/Desktop/idea-IC-171.3780.95/bin -Dfile.encoding=UTF-8 -classpath /usr/lib/jvm/jdk1.8.0_60/jre/lib/charsets.jar:/usr/lib/jvm/jdk1.8.0_60/jre/lib/deploy.jar:/usr/lib/jvm/jdk1.8.0_60/jre/lib/ext/cldrdata.jar:/usr/lib/jvm/jdk1.8.0_60/jre/lib/ext/dnsns.jar:/usr/lib/jvm/jdk1.8.0_60/jre/lib/ext/jaccess.jar:/usr/lib/jvm/jdk1.8.0_60/jre/lib/ext/jfxrt.jar:/usr/lib/jvm/jdk1.8.0_60/jre/lib/ext/localedata.jar:/usr/lib/jvm/jdk1.8.0_60/jre/lib/ext/nashorn.jar:/usr/lib/jvm/jdk1.8.0_60/jre/lib/ext/sunec.jar:/usr/lib/jvm/jdk1.8.0_60/jre/lib/ext/sunjce_provider.jar:/usr/lib/jvm/jdk1.8.0_60/jre/lib/ext/sunpkcs11.jar:/usr/lib/jvm/jdk1.8.0_60/jre/lib/ext/zipfs.jar:/usr/lib/jvm/jdk1.8.0_60/jre/lib/javaws.jar:/usr/lib/jvm/jdk1.8.0_60/jre/lib/jce.jar:/usr/lib/jvm/jdk1.8.0_60/jre/lib/jfr.jar:/usr/lib/jvm/jdk1.8.0_60/jre/lib/jfxswt.jar:/usr/lib/jvm/jdk1.8.0_60/jre/lib/jsse.jar:/usr/lib/jvm/jdk1.8.0_60/jre/lib/management-agent.jar:/usr/lib/jvm/jdk1.8.0_60/jre/lib/plugin.jar:/usr/lib/jvm/jdk1.8.0_60/jre/lib/resources.jar:/usr/lib/jvm/jdk1.8.0_60/jre/lib/rt.jar:/root/IdeaProjects/my2/out/production/my2:/root/.ivy2/cache/org.scala-lang/scala-reflect/jars/scala-reflect-2.10.4.jar:/root/.ivy2/cache/org.scala-lang/scala-library/jars/scala-library-2.10.4.jar:/root/.ivy2/cache/org.scala-lang/scala-reflect/srcs/scala-reflect-2.10.4-sources.jar:/root/.ivy2/cache/org.scala-lang/scala-library/srcs/scala-library-2.10.4-sources.jar:/root/spark-1.0.0-bin-2.2.0/lib/spark-assembly-1.0.0-hadoop2.2.0.jar spp hdfs://10.6.3.200:8020/data/wordcount/1.txt
17/03/28 13:26:06 INFO SecurityManager: Using Spark’s default log4j profile: org/apache/spark/log4j-defaults.properties
17/03/28 13:26:06 INFO SecurityManager: Changing view acls to: root
17/03/28 13:26:06 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(root)
17/03/28 13:26:07 INFO Slf4jLogger: Slf4jLogger started
17/03/28 13:26:07 INFO Remoting: Starting remoting
17/03/28 13:26:08 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://[email protected]:59842]
17/03/28 13:26:08 INFO Remoting: Remoting now listens on addresses: [akka.tcp://[email protected]:59842]
17/03/28 13:26:08 INFO SparkEnv: Registering MapOutputTracker
17/03/28 13:26:08 INFO SparkEnv: Registering BlockManagerMaster
17/03/28 13:26:08 INFO DiskBlockManager: Created local directory at /tmp/spark-local-20170328132608-36ca
17/03/28 13:26:08 INFO MemoryStore: MemoryStore started with capacity 528.0 MB.
17/03/28 13:26:08 INFO ConnectionManager: Bound socket to port 50620 with id = ConnectionManagerId(hadoop-master,50620)
17/03/28 13:26:08 INFO BlockManagerMaster: Trying to register BlockManager
17/03/28 13:26:08 INFO BlockManagerInfo: Registering block manager hadoop-master:50620 with 528.0 MB RAM
17/03/28 13:26:08 INFO BlockManagerMaster: Registered BlockManager
17/03/28 13:26:08 INFO HttpServer: Starting HTTP Server
17/03/28 13:26:08 INFO HttpBroadcast: Broadcast server started at http://10.6.3.200:50541
17/03/28 13:26:08 INFO HttpFileServer: HTTP File server directory is /tmp/spark-e484b842-2b2c-43e1-8c19-c1375c30dc92
17/03/28 13:26:08 INFO HttpServer: Starting HTTP Server
17/03/28 13:26:09 INFO SparkUI: Started SparkUI at http://hadoop-master:4040
17/03/28 13:26:10 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform… using builtin-java classes where applicable
17/03/28 13:26:11 INFO MemoryStore: ensureFreeSpace(133256) called with curMem=0, maxMem=553648128
17/03/28 13:26:11 INFO MemoryStore: Block broadcast_0 stored as values to memory (estimated size 130.1 KB, free 527.9 MB)
17/03/28 13:26:12 INFO FileInputFormat: Total input paths to process : 1
17/03/28 13:26:12 INFO SparkContext: Starting job: count at spp.scala:28
17/03/28 13:26:12 INFO DAGScheduler: Registering RDD 4 (reduceByKey at spp.scala:23)
17/03/28 13:26:12 INFO DAGScheduler: Got job 0 (count at spp.scala:28) with 1 output partitions (allowLocal=false)
17/03/28 13:26:12 INFO DAGScheduler: Final stage: Stage 0(count at spp.scala:28)
17/03/28 13:26:12 INFO DAGScheduler: Parents of final stage: List(Stage 1)
17/03/28 13:26:12 INFO DAGScheduler: Missing parents: List(Stage 1)
17/03/28 13:26:12 INFO DAGScheduler: Submitting Stage 1 (MapPartitionsRDD[4] at reduceByKey at spp.scala:23), which has no missing parents
17/03/28 13:26:12 INFO DAGScheduler: Submitting 1 missing tasks from Stage 1 (MapPartitionsRDD[4] at reduceByKey at spp.scala:23)
17/03/28 13:26:12 INFO TaskSchedulerImpl: Adding task set 1.0 with 1 tasks
17/03/28 13:26:12 INFO TaskSetManager: Starting task 1.0:0 as TID 0 on executor localhost: localhost (PROCESS_LOCAL)
17/03/28 13:26:12 INFO TaskSetManager: Serialized task 1.0:0 as 2076 bytes in 129 ms
17/03/28 13:26:12 INFO Executor: Running task ID 0
17/03/28 13:26:12 INFO BlockManager: Found block broadcast_0 locally
17/03/28 13:26:12 INFO HadoopRDD: Input split: hdfs://10.6.3.200:8020/data/wordcount/1.txt:0+15
17/03/28 13:26:12 INFO deprecation: mapred.tip.id is deprecated. Instead, use mapreduce.task.id
17/03/28 13:26:12 INFO deprecation: mapred.task.id is deprecated. Instead, use mapreduce.task.attempt.id
17/03/28 13:26:12 INFO deprecation: mapred.task.is.map is deprecated. Instead, use mapreduce.task.ismap
17/03/28 13:26:12 INFO deprecation: mapred.task.partition is deprecated. Instead, use mapreduce.task.partition
17/03/28 13:26:12 INFO deprecation: mapred.job.id is deprecated. Instead, use mapreduce.job.id
17/03/28 13:26:13 INFO Executor: Serialized size of result for 0 is 783
17/03/28 13:26:13 INFO Executor: Sending result for 0 directly to driver
17/03/28 13:26:13 INFO Executor: Finished task ID 0
17/03/28 13:26:13 INFO DAGScheduler: Completed ShuffleMapTask(1, 0)
17/03/28 13:26:13 INFO TaskSetManager: Finished TID 0 in 667 ms on localhost (progress: 1/1)
17/03/28 13:26:13 INFO TaskSchedulerImpl: Removed TaskSet 1.0, whose tasks have all completed, from pool
17/03/28 13:26:13 INFO DAGScheduler: Stage 1 (reduceByKey at spp.scala:23) finished in 0.705 s
17/03/28 13:26:13 INFO DAGScheduler: looking for newly runnable stages
17/03/28 13:26:13 INFO DAGScheduler: running: Set()
17/03/28 13:26:13 INFO DAGScheduler: waiting: Set(Stage 0)
17/03/28 13:26:13 INFO DAGScheduler: failed: Set()
17/03/28 13:26:13 INFO DAGScheduler: Missing parents for Stage 0: List()
17/03/28 13:26:13 INFO DAGScheduler: Submitting Stage 0 (MapPartitionsRDD[6] at reduceByKey at spp.scala:23), which is now runnable
17/03/28 13:26:13 INFO DAGScheduler: Submitting 1 missing tasks from Stage 0 (MapPartitionsRDD[6] at reduceByKey at spp.scala:23)
17/03/28 13:26:13 INFO TaskSchedulerImpl: Adding task set 0.0 with 1 tasks
17/03/28 13:26:13 INFO TaskSetManager: Starting task 0.0:0 as TID 1 on executor localhost: localhost (PROCESS_LOCAL)
17/03/28 13:26:13 INFO TaskSetManager: Serialized task 0.0:0 as 1939 bytes in 1 ms
17/03/28 13:26:13 INFO Executor: Running task ID 1
17/03/28 13:26:13 INFO BlockManager: Found block broadcast_0 locally
17/03/28 13:26:13 INFO BlockFetcherIteratorBasicBlockFetcherIterator:maxBytesInFlight:50331648,targetRequestSize:1006632917/03/2813:26:13INFOBlockFetcherIteratorBasicBlockFetcherIterator: Getting 1 non-empty blocks out of 1 blocks
17/03/28 13:26:13 INFO BlockFetcherIterator$BasicBlockFetcherIterator: Started 0 remote fetches in 13 ms
17/03/28 13:26:13 INFO Executor: Serialized size of result for 1 is 863
17/03/28 13:26:13 INFO Executor: Sending result for 1 directly to driver
17/03/28 13:26:13 INFO Executor: Finished task ID 1
17/03/28 13:26:13 INFO DAGScheduler: Completed ResultTask(0, 0)
17/03/28 13:26:13 INFO TaskSetManager: Finished TID 1 in 91 ms on localhost (progress: 1/1)
17/03/28 13:26:13 INFO DAGScheduler: Stage 0 (count at spp.scala:28) finished in 0.094 s
17/03/28 13:26:13 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool
17/03/28 13:26:13 INFO SparkContext: Job finished: count at spp.scala:28, took 1.148558113 s
2
17/03/28 13:26:13 INFO SparkUI: Stopped Spark web UI at http://hadoop-master:4040
17/03/28 13:26:13 INFO DAGScheduler: Stopping DAGScheduler
17/03/28 13:26:14 INFO MapOutputTrackerMasterActor: MapOutputTrackerActor stopped!
17/03/28 13:26:14 INFO ConnectionManager: Selector thread was interrupted!
17/03/28 13:26:14 INFO ConnectionManager: ConnectionManager stopped
17/03/28 13:26:14 INFO MemoryStore: MemoryStore cleared
17/03/28 13:26:14 INFO BlockManager: BlockManager stopped
17/03/28 13:26:14 INFO BlockManagerMasterActor: Stopping BlockManagerMaster
17/03/28 13:26:14 INFO BlockManagerMaster: BlockManagerMaster stopped
17/03/28 13:26:14 INFO SparkContext: Successfully stopped SparkContext
17/03/28 13:26:14 INFO RemoteActorRefProviderRemotingTerminator:Shuttingdownremotedaemon.17/03/2813:26:14INFORemoteActorRefProviderRemotingTerminator: Remote daemon shut down; proceeding with flushing remote transports.`

OK

打包、叢集 yarn 執行

1.看 hdfs 情況
Hadoop fs -ls -R /
這裡寫圖片描述
2.如上圖所示我們已經有了檔案1.txt,如果沒有,則使用命令將一些好的檔案放入hdfs中:hadoop fs -put /usr/local/cluster/hadoop/etc/hadoop/slaves /data/wordcount/
3.修改程式碼為:
`
package scala
import org.apache.spark.SparkContext._
import org.apache.spark.{SparkConf, SparkContext}
/**
* Created by root on 3/23/17.
*/
object spp {
def main(args: Array[String]) {
//輸入檔案既可以是本地linux系統檔案,也可以是其它來原始檔,例如HDFS
if (args.length == 0) {
System.err.println(“Usage: SparkWordCount ”)
System.exit(1)
}
//以本地執行緒方式執行,可以指定執行緒個數,
//如.setMaster(“local[2]”),兩個執行緒執行
//下面給出的是單執行緒執行
val conf = new SparkConf().setAppName(“SparkWordCount”).setMaster(“spark://10.6.3.200:7077”)
val sc = new SparkContext(conf)

//wordcount操作,計算檔案中包含Spark的行數
val rdd2=sc.textFile(args(0)).flatMap(line => line.split(" ")).map(word => (word,1)).reduceByKey(_+_)
//    val count1=rdd2.countByValue()
//列印結果
 rdd2.saveAsTextFile(path = args(1))
//   rdd2.count()
//println(rdd2.count())
sc.stop()

}

}
`
4.點選工程my2,然後按F4打個Project Structure並選擇Artifacts,如下圖
選擇 jar from moulde
這裡寫圖片描述

5.因為後期提交到叢集上執行,因此相關jar包都存在,為減小jar包的體積,將spark-assembly-1.0.0-hadoop2.2.0.jar等jar包刪除即可,如下圖
這裡寫圖片描述

6.確定後,再點選Build->Build Artifacts
生成如圖所示:
這裡寫圖片描述

7.執行 yarn-client(cluster會出現記憶體溢位,未解決)
/root/spark-1.0.0-bin-2.2.0/bin/spark-submit –master yarn-client –class main.scala.spp my2.jar hdfs://hadoop-master:8020/data/wordcount/1.txt hdfs://hadoop-master:8020/data/wordcount/read9

8.看hdfs 是否有read9
這裡寫圖片描述
看spark :

這裡寫圖片描述

叢集hdfs:10.6.3.200:50070
spark:10.6.3.200:8088
hadoop:10.6.3.8080

相關推薦

Spark on yarn Intellij ide 安裝編譯打包叢集執行

說明:已經安裝好hadoop2.2.0 完全分佈,scala,spark已安裝好,環境配置完畢;主機為hadoop-master,hadoop-slave 一.intellij 安裝(centos6.5系統) 步驟一。 1.將上述兩個安裝

Spark on YARN模式的安裝spark-1.6.1-bin-hadoop2.6.tgz + hadoop-2.6.0.tar.gz)(master、slave1和slave2)(博主推薦)

說白了   Spark on YARN模式的安裝,它是非常的簡單,只需要下載編譯好Spark安裝包,在一臺帶有Hadoop YARN客戶端的的機器上執行即可。    Spark on YARN分為兩種: YARN cluster(YARN standalone,0.9版本以前)和 YA

Spark HA on yarn 最簡易安裝

ima zookeepe mage mas bin apache spa pps dir 機器部署: 準備兩臺機以上linux服務器,安裝好JDK,zookeeper,hadoop spark部署 master:hadoop1,hadoop2(備用) worker:ha

Spark的分散式執行模式 LocalStandalone, Spark on Mesos, Spark on Yarn, Kubernetes

Spark的分散式執行模式 Local,Standalone, Spark on Mesos, Spark on Yarn, Kubernetes Local模式 Standalone模式 Spark on Mesos模式 Spark on Yarn

大資料之Spark(八)--- Spark閉包處理Spark的應用的部署模式Spark叢集的模式啟動Spark On Yarn模式Spark的高可用配置

一、Spark閉包處理 ------------------------------------------------------------ RDD,resilient distributed dataset,彈性(容錯)分散式資料集。 分割槽列表,fun

spark on yarn cluster模式提交作業一直處於ACCEPTED狀態改了Client模式後就正常了

1. 提交spark作業到yarn,採用client模式的時候作業可以執行,但是採用cluster模式的時候作業會一直初一accept狀態。 背景:這個測試環境的資源比較小,提交作業後一直處於accept狀態,所以把作業的配置也設定的小。 submit 語句: spark

spark遠端debug之除錯spark on yarn 程式(基於CDH平臺1.6.0版本)

簡介 由於spark有多種執行模式,遠端除錯的時候,雖然大體步驟相同,但是還是有小部分需要注意的地方,這裡記錄一下除錯執行在spark on yarn模式下的程式。 環境準備 需要完好的Hadoop,spark叢集,以便於提交spark on yarn程式。我這裡是基

Spark on Yarn遇到的幾個問題

添加 shuffle tasks pil 生產 當前 lis file 被拒 1 概述 Spark的on Yarn模式。其資源分配是交給Yarn的ResourceManager來進行管理的。可是眼下的Spark版本號,Application日誌的查看,僅僅

Spark on yarn的兩種模式 yarn-cluster 和 yarn-client

然而 技術 負責 blog 作業 mage 申請 .com contain 從深層次的含義講,yarn-cluster和yarn-client模式的區別其實就是Application Master進程的區別,yarn-cluster模式下,driver運行在AM(Appli

spark on yarn

.sh 提交 cut com blog sta clu ... client模式 1、參考文檔: spark-1.3.0:http://spark.apache.org/docs/1.3.0/running-on-yarn.html spark-1.6.0:http://s

Spark記錄-Spark on Yarn框架

ive 變量 進程 app shuf backend 性能 操作 spi 一、客戶端進行操作 1、根據yarnConf來初始化yarnClient,並啟動yarnClient2、創建客戶端Application,並獲取Application的ID,進一步判斷集群中的資源是

基礎概念 之 Spark on Yarn

資源 兩個 htm 底層 兩種 nta 一起 () 所在 先拋出問題:Spark on Yarn有cluster和client兩種模式,它們有什麽區別? 用Jupyter寫Spark時,只能使用client模式,為什麽? 寫一篇文章,搞清楚 Spark on Yarn 的運

Spark-on-YARN

stdout 資源 val running apach add cin 一般來說 mysq 1.官方文檔 http://spark.apache.org/docs/latest/running-on-yarn.html 2.配置安裝 1.安裝hadoop:需要安

Spark on Yarn作業運行架構原理解析

狀態 區別 通訊 含義 啟動應用 follow 關於 containe yar [TOC] 0 前言 可以先參考之前寫的《Yarn流程、Yarn與MapReduce 1相比》,之後再參考《Spark作業運行架構原理解析》,然後再閱讀下面的內容,就很容易理解了。 下面內容參

Spark on Yarn with Hive實戰案例與常見問題解決

ast spa dfs img 運維 base kcon 運維人員 來看 [TOC] 1 場景 在實際過程中,遇到這樣的場景: 日誌數據打到HDFS中,運維人員將HDFS的數據做ETL之後加載到hive中,之後需要使用Spark來對日誌做分析處理,Spark的部署方式是

spark on yarn模式下內存資源管理(筆記2)

warn 計算 nta 堆內存 註意 layout led -o exc 1.spark 2.2內存占用計算公式 https://blog.csdn.net/lingbo229/article/details/80914283 2.spark on yarn內存分配*

spark on yarn任務提交緩慢解決

1.為什麼要讓執行時Jar可以從yarn端訪問spark2以後,原有lib目錄下的大JAR包被分散成多個小JAR包,原來的spark-assembly-*.jar已經不存在 每一次我們執行的時候,如果沒有指定 spark.yarn.archive or spark.yarn.jars Spark將在安裝路徑

kerberos體系下的應用(yarn,spark on yarn)

kerberos 介紹 閱讀本文之前建議先預讀下面這篇部落格kerberos認證原理---講的非常細緻,易懂 Kerberos實際上一個基於Ticket的認證方式。Client想要獲取Server端的資源,先得通過Server的認證;而認證的先決條件是Client向Server

Spark專案學習-慕課網日誌分析-days5-Spark on Yarn

1. 概述     (1) 在Spark中,支援4種執行模式:     1)local:開發時使用     2)standalone:是Spark自帶的,如果一個叢集是Standalone的話,那就需要在多臺機器上同時部署Spa

Spark08——Spark on yarn

配置安裝 需要提前配置好hadoop叢集,以及Spark。之前均已安裝。 之後啟動HDFS及yarn叢集。 配置指向HADOOP_CONF_DIR或YARN_CONF_DIR的指向Hadoop叢集的(客戶端)配置檔案的目錄。 執行模式 Spark on yarn