1. 程式人生 > >linux(centos7)基於hadoop2.5.2安裝spark1.2.1

linux(centos7)基於hadoop2.5.2安裝spark1.2.1

1、安裝hadoop參考

2、安裝scala參考

3、安裝spark

下載spark最新版spark-1.2.1-bin-hadoop2.4.tgz

上傳到linux上/opt下面,解壓

[root@master opt]# tar -zxf spark-1.2.1-bin-hadoop2.4.tgz

修改屬組(與hadoop一個使用者)

[root@master opt]# chown -R hadoop:hadoop spark-1.2.1-bin-hadoop2.4

檢視許可權

[root@master opt]# ls -ll

drwxrwxr-x  10 hadoop    hadoop          154 2月   3 11:45 spark-1.2.1-bin-hadoop2.4
-rw-r--r--   1 root      root      219309755 3月  12 13:41 spark-1.2.1-bin-hadoop2.4.tgz

新增環境變數

[root@master spark-1.2.1-bin-hadoop2.4]# vim /etc/profile

export SPARK_HOME=/opt/spark-1.2.1-bin-hadoop2.4
export PATH=$PATH:$SPARK_HOME/bin

:wq    #儲存並退出

執行

[root@master spark-1.2.1-bin-hadoop2.4]# . /etc/profile

切換使用者

[root@master spark-1.2.1-bin-hadoop2.4]# su hadoop

進入conf

[hadoop@master spark-1.2.1-bin-hadoop2.4]$ cd conf

拷貝spark-env.sh.template 到 spark-env.sh

[hadoop@master conf]$ cp spark-env.sh.template spark-env.sh

編輯

[hadoop@master conf]$ vim spark-env.sh

新增如下內容

export JAVA_HOME=/usr/java/jdk1.7.0_71
export SCALA_HOME=/usr/scala/scala-2.11.6
export SPARK_MASTER_IP=192.168.189.136     #叢集master的ip
export SPARK_WORKER_MEMORY=2g                #worker幾點分配給excutors的最大記憶體,因為三臺機器都是2G
export HADOOP_CONF_DIR=/opt/hadoop-2.5.2/etc/hadoop     #hadoop叢集的配置檔案的目錄

編輯slaves

[hadoop@master conf]$ cp slaves.template slaves
[hadoop@master conf]$ vim slaves

修改成如下內容

master
slave1
slave2

4、安裝另兩臺slave1與slave2,安裝過程與上述過程一樣直接拷貝檔案即可

[hadoop@master opt]$ scp -r spark-1.2.1-bin-hadoop2.4 root@slave1:/opt/

[hadoop@master opt]$ scp -r spark-1.2.1-bin-hadoop2.4 root@slave2:/opt/

修改slave1的屬組

[root@slave1 opt]# chown -R hadoop:hadoop spark-1.2.1-bin-hadoop2.4/

修改slave2的屬組

[root@slave2 opt]# chown -R hadoop:hadoop spark-1.2.1-bin-hadoop2.4/

新增slave1的環境變數

[root@slave1 opt]# vim /etc/profile

export SPARK_HOME=/opt/spark-1.2.1-bin-hadoop2.4
export PATH=$PATH:$SPARK_HOME/bin

[root@slave1 opt]# . /etc/profile

新增slave2的環境變數

[root@slave2 opt]# vim /etc/profile

export SPARK_HOME=/opt/spark-1.2.1-bin-hadoop2.4
export PATH=$PATH:$SPARK_HOME/bin

[root@slave2 opt]# . /etc/profile

4、啟動

首先啟動hadoop

[hadoop@master hadoop-2.5.2]$ ./sbin/start-dfs.sh

[hadoop@master hadoop-2.5.2]$ ./sbin/start-yarn.sh

[hadoop@master hadoop-2.5.2]$ jps
25229 NameNode
25436 SecondaryNameNode
25862 Jps
25605 ResourceManager
[hadoop@master hadoop-2.5.2]$
表示啟動成功

在啟動spark

[hadoop@master spark-1.2.1-bin-hadoop2.4]$ ./sbin/start-all.sh

[hadoop@master spark-1.2.1-bin-hadoop2.4]$ jps
26070 Master
25229 NameNode
26219 Worker
25436 SecondaryNameNode
25605 ResourceManager
26314 Jps
[hadoop@master spark-1.2.1-bin-hadoop2.4]$
多了Master與Worker表示啟動成功

web頁面

http://master:8080/

進入bin目錄下的spark-shell

[hadoop@master spark-1.2.1-bin-hadoop2.4]$ cd bin

[hadoop@master bin]$ spark-shell

Spark assembly has been built with Hive, including Datanucleus jars on classpath
15/03/12 14:53:48 INFO spark.SecurityManager: Changing view acls to: hadoop
15/03/12 14:53:48 INFO spark.SecurityManager: Changing modify acls to: hadoop
15/03/12 14:53:48 INFO spark.SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(hadoop); users with modify permissions: Set(hadoop)
15/03/12 14:53:48 INFO spark.HttpServer: Starting HTTP Server
15/03/12 14:53:48 INFO server.Server: jetty-8.y.z-SNAPSHOT
15/03/12 14:53:48 INFO server.AbstractConnector: Started [email protected]:47965
15/03/12 14:53:48 INFO util.Utils: Successfully started service 'HTTP class server' on port 47965.
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 1.2.1
      /_/

Using Scala version 2.10.4 (Java HotSpot(TM) 64-Bit Server VM, Java 1.7.0_71)
Type in expressions to have them evaluated.
Type :help for more information.
15/03/12 14:54:44 INFO spark.SecurityManager: Changing view acls to: hadoop
15/03/12 14:54:44 INFO spark.SecurityManager: Changing modify acls to: hadoop
15/03/12 14:54:44 INFO spark.SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(hadoop); users with modify permissions: Set(hadoop)
15/03/12 14:54:47 INFO slf4j.Slf4jLogger: Slf4jLogger started
15/03/12 14:54:47 INFO Remoting: Starting remoting
15/03/12 14:54:48 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkDriver@master:35608]
15/03/12 14:54:48 INFO util.Utils: Successfully started service 'sparkDriver' on port 35608.
15/03/12 14:54:48 INFO spark.SparkEnv: Registering MapOutputTracker
15/03/12 14:54:48 INFO spark.SparkEnv: Registering BlockManagerMaster
15/03/12 14:54:48 INFO storage.DiskBlockManager: Created local directory at /tmp/spark-f86b289e-f690-4e31-9f8c-55814655620b/spark-c6d44057-0149-4046-bddb-7609e9b78984
15/03/12 14:54:48 INFO storage.MemoryStore: MemoryStore started with capacity 267.3 MB
15/03/12 14:54:50 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
15/03/12 14:54:51 INFO spark.HttpFileServer: HTTP File server directory is /tmp/spark-0ffa51b3-bb0a-4689-8dd5-1d649503b21f/spark-04debaff-ac2c-403f-8c12-13f3e1f63812
15/03/12 14:54:51 INFO spark.HttpServer: Starting HTTP Server
15/03/12 14:54:51 INFO server.Server: jetty-8.y.z-SNAPSHOT
15/03/12 14:54:51 INFO server.AbstractConnector: Started [email protected]:38245
15/03/12 14:54:51 INFO util.Utils: Successfully started service 'HTTP file server' on port 38245.
15/03/12 14:54:52 INFO server.Server: jetty-8.y.z-SNAPSHOT
15/03/12 14:54:52 INFO server.AbstractConnector: Started [email protected]:4040
15/03/12 14:54:52 INFO util.Utils: Successfully started service 'SparkUI' on port 4040.
15/03/12 14:54:52 INFO ui.SparkUI: Started SparkUI at http://master:4040
15/03/12 14:54:52 INFO executor.Executor: Starting executor ID <driver> on host localhost
15/03/12 14:54:52 INFO executor.Executor: Using REPL class URI: http://192.168.189.136:47965
15/03/12 14:54:52 INFO util.AkkaUtils: Connecting to HeartbeatReceiver: akka.tcp://sparkDriver@master:35608/user/HeartbeatReceiver
15/03/12 14:54:53 INFO netty.NettyBlockTransferService: Server created on 37564
15/03/12 14:54:53 INFO storage.BlockManagerMaster: Trying to register BlockManager
15/03/12 14:54:53 INFO storage.BlockManagerMasterActor: Registering block manager localhost:37564 with 267.3 MB RAM, BlockManagerId(<driver>, localhost, 37564)
15/03/12 14:54:53 INFO storage.BlockManagerMaster: Registered BlockManager
15/03/12 14:54:53 INFO repl.SparkILoop: Created spark context..
Spark context available as sc.

scala>

通過瀏覽器進入sparkUI

http://master:4040

5、測試

複製README.md檔案到hdfs系統上

[hadoop@master spark-1.2.1-bin-hadoop2.4]$ hadoop dfs -copyFromLocal README.md ./

檢視檔案

[hadoop@master hadoop-2.5.2]$ hadoop fs -ls -R README.md
-rw-r--r--   2 hadoop supergroup       3629 2015-03-12 15:11 README.md

通過spark-shell讀取檔案

scala> val file=sc.textFile("hdfs://master:9000/user/hadoop/README.md")

統計Spark出現多少次

scala> val sparks = file.filter(line=>line.contains("Spark"))

scala> sparks.count

15/03/12 15:28:47 INFO mapred.FileInputFormat: Total input paths to process : 1
15/03/12 15:28:47 INFO spark.SparkContext: Starting job: count at <console>:17
15/03/12 15:28:47 INFO scheduler.DAGScheduler: Got job 0 (count at <console>:17) with 2 output partitions (allowLocal=false)
15/03/12 15:28:47 INFO scheduler.DAGScheduler: Final stage: Stage 0(count at <console>:17)
15/03/12 15:28:47 INFO scheduler.DAGScheduler: Parents of final stage: List()
15/03/12 15:28:47 INFO scheduler.DAGScheduler: Missing parents: List()
15/03/12 15:28:47 INFO scheduler.DAGScheduler: Submitting Stage 0 (FilteredRDD[2] at filter at <console>:14), which has no missing parents
15/03/12 15:28:47 INFO storage.MemoryStore: ensureFreeSpace(2752) called with curMem=187602, maxMem=280248975
15/03/12 15:28:47 INFO storage.MemoryStore: Block broadcast_1 stored as values in memory (estimated size 2.7 KB, free 267.1 MB)
15/03/12 15:28:47 INFO storage.MemoryStore: ensureFreeSpace(1975) called with curMem=190354, maxMem=280248975
15/03/12 15:28:47 INFO storage.MemoryStore: Block broadcast_1_piece0 stored as bytes in memory (estimated size 1975.0 B, free 267.1 MB)
15/03/12 15:28:47 INFO storage.BlockManagerInfo: Added broadcast_1_piece0 in memory on localhost:37564 (size: 1975.0 B, free: 267.2 MB)
15/03/12 15:28:47 INFO storage.BlockManagerMaster: Updated info of block broadcast_1_piece0
15/03/12 15:28:47 INFO spark.SparkContext: Created broadcast 1 from broadcast at DAGScheduler.scala:838
15/03/12 15:28:47 INFO scheduler.DAGScheduler: Submitting 2 missing tasks from Stage 0 (FilteredRDD[2] at filter at <console>:14)
15/03/12 15:28:47 INFO scheduler.TaskSchedulerImpl: Adding task set 0.0 with 2 tasks
15/03/12 15:28:47 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, localhost, ANY, 1304 bytes)
15/03/12 15:28:47 INFO scheduler.TaskSetManager: Starting task 1.0 in stage 0.0 (TID 1, localhost, ANY, 1304 bytes)
15/03/12 15:28:47 INFO executor.Executor: Running task 1.0 in stage 0.0 (TID 1)
15/03/12 15:28:47 INFO executor.Executor: Running task 0.0 in stage 0.0 (TID 0)
15/03/12 15:28:48 INFO rdd.HadoopRDD: Input split: hdfs://master:9000/user/hadoop/README.md:0+1814
15/03/12 15:28:48 INFO rdd.HadoopRDD: Input split: hdfs://master:9000/user/hadoop/README.md:1814+1815
15/03/12 15:28:48 INFO Configuration.deprecation: mapred.tip.id is deprecated. Instead, use mapreduce.task.id
15/03/12 15:28:48 INFO Configuration.deprecation: mapred.task.is.map is deprecated. Instead, use mapreduce.task.ismap
15/03/12 15:28:48 INFO Configuration.deprecation: mapred.task.partition is deprecated. Instead, use mapreduce.task.partition
15/03/12 15:28:48 INFO Configuration.deprecation: mapred.job.id is deprecated. Instead, use mapreduce.job.id
15/03/12 15:28:48 INFO Configuration.deprecation: mapred.task.id is deprecated. Instead, use mapreduce.task.attempt.id
15/03/12 15:28:48 INFO executor.Executor: Finished task 1.0 in stage 0.0 (TID 1). 1757 bytes result sent to driver
15/03/12 15:28:48 INFO executor.Executor: Finished task 0.0 in stage 0.0 (TID 0). 1757 bytes result sent to driver
15/03/12 15:28:48 INFO scheduler.TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 567 ms on localhost (1/2)
15/03/12 15:28:48 INFO scheduler.TaskSetManager: Finished task 1.0 in stage 0.0 (TID 1) in 565 ms on localhost (2/2)
15/03/12 15:28:48 INFO scheduler.DAGScheduler: Stage 0 (count at <console>:17) finished in 0.593 s
15/03/12 15:28:48 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool
15/03/12 15:28:48 INFO scheduler.DAGScheduler: Job 0 finished: count at <console>:17, took 1.208066 s
res2: Long = 19

用linux 自帶的命令驗證

[hadoop@master spark-1.2.1-bin-hadoop2.4]$ grep Spark README.md|wc
     19     156    1232