阿里雲ECS上搭建Hadoop叢集環境——使用兩臺ECS伺服器搭建“Cluster mode”的Hadoop叢集環境
Ingredient:
之前在:
這4篇文章裡講述了搭建Hadoop環境時在本地“/etc/hosts”裡面的ip域名配置上應該注意的事情,以及如何配置伺服器之間的ssh免密碼登入,啟動Hadoop遇到的一些問題的解決等等,這些都是使用ECS伺服器搭建Hadoop時一定要注意的,能夠節省搭建過程中的很多精力。這些問題都注意了,就可以完整搭建Hadoop環境了。
#1 節點環境介紹
##1.1 環境介紹
-
伺服器:2臺阿里雲ECS伺服器:1臺Master(test7972),1臺Slave(test25572)
-
作業系統: Ubuntu 16.04.4 LTS
##1.2 安裝Java
/opt/java/jdk1.8.0_162
設定Java在“/etc/profile”中的環境變數:
# set the java enviroment
export JAVA_HOME=/opt/java/jdk1.8.0_162
export PATH=$JAVA_HOME/bin:$PATH
export CLASSPATH=.:$JAVA_HOME/bin/dt.jar:$JAVA_HOME/lib/tools.jar
##1.3 安裝ssh與rsync
$sudo apt-get install ssh
$sudo apt-get install rsync
##1.4 設定ssh免密碼登陸
#2 Hadoop下載
首先下載hadoop-2.9.1.tar.gz,下載之後在合適的位置解壓縮即可,筆者這裡解壓縮之後的路徑為:
/opt/hadoop/hadoop-2.9.1
#3 設定本地域名
都應該看一下以作參考。
總結一下那就是,在“/etc/hosts”檔案中進行域名配置時要遵從2個原則:
-
1 新加域名在前面:將新新增的Master、Slave伺服器ip域名(例如“test7972”),放置在ECS伺服器原有本地域名(例如“iZuf67wb***********”)的前面。但是注意ECS伺服器原有本地域名(例如“iZuf67wb***************”)不能被刪除,因為作業系統別的地方還會使用到。
-
**2 IP本機內網,其它外網:**在本機上的操作,都要設定成內網ip;其它機器上的操作,要設定成外網ip。
按照這兩個原則,這裡配置的兩臺伺服器的域名資訊:
- Master:test7972
- Slave:test25572
#4 新增Hadoop環境變數
在“/etc/profile”中增加配置:
# hadoop
export HADOOP_HOME=/opt/hadoop/hadoop-2.9.1
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME
export YARN_HOME=$HADOOP_HOME
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
export YARN_CONF_DIR=$HADOOP_HOME/etc/hadoop
#5 新增Hadoop配置資訊
目前共有2臺伺服器:
- Master:test7972
- Slave:test25572
需要新增的資訊,有一些是2臺伺服器上共同配置的,還有一些是在特定伺服器上單獨設定的。
##5.1 Master、Slave上的共同配置
在Master、Slave上共同新增的配置:
###5.1.1 “etc/hadoop/core-site.xml”
編輯檔案:
vi etc/hadoop/core-site.xml
新增內容:
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://test7972:9000</value>
</property>
<property>
<name>hadoop.tmp.dir</name>
<value>/opt/hadoop/hadoop-2.9.1/tmp</value>
</property>
</configuration>
###5.1.2 “etc/hadoop/hdfs-site.xml”
編輯檔案:
vi etc/hadoop/hdfs-site.xml
新增內容:
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>file:///opt/hadoop/hadoop-2.9.1/tmp/dfs/name</value>
</property>
</configuration>
###5.1.3 “etc/hadoop/mapred-site.xml”
編輯檔案:
vi etc/hadoop/mapred-site.xml
新增內容:
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>
##5.2 在各節點上指定HDFS檔案的儲存位置(預設為/tmp)
###5.2.1 給Master節點namenode建立目錄、賦予許可權,編輯檔案
Master節點: namenode
建立目錄並賦予許可權:
mkdir -p /opt/hadoop/hadoop-2.9.1/tmp/dfs/name
chmod -R 777 /usr/local/hadoop-2.7.0/tmp
編輯檔案:
vi etc/hadoop/hdfs-site.xml
新增內容:
<property>
<name>dfs.namenode.name.dir</name>
<value>file:///opt/hadoop/hadoop-2.9.1/tmp/dfs/name</value>
</property>
###5.2.2 給Slave節點datanode建立目錄、賦予許可權,編輯檔案
Slave節點:datanode
建立目錄並賦予許可權:
mkdir -p /usr/local/hadoop-2.7.0/tmp/dfs/data
chmod -R 777 /usr/local/hadoop-2.7.0/tmp
編輯檔案:
vi etc/hadoop/hdfs-site.xml
新增內容:
<property>
<name>dfs.datanode.data.dir</name>
<value>file:///opt/hadoop/hadoop-2.9.1/tmp/dfs/data</value>
</property>
##5.3 YARN設定
###5.3.1 給Master節點resourcemanager資訊編輯檔案
Master節點: resourcemanager
編輯檔案:
vi etc/hadoop/yarn-site.xml
新增內容:
<configuration>
<property>
<name>yarn.resourcemanager.hostname</name>
<value>test7972</value>
</property>
</configuration>
###5.3.2 給Slave節點nodemanager資訊編輯檔案
編輯檔案:
vi etc/hadoop/yarn-site.xml
新增內容:
<configuration>
<property>
<name>yarn.resourcemanager.hostname</name>
<value>test7972</value>
</property>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
</configuration>
###5.3.3 為了在Master上啟動“job history server”,需要Slave節點上新增配置資訊
Slave節點:配置Master節點的“job history server”資訊
編輯檔案:
vi etc/hadoop/mapred-site.xml
新增內容:
<property>
<name>mapreduce.jobhistory.address</name>
<value>test7972:10020</value>
</property>
##5.4 確認一下Master、Slave上的配置檔案內容
因為剛才確定修改檔案的過程比較雜亂,我們來確認一下Master、Slave這2個伺服器上的配置檔案內容,如果對Hadoop的各檔案配置意義都比較清楚了,可以直接根據這個步驟修改配置檔案。
###5.4.1 Master節點上的配置檔案
Master節點(test7972)上共有4個檔案添加了內容,分別為:
- 1 etc/hadoop/core-site.xml
vi etc/hadoop/core-site.xml
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://test7972:9000</value>
</property>
<property>
<name>hadoop.tmp.dir</name>
<value>/opt/hadoop/hadoop-2.9.1/tmp</value>
</property>
</configuration>
- 2 etc/hadoop/hdfs-site.xml
vi etc/hadoop/hdfs-site.xml
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>file:///opt/hadoop/hadoop-2.9.1/tmp/dfs/name</value>
</property>
</configuration>
- 3 etc/hadoop/mapred-site.xml
vi etc/hadoop/mapred-site.xml
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>
- 4 etc/hadoop/yarn-site.xml
vi etc/hadoop/yarn-site.xml
<configuration>
<property>
<name>yarn.resourcemanager.hostname</name>
<value>test7972</value>
</property>
</configuration>
###5.4.2 Slave節點上的配置檔案
Master節點(test25572)上共有4個檔案添加了內容,分別為:
- 1 etc/hadoop/core-site.xml
vi etc/hadoop/core-site.xml
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://test7972:9000</value>
</property>
<property>
<name>hadoop.tmp.dir</name>
<value>/opt/hadoop/hadoop-2.9.1/tmp</value>
</property>
</configuration>
- 2 etc/hadoop/hdfs-site.xml
vi etc/hadoop/hdfs-site.xml
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>file:///opt/hadoop/hadoop-2.9.1/tmp/dfs/data</value>
</property>
</configuration>
- 3 etc/hadoop/mapred-site.xml
vi etc/hadoop/mapred-site.xml
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
<property>
<name>mapreduce.jobhistory.address</name>
<value>test7972:10020</value>
</property>
</configuration>
- 4 etc/hadoop/yarn-site.xml
vi etc/hadoop/yarn-site.xml
<configuration>
<property>
<name>yarn.resourcemanager.hostname</name>
<value>test7972</value>
</property>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
</configuration>
#6 格式化HDFS(Master,Slave)
在Master、Slave上均執行格式化命令:
hadoop namenode -format
#7 啟動Hadoop
##7.1 在Master上啟動Daemon,Slave上的服務會被同時啟動
啟動HDFS:
sbin/start-dfs.sh
啟動YARN:
sbin/start-yarn.sh
當然也可以一個命令來執行這兩個啟動操作:
sbin/start-all.sh
執行結果:
[email protected]***************:/opt/hadoop/hadoop-2.9.1# ./sbin/start-all.sh
This script is Deprecated. Instead use start-dfs.sh and start-yarn.sh
Starting namenodes on [test7972]
test7972: starting namenode, logging to /opt/hadoop/hadoop-2.9.1/logs/hadoop-root-namenode-iZ2ze72w7p5za2ax3zh81cZ.out
test25572: starting datanode, logging to /opt/hadoop/hadoop-2.9.1/logs/hadoop-root-datanode-iZuf67wbvlyduq07idw3pyZ.out
Starting secondary namenodes [0.0.0.0]
0.0.0.0: starting secondarynamenode, logging to /opt/hadoop/hadoop-2.9.1/logs/hadoop-root-secondarynamenode-iZ2ze72w7p5za2ax3zh81cZ.out
starting yarn daemons
starting resourcemanager, logging to /opt/hadoop/hadoop-2.9.1/logs/yarn-root-resourcemanager-iZ2ze72w7p5za2ax3zh81cZ.out
test25572: starting nodemanager, logging to /opt/hadoop/hadoop-2.9.1/logs/yarn-root-nodemanager-iZuf67wbvlyduq07idw3pyZ.out
效果是一樣的。
##7.2 啟動job history server
在Master上執行:
sbin/mr-jobhistory-daemon.sh start historyserver
執行結果:
[email protected]***************:/opt/hadoop/hadoop-2.9.1# ./sbin/mr-jobhistory-daemon.sh start historyserver
starting historyserver, logging to /opt/hadoop/hadoop-2.9.1/logs/mapred-root-historyserver-iZ2ze72w7p5za2ax3zh81cZ.out
[email protected]:/opt/hadoop/hadoop-2.9.1#
##7.3 確定啟動程序
- 在Master節點上執行:
jps
顯示資訊:
[email protected]***************:/opt/hadoop/hadoop-2.9.1# jps
867 JobHistoryServer
938 Jps
555 ResourceManager
379 SecondaryNameNode
32639 NameNode
可以看到不算jps自己,啟動了4個程序。
- 在Slave節點上執行:
jps
顯示資訊:
[email protected]***************:/opt/hadoop/hadoop-2.9.1# jps
26510 Jps
26222 DataNode
26350 NodeManager
可以看到不算jps自己,啟動了2個程序。
#8 在Master節點上建立HDFS,並拷貝測試檔案
當前資料夾路徑:
[email protected]:/opt/hadoop/hadoop-2.9.1# pwd
/opt/hadoop/hadoop-2.9.1
##8.1 在Master節點上建立HDFS資料夾
./bin/hdfs dfs -mkdir -p /user/root/input
##8.2 拷貝測試檔案到HDFS資料夾下
拷貝測試檔案:
./bin/hdfs dfs -put etc/hadoop/* /user/root/input
檢視拷貝到HDFS下的檔案:
./bin/hdfs dfs -ls /user/root/input
#9 執行“Hadoop job”進行測試
##9.1 監聽檔案輸出
雖然執行命令介面也有輸出,但一般直接監控輸出檔案捕獲的資訊會更清晰,檢視輸出資訊一般監控兩個檔案:
tail -f logs/hadoop-root-namenode-iZ2ze72w7p5za2ax3zh81cZ.log
tail -f logs/yarn-root-resourcemanager-iZ2ze72w7p5za2ax3zh81cZ.log
##9.2 執行Hadoop job
使用Hadoop自帶的一個測試job
./bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.9.1.jar grep /user/root/input output 'dfs[a-z.]+'
這時候輸出結果會被儲存到“output”資料夾中,注意這裡的output其實就是:
/user/{username}/output
由於當前登入使用者是root,那麼實際資料夾就是:
/user/root/output/
這個資料夾。
如果你的計算正常,是能夠在執行命令的介面看到一些計算過程的:
......
18/07/30 16:04:32 INFO mapreduce.Job: The url to track the job: http://test7972: 8088/proxy/application_1532936445384_0006/
18/07/30 16:04:32 INFO mapreduce.Job: Running job: job_1532936445384_0006
18/07/30 16:04:57 INFO mapreduce.Job: Job job_1532936445384_0006 running in uber mode : false
18/07/30 16:04:57 INFO mapreduce.Job: map 0% reduce 0%
18/07/30 16:06:03 INFO mapreduce.Job: map 13% reduce 0%
18/07/30 16:06:04 INFO mapreduce.Job: map 20% reduce 0%
18/07/30 16:07:08 INFO mapreduce.Job: map 27% reduce 0%
18/07/30 16:07:09 INFO mapreduce.Job: map 40% reduce 0%
18/07/30 16:08:12 INFO mapreduce.Job: map 50% reduce 0%
18/07/30 16:08:13 INFO mapreduce.Job: map 57% reduce 0%
18/07/30 16:08:15 INFO mapreduce.Job: map 57% reduce 19%
18/07/30 16:09:11 INFO mapreduce.Job: map 60% reduce 19%
18/07/30 16:09:12 INFO mapreduce.Job: map 67% reduce 19%
18/07/30 16:09:13 INFO mapreduce.Job: map 73% reduce 19%
18/07/30 16:09:18 INFO mapreduce.Job: map 73% reduce 24%
18/07/30 16:10:04 INFO mapreduce.Job: map 77% reduce 24%
18/07/30 16:10:06 INFO mapreduce.Job: map 90% reduce 24%
18/07/30 16:10:07 INFO mapreduce.Job: map 90% reduce 26%
18/07/30 16:10:14 INFO mapreduce.Job: map 90% reduce 30%
18/07/30 16:10:40 INFO mapreduce.Job: map 93% reduce 30%
18/07/30 16:10:44 INFO mapreduce.Job: map 100% reduce 31%
18/07/30 16:10:45 INFO mapreduce.Job: map 100% reduce 100%
18/07/30 16:10:53 INFO mapreduce.Job: Job job_1532936445384_0006 completed successfully
......
18/07/30 16:10:56 INFO mapreduce.Job: The url to track the job: http://test7972: 8088/proxy/application_1532936445384_0007/
18/07/30 16:10:56 INFO mapreduce.Job: Running job: job_1532936445384_0007
18/07/30 16:11:21 INFO mapreduce.Job: Job job_1532936445384_0007 running in uber mode : false
18/07/30 16:11:21 INFO mapreduce.Job: map 0% reduce 0%
18/07/30 16:11:35 INFO mapreduce.Job: map 100% reduce 0%
18/07/30 16:11:48 INFO mapreduce.Job: map 100% reduce 100%
18/07/30 16:11:53 INFO mapreduce.Job: Job job_1532936445384_0007 completed successfully
18/07/30 16:11:53 INFO mapreduce.Job: Counters: 49
File System Counters
FILE: Number of bytes read=362
FILE: Number of bytes written=395249
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=655
HDFS: Number of bytes written=244
HDFS: Number of read operations=7
HDFS: Number of large read operations=0
HDFS: Number of write operations=2
Job Counters
Launched map tasks=1
Launched reduce tasks=1
Data-local map tasks=1
Total time spent by all maps in occupied slots (ms)=11800
Total time spent by all reduces in occupied slots (ms)=10025
Total time spent by all map tasks (ms)=11800
Total time spent by all reduce tasks (ms)=10025
Total vcore-milliseconds taken by all map tasks=11800
Total vcore-milliseconds taken by all reduce tasks=10025
Total megabyte-milliseconds taken by all map tasks=12083200
Total megabyte-milliseconds taken by all reduce tasks=10265600
Map-Reduce Framework
Map input records=14
Map output records=14
Map output bytes=328
Map output materialized bytes=362
Input split bytes=129
Combine input records=0
Combine output records=0
Reduce input groups=5
Reduce shuffle bytes=362
Reduce input records=14
Reduce output records=14
Spilled Records=28
Shuffled Maps =1
Failed Shuffles=0
Merged Map outputs=1
GC time elapsed (ms)=443
CPU time spent (ms)=1480
Physical memory (bytes) snapshot=367788032
Virtual memory (bytes) snapshot=3771973632
Total committed heap usage (bytes)=170004480
Shuffle Errors
BAD_ID=0
CONNECTION=0
IO_ERROR=0
WRONG_LENGTH=0
WRONG_MAP=0
WRONG_REDUCE=0
File Input Format Counters
Bytes Read=526
File Output Format Counters
Bytes Written=244
能夠看到這裡面有計算成功的提示:
mapreduce.Job: Job job_1532936445384_0007 completed successfully
說明我們的MapReduce計算成功了!
##9.3 檢視執行結果
執行查詢命令:
./bin/hdfs dfs -cat output/*
輸出計算結果:
[email protected]:/opt/hadoop/hadoop-2.9.1# ./bin/hdfs dfs -cat output/*
6 dfs.audit.logger
4 dfs.class
3 dfs.logger
3 dfs.server.namenode.
2 dfs.audit.log.maxbackupindex
2 dfs.period
2 dfs.audit.log.maxfilesize
1 dfs.replication
1 dfs.log
1 dfs.file
1 dfs.servers
1 dfsadmin
1 dfsmetrics.log
1 dfs.namenode.name.dir
#10 搭建完成
至此,在阿里雲ECS上搭建1臺Master、1臺Slave的Hadoop叢集環境就搭建完了!
#11 參考