分散式環境搭建之環境介紹

之前我們已經介紹瞭如何在單機上搭建偽分散式的Hadoop環境,而在實際情況中,肯定都是多機器多節點的分散式叢集環境,所以本文將簡單介紹一下如何在多臺機器上搭建Hadoop的分散式環境。

我這裡準備了三臺機器,IP地址如下: 192.168.77.128 192.168.77.130 192.168.77.134 首先在這三臺機器上編輯/etc/hosts配置檔案,修改主機名以及配置其他機器的主機名

[[email protected] ~]# vim /etc/hosts # 三臺機器都需要操作 192.168.77.128 hadoop000 192.168.77.130 hadoop001 192.168.77.134 hadoop002 [[email protected] ~]# reboot

三臺機器在叢集中所擔任的角色: hadoop000作為NameNode、DataNode、ResourceManager、NodeManager hadoop001作為DataNode、NodeManager hadoop002也是作為DataNode、NodeManager

配置ssh免密碼登入

叢集之間的機器需要相互通訊,所以我們得先配置免密碼登入。在三臺機器上分別執行如下命令,生成金鑰對:

[[email protected] ~]# ssh-keygen -t rsa # 三臺機器都需要執行這個命令生成金鑰對 Generating public/private rsa key pair. Enter file in which to save the key (/root/.ssh/id_rsa): Enter passphrase (empty for no passphrase): Enter same passphrase again: Your identification has been saved in /root/.ssh/id_rsa. Your public key has been saved in /root/.ssh/id_rsa.pub. The key fingerprint is: 0d:00:bd:a3:69:b7:03:d5:89:dc:a8:a2:ca:28:d6:06 [email protected] The key’s randomart image is: ±-[ RSA 2048]----+ | .o. | | … | | . … | | B +o | | = .S . | | E. * . | | .oo o . | |=. o o | |… . | ±----------------+ [[email protected] ~]# ls .ssh/ authorized_keys id_rsa id_rsa.pub known_hosts [[email protected] ~]#

以hadoop000為主,執行以下命令,分別把公鑰拷貝到其他機器上:

[[email protected] ~]# ssh-copy-id -i ~/.ssh/id_rsa.pub hadoop000 [[email protected] ~]# ssh-copy-id -i ~/.ssh/id_rsa.pub hadoop001 [[email protected] ~]# ssh-copy-id -i ~/.ssh/id_rsa.pub hadoop002

注:其他兩臺機器也需要執行以上這三條命令。

拷貝完成之後,測試能否正常進行免密登入:

[[email protected] ~]# ssh hadoop000 Last login: Mon Apr 2 17:20:02 2018 from localhost [[email protected] ~]# ssh hadoop001 Last login: Tue Apr 3 00:49:59 2018 from 192.168.77.1 [[email protected] ~]# 登出 Connection to hadoop001 closed. [[email protected] ~]# ssh hadoop002 Last login: Tue Apr 3 00:50:03 2018 from 192.168.77.1 [[email protected] ~]# 登出 Connection to hadoop002 closed. [[email protected] ~]# 登出 Connection to hadoop000 closed. [[email protected] ~]#

如上,hadoop000機器已經能夠正常免密登入其他兩臺機器,那麼我們的配置就成功了。

安裝JDK

到Oracle官網拿到JDK的下載連結,我這裡用的是JDK1.8,地址如下:

使用wget命令將JDK下載到/usr/local/src/目錄下,我這裡已經下載好了:

[[email protected] ~]# cd /usr/local/src/ [[email protected] /usr/local/src]# ls jdk-8u151-linux-x64.tar.gz [[email protected] /usr/local/src]#

解壓下載的壓縮包,並將解壓後的目錄移動到/usr/local/目錄下:

[[email protected] /usr/local/src]# tar -zxvf jdk-8u151-linux-x64.tar.gz [[email protected] /usr/local/src]# mv ./jdk1.8.0_151 /usr/local/jdk1.8

編輯/etc/profile檔案配置環境變數:

[[email protected] ~]# vim /etc/profile # 增加如下內容 JAVA_HOME=/usr/local/jdk1.8/ JAVA_BIN=/usr/local/jdk1.8/bin JRE_HOME=/usr/local/jdk1.8/jre PATH=$PATH:/usr/local/jdk1.8/bin:/usr/local/jdk1.8/jre/bin CLASSPATH=/usr/local/jdk1.8/jre/lib:/usr/local/jdk1.8/lib:/usr/local/jdk1.8/jre/lib/charsets.jar

使用source命令載入配置檔案,讓其生效,生效後執行java -version命令即可看到JDK的版本:

[[email protected] ~]# source /etc/profile [[email protected] ~]# java -version java version “1.8.0_151” Java™ SE Runtime Environment (build 1.8.0_151-b12) Java HotSpot™ 64-Bit Server VM (build 25.151-b12, mixed mode) [[email protected] ~]#

在hadoop000上安裝完JDK後,通過rsync命令,將JDK以及配置檔案都同步到其他機器上:

[[email protected] ~]# scp -r /usr/local/jdk1.8 hadoop001:/usr/local [[email protected] ~]# scp -r /usr/local/jdk1.8 hadoop002:/usr/local [[email protected] ~]# scp -r /etc/profile hadoop001:/etc/profile [[email protected] ~]# scp -r /etc/profile hadoop002:/etc/profile

同步完成後,分別在兩臺機器上source配置檔案,讓環境變數生效,生效後再執行java -version命令測試JDK是否已安裝成功。

Hadoop配置及分發

下載Hadoop 2.6.0-cdh5.7.0的tar.gz包並解壓:

[[email protected] ~]# cd /usr/local/src/ [[email protected] /usr/local/src]# wget http://archive.cloudera.com/cdh5/cdh/5/hadoop-2.6.0-cdh5.7.0.tar.gz [[email protected] /usr/local/src]# tar -zxvf hadoop-2.6.0-cdh5.7.0.tar.gz -C /usr/local/

注:如果在Linux上下載得很慢的話,可以在windows的迅雷上使用這個連結進行下載。然後再上傳到Linux中,這樣就會快一些。

解壓完後,進入到解壓後的目錄下,可以看到hadoop的目錄結構如下:

[[email protected] /usr/local/src]# cd /usr/local/hadoop-2.6.0-cdh5.7.0/ [[email protected] /usr/local/hadoop-2.6.0-cdh5.7.0]# ls bin cloudera examples include libexec NOTICE.txt sbin src bin-mapreduce1 etc examples-mapreduce1 lib LICENSE.txt README.txt share [[email protected] /usr/local/hadoop-2.6.0-cdh5.7.0]#

簡單說明一下其中幾個目錄存放的東西: bin目錄存放可執行檔案 etc目錄存放配置檔案 sbin目錄下存放服務的啟動命令 share目錄下存放jar包與文件 以上就算是把hadoop給安裝好了,接下來就是編輯配置檔案,把JAVA_HOME配置一下:

[[email protected] /usr/local/hadoop-2.6.0-cdh5.7.0]# cd etc/ [[email protected] /usr/local/hadoop-2.6.0-cdh5.7.0/etc]# cd hadoop [[email protected] /usr/local/hadoop-2.6.0-cdh5.7.0/etc/hadoop]# vim hadoop-env.sh export JAVA_HOME=/usr/local/jdk1.8/ # 根據你的環境變數進行修改 [[email protected] /usr/local/hadoop-2.6.0-cdh5.7.0/etc/hadoop]#

然後將Hadoop的安裝目錄配置到環境變數中,方便之後使用它的命令:

[[email protected] ~]# vim ~/.bash_profile # 增加以下內容 export HADOOP_HOME=/usr/local/hadoop-2.6.0-cdh5.7.0/ export PATH=HADOOPHOME/bin:HADOOP_HOME/bin:HADOOP_HOME/sbin:KaTeX parse error: Expected 'EOF', got '#' at position 24: …[email protected] ~]#̲ source ! source ~/.bash_profile [[email protected] ~]#

接著分別編輯core-site.xml以及hdfs-site.xml配置檔案:

[[email protected] ~]# cd $HADOOP_HOME [[email protected] /usr/local/hadoop-2.6.0-cdh5.7.0]# cd etc/hadoop [[email protected] /usr/local/hadoop-2.6.0-cdh5.7.0/etc/hadoop]# vim core-site.xml # 增加如下內容 fs.default.name hdfs://hadoop000:8020 # 指定預設的訪問地址以及埠號 [[email protected] /usr/local/hadoop-2.6.0-cdh5.7.0/etc/hadoop]# vim hdfs-site.xml # 增加如下內容 dfs.namenode.name.dir /data/hadoop/app/tmp/dfs/name # namenode臨時檔案所存放的目錄 dfs.datanode.data.dir /data/hadoop/app/tmp/dfs/data # datanode臨時檔案所存放的目錄 [[email protected] /usr/local/hadoop-2.6.0-cdh5.7.0/etc/hadoop]# mkdir -p /data/hadoop/app/tmp/dfs/name [[email protected] /usr/local/hadoop-2.6.0-cdh5.7.0/etc/hadoop]# mkdir -p /data/hadoop/app/tmp/dfs/data

接下來還需要編輯yarn-site.xml配置檔案:

[[email protected] /usr/local/hadoop-2.6.0-cdh5.7.0/etc/hadoop]# vim yarn-site.xml # 增加如下內容 yarn.nodemanager.aux-services mapreduce_shuffle yarn.resourcemanager.hostname hadoop000 [[email protected] /usr/local/hadoop-2.6.0-cdh5.7.0/etc/hadoop]#

拷貝並編輯MapReduce的配置檔案:

[[email protected] /usr/local/hadoop-2.6.0-cdh5.7.0/etc/hadoop]# cp mapred-site.xml.template mapred-site.xml [[email protected] /usr/local/hadoop-2.6.0-cdh5.7.0/etc/hadoop]# vim !$ # 增加如下內容 mapreduce.framework.name yarn [[email protected] /usr/local/hadoop-2.6.0-cdh5.7.0/etc/hadoop]#

最後是配置從節點的主機名,如果沒有配置主機名的情況下就使用IP:

[[email protected] /usr/local/hadoop-2.6.0-cdh5.7.0/etc/hadoop]# vim slaves hadoop000 hadoop001 hadoop002 [[email protected] /usr/local/hadoop-2.6.0-cdh5.7.0/etc/hadoop]#

壓縮檔案 [[email protected] /usr/local] tar -zcvf /usr/local/etc/hadoop-2.6.0-cdh5.7.0.tar.gz hadoop-2.6.0-cdh5.7.0

到此為止,我們就已經在hadoop000上搭建好了我們主節點(master)的Hadoop叢集環境,但是還有其他兩臺作為從節點(slave)的機器沒配置Hadoop環境,所以接下來需要把hadoop000上的Hadoop安裝目錄以及環境變數配置檔案分發到其他兩臺機器上,分別執行如下命令:

[[email protected] ~]# scp -r /usr/local/etc/hadoop-2.6.0-cdh5.7.0.tar.gz hadoop001:/usr/local/ [[email protected] ~]# tar -zxvf hadoop-2.6.0-cdh5.7.0.tar.gz -C /usr/local/ [[email protected] ~]# scp -r /usr/local/etc/hadoop-2.6.0-cdh5.7.0.tar.gz hadoop002:/usr/local/src [[email protected] ~]# tar -zxvf hadoop-2.6.0-cdh5.7.0.tar.gz -C /usr/local/ [[email protected] ~]# scp -r ~/.bash_profile hadoop001:~/.bash_profile [[email protected] ~]# scp -r ~/.bash_profile hadoop002:~/.bash_profile

分發完成之後到兩臺機器上分別執行source命令以及建立臨時目錄:

[[email protected] ~]# source ~/.bash_profile [[email protected] ~]# mkdir -p /data/hadoop/app/tmp/dfs/name [[email protected] ~]# mkdir -p /data/hadoop/app/tmp/dfs/data [[email protected] ~]# source ~/.bash_profile [[email protected] ~]# mkdir -p /data/hadoop/app/tmp/dfs/name [[email protected] ~]# mkdir -p /data/hadoop/app/tmp/dfs/data

Hadoop格式化及啟停

對NameNode做格式化,只需要在hadoop000上執行即可:

[[email protected] ~]# hdfs namenode -format

格式化完成之後,就可以啟動Hadoop叢集了:

[[email protected] ~]# start-all.sh This script is Deprecated. Instead use start-dfs.sh and start-yarn.sh 18/04/02 20:10:59 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform… using builtin-java classes where applicable Starting namenodes on [hadoop000] hadoop000: starting namenode, logging to /usr/local/hadoop-2.6.0-cdh5.7.0/logs/hadoop-root-namenode-hadoop000.out hadoop000: starting datanode, logging to /usr/local/hadoop-2.6.0-cdh5.7.0/logs/hadoop-root-datanode-hadoop000.out hadoop001: starting datanode, logging to /usr/local/hadoop-2.6.0-cdh5.7.0/logs/hadoop-root-datanode-hadoop001.out hadoop002: starting datanode, logging to /usr/local/hadoop-2.6.0-cdh5.7.0/logs/hadoop-root-datanode-hadoop002.out Starting secondary namenodes [0.0.0.0] The authenticity of host ‘0.0.0.0 (0.0.0.0)’ can’t be established. ECDSA key fingerprint is 4d:5a:9d:31:65:75:30:47:a3:9c:f5:56:63:c4:0f:6a. Are you sure you want to continue connecting (yes/no)? yes # 輸入yes即可 0.0.0.0: Warning: Permanently added ‘0.0.0.0’ (ECDSA) to the list of known hosts. 0.0.0.0: starting secondarynamenode, logging to /usr/local/hadoop-2.6.0-cdh5.7.0/logs/hadoop-root-secondarynamenode-hadoop000.out 18/04/02 20:11:21 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform… using builtin-java classes where applicable starting yarn daemons starting resourcemanager, logging to /usr/local/hadoop-2.6.0-cdh5.7.0/logs/yarn-root-resourcemanager-hadoop000.out hadoop001: starting nodemanager, logging to /usr/local/hadoop-2.6.0-cdh5.7.0/logs/yarn-root-nodemanager-hadoop001.out hadoop002: starting nodemanager, logging to /usr/local/hadoop-2.6.0-cdh5.7.0/logs/yarn-root-nodemanager-hadoop002.out hadoop000: starting nodemanager, logging to /usr/local/hadoop-2.6.0-cdh5.7.0/logs/yarn-root-nodemanager-hadoop000.out [[email protected] ~]# jps # 檢視是否有以下幾個程序 6256 Jps 5538 DataNode 5843 ResourceManager 5413 NameNode 5702 SecondaryNameNode 5945 NodeManager [[email protected] ~]#

到另外兩臺機器上檢查程序:

hadoop001:

[[email protected] ~]# jps 3425 DataNode 3538 NodeManager 3833 Jps [[email protected] ~]#

hadoop002:

[[email protected] ~]# jps 3171 DataNode 3273 NodeManager 3405 Jps [[email protected] ~]#

各機器的程序檢查完成,並且確定沒有問題後,在瀏覽器上訪問主節點的50070埠,例如:192.168.77.128:50070。會訪問到如下頁面:

點選 ”Live Nodes“ 檢視存活的節點:

如上,可以訪問50070埠就代表叢集中的HDFS是正常的。

接下來我們還需要訪問主節點的8088埠,這是YARN的web服務埠,例如:192.168.77.128:8088。如下:

點選 “Active Nodes” 檢視存活的節點:

好了,到此為止我們的Hadoop分散式叢集環境就搭建完畢了,就是這麼簡單。那麼啟動了叢集之後要如何關閉叢集呢?也很簡單,在主節點上執行如下命令即可:

[[email protected] ~]# stop-all.sh

分散式環境下HDFS及YARN的使用

實際上分散式環境下HDFS及YARN的使用和偽分散式下是一模一樣的,例如HDFS的shell命令的使用方式依舊是和偽分散式下一樣的。例如:

[[email protected] ~]# hdfs dfs -ls / [[email protected] ~]# hdfs dfs -mkdir /data [[email protected] ~]# hdfs dfs -put ./test.sh /data [[email protected] ~]# hdfs dfs -ls / Found 1 items drwxr-xr-x - root supergroup 0 2018-04-02 20:29 /data [[email protected] ~]# hdfs dfs -ls /data Found 1 items -rw-r–r-- 3 root supergroup 68 2018-04-02 20:29 /data/test.sh [[email protected] ~]#

在叢集中的其他節點也可以訪問HDFS,而且在叢集中HDFS是共享的,所有節點訪問的資料都是一樣的。例如我在hadoop001節點中,上傳一個目錄:

[[email protected] ~]# hdfs dfs -ls / Found 1 items drwxr-xr-x - root supergroup 0 2018-04-02 20:29 /data [[email protected] ~]# hdfs dfs -put ./logs / [[email protected] ~]# hdfs dfs -ls / drwxr-xr-x - root supergroup 0 2018-04-02 20:29 /data drwxr-xr-x - root supergroup 0 2018-04-02 20:31 /logs [[email protected] ~]#

然後再到hadoop002上檢視:

[[email protected] ~]# hdfs dfs -ls / Found 2 items drwxr-xr-x - root supergroup 0 2018-04-02 20:29 /data drwxr-xr-x - root supergroup 0 2018-04-02 20:31 /logs [[email protected] ~]#

可以看到,不同的節點,訪問的資料也是一樣的。由於和偽分散式下的操作是一樣的,我這裡就不再過多演示了。

簡單演示了HDFS的操作之後,我們再來執行一下Hadoop自帶的案例,看看YARN上是否能獲取到任務的執行資訊。隨便在一個節點上執行如下命令:

[[email protected] ~]# cd /usr/local/hadoop-2.6.0-cdh5.7.0/share/hadoop/mapreduce [[email protected] /usr/local/hadoop-2.6.0-cdh5.7.0/share/hadoop/mapreduce]# hadoop jar ./hadoop-mapreduce-examples-2.6.0-cdh5.7.0.jar pi 3 4 [[email protected] ~]#

申請資源:

執行任務:

完成之後再次執行之前的命令,這次任務執行成功: