針對分散式hadoop叢集搭建,已經在四臺虛擬機器上,完全搭建好,這裡針對整個搭建過程以及遇到的問題做個總結,按照下面的做法應該能夠比較順暢的搭建一套高可用的分散式hadoop叢集。

這一系列分散式元件的安裝過程中,大體可以分為以下幾步:

第一步.配置機器互信

   機器互信配置原理:機器互信是指一個機器可以不需要輸入密碼的情況下直接登入到另外一臺機器,使用證書信任的方式。ftp,telnet等連線方式的弊端是採用明文傳輸,中間者可以冒充真正的伺服器來擷取這部分傳輸資料,從而帶來安全問題。ssh是secure shell的簡寫  兩種登入方式 一種是口令登入,輸入使用者名稱和密碼,另外一種是祕鑰驗證,自己為自己建立一對祕鑰,然後把公鑰放在伺服器上,如果連線伺服器的時候,客戶端首先發送請求,裡面包含公鑰,請求伺服器驗證,伺服器接收到請求後,與自己機器上儲存的所有公鑰中進行對比驗證,如果相同,伺服器就把質詢資訊加密傳送給客戶端,客戶端收到後,用自己的私鑰解密,然後把解密結果發給伺服器進行驗證,驗證通過則可以通訊。

其實就是三次握手的一來一回,A跟B發公鑰  B給A發密文 A解密發個B解密後的東西 可以連線

知道了原理以後,下面來配置機器互信,機器配置互信有下面幾個步驟

第一步:建立hadoop使用者 

sudo addgroup hadoop
sudo adduser --ingroup hadoop hadoop
      第一行是建立了hadoop組,第二行是添加了hadoop使用者,而且該使用者是屬於hadoop組的

這些可以在當前使用者下執行  sudo su - hadoop 來切換到hadoop使用者下   中間的小短線- 不要忘記寫,表示將環境變數等資訊載入到新使用者下,可以避免很多不必要的麻煩。但是在hadoop使用者  沒辦法使用sudo 還需要修改一個檔案。當使用sudo時,會提示如下錯誤:

hadoop is not in the sudoers file.  This incident will be reported.

此時,需要修改/etc/sudoers檔案 預設這個檔案是不可寫的 

首先切換到root使用者  

sudo su -
      然後修改檔案屬性 
chmod u+w /etc/sudoers
      給u(user)添加了w(write)許可權,測試在root使用者下,可以編輯該檔案了

新增如下行:

hadoop  ALL=(ALL:ALL) ALL

:wq儲存後後恢復檔案屬性

chmod u-w /etc/sudoers
      切回到hadoop使用者下 發現可以使用sudo命令了。

第二步:修改機器名字  比如我的ubuntu虛擬機器預設登入上去是[email protected],可以選擇hadoop使用者登入,或者是切換到hadoop使用者下,sudo su - hadoop,

修改機器主機名 

sudo vim /etc/hostname
      開啟檔案後將主機名刪除掉 然後換成 hadoop01-namenode  標識這臺機器作為hadoop的namenode節點  然後儲存退出。此時還需要做對映操作,叢集中所有用到的主機名都要做對映 這樣才能正常使用ssh hadoop03-datanode 這種命令來登入 否則只能是使用ip地址,下面來說如何做對映。
sudo vim /etc/hosts
      在下面追加需要增加的對映地址即可,比如
hadoop01-namenode 192.168.79.183
      儲存退出即可。
      第三步,生成信任證書,也是真正配置互信的地方
ssh-keygen -t rsa
       然後一直回車,證書已經生成
cd ~/.ssh
cat id_rsa.pub
      你會看到公鑰內容,建立一個公鑰認證的檔案
touch authorized_keys
chmod 600 authorized_keys

      將id_rsa.pub檔案內的東西複製到authorized_keys檔案中,退出儲存,所有機器的互信配置都需要執行上面的操作,而且公鑰檔案都需要新增到authorized_keys檔案裡面,也就是說如果A,C想跟B連線 那麼A和C的公鑰要寫入B的authorized_keys裡,以追加的形式即可。此時就算完成了互信配置。驗證機器是否安裝了ssh,如果提示錯誤

ssh: connect to host localhost port 22: Connection refused

說明ssh服務沒安裝,可以使用ubuntu的安裝包來安裝,如下所示

sudo apt-get install openssh-server
      此時再次執行,ssh hadoop01-namenode 提示輸入連線yes後 可以不用輸密碼登入到自己機器上了,說明互信配置成功。

到此,互信就算是配置完了。

第二步。配置hadoop,以HA的方式配置

下面主要是展示hadoop中核心的幾個配置檔案

下面是core_site.xml的配置內容

<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!--
  Licensed under the Apache License, Version 2.0 (the "License");
  you may not use this file except in compliance with the License.
  You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0

  Unless required by applicable law or agreed to in writing, software
  distributed under the License is distributed on an "AS IS" BASIS,
  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
  See the License for the specific language governing permissions and
  limitations under the License. See accompanying LICENSE file.
-->

<!-- Put site-specific property overrides in this file. -->

<configuration>

<property>

<name>hadoop.tmp.dir</name>

<value>file:/home/hadoop/tmp/</value>

<description>Abase for other temporary directories.</description>

</property>
<!--指定zookeeper地址,主要是配置高可用使用-->
 <property>
      <name>ha.zookeeper.quorum</name>
      <value>hadoop02-datanode:2181,hadoop03-datanode:2181,hadoop04-datanode:2181</value>
 </property>

<property>
<!-- 指定hdfs的nameservice為ns -->
<name>fs.defaultFS</name>

<value>hdfs://ns</value>

</property>

</configuration>

下面是hdfs_site.xml的檔案內容
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!--
  Licensed under the Apache License, Version 2.0 (the "License");
  you may not use this file except in compliance with the License.
  You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0

  Unless required by applicable law or agreed to in writing, software
  distributed under the License is distributed on an "AS IS" BASIS,
  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
  See the License for the specific language governing permissions and
  limitations under the License. See accompanying LICENSE file.
-->

<!-- Put site-specific property overrides in this file. -->

<configuration>
      <property>    
          <name>dfs.nameservices</name>    
          <value>ns</value>    
      </property>  
      <property>
      //HA的兩個node
          <name>dfs.ha.namenodes.ns</name>
          <value>nn1,nn2</value>
      </property>
  <property>
          <name>dfs.namenode.rpc-address.ns.nn1</name>
          <value>hadoop02-datanode:9000</value>
      </property>
  <property>
          <name>dfs.namenode.http-address.ns.nn1</name>
          <value>hadoop02-datanode:50070</value>
      </property>
  <property>
          <name>dfs.namenode.rpc-address.ns.nn2</name>
          <value>hadoop03-datanode:9000</value>
      </property>
  <property>
          <name>dfs.namenode.http-address.ns.nn2</name>
          <value>hadoop03-datanode:50070</value>
      </property>
  <property>
          <name>dfs.namenode.shared.edits.dir</name>
          <value>qjournal://hadoop02-datanode:8485;hadoop03-datanode:8485;hadoop04-datanode:8485/ns</value>
      </property>
  <property>
            <name>dfs.journalnode.edits.dir</name>
            <value>/home/hadoop/journal</value>
      </property>
  <property>
            <name>dfs.ha.automatic-failover.enabled</name>
            <value>true</value>
      </property>
  <property>
              <name>dfs.client.failover.proxy.provider.ns</name>
              <value>org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider</value>
      </property>
  <property>
              <name>dfs.ha.fencing.methods</name>
              <value>sshfence</value>
      </property>
  <property>
              <name>dfs.ha.fencing.ssh.private-key-files</name>
            <value>/home/hadoop/.ssh/id_rsa</value>
      </property>
  <property>
    <name>dfs.replication</name>
    <value>1</value>
  </property>

  <property>
    <name>dfs.namenode.name.dir</name>
    <value>file:///home/hadoop/workspace/hdfs/name</value>
  </property>

  <property>
    <name>hadoop.tmp.dir</name>
        <value>/home/hadoop/tmp/</value>
  </property>

  <property>
    <name>dfs.datanode.data.dir</name>
    <value>file:///home/hadoop/workspace/hdfs/data</value>
  </property>

<!--
  <property>
          <name>dfs.namenode.secondary.http-address</name>
          <value>192.168.79.183:9001</value>
  </property>
-->
  <property>
        <name>dfs.webhdfs.enabled</name>
       <value>true</value>
  </property>

</configuration>
3.下面是mapred.site.xml的內容
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!--
  Licensed under the Apache License, Version 2.0 (the "License");
  you may not use this file except in compliance with the License.
  You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0

  Unless required by applicable law or agreed to in writing, software
  distributed under the License is distributed on an "AS IS" BASIS,
  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
  See the License for the specific language governing permissions and
  limitations under the License. See accompanying LICENSE file.
-->

<!-- Put site-specific property overrides in this file. -->

<configuration>
<property>

<name>mapred.job.tracker</name>

<value>hadoop02-datanode:9001</value>

</property>
<property>
        <name>mapreduce.map.memory.mb</name>
        <value>768</value>
        </property>
<property>
        <name>mapreduce.map.java.opts</name>
        <value>-Xmx512m</value>
        </property>

<property>
        <name>mapreduce.reduce.memory.mb</name>
        <value>1536</value>
</property>
<property>
        <name>mapreduce.reduce.java.opts</name>
        <value>-Xmx1024m</value>
</property>


<property>
        <name>mapreduce.framework.name</name>
        <value>yarn</value>
</property>
<property>
        <name>mapreduce.jobhistory.address</name>
        <value>192.168.79.183:10020</value>
</property>
<property>
        <name>mapreduce.jobhistory.webapp.address</name>
        <value>192.168.79.183:19888</value>
</property>
<property>
        <name>mapreduce.application.classpath</name>
<value>/usr/local/hadoop/share/hadoop/mapreduce/*,/usr/local/hadoop/share/hadoop/mapreduce/lib/*</value>  


</property>

</configuration>

下面是yarn_site.xml的內容:
<?xml version="1.0"?>
<!--
  Licensed under the Apache License, Version 2.0 (the "License");
  you may not use this file except in compliance with the License.
  You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0

  Unless required by applicable law or agreed to in writing, software
  distributed under the License is distributed on an "AS IS" BASIS,
  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
  See the License for the specific language governing permissions and
  limitations under the License. See accompanying LICENSE file.
-->
<configuration>
  <property>
   <name>yarn.nodemanager.aux-services</name>
   <value>mapreduce_shuffle</value>
  </property>
<property>
                <description>Classpath for typical applications.</description>
                <name>yarn.application.classpath</name>
                <value>/usr/local/hadoop/lib/*,/usr/local/hadoop/share/hadoop/common/*,/usr/local/hadoop/share/hadoop/common/lib/*,/usr/local/hadoop/share/hadoop/hdfs/*,/usr/local/hadoop/share/hadoop/hdfs/lib/*,/usr/local/hadoop/share/hadoop/httpfs/*,/usr/local/hadoop/share/hadoop/httpfs/lib/*,/usr/local/hadoop/share/hadoop/kms/*,/usr/local/hadoop/share/hadoop/kms/lib/*,/usr/local/hadoop/share/hadoop/mapreduce/*,/usr/local/hadoop/share/hadoop/mapreduce/lib/*,/usr/local/hadoop/share/hadoop/tools/*,/usr/local/hadoop/share/hadoop/tools/lib/*,/usr/local/hadoop/share/hadoop/yarn/*,/usr/local/hadoop/share/hadoop/yarn/lib/*</value>
  </property>

  <property>
   <name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
   <value>org.apache.hadoop.mapred.ShuffleHandler</value>
  </property>
  <property>
   <name>yarn.resourcemanager.address</name>
   <value>hadoop02-datanode:8032</value>
  </property>
  <property>
   <name>yarn.resourcemanager.scheduler.address</name>
   <value>hadoop02-datanode:8030</value>
  </property>
  <property>
   <name>yarn.resourcemanager.resource-tracker.address</name>
   <value>hadoop02-datanode:8035</value>
  </property>
  <property>
   <name>yarn.resourcemanager.admin.address</name>
   <value>hadoop02-datanode:8033</value>
  </property>
  <property>
   <name>yarn.resourcemanager.webapp.address</name>
   <value>hadoop02-datanode:8088</value>
  </property>
<property>
                <description>Amount of physical memory, in MB, that can be allocated
                  for containers.</description>
                <name>yarn.nodemanager.resource.memory-mb</name>
                <value>3096</value>
        </property>

        <property>
                <description>The class to use as the resource scheduler.</description>
                <name>yarn.resourcemanager.scheduler.class</name>
                <value>org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler</value>
        </property>

 <property>
               <description>The minimum allocation for every container request at the RM,
                                                    in MBs. Memory requests lower than this won't take effect,
                   and the specified value will get allocated at minimum.</description>
               <name>yarn.scheduler.minimum-allocation-mb</name>
               <value>1024</value>
        </property>

        <property>
               <description>The maximum allocation for every container request at the RM,
                  in MBs. Memory requests higher than this won't take effect,
                   and will get capped to this value.</description>
               <name>yarn.scheduler.maximum-allocation-mb</name>
               <value>2048</value>
        </property>
        <property>
               <name>yarn.app.mapreduce.am.resource.mb</name>
               <value>768</value>
        </property>
        <property>
               <name>yarn.app.mapreduce.am.command-opts</name>
               <value>-Xmx512m</value>

        </property>
  <property>
               <name>yarn.app.mapreduce.am.resource.mb</name>
               <value>768</value>
        </property>
        <property>
               <name>yarn.app.mapreduce.am.command-opts</name>
               <value>-Xmx512m</value>
        </property>
<property>
            <name>yarn.resourcemanager.hostname</name>
            <value>hadoop04-datanode</value>
      </property> 




</configuration>

此處不得不說,hadoop的搭建精髓就是編寫配置檔案。具體裡面所有的配置的檔案的含義,讀者可以自行到官方文件裡去檢視。

下面給出搭建過程中系統路徑的所有環境變數資訊。這部分也是很關鍵的部分。

export JAVA_HOME=/usr/local/java
export JRE_HOME=${JAVA_HOME}/jre  
export CLASSPATH=.:${JAVA_HOME}/lib
export PATH=${JAVA_HOME}/bin:$PATH
export PATH=${PATH}:/usr/lib/jsoncpp/libs/linux-gcc-4.8
export M2_HOME=/home/verlink/Desktop/apache-maven-3.1.1
export TOMCAT_HOME=/home/verlink/tomcat/apache-tomcat-7.0.69
export PATH=$PATH:$M2_HOME/bin:$TOMCAT_HOME
export HADOOP_HOME=/usr/local/hadoop
export PATH=$HADOOP_HOME/bin:$HADOOP_HOME/sbin:$PATH

export HADOOP_HOME=/usr/local/hadoop
export HADOOP_COMMON_HOME=${HADOOP_HOME}
export HADOOP_CONF_DIR=${HADOOP_HOME}/etc/hadoop


export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME
export YARN_HOME=$HADOOP_HOME
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native  
export HADOOP_OPTS="-Djava.library.path=$HADOOP_HOME/lib"  

export ZOOKEEPER_HOME=/usr/local/hadoop/app/zookeeper
export PATH=$ZOOKEEPER_HOME/bin:$PATH


1.hive中的元資料是依託mysql來進行管理的,在hive-site.xml中需要配置出mysql的地址資訊。以及連線mysql所使用的使用者名稱以及密碼,同時還需要將mysql的連線驅動jar包包含到系統的classpath中去。

2.對mysql的授權操作,允許其他ip地址的機器可以對該mysql進行訪問  此處查詢下授權命令 grant 一次性搞懂