1. 程式人生 > >【原創 Spark動手實踐 1】Hadoop2.7.3安裝部署實際動手

【原創 Spark動手實踐 1】Hadoop2.7.3安裝部署實際動手

dmi 遠程 nag proc host 一個 error img 連接

目錄:

第一部分:操作系統準備工作:

  1. 安裝部署CentOS7.3 1611

  2. CentOS7軟件安裝(net-tools, wget, vim等)

  3. 更新CentOS7的Yum源,更新軟件速度更快

  4. CentOS 用戶配置,Sudo授權

第二部分:Java環境準備

  1. JDK1.8 安裝與配置

第三部分:Hadoop配置,啟動與驗證

  1. 解壓Hadoop2.7.3更新全局變量

  2. 更新Hadoop配置文件

  3. 啟動Hadoop

  4. 驗證Hadoop

=============================================================================================

第一部分:操作系統準備工作:

  1. 安裝部署CentOS7.3 1611

  2. CentOS7軟件安裝(net-tools, wget, vim等)

  3. 更新CentOS7的Yum源,更新軟件速度更快

  4. CentOS 用戶配置,Sudo授權

1. 安裝部署CentOS7.3 1611

技術分享

2. CentOS7軟件安裝(net-tools, wget, vim等)

sudo yum install -y net-tools

sudo yum install -y wget

sudo yum install -y vim

技術分享

技術分享

3. 更新CentOS7的Yum源(更新為阿裏雲的CentOS7的源),更新軟件速度更快

http://mirrors.aliyun.com/help/centos

1、備份

mv /etc/yum.repos.d/CentOS-Base.repo /etc/yum.repos.d/CentOS-Base.repo.backup

2、下載新的CentOS-Base.repo 到/etc/yum.repos.d/

CentOS 5

wget -O /etc/yum.repos.d/CentOS-Base.repo http://mirrors.aliyun.com/repo/Centos-5.repo

CentOS 6

wget -O /etc/yum.repos.d/CentOS-Base.repo http://mirrors.aliyun.com/repo/Centos-6.repo

CentOS 7

wget -O /etc/yum.repos.d/CentOS-Base.repo http://mirrors.aliyun.com/repo/Centos-7.repo

3、之後運行yum makecache生成緩存

技術分享

4、sudo yum -y update 對系統進行升級

技術分享

技術分享

sudo vim /etc/hosts # 更新hosts文件,便於用spark02代表本機IP

技術分享

第二部分:Java環境準備

  1. JDK1.8 安裝與配置

通過FileZilla 上傳實驗所需要用到的文件(JDK,Hadoop,Spark)

技術分享

對JDK和Hadoop進行解壓

tar -zxvf jdk-8u121-linux-x64.tar.gz

tar -zxvf hadoop-2.7.3.tar.gz

在 .bash_profile文件內增加環境便利,便於Java和Hadoop更容易操作

#Add JAVA_HOME and HADOOP_HOME
export JAVA_HOME=/home/spark/jdk1.8.0_121
export PATH=$PATH:$JAVA_HOME/bin
export HADOOP_HOME=/home/spark/hadoop-2.7.3
export PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin

source .bash_profile #配置生效

技術分享

第三部分:Hadoop配置,啟動與驗證

  1. 解壓Hadoop2.7.3更新全局變量

  2. 更新Hadoop配置文件

  3. 啟動Hadoop

  4. 驗證Hadoop

參考:Hadoop2.7.3 官方文檔進行偽分布式的配置

http://hadoop.apache.org/docs/r2.7.3/hadoop-project-dist/hadoop-common/SingleCluster.html

Pseudo-Distributed Operation

Hadoop can also be run on a single-node in a pseudo-distributed mode where each Hadoop daemon runs in a separate Java process.

Configuration

Use the following:

etc/hadoop/core-site.xml:

<configuration>
    <property>
        <name>fs.defaultFS</name>
        <value>hdfs://localhost:9000</value>
    </property>
</configuration>

etc/hadoop/hdfs-site.xml:

<configuration>
    <property>
        <name>dfs.replication</name>
        <value>1</value>
    </property>
</configuration>

Setup passphraseless ssh

Now check that you can ssh to the localhost without a passphrase:

  $ ssh localhost

If you cannot ssh to localhost without a passphrase, execute the following commands:

  $ ssh-keygen -t rsa -P ‘‘ -f ~/.ssh/id_rsa
  $ cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
  $ chmod 0600 ~/.ssh/authorized_keys

Execution

The following instructions are to run a MapReduce job locally. If you want to execute a job on YARN, see YARN on Single Node.

  1. Format the filesystem:

      $ bin/hdfs namenode -format
    
  2. Start NameNode daemon and DataNode daemon:

      $ sbin/start-dfs.sh
    

    The hadoop daemon log output is written to the $HADOOP_LOG_DIR directory (defaults to $HADOOP_HOME/logs).

  3. Browse the web interface for the NameNode; by default it is available at:

    • NameNode - http://localhost:50070/
  4. Make the HDFS directories required to execute MapReduce jobs:

      $ bin/hdfs dfs -mkdir /user
      $ bin/hdfs dfs -mkdir /user/<username>
    
  5. Copy the input files into the distributed filesystem:

      $ bin/hdfs dfs -put etc/hadoop input
    
  6. Run some of the examples provided:

      $ bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.3.jar grep input output ‘dfs[a-z.]+‘
    
  7. Examine the output files: Copy the output files from the distributed filesystem to the local filesystem and examine them:

      $ bin/hdfs dfs -get output output
      $ cat output/*
    

    or

    View the output files on the distributed filesystem:

      $ bin/hdfs dfs -cat output/*
    
  8. When you’re done, stop the daemons with:

      $ sbin/stop-dfs.sh
    

YARN on a Single Node

You can run a MapReduce job on YARN in a pseudo-distributed mode by setting a few parameters and running ResourceManager daemon and NodeManager daemon in addition.

The following instructions assume that 1. ~ 4. steps of the above instructions are already executed.

  1. Configure parameters as follows:etc/hadoop/mapred-site.xml:

    <configuration>
        <property>
            <name>mapreduce.framework.name</name>
            <value>yarn</value>
        </property>
    </configuration>
    

    etc/hadoop/yarn-site.xml:

    <configuration>
        <property>
            <name>yarn.nodemanager.aux-services</name>
            <value>mapreduce_shuffle</value>
        </property>
    </configuration>
    
  2. Start ResourceManager daemon and NodeManager daemon:

      $ sbin/start-yarn.sh
    
  3. Browse the web interface for the ResourceManager; by default it is available at:

    • ResourceManager - http://localhost:8088/
  4. Run a MapReduce job.

  5. When you’re done, stop the daemons with:

      $ sbin/stop-yarn.sh

    配置免密碼,否則運行的時候會報錯。

    Setup passphraseless ssh

    Now check that you can ssh to the localhost without a passphrase:

      $ ssh localhost
    

    If you cannot ssh to localhost without a passphrase, execute the following commands:

      $ ssh-keygen -t rsa -P ‘‘ -f ~/.ssh/id_rsa
      $ cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
      $ chmod 0600 ~/.ssh/authorized_keys

    Hadoop配置文件具體的配置信息如下:

    1. vim etc/hadoop/hadoop-env.sh

    #export JAVA_HOME=${JAVA_HOME}
    export JAVA_HOME=/home/spark/jdk1.8.0_121

    2. vim etc/hadoop/core-site.xml

    <!-- Put site-specific property overrides in this file. -->

    <configuration>
    <property>
    <name>fs.defaultFS</name>
    <value>hdfs://spark01:9000</value>
    </property>
    <property>
    <name>hadoop.tmp.dir</name>
    <value>/home/spark/hadoopdata</value>
    </property>
    </configuration>

    3. vim etc/hadoop/hdfs-site.xml

    <configuration>
    <property>
    <name>dfs.replication</name>
    <value>1</value>
    </property>
    </configuration>

    4. vim etc/hadoop/mapred-site.xml

    <configuration>
    <property>
    <name>mapreduce.framework.name</name>
    <value>yarn</value>
    </property>
    </configuration>

    5.vim etc/hadoop/yarn-site.xml

    <configuration>

    <!-- Site specific YARN configuration properties -->
    <property>
    <name>yarn.nodemanager.aux-services</name>
    <value>mapreduce_shuffle</value>
    </property>
    <property>
    <name>yarn.log-aggregation-enable</name>
    <value>true</value>
    </property>
    </configuration>

    1. 對HDFS進行格式化 hdfs namenode -format 技術分享

    2.啟動HDFS

    start-dfs.sh

    技術分享

    3. 啟動YARN

    start-yarn.sh

    技術分享

    關閉防火墻,其他服務器才能進行訪問和驗證

    [[email protected] hadoop-2.7.3]$ systemctl stop firewalld.service
    ==== AUTHENTICATING FOR org.freedesktop.systemd1.manage-units ===
    Authentication is required to manage system services or units.
    Authenticating as: spark
    Password:
    ==== AUTHENTICATION COMPLETE ===

    http://spark02:8088

    技術分享

    http://spark02:50070

    技術分享

    使用HDFS創建目錄,拷貝文件和查看文件

    hdfs dfs -mkdir hdfs://user/jonson/input hdfs dfs -cp etc/hadoop hdfs://user/jonson/input hdfs dfs -ls hdfs://user/jonson/input hdfs dfs -mkdir hdfs://user/jonson/output hdfs dfs -rmdir hdfs://user/jonson/output hdfs dfs -ls hdfs://user/jonson 嘗試使用MapReduce計算框架 [[email protected] hadoop-2.7.3]$ hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.3.jar grep hdfs:///user/jonson/input/hadoop hdfs:///user/jonson/output ‘dfs[a-z.]+‘

    [[email protected] hadoop-2.7.3]$ hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.3.jar grep hdfs:///user/jonson/input/hadoop hdfs:///user/jonson/output ‘dfs[a-z.]+‘
    17/05/07 23:36:17 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
    17/05/07 23:36:18 INFO input.FileInputFormat: Total input paths to process : 30
    17/05/07 23:36:18 INFO mapreduce.JobSubmitter: number of splits:30
    17/05/07 23:36:19 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1494169715431_0003
    17/05/07 23:36:19 INFO impl.YarnClientImpl: Submitted application application_1494169715431_0003
    17/05/07 23:36:19 INFO mapreduce.Job: The url to track the job: http://localhost:8088/proxy/application_1494169715431_0003/
    17/05/07 23:36:19 INFO mapreduce.Job: Running job: job_1494169715431_0003
    17/05/07 23:36:28 INFO mapreduce.Job: Job job_1494169715431_0003 running in uber mode : false
    17/05/07 23:36:28 INFO mapreduce.Job: map 0% reduce 0%
    17/05/07 23:36:58 INFO mapreduce.Job: map 20% reduce 0%
    17/05/07 23:37:25 INFO mapreduce.Job: map 37% reduce 0%
    17/05/07 23:37:26 INFO mapreduce.Job: map 40% reduce 0%
    17/05/07 23:37:50 INFO mapreduce.Job: map 47% reduce 0%
    17/05/07 23:37:51 INFO mapreduce.Job: map 57% reduce 0%
    17/05/07 23:37:54 INFO mapreduce.Job: map 57% reduce 19%
    17/05/07 23:38:04 INFO mapreduce.Job: map 60% reduce 19%
    17/05/07 23:38:06 INFO mapreduce.Job: map 60% reduce 20%
    17/05/07 23:38:12 INFO mapreduce.Job: map 73% reduce 20%
    17/05/07 23:38:15 INFO mapreduce.Job: map 73% reduce 24%
    17/05/07 23:38:18 INFO mapreduce.Job: map 77% reduce 24%
    17/05/07 23:38:21 INFO mapreduce.Job: map 77% reduce 26%
    17/05/07 23:38:33 INFO mapreduce.Job: map 83% reduce 26%
    17/05/07 23:38:34 INFO mapreduce.Job: map 90% reduce 26%
    17/05/07 23:38:35 INFO mapreduce.Job: map 93% reduce 26%
    17/05/07 23:38:36 INFO mapreduce.Job: map 93% reduce 31%
    17/05/07 23:38:43 INFO mapreduce.Job: map 100% reduce 31%
    17/05/07 23:38:44 INFO mapreduce.Job: map 100% reduce 100%
    17/05/07 23:38:45 INFO mapreduce.Job: Job job_1494169715431_0003 completed successfully
    17/05/07 23:38:45 INFO mapreduce.Job: Counters: 49
    File System Counters
    FILE: Number of bytes read=345
    FILE: Number of bytes written=3690573
    FILE: Number of read operations=0
    FILE: Number of large read operations=0
    FILE: Number of write operations=0
    HDFS: Number of bytes read=81841
    HDFS: Number of bytes written=437
    HDFS: Number of read operations=93
    HDFS: Number of large read operations=0
    HDFS: Number of write operations=2
    Job Counters
    Launched map tasks=30
    Launched reduce tasks=1
    Data-local map tasks=30
    Total time spent by all maps in occupied slots (ms)=653035
    Total time spent by all reduces in occupied slots (ms)=77840
    Total time spent by all map tasks (ms)=653035
    Total time spent by all reduce tasks (ms)=77840
    Total vcore-milliseconds taken by all map tasks=653035
    Total vcore-milliseconds taken by all reduce tasks=77840
    Total megabyte-milliseconds taken by all map tasks=668707840
    Total megabyte-milliseconds taken by all reduce tasks=79708160
    Map-Reduce Framework
    Map input records=2103
    Map output records=24
    Map output bytes=590
    Map output materialized bytes=519
    Input split bytes=3804
    Combine input records=24
    Combine output records=13
    Reduce input groups=11
    Reduce shuffle bytes=519
    Reduce input records=13
    Reduce output records=11
    Spilled Records=26
    Shuffled Maps =30
    Failed Shuffles=0
    Merged Map outputs=30
    GC time elapsed (ms)=8250
    CPU time spent (ms)=13990
    Physical memory (bytes) snapshot=6025490432
    Virtual memory (bytes) snapshot=64352063488
    Total committed heap usage (bytes)=4090552320
    Shuffle Errors
    BAD_ID=0
    CONNECTION=0
    IO_ERROR=0
    WRONG_LENGTH=0
    WRONG_MAP=0
    WRONG_REDUCE=0
    File Input Format Counters
    Bytes Read=78037
    File Output Format Counters
    Bytes Written=437
    17/05/07 23:38:45 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
    17/05/07 23:38:46 INFO input.FileInputFormat: Total input paths to process : 1
    17/05/07 23:38:46 INFO mapreduce.JobSubmitter: number of splits:1
    17/05/07 23:38:46 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1494169715431_0004
    17/05/07 23:38:46 INFO impl.YarnClientImpl: Submitted application application_1494169715431_0004
    17/05/07 23:38:46 INFO mapreduce.Job: The url to track the job: http://localhost:8088/proxy/application_1494169715431_0004/
    17/05/07 23:38:46 INFO mapreduce.Job: Running job: job_1494169715431_0004
    17/05/07 23:39:00 INFO mapreduce.Job: Job job_1494169715431_0004 running in uber mode : false
    17/05/07 23:39:00 INFO mapreduce.Job: map 0% reduce 0%
    17/05/07 23:39:06 INFO mapreduce.Job: map 100% reduce 0%
    17/05/07 23:39:13 INFO mapreduce.Job: map 100% reduce 100%
    17/05/07 23:39:14 INFO mapreduce.Job: Job job_1494169715431_0004 completed successfully
    17/05/07 23:39:14 INFO mapreduce.Job: Counters: 49
    File System Counters
    FILE: Number of bytes read=291
    FILE: Number of bytes written=237535
    FILE: Number of read operations=0
    FILE: Number of large read operations=0
    FILE: Number of write operations=0
    HDFS: Number of bytes read=566
    HDFS: Number of bytes written=197
    HDFS: Number of read operations=7
    HDFS: Number of large read operations=0
    HDFS: Number of write operations=2
    Job Counters
    Launched map tasks=1
    Launched reduce tasks=1
    Data-local map tasks=1
    Total time spent by all maps in occupied slots (ms)=3838
    Total time spent by all reduces in occupied slots (ms)=3849
    Total time spent by all map tasks (ms)=3838
    Total time spent by all reduce tasks (ms)=3849
    Total vcore-milliseconds taken by all map tasks=3838
    Total vcore-milliseconds taken by all reduce tasks=3849
    Total megabyte-milliseconds taken by all map tasks=3930112
    Total megabyte-milliseconds taken by all reduce tasks=3941376
    Map-Reduce Framework
    Map input records=11
    Map output records=11
    Map output bytes=263
    Map output materialized bytes=291
    Input split bytes=129
    Combine input records=0
    Combine output records=0
    Reduce input groups=5
    Reduce shuffle bytes=291
    Reduce input records=11
    Reduce output records=11
    Spilled Records=22
    Shuffled Maps =1
    Failed Shuffles=0
    Merged Map outputs=1
    GC time elapsed (ms)=143
    CPU time spent (ms)=980
    Physical memory (bytes) snapshot=306675712
    Virtual memory (bytes) snapshot=4157272064
    Total committed heap usage (bytes)=165810176
    Shuffle Errors
    BAD_ID=0
    CONNECTION=0
    IO_ERROR=0
    WRONG_LENGTH=0
    WRONG_MAP=0
    WRONG_REDUCE=0
    File Input Format Counters
    Bytes Read=437
    File Output Format Counters
    Bytes Written=197

    技術分享

    技術分享

    技術分享

    技術分享

    技術分享

    利用命令行來查看運行結果:

    [[email protected] hadoop-2.7.3]$ hadoop fs -cat hdfs:///user/jonson/output/*
    6 dfs.audit.logger
    4 dfs.class
    3 dfs.server.namenode.
    2 dfs.period
    2 dfs.audit.log.maxfilesize
    2 dfs.audit.log.maxbackupindex
    1 dfsmetrics.log
    1 dfsadmin
    1 dfs.servers
    1 dfs.replication
    1 dfs.file
    [[email protected] hadoop-2.7.3]$ hadoop fs -cat hdfs:///user/jonson/output/part-r-00000
    6 dfs.audit.logger
    4 dfs.class
    3 dfs.server.namenode.
    2 dfs.period
    2 dfs.audit.log.maxfilesize
    2 dfs.audit.log.maxbackupindex
    1 dfsmetrics.log
    1 dfsadmin
    1 dfs.servers
    1 dfs.replication
    1 dfs.file

    技術分享

    技術分享

    技術分享

    技術分享

    ====================================

    免密碼登錄原理和方法

    背景:搭建Hadoop環境需要設置無密碼登陸,所謂無密碼登陸其實是指通過證書認證的方式登陸,使用一種被稱為"公私鑰"認證的方式來進行ssh登錄。

      在linux系統中,ssh是遠程登錄的默認工具,因為該工具的協議使用了RSA/DSA的加密算法.該工具做linux系統的遠程管理是非常安全的。telnet,因為其不安全性,在linux系統中被擱置使用了。

      " 公私鑰"認證方式簡單的解釋:首先在客戶端上創建一對公私鑰 (公鑰文件:~/.ssh/id_rsa.pub; 私鑰文件:~/.ssh/id_rsa)。然後把公鑰放到服務器上(~/.ssh/authorized_keys), 自己保留好私鑰.在使用ssh登錄時,ssh程序會發送私鑰去和服務器上的公鑰做匹配.如果匹配成功就可以登錄了。

    工具/原料

    • linux系統

    方法/步驟

    1. 確認系統已經安裝了SSH。

      rpm –qa | grep openssh

      rpm –qa | grep rsync

      -->出現如下圖的信息表示已安裝

      假設沒有安裝ssh和rsync,可以通過下面命令進行安裝。

      yum install ssh -->安裝SSH協議

      yum install rsync -->rsync是一個遠程數據同步工具,可通過LAN/WAN快速同步多臺主機間的文件

      service sshd restart -->啟動服務

      技術分享
    2. 生成秘鑰對

      ssh-keygen –t rsa –P ‘‘ -->直接回車生成的密鑰對:id_rsa和id_rsa.pub,默認存儲在"/home/hadoop/.ssh"目錄下。

      技術分享 技術分享
    3. 把id_rsa.pub追加到授權的key裏面去。

      cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys

      技術分享
    4. 修改授權key的權限

      chmod 600 ~/.ssh/authorized_keys

      技術分享
    5. 修改SSH配置文件

      su - -->登陸root用戶修改配置文件

      vim /etc/ssh/sshd_config -->去掉下圖中三行的註釋

      技術分享
    6. 測試連接

      service sshd restart -->重啟ssh服務,

      exit -->退出root用戶,回到普通用戶

      ssh localhost -->連接普通用戶測試

      這只是配置好了單機環境上的SSH服務,要遠程連接其它的服務器,接著看下面。

      技術分享
    7. 現在秘鑰對已經生成好了,客戶端SSH服務也已經配置好了,現在就把我們的鑰匙(公鑰)送給服務器。

      scp ~/.ssh/id_rsa.pub [email protected]:~/ -->將公鑰復制到遠程服務器的~/目錄下

      如: scp ~/.ssh/id_rsa.pub [email protected]:~/

      可以看到我們復制的時候需要我們輸入服務器的密碼,等我們把SSH配置好之後這些步驟就可以不用輸入密碼了。

      技術分享
    8. 8

      上一步把公鑰發送到192.168.1.134服務器上去了,我們去134機器上把公鑰追加到授權key中去。(註意:如果是第一次運行SSH,那麽.ssh目錄需要手動創建,或者使用命令ssh-keygen -t rsa生成秘鑰,它會自動在用戶目錄下生成.ssh目錄。特別註意的是.ssh目錄的權限問題,記得運行下chmod 700 .ssh命令)

      在134機器上使用命令:

      cat ~/id_rsa.pub >> ~/.ssh/authorized_keys -->追加公鑰到授權key中

      rm ~/id_rsa.pub -->保險起見,刪除公鑰

      同樣在134機器上重復第四步和第五步,

      service sshd restart -->重啟ssh服務

    9. 9

      回到客戶機來,輸入:

      ssh 192.168.1.134 -->應該就能直接連接服務器咯。

【原創 Spark動手實踐 1】Hadoop2.7.3安裝部署實際動手