hadoop偽叢集的安裝，及基本概念。

阿新 • • 發佈：2018-12-22

導讀

偽叢集的意思就是說我們可以在多臺計算機上面安裝hadoop，但是不具有高可用和共容錯，這適用於開發環境。

我們首先下載hadoop的安裝包，我使用的cdh版本的5.14.0，你可以在該網址找到他，

首先我們說一下hadoop的配置檔案的分類：

hadoop的配置檔案可以分為兩種型別的配置檔案。

一種是隻讀的預設配置如： core-default.xml, hdfs-default.xml, yarn-default.xml and mapred-default.xml

另一種是我們指定的配置檔案：conf/core-site.xml, conf/hdfs-site.xml, conf/yarn-site.xml and conf/mapred-site.xml.

此外我們還可對這兩個指令碼進行配置：conf/hadoop-env.sh and yarn-env.sh.

詳細配置

我們首先來配置conf/hadoop-env.sh and yarn-env.sh。

第一步配置 JAVA_HOME,

第二步：可選配置，管理員可以為獨立的守護程序進行詳細的配置：

Daemon	Environment Variable
NameNode	HADOOP_NAMENODE_OPTS
DataNode	HADOOP_DATANODE_OPTS
Secondary NameNode	HADOOP_SECONDARYNAMENODE_OPTS
ResourceManager	YARN_RESOURCEMANAGER_OPTS
NodeManager	YARN_NODEMANAGER_OPTS
WebAppProxy	YARN_PROXYSERVER_OPTS
Map Reduce Job History Server	HADOOP_JOB_HISTORYSERVER_OPTS

例如我們可以配置NameNode的 parallelGC：

export HADOOP_NAMENODE_OPTS="-XX:+UseParallelGC ${HADOOP_NAMENODE_OPTS}"

還可以進行一些其他的有用的配置：

HADOOP_LOG_DIR / YARN_LOG_DIR - 指定他們日誌的儲存路徑，如果不存在的話自動建立。
HADOOP_HEAPSIZE / YARN_HEAPSIZE - 指定最大堆疊大小。

Daemon	Environment Variable
ResourceManager	YARN_RESOURCEMANAGER_HEAPSIZE
NodeManager	YARN_NODEMANAGER_HEAPSIZE
WebAppProxy	YARN_PROXYSERVER_HEAPSIZE
Map Reduce Job History Server	HADOOP_JOB_HISTORYSERVER_HEAPSIZE

現在我們可以進行這些守護程序的詳細配置了：

conf/core-site.xml

Parameter	Value	Notes
fs.defaultFS	NameNode URI	hdfs://host:port/
io.file.buffer.size	131072	Size of read/write buffer used in SequenceFiles.

fs.defaultFS：這個路徑是我們訪問分散式檔案系統的路徑。

conf/hdfs-site.xml

Configurations for NameNode:

Parameter	Value	Notes
dfs.namenode.name.dir	Path on the local filesystem where the NameNode stores the namespace and transactions logs persistently.	If this is a comma-delimited list of directories then the name table is replicated in all of the directories, for redundancy.

Configurations for DataNode:

Parameter	Value	Notes
dfs.datanode.data.dir	Comma separated list of paths on the local filesystem of a DataNode where it should store its blocks.	If this is a comma-delimited list of directories, then data will be stored in all named directories, typically on different devices.

conf/yarn-site.xml

Configurations for ResourceManager and NodeManager:

Parameter	Value	Notes
yarn.acl.enable	true / false	Enable ACLs? Defaults to false.
yarn.admin.acl	Admin ACL	ACL to set admins on the cluster. ACLs are of for comma-separated-usersspacecomma-separated-groups. Defaults to special value of * which means anyone. Special value of just space means no one has access.
yarn.log-aggregation-enable	false	Configuration to enable or disable log aggregation

Configurations for ResourceManager:

Parameter	Value	Notes
yarn.resourcemanager.address	ResourceManager host:port for clients to submit jobs.	host:port If set, overrides the hostname set in yarn.resourcemanager.hostname.
yarn.resourcemanager.scheduler.address	ResourceManager host:port for ApplicationMasters to talk to Scheduler to obtain resources.	host:port If set, overrides the hostname set in yarn.resourcemanager.hostname.
yarn.resourcemanager.resource-tracker.address	ResourceManager host:port for NodeManagers.	host:port If set, overrides the hostname set in yarn.resourcemanager.hostname.
yarn.resourcemanager.admin.address	ResourceManager host:port for administrative commands.	host:port If set, overrides the hostname set in yarn.resourcemanager.hostname.
yarn.resourcemanager.webapp.address	ResourceManager web-ui host:port.	host:port If set, overrides the hostname set in yarn.resourcemanager.hostname.
yarn.resourcemanager.hostname	ResourceManager host.	host Single hostname that can be set in place of setting all yarn.resourcemanager*addressresources. Results in default ports for ResourceManager components.
yarn.resourcemanager.scheduler.class	ResourceManager Scheduler class.	CapacityScheduler (recommended), FairScheduler (also recommended), or FifoScheduler
yarn.scheduler.minimum-allocation-mb	Minimum limit of memory to allocate to each container request at the Resource Manager.	In MBs
yarn.scheduler.maximum-allocation-mb	Maximum limit of memory to allocate to each container request at the Resource Manager.	In MBs
yarn.resourcemanager.nodes.include-path / yarn.resourcemanager.nodes.exclude-path	List of permitted/excluded NodeManagers.	If necessary, use these files to control the list of allowable NodeManagers.

Configurations for NodeManager:

Parameter	Value	Notes
yarn.nodemanager.resource.memory-mb	Resource i.e. available physical memory, in MB, for given NodeManager	Defines total available resources on the NodeManager to be made available to running containers
yarn.nodemanager.vmem-pmem-ratio	Maximum ratio by which virtual memory usage of tasks may exceed physical memory	The virtual memory usage of each task may exceed its physical memory limit by this ratio. The total amount of virtual memory used by tasks on the NodeManager may exceed its physical memory usage by this ratio.
yarn.nodemanager.local-dirs	Comma-separated list of paths on the local filesystem where intermediate data is written.	Multiple paths help spread disk i/o.
yarn.nodemanager.log-dirs	Comma-separated list of paths on the local filesystem where logs are written.	Multiple paths help spread disk i/o.
yarn.nodemanager.log.retain-seconds	10800	Default time (in seconds) to retain log files on the NodeManager Only applicable if log-aggregation is disabled.
yarn.nodemanager.remote-app-log-dir	/logs	HDFS directory where the application logs are moved on application completion. Need to set appropriate permissions. Only applicable if log-aggregation is enabled.
yarn.nodemanager.remote-app-log-dir-suffix	logs	Suffix appended to the remote log dir. Logs will be aggregated to ${yarn.nodemanager.remote-app-log-dir}/${user}/${thisParam} Only applicable if log-aggregation is enabled.
yarn.nodemanager.aux-services	mapreduce_shuffle	Shuffle service that needs to be set for Map Reduce applications.

Configurations for History Server (Needs to be moved elsewhere):

Parameter	Value	Notes
yarn.log-aggregation.retain-seconds	-1	How long to keep aggregation logs before deleting them. -1 disables. Be careful, set this too small and you will spam the name node.
yarn.log-aggregation.retain-check-interval-seconds	-1	Time between checks for aggregated log retention. If set to 0 or a negative value then the value is computed as one-tenth of the aggregated log retention time. Be careful, set this too small and you will spam the name node.

conf/mapred-site.xml

Configurations for MapReduce Applications:

Parameter	Value	Notes
mapreduce.framework.name	yarn	Execution framework set to Hadoop YARN.
mapreduce.map.memory.mb	1536	Larger resource limit for maps.
mapreduce.map.java.opts	-Xmx1024M	Larger heap-size for child jvms of maps.
mapreduce.reduce.memory.mb	3072	Larger resource limit for reduces.
mapreduce.reduce.java.opts	-Xmx2560M	Larger heap-size for child jvms of reduces.
mapreduce.task.io.sort.mb	512	Higher memory-limit while sorting data for efficiency.
mapreduce.task.io.sort.factor	100	More streams merged at once while sorting files.
mapreduce.reduce.shuffle.parallelcopies	50	Higher number of parallel copies run by reduces to fetch outputs from very large number of maps.

Configurations for MapReduce JobHistory Server:

Parameter	Value	Notes
mapreduce.jobhistory.address	MapReduce JobHistory Server host:port	Default port is 10020.
mapreduce.jobhistory.webapp.address	MapReduce JobHistory Server Web UI host:port	Default port is 19888.
mapreduce.jobhistory.intermediate-done-dir	/mr-history/tmp	Directory where history files are written by MapReduce jobs.
mapreduce.jobhistory.done-dir	/mr-history/done	Directory where history files are managed by the MR JobHistory Server.

基本配置

我負責給出這些配置，你如何配置在你，下面我給出一個可以執行環境的基本配置：

#hadoop-env.sh的配置配置javahome
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64/jre

core-site.xml

<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
    <configuration>
        <property>
            <name>fs.default.name</name>
            <value>hdfs://node-master:9000</value>
        </property>
    </configuration>

hdfs-site.conf

<configuration>
    <property>
            <name>dfs.namenode.name.dir</name>
            <value>/home/hadoop/data/nameNode</value>
    </property>

    <property>
            <name>dfs.datanode.data.dir</name>
            <value>/home/hadoop/data/dataNode</value>
    </property>

    <property>
            <name>dfs.replication</name>
            <value>1</value>
    </property>
</configuration>

mapred-site.xml

<configuration>
    <property>
            <name>mapreduce.framework.name</name>
            <value>yarn</value>
    </property>
    <property>
        <name>yarn.app.mapreduce.am.resource.mb</name>
        <value>512</value>
    </property>

    <property>
        <name>mapreduce.map.memory.mb</name>
        <value>256</value>
    </property>

    <property>
        <name>mapreduce.reduce.memory.mb</name>
        <value>256</value>
    </property>
</configuration>

yarn-site.xml

<configuration>
    <property>
            <name>yarn.acl.enable</name>
            <value>0</value>
    </property>

    
    <property>
            <name>yarn.nodemanager.aux-services</name>
            <value>mapreduce_shuffle</value>
    </property>

    <property>
        <name>yarn.nodemanager.resource.memory-mb</name>
        <value>1536</value>
    </property>

    <property>
        <name>yarn.scheduler.maximum-allocation-mb</name>
        <value>1536</value>
    </property>

    <property>
        <name>yarn.scheduler.minimum-allocation-mb</name>
        <value>128</value>
    </property>

    <property>
        <name>yarn.nodemanager.vmem-check-enabled</name>
        <value>false</value>
    </property>
</configuration>

slaves

node01
node02
node03

解釋

slaves

我們需要說一下slaves 這個配置檔案的作用：

通常，您選擇群集中的一臺計算機作為NameNode，並選擇一臺計算機作為ResourceManager。其餘的機器既充當DataNode又充當NodeManager，並稱為從裝置。這個配置檔案就是指定執行DataNode和NodeManager的節點的主機名。

記憶體

還有yarn-set.xml中的關於記憶體的額配置和 mapred-site.xml中關於記憶體的配置：

在yarn叢集中執行的任務有兩種型別：

一種是Application Master (AM)他負責監視應用程式並協調叢集中的分散式執行程式。

一種是executors 他通過AM建立，並執行job，對於MapReduce jobs,他們進行map,reduce並行的操作,請注意我們的yarn上面可不止能執行mapreduce哦，歸根到底，mapreduce程式只是一個類呼叫yarm叢集的程式。

兩者都在slave nodes上的containers 中執行。每個slave node都執行一個NodeManager守護程式，該守護程式負責在節點上建立container 。整個叢集由ResourceManager管理，ResourceManager根據容量要求和當前費用排程所有slave nodes上的容器分配。

說了這麼多都不如看一張圖明顯：

官方解釋如下：

How much memory can be allocated for YARN containers on a single node. This limit should be higher than all the others; otherwise, container allocation will be rejected and applications will fail. However, it should not be the entire amount of RAM on the node.

This value is configured in yarn-site.xml with yarn.nodemanager.resource.memory-mb.
How much memory a single container can consume and the minimum memory allocation allowed. A container will never be bigger than the maximum, or else allocation will fail and will always be allocated as a multiple of the minimum amount of RAM.

Those values are configured in yarn-site.xml with yarn.scheduler.maximum-allocation-mb and yarn.scheduler.minimum-allocation-mb.
How much memory will be allocated to the ApplicationMaster. This is a constant value that should fit in the container maximum size.

This is configured in mapred-site.xml with yarn.app.mapreduce.am.resource.mb.
How much memory will be allocated to each map or reduce operation. This should be less than the maximum size.

This is configured in mapred-site.xml with properties mapreduce.map.memory.mb and mapreduce.reduce.memory.mb.

執行

接下來就可以執行叢集了，跟我們的作業系統一樣，用之前得先進行格式化：

hdfs namenode -format

執行hdfs：

start-dfs.sh

執行yarn：

start-yarn.sh

hadoop偽叢集的安裝，及基本概念。

導讀

詳細配置

基本配置

解釋

slaves

記憶體

執行

易出問題

hadoop偽叢集的安裝，及基本概念。

大資料之（4）Hadoop生態系統體系架構及基本概念

CentOS7.4——KVM虛擬化一安裝配置及基本操作

Memcached安裝部署及基本操作

Hadoop偽分佈安裝詳解+MapReduce執行原理+基於MapReduce的KNN演算法實現

CentOS 安裝 Docker 解除安裝 Docker 及基本命令

ETCD叢集安裝配置及簡單應用

php（5.6.30-ts-x86）及其他版本php擴充套件imagick安裝，及支援curl擴充套件，及一般擴充套件安裝方法

總結常用的Transformation運算元和Action運算元，及基本用法

繪製直線，及基本的資料型別

[Hbase]Hbase章１　Hbase框架及基本概念

fuel包、Blocks包的安裝，及pycharm的更新

並行作業1：MPI安裝，及示例程式執行

Linux下原始碼包安裝Swoole及基本使用

JDBC(1)-連線MySQL資料庫，及其基本概念

windows下安裝git及基本配置

centos7下hadoop的叢集安裝

ubuntu 安裝mongodb 及基本命令

hadoop 偽分散式安裝

[Kafka]Kafka主要設計目標及基本概念

hadoop偽叢集的安裝，及基本概念。

導讀

詳細配置

基本配置

解釋

slaves

記憶體

執行

易出問題

相關推薦