1. 程式人生 > >hadoop權威指南(第四版)要點翻譯(4)——Chapter 3. The HDFS(1-4)

hadoop權威指南(第四版)要點翻譯(4)——Chapter 3. The HDFS(1-4)

memory concept strac asc 主機 metadata ould txt ssi

Filesystems that manage the storage across a network of machines are called distributed filesystems. Since they are network based, all the complications of network programming kick in, thus making distributed filesystems more complex than regular disk filesystems.
管理網絡中跨多臺計算機存儲的文件系統被稱之為分布式文件系統。

因為是以網絡為基礎,也就引入了網絡編程的復雜性,因此使得分布式文件系統比普通的磁盤文件系統更加復雜。


1) The Design of HDFS
a) HDFS is a filesystem designed for storing very large files with streaming data access patterns, running on clusters of commodity hardware.
HDFS被設計成以流數據訪問模式來進行存儲超大型文件。執行在商業硬件集群上。
b) HDFS is built around the idea that the most efficient data processing pattern is a writeonce,read-many-times pattern. A dataset is typically generated or copied from source,and then various analyses are performed on that dataset over time. Each analysis will involve a large proportion, if not all, of the dataset, so the time to read the whole dataset is more important than the latency in reading the first record.
HDFS建立在一次寫入,多次讀取這樣一個最高效的數據處理模式的理念之上。數據集通常有數據源生成或者從數據源復制而來,接著在此數據集上進行長時間的數據分析操作。每一次分析都會涉及到一大部分數據,甚至整個數據集,因此讀取整個數據集的時間比讀取第一條記錄的延遲更重要。


c) Hadoop doesn’t require expensive, highly reliable hardware. It’s designed to run on clusters of commodity hardware (commonly available hardware that can be obtained from multiple vendors)for which the chance of node failure across the cluster is high, at least for large clusters. HDFS is designed to carry on working without a noticeable interruption to the user in the face of such failure.
hadoop不須要昂貴的、高可靠性的硬件,hadoop執行在商業硬件集群上(普通硬件能夠從各種供應商來獲得),因此整個集群節點發生問題的機會是非常高的。至少是對於大集群而言。
2) HDFS Concepts
a) A disk has a block size, which is the minimum amount of data that it can read or write.Filesystems for a single disk build on this by dealing with data in blocks, which are an integral multiple of the disk block size. Filesystem blocks are typically a few kilobytes in size, whereas disk blocks are normally 512 bytes.
每個磁盤都有一個默認的數據塊大小,這是磁盤能夠讀寫的最小數據量。單個磁盤文件管理系統構建於處理磁盤塊數據之上,它的大小是磁盤塊大小的整數倍。磁盤文件系統的塊大小一般是幾KB,然而磁盤塊大小一般是512字節。


b) Unlike a filesystem for a single disk, a file in HDFS that is smaller than a single block does not occupy a full block’s worth of underlying storage.
跟單個磁盤文件系統不同的是,HDFS中比磁盤塊小的文件不會沾滿整個塊的潛在存儲空間。


c) HDFS blocks are large compared to disk blocks, and the reason is to minimize the cost of seeks. If the block is large enough, the time it takes to transfer the data from the disk can be significantly longer than the time to seek to the start of the block. Thus, transferring a large file made of multiple blocks operates at the disk transfer rate.
HDFS的塊比磁盤塊大,這樣會最小化搜索代價。假設塊足夠大,那麽從磁盤數據傳輸的時間就會比搜尋塊開始位置的時間長得多,因此,傳輸由多個塊組成的大文件取決於磁盤傳輸速率
d) Having a block abstraction for a distributed filesystem brings several benefits. The first benefit is the most obvious: a file can be larger than any single disk in the network. Second, making the unit of abstraction a block rather than a file simplifies the storage subsystem. Furthermore, blocks fit well with replication for providing fault tolerance and availability.
對分布式文件系統的塊進行抽象能夠帶來幾個優點 。

首先一個顯而易見的是。一個文件的大小能夠大於網絡中不論什麽一個磁盤的容量。

其次,用抽象塊而不是整個文件能夠使得存儲子系統得到簡化。最後,抽象塊非常適合於備份,這樣能夠提高容錯性和有用性。
e) An HDFS cluster has two types of nodes operating in a master?worker pattern: a namenode (the master) and a number of datanodes (workers). The namenode manages the filesystem namespace.
在主機-從機模式下工作的HDFS集群有兩種類型的節點能夠操作:一個namenode(主機上)和若幹個datanode(從機上)。namenode管理整個文件系統的命名空間。
f) Datanodes are the workhorses of the filesystem. They store and retrieve blocks when they are told to (by clients or the namenode), and they report back to the namenode periodically with lists of blocks that they are storing.
datanode是文件系統的直接工作點。它們存儲和檢索數據塊(受client或者namenode通知),而且周期性的向namenode報告它們所存儲的塊列表信息。
g) For this reason, it is important to make the namenode resilient to failure, and Hadoop provides two mechanisms for this. The first way is to back up the files that make up the persistent state of the filesystem metadata.
基於這個原因。確保namenode對故障的彈性機制非常重要,為此,hadoop提供了兩種機制。第一種機制是備份那些由文件系統元數據持久狀態組成的文件。
h) It is also possible to run a secondary namenode, which despite its name does not act as a namenode. Its main role is to periodically merge the namespace image with the edit log to prevent the edit log from becoming too large.
還有一種可能的方法是執行一個輔助的namenode,雖然它不會被用作namenode。

它的主要角色是通過可編輯的日誌周期性的融合命名空間鏡像,以防止可編輯日誌過大。


i) However, the state of the secondary namenode lags that of the primary, so in the event of total failure of the primary, data loss is almost certain. The usual course of action in this case is to copy the namenode’s metadata files that are on NFS to the secondary and run it as the new primary.
然後,輔助namenode中的狀態總是滯後於主節點,因此,在主節點的整個故障事件中。數據丟失差點兒是肯定的。在這樣的情況下。通常的做法是把存儲在NFS上的元數據文件復制到輔助namenode中,而且作為一個新的主節點namenode來執行。
j) Normally a datanode reads blocks from disk, but for frequently accessed files the blocks may be explicitly cached in the datanode’s memory, in an off-heap block cache.
通常一個節點會從磁盤中讀取塊數據,可是對已頻繁訪問的文件,其塊數據可能會緩存在節點的內存中。一個非堆形式的塊緩存。
k) HDFS federation,introduced in the 2.x release series, allows a cluster to scale by adding namenodes, each of which manages a portion of the filesystem namespace.
HDFS中的federation,是在2.X系列公布中引進的,它同意一個集群通過添加namenode節點來擴展。每個namenode節點管理文件系統命名空間中的一部分。
l) Hadoop 2 remedied this situation by adding support for HDFS high availability (HA). In this implementation, there are a pair of namenodes in an active-standby configuration. In the event of the failure of the active namenode, the standby takes over its duties to continue servicing client requests without a significant interruption. A few architectural changes are needed to allow this to happen:
hadoop 2 通過添加對HA的支持糾正了這樣的情況。在這樣的實現方式中,將會有2個namenode實現雙機熱備份。

在發生主活動節點故障的時候,備份主節點就能夠在不發生明顯的中斷的情況下接管繼續響應client請求的責任。下面這些結構性的變化是同意發生的:
m) The namenodes must use highly available shared storage to share the edit log. Datanodes must send block reports to both namenodes because the block mappings are stored in a namenode’s memory, and not on disk. Clients must be configured to handle namenode failover, using a mechanism that is transparent to users. The secondary namenode’s role is subsumed by the standby, which takes periodic checkpoints of the active namenode’s namespace.
namenode必須使用高有用性的共享存儲來實現可編輯日誌的共享,因為塊映射信息是存儲在namenode的內存中,而不是磁盤上,所以datanode必須發送塊信息報告至雙機熱備份的namenode,client必須進行故障切換的操作配置。這個能夠通過一個對用戶透明的機制來實現。輔助節點的角色通過備份被包括進來,其含有活動主節點命名空間周期性檢查點信息。
n) There are two choices for the highly available shared storage: an NFS filer, or a quorum journal manager (QJM). The QJM is a dedicated HDFS implementation, designed for the sole purpose of providing a highly available edit log, and is the recommended choice for most HDFS installations.
對於高有用性共享存儲有兩種選擇:NFS文件。QJM(quorum journal manager)。QJM專註於HDFS的實現。其唯一目的就是提供一個高有用性的可編輯日誌。也是大多是HDFS安裝時所推薦的。
o) If the active namenode fails, the standby can take over very quickly (in a few tens of seconds) because it has the latest state available in memory: both the latest edit log entries and an up-to-date block mapping.
假設活動namenode發生問題 。備份節點會迅速接管任務(在數秒內)。因為在內存中備份節點有最新的可用狀態,包括最新的可編輯日誌記錄和塊映射信息。
p) The transition from the active namenode to the standby is managed by a new entity in the system called the failover controller. There are various failover controllers, but the default implementation uses ZooKeeper to ensure that only one namenode is active.Each namenode runs a lightweight failover controller process whose job it is to monitor its namenode for failures and trigger a failover should a namenode fail.
從活動主節點到備份節點的故障切換是由系統中一個新的實體——故障切換控制器來管理的。雖然有多種版本號的故障切換控制器。可是hadoop默認的是ZooKeeper,它也可確保唯獨一個namenode是處於活動狀態。每個namenode節點上都執行一個輕量級的故障切換控制器進程。它的任務就是去監控namenode的故障。一旦namenode發生問題,它就會觸發故障切換。


q) The HA implementation goes to great lengths to ensure that the previously active namenode is prevented from doing any damage and causing corruption — a method known as fencing.
HA的實現會竭盡全力的去確保之前的活動主節點不會做出不論什麽導致故障的有害舉動。這種方法就是fencing。
r) The QJM only allows one namenode to write to the edit log at one time; however, it is still possible for the previously active namenode to serve stale read requests to clients, so setting up an SSH fencing command that will kill the namenode’s process is a good idea.
QJM只同意一個namenode在同一時刻進行可編輯日誌的寫入操作。然而,對已之前的活動節點來說,響應來自client的陳舊的讀取請求服務是可能的。因此建立一個能夠殺死namenode進程的fencing命令式一個好方法。
3) The Command-Line Interface
a) You can type hadoop fs -help to get detailed help on every command.
你能夠在每個命令上使用hadoop fs –help來獲得具體的幫助。


b) Let’s copy the file back to the local filesystem and check whether it’s the same:

% hadoop fs -copyToLocal quangle.txt quangle.copy.txt
% md5 input/docs/quangle.txt quangle.copy.txt
MD5 (input/docs/quangle.txt) = e7891a2627cf263a079fb0f18256ffb2
MD5 (quangle.copy.txt) = e7891a2627cf263a079fb0f18256ffb2

The MD5 digests are the same, showing that the file survived its trip to HDFS and is back intact.
讓我們來拷貝這個文件到本地,檢查它們是否是一樣的文件。

MD5是一樣的,表明這個文件傳輸到了HDFS,而且原封不動的回來了。
4) Hadoop Filesystems
a) Hadoop has an abstract notion of filesystems, of which HDFS is just one implementation.The Java abstract class org.apache.hadoop.fs.FileSystem represents the client interface to a filesystem in Hadoop, and there are several concrete implementations.
hadoop對於文件系統有一個抽象的概念,HDFS僅是當中一個實現。Java的抽象類org.apache.hadoop.fs.FileSystem定義了client到文件系統之間的接口,而且該抽象類還有幾個具體的實現。


b) Hadoop is written in Java, so most Hadoop filesystem interactions are mediated through the Java API.
hadoop是用Java編寫的,因此大多數hadoop文件系統的交互通過Java API來進行的。
c) By exposing its filesystem interface as a Java API, Hadoop makes it awkward for non-Java applications to access HDFS. The HTTP REST API exposed by the WebHDFS protocol makes it easier for other languages to interact with HDFS. Note that the HTTP interface is slower than the native Java client, so should be avoided for very large data transfers if possible.
把文件系統的接口作為一個Java API。會讓非Java應用程序訪問HDFS時有些麻煩。通過WebHDFS協議實現的HTTP REST API能夠非常easy的讓其它語言與HDFS進行交互。註意,HTTP接口比本地Javaclient要慢,因此,有可能的話,應該避免使用該接口進行大文件數據傳輸。

hadoop權威指南(第四版)要點翻譯(4)——Chapter 3. The HDFS(1-4)