1. 程式人生 > >hadoop權威指南(第四版)要點翻譯(5)——Chapter 3. The HDFS(5)

hadoop權威指南(第四版)要點翻譯(5)——Chapter 3. The HDFS(5)

val str 能夠 byte present ted streaming 三種 創建

5) The Java Interface
a) Reading Data from a Hadoop URL.
使用hadoop URL來讀取數據
b) Although we focus mainly on the HDFS implementation, DistributedFileSystem, in general you should strive to write your code against the FileSystem abstract class, to retain portability across filesystems.
盡管我們把基本的註意力都集中在HDFS的實現上,即DistributedFileSystem,但通常你應該針對抽象類FileSystem編寫代碼以保持其跨文件系統的可移植性。
c) One of the simplest ways to read a file from a Hadoop filesystem is by using a java.net.URL object to open a stream to read the data from. The general idiom is:
從一個hadoop文件系統中讀取一個文件最簡單的方式就是使用一個java.net.URL對象打開一個數據流去從中讀取數據。

通常格式是:

InputStream in = null;
try {
in = new URL("hdfs://host/path").openStream();
// process in
} finally {
IOUtils.closeStream(in);
}

There’s a little bit more work required to make Java recognize Hadoop’s hdfs URL scheme. This is achieved by calling the setURLStreamHandlerFactory() method on URL with an instance of FsUrlStreamHandlerFactory. This method can be called only once per JVM, so it is typically executed in a static block.
讓Java識別hadoop的hdfs url方案還須要一點額外的工作,在這裏能夠通過FsUrlStreamHandlerFactory對象調用URL中的setURLStreamHandlerFactory()方法來實現。

這種方法每個JVM僅僅能運行一次,因此通常在一個靜態程序塊中運行。
d) Example 3-1. Displaying files from a Hadoop filesystem on standard output using a URLStreamHandler.
使用URLStreamHandler用標準輸出的方式列出一個hadoop文件系統中的文件。

public class URLCat {

    static {
        URL.setURLStreamHandlerFactory(new FsUrlStreamHandlerFactory());
    }

    public
static void main(String[] args) throws Exception { InputStream in = null; try { in = new URL(args[0]).openStream(); IOUtils.copyBytes(in, System.out, 4096, false); } finally { IOUtils.closeStream(in); } } } run: % hadoop URLCat hdfs://localhost/user/tom/quangle.txt

e) We make use of the handy IOUtils class that comes with Hadoop for closing the stream in the finally clause, and also for copying bytes between the input stream and the output stream (System.out, in this case). The last two arguments to the copyBytes() method are the buffer size used for copying and whether to close the streams when the copy is complete. We close the input stream ourselves, and System.out doesn’t need to be closed.
我們使用了hadoop中就近的IOUtils類,而且在finally子句中關閉了數據流,而且在輸入流和輸出流之間復制數據(在這個樣例中輸出流是System.out). copyBytes()方法中最後的兩個參數表示復制數據的緩存大小以及當復制完畢時是否關閉數據流。在這裏我們關閉了輸入流。而輸出流System.out不須要關閉。
f) Reading Data Using the FileSystem API.
使用FileSystem API來讀取數據。
g) FileSystem is a general filesystem API, so the first step is to retrieve an instance for the filesystem we want to use — HDFS, in this case. There are several static factory methods for getting a FileSystem instance:
FileSystem類是一個通用文件系統的API,因此第一步就是獲得一個文件系統的實力,在本例中是HDFS。

獲得一個FileSystem實例有幾種靜態工廠方法。

public static FileSystem get(Configuration conf) throws IOException
public static FileSystem get(URI uri, Configuration conf) throws IOException
public static FileSystem get(URI uri, Configuration conf, String user) throws IOException

h) A Configuration object encapsulates a client or server’s configuration, which is set using configuration files read from the classpath, such as etc/hadoop/core-site.xml. The first method returns the default filesystem (as specified in core-site.xml, or the default local filesystem if not specified there). The second uses the given URI’s scheme and authority to determine the filesystem to use, falling back to the default filesystem if no scheme is specified in the given URI. The third retrieves the filesystem as the given user, which is important in the context of security.
Configuration對象封裝了客戶端或者服務器端的配置,其設置成使用配置文件從類路徑中讀取,比方etc/hadoop/core-site.xml。

第一種方法返回默認的文件系統(其在core-site.xml中指定,假設沒有在這裏指定的話,就是默認的本地文件系統).另外一種方法依據給定的URL方案和權限來決定所使用的文件系統,假設在給定的URL中沒有指定詳細的方案,那麽返回默認的文件系統。

第三種方法會去檢索給定的用戶的文件系統。在強調安全的背景下,這是非常重要的。
i) Example 3-2. Displaying files from a Hadoop filesystem on standard output by using the FileSystem directly.
直接使用FileSystem類以標準輸出格式列出hadoop文件系統中的文件。

public class FileSystemCat {

    public static void main(String[] args) throws Exception {
        String uri = args[0];
        Configuration conf = new Configuration();
        FileSystem fs = FileSystem.get(URI.create(uri), conf);
        InputStream in = null;
        try {
            in = fs.open(new Path(uri));
            IOUtils.copyBytes(in, System.out, 4096, false);
        } finally {
            IOUtils.closeStream(in);
        }
    }
}
run:
% hadoop FileSystemCat hdfs://localhost/user/tom/quangle.txt

j) The open() method on FileSystem actually returns an FSDataInputStream rather than a standard java.io class. This class is a specialization of java.io.DataInputStream with support for random access, so you can read from any part of the stream:
FileSystem類的open()方法實際上返回的是一個FSDataInputStream,而不是一個標準的Java IO類。這個類一個繼承了java.io.DataInputStream類的特殊類,且支持隨機訪問,因此。能夠讀取數據流的不論什麽部分。

package org.apache.hadoop.fs;
public class FSDataInputStream extends DataInputStream
implements Seekable, PositionedReadable {
// implementation elided
}

k) The Seekable interface permits seeking to a position in the file and provides a query method for the current offset from the start of the file (getPos()):
Seekable接口同意進行在文件裏定位,而且提供一個當前位置相對文件起始位置的偏移量的查詢方法(getPos()):

public interface Seekable {
void seek(long pos) throws IOException;
long getPos() throws IOException;
}

l) Example 3-3. Displaying files from a Hadoop filesystem on standard output twice, by using seek():
使用seek()方法以標準輸出方式列出2次hadoop文件系統的文件

public class FileSystemDoubleCat {

    public static void main(String[] args) throws Exception {
        String uri = args[0];
        Configuration conf = new Configuration();
        FileSystem fs = FileSystem.get(URI.create(uri), conf);
        FSDataInputStream in = null;
        try {
            in = fs.open(new Path(uri));
            IOUtils.copyBytes(in, System.out, 4096, false);
            in.seek(0); // go back to the start of the file
            IOUtils.copyBytes(in, System.out, 4096, false);
        } finally {
            IOUtils.closeStream(in);
        }
    }
}
run:
% hadoop FileSystemDoubleCat hdfs://localhost/user/tom/quangle.txt

m) Finally, bear in mind that calling seek() is a relatively expensive operation and should be done sparingly. You should structure your application access patterns to rely on streaming data (by using MapReduce, for example) rather than performing a large number of seeks.
最後。別忘了調用seek()方法是一個相對開銷比較大的操作,應該慎重使用。你應該在流數據之上(比方,MapReduce)構建應用程序訪問模式,而不是運行大量的seek()方法。


n) Writing Data
o) The FileSystem class has a number of methods for creating a file. The simplest is the method that takes a Path object for the file to be created and returns an output stream to write to:
FileSystem類有很多創建文件的方法。

最簡單的方法是給要創建的文件設置一個Path對象,而且返回一個能夠給文件寫入數據的輸出流。


public FSDataOutputStream create(Path f) throws IOException
p) There’s also an overloaded method for passing a callback interface, Progressable, so your application can be notified of the progress of the data being written to the datanodes:
另一個重載方法,用來傳遞一個回調接口Progressable。因此這樣能夠把數據寫入節點的進度告知應用程序。

package org.apache.hadoop.util;
public interface Progressable {
    public void progress();
}

q) As an alternative to creating a new file, you can append to an existing file using the append() method (there are also some other overloaded versions):
作為一個創建新文件的可選方式,你能夠使用append()方法來附件一個已經存在的文件(也有其它的重載版本號)。
public FSDataOutputStream append(Path f) throws IOException
r) Example 3-4. Copying a local file to a Hadoop filesystem
復制一個本地文件到hadoop文件系統。

public class FileCopyWithProgress {
    public static void main(String[] args) throws Exception {
        String localSrc = args[0];
        String dst = args[1];

        InputStream in = new BufferedInputStream(new FileInputStream(localSrc));

        Configuration conf = new Configuration();
        FileSystem fs = FileSystem.get(URI.create(dst), conf);
        OutputStream out = fs.create(new Path(dst), new Progressable() {
            public void progress() {
                System.out.print(".");
            }
        });

        IOUtils.copyBytes(in, out, 4096, true);
    }
}

s) The create() method on FileSystem returns an FSDataOutputStream, which, like FSDataInputStream, has a method for querying the current position in the file:
FileSystem類的create()方法返回了一個FSDataOutputStream,就像FSDataInputStream一樣,也有一個方法用來查詢文件裏的當前位置:

package org.apache.hadoop.fs;
public class FSDataOutputStream extends DataOutputStream implements Syncable {
    public long getPos() throws IOException {
// implementation elided
    }
// implementation elided
}

However, unlike FSDataInputStream, FSDataOutputStream does not permit seeking. This is because HDFS allows only sequential writes to an open file or appends to an already written file. In other words, there is no support for writing to anywhere other than the end of the file, so there is no value in being able to seek while writing.
然而,跟FSDataInputStream不一樣。FSDataOutputStream不同意檢索。這是由於HDFS僅同意連續的寫入一個已經打開的文件,或者附加到一個已經存在的可寫入文檔。

換句話說。除了支持寫入文件的末尾之外,其它位置都不支持,因此寫入的時候進行定位就毫無意義。


t) FileSystem provides a method to create a directory:
FileSystem類提供了一個方法去創建文件夾。
public boolean mkdirs(Path f) throws IOException
Often, you don’t need to explicitly create a directory, because writing a file by calling create() will automatically create any parent directories.
通常。你不須要顯示的創建一個文件夾,由於使用create()方法寫入文件時會自己主動的創建不論什麽須要的父文件夾。
u) Querying the Filesystem
v) An important feature of any filesystem is the ability to navigate its directory structure and retrieve information about the files and directories that it stores. The FileStatus class encapsulates filesystem metadata for files and directories, including file length, block size, replication, modification time, ownership, and permission information.
不論什麽文件系統的一個重要特征就是具有瀏覽和檢索所存儲的文件和文件夾的文件夾結構和信息。

FileStatus類封裝了文件系統中文件和文件夾的元數據,包括文件長度、塊大小、備份因素、改動時間,全部者以及權限信息。
w) The method getFileStatus() on FileSystem provides a way of getting a FileStatus object for a single file or directory.
FileSystem類的getFileStatus()方法提供了一個獲取文件或文件夾的FileStatus對象的方式。


x) Finding information on a single file or directory is useful, but you also often need to be able to list the contents of a directory. That’s what FileSystem’s listStatus() methods are for:
在一個單個文件或文件夾上搜尋信息是實用的。可是你也會常常須要羅列一個文件夾的內容。

這就是FileSystem類listStatus()方法的功能。

public FileStatus[] listStatus(Path f) throws IOException
public FileStatus[] listStatus(Path f, PathFilter filter) throws IOException
public FileStatus[] listStatus(Path[] files) throws IOException
public FileStatus[] listStatus(Path[] files, PathFilter filter) throws IOException

When the argument is a file, the simplest variant returns an array of FileStatus objects of length 1. When the argument is a directory, it returns zero or more FileStatus objects representing the files and directories contained in the directory.
當參數是一個文件時,最簡單變化就是返回一個長度為1的FileStatus對象數組。當參數是一個文件夾時。返回0個或多個FileStatus對象,代表文件夾中包括的文件或者文件夾。
y) Example 3-6. Showing the file statuses for a collection of paths in a Hadoop filesystem.
顯示hadoop文件系統中一組路徑的文件狀態

public class ListStatus {

    public static void main(String[] args) throws Exception {
        String uri = args[0];
        Configuration conf = new Configuration();
        FileSystem fs = FileSystem.get(URI.create(uri), conf);

        Path[] paths = new Path[args.length];
        for (int i = 0; i < paths.length; i++) {
            paths[i] = new Path(args[i]);
        }

        FileStatus[] status = fs.listStatus(paths);
        Path[] listedPaths = FileUtil.stat2Paths(status);
        for (Path p : listedPaths) {
            System.out.println(p);
        }
    }
}

z) Rather than having to enumerate each file and directory to specify the input, it is convenient to use wildcard characters to match multiple files with a single expression, an operation that is known as globbing. Hadoop provides two FileSystem methods for processing globs:
不同於使用枚舉的方式去指定每個文件和文件夾作為輸入,它能夠非常方便的使用通配符用一個表達式去匹配多個文件,也就是被覺得的globbing操作。hadoop提供了兩種FileSystem類的方法去處理globs:

public FileStatus[] globStatus(Path pathPattern) throws IOException
public FileStatus[] globStatus(Path pathPattern, PathFilter filter) throws IOException

Hadoop supports the same set of glob characters as the Unix bash shell.
hadoop支持與Unix系統bash腳本一致的通配符表達。

hadoop權威指南(第四版)要點翻譯(5)——Chapter 3. The HDFS(5)