Hadoop原始碼詳解之Mapper類

阿新 • • 發佈：2018-12-27

Hadoop原始碼詳解之`Mapper`類

1. 類釋義

Maps input key/value pairs to a set of intermediate key/value pairs.
將輸入的鍵值對應成一系列的中間鍵值對

Maps are the individual tasks which transform input records into a intermediate records. The transformed intermediate records need not be of the same type as the input records. A given input pair may map to zero or many output pairs.
Maps 將輸入記錄轉換成一箇中間的記錄是單個任務。轉換後的中間記錄不需要與輸入記錄相同。一個給出的對可能對映0或多個輸出對。

The Hadoop Map-Reduce framework spawns one map task for each InputSplit generated by the InputFormat for the job. Mapper implementations can access the Configuration for the job via the JobContext.getConfiguration().
Hadoop 的 Map-Reduce框架對每一個job的InputFormat InputSplit產出一個 map任務。Mapper實現能夠訪問job的配置，通過 JobContext.getConfiguration()

。

The framework first calls setup(org.apache.hadoop.mapreduce.Mapper.Context), followed by map(Object, Object, org.apache.hadoop.mapreduce.Mapper.Context) for each key/value pair in the InputSplit. Finally cleanup(org.apache.hadoop.mapreduce.Mapper.Context) is called.
框架首先呼叫setup(…)，緊接著是map(…)針對每一個key/value對在InputSplit中。最後是呼叫 cleanup(...)

All intermediate values associated with a given output key are subsequently grouped by the framework, and passed to a Reducer to determine the final output. Users can control the sorting and grouping by specifying two key RawComparator classes.
與給出的輸出鍵關聯的所有的中間值隨後由框架分組，並傳遞給一個 Reducer 以確定最終輸出。使用者可以通過指定兩個關鍵的RawComparator 類來控制排序和分組。 【什麼是兩個關鍵的RawComparator?】

The Mapper outputs are partitioned per Reducer. Users can control which keys (and hence records) go to which Reducer by implementing a custom Partitioner.
Mapper 輸出被每一個Reducer 分割槽。使用者能夠控制哪一個鍵（因此控制的是記錄）去到哪一個Reducer通過實現自定義的Partitioner

Users can optionally specify a combiner, via Job.setCombinerClass(Class), to perform local aggregation of the intermediate outputs, which helps to cut down the amount of data transferred from the Mapper to the Reducer.
使用者能夠選擇一個指定的combiner，通過 Job.setCombinerClass(Class)，去執行一個本地的聚合，針對中間的輸出，這能夠幫助減少從Mapper到Reducer時傳輸資料的數量。

Applications can specify if and how the intermediate outputs are to be compressed and which CompressionCodecs are to be used via the Configuration.
應用能夠使用Configuration 去指定是否以及怎樣比較中間輸出結果以及哪個CompressionCodecs會被使用。

2. 類原始碼

package org.apache.hadoop.mapreduce;

import java.io.IOException;

import org.apache.hadoop.classification.InterfaceAudience;
import org.apache.hadoop.classification.InterfaceStability;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.io.RawComparator;
import org.apache.hadoop.io.compress.CompressionCodec;
import org.apache.hadoop.mapreduce.task.MapContextImpl;

 /* 
 * @see InputFormat
 * @see JobContext
 * @see Partitioner  
 * @see Reducer
 */
@InterfaceAudience.Public
@InterfaceStability.Stable
public class Mapper<KEYIN, VALUEIN, KEYOUT, VALUEOUT> {

  /**
   * The <code>Context</code> passed on to the {@link Mapper} implementations.
   */
  public abstract class Context
    implements MapContext<KEYIN,VALUEIN,KEYOUT,VALUEOUT> {
  }
  
  /**
   * Called once at the beginning of the task.
   */
  protected void setup(Context context
                       ) throws IOException, InterruptedException {
    // NOTHING
  }

  /**
   * Called once for each key/value pair in the input split. Most applications
   * should override this, but the default is the identity function.
   */
  @SuppressWarnings("unchecked")
  protected void map(KEYIN key, VALUEIN value, 
                     Context context) throws IOException, InterruptedException {
    context.write((KEYOUT) key, (VALUEOUT) value);
  }

  /**
   * Called once at the end of the task.
   */
  protected void cleanup(Context context
                         ) throws IOException, InterruptedException {
    // NOTHING
  }
  
  /**
   * Expert users can override this method for more complete control over the
   * execution of the Mapper.
   * @param context
   * @throws IOException
   */
  public void run(Context context) throws IOException, InterruptedException {
    setup(context);
    try {
      while (context.nextKeyValue()) {
        map(context.getCurrentKey(), context.getCurrentValue(), context);
      }
    } finally {
      cleanup(context);
    }
  }
}

3. 類方法

3.1 內部類 `Context`

類釋義

The Context passed on to the Mapper implementations.
傳遞給Mapper實現的Context

類程式碼

public abstract class Context
    implements MapContext<KEYIN,VALUEIN,KEYOUT,VALUEOUT> {
  }

可以看到這個抽象類Context 繼承了MapContext 這個介面，其泛型是<KEYIN,VALUEIN,KEYOUT,VALUEOUT>。
而MapContext這個類又繼承自TaskInputOutputContext，這個TaskInputOutputContext 類中有一個方法write()。在WordCountMapper類中，就是使用這個write()方法去輸出中間的鍵值對。

//輸出<單詞，1>
for(String word:words){
 //1.write():Generate an output key/value pair.=>(KeyOut[Type is Text],ValueOut[Type is IntWritable])
 context.write(new Text(word), new IntWritable(1));
}

？？？但是我不清楚的是，這個write()方法的真正實現在哪裡。？？？

Hadoop原始碼詳解之Mapper類

Hadoop原始碼詳解之Mapper類 1. 類釋義 Maps input key/value pairs to a set of intermediate key/value pairs. 將輸入的鍵值對應成一系列的中間鍵值對 Maps are the

Hadoop原始碼詳解之DBOutputFormat類

Hadoop 原始碼詳解之 DBOutputFormat 類 1. 類釋義 A OutputFormat that sends the reduce output to a SQL table. 一種將Reduce 輸出到一個SQL表中的輸出格式。 DB

Hadoop原始碼詳解之FileOutputFormat 類

Hadoop 原始碼詳解之FileOutputFormat 類 1. 類釋義 A base class for OutputFormats that read from FileSystems. 一個類從FileSystems讀取用於OutputFormats 【實在翻

Hadoop 原始碼詳解之FileInputFormat類

Hadoop 原始碼詳解之FileInputFormat類【updating…】 1. 類釋義 A base class for file-based InputFormats. 針對基於檔案的 InputFormats 一個基類 FileInputFo

Hadoop原始碼詳解之Job 類

Hadoop原始碼詳解之Job類 1. 原始碼包：org.apache.hadoop.mapreduce 繼承的介面有：AutoCloseable，JobContext，org.apache.hadoop.mapreduce.MRJobConfig

openTSDB原始碼詳解之Deferred類簡單示例2

openTSDB原始碼詳解之Deferred類簡單示例2 1.示例2 1.1 程式碼程式程式碼如下： public static void test2() { try { //注意這個時候由 dfd -> dfd List(lstDfd)。但是其型

openTSDB原始碼詳解之Deferred類程式碼簡單示例1

openTSDB原始碼詳解之Deferred類程式碼簡單示例1 1.示例1 1.1 程式碼 /** * simplest with only 1 defer * 最簡單的，僅僅只有1個defer */ public static void test

JDK原始碼詳解之File類

JDK原始碼詳解之File類 1. 類釋義 2. 類方法 listFiles() File[] listFiles() Returns an array of abstract pathnames denoting the files in the dir

Jdk原始碼詳解之`ProcessBuilder()`類

Jdk原始碼詳解之ProcessBuilder()類 1.ProcessBuilder類 2.方法簡介構造器ProcessBuilder /** Constructs a process builder with the specif

Hadoop 原始碼詳解之RecordReader介面

Hadoop 原始碼詳解之RecordReader介面 1. 類釋義 RecordReader reads <key, value> pairs from an InputSplit. RecordReader 從InputSplit中讀取<key,va

OkHttp原始碼詳解之二完結篇

1. 請大家思考幾個問題在開始本文之前，請大家思考如下幾個問題。並請大家帶著這幾個問題，去本文尋找答案。如果你對下面幾個問題的答案瞭如指掌那本文可以略過不看在瀏覽器中輸入一個網址，按回車後發生了什麼？ Okhttp的TCP連線建立發生在什麼時候？ Okht

OkHttp原始碼詳解之Okio原始碼詳解

請在電腦上閱讀，效果更佳本文將從兩個技術點講解OkHttp 1. 講解Okio，因為Okhttp的IO操作都是基於Okio，拋開Okio的OkHttp講解是不完美的 2. 講解OkHttp原始碼 Okio 1. Okio簡介引用官方的一段介紹 Okio是一個補

openTSDB原始碼詳解之rowKey生成

openTSDB原始碼詳解之rowKey生成 openTSDB的一個非常好的設計就是其rowKey的生成。下面詳細介紹一下。 1.相關處理類 openTSDB往hbase中寫入資料的處理過程，我之前就已經分析過，主要涉及的類有： addPointInternal(

openTSDB 原始碼詳解之寫入資料到 tsdb-uid 表

openTSDB 原始碼詳解之寫入資料到tsdb-uid表 1.方法入口messageReceived public void messageReceived(final ChannelHandlerContext ctx,

Java原始碼詳解之FileOutputStream

Java原始碼詳解之FileOutputStream類 1.類定義 A file output stream is an output stream for writing data to a File or to a FileDescriptor. Whether or

HttpClient` 原始碼詳解之`UrlEncodedFormEntity

HttpClient 原始碼詳解之UrlEncodedFormEntity 1. 類釋義 /** * An entity composed of a list of url-encoded pairs. * This is typically useful while sen

HttpClient 原始碼詳解之 BasicNameValuePair

HttpClient 原始碼詳解之 BasicNameValuePair 1. 類定義 Basic implementation of {@link NameValuePair}. 2. 方法簡介構造器 /** * Default C

Java佇列詳解之 LinkedList 類

Java佇列詳解之 LinkedList 類 1. 類簡介類釋義 A collection designed for holding elements prior to processing. Besides basic Collection oper

HttpClient 原始碼詳解之HttpEntity

HttpClient 原始碼詳解之HttpEntity 1. 類釋義 An entity that can be sent or received with an HTTP message. Entities can be found in some requests

HttpClient 原始碼詳解之HttpRequestBase

HttpClient 原始碼詳解之HttpRequestBase 1. 類釋義 * Base implementation of {@link HttpUriRequest}. 2. 基本方法 getURI() 返回初始的請求URI【這個URI不會隨著重定位或者請

Hadoop原始碼詳解之Mapper類