1. 程式人生 > >MapReduce型別與格式(輸入與輸出)

MapReduce型別與格式(輸入與輸出)

一、輸入格式

(1)輸入分片記錄

①JobClient通過指定的輸入檔案的格式來生成資料分片InputSplit;

②一個分片不是資料本身,而是可分片資料的引用;

③InputFormat介面負責生成分片;

原始碼位置:org.apache.hadoop.mapreduce.lib.input包(新)

          org.apache.hadoop.mapred.lib 包(舊)

檢視其中FileInputFormat類中的getSplits()方法;

computeSplitSize()函式決定分片大小;

各種輸入類的結構關係圖:



(2)檔案輸入

抽象類:FileInputFormat

①FileInputFormat是所有使用檔案作為資料來源的InputFormat實現的基類;

②FileInputFormat輸入資料格式的分配大小由資料塊大小決定;

抽象類:CombineFileInputFormat

①可以使用CombineFileInputFormat來合併小檔案;

②因為CombineFileInputFormat是一個抽象類,使用的時候需要建立一個CombineFileInputFormat的實體類,並且實現getRecordReader()的方法;

③避免檔案分割的方法:

A.資料塊大小盡可能大,這樣使檔案的大小小於資料塊的大小,就不用進行分片;

B.繼承FileInputFormat,並且過載isSplitable()方法;

(3)文字輸入

類名:TextInputFormat

①TextInputFormat是預設的InputFormat,每一行資料就是一條記錄;

②TextInputFormat的key是LongWritable型別的,儲存該行在整個檔案的偏移量,value是每行的資料內容,Text型別;

③輸入分片與HDFS資料塊關係:TextInputFormat每一條記錄就是一行,很有可能某一行跨資料塊存放;

類名:KeyValueInputFormat類

可以通過key為行號的方式來知道記錄的行號,並且可以通過key.value.separator.in.input設定key與value的分割符;

類名:NLineInputFormat類

可以設定每個mapper處理的行數,可以通過mapred.line.input.format.lienspermap屬性設定;

(4)二進位制輸入

類名:SequenceFileInputFormat

SequenceFileAsTextInputFormat

SequenceFileAsBinaryInputFormat

由於SequenceFile能夠支援Splittable,所以能夠作為mapreduce輸入檔案的格式,能夠很方便的得到已經含有,value>的分片;

(5)多檔案輸入

類名:MultipleInputs

①MultipleInputs能夠提供多個輸入資料型別;

②通過addInputPath()方法來設定多路徑;

(6)資料庫格式輸入

類名:DBInputFormat

①DBInputFormat是一個使用JDBC並且從關係型資料庫中讀取資料的一種輸入格式;

②避免過多的資料庫連線;

③HBase中的TableInputFormat可以讓MapReduce程式訪問HBase表裡的資料;

實驗部分:

新建專案TestMRInputFormat,新建包com.mr,匯入相關依賴包

實驗①,以SequenceFile作為輸入,故預先執行SequenceFileWriter.java產生一個b.seq檔案;

新建類:TestInputFormat1.java(基於WordCount.java修改):

package com.mr;

import java.io.IOException;

import java.util.StringTokenizer;

import org.apache.hadoop.conf.Configuration;

import org.apache.hadoop.fs.Path;

import org.apache.hadoop.io.IntWritable;

import org.apache.hadoop.io.Text;

import org.apache.hadoop.mapreduce.Job;

import org.apache.hadoop.mapreduce.Mapper;

import org.apache.hadoop.mapreduce.Reducer;

import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;

import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

import org.apache.hadoop.util.GenericOptionsParser;

public class TestInputFormat {

  public static class TokenizerMapper

       extends Mapper< IntWritable, Text, Text, IntWritable>{

    private final static IntWritable one = new IntWritable(1);

    private Text word = new Text();

    public void map(IntWritable key, Text value, Context context

                    ) throws IOException, InterruptedException {

      StringTokenizer itr = new StringTokenizer(value.toString());

      while (itr.hasMoreTokens()) {

        word.set(itr.nextToken());

        context.write(word, one);

      }

    }

  }

  public static class IntSumReducer

       extends Reducer {

    private IntWritable result = new IntWritable();

    public void reduce(Text key, Iterable values,

                       Context context

                       ) throws IOException, InterruptedException {

      int sum = 0;

      for (IntWritable val : values) {

        sum += val.get();

      }

      result.set(sum);

      context.write(key, result);

    }

  }

  public static void main(String[] args) throws Exception {

    Configuration conf = new Configuration();

    String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs();

    if (otherArgs.length != 2) {

      System.err.println("Usage: wordcount ");

      System.exit(2);

    }

    Job job = new Job(conf, "word count");

    job.setJarByClass(TestInputFormat.class);

    job.setMapperClass(TokenizerMapper.class);

    job.setCombinerClass(IntSumReducer.class);

    job.setReducerClass(IntSumReducer.class);

    job.setOutputKeyClass(Text.class);

job.setOutputValueClass(IntWritable.class);

    job.setInputFormatClass(SequenceFileInputFormat.class);//輸入格式的設定

    FileInputFormat.addInputPath(job, new Path(otherArgs[0]));

    FileOutputFormat.setOutputPath(job, new Path(otherArgs[1]));

    System.exit(job.waitForCompletion(true) ? 0 : 1);

  }

}

Eclipse中執行,引數配置如下圖:



輸出統計結果如下:



實驗②,多種來源輸入:

TestInputFormat2.java:

package com.mr;

import java.io.IOException;

import java.util.StringTokenizer;

import org.apache.hadoop.conf.Configuration;

import org.apache.hadoop.fs.Path;

import org.apache.hadoop.io.IntWritable;

import org.apache.hadoop.io.LongWritable;

import org.apache.hadoop.io.Text;

import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;

import org.apache.hadoop.mapreduce.Job;

import org.apache.hadoop.mapreduce.Mapper;

import org.apache.hadoop.mapreduce.Reducer;

import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;

import org.apache.hadoop.mapreduce.lib.input.MultipleInputs;

import org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat;

import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

import org.apache.hadoop.util.GenericOptionsParser;

public class TestInputFormat2 {

  public static class Mapper1  //第一個mapper類

       extends Mapper<<font color="#ed1c24">LongWritable, Text, Text, IntWritable>{

    private final static IntWritable one = new IntWritable(1);

    private Text word = new Text();

    public void map(LongWritable key, Text value, Context context

                    ) throws IOException, InterruptedException {

      StringTokenizer itr = new StringTokenizer(value.toString());

      while (itr.hasMoreTokens()) {

        word.set(itr.nextToken());

        context.write(word, one);

      }

    }

  }

public static class Mapper2 extends  //第二個mapper類

Mapper {

private final static IntWritable one = new IntWritable(1);

private Text word = new Text();

public void map(IntWritable key, Text value, Context context)

throws IOException, InterruptedException {

StringTokenizer itr = new StringTokenizer(value.toString());

while (itr.hasMoreTokens()) {

word.set(itr.nextToken());

context.write(word, one);

}

}

}

  public static class IntSumReducer 

       extends Reducer {

    private IntWritable result = new IntWritable();

    public void reduce(Text key, Iterable values, 

                       Context context

                       ) throws IOException, InterruptedException {

      int sum = 0;

      for (IntWritable val : values) {

        sum += val.get();

      }

      result.set(sum);

      context.write(key, result);

    }

  }

  public static void main(String[] args) throws Exception {

    Configuration conf = new Configuration();

    Job job = new Job(conf, "word count");

    job.setJarByClass(TestInputFormat2.class);

    job.setReducerClass(IntSumReducer.class);

    job.setOutputKeyClass(Text.class);

    job.setOutputValueClass(IntWritable.class);

    Path path1 = new Path("/a.txt");

    Path path2 = new Path("/b.seq");

   //多輸入

    MultipleInputs.addInputPath(job, path1,TextInputFormat.class, Mapper1.class);

    MultipleInputs.addInputPath(job, path2,SequenceFileInputFormat.class, Mapper2.class);

    FileOutputFormat.setOutputPath(job, new Path("/output2"));

    System.exit(job.waitForCompletion(true) ? 0 : 1);

  }

}

建立輸入文字檔案a.txt:

aaa bbb

ccc aaa

ddd eee

將專案打包為jar(不知道為什麼eclipse中不能執行,還沒找到原因,用jar命令可以執行):

File->Export->Runnable JAR file,命名jar檔案為testMR.jar。

命令列中執行:

$hadoop jar testMR.jar com.mr.TestInputFormat2 



輸出統計結果如下:



二、輸出格式

各種類關係結構圖:



(1)文字輸出

類名:TextOutputFormat

①預設的輸出方式,key是LongWritable型別的,value是Text型別的;

②以“key \t value”的方式輸出行;

(2)二進位制輸出

類名:SequenceFileOutputFormat

SequenceFileAsTextOutputFormat

SequenceFileAsBinaryOutputFormat

MapFileOutputFormat

(3)多檔案輸出

類名:MultipleOutputFormat

      MultipleOutputs

區別:MultipleOutputs可以產生不同型別的輸出;

(4)資料庫輸出

類名:DBOutputFormat

 http://blog.sina.com.cn/s/blog_4438ac090101qfuh.html