MapReduce程序之二次排序與多次排序

阿新 • • 發佈：2018-03-08

大數據 Hadoop MapReduce Java

[toc]

MapReduce程序之二次排序與多次排序

需求

有下面的數據：

cookieId    time    url
2   12:12:34    2_hao123
3   09:10:34    3_baidu
1   15:02:41    1_google
3   22:11:34    3_sougou
1   19:10:34    1_baidu
2   15:02:41    2_google
1   12:12:34    1_hao123
3   23:10:34    3_soso
2   05:02:41    2_google

假如我們現在的需求是先按 cookieId 排序，然後按 time 排序，以便按 session 切分日誌，排序後的結果如下：

---------------------------------
1      12:12:34        1_hao123
1      15:02:41        1_google
1      19:10:34        1_baidu
---------------------------------
2      05:02:41        2_google
2      12:12:34        2_hao123
2      15:02:41        2_google
---------------------------------
3      09:10:34        3_baidu
3      22:11:34        3_sougou
3      23:10:34        3_soso

要求使用MapReduce程序實現。

程序思路分析

Map函數：
/**
 * Map函數，解析每一行記錄為AccessLogWritable，這樣Map輸出的時候就可以根據
 * AccessLogWritable對象中的兩個字段進行排序，從而實現前面要求的二次排序需求
 * 也就是說，排序依舊是依賴Map輸出時的排序，但是規則是我們在AccessLogWritable中定義的
 */

 Reduce函數：
/**
 * 經過shuffle後到達Reducer的數據已經是有序的，所以直接寫出即可
 */

所以為了進行多個數據的比較，我們需要自定義key來作為Map輸出的key。

MapReduce程序

關於如何進行數據的排序，思路已經在代碼註釋中有說明，不過需要註意的是，這裏使用了前面開發的Job工具類來開發驅動程序。

SecondSortJob.java

package com.uplooking.bigdata.mr.secondsort;

import com.uplooking.bigdata.common.utils.MapReduceJobUtil;
import com.uplooking.bigdata.mr.sort.SortJob;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;

import java.io.IOException;

/**
 * MapReduce排序之二次排序
 */
public class SecondSortJob {

    /**
     * 驅動程序，使用工具類使用Job
     * @param args
     */
    public static void main(String[] args) throws Exception {
        if (args == null || args.length < 2) {
            System.err.println("Parameter Errors! Usages:<inputpath> <outputpath>");
            System.exit(-1);
        }

        Job job = MapReduceJobUtil.buildJob(new Configuration(),
                SecondSortJob.class,
                args[0],
                TextInputFormat.class,
                SecondSortMapper.class,
                AccessLogWritable.class,
                NullWritable.class,
                new Path(args[1]),
                TextOutputFormat.class,
                SecondSortReducer.class,
                AccessLogWritable.class,
                NullWritable.class);

        // ReduceTask必須設置為1
        job.setNumReduceTasks(1);
        job.waitForCompletion(true);
    }

    /**
     * Map函數，解析每一行記錄為AccessLogWritable，這樣Map輸出的時候就可以根據
     * AccessLogWritable對象中的兩個字段進行排序，從而實現前面要求的二次排序需求
     * 也就是說，排序依舊是依賴Map輸出時的排序，但是規則是我們在AccessLogWritable中定義的
     */
    public static class SecondSortMapper extends Mapper<LongWritable, Text, AccessLogWritable, NullWritable> {
        @Override
        protected void map(LongWritable key, Text value, Context context)
                throws IOException, InterruptedException {
            // 解析每一行
            String[] fields = value.toString().split("\t");
            if(fields == null || fields.length < 3) {
                return;
            }
            String cookieId = fields[0];
            String time = fields[1];
            String url = fields[2];
            // 構建AccessLogWritable對象
            AccessLogWritable logLine = new AccessLogWritable(cookieId, time, url);
            // 寫出到context
            context.write(logLine, NullWritable.get());
        }
    }

    /**
     * 經過shuffle後到達Reducer的數據已經是有序的，所以直接寫出即可
     */
    public static class SecondSortReducer extends Reducer<AccessLogWritable, NullWritable, AccessLogWritable, NullWritable> {
        @Override
        protected void reduce(AccessLogWritable key, Iterable<NullWritable> values, Context context)
                throws IOException, InterruptedException {
            context.write(key, NullWritable.get());
        }
    }
}

AccessLogWritable.java

package com.uplooking.bigdata.mr.secondsort;

import org.apache.hadoop.io.WritableComparable;

import java.io.DataInput;
import java.io.DataOutput;
import java.io.IOException;

/**
 * 自定義Hadoop數據類型，作為key，需要實現WritableComparable接口
 * map中排序需要比較的對象為AccessLogWritable，所以泛型填寫為AccessLogWritable
 */
public class AccessLogWritable implements WritableComparable<AccessLogWritable> {

    private String cookieId;
    private String time;
    private String url;

    /**
     * 空參構造方法，必須要有，否則會有下面的異常：
     Caused by: java.lang.NoSuchMethodException: com.uplooking.bigdata.mr.secondsort.AccessLogWritable.<init>()
     at java.lang.Class.getConstructor0(Class.java:3082)
     at java.lang.Class.getDeclaredConstructor(Class.java:2178)
     at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:125)
     ... 16 more
     */
    public AccessLogWritable() {

    }

    public AccessLogWritable(String cookieId, String time, String url) {
        this.cookieId = cookieId;
        this.time = time;
        this.url = url;
    }

    /**
     * 比較的方法，定義的規則為：
     * 先按 cookieId 排序，然後按 time 排序
     * @param o
     * @return
     */
    public int compareTo(AccessLogWritable o) {
        int ret = this.cookieId.compareTo(o.cookieId);
        // 如果cookieId比較結果相同，再比較time
        if(ret == 0) {
            ret = this.time.compareTo(o.time);
        }
        return ret;
    }

    public void write(DataOutput out) throws IOException {
        out.writeUTF(cookieId);
        out.writeUTF(time);
        out.writeUTF(url);
    }

    public void readFields(DataInput in) throws IOException {
        this.cookieId = in.readUTF();
        this.time = in.readUTF();
        this.url = in.readUTF();
    }

    @Override
    public String toString() {
        return cookieId + "\t" + time + "\t" + url;
    }
}

測試

這裏使用本地環境來運行MapReduce程序，輸入的參數如下：

/Users/yeyonghao/data/input/secondsort /Users/yeyonghao/data/output/mr/secondsort

也可以將其打包成jar包，然後上傳到Hadoop環境中運行。

運行程序後，查看輸出結果如下：

yeyonghao@yeyonghaodeMacBook-Pro:~/data/output/mr/secondsort$ cat part-r-00000
1   12:12:34    1_hao123
1   15:02:41    1_google
1   19:10:34    1_baidu
2   05:02:41    2_google
2   12:12:34    2_hao123
2   15:02:41    2_google
3   09:10:34    3_baidu
3   22:11:34    3_sougou
3   23:10:34    3_soso

可以看到，通過使用自定義的key，我們的MapReduce程序已經完成了二次排序的功能。

擴展：如何實現多次排序

其實如果上面的程序能夠理解清楚的話，多次排序的思路應該也是很自然就可以想到的，因為比較的規則其實是在key中定義的，而對於Map來說，是依據key來進行排序的，所以如果需要進行多次排序，我們就可以在自定義的key的compareTo方法中來實現多次排序的規則，有興趣的朋友可以自行寫出這樣的程序，這裏就不再說明。

MapReduce程序之二次排序與多次排序

大數據 Hadoop MapReduce Java [toc] MapReduce程序之二次排序與多次排序需求有下面的數據： cookieId time url 2 12:12:34 2_hao123 3 09:10:34 3_baidu 1 15:0

MapReduce程序之二次排序與多次排序

MapReduce程序之二次排序與多次排序

需求

程序思路分析

MapReduce程序

SecondSortJob.java

AccessLogWritable.java

測試

擴展：如何實現多次排序

MapReduce程序之二次排序與多次排序

MapReduce程序之序列化原理與Writable案例

MapReduce程序之數據排序

Angular-ui-router進階二之巢狀檢視與多個檢視組合使用

spark二次排序到多次排序

對jquery中的$.ajax二次封裝從而多次調用今天一整天都在想這個事情

緩存系列之二：CDN與其他層面緩存

[Java]Servlet工作原理之二：Session與Cookie

java之二進制與數據類型（二）

MapReduce程序之combiner規約

MapReduce程序之求一年中的最高溫度和最低溫度

MapReduce程序之數據去重

MapReduce程序之TopN問題（排行榜問題）

微信小程序-兩個input疊加，多次點擊字體變粗或閃動

微信小程序之工具js封裝與使用

面試題之------三次握手與四次揮手過程

PHP 小技巧之如何避免參數多次傳遞？

Java多執行緒學習筆記21之單例模式與多執行緒

Apache CXF實戰之二整合Sping與Web容器

java基礎個人筆記之抽象類,介面與多型，內部類

MapReduce程序之二次排序與多次排序

MapReduce程序之二次排序與多次排序

需求

程序思路分析

MapReduce程序

SecondSortJob.java

AccessLogWritable.java

測試

擴展：如何實現多次排序

相關推薦