大資料（十四）：多job串聯與ReduceTask工作機制

阿新 • • 發佈：2018-11-10

一、多job串聯例項（倒索引排序）

1.需求

查詢每個單詞分別在每個檔案中出現的個數

預期第一次輸出(表示單詞分別在個個檔案中出現的次數)

apple--a.txt 3

apple--b.txt 1

apple--c.txt 1

grape--a.txt 4

grape--b.txt 3

grape--c.txt 1

pear--a.txt 1

pear--b.txt 2

pear--c.txt 2

預期第二次輸出

apple a.txt 3 b.txt 1 c.txt 1

grape a.txt 4 b.txt 3 c.txt 1

pear a.txt 1 b.txt 2 c.txt 2

2.編寫第一個Mapper程式碼

public class OneIndexMapper extends Mapper<LongWritable,Text,Text,IntWritable>{
    String name;
    private Text k = new Text();
    private IntWritable v = new IntWritable();
    @Override
    protected void setup(Context context) throws IOException, InterruptedException {
        FileSplit split = (FileSplit)context.getInputSplit();
        name = split.getPath().getName();
    }

    @Override
    protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
        //獲取一行資料
        String line = value.toString();
        //切割
        String[] fields = line.split(" ");
        for (String field : fields) {
            //拼接
            k.set(field+"--"+name);
            //輸出
            context.write(k,v);
        }
    }
}

3.編寫第一個Reducer程式碼

public class OneIndexReducer extends Reducer<Text, IntWritable, Text, IntWritable> {
    @Override
    protected void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
        //彙總
        int sum = 0;
        for (IntWritable value : values) {
            sum += value.get();
        }
        //輸出
        context.write(key, new IntWritable(sum));
    }
}

4.編寫第一個Driver程式碼

public class OneIndexDriver {
    public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {
        // 1 獲取job物件
        Configuration configuration = new Configuration();
        Job job = Job.getInstance(configuration);

        // 2 設定jar包路徑
        job.setJarByClass(OneIndexDriver.class);

        // 3 管理mapper和reducer類
        job.setMapperClass(OneIndexMapper.class);
        job.setReducerClass(OneIndexReducer.class);

        // 4 設定mapper輸出的kv型別
        job.setMapOutputKeyClass(Text.class);
        job.setMapOutputValueClass(IntWritable.class);

        // 5 設定最終輸出kv型別
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(IntWritable.class);

        // 6 設定輸入輸出路徑
        FileInputFormat.setInputPaths(job, new Path(args[0]));
        FileOutputFormat.setOutputPath(job, new Path(args[1]));

        // 7 提交
        boolean result = job.waitForCompletion(true);
        System.exit(result ? 0 : 1);
    }
}

5.編寫第二個Mapper程式碼

public class TwoIndexMapper extends Mapper<LongWritable,Text,Text,Text> {
    Text k = new Text();
    Text v = new Text();
    @Override
    protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
        //獲取一行資料
        String line = value.toString();
        //切割
        String[] fields = line.split("--");
        k.set(fields[0]);
        v.set(fields[1]);
        //輸出
        context.write(k,v);
    }
}

6.編寫第二個Reducer程式碼

public class TwoIndexReducer extends Reducer<Text,Text,Text,Text>{
    @Override
    protected void reduce(Text key, Iterable<Text> values, Context context) throws IOException, InterruptedException {
        //拼接
        StringBuilder sb =new StringBuilder();
        for (Text value : values) {
            sb.append(value.toString().replace("\t","-->")+"\t");
        }
        //輸出
        context.write(key,new Text(sb.toString()));
    }
}

7.編寫第二個Driver程式碼

public class TwoIndexDriver {
    public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {
        // 1 獲取job物件
        Configuration configuration = new Configuration();
        Job job = Job.getInstance(configuration);

        // 2 設定jar包路徑
        job.setJarByClass(TwoIndexDriver.class);

        // 3 管理mapper和reducer類
        job.setMapperClass(TwoIndexMapper.class);
        job.setReducerClass(TwoIndexReducer.class);

        // 4 設定mapper輸出的kv型別
        job.setMapOutputKeyClass(Text.class);
        job.setMapOutputValueClass(Text.class);

        // 5 設定最終輸出kv型別
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(Text.class);

        // 6 設定輸入輸出路徑
        FileInputFormat.setInputPaths(job, new Path(args[0]));
        FileOutputFormat.setOutputPath(job, new Path(args[1]));

        // 7 提交
        boolean result = job.waitForCompletion(true);
        System.exit(result ? 0 : 1);
    }
}

8.將第一個的輸出路徑作為第二個的輸入路徑

二、ReduceTask工作機制

1.設定ReduceTask並行度

ReduceTask的並行度同樣影響整個job的執行併發度和執行效率，但與MapTask的併發數由切片數決定不同，ReduceTask數量的是個可以直接手動設定的

//預設值是1，手動設定為4
job.setNumReduceTasks(4);

2.注意點

ReduceTask=0，表示沒有Reduce階段，輸出檔案個數和map個數一致
ReduceTask預設值是1，所以輸出檔案個數是一個
如果資料分佈不均勻，可能在reduce階段出現數據傾斜
ReduceTask數量並不是任意設定，還要考慮到業務邏輯需求，有些情況下，需要計算全域性彙總結果，就只能設定1個ReduceTask
具體的ReduceTask個數需要根據叢集效能而定
如果分割槽數不是一個，但是ReduceTask為1，將不會執行分割槽過程。原始碼中在分割槽步驟之前判斷了ReduceNum個數。

Copy階段：ReduceTask從各中MapTask上遠端拷貝一片資料，並針對某一片資料，如果其帶下超過閾值，則寫到磁碟上，否則放入記憶體。
Merge階段：在遠端拷貝資料的同時，ReduceTask啟動兩個後臺執行緒對記憶體和磁碟上的檔案進行合併，以防止記憶體使用過多或者磁碟上的檔案過多。
Sort階段：按照MapReduce語義，使用者編寫reduce()函式輸入資料是按key進行聚集的一組資料。為了將key相同的資料聚集在一起，Hadoop採用了基於排序的策略。由於各個MapTask已經實現了對自己的處理結果進行了區域性排序，因此，ReduceTask只需要對所有資料進行一次歸併排序即可。
Reduce階段：reduce()函式將計算結果寫到HDFS上。

大資料（十四）：多job串聯與ReduceTask工作機制

一、多job串聯例項（倒索引排序） 1.需求查詢每個單詞分別在每個檔案中出現的個數預期第一次輸出(表示單詞分別在個個檔案中出現的次數) apple--a.txt 3 apple--b.txt 1 apple--c.txt 1 grape--a.txt

大資料（十二）：自定義OutputFormat與ReduceJoin合併（資料傾斜）

一、OutputFormat介面 OutputFormat是MapReduce輸出的基類，所有實現MapReduce輸出都實現了OutputFormat介面。 1.文字輸出TextOutPutFormat &n

大資料（十五）：Hadoop資料壓縮與壓縮/解壓縮例項

一、資料壓縮 1.概論壓縮技術能夠有效減少低層儲存系統（HDFS）讀寫位元組。壓縮提高了網路頻寬和磁碟空間的效率。在Hadoop下，尤其是資料規模很大和工作負載密集的情況下。使用資料壓縮閒的非常重要。在這種情況下，I/O操作

大資料（十八）：Hive元資料配置、常見屬性配置、資料型別與資料轉換

一、Hive元資料配置到MySQL當中為什麼要把Hive的元資料配置到MySql中？我們可以使用多個客戶端連線linux系統並且都嘗試啟動Hive，可以發現在啟動第二個Hive客戶端的時候發生報錯了。

大資料（十七）：Hive簡介、安裝與基本操作

一、簡介 Hive由Facebook開源用於解決海量結構化日誌的資料統計。Hive是基於Hadoop的一個數據倉庫工具，可以將結構化的資料檔案對映為一張表，並提供類Sql查詢的功能。 hive本質是將HQL轉化為MapRedu

大資料（十六）：Yarn的工作機制、資源排程器、任務的推測執行機制

一、Yarn概述 Yarn是一個資源排程平臺，負責為運算程式提供伺服器運算資源，相當於一個分散式的作業系統平臺，而MapReduce等運算程式則相當於運行於操作程式上的應用程式。二、Yarn基本架

大資料（十九）：hive資料庫基本操作與表分類

一、建立資料庫 1.建立一個數據，資料庫在HDFS上的預設儲存路徑是/user/hive/warehouse/*.db create database db_hive; 2.避免建立的資料庫已經存在，增加if not exists create database

跟著湯陽光同志做一個OA專案（十四）：審批流轉約定與總流程和一些重要程式碼及最終的實體設計

@Controller @Scope("prototype") public class FlowAction extends BaseAction { private File upload; // 上傳的檔案 private Long applicationTemplateId;

Java多執行緒（十四）：Timer

Timer schedule(TimerTask task, Date time) 該方法在指定日期執行任務，如果是過去的時間，這個任務會立即被執行。執行時間早於當前時間示例程式碼，當前時間是2019年9月19日，程式碼中寫的是前一天的時間。 public class MyTask1 extends Ti

Android開發系列（二十四）：Notification的功能與使用方法

font _id when ice extends 開發 content androi mark 關於消息的提示有兩種：一種是Toast，一種就是Notification。前者維持的時間比較短暫，後者維持的時間比較長。並且我們尋常手機的應用比方網易、貼吧等等都有非常多

Java框架spring Boot學習筆記（十四）：log4j介紹

inf alt 技術分享 images 使用 image 詳細配置文件 -128 功能日誌功能，通過log4j可以看到程序運行過程的詳細信息。使用導入log4j的jar包復制log4j的配置文件，復制到src下面　　　　　　 3.設置日誌級別　　　

Python筆記（十一）：多線程

st2 pv操作出現 end 談話 col 隊列大小 == done （二）和（三）不感興趣的可以跳過，這裏參考了《深入理解計算機系統》第一章和《Python核心編程》第四章（一）多線程編程一個程序包含多個子任務，並且子任務之間相互獨立，讓這些子任務同時運

Android項目實戰（十四）：TextView顯示html樣式的文字

sta ref RR per 使用一個 title name Go 原文:Android項目實戰（十四）：TextView顯示html樣式的文字項目需求： TextView顯示一段文字，格式為：白雪公主（姓名，字數不確定）向您發來了2（消息個數，不確定）條消息這段文

Python筆記（十四）：操作excel openpyxl模塊

align pre 一行 color value colspan xls str 工作（一）常遇到的情況就我自己來說，常遇到的情況可能就下面幾種：讀取excel整個sheet頁的數據。讀取指定行、列的數據往一個空白的excel文檔寫數據往一

talib 中文文檔（十四）：Math Transform Functions 數學變換

曲線 tor lib 函數 sin 中文 oot fun 函數名 Math Transform Functions ACOS - Vector Trigonometric ACos 函數名：ACOS 名稱：acos函數是反余弦函數，三角函數

java基礎學習總結（十四）：Enum 型別的使用介紹

一、Enum 型別的介紹列舉型別（Enumerated Type）很早就出現在程式語言中，它被用來將一組類似的值包含到一種型別當中。而這種列舉型別的名稱則會被定義成獨一無二的型別描述符，在這一點上和常量的定義相似。不過相比較常量型別，列舉型別可以為宣告的變

PE檔案格式學習（十四）：繫結匯入表

1.介紹繫結匯入表的作用是加快程式的啟動速度，一個PE程式在啟動時會去載入匯入表中的dll檔案，並將匯入表的FirstThunk指向的陣列填入函式的真實地址，這需要耗去時間，繫結匯入表中儲存了匯入函式的真實地址，所以當PE在啟動時系統檢測到有繫結匯入表，就會直接將地址填入FirstThunk裡，這樣就省去

javaweb學習筆記（十四）：JSP（4）

目錄製作高仿的JSTL標籤庫之核心標籤庫《1》xiaohua.tld檔案：《2》依附的各個類：《3》imitate.core.jsp檔案：《4》瀏覽器檢視：製作高仿的JSTL標籤庫之核心標籤庫通過自定義標籤，製

大資料（二十三）：hive優化、表優化

一、Fetch抓取 Fetch抓取是指，Hive中對某些情況的查詢可以不必使用MapReduce計算。例如，select * from employees;在這種情況下，Hive可以簡單讀取employee對應的儲存目錄

Python之路（十四）：網路程式設計基礎 Python基礎之網路程式設計

Python基礎之網路程式設計學習網路程式設計之前,要對計算機底層的通訊實現機制要有一定的理解。 OSI 網際網路協議按照功能不同分為osi七層或tcp/ip五層或tcp/ip四層可以將應用層，表示層，會

大資料（十四）：多job串聯與ReduceTask工作機制

相關推薦