MapReduce 中用於劃分資料的那些函式以及它們在streaming中的實現

阿新 • • 發佈：2019-02-14

MapReduce中有三個步驟用於劃分大資料集, 給mapper和reducer提供資料

InputSplit

第一個是InputSplit, 它把資料劃分成若干塊提供給mapper

預設情況下是根據資料檔案的block, 來劃分, 一個block對應一個mapper, 優先在block所在的機器上啟動mapper

如果要重構這個 InputSplit 函式的話, 要去 InputFormat 裡重構 getSplits 函式

https://hadoop.apache.org/docs/r2.7.2/api/org/apache/hadoop/mapred/InputFormat.html

在streaming中:

-inputformat JavaClassName	Optional	Class you supply should return key/value pairs of Text class. If not specified, TextInputFormat is used as the default
-outputformat JavaClassName	Optional	Class you supply should take key/value pairs of Text class. If not specified, TextOutputformat is used as the default

這兩個引數指定姚世勇inputformat class

Partition

partition用於把結果分配給不同的reducer, 一般繼承自 "org.apache.hadoop.mapreduce.Partitioner" 這個類

Grouping

這個概念比較難理解, 意思是在資料給reducer前再進行一次分組, 一組資料給到同一個reducer執行一次, 他們的key用的是分組中第一個資料的key

https://stackoverflow.com/questions/14728480/what-is-the-use-of-grouping-comparator-in-hadoop-map-reduce

最佳答案中 a-1和a-2因為grouping的關係合併成了 a-1為key的一組資料給reducer處理

那麼在streaming中Partition和Grouping該怎麼處理呢?

在streaming中可以用命令列引數指定Partition的類:

-partitioner JavaClassName

Optional

Class that determines which reduce a key is sent to

也可以用另一種引數結合sort命令來指定:

-D stream.map.output.field.separator=. \
-D stream.num.map.output.key.fields=4 \
-D mapred.text.key.partitioner.options=-k1,2 \

這裡指定了分割符, 並且分割出來前4個field是key, 並用第一和第二個field來做partition

-D mapreduce.partition.keycomparator.options='-k1,2 -k3,3nr -k4,4nr'

linux中的sort命令:

sort -k1 -k2n -k3nr #表示優先根據第一列排序, 再根據第二列排序且第二列是數字,再根據第三列排序它是數字而且要逆序來排

grouping在streaming的模式中沒有相應實現, 但是可以利用partition來代替.

附表:

Parameter	Optional/Required	Description
-input directoryname or filename	Required	Input location for mapper
-output directoryname	Required	Output location for reducer
-mapper executable or JavaClassName	Required	Mapper executable
-reducer executable or JavaClassName	Required	Reducer executable
-file filename	Optional	Make the mapper, reducer, or combiner executable available locally on the compute nodes
-inputformat JavaClassName	Optional	Class you supply should return key/value pairs of Text class. If not specified, TextInputFormat is used as the default
-outputformat JavaClassName	Optional	Class you supply should take key/value pairs of Text class. If not specified, TextOutputformat is used as the default
-partitioner JavaClassName	Optional	Class that determines which reduce a key is sent to
-combiner streamingCommand or JavaClassName	Optional	Combiner executable for map output
-cmdenv name=value	Optional	Pass environment variable to streaming commands
-inputreader	Optional	For backwards-compatibility: specifies a record reader class (instead of an input format class)
-verbose	Optional	Verbose output
-lazyOutput	Optional	Create output lazily. For example, if the output format is based on FileOutputFormat, the output file is created only on the first call to output.collect (or Context.write)
-numReduceTasks	Optional	Specify the number of reducers
-mapdebug	Optional	Script to call when map task fails
-reducedebug	Optional	Script to call when reduce task fails

MapReduce 中用於劃分資料的那些函式以及它們在streaming中的實現

MapReduce中有三個步驟用於劃分大資料集, 給mapper和reducer提供資料InputSplit第一個是InputSplit, 它把資料劃分成若干塊提供給mapper預設情況下是根據資料檔案的block, 來劃分, 一個block對應一個mapper, 優先在bl

Hive 中的複合資料結構簡介以及一些函式的用法說明

目錄[-] 一、map、struct、array 這3種的用法： 1、Array的使用 2、Map 的使用 3、Struct 的使用 4、資料組合（不支援組合的複雜資料型別）二、hive中的一些不常見函式的用法： 1、array_contains （

Numpy中求標準差的函式std( )與Matlab中求標準差的函式std( )對同一組資料求標註差結果不一樣

一、問題描述 “Matlab求標註差函式std與Python Numpy中求標註差函式std對統一資料求標準差的結果不一樣” Matlab示例： >> a = [1,3,7,10,20]; >> std(a) ans =

python—pandas中DataFrame型別資料操作函式

python資料分析工具pandas中DataFrame和Series作為主要的資料結構. 本文主要是介紹如何對DataFrame資料進行操作並結合一個例項測試操作函式。 1）檢視DataFrame資料及屬性 df_obj = DataFrame() #建

淺析C++string中用於查詢的find函式

總述：以下所講的所有的string查詢函式，都有唯一的返回型別，那就是size_type，即一個無符號整數（按打印出來的算）。若查詢成功，返回按查詢規則找到的第一個字元或子串的位置；若查詢失敗，返回npos，即-1（打印出來為4294967295）。 1.

stm32中一些常用基本庫函式以及串列埠配置步驟

常用基本庫函式：void RCC_APB2PeriphClockCmd(uint32_t RCC_APB2Periph, FunctionalState NewState)//使能埠時鐘 void GPIO_Init(GPIO_TypeDef* GPIOx, GPIO_In

字符、字符集、編碼，以及它們python中會遇到的一些問題（下）

區別做了 and 內部 eve nbsp nes 文字相對在看了很多的博客文章之後，總結整理得到了以下文章，非常感謝這些無私奉獻的博主！文章末尾有本文引用的文章的鏈接，如果有漏掉的文章引用，可以發郵件聯系我，隨後再次附上鏈接！侵刪！！！這一部分是下篇，主要

三種啟用函式以及它們的優缺點

三種啟用函式以及它們的優缺點 s i g

django channels中在customer之外使用channel_layer (以及channel_layer.group_send中type的意思)

channels文件中有Using Outside Of Consumers,介紹瞭如何在消費者類外使用channel_layer from channels.layers import get_channel

作業系統中的中斷和陷阱、以及程式語言中的異常區別

在閱讀作業系統概念聖經書導論部分中，學習到了中斷和陷阱，同時聯絡到高階語言中的異常處理，下面我們介紹他們的概念。陷阱計算機有兩種執行模式：使用者態，核心態。其中作業系統執行在核心態，在核心態中，作業系統具有對所有硬體的完全訪問許可權，可以使機器執行任何指令；相反

C語言中求字串長度的函式my_strlen()的幾種實現方法

C語言中求字串長度的函式的幾種實現方法 1.最常用的方法是建立一個計數器，判斷是否遇到‘\0’,不是’\0’指標就往後加一。 int my_strlen(const char *str) { assert(str != NULL);//此句判段str是否為空指標（事實上這條語

在c語言中自定義了一個函式，在main中呼叫時提示找不到識別符號

解決方案一：把定義的函式放在，main函式之前。 void f() { printf("Hello"); } main() { f(); } 解決方案二：在main函式之前宣告。 void f(); main() { f

Android中動態初始化佈局引數以及ConstraintLayout使用中遇到的坑

Android中動態初始化佈局以及ConstraintLayout遇到的一個坑 ConstraintLayout是Android中的一個很強大的佈局，它通過控制元件之間的相對定位，來完成一個layout中的所有view的佈局，但佈局方法相對於RelativeL

Python3 中把txt資料檔案讀入到矩陣中

1.例項程式： ''' 資料檔案：2.txt內容：（以空格分開每個資料） 1 2 2.5 3 4 4 7 8 7 ''' from numpy import * A = zeros((3,3),dt

webkit中的javascript(1)---javascript函式呼叫在webkit中的實現

javascript函式呼叫如何實現的？ ######################################################################################### by zevolo JS 的object有三種，J

單獨啟動tomcat和從eclipse中啟動tomcat的差異，以及將Eclipse中的Web專案部署到Tomcat的方法

剛接觸java web，對很多東西還不是太瞭解，特別是各種配置方面的問題，下面僅是自己個人的理解，如有錯誤或不足之處，希望大家能指教。如果通過tomcat的bin目錄下的startup.bat來啟動tomcat，此時tomcat使用co

WPF中Popup控件在Win7以及Win10等中的對齊點方式不一樣的解決方案 - 簡書

方法 ase .get sta href ide public content tla 原文:WPF中Popup控件在Win7以及Win10等中的對齊點方式不一樣的解決方案 - 簡書最近項目中使用彈出控件Popup，發現彈

SpringMVC中用於綁定請求數據的註解以及配置視圖解析器

SpringMVC 視圖解析器 @RequestHeader 綁定集合參數 @CookieValue SpringMVC中用於綁定請求數據的註解在上一篇文章中我們簡單介紹了@RequestMapping與@RequestParam註解，知道了如何去配置地址映射，本篇則介紹一些用於處理re

小結下：Maltba中的匿名函式，函式函式，子函式以及feval函式

1：匿名函式引數函式體 %匿名函式 [email protected](x)sin(x); y1=myfun1(0); y2=myfun1(0:0.01:2*pi); plot(y2) 變數空間結果為： 2：子函式見註釋 functi

c++中txt檔案的讀取以及在MFC中讀取txt座標資料並完成圖形繪製

主要介紹如何讀取txt檔案中的座標資料，並在MFC視窗中繪製出來，工程建立方法和繪圖方法與上一篇博文基本一致，這裡就不再詳贅述，可參考上一篇博文vs2010、MFC視窗中繪製點、線、面。 C++中讀取檔案的方法有兩種，一種是來自於C語言的“檔案指標”方法，另一種是C++中的“檔案流”思想。

MapReduce 中用於劃分資料的那些函式 以及它們在streaming中的實現

InputSplit

Partition

Grouping

相關推薦

MapReduce 中用於劃分資料的那些函式以及它們在streaming中的實現