1. 程式人生 > >hive的mr和map-reduce基本設計模式

hive的mr和map-reduce基本設計模式

key format values 模式 none columns lan pac ...

(原創文章,謝絕轉載~)

hive可以使用 explain 或 explain extended (select query) 來看mapreduce執行的簡要過程描述。explain出來的結果類似以下:

STAGE DEPENDENCIES:
  Stage-1 is a root stage
  Stage-0 is a root stage

STAGE PLANS:
  Stage: Stage-1
    Map Reduce
      Map Operator Tree:    --map tree
          TableScan
            alias: testtb
            Statistics: Num rows: 0 Data size: 86 Basic stats: PARTIAL Column stats: NONE
            Select Operator
              expressions: zd1 (type: string), zd2 (type: string), zd3 (type: string)
              outputColumnNames: zd1, zd2, zd3
              Statistics: Num rows: 0 Data size: 86 Basic stats: PARTIAL Column stats: NONE
              Group By Operator
                aggregations: sum(zd3)
                keys: zd1 (type: string), zd2 (type: string)
                mode: hash
                outputColumnNames: _col0, _col1, _col2
                Statistics: Num rows: 0 Data size: 86 Basic stats: PARTIAL Column stats: NONE
                Reduce Output Operator
                  key expressions: _col0 (type: string), _col1 (type: string)
                  sort order: ++
                  Map-reduce partition columns: _col0 (type: string), _col1 (type: string)
                  Statistics: Num rows: 0 Data size: 86 Basic stats: PARTIAL Column stats: NONE
                  value expressions: _col2 (type: double)
      Reduce Operator Tree:    --reduce tree
        Group By Operator
          aggregations: sum(VALUE._col0)
          keys: KEY._col0 (type: string), KEY._col1 (type: string)
          mode: mergepartial
          outputColumnNames: _col0, _col1, _col2
          Statistics: Num rows: 0 Data size: 0 Basic stats: NONE Column stats: NONE
          Select Operator
            expressions: _col0 (type: string), _col1 (type: string), _col2 (type: double)
            outputColumnNames: _col0, _col1, _col2
            Statistics: Num rows: 0 Data size: 0 Basic stats: NONE Column stats: NONE
            File Output Operator
              compressed: false
              Statistics: Num rows: 0 Data size: 0 Basic stats: NONE Column stats: NONE
              table:
                  input format: org.apache.hadoop.mapred.TextInputFormat
                  output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
                  serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe

  Stage: Stage-0
    Fetch Operator
      limit: -1

可以通過此分析mapreduce過程。以上為對zd1,zd2 分組,求sum(zd3)的mr過程:

這個直接根據需要group by的字段作為 key,hive 默認在map端先做一次聚合(set hive.map.aggr=true),且mode為 hash;然後再到reduce端聚合,此時reduce端的mode為mergepartial,如果設置不在map端聚合set hive.map.aggr=false,那麽reduce端的mode是 complete 。

mapreduce的基本設計模式:(參考資料:MapReduce Design Pattern -by Donald Miner and Adam Shook )

1.分組數值聚合,這個模式下map端直接根據需要分組(group by)的字段作為keys,values包括需要的數據,reduce端, f(values) 得到需要的結果(以keys為組)

2.join,map端關聯字段作為keys,每條record作為輸出,不同表的數據打上flag,reduce端根據每組keys的數據,每個flag的數據放在這個flag的list下,然後不同的list的數據再join輸出即可,若inner join那麽限制list都不空,left、right join等則list為空也輸出。

(待續....)

hive的mr和map-reduce基本設計模式