1. 程式人生 > >hive top n (order by與sort by區別)

hive top n (order by與sort by區別)

我想說的SELECT TOP N是取最大前N條或者最小前N條。
Hive
提供了limit關鍵字,再配合order by可以很容易地實現SELECT TOP N但是在Hiveorder by只能使用1reduce,如果表的資料量很大,那麼order by就會力不從心。例如我們執行SQLselect a from ljntest01 order by a limit 10;
控制檯會打印出:Number of reduce tasks determined at compile time: 1
說明啟動的reduce數量是編譯時確定的。檢視該SQL的執行計劃,該SQL只啟動1JOB

假設資料表有

1億條資料,而我們只想取TOP 10,那對1億條資料在1reduce中做全排序是非常不合理的。幸好有sort by,使用sort by替換order by就可以解決這個問題:
select a from ljntest01 sort by a limit 10;
首先執行該SQL控制檯打印出:Number of reduce tasks not specified. Estimated from input data size: 1
說明reduce數不是編譯時確定的,而是根據輸入檔案大小動態確定的。此外檢視該SQL的執行計劃:

STAGE DEPENDENCIES:
  Stage-1 is a root stage
  Stage-2 depends on stages: Stage-1
  Stage-0 is a root stage 

STAGE PLANS:
  Stage: Stage-1
    Map Reduce
      Alias -> Map Operator Tree:
        ljntest01
          TableScan
            alias: ljntest01
            Select Operator
              expressions:
                    expr: a
                    type: int
              outputColumnNames: _col0
              Reduce Output Operator
                key expressions:
                      expr: _col0
                      type: int
                sort order: +
                tag: -1
                value expressions:
                      expr: _col0
                      type: int
      Reduce Operator Tree:
        Extract
          Limit
            File Output Operator
              compressed: true
              GlobalTableId: 0
              table:
                  input format: org.apache.hadoop.mapred.SequenceFileInputFormat
                  output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
 

  Stage: Stage-2
    Map Reduce
      Alias -> Map Operator Tree:
        hdfs://hdpnn:9000/group/alidw-cbu/tmp/hive-admin/hive_2012-12-16_01-19-42_893_2878471909568139281/-mr-10002
            Reduce Output Operator
              key expressions:
                    expr: _col0
                    type: int
              sort order: +
              tag: -1
              value expressions:
                    expr: _col0
                    type: int
      Reduce Operator Tree:
        Extract
          Limit
            File Output Operator
              compressed: true
              GlobalTableId: 0
              table:
                  input format: org.apache.hadoop.mapred.TextInputFormat
                  output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
 

  Stage: Stage-0
    Fetch Operator
      limit: 10 

sort by可以啟動多個reduce,每個reduce做區域性排序,但是這對於sort by limit N已經夠用了。從執行計劃中可以看出sort by limit N啟動了兩個JOB。第一個JOB是在每個reduce中做區域性排序,然後分別取TOP N。假設啟動了Mreduce,第二個JOB再對Mreduce分別區域性排好序的總計M * N條資料做全域性排序,取TOP N,從而得到想要的結果。這樣就可以大大提高SELECT TOP N的效率。