hive------ Group by、join、distinct等實現原理
1. Hive 的 distribute by
Order by 能夠預期產生完全排序的結果,但是它是通過只用一個reduce來做到這點的。所以對於大規模的數據集它的效率非常低。在很多情況下,並不需要全局排序,此時可以換成Hive的非標準擴展sort by。Sort by為每個reducer產生一個排序文件。在有些情況下,你需要控制某個特定行應該到哪個reducer,通常是為了進行後續的聚集操作。Hive的distribute by 子句可以做這件事。
// 根據年份和氣溫對氣象數據進行排序,以確保所有具有相同年份的行最終都在一個reducer分區中
from record2
select year, temperature
distribute by year
sort by year asc, temperature desc;
2. Distinct 的實現
準備數據
語句
SELECT COUNT, COUNT(DISTINCT uid) FROM logs GROUP BY COUNT;
hive> SELECT * FROM logs;
OK
a 蘋果 3
a 橙子 3
a 燒雞 1
b 燒雞 3
hive> SELECT COUNT, COUNT(DISTINCT uid) FROM logs GROUP BY COUNT;
根據count分組,計算獨立用戶數。
計算過程
1. 第一步先在mapper計算部分值,會以count和uid作為key,如果是distinct並且之前已經出現過,則忽略這條計算。第一步是以組合為key,第二步是以count為key.
2. ReduceSink是在mapper.close()時才執行的,在GroupByOperator.close()時,把結果輸出。註意這裏雖然key是count和uid,但是在reduce時分區是按count來的!
3. 第一步的distinct計算的值沒用,要留到reduce計算的才準確。這裏只是減少了key組合相同的行。不過如果是普通的count,後面是會合並起來的。
4. distinct通過比較lastInvoke判斷要不要+1(因為在reduce是排序過了的,所以判斷distict的字段變了沒有,如果沒變,則不+1)
Operator
Explain
hive> explain select count, count(distinct uid) from logs group by count;
OK
ABSTRACT SYNTAX TREE:
(TOK_QUERY (TOK_FROM (TOK_TABREF (TOK_TABNAME logs))) (TOK_INSERT (TOK_DESTINATION (TOK_DIR TOK_TMP_FILE)) (TOK_SELECT (TOK_SELEXPR (TOK_TABLE_OR_COL count)) (TOK_SELEXPR (TOK_FUNCTIONDI count (TOK_TABLE_OR_COL uid)))) (TOK_GROUPBY (TOK_TABLE_OR_COL count))))
STAGE DEPENDENCIES:
Stage-1 is a root stage
Stage-0 is a root stage
STAGE PLANS:
Stage: Stage-1
Map Reduce
Alias -> Map Operator Tree:
logs
TableScan //表掃描
alias: logs
Select Operator//列裁剪,取出uid,count字段就夠了
expressions:
expr: count
type: int
expr: uid
type: string
outputColumnNames: count, uid
Group By Operator //先來map聚集
aggregations:
expr: count(DISTINCT uid) //聚集表達式
bucketGroup: false
keys:
expr: count
type: int
expr: uid
type: string
mode: hash //hash方式
outputColumnNames: _col0, _col1, _col2
Reduce Output Operator
key expressions: //輸出的鍵
expr: _col0 //count
type: int
expr: _col1 //uid
type: string
sort order: ++
Map-reduce partition columns: //這裏是按group by的字段分區的
expr: _col0 //這裏表示count
type: int
tag: -1
value expressions:
expr: _col2
type: bigint
Reduce Operator Tree:
Group By Operator //第二次聚集