同事寫了個hive的sql語句,執行效率特別慢,跑了一個多小時程式只是map完了,reduce進行到20%。
該Hive語句如下:
select count(distinct ip) 
from(select ip as ip from comprehensive.f_client_boot_daily where year="2013" and month="10"  
union all 
select pub_ip as ip from f_app_boot_daily where year="2013" and month="10" 
union all select ip as ip from format_log.format_pv1 where year="2013" and month="10" and url_first_id=1 
) d 

       分析:select ip as ip from comprehensive.f_client_boot_daily where year="2013" and month="10"這個語句篩選出來的資料約有10億條,select pub_ip as ip from f_app_boot_daily where year="2013" and month="10"約有10億條條,select ip as ip from format_log.format_pv1 where year="2013" and month="10" and url_first_id=1 篩選出來的資料約有10億條,總的資料量大約30億條。這麼大的資料量,使用disticnt函式,所有的資料只會shuffle到一個reducer上,導致reducer資料傾斜嚴重。
解決辦法:
首先,通過使用groupby,按照ip進行分組。改寫後的sql語句如下:
select count(*) 
from 
(select ip 
from(select ip as ip from comprehensive.f_client_boot_daily where year="2013" and month="10" 
union all 
select pub_ip as ip from f_app_boot_daily where year="2013" and month="10" 
union all select ip as ip from format_log.format_pv1 where year="2013" and month="10" and url_first_id=1
) d 
group by ip ) b 
然後,合理的設定reducer數量,將資料分散到多臺機器上。set mapred.reduce.tasks=50; 
       經過優化後,速度提高非常明顯。整個作業跑完大約只需要20多分鐘的時間。
.