1. 程式人生 > >[Hive]Hive調優:讓任務並行執行

[Hive]Hive調優:讓任務並行執行

業務背景

extract_trfc_page_kpi的hive sql如下:

set mapred.job.queue.name=pms;
set hive.exec.reducers.max=8;
set mapred.reduce.tasks=8;
set mapred.job.name=extract_trfc_page_kpi;

insert overwrite table pms.extract_trfc_page_kpi partition(ds='$yesterday')
select distinct 
    page_type_id,
    pv,
    uv,
    '$yesterday'
update_time from ( --針對PC、H5 select page_type_id, sum(pv) as pv, sum(uv) as uv from dw.rpt_trfc_page_kpi where ds = '$yesterday' and stat_type = 1 group by page_type_id union all --PC搜尋頁特殊處理 select 5 as page_type_id, sum(pv) as
pv, sum(uv) as uv from dw.rpt_trfc_page_kpi where ds = '$yesterday' and stat_type = 1 and page_type_id in (51, 52) union all --針對APP select a.page_type_id, sum(pv) as pv, sum(uv) as uv from dw.rpt_trfc_page_kpi a left outer join ( select
distinct page_type_id, old_page_type_id from tandem.mobile_backend_page_url_rule where is_delete = 0 ) b on (a.page_type_id = b.old_page_type_id) where a.ds = '$yesterday' and stat_type = 1 group by a.page_type_id ) t;

上面的sql中存在兩個union all操作,順序執行下來的話,需要耗時20分鐘。

優化策略

分析以上的sql,其中union all前後的三個查詢操作並無直接關聯,因此沒有必要順序執行,因此優化的思路是讓這三個查詢操作並行執行,hive提供瞭如下引數實現job的並行操作:

// 開啟任務並行執行
set hive.exec.parallel=true;
// 同一個sql允許並行任務的最大執行緒數
set hive.exec.parallel.thread.number=8;

方案一

在執行sql時加上上面的兩個hive引數,如:

set mapred.job.queue.name=pms;
set hive.exec.reducers.max=8;
set mapred.reduce.tasks=8;
set hive.exec.parallel=true;
set hive.exec.parallel.thread.number=8;
set mapred.job.name=extract_trfc_page_kpi;

insert overwrite table pms.extract_trfc_page_kpi partition(ds='$yesterday')
select distinct 
    page_type_id,
    pv,
    uv,
    '$yesterday' update_time 
from
(
    --針對PC、H5
    select 
        page_type_id,
        sum(pv) as pv,
        sum(uv) as uv 
    from dw.rpt_trfc_page_kpi 
    where ds = '$yesterday' and stat_type = 1 
    group by page_type_id 

union all

    --PC搜尋頁特殊處理
    select 
        5 as page_type_id,
        sum(pv) as pv,
        sum(uv) as uv 
    from dw.rpt_trfc_page_kpi 
    where ds = '$yesterday' and stat_type = 1 and page_type_id in (51, 52)

union all

    --針對APP
    select 
        a.page_type_id,
        sum(pv) as pv,
        sum(uv) as uv 
    from dw.rpt_trfc_page_kpi a 
    left outer join (
        select distinct 
            page_type_id, 
            old_page_type_id 
        from tandem.mobile_backend_page_url_rule 
        where is_delete = 0
    ) b on (a.page_type_id = b.old_page_type_id)
    where a.ds = '$yesterday' and stat_type = 1 
    group by a.page_type_id 
) t;

方案二

在hive-site.xml中進行設定,檢視當前版本hive的配置引數:

hive> set -v;
...
hive.exec.orc.zerocopy=false
hive.exec.parallel=false
hive.exec.parallel.thread.number=8
hive.exec.perf.logger=org.apache.hadoop.hive.ql.log.PerfLogger
hive.exec.rcfile.use.explicit.header=true
hive.exec.rcfile.use.sync.cache=true
hive.exec.reducers.bytes.per.reducer=1000000000
hive.exec.reducers.max=999
hive.exec.rowoffset=false
hive.exec.scratchdir=/tmp/hive-pms
hive.exec.script.allow.partial.consumption=false
hive.exec.script.maxerrsize=100000
hive.exec.script.trust=false
hive.exec.show.job.failure.debug.info=true
...

這些引數是配置在$HIVE_HOME/conf/hive-site.xml中的,現在在這個配置檔案中加入:

<property>
    <name>hive.exec.parallel</name>
    <value>true</value>
</property>
<property>
    <name>hive.exec.parallel.thread.number</name>
    <value>16</value>
</property>

重新啟動hive,看到剛剛配置的引數已經生效了:

hive> set -v;
...
hive.exec.orc.skip.corrupt.data=false
hive.exec.orc.zerocopy=false
hive.exec.parallel=true
hive.exec.parallel.thread.number=16
hive.exec.perf.logger=org.apache.hadoop.hive.ql.log.PerfLogger
hive.exec.rcfile.use.explicit.header=true
hive.exec.rcfile.use.sync.cache=true
hive.exec.reducers.bytes.per.reducer=1000000000
hive.exec.reducers.max=999
hive.exec.rowoffset=false
hive.exec.scratchdir=/tmp/hive-pms
hive.exec.script.allow.partial.consumption=false
...

結論

經過測試,添加了這兩個引數以後,extract_trfc_page_kpi指令碼執行時間從耗時20分鐘,優化為耗時3分鐘。