1. 程式人生 > >【Hive】13-實戰案例1——資料ETL

【Hive】13-實戰案例1——資料ETL

需求:

  • 對web點選流日誌基礎資料表進行etl(按照倉庫模型設計)
  • 按各時間維度統計來源域名top10

已有資料表 “t_orgin_weblog” :

+------------------+------------+----------+--+

|     col_name     | data_type  | comment |

+------------------+------------+----------+--+

| valid                 | string     |          |

| remote_addr    | string     |          |

| remote_user     | string     |          |

| time_local         | string     |          |

| request             | string     |          |

| status               | string     |          |

| body_bytes_sent | string  |          |

| http_referer       | string    |          |

| http_user_agent | string  |          |

+------------------+------------+----------+--+

資料示例:

| true|1.162.203.134| - | 18/Sep/2013:13:47:35| /images/my.jpg  | 200| 19939 |http://www.angularjs.cn/A0d9" | "Mozilla/5.0 (Windows   |
| true|1.202.186.37 | - | 18/Sep/2013:15:39:11| /wp-content/uploads/2013/08/windjs.png| 200| 34613 | "http://cnodejs.org/topic/521a30d4bee8d3cb1272ac0f" | "Mozilla/5.0(Macintosh;|

實現步驟:

將來訪url分離出host  path  query  query id

drop table if exists t_etl_referurl;
create table t_etl_referurl as SELECT a.*,b.*
FROM t_orgin_weblog a LATERAL VIEW parse_url_tuple(regexp_replace(http_referer, "\"", ""),
'HOST', 'PATH','QUERY', 'QUERY:id') b as host, path, query, query_id;

2、從前述步驟進一步分離出日期時間形成ETL明細表“t_etl_detail”

drop table if exists t_etl_detail;
create table t_etl_detail as
select b.*,substring(time_local,0,11) as daystr,
substring(time_local,13) as tmstr,
substring(time_local,4,3) as month,
substring(time_local,0,2) as day,
substring(time_local,13,2) as hour
from t_etl_referurl b;

3、對etl資料進行分割槽(包含所有資料的結構化資訊)

drop table t_etl_detail_prt;
create table t_etl_detail_prt(
valid                  string,
remote_addr            string,
remote_user            string,
time_local             string,
request                string,
status                 string,
body_bytes_sent        string,
http_referer           string,
http_user_agent        string,
host                   string,
path                   string,
query                  string,
query_id               string,
daystr                 string,
tmstr                  string,
month                  string,
day                    string,
hour                   string)
partitioned by (mm string,dd string);

匯入資料

insert into table t_etl_detail_prt partition(mm='Sep',dd='18')
select * from t_etl_detail where daystr='18/Sep/2013';

insert into table t_etl_detail_prt partition(mm='Sep',dd='19')
select * from t_etl_detail where daystr='19/Sep/2013';

分個時間維度統計各referer_host的訪問次數並排序

create table t_refer_host_visit_top_tmp as
select referer_host,count(*) as counts,mm,dd,hh 
from t_display_referer_counts group by hh,dd,mm,referer_host order by hh asc,dd asc,mm asc,counts desc;

4、來源訪問次數topn各時間維度URL

取各時間維度的referer_host訪問次數top n

select * from (select referer_host,counts,concat(hh,dd),row_number() over 
(partition by concat(hh,dd) order by concat(hh,dd) asc) as od from t_refer_host_visit_top_tmp) t where od<=3;