1. 程式人生 > >模塊開發之統計分析

模塊開發之統計分析

點擊 topn 統計 說明 write n) ike 理解 views

數據倉庫建設好以後,用戶就可以編寫Hive SQL語句對其進行訪問並對其中數據進行分析。
在實際生產中,究竟需要哪些統計指標通常由數據需求相關部門人員提出,而且會不斷有新的統計需求產生,以下為網站流量分析中的一些典型指標示例。
註:每一種統計指標都可以跟各維度表進行鉆取。
1. 流量分析1.1. 多維度統計PV總量按時間維度
--計算每小時pvs,註意gruop by語法
select count() as pvs,month,day,hour from ods_weblog_detail group by month,day,hour;
方式一:直接在ods_weblog_detail單表上進行查詢
--計算該處理批次(一天)中的各小時pvs
drop table dw_pvs_everyhour_oneday;
create table dw_pvs_everyhour_oneday(month string,day string,hour string,pvs bigint) partitioned by(datestr string);
insert into table dw_pvs_everyhour_oneday partition(datestr=‘20130918‘)
select a.month as month,a.day as day,a.hour as hour,count(
) as pvs from ods_weblog_detail a
where a.datestr=‘20130918‘ group by a.month,a.day,a.hour;
--計算每天的pvs
drop table dw_pvs_everyday;
create table dw_pvs_everyday(pvs bigint,month string,day string);
insert into table dw_pvs_everyday
select count() as pvs,a.month as month,a.day as day from ods_weblog_detail a
group by a.month,a.day;
方式二:與時間維表關聯查詢
--維度:日
drop table dw_pvs_everyday;
create table dw_pvs_everyday(pvs bigint,month string,day string);
insert into table dw_pvs_everyday
select count(
) as pvs,a.month as month,a.day as day from (select distinct month, day from t_dim_time) a
join ods_weblog_detail b
on a.month=b.month and a.day=b.day
group by a.month,a.day;
--維度:月
drop table dw_pvs_everymonth;
create table dw_pvs_everymonth (pvs bigint,month string);
insert into table dw_pvs_everymonth
select count(*) as pvs,a.month from (select distinct month from t_dim_time) a
join ods_weblog_detail b on a.month=b.month group by a.month;
--另外,也可以直接利用之前的計算結果。比如從之前算好的小時結果中統計每一天的
Insert into table dw_pvs_everyday
Select sum(pvs) as pvs,month,day from dw_pvs_everyhour_oneday group by month,day having day=‘18‘;
按終端維度
數據中能夠反映出用戶終端信息的字段是http_user_agent。
User Agent也簡稱UA。它是一個特殊字符串頭,是一種向訪問網站提供所使用的瀏覽器類型及版本、操作系統及版本、瀏覽器內核、等信息的標識。例如:
User-Agent,Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.276 Safari/537.36
上述UA信息就可以提取出以下的信息:
chrome 58.0、瀏覽器 chrome、瀏覽器版本 58.0、系統平臺 windows
瀏覽器內核 webkit
這裏不再拓展相關知識,感興趣的可以查看參考資料如何解析UA。
可以用下面的語句進行試探性統計,當然這樣的準確度不是很高。
select distinct(http_user_agent) from ods_weblog_detail where http_user_agent like ‘%Chrome%‘ limit 200;
按欄目維度
網站欄目可以理解為網站中內容相關的主題集中。體現在域名上來看就是不同的欄目會有不同的二級目錄。比如某網站網址為www.xxxx.cn,旗下欄目可以通過如下方式訪問:
欄目維度:../job
欄目維度:../news
欄目維度:../sports
欄目維度:../technology
那麽根據用戶請求url就可以解析出訪問欄目,然後按照欄目進行統計分析。
按referer維度
--統計每小時各來訪url產生的pv量
drop table dw_pvs_referer_everyhour;
create table dw_pvs_referer_everyhour(referer_url string,referer_host string,month string,day string,hour string,pv_referer_cnt bigint) partitioned by(datestr string);
insert into table dw_pvs_referer_everyhour partition(datestr=‘20130918‘)
select http_referer,ref_host,month,day,hour,count(1) as pv_referer_cnt
from ods_weblog_detail
group by http_referer,ref_host,month,day,hour
having ref_host is not null
order by hour asc,day asc,month asc,pv_referer_cnt desc;
--統計每小時各來訪host的產生的pv數並排序
drop table dw_pvs_refererhost_everyhour;
create table dw_pvs_refererhost_everyhour(ref_host string,month string,day string,hour string,ref_host_cnts bigint) partitioned by(datestr string);
insert into table dw_pvs_refererhost_everyhour partition(datestr=‘20130918‘)
select ref_host,month,day,hour,count(1) as ref_host_cnts
from ods_weblog_detail
group by ref_host,month,day,hour
having ref_host is not null
order by hour asc,day asc,month asc,ref_host_cnts desc;
技術分享圖片

註:還可以按來源地域維度、訪客終端維度等計算
1.2. 人均瀏覽量
需求描述:統計今日所有來訪者平均請求的頁面數。
人均瀏覽量也稱作人均瀏覽頁數,該指標可以說明網站對用戶的粘性。
人均頁面瀏覽量表示用戶某一時段平均瀏覽頁面的次數。
計算方式:總頁面請求數/去重總人數
remote_addr表示不同的用戶。可以先統計出不同remote_addr的pv量,然後累加(sum)所有pv作為總的頁面請求數,再count所有remote_addr作為總的去重總人數。
--總頁面請求數/去重總人數
drop table dw_avgpv_user_everyday;
create table dw_avgpv_user_everyday(
day string,
avgpv string);
insert into table dw_avgpv_user_everyday
select ‘20130918‘,sum(b.pvs)/count(b.remote_addr) from
(select remote_addr,count(1) as pvs from ods_weblog_detail where datestr=‘20130918‘ group by remote_addr) b;

1.3. 統計pv總量最大的來源TOPN (分組TOP)
需求描述:統計每小時各來訪host的產生的pvs數最多的前N個(topN)。
row_number()函數
? 語法:row_number() over (partition by xxx order by xxx) rank,rank為分組的別名,相當於新增一個字段為rank。
? partition by用於分組,比方說依照sex字段分組
? order by用於分組內排序,比方說依照sex分組,組內按照age排序
? 排好序之後,為每個分組內每一條分組記錄從1開始返回一個數字
? 取組內某個數據,可以使用where 表名.rank>x之類的語法去取
以下語句對每個小時內的來訪host次數倒序排序標號:
select ref_host,ref_host_cnts,concat(month,day,hour),
row_number() over (partition by concat(month,day,hour) order by ref_host_cnts desc) as od from dw_pvs_refererhost_everyhour;
效果如下:

技術分享圖片
根據上述row_number的功能,可編寫hql取各小時的ref_host訪問次數topn
drop table dw_pvs_refhost_topn_everyhour;
create table dw_pvs_refhost_topn_everyhour(
hour string,
toporder string,
ref_host string,
ref_host_cnts string
)partitioned by(datestr string);
insert into table dw_pvs_refhost_topn_everyhour partition(datestr=‘20130918‘)
select t.hour,t.od,t.ref_host,t.ref_host_cnts from
(select ref_host,ref_host_cnts,concat(month,day,hour) as hour,
row_number() over (partition by concat(month,day,hour) order by ref_host_cnts desc) as od
from dw_pvs_refererhost_everyhour) t where od<=3;
結果如下:
技術分享圖片

2. 受訪分析(從頁面的角度分析)2.1. 各頁面訪問統計
主要是針對數據中的request進行統計分析,比如各頁面PV ,各頁面UV 等。
以上指標無非就是根據頁面的字段group by。例如:
--統計各頁面pv
select request as request,count(request) as request_counts from
ods_weblog_detail group by request having request is not null order by request_counts desc limit 20;
2.2. 熱門頁面統計
--統計每日最熱門的頁面top10
drop table dw_hotpages_everyday;
create table dw_hotpages_everyday(day string,url string,pvs string);
insert into table dw_hotpages_everyday
select ‘20130918‘,a.request,a.request_counts from
(select request as request,count(request) as request_counts from ods_weblog_detail where datestr=‘20130918‘ group by request having request is not null) a
order by a.request_counts desc limit 10;

3. 訪客分析
3.1. 獨立訪客
需求描述:按照時間維度比如小時來統計獨立訪客及其產生的pv。
對於獨立訪客的識別,如果在原始日誌中有用戶標識,則根據用戶標識即很好實現;此處,由於原始日誌中並沒有用戶標識,以訪客IP來模擬,技術上是一樣的,只是精確度相對較低。
--時間維度:時
drop table dw_user_dstc_ip_h;
create table dw_user_dstc_ip_h(
remote_addr string,
pvs bigint,
hour string);
insert into table dw_user_dstc_ip_h
select remote_addr,count(1) as pvs,concat(month,day,hour) as hour
from ods_weblog_detail
Where datestr=‘20130918‘
group by concat(month,day,hour),remote_addr;
在此結果表之上,可以進一步統計,如每小時獨立訪客總數:
select count(1) as dstc_ip_cnts,hour from dw_user_dstc_ip_h group by hour;
--時間維度:日
select remote_addr,count(1) as counts,concat(month,day) as day
from ods_weblog_detail
Where datestr=‘20130918‘
group by concat(month,day),remote_addr;
--時間維度:月
select remote_addr,count(1) as counts,month
from ods_weblog_detail
group by month,remote_addr;
3.2. 每日新訪客
需求:將每天的新訪客統計出來。
實現思路:創建一個去重訪客累積表,然後將每日訪客對比累積表。
技術分享圖片
--歷日去重訪客累積表
drop table dw_user_dsct_history;
create table dw_user_dsct_history(
day string,
ip string
)
partitioned by(datestr string);
--每日新訪客表
drop table dw_user_new_d;
create table dw_user_new_d (
day string,
ip string
)
partitioned by(datestr string);
--每日新用戶插入新訪客表
insert into table dw_user_new_d partition(datestr=‘20130918‘)
select tmp.day as day,tmp.today_addr as new_ip from
(
select today.day as day,today.remote_addr as today_addr,old.ip as old_addr
from
(select distinct remote_addr as remote_addr,"20130918" as day from ods_weblog_detail where datestr="20130918") today
left outer join
dw_user_dsct_history old
on today.remote_addr=old.ip
) tmp
where tmp.old_addr is null;
--每日新用戶追加到累計表
insert into table dw_user_dsct_history partition(datestr=‘20130918‘)
select day,ip from dw_user_new_d where datestr=‘20130918‘;
驗證查看:
select count(distinct remote_addr) from ods_weblog_detail;
select count(1) from dw_user_dsct_history where datestr=‘20130918‘;
select count(1) from dw_user_new_d where datestr=‘20130918‘;
註:還可以按來源地域維度、訪客終端維度等計算

4. 訪客Visit分析(點擊流模型)
4.1. 回頭/單次訪客統計
需求:查詢今日所有回頭訪客及其訪問次數。
技術分享圖片
實現思路:上表中出現次數>1的訪客,即回頭訪客;反之,則為單次訪客。
drop table dw_user_returning;
create table dw_user_returning(
day string,
remote_addr string,
acc_cnt string)
partitioned by (datestr string);
insert overwrite table dw_user_returning partition(datestr=‘20130918‘)
select tmp.day,tmp.remote_addr,tmp.acc_cnt
from
(select ‘20130918‘ as day,remote_addr,count(session) as acc_cnt from ods_click_stream_visit group by remote_addr) tmp
where tmp.acc_cnt>1;
4.2. 人均訪問頻次
需求:統計出每天所有用戶訪問網站的平均次數(visit)
總visit數/去重總用戶數
select sum(pagevisits)/count(distinct remote_addr) from ods_click_stream_visit where datestr=‘20130918‘;

5. 關鍵路徑轉化率分析(漏鬥模型)
5.1. 需求分析
轉化:在一條指定的業務流程中,各個步驟的完成人數及相對上一個步驟的百分比。
技術分享圖片
5.2. 模型設計
定義好業務流程中的頁面標識,下例中的步驟為:
Step1、 /item
Step2、 /category
Step3、 /index
Step4、 /order
5.3. 開發實現
l 查詢每一個步驟的總訪問人數
--查詢每一步人數存入dw_oute_numbs
create table dw_oute_numbs as
select ‘step1‘ as step,count(distinct remote_addr) as numbs from ods_click_pageviews where datestr=‘20130920‘ and request like ‘/item%‘
union
select ‘step2‘ as step,count(distinct remote_addr) as numbs from ods_click_pageviews where datestr=‘20130920‘ and request like ‘/category%‘
union
select ‘step3‘ as step,count(distinct remote_addr) as numbs from ods_click_pageviews where datestr=‘20130920‘ and request like ‘/order%‘
union
select ‘step4‘ as step,count(distinct remote_addr) as numbs from ods_click_pageviews where datestr=‘20130920‘ and request like ‘/index%‘;
註:UNION將多個SELECT語句的結果集合並為一個獨立的結果集。
l 查詢每一步驟相對於路徑起點人數的比例
思路:級聯查詢,利用自join
--dw_oute_numbs跟自己join
select rn.step as rnstep,rn.numbs as rnnumbs,rr.step as rrstep,rr.numbs as rrnumbs from dw_oute_numbs rn
inner join
dw_oute_numbs rr;
--每一步的人數/第一步的人數==每一步相對起點人數比例
select tmp.rnstep,tmp.rnnumbs/tmp.rrnumbs as ratio
from
(
select rn.step as rnstep,rn.numbs as rnnumbs,rr.step as rrstep,rr.numbs as rrnumbs from dw_oute_numbs rn
inner join
dw_oute_numbs rr) tmp
where tmp.rrstep=‘step1‘;
l 查詢每一步驟相對於上一步驟的漏出率
--自join表過濾出每一步跟上一步的記錄
select rn.step as rnstep,rn.numbs as rnnumbs,rr.step as rrstep,rr.numbs as rrnumbs from dw_oute_numbs rn
inner join
dw_oute_numbs rr
where cast(substr(rn.step,5,1) as int)=cast(substr(rr.step,5,1) as int)-1;
select tmp.rrstep as step,tmp.rrnumbs/tmp.rnnumbs as leakage_rate
from
(
select rn.step as rnstep,rn.numbs as rnnumbs,rr.step as rrstep,rr.numbs as rrnumbs from dw_oute_numbs rn
inner join
dw_oute_numbs rr) tmp
where cast(substr(tmp.rnstep,5,1) as int)=cast(substr(tmp.rrstep,5,1) as int)-1;
l 匯總以上兩種指標
select abs.step,abs.numbs,abs.rate as abs_ratio,rel.rate as leakage_rate
from
(
select tmp.rnstep as step,tmp.rnnumbs as numbs,tmp.rnnumbs/tmp.rrnumbs as rate
from
(
select rn.step as rnstep,rn.numbs as rnnumbs,rr.step as rrstep,rr.numbs as rrnumbs from dw_oute_numbs rn
inner join
dw_oute_numbs rr) tmp
where tmp.rrstep=‘step1‘
) abs
left outer join
(
select tmp.rrstep as step,tmp.rrnumbs/tmp.rnnumbs as rate
from
(
select rn.step as rnstep,rn.numbs as rnnumbs,rr.step as rrstep,rr.numbs as rrnumbs from dw_oute_numbs rn
inner join
dw_oute_numbs rr) tmp
where cast(substr(tmp.rnstep,5,1) as int)=cast(substr(tmp.rrstep,5,1) as int)-1
) rel
on abs.step=rel.step;

模塊開發之統計分析