impala分析函式
impala 分析函式
impala_analytic_functions
臨時表
作用:重複使用到不必重複查詢,簡化語句複雜度,方便檢視
WITH table_name AS (SELECT 1 id, 2 num UNION SELECT 2,2) SELECT * FROM table_name %% 取每個賬號的第一個創角記錄作為臨時表 WITH role_unique AS ( SELECT * FROM ( SELECT role_id, create_time, row_number() over( PARTITION BY account_name ORDER BY create_time asc ) AS row_num FROM t_log_role_create ) role_create where row_num = 1 )
分析函式
function(args) OVER([partition_by_clause] [order_by_clause [window_clause]])
- FUNCTION 子句
- PARTITION 子句
- ORDER BY 子句
- WINDOWING 子句
PARTITION
型別GROUP BY,分組
ORDER BY
排序
視窗條件
ROWS BETWEEN [ { m | UNBOUNDED } PRECEDING | CURRENT ROW] [ AND [CURRENT ROW | { UNBOUNDED | n } FOLLOWING] ] RANGE BETWEEN [ {m | UNBOUNDED } PRECEDING | CURRENT ROW] [ AND [CURRENT ROW | { UNBOUNDED | n } FOLLOWING] ]
關於ROWS和RANGE的區別
- ROWS 每一行元素都視為新的計算行,即每一行都是一個新的視窗
- RANGE 具有相同值的所有元素行視為同一計算行,即具有相同值的所有行都是同一個視窗
Query: with num_table AS (SELECT 1 id, 1 num UNION SELECT 2,2 UNION SELECT 3,3 UNION SELECT 4,6 UNION SELECT 5,4 UNION SELECT 6,5 UNION SELECT 7,5 UNION SELECT 8,4) SELECT num,sum(num) over (ORDER BY num asc rows between unbounded preceding and current row) as total FROM num_table; +-----+-------+ | num | total | +-----+-------+ | 1| 1| | 2| 3| | 3| 6| | 4| 10| | 4| 14| | 5| 19| | 5| 24| | 6| 30| Query: with num_table AS (SELECT 1 id, 1 num UNION SELECT 2,2 UNION SELECT 3,3 UNION SELECT 4,6 UNION SELECT 5,4 UNION SELECT 6,5 UNION SELECT 7,5 UNION SELECT 8,4) SELECT num,sum(num) over (ORDER BY num asc range between unbounded preceding and current row) as total FROM num_table; +-----+-------+ | num | total | +-----+-------+ | 1| 1| | 2| 3| | 3| 6| | 4| 14| | 4| 14| | 5| 24| | 5| 24| | 6| 30|
Row_Number,Rank,Dense_Rank
- Row_Number,整數的升序順序,從1開始,逐行加1
- Rank,整數的升序順序,從1開始,重複值生成重複整數,重複後按值的數量增加序列
- Dense_Rank,整數的升序順序,從1開始,重複值生成重複整數,重複後按值的數值增加序列
使用:
- 獲取最新資料
- 獲取topN資料
WITH create_table AS ( SELECT '2018-10-01' create_date, 'account1' account_name UNION ALL SELECT '2018-10-02', 'account2' UNION ALL SELECT '2018-10-03', 'account3' UNION ALL SELECT '2018-10-04', 'account3' UNION ALL SELECT '2018-10-05', 'account2' UNION ALL SELECT '2018-10-06', 'account4' UNION ALL SELECT '2018-10-07', 'account5' ) SELECT create_date, account_name, row_number() over( order by account_name ) row_num, rank() over ( order by account_name ) rank_id, dense_rank() over ( order by account_name ) dense_id from create_table order by account_name;
+-------------+--------------+---------+---------+----------+ | create_date | account_name | row_num | rank_id | dense_id | +-------------+--------------+---------+---------+----------+ | 2018-10-01| account1| 1| 1| 1| | 2018-10-02| account2| 2| 2| 2| | 2018-10-05| account2| 3| 2| 2| | 2018-10-03| account3| 4| 4| 3| | 2018-10-04| account3| 5| 4| 3| | 2018-10-06| account4| 6| 6| 4| | 2018-10-07| account5| 7| 7| 5|
LAG,LEAD
- LAG(col, n, DEFAULT) 用於統計視窗內往上第n行值
- LEAD(col, n, DEFAULT) 用於統計視窗內往下第n行值, 與LAG相反
WITH pay_date AS ( SELECT '2018-10-01' dt, 5000 pay UNION ALL SELECT '2018-10-02', 6000 UNION ALL SELECT '2018-10-03', 7000 UNION ALL SELECT '2018-10-04', 8000 UNION ALL SELECT '2018-10-05', 9000 UNION ALL SELECT '2018-10-06', 10000 UNION ALL SELECT '2018-10-07', 11000 ) SELECT dt, lag(pay, 1) over ( order by dt ) as pre_day, pay, lead(pay, 1) over ( order by dt ) as next_day, avg(pay) over (order by dt rows between 1 preceding and 1 following) as pay_average FROM pay_date order by dt;
前一日,當天,後一天,三天平均值(可根據需要調整視窗)
+------------+---------+-------+----------+-------------+ | dt| pre_day | pay| next_day | pay_average | +------------+---------+-------+----------+-------------+ | 2018-10-01 | NULL| 5000| 6000| 5500| | 2018-10-02 | 5000| 6000| 7000| 6000| | 2018-10-03 | 6000| 7000| 8000| 7000| | 2018-10-04 | 7000| 8000| 9000| 8000| | 2018-10-05 | 8000| 9000| 10000| 9000| | 2018-10-06 | 9000| 10000 | 11000| 10000| | 2018-10-07 | 10000| 11000 | NULL| 10500|
WITH pay_date AS ( SELECT FROM_UNIXTIME(pay_time, 'yyyy-MM-dd') AS dt, sum(pay_money) as pay FROM t_log_pay WHERE pay_time between 1538323200 AND 1538927999 GROUP BY dt ) SELECT dt, lag(pay, 1) over ( order by dt ) as pre_day, pay, lead(pay, 1) over ( order by dt ) as next_day, avg(pay) over (order by dt rows between 1 preceding and 1 following) as pay_average FROM pay_date order by dt;
FIRST_VALUE,LAST_VALUE
- FIRST_VALUE 取分組內排序後,截止到當前行,第一個值
- LAST_VALUE 取分組內排序後,截止到當前行,最後一個值
- FIRST_VALUE(DESC) 獲得組內全域性的最後一個值
with test_table AS ( SELECT 1 id, 'test1' test UNION ALL SELECT 2, 'test1' UNION ALL SELECT 3, 'test1' UNION ALL SELECT 4, 'test2' UNION ALL SELECT 5, 'test2' UNION ALL SELECT 6, 'test2' ) SELECT id, test, first_value(id) OVER ( PARTITION BY test ORDER BY id RANGE UNBOUNDED preceding ) as first_val, last_value(id) OVER ( PARTITION BY test ORDER BY id desc RANGE UNBOUNDED preceding ) as last_val, first_value(id) OVER ( PARTITION BY test ORDER BY id desc RANGE UNBOUNDED preceding ) as first_desc from test_table order by id;
+----+-------+-----------+----------+------------+ | id | test| first_val | last_val | first_desc | +----+-------+-----------+----------+------------+ | 1| test1 | 1| 1| 3| | 2| test1 | 1| 2| 3| | 3| test1 | 1| 3| 3| | 4| test2 | 4| 4| 6| | 5| test2 | 4| 5| 6| | 6| test2 | 4| 6| 6|
使用示例
CREATE TABLE t_log_role_create ( account_name STRING COMMENT ‘賬號’, create_time INT COMMENT ‘註冊時間’ ) COMMENT ‘玩家登錄檔’ ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘\t’ STORED AS TEXTFILE ;
CREATE TABLE t_log_login ( account_name STRING COMMENT ‘賬號’, login_time INT COMMENT ‘登入時間’ ) COMMENT ‘玩家登入表’ ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘\t’ STORED AS TEXTFILE ;
CREATE TABLE t_log_pay ( account_name STRING COMMENT ‘賬號’, pay_time INT COMMENT ‘充值時間’, pay_money float COMMENT ‘充值金額’ ) COMMENT ‘玩家充值表’ ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘\t’ STORED AS TEXTFILE ;
統計留存,使用者在某段時間內開始使用遊戲,經過一段時間後,仍然繼續使用遊戲的被認作是留存使用者,這部分使用者佔當日新增使用者的比例即是使用者留存率
WITH role_unique_a AS ( SELECT * FROM ( SELECT account_name, create_time, row_number() over( PARTITION BY account_name ORDER BY create_time asc ) AS row_num FROM t_log_role_create WHERE create_time between 1533657600 AND 1533743999 ) role_create where row_num = 1 ) SELECT e.create_date, f.day, e.account_num, f.total, (f.total / e.account_num) AS rate FROM ( SELECT COUNT(distinct account_name) AS account_num, FROM_UNIXTIME(create_time, 'yyyy-MM-dd') AS create_date FROM role_unique_a group by create_date ) e join ( SELECT COUNT(distinct b.account_name) AS total, FROM_UNIXTIME(login_time, 'yyyy-MM-dd') AS day, FROM_UNIXTIME(create_time, 'yyyy-MM-dd') AS create_date FROM t_log_login a join role_unique_a b on a.account_name = b.account_name where a.login_time > 1533657600 group by day, create_date ) f on e.create_date = f.create_date ORDER BY create_date, day;
統計LTV,(Lifetime-Value):生命週期價值,即平均一個使用者在首次登入遊戲到最後一次登入遊戲內,為該遊戲創造的收入總計
WITH role_unique_a AS ( SELECT * FROM ( SELECT account_name, create_time, row_number() over( PARTITION BY upf, account_name ORDER BY create_time asc ) AS row_num FROM t_log_role_create where create_time between 1533657600 AND 1533743999 ) role_create where row_num = 1 ) SELECT e.create_date, f.day, e.account_num, f.total, total / account_num as ltv FROM ( SELECT COUNT(distinct account_name) AS account_num, FROM_UNIXTIME(create_time, 'yyyy-MM-dd') AS create_date FROM role_unique_a GROUP BY create_date ) e join ( SELECT create_date, day, sum(pay_money) over ( partition by create_date ORDER BY day rows between unbounded preceding and current row ) as total from ( SELECT SUM(pay_money) as pay_money, FROM_UNIXTIME(pay_time, 'yyyy-MM-dd') AS day, FROM_UNIXTIME(b.create_time, 'yyyy-MM-dd') AS create_date FROM t_log_pay a join role_unique_a b on a.account_name = b.account_name WHERE pay_time > 1533657600 GROUP BY day, create_date ) c ) f on e.create_date = f.create_date ORDER BY create_date, day;