1. 程式人生 > >Hive資料分析實戰演練

Hive資料分析實戰演練

 

Hive資料分析實戰演練

文章來源:企鵝號 - 程式猿的修身養性

1、準備工作

Hive的底層是基於MapReduce分散式計算和HDFS分散式儲存,因此,在使用Hive進行資料操作前,需要先啟動Hadoop。如果事先已經搭建好了偽分散式環境的Hadoop,執行命令: start-all.sh,等待Hadoop啟動完成即可。

使用Hive進行資料分析操作,必然需要安裝和配置Hive資料倉庫工具,這裡就不介紹其安裝和配置了,具體內容可以參考前面相關文章。本文基於Hive的本地模式(元資料資訊儲存到第三方MySQL資料庫中)進行操作,執行命令:hive,等待Hive啟動完成。如下圖所示,這樣便可以在Hive的shell命令列視窗中進行資料分析操作。

在正式開始操作Hive進行資料分析之前,先介紹幾個Hive的基本命令。

建立資料庫

create database mytest;

切換到指定資料庫

use mytest;

檢視指定的資料庫資訊

describe database mytest;

檢視指定資料表的詳細資訊

desc formatted special1;

2、SUM、AVG、MIN、MAX函式

A、資料準備

建立檔案special1,往該檔案中輸入相應的測試資料,如下圖所示:

 

然後,將special1檔案拷貝到指定目錄下,這裡使用的目錄是:/root/temp,然後執行如下命令建立對應的外部表:

CREATE EXTERNAL TABLE special1 (

cookieid string,createtime string,pv INT)

ROW FORMAT DELIMITED FIELDS TERMINATED BY ','

stored as textfile location '/root/temp/special1/';

最後,執行如下命令將本地檔案special1中的資料匯入表special1中:

load data local inpath '/root/temp/special1' into table special1;

B、SUM函式使用

功能:實現分組內所有和連續累積的統計,注意,結果和ORDER BY相關,預設為升序。命令如下:

SELECT cookieid,createtime,pv,

SUM(pv) OVER(PARTITION BY cookieid ORDER BY createtime) AS pv1,--預設為從起點到當前行

SUM(pv) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS pv2,--從起點到當前行,結果同pv1

SUM(pv) OVER(PARTITION BY cookieid) AS pv3,--分組內所有行,會使得最終結果降序排列

SUM(pv) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN 3 PRECEDING AND CURRENT ROW) AS pv4,--當前行+往前3行

SUM(pv) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN 3 PRECEDING AND 1 FOLLOWING) AS pv5,--當前行+往前3行+往後1行

SUM(pv) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING) AS pv6--當前行+往後所有行

FROM special1;

執行結果如下圖所示:

 

解析:

pv1: 分組內從起點到當前行的pv累積,如11號的pv1等於10號的pv值加上11號的pv值, 12號的pv1等於10號的pv值加上11號的pv值加上12號的pv值;

pv2: 同pv1的計算方法;

pv3: 分組內(cookie1)所有的pv值累加;

pv4: 分組內當前行+往前3行,如11號=10號+11號,12號=10號+11號+12號,13號=10號+11號+12號+13號,14號=11號+12號+13號+14號;

pv5: 分組內當前行+往前3行+往後1行,如14號=11號+12號+13號+14號+15號=5+7+3+2+4=21;

pv6: 分組內當前行+往後所有行,如13號=13號+14號+15號+16號=3+2+4+4=13,14號=14號+15號+16號=2+4+4=10;

如果不指定ROWS BETWEEN,預設為從起點到當前行;

如果不指定ORDER BY,則將分組內所有值累加;

關鍵是理解ROWS BETWEEN含義,也叫做WINDOW子句:

PRECEDING:往前,FOLLOWING:往後,CURRENT ROW:當前行

UNBOUNDED:起點,UNBOUNDED PRECEDING 表示從前面的起點, UNBOUNDED FOLLOWING:表示到後面的終點

——其他AVG,MIN,MAX函式,和SUM函式的用法一樣。

C、AVG函式使用

功能:實現求分組內指定數量行資料的平均值。命令如下:

SELECT cookieid,createtime,pv,

AVG(pv) OVER(PARTITION BY cookieid ORDER BY createtime) AS pv1,--預設為從起點到當前行

AVG(pv) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS pv2,--從起點到當前行,結果同pv1

AVG(pv) OVER(PARTITION BY cookieid) AS pv3,--分組內所有行,會使得最終結果降序排列

AVG(pv) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN 3 PRECEDING AND CURRENT ROW) AS pv4,--當前行+往前3行

AVG(pv) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN 3 PRECEDING AND 1 FOLLOWING) AS pv5,--當前行+往前3行+往後1行

AVG(pv) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING) AS pv6--當前行+往後所有行

FROM special1;

執行結果如下圖所示:

 

D、MIN函式使用

功能:實現求分組內指定數量行資料的最小值。命令如下:

SELECT cookieid,createtime,pv,

MIN(pv) OVER(PARTITION BY cookieid ORDER BY createtime) AS pv1,--預設為從起點到當前行

MIN(pv) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS pv2,--從起點到當前行,結果同pv1

MIN(pv) OVER(PARTITION BY cookieid) AS pv3,--分組內所有行,會使得最終結果降序排列

MIN(pv) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN 3 PRECEDING AND CURRENT ROW) AS pv4,--當前行+往前3行

MIN(pv) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN 3 PRECEDING AND 1 FOLLOWING) AS pv5,--當前行+往前3行+往後1行

MIN(pv) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING) AS pv6--當前行+往後所有行

FROM special1;

執行結果如下圖所示:

 

E、MAX函式使用

功能:實現求分組內指定數量行資料的最大值。命令如下:

SELECT cookieid,createtime,pv,

MAX(pv) OVER(PARTITION BY cookieid ORDER BY createtime) AS pv1,--預設為從起點到當前行

MAX(pv) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS pv2,--從起點到當前行,結果同pv1

MAX(pv) OVER(PARTITION BY cookieid) AS pv3,--分組內所有行

MAX(pv) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN 3 PRECEDING AND CURRENT ROW) AS pv4,--當前行+往前3行

MAX(pv) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN 3 PRECEDING AND 1 FOLLOWING) AS pv5,--當前行+往前3行+往後1行

MAX(pv) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING) AS pv6--當前行+往後所有行

FROM special1;

執行結果如下圖所示:

 

3、NTILE、ROW_NUMBER、RANK,DENSE_RANK函式

A、資料準備

建立檔案special2,往該檔案中輸入相應的測試資料,如下圖所示:

 

然後,將special2檔案拷貝到指定目錄下,這裡使用的目錄是:/root/temp,然後執行如下命令建立對應的外部表:

CREATE EXTERNAL TABLE special2 (

cookieid string,createtime string,pv INT)

ROW FORMAT DELIMITED FIELDS TERMINATED BY ','

stored as textfile location '/root/temp/special2/';

最後,執行如下命令將本地檔案special2中的資料匯入表special2中:

load data local inpath '/root/temp/special2' into table special2;

B、NTILE函式使用

功能:NTILE(n),用於將分組資料按照順序切分成n片,並返回當前切片值。

NTILE不支援ROWS BETWEEN,比如 NTILE(2) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN 3 PRECEDING AND CURRENT ROW)。如果切片不均勻,預設增加第一個切片的分佈。命令如下:

SELECT cookieid,createtime,pv,

NTILE(2) OVER(PARTITION BY cookieid ORDER BY createtime) AS rn1,--將分組內資料分成2片

NTILE(3) OVER(PARTITION BY cookieid ORDER BY createtime) AS rn2,--將分組內資料分成3片

NTILE(4) OVER(ORDER BY createtime) AS rn3--將所有資料分成4片

FROM special2

ORDER BY cookieid,createtime;

執行結果如下圖所示:

 

再比如,統計一個cookie,pv數最多的前1/3數量的天,命令如下:

SELECT cookieid,createtime,pv,

NTILE(3) OVER(PARTITION BY cookieid ORDER BY pv DESC) AS rn

FROM special2;

執行結果如下圖所示:

 

C、ROW_NUMBER函式使用

功能:從1開始,按照順序,生成分組內記錄的序列。比如,按照pv降序排列,生成分組內每天的pv名次。ROW_NUMBER()的應用場景非常多,再比如,獲取分組內排序第一的記錄;獲取一個session中的第一條refer等。命令如下:

SELECT cookieid,createtime,pv,

ROW_NUMBER() OVER(PARTITION BY cookieid ORDER BY pv desc) AS rn

FROM special2;

執行結果如下圖所示:

 

D、RANK和DENSE_RANK函式使用

功能:RANK()生成資料項在分組中的排名,排名相等會在名次中留下空缺位;DENSE_RANK()生成資料項在分組中的排名,排名相等不會在名次中留下空缺位。命令如下:

SELECT cookieid,createtime,pv,

RANK() OVER(PARTITION BY cookieid ORDER BY pv desc) AS rn1,

DENSE_RANK() OVER(PARTITION BY cookieid ORDER BY pv desc) AS rn2,

ROW_NUMBER() OVER(PARTITION BY cookieid ORDER BY pv DESC) AS rn3

FROM special2

WHERE cookieid = 'cookie1';

執行結果如下圖所示:

 

4、CUME_DIST、PERCENT_RANK函式

A、資料準備

建立檔案special3,往該檔案中輸入相應的測試資料,如下圖所示:

然後,將special3檔案拷貝到指定目錄下,這裡使用的目錄是:/root/temp,然後執行如下命令建立對應的外部表:

CREATE EXTERNAL TABLE special3 (

dept STRING,userid string,sal INT)

ROW FORMAT DELIMITED FIELDS TERMINATED BY ','

stored as textfile location '/root/temp/special3/';

最後,執行如下命令將本地檔案special3中的資料匯入表special3中:

load data local inpath '/root/temp/special3' into table special3;

B、CUME_DIST函式使用

功能:實現求小於等於當前值的行數/分組內總行數,比如,統計小於等於當前薪水的人數,所佔總人數的比例。命令如下:

SELECT dept,userid,sal,

CUME_DIST() OVER(ORDER BY sal) AS rn1,

CUME_DIST() OVER(PARTITION BY dept ORDER BY sal) AS rn2

FROM special3;

執行結果如下圖所示:

 

C、PERCENT_RANK函式使用

功能:實現求分組內當前行的RANK值-1/分組內總行數-1的比值,該函式的功能比較特殊,應用場景不太瞭解。命令如下:

SELECT dept,userid,sal,

PERCENT_RANK() OVER(ORDER BY sal) AS rn1,--分組內

RANK() OVER(ORDER BY sal) AS rn11,--分組內RANK值

SUM(1) OVER(PARTITION BY NULL) AS rn12,--分組內總行數

PERCENT_RANK() OVER(PARTITION BY dept ORDER BY sal) AS rn2

FROM special3;

執行結果如下圖所示:

 

5、LAG、LEAD、FIRST_VALUE、LAST_VALUE函式

A、資料準備

建立檔案special4,往該檔案中輸入相應的測試資料,如下圖所示:

 

然後,將special4檔案拷貝到指定目錄下,這裡使用的目錄是:/root/temp,然後執行如下命令建立對應的外部表:

CREATE EXTERNAL TABLE special4 (

cookieid string,createtime string,url STRING)

ROW FORMAT DELIMITED FIELDS TERMINATED BY ','

stored as textfile location '/root/temp/special4/';

最後,執行如下命令將本地檔案special4中的資料匯入表special4中:

load data local inpath '/root/temp/special4' into table special4;

B、LAG函式使用

功能:LAG(col,n,DEFAULT) 用於統計視窗內往上第n行的值,第一個引數為列名,第二個引數為往上第n行(可選,預設為1),第三個引數為預設值(當往上第n行為NULL時候,取預設值,如不指定,則為NULL)。命令如下:

SELECT cookieid,createtime,url,

ROW_NUMBER() OVER(PARTITION BY cookieid ORDER BY createtime) AS rn,

LAG(createtime,1,'1970-01-01 00:00:00') OVER(PARTITION BY cookieid ORDER BY createtime) AS last_1_time,

LAG(createtime,2) OVER(PARTITION BY cookieid ORDER BY createtime) AS last_2_time

FROM special4;

執行結果如下圖所示:

 

C、LEAD函式使用

功能:與LAG相反,LEAD(col,n,DEFAULT)用於統計視窗內往下第n行的值,第一個引數為列名,第二個引數為往下第n行(可選,預設為1),第三個引數為預設值(當往下第n行為NULL時候,取預設值,如不指定,則為NULL)。命令如下:

SELECT cookieid,createtime,url,

ROW_NUMBER() OVER(PARTITION BY cookieid ORDER BY createtime) AS rn,

LEAD(createtime,1,'1970-01-01 00:00:00') OVER(PARTITION BY cookieid ORDER BY createtime) AS next_1_time,

LEAD(createtime,2) OVER(PARTITION BY cookieid ORDER BY createtime) AS next_2_time

FROM special4;

執行結果如下圖所示:

 

D、FIRST_VALUE函式使用

功能:實現求分組內排序後,截止到當前行,第一個值。命令如下:

SELECT cookieid,createtime,url,

ROW_NUMBER() OVER(PARTITION BY cookieid ORDER BY createtime) AS rn,

FIRST_VALUE(url) OVER(PARTITION BY cookieid ORDER BY createtime) AS first1

FROM special4;

執行結果如下圖所示:

 

E、LAST_VALUE函式使用

功能:實現求分組內排序後,截止到當前行,最後一個值。命令如下:

SELECT cookieid,createtime,url,

ROW_NUMBER() OVER(PARTITION BY cookieid ORDER BY createtime) AS rn,

LAST_VALUE(url) OVER(PARTITION BY cookieid ORDER BY createtime) AS last1

FROM special4;

執行結果如下圖所示:

 

如果不指定ORDER BY,則預設按照記錄在檔案中的偏移量進行排序。命令如下:

SELECT cookieid,createtime,url,

FIRST_VALUE(url) OVER(PARTITION BY cookieid) AS first2

FROM special4;

執行結果如下圖所示:

 

如果想要取分組內排序後最後一個值,則需要變通一下。命令如下:

SELECT cookieid,createtime,url,

ROW_NUMBER() OVER(PARTITION BY cookieid ORDER BY createtime) AS rn,

LAST_VALUE(url) OVER(PARTITION BY cookieid ORDER BY createtime) AS last1,

FIRST_VALUE(url) OVER(PARTITION BY cookieid ORDER BY createtime DESC) AS last2

FROM special4

ORDER BY cookieid,createtime;

執行結果如下圖所示:

 

6、GROUPING SETS、GROUPING__ID、CUBE、ROOUP函式

A、資料準備

建立檔案special5,往該檔案中輸入相應的測試資料,如下圖所示:

 

然後,將special5檔案拷貝到指定目錄下,這裡使用的目錄是:/root/temp,然後執行如下命令建立對應的外部表:

CREATE EXTERNAL TABLE special5 (

month STRING,day STRING,cookieid STRING )

ROW FORMAT DELIMITED FIELDS TERMINATED BY ','

stored as textfile location '/root/temp/special5/';

最後,執行如下命令將本地檔案special5中的資料匯入表special5中:

load data local inpath '/root/temp/special5' into table special5;

B、GROUPINT SETS函式使用

功能:在一個GROUP BY查詢中,根據不同的維度組合進行聚合,等價於將不同維度的GROUP BY結果集進行UNION ALL。命令如下:

SELECT month,day,

COUNT(DISTINCT cookieid) AS uv,

GROUPING__ID

FROM special5

GROUP BY month,day

GROUPING SETS (month,day)

ORDER BY GROUPING__ID;

執行結果如下圖所示:

 

上面的語句等價於:

SELECT month,NULL,COUNT(DISTINCT cookieid) AS uv,1 AS GROUPING__ID FROM special5 GROUP BY month

UNION ALL

SELECT NULL,day,COUNT(DISTINCT cookieid) AS uv,2 AS GROUPING__ID FROM special5 GROUP BY day

再比如下述命令:

SELECT month,day,

COUNT(DISTINCT cookieid) AS uv,

GROUPING__ID

FROM special5

GROUP BY month,day

GROUPING SETS (month,day,(month,day))

ORDER BY GROUPING__ID;

執行結果如下圖所示:

 

上述命令等價於:

SELECT month,NULL,COUNT(DISTINCT cookieid) AS uv,1 AS GROUPING__ID FROM special5 GROUP BY month

UNION ALL

SELECT NULL,day,COUNT(DISTINCT cookieid) AS uv,2 AS GROUPING__ID FROM special5 GROUP BY day

UNION ALL

SELECT month,day,COUNT(DISTINCT cookieid) AS uv,3 AS GROUPING__ID FROM special5GROUP BY month,day

其中的GROUPING__ID,表示結果屬於哪一個分組集合。

C、CUBE函式使用

功能:根據GROUP BY的維度的所有組合進行聚合。命令如下:

SELECT month,day,

COUNT(DISTINCT cookieid) AS uv,

GROUPING__ID

FROM special5

GROUP BY month,day

WITH CUBE

ORDER BY GROUPING__ID;

執行結果如下圖所示:

 

上述命令等價於:

SELECT NULL,NULL,COUNT(DISTINCT cookieid) AS uv,0 AS GROUPING__ID FROM special5

UNION ALL

SELECT month,NULL,COUNT(DISTINCT cookieid) AS uv,1 AS GROUPING__ID FROM special5 GROUP BY month

UNION ALL

SELECT NULL,day,COUNT(DISTINCT cookieid) AS uv,2 AS GROUPING__ID FROM special5 GROUP BY day

UNION ALL

SELECT month,day,COUNT(DISTINCT cookieid) AS uv,3 AS GROUPING__ID FROM special5 GROUP BY month,day

D、ROLLUP函式使用

功能:是CUBE的子集,以最左側的維度為主,從該維度進行層級聚合。命令如下:

SELECT month,day,

COUNT(DISTINCT cookieid) AS uv,

GROUPING__ID

FROM special5

GROUP BY month,day

WITH ROLLUP

ORDER BY GROUPING__ID;

執行結果如下圖所示:

 

還可以實現這樣的上鑽過程:月天的uv->月的uv->總uv,把month和day調換順序,則以day維度進行層級聚合。命令如下:

SELECT day,month,

COUNT(DISTINCT cookieid) AS uv,

GROUPING__ID

FROM special5

GROUP BY day,month

WITH ROLLUP

ORDER BY GROUPING__ID;

執行結果如下圖所示:

 

  •