Hive函式分類、CLI命令、簡單函式、聚合函式、集合函式、特殊函式(分析函式、視窗函式、混合函式,UDTF),常用函式Demo
1.1 Hive函式分類
1.2 Hive CLI命令
顯示當前會話有多少函式可用
show functions;
顯示函式的描述資訊:
DESC FUNCTION concat;
顯示函式的擴充套件描述資訊:
1.3 簡單函式
函式的計算粒度為單條記錄:
關係運算
數學運算
邏輯運算
數值計算
日期函式
型別轉換
條件函式
字串函式
統計函式
1.4 聚合函式
函式處理的資料粒度為多條記錄
sum()——求和
count()——求資料量
avg()——求平均值
distinct——求不同值數
min——求最小值
max——求最大值
1.5 集合函式
複合型別構建
複雜型別訪問
複雜型別長度
1.6 特殊函式
視窗函式
應用場景
用於分割槽排序、動態Group By、Top N、累計計算、層次查詢
Windowing Functions:lead、lag、FIRST_VALUE、LAST_VALUE
分析函式
Analytivs functions:RANK、ROW_NUMBER、DENSE_RANK、CUME_DIST、PERCENT_RANK、NTILE
混合函式
java_method(class,method[,arg1 [,arg2]]) reflect(class,method [, arg1 [,arg2..]])hash(a1[,a2...])
UDTF
lateralView: LATERAL VIEW udtf(expression) tableAlias AS columnAlias (',' columnAlias)*
fromClause: FROM baseTable (lateralView)*
lateral view用於和split,explode等UDTF一起使用,它能夠將一行資料拆成多行資料,在此基礎上
可以對拆分後的資料進行聚合。lateral view首先為原始表的每行呼叫UDTF,UDTF會把一行拆分成
一或者多行,lateral view再把結果組合,產生一個支援別名表的虛擬表。
1.7 常用函式Demo:
create table employee(id string, money double,type string) row format delimited fields terminated by '\t' lines terminated by '\n' stored as textfile;
load data local inpath '/home/tuzq/software/hivedata/employee.txt' into table employee;
優先順序依次為NOT AND OR
select id,money from employee where (id = '2' or id = '3' or id = '4' or id = '5') AND (money > 120 AND money < 250);
資料準備:在/home/tuzq/software/hivedata下建立employee.txt,
資料內容如下:
hive中的顯示效果如下:
帶有條件的查詢:
cast型別轉換:
select cast (1.5 as int);
URL解析函式
parse_url(string urlString, string partToExtract [, string keyToExtract])
select parse_url(‘http://facebook.com/path1/p.php?k1=v1&k2=v2#Ref1‘, ‘HOST‘) from
employee limit 1;
字串連線函式: concat
語法: concat(string A, stringB…)
返回值: string
說明:返回輸入字串連線後的結果,支援任意個輸入字串
舉例:
hive> select concat(‘abc‘,‘def’,‘gh‘) from lxw_dual;
abcdefgh
帶分隔符字串連線函式: concat_ws
語法: concat_ws(string SEP,string A, string B…)
返回值: string
說明:返回輸入字串連線後的結果, SEP 表示各個字串間的分隔符
concat_ws(string SEP, array<string>)
舉例:
hive> select concat_ws(‘,‘,‘abc‘,‘def‘,‘gh‘) from lxw_dual;
abc,def,gh
再如案例:
列出該欄位所有不重複的值,相當於去重
collect_set(id) //返回的是陣列
列出該欄位所有的值,列出來不去重
collect_list(id) //返回的是陣列
select collect_set(id) from taborder;
求和
sum(money)
統計列數
count(*)
select sum(num),count(*) from taborder;
視窗函式
first_value(第一行值)
first_value(money) over (partition by id order by money)
select ch,num,first_value(num) over (partition by ch order by num) from taborder;
rows between 1 preceding and 1 following (當前行以及當前行的前一行與後一行)
hive (liguodong)> select ch,num,first_value(num) over (partition by ch order by num ROWS BETWEEN 2 PRECEDING AND CURRENT ROW) from taborder;
last_value 最後一行值
hive (liguodong)> select ch,num,last_value(num) over (partition by ch) from taborder;
lead
取當前行後面的第二行的值
lead(money,2) over (order by money)
lag
取當前行前面的第二行的值
lag(money,2) over (order by money)
```
```
select ch, num, lead(num,2) over (order by num) from taborder;
select ch, num, lag(num,2) over (order by num) from taborder;
rank排名
rank() over(partition by id order by money)
select ch, num, rank() over(partition by ch order by num) as rank from taborder;
select ch, num, dense_rank() over(partition by ch order by num) as dense_rank from taborder;
cume_dist
cume_dist (相同值的最大行號/行數)
cume_dist() over (partition by id order by money)
percent_rank (相同值的最小行號-1)/(行數-1)
第一個總是從0開始
percent_rank() over (partition by id order by money)
select ch,num,cume_dist() over (partition by ch order by num) as cume_dist,
percent_rank() over (partition by ch order by num) as percent_rank
from taborder;
ntile分片
ntile(2) over (order by money desc) 分兩份
select ch,num,ntile(2) over (order by num desc) from taborder;
混合函式
select id,java_method("java.lang,Math","sqrt",cast(id as double)) as sqrt from hiveTest;
UDTF
select id,adid
from employee
lateral view explode(split(type,‘B‘)) tt as adid;
explode 把一列轉成多行
hive (liguodong)> select id,adid
> from hiveDemo
> lateral view explode(split(str,‘,‘)) tt as adid;
正則表示式
使用正則表示式的函式
regexp_replace(string subject A,stringB,string C)
regexp_extract(string subject,stringpattern,int index)
hive> select regexp_replace(‘foobar‘, ‘oo|ar‘, ‘‘) from lxw_dual;
fb
hive> select regexp_replace(‘979|7.10.80|8684‘, ‘.*\\|(.*)‘,1) from hiveDemo limit 1;
hive> select regexp_replace(‘979|7.10.80|8684‘, ‘(.*?)\\|(.*)‘,1) from hiveDemo limit 1;
轉載:
原文:https://blog.csdn.net/tototuzuoquan/article/details/73028361