Hive的分桶表

【分桶概述】

　　Hive表分割槽的實質是分目錄（將超大表的資料按指定標準細分到指定目錄），且分割槽的欄位不屬於Hive表中存在的欄位；分桶的實質是分檔案（將超大檔案的資料按指定標準細分到分桶檔案），且分桶的欄位必須在Hive表中存在。

　　分桶的意義在於：

1. 可以提高多表join的效率（因為通過分桶已經將超大資料集提取出來了。假如原資料被分了4個桶，此時2表join的時候只需要讀取符合條件的一個分桶，則理論上效率可提升4倍）
2. 加速資料抽樣的效率（理由同上，只需要按照指定規則抽取指定分桶的資料即可，不需要掃描全表）

　　需要Hive表分桶的時候，我們可以注意到Reduce的任務數量 = 分桶的數量，也就是最終產生的分桶檔案的個數，因為分桶表就是通過MapReduce任務計算而來。由此可見，其實桶的概念就是MapReduce的分割槽的概念，兩者完全相同。

　　分桶表取樣語法的核心：

select * from tableName tablesample(bucket x out of y on colum)。其中：

x：表示從第x個桶中抽取資料

y：表示每y個桶中抽取一次資料（必須是分桶數量的倍數 or 因子）

【用法簡介】

1.開啟支援分桶

set hive.enforce.bucketing=true;　　　　-- 預設：false --

　　設定為 true 之後，mr 執行時會根據 bucket 的個數自動分配 reduce task的個數。

　　當然，使用者也可以通過 mapred.reduce.tasks 自己設定 reduce 任務個數，但分桶時不推薦使用。注意：一次作業產生的桶（檔案數量）和 reduce task 個數一致）

2.往分桶表中載入資料

/* 往分桶表中插入資料的語法類似下面 */

insert into table bucket_table select columns from tbl;    -- 全新插入 --

insert overwrite table bucket_table select columns from tbl;    -- 覆蓋重寫 --

3.分桶表資料抽樣

/*

抽樣語法：TABLESAMPLE(BUCKET x OUT OF y)。其中：

x：表示從第x個桶中抽取資料

y：表示每y個桶中抽取一次資料（必須是分桶數量的倍數 or 因子）

*/

select * from bucket_table tablesample(bucket 1 out of 4 on columns);

【用法舉例】

1. 假設本地檔案 /root/hivedata/ft 中有以下內容：

zhang   12

lisi    34

wange   23

zhouyu  15

guoji   45

xiafen  48

yanggu  78

liuwu   41

zhuto   66

madan   71

sichua  89

2. 新建Hive常規表並匯入本地檔案：

hive> CREATE TABLE ft( id INT, name STRING, age INT)

      > ROW FORMAT DELIMITED FIELDS TERMINATED BY'\t';

OK

Time taken: 0.216 seconds

hive> load data local inpath'/root/hivedata/ft' into table ft;

Loading data to table hehe.ft

Table hehe.ft stats: [numFiles=1, totalSize=127]

OK

Time taken: 1.105 seconds

hive> select *from ft;

OK

1    zhang    12

2    lisi    34

3    wange    23

4    zhouyu    15

5    guoji    45

6    xiafen    48

7    yanggu    78

8    liuwu    41

9    zhuto    66

10    madan    71

11    sichua    89

NULL    NULL    NULL

Time taken: 0.229 seconds, Fetched: 12 row(s)

3. 建立分桶表：

hive> create table fentong(

    > id  int,

    > name string,

    > age int,)clustered by(age) into 4 buckets        -- 以欄位age來劃分成4個桶 --

    > row format delimited fields terminated by ',';

　　每行資料具體落入幾號分桶的規則如下：

1. 用表中指定的欄位值(比如age)來除以桶的個數4；
2. 結果取餘數，也就是求模（若餘數為0就放到1號桶，餘數為1就放到2號桶，餘數為2就放到3號桶，餘數為3就放到4號桶）

4. 給分桶表匯入資料：

hive> insert into table fentong select name,age from ft;

5. 查詢分桶表資料以確認正確匯入：

hive> select * from  fentong

6. 我們來看看分桶表的資料如何使用：

hive> select id, name, age from fentong tablesample(bucket 1 out of 4 on age);

OK

NULL    NULL    NULL

6    xiafen    48

1    zhang    12

hive> select id, name, age from fentong tablesample(bucket 2 out of 4 on age);

OK

11    sichua    89

8    liuwu    41

5    guoji            45

hive> select id, name, age from fentong tablesample(bucket 3 out of 4 on age);

OK

9    zhuto    66

7    yanggu    78

2    lisi    34