hive分割槽分桶

阿新 • • 發佈：2021-02-26

[TOC] ## 1、分割槽如果一個表中資料很多，我們查詢時就很慢，耗費大量時間，如果要查詢其中部分資料該怎麼辦呢，這時我們引入分割槽的概念。 Hive 中的分割槽表分為兩種：靜態分割槽和動態分割槽。 ### 1.1、靜態分割槽可以根據 PARTITIONED BY 建立分割槽表。一個表可以擁有一個或者多個分割槽，每個分割槽以資料夾的形式單獨存在表文件夾的目錄下。 - 一個分割槽：表目錄下只有一級目錄。 - 多個分割槽：表目錄下是多級目錄。分割槽列是虛擬列，它們不是資料本身的一部分，而是在載入時派生的。 #### 1.1.1、一個分割槽 ```sql -- 建表 create table order_table_s ( order_id int, -- 訂單id product_name string, -- 產品名稱 price int -- 產品價格 ) partitioned by (deal_day string) -- 交易日期YYYYMM row format delimited fields terminated by "\t"; -- 查看錶結構 hive> desc order_table_s; OK order_id int product_name string price int deal_day string # Partition Information # col_name data_type comment deal_day string -- 源資料 -- order-201901.txt 1 cellphone 2000 2 tv 3000 3 sofa 8000 4 cabinet 5000 5 bicycle 1000 6 truck 20000 -- order-201902.txt 1 apple 10 2 banana 8 3 milk 70 4 liquor 150 -- 匯入資料 load data local inpath '/root/data/order-201901.txt' overwrite into table order_table_s partition(deal_day='201901'); load data local inpath '/root/data/order-201902.txt' overwrite into table order_table_s partition(deal_day='201902'); -- 查看錶 hive> select * from order_table_s; 1 cellphone 2000 201901 2 tv 3000 201901 3 sofa 8000 201901 4 cabinet 5000 201901 5 bicycle 1000 201901 6 truck 20000 201901 1 apple 10 201902 2 banana 8 201902 3 milk 70 201902 4 liquor 150 201902 -- 檢視201902的資料 hive> select * from order_table_s where deal_day='201902'; 1 apple 10 201902 2 banana 8 201902 3 milk 70 201902 4 liquor 150 201902 -- 檢視hdfs目錄 [root@zgg data]# hadoop fs -ls /user/hive/warehouse/order_table_s Found 2 items drwxr-xr-x - root supergroup 0 2021-01-08 16:05 /user/hive/warehouse/order_table_s/deal_day=201901 drwxr-xr-x - root supergroup 0 2021-01-08 16:27 /user/hive/warehouse/order_table_s/deal_day=201902 [root@zgg data]# hadoop fs -ls /user/hive/warehouse/order_table_s/deal_day=201901 Found 1 items -rw-r--r-- 1 root supergroup 56 2021-01-08 18:23 /user/hive/warehouse/order_table_s/deal_day=201901/000000_0 [root@zgg data]# hadoop fs -ls /user/hive/warehouse/order_table_s/deal_day=201901 Found 1 items -rw-r--r-- 1 root supergroup 83 2021-01-08 16:05 /user/hive/warehouse/order_table_s/deal_day=201901/order-201901.txt -- 檢視mysql中的元資料 mysql> select * from PARTITION_KEY_VALS; +---------+--------------+-------------+ | PART_ID | PART_KEY_VAL | INTEGER_IDX | +---------+--------------+-------------+ | 16 | 201901 | 0 | | 21 | 201902 | 0 | +---------+--------------+-------------+ -- 刪除分割槽201902 -- ALTER TABLE table_name DROP partition_spec, partition_spec,... hive> alter table order_table_s drop partition(deal_day='201902'); -- 刪除後，查看錶 hive> select * from order_table_s; 1 cellphone 2000 201901 2 tv 3000 201901 3 sofa 8000 201901 4 cabinet 5000 201901 5 bicycle 1000 201901 6 truck 20000 201901 -- 刪除後，檢視hdfs目錄 [root@zgg data]# hadoop fs -ls /user/hive/warehouse/order_table_s Found 1 items drwxr-xr-x - root supergroup 0 2021-01-08 16:05 /user/hive/warehouse/order_table_s/deal_day=201901 -- 刪除後，檢視mysql中的元資料 mysql> select * from PARTITION_KEY_VALS; +---------+--------------+-------------+ | PART_ID | PART_KEY_VAL | INTEGER_IDX | +---------+--------------+-------------+ | 16 | 201901 | 0 | +---------+--------------+-------------+ -- 刪除後，資料和元資料都被刪除了 -- --------------------------------------------------- -- 重新新增分割槽201902 -- 這裡的location是這個分割槽的資料存放的位置，預設是`/user/hive/warehouse/order_table_s` alter table order_table_s add partition(deal_day='201902') location '/in/order'; -- 把order-201902.txt移動到'/in/order'目錄下 -- 所以匯入資料，也可以直接複製到對應的目錄下。 [root@zgg data]# hadoop fs -mv /in/order-201902.txt /in/order -- 檢視201902的資料 hive> select * from order_table_s where deal_day='201902'; 1 apple 10 201902 2 banana 8 201902 3 milk 70 201902 4 liquor 150 201902 -- hive中查詢分割槽 hive> show partitions order_table_s; deal_day=201901 deal_day=201902 -- 如果直接將資料複製到分割槽目錄下，select沒有資料的話，可以執行`msck repair table order_table_s` 重新同步hdfs上的分割槽資訊。 ``` #### 1.1.2、多個分割槽 ```sql -- 建表:兩個分割槽 create table order_table_d ( order_id int, -- 訂單id product_name string, -- 產品名稱 price int -- 產品價格 ) partitioned by (deal_day string,category string) -- 交易日期YYYYMM，產品類別 row format delimited fields terminated by "\t"; -- 查看錶結構 hive> desc order_table_d; OK order_id int product_name string price int deal_day string category string # Partition Information # col_name data_type comment deal_day string category string -- 源資料 -- order-201901-electronicproducts.txt 1 cellphone 2000 2 tv 3000 -- order-201901-car.txt 1 bicycle 1000 2 truck 20000 -- order-201902-fruit.txt 1 apple 10 2 banana 8 -- order-201902-drinks.txt 1 milk 70 2 liquor 150 -- 匯入資料 load data local inpath '/root/data/order-201901-electronicproducts.txt' overwrite into table order_table_d partition(deal_day='201901',category='electronicproducts'); load data local inpath '/root/data/order-201901-car.txt' overwrite into table order_table_d partition(deal_day='201901',category='car'); load data local inpath '/root/data/order-201902-fruit.txt' overwrite into table order_table_d partition(deal_day='201902',category='fruit'); load data local inpath '/root/data/order-201902-drinks.txt' overwrite into table order_table_d partition(deal_day='201902',category='drinks'); -- 檢視 hive> select * from order_table_d; OK 1 bicycle 1000 201901 car 2 truck 20000 201901 car 1 cellphone 2000 201901 electronicproducts 2 tv 3000 201901 electronicproducts 1 milk 70 201902 drinks 2 liquor 150 201902 drinks 1 apple 10 201902 fruit 2 banana 8 201902 fruit hive> show partitions order_table_d; OK deal_day=201901/category=car deal_day=201901/category=electronicproducts deal_day=201902/category=drinks deal_day=201902/category=fruit -- 檢視hdfs目錄 [root@zgg data]# hadoop fs -ls /user/hive/warehouse/order_table_d Found 2 items drwxr-xr-x - root supergroup 0 2021-01-08 17:00 /user/hive/warehouse/order_table_d/deal_day=201901 drwxr-xr-x - root supergroup 0 2021-01-08 17:00 /user/hive/warehouse/order_table_d/deal_day=201902 [root@zgg data]# hadoop fs -ls /user/hive/warehouse/order_table_d/deal_day=201901 Found 2 items drwxr-xr-x - root supergroup 0 2021-01-08 17:00 /user/hive/warehouse/order_table_d/deal_day=201901/category=car drwxr-xr-x - root supergroup 0 2021-01-08 17:00 /user/hive/warehouse/order_table_d/deal_day=201901/category=electronicproducts -- 檢視mysql中的元資料 mysql> select * from PARTITION_KEY_VALS; +---------+--------------------+-------------+ | PART_ID | PART_KEY_VAL | INTEGER_IDX | +---------+--------------------+-------------+ | 23 | 201901 | 0 | | 23 | electronicproducts | 1 | | 24 | 201901 | 0 | | 24 | car | 1 | | 25 | 201902 | 0 | | 25 | fruit | 1 | | 26 | 201902 | 0 | | 26 | drinks | 1 | +---------+--------------------+-------------+ -- 刪除分割槽deal_day=201902/category=fruit hive> alter table order_table_d drop partition(deal_day='201902',category='fruit'); -- 刪除後，查看錶分割槽 hive> show partitions order_table_d; OK deal_day=201901/category=car deal_day=201901/category=electronicproducts deal_day=201902/category=drinks -- 刪除後，檢視hdfs目錄 [root@zgg data]# hadoop fs -ls /user/hive/warehouse/order_table_d/deal_day=201902 Found 1 items drwxr-xr-x - root supergroup 0 2021-01-08 17:00 /user/hive/warehouse/order_table_d/deal_day=201902/category=drinks -- 刪除後，檢視mysql中的元資料 mysql> select * from PARTITION_KEY_VALS; +---------+--------------------+-------------+ | PART_ID | PART_KEY_VAL | INTEGER_IDX | +---------+--------------------+-------------+ | 23 | 201901 | 0 | | 23 | electronicproducts | 1 | | 24 | 201901 | 0 | | 24 | car | 1 | | 26 | 201902 | 0 | | 26 | drinks | 1 | +---------+--------------------+-------------+ -- 重新新增分割槽deal_day=201902/category=fruit hive> alter table order_table_d add partition(deal_day='201902',category='fruit'); -- 重新匯入 hive> load data local inpath '/root/data/order-201902-drinks.txt' overwrite into table order_table_d partition(deal_day='201902',category='drinks'); -- 檢視 hive> select * from order_table_d; 1 bicycle 1000 201901 car 2 truck 20000 201901 car 1 cellphone 2000 201901 electronicproducts 2 tv 3000 201901 electronicproducts 1 milk 70 201902 drinks 2 liquor 150 201902 drinks hive> show partitions order_table_d; deal_day=201901/category=car deal_day=201901/category=electronicproducts deal_day=201902/category=drinks deal_day=201902/category=fruit -- 匯入資料到分割槽表，還可以使用 insert -- 將order_table_d的201901分割槽中的資料插入到order_table_s的201901分割槽中 -- 檢視order_table_s資料 hive> select * from order_table_s; 1 apple 10 201902 2 banana 8 201902 3 milk 70 201902 4 liquor 150 201902 hive> insert into table order_table_s partition(deal_day='201901') select order_id,product_name,price from order_table_d where deal_day='201901'; -- 檢視結果 hive> select * from order_table_s where deal_day='201901'; 1 bicycle 1000 201901 2 truck 20000 201901 1 cellphone 2000 201901 2 tv 3000 201901 ``` ### 1.2、動態分割槽上面展示瞭如何使用 insert 插入到分割槽表，如果再插入分割槽 '201902' 資料，需要再寫一條 insert 語句。而動態分割槽可以直接使用一條 insert 語句完成。下面利用動態分割槽進行演示。演示前先進行設定： hive 中預設是靜態分割槽，想要使用動態分割槽，需要設定如下引數，可以使用臨時設定，也可以寫在配置檔案（hive-site.xml）裡，永久生效。臨時配置如下: //開啟動態分割槽預設為false，不開啟 set hive.exec.dynamic.partition=true; //指定動態分割槽模式，預設為strict，即必須指定至少一個分割槽為靜態分割槽， //nonstrict模式表示允許所有的分割槽欄位都可以使用動態分割槽 set hive.exec.dynamic.partition.mode=nonstrict; ```sql -- 建立動態分割槽表 create table order_table_dy ( order_id int, -- 訂單id product_name string, -- 產品名稱 price int -- 產品價格 ) partitioned by (deal_day string) -- 交易日期YYYYMM，產品類別 row format delimited fields terminated by "\t"; hive> select * from order_table_s; 1 bicycle 1000 201901 2 truck 20000 201901 1 cellphone 2000 201901 2 tv 3000 201901 1 apple 10 201902 2 banana 8 201902 3 milk 70 201902 4 liquor 150 201902 -- 將order_table_s表裡的兩個分割槽的資料插入到order_table_dy表 hive> insert into table order_table_dy(deal_day) select order_id,product_name,price,deal_day from order_table_s; -- 檢視結果 hive> select * from order_table_dy where deal_day=201901; OK 1 bicycle 1000 201901 2 truck 20000 201901 1 cellphone 2000 201901 2 tv 3000 201901 hive> select * from order_table_dy where deal_day=201902; OK 1 apple 10 201902 2 banana 8 201902 3 milk 70 201902 4 liquor 150 201902 hive> show partitions order_table_dy; OK deal_day=201901 deal_day=201902 ``` 靜態分割槽(SP)列:在涉及多個分割槽列的 DML/DDL 中，這些列的值在編譯時已知(由使用者給出)。動態分割槽(DP)列:在執行時才知道其值的列。 DP 列的指定方式與 SP 列的指定方式相同：在 partition 子句中。唯一的區別是 DP 列沒有值，而 SP 列有。在 partition 子句中，我們需要指定所有分割槽列，即使它們都是 DP 列。在 `INSERT ... SELECT ...` 查詢時，動態分割槽列必須在 SELECT 語句中的最後一個列中指定，且順序與它們在 PARTITION() 子句中出現的順序相同。 ```sql -- 正確 INSERT OVERWRITE TABLE T PARTITION (ds='2010-03-03', hr) SELECT key, value, /*ds,*/ hr FROM srcpart WHERE ds is not null and hr>10; -- 錯誤 INSERT OVERWRITE TABLE T PARTITION (ds, hr = 11) SELECT key, value, ds/*, hr*/ FROM srcpart WHERE ds is not null and hr=11; ``` ## 2、分桶通過計算表的某些列的雜湊值，分割槽中的資料再被劃分到桶中。例如，page_views 表根據 userid 分桶，userid 是 page_view 表的列之一，而不是分割槽列。將表或分割槽組織成桶有以下幾個目的： (1)抽樣更高效，因為在處理大規模的資料集時，在開發、測試階段將所有的資料全部處理一遍可能不太現實，這時抽樣就必不可少。抽樣： SELECT * FROM table_name TABLESAMPLE(n PERCENT)；抽樣出n%的資料，會全表掃描。有了桶之後呢？如果在 TABLESAMPLE 子句中指定的列與 CLUSTERED BY 子句中的列相匹配，則 TABLESAMPLE 只掃描表中要求的雜湊分割槽【就是具體的桶】 SELECT * FROM film TABLESAMPLE(BUCKET x OUTOF y) 如果表是用 `CLUSTERED BY id INTO 32 BUCKETS` 建立的。 TABLESAMPLE(BUCKET 3 OUT OF 16 ON id)：將挑選出第 3 和第 19 個聚類，因為每個桶將由 (32/16)=2 個聚類組成。【每個桶有2個聚類，一共64個聚類，取出第 3 和第 19 個聚類】 TABLESAMPLE(BUCKET 3 OUT OF 64 ON id)：將挑選出第 3 個聚類的一半，因為每個桶將由 (32/64)=1/2 個聚類組成 (2)更好的查詢處理效率。大表在JOIN的時候，效率低下。如果對兩個表先分別按id分桶，那麼相同id都會歸入一個桶。那麼此時再進行JOIN的時候是按照桶來JOIN的，那麼大大減少了JOIN的數量。在建立桶之前，需要設定 `set hive.enforce.bucketing=true;`，使得 hive 能識別桶。【僅版本 0.x 和 1.x，對於2.x版本不再需要】 ```sql hive> select * from order_table_s; OK 1 bicycle 1000 201901 2 truck 20000 201901 1 cellphone 2000 201901 2 tv 3000 201901 1 apple 10 201902 2 banana 8 201902 3 milk 70 201902 4 liquor 150 201902 -- 建立分桶 create table order_table_buckets ( order_id int, -- 訂單id product_name string, -- 產品名稱 price int -- 產品價格 ) partitioned by (deal_day string) -- 交易日期YYYYMM，產品類別 clustered by (order_id) into 4 buckets row format delimited fields terminated by "\t"; -- 匯入資料： hive> insert into table order_table_buckets partition(deal_day) select order_id,product_name,price,deal_day from order_table_s where order_id=1; -- 查看錶 hive> select * from order_table_buckets; OK 1 cellphone 2000 201901 1 bicycle 1000 201901 1 apple 10 201902 -- 檢視hdfs中的目錄 [root@zgg ~]# hadoop fs -ls /user/hive/warehouse/order_table_buckets Found 2 items drwxr-xr-x - root supergroup 0 2021-01-08 18:49 /user/hive/warehouse/order_table_buckets/deal_day=201901 drwxr-xr-x - root supergroup 0 2021-01-08 18:49 /user/hive/warehouse/order_table_buckets/deal_day=201902 [root@zgg ~]# hadoop fs -ls /user/hive/warehouse/order_table_buckets/deal_day=201901 Found 4 items -rw-r--r-- 1 root supergroup 0 2021-01-08 18:49 /user/hive/warehouse/order_table_buckets/deal_day=201901/000000_0 -rw-r--r-- 1 root supergroup 32 2021-01-08 18:49 /user/hive/warehouse/order_table_buckets/deal_day=201901/000001_0 -rw-r--r-- 1 root supergroup 0 2021-01-08 18:49 /user/hive/warehouse/order_table_buckets/deal_day=201901/000002_0 -rw-r--r-- 1 root supergroup 0 2021-01-08 18:49 /user/hive/warehouse/order_table_buckets/deal_day=201901/000003_0 ``` 參考： [https://blog.51cto.com/10814168/2135046](https://blog.51cto.com/10814168/

hive分割槽分桶

深入理解 Hive 分割槽分桶（Inceptor）

hive分割槽分桶操作及載入資料

Hive 分割槽分桶使用

hive分割槽分桶

Hive的分桶管理和分割槽

Hive（二）--分割槽分桶，內部表外部表

Hive總結2（分割槽+分桶+查詢）

HIVE的四種排序，以及分割槽分桶

Hive的分桶

## Hive分割槽、桶、與傾斜

Hive的分桶表和抽樣查詢（附帶案例，分桶和分表的區別）

大資料012-Hive的分桶詳解

能說一下你們專案中hive的庫表設計嗎---庫表設計三板斧---內外部表|分割槽和分桶|序列化和反序列化

hive的三板斧：內部表和外部表、分割槽和分桶以及序列化/反序列化(SerDe)

Hive面試題:Hive分割槽表和分桶表的區別

hive的分割槽和分桶

一起學Hive——建立內部表、外部表、分割槽表和分桶表及匯入資料

HIVE分割槽以及分桶的簡單概念

HIVE 分割槽表分桶表

HIVE—索引、分割槽和分桶的區別

hive分割槽分桶

相關推薦