1. 程式人生 > >hive內部表&外部表介紹

hive內部表&外部表介紹

未被external修飾的是內部表(managed table),被external修飾的為外部表(external table);
區別:
內部表資料由Hive自身管理,外部表資料由HDFS管理;
內部表資料儲存的位置是hive.metastore.warehouse.dir(預設:/user/hive/warehouse),外部表資料的儲存位置由自己制定;
刪除內部表會直接刪除元資料(metadata)及儲存資料;刪除外部表僅僅會刪除元資料,HDFS上的檔案並不會被刪除;
對內部表的修改會將修改直接同步給元資料,而對外部表的表結構和分割槽進行修改,則需要修復(MSCK REPAIR TABLE table_name;)

如下,進行試驗進行理解
試驗理解
建立內部表t1

create table t1(
    id      int
   ,name    string
   ,hobby   array<string>
   ,add     map<String,string>
)
row format delimited
fields terminated by ','
collection items terminated by '-'
map keys terminated by ':'
;


2. 查看錶的描述:desc t1;
裝載資料(t1)

注:一般很少用insert (不是insert overwrite)語句,因為就算就算插入一條資料,也會呼叫MapReduce,這裡我們選擇Load Data的方式。

LOAD DATA [LOCAL] INPATH 'filepath' [OVERWRITE] INTO TABLE tablename [PARTITION (partcol1=val1, partcol2=val2 ...)]


然後上載

load data local inpath '/home/hadoop/Desktop/data' overwrite into table t1;

別忘記寫檔名/data,筆者第一次忘記寫,把整個Desktop上傳了,一查全是null和亂碼。。。。
查看錶內容:

select * from t1;


建立一個外部表t2

create external table t2(
    id      int
   ,name    string
   ,hobby   array<string>
   ,add     map<String,string>
)
row format delimited
fields terminated by ','
collection items terminated by '-'
map keys terminated by ':'
location '/user/t2'
;

裝載資料(t2)

load data local inpath '/home/hadoop/Desktop/data' overwrite into table t2;


檢視檔案位置
我們在NameNode:50070/explorer.html#/user/目錄下,可以看到t2檔案

t1在哪呢?在我們之前配置的預設路徑裡

同樣我們可以通過命令列獲得兩者的位置資訊:

desc formatted table_name;

這裡寫圖片描述

這裡寫圖片描述
注:圖中managed table就是內部表,而external table就是外部表。
分別刪除內部表和外部表

下面分別刪除內部表和外部表,檢視區別
觀察HDFS上的檔案

發現t1已經不存在了
這裡寫圖片描述

但是t2仍然存在
這裡寫圖片描述
因而外部表僅僅刪除元資料
重新建立外部表t2

create external table t2(
    id      int
   ,name    string
   ,hobby   array<string>
   ,add     map<String,string>
)
row format delimited
fields terminated by ','
collection items terminated by '-'
map keys terminated by ':'
location '/user/t2'
;

這裡寫圖片描述

不往裡面插入資料,我們select * 看看結果
這裡寫圖片描述
可見資料仍然在!!!
官網介紹

以下是官網中關於external表的介紹:

A table created without the EXTERNAL clause is called a managed table because Hive manages its data.
Managed and External Tables
By default Hive creates managed tables, where files, metadata and statistics are managed by internal Hive processes. A managed table is stored under the hive.metastore.warehouse.dir path property, by default in a folder path similar to /apps/hive/warehouse/databasename.db/tablename/. The default location can be overridden by the location property during table creation. If a managed table or partition is dropped, the data and metadata associated with that table or partition are deleted. If the PURGE option is not specified, the data is moved to a trash folder for a defined duration.
Use managed tables when Hive should manage the lifecycle of the table, or when generating temporary tables.
An external table describes the metadata / schema on external files. External table files can be accessed and managed by processes outside of Hive. External tables can access data stored in sources such as Azure Storage Volumes (ASV) or remote HDFS locations. If the structure or partitioning of an external table is changed, an MSCK REPAIR TABLE table_name statement can be used to refresh metadata information.
Use external tables when files are already present or in remote locations, and the files should remain even if the table is dropped.
Managed or external tables can be identified using the DESCRIBE FORMATTED table_name command, which will display either MANAGED_TABLE or EXTERNAL_TABLE depending on table type.
Statistics can be managed on internal and external tables and partitions for query optimization.

Hive官網介紹:
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-DescribeTable/View/Column