1. 程式人生 > >Hive_ Hive 建表語句詳解

Hive_ Hive 建表語句詳解

參考文章:

最近博主在編寫一個每天定時建立Hive 分割槽的指令碼,其中需要建立Hive表,

開始的時候我以為建立Hive 表的語句順序是比較寬鬆的,經過測試發現不然,

Hive 建立表需要比較固定的書寫順序

雖然暫時不知道這個順序,可以查閱什麼樣的文件找到,如果知道的朋友,可以在底下踴躍留言,有紅包派送

下面對Hive 建表的格式規範進行講解

Create Table

官網說明

Hive建表方式共有三種:

  • 直接建表法
  • 查詢建表法
  • like建表法

首先看官網介紹 
‘[]’ 表示可選,’|’ 表示二選一

CREATE [TEMPORARY] [EXTERNAL] TABLE [IF NOT EXISTS] [db_name.]table_name    -- (Note: TEMPORARY available in
Hive 0.14.0 and later) [(col_name data_type [COMMENT col_comment], ... [constraint_specification])] [COMMENT table_comment] [PARTITIONED BY (col_name data_type [COMMENT col_comment], ...)] [CLUSTERED BY (col_name, col_name, ...) [SORTED BY (col_name [ASC|DESC], ...)] INTO num_buckets BUCKETS] [SKEWED BY (col_name, col_name, ...
) -- (Note: Available in Hive 0.10.0 and later)] ON ((col_value, col_value, ...), (col_value, col_value, ...), ...) [STORED AS DIRECTORIES] [ [ROW FORMAT row_format] [STORED AS file_format] | STORED BY 'storage.handler.class.name' [WITH SERDEPROPERTIES (...)] -- (Note: Available in
Hive 0.6.0 and later) ] [LOCATION hdfs_path] [TBLPROPERTIES (property_name=property_value, ...)] -- (Note: Available in Hive 0.6.0 and later) [AS select_statement]; -- (Note: Available in Hive 0.5.0 and later; not supported for external tables) CREATE [TEMPORARY] [EXTERNAL] TABLE [IF NOT EXISTS] [db_name.]table_name LIKE existing_table_or_view_name [LOCATION hdfs_path]; data_type : primitive_type | array_type | map_type | struct_type | union_type -- (Note: Available in Hive 0.7.0 and later) primitive_type : TINYINT | SMALLINT | INT | BIGINT | BOOLEAN | FLOAT | DOUBLE | DOUBLE PRECISION -- (Note: Available in Hive 2.2.0 and later) | STRING | BINARY -- (Note: Available in Hive 0.8.0 and later) | TIMESTAMP -- (Note: Available in Hive 0.8.0 and later) | DECIMAL -- (Note: Available in Hive 0.11.0 and later) | DECIMAL(precision, scale) -- (Note: Available in Hive 0.13.0 and later) | DATE -- (Note: Available in Hive 0.12.0 and later) | VARCHAR -- (Note: Available in Hive 0.12.0 and later) | CHAR -- (Note: Available in Hive 0.13.0 and later) array_type : ARRAY < data_type > map_type : MAP < primitive_type, data_type > struct_type : STRUCT < col_name : data_type [COMMENT col_comment], ...> union_type : UNIONTYPE < data_type, data_type, ... > -- (Note: Available in Hive 0.7.0 and later) row_format : DELIMITED [FIELDS TERMINATED BY char [ESCAPED BY char]] [COLLECTION ITEMS TERMINATED BY char] [MAP KEYS TERMINATED BY char] [LINES TERMINATED BY char] [NULL DEFINED AS char] -- (Note: Available in Hive 0.13 and later) | SERDE serde_name [WITH SERDEPROPERTIES (property_name=property_value, property_name=property_value, ...)] file_format: : SEQUENCEFILE | TEXTFILE -- (Default, depending on hive.default.fileformat configuration) | RCFILE -- (Note: Available in Hive 0.6.0 and later) | ORC -- (Note: Available in Hive 0.11.0 and later) | PARQUET -- (Note: Available in Hive 0.13.0 and later) | AVRO -- (Note: Available in Hive 0.14.0 and later) | INPUTFORMAT input_format_classname OUTPUTFORMAT output_format_classname constraint_specification: : [, PRIMARY KEY (col_name, ...) DISABLE NOVALIDATE ] [, CONSTRAINT constraint_name FOREIGN KEY (col_name, ...) REFERENCES table_name(col_name, ...) DISABLE NOVALIDATE

觀察可發現一共有三種建表方式,接下來我們將一一講解。

1.直接建表法:

create table table_name(col_name data_type);

一個複雜的例子

主要要按照上面的定義的格式順序去進行編寫

CREATE EXTERNAL TABLE IF NOT EXISTS `dmp_clearlog` (
  `date_log` string COMMENT 'date in file', 
  `hour` int COMMENT 'hour', 
  `device_id` string COMMENT '(android) md5 imei / (ios) origin  mac', 
  `imei_orgin` string COMMENT 'origin value of imei', 
  `mac_orgin` string COMMENT 'origin value of mac', 
  `mac_md5` string COMMENT 'mac after md5 encrypt', 
  `android_id` string COMMENT 'androidid', 
  `os` string  COMMENT 'operating system', 
  `ip` string COMMENT 'remote real ip', 
  `app` string COMMENT 'appname' )
COMMENT 'cleared log of origin log'
PARTITIONED BY (
  `date` date COMMENT 'date used by partition'
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' 
TBLPROPERTIES ('creator'='szh', 'crate_time'='2018-06-07')
;

這裡我們針對裡面的一些不同於關係型資料庫的地方進行說明。

row format

row_format
  : DELIMITED [FIELDS TERMINATED BY char [ESCAPED BY char]] [COLLECTION ITEMS TERMINATED BY char]
        [MAP KEYS TERMINATED BY char] [LINES TERMINATED BY char]
        [NULL DEFINED AS char]   -- (Note: Available in Hive 0.13 and later)
  | SERDE serde_name [WITH SERDEPROPERTIES (property_name=property_value, property_name=property_value, ...)]

Hive將HDFS上的檔案對映成表結構,通過分隔符來區分列(比如’,’ ‘;’ or ‘^’ 等),row format就是用於指定序列化和反序列化的規則。 
比如對於以下記錄:

1,xiaoming,book-TV-code,beijing:chaoyang-shagnhai:pudong
2,lilei,book-code,nanjing:jiangning-taiwan:taibei
3,lihua,music-book,heilongjiang:haerbin

逗號用於分割列(FIELDS TERMINATED BY char:對應ID、name、hobby(陣列形式,COLLECTION ITEMS TERMINATED BY char)、address(鍵值對形式map,MAP KEYS TERMINATED BY char)),而LINES TERMINATED BY char 用於區分不同條的資料,預設是換行符;

file format(HDFS檔案存放的格式)

預設TEXTFILE,即文字格式,可以直接開啟。 

如下:根據上述檔案內容,建立一個表t1

create table t1(
    id      int
   ,name    string
   ,hobby   array<string>
   ,add     map<String,string>
)
row format delimited
fields terminated by ','
collection items terminated by '-'
map keys terminated by ':'
;

這裡寫圖片描述 
2. 查看錶的描述:desc t1; 
這裡寫圖片描述

  1. 下面插入資料 
    注:一般很少用insert (不是insert overwrite)語句,因為就算就算插入一條資料,也會呼叫MapReduce,這裡我們選擇Load Data的方式。
LOAD DATA [LOCAL] INPATH 'filepath' [OVERWRITE] INTO TABLE tablename [PARTITION (partcol1=val1, partcol2=val2 ...)]

建立一個檔案貼上上述記錄,並上載即可,如下圖: 
這裡寫圖片描述 
然後上載

load data local inpath '/home/hadoop/Desktop/data' overwrite into table t1;

別忘記寫檔名/data,筆者第一次忘記寫,把整個Desktop上傳了,一查全是null和亂碼。。。。 
查看錶內容:

select * from t1;

這裡寫圖片描述

external

未被external修飾的是內部表(managed table),被external修飾的為外部表(external table); 
區別: 
內部表資料由Hive自身管理,外部表資料由HDFS管理; 
內部表資料儲存的位置是hive.metastore.warehouse.dir(預設:/user/hive/warehouse),外部表資料的儲存位置由自己制定; 
刪除內部表會直接刪除元資料(metadata)及儲存資料;刪除外部表僅僅會刪除元資料,HDFS上的檔案並不會被刪除; 
對內部表的修改會將修改直接同步給元資料,而對外部表的表結構和分割槽進行修改,則需要修復(MSCK REPAIR TABLE table_name;)

建立一個外部表t2

create external table t2(
    id      int
   ,name    string
   ,hobby   array<string>
   ,add     map<String,string>
)
row format delimited
fields terminated by ','
collection items terminated by '-'
map keys terminated by ':'
location '/user/t2'
;

這裡寫圖片描述

裝載資料
load data local inpath '/home/hadoop/Desktop/data' overwrite into table t2;

這裡寫圖片描述

檢視檔案位置

如下圖,我們在NameNode:50070/explorer.html#/user/目錄下,可以看到t2檔案 
這裡寫圖片描述

t1在哪呢?在我們之前配置的預設路徑裡 
這裡寫圖片描述

同樣我們可以通過命令列獲得兩者的位置資訊:

desc formatted table_name;

這裡寫圖片描述

這裡寫圖片描述 
注:圖中managed table就是內部表,而external table就是外部表。

分別刪除內部表和外部表

下面分別刪除內部表和外部表,檢視區別 
這裡寫圖片描述

觀察HDFS上的檔案

發現t1已經不存在了 
這裡寫圖片描述

但是t2仍然存在 
這裡寫圖片描述
因而外部表僅僅刪除元資料

重新穿件表外部表t2
create external table t2(
    id      int
   ,name    string
   ,hobby   array<string>
   ,add     map<String,string>
)
row format delimited
fields terminated by ','
collection items terminated by '-'
map keys terminated by ':'
location '/user/t2'
;

這裡寫圖片描述

不往裡面插入資料,我們select * 看看結果 
這裡寫圖片描述 
可見資料仍然在!!!

官網說明
A table created without the EXTERNAL clause is called a managed table because Hive manages its data. 
Managed and External Tables
By default Hive creates managed tables, where files, metadata and statistics are managed by internal Hive processes. A managed table is stored under the hive.metastore.warehouse.dir path property, by default in a folder path similar to /apps/hive/warehouse/databasename.db/tablename/. The default location can be overridden by the location property during table creation. If a managed table or partition is dropped, the data and metadata associated with that table or partition are deleted. If the PURGE option is not specified, the data is moved to a trash folder for a defined duration.
Use managed tables when Hive should manage the lifecycle of the table, or when generating temporary tables.
An external table describes the metadata / schema on external files. External table files can be accessed and managed by processes outside of Hive. External tables can access data stored in sources such as Azure Storage Volumes (ASV) or remote HDFS locations. If the structure or partitioning of an external table is changed, an MSCK REPAIR TABLE table_name statement can be used to refresh metadata information.
Use external tables when files are already present or in remote locations, and the files should remain even if the table is dropped.
Managed or external tables can be identified using the DESCRIBE FORMATTED table_name command, which will display either MANAGED_TABLE or EXTERNAL_TABLE depending on table type.
Statistics can be managed on internal and external tables and partitions for query optimization. 

2.查詢建表法

通過AS 查詢語句完成建表:將子查詢的結果存在新表裡,有資料 
一般用於中間表

CREATE TABLE new_key_value_store
   ROW FORMAT SERDE "org.apache.hadoop.hive.serde2.columnar.ColumnarSerDe"
   STORED AS RCFile
   AS
SELECT (key % 1024) new_key, concat(key, value) key_value_pair
FROM key_value_store
SORT BY new_key, key_value_pair;

根據例子我們建一張表:t3

create table t3 as
select
    id
   ,name
from t2
;

會執行MapReduce過程。 
查看錶結構及內容,發現是有資料的,並且由於沒有指定外部表和location,該表在預設位置,即是內部表。 
這裡寫圖片描述

3.like建表法

會建立結構完全相同的表,但是沒有資料。 
常用語中間表

CREATE TABLE empty_key_value_store
LIKE key_value_store;

例子

create table t4 like t2;

可以發現,不會執行MapReduce,且表結構和t2完全一樣,但是沒有資料。 
這裡寫圖片描述