1. 程式人生 > >HIVE表數據的導入與導出(load data&insert overwrite)

HIVE表數據的導入與導出(load data&insert overwrite)

als map 基本類型 ext emp part splay starting style

1. 準備測試數據

首先創建普通表:

create table test(id int, name string) ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘,‘ STORED AS TEXTFILE;

創建分區表:

CREATE EXTERNAL TABLE test_p(
id int,
name string 
)
partitioned by (date STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY \,LINES TERMINATED BY \nSTORED AS TEXTFILE;

準備數據文件:

[/tmp]# cat test.txt 
1,a
2,b
3,c
4,d

2.加載數據

語法如下:

LOAD DATA [LOCAL] INPATH ‘filepath‘ [OVERWRITE] INTO TABLE tablename [PARTITION (partcol1=val1, partcol2=val2 ...)]

說明:

  • filepath 可能是:
  • 一個相對路徑
  • 一個絕對路徑,例如:/root/project/data1
  • 一個url地址,可選的可以帶上授權信息,例如:hdfs://namenode:9000/user/hive/project/data1
  • 目標可能是一個表或者分區,如果該表是分區,則必須制定分區列。
  • filepath 可以是一個文件也可以是目錄
  • 如果指定了 LOCAL,則:
  • load 命令會在本地查找 filepath。如果 filepath 是相對路徑,則相對於當前路徑,也可以指定一個 url 或者本地文件,例如:file:///user/hive/project/data1
  • 如果沒有指定 LOCAL ,則hive會使用全路徑的url,url 中如果沒有制定 schema,則默認使用 fs.default.name的值;如果該路徑不是絕對路徑,則會相對於 /user/<username>
  • 如果使用 OVERWRITE ,則會刪除原來的數據,然後導入新的數據,否則,就是追加數據。

需要註意的:

  • filepath 中不能包括子目錄
  • 如果沒有指定 LOCAL,則 filepath 指向目標表或者分區所在的文件系統。
  • 如果需要壓縮,則參考 CompressedStorage

2.1 測試

2.1.1 加載本地文件

a) 加載到普通表中

hive> load data local inpath ‘/tmp/test.txt‘ into table test;                              
Copying data from file:/tmp/test.txt
Copying file: file:/tmp/test.txt
Loading data to table default.test
Table default.test stats: [num_partitions: 0, num_files: 1, num_rows: 0, total_size: 16, raw_data_size: 0]
OK
Time taken: 0.572 seconds

查看hdfs上的數據:

$ hadoop fs -ls /user/hive/warehouse/test
Found 1 items
-rwxrwxrwt   3 hive hadoop         16 2014-06-09 18:36 /user/hive/warehouse/test/test.txt

查看表中數據:

hive> select * from test;
OK
1	a
2	b
3	c
4	d
Time taken: 0.562 seconds, Fetched: 4 row(s)

b) 加載文件到分區表

通常是直接使用 load 命令加載:

LOAD DATA LOCAL INPATH "/tmp/test.txt" INTO TABLE test_p PARTITION (date=20140722)

註意:如果沒有加上 overwrite 關鍵字,則加載相同文件最後會存在多個文件

還有一種方法是:創建分區目錄,手動上傳文件,最後再添加新的分區,代碼如下:

hadoop fs -mkdir  /user/hive/warehouse/test/date=20140320
ALTER TABLE test_p ADD IF NOT EXISTS PARTITION (date=20140320);

hive hadoop fs -rm /user/hive/warehouse/test/date=20140320/test.txt
hadoop fs -put /tmp/test.txt  /user/hive/warehouse/test/date=20140320

同樣,你也可以查看 hdfs 和表中的數據。

2.1.2 加載hdfs上的文件

拷貝 test.txt 為test_1.txt 並將其上傳到 /user/hive/warehouse:

$ cp test.txt test_1.txt
$ sudo -u hive hadoop fs -put test_1.txt /user/hive/warehouse

然後將 /user/hive/warehouse/test_1.txt 導入到test表中:

hive> load data inpath ‘/user/hive/warehouse/test_1.txt‘ into table test; 
Loading data to table default.test
Table default.test stats: [num_partitions: 0, num_files: 1, num_rows: 0, total_size: 16, raw_data_size: 0]
OK
Time taken: 2.941 seconds

查看hdfs上的數據:

$ hadoop fs -ls /user/hive/warehouse/test
Found 2 items
-rwxr-xr-x   3 hive hadoop         16 2014-06-09 18:48 /user/hive/warehouse/test/test.txt
-rwxr-xr-x   3 hive hadoop         16 2014-06-09 18:45 /user/hive/warehouse/test/test_1.txt

查看表中數據:

hive> select * from test;                                                 
OK
1	a
2	b
3	c
4	d
1	a
2	b
3	c
4	d
Time taken: 0.302 seconds, Fetched: 8 row(s)

3. 插入數據

標準語法:

INSERT OVERWRITE TABLE tablename1 [PARTITION (partcol1=val1, partcol2=val2 ...) [IF NOT EXISTS]] select_statement1 FROM from_statement;

INSERT INTO TABLE tablename1 [PARTITION (partcol1=val1, partcol2=val2 ...)] select_statement1 FROM from_statement;

擴展語法(多個insert):

FROM from_statement
INSERT OVERWRITE TABLE tablename1 [PARTITION (partcol1=val1, partcol2=val2 ...) [IF NOT EXISTS]] select_statement1
[INSERT OVERWRITE TABLE tablename2 [PARTITION ... [IF NOT EXISTS]] select_statement2]
[INSERT INTO TABLE tablename2 [PARTITION ...] select_statement2] ...;

FROM from_statement
INSERT INTO TABLE tablename1 [PARTITION (partcol1=val1, partcol2=val2 ...)] select_statement1
[INSERT INTO TABLE tablename2 [PARTITION ...] select_statement2]
[INSERT OVERWRITE TABLE tablename2 [PARTITION ... [IF NOT EXISTS]] select_statement2] ...;

擴展語法(動態分區insert):

INSERT OVERWRITE TABLE tablename PARTITION (partcol1[=val1], partcol2[=val2] ...) select_statement FROM from_statement;

INSERT INTO TABLE tablename PARTITION (partcol1[=val1], partcol2[=val2] ...) select_statement FROM from_statement;

說明:

  • INSERT OVERWRITE 會覆蓋存在的數據
  • 輸出的格式和序列化類取決於表的元數據
  • hive 0.13.0之後,select語句可以使用 CTEs 表達式,語法請參考 SELECT syntax,示例見 Common Table Expression

Dynamic Partition Inserts

dynamic partition inserts在hive 0.6.0中引入。相關的配置參數有:

hive.exec.dynamic.partition
hive.exec.dynamic.partition.mode
hive.exec.max.dynamic.partitions.pernode
hive.exec.max.dynamic.partitions
hive.exec.max.created.files
hive.error.on.empty.partition

一個示例:

FROM page_view_stg pvs
INSERT OVERWRITE TABLE page_view PARTITION(dt=‘2008-06-08‘, country)
       SELECT pvs.viewTime, pvs.userid, pvs.page_url, pvs.referrer_url, null, null, pvs.ip, pvs.cnt

4. 導出數據

標準語法:

INSERT OVERWRITE [LOCAL] DIRECTORY directory1
  [ROW FORMAT row_format] [STORED AS file_format] (Note: Only available starting with Hive 0.11.0)
  SELECT ... FROM ...

擴展語法(多個insert):

FROM from_statement
INSERT OVERWRITE [LOCAL] DIRECTORY directory1 select_statement1
[INSERT OVERWRITE [LOCAL] DIRECTORY directory2 select_statement2] ...

row_format相關語法:

DELIMITED [FIELDS TERMINATED BY char [ESCAPED BY char]] [COLLECTION ITEMS TERMINATED BY char]
        [MAP KEYS TERMINATED BY char] [LINES TERMINATED BY char]
        [NULL DEFINED AS char](Note: Only available starting with Hive 0.13)

說明:

  • Directory 可以是一個全路徑的 url。
  • 如果指定 LOCAL,則會將數據寫到本地文件系統。
  • 輸出的數據序列化為 text 格式,分隔符為 ^A,行於行之間通過換行符連接。如果存在不是基本類型的列,則這些列將被序列化為 JSON 格式。
  • 在 Hive 0.11.0 可以輸出字段的分隔符,之前版本的默認為 ^A

4.1 測試;

4.1.1 導出到本地文件系統

hive> insert overwrite local directory ‘/tmp/test‘ select * from test;
Total MapReduce jobs = 1
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there‘s no reduce operator
Starting Job = job_1402248601715_0016, Tracking URL = http://cdh1:8088/proxy/application_1402248601715_0016/
Kill Command = /usr/lib/hadoop/bin/hadoop job  -kill job_1402248601715_0016
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2014-06-09 19:25:12,896 Stage-1 map = 0%,  reduce = 0%
2014-06-09 19:25:20,380 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 0.99 sec
2014-06-09 19:25:21,433 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 0.99 sec
MapReduce Total cumulative CPU time: 990 msec
Ended Job = job_1402248601715_0016
Copying data to local directory /tmp/test
Copying data to local directory /tmp/test
MapReduce Jobs Launched: 
Job 0: Map: 1   Cumulative CPU: 0.99 sec   HDFS Read: 305 HDFS Write: 32 SUCCESS
Total MapReduce CPU Time Spent: 990 msec
OK
Time taken: 18.438 seconds

導出後的數據預覽如下:

[/tmp]# vim test/000000_0 
1^Aa
2^Ab
3^Ac
4^Ad
1^Aa
2^Ab
3^Ac
4^Ad

可以看到數據中的列與列之間的分隔符是^A(ascii碼是\00001),如果想修改分隔符,可以做如下修改:

hive> insert overwrite local directory ‘/tmp/test‘ row format delimited fields terminated by ‘,‘ select * from test;

再來查看數據:

vim test/000000_3 
1,a
2,b
3,c
4,d
1,a
2,b
3,c
4,d

4.1.2 導出到 HDFS 中

hive> insert overwrite  directory ‘/user/hive/tmp‘ select * from test;

註意:

和導出文件到本地文件系統的HQL少一個local,數據的存放路徑不一樣了。

4.1.3 導出到Hive的另一個表中

在實際情況中,表的輸出結果可能太多,不適於顯示在控制臺上,這時候,將Hive的查詢輸出結果直接存在一個新的表中是非常方便的,我們稱這種情況為CTAS( create table .. as select)如下:

hive> create table test2 as select * from test;

HIVE表數據的導入與導出(load data&insert overwrite)