使用insert ....directory匯出資料注意事項與使用詳解
阿新 • • 發佈:2019-01-05
一個網友問我很簡單的查詢匯出語句,使用insert .....directory匯出資料後,無論是在hdfs上還是本地檢視的,都是檢視顯示亂碼
insert overwrite directory '/user/finance/hive/warehouse/fdm_sor.db/t_tmp/' select * from t_tmp; #遠端檢視資料內容 [robot~]$ hadoop fs -cat /user/finance/hive/warehouse/fdm_sor.db/t_tmp/000000_0.deflate x1ӕߠ~H~_ᚩ¹ªªūęoRJm컑©ҋi쵤)̽²Y¼x1ӕߠ~H~_ᚩ¹ªªūęoRJm컑©ҋi쵤)̽²Y¼x1ӕߠ~H~_ᚩ¹ªªūęoRJm컑©ҋi쵤)̽²Y¼ x1ӕߠ~H~_ᚩ¹ªªūęoRJm컑©ҋi쵤)̽²Y¼x1ӕߠ~H~_ᚩ¹ªªūęoRJm컑©ҋi쵤)̽²Y¼x1ӕߠ~H~_ᚩ¹ªªūęoRJm컑©ҋi쵤)̽²Y¼
問題分析:很明顯匯出的檔案是.deflate格式。像.deflate,.gz,.zip,.bz2,.lzo等這些字尾格式的檔案,都是壓縮格式。正常如何檔案儲存沒有壓縮的話,是沒有後綴的。所以分析這裡是因為hive啟動了壓縮命令,對所有寫入的檔案進行了壓縮。使用如下命名進行檢視,顯然啟動壓縮格式,而對應的類就是deflate壓縮方法。
SET hive.exec.compress.output;#是否啟動壓縮
SET mapreduce.output.fileoutputformat.compress.codec;#壓縮格式
hive (zala.a)> SET hive.exec.compress.output; hive.exec.compress.output=true hive (zala.a)> SET mapreduce.output.fileoutputformat.compress.codec; mapreduce.output.fileoutputformat.compress.codec=org.apache.hadoop.io.compress.DefaultCodec
hive中常見的壓縮格式與對應的類,具體關於壓縮的使用參考後續部落格
壓縮格式 | 對應的編碼/解碼器 |
DEFLATE | org.apache.hadoop.io.compress.DefaultCodec |
gzip | org.apache.hadoop.io.compress.GzipCodec |
bzip | org.apache.hadoop.io.compress.BZipCodec |
Snappy | org.apache.hadoop.io.compress.SnappyCodec |
所以這個時候如果想匯出檔案,那麼需要加入取消壓縮的使用。hive中預設壓縮是不使用的,可以使用如下命令進行設定單次匯出取消壓縮。
SET hive.exec.compress.output=false;
insert overwrite directory '/user/finance/hive/warehouse/fdm_sor.db/t_tmp/'
select * from t_tmp;
檢視結果如下,匯出檔案也是text格式了:
[robot 3333]$ hadoop fs -cat /user/finance/hive/warehouse/fdm_sor.db/t_tmp/000000_0
1111fdfsdfrerfwef\N\N
234343dfdsfdsaaaa\N\N
33333dfsdfdsabnhh\N\N
4444fdsfsdfaaaaaa\N\N
1111fdfsdfrerfwef
234343dfdsfdsaaaa
33333dfsdfdsabnhh
4444fdsfsdfaaaaaa
2.insert...directory查詢匯出使用詳解
注意insert的查詢匯出可以自定義匯出檔案儲存格式,行列解析模式等儲存資訊。
insert查詢匯出的標準語法:
INSERT OVERWRITE [LOCAL] DIRECTORY directory1
[ROW FORMAT row_format] [STORED AS file_format] (Note: Only available starting with Hive 0.11.0)
SELECT ... FROM ...
insert 查詢匯出多次匯出語法:
FROM from_statement
INSERT OVERWRITE [LOCAL] DIRECTORY directory1 select_statement1
[INSERT OVERWRITE [LOCAL] DIRECTORY directory2 select_statement2] ...
row_format新增用法:
: DELIMITED [FIELDS TERMINATED BY char [ESCAPED BY char]] [COLLECTION ITEMS TERMINATED BY char]
[MAP KEYS TERMINATED BY char] [LINES TERMINATED BY char]
[NULL DEFINED AS char] (Note: Only available starting with Hive 0.13)
使用演示:
1.將rcfile儲存的檔案查詢匯出,分隔符為‘@’,儲存格式為textfile的檔案到本地。
SET hive.exec.compress.output=false;
insert overwrite local directory '/home/finance/mytest/3333'
row format delimited fields terminated by '@'
stored as textfile
select * from t_tmp_rc;
匯出結果顯示:
[[email protected] 3333]$ cat 000000_0
[email protected]@456
[email protected]@456
[email protected]@456
[email protected]@456
[email protected]@456
[email protected]@456
2.演示一張表多次匯出資料,掃描一次。
SET hive.exec.compress.output=false;
from t_tmp_rc --適合從一個寬表文件匯出資料,這種方式只需要掃描一次表即可,效率高。
insert overwrite local directory '/home/finance/mytest/3333'
row format delimited fields terminated by '@' --這裡列分隔符是'@'
stored as textfile
select a ,b+'123',c
insert overwrite local directory '/home/finance/mytest/2222'
row format delimited fields terminated by '*' --這裡列分隔符是'%'
stored as textfile
select a ,b+'123','標註',c ;
結果演示如下:
[email protected] mytest]$ cat 2222/000000_0
[email protected]@標註@456
[email protected]@標註@456
[email protected]@標註@456
[email protected]@標註@456
[email protected]@標註@456
[email protected]@標註@456
[[email protected] mytest]$ cat 3333/000000_0
[email protected]@456
[email protected]@456
[email protected]@456
[email protected]@456
[email protected]@456
[email protected]@456