HAWQ技術解析（九） —— 外部資料

阿新 • • 發佈：2019-01-24

HAWQ不但可以讀寫自身系統中的表，而且能夠訪問HDFS、Hive、HBase等外部系統的資料。這是通過一個名為PXF的擴充套件框架實現的。大部分外部資料是以HAWQ外部表的形式進行訪問的，但對於Hive，除外部表方式，PXF還能夠與HCatalog結合直接查詢Hive表。PXF內建多個聯結器，使用者也可以按照PXF API建立自己的聯結器，訪問其它並行資料儲存或處理引擎。

一、安裝配置PXF
如果使用Ambari安裝管理HAWQ叢集，那麼不需要執行任何手工命令列安裝步驟，從Ambari web介面就可以安裝所有需要的PXF外掛。詳細安裝步驟參考用HAWQ輕鬆取代傳統資料倉庫（二） —— 安裝部署。如果使用命令列安裝PXF，參見Installing PXF from the Command Line。PXF相關的預設安裝目錄和檔案如表1所示。

目錄	描述
/usr/lib/pxf	PXF庫目錄。
/etc/pxf/conf	PXF配置目錄。該目錄下包含pxf-public.classpath、pxf-private.classpath及其它配置檔案。
/var/pxf/pxf-service	PXF服務例項所在目錄。
/var/log/pxf	該目錄包含pxf-service.log和所有Tomcat相關的日誌檔案。（PXF需要在主機上執行Tomcat，用Ambari安裝PXF時會自動安裝Tomcat），這些檔案的屬主是pxf:pxf，對其他使用者是隻讀的。
/var/run/pxf/catalina.pid	PXF Tomcat容器的PID檔案，儲存程序號。

表1

與安裝一樣，PXF也可以使用Ambari的圖形介面進行互動式配置，完成後重啟PXF服務以使配置生效。手工配置步驟參考Configuring PXF。注意，手工配置需要修改所有叢集主機上的相關配置檔案，然後重啟所有節點上的PXF服務。

二、PXF profile

PXF profile是一組通用元資料屬性的集合，用於簡化外部資料讀寫。PXF自帶多個內建的profile，每個profile將一組元資料屬性歸於一類，使得對以下資料儲存系統的訪問更加容易：

HDFS檔案資料（讀寫）
Hive（只讀）
HBase（只讀）
JSON（只讀）

表2說明了PXF的內建profile及其相關Java類。這些profile在/etc/pxf/conf/pxf-profiles.xml檔案中定義。

Profile	描述	相關Java類
HdfsTextSimple	讀寫HDFS上的平面文字檔案，每條記錄由固定分隔符的一行構成。	org.apache.hawq.pxf.plugins.hdfs.HdfsDataFragmenter org.apache.hawq.pxf.plugins.hdfs.LineBreakAccessor org.apache.hawq.pxf.plugins.hdfs.StringPassResolver
HdfsTextMulti	從HDFS上的平面檔案中讀取具有固定分隔符的記錄，每條記錄由一行或多行（記錄中包含換行符）構成。此profile是不可拆分的（非並行），比HdfsTextSimple讀取慢。	org.apache.hawq.pxf.plugins.hdfs.HdfsDataFragmenter org.apache.hawq.pxf.plugins.hdfs.QuotedLineBreakAccessor org.apache.hawq.pxf.plugins.hdfs.StringPassResolver
Hive	讀Hive表，支援text、RC、ORC、Sequence或Parquet儲存格式。	org.apache.hawq.pxf.plugins.hive.HiveDataFragmenter org.apache.hawq.pxf.plugins.hive.HiveAccessor org.apache.hawq.pxf.plugins.hive.HiveResolver org.apache.hawq.pxf.plugins.hive.HiveMetadataFetcher org.apache.hawq.pxf.service.io.GPDBWritable
HiveRC	優化讀取RCFile儲存格式的Hive表，必須指定DELIMITER引數。	org.apache.hawq.pxf.plugins.hive.HiveInputFormatFragmenter org.apache.hawq.pxf.plugins.hive.HiveRCFileAccessor org.apache.hawq.pxf.plugins.hive.HiveColumnarSerdeResolver org.apache.hawq.pxf.plugins.hive.HiveMetadataFetcher org.apache.hawq.pxf.service.io.Text
HiveORC	優化讀取ORCFile儲存格式的Hive表。	org.apache.hawq.pxf.plugins.hive.HiveInputFormatFragmenter org.apache.hawq.pxf.plugins.hive.HiveORCAccessor org.apache.hawq.pxf.plugins.hive.HiveORCSerdeResolver org.apache.hawq.pxf.plugins.hive.HiveMetadataFetcher org.apache.hawq.pxf.service.io.GPDBWritable
HiveText	優化讀取TextFile儲存格式的Hive表，必須指定DELIMITER引數。	org.apache.hawq.pxf.plugins.hive.HiveInputFormatFragmenter org.apache.hawq.pxf.plugins.hive.HiveLineBreakAccessor org.apache.hawq.pxf.plugins.hive.HiveStringPassResolver org.apache.hawq.pxf.plugins.hive.HiveMetadataFetcher org.apache.hawq.pxf.service.io.Text
HBase	讀取HBase資料儲存引擎。	org.apache.hawq.pxf.plugins.hbase.HBaseDataFragmenter org.apache.hawq.pxf.plugins.hbase.HBaseAccessor org.apache.hawq.pxf.plugins.hbase.HBaseResolver
Avro	讀取Avro檔案。	org.apache.hawq.pxf.plugins.hdfs.HdfsDataFragmenter org.apache.hawq.pxf.plugins.hdfs.AvroFileAccessor org.apache.hawq.pxf.plugins.hdfs.AvroResolver
JSON	讀取HDFS上的JSON檔案。	org.apache.hawq.pxf.plugins.hdfs.HdfsDataFragmenter org.apache.hawq.pxf.plugins.json.JsonAccessor org.apache.hawq.pxf.plugins.json.JsonResolver

表2

二、訪問HDFS檔案
HDFS是Hadoop應用的主要分散式儲存機制。PXF的HDFS外掛用於讀取儲存在HDFS檔案中的資料，支援具有固定分隔符的文字和Avro兩種檔案格式。在使用PXF訪問HDFS檔案前，確認已經在叢集所有節點上安裝了PXF HDFS外掛（Ambari會自動安裝），並授予了HAWQ使用者（典型的是gpadmin）對HDFS檔案相應的讀寫許可權。

1. PXF支援的HDFS檔案格式
PXF HDFS外掛支援對以下兩種檔案格式的讀取：

comma-separated value（.csv）或其它固定分隔符的平面文字檔案。
由JSON定義的、基於Schema的Avro檔案格式。

PXF HDFS外掛包括以下Profile支援上面的兩類檔案：

HdfsTextSimple - 單行文字檔案
HdfsTextMulti - 內嵌換行符的多行文字檔案
Avro - Avro檔案

2. 查詢外部HDFS資料
HAWQ通過外部表的形式訪問HDFS檔案。下面是建立一個HDFS外部表的語法。

CREATE EXTERNAL TABLE <table_name> 
    ( <column_name> <data_type> [, ...] | LIKE <other_table> )
LOCATION ('pxf://<host>[:<port>]/<path-to-hdfs-file>
    ?PROFILE=HdfsTextSimple|HdfsTextMulti|Avro[&<custom-option>=<value>[...]]')
FORMAT '[TEXT|CSV|CUSTOM]' (<formatting-properties>);

CREATE EXTERNAL TABLE語句中使用的各個關鍵字和相應值的描述如表3所示。

關鍵字	值
<host>[:<port>]	HDFS NameNode主機名、埠。
<path-to-hdfs-file>	HDFS檔案路徑。
PROFILE	PROFILE關鍵字指定為HdfsTextSimple、HdfsTextMulti或Avro之一。
<custom-option>	與特定PROFILE對應的定製選項。
FORMAT 'TEXT'	當<path-to-hdfs-file>指向一個單行固定分隔符的平面檔案時，使用該關鍵字。
FORMAT 'CSV'	當<path-to-hdfs-file>指向一個單行或多行的逗號分隔值（CSV）平面檔案時，使用該關鍵字。
FORMAT 'CUSTOM'	Avro檔案使用該關鍵字。Avro 'CUSTOM'格式只支援內建的（formatter='pxfwritable_import'）格式屬性。
<formatting-properties>	與特定PROFILE對應的格式屬性。

表3

下面是幾個HAWQ訪問HDFS檔案的例子。
（1）使用HdfsTextSimple Profile。
HdfsTextSimple Profile用於讀取一行表示一條記錄的平面文字檔案或CSV檔案，支援的<formatting-properties>是delimiter，用來指定檔案中每條記錄的欄位分隔符。
為PXF建立一個HDFS目錄。

su - hdfs
hdfs dfs -mkdir -p /data/pxf_examples
hdfs dfs -chown -R gpadmin:gpadmin /data/pxf_examples

建立一個名為pxf_hdfs_simple.txt的平面文字檔案，生成四條記錄，使用逗號作為欄位分隔符。

echo 'Prague,Jan,101,4875.33
Rome,Mar,87,1557.39
Bangalore,May,317,8936.99
Beijing,Jul,411,11600.67' > /tmp/pxf_hdfs_simple.txt

將檔案傳到HDFS上。

hdfs dfs -put /tmp/pxf_hdfs_simple.txt /data/pxf_examples/

顯示HDFS上的pxf_hdfs_simple.txt檔案內容。

hdfs dfs -cat /data/pxf_examples/pxf_hdfs_simple.txt

使用HdfsTextSimple profile建立一個可從pxf_hdfs_simple.txt檔案查詢資料的HAWQ外部表。delimiter=e','中的e表示轉義，就是說如果記錄正文中含有逗號，需要用\符號進行轉義。

su - gpadmin 
psql -d db1
db1=# create external table pxf_hdfs_textsimple(location text, month text, num_orders int, total_sales float8)
db1-#             location ('pxf://hdp1:51200/data/pxf_examples/pxf_hdfs_simple.txt?profile=hdfstextsimple')
db1-#           format 'text' (delimiter=e',');
CREATE EXTERNAL TABLE
db1=# select * from pxf_hdfs_textsimple;
 location  | month | num_orders | total_sales 
-----------+-------+------------+-------------
 Prague    | Jan   |        101 |     4875.33
 Rome      | Mar   |         87 |     1557.39
 Bangalore | May   |        317 |     8936.99
 Beijing   | Jul   |        411 |    11600.67
(4 rows)

用CSV格式建立第二個外部表。當指定格式為‘CSV’時，逗號是預設分隔符，不再需要使用delimiter說明。

db1=# create external table pxf_hdfs_textsimple_csv(location text, month text, num_orders int, total_sales float8)
db1-#             location ('pxf://hdp1:51200/data/pxf_examples/pxf_hdfs_simple.txt?profile=hdfstextsimple')
db1-#           format 'csv';
CREATE EXTERNAL TABLE
db1=# select * from pxf_hdfs_textsimple_csv; 
 location  | month | num_orders | total_sales 
-----------+-------+------------+-------------
 Prague    | Jan   |        101 |     4875.33
 Rome      | Mar   |         87 |     1557.39
 Bangalore | May   |        317 |     8936.99
 Beijing   | Jul   |        411 |    11600.67
(4 rows)

（2）使用HdfsTextMulti Profile
HdfsTextMulti profile用於讀取一條記錄中含有換行符的平面文字檔案。因為PXF將換行符作為行分隔符，所以當資料中含有換行符時需要用HdfsTextMulti進行特殊處理。HdfsTextMulti Profile支援的<formatting-properties>是delimiter，用來指定檔案中每條記錄的欄位分隔符。
建立一個平面文字檔案。

vi /tmp/pxf_hdfs_multi.txt

輸入以下記錄，以冒號作為欄位分隔符，第一個欄位中含有換行符。

"4627 Star Rd.
San Francisco, CA  94107":Sept:2017
"113 Moon St.
San Diego, CA  92093":Jan:2018
"51 Belt Ct.
Denver, CO  90123":Dec:2016
"93114 Radial Rd.
Chicago, IL  60605":Jul:2017
"7301 Brookview Ave.
Columbus, OH  43213":Dec:2018

將檔案傳到HDFS上。

su - hdfs
hdfs dfs -put /tmp/pxf_hdfs_multi.txt /data/pxf_examples/

使用HdfsTextMulti profile建立一個可從pxf_hdfs_multi.txt檔案查詢資料的外部表，指定分隔符是冒號。

db1=# create external table pxf_hdfs_textmulti(address text, month text, year int)
db1-#             location ('pxf://hdp1:51200/data/pxf_examples/pxf_hdfs_multi.txt?profile=hdfstextmulti')
db1-#           format 'csv' (delimiter=e':');
CREATE EXTERNAL TABLE
db1=# select * from pxf_hdfs_textmulti;
         address          | month | year 
--------------------------+-------+------
 4627 Star Rd.            | Sept  | 2017
 San Francisco, CA  94107           
 113 Moon St.             | Jan   | 2018
 San Diego, CA  92093               
 51 Belt Ct.              | Dec   | 2016
 Denver, CO  90123                  
 93114 Radial Rd.         | Jul   | 2017
 Chicago, IL  60605                 
 7301 Brookview Ave.      | Dec   | 2018
 Columbus, OH  43213                
(5 rows)

（3）Avro Profile
參見Avro Profile。

（4）訪問HDFS HA叢集中的檔案
為了訪問HDFS HA叢集中的外部資料，將CREATE EXTERNAL TABLE LOCATION子句由<host>[:<port>]修改為<HA-nameservice>。

gpadmin=# create external table pxf_hdfs_textmulti_ha (address text, month text, year int)
            location ('pxf://mycluster/data/pxf_examples/pxf_hdfs_multi.txt?profile=hdfstextmulti')
          format 'csv' (delimiter=e':');
gpadmin=# select * from pxf_hdfs_textmulti_ha;

查詢結果如圖1所示。

圖1三、訪問Hive資料
Hive是Hadoop的分散式資料倉庫框架，支援多種檔案格式，如CVS、RC、ORC、parquet等。PXF的Hive外掛用於讀取儲存在Hive表中的資料。PXF提供兩種方式查詢Hive表：

通過整合PXF與HCatalog直接查詢。
通過外部表查詢。

在使用PXF訪問Hive前，確認滿足以下前提條件：

在HAWQ和HDFS叢集的所有節點上（master、segment、NameNode、DataNode）安裝了PXF HDFS外掛。
在HAWQ和HDFS叢集的所有節點上安裝了PXF Hive外掛。
如果配置了Hadoop HA，PXF也必須安裝在所有執行NameNode服務的HDFS節點上。
所有PXF節點上都安裝了Hive客戶端。
叢集所有節點上都安裝了Hive JAR檔案目錄和conf目錄。
已經測試了PXF訪問HDFS。
在叢集中的一臺主機上執行Hive Metastore服務。
在NameNode上的hive-site.xml檔案中設定了hive.metastore.uris屬性。

看似條件不少，但是如果使用Ambari安裝管理HAWQ叢集，並安裝了Hadoop相關服務，則所有這些前置條件都已自動配置好，不需要任何手工配置。

2. PXF支援的Hive檔案格式
PXF Hive外掛支援的Hive檔案格式及其訪問這些格式對應的profile如表4所示。

檔案格式	描述	Profile
TextFile	逗號、tab或空格分隔的平面檔案格式或JSON格式。	Hive、HiveText
SequenceFile	二進位制鍵值對組成的平面檔案。	Hive
RCFile	記錄由鍵值對組成的列資料，具有行高壓縮率。	Hive、HiveRC
ORCFile	優化的列式儲存，減小資料大小。	Hive
Parquet	壓縮的列式儲存。	Hive
Avro	基於schema的、由JSON所定義的序列化格式。	Hive

表4

3. 資料型別對映
為了在HAWQ中表示Hive資料，需要將使用Hive私有資料型別的資料值對映為等價的HAWQ型別值。表5是對Hive私有資料型別的對映規則彙總。

Hive資料型別	HAWQ資料型別
boolean	bool
int	int4
smallint	int2
tinyint	int2
bigint	int8
float	float4
double	float8
string	text
binary	bytea
timestamp	timestamp

表5
除簡單型別外，Hive還支援array、struct、map等複雜資料型別。由於HAWQ原生不支援這些型別，PXF將它們統一對映為text型別。可以建立HAWQ函式或使用應用程式抽取複雜資料型別子元素的資料。
下面是一些HAWQ訪問Hive表的例子。

4. 準備示例資料
（1）準備資料檔案，新增如下記錄，用逗號分隔欄位。

vi /tmp/pxf_hive_datafile.txt
Prague,Jan,101,4875.33
Rome,Mar,87,1557.39
Bangalore,May,317,8936.99
Beijing,Jul,411,11600.67
San Francisco,Sept,156,6846.34
Paris,Nov,159,7134.56
San Francisco,Jan,113,5397.89
Prague,Dec,333,9894.77
Bangalore,Jul,271,8320.55
Beijing,Dec,100,4248.41

（2）建立文字格式的Hive表sales_info

create database test;
use test;
create table sales_info (location string, month string,
        number_of_orders int, total_sales double)
        row format delimited fields terminated by ','
        stored as textfile;

（3）向sales_info表裝載資料

load data local inpath '/tmp/pxf_hive_datafile.txt' into table sales_info;

（4）查詢sales_info表資料，驗證裝載資料成功。

select * from sales_info;

（5）確認sales_info表在HDFS上的位置，在建立HAWQ外部表時需要用到該資訊。

describe extended sales_info;
...
location:hdfs://mycluster/apps/hive/warehouse/test.db/sales_info
...

5. 使用PXF和HCatalog查詢Hive
HAWQ可以獲取儲存在HCatalog中的元資料，通過HCatalog直接訪問Hive表，而不用關心Hive表對應的底層檔案儲存格式。HCatalog建立在Hive metastore之上，包含Hive的DDL語句。使用這種方式的好處是：

不須要知道Hive表結構。
不須要手工輸入Hive表的位置與格式資訊。
如果表的元資料改變，HCatalog自動提供更新後的元資料。這是使用PXF靜態外部表方式無法做到的。

圖2所示HAWQ如何使用HCatalog查詢Hive表。

圖2

HAWQ使用PXF從HCatalog查詢表的元資料。
HAWQ用查詢到的元資料建立一個記憶體目錄表。如果一個查詢中多次引用了同一個表，記憶體目錄表可以減少對外部HCatalog的呼叫次數。
PXF使用記憶體目錄表的元資料資訊查詢Hive表。查詢結束後，記憶體目錄表將被刪除。

如果使用Ambari安裝管理HAWQ，並且已經啟動了Hive服務，則不需要任何額外配置，就可以查詢Hive表。

db1=# select * from hcatalog.test.sales_info;
   location    | month | number_of_orders | total_sales 
---------------+-------+------------------+-------------
 Prague        | Jan   |              101 |     4875.33
 Rome          | Mar   |               87 |     1557.39
 Bangalore     | May   |              317 |     8936.99
 Beijing       | Jul   |              411 |    11600.67
 San Francisco | Sept  |              156 |     6846.34
 Paris         | Nov   |              159 |     7134.56
 San Francisco | Jan   |              113 |     5397.89
 Prague        | Dec   |              333 |     9894.77
 Bangalore     | Jul   |              271 |     8320.55
 Beijing       | Dec   |              100 |     4248.41
(10 rows)

獲取Hive表的欄位和資料型別對映。

db1=# \d+ hcatalog.test.sales_info;
    PXF Hive Table "test.sales_info"
      Column      |  Type  | Source type 
------------------+--------+-------------
 location         | text   | string
 month            | text   | string
 number_of_orders | int4   | int
 total_sales      | float8 | double

可以使用萬用字元獲取所有Hive庫表的資訊。

\d+ hcatalog.test.*;
\d+ hcatalog.*.*;

還可以使用pxf_get_item_fields函式獲得Hive表的描述資訊，該函式目前僅支援Hive profile。

db1=# select * from pxf_get_item_fields('hive','test.sales_info');
 path |  itemname  |    fieldname     | fieldtype | sourcefieldtype 
------+------------+------------------+-----------+-----------------
 test | sales_info | location         | text      | string
 test | sales_info | month            | text      | string
 test | sales_info | number_of_orders | int4      | int
 test | sales_info | total_sales      | float8    | double
(4 rows)

pxf_get_item_fields函式同樣也支援萬用字元。

select * from pxf_get_item_fields('hive','test.*');
select * from pxf_get_item_fields('hive','*.*');

6. 查詢Hive外部表
使用外部表方式需要標識適當的profile。PXF Hive外掛支援三種Hive相關的profile，Hive、HiveText和HiveRC。HiveText和HiveRC分別針對TEXT和RC檔案格式做了特別優化，而Hive profile可用於所有PXF支援的Hive檔案儲存型別。當底層Hive表由多個分割槽組成，並且分割槽使用了不同的檔案格式，需要使用Hive profile。
以下語法建立一個HAWQ的Hive外部表：

CREATE EXTERNAL TABLE <table_name>
    ( <column_name> <data_type> [, ...] | LIKE <other_table> )
LOCATION ('pxf://<host>[:<port>]/<hive-db-name>.<hive-table-name>
    ?PROFILE=Hive|HiveText|HiveRC[&DELIMITER=<delim>'])
FORMAT 'CUSTOM|TEXT' (formatter='pxfwritable_import' | delimiter='<delim>')

CREATE EXTERNAL TABLE語句中Hive外掛使用關鍵字和相應值的描述如表6所示。

關鍵字	值
<host>[:]	HDFS NameNode主機名、埠號
<hive-db-name>	Hive資料庫名，如果忽略，預設是defaults。
<hive-table-name>	Hive表名。
PROFILE	必須是Hive、HiveText或HiveRC之一。
DELIMITER	指定欄位分隔符，必須是單個ascii字元或相應字元的十六進位制表示。
FORMAT (Hive profile)	必須指定為CUSTOM，僅支援內建的pxfwritable_import格式屬性。
FORMAT (HiveText and HiveRC profiles)	必須指定為TEXT，並再次指定欄位分隔符。

表6
（1）Hive Profile
Hive profile適用於任何PXF支援的Hive檔案儲存格式，它實際上是為底層檔案儲存型別選擇最優的Hive* profile。

db1=# create external table salesinfo_hiveprofile(location text, month text, num_orders int, total_sales float8)
db1-#             location ('pxf://hdp1:51200/test.sales_info?profile=hive')
db1-#           format 'custom' (formatter='pxfwritable_import');
CREATE EXTERNAL TABLE
db1=# 
db1=# select * from salesinfo_hiveprofile;
   location    | month | num_orders | total_sales 
---------------+-------+------------+-------------
 Prague        | Jan   |        101 |     4875.33
 Rome          | Mar   |         87 |     1557.39
 Bangalore     | May   |        317 |     8936.99
 Beijing       | Jul   |        411 |    11600.67
 San Francisco | Sept  |        156 |     6846.34
 Paris         | Nov   |        159 |     7134.56
 San Francisco | Jan   |        113 |     5397.89
 Prague        | Dec   |        333 |     9894.77
 Bangalore     | Jul   |        271 |     8320.55
 Beijing       | Dec   |        100 |     4248.41
(10 rows)

注意外部表和Hcatalog查詢計劃的區別，如圖3所示。

圖3 外部表查詢使用了全部24個虛擬段，而Hcatalog查詢只使用了1個虛擬段，顯然外部表更加有效地利用了資源。

（2）HiveText Profile
使用HiveText profile時，必須在LOCATION和FORMAT兩個子句中都指定分隔符選項。

db1=# create external table salesinfo_hivetextprofile(location text, month text, num_orders int, total_sales float8)
db1-# location ('pxf://hdp1:51200/test.sales_info?profile=hivetext&delimiter=,')
db1-# format 'text' (delimiter=e',');
CREATE EXTERNAL TABLE
db1=# select * from salesinfo_hivetextprofile where location='Beijing';
 location | month | num_orders | total_sales 
----------+-------+------------+-------------
 Beijing  | Jul   |        411 |    11600.67
 Beijing  | Dec   |        100 |     4248.41
(2 rows)

（3）HiveRC Profile
建立一個rcfile格式的Hive表，並插入資料。

create table sales_info_rcfile (location string, month string,
        number_of_orders int, total_sales double)
      row format delimited fields terminated by ','
      stored as rcfile;
	  
insert into table sales_info_rcfile select * from sales_info;

查詢Hive表。

db1=# create external table salesinfo_hivercprofile(location text, month text, num_orders int, total_sales float8)
db1-#              location ('pxf://hdp1:51200/test.sales_info_rcfile?profile=hiverc&delimiter=,')
db1-#            format 'text' (delimiter=e',');
CREATE EXTERNAL TABLE
db1=# 
db1=# select location, total_sales from salesinfo_hivercprofile;
   location    | total_sales 
---------------+-------------
 Prague        |     4875.33
 Rome          |     1557.39
 Bangalore     |     8936.99
 Beijing       |    11600.67
 San Francisco |     6846.34
 Paris         |     7134.56
 San Francisco |     5397.89
 Prague        |     9894.77
 Bangalore     |     8320.55
 Beijing       |     4248.41
(10 rows)

（4）訪問Parquet格式的Hive表
PXF Hive profile支援分割槽或非分割槽的Parquet儲存格式。建立一個Parquet格式的Hive表，並插入資料。

create table sales_info_parquet (location string, month string,
        number_of_orders int, total_sales double)
        stored as parquet;

insert into sales_info_parquet select * from sales_info;

查詢Hive表。

db1=# create external table salesinfo_parquet (location text, month text, num_orders int, total_sales float8)
db1-#     location ('pxf://hdp1:51200/test.sales_info_parquet?profile=hive')
db1-#     format 'custom' (formatter='pxfwritable_import');
CREATE EXTERNAL TABLE
db1=# 
db1=# select * from salesinfo_parquet;
   location    | month | num_orders | total_sales 
---------------+-------+------------+-------------
 Prague        | Jan   |        101 |     4875.33
 Rome          | Mar   |         87 |     1557.39
 Bangalore     | May   |        317 |     8936.99
 Beijing       | Jul   |        411 |    11600.67
 San Francisco | Sept  |        156 |     6846.34
 Paris         | Nov   |        159 |     7134.56
 San Francisco | Jan   |        113 |     5397.89
 Prague        | Dec   |        333 |     9894.77
 Bangalore     | Jul   |        271 |     8320.55
 Beijing       | Dec   |        100 |     4248.41
(10 rows)

7. 複雜資料型別
（1）準備資料檔案，新增如下記錄，用逗號分隔欄位，第三個欄位是array型別，第四個欄位是map型別。

vi /tmp/pxf_hive_complex.txt
3,Prague,1%2%3,zone:euro%status:up
89,Rome,4%5%6,zone:euro
400,Bangalore,7%8%9,zone:apac%status:pending
183,Beijing,0%1%2,zone:apac
94,Sacramento,3%4%5,zone:noam%status:down
101,Paris,6%7%8,zone:euro%status:up
56,Frankfurt,9%0%1,zone:euro
202,Jakarta,2%3%4,zone:apac%status:up
313,Sydney,5%6%7,zone:apac%status:pending
76,Atlanta,8%9%0,zone:noam%status:down

（2）建立Hive表

create table table_complextypes( index int, name string, intarray array<int>, propmap map<string, string>)
         row format delimited fields terminated by ','
         collection items terminated by '%'
         map keys terminated by ':'
         stored as textfile;

（3）向Hive表裝載資料

load data local inpath '/tmp/pxf_hive_complex.txt' into table table_complextypes;

（4）查詢Hive表，驗證資料正確匯入

select * from table_complextypes;

（5）建立Hive外部表並查詢資料

db1=# create external table complextypes_hiveprofile(index int, name text, intarray text, propmap text)
db1-#              location ('pxf://hdp1:51200/test.table_complextypes?profile=hive')
db1-#            format 'custom' (formatter='pxfwritable_import');
CREATE EXTERNAL TABLE
db1=# select * from complextypes_hiveprofile;
 index |    name    | intarray |              propmap               
-------+------------+----------+------------------------------------
     3 | Prague     | [1,2,3]  | {"zone":"euro","status":"up"}
    89 | Rome       | [4,5,6]  | {"zone":"euro"}
   400 | Bangalore  | [7,8,9]  | {"zone":"apac","status":"pending"}
   183 | Beijing    | [0,1,2]  | {"zone":"apac"}
    94 | Sacramento | [3,4,5]  | {"zone":"noam","status":"down"}
   101 | Paris      | [6,7,8]  | {"zone":"euro","status":"up"}
    56 | Frankfurt  | [9,0,1]  | {"zone":"euro"}
   202 | Jakarta    | [2,3,4]  | {"zone":"apac","status":"up"}
   313 | Sydney     | [5,6,7]  | {"zone":"apac","status":"pending"}
    76 | Atlanta    | [8,9,0]  | {"zone":"noam","status":"down"}
(10 rows)

可以看到，複雜資料型別都被簡單地轉化為HAWQ的TEXT型別。

8. 訪問Hive分割槽表
PXF Hive外掛支援Hive的分割槽特性與目錄結構，並且提供了所謂的分割槽過濾下推功能，可以利用Hive的分割槽消除特性，以降低網路流量和I/O負載。PXF的分割槽過濾下推與MySQL的索引條件下推（Index Condition Pushdown，ICP）概念上類似，都是將過濾條件下推至更底層的儲存上，以提高效能。
為了利用PXF的分割槽過濾下推功能，查詢的where子句中應該只使用分割槽欄位。否則，PXF忽略分割槽過濾，過濾將在HAWQ端執行，影響查詢效能。PXF的Hive外掛只對分割槽鍵執行過濾下推。
分割槽過濾下推預設是啟用的：

db1=# show pxf_enable_filter_pushdown;
 pxf_enable_filter_pushdown 
----------------------------
 on
(1 row)

（1）使用Hive Profile訪問同構分割槽資料
建立Hive表並裝載資料。

create table sales_part (name string, type string, supplier_key int, price double)
        partitioned by (delivery_state string, delivery_city string)
        row format delimited fields terminated by ',';

insert into table sales_part partition(delivery_state = 'CALIFORNIA', delivery_city = 'Fresno') 
values ('block', 'widget', 33, 15.17);
insert into table sales_part partition(delivery_state = 'CALIFORNIA', delivery_city = 'Sacramento') 
values ('cube', 'widget', 11, 1.17);
insert into table sales_part partition(delivery_state = 'NEVADA', delivery_city = 'Reno') 
values ('dowel', 'widget', 51, 31.82);
insert into table sales_part partition(delivery_state = 'NEVADA', delivery_city = 'Las Vegas') 
values ('px49', 'pipe', 52, 99.82);

查詢sales_part表。

select * from sales_part;

檢查sales_part表在HDFS上的目錄結構。

sudo -u hdfs hdfs dfs -ls -R /apps/hive/warehouse/test.db/sales_part

建立PXF外部表並查詢資料。

db1=# create external table pxf_sales_part(
db1(#              item_name text, item_type text, 
db1(#              supplier_key integer, item_price double precision, 
db1(#              delivery_state text, delivery_city text)
db1-#            location ('pxf://hdp1:51200/test.sales_part?profile=hive')
db1-#            format 'custom' (formatter='pxfwritable_import');
CREATE EXTERNAL TABLE
db1=# select * from pxf_sales_part;
 item_name | item_type | supplier_key | item_price | delivery_state | delivery_city 
-----------+-----------+--------------+------------+----------------+---------------
 block     | widget    |           33 |      15.17 | CALIFORNIA     | Fresno
 dowel     | widget    |           51 |      31.82 | NEVADA         | Reno
 cube      | widget    |           11 |       1.17 | CALIFORNIA     | Sacramento
 px49      | pipe      |           52 |      99.82 | NEVADA         | Las Vegas
(4 rows)

執行一個非過濾下推的查詢。

db1=# select * from pxf_sales_part where delivery_city = 'Sacramento' and item_name = 'cube';
 item_name | item_type | supplier_key | item_price | delivery_state | delivery_city 
-----------+-----------+--------------+------------+----------------+---------------
 cube      | widget    |           11 |       1.17 | CALIFORNIA     | Sacramento
(1 row)

該查詢會利用Hive過濾delivery_city='Sacramento'的分割槽，但item_name上的過濾條件不會下推至Hive，因為它不是分割槽列。當所有Sacramento分割槽的資料傳到HAWQ後，在HAWQ端執行item_name的過濾。
執行一個過濾下推的查詢。

db1=# select * from pxf_sales_part where delivery_state = 'CALIFORNIA';
 item_name | item_type | supplier_key | item_price | delivery_state | delivery_city 
-----------+-----------+--------------+------------+----------------+---------------
 cube      | widget    |           11 |       1.17 | CALIFORNIA     | Sacramento
 block     | widget    |           33 |      15.17 | CALIFORNIA     | Fresno
(2 rows)

（2）使用Hive Profile訪問異構分割槽資料
一個Hive表中的不同分割槽可能有不同的儲存格式，PXF Hive profile也支援這種情況。
建立Hive表。

$ HADOOP_USER_NAME=hdfs hive
create external table hive_multiformpart( location string, month string, number_of_orders int, total_sales double)
        partitioned by( year string )
        row format delimited fields terminated by ',';

記下sales_info和sales_info_rcfile表在HDFS中的位置。

describe extended sales_info;
describe extended sales_info_rcfile;

在我的環境中兩個表的目錄分別是：

location:hdfs://mycluster/apps/hive/warehouse/test.db/sales_info
location:hdfs://mycluster/apps/hive/warehouse/test.db/sales_info_rcfile

給hive_multiformpart表增加兩個分割槽，位置分別指向sales_info和sales_info_rcfile

alter table hive_multiformpart add partition (year = '2013') location 'hdfs://mycluster/apps/hive/warehouse/test.db/sales_info';
alter table hive_multiformpart add partition (year = '2016') location 'hdfs://mycluster/apps/hive/warehouse/test.db/sales_info_rcfile';

顯式標識與sales_info_rcfile表對應分割槽的檔案格式。

alter table hive_multiformpart partition (year='2016') set fileformat rcfile;

此時檢視兩個分割槽的儲存格式可以看到，sales_info表對應的分割槽使用的是預設的TEXTFILE格式，而sales_info_rcfile表對應的分割槽是RCFILE格式，分別如圖4、圖5所示。

show partitions hive_multiformpart;
desc formatted hive_multiformpart partition(year=2013);  
desc formatted hive_multiformpart partition(year=2016);

圖4

圖5 使用Hcatalog方式查詢hive_multiformpart表。

db1=# select * from hcatalog.test.hive_multiformpart;
   location    | month | number_of_orders | total_sales | year 
---------------+-------+------------------+-------------+------
...
 Prague        | Dec   |              333 |     9894.77 | 2013
 Bangalore     | Jul   |              271 |     8320.55 | 2013
 Beijing       | Dec   |              100 |     4248.41 | 2013
 Prague        | Jan   |              101 |     4875.33 | 2016
 Rome          | Mar   |               87 |     1557.39 | 2016
 Bangalore     | May   |              317 |     8936.99 | 2016
 ...
(20 rows)

使用外部表方式查詢hive_multiformpart表。

db1=# create external table pxf_multiformpart(location text, month text, num_orders int, total_sales float8, year text)
db1-#              location ('pxf://hdp1:51200/test.hive_multiformpart?profile=hive')
db1-#            format 'custom' (formatter='pxfwritable_import');
CREATE EXTERNAL TABLE
db1=# select * from pxf_multiformpart;
   location    | month | num_orders | total_sales | year 
---------------+-------+------------+-------------+------
... 
 Prague        | Dec   |        333 |     9894.77 | 2013
 Bangalore     | Jul   |        271 |     8320.55 | 2013
 Beijing       | Dec   |        100 |     4248.41 | 2013
 Prague        | Jan   |        101 |     4875.33 | 2016
 Rome          | Mar   |         87 |     1557.39 | 2016
 Bangalore     | May   |        317 |     8936.99 | 2016
...
(20 rows)

db1=# select sum(num_orders) from pxf_multiformpart where month='Dec' and year='2013';
 sum 
-----
 433
(1 row)

四、訪問JSON資料
PXF的JSON外掛用於讀取儲存在HDFS上的JSON檔案，支援N層巢狀。為了使用HAWQ訪問JSON資料，必須將JSON檔案儲存在HDFS上，並從HDFS資料儲存建立外部表。在使用PXF訪問JSON檔案前，確認滿足以下前提條件：

已經在叢集所有節點上安裝了HDFS外掛（Ambari會自動安裝）。
已經在叢集所有節點上安裝了JSON外掛（Ambari會自動安裝）。
已經測試了PXF對HDFS的訪問。

1. PXF與JSON檔案協同工作
JSON是一種基於文字的資料交換格式，其資料通常儲存在一個以.json為字尾的檔案中。一個.json檔案包含一組物件的集合，一個JSON物件是一組無序的名/值對，值可以是字串、數字、true、false、null，或者一個物件或陣列。物件和陣列可以巢狀。例如，下面是一個JSON資料檔案的內容：

{
  "created_at":"MonSep3004:04:53+00002013",
  "id_str":"384529256681725952",
  "user": {
    "id":31424214,
     "location":"COLUMBUS"
  },
  "coordinates":null
}

（1）JSON到HAWQ的資料型別對映
為了在HAWQ中表示JSON資料，需要將使用私有資料型別的JSON值對映為等價的HAWQ資料型別值。表7是對JSON資料對映規則的總結。

JSON資料型別	HAWQ資料型別
integer、float、string、boolean	使用對應的HAWQ內建資料型別（integer、real、double precision、char、varchar、text、boolean）
Array	使用[]標識一個特定陣列中具有私有資料型別成員的下標。
Object	使用 . 點識別符號指定每個級別的具有私有資料型別的巢狀成員。

表7
（2）JSON檔案讀模式
PXF的JSON外掛用兩個模式之一讀取資料。預設模式是每行一個完整的JSON記錄，同時也支援對多行構成的JSON記錄的讀操作。下面是每種讀模式的例子。示例schema包含資料列的名稱和資料型別如下：

“created_at” - text
“id_str” - text
“user” - object（“id” - integer，“location” - text）
“coordinates” - object（“type” - text，“values” - array（integer））

例1 - 每行一條JSON記錄的讀模式：

{"created_at":"FriJun0722:45:03+00002013","id_str":"343136551322136576","user":{"id":395504494,"location":"NearCornwall"},"coordinates":{"type":"Point","values": [ 6, 50 ]}},
{"created_at":"FriJun0722:45:02+00002013","id_str":"343136547115253761","user":{"id":26643566,"location":"Austin,Texas"}, "coordinates": null},
{"created_at":"FriJun0722:45:02+00002013","id_str":"343136547136233472","user":{"id":287819058,"location":""}, "coordinates": null}

例2 - 多行JSON記錄讀模式：

{
  "root":[
    {
      "record_obj":{
        "created_at":"MonSep3004:04:53+00002013",
        "id_str":"384529256681725952",
        "user":{
          "id":31424214,
          "location":"COLUMBUS"
        },
        "coordinates":null
      },
      "record_obj":{
        "created_at":"MonSep3004:04:54+00002013",
        "id_str":"384529260872228864",
        "user":{
          "id":67600981,
          "location":"KryberWorld"
        },
        "coordinates":{
          "type":"Point",
          "values":[
             8,
             52
          ]
        }
      }
    }
  ]
}

下面從PXF的JSON外部表查詢上面的示例資料。

3. 將JSON資料裝載到HDFS
PXF的JSON外掛讀取儲存在HDFS中的JSON檔案。因此在HAWQ查詢JSON資料前，必須先將JSON檔案傳到HDFS上。將前面的單行和多行JSON記錄分別儲存到singleline.json和multiline.json檔案中，而且確保JSON檔案中沒有空行，然後將檔案傳到HDFS。

su - hdfs
hdfs dfs -mkdir /user/data
hdfs dfs -chown -R gpadmin:gpadmin /user/data
hdfs dfs -put singleline.json /user/data
hdfs dfs -put multiline.json /user/data

檔案傳到HDFS後，就可以通過HAWQ查詢JSON資料。

4. 查詢外部的JSON資料
使用下面的語法建立一個表示JSON資料的HAWQ外部表。

CREATE EXTERNAL TABLE <table_name> 
    ( <column_name> <data_type> [, ...] | LIKE <other_table> )
LOCATION ( 'pxf://<host>[:<port>]/<path-to-data>?PROFILE=Json[&IDENTIFIER=<value>]' )
      FORMAT 'CUSTOM' ( FORMATTER='pxfwritable_import' );

CREATE EXTERNAL TABLE語句中使用的各個關鍵字和相應值的描述如表8所示。

關鍵字	值
<host>[:<port>]	HDFS NameNode主機名、埠。
PROFILE	PROFILE關鍵字必須指定為Json。
IDENTIFIER	只有當JSON檔案是多行記錄格式時，LOCATION字串中才包含IDENTIFIER關鍵字及其對應的值。<value>應該標識用以確定一個返回的JSON物件的成員名稱，例如上面的示例2中，應該指定&IDENTIFIER=created_at。
FORMAT	FORMAT子句必須指定為CUSTOM。
FORMATTER	JSON 'CUSTOM'格式只支援內建的'pxfwritable_import'格式屬性。

表8
建立一個基於單行記錄的JSON外部表。

create external table sample_json_singleline_tbl(
  created_at text,
  id_str text,
  text text,
  "user.id" integer,
  "user.location" text,
  "coordinates.values[0]" integer,
  "coordinates.values[1]" integer
)
location('pxf://hdp1:51200/user/data/singleline.json?profile=json')
format 'custom' (formatter='pxfwritable_import');
select * from sample_json_singleline_tbl;

查詢結果如圖6所示。

圖6 注意，原來JSON中的巢狀資料都被平面化展開。在查詢結果中，使用 . 訪問巢狀user物件（user.id和user.location），使用 [] 訪問coordinates.values陣列的元素（coordinates.values[0]和coordinates.values[1]）。
多行記錄的JSON外部表與單行的類似，只是需要指定identifier，指定標識記錄的鍵。

db1=# create external table sample_json_multiline_tbl(
db1(#   created_at text,
db1(#   id_str text,
db1(#   text text,
db1(#   "user.id" integer,
db1(#   "user.location" text,
db1(#   "coordinates.values[0]" integer,
db1(#   "coordinates.values[1]" integer
db1(# )
db1-# location('pxf://hdp1:51200/user/data/multiline.json?profile=json&identifier=created_at')
db1-# format 'custom' (formatter='pxfwritable_import');
CREATE EXTERNAL TABLE
db1=# select * from sample_json_multiline_tbl;
        created_at         |       id_str       | text | user.id  | user.location | coordinates.values[0] | coordinates.v
alues[1] 
---------------------------+--------------------+------+----------+---------------+-----------------------+--------------
---------
 MonSep3004:04:53+00002013 | 384529256681725952 |      | 31424214 | COLUMBUS      |                       |              
        
 MonSep3004:04:54+00002013 | 384529260872228864 |      | 67600981 | KryberWorld   |                     8 |              
      52
(2 rows)

五、向HDFS中寫入資料
PXF只能向HDFS檔案中寫入資料，而對Hive、HBase和JSON等外部資料都是隻讀的。在使用PXF向HDFS檔案寫資料前，確認已經在叢集所有節點上安裝了PXF HDFS外掛（Ambari會自動安裝），並授予了HAWQ使用者（典型的是gpadmin）對HDFS檔案相應的讀寫許可權。

1. 寫PXF外部表
PXF HDFS外掛支援兩種可寫的profile：HdfsTextSimple和SequenceWritable。建立HAWQ可寫外部表的語法如下：

CREATE WRITABLE EXTERNAL TABLE <table_name> 
    ( <column_name> <data_type> [, ...] | LIKE <other_table> )
LOCATION ('pxf://<host>[:<port>]/<path-to-hdfs-file>
    ?PROFILE=HdfsTextSimple|SequenceWritable[&<custom-option>=<value>[...]]')
FORMAT '[TEXT|CSV|CUSTOM]' (<formatting-properties>);

CREATE EXTERNAL TABLE語句中使用的各個關鍵字和相應值的描述如表9所示。

關鍵字	值
<host>[:<port>]	HDFS NameNode主機名、埠。
<path-to-hdfs-file>	HDFS檔案路徑。
PROFILE	PROFILE關鍵字指定為HdfsTextSimple或SequenceWritable。
<custom-option>	與特定PROFILE對應的定製選項。
FORMAT 'TEXT'	當<path-to-hdfs-file>指向一個單行固定分隔符的平面檔案時，使用該關鍵字。
FORMAT 'CSV'	當<path-to-hdfs-file>指向一個單行或多行的逗號分隔值（CSV）平面檔案時，使用該關鍵字。
FORMAT 'CUSTOM'	SequenceWritable profile使用該關鍵字。SequenceWritable 'CUSTOM'格式僅支援內建的formatter='pxfwritable_export（寫）和formatter='pxfwritable_import（讀）格式屬性。

表9

3. 定製選項
HdfsTextSimple和SequenceWritable profile支援表10所示的定製選項：

選項	值描述	Profile
COMPRESSION_CODEC	壓縮編解碼對應的Java類名。如果不提供，不會執行資料壓縮。支援的壓縮編解碼包括：org.apache.hadoop.io.compress.DefaultCodec和org.apache.hadoop.io.compress.BZip2Codec	HdfsTextSimple、SequenceWritable
COMPRESSION_CODEC	org.apache.hadoop.io.compress.GzipCodec	HdfsTextSimple
COMPRESSION_TYPE	使用的壓縮型別，支援的值為RECORD（預設）或BLOCK。	HdfsTextSimple、SequenceWritable
DATA-SCHEMA	寫入器的序列化/反序列化類名。類所在的jar檔案必須在PXF classpath中。該選項被SequenceWritable profile使用，並且沒有預設值。	SequenceWritable
THREAD-SAFE	該Boolean值決定表查詢是否執行在多執行緒模式，預設值為TRUE。	HdfsTextSimple、SequenceWritable

表10

4. 使用HdfsTextSimple Profile寫資料
HdfsTextSimple profile用於向單行每記錄（不含內嵌換行符）的固定分隔符平面檔案寫資料。使用HdfsTextSimple Profile的建立可寫表時，可以選擇記錄或塊壓縮，支援以下壓縮編解碼方法。

org.apache.hadoop.io.compress.DefaultCodec
org.apache.hadoop.io.compress.GzipCodec
org.apache.hadoop.io.compress.BZip2Codec

HdfsTextSimple profile支援的格式屬性為'delimiter'，標識欄位分隔符，預設值為逗號（,）。
（1）建立可寫外部表，資料寫到HDFS的/data/pxf_examples/pxfwritable_hdfs_textsimple1目錄中，欄位分隔符為逗號。

create writable external table pxf_hdfs_writabletbl_1(location text, month text, num_orders int, total_sales float8)
            location ('pxf://hdp1:51200/data/pxf_examples/pxfwritable_hdfs_textsimple1?profile=hdfstextsimple')
          format 'text' (delimiter=e',');

（2）向pxf_hdfs_writabletbl_1表插入資料。

insert into pxf_hdfs_writabletbl_1 values ( 'Frankfurt', 'Mar', 777, 3956.98 );
insert into pxf_hdfs_writabletbl_1 values ( 'Cleveland', 'Oct', 3812, 96645.37 );
insert into pxf_hdfs_writabletbl_1 select * from pxf_hdfs_textsimple;

（3）檢視HDFS檔案的內容。

[[email protected] ~]$ hdfs dfs -cat /data/pxf_examples/pxfwritable_hdfs_textsimple1/*
Frankfurt,Mar,777,3956.98
Cleveland,Oct,3812,96645.37
Prague,Jan,101,4875.33
Rome,Mar,87,1557.39
Bangalore,May,317,8936.99
Beijing,Jul,411,11600.67
[[email protected] ~]$ hdfs dfs -ls /data/pxf_examples/pxfwritable_hdfs_textsimple1
Found 3 items
-rw-r--r--   3 pxf gpadmin         26 2017-03-22 10:45 /data/pxf_examples/pxfwritable_hdfs_textsimple1/236002_0
-rw-r--r--   3 pxf gpadmin         28 2017-03-22 10:45 /data/pxf_examples/pxfwritable_hdfs_textsimple1/236003_0
-rw-r--r--   3 pxf gpadmin         94 2017-03-22 10:46 /data/pxf_examples/pxfwritable_hdfs_textsimple1/236004_15
[[email protected] ~]$ hdfs dfs -cat /data/pxf_examples/pxfwritable_hdfs_textsimple1/236002_0
Frankfurt,Mar,777,3956.98
[[email protected] ~]$ hdfs dfs -cat /data/pxf_examples/pxfwritable_hdfs_textsimple1/236003_0
Cleveland,Oct,3812,96645.37
[[email protected] ~]$ hdfs dfs -cat /data/pxf_examples/pxfwritable_hdfs_textsimple1/236004_15
Prague,Jan,101,4875.33
Rome,Mar,87,1557.39
Bangalore,May,317,8936.99
Beijing,Jul,411,11600.67
[[email protected] ~]$

可以看到，一共寫入了6條記錄，生成了3個檔案。其中兩個檔案各有1條記錄，另外一個檔案中有4條記錄，記錄以逗號作為欄位分隔符。
（4）查詢可寫外部表
HAWQ不支援對可寫外部表的查詢。為了查詢可寫外部表的資料，需要建立一個可讀外部表，指向HDFS的相應檔案。

db1=# select * from pxf_hdfs_writabletbl_1;
ERROR:  External scan error: It is not possible to read from a WRITABLE external table. Create the table as READABLE instead. (CTranslatorDXLToPlStmt.cpp:1041)
db1=# create external table pxf_hdfs_textsimple_r1(location text, month text, num_orders int, total_sales float8)
db1-#             location ('pxf://hdp1:51200/data/pxf_examples/pxfwritable_hdfs_textsimple1?profile=hdfstextsimple')
db1-#             format 'csv';
CREATE EXTERNAL TABLE
db1=# select * from pxf_hdfs_textsimple_r1;
 location  | month | num_orders | total_sales 
-----------+-------+------------+-------------
 Cleveland | Oct   |       3812 |    96645.37
 Frankfurt | Mar   |        777 |     3956.98
 Prague    | Jan   |        101 |     4875.33
 Rome      | Mar   |         87 |     1557.39
 Bangalore | May   |        317 |     8936.99
 Beijing   | Jul   |        411 |    11600.67
(6 rows)

（5）建立一個使用Gzip壓縮，並用冒號（:）做欄位分隔符的可寫外部表，注意類名區分大小寫。

create writable external table pxf_hdfs_writabletbl_2 (location text, month text, num_orders int, total_sales float8) 
location ('pxf://hdp1:51200/data/pxf_examples/pxfwritable_hdfs_textsimple2?profile=hdfstextsimple&compression_codec=org.apache.hadoop.io.compress.GzipCodec') 
format 'text' (delimiter=e':');

（6）插入資料

insert into pxf_hdfs_writabletbl_2 values ( 'Frankfurt', 'Mar', 777, 3956.98 );
insert into pxf_hdfs_writabletbl_2 values ( 'Cleveland', 'Oct', 3812, 96645.37 );

（7）使用-text引數檢視壓縮的資料

[[email protected] ~]$ hdfs dfs -text /data/pxf_examples/pxfwritable_hdfs_textsimple2/*
Frankfurt:Mar:777:3956.98
Cleveland:Oct:3812:96645.37
[[email protected] ~]$

可以看到剛插入的兩條記錄，記錄以冒號作為欄位分隔符。

七、刪除外部表
使用drop external table <table_name>語句刪除外部表，該語句並不刪除外部資料，因為外部資料不是由HAWQ管理的。

HAWQ技術解析（九） —— 外部資料

HAWQ技術解析（九） —— 外部資料

HAWQ技術解析（二） —— 安裝部署

HAWQ技術解析（六） —— 定義物件

HAWQ技術解析（三） —— 基本架構

HAWQ技術解析（五） —— 連線管理

HAWQ技術解析（十一） —— 資料管理

HAWQ技術解析（十四） —— 高可用性

HAWQ技術解析（十六） —— 運維監控

用友ERP T6技術解析（六）庫齡分析

yocto-sumo源碼解析（九）: ProcessServer.main

## sql必知必會學習記錄（九）- 彙總資料

Laravel框架關鍵技術解析（1）元件化開發與composer使用

jdk原始碼解析（九）——早期（編譯期）優化

Github專案解析（九）-->實現Activity跳轉動畫的五種方式

MySQL（九）之資料表的查詢詳解（SELECT語法）二

Struts2學習總結（九）：資料驗證

MYSQL資料庫（九）- 修改資料表名稱、列名稱

Dubbo原始碼解析（九）Dubbo系列原始碼總結+最近感悟

Docker 容器化技術介紹（九）之 Docker 安裝指定版本安裝包

Elasticsearch 技術分析（九）：Elasticsearch的使用和原理總結

HAWQ技術解析（九） —— 外部資料

相關推薦