1. 程式人生 > >教程:Data Lake Analytics + OSS資料檔案格式處理大全

教程:Data Lake Analytics + OSS資料檔案格式處理大全

0. 前言

Data Lake Analytics是Serverless化的雲上互動式查詢分析服務。使用者可以使用標準的SQL語句,對儲存在OSS、TableStore上的資料無需移動,直接進行查詢分析。

目前該產品已經正式登陸阿里雲,歡迎大家申請試用,體驗更便捷的資料分析服務。
請參考https://help.aliyun.com/document_detail/70386.html 進行產品開通服務申請。

在上一篇教程中,我們介紹了如何分析CSV格式的TPC-H資料集。除了純文字檔案(例如,CSV,TSV等),使用者儲存在OSS上的其他格式的資料檔案,也可以使用Data Lake Analytics進行查詢分析,包括ORC, PARQUET, JSON, RCFILE, AVRO甚至ESRI規範的地理JSON資料,還可以用正則表示式匹配的檔案等。

本文詳細介紹如何根據儲存在OSS上的檔案格式使用Data Lake Analytics (下文簡稱 DLA)進行分析。DLA內建了各種處理檔案資料的SerDe(Serialize/Deserilize的簡稱,目的是用於序列化和反序列化)實現,使用者無需自己編寫程式,基本上能選用DLA中的一款或多款SerDe來匹配您OSS上的資料檔案格式。如果還不能滿足您特殊檔案格式的處理需求,請聯絡我們,儘快為您實現。

1. 儲存格式與SerDe

使用者可以依據儲存在OSS上的資料檔案進行建表,通過STORED AS 指定資料檔案的格式。
例如,

CREATE EXTERNAL TABLE nation (
    N_NATIONKEY INT, 
    N_NAME STRING, 
    N_REGIONKEY INT, 
    N_COMMENT STRING
) 
ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' 
STORED AS TEXTFILE 
LOCATION 'oss://test-bucket-julian-1/tpch_100m/nation';

建表成功後可以使用SHOW CREATE TABLE語句檢視原始建表語句。

mysql> show create table nation;
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Result                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            |
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| CREATE EXTERNAL TABLE `nation`(
  `n_nationkey` int,
  `n_name` string,
  `n_regionkey` int,
  `n_comment` string)
ROW FORMAT DELIMITED
    FIELDS TERMINATED BY '|'
STORED AS `TEXTFILE`
LOCATION
  'oss://test-bucket-julian-1/tpch_100m/nation'|
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
1 row in set (1.81 sec)

下表中列出了目前DLA已經支援的檔案格式,當針對下列格式的檔案建表時,可以直接使用STORED AS,DLA會選擇合適的SERDE/INPUTFORMAT/OUTPUTFORMAT。

儲存格式 描述
STORED AS TEXTFILE 資料檔案的儲存格式為純文字檔案。預設的檔案型別。 檔案中的每一行對應表中的一條記錄。
STORED AS ORC 資料檔案的儲存格式為ORC。
STORED AS PARQUET 資料檔案的儲存格式為PARQUET。
STORED AS RCFILE 資料檔案的儲存格式為RCFILE。
STORED AS AVRO 資料檔案的儲存格式為AVRO。
STORED AS JSON 資料檔案的儲存格式為JSON (Esri ArcGIS的地理JSON資料檔案 除外)。

在指定了STORED AS 的同時,還可以根據具體檔案的特點,指定SerDe (用於解析資料檔案並對映到DLA表),特殊的列分隔符等。
後面的部分會做進一步的講解。

2. 示例

2.1 CSV檔案

CSV檔案,本質上還是純文字檔案,可以使用STORED AS TEXTFILE。
列與列之間以逗號分隔,可以通過ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' 表示。

普通CSV檔案

例如,資料檔案oss://bucket-for-testing/oss/text/cities/city.csv的內容為

Beijing,China,010
ShangHai,China,021
Tianjin,China,022

建表語句可以為

CREATE EXTERNAL TABLE city (
    city STRING, 
    country STRING, 
    code INT
) 
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' 
STORED AS TEXTFILE 
LOCATION 'oss://bucket-for-testing/oss/text/cities';

使用OpenCSVSerde__處理引號__引用的欄位

OpenCSVSerde在使用時需要注意以下幾點:

  1. 使用者可以為行的欄位指定欄位分隔符、欄位內容引用符號和轉義字元,例如:WITH SERDEPROPERTIES ("separatorChar" = ",", "quoteChar" = "`", "escapeChar" = "\" );
  2. 不支援欄位內嵌入的行分割符;
  3. 所有欄位定義STRING型別;
  4. 其他資料型別的處理,可以在SQL中使用函式進行轉換。
    例如,
CREATE EXTERNAL TABLE test_csv_opencsvserde (
  id STRING,
  name STRING,
  location STRING,
  create_date STRING,
  create_timestamp STRING,
  longitude STRING,
  latitude STRING
) 
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
with serdeproperties(
"separatorChar"=",",
"quoteChar"="\"",
"escapeChar"="\\"
)
STORED AS TEXTFILE LOCATION 'oss://test-bucket-julian-1/test_csv_serde_1';

自定義分隔符

需要自定義列分隔符(FIELDS TERMINATED BY),轉義字元(ESCAPED BY),行結束符(LINES TERMINATED BY)。
需要在建表語句中指定

ROW FORMAT DELIMITED
    FIELDS TERMINATED BY '\t'
    ESCAPED BY '\\'
    LINES TERMINATED BY '\n'

忽略CSV檔案中的HEADER

在csv檔案中,有時會帶有HEADER資訊,需要在資料讀取時忽略掉這些內容。這時需要在建表語句中定義skip.header.line.count。

例如,資料檔案oss://my-bucket/datasets/tpch/nation_csv/nation_header.tbl的內容如下:

N_NATIONKEY|N_NAME|N_REGIONKEY|N_COMMENT
0|ALGERIA|0| haggle. carefully final deposits detect slyly agai|
1|ARGENTINA|1|al foxes promise slyly according to the regular accounts. bold requests alon|
2|BRAZIL|1|y alongside of the pending deposits. carefully special packages are about the ironic forges. slyly special |
3|CANADA|1|eas hang ironic, silent packages. slyly regular packages are furiously over the tithes. fluffily bold|
4|EGYPT|4|y above the carefully unusual theodolites. final dugouts are quickly across the furiously regular d|
5|ETHIOPIA|0|ven packages wake quickly. regu|

相應的建表語句為:

CREATE EXTERNAL TABLE nation_header (
    N_NATIONKEY INT, 
    N_NAME STRING, 
    N_REGIONKEY INT, 
    N_COMMENT STRING
) 
ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' 
STORED AS TEXTFILE 
LOCATION 'oss://my-bucket/datasets/tpch/nation_csv/nation_header.tbl'
TBLPROPERTIES ("skip.header.line.count"="1");

skip.header.line.count的取值x和資料檔案的實際行數n有如下關係:

  • 當x<=0時,DLA在讀取檔案時,不會過濾掉任何資訊,即全部讀取;
  • 當0
  • 當x>=n時,DLA在讀取檔案時,會過濾掉所有的檔案內容。

2.2 TSV檔案

與CSV檔案類似,TSV格式的檔案也是純文字檔案,列與列之間的分隔符為Tab。

例如,資料檔案oss://bucket-for-testing/oss/text/cities/city.tsv的內容為

Beijing    China    010
ShangHai    China    021
Tianjin    China    022

建表語句可以為

CREATE EXTERNAL TABLE city (
    city STRING, 
    country STRING, 
    code INT
) 
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' 
STORED AS TEXTFILE 
LOCATION 'oss://bucket-for-testing/oss/text/cities';

2.3 多字元資料欄位分割符檔案

假設您的資料欄位的分隔符包含多個字元,可採用如下示例建表語句,其中每行的資料欄位分割符為“||”,可以替換為您具體的分割符字串。

ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.MultiDelimitSerDe'
with serdeproperties(
"field.delim"="||"
)

示例:

CREATE EXTERNAL TABLE test_csv_multidelimit (
  id STRING,
  name STRING,
  location STRING,
  create_date STRING,
  create_timestamp STRING,
  longitude STRING,
  latitude STRING
) 
ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.MultiDelimitSerDe'
with serdeproperties(
"field.delim"="||"
)
STORED AS TEXTFILE LOCATION 'oss://bucket-for-testing/oss/text/cities/';

2.4 JSON檔案

DLA可以處理的JSON檔案通常以純文字的格式儲存,在建表時除了要指定STORED AS TEXTFILE, 還要定義SERDE。
在JSON檔案中,每行必須是一個完整的JSON物件。
例如,下面的檔案格式是不被接受

{"id": 123, "name": "jack", 
"c3": "2001-02-03 12:34:56"}
{"id": 456, "name": "rose", "c3": "1906-04-18 05:12:00"}
{"id": 789, "name": "tom", "c3": "2001-02-03 12:34:56"}
{"id": 234, "name": "alice", "c3": "1906-04-18 05:12:00"}

需要改寫成:

{"id": 123, "name": "jack", "c3": "2001-02-03 12:34:56"}
{"id": 456, "name": "rose", "c3": "1906-04-18 05:12:00"}
{"id": 789, "name": "tom", "c3": "2001-02-03 12:34:56"}
{"id": 234, "name": "alice", "c3": "1906-04-18 05:12:00"}

不含巢狀的JSON資料

建表語句可以寫

CREATE EXTERNAL TABLE t1 (id int, name string, c3 timestamp)
STORED AS JSON
LOCATION 'oss://path/to/t1/directory';

含有巢狀的JSON檔案

使用struct和array結構定義巢狀的JSON資料。
例如,使用者原始資料(注意:無論是否巢狀,一條完整的JSON資料都只能放在一行上,才能被Data Lake Analytics處理):

{       "DocId": "Alibaba",         "User_1": {             "Id": 1234,             "Username": "bob1234",          "Name": "Bob",          "ShippingAddress": {                    "Address1": "969 Wenyi West St.",                     "Address2": null,                       "City": "Hangzhou",                      "Province": "Zhejiang"           },              "Orders": [{                            "ItemId": 6789,                                 "OrderDate": "11/11/2017"                       },                      {                               "ItemId": 4352,                                 "OrderDate": "12/12/2017"                       }               ]       } }

使用線上JSON格式化工具格式化後,資料內容如下:

{
    "DocId": "Alibaba", 
    "User_1": {
        "Id": 1234, 
        "Username": "bob1234", 
        "Name": "Bob", 
        "ShippingAddress": {
            "Address1": "969 Wenyi West St.", 
            "Address2": null, 
            "City": "Hangzhou", 
            "Province": "Zhejiang"
        }, 
        "Orders": [
            {
                "ItemId": 6789, 
                "OrderDate": "11/11/2017"
            }, 
            {
                "ItemId": 4352, 
                "OrderDate": "12/12/2017"
            }
        ]
    }
}

則建表語句可以寫成如下(注意:LOCATION中指定的路徑必須是JSON資料檔案所在的目錄,該目錄下的所有JSON檔案都能被識別為該表的資料):

CREATE EXTERNAL TABLE json_table_1 (
    docid string,
    user_1 struct<
            id:INT,
            username:string,
            name:string,
            shippingaddress:struct<
                            address1:string,
                            address2:string,
                            city:string,
                            province:string
                            >,
            orders:array<
                    struct<
                        itemid:INT,
                        orderdate:string
                    >
            >
    >
)
STORED AS JSON
LOCATION 'oss://xxx/test/json/hcatalog_serde/table_1/';

對該表進行查詢:

select * from json_table_1;

+---------+----------------------------------------------------------------------------------------------------------------+
| docid   | user_1                                                                                                         |
+---------+----------------------------------------------------------------------------------------------------------------+
| Alibaba | [1234, bob1234, Bob, [969 Wenyi West St., null, Hangzhou, Zhejiang], [[6789, 11/11/2017], [4352, 12/12/2017]]] |
+---------+----------------------------------------------------------------------------------------------------------------+

對於struct定義的巢狀結構,可以通過“.”進行層次物件引用,對於array定義的陣列結構,可以通過“[陣列下標]”(注意:陣列下標從1開始)進行物件引用。

select DocId,
       User_1.Id,
       User_1.ShippingAddress.Address1,
       User_1.Orders[1].ItemId
from json_table_1
where User_1.Username = 'bob1234'
  and User_1.Orders[2].OrderDate = '12/12/2017';

+---------+------+--------------------+-------+
| DocId   | id   | address1           | _col3 |
+---------+------+--------------------+-------+
| Alibaba | 1234 | 969 Wenyi West St. |  6789 |
+---------+------+--------------------+-------+

使用JSON函式處理資料

例如,把“value_string”的巢狀JSON值作為字串儲存:

{"data_key":"com.taobao.vipserver.domains.meta.biz.alibaba.com","ts":1524550275112,"value_string":"{\"appName\":\"\",\"apps\":[],\"checksum\":\"50fa0540b430904ee78dff07c7350e1c\",\"clusterMap\":{\"DEFAULT\":{\"defCkport\":80,\"defIPPort\":80,\"healthCheckTask\":null,\"healthChecker\":{\"checkCode\":200,\"curlHost\":\"\",\"curlPath\":\"/status.taobao\",\"type\":\"HTTP\"},\"name\":\"DEFAULT\",\"nodegroup\":\"\",\"sitegroup\":\"\",\"submask\":\"0.0.0.0/0\",\"syncConfig\":{\"appName\":\"trade-ma\",\"nodegroup\":\"tradema\",\"pubLevel\":\"publish\",\"role\":\"\",\"site\":\"\"},\"useIPPort4Check\":true}},\"disabledSites\":[],\"enableArmoryUnit\":false,\"enableClientBeat\":false,\"enableHealthCheck\":true,\"enabled\":true,\"envAndSites\":\"\",\"invalidThreshold\":0.6,\"ipDeleteTimeout\":1800000,\"lastModifiedMillis\":1524550275107,\"localSiteCall\":true,\"localSiteThreshold\":0.8,\"name\":\"biz.alibaba.com\",\"nodegroup\":\"\",\"owners\":[\"junlan.zx\",\"張三\",\"李四\",\"cui.yuanc\"],\"protectThreshold\":0,\"requireSameEnv\":false,\"resetWeight\":false,\"symmetricCallType\":null,\"symmetricType\":\"warehouse\",\"tagName\":\"ipGroup\",\"tenantId\":\"\",\"tenants\":[],\"token\":\"1cf0ec0c771321bb4177182757a67fb0\",\"useSpecifiedURL\":false}"}

使用線上JSON格式化工具格式化後,資料內容如下:

{
    "data_key": "com.taobao.vipserver.domains.meta.biz.alibaba.com", 
    "ts": 1524550275112, 
    "value_string": "{\"appName\":\"\",\"apps\":[],\"checksum\":\"50fa0540b430904ee78dff07c7350e1c\",\"clusterMap\":{\"DEFAULT\":{\"defCkport\":80,\"defIPPort\":80,\"healthCheckTask\":null,\"healthChecker\":{\"checkCode\":200,\"curlHost\":\"\",\"curlPath\":\"/status.taobao\",\"type\":\"HTTP\"},\"name\":\"DEFAULT\",\"nodegroup\":\"\",\"sitegroup\":\"\",\"submask\":\"0.0.0.0/0\",\"syncConfig\":{\"appName\":\"trade-ma\",\"nodegroup\":\"tradema\",\"pubLevel\":\"publish\",\"role\":\"\",\"site\":\"\"},\"useIPPort4Check\":true}},\"disabledSites\":[],\"enableArmoryUnit\":false,\"enableClientBeat\":false,\"enableHealthCheck\":true,\"enabled\":true,\"envAndSites\":\"\",\"invalidThreshold\":0.6,\"ipDeleteTimeout\":1800000,\"lastModifiedMillis\":1524550275107,\"localSiteCall\":true,\"localSiteThreshold\":0.8,\"name\":\"biz.alibaba.com\",\"nodegroup\":\"\",\"owners\":[\"junlan.zx\",\"張三\",\"李四\",\"cui.yuanc\"],\"protectThreshold\":0,\"requireSameEnv\":false,\"resetWeight\":false,\"symmetricCallType\":null,\"symmetricType\":\"warehouse\",\"tagName\":\"ipGroup\",\"tenantId\":\"\",\"tenants\":[],\"token\":\"1cf0ec0c771321bb4177182757a67fb0\",\"useSpecifiedURL\":false}"
}

建表語句為

CREATE external TABLE json_table_2 (
   data_key string,
   ts bigint,
   value_string string
)
STORED AS JSON
LOCATION 'oss://xxx/test/json/hcatalog_serde/table_2/';

表建好後,可進行查詢:

select * from json_table_2;

+---------------------------------------------------+---------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| data_key                                          | ts            | value_string                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               |
+---------------------------------------------------+---------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| com.taobao.vipserver.domains.meta.biz.alibaba.com | 1524550275112 | {"appName":"","apps":[],"checksum":"50fa0540b430904ee78dff07c7350e1c","clusterMap":{"DEFAULT":{"defCkport":80,"defIPPort":80,"healthCheckTask":null,"healthChecker":{"checkCode":200,"curlHost":"","curlPath":"/status.taobao","type":"HTTP"},"name":"DEFAULT","nodegroup":"","sitegroup":"","submask":"0.0.0.0/0","syncConfig":{"appName":"trade-ma","nodegroup":"tradema","pubLevel":"publish","role":"","site":""},"useIPPort4Check":true}},"disabledSites":[],"enableArmoryUnit":false,"enableClientBeat":false,"enableHealthCheck":true,"enabled":true,"envAndSites":"","invalidThreshold":0.6,"ipDeleteTimeout":1800000,"lastModifiedMillis":1524550275107,"localSiteCall":true,"localSiteThreshold":0.8,"name":"biz.alibaba.com","nodegroup":"","owners":["junlan.zx","張三","李四","cui.yuanc"],"protectThreshold":0,"requireSameEnv":false,"resetWeight":false,"symmetricCallType":null,"symmetricType":"warehouse","tagName":"ipGroup","tenantId":"","tenants":[],"token":"1cf0ec0c771321bb4177182757a67fb0","useSpecifiedURL":false}       |
+---------------------------------------------------+---------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

下面SQL示例json_parse,json_extract_scalar,json_extract等常用JSON函式的使用方式:

mysql> select json_extract_scalar(json_parse(value), '$.owners[1]') from json_table_2;

+--------+
| _col0  |
+--------+
| 張三    |
+--------+

mysql> select json_extract_scalar(json_obj.json_col, '$.DEFAULT.submask') 
from (
  select json_extract(json_parse(value), '$.clusterMap') as json_col from json_table_2
) json_obj
where json_extract_scalar(json_obj.json_col, '$.DEFAULT.healthChecker.curlPath') = '/status.taobao';

+-----------+
| _col0     |
+-----------+
| 0.0.0.0/0 |
+-----------+

mysql> with json_obj as (select json_extract(json_parse(value), '$.clusterMap') as json_col from json_table_2)
select json_extract_scalar(json_obj.json_col, '$.DEFAULT.submask')
from json_obj 
where json_extract_scalar(json_obj.json_col, '$.DEFAULT.healthChecker.curlPath') = '/status.taobao';

+-----------+
| _col0     |
+-----------+
| 0.0.0.0/0 |
+-----------+

2.5 ORC檔案

Optimized Row Columnar(ORC)是Apache開源專案Hive支援的一種優化的列儲存檔案格式。與CSV檔案相比,不僅可以節省儲存空間,還可以得到更好的查詢效能。

對於ORC檔案,只需要在建表時指定 STORED AS ORC。
例如,

CREATE EXTERNAL TABLE orders_orc_date (
    O_ORDERKEY INT, 
    O_CUSTKEY INT, 
    O_ORDERSTATUS STRING, 
    O_TOTALPRICE DOUBLE, 
    O_ORDERDATE DATE, 
    O_ORDERPRIORITY STRING, 
    O_CLERK STRING, 
    O_SHIPPRIORITY INT, 
    O_COMMENT STRING
) 
STORED AS ORC 
LOCATION 'oss://bucket-for-testing/datasets/tpch/1x/orc_date/orders_orc';

2.6 PARQUET檔案

Parquet是Apache開源專案Hadoop支援的一種列儲存的檔案格式。
使用DLA建表時,需要指定STORED AS PARQUET即可。
例如,

CREATE EXTERNAL TABLE orders_parquet_date (
    O_ORDERKEY INT, 
    O_CUSTKEY INT, 
    O_ORDERSTATUS STRING, 
    O_TOTALPRICE DOUBLE, 
    O_ORDERDATE DATE, 
    O_ORDERPRIORITY STRING, 
    O_CLERK STRING, 
    O_SHIPPRIORITY INT, 
    O_COMMENT STRING
) 
STORED AS PARQUET 
LOCATION 'oss://bucket-for-testing/datasets/tpch/1x/parquet_date/orders_parquet';

2.7 RCFILE檔案

Record Columnar File (RCFile), 列儲存檔案,可以有效地將關係型表結構儲存在分散式系統中,並且可以被高效地讀取和處理。
DLA在建表時,需要指定STORED AS RCFILE。
例如,

CREATE EXTERNAL TABLE lineitem_rcfile_date (
    L_ORDERKEY INT, 
    L_PARTKEY INT, 
    L_SUPPKEY INT, 
    L_LINENUMBER INT, 
    L_QUANTITY DOUBLE, 
    L_EXTENDEDPRICE DOUBLE, 
    L_DISCOUNT DOUBLE, 
    L_TAX DOUBLE, 
    L_RETURNFLAG STRING, 
    L_LINESTATUS STRING, 
    L_SHIPDATE DATE, 
    L_COMMITDATE DATE, 
    L_RECEIPTDATE DATE, 
    L_SHIPINSTRUCT STRING, 
    L_SHIPMODE STRING, 
    L_COMMENT STRING
) 
STORED AS RCFILE
LOCATION 'oss://bucke-for-testing/datasets/tpch/1x/rcfile_date/lineitem_rcfile'

2.8 AVRO檔案

DLA針對AVRO檔案建表時,需要指定STORED AS AVRO,並且定義的欄位需要符合AVRO檔案的schema。

如果不確定可以通過使用Avro提供的工具,獲得schema,並根據schema建表。
Apache Avro官網下載avro-tools-.jar到本地,執行下面的命令獲得Avro檔案的schema:

java -jar avro-tools-1.8.2.jar getschema /path/to/your/doctors.avro
{
  "type" : "record",
  "name" : "doctors",
  "namespace" : "testing.hive.avro.serde",
  "fields" : [ {
    "name" : "number",
    "type" : "int",
    "doc" : "Order of playing the role"
  }, {
    "name" : "first_name",
    "type" : "string",
    "doc" : "first name of actor playing role"
  }, {
    "name" : "last_name",
    "type" : "string",
    "doc" : "last name of actor playing role"
  } ]
}

建表語句如下,其中fields中的name對應表中的列名,type需要參考本文件中的表格轉成hive支援的型別

CREATE EXTERNAL TABLE doctors(
number int,
first_name string,
last_name string)
STORED AS AVRO
LOCATION 'oss://mybucket-for-testing/directory/to/doctors';

大多數情況下,Avro的型別可以直接轉換成Hive中對應的型別。如果該型別在Hive不支援,則會轉換成接近的型別。具體請參照下表:

Avro型別 對應Hive型別
null void
boolean boolean
int int
long bigint
float float
double double
bytes binary
string string
record struct
map map
list array
union union
enum string
fixed binary

2.9 可以用正則表示式匹配的檔案

通常此型別的檔案是以純文字格式儲存在OSS上的,每一行代表表中的一條記錄,並且每行可以用正則表示式匹配。
例如,Apache WebServer日誌檔案就是這種型別的檔案。

某日誌檔案的內容為:

127.0.0.1 - frank [10/Oct/2000:13:55:36 -0700] "GET /apache_pb.gif HTTP/1.0" 200 2326
127.0.0.1 - - [26/May/2009:00:00:00 +0000] "GET /someurl/?track=Blabla(Main) HTTP/1.1" 200 5864 - "Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US) AppleWebKit/525.19 (KHTML, like Gecko) Chrome/1.0.154.65 Safari/525.19"

每行檔案可以用下面的正則表示式表示,列之間使用空格分隔:

([^ ]*) ([^ ]*) ([^ ]*) (-|\\[[^\\]]*\\]) ([^ \"]*|\"[^\"]*\") (-|[0-9]*) (-|[0-9]*)(?: ([^ \"]*|\"[^\"]*\") ([^ \"]*|\"[^\"]*\"))?

針對上面的檔案格式,建表語句可以表示為:

CREATE EXTERNAL TABLE serde_regex(
  host STRING,
  identity STRING,
  userName STRING,
  time STRING,
  request STRING,
  status STRING,
  size INT,
  referer STRING,
  agent STRING)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
  "input.regex" = "([^ ]*) ([^ ]*) ([^ ]*) (-|\\[[^\\]]*\\]) ([^ \"]*|\"[^\"]*\") (-|[0-9]*) (-|[0-9]*)(?: ([^ \"]*|\"[^\"]*\") ([^ \"]*|\"[^\"]*\"))?"
)
STORED AS TEXTFILE
LOCATION 'oss://bucket-for-testing/datasets/serde/regex';

查詢結果

mysql> select * from serde_regex;
+-----------+----------+-------+------------------------------+---------------------------------------------+--------+------+---------+--------------------------------------------------------------------------------------------------------------------------+
| host      | identity | userName | time                         | request                                     | status | size | referer | agent                                                                                                                    |
+-----------+----------+-------+------------------------------+---------------------------------------------+--------+------+---------+--------------------------------------------------------------------------------------------------------------------------+
| 127.0.0.1 | -        | frank | [10/Oct/2000:13:55:36 -0700] | "GET /apache_pb.gif HTTP/1.0"               | 200    | 2326 | NULL    | NULL                                                                                                                     |
| 127.0.0.1 | -        | -     | [26/May/2009:00:00:00 +0000] | "GET /someurl/?track=Blabla(Main) HTTP/1.1" | 200    | 5864 | -       | "Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US) AppleWebKit/525.19 (KHTML, like Gecko) Chrome/1.0.154.65 Safari/525.19" |
+-----------+----------+-------+------------------------------+---------------------------------------------+--------+------+---------+--------------------------------------------------------------------------------------------------------------------------+

2.10 Esri ArcGIS的地理JSON資料檔案

DLA支援Esri ArcGIS的地理JSON資料檔案的SerDe處理,關於這種地理JSON資料格式說明,可以參考:https://github.com/Esri/spatial-framework-for-hadoop/wiki/JSON-Formats

示例:

CREATE EXTERNAL TABLE IF NOT EXISTS california_counties
(
    Name string,
    BoundaryShape binary
)
ROW FORMAT SERDE 'com.esri.hadoop.hive.serde.JsonSerde'
STORED AS INPUTFORMAT 'com.esri.json.hadoop.EnclosedJsonInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION 'oss://test_bucket/datasets/geospatial/california-counties/'

3. 總結

通過以上例子可以看出,DLA可以支援大部分開源儲存格式的檔案。對於同一份資料,使用不同的儲存格式,在OSS中儲存檔案的大小,DLA的查詢分析速度上會有較大的差別。推薦使用ORC格式進行檔案的儲存和查詢。

為了獲得更快的查詢速度,DLA還在不斷的優化中,後續也會支援更多的資料來源,為使用者帶來更好的大資料分析體驗。

(本文由 @金絡 @君瀾 共同完成,有問題可隨時聯絡,歡迎騷擾)。