1. 程式人生 > >詳解Apache Hudi如何配置各種型別分割槽

詳解Apache Hudi如何配置各種型別分割槽

## 1. 引入 Apache Hudi支援多種分割槽方式資料集,如多級分割槽、單分割槽、時間日期分割槽、無分割槽資料集等,使用者可根據實際需求選擇合適的分割槽方式,下面來詳細瞭解Hudi如何配置何種型別分割槽。 ## 2. 分割槽處理 為說明Hudi對不同分割槽型別的處理,假定寫入Hudi的Schema如下 ```json { "type" : "record", "name" : "HudiSchemaDemo", "namespace" : "hoodie.HudiSchemaDemo", "fields" : [ { "name" : "age", "type" : [ "long", "null" ] }, { "name" : "location", "type" : [ "string", "null" ] }, { "name" : "name", "type" : [ "string", "null" ] }, { "name" : "sex", "type" : [ "string", "null" ] }, { "name" : "ts", "type" : [ "long", "null" ] }, { "name" : "date", "type" : [ "string", "null" ] } ] } ``` 其中一條具體資料如下 ```json { "name": "zhangsan", "ts": 1574297893837, "age": 16, "location": "beijing", "sex":"male", "date":"2020/08/16" } ``` ### 2.1 單分割槽 單分割槽表示使用一個欄位表示作為分割槽欄位的場景,可具體分為非日期格式欄位(如location)和日期格式欄位(如date) ### 2.1.1 非日期格式欄位分割槽 如使用上述location欄位做為分割槽欄位,在寫入Hudi並同步至Hive時配置如下 ```java df.write().format("org.apache.hudi"). options(getQuickstartWriteConfigs()). option(DataSourceWriteOptions.TABLE_TYPE_OPT_KEY(), "COPY_ON_WRITE"). option(DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY(), "ts"). option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY(), "name"). option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY(), partitionFields). option(DataSourceWriteOptions.KEYGENERATOR_CLASS_OPT_KEY(), keyGenerator). option(TABLE_NAME, tableName). option("hoodie.datasource.hive_sync.enable", true). option("hoodie.datasource.hive_sync.table", tableName). option("hoodie.datasource.hive_sync.username", "root"). option("hoodie.datasource.hive_sync.password", "123456"). option("hoodie.datasource.hive_sync.jdbcurl", "jdbc:hive2://localhost:10000"). option("hoodie.datasource.hive_sync.partition_fields", hivePartitionFields). option("hoodie.datasource.write.table.type", "COPY_ON_WRITE"). option("hoodie.embed.timeline.server", false). option("hoodie.datasource.hive_sync.partition_extractor_class", hivePartitionExtractorClass). mode(saveMode). save(basePath); ``` 值得注意如下幾個配置項 * `DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY()`配置為`location`; * `hoodie.datasource.hive_sync.partition_fields`配置為`location`,與寫入Hudi的分割槽欄位相同; * **`DataSourceWriteOptions.KEYGENERATOR_CLASS_OPT_KEY()`配置為`org.apache.hudi.keygen.SimpleKeyGenerator`,或者不配置該選項,預設為`org.apache.hudi.keygen.SimpleKeyGenerator`**; * **`hoodie.datasource.hive_sync.partition_extractor_class`配置為`org.apache.hudi.hive.MultiPartKeysValueExtractor`**; Hudi同步到Hive建立的表如下 ```sql CREATE EXTERNAL TABLE `notdateformatsinglepartitiondemo`( `_hoodie_commit_time` string, `_hoodie_commit_seqno` string, `_hoodie_record_key` string, `_hoodie_partition_path` string, `_hoodie_file_name` string, `age` bigint, `date` string, `name` string, `sex` string, `ts` bigint) PARTITIONED BY ( `location` string) ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' STORED AS INPUTFORMAT 'org.apache.hudi.hadoop.HoodieParquetInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat' LOCATION 'file:/tmp/hudi-partitions/notDateFormatSinglePartitionDemo' TBLPROPERTIES ( 'last_commit_time_sync'='20200816154250', 'transient_lastDdlTime'='1597563780') ``` 查詢表`notdateformatsinglepartitiondemo` **tips: 查詢時請先將hudi-hive-sync-bundle-xxx.jar包放入$HIVE_HOME/lib下** ![](https://img2020.cnblogs.com/blog/616953/202008/616953-20200818094437316-1494932632.png) ### 2.1.2 日期格式分割槽 如使用上述date欄位做為分割槽欄位,核心配置項如下 * `DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY()`配置為`date`; * `hoodie.datasource.hive_sync.partition_fields`配置為`date`,與寫入Hudi的分割槽欄位相同; * **`DataSourceWriteOptions.KEYGENERATOR_CLASS_OPT_KEY()`配置為`org.apache.hudi.keygen.SimpleKeyGenerator`,或者不配置該選項,預設為`org.apache.hudi.keygen.SimpleKeyGenerator`**; * **`hoodie.datasource.hive_sync.partition_extractor_class`配置為`org.apache.hudi.hive.SlashEncodedDayPartitionValueExtractor`**; Hudi同步到Hive建立的表如下 ```sql CREATE EXTERNAL TABLE `dateformatsinglepartitiondemo`( `_hoodie_commit_time` string, `_hoodie_commit_seqno` string, `_hoodie_record_key` string, `_hoodie_partition_path` string, `_hoodie_file_name` string, `age` bigint, `location` string, `name` string, `sex` string, `ts` bigint) PARTITIONED BY ( `date` string) ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' STORED AS INPUTFORMAT 'org.apache.hudi.hadoop.HoodieParquetInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat' LOCATION 'file:/tmp/hudi-partitions/dateFormatSinglePartitionDemo' TBLPROPERTIES ( 'last_commit_time_sync'='20200816155107', 'transient_lastDdlTime'='1597564276') ``` 查詢表`dateformatsinglepartitiondemo` ![](https://img2020.cnblogs.com/blog/616953/202008/616953-20200818094454796-257500503.png) ### 2.2 多分割槽 多分割槽表示使用多個欄位表示作為分割槽欄位的場景,如上述使用location欄位和sex欄位,核心配置項如下 * `DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY()`配置為`location,sex`; * `hoodie.datasource.hive_sync.partition_fields`配置為`location,sex`,與寫入Hudi的分割槽欄位相同; * **`DataSourceWriteOptions.KEYGENERATOR_CLASS_OPT_KEY()`配置為`org.apache.hudi.keygen.ComplexKeyGenerator`**; * **`hoodie.datasource.hive_sync.partition_extractor_class`配置為`org.apache.hudi.hive.MultiPartKeysValueExtractor`**; Hudi同步到Hive建立的表如下 ```sql CREATE EXTERNAL TABLE `multipartitiondemo`( `_hoodie_commit_time` string, `_hoodie_commit_seqno` string, `_hoodie_record_key` string, `_hoodie_partition_path` string, `_hoodie_file_name` string, `age` bigint, `date` string, `name` string, `ts` bigint) PARTITIONED BY ( `location` string, `sex` string) ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' STORED AS INPUTFORMAT 'org.apache.hudi.hadoop.HoodieParquetInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat' LOCATION 'file:/tmp/hudi-partitions/multiPartitionDemo' TBLPROPERTIES ( 'last_commit_time_sync'='20200816160557', 'transient_lastDdlTime'='1597565166') ``` 查詢表`multipartitiondemo` ![](https://img2020.cnblogs.com/blog/616953/202008/616953-20200818094510362-838782721.png) ### 2.3 無分割槽 無分割槽場景是指無分割槽欄位,寫入Hudi的資料集無分割槽。核心配置如下 * `DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY()`配置為空字串; * `hoodie.datasource.hive_sync.partition_fields`配置為空字串,與寫入Hudi的分割槽欄位相同; * **`DataSourceWriteOptions.KEYGENERATOR_CLASS_OPT_KEY()`配置為`org.apache.hudi.keygen.NonpartitionedKeyGenerator`**; * **`hoodie.datasource.hive_sync.partition_extractor_class`配置為`org.apache.hudi.hive.NonPartitionedExtractor`**; Hudi同步到Hive建立的表如下 ```sql CREATE EXTERNAL TABLE `nonpartitiondemo`( `_hoodie_commit_time` string, `_hoodie_commit_seqno` string, `_hoodie_record_key` string, `_hoodie_partition_path` string, `_hoodie_file_name` string, `age` bigint, `date` string, `location` string, `name` string, `sex` string, `ts` bigint) ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' STORED AS INPUTFORMAT 'org.apache.hudi.hadoop.HoodieParquetInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat' LOCATION 'file:/tmp/hudi-partitions/nonPartitionDemo' TBLPROPERTIES ( 'last_commit_time_sync'='20200816161558', 'transient_lastDdlTime'='1597565767') ``` 查詢表`nonpartitiondemo` ![](https://img2020.cnblogs.com/blog/616953/202008/616953-20200818094523713-456230446.png) ### 2.4 Hive風格分割槽 除了上述幾種常見的分割槽方式,還有一種Hive風格分割槽格式,如location=beijing/sex=male格式,以`location,sex`作為分割槽欄位,核心配置如下 * `DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY()`配置為`location,sex`; * `hoodie.datasource.hive_sync.partition_fields`配置為`location,sex`,與寫入Hudi的分割槽欄位相同; * **`DataSourceWriteOptions.KEYGENERATOR_CLASS_OPT_KEY()`配置為`org.apache.hudi.keygen.ComplexKeyGenerator`**; * **`hoodie.datasource.hive_sync.partition_extractor_class`配置為`org.apache.hudi.hive.SlashEncodedDayPartitionValueExtractor`**; * **`DataSourceWriteOptions.HIVE_STYLE_PARTITIONING_OPT_KEY()`配置為`true`**; 生成的Hudi資料集目錄結構會為如下格式 ```shell /location=beijing/sex=male ``` Hudi同步到Hive建立的表如下 ```sql CREATE EXTERNAL TABLE `hivestylepartitiondemo`( `_hoodie_commit_time` string, `_hoodie_commit_seqno` string, `_hoodie_record_key` string, `_hoodie_partition_path` string, `_hoodie_file_name` string, `age` bigint, `date` string, `name` string, `ts` bigint) PARTITIONED BY ( `location` string, `sex` string) ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' STORED AS INPUTFORMAT 'org.apache.hudi.hadoop.HoodieParquetInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat' LOCATION 'file:/tmp/hudi-partitions/hiveStylePartitionDemo' TBLPROPERTIES ( 'last_commit_time_sync'='20200816172710', 'transient_lastDdlTime'='1597570039') ``` 查詢表`hivestylepartitiondemo` ![](https://img2020.cnblogs.com/blog/616953/202008/616953-20200818094608135-672307716.png) ## 3. 總結 本篇文章介紹了Hudi如何處理不同分割槽場景,上述配置的分割槽類配置可以滿足絕大多數場景,當然Hudi非常靈活,還支援自定義分割槽解析器,具體可檢視`KeyGenerator`和`PartitionValueExtractor`類,其中所有寫入Hudi的分割槽欄位生成器都是`KeyGenerator`的子類,所有同步至Hive的分割槽值解析器都是`PartitionValueExtractor`的子類。上述示例程式碼都已經上傳至[https://github.com/leesf/hudi-demos](https://github.com/leesf/hudi-demos),該倉庫會持續補充各種使用Hudi的Demo,方便開發者快速瞭解Hudi,構建企業級資料湖,歡迎star &