1. 程式人生 > >[3] Hive3.x Materialized view

[3] Hive3.x Materialized view

Objectives

一般來說,查詢加速的最有效方法即

  • 關係聚合預計算(pre-computation of relevant summaries)
  • 物化檢視(materialized views)

Hive3.0開始嘗試引入物化檢視,並提供對於物化檢視的查詢自動重寫(基於Apache Calcite實現);值得注意的是,3.0中提供了物化檢視儲存選擇機制,可以本地儲存在hive,同時可以通過使用者自定義storage handlers儲存在其他系統(如Druid)。Hive3.0提供了對於物化檢視生命週期管理(如資料更新)的控制。

Not a view not a table, meet the Materialized view

According to Wikipedia a SQL View is the result set of a stored query on the data. Let’s say you have a lot of different tables that you are constantly requesting, using always the same joins, filters and aggregations. With a view, you could simplify access to those datasets while providing more meaning to the end user. It avoids repeating the same complex queries and eases schema evolution.

For example, an application needs access to a products dataset with the product owner and the total number of order for each product. Such queries would need to join the User and Order tables with the Product table. A view would mask the complexity of the schema to the end users by only providing one table with custom and dedicated ACLs.

However such views in Hive used to be virtual and implied huge and slow queries. Instead, you could create an intermediate table to store the results of your query, but such operations require changing your access patterns and has the challenge of making sure the data in the table stays fresh.

We can identify four main types of optimization:

  • Change data’s physical properties (distribute, sort).
  • Filter or partition rows.
  • Denormalization(Denormalization is the operation of grouping two or more tables into one bigger table. Basically it removes the need of a heavy JOIN operation)
  • Preaggregation.

The goal of Materialized views (MV) is to improve the speed of queries while requiring zero maintenance operations.

The main features are:

  • Storing the result of a query just like a table (the storage can be in Hive or Druid).
  • The definition of the MV is used to rewrite query and requires no change in your previous patterns.
  • The freshness of the data is ensured by the system.
  • A simple insert in the table is very efficient since it does not require rebuilding the view.

Management of materialized views in Hive

Materialized views creation
支援的基本特性:

  • partition columns
  • custom storage handler
  • passing table properties
CREATE MATERIALIZED VIEW [IF NOT EXISTS] [db_name.]materialized_view_name
  [DISABLE REWRITE]
  [COMMENT materialized_view_comment]
  [PARTITIONED ON (col_name, ...)]
  [
    [ROW FORMAT row_format]
    [STORED AS file_format]
      | STORED BY 'storage.handler.class.name' [WITH SERDEPROPERTIES (...)]
  ]
  [LOCATION hdfs_path]
  [TBLPROPERTIES (property_name=property_value, ...)]
AS
<query>;

說明

(1)物化檢視建立後,query的執行資料自動落地,"自動"也即在query的執行期間,任何使用者對該物化檢視是不可見的
(2)預設,該物化檢視可被用於查詢優化器optimizer查詢重寫(在物化檢視建立期間可以通過DISABLE REWRITE引數設定禁止使用)
(3) SerDe和storage format非強制引數,可以使用者配置,預設可用hive.materializedview.serde、 hive.materializedview.fileformat
(4)物化檢視可以使用custom storage handlers儲存在外部系統(如druid)例如:

CREATE MATERIALIZED VIEW druid_wiki_mv
STORED AS 'org.apache.hadoop.hive.druid.DruidStorageHandler'
AS
SELECT __time, page, user, c_added, c_removed
FROM src;

Other operations for materialized view management

目前支援物化檢視的drop和show操作,後續會增加其他操作

-- Drops a materialized view
DROP MATERIALIZED VIEW [db_name.]materialized_view_name;
-- Shows materialized views (with optional filters)
SHOW MATERIALIZED VIEWS [IN database_name] ['identifier_with_wildcards’];
-- Shows information about a specific materialized view
DESCRIBE [EXTENDED | FORMATTED] [db_name.]materialized_view_name;

Materialized view-based query rewriting

  • 物化檢視建立後即可用於相關查詢的加速,使用者提交查詢query,若該query經過重寫後可命中已建檢視,則被重寫命中相關已建檢視實現查詢加速。
  • 是否重寫查詢使用物化檢視可以通過全域性引數控制(hive.materializedview.rewriting,預設為true, )
    SET hive.materializedview.rewriting=true;
  • 使用者可選擇性的失能物化檢視的重寫,materialized views are enabled for rewriting at creation time. To alter that behavior, the following statement can be used: ALTER MATERIALIZED VIEW [db_name.]materialized_view_name ENABLE|DISABLE REWRITE;

基於Calcite重寫物化檢視,其中支援的重寫樣例可參見:
Materialized Views

Materialized view maintenance

當資料來源變更(新資料插入inserted、資料修改modified),物化檢視也需要更新以保持資料一致性,目前需要使用者主動觸發rebuild:

ALTER MATERIALIZED VIEW [db_name.]materialized_view_name REBUILD;

增量更新
Hive supports incremental view maintenance, i.e., only refresh data that was affected by the changes in the original source tables. Incremental view maintenance will decrease the rebuild step execution time. In addition, it will preserve LLAP cache for existing data in the materialized view.

By default, Hive will attempt to rebuild a materialized view incrementally, falling back to full rebuild if it is not possible. Current implementation only supports incremental rebuild when there were INSERT operations over the source tables, while UPDATE and DELETE operations will force a full rebuild of the materialized view.

To execute incremental maintenance, following conditions should be met:

  • The materialized view should only use transactional tables, either micromanaged or ACID.
  • If the materialized view definition contains a Group By clause, the materialized view should be stored in an ACID table, since it needs to support MERGE operation. For materialized view definitions consisting of Scan-Project-Filter-Join, this restriction does not exist.
    A rebuild operation acquires an exclusive write lock over the materialized view, i.e., for a given materialized view, only one rebuild operation can be executed at a given time.

Materialized view lifecycle

By default, once a materialized view contents are stale, the materialized view will not be used for automatic query rewriting.

However, in some occasions it may be fine to accept stale data, e.g., if the materialized view uses non-transactional tables and hence we cannot verify whether its contents are outdated, however we still want to use the automatic rewriting. For those occasions, we can combine a rebuild operation run periodically, e.g., every 5minutes, and define the required freshness of the materialized view data using the hive.materializedview.rewriting.time.window configuration parameter, for instance:

SET hive.materializedview.rewriting.time.window=10min;
The parameter value can be also overridden by a concrete materialized view just by setting it as a table property when the materialization is created.

Materialized view related setting parameters

<property>
    <name>hive.materializedview.rewriting</name>
    <value>true</value>
    <description>Whether to try to rewrite queries using the materialized views enabled for rewriting</description>
  </property>
  <property>
    <name>hive.materializedview.rewriting.strategy</name>
    <value>heuristic</value>
    <description>
      Expects one of [heuristic, costbased].
      The strategy that should be used to cost and select the materialized view rewriting. 
        heuristic: Always try to select the plan using the materialized view if rewriting produced one,choosing the plan with lower cost among possible plans containing a materialized view
        costbased: Fully cost-based strategy, always use plan with lower cost, independently on whether it uses a materialized view or not
    </description>
  </property>
  <property>
    <name>hive.materializedview.rewriting.time.window</name>
    <value>0min</value>
    <description>
      Expects a time value with unit (d/day, h/hour, m/min, s/sec, ms/msec, us/usec, ns/nsec), which is min if not specified.
      Time window, specified in seconds, after which outdated materialized views become invalid for automatic query rewriting.
      For instance, if more time than the value assigned to the property has passed since the materialized view was created or rebuilt, and one of its source tables has changed since, the materialized view will not be considered for rewriting. Default value 0 means that the materialized view cannot be outdated to be used automatically in query rewriting. Value -1 means to skip this check.
    </description>
  </property>
  <property>
    <name>hive.materializedview.rewriting.incremental</name>
    <value>false</value>
    <description>
      Whether to try to execute incremental rewritings based on outdated materializations and
      current content of tables. Default value of true effectively amounts to enabling incremental
      rebuild for the materializations too.
    </description>
  </property>
  <property>
    <name>hive.materializedview.rebuild.incremental</name>
    <value>true</value>
    <description>
      Whether to try to execute incremental rebuild for the materialized views. Incremental rebuild
      tries to modify the original materialization contents to reflect the latest changes to the
      materialized view source tables, instead of rebuilding the contents fully. Incremental rebuild
      is based on the materialized view algebraic incremental rewriting.
    </description>
  </property>
  <property>
    <name>hive.materializedview.fileformat</name>
    <value>ORC</value>
    <description>
      Expects one of [none, textfile, sequencefile, rcfile, orc].
      Default file format for CREATE MATERIALIZED VIEW statement
    </description>
  </property>
  <property>
    <name>hive.materializedview.serde</name>
    <value>org.apache.hadoop.hive.ql.io.orc.OrcSerde</value>
    <description>Default SerDe used for materialized views</description>
  </property>

Example

(1)新建一張transactional表depts

SET hive.txn.manager=org.apache.hadoop.hive.ql.lockmgr.DbTxnManager;
SET hive.support.concurrency=true;
SET hive.enforce.bucketing=true;
SET hive.exec.dynamic.partition.mode=nonstrict;
SET hive.compactor.initiator.on=true;
SET hive.compactor.worker.threads=2;
CREATE TABLE depts (
  deptno INT,
  deptname VARCHAR(256),
  locationid INT)
STORED AS ORC
TBLPROPERTIES ('transactional'='true');

(2)匯入資料

hive> INSERT OVERWRITE TABLE depts
    > select
    > id,name,1 as loc from student;
Query ID = didi_20181128204405_c06c8983-a363-458b-b1f8-443deeb514c2
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
  ....
  
hive> select * from depts;
OK
1001	zhangsan	1
1002	lisi	1
Time taken: 0.195 seconds, Fetched: 2 row(s)
...

(3)對depts建立聚合物化檢視

hive> CREATE MATERIALIZED VIEW depts_agg
    > AS
    > SELECT  deptno, count(1) as deptno_cnt from depts group by deptno;
    
Query ID = didi_20181128204706_be53ca94-f594-49a2-beda-7cec2b2f2c71
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks not specified. Estimated from input data size: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
Starting Job = job_1543385586294_0004, Tracking URL = http://localhost:8088/proxy/application_1543385586294_0004/
Kill Command = /..../software/hadoop/hadoop-2.7.4/bin/mapred job  -kill job_1543385586294_0004
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
.....

注意.
這裡日誌可見,當執行CREATE MATERIALIZED VIEW,與一遍creat table 不同,會啟動一個MR(這裡沒有指定其他型別的引擎如spark,預設為MR)對物化檢視進行構建

開啟資源UI,可見具體的構建作業
在這裡插入圖片描述

(4)對原始表deptno查詢
由於會命中物化檢視,重寫query查詢物化檢視,查詢速度會加快(沒有啟動MR,只是普通的tablescan)

hive> SELECT  deptno, count(1) as deptno_cnt from depts group by deptno;
OK
1001	1
1002	1
Time taken: 0.414 seconds, Fetched: 2 row(s)

具體可見執行過程
查詢被自動重寫為TableScan alias: hive3_test.depts_agg

hive> explain SELECT  deptno, count(1) as deptno_cnt from depts group by deptno;
OK
STAGE DEPENDENCIES:
  Stage-0 is a root stage

STAGE PLANS:
  Stage: Stage-0
    Fetch Operator
      limit: -1
      Processor Tree:
        TableScan
          alias: hive3_test.depts_agg
          Statistics: Num rows: 2 Data size: 24 Basic stats: COMPLETE Column stats: NONE
          Select Operator
            expressions: deptno (type: int), deptno_cnt (type: bigint)
            outputColumnNames: _col0, _col1
            Statistics: Num rows: 2 Data size: 24 Basic stats: COMPLETE Column stats: NONE
            ListSink

Time taken: 0.275 seconds, Fetched: 17 row(s)

Roadmap

Many improvements are planned :

  • Improving the rewriting algorithm inside Apache Calcite
  • Control distribution of data inside the view (SORT BY, CLUSTER BY, DISTRIBUTE BY)
  • Supports UPDATE/DELETE in incremental rebuild of the view

參考