1. 程式人生 > >PostgreSQL 11 新特性解讀:支援並行雜湊連線(Parallel Hash Joins)"

PostgreSQL 11 新特性解讀:支援並行雜湊連線(Parallel Hash Joins)"

PostgreSQL 11 版本在並行方面得到增強,例如支援並行建立索引(Parallel Index Build)、並行雜湊連線(Parallel Hash Join)、並行 CREATE TABLE .. AS等,上篇部落格介紹了並行建立索引,本文介紹並行 Hash Join。

測試環境準備

建立大表t_big並插入5000萬條資料。

CREATE TABLE t_big(
id int4,
name text,
create_time timestamp without time zone );

INSERT INTO t_big(id,name,create_time)
SELECT n, n|| '_test',clock_timestamp() FROM generate_series(1,50000000) n ;

建立小表t_small並插入800萬條資料

CREATE TABLE t_small(id int4, name text);

INSERT INTO t_small(id,name)
SELECT n, n|| '_small' FROM generate_series(1,8000000) n ;

驗證並行雜湊連線

PostgreSQL 10 版本檢視以下SQL執行計劃,如下:

des=> EXPLAIN SELECT t_small.name
  FROM t_big JOIN t_small ON (t_big.id = t_small.id)
       AND t_small.id < 100;
                                      QUERY PLAN
--------------------------------------------------------------------------------------
 Gather  (cost=151870.58..661385.28 rows=4143 width=13)
   Workers Planned: 4
   ->  Hash Join  (cost=150870.58..659970.98 rows=1036 width=13)
         Hash Cond: (t_big.id = t_small.id)
         ->  Parallel Seq Scan on t_big  (cost=0.00..470246.58 rows=10358258 width=4)
         ->  Hash  (cost=150860.58..150860.58 rows=800 width=17)
               ->  Seq Scan on t_small  (cost=0.00..150860.58 rows=800 width=17)
                     Filter: (id < 100)
(8 rows)

PostgreSQL 11 版本檢視以下SQL執行計劃,如下:

francs=> EXPLAIN SELECT t_small.name
  FROM t_big JOIN t_small ON (t_big.id = t_small.id)
       AND t_small.id < 100;
                                       QUERY PLAN
-----------------------------------------------------------------------------------------
 Gather  (cost=76862.42..615477.60 rows=800 width=13)
   Workers Planned: 4
   ->  Parallel Hash Join  (cost=75862.42..614397.60 rows=200 width=13)
         Hash Cond: (t_big.id = t_small.id)
         ->  Parallel Seq Scan on t_big  (cost=0.00..491660.86 rows=12499686 width=4)
         ->  Parallel Hash  (cost=75859.92..75859.92 rows=200 width=17)
               ->  Parallel Seq Scan on t_small  (cost=0.00..75859.92 rows=200 width=17)
                     Filter: (id < 100)
(8 rows)

對比10版本的執行計劃,不同之處為11版本走了 Parallel Hash Join,而 10 版本走的 Hash JoinParallel Hash Join 為 11 版本的新特性。

並行雜湊連線效能測試

開啟並行雜湊連線相比不開啟效能上有何變化?接著測試。

開啟並行雜湊連線

PostgreSQL 11 版本執行以下SQL,如下:

francs=> EXPLAIN ANALYZE SELECT t_small.name
  FROM t_big JOIN t_small ON (t_big.id = t_small.id)
       AND t_small.id < 100;
                                                                QUERY PLAN

------------------------------------------------------------------------------------------------------------------------------------------
 Gather  (cost=76862.42..615477.60 rows=800 width=13) (actual time=197.399..2738.010 rows=99 loops=1)
   Workers Planned: 4
   Workers Launched: 4
   ->  Parallel Hash Join  (cost=75862.42..614397.60 rows=200 width=13) (actual time=2222.347..2729.943 rows=20 loops=5)
         Hash Cond: (t_big.id = t_small.id)
         ->  Parallel Seq Scan on t_big  (cost=0.00..491660.86 rows=12499686 width=4) (actual time=0.038..1330.836 rows=10000000 loops=5)
         ->  Parallel Hash  (cost=75859.92..75859.92 rows=200 width=17) (actual time=191.484..191.484 rows=20 loops=5)
               Buckets: 1024  Batches: 1  Memory Usage: 40kB
               ->  Parallel Seq Scan on t_small  (cost=0.00..75859.92 rows=200 width=17) (actual time=152.436..191.385 rows=20 loops=5)
                     Filter: (id < 100)
                     Rows Removed by Filter: 1599980
 Planning Time: 0.183 ms
 Execution Time: 2738.068 ms
(13 rows)

以上SQL執行多次,取最快時間,執行時間為 2738.068 ms。

關閉並行雜湊連線

會話級設定enable_parallel_hash引數為off表示關閉並行雜湊連線,測試效能有何變化,如下。

francs=> set enable_parallel_hash = off;
SET

francs=> EXPLAIN ANALYZE SELECT t_small.name
  FROM t_big JOIN t_small ON (t_big.id = t_small.id)
       AND t_small.id < 100;
                                                                QUERY PLAN

------------------------------------------------------------------------------------------------------------------------------------------
 Gather  (cost=151869.66..690486.34 rows=800 width=13) (actual time=996.137..3496.940 rows=99 loops=1)
   Workers Planned: 4
   Workers Launched: 4
   ->  Hash Join  (cost=150869.66..689406.34 rows=200 width=13) (actual time=2990.847..3490.557 rows=20 loops=5)
         Hash Cond: (t_big.id = t_small.id)
         ->  Parallel Seq Scan on t_big  (cost=0.00..491660.86 rows=12499686 width=4) (actual time=0.240..1392.062 rows=10000000 loops=5)
         ->  Hash  (cost=150859.66..150859.66 rows=800 width=17) (actual time=890.943..890.943 rows=99 loops=5)
               Buckets: 1024  Batches: 1  Memory Usage: 13kB
               ->  Seq Scan on t_small  (cost=0.00..150859.66 rows=800 width=17) (actual time=884.288..890.906 rows=99 loops=5)
                     Filter: (id < 100)
                     Rows Removed by Filter: 7999901
 Planning Time: 0.154 ms
 Execution Time: 3496.982 ms
(13 rows)

以上SQL執行多次,取最快時間,從以上看出,關閉並行雜湊連線時SQL的執行時間為 3496.982 ms ,相比開啟並行雜湊連線執行時間長了 27%。

可見開啟並行雜湊連線後,效能有較大幅度提升。

參考

新書推薦

最後推薦和張文升共同編寫的《PostgreSQL實戰》,本書基於PostgreSQL 10 編寫,共18章,重點介紹SQL高階特性、並行查詢、分割槽表、物理複製、邏輯複製、備份恢復、高可用、效能優化、PostGIS等,涵蓋大量實戰用例!

購買連結:https://item.jd.com/12405774.html
_5_PostgreSQL_