PostgreSQL 一複合查詢SQL優化例子 - (多個exists , 範圍檢索 , IN檢索 , 模糊檢索 組合)
標籤
PostgreSQL , 多個exists , 範圍檢索 , IN檢索 , 模糊檢索 , 組合 , gin , recheck , filter , subplan
背景
當一個SQL包含複雜的多個exists , 範圍檢索 , IN檢索 , 模糊檢索 , 組合查詢時,可能由於索引使用不當導致查詢效能較慢。
主要的問題在於,索引使用不當,可能導致幾個問題:
1、索引本身掃描的耗時過多
2、點陣圖掃描引入的recheck過多
3、subplan 引入的 filter過多
一個現實的例子,可以看到耗時集中在recheck和filter上面,每個索引掃描返回的記錄數都很多,但是組合起來是0條符合條件的記錄。
問題就出在索引不正確上,導致了問題。
->Subquery Scan on "*SELECT* 2"(cost=273453.65..432483146.70 rows=223 width=349) (actual time=25932.371..25932.371 rows=0 loops=1) Output: ................................... Buffers: shared hit=920071 read=269255 I/O Timings: read=1552.767 ->Bitmap Heap Scan on zjxftypt.tab1010201 t_1(cost=273453.65..432483144.47 rows=223 width=349) (actual time=25932.370..25932.370 rows=0 loops=1) Output: t_1.storeid, t_1.xfjbh, t_1.wtsd, t_1.rs, t_1.digoal123x, t_1.dz, t_1.blfsjd, t_1.qx, t_1.gk, t_1.xfrq, t_1.djsj, t_1.djdw, t_1.xfjclzt, t_1.digoal123, t_1.xfxs -- 點陣圖掃描的條件重新過濾 , 過濾太多了 Recheck Cond: ((t_1.xfrq < (to_date('2018-06-11'::character varying, 'yyyy-mm-dd'::character varying) + 1)) AND (t_1.xfrq >= to_date('2014-02-12'::character varying, 'yyyy-mm-dd'::character varying)) AND (t_1.digoal123 = 1::numeric)) Rows Removed by Index Recheck: 1214155 -- 過濾exists的JOIN條件值是否滿足 ,過濾太多了 Filter: (((t_1.digoal123x)::text ~~ '%阿里巴巴%'::text) AND ((alternatives: SubPlan 4 or hashed SubPlan 5) OR(alternatives: SubPlan 6 or hashed SubPlan 7))) Rows Removed by Filter: 5215804 Buffers: shared hit=920071 read=269255 I/O Timings: read=1552.767 -- 條件1,2點陣圖掃描 ->BitmapAnd(cost=273453.65..273453.65 rows=4909643 width=0) (actual time=2510.718..2510.718 rows=0 loops=1) Buffers: shared hit=27036 read=16539 I/O Timings: read=101.425 -- 自身條件1 符合條件的記錄太多了 ->Bitmap Index Scan on index_tab1010201_xfrq(cost=0.00..126565.99 rows=4943755 width=0) (actual time=1085.429..1085.429 rows=5268071 loops=1) Index Cond: ((t_1.xfrq < (to_date('2018-06-11'::character varying, 'yyyy-mm-dd'::character varying) + 1)) AND (t_1.xfrq >= to_date('2014-02-12'::character varying, 'yyyy-mm-dd'::character varying))) Buffers: shared hit=3288 read=16539 I/O Timings: read=101.425 -- 自身條件2 符合條件的記錄太多了 ->Bitmap Index Scan on index_tab1010201_digoal123(cost=0.00..146887.30 rows=6599316 width=0) (actual time=1355.825..1355.825 rows=6845646 loops=1) Index Cond: (t_1.digoal123 = 1::numeric) Buffers: shared hit=23748 ..............sub plans
優化舉例
1、復現問題,建立測試表
create table test(id int, c1 text, c2 date, c3 text);
SQL如下
select * from test where ( exists (select 1 from pg_class where oid::int = test.id) or exists (select 1 from pg_attribute where attrelid::int=test.id) ) and c1 in ('1','2','3') and c2 between current_date-1 and current_date and c3 ~ 'abcdef';
2、寫入測試護甲1000萬條
insert into test select id, (random()*10)::int::text, current_date, md5(random()::text) from generate_series(1,10000000) t(id);
3、建立索引,使之可以在索引層面過濾掉所有資料
create extension pg_trgm; create extension btree_gin; create index idx_test_1 on test using gin (c1, c2, c3 gin_trgm_ops);
如果是復現問題,應該是這兩個索引
create index idx1 on test (c1); create index idx2 on test (c2);
4、檢視執行計劃
postgres=# explain (analyze,verbose,timing,costs,buffers) select * from test where ( exists (select 1 from pg_class where oid::int = test.id) or exists (select 1 from pg_attribute where attrelid::int=test.id) ) and c1 in ('1','2','3') and c2 between current_date-1 and current_date and c3 ~ 'abcdef'; QUERY PLAN -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Bitmap Heap Scan on public.test(cost=156.43..8593.79 rows=228 width=43) (actual time=837.151..837.151 rows=0 loops=1) Output: test.id, test.c1, test.c2, test.c3 -- 點陣圖掃描重新RECHECK過濾 Recheck Cond: ((test.c1 = ANY ('{1,2,3}'::text[])) AND (test.c2 >= (CURRENT_DATE - 1)) AND (test.c2 <= CURRENT_DATE) AND (test.c3 ~ 'abcdef'::text)) Rows Removed by Index Recheck: 1 -- exists子句的條件檢查,過濾 Filter: ((alternatives: SubPlan 1 or hashed SubPlan 2) OR (alternatives: SubPlan 3 or hashed SubPlan 4)) Rows Removed by Filter: 7 Heap Blocks: exact=8 Buffers: shared hit=11658 read=23 -- 所有條件壓到GIN複合索引裡面 -- GIN多個條件時,會自動內部點陣圖掃描 ->Bitmap Index Scan on idx_test_1(cost=0.00..156.37 rows=304 width=0) (actual time=834.418..834.418 rows=8 loops=1) Index Cond: ((test.c1 = ANY ('{1,2,3}'::text[])) AND (test.c2 >= (CURRENT_DATE - 1)) AND (test.c2 <= CURRENT_DATE) AND (test.c3 ~ 'abcdef'::text)) Buffers: shared hit=11582 read=23 SubPlan 1 ->Seq Scan on pg_catalog.pg_class(cost=0.00..15.84 rows=1 width=0) (never executed) Filter: ((pg_class.oid)::integer = test.id) SubPlan 2 ->Seq Scan on pg_catalog.pg_class pg_class_1(cost=0.00..14.87 rows=387 width=4) (actual time=0.014..0.155 rows=388 loops=1) Output: (pg_class_1.oid)::integer Buffers: shared hit=11 SubPlan 3 ->Index Only Scan using pg_attribute_relid_attnum_index on pg_catalog.pg_attribute(cost=0.28..84.39 rows=8 width=0) (never executed) Filter: ((pg_attribute.attrelid)::integer = test.id) Heap Fetches: 0 SubPlan 4 ->Index Only Scan using pg_attribute_relid_attnum_index on pg_catalog.pg_attribute pg_attribute_1(cost=0.28..77.13 rows=2904 width=4) (actual time=0.029..1.081 rows=2941 loops=1) Output: (pg_attribute_1.attrelid)::integer Heap Fetches: 459 Buffers: shared hit=57 Planning time: 1.070 ms Execution time: 839.834 ms (29 rows)
看起來還不錯,但是仔細深究實際上並沒有優化太多,還可以有更好的優化。
5、深入優化,需要理解GIN複合索引內部的執行機制(點陣圖掃描)。
因為滿足C3條件的記錄本身就很少,所以完全不需要使用GIN內部的點陣圖掃描。
postgres=# select count(*) from test where c3 ~ 'abcdef'; count ------- 23 (1 row)
修改為如下索引
postgres=# drop index idx_test_1 ; DROP INDEX postgres=# create index idx_test_1 on test using gin (c3 gin_trgm_ops) ; CREATE INDEX
6、耗時程式設計24毫秒
postgres=# explain (analyze,verbose,timing,costs,buffers) select * from test where ( exists (select 1 from pg_class where oid::int = test.id) or exists (select 1 from pg_attribute where attrelid::int=test.id) ) and c1 in ('1','2','3') and c2 between current_date-1 and current_date and c3 ~ 'abcdef'; QUERY PLAN ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ Bitmap Heap Scan on public.test(cost=53.76..27798.16 rows=228 width=43) (actual time=24.287..24.287 rows=0 loops=1) Output: test.id, test.c1, test.c2, test.c3 Recheck Cond: (test.c3 ~ 'abcdef'::text) Rows Removed by Index Recheck: 6 Filter: ((test.c1 = ANY ('{1,2,3}'::text[])) AND (test.c2 <= CURRENT_DATE) AND (test.c2 >= (CURRENT_DATE - 1)) AND ((alternatives: SubPlan 1 or hashed SubPlan 2) OR (alternatives: SubPlan 3 or hashed SubPlan 4))) Rows Removed by Filter: 23 Heap Blocks: exact=29 Buffers: shared hit=226 ->Bitmap Index Scan on idx_test_1(cost=0.00..53.70 rows=1000 width=0) (actual time=21.517..21.517 rows=29 loops=1) Index Cond: (test.c3 ~ 'abcdef'::text) Buffers: shared hit=128 SubPlan 1 ->Seq Scan on pg_catalog.pg_class(cost=0.00..15.84 rows=1 width=0) (never executed) Filter: ((pg_class.oid)::integer = test.id) SubPlan 2 ->Seq Scan on pg_catalog.pg_class pg_class_1(cost=0.00..14.87 rows=387 width=4) (actual time=0.011..0.156 rows=387 loops=1) Output: (pg_class_1.oid)::integer Buffers: shared hit=11 SubPlan 3 ->Index Only Scan using pg_attribute_relid_attnum_index on pg_catalog.pg_attribute(cost=0.28..84.39 rows=8 width=0) (never executed) Filter: ((pg_attribute.attrelid)::integer = test.id) Heap Fetches: 0 SubPlan 4 ->Index Only Scan using pg_attribute_relid_attnum_index on pg_catalog.pg_attribute pg_attribute_1(cost=0.28..77.13 rows=2904 width=4) (actual time=0.028..1.099 rows=2938 loops=1) Output: (pg_attribute_1.attrelid)::integer Heap Fetches: 456 Buffers: shared hit=58 Planning time: 0.801 ms Execution time: 24.403 ms (29 rows) Time: 26.052 ms
小結
本文的SQL比較複雜,優化的思路和其他SQL差不多,只是本例可以理解BITMAP SCAN以及GIN索引的內部BITMAP SCAN在對較大資料進行合併時,可能引入的開銷。
切入點依舊是explain,找耗時段,找背後的原因,解決。
1、什麼時候使用GIN複合?
當任意一個條件,選擇性不好時,使用複合。
什麼時候使用GIN非複合?
2、當有有一個條件,選擇性很好時,把它單獨拿出來,作為一個獨立索引。比如本例的c3模糊查詢欄位,過濾性好,應該單獨拿出來。
其實就是說,選擇性不好的列,不要放到索引裡面,即使要放,也應該等PG出了分割槽索引後,將這種列作為分割槽索引的分割槽鍵。(多顆樹),或者使用partial index。
ofollow,noindex" target="_blank">《PostgreSQL 黑科技 - 空間聚集儲存, 內窺GIN, GiST, SP-GiST索引》
《寶劍贈英雄 - 任意組合欄位等效查詢, 探探PostgreSQL多列展開式B樹 (GIN)》
《PostgreSQL GIN multi-key search 優化》
《從難纏的模糊查詢聊開 - PostgreSQL獨門絕招之一 GIN , GiST , SP-GiST , RUM 索引原理與技術背景》