PostgreSQL 百億資料秒級響應正則及模糊查詢

阿新 • • 發佈：2019-02-07

原文： https://yq.aliyun.com/articles/7444?spm=5176.blog7549.yqblogcon1.6.2wcXO2

摘要： 正則匹配和模糊匹配通常是搜尋引擎的特長，但是如果你使用的是 PostgreSQL 資料庫照樣能實現，並且效能不賴，加上分散式方案 (譬如 plproxy, pg_shard, fdw shard, pg-xc, pg-xl, greenplum)，處理百億以上資料量的正則匹配和模糊匹配效果槓槓的，.

正則匹配和模糊匹配通常是搜尋引擎的特長，但是如果你使用的是 PostgreSQL 資料庫照樣能實現，並且效能不賴，加上分散式方案 (譬如 plproxy, pg_shard, fdw shard, pg-xc, pg-xl, greenplum)，處理百億以上資料量的正則匹配和模糊匹配效果槓槓的，同時還不失資料庫固有的功能，一舉多得。

物聯網中有大量的資料，除了數字資料，還有字串類的資料，例如條形碼，車牌，手機號，郵箱，姓名等等。
假設使用者需要在大量的感測資料中進行模糊檢索，甚至規則表示式匹配，有什麼高效的方法呢？
這種場景還挺多，例如市面上發現了一批藥品可能有問題，需要對藥品條碼進行規則表示式查詢，找出複合條件的藥品流向。
又比如在偵查行動時，線索的檢索，如使用者提供的殘缺的電話號碼，郵箱，車牌，IP地址，QQ號碼，微訊號碼等等。
根據這些資訊加上時間的疊加，模糊匹配和關聯，最終找出罪犯。
可以看出，模糊匹配，正則表示式匹配，和人臉拼圖有點類似，需求非常的迫切。

首先對應用場景進行一下分類，以及現有技術下能使用的優化手段。
.1. 帶字首的模糊查詢，例如 like 'ABC%'，在PG中也可以寫成 ~ '^ABC'
可以使用btree索引優化，或者拆列用多列索引疊加bit and或bit or進行優化（只適合固定長度的端字串，例如char(8)）。

.2. 帶字尾的模糊查詢，例如 like '%ABC'，在PG中也可以寫成 ~ 'ABC$'
可以使用reverse函式btree索引，或者拆列用多列索引疊加bit and或bit or進行優化（只適合固定長度的端字串，例如char(8)）。

.3. 不帶字首和字尾的模糊查詢，例如 like '%AB_C%'，在PG中也可以寫成 ~ 'AB.C'
可以使用pg_trgm的gin索引，或者拆列用多列索引疊加bit and或bit or進行優化（只適合固定長度的端字串，例如char(8)）。

.4. 正則表示式查詢，例如 ~ '[\d]+def1.?[a|b|0|8]{1,3}'
可以使用pg_trgm的gin索引，或者拆列用多列索引疊加bit and或bit or進行優化（只適合固定長度的端字串，例如char(8)）。

PostgreSQL pg_trgm外掛自從9.1開始支援模糊查詢使用索引，從9.3開始支援規則表示式查詢使用索引，大大提高了PostgreSQL在刑偵方面的能力。
程式碼見
https://github.com/postgrespro/pg_trgm_pro

pg_trgm外掛的原理，將字串前加2個空格，後加1個空格，組成一個新的字串，並將這個新的字串按照每3個相鄰的字元拆分成多個token。
當使用規則表示式或者模糊查詢進行匹配時，會檢索出他們的近似度，再進行filter。
GIN索引的圖例：

從btree檢索到匹配的token時，指向對應的list, 從list中儲存的ctid找到對應的記錄。
因為一個字串會拆成很多個token，所以沒插入一條記錄，會更新多條索引，這也是GIN索引需要fastupdate的原因。
正則匹配是怎麼做到的呢？
詳見 https://raw.githubusercontent.com/postgrespro/pg_trgm_pro/master/trgm_regexp.c
實際上它是將正則表示式轉換成了NFA格式，然後掃描多個TOKEN，進行bit and|or匹配。
正則組合如果轉換出來的的bit and|or很多的話，就需要大量的recheck，效能也不能好到哪裡去。

下面針對以上四種場景，例項講解如何優化。

.1. 帶字首的模糊查詢，例如 like 'ABC%'，在PG中也可以寫成 ~ '^ABC'
可以使用btree索引優化，或者拆列用多列索引疊加bit and或bit or進行優化（只適合固定長度的端字串，例如char(8)）。
例子，1000萬隨機產生的MD5資料的前8個字元。

postgres=# create table tb(info text);  
CREATE TABLE  
postgres=# insert into tb select substring(md5(random()::text),1,8) from generate_series(1,10000000);  
INSERT 0 10000000  
postgres=# create index idx_tb on tb(info);  
CREATE INDEX  
postgres=# select * from tb limit 1;  
   info     
----------  
 376821ab  
(1 row)  
postgres=# explain select * from tb where info ~ '^376821' limit 10;  
                                  QUERY PLAN                                     
-------------------------------------------------------------------------------  
 Limit  (cost=0.43..0.52 rows=10 width=9)  
   ->  Index Only Scan using idx_tb on tb  (cost=0.43..8.46 rows=1000 width=9)  
         Index Cond: ((info >= '376821'::text) AND (info < '376822'::text))  
         Filter: (info ~ '^376821'::text)  
(4 rows)  
postgres=# select * from tb where info ~ '^376821' limit 10;  
   info     
----------  
 376821ab  
(1 row)  
Time: 0.536 ms  
postgres=# set enable_indexscan=off;  
SET  
Time: 1.344 ms  
postgres=# set enable_bitmapscan=off;  
SET  
Time: 0.158 ms  
postgres=# explain select * from tb where info ~ '^376821' limit 10;  
                           QUERY PLAN                             
----------------------------------------------------------------  
 Limit  (cost=0.00..1790.55 rows=10 width=9)  
   ->  Seq Scan on tb  (cost=0.00..179055.00 rows=1000 width=9)  
         Filter: (info ~ '^376821'::text)  
(3 rows)  
Time: 0.505 ms

帶字首的模糊查詢，不使用索引需要5483毫秒。
帶字首的模糊查詢，使用索引只需要0.5毫秒。

postgres=# select * from tb where info ~ '^376821' limit 10;  
   info     
----------  
 376821ab  
(1 row)  
Time: 5483.655 ms

postgres=# create index idx_tb1 on tb(reverse(info));  
CREATE INDEX  
postgres=# explain select * from tb where reverse(info) ~ '^ba128' limit 10;  
                                         QUERY PLAN                                           
--------------------------------------------------------------------------------------------  
 Limit  (cost=0.43..28.19 rows=10 width=9)  
   ->  Index Scan using idx_tb1 on tb  (cost=0.43..138778.43 rows=50000 width=9)  
         Index Cond: ((reverse(info) >= 'ba128'::text) AND (reverse(info) < 'ba129'::text))  
         Filter: (reverse(info) ~ '^ba128'::text)  
(4 rows)  

postgres=# select * from tb where reverse(info) ~ '^ba128' limit 10;  
   info     
----------  
 220821ab  
 671821ab  
 305821ab  
 e65821ab  
 536821ab  
 376821ab  
 668821ab  
 4d8821ab  
 26c821ab  
(9 rows)  
Time: 0.506 ms

帶字尾的模糊查詢，使用索引只需要0.5毫秒。

postgres=# create extension pg_trgm;  
postgres=# explain select * from tb where info ~ '5821a';  
                                 QUERY PLAN                                   
----------------------------------------------------------------------------  
 Bitmap Heap Scan on tb  (cost=103.75..3677.71 rows=1000 width=9)  
   Recheck Cond: (info ~ '5821a'::text)  
   ->  Bitmap Index Scan on idx_tb_2  (cost=0.00..103.50 rows=1000 width=0)  
         Index Cond: (info ~ '5821a'::text)  
(4 rows)  
Time: 0.647 ms  

postgres=# select * from tb where info ~ '5821a';  
   info     
----------  
 5821a8a3  
 945821af  
 45821a74  
 9fe5821a  
 5821a7e0  
 5821af2a  
 1075821a  
 e5821ac9  
 d265821a  
 45f5821a  
 df5821a4  
 de5821af  
 71c5821a  
 375821a3  
 fc5821af  
 5c5821ad  
 e65821ab  
 5821adde  
 c35821a6  
 5821a642  
 305821ab  
 5821a1c8  
 75821a5c  
 ce95821a  
 a65821ad  
(25 rows)  
Time: 3.808 ms

前後模糊查詢，使用索引只需要3.8毫秒。

前後模糊查詢，使用索引只需要108毫秒。

postgres=# select * from tb where info ~ 'e65[\d]{2}a[b]{1,2}8' limit 10;  
   info     
----------  
 4e6567ab  
 1e6530ab  
 e6500ab8  
 ae6583ab  
 e6564ab7  
 5e6532ab  
 e6526abf  
 e6560ab6  
(8 rows)  
Time: 108.577 ms

時間主要花費在排他上面。
檢索了14794行，remove了14793行。大量的時間花費在無用功上，但是比全表掃還是好很多。

postgres=# explain (verbose,analyze,buffers,costs,timing) select * from tb where info ~ 'e65[\d]{2}a[b]{1,2}8' limit 10;  
                                                            QUERY PLAN                                                              
----------------------------------------------------------------------------------------------------------------------------------  
 Limit  (cost=511.75..547.49 rows=10 width=9) (actual time=89.934..120.567 rows=1 loops=1)  
   Output: info  
   Buffers: shared hit=13054  
   ->  Bitmap Heap Scan on public.tb  (cost=511.75..4085.71 rows=1000 width=9) (actual time=89.930..120.562 rows=1 loops=1)  
         Output: info  
         Recheck Cond: (tb.info ~ 'e65[\d]{2}a[b]{1,2}8'::text)  
         Rows Removed by Index Recheck: 14793  
         Heap Blocks: exact=12929  
         Buffers: shared hit=13054  
         ->  Bitmap Index Scan on idx_tb_2  (cost=0.00..511.50 rows=1000 width=0) (actual time=67.589..67.589 rows=14794 loops=1)  
               Index Cond: (tb.info ~ 'e65[\d]{2}a[b]{1,2}8'::text)  
               Buffers: shared hit=125  
 Planning time: 0.493 ms  
 Execution time: 120.618 ms  
(14 rows)  
Time: 124.693 ms

優化：
使用gin索引後，需要考慮效能問題，因為info欄位被打散成了多個char(3)的token，從而涉及到非常多的索引條目，如果有非常高併發的插入，最好把gin_pending_list_limit設大，來提高插入效率，降低實時合併索引帶來的RT升高。
使用了fastupdate後，會在每次vacuum表時，自動將pengding的資訊合併到GIN索引中。
還有一點，查詢不會有合併的動作，對於沒有合併的GIN資訊是使用遍歷的方式搜尋的。

壓測高併發的效能：

create table tbl(id serial8, crt_time timestamp, sensorid int, sensorloc point, info text) with (autovacuum_enabled=on, autovacuum_vacuum_threshold=0.000001,autovacuum_vacuum_cost_delay=0);  
CREATE INDEX trgm_idx ON tbl USING GIN (info gin_trgm_ops) with (fastupdate='on', gin_pending_list_limit='6553600');  
alter sequence tbl_id_seq cache 10000;

修改配置，讓資料庫的autovacuum快速迭代合併gin。

vi $PGDATA/postgresql.conf  
autovacuum_naptime=1s  
maintenance_work_mem=1GB  
autovacuum_work_mem=1GB  
autovacuum = on  
autovacuum_max_workers = 3  
log_autovacuum_min_duration = 0  
autovacuum_vacuum_cost_delay=0  

$ pg_ctl reload

建立一個測試函式，用來產生隨機的測試資料。

postgres=# create or replace function f() returns void as $$  
  insert into tbl (crt_time,sensorid,info) values ( clock_timestamp(),trunc(random()*500000),substring(md5(random()::text),1,8) );  
$$ language sql strict;  

vi test.sql  
select f();  

pgbench -M prepared -n -r -P 1 -f ./test.sql -c 48 -j 48 -T 10000  

progress: 50.0 s, 52800.9 tps, lat 0.453 ms stddev 0.390  
progress: 51.0 s, 52775.8 tps, lat 0.453 ms stddev 0.398  
progress: 52.0 s, 53173.2 tps, lat 0.449 ms stddev 0.371  
progress: 53.0 s, 53010.0 tps, lat 0.451 ms stddev 0.390  
progress: 54.0 s, 53360.9 tps, lat 0.448 ms stddev 0.365  
progress: 55.0 s, 53285.0 tps, lat 0.449 ms stddev 0.362  
progress: 56.0 s, 53662.1 tps, lat 0.445 ms stddev 0.368  
progress: 57.0 s, 53283.8 tps, lat 0.448 ms stddev 0.385  
progress: 58.0 s, 53703.4 tps, lat 0.445 ms stddev 0.355  
progress: 59.0 s, 53818.7 tps, lat 0.444 ms stddev 0.344  
progress: 60.0 s, 53889.2 tps, lat 0.443 ms stddev 0.361  
progress: 61.0 s, 53613.8 tps, lat 0.446 ms stddev 0.355  
progress: 62.0 s, 53339.9 tps, lat 0.448 ms stddev 0.392  
progress: 63.0 s, 54014.9 tps, lat 0.442 ms stddev 0.346  
progress: 64.0 s, 53112.1 tps, lat 0.450 ms stddev 0.374  
progress: 65.0 s, 53706.1 tps, lat 0.445 ms stddev 0.367  
progress: 66.0 s, 53720.9 tps, lat 0.445 ms stddev 0.353  
progress: 67.0 s, 52858.1 tps, lat 0.452 ms stddev 0.415  
progress: 68.0 s, 53218.9 tps, lat 0.449 ms stddev 0.387  
progress: 69.0 s, 53403.0 tps, lat 0.447 ms stddev 0.377  
progress: 70.0 s, 53179.9 tps, lat 0.449 ms stddev 0.377  
progress: 71.0 s, 53232.4 tps, lat 0.449 ms stddev 0.373  
progress: 72.0 s, 53011.7 tps, lat 0.451 ms stddev 0.386  
progress: 73.0 s, 52685.1 tps, lat 0.454 ms stddev 0.384  
progress: 74.0 s, 52937.8 tps, lat 0.452 ms stddev 0.377

按照這個速度，一天能支援超過40億資料入庫。

接下來對比一下字串分離的例子，這個例子適用於字串長度固定，並且很小的場景，如果字串長度不固定，這種方法沒用。
適用splict的方法，測試資料不盡人意，所以還是用pg_trgm比較靠譜。

postgres=# create table t_split(id int, crt_time timestamp, sensorid int, sensorloc point, info text, c1 char(1), c2 char(1), c3 char(1), c4 char(1), c5 char(1), c6 char(1), c7 char(1), c8 char(1));  
CREATE TABLE  
Time: 2.123 ms  

postgres=# insert into t_split(id,crt_time,sensorid,info,c1,c2,c3,c4,c5,c6,c7,c8) select id,ct,sen,info,substring(info,1,1),substring(info,2,1),substring(info,3,1),substring(info,4,1),substring(info,5,1),substring(info,6,1),substring(info,7,1),substring(info,8,1) from (select id, clock_timestamp() ct, trunc(random()*500000) sen, substring(md5(random()::text), 1, 8) info from generate_series(1,10000000) t(id)) t;  
INSERT 0 10000000  
Time: 81829.274 ms  

postgres=# create index idx1 on t_split (c1);  
postgres=# create index idx2 on t_split (c2);  
postgres=# create index idx3 on t_split (c3);  
postgres=# create index idx4 on t_split (c4);  
postgres=# create index idx5 on t_split (c5);  
postgres=# create index idx6 on t_split (c6);  
postgres=# create index idx7 on t_split (c7);  
postgres=# create index idx8 on t_split (c8);  
postgres=# create index idx9 on t_split using gin (info gin_trgm_ops);  

postgres=# select * from t_split limit 1;  
 id |          crt_time          | sensorid | sensorloc |   info   | c1 | c2 | c3 | c4 | c5 | c6 | c7 | c8   
----+----------------------------+----------+-----------+----------+----+----+----+----+----+----+----+----  
  1 | 2016-03-02 09:58:03.990639 |   161958 || 33eed779 | 3  | 3  | e  | e  | d  | 7  | 7  | 9  
(1 row)  

postgres=# select * from t_split where info ~ '^3[\d]?eed[\d]?79$' limit 10;  
 id |          crt_time          | sensorid | sensorloc |   info   | c1 | c2 | c3 | c4 | c5 | c6 | c7 | c8   
----+----------------------------+----------+-----------+----------+----+----+----+----+----+----+----+----  
  1 | 2016-03-02 09:58:03.990639 |   161958 || 33eed779 | 3  | 3  | e  | e  | d  | 7  | 7  | 9  
(1 row)  
Time: 133.041 ms  
postgres=# explain (analyze,verbose,timing,costs,buffers) select * from t_split where info ~ '^3[\d]?eed[\d]?79$' limit 10;  
                                                            QUERY PLAN                                                              
----------------------------------------------------------------------------------------------------------------------------------  
 Limit  (cost=575.75..612.78 rows=10 width=57) (actual time=92.406..129.838 rows=1 loops=1)  
   Output: id, crt_time, sensorid, sensorloc, info, c1, c2, c3, c4, c5, c6, c7, c8  
   Buffers: shared hit=13798  
   ->  Bitmap Heap Scan on public.t_split  (cost=575.75..4278.56 rows=1000 width=57) (actual time=92.403..129.833 rows=1 loops=1)  
         Output: id, crt_time, sensorid, sensorloc, info, c1, c2, c3, c4, c5, c6, c7, c8  
         Recheck Cond: (t_split.info ~ '^3[\d]?eed[\d]?79$'::text)  
         Rows Removed by Index Recheck: 14690  
         Heap Blocks: exact=13669  
         Buffers: shared hit=13798  
         ->  Bitmap Index Scan on idx9  (cost=0.00..575.50 rows=1000 width=0) (actual time=89.576..89.576 rows=14691 loops=1)  
               Index Cond: (t_split.info ~ '^3[\d]?eed[\d]?79$'::text)  
               Buffers: shared hit=129  
 Planning time: 0.385 ms  
 Execution time: 129.883 ms  
(14 rows)  

Time: 130.678 ms  


postgres=# select * from t_split where c1='3' and c3='e' and c4='e' and c5='d' and c7='7' and c8='9' and c2 between '0' and '9' and c6 between '0' and '9' limit 10;  
 id |          crt_time          | sensorid | sensorloc |   info   | c1 | c2 | c3 | c4 | c5 | c6 | c7 | c8   
----+----------------------------+----------+-----------+----------+----+----+----+----+----+----+----+----  
  1 | 2016-03-02 09:58:03.990639 |   161958 || 33eed779 | 3  | 3  | e  | e  | d  | 7  | 7  | 9  
(1 row)  

Time: 337.367 ms  

postgres=# explain (analyze,verbose,timing,costs,buffers) select * from t_split where c1='3' and c3='e' and c4='e' and c5='d' and c7='7' and c8='9' and c2 between '0' and '9' and c6 between '0' and '9' limit 10;  
                                                                                                                 QUERY PLAN                                                                                                                   
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------  
 Limit  (cost=33582.31..41499.35 rows=1 width=57) (actual time=339.230..344.675 rows=1 loops=1)  
   Output: id, crt_time, sensorid, sensorloc, info, c1, c2, c3, c4, c5, c6, c7, c8  
   Buffers: shared hit=7581  
   ->  Bitmap Heap Scan on public.t_split  (cost=33582.31..41499.35 rows=1 width=57) (actual time=339.228..344.673 rows=1 loops=1)  
         Output: id, crt_time, sensorid, sensorloc, info, c1, c2, c3, c4, c5, c6, c7, c8  
         Recheck Cond: ((t_split.c3 = 'e'::bpchar) AND (t_split.c8 = '9'::bpchar) AND (t_split.c5 = 'd'::bpchar))  
         Filter: ((t_split.c2 >= '0'::bpchar) AND (t_split.c2 <= '9'::bpchar) AND (t_split.c6 >= '0'::bpchar) AND (t_split.c6 <= '9'::bpchar) AND (t_split.c1 = '3'::bpchar) AND (t_split.c4 = 'e'::bpchar) AND (t_split.c7 = '7'::bpchar))  
         Rows Removed by Filter: 2480  
         Heap Blocks: exact=2450  
         Buffers: shared hit=7581  
         ->  BitmapAnd  (cost=33582.31..33582.31 rows=2224 width=0) (actual time=338.512..338.512 rows=0 loops=1)  
               Buffers: shared hit=5131  
               ->  Bitmap Index Scan on idx3  (cost=0.00..11016.93 rows=596333 width=0) (actual time=104.418..104.418 rows=624930 loops=1)  
                     Index Cond: (t_split.c3 = 'e'::bpchar)  
                     Buffers: shared hit=1711  
               ->  Bitmap Index Scan on idx8  (cost=0.00..11245.44 rows=608667 width=0) (actual time=100.185..100.185 rows=625739 loops=1)  
                     Index Cond: (t_split.c8 = '9'::bpchar)  
                     Buffers: shared hit=1712  
               ->  Bitmap Index Scan on idx5  (cost=0.00..11319.44 rows=612667 width=0) (actual time=99.480..99.480 rows=624269 loops=1)  
                     Index Cond: (t_split.c5 = 'd'::bpchar)  
                     Buffers: shared hit=1708  
 Planning time: 0.262 ms  
 Execution time: 344.731 ms  
(23 rows)  

Time: 346.424 ms  

postgres=# select * from t_split where info ~ '^33.+7.+9$' limit 10;  
   id   |          crt_time          | sensorid | sensorloc |   info   | c1 | c2 | c3 | c4 | c5 | c6 | c7 | c8   
--------+----------------------------+----------+-----------+----------+----+----+----+----+----+----+----+----  
      1 | 2016-03-02 09:58:03.990639 |   161958 || 33eed779 | 3  | 3  | e  | e  | d  | 7  | 7  | 9  
  24412 | 2016-03-02 09:58:04.186359 |   251599 || 33f07429 | 3  | 3  | f  | 0  | 7  | 4  | 2  | 9  
  24989 | 2016-03-02 09:58:04.191112 |   214569 || 334587d9 | 3  | 3  | 4  | 5  | 8  | 7  | d  | 9  
  50100 | 2016-03-02 09:58:04.398499 |   409819 || 33beb7b9 | 3  | 3  | b  | e  | b  | 7  | b  | 9  
  92623 | 2016-03-02 09:58:04.745372 |   280100 || 3373e719 | 3  | 3  | 7  | 3  | e  | 7  | 1  | 9  
 106054 | 2016-03-02 09:58:04.855627 |   155192 || 33c575c9 | 3  | 3  | c  | 5  | 7  | 5  | c  | 9  
 107070 | 2016-03-02 09:58:04.863827 |   464325 || 337dd729 | 3  | 3  | 7  | d  | d  | 7  | 2  | 9  
 135152 | 2016-03-02 09:58:05.088217 |   240500 || 336271d9 | 3  | 3  | 6  | 2  | 7  | 1  | d  | 9  
 156425 | 2016-03-02 09:58:05.25805  |   218202 || 333e7289 | 3  | 3  | 3  | e  | 7  | 2  | 8  | 9  
 170210 | 2016-03-02 09:58:05.368371 |   132530 || 33a8d789 | 3  | 3  | a  | 8  | d  | 7  | 8  | 9  
(10 rows)  

Time: 20.431 ms  

postgres=# explain (analyze,verbose,timing,costs,buffers) select * from t_split where info ~ '^33.+7.+9$' limit 10;  
                                                           QUERY PLAN                                                              
---------------------------------------------------------------------------------------------------------------------------------  
 Limit  (cost=43.75..80.78 rows=10 width=57) (actual time=19.573..21.212 rows=10 loops=1)  
   Output: id, crt_time, sensorid, sensorloc, info, c1, c2, c3, c4, c5, c6, c7, c8  
   Buffers: shared hit=566  
   ->  Bitmap Heap Scan on public.t_split  (cost=43.75..3746.56 rows=1000 width=57) (actual time=19.571..21.206 rows=10 loops=1)  
         Output: id, crt_time, sensorid, sensorloc, info, c1, c2, c3, c4, c5, c6, c7, c8  
         Recheck Cond: (t_split.info ~ '^33.+7.+9$'::text)  
         Rows Removed by Index Recheck: 647  
         Heap Blocks: exact=552  
         Buffers: shared hit=566  
         ->  Bitmap Index Scan on idx9  (cost=0.00..43.50 rows=1000 width=0) (actual time=11.712..11.712 rows=39436 loops=1)  
               Index Cond: (t_split.info ~ '^33.+7.+9$'::text)  
               Buffers: shared hit=14  
 Planning time: 0.301 ms  
 Execution time: 21.255 ms  
(14 rows)  

Time: 21.995 ms  


postgres=# select * from t_split where c1='3' and c2='3' and c8='9' and (c4='7' or c5='7' or c6='7') limit 10;  
   id   |          crt_time          | sensorid | sensorloc |   info   | c1 | c2 | c3 | c4 | c5 | c6 | c7 | c8   
--------+----------------------------+----------+-----------+----------+----+----+----+----+----+----+----+----  
      1 | 2016-03-02 09:58:03.990639 |   161958 || 33eed779 | 3  | 3  | e  | e  | d  | 7  | 7  | 9  
  24412 | 2016-03-02 09:58:04.186359 |   251599 || 33f07429 | 3  | 3  | f  | 0  | 7  | 4  | 2  | 9  
  24989 | 2016-03-02 09:58:04.191112 |   214569 || 334587d9 | 3  | 3  | 4  | 5  | 8  | 7  | d  | 9  
  50100 | 2016-03-02 09:58:04.398499 |   409819 || 33beb7b9 | 3  | 3  | b  | e  | b  | 7  | b  | 9  
  92623 | 2016-03-02 09:58:04.745372 |   280100 || 3373e719 | 3  | 3  | 7  | 3  | e  | 7  | 1  | 9  
 106054 | 2016-03-02 09:58:04.855627 |   155192 || 33c575c9 | 3  | 3  | c  | 5  | 7  | 5  | c  | 9  
 107070 | 2016-03-02 09:58:04.863827 |   464325 || 337dd729 | 3  | 3  | 7  | d  | d  | 7  | 2  | 9  
 135152 | 2016-03-02 09:58:05.088217 |   240500 || 336271d9 | 3  | 3  | 6  | 2  | 7  | 1  | d  | 9  
 156425 | 2016-03-02 09:58:05.25805  |   218202 || 333e7289 | 3  | 3  | 3  | e  | 7  | 2  | 8  | 9  
 170210 | 2016-03-02 09:58:05.368371 |   132530 || 33a8d789 | 3  | 3  | a  | 8  | d  | 7  | 8  | 9  
(10 rows)  

Time: 37.739 ms  

postgres=# explain (analyze,verbose,timing,costs,buffers) select * from t_split where c1='3' and c2='3' and c8='9' and (c4='7' or c5='7' or c6='7') limit 10;  
                                                                                               QUERY PLAN                                                                                                  
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------  
 Limit  (cost=0.00..8135.78 rows=10 width=57) (actual time=0.017..35.532 rows=10 loops=1)  
   Output: id, crt_time, sensorid, sensorloc, info, c1, c2, c3, c4, c5, c6, c7, c8  
   Buffers: shared hit=1755  
   ->  Seq Scan on public.t_split  (cost=0.00..353093.00 rows=434 width=57) (actual time=0.015..35.526 rows=10 loops=1)  
         Output: id, crt_time, sensorid, sensorloc, info, c1, c2, c3, c4, c5, c6, c7, c8  
         Filter: ((t_split.c1 = '3'::bpchar) AND (t_split.c2 = '3'::bpchar) AND (t_split.c8 = '9'::bpchar) AND ((t_split.c4 = '7'::bpchar) OR (t_split.c5 = '7'::bpchar) OR (t_split.c6 = '7'::bpchar)))  
         Rows Removed by Filter: 170200  
         Buffers: shared hit=1755  
 Planning time: 0.210 ms  
 Execution time: 35.572 ms  
(10 rows)  

Time: 36.260 ms  

postgres=# select * from t_split where info ~ '^3.?[b-g]+ed[\d]+79' order by info <-> '^3.?[b-g]+ed[\d]+79' limit 10;  
   id    |          crt_time          | sensorid | sensorloc |   info   | c1 | c2 | c3 | c4 | c5 | c6 | c7 | c8   
---------+----------------------------+----------+-----------+----------+----+----+----+----+----+----+----+----  
       1 | 2016-03-02 09:58:03.990639 |   161958 || 33eed779 | 3  | 3  | e  | e  | d  | 7  | 7  | 9  
 1308724 | 2016-03-02 09:58:14.590901 |   458822 || 3fed9479 | 3  | f  | e  | d  | 9  | 4  | 7  | 9  
 2866024 | 2016-03-02 09:58:27.20105  |   106467 || 3fed2279 | 3  | f  | e  | d  | 2  | 2  | 7  | 9  
 4826729 | 2016-03-02 09:58:42.907431 |   228023 || 3ded9879 | 3  | d  | e  | d  | 9  | 8  | 7  | 9  
 6113373 | 2016-03-02 09:58:53.211146 |   499702 || 36fed479 | 3  | 6  | f  | e  | d  | 4  | 7  | 9  
 1768237 | 2016-03-02 09:58:18.310069 |   345027 || 30fed079 | 3  | 0  | f  | e  | d  | 0  | 7  | 9  
 1472324 | 2016-03-02 09:58:15.913629 |   413283 || 3eed5798 | 3  | e  | e  | d  | 5  | 7  | 9  | 8  
 8319056 | 2016-03-02 09:59:10.902137 |   336740 || 3ded7790 | 3  | d  | e  | d  | 7  | 7  | 9  | 0  
 8576573 | 2016-03-02 09:59:12.962923 |   130223 || 3eed5793 | 3  | e  | e  | d  | 5  | 7  | 9  | 3  
(9 rows)  

Time: 268.661 ms  

postgres=# explain (analyze,verbose,timing,buffers,costs) select * from t_split where info ~ '^3.?[b-g]+ed[\d]+79' order by info <-> '^3.?[b-g]+ed[\d]+79' limit 10;  
                                                               QUERY PLAN                                                                  
-----------------------------------------------------------------------------------------------------------------------------------------  
 Limit  (cost=4302.66..4302.69 rows=10 width=57) (actual time=269.214..269.217 rows=9 loops=1)  
   Output: id, crt_time, sensorid, sensorloc, info, c1, c2, c3, c4, c5, c6, c7, c8, ((info <-> '^3.?[b-g]+ed[\d]+79'::text))  
   Buffers: shared hit=52606  
   ->  Sort  (cost=4302.66..4305.16 rows=1000 width=57) (actual time=269.212..269.212 rows=9 loops=1)  
         Output: id, crt_time, sensorid, sensorloc, info, c1, c2, c3, c4, c5, c6, c7, c8, ((info <-> '^3.?[b-g]+ed[\d]+79'::text))  
         Sort Key: ((t_split.info <-> '^3.?[b-g]+ed[\d]+79'::text))  
         Sort Method: quicksort  Memory: 26kB  
         Buffers: shared hit=52606  
         ->  Bitmap Heap Scan on public.t_split  (cost=575.75..4281.06 rows=1000 width=57) (actual time=100.771..269.180 rows=9 loo

 
 
              
           
              
              
            
            相關推薦
			   
            
            
            
 

    

    
    PostgreSQL 百億資料 秒級響應 正則及模糊查詢
      
                
原文： https://yq.aliyun.com/articles/7444?spm=5176.blog7549.yqblogcon1.6.2wcXO2

摘要： 正則匹配和模糊匹配通常是搜尋引擎的特長，但是如果你使用的是 PostgreSQL 資料庫照樣能實現，並 

  
 

    

    
    億級資料多條件組合查詢——秒級響應解決方案
      
							
							
							1 概述
組合查詢為多條件組合查詢，在很多場景下都有使用。購物網站中通過勾選類別、價格、銷售量範圍等屬性來對所有的商品進行篩選，篩選出滿足客戶需要的商品，這是一種典型的組合查詢。在小資料量的情況下，後臺通過簡單的sql語句便能夠快速過濾出需要的資料，但隨著資料量 

  
 

    

    
    【 轉】百度地圖Canvas實現十萬CAD資料秒級載入
      Github上看到： 
https://github.com/lcosmos/map-canvas  
這個實現颱風軌跡，這個資料量非常龐大，當時開啟時，看到這麼多資料載入很快，感到有點震驚，然後自己研究了一番，發現作者採用的是Canvas作為百度的自定義覆蓋層。 
 
 <!DOCTYPE html& 

  
 

    

    
    阿里EB級大資料體系，如何做到秒級響應、高效賦能？
      
阿里妹導讀：阿里巴巴如何構建一個從底層的資料採集、處理，到挖掘演算法、應用、產品服務的全鏈路、標準化的大資料體系，使得超過EB級別的海量資料能夠高效融合，並以秒級的響應速度，服務並驅動阿里巴巴的業務和外部千萬使用者的發展？阿里巴巴資料技術及產品部資深技術專家姚濱暉，在2017雲棲大會上做了一次非常精彩的分享 

  
 

    

    
    百億資料入庫elasticsearch生產實踐（二）
      
                一、前言    前情回顧，hive中有三個具有關聯關係的表，依次是一對多的關係，當前歷史資料總量在500億左右，每日增量依次是百萬、千萬、億級的體量。其中，這500億資料只是一個領域，還有另一個塊更大領域的資料第三層日增量在10～25億之間，這塊還沒來得及去啃。這一塊資料，需 

  
 

    

    
    oracle資料庫同步，100萬資料秒級插入
      
                
近期為了滿足客戶的（××電網公司）需求，先說下他們的需求，需求如下：
1.實現Ⅱ區、Ⅲ區資料庫的同步，其中Ⅱ區是主資料庫，Ⅲ區是需要同步的資料庫。
2.兩臺資料庫伺服器之間是不能直接通訊的，因為Ⅱ、Ⅲ區之間安裝了隔離裝置，只能通過埠訪問。
3.同步需要保證實時性，資料都是秒 

  
 

    

    
    使用PHP_XLSXWriter代替PHPExcel 10W+資料秒級匯出
      
							
							
							

長數字匯出後顯示科學計數法的問題？ 在該欄位前拼一個空格就可以解決了。

以下為轉載原文。

二者有何區別? 
PHPExcel 是一個處理Excel,CVS檔案的開源框架,它基於微軟的OpenXML標準和PHP語言。可以使用它來讀取、寫入不同格式的電子表格 

  
 

    

    
    Oracle資料隱式亂碼,正則匹配中文資料失敗
      起因：相同資料，供述廠家不同，使用正則匹配時，不同廠家的資料匹配不到。 
描述：導致此問題的發生原因為“資料編碼不一致”，如果僅憑肉眼壕無差異。此時需用Convert函式檢視資料編碼，會發現不同廠家的相同資料轉換出來的編碼是不一致的。 
附轉碼函式使用方式： 
在oracle中，convert函式是用來轉字符 

  
 

    

    
    Python爬蟲實習筆記 | Week3 資料爬取和正則再學習
       
 
 2018/10/29 1.所思所想：雖然自己的考試在即，但工作上不能有半點馬虎，要認真努力，不辜負期望。中午和他們去吃飯，算是吃飯創新吧。下午爬了雞西的網站，還有一些欄位沒爬出來，正則用的不熟悉，此時終於露出端倪，心情不是很好。。明天上午把正則好好看看。 
 2.工作： [1].哈爾濱：html p 

  
 

    

    
    python爬蟲三大解析資料方法：正則 及 圖片下載案例
       
  
  
 基本正則用法回顧 
 # 提取python
key = 'javapythonc++php'
print(re.findall('python', key)[0])

# 提取hello world
key = '<html><h1>hello world</h 

  
 

    

    
    jmeter之斷言、資料提取器（正則表示式、jsonpath、beanshell）、聚合報告、引數化
        
 
 ctx - ( JMeterContext ) - gives access to the context 
 vars - ( JMeterVariables ) - gives read/write access to variables: v 

  
 

    

    
    神經網路資料預處理，正則化與損失函式
      

1. 引言

上一節我們講完了各種激勵函式的優缺點和選擇，以及網路的大小以及正則化對神經網路的影響。這一節我們講一講輸入資料預處理、正則化以及損失函式設定的一些事情。
2. 資料與網路的設定

前一節提到前向計算涉及到的元件(主要是神經元)設定。神經網路結構和引數設定完畢之後，我們就得到得分 

  
 

    

    
    Linux正則及用戶管理練習
      Linux基礎1、who | cut -d" " -f1 | sort -u    2、lastlog | grep "\<tty"     lastlog | grep -v "[**]"   3、cut -d: -f7 /etc/passw 

  
 

    

    
    Word中使用正則表示式進行查詢和替換（高效進行文書處理）
       
 
 術語 
 開始前，我們先定義一對術語： 
 
  萬用字元指的是您可以用來代表一個或多個字元的鍵盤字元。例如，星號 (*) 通常代表一個或多個字元，問號 (?) 通常代表單個字元。 
  對我們來說，正則表示式指的是您可以用來查詢和替換文字模式的文字字元和萬用字元組合。文字字元指的是必須存在於目標文 

  
 

    

    
    notepad++   正則表示式 高階查詢替換技巧（一）
      
                正則表示式：(^\w+$)

替換式：db2 \"delete from \1 \" \r\ndb2 \"import from \.\/data\/\1\.ixf of ixf modified by identityignore insert into \1 \"

輸入 

  
 

    

    
    Elasticsearch-字首、萬用字元、正則、模糊搜尋詳解
      
							
							
							1.對於字首的匹配搜尋：



GET /forum/article/_search
{
  "query": {
    "prefix": {
      "articleID": {
        "value": "J"
      }
    }
  

  
 

    

    
    正則表示式給查詢到的內容加引號
      
							
							
							首先介紹一下正則表示式的基本語法，不使用任何一門語言，就使用notepad++進行正則表示式的操作。


正則表示式：正則表示式表達就是操作字串的一個規則，正則表示式使用了特殊的符號表示。
正則表示式對字串的操作主要有一下集中應用：

匹配
切割
替換
查詢
預 

  
 

    

    
    ［linux ］find命令是用正則匹配目錄查詢檔案
      
                
我們經常是用linux下的find命令去查詢搜尋日誌，比如伺服器一堆按日期的日曆，我要搜尋/var/logs/projects的2013年12月的txt日誌檔案中包含date字元，如下：

find /var/logs/projects/ -name "2013-12-*. 

  
 

    

    
    javascript使用正則表示式獲取查詢字串QueryString（轉載）
       
 
 該方法大小寫敏感
 <script>

    function fnOnload() {
        document.getElementById("SystemTree").src = "TreeList.aspx?PID=" + GetQueryString("PID");
 

  
 

    

    
    xCode正則表示式替換查詢
      
                

應用場景

[self presentModalViewController:imgPicker animated:YES]; 
在ios6已經deprecated，需要替換為其他格式 
[self presentViewController:imgPicker anim

PostgreSQL 百億資料秒級響應正則及模糊查詢

PostgreSQL 百億資料秒級響應正則及模糊查詢

億級資料多條件組合查詢——秒級響應解決方案

【轉】百度地圖Canvas實現十萬CAD資料秒級載入

阿里EB級大資料體系，如何做到秒級響應、高效賦能？

百億資料入庫elasticsearch生產實踐（二）

oracle資料庫同步，100萬資料秒級插入

使用PHP_XLSXWriter代替PHPExcel 10W+資料秒級匯出

Oracle資料隱式亂碼,正則匹配中文資料失敗

Python爬蟲實習筆記 | Week3 資料爬取和正則再學習

python爬蟲三大解析資料方法：正則及圖片下載案例

jmeter之斷言、資料提取器（正則表示式、jsonpath、beanshell）、聚合報告、引數化

神經網路資料預處理，正則化與損失函式

Linux正則及用戶管理練習

Word中使用正則表示式進行查詢和替換（高效進行文書處理）

notepad++ 正則表示式高階查詢替換技巧（一）

Elasticsearch-字首、萬用字元、正則、模糊搜尋詳解

正則表示式給查詢到的內容加引號

［linux ］find命令是用正則匹配目錄查詢檔案

javascript使用正則表示式獲取查詢字串QueryString（轉載）

xCode正則表示式替換查詢

PostgreSQL 百億資料 秒級響應 正則及模糊查詢

相關推薦

PostgreSQL 百億資料秒級響應正則及模糊查詢