Index on regexes

Поиск
Список
Период
Сортировка
От Alexander Korotkov
Тема Index on regexes
Дата
Msg-id CAPpHfduMvS=sGPNh_88bUo3TEg_KhxV7s81Fa+ZV96yadgff2g@mail.gmail.com
обсуждение исходный текст
Ответы Re: Index on regexes  ("Erik Rijkers" <er@xs4all.nl>)
Re: Index on regexes  (Heikki Linnakangas <hlinnakangas@vmware.com>)
Re: Index on regexes  (Heikki Linnakangas <hlinnakangas@vmware.com>)
Список pgsql-hackers
Hackers,

Attached patch contains opclass which demonstrates advantages of GIN additional information storing itself without other GIN improvements. It implements inversed task of regex indexing. It works so: you create index on regexes and search for regexes matched query string. It introduce two additional operators |~ and *~ for case-sensetive and case-insensetive regex to string matching, and gin_regexp_trgm_ops opclass.

Let's consider some example.

At first, generate some regexes.

CREATE OR REPLACE FUNCTION generate_string(int, int) RETURNS text AS $$
SELECT array_to_string(ARRAY(SELECT chr((97 + random() * 10) :: integer) FROM generate_series(1,($1 + random()*$2)::int)), '');
$$
LANGUAGE sql;

CREATE TABLE test AS select '(' || generate_string(1,4) || '|' || generate_string(1,4) || '|' || generate_string(1,4) || ')' || generate_string(1,2) || '(' || generate_string(1,4) || '|' || generate_string(1,4) || '|' || generate_string(1,4) || ')' AS s FROM generate_series(1,1000000);

I use only 10 characters on alphabet in order to have better chance of matching. It generate some regexes like so:

postgres=# SELECT * FROM test LIMIT 10;
                 s
------------------------------------
 (g|cij|ah)jg(iei|hfc|eef)
 (gbfdb|ehbg|akf)ge(bc|jgee|jidd)
 (jedc|kgc|c)bc(ii|bji|iebc)
 (aa|eie|bgdb)f(fc|he|f)
 (b|ijc|ae)ae(eccb|ie|kjf)
 (bib|igf|kdibf)fij(gcbh|efi|fidj)
 (bkejf|jfdhg|gbfe)bhb(bedj|hh|ggg)
 (kfb|egccd|iefce)jf(kj|jbef|kbc)
 (bhh|c|cd)cb(h|ed|jg)
 (id|j|geg)gc(djif|ai|cjjjc)
(10 rows)

Without index search takes about 10 seconds.

postgres=# explain analyze select * from test where s |~ 'abcdefghijkl';
                                                  QUERY PLAN
--------------------------------------------------------------------------------------------------------------
 Seq Scan on test  (cost=0.00..19929.00 rows=5000 width=28) (actual time=172.990..97357.312 rows=438 loops=1)
   Filter: (s |~ 'abcdefghijkl'::text)
   Rows Removed by Filter: 999562
 Total runtime: 97357.490 ms
(4 rows)

And with index it takes only 110 milliseconds.

postgres=# explain analyze select * from test where s |~ 'abcdefghijkl';
                                                        QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on test  (cost=182.75..7245.94 rows=5000 width=28) (actual time=68.143..110.663 rows=438 loops=1)
   Recheck Cond: (s |~ 'abcdefghijkl'::text)
   ->  Bitmap Index Scan on test_idx  (cost=0.00..181.50 rows=5000 width=0) (actual time=67.929..67.929 rows=438 loops=1)
         Index Cond: (s |~ 'abcdefghijkl'::text)
 Total runtime: 110.870 ms
(5 rows)

------
With best regards,
Alexander Korotkov.
Вложения

В списке pgsql-hackers по дате отправления:

Предыдущее
От: Alexander Korotkov
Дата:
Сообщение: GIN improvements part 1: additional information
Следующее
От: Tom Lane
Дата:
Сообщение: Re: Adjusting elog behavior in bootstrap/standalone mode