Re: WIP: index support for regexp search

Поиск
Список
Период
Сортировка
От Heikki Linnakangas
Тема Re: WIP: index support for regexp search
Дата
Msg-id 50B76242.8070907@vmware.com
обсуждение исходный текст
Ответ на Re: WIP: index support for regexp search  (Alexander Korotkov <aekorotkov@gmail.com>)
Ответы Re: WIP: index support for regexp search  (Alexander Korotkov <aekorotkov@gmail.com>)
Список pgsql-hackers
One thing that bothers me with this algoritm is that the overflow 
mechanism is all-or-nothing. In many cases, even when there is a huge 
number of states in the diagram, you could still extract at least a few 
trigrams that must be present in any matching string, with little 
effort. At least, it seems like that to a human :-).

For example, consider this:

explain analyze select count(*) from azjunk4 where txt ~ 

('^aabaacaadaaeaafaagaahaaiaajaakaalaamaanaaoaapaaqaaraasaataauaavaawaaxaayaazabaabbabcabdabeabfabgabhabiabjabkablabmabnaboabpabqabrabsabtabuabvabwabxabyabzacaacbaccacdaceacfacgachaciacjackaclacmacnacoacpacqacracsactacuacvacwacxacyaczadaadbadcaddadeadfadgadhadiadjadkadladmadnadoadpadqadradsadtaduadvadwadxadyadzaeaaebaecaedaeeaefaegaehaeiaejaekaelaemaenaeoaepaeqaeraesaetaeuaevaewaexaeyaezafaafbafcafdafeaffafgafhafiafjafkaflafmafnafoafpafqafrafsaftafuafvafwafxafyafzagaagbagcagdageagfaggaghagiagjagkaglagmagnagoagpagqagragsagtaguagvagwagxagyagzahaahbahcahdaheahfahgahhahiahjahkahlahmahnahoahpahqahrahs$');

you get a query plan like this (the long regexp string edited out):
 Aggregate  (cost=228148.02..228148.03 rows=1 width=0) (actual 
time=131.100..131
.101 rows=1 loops=1)   ->  Bitmap Heap Scan on azjunk4  (cost=228144.01..228148.02 rows=1 
width=0) (
actual time=131.096..131.096 rows=0 loops=1)         Recheck Cond: (txt ~ <ridiculously long regexp>)         Rows
Removedby Index Recheck: 10000         ->  Bitmap Index Scan on azjunk4_trgmrgx_txt_01_idx 
 
(cost=0.00..228144
.01 rows=1 width=0) (actual time=82.914..82.914 rows=10000 loops=1)               Index Cond: (txt ~ <ridiculously long
regexp>)Total runtime: 131.230 ms
 
(7 rows)

That ridiculously long string exceeds the number of states (I think, 
could be number of paths or arcs too), and the algorithm gives up, 
resorting to scanning the whole index as can be seen by the "Rows 
Removed by Index Recheck" line. However, it's easy to see that any 
matching string must contain *any* of the possible trigrams the 
algorithm extracts. If it could safely return just a few of them, say 
"aab" and "abz", and discard the rest, that would already be much better 
than a full index scan.

Would it be safe to simply stop short the depth-first search on 
overflow, and proceed with the graph that was constructed up to that point?

- Heikki



В списке pgsql-hackers по дате отправления:

Предыдущее
От: Muhammad Usama
Дата:
Сообщение: Patch to fix ecpg core dump with nested structure pointer variable
Следующее
От: Andres Freund
Дата:
Сообщение: [PATCH] make -jN check fails / unset MAKEFLAGS in pg_regress.c