Re: WIP: index support for regexp search

Поиск
Список
Период
Сортировка
От Alexander Korotkov
Тема Re: WIP: index support for regexp search
Дата
Msg-id CAPpHfdshP6ituomYi0VUvEiN6pWsMcm+XztTjo7TeP1wUCXuVg@mail.gmail.com
обсуждение исходный текст
Ответ на Re: WIP: index support for regexp search  (Heikki Linnakangas <hlinnakangas@vmware.com>)
Ответы Re: WIP: index support for regexp search
Список pgsql-hackers
On Mon, Dec 3, 2012 at 2:05 PM, Heikki Linnakangas <hlinnakangas@vmware.com> wrote:
On 02.12.2012 20:19, Tom Lane wrote:
Alexander Korotkov<aekorotkov@gmail.com>  writes:
Nice idea to delay expanding colors to characters! Obviously, we should
delay expanding inly alphanumerical characters. Because non-alphanumberical
characters influence graph structure. Trying to implement...

Uh, why would that be?  Colors are colors.  The regexp machinery doesn't
care whether they represent alphanumerics or not.  (Or to be more
precise, if there is a situation where it makes a difference, separate
colors will have been created for each set of characters that need to be
distinguished.)

The regexp machinery doesn't care, but the trigrams that pg_trgm extracts only contain alphanumeric characters. So if by looking at the CNFA graph produced by the regexp machinery you conclude that any matching strings must contain three-letter sequences "%oo" and "#oo", you can just club them together into " oo" trigram.

I think you can run a pre-processing step to the colors, and merge colors that are equivalent as far as trigrams are considered. For example, if you have a color that contains only character '%', and another that contains character '#', you can treat them as the same hcolor. You might then be able to simplify the CNFA. Actually, it would be even better if you could apply the pre-processing to the regexp before the regexp machinery turns it into a CNFA. Not sure how easy it would be to do such pre-processing.
 
Treating colors as same should be possible only for colors which has no alphanumeric characters, because colors are non-overlapping. However, this optimization could be significant in some cases.

BTW, why create the path matrix? You could check the "check" array of trigrams in the consistent function directly against the graph. Consistent should return true, iff there is a path through the graph following only arcs that contain trigrams present in the check array. Finding a path through a complex graph could be expensive, O(|E|), but if the path is complex, the path matrix would be large as well, and checking against a large matrix isn't exactly free either. It would allow you to avoid "overflows" caused by having too many paths through the graph.

Actually, I generally dislike path matrix for same reasons. But:
1) Output graphs could contain trigrams which are completely useless for search. For example, for regex /(abcdefgh)*ijk/ we need only "ijk" trigram while graph would contain much more.Path matrix is a method to get rid of all of them.
2) If we use color trigrams then we need some criteria for which color trigrams to expand into trigrams. Simultaneously, we shouldn't allow path from initial state to the final by unexpanded trigrams. It seems much harder to do with graph than with matrix.

------
With best regards,
Alexander Korotkov.

В списке pgsql-hackers по дате отправления:

Предыдущее
От: Andres Freund
Дата:
Сообщение: Re: [PATCH 11/14] Introduce wal decoding via catalog timetravel
Следующее
От: Kohei KaiGai
Дата:
Сообщение: Re: [v9.3] writable foreign tables