Re: WIP: index support for regexp search

Поиск
Список
Период
Сортировка
От Alexander Korotkov
Тема Re: WIP: index support for regexp search
Дата
Msg-id CAPpHfdsTEGmx-fmBzaqFED8ghiDX+UA0duph=UL383w80A=oDw@mail.gmail.com
обсуждение исходный текст
Ответ на Re: WIP: index support for regexp search  (Heikki Linnakangas <hlinnakangas@vmware.com>)
Ответы Re: WIP: index support for regexp search
Список pgsql-hackers
On Fri, Nov 30, 2012 at 6:23 PM, Heikki Linnakangas <hlinnakangas@vmware.com> wrote:
On 30.11.2012 13:20, Alexander Korotkov wrote:
On Thu, Nov 29, 2012 at 5:25 PM, Heikki Linnakangas<hlinnakangas@vmware.com
wrote:

Would it be safe to simply stop short the depth-first search on overflow,
and proceed with the graph that was constructed up to that point?

For depth-first it's not. But your proposal naturally makes sense. I've
changed it to breadth-first search. And then it's safe to mark all
unprocessed states as final when overflow. It means that we assume every
unprocessed branch to immediately finish with matching (this can give us
more false positives but no false negatives).
For overflow of matrix collection, it's safe to do just OR between all the
trigrams.
New version of patch is attached.

Thanks, sounds good.

I've spent quite a long time trying to understand the transformation the getState/addKeys/addAcrs functions do to the original CNFA graph. I think that still needs more comments to explain the steps involved in it.

One thing that occurs to me is that it might be better to delay expanding colors to characters. You could build and transform the graph, and even create the paths, all while operating on colors. You would end up with lists of "color trigrams", consisting of sequences of three colors that must appear in the source string. Only at the last step you would expand the color trigrams to real character trigrams. I think that would save a lot of processing while building the graph, if you have colors that contain many characters. It would allow us to do better than the fixed small MAX_COLOR_CHARS limit. For example with MAX_COLOR_CHARS = 4 in the current patch, it's a shame that we can't do anything with a fairly simple regexp like "^a[b-g]h$"

Nice idea to delay expanding colors to characters! Obviously, we should delay expanding inly alphanumerical characters. Because non-alphanumberical characters influence graph structure. Trying to implement...

------
With best regards,
Alexander Korotkov.

В списке pgsql-hackers по дате отправления:

Предыдущее
От: Andrew Dunstan
Дата:
Сообщение: Re: proposal: separate databases for contrib module testing
Следующее
От: Alexander Korotkov
Дата:
Сообщение: Re: WIP: index support for regexp search