[PATCH] tsearch parser inefficiency if text includes urls or emails

Поиск
Список
Период
Сортировка
От Andres Freund
Тема [PATCH] tsearch parser inefficiency if text includes urls or emails
Дата
Msg-id 200911011619.44683.andres@anarazel.de
обсуждение исходный текст
Ответы Re: [PATCH] tsearch parser inefficiency if text includes urls or emails - new version
Список pgsql-hackers
Hi,

While playing around/evaluating tsearch I notices that to_tsvector is
obscenely slow for some files. After some profiling I found that this is due
using a seperate TSParser in p_ishost/p_isURLPath in wparser_def.c.
If a multibyte encoding is in use TParserInit copies the whole remaining input
and converts it to wchar_t or pg_wchar - for every email or protocol prefixed
url in the the document. Which obviously is bad.

I solved the issue by having a seperate TParserCopyInit/TParserCopyClose which
reuses the the already converted strings of the original TParser - only at
different offsets.

Another approach would be to get rid of the separate parser invocations -
requiring a bunch of additional states. This seemed more complex to me, so I
wanted to get some feedback first.

Without patch:
andres=# SELECT to_tsvector('english', document) FROM document WHERE filename =
'/usr/share/doc/libdrm-nouveau1/changelog';

─────────────────────────────────────────────────────────────────────────────────────────────────────
...(1 row)
Time: 5835.676 ms

With patch:
andres=# SELECT to_tsvector('english', document) FROM document WHERE filename =
'/usr/share/doc/libdrm-nouveau1/changelog';

─────────────────────────────────────────────────────────────────────────────────────────────────────
...(1 row)
Time: 395.341 ms

Ill cleanup the patch if it seems like a sensible solution...

Is this backpatch-worthy?

Andres


PS: I let the additional define in for the moment so that its easier to see the
performance differences.

В списке pgsql-hackers по дате отправления:

Предыдущее
От: Marko Tiikkaja
Дата:
Сообщение: Re: WIP: push AFTER-trigger execution into ModifyTable node
Следующее
От: Marko Tiikkaja
Дата:
Сообщение: Re: WIP: push AFTER-trigger execution into ModifyTable node