Re: [PATCH] tsearch parser inefficiency if text includes urls or emails - new version

Поиск
Список
Период
Сортировка
От Andres Freund
Тема Re: [PATCH] tsearch parser inefficiency if text includes urls or emails - new version
Дата
Msg-id 200911081700.53726.andres@anarazel.de
обсуждение исходный текст
Ответ на [PATCH] tsearch parser inefficiency if text includes urls or emails  (Andres Freund <andres@anarazel.de>)
Ответы Re: [PATCH] tsearch parser inefficiency if text includes urls or emails - new version
Re: tsearch parser inefficiency if text includes urls or emails - new version
Список pgsql-hackers
On Sunday 01 November 2009 16:19:43 Andres Freund wrote:
> While playing around/evaluating tsearch I notices that to_tsvector is
> obscenely slow for some files. After some profiling I found that this is
>  due using a seperate TSParser in p_ishost/p_isURLPath in wparser_def.c. If
>  a multibyte encoding is in use TParserInit copies the whole remaining
>  input and converts it to wchar_t or pg_wchar - for every email or protocol
>  prefixed url in the the document. Which obviously is bad.
>
> I solved the issue by having a seperate TParserCopyInit/TParserCopyClose
>  which reuses the the already converted strings of the original TParser -
>  only at different offsets.
>
> Another approach would be to get rid of the separate parser invocations -
> requiring a bunch of additional states. This seemed more complex to me, so
>  I wanted to get some feedback first.
>
> Without patch:
> andres=# SELECT to_tsvector('english', document) FROM document WHERE
>  filename = '/usr/share/doc/libdrm-nouveau1/changelog';
>
>  ──────────────────────────────────────────────────────────────────────────
> ─────────────────────────── ...
>  (1 row)
>
> Time: 5835.676 ms
>
> With patch:
> andres=# SELECT to_tsvector('english', document) FROM document WHERE
>  filename = '/usr/share/doc/libdrm-nouveau1/changelog';
>
>  ──────────────────────────────────────────────────────────────────────────
> ─────────────────────────── ...
>  (1 row)
>
> Time: 395.341 ms
>
> Ill cleanup the patch if it seems like a sensible solution...
As nobody commented here is a corrected (stupid thinko) and cleaned up
version. Anyone cares to comment whether I am the only one thinking this is an
issue?

Andres

В списке pgsql-hackers по дате отправления:

Предыдущее
От: Tom Lane
Дата:
Сообщение: Re: Specific names for plpgsql variable-resolution control options?
Следующее
От: Kenneth Marshall
Дата:
Сообщение: Re: [PATCH] tsearch parser inefficiency if text includes urls or emails - new version