Re: [PATCH] tsearch parser inefficiency if text includes urls or emails - new version

Поиск

Список

Период

Сортировка

От	Kenneth Marshall
Тема	Re: [PATCH] tsearch parser inefficiency if text includes urls or emails - new version
Дата	8 ноября 2009 г. 15:41:35
Msg-id	20091108164115.GA27729@it.is.rice.edu обсуждение исходный текст
Ответ на	Re: [PATCH] tsearch parser inefficiency if text includes urls or emails - new version (Andres Freund <andres@anarazel.de>)
Ответы	Re: [PATCH] tsearch parser inefficiency if text includes urls or emails - new version
Список	pgsql-hackers

Дерево обсуждения

On Sun, Nov 08, 2009 at 05:00:53PM +0100, Andres Freund wrote:
> On Sunday 01 November 2009 16:19:43 Andres Freund wrote:
> > While playing around/evaluating tsearch I notices that to_tsvector is
> > obscenely slow for some files. After some profiling I found that this is
> >  due using a seperate TSParser in p_ishost/p_isURLPath in wparser_def.c. If
> >  a multibyte encoding is in use TParserInit copies the whole remaining
> >  input and converts it to wchar_t or pg_wchar - for every email or protocol
> >  prefixed url in the the document. Which obviously is bad.
> > 
> > I solved the issue by having a seperate TParserCopyInit/TParserCopyClose
> >  which reuses the the already converted strings of the original TParser -
> >  only at different offsets.
> > 
> > Another approach would be to get rid of the separate parser invocations -
> > requiring a bunch of additional states. This seemed more complex to me, so
> >  I wanted to get some feedback first.
> > 
> > Without patch:
> > andres=# SELECT to_tsvector('english', document) FROM document WHERE
> >  filename = '/usr/share/doc/libdrm-nouveau1/changelog';
> > 
> >
??????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????
> > ????????????????????????????????????????????????????????????????????????????????? ...
> >  (1 row)
> > 
> > Time: 5835.676 ms
> > 
> > With patch:
> > andres=# SELECT to_tsvector('english', document) FROM document WHERE
> >  filename = '/usr/share/doc/libdrm-nouveau1/changelog';
> > 
> >
??????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????
> > ????????????????????????????????????????????????????????????????????????????????? ...
> >  (1 row)
> > 
> > Time: 395.341 ms
> > 
> > Ill cleanup the patch if it seems like a sensible solution...
> As nobody commented here is a corrected (stupid thinko) and cleaned up 
> version. Anyone cares to comment whether I am the only one thinking this is an 
> issue?
> 
> Andres

+1

As a user of tsearch, I can certainly appreciate to speed-up in parsing --
more CPU for everyone else.

Regards,
Ken

В списке pgsql-hackers по дате отправления:

Предыдущее

От: Andres Freund
Дата: 08 ноября 2009 г., 15:01:11
Сообщение: Re: [PATCH] tsearch parser inefficiency if text includes urls or emails - new version

Следующее

От: Josh Berkus
Дата: 08 ноября 2009 г., 16:33:05
Сообщение: Re: Why do OLD and NEW have special internal names?

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Re: [PATCH] tsearch parser inefficiency if text includes urls or emails - new version

Предыдущее

Следующее