Re: text search: restricting the number of parsed words in headline generation

Поиск

Список

Период

Сортировка

От	Tom Lane
Тема	Re: text search: restricting the number of parsed words in headline generation
Дата	24 августа 2011 г. 02:31:50
Msg-id	15125.1314153102@sss.pgh.pa.us обсуждение исходный текст
Ответ на	Re: text search: restricting the number of parsed words in headline generation (Sushant Sinha <sushant354@gmail.com>)
Ответы	Re: text search: restricting the number of parsed words in headline generation (Sushant Sinha <sushant354@gmail.com>) Re: text search: restricting the number of parsed words in headline generation (Bruce Momjian <bruce@momjian.us>) Re: text search: restricting the number of parsed words in headline generation (Bruce Momjian <bruce@momjian.us>)
Список	pgsql-hackers

Дерево обсуждения

Sushant Sinha <sushant354@gmail.com> writes:
>> Doesn't this force the headline to be taken from the first N words of
>> the document, independent of where the match was?  That seems rather
>> unworkable, or at least unhelpful.

> In headline generation function, we don't have any index or knowledge of
> where the match is. We discover the matches by first tokenizing and then
> comparing the matches with the query tokens. So it is hard to do
> anything better than first N words.

After looking at the code in wparser_def.c a bit more, I wonder whether
this patch is doing what you think it is.  Did you do any profiling to
confirm that tokenization is where the cost is?  Because it looks to me
like the match searching in hlCover() is at least O(N^2) in the number
of tokens in the document, which means it's probably the dominant cost
for any long document.  I suspect that your patch helps not so much
because it saves tokenization costs as because it bounds the amount of
effort spent in hlCover().

I haven't tried to do anything about this, but I wonder whether it
wouldn't be possible to eliminate the quadratic blowup by saving more
state across the repeated calls to hlCover().  At the very least, it
shouldn't be necessary to find the last query-token occurrence in the
document from scratch on each and every call.

Actually, this code seems probably flat-out wrong: won't every
successful call of hlCover() on a given document return exactly the same
q value (end position), namely the last token occurrence in the
document?  How is that helpful?
        regards, tom lane

В списке pgsql-hackers по дате отправления:

Предыдущее

От: Tom Lane
Дата: 24 августа 2011 г., 01:55:49
Сообщение: Re: Question: CREATE EXTENSION and create schema permission?

Следующее

От: Sushant Sinha
Дата: 24 августа 2011 г., 04:38:19
Сообщение: Re: text search: restricting the number of parsed words in headline generation

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Re: text search: restricting the number of parsed words in headline generation

Предыдущее

Следующее