Re: text search: restricting the number of parsed words in headline generation

Поиск
Список
Период
Сортировка
От Tom Lane
Тема Re: text search: restricting the number of parsed words in headline generation
Дата
Msg-id 7857.1345049154@sss.pgh.pa.us
обсуждение исходный текст
Ответ на Re: text search: restricting the number of parsed words in headline generation  (Bruce Momjian <bruce@momjian.us>)
Ответы Re: text search: restricting the number of parsed words in headline generation
Список pgsql-hackers
Bruce Momjian <bruce@momjian.us> writes:
> Is this a TODO?

AFAIR nothing's been done about the speed issue, so yes.  I didn't
like the idea of creating a user-visible knob when the speed issue
might be fixable with internal algorithm improvements, but we never
followed up on this in either fashion.
        regards, tom lane

> ---------------------------------------------------------------------------

> On Tue, Aug 23, 2011 at 10:31:42PM -0400, Tom Lane wrote:
>> Sushant Sinha <sushant354@gmail.com> writes:
>>> Doesn't this force the headline to be taken from the first N words of
>>> the document, independent of where the match was?  That seems rather
>>> unworkable, or at least unhelpful.
>> 
>>> In headline generation function, we don't have any index or knowledge of
>>> where the match is. We discover the matches by first tokenizing and then
>>> comparing the matches with the query tokens. So it is hard to do
>>> anything better than first N words.
>> 
>> After looking at the code in wparser_def.c a bit more, I wonder whether
>> this patch is doing what you think it is.  Did you do any profiling to
>> confirm that tokenization is where the cost is?  Because it looks to me
>> like the match searching in hlCover() is at least O(N^2) in the number
>> of tokens in the document, which means it's probably the dominant cost
>> for any long document.  I suspect that your patch helps not so much
>> because it saves tokenization costs as because it bounds the amount of
>> effort spent in hlCover().
>> 
>> I haven't tried to do anything about this, but I wonder whether it
>> wouldn't be possible to eliminate the quadratic blowup by saving more
>> state across the repeated calls to hlCover().  At the very least, it
>> shouldn't be necessary to find the last query-token occurrence in the
>> document from scratch on each and every call.
>> 
>> Actually, this code seems probably flat-out wrong: won't every
>> successful call of hlCover() on a given document return exactly the same
>> q value (end position), namely the last token occurrence in the
>> document?  How is that helpful?
>> 
>> regards, tom lane
>> 
>> -- 
>> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
>> To make changes to your subscription:
>> http://www.postgresql.org/mailpref/pgsql-hackers

> -- 
>   Bruce Momjian  <bruce@momjian.us>        http://momjian.us
>   EnterpriseDB                             http://enterprisedb.com

>   + It's impossible for everything to be true. +


> -- 
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers



В списке pgsql-hackers по дате отправления:

Предыдущее
От: Tom Lane
Дата:
Сообщение: Re: Statistics and selectivity estimation for ranges
Следующее
От: Sushant Sinha
Дата:
Сообщение: Re: text search: restricting the number of parsed words in headline generation