Re: text search: restricting the number of parsed words in headline generation

Поиск
Список
Период
Сортировка
От Bruce Momjian
Тема Re: text search: restricting the number of parsed words in headline generation
Дата
Msg-id 20130124205452.GB21914@momjian.us
обсуждение исходный текст
Ответ на Re: text search: restricting the number of parsed words in headline generation  (Sushant Sinha <sushant354@gmail.com>)
Список pgsql-hackers
On Wed, Aug 15, 2012 at 11:09:18PM +0530, Sushant Sinha wrote:
> I will do the profiling and present the results.

Sushant, do you have any profiling results on this issue from August?

---------------------------------------------------------------------------


> 
> On Wed, 2012-08-15 at 12:45 -0400, Tom Lane wrote:
> > Bruce Momjian <bruce@momjian.us> writes:
> > > Is this a TODO?
> > 
> > AFAIR nothing's been done about the speed issue, so yes.  I didn't
> > like the idea of creating a user-visible knob when the speed issue
> > might be fixable with internal algorithm improvements, but we never
> > followed up on this in either fashion.
> > 
> >             regards, tom lane
> > 
> > > ---------------------------------------------------------------------------
> > 
> > > On Tue, Aug 23, 2011 at 10:31:42PM -0400, Tom Lane wrote:
> > >> Sushant Sinha <sushant354@gmail.com> writes:
> > >>> Doesn't this force the headline to be taken from the first N words of
> > >>> the document, independent of where the match was?  That seems rather
> > >>> unworkable, or at least unhelpful.
> > >> 
> > >>> In headline generation function, we don't have any index or knowledge of
> > >>> where the match is. We discover the matches by first tokenizing and then
> > >>> comparing the matches with the query tokens. So it is hard to do
> > >>> anything better than first N words.
> > >> 
> > >> After looking at the code in wparser_def.c a bit more, I wonder whether
> > >> this patch is doing what you think it is.  Did you do any profiling to
> > >> confirm that tokenization is where the cost is?  Because it looks to me
> > >> like the match searching in hlCover() is at least O(N^2) in the number
> > >> of tokens in the document, which means it's probably the dominant cost
> > >> for any long document.  I suspect that your patch helps not so much
> > >> because it saves tokenization costs as because it bounds the amount of
> > >> effort spent in hlCover().
> > >> 
> > >> I haven't tried to do anything about this, but I wonder whether it
> > >> wouldn't be possible to eliminate the quadratic blowup by saving more
> > >> state across the repeated calls to hlCover().  At the very least, it
> > >> shouldn't be necessary to find the last query-token occurrence in the
> > >> document from scratch on each and every call.
> > >> 
> > >> Actually, this code seems probably flat-out wrong: won't every
> > >> successful call of hlCover() on a given document return exactly the same
> > >> q value (end position), namely the last token occurrence in the
> > >> document?  How is that helpful?
> > >> 
> > >> regards, tom lane
> > >> 
> > >> -- 
> > >> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> > >> To make changes to your subscription:
> > >> http://www.postgresql.org/mailpref/pgsql-hackers
> > 
> > > -- 
> > >   Bruce Momjian  <bruce@momjian.us>        http://momjian.us
> > >   EnterpriseDB                             http://enterprisedb.com
> > 
> > >   + It's impossible for everything to be true. +
> > 
> > 
> > > -- 
> > > Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> > > To make changes to your subscription:
> > > http://www.postgresql.org/mailpref/pgsql-hackers
> 
> 

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + It's impossible for everything to be true. +



В списке pgsql-hackers по дате отправления:

Предыдущее
От: Heikki Linnakangas
Дата:
Сообщение: Re: gistchoose vs. bloat
Следующее
От: Bruce Momjian
Дата:
Сообщение: Re: [BUGS] BUG #6572: The example of SPI_execute is bogus