Обсуждение: text search: restricting the number of parsed words in headline generation
Given a document and a query, the goal of headline generation is to produce text excerpts in which the query appears. Currently the headline generation in postgres follows the following steps: 1. Tokenize the documents and obtain the lexemes 2. Decide on lexemes that should be the part of the headline 3. Generate the headline So the time taken by the headline generation is directly dependent on the size of the document. The longer the document, the more time taken to tokenize and more lexemes to operate on. Most of the time is taken during the tokenization phase and for very big documents, the headline generation is very expensive. Here is a simple patch that limits the number of words during the tokenization phase and puts an upper-bound on the headline generation. The headline function takes a parameter MaxParsedWords. If this parameter is negative or not supplied, then the entire document is tokenized and operated on (the current behavior). However, if the supplied MaxParsedWords is a positive number, then the tokenization stops after MaxParsedWords is obtained. The remaining headline generation happens on the tokens obtained till that point. The current patch can be applied to 9.1rc1. It lacks changes to the documentation and test cases. I will add them if you folks agree on the functionality. -Sushant.
Вложения
Sushant Sinha <sushant354@gmail.com> writes:
> Given a document and a query, the goal of headline generation is to
> produce text excerpts in which the query appears.
... right ...
> Here is a simple patch that limits the number of words during the
> tokenization phase and puts an upper-bound on the headline generation.
Doesn't this force the headline to be taken from the first N words of
the document, independent of where the match was?  That seems rather
unworkable, or at least unhelpful.
        regards, tom lane
			
		Re: text search: restricting the number of parsed words in headline generation
От
 
		    	Alvaro Herrera
		    Дата:
		        Excerpts from Tom Lane's message of mar ago 23 15:59:18 -0300 2011: > Sushant Sinha <sushant354@gmail.com> writes: > > Given a document and a query, the goal of headline generation is to > > produce text excerpts in which the query appears. > > ... right ... > > > Here is a simple patch that limits the number of words during the > > tokenization phase and puts an upper-bound on the headline generation. > > Doesn't this force the headline to be taken from the first N words of > the document, independent of where the match was? That seems rather > unworkable, or at least unhelpful. Yeah ... Doesn't a search result include the position on which the tokens were found within the document? Wouldn't it make more sense to improve the system somehow so that it can restrict searching for headlines in the general area where the tokens were found? -- Álvaro Herrera <alvherre@commandprompt.com> The PostgreSQL Company - Command Prompt, Inc. PostgreSQL Replication, Consulting, Custom Development, 24x7 support
Re: text search: restricting the number of parsed words in headline generation
От
 
		    	Sushant Sinha
		    Дата:
		        > > Here is a simple patch that limits the number of words during the > > tokenization phase and puts an upper-bound on the headline generation. > > Doesn't this force the headline to be taken from the first N words of > the document, independent of where the match was? That seems rather > unworkable, or at least unhelpful. > > regards, tom lane In headline generation function, we don't have any index or knowledge of where the match is. We discover the matches by first tokenizing and then comparing the matches with the query tokens. So it is hard to do anything better than first N words. One option could be that we start looking for "good match" while tokenizing and then stop if we have found good match. Currently the algorithms that decide a good match operates independently of the tokenization and there are two of them. So integrating them would not be easy. The patch is very helpful if you believe in the common case assumption that most of the time a good match is at the top of the document. Typically a search application generates headline for the top matches of a query i.e., those in which the query terms appears frequently. So there should be atleast one or two good text excerpt matches at the top of the document. -Sushant.
Sushant Sinha <sushant354@gmail.com> writes:
>> Doesn't this force the headline to be taken from the first N words of
>> the document, independent of where the match was?  That seems rather
>> unworkable, or at least unhelpful.
> In headline generation function, we don't have any index or knowledge of
> where the match is. We discover the matches by first tokenizing and then
> comparing the matches with the query tokens. So it is hard to do
> anything better than first N words.
After looking at the code in wparser_def.c a bit more, I wonder whether
this patch is doing what you think it is.  Did you do any profiling to
confirm that tokenization is where the cost is?  Because it looks to me
like the match searching in hlCover() is at least O(N^2) in the number
of tokens in the document, which means it's probably the dominant cost
for any long document.  I suspect that your patch helps not so much
because it saves tokenization costs as because it bounds the amount of
effort spent in hlCover().
I haven't tried to do anything about this, but I wonder whether it
wouldn't be possible to eliminate the quadratic blowup by saving more
state across the repeated calls to hlCover().  At the very least, it
shouldn't be necessary to find the last query-token occurrence in the
document from scratch on each and every call.
Actually, this code seems probably flat-out wrong: won't every
successful call of hlCover() on a given document return exactly the same
q value (end position), namely the last token occurrence in the
document?  How is that helpful?
        regards, tom lane
			
		Re: text search: restricting the number of parsed words in headline generation
От
 
		    	Sushant Sinha
		    Дата:
		        <div class="im"><br /><div class="gmail_quote"><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #cccsolid;padding-left:1ex"><br /> Actually, this code seems probably flat-out wrong: won't every<br /> successful call ofhlCover() on a given document return exactly the same<br /> q value (end position), namely the last token occurrence inthe<br /> document? How is that helpful?<br /><br /> regards, tom lane<br /></blockquote></div><br/></div>There is a line that saves the computation state from the previous call and search only startsfrom there:<br /><pre><span>int</span> pos = *p;</pre><br /><br />
Re: text search: restricting the number of parsed words in headline generation
От
 
		    	Bruce Momjian
		    Дата:
		        Is this a TODO? --------------------------------------------------------------------------- On Tue, Aug 23, 2011 at 10:31:42PM -0400, Tom Lane wrote: > Sushant Sinha <sushant354@gmail.com> writes: > >> Doesn't this force the headline to be taken from the first N words of > >> the document, independent of where the match was? That seems rather > >> unworkable, or at least unhelpful. > > > In headline generation function, we don't have any index or knowledge of > > where the match is. We discover the matches by first tokenizing and then > > comparing the matches with the query tokens. So it is hard to do > > anything better than first N words. > > After looking at the code in wparser_def.c a bit more, I wonder whether > this patch is doing what you think it is. Did you do any profiling to > confirm that tokenization is where the cost is? Because it looks to me > like the match searching in hlCover() is at least O(N^2) in the number > of tokens in the document, which means it's probably the dominant cost > for any long document. I suspect that your patch helps not so much > because it saves tokenization costs as because it bounds the amount of > effort spent in hlCover(). > > I haven't tried to do anything about this, but I wonder whether it > wouldn't be possible to eliminate the quadratic blowup by saving more > state across the repeated calls to hlCover(). At the very least, it > shouldn't be necessary to find the last query-token occurrence in the > document from scratch on each and every call. > > Actually, this code seems probably flat-out wrong: won't every > successful call of hlCover() on a given document return exactly the same > q value (end position), namely the last token occurrence in the > document? How is that helpful? > > regards, tom lane > > -- > Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-hackers -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. +
Re: text search: restricting the number of parsed words in headline generation
От
 
		    	Bruce Momjian
		    Дата:
		        This might indicate that the hlCover() item is resolved. --------------------------------------------------------------------------- On Wed, Aug 24, 2011 at 10:08:11AM +0530, Sushant Sinha wrote: > > > Actually, this code seems probably flat-out wrong: won't every > successful call of hlCover() on a given document return exactly the same > q value (end position), namely the last token occurrence in the > document? How is that helpful? > > regards, tom lane > > > There is a line that saves the computation state from the previous call and > search only starts from there: > > int pos = *p; > > > -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. +
Bruce Momjian <bruce@momjian.us> writes:
> Is this a TODO?
AFAIR nothing's been done about the speed issue, so yes.  I didn't
like the idea of creating a user-visible knob when the speed issue
might be fixable with internal algorithm improvements, but we never
followed up on this in either fashion.
        regards, tom lane
> ---------------------------------------------------------------------------
> On Tue, Aug 23, 2011 at 10:31:42PM -0400, Tom Lane wrote:
>> Sushant Sinha <sushant354@gmail.com> writes:
>>> Doesn't this force the headline to be taken from the first N words of
>>> the document, independent of where the match was?  That seems rather
>>> unworkable, or at least unhelpful.
>> 
>>> In headline generation function, we don't have any index or knowledge of
>>> where the match is. We discover the matches by first tokenizing and then
>>> comparing the matches with the query tokens. So it is hard to do
>>> anything better than first N words.
>> 
>> After looking at the code in wparser_def.c a bit more, I wonder whether
>> this patch is doing what you think it is.  Did you do any profiling to
>> confirm that tokenization is where the cost is?  Because it looks to me
>> like the match searching in hlCover() is at least O(N^2) in the number
>> of tokens in the document, which means it's probably the dominant cost
>> for any long document.  I suspect that your patch helps not so much
>> because it saves tokenization costs as because it bounds the amount of
>> effort spent in hlCover().
>> 
>> I haven't tried to do anything about this, but I wonder whether it
>> wouldn't be possible to eliminate the quadratic blowup by saving more
>> state across the repeated calls to hlCover().  At the very least, it
>> shouldn't be necessary to find the last query-token occurrence in the
>> document from scratch on each and every call.
>> 
>> Actually, this code seems probably flat-out wrong: won't every
>> successful call of hlCover() on a given document return exactly the same
>> q value (end position), namely the last token occurrence in the
>> document?  How is that helpful?
>> 
>> regards, tom lane
>> 
>> -- 
>> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
>> To make changes to your subscription:
>> http://www.postgresql.org/mailpref/pgsql-hackers
> -- 
>   Bruce Momjian  <bruce@momjian.us>        http://momjian.us
>   EnterpriseDB                             http://enterprisedb.com
>   + It's impossible for everything to be true. +
> -- 
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers
			
		Re: text search: restricting the number of parsed words in headline generation
От
 
		    	Sushant Sinha
		    Дата:
		        I will do the profiling and present the results. On Wed, 2012-08-15 at 12:45 -0400, Tom Lane wrote: > Bruce Momjian <bruce@momjian.us> writes: > > Is this a TODO? > > AFAIR nothing's been done about the speed issue, so yes. I didn't > like the idea of creating a user-visible knob when the speed issue > might be fixable with internal algorithm improvements, but we never > followed up on this in either fashion. > > regards, tom lane > > > --------------------------------------------------------------------------- > > > On Tue, Aug 23, 2011 at 10:31:42PM -0400, Tom Lane wrote: > >> Sushant Sinha <sushant354@gmail.com> writes: > >>> Doesn't this force the headline to be taken from the first N words of > >>> the document, independent of where the match was? That seems rather > >>> unworkable, or at least unhelpful. > >> > >>> In headline generation function, we don't have any index or knowledge of > >>> where the match is. We discover the matches by first tokenizing and then > >>> comparing the matches with the query tokens. So it is hard to do > >>> anything better than first N words. > >> > >> After looking at the code in wparser_def.c a bit more, I wonder whether > >> this patch is doing what you think it is. Did you do any profiling to > >> confirm that tokenization is where the cost is? Because it looks to me > >> like the match searching in hlCover() is at least O(N^2) in the number > >> of tokens in the document, which means it's probably the dominant cost > >> for any long document. I suspect that your patch helps not so much > >> because it saves tokenization costs as because it bounds the amount of > >> effort spent in hlCover(). > >> > >> I haven't tried to do anything about this, but I wonder whether it > >> wouldn't be possible to eliminate the quadratic blowup by saving more > >> state across the repeated calls to hlCover(). At the very least, it > >> shouldn't be necessary to find the last query-token occurrence in the > >> document from scratch on each and every call. > >> > >> Actually, this code seems probably flat-out wrong: won't every > >> successful call of hlCover() on a given document return exactly the same > >> q value (end position), namely the last token occurrence in the > >> document? How is that helpful? > >> > >> regards, tom lane > >> > >> -- > >> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) > >> To make changes to your subscription: > >> http://www.postgresql.org/mailpref/pgsql-hackers > > > -- > > Bruce Momjian <bruce@momjian.us> http://momjian.us > > EnterpriseDB http://enterprisedb.com > > > + It's impossible for everything to be true. + > > > > -- > > Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) > > To make changes to your subscription: > > http://www.postgresql.org/mailpref/pgsql-hackers
Re: text search: restricting the number of parsed words in headline generation
От
 
		    	Bruce Momjian
		    Дата:
		        On Wed, Aug 15, 2012 at 11:09:18PM +0530, Sushant Sinha wrote: > I will do the profiling and present the results. Sushant, do you have any profiling results on this issue from August? --------------------------------------------------------------------------- > > On Wed, 2012-08-15 at 12:45 -0400, Tom Lane wrote: > > Bruce Momjian <bruce@momjian.us> writes: > > > Is this a TODO? > > > > AFAIR nothing's been done about the speed issue, so yes. I didn't > > like the idea of creating a user-visible knob when the speed issue > > might be fixable with internal algorithm improvements, but we never > > followed up on this in either fashion. > > > > regards, tom lane > > > > > --------------------------------------------------------------------------- > > > > > On Tue, Aug 23, 2011 at 10:31:42PM -0400, Tom Lane wrote: > > >> Sushant Sinha <sushant354@gmail.com> writes: > > >>> Doesn't this force the headline to be taken from the first N words of > > >>> the document, independent of where the match was? That seems rather > > >>> unworkable, or at least unhelpful. > > >> > > >>> In headline generation function, we don't have any index or knowledge of > > >>> where the match is. We discover the matches by first tokenizing and then > > >>> comparing the matches with the query tokens. So it is hard to do > > >>> anything better than first N words. > > >> > > >> After looking at the code in wparser_def.c a bit more, I wonder whether > > >> this patch is doing what you think it is. Did you do any profiling to > > >> confirm that tokenization is where the cost is? Because it looks to me > > >> like the match searching in hlCover() is at least O(N^2) in the number > > >> of tokens in the document, which means it's probably the dominant cost > > >> for any long document. I suspect that your patch helps not so much > > >> because it saves tokenization costs as because it bounds the amount of > > >> effort spent in hlCover(). > > >> > > >> I haven't tried to do anything about this, but I wonder whether it > > >> wouldn't be possible to eliminate the quadratic blowup by saving more > > >> state across the repeated calls to hlCover(). At the very least, it > > >> shouldn't be necessary to find the last query-token occurrence in the > > >> document from scratch on each and every call. > > >> > > >> Actually, this code seems probably flat-out wrong: won't every > > >> successful call of hlCover() on a given document return exactly the same > > >> q value (end position), namely the last token occurrence in the > > >> document? How is that helpful? > > >> > > >> regards, tom lane > > >> > > >> -- > > >> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) > > >> To make changes to your subscription: > > >> http://www.postgresql.org/mailpref/pgsql-hackers > > > > > -- > > > Bruce Momjian <bruce@momjian.us> http://momjian.us > > > EnterpriseDB http://enterprisedb.com > > > > > + It's impossible for everything to be true. + > > > > > > > -- > > > Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) > > > To make changes to your subscription: > > > http://www.postgresql.org/mailpref/pgsql-hackers > > -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. +
Re: text search: restricting the number of parsed words in headline generation
От
 
		    	Bruce Momjian
		    Дата:
		        FYI, I have kept this email from 2011 about poor performance of parsed words in headline generation. If someone wants to research it, please do so: http://www.postgresql.org/message-id/1314117620.3700.12.camel@dragflick --------------------------------------------------------------------------- On Tue, Aug 23, 2011 at 10:31:42PM -0400, Tom Lane wrote: > Sushant Sinha <sushant354@gmail.com> writes: > >> Doesn't this force the headline to be taken from the first N words of > >> the document, independent of where the match was? That seems rather > >> unworkable, or at least unhelpful. > > > In headline generation function, we don't have any index or knowledge of > > where the match is. We discover the matches by first tokenizing and then > > comparing the matches with the query tokens. So it is hard to do > > anything better than first N words. > > After looking at the code in wparser_def.c a bit more, I wonder whether > this patch is doing what you think it is. Did you do any profiling to > confirm that tokenization is where the cost is? Because it looks to me > like the match searching in hlCover() is at least O(N^2) in the number > of tokens in the document, which means it's probably the dominant cost > for any long document. I suspect that your patch helps not so much > because it saves tokenization costs as because it bounds the amount of > effort spent in hlCover(). > > I haven't tried to do anything about this, but I wonder whether it > wouldn't be possible to eliminate the quadratic blowup by saving more > state across the repeated calls to hlCover(). At the very least, it > shouldn't be necessary to find the last query-token occurrence in the > document from scratch on each and every call. > > Actually, this code seems probably flat-out wrong: won't every > successful call of hlCover() on a given document return exactly the same > q value (end position), namely the last token occurrence in the > document? How is that helpful? > > regards, tom lane > > -- > Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-hackers -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + Everyone has their own god. +