Обсуждение: BUG #8354: stripped positions can generate nonzero rank in ts_rank_cd

Поиск
Список
Период
Сортировка

BUG #8354: stripped positions can generate nonzero rank in ts_rank_cd

От
alex@hill.net.au
Дата:
The following bug has been logged on the website:

Bug reference:      8354
Logged by:          Alex Hill
Email address:      alex@hill.net.au
PostgreSQL version: 9.2.4
Operating system:   OS X 10.8.4 Mountain Lion
Description:

Hi all,


The docs for ts_rank_cd state:


"This function requires positional information in its input. Therefore it
will not work on "stripped" tsvector values — it will always return zero."


However if a tsvector contains some stripped lexemes and some non-stripped,
ts_rank_cd will rank extents including the non-stripped values.


For example, this evaluates to zero as expected:


    SELECT ts_rank_cd(strip(to_tsvector('text search')),
plainto_tsquery('text search'))




But this doesn't:


    SELECT ts_rank_cd(to_tsvector('text') || strip(to_tsvector('search')),
plainto_tsquery('text search'))




I think this is a bug, if not in the code then in the documentation, which
isn't clear on what happens when stripped and positioned lexemes are mixed
in one tsvector.


I would prefer that stripped lexemes were completely ignored by ts_rank_cd:
my use case is using this as a fifth pseudo-weight, which matches a @@ query
but doesn't add to a ts_rank_cd ranking.


What do you think?


Cheers,
Alex

Re: BUG #8354: stripped positions can generate nonzero rank in ts_rank_cd

От
Bruce Momjian
Дата:
Would someone please comment on this text search bug report?  Thanks.

---------------------------------------------------------------------------

On Fri, Aug  2, 2013 at 07:03:42AM +0000, alex@hill.net.au wrote:
> The following bug has been logged on the website:
>
> Bug reference:      8354
> Logged by:          Alex Hill
> Email address:      alex@hill.net.au
> PostgreSQL version: 9.2.4
> Operating system:   OS X 10.8.4 Mountain Lion
> Description:
>
> Hi all,
>
>
> The docs for ts_rank_cd state:
>
>
> "This function requires positional information in its input. Therefore it
> will not work on "stripped" tsvector values — it will always return zero."
>
>
> However if a tsvector contains some stripped lexemes and some non-stripped,
> ts_rank_cd will rank extents including the non-stripped values.
>
>
> For example, this evaluates to zero as expected:
>
>
>     SELECT ts_rank_cd(strip(to_tsvector('text search')),
> plainto_tsquery('text search'))
>
>
>
>
> But this doesn't:
>
>
>     SELECT ts_rank_cd(to_tsvector('text') || strip(to_tsvector('search')),
> plainto_tsquery('text search'))
>
>
>
>
> I think this is a bug, if not in the code then in the documentation, which
> isn't clear on what happens when stripped and positioned lexemes are mixed
> in one tsvector.
>
>
> I would prefer that stripped lexemes were completely ignored by ts_rank_cd:
> my use case is using this as a fifth pseudo-weight, which matches a @@ query
> but doesn't add to a ts_rank_cd ranking.
>
>
> What do you think?
>
>
> Cheers,
> Alex
>
>
>
> --
> Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-bugs

--
  Bruce Momjian  <bruce@momjian.us>        http://momjian.us
  EnterpriseDB                             http://enterprisedb.com

  + Everyone has their own god. +

Re: BUG #8354: stripped positions can generate nonzero rank in ts_rank_cd

От
Alexander Hill
Дата:
Hi Bruce, all,

I think this can be solved (if it's agreed that it's a bug) in a pretty
straightforward way: when creating the document representation used in
calculating cover density rank, we can just skip lexemes with no position
entirely.

Fix and tests here: https://github.com/AlexHill/postgres/compare/bug_8354

As a patch file here:
https://github.com/AlexHill/postgres/commit/cd522b254d166d569b86803115f0f499864e949b.patch

Cheers,
Alex





On Sat, Feb 1, 2014 at 5:22 AM, Bruce Momjian <bruce@momjian.us> wrote:

>
> Would someone please comment on this text search bug report?  Thanks.
>
> ---------------------------------------------------------------------------
>
> On Fri, Aug  2, 2013 at 07:03:42AM +0000, alex@hill.net.au wrote:
> > The following bug has been logged on the website:
> >
> > Bug reference:      8354
> > Logged by:          Alex Hill
> > Email address:      alex@hill.net.au
> > PostgreSQL version: 9.2.4
> > Operating system:   OS X 10.8.4 Mountain Lion
> > Description:
> >
> > Hi all,
> >
> >
> > The docs for ts_rank_cd state:
> >
> >
> > "This function requires positional information in its input. Therefore it
> > will not work on "stripped" tsvector values -- it will always return
> zero."
> >
> >
> > However if a tsvector contains some stripped lexemes and some
> non-stripped,
> > ts_rank_cd will rank extents including the non-stripped values.
> >
> >
> > For example, this evaluates to zero as expected:
> >
> >
> >     SELECT ts_rank_cd(strip(to_tsvector('text search')),
> > plainto_tsquery('text search'))
> >
> >
> >
> >
> > But this doesn't:
> >
> >
> >     SELECT ts_rank_cd(to_tsvector('text') ||
> strip(to_tsvector('search')),
> > plainto_tsquery('text search'))
> >
> >
> >
> >
> > I think this is a bug, if not in the code then in the documentation,
> which
> > isn't clear on what happens when stripped and positioned lexemes are
> mixed
> > in one tsvector.
> >
> >
> > I would prefer that stripped lexemes were completely ignored by
> ts_rank_cd:
> > my use case is using this as a fifth pseudo-weight, which matches a @@
> query
> > but doesn't add to a ts_rank_cd ranking.
> >
> >
> > What do you think?
> >
> >
> > Cheers,
> > Alex
> >
> >
> >
> > --
> > Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
> > To make changes to your subscription:
> > http://www.postgresql.org/mailpref/pgsql-bugs
>
> --
>   Bruce Momjian  <bruce@momjian.us>        http://momjian.us
>   EnterpriseDB                             http://enterprisedb.com
>
>   + Everyone has their own god. +
>

Re: BUG #8354: stripped positions can generate nonzero rank in ts_rank_cd

От
Bruce Momjian
Дата:
On Fri, Feb  7, 2014 at 02:07:59AM +0800, Alexander Hill wrote:
> Hi Bruce, all,
>
> I think this can be solved (if it's agreed that it's a bug) in a pretty
> straightforward way: when creating the document representation used in
> calculating cover density rank, we can just skip lexemes with no position
> entirely.
>
> Fix and tests here: https://github.com/AlexHill/postgres/compare/bug_8354
>
> As a patch file here: https://github.com/AlexHill/postgres/commit/
> cd522b254d166d569b86803115f0f499864e949b.patch

OK, great.  I am attaching this patch.  Would this change the contents
of any indexes?  That would be a problem for pg_upgrade.

--
  Bruce Momjian  <bruce@momjian.us>        http://momjian.us
  EnterpriseDB                             http://enterprisedb.com

  + Everyone has their own god. +

Вложения

Re: BUG #8354: stripped positions can generate nonzero rank in ts_rank_cd

От
Bruce Momjian
Дата:
On Wed, Feb 12, 2014 at 09:49:32AM +0800, Alexander Hill wrote:
> Hi Bruce,
>
> In normal use this won't affect any data that ends up in an index.
>
> The only exception would be where somebody has created an index over the result
> of ts_rank_cd - indexing search rank for the most common search terms, maybe?

OK, good.

> But that's true of any function whose result is indexed. What's the upgrade
> policy in that case?

Well, we document that the function output changed, and if anyone is
using the function for an index, they need to rebuild the index, so I
think we are fine fixing this for 9.4.

--
  Bruce Momjian  <bruce@momjian.us>        http://momjian.us
  EnterpriseDB                             http://enterprisedb.com

  + Everyone has their own god. +

Re: BUG #8354: stripped positions can generate nonzero rank in ts_rank_cd

От
Alexander Hill
Дата:
Hi Bruce,

In normal use this won't affect any data that ends up in an index.

The only exception would be where somebody has created an index over the
result of ts_rank_cd - indexing search rank for the most common search
terms, maybe?

But that's true of any function whose result is indexed. What's the upgrade
policy in that case?

Cheers,
Alex


On Tue, Feb 11, 2014 at 11:45 PM, Bruce Momjian <bruce@momjian.us> wrote:

> On Fri, Feb  7, 2014 at 02:07:59AM +0800, Alexander Hill wrote:
> > Hi Bruce, all,
> >
> > I think this can be solved (if it's agreed that it's a bug) in a pretty
> > straightforward way: when creating the document representation used in
> > calculating cover density rank, we can just skip lexemes with no position
> > entirely.
> >
> > Fix and tests here:
> https://github.com/AlexHill/postgres/compare/bug_8354
> >
> > As a patch file here: https://github.com/AlexHill/postgres/commit/
> > cd522b254d166d569b86803115f0f499864e949b.patch
>
> OK, great.  I am attaching this patch.  Would this change the contents
> of any indexes?  That would be a problem for pg_upgrade.
>
> --
>   Bruce Momjian  <bruce@momjian.us>        http://momjian.us
>   EnterpriseDB                             http://enterprisedb.com
>
>   + Everyone has their own god. +
>

Re: BUG #8354: stripped positions can generate nonzero rank in ts_rank_cd

От
Alexander Hill
Дата:
Hi Bruce,

I've just updated the docs for ts_rank_cd in my fork to reflect this
change. Would be great if that could make it in too.

Patch here:
https://github.com/AlexHill/postgres/commit/3e5d86ea23d82b07188192411cdc5c5a5194a5ae.patch

Is there anything else I need to do to make sure this patch ends up in 9.4?

Cheers,

Alex
On Feb 12, 2014 10:31 PM, "Bruce Momjian" <bruce@momjian.us> wrote:

> On Wed, Feb 12, 2014 at 09:49:32AM +0800, Alexander Hill wrote:
> > Hi Bruce,
> >
> > In normal use this won't affect any data that ends up in an index.
> >
> > The only exception would be where somebody has created an index over the
> result
> > of ts_rank_cd - indexing search rank for the most common search terms,
> maybe?
>
> OK, good.
>
> > But that's true of any function whose result is indexed. What's the
> upgrade
> > policy in that case?
>
> Well, we document that the function output changed, and if anyone is
> using the function for an index, they need to rebuild the index, so I
> think we are fine fixing this for 9.4.
>
> --
>   Bruce Momjian  <bruce@momjian.us>        http://momjian.us
>   EnterpriseDB                             http://enterprisedb.com
>
>   + Everyone has their own god. +
>

Re: BUG #8354: stripped positions can generate nonzero rank in ts_rank_cd

От
Bruce Momjian
Дата:
On Fri, Mar 21, 2014 at 02:29:26PM +0800, Alexander Hill wrote:
> Hi Bruce,
>
> I've just updated the docs for ts_rank_cd in my fork to reflect this change.
> Would be great if that could make it in too.
>
> Patch here: https://github.com/AlexHill/postgres/commit/
> 3e5d86ea23d82b07188192411cdc5c5a5194a5ae.patch
>
> Is there anything else I need to do to make sure this patch ends up in 9.4?

OK, combined patch attached.  Is this correct?  I keep hoping someone
else will chime in with an opinion on this, but at this point I think
what you have done is clear enough for me to commit for 9.4.

--
  Bruce Momjian  <bruce@momjian.us>        http://momjian.us
  EnterpriseDB                             http://enterprisedb.com

  + Everyone has their own god. +

Вложения

Re: BUG #8354: stripped positions can generate nonzero rank in ts_rank_cd

От
Tom Lane
Дата:
Bruce Momjian <bruce@momjian.us> writes:
> OK, combined patch attached.  Is this correct?  I keep hoping someone
> else will chime in with an opinion on this, but at this point I think
> what you have done is clear enough for me to commit for 9.4.

That whole function is seriously undercommented; I suppose it's not this
patch's charter to fix that, but we could at least write

!                           /* ignore words without positions */
!                 entry++;
!                 continue;

The proposed new documentation text seems pretty badly written.  How about

This function requires lexeme positional information to perform its
calculation.  Therefore it ignores any <quote>stripped</> lexemes in the
<type>tsvector</>.  If there are no unstripped lexemes in the input, the
result will be zero.

The parenthetical "See" text is ok.

            regards, tom lane

Re: BUG #8354: stripped positions can generate nonzero rank in ts_rank_cd

От
Bruce Momjian
Дата:
On Fri, Mar 21, 2014 at 10:45:18AM -0400, Tom Lane wrote:
> Bruce Momjian <bruce@momjian.us> writes:
> > OK, combined patch attached.  Is this correct?  I keep hoping someone
> > else will chime in with an opinion on this, but at this point I think
> > what you have done is clear enough for me to commit for 9.4.
>
> That whole function is seriously undercommented; I suppose it's not this
> patch's charter to fix that, but we could at least write
>
> !                           /* ignore words without positions */
> !                 entry++;
> !                 continue;
>
> The proposed new documentation text seems pretty badly written.  How about
>
> This function requires lexeme positional information to perform its
> calculation.  Therefore it ignores any <quote>stripped</> lexemes in the
> <type>tsvector</>.  If there are no unstripped lexemes in the input, the
> result will be zero.
>
> The parenthetical "See" text is ok.

OK, thanks.  Updated patch attached.  If Alexander Hill would like to
add more C comments to any of the ts*.c files, or docs explaining what
"cover desity" is, I would be glad to include those in the patch as
well.

--
  Bruce Momjian  <bruce@momjian.us>        http://momjian.us
  EnterpriseDB                             http://enterprisedb.com

  + Everyone has their own god. +

Вложения

Re: BUG #8354: stripped positions can generate nonzero rank in ts_rank_cd

От
Bruce Momjian
Дата:
On Fri, Mar 21, 2014 at 06:10:15PM -0400, Bruce Momjian wrote:
> On Fri, Mar 21, 2014 at 10:45:18AM -0400, Tom Lane wrote:
> > Bruce Momjian <bruce@momjian.us> writes:
> > > OK, combined patch attached.  Is this correct?  I keep hoping someone
> > > else will chime in with an opinion on this, but at this point I think
> > > what you have done is clear enough for me to commit for 9.4.
> >
> > That whole function is seriously undercommented; I suppose it's not this
> > patch's charter to fix that, but we could at least write
> >
> > !                           /* ignore words without positions */
> > !                 entry++;
> > !                 continue;
> >
> > The proposed new documentation text seems pretty badly written.  How about
> >
> > This function requires lexeme positional information to perform its
> > calculation.  Therefore it ignores any <quote>stripped</> lexemes in the
> > <type>tsvector</>.  If there are no unstripped lexemes in the input, the
> > result will be zero.
> >
> > The parenthetical "See" text is ok.
>
> OK, thanks.  Updated patch attached.  If Alexander Hill would like to
> add more C comments to any of the ts*.c files, or docs explaining what
> "cover desity" is, I would be glad to include those in the patch as
> well.

Patch applied with this commit message:

    Fix ts_rank_cd() to ignore stripped lexemes

    Previously, stripped lexemes got a default location and could be
    considered if mixed with non-stripped lexemes.

    BACKWARD INCOMPATIBILITY CHANGE

--
  Bruce Momjian  <bruce@momjian.us>        http://momjian.us
  EnterpriseDB                             http://enterprisedb.com

  + Everyone has their own god. +

Re: BUG #8354: stripped positions can generate nonzero rank in ts_rank_cd

От
Bruce Momjian
Дата:
On Mon, Mar 24, 2014 at 02:37:30PM -0400, Bruce Momjian wrote:
> > OK, thanks.  Updated patch attached.  If Alexander Hill would like to
> > add more C comments to any of the ts*.c files, or docs explaining what
> > "cover density" is, I would be glad to include those in the patch as
> > well.
>
> Patch applied with this commit message:
>
>     Fix ts_rank_cd() to ignore stripped lexemes
>
>     Previously, stripped lexemes got a default location and could be
>     considered if mixed with non-stripped lexemes.
>
>     BACKWARD INCOMPATIBILITY CHANGE

FYI, I forgot to mention Alex Hill's name in this commit, so I have
added a little docs around cover density and mentioned his name as
author of the previous patch so we can reflect that in the 9.4 release
notes.

--
  Bruce Momjian  <bruce@momjian.us>        http://momjian.us
  EnterpriseDB                             http://enterprisedb.com

  + Everyone has their own god. +