Обсуждение: BUG #8354: stripped positions can generate nonzero rank in ts_rank_cd
The following bug has been logged on the website: Bug reference: 8354 Logged by: Alex Hill Email address: alex@hill.net.au PostgreSQL version: 9.2.4 Operating system: OS X 10.8.4 Mountain Lion Description: Hi all, The docs for ts_rank_cd state: "This function requires positional information in its input. Therefore it will not work on "stripped" tsvector values â it will always return zero." However if a tsvector contains some stripped lexemes and some non-stripped, ts_rank_cd will rank extents including the non-stripped values. For example, this evaluates to zero as expected: SELECT ts_rank_cd(strip(to_tsvector('text search')), plainto_tsquery('text search')) But this doesn't: SELECT ts_rank_cd(to_tsvector('text') || strip(to_tsvector('search')), plainto_tsquery('text search')) I think this is a bug, if not in the code then in the documentation, which isn't clear on what happens when stripped and positioned lexemes are mixed in one tsvector. I would prefer that stripped lexemes were completely ignored by ts_rank_cd: my use case is using this as a fifth pseudo-weight, which matches a @@ query but doesn't add to a ts_rank_cd ranking. What do you think? Cheers, Alex
Would someone please comment on this text search bug report? Thanks. --------------------------------------------------------------------------- On Fri, Aug 2, 2013 at 07:03:42AM +0000, alex@hill.net.au wrote: > The following bug has been logged on the website: > > Bug reference: 8354 > Logged by: Alex Hill > Email address: alex@hill.net.au > PostgreSQL version: 9.2.4 > Operating system: OS X 10.8.4 Mountain Lion > Description: > > Hi all, > > > The docs for ts_rank_cd state: > > > "This function requires positional information in its input. Therefore it > will not work on "stripped" tsvector values â it will always return zero." > > > However if a tsvector contains some stripped lexemes and some non-stripped, > ts_rank_cd will rank extents including the non-stripped values. > > > For example, this evaluates to zero as expected: > > > SELECT ts_rank_cd(strip(to_tsvector('text search')), > plainto_tsquery('text search')) > > > > > But this doesn't: > > > SELECT ts_rank_cd(to_tsvector('text') || strip(to_tsvector('search')), > plainto_tsquery('text search')) > > > > > I think this is a bug, if not in the code then in the documentation, which > isn't clear on what happens when stripped and positioned lexemes are mixed > in one tsvector. > > > I would prefer that stripped lexemes were completely ignored by ts_rank_cd: > my use case is using this as a fifth pseudo-weight, which matches a @@ query > but doesn't add to a ts_rank_cd ranking. > > > What do you think? > > > Cheers, > Alex > > > > -- > Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-bugs -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + Everyone has their own god. +
Hi Bruce, all, I think this can be solved (if it's agreed that it's a bug) in a pretty straightforward way: when creating the document representation used in calculating cover density rank, we can just skip lexemes with no position entirely. Fix and tests here: https://github.com/AlexHill/postgres/compare/bug_8354 As a patch file here: https://github.com/AlexHill/postgres/commit/cd522b254d166d569b86803115f0f499864e949b.patch Cheers, Alex On Sat, Feb 1, 2014 at 5:22 AM, Bruce Momjian <bruce@momjian.us> wrote: > > Would someone please comment on this text search bug report? Thanks. > > --------------------------------------------------------------------------- > > On Fri, Aug 2, 2013 at 07:03:42AM +0000, alex@hill.net.au wrote: > > The following bug has been logged on the website: > > > > Bug reference: 8354 > > Logged by: Alex Hill > > Email address: alex@hill.net.au > > PostgreSQL version: 9.2.4 > > Operating system: OS X 10.8.4 Mountain Lion > > Description: > > > > Hi all, > > > > > > The docs for ts_rank_cd state: > > > > > > "This function requires positional information in its input. Therefore it > > will not work on "stripped" tsvector values -- it will always return > zero." > > > > > > However if a tsvector contains some stripped lexemes and some > non-stripped, > > ts_rank_cd will rank extents including the non-stripped values. > > > > > > For example, this evaluates to zero as expected: > > > > > > SELECT ts_rank_cd(strip(to_tsvector('text search')), > > plainto_tsquery('text search')) > > > > > > > > > > But this doesn't: > > > > > > SELECT ts_rank_cd(to_tsvector('text') || > strip(to_tsvector('search')), > > plainto_tsquery('text search')) > > > > > > > > > > I think this is a bug, if not in the code then in the documentation, > which > > isn't clear on what happens when stripped and positioned lexemes are > mixed > > in one tsvector. > > > > > > I would prefer that stripped lexemes were completely ignored by > ts_rank_cd: > > my use case is using this as a fifth pseudo-weight, which matches a @@ > query > > but doesn't add to a ts_rank_cd ranking. > > > > > > What do you think? > > > > > > Cheers, > > Alex > > > > > > > > -- > > Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org) > > To make changes to your subscription: > > http://www.postgresql.org/mailpref/pgsql-bugs > > -- > Bruce Momjian <bruce@momjian.us> http://momjian.us > EnterpriseDB http://enterprisedb.com > > + Everyone has their own god. + >
On Fri, Feb 7, 2014 at 02:07:59AM +0800, Alexander Hill wrote: > Hi Bruce, all, > > I think this can be solved (if it's agreed that it's a bug) in a pretty > straightforward way: when creating the document representation used in > calculating cover density rank, we can just skip lexemes with no position > entirely. > > Fix and tests here: https://github.com/AlexHill/postgres/compare/bug_8354 > > As a patch file here: https://github.com/AlexHill/postgres/commit/ > cd522b254d166d569b86803115f0f499864e949b.patch OK, great. I am attaching this patch. Would this change the contents of any indexes? That would be a problem for pg_upgrade. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + Everyone has their own god. +
Вложения
On Wed, Feb 12, 2014 at 09:49:32AM +0800, Alexander Hill wrote: > Hi Bruce, > > In normal use this won't affect any data that ends up in an index. > > The only exception would be where somebody has created an index over the result > of ts_rank_cd - indexing search rank for the most common search terms, maybe? OK, good. > But that's true of any function whose result is indexed. What's the upgrade > policy in that case? Well, we document that the function output changed, and if anyone is using the function for an index, they need to rebuild the index, so I think we are fine fixing this for 9.4. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + Everyone has their own god. +
Hi Bruce, In normal use this won't affect any data that ends up in an index. The only exception would be where somebody has created an index over the result of ts_rank_cd - indexing search rank for the most common search terms, maybe? But that's true of any function whose result is indexed. What's the upgrade policy in that case? Cheers, Alex On Tue, Feb 11, 2014 at 11:45 PM, Bruce Momjian <bruce@momjian.us> wrote: > On Fri, Feb 7, 2014 at 02:07:59AM +0800, Alexander Hill wrote: > > Hi Bruce, all, > > > > I think this can be solved (if it's agreed that it's a bug) in a pretty > > straightforward way: when creating the document representation used in > > calculating cover density rank, we can just skip lexemes with no position > > entirely. > > > > Fix and tests here: > https://github.com/AlexHill/postgres/compare/bug_8354 > > > > As a patch file here: https://github.com/AlexHill/postgres/commit/ > > cd522b254d166d569b86803115f0f499864e949b.patch > > OK, great. I am attaching this patch. Would this change the contents > of any indexes? That would be a problem for pg_upgrade. > > -- > Bruce Momjian <bruce@momjian.us> http://momjian.us > EnterpriseDB http://enterprisedb.com > > + Everyone has their own god. + >
Hi Bruce, I've just updated the docs for ts_rank_cd in my fork to reflect this change. Would be great if that could make it in too. Patch here: https://github.com/AlexHill/postgres/commit/3e5d86ea23d82b07188192411cdc5c5a5194a5ae.patch Is there anything else I need to do to make sure this patch ends up in 9.4? Cheers, Alex On Feb 12, 2014 10:31 PM, "Bruce Momjian" <bruce@momjian.us> wrote: > On Wed, Feb 12, 2014 at 09:49:32AM +0800, Alexander Hill wrote: > > Hi Bruce, > > > > In normal use this won't affect any data that ends up in an index. > > > > The only exception would be where somebody has created an index over the > result > > of ts_rank_cd - indexing search rank for the most common search terms, > maybe? > > OK, good. > > > But that's true of any function whose result is indexed. What's the > upgrade > > policy in that case? > > Well, we document that the function output changed, and if anyone is > using the function for an index, they need to rebuild the index, so I > think we are fine fixing this for 9.4. > > -- > Bruce Momjian <bruce@momjian.us> http://momjian.us > EnterpriseDB http://enterprisedb.com > > + Everyone has their own god. + >
On Fri, Mar 21, 2014 at 02:29:26PM +0800, Alexander Hill wrote: > Hi Bruce, > > I've just updated the docs for ts_rank_cd in my fork to reflect this change. > Would be great if that could make it in too. > > Patch here: https://github.com/AlexHill/postgres/commit/ > 3e5d86ea23d82b07188192411cdc5c5a5194a5ae.patch > > Is there anything else I need to do to make sure this patch ends up in 9.4? OK, combined patch attached. Is this correct? I keep hoping someone else will chime in with an opinion on this, but at this point I think what you have done is clear enough for me to commit for 9.4. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + Everyone has their own god. +
Вложения
Bruce Momjian <bruce@momjian.us> writes: > OK, combined patch attached. Is this correct? I keep hoping someone > else will chime in with an opinion on this, but at this point I think > what you have done is clear enough for me to commit for 9.4. That whole function is seriously undercommented; I suppose it's not this patch's charter to fix that, but we could at least write ! /* ignore words without positions */ ! entry++; ! continue; The proposed new documentation text seems pretty badly written. How about This function requires lexeme positional information to perform its calculation. Therefore it ignores any <quote>stripped</> lexemes in the <type>tsvector</>. If there are no unstripped lexemes in the input, the result will be zero. The parenthetical "See" text is ok. regards, tom lane
On Fri, Mar 21, 2014 at 10:45:18AM -0400, Tom Lane wrote: > Bruce Momjian <bruce@momjian.us> writes: > > OK, combined patch attached. Is this correct? I keep hoping someone > > else will chime in with an opinion on this, but at this point I think > > what you have done is clear enough for me to commit for 9.4. > > That whole function is seriously undercommented; I suppose it's not this > patch's charter to fix that, but we could at least write > > ! /* ignore words without positions */ > ! entry++; > ! continue; > > The proposed new documentation text seems pretty badly written. How about > > This function requires lexeme positional information to perform its > calculation. Therefore it ignores any <quote>stripped</> lexemes in the > <type>tsvector</>. If there are no unstripped lexemes in the input, the > result will be zero. > > The parenthetical "See" text is ok. OK, thanks. Updated patch attached. If Alexander Hill would like to add more C comments to any of the ts*.c files, or docs explaining what "cover desity" is, I would be glad to include those in the patch as well. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + Everyone has their own god. +
Вложения
On Fri, Mar 21, 2014 at 06:10:15PM -0400, Bruce Momjian wrote: > On Fri, Mar 21, 2014 at 10:45:18AM -0400, Tom Lane wrote: > > Bruce Momjian <bruce@momjian.us> writes: > > > OK, combined patch attached. Is this correct? I keep hoping someone > > > else will chime in with an opinion on this, but at this point I think > > > what you have done is clear enough for me to commit for 9.4. > > > > That whole function is seriously undercommented; I suppose it's not this > > patch's charter to fix that, but we could at least write > > > > ! /* ignore words without positions */ > > ! entry++; > > ! continue; > > > > The proposed new documentation text seems pretty badly written. How about > > > > This function requires lexeme positional information to perform its > > calculation. Therefore it ignores any <quote>stripped</> lexemes in the > > <type>tsvector</>. If there are no unstripped lexemes in the input, the > > result will be zero. > > > > The parenthetical "See" text is ok. > > OK, thanks. Updated patch attached. If Alexander Hill would like to > add more C comments to any of the ts*.c files, or docs explaining what > "cover desity" is, I would be glad to include those in the patch as > well. Patch applied with this commit message: Fix ts_rank_cd() to ignore stripped lexemes Previously, stripped lexemes got a default location and could be considered if mixed with non-stripped lexemes. BACKWARD INCOMPATIBILITY CHANGE -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + Everyone has their own god. +
On Mon, Mar 24, 2014 at 02:37:30PM -0400, Bruce Momjian wrote: > > OK, thanks. Updated patch attached. If Alexander Hill would like to > > add more C comments to any of the ts*.c files, or docs explaining what > > "cover density" is, I would be glad to include those in the patch as > > well. > > Patch applied with this commit message: > > Fix ts_rank_cd() to ignore stripped lexemes > > Previously, stripped lexemes got a default location and could be > considered if mixed with non-stripped lexemes. > > BACKWARD INCOMPATIBILITY CHANGE FYI, I forgot to mention Alex Hill's name in this commit, so I have added a little docs around cover density and mentioned his name as author of the previous patch so we can reflect that in the 9.4 release notes. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + Everyone has their own god. +