Обсуждение: BUG #16235: ts_rank ignores match and only considers lower weighted vector
BUG #16235: ts_rank ignores match and only considers lower weighted vector
От
PG Bug reporting form
Дата:
The following bug has been logged on the website: Bug reference: 16235 Logged by: Dominik Giger Email address: dominik.giger@gmail.com PostgreSQL version: 12.1 Operating system: Linux Description: The following query shows the problem: select ts_rank(doc1, query) as rank_wrong, ts_rank(doc2, query) as rank_correct from (select setweight(to_tsvector('simple', 'foo something'), 'A') || setweight(to_tsvector('simple', 'foobar'), 'C') as doc1, setweight(to_tsvector('simple', 'foo something'), 'A') as doc2, to_tsquery('simple', 'foo:* & something') as query) as subquery; Some more explanation: doc1 looks like this: 'foo':1A 'foobar':3C 'something':2A doc2 looks like this: 'foo':1A 'something':2A Calling ts_rank on both vectors with the same query 'foo':* & 'something' Expected result: ts_rank on doc1 is the same or higher than ts_rank on doc2. Actual result: ts_rank on doc1 is only half of the rank of doc2. ts_rank seems to only consider the 'foobar' term with lower weight when calculating the rank. The foo:1A is only considered in doc2.
PG Bug reporting form <noreply@postgresql.org> writes: > The following query shows the problem: > select ts_rank(doc1, query) as rank_wrong, ts_rank(doc2, query) as > rank_correct > from (select setweight(to_tsvector('simple', 'foo something'), 'A') || > setweight(to_tsvector('simple', 'foobar'), 'C') as doc1, > setweight(to_tsvector('simple', 'foo something'), 'A') as > doc2, > to_tsquery('simple', 'foo:* & something') as > query) as subquery; > ts_rank on doc1 is only half of the rank of doc2. ts_rank seems to only > consider the 'foobar' term with lower weight when calculating the rank. The > foo:1A is only considered in doc2. No, that's not correct. What it actually is doing is taking some sort of average of the weights of the occurrences, as you can see if you play around with a few more examples besides these two. That could be better documented, perhaps, but I don't think it's obviously broken. I can see that there might be a use for taking the max or even the sum of the weights rather than an average --- in many situations it wouldn't be desirable to rank doc1 of your example lower than doc2. But really that'd be a different ranking algorithm, not a bug fix for this one. The manual claims you can write your own ranking algorithm ... but AFAICS you'd have to code it in C, because we aren't exposing anything at SQL level that would let you get at the raw match data :-(. So there's room for improvement there. Also, you might try using ts_rank_cd() instead, as that uses a different algorithm for combining the weights. At least on this example, doc1 gets a higher score than doc2. regards, tom lane
Re: BUG #16235: ts_rank ignores match and only considers lowerweighted vector
От
Dominik Giger
Дата:
On Mon, Jan 27, 2020 at 11:35 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > > PG Bug reporting form <noreply@postgresql.org> writes: > > The following query shows the problem: > > > select ts_rank(doc1, query) as rank_wrong, ts_rank(doc2, query) as > > rank_correct > > from (select setweight(to_tsvector('simple', 'foo something'), 'A') || > > setweight(to_tsvector('simple', 'foobar'), 'C') as doc1, > > setweight(to_tsvector('simple', 'foo something'), 'A') as > > doc2, > > to_tsquery('simple', 'foo:* & something') as > > query) as subquery; > > > ts_rank on doc1 is only half of the rank of doc2. ts_rank seems to only > > consider the 'foobar' term with lower weight when calculating the rank. The > > foo:1A is only considered in doc2. > > No, that's not correct. What it actually is doing is taking some sort of > average of the weights of the occurrences, as you can see if you play > around with a few more examples besides these two. That could be better > documented, perhaps, but I don't think it's obviously broken. > > I can see that there might be a use for taking the max or even the sum > of the weights rather than an average --- in many situations it wouldn't > be desirable to rank doc1 of your example lower than doc2. But really > that'd be a different ranking algorithm, not a bug fix for this one. > > The manual claims you can write your own ranking algorithm ... but > AFAICS you'd have to code it in C, because we aren't exposing anything > at SQL level that would let you get at the raw match data :-(. > So there's room for improvement there. > > Also, you might try using ts_rank_cd() instead, as that uses a different > algorithm for combining the weights. At least on this example, doc1 > gets a higher score than doc2. > > regards, tom lane I see, thank you for the explanation. Maybe I can add another reason why I think it might be a bug. Consider this query: select ts_rank(doc1, query) as rank_wrong, ts_rank(doc2, query) as rank_correct from (select setweight(to_tsvector('simple', 'foo something'), 'A') || setweight(to_tsvector('simple', 'foobar'), 'C') as doc1, setweight(to_tsvector('simple', 'foo something'), 'A') as doc2, to_tsquery('simple', 'foo:*') as query) as subquery; Here I only removed the '& something' part of the query. Now the query behaves as one would expect: The first rank is higher than the second. I am unsure why adding a second search term (which is contained in both documents) would lead to a change in the ranking order. What do you think? Regards, Dominik Giger