On Mon, Jan 27, 2020 at 11:35 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> PG Bug reporting form <noreply@postgresql.org> writes:
> > The following query shows the problem:
>
> > select ts_rank(doc1, query) as rank_wrong, ts_rank(doc2, query) as
> > rank_correct
> > from (select setweight(to_tsvector('simple', 'foo something'), 'A') ||
> > setweight(to_tsvector('simple', 'foobar'), 'C') as doc1,
> > setweight(to_tsvector('simple', 'foo something'), 'A') as
> > doc2,
> > to_tsquery('simple', 'foo:* & something') as
> > query) as subquery;
>
> > ts_rank on doc1 is only half of the rank of doc2. ts_rank seems to only
> > consider the 'foobar' term with lower weight when calculating the rank. The
> > foo:1A is only considered in doc2.
>
> No, that's not correct. What it actually is doing is taking some sort of
> average of the weights of the occurrences, as you can see if you play
> around with a few more examples besides these two. That could be better
> documented, perhaps, but I don't think it's obviously broken.
>
> I can see that there might be a use for taking the max or even the sum
> of the weights rather than an average --- in many situations it wouldn't
> be desirable to rank doc1 of your example lower than doc2. But really
> that'd be a different ranking algorithm, not a bug fix for this one.
>
> The manual claims you can write your own ranking algorithm ... but
> AFAICS you'd have to code it in C, because we aren't exposing anything
> at SQL level that would let you get at the raw match data :-(.
> So there's room for improvement there.
>
> Also, you might try using ts_rank_cd() instead, as that uses a different
> algorithm for combining the weights. At least on this example, doc1
> gets a higher score than doc2.
>
> regards, tom lane
I see, thank you for the explanation.
Maybe I can add another reason why I think it might be a bug. Consider
this query:
select ts_rank(doc1, query) as rank_wrong,
ts_rank(doc2, query) as rank_correct
from (select setweight(to_tsvector('simple', 'foo something'), 'A') ||
setweight(to_tsvector('simple', 'foobar'), 'C') as doc1,
setweight(to_tsvector('simple', 'foo something'), 'A') as doc2,
to_tsquery('simple', 'foo:*') as
query) as subquery;
Here I only removed the '& something' part of the query. Now the query
behaves as one would expect: The first rank is higher than the second.
I am unsure why adding a second search term (which is contained in
both documents) would lead to a change in the ranking order.
What do you think?
Regards,
Dominik Giger