Обсуждение: BUG #17468: Ranking of search results: ts_rank_cd with normalization variant 4

Поиск
Список
Период
Сортировка

BUG #17468: Ranking of search results: ts_rank_cd with normalization variant 4

От
PG Bug reporting form
Дата:
The following bug has been logged on the website:

Bug reference:      17468
Logged by:          vicreal
Email address:      vicreal@yandex.ru
PostgreSQL version: 13.1
Operating system:   Debian 10
Description:

Test 1
SELECT ts_rank_cd(to_tsvector('This function calculates the coverage
density'), to_tsquery('function | calculates'), 4); -- 0.1
SELECT ts_rank_cd(to_tsvector('This function calculates the coverage
density'), to_tsquery('function'), 4);              -- 0.1

Test 2
SELECT ts_rank_cd(to_tsvector('This function calculates the coverage
density'), to_tsquery('function | calculates'), 1); -- 0.124
SELECT ts_rank_cd(to_tsvector('This function calculates the coverage
density'), to_tsquery('function'), 1);              -- 0.062

Test 3
SELECT ts_rank_cd(to_tsvector('This function calculates the coverage
density'), to_tsquery('function | calculates'), 4|1); -- 0.062 (X)
SELECT ts_rank_cd(to_tsvector('This function calculates the coverage
density'), to_tsquery('function'), 4|1);              -- 0.062 (Y)

How it should be:
1) In test 3 rank Y should be smaller, than rank X (as in test 2).

2) How get difference ranks Y < X when using normalization variant 4 ?


Re: BUG #17468: Ranking of search results: ts_rank_cd with normalization variant 4

От
"David G. Johnston"
Дата:
On Mon, Apr 25, 2022 at 2:03 PM PG Bug reporting form <noreply@postgresql.org> wrote:
The following bug has been logged on the website:

Bug reference:      17468
Logged by:          vicreal
Email address:      vicreal@yandex.ru
PostgreSQL version: 13.1
Operating system:   Debian 10
Description:       

 
Test 3
SELECT ts_rank_cd(to_tsvector('This function calculates the coverage
density'), to_tsquery('function | calculates'), 4|1); -- 0.062 (X)
SELECT ts_rank_cd(to_tsvector('This function calculates the coverage
density'), to_tsquery('function'), 4|1);              -- 0.062 (Y)

How it should be:
1) In test 3 rank Y should be smaller, than rank X (as in test 2).

2) How get difference ranks Y < X when using normalization variant 4 ?

Why are you writing the number 5 as "4|1" (4 "bitwise or" 1) in Test 3?

David J.

Re: BUG #17468: Ranking of search results: ts_rank_cd with normalization variant 4

От
"David G. Johnston"
Дата:
On Mon, Apr 25, 2022 at 2:34 PM Артем Александров <vicreal@yandex.ru> wrote:
> Why are you writing the number 5 as "4|1" (4 "bitwise or" 1) in Test 3?
 
According to the reference: "The integer option controls several behaviors, so it is a bit mask: you can specify one or more behaviors using | (for example, 2|4)".
 

I was so surprised by the use of a bitmap here I didn't get that far, my bad.

The task is as follows:
1) use normalization option 4 (document rank is divided by the average harmonic distance between blocks)
2) in test 3, get the result Y < X


You are first using normalization option 1 (divide rank by 1+ log(len))
Then, for option 4, divide that first result by "mean harmonic distance between extents" - I have no idea how to do that off the top of my head...

"If more than one flag bit is specified, the transformations are applied in the order listed."

The listed order is the documented order, not the order you specify.  The function has no way of to know whether the 5 it received was presented as 5, 1|4, or 4|1

We are open source if you want to demonstrate specifically, using numbers, where the error in the calculation is and, ideally, where it happens in the code.  I don't know enough to say with the information given whether your assertion of a bug is correct or a mis-understanding on your part.

David J.

Re: BUG #17468: Ranking of search results: ts_rank_cd with normalization variant 4

От
"David G. Johnston"
Дата:
On Mon, Apr 25, 2022 at 2:47 PM David G. Johnston <david.g.johnston@gmail.com> wrote:
We are open source if you want to demonstrate specifically, using numbers, where the error in the calculation is and, ideally, where it happens in the code.  I don't know enough to say with the information given whether your assertion of a bug is correct or a mis-understanding on your part.

If I don't normalize at all the queries have ranks of: 0.2 and 0.1 (two-word and single-word respectively)
From your first test this means that the divisors for normalization are 2 and 1 respectively, since the results are: 0.1 and 0.1 respectively
From the second test the normalized option 1 ranks are: .124 and .062 respectively.
Dividing by the option 4 normalization factors of 2 and 1 respectively yields: .062 and .062 respectively, which is what you show in the third test.

Thus, this is not a bug.  You failed to check the unnormalized values as the required starting point; and more generally failed to prove your claim and provide what precisely the correct answer should have been (or at least why the relative values should be what you claimed).  Even if there is a bug It is not possible for only test 3 to be wrong here.

David J.