Re: BUG #17691: Unexpected behaviour using ts_headline()

Поиск

Список

Период

Сортировка

От	sebastian.patino-lang@posteo.net
Тема	Re: BUG #17691: Unexpected behaviour using ts_headline()
Дата	22 ноября 2022 г. 06:34:21
Msg-id	6702b4c1-457f-4e34-a844-bb9cfe4c7e6a@Spark обсуждение исходный текст
Ответ на	Re: BUG #17691: Unexpected behaviour using ts_headline() (Tom Lane <tgl@sss.pgh.pa.us>)
Ответы	Re: BUG #17691: Unexpected behaviour using ts_headline()
Список	pgsql-bugs

Дерево обсуждения

Hi Tom,

thanks for the in-depth explanation. Makes a bit more sense now. Not sure what to do with the attached patch though. Should I use that to compile my own version of PG? Since I use a managed version thats not an option currently.

Regards

Sebastian

On 20. Nov 2022, 20:30 +0100, Tom Lane <tgl@sss.pgh.pa.us>, wrote:

PG Bug reporting form <noreply@postgresql.org> writes:
I experience unexpected behaviour when using ts_headline() in general, but
especially when changing MaxFragments. Given the data in
ts_headline_report.sql [1]

Thanks for sending the sample data. After poking at this a bit, it
seems the question boils down to what ts_headline should do when there
simply aren't any short matches to the query. Part of the issue there
is what qualifies as "short". The definition I installed in 78e73e875
was

+ /*
+ * We might eventually make max_cover a user-settable parameter, but for
+ * now, just compute a reasonable value based on max_words and
+ * max_fragments.
+ */
+ max_cover = Max(max_words * 10, 100);
+ if (max_fragments > 0)
+ max_cover *= max_fragments;

Looking at this again, I think the dependency on max_fragments is just
wrong. That is what is causing your different results for different
settings of MaxFragments: larger values allow recognition of longer
candidate covers ("cover" being jargon in this code for "a substring
satisfying the query"). But there's no good reason for that. I'd misread
the logic in mark_hl_fragments, which "breaks the cover into smaller
fragments such that each fragment has at most max_words", to think that
it needed to see longer covers that it could divide into fragments.
But actually max_fragments limits the number of fragments it will return
across all covers, and there doesn't seem to be a reason to connect that
parameter to the acceptable length of any particular cover. So dropping
those two lines is enough to make the strange behavior across differing
MaxFragments go away.

The other issue is what to do after finding there's no match of max_cover
or fewer tokens. As the code now stands, it'll just give up and return
the initial bit of the text, likely with no query words. That's not
great. I propose that we look for any match and return (highlight) just
its first word.

SELECT id,
ts_headline('english', "texts"."fulltext", to_tsquery('english', 'amazon &
world'), 'StartSel=<<, StopSel=>>') AS "preview"
FROM texts;

id=1: Highlight word is the first one in the result. Expectation: highlight
word is somewhere in the middle.

Meh --- I don't agree with that expectation. The fragment-based code does
act that way, but the non-fragment path never has, so I think we'd make
more people unhappy than happy if we change its behavior.

id=2: No highlight word at all.

Right, because there are no short matches to "amazon & world" anywhere.
With the attached proposed patch, we'll highlight whichever one of those
words appears first.

id=3: Highlight words are the first and last one in the result. Not ideal
but ok-ish.

As I said, that's how non-fragment highlighting has always worked. It
will extend the quoted text, but not beyond min_words, which is a bit
different from fragment highlighting.

SELECT id,
ts_headline('english', "texts"."fulltext", to_tsquery('english', 'amazon &
world'), 'MaxFragments=3, StartSel=<<, StopSel=>>') AS "preview"
FROM texts;

id=1: Wrong number of fragments (2) with highlight words are returned.

I don't agree with this expectation either. There's only one reasonably
sane match to 'amazon & world' in that text, and it's near the beginning.
As the code stands, increasing max_fragments eventually allows it to find
some less-sane matches, but that doesn't seem like an improvement.

In short, I suggest something about like the attached. For your id=2
example, which has only very far-apart occurrences of 'amazon' and
'world', this produces

<<world>> where fast runtimes make or break the popularity of research
fields, this oversight has effectively

without fragmenting, and

researchers make the effort of understanding the inner workings of
these convenient black-boxes. In a <<world>> where fast runtimes make
or break the popularity of research fields, this oversight has
effectively surrendered most

with fragmenting, which is okay by me given their historically different
behaviors. Otherwise, it gets rid of the strange changes in matching
behavior for different MaxFragment values.

regards, tom lane

В списке pgsql-bugs по дате отправления:

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Re: BUG #17691: Unexpected behaviour using ts_headline()