Обсуждение: BUG #15172: Postgresql ts_headline with <-> operator does nothighlight text properly

Поиск
Список
Период
Сортировка

BUG #15172: Postgresql ts_headline with <-> operator does nothighlight text properly

От
PG Bug reporting form
Дата:
The following bug has been logged on the website:

Bug reference:      15172
Logged by:          Ngigi Waithaka
Email address:      ngigi@at.co.ke
PostgreSQL version: 10.3
Operating system:   Linux
Description:

I have a noticed a likely bug when using ts_headline with the <-> operator

Assuming the following query:

SELECT ts_headline('English','This Commercial Bank does not have any Equity
in Europe but European Commercial Bank does', 
                    phraseto_tsquery('English','European Commercial
Bank')::tsquery);

The returned result is:
This <b>Commercial</b> <b>Bank</b> does not have any Equity in Europe but
<b>European</b> <b>Commercial</b> <b>Bank</b> does

This highlights the words Commercial & Bank separately in addition to
European Commercial Bank.

However, the correct output expected should be:
This Commercial Bank does not have any Equity in Europe but <b>European</b>
<b>Commercial</b> <b>Bank</b> does

Which only highlights *European Commercial Bank* due to the <-> operator in
phraseto_tsquery.

SELECT phraseto_tsquery('English','European Commercial Bank');
returns 'european' <-> 'commerci' <-> 'bank' as expected indicating the
problem is with ts_headline function.

Regards
NgigiW


Re: BUG #15172: Postgresql ts_headline with <-> operator does not highlight text properly

От
Alex Malek
Дата:

I can confirm this is still an issue in PostgreSQL 14.4

Best,
Alex


On Wed, Aug 3, 2022 at 1:58 PM PG Bug reporting form <noreply@postgresql.org> wrote:
The following bug has been logged on the website:

Bug reference:      15172
Logged by:          Ngigi Waithaka
Email address:      ngigi@at.co.ke
PostgreSQL version: 10.3
Operating system:   Linux
Description:       

I have a noticed a likely bug when using ts_headline with the <-> operator

Assuming the following query:

SELECT ts_headline('English','This Commercial Bank does not have any Equity
in Europe but European Commercial Bank does',
                    phraseto_tsquery('English','European Commercial
Bank')::tsquery);

The returned result is:
This <b>Commercial</b> <b>Bank</b> does not have any Equity in Europe but
<b>European</b> <b>Commercial</b> <b>Bank</b> does

This highlights the words Commercial & Bank separately in addition to
European Commercial Bank.

However, the correct output expected should be:
This Commercial Bank does not have any Equity in Europe but <b>European</b>
<b>Commercial</b> <b>Bank</b> does

Which only highlights *European Commercial Bank* due to the <-> operator in
phraseto_tsquery.

SELECT phraseto_tsquery('English','European Commercial Bank');
returns 'european' <-> 'commerci' <-> 'bank' as expected indicating the
problem is with ts_headline function.

Regards
NgigiW

Re: BUG #15172: Postgresql ts_headline with <-> operator does not highlight text properly

От
Bruce Momjian
Дата:
On Wed, Aug  3, 2022 at 02:02:51PM -0400, Alex Malek wrote:
> On Wed, Aug 3, 2022 at 1:58 PM PG Bug reporting form <noreply@postgresql.org>
> wrote:
>     I have a noticed a likely bug when using ts_headline with the <-> operator
> 
>     Assuming the following query:
> 
>     SELECT ts_headline('English','This Commercial Bank does not have any Equity
>     in Europe but European Commercial Bank does',
>                         phraseto_tsquery('English','European Commercial
>     Bank')::tsquery);
> 
>     The returned result is:
>     This <b>Commercial</b> <b>Bank</b> does not have any Equity in Europe but
>     <b>European</b> <b>Commercial</b> <b>Bank</b> does
> 
>     This highlights the words Commercial & Bank separately in addition to
>     European Commercial Bank.
> 
>     However, the correct output expected should be:
>     This Commercial Bank does not have any Equity in Europe but <b>European</b>
>     <b>Commercial</b> <b>Bank</b> does
> 
>     Which only highlights *European Commercial Bank* due to the <-> operator in
>     phraseto_tsquery.
> 
>     SELECT phraseto_tsquery('English','European Commercial Bank');
>     returns 'european' <-> 'commerci' <-> 'bank' as expected indicating the
>     problem is with ts_headline function.

I tested this against Postgres 11 and master (and you tested on PG 10
and 14) and I found the same behavior, plus I found someting even
worse:

    SELECT ts_headline('English',
    'This Commercial Bank does not have any Equity in Europe but European Commercial Bank does',
    ('''equiti'' <-> ''bank''')::tsquery);
                                                      ts_headline
    ----------------------------------------------------------------------------------------------------------------
    
     This Commercial <b>Bank</b> does not have any <b>Equity</b> in Europebut European Commercial <b>Bank</b> does

Notice that "Bank" and "Equity" are not next to each other, but they
still highlight.  In fact, the words appear to be independently checked:

    SELECT ts_headline('English',
    'This Commercial Bank does not have any Equity in Europe but European Commercial Bank does',
    ('''XXX'' <-> ''bank''')::tsquery);
                                                   ts_headline
    ---------------------------------------------------------------------------------------------------------
     This Commercial <b>Bank</b> does not have any Equity in Europe but European Commercial <b>Bank</b> does

Is this documented somewhere?

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com

  Only you can decide what is important to you.



Re: BUG #15172: Postgresql ts_headline with <-> operator does not highlight text properly

От
Tom Lane
Дата:
Bruce Momjian <bruce@momjian.us> writes:
> Is this documented somewhere?

The docs [1] only say that ts_headline "returns an excerpt from the
document in which terms from the query are highlighted".  This
behavior does not violate that admittedly-weak contract.

IIRC, ts_headline does attempt to find a text fragment or fragments
that fully satisfy the query (e.g., include an exact phrase match)
but it will then highlight all the matching words in the fragment,
not only the location of the phrase match.  I do not agree with the
OP's opinion that that's wrong.  The highlight-em-all approach has its
own value, and in any case it may not be possible to find a full match
that satisfies the function's other constraints such as MaxWords.
Refusing to highlight anything in that event would be unhelpful.

            regards, tom lane

[1] https://www.postgresql.org/docs/current/textsearch-controls.html#TEXTSEARCH-HEADLINE



Re: BUG #15172: Postgresql ts_headline with <-> operator does not highlight text properly

От
Pavel Borisov
Дата:
Hi, Bruce and Tom!

On Sun, 29 Oct 2023 at 00:46, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> Bruce Momjian <bruce@momjian.us> writes:
> > Is this documented somewhere?
>
> The docs [1] only say that ts_headline "returns an excerpt from the
> document in which terms from the query are highlighted".  This
> behavior does not violate that admittedly-weak contract.
>
> IIRC, ts_headline does attempt to find a text fragment or fragments
> that fully satisfy the query (e.g., include an exact phrase match)
> but it will then highlight all the matching words in the fragment,
> not only the location of the phrase match.  I do not agree with the
> OP's opinion that that's wrong.  The highlight-em-all approach has its
> own value, and in any case it may not be possible to find a full match
> that satisfies the function's other constraints such as MaxWords.
> Refusing to highlight anything in that event would be unhelpful.
>
>                         regards, tom lane

I think that the ts_headline main functionality is to make Postgres
more friendly to search-engine-like approach, which I feel is too
niche usage scenario for supporting it as a part of core code. If
remember right, bug reports coming from the users supposing it has
more strict semantics than it has in reality are regular. And I also
remember myself being puzzled by unusual output in the past.

If we fiddle with other parameters of ts_headline we can easily have
other kinds of output that seem counterintuitive e.g.:
SELECT ts_headline('English',


                             'This Commercial Bank does not have any
Equity in Europe but European Commercial Bank does',


('''equiti'' <-> ''bank''')::tsquery,  'MaxWords=30, MinWords=2');
   ts_headline
-----------------
 This Commercial
(1 row)

What do you think about clearly deprecating this feature in docs,
still leaving it working as it is?

Kind regards,
Pavel Borisov,
Supabase.



Re: BUG #15172: Postgresql ts_headline with <-> operator does not highlight text properly

От
Bruce Momjian
Дата:
On Sat, Oct 28, 2023 at 04:46:40PM -0400, Tom Lane wrote:
> Bruce Momjian <bruce@momjian.us> writes:
> > Is this documented somewhere?
> 
> The docs [1] only say that ts_headline "returns an excerpt from the
> document in which terms from the query are highlighted".  This
> behavior does not violate that admittedly-weak contract.
> 
> IIRC, ts_headline does attempt to find a text fragment or fragments
> that fully satisfy the query (e.g., include an exact phrase match)
> but it will then highlight all the matching words in the fragment,
> not only the location of the phrase match.  I do not agree with the

I see what you mean in this query output:

    SELECT ts_headline('English','kj asdlkjf alds jflkasjd flkaj dsflkja sdlfk jaslfd kjasdlfkj salfdkj This Commercial
Bankdoes not have any Equity in Europe but European Commercial Bank does lkj sadlkjf asldkjf alskjd flsakj fdlkaj
dfaslkfdjlakds jaslkfdj',
 
    ('''european'' <-> ''commerci'' <-> ''bank''')::tsquery);
                                                               ts_headline

---------------------------------------------------------------------------------------------------------------------------------
     Europe but <b>European</b> <b>Commercial</b> <b>Bank</b> does lkj sadlkjf asldkjf alskjd flsakj fdlkaj dfaslkfd
jlakdsjaslkfdj
 

The query controls the fragment chosen.

> OP's opinion that that's wrong.  The highlight-em-all approach has its
> own value, and in any case it may not be possible to find a full match
> that satisfies the function's other constraints such as MaxWords.
> Refusing to highlight anything in that event would be unhelpful.

Attached is a proposed doc patch.

I hope people don't mind me addressing these old emails but I think
they address important issues, and while I wasn't able to deal with them
when they are posted, I have time for the next month to do so.

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com

  Only you can decide what is important to you.

Вложения

Re: BUG #15172: Postgresql ts_headline with <-> operator does not highlight text properly

От
Bruce Momjian
Дата:
On Sun, Oct 29, 2023 at 01:20:11AM +0400, Pavel Borisov wrote:
> Hi, Bruce and Tom!
> I think that the ts_headline main functionality is to make Postgres
> more friendly to search-engine-like approach, which I feel is too
> niche usage scenario for supporting it as a part of core code. If
> remember right, bug reports coming from the users supposing it has
> more strict semantics than it has in reality are regular. And I also
> remember myself being puzzled by unusual output in the past.
> 
> If we fiddle with other parameters of ts_headline we can easily have
> other kinds of output that seem counterintuitive e.g.:
> SELECT ts_headline('English',

I just posted a proposed doc patch which should help reduce the number
of people surprised by the highlighting.  Let's see if that helps.

FYI, here is a Stack Overflow post from 2021 linking to the original
email that started this thread from 2018:


https://stackoverflow.com/questions/69512416/is-ts-headline-intended-to-highlight-non-matching-parts-of-the-query-which-it

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com

  Only you can decide what is important to you.



Re: BUG #15172: Postgresql ts_headline with <-> operator does not highlight text properly

От
Tom Lane
Дата:
Bruce Momjian <bruce@momjian.us> writes:
> Attached is a proposed doc patch.

As I pointed out before, the fragments *don't* necessarily satisfy
the query, so this is still promising too much.

An important edge case to keep in mind is that the given text
itself might not satisfy the query; ts_headline has no control
over what you hand it.  But even if the text as a whole does,
there may not be small fragments that do.

            regards, tom lane



Re: BUG #15172: Postgresql ts_headline with <-> operator does not highlight text properly

От
Bruce Momjian
Дата:
On Sun, Oct 29, 2023 at 11:53:35AM -0400, Tom Lane wrote:
> Bruce Momjian <bruce@momjian.us> writes:
> > Attached is a proposed doc patch.
> 
> As I pointed out before, the fragments *don't* necessarily satisfy
> the query, so this is still promising too much.
> 
> An important edge case to keep in mind is that the given text
> itself might not satisfy the query; ts_headline has no control
> over what you hand it.  But even if the text as a whole does,
> there may not be small fragments that do.

How is this weasel-wording, attached.  :-)

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com

  Only you can decide what is important to you.

Вложения

Re: BUG #15172: Postgresql ts_headline with <-> operator does not highlight text properly

От
Tom Lane
Дата:
Bruce Momjian <bruce@momjian.us> writes:
> How is this weasel-wording, attached.  :-)

Getting there.  What do you think of

+    Specifically, the function will use the query to select relevant
+    text fragments, and then highlight all words that appear in the query,
+    even if those word positions do not match the query's restrictions.

            regards, tom lane



Re: BUG #15172: Postgresql ts_headline with <-> operator does not highlight text properly

От
Bruce Momjian
Дата:
On Mon, Oct 30, 2023 at 11:32:26AM -0400, Tom Lane wrote:
> Bruce Momjian <bruce@momjian.us> writes:
> > How is this weasel-wording, attached.  :-)
> 
> Getting there.  What do you think of
> 
> +    Specifically, the function will use the query to select relevant
> +    text fragments, and then highlight all words that appear in the query,
> +    even if those word positions do not match the query's restrictions.

Sold!  :-)  Attached.

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com

  Only you can decide what is important to you.

Вложения

Re: BUG #15172: Postgresql ts_headline with <-> operator does not highlight text properly

От
Tom Lane
Дата:
Bruce Momjian <bruce@momjian.us> writes:
> Sold!  :-)  Attached.

LGTM.

BTW, just for the OP's context: ts_headline was designed before
we had phrase search operators.  With only AND/OR/NOT, there
aren't any location restrictions on individual words.  (I recall
that we occasionally got complaints about how it shouldn't
highlight words that are supposed to NOT be there, but that was
an uncommon situation because normally you wouldn't be selecting
such a document to highlight.)  So both the function's basic
algorithm and its control parameters were designed without thought
for what to do if the query restricted match locations.  Maybe there's
a case for rethinking what it should do more than we already have;
but it's not clear that you can do much better without throwing out
the current set of control parameters as well as the algorithm.
See [1] for some context and discussion.

            regards, tom lane

[1] https://www.postgresql.org/message-id/flat/840.1669405935%40sss.pgh.pa.us



Re: BUG #15172: Postgresql ts_headline with <-> operator does not highlight text properly

От
Bruce Momjian
Дата:
On Mon, Oct 30, 2023 at 12:00:38PM -0400, Bruce Momjian wrote:
> On Mon, Oct 30, 2023 at 11:32:26AM -0400, Tom Lane wrote:
> > Bruce Momjian <bruce@momjian.us> writes:
> > > How is this weasel-wording, attached.  :-)
> > 
> > Getting there.  What do you think of
> > 
> > +    Specifically, the function will use the query to select relevant
> > +    text fragments, and then highlight all words that appear in the query,
> > +    even if those word positions do not match the query's restrictions.
> 
> Sold!  :-)  Attached.

Patch applied back to PG 16.

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com

  Only you can decide what is important to you.