Обсуждение: Mailing list search engine: surprising missing results?

Поиск

Список

Период

Сортировка

Mailing list search engine: surprising missing results?

От

James Addison

Дата:

23 января 2022 г., 15:49:07

Hello,

I noticed that the mailing list search engine[1] seems to unexpectedly
miss results for some queries.

For example:

A search for "boyer"[2] returns five results, including result
snippets that contain the text "Boyer-More-Horspool" [sic] and
"Boyer-Moore-Horspool".

However, a more specific search for "boyer-moore"[3] does not return
any results -- that seems surprising.

Specializing the query further and searching for
"boyer-moore-horspool"[4] *does* again return results -- two documents
-- with the terms "boyer" and "horspool" highlighted.

Although it's not a significant problem, I do have a theory that could
explain the behaviour (offered in case it may save time on
investigation):

It seems possible that the term "more" -- and nearby misspellings,
like "moore" -- may be filtered out as stopwords (meaning: they're not
present in the search index), and that the search engine is configured
to require a minimum percentage match rate for query terms.

Under those conditions: searches for "boyer" would produce an 100%
match rate, "boyer-moore" would produce 50% (since "moore" would not
be found in the term index), and "boyer-moore-horspool" would match at
66-point-6-repeating percent.

Given a required match rate of around two thirds, that could explain
the behaviour (it might not be the true reason, but it seems like one
possibility).

Thanks,
James

[1] https://www.postgresql.org/search/
[2] https://www.postgresql.org/search/?m=1&q=boyer&l=1&d=365&s=r
[3] https://www.postgresql.org/search/?m=1&q=boyer-moore&l=1&d=365&s=r
[4] https://www.postgresql.org/search/?m=1&q=boyer-moore-horspool&l=1&d=365&s=r

Re: Mailing list search engine: surprising missing results?

От

Laurenz Albe

Дата:

24 января 2022 г., 10:27:41

On Sun, 2022-01-23 at 12:49 +0000, James Addison wrote:
> Hello,
> 
> I noticed that the mailing list search engine[1] seems to unexpectedly
> miss results for some queries.
> 
> For example:
> 
> A search for "boyer"[2] returns five results, including result
> snippets that contain the text "Boyer-More-Horspool" [sic] and
> "Boyer-Moore-Horspool".
> 
> However, a more specific search for "boyer-moore"[3] does not return
> any results -- that seems surprising.
> 
> Specializing the query further and searching for
> "boyer-moore-horspool"[4] *does* again return results -- two documents
> -- with the terms "boyer" and "horspool" highlighted.

This is caused by the peculiarities of PostgreSQL full text search:

SELECT to_tsvector('english', 'Boyer-Moore-Horspool')
       @@ websearch_to_tsquery('english', 'boyer-moore');

 ?column?
══════════
 f
(1 row)

The reason is that the 'moore' in 'boyer-moore' is stemmed, since it
is at the end of the word, while the 'moore' in 'Boyer-Moore-Horspool'
isn't:

SELECT to_tsvector('english', 'Boyer-Moore-Horspool');

                       to_tsvector
══════════════════════════════════════════════════════════
 'boyer':2 'boyer-moore-horspool':1 'horspool':4 'moor':3
(1 row)

SELECT websearch_to_tsquery('english', 'boyer-moore');

         websearch_to_tsquery
═════════════════════════════════════
 'boyer-moor' <-> 'boyer' <-> 'moor'
(1 row)

'boyer-moor' is not present in the first result.

As a workaround, I suggest that you search for 'boyer moore'
or (even better) '"boyer moore"' (with the double quotes):

SELECT websearch_to_tsquery('english', 'boyer moore');

 websearch_to_tsquery
══════════════════════
 'boyer' & 'moor'
(1 row)

SELECT websearch_to_tsquery('english', '"boyer moore"');

 websearch_to_tsquery
══════════════════════
 'boyer' <-> 'moor'
(1 row)

Yours,
Laurenz Albe

Re: Mailing list search engine: surprising missing results?

От

Bruce Momjian

Дата:

24 января 2022 г., 22:28:00

On Mon, Jan 24, 2022 at 08:27:41AM +0100, Laurenz Albe wrote:
> On Sun, 2022-01-23 at 12:49 +0000, James Addison wrote:
> > Specializing the query further and searching for
> > "boyer-moore-horspool"[4] *does* again return results -- two documents
> > -- with the terms "boyer" and "horspool" highlighted.
> 
> This is caused by the peculiarities of PostgreSQL full text search:
> 
> SELECT to_tsvector('english', 'Boyer-Moore-Horspool')
>        @@ websearch_to_tsquery('english', 'boyer-moore');
> 
>  ?column?
> ══════════
>  f
> (1 row)
> 
> The reason is that the 'moore' in 'boyer-moore' is stemmed, since it
> is at the end of the word, while the 'moore' in 'Boyer-Moore-Horspool'
> isn't:

Wow, he showed me this problem earlier but I never suspected it was
stemming issue because I never considered proper nowns could be
stem-adjusted, but it is obvious they can.

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com

  If only the physical world exists, free will is an illusion.

Re: Mailing list search engine: surprising missing results?

От

Tom Lane

Дата:

24 января 2022 г., 23:47:29

Bruce Momjian <bruce@momjian.us> writes:
> On Mon, Jan 24, 2022 at 08:27:41AM +0100, Laurenz Albe wrote:
>> The reason is that the 'moore' in 'boyer-moore' is stemmed, since it
>> is at the end of the word, while the 'moore' in 'Boyer-Moore-Horspool'
>> isn't:

> Wow, he showed me this problem earlier but I never suspected it was
> stemming issue because I never considered proper nowns could be
> stem-adjusted, but it is obvious they can.

I wonder if we should change that so that components of a compound
word are consistently stemmed the same way.

            regards, tom lane

Re: Mailing list search engine: surprising missing results?

От

Bruce Momjian

Дата:

25 января 2022 г., 01:15:32

On Mon, Jan 24, 2022 at 03:47:29PM -0500, Tom Lane wrote:
> Bruce Momjian <bruce@momjian.us> writes:
> > On Mon, Jan 24, 2022 at 08:27:41AM +0100, Laurenz Albe wrote:
> >> The reason is that the 'moore' in 'boyer-moore' is stemmed, since it
> >> is at the end of the word, while the 'moore' in 'Boyer-Moore-Horspool'
> >> isn't:
> 
> > Wow, he showed me this problem earlier but I never suspected it was
> > stemming issue because I never considered proper nowns could be
> > stem-adjusted, but it is obvious they can.
> 
> I wonder if we should change that so that components of a compound
> word are consistently stemmed the same way.

I don't see the value in a change --- it might break the same number of
cases it fixes.

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com

  If only the physical world exists, free will is an illusion.

Re: Mailing list search engine: surprising missing results?

От

Oleg Bartunov

Дата:

25 января 2022 г., 14:04:09

On Mon, Jan 24, 2022 at 11:47 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Bruce Momjian <bruce@momjian.us> writes:
> On Mon, Jan 24, 2022 at 08:27:41AM +0100, Laurenz Albe wrote:
>> The reason is that the 'moore' in 'boyer-moore' is stemmed, since it
>> is at the end of the word, while the 'moore' in 'Boyer-Moore-Horspool'
>> isn't:

> Wow, he showed me this problem earlier but I never suspected it was
> stemming issue because I never considered proper nowns could be
> stem-adjusted, but it is obvious they can.

I wonder if we should change that so that components of a compound
word are consistently stemmed the same way.

Something like this

SELECT to_tsvector('english', 'Boyer-Moore-Horspool');
to_tsvector

----------------------------------------------------------
'boyer':2 'boyer-moore-horspool':1 'boyer-moore':1 'moore-horspool':1 'horspool':4 'moor':3
(1 row)

regards, tom lane

Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: Mailing list search engine: surprising missing results?

От

Laurenz Albe

Дата:

25 января 2022 г., 15:43:48

On Tue, 2022-01-25 at 14:04 +0300, Oleg Bartunov wrote:
> On Mon, Jan 24, 2022 at 11:47 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> > Bruce Momjian <bruce@momjian.us> writes:
> > > On Mon, Jan 24, 2022 at 08:27:41AM +0100, Laurenz Albe wrote:
> > > > The reason is that the 'moore' in 'boyer-moore' is stemmed, since it
> > > > is at the end of the word, while the 'moore' in 'Boyer-Moore-Horspool'
> > > > isn't:
> > 
> > > Wow, he showed me this problem earlier but I never suspected it was
> > > stemming issue because I never considered proper nowns could be
> > > stem-adjusted, but it is obvious they can.
> > 
> > I wonder if we should change that so that components of a compound
> > word are consistently stemmed the same way.
>
> Something like this
> 
> SELECT to_tsvector('english', 'Boyer-Moore-Horspool');
>                        to_tsvector
> ----------------------------------------------------------
>  'boyer':2 'boyer-moore-horspool':1 'boyer-moore':1  'moore-horspool':1  'horspool':4 'moor':3
> (1 row)

Not quite.  The problem is question is the "'boyer-moore':1".
If that were "'boyer-moor':1" instead, the problem would disappear.

Yours,
Laurenz Albe

Re: Mailing list search engine: surprising missing results?

От

Tom Lane

Дата:

25 января 2022 г., 19:22:33

Laurenz Albe <laurenz.albe@cybertec.at> writes:
> On Tue, 2022-01-25 at 14:04 +0300, Oleg Bartunov wrote:
>> On Mon, Jan 24, 2022 at 11:47 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>> Bruce Momjian <bruce@momjian.us> writes:
>>>> On Mon, Jan 24, 2022 at 08:27:41AM +0100, Laurenz Albe wrote:
>>>>> The reason is that the 'moore' in 'boyer-moore' is stemmed, since it
>>>>> is at the end of the word, while the 'moore' in 'Boyer-Moore-Horspool'
>>>>> isn't:

> Not quite.  The problem is question is the "'boyer-moore':1".
> If that were "'boyer-moor':1" instead, the problem would disappear.

Actually, when I try this here, it seems like the stemming *is*
consistent:

regression=# SELECT to_tsvector('english', 'Boyer-Moore-Horspool');
                       to_tsvector                        
----------------------------------------------------------
 'boyer':2 'boyer-moore-horspool':1 'horspool':4 'moor':3
(1 row)

regression=# SELECT to_tsvector('english', 'Boyer-Moore');
            to_tsvector            
-----------------------------------
 'boyer':2 'boyer-moor':1 'moor':3
(1 row)

If you try variants of that where the first or third term is stemmable,
say

regression=# SELECT to_tsvector('english', 'Boyers-Moore-Horspool');
                        to_tsvector                        
-----------------------------------------------------------
 'boyer':2 'boyers-moore-horspool':1 'horspool':4 'moor':3
(1 row)

it sure appears that each component word is stemmed independently
already.  So I think the original explanation here is wrong and
we need to probe more closely.

            regards, tom lane

Re: Mailing list search engine: surprising missing results?

От

Ivan Panchenko

Дата:

25 января 2022 г., 20:02:36

On 25.01.2022 19:22, Tom Lane wrote:

Laurenz Albe <laurenz.albe@cybertec.at> writes:

On Tue, 2022-01-25 at 14:04 +0300, Oleg Bartunov wrote:

On Mon, Jan 24, 2022 at 11:47 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Bruce Momjian <bruce@momjian.us> writes:

On Mon, Jan 24, 2022 at 08:27:41AM +0100, Laurenz Albe wrote:

The reason is that the 'moore' in 'boyer-moore' is stemmed, since it
is at the end of the word, while the 'moore' in 'Boyer-Moore-Horspool'
isn't:

Not quite.  The problem is question is the "'boyer-moore':1".
If that were "'boyer-moor':1" instead, the problem would disappear.

Actually, when I try this here, it seems like the stemming *is*
consistent:

regression=# SELECT to_tsvector('english', 'Boyer-Moore-Horspool');                       to_tsvector                        
---------------------------------------------------------- 'boyer':2 'boyer-moore-horspool':1 'horspool':4 'moor':3
(1 row)

regression=# SELECT to_tsvector('english', 'Boyer-Moore');            to_tsvector            
----------------------------------- 'boyer':2 'boyer-moor':1 'moor':3
(1 row)

If you try variants of that where the first or third term is stemmable,
say

regression=# SELECT to_tsvector('english', 'Boyers-Moore-Horspool');                        to_tsvector                        
----------------------------------------------------------- 'boyer':2 'boyers-moore-horspool':1 'horspool':4 'moor':3
(1 row)

it sure appears that each component word is stemmed independently
already.  So I think the original explanation here is wrong and
we need to probe more closely.

The actual explanation can be seen from comparing a tsvector with a tsquery.
To avoid stemming effects, we use the simple configuration below.

# select plainto_tsquery('simple','boyers-moore');

           plainto_tsquery           
-------------------------------------
 'boyers-moore' & 'boyers' & 'moore'

# select to_tsvector('simple','boyers-moore-horspool');

                         to_tsvector                        
-------------------------------------------------------------
 'boyers':2 'boyers-moore-horspool':1 'horspool':4 'moore':3

Obviously, such tsvector does not match the above tsquery. I think,a better tsquery for this query would be

 'boyers-moore' | ('boyers' & 'moore')

May be, it is worth changing to_tsquery() behavior for such cases.

			regards, tom lane

Regards,
Ivan

Re: Mailing list search engine: surprising missing results?

От

Tom Lane

Дата:

25 января 2022 г., 20:54:28

Ivan Panchenko <i.panchenko@postgrespro.ru> writes:
> The actual explanation can be seen from comparing a tsvector with a tsquery.
> To avoid stemming effects, we use the simple configuration below.

> # select plainto_tsquery('simple','boyers-moore');

>             plainto_tsquery
> -------------------------------------
>   'boyers-moore' & 'boyers' & 'moore'

> # select to_tsvector('simple','boyers-moore-horspool');

>                           to_tsvector
> -------------------------------------------------------------
>   'boyers':2 'boyers-moore-horspool':1 'horspool':4 'moore':3

> Obviously, such tsvector does not match the above tsquery. I think,a better tsquery for this query would be

>   'boyers-moore' | ('boyers' & 'moore')

> May be, it is worth changing to_tsquery() behavior for such cases.

Changing the behavior of to_tsquery is certainly a lot less scary
than changing to_tsvector --- it wouldn't call the validity of
existing tsvector indexes into question.

I see that to_tsquery is even sillier than plainto_tsquery:

regression=# select to_tsquery('simple','boyers-moore');
               to_tsquery
-----------------------------------------
 'boyers-moore' <-> 'boyers' <-> 'moore'
(1 row)

which is absolutely not a sane translation.

It seems to me that in both cases we'd be better off generating
"'boyers' <-> 'moore'", without the compound token at all.
Maybe there's a case for the weaker 'boyers' & 'moore' translation,
but I think if people wanted that they'd just enter separate words.

            regards, tom lane

Re: Mailing list search engine: surprising missing results?

От

James Addison

Дата:

25 января 2022 г., 23:48:34

I'm uncertain why parsing hyphenated query text produces compound tokens?

There are a couple of references[1][2] in the documentation about the
dash character being converted to a boolean not (!) operator by
websearch_to_tsquery, but that seems unrelated.

postgres=# select plainto_tsquery('simple', 'a-b');
  plainto_tsquery
-------------------
 'a-b' & 'a' & 'b'
(1 row)

postgres=# select plainto_tsquery('simple', 'a_b');
 plainto_tsquery
-----------------
 'a' & 'b'
(1 row)

postgres=# select plainto_tsquery('simple', 'a+b');
 plainto_tsquery
-----------------
 'a' & 'b'
(1 row)

[1] - https://www.postgresql.org/docs/14/functions-textsearch.html
[2] - https://www.postgresql.org/docs/14/textsearch-controls.html#TEXTSEARCH-PARSING-QUERIES

On Tue, 25 Jan 2022 at 17:54, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> Ivan Panchenko <i.panchenko@postgrespro.ru> writes:
> > The actual explanation can be seen from comparing a tsvector with a tsquery.
> > To avoid stemming effects, we use the simple configuration below.
>
> > # select plainto_tsquery('simple','boyers-moore');
>
> >             plainto_tsquery
> > -------------------------------------
> >   'boyers-moore' & 'boyers' & 'moore'
>
> > # select to_tsvector('simple','boyers-moore-horspool');
>
> >                           to_tsvector
> > -------------------------------------------------------------
> >   'boyers':2 'boyers-moore-horspool':1 'horspool':4 'moore':3
>
> > Obviously, such tsvector does not match the above tsquery. I think,a better tsquery for this query would be
>
> >   'boyers-moore' | ('boyers' & 'moore')
>
> > May be, it is worth changing to_tsquery() behavior for such cases.
>
> Changing the behavior of to_tsquery is certainly a lot less scary
> than changing to_tsvector --- it wouldn't call the validity of
> existing tsvector indexes into question.
>
> I see that to_tsquery is even sillier than plainto_tsquery:
>
> regression=# select to_tsquery('simple','boyers-moore');
>                to_tsquery
> -----------------------------------------
>  'boyers-moore' <-> 'boyers' <-> 'moore'
> (1 row)
>
> which is absolutely not a sane translation.
>
> It seems to me that in both cases we'd be better off generating
> "'boyers' <-> 'moore'", without the compound token at all.
> Maybe there's a case for the weaker 'boyers' & 'moore' translation,
> but I think if people wanted that they'd just enter separate words.
>
>                         regards, tom lane
>
>

Re: Mailing list search engine: surprising missing results?

От

Ivan Panchenko

Дата:

26 января 2022 г., 00:23:35

On 25.01.2022 23:48, James Addison wrote:
> I'm uncertain why parsing hyphenated query text produces compound tokens?

Because in some cases user wants to search the full hyphenated words, 
not parts of them.

But the parser is pluggable, it is possible to develop another one, such 
as  pg_tsparser [1] which does the same for underscores.

*to_tsquery functions are also changeable. There can exist plenty of 
them according to different user requirements.
Such function just translates the query from the user query language 
with its semantics into the tsquery language.
So you may write your own and contribute it to community or not. Another 
option is to make a wrapper function which will modify the result of 
existing *to_tsquery function to fit your task.

> There are a couple of references[1][2] in the documentation about the
> dash character being converted to a boolean not (!) operator by
> websearch_to_tsquery, but that seems unrelated.
>
> postgres=# select plainto_tsquery('simple', 'a-b');
>    plainto_tsquery
> -------------------
>   'a-b' & 'a' & 'b'
> (1 row)
>
> postgres=# select plainto_tsquery('simple', 'a_b');
>   plainto_tsquery
> -----------------
>   'a' & 'b'
> (1 row)
>
> postgres=# select plainto_tsquery('simple', 'a+b');
>   plainto_tsquery
> -----------------
>   'a' & 'b'
> (1 row)
In these examples, some characters are removed by the parser. Try 
ts_debug('simple', 'a+b').
>
> [1] - https://www.postgresql.org/docs/14/functions-textsearch.html
> [2] - https://www.postgresql.org/docs/14/textsearch-controls.html#TEXTSEARCH-PARSING-QUERIES
> On Tue, 25 Jan 2022 at 17:54, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Ivan Panchenko <i.panchenko@postgrespro.ru> writes:
>>> The actual explanation can be seen from comparing a tsvector with a tsquery.
>>> To avoid stemming effects, we use the simple configuration below.
>>> # select plainto_tsquery('simple','boyers-moore');
>>>              plainto_tsquery
>>> -------------------------------------
>>>    'boyers-moore' & 'boyers' & 'moore'
>>> # select to_tsvector('simple','boyers-moore-horspool');
>>>                            to_tsvector
>>> -------------------------------------------------------------
>>>    'boyers':2 'boyers-moore-horspool':1 'horspool':4 'moore':3
>>> Obviously, such tsvector does not match the above tsquery. I think,a better tsquery for this query would be
>>>    'boyers-moore' | ('boyers' & 'moore')
>>> May be, it is worth changing to_tsquery() behavior for such cases.
>> Changing the behavior of to_tsquery is certainly a lot less scary
>> than changing to_tsvector --- it wouldn't call the validity of
>> existing tsvector indexes into question.
>>
>> I see that to_tsquery is even sillier than plainto_tsquery:
>>
>> regression=# select to_tsquery('simple','boyers-moore');
>>                 to_tsquery
>> -----------------------------------------
>>   'boyers-moore' <-> 'boyers' <-> 'moore'
>> (1 row)
>>
>> which is absolutely not a sane translation.
>>
>> It seems to me that in both cases we'd be better off generating
>> "'boyers' <-> 'moore'", without the compound token at all.
>> Maybe there's a case for the weaker 'boyers' & 'moore' translation,
>> but I think if people wanted that they'd just enter separate words.

Matching the compond token might be significant for ranking. (?)

Probably, there is no universal *to_tsquery function and no universal 
parser to fit all users.

[1] https://github.com/postgrespro/pg_tsparser

>>
>>                          regards, tom lane
>>
>>
regards, Ivan

Re: Mailing list search engine: surprising missing results?

От

James Addison

Дата:

26 января 2022 г., 11:28:43

On Tue, 25 Jan 2022 at 21:23, Ivan Panchenko <i.panchenko@postgrespro.ru> wrote:
>
> On 25.01.2022 23:48, James Addison wrote:
> > I'm uncertain why parsing hyphenated query text produces compound tokens?
>
> Because in some cases user wants to search the full hyphenated words,
> not parts of them.

That makes sense, although to refer back to a previous suggestion of
yours, we could allow matching on the full hyphenated words by
emitting an 'OR' condition from the parsed query, instead of 'AND'
(perhaps using an argument?).

In other words:

# expected query to achieve a match (from your previous post in this thread)
'boyers-moore' | ('boyers' & 'moore')

# actual query that does not result in a match today (plainto_tsquery
for 'boyer-moore')
'boyer-moore' & 'boyer' & 'moore'

> >> It seems to me that in both cases we'd be better off generating
> >> "'boyers' <-> 'moore'", without the compound token at all.
> >> Maybe there's a case for the weaker 'boyers' & 'moore' translation,
> >> but I think if people wanted that they'd just enter separate words.
>
> Matching the compond token might be significant for ranking. (?)

Yes that does seem likely.  The knowledge that there is an exact-match
token in the results could be important for various use cases
(including relevance scoring).

> Probably, there is no universal *to_tsquery function and no universal
> parser to fit all users.

That seems possible too, yep.

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Обсуждение: Mailing list search engine: surprising missing results?