Обсуждение: Mailing list search engine: surprising missing results?
Hello, I noticed that the mailing list search engine[1] seems to unexpectedly miss results for some queries. For example: A search for "boyer"[2] returns five results, including result snippets that contain the text "Boyer-More-Horspool" [sic] and "Boyer-Moore-Horspool". However, a more specific search for "boyer-moore"[3] does not return any results -- that seems surprising. Specializing the query further and searching for "boyer-moore-horspool"[4] *does* again return results -- two documents -- with the terms "boyer" and "horspool" highlighted. Although it's not a significant problem, I do have a theory that could explain the behaviour (offered in case it may save time on investigation): It seems possible that the term "more" -- and nearby misspellings, like "moore" -- may be filtered out as stopwords (meaning: they're not present in the search index), and that the search engine is configured to require a minimum percentage match rate for query terms. Under those conditions: searches for "boyer" would produce an 100% match rate, "boyer-moore" would produce 50% (since "moore" would not be found in the term index), and "boyer-moore-horspool" would match at 66-point-6-repeating percent. Given a required match rate of around two thirds, that could explain the behaviour (it might not be the true reason, but it seems like one possibility). Thanks, James [1] https://www.postgresql.org/search/ [2] https://www.postgresql.org/search/?m=1&q=boyer&l=1&d=365&s=r [3] https://www.postgresql.org/search/?m=1&q=boyer-moore&l=1&d=365&s=r [4] https://www.postgresql.org/search/?m=1&q=boyer-moore-horspool&l=1&d=365&s=r
On Sun, 2022-01-23 at 12:49 +0000, James Addison wrote: > Hello, > > I noticed that the mailing list search engine[1] seems to unexpectedly > miss results for some queries. > > For example: > > A search for "boyer"[2] returns five results, including result > snippets that contain the text "Boyer-More-Horspool" [sic] and > "Boyer-Moore-Horspool". > > However, a more specific search for "boyer-moore"[3] does not return > any results -- that seems surprising. > > Specializing the query further and searching for > "boyer-moore-horspool"[4] *does* again return results -- two documents > -- with the terms "boyer" and "horspool" highlighted. This is caused by the peculiarities of PostgreSQL full text search: SELECT to_tsvector('english', 'Boyer-Moore-Horspool') @@ websearch_to_tsquery('english', 'boyer-moore'); ?column? ══════════ f (1 row) The reason is that the 'moore' in 'boyer-moore' is stemmed, since it is at the end of the word, while the 'moore' in 'Boyer-Moore-Horspool' isn't: SELECT to_tsvector('english', 'Boyer-Moore-Horspool'); to_tsvector ══════════════════════════════════════════════════════════ 'boyer':2 'boyer-moore-horspool':1 'horspool':4 'moor':3 (1 row) SELECT websearch_to_tsquery('english', 'boyer-moore'); websearch_to_tsquery ═════════════════════════════════════ 'boyer-moor' <-> 'boyer' <-> 'moor' (1 row) 'boyer-moor' is not present in the first result. As a workaround, I suggest that you search for 'boyer moore' or (even better) '"boyer moore"' (with the double quotes): SELECT websearch_to_tsquery('english', 'boyer moore'); websearch_to_tsquery ══════════════════════ 'boyer' & 'moor' (1 row) SELECT websearch_to_tsquery('english', '"boyer moore"'); websearch_to_tsquery ══════════════════════ 'boyer' <-> 'moor' (1 row) Yours, Laurenz Albe
On Mon, Jan 24, 2022 at 08:27:41AM +0100, Laurenz Albe wrote: > On Sun, 2022-01-23 at 12:49 +0000, James Addison wrote: > > Specializing the query further and searching for > > "boyer-moore-horspool"[4] *does* again return results -- two documents > > -- with the terms "boyer" and "horspool" highlighted. > > This is caused by the peculiarities of PostgreSQL full text search: > > SELECT to_tsvector('english', 'Boyer-Moore-Horspool') > @@ websearch_to_tsquery('english', 'boyer-moore'); > > ?column? > ══════════ > f > (1 row) > > The reason is that the 'moore' in 'boyer-moore' is stemmed, since it > is at the end of the word, while the 'moore' in 'Boyer-Moore-Horspool' > isn't: Wow, he showed me this problem earlier but I never suspected it was stemming issue because I never considered proper nowns could be stem-adjusted, but it is obvious they can. -- Bruce Momjian <bruce@momjian.us> https://momjian.us EDB https://enterprisedb.com If only the physical world exists, free will is an illusion.
Bruce Momjian <bruce@momjian.us> writes: > On Mon, Jan 24, 2022 at 08:27:41AM +0100, Laurenz Albe wrote: >> The reason is that the 'moore' in 'boyer-moore' is stemmed, since it >> is at the end of the word, while the 'moore' in 'Boyer-Moore-Horspool' >> isn't: > Wow, he showed me this problem earlier but I never suspected it was > stemming issue because I never considered proper nowns could be > stem-adjusted, but it is obvious they can. I wonder if we should change that so that components of a compound word are consistently stemmed the same way. regards, tom lane
On Mon, Jan 24, 2022 at 03:47:29PM -0500, Tom Lane wrote: > Bruce Momjian <bruce@momjian.us> writes: > > On Mon, Jan 24, 2022 at 08:27:41AM +0100, Laurenz Albe wrote: > >> The reason is that the 'moore' in 'boyer-moore' is stemmed, since it > >> is at the end of the word, while the 'moore' in 'Boyer-Moore-Horspool' > >> isn't: > > > Wow, he showed me this problem earlier but I never suspected it was > > stemming issue because I never considered proper nowns could be > > stem-adjusted, but it is obvious they can. > > I wonder if we should change that so that components of a compound > word are consistently stemmed the same way. I don't see the value in a change --- it might break the same number of cases it fixes. -- Bruce Momjian <bruce@momjian.us> https://momjian.us EDB https://enterprisedb.com If only the physical world exists, free will is an illusion.
On Mon, Jan 24, 2022 at 11:47 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
Bruce Momjian <bruce@momjian.us> writes:
> On Mon, Jan 24, 2022 at 08:27:41AM +0100, Laurenz Albe wrote:
>> The reason is that the 'moore' in 'boyer-moore' is stemmed, since it
>> is at the end of the word, while the 'moore' in 'Boyer-Moore-Horspool'
>> isn't:
> Wow, he showed me this problem earlier but I never suspected it was
> stemming issue because I never considered proper nowns could be
> stem-adjusted, but it is obvious they can.
I wonder if we should change that so that components of a compound
word are consistently stemmed the same way.
Something like this
SELECT to_tsvector('english', 'Boyer-Moore-Horspool');
to_tsvector
to_tsvector
----------------------------------------------------------
'boyer':2 'boyer-moore-horspool':1 'boyer-moore':1 'moore-horspool':1 'horspool':4 'moor':3
(1 row)
'boyer':2 'boyer-moore-horspool':1 'boyer-moore':1 'moore-horspool':1 'horspool':4 'moor':3
(1 row)
regards, tom lane
--
On Tue, 2022-01-25 at 14:04 +0300, Oleg Bartunov wrote: > On Mon, Jan 24, 2022 at 11:47 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > > Bruce Momjian <bruce@momjian.us> writes: > > > On Mon, Jan 24, 2022 at 08:27:41AM +0100, Laurenz Albe wrote: > > > > The reason is that the 'moore' in 'boyer-moore' is stemmed, since it > > > > is at the end of the word, while the 'moore' in 'Boyer-Moore-Horspool' > > > > isn't: > > > > > Wow, he showed me this problem earlier but I never suspected it was > > > stemming issue because I never considered proper nowns could be > > > stem-adjusted, but it is obvious they can. > > > > I wonder if we should change that so that components of a compound > > word are consistently stemmed the same way. > > Something like this > > SELECT to_tsvector('english', 'Boyer-Moore-Horspool'); > to_tsvector > ---------------------------------------------------------- > 'boyer':2 'boyer-moore-horspool':1 'boyer-moore':1 'moore-horspool':1 'horspool':4 'moor':3 > (1 row) Not quite. The problem is question is the "'boyer-moore':1". If that were "'boyer-moor':1" instead, the problem would disappear. Yours, Laurenz Albe
Laurenz Albe <laurenz.albe@cybertec.at> writes: > On Tue, 2022-01-25 at 14:04 +0300, Oleg Bartunov wrote: >> On Mon, Jan 24, 2022 at 11:47 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: >>> Bruce Momjian <bruce@momjian.us> writes: >>>> On Mon, Jan 24, 2022 at 08:27:41AM +0100, Laurenz Albe wrote: >>>>> The reason is that the 'moore' in 'boyer-moore' is stemmed, since it >>>>> is at the end of the word, while the 'moore' in 'Boyer-Moore-Horspool' >>>>> isn't: > Not quite. The problem is question is the "'boyer-moore':1". > If that were "'boyer-moor':1" instead, the problem would disappear. Actually, when I try this here, it seems like the stemming *is* consistent: regression=# SELECT to_tsvector('english', 'Boyer-Moore-Horspool'); to_tsvector ---------------------------------------------------------- 'boyer':2 'boyer-moore-horspool':1 'horspool':4 'moor':3 (1 row) regression=# SELECT to_tsvector('english', 'Boyer-Moore'); to_tsvector ----------------------------------- 'boyer':2 'boyer-moor':1 'moor':3 (1 row) If you try variants of that where the first or third term is stemmable, say regression=# SELECT to_tsvector('english', 'Boyers-Moore-Horspool'); to_tsvector ----------------------------------------------------------- 'boyer':2 'boyers-moore-horspool':1 'horspool':4 'moor':3 (1 row) it sure appears that each component word is stemmed independently already. So I think the original explanation here is wrong and we need to probe more closely. regards, tom lane
On 25.01.2022 19:22, Tom Lane wrote:
Laurenz Albe <laurenz.albe@cybertec.at> writes:On Tue, 2022-01-25 at 14:04 +0300, Oleg Bartunov wrote:On Mon, Jan 24, 2022 at 11:47 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:Bruce Momjian <bruce@momjian.us> writes:On Mon, Jan 24, 2022 at 08:27:41AM +0100, Laurenz Albe wrote:The reason is that the 'moore' in 'boyer-moore' is stemmed, since it is at the end of the word, while the 'moore' in 'Boyer-Moore-Horspool' isn't:Not quite. The problem is question is the "'boyer-moore':1". If that were "'boyer-moor':1" instead, the problem would disappear.Actually, when I try this here, it seems like the stemming *is* consistent: regression=# SELECT to_tsvector('english', 'Boyer-Moore-Horspool'); to_tsvector ---------------------------------------------------------- 'boyer':2 'boyer-moore-horspool':1 'horspool':4 'moor':3 (1 row) regression=# SELECT to_tsvector('english', 'Boyer-Moore'); to_tsvector ----------------------------------- 'boyer':2 'boyer-moor':1 'moor':3 (1 row) If you try variants of that where the first or third term is stemmable, say regression=# SELECT to_tsvector('english', 'Boyers-Moore-Horspool'); to_tsvector ----------------------------------------------------------- 'boyer':2 'boyers-moore-horspool':1 'horspool':4 'moor':3 (1 row) it sure appears that each component word is stemmed independently already. So I think the original explanation here is wrong and we need to probe more closely.
The actual explanation can be seen from comparing a tsvector with a tsquery. To avoid stemming effects, we use the simple configuration below.
# select plainto_tsquery('simple','boyers-moore'); plainto_tsquery ------------------------------------- 'boyers-moore' & 'boyers' & 'moore'
# select to_tsvector('simple','boyers-moore-horspool');
to_tsvector ------------------------------------------------------------- 'boyers':2 'boyers-moore-horspool':1 'horspool':4 'moore':3
Obviously, such tsvector does not match the above tsquery. I think,a better tsquery for this query would be
'boyers-moore' | ('boyers' & 'moore')
May be, it is worth changing to_tsquery() behavior for such cases.
regards, tom lane
Regards, Ivan
Ivan Panchenko <i.panchenko@postgrespro.ru> writes: > The actual explanation can be seen from comparing a tsvector with a tsquery. > To avoid stemming effects, we use the simple configuration below. > # select plainto_tsquery('simple','boyers-moore'); > plainto_tsquery > ------------------------------------- > 'boyers-moore' & 'boyers' & 'moore' > # select to_tsvector('simple','boyers-moore-horspool'); > to_tsvector > ------------------------------------------------------------- > 'boyers':2 'boyers-moore-horspool':1 'horspool':4 'moore':3 > Obviously, such tsvector does not match the above tsquery. I think,a better tsquery for this query would be > 'boyers-moore' | ('boyers' & 'moore') > May be, it is worth changing to_tsquery() behavior for such cases. Changing the behavior of to_tsquery is certainly a lot less scary than changing to_tsvector --- it wouldn't call the validity of existing tsvector indexes into question. I see that to_tsquery is even sillier than plainto_tsquery: regression=# select to_tsquery('simple','boyers-moore'); to_tsquery ----------------------------------------- 'boyers-moore' <-> 'boyers' <-> 'moore' (1 row) which is absolutely not a sane translation. It seems to me that in both cases we'd be better off generating "'boyers' <-> 'moore'", without the compound token at all. Maybe there's a case for the weaker 'boyers' & 'moore' translation, but I think if people wanted that they'd just enter separate words. regards, tom lane
I'm uncertain why parsing hyphenated query text produces compound tokens? There are a couple of references[1][2] in the documentation about the dash character being converted to a boolean not (!) operator by websearch_to_tsquery, but that seems unrelated. postgres=# select plainto_tsquery('simple', 'a-b'); plainto_tsquery ------------------- 'a-b' & 'a' & 'b' (1 row) postgres=# select plainto_tsquery('simple', 'a_b'); plainto_tsquery ----------------- 'a' & 'b' (1 row) postgres=# select plainto_tsquery('simple', 'a+b'); plainto_tsquery ----------------- 'a' & 'b' (1 row) [1] - https://www.postgresql.org/docs/14/functions-textsearch.html [2] - https://www.postgresql.org/docs/14/textsearch-controls.html#TEXTSEARCH-PARSING-QUERIES On Tue, 25 Jan 2022 at 17:54, Tom Lane <tgl@sss.pgh.pa.us> wrote: > > Ivan Panchenko <i.panchenko@postgrespro.ru> writes: > > The actual explanation can be seen from comparing a tsvector with a tsquery. > > To avoid stemming effects, we use the simple configuration below. > > > # select plainto_tsquery('simple','boyers-moore'); > > > plainto_tsquery > > ------------------------------------- > > 'boyers-moore' & 'boyers' & 'moore' > > > # select to_tsvector('simple','boyers-moore-horspool'); > > > to_tsvector > > ------------------------------------------------------------- > > 'boyers':2 'boyers-moore-horspool':1 'horspool':4 'moore':3 > > > Obviously, such tsvector does not match the above tsquery. I think,a better tsquery for this query would be > > > 'boyers-moore' | ('boyers' & 'moore') > > > May be, it is worth changing to_tsquery() behavior for such cases. > > Changing the behavior of to_tsquery is certainly a lot less scary > than changing to_tsvector --- it wouldn't call the validity of > existing tsvector indexes into question. > > I see that to_tsquery is even sillier than plainto_tsquery: > > regression=# select to_tsquery('simple','boyers-moore'); > to_tsquery > ----------------------------------------- > 'boyers-moore' <-> 'boyers' <-> 'moore' > (1 row) > > which is absolutely not a sane translation. > > It seems to me that in both cases we'd be better off generating > "'boyers' <-> 'moore'", without the compound token at all. > Maybe there's a case for the weaker 'boyers' & 'moore' translation, > but I think if people wanted that they'd just enter separate words. > > regards, tom lane > >
On 25.01.2022 23:48, James Addison wrote: > I'm uncertain why parsing hyphenated query text produces compound tokens? Because in some cases user wants to search the full hyphenated words, not parts of them. But the parser is pluggable, it is possible to develop another one, such as pg_tsparser [1] which does the same for underscores. *to_tsquery functions are also changeable. There can exist plenty of them according to different user requirements. Such function just translates the query from the user query language with its semantics into the tsquery language. So you may write your own and contribute it to community or not. Another option is to make a wrapper function which will modify the result of existing *to_tsquery function to fit your task. > There are a couple of references[1][2] in the documentation about the > dash character being converted to a boolean not (!) operator by > websearch_to_tsquery, but that seems unrelated. > > postgres=# select plainto_tsquery('simple', 'a-b'); > plainto_tsquery > ------------------- > 'a-b' & 'a' & 'b' > (1 row) > > postgres=# select plainto_tsquery('simple', 'a_b'); > plainto_tsquery > ----------------- > 'a' & 'b' > (1 row) > > postgres=# select plainto_tsquery('simple', 'a+b'); > plainto_tsquery > ----------------- > 'a' & 'b' > (1 row) In these examples, some characters are removed by the parser. Try ts_debug('simple', 'a+b'). > > [1] - https://www.postgresql.org/docs/14/functions-textsearch.html > [2] - https://www.postgresql.org/docs/14/textsearch-controls.html#TEXTSEARCH-PARSING-QUERIES > On Tue, 25 Jan 2022 at 17:54, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Ivan Panchenko <i.panchenko@postgrespro.ru> writes: >>> The actual explanation can be seen from comparing a tsvector with a tsquery. >>> To avoid stemming effects, we use the simple configuration below. >>> # select plainto_tsquery('simple','boyers-moore'); >>> plainto_tsquery >>> ------------------------------------- >>> 'boyers-moore' & 'boyers' & 'moore' >>> # select to_tsvector('simple','boyers-moore-horspool'); >>> to_tsvector >>> ------------------------------------------------------------- >>> 'boyers':2 'boyers-moore-horspool':1 'horspool':4 'moore':3 >>> Obviously, such tsvector does not match the above tsquery. I think,a better tsquery for this query would be >>> 'boyers-moore' | ('boyers' & 'moore') >>> May be, it is worth changing to_tsquery() behavior for such cases. >> Changing the behavior of to_tsquery is certainly a lot less scary >> than changing to_tsvector --- it wouldn't call the validity of >> existing tsvector indexes into question. >> >> I see that to_tsquery is even sillier than plainto_tsquery: >> >> regression=# select to_tsquery('simple','boyers-moore'); >> to_tsquery >> ----------------------------------------- >> 'boyers-moore' <-> 'boyers' <-> 'moore' >> (1 row) >> >> which is absolutely not a sane translation. >> >> It seems to me that in both cases we'd be better off generating >> "'boyers' <-> 'moore'", without the compound token at all. >> Maybe there's a case for the weaker 'boyers' & 'moore' translation, >> but I think if people wanted that they'd just enter separate words. Matching the compond token might be significant for ranking. (?) Probably, there is no universal *to_tsquery function and no universal parser to fit all users. [1] https://github.com/postgrespro/pg_tsparser >> >> regards, tom lane >> >> regards, Ivan
On Tue, 25 Jan 2022 at 21:23, Ivan Panchenko <i.panchenko@postgrespro.ru> wrote: > > On 25.01.2022 23:48, James Addison wrote: > > I'm uncertain why parsing hyphenated query text produces compound tokens? > > Because in some cases user wants to search the full hyphenated words, > not parts of them. That makes sense, although to refer back to a previous suggestion of yours, we could allow matching on the full hyphenated words by emitting an 'OR' condition from the parsed query, instead of 'AND' (perhaps using an argument?). In other words: # expected query to achieve a match (from your previous post in this thread) 'boyers-moore' | ('boyers' & 'moore') # actual query that does not result in a match today (plainto_tsquery for 'boyer-moore') 'boyer-moore' & 'boyer' & 'moore' > >> It seems to me that in both cases we'd be better off generating > >> "'boyers' <-> 'moore'", without the compound token at all. > >> Maybe there's a case for the weaker 'boyers' & 'moore' translation, > >> but I think if people wanted that they'd just enter separate words. > > Matching the compond token might be significant for ranking. (?) Yes that does seem likely. The knowledge that there is an exact-match token in the results could be important for various use cases (including relevance scoring). > Probably, there is no universal *to_tsquery function and no universal > parser to fit all users. That seems possible too, yep.