Re: Boyer-More-Horspool searching LIKE queries

Поиск
Список
Период
Сортировка
От Atsushi Ogawa
Тема Re: Boyer-More-Horspool searching LIKE queries
Дата
Msg-id CAO2GH9EgykEvn_JQ+dPXJE1Aq4bo7KUQ=AGQVhLpxKhDqgkNuQ@mail.gmail.com
обсуждение исходный текст
Ответ на Re: Boyer-More-Horspool searching LIKE queries  (Heikki Linnakangas <hlinnaka@iki.fi>)
Список pgsql-hackers
Thanks for the comments.

> > The conditions under which B-M-H can be used are as follows.
> >
> > (1) B-M-H in LIKE search supports only single-byte character sets and UTF8.
> > Multibyte character sets does not support, because it may contain another
> > characters in the byte sequence. For UTF-8, it works fine, because in
> > UTF-8 the byte sequence of one character cannot contain another character.
>
> You can make it work with any encoding, if you check for that case after
> you find a match. See how text_position() does it.

I saw the text_position(). I would like to do the same.

> > (3) The pattern string should be at least 4 characters.
> > For example, '%AB%' can use B-M-H.
>
> To be precise, the patch checks that the pattern string is at least 4
> *bytes* long. A pattern like E'%\U0001F418%' would benefit too.

*bytes* is precise. I will revise the comment of code.

> If I'm reading the code correctly, it doesn't account for escapes
> correctly. It will use B-M-H for a pattern like '%\\%', even though
> that's just searching for a single backslash and won't benefit from B-M-H.

You are correct. I will fix it.

> > (4) The first and last character of the pattern string should be '%'.
>
> I wonder if we can do better than that. If you have a pattern like
> '%foo%bar', its pretty obvious (to a human) that you can quickly check
> if the string ends in 'bar', and then check if it also contains the
> substring 'foo'. Is there some way to generalize that?

I think the following optimizations are possible.

(1)%foo%bar
 Check if the string ends with 'bar' and search for 'foo' by B-M-H.

(2)foo%bar%
 Check if the string starts with 'foo' and search for 'bar' by B-M-H.

(3)foo%bar%baz
 Check if the string starts with 'foo' and string ends with 'baz' and
 search for 'bar' by B-M-H.

> Looking at MatchText() in like.c, there is this piece of code:
>
> >               else if (*p == '%')
> >               {
> >                       char            firstpat;
> >
> >                       /*
> >                        * % processing is essentially a search for a text position at
> >                        * which the remainder of the text matches the remainder of the
> >                        * pattern, using a recursive call to check each potential match.
> >                        *
> >                        * If there are wildcards immediately following the %, we can skip
> >                        * over them first, using the idea that any sequence of N _'s and
> >                        * one or more %'s is equivalent to N _'s and one % (ie, it will
> >                        * match any sequence of at least N text characters).  In this way
> >                        * we will always run the recursive search loop using a pattern
> >                        * fragment that begins with a literal character-to-match, thereby
> >                        * not recursing more than we have to.
> >                        */
> >                       NextByte(p, plen);
> >
> >                       while (plen > 0)
> >                       {
> >                               if (*p == '%')
> >                                       NextByte(p, plen);
> >                               else if (*p == '_')
> >                               {
> >                                       /* If not enough text left to match the pattern, ABORT */
> >                                       if (tlen <= 0)
> >                                               return LIKE_ABORT;
> >                                       NextChar(t, tlen);
> >                                       NextByte(p, plen);
> >                               }
> >                               else
> >                                       break;          /* Reached a non-wildcard pattern char */
> >                       }
>
> Could we use B-M-H to replace that piece of code?

For example, in a pattern such as %foo%bar%, it is possible to first search
for 'foo' by B-M-H, and then search for 'bar' by B-M-H. It would be nice if such a 
process could be generalized to handle various LIKE search patterns.

> How does the performance compare with regular expressions? Would it be
> possible to use this for trivial regular expressions too? Or could we
> speed up the regexp engine to take advantage of B-M-H, and use it for
> LIKE? Or something like that?

I think regular expressions in postgresql is slower than LIKE.
I compared it with the following two SQLs.

(1)LIKE: execution time is about 0.8msec
select count(*) from pg_catalog.pg_description d where d.description like '%greater%';

(2)regular expression: execution time is about 3.1 msec
select count(*) from pg_catalog.pg_description d where d.description ~ 'greater';

For trivial regular expressions, it may be better to use LIKE.

> > I have measured the performance with the following query.
>
> Setting up the B-M-H table adds some initialization overhead, so this
> would be a loss for cases where the LIKE is executed only once, and/or
> the haystack strings are very small. That's probably OK, the overhead is
> probably small, and those cases are probably not performance-critical.
> But would be nice to measure that too.

I tried to measure the case where LIKE is executed only once and
the haystack string are very small.

---------------------------------------------------------------
SET client_min_messages TO notice;

\timing

DO $$
DECLARE
  cnt integer := 0;
  total integer := 0;
BEGIN
  FOR i IN 1..10000 LOOP
    select count(*) into cnt from pg_class where oid = 2662 and relname like '%cl%';
    total := total + cnt;
  END LOOP;

  RAISE NOTICE 'TOTAL: %', total;
END
$$
;
---------------------------------------------------------------

without patch: 74.499msec
with patch 77.321msec

In this case, the patched version will be a few percent slower, but I think
the overhead is small.

Regards,
Atsushi Ogawa



2022年1月12日(水) 5:17 Heikki Linnakangas <hlinnaka@iki.fi>:
On 11/01/2022 15:55, Atsushi Ogawa wrote:
> I have created a patch to enable the Boyer-More-Horspool search
> algorithm (B-M-H) for LIKE queries.

Cool!

> The conditions under which B-M-H can be used are as follows.
>
> (1) B-M-H in LIKE search supports only single-byte character sets and UTF8.
> Multibyte character sets does not support, because it may contain another
> characters in the byte sequence. For UTF-8, it works fine, because in
> UTF-8 the byte sequence of one character cannot contain another character.

You can make it work with any encoding, if you check for that case after
you find a match. See how text_position() does it.

> (3) The pattern string should be at least 4 characters.
> For example, '%AB%' can use B-M-H.

To be precise, the patch checks that the pattern string is at least 4
*bytes* long. A pattern like E'%\U0001F418%' would benefit too.

If I'm reading the code correctly, it doesn't account for escapes
correctly. It will use B-M-H for a pattern like '%\\%', even though
that's just searching for a single backslash and won't benefit from B-M-H.

> (4) The first and last character of the pattern string should be '%'.

I wonder if we can do better than that. If you have a pattern like
'%foo%bar', its pretty obvious (to a human) that you can quickly check
if the string ends in 'bar', and then check if it also contains the
substring 'foo'. Is there some way to generalize that?

Looking at MatchText() in like.c, there is this piece of code:

>               else if (*p == '%')
>               {
>                       char            firstpat;
>
>                       /*
>                        * % processing is essentially a search for a text position at
>                        * which the remainder of the text matches the remainder of the
>                        * pattern, using a recursive call to check each potential match.
>                        *
>                        * If there are wildcards immediately following the %, we can skip
>                        * over them first, using the idea that any sequence of N _'s and
>                        * one or more %'s is equivalent to N _'s and one % (ie, it will
>                        * match any sequence of at least N text characters).  In this way
>                        * we will always run the recursive search loop using a pattern
>                        * fragment that begins with a literal character-to-match, thereby
>                        * not recursing more than we have to.
>                        */
>                       NextByte(p, plen);
>
>                       while (plen > 0)
>                       {
>                               if (*p == '%')
>                                       NextByte(p, plen);
>                               else if (*p == '_')
>                               {
>                                       /* If not enough text left to match the pattern, ABORT */
>                                       if (tlen <= 0)
>                                               return LIKE_ABORT;
>                                       NextChar(t, tlen);
>                                       NextByte(p, plen);
>                               }
>                               else
>                                       break;          /* Reached a non-wildcard pattern char */
>                       }

Could we use B-M-H to replace that piece of code?

How does the performance compare with regular expressions? Would it be
possible to use this for trivial regular expressions too? Or could we
speed up the regexp engine to take advantage of B-M-H, and use it for
LIKE? Or something like that?

> I have measured the performance with the following query.

Setting up the B-M-H table adds some initialization overhead, so this
would be a loss for cases where the LIKE is executed only once, and/or
the haystack strings are very small. That's probably OK, the overhead is
probably small, and those cases are probably not performance-critical.
But would be nice to measure that too.

- Heikki

В списке pgsql-hackers по дате отправления:

Предыдущее
От: Amit Kapila
Дата:
Сообщение: Re: Consistently use the function name CreateCheckPoint instead of CreateCheckpoint in code comments
Следующее
От: Denis Hirn
Дата:
Сообщение: Re: [PATCH] Allow multiple recursive self-references