Обсуждение: Multibyte LIKE optimization

Поиск

Список

Период

Сортировка

Multibyte LIKE optimization

От

ITAGAKI Takahiro

Дата:

30 марта 2007 г., 05:40:16

Andrew - Supernews <andrew+nonews@supernews.com> wrote:

> Actually, I think your proposal is fundamentally correct, merely incomplete.

Yeah, I fixed the patch to handle '_' correctly.

> Doing octet-based rather than character-based matching of strings is a
> _design goal_ of UTF8.

I think all "safe ASCII-supersets" encodings are comparable by bytes,
not only UTF-8. Their all multibyte characters consist of bytes larger
than 127. I updated the patch on this presupposition. It uses octet-based
matching usually and character-based matching at '_'.

There was 30%+ of performance win in selection using multibytes LIKE '%foo%'.

 encoding  |  HEAD   | patched
-----------+---------+---------
 SQL_ASCII |  7094ms |  7062ms
 LATIN1    |  7083ms |  7078ms
 UTF8      | 17974ms | 11635ms (64.7%)
 EUC_JP    | 17032ms | 12109ms (71.1%)

If this patch is acceptable, please drop JOHAB encoding from server encodings
before it is applied. Trailing bytes of JOHAB can be less than 128.
http://archives.postgresql.org/pgsql-hackers/2007-03/msg01475.php

Regards,
---
ITAGAKI Takahiro
NTT Open Source Software Center

Вложения

mbtextmatch.patch

UTF8MatchText

От

ITAGAKI Takahiro

Дата:

02 апреля 2007 г., 01:56:10

"Andrew - Supernews" <andrew@supernews.net> wrote:

>  ITAGAKI> I think all "safe ASCII-supersets" encodings are comparable
>  ITAGAKI> by bytes, not only UTF-8.
>
> This is false, particularly for EUC.

Umm, I see. I updated the optimization to be used only for UTF8 case.
I also added some inlining hints that are useful on my machine (Pentium 4).


x1000 of LIKE '%foo% on 10000 rows tables [ms]
 encoding  | HEAD  |  P1   |  P2   |  P3
-----------+-------+-------+-------+-------
 SQL_ASCII |  7094 |  7120 |  7063 |  7031
 LATIN1    |  7083 |  7130 |  7057 |  7031
 UTF8      | 17974 | 10859 | 10839 |  9682
 EUC_JP    | 17032 | 17557 | 17599 | 15240

- P1: UTF8MatchText()
- P2: P1 + __inline__ GenericMatchText()
- P3: P2 + __inline__ wchareq()
      (The attached patch is P3.)

Regards,
---
ITAGAKI Takahiro
NTT Open Source Software Center

Вложения

utf8matchtext.patch

Re: UTF8MatchText

От

Bruce Momjian

Дата:

02 апреля 2007 г., 19:54:55

I assume this replaces all your earlier multi-byte LIKE patches.

Your patch has been added to the PostgreSQL unapplied patches list at:

    http://momjian.postgresql.org/cgi-bin/pgpatches

It will be applied as soon as one of the PostgreSQL committers reviews
and approves it.

---------------------------------------------------------------------------


ITAGAKI Takahiro wrote:
> "Andrew - Supernews" <andrew@supernews.net> wrote:
>
> >  ITAGAKI> I think all "safe ASCII-supersets" encodings are comparable
> >  ITAGAKI> by bytes, not only UTF-8.
> >
> > This is false, particularly for EUC.
>
> Umm, I see. I updated the optimization to be used only for UTF8 case.
> I also added some inlining hints that are useful on my machine (Pentium 4).
>
>
> x1000 of LIKE '%foo% on 10000 rows tables [ms]
>  encoding  | HEAD  |  P1   |  P2   |  P3
> -----------+-------+-------+-------+-------
>  SQL_ASCII |  7094 |  7120 |  7063 |  7031
>  LATIN1    |  7083 |  7130 |  7057 |  7031
>  UTF8      | 17974 | 10859 | 10839 |  9682
>  EUC_JP    | 17032 | 17557 | 17599 | 15240
>
> - P1: UTF8MatchText()
> - P2: P1 + __inline__ GenericMatchText()
> - P3: P2 + __inline__ wchareq()
>       (The attached patch is P3.)
>
> Regards,
> ---
> ITAGAKI Takahiro
> NTT Open Source Software Center
>

[ Attachment, skipping... ]

>
> ---------------------------(end of broadcast)---------------------------
> TIP 7: You can help support the PostgreSQL project by donating at
>
>                 http://www.postgresql.org/about/donate

--
  Bruce Momjian  <bruce@momjian.us>          http://momjian.us
  EnterpriseDB                               http://www.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

Re: UTF8MatchText

От

Bruce Momjian

Дата:

07 апреля 2007 г., 01:19:44

I do not understand this patch.  You have defined two functions,
UTF8MatchText() and UTF8MatchTextIC(), and the difference between them
is that one calls CHAREQ and the other calls ICHAREQ, but just above
those two functions you define the macros identically:

    #define CHAREQ(p1, p2)    wchareq(p1, p2)
    #define ICHAREQ(p1, p2)   wchareq(p1, p2)

Why are there two functions?  Also, can't you use one function and just
pass a boolean to indicate whether case it be ignored?

---------------------------------------------------------------------------

ITAGAKI Takahiro wrote:
> "Andrew - Supernews" <andrew@supernews.net> wrote:
>
> >  ITAGAKI> I think all "safe ASCII-supersets" encodings are comparable
> >  ITAGAKI> by bytes, not only UTF-8.
> >
> > This is false, particularly for EUC.
>
> Umm, I see. I updated the optimization to be used only for UTF8 case.
> I also added some inlining hints that are useful on my machine (Pentium 4).
>
>
> x1000 of LIKE '%foo% on 10000 rows tables [ms]
>  encoding  | HEAD  |  P1   |  P2   |  P3
> -----------+-------+-------+-------+-------
>  SQL_ASCII |  7094 |  7120 |  7063 |  7031
>  LATIN1    |  7083 |  7130 |  7057 |  7031
>  UTF8      | 17974 | 10859 | 10839 |  9682
>  EUC_JP    | 17032 | 17557 | 17599 | 15240
>
> - P1: UTF8MatchText()
> - P2: P1 + __inline__ GenericMatchText()
> - P3: P2 + __inline__ wchareq()
>       (The attached patch is P3.)
>
> Regards,
> ---
> ITAGAKI Takahiro
> NTT Open Source Software Center
>

[ Attachment, skipping... ]

>
> ---------------------------(end of broadcast)---------------------------
> TIP 7: You can help support the PostgreSQL project by donating at
>
>                 http://www.postgresql.org/about/donate

--
  Bruce Momjian  <bruce@momjian.us>          http://momjian.us
  EnterpriseDB                               http://www.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

Re: UTF8MatchText

От

Bruce Momjian

Дата:

07 апреля 2007 г., 01:24:10

Bruce Momjian wrote:
>
> I do not understand this patch.  You have defined two functions,
> UTF8MatchText() and UTF8MatchTextIC(), and the difference between them
> is that one calls CHAREQ and the other calls ICHAREQ, but just above
> those two functions you define the macros identically:
>
>     #define CHAREQ(p1, p2)    wchareq(p1, p2)
>     #define ICHAREQ(p1, p2)   wchareq(p1, p2)
>
> Why are there two functions?  Also, can't you use one function and just
> pass a boolean to indicate whether case it be ignored?

Sorry, typo:

Why are there two functions?  Also, can't you use one function and just
pass a boolean to indicate whether case should be ignored?
                                        ------

>
> ---------------------------------------------------------------------------
>
> ITAGAKI Takahiro wrote:
> > "Andrew - Supernews" <andrew@supernews.net> wrote:
> >
> > >  ITAGAKI> I think all "safe ASCII-supersets" encodings are comparable
> > >  ITAGAKI> by bytes, not only UTF-8.
> > >
> > > This is false, particularly for EUC.
> >
> > Umm, I see. I updated the optimization to be used only for UTF8 case.
> > I also added some inlining hints that are useful on my machine (Pentium 4).
> >
> >
> > x1000 of LIKE '%foo% on 10000 rows tables [ms]
> >  encoding  | HEAD  |  P1   |  P2   |  P3
> > -----------+-------+-------+-------+-------
> >  SQL_ASCII |  7094 |  7120 |  7063 |  7031
> >  LATIN1    |  7083 |  7130 |  7057 |  7031
> >  UTF8      | 17974 | 10859 | 10839 |  9682
> >  EUC_JP    | 17032 | 17557 | 17599 | 15240
> >
> > - P1: UTF8MatchText()
> > - P2: P1 + __inline__ GenericMatchText()
> > - P3: P2 + __inline__ wchareq()
> >       (The attached patch is P3.)
> >
> > Regards,
> > ---
> > ITAGAKI Takahiro
> > NTT Open Source Software Center
> >
>
> [ Attachment, skipping... ]
>
> >
> > ---------------------------(end of broadcast)---------------------------
> > TIP 7: You can help support the PostgreSQL project by donating at
> >
> >                 http://www.postgresql.org/about/donate
>
> --
>   Bruce Momjian  <bruce@momjian.us>          http://momjian.us
>   EnterpriseDB                               http://www.enterprisedb.com
>
>   + If your life is a hard drive, Christ can be your backup. +
>
> ---------------------------(end of broadcast)---------------------------
> TIP 5: don't forget to increase your free space map settings

--
  Bruce Momjian  <bruce@momjian.us>          http://momjian.us
  EnterpriseDB                               http://www.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

Re: UTF8MatchText

От

ITAGAKI Takahiro

Дата:

08 апреля 2007 г., 23:48:55

Bruce Momjian <bruce@momjian.us> wrote:

> > I do not understand this patch.  You have defined two functions,
> > UTF8MatchText() and UTF8MatchTextIC(), and the difference between them
> > is that one calls CHAREQ and the other calls ICHAREQ, but just above
> > those two functions you define the macros identically:
>
> Why are there two functions?  Also, can't you use one function and just
> pass a boolean to indicate whether case should be ignored?

The same is true of MBMatchText() and MBMatchTextIC().
Now, I'll split the patch into two changes.

1. DropMBMatchTextIC.patch
        Drop MBMatchTextIC() and use MBMatchText() instead.

2. UTF8MatchText.patch
        Add UTF8MatchText() as a specialized version of MBMatchText().

As a future work, it might be good to research the performance of rewriting
"col ILIKE 'pattern'" to "lower(col) LIKE lower('pattern')" in planner so that
we can avoid to call lower() for constant pattern in the right-hand side and
can use functional indexes (lower(col)). I think we never need MBMatchTextIC()
in the future unless we move to wide-character server encoding :)

Regards,
---
ITAGAKI Takahiro
NTT Open Source Software Center

Вложения

Re: UTF8MatchText

От

Bruce Momjian

Дата:

09 апреля 2007 г., 17:16:26

Patch removed, updated version submitted.

---------------------------------------------------------------------------

ITAGAKI Takahiro wrote:
> "Andrew - Supernews" <andrew@supernews.net> wrote:
>
> >  ITAGAKI> I think all "safe ASCII-supersets" encodings are comparable
> >  ITAGAKI> by bytes, not only UTF-8.
> >
> > This is false, particularly for EUC.
>
> Umm, I see. I updated the optimization to be used only for UTF8 case.
> I also added some inlining hints that are useful on my machine (Pentium 4).
>
>
> x1000 of LIKE '%foo% on 10000 rows tables [ms]
>  encoding  | HEAD  |  P1   |  P2   |  P3
> -----------+-------+-------+-------+-------
>  SQL_ASCII |  7094 |  7120 |  7063 |  7031
>  LATIN1    |  7083 |  7130 |  7057 |  7031
>  UTF8      | 17974 | 10859 | 10839 |  9682
>  EUC_JP    | 17032 | 17557 | 17599 | 15240
>
> - P1: UTF8MatchText()
> - P2: P1 + __inline__ GenericMatchText()
> - P3: P2 + __inline__ wchareq()
>       (The attached patch is P3.)
>
> Regards,
> ---
> ITAGAKI Takahiro
> NTT Open Source Software Center
>

[ Attachment, skipping... ]

>
> ---------------------------(end of broadcast)---------------------------
> TIP 7: You can help support the PostgreSQL project by donating at
>
>                 http://www.postgresql.org/about/donate

--
  Bruce Momjian  <bruce@momjian.us>          http://momjian.us
  EnterpriseDB                               http://www.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

Re: UTF8MatchText

От

Bruce Momjian

Дата:

09 апреля 2007 г., 22:07:55

Your patch has been added to the PostgreSQL unapplied patches list at:

    http://momjian.postgresql.org/cgi-bin/pgpatches

It will be applied as soon as one of the PostgreSQL committers reviews
and approves it.

---------------------------------------------------------------------------


ITAGAKI Takahiro wrote:
> Bruce Momjian <bruce@momjian.us> wrote:
>
> > > I do not understand this patch.  You have defined two functions,
> > > UTF8MatchText() and UTF8MatchTextIC(), and the difference between them
> > > is that one calls CHAREQ and the other calls ICHAREQ, but just above
> > > those two functions you define the macros identically:
> >
> > Why are there two functions?  Also, can't you use one function and just
> > pass a boolean to indicate whether case should be ignored?
>
> The same is true of MBMatchText() and MBMatchTextIC().
> Now, I'll split the patch into two changes.
>
> 1. DropMBMatchTextIC.patch
>         Drop MBMatchTextIC() and use MBMatchText() instead.
>
> 2. UTF8MatchText.patch
>         Add UTF8MatchText() as a specialized version of MBMatchText().
>
>
> As a future work, it might be good to research the performance of rewriting
> "col ILIKE 'pattern'" to "lower(col) LIKE lower('pattern')" in planner so that
> we can avoid to call lower() for constant pattern in the right-hand side and
> can use functional indexes (lower(col)). I think we never need MBMatchTextIC()
> in the future unless we move to wide-character server encoding :)
>
> Regards,
> ---
> ITAGAKI Takahiro
> NTT Open Source Software Center
>

[ Attachment, skipping... ]

[ Attachment, skipping... ]

--
  Bruce Momjian  <bruce@momjian.us>          http://momjian.us
  EnterpriseDB                               http://www.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

Re: UTF8MatchText

От

Andrew Dunstan

Дата:

17 мая 2007 г., 11:20:58

Itagaki,

I find this still fairly unclean. It certainly took me some time to get
me head around what's going on.

ISTM we should generate all these match functions from one body of code
plus some #define magic.

As I understand it, we have three possible encoding switches:  Single
Byte, UTF8 and other Multi Byte Charsets, and two possible case
settings: case Sensitive and Case Insensitive. That would make for a
total of six functions, but in the case of both UTF8 and other MBCS we
don't need a special Case Insensitive function - instead we downcase
both the string and the pattern and then use the Case Sensitive
function. That leaves a total of four functions.

What is not clear to me is why the UTF8 optimisation work, and why it
doesn't apply to other MBCS. At the very least we need a comment on that.

I also find the existing function naming convention somewhat annoying -
having foo() and MB_foo() is less than clear. I'd rather have SB_foo()
and MB_foo(). That's not your fault, of course.

If you supply me with some explanation on the UTF8 optimisation issue,
I'll prepare a revised patch along these lines.

cheers

andrew

ITAGAKI Takahiro wrote:
> Bruce Momjian <bruce@momjian.us> wrote:
>
>
>>> I do not understand this patch.  You have defined two functions,
>>> UTF8MatchText() and UTF8MatchTextIC(), and the difference between them
>>> is that one calls CHAREQ and the other calls ICHAREQ, but just above
>>> those two functions you define the macros identically:
>>>
>> Why are there two functions?  Also, can't you use one function and just
>> pass a boolean to indicate whether case should be ignored?
>>
>
> The same is true of MBMatchText() and MBMatchTextIC().
> Now, I'll split the patch into two changes.
>
> 1. DropMBMatchTextIC.patch
>         Drop MBMatchTextIC() and use MBMatchText() instead.
>
> 2. UTF8MatchText.patch
>         Add UTF8MatchText() as a specialized version of MBMatchText().
>
>
> As a future work, it might be good to research the performance of rewriting
> "col ILIKE 'pattern'" to "lower(col) LIKE lower('pattern')" in planner so that
> we can avoid to call lower() for constant pattern in the right-hand side and
> can use functional indexes (lower(col)). I think we never need MBMatchTextIC()
> in the future unless we move to wide-character server encoding :)
>
> Regards,
> ---
> ITAGAKI Takahiro
> NTT Open Source Software Center
>
>
> ------------------------------------------------------------------------
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 9: In versions below 8.0, the planner will ignore your desire to
>        choose an index scan if your joining column's datatypes do not
>        match
>

Re: UTF8MatchText

От

Andrew Dunstan

Дата:

17 мая 2007 г., 13:57:10

I wrote:
>
>
> ISTM we should generate all these match functions from one body of
> code plus some #define magic.
>
> As I understand it, we have three possible encoding switches:  Single
> Byte, UTF8 and other Multi Byte Charsets, and two possible case
> settings: case Sensitive and Case Insensitive. That would make for a
> total of six functions, but in the case of both UTF8 and other MBCS we
> don't need a special Case Insensitive function - instead we downcase
> both the string and the pattern and then use the Case Sensitive
> function. That leaves a total of four functions.
>
> What is not clear to me is why the UTF8 optimisation work, and why it
> doesn't apply to other MBCS. At the very least we need a comment on that.
>
> I also find the existing function naming convention somewhat annoying
> - having foo() and MB_foo() is less than clear. I'd rather have
> SB_foo() and MB_foo(). That's not your fault, of course.
>
> If you supply me with some explanation on the UTF8 optimisation issue,
> I'll prepare a revised patch along these lines.
>
>

Ok, I have studied some more and I think I understand what's going on.
AIUI, we are switching from some expensive char-wise comparisons to
cheap byte-wise comparisons in the UTF8 case because we know that in
UTF8 the magic characters ('_', '%' and '\') aren't a part of any other
character sequence. Is that putting it too mildly? Do we need stronger
conditions than that? If it's correct, are there other MBCS for which
this is true?

cheers

andrew

Re: UTF8MatchText

От

Tom Lane

Дата:

17 мая 2007 г., 14:33:54

Andrew Dunstan <andrew@dunslane.net> writes:
> Ok, I have studied some more and I think I understand what's going on.
> AIUI, we are switching from some expensive char-wise comparisons to
> cheap byte-wise comparisons in the UTF8 case because we know that in
> UTF8 the magic characters ('_', '%' and '\') aren't a part of any other
> character sequence. Is that putting it too mildly? Do we need stronger
> conditions than that? If it's correct, are there other MBCS for which
> this is true?

I don't think this is a correct analysis.  If it were correct then we
could use the optimization for all backend charsets because none of them
allow MB characters to contain non-high-bit-set bytes.  But it was
stated somewhere upthread that that doesn't actually work.  Clearly
it's a necessary property that we not falsely detect the magic pattern
characters, but that's not sufficient.

I think the real issue is that UTF8 has disjoint representations for
first-bytes and not-first-bytes of MB characters, and thus it is
impossible to make a false match in which an MB pattern character is
matched to the end of one data character plus the start of another.
In character sets without that property, we have to use the slow way to
ensure we don't make out-of-sync matches.

            regards, tom lane

Re: UTF8MatchText

От

Tom Lane

Дата:

17 мая 2007 г., 14:40:22

Wait a second ... I just thought of a counterexample that destroys the
entire concept.  Consider the pattern 'A__B', which clearly is supposed
to match strings of four *characters*.  With the proposed patch in
place, it would match strings of four *bytes*.  Which is not the correct
behavior.

            regards, tom lane

Re: UTF8MatchText

От

Andrew Dunstan

Дата:

17 мая 2007 г., 14:48:17


Tom Lane wrote:
> UTF8 has disjoint representations for
> first-bytes and not-first-bytes of MB characters, and thus it is
> impossible to make a false match in which an MB pattern character is
> matched to the end of one data character plus the start of another.
> In character sets without that property, we have to use the slow way to
> ensure we don't make out-of-sync matches.
>
>
>

Thanks. I will include this info in the comments.

cheers

andrew

Re: UTF8MatchText

От

Andrew Dunstan

Дата:

17 мая 2007 г., 15:06:19

Tom Lane wrote:
> Wait a second ... I just thought of a counterexample that destroys the
> entire concept.  Consider the pattern 'A__B', which clearly is supposed
> to match strings of four *characters*.  With the proposed patch in
> place, it would match strings of four *bytes*.  Which is not the correct
> behavior.
>
>

 From what I can see the code is quite careful about when it calls
NextByte vs NextChar, and after _ it calls NextChar.

So I'll test for this, but I think it's safe.

cheers

andrew

Re: UTF8MatchText

От

Tom Lane

Дата:

17 мая 2007 г., 15:17:34

Andrew Dunstan <andrew@dunslane.net> writes:
> Tom Lane wrote:
>> Wait a second ... I just thought of a counterexample that destroys the
>> entire concept.  Consider the pattern 'A__B', which clearly is supposed
>> to match strings of four *characters*.  With the proposed patch in
>> place, it would match strings of four *bytes*.  Which is not the correct
>> behavior.

>  From what I can see the code is quite careful about when it calls
> NextByte vs NextChar, and after _ it calls NextChar.

Except that the entire point of this patch is to dumb down NextChar to
be the same as NextByte for UTF8 strings.

            regards, tom lane

Re: UTF8MatchText

От

Andrew Dunstan

Дата:

17 мая 2007 г., 15:36:59


Tom Lane wrote:
> Andrew Dunstan <andrew@dunslane.net> writes:
>
>> Tom Lane wrote:
>>
>>> Wait a second ... I just thought of a counterexample that destroys the
>>> entire concept.  Consider the pattern 'A__B', which clearly is supposed
>>> to match strings of four *characters*.  With the proposed patch in
>>> place, it would match strings of four *bytes*.  Which is not the correct
>>> behavior.
>>>
>
>
>>  From what I can see the code is quite careful about when it calls
>> NextByte vs NextChar, and after _ it calls NextChar.
>>
>
> Except that the entire point of this patch is to dumb down NextChar to
> be the same as NextByte for UTF8 strings.
>
>
>

That's not what I see in (what I think is) the latest submission, which
includes this snippet:

+ /* Set up for utf8 characters */
+ #define CHAREQ(p1, p2)    wchareq(p1, p2)
+ #define NextChar(p, plen) \
+   do { int __l = pg_utf_mblen(p); (p) +=__l; (plen) -=__l; } while (0)
+
+ /*
+  * UTF8MatchText -- specialized version of MBMatchText for UTF8
+  */
+ static int
+ UTF8MatchText(char *t, int tlen, char *p, int plen)

Am I looking at the wrong thing? This is from around April 9th I think.


cheers

andrew

Re: UTF8MatchText

От

Tom Lane

Дата:

17 мая 2007 г., 16:49:37

Andrew Dunstan <andrew@dunslane.net> writes:
> Tom Lane wrote:
>> Except that the entire point of this patch is to dumb down NextChar to
>> be the same as NextByte for UTF8 strings.

> That's not what I see in (what I think is) the latest submission, which
> includes this snippet:

[ scratches head... ]  OK, then I think I totally missed what this patch
is trying to accomplish; because this code looks just the same as the
existing multibyte-character operations.  Where does the performance
improvement come from?

            regards, tom lane

Re: UTF8MatchText

От

Andrew Dunstan

Дата:

17 мая 2007 г., 16:57:52


Tom Lane wrote:
> Andrew Dunstan <andrew@dunslane.net> writes:
>
>> Tom Lane wrote:
>>
>>> Except that the entire point of this patch is to dumb down NextChar to
>>> be the same as NextByte for UTF8 strings.
>>>
>
>
>> That's not what I see in (what I think is) the latest submission, which
>> includes this snippet:
>>
>
> [ scratches head... ]  OK, then I think I totally missed what this patch
> is trying to accomplish; because this code looks just the same as the
> existing multibyte-character operations.  Where does the performance
> improvement come from?
>
>
>

That's what bothered me. The trouble is that we have so much code that
looks *almost* identical.

 From my WIP patch, here's where the difference appears to be - note
that UTF8 branch has two NextByte calls at the bottom, unlike the other
branch:


#ifdef UTF8_OPT
        /*
         * UTF8 is optimised to do byte at a time matching in most cases,
         * thus saving expensive calls to NextChar.
         *
         * UTF8 has disjoint representations for first-bytes and
         * not-first-bytes of MB characters, and thus it is
         * impossible to make a false match in which an MB pattern
         * character is matched to the end of one data character
         * plus the start of another.
         * In character sets without that property, we have to use the
         * slow way to ensure we don't make out-of-sync matches.
         */
        else if (*p == '_')
        {
            NextChar(t, tlen);
            NextByte(p, plen);
            continue;
        }
        else if (!BYTEEQ(t, p))
        {
            /*
             * Not the single-character wildcard and no explicit match? Then
             * time to quit...
             */
            return LIKE_FALSE;
        }

        NextByte(t, tlen);
        NextByte(p, plen);
#else
        /*
         * Branch for non-utf8 multi-byte charsets and also for single-byte
         * charsets which don't gain any benefit from the above
optimisation.
         */

        else if ((*p != '_') && !CHAREQ(t, p))
        {
            /*
             * Not the single-character wildcard and no explicit match? Then
             * time to quit...
             */
            return LIKE_FALSE;
        }

        NextChar(t, tlen);
        NextChar(p, plen);

#endif /* UTF8_OPT */


cheers

andrew

Re: UTF8MatchText

От

Tom Lane

Дата:

17 мая 2007 г., 18:19:21

Andrew Dunstan <andrew@dunslane.net> writes:
> From my WIP patch, here's where the difference appears to be - note
> that UTF8 branch has two NextByte calls at the bottom, unlike the other
> branch:

Oh, I see: NextChar is still "real" but the patch is willing to have t
and p pointing into the middle of an MB character.  That's a bit
risky.  I think it works but it's making at least the following
undocumented assumptions:

* At a pattern backslash, it applies CHAREQ() but then advances
byte-by-byte over the matched characters (implicitly assuming that none
of these bytes will look like the magic characters).  While that works
for backend-safe encodings, it seems a bit strange; you've already paid
the price of determining the character length once, not to mention
matching the bytes of the characters once, and then throw that knowledge
away.  I think BYTEEQ would make more sense in the backslash path.

* At pattern % or _, it's critical that we do NextChar not NextByte
on the data side.  Else t is pointing into the middle of an MB sequence
when p isn't, and we have various out-of-sync conditions to worry about,
notably possibly calling NextChar when t is not pointing at the first
byte of the character, which will result in a wrong answer about the
character length.

* We *must* do NextChar not NextByte for _ since we have to match it to
exactly one logical character, not byte.  You could imagine trying to do
% a byte at a time (and indeed that's what I'd been thinking it did)
but that gets you out of sync which breaks the _ case.

So the actual optimization here is that we do bytewise comparison and
advancing, but only when we are either at the start of a character
(on both sides, and the pattern char is not wildcard) or we are in the
middle of a character (on both sides) and we've already proven that both
sides matched for the previous byte(s) of the character.

On the strength of this closer reading, I would say that the patch isn't
relying on UTF8's first-byte-vs-not-first-byte property after all.
All that it's relying on is that no MB character is a prefix of another
one, which seems like a necessary property for any sane encoding; plus
that characters are considered equal only if they're bytewise equal.
So are we sure it doesn't work for non-UTF8 encodings?  Maybe that
earlier conclusion was based on a misunderstanding of what the patch
really does.

            regards, tom lane

Re: UTF8MatchText

От

Andrew Dunstan

Дата:

17 мая 2007 г., 19:25:08

Tom Lane wrote:
>
> * At a pattern backslash, it applies CHAREQ() but then advances
> byte-by-byte over the matched characters (implicitly assuming that none
> of these bytes will look like the magic characters).  While that works
> for backend-safe encodings, it seems a bit strange; you've already paid
> the price of determining the character length once, not to mention
> matching the bytes of the characters once, and then throw that knowledge
> away.  I think BYTEEQ would make more sense in the backslash path.
>

Probably, although the use of CHAREQ is in the present code.

Is it legal to follow escape by anything other than _ % or escape?

>
> So the actual optimization here is that we do bytewise comparison and
> advancing, but only when we are either at the start of a character
> (on both sides, and the pattern char is not wildcard) or we are in the
> middle of a character (on both sides) and we've already proven that both
> sides matched for the previous byte(s) of the character.
>

I think that's correct.

> On the strength of this closer reading, I would say that the patch isn't
> relying on UTF8's first-byte-vs-not-first-byte property after all.
> All that it's relying on is that no MB character is a prefix of another
> one, which seems like a necessary property for any sane encoding; plus
> that characters are considered equal only if they're bytewise equal.
> So are we sure it doesn't work for non-UTF8 encodings?  Maybe that
> earlier conclusion was based on a misunderstanding of what the patch
> really does.
>
>
>

Indeed.

One more thing - I'm thinking of rolling up the bytea matching routines
as well as the text routines to eliminate all the duplication of logic.
I can do it by a little type casting from bytea* to text* and back
again, or if that's not acceptable by some preprocessor magic. I think
the casting is likely to be safe enough in this case - I don't think a
null byte will hurt us anywhere in this code - and presumably the
varlena stuff is all the same. Does that sound reasonable?

cheers

andrew

Re: UTF8MatchText

От

Tom Lane

Дата:

17 мая 2007 г., 20:46:34

Andrew Dunstan <andrew@dunslane.net> writes:
> Is it legal to follow escape by anything other than _ % or escape?

Certainly, but once you've compared the first byte you can handle any
remaining bytes via the main loop.  And in fact the code is already
depending on being able to do that --- the use of CHAREQ rather than
BYTEEQ is just wasted cycles.

> One more thing - I'm thinking of rolling up the bytea matching routines
> as well as the text routines to eliminate all the duplication of logic.

+1, I think, but I wonder why we had the duplication in the first
place.  Is there any likelihood that bytea's semantics should diverge
from the text case?

            regards, tom lane

Re: UTF8MatchText

От

ITAGAKI Takahiro

Дата:

17 мая 2007 г., 23:07:31

Tom Lane <tgl@sss.pgh.pa.us> wrote:

> On the strength of this closer reading, I would say that the patch isn't
> relying on UTF8's first-byte-vs-not-first-byte property after all.
> All that it's relying on is that no MB character is a prefix of another
> one, which seems like a necessary property for any sane encoding; plus
> that characters are considered equal only if they're bytewise equal.
> So are we sure it doesn't work for non-UTF8 encodings?  Maybe that
> earlier conclusion was based on a misunderstanding of what the patch
> really does.

Yes, I only used the 'disjoint representations for first-bytes and
not-first-bytes of MB characters' feature in UTF8. Other encodings
allows both [AB] and [BA] for MB character patterns. UTF8Match() does
not cope with those encodings; If we have '[AB][AB]' in a table and
search it with LIKE '%[BA]%', we judge that they are matched by mistake.

I've also misunderstood it before, and Dennis Bjorklund pointed out.
http://archives.postgresql.org/pgsql-hackers/2007-03/msg01377.php

Regards,
---
ITAGAKI Takahiro
NTT Open Source Software Center

Re: UTF8MatchText

От

Tom Lane

Дата:

17 мая 2007 г., 23:35:49

ITAGAKI Takahiro <itagaki.takahiro@oss.ntt.co.jp> writes:
> Yes, I only used the 'disjoint representations for first-bytes and
> not-first-bytes of MB characters' feature in UTF8. Other encodings
> allows both [AB] and [BA] for MB character patterns. UTF8Match() does
> not cope with those encodings; If we have '[AB][AB]' in a table and
> search it with LIKE '%[BA]%', we judge that they are matched by mistake.

AFAICS, the patch does *not* make that mistake because % will not
advance over a fractional character.

            regards, tom lane

Re: UTF8MatchText

От

Andrew Dunstan

Дата:

18 мая 2007 г., 00:06:30


Tom Lane wrote:
> ITAGAKI Takahiro <itagaki.takahiro@oss.ntt.co.jp> writes:
>
>> Yes, I only used the 'disjoint representations for first-bytes and
>> not-first-bytes of MB characters' feature in UTF8. Other encodings
>> allows both [AB] and [BA] for MB character patterns. UTF8Match() does
>> not cope with those encodings; If we have '[AB][AB]' in a table and
>> search it with LIKE '%[BA]%', we judge that they are matched by mistake.
>>
>
> AFAICS, the patch does *not* make that mistake because % will not
> advance over a fractional character.
>
>


Yeah, I think that's right.

Attached is my current WIP patch. If we decide that this optimisation
can in fact be applied to all backend encodings, that will be easily
incorporated. It will simplify the code further. Note that all the
common code in the MatchText and do_like_escape functions has been
factored - and the bytea functions just call the single-byte text
versions - AFAICS the effect will be identical to having the specialised
versions. (I'm always happy when code volume can be reduced.)

cheers

andrew

Index: src/backend/utils/adt/like.c
===================================================================
RCS file: /cvsroot/pgsql/src/backend/utils/adt/like.c,v
retrieving revision 1.68
diff -c -r1.68 like.c
*** src/backend/utils/adt/like.c    27 Feb 2007 23:48:08 -0000    1.68
--- src/backend/utils/adt/like.c    18 May 2007 02:47:41 -0000
***************
*** 28,48 ****
  #define LIKE_ABORT                        (-1)


! static int    MatchText(char *t, int tlen, char *p, int plen);
! static int    MatchTextIC(char *t, int tlen, char *p, int plen);
! static int    MatchBytea(char *t, int tlen, char *p, int plen);
! static text *do_like_escape(text *, text *);

! static int    MBMatchText(char *t, int tlen, char *p, int plen);
! static int    MBMatchTextIC(char *t, int tlen, char *p, int plen);
  static text *MB_do_like_escape(text *, text *);

  /*--------------------
   * Support routine for MatchText. Compares given multibyte streams
   * as wide characters. If they match, returns 1 otherwise returns 0.
   *--------------------
   */
! static int
  wchareq(char *p1, char *p2)
  {
      int            p1_len;
--- 28,50 ----
  #define LIKE_ABORT                        (-1)


! static int    SB_MatchText(char *t, int tlen, char *p, int plen);
! static int    SB_MatchTextIC(char *t, int tlen, char *p, int plen);
! static text *SB_do_like_escape(text *, text *);

! static int    MB_MatchText(char *t, int tlen, char *p, int plen);
  static text *MB_do_like_escape(text *, text *);

+ static int    UTF8_MatchText(char *t, int tlen, char *p, int plen);
+ static int    GenericMatchText(char *s, int slen, char* p, int plen);
+ static int    mbtexticlike(text *str, text *pat);
+
  /*--------------------
   * Support routine for MatchText. Compares given multibyte streams
   * as wide characters. If they match, returns 1 otherwise returns 0.
   *--------------------
   */
! static __inline__ int
  wchareq(char *p1, char *p2)
  {
      int            p1_len;
***************
*** 72,86 ****
   * of getting a single character transformed to the system's wchar_t format.
   * So now, we just downcase the strings using lower() and apply regular LIKE
   * comparison.    This should be revisited when we install better locale support.
-  *
-  * Note that MBMatchText and MBMatchTextIC do exactly the same thing now.
-  * Is it worth refactoring to avoid duplicated code?  They might become
-  * different again in the future.
   */

  /* Set up to compile like_match.c for multibyte characters */
  #define CHAREQ(p1, p2) wchareq(p1, p2)
- #define ICHAREQ(p1, p2) wchareq(p1, p2)
  #define NextChar(p, plen) \
      do { int __l = pg_mblen(p); (p) +=__l; (plen) -=__l; } while (0)
  #define CopyAdvChar(dst, src, srclen) \
--- 74,86 ----
   * of getting a single character transformed to the system's wchar_t format.
   * So now, we just downcase the strings using lower() and apply regular LIKE
   * comparison.    This should be revisited when we install better locale support.
   */

+ #define NextByte(p, plen)    ((p)++, (plen)--)
+ #define BYTEEQ(p1, p2)        (*(p1) == *(p2))
+
  /* Set up to compile like_match.c for multibyte characters */
  #define CHAREQ(p1, p2) wchareq(p1, p2)
  #define NextChar(p, plen) \
      do { int __l = pg_mblen(p); (p) +=__l; (plen) -=__l; } while (0)
  #define CopyAdvChar(dst, src, srclen) \
***************
*** 90,122 ****
               *(dst)++ = *(src)++; \
         } while (0)

! #define MatchText    MBMatchText
! #define MatchTextIC MBMatchTextIC
  #define do_like_escape    MB_do_like_escape

  #include "like_match.c"

- #undef CHAREQ
- #undef ICHAREQ
- #undef NextChar
- #undef CopyAdvChar
- #undef MatchText
- #undef MatchTextIC
- #undef do_like_escape
-
  /* Set up to compile like_match.c for single-byte characters */
! #define CHAREQ(p1, p2) (*(p1) == *(p2))
! #define ICHAREQ(p1, p2) (tolower((unsigned char) *(p1)) == tolower((unsigned char) *(p2)))
! #define NextChar(p, plen) ((p)++, (plen)--)
  #define CopyAdvChar(dst, src, srclen) (*(dst)++ = *(src)++, (srclen)--)

  #include "like_match.c"

- /* And some support for BYTEA */
- #define BYTEA_CHAREQ(p1, p2) (*(p1) == *(p2))
- #define BYTEA_NextChar(p, plen) ((p)++, (plen)--)
- #define BYTEA_CopyAdvChar(dst, src, srclen) (*(dst)++ = *(src)++, (srclen)--)


  /*
   *    interface routines called by the function manager
--- 90,170 ----
               *(dst)++ = *(src)++; \
         } while (0)

! #define MatchText    MB_MatchText
  #define do_like_escape    MB_do_like_escape

  #include "like_match.c"

  /* Set up to compile like_match.c for single-byte characters */
! #define CHAREQ(p1, p2) BYTEEQ(p1, p2)
! #define NextChar(p, plen) NextByte(p, plen)
  #define CopyAdvChar(dst, src, srclen) (*(dst)++ = *(src)++, (srclen)--)

+ #define MatchText    SB_MatchText
+ #define do_like_escape    SB_do_like_escape
+
+ #include "like_match.c"
+
+ /* set up to compile like_match.c for single byte case insensitive matching */
+
+ #define CHAREQ(p1, p2) (tolower((unsigned char) *(p1)) == tolower((unsigned char) *(p2)))
+ #define NextChar(p, plen) NextByte(p, plen)
+ #define CopyAdvChar(dst, src, srclen) (*(dst)++ = *(src)++, (srclen)--)
+
+ #define MatchText    SB_MatchTextIC
+
+ #include "like_match.c"
+
+ /* set up for UTF8 match optimisation */
+
+ #define CHAREQ(p1, p2) wchareq(p1, p2)
+ #define NextChar(p, plen) \
+     do { int __l = pg_mblen(p); (p) +=__l; (plen) -=__l; } while (0)
+ #define CopyAdvChar(dst, src, srclen) \
+     do { int __l = pg_mblen(src); \
+          (srclen) -= __l; \
+          while (__l-- > 0) \
+              *(dst)++ = *(src)++; \
+        } while (0)
+
+ #define MatchText    UTF8_MatchText
+ #define UTF8_OPT
+
  #include "like_match.c"


+ static __inline__ int
+ GenericMatchText(char *s, int slen, char* p, int plen)
+ {
+     if (pg_database_encoding_max_length() == 1)
+         return SB_MatchText(s, slen, p, plen);
+     else if (GetDatabaseEncoding() == PG_UTF8)
+         return UTF8_MatchText(s, slen, p, plen);
+     else
+         return MB_MatchText(s, slen, p, plen);
+ }
+
+ static __inline__ int
+ mbtexticlike(text *str, text *pat)
+ {
+     char       *s,
+                *p;
+     int            slen,
+                 plen;
+
+     /* Force inputs to lower case to achieve case insensitivity */
+     str = DatumGetTextP(DirectFunctionCall1(lower, PointerGetDatum(str)));
+     pat = DatumGetTextP(DirectFunctionCall1(lower, PointerGetDatum(pat)));
+     s = VARDATA(str);
+     slen = (VARSIZE(str) - VARHDRSZ);
+     p = VARDATA(pat);
+     plen = (VARSIZE(pat) - VARHDRSZ);
+
+     if (GetDatabaseEncoding() == PG_UTF8)
+         return UTF8_MatchText(s, slen, p, plen);
+     else
+         return MB_MatchText(s, slen, p, plen);
+ }

  /*
   *    interface routines called by the function manager
***************
*** 138,147 ****
      p = VARDATA(pat);
      plen = (VARSIZE(pat) - VARHDRSZ);

!     if (pg_database_encoding_max_length() == 1)
!         result = (MatchText(s, slen, p, plen) == LIKE_TRUE);
!     else
!         result = (MBMatchText(s, slen, p, plen) == LIKE_TRUE);

      PG_RETURN_BOOL(result);
  }
--- 186,192 ----
      p = VARDATA(pat);
      plen = (VARSIZE(pat) - VARHDRSZ);

!     result = (GenericMatchText(s, slen, p, plen) == LIKE_TRUE);

      PG_RETURN_BOOL(result);
  }
***************
*** 162,171 ****
      p = VARDATA(pat);
      plen = (VARSIZE(pat) - VARHDRSZ);

!     if (pg_database_encoding_max_length() == 1)
!         result = (MatchText(s, slen, p, plen) != LIKE_TRUE);
!     else
!         result = (MBMatchText(s, slen, p, plen) != LIKE_TRUE);

      PG_RETURN_BOOL(result);
  }
--- 207,213 ----
      p = VARDATA(pat);
      plen = (VARSIZE(pat) - VARHDRSZ);

!     result = (GenericMatchText(s, slen, p, plen) != LIKE_TRUE);

      PG_RETURN_BOOL(result);
  }
***************
*** 186,195 ****
      p = VARDATA(pat);
      plen = (VARSIZE(pat) - VARHDRSZ);

!     if (pg_database_encoding_max_length() == 1)
!         result = (MatchText(s, slen, p, plen) == LIKE_TRUE);
!     else
!         result = (MBMatchText(s, slen, p, plen) == LIKE_TRUE);

      PG_RETURN_BOOL(result);
  }
--- 228,234 ----
      p = VARDATA(pat);
      plen = (VARSIZE(pat) - VARHDRSZ);

!     result = (GenericMatchText(s, slen, p, plen) == LIKE_TRUE);

      PG_RETURN_BOOL(result);
  }
***************
*** 210,219 ****
      p = VARDATA(pat);
      plen = (VARSIZE(pat) - VARHDRSZ);

!     if (pg_database_encoding_max_length() == 1)
!         result = (MatchText(s, slen, p, plen) != LIKE_TRUE);
!     else
!         result = (MBMatchText(s, slen, p, plen) != LIKE_TRUE);

      PG_RETURN_BOOL(result);
  }
--- 249,255 ----
      p = VARDATA(pat);
      plen = (VARSIZE(pat) - VARHDRSZ);

!     result = (GenericMatchText(s, slen, p, plen) != LIKE_TRUE);

      PG_RETURN_BOOL(result);
  }
***************
*** 234,240 ****
      p = VARDATA(pat);
      plen = (VARSIZE(pat) - VARHDRSZ);

!     result = (MatchBytea(s, slen, p, plen) == LIKE_TRUE);

      PG_RETURN_BOOL(result);
  }
--- 270,276 ----
      p = VARDATA(pat);
      plen = (VARSIZE(pat) - VARHDRSZ);

!     result = (SB_MatchText(s, slen, p, plen) == LIKE_TRUE);

      PG_RETURN_BOOL(result);
  }
***************
*** 255,261 ****
      p = VARDATA(pat);
      plen = (VARSIZE(pat) - VARHDRSZ);

!     result = (MatchBytea(s, slen, p, plen) != LIKE_TRUE);

      PG_RETURN_BOOL(result);
  }
--- 291,297 ----
      p = VARDATA(pat);
      plen = (VARSIZE(pat) - VARHDRSZ);

!     result = (SB_MatchText(s, slen, p, plen) != LIKE_TRUE);

      PG_RETURN_BOOL(result);
  }
***************
*** 281,305 ****
          slen = strlen(s);
          p = VARDATA(pat);
          plen = (VARSIZE(pat) - VARHDRSZ);
!         result = (MatchTextIC(s, slen, p, plen) == LIKE_TRUE);
      }
      else
      {
-         /* Force inputs to lower case to achieve case insensitivity */
          text       *strtext;

          strtext = DatumGetTextP(DirectFunctionCall1(name_text,
                                                      NameGetDatum(str)));
!         strtext = DatumGetTextP(DirectFunctionCall1(lower,
!                                                   PointerGetDatum(strtext)));
!         pat = DatumGetTextP(DirectFunctionCall1(lower,
!                                                 PointerGetDatum(pat)));
!
!         s = VARDATA(strtext);
!         slen = (VARSIZE(strtext) - VARHDRSZ);
!         p = VARDATA(pat);
!         plen = (VARSIZE(pat) - VARHDRSZ);
!         result = (MBMatchTextIC(s, slen, p, plen) == LIKE_TRUE);
      }

      PG_RETURN_BOOL(result);
--- 317,331 ----
          slen = strlen(s);
          p = VARDATA(pat);
          plen = (VARSIZE(pat) - VARHDRSZ);
!         result = (SB_MatchTextIC(s, slen, p, plen) == LIKE_TRUE);
      }
      else
      {
          text       *strtext;

          strtext = DatumGetTextP(DirectFunctionCall1(name_text,
                                                      NameGetDatum(str)));
!         result = (mbtexticlike(strtext, pat) == LIKE_TRUE);
      }

      PG_RETURN_BOOL(result);
***************
*** 322,346 ****
          slen = strlen(s);
          p = VARDATA(pat);
          plen = (VARSIZE(pat) - VARHDRSZ);
!         result = (MatchTextIC(s, slen, p, plen) != LIKE_TRUE);
      }
      else
      {
-         /* Force inputs to lower case to achieve case insensitivity */
          text       *strtext;

          strtext = DatumGetTextP(DirectFunctionCall1(name_text,
                                                      NameGetDatum(str)));
!         strtext = DatumGetTextP(DirectFunctionCall1(lower,
!                                                   PointerGetDatum(strtext)));
!         pat = DatumGetTextP(DirectFunctionCall1(lower,
!                                                 PointerGetDatum(pat)));
!
!         s = VARDATA(strtext);
!         slen = (VARSIZE(strtext) - VARHDRSZ);
!         p = VARDATA(pat);
!         plen = (VARSIZE(pat) - VARHDRSZ);
!         result = (MBMatchTextIC(s, slen, p, plen) != LIKE_TRUE);
      }

      PG_RETURN_BOOL(result);
--- 348,362 ----
          slen = strlen(s);
          p = VARDATA(pat);
          plen = (VARSIZE(pat) - VARHDRSZ);
!         result = (SB_MatchTextIC(s, slen, p, plen) != LIKE_TRUE);
      }
      else
      {
          text       *strtext;

          strtext = DatumGetTextP(DirectFunctionCall1(name_text,
                                                      NameGetDatum(str)));
!         result = (mbtexticlike(strtext, pat) != LIKE_TRUE);
      }

      PG_RETURN_BOOL(result);
***************
*** 363,383 ****
          slen = (VARSIZE(str) - VARHDRSZ);
          p = VARDATA(pat);
          plen = (VARSIZE(pat) - VARHDRSZ);
!         result = (MatchTextIC(s, slen, p, plen) == LIKE_TRUE);
      }
      else
!     {
!         /* Force inputs to lower case to achieve case insensitivity */
!         str = DatumGetTextP(DirectFunctionCall1(lower,
!                                                 PointerGetDatum(str)));
!         pat = DatumGetTextP(DirectFunctionCall1(lower,
!                                                 PointerGetDatum(pat)));
!         s = VARDATA(str);
!         slen = (VARSIZE(str) - VARHDRSZ);
!         p = VARDATA(pat);
!         plen = (VARSIZE(pat) - VARHDRSZ);
!         result = (MBMatchTextIC(s, slen, p, plen) == LIKE_TRUE);
!     }

      PG_RETURN_BOOL(result);
  }
--- 379,388 ----
          slen = (VARSIZE(str) - VARHDRSZ);
          p = VARDATA(pat);
          plen = (VARSIZE(pat) - VARHDRSZ);
!         result = (SB_MatchTextIC(s, slen, p, plen) == LIKE_TRUE);
      }
      else
!         result = (mbtexticlike(str, pat) == LIKE_TRUE);

      PG_RETURN_BOOL(result);
  }
***************
*** 399,419 ****
          slen = (VARSIZE(str) - VARHDRSZ);
          p = VARDATA(pat);
          plen = (VARSIZE(pat) - VARHDRSZ);
!         result = (MatchTextIC(s, slen, p, plen) != LIKE_TRUE);
      }
      else
!     {
!         /* Force inputs to lower case to achieve case insensitivity */
!         str = DatumGetTextP(DirectFunctionCall1(lower,
!                                                 PointerGetDatum(str)));
!         pat = DatumGetTextP(DirectFunctionCall1(lower,
!                                                 PointerGetDatum(pat)));
!         s = VARDATA(str);
!         slen = (VARSIZE(str) - VARHDRSZ);
!         p = VARDATA(pat);
!         plen = (VARSIZE(pat) - VARHDRSZ);
!         result = (MBMatchTextIC(s, slen, p, plen) != LIKE_TRUE);
!     }

      PG_RETURN_BOOL(result);
  }
--- 404,413 ----
          slen = (VARSIZE(str) - VARHDRSZ);
          p = VARDATA(pat);
          plen = (VARSIZE(pat) - VARHDRSZ);
!         result = (SB_MatchTextIC(s, slen, p, plen) != LIKE_TRUE);
      }
      else
!         result = (mbtexticlike(str, pat) != LIKE_TRUE);

      PG_RETURN_BOOL(result);
  }
***************
*** 430,436 ****
      text       *result;

      if (pg_database_encoding_max_length() == 1)
!         result = do_like_escape(pat, esc);
      else
          result = MB_do_like_escape(pat, esc);

--- 424,430 ----
      text       *result;

      if (pg_database_encoding_max_length() == 1)
!         result = SB_do_like_escape(pat, esc);
      else
          result = MB_do_like_escape(pat, esc);

***************
*** 446,624 ****
  {
      bytea       *pat = PG_GETARG_BYTEA_P(0);
      bytea       *esc = PG_GETARG_BYTEA_P(1);
!     bytea       *result;
!     char       *p,
!                *e,
!                *r;
!     int            plen,
!                 elen;
!     bool        afterescape;
!
!     p = VARDATA(pat);
!     plen = (VARSIZE(pat) - VARHDRSZ);
!     e = VARDATA(esc);
!     elen = (VARSIZE(esc) - VARHDRSZ);
!
!     /*
!      * Worst-case pattern growth is 2x --- unlikely, but it's hardly worth
!      * trying to calculate the size more accurately than that.
!      */
!     result = (text *) palloc(plen * 2 + VARHDRSZ);
!     r = VARDATA(result);
!
!     if (elen == 0)
!     {
!         /*
!          * No escape character is wanted.  Double any backslashes in the
!          * pattern to make them act like ordinary characters.
!          */
!         while (plen > 0)
!         {
!             if (*p == '\\')
!                 *r++ = '\\';
!             BYTEA_CopyAdvChar(r, p, plen);
!         }
!     }
!     else
!     {
!         /*
!          * The specified escape must be only a single character.
!          */
!         BYTEA_NextChar(e, elen);
!         if (elen != 0)
!             ereport(ERROR,
!                     (errcode(ERRCODE_INVALID_ESCAPE_SEQUENCE),
!                      errmsg("invalid escape string"),
!                   errhint("Escape string must be empty or one character.")));
!
!         e = VARDATA(esc);
!
!         /*
!          * If specified escape is '\', just copy the pattern as-is.
!          */
!         if (*e == '\\')
!         {
!             memcpy(result, pat, VARSIZE(pat));
!             PG_RETURN_BYTEA_P(result);
!         }
!
!         /*
!          * Otherwise, convert occurrences of the specified escape character to
!          * '\', and double occurrences of '\' --- unless they immediately
!          * follow an escape character!
!          */
!         afterescape = false;
!         while (plen > 0)
!         {
!             if (BYTEA_CHAREQ(p, e) && !afterescape)
!             {
!                 *r++ = '\\';
!                 BYTEA_NextChar(p, plen);
!                 afterescape = true;
!             }
!             else if (*p == '\\')
!             {
!                 *r++ = '\\';
!                 if (!afterescape)
!                     *r++ = '\\';
!                 BYTEA_NextChar(p, plen);
!                 afterescape = false;
!             }
!             else
!             {
!                 BYTEA_CopyAdvChar(r, p, plen);
!                 afterescape = false;
!             }
!         }
!     }
!
!     SET_VARSIZE(result, r - ((char *) result));

!     PG_RETURN_BYTEA_P(result);
  }

- /*
-  * Same as above, but specifically for bytea (binary) datatype
-  */
- static int
- MatchBytea(char *t, int tlen, char *p, int plen)
- {
-     /* Fast path for match-everything pattern */
-     if ((plen == 1) && (*p == '%'))
-         return LIKE_TRUE;
-
-     while ((tlen > 0) && (plen > 0))
-     {
-         if (*p == '\\')
-         {
-             /* Next pattern char must match literally, whatever it is */
-             BYTEA_NextChar(p, plen);
-             if ((plen <= 0) || !BYTEA_CHAREQ(t, p))
-                 return LIKE_FALSE;
-         }
-         else if (*p == '%')
-         {
-             /* %% is the same as % according to the SQL standard */
-             /* Advance past all %'s */
-             while ((plen > 0) && (*p == '%'))
-                 BYTEA_NextChar(p, plen);
-             /* Trailing percent matches everything. */
-             if (plen <= 0)
-                 return LIKE_TRUE;
-
-             /*
-              * Otherwise, scan for a text position at which we can match the
-              * rest of the pattern.
-              */
-             while (tlen > 0)
-             {
-                 /*
-                  * Optimization to prevent most recursion: don't recurse
-                  * unless first pattern char might match this text char.
-                  */
-                 if (BYTEA_CHAREQ(t, p) || (*p == '\\') || (*p == '_'))
-                 {
-                     int            matched = MatchBytea(t, tlen, p, plen);
-
-                     if (matched != LIKE_FALSE)
-                         return matched; /* TRUE or ABORT */
-                 }
-
-                 BYTEA_NextChar(t, tlen);
-             }
-
-             /*
-              * End of text with no match, so no point in trying later places
-              * to start matching this pattern.
-              */
-             return LIKE_ABORT;
-         }
-         else if ((*p != '_') && !BYTEA_CHAREQ(t, p))
-         {
-             /*
-              * Not the single-character wildcard and no explicit match? Then
-              * time to quit...
-              */
-             return LIKE_FALSE;
-         }
-
-         BYTEA_NextChar(t, tlen);
-         BYTEA_NextChar(p, plen);
-     }
-
-     if (tlen > 0)
-         return LIKE_FALSE;        /* end of pattern, but not of text */
-
-     /* End of input string.  Do we have matching pattern remaining? */
-     while ((plen > 0) && (*p == '%'))    /* allow multiple %'s at end of
-                                          * pattern */
-         BYTEA_NextChar(p, plen);
-     if (plen <= 0)
-         return LIKE_TRUE;
-
-     /*
-      * End of text with no match, so no point in trying later places to start
-      * matching this pattern.
-      */
-     return LIKE_ABORT;
- }    /* MatchBytea() */
--- 440,447 ----
  {
      bytea       *pat = PG_GETARG_BYTEA_P(0);
      bytea       *esc = PG_GETARG_BYTEA_P(1);
!     bytea       *result = SB_do_like_escape((text *)pat, (text *)esc);

!     PG_RETURN_BYTEA_P((bytea *)result);
  }

Index: src/backend/utils/adt/like_match.c
===================================================================
RCS file: /cvsroot/pgsql/src/backend/utils/adt/like_match.c,v
retrieving revision 1.15
diff -c -r1.15 like_match.c
*** src/backend/utils/adt/like_match.c    27 Feb 2007 23:48:08 -0000    1.15
--- src/backend/utils/adt/like_match.c    18 May 2007 02:47:41 -0000
***************
*** 9,19 ****
   * Before the inclusion, we need to define following macros:
   *
   * CHAREQ
-  * ICHAREQ
   * NextChar
   * CopyAdvChar
   * MatchText (MBMatchText)
-  * MatchTextIC (MBMatchTextIC)
   * do_like_escape (MB_do_like_escape)
   *
   * Copyright (c) 1996-2007, PostgreSQL Global Development Group
--- 9,17 ----
***************
*** 82,88 ****
          if (*p == '\\')
          {
              /* Next pattern char must match literally, whatever it is */
!             NextChar(p, plen);
              if ((plen <= 0) || !CHAREQ(t, p))
                  return LIKE_FALSE;
          }
--- 80,86 ----
          if (*p == '\\')
          {
              /* Next pattern char must match literally, whatever it is */
!             NextByte(p, plen);
              if ((plen <= 0) || !CHAREQ(t, p))
                  return LIKE_FALSE;
          }
***************
*** 91,97 ****
              /* %% is the same as % according to the SQL standard */
              /* Advance past all %'s */
              while ((plen > 0) && (*p == '%'))
!                 NextChar(p, plen);
              /* Trailing percent matches everything. */
              if (plen <= 0)
                  return LIKE_TRUE;
--- 89,95 ----
              /* %% is the same as % according to the SQL standard */
              /* Advance past all %'s */
              while ((plen > 0) && (*p == '%'))
!                 NextByte(p, plen);
              /* Trailing percent matches everything. */
              if (plen <= 0)
                  return LIKE_TRUE;
***************
*** 123,129 ****
               */
              return LIKE_ABORT;
          }
!         else if ((*p != '_') && !CHAREQ(t, p))
          {
              /*
               * Not the single-character wildcard and no explicit match? Then
--- 121,146 ----
               */
              return LIKE_ABORT;
          }
! #ifdef UTF8_OPT
!         /*
!          * UTF8 is optimised to do byte at a time matching in most cases,
!          * thus saving expensive calls to NextChar.
!          *
!          * UTF8 has disjoint representations for first-bytes and
!          * not-first-bytes of MB characters, and thus it is
!          * impossible to make a false match in which an MB pattern
!          * character is matched to the end of one data character
!          * plus the start of another.
!          * In character sets without that property, we have to use the
!          * slow way to ensure we don't make out-of-sync matches.
!          */
!         else if (*p == '_')
!         {
!             NextChar(t, tlen);
!             NextByte(p, plen);
!             continue;
!         }
!         else if (!BYTEEQ(t, p))
          {
              /*
               * Not the single-character wildcard and no explicit match? Then
***************
*** 132,215 ****
              return LIKE_FALSE;
          }

!         NextChar(t, tlen);
!         NextChar(p, plen);
!     }
!
!     if (tlen > 0)
!         return LIKE_FALSE;        /* end of pattern, but not of text */
!
!     /* End of input string.  Do we have matching pattern remaining? */
!     while ((plen > 0) && (*p == '%'))    /* allow multiple %'s at end of
!                                          * pattern */
!         NextChar(p, plen);
!     if (plen <= 0)
!         return LIKE_TRUE;
!
!     /*
!      * End of text with no match, so no point in trying later places to start
!      * matching this pattern.
!      */
!     return LIKE_ABORT;
! }    /* MatchText() */
!
! /*
!  * Same as above, but ignore case
!  */
! static int
! MatchTextIC(char *t, int tlen, char *p, int plen)
! {
!     /* Fast path for match-everything pattern */
!     if ((plen == 1) && (*p == '%'))
!         return LIKE_TRUE;
!
!     while ((tlen > 0) && (plen > 0))
!     {
!         if (*p == '\\')
!         {
!             /* Next pattern char must match literally, whatever it is */
!             NextChar(p, plen);
!             if ((plen <= 0) || !ICHAREQ(t, p))
!                 return LIKE_FALSE;
!         }
!         else if (*p == '%')
!         {
!             /* %% is the same as % according to the SQL standard */
!             /* Advance past all %'s */
!             while ((plen > 0) && (*p == '%'))
!                 NextChar(p, plen);
!             /* Trailing percent matches everything. */
!             if (plen <= 0)
!                 return LIKE_TRUE;
!
!             /*
!              * Otherwise, scan for a text position at which we can match the
!              * rest of the pattern.
!              */
!             while (tlen > 0)
!             {
!                 /*
!                  * Optimization to prevent most recursion: don't recurse
!                  * unless first pattern char might match this text char.
!                  */
!                 if (ICHAREQ(t, p) || (*p == '\\') || (*p == '_'))
!                 {
!                     int            matched = MatchTextIC(t, tlen, p, plen);
!
!                     if (matched != LIKE_FALSE)
!                         return matched; /* TRUE or ABORT */
!                 }
!
!                 NextChar(t, tlen);
!             }
!
!             /*
!              * End of text with no match, so no point in trying later places
!              * to start matching this pattern.
!              */
!             return LIKE_ABORT;
!         }
!         else if ((*p != '_') && !ICHAREQ(t, p))
          {
              /*
               * Not the single-character wildcard and no explicit match? Then
--- 149,163 ----
              return LIKE_FALSE;
          }

!         NextByte(t, tlen);
!         NextByte(p, plen);
! #else
!         /*
!          * Branch for non-utf8 multi-byte charsets and also for single-byte
!          * charsets which don't gain any benefir from the above optimisation.
!          */
!
!         else if ((*p != '_') && !CHAREQ(t, p))
          {
              /*
               * Not the single-character wildcard and no explicit match? Then
***************
*** 220,225 ****
--- 168,175 ----

          NextChar(t, tlen);
          NextChar(p, plen);
+
+ #endif /* UTF8_OPT */
      }

      if (tlen > 0)
***************
*** 228,234 ****
      /* End of input string.  Do we have matching pattern remaining? */
      while ((plen > 0) && (*p == '%'))    /* allow multiple %'s at end of
                                           * pattern */
!         NextChar(p, plen);
      if (plen <= 0)
          return LIKE_TRUE;

--- 178,184 ----
      /* End of input string.  Do we have matching pattern remaining? */
      while ((plen > 0) && (*p == '%'))    /* allow multiple %'s at end of
                                           * pattern */
!         NextByte(p, plen);
      if (plen <= 0)
          return LIKE_TRUE;

***************
*** 237,248 ****
       * matching this pattern.
       */
      return LIKE_ABORT;
! }    /* MatchTextIC() */

  /*
   * like_escape() --- given a pattern and an ESCAPE string,
   * convert the pattern to use Postgres' standard backslash escape convention.
   */
  static text *
  do_like_escape(text *pat, text *esc)
  {
--- 187,200 ----
       * matching this pattern.
       */
      return LIKE_ABORT;
! }    /* MatchText() */

  /*
   * like_escape() --- given a pattern and an ESCAPE string,
   * convert the pattern to use Postgres' standard backslash escape convention.
   */
+ #ifdef do_like_escape
+
  static text *
  do_like_escape(text *pat, text *esc)
  {
***************
*** 336,338 ****
--- 288,304 ----

      return result;
  }
+ #endif /* do_like_escape */
+
+ #undef CHAREQ
+ #undef NextChar
+ #undef CopyAdvChar
+ #undef MatchText
+
+ #ifdef do_like_escape
+ #undef do_like_escape
+ #endif
+
+ #ifdef UTF8_OPT
+ #undef UTF8_OPT
+ #endif

Re: UTF8MatchText

От

Tom Lane

Дата:

18 мая 2007 г., 00:32:34

Andrew Dunstan <andrew@dunslane.net> writes:
> Attached is my current WIP patch.

A few quick eyeball comments:

> ! static __inline__ int

Under *no* circumstances use __inline__, as it will certainly break
every non-gcc compiler.  Use "inline", which we #define appropriately
at need.

> !          * UTF8 has disjoint representations for first-bytes and
> !          * not-first-bytes of MB characters, and thus it is
> !          * impossible to make a false match in which an MB pattern
> !          * character is matched to the end of one data character
> !          * plus the start of another.
> !          * In character sets without that property, we have to use the
> !          * slow way to ensure we don't make out-of-sync matches.

I thought we'd concluded that this explanation is pseudo-science?

> !          * Branch for non-utf8 multi-byte charsets and also for single-byte
> !          * charsets which don't gain any benefir from the above optimisation.

spellcheck...

            regards, tom lane

Re: UTF8MatchText

От

Andrew Dunstan

Дата:

18 мая 2007 г., 00:42:53


Tom Lane wrote:
> Under *no* circumstances use __inline__, as it will certainly break
> every non-gcc compiler.  Use "inline", which we #define appropriately
> at need.
>

OK. (this was from upstream patch.)
>
> I thought we'd concluded that this explanation is pseudo-science?
>
>
[...]
>
> spellcheck...
>
>
>

Right. I'm waiting on a consensus about the UTF8-ness of the solution
before I adjust comments.

cheers

andrew

Re: UTF8MatchText

От

Andrew Dunstan

Дата:

18 мая 2007 г., 12:44:03


Tom Lane wrote:
> ITAGAKI Takahiro <itagaki.takahiro@oss.ntt.co.jp> writes:
>
>> Yes, I only used the 'disjoint representations for first-bytes and
>> not-first-bytes of MB characters' feature in UTF8. Other encodings
>> allows both [AB] and [BA] for MB character patterns. UTF8Match() does
>> not cope with those encodings; If we have '[AB][AB]' in a table and
>> search it with LIKE '%[BA]%', we judge that they are matched by mistake.
>>
>
> AFAICS, the patch does *not* make that mistake because % will not
> advance over a fractional character.
>
>
>

Unless I hear differently, my present intention is to apply the
suggested improvement universally. I'll wait a day or two before
completing the patch.

cheers

andrew

Re: UTF8MatchText

От

Dennis Bjorklund

Дата:

20 мая 2007 г., 07:07:36

Tom Lane skrev:
> You could imagine trying to do
> % a byte at a time (and indeed that's what I'd been thinking it did)
> but that gets you out of sync which breaks the _ case.

It is only when you have a pattern like '%_' when this is a problem and
we could detect this and do byte by byte when it's not. Now we check (*p
== '\\') || (*p == '_') in each iteration when we scan over characters
for '%', and we could do it once and have different loops for the two cases.

Other than this part that I think can be optimized I don't see anything
wrong with the idea behind the patch. To make the '%' case fast might be
an important optimization for a lot of use cases. It's not uncommon that
'%' matches a bigger part of the string than the rest of the pattern.

It's easy to make a misstake when one is used to think about the simple
fixed size characters like ascii. Strange that this simple topic can be
so difficult to think about... :-)

/Dennis

Re: UTF8MatchText

От

Andrew Dunstan

Дата:

20 мая 2007 г., 10:28:48

Dennis Bjorklund wrote:
> Tom Lane skrev:
>> You could imagine trying to do
>> % a byte at a time (and indeed that's what I'd been thinking it did)
>> but that gets you out of sync which breaks the _ case.
>
> It is only when you have a pattern like '%_' when this is a problem
> and we could detect this and do byte by byte when it's not. Now we
> check (*p == '\\') || (*p == '_') in each iteration when we scan over
> characters for '%', and we could do it once and have different loops
> for the two cases.
>
> Other than this part that I think can be optimized I don't see
> anything wrong with the idea behind the patch. To make the '%' case
> fast might be an important optimization for a lot of use cases. It's
> not uncommon that '%' matches a bigger part of the string than the
> rest of the pattern.
>

Are you sure? The big remaining char-matching bottleneck will surely be
in the code that scans for a place to start matching a %. But that's
exactly where we can't use byte matching for cases where the charset
might include AB and BA as characters - the pattern might contain %BA
and the string AB. However, this isn't a danger for UTF8, which leads me
to think that we do indeed need a special case for UTF8, but for a
different improvement from that proposed in the original patch. I'll
post an updated patch shortly.

cheers

andrew

Re: UTF8MatchText

От

Andrew Dunstan

Дата:

20 мая 2007 г., 11:11:34


I wrote:
>
>
>>
>> It is only when you have a pattern like '%_' when this is a problem
>> and we could detect this and do byte by byte when it's not. Now we
>> check (*p == '\\') || (*p == '_') in each iteration when we scan over
>> characters for '%', and we could do it once and have different loops
>> for the two cases.
>>
>> Other than this part that I think can be optimized I don't see
>> anything wrong with the idea behind the patch. To make the '%' case
>> fast might be an important optimization for a lot of use cases. It's
>> not uncommon that '%' matches a bigger part of the string than the
>> rest of the pattern.
>>
>
>
> Are you sure? The big remaining char-matching bottleneck will surely
> be in the code that scans for a place to start matching a %. But
> that's exactly where we can't use byte matching for cases where the
> charset might include AB and BA as characters - the pattern might
> contain %BA and the string AB. However, this isn't a danger for UTF8,
> which leads me to think that we do indeed need a special case for
> UTF8, but for a different improvement from that proposed in the
> original patch. I'll post an updated patch shortly.
>

Here is a patch that implements this. Please analyse for possible breakage.

cheers

andrew

Re: UTF8MatchText

От

Andrew Dunstan

Дата:

20 мая 2007 г., 11:22:26

oops. patch attached this time


Andrew Dunstan wrote:
>
>
> I wrote:
>>
>>
>>>
>>> It is only when you have a pattern like '%_' when this is a problem
>>> and we could detect this and do byte by byte when it's not. Now we
>>> check (*p == '\\') || (*p == '_') in each iteration when we scan
>>> over characters for '%', and we could do it once and have different
>>> loops for the two cases.
>>>
>>> Other than this part that I think can be optimized I don't see
>>> anything wrong with the idea behind the patch. To make the '%' case
>>> fast might be an important optimization for a lot of use cases. It's
>>> not uncommon that '%' matches a bigger part of the string than the
>>> rest of the pattern.
>>>
>>
>>
>> Are you sure? The big remaining char-matching bottleneck will surely
>> be in the code that scans for a place to start matching a %. But
>> that's exactly where we can't use byte matching for cases where the
>> charset might include AB and BA as characters - the pattern might
>> contain %BA and the string AB. However, this isn't a danger for UTF8,
>> which leads me to think that we do indeed need a special case for
>> UTF8, but for a different improvement from that proposed in the
>> original patch. I'll post an updated patch shortly.
>>
>
> Here is a patch that implements this. Please analyse for possible
> breakage.
>
> cheers
>
> andrew
>
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 9: In versions below 8.0, the planner will ignore your desire to
>       choose an index scan if your joining column's datatypes do not
>       match
>
Index: src/backend/utils/adt/like.c
===================================================================
RCS file: /cvsroot/pgsql/src/backend/utils/adt/like.c,v
retrieving revision 1.68
diff -c -r1.68 like.c
*** src/backend/utils/adt/like.c    27 Feb 2007 23:48:08 -0000    1.68
--- src/backend/utils/adt/like.c    20 May 2007 14:16:22 -0000
***************
*** 28,48 ****
  #define LIKE_ABORT                        (-1)


! static int    MatchText(char *t, int tlen, char *p, int plen);
! static int    MatchTextIC(char *t, int tlen, char *p, int plen);
! static int    MatchBytea(char *t, int tlen, char *p, int plen);
! static text *do_like_escape(text *, text *);

! static int    MBMatchText(char *t, int tlen, char *p, int plen);
! static int    MBMatchTextIC(char *t, int tlen, char *p, int plen);
  static text *MB_do_like_escape(text *, text *);

  /*--------------------
   * Support routine for MatchText. Compares given multibyte streams
   * as wide characters. If they match, returns 1 otherwise returns 0.
   *--------------------
   */
! static int
  wchareq(char *p1, char *p2)
  {
      int            p1_len;
--- 28,51 ----
  #define LIKE_ABORT                        (-1)


! static int    SB_MatchText(char *t, int tlen, char *p, int plen);
! static int    SB_MatchTextIC(char *t, int tlen, char *p, int plen);
! static text *SB_do_like_escape(text *, text *);

! static int    MB_MatchText(char *t, int tlen, char *p, int plen);
  static text *MB_do_like_escape(text *, text *);

+ static int    UTF8_MatchText(char *t, int tlen, char *p, int plen);
+
+ static int    GenericMatchText(char *s, int slen, char* p, int plen);
+ static int    mbtexticlike(text *str, text *pat);
+
  /*--------------------
   * Support routine for MatchText. Compares given multibyte streams
   * as wide characters. If they match, returns 1 otherwise returns 0.
   *--------------------
   */
! static inline int
  wchareq(char *p1, char *p2)
  {
      int            p1_len;
***************
*** 72,86 ****
   * of getting a single character transformed to the system's wchar_t format.
   * So now, we just downcase the strings using lower() and apply regular LIKE
   * comparison.    This should be revisited when we install better locale support.
-  *
-  * Note that MBMatchText and MBMatchTextIC do exactly the same thing now.
-  * Is it worth refactoring to avoid duplicated code?  They might become
-  * different again in the future.
   */

  /* Set up to compile like_match.c for multibyte characters */
  #define CHAREQ(p1, p2) wchareq(p1, p2)
- #define ICHAREQ(p1, p2) wchareq(p1, p2)
  #define NextChar(p, plen) \
      do { int __l = pg_mblen(p); (p) +=__l; (plen) -=__l; } while (0)
  #define CopyAdvChar(dst, src, srclen) \
--- 75,87 ----
   * of getting a single character transformed to the system's wchar_t format.
   * So now, we just downcase the strings using lower() and apply regular LIKE
   * comparison.    This should be revisited when we install better locale support.
   */

+ #define NextByte(p, plen)    ((p)++, (plen)--)
+ #define BYTEEQ(p1, p2)        (*(p1) == *(p2))
+
  /* Set up to compile like_match.c for multibyte characters */
  #define CHAREQ(p1, p2) wchareq(p1, p2)
  #define NextChar(p, plen) \
      do { int __l = pg_mblen(p); (p) +=__l; (plen) -=__l; } while (0)
  #define CopyAdvChar(dst, src, srclen) \
***************
*** 89,122 ****
           while (__l-- > 0) \
               *(dst)++ = *(src)++; \
         } while (0)

! #define MatchText    MBMatchText
! #define MatchTextIC MBMatchTextIC
  #define do_like_escape    MB_do_like_escape

  #include "like_match.c"

! #undef CHAREQ
! #undef ICHAREQ
! #undef NextChar
! #undef CopyAdvChar
! #undef MatchText
! #undef MatchTextIC
! #undef do_like_escape

  /* Set up to compile like_match.c for single-byte characters */
! #define CHAREQ(p1, p2) (*(p1) == *(p2))
! #define ICHAREQ(p1, p2) (tolower((unsigned char) *(p1)) == tolower((unsigned char) *(p2)))
! #define NextChar(p, plen) ((p)++, (plen)--)
  #define CopyAdvChar(dst, src, srclen) (*(dst)++ = *(src)++, (srclen)--)

  #include "like_match.c"

! /* And some support for BYTEA */
! #define BYTEA_CHAREQ(p1, p2) (*(p1) == *(p2))
! #define BYTEA_NextChar(p, plen) ((p)++, (plen)--)
! #define BYTEA_CopyAdvChar(dst, src, srclen) (*(dst)++ = *(src)++, (srclen)--)


  /*
   *    interface routines called by the function manager
--- 90,173 ----
           while (__l-- > 0) \
               *(dst)++ = *(src)++; \
         } while (0)
+ #define BYTEEQIC(b1, b2) BYTEEQ(b1, b2)

! #define MatchText    MB_MatchText
  #define do_like_escape    MB_do_like_escape

  #include "like_match.c"

! /* Set up to compile like_match.c for UTF8 characters */
! #define CHAREQ(p1, p2) wchareq(p1, p2)
! #define NextChar(p, plen) \
!     do { int __l = pg_mblen(p); (p) +=__l; (plen) -=__l; } while (0)
! #define CopyAdvChar(dst, src, srclen) \
!     do { int __l = pg_mblen(src); \
!          (srclen) -= __l; \
!          while (__l-- > 0) \
!              *(dst)++ = *(src)++; \
!        } while (0)
! #define BYTEEQIC(b1, b2) BYTEEQ(b1, b2)
! #define UTF8_OPT
!
! #define MatchText    UTF8_MatchText
!
! #include "like_match.c"

  /* Set up to compile like_match.c for single-byte characters */
! #define CHAREQ(p1, p2) BYTEEQ(p1, p2)
! #define NextChar(p, plen) NextByte(p, plen)
  #define CopyAdvChar(dst, src, srclen) (*(dst)++ = *(src)++, (srclen)--)
+ #define BYTEEQIC(b1, b2) BYTEEQ(b1, b2)
+
+ #define MatchText    SB_MatchText
+ #define do_like_escape    SB_do_like_escape

  #include "like_match.c"

! /* set up to compile like_match.c for single byte case insensitive matching */

+ #define CHAREQ(p1, p2) (tolower((unsigned char) *(p1)) == tolower((unsigned char) *(p2)))
+ #define NextChar(p, plen) NextByte(p, plen)
+ #define CopyAdvChar(dst, src, srclen) (*(dst)++ = *(src)++, (srclen)--)
+ #define BYTEEQIC(b1, b2) CHAREQ(b1, b2)
+
+ #define MatchText    SB_MatchTextIC
+
+ #include "like_match.c"
+
+ static inline int
+ GenericMatchText(char *s, int slen, char* p, int plen)
+ {
+     if (pg_database_encoding_max_length() == 1)
+         return SB_MatchText(s, slen, p, plen);
+     else if (GetDatabaseEncoding() == PG_UTF8)
+         return UTF8_MatchText(s, slen, p, plen);
+     else
+         return MB_MatchText(s, slen, p, plen);
+ }
+
+ static inline int
+ mbtexticlike(text *str, text *pat)
+ {
+     char       *s,
+                *p;
+     int            slen,
+                 plen;
+
+     /* Force inputs to lower case to achieve case insensitivity */
+     str = DatumGetTextP(DirectFunctionCall1(lower, PointerGetDatum(str)));
+     pat = DatumGetTextP(DirectFunctionCall1(lower, PointerGetDatum(pat)));
+     s = VARDATA(str);
+     slen = (VARSIZE(str) - VARHDRSZ);
+     p = VARDATA(pat);
+     plen = (VARSIZE(pat) - VARHDRSZ);
+
+     if (GetDatabaseEncoding() == PG_UTF8)
+         return UTF8_MatchText(s, slen, p, plen);
+     else
+         return MB_MatchText(s, slen, p, plen);
+ }

  /*
   *    interface routines called by the function manager
***************
*** 138,147 ****
      p = VARDATA(pat);
      plen = (VARSIZE(pat) - VARHDRSZ);

!     if (pg_database_encoding_max_length() == 1)
!         result = (MatchText(s, slen, p, plen) == LIKE_TRUE);
!     else
!         result = (MBMatchText(s, slen, p, plen) == LIKE_TRUE);

      PG_RETURN_BOOL(result);
  }
--- 189,195 ----
      p = VARDATA(pat);
      plen = (VARSIZE(pat) - VARHDRSZ);

!     result = (GenericMatchText(s, slen, p, plen) == LIKE_TRUE);

      PG_RETURN_BOOL(result);
  }
***************
*** 162,171 ****
      p = VARDATA(pat);
      plen = (VARSIZE(pat) - VARHDRSZ);

!     if (pg_database_encoding_max_length() == 1)
!         result = (MatchText(s, slen, p, plen) != LIKE_TRUE);
!     else
!         result = (MBMatchText(s, slen, p, plen) != LIKE_TRUE);

      PG_RETURN_BOOL(result);
  }
--- 210,216 ----
      p = VARDATA(pat);
      plen = (VARSIZE(pat) - VARHDRSZ);

!     result = (GenericMatchText(s, slen, p, plen) != LIKE_TRUE);

      PG_RETURN_BOOL(result);
  }
***************
*** 186,195 ****
      p = VARDATA(pat);
      plen = (VARSIZE(pat) - VARHDRSZ);

!     if (pg_database_encoding_max_length() == 1)
!         result = (MatchText(s, slen, p, plen) == LIKE_TRUE);
!     else
!         result = (MBMatchText(s, slen, p, plen) == LIKE_TRUE);

      PG_RETURN_BOOL(result);
  }
--- 231,237 ----
      p = VARDATA(pat);
      plen = (VARSIZE(pat) - VARHDRSZ);

!     result = (GenericMatchText(s, slen, p, plen) == LIKE_TRUE);

      PG_RETURN_BOOL(result);
  }
***************
*** 210,219 ****
      p = VARDATA(pat);
      plen = (VARSIZE(pat) - VARHDRSZ);

!     if (pg_database_encoding_max_length() == 1)
!         result = (MatchText(s, slen, p, plen) != LIKE_TRUE);
!     else
!         result = (MBMatchText(s, slen, p, plen) != LIKE_TRUE);

      PG_RETURN_BOOL(result);
  }
--- 252,258 ----
      p = VARDATA(pat);
      plen = (VARSIZE(pat) - VARHDRSZ);

!     result = (GenericMatchText(s, slen, p, plen) != LIKE_TRUE);

      PG_RETURN_BOOL(result);
  }
***************
*** 234,240 ****
      p = VARDATA(pat);
      plen = (VARSIZE(pat) - VARHDRSZ);

!     result = (MatchBytea(s, slen, p, plen) == LIKE_TRUE);

      PG_RETURN_BOOL(result);
  }
--- 273,279 ----
      p = VARDATA(pat);
      plen = (VARSIZE(pat) - VARHDRSZ);

!     result = (SB_MatchText(s, slen, p, plen) == LIKE_TRUE);

      PG_RETURN_BOOL(result);
  }
***************
*** 255,261 ****
      p = VARDATA(pat);
      plen = (VARSIZE(pat) - VARHDRSZ);

!     result = (MatchBytea(s, slen, p, plen) != LIKE_TRUE);

      PG_RETURN_BOOL(result);
  }
--- 294,300 ----
      p = VARDATA(pat);
      plen = (VARSIZE(pat) - VARHDRSZ);

!     result = (SB_MatchText(s, slen, p, plen) != LIKE_TRUE);

      PG_RETURN_BOOL(result);
  }
***************
*** 281,305 ****
          slen = strlen(s);
          p = VARDATA(pat);
          plen = (VARSIZE(pat) - VARHDRSZ);
!         result = (MatchTextIC(s, slen, p, plen) == LIKE_TRUE);
      }
      else
      {
-         /* Force inputs to lower case to achieve case insensitivity */
          text       *strtext;

          strtext = DatumGetTextP(DirectFunctionCall1(name_text,
                                                      NameGetDatum(str)));
!         strtext = DatumGetTextP(DirectFunctionCall1(lower,
!                                                   PointerGetDatum(strtext)));
!         pat = DatumGetTextP(DirectFunctionCall1(lower,
!                                                 PointerGetDatum(pat)));
!
!         s = VARDATA(strtext);
!         slen = (VARSIZE(strtext) - VARHDRSZ);
!         p = VARDATA(pat);
!         plen = (VARSIZE(pat) - VARHDRSZ);
!         result = (MBMatchTextIC(s, slen, p, plen) == LIKE_TRUE);
      }

      PG_RETURN_BOOL(result);
--- 320,334 ----
          slen = strlen(s);
          p = VARDATA(pat);
          plen = (VARSIZE(pat) - VARHDRSZ);
!         result = (SB_MatchTextIC(s, slen, p, plen) == LIKE_TRUE);
      }
      else
      {
          text       *strtext;

          strtext = DatumGetTextP(DirectFunctionCall1(name_text,
                                                      NameGetDatum(str)));
!         result = (mbtexticlike(strtext, pat) == LIKE_TRUE);
      }

      PG_RETURN_BOOL(result);
***************
*** 322,346 ****
          slen = strlen(s);
          p = VARDATA(pat);
          plen = (VARSIZE(pat) - VARHDRSZ);
!         result = (MatchTextIC(s, slen, p, plen) != LIKE_TRUE);
      }
      else
      {
-         /* Force inputs to lower case to achieve case insensitivity */
          text       *strtext;

          strtext = DatumGetTextP(DirectFunctionCall1(name_text,
                                                      NameGetDatum(str)));
!         strtext = DatumGetTextP(DirectFunctionCall1(lower,
!                                                   PointerGetDatum(strtext)));
!         pat = DatumGetTextP(DirectFunctionCall1(lower,
!                                                 PointerGetDatum(pat)));
!
!         s = VARDATA(strtext);
!         slen = (VARSIZE(strtext) - VARHDRSZ);
!         p = VARDATA(pat);
!         plen = (VARSIZE(pat) - VARHDRSZ);
!         result = (MBMatchTextIC(s, slen, p, plen) != LIKE_TRUE);
      }

      PG_RETURN_BOOL(result);
--- 351,365 ----
          slen = strlen(s);
          p = VARDATA(pat);
          plen = (VARSIZE(pat) - VARHDRSZ);
!         result = (SB_MatchTextIC(s, slen, p, plen) != LIKE_TRUE);
      }
      else
      {
          text       *strtext;

          strtext = DatumGetTextP(DirectFunctionCall1(name_text,
                                                      NameGetDatum(str)));
!         result = (mbtexticlike(strtext, pat) != LIKE_TRUE);
      }

      PG_RETURN_BOOL(result);
***************
*** 363,383 ****
          slen = (VARSIZE(str) - VARHDRSZ);
          p = VARDATA(pat);
          plen = (VARSIZE(pat) - VARHDRSZ);
!         result = (MatchTextIC(s, slen, p, plen) == LIKE_TRUE);
      }
      else
!     {
!         /* Force inputs to lower case to achieve case insensitivity */
!         str = DatumGetTextP(DirectFunctionCall1(lower,
!                                                 PointerGetDatum(str)));
!         pat = DatumGetTextP(DirectFunctionCall1(lower,
!                                                 PointerGetDatum(pat)));
!         s = VARDATA(str);
!         slen = (VARSIZE(str) - VARHDRSZ);
!         p = VARDATA(pat);
!         plen = (VARSIZE(pat) - VARHDRSZ);
!         result = (MBMatchTextIC(s, slen, p, plen) == LIKE_TRUE);
!     }

      PG_RETURN_BOOL(result);
  }
--- 382,391 ----
          slen = (VARSIZE(str) - VARHDRSZ);
          p = VARDATA(pat);
          plen = (VARSIZE(pat) - VARHDRSZ);
!         result = (SB_MatchTextIC(s, slen, p, plen) == LIKE_TRUE);
      }
      else
!         result = (mbtexticlike(str, pat) == LIKE_TRUE);

      PG_RETURN_BOOL(result);
  }
***************
*** 399,419 ****
          slen = (VARSIZE(str) - VARHDRSZ);
          p = VARDATA(pat);
          plen = (VARSIZE(pat) - VARHDRSZ);
!         result = (MatchTextIC(s, slen, p, plen) != LIKE_TRUE);
      }
      else
!     {
!         /* Force inputs to lower case to achieve case insensitivity */
!         str = DatumGetTextP(DirectFunctionCall1(lower,
!                                                 PointerGetDatum(str)));
!         pat = DatumGetTextP(DirectFunctionCall1(lower,
!                                                 PointerGetDatum(pat)));
!         s = VARDATA(str);
!         slen = (VARSIZE(str) - VARHDRSZ);
!         p = VARDATA(pat);
!         plen = (VARSIZE(pat) - VARHDRSZ);
!         result = (MBMatchTextIC(s, slen, p, plen) != LIKE_TRUE);
!     }

      PG_RETURN_BOOL(result);
  }
--- 407,416 ----
          slen = (VARSIZE(str) - VARHDRSZ);
          p = VARDATA(pat);
          plen = (VARSIZE(pat) - VARHDRSZ);
!         result = (SB_MatchTextIC(s, slen, p, plen) != LIKE_TRUE);
      }
      else
!         result = (mbtexticlike(str, pat) != LIKE_TRUE);

      PG_RETURN_BOOL(result);
  }
***************
*** 430,436 ****
      text       *result;

      if (pg_database_encoding_max_length() == 1)
!         result = do_like_escape(pat, esc);
      else
          result = MB_do_like_escape(pat, esc);

--- 427,433 ----
      text       *result;

      if (pg_database_encoding_max_length() == 1)
!         result = SB_do_like_escape(pat, esc);
      else
          result = MB_do_like_escape(pat, esc);

***************
*** 446,624 ****
  {
      bytea       *pat = PG_GETARG_BYTEA_P(0);
      bytea       *esc = PG_GETARG_BYTEA_P(1);
!     bytea       *result;
!     char       *p,
!                *e,
!                *r;
!     int            plen,
!                 elen;
!     bool        afterescape;
!
!     p = VARDATA(pat);
!     plen = (VARSIZE(pat) - VARHDRSZ);
!     e = VARDATA(esc);
!     elen = (VARSIZE(esc) - VARHDRSZ);
!
!     /*
!      * Worst-case pattern growth is 2x --- unlikely, but it's hardly worth
!      * trying to calculate the size more accurately than that.
!      */
!     result = (text *) palloc(plen * 2 + VARHDRSZ);
!     r = VARDATA(result);
!
!     if (elen == 0)
!     {
!         /*
!          * No escape character is wanted.  Double any backslashes in the
!          * pattern to make them act like ordinary characters.
!          */
!         while (plen > 0)
!         {
!             if (*p == '\\')
!                 *r++ = '\\';
!             BYTEA_CopyAdvChar(r, p, plen);
!         }
!     }
!     else
!     {
!         /*
!          * The specified escape must be only a single character.
!          */
!         BYTEA_NextChar(e, elen);
!         if (elen != 0)
!             ereport(ERROR,
!                     (errcode(ERRCODE_INVALID_ESCAPE_SEQUENCE),
!                      errmsg("invalid escape string"),
!                   errhint("Escape string must be empty or one character.")));
!
!         e = VARDATA(esc);
!
!         /*
!          * If specified escape is '\', just copy the pattern as-is.
!          */
!         if (*e == '\\')
!         {
!             memcpy(result, pat, VARSIZE(pat));
!             PG_RETURN_BYTEA_P(result);
!         }
!
!         /*
!          * Otherwise, convert occurrences of the specified escape character to
!          * '\', and double occurrences of '\' --- unless they immediately
!          * follow an escape character!
!          */
!         afterescape = false;
!         while (plen > 0)
!         {
!             if (BYTEA_CHAREQ(p, e) && !afterescape)
!             {
!                 *r++ = '\\';
!                 BYTEA_NextChar(p, plen);
!                 afterescape = true;
!             }
!             else if (*p == '\\')
!             {
!                 *r++ = '\\';
!                 if (!afterescape)
!                     *r++ = '\\';
!                 BYTEA_NextChar(p, plen);
!                 afterescape = false;
!             }
!             else
!             {
!                 BYTEA_CopyAdvChar(r, p, plen);
!                 afterescape = false;
!             }
!         }
!     }

!     SET_VARSIZE(result, r - ((char *) result));
!
!     PG_RETURN_BYTEA_P(result);
  }

- /*
-  * Same as above, but specifically for bytea (binary) datatype
-  */
- static int
- MatchBytea(char *t, int tlen, char *p, int plen)
- {
-     /* Fast path for match-everything pattern */
-     if ((plen == 1) && (*p == '%'))
-         return LIKE_TRUE;
-
-     while ((tlen > 0) && (plen > 0))
-     {
-         if (*p == '\\')
-         {
-             /* Next pattern char must match literally, whatever it is */
-             BYTEA_NextChar(p, plen);
-             if ((plen <= 0) || !BYTEA_CHAREQ(t, p))
-                 return LIKE_FALSE;
-         }
-         else if (*p == '%')
-         {
-             /* %% is the same as % according to the SQL standard */
-             /* Advance past all %'s */
-             while ((plen > 0) && (*p == '%'))
-                 BYTEA_NextChar(p, plen);
-             /* Trailing percent matches everything. */
-             if (plen <= 0)
-                 return LIKE_TRUE;
-
-             /*
-              * Otherwise, scan for a text position at which we can match the
-              * rest of the pattern.
-              */
-             while (tlen > 0)
-             {
-                 /*
-                  * Optimization to prevent most recursion: don't recurse
-                  * unless first pattern char might match this text char.
-                  */
-                 if (BYTEA_CHAREQ(t, p) || (*p == '\\') || (*p == '_'))
-                 {
-                     int            matched = MatchBytea(t, tlen, p, plen);
-
-                     if (matched != LIKE_FALSE)
-                         return matched; /* TRUE or ABORT */
-                 }
-
-                 BYTEA_NextChar(t, tlen);
-             }
-
-             /*
-              * End of text with no match, so no point in trying later places
-              * to start matching this pattern.
-              */
-             return LIKE_ABORT;
-         }
-         else if ((*p != '_') && !BYTEA_CHAREQ(t, p))
-         {
-             /*
-              * Not the single-character wildcard and no explicit match? Then
-              * time to quit...
-              */
-             return LIKE_FALSE;
-         }
-
-         BYTEA_NextChar(t, tlen);
-         BYTEA_NextChar(p, plen);
-     }
-
-     if (tlen > 0)
-         return LIKE_FALSE;        /* end of pattern, but not of text */
-
-     /* End of input string.  Do we have matching pattern remaining? */
-     while ((plen > 0) && (*p == '%'))    /* allow multiple %'s at end of
-                                          * pattern */
-         BYTEA_NextChar(p, plen);
-     if (plen <= 0)
-         return LIKE_TRUE;
-
-     /*
-      * End of text with no match, so no point in trying later places to start
-      * matching this pattern.
-      */
-     return LIKE_ABORT;
- }    /* MatchBytea() */
--- 443,450 ----
  {
      bytea       *pat = PG_GETARG_BYTEA_P(0);
      bytea       *esc = PG_GETARG_BYTEA_P(1);
!     bytea       *result = SB_do_like_escape((text *)pat, (text *)esc);

!     PG_RETURN_BYTEA_P((bytea *)result);
  }

Index: src/backend/utils/adt/like_match.c
===================================================================
RCS file: /cvsroot/pgsql/src/backend/utils/adt/like_match.c,v
retrieving revision 1.15
diff -c -r1.15 like_match.c
*** src/backend/utils/adt/like_match.c    27 Feb 2007 23:48:08 -0000    1.15
--- src/backend/utils/adt/like_match.c    20 May 2007 14:16:24 -0000
***************
*** 9,19 ****
   * Before the inclusion, we need to define following macros:
   *
   * CHAREQ
!  * ICHAREQ
   * NextChar
   * CopyAdvChar
   * MatchText (MBMatchText)
-  * MatchTextIC (MBMatchTextIC)
   * do_like_escape (MB_do_like_escape)
   *
   * Copyright (c) 1996-2007, PostgreSQL Global Development Group
--- 9,18 ----
   * Before the inclusion, we need to define following macros:
   *
   * CHAREQ
!  * BYTEEQIC
   * NextChar
   * CopyAdvChar
   * MatchText (MBMatchText)
   * do_like_escape (MB_do_like_escape)
   *
   * Copyright (c) 1996-2007, PostgreSQL Global Development Group
***************
*** 70,75 ****
--- 69,82 ----
   *--------------------
   */

+ #ifdef UTF8_OPT
+ #define PCT_CHAREQ(c1, c2) BYTEEQ(c1, c2)
+ #define PCT_NEXT(a, b) NextByte(a, b)
+ #else
+ #define PCT_CHAREQ(c1, c2) CHAREQ(c1, c2)
+ #define PCT_NEXT(a, b) NextChar(a, b)
+ #endif
+
  static int
  MatchText(char *t, int tlen, char *p, int plen)
  {
***************
*** 81,89 ****
      {
          if (*p == '\\')
          {
!             /* Next pattern char must match literally, whatever it is */
!             NextChar(p, plen);
!             if ((plen <= 0) || !CHAREQ(t, p))
                  return LIKE_FALSE;
          }
          else if (*p == '%')
--- 88,96 ----
      {
          if (*p == '\\')
          {
!             /* Next byte must match literally, whatever it is */
!             NextByte(p, plen);
!             if ((plen <= 0) || !BYTEEQ(t, p))
                  return LIKE_FALSE;
          }
          else if (*p == '%')
***************
*** 91,97 ****
              /* %% is the same as % according to the SQL standard */
              /* Advance past all %'s */
              while ((plen > 0) && (*p == '%'))
!                 NextChar(p, plen);
              /* Trailing percent matches everything. */
              if (plen <= 0)
                  return LIKE_TRUE;
--- 98,104 ----
              /* %% is the same as % according to the SQL standard */
              /* Advance past all %'s */
              while ((plen > 0) && (*p == '%'))
!                 NextByte(p, plen);
              /* Trailing percent matches everything. */
              if (plen <= 0)
                  return LIKE_TRUE;
***************
*** 106,112 ****
                   * Optimization to prevent most recursion: don't recurse
                   * unless first pattern char might match this text char.
                   */
!                 if (CHAREQ(t, p) || (*p == '\\') || (*p == '_'))
                  {
                      int            matched = MatchText(t, tlen, p, plen);

--- 113,119 ----
                   * Optimization to prevent most recursion: don't recurse
                   * unless first pattern char might match this text char.
                   */
!                 if (PCT_CHAREQ(t, p) || (*p == '\\') || (*p == '_'))
                  {
                      int            matched = MatchText(t, tlen, p, plen);

***************
*** 114,120 ****
                          return matched; /* TRUE or ABORT */
                  }

!                 NextChar(t, tlen);
              }

              /*
--- 121,127 ----
                          return matched; /* TRUE or ABORT */
                  }

!                 PCT_NEXT(t, tlen);
              }

              /*
***************
*** 123,215 ****
               */
              return LIKE_ABORT;
          }
!         else if ((*p != '_') && !CHAREQ(t, p))
!         {
!             /*
!              * Not the single-character wildcard and no explicit match? Then
!              * time to quit...
!              */
!             return LIKE_FALSE;
!         }
!
!         NextChar(t, tlen);
!         NextChar(p, plen);
!     }
!
!     if (tlen > 0)
!         return LIKE_FALSE;        /* end of pattern, but not of text */
!
!     /* End of input string.  Do we have matching pattern remaining? */
!     while ((plen > 0) && (*p == '%'))    /* allow multiple %'s at end of
!                                          * pattern */
!         NextChar(p, plen);
!     if (plen <= 0)
!         return LIKE_TRUE;
!
!     /*
!      * End of text with no match, so no point in trying later places to start
!      * matching this pattern.
!      */
!     return LIKE_ABORT;
! }    /* MatchText() */
!
! /*
!  * Same as above, but ignore case
!  */
! static int
! MatchTextIC(char *t, int tlen, char *p, int plen)
! {
!     /* Fast path for match-everything pattern */
!     if ((plen == 1) && (*p == '%'))
!         return LIKE_TRUE;
!
!     while ((tlen > 0) && (plen > 0))
!     {
!         if (*p == '\\')
!         {
!             /* Next pattern char must match literally, whatever it is */
!             NextChar(p, plen);
!             if ((plen <= 0) || !ICHAREQ(t, p))
!                 return LIKE_FALSE;
!         }
!         else if (*p == '%')
          {
!             /* %% is the same as % according to the SQL standard */
!             /* Advance past all %'s */
!             while ((plen > 0) && (*p == '%'))
!                 NextChar(p, plen);
!             /* Trailing percent matches everything. */
!             if (plen <= 0)
!                 return LIKE_TRUE;
!
!             /*
!              * Otherwise, scan for a text position at which we can match the
!              * rest of the pattern.
!              */
!             while (tlen > 0)
!             {
!                 /*
!                  * Optimization to prevent most recursion: don't recurse
!                  * unless first pattern char might match this text char.
!                  */
!                 if (ICHAREQ(t, p) || (*p == '\\') || (*p == '_'))
!                 {
!                     int            matched = MatchTextIC(t, tlen, p, plen);
!
!                     if (matched != LIKE_FALSE)
!                         return matched; /* TRUE or ABORT */
!                 }
!
!                 NextChar(t, tlen);
!             }
!
!             /*
!              * End of text with no match, so no point in trying later places
!              * to start matching this pattern.
!              */
!             return LIKE_ABORT;
          }
!         else if ((*p != '_') && !ICHAREQ(t, p))
          {
              /*
               * Not the single-character wildcard and no explicit match? Then
--- 130,142 ----
               */
              return LIKE_ABORT;
          }
!         else if (*p == '_')
          {
!             NextChar(t, tlen);
!             NextByte(p, plen);
!             continue;
          }
!         else if (!BYTEEQIC(t, p))
          {
              /*
               * Not the single-character wildcard and no explicit match? Then
***************
*** 217,225 ****
               */
              return LIKE_FALSE;
          }
!
!         NextChar(t, tlen);
!         NextChar(p, plen);
      }

      if (tlen > 0)
--- 144,156 ----
               */
              return LIKE_FALSE;
          }
!         /*
!          * It is safe to use NextByte instead of NextChar here, even for
!          * multi-byte character sets, because we are not following
!          * immediately after a wildcard character.
!          */
!         NextByte(t, tlen);
!         NextByte(p, plen);
      }

      if (tlen > 0)
***************
*** 228,234 ****
      /* End of input string.  Do we have matching pattern remaining? */
      while ((plen > 0) && (*p == '%'))    /* allow multiple %'s at end of
                                           * pattern */
!         NextChar(p, plen);
      if (plen <= 0)
          return LIKE_TRUE;

--- 159,165 ----
      /* End of input string.  Do we have matching pattern remaining? */
      while ((plen > 0) && (*p == '%'))    /* allow multiple %'s at end of
                                           * pattern */
!         NextByte(p, plen);
      if (plen <= 0)
          return LIKE_TRUE;

***************
*** 237,248 ****
       * matching this pattern.
       */
      return LIKE_ABORT;
! }    /* MatchTextIC() */

  /*
   * like_escape() --- given a pattern and an ESCAPE string,
   * convert the pattern to use Postgres' standard backslash escape convention.
   */
  static text *
  do_like_escape(text *pat, text *esc)
  {
--- 168,181 ----
       * matching this pattern.
       */
      return LIKE_ABORT;
! }    /* MatchText() */

  /*
   * like_escape() --- given a pattern and an ESCAPE string,
   * convert the pattern to use Postgres' standard backslash escape convention.
   */
+ #ifdef do_like_escape
+
  static text *
  do_like_escape(text *pat, text *esc)
  {
***************
*** 336,338 ****
--- 269,288 ----

      return result;
  }
+ #endif /* do_like_escape */
+
+ #undef CHAREQ
+ #undef BYTEEQIC
+ #undef PCT_CHAREQ
+ #undef PCT_NEXT
+ #undef NextChar
+ #undef CopyAdvChar
+ #undef MatchText
+
+ #ifdef do_like_escape
+ #undef do_like_escape
+ #endif
+
+ #ifdef UTF8_OPT
+ #undef UTF8_OPT
+ #endif

Re: UTF8MatchText

От

Tom Lane

Дата:

20 мая 2007 г., 13:58:53

Andrew Dunstan <andrew@dunslane.net> writes:
> Are you sure? The big remaining char-matching bottleneck will surely
> be in the code that scans for a place to start matching a %. But
> that's exactly where we can't use byte matching for cases where the
> charset might include AB and BA as characters - the pattern might
> contain %BA and the string AB. However, this isn't a danger for UTF8,
> which leads me to think that we do indeed need a special case for
> UTF8, but for a different improvement from that proposed in the
> original patch. I'll post an updated patch shortly.

> Here is a patch that implements this. Please analyse for possible
> breakage.

On the strength of this analysis, shouldn't we drop the separate
UTF8 match function and just use SB_MatchText for UTF8?

It strikes me that we may be overcomplicating matters in another way
too.  If you believe that the %-scan code is now the bottleneck, that
is, the key loop is where we have pattern '%foo' and we are trying to
match 'f' to each successive data position, then you should be bothered
that SB_MatchTextIC is applying tolower() to 'f' again for each data
character.  Worst-case we could have O(N^2) applications of tolower()
during a match.  I think there's a fair case to be made that we should
get rid of SB_MatchTextIC and implement *all* the case-insensitive
variants by means of an initial lower() call.  This would leave us with
just two match functions and allow considerable unification of the setup
logic.

            regards, tom lane

Re: UTF8MatchText

От

Andrew Dunstan

Дата:

20 мая 2007 г., 16:46:02


Tom Lane wrote:
> On the strength of this analysis, shouldn't we drop the separate
> UTF8 match function and just use SB_MatchText for UTF8?
>

Possibly - IIRC I looked at that and there was some reason I didn't, but
I'll look again.

> It strikes me that we may be overcomplicating matters in another way
> too.  If you believe that the %-scan code is now the bottleneck, that
> is, the key loop is where we have pattern '%foo' and we are trying to
> match 'f' to each successive data position, then you should be bothered
> that SB_MatchTextIC is applying tolower() to 'f' again for each data
> character.  Worst-case we could have O(N^2) applications of tolower()
> during a match.  I think there's a fair case to be made that we should
> get rid of SB_MatchTextIC and implement *all* the case-insensitive
> variants by means of an initial lower() call.  This would leave us with
> just two match functions and allow considerable unification of the setup
> logic.
>
>
>


Yeah, quite possibly. I'm also wondering if we are wasting effort
downcasing what will in most cases be the same pattern over and over
again. Maybe we need to look at memoizing that somehow, or at least test
to see if that would be a gain.

We're getting quite a long way from the original patch :-)

cheers

andrew

Re: UTF8MatchText

От

Tom Lane

Дата:

20 мая 2007 г., 18:51:31

Andrew Dunstan <andrew@dunslane.net> writes:
> Yeah, quite possibly. I'm also wondering if we are wasting effort
> downcasing what will in most cases be the same pattern over and over
> again. Maybe we need to look at memoizing that somehow, or at least test
> to see if that would be a gain.

Someone (Itagaki-san IIRC) suggested that we ought to convert
"x ILIKE y" into "lower(x) LIKE lower(y)" at some fairly early
stage, definitely before constant-folding in the planner.  That
would take care of that issue without any run-time mechanism,
and would open opportunities for making use of an index on lower(x).

I recall thinking at the time that there were some potential downsides,
but right at the moment I'm darned if I can see any --- especially
if we're going to make ILIKE do this uniformly at runtime anyway.

            regards, tom lane

Re: UTF8MatchText

От

Andrew Dunstan

Дата:

20 мая 2007 г., 18:58:42


Tom Lane wrote:
> Andrew Dunstan <andrew@dunslane.net> writes:
>
>> Yeah, quite possibly. I'm also wondering if we are wasting effort
>> downcasing what will in most cases be the same pattern over and over
>> again. Maybe we need to look at memoizing that somehow, or at least test
>> to see if that would be a gain.
>>
>
> Someone (Itagaki-san IIRC) suggested that we ought to convert
> "x ILIKE y" into "lower(x) LIKE lower(y)" at some fairly early
> stage, definitely before constant-folding in the planner.  That
> would take care of that issue without any run-time mechanism,
> and would open opportunities for making use of an index on lower(x).
>
> I recall thinking at the time that there were some potential downsides,
> but right at the moment I'm darned if I can see any --- especially
> if we're going to make ILIKE do this uniformly at runtime anyway.
>
>
>
Sounds like a TODO item. I'm already concerned a bit about scope creep
for this item.

cheers

andrew

Re: UTF8MatchText

От

Tom Lane

Дата:

20 мая 2007 г., 19:02:55

Andrew Dunstan <andrew@dunslane.net> writes:
> Tom Lane wrote:
>> Andrew Dunstan <andrew@dunslane.net> writes:
>>> Yeah, quite possibly. I'm also wondering if we are wasting effort
>>> downcasing what will in most cases be the same pattern over and over
>>> again. Maybe we need to look at memoizing that somehow, or at least test
>>> to see if that would be a gain.
>>
>> Someone (Itagaki-san IIRC) suggested that we ought to convert
>> "x ILIKE y" into "lower(x) LIKE lower(y)" at some fairly early
>> stage, definitely before constant-folding in the planner.
>>
> Sounds like a TODO item. I'm already concerned a bit about scope creep
> for this item.

Agreed, I don't want to tackle this right now --- I'm just suggesting
it's probably a better answer than memoizing at runtime.

            regards, tom lane

Re: UTF8MatchText

От

Andrew Dunstan

Дата:

20 мая 2007 г., 19:11:20


Tom Lane wrote:
>
> On the strength of this analysis, shouldn't we drop the separate
> UTF8 match function and just use SB_MatchText for UTF8?
>

We still call NextChar() after "_", and I think we probably need to,
don't we? If so we can't just marry the cases.


cheers

andrew

Re: UTF8MatchText

От

Tom Lane

Дата:

20 мая 2007 г., 19:24:37

Andrew Dunstan <andrew@dunslane.net> writes:
> Tom Lane wrote:
>> On the strength of this analysis, shouldn't we drop the separate
>> UTF8 match function and just use SB_MatchText for UTF8?

> We still call NextChar() after "_", and I think we probably need to,
> don't we? If so we can't just marry the cases.

Doh, you're right ... but on third thought, what happens with a pattern
containing "%_"?  If % tries to advance bytewise then we'll be trying to
apply NextChar in the middle of a data character, and bad things ensue.

I think we need to go back to the scheme with SB_ and MB_ variants and
no special case for UTF8.

            regards, tom lane

Re: UTF8MatchText

От

Andrew Dunstan

Дата:

20 мая 2007 г., 19:43:51


Tom Lane wrote:
> Andrew Dunstan <andrew@dunslane.net> writes:
>
>> Tom Lane wrote:
>>
>>> On the strength of this analysis, shouldn't we drop the separate
>>> UTF8 match function and just use SB_MatchText for UTF8?
>>>
>
>
>> We still call NextChar() after "_", and I think we probably need to,
>> don't we? If so we can't just marry the cases.
>>
>
> Doh, you're right ... but on third thought, what happens with a pattern
> containing "%_"?  If % tries to advance bytewise then we'll be trying to
> apply NextChar in the middle of a data character, and bad things ensue.
>
> I think we need to go back to the scheme with SB_ and MB_ variants and
> no special case for UTF8.
>
>
>

My head is spinning with all these variants. I'll look at ti tomorrow.

cheers

andrew

Re: UTF8MatchText

От

db@zigo.dhs.org

Дата:

21 мая 2007 г., 02:58:14

> Doh, you're right ... but on third thought, what happens with a pattern
> containing "%_"?  If % tries to advance bytewise then we'll be trying to
> apply NextChar in the middle of a data character, and bad things ensue.

Right, when you have '_' after a '%' you need to make sure the '%'
advances full characters. In my suggestion the test if '_' (or '\') come
after the '%' is done once and it select which of the two loops to use,
the one that do byte stepping or the one with NextChar.

It's difficult to know for sure that we have thought about all the corner
cases. I hope the gain is worth the effort.. :-)

/Dennis

Re: UTF8MatchText

От

Andrew Dunstan

Дата:

21 мая 2007 г., 10:34:32


db@zigo.dhs.org wrote:
>> Doh, you're right ... but on third thought, what happens with a pattern
>> containing "%_"?  If % tries to advance bytewise then we'll be trying to
>> apply NextChar in the middle of a data character, and bad things ensue.
>>
>
> Right, when you have '_' after a '%' you need to make sure the '%'
> advances full characters. In my suggestion the test if '_' (or '\') come
> after the '%' is done once and it select which of the two loops to use,
> the one that do byte stepping or the one with NextChar.
>
> It's difficult to know for sure that we have thought about all the corner
> cases. I hope the gain is worth the effort.. :-)
>
>
>

Yes, I came to the same conclusion about how to restructure the code.

The current code contains this:

            while (tlen > 0)
            {
                /*
                 * Optimization to prevent most recursion: don't recurse
                 * unless first pattern char might match this text char.
                 */
                if (CHAREQ(t, p) || (*p == '\\') || (*p == '_'))
                {
                    int         matched = MatchText(t, tlen, p, plen);

                    if (matched != LIKE_FALSE)
                        return matched; /* TRUE or ABORT */
                }

                NextChar(t, tlen);
            }


The code appears to date from v 1.23 of like.c way back in 2001. I'm not
sure I agree with the comment, though. In the first place, the invariant
tests should not be in the loop, I think, and I'll hoist them out as
Dennis suggests. But why are we doing that CHAREQ? If it succeeds we'll
just do it again when we recurse, I think.

cheers

andrew

Re: UTF8MatchText

От

Tom Lane

Дата:

21 мая 2007 г., 10:44:30

Andrew Dunstan <andrew@dunslane.net> writes:
> But why are we doing that CHAREQ?

To avoid the cost of the recursive call, just like it says.

> If it succeeds we'll
> just do it again when we recurse, I think.

If you move the other two cases then you could advance t and p before
entering the recursion.

            regards, tom lane

Re: UTF8MatchText

От

Andrew Dunstan

Дата:

21 мая 2007 г., 13:59:47


Tom Lane wrote:
> Andrew Dunstan <andrew@dunslane.net> writes:
>
>> But why are we doing that CHAREQ?
>>
>
> To avoid the cost of the recursive call, just like it says.
>
>
>> If it succeeds we'll
>> just do it again when we recurse, I think.
>>
>
> If you move the other two cases then you could advance t and p before
> entering the recursion.
>
>
>

Yeah. Since I have removed the "_" case I believe it's now safe there to
use BYTEEQ/NextByte, and since they are sufficiently cheap it's not
worth worrying about.

Attached is a patch version that I think draws together all the threads
of discussion so far. It's in fact quite a lot simpler than the existing
code, with no special UTF8 case - this should improve LIKE/ILIKE
processing for all charsets.

More eyeballs please for nasty corner cases.

cheers

andrew
? src/backend/utils/adt/.deps
Index: src/backend/utils/adt/like.c
===================================================================
RCS file: /cvsroot/pgsql/src/backend/utils/adt/like.c,v
retrieving revision 1.68
diff -c -r1.68 like.c
*** src/backend/utils/adt/like.c    27 Feb 2007 23:48:08 -0000    1.68
--- src/backend/utils/adt/like.c    21 May 2007 16:50:48 -0000
***************
*** 28,48 ****
  #define LIKE_ABORT                        (-1)


! static int    MatchText(char *t, int tlen, char *p, int plen);
! static int    MatchTextIC(char *t, int tlen, char *p, int plen);
! static int    MatchBytea(char *t, int tlen, char *p, int plen);
! static text *do_like_escape(text *, text *);

! static int    MBMatchText(char *t, int tlen, char *p, int plen);
! static int    MBMatchTextIC(char *t, int tlen, char *p, int plen);
  static text *MB_do_like_escape(text *, text *);

  /*--------------------
   * Support routine for MatchText. Compares given multibyte streams
   * as wide characters. If they match, returns 1 otherwise returns 0.
   *--------------------
   */
! static int
  wchareq(char *p1, char *p2)
  {
      int            p1_len;
--- 28,48 ----
  #define LIKE_ABORT                        (-1)


! static int    SB_MatchText(char *t, int tlen, char *p, int plen);
! static text *SB_do_like_escape(text *, text *);

! static int    MB_MatchText(char *t, int tlen, char *p, int plen);
  static text *MB_do_like_escape(text *, text *);

+ static int    GenericMatchText(char *s, int slen, char* p, int plen);
+ static int    Generic_Text_IC_like(text *str, text *pat);
+
  /*--------------------
   * Support routine for MatchText. Compares given multibyte streams
   * as wide characters. If they match, returns 1 otherwise returns 0.
   *--------------------
   */
! static inline int
  wchareq(char *p1, char *p2)
  {
      int            p1_len;
***************
*** 72,86 ****
   * of getting a single character transformed to the system's wchar_t format.
   * So now, we just downcase the strings using lower() and apply regular LIKE
   * comparison.    This should be revisited when we install better locale support.
-  *
-  * Note that MBMatchText and MBMatchTextIC do exactly the same thing now.
-  * Is it worth refactoring to avoid duplicated code?  They might become
-  * different again in the future.
   */

  /* Set up to compile like_match.c for multibyte characters */
  #define CHAREQ(p1, p2) wchareq(p1, p2)
- #define ICHAREQ(p1, p2) wchareq(p1, p2)
  #define NextChar(p, plen) \
      do { int __l = pg_mblen(p); (p) +=__l; (plen) -=__l; } while (0)
  #define CopyAdvChar(dst, src, srclen) \
--- 72,84 ----
   * of getting a single character transformed to the system's wchar_t format.
   * So now, we just downcase the strings using lower() and apply regular LIKE
   * comparison.    This should be revisited when we install better locale support.
   */

+ #define NextByte(p, plen)    ((p)++, (plen)--)
+ #define BYTEEQ(p1, p2)        (*(p1) == *(p2))
+
  /* Set up to compile like_match.c for multibyte characters */
  #define CHAREQ(p1, p2) wchareq(p1, p2)
  #define NextChar(p, plen) \
      do { int __l = pg_mblen(p); (p) +=__l; (plen) -=__l; } while (0)
  #define CopyAdvChar(dst, src, srclen) \
***************
*** 89,122 ****
           while (__l-- > 0) \
               *(dst)++ = *(src)++; \
         } while (0)

! #define MatchText    MBMatchText
! #define MatchTextIC MBMatchTextIC
  #define do_like_escape    MB_do_like_escape

  #include "like_match.c"

- #undef CHAREQ
- #undef ICHAREQ
- #undef NextChar
- #undef CopyAdvChar
- #undef MatchText
- #undef MatchTextIC
- #undef do_like_escape
-
  /* Set up to compile like_match.c for single-byte characters */
! #define CHAREQ(p1, p2) (*(p1) == *(p2))
! #define ICHAREQ(p1, p2) (tolower((unsigned char) *(p1)) == tolower((unsigned char) *(p2)))
! #define NextChar(p, plen) ((p)++, (plen)--)
  #define CopyAdvChar(dst, src, srclen) (*(dst)++ = *(src)++, (srclen)--)

  #include "like_match.c"

! /* And some support for BYTEA */
! #define BYTEA_CHAREQ(p1, p2) (*(p1) == *(p2))
! #define BYTEA_NextChar(p, plen) ((p)++, (plen)--)
! #define BYTEA_CopyAdvChar(dst, src, srclen) (*(dst)++ = *(src)++, (srclen)--)


  /*
   *    interface routines called by the function manager
--- 87,137 ----
           while (__l-- > 0) \
               *(dst)++ = *(src)++; \
         } while (0)
+ #define BYTEEQIC(b1, b2) BYTEEQ(b1, b2)

! #define MatchText    MB_MatchText
  #define do_like_escape    MB_do_like_escape

  #include "like_match.c"

  /* Set up to compile like_match.c for single-byte characters */
! #define CHAREQ(p1, p2) BYTEEQ(p1, p2)
! #define NextChar(p, plen) NextByte(p, plen)
  #define CopyAdvChar(dst, src, srclen) (*(dst)++ = *(src)++, (srclen)--)
+ #define BYTEEQIC(b1, b2) BYTEEQ(b1, b2)
+
+ #define MatchText    SB_MatchText
+ #define do_like_escape    SB_do_like_escape

  #include "like_match.c"

! static inline int
! GenericMatchText(char *s, int slen, char* p, int plen)
! {
!     if (pg_database_encoding_max_length() == 1)
!         return SB_MatchText(s, slen, p, plen);
!     else
!         return MB_MatchText(s, slen, p, plen);
! }

+ static inline int
+ Generic_Text_IC_like(text *str, text *pat)
+ {
+     char       *s,
+                *p;
+     int            slen,
+                 plen;
+
+     /* Force inputs to lower case to achieve case insensitivity */
+     str = DatumGetTextP(DirectFunctionCall1(lower, PointerGetDatum(str)));
+     pat = DatumGetTextP(DirectFunctionCall1(lower, PointerGetDatum(pat)));
+     s = VARDATA(str);
+     slen = (VARSIZE(str) - VARHDRSZ);
+     p = VARDATA(pat);
+     plen = (VARSIZE(pat) - VARHDRSZ);
+
+     return GenericMatchText(s, slen, p, plen);
+ }

  /*
   *    interface routines called by the function manager
***************
*** 138,147 ****
      p = VARDATA(pat);
      plen = (VARSIZE(pat) - VARHDRSZ);

!     if (pg_database_encoding_max_length() == 1)
!         result = (MatchText(s, slen, p, plen) == LIKE_TRUE);
!     else
!         result = (MBMatchText(s, slen, p, plen) == LIKE_TRUE);

      PG_RETURN_BOOL(result);
  }
--- 153,159 ----
      p = VARDATA(pat);
      plen = (VARSIZE(pat) - VARHDRSZ);

!     result = (GenericMatchText(s, slen, p, plen) == LIKE_TRUE);

      PG_RETURN_BOOL(result);
  }
***************
*** 162,171 ****
      p = VARDATA(pat);
      plen = (VARSIZE(pat) - VARHDRSZ);

!     if (pg_database_encoding_max_length() == 1)
!         result = (MatchText(s, slen, p, plen) != LIKE_TRUE);
!     else
!         result = (MBMatchText(s, slen, p, plen) != LIKE_TRUE);

      PG_RETURN_BOOL(result);
  }
--- 174,180 ----
      p = VARDATA(pat);
      plen = (VARSIZE(pat) - VARHDRSZ);

!     result = (GenericMatchText(s, slen, p, plen) != LIKE_TRUE);

      PG_RETURN_BOOL(result);
  }
***************
*** 186,195 ****
      p = VARDATA(pat);
      plen = (VARSIZE(pat) - VARHDRSZ);

!     if (pg_database_encoding_max_length() == 1)
!         result = (MatchText(s, slen, p, plen) == LIKE_TRUE);
!     else
!         result = (MBMatchText(s, slen, p, plen) == LIKE_TRUE);

      PG_RETURN_BOOL(result);
  }
--- 195,201 ----
      p = VARDATA(pat);
      plen = (VARSIZE(pat) - VARHDRSZ);

!     result = (GenericMatchText(s, slen, p, plen) == LIKE_TRUE);

      PG_RETURN_BOOL(result);
  }
***************
*** 210,219 ****
      p = VARDATA(pat);
      plen = (VARSIZE(pat) - VARHDRSZ);

!     if (pg_database_encoding_max_length() == 1)
!         result = (MatchText(s, slen, p, plen) != LIKE_TRUE);
!     else
!         result = (MBMatchText(s, slen, p, plen) != LIKE_TRUE);

      PG_RETURN_BOOL(result);
  }
--- 216,222 ----
      p = VARDATA(pat);
      plen = (VARSIZE(pat) - VARHDRSZ);

!     result = (GenericMatchText(s, slen, p, plen) != LIKE_TRUE);

      PG_RETURN_BOOL(result);
  }
***************
*** 234,240 ****
      p = VARDATA(pat);
      plen = (VARSIZE(pat) - VARHDRSZ);

!     result = (MatchBytea(s, slen, p, plen) == LIKE_TRUE);

      PG_RETURN_BOOL(result);
  }
--- 237,243 ----
      p = VARDATA(pat);
      plen = (VARSIZE(pat) - VARHDRSZ);

!     result = (SB_MatchText(s, slen, p, plen) == LIKE_TRUE);

      PG_RETURN_BOOL(result);
  }
***************
*** 255,261 ****
      p = VARDATA(pat);
      plen = (VARSIZE(pat) - VARHDRSZ);

!     result = (MatchBytea(s, slen, p, plen) != LIKE_TRUE);

      PG_RETURN_BOOL(result);
  }
--- 258,264 ----
      p = VARDATA(pat);
      plen = (VARSIZE(pat) - VARHDRSZ);

!     result = (SB_MatchText(s, slen, p, plen) != LIKE_TRUE);

      PG_RETURN_BOOL(result);
  }
***************
*** 270,306 ****
      Name        str = PG_GETARG_NAME(0);
      text       *pat = PG_GETARG_TEXT_P(1);
      bool        result;
!     char       *s,
!                *p;
!     int            slen,
!                 plen;
!
!     if (pg_database_encoding_max_length() == 1)
!     {
!         s = NameStr(*str);
!         slen = strlen(s);
!         p = VARDATA(pat);
!         plen = (VARSIZE(pat) - VARHDRSZ);
!         result = (MatchTextIC(s, slen, p, plen) == LIKE_TRUE);
!     }
!     else
!     {
!         /* Force inputs to lower case to achieve case insensitivity */
!         text       *strtext;

!         strtext = DatumGetTextP(DirectFunctionCall1(name_text,
                                                      NameGetDatum(str)));
!         strtext = DatumGetTextP(DirectFunctionCall1(lower,
!                                                   PointerGetDatum(strtext)));
!         pat = DatumGetTextP(DirectFunctionCall1(lower,
!                                                 PointerGetDatum(pat)));
!
!         s = VARDATA(strtext);
!         slen = (VARSIZE(strtext) - VARHDRSZ);
!         p = VARDATA(pat);
!         plen = (VARSIZE(pat) - VARHDRSZ);
!         result = (MBMatchTextIC(s, slen, p, plen) == LIKE_TRUE);
!     }

      PG_RETURN_BOOL(result);
  }
--- 273,283 ----
      Name        str = PG_GETARG_NAME(0);
      text       *pat = PG_GETARG_TEXT_P(1);
      bool        result;
!     text       *strtext;

!     strtext = DatumGetTextP(DirectFunctionCall1(name_text,
                                                      NameGetDatum(str)));
!     result = (Generic_Text_IC_like(strtext, pat) == LIKE_TRUE);

      PG_RETURN_BOOL(result);
  }
***************
*** 311,347 ****
      Name        str = PG_GETARG_NAME(0);
      text       *pat = PG_GETARG_TEXT_P(1);
      bool        result;
!     char       *s,
!                *p;
!     int            slen,
!                 plen;
!
!     if (pg_database_encoding_max_length() == 1)
!     {
!         s = NameStr(*str);
!         slen = strlen(s);
!         p = VARDATA(pat);
!         plen = (VARSIZE(pat) - VARHDRSZ);
!         result = (MatchTextIC(s, slen, p, plen) != LIKE_TRUE);
!     }
!     else
!     {
!         /* Force inputs to lower case to achieve case insensitivity */
!         text       *strtext;

!         strtext = DatumGetTextP(DirectFunctionCall1(name_text,
                                                      NameGetDatum(str)));
!         strtext = DatumGetTextP(DirectFunctionCall1(lower,
!                                                   PointerGetDatum(strtext)));
!         pat = DatumGetTextP(DirectFunctionCall1(lower,
!                                                 PointerGetDatum(pat)));
!
!         s = VARDATA(strtext);
!         slen = (VARSIZE(strtext) - VARHDRSZ);
!         p = VARDATA(pat);
!         plen = (VARSIZE(pat) - VARHDRSZ);
!         result = (MBMatchTextIC(s, slen, p, plen) != LIKE_TRUE);
!     }

      PG_RETURN_BOOL(result);
  }
--- 288,298 ----
      Name        str = PG_GETARG_NAME(0);
      text       *pat = PG_GETARG_TEXT_P(1);
      bool        result;
!     text       *strtext;

!     strtext = DatumGetTextP(DirectFunctionCall1(name_text,
                                                      NameGetDatum(str)));
!     result = (Generic_Text_IC_like(strtext, pat) != LIKE_TRUE);

      PG_RETURN_BOOL(result);
  }
***************
*** 352,383 ****
      text       *str = PG_GETARG_TEXT_P(0);
      text       *pat = PG_GETARG_TEXT_P(1);
      bool        result;
-     char       *s,
-                *p;
-     int            slen,
-                 plen;

!     if (pg_database_encoding_max_length() == 1)
!     {
!         s = VARDATA(str);
!         slen = (VARSIZE(str) - VARHDRSZ);
!         p = VARDATA(pat);
!         plen = (VARSIZE(pat) - VARHDRSZ);
!         result = (MatchTextIC(s, slen, p, plen) == LIKE_TRUE);
!     }
!     else
!     {
!         /* Force inputs to lower case to achieve case insensitivity */
!         str = DatumGetTextP(DirectFunctionCall1(lower,
!                                                 PointerGetDatum(str)));
!         pat = DatumGetTextP(DirectFunctionCall1(lower,
!                                                 PointerGetDatum(pat)));
!         s = VARDATA(str);
!         slen = (VARSIZE(str) - VARHDRSZ);
!         p = VARDATA(pat);
!         plen = (VARSIZE(pat) - VARHDRSZ);
!         result = (MBMatchTextIC(s, slen, p, plen) == LIKE_TRUE);
!     }

      PG_RETURN_BOOL(result);
  }
--- 303,310 ----
      text       *str = PG_GETARG_TEXT_P(0);
      text       *pat = PG_GETARG_TEXT_P(1);
      bool        result;

!     result = (Generic_Text_IC_like(str, pat) == LIKE_TRUE);

      PG_RETURN_BOOL(result);
  }
***************
*** 388,419 ****
      text       *str = PG_GETARG_TEXT_P(0);
      text       *pat = PG_GETARG_TEXT_P(1);
      bool        result;
-     char       *s,
-                *p;
-     int            slen,
-                 plen;

!     if (pg_database_encoding_max_length() == 1)
!     {
!         s = VARDATA(str);
!         slen = (VARSIZE(str) - VARHDRSZ);
!         p = VARDATA(pat);
!         plen = (VARSIZE(pat) - VARHDRSZ);
!         result = (MatchTextIC(s, slen, p, plen) != LIKE_TRUE);
!     }
!     else
!     {
!         /* Force inputs to lower case to achieve case insensitivity */
!         str = DatumGetTextP(DirectFunctionCall1(lower,
!                                                 PointerGetDatum(str)));
!         pat = DatumGetTextP(DirectFunctionCall1(lower,
!                                                 PointerGetDatum(pat)));
!         s = VARDATA(str);
!         slen = (VARSIZE(str) - VARHDRSZ);
!         p = VARDATA(pat);
!         plen = (VARSIZE(pat) - VARHDRSZ);
!         result = (MBMatchTextIC(s, slen, p, plen) != LIKE_TRUE);
!     }

      PG_RETURN_BOOL(result);
  }
--- 315,322 ----
      text       *str = PG_GETARG_TEXT_P(0);
      text       *pat = PG_GETARG_TEXT_P(1);
      bool        result;

!     result = (Generic_Text_IC_like(str, pat) != LIKE_TRUE);

      PG_RETURN_BOOL(result);
  }
***************
*** 430,436 ****
      text       *result;

      if (pg_database_encoding_max_length() == 1)
!         result = do_like_escape(pat, esc);
      else
          result = MB_do_like_escape(pat, esc);

--- 333,339 ----
      text       *result;

      if (pg_database_encoding_max_length() == 1)
!         result = SB_do_like_escape(pat, esc);
      else
          result = MB_do_like_escape(pat, esc);

***************
*** 446,624 ****
  {
      bytea       *pat = PG_GETARG_BYTEA_P(0);
      bytea       *esc = PG_GETARG_BYTEA_P(1);
!     bytea       *result;
!     char       *p,
!                *e,
!                *r;
!     int            plen,
!                 elen;
!     bool        afterescape;
!
!     p = VARDATA(pat);
!     plen = (VARSIZE(pat) - VARHDRSZ);
!     e = VARDATA(esc);
!     elen = (VARSIZE(esc) - VARHDRSZ);
!
!     /*
!      * Worst-case pattern growth is 2x --- unlikely, but it's hardly worth
!      * trying to calculate the size more accurately than that.
!      */
!     result = (text *) palloc(plen * 2 + VARHDRSZ);
!     r = VARDATA(result);
!
!     if (elen == 0)
!     {
!         /*
!          * No escape character is wanted.  Double any backslashes in the
!          * pattern to make them act like ordinary characters.
!          */
!         while (plen > 0)
!         {
!             if (*p == '\\')
!                 *r++ = '\\';
!             BYTEA_CopyAdvChar(r, p, plen);
!         }
!     }
!     else
!     {
!         /*
!          * The specified escape must be only a single character.
!          */
!         BYTEA_NextChar(e, elen);
!         if (elen != 0)
!             ereport(ERROR,
!                     (errcode(ERRCODE_INVALID_ESCAPE_SEQUENCE),
!                      errmsg("invalid escape string"),
!                   errhint("Escape string must be empty or one character.")));
!
!         e = VARDATA(esc);
!
!         /*
!          * If specified escape is '\', just copy the pattern as-is.
!          */
!         if (*e == '\\')
!         {
!             memcpy(result, pat, VARSIZE(pat));
!             PG_RETURN_BYTEA_P(result);
!         }
!
!         /*
!          * Otherwise, convert occurrences of the specified escape character to
!          * '\', and double occurrences of '\' --- unless they immediately
!          * follow an escape character!
!          */
!         afterescape = false;
!         while (plen > 0)
!         {
!             if (BYTEA_CHAREQ(p, e) && !afterescape)
!             {
!                 *r++ = '\\';
!                 BYTEA_NextChar(p, plen);
!                 afterescape = true;
!             }
!             else if (*p == '\\')
!             {
!                 *r++ = '\\';
!                 if (!afterescape)
!                     *r++ = '\\';
!                 BYTEA_NextChar(p, plen);
!                 afterescape = false;
!             }
!             else
!             {
!                 BYTEA_CopyAdvChar(r, p, plen);
!                 afterescape = false;
!             }
!         }
!     }
!
!     SET_VARSIZE(result, r - ((char *) result));

!     PG_RETURN_BYTEA_P(result);
  }

- /*
-  * Same as above, but specifically for bytea (binary) datatype
-  */
- static int
- MatchBytea(char *t, int tlen, char *p, int plen)
- {
-     /* Fast path for match-everything pattern */
-     if ((plen == 1) && (*p == '%'))
-         return LIKE_TRUE;
-
-     while ((tlen > 0) && (plen > 0))
-     {
-         if (*p == '\\')
-         {
-             /* Next pattern char must match literally, whatever it is */
-             BYTEA_NextChar(p, plen);
-             if ((plen <= 0) || !BYTEA_CHAREQ(t, p))
-                 return LIKE_FALSE;
-         }
-         else if (*p == '%')
-         {
-             /* %% is the same as % according to the SQL standard */
-             /* Advance past all %'s */
-             while ((plen > 0) && (*p == '%'))
-                 BYTEA_NextChar(p, plen);
-             /* Trailing percent matches everything. */
-             if (plen <= 0)
-                 return LIKE_TRUE;
-
-             /*
-              * Otherwise, scan for a text position at which we can match the
-              * rest of the pattern.
-              */
-             while (tlen > 0)
-             {
-                 /*
-                  * Optimization to prevent most recursion: don't recurse
-                  * unless first pattern char might match this text char.
-                  */
-                 if (BYTEA_CHAREQ(t, p) || (*p == '\\') || (*p == '_'))
-                 {
-                     int            matched = MatchBytea(t, tlen, p, plen);
-
-                     if (matched != LIKE_FALSE)
-                         return matched; /* TRUE or ABORT */
-                 }
-
-                 BYTEA_NextChar(t, tlen);
-             }
-
-             /*
-              * End of text with no match, so no point in trying later places
-              * to start matching this pattern.
-              */
-             return LIKE_ABORT;
-         }
-         else if ((*p != '_') && !BYTEA_CHAREQ(t, p))
-         {
-             /*
-              * Not the single-character wildcard and no explicit match? Then
-              * time to quit...
-              */
-             return LIKE_FALSE;
-         }
-
-         BYTEA_NextChar(t, tlen);
-         BYTEA_NextChar(p, plen);
-     }
-
-     if (tlen > 0)
-         return LIKE_FALSE;        /* end of pattern, but not of text */
-
-     /* End of input string.  Do we have matching pattern remaining? */
-     while ((plen > 0) && (*p == '%'))    /* allow multiple %'s at end of
-                                          * pattern */
-         BYTEA_NextChar(p, plen);
-     if (plen <= 0)
-         return LIKE_TRUE;
-
-     /*
-      * End of text with no match, so no point in trying later places to start
-      * matching this pattern.
-      */
-     return LIKE_ABORT;
- }    /* MatchBytea() */
--- 349,356 ----
  {
      bytea       *pat = PG_GETARG_BYTEA_P(0);
      bytea       *esc = PG_GETARG_BYTEA_P(1);
!     bytea       *result = SB_do_like_escape((text *)pat, (text *)esc);

!     PG_RETURN_BYTEA_P((bytea *)result);
  }

Index: src/backend/utils/adt/like_match.c
===================================================================
RCS file: /cvsroot/pgsql/src/backend/utils/adt/like_match.c,v
retrieving revision 1.15
diff -c -r1.15 like_match.c
*** src/backend/utils/adt/like_match.c    27 Feb 2007 23:48:08 -0000    1.15
--- src/backend/utils/adt/like_match.c    21 May 2007 16:50:48 -0000
***************
*** 9,19 ****
   * Before the inclusion, we need to define following macros:
   *
   * CHAREQ
!  * ICHAREQ
   * NextChar
   * CopyAdvChar
   * MatchText (MBMatchText)
-  * MatchTextIC (MBMatchTextIC)
   * do_like_escape (MB_do_like_escape)
   *
   * Copyright (c) 1996-2007, PostgreSQL Global Development Group
--- 9,18 ----
   * Before the inclusion, we need to define following macros:
   *
   * CHAREQ
!  * BYTEEQIC
   * NextChar
   * CopyAdvChar
   * MatchText (MBMatchText)
   * do_like_escape (MB_do_like_escape)
   *
   * Copyright (c) 1996-2007, PostgreSQL Global Development Group
***************
*** 81,89 ****
      {
          if (*p == '\\')
          {
!             /* Next pattern char must match literally, whatever it is */
!             NextChar(p, plen);
!             if ((plen <= 0) || !CHAREQ(t, p))
                  return LIKE_FALSE;
          }
          else if (*p == '%')
--- 80,88 ----
      {
          if (*p == '\\')
          {
!             /* Next byte must match literally, whatever it is */
!             NextByte(p, plen);
!             if ((plen <= 0) || !BYTEEQ(t, p))
                  return LIKE_FALSE;
          }
          else if (*p == '%')
***************
*** 91,97 ****
              /* %% is the same as % according to the SQL standard */
              /* Advance past all %'s */
              while ((plen > 0) && (*p == '%'))
!                 NextChar(p, plen);
              /* Trailing percent matches everything. */
              if (plen <= 0)
                  return LIKE_TRUE;
--- 90,96 ----
              /* %% is the same as % according to the SQL standard */
              /* Advance past all %'s */
              while ((plen > 0) && (*p == '%'))
!                 NextByte(p, plen);
              /* Trailing percent matches everything. */
              if (plen <= 0)
                  return LIKE_TRUE;
***************
*** 100,206 ****
               * Otherwise, scan for a text position at which we can match the
               * rest of the pattern.
               */
!             while (tlen > 0)
              {
!                 /*
!                  * Optimization to prevent most recursion: don't recurse
!                  * unless first pattern char might match this text char.
!                  */
!                 if (CHAREQ(t, p) || (*p == '\\') || (*p == '_'))
                  {
                      int            matched = MatchText(t, tlen, p, plen);
!
                      if (matched != LIKE_FALSE)
!                         return matched; /* TRUE or ABORT */
!                 }

!                 NextChar(t, tlen);
              }
!
!             /*
!              * End of text with no match, so no point in trying later places
!              * to start matching this pattern.
!              */
!             return LIKE_ABORT;
!         }
!         else if ((*p != '_') && !CHAREQ(t, p))
!         {
!             /*
!              * Not the single-character wildcard and no explicit match? Then
!              * time to quit...
!              */
!             return LIKE_FALSE;
!         }
!
!         NextChar(t, tlen);
!         NextChar(p, plen);
!     }
!
!     if (tlen > 0)
!         return LIKE_FALSE;        /* end of pattern, but not of text */
!
!     /* End of input string.  Do we have matching pattern remaining? */
!     while ((plen > 0) && (*p == '%'))    /* allow multiple %'s at end of
!                                          * pattern */
!         NextChar(p, plen);
!     if (plen <= 0)
!         return LIKE_TRUE;
!
!     /*
!      * End of text with no match, so no point in trying later places to start
!      * matching this pattern.
!      */
!     return LIKE_ABORT;
! }    /* MatchText() */
!
! /*
!  * Same as above, but ignore case
!  */
! static int
! MatchTextIC(char *t, int tlen, char *p, int plen)
! {
!     /* Fast path for match-everything pattern */
!     if ((plen == 1) && (*p == '%'))
!         return LIKE_TRUE;
!
!     while ((tlen > 0) && (plen > 0))
!     {
!         if (*p == '\\')
!         {
!             /* Next pattern char must match literally, whatever it is */
!             NextChar(p, plen);
!             if ((plen <= 0) || !ICHAREQ(t, p))
!                 return LIKE_FALSE;
!         }
!         else if (*p == '%')
!         {
!             /* %% is the same as % according to the SQL standard */
!             /* Advance past all %'s */
!             while ((plen > 0) && (*p == '%'))
!                 NextChar(p, plen);
!             /* Trailing percent matches everything. */
!             if (plen <= 0)
!                 return LIKE_TRUE;
!
!             /*
!              * Otherwise, scan for a text position at which we can match the
!              * rest of the pattern.
!              */
!             while (tlen > 0)
              {
!                 /*
!                  * Optimization to prevent most recursion: don't recurse
!                  * unless first pattern char might match this text char.
!                  */
!                 if (ICHAREQ(t, p) || (*p == '\\') || (*p == '_'))
                  {
!                     int            matched = MatchTextIC(t, tlen, p, plen);
!
                      if (matched != LIKE_FALSE)
!                         return matched; /* TRUE or ABORT */
                  }

!                 NextChar(t, tlen);
              }

              /*
--- 99,147 ----
               * Otherwise, scan for a text position at which we can match the
               * rest of the pattern.
               */
!             if (*p == '_')
              {
!                 while (tlen > 0)
                  {
                      int            matched = MatchText(t, tlen, p, plen);
!
                      if (matched != LIKE_FALSE)
!                             return matched; /* TRUE or ABORT */

!                     NextChar(t, tlen);
!                 }
              }
!             else if (*p == '\\')
              {
!                 while (tlen > 0)
                  {
!                     int            matched = MatchText(t, tlen, p, plen);
!
                      if (matched != LIKE_FALSE)
!                             return matched; /* TRUE or ABORT */
!
!                     NextByte(t, tlen);
                  }
+             }
+             else
+             {
+
+                 while (tlen > 0)
+                 {
+                     /*
+                      * Optimization to prevent most recursion: don't recurse
+                      * unless first pattern char matches this text char.
+                      */
+                     if (BYTEEQ(t, p))
+                     {
+                         int            matched = MatchText(t, tlen, p, plen);
+
+                         if (matched != LIKE_FALSE)
+                             return matched; /* TRUE or ABORT */
+                     }

!                     NextByte(t, tlen);
!                 }
              }

              /*
***************
*** 209,215 ****
               */
              return LIKE_ABORT;
          }
!         else if ((*p != '_') && !ICHAREQ(t, p))
          {
              /*
               * Not the single-character wildcard and no explicit match? Then
--- 150,162 ----
               */
              return LIKE_ABORT;
          }
!         else if (*p == '_')
!         {
!             NextChar(t, tlen);
!             NextByte(p, plen);
!             continue;
!         }
!         else if (!BYTEEQ(t, p))
          {
              /*
               * Not the single-character wildcard and no explicit match? Then
***************
*** 217,225 ****
               */
              return LIKE_FALSE;
          }
!
!         NextChar(t, tlen);
!         NextChar(p, plen);
      }

      if (tlen > 0)
--- 164,176 ----
               */
              return LIKE_FALSE;
          }
!         /*
!          * It is safe to use NextByte instead of NextChar here, even for
!          * multi-byte character sets, because we are not following
!          * immediately after a wildcard character.
!          */
!         NextByte(t, tlen);
!         NextByte(p, plen);
      }

      if (tlen > 0)
***************
*** 228,234 ****
      /* End of input string.  Do we have matching pattern remaining? */
      while ((plen > 0) && (*p == '%'))    /* allow multiple %'s at end of
                                           * pattern */
!         NextChar(p, plen);
      if (plen <= 0)
          return LIKE_TRUE;

--- 179,186 ----
      /* End of input string.  Do we have matching pattern remaining? */
      while ((plen > 0) && (*p == '%'))    /* allow multiple %'s at end of
                                           * pattern */
!         NextByte(p, plen);
!
      if (plen <= 0)
          return LIKE_TRUE;

***************
*** 237,248 ****
       * matching this pattern.
       */
      return LIKE_ABORT;
! }    /* MatchTextIC() */

  /*
   * like_escape() --- given a pattern and an ESCAPE string,
   * convert the pattern to use Postgres' standard backslash escape convention.
   */
  static text *
  do_like_escape(text *pat, text *esc)
  {
--- 189,202 ----
       * matching this pattern.
       */
      return LIKE_ABORT;
! }    /* MatchText() */

  /*
   * like_escape() --- given a pattern and an ESCAPE string,
   * convert the pattern to use Postgres' standard backslash escape convention.
   */
+ #ifdef do_like_escape
+
  static text *
  do_like_escape(text *pat, text *esc)
  {
***************
*** 336,338 ****
--- 290,302 ----

      return result;
  }
+ #endif /* do_like_escape */
+
+ #undef CHAREQ
+ #undef NextChar
+ #undef CopyAdvChar
+ #undef MatchText
+
+ #ifdef do_like_escape
+ #undef do_like_escape
+ #endif

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Обсуждение: Multibyte LIKE optimization

Вложения

Вложения

Вложения