Обсуждение: Notes about fixing regexes and UTF-8 (yet again)

Поиск
Список
Период
Сортировка

Notes about fixing regexes and UTF-8 (yet again)

От
Tom Lane
Дата:
In bug #6457 it's pointed out that we *still* don't have full
functionality for locale-dependent regexp behavior with UTF8 encoding.
The reason is that there's old crufty code in regc_locale.c that only
considers character codes up to 255 when searching for characters that
should be considered "letters", "digits", etc.  We could fix that, for
some value of "fix", by iterating up to perhaps 0xFFFF when dealing with
UTF8 encoding, but the time that would take is unappealing.  Especially
so considering that this code is executed afresh anytime we compile a
regex that requires locale knowledge.

I looked into the upstream Tcl code and observed that they deal with
this by having hard-wired tables of which Unicode code points are to be
considered letters etc.  The tables are directly traceable to the
Unicode standard (they provide a script to regenerate them from files
available from unicode.org).  Nonetheless, I do not find that approach
appealing, mainly because we'd be risking deviating from the libc locale
code's behavior within regexes when we follow it everywhere else.
It seems entirely likely to me that a particular locale setting might
consider only some of what Unicode says are letters to be letters.

However, we could possibly compromise by using Unicode-derived tables
as a guide to which code points are worth probing libc for.  That is,
assume that a utf8-based locale will never claim that some code is a
letter that unicode.org doesn't think is a letter.  That would cut the
number of required probes by a pretty large factor.

The other thing that seems worth doing is to install some caching.
We could presumably assume that the behavior of iswupper() et al are
fixed for the duration of a database session, so that we only need to
run the probe loop once when first asked to create a cvec for a
particular category.

Thoughts, better ideas?
        regards, tom lane


Re: Notes about fixing regexes and UTF-8 (yet again)

От
Heikki Linnakangas
Дата:
On 16.02.2012 01:06, Tom Lane wrote:
> In bug #6457 it's pointed out that we *still* don't have full
> functionality for locale-dependent regexp behavior with UTF8 encoding.
> The reason is that there's old crufty code in regc_locale.c that only
> considers character codes up to 255 when searching for characters that
> should be considered "letters", "digits", etc.  We could fix that, for
> some value of "fix", by iterating up to perhaps 0xFFFF when dealing with
> UTF8 encoding, but the time that would take is unappealing.  Especially
> so considering that this code is executed afresh anytime we compile a
> regex that requires locale knowledge.
>
> I looked into the upstream Tcl code and observed that they deal with
> this by having hard-wired tables of which Unicode code points are to be
> considered letters etc.  The tables are directly traceable to the
> Unicode standard (they provide a script to regenerate them from files
> available from unicode.org).  Nonetheless, I do not find that approach
> appealing, mainly because we'd be risking deviating from the libc locale
> code's behavior within regexes when we follow it everywhere else.
> It seems entirely likely to me that a particular locale setting might
> consider only some of what Unicode says are letters to be letters.
>
> However, we could possibly compromise by using Unicode-derived tables
> as a guide to which code points are worth probing libc for.  That is,
> assume that a utf8-based locale will never claim that some code is a
> letter that unicode.org doesn't think is a letter.  That would cut the
> number of required probes by a pretty large factor.
>
> The other thing that seems worth doing is to install some caching.
> We could presumably assume that the behavior of iswupper() et al are
> fixed for the duration of a database session, so that we only need to
> run the probe loop once when first asked to create a cvec for a
> particular category.
>
> Thoughts, better ideas?

Here's a wild idea: keep the class of each codepoint in a hash table. 
Initialize it with all codepoints up to 0xFFFF. After that, whenever a 
string contains a character that's not in the hash table yet, query the 
class of that character, and add it to the hash table. Then recompile 
the whole regex and restart the matching engine.

Recompiling is expensive, but if you cache the results for the session, 
it would probably be acceptable.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: Notes about fixing regexes and UTF-8 (yet again)

От
Tom Lane
Дата:
Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:
> Here's a wild idea: keep the class of each codepoint in a hash table. 
> Initialize it with all codepoints up to 0xFFFF. After that, whenever a 
> string contains a character that's not in the hash table yet, query the 
> class of that character, and add it to the hash table. Then recompile 
> the whole regex and restart the matching engine.

> Recompiling is expensive, but if you cache the results for the session, 
> it would probably be acceptable.

Dunno ... recompiling is so expensive that I can't see this being a win;
not to mention that it would require fundamental surgery on the regex
code.

In the Tcl implementation, no codepoints above U+FFFF have any locale
properties (alpha/digit/punct/etc), period.  Personally I'd not have a
problem imposing the same limitation, so that dealing with stuff above
that range isn't really a consideration anyway.
        regards, tom lane


Re: Notes about fixing regexes and UTF-8 (yet again)

От
Andrew Dunstan
Дата:

On 02/17/2012 09:39 AM, Tom Lane wrote:
> Heikki Linnakangas<heikki.linnakangas@enterprisedb.com>  writes:
>> Here's a wild idea: keep the class of each codepoint in a hash table.
>> Initialize it with all codepoints up to 0xFFFF. After that, whenever a
>> string contains a character that's not in the hash table yet, query the
>> class of that character, and add it to the hash table. Then recompile
>> the whole regex and restart the matching engine.
>> Recompiling is expensive, but if you cache the results for the session,
>> it would probably be acceptable.
> Dunno ... recompiling is so expensive that I can't see this being a win;
> not to mention that it would require fundamental surgery on the regex
> code.
>
> In the Tcl implementation, no codepoints above U+FFFF have any locale
> properties (alpha/digit/punct/etc), period.  Personally I'd not have a
> problem imposing the same limitation, so that dealing with stuff above
> that range isn't really a consideration anyway.


up to U+FFFF is the BMP which is described as containing "characters for 
almost all modern languages, and a large number of special characters." 
It seems very likely to be acceptable not to bother about the locale of 
code points in the supplementary planes.

See <http://en.wikipedia.org/wiki/Plane_%28Unicode%29> for descriptions 
of which sets of characters are involved.


cheers

andrew




Re: Notes about fixing regexes and UTF-8 (yet again)

От
Robert Haas
Дата:
On Fri, Feb 17, 2012 at 3:48 AM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:
> Here's a wild idea: keep the class of each codepoint in a hash table.
> Initialize it with all codepoints up to 0xFFFF. After that, whenever a
> string contains a character that's not in the hash table yet, query the
> class of that character, and add it to the hash table. Then recompile the
> whole regex and restart the matching engine.
>
> Recompiling is expensive, but if you cache the results for the session, it
> would probably be acceptable.

What if you did this ONCE and wrote the results to a file someplace?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: Notes about fixing regexes and UTF-8 (yet again)

От
Tom Lane
Дата:
Robert Haas <robertmhaas@gmail.com> writes:
> On Fri, Feb 17, 2012 at 3:48 AM, Heikki Linnakangas
> <heikki.linnakangas@enterprisedb.com> wrote:
>> Recompiling is expensive, but if you cache the results for the session, it
>> would probably be acceptable.

> What if you did this ONCE and wrote the results to a file someplace?

That's still a cache, you've just defaulted on your obligation to think
about what conditions require the cache to be flushed.  (In the case at
hand, the trigger for a cache rebuild would probably need to be a glibc
package update, which we have no way of knowing about.)

Before going much further with this, we should probably do some timings
of 64K calls of iswupper and friends, just to see how bad a dumb
implementation will be.
        regards, tom lane


Re: Notes about fixing regexes and UTF-8 (yet again)

От
Robert Haas
Дата:
On Fri, Feb 17, 2012 at 10:19 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> What if you did this ONCE and wrote the results to a file someplace?
>
> That's still a cache, you've just defaulted on your obligation to think
> about what conditions require the cache to be flushed.

Yep.  Unfortunately, I don't have a good idea how to handle that; I
was hoping someone else did.

> Before going much further with this, we should probably do some timings
> of 64K calls of iswupper and friends, just to see how bad a dumb
> implementation will be.

Can't hurt.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: Notes about fixing regexes and UTF-8 (yet again)

От
Tom Lane
Дата:
Robert Haas <robertmhaas@gmail.com> writes:
> On Fri, Feb 17, 2012 at 10:19 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Before going much further with this, we should probably do some timings
>> of 64K calls of iswupper and friends, just to see how bad a dumb
>> implementation will be.

> Can't hurt.

The answer, on a reasonably new desktop machine (2.0GHz Xeon E5503)
running Fedora 16 in en_US.utf8 locale, is that 64K iterations of
pg_wc_isalpha or sibling functions requires a shade under 2ms.
So this definitely justifies caching the values to avoid computing
them more than once per session, but I'm not convinced there are
grounds for trying harder than that.

BTW, I am also a bit surprised to find out that this locale considers
48342 of those characters to satisfy isalpha().  Seems like a heck of
a lot.  But anyway we can forget my idea of trying to save work by
incorporating a-priori assumptions about which Unicode codepoints are
which --- it'll be faster to just iterate through them all, at least
for that case.  Maybe we should hard-wire some cases like digits, not
sure.
        regards, tom lane


Re: Notes about fixing regexes and UTF-8 (yet again)

От
Tom Lane
Дата:
I wrote:
> The answer, on a reasonably new desktop machine (2.0GHz Xeon E5503)
> running Fedora 16 in en_US.utf8 locale, is that 64K iterations of
> pg_wc_isalpha or sibling functions requires a shade under 2ms.
> So this definitely justifies caching the values to avoid computing
> them more than once per session, but I'm not convinced there are
> grounds for trying harder than that.

And here's a poorly-tested draft patch for that.

            regards, tom lane


diff --git a/src/backend/regex/regc_cvec.c b/src/backend/regex/regc_cvec.c
index fb6f06b5243f50bfad2cefa5c016d4e842791a3d..98f3c597678b492dd59afcd956e5cdfecdba4f86 100644
*** a/src/backend/regex/regc_cvec.c
--- b/src/backend/regex/regc_cvec.c
*************** static void
*** 77,82 ****
--- 77,83 ----
  addchr(struct cvec * cv,        /* character vector */
         chr c)                    /* character to add */
  {
+     assert(cv->nchrs < cv->chrspace);
      cv->chrs[cv->nchrs++] = (chr) c;
  }

diff --git a/src/backend/regex/regc_locale.c b/src/backend/regex/regc_locale.c
index 6cf27958b1545a61fba01e76dc4d37aca32789dc..44ce582bdad1a7d830d4122cada45a39c188981c 100644
*** a/src/backend/regex/regc_locale.c
--- b/src/backend/regex/regc_locale.c
*************** static const struct cname
*** 351,356 ****
--- 351,366 ----


  /*
+  * We do not use the hard-wired Unicode classification tables that Tcl does.
+  * This is because (a) we need to deal with other encodings besides Unicode,
+  * and (b) we want to track the behavior of the libc locale routines as
+  * closely as possible.  For example, it wouldn't be unreasonable for a
+  * locale to not consider every Unicode letter as a letter.  So we build
+  * character classification cvecs by asking libc, even for Unicode.
+  */
+
+
+ /*
   * element - map collating-element name to celt
   */
  static celt
*************** cclass(struct vars * v,            /* context */
*** 498,503 ****
--- 508,514 ----
         int cases)                /* case-independent? */
  {
      size_t        len;
+     const struct cvec *ccv = NULL;
      struct cvec *cv = NULL;
      const char * const *namePtr;
      int            i,
*************** cclass(struct vars * v,            /* context */
*** 549,626 ****

      /*
       * Now compute the character class contents.
-      *
-      * For the moment, assume that only char codes < 256 can be in these
-      * classes.
       */

      switch ((enum classes) index)
      {
          case CC_PRINT:
!             cv = getcvec(v, UCHAR_MAX, 0);
!             if (cv)
!             {
!                 for (i = 0; i <= UCHAR_MAX; i++)
!                 {
!                     if (pg_wc_isprint((chr) i))
!                         addchr(cv, (chr) i);
!                 }
!             }
              break;
          case CC_ALNUM:
!             cv = getcvec(v, UCHAR_MAX, 0);
!             if (cv)
!             {
!                 for (i = 0; i <= UCHAR_MAX; i++)
!                 {
!                     if (pg_wc_isalnum((chr) i))
!                         addchr(cv, (chr) i);
!                 }
!             }
              break;
          case CC_ALPHA:
!             cv = getcvec(v, UCHAR_MAX, 0);
!             if (cv)
!             {
!                 for (i = 0; i <= UCHAR_MAX; i++)
!                 {
!                     if (pg_wc_isalpha((chr) i))
!                         addchr(cv, (chr) i);
!                 }
!             }
              break;
          case CC_ASCII:
              cv = getcvec(v, 0, 1);
              if (cv)
                  addrange(cv, 0, 0x7f);
              break;
          case CC_BLANK:
              cv = getcvec(v, 2, 0);
              addchr(cv, '\t');
              addchr(cv, ' ');
              break;
          case CC_CNTRL:
              cv = getcvec(v, 0, 2);
              addrange(cv, 0x0, 0x1f);
              addrange(cv, 0x7f, 0x9f);
              break;
          case CC_DIGIT:
!             cv = getcvec(v, 0, 1);
!             if (cv)
!                 addrange(cv, (chr) '0', (chr) '9');
              break;
          case CC_PUNCT:
!             cv = getcvec(v, UCHAR_MAX, 0);
!             if (cv)
!             {
!                 for (i = 0; i <= UCHAR_MAX; i++)
!                 {
!                     if (pg_wc_ispunct((chr) i))
!                         addchr(cv, (chr) i);
!                 }
!             }
              break;
          case CC_XDIGIT:
              cv = getcvec(v, 0, 3);
              if (cv)
              {
--- 560,608 ----

      /*
       * Now compute the character class contents.
       */

      switch ((enum classes) index)
      {
          case CC_PRINT:
!             ccv = pg_ctype_get_cache(pg_wc_isprint);
              break;
          case CC_ALNUM:
!             ccv = pg_ctype_get_cache(pg_wc_isalnum);
              break;
          case CC_ALPHA:
!             ccv = pg_ctype_get_cache(pg_wc_isalpha);
              break;
          case CC_ASCII:
+             /* hard-wired meaning */
              cv = getcvec(v, 0, 1);
              if (cv)
                  addrange(cv, 0, 0x7f);
              break;
          case CC_BLANK:
+             /* hard-wired meaning */
              cv = getcvec(v, 2, 0);
              addchr(cv, '\t');
              addchr(cv, ' ');
              break;
          case CC_CNTRL:
+             /* hard-wired meaning */
              cv = getcvec(v, 0, 2);
              addrange(cv, 0x0, 0x1f);
              addrange(cv, 0x7f, 0x9f);
              break;
          case CC_DIGIT:
!             ccv = pg_ctype_get_cache(pg_wc_isdigit);
              break;
          case CC_PUNCT:
!             ccv = pg_ctype_get_cache(pg_wc_ispunct);
              break;
          case CC_XDIGIT:
+             /*
+              * It's not clear how to define this in non-western locales, and
+              * even less clear that there's any particular use in trying.
+              * So just hard-wire the meaning.
+              */
              cv = getcvec(v, 0, 3);
              if (cv)
              {
*************** cclass(struct vars * v,            /* context */
*** 630,679 ****
              }
              break;
          case CC_SPACE:
!             cv = getcvec(v, UCHAR_MAX, 0);
!             if (cv)
!             {
!                 for (i = 0; i <= UCHAR_MAX; i++)
!                 {
!                     if (pg_wc_isspace((chr) i))
!                         addchr(cv, (chr) i);
!                 }
!             }
              break;
          case CC_LOWER:
!             cv = getcvec(v, UCHAR_MAX, 0);
!             if (cv)
!             {
!                 for (i = 0; i <= UCHAR_MAX; i++)
!                 {
!                     if (pg_wc_islower((chr) i))
!                         addchr(cv, (chr) i);
!                 }
!             }
              break;
          case CC_UPPER:
!             cv = getcvec(v, UCHAR_MAX, 0);
!             if (cv)
!             {
!                 for (i = 0; i <= UCHAR_MAX; i++)
!                 {
!                     if (pg_wc_isupper((chr) i))
!                         addchr(cv, (chr) i);
!                 }
!             }
              break;
          case CC_GRAPH:
!             cv = getcvec(v, UCHAR_MAX, 0);
!             if (cv)
!             {
!                 for (i = 0; i <= UCHAR_MAX; i++)
!                 {
!                     if (pg_wc_isgraph((chr) i))
!                         addchr(cv, (chr) i);
!                 }
!             }
              break;
      }
      if (cv == NULL)
          ERR(REG_ESPACE);
      return cv;
--- 612,643 ----
              }
              break;
          case CC_SPACE:
!             ccv = pg_ctype_get_cache(pg_wc_isspace);
              break;
          case CC_LOWER:
!             ccv = pg_ctype_get_cache(pg_wc_islower);
              break;
          case CC_UPPER:
!             ccv = pg_ctype_get_cache(pg_wc_isupper);
              break;
          case CC_GRAPH:
!             ccv = pg_ctype_get_cache(pg_wc_isgraph);
              break;
      }
+
+     /* If we got a cached cvec, copy it to a freeable one. */
+     if (ccv != NULL)
+     {
+         cv = getcvec(v, ccv->nchrs, ccv->nranges);
+         if (cv)
+         {
+             cv->nchrs = ccv->nchrs;
+             memcpy(cv->chrs, ccv->chrs, cv->nchrs * sizeof(chr));
+             cv->nranges = ccv->nranges;
+             memcpy(cv->ranges, ccv->ranges, cv->nranges * sizeof(chr) * 2);
+         }
+     }
+
      if (cv == NULL)
          ERR(REG_ESPACE);
      return cv;
diff --git a/src/backend/regex/regc_pg_locale.c b/src/backend/regex/regc_pg_locale.c
index 7c010e372858065bedbc47382a84cb3650e03efa..9ea7a81b176de59ed989238b1849672108374d97 100644
*** a/src/backend/regex/regc_pg_locale.c
--- b/src/backend/regex/regc_pg_locale.c
***************
*** 1,7 ****
  /*-------------------------------------------------------------------------
   *
   * regc_pg_locale.c
!  *      ctype functions adapted to work on pg_wchar (a/k/a chr)
   *
   * This file is #included by regcomp.c; it's not meant to compile standalone.
   *
--- 1,8 ----
  /*-------------------------------------------------------------------------
   *
   * regc_pg_locale.c
!  *      ctype functions adapted to work on pg_wchar (a/k/a chr),
!  *      and functions to cache the results of wholesale ctype probing.
   *
   * This file is #included by regcomp.c; it's not meant to compile standalone.
   *
*************** typedef enum
*** 72,77 ****
--- 73,79 ----

  static PG_Locale_Strategy pg_regex_strategy;
  static pg_locale_t pg_regex_locale;
+ static Oid    pg_regex_collation;

  /*
   * Hard-wired character properties for C locale
*************** pg_set_regex_collation(Oid collation)
*** 233,238 ****
--- 235,241 ----
          /* C/POSIX collations use this path regardless of database encoding */
          pg_regex_strategy = PG_REGEX_LOCALE_C;
          pg_regex_locale = 0;
+         pg_regex_collation = C_COLLATION_OID;
      }
      else
      {
*************** pg_set_regex_collation(Oid collation)
*** 275,280 ****
--- 278,285 ----
              else
                  pg_regex_strategy = PG_REGEX_LOCALE_1BYTE;
          }
+
+         pg_regex_collation = collation;
      }
  }

*************** pg_wc_tolower(pg_wchar c)
*** 656,658 ****
--- 661,873 ----
      }
      return 0;                    /* can't get here, but keep compiler quiet */
  }
+
+
+ /*
+  * These functions cache the results of probing libc's ctype behavior for
+  * all character codes of interest in a given encoding/collation.  The
+  * result is provided as a "struct cvec", but notice that the representation
+  * is a touch different from a cvec created by regc_cvec.c: we allocate the
+  * chrs[] and ranges[] arrays separately from the struct so that we can
+  * realloc them larger at need.  This is okay since the cvecs made here
+  * should never be freed by freecvec().
+  *
+  * We use malloc not palloc since we mustn't lose control on out-of-memory;
+  * the main regex code expects us to return a failure indication instead.
+  */
+
+ typedef int (*pg_wc_probefunc) (pg_wchar c);
+
+ typedef struct pg_ctype_cache
+ {
+     pg_wc_probefunc probefunc;        /* pg_wc_isalpha or a sibling */
+     Oid            collation;            /* collation this entry is for */
+     struct cvec cv;                    /* cache entry contents */
+     struct pg_ctype_cache *next;    /* chain link */
+ } pg_ctype_cache;
+
+ static pg_ctype_cache *pg_ctype_cache_list = NULL;
+
+ /*
+  * Add a chr or range to pcc->cv; return false if run out of memory
+  */
+ static bool
+ store_match(pg_ctype_cache *pcc, pg_wchar chr1, int nchrs)
+ {
+     chr           *newchrs;
+
+     if (nchrs > 1)
+     {
+         if (pcc->cv.nranges >= pcc->cv.rangespace)
+         {
+             pcc->cv.rangespace *= 2;
+             newchrs = (chr *) realloc(pcc->cv.ranges,
+                                       pcc->cv.rangespace * sizeof(chr) * 2);
+             if (newchrs == NULL)
+                 return false;
+             pcc->cv.ranges = newchrs;
+         }
+         pcc->cv.ranges[pcc->cv.nranges * 2] = chr1;
+         pcc->cv.ranges[pcc->cv.nranges * 2 + 1] = chr1 + nchrs - 1;
+         pcc->cv.nranges++;
+     }
+     else
+     {
+         assert(nchrs == 1);
+         if (pcc->cv.nchrs >= pcc->cv.chrspace)
+         {
+             pcc->cv.chrspace *= 2;
+             newchrs = (chr *) realloc(pcc->cv.chrs,
+                                       pcc->cv.chrspace * sizeof(chr));
+             if (newchrs == NULL)
+                 return false;
+             pcc->cv.chrs = newchrs;
+         }
+         pcc->cv.chrs[pcc->cv.nchrs++] = chr1;
+     }
+     return true;
+ }
+
+ /*
+  * Given a probe function (e.g., pg_wc_isalpha) get a struct cvec for all
+  * chrs satisfying the probe function.  The active collation is the one
+  * previously set by pg_set_regex_collation.  Returns NULL if out of memory.
+  *
+  * Note that the result must NOT be freed or modified by caller.
+  */
+ static const struct cvec *
+ pg_ctype_get_cache(pg_wc_probefunc probefunc)
+ {
+     pg_ctype_cache *pcc;
+     pg_wchar    max_chr;
+     pg_wchar    cur_chr;
+     int            seq;
+     chr           *newchrs;
+
+     /*
+      * Do we already have the answer cached?
+      */
+     for (pcc = pg_ctype_cache_list; pcc != NULL; pcc = pcc->next)
+     {
+         if (pcc->probefunc == probefunc &&
+             pcc->collation == pg_regex_collation)
+             return &pcc->cv;
+     }
+
+     /*
+      * Nope, so initialize some workspace ...
+      */
+     pcc = (pg_ctype_cache *) malloc(sizeof(pg_ctype_cache));
+     if (pcc == NULL)
+         return NULL;
+     pcc->probefunc = probefunc;
+     pcc->collation = pg_regex_collation;
+     pcc->cv.nchrs = 0;
+     pcc->cv.chrspace = 256;
+     pcc->cv.chrs = (chr *) malloc(pcc->cv.chrspace * sizeof(chr));
+     pcc->cv.nranges = 0;
+     pcc->cv.rangespace = 256;
+     pcc->cv.ranges = (chr *) malloc(pcc->cv.rangespace * sizeof(chr) * 2);
+     if (pcc->cv.chrs == NULL || pcc->cv.ranges == NULL)
+         goto out_of_memory;
+
+     /*
+      * Decide how many character codes we ought to look through.  For C locale
+      * there's no need to go further than 127.  Otherwise, if the encoding is
+      * UTF8 go up to 0xFFFF (the end of the Basic Multilingual Plane).
+      * Otherwise, go up to 255.  These limits are interrelated with
+      * restrictions discussed at the head of this file.
+      */
+     switch (pg_regex_strategy)
+     {
+         case PG_REGEX_LOCALE_C:
+             max_chr = (pg_wchar) 127;
+             break;
+         case PG_REGEX_LOCALE_WIDE:
+         case PG_REGEX_LOCALE_WIDE_L:
+             max_chr = (pg_wchar) 0xFFFF;
+             break;
+         case PG_REGEX_LOCALE_1BYTE:
+         case PG_REGEX_LOCALE_1BYTE_L:
+             max_chr = (pg_wchar) UCHAR_MAX;
+             break;
+         default:
+             max_chr = 0;        /* can't get here, but keep compiler quiet */
+             break;
+     }
+
+     /*
+      * And scan 'em ...
+      */
+     seq = 0;                    /* number of consecutive matches */
+
+     for (cur_chr = 0; cur_chr <= max_chr; cur_chr++)
+     {
+         if ((*probefunc) (cur_chr))
+             seq++;
+         else if (seq > 0)
+         {
+             if (!store_match(pcc, cur_chr - seq, seq))
+                 goto out_of_memory;
+             seq = 0;
+         }
+     }
+
+     if (seq > 0)
+         if (!store_match(pcc, cur_chr - seq, seq))
+             goto out_of_memory;
+
+     /*
+      * We might have allocated more memory than needed, if so free it
+      */
+     if (pcc->cv.nchrs == 0)
+     {
+         free(pcc->cv.chrs);
+         pcc->cv.chrs = NULL;
+         pcc->cv.chrspace = 0;
+     }
+     else if (pcc->cv.nchrs < pcc->cv.chrspace)
+     {
+         newchrs = (chr *) realloc(pcc->cv.chrs,
+                                   pcc->cv.nchrs * sizeof(chr));
+         if (newchrs == NULL)
+             goto out_of_memory;
+         pcc->cv.chrs = newchrs;
+         pcc->cv.chrspace = pcc->cv.nchrs;
+     }
+     if (pcc->cv.nranges == 0)
+     {
+         free(pcc->cv.ranges);
+         pcc->cv.ranges = NULL;
+         pcc->cv.rangespace = 0;
+     }
+     else if (pcc->cv.nranges < pcc->cv.rangespace)
+     {
+         newchrs = (chr *) realloc(pcc->cv.ranges,
+                                   pcc->cv.nranges * sizeof(chr) * 2);
+         if (newchrs == NULL)
+             goto out_of_memory;
+         pcc->cv.ranges = newchrs;
+         pcc->cv.rangespace = pcc->cv.nranges;
+     }
+
+     /*
+      * Success, link it into cache chain
+      */
+     pcc->next = pg_ctype_cache_list;
+     pg_ctype_cache_list = pcc;
+
+     return &pcc->cv;
+
+     /*
+      * Failure, clean up
+      */
+ out_of_memory:
+     if (pcc->cv.chrs)
+         free(pcc->cv.chrs);
+     if (pcc->cv.ranges)
+         free(pcc->cv.ranges);
+     free(pcc);
+
+     return NULL;
+ }

Re: Notes about fixing regexes and UTF-8 (yet again)

От
NISHIYAMA Tomoaki
Дата:
I don't believe it is valid to ignore CJK characters above U+20000.
If it is used for names, it will be stored in the database.
If the behaviour is different from characters below U+FFFF, you will
get a bug report in meanwhile.

see
CJK Extension B, C, and D
from
http://www.unicode.org/charts/

Also, there are some code points that could be regarded alphabet and numbers
http://en.wikipedia.org/wiki/Mathematical_alphanumeric_symbols

On the other hand, it is ok if processing of characters above U+10000 is very slow,
as far as properly processed, because it is considered rare.


On 2012/02/17, at 23:56, Andrew Dunstan wrote:

>
>
> On 02/17/2012 09:39 AM, Tom Lane wrote:
>> Heikki Linnakangas<heikki.linnakangas@enterprisedb.com>  writes:
>>> Here's a wild idea: keep the class of each codepoint in a hash table.
>>> Initialize it with all codepoints up to 0xFFFF. After that, whenever a
>>> string contains a character that's not in the hash table yet, query the
>>> class of that character, and add it to the hash table. Then recompile
>>> the whole regex and restart the matching engine.
>>> Recompiling is expensive, but if you cache the results for the session,
>>> it would probably be acceptable.
>> Dunno ... recompiling is so expensive that I can't see this being a win;
>> not to mention that it would require fundamental surgery on the regex
>> code.
>>
>> In the Tcl implementation, no codepoints above U+FFFF have any locale
>> properties (alpha/digit/punct/etc), period.  Personally I'd not have a
>> problem imposing the same limitation, so that dealing with stuff above
>> that range isn't really a consideration anyway.
>
>
> up to U+FFFF is the BMP which is described as containing "characters for almost all modern languages, and a large
numberof special characters." It seems very likely to be acceptable not to bother about the locale of code points in
thesupplementary planes. 
>
> See <http://en.wikipedia.org/wiki/Plane_%28Unicode%29> for descriptions of which sets of characters are involved.
>
>
> cheers
>
> andrew
>
>
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers
>



Re: Notes about fixing regexes and UTF-8 (yet again)

От
Tom Lane
Дата:
NISHIYAMA Tomoaki <tomoakin@staff.kanazawa-u.ac.jp> writes:
> I don't believe it is valid to ignore CJK characters above U+20000.
> If it is used for names, it will be stored in the database.
> If the behaviour is different from characters below U+FFFF, you will
> get a bug report in meanwhile.

I am skeptical that there is enough usage of such things to justify
slowing regexp operations down for everybody.  Note that it's not only
the initial probe of libc behavior that's at stake here --- the more
character codes are treated as letters, the larger the DFA transition
maps get and the more time it takes to build them.  So I'm unexcited
about just cranking up the loop limit in pg_ctype_get_cache.

> On the other hand, it is ok if processing of characters above U+10000
> is very slow, as far as properly processed, because it is considered
> rare.

Yeah, it's conceivable that we could implement something whereby
characters with codes above some cutoff point are handled via runtime
calls to iswalpha() and friends, rather than being included in the
statically-constructed DFA maps.  The cutoff point could likely be a lot
less than U+FFFF, too, thereby saving storage and map build time all
round.

However, that "we" above is the editorial "we".  *I* am not going to
do this.  Somebody who actually has a need for it should step up.
        regards, tom lane


Re: Notes about fixing regexes and UTF-8 (yet again)

От
Dimitri Fontaine
Дата:
Tom Lane <tgl@sss.pgh.pa.us> writes:
> Yeah, it's conceivable that we could implement something whereby
> characters with codes above some cutoff point are handled via runtime
> calls to iswalpha() and friends, rather than being included in the
> statically-constructed DFA maps.  The cutoff point could likely be a lot
> less than U+FFFF, too, thereby saving storage and map build time all
> round.

It's been proposed to build a “regexp” type in PostgreSQL which would
store the DFA directly and provides some way to run that DFA out of its
“storage” without recompiling.

Would such a mechanism be useful here?  Would it be useful only when
storing the regexp in a column somewhere then applying it in the query
from there (so most probably adding a join or subquery somewhere)?

Regards,
--
Dimitri Fontaine
http://2ndQuadrant.fr     PostgreSQL : Expertise, Formation et Support


Re: Notes about fixing regexes and UTF-8 (yet again)

От
Tom Lane
Дата:
Dimitri Fontaine <dimitri@2ndQuadrant.fr> writes:
> Tom Lane <tgl@sss.pgh.pa.us> writes:
>> Yeah, it's conceivable that we could implement something whereby
>> characters with codes above some cutoff point are handled via runtime
>> calls to iswalpha() and friends, rather than being included in the
>> statically-constructed DFA maps.  The cutoff point could likely be a lot
>> less than U+FFFF, too, thereby saving storage and map build time all
>> round.

> It's been proposed to build a “regexp” type in PostgreSQL which would
> store the DFA directly and provides some way to run that DFA out of its
> “storage” without recompiling.

> Would such a mechanism be useful here?

No, this is about what goes into the DFA representation in the first
place, not about how we store it and reuse it.
        regards, tom lane


Re: Notes about fixing regexes and UTF-8 (yet again)

От
Tom Lane
Дата:
I wrote:
> And here's a poorly-tested draft patch for that.

I've done some more testing now, and am satisfied that this works as
intended.  However, some crude performance testing suggests that people
might be annoyed with it.  As an example, in 9.1 with pl_PL.utf8 locale,
I see this:select 'aaaaaaaaaa' ~ '\w\w\w\w\w\w\w\w\w\w\w';
taking perhaps 0.75 ms on first execution and 0.4 ms on subsequent
executions, the difference being the time needed to compile and cache
the DFA representation of the regexp.  With the patch, the numbers are
more like 5 ms and 0.4 ms, meaning the compilation time has gone up by
something near a factor of 10, though AFAICT execution time hasn't
moved.  It's hard to tell how significant that would be to real-world
queries, but in the worst case where our caching of regexps doesn't help
much, it could be disastrous.

All of the extra time is in manipulation of the much larger number of
DFA arcs required to represent all the additional character codes that
are being considered to be letters.

Perhaps I'm being overly ASCII-centric, but I'm afraid to commit this
as-is; I think the number of people who are hurt by the performance
degradation will be greatly larger than the number who are glad because
characters in $random_alphabet are now seen to be letters.  I think an
actually workable solution will require something like what I speculated
about earlier:

> Yeah, it's conceivable that we could implement something whereby
> characters with codes above some cutoff point are handled via runtime
> calls to iswalpha() and friends, rather than being included in the
> statically-constructed DFA maps.  The cutoff point could likely be a lot
> less than U+FFFF, too, thereby saving storage and map build time all
> round.

In the meantime, I still think the caching logic is worth having, and
we could at least make some people happy if we selected a cutoff point
somewhere between U+FF and U+FFFF.  I don't have any strong ideas about
what a good compromise cutoff would be.  One possibility is U+7FF, which
corresponds to the limit of what fits in 2-byte UTF8; but I don't know
if that corresponds to any significant dropoff in frequency of usage.
        regards, tom lane


Re: Notes about fixing regexes and UTF-8 (yet again)

От
Robert Haas
Дата:
On Sat, Feb 18, 2012 at 7:29 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Yeah, it's conceivable that we could implement something whereby
>> characters with codes above some cutoff point are handled via runtime
>> calls to iswalpha() and friends, rather than being included in the
>> statically-constructed DFA maps.  The cutoff point could likely be a lot
>> less than U+FFFF, too, thereby saving storage and map build time all
>> round.
>
> In the meantime, I still think the caching logic is worth having, and
> we could at least make some people happy if we selected a cutoff point
> somewhere between U+FF and U+FFFF.  I don't have any strong ideas about
> what a good compromise cutoff would be.  One possibility is U+7FF, which
> corresponds to the limit of what fits in 2-byte UTF8; but I don't know
> if that corresponds to any significant dropoff in frequency of usage.

The problem, of course, is that this probably depends quite a bit on
what language you happen to be using.  For some languages, it won't
matter whether you cut it off at U+FF or U+7FF; while for others even
U+FFFF might not be enough.  So I think this is one of those cases
where it's somewhat meaningless to talk about frequency of usage.

In theory you can imagine a regular expression engine where these
decisions can be postponed until we see the string we're matching
against.  IOW, your DFA ends up with state transitions for characters
specifically named, plus a state transition for "anything else that's
a letter", plus a state transition for "anything else not otherwise
specified".  Then you only need to test the letters that actually
appear in the target string, rather than all of the ones that might
appear there.

But implementing that could be quite a lot of work.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: Notes about fixing regexes and UTF-8 (yet again)

От
Vik Reykja
Дата:
On Sun, Feb 19, 2012 at 04:33, Robert Haas <robertmhaas@gmail.com> wrote:
On Sat, Feb 18, 2012 at 7:29 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Yeah, it's conceivable that we could implement something whereby
>> characters with codes above some cutoff point are handled via runtime
>> calls to iswalpha() and friends, rather than being included in the
>> statically-constructed DFA maps.  The cutoff point could likely be a lot
>> less than U+FFFF, too, thereby saving storage and map build time all
>> round.
>
> In the meantime, I still think the caching logic is worth having, and
> we could at least make some people happy if we selected a cutoff point
> somewhere between U+FF and U+FFFF.  I don't have any strong ideas about
> what a good compromise cutoff would be.  One possibility is U+7FF, which
> corresponds to the limit of what fits in 2-byte UTF8; but I don't know
> if that corresponds to any significant dropoff in frequency of usage.

The problem, of course, is that this probably depends quite a bit on
what language you happen to be using.  For some languages, it won't
matter whether you cut it off at U+FF or U+7FF; while for others even
U+FFFF might not be enough.  So I think this is one of those cases
where it's somewhat meaningless to talk about frequency of usage.

Does it make sense for regexps to have collations?

Re: Notes about fixing regexes and UTF-8 (yet again)

От
Robert Haas
Дата:
On Sat, Feb 18, 2012 at 10:38 PM, Vik Reykja <vikreykja@gmail.com> wrote:
> Does it make sense for regexps to have collations?

As I understand it, collations determine the sort-ordering of strings.Regular expressions don't care about that.  Why
doyou ask?
 

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: Notes about fixing regexes and UTF-8 (yet again)

От
Vik Reykja
Дата:
On Sun, Feb 19, 2012 at 05:03, Robert Haas <robertmhaas@gmail.com> wrote:
On Sat, Feb 18, 2012 at 10:38 PM, Vik Reykja <vikreykja@gmail.com> wrote:
> Does it make sense for regexps to have collations?

As I understand it, collations determine the sort-ordering of strings.
 Regular expressions don't care about that.  Why do you ask?

Perhaps I used the wrong term, but I was thinking the locale could tell us what alphabet we're dealing with. So a regexp using en_US would give different word-boundary results from one using zh_CN.

Re: Notes about fixing regexes and UTF-8 (yet again)

От
Tom Lane
Дата:
Robert Haas <robertmhaas@gmail.com> writes:
> In theory you can imagine a regular expression engine where these
> decisions can be postponed until we see the string we're matching
> against.  IOW, your DFA ends up with state transitions for characters
> specifically named, plus a state transition for "anything else that's
> a letter", plus a state transition for "anything else not otherwise
> specified".  Then you only need to test the letters that actually
> appear in the target string, rather than all of the ones that might
> appear there.

> But implementing that could be quite a lot of work.

Yeah, not to mention slow.  The difficulty is overlapping sets of
characters.  As a simple example, if your regex refers to 3, 7,
[[:digit:]], X, and [[:alnum:]], then you end up needing five distinct
"colors": 3, 7, X, all digits that aren't 3 or 7, all alphanumerics
that aren't any of the preceding.  And state transitions for the digit
and alnum cases had better mention all and only the correct colors.
I've been tracing through the logic this evening, and it works pretty
simply given that all named character classes are immediately expanded
out to their component characters.  If we are going to try to keep
the classes in some kind of symbolic form, it's a lot messier.  In
particular, I think your sketch above would lead to having to test
every character against iswdigit and iswalnum at runtime, which would
be disastrous performancewise.  I'd like to at least avoid that for the
shorter (and presumably more common) UTF8 codes.
        regards, tom lane


Re: Notes about fixing regexes and UTF-8 (yet again)

От
Tom Lane
Дата:
Vik Reykja <vikreykja@gmail.com> writes:
> On Sun, Feb 19, 2012 at 05:03, Robert Haas <robertmhaas@gmail.com> wrote:
>> On Sat, Feb 18, 2012 at 10:38 PM, Vik Reykja <vikreykja@gmail.com> wrote:
>>> Does it make sense for regexps to have collations?

>> As I understand it, collations determine the sort-ordering of strings.
>> Regular expressions don't care about that.  Why do you ask?

> Perhaps I used the wrong term, but I was thinking the locale could tell us
> what alphabet we're dealing with. So a regexp using en_US would give
> different word-boundary results from one using zh_CN.

Our interpretation of a "collation" is that it sets both LC_COLLATE and
LC_CTYPE.  Regexps may not care about the first but they definitely care
about the second.  This is why the stuff in regc_pg_locale.c pays
attention to collation.
        regards, tom lane


Re: Notes about fixing regexes and UTF-8 (yet again)

От
Robert Haas
Дата:
On Sat, Feb 18, 2012 at 11:16 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> In theory you can imagine a regular expression engine where these
>> decisions can be postponed until we see the string we're matching
>> against.  IOW, your DFA ends up with state transitions for characters
>> specifically named, plus a state transition for "anything else that's
>> a letter", plus a state transition for "anything else not otherwise
>> specified".  Then you only need to test the letters that actually
>> appear in the target string, rather than all of the ones that might
>> appear there.
>
>> But implementing that could be quite a lot of work.
>
> Yeah, not to mention slow.  The difficulty is overlapping sets of
> characters.  As a simple example, if your regex refers to 3, 7,
> [[:digit:]], X, and [[:alnum:]], then you end up needing five distinct
> "colors": 3, 7, X, all digits that aren't 3 or 7, all alphanumerics
> that aren't any of the preceding.  And state transitions for the digit
> and alnum cases had better mention all and only the correct colors.

Yeah, that's unfortunate.  On the other hand, if you don't use colors
for this case, aren't you going to need, for each DFA state, a
gigantic lookup table that includes every character in the server
encoding?  Even if you've got plenty of memory, initializing such a
beast seems awfully expensive, and it might not do very good things
for cache locality, either.

> I've been tracing through the logic this evening, and it works pretty
> simply given that all named character classes are immediately expanded
> out to their component characters.  If we are going to try to keep
> the classes in some kind of symbolic form, it's a lot messier.  In
> particular, I think your sketch above would lead to having to test
> every character against iswdigit and iswalnum at runtime, which would
> be disastrous performancewise.  I'd like to at least avoid that for the
> shorter (and presumably more common) UTF8 codes.

Hmm, but you could cache that information.  Instead of building a
cache that covers every possible character that might appear in the
target string, you can just cache the results for the code points that
you actually see.

Yet another option would be to dictate that the cache can't holes - it
will always include information for every code point from 0 up to some
value X.  If we see a code point in the target string which is greater
than X, then we extend the cache out as far as that code point.  That
way, people who are using only code points out to U+FF (or even U+7F)
don't pay the cost of building a large cache, but people who need it
can get correct behavior.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: Notes about fixing regexes and UTF-8 (yet again)

От
Peter Eisentraut
Дата:
On fre, 2012-02-17 at 10:19 -0500, Tom Lane wrote:
> > What if you did this ONCE and wrote the results to a file someplace?
> 
> That's still a cache, you've just defaulted on your obligation to think
> about what conditions require the cache to be flushed.  (In the case at
> hand, the trigger for a cache rebuild would probably need to be a glibc
> package update, which we have no way of knowing about.) 

We basically hardwire locale behavior at initdb time, so computing this
then and storing it somewhere for eternity would be consistent.



Re: Notes about fixing regexes and UTF-8 (yet again)

От
Tom Lane
Дата:
Peter Eisentraut <peter_e@gmx.net> writes:
> On fre, 2012-02-17 at 10:19 -0500, Tom Lane wrote:
>> That's still a cache, you've just defaulted on your obligation to think
>> about what conditions require the cache to be flushed.  (In the case at
>> hand, the trigger for a cache rebuild would probably need to be a glibc
>> package update, which we have no way of knowing about.) 

> We basically hardwire locale behavior at initdb time, so computing this
> then and storing it somewhere for eternity would be consistent.

Well, only if we could cache every locale-related libc inquiry we ever
make.  Locking down only part of the behavior does not sound like a
plan.
        regards, tom lane