Обсуждение: Better locale-specific-character-class handling for regexps

Поиск

Список

Период

Сортировка

Better locale-specific-character-class handling for regexps

От

Tom Lane

Дата:

23 августа 2016 г., 00:55:06

I got tired of hearing complaints about the issue described in
this thread:
https://www.postgresql.org/message-id/flat/24241.1329347196%40sss.pgh.pa.us

Here's a proposed fix.  I've not done extensive performance testing,
but it seems to be as fast or faster than the old code in cases where
there are not too many "large" characters in the input.  And, more
to the point, it gets the right answer for such large characters.

I'll add this to the upcoming commitfest.

            regards, tom lane

diff --git a/src/backend/regex/README b/src/backend/regex/README
index 6c9f483..b4a7ad7 100644
*** a/src/backend/regex/README
--- b/src/backend/regex/README
*************** and similarly additional source files re
*** 27,39 ****
  regexec.  This was done to avoid exposing internal symbols globally;
  all functions not meant to be part of the library API are static.

! (Actually the above is a lie in one respect: there is one more global
! symbol, pg_set_regex_collation in regcomp.  It is not meant to be part of
! the API, but it has to be global because both regcomp and regexec call it.
! It'd be better to get rid of that, as well as the static variables it
! sets, in favor of keeping the needed locale state in the regex structs.
! We have not done this yet for lack of a design for how to add
! application-specific state to the structs.)

  What's where in src/backend/regex/:

--- 27,40 ----
  regexec.  This was done to avoid exposing internal symbols globally;
  all functions not meant to be part of the library API are static.

! (Actually the above is a lie in one respect: there are two more global
! symbols, pg_set_regex_collation and pg_reg_getcolor in regcomp.  These are
! not meant to be part of the API, but they have to be global because both
! regcomp and regexec call them.  It'd be better to get rid of
! pg_set_regex_collation, as well as the static variables it sets, in favor of
! keeping the needed locale state in the regex structs.  We have not done this
! yet for lack of a design for how to add application-specific state to the
! structs.)

  What's where in src/backend/regex/:

*************** colors:
*** 274,301 ****
    an existing color has to be subdivided.

  The last two of these are handled with the "struct colordesc" array and
! the "colorchain" links in NFA arc structs.  The color map proper (that
! is, the per-character lookup array) is handled as a multi-level tree,
! with each tree level indexed by one byte of a character's value.  The
! code arranges to not have more than one copy of bottom-level tree pages
! that are all-the-same-color.

! Unfortunately, this design does not seem terribly efficient for common
! cases such as a tree in which all Unicode letters are colored the same,
! because there aren't that many places where we get a whole page all the
! same color, except at the end of the map.  (It also strikes me that given
! PG's current restrictions on the range of Unicode values, we could use a
! 3-level rather than 4-level tree; but there's not provision for that in
! regguts.h at the moment.)

! A bigger problem is that it just doesn't seem very reasonable to have to
! consider each Unicode letter separately at regex parse time for a regex
! such as "\w"; more than likely, a huge percentage of those codes will
! never be seen at runtime.  We need to fix things so that locale-based
! character classes are somehow processed "symbolically" without making a
! full expansion of their contents at parse time.  This would mean that we'd
! have to be ready to call iswalpha() at runtime, but if that only happens
! for high-code-value characters, it shouldn't be a big performance hit.


  Detailed semantics of an NFA
--- 275,330 ----
    an existing color has to be subdivided.

  The last two of these are handled with the "struct colordesc" array and
! the "colorchain" links in NFA arc structs.

! Ideally, we'd do the first two operations using a simple linear array
! storing the current color assignment for each character code.
! Unfortunately, that's not terribly workable for large charsets such as
! Unicode.  Our solution is to divide the color map into two parts.  A simple
! linear array is used for character codes up to MAX_SIMPLE_CHR, which can be
! chosen large enough to include all popular characters (so that the
! significantly-slower code paths about to be described are seldom invoked).
! Characters above that need be considered at compile time only if they
! appear explicitly in the regex pattern.  We store each such mentioned
! character or character range as an entry in the "colormaprange" array in
! the colormap.  (Overlapping ranges are split into unique subranges, so that
! each range in the finished list needs only a single color that describes
! all its characters.)  When mapping a character above MAX_SIMPLE_CHR to a
! color at runtime, we search this list of ranges explicitly.

! That's still not quite enough, though, because of locale-dependent
! character classes such as [[:alpha:]].  In Unicode locales these classes
! may have thousands of entries that are above MAX_SIMPLE_CHR, and we
! certainly don't want to be searching large colormaprange arrays at runtime.
! Nor do we even want to spend the time to initialize cvec structures that
! exhaustively describe all of those characters.  Our solution is to compute
! exact per-character colors at compile time only up to MAX_SIMPLE_CHR.
! For characters above that, we apply the <ctype.h> or <wctype.h> lookup
! functions at runtime for each locale-dependent character class used in the
! regex pattern, constructing a bitmap that describes which classes the
! runtime character belongs to.  The per-character-range data structure
! mentioned above actually holds, for each range, a separate color entry
! for each possible combination of character class properties.  That is,
! the color map for characters above MAX_SIMPLE_CHR is really a 2-D array,
! whose rows correspond to character ranges that are explicitly mentioned
! in the input, and whose columns correspond to sets of relevant locale
! character classes.
!
! We build this color map in parallel with scanning the regex.  Each time
! we detect a new explicit high character (or range) or a locale-dependent
! character class, we split existing entry(s) in the high color map so that
! characters we need to be able to distinguish will have distinct entries
! that can be given separate colors.  Often, though, single entries in the
! high color map will represent very large sets of characters.
!
! If there are both explicit high characters/ranges and locale-dependent
! character classes, we may have entries in the high color map array that
! have non-WHITE colors but don't actually represent any real characters.
! (For example, in a row representing a singleton range, only one of the
! columns could possibly be a live entry; it's the one matching the actual
! locale properties for that single character.)  We don't currently make
! any effort to reclaim such colors.  In principle it could be done, but
! it's not clear that it's worth the trouble.


  Detailed semantics of an NFA
diff --git a/src/backend/regex/regc_color.c b/src/backend/regex/regc_color.c
index 8ffc8fb..7cb2d67 100644
*** a/src/backend/regex/regc_color.c
--- b/src/backend/regex/regc_color.c
*************** static void
*** 49,58 ****
  initcm(struct vars * v,
         struct colormap * cm)
  {
-     int            i;
-     int            j;
-     union tree *t;
-     union tree *nextt;
      struct colordesc *cd;

      cm->magic = CMMAGIC;
--- 49,54 ----
*************** initcm(struct vars * v,
*** 64,87 ****
      cm->free = 0;

      cd = cm->cd;                /* cm->cd[WHITE] */
      cd->sub = NOSUB;
      cd->arcs = NULL;
      cd->firstchr = CHR_MIN;
-     cd->nchrs = CHR_MAX - CHR_MIN + 1;
      cd->flags = 0;

!     /* upper levels of tree */
!     for (t = &cm->tree[0], j = NBYTS - 1; j > 0; t = nextt, j--)
      {
!         nextt = t + 1;
!         for (i = BYTTAB - 1; i >= 0; i--)
!             t->tptr[i] = nextt;
      }
!     /* bottom level is solid white */
!     t = &cm->tree[NBYTS - 1];
!     for (i = BYTTAB - 1; i >= 0; i--)
!         t->tcolor[i] = WHITE;
!     cd->block = t;
  }

  /*
--- 60,99 ----
      cm->free = 0;

      cd = cm->cd;                /* cm->cd[WHITE] */
+     cd->nschrs = MAX_SIMPLE_CHR - CHR_MIN + 1;
+     cd->nuchrs = 1;
      cd->sub = NOSUB;
      cd->arcs = NULL;
      cd->firstchr = CHR_MIN;
      cd->flags = 0;

!     cm->locolormap = (color *)
!         MALLOC((MAX_SIMPLE_CHR - CHR_MIN + 1) * sizeof(color));
!     if (cm->locolormap == NULL)
      {
!         CERR(REG_ESPACE);
!         cm->cmranges = NULL;    /* prevent failure during freecm */
!         cm->hicolormap = NULL;
!         return;
      }
!     /* this relies on WHITE being zero: */
!     memset(cm->locolormap, WHITE,
!            (MAX_SIMPLE_CHR - CHR_MIN + 1) * sizeof(color));
!
!     memset(cm->classbits, 0, sizeof(cm->classbits));
!     cm->numcmranges = 0;
!     cm->cmranges = NULL;
!     cm->maxarrayrows = 4;        /* arbitrary initial allocation */
!     cm->hiarrayrows = 1;
!     cm->hiarraycols = 1;
!     cm->hicolormap = (color *) MALLOC(cm->maxarrayrows * sizeof(color));
!     if (cm->hicolormap == NULL)
!     {
!         CERR(REG_ESPACE);
!         return;
!     }
!     /* initialize the "all other characters" row to WHITE */
!     cm->hicolormap[0] = WHITE;
  }

  /*
*************** initcm(struct vars * v,
*** 90,206 ****
  static void
  freecm(struct colormap * cm)
  {
-     size_t        i;
-     union tree *cb;
-
      cm->magic = 0;
-     if (NBYTS > 1)
-         cmtreefree(cm, cm->tree, 0);
-     for (i = 1; i <= cm->max; i++)        /* skip WHITE */
-         if (!UNUSEDCOLOR(&cm->cd[i]))
-         {
-             cb = cm->cd[i].block;
-             if (cb != NULL)
-                 FREE(cb);
-         }
      if (cm->cd != cm->cdspace)
          FREE(cm->cd);
  }

  /*
!  * cmtreefree - free a non-terminal part of a colormap tree
   */
! static void
! cmtreefree(struct colormap * cm,
!            union tree * tree,
!            int level)            /* level number (top == 0) of this block */
  {
!     int            i;
!     union tree *t;
!     union tree *fillt = &cm->tree[level + 1];
!     union tree *cb;

!     assert(level < NBYTS - 1);    /* this level has pointers */
!     for (i = BYTTAB - 1; i >= 0; i--)
      {
!         t = tree->tptr[i];
!         assert(t != NULL);
!         if (t != fillt)
          {
!             if (level < NBYTS - 2)
!             {                    /* more pointer blocks below */
!                 cmtreefree(cm, t, level + 1);
!                 FREE(t);
!             }
!             else
!             {                    /* color block below */
!                 cb = cm->cd[t->tcolor[0]].block;
!                 if (t != cb)    /* not a solid block */
!                     FREE(t);
!             }
          }
      }
- }
-
- /*
-  * setcolor - set the color of a character in a colormap
-  */
- static color                    /* previous color */
- setcolor(struct colormap * cm,
-          chr c,
-          color co)
- {
-     uchr        uc = c;
-     int            shift;
-     int            level;
-     int            b;
-     int            bottom;
-     union tree *t;
-     union tree *newt;
-     union tree *fillt;
-     union tree *lastt;
-     union tree *cb;
-     color        prev;
-
-     assert(cm->magic == CMMAGIC);
-     if (CISERR() || co == COLORLESS)
-         return COLORLESS;

!     t = cm->tree;
!     for (level = 0, shift = BYTBITS * (NBYTS - 1); shift > 0;
!          level++, shift -= BYTBITS)
!     {
!         b = (uc >> shift) & BYTMASK;
!         lastt = t;
!         t = lastt->tptr[b];
!         assert(t != NULL);
!         fillt = &cm->tree[level + 1];
!         bottom = (shift <= BYTBITS) ? 1 : 0;
!         cb = (bottom) ? cm->cd[t->tcolor[0]].block : fillt;
!         if (t == fillt || t == cb)
!         {                        /* must allocate a new block */
!             newt = (union tree *) MALLOC((bottom) ?
!                                 sizeof(struct colors) : sizeof(struct ptrs));
!             if (newt == NULL)
!             {
!                 CERR(REG_ESPACE);
!                 return COLORLESS;
!             }
!             if (bottom)
!                 memcpy(VS(newt->tcolor), VS(t->tcolor),
!                        BYTTAB * sizeof(color));
!             else
!                 memcpy(VS(newt->tptr), VS(t->tptr),
!                        BYTTAB * sizeof(union tree *));
!             t = newt;
!             lastt->tptr[b] = t;
!         }
!     }

!     b = uc & BYTMASK;
!     prev = t->tcolor[b];
!     t->tcolor[b] = co;
!     return prev;
  }

  /*
--- 102,164 ----
  static void
  freecm(struct colormap * cm)
  {
      cm->magic = 0;
      if (cm->cd != cm->cdspace)
          FREE(cm->cd);
+     if (cm->locolormap != NULL)
+         FREE(cm->locolormap);
+     if (cm->cmranges != NULL)
+         FREE(cm->cmranges);
+     if (cm->hicolormap != NULL)
+         FREE(cm->hicolormap);
  }

  /*
!  * pg_reg_getcolor - slow case of GETCOLOR()
   */
! color
! pg_reg_getcolor(struct colormap * cm, chr c)
  {
!     int            rownum,
!                 colnum,
!                 low,
!                 high;

!     /* Should not be used for chrs in the locolormap */
!     assert(c > MAX_SIMPLE_CHR);
!
!     /*
!      * Find which row it's in.  The colormapranges are in order, so we can use
!      * binary search.
!      */
!     rownum = 0;                    /* if no match, use array row zero */
!     low = 0;
!     high = cm->numcmranges;
!     while (low < high)
      {
!         int            middle = low + (high - low) / 2;
!         const colormaprange *cmr = &cm->cmranges[middle];
!
!         if (c < cmr->cmin)
!             high = middle;
!         else if (c > cmr->cmax)
!             low = middle + 1;
!         else
          {
!             rownum = cmr->rownum;        /* found a match */
!             break;
          }
      }

!     /*
!      * Find which column it's in --- this is all locale-dependent.
!      */
!     if (cm->hiarraycols > 1)
!         colnum = cclass_column_index(cm, c);
!     else
!         colnum = 0;                /* fast path if no relevant cclasses */

!     return cm->hicolormap[rownum * cm->hiarraycols + colnum];
  }

  /*
*************** maxcolor(struct colormap * cm)
*** 216,222 ****
  }

  /*
!  * newcolor - find a new color (must be subject of setcolor at once)
   * Beware:    may relocate the colordescs.
   */
  static color                    /* COLORLESS for error */
--- 174,180 ----
  }

  /*
!  * newcolor - find a new color (must be assigned at once)
   * Beware:    may relocate the colordescs.
   */
  static color                    /* COLORLESS for error */
*************** newcolor(struct colormap * cm)
*** 278,289 ****
          cd = &cm->cd[cm->max];
      }

!     cd->nchrs = 0;
      cd->sub = NOSUB;
      cd->arcs = NULL;
      cd->firstchr = CHR_MIN;        /* in case never set otherwise */
      cd->flags = 0;
-     cd->block = NULL;

      return (color) (cd - cm->cd);
  }
--- 236,247 ----
          cd = &cm->cd[cm->max];
      }

!     cd->nschrs = 0;
!     cd->nuchrs = 0;
      cd->sub = NOSUB;
      cd->arcs = NULL;
      cd->firstchr = CHR_MIN;        /* in case never set otherwise */
      cd->flags = 0;

      return (color) (cd - cm->cd);
  }
*************** freecolor(struct colormap * cm,
*** 305,317 ****

      assert(cd->arcs == NULL);
      assert(cd->sub == NOSUB);
!     assert(cd->nchrs == 0);
      cd->flags = FREECOL;
-     if (cd->block != NULL)
-     {
-         FREE(cd->block);
-         cd->block = NULL;        /* just paranoia */
-     }

      if ((size_t) co == cm->max)
      {
--- 263,271 ----

      assert(cd->arcs == NULL);
      assert(cd->sub == NOSUB);
!     assert(cd->nschrs == 0);
!     assert(cd->nuchrs == 0);
      cd->flags = FREECOL;

      if ((size_t) co == cm->max)
      {
*************** static color
*** 354,370 ****
  pseudocolor(struct colormap * cm)
  {
      color        co;

      co = newcolor(cm);
      if (CISERR())
          return COLORLESS;
!     cm->cd[co].nchrs = 1;
!     cm->cd[co].flags = PSEUDO;
      return co;
  }

  /*
   * subcolor - allocate a new subcolor (if necessary) to this chr
   */
  static color
  subcolor(struct colormap * cm, chr c)
--- 308,332 ----
  pseudocolor(struct colormap * cm)
  {
      color        co;
+     struct colordesc *cd;

      co = newcolor(cm);
      if (CISERR())
          return COLORLESS;
!     cd = &cm->cd[co];
!     cd->nschrs = 0;
!     cd->nuchrs = 1;                /* pretend it is in the upper map */
!     cd->sub = NOSUB;
!     cd->arcs = NULL;
!     cd->firstchr = CHR_MIN;
!     cd->flags = PSEUDO;
      return co;
  }

  /*
   * subcolor - allocate a new subcolor (if necessary) to this chr
+  *
+  * This works only for chrs that map into the low color map.
   */
  static color
  subcolor(struct colormap * cm, chr c)
*************** subcolor(struct colormap * cm, chr c)
*** 372,378 ****
      color        co;                /* current color of c */
      color        sco;            /* new subcolor */

!     co = GETCOLOR(cm, c);
      sco = newsub(cm, co);
      if (CISERR())
          return COLORLESS;
--- 334,342 ----
      color        co;                /* current color of c */
      color        sco;            /* new subcolor */

!     assert(c <= MAX_SIMPLE_CHR);
!
!     co = cm->locolormap[c - CHR_MIN];
      sco = newsub(cm, co);
      if (CISERR())
          return COLORLESS;
*************** subcolor(struct colormap * cm, chr c)
*** 380,390 ****

      if (co == sco)                /* already in an open subcolor */
          return co;                /* rest is redundant */
!     cm->cd[co].nchrs--;
!     if (cm->cd[sco].nchrs == 0)
          cm->cd[sco].firstchr = c;
!     cm->cd[sco].nchrs++;
!     setcolor(cm, c, sco);
      return sco;
  }

--- 344,380 ----

      if (co == sco)                /* already in an open subcolor */
          return co;                /* rest is redundant */
!     cm->cd[co].nschrs--;
!     if (cm->cd[sco].nschrs == 0)
          cm->cd[sco].firstchr = c;
!     cm->cd[sco].nschrs++;
!     cm->locolormap[c - CHR_MIN] = sco;
!     return sco;
! }
!
! /*
!  * subcolorhi - allocate a new subcolor (if necessary) to this colormap entry
!  *
!  * This is the same processing as subcolor(), but for entries in the high
!  * colormap, which do not necessarily correspond to exactly one chr code.
!  */
! static color
! subcolorhi(struct colormap * cm, color *pco)
! {
!     color        co;                /* current color of entry */
!     color        sco;            /* new subcolor */
!
!     co = *pco;
!     sco = newsub(cm, co);
!     if (CISERR())
!         return COLORLESS;
!     assert(sco != COLORLESS);
!
!     if (co == sco)                /* already in an open subcolor */
!         return co;                /* rest is redundant */
!     cm->cd[co].nuchrs--;
!     cm->cd[sco].nuchrs++;
!     *pco = sco;
      return sco;
  }

*************** newsub(struct colormap * cm,
*** 400,406 ****
      sco = cm->cd[co].sub;
      if (sco == NOSUB)
      {                            /* color has no open subcolor */
!         if (cm->cd[co].nchrs == 1)        /* optimization */
              return co;
          sco = newcolor(cm);        /* must create subcolor */
          if (sco == COLORLESS)
--- 390,397 ----
      sco = cm->cd[co].sub;
      if (sco == NOSUB)
      {                            /* color has no open subcolor */
!         /* optimization: singly-referenced color need not be subcolored */
!         if ((cm->cd[co].nschrs + cm->cd[co].nuchrs) == 1)
              return co;
          sco = newcolor(cm);        /* must create subcolor */
          if (sco == COLORLESS)
*************** newsub(struct colormap * cm,
*** 417,552 ****
  }

  /*
!  * subrange - allocate new subcolors to this range of chrs, fill in arcs
   */
! static void
! subrange(struct vars * v,
!          chr from,
!          chr to,
!          struct state * lp,
!          struct state * rp)
  {
!     uchr        uf;
      int            i;

!     assert(from <= to);

!     /* first, align "from" on a tree-block boundary */
!     uf = (uchr) from;
!     i = (int) (((uf + BYTTAB - 1) & (uchr) ~BYTMASK) - uf);
!     for (; from <= to && i > 0; i--, from++)
!         newarc(v->nfa, PLAIN, subcolor(v->cm, from), lp, rp);
!     if (from > to)                /* didn't reach a boundary */
          return;

!     /* deal with whole blocks */
!     for (; to - from >= BYTTAB; from += BYTTAB)
!         subblock(v, from, lp, rp);

!     /* clean up any remaining partial table */
!     for (; from <= to; from++)
!         newarc(v->nfa, PLAIN, subcolor(v->cm, from), lp, rp);
  }

  /*
!  * subblock - allocate new subcolors for one tree block of chrs, fill in arcs
   *
!  * Note: subcolors that are created during execution of this function
!  * will not be given a useful value of firstchr; it'll be left as CHR_MIN.
!  * For the current usage of firstchr in pg_regprefix, this does not matter
!  * because such subcolors won't occur in the common prefix of a regex.
   */
  static void
! subblock(struct vars * v,
!          chr start,                /* first of BYTTAB chrs */
!          struct state * lp,
!          struct state * rp)
  {
-     uchr        uc = start;
      struct colormap *cm = v->cm;
!     int            shift;
!     int            level;
      int            i;
-     int            b;
-     union tree *t;
-     union tree *cb;
-     union tree *fillt;
-     union tree *lastt;
-     int            previ;
-     int            ndone;
-     color        co;
-     color        sco;

!     assert((uc % BYTTAB) == 0);

!     /* find its color block, making new pointer blocks as needed */
!     t = cm->tree;
!     fillt = NULL;
!     for (level = 0, shift = BYTBITS * (NBYTS - 1); shift > 0;
!          level++, shift -= BYTBITS)
      {
!         b = (uc >> shift) & BYTMASK;
!         lastt = t;
!         t = lastt->tptr[b];
!         assert(t != NULL);
!         fillt = &cm->tree[level + 1];
!         if (t == fillt && shift > BYTBITS)
!         {                        /* need new ptr block */
!             t = (union tree *) MALLOC(sizeof(struct ptrs));
!             if (t == NULL)
              {
!                 CERR(REG_ESPACE);
!                 return;
              }
-             memcpy(VS(t->tptr), VS(fillt->tptr),
-                    BYTTAB * sizeof(union tree *));
-             lastt->tptr[b] = t;
          }
      }

!     /* special cases:  fill block or solid block */
!     co = t->tcolor[0];
!     cb = cm->cd[co].block;
!     if (t == fillt || t == cb)
      {
!         /* either way, we want a subcolor solid block */
!         sco = newsub(cm, co);
!         t = cm->cd[sco].block;
!         if (t == NULL)
!         {                        /* must set it up */
!             t = (union tree *) MALLOC(sizeof(struct colors));
!             if (t == NULL)
              {
!                 CERR(REG_ESPACE);
!                 return;
              }
-             for (i = 0; i < BYTTAB; i++)
-                 t->tcolor[i] = sco;
-             cm->cd[sco].block = t;
          }
!         /* find loop must have run at least once */
!         lastt->tptr[b] = t;
!         newarc(v->nfa, PLAIN, sco, lp, rp);
!         cm->cd[co].nchrs -= BYTTAB;
!         cm->cd[sco].nchrs += BYTTAB;
          return;
      }

!     /* general case, a mixed block to be altered */
!     i = 0;
!     while (i < BYTTAB)
      {
!         co = t->tcolor[i];
!         sco = newsub(cm, co);
!         newarc(v->nfa, PLAIN, sco, lp, rp);
!         previ = i;
!         do
          {
!             t->tcolor[i++] = sco;
!         } while (i < BYTTAB && t->tcolor[i] == co);
!         ndone = i - previ;
!         cm->cd[co].nchrs -= ndone;
!         cm->cd[sco].nchrs += ndone;
      }
  }

--- 408,907 ----
  }

  /*
!  * newhicolorrow - get a new row in the hicolormap, cloning it from oldrow
!  *
!  * Returns array index of new row.  Note the array might move.
   */
! static int
! newhicolorrow(struct colormap * cm,
!               int oldrow)
  {
!     int            newrow = cm->hiarrayrows;
!     color       *newrowptr;
      int            i;

!     /* Assign a fresh array row index, enlarging storage if needed */
!     if (newrow >= cm->maxarrayrows)
!     {
!         color       *newarray;

!         if (cm->maxarrayrows >= INT_MAX / (cm->hiarraycols * 2))
!         {
!             CERR(REG_ESPACE);
!             return 0;
!         }
!         newarray = (color *) REALLOC(cm->hicolormap,
!                                      cm->maxarrayrows * 2 *
!                                      cm->hiarraycols * sizeof(color));
!         if (newarray == NULL)
!         {
!             CERR(REG_ESPACE);
!             return 0;
!         }
!         cm->hicolormap = newarray;
!         cm->maxarrayrows *= 2;
!     }
!     cm->hiarrayrows++;
!
!     /* Copy old row data */
!     newrowptr = &cm->hicolormap[newrow * cm->hiarraycols];
!     memcpy(newrowptr,
!            &cm->hicolormap[oldrow * cm->hiarraycols],
!            cm->hiarraycols * sizeof(color));
!
!     /* Increase color reference counts to reflect new colormap entries */
!     for (i = 0; i < cm->hiarraycols; i++)
!         cm->cd[newrowptr[i]].nuchrs++;
!
!     return newrow;
! }
!
! /*
!  * newhicolorcols - create a new set of columns in the high colormap
!  *
!  * Essentially, extends the 2-D array to the right with a copy of itself.
!  */
! static void
! newhicolorcols(struct colormap * cm)
! {
!     color       *newarray;
!     int            r,
!                 c;
!
!     if (cm->hiarraycols >= INT_MAX / (cm->maxarrayrows * 2))
!     {
!         CERR(REG_ESPACE);
          return;
+     }
+     newarray = (color *) REALLOC(cm->hicolormap,
+                                  cm->maxarrayrows *
+                                  cm->hiarraycols * 2 * sizeof(color));
+     if (newarray == NULL)
+     {
+         CERR(REG_ESPACE);
+         return;
+     }
+     cm->hicolormap = newarray;

!     /* Duplicate existing columns to the right, and increase ref counts */
!     /* Must work downwards in the array because we realloc'd in place */
!     for (r = cm->hiarrayrows - 1; r >= 0; r--)
!     {
!         color       *oldrowptr = &newarray[r * cm->hiarraycols];
!         color       *newrowptr = &newarray[r * cm->hiarraycols * 2];
!         color       *newrowptr2 = newrowptr + cm->hiarraycols;

!         for (c = 0; c < cm->hiarraycols; c++)
!         {
!             color        co = oldrowptr[c];
!
!             newrowptr[c] = newrowptr2[c] = co;
!             cm->cd[co].nuchrs++;
!         }
!     }
!
!     cm->hiarraycols *= 2;
  }

  /*
!  * subcolorcvec - allocate new subcolors to cvec members, fill in arcs
   *
!  * For each chr "c" represented by the cvec, do the equivalent of
!  * newarc(v->nfa, PLAIN, subcolor(v->cm, c), lp, rp);
!  *
!  * Note that in typical cases, many of the subcolors are the same.
!  * While newarc() would discard duplicate arc requests, we can save
!  * some cycles by not calling it repetitively to begin with.  This is
!  * mechanized with the "lastsubcolor" state variable.
   */
  static void
! subcolorcvec(struct vars * v,
!              struct cvec * cv,
!              struct state * lp,
!              struct state * rp)
  {
      struct colormap *cm = v->cm;
!     color        lastsubcolor = COLORLESS;
!     chr            ch,
!                 from,
!                 to;
!     const chr  *p;
      int            i;

!     /* ordinary characters */
!     for (p = cv->chrs, i = cv->nchrs; i > 0; p++, i--)
!     {
!         ch = *p;
!         subcoloronechr(v, ch, lp, rp, &lastsubcolor);
!         NOERR();
!     }

!     /* and the ranges */
!     for (p = cv->ranges, i = cv->nranges; i > 0; p += 2, i--)
      {
!         from = *p;
!         to = *(p + 1);
!         if (from <= MAX_SIMPLE_CHR)
!         {
!             /* deal with simple chars one at a time */
!             chr            lim = (to <= MAX_SIMPLE_CHR) ? to : MAX_SIMPLE_CHR;
!
!             while (from <= lim)
              {
!                 color        sco = subcolor(cm, from);
!
!                 NOERR();
!                 if (sco != lastsubcolor)
!                 {
!                     newarc(v->nfa, PLAIN, sco, lp, rp);
!                     NOERR();
!                     lastsubcolor = sco;
!                 }
!                 from++;
              }
          }
+         /* deal with any part of the range that's above MAX_SIMPLE_CHR */
+         if (from < to)
+             subcoloronerange(v, from, to, lp, rp, &lastsubcolor);
+         else if (from == to)
+             subcoloronechr(v, from, lp, rp, &lastsubcolor);
+         NOERR();
      }

!     /* and deal with cclass if any */
!     if (cv->cclasscode >= 0)
      {
!         int            classbit;
!         color       *pco;
!         int            r,
!                     c;
!
!         /* Enlarge array if we don't have a column bit assignment for cclass */
!         if (cm->classbits[cv->cclasscode] == 0)
!         {
!             cm->classbits[cv->cclasscode] = cm->hiarraycols;
!             newhicolorcols(cm);
!             NOERR();
!         }
!         /* Apply subcolorhi() and make arc for each entry in relevant cols */
!         classbit = cm->classbits[cv->cclasscode];
!         pco = cm->hicolormap;
!         for (r = 0; r < cm->hiarrayrows; r++)
!         {
!             for (c = 0; c < cm->hiarraycols; c++)
              {
!                 if (c & classbit)
!                 {
!                     color        sco = subcolorhi(cm, pco);
!
!                     NOERR();
!                     /* add the arc if needed */
!                     if (sco != lastsubcolor)
!                     {
!                         newarc(v->nfa, PLAIN, sco, lp, rp);
!                         NOERR();
!                         lastsubcolor = sco;
!                     }
!                 }
!                 pco++;
              }
          }
!     }
! }
!
! /*
!  * subcoloronechr - do subcolorcvec's work for a singleton chr
!  *
!  * We could just let subcoloronerange do this, but it's a bit more efficient
!  * if we exploit the single-chr case.  Also, callers find it useful for this
!  * to be able to handle both low and high chr codes.
!  */
! static void
! subcoloronechr(struct vars * v,
!                chr ch,
!                struct state * lp,
!                struct state * rp,
!                color *lastsubcolor)
! {
!     struct colormap *cm = v->cm;
!     colormaprange *newranges;
!     int            numnewranges;
!     colormaprange *oldrange;
!     int            oldrangen;
!     int            newrow;
!
!     /* Easy case for low chr codes */
!     if (ch <= MAX_SIMPLE_CHR)
!     {
!         color        sco = subcolor(cm, ch);
!
!         NOERR();
!         if (sco != *lastsubcolor)
!         {
!             newarc(v->nfa, PLAIN, sco, lp, rp);
!             *lastsubcolor = sco;
!         }
          return;
      }

!     /*
!      * Potentially, we could need two more colormapranges than we have now, if
!      * the given chr is in the middle of some existing range.
!      */
!     newranges = (colormaprange *)
!         MALLOC((cm->numcmranges + 2) * sizeof(colormaprange));
!     if (newranges == NULL)
      {
!         CERR(REG_ESPACE);
!         return;
!     }
!     numnewranges = 0;
!
!     /* Ranges before target are unchanged */
!     for (oldrange = cm->cmranges, oldrangen = 0;
!          oldrangen < cm->numcmranges;
!          oldrange++, oldrangen++)
!     {
!         if (oldrange->cmax >= ch)
!             break;
!         newranges[numnewranges++] = *oldrange;
!     }
!
!     /* Match target chr against current range */
!     if (oldrangen >= cm->numcmranges || oldrange->cmin > ch)
!     {
!         /* chr does not belong to any existing range, make a new one */
!         newranges[numnewranges].cmin = ch;
!         newranges[numnewranges].cmax = ch;
!         /* row state should be cloned from the "all others" row */
!         newranges[numnewranges].rownum = newrow = newhicolorrow(cm, 0);
!         numnewranges++;
!     }
!     else if (oldrange->cmin == oldrange->cmax)
!     {
!         /* we have an existing singleton range matching the chr */
!         newranges[numnewranges++] = *oldrange;
!         newrow = oldrange->rownum;
!         /* we've now fully processed this old range */
!         oldrange++, oldrangen++;
!     }
!     else
!     {
!         /* chr is a subset of this existing range, must split it */
!         if (ch > oldrange->cmin)
          {
!             /* emit portion of old range before chr */
!             newranges[numnewranges].cmin = oldrange->cmin;
!             newranges[numnewranges].cmax = ch - 1;
!             newranges[numnewranges].rownum = oldrange->rownum;
!             numnewranges++;
!         }
!         /* emit chr as singleton range, initially cloning from range */
!         newranges[numnewranges].cmin = ch;
!         newranges[numnewranges].cmax = ch;
!         newranges[numnewranges].rownum = newrow =
!             newhicolorrow(cm, oldrange->rownum);
!         numnewranges++;
!         if (ch < oldrange->cmax)
!         {
!             /* emit portion of old range after chr */
!             newranges[numnewranges].cmin = ch + 1;
!             newranges[numnewranges].cmax = oldrange->cmax;
!             /* must clone the row if we are making two new ranges from old */
!             newranges[numnewranges].rownum =
!                 (ch > oldrange->cmin) ? newhicolorrow(cm, oldrange->rownum) :
!                 oldrange->rownum;
!             numnewranges++;
!         }
!         /* we've now fully processed this old range */
!         oldrange++, oldrangen++;
!     }
!
!     /* Update colors in newrow and create arcs as needed */
!     subcoloronerow(v, newrow, lp, rp, lastsubcolor);
!
!     /* Ranges after target are unchanged */
!     for (; oldrangen < cm->numcmranges; oldrange++, oldrangen++)
!     {
!         newranges[numnewranges++] = *oldrange;
!     }
!
!     /* Assert our original space estimate was adequate */
!     assert(numnewranges <= (cm->numcmranges + 2));
!
!     /* And finally, store back the updated list of ranges */
!     if (cm->cmranges != NULL)
!         FREE(cm->cmranges);
!     cm->cmranges = newranges;
!     cm->numcmranges = numnewranges;
! }
!
! /*
!  * subcoloronerange - do subcolorcvec's work for a high range
!  */
! static void
! subcoloronerange(struct vars * v,
!                  chr from,
!                  chr to,
!                  struct state * lp,
!                  struct state * rp,
!                  color *lastsubcolor)
! {
!     struct colormap *cm = v->cm;
!     colormaprange *newranges;
!     int            numnewranges;
!     colormaprange *oldrange;
!     int            oldrangen;
!     int            newrow;
!
!     /* Caller should take care of non-high-range cases */
!     assert(from > MAX_SIMPLE_CHR);
!     assert(from < to);
!
!     /*
!      * Potentially, if we have N non-adjacent ranges, we could need as many as
!      * 2N+1 result ranges (consider case where new range spans 'em all).
!      */
!     newranges = (colormaprange *)
!         MALLOC((cm->numcmranges * 2 + 1) * sizeof(colormaprange));
!     if (newranges == NULL)
!     {
!         CERR(REG_ESPACE);
!         return;
!     }
!     numnewranges = 0;
!
!     /* Ranges before target are unchanged */
!     for (oldrange = cm->cmranges, oldrangen = 0;
!          oldrangen < cm->numcmranges;
!          oldrange++, oldrangen++)
!     {
!         if (oldrange->cmax >= from)
!             break;
!         newranges[numnewranges++] = *oldrange;
!     }
!
!     /*
!      * Deal with ranges that (partially) overlap the target.  As we process
!      * each such range, increase "from" to remove the dealt-with characters
!      * from the target range.
!      */
!     while (oldrangen < cm->numcmranges && oldrange->cmin <= to)
!     {
!         if (from < oldrange->cmin)
!         {
!             /* Handle portion of new range that corresponds to no old range */
!             newranges[numnewranges].cmin = from;
!             newranges[numnewranges].cmax = oldrange->cmin - 1;
!             /* row state should be cloned from the "all others" row */
!             newranges[numnewranges].rownum = newrow = newhicolorrow(cm, 0);
!             numnewranges++;
!             /* Update colors in newrow and create arcs as needed */
!             subcoloronerow(v, newrow, lp, rp, lastsubcolor);
!             /* We've now fully processed the part of new range before old */
!             from = oldrange->cmin;
!         }
!
!         if (from <= oldrange->cmin && to >= oldrange->cmax)
!         {
!             /* old range is fully contained in new, process it in-place */
!             newranges[numnewranges++] = *oldrange;
!             newrow = oldrange->rownum;
!             from = oldrange->cmax + 1;
!         }
!         else
!         {
!             /* some part of old range does not overlap new range */
!             if (from > oldrange->cmin)
!             {
!                 /* emit portion of old range before new range */
!                 newranges[numnewranges].cmin = oldrange->cmin;
!                 newranges[numnewranges].cmax = from - 1;
!                 newranges[numnewranges].rownum = oldrange->rownum;
!                 numnewranges++;
!             }
!             /* emit common subrange, initially cloning from old range */
!             newranges[numnewranges].cmin = from;
!             newranges[numnewranges].cmax =
!                 (to < oldrange->cmax) ? to : oldrange->cmax;
!             newranges[numnewranges].rownum = newrow =
!                 newhicolorrow(cm, oldrange->rownum);
!             numnewranges++;
!             if (to < oldrange->cmax)
!             {
!                 /* emit portion of old range after new range */
!                 newranges[numnewranges].cmin = to + 1;
!                 newranges[numnewranges].cmax = oldrange->cmax;
!                 /* must clone the row if we are making two new ranges from old */
!                 newranges[numnewranges].rownum =
!                     (from > oldrange->cmin) ? newhicolorrow(cm, oldrange->rownum) :
!                     oldrange->rownum;
!                 numnewranges++;
!             }
!             from = oldrange->cmax + 1;
!         }
!         /* Update colors in newrow and create arcs as needed */
!         subcoloronerow(v, newrow, lp, rp, lastsubcolor);
!         /* we've now fully processed this old range */
!         oldrange++, oldrangen++;
!     }
!
!     if (from <= to)
!     {
!         /* Handle portion of new range that corresponds to no old range */
!         newranges[numnewranges].cmin = from;
!         newranges[numnewranges].cmax = to;
!         /* row state should be cloned from the "all others" row */
!         newranges[numnewranges].rownum = newrow = newhicolorrow(cm, 0);
!         numnewranges++;
!         /* Update colors in newrow and create arcs as needed */
!         subcoloronerow(v, newrow, lp, rp, lastsubcolor);
!     }
!
!     /* Ranges after target are unchanged */
!     for (; oldrangen < cm->numcmranges; oldrange++, oldrangen++)
!     {
!         newranges[numnewranges++] = *oldrange;
!     }
!
!     /* Assert our original space estimate was adequate */
!     assert(numnewranges <= (cm->numcmranges * 2 + 1));
!
!     /* And finally, store back the updated list of ranges */
!     if (cm->cmranges != NULL)
!         FREE(cm->cmranges);
!     cm->cmranges = newranges;
!     cm->numcmranges = numnewranges;
! }
!
! /*
!  * subcoloronerow - do subcolorcvec's work for one new row in the high colormap
!  */
! static void
! subcoloronerow(struct vars * v,
!                int rownum,
!                struct state * lp,
!                struct state * rp,
!                color *lastsubcolor)
! {
!     struct colormap *cm = v->cm;
!     color       *pco;
!     int            i;
!
!     /* Apply subcolorhi() and make arc for each entry in row */
!     pco = &cm->hicolormap[rownum * cm->hiarraycols];
!     for (i = 0; i < cm->hiarraycols; pco++, i++)
!     {
!         color        sco = subcolorhi(cm, pco);
!
!         NOERR();
!         /* make the arc if needed */
!         if (sco != *lastsubcolor)
!         {
!             newarc(v->nfa, PLAIN, sco, lp, rp);
!             NOERR();
!             *lastsubcolor = sco;
!         }
      }
  }

*************** okcolors(struct nfa * nfa,
*** 575,586 ****
          {
              /* is subcolor, let parent deal with it */
          }
!         else if (cd->nchrs == 0)
          {
              /* parent empty, its arcs change color to subcolor */
              cd->sub = NOSUB;
              scd = &cm->cd[sco];
!             assert(scd->nchrs > 0);
              assert(scd->sub == sco);
              scd->sub = NOSUB;
              while ((a = cd->arcs) != NULL)
--- 930,941 ----
          {
              /* is subcolor, let parent deal with it */
          }
!         else if (cd->nschrs == 0 && cd->nuchrs == 0)
          {
              /* parent empty, its arcs change color to subcolor */
              cd->sub = NOSUB;
              scd = &cm->cd[sco];
!             assert(scd->nschrs > 0 || scd->nuchrs > 0);
              assert(scd->sub == sco);
              scd->sub = NOSUB;
              while ((a = cd->arcs) != NULL)
*************** okcolors(struct nfa * nfa,
*** 597,603 ****
              /* parent's arcs must gain parallel subcolor arcs */
              cd->sub = NOSUB;
              scd = &cm->cd[sco];
!             assert(scd->nchrs > 0);
              assert(scd->sub == sco);
              scd->sub = NOSUB;
              for (a = cd->arcs; a != NULL; a = a->colorchain)
--- 952,958 ----
              /* parent's arcs must gain parallel subcolor arcs */
              cd->sub = NOSUB;
              scd = &cm->cd[sco];
!             assert(scd->nschrs > 0 || scd->nuchrs > 0);
              assert(scd->sub == sco);
              scd->sub = NOSUB;
              for (a = cd->arcs; a != NULL; a = a->colorchain)
*************** dumpcolors(struct colormap * cm,
*** 711,732 ****
      struct colordesc *end;
      color        co;
      chr            c;
-     char       *has;

      fprintf(f, "max %ld\n", (long) cm->max);
-     if (NBYTS > 1)
-         fillcheck(cm, cm->tree, 0, f);
      end = CDEND(cm);
      for (cd = cm->cd + 1, co = 1; cd < end; cd++, co++) /* skip 0 */
          if (!UNUSEDCOLOR(cd))
          {
!             assert(cd->nchrs > 0);
!             has = (cd->block != NULL) ? "#" : "";
              if (cd->flags & PSEUDO)
!                 fprintf(f, "#%2ld%s(ps): ", (long) co, has);
              else
!                 fprintf(f, "#%2ld%s(%2d): ", (long) co,
!                         has, cd->nchrs);

              /*
               * Unfortunately, it's hard to do this next bit more efficiently.
--- 1066,1083 ----
      struct colordesc *end;
      color        co;
      chr            c;

      fprintf(f, "max %ld\n", (long) cm->max);
      end = CDEND(cm);
      for (cd = cm->cd + 1, co = 1; cd < end; cd++, co++) /* skip 0 */
+     {
          if (!UNUSEDCOLOR(cd))
          {
!             assert(cd->nschrs > 0 || cd->nuchrs > 0);
              if (cd->flags & PSEUDO)
!                 fprintf(f, "#%2ld(ps): ", (long) co);
              else
!                 fprintf(f, "#%2ld(%2d): ", (long) co, cd->nschrs + cd->nuchrs);

              /*
               * Unfortunately, it's hard to do this next bit more efficiently.
*************** dumpcolors(struct colormap * cm,
*** 741,772 ****
                      dumpchr(c, f);
              fprintf(f, "\n");
          }
! }
!
! /*
!  * fillcheck - check proper filling of a tree
!  */
! static void
! fillcheck(struct colormap * cm,
!           union tree * tree,
!           int level,            /* level number (top == 0) of this block */
!           FILE *f)
! {
!     int            i;
!     union tree *t;
!     union tree *fillt = &cm->tree[level + 1];
!
!     assert(level < NBYTS - 1);    /* this level has pointers */
!     for (i = BYTTAB - 1; i >= 0; i--)
      {
!         t = tree->tptr[i];
!         if (t == NULL)
!             fprintf(f, "NULL found in filled tree!\n");
!         else if (t == fillt)
          {
          }
-         else if (level < NBYTS - 2)        /* more pointer blocks below */
-             fillcheck(cm, t, level + 1, f);
      }
  }

--- 1092,1124 ----
                      dumpchr(c, f);
              fprintf(f, "\n");
          }
!     }
!     /* dump the high colormap if it contains anything interesting */
!     if (cm->hiarrayrows > 1 || cm->hiarraycols > 1)
      {
!         int            r,
!                     c;
!         const color *rowptr;
!
!         fprintf(f, "other:\t");
!         for (c = 0; c < cm->hiarraycols; c++)
          {
+             fprintf(f, "\t%ld", (long) cm->hicolormap[c]);
+         }
+         fprintf(f, "\n");
+         for (r = 0; r < cm->numcmranges; r++)
+         {
+             dumpchr(cm->cmranges[r].cmin, f);
+             fprintf(f, "..");
+             dumpchr(cm->cmranges[r].cmax, f);
+             fprintf(f, ":");
+             rowptr = &cm->hicolormap[cm->cmranges[r].rownum * cm->hiarraycols];
+             for (c = 0; c < cm->hiarraycols; c++)
+             {
+                 fprintf(f, "\t%ld", (long) rowptr[c]);
+             }
+             fprintf(f, "\n");
          }
      }
  }

diff --git a/src/backend/regex/regc_cvec.c b/src/backend/regex/regc_cvec.c
index 3a9e8cf..50b7a45 100644
*** a/src/backend/regex/regc_cvec.c
--- b/src/backend/regex/regc_cvec.c
***************
*** 34,40 ****

  /*
   * Notes:
!  * Only (selected) functions in _this_ file should treat chr* as non-constant.
   */

  /*
--- 34,41 ----

  /*
   * Notes:
!  * Only (selected) functions in _this_ file should treat the chr arrays
!  * of a cvec as non-constant.
   */

  /*
*************** clearcvec(struct cvec * cv)
*** 67,72 ****
--- 68,74 ----
      assert(cv != NULL);
      cv->nchrs = 0;
      cv->nranges = 0;
+     cv->cclasscode = -1;
      return cv;
  }

diff --git a/src/backend/regex/regc_locale.c b/src/backend/regex/regc_locale.c
index 399de02..7cb3a40 100644
*** a/src/backend/regex/regc_locale.c
--- b/src/backend/regex/regc_locale.c
*************** static const struct cname
*** 349,354 ****
--- 349,367 ----
      }
  };

+ /*
+  * The following arrays define the valid character class names.
+  */
+ static const char *const classNames[NUM_CCLASSES + 1] = {
+     "alnum", "alpha", "ascii", "blank", "cntrl", "digit", "graph",
+     "lower", "print", "punct", "space", "upper", "xdigit", NULL
+ };
+
+ enum classes
+ {
+     CC_ALNUM, CC_ALPHA, CC_ASCII, CC_BLANK, CC_CNTRL, CC_DIGIT, CC_GRAPH,
+     CC_LOWER, CC_PRINT, CC_PUNCT, CC_SPACE, CC_UPPER, CC_XDIGIT
+ };

  /*
   * We do not use the hard-wired Unicode classification tables that Tcl does.
*************** cclass(struct vars * v,            /* context */
*** 544,564 ****
                  index;

      /*
-      * The following arrays define the valid character class names.
-      */
-
-     static const char *const classNames[] = {
-         "alnum", "alpha", "ascii", "blank", "cntrl", "digit", "graph",
-         "lower", "print", "punct", "space", "upper", "xdigit", NULL
-     };
-
-     enum classes
-     {
-         CC_ALNUM, CC_ALPHA, CC_ASCII, CC_BLANK, CC_CNTRL, CC_DIGIT, CC_GRAPH,
-         CC_LOWER, CC_PRINT, CC_PUNCT, CC_SPACE, CC_UPPER, CC_XDIGIT
-     };
-
-     /*
       * Map the name to the corresponding enumerated value.
       */
      len = endp - startp;
--- 557,562 ----
*************** cclass(struct vars * v,            /* context */
*** 593,610 ****
       * pg_ctype_get_cache so that we can cache the results.  Other classes
       * have definitions that are hard-wired here, and for those we just
       * construct a transient cvec on the fly.
       */

      switch ((enum classes) index)
      {
          case CC_PRINT:
!             cv = pg_ctype_get_cache(pg_wc_isprint);
              break;
          case CC_ALNUM:
!             cv = pg_ctype_get_cache(pg_wc_isalnum);
              break;
          case CC_ALPHA:
!             cv = pg_ctype_get_cache(pg_wc_isalpha);
              break;
          case CC_ASCII:
              /* hard-wired meaning */
--- 591,610 ----
       * pg_ctype_get_cache so that we can cache the results.  Other classes
       * have definitions that are hard-wired here, and for those we just
       * construct a transient cvec on the fly.
+      *
+      * NB: keep this code in sync with cclass_column_index(), below.
       */

      switch ((enum classes) index)
      {
          case CC_PRINT:
!             cv = pg_ctype_get_cache(pg_wc_isprint, index);
              break;
          case CC_ALNUM:
!             cv = pg_ctype_get_cache(pg_wc_isalnum, index);
              break;
          case CC_ALPHA:
!             cv = pg_ctype_get_cache(pg_wc_isalpha, index);
              break;
          case CC_ASCII:
              /* hard-wired meaning */
*************** cclass(struct vars * v,            /* context */
*** 625,634 ****
              addrange(cv, 0x7f, 0x9f);
              break;
          case CC_DIGIT:
!             cv = pg_ctype_get_cache(pg_wc_isdigit);
              break;
          case CC_PUNCT:
!             cv = pg_ctype_get_cache(pg_wc_ispunct);
              break;
          case CC_XDIGIT:

--- 625,634 ----
              addrange(cv, 0x7f, 0x9f);
              break;
          case CC_DIGIT:
!             cv = pg_ctype_get_cache(pg_wc_isdigit, index);
              break;
          case CC_PUNCT:
!             cv = pg_ctype_get_cache(pg_wc_ispunct, index);
              break;
          case CC_XDIGIT:

*************** cclass(struct vars * v,            /* context */
*** 646,661 ****
              }
              break;
          case CC_SPACE:
!             cv = pg_ctype_get_cache(pg_wc_isspace);
              break;
          case CC_LOWER:
!             cv = pg_ctype_get_cache(pg_wc_islower);
              break;
          case CC_UPPER:
!             cv = pg_ctype_get_cache(pg_wc_isupper);
              break;
          case CC_GRAPH:
!             cv = pg_ctype_get_cache(pg_wc_isgraph);
              break;
      }

--- 646,661 ----
              }
              break;
          case CC_SPACE:
!             cv = pg_ctype_get_cache(pg_wc_isspace, index);
              break;
          case CC_LOWER:
!             cv = pg_ctype_get_cache(pg_wc_islower, index);
              break;
          case CC_UPPER:
!             cv = pg_ctype_get_cache(pg_wc_isupper, index);
              break;
          case CC_GRAPH:
!             cv = pg_ctype_get_cache(pg_wc_isgraph, index);
              break;
      }

*************** cclass(struct vars * v,            /* context */
*** 666,671 ****
--- 666,712 ----
  }

  /*
+  * cclass_column_index - get appropriate high colormap column index for chr
+  */
+ static int
+ cclass_column_index(struct colormap * cm, chr c)
+ {
+     int            colnum = 0;
+
+     /* Shouldn't go through all these pushups for simple chrs */
+     assert(c > MAX_SIMPLE_CHR);
+
+     /*
+      * Note: we should not see requests to consider cclasses that are not
+      * treated as locale-specific by cclass(), above.
+      */
+     if (cm->classbits[CC_PRINT] && pg_wc_isprint(c))
+         colnum |= cm->classbits[CC_PRINT];
+     if (cm->classbits[CC_ALNUM] && pg_wc_isalnum(c))
+         colnum |= cm->classbits[CC_ALNUM];
+     if (cm->classbits[CC_ALPHA] && pg_wc_isalpha(c))
+         colnum |= cm->classbits[CC_ALPHA];
+     assert(cm->classbits[CC_ASCII] == 0);
+     assert(cm->classbits[CC_BLANK] == 0);
+     assert(cm->classbits[CC_CNTRL] == 0);
+     if (cm->classbits[CC_DIGIT] && pg_wc_isdigit(c))
+         colnum |= cm->classbits[CC_DIGIT];
+     if (cm->classbits[CC_PUNCT] && pg_wc_ispunct(c))
+         colnum |= cm->classbits[CC_PUNCT];
+     assert(cm->classbits[CC_XDIGIT] == 0);
+     if (cm->classbits[CC_SPACE] && pg_wc_isspace(c))
+         colnum |= cm->classbits[CC_SPACE];
+     if (cm->classbits[CC_LOWER] && pg_wc_islower(c))
+         colnum |= cm->classbits[CC_LOWER];
+     if (cm->classbits[CC_UPPER] && pg_wc_isupper(c))
+         colnum |= cm->classbits[CC_UPPER];
+     if (cm->classbits[CC_GRAPH] && pg_wc_isgraph(c))
+         colnum |= cm->classbits[CC_GRAPH];
+
+     return colnum;
+ }
+
+ /*
   * allcases - supply cvec for all case counterparts of a chr (including itself)
   *
   * This is a shortcut, preferably an efficient one, for simple characters;
diff --git a/src/backend/regex/regc_pg_locale.c b/src/backend/regex/regc_pg_locale.c
index 551ae7d..ad9d4b1 100644
*** a/src/backend/regex/regc_pg_locale.c
--- b/src/backend/regex/regc_pg_locale.c
*************** store_match(pg_ctype_cache *pcc, pg_wcha
*** 736,742 ****
   * Note that the result must not be freed or modified by caller.
   */
  static struct cvec *
! pg_ctype_get_cache(pg_wc_probefunc probefunc)
  {
      pg_ctype_cache *pcc;
      pg_wchar    max_chr;
--- 736,742 ----
   * Note that the result must not be freed or modified by caller.
   */
  static struct cvec *
! pg_ctype_get_cache(pg_wc_probefunc probefunc, int cclasscode)
  {
      pg_ctype_cache *pcc;
      pg_wchar    max_chr;
*************** pg_ctype_get_cache(pg_wc_probefunc probe
*** 770,800 ****
      pcc->cv.ranges = (chr *) malloc(pcc->cv.rangespace * sizeof(chr) * 2);
      if (pcc->cv.chrs == NULL || pcc->cv.ranges == NULL)
          goto out_of_memory;

      /*
!      * Decide how many character codes we ought to look through.  For C locale
!      * there's no need to go further than 127.  Otherwise, if the encoding is
!      * UTF8 go up to 0x7FF, which is a pretty arbitrary cutoff but we cannot
!      * extend it as far as we'd like (say, 0xFFFF, the end of the Basic
!      * Multilingual Plane) without creating significant performance issues due
!      * to too many characters being fed through the colormap code.  This will
!      * need redesign to fix reasonably, but at least for the moment we have
!      * all common European languages covered.  Otherwise (not C, not UTF8) go
!      * up to 255.  These limits are interrelated with restrictions discussed
!      * at the head of this file.
       */
      switch (pg_regex_strategy)
      {
          case PG_REGEX_LOCALE_C:
              max_chr = (pg_wchar) 127;
              break;
          case PG_REGEX_LOCALE_WIDE:
          case PG_REGEX_LOCALE_WIDE_L:
!             max_chr = (pg_wchar) 0x7FF;
              break;
          case PG_REGEX_LOCALE_1BYTE:
          case PG_REGEX_LOCALE_1BYTE_L:
              max_chr = (pg_wchar) UCHAR_MAX;
              break;
          default:
              max_chr = 0;        /* can't get here, but keep compiler quiet */
--- 770,812 ----
      pcc->cv.ranges = (chr *) malloc(pcc->cv.rangespace * sizeof(chr) * 2);
      if (pcc->cv.chrs == NULL || pcc->cv.ranges == NULL)
          goto out_of_memory;
+     pcc->cv.cclasscode = cclasscode;

      /*
!      * Decide how many character codes we ought to look through.  In general
!      * we don't go past MAX_SIMPLE_CHR; chr codes above that are handled at
!      * runtime using the "high colormap" mechanism.  However, in C locale
!      * there's no need to go further than 127, and if we only have a 1-byte
!      * <ctype.h> API there's no need to go further than that can handle.
!      *
!      * If it's not MAX_SIMPLE_CHR that's constraining the search, mark the
!      * output cvec as not having any locale-dependent behavior, since there
!      * will be no need to do any run-time locale checks.  (The #if's here
!      * would always be true for production values of MAX_SIMPLE_CHR, but it's
!      * useful to allow it to be small for testing purposes.)
       */
      switch (pg_regex_strategy)
      {
          case PG_REGEX_LOCALE_C:
+ #if MAX_SIMPLE_CHR >= 127
              max_chr = (pg_wchar) 127;
+             pcc->cv.cclasscode = -1;
+ #else
+             max_chr = (pg_wchar) MAX_SIMPLE_CHR;
+ #endif
              break;
          case PG_REGEX_LOCALE_WIDE:
          case PG_REGEX_LOCALE_WIDE_L:
!             max_chr = (pg_wchar) MAX_SIMPLE_CHR;
              break;
          case PG_REGEX_LOCALE_1BYTE:
          case PG_REGEX_LOCALE_1BYTE_L:
+ #if MAX_SIMPLE_CHR >= UCHAR_MAX
              max_chr = (pg_wchar) UCHAR_MAX;
+             pcc->cv.cclasscode = -1;
+ #else
+             max_chr = (pg_wchar) MAX_SIMPLE_CHR;
+ #endif
              break;
          default:
              max_chr = 0;        /* can't get here, but keep compiler quiet */
diff --git a/src/backend/regex/regcomp.c b/src/backend/regex/regcomp.c
index b211cc0..ed95474 100644
*** a/src/backend/regex/regcomp.c
--- b/src/backend/regex/regcomp.c
*************** static void cbracket(struct vars *, stru
*** 55,61 ****
  static void brackpart(struct vars *, struct state *, struct state *);
  static const chr *scanplain(struct vars *);
  static void onechr(struct vars *, chr, struct state *, struct state *);
- static void dovec(struct vars *, struct cvec *, struct state *, struct state *);
  static void wordchrs(struct vars *);
  static void processlacon(struct vars *, struct state *, struct state *, int,
               struct state *, struct state *);
--- 55,60 ----
*************** static chr    chrnamed(struct vars *, const
*** 96,111 ****
  /* === regc_color.c === */
  static void initcm(struct vars *, struct colormap *);
  static void freecm(struct colormap *);
- static void cmtreefree(struct colormap *, union tree *, int);
- static color setcolor(struct colormap *, chr, color);
  static color maxcolor(struct colormap *);
  static color newcolor(struct colormap *);
  static void freecolor(struct colormap *, color);
  static color pseudocolor(struct colormap *);
! static color subcolor(struct colormap *, chr c);
  static color newsub(struct colormap *, color);
! static void subrange(struct vars *, chr, chr, struct state *, struct state *);
! static void subblock(struct vars *, chr, struct state *, struct state *);
  static void okcolors(struct nfa *, struct colormap *);
  static void colorchain(struct colormap *, struct arc *);
  static void uncolorchain(struct colormap *, struct arc *);
--- 95,113 ----
  /* === regc_color.c === */
  static void initcm(struct vars *, struct colormap *);
  static void freecm(struct colormap *);
  static color maxcolor(struct colormap *);
  static color newcolor(struct colormap *);
  static void freecolor(struct colormap *, color);
  static color pseudocolor(struct colormap *);
! static color subcolor(struct colormap *, chr);
! static color subcolorhi(struct colormap *, color *);
  static color newsub(struct colormap *, color);
! static int    newhicolorrow(struct colormap *, int);
! static void newhicolorcols(struct colormap *);
! static void subcolorcvec(struct vars *, struct cvec *, struct state *, struct state *);
! static void subcoloronechr(struct vars *, chr, struct state *, struct state *, color *);
! static void subcoloronerange(struct vars *, chr, chr, struct state *, struct state *, color *);
! static void subcoloronerow(struct vars *, int, struct state *, struct state *, color *);
  static void okcolors(struct nfa *, struct colormap *);
  static void colorchain(struct colormap *, struct arc *);
  static void uncolorchain(struct colormap *, struct arc *);
*************** static void colorcomplement(struct nfa *
*** 114,120 ****

  #ifdef REG_DEBUG
  static void dumpcolors(struct colormap *, FILE *);
- static void fillcheck(struct colormap *, union tree *, int, FILE *);
  static void dumpchr(chr, FILE *);
  #endif
  /* === regc_nfa.c === */
--- 116,121 ----
*************** static struct cvec *range(struct vars *,
*** 215,220 ****
--- 216,222 ----
  static int    before(chr, chr);
  static struct cvec *eclass(struct vars *, chr, int);
  static struct cvec *cclass(struct vars *, const chr *, const chr *, int);
+ static int    cclass_column_index(struct colormap *, chr);
  static struct cvec *allcases(struct vars *, chr);
  static int    cmp(const chr *, const chr *, size_t);
  static int    casecmp(const chr *, const chr *, size_t);
*************** brackpart(struct vars * v,
*** 1467,1473 ****
              NOERR();
              cv = eclass(v, startc, (v->cflags & REG_ICASE));
              NOERR();
!             dovec(v, cv, lp, rp);
              return;
              break;
          case CCLASS:
--- 1469,1475 ----
              NOERR();
              cv = eclass(v, startc, (v->cflags & REG_ICASE));
              NOERR();
!             subcolorcvec(v, cv, lp, rp);
              return;
              break;
          case CCLASS:
*************** brackpart(struct vars * v,
*** 1477,1483 ****
              NOERR();
              cv = cclass(v, startp, endp, (v->cflags & REG_ICASE));
              NOERR();
!             dovec(v, cv, lp, rp);
              return;
              break;
          default:
--- 1479,1485 ----
              NOERR();
              cv = cclass(v, startp, endp, (v->cflags & REG_ICASE));
              NOERR();
!             subcolorcvec(v, cv, lp, rp);
              return;
              break;
          default:
*************** brackpart(struct vars * v,
*** 1523,1529 ****
          NOTE(REG_UUNPORT);
      cv = range(v, startc, endc, (v->cflags & REG_ICASE));
      NOERR();
!     dovec(v, cv, lp, rp);
  }

  /*
--- 1525,1531 ----
          NOTE(REG_UUNPORT);
      cv = range(v, startc, endc, (v->cflags & REG_ICASE));
      NOERR();
!     subcolorcvec(v, cv, lp, rp);
  }

  /*
*************** onechr(struct vars * v,
*** 1565,1610 ****
  {
      if (!(v->cflags & REG_ICASE))
      {
!         newarc(v->nfa, PLAIN, subcolor(v->cm, c), lp, rp);
          return;
      }

      /* rats, need general case anyway... */
!     dovec(v, allcases(v, c), lp, rp);
! }
!
! /*
!  * dovec - fill in arcs for each element of a cvec
!  */
! static void
! dovec(struct vars * v,
!       struct cvec * cv,
!       struct state * lp,
!       struct state * rp)
! {
!     chr            ch,
!                 from,
!                 to;
!     const chr  *p;
!     int            i;
!
!     /* ordinary characters */
!     for (p = cv->chrs, i = cv->nchrs; i > 0; p++, i--)
!     {
!         ch = *p;
!         newarc(v->nfa, PLAIN, subcolor(v->cm, ch), lp, rp);
!         NOERR();
!     }
!
!     /* and the ranges */
!     for (p = cv->ranges, i = cv->nranges; i > 0; p += 2, i--)
!     {
!         from = *p;
!         to = *(p + 1);
!         if (from <= to)
!             subrange(v, from, to, lp, rp);
!         NOERR();
!     }
  }

  /*
--- 1567,1580 ----
  {
      if (!(v->cflags & REG_ICASE))
      {
!         color        lastsubcolor = COLORLESS;
!
!         subcoloronechr(v, c, lp, rp, &lastsubcolor);
          return;
      }

      /* rats, need general case anyway... */
!     subcolorcvec(v, allcases(v, c), lp, rp);
  }

  /*
diff --git a/src/backend/regex/regexport.c b/src/backend/regex/regexport.c
index 93da822..a515949 100644
*** a/src/backend/regex/regexport.c
--- b/src/backend/regex/regexport.c
***************
*** 28,37 ****

  #include "regex/regexport.h"

- static void scancolormap(struct colormap * cm, int co,
-              union tree * t, int level, chr partial,
-              pg_wchar **chars, int *chars_len);
-

  /*
   * Get total number of NFA states.
--- 28,33 ----
*************** pg_reg_colorisend(const regex_t *regex,
*** 187,196 ****
   *
   * Note: we return -1 if the color number is invalid, or if it is a special
   * color (WHITE or a pseudocolor), or if the number of members is uncertain.
!  * The latter case cannot arise right now but is specified to allow for future
!  * improvements (see musings about run-time handling of higher character codes
!  * in regex/README).  Callers should not try to extract the members if -1 is
!  * returned.
   */
  int
  pg_reg_getnumcharacters(const regex_t *regex, int co)
--- 183,189 ----
   *
   * Note: we return -1 if the color number is invalid, or if it is a special
   * color (WHITE or a pseudocolor), or if the number of members is uncertain.
!  * Callers should not try to extract the members if -1 is returned.
   */
  int
  pg_reg_getnumcharacters(const regex_t *regex, int co)
*************** pg_reg_getnumcharacters(const regex_t *r
*** 205,211 ****
      if (cm->cd[co].flags & PSEUDO)        /* also pseudocolors (BOS etc) */
          return -1;

!     return cm->cd[co].nchrs;
  }

  /*
--- 198,215 ----
      if (cm->cd[co].flags & PSEUDO)        /* also pseudocolors (BOS etc) */
          return -1;

!     /*
!      * If the color appears anywhere in the high colormap, treat its number of
!      * members as uncertain.  In principle we could determine all the specific
!      * chrs corresponding to each such entry, but it would be expensive
!      * (particularly if character class tests are required) and it doesn't
!      * seem worth it.
!      */
!     if (cm->cd[co].nuchrs != 0)
!         return -1;
!
!     /* OK, return the known number of member chrs */
!     return cm->cd[co].nschrs;
  }

  /*
*************** pg_reg_getcharacters(const regex_t *rege
*** 222,227 ****
--- 226,232 ----
                       pg_wchar *chars, int chars_len)
  {
      struct colormap *cm;
+     chr            c;

      assert(regex != NULL && regex->re_magic == REMAGIC);
      cm = &((struct guts *) regex->re_guts)->cmap;
*************** pg_reg_getcharacters(const regex_t *rege
*** 231,292 ****
      if (cm->cd[co].flags & PSEUDO)
          return;

!     /* Recursively search the colormap tree */
!     scancolormap(cm, co, cm->tree, 0, 0, &chars, &chars_len);
! }
!
! /*
!  * Recursively scan the colormap tree to find chrs belonging to color "co".
!  * See regex/README for info about the tree structure.
!  *
!  * t: tree block to scan
!  * level: level (from 0) of t
!  * partial: partial chr code for chrs within t
!  * chars, chars_len: output area
!  */
! static void
! scancolormap(struct colormap * cm, int co,
!              union tree * t, int level, chr partial,
!              pg_wchar **chars, int *chars_len)
! {
!     int            i;
!
!     if (level < NBYTS - 1)
!     {
!         /* non-leaf node */
!         for (i = 0; i < BYTTAB; i++)
!         {
!             /*
!              * We do not support search for chrs of color 0 (WHITE), so
!              * all-white subtrees need not be searched.  These can be
!              * recognized because they are represented by the fill blocks in
!              * the colormap struct.  This typically allows us to avoid
!              * scanning large regions of higher-numbered chrs.
!              */
!             if (t->tptr[i] == &cm->tree[level + 1])
!                 continue;
!
!             /* Recursively scan next level down */
!             scancolormap(cm, co,
!                          t->tptr[i], level + 1,
!                          (partial | (chr) i) << BYTBITS,
!                          chars, chars_len);
!         }
!     }
!     else
      {
!         /* leaf node */
!         for (i = 0; i < BYTTAB; i++)
          {
!             if (t->tcolor[i] == co)
!             {
!                 if (*chars_len > 0)
!                 {
!                     **chars = partial | (chr) i;
!                     (*chars)++;
!                     (*chars_len)--;
!                 }
!             }
          }
      }
  }
--- 236,252 ----
      if (cm->cd[co].flags & PSEUDO)
          return;

!     /*
!      * We need only examine the low character map; there should not be any
!      * matching entries in the high map.
!      */
!     for (c = CHR_MIN; c <= MAX_SIMPLE_CHR; c++)
      {
!         if (cm->locolormap[c - CHR_MIN] == co)
          {
!             *chars++ = c;
!             if (--chars_len == 0)
!                 break;
          }
      }
  }
diff --git a/src/backend/regex/regprefix.c b/src/backend/regex/regprefix.c
index 04b6449..f338715 100644
*** a/src/backend/regex/regprefix.c
--- b/src/backend/regex/regprefix.c
*************** findprefix(struct cnfa * cnfa,
*** 194,200 ****
          if (thiscolor == COLORLESS)
              break;
          /* The color must be a singleton */
!         if (cm->cd[thiscolor].nchrs != 1)
              break;

          /*
--- 194,203 ----
          if (thiscolor == COLORLESS)
              break;
          /* The color must be a singleton */
!         if (cm->cd[thiscolor].nschrs != 1)
!             break;
!         /* Must not have any high-color-map entries */
!         if (cm->cd[thiscolor].nuchrs != 0)
              break;

          /*
diff --git a/src/include/regex/regcustom.h b/src/include/regex/regcustom.h
index 459851a..04a1893 100644
*** a/src/include/regex/regcustom.h
--- b/src/include/regex/regcustom.h
*************** typedef unsigned uchr;            /* unsigned typ
*** 77,82 ****
--- 77,92 ----
   */
  #define CHR_IS_IN_RANGE(c)    ((c) <= CHR_MAX)

+ /*
+  * MAX_SIMPLE_CHR is the cutoff between "simple" and "complicated" processing
+  * in the color map logic.  It should usually be chosen high enough to ensure
+  * that all common characters are <= MAX_SIMPLE_CHR.  However, very large
+  * values will be counterproductive since they cause more regex setup time.
+  * Also, small values can be helpful for testing the high-color-map logic
+  * with plain old ASCII input.
+  */
+ #define MAX_SIMPLE_CHR    0x7FF    /* suitable value for Unicode */
+
  /* functions operating on chr */
  #define iscalnum(x) pg_wc_isalnum(x)
  #define iscalpha(x) pg_wc_isalpha(x)
diff --git a/src/include/regex/regex.h b/src/include/regex/regex.h
index 2f89dc9..cc73db2 100644
*** a/src/include/regex/regex.h
--- b/src/include/regex/regex.h
*************** extern int    pg_regexec(regex_t *, const p
*** 172,177 ****
  extern int    pg_regprefix(regex_t *, pg_wchar **, size_t *);
  extern void pg_regfree(regex_t *);
  extern size_t pg_regerror(int, const regex_t *, char *, size_t);
- extern void pg_set_regex_collation(Oid collation);

  #endif   /* _REGEX_H_ */
--- 172,176 ----
diff --git a/src/include/regex/regguts.h b/src/include/regex/regguts.h
index b0aa641..69816f1 100644
*** a/src/include/regex/regguts.h
--- b/src/include/regex/regguts.h
***************
*** 127,149 ****
  #define ISBSET(uv, sn)    ((uv)[(sn)/UBITS] & ((unsigned)1 << ((sn)%UBITS)))


-
- /*
-  * We dissect a chr into byts for colormap table indexing.  Here we define
-  * a byt, which will be the same as a byte on most machines...  The exact
-  * size of a byt is not critical, but about 8 bits is good, and extraction
-  * of 8-bit chunks is sometimes especially fast.
-  */
- #ifndef BYTBITS
- #define BYTBITS 8                /* bits in a byt */
- #endif
- #define BYTTAB    (1<<BYTBITS)    /* size of table with one entry per byt value */
- #define BYTMASK (BYTTAB-1)        /* bit mask for byt */
- #define NBYTS    ((CHRBITS+BYTBITS-1)/BYTBITS)
- /* the definition of GETCOLOR(), below, assumes NBYTS <= 4 */
-
-
-
  /*
   * As soon as possible, we map chrs into equivalence classes -- "colors" --
   * which are of much more manageable number.
--- 127,132 ----
*************** typedef short color;            /* colors of char
*** 153,194 ****
  #define MAX_COLOR    32767        /* max color (must fit in 'color' datatype) */
  #define COLORLESS    (-1)        /* impossible color */
  #define WHITE        0            /* default color, parent of all others */


-
- /*
-  * A colormap is a tree -- more precisely, a DAG -- indexed at each level
-  * by a byt of the chr, to map the chr to a color efficiently.  Because
-  * lower sections of the tree can be shared, it can exploit the usual
-  * sparseness of such a mapping table.  The tree is always NBYTS levels
-  * deep (in the past it was shallower during construction but was "filled"
-  * to full depth at the end of that); areas that are unaltered as yet point
-  * to "fill blocks" which are entirely WHITE in color.
-  *
-  * Leaf-level tree blocks are of type "struct colors", while upper-level
-  * blocks are of type "struct ptrs".  Pointers into the tree are generally
-  * declared as "union tree *" to be agnostic about what level they point to.
-  */
-
- /* the tree itself */
- struct colors
- {
-     color        ccolor[BYTTAB];
- };
- struct ptrs
- {
-     union tree *pptr[BYTTAB];
- };
- union tree
- {
-     struct colors colors;
-     struct ptrs ptrs;
- };
-
- /* use these pseudo-field names when dereferencing a "union tree" pointer */
- #define tcolor    colors.ccolor
- #define tptr    ptrs.pptr
-
  /*
   * Per-color data structure for the compile-time color machinery
   *
--- 136,144 ----
  #define MAX_COLOR    32767        /* max color (must fit in 'color' datatype) */
  #define COLORLESS    (-1)        /* impossible color */
  #define WHITE        0            /* default color, parent of all others */
+ /* Note: various places in the code know that WHITE is zero */


  /*
   * Per-color data structure for the compile-time color machinery
   *
*************** union tree
*** 203,228 ****
   */
  struct colordesc
  {
!     uchr        nchrs;            /* number of chars of this color */
      color        sub;            /* open subcolor, if any; or free-chain ptr */
  #define  NOSUB     COLORLESS        /* value of "sub" when no open subcolor */
      struct arc *arcs;            /* chain of all arcs of this color */
!     chr            firstchr;        /* char first assigned to this color */
      int            flags;            /* bit values defined next */
  #define  FREECOL 01                /* currently free */
  #define  PSEUDO  02                /* pseudocolor, no real chars */
  #define  UNUSEDCOLOR(cd) ((cd)->flags & FREECOL)
-     union tree *block;            /* block of solid color, if any */
  };

  /*
   * The color map itself
   *
!  * Much of the data in the colormap struct is only used at compile time.
!  * However, the bulk of the space usage is in the "tree" structure, so it's
!  * not clear that there's much point in converting the rest to a more compact
!  * form when compilation is finished.
   */
  struct colormap
  {
      int            magic;
--- 153,208 ----
   */
  struct colordesc
  {
!     int            nschrs;            /* number of simple chars of this color */
!     int            nuchrs;            /* number of upper map entries of this color */
      color        sub;            /* open subcolor, if any; or free-chain ptr */
  #define  NOSUB     COLORLESS        /* value of "sub" when no open subcolor */
      struct arc *arcs;            /* chain of all arcs of this color */
!     chr            firstchr;        /* simple char first assigned to this color */
      int            flags;            /* bit values defined next */
  #define  FREECOL 01                /* currently free */
  #define  PSEUDO  02                /* pseudocolor, no real chars */
  #define  UNUSEDCOLOR(cd) ((cd)->flags & FREECOL)
  };

  /*
   * The color map itself
   *
!  * This struct holds both data used only at compile time, and the chr to
!  * color mapping information, used at both compile and run time.  The latter
!  * is the bulk of the space, so it's not really worth separating out the
!  * compile-only portion.
!  *
!  * Ideally, the mapping data would just be an array of colors indexed by
!  * chr codes; but for large character sets that's impractical.  Fortunately,
!  * common characters have smaller codes, so we can use a simple array for chr
!  * codes up to MAX_SIMPLE_CHR, and do something more complex for codes above
!  * that, without much loss of performance.  The "something more complex" is a
!  * 2-D array of color entries, where row indexes correspond to individual chrs
!  * or chr ranges that have been mentioned in the regex (with row zero
!  * representing all other chrs), and column indexes correspond to different
!  * sets of locale-dependent character classes such as "isalpha".  The
!  * classbits[k] entry is zero if we do not care about the k'th character class
!  * in this regex, and otherwise it is the bit to be OR'd into the column index
!  * if the character in question is a member of that class.  We find the color
!  * of a high-valued chr by identifying which colormaprange it is in to get
!  * the row index (use row zero if it's in none of them), identifying which of
!  * the interesting cclasses it's in to get the column index, and then indexing
!  * into the 2-D hicolormap array.
!  *
!  * The colormapranges are required to be nonempty, nonoverlapping, and to
!  * appear in increasing chr-value order.
   */
+
+ #define NUM_CCLASSES 13            /* must match data in regc_locale.c */
+
+ typedef struct colormaprange
+ {
+     chr            cmin;            /* range represents cmin..cmax inclusive */
+     chr            cmax;
+     int            rownum;            /* row index in hicolormap array (>= 1) */
+ } colormaprange;
+
  struct colormap
  {
      int            magic;
*************** struct colormap
*** 233,259 ****
      color        free;            /* beginning of free chain (if non-0) */
      struct colordesc *cd;        /* pointer to array of colordescs */
  #define  CDEND(cm)     (&(cm)->cd[(cm)->max + 1])
      /* If we need up to NINLINECDS, we store them here to save a malloc */
! #define  NINLINECDS  ((size_t)10)
      struct colordesc cdspace[NINLINECDS];
-     union tree    tree[NBYTS];    /* tree top, plus lower-level fill blocks */
  };

! /* optimization magic to do fast chr->color mapping */
! #define B0(c)    ((c) & BYTMASK)
! #define B1(c)    (((c)>>BYTBITS) & BYTMASK)
! #define B2(c)    (((c)>>(2*BYTBITS)) & BYTMASK)
! #define B3(c)    (((c)>>(3*BYTBITS)) & BYTMASK)
! #if NBYTS == 1
! #define GETCOLOR(cm, c) ((cm)->tree->tcolor[B0(c)])
! #endif
! /* beware, for NBYTS>1, GETCOLOR() is unsafe -- 2nd arg used repeatedly */
! #if NBYTS == 2
! #define GETCOLOR(cm, c) ((cm)->tree->tptr[B1(c)]->tcolor[B0(c)])
! #endif
! #if NBYTS == 4
! #define GETCOLOR(cm, c) ((cm)->tree->tptr[B3(c)]->tptr[B2(c)]->tptr[B1(c)]->tcolor[B0(c)])
! #endif


  /*
--- 213,239 ----
      color        free;            /* beginning of free chain (if non-0) */
      struct colordesc *cd;        /* pointer to array of colordescs */
  #define  CDEND(cm)     (&(cm)->cd[(cm)->max + 1])
+
+     /* mapping data for chrs <= MAX_SIMPLE_CHR: */
+     color       *locolormap;        /* simple array indexed by chr code */
+
+     /* mapping data for chrs > MAX_SIMPLE_CHR: */
+     int            classbits[NUM_CCLASSES];        /* see comment above */
+     int            numcmranges;    /* number of colormapranges */
+     colormaprange *cmranges;    /* ranges of high chrs */
+     color       *hicolormap;        /* 2-D array of color entries */
+     int            maxarrayrows;    /* number of array rows allocated */
+     int            hiarrayrows;    /* number of array rows in use */
+     int            hiarraycols;    /* number of array columns (2^N) */
+
      /* If we need up to NINLINECDS, we store them here to save a malloc */
! #define  NINLINECDS  ((size_t) 10)
      struct colordesc cdspace[NINLINECDS];
  };

! /* fetch color for chr; beware of multiple evaluation of c argument */
! #define GETCOLOR(cm, c) \
!     ((c) <= MAX_SIMPLE_CHR ? (cm)->locolormap[(c) - CHR_MIN] : pg_reg_getcolor(cm, c))


  /*
*************** struct colormap
*** 264,269 ****
--- 244,254 ----
   * Representation of a set of characters.  chrs[] represents individual
   * code points, ranges[] represents ranges in the form min..max inclusive.
   *
+  * If the cvec represents a locale-specific character class, eg [[:alpha:]],
+  * then the chrs[] and ranges[] arrays contain only members of that class
+  * up to MAX_SIMPLE_CHR (inclusive).  cclasscode is set to regc_locale.c's
+  * code for the class, rather than being -1 as it is in an ordinary cvec.
+  *
   * Note that in cvecs gotten from newcvec() and intended to be freed by
   * freecvec(), both arrays of chrs are after the end of the struct, not
   * separately malloc'd; so chrspace and rangespace are effectively immutable.
*************** struct cvec
*** 276,281 ****
--- 261,267 ----
      int            nranges;        /* number of ranges (chr pairs) */
      int            rangespace;        /* number of ranges allocated in ranges[] */
      chr           *ranges;            /* pointer to vector of chr pairs */
+     int            cclasscode;        /* value of "enum classes", or -1 */
  };


*************** struct guts
*** 489,491 ****
--- 475,482 ----
      int            nlacons;        /* size of lacons[]; note that only slots
                                   * numbered 1 .. nlacons-1 are used */
  };
+
+
+ /* prototypes for functions that are exported from regcomp.c to regexec.c */
+ extern void pg_set_regex_collation(Oid collation);
+ extern color pg_reg_getcolor(struct colormap * cm, chr c);

Re: Better locale-specific-character-class handling for regexps

От

Heikki Linnakangas

Дата:

04 сентября 2016 г., 13:56:39

On 08/23/2016 03:54 AM, Tom Lane wrote:
> ! That's still not quite enough, though, because of locale-dependent
> ! character classes such as [[:alpha:]].  In Unicode locales these classes
> ! may have thousands of entries that are above MAX_SIMPLE_CHR, and we
> ! certainly don't want to be searching large colormaprange arrays at runtime.
> ! Nor do we even want to spend the time to initialize cvec structures that
> ! exhaustively describe all of those characters.  Our solution is to compute
> ! exact per-character colors at compile time only up to MAX_SIMPLE_CHR.
> ! For characters above that, we apply the <ctype.h> or <wctype.h> lookup
> ! functions at runtime for each locale-dependent character class used in the
> ! regex pattern, constructing a bitmap that describes which classes the
> ! runtime character belongs to.  The per-character-range data structure
> ! mentioned above actually holds, for each range, a separate color entry
> ! for each possible combination of character class properties.  That is,
> ! the color map for characters above MAX_SIMPLE_CHR is really a 2-D array,
> ! whose rows correspond to character ranges that are explicitly mentioned
> ! in the input, and whose columns correspond to sets of relevant locale
> ! character classes.

I think that last sentence should say "whose rows correspond to 
character ranges that are explicitly mentioned in the *regex pattern*", 
rather than "in the input".

An example would be very helpful here. I had to read through this many 
times, until I understood it. I can easily come up with examples for 
character classes, but not for "high" character-ranges. The best I could 
come up with is to check if a characters belongs to some special group 
of unicode characters, like U&'[\+01D100-\+01D1FF]' to check for musical 
symbol characters. In practice, I guess you will only see single 
characters in the colormaprange array, although we must of course cope 
with ranges too.

> +       /* this relies on WHITE being zero: */
> +       memset(cm->locolormap, WHITE,
> +                  (MAX_SIMPLE_CHR - CHR_MIN + 1) * sizeof(color));
> +
> +       memset(cm->classbits, 0, sizeof(cm->classbits));
> +       cm->numcmranges = 0;
> +       cm->cmranges = NULL;
> +       cm->maxarrayrows = 4;           /* arbitrary initial allocation */
> +       cm->hiarrayrows = 1;
> +       cm->hiarraycols = 1;
> +       cm->hicolormap = (color *) MALLOC(cm->maxarrayrows * sizeof(color));
> +       if (cm->hicolormap == NULL)
> +       {
> +               CERR(REG_ESPACE);
> +               return;
> +       }
> +       /* initialize the "all other characters" row to WHITE */
> +       cm->hicolormap[0] = WHITE;

Is the comment correct? I don't see why this wouldn't work with "WHITE 
!= 0".

> !     /* Duplicate existing columns to the right, and increase ref counts */
> !     /* Must work downwards in the array because we realloc'd in place */
> !     for (r = cm->hiarrayrows - 1; r >= 0; r--)
> !     {
> !         color       *oldrowptr = &newarray[r * cm->hiarraycols];
> !         color       *newrowptr = &newarray[r * cm->hiarraycols * 2];
> !         color       *newrowptr2 = newrowptr + cm->hiarraycols;
>
> !         for (c = 0; c < cm->hiarraycols; c++)
> !         {
> !             color        co = oldrowptr[c];
> !
> !             newrowptr[c] = newrowptr2[c] = co;
> !             cm->cd[co].nuchrs++;
> !         }
> !     }

Perhaps "backwards" would be clearer than "downwards"? At least in my 
mental model, index 0 is the top row of an array, so "downwards" means 
0, 1, 2. I guess you meant downwards numerically, rather than visually, 
but it took me a moment to process that.

+1 for this patch in general. Some regression test cases would be nice.

- Heikki

Re: Better locale-specific-character-class handling for regexps

От

Tom Lane

Дата:

04 сентября 2016 г., 17:44:28

Heikki Linnakangas <hlinnaka@iki.fi> writes:
> On 08/23/2016 03:54 AM, Tom Lane wrote:
>> ! the color map for characters above MAX_SIMPLE_CHR is really a 2-D array,
>> ! whose rows correspond to character ranges that are explicitly mentioned
>> ! in the input, and whose columns correspond to sets of relevant locale
>> ! character classes.

> I think that last sentence should say "whose rows correspond to 
> character ranges that are explicitly mentioned in the *regex pattern*", 
> rather than "in the input".

OK.  The pattern is the input under consideration by regcomp.c, but I see
your point that someone might be confused.

> An example would be very helpful here.

I'll see what I can do.

> I can easily come up with examples for 
> character classes, but not for "high" character-ranges. The best I could 
> come up with is to check if a characters belongs to some special group 
> of unicode characters, like U&'[\+01D100-\+01D1FF]' to check for musical 
> symbol characters.

Right, or someone might write the equivalent of [A-Z] in the Klingon
alphabet, or whatever --- it wouldn't necessarily have to use backslash
notation, it could be literal characters with large codes.

> In practice, I guess you will only see single 
> characters in the colormaprange array, although we must of course cope 
> with ranges too.

I suspect that's true; I'm not sure that there are many places in the
upper reaches of Unicode space where ranges of consecutive character
codes form natural units to search for.

>> +       /* this relies on WHITE being zero: */
>> +       memset(cm->locolormap, WHITE,
>> +                  (MAX_SIMPLE_CHR - CHR_MIN + 1) * sizeof(color));

> Is the comment correct? I don't see why this wouldn't work with "WHITE 
> != 0".

The array entries are shorts not chars, so if WHITE were say 2, memset
would fill the array with 0x0202 rather than 2.  I suppose if WHITE were
say 0x0303 then it would accidentally work, but I don't think the comment
needs to get into that.

> Perhaps "backwards" would be clearer than "downwards"?

OK.

> +1 for this patch in general. Some regression test cases would be nice.

I'm not sure how to write such tests without introducing insurmountable
platform dependencies --- particularly on platforms with weak support for
UTF8 locales, such as OS X.  All the interesting cases require knowing
what iswalpha() etc will return for some high character codes.

What I did to test it during development was to set MAX_SIMPLE_CHR to
something in the ASCII range, so that the high-character-code paths could
be tested without making any assumptions about locale classifications for
non-ASCII characters.  I'm not sure that's a helpful idea for regression
testing purposes, though.

I guess I could follow the lead of collate.linux.utf8.sql and produce
a test that's only promised to pass on one platform with one encoding,
but I'm not terribly excited by that.  AFAIK that test file does not
get run at all in the buildfarm or in the wild.

Thanks for reviewing!
        regards, tom lane

Re: Better locale-specific-character-class handling for regexps

От

Tom Lane

Дата:

04 сентября 2016 г., 21:17:57

I wrote:
> I got tired of hearing complaints about the issue described in
> this thread:
> https://www.postgresql.org/message-id/flat/24241.1329347196%40sss.pgh.pa.us
> Here's a proposed fix.  I've not done extensive performance testing,
> but it seems to be as fast or faster than the old code in cases where
> there are not too many "large" characters in the input.  And, more
> to the point, it gets the right answer for such large characters.

I've now done some performance testing on this patch.  I ran the
attached scripts in cassert-off builds, using en_US.utf8 locale on
a RHEL6 box.  Each number is the best of 3 runs; all times in ms.

The first two test cases are meant to exercise regex compile speed; they
provide relatively short input strings, and each call provides a slightly
different pattern so as to defeat the compiled-pattern cache in
adt/regexp.c.

            HEAD        patched

compilespeed.sql    61959.793    32936.773
compilespeed09.sql    6411.957    5903.211

compilespeed.sql exercises compiling '\w' which of course is locale
dependent, while compilespeed09.sql compiles '[0-9]' which is not.
Since the patch sets MAX_SIMPLE_CHR to 0x7FF which is the same as
regc_pg_locale.c's old cutoff for how many characters to examine,
we must be doing about the same number of compile-time iswalnum()
probes in both old and new code.  The compile speedup must come entirely
from the fact that the new code uses a simple array for the low-char-codes
colormap, whereas the old code has a complicated multilevel colormap that
is more expensive to update.  Then the bigger win for compilespeed.sql
must be due to the fact that \w will require changing the colors of more
characters in the colormap.  I was not expecting to see such a big win,
but I'll take it ;-)

The remaining cases hold the pattern fixed (so it's not recompiled
each time) and are meant to measure the per-character scan speed
while executing a compiled regexp.

            HEAD        patched        ratio of per-char times

runspeedlo.sql        12951.588    12053.533    0.91
runspeedhi.sql        10410.140 *    36678.876    2.72 (estimated)
runspeednotlo.sql    12999.629    12130.439    0.91
runspeednothi.sql    15678.125    20053.861    1.36
dummy.sql        3444.696    3437.984    (to measure overhead)

* gives wrong answer

In the runspeedhi.sql case, HEAD thinks the first character doesn't match
the pattern, so it's not scanning the whole input, which makes that
measurement not useful.  We can assume though that the scan rate would
have been about the same as runspeednothi.sql, and the estimated per-char
ratio shown in the table is computed on that basis.

So the patched code's scan speed is about 9% better than HEAD for data
characters below MAX_SIMPLE_CHR, presumably because of the simpler color
map structure.  But it's noticeably worse for data chars above that, and
much worse if the pattern forces runtime iswalnum() tests.  That's what
I was expecting.

It's worth noticing that comparing runspeednotlo.sql and runspeednothi.sql
shows that HEAD is 28% worse per-character on the latter.  Those cases are
exactly the same so far as HEAD's regex engine is concerned, so I think
that that differential must be blamable entirely on the fact that we're
converting 2-byte UTF8 sequences into pg_wchars in the first case and
3-byte UTF8 sequences in the second case.  The patched code is also paying
that cost of course, so you should take it into account when comparing
numbers in different rows of this table.  And it's also an indication
that really, this code is pretty frickin fast --- the total time per
character is only circa 4x the extra logic to pick up and shift in
one more byte!

Overall I'm pretty happy with these results.  I think that most cases
will be as fast or faster than before, assuming that most data characters
are below MAX_SIMPLE_CHR for most use-cases.  In the worst cases, the
per-character cost can be several times what it was before, but those
are exactly the same cases where *we gave the wrong answer* before, so
it's hard to complain too much.

            regards, tom lane

\timing on
do $$
begin
  for i in 1..100000 loop
    perform 'aqaqaqaqaqaqaqaqaqaqaqaqaq' ~ ('\w\w\w\w\w\w\w\w\w\w\w\w\w\w' || i);
  end loop;
end$$;
\timing on
do $$
begin
  for i in 1..100000 loop
    perform 'a' ~ ('[0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9]' || i);
  end loop;
end$$;
\timing on
do $$
declare
  txt text := repeat(chr(2037), 1000);
  pat text := '^\w*$';
begin
  raise notice 'result of txt ~ pat = %', txt ~ pat;
  for i in 1..1000000 loop
    perform txt ~ pat;
  end loop;
end$$;
\timing on
do $$
declare
  txt text := repeat(chr(2310), 1000);
  pat text := '^\w*$';
begin
  raise notice 'result of txt ~ pat = %', txt ~ pat;
  for i in 1..1000000 loop
    perform txt ~ pat;
  end loop;
end$$;
\timing on
do $$
declare
  txt text := repeat(chr(2037), 1000);
  pat text := '^[^x]*$';
begin
  raise notice 'result of txt ~ pat = %', txt ~ pat;
  for i in 1..1000000 loop
    perform txt ~ pat;
  end loop;
end$$;
\timing on
do $$
declare
  txt text := repeat(chr(2310), 1000);
  pat text := '^[^x]*$';
begin
  raise notice 'result of txt ~ pat = %', txt ~ pat;
  for i in 1..1000000 loop
    perform txt ~ pat;
  end loop;
end$$;
\timing on
do $$
declare
  txt text := repeat(chr(2310), 1000);
  pat text := '^\w*$';
begin
  for i in 1..1000000 loop
    perform txt;
  end loop;
end$$;

Re: Better locale-specific-character-class handling for regexps

От

Heikki Linnakangas

Дата:

05 сентября 2016 г., 07:05:37

On 09/04/2016 08:44 PM, Tom Lane wrote:
> Heikki Linnakangas <hlinnaka@iki.fi> writes:
>> On 08/23/2016 03:54 AM, Tom Lane wrote:
>> +1 for this patch in general. Some regression test cases would be nice.
>
> I'm not sure how to write such tests without introducing insurmountable
> platform dependencies --- particularly on platforms with weak support for
> UTF8 locales, such as OS X.  All the interesting cases require knowing
> what iswalpha() etc will return for some high character codes.
>
> What I did to test it during development was to set MAX_SIMPLE_CHR to
> something in the ASCII range, so that the high-character-code paths could
> be tested without making any assumptions about locale classifications for
> non-ASCII characters.  I'm not sure that's a helpful idea for regression
> testing purposes, though.
>
> I guess I could follow the lead of collate.linux.utf8.sql and produce
> a test that's only promised to pass on one platform with one encoding,
> but I'm not terribly excited by that.  AFAIK that test file does not
> get run at all in the buildfarm or in the wild.

I'm not too worried if the tests don't get run regularly, but I don't 
like the idea that only works on one platform. This code is unlikely to 
be accidentally broken by unrelated changes in the backend, as the 
regexp code is very well isolated. But for someone hacks on the regexp 
library in the future, having a test suite to tickle all these 
corner-cases would be very handy.

Another class of regressions would be that something changes in the way 
a locale treats some characters, and that breaks an application. That 
would be very difficult to test for in a platform-independent way. That 
wouldn't really our bug, though, but the locale's.

Since we're now de facto maintainers of this regexp library, and our 
version could be used somewhere else than PostgreSQL too, it would 
actually be nice to have a regression suite that's independent from the 
pg_regress infrastructure, and wouldn't need a server to run. Perhaps a 
stand-alone C program that compiles the regexp code with mock versions 
of pg_wc_is* functions. Or perhaps a magic collation OID that makes 
pg_wc_is* functions to return hard-coded values for particular inputs.

- Heikki

Re: Better locale-specific-character-class handling for regexps

От

Tom Lane

Дата:

05 сентября 2016 г., 16:10:44

Heikki Linnakangas <hlinnaka@iki.fi> writes:
> On 09/04/2016 08:44 PM, Tom Lane wrote:
>> I guess I could follow the lead of collate.linux.utf8.sql and produce
>> a test that's only promised to pass on one platform with one encoding,
>> but I'm not terribly excited by that.  AFAIK that test file does not
>> get run at all in the buildfarm or in the wild.

> I'm not too worried if the tests don't get run regularly, but I don't 
> like the idea that only works on one platform.

Well, it would work on any platform that reports high Unicode letters
as letters.  The problem for putting this into the regular regression
tests is that the generic tests don't even assume UTF8 encoding, let
alone a Unicode-ish locale.

> Since we're now de facto maintainers of this regexp library, and our 
> version could be used somewhere else than PostgreSQL too, it would 
> actually be nice to have a regression suite that's independent from the 
> pg_regress infrastructure, and wouldn't need a server to run.

If anyone ever really picks up the challenge of making the regexp library
a standalone project, I think one of the first orders of business would be
to pull out the Tcl project's regexp-related regression tests.  There's a
pretty extensive set of tests written by Henry Spencer himself, and more
that they added over the years; it's far more comprehensive than our
tests.  (I've looked at stealing that test set in toto, but it requires
some debug APIs that we don't expose in SQL, and probably don't want to.)

In any case, this is getting very far afield from the current patch.
I'm willing to add a regexp.linux.ut8.sql test file if you think it's
important to have some canned tests that exercise this new code, but
otherwise I don't see any near-term solution.
        regards, tom lane

Re: Better locale-specific-character-class handling for regexps

От

Heikki Linnakangas

Дата:

05 сентября 2016 г., 17:21:07

On 09/05/2016 07:10 PM, Tom Lane wrote:
> Heikki Linnakangas <hlinnaka@iki.fi> writes:
>> On 09/04/2016 08:44 PM, Tom Lane wrote:
>>> I guess I could follow the lead of collate.linux.utf8.sql and produce
>>> a test that's only promised to pass on one platform with one encoding,
>>> but I'm not terribly excited by that.  AFAIK that test file does not
>>> get run at all in the buildfarm or in the wild.
>
>> I'm not too worried if the tests don't get run regularly, but I don't
>> like the idea that only works on one platform.
>
> Well, it would work on any platform that reports high Unicode letters
> as letters.  The problem for putting this into the regular regression
> tests is that the generic tests don't even assume UTF8 encoding, let
> alone a Unicode-ish locale.

Ah, ok. I thought there were some more special requirements.

>> Since we're now de facto maintainers of this regexp library, and our
>> version could be used somewhere else than PostgreSQL too, it would
>> actually be nice to have a regression suite that's independent from the
>> pg_regress infrastructure, and wouldn't need a server to run.
>
> If anyone ever really picks up the challenge of making the regexp library
> a standalone project, I think one of the first orders of business would be
> to pull out the Tcl project's regexp-related regression tests.  There's a
> pretty extensive set of tests written by Henry Spencer himself, and more
> that they added over the years; it's far more comprehensive than our
> tests.  (I've looked at stealing that test set in toto, but it requires
> some debug APIs that we don't expose in SQL, and probably don't want to.)

Oh, interesting.

> In any case, this is getting very far afield from the current patch.
> I'm willing to add a regexp.linux.ut8.sql test file if you think it's
> important to have some canned tests that exercise this new code, but
> otherwise I don't see any near-term solution.

Ok, that'll do.

- Heikki

Re: Better locale-specific-character-class handling for regexps

От

Tom Lane

Дата:

05 сентября 2016 г., 17:43:13

Heikki Linnakangas <hlinnaka@iki.fi> writes:
> On 09/05/2016 07:10 PM, Tom Lane wrote:
>> In any case, this is getting very far afield from the current patch.
>> I'm willing to add a regexp.linux.ut8.sql test file if you think it's
>> important to have some canned tests that exercise this new code, but
>> otherwise I don't see any near-term solution.

> Ok, that'll do.

OK, I'll put something together.  Thanks again for reviewing.
        regards, tom lane

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Обсуждение: Better locale-specific-character-class handling for regexps