Обсуждение: Assert triggered during RE_compile_and_cache

Поиск
Список
Период
Сортировка

Assert triggered during RE_compile_and_cache

От
Mark Dilger
Дата:
Tom,

while testing your patch in [1], I stumbled upon the following assert in master, without your patch applied:

+select 'vyrlvyrlwvqko' ~ '(?:(?:((.))))((\2)\1.){0,0}?';
+server closed the connection unexpectedly
+   This probably means the server terminated abnormally
+   before or while processing the request.
+connection to server was lost


The assert itself dates back to a commit of yours from 2003.  I don't think you did anything wrong here, but I wonder
ifyou might understand the code better than anyone else? 

The relevant bit of the stack trace is:

    frame #3: 0x00000001020eff7d postgres`ExceptionalCondition(conditionName=<unavailable>, errorType=<unavailable>,
fileName=<unavailable>,lineNumber=<unavailable>) at assert.c:69:2 [opt] 
    frame #4: 0x0000000101e2c682 postgres`delsub(nfa=0x00007fc7e8407360, lp=0x00007fc7e8831010, rp=0x00007fc7e8827328)
atregc_nfa.c:1265:2 [opt] 
    frame #5: 0x0000000101e28dd8 postgres`parsebranch [inlined] parseqatom(v=<unavailable>, stopper=101, type=112,
lp=<unavailable>,rp=0x00007fc7e8827328, top=<unavailable>) at regcomp.c:1084:3 [opt] 
    frame #6: 0x0000000101e27de9 postgres`parsebranch(v=0x00007ffeee0a80b8, stopper=101, type=<unavailable>,
left=<unavailable>,right=<unavailable>, partial=<unavailable>) at regcomp.c:759 [opt] 
    frame #7: 0x0000000101e29719 postgres`parsebranch [inlined] parseqatom(v=<unavailable>, stopper=101, type=112,
lp=<unavailable>,rp=0x00007fc7e8827328, top=<unavailable>) at regcomp.c:1282:23 [opt] 
    frame #8: 0x0000000101e29639 postgres`parsebranch(v=0x00007ffeee0a80b8, stopper=101, type=<unavailable>,
left=<unavailable>,right=<unavailable>, partial=<unavailable>) at regcomp.c:759 [opt] 
    frame #9: 0x0000000101e2457b postgres`parse(v=0x00007ffeee0a80b8, stopper=101, type=112, init=0x00007fc7e8827280,
final=0x00007fc7e88272b8)at regcomp.c:691:12 [opt] 
    frame #10: 0x0000000101e234d2 postgres`pg_regcomp(re=0x00007ffeee0a8208, string=<unavailable>, len=<unavailable>,
flags=<unavailable>,collation=<unavailable>) at regcomp.c:418:12 [opt] 
    frame #11: 0x0000000101f75790 postgres`RE_compile_and_cache(text_re=<unavailable>, cflags=<unavailable>,
collation=100)at regexp.c:186:19 [opt] 
    frame #12: 0x0000000101f75b90 postgres`textregexeq [inlined] RE_compile_and_execute(text_re=<unavailable>,
dat=<unavailable>,dat_len=13, cflags=3, collation=<unavailable>, nmatch=0, pmatch=<unavailable>) at regexp.c:351:7
[opt]

[1] https://www.postgresql.org/message-id/2219936.1628115334%40sss.pgh.pa.us

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company






Re: Assert triggered during RE_compile_and_cache

От
Mark Dilger
Дата:

> On Aug 5, 2021, at 1:38 PM, Mark Dilger <mark.dilger@enterprisedb.com> wrote:
>
> +select 'vyrlvyrlwvqko' ~ '(?:(?:((.))))((\2)\1.){0,0}?';

I've boiled it down a bit more:

+select '' ~ '()\1{0}';
+ ?column?
+----------
+ t
+(1 row)
+
+select '' ~ '()(\1){0}';
+ ?column?
+----------
+ t
+(1 row)
+
+select '' ~ '(())\2{0}';
+ ?column?
+----------
+ t
+(1 row)
+
+select '' ~ '(())(\2){0}';
+server closed the connection unexpectedly
+   This probably means the server terminated abnormally
+   before or while processing the request.
+connection to server was lost

Any ideas?

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company






Re: Assert triggered during RE_compile_and_cache

От
Tom Lane
Дата:
Mark Dilger <mark.dilger@enterprisedb.com> writes:
> +select '' ~ '(())(\2){0}';
> +server closed the connection unexpectedly

> Any ideas?

Huh.  This seems like some deficiency in the part of parseqatom
starting with

    /* annoying special case:  {0} or {0,0} cancels everything */
    if (m == 0 && n == 0)

but I don't immediately see what's different about your failing case
versus the not-failing ones.

            regards, tom lane



Re: Assert triggered during RE_compile_and_cache

От
Mark Dilger
Дата:

> On Aug 5, 2021, at 3:15 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> I don't immediately see what's different about your failing case
> versus the not-failing ones.

I have now found lots of cases of this failure.  I *believe* the backreference is always greater than 1, and it is
alwaysin a capture group which then has the {0} or {0,0} applied to it. 

You can find lots of cases using the attached regex generating script I whipped up for testing your work.  (Note this
isjust a quick and dirty tool for hacking, not anything refined.) 



—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Вложения

Re: Assert triggered during RE_compile_and_cache

От
Tom Lane
Дата:
Mark Dilger <mark.dilger@enterprisedb.com> writes:
> I have now found lots of cases of this failure.  I *believe* the backreference is always greater than 1, and it is
alwaysin a capture group which then has the {0} or {0,0} applied to it. 

Hm.  I thought that this might be an old issue, but I'm not seeing the
crash in pre-v14 branches.  That means it's some bug introduced in
the performance improvements I did a few months ago.  Open item added.

            regards, tom lane



Re: Assert triggered during RE_compile_and_cache

От
Tom Lane
Дата:
I wrote:
> Hm.  I thought that this might be an old issue, but I'm not seeing the
> crash in pre-v14 branches.  That means it's some bug introduced in
> the performance improvements I did a few months ago.  Open item added.

git bisect says the trouble started with

ea1268f6301cc7adce571cc9c5ebe8d9342a2ef4 is the first bad commit
commit ea1268f6301cc7adce571cc9c5ebe8d9342a2ef4
Author: Tom Lane <tgl@sss.pgh.pa.us>
Date:   Sat Feb 20 19:26:41 2021 -0500

    Avoid generating extra subre tree nodes for capturing parentheses.

I probably won't be able to dig deeper for a little while.

            regards, tom lane



Re: Assert triggered during RE_compile_and_cache

От
Tom Lane
Дата:
I wrote:
>> Hm.  I thought that this might be an old issue, but I'm not seeing the
>> crash in pre-v14 branches.  That means it's some bug introduced in
>> the performance improvements I did a few months ago.  Open item added.

> git bisect says the trouble started with
> ea1268f6301cc7adce571cc9c5ebe8d9342a2ef4 is the first bad commit

Here's a draft fix for this.  The issue seems to be that parseqatom
records where a previous capture group is by saving a pointer to
the "subre" that parse() returned for it.  That was okay before
said commit because we didn't do anything further to that subre:
we immediately wrapped it into a parent subre, and then any further
hacking that parseqatom did was done on the parent subre.  But after
ea1268f63, in most cases we'd hack right on that same subre, thus
potentially modifying the portion of the NFA that would be copied
by a subsequent back-reference.

The particular thing that's asserting when you wrap {0} around such
a back-ref is just accidental road kill: it happens to notice that
the NFA is no longer consistent, because there's no path leading
from the start to the end of the copied sub-NFA, thanks to the copying
having been done between a pair of states that weren't actually
appropriate to reference.  What surprises me about this is not that
crash, but that we didn't see fundamental breakage of backref-using
patterns all over the place.  It evidently breaks only in corner
cases, but I'm not quite sure why the effects are so limited.

Anyway, the fix seems pretty straightforward.  We don't really need
anything about the subre as such except for its starting and ending
NFA states, so the attached modifies things to record only those
state pointers.  I'm not done testing this by any means, but it
does fix the two cases you showed, and I've been running that perl
script for some time with no further crashes.

I suspect the assertion you hit with the REG_NOSUB patch was just
further fallout from this same basic bug, but I've not yet rebased
that patch over this to check.

            regards, tom lane

diff --git a/src/backend/regex/regcomp.c b/src/backend/regex/regcomp.c
index 9f71177d31..afa6dd44c6 100644
--- a/src/backend/regex/regcomp.c
+++ b/src/backend/regex/regcomp.c
@@ -233,6 +233,13 @@ static int    cmp(const chr *, const chr *, size_t);
 static int    casecmp(const chr *, const chr *, size_t);


+/* info we need during compilation about a known capturing subexpression */
+struct subinfo
+{
+    struct state *left;            /* left end of its sub-NFA */
+    struct state *right;        /* right end of its sub-NFA */
+};
+
 /* internal variables, bundled for easy passing around */
 struct vars
 {
@@ -245,10 +252,10 @@ struct vars
     int            nexttype;        /* type of next token */
     chr            nextvalue;        /* value (if any) of next token */
     int            lexcon;            /* lexical context type (see regc_lex.c) */
-    int            nsubexp;        /* subexpression count */
-    struct subre **subs;        /* subRE pointer vector */
-    size_t        nsubs;            /* length of vector */
-    struct subre *sub10[10];    /* initial vector, enough for most */
+    int            nsubexp;        /* number of known capturing subexpressions */
+    struct subinfo *subs;        /* info about known capturing subexpressions */
+    size_t        nsubs;            /* allocated length of subs[] vector */
+    struct subinfo sub10[10];    /* initial vector, enough for most */
     struct nfa *nfa;            /* the NFA */
     struct colormap *cm;        /* character color map */
     color        nlcolor;        /* color of newline */
@@ -368,7 +375,7 @@ pg_regcomp(regex_t *re,
     v->subs = v->sub10;
     v->nsubs = 10;
     for (j = 0; j < v->nsubs; j++)
-        v->subs[j] = NULL;
+        v->subs[j].left = v->subs[j].right = NULL;
     v->nfa = NULL;
     v->cm = NULL;
     v->nlcolor = COLORLESS;
@@ -504,13 +511,13 @@ pg_regcomp(regex_t *re,
 }

 /*
- * moresubs - enlarge subRE vector
+ * moresubs - enlarge capturing-subexpressions vector
  */
 static void
 moresubs(struct vars *v,
          int wanted)            /* want enough room for this one */
 {
-    struct subre **p;
+    struct subinfo *p;
     size_t        n;

     assert(wanted > 0 && (size_t) wanted >= v->nsubs);
@@ -518,13 +525,13 @@ moresubs(struct vars *v,

     if (v->subs == v->sub10)
     {
-        p = (struct subre **) MALLOC(n * sizeof(struct subre *));
+        p = (struct subinfo *) MALLOC(n * sizeof(struct subinfo));
         if (p != NULL)
             memcpy(VS(p), VS(v->subs),
-                   v->nsubs * sizeof(struct subre *));
+                   v->nsubs * sizeof(struct subinfo));
     }
     else
-        p = (struct subre **) REALLOC(v->subs, n * sizeof(struct subre *));
+        p = (struct subinfo *) REALLOC(v->subs, n * sizeof(struct subinfo));
     if (p == NULL)
     {
         ERR(REG_ESPACE);
@@ -532,7 +539,7 @@ moresubs(struct vars *v,
     }
     v->subs = p;
     for (p = &v->subs[v->nsubs]; v->nsubs < n; p++, v->nsubs++)
-        *p = NULL;
+        p->left = p->right = NULL;
     assert(v->nsubs == n);
     assert((size_t) wanted < v->nsubs);
 }
@@ -982,8 +989,10 @@ parseqatom(struct vars *v,
             NOERR();
             if (cap)
             {
-                assert(v->subs[subno] == NULL);
-                v->subs[subno] = atom;
+                /* save the sub-NFA's endpoints for future backrefs to use */
+                assert(v->subs[subno].left == NULL);
+                v->subs[subno].left = s;
+                v->subs[subno].right = s2;
                 if (atom->capno == 0)
                 {
                     /* normal case: just mark the atom as capturing */
@@ -1005,7 +1014,7 @@ parseqatom(struct vars *v,
         case BACKREF:            /* the Feature From The Black Lagoon */
             INSIST(type != LACON, REG_ESUBREG);
             INSIST(v->nextvalue < v->nsubs, REG_ESUBREG);
-            INSIST(v->subs[v->nextvalue] != NULL, REG_ESUBREG);
+            INSIST(v->subs[v->nextvalue].left != NULL, REG_ESUBREG);
             NOERR();
             assert(v->nextvalue > 0);
             atom = subre(v, 'b', BACKR, lp, rp);
@@ -1080,7 +1089,7 @@ parseqatom(struct vars *v,
         if (atom != NULL)
             freesubre(v, atom);
         if (atomtype == '(')
-            v->subs[subno] = NULL;
+            v->subs[subno].left = v->subs[subno].right = NULL;
         delsub(v->nfa, lp, rp);
         EMPTYARC(lp, rp);
         return;
@@ -1173,14 +1182,14 @@ parseqatom(struct vars *v,
     {
         assert(atom->begin->nouts == 1);    /* just the EMPTY */
         delsub(v->nfa, atom->begin, atom->end);
-        assert(v->subs[subno] != NULL);
+        assert(v->subs[subno].left != NULL);

         /*
          * And here's why the recursion got postponed: it must wait until the
          * skeleton is filled in, because it may hit a backref that wants to
          * copy the filled-in skeleton.
          */
-        dupnfa(v->nfa, v->subs[subno]->begin, v->subs[subno]->end,
+        dupnfa(v->nfa, v->subs[subno].left, v->subs[subno].right,
                atom->begin, atom->end);
         NOERR();

diff --git a/src/test/modules/test_regex/expected/test_regex.out b/src/test/modules/test_regex/expected/test_regex.out
index 01d50ec1e3..44da7d2019 100644
--- a/src/test/modules/test_regex/expected/test_regex.out
+++ b/src/test/modules/test_regex/expected/test_regex.out
@@ -3468,6 +3468,14 @@ select * from test_regex(' TO (([a-z0-9._]+|"([^"]+|"")+")+)', 'asd TO foo', 'M'
  {" TO foo",foo,o,NULL}
 (2 rows)

+-- expectMatch    21.36 RPQ    ((.))(\2){0}    xy    x    x    x    {}
+select * from test_regex('((.))(\2){0}', 'xy', 'RPQ');
+                 test_regex
+--------------------------------------------
+ {3,REG_UBACKREF,REG_UBOUNDS,REG_UNONPOSIX}
+ {x,x,x,NULL}
+(2 rows)
+
 -- doing 22 "multicharacter collating elements"
 -- # again ugh
 -- MCCEs are not implemented in Postgres, so we skip all these tests
diff --git a/src/test/modules/test_regex/sql/test_regex.sql b/src/test/modules/test_regex/sql/test_regex.sql
index 7f5bc6e418..9224fdfdd3 100644
--- a/src/test/modules/test_regex/sql/test_regex.sql
+++ b/src/test/modules/test_regex/sql/test_regex.sql
@@ -1009,6 +1009,8 @@ select * from test_regex('(.*).*', 'abc', 'N');
 select * from test_regex('(a*)*', 'bc', 'N');
 -- expectMatch    21.35 M        { TO (([a-z0-9._]+|"([^"]+|"")+")+)}    {asd TO foo}    { TO foo} foo o {}
 select * from test_regex(' TO (([a-z0-9._]+|"([^"]+|"")+")+)', 'asd TO foo', 'M');
+-- expectMatch    21.36 RPQ    ((.))(\2){0}    xy    x    x    x    {}
+select * from test_regex('((.))(\2){0}', 'xy', 'RPQ');

 -- doing 22 "multicharacter collating elements"
 -- # again ugh

Re: Assert triggered during RE_compile_and_cache

От
Tom Lane
Дата:
I wrote:
> ... What surprises me about this is not that
> crash, but that we didn't see fundamental breakage of backref-using
> patterns all over the place.  It evidently breaks only in corner
> cases, but I'm not quite sure why the effects are so limited.

Ah-hah, I've sussed it.  I'd supposed that the issue was that parseqatom
could replace the subre atom's endpoints in the bit where it starts
to "prepare a general-purpose state skeleton".  However, it seems that
that's not what makes it go off the rails.  Rather, in the specific
combination we've got in the test case, the next outer recursion level
of parseqatom() decides that it can free the subre that was returned
by the inner level, *which is the same subre that v->subs[n] is pointing
to*.  So this is really a use-after-free problem.  freesrnode() won't
actually free() the node, just stick it back onto the chain for possible
recycling --- and it doesn't clear the state pointer fields when it
does that.  So there's no use-after-free that ordinary tools could detect.
We only see a problem manifest if the subre node is reused later, which
explains why '(())(\2){0}' fails while '(())\2{0}' does not.

So the problem boils down to parseqatom deciding it can free child
subre nodes that it knows little about the provenance of.  It would
be safer to make that optimization by instead freeing the "top"
node passed in from parsebranch, which we know has got no other
interesting linkages.  That requires tweaking the API of parseqatom,
which why I'd not done it like that to begin with --- but that's not
a hard or complicated change, just a mite tedious.

Hence, the attached patch.

While this is sufficient to make the problem go away, I'm
inclined to apply both changesets.  Even if it accidentally
works right now to have later backrefs consult the outer s/s2
state pair rather than the original pair, that seems far too
convoluted and bug-prone.  The outer states should be strictly
the concern of the iteration setup logic in the outer invocation
of parseqatom.

(I'm sort of wondering now whether the outer s/s2 states are
even really necessary anymore ... maybe Spencer put those in
as a way of preventing some prehistoric version of this bug.
But I'm not excited about messing with that right now.)

            regards, tom lane

diff --git a/src/backend/regex/regcomp.c b/src/backend/regex/regcomp.c
index 9f71177d31..3d7f11af8c 100644
--- a/src/backend/regex/regcomp.c
+++ b/src/backend/regex/regcomp.c
@@ -43,7 +43,7 @@ static int    freev(struct vars *, int);
 static void makesearch(struct vars *, struct nfa *);
 static struct subre *parse(struct vars *, int, int, struct state *, struct state *);
 static struct subre *parsebranch(struct vars *, int, int, struct state *, struct state *, int);
-static void parseqatom(struct vars *, int, int, struct state *, struct state *, struct subre *);
+static struct subre *parseqatom(struct vars *, int, int, struct state *, struct state *, struct subre *);
 static void nonword(struct vars *, int, struct state *, struct state *);
 static void word(struct vars *, int, struct state *, struct state *);
 static void charclass(struct vars *, enum char_classes,
@@ -756,7 +756,7 @@ parsebranch(struct vars *v,
         seencontent = 1;

         /* NB, recursion in parseqatom() may swallow rest of branch */
-        parseqatom(v, stopper, type, lp, right, t);
+        t = parseqatom(v, stopper, type, lp, right, t);
         NOERRN();
     }

@@ -777,8 +777,12 @@ parsebranch(struct vars *v,
  * The bookkeeping near the end cooperates very closely with parsebranch();
  * in particular, it contains a recursion that can involve parsing the rest
  * of the branch, making this function's name somewhat inaccurate.
+ *
+ * Usually, the return value is just "top", but in some cases where we
+ * have parsed the rest of the branch, we may deem "top" redundant and
+ * free it, returning some child subre instead.
  */
-static void
+static struct subre *
 parseqatom(struct vars *v,
            int stopper,            /* EOS or ')' */
            int type,            /* LACON (lookaround subRE) or PLAIN */
@@ -818,84 +822,84 @@ parseqatom(struct vars *v,
             if (v->cflags & REG_NLANCH)
                 ARCV(BEHIND, v->nlcolor);
             NEXT();
-            return;
+            return top;
             break;
         case '$':
             ARCV('$', 1);
             if (v->cflags & REG_NLANCH)
                 ARCV(AHEAD, v->nlcolor);
             NEXT();
-            return;
+            return top;
             break;
         case SBEGIN:
             ARCV('^', 1);        /* BOL */
             ARCV('^', 0);        /* or BOS */
             NEXT();
-            return;
+            return top;
             break;
         case SEND:
             ARCV('$', 1);        /* EOL */
             ARCV('$', 0);        /* or EOS */
             NEXT();
-            return;
+            return top;
             break;
         case '<':
             wordchrs(v);
             s = newstate(v->nfa);
-            NOERR();
+            NOERRN();
             nonword(v, BEHIND, lp, s);
             word(v, AHEAD, s, rp);
             NEXT();
-            return;
+            return top;
             break;
         case '>':
             wordchrs(v);
             s = newstate(v->nfa);
-            NOERR();
+            NOERRN();
             word(v, BEHIND, lp, s);
             nonword(v, AHEAD, s, rp);
             NEXT();
-            return;
+            return top;
             break;
         case WBDRY:
             wordchrs(v);
             s = newstate(v->nfa);
-            NOERR();
+            NOERRN();
             nonword(v, BEHIND, lp, s);
             word(v, AHEAD, s, rp);
             s = newstate(v->nfa);
-            NOERR();
+            NOERRN();
             word(v, BEHIND, lp, s);
             nonword(v, AHEAD, s, rp);
             NEXT();
-            return;
+            return top;
             break;
         case NWBDRY:
             wordchrs(v);
             s = newstate(v->nfa);
-            NOERR();
+            NOERRN();
             word(v, BEHIND, lp, s);
             word(v, AHEAD, s, rp);
             s = newstate(v->nfa);
-            NOERR();
+            NOERRN();
             nonword(v, BEHIND, lp, s);
             nonword(v, AHEAD, s, rp);
             NEXT();
-            return;
+            return top;
             break;
         case LACON:                /* lookaround constraint */
             latype = v->nextvalue;
             NEXT();
             s = newstate(v->nfa);
             s2 = newstate(v->nfa);
-            NOERR();
+            NOERRN();
             t = parse(v, ')', LACON, s, s2);
             freesubre(v, t);    /* internal structure irrelevant */
-            NOERR();
+            NOERRN();
             assert(SEE(')'));
             NEXT();
             processlacon(v, s, s2, latype, lp, rp);
-            return;
+            return top;
             break;
             /* then errors, to get them out of the way */
         case '*':
@@ -903,18 +907,18 @@ parseqatom(struct vars *v,
         case '?':
         case '{':
             ERR(REG_BADRPT);
-            return;
+            return top;
             break;
         default:
             ERR(REG_ASSERT);
-            return;
+            return top;
             break;
             /* then plain characters, and minor variants on that theme */
         case ')':                /* unbalanced paren */
             if ((v->cflags & REG_ADVANCED) != REG_EXTENDED)
             {
                 ERR(REG_EPAREN);
-                return;
+                return top;
             }
             /* legal in EREs due to specification botch */
             NOTE(REG_UPBOTCH);
@@ -923,7 +927,7 @@ parseqatom(struct vars *v,
         case PLAIN:
             onechr(v, v->nextvalue, lp, rp);
             okcolors(v->nfa, v->cm);
-            NOERR();
+            NOERRN();
             NEXT();
             break;
         case '[':
@@ -972,14 +976,14 @@ parseqatom(struct vars *v,
              */
             s = newstate(v->nfa);
             s2 = newstate(v->nfa);
-            NOERR();
+            NOERRN();
             EMPTYARC(lp, s);
             EMPTYARC(s2, rp);
-            NOERR();
+            NOERRN();
             atom = parse(v, ')', type, s, s2);
             assert(SEE(')') || ISERR());
             NEXT();
-            NOERR();
+            NOERRN();
             if (cap)
             {
                 assert(v->subs[subno] == NULL);
@@ -994,7 +998,7 @@ parseqatom(struct vars *v,
                 {
                     /* generate no-op wrapper node to handle "((x))" */
                     t = subre(v, '(', atom->flags | CAP, lp, rp);
-                    NOERR();
+                    NOERRN();
                     t->capno = subno;
                     t->child = atom;
                     atom = t;
@@ -1006,10 +1010,10 @@ parseqatom(struct vars *v,
             INSIST(type != LACON, REG_ESUBREG);
             INSIST(v->nextvalue < v->nsubs, REG_ESUBREG);
             INSIST(v->subs[v->nextvalue] != NULL, REG_ESUBREG);
-            NOERR();
+            NOERRN();
             assert(v->nextvalue > 0);
             atom = subre(v, 'b', BACKR, lp, rp);
-            NOERR();
+            NOERRN();
             subno = v->nextvalue;
             atom->backno = subno;
             EMPTYARC(lp, rp);    /* temporarily, so there's something */
@@ -1050,7 +1054,7 @@ parseqatom(struct vars *v,
                 if (m > n)
                 {
                     ERR(REG_BADBR);
-                    return;
+                    return top;
                 }
                 /* {m,n} exercises preference, even if it's {m,m} */
                 qprefer = (v->nextvalue) ? LONGER : SHORTER;
@@ -1064,7 +1068,7 @@ parseqatom(struct vars *v,
             if (!SEE('}'))
             {                    /* catches errors too */
                 ERR(REG_BADBR);
-                return;
+                return top;
             }
             NEXT();
             break;
@@ -1083,7 +1087,7 @@ parseqatom(struct vars *v,
             v->subs[subno] = NULL;
         delsub(v->nfa, lp, rp);
         EMPTYARC(lp, rp);
-        return;
+        return top;
     }

     /* if not a messy case, avoid hard part */
@@ -1096,7 +1100,7 @@ parseqatom(struct vars *v,
         if (atom != NULL)
             freesubre(v, atom);
         top->flags = f;
-        return;
+        return top;
     }

     /*
@@ -1110,7 +1114,7 @@ parseqatom(struct vars *v,
     if (atom == NULL)
     {
         atom = subre(v, '=', 0, lp, rp);
-        NOERR();
+        NOERRN();
     }

     /*----------
@@ -1131,20 +1135,20 @@ parseqatom(struct vars *v,
      */
     s = newstate(v->nfa);        /* first, new endpoints for the atom */
     s2 = newstate(v->nfa);
-    NOERR();
+    NOERRN();
     moveouts(v->nfa, lp, s);
     moveins(v->nfa, rp, s2);
-    NOERR();
+    NOERRN();
     atom->begin = s;
     atom->end = s2;
     s = newstate(v->nfa);        /* set up starting state */
-    NOERR();
+    NOERRN();
     EMPTYARC(lp, s);
-    NOERR();
+    NOERRN();

     /* break remaining subRE into x{...} and what follows */
     t = subre(v, '.', COMBINE(qprefer, atom->flags), lp, rp);
-    NOERR();
+    NOERRN();
     t->child = atom;
     atomp = &t->child;

@@ -1163,7 +1167,7 @@ parseqatom(struct vars *v,
      */
     assert(top->op == '=' && top->child == NULL);
     top->child = subre(v, '=', top->flags, top->begin, lp);
-    NOERR();
+    NOERRN();
     top->op = '.';
     top->child->sibling = t;
     /* top->flags will get updated later */
@@ -1182,11 +1186,11 @@ parseqatom(struct vars *v,
          */
         dupnfa(v->nfa, v->subs[subno]->begin, v->subs[subno]->end,
                atom->begin, atom->end);
-        NOERR();
+        NOERRN();

         /* The backref node's NFA should not enforce any constraints */
         removeconstraints(v->nfa, atom->begin, atom->end);
-        NOERR();
+        NOERRN();
     }

     /*
@@ -1226,7 +1230,7 @@ parseqatom(struct vars *v,
         repeat(v, atom->begin, atom->end, m, n);
         f = COMBINE(qprefer, atom->flags);
         t = subre(v, '=', f, atom->begin, atom->end);
-        NOERR();
+        NOERRN();
         freesubre(v, atom);
         *atomp = t;
         /* rest of branch can be strung starting from t->end */
@@ -1247,9 +1251,9 @@ parseqatom(struct vars *v,
         repeat(v, s, atom->begin, m - 1, (n == DUPINF) ? n : n - 1);
         f = COMBINE(qprefer, atom->flags);
         t = subre(v, '.', f, s, atom->end); /* prefix and atom */
-        NOERR();
+        NOERRN();
         t->child = subre(v, '=', PREF(f), s, atom->begin);
-        NOERR();
+        NOERRN();
         t->child->sibling = atom;
         *atomp = t;
         /* rest of branch can be strung starting from atom->end */
@@ -1259,14 +1263,14 @@ parseqatom(struct vars *v,
     {
         /* general case: need an iteration node */
         s2 = newstate(v->nfa);
-        NOERR();
+        NOERRN();
         moveouts(v->nfa, atom->end, s2);
-        NOERR();
+        NOERRN();
         dupnfa(v->nfa, atom->begin, atom->end, s, s2);
         repeat(v, s, s2, m, n);
         f = COMBINE(qprefer, atom->flags);
         t = subre(v, '*', f, s, s2);
-        NOERR();
+        NOERRN();
         t->min = (short) m;
         t->max = (short) n;
         t->child = atom;
@@ -1280,7 +1284,7 @@ parseqatom(struct vars *v,
     {
         /* parse all the rest of the branch, and insert in t->child->sibling */
         t->child->sibling = parsebranch(v, stopper, type, s2, rp, 1);
-        NOERR();
+        NOERRN();
         assert(SEE('|') || SEE(stopper) || SEE(EOS));

         /* here's the promised update of the flags */
@@ -1299,9 +1303,7 @@ parseqatom(struct vars *v,
          *
          * If the messy atom was the first thing in the branch, then
          * top->child is vacuous and we can get rid of one level of
-         * concatenation.  Since the caller is holding a pointer to the top
-         * node, we can't remove that node; but we're allowed to change its
-         * properties.
+         * concatenation.
          */
         assert(top->child->op == '=');
         if (top->child->begin == top->child->end)
@@ -1351,21 +1353,13 @@ parseqatom(struct vars *v,
         {
             assert(!MESSY(top->child->flags));
             t = top->child->sibling;
-            freesubre(v, top->child);
-            top->op = t->op;
-            top->flags = t->flags;
-            top->latype = t->latype;
-            top->id = t->id;
-            top->capno = t->capno;
-            top->backno = t->backno;
-            top->min = t->min;
-            top->max = t->max;
-            top->child = t->child;
-            top->begin = t->begin;
-            top->end = t->end;
-            freesrnode(v, t);
+            top->child->sibling = NULL;
+            freesubre(v, top);
+            top = t;
         }
     }
+
+    return top;
 }

 /*
@@ -2109,7 +2103,9 @@ freesrnode(struct vars *v,        /* might be NULL */

     if (!NULLCNFA(sr->cnfa))
         freecnfa(&sr->cnfa);
-    sr->flags = 0;
+    sr->flags = 0;                /* in particular, not INUSE */
+    sr->child = sr->sibling = NULL;
+    sr->begin = sr->end = NULL;

     if (v != NULL && v->treechain != NULL)
     {
diff --git a/src/test/modules/test_regex/expected/test_regex.out b/src/test/modules/test_regex/expected/test_regex.out
index 01d50ec1e3..44da7d2019 100644
--- a/src/test/modules/test_regex/expected/test_regex.out
+++ b/src/test/modules/test_regex/expected/test_regex.out
@@ -3468,6 +3468,14 @@ select * from test_regex(' TO (([a-z0-9._]+|"([^"]+|"")+")+)', 'asd TO foo', 'M'
  {" TO foo",foo,o,NULL}
 (2 rows)

+-- expectMatch    21.36 RPQ    ((.))(\2){0}    xy    x    x    x    {}
+select * from test_regex('((.))(\2){0}', 'xy', 'RPQ');
+                 test_regex
+--------------------------------------------
+ {3,REG_UBACKREF,REG_UBOUNDS,REG_UNONPOSIX}
+ {x,x,x,NULL}
+(2 rows)
+
 -- doing 22 "multicharacter collating elements"
 -- # again ugh
 -- MCCEs are not implemented in Postgres, so we skip all these tests
diff --git a/src/test/modules/test_regex/sql/test_regex.sql b/src/test/modules/test_regex/sql/test_regex.sql
index 7f5bc6e418..9224fdfdd3 100644
--- a/src/test/modules/test_regex/sql/test_regex.sql
+++ b/src/test/modules/test_regex/sql/test_regex.sql
@@ -1009,6 +1009,8 @@ select * from test_regex('(.*).*', 'abc', 'N');
 select * from test_regex('(a*)*', 'bc', 'N');
 -- expectMatch    21.35 M        { TO (([a-z0-9._]+|"([^"]+|"")+")+)}    {asd TO foo}    { TO foo} foo o {}
 select * from test_regex(' TO (([a-z0-9._]+|"([^"]+|"")+")+)', 'asd TO foo', 'M');
+-- expectMatch    21.36 RPQ    ((.))(\2){0}    xy    x    x    x    {}
+select * from test_regex('((.))(\2){0}', 'xy', 'RPQ');

 -- doing 22 "multicharacter collating elements"
 -- # again ugh

Re: Assert triggered during RE_compile_and_cache

От
Tom Lane
Дата:
I wrote:
> While this is sufficient to make the problem go away, I'm
> inclined to apply both changesets.  Even if it accidentally
> works right now to have later backrefs consult the outer s/s2
> state pair rather than the original pair, that seems far too
> convoluted and bug-prone.  The outer states should be strictly
> the concern of the iteration setup logic in the outer invocation
> of parseqatom.

> (I'm sort of wondering now whether the outer s/s2 states are
> even really necessary anymore ... maybe Spencer put those in
> as a way of preventing some prehistoric version of this bug.
> But I'm not excited about messing with that right now.)

I realized that the earlier patch is actually a bad idea, because
it breaks the possibility of updating the subre to mark it as
being referenced by a later backref, as the REG_NOSUB patch needs
to do.  However, on closer study, the outer s/s2 states being
added by the "prepare a general-purpose state skeleton" stanza
really are duplicative of the ones we already made for a
parenthesized atom, so we can just get rid of them.  Done that
way.

            regards, tom lane



Re: Assert triggered during RE_compile_and_cache

От
Mark Dilger
Дата:

> On Aug 8, 2021, at 9:31 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> I realized that the earlier patch is actually a bad idea

Applying only your <fix-backref-corner-case-1.patch> to master, the following which did not crash begins to crash:

 select regexp_split_to_array('rjgykkkk', '(?:(.))((\1\1.))');

It also changes the results for another one:

 select regexp_split_to_array('jdpveoarcnsarcnsarcnszieqbqbqbqbiufdlywphbnrxtdoboouuzcqiqmenj', '()((.)){1}?\3?',
'itx');
-                                                                                     regexp_split_to_array
                                                                         

-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
-
{"","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","",""}
+                      regexp_split_to_array
+------------------------------------------------------------------
+ {jdpveoarcnsarcnsarcnszieqbqbqbqbiufdlywphbnrxtdoboouuzcqiqmenj}

I'm not sure what *should* be returned here, only that it is a behavioral change.

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company






Re: Assert triggered during RE_compile_and_cache

От
Mark Dilger
Дата:

> On Aug 7, 2021, at 6:03 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> That requires tweaking the API of parseqatom,
> which why I'd not done it like that to begin with --- but that's not
> a hard or complicated change, just a mite tedious.
>
> Hence, the attached patch.

Applying your <alternate-backref-corner-case-fix-1.patch> to master changes the outcome of some regular expression
queries,but I *think* it changes them for the better. 

These three look to me exactly correct after the patch, and wrong before:

 select regexp_matches('vnrungnajjjgkaaeaper', '((.))(((\1)))((?:\5..))', 'mx');
- regexp_matches
-----------------
-(0 rows)
+ regexp_matches
+-----------------
+ {j,j,j,j,j,jgk}
+(1 row)

 select regexp_match('kgkgeganlifykxhfspjtgluwluwluwdfdfbbdjvrxjvrxedndrkaxxvbtqdj', '((.))\2');
  regexp_match
 --------------
-
+ {b,b}
 (1 row)

 select regexp_split_to_array('uuuzkodphfbfbfb', '((.))(\1\2)', 'ntw');
  regexp_split_to_array
 -----------------------
- {uuuzkodphfbfbfb}
+ {"",zkodphfbfbfb}
 (1 row)

But these next two look to me correct before the patch and wrong after:

 select regexp_matches('ircecpbgyiggvtruqgxzigxzigxzisdbkuhbkuhrvl', '(((.)))(?:(\3))[^\f]');
  regexp_matches
 ----------------
-(0 rows)
+ {g,g,g,g}
+(1 row)

 select regexp_matches('fhgxnvbvjaej', '(((.)).)((\3)((.)))', 'csx');
- regexp_matches
-----------------
-(0 rows)
+  regexp_matches
+-------------------
+ {vb,v,v,vj,v,j,j}
+(1 row)

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company






Re: Assert triggered during RE_compile_and_cache

От
Mark Dilger
Дата:

> On Aug 8, 2021, at 10:15 AM, Mark Dilger <mark.dilger@enterprisedb.com> wrote:
>
> But these next two look to me correct before the patch and wrong after:
>
> select regexp_matches('ircecpbgyiggvtruqgxzigxzigxzisdbkuhbkuhrvl', '(((.)))(?:(\3))[^\f]');
>  regexp_matches
> ----------------
> -(0 rows)
> + {g,g,g,g}
> +(1 row)
>
> select regexp_matches('fhgxnvbvjaej', '(((.)).)((\3)((.)))', 'csx');
> - regexp_matches
> -----------------
> -(0 rows)
> +  regexp_matches
> +-------------------
> + {vb,v,v,vj,v,j,j}
> +(1 row)

Scratch that.  On further thought, these also look correct.  I wasn't doing the capturing in my head correctly.  So I
thinkthe patch is working!  I'll test a bit longer. 

Is this patch (alternate-backref-corner-case-fix-1.patch) the current state of your patch set?  I'd like to be sure I'm
testingyour latest work.  Thanks. 

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company






Re: Assert triggered during RE_compile_and_cache

От
Tom Lane
Дата:
Mark Dilger <mark.dilger@enterprisedb.com> writes:
> Applying your <alternate-backref-corner-case-fix-1.patch> to master changes the outcome of some regular expression
queries,but I *think* it changes them for the better. 

[ squint... ] You sure you applied the patch properly?  All these examples
give me the same results in HEAD (with said patches as-committed) and v13.

            regards, tom lane



Re: Assert triggered during RE_compile_and_cache

От
Mark Dilger
Дата:

> On Aug 8, 2021, at 10:29 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> Mark Dilger <mark.dilger@enterprisedb.com> writes:
>> Applying your <alternate-backref-corner-case-fix-1.patch> to master changes the outcome of some regular expression
queries,but I *think* it changes them for the better. 
>
> [ squint... ] You sure you applied the patch properly?  All these examples
> give me the same results in HEAD (with said patches as-committed) and v13.
>
>             regards, tom lane

I'm not sure what you mean by "as-committed".  Did you commit one of these recently?

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company






Re: Assert triggered during RE_compile_and_cache

От
Mark Dilger
Дата:

> On Aug 8, 2021, at 10:32 AM, Mark Dilger <mark.dilger@enterprisedb.com> wrote:
>
> I'm not sure what you mean by "as-committed".  Did you commit one of these recently?

Nevermind, I see the commit now.

I'll rerun my tests with the new master.  I was still using the code that I pulled yesterday.

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company






Re: Assert triggered during RE_compile_and_cache

От
Mark Dilger
Дата:

> On Aug 8, 2021, at 10:34 AM, Mark Dilger <mark.dilger@enterprisedb.com> wrote:
>
> I'll rerun my tests with the new master.  I was still using the code that I pulled yesterday.

I am now testing REL_13_STABLE (ba9f665a4413f855bbf4b544481db326f7b9bd73) vs master
(c1132aae336c41cf9d316222e525d8d593c2b5d2). The regular expressions we discussed upthread show no differences. 

There are a few changes which appear correct to me, but I don't know if you expected them:

 select regexp_match('hko', '((((?<!.))))(.(\3)(\1.))');
- regexp_match
---------------
-
+    regexp_match
+--------------------
+ {"","","",hk,"",k}
 (1 row)

 select 'nigqvigqvee' !~ '(())((\2)\2)(?:([^\W]))';
-ERROR:  invalid regular expression: invalid escape \ sequence
+ ?column?
+----------
+ f
+(1 row)
+

 select regexp_replace('tl', '.{0}?(((?<![^][^]{1}?)))(.)\2{6}?', 'jx', 'mt');
  regexp_replace
 ----------------
- tl
+ jxl
 (1 row)


—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company






Re: Assert triggered during RE_compile_and_cache

От
Tom Lane
Дата:
Mark Dilger <mark.dilger@enterprisedb.com> writes:
> There are a few changes which appear correct to me, but I don't know if you expected them:

>  select regexp_match('hko', '((((?<!.))))(.(\3)(\1.))');
> - regexp_match
> ---------------
> -
> +    regexp_match
> +--------------------
> + {"","","",hk,"",k}
>  (1 row)

Yes, this one is a consequence of 4aea704a5 (Fix semantics of regular
expression back-references).  The lookbehind constraint should not be
applied within the back-ref.

>  select 'nigqvigqvee' !~ '(())((\2)\2)(?:([^\W]))';
> -ERROR:  invalid regular expression: invalid escape \ sequence
> + ?column?
> +----------
> + f
> +(1 row)
> +

and this one of 2a0af7fe4 (Allow complemented character class escapes
within regex brackets).

>  select regexp_replace('tl', '.{0}?(((?<![^][^]{1}?)))(.)\2{6}?', 'jx', 'mt');
>   regexp_replace
>  ----------------
> - tl
> + jxl
>  (1 row)

This looks like also a consequence of 4aea704a5.

            regards, tom lane