Re: Another regexp performance improvement: skip useless paren-captures

Поиск

Список

Период

Сортировка

От	Tom Lane
Тема	Re: Another regexp performance improvement: skip useless paren-captures
Дата	10 августа 2021 г. 05:20:33
Msg-id	3756810.1628562033@sss.pgh.pa.us обсуждение исходный текст
Ответ на	Re: Another regexp performance improvement: skip useless paren-captures (Mark Dilger <mark.dilger@enterprisedb.com>)
Ответы	Re: Another regexp performance improvement: skip useless paren-captures (Mark Dilger <mark.dilger@enterprisedb.com>) Re: Another regexp performance improvement: skip useless paren-captures (Mark Dilger <mark.dilger@enterprisedb.com>)
Список	pgsql-hackers

Дерево обсуждения

Mark Dilger <mark.dilger@enterprisedb.com> writes:
> I ran a lot of tests with the patch, and the asserts have all cleared up, but I don't know how to think about the
userfacing differences.  If we had a good reason for raising an error for these back-references, maybe that'd be fine,
butit seems to just be an implementation detail. 

I thought about this some more, and I'm coming around to the idea that
throwing an error is the wrong thing.  As a contrary example, consider

    (.)|(\1\1)

We don't throw an error for this, and neither does Perl, even though
the capturing parens can never be defined in the branch where the
backrefs are.  So it seems hard to argue that this is okay but the
other thing isn't.  Another interesting example is

    (.){0}(\1){0}

I think that the correct interpretation is that this is a valid
regexp matching an empty string (i.e., zero repetitions of each
part), even though neither capture group will be defined.
That's different from

    (.){0}(\1)

which can never match.

So I took another look at the code, and it doesn't seem that hard
to make it act this way.  The attached passes regression, but
I've not beat on it with random strings.

            regards, tom lane

diff --git a/src/backend/regex/regcomp.c b/src/backend/regex/regcomp.c
index ae3a7b6a38..d9840171a3 100644
--- a/src/backend/regex/regcomp.c
+++ b/src/backend/regex/regcomp.c
@@ -1089,11 +1089,23 @@ parseqatom(struct vars *v,
     /* annoying special case:  {0} or {0,0} cancels everything */
     if (m == 0 && n == 0)
     {
-        if (atom != NULL)
-            freesubre(v, atom);
-        if (atomtype == '(')
-            v->subs[subno] = NULL;
-        delsub(v->nfa, lp, rp);
+        /*
+         * If we had capturing subexpression(s) within the atom, we don't want
+         * to destroy them, because it's legal (if useless) to back-ref them
+         * later.  Hence, just unlink the atom from lp/rp and then ignore it.
+         */
+        if (atom != NULL && (atom->flags & CAP))
+        {
+            delsub(v->nfa, lp, atom->begin);
+            delsub(v->nfa, atom->end, rp);
+        }
+        else
+        {
+            /* Otherwise, we can clean up any subre infrastructure we made */
+            if (atom != NULL)
+                freesubre(v, atom);
+            delsub(v->nfa, lp, rp);
+        }
         EMPTYARC(lp, rp);
         return top;
     }
diff --git a/src/test/modules/test_regex/expected/test_regex.out b/src/test/modules/test_regex/expected/test_regex.out
index 5a6cdf47c2..96c0e6fa59 100644
--- a/src/test/modules/test_regex/expected/test_regex.out
+++ b/src/test/modules/test_regex/expected/test_regex.out
@@ -3506,6 +3506,21 @@ select * from test_regex('((.))(\2)', 'xyy', 'oRP');
  {yy,NULL,NULL,NULL}
 (2 rows)

+-- expectMatch    21.39 NPQR    {((.)){0}(\2){0}}    xyz    {}    {}    {}    {}
+select * from test_regex('((.)){0}(\2){0}', 'xyz', 'NPQR');
+                         test_regex
+------------------------------------------------------------
+ {3,REG_UBACKREF,REG_UBOUNDS,REG_UNONPOSIX,REG_UEMPTYMATCH}
+ {"",NULL,NULL,NULL}
+(2 rows)
+
+-- expectNomatch    21.40 PQR    {((.)){0}(\2)}    xxx
+select * from test_regex('((.)){0}(\2)', 'xxx', 'PQR');
+                 test_regex
+--------------------------------------------
+ {3,REG_UBACKREF,REG_UBOUNDS,REG_UNONPOSIX}
+(1 row)
+
 -- doing 22 "multicharacter collating elements"
 -- # again ugh
 -- MCCEs are not implemented in Postgres, so we skip all these tests
diff --git a/src/test/modules/test_regex/sql/test_regex.sql b/src/test/modules/test_regex/sql/test_regex.sql
index 3419564203..29806e9417 100644
--- a/src/test/modules/test_regex/sql/test_regex.sql
+++ b/src/test/modules/test_regex/sql/test_regex.sql
@@ -1020,6 +1020,10 @@ select * from test_regex('((.))(\2){0}', 'xy', 'RPQ');
 select * from test_regex('((.))(\2)', 'xyy', 'RP');
 -- expectMatch    21.38 oRP    ((.))(\2)    xyy    yy    {}    {}    {}
 select * from test_regex('((.))(\2)', 'xyy', 'oRP');
+-- expectMatch    21.39 NPQR    {((.)){0}(\2){0}}    xyz    {}    {}    {}    {}
+select * from test_regex('((.)){0}(\2){0}', 'xyz', 'NPQR');
+-- expectNomatch    21.40 PQR    {((.)){0}(\2)}    xxx
+select * from test_regex('((.)){0}(\2)', 'xxx', 'PQR');

 -- doing 22 "multicharacter collating elements"
 -- # again ugh

В списке pgsql-hackers по дате отправления:

Предыдущее

От: Peter Smith
Дата: 10 августа 2021 г., 05:17:07
Сообщение: Re: Small documentation improvement for ALTER SUBSCRIPTION

Следующее

От: Masahiko Sawada
Дата: 10 августа 2021 г., 05:35:17
Сообщение: Re: [BUG] wrong refresh when ALTER SUBSCRIPTION ADD/DROP PUBLICATION

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Re: Another regexp performance improvement: skip useless paren-captures

Предыдущее

Следующее