Re: Pathological regexp match

Поиск
Список
Период
Сортировка
От Tom Lane
Тема Re: Pathological regexp match
Дата
Msg-id 25719.1264835266@sss.pgh.pa.us
обсуждение исходный текст
Ответ на Pathological regexp match  (Michael Glaesemann <michael.glaesemann@myyearbook.com>)
Ответы Re: Pathological regexp match  (Magnus Hagander <magnus@hagander.net>)
Re: Pathological regexp match  (Tom Lane <tgl@sss.pgh.pa.us>)
Список pgsql-hackers
Michael Glaesemann <michael.glaesemann@myyearbook.com> writes:
> We came across a regexp that takes very much longer than expected.

I poked into this a little bit.  What seems to be happening is that the
use of non-greedy quantifiers plus backreferences turns off most of the
optimization that the regexp engine usually does, leaving the RE as a
tree of possibilities that is explored in a fairly dumb fashion.  In
particular, there is a loop in crevdissect() that tries to locate a
feasible division point between two concatenated sub-patterns, and
for each try, a recursive call to crevdissect() tries to locate a
feasible division point between *its* two sub-patterns, resulting in
O(N^2) runtime.  With a not very pleasant constant factor, too, because
of repeated startup/shutdown costs for the DFA matching engine.

I found a possible patch that seems to improve matters: AFAICS the DFA
matching is independent of the checking that cdissect() and friends do,
so we can apply that check first instead of second.  This brings the
runtime down from minutes to sub-second on my machine.  However I don't
have a whole lot of faith either in the correctness of this change, or
that it might not make some other cases slower instead of faster.
Has anyone got a reasonably messy collection of regexps they'd like
to try this patch on?

BTW, I filed the problem upstream with the Tcl project
https://sourceforge.net/tracker/?func=detail&aid=2942697&group_id=10894&atid=110894
but I'm not going to hold my breath waiting for useful advice from
them.  Since Henry Spencer dropped off the net, there doesn't seem
to be anybody there who understands that code much better than we do.

Also, we likely still ought to toss some CHECK_FOR_INTERRUPTS calls
in there ...

            regards, tom lane

Index: src/backend/regex/regexec.c
===================================================================
RCS file: /cvsroot/pgsql/src/backend/regex/regexec.c,v
retrieving revision 1.27
diff -c -r1.27 regexec.c
*** src/backend/regex/regexec.c    15 Oct 2005 02:49:24 -0000    1.27
--- src/backend/regex/regexec.c    30 Jan 2010 06:53:13 -0000
***************
*** 804,820 ****
      for (;;)
      {
          /* try this midpoint on for size */
!         er = cdissect(v, t->left, begin, mid);
!         if (er == REG_OKAY &&
!             longest(v, d2, mid, end, (int *) NULL) == end &&
!             (er = cdissect(v, t->right, mid, end)) ==
!             REG_OKAY)
!             break;                /* NOTE BREAK OUT */
!         if (er != REG_OKAY && er != REG_NOMATCH)
          {
!             freedfa(d);
!             freedfa(d2);
!             return er;
          }

          /* that midpoint didn't work, find a new one */
--- 804,821 ----
      for (;;)
      {
          /* try this midpoint on for size */
!         if (longest(v, d2, mid, end, (int *) NULL) == end)
          {
!             er = cdissect(v, t->left, begin, mid);
!             if (er == REG_OKAY &&
!                 (er = cdissect(v, t->right, mid, end)) == REG_OKAY)
!                 break;                /* NOTE BREAK OUT */
!             if (er != REG_OKAY && er != REG_NOMATCH)
!             {
!                 freedfa(d);
!                 freedfa(d2);
!                 return er;
!             }
          }

          /* that midpoint didn't work, find a new one */
***************
*** 904,920 ****
      for (;;)
      {
          /* try this midpoint on for size */
!         er = cdissect(v, t->left, begin, mid);
!         if (er == REG_OKAY &&
!             longest(v, d2, mid, end, (int *) NULL) == end &&
!             (er = cdissect(v, t->right, mid, end)) ==
!             REG_OKAY)
!             break;                /* NOTE BREAK OUT */
!         if (er != REG_OKAY && er != REG_NOMATCH)
          {
!             freedfa(d);
!             freedfa(d2);
!             return er;
          }

          /* that midpoint didn't work, find a new one */
--- 905,922 ----
      for (;;)
      {
          /* try this midpoint on for size */
!         if (longest(v, d2, mid, end, (int *) NULL) == end)
          {
!             er = cdissect(v, t->left, begin, mid);
!             if (er == REG_OKAY &&
!                 (er = cdissect(v, t->right, mid, end)) == REG_OKAY)
!                 break;                /* NOTE BREAK OUT */
!             if (er != REG_OKAY && er != REG_NOMATCH)
!             {
!                 freedfa(d);
!                 freedfa(d2);
!                 return er;
!             }
          }

          /* that midpoint didn't work, find a new one */

В списке pgsql-hackers по дате отправления:

Предыдущее
От: Mark Mielke
Дата:
Сообщение: Re: PG 9.0 and standard_conforming_strings
Следующее
От: Peter Eisentraut
Дата:
Сообщение: Re: PG 9.0 and standard_conforming_strings