Обсуждение: regexp_matches() quantified-capturing-parentheses oddity

Поиск
Список
Период
Сортировка

regexp_matches() quantified-capturing-parentheses oddity

От
Julian Mehnle
Дата:
Hi all,

  wisu-dev=# SELECT regexp_matches('quux@foo@bar.zip', '([@.]|[^@.]+)', 'g');
   {quux}
   {@}
   {foo}
   {@}
   {bar}
   {.}
   {zip}

So far, so good.  However, can someone please explain the following to me?

  wisu-dev=# SELECT regexp_matches('quux@foo@bar.zip', '([@.]|[^@.]+)+', 'g');
   {p}

  wisu-dev=# SELECT regexp_matches('quux@foo@bar.zip', '([@.]|[^@.]+){1,2}', 'g');
   {@}
   {@}
   {.}
   {p}

  wisu-dev=# SELECT regexp_matches('quux@foo@bar.zip', '([@.]|[^@.]+){1,3}', 'g');
   {foo}
   {.}
   {p}

What's going on here??

Regards,

-Julian Mehnle

Вложения

Re: regexp_matches() quantified-capturing-parentheses oddity

От
Tom Lane
Дата:
Julian Mehnle <julian@mehnle.net> writes:
> So far, so good.  However, can someone please explain the following to me?
>   wisu-dev=# SELECT regexp_matches('quux@foo@bar.zip', '([@.]|[^@.]+)+', 'g');
>   wisu-dev=# SELECT regexp_matches('quux@foo@bar.zip', '([@.]|[^@.]+){1,2}', 'g');
>   wisu-dev=# SELECT regexp_matches('quux@foo@bar.zip', '([@.]|[^@.]+){1,3}', 'g');

These might be a bug, but the behavior doesn't seem to me that it'd be
terribly well defined in any case.  The function should be pulling the
match to the parenthesized subexpression, but here that subexpression
has got multiple matches --- which one would you expect to get?

Instead of (foo)+ I'd try
    ((foo+))    if you want all the matches
    (foo)(foo)*    if you want the first one
    (?:foo)*(foo)    if you want the last one

            regards, tom lane

Re: regexp_matches() quantified-capturing-parentheses oddity

От
Julian Mehnle
Дата:
Tom, thanks for your reply.

I wrote:

>   wisu-dev=# SELECT regexp_matches('quux@foo@bar.zip', '([@.]|[^@.]+)+', 'g');
>    {p}
>
>   wisu-dev=# SELECT regexp_matches('quux@foo@bar.zip', '([@.]|[^@.]+){1,2}', 'g');
>    {@}
>    {@}
>    {.}
>    {p}
>
>   wisu-dev=# SELECT regexp_matches('quux@foo@bar.zip', '([@.]|[^@.]+){1,3}', 'g');
>    {foo}
>    {.}
>    {p}
>
> What's going on here??

FWIW, I have a vague idea of what these do:

They match greedily, i.e., exactly as many instances of the subexpression
as maximally allowed by the quantifier (and no less!), backtracking and
regrouping word characters if necessary to get that many instances,
and it always returns only the last of each tuple of sub-expressions,
repeating until the string is exhausted.

E.g., I think:

'([@.]|[^@.]+){1,2}' matches (quux @ foo @ bar . zi p) and returns every
second of those: @ @ . p

'([@.]|[^@.]+){1,3}' matches (quux @ foo @ bar . z i p) and returns every
third of those: foo . p

'([@.]|[^@.]+)+' matches (q u u x @ f o o @ b a r . z i p) and returns
every 16th of those: p

I see that Perl behaves similarly, except for the trying to always match
exactly as many instances of the subexpression as *maximally* allowed by
the quantified, and backtracking if necessary for this to work.  That last
part is very, very weird.

Tom Lane wrote:

> These might be a bug, but the behavior doesn't seem to me that it'd be
> terribly well defined in any case. The function should be pulling the
> match to the parenthesized subexpression, but here that subexpression
> has got multiple matches --- which one would you expect to get?

I had *hoped* regexp_matches('quux@foo@bar.zip', '([@.]|[^@.]+)') (without
'g') would return all the subexpression matches as a *single* array in a
*single* row.

However, now that I've checked the Perl regexp engine's behavior, I would
at least expect it to work just like Perl, i.e., allow fewer matches at
the end, without tracking back and regrouping:

  $ perl -le 'print(join(" ", "quux\@foo\@bar.zip" =~ m/([@.]|[^@.]+){1,2}/g))'
  @ @ . zip

  $ perl -le 'print(join(" ", "quux\@foo\@bar.zip" =~ m/([@.]|[^@.]+){1,3}/g))'
  foo . zip

> Instead of (foo)+ I'd try
>     ((foo+))    if you want all the matches
>     (foo)(foo)*    if you want the first one
>     (?:foo)*(foo)    if you want the last one

I would use the ((foo+)) form, but of course it doesn't return all of the
subexpression matches as separate elements, which was the point of my
exercise.

For what it's worth, I'm now using a "FOR ... IN SELECT regexp_matches(...)
LOOP" construct in a custom plpgsql function.

-Julian

Вложения

Re: regexp_matches() quantified-capturing-parentheses oddity

От
Harald Fuchs
Дата:
In article <13289.1260290974@sss.pgh.pa.us>,
Tom Lane <tgl@sss.pgh.pa.us> writes:

> Julian Mehnle <julian@mehnle.net> writes:
>> So far, so good.  However, can someone please explain the following to me?
>> wisu-dev=# SELECT regexp_matches('quux@foo@bar.zip', '([@.]|[^@.]+)+', 'g');
>> wisu-dev=# SELECT regexp_matches('quux@foo@bar.zip', '([@.]|[^@.]+){1,2}', 'g');
>> wisu-dev=# SELECT regexp_matches('quux@foo@bar.zip', '([@.]|[^@.]+){1,3}', 'g');

> These might be a bug, but the behavior doesn't seem to me that it'd be
> terribly well defined in any case.  The function should be pulling the
> match to the parenthesized subexpression, but here that subexpression
> has got multiple matches --- which one would you expect to get?

Perl seems to return always the last one, but the last one is never just
'p' - so I also think that Julian has spotted a bug.

Re: regexp_matches() quantified-capturing-parentheses oddity

От
Tom Lane
Дата:
Harald Fuchs <hari.fuchs@gmail.com> writes:
> Tom Lane <tgl@sss.pgh.pa.us> writes:
>> Julian Mehnle <julian@mehnle.net> writes:
>>> So far, so good.  However, can someone please explain the following to me?
>>> wisu-dev=# SELECT regexp_matches('quux@foo@bar.zip', '([@.]|[^@.]+)+', 'g');
>>> wisu-dev=# SELECT regexp_matches('quux@foo@bar.zip', '([@.]|[^@.]+){1,2}', 'g');
>>> wisu-dev=# SELECT regexp_matches('quux@foo@bar.zip', '([@.]|[^@.]+){1,3}', 'g');

>> These might be a bug, but the behavior doesn't seem to me that it'd be
>> terribly well defined in any case.  The function should be pulling the
>> match to the parenthesized subexpression, but here that subexpression
>> has got multiple matches --- which one would you expect to get?

> Perl seems to return always the last one, but the last one is never just
> 'p' - so I also think that Julian has spotted a bug.

Well, Perl is not the definition of correct regexp behavior ;-).  It's
got a completely different regexp engine in it, and so you shouldn't
be surprised if a poorly-specified regexp gives different results.
(The regexp engine we use was borrowed from Tcl, not Perl.  It has
some strengths and some weaknesses compared to Perl's.)

It does appear that our engine agrees with Perl's that the thing to do
with something like this is to return the last substring matching the
quantified expression.  However, it appears to define that as the last
possible match, not what would be left over after removing the first N-1
matches left-to-right.  It's possible to match the parenthesized
subexpression to just the trailing 'p', which is what it tries first,
and so that's what you get.

The right way to deal with this, I think, is to add constraints so that
the boundaries for the sub-matches are not ambiguous.  Try adding
(?![^@.]) after the [^@.]+.

            regards, tom lane