Обсуждение: BUG #13538: REGEX non-greedy is working incorrectly (and also greedy matches fail if non-greedy is present)

Поиск
Список
Период
Сортировка

BUG #13538: REGEX non-greedy is working incorrectly (and also greedy matches fail if non-greedy is present)

От
christian_maechler@hotmail.com
Дата:
The following bug has been logged on the website:

Bug reference:      13538
Logged by:          Chris Mächler
Email address:      christian_maechler@hotmail.com
PostgreSQL version: 9.3.0
Operating system:   ?
Description:

Here is an example to verify and reproduce the error (extract a number and
the things before and after it with 3 groups):


'(.*)([+-]?[0-9]*\.[0-9]+)(.*)'

Using regexü_matches this will produce an undesirable result (only one digit
in group 2), but everything behaves correctly, the third group matches until
the end.

'(.*?)([+-]?[0-9]*\.[0-9]+)(.*)'

If we change the first group to non-greedy to fix this, then the bug
appears: the third group becomes non-greedy too (it shouldn't!) and
therefore it is always empty instead of matching until the end of the line.
Also the first group is empty (should match from start!), it should find a
match at start position, whether it is non-greedy or not and not look ahead
if the non-greedy group can be reduced if starting to match at the next
index. Both are wrong behaviors.

(the workaround is anchoring, but the behavior of the regex is still wrong)

link: http://sqlfiddle.com/#!15/f0f14/14

Re: BUG #13538: REGEX non-greedy is working incorrectly (and also greedy matches fail if non-greedy is present)

От
"David G. Johnston"
Дата:
On Monday, August 3, 2015, <christian_maechler@hotmail.com> wrote:

> The following bug has been logged on the website:
>
> Bug reference:      13538
> Logged by:          Chris M=C3=A4chler
> Email address:      christian_maechler@hotmail.com <javascript:;>
> PostgreSQL version: 9.3.0
> Operating system:   ?
> Description:
>
> Here is an example to verify and reproduce the error (extract a number an=
d
> the things before and after it with 3 groups):
>
>
> '(.*)([+-]?[0-9]*\.[0-9]+)(.*)'
>
> Using regex=C3=BC_matches this will produce an undesirable result (only o=
ne
> digit
> in group 2), but everything behaves correctly, the third group matches
> until
> the end.
>
> '(.*?)([+-]?[0-9]*\.[0-9]+)(.*)'
>
> If we change the first group to non-greedy to fix this, then the bug
> appears: the third group becomes non-greedy too (it shouldn't!) and
> therefore it is always empty instead of matching until the end of the lin=
e.
> Also the first group is empty (should match from start!), it should find =
a
> match at start position, whether it is non-greedy or not and not look ahe=
ad
> if the non-greedy group can be reduced if starting to match at the next
> index. Both are wrong behaviors.
>
> (the workaround is anchoring, but the behavior of the regex is still wron=
g)
>
> link: http://sqlfiddle.com/#!15/f0f14/14
>
>
>
Reading the documentation this seems to be working as intended.

http://www.postgresql.org/docs/9.3/static/functions-matching.html#POSIX-MAT=
CHING-RULES

On what are you basing your concept of correctness?  Specifically, what
language implementation do you consider "right"?

The TCL implementation used by PostgreSQL has some differences compared to
Java and Perl, the two I am most familiar with.

David J.

Re: BUG #13538: REGEX non-greedy is working incorrectly (and also greedy matches fail if non-greedy is present)

От
Christian Mächler
Дата:
<div dir="ltr">Sorry probably sent my mail to the wrong address before.<br /><br />Here some more detailed examples to
showwhy the behavior of the 3rd group is clearly wrong also according to the specification:<br /><br />abc0.123def
applyingthe first regex (with optional dot)--> abc0.12 , 3, def<br /><br />abc0.123def applying the 2nd regex (with
optionaldot) --> , 3 ,<br /><br />Although the 3rd group wasn't touched it changed it's behavior from geedy to
non-greedy.abc, 0.123, def would be correct.<br /><br />The behavior of the non-greedy group isn't correct either,
becausea match should return the FIRST match found, greedyness is only about the FOLLOWING characters, but because it
willalready find a match at start position it should return that and not keep looking to find another match with
reducedgroup size.<br /><br />I was just trying to help, because I think this is a pretty big issue, although not using
regexatm.<br /><br />Chris<br /></div> 
Christian Mächler <christian_maechler@hotmail.com> writes:
> Here some more detailed examples to show why the behavior of the 3rd group is clearly wrong also according to the
specification:

What specification are you reading?  The POSIX standard for regular
expressions doesn't mention non-greedy quantifiers at all.

As David says, these examples appear to be following what's stated in
http://www.postgresql.org/docs/9.4/static/functions-matching.html#POSIX-MATCHING-RULES
The Spencer regex engine we use has a notion of greediness or
non-greediness of the entire regex, and further that that takes precedence
for determining the overall match length over greediness of individual
subexpressions.  That behavior might be inconvenient for this particular
use-case, but that doesn't make it a bug.
        regards, tom lane



I wrote:
> As David says, these examples appear to be following what's stated in
> http://www.postgresql.org/docs/9.4/static/functions-matching.html#POSIX-MATCHING-RULES
> The Spencer regex engine we use has a notion of greediness or
> non-greediness of the entire regex, and further that that takes precedence
> for determining the overall match length over greediness of individual
> subexpressions.  That behavior might be inconvenient for this particular
> use-case, but that doesn't make it a bug.

BTW, perhaps it would be worth adding an example to that section that
shows how to control this behavior.  The trick is obvious once you've seen
it, but not so much otherwise: you add something to the start of the regex
that establishes the overall greediness you want, but can never actually
match any characters.  "\0*" or "\0*?" will work fine in Postgres
use-cases since there can never be a NUL character in the data.

regression=# select regexp_matches('abc01234xyz', '(.*)(\d+)(.*)');
 regexp_matches
-----------------
 {abc0123,4,xyz}
(1 row)

regression=# select regexp_matches('abc01234xyz', '(.*?)(\d+)(.*)');
 regexp_matches
----------------
 {abc,0,""}
(1 row)

regression=# select regexp_matches('abc01234xyz', '\0*(.*?)(\d+)(.*)');
 regexp_matches
-----------------
 {abc,01234,xyz}
(1 row)


            regards, tom lane

Re: BUG #13538: REGEX non-greedy is working incorrectly (and also greedy matches fail if non-greedy is present)

От
"David G. Johnston"
Дата:
On Tue, Aug 4, 2015 at 8:39 AM, Christian M=C3=A4chler <
christian_maechler@hotmail.com> wrote:

> You say it is okay that a greedy group suddenly becomes non-greedy if
> ANOTHER group is made non-greedy?
>
> I've chosen a simple example, but I'm pretty sure I could construct
> several use-cases which can be solved easily if the regex behaves like in
> java, javaScript, perl etc.  but not with how it is done here. It's clear=
ly
> not a feature. Already simple things like ending a match with any amount =
of
> numbers will become difficult if non-greedy groups are present, e.g.
> instead of ...([0-9]+) you will have to write ...([0-9]+)(?![0-9])  makes
> things easier...
>
> Seriously I didn't want to start a debate whether this is right or wrong,
> because I honestly can't understand how anyone could defend the behavior
> mentioned in the first sentence of this message. As I said, I just wanted
> to point out that there is a bug to help improve, but if you prefer it li=
ke
> this it is fine with me, I just think then you probably haven't used rege=
x
> that much.
>
>
=E2=80=8BI use RegEx quite a bit and while I will agree with the sentiment =
it would
be nearly impossible to replace the existing implementation.  Fortunately,
PostgreSQL is quite extensible and so if you need a more flexible RegEx
implementation you can write a function in pl/perl or pl/v8 and use their
implementations.

About the only thing that would make sense would be to get the TCL
implementation to accept the "possessive" modifier (+ in Java:
https://docs.oracle.com/javase/tutorial/essential/regex/quant.html) and let
it override the "overall greediness" aspect of the matching region while
leaving the unadorned case to use the overall aspect.

There is probable a bit more consideration than the brief amount I've done
here - though I think the point has been made.  Right or wrong it is a
design choice that has been made and in use for many years.  It works well
enough, and options/hacks exist, that finding someone who wants to dedicate
resources to improving the situation is likely to be difficult.

David J.
=E2=80=8B

Re: BUG #13538: REGEX non-greedy is working incorrectly (and also greedy matches fail if non-greedy is present)

От
"David G. Johnston"
Дата:
On Tue, Aug 4, 2015 at 8:39 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

> I wrote:
> > As David says, these examples appear to be following what's stated in
> >
> http://www.postgresql.org/docs/9.4/static/functions-matching.html#POSIX-M=
ATCHING-RULES
> > The Spencer regex engine we use has a notion of greediness or
> > non-greediness of the entire regex, and further that that takes
> precedence
> > for determining the overall match length over greediness of individual
> > subexpressions.  That behavior might be inconvenient for this particula=
r
> > use-case, but that doesn't make it a bug.
>
> BTW, perhaps it would be worth adding an example to that section that
> shows how to control this behavior.  The trick is obvious once you've see=
n
> it, but not so much otherwise: you add something to the start of the rege=
x
> that establishes the overall greediness you want, but can never actually
> match any characters.  "\0*" or "\0*?" will work fine in Postgres
> use-cases since there can never be a NUL character in the data.
>
> regression=3D# select regexp_matches('abc01234xyz', '(.*)(\d+)(.*)');
>  regexp_matches
> -----------------
>  {abc0123,4,xyz}
> (1 row)
>
> regression=3D# select regexp_matches('abc01234xyz', '(.*?)(\d+)(.*)');
>  regexp_matches
> ----------------
>  {abc,0,""}
> (1 row)
>
> regression=3D# select regexp_matches('abc01234xyz', '\0*(.*?)(\d+)(.*)');
>  regexp_matches
> -----------------
>  {abc,01234,xyz}
> (1 row)
>
>
=E2=80=8B+1

David J.=E2=80=8B

Re: BUG #13538: REGEX non-greedy is working incorrectly (and also greedy matches fail if non-greedy is present)

От
Christian Mächler
Дата:
You say it is okay that a greedy group suddenly becomes non-greedy if ANOTHER group is made non-greedy?

I've chosen a simple example, but I'm pretty sure I could construct several use-cases which can be solved easily if the regex behaves like in java, javaScript, perl etc.  but not with how it is done here. It's clearly not a feature. Already simple things like ending a match with any amount of numbers will become difficult if non-greedy groups are present, e.g. instead of ...([0-9]+) you will have to write ...([0-9]+)(?![0-9])  makes things easier...

Seriously I didn't want to start a debate whether this is right or wrong, because I honestly can't understand how anyone could defend the behavior mentioned in the first sentence of this message. As I said, I just wanted to point out that there is a bug to help improve, but if you prefer it like this it is fine with me, I just think then you probably haven't used regex that much.

Chris

> From: tgl@sss.pgh.pa.us
> To: christian_maechler@hotmail.com
> CC: david.g.johnston@gmail.com; pgsql-bugs@postgresql.org
> Subject: Re: [BUGS] BUG #13538: REGEX non-greedy is working incorrectly (and also greedy matches fail if non-greedy is present)
> Date: Tue, 4 Aug 2015 10:58:57 -0400
>
> Christian Mächler <christian_maechler@hotmail.com> writes:
> > Here some more detailed examples to show why the behavior of the 3rd group is clearly wrong also according to the specification:
>
> What specification are you reading? The POSIX standard for regular
> expressions doesn't mention non-greedy quantifiers at all.
>
> As David says, these examples appear to be following what's stated in
> http://www.postgresql.org/docs/9.4/static/functions-matching.html#POSIX-MATCHING-RULES
> The Spencer regex engine we use has a notion of greediness or
> non-greediness of the entire regex, and further that that takes precedence
> for determining the overall match length over greediness of individual
> subexpressions. That behavior might be inconvenient for this particular
> use-case, but that doesn't make it a bug.
>
> regards, tom lane
"David G. Johnston" <david.g.johnston@gmail.com> writes:
> On Tue, Aug 4, 2015 at 8:39 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> BTW, perhaps it would be worth adding an example to that section that
>> shows how to control this behavior.

> +1

When I looked closer, I noticed that the docs already had a recommendation
about how to force overall greediness or non-greediness, and it was
cleaner than the \0* hack I'd come up with on the spur of the moment.
So I extended that with an example.

For-the-archives: it strikes me that if we ever did want to break
backwards compatibility here in order to make it act a bit more like
Perl's regexps, we could try making the concatenation rule be that the
overall RE inherits the greediness of its last quantified atom rather than
its first one.  But I'm not sure how close an approximation that would
produce to Perl's engine's behavior; there would probably still be some
discrepancies.  I doubt it's worth breaking backwards compatibility for,
if we'd still get complaints from Perl users that our regex engine is
broken because it's not bug-compatible with Perl's.

            regards, tom lane
On 8/4/2015 6:20 PM, Tom Lane wrote:
> if we'd still get complaints from Perl users that our regex engine is
> broken because it's not bug-compatible with Perl's.

Some wag once said...   "Some folks, when faced with a problem, go, 'I
know, I'll use Regular Expressions'.  Now they have TWO problems!"



--
john r pierce, recycling bits in santa cruz