Обсуждение: 9.5.3: substring: regex greedy operator not picking up chars as expected

Поиск
Список
Период
Сортировка

9.5.3: substring: regex greedy operator not picking up chars as expected

От
"Foster, Russell"
Дата:
Hello,

For the following query:

select substring('>772' from '.*?[0-9]+')

I would expect the output to be '>772', but it is '>7'.  You can also see t=
he expected result on https://regex101.com/, although I am aware not all re=
gex processors work the same.

The following queries:

select substring('>772' from '^.*?[0-9]+$')

and:

select substring('>772' from '[0-9]+')

both return '>772', which is expected.  Could the less greedy operator on t=
he left (.*?) be affecting the more greedy right one (+)?

Thanks,
Russell Foster

Re: 9.5.3: substring: regex greedy operator not picking up chars as expected

От
"David G. Johnston"
Дата:
=E2=80=8BWorking as documented.=E2=80=8B

https://www.postgresql.org/docs/9.5/static/functions-matching.html#POSIX-MA=
TCHING-RULES

Specifically, this implementation considers greediness at a level higher
than just the atom/expression - and in a mixed "branch" if there is a
non-greedy quantifier in a branch the entire branch is non-greedy and can
in many situations cause greedy atoms to behave non-greedily.

In might help to consider that there aren't really any explicit "greedy"
operators like other engines have (i.e., ??, ?, ?+) but rather non-greedy
(lazy) and default.  The default inherits the non-greedy trait from its
parent if applicable otherwise is behaves greedily.

On Mon, Aug 15, 2016 at 7:53 AM, Foster, Russell <Russell.Foster@crl.com>
wrote:

> Hello,
>
>
>
> For the following query:
>
>
>
> select substring('>772' from '.*?[0-9]+')
>

=E2=80=8BThe pattern itself is non-greedy=E2=80=8B due to their only being =
a single branch
and it having a non-greedy quantifier within it.

.*? matches ">" and [0-9]+ only needs a single character to generate a
non-greedy match conforming match


>
> I would expect the output to be =E2=80=98>772=E2=80=99, but it is =E2=80=
=98>7=E2=80=99.  You can also see
> the expected result on https://regex101.com/, although I am aware not all
> regex processors work the same.
>
>
>
> The following queries:
>
>
>
> select substring('>772' from '^.*?[0-9]+$')
>

=E2=80=8BThis is treated exactly the same as the above but because of the ^=
$ the
shortest possible output string is the entire string=E2=80=8B


>
> and:
>
>
>
> select substring('>772' from '[0-9]+')
>
>
>
> both return =E2=80=98>772=E2=80=99, which is expected.  Could the less gr=
eedy operator on
> the left (.*?) be affecting the more greedy right one (+)?
>
>
>

Typo here? I'm not fluent with substring(regex).

Anyway, the entire RE (single branch) is now greedy so the greedy [0-9]+
atom matches as many numbers as possible.

David J.

Re: 9.5.3: substring: regex greedy operator not picking up chars as expected

От
"Foster, Russell"
Дата:
SGkgRGF2aWQsDQoNCk11c3QgaGF2ZSBtaXNzZWQgdGhhdCBpbiB0aGUgbWFudWFsLCBidXQgbWFr
ZXMgc2Vuc2Ugbm93LiAgU29tZXdoYXQgc3RyYW5nZSBiZWhhdmlvciB0aGF0IGEgbm9uLWdyZWVk
eSBxdWFudGlmaWVyIGJhc2ljYWxseSBydWlucyB0aGUgcmVzdCBvZiB0aGUgZXhwcmVzc2lvbiBm
b3IgdGhlIGdyZWVkeSBvbmVzLCBidXQgYXQgbGVhc3QgaXTigJlzIHdvcmtpbmcgYXMgZGVzaWdu
ZWQuICBUaGFua3MhDQoNClJ1c3NlbGwNCg0KRnJvbTogRGF2aWQgRy4gSm9obnN0b24gW21haWx0
bzpkYXZpZC5nLmpvaG5zdG9uQGdtYWlsLmNvbV0NClNlbnQ6IDE1IEF1Z3VzdCAyMDE2IDg6NDUg
QU0NClRvOiBGb3N0ZXIsIFJ1c3NlbGwgPFJ1c3NlbGwuRm9zdGVyQGNybC5jb20+DQpDYzogcGdz
cWwtYnVnc0Bwb3N0Z3Jlc3FsLm9yZw0KU3ViamVjdDogUmU6IFtCVUdTXSA5LjUuMzogc3Vic3Ry
aW5nOiByZWdleCBncmVlZHkgb3BlcmF0b3Igbm90IHBpY2tpbmcgdXAgY2hhcnMgYXMgZXhwZWN0
ZWQNCg0K4oCLV29ya2luZyBhcyBkb2N1bWVudGVkLuKAiw0KDQpodHRwczovL3d3dy5wb3N0Z3Jl
c3FsLm9yZy9kb2NzLzkuNS9zdGF0aWMvZnVuY3Rpb25zLW1hdGNoaW5nLmh0bWwjUE9TSVgtTUFU
Q0hJTkctUlVMRVM8aHR0cHM6Ly9uYTAxLnNhZmVsaW5rcy5wcm90ZWN0aW9uLm91dGxvb2suY29t
Lz91cmw9aHR0cHMlM2ElMmYlMmZ3d3cucG9zdGdyZXNxbC5vcmclMmZkb2NzJTJmOS41JTJmc3Rh
dGljJTJmZnVuY3Rpb25zLW1hdGNoaW5nLmh0bWwlMjNQT1NJWC1NQVRDSElORy1SVUxFUyZkYXRh
PTAxJTdjMDElN2NSdXNzZWxsLkZvc3RlciU0MGNybC5jb20lN2NjMWU3MTM1OWVhOGM0YWEwYTQw
MDA4ZDNjNTA5ZjdjZiU3YzM3NGY4OTMwZTE1MDQwMzFiYjM1NDgzMjE1ZmU1OTAwJTdjMCZzZGF0
YT1uNEZtV1ppMCUyZiUyYmRnWjVLclkzQmZrMU8wbnBiVkdLJTJiUkNIV25OTU1tWFZvJTNkPg0K
DQpTcGVjaWZpY2FsbHksIHRoaXMgaW1wbGVtZW50YXRpb24gY29uc2lkZXJzIGdyZWVkaW5lc3Mg
YXQgYSBsZXZlbCBoaWdoZXIgdGhhbiBqdXN0IHRoZSBhdG9tL2V4cHJlc3Npb24gLSBhbmQgaW4g
YSBtaXhlZCAiYnJhbmNoIiBpZiB0aGVyZSBpcyBhIG5vbi1ncmVlZHkgcXVhbnRpZmllciBpbiBh
IGJyYW5jaCB0aGUgZW50aXJlIGJyYW5jaCBpcyBub24tZ3JlZWR5IGFuZCBjYW4gaW4gbWFueSBz
aXR1YXRpb25zIGNhdXNlIGdyZWVkeSBhdG9tcyB0byBiZWhhdmUgbm9uLWdyZWVkaWx5Lg0KDQpJ
biBtaWdodCBoZWxwIHRvIGNvbnNpZGVyIHRoYXQgdGhlcmUgYXJlbid0IHJlYWxseSBhbnkgZXhw
bGljaXQgImdyZWVkeSIgb3BlcmF0b3JzIGxpa2Ugb3RoZXIgZW5naW5lcyBoYXZlIChpLmUuLCA/
PywgPywgPyspIGJ1dCByYXRoZXIgbm9uLWdyZWVkeSAobGF6eSkgYW5kIGRlZmF1bHQuICBUaGUg
ZGVmYXVsdCBpbmhlcml0cyB0aGUgbm9uLWdyZWVkeSB0cmFpdCBmcm9tIGl0cyBwYXJlbnQgaWYg
YXBwbGljYWJsZSBvdGhlcndpc2UgaXMgYmVoYXZlcyBncmVlZGlseS4NCg0KT24gTW9uLCBBdWcg
MTUsIDIwMTYgYXQgNzo1MyBBTSwgRm9zdGVyLCBSdXNzZWxsIDxSdXNzZWxsLkZvc3RlckBjcmwu
Y29tPG1haWx0bzpSdXNzZWxsLkZvc3RlckBjcmwuY29tPj4gd3JvdGU6DQpIZWxsbywNCg0KRm9y
IHRoZSBmb2xsb3dpbmcgcXVlcnk6DQoNCnNlbGVjdCBzdWJzdHJpbmcoJz43NzInIGZyb20gJy4q
P1swLTldKycpDQoNCuKAi1RoZSBwYXR0ZXJuIGl0c2VsZiBpcyBub24tZ3JlZWR54oCLIGR1ZSB0
byB0aGVpciBvbmx5IGJlaW5nIGEgc2luZ2xlIGJyYW5jaCBhbmQgaXQgaGF2aW5nIGEgbm9uLWdy
ZWVkeSBxdWFudGlmaWVyIHdpdGhpbiBpdC4NCg0KLio/IG1hdGNoZXMgIj4iIGFuZCBbMC05XSsg
b25seSBuZWVkcyBhIHNpbmdsZSBjaGFyYWN0ZXIgdG8gZ2VuZXJhdGUgYSBub24tZ3JlZWR5IG1h
dGNoIGNvbmZvcm1pbmcgbWF0Y2gNCg0KDQpJIHdvdWxkIGV4cGVjdCB0aGUgb3V0cHV0IHRvIGJl
IOKAmD43NzLigJksIGJ1dCBpdCBpcyDigJg+N+KAmS4gIFlvdSBjYW4gYWxzbyBzZWUgdGhlIGV4
cGVjdGVkIHJlc3VsdCBvbiBodHRwczovL3JlZ2V4MTAxLmNvbS88aHR0cHM6Ly9uYTAxLnNhZmVs
aW5rcy5wcm90ZWN0aW9uLm91dGxvb2suY29tLz91cmw9aHR0cHMlM2ElMmYlMmZyZWdleDEwMS5j
b20lMmYmZGF0YT0wMSU3YzAxJTdjUnVzc2VsbC5Gb3N0ZXIlNDBjcmwuY29tJTdjYzFlNzEzNTll
YThjNGFhMGE0MDAwOGQzYzUwOWY3Y2YlN2MzNzRmODkzMGUxNTA0MDMxYmIzNTQ4MzIxNWZlNTkw
MCU3YzAmc2RhdGE9eWU1NVRkUHhHT0I2TlVvRG44NWwlMmZFZzhvOU1nWVBrYk92JTJiZzRtR2FY
dzQlM2Q+LCBhbHRob3VnaCBJIGFtIGF3YXJlIG5vdCBhbGwgcmVnZXggcHJvY2Vzc29ycyB3b3Jr
IHRoZSBzYW1lLg0KDQpUaGUgZm9sbG93aW5nIHF1ZXJpZXM6DQoNCnNlbGVjdCBzdWJzdHJpbmco
Jz43NzInIGZyb20gJ14uKj9bMC05XSskJykNCg0K4oCLVGhpcyBpcyB0cmVhdGVkIGV4YWN0bHkg
dGhlIHNhbWUgYXMgdGhlIGFib3ZlIGJ1dCBiZWNhdXNlIG9mIHRoZSBeJCB0aGUgc2hvcnRlc3Qg
cG9zc2libGUgb3V0cHV0IHN0cmluZyBpcyB0aGUgZW50aXJlIHN0cmluZ+KAiw0KDQoNCmFuZDoN
Cg0Kc2VsZWN0IHN1YnN0cmluZygnPjc3MicgZnJvbSAnWzAtOV0rJykNCg0KYm90aCByZXR1cm4g
4oCYPjc3MuKAmSwgd2hpY2ggaXMgZXhwZWN0ZWQuICBDb3VsZCB0aGUgbGVzcyBncmVlZHkgb3Bl
cmF0b3Igb24gdGhlIGxlZnQgKC4qPykgYmUgYWZmZWN0aW5nIHRoZSBtb3JlIGdyZWVkeSByaWdo
dCBvbmUgKCspPw0KDQoNClR5cG8gaGVyZT8gSSdtIG5vdCBmbHVlbnQgd2l0aCBzdWJzdHJpbmco
cmVnZXgpLg0KDQpBbnl3YXksIHRoZSBlbnRpcmUgUkUgKHNpbmdsZSBicmFuY2gpIGlzIG5vdyBn
cmVlZHkgc28gdGhlIGdyZWVkeSBbMC05XSsgYXRvbSBtYXRjaGVzIGFzIG1hbnkgbnVtYmVycyBh
cyBwb3NzaWJsZS4NCg0KRGF2aWQgSi4NCg0K

Re: 9.5.3: substring: regex greedy operator not picking up chars as expected

От
Tom Lane
Дата:
"Foster, Russell" <Russell.Foster@crl.com> writes:
> For the following query:
> select substring('>772' from '.*?[0-9]+')
> I would expect the output to be '>772', but it is '>7'.

As David pointed out, that's what you get because the RE as a whole is
considered to be non-greedy, ie you get the shortest overall match.
However, you can adjust that by decorating the RE:

# select substring('>772' from '(.*?[0-9]+){1,1}');
 substring
-----------
 >772
(1 row)

Now it's longest-overall, but the .*? part is still shortest-match,
so it doesn't consume any digits.  However, I suspect that still is
not quite what you want, because it consumes too much in cases like:

# select substring('>772foo444' from '(.*?[0-9]+){1,1}');
 substring
------------
 >772foo444
(1 row)

There's probably really no way out of that except to be less lazy about
writing the pattern:

# select substring('>772foo444' from '([^0-9]*?[0-9]+){1,1}');
 substring
-----------
 >772
(1 row)

and in that formulation, of course, greediness doesn't really matter
because there is only one way to match.

# select substring('>772foo444' from '[^0-9]*[0-9]+');
 substring
-----------
 >772
(1 row)

See
https://www.postgresql.org/docs/9.5/static/functions-matching.html#POSIX-MATCHING-RULES

            regards, tom lane