Обсуждение: 9.5.3: substring: regex greedy operator not picking up chars as expected
Hello,
For the following query:
select substring('>772' from '.*?[0-9]+')
I would expect the output to be '>772', but it is '>7'. You can also see t=
he expected result on https://regex101.com/, although I am aware not all re=
gex processors work the same.
The following queries:
select substring('>772' from '^.*?[0-9]+$')
and:
select substring('>772' from '[0-9]+')
both return '>772', which is expected. Could the less greedy operator on t=
he left (.*?) be affecting the more greedy right one (+)?
Thanks,
Russell Foster
Re: 9.5.3: substring: regex greedy operator not picking up chars as expected
От
"David G. Johnston"
Дата:
=E2=80=8BWorking as documented.=E2=80=8B https://www.postgresql.org/docs/9.5/static/functions-matching.html#POSIX-MA= TCHING-RULES Specifically, this implementation considers greediness at a level higher than just the atom/expression - and in a mixed "branch" if there is a non-greedy quantifier in a branch the entire branch is non-greedy and can in many situations cause greedy atoms to behave non-greedily. In might help to consider that there aren't really any explicit "greedy" operators like other engines have (i.e., ??, ?, ?+) but rather non-greedy (lazy) and default. The default inherits the non-greedy trait from its parent if applicable otherwise is behaves greedily. On Mon, Aug 15, 2016 at 7:53 AM, Foster, Russell <Russell.Foster@crl.com> wrote: > Hello, > > > > For the following query: > > > > select substring('>772' from '.*?[0-9]+') > =E2=80=8BThe pattern itself is non-greedy=E2=80=8B due to their only being = a single branch and it having a non-greedy quantifier within it. .*? matches ">" and [0-9]+ only needs a single character to generate a non-greedy match conforming match > > I would expect the output to be =E2=80=98>772=E2=80=99, but it is =E2=80= =98>7=E2=80=99. You can also see > the expected result on https://regex101.com/, although I am aware not all > regex processors work the same. > > > > The following queries: > > > > select substring('>772' from '^.*?[0-9]+$') > =E2=80=8BThis is treated exactly the same as the above but because of the ^= $ the shortest possible output string is the entire string=E2=80=8B > > and: > > > > select substring('>772' from '[0-9]+') > > > > both return =E2=80=98>772=E2=80=99, which is expected. Could the less gr= eedy operator on > the left (.*?) be affecting the more greedy right one (+)? > > > Typo here? I'm not fluent with substring(regex). Anyway, the entire RE (single branch) is now greedy so the greedy [0-9]+ atom matches as many numbers as possible. David J.
Re: 9.5.3: substring: regex greedy operator not picking up chars as expected
От
"Foster, Russell"
Дата:
SGkgRGF2aWQsDQoNCk11c3QgaGF2ZSBtaXNzZWQgdGhhdCBpbiB0aGUgbWFudWFsLCBidXQgbWFr ZXMgc2Vuc2Ugbm93LiAgU29tZXdoYXQgc3RyYW5nZSBiZWhhdmlvciB0aGF0IGEgbm9uLWdyZWVk eSBxdWFudGlmaWVyIGJhc2ljYWxseSBydWlucyB0aGUgcmVzdCBvZiB0aGUgZXhwcmVzc2lvbiBm b3IgdGhlIGdyZWVkeSBvbmVzLCBidXQgYXQgbGVhc3QgaXTigJlzIHdvcmtpbmcgYXMgZGVzaWdu ZWQuICBUaGFua3MhDQoNClJ1c3NlbGwNCg0KRnJvbTogRGF2aWQgRy4gSm9obnN0b24gW21haWx0 bzpkYXZpZC5nLmpvaG5zdG9uQGdtYWlsLmNvbV0NClNlbnQ6IDE1IEF1Z3VzdCAyMDE2IDg6NDUg QU0NClRvOiBGb3N0ZXIsIFJ1c3NlbGwgPFJ1c3NlbGwuRm9zdGVyQGNybC5jb20+DQpDYzogcGdz cWwtYnVnc0Bwb3N0Z3Jlc3FsLm9yZw0KU3ViamVjdDogUmU6IFtCVUdTXSA5LjUuMzogc3Vic3Ry aW5nOiByZWdleCBncmVlZHkgb3BlcmF0b3Igbm90IHBpY2tpbmcgdXAgY2hhcnMgYXMgZXhwZWN0 ZWQNCg0K4oCLV29ya2luZyBhcyBkb2N1bWVudGVkLuKAiw0KDQpodHRwczovL3d3dy5wb3N0Z3Jl c3FsLm9yZy9kb2NzLzkuNS9zdGF0aWMvZnVuY3Rpb25zLW1hdGNoaW5nLmh0bWwjUE9TSVgtTUFU Q0hJTkctUlVMRVM8aHR0cHM6Ly9uYTAxLnNhZmVsaW5rcy5wcm90ZWN0aW9uLm91dGxvb2suY29t Lz91cmw9aHR0cHMlM2ElMmYlMmZ3d3cucG9zdGdyZXNxbC5vcmclMmZkb2NzJTJmOS41JTJmc3Rh dGljJTJmZnVuY3Rpb25zLW1hdGNoaW5nLmh0bWwlMjNQT1NJWC1NQVRDSElORy1SVUxFUyZkYXRh PTAxJTdjMDElN2NSdXNzZWxsLkZvc3RlciU0MGNybC5jb20lN2NjMWU3MTM1OWVhOGM0YWEwYTQw MDA4ZDNjNTA5ZjdjZiU3YzM3NGY4OTMwZTE1MDQwMzFiYjM1NDgzMjE1ZmU1OTAwJTdjMCZzZGF0 YT1uNEZtV1ppMCUyZiUyYmRnWjVLclkzQmZrMU8wbnBiVkdLJTJiUkNIV25OTU1tWFZvJTNkPg0K DQpTcGVjaWZpY2FsbHksIHRoaXMgaW1wbGVtZW50YXRpb24gY29uc2lkZXJzIGdyZWVkaW5lc3Mg YXQgYSBsZXZlbCBoaWdoZXIgdGhhbiBqdXN0IHRoZSBhdG9tL2V4cHJlc3Npb24gLSBhbmQgaW4g YSBtaXhlZCAiYnJhbmNoIiBpZiB0aGVyZSBpcyBhIG5vbi1ncmVlZHkgcXVhbnRpZmllciBpbiBh IGJyYW5jaCB0aGUgZW50aXJlIGJyYW5jaCBpcyBub24tZ3JlZWR5IGFuZCBjYW4gaW4gbWFueSBz aXR1YXRpb25zIGNhdXNlIGdyZWVkeSBhdG9tcyB0byBiZWhhdmUgbm9uLWdyZWVkaWx5Lg0KDQpJ biBtaWdodCBoZWxwIHRvIGNvbnNpZGVyIHRoYXQgdGhlcmUgYXJlbid0IHJlYWxseSBhbnkgZXhw bGljaXQgImdyZWVkeSIgb3BlcmF0b3JzIGxpa2Ugb3RoZXIgZW5naW5lcyBoYXZlIChpLmUuLCA/ PywgPywgPyspIGJ1dCByYXRoZXIgbm9uLWdyZWVkeSAobGF6eSkgYW5kIGRlZmF1bHQuICBUaGUg ZGVmYXVsdCBpbmhlcml0cyB0aGUgbm9uLWdyZWVkeSB0cmFpdCBmcm9tIGl0cyBwYXJlbnQgaWYg YXBwbGljYWJsZSBvdGhlcndpc2UgaXMgYmVoYXZlcyBncmVlZGlseS4NCg0KT24gTW9uLCBBdWcg MTUsIDIwMTYgYXQgNzo1MyBBTSwgRm9zdGVyLCBSdXNzZWxsIDxSdXNzZWxsLkZvc3RlckBjcmwu Y29tPG1haWx0bzpSdXNzZWxsLkZvc3RlckBjcmwuY29tPj4gd3JvdGU6DQpIZWxsbywNCg0KRm9y IHRoZSBmb2xsb3dpbmcgcXVlcnk6DQoNCnNlbGVjdCBzdWJzdHJpbmcoJz43NzInIGZyb20gJy4q P1swLTldKycpDQoNCuKAi1RoZSBwYXR0ZXJuIGl0c2VsZiBpcyBub24tZ3JlZWR54oCLIGR1ZSB0 byB0aGVpciBvbmx5IGJlaW5nIGEgc2luZ2xlIGJyYW5jaCBhbmQgaXQgaGF2aW5nIGEgbm9uLWdy ZWVkeSBxdWFudGlmaWVyIHdpdGhpbiBpdC4NCg0KLio/IG1hdGNoZXMgIj4iIGFuZCBbMC05XSsg b25seSBuZWVkcyBhIHNpbmdsZSBjaGFyYWN0ZXIgdG8gZ2VuZXJhdGUgYSBub24tZ3JlZWR5IG1h dGNoIGNvbmZvcm1pbmcgbWF0Y2gNCg0KDQpJIHdvdWxkIGV4cGVjdCB0aGUgb3V0cHV0IHRvIGJl IOKAmD43NzLigJksIGJ1dCBpdCBpcyDigJg+N+KAmS4gIFlvdSBjYW4gYWxzbyBzZWUgdGhlIGV4 cGVjdGVkIHJlc3VsdCBvbiBodHRwczovL3JlZ2V4MTAxLmNvbS88aHR0cHM6Ly9uYTAxLnNhZmVs aW5rcy5wcm90ZWN0aW9uLm91dGxvb2suY29tLz91cmw9aHR0cHMlM2ElMmYlMmZyZWdleDEwMS5j b20lMmYmZGF0YT0wMSU3YzAxJTdjUnVzc2VsbC5Gb3N0ZXIlNDBjcmwuY29tJTdjYzFlNzEzNTll YThjNGFhMGE0MDAwOGQzYzUwOWY3Y2YlN2MzNzRmODkzMGUxNTA0MDMxYmIzNTQ4MzIxNWZlNTkw MCU3YzAmc2RhdGE9eWU1NVRkUHhHT0I2TlVvRG44NWwlMmZFZzhvOU1nWVBrYk92JTJiZzRtR2FY dzQlM2Q+LCBhbHRob3VnaCBJIGFtIGF3YXJlIG5vdCBhbGwgcmVnZXggcHJvY2Vzc29ycyB3b3Jr IHRoZSBzYW1lLg0KDQpUaGUgZm9sbG93aW5nIHF1ZXJpZXM6DQoNCnNlbGVjdCBzdWJzdHJpbmco Jz43NzInIGZyb20gJ14uKj9bMC05XSskJykNCg0K4oCLVGhpcyBpcyB0cmVhdGVkIGV4YWN0bHkg dGhlIHNhbWUgYXMgdGhlIGFib3ZlIGJ1dCBiZWNhdXNlIG9mIHRoZSBeJCB0aGUgc2hvcnRlc3Qg cG9zc2libGUgb3V0cHV0IHN0cmluZyBpcyB0aGUgZW50aXJlIHN0cmluZ+KAiw0KDQoNCmFuZDoN Cg0Kc2VsZWN0IHN1YnN0cmluZygnPjc3MicgZnJvbSAnWzAtOV0rJykNCg0KYm90aCByZXR1cm4g 4oCYPjc3MuKAmSwgd2hpY2ggaXMgZXhwZWN0ZWQuICBDb3VsZCB0aGUgbGVzcyBncmVlZHkgb3Bl cmF0b3Igb24gdGhlIGxlZnQgKC4qPykgYmUgYWZmZWN0aW5nIHRoZSBtb3JlIGdyZWVkeSByaWdo dCBvbmUgKCspPw0KDQoNClR5cG8gaGVyZT8gSSdtIG5vdCBmbHVlbnQgd2l0aCBzdWJzdHJpbmco cmVnZXgpLg0KDQpBbnl3YXksIHRoZSBlbnRpcmUgUkUgKHNpbmdsZSBicmFuY2gpIGlzIG5vdyBn cmVlZHkgc28gdGhlIGdyZWVkeSBbMC05XSsgYXRvbSBtYXRjaGVzIGFzIG1hbnkgbnVtYmVycyBh cyBwb3NzaWJsZS4NCg0KRGF2aWQgSi4NCg0K
"Foster, Russell" <Russell.Foster@crl.com> writes:
> For the following query:
> select substring('>772' from '.*?[0-9]+')
> I would expect the output to be '>772', but it is '>7'.
As David pointed out, that's what you get because the RE as a whole is
considered to be non-greedy, ie you get the shortest overall match.
However, you can adjust that by decorating the RE:
# select substring('>772' from '(.*?[0-9]+){1,1}');
substring
-----------
>772
(1 row)
Now it's longest-overall, but the .*? part is still shortest-match,
so it doesn't consume any digits. However, I suspect that still is
not quite what you want, because it consumes too much in cases like:
# select substring('>772foo444' from '(.*?[0-9]+){1,1}');
substring
------------
>772foo444
(1 row)
There's probably really no way out of that except to be less lazy about
writing the pattern:
# select substring('>772foo444' from '([^0-9]*?[0-9]+){1,1}');
substring
-----------
>772
(1 row)
and in that formulation, of course, greediness doesn't really matter
because there is only one way to match.
# select substring('>772foo444' from '[^0-9]*[0-9]+');
substring
-----------
>772
(1 row)
See
https://www.postgresql.org/docs/9.5/static/functions-matching.html#POSIX-MATCHING-RULES
regards, tom lane