Обсуждение: Our regex vs. POSIX on "longest match"
Hello folks, I am in the process of accelerating down the rabbit hole of regex internals. Something that came up during my reading, is that a POSIX compliant regex engine ought to always prefer the longest possible match, when multiple matches are possible beginning from the same location in the string. [1] I wasn't sure that that was how our regex engine worked, and indeed, on checking the manual [2] I found that our regex engine uses a strange sort of "inductive greediness" to determine whether the longest or the shortest possible match ought to be preferred. The greediness of individual particles in the regex are taken into account, and at the top level the entire expression is concluded to be either greedy, or non-greedy. I'll admit that this is a pretty obscure point, but we do appear to be in direct violation of POSIX here. I am getting the impression that our engine works this way to route around some of the performance issues that can arise in trying out every possible match with an NFA-style engine. I find it a bit unfortunate that POSIX is so unambiguous about this, while our engine's treatment is, well ... quite arcane by comparison. At minimum, I think we should be more explicit in the manual that this behaviour flips POSIX the proverbial bird. Several paragraphs south, there is a sentence reading "Incompatibilities of note include ... the longest/shortest-match (rather than first-match) matching semantics", but in the context it seems as though the author is talking about an incompatibility with Perl, not with POSIX. Thoughts? Cheers, BJ [1] http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap09.html [2] http://www.postgresql.org/docs/9.0/static/functions-matching.html#FUNCTIONS-POSIX-REGEXP
Brendan Jurd <direvus@gmail.com> writes: > I am in the process of accelerating down the rabbit hole of regex > internals. Something that came up during my reading, is that a POSIX > compliant regex engine ought to always prefer the longest possible > match, when multiple matches are possible beginning from the same > location in the string. [1] > I wasn't sure that that was how our regex engine worked, and indeed, > on checking the manual [2] I found that our regex engine uses a > strange sort of "inductive greediness" to determine whether the > longest or the shortest possible match ought to be preferred. The > greediness of individual particles in the regex are taken into > account, and at the top level the entire expression is concluded to be > either greedy, or non-greedy. > I'll admit that this is a pretty obscure point, but we do appear to be > in direct violation of POSIX here. How so? POSIX doesn't contain any non-greedy constructs. If you use only the POSIX-compatible greedy constructs, the behavior is compliant. The issue that is obscure is, once you define some non-greedy constructs, how to define how they should act in combination with greedy ones. I'm not sure to what extent the engine's behavior is driven by implementation restrictions and to what extent it's really the sanest behavior Henry could think of. I found a comment from him about it: http://groups.google.com/group/comp.lang.tcl/msg/c493317cc0d10d50 but it's short on details as to what alternatives he considered. regards, tom lane
On 4 March 2012 17:53, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Brendan Jurd <direvus@gmail.com> writes: >> I'll admit that this is a pretty obscure point, but we do appear to be >> in direct violation of POSIX here. > > How so? POSIX doesn't contain any non-greedy constructs. If you use > only the POSIX-compatible greedy constructs, the behavior is compliant. While it's true that POSIX doesn't contemplate non-greed, after reading the spec I would have expected an expression *as a whole* to still prefer the longest match, insofar as that is possible while respecting non-greedy particles. I just ran some quick experiments in Perl, Python and PHP, using 'xy1234' ~ 'y*?\d+'. All returned 'y1234', which to my mind is the most obvious answer, and one which still makes sense in light of what POSIX has to say. Whereas Postgres (and presumably Tcl) return 'y1'. What I found surprising here is that our implementation allows an earlier quantifier to clobber the greediness of a later quantifier. There's a certain disregard for the intentions of the author of the regex. If I had wanted the whole expression to be non-greedy, I could have written 'y*?\d+?'. But since I wrote 'y*?\d+', it is clear that I meant for the first atom to be non-greedy, and for the second to be greedy. > The issue that is obscure is, once you define some non-greedy > constructs, how to define how they should act in combination with greedy > ones. I'm not sure to what extent the engine's behavior is driven by > implementation restrictions and to what extent it's really the sanest > behavior Henry could think of. I found a comment from him about it: > http://groups.google.com/group/comp.lang.tcl/msg/c493317cc0d10d50 > but it's short on details as to what alternatives he considered. Thanks for the link; it is always good to get more insight into Henry's approach. I'm beginning to think I should just start reading everything he ever posted to comp.lang.tcl ... Cheers, BJ
Brendan Jurd <direvus@gmail.com> writes: > On 4 March 2012 17:53, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Brendan Jurd <direvus@gmail.com> writes: >>> I'll admit that this is a pretty obscure point, but we do appear to be >>> in direct violation of POSIX here. >> How so? POSIX doesn't contain any non-greedy constructs. If you use >> only the POSIX-compatible greedy constructs, the behavior is compliant. > While it's true that POSIX doesn't contemplate non-greed, after > reading the spec I would have expected an expression *as a whole* to > still prefer the longest match, insofar as that is possible while > respecting non-greedy particles. It's not apparent to me that a construct that is outside the spec shouldn't be expected to change the overall behavior. In particular, what if the RE contains only non-greedy quantifiers --- would you still expect longest match overall? Henry certainly didn't. > I just ran some quick experiments in > Perl, Python and PHP, using 'xy1234' ~ 'y*?\d+'. All returned > 'y1234', which to my mind is the most obvious answer, and one which > still makes sense in light of what POSIX has to say. Whereas Postgres > (and presumably Tcl) return 'y1'. Well, that's just an arbitrary example. The cases I remember people complaining about in practice were the other way round: greedy quantifier followed by non-greedy, and they were unhappy that the non-greediness was effectively not respected (because the overall RE was taken as greedy). So you can't fix the issue by pointing to POSIX and saying "overall greedy is always the right thing". Another thing I've seen people complain about is that an alternation (anything with top-level "|") is always taken as greedy overall. This seems a bit surprising to me --- I'd have expected a rule like "inherits its greediness from the leftmost branch that has one", though I'm not sure whether that would really be much of an improvement in practice. (It would at least fix the problem that a leading "(a|b)" determines greediness of the containing RE whereas the logically equivalent "[ab]" does not.) Again, pointing to POSIX and saying "overall greedy is the right thing" doesn't help. regards, tom lane
On Sun, Mar 04, 2012 at 12:34:22PM -0500, Tom Lane wrote: > Well, that's just an arbitrary example. The cases I remember people > complaining about in practice were the other way round: greedy > quantifier followed by non-greedy, and they were unhappy that the > non-greediness was effectively not respected (because the overall RE was > taken as greedy). So you can't fix the issue by pointing to POSIX and > saying "overall greedy is always the right thing". I was one of the complaining, and my point was that deciding for whole regexp whether it's greedy or non-greedy is a bug (well, it might be documented, but it's still *very* unexpected). I stand on position that mixing greedy and non-greedy operators should be possible, and that it should work according to POLA - i.e. greedines of given atom shouldn't be influenced by other atoms. Best regards, depesz -- The best thing about modern society is how easy it is to avoid contact with it. http://depesz.com/
hubert depesz lubaczewski <depesz@depesz.com> writes: > I stand on position that mixing greedy and non-greedy operators should > be possible, and that it should work according to POLA - i.e. greedines > of given atom shouldn't be influenced by other atoms. [ shrug... ] That sounds good, but it's pretty much vacuous as far as defining a principled alternative behavior goes. It's easy to demonstrate cases where atoms *must* be influenced by other ones. A trivial example is(.*)(.*) It doesn't matter whether the second atom is greedy or not: it's not going to get to eat anything because the first one does instead. IOW this is just the same as(.*)(.*?) --- they are both overall-greedy. regards, tom lane
On 5 March 2012 04:34, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Brendan Jurd <direvus@gmail.com> writes: >> On 4 March 2012 17:53, Tom Lane <tgl@sss.pgh.pa.us> wrote: >>> Brendan Jurd <direvus@gmail.com> writes: >> While it's true that POSIX doesn't contemplate non-greed, after >> reading the spec I would have expected an expression *as a whole* to >> still prefer the longest match, insofar as that is possible while >> respecting non-greedy particles. > > It's not apparent to me that a construct that is outside the spec > shouldn't be expected to change the overall behavior. In particular, > what if the RE contains only non-greedy quantifiers --- would you still > expect longest match overall? Henry certainly didn't. Well, no, but then that wouldn't be "prefer the longest match, insofar as that is possible whilerespecting non-greedy particles". If all the quantifiers in an expression are non-greedy, then it is trivially true that the only way to respect the author's intention is to return the shortest match. > Well, that's just an arbitrary example. The cases I remember people > complaining about in practice were the other way round: greedy > quantifier followed by non-greedy, and they were unhappy that the > non-greediness was effectively not respected (because the overall RE was > taken as greedy). I am unhappy about the reverse example too, and for the same reasons. If we look at 'xy1234' ~ 'y*\d+?', Postgres gives us 'y1234' (disregarding the non-greediness of the latter quantifier), and Python gives us 'y1' (respecting both quantifiers). So in Postgres, no matter how you mix the greediness up, some of your quantifiers will not be respected. > So you can't fix the issue by pointing to POSIX and > saying "overall greedy is always the right thing". "... insofar as that is possible while respecting non-greedy particles". I will take Henry's word for it that this problem is harder than it looks, but in any case, I think we may not presume to know better than the author of the regex about the greediness of his quantifiers. > Another thing I've seen people complain about is that an alternation > (anything with top-level "|") is always taken as greedy overall. Yeah, that is quirky. Cheers, BJ
On Sun, Mar 4, 2012 at 3:20 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > hubert depesz lubaczewski <depesz@depesz.com> writes: >> I stand on position that mixing greedy and non-greedy operators should >> be possible, and that it should work according to POLA - i.e. greedines >> of given atom shouldn't be influenced by other atoms. > > [ shrug... ] That sounds good, but it's pretty much vacuous as far as > defining a principled alternative behavior goes. It's easy to > demonstrate cases where atoms *must* be influenced by other ones. > A trivial example is > (.*)(.*) > It doesn't matter whether the second atom is greedy or not: it's not > going to get to eat anything because the first one does instead. > IOW this is just the same as > (.*)(.*?) > --- they are both overall-greedy. But I don't think this makes Brendan's example not a bug. In this case, we can't eat anything because there's nothing to eat. In his example, there's something to eat, and it's sitting in a place where it logically seems to line up with a greedy quantifier, and yet the quantifier doesn't eat it. Now in fairness, I've seen my fair share of apparently-buggy behavior from Perl's regex engine over the years as well. I think the right way to imagine this is as though the regular expression were being matched to the source text in left-to-right fashion. For example, suppose the RE is [ab]+?[bc]+ and the source text is aabbcc. Clearly there's a match at position 1, but the match could be anywhere between 3 and 6 characters in length, depending on how greedy we think the RE should be, and the exact source text that gets matched to each atom is up for grabs. Enumerating all the possibilities where each atom matches a string that is consistent with its definition, leaving greediness aside, we get this: [aa,b] [aa,bb] [aa,bbc] [aa,bbcc] [aab,b] [aab,bc] [aab,bcc] [aabb,c] [aabb,cc] To distinguish among these, we look first at [ab]+? and decide that, since it is non-greedy, the right overall answer must be one of the ones that assigns to [ab]+? a match of minimal length within the space of possible matches. That narrows the field to just the first four choices listed above. Then we look at [bc]+ and, since it is greedy, give it a match of maximal length within the space of possible matches, leading to [ab]+? = aa and [bc]+ = bbcc. Had the RE been [ab]+?[bc]+?, the same algorithm would assign [ab]+? = aa and [bc]+? = b; had it been [ab]+[bc]+? the same algorithm would assign [ab]+ = aabb and [bc]+? = c. Testing, I find that this matches what Perl does in all of those cases. Things get a bit more complicated when you throw alternation into the picture, but I think it's basically right to view it as a greedy operation. Anything else seems rather unprincipled. It may seem surprising that a+?|b+? matched against aaa will match the first branch to aaa rather than just a, but there's no real reason to suppose that the ? which applies to the plus should somehow bleed up and affect the interpretation of the alternation. The RE might be something like a+b+?|b+?a+ where the sub-REs have different greediness interpretations and there's no obvious principled way to decide which one should apply to the parent; or even something like cat|catfish where the sub-REs don't contain any greedy/non-greedy operators at all and yet we still have to assign some greediness to the alternation. I think it's right to view every RE construct as greedy unless it's got an explicit "not-greedy" flag attached to it; after all, that's the traditional behavior of REs from time immemorial. Someone could invent a non-greedy form of alternation if they were so inclined. This is different from what Perl does, but I think Perl's behavior here is batty: given a+|a+b+ and the string aaabbb, it picks the first branch and matches only aaa. My whole being recoils: that surely looks like a non-greedy interpretation of a regex with only greedy quantifiers. It turns out that if you write a+b+|a+ then it matches the whole string; apparently, it selects the syntactically first branch that can match, regardless of the length of the match, which strikes me as nearly pure evil. PostgreSQL - correctly, IMHO - matches the longest substring available regardless of the syntactic ordering; AIUI, the original charter for REs was to always match the longest possible substring; non-greedy quantifiers were added later, and should not be taken as a license to change the semantics of REs that don't use them. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 5 March 2012 17:23, Robert Haas <robertmhaas@gmail.com> wrote: > This is different from what Perl does, but I think Perl's behavior > here is batty: given a+|a+b+ and the string aaabbb, it picks the first > branch and matches only aaa. Yeah, this is sometimes referred to as "ordered alternation", basically that the branches of the alternation are prioritised in the same order in which they are described. It is fairly commonplace among regex implementations. > apparently, it selects the syntactically first > branch that can match, regardless of the length of the match, which > strikes me as nearly pure evil. As long as it's documented that alternation prioritises in this way, I don't feel upset about it. At least it still provides you with a sensible way to get whatever you want from your RE; if you want a shorter alternative to be preferred, put it up the front. Ordered alternation also gives you a way to specify which of several same-length alternatives you would prefer to be matched, which can come in handy. It also means you can specify less-complex alternatives before more-complex ones, which can have performance advantages. I do agree with you that if you *don't* do ordered alternation, then it is right to treat alternation as greedy by default. Cheers, BJ
Robert Haas <robertmhaas@gmail.com> writes: > I think the right way to imagine this is as though the regular > expression were being matched to the source text in left-to-right > fashion. No, it isn't. You are headed down the garden path that leads to a Perl-style definition-by-implementation, and in particular you are going to end up with an implementation that fails to satisfy the POSIX standard. POSIX requires an *overall longest* match (at least for cases where all quantifiers are greedy), and that sometimes means that the quantifiers can't be processed strictly left-to-right greedy. An example of this is regression=# select substring('aaaaaabab' from '(a*(ab)*)');substring -----------aaaaaabab (1 row) If the a* is allowed to match as much as it wants, the (ab)* will not be able to match at all, and then you fail to find the longest possible overall match. I suspect that it is possible to construct similar cases where, for an all-non-greedy pattern, finding the overall shortest match sometimes requires that individual quantifiers eat more than the local minimum. I've not absorbed enough caffeine yet this morning to produce an example though. I probably shouldn't guess too much at Henry Spencer's thought processes, but I think that he was looking for an extension of this POSIX concept to mixed-greediness cases, ie you first define what the overall RE matches and then let the individual quantifiers fight it out as to which one gets how much of that. The particular way he did that is obviously leaving a lot of people unsatisfied, but I think we need to keep looking for rules of that sort, and not revert to defining the behavior by a search algorithm. > I think it's right to view every RE construct as greedy unless it's got > an explicit "not-greedy" flag attached to it; after all, that's the > traditional behavior of REs from time immemorial. Someone could > invent a non-greedy form of alternation if they were so inclined. I think you can do that already: "(foo|bar){1,1}?" (if this doesn't result in a non-greedy alternation then it's a bug). The notation is a bit ugly admittedly. regards, tom lane
On Mon, Mar 5, 2012 at 11:28 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Robert Haas <robertmhaas@gmail.com> writes: >> I think the right way to imagine this is as though the regular >> expression were being matched to the source text in left-to-right >> fashion. > > No, it isn't. You are headed down the garden path that leads to a > Perl-style definition-by-implementation, and in particular you are going > to end up with an implementation that fails to satisfy the POSIX > standard. POSIX requires an *overall longest* match (at least for cases > where all quantifiers are greedy), and that sometimes means that the > quantifiers can't be processed strictly left-to-right greedy. An > example of this is > > regression=# select substring('aaaaaabab' from '(a*(ab)*)'); > substring > ----------- > aaaaaabab > (1 row) > > If the a* is allowed to match as much as it wants, the (ab)* will not be > able to match at all, and then you fail to find the longest possible > overall match. Oh. Right. > I suspect that it is possible to construct similar cases where, for an > all-non-greedy pattern, finding the overall shortest match sometimes > requires that individual quantifiers eat more than the local minimum. > I've not absorbed enough caffeine yet this morning to produce an example > though. Probably true. I guess, then, that the issue here is that there isn't really any principled way to decide whether the RE overall should be greedy or non-greedy. And similarly with every sub-RE. The problem with the "non-greedy" quantifiers is that they apply only to the quantified bit specifically, which leaves us guessing as to the user's intent with regards to everything else. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Mon, Mar 05, 2012 at 11:28:24AM -0500, Tom Lane wrote: > Robert Haas <robertmhaas@gmail.com> writes: > > I think the right way to imagine this is as though the regular > > expression were being matched to the source text in left-to-right > > fashion. > > No, it isn't. You are headed down the garden path that leads to a > Perl-style definition-by-implementation, and in particular you are going > to end up with an implementation that fails to satisfy the POSIX > standard. POSIX requires an *overall longest* match (at least for cases > where all quantifiers are greedy), and that sometimes means that the > quantifiers can't be processed strictly left-to-right greedy. An > example of this is On the otherhand, I think requiring an "overall longest match" makes your implementation non-polynomial complexity. The simplest example I can think of is the knapsack problem, where given weights x_n and a total W, can be converted to a regex problem as matching a string with W a's against the regex: a{x_1}?a{x_2}?a{x_3}? etc... Yes, Perl (and others) don't guarentee an overall longest match. I think they want you to consider regular expressions as a specialised parsing language where you can configure a state machine to process your strings. Not ideal, but predicatable. The question is, what are users expecting of the PostgreSQL regex implementation? Have a nice day, -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > He who writes carelessly confesses thereby at the very outset that he does > not attach much importance to his own thoughts. -- Arthur Schopenhauer
Martijn van Oosterhout <kleptog@svana.org> writes: > On the otherhand, I think requiring an "overall longest match" makes > your implementation non-polynomial complexity. Only if you don't know how to implement it -- a DFA-based implementation doesn't have much trouble with this. > [ equivalence of knapsack problem to regexes with bounded repetition ] Interesting, but note that neither the POSIX spec nor our implementation permit arbitrarily large repetition counts, so the theoretical NP-completeness is only theoretical. > The question is, what are users expecting of the PostgreSQL regex > implementation? I think a minimum expectation is that we adhere to the POSIX specification. regards, tom lane