Re: Our regex vs. POSIX on "longest match"

Поиск
Список
Период
Сортировка
От Tom Lane
Тема Re: Our regex vs. POSIX on "longest match"
Дата
Msg-id 16069.1330882462@sss.pgh.pa.us
обсуждение исходный текст
Ответ на Re: Our regex vs. POSIX on "longest match"  (Brendan Jurd <direvus@gmail.com>)
Ответы Re: Our regex vs. POSIX on "longest match"
Re: Our regex vs. POSIX on "longest match"
Список pgsql-hackers
Brendan Jurd <direvus@gmail.com> writes:
> On 4 March 2012 17:53, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Brendan Jurd <direvus@gmail.com> writes:
>>> I'll admit that this is a pretty obscure point, but we do appear to be
>>> in direct violation of POSIX here.

>> How so?  POSIX doesn't contain any non-greedy constructs.  If you use
>> only the POSIX-compatible greedy constructs, the behavior is compliant.

> While it's true that POSIX doesn't contemplate non-greed, after
> reading the spec I would have expected an expression *as a whole* to
> still prefer the longest match, insofar as that is possible while
> respecting non-greedy particles.

It's not apparent to me that a construct that is outside the spec
shouldn't be expected to change the overall behavior.  In particular,
what if the RE contains only non-greedy quantifiers --- would you still
expect longest match overall?  Henry certainly didn't.

> I just ran some quick experiments in
> Perl, Python and PHP, using 'xy1234' ~ 'y*?\d+'.  All returned
> 'y1234', which to my mind is the most obvious answer, and one which
> still makes sense in light of what POSIX has to say.  Whereas Postgres
> (and presumably Tcl) return 'y1'.

Well, that's just an arbitrary example.  The cases I remember people
complaining about in practice were the other way round: greedy
quantifier followed by non-greedy, and they were unhappy that the
non-greediness was effectively not respected (because the overall RE was
taken as greedy).  So you can't fix the issue by pointing to POSIX and
saying "overall greedy is always the right thing".

Another thing I've seen people complain about is that an alternation
(anything with top-level "|") is always taken as greedy overall.
This seems a bit surprising to me --- I'd have expected a rule like
"inherits its greediness from the leftmost branch that has one",
though I'm not sure whether that would really be much of an improvement
in practice.  (It would at least fix the problem that a leading "(a|b)"
determines greediness of the containing RE whereas the logically
equivalent "[ab]" does not.)  Again, pointing to POSIX and saying
"overall greedy is the right thing" doesn't help.
        regards, tom lane


В списке pgsql-hackers по дате отправления:

Предыдущее
От: Simon Riggs
Дата:
Сообщение: Re: RFC: Making TRUNCATE more "MVCC-safe"
Следующее
От: Tom Lane
Дата:
Сообщение: Re: Patch: improve selectivity estimation for IN/NOT IN