Обсуждение: regex help wanted

Поиск
Список
Период
Сортировка

regex help wanted

От
Karsten Hilbert
Дата:
Hi,

I am in the process of converting some TEXT data which I try
to identify by regular expression.

What I don't understand is: Why does the following return a
substring ?

    select substring ('junk $<allergy::test::99>$ junk' from '\$<[^<]+?::[^:]+?>\$');

I would have thought the '::[^:]+?>' part should have meant

    after two ":"s
    match at least one character
    except any further ":"s
    until the next ">"

I don't find the flaw in my thinking. Can anyone help ?

(Sure, it is not PostgreSQL-specific yet I need to run this
in PostgreSQL on data migration.)

Karsten
--
GPG key ID E4071346 @ gpg-keyserver.de
E167 67FD A291 2BEA 73BD  4537 78B9 A9F9 E407 1346


Re: regex help wanted

От
Tom Lane
Дата:
Karsten Hilbert <Karsten.Hilbert@gmx.net> writes:
> What I don't understand is: Why does the following return a
> substring ?

>     select substring ('junk $<allergy::test::99>$ junk' from '\$<[^<]+?::[^:]+?>\$');

There's a perfectly valid match in which [^<]+? matches allergy::test
and [^:]+? matches 99.

            regards, tom lane


Re: regex help wanted

От
Thom Brown
Дата:
On 25 April 2013 15:32, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Karsten Hilbert <Karsten.Hilbert@gmx.net> writes:
>> What I don't understand is: Why does the following return a
>> substring ?
>
>>       select substring ('junk $<allergy::test::99>$ junk' from '\$<[^<]+?::[^:]+?>\$');
>
> There's a perfectly valid match in which [^<]+? matches allergy::test
> and [^:]+? matches 99.

Yeah, I think there may be an assumption that a lazy quantifier will
stop short and cause the remainder to fail to match permanently, but
it will backtrack, forcing the lazy quantifier to expand until it can
match the expression.

--
Thom


Re: regex help wanted

От
Karsten Hilbert
Дата:
On Thu, Apr 25, 2013 at 10:32:26AM -0400, Tom Lane wrote:

> Karsten Hilbert <Karsten.Hilbert@gmx.net> writes:
> > What I don't understand is: Why does the following return a
> > substring ?
>
> >     select substring ('junk $<allergy::test::99>$ junk' from '\$<[^<]+?::[^:]+?>\$');
>
> There's a perfectly valid match in which [^<]+? matches allergy::test
> and [^:]+? matches 99.

Tom, thanks for helping !

I would have thought "<[^<]+?:" should mean:

    match a "<"
    followed by 1-n characters as long as they are not "<"
    until the VERY NEXT ":"

The "?" should make the "+" after "[^<]" non-greedy and thus
stop at the first occurrence of ":", right ?  Or am I
misunderstanding that part ?

At any rate,

    select substring ('junk $<allergy::test::99>$ junk' from '\$<[^<:]+?::[^:]+?>\$');

(which follows from your hint) appears to do what I need.

Thanks,
Karsten
--
GPG key ID E4071346 @ gpg-keyserver.de
E167 67FD A291 2BEA 73BD  4537 78B9 A9F9 E407 1346


Re: regex help wanted

От
Karsten Hilbert
Дата:
On Thu, Apr 25, 2013 at 03:40:51PM +0100, Thom Brown wrote:

> On 25 April 2013 15:32, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> > Karsten Hilbert <Karsten.Hilbert@gmx.net> writes:
> >> What I don't understand is: Why does the following return a
> >> substring ?
> >
> >>       select substring ('junk $<allergy::test::99>$ junk' from '\$<[^<]+?::[^:]+?>\$');
> >
> > There's a perfectly valid match in which [^<]+? matches allergy::test
> > and [^:]+? matches 99.
>
> Yeah, I think there may be an assumption that a lazy quantifier will
> stop short and cause the remainder to fail to match permanently, but
> it will backtrack, forcing the lazy quantifier to expand until it can
> match the expression.

Yup, therein lies the rub :-)

Thanks,
Karsten
--
GPG key ID E4071346 @ gpg-keyserver.de
E167 67FD A291 2BEA 73BD  4537 78B9 A9F9 E407 1346


Re: regex help wanted

От
Tom Lane
Дата:
Karsten Hilbert <Karsten.Hilbert@gmx.net> writes:
> I would have thought "<[^<]+?:" should mean:

>     match a "<"
>     followed by 1-n characters as long as they are not "<"
>     until the VERY NEXT ":"

> The "?" should make the "+" after "[^<]" non-greedy and thus
> stop at the first occurrence of ":", right ?  Or am I
> misunderstanding that part ?

No, non-greedy just means that if there are multiple ways to make the
pattern match the string, prefer the way that makes this sub-match the
shortest (whereas the default makes leftmost sub-matches longest).
If you don't want the char class to match : then you need to say that
explicitly.

BTW, I'm fairly sure that unless you are doing something that extracts
or replaces sub-matches, there is no value whatever in marking
quantifiers non-greedy; that just complicates life for the regex
compiler.  A match is a match, if you're not paying attention to
where the subpattern boundaries are.

            regards, tom lane


Re: regex help wanted

От
Jasen Betts
Дата:
On 2013-04-25, Karsten Hilbert <Karsten.Hilbert@gmx.net> wrote:
> On Thu, Apr 25, 2013 at 10:32:26AM -0400, Tom Lane wrote:
>
>> Karsten Hilbert <Karsten.Hilbert@gmx.net> writes:
>> > What I don't understand is: Why does the following return a
>> > substring ?
>>
>> >     select substring ('junk $<allergy::test::99>$ junk' from '\$<[^<]+?::[^:]+?>\$');
>>
>> There's a perfectly valid match in which [^<]+? matches allergy::test
>> and [^:]+? matches 99.
>
> Tom, thanks for helping !
>
> I would have thought "<[^<]+?:" should mean:
>
>     match a "<"
>     followed by 1-n characters as long as they are not "<"
>     until the VERY NEXT ":"


if you want that say:  "<[^<:]+:"

> The "?" should make the "+" after "[^<]" non-greedy and thus
> stop at the first occurrence of ":", right ?  Or am I
> misunderstanding that part ?

From "the fine manual"

 Non-greedy quantifiers (available in AREs only) match the same
 possibilities as their corresponding normal (greedy) counterparts, but
 prefer the smallest number rather than the largest number of matches.
 See Section 9.7.3.5 for more detail.



--
⚂⚃ 100% natural

Re: regex help wanted

От
matt@byrney.com
Дата:
> On 2013-04-25, Karsten Hilbert <Karsten.Hilbert@gmx.net> wrote:
>> On Thu, Apr 25, 2013 at 10:32:26AM -0400, Tom Lane wrote:
>>
>>> Karsten Hilbert <Karsten.Hilbert@gmx.net> writes:
>>> > What I don't understand is: Why does the following return a
>>> > substring ?
>>>
>>> >     select substring ('junk $<allergy::test::99>$ junk' from
>>> '\$<[^<]+?::[^:]+?>\$');
>>>
>>> There's a perfectly valid match in which [^<]+? matches allergy::test
>>> and [^:]+? matches 99.
>>
>> Tom, thanks for helping !
>>
>> I would have thought "<[^<]+?:" should mean:
>>
>>     match a "<"
>>     followed by 1-n characters as long as they are not "<"
>>     until the VERY NEXT ":"
>
>
> if you want that say:  "<[^<:]+:"
>
>> The "?" should make the "+" after "[^<]" non-greedy and thus
>> stop at the first occurrence of ":", right ?  Or am I
>> misunderstanding that part ?

Greediness and non-greediness of operators are like hints - they are only
honoured if there is a choice in the matter.  In your case, if the
<[^<]+?: stopped at the first ":", it would be impossible to match the
rest of the pattern.