Обсуждение: BUG #4044: Incorrect RegExp substring Output

Поиск
Список
Период
Сортировка

BUG #4044: Incorrect RegExp substring Output

От
"Rui Martins"
Дата:
The following bug has been logged online:

Bug reference:      4044
Logged by:          Rui Martins
Email address:      Rui.Martins@pdmfc.com
PostgreSQL version: 8.3
Operating system:   Win2000
Description:        Incorrect  RegExp substring Output
Details:

To setup for test, run these SQL commands

CREATE TABLE TEST_REGEXP
(
    BedNo VARCHAR(20)
);

INSERT INTO TEST_REGEXP ( BedNo ) VALUES ( '123' );
INSERT INTO TEST_REGEXP ( BedNo ) VALUES ( '4325:1' );
INSERT INTO TEST_REGEXP ( BedNo ) VALUES ( '2464M' );
INSERT INTO TEST_REGEXP ( BedNo ) VALUES ( '5678M:2' );
INSERT INTO TEST_REGEXP ( BedNo ) VALUES ( '453L:1' );

From the following link
http://www.postgresql.org/docs/8.3/static/functions-matching.html

We can read the following text:

The substring function with two parameters, substring(string from pattern),
provides extraction of a substring that matches a POSIX regular expression
pattern. It returns null if there is no match, otherwise the portion of the
text that matched the pattern. But if the pattern contains any parentheses,
the portion of the text that matched the first parenthesized subexpression
(the one whose left parenthesis comes first) is returned. You can put
parentheses around the whole expression if you want to use parentheses
within it without triggering this exception. If you need parentheses in the
pattern before the subexpression you want to extract, see the non-capturing
parentheses described below.


-- -- -- -- --
No suppose we want to split up the "BedNo" column, into its parts
(DoorNumber, RoomSize and BedNumber)


SELECT BedNo,
       SUBSTRING( BedNo FROM '^([[:digit:]]+)[a-zA-Z]*(:[[:digit:]]+)?$' )
AS DoorNumber,
       SUBSTRING( BedNo FROM '^[[:digit:]]+([a-zA-Z]*)(:[[:digit:]]+)?$' )
AS RoomSize,
       SUBSTRING( BedNo FROM '^[[:digit:]]+[a-zA-Z]*(:[[:digit:]]+)?$' ) AS
BedNumber
  FROM TEST_REGEXP;

Or in it's other form

SELECT BedNo,
       SUBSTRING( BedNo, '^([[:digit:]]+)[a-zA-Z]*(:[[:digit:]]+)?$' ) AS
DoorNumber,
       SUBSTRING( BedNo, '^[[:digit:]]+([a-zA-Z]*)(:[[:digit:]]+)?$' ) AS
RoomSize,
       SUBSTRING( BedNo, '^[[:digit:]]+[a-zA-Z]*(:[[:digit:]]+)?$' ) AS
BedNumber
  FROM TEST_REGEXP;

Both will return the same result:

BedNo    DoorNumber    RoomSize    BedNumber
123    123        123
4325:1    4325        :1
2464M    2464    M    2464M
5678M:2    5678    M    :2
453L:1    453    L    :1

NOTE: Tabs may note be seen correctly on a web interface.

This is clearly, not the expected result, in particular for BedNo IN (
'123', '2464M' )

The BedNumber returned in these cases is NOT logical!

However, the manual is NOT EXPLICIT in what happens, if the returned MATCHED
parentheses part is the equivalent of an empty string!

Although it states:
"But if the pattern contains any parentheses, the portion of the text that
matched the first parenthesized subexpression (the one whose left
parenthesis comes first) is returned."

Apparently, the function is returning the entire MATCHED string, instead of
just the parenthesized subexpression.

I would expect the result for BedNumber to be either NULL or the EMPTY
String, and the later seems more logical. But the documentation doesn't
state which should be returned!

Not withstanding, the expected result should be:

BedNo    DoorNumber    RoomSize    BedNumber
123    123
4325:1    4325        :1
2464M    2464    M
5678M:2    5678    M    :2
453L:1    453    L    :1

NOTE: Tabs may note be seen correctly on a web interface.


-- Hack Note for this specific case !

For this specific case, we can do a hack, and change the RegExp for
BedNumber, like in the next SELECT.

SELECT BedNo,
       SUBSTRING( BedNo FROM '^([[:digit:]]+)[a-zA-Z]*(:[[:digit:]]+)?$' )
AS DoorNumber,
       SUBSTRING( BedNo FROM '^[[:digit:]]+([a-zA-Z]*)(:[[:digit:]]+)?$' )
AS RoomSize,
       SUBSTRING( BedNo FROM '^[[:digit:]]+[a-zA-Z]*:([[:digit:]]+)?$' ) AS
BedNumber
  FROM TEST_REGEXP;

or alternatively

SELECT BedNo,
       SUBSTRING( BedNo, '^([[:digit:]]+)[a-zA-Z]*(:[[:digit:]]+)?$' ) AS
DoorNumber,
       SUBSTRING( BedNo, '^[[:digit:]]+([a-zA-Z]*)(:[[:digit:]]+)?$' ) AS
RoomSize,
       SUBSTRING( BedNo, '^[[:digit:]]+[a-zA-Z]*:([[:digit:]]+)?$' ) AS
BedNumber
  FROM TEST_REGEXP

This will return NULL when there is no BedNumber, by forcing the regExp to
fail the match.

But this only works, because the format uses ":" in that specific location,
if there was no ":" anywhere, we would NOT get away so easily.

Hope to have bee of help, in finding this documentation issue and
implementation bug.

Re: BUG #4044: Incorrect RegExp substring Output

От
Tom Lane
Дата:
"Rui Martins" <Rui.Martins@pdmfc.com> writes:
> Description:        Incorrect  RegExp substring Output

>        SUBSTRING( BedNo FROM '^[[:digit:]]+[a-zA-Z]*(:[[:digit:]]+)?$' )

Interesting.  It had never occurred to me that it's possible for the
whole pattern to have a match when some parenthesized subexpression
has no match.  On investigation, Tcl's regex library seems to get
this right, but textregexsubstr() doesn't.  Will fix.

> I would expect the result for BedNumber to be either NULL or the EMPTY
> String, and the later seems more logical.

It's going to be null.  Your example has no match to the parenthesized
substring --- a match would have to include a colon and some digits, no?

            regards, tom lane

Re: BUG #4044: Incorrect RegExp substring Output

От
"Rui Martins"
Дата:
> "Rui Martins" <Rui.Martins@pdmfc.com> writes:
>> Description:        Incorrect  RegExp substring Output
>
>>        SUBSTRING( BedNo FROM '^[[:digit:]]+[a-zA-Z]*(:[[:digit:]]+)?$' )
>
> Interesting.  It had never occurred to me that it's possible for the
> whole pattern to have a match when some parenthesized subexpression
> has no match.  On investigation, Tcl's regex library seems to get
> this right, but textregexsubstr() doesn't.  Will fix.
>
>> I would expect the result for BedNumber to be either NULL or the EMPTY
>> String, and the later seems more logical.
>
> It's going to be null.  Your example has no match to the parenthesized
> substring --- a match would have to include a colon and some digits, no?

Yes, the subexpression will not match, but the entire expression will.
Taking this into account I agree that it should be NULL then, but this
should be CLEARLY stated in the MANUAL, so that the user will not have to
guess.

I believe that there should be a more detailed explanation of Substring
function in the manual, because I haven't found a specific section about
it. It is kind of scattered around the string functions page.

>             regards, tom lane

Thank you for your feedback.

P.S.
  Will the fix be available as a patch or just in 8.3.1 ?

See ya
   Rui Martins

Re: BUG #4044: Incorrect RegExp substring Output

От
Tom Lane
Дата:
"Rui Martins" <Rui.Martins@PDMFC.com> writes:
> Here the context of the word "match" may be misleading us, in this
> conversation.
> I say this, because in my report, the second substring expression, the one
> for RoomSize:

> SUBSTRING( BedNo, '^[[:digit:]]+([a-zA-Z]*)(:[[:digit:]]+)?$' ) AS RoomSize,

> Actually returns an EMPTY String, and not a NULL, for the first 2 test
> cases, which I believe is the correct answer.

What's your point?  A zero-length substring is a legal match to that
subexpression ... unlike the third one.

            regards, tom lane

Re: BUG #4044: Incorrect RegExp substring Output

От
"Rui Martins"
Дата:
Hi Tom

Just a side note, See comments below.

> "Rui Martins" <Rui.Martins@pdmfc.com> writes:
>> Description:        Incorrect  RegExp substring Output
>
>>        SUBSTRING( BedNo FROM '^[[:digit:]]+[a-zA-Z]*(:[[:digit:]]+)?$' )
>
> Interesting.  It had never occurred to me that it's possible for the
> whole pattern to have a match when some parenthesized subexpression
> has no match.  On investigation, Tcl's regex library seems to get
> this right, but textregexsubstr() doesn't.  Will fix.
>
>> I would expect the result for BedNumber to be either NULL or the EMPTY
>> String, and the later seems more logical.
>
> It's going to be null.  Your example has no match to the parenthesized
> substring --- a match would have to include a colon and some digits, no?


You mention that it will return NULL, when the subexpression does not match!

Here the context of the word "match" may be misleading us, in this
conversation.
I say this, because in my report, the second substring expression, the one
for RoomSize:

SUBSTRING( BedNo, '^[[:digit:]]+([a-zA-Z]*)(:[[:digit:]]+)?$' ) AS RoomSize,

Actually returns an EMPTY String, and not a NULL, for the first 2 test
cases, which I believe is the correct answer.

From what I can infer, from your definition, of "match", from your last
sentence, this should NOT be a MATCH for the Subexpression, since it would
be an EMPTY match.
However, it's returning EMPTY String instead of NULL. i.e. returns what I
expect and not what you have said it should return in case of NO MATCH.

I usually think that a "match" is something that validates as correct, and
hence returns something. But I have to admit that I usually think about a
global match, i.e. the entire expression match, and not about
sub-expression match.

Even though this can me though as argumentative, think about this expression:

(something)?

Will "match" with an empty string in the context of a full expression, and
will return an EMPTY String. So by analogy, I would expect it, to return
the same as a sub-expression when it actually has a "match" even if with
an empty sub-string.

My expectations and assumptions might be wrong, but I believe they are
correct. Please check this too.

Once again, thank you for your quick feedback.

Best regards
   Rui Martins

Re: BUG #4044: Incorrect RegExp substring Output

От
"Rui Martins"
Дата:
Hi Tom

What I meant is:

  SELECT '' ~ '^(something)?$'

This will match, i.e. the empty string will match with the expression,
and substring would return and empty string.

or in other words:
   it returns true

  SELECT SUBSTRING( '', '^(something)?$' )

This will also match, has a global expression, and the return is, has
expected, an EMPTY string !

Now, If all the above is correct (which I believe it is), than:

  SELECT 'TEST' ~ '^TEST(something)?$'

Will also return TRUE, which it correctly does.

So, what would you expect to be the result, from the following query ?

  SELECT SUBSTRING( 'TEST', '^TEST(something)?$' )

The current returned value is 'TEST' which is wrong!
This we knew already, from my bug Report.

But by analogy, with the previous queries, and to maintain overall
coherence the expected correct result should be and EMPTY string too!

But If I understood you correctly, you said that it will return a result
of NULL, which I think is incorrect.


Did I make my self clearer know ?

Thanks again for feedback.

See ya
   Rui Martins

> "Rui Martins" <Rui.Martins@PDMFC.com> writes:
>> Here the context of the word "match" may be misleading us, in this
>> conversation.
>> I say this, because in my report, the second substring expression, the
>> one
>> for RoomSize:
>
>> SUBSTRING( BedNo, '^[[:digit:]]+([a-zA-Z]*)(:[[:digit:]]+)?$' ) AS
>> RoomSize,
>
>> Actually returns an EMPTY String, and not a NULL, for the first 2 test
>> cases, which I believe is the correct answer.
>
> What's your point?  A zero-length substring is a legal match to that
> subexpression ... unlike the third one.
>
>             regards, tom lane
>

Re: BUG #4044: Incorrect RegExp substring Output

От
Tom Lane
Дата:
"Rui Martins" <Rui.Martins@PDMFC.com> writes:
> Even though this can me though as argumentative, think about this expression:

> (something)?

> Will "match" with an empty string in the context of a full expression, and
> will return an EMPTY String. So by analogy, I would expect it, to return
> the same as a sub-expression when it actually has a "match" even if with
> an empty sub-string.

Uh, no, it *won't* match if there is not "something" in the string.

The behavior you are looking for is properly obtained this way:

    ((something)?)

This will return either "something" or an empty string (assuming
we have a globally successful match).  The point is there's a difference
between what X matches (or doesn't) and what X? matches.

            regards, tom lane

Re: BUG #4044: Incorrect RegExp substring Output

От
"Rui Martins"
Дата:
> "Rui Martins" <Rui.Martins@PDMFC.com> writes:
>> Even though this can me though as argumentative, think about this
>> expression:
>
>> (something)?
>
>> Will "match" with an empty string in the context of a full expression,
>> and
>> will return an EMPTY String. So by analogy, I would expect it, to return
>> the same as a sub-expression when it actually has a "match" even if with
>> an empty sub-string.
>
> Uh, no, it *won't* match if there is not "something" in the string.
>
> The behavior you are looking for is properly obtained this way:
>
>     ((something)?)
>
> This will return either "something" or an empty string (assuming
> we have a globally successful match).  The point is there's a difference
> between what X matches (or doesn't) and what X? matches.
>
>             regards, tom lane

OK, what I'm trying to explain is maybe a difference of assumptions:

  SELECT '' ~ '^(something)?$'

NOW: True
 ME: True
YOU: True


  SELECT SUBSTRING( '', '^(something)?$' )

NOW: '' => EMPTY String (OK, by pure luck. It's returning FULL match)
 ME: '' => EMPTY String
YOU: NULL


  SELECT 'TEST' ~ '^TEST(something)?$'

NOW: True
 ME: True
YOU: True


  SELECT SUBSTRING( 'TEST', '^TEST(something)?$' )

NOW: 'TEST' => WRONG !!! (It's returning FULL Match)
 ME: ''     => EMPTY String
YOU: NULL


My reasoning is:
Why would the exact same sub-expression, return different results when
either preceded or followed by something.
In my opinion, the same sub-expression should always return the same
result if the same conditions are met (in the sub-expression context).

Hope this time it's clear.


Best Regards
   Rui Martins

Re: BUG #4044: Incorrect RegExp substring Output

От
Tom Lane
Дата:
"Rui Martins" <Rui.Martins@PDMFC.com> writes:
> My reasoning is:
> Why would the exact same sub-expression, return different results when
> either preceded or followed by something.

It *isn't* returning different results; you are testing for different
things in these two cases, namely whether there is a match to the whole
pattern or just a parenthesized subpattern.  In none of these examples
was there any match to '(something)' --- there couldn't possibly be,
because "something" isn't in the data string.

            regards, tom lane

Re: BUG #4044: Incorrect RegExp substring Output

От
"Rui Martins"
Дата:
Hi Tom

> "Rui Martins" <Rui.Martins@PDMFC.com> writes:
>> My reasoning is:
>> Why would the exact same sub-expression, return different results when
>> either preceded or followed by something.
>
> It *isn't* returning different results; you are testing for different
> things in these two cases, namely whether there is a match to the whole
> pattern or just a parenthesized subpattern.  In none of these examples
> was there any match to '(something)' --- there couldn't possibly be,
> because "something" isn't in the data string.
>
>             regards, tom lane

That's one way to look at it. That's why I mentioned the possibility of
different assumptions regarding the context of the word "match".

In fact, you are saying that the sub-expression did not "match" because
there wasn't "something" in the string to be matched!
I agree with you on this last part,
  "there wasn't "something" in the string to be matched"
But the sub-expression did "match" !

I say this, because, since the empty string is a valid "match" for
"(something)?" because the "?" question mark operator, is defined as "a
sequence of 0 or 1 matches of the atom".

So we are probably just discussing semantics here!

My concern is that many will make the same refutable "valid" assumptions
that I do.

And If they will get NULL instead of an EMPTY String, it will be awkward,
besides not being able to distinguish between an EMPTY "match" and NO
"match" at all, since both will return NULL, according to your definition.

But what I find odd, is that you say that I'm testing different things. So
what would you say for the following cases ?

   '(something)?'

NOTE: I removed the anchors only.

Now is this a full string match or a sub-expression match ?

We can't give a concrete answer, unless we know the concrete string to be
matched

  SELECT '' ~ '(something)?'

This will be a FULL match


   SELECT 'TEST' ~ '(something)?'

But this one won't! It will be a sub-expression match by your definition.
So using the EXACT same REG_EXP, we will have two different context,
depending on the input !

The regexp context, MUST NOT depend on the String to be matched.
Because if it depends, then this is VERY BAD for consistency.

Do you get my point now ?


Now try this:

  SELECT SUBSTRING( '', '(something)?' )

  SELECT SUBSTRING( 'TEST', '(something)?' )

Odd enough, this, currently, returns the correct answer for both queries!
And by correct I mean EMPTY String !

According to your assumption, the first, would return an Empty String, but
the second, would return NULL !

You should try this with other reg_exp implementations, and see what comes
up on the the sub-expression result.


If after this exposition I haven't been able to correctly transmit the
problem to you, then it's probably my inability to explain it better, or
my   not so good English, since it's not my native language.

Hope you understand this now, since I don't know how to explain it better.

Thank you for your feedback.

Best Regards
   Rui Martins

Re: BUG #4044: Incorrect RegExp substring Output

От
Tom Lane
Дата:
"Rui Martins" <Rui.Martins@PDMFC.com> writes:
> But the sub-expression did "match" !

No, the sub-expression "(something)" did not match.  What did match
is the larger expression "(something)?".  You seem to be failing
to recognize that these are two different things.  If you put the
capturing parentheses around the larger expression then you will
get the result you want.

            regards, tom lane