BUG #4044: Incorrect RegExp substring Output

Поиск
Список
Период
Сортировка
От Rui Martins
Тема BUG #4044: Incorrect RegExp substring Output
Дата
Msg-id 200803181846.m2IIkKwE078848@wwwmaster.postgresql.org
обсуждение исходный текст
Ответы Re: BUG #4044: Incorrect RegExp substring Output  (Tom Lane <tgl@sss.pgh.pa.us>)
Список pgsql-bugs
The following bug has been logged online:

Bug reference:      4044
Logged by:          Rui Martins
Email address:      Rui.Martins@pdmfc.com
PostgreSQL version: 8.3
Operating system:   Win2000
Description:        Incorrect  RegExp substring Output
Details:

To setup for test, run these SQL commands

CREATE TABLE TEST_REGEXP
(
    BedNo VARCHAR(20)
);

INSERT INTO TEST_REGEXP ( BedNo ) VALUES ( '123' );
INSERT INTO TEST_REGEXP ( BedNo ) VALUES ( '4325:1' );
INSERT INTO TEST_REGEXP ( BedNo ) VALUES ( '2464M' );
INSERT INTO TEST_REGEXP ( BedNo ) VALUES ( '5678M:2' );
INSERT INTO TEST_REGEXP ( BedNo ) VALUES ( '453L:1' );

From the following link
http://www.postgresql.org/docs/8.3/static/functions-matching.html

We can read the following text:

The substring function with two parameters, substring(string from pattern),
provides extraction of a substring that matches a POSIX regular expression
pattern. It returns null if there is no match, otherwise the portion of the
text that matched the pattern. But if the pattern contains any parentheses,
the portion of the text that matched the first parenthesized subexpression
(the one whose left parenthesis comes first) is returned. You can put
parentheses around the whole expression if you want to use parentheses
within it without triggering this exception. If you need parentheses in the
pattern before the subexpression you want to extract, see the non-capturing
parentheses described below.


-- -- -- -- --
No suppose we want to split up the "BedNo" column, into its parts
(DoorNumber, RoomSize and BedNumber)


SELECT BedNo,
       SUBSTRING( BedNo FROM '^([[:digit:]]+)[a-zA-Z]*(:[[:digit:]]+)?$' )
AS DoorNumber,
       SUBSTRING( BedNo FROM '^[[:digit:]]+([a-zA-Z]*)(:[[:digit:]]+)?$' )
AS RoomSize,
       SUBSTRING( BedNo FROM '^[[:digit:]]+[a-zA-Z]*(:[[:digit:]]+)?$' ) AS
BedNumber
  FROM TEST_REGEXP;

Or in it's other form

SELECT BedNo,
       SUBSTRING( BedNo, '^([[:digit:]]+)[a-zA-Z]*(:[[:digit:]]+)?$' ) AS
DoorNumber,
       SUBSTRING( BedNo, '^[[:digit:]]+([a-zA-Z]*)(:[[:digit:]]+)?$' ) AS
RoomSize,
       SUBSTRING( BedNo, '^[[:digit:]]+[a-zA-Z]*(:[[:digit:]]+)?$' ) AS
BedNumber
  FROM TEST_REGEXP;

Both will return the same result:

BedNo    DoorNumber    RoomSize    BedNumber
123    123        123
4325:1    4325        :1
2464M    2464    M    2464M
5678M:2    5678    M    :2
453L:1    453    L    :1

NOTE: Tabs may note be seen correctly on a web interface.

This is clearly, not the expected result, in particular for BedNo IN (
'123', '2464M' )

The BedNumber returned in these cases is NOT logical!

However, the manual is NOT EXPLICIT in what happens, if the returned MATCHED
parentheses part is the equivalent of an empty string!

Although it states:
"But if the pattern contains any parentheses, the portion of the text that
matched the first parenthesized subexpression (the one whose left
parenthesis comes first) is returned."

Apparently, the function is returning the entire MATCHED string, instead of
just the parenthesized subexpression.

I would expect the result for BedNumber to be either NULL or the EMPTY
String, and the later seems more logical. But the documentation doesn't
state which should be returned!

Not withstanding, the expected result should be:

BedNo    DoorNumber    RoomSize    BedNumber
123    123
4325:1    4325        :1
2464M    2464    M
5678M:2    5678    M    :2
453L:1    453    L    :1

NOTE: Tabs may note be seen correctly on a web interface.


-- Hack Note for this specific case !

For this specific case, we can do a hack, and change the RegExp for
BedNumber, like in the next SELECT.

SELECT BedNo,
       SUBSTRING( BedNo FROM '^([[:digit:]]+)[a-zA-Z]*(:[[:digit:]]+)?$' )
AS DoorNumber,
       SUBSTRING( BedNo FROM '^[[:digit:]]+([a-zA-Z]*)(:[[:digit:]]+)?$' )
AS RoomSize,
       SUBSTRING( BedNo FROM '^[[:digit:]]+[a-zA-Z]*:([[:digit:]]+)?$' ) AS
BedNumber
  FROM TEST_REGEXP;

or alternatively

SELECT BedNo,
       SUBSTRING( BedNo, '^([[:digit:]]+)[a-zA-Z]*(:[[:digit:]]+)?$' ) AS
DoorNumber,
       SUBSTRING( BedNo, '^[[:digit:]]+([a-zA-Z]*)(:[[:digit:]]+)?$' ) AS
RoomSize,
       SUBSTRING( BedNo, '^[[:digit:]]+[a-zA-Z]*:([[:digit:]]+)?$' ) AS
BedNumber
  FROM TEST_REGEXP

This will return NULL when there is no BedNumber, by forcing the regExp to
fail the match.

But this only works, because the format uses ":" in that specific location,
if there was no ":" anywhere, we would NOT get away so easily.

Hope to have bee of help, in finding this documentation issue and
implementation bug.

В списке pgsql-bugs по дате отправления:

Предыдущее
От: "Dave Page"
Дата:
Сообщение: Re: BUG #4041: error in Application Stack Builder
Следующее
От: "Chan, Keith \(LESA IT\)"
Дата:
Сообщение: Re: BUG #4041: error in Application Stack Builder