Обсуждение: BUG #4044: Incorrect RegExp substring Output
The following bug has been logged online: Bug reference: 4044 Logged by: Rui Martins Email address: Rui.Martins@pdmfc.com PostgreSQL version: 8.3 Operating system: Win2000 Description: Incorrect RegExp substring Output Details: To setup for test, run these SQL commands CREATE TABLE TEST_REGEXP ( BedNo VARCHAR(20) ); INSERT INTO TEST_REGEXP ( BedNo ) VALUES ( '123' ); INSERT INTO TEST_REGEXP ( BedNo ) VALUES ( '4325:1' ); INSERT INTO TEST_REGEXP ( BedNo ) VALUES ( '2464M' ); INSERT INTO TEST_REGEXP ( BedNo ) VALUES ( '5678M:2' ); INSERT INTO TEST_REGEXP ( BedNo ) VALUES ( '453L:1' ); From the following link http://www.postgresql.org/docs/8.3/static/functions-matching.html We can read the following text: The substring function with two parameters, substring(string from pattern), provides extraction of a substring that matches a POSIX regular expression pattern. It returns null if there is no match, otherwise the portion of the text that matched the pattern. But if the pattern contains any parentheses, the portion of the text that matched the first parenthesized subexpression (the one whose left parenthesis comes first) is returned. You can put parentheses around the whole expression if you want to use parentheses within it without triggering this exception. If you need parentheses in the pattern before the subexpression you want to extract, see the non-capturing parentheses described below. -- -- -- -- -- No suppose we want to split up the "BedNo" column, into its parts (DoorNumber, RoomSize and BedNumber) SELECT BedNo, SUBSTRING( BedNo FROM '^([[:digit:]]+)[a-zA-Z]*(:[[:digit:]]+)?$' ) AS DoorNumber, SUBSTRING( BedNo FROM '^[[:digit:]]+([a-zA-Z]*)(:[[:digit:]]+)?$' ) AS RoomSize, SUBSTRING( BedNo FROM '^[[:digit:]]+[a-zA-Z]*(:[[:digit:]]+)?$' ) AS BedNumber FROM TEST_REGEXP; Or in it's other form SELECT BedNo, SUBSTRING( BedNo, '^([[:digit:]]+)[a-zA-Z]*(:[[:digit:]]+)?$' ) AS DoorNumber, SUBSTRING( BedNo, '^[[:digit:]]+([a-zA-Z]*)(:[[:digit:]]+)?$' ) AS RoomSize, SUBSTRING( BedNo, '^[[:digit:]]+[a-zA-Z]*(:[[:digit:]]+)?$' ) AS BedNumber FROM TEST_REGEXP; Both will return the same result: BedNo DoorNumber RoomSize BedNumber 123 123 123 4325:1 4325 :1 2464M 2464 M 2464M 5678M:2 5678 M :2 453L:1 453 L :1 NOTE: Tabs may note be seen correctly on a web interface. This is clearly, not the expected result, in particular for BedNo IN ( '123', '2464M' ) The BedNumber returned in these cases is NOT logical! However, the manual is NOT EXPLICIT in what happens, if the returned MATCHED parentheses part is the equivalent of an empty string! Although it states: "But if the pattern contains any parentheses, the portion of the text that matched the first parenthesized subexpression (the one whose left parenthesis comes first) is returned." Apparently, the function is returning the entire MATCHED string, instead of just the parenthesized subexpression. I would expect the result for BedNumber to be either NULL or the EMPTY String, and the later seems more logical. But the documentation doesn't state which should be returned! Not withstanding, the expected result should be: BedNo DoorNumber RoomSize BedNumber 123 123 4325:1 4325 :1 2464M 2464 M 5678M:2 5678 M :2 453L:1 453 L :1 NOTE: Tabs may note be seen correctly on a web interface. -- Hack Note for this specific case ! For this specific case, we can do a hack, and change the RegExp for BedNumber, like in the next SELECT. SELECT BedNo, SUBSTRING( BedNo FROM '^([[:digit:]]+)[a-zA-Z]*(:[[:digit:]]+)?$' ) AS DoorNumber, SUBSTRING( BedNo FROM '^[[:digit:]]+([a-zA-Z]*)(:[[:digit:]]+)?$' ) AS RoomSize, SUBSTRING( BedNo FROM '^[[:digit:]]+[a-zA-Z]*:([[:digit:]]+)?$' ) AS BedNumber FROM TEST_REGEXP; or alternatively SELECT BedNo, SUBSTRING( BedNo, '^([[:digit:]]+)[a-zA-Z]*(:[[:digit:]]+)?$' ) AS DoorNumber, SUBSTRING( BedNo, '^[[:digit:]]+([a-zA-Z]*)(:[[:digit:]]+)?$' ) AS RoomSize, SUBSTRING( BedNo, '^[[:digit:]]+[a-zA-Z]*:([[:digit:]]+)?$' ) AS BedNumber FROM TEST_REGEXP This will return NULL when there is no BedNumber, by forcing the regExp to fail the match. But this only works, because the format uses ":" in that specific location, if there was no ":" anywhere, we would NOT get away so easily. Hope to have bee of help, in finding this documentation issue and implementation bug.
"Rui Martins" <Rui.Martins@pdmfc.com> writes: > Description: Incorrect RegExp substring Output > SUBSTRING( BedNo FROM '^[[:digit:]]+[a-zA-Z]*(:[[:digit:]]+)?$' ) Interesting. It had never occurred to me that it's possible for the whole pattern to have a match when some parenthesized subexpression has no match. On investigation, Tcl's regex library seems to get this right, but textregexsubstr() doesn't. Will fix. > I would expect the result for BedNumber to be either NULL or the EMPTY > String, and the later seems more logical. It's going to be null. Your example has no match to the parenthesized substring --- a match would have to include a colon and some digits, no? regards, tom lane
> "Rui Martins" <Rui.Martins@pdmfc.com> writes: >> Description: Incorrect RegExp substring Output > >> SUBSTRING( BedNo FROM '^[[:digit:]]+[a-zA-Z]*(:[[:digit:]]+)?$' ) > > Interesting. It had never occurred to me that it's possible for the > whole pattern to have a match when some parenthesized subexpression > has no match. On investigation, Tcl's regex library seems to get > this right, but textregexsubstr() doesn't. Will fix. > >> I would expect the result for BedNumber to be either NULL or the EMPTY >> String, and the later seems more logical. > > It's going to be null. Your example has no match to the parenthesized > substring --- a match would have to include a colon and some digits, no? Yes, the subexpression will not match, but the entire expression will. Taking this into account I agree that it should be NULL then, but this should be CLEARLY stated in the MANUAL, so that the user will not have to guess. I believe that there should be a more detailed explanation of Substring function in the manual, because I haven't found a specific section about it. It is kind of scattered around the string functions page. > regards, tom lane Thank you for your feedback. P.S. Will the fix be available as a patch or just in 8.3.1 ? See ya Rui Martins
"Rui Martins" <Rui.Martins@PDMFC.com> writes: > Here the context of the word "match" may be misleading us, in this > conversation. > I say this, because in my report, the second substring expression, the one > for RoomSize: > SUBSTRING( BedNo, '^[[:digit:]]+([a-zA-Z]*)(:[[:digit:]]+)?$' ) AS RoomSize, > Actually returns an EMPTY String, and not a NULL, for the first 2 test > cases, which I believe is the correct answer. What's your point? A zero-length substring is a legal match to that subexpression ... unlike the third one. regards, tom lane
Hi Tom Just a side note, See comments below. > "Rui Martins" <Rui.Martins@pdmfc.com> writes: >> Description: Incorrect RegExp substring Output > >> SUBSTRING( BedNo FROM '^[[:digit:]]+[a-zA-Z]*(:[[:digit:]]+)?$' ) > > Interesting. It had never occurred to me that it's possible for the > whole pattern to have a match when some parenthesized subexpression > has no match. On investigation, Tcl's regex library seems to get > this right, but textregexsubstr() doesn't. Will fix. > >> I would expect the result for BedNumber to be either NULL or the EMPTY >> String, and the later seems more logical. > > It's going to be null. Your example has no match to the parenthesized > substring --- a match would have to include a colon and some digits, no? You mention that it will return NULL, when the subexpression does not match! Here the context of the word "match" may be misleading us, in this conversation. I say this, because in my report, the second substring expression, the one for RoomSize: SUBSTRING( BedNo, '^[[:digit:]]+([a-zA-Z]*)(:[[:digit:]]+)?$' ) AS RoomSize, Actually returns an EMPTY String, and not a NULL, for the first 2 test cases, which I believe is the correct answer. From what I can infer, from your definition, of "match", from your last sentence, this should NOT be a MATCH for the Subexpression, since it would be an EMPTY match. However, it's returning EMPTY String instead of NULL. i.e. returns what I expect and not what you have said it should return in case of NO MATCH. I usually think that a "match" is something that validates as correct, and hence returns something. But I have to admit that I usually think about a global match, i.e. the entire expression match, and not about sub-expression match. Even though this can me though as argumentative, think about this expression: (something)? Will "match" with an empty string in the context of a full expression, and will return an EMPTY String. So by analogy, I would expect it, to return the same as a sub-expression when it actually has a "match" even if with an empty sub-string. My expectations and assumptions might be wrong, but I believe they are correct. Please check this too. Once again, thank you for your quick feedback. Best regards Rui Martins
Hi Tom What I meant is: SELECT '' ~ '^(something)?$' This will match, i.e. the empty string will match with the expression, and substring would return and empty string. or in other words: it returns true SELECT SUBSTRING( '', '^(something)?$' ) This will also match, has a global expression, and the return is, has expected, an EMPTY string ! Now, If all the above is correct (which I believe it is), than: SELECT 'TEST' ~ '^TEST(something)?$' Will also return TRUE, which it correctly does. So, what would you expect to be the result, from the following query ? SELECT SUBSTRING( 'TEST', '^TEST(something)?$' ) The current returned value is 'TEST' which is wrong! This we knew already, from my bug Report. But by analogy, with the previous queries, and to maintain overall coherence the expected correct result should be and EMPTY string too! But If I understood you correctly, you said that it will return a result of NULL, which I think is incorrect. Did I make my self clearer know ? Thanks again for feedback. See ya Rui Martins > "Rui Martins" <Rui.Martins@PDMFC.com> writes: >> Here the context of the word "match" may be misleading us, in this >> conversation. >> I say this, because in my report, the second substring expression, the >> one >> for RoomSize: > >> SUBSTRING( BedNo, '^[[:digit:]]+([a-zA-Z]*)(:[[:digit:]]+)?$' ) AS >> RoomSize, > >> Actually returns an EMPTY String, and not a NULL, for the first 2 test >> cases, which I believe is the correct answer. > > What's your point? A zero-length substring is a legal match to that > subexpression ... unlike the third one. > > regards, tom lane >
"Rui Martins" <Rui.Martins@PDMFC.com> writes: > Even though this can me though as argumentative, think about this expression: > (something)? > Will "match" with an empty string in the context of a full expression, and > will return an EMPTY String. So by analogy, I would expect it, to return > the same as a sub-expression when it actually has a "match" even if with > an empty sub-string. Uh, no, it *won't* match if there is not "something" in the string. The behavior you are looking for is properly obtained this way: ((something)?) This will return either "something" or an empty string (assuming we have a globally successful match). The point is there's a difference between what X matches (or doesn't) and what X? matches. regards, tom lane
> "Rui Martins" <Rui.Martins@PDMFC.com> writes: >> Even though this can me though as argumentative, think about this >> expression: > >> (something)? > >> Will "match" with an empty string in the context of a full expression, >> and >> will return an EMPTY String. So by analogy, I would expect it, to return >> the same as a sub-expression when it actually has a "match" even if with >> an empty sub-string. > > Uh, no, it *won't* match if there is not "something" in the string. > > The behavior you are looking for is properly obtained this way: > > ((something)?) > > This will return either "something" or an empty string (assuming > we have a globally successful match). The point is there's a difference > between what X matches (or doesn't) and what X? matches. > > regards, tom lane OK, what I'm trying to explain is maybe a difference of assumptions: SELECT '' ~ '^(something)?$' NOW: True ME: True YOU: True SELECT SUBSTRING( '', '^(something)?$' ) NOW: '' => EMPTY String (OK, by pure luck. It's returning FULL match) ME: '' => EMPTY String YOU: NULL SELECT 'TEST' ~ '^TEST(something)?$' NOW: True ME: True YOU: True SELECT SUBSTRING( 'TEST', '^TEST(something)?$' ) NOW: 'TEST' => WRONG !!! (It's returning FULL Match) ME: '' => EMPTY String YOU: NULL My reasoning is: Why would the exact same sub-expression, return different results when either preceded or followed by something. In my opinion, the same sub-expression should always return the same result if the same conditions are met (in the sub-expression context). Hope this time it's clear. Best Regards Rui Martins
"Rui Martins" <Rui.Martins@PDMFC.com> writes: > My reasoning is: > Why would the exact same sub-expression, return different results when > either preceded or followed by something. It *isn't* returning different results; you are testing for different things in these two cases, namely whether there is a match to the whole pattern or just a parenthesized subpattern. In none of these examples was there any match to '(something)' --- there couldn't possibly be, because "something" isn't in the data string. regards, tom lane
Hi Tom > "Rui Martins" <Rui.Martins@PDMFC.com> writes: >> My reasoning is: >> Why would the exact same sub-expression, return different results when >> either preceded or followed by something. > > It *isn't* returning different results; you are testing for different > things in these two cases, namely whether there is a match to the whole > pattern or just a parenthesized subpattern. In none of these examples > was there any match to '(something)' --- there couldn't possibly be, > because "something" isn't in the data string. > > regards, tom lane That's one way to look at it. That's why I mentioned the possibility of different assumptions regarding the context of the word "match". In fact, you are saying that the sub-expression did not "match" because there wasn't "something" in the string to be matched! I agree with you on this last part, "there wasn't "something" in the string to be matched" But the sub-expression did "match" ! I say this, because, since the empty string is a valid "match" for "(something)?" because the "?" question mark operator, is defined as "a sequence of 0 or 1 matches of the atom". So we are probably just discussing semantics here! My concern is that many will make the same refutable "valid" assumptions that I do. And If they will get NULL instead of an EMPTY String, it will be awkward, besides not being able to distinguish between an EMPTY "match" and NO "match" at all, since both will return NULL, according to your definition. But what I find odd, is that you say that I'm testing different things. So what would you say for the following cases ? '(something)?' NOTE: I removed the anchors only. Now is this a full string match or a sub-expression match ? We can't give a concrete answer, unless we know the concrete string to be matched SELECT '' ~ '(something)?' This will be a FULL match SELECT 'TEST' ~ '(something)?' But this one won't! It will be a sub-expression match by your definition. So using the EXACT same REG_EXP, we will have two different context, depending on the input ! The regexp context, MUST NOT depend on the String to be matched. Because if it depends, then this is VERY BAD for consistency. Do you get my point now ? Now try this: SELECT SUBSTRING( '', '(something)?' ) SELECT SUBSTRING( 'TEST', '(something)?' ) Odd enough, this, currently, returns the correct answer for both queries! And by correct I mean EMPTY String ! According to your assumption, the first, would return an Empty String, but the second, would return NULL ! You should try this with other reg_exp implementations, and see what comes up on the the sub-expression result. If after this exposition I haven't been able to correctly transmit the problem to you, then it's probably my inability to explain it better, or my not so good English, since it's not my native language. Hope you understand this now, since I don't know how to explain it better. Thank you for your feedback. Best Regards Rui Martins
"Rui Martins" <Rui.Martins@PDMFC.com> writes: > But the sub-expression did "match" ! No, the sub-expression "(something)" did not match. What did match is the larger expression "(something)?". You seem to be failing to recognize that these are two different things. If you put the capturing parentheses around the larger expression then you will get the result you want. regards, tom lane