Обсуждение: Expected behaviour of \d in regexp with exponent numbers ?
Hi all, I just came accross this trying to upgrade my server from 8.4.8 to 9.3.4 : SELECT substring('²' FROM E'\\d'); 8.4 : NULL 9.3 : "²" Am I correct to expect NULL in this case ? Thanks ! -- Arnaud
Arnaud Lesauvage wrote: > I just came accross this trying to upgrade my server from 8.4.8 to 9.3.4 : > > SELECT substring('²' FROM E'\\d'); > > 8.4 : NULL > 9.3 : "²" > > Am I correct to expect NULL in this case ? I get a different result on Linux: test=> SHOW server_encoding; server_encoding ----------------- UTF8 (1 row) test=> SHOW client_encoding; client_encoding ----------------- UTF8 (1 row) test=> SHOW lc_collate; lc_collate ------------ de_DE.UTF8 (1 row) test=> SHOW server_version; server_version ---------------- 9.3.5 (1 row) test=> SELECT substring('²' FROM E'\\d') IS NULL; ?column? ---------- t (1 row) Maybe it is a collation problem. Which operating system and collation are you using on each system? Maybe there is an encoding problem involved. What is your server and client encoding on each system? What do you get for "SELECT '²'::bytea" on each system? Yours, Laurenz Albe
Le 1/09/2014 15:42, Albe Laurenz a écrit : > Arnaud Lesauvage wrote: >> I just came accross this trying to upgrade my server from 8.4.8 to 9.3.4 : >> SELECT substring('²' FROM E'\\d'); >> 8.4 : NULL >> 9.3 : "²" >> Am I correct to expect NULL in this case ? > > I get a different result on Linux: OK, first of all the servers are running on Windows 2003 R2. They are on the same computer, the only difference is they use a different port. > test=> SHOW server_encoding; UTF8 on both servers > test=> SHOW client_encoding; UNICODE on both servers > test=> SHOW lc_collate; I have a small difference here. On 8.4 I have "French, Belgium" On 9.3 I have "French_France.1252" Quite confusing... I left the installer chose the collation for me, French_France sounded close enough ! > Maybe it is a collation problem. > Which operating system and collation are you using on each system? > What do you get for "SELECT '²'::bytea" on each system? "\302\262" on each system. -- Arnaud
Arnaud Lesauvage <arnaud.listes@codata.eu> writes: > I just came accross this trying to upgrade my server from 8.4.8 to 9.3.4 : > SELECT substring('�' FROM E'\\d'); > 8.4 : NULL > 9.3 : "�" > Am I correct to expect NULL in this case ? Not necessarily. \d will match any character that iswdigit() returns true for. It looks like your new server is using a locale that considers "�" to be a digit. regards, tom lane
Le 1/09/2014 17:39, Tom Lane a écrit : > Arnaud Lesauvage <arnaud.listes@codata.eu> writes: >> I just came accross this trying to upgrade my server from 8.4.8 to 9.3.4 : > >> SELECT substring('²' FROM E'\\d'); > >> 8.4 : NULL >> 9.3 : "²" > >> Am I correct to expect NULL in this case ? > > Not necessarily. \d will match any character that iswdigit() returns true > for. It looks like your new server is using a locale that considers "²" > to be a digit. Since both PostgreSQL servers run on the same computer, can I assume that this is a collation problem ? I tried to create a test table with different collations, but locale names are a headache on windows, so I gave up for today. I will give it another try tomorrow. -- Arnaud
Arnaud Lesauvage <arnaud.listes@codata.eu> writes: > Le 1/09/2014 17:39, Tom Lane a �crit : >> Not necessarily. \d will match any character that iswdigit() returns true >> for. It looks like your new server is using a locale that considers "�" >> to be a digit. > Since both PostgreSQL servers run on the same computer, can I assume > that this is a collation problem ? Ah: after consulting the commit history I realized that the regex operators only base \d on iswdigit() in 9.2 and later. Before that it was hardwired as [0-9]. So there might not be any difference in the locale environment after all. I wonder whether this was a bad idea. I think it's unsurprising for the definition of "alphanumeric" to depend on locale, but I bet most people are not expecting \d to vary that way. regards, tom lane
Le 1/09/2014 18:11, Tom Lane a écrit : > Arnaud Lesauvage <arnaud.listes@codata.eu> writes: >> Le 1/09/2014 17:39, Tom Lane a écrit : >>> Not necessarily. \d will match any character that iswdigit() returns true >>> for. It looks like your new server is using a locale that considers "²" >>> to be a digit. > >> Since both PostgreSQL servers run on the same computer, can I assume >> that this is a collation problem ? > > Ah: after consulting the commit history I realized that the regex > operators only base \d on iswdigit() in 9.2 and later. Before that > it was hardwired as [0-9]. So there might not be any difference > in the locale environment after all. > > I wonder whether this was a bad idea. I think it's unsurprising for the > definition of "alphanumeric" to depend on locale, but I bet most people > are not expecting \d to vary that way. I guess the change in the way \d is behaving is OK as long as it is documented in a changelog. I saw something about syncing the regexp code with TCL somewhere, but I think there was a mention that this should not change the regexp behaviour and that it was only to keep the code base in sync. The problem in my case is just that I expected any character in the "digit" class to be a valid integer, which is wrong since the doc states that numbers should be written with "decimal digits", not "locale digits". So I guess I should just rewrite this regexp as [0-9]+ instead. Thanks for the feedback ! Regards -- Arnaud
On Mon, Sep 1, 2014 at 12:11 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > I wonder whether this was a bad idea. I think it's unsurprising for the > definition of "alphanumeric" to depend on locale, but I bet most people > are not expecting \d to vary that way. FWIW, tha Perl man page on unicode (perldoc perlunicode) says: <quote> It is worth stressing that there are several different sets of digits in Unicode that are equivalent to 0-9 and are matchable by "\d" in a regular expression. If they are used in a single language only, they are in that language's "Script" and "Script_Extension". ... </quote> When working with Unicode/UTF8, I do not think it is safe to assume \d matches only ASCII [0-9].