Обсуждение: Expected behaviour of \d in regexp with exponent numbers ?

Поиск
Список
Период
Сортировка

Expected behaviour of \d in regexp with exponent numbers ?

От
Arnaud Lesauvage
Дата:
Hi all,

I just came accross this trying to upgrade my server from 8.4.8 to 9.3.4 :

SELECT substring('²' FROM E'\\d');

8.4 : NULL
9.3 : "²"

Am I correct to expect NULL in this case ?
Thanks !

--
Arnaud


Re: Expected behaviour of \d in regexp with exponent numbers ?

От
Albe Laurenz
Дата:
Arnaud Lesauvage wrote:
> I just came accross this trying to upgrade my server from 8.4.8 to 9.3.4 :
> 
> SELECT substring('²' FROM E'\\d');
> 
> 8.4 : NULL
> 9.3 : "²"
> 
> Am I correct to expect NULL in this case ?

I get a different result on Linux:

test=> SHOW server_encoding;
 server_encoding
-----------------
 UTF8
(1 row)

test=> SHOW client_encoding;
 client_encoding
-----------------
 UTF8
(1 row)

test=> SHOW lc_collate;
 lc_collate
------------
 de_DE.UTF8
(1 row)

test=> SHOW server_version;
 server_version
----------------
 9.3.5
(1 row)

test=> SELECT substring('²' FROM E'\\d') IS NULL;
 ?column?
----------
 t
(1 row)

Maybe it is a collation problem.
Which operating system and collation are you using on each system?

Maybe there is an encoding problem involved.
What is your server and client encoding on each system?
What do you get for "SELECT '²'::bytea" on each system?

Yours,
Laurenz Albe

Re: Expected behaviour of \d in regexp with exponent numbers ?

От
Arnaud Lesauvage
Дата:
Le 1/09/2014 15:42, Albe Laurenz a écrit :
> Arnaud Lesauvage wrote:
>> I just came accross this trying to upgrade my server from 8.4.8 to 9.3.4 :
>> SELECT substring('²' FROM E'\\d');
>> 8.4 : NULL
>> 9.3 : "²"
>> Am I correct to expect NULL in this case ?
>
> I get a different result on Linux:

OK, first of all the servers are running on Windows 2003 R2. They are on
the same computer, the only difference is they use a different port.

> test=> SHOW server_encoding;

UTF8 on both servers

> test=> SHOW client_encoding;

UNICODE on both servers

> test=> SHOW lc_collate;

I have a small difference here.
On 8.4 I have "French, Belgium"
On 9.3 I have "French_France.1252"
Quite confusing... I left the installer chose the collation for me,
French_France sounded close enough !


> Maybe it is a collation problem.
> Which operating system and collation are you using on each system?
> What do you get for "SELECT '²'::bytea" on each system?

"\302\262" on each system.

--
Arnaud


Re: Expected behaviour of \d in regexp with exponent numbers ?

От
Tom Lane
Дата:
Arnaud Lesauvage <arnaud.listes@codata.eu> writes:
> I just came accross this trying to upgrade my server from 8.4.8 to 9.3.4 :

> SELECT substring('�' FROM E'\\d');

> 8.4 : NULL
> 9.3 : "�"

> Am I correct to expect NULL in this case ?

Not necessarily.  \d will match any character that iswdigit() returns true
for.  It looks like your new server is using a locale that considers "�"
to be a digit.

            regards, tom lane


Re: Expected behaviour of \d in regexp with exponent numbers ?

От
Arnaud Lesauvage
Дата:
Le 1/09/2014 17:39, Tom Lane a écrit :
> Arnaud Lesauvage <arnaud.listes@codata.eu> writes:
>> I just came accross this trying to upgrade my server from 8.4.8 to 9.3.4 :
>
>> SELECT substring('²' FROM E'\\d');
>
>> 8.4 : NULL
>> 9.3 : "²"
>
>> Am I correct to expect NULL in this case ?
>
> Not necessarily.  \d will match any character that iswdigit() returns true
> for.  It looks like your new server is using a locale that considers "²"
> to be a digit.

Since both PostgreSQL servers run on the same computer, can I assume
that this is a collation problem ?

I tried to create a test table with different collations, but locale
names are a headache on windows, so I gave up for today. I will give it
another try tomorrow.


--
Arnaud


Re: Expected behaviour of \d in regexp with exponent numbers ?

От
Tom Lane
Дата:
Arnaud Lesauvage <arnaud.listes@codata.eu> writes:
> Le 1/09/2014 17:39, Tom Lane a �crit :
>> Not necessarily.  \d will match any character that iswdigit() returns true
>> for.  It looks like your new server is using a locale that considers "�"
>> to be a digit.

> Since both PostgreSQL servers run on the same computer, can I assume
> that this is a collation problem ?

Ah: after consulting the commit history I realized that the regex
operators only base \d on iswdigit() in 9.2 and later.  Before that
it was hardwired as [0-9].  So there might not be any difference
in the locale environment after all.

I wonder whether this was a bad idea.  I think it's unsurprising for the
definition of "alphanumeric" to depend on locale, but I bet most people
are not expecting \d to vary that way.

            regards, tom lane


Re: Expected behaviour of \d in regexp with exponent numbers ?

От
Arnaud Lesauvage
Дата:
Le 1/09/2014 18:11, Tom Lane a écrit :
> Arnaud Lesauvage <arnaud.listes@codata.eu> writes:
>> Le 1/09/2014 17:39, Tom Lane a écrit :
>>> Not necessarily.  \d will match any character that iswdigit() returns true
>>> for.  It looks like your new server is using a locale that considers "²"
>>> to be a digit.
>
>> Since both PostgreSQL servers run on the same computer, can I assume
>> that this is a collation problem ?
>
> Ah: after consulting the commit history I realized that the regex
> operators only base \d on iswdigit() in 9.2 and later.  Before that
> it was hardwired as [0-9].  So there might not be any difference
> in the locale environment after all.
>
> I wonder whether this was a bad idea.  I think it's unsurprising for the
> definition of "alphanumeric" to depend on locale, but I bet most people
> are not expecting \d to vary that way.

I guess the change in the way \d is behaving is OK as long as it is
documented in a changelog. I saw something about syncing the regexp code
with TCL somewhere, but I think there was a mention that this should not
change the regexp behaviour and that it was only to keep the code base
in sync.

The problem in my case is just that I expected any character in the
"digit" class to be a valid integer, which is wrong since the doc states
that numbers should be written with "decimal digits", not "locale digits".
So I guess I should just rewrite this regexp as [0-9]+ instead.

Thanks for the feedback !
Regards

--
Arnaud


Re: Expected behaviour of \d in regexp with exponent numbers ?

От
Vick Khera
Дата:
On Mon, Sep 1, 2014 at 12:11 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> I wonder whether this was a bad idea.  I think it's unsurprising for the
> definition of "alphanumeric" to depend on locale, but I bet most people
> are not expecting \d to vary that way.

FWIW, tha Perl man page on unicode (perldoc perlunicode) says:

<quote>
It is worth stressing that there are several different sets of digits
in Unicode that are equivalent to 0-9 and are matchable by "\d" in a
regular expression.  If they are used in a single language only, they
are in that language's "Script" and "Script_Extension". ...
</quote>

When working with Unicode/UTF8, I do not think it is safe to assume \d
matches only ASCII [0-9].