Обсуждение: regular expressions stranges
Regexp works differently with no-ascii characters depending on server encoding
(bug.sql contains non-ascii char):
% initdb -E KOI8-R --locale ru_RU.KOI8-R
% psql postgres < bug.sql
true
------
t
(1 row)
true | true
------+------
t | t
(1 row)
% initdb -E UTF8 --locale ru_RU.UTF-8
% psql postgres < bug.sql
true
------
f
(1 row)
true | true
------+------
f | t
(1 row)
As I can see, that is because of using isalpha (and other is*), tolower &
toupper instead of isw* and tow* functions. Is any reason to use them? If not, I
can modify regc_locale.c similarly to tsearch2 locale part.
--
Teodor Sigaev E-mail: teodor@sigaev.ru
WWW: http://www.sigaev.ru/
set client_encoding='KOI8';
SELECT '�' ~* '[[:alpha:]]' as "true";
SELECT
'������' ~* '������' as "true",
'������' ~* '������' as "true";
Teodor Sigaev <teodor@sigaev.ru> writes:
> As I can see, that is because of using isalpha (and other is*), tolower &
> toupper instead of isw* and tow* functions. Is any reason to use them? If not, I
> can modify regc_locale.c similarly to tsearch2 locale part.
The regex code is working with pg_wchar strings, which aren't
necessarily the same representation that the OS' wide-char functions
expect. If we could guarantee compatibility then the above plan
would make sense ...
regards, tom lane
> The regex code is working with pg_wchar strings, which aren't
> necessarily the same representation that the OS' wide-char functions
> expect. If we could guarantee compatibility then the above plan
> would make sense ...
it seems to me, that is possible for UTF8 encoding. So isalpha() function may be
defined as:
static int
pg_wc_isalpha(pg_wchar c)
{ if ( (c >= 0 && c <= UCHAR_MAX) )return isalpha((unsigned char) c)
#ifdef HAVE_WCSTOMBS else if ( GetDatabaseEncoding() == PG_UTF8 )return iswalpha((wint_t) c)
#endif return 0;
}
--
Teodor Sigaev E-mail: teodor@sigaev.ru
WWW: http://www.sigaev.ru/
Teodor Sigaev <teodor@sigaev.ru> writes:
>> The regex code is working with pg_wchar strings, which aren't
>> necessarily the same representation that the OS' wide-char functions
>> expect. If we could guarantee compatibility then the above plan
>> would make sense ...
> it seems to me, that is possible for UTF8 encoding.
Why? The one thing that a wchar certainly is not is UTF8.
It might be that the <wctype.h> functions are expecting UTF16 or UTF32,
but we don't know which, and really we can hardly even be sure they're
expecting Unicode at all.
regards, tom lane