Обсуждение: \w doesn't match non-ASCII letters

Поиск

Список

Период

Сортировка

\w doesn't match non-ASCII letters

От

Markus Bertheau

Дата:

14 июня 2004 г., 11:56:31

oocms=3D# select '=D1=84' ~ '^\\w$';
 ?column?
----------
 f
(1 =D0=B7=D0=B0=D0=BF=D0=B8=D1=81=D1=8C)

or

oocms=3D# select '=C3=A4' ~ '^\\w$';
 ?column?
----------
 f
(1 =D0=B7=D0=B0=D0=BF=D0=B8=D1=81=D1=8C)

both should return true, as does=20

oocms=3D# select 'n' ~ '^\\w$';
 ?column?
----------
 t
(1 =D0=B7=D0=B0=D0=BF=D0=B8=D1=81=D1=8C)

Thanks.

--=20
Markus Bertheau <twanger@bluetwanger.de>

Re: \w doesn't match non-ASCII letters

От

Peter Eisentraut

Дата:

14 июня 2004 г., 12:16:59

Markus Bertheau wrote:
> oocms=# select 'Ñ' ~ '^\\w$';
>  ?column?
> ----------
>  f
> (1 Ð·Ð°Ð¿Ð¸ÑÑ)

What locale are you using for LC_COLLATE?  If it's C or POSIX, you need
to change it and re-initdb.

Re: \w doesn't match non-ASCII letters

От

Tom Lane

Дата:

14 июня 2004 г., 12:25:58

Peter Eisentraut <peter_e@gmx.net> writes:
> Markus Bertheau wrote:
>> oocms=# select 'Ñ' ~ '^\\w$';
>> ?column?
>> ----------
>> f
>> (1 Ð·Ð°Ð¿Ð¸ÑÑ)

> What locale are you using for LC_COLLATE?  If it's C or POSIX, you need
> to change it and re-initdb.

Another likely cause of trouble is that the regexp character
classification stuff is presently based on <ctype.h> functions and thus
cannot work in multibyte encodings.

            regards, tom lane

Re: \w doesn't match non-ASCII letters

От

Markus Bertheau

Дата:

14 июня 2004 г., 13:37:05

=D0=92 =D0=9F=D0=BD=D0=B4, 14.06.2004, =D0=B2 17:25, Tom Lane =D0=BF=D0=B8=
=D1=88=D0=B5=D1=82:
> Peter Eisentraut <peter_e@gmx.net> writes:
> > Markus Bertheau wrote:
> >> oocms=3D# select '=D1=84' ~ '^\\w$';
> >> ?column?
> >> ----------
> >> f
> >> (1 =D0=B7=D0=B0=D0=BF=D0=B8=D1=81=D1=8C)
>=20
> > What locale are you using for LC_COLLATE?  If it's C or POSIX, you need=
=20
> > to change it and re-initdb.
>=20
> Another likely cause of trouble is that the regexp character
> classification stuff is presently based on <ctype.h> functions and thus
> cannot work in multibyte encodings.

This is in a UTF-8 database, so yes, these are multibyte characters. Is
there something planned to support UTF-8 in regexps?

--=20
Markus Bertheau <twanger@bluetwanger.de>

Re: \w doesn't match non-ASCII letters

От

Tom Lane

Дата:

14 июня 2004 г., 13:48:51

Markus Bertheau <twanger@bluetwanger.de> writes:
> Is there something planned to support UTF-8 in regexps?

It'd be relatively easy to use the <wctype.h> functions here if we
were convinced that pg_mb2wchar() generated exactly the same
wide-character encoding as the C library is expecting for the current
LC_CTYPE setting.  In the absence of such a guarantee I think we'd
have to convert the pg_wchar back to multibyte form and then apply
mbstowcs(), which is rather painful, not least because our wide
character support doesn't seem to have any function for converting
back to multibyte form ...

Tatsuo, any thoughts here?

            regards, tom lane

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Обсуждение: \w doesn't match non-ASCII letters

\w doesn't match non-ASCII letters

Re: \w doesn't match non-ASCII letters

Re: \w doesn't match non-ASCII letters

Re: \w doesn't match non-ASCII letters

Re: \w doesn't match non-ASCII letters