encoding affects ICU regex character classification

Поиск

Список

Период

Сортировка

От	Jeff Davis
Тема	encoding affects ICU regex character classification
Дата	29 ноября 2023 г. 23:46:26
Msg-id	e7b67d24288f811aebada7c33f9ae629dde0def5.camel@j-davis.com обсуждение исходный текст
Ответы	Re: encoding affects ICU regex character classification
Список	pgsql-hackers

Дерево обсуждения

The following query:

    SELECT U&'\017D' ~ '[[:alpha:]]' collate "en-US-x-icu";

returns true if the server encoding is UTF8, and false if the server
encoding is LATIN9. That's a bug -- any behavior involving ICU should
be encoding-independent.

The problem seems to be confusion between pg_wchar and a unicode code
point in pg_wc_isalpha() and related functions.

It might be good to introduce some infrastructure here that can convert
a pg_wchar into a Unicode code point, or decode a string of bytes into
a string of 32-bit code points. Right now, that's possible, but it
involves pg_wchar2mb() followed by encoding conversion to UTF8,
followed by decoding the UTF8 to a code point. (Is there an easier path
that I missed?)

One wrinkle is MULE_INTERNAL, which doesn't have any conversion path to
UTF8. That's not important for ICU (because ICU is not allowed for that
encoding), but I'd like it if we could make this infrastructure
independent of ICU, because I have some follow-up proposals to simplify
character classification here and in ts_locale.c.

Thoughts?

Regards,
    Jeff Davis

В списке pgsql-hackers по дате отправления:

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

encoding affects ICU regex character classification