encoding affects ICU regex character classification

Поиск
Список
Период
Сортировка
От Jeff Davis
Тема encoding affects ICU regex character classification
Дата
Msg-id e7b67d24288f811aebada7c33f9ae629dde0def5.camel@j-davis.com
обсуждение исходный текст
Ответы Re: encoding affects ICU regex character classification  (Tom Lane <tgl@sss.pgh.pa.us>)
Список pgsql-hackers
The following query:

    SELECT U&'\017D' ~ '[[:alpha:]]' collate "en-US-x-icu";

returns true if the server encoding is UTF8, and false if the server
encoding is LATIN9. That's a bug -- any behavior involving ICU should
be encoding-independent.

The problem seems to be confusion between pg_wchar and a unicode code
point in pg_wc_isalpha() and related functions.

It might be good to introduce some infrastructure here that can convert
a pg_wchar into a Unicode code point, or decode a string of bytes into
a string of 32-bit code points. Right now, that's possible, but it
involves pg_wchar2mb() followed by encoding conversion to UTF8,
followed by decoding the UTF8 to a code point. (Is there an easier path
that I missed?)

One wrinkle is MULE_INTERNAL, which doesn't have any conversion path to
UTF8. That's not important for ICU (because ICU is not allowed for that
encoding), but I'd like it if we could make this infrastructure
independent of ICU, because I have some follow-up proposals to simplify
character classification here and in ts_locale.c.

Thoughts?

Regards,
    Jeff Davis




В списке pgsql-hackers по дате отправления:

Предыдущее
От: Heikki Linnakangas
Дата:
Сообщение: Re: Refactoring backend fork+exec code
Следующее
От: Tom Lane
Дата:
Сообщение: Re: encoding affects ICU regex character classification