Re: Built-in CTYPE provider

Поиск
Список
Период
Сортировка
От Jeff Davis
Тема Re: Built-in CTYPE provider
Дата
Msg-id 2f404017690b43e6951cd4a60798c3f9626bbe56.camel@j-davis.com
обсуждение исходный текст
Ответ на Re: Built-in CTYPE provider  ("Daniel Verite" <daniel@manitou-mail.org>)
Список pgsql-hackers
On Wed, 2024-03-27 at 16:53 +0100, Daniel Verite wrote:
>  provider | isalpha | isdigit
> ----------+---------+---------
>  ICU      | f       | t
>  glibc    | t       | f
>  builtin  | f       | f

The "ICU" above is really the behvior of the Postgres ICU provider as
we implemented it, it's not something forced on us by ICU.

For the ICU provider, pg_wc_isalpha() is defined as u_isalpha()[1] and
pg_wc_isdigit() is defined as u_isdigit()[2]. Those, in turn, are
defined by ICU to be equivalent to java.lang.Character.isLetter() and
java.lang.Character.isDigit().

ICU documents[3] how regex character classes should be implemented
using the ICU APIs, and cites Unicode TR#18 [4] as the source. Despite
being under the heading "...for C/POSIX character classes...", [3] says
it's based on the "Standard" variant of [4], rather than "POSIX
Compatible".

(Aside: the Postgres ICU provider doesn't match what [3] suggests for
the "alpha" class. For the character U+FF11 it doesn't matter, but I
suspect there are differences for other characters. This should be
fixed.)

The differences between PG_C_UTF8 and what ICU suggests are just
because the former uses the "POSIX Compatible" definitions and the
latter uses "Standard".

I implemented both the "Standard" and "POSIX Compatible" compatibility
properties in ad49994538, so it would be easy to change what PG_C_UTF8
uses.

[1]
https://unicode-org.github.io/icu-docs/apidoc/dev/icu4c/uchar_8h.html#aecff8611dfb1814d1770350378b3b283
[2]
https://unicode-org.github.io/icu-docs/apidoc/dev/icu4c/uchar_8h.html#a42b37828d86daa0fed18b381130ce1e6
[3]
https://unicode-org.github.io/icu-docs/apidoc/dev/icu4c/uchar_8h.html#details
[4]
http://www.unicode.org/reports/tr18/#Compatibility_Properties

> Are we fine with pg_c_utf8 differing from both ICU's point of view
> (U+ff11 is digit and not alpha) and glibc point of view (U+ff11 is
> not
> digit, but it's alpha)?

Yes, some differences are to be expected.

But I'm fine making a change to PG_C_UTF8 if it makes sense, as long as
we can point to something other than "glibc version 2.35 does it this
way".

Regards,
    Jeff Davis




В списке pgsql-hackers по дате отправления:

Предыдущее
От: Nathan Bossart
Дата:
Сообщение: Re: Slow GRANT ROLE on PostgreSQL 16 with thousands of ROLEs
Следующее
От: Bharath Rupireddy
Дата:
Сообщение: Re: Add new error_action COPY ON_ERROR "log"