Re: ICU locale validation / canonicalization

Поиск

Список

Период

Сортировка

От	Jeff Davis
Тема	Re: ICU locale validation / canonicalization
Дата	10 февраля 2023 г. 01:09:39
Msg-id	fccd3064fa38285d2b71c2cc46a3eae8e5c6d4fb.camel@j-davis.com обсуждение исходный текст
Ответ на	Re: ICU locale validation / canonicalization (Robert Haas <robertmhaas@gmail.com>)
Ответы	Re: ICU locale validation / canonicalization Re: ICU locale validation / canonicalization Re: ICU locale validation / canonicalization
Список	pgsql-hackers

Дерево обсуждения

On Thu, 2023-02-09 at 10:53 -0500, Robert Haas wrote:
> Unfortunately, I have no idea whether your specific ideas about how
> to
> make that happen are any good or not. But I hope they are, because
> the
> current situation is pessimal.

It feels like BCP 47 is the right catalog representation. We are
already using it for the import of initial collations, and it's a
standard, and there seems to be good support in ICU.

There are a couple cases where canonicalization will succeed but
conversion to a BCP 47 language tag will fail. One is for unsupported
attributes, like "en_US@foo=bar". Another is a bug I found and reported
here:

https://unicode-org.atlassian.net/browse/ICU-22268

In both cases, we know that conversion has failed, and we have a choice
about how to proceed. We can fail, warn and continue with the user-
entered representation, or turn off the strictness checking and come up
with some BCP 47 tag and see if it resolves to the same collator.

I do like the ICU format locale IDs from a readability standpoint.
"en_US@colstrength=primary" is more meaningful to me than "en-US-u-ks-
level1" (the equivalent language tag). And the format is specified[1],
even though it's not an independent standard. But I think the benefits
of better validation, an independent standard, and the fact that we're
already favoring BCP47 outweigh my subjective opinion.

I also attached a simple test program that I've been using to
experiment (not intended for code review).

It's hard for me to say that I'm sure I'm right. I really just got
involved in this a few months back, and had a few off-list
conversations with Peter Eisentraut to try to learn more (I believe he
is aligned with my proposal but I will let him speak for himself).

I should also say that I'm not exactly an expert in languages or
scripts. I assume that ICU and IETF are doing sensible things to
accommodate the diversity of human language as well as they can (or at
least much better than the Postgres project could do on its own).

I'm happy to hear more input or other proposals.

[1]
https://unicode-org.github.io/icu/userguide/locale/#canonicalization

--
Jeff Davis
PostgreSQL Contributor Team - AWS

Вложения

icutool.c

В списке pgsql-hackers по дате отправления:

Предыдущее

От: Andres Freund
Дата: 10 февраля 2023 г., 00:55:32
Сообщение: Re: Importing pg_bsd_indent into our source tree

Следующее

От: Tom Lane
Дата: 10 февраля 2023 г., 01:12:52
Сообщение: Re: Importing pg_bsd_indent into our source tree

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Re: ICU locale validation / canonicalization

Вложения

Предыдущее

Следующее