Re: ICU locale validation / canonicalization

Поиск
Список
Период
Сортировка
От Jeff Davis
Тема Re: ICU locale validation / canonicalization
Дата
Msg-id 060cb1b5d32c8693587b41f8f534ef79d3caecb1.camel@j-davis.com
обсуждение исходный текст
Ответ на Re: ICU locale validation / canonicalization  (Jeff Davis <pgsql@j-davis.com>)
Список pgsql-hackers
On Thu, 2023-02-09 at 14:09 -0800, Jeff Davis wrote:
> It feels like BCP 47 is the right catalog representation. We are
> already using it for the import of initial collations, and it's a
> standard, and there seems to be good support in ICU.

Patch attached.

We should have been canonicalizing all along -- either with
uloc_toLanguageTag(), as this patch does, or at least with
uloc_canonicalize() -- before passing to ucol_open().

ucol_open() is documented[1] to work on either language tags or ICU
format locale IDs. Anything else is invalid and ends up going through
some fallback logic, probably after being mis-parsed. For instance, in
ICU 72, "fr_CA.UTF-8" is not a valid ICU format locale ID or a valid
language tag, and is resolved by ucol_open() to the actual locale
"root"; but if you canonicalize it first (to the ICU format locale ID
"fr_CA" or the language tag "fr-CA"), it correctly resolves to the
actual locale "fr_CA".

The correct thing to do is canonicalize first and then pass to
ucol_open().

But because we didn't canonicalize in the past, there could be raw
locale strings stored in the catalog that resolve to the wrong actual
collator, and there could be indexes depending on the wrong collator,
so we have to be careful during pg_upgrade.

Say someone created two ICU collations, one with locale "en_US.UTF-8"
and one with locale "fr_CA.UTF-8" in PG15. When they upgrade to PG16,
this patch will check the language tag "en-US" and see that it resolves
to the same locale as "en_US.UTF-8", and change to the language tag
during upgrade (so "en-US" will be in the new catalog). But when it
checks the language tag "fr-CA", it will notice that it resolves to a
different locale than "fr_CA.UTF-8", and keep the latter string even
though it's wrong, because some indexes might be dependent on that
wrong collator.


[1]
https://unicode-org.github.io/icu-docs/apidoc/dev/icu4c/ucol_8h.html#a3b0bf34733dc208040e4157b0fe5fcd6

--
Jeff Davis
PostgreSQL Contributor Team - AWS



Вложения

В списке pgsql-hackers по дате отправления:

Предыдущее
От: Richard Guo
Дата:
Сообщение: Re: wrong query result due to wang plan
Следующее
От: Julien Rouhaud
Дата:
Сообщение: pg_upgrade and logical replication