Re: [HACKERS] CREATE COLLATION does not sanitize ICU's BCP 47language tags. Should it?

Поиск
Список
Период
Сортировка
От Peter Geoghegan
Тема Re: [HACKERS] CREATE COLLATION does not sanitize ICU's BCP 47language tags. Should it?
Дата
Msg-id CAH2-WzkhLa-Q9s6qSrjVT+Rm=_jaZi7cwSSm1wW3L_k9PHNtyw@mail.gmail.com
обсуждение исходный текст
Ответ на Re: [HACKERS] CREATE COLLATION does not sanitize ICU's BCP 47language tags. Should it?  (Andreas Karlsson <andreas@proxel.se>)
Ответы Re: [HACKERS] CREATE COLLATION does not sanitize ICU's BCP 47language tags. Should it?  (Andreas Karlsson <andreas@proxel.se>)
Re: [HACKERS] CREATE COLLATION does not sanitize ICU's BCP 47language tags. Should it?  (Peter Eisentraut <peter.eisentraut@2ndquadrant.com>)
Список pgsql-hackers
On Thu, Sep 21, 2017 at 2:49 AM, Andreas Karlsson <andreas@proxel.se> wrote:
> If we are fine with supporting only ICU 4.2 and later (which I think we are
> given that ICU 4.2 was released in 2009) then using uloc_forLanguageTag()[1]
> to validate and canonize seems like the right solution. I had missed that
> this function even existed when I last read the documentation. Does it
> return a BCP 47 tag in modern versions of ICU?

The decision to support ICU >= 4.2 was already made (see commit
eccead9). I have no reason to think that that's a problem for us.

As I've said, we currently use uloc_toLanguageTag() on all supported
ICU versions, to at least create a collation name at initdb time, but
also to get our new collation's colcollate when ICU >= 54. If we now
put a uloc_forLanguageTag() in place before each ucol_open() call,
then we can safely store a BCP 47 format tag as colcollate on *every*
ICU version. We can reconstruct a traditional format locale string
*on-the-fly* within pg_newlocale_from_collation() and
get_collation_actual_version(), which would be what we'd pass to
ucol_open() (we'd never pass "raw colcollate").

To keep things simple, we could always convert colcollate to the
traditional format using uloc_forLanguageTag(), just in case we're on
a version of ICU where ucol_open() doesn't like locales that are
spelled using the BCP 47 format. It doesn't seem worth it to take
advantage of the leniency on ICU >= 54, since that would be a special
case (though we could if we wanted to).

> I strongly prefer if there, as much as possible, is only one format for
> inputting ICU locales.

I agree, but the bigger issue is that we're *half way* between
supporting only one format, and supporting two formats. AFAICT, there
is no reason that we can't simply support one format on all ICU
versions, and keep what ends up within pg_collation at initdb time
essentially the same across ICU versions (except for those that are
due to cultural/political developments).

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

В списке pgsql-hackers по дате отправления:

Предыдущее
От: "Jim Van Fleet"
Дата:
Сообщение: Fw: [HACKERS] HACKERS[PATCH] split ProcArrayLock into multiple parts --follow-up
Следующее
От: Andres Freund
Дата:
Сообщение: Re: Fw: [HACKERS] HACKERS[PATCH] split ProcArrayLock into multipleparts -- follow-up