Re: [HACKERS] CREATE COLLATION does not sanitize ICU's BCP 47language tags. Should it?

Поиск
Список
Период
Сортировка
От Peter Geoghegan
Тема Re: [HACKERS] CREATE COLLATION does not sanitize ICU's BCP 47language tags. Should it?
Дата
Msg-id CAH2-WzkyJCJNzarYYj0HTt0NTUWpKBuEmnUoX8QDA6+XRFE71Q@mail.gmail.com
обсуждение исходный текст
Ответ на Re: [HACKERS] CREATE COLLATION does not sanitize ICU's BCP 47language tags. Should it?  (Peter Eisentraut <peter.eisentraut@2ndquadrant.com>)
Ответы Re: [HACKERS] CREATE COLLATION does not sanitize ICU's BCP 47language tags. Should it?  (Peter Geoghegan <pg@bowt.ie>)
Re: [HACKERS] CREATE COLLATION does not sanitize ICU's BCP 47language tags. Should it?  (Peter Eisentraut <peter.eisentraut@2ndquadrant.com>)
Re: [HACKERS] CREATE COLLATION does not sanitize ICU's BCP 47language tags. Should it?  (Peter Geoghegan <pg@bowt.ie>)
[HACKERS] Re: CREATE COLLATION does not sanitize ICU's BCP 47 language tags.Should it?  (Noah Misch <noah@leadboat.com>)
Список pgsql-hackers
On Tue, Sep 19, 2017 at 5:52 PM, Peter Eisentraut
<peter.eisentraut@2ndquadrant.com> wrote:
> On 9/18/17 18:46, Peter Geoghegan wrote:
>> As I pointed out a couple of times already [1], we don't currently
>> sanitize ICU's BCP 47 language tags within CREATE COLLATION.
>
> There is no requirement that the locale strings for ICU need to be BCP
> 47.  ICU locale names like 'de@collation=phonebook' are also acceptable.

Right. But, we only document that BCP 47 is supported by Postgres.
Maybe we can use get_icu_language_tag()/uloc_toLanguageTag() to ensure
that we end up with BCP 47, even when the user happens to specify the
legacy syntax. Should we be "canonicalizing" legacy ICU locale strings
as BCP 47, too?

> The reason they are not validated is that, as you know, ICU accepts any
> locale string as valid.  You appear to have found a way to do some
> validation, but I would like to see that code.

As I mentioned, I'm simply calling
get_icu_language_tag()/uloc_toLanguageTag() to do that sanitization.
The code to get the extra sanitization is completely trivial.

I didn't post the patch that generates the errors in my opening e-mail
because I'm not confident it's correct just yet. And, I think that I
see a bigger problem: we pass a string that is almost certainly a BCP
47 string to ucol_open() from within pg_newlocale_from_collation(). We
do so despite the fact that ucol_open() apparently doesn't accept BCP
47 syntax locale strings until ICU 54 [1]. Seems entirely possible
that this accounts for the problems you saw on ICU 4.2, back when we
were still creating keyword variants (I guess that the keyword
variants seem very "BCP 47-ish" to me).

I really think we need to add some kind of debug mode that makes ICU
optionally spit out a locale display name at key points. We need this
to gain confidence that the behavior that ICU provides actually
matches what is expected across ICU different versions for different
locale formats. We talked about this as a user-facing feature before,
which can wait till v11; I just want this to debug problems like this
one.

[1] https://ssl.icu-project.org/apiref/icu4c/ucol_8h.html#a4721e4c0a519bb0139a874e191223590
-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

В списке pgsql-hackers по дате отправления:

Предыдущее
От: Andres Freund
Дата:
Сообщение: [HACKERS] Re: [COMMITTERS] pgsql: Make new crash restart test a bit morerobust.
Следующее
От: Robert Haas
Дата:
Сообщение: Re: [HACKERS] Boom filters for hash joins (was: A design for amcheckheapam verification)