[HACKERS] What users can do with custom ICU collations in Postgres 10

Поиск
Список
Период
Сортировка
От Peter Geoghegan
Тема [HACKERS] What users can do with custom ICU collations in Postgres 10
Дата
Msg-id CAH2-Wz=bcgmk97YaZ3rs4OoCdn1nOco1HCfRGBCOOty0ztnCnA@mail.gmail.com
обсуждение исходный текст
Ответы Re: [HACKERS] What users can do with custom ICU collations inPostgres 10  (Peter Eisentraut <peter.eisentraut@2ndquadrant.com>)
Re: [HACKERS] What users can do with custom ICU collations inPostgres 10  (Craig Ringer <craig@2ndquadrant.com>)
Re: [HACKERS] What users can do with custom ICU collations inPostgres 10  (Peter Eisentraut <peter.eisentraut@2ndquadrant.com>)
Список pgsql-hackers
There are actually very many customizations to collations that are
possible beyond what the "stock" ICU collations provide (whatever
"stock" means). Some of these are really cool, and I can imagine use
cases where they are very compelling that have nothing to do with
internationalization (such customizations are how we should eventually
implement case-insensitive collations, once the infrastructure for
doing that without breaking hashing is in place).

I'd like to give a demo on what is already possible, but not currently
documented. I didn't see anyone else comment on this, including Peter
E (maybe I missed that?). We should improve the documentation in this
area, to get this into the hands of users.

Say we're unhappy that numbers come first, which we see here:

postgres=# select * from (select '1a' i union select '1b' union select
'1c' union select 'a1' union select 'b2' union select 'c3') j order by
i collate "en-x-icu";i
────1a1b1ca1b2c3
(6 rows)

We may do this to get our desired sort order:

postgres=# create collation digitlast (provider=icu,
locale='en-u-kr-latn-digit');
CREATE COLLATION
postgres=# select * from (select '1a' i union select '1b' union select
'1c' union select 'a1' union select 'b2' union select 'c3') j order by
i collate "digitlast";i
────a1b2c31a1b1c
(6 rows)

Note that 'kr' is a specific BCP47 Key [1]. Many different options can
be set in this manner.

Let's say we are unhappy with the fact that capital letters sort
higher than lowercase:

postgres=# select * from (select 'B' i union select 'b' union select
'A' union select 'a') j order by i collate "en-x-icu";i
───aAbB
(4 rows)

ICU provides a solution here, too:

postgres=# create collation capitalfirst (provider=icu, locale='en-u-kf-upper');
CREATE COLLATION
postgres=#
select * from (select 'B' i union select 'b' union select 'A' union
select 'a') j order by i collate "capitalfirst";i
───AaBb
(4 rows)

And, yes -- you can even *combine* these two options by creating a
third custom collation. That can be spelled
'en-u-kf-upper-kr-latn-digit', in case you were wondering.

Users have certainly complained about not liking this or that aspect
of how glibc sorts text many times over the years, particularly around
things like how whitespace and punctuation are handled, which they can
now do something about [2]. Users can also have numbers sort like
numbers should when compared against other numbers, by using the
numericOrdering option (not shown). numericOrdering would be great for
things like alphanumeric invoice numbers, or the alphanumeric car
registration plate numbers that are used in certain countries [3],
with fixed number/letter fields. These options are very powerful.

[1] http://unicode.org/reports/tr35/tr35-collation.html#Setting_Options
[2] http://unicode.org/reports/tr35/tr35-collation.html#Common_Settings
[3] https://en.wikipedia.org/wiki/Vehicle_registration_plates_of_the_Republic_of_Ireland
--
Peter Geoghegan



В списке pgsql-hackers по дате отправления:

Предыдущее
От: Remi Colinet
Дата:
Сообщение: Re: [HACKERS] [PATCH v3] pg_progress() SQL function to monitorprogression of long running SQL queries/utilities
Следующее
От: Thomas Munro
Дата:
Сообщение: Re: [HACKERS] Log LDAP "diagnostic messages"?