Обсуждение: Difference between C and en_US.UTF-8 Collate & CType in Postgres 10.x

Поиск

Список

Период

Сортировка

Difference between C and en_US.UTF-8 Collate & CType in Postgres 10.x

От

Debraj Manna

Дата:

14 сентября 2018 г., 11:46:26

Can someone let me know what is the difference we are expected to observe if we set Collate & Ctype to C as against en_US.UTF-8 with encoding set to UTF8 in both the cases ?

-----------+----------+----------+---------+-------+-----------------------

postgres | postgres | UTF8 | C | C |

-----------+----------+----------+-------------+-------------+-----------------------

Re: Difference between C and en_US.UTF-8 Collate & CType inPostgres 10.x

От

Laurenz Albe

Дата:

14 сентября 2018 г., 14:57:38

Debraj Manna wrote:
> Can someone let me know what is the difference we are expected to observe if we set
> Collate & Ctype to C as against  en_US.UTF-8 with encoding set to UTF8 in both the cases ?

For one, the ordering will be substantially different.

Compare the result of these two queries:

SELECT chr(i) FROM generate_series(1, 50000) i ORDER BY chr(i) COLLATE "C";
SELECT chr(i) FROM generate_series(1, 50000) i ORDER BY chr(i) COLLATE "en_US.utf8";

Yours,
Laurenz Albe

Re: Difference between C and en_US.UTF-8 Collate & CType in Postgres 10.x

От

Debraj Manna

Дата:

15 сентября 2018 г., 10:02:59

Thanks for replying.

Will there be any other difference like in terms of index size, etc?

Is there any link that I can refer that provides more details about the differences?

On Fri 14 Sep, 2018, 5:27 PM Laurenz Albe, <laurenz.albe@cybertec.at> wrote:

Debraj Manna wrote:
> Can someone let me know what is the difference we are expected to observe if we set
> Collate & Ctype to C as against en_US.UTF-8 with encoding set to UTF8 in both the cases ?

For one, the ordering will be substantially different.

Compare the result of these two queries:

SELECT chr(i) FROM generate_series(1, 50000) i ORDER BY chr(i) COLLATE "C";
SELECT chr(i) FROM generate_series(1, 50000) i ORDER BY chr(i) COLLATE "en_US.utf8";

Yours,
Laurenz Albe

Re: Difference between C and en_US.UTF-8 Collate & CType in Postgres 10.x

От

Imre Samu

Дата:

15 сентября 2018 г., 14:21:16

> Is there any link that I can refer that provides more details about the differences?

According to the documentation ( https://www.postgresql.org/docs/10/static/locale.html )

""
The locale settings influence the following SQL features:

Sort order in queries using ORDER BY or the standard comparison operators on textual data
The upper, lower, and initcap functions
Pattern matching operators (LIKE, SIMILAR TO, and POSIX-style regular expressions); locales affect both case insensitive matching and the classification of characters by character-class regular expressions
The to_char family of functions
The ability to use indexes with LIKE clauses

The drawback of using locales other than C or POSIX in PostgreSQL is its performance impact. It slows character handling and prevents ordinary indexes from being used by LIKE. For this reason use locales only if you actually need them.

As a workaround to allow PostgreSQL to use indexes with LIKE clauses under a non-C locale, several custom operator classes exist. These allow the creation of an index that performs a strict character-by-character comparison, ignoring locale comparison rules. Refer to Section 11.9 for more information. Another approach is to create indexes using the C collation, as discussed in Section 23.2.
""

the performance impact sometimes is huge.

Sorting Geohash with "C" locale is sometimes 40% faster: " ORDER BY ST_GeoHash(ST_Transform(ST_Envelope(geom),4326),10) COLLATE "C";"

http://www.paulnorman.ca/blog/2016/05/improve-your-st-geohash-sorting-with-these-three-simple-tricks/

Some links:

* https://www.postgresql.org/docs/current/static/collation.html

* https://blog.2ndquadrant.com/icu-support-postgresql-10/ ( More robust collations with ICU support in PostgreSQL 10 )
* https://blog.anayrat.info/en/2017/11/19/postgresql-10--icu--abbreviated-keys/ (PostgreSQL 10 : ICU & Abbreviated Keys )

Best,
Imre

2018-09-15 9:02 GMT+02:00 Debraj Manna <subharaj.manna@gmail.com>:

Thanks for replying.

Will there be any other difference like in terms of index size, etc?

Is there any link that I can refer that provides more details about the differences?

On Fri 14 Sep, 2018, 5:27 PM Laurenz Albe, <laurenz.albe@cybertec.at> wrote:
Debraj Manna wrote:
> Can someone let me know what is the difference we are expected to observe if we set
> Collate & Ctype to C as against en_US.UTF-8 with encoding set to UTF8 in both the cases ?

For one, the ordering will be substantially different.

Compare the result of these two queries:

SELECT chr(i) FROM generate_series(1, 50000) i ORDER BY chr(i) COLLATE "C";
SELECT chr(i) FROM generate_series(1, 50000) i ORDER BY chr(i) COLLATE "en_US.utf8";

Yours,
Laurenz Albe

Re: Difference between C and en_US.UTF-8 Collate & CType inPostgres 10.x

От

Laurenz Albe

Дата:

16 сентября 2018 г., 18:59:19

Debraj Manna wrote:
> Will there be any other difference like in terms of index size, etc?

The size will be the same, just the ordering will be different.

> Is there any link that I can refer that provides more details about the differences?

Maybe unicode.org has some material...

Yours,
Laurenz Albe
-- 
Cybertec | https://www.cybertec-postgresql.com

Re: Difference between C and en_US.UTF-8 Collate & CType in Postgres 10.x

От

Tom Lane

Дата:

16 сентября 2018 г., 19:25:17

Laurenz Albe <laurenz.albe@cybertec.at> writes:
> Debraj Manna wrote:
>> Will there be any other difference like in terms of index size, etc?

> The size will be the same, just the ordering will be different.

Just to clarify, index sizes might not be *exactly* the same.  Different
ordering rules might lead to different upper-page-split choices as the
index grows, resulting in different amounts of unused space in some index
pages.  It should average out to about the same thing though.

            regards, tom lane

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Обсуждение: Difference between C and en_US.UTF-8 Collate & CType in Postgres 10.x

Difference between C and en_US.UTF-8 Collate & CType in Postgres 10.x

Re: Difference between C and en_US.UTF-8 Collate & CType inPostgres 10.x

Re: Difference between C and en_US.UTF-8 Collate & CType in Postgres 10.x

Re: Difference between C and en_US.UTF-8 Collate & CType in Postgres 10.x

Re: Difference between C and en_US.UTF-8 Collate & CType inPostgres 10.x

Re: Difference between C and en_US.UTF-8 Collate & CType in Postgres 10.x