Обсуждение: ICU, locale and collation question

Поиск
Список
Период
Сортировка

ICU, locale and collation question

От
Oscar Carlberg
Дата:
Hello,

We have a bunch of existing Postgres 10 clusters running on CentOS 7,
which have been initialized (initdb) with these collation options;

-E 'UTF-8'
--lc-collate=sv_SE.UTF-8
--lc-ctype=sv_SE.UTF-8
--lc_monetary=sv_SE.UTF-8
--lc-numeric=sv_SE.UTF-8
--lc-time=sv_SE.UTF-8
--lc-messages=en_US.UTF-8

And createdb were provided with these locale options when setting up
databases:
--lc-collate=sv_SE.UTF-8
--lc-ctype=sv_SE.UTF-8

\l in psql gives:

          Name         |          Owner           | Encoding | Collate  
|    Ctype    |
----------------------+--------------------------+----------+-------------+-------------+
  test-db              | test-user                | UTF8     |
sv_SE.UTF-8 | sv_SE.UTF-8 |


We're upgrading the servers using logical replication, and would like to
take the opportunity to switch to ICU rather than relying on glibc, to
avoid future problems with index corruption if using physical
replication between servers with different versions of glibc.

We're trying to figure out the most correct way to configure postgres to
achieve this. Currently we have:

-E 'UTF-8'
--locale-provider=icu
--icu-locale=sv-SE-x-icu

And createdb are provided with locale options:
--lc-collate=C
--lc-ctype=C

\l in psql now gives:

         Name         |        Owner        | Encoding | Collate   |   
Ctype    | ICU Locale  | Locale Provider |
---------------------+---------------------+----------+-------------+-------------+-------------+-----------------+
  test-db             | test-user           | UTF8     | C           |
C           | sv-SE-x-icu | icu             |

Is this a safe configuration to avoid index corruption, and other
problems, while still being compatible with the previous locale
settings? We have done some testing and it appears ORDER BY does sort
rows according to Swedish localization in the ICU configured test-db.

We're uncertain since this blogpost ->
https://peter.eisentraut.org/blog/2022/09/26/icu-features-in-postgresql-15
mentions that there are still some postgres code relying on libc locale
facilities. Should we set lc-collate and lc-ctype to sv_SE.UTF-8 when
creating databases in addition to the ICU options provided to initdb due
to this? Will we still be safe from glibc related corruption as long as
--locale-provider=icu --icu-locale=sv-SE-x-icu is set?

Best Regards,

Oscar


--
Innehållet i detta e-postmeddelande är konfidentiellt och avsett endast för
adressaten.Varje spridning, kopiering eller utnyttjande av innehållet är
förbjuden utan tillåtelse av avsändaren. Om detta meddelande av misstag
gått till fel adressat vänligen radera det ursprungliga meddelandet och
underrätta avsändaren via e-post



Re: ICU, locale and collation question

От
Laurenz Albe
Дата:
On Mon, 2023-05-08 at 16:35 +0200, Oscar Carlberg wrote:
> We have a bunch of existing Postgres 10 clusters running on CentOS 7,
> which have been initialized (initdb) with these collation options;
>
> -E 'UTF-8'
> --lc-collate=sv_SE.UTF-8
> --lc-ctype=sv_SE.UTF-8
> --lc_monetary=sv_SE.UTF-8
> --lc-numeric=sv_SE.UTF-8
> --lc-time=sv_SE.UTF-8
> --lc-messages=en_US.UTF-8
>
> And createdb were provided with these locale options when setting up
> databases:
> --lc-collate=sv_SE.UTF-8
> --lc-ctype=sv_SE.UTF-8
>
> We're upgrading the servers using logical replication, and would like to
> take the opportunity to switch to ICU rather than relying on glibc, to
> avoid future problems with index corruption if using physical
> replication between servers with different versions of glibc.
>
> We're trying to figure out the most correct way to configure postgres to
> achieve this. Currently we have:
>
> -E 'UTF-8'
> --locale-provider=icu
> --icu-locale=sv-SE-x-icu
>
> And createdb are provided with locale options:
> --lc-collate=C
> --lc-ctype=C
>
> Is this a safe configuration to avoid index corruption, and other
> problems, while still being compatible with the previous locale
> settings?

Yes, that is safe.

But it is not compatible with the old setup when it comes to lc_time,
lc_messages and the others.  You should use sv_SE.UTF-8 for these
locale categories.

Note that that won't cause problems with upgrading the C library.

Yours,
Laurenz Albe



Re: ICU, locale and collation question

От
Oscar Carlberg
Дата:
On 5/8/23 22:07, Laurenz Albe wrote:

> Yes, that is safe.
>
> But it is not compatible with the old setup when it comes to lc_time,
> lc_messages and the others.  You should use sv_SE.UTF-8 for these
> locale categories.
>
> Note that that won't cause problems with upgrading the C library.
>
> Yours,
> Laurenz Albe
>
Thank you very much for clarifying!

Our initdb setup would then look like this for compatibility;
-E 'UTF-8'
--locale-provider=icu
--icu-locale=sv-SE-x-icu
--lc_monetary=sv_SE.UTF-8
--lc-numeric=sv_SE.UTF-8
--lc-time=sv_SE.UTF-8
--lc-messages=en_US.UTF-8

Should we still provide createdb with --lc-collate=C and --lc-ctype=C,
or should we set those to sv_SE.UTF-8 as well?

Best regards,
Oscar


--
Innehållet i detta e-postmeddelande är konfidentiellt och avsett endast för
adressaten.Varje spridning, kopiering eller utnyttjande av innehållet är
förbjuden utan tillåtelse av avsändaren. Om detta meddelande av misstag
gått till fel adressat vänligen radera det ursprungliga meddelandet och
underrätta avsändaren via e-post



Re: ICU, locale and collation question

От
Laurenz Albe
Дата:
On Tue, 2023-05-09 at 08:54 +0200, Oscar Carlberg wrote:
> Should we still provide createdb with --lc-collate=C and --lc-ctype=C,
> or should we set those to sv_SE.UTF-8 as well?

That depends on whether you want Swedish or binary sort order
(is "a" < "Z" or not?).  The C collation is not good for natural
language sorting, but it is fast and not subject to collation changes.

Yours,
Laurenz Albe



Re: ICU, locale and collation question

От
Peter Eisentraut
Дата:
On 09.05.23 08:54, Oscar Carlberg wrote:
> Our initdb setup would then look like this for compatibility;
> -E 'UTF-8'
> --locale-provider=icu
> --icu-locale=sv-SE-x-icu
> --lc_monetary=sv_SE.UTF-8
> --lc-numeric=sv_SE.UTF-8
> --lc-time=sv_SE.UTF-8
> --lc-messages=en_US.UTF-8
> 
> Should we still provide createdb with --lc-collate=C and --lc-ctype=C, 
> or should we set those to sv_SE.UTF-8 as well?

You should set those to something other than C.  It doesn't matter much 
what exactly, so what you have there is fine.

Setting it to C would for example affect the ability of the text search 
functionality to detect words containing non-ASCII characters.




Re: ICU, locale and collation question

От
Kirk Wolak
Дата:
On Tue, May 9, 2023 at 11:24 AM Peter Eisentraut <peter.eisentraut@enterprisedb.com> wrote:
On 09.05.23 08:54, Oscar Carlberg wrote:
> Our initdb setup would then look like this for compatibility;
> -E 'UTF-8'
> --locale-provider=icu
> --icu-locale=sv-SE-x-icu
> --lc_monetary=sv_SE.UTF-8
> --lc-numeric=sv_SE.UTF-8
> --lc-time=sv_SE.UTF-8
> --lc-messages=en_US.UTF-8
>
> Should we still provide createdb with --lc-collate=C and --lc-ctype=C,
> or should we set those to sv_SE.UTF-8 as well?

You should set those to something other than C.  It doesn't matter much
what exactly, so what you have there is fine.

Setting it to C would for example affect the ability of the text search
functionality to detect words containing non-ASCII characters.

Doesn't searching LIKE 'abc%'  behave much better for C than others.  This was the driving force for choosing C for us.
[EXPLAIN made it clear that it was range bound until 'abd']
 

Re: ICU, locale and collation question

От
Peter Eisentraut
Дата:
On 10.05.23 07:02, Kirk Wolak wrote:
> On Tue, May 9, 2023 at 11:24 AM Peter Eisentraut 
> <peter.eisentraut@enterprisedb.com 
> <mailto:peter.eisentraut@enterprisedb.com>> wrote:
> 
>     On 09.05.23 08:54, Oscar Carlberg wrote:
>      > Our initdb setup would then look like this for compatibility;
>      > -E 'UTF-8'
>      > --locale-provider=icu
>      > --icu-locale=sv-SE-x-icu
>      > --lc_monetary=sv_SE.UTF-8
>      > --lc-numeric=sv_SE.UTF-8
>      > --lc-time=sv_SE.UTF-8
>      > --lc-messages=en_US.UTF-8
>      >
>      > Should we still provide createdb with --lc-collate=C and
>     --lc-ctype=C,
>      > or should we set those to sv_SE.UTF-8 as well?
> 
>     You should set those to something other than C.  It doesn't matter much
>     what exactly, so what you have there is fine.
> 
>     Setting it to C would for example affect the ability of the text search
>     functionality to detect words containing non-ASCII characters.
> 
> Doesn't searching LIKE 'abc%'  behave much better for C than others.  
> This was the driving force for choosing C for us.
> [EXPLAIN made it clear that it was range bound until 'abd']

For that use, I would recommend making an index specifically on the 
tables you need, instead of switching the whole database.

Also, if you are using the ICU provider for the database, then setting 
lc_collation=C wouldn't even affect LIKE optimization, because the ICU 
locale would be used.