Обсуждение: Order changes in PG16 since ICU introduction

Поиск

Список

Период

Сортировка

Order changes in PG16 since ICU introduction

От

"Regina Obe"

Дата:

21 апреля 2023 г., 15:27:19

A couple of days ago, our PostGIS PG16 bots started failing with order
changes in text.
We have our tests set to locale=c

It seems since April 20th, our tests that rely on sorting characters
changed.
As noted in this ticket:

https://trac.osgeo.org/postgis/ticket/5375

I'm assuming it's result of icu change:
https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=fcb21b3ac
dcb9a60313325618fd7080aa36f1626

I suspect all our bots are compiling with icu enabled. But I haven't
confirmed.

I'm assuming this is an expected change in behavior, but just want to
confirm.

Thanks,
Regina

Re: Order changes in PG16 since ICU introduction

От

Tom Lane

Дата:

21 апреля 2023 г., 15:48:51

"Regina Obe" <lr@pcorp.us> writes:
> A couple of days ago, our PostGIS PG16 bots started failing with order
> changes in text.
> We have our tests set to locale=c

> It seems since April 20th, our tests that rely on sorting characters
> changed.
> As noted in this ticket:

> https://trac.osgeo.org/postgis/ticket/5375

> I'm assuming it's result of icu change:
> https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=fcb21b3ac
> dcb9a60313325618fd7080aa36f1626

> I suspect all our bots are compiling with icu enabled. But I haven't
> confirmed.

If they actually are using locale C, I would say this is a bug.
That should designate memcmp sorting and nothing else.

            regards, tom lane

Re: Order changes in PG16 since ICU introduction

От

Jeff Davis

Дата:

21 апреля 2023 г., 16:15:39

On Fri, 2023-04-21 at 11:27 -0400, Regina Obe wrote:
> A couple of days ago, our PostGIS PG16 bots started failing with
> order
> changes in text.
> We have our tests set to locale=c

Are you sure it's still using the C locale? The results seem to be
explainable if the locale switched from "C" to "en-US-x-icu":

The results of the following are the same in v15 and v16:

select 'RM(25)/nodes|+|21|1' collate "C" < 'RM(25)/nodes|-|21|' collate
"C"; -- true

select 'RM(25)/nodes|+|21|1' collate "en-US-x-icu" < 'RM(25)/nodes|-
|21|' collate "en-US-x-icu"; -- false

I suspect when the initdb and configure defaults changed from libc to
ICU, then your locale changed.

Regards,
    Jeff Davis

Re: Order changes in PG16 since ICU introduction

От

Sandro Santilli

Дата:

21 апреля 2023 г., 17:09:14

On Fri, Apr 21, 2023 at 11:48:51AM -0400, Tom Lane wrote:
> "Regina Obe" <lr@pcorp.us> writes:
> 
> > https://trac.osgeo.org/postgis/ticket/5375
> 
> If they actually are using locale C, I would say this is a bug.
> That should designate memcmp sorting and nothing else.

Sounds like a bug to me. This is happening with a PostgreSQL cluster
created and served by a build of commit c04c6c5d6f :

  =# select version();
  PostgreSQL 16devel on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu 11.3.0-1ubuntu1~22.04) 11.3.0, 64-bit
  =# show lc_collate;
  C
  =# select '+' < '-';
  f
  =# select '+' < '-' collate "C";
  t

I don't know if it should matter but also:

  =# show lc_messages;
  C

--strk;

Re: Order changes in PG16 since ICU introduction

От

Peter Eisentraut

Дата:

21 апреля 2023 г., 17:14:13

On 21.04.23 19:09, Sandro Santilli wrote:
> On Fri, Apr 21, 2023 at 11:48:51AM -0400, Tom Lane wrote:
>> "Regina Obe" <lr@pcorp.us> writes:
>>
>>> https://trac.osgeo.org/postgis/ticket/5375
>>
>> If they actually are using locale C, I would say this is a bug.
>> That should designate memcmp sorting and nothing else.
> 
> Sounds like a bug to me. This is happening with a PostgreSQL cluster
> created and served by a build of commit c04c6c5d6f :
> 
>    =# select version();
>    PostgreSQL 16devel on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu 11.3.0-1ubuntu1~22.04) 11.3.0, 64-bit
>    =# show lc_collate;
>    C
>    =# select '+' < '-';
>    f

If the database is created with locale provider ICU, then lc_collate 
does not apply here, so the result might be correct (depending on what 
locale you have set).

>    =# select '+' < '-' collate "C";
>    t

Re: Order changes in PG16 since ICU introduction

От

Jeff Davis

Дата:

21 апреля 2023 г., 17:27:49

On Fri, 2023-04-21 at 19:09 +0200, Sandro Santilli wrote:
>   =# select version();
>   PostgreSQL 16devel on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu
> 11.3.0-1ubuntu1~22.04) 11.3.0, 64-bit
>   =# show lc_collate;
>   C
>   =# select '+' < '-';
>   f

What is the result of:

  select datlocprovider, datcollate, daticulocale
    from pg_database where datname=current_database();

>   =# select '+' < '-' collate "C";
>   t

Regards,
    Jeff Davis

Re: Order changes in PG16 since ICU introduction

От

Tom Lane

Дата:

21 апреля 2023 г., 17:28:11

Peter Eisentraut <peter.eisentraut@enterprisedb.com> writes:
> If the database is created with locale provider ICU, then lc_collate 
> does not apply here, so the result might be correct (depending on what 
> locale you have set).

FWIW, an installation created under LANG=C defaults to ICU locale
en-US-u-va-posix for me (see psql \l), and that still sorts as
expected on my RHEL8 box.  We've not seen buildfarm problems either.

I am wondering however whether this doesn't mean that all our carefully
coded fast paths for C locale just went down the drain.  Does the ICU
code have any of that?  Has any performance testing been done to see
what impact this change had on C-locale installations?  (The current
code coverage report for pg_locale.c is not encouraging.)

            regards, tom lane

RE: Order changes in PG16 since ICU introduction

От

"Regina Obe"

Дата:

21 апреля 2023 г., 17:37:25

> Peter Eisentraut <peter.eisentraut@enterprisedb.com> writes:
> > If the database is created with locale provider ICU, then lc_collate
> > does not apply here, so the result might be correct (depending on what
> > locale you have set).
> 
> FWIW, an installation created under LANG=C defaults to ICU locale en-US-u-
> va-posix for me (see psql \l), and that still sorts as expected on my
RHEL8 box.
> We've not seen buildfarm problems either.
> 
> I am wondering however whether this doesn't mean that all our carefully
> coded fast paths for C locale just went down the drain.  Does the ICU code
> have any of that?  Has any performance testing been done to see what
impact
> this change had on C-locale installations?  (The current code coverage
report
> for pg_locale.c is not encouraging.)
> 
>             regards, tom lane

Just  another metric.

On my mingw64 setup, I built a test database on PG16 (built with icu
support) and PG15 (no icu support)

CREATE DATABASE test TEMPLATE=template0 ENCODING = 'UTF8' LC_COLLATE = 'C'
LC_CTYPE = 'C';

I think the above is the similar setup we have when testing.

On PG15 

SELECT '+' <  '-' ;  returns true

On PG 16 returns false

For PG 16, to strk's point, you have to do: to get a true
SELECT '+' COLLATE "C" <  '-'  COLLATE "C";


I would expect since I'm initializing my db in collate C they would both
behave the same

Re: Order changes in PG16 since ICU introduction

От

Tom Lane

Дата:

21 апреля 2023 г., 17:46:25

"Regina Obe" <lr@pcorp.us> writes:
> On my mingw64 setup, I built a test database on PG16 (built with icu
> support) and PG15 (no icu support)

> CREATE DATABASE test TEMPLATE=template0 ENCODING = 'UTF8' LC_COLLATE = 'C'
> LC_CTYPE = 'C';

As has been pointed out already, setting LC_COLLATE/LC_CTYPE is
meaningless when the locale provider is ICU.  You need to look
at what ICU locale is being chosen, or force it with LOCALE = 'C'.

            regards, tom lane

RE: Order changes in PG16 since ICU introduction

От

"Regina Obe"

Дата:

21 апреля 2023 г., 17:56:14

> > CREATE DATABASE test TEMPLATE=template0 ENCODING = 'UTF8'
> LC_COLLATE = 'C'
> > LC_CTYPE = 'C';
> 
> As has been pointed out already, setting LC_COLLATE/LC_CTYPE is
> meaningless when the locale provider is ICU.  You need to look at what ICU
> locale is being chosen, or force it with LOCALE = 'C'.
> 
>             regards, tom lane

Okay got it was on IRC with RhodiumToad and he suggested:

CREATE DATABASE test2 TEMPLATE=template0 ENCODING = 'UTF8' LC_COLLATE = 'C'
LC_CTYPE = 'C' ICU_LOCALE='C';

Which gives expected result:
SELECT '+'  <  '-'  ;  -- true

 but gives me a notice:
NOTICE:  using standard form "en-US-u-va-posix" for locale "C"

Re: Order changes in PG16 since ICU introduction

От

Tom Lane

Дата:

21 апреля 2023 г., 17:59:42

"Regina Obe" <lr@pcorp.us> writes:
> Okay got it was on IRC with RhodiumToad and he suggested:
> CREATE DATABASE test2 TEMPLATE=template0 ENCODING = 'UTF8' LC_COLLATE = 'C'
> LC_CTYPE = 'C' ICU_LOCALE='C';

> Which gives expected result:
> SELECT '+'  <  '-'  ;  -- true

>  but gives me a notice:
> NOTICE:  using standard form "en-US-u-va-posix" for locale "C"

Yeah.  My recommendation is just LOCALE:

regression=# CREATE DATABASE test1 TEMPLATE=template0 ENCODING = 'UTF8' LOCALE = 'C';
CREATE DATABASE
regression=# CREATE DATABASE test2 TEMPLATE=template0 ENCODING = 'UTF8' ICU_LOCALE = 'C';
NOTICE:  using standard form "en-US-u-va-posix" for locale "C"
CREATE DATABASE

I think it's probably intentional that ICU_LOCALE is stricter
about being given a real ICU locale name, but I didn't write
any of that code.

            regards, tom lane

Re: Order changes in PG16 since ICU introduction

От

Andrew Gierth

Дата:

21 апреля 2023 г., 18:00:35

>>>>> "Peter" == Peter Eisentraut <peter.eisentraut@enterprisedb.com> writes:

 Peter> If the database is created with locale provider ICU, then
 Peter> lc_collate does not apply here,

Having lc_collate return a value which is silently being ignored seems
to me rather hugely confusing.

Also, somewhere along the line someone broke initdb --no-locale, which
should result in C locale being the default everywhere, but when I just
tested it it picked 'en' for an ICU locale, which is not the right
thing.

-- 
Andrew (irc:RhodiumToad)

Re: Order changes in PG16 since ICU introduction

От

Tom Lane

Дата:

21 апреля 2023 г., 18:06:57

Andrew Gierth <andrew@tao11.riddles.org.uk> writes:
> "Peter" == Peter Eisentraut <peter.eisentraut@enterprisedb.com> writes:
>  Peter> If the database is created with locale provider ICU, then
>  Peter> lc_collate does not apply here,

> Having lc_collate return a value which is silently being ignored seems
> to me rather hugely confusing.

It's not *completely* ignored --- there are bits of code that are not
yet ICU-ified and will still use the libc facilities.  So we can't
get rid of those options yet, even in an ICU-based database.

> Also, somewhere along the line someone broke initdb --no-locale, which
> should result in C locale being the default everywhere, but when I just
> tested it it picked 'en' for an ICU locale, which is not the right
> thing.

Confirmed:

$ LANG=en_US.utf8 initdb --no-locale
The files belonging to this database system will be owned by user "postgres".
This user must also own the server process.

Using default ICU locale "en_US".
Using language tag "en-US" for ICU locale "en_US".
The database cluster will be initialized with this locale configuration:
  provider:    icu
  ICU locale:  en-US
  LC_COLLATE:  C
  LC_CTYPE:    C
  ...

That needs to be fixed: --no-locale should prevent any consideration
of initdb's LANG/LC_foo environment.

            regards, tom lane

RE: Order changes in PG16 since ICU introduction

От

"Regina Obe"

Дата:

21 апреля 2023 г., 18:13:49

> Yeah.  My recommendation is just LOCALE:
> 
> regression=# CREATE DATABASE test1 TEMPLATE=template0 ENCODING =
> 'UTF8' LOCALE = 'C'; CREATE DATABASE regression=# CREATE DATABASE test2
> TEMPLATE=template0 ENCODING = 'UTF8' ICU_LOCALE = 'C';
> NOTICE:  using standard form "en-US-u-va-posix" for locale "C"
> CREATE DATABASE
> 
> I think it's probably intentional that ICU_LOCALE is stricter about being
given
> a real ICU locale name, but I didn't write any of that code.
> 
>             regards, tom lane

CREATE DATABASE test1 TEMPLATE=template0 ENCODING = 'UTF8' LOCALE = 'C';

Doesn't seem to work at least not under mingw64 anyway.

SELECT '+'  <  '-'  ;

Returns false

Re: Order changes in PG16 since ICU introduction

От

Tom Lane

Дата:

21 апреля 2023 г., 18:23:45

"Regina Obe" <lr@pcorp.us> writes:
> CREATE DATABASE test1 TEMPLATE=template0 ENCODING = 'UTF8' LOCALE = 'C';
> Doesn't seem to work at least not under mingw64 anyway.

Hmm, doesn't work for me either:

$ LANG=en_US.utf8 initdb
The files belonging to this database system will be owned by user "postgres".
This user must also own the server process.

Using default ICU locale "en_US".
Using language tag "en-US" for ICU locale "en_US".
The database cluster will be initialized with this locale configuration:
  provider:    icu
  ICU locale:  en-US
  LC_COLLATE:  en_US.utf8
  LC_CTYPE:    en_US.utf8
  LC_MESSAGES: en_US.utf8
  LC_MONETARY: en_US.utf8
  LC_NUMERIC:  en_US.utf8
  LC_TIME:     en_US.utf8
  ...
$ psql postgres
psql (16devel)
Type "help" for help.

postgres=# SELECT '+'  <  '-'  ;
 ?column?
----------
 f
(1 row)

(as expected, so far)

postgres=# CREATE DATABASE test1 TEMPLATE=template0 ENCODING = 'UTF8' LOCALE = 'C';
CREATE DATABASE
postgres=# \c test1
You are now connected to database "test1" as user "postgres".
test1=# SELECT '+'  <  '-'  ;
 ?column?
----------
 f
(1 row)

(wrong!)

test1=# \l
                                                      List of databases
   Name    |  Owner   | Encoding | Locale Provider |  Collate   |   Ctype    | ICU Locale | ICU Rules |   Access
privileges   

-----------+----------+----------+-----------------+------------+------------+------------+-----------+-----------------------
 postgres  | postgres | UTF8     | icu             | en_US.utf8 | en_US.utf8 | en-US      |           |
 template0 | postgres | UTF8     | icu             | en_US.utf8 | en_US.utf8 | en-US      |           | =c/postgres
    + 
           |          |          |                 |            |            |            |           |
postgres=CTc/postgres
 template1 | postgres | UTF8     | icu             | en_US.utf8 | en_US.utf8 | en-US      |           | =c/postgres
    + 
           |          |          |                 |            |            |            |           |
postgres=CTc/postgres
 test1     | postgres | UTF8     | icu             | C          | C          | en-US      |           |
(4 rows)

Looks like the "pick en-US even when told not to" problem exists here too.

            regards, tom lane

Re: Order changes in PG16 since ICU introduction

От

Sandro Santilli

Дата:

21 апреля 2023 г., 19:14:13

On Fri, Apr 21, 2023 at 07:14:13PM +0200, Peter Eisentraut wrote:
> On 21.04.23 19:09, Sandro Santilli wrote:
> > On Fri, Apr 21, 2023 at 11:48:51AM -0400, Tom Lane wrote:
> > > "Regina Obe" <lr@pcorp.us> writes:
> > > 
> > > > https://trac.osgeo.org/postgis/ticket/5375
> > > 
> > > If they actually are using locale C, I would say this is a bug.
> > > That should designate memcmp sorting and nothing else.
> > 
> > Sounds like a bug to me. This is happening with a PostgreSQL cluster
> > created and served by a build of commit c04c6c5d6f :
> > 
> >    =# select version();
> >    PostgreSQL 16devel on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu 11.3.0-1ubuntu1~22.04) 11.3.0, 64-bit
> >    =# show lc_collate;
> >    C
> >    =# select '+' < '-';
> >    f
> 
> If the database is created with locale provider ICU, then lc_collate does
> not apply here, so the result might be correct (depending on what locale you
> have set).

The database is created by a perl script which starts like this:

  $ENV{"LC_ALL"} = "C";
  $ENV{"LANG"} = "C";

And then runs:

  createdb --encoding=UTF-8 --template=template0 --lc-collate=C 

Should we tweak anything else to make the results predictable ?

--strk;

Re: Order changes in PG16 since ICU introduction

От

Andrew Gierth

Дата:

21 апреля 2023 г., 19:14:20

>>>>> "Tom" == Tom Lane <tgl@sss.pgh.pa.us> writes:

 >> Also, somewhere along the line someone broke initdb --no-locale,
 >> which should result in C locale being the default everywhere, but
 >> when I just tested it it picked 'en' for an ICU locale, which is not
 >> the right thing.

 Tom> Confirmed:

 Tom> $ LANG=en_US.utf8 initdb --no-locale
 Tom> The files belonging to this database system will be owned by user "postgres".
 Tom> This user must also own the server process.

 Tom> Using default ICU locale "en_US".
 Tom> Using language tag "en-US" for ICU locale "en_US".
 Tom> The database cluster will be initialized with this locale configuration:
 Tom>   provider:    icu
 Tom>   ICU locale:  en-US
 Tom>   LC_COLLATE:  C
 Tom>   LC_CTYPE:    C
 Tom>   ...

 Tom> That needs to be fixed: --no-locale should prevent any
 Tom> consideration of initdb's LANG/LC_foo environment.

Would it also not make sense to also take into account any --locale and
--lc-* options before choosing an ICU default locale? Right now if you
do, say, initdb --locale=fr_FR you get an ICU locale based on the
environment but lc_* settings based on the option, which seems maximally
confusing.

Also, what happens now to lc_collate_is_c() when the provider is ICU? Am
I missing something, or is it never true now, even if you specified C /
POSIX / en-US-u-va-posix as the ICU locale? This seems like it could be
an important pessimization.

Also also, we now have the problem that it is much harder to create a
'C' collation database within an existing cluster (e.g. for testing)
without knowing whether the default provider is ICU. In the past one
would have done:

CREATE DATABASE test TEMPLATE=template0 ENCODING = 'UTF8' LOCALE = 'C';

but now that creates a database that uses the same ICU locale as
template0 by default. If instead one tries:

CREATE DATABASE test TEMPLATE=template0 ENCODING = 'UTF8' LOCALE = 'C' ICU_LOCALE='C';

then one gets an error if the default locale provider is _not_ ICU. The
only option now seems to be:

CREATE DATABASE test TEMPLATE=template0 ENCODING = 'UTF8' LOCALE = 'C' LOCALE_PROVIDER = 'libc';

which of course doesn't work in older pg versions.

-- 
Andrew.

Re: Order changes in PG16 since ICU introduction

От

Sandro Santilli

Дата:

21 апреля 2023 г., 19:17:25

On Fri, Apr 21, 2023 at 10:27:49AM -0700, Jeff Davis wrote:
> On Fri, 2023-04-21 at 19:09 +0200, Sandro Santilli wrote:
> >   =# select version();
> >   PostgreSQL 16devel on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu
> > 11.3.0-1ubuntu1~22.04) 11.3.0, 64-bit
> >   =# show lc_collate;
> >   C
> >   =# select '+' < '-';
> >   f
> 
> What is the result of:
> 
>   select datlocprovider, datcollate, daticulocale
>     from pg_database where datname=current_database();

datlocprovider | i
datcollate     | C
daticulocale   | en-US

--strk;

Re: Order changes in PG16 since ICU introduction

От

Jeff Davis

Дата:

21 апреля 2023 г., 19:25:00

On Fri, 2023-04-21 at 14:23 -0400, Tom Lane wrote:
> postgres=# CREATE DATABASE test1 TEMPLATE=template0 ENCODING = 'UTF8'
> LOCALE = 'C';

...

>  test1     | postgres | UTF8     | icu             | C          |
> C          | en-US      |           |
> (4 rows)
>
> Looks like the "pick en-US even when told not to" problem exists here
> too.

Both provider (ICU) and the icu locale (en-US) are inherited from
template0. The LOCALE parameter to CREATE DATABASE doesn't affect
either of those things, because there's a separate parameter
ICU_LOCALE.

This happens the same way in v15, and although it matches the
documentation technically, it is not a great user experience.

I have a couple ideas:

1. Introduce a "none" provider to separate the concept of C/POSIX
locales from the libc provider. It's not really using a provider
anyway, it's just using memcmp(), and I think it causes confusion to
combine them. Saying "LOCALE_PROVIDER=none" is less error-prone than
"LOCALE_PROVIDER=libc LOCALE='C'".

2. Change the CREATE DATABASE syntax to catch these errors better at
the possible expense of backwards compatibility.

I am also having second thoughts about accepting "C" or "POSIX" as an
ICU locale and transforming it to "en-US-u-va-posix" in v16. It's not
terribly useful (why not just use memcmp()?), it's not fast in my
measurements (en-US is faster), so maybe it's better to just throw an
error and tell the user to use C (or provider=none as I suggest
above)? 

Obviously the user could manually type "en-US-u-va-posix" if that's the
locale they want. Throwing an error would be a backwards-compatibility
issue, but in v15 an ICU locale of "C" just gives the root locale
anyway, which is probably not what they want.

Regards,
    Jeff Davis

Re: Order changes in PG16 since ICU introduction

От

Tom Lane

Дата:

21 апреля 2023 г., 20:00:09

Jeff Davis <pgsql@j-davis.com> writes:
> I have a couple ideas:

> 1. Introduce a "none" provider to separate the concept of C/POSIX
> locales from the libc provider. It's not really using a provider
> anyway, it's just using memcmp(), and I think it causes confusion to
> combine them. Saying "LOCALE_PROVIDER=none" is less error-prone than
> "LOCALE_PROVIDER=libc LOCALE='C'".

I think I might like this idea, except for one thing: you're imagining
that the locale doesn't control anything except string comparisons.
What about to_upper/to_lower, character classifications in regexes, etc?
(I'm not sure whether those operations can get redirected to ICU today
or whether they still always go to libc, but we'll surely want to fix
it eventually if the latter is still true.)

In any case, that seems somewhat orthogonal to what we're on about here,
which is making the behavior of CREATE DATABASE less surprising and more
backwards-compatible.  I'm not sure that provider=none can help with that.
Aside from the user-surprise issues discussed up to now, pg_dump scripts
emitted by pre-v15 pg_dump are not going to contain LOCALE_PROVIDER
clauses in CREATE DATABASE, and people are going to be very unhappy
if that means they suddenly get totally different locale semantics
after restoring into a new DB.  I think we need some plan for mapping
libc-style locale specs into ICU locales so that we can make that
more nearly transparent.

> 2. Change the CREATE DATABASE syntax to catch these errors better at
> the possible expense of backwards compatibility.

That is the exact opposite of what I think we need.  Backwards
compatibility isn't optional.

Maybe this means we are not ready to do ICU-by-default in v16.
It certainly feels like there might be more here than we want to
start designing post-feature-freeze.

            regards, tom lane

Re: Order changes in PG16 since ICU introduction

От

Jeff Davis

Дата:

21 апреля 2023 г., 20:03:19

On Fri, 2023-04-21 at 21:14 +0200, Sandro Santilli wrote:
> And then runs:
>
>   createdb --encoding=UTF-8 --template=template0 --lc-collate=C
>
> Should we tweak anything else to make the results predictable ?

You can specify --locale-provider=libc

Regards,
    Jeff Davis

Re: Order changes in PG16 since ICU introduction

От

Jeff Davis

Дата:

21 апреля 2023 г., 20:13:27

On Fri, 2023-04-21 at 13:28 -0400, Tom Lane wrote:
> I am wondering however whether this doesn't mean that all our
> carefully
> coded fast paths for C locale just went down the drain.

The code still exists. You can test it by using the built-in collation
"C" which is correctly specified with collprovider=libc and
collcollate=C.

For my test dataset, ICU 72, glibc 2.35:

  -- ~07s
  explain analyze select t from a order by t collate "C";

  -- ~15s
  explain analyze select t from a order by t collate "en-US-x-icu";

  -- ~21s
  explain analyze select t from a order by t collate "en-US-u-va-posix-
x-icu";

  -- ~34s
  explain analyze select t from a order by t collate "en_US";

I believe the confusion in this thread comes from:

* The syntax of CREATE DATABASE (the same as v15 but still confusing)
* The fact that you need provider=libc to get memcmp() behavior (same
as v15 but still confusing)

Regards,
    Jeff Davis

Re: Order changes in PG16 since ICU introduction

От

Robert Haas

Дата:

21 апреля 2023 г., 20:33:48

On Fri, Apr 21, 2023 at 3:25 PM Jeff Davis <pgsql@j-davis.com> wrote:
> I am also having second thoughts about accepting "C" or "POSIX" as an
> ICU locale and transforming it to "en-US-u-va-posix" in v16. It's not
> terribly useful (why not just use memcmp()?), it's not fast in my
> measurements (en-US is faster), so maybe it's better to just throw an
> error and tell the user to use C (or provider=none as I suggest
> above)?

I mean, to renew a complaint I've made previously, how the heck is
anyone supposed to understand what's going on here?

We have no meaningful documentation of how to select an ICU locale
that works for you. We have a couple of examples and a suggestion that
you should use BCP 47. But when I asked before for documentation
references, the ones you provided were not clear, basically
incomprehensible. In follow-up discussion, you admitted you'd had to
consult the source code to figure certain things out.

And the fact that "C" or "POSIX" gets transformed into
"en-US-u-va-posix" is also completely documented. That string appears
twice in the code, but zero times in the documentation. There's code
to do it, but users shouldn't have to read code, and it wouldn't help
much if they did, because the code comments don't really explain the
rationale behind this choice either.

I find the fact that people are having trouble here completely
predictable. Of course if people ask for "C" and the system tells them
that it's using "en-US-u-va-posix" instead they're going to be
confused and ask questions, exactly as is happening here. glibc
collations aren't particularly well-documented either, but people have
some experience with, and they can get a list of values that have a
chance of working from /usr/share/locale, and they know what "C"
means. Nobody knows what "en-US-u-va-posix" is. It's not even
Googleable, really, whereas "C locale" is.

My opinion is that the switch to using ICU by default is ill-advised
and should be reverted. The compatibility break isn't worth whatever
advantages ICU may have, the documentation to allow people to
transition to ICU with reasonable effort doesn't exist, and the fact
that within weeks of feature freeze people who know a lot about
PostgreSQL are struggling to get the behavior they want is a really
bad sign.

--
Robert Haas
EDB: http://www.enterprisedb.com

Re: Order changes in PG16 since ICU introduction

От

Jeff Davis

Дата:

21 апреля 2023 г., 20:45:50

On Fri, 2023-04-21 at 19:00 +0100, Andrew Gierth wrote:
> > > > >
> Also, somewhere along the line someone broke initdb --no-locale,
> which
> should result in C locale being the default everywhere, but when I
> just
> tested it it picked 'en' for an ICU locale, which is not the right
> thing.

Fixed, thank you.

Regards,
    Jeff Davis

Re: Order changes in PG16 since ICU introduction

От

Jeff Davis

Дата:

21 апреля 2023 г., 20:50:41

On Fri, 2023-04-21 at 16:00 -0400, Tom Lane wrote:
> Maybe this means we are not ready to do ICU-by-default in v16.
> It certainly feels like there might be more here than we want to
> start designing post-feature-freeze.

I don't see how punting to the next release helps. If the CREATE
DATABASE syntax (and similar issues for createdb and initdb) in v15 is
just too confusing, and we can't find a remedy for v16, then we
probably won't find a remedy for v17 either.

Regards,
    Jeff Davis

Re: Order changes in PG16 since ICU introduction

От

Andrew Gierth

Дата:

21 апреля 2023 г., 21:08:59

>>>>> "Jeff" == Jeff Davis <pgsql@j-davis.com> writes:

 >> Also, somewhere along the line someone broke initdb --no-locale,
 >> which should result in C locale being the default everywhere, but
 >> when I just tested it it picked 'en' for an ICU locale, which is not
 >> the right thing.

 Jeff> Fixed, thank you.

Is that the right fix, though? (It forces --locale-provider=libc for the
cluster default, which might not be desirable?)

-- 
Andrew.

Re: Order changes in PG16 since ICU introduction

От

Jeff Davis

Дата:

21 апреля 2023 г., 21:25:26

On Fri, 2023-04-21 at 22:08 +0100, Andrew Gierth wrote:
> > > > >
> Is that the right fix, though? (It forces --locale-provider=libc for
> the
> cluster default, which might not be desirable?)

For the "no locale" behavior (memcmp()-based) the provider needs to be
libc. Do you see an alternative?

Regards,
    Jeff Davis

Re: Order changes in PG16 since ICU introduction

От

Andrew Gierth

Дата:

21 апреля 2023 г., 21:35:09

>>>>> "Jeff" == Jeff Davis <pgsql@j-davis.com> writes:

 >> Is that the right fix, though? (It forces --locale-provider=libc for
 >> the cluster default, which might not be desirable?)

 Jeff> For the "no locale" behavior (memcmp()-based) the provider needs
 Jeff> to be libc. Do you see an alternative?

Can lc_collate_is_c() be taught to check whether an ICU locale is using
POSIX collation?

There's now another bug in that --no-locale no longer does the same
thing as --locale=C (which is its long-established documented behavior).
How should these various options interact? This all seems not well
thought out from a usability perspective, and I think a proper fix
should involve a bit more serious consideration.

-- 
Andrew.

Re: Order changes in PG16 since ICU introduction

От

Jeff Davis

Дата:

21 апреля 2023 г., 21:56:19

On Fri, 2023-04-21 at 16:33 -0400, Robert Haas wrote:

> My opinion is that the switch to using ICU by default is ill-advised
> and should be reverted.

Most of the complaints seem to be complaints about v15 as well, and
while those complaints may be a reason to not make ICU the default,
they are also an argument that we should continue to learn and try to
fix those issues because they exist in an already-released version.
Leaving it the default for now will help us fix those issues rather
than hide them.

It's still early, so we have plenty of time to revert the initdb
default if we need to.

Regards,
    Jeff Davis

RE: Order changes in PG16 since ICU introduction

От

"Regina Obe"

Дата:

21 апреля 2023 г., 22:39:26

> > My opinion is that the switch to using ICU by default is ill-advised
> > and should be reverted.
>
> Most of the complaints seem to be complaints about v15 as well, and while
> those complaints may be a reason to not make ICU the default, they are also
> an argument that we should continue to learn and try to fix those issues
> because they exist in an already-released version.
> Leaving it the default for now will help us fix those issues rather than hide
> them.
>
> It's still early, so we have plenty of time to revert the initdb default if we need
> to.
>
> Regards,
>     Jeff Davis

I'm fine with that.  Sounds like it wouldn't be too hard to just pull it out at the end.

Before this, I didn't even know ICU existed in PG15. My first realization that ICU was even a thing was when my PG16
refusedto compile without adding my ICU path to my pkg-config or putting in --without-icu. 

So yah I suspect leaving it in a little bit longer will uncover some more issues and won't harm too much.

Thanks,
Regina

Re: Order changes in PG16 since ICU introduction

От

Jeff Davis

Дата:

21 апреля 2023 г., 23:00:30

On Fri, 2023-04-21 at 16:33 -0400, Robert Haas wrote:
> And the fact that "C" or "POSIX" gets transformed into
> "en-US-u-va-posix"

I already expressed, on reflection, that we should probably just not do
that. So I think we're in agreement on this point; patch attached.

Regards,
    Jeff Davis

Вложения

0001-ICU-do-not-convert-locale-C-to-en-US-u-va-posix.patch

Re: Order changes in PG16 since ICU introduction

От

Robert Haas

Дата:

22 апреля 2023 г., 00:12:59

On Fri, Apr 21, 2023 at 5:56 PM Jeff Davis <pgsql@j-davis.com> wrote:
> Most of the complaints seem to be complaints about v15 as well, and
> while those complaints may be a reason to not make ICU the default,
> they are also an argument that we should continue to learn and try to
> fix those issues because they exist in an already-released version.
> Leaving it the default for now will help us fix those issues rather
> than hide them.
>
> It's still early, so we have plenty of time to revert the initdb
> default if we need to.

That's fair enough, but I really think it's important that some energy
get invested in providing adequate documentation for this stuff. Just
patching the code is not enough.

--
Robert Haas
EDB: http://www.enterprisedb.com

Re: Order changes in PG16 since ICU introduction

От

Peter Eisentraut

Дата:

24 апреля 2023 г., 15:10:04

On 21.04.23 19:14, Peter Eisentraut wrote:
> On 21.04.23 19:09, Sandro Santilli wrote:
>> On Fri, Apr 21, 2023 at 11:48:51AM -0400, Tom Lane wrote:
>>> "Regina Obe" <lr@pcorp.us> writes:
>>>
>>>> https://trac.osgeo.org/postgis/ticket/5375
>>>
>>> If they actually are using locale C, I would say this is a bug.
>>> That should designate memcmp sorting and nothing else.
>>
>> Sounds like a bug to me. This is happening with a PostgreSQL cluster
>> created and served by a build of commit c04c6c5d6f :
>>
>>    =# select version();
>>    PostgreSQL 16devel on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu 
>> 11.3.0-1ubuntu1~22.04) 11.3.0, 64-bit
>>    =# show lc_collate;
>>    C
>>    =# select '+' < '-';
>>    f
> 
> If the database is created with locale provider ICU, then lc_collate 
> does not apply here, so the result might be correct (depending on what 
> locale you have set).

The GUC settings lc_collate and lc_ctype are from a time when those 
locale settings were cluster-global.  When we made those locale settings 
per-database (PG 8.4), we kept them as read-only.  As of PG 15, you can 
use ICU as the per-database locale provider, so what is being attempted 
in the above example is already meaningless before PG 16, since you need 
to look into pg_database to find out what is really happening.

I think we should just remove the GUC parameters lc_collate and lc_ctype.

Re: Order changes in PG16 since ICU introduction

От

Peter Eisentraut

Дата:

24 апреля 2023 г., 15:26:42

On 22.04.23 01:00, Jeff Davis wrote:
> On Fri, 2023-04-21 at 16:33 -0400, Robert Haas wrote:
>> And the fact that "C" or "POSIX" gets transformed into
>> "en-US-u-va-posix"
> 
> I already expressed, on reflection, that we should probably just not do
> that. So I think we're in agreement on this point; patch attached.

This makes sense to me.  This way, if someone specifies 'C' locale 
together with ICU provider they get an error.  They can then choose to 
use the libc provider, to get the performance path, or stick with ICU by 
using the native spelling of the locale.

Re: Order changes in PG16 since ICU introduction

От

Jeff Davis

Дата:

25 апреля 2023 г., 04:31:40

On Fri, 2023-04-21 at 16:00 -0400, Tom Lane wrote:
> I think I might like this idea, except for one thing: you're
> imagining
> that the locale doesn't control anything except string comparisons.
> What about to_upper/to_lower, character classifications in regexes,
> etc?

If provider='libc' and LC_CTYPE='C', str_toupper/str_tolower are
handled with asc_tolower/asc_toupper. The regex character
classification is done with pg_char_properties. In these cases neither
ICU nor libc is used; it's just code in postgres.

libc is special in that you can set LC_COLLATE and LC_CTYPE separately,
so that different locales are used for sorting and character
classification. That's potentially useful to set LC_COLLATE to C for
performance reasons, while setting LC_CTYPE to a useful locale. We
don't allow ICU to set collation and ctype separately (it would be
possible to allow it, but I don't think there's a huge demand and it's
arguably inconsistent to set them differently).

> (I'm not sure whether those operations can get redirected to ICU
> today
> or whether they still always go to libc, but we'll surely want to fix
> it eventually if the latter is still true.)

Those operations do get redirected to ICU today. There are extensions
that call locale-sensitive libc functions directly, and obviously those
won't use ICU.

> Aside from the user-surprise issues discussed up to now, pg_dump
> scripts
> emitted by pre-v15 pg_dump are not going to contain LOCALE_PROVIDER
> clauses in CREATE DATABASE, and people are going to be very unhappy
> if that means they suddenly get totally different locale semantics
> after restoring into a new DB.

Agreed.

>   I think we need some plan for mapping
> libc-style locale specs into ICU locales so that we can make that
> more nearly transparent.

ICU does a reasonable job mapping libc-like locale names to ICU
locales, e.g. en_US to en-US, etc. The ordering semantics aren't
guaranteed to be the same, of course (because the libc-locales are
platform-dependent), but it's at least conceptually the same locale.

>
> Maybe this means we are not ready to do ICU-by-default in v16.
> It certainly feels like there might be more here than we want to
> start designing post-feature-freeze.

This thread is already on the Open Items list. As long as it's not too
disruptive to others I'll leave it as-is for now to see how this sorts
out. Right now it's not clear to me how much of this is a v15 issue vs
a v16 issue.

Regards,
    Jeff Davis

Re: Order changes in PG16 since ICU introduction

От

"Daniel Verite"

Дата:

25 апреля 2023 г., 15:00:19

    Jeff Davis wrote:

> > (I'm not sure whether those operations can get redirected to ICU
> > today
> > or whether they still always go to libc, but we'll surely want to fix
> > it eventually if the latter is still true.)
>
> Those operations do get redirected to ICU today.

FTR the full text search parser still uses the libc functions
is[w]space/alpha/digit...  that depend on lc_ctype, whether the db
collation provider is ICU or not.

Best regards,
--
Daniel Vérité
https://postgresql.verite.pro/
Twitter: @DanielVerite

Re: Order changes in PG16 since ICU introduction

От

Tom Lane

Дата:

25 апреля 2023 г., 15:20:38

"Daniel Verite" <daniel@manitou-mail.org> writes:
> FTR the full text search parser still uses the libc functions
> is[w]space/alpha/digit...  that depend on lc_ctype, whether the db
> collation provider is ICU or not.

Yeah, those aren't even connected up to the collation-selection
mechanisms; lots of work to do there.  I wonder if they could be
made to use regc_pg_locale.c instead of duplicating logic.

            regards, tom lane

Re: Order changes in PG16 since ICU introduction

От

Jeff Davis

Дата:

26 апреля 2023 г., 01:15:21

On Fri, 2023-04-21 at 22:35 +0100, Andrew Gierth wrote:
> > > > >
> Can lc_collate_is_c() be taught to check whether an ICU locale is
> using
> POSIX collation?

Attached are a few small patches:

  0001: don't convert C to en-US-u-va-posix
  0002: handle locale C the same regardless of the provider, as you
suggest above
  0003: make LOCALE (or --locale) apply to everything including ICU

As far as I can tell, any libc locale has a reasonable match in ICU, so
setting LOCALE to either C or a libc locale name should be fine. Some
locales are only valid in ICU, e.g. '@colStrength=primary', or a
language tag representation, so if you do something like:

  create database foo locale 'en_US@colStrenghth=primary'
    template template0;

You'll get a decent error like:

  ERROR:  invalid LC_COLLATE locale name: "en_US@colStrenghth=primary"
  HINT:  If the locale name is specific to ICU, use ICU_LOCALE.

Overall, I think it works out nicely. Let me know if there are still
some confusing cases. I tried a few variations and this one seemed the
best, but I may have missed something.

Regards,
    Jeff Davis

Вложения

Re: Order changes in PG16 since ICU introduction

От

"Daniel Verite"

Дата:

27 апреля 2023 г., 12:23:24

    Jeff Davis wrote:

> Attached are a few small patches:
>
>   0001: don't convert C to en-US-u-va-posix
>   0002: handle locale C the same regardless of the provider, as you
> suggest above
>   0003: make LOCALE (or --locale) apply to everything including ICU

Testing this briefly I noticed two regressions

1) all pg_collation.collversion are empty due to a trivial bug in 0002:

@ -1650,6 +1686,10 @@ get_collation_actual_version(char collprovider, const
char *collcollate)
 {
    char       *collversion = NULL;

+    if (pg_strcasecmp("C", collcollate) ||
+        pg_strcasecmp("POSIX", collcollate))
+        return NULL;
+

This should be pg_strcasecmp(...) == 0

2) The following works with HEAD (default provider=icu) but errors out with
the patches:

postgres=# create database lat9 locale 'fr_FR@euro' encoding LATIN9 template
'template0';
ERROR:    could not convert locale name "fr_FR@euro" to language tag:
U_ILLEGAL_ARGUMENT_ERROR

fr_FR@euro is a libc locale name

$ locale -a|grep fr_FR
fr_FR
fr_FR@euro
fr_FR.iso88591
fr_FR.iso885915@euro
fr_FR.utf8

I understand that fr_FR@euro is taken as an ICU locale name, with the idea
that the locale
syntax being more or less compatible between both providers, this should work
smoothly.  0003 seems to go further in the interpretation and fail on it.
TBH the assumption that it's OK to feed libc locale names to ICU feels quite
uncomfortable.


Best regards,
--
Daniel Vérité
https://postgresql.verite.pro/
Twitter: @DanielVerite

Re: Order changes in PG16 since ICU introduction

От

Jeff Davis

Дата:

28 апреля 2023 г., 21:35:25

On Thu, 2023-04-27 at 14:23 +0200, Daniel Verite wrote:
> This should be pg_strcasecmp(...) == 0

Good catch, thank you! Fixed in updated patches.

> postgres=# create database lat9 locale 'fr_FR@euro' encoding LATIN9
> template
> 'template0';
> ERROR:  could not convert locale name "fr_FR@euro" to language tag:
> U_ILLEGAL_ARGUMENT_ERROR

ICU 63 and earlier convert it without error to the language tag 'fr-FR-
u-cu-eur', which is correct. ICU 64 removed support for transforming
some locale variants, because apparently they think those variants are
obsolete:

https://unicode-org.atlassian.net/browse/ICU-22268
https://unicode-org.atlassian.net/browse/ICU-20187

(Aside: how obsolete are those variants?)

It's frustrating that they'd remove such transformations from the
canonicalization process.

Fortunately, it looks like it's easy enough to do the transformation
ourselves. The only problematic format is '...@VARIANT'. The other
format 'fr_FR_EURO' doesn't seem to be a valid glibc locale name[1] and
windows seems to use BCP 47[2].

And there don't seem to be a lot of variants to handle. ICU 63 only
handles 3 variants, so that's what my patch does. Any unknown variant
between 5 and 8 characters won't throw an error. There could be more
problem cases, but I'm not sure how much of a practical problem they
are.

If we try to keep the meaning of LOCALE to only LC_COLLATE and
LC_CTYPE, that will continue to be confusing for anyone that uses
provider=icu.

Regards,
    Jeff Davis

[1]
https://www.gnu.org/software/libc/manual/html_node/Locale-Names.html
[2]
https://learn.microsoft.com/en-us/windows/win32/intl/locale-names

Вложения

Re: Order changes in PG16 since ICU introduction

От

Jeff Davis

Дата:

04 мая 2023 г., 05:35:41

On Fri, 2023-04-28 at 14:35 -0700, Jeff Davis wrote:
> On Thu, 2023-04-27 at 14:23 +0200, Daniel Verite wrote:
> > This should be pg_strcasecmp(...) == 0
>
> Good catch, thank you! Fixed in updated patches.

Rebased patches.

=== 0001: do not convert C to en-US-u-va-posix

I plan to commit this soon. If someone specifies "C", they are probably
expecting memcmp()-like behavior, or some kind of error/warning that it
can't be provided.

Removing this transformation means that if you specify iculocale=C,
you'll get an error or warning (depending on icu_validation_level),
because C is not a recognized icu locale. Depending on how some of the
other issues in this thread are sorted out, we may want to relax the
validation.

=== 0002: fix @euro, etc. in ICU >= 64

I'd like to commit this soon too, but I'll wait for someone to take a
look. It makes it more reliable to map libc names to icu locale names
regardless of the ICU version.

It doesn't solve the problem for locales like "de__PHONEBOOK", but
those don't seem to be a libc format (I think just an old ICU format),
so I don't see a big reason to carry it forward. It might be another
reason to turn down the validation level to WARNING, though.

=== 0003: support C memcmp() behavior with ICU provider

The current patch 0003 has a problem, because in previous postgres
versions (going all the way back), we allowed "C" as a valid ICU
locale, that would actually be passed to ICU as a locale name. But ICU
didn't recognize it, and it would end up opening the root locale. So we
can't simply redefine "C" to mean "memcmp", because that would
potentially break indexes.

I see the following potential solutions:

   1. Represent the memcmp behavior with iculocale=NULL, or some other
catalog hack, so that we can distinguish between a locale "C" upgraded
from a previous version (which should pass "C" to ICU and get the root
locale), and a new collation defined with locale "C" (which should have
memcmp behavior). The catalog representation for locale information is
already complex, so I'm not excited about this option, but it will
work.

   2. When provider=icu and locale=C, magically transform that into
provider=libc to get memcmp-like behavior for new collations but
preserve the existing behavior for upgraded collations. Not especially
clean, but if we issue a NOTICE perhaps that would avoid confusion.

   3. Like #2, except create a new provider type "none" which may be
slightly less confusing.

=== 0004: make LOCALE apply to ICU for CREATE DATABASE

To understand this patch it helps to understand the confusing situation
with CREATE DATABASE in version 15:

The keywords LC_CTYPE and LC_COLLATE set the server environment
LC_CTYPE/LC_COLLATE for that database and can be specified regardless
of the provider. LOCALE can be specified along with (or instead of)
LC_CTYPE and LC_COLLATE, in which case whichever of LC_CTYPE or
LC_COLLATE is unspecified defaults to the setting of LOCALE. Iff the
provider is libc, LC_CTYPE and LC_COLLATE also act as the database
default collation's locale. If the provider is icu, then none of
LOCALE, LC_CTYPE, or LC_COLLATE affect the database default collation's
locale at all; that's controlled by ICU_LOCALE (which may be omitted if
the template's daticulocale is non-NULL).

The idea of patch 0004 is to address the last part, which is probably
the most confusing aspect. But for that to work smoothly, we need
something like 0003 so that LOCALE=C gives the same semantics
regardless of the provider.

Regards,
    Jeff Davis

Вложения

Re: Order changes in PG16 since ICU introduction

От

Jeff Davis

Дата:

06 мая 2023 г., 00:25:18

On Fri, 2023-04-21 at 20:12 -0400, Robert Haas wrote:
> On Fri, Apr 21, 2023 at 5:56 PM Jeff Davis <pgsql@j-davis.com> wrote:
> > Most of the complaints seem to be complaints about v15 as well, and
> > while those complaints may be a reason to not make ICU the default,
> > they are also an argument that we should continue to learn and try
> > to
> > fix those issues because they exist in an already-released version.
> > Leaving it the default for now will help us fix those issues rather
> > than hide them.
> >
> > It's still early, so we have plenty of time to revert the initdb
> > default if we need to.
>
> That's fair enough, but I really think it's important that some
> energy
> get invested in providing adequate documentation for this stuff. Just
> patching the code is not enough.

Attached a significant documentation patch.

I tried to make it comprehensive without trying to be exhaustive, and I
separated the explanation of language tags from what collation settings
you can include in a language tag, so hopefully that's more clear.

I added quite a few examples spread throughout the various sections,
and I preserved the existing examples at the end. I also left all of
the external links at the bottom for those interested enough to go
beyond what's there.

I didn't add additional documentation for ICU rules. There are so many
options for collations that it's hard for me to think of realistic
examples to specify the rules directly, unless someone wants to invent
a new language. Perhaps useful if working with an interesting text file
format with special treatment for delimiters?

I asked the question about rules here:

https://www.postgresql.org/message-id/e861ac4fdae9f9f5ce2a938a37bcb5e083f0f489.camel%40cybertec.at

and got some limited response about addressing sort complaints. That
sounds reasonable, but a lot of that can also be handled just by
specifying the right collation settings. Someone who understands the
use case better could add some more documentation.

--
Jeff Davis
PostgreSQL Contributor Team - AWS

Вложения

v1-0001-Doc-improvements-for-language-tags-and-custom-ICU.patch

Re: Order changes in PG16 since ICU introduction

От

Tom Lane

Дата:

08 мая 2023 г., 21:47:55

Jeff Davis <pgsql@j-davis.com> writes:
> === 0001: do not convert C to en-US-u-va-posix

> I plan to commit this soon.

Several buildfarm animals have failed since this went in.  The
only one showing enough info to diagnose is siskin [1]:

@@ -1043,16 +1043,15 @@
 ERROR:  ICU locale "nonsense-nowhere" has unknown language "nonsense"
 HINT:  To disable ICU locale validation, set parameter icu_validation_level to DISABLED.
 CREATE COLLATION testx (provider = icu, locale = 'C'); -- fails
-ERROR:  could not convert locale name "C" to language tag: U_ILLEGAL_ARGUMENT_ERROR
+NOTICE:  using standard form "en-US-u-va-posix" for locale "C"
 CREATE COLLATION testx (provider = icu, locale = '@colStrength=primary;nonsense=yes'); -- fails
 ERROR:  could not convert locale name "@colStrength=primary;nonsense=yes" to language tag: U_ILLEGAL_ARGUMENT_ERROR
 SET icu_validation_level = WARNING;
 CREATE COLLATION testx (provider = icu, locale = '@colStrength=primary;nonsense=yes'); DROP COLLATION testx;
 WARNING:  could not convert locale name "@colStrength=primary;nonsense=yes" to language tag: U_ILLEGAL_ARGUMENT_ERROR
+ERROR:  collation "testx" already exists
 CREATE COLLATION testx (provider = icu, locale = 'C'); DROP COLLATION testx;
-WARNING:  could not convert locale name "C" to language tag: U_ILLEGAL_ARGUMENT_ERROR
-WARNING:  ICU locale "C" has unknown language "c"
-HINT:  To disable ICU locale validation, set parameter icu_validation_level to DISABLED.
+NOTICE:  using standard form "en-US-u-va-posix" for locale "C"
 CREATE COLLATION testx (provider = icu, locale = 'nonsense-nowhere'); DROP COLLATION testx;
 WARNING:  ICU locale "nonsense-nowhere" has unknown language "nonsense"
 HINT:  To disable ICU locale validation, set parameter icu_validation_level to DISABLED.

I suppose this is environment-dependent.  Sadly, the buildfarm
client does not show the prevailing LANG or LC_XXX settings.

            regards, tom lane

[1] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=siskin&dt=2023-05-08%2020%3A09%3A26

Re: Order changes in PG16 since ICU introduction

От

Jeff Davis

Дата:

08 мая 2023 г., 21:59:41

On Mon, 2023-05-08 at 17:47 -0400, Tom Lane wrote:
> -ERROR:  could not convert locale name "C" to language tag:
> U_ILLEGAL_ARGUMENT_ERROR
> +NOTICE:  using standard form "en-US-u-va-posix" for locale "C"

...

> I suppose this is environment-dependent.  Sadly, the buildfarm
> client does not show the prevailing LANG or LC_XXX settings.

Looks like it's failing-to-fail on some versions of ICU which
automatically perform that conversion.

The easiest thing to do is revert it for now, and after we sort out the
memcmp() path for the ICU provider, then I can commit it again (after
that point it would just be code cleanup and should have no functional
impact).

Regards,
    Jeff Davis

Re: Order changes in PG16 since ICU introduction

От

Alvaro Herrera

Дата:

09 мая 2023 г., 08:25:53

On 2023-Apr-24, Peter Eisentraut wrote:

> The GUC settings lc_collate and lc_ctype are from a time when those locale
> settings were cluster-global.  When we made those locale settings
> per-database (PG 8.4), we kept them as read-only.  As of PG 15, you can use
> ICU as the per-database locale provider, so what is being attempted in the
> above example is already meaningless before PG 16, since you need to look
> into pg_database to find out what is really happening.
> 
> I think we should just remove the GUC parameters lc_collate and lc_ctype.

I agree with removing these in v16, since they are going to become more
meaningless and confusing.

-- 
Álvaro Herrera               48°01'N 7°57'E  —  https://www.EnterpriseDB.com/

Re: Order changes in PG16 since ICU introduction

От

Jeff Davis

Дата:

09 мая 2023 г., 15:09:29

On Tue, 2023-05-09 at 10:25 +0200, Alvaro Herrera wrote:
> I agree with removing these in v16, since they are going to become
> more
> meaningless and confusing.

Agreed, but it would be nice to have an alternative that does the right
thing.

It's awkward for a user to read pg_database.datlocprovider, then
depending on that, either look in datcollate or daticulocale. (It's
awkward in the code, too.)

Maybe some built-in function that returns a tuple of the default
provider, the locale, and the version? Or should we also output the
ctype somehow (which affects the results of upper()/lower())?

Regards,
    Jeff Davis

Re: Order changes in PG16 since ICU introduction

От

Peter Eisentraut

Дата:

11 мая 2023 г., 11:07:53

On 09.05.23 10:25, Alvaro Herrera wrote:
> On 2023-Apr-24, Peter Eisentraut wrote:
> 
>> The GUC settings lc_collate and lc_ctype are from a time when those locale
>> settings were cluster-global.  When we made those locale settings
>> per-database (PG 8.4), we kept them as read-only.  As of PG 15, you can use
>> ICU as the per-database locale provider, so what is being attempted in the
>> above example is already meaningless before PG 16, since you need to look
>> into pg_database to find out what is really happening.
>>
>> I think we should just remove the GUC parameters lc_collate and lc_ctype.
> 
> I agree with removing these in v16, since they are going to become more
> meaningless and confusing.

Here is my proposed patch for this.

Вложения

0001-Remove-read-only-server-settings-lc_collate-and-lc_c.patch

Re: Order changes in PG16 since ICU introduction

От

Peter Eisentraut

Дата:

11 мая 2023 г., 11:09:26

On 09.05.23 17:09, Jeff Davis wrote:
> It's awkward for a user to read pg_database.datlocprovider, then
> depending on that, either look in datcollate or daticulocale. (It's
> awkward in the code, too.)
> 
> Maybe some built-in function that returns a tuple of the default
> provider, the locale, and the version? Or should we also output the
> ctype somehow (which affects the results of upper()/lower())?

There is also the deterministic flag and the icurules setting. 
Depending on what level of detail you imagine the user needs, you really 
do need to look at the whole picture, not some subset of it.

Re: Order changes in PG16 since ICU introduction

От

Jeff Davis

Дата:

11 мая 2023 г., 21:29:18

New patch series attached.

=== 0001: fix bug that allows creating hidden collations

Bug:
https://www.postgresql.org/message-id/051c9395cf880307865ee8b17acdbf7f838c1e39.camel@j-davis.com

=== 0002: handle some kinds of libc-stlye locale strings

ICU used to handle libc locale strings like 'fr_FR@euro', but doesn't
in later versions. Handle them in postgres for consistency.

=== 0003: reduce icu_validation_level to WARNING

Given that we've seen some inconsistency in which locale names are
accepted in different ICU versions, it seems best not to be too strict.
Peter Eisentraut suggested that it be set to ERROR originally, but a
WARNING should be sufficient to see problems without introducing risks
migrating to version 16.

I don't expect objections to 0003, so I may commit this soon, but I'll
give it a little time in case someone has an opinion.

=== 0004-0006:

To solve the issues that have come up in this thread, we need CREATE
DATABASE (and createdb and initdb) to use LOCALE to mean the collation
locale regardless of which provider is in use (which is what 0006
does).

0006 depends on ICU handling libc locale names. It already does a good
job for most libc locale names (though patch 0002 fixes a few cases
where it doesn't). There may be more cases, but for the most part libc
names are interpreted in a reasonable way. But one important case is
missing: ICU does not handle the "C" locale as we expect (that is,
using memcmp()).

We've already allowed users to create ICU collations with the C locale
in the past, which uses the root collation (not memcmp()), and we need
to keep supporting that for upgraded clusters. So that leaves us with a
catalog representation problem. I mentioned upthread that we can solve
that by:

  1. Using iculocale=NULL to mean "C-as-in-memcmp", or having some
other catalog hack (like another field). That's not desirable because
the catalog representation is already complex and it may be hard for
users to tell what's happening.

  2. When provider=icu and locale=C, switch to provider=libc locale=C.
This is very messy, because currently the syntax allows specifying a
database with LOCALE_PROVIDER='icu' ICU_LOCALE='C' LC_COLLATE='en_US' -
- if the provider gets changed to libc, what would we set datcollate
to? I don't think this is workable without some breakage. We can't
simply override datcollate to be C in that case, because there are some
things other than the default collation that might need it set to en_US
as the user specified.

  3. Introduce collation provider "none", which is always memcmp-based
(patch 0004). It's equivalent to the libc locale=C, but it allows
specifying the LC_COLLATE and LC_CTYPE independently. A command like
CREATE DATABASE ... LOCALE_PROVIDER='icu' ICU_LOCALE='C'
LC_COLLATE='en_US' would get changed (with a NOTICE) to provider "none"
(patch 0005), so you'd have datlocprovider=none, datcollate=en_US. For
the database default collation, that would always use memcmp(), but the
server environment LC_COLLATE would be set to en_US as the user
specified.

For this patch series, I chose approach #3. I think it works out nicely
-- it provides a better place to document the "no locale" behavior
(including a warning that it depends on the database encoding), and I
think it's more clear to the user that locale=C is not actually using a
provider at all. It's more invasive, but feels like a better solution.
If others don't like it I can implement approach #1 instead.

=== 0007: Add a GUC to control the default collation provider

Having a GUC would make it easier to migrate to ICU without surprises.
This only affects the default for CREATE COLLATION, not CREATE DATABASE
(and obviously not initdb).


--
Jeff Davis
PostgreSQL Contributor Team - AWS

Вложения

Re: Order changes in PG16 since ICU introduction

От

Alexander Lakhin

Дата:

13 мая 2023 г., 10:00:00

Hello Jeff,

09.05.2023 00:59, Jeff Davis wrote:
> The easiest thing to do is revert it for now, and after we sort out the
> memcmp() path for the ICU provider, then I can commit it again (after
> that point it would just be code cleanup and should have no functional
> impact).

On the current master (after 455f948b0, and before f7faa9976, of course)
I get an ASAN-detected failure with the following query:
CREATE COLLATION col (provider = icu, locale = '123456789012');

==2929883==ERROR: AddressSanitizer: stack-buffer-overflow on address 0x7ffc491be09c at pc 0x556e8571a260 bp 0x7
ffc491be020 sp 0x7ffc491bd7c8
READ of size 15 at 0x7ffc491be09c thread T0
     #0 0x556e8571a25f in __interceptor_strcmp.part.0 (.../usr/local/pgsql/bin/postgres+0x2aa025f)
     #1 0x556e86d77ee6 in icu_language_tag .../src/backend/utils/adt/pg_locale.c:2802
...
Address 0x7ffc491be09c is located in stack of thread T0 at offset 76 in frame
     #0 0x556e86d77cfe in icu_language_tag .../src/backend/utils/adt/pg_locale.c:2782

   This frame has 2 object(s):
     [48, 52) 'status' (line 2784)
     [64, 76) 'lang' (line 2785) <== Memory access at offset 76 overflows this variable
...

Here, uloc_getLanguage(loc_str, lang, ULOC_LANG_CAPACITY, &status) returns
status = -124, i.e.,
     U_STRING_NOT_TERMINATED_WARNING = -124,/**< An output string could not be NUL-terminated because output 
length==destCapacity. */
(ULOC_LANG_CAPACITY = 12)
this value is not covered by U_FAILURE(status), and strcmp(), that follows,
goes out of the lang variable bounds.

Best regards,
Alexander

Re: Order changes in PG16 since ICU introduction

От

Peter Eisentraut

Дата:

15 мая 2023 г., 10:06:36

On 11.05.23 23:29, Jeff Davis wrote:
> New patch series attached.
> 
> === 0001: fix bug that allows creating hidden collations
> 
> Bug:
> https://www.postgresql.org/message-id/051c9395cf880307865ee8b17acdbf7f838c1e39.camel@j-davis.com

This is still being debated in the other thread.  Not really related to 
this thread, so I suggest dropping it from this patch series.

> === 0002: handle some kinds of libc-stlye locale strings
> 
> ICU used to handle libc locale strings like 'fr_FR@euro', but doesn't
> in later versions. Handle them in postgres for consistency.

I tend to agree with ICU that these variants are obsolete, and we don't 
need to support them anymore.  If this were a tiny patch, then maybe ok, 
but the way it's presented here the whole code is duplicated between 
pg_locale.c and initdb.c, which is not great.

> === 0003: reduce icu_validation_level to WARNING
> 
> Given that we've seen some inconsistency in which locale names are
> accepted in different ICU versions, it seems best not to be too strict.
> Peter Eisentraut suggested that it be set to ERROR originally, but a
> WARNING should be sufficient to see problems without introducing risks
> migrating to version 16.

I'm not sure why this is the conclusion.  Presumably, the detection 
capabilities of ICU improve over time, so we want to take advantage of 
that?  What are some example scenarios where this change would help?

> === 0004-0006:
> 
> To solve the issues that have come up in this thread, we need CREATE
> DATABASE (and createdb and initdb) to use LOCALE to mean the collation
> locale regardless of which provider is in use (which is what 0006
> does).
> 
> 0006 depends on ICU handling libc locale names. It already does a good
> job for most libc locale names (though patch 0002 fixes a few cases
> where it doesn't). There may be more cases, but for the most part libc
> names are interpreted in a reasonable way. But one important case is
> missing: ICU does not handle the "C" locale as we expect (that is,
> using memcmp()).
> 
> We've already allowed users to create ICU collations with the C locale
> in the past, which uses the root collation (not memcmp()), and we need
> to keep supporting that for upgraded clusters.

I'm not sure I agree that we need to keep supporting that.  The only way 
you could get that in past releases is if you specify explicitly, "give 
me provider ICU and locale C", and then it wouldn't actually even work 
correctly.  So nobody should be using that in practice, and nobody 
should have stumbled into that combination of settings by accident.

>    3. Introduce collation provider "none", which is always memcmp-based
> (patch 0004). It's equivalent to the libc locale=C, but it allows
> specifying the LC_COLLATE and LC_CTYPE independently. A command like
> CREATE DATABASE ... LOCALE_PROVIDER='icu' ICU_LOCALE='C'
> LC_COLLATE='en_US' would get changed (with a NOTICE) to provider "none"
> (patch 0005), so you'd have datlocprovider=none, datcollate=en_US. For
> the database default collation, that would always use memcmp(), but the
> server environment LC_COLLATE would be set to en_US as the user
> specified.

This seems most attractive, but I think it's quite invasive at this 
point, especially given the dubious premise (see above).

> === 0007: Add a GUC to control the default collation provider
> 
> Having a GUC would make it easier to migrate to ICU without surprises.
> This only affects the default for CREATE COLLATION, not CREATE DATABASE
> (and obviously not initdb).

It's not clear to me why we would want that.  Also not clear why it 
should only affect CREATE COLLATION.

Re: Order changes in PG16 since ICU introduction

От

Jeff Davis

Дата:

15 мая 2023 г., 21:03:23

On Sat, 2023-05-13 at 13:00 +0300, Alexander Lakhin wrote:
> On the current master (after 455f948b0, and before f7faa9976, of
> course)
> I get an ASAN-detected failure with the following query:
> CREATE COLLATION col (provider = icu, locale = '123456789012');
>

Thank you for the report!

ICU source specifically says:

  /**
   * Useful constant for the maximum size of the language
     part of a locale ID.
   * (including the terminating NULL).
   * @stable ICU 2.0
   */
  #define ULOC_LANG_CAPACITY 12

So the fact that it returning success without nul-terminating the
result is an ICU bug in my opinion, and I reported it here:

https://unicode-org.atlassian.net/browse/ICU-22394

Unfortunately that means we need to be a bit more paranoid and always
check for that warning, even if we have a preallocated buffer of the
"correct" size. It also means that both U_STRING_NOT_TERMINATED_WARNING
and U_BUFFER_OVERFLOW_ERROR will be user-facing errors (potentially
scary), unless we check for those errors each time and report specific
errors for them.

Another approach is to always check the length and loop using dynamic
allocation so that we never overflow (and forget about any static
buffers). That seems like overkill given that the problem case is
obviously invalid input; I think as long as we catch it and throw an
ERROR it's fine. But I can do this if others think it's worthwhile.

Patch attached. It just checks for the U_STRING_NOT_TERMINATED_WARNING
in a few places and turns it into an ERROR. It also cleans up the loop
for uloc_getLanguageTag() to check for U_STRING_NOT_TERMINATED_WARNING
rather than (U_SUCCESS(status) && len >= buflen).

--
Jeff Davis
PostgreSQL Contributor Team - AWS

Вложения

0001-ICU-check-for-U_STRING_NOT_TERMINATED_WARNING.patch

Re: Order changes in PG16 since ICU introduction

От

Jeff Davis

Дата:

15 мая 2023 г., 21:16:31

On Mon, 2023-05-08 at 14:59 -0700, Jeff Davis wrote:
> The easiest thing to do is revert it for now, and after we sort out
> the
> memcmp() path for the ICU provider, then I can commit it again (after
> that point it would just be code cleanup and should have no
> functional
> impact).

The conversion won't be entirely dead code even after we handle the "C"
locale with memcmp(): for a locale like "C.UTF-8", it will still be
passed to the collation provider (same as with libc), and in that case,
we should still convert that to a language tag consistently across ICU
versions.

For it to be entirely dead code, we would need to convert any locale
with language "C" (e.g. "C.UTF-8") to use the memcmp() path. I'm fine
with that, but that's not what the libc provider does today, and
perhaps we should be consistent between the two. If we do leave the
code in place, we can document that specific "en-US-u-va-posix" locale
so that it's not too surprising for users.

Regards,
    Jeff Davis

Re: Order changes in PG16 since ICU introduction

От

Alexander Lakhin

Дата:

16 мая 2023 г., 16:00:00

Hi Jeff,

16.05.2023 00:03, Jeff Davis wrote:
> On Sat, 2023-05-13 at 13:00 +0300, Alexander Lakhin wrote:
>> On the current master (after 455f948b0, and before f7faa9976, of
>> course)
>> I get an ASAN-detected failure with the following query:
>> CREATE COLLATION col (provider = icu, locale = '123456789012');
>>
> Thank you for the report!
>
> ICU source specifically says:
>
>    /**
>     * Useful constant for the maximum size of the language
>       part of a locale ID.
>     * (including the terminating NULL).
>     * @stable ICU 2.0
>     */
>    #define ULOC_LANG_CAPACITY 12
>
> So the fact that it returning success without nul-terminating the
> result is an ICU bug in my opinion, and I reported it here:
>
> https://unicode-org.atlassian.net/browse/ICU-22394
>
> Unfortunately that means we need to be a bit more paranoid and always
> check for that warning, even if we have a preallocated buffer of the
> "correct" size. It also means that both U_STRING_NOT_TERMINATED_WARNING
> and U_BUFFER_OVERFLOW_ERROR will be user-facing errors (potentially
> scary), unless we check for those errors each time and report specific
> errors for them.
>
> Another approach is to always check the length and loop using dynamic
> allocation so that we never overflow (and forget about any static
> buffers). That seems like overkill given that the problem case is
> obviously invalid input; I think as long as we catch it and throw an
> ERROR it's fine. But I can do this if others think it's worthwhile.
>
> Patch attached. It just checks for the U_STRING_NOT_TERMINATED_WARNING
> in a few places and turns it into an ERROR. It also cleans up the loop
> for uloc_getLanguageTag() to check for U_STRING_NOT_TERMINATED_WARNING
> rather than (U_SUCCESS(status) && len >= buflen).

I'm not sure about the proposed change in icu_from_uchar(). It seems that
len_result + 1 bytes should always be enough for the result string terminated
with NUL. If that's not true (we want to protect from some ICU bug here),
then the change should be backpatched?

Best regards,
Alexander

Re: Order changes in PG16 since ICU introduction

От

Jeff Davis

Дата:

16 мая 2023 г., 16:47:51

On Tue, 2023-05-16 at 19:00 +0300, Alexander Lakhin wrote:
> I'm not sure about the proposed change in icu_from_uchar(). It seems
> that
> len_result + 1 bytes should always be enough for the result string
> terminated
> with NUL. If that's not true (we want to protect from some ICU bug
> here),
> then the change should be backpatched?

I believe it's enough and I'm not aware of any bug there so no backport
is required.

I added checks in places that were (a) checking for U_FAILURE; and (b)
expecting the result to be NUL-terminated. That's mostly callers of
uloc_getLanguage(), where I was not quite paranoid enough.

There were a couple other places though, and I went ahead and added
checks there out of paranoia, too. One was ucnv_fromUChars(), and the
other was uloc_canonicalize().

Regards,
    Jeff Davis

Re: Order changes in PG16 since ICU introduction

От

"Jonathan S. Katz"

Дата:

16 мая 2023 г., 19:35:28

On 5/5/23 8:25 PM, Jeff Davis wrote:
> On Fri, 2023-04-21 at 20:12 -0400, Robert Haas wrote:
>> On Fri, Apr 21, 2023 at 5:56 PM Jeff Davis <pgsql@j-davis.com> wrote:
>>> Most of the complaints seem to be complaints about v15 as well, and
>>> while those complaints may be a reason to not make ICU the default,
>>> they are also an argument that we should continue to learn and try
>>> to
>>> fix those issues because they exist in an already-released version.
>>> Leaving it the default for now will help us fix those issues rather
>>> than hide them.
>>>
>>> It's still early, so we have plenty of time to revert the initdb
>>> default if we need to.
>>
>> That's fair enough, but I really think it's important that some
>> energy
>> get invested in providing adequate documentation for this stuff. Just
>> patching the code is not enough.
> 
> Attached a significant documentation patch.


> I tried to make it comprehensive without trying to be exhaustive, and I
> separated the explanation of language tags from what collation settings
> you can include in a language tag, so hopefully that's more clear.
> 
> I added quite a few examples spread throughout the various sections,
> and I preserved the existing examples at the end. I also left all of
> the external links at the bottom for those interested enough to go
> beyond what's there.

[Personal hat, not RMT]

Thanks -- this is super helpful. A bunch of these examples I had 
previously had to figure out by randomly searching blog posts / 
trial-and-error, so I think this will help developers get started more 
quickly.

Comments (and a lot are just little nits to tighten the language)

Commit message -- typo: "documentaiton"


+     If you see such a message, ensure that the 
<symbol>PROVIDER</symbol> and
+     <symbol>LOCALE</symbol> are as you expect, and consider specifying
+     directly as the canonical language tag instead of relying on the
+     transformation.
+    </para>

I'd recommend make this more prescriptive:

"If you see this notice, ensure that the <symbol>PROVIDER</symbol> and 
<symbol>LOCALE</symbol> are the expected result. For consistent results 
when using the ICU provider, specify the canonical <link 
linkend="icu-language-tag">language tag</link> instead of relying on the 
transformation."

+     If there is some problem interpreting the locale name, or if it 
represents
+     a language or region that ICU does not recognize, a message will 
be reported:

This is passive voice, consider:

"If there is a problem interpreting the locale name, or if the locale 
name represents a language or region that ICU does not recognize, you'll 
see the following error:"


+   <sect3 id="icu-language-tag">
+    <title>Language Tag</title>
+    <para>

Before jumping in, I'd recommend a quick definition of what a language 
tag is, e.g.:

"A language tag, defined in BCP 47, is a standardized identifier used to 
identify languages in computer systems" or something similar.

(I did find a database that made it simpler to search for these, which 
is one issue I've previously add, but I don't think we'd want to link to i)

+     To include this additional collation information in a language tag,
+     append <literal>-u</literal>, followed by one or more

My first question was "what's special about '-u'", so maybe we say:

"To include this additional collation information in a language tag, 
append <literal>-u</literal>, which indicates there are additional 
collation settings, followed by one or more..."

+     ICU locales are specified as a <link 
linkend="icu-language-tag">Language
+     Tag</link>, but can also accept most libc-style locale names 
(which will
+     be transformed into language tags if possible).
+    </para>

I'd recommend removing the parantheticals:

ICU locales are specified as a BCP 47 <link 
linkend="icu-language-tag">Language
  Tag</link>, but can also accept most libc-style locale names. If 
possible, libc-style locale names are transformed into language tags.

+      <title>ICU Collation Levels</title>

Nothing to add here other than to say I'm extremely appreciative of this 
section. Once upon a time I sunk a lot of time trying to figure out how 
all of these levels worked.

+          Sensitivity when determining equality, with
+          <literal>level1</literal> the least sensitive and
+          <literal>identic</literal> the most sensitive. See <xref
+          linkend="icu-collation-levels"/> for details.

This discusses equality sensitivity, but I'm not sure if I understand 
that term here. The ICU docs seem to call these "strengths"[1], maybe we 
use that term to be consistent with upstream?

+          If set to <literal>upper</literal>, upper case sorts before lower
+          case. If set to <literal>lower</literal>, lower case sorts before
+          upper case. If set to <literal>false</literal>, it depends on the
+          locale.

Suggestion to tighten this up:

"If set to <literal>false</literal>, the sort depends on the rules of 
the locale."

+      Defaults may depend on locale. The above table is not meant to be
+      complete. See <xref linkend="icu-external-references"/> for additinal
+      options and details.

Typo: additinal => "additional"

> I didn't add additional documentation for ICU rules. There are so many
> options for collations that it's hard for me to think of realistic
> examples to specify the rules directly, unless someone wants to invent
> a new language. Perhaps useful if working with an interesting text file
> format with special treatment for delimiters?
> 
> I asked the question about rules here:
> 
> https://www.postgresql.org/message-id/e861ac4fdae9f9f5ce2a938a37bcb5e083f0f489.camel%40cybertec.at
> 
> and got some limited response about addressing sort complaints. That
> sounds reasonable, but a lot of that can also be handled just by
> specifying the right collation settings. Someone who understands the
> use case better could add some more documentation.

I'm not too sure about this one -- from my experience, users want 
predictability in sorts, but there are a variety of ways to get that 
experience.

Thanks,

Jonathan

[1] https://unicode-org.github.io/icu/userguide/collation/concepts.html

Вложения

OpenPGP_signature

Re: Order changes in PG16 since ICU introduction

От

Jeff Davis

Дата:

17 мая 2023 г., 03:23:16

On Tue, 2023-05-16 at 15:35 -0400, Jonathan S. Katz wrote:
> +          Sensitivity when determining equality, with
> +          <literal>level1</literal> the least sensitive and
> +          <literal>identic</literal> the most sensitive. See <xref
> +          linkend="icu-collation-levels"/> for details.
>
> This discusses equality sensitivity, but I'm not sure if I understand
> that term here. The ICU docs seem to call these "strengths"[1], maybe
> we
> use that term to be consistent with upstream?

"Sensitivity" comes from "case sensitivity" which is more clear to me
than "strength". I added the term "strength" to correspond to the
unicode terminology, but I kept sensitivity and I tried to make it
slightly more clear.

Other than that, and I took your suggestions almost verbatim. Patch
attached. Thank you!

I also made a few other changes:

  * added paragraph transformation of '' or 'root' to the 'und'
language (root collation)
  * added paragraph that the "identic" level still performs some basic
normalization
  * added example for when full normalization matters

I should also say that I don't really understand the case when "kc" is
set to true and "ks" is level 2 or higher. If someone has an example of
where that matters, let me know.

Regards,
    Jeff Davis

Вложения

v2-0001-Doc-improvements-for-language-tags-and-custom-ICU.patch

Re: Order changes in PG16 since ICU introduction

От

Jeff Davis

Дата:

17 мая 2023 г., 22:59:06

On Tue, 2023-05-16 at 20:23 -0700, Jeff Davis wrote:
> Other than that, and I took your suggestions almost verbatim. Patch
> attached. Thank you!

Attached new patch with a typo fix and a few other edits. I plan to
commit soon.

Regards,
    Jeff Davis

Вложения

0001-Doc-improvements-for-language-tags-and-custom-ICU-co.patch

Re: Order changes in PG16 since ICU introduction

От

"Jonathan S. Katz"

Дата:

17 мая 2023 г., 23:59:48

On 5/17/23 6:59 PM, Jeff Davis wrote:
> On Tue, 2023-05-16 at 20:23 -0700, Jeff Davis wrote:
>> Other than that, and I took your suggestions almost verbatim. Patch
>> attached. Thank you!
> 
> Attached new patch with a typo fix and a few other edits. I plan to
> commit soon.

I did a quicker read through this time. LGTM overall. I like what you 
did with the explanations around sensitivity (now it makes sense).

Thanks,

Jonathan

Вложения

OpenPGP_signature

Re: Order changes in PG16 since ICU introduction

От

Jeff Davis

Дата:

18 мая 2023 г., 17:55:49

On Wed, 2023-05-17 at 19:59 -0400, Jonathan S. Katz wrote:
> I did a quicker read through this time. LGTM overall. I like what you
> did with the explanations around sensitivity (now it makes sense).

Committed, thank you.

There are a few things I don't understand that would be good to
document better:

* Rules. I still don't quite understand the use case: are these for
people inventing new languages? What is a plausible use case that isn't
covered by the existing locales and collation settings? Do rules make
sense for a database default collation? Are they for language experts
only or might an ordinary developer benefit from using them?

* The collation types "phonebk", "emoji", etc.: are these variants of
particular locales, or do they make sense in multiple locales? I don't
know where they fit in or how to document them.

* I don't understand what "kc" means if "ks" is not set to "level1".

Regards,
    Jeff Davis

Re: Order changes in PG16 since ICU introduction

От

"Jonathan S. Katz"

Дата:

18 мая 2023 г., 17:58:48

On 5/18/23 1:55 PM, Jeff Davis wrote:
> On Wed, 2023-05-17 at 19:59 -0400, Jonathan S. Katz wrote:
>> I did a quicker read through this time. LGTM overall. I like what you
>> did with the explanations around sensitivity (now it makes sense).
> 
> Committed, thank you.

\o/

> There are a few things I don't understand that would be good to
> document better:
> 
> * Rules. I still don't quite understand the use case: are these for
> people inventing new languages? What is a plausible use case that isn't
> covered by the existing locales and collation settings? Do rules make
> sense for a database default collation? Are they for language experts
> only or might an ordinary developer benefit from using them?

 From my read of them, as an app developer I'd be very unlikely to use 
this. Maybe there is something with building out some collation rules 
vis-a-vis an extension, but I have trouble imagining the use-case. I may 
also not be the target audience for this feature.

> * The collation types "phonebk", "emoji", etc.: are these variants of
> particular locales, or do they make sense in multiple locales? I don't
> know where they fit in or how to document them.

I remember I had a exploratory use case for "phonebk" but I couldn't 
figure out how to get it to work. AIUI from random searching, the idea 
is that it provides the "phonebook" rules for ordering "names" in a 
particular locale, but I couldn't get it to work.

> * I don't understand what "kc" means if "ks" is not set to "level1".

Me neither, but I haven't stared at this as hard as others.

Thanks,

Jonathan

Вложения

OpenPGP_signature

Re: Order changes in PG16 since ICU introduction

От

Matthias van de Meent

Дата:

18 мая 2023 г., 18:11:36

On Fri, 21 Apr 2023 at 22:46, Jeff Davis <pgsql@j-davis.com> wrote:
>
> On Fri, 2023-04-21 at 19:00 +0100, Andrew Gierth wrote:
> > > > > >
> > Also, somewhere along the line someone broke initdb --no-locale,
> > which
> > should result in C locale being the default everywhere, but when I
> > just
> > tested it it picked 'en' for an ICU locale, which is not the right
> > thing.
>
> Fixed, thank you.

As I complain about in [0], since 5cd1a5af --no-locale has been broken
/ bahiving outside it's description: Instead of being equivalent to
`--locale=C` it now also overrides `--locale-provider=libc`, resulting
in undocumented behaviour.

Kind regards,

Matthias van de Meent
Neon, Inc.

[0] https://www.postgresql.org/message-id/CAEze2WiZFQyyb-DcKwayUmE4rY42Bo6kuK9nBjvqRHYxUYJ-DA%40mail.gmail.com

Re: Order changes in PG16 since ICU introduction

От

Jeff Davis

Дата:

18 мая 2023 г., 18:26:14

On Thu, 2023-05-18 at 13:58 -0400, Jonathan S. Katz wrote:
>  From my read of them, as an app developer I'd be very unlikely to
> use
> this. Maybe there is something with building out some collation rules
> vis-a-vis an extension, but I have trouble imagining the use-case. I
> may
> also not be the target audience for this feature.

That's a problem for the ICU rules feature. I understand some features
may be for domain experts only, but we at least need to call that out
so that ordinary developers don't get confused. And we should hear from
some of those domain experts that they actually want it and it solves a
real problem.

For the features that can be described with collation
settings/attributes right in the locale name, the use cases are more
plausible and we've supported them since v10, so it's good to document
them as best we can. It's hard to expose only the particular ICU
collation settings we understand best (e.g. the "ks" setting that
allows case insensitive collation), so it's inevitable that there will
be some settings that are more obscure and harder to document.

But in the case of ICU rules, they are newly-supported in 16, so there
should be a clear reason we're adding them. Otherwise we're just
setting up users for confusion or problems, and creating backwards-
compatibility headaches for ourselves (and the last thing we want is to
fret over backwards compatibility for a feature with no users).

Beyond that, there seems to be some danger: if the syntax for rules is
not perfectly compatible between ICU versions, the user might run into
big problems.

Regards,
    Jeff Davis

Re: Order changes in PG16 since ICU introduction

От

Jeff Davis

Дата:

18 мая 2023 г., 22:46:52

On Thu, 2023-05-18 at 20:11 +0200, Matthias van de Meent wrote:
> As I complain about in [0], since 5cd1a5af --no-locale has been
> broken
> / bahiving outside it's description: Instead of being equivalent to
> `--locale=C` it now also overrides `--locale-provider=libc`,
> resulting
> in undocumented behaviour.

I agree that 5cd1a5af is incomplete.

Posting updated patches. Feedback on the approaches below would be
appreciated.

For context, in version 15:

  $ initdb -D data --locale-provider=icu --icu-locale=en
  => create database clocale template template0 locale='C';
  => select datname, datlocprovider, daticulocale
     from pg_database where datname='clocale';
   datname | datlocprovider | daticulocale
  ---------+----------------+--------------
   clocale | i              | en
  (1 row)

That behavior is confusing, and when I made ICU the default provider in
v16, the confusion was extended into more cases.

If we leave the CREATE DATABASE (and createdb and initdb) syntax in
place, such that LOCALE (and --locale) do not apply to ICU at all, then
I don't see a path to a good ICU user experience.

Therefore I conclude that we need LOCALE (and --locale) to apply to ICU
somehow. (The LOCALE option already applies to ICU during CREATE
COLLATION, just not CREATE DATABASE or initdb.)

Patch 0003 does this. It's fairly straightforward and I believe we need
this patch.

But to actually fix your complaint we also need --no-locale to be
equivalent to --locale=C and for those options to both use memcmp()
semantics. There are several approaches to accomplish this, and I think
this is the part where I most need some feedback. There are only so
many approaches, and each one has some potential downsides, but I
believe we need to select one:

(1) Give up and leave the existing CREATE DATABASE (and createdb, and
initdb) semantics in place, along with the confusing behavior in v15.

This is a last resort, in my opinion. It gives us no path toward a good
user experience with ICU, and leaves us with all of the problems of the
OS as a collation provider.

(2) Automatically change the provider to libc when locale=C.

Almost works, but it's not clear how we handle the case "provider=icu
lc_collate='fr_FR.utf8' locale=C".

If we change it to "provider=libc lc_collate=C", we've overridden the
specified lc_collate. If we ignore the locale=C, that would be
surprising to users. If we throw an error, that would be a backwards
compatibility issue.

One possible solution would be to change the catalog representation to
allow setting the default collation locale separately from datcollate
even for the libc provider. For instance, rename daticulocale to
datdeflocale, and store the default collation locale there for both
libc and ICU. Then, "provider=icu lc_collate='fr_FR.utf8' locale=C"
could be changed into "provider=libc lc_collate='fr_FR.utf8'
deflocale=C". It may be confusing that datcollate is a different
concept from datdeflocale; but then again they are different concepts
and it's confusing that they are currently combined into one.

(3) Support iculocale=C in the ICU provider using the memcmp() path.

In other words, if provider=icu and iculocale=C, lc_collate_is_c() and
lc_ctpye_is_c() would both return true.

There's a potential problem for users who've misused ICU in the past
(15 or earlier) by using provider=icu and iculocale=C. ICU would accept
such a locale name, but not recognize it and fall back to the root
locale, so it never worked as the user intended it. But if we redefine
C to be memcmp(), then such users will have broken indexes if they
upgrade.

We could add a check at pg_upgrade time for iculocale=C in versions 15
and earlier, and cause the check (and therefore the upgrade) to fail.
That may be reasonable considering that it never really worked in the
past, and perhaps very few users actually ever created such a
collation. But if some user runs into that problem, we'd have to resort
to a hack like telling them to "update pg_collation set iculocale='und'
where iculocale='C'" and then try the upgrade again, which is not a
great answer (as far as I can tell it would be a correct answer and
should not break their indexes, but it feels pretty dangerous).

There may be some other resolutions to this problem, such as catalog
hacks that allow for different representations of iculocale=C pre-16
and post-16. That doesn't sound great though, and we'd have to figure
out what to do with pg_dump.

(4) Create a new "none" provider (which has no locale and always memcmp
semantics), and automatically change the provider to "none" if
provider=icu and iculocale=C.

This solves the problem case in #2 and the potential upgrade problem in
#3. It also makes the documentation a bit more natural, in my opinion,
even if we retain the special case for provider=libc collate=C.

#4 is the approach I chose (patches 0001 and 0002), but I'd like to
hear what others think.

For historical reasons, users may assume that LC_COLLATE controls the
default collation order because that's true in libc. And if their
provider is ICU, they may be surprised that it doesn't. I believe we
could extend each of the above approaches to use LC_COLLATE as the
default for ICU_LOCALE if the former is specified and the latter is
not, and that may make things smoother.

--
Jeff Davis
PostgreSQL Contributor Team - AWS

Вложения

Re: Order changes in PG16 since ICU introduction

От

"Daniel Verite"

Дата:

19 мая 2023 г., 19:13:47

    Jeff Davis wrote:

> 2) Automatically change the provider to libc when locale=C.
>
> Almost works, but it's not clear how we handle the case "provider=icu
> lc_collate='fr_FR.utf8' locale=C".
>
> If we change it to "provider=libc lc_collate=C", we've overridden the
> specified lc_collate. If we ignore the locale=C, that would be
> surprising to users. If we throw an error, that would be a backwards
> compatibility issue.

This thread started with a report illustrating that when users mention
the locale "C", they implicitly mean "C" from the libc provider, as
when libc was the default. The problem is that as soon as ICU is the
default, any reference to a libc collation should mention explicitly
that the provider is libc.

It seems what we're set on the idea to create an exception for "C"
(and I assume also "POSIX") to avoid too much confusion, and because
"C" is quite special anyway, and has no equivalent in ICU (the switch
in v16 to ICU as the default provider is based on the premise that the
locales with the same name will behave pretty much the same with ICU
as they did with libc, but it's absolutely not the case with "C").

ISTM that if we want to go that route, we need the make the minimum
changes at the user interface level and not any deeper, so that when
(locale="C" OR locale="POSIX") AND the provider has not been specified,
then the command (initdb and create database) act as if the user had
specified provider=libc.

> (3) Support iculocale=C in the ICU provider using the memcmp() path.

> In other words, if provider=icu and iculocale=C, lc_collate_is_c() and
> lc_ctpye_is_c() would both return true.

ICU does not provide a locale that behaves like that, and it doesn't
feel right to pretend it does. It feels like attacking the problem
at the wrong level.

> (4) Create a new "none" provider (which has no locale and always memcmp
> semantics), and automatically change the provider to "none" if
> provider=icu and iculocale=C.

It still uses libc/C for character classification and case changing,
so "no locale" is technically not true. Personally I don't see
the benefit of adding a "none" provider. C is a libc locale
and libc is not disappearing. I also think  that when users explicitly
indicate provider=icu, they should get icu.

Best regards,
--
Daniel Vérité
https://postgresql.verite.pro/
Twitter: @DanielVerite

Re: Order changes in PG16 since ICU introduction

От

Tom Lane

Дата:

19 мая 2023 г., 22:57:17

Jeff Davis <pgsql@j-davis.com> writes:
> Committed, thank you.

This commit has given the PDF docs build some indigestion:

Making portrait pages on A4 paper (210mmx297mm)
/home/postgres/bin/fop -fo postgres-A4.fo -pdf postgres-A4.pdf
[WARN] FOUserAgent - Font "Symbol,normal,700" not found. Substituting with "Symbol,normal,400".
[WARN] FOUserAgent - Font "ZapfDingbats,normal,700" not found. Substituting with "ZapfDingbats,normal,400".
[WARN] FOUserAgent - Hyphenation pattern not found. URI: en.
[WARN] FOUserAgent - The contents of fo:block line 1 exceed the available area in the inline-progression direction by
3531millipoints. (See position 55117:2388) 
[WARN] FOUserAgent - The contents of fo:block line 1 exceed the available area in the inline-progression direction by
1871millipoints. (See position 55117:12998) 
[WARN] FOUserAgent - Glyph "?" (0x323, dotbelowcmb) not available in font "Courier".
[WARN] FOUserAgent - Glyph "?" (0x302, circumflexcmb) not available in font "Courier".
[WARN] FOUserAgent - The contents of fo:block line 12 exceed the available area in the inline-progression direction by
20182millipoints. (See position 55172:188) 
[WARN] FOUserAgent - The contents of fo:block line 10 exceed the available area in the inline-progression direction by
17682millipoints. (See position 55172:188) 
[WARN] FOUserAgent - Glyph "?" (0x142, lslash) not available in font "Times-Roman".
[WARN] PropertyMaker - span="inherit" on fo:block, but no explicit value found on the parent FO.

(The first three and last one warnings are things we've been living
with, but the ones between are new.)

The first two "exceed the available area" complaints are in the "ICU
Collation Levels" table.  We can silence them by providing some column
width hints to make the "Description" column a tad wider than the rest,
as in the proposed patch attached.  The other two, as well as the first
two glyph-not-available complaints, are caused by this bit:

           Full normalization is important in some cases, such as when
           multiple accents are applied to a single character. For instance,
           <literal>'ệ'</literal> can be composed of code points
           <literal>U&'\0065\0323\0302'</literal> or
           <literal>U&'\0065\0302\0323'</literal>. With full normalization
           on, these code point sequences are treated as equal; otherwise they
           are unequal.

which renders just abysmally (see attached screen shot), and even in HTML
where it's rendering about as intended, it really is an unintelligible
example.  Few people are going to understand that the circumflex and the
dot-below are separately applied accents.  How about we drop the explicit
example and write something like

           Full normalization allows code point sequences such as
           characters with multiple accent marks applied in different
           orders to be seen as equal.

?

(The last missing-glyph complaint is evidently from the release notes;
I'll bug Bruce about that separately.)

            regards, tom lane

diff --git a/doc/src/sgml/charset.sgml b/doc/src/sgml/charset.sgml
index 9db14649aa..96a23bf530 100644
--- a/doc/src/sgml/charset.sgml
+++ b/doc/src/sgml/charset.sgml
@@ -1140,6 +1140,14 @@ SELECT 'w;x*y-z' = 'wxyz' COLLATE num_ignore_punct; -- true
      <table id="icu-collation-levels">
       <title>ICU Collation Levels</title>
       <tgroup cols="8">
+       <colspec colname="col1" colwidth="1*"/>
+       <colspec colname="col2" colwidth="1.25*"/>
+       <colspec colname="col3" colwidth="1*"/>
+       <colspec colname="col4" colwidth="1*"/>
+       <colspec colname="col5" colwidth="1*"/>
+       <colspec colname="col6" colwidth="1*"/>
+       <colspec colname="col7" colwidth="1*"/>
+       <colspec colname="col8" colwidth="1*"/>
        <thead>
         <row>
          <entry>Level</entry>

Вложения

ICU-documentation-excerpt.png

Re: Order changes in PG16 since ICU introduction

От

Jeff Davis

Дата:

21 мая 2023 г., 01:42:55

On Fri, 2023-05-19 at 21:13 +0200, Daniel Verite wrote:
> ISTM that if we want to go that route, we need the make the minimum
> changes at the user interface level and not any deeper, so that when
> (locale="C" OR locale="POSIX") AND the provider has not been
> specified,
> then the command (initdb and create database) act as if the user had
> specified provider=libc.

If we special case locale=C, but do nothing for locale=fr_FR, then I'm
not sure we've solved the problem. Andrew Gierth raised the issue here,
which he called "maximally confusing":

https://postgr.es/m/874jp9f5jo.fsf@news-spur.riddles.org.uk

That's why I feel that we need to make locale apply to whatever the
provider is, not just when it happens to be C.

> > (3) Support iculocale=C in the ICU provider using the memcmp()
> > path.
>
> > In other words, if provider=icu and iculocale=C, lc_collate_is_c()
> > and
> > lc_ctpye_is_c() would both return true.
>
> ICU does not provide a locale that behaves like that, and it doesn't
> feel right to pretend it does. It feels like attacking the problem
> at the wrong level.

I agree that #3 feels slightly wrong, but I think it's still a viable
option until we have consensus on something better.

> > (4) Create a new "none" provider (which has no locale and always
> > memcmp
> > semantics), and automatically change the provider to "none" if
> > provider=icu and iculocale=C.
>
> It still uses libc/C for character classification and case changing,
> so "no locale" is technically not true.

The provider affects callers that have a pg_locale_t, such as the SQL-
callable lower() function. For those callers, the "none" provider uses
pg_ascii_tolower(), etc., not libc. That's why I called it "none" --
it's using simple internal postgres implementations instead of a
provider.

For callers that don't have a pg_locale_t, they may call libc functions
directly and rely on the server environment. But in those cases,
there's no way to set a provider at all, it's just relying on the
server environment. There aren't many of these cases, and hopefully we
can eliminate the reliance on the server environment over time.

If I'm missing something, let me know what cases you have in mind.

Regards,
    Jeff Davis

Re: Order changes in PG16 since ICU introduction

От

Peter Eisentraut

Дата:

22 мая 2023 г., 12:27:10

On 18.05.23 19:55, Jeff Davis wrote:
> On Wed, 2023-05-17 at 19:59 -0400, Jonathan S. Katz wrote:
>> I did a quicker read through this time. LGTM overall. I like what you
>> did with the explanations around sensitivity (now it makes sense).
> 
> Committed, thank you.
> 
> There are a few things I don't understand that would be good to
> document better:
> 
> * Rules. I still don't quite understand the use case: are these for
> people inventing new languages? What is a plausible use case that isn't
> covered by the existing locales and collation settings? Do rules make
> sense for a database default collation? Are they for language experts
> only or might an ordinary developer benefit from using them?

The rules are for setting whatever sort order you like.  Maybe you want 
to sort + before - or whatever.  It's like, if you don't like it, build 
your own.

> * The collation types "phonebk", "emoji", etc.: are these variants of
> particular locales, or do they make sense in multiple locales? I don't
> know where they fit in or how to document them.

The k* settings are parametric settings, in that they transform the sort 
key in some algorithmic way.  The co settings are just everything else. 
They are not parametric, they are just some other sort order that 
someone spelled out explicitly.

> * I don't understand what "kc" means if "ks" is not set to "level1".

There is an example here: 
https://peter.eisentraut.org/blog/2023/05/16/overview-of-icu-collation-settings#colcaselevel

Re: Order changes in PG16 since ICU introduction

От

Peter Eisentraut

Дата:

22 мая 2023 г., 12:34:31

On 18.05.23 00:59, Jeff Davis wrote:
> On Tue, 2023-05-16 at 20:23 -0700, Jeff Davis wrote:
>> Other than that, and I took your suggestions almost verbatim. Patch
>> attached. Thank you!
> 
> Attached new patch with a typo fix and a few other edits. I plan to
> commit soon.

Some small follow-up on this patch:

Please put blank lines between

</sect3>
<sect3 ...>

etc., matching existing style.

We usually don't capitalize the collation parameters like

CREATE COLLATION mycollation1 (PROVIDER = icu, LOCALE = 'ja-JP);

elsewhere in the documentation.

Table 24.2. ICU Collation Settings should probably be sorted by key, or 
at least by something.

All tables should referenced in the text, like "Table x.y shows this and 
that."  (Note that a table could float to a different page in some 
output formats, so just putting it into a section without some 
introductory text isn't sound.)

Table 24.1. ICU Collation Levels shows punctuation as level 4, which is 
only true in shifted mode, which isn't the default.  The whole business 
of treating variable collation elements is getting a bit lost in this 
description.  The kv option is described as "Classes of characters 
ignored during comparison at level 3.", which is effectively true but 
not the whole picture.

Re: Order changes in PG16 since ICU introduction

От

Jeff Davis

Дата:

22 мая 2023 г., 17:08:17

On Thu, 2023-05-11 at 13:09 +0200, Peter Eisentraut wrote:
> There is also the deterministic flag and the icurules setting.
> Depending on what level of detail you imagine the user needs, you
> really
> do need to look at the whole picture, not some subset of it.

(Nit: all database default collations are deterministic.)

I agree, but I think there should be a way to see the whole picture in
one command. If nothing else, for repro cases sent to the list, it
would be nice to have a single line like:

   SELECT show_default_collation_whole_picture();

Right now it involves some back and forth, like checking
datlocprovider, then looking in the right fields and ignoring the wrong
ones.

Regards,
    Jeff Davis

Re: Order changes in PG16 since ICU introduction

От

Jeff Davis

Дата:

22 мая 2023 г., 17:35:25

On Thu, 2023-05-11 at 13:07 +0200, Peter Eisentraut wrote:
> Here is my proposed patch for this.

The commit message makes it sound like lc_collate/ctype are completely
obsolete, and I don't think that's quite right: they still represent
the server environment, which does still matter in some cases.

I'd just say that they are too confusing (likely to be misused), and
becoming obsolete (or less relevant), or something along those lines.

Otherwise, this is fine with me. I didn't do a detailed review because
it's just mechanical.

Regards,
    Jeff Davis

Re: Order changes in PG16 since ICU introduction

От

Jeff Davis

Дата:

22 мая 2023 г., 19:03:36

On Mon, 2023-05-22 at 14:27 +0200, Peter Eisentraut wrote:
> The rules are for setting whatever sort order you like.  Maybe you
> want
> to sort + before - or whatever.  It's like, if you don't like it,
> build
> your own.

A build-your-own feature is fine, but it's not completely zero cost.

There some risk that rules specified for ICU version X fail to load for
ICU version Y. If that happens to your database default collation, you
are in big trouble. The risk of failing to load a language tag in a
later version, especially one returned by uloc_toLanguageTag() in
strict mode, is much lower. We can reduce the risk by allowing rules
only for CREATE COLLATION (not  CREATE DATABASE), and see what users do
with it first, and consider adding it to CREATE DATABASE later.

We can also try to explain in the docs that it's a build-it-yourself
kind of feature (use it if you see a purpose, otherwise ignore it),
though I'm not sure quite how we should word it.

And I'm skeptical that we don't have a single plausible end-to-end user
story. I just can't think of any reason someone would need something
like this, given how flexible the collation settings in the language
tags are. The best case I can think of is if someone is trying to make
an ICU collation that matches some non-ICU collation in another system,
which sounds hard; but perhaps it's reasonable to do in cases where it
just needs to work well-enough in some limited case.

Also, do we have an answer as to why specifying the rules as '' is not
the same as not specifying any rules[1]?

[1]
https://www.postgresql.org/message-id/36a6e89689716c2ca1fae8adc8e84601a041121c.camel@j-davis.com

> The co settings are just everything else.
> They are not parametric, they are just some other sort order that
> someone spelled out explicitly.

This sounds like another case where we can't really tell the user why
they would want to use a specific "co" setting; they should only use it
if they already know they want it. Is there some way we can word that
in the documentation so that people don't misuse them?

For instance, one of them is called "emoji". I'm sure a lot of
applications use emoji (or at least might encounter them), should they
always use co-emoji, or would some people who are using emoji not want
it? Can it be combined with "ks" or other "k*" settings?

What I'm trying to avoid is users seeing something in the documentation
and using it without it really being a good fit for their problem. Then
they see something unexpected, and need to rebuild all of their indexes
or something.

> > * I don't understand what "kc" means if "ks" is not set to
> > "level1".
>
> There is an example here:
> https://peter.eisentraut.org/blog/2023/05/16/overview-of-icu-collation-settings#colcaselevel

Interesting, thank you.

Regards,
    Jeff Davis

Re: Order changes in PG16 since ICU introduction

От

"Daniel Verite"

Дата:

22 мая 2023 г., 20:09:00

    Jeff Davis wrote:

> If we special case locale=C, but do nothing for locale=fr_FR, then I'm
> not sure we've solved the problem. Andrew Gierth raised the issue here,
> which he called "maximally confusing":
>
> https://postgr.es/m/874jp9f5jo.fsf@news-spur.riddles.org.uk
>
> That's why I feel that we need to make locale apply to whatever the
> provider is, not just when it happens to be C.

While I agree that the LOCALE option in CREATE DATABASE is
counter-intuitive, I find it questionable that blending ICU
and libc locales into it helps that much with the user experience.

Trying the lastest v6-* patches applied on top of 722541ead1
(before the pgindent run), here are a few examples when I
don't think it goes well.

The OS is Ubuntu 22.04 (glibc 2.35, ICU 70.1)

initdb:

  Using default ICU locale "fr".
  Using language tag "fr" for ICU locale "fr".
  The database cluster will be initialized with this locale configuration:
    provider:     icu
    ICU locale:  fr
    LC_COLLATE:  fr_FR.UTF-8
    LC_CTYPE:     fr_FR.UTF-8
    LC_MESSAGES: fr_FR.UTF-8
    LC_MONETARY: fr_FR.UTF-8
    LC_NUMERIC:  fr_FR.UTF-8
    LC_TIME:     fr_FR.UTF-8
  The default database encoding has accordingly been set to "UTF8".

#1

postgres=# create database test1 locale='fr_FR.UTF-8';
NOTICE:  using standard form "fr-FR" for ICU locale "fr_FR.UTF-8"
ERROR:    new ICU locale (fr-FR) is incompatible with the ICU locale of the
template database (fr)
HINT:  Use the same ICU locale as in the template database, or use template0
as template.

That looks like a fairly generic case that doesn't work seamlessly.

#2

postgres=# create database test2 locale='C.UTF-8' template='template0';
NOTICE:  using standard form "en-US-u-va-posix" for ICU locale "C.UTF-8"
CREATE DATABASE

en-US-u-va-posix does not sort like C.UTF-8 in glibc 2.35, so
this interpretation is arguably not what a user would expect.

I would expect the ICU warning or error (icu_validation_level) to kick
in instead of that transliteration.

#3

$ grep french /etc/locale.alias
french        fr_FR.ISO-8859-1

postgres=# create database test3 locale='french' template='template0'
encoding='LATIN1';
WARNING:  ICU locale "french" has unknown language "french"
HINT:  To disable ICU locale validation, set parameter icu_validation_level
to DISABLED.
CREATE DATABASE

In practice we're probably getting the "und" ICU locale whereas "fr" would
be appropriate.

I assume that we would find more cases like that if testing on many
operating systems.

Best regards,
--
Daniel Vérité
https://postgresql.verite.pro/
Twitter: @DanielVerite

Re: Order changes in PG16 since ICU introduction

От

Jeff Davis

Дата:

24 мая 2023 г., 15:39:07

On Mon, 2023-05-22 at 22:09 +0200, Daniel Verite wrote:
> While I agree that the LOCALE option in CREATE DATABASE is
> counter-intuitive,

I think it's more than that. As Andreww Gierth pointed out:

   $ initdb --locale=fr_FR
   ...
     ICU locale:  en-US
   ...

Is more than just counter-intuitive. I don't think we can ship 16 that
way.

>  I find it questionable that blending ICU
> and libc locales into it helps that much with the user experience.

Thank you for going through some examples here. I agree that it's not
perfect, but we need some path to a reasonable ICU user experience, and
I think we'll have to accept some rough edges to avoid the worst cases,
like above.

> initdb:
>
>   Using default ICU locale "fr".
>   Using language tag "fr" for ICU locale "fr".
>

...

> #1
>
> postgres=# create database test1 locale='fr_FR.UTF-8';
> NOTICE:  using standard form "fr-FR" for ICU locale "fr_FR.UTF-8"
> ERROR:  new ICU locale (fr-FR) is incompatible with the ICU locale of

I don't see a problem here. If you specify LOCALE to CREATE DATABASE,
you should either be using "TEMPLATE template0", or you should be
expecting an error if the LOCALE doesn't match exactly.

What would you like to see happen here?

> #2
>
> postgres=# create database test2 locale='C.UTF-8'
> template='template0';
> NOTICE:  using standard form "en-US-u-va-posix" for ICU locale
> "C.UTF-8"
> CREATE DATABASE
>
>
> en-US-u-va-posix does not sort like C.UTF-8 in glibc 2.35, so
> this interpretation is arguably not what a user would expect.

As you pointed out, this is not settled in libc either:

https://www.postgresql.org/message-id/8a3dc06f-9b9d-4ed7-9a12-2070d8b0165f%40manitou-mail.org

We really can't expect a particular order for a particular locale name,
unless we handle it specially like "C" or "POSIX". If we pass it to the
provider, we have to trust the provider to match our conceptual
expectations for that locale (and ideally version it properly).

> I would expect the ICU warning or error (icu_validation_level) to
> kick
> in instead of that transliteration.

Earlier versions of ICU (<= 63) do this transformation automatically,
and I don't see a reason to throw an error if ICU considers it valid.
The language tag en-US-u-va-posix will be stored in the catalog, and
that will be considered valid in later versions of ICU.

Later versions of ICU (>= 64) consider locales with a language name of
"C" to be obsolete and no longer valid. I added code to do the
transformation without error in these later versions, but I think we
have agreement to remove it.

If a user specifies the locale as "C.UTF-8", we can either pass it to
ICU and see whether that version accepts it or not (and if not, throw a
warning/error); or if we decide that "C.UTF-8" really means "C", we can
handle it in the memcmp() path like C and never send it to ICU.

> #3
>
> $ grep french /etc/locale.alias
> french          fr_FR.ISO-8859-1
>
> postgres=# create database test3 locale='french' template='template0'
> encoding='LATIN1';
> WARNING:  ICU locale "french" has unknown language "french"
> HINT:  To disable ICU locale validation, set parameter
> icu_validation_level
> to DISABLED.
> CREATE DATABASE
>
>
> In practice we're probably getting the "und" ICU locale whereas "fr"
> would
> be appropriate.

This is a good point and illustrates that ICU is not a drop-in
replacement for libc in all cases.

I don't see a solution here that doesn't involve some rough edges,
though. "Locale" is a generic term, and if we continue to insist that
it really means a libc locale, then ICU will never be on an equal
footing with libc, let alone the preferred provider.

Regards,
    Jeff Davis

Re: Order changes in PG16 since ICU introduction

От

Joe Conway

Дата:

24 мая 2023 г., 16:31:42

On 5/24/23 11:39, Jeff Davis wrote:
> On Mon, 2023-05-22 at 22:09 +0200, Daniel Verite wrote:
>> In practice we're probably getting the "und" ICU locale whereas "fr"
>> would be appropriate.
> 
> This is a good point and illustrates that ICU is not a drop-in
> replacement for libc in all cases.
> 
> I don't see a solution here that doesn't involve some rough edges,
> though. "Locale" is a generic term, and if we continue to insist that
> it really means a libc locale, then ICU will never be on an equal
> footing with libc, let alone the preferred provider.

Huge +1

IMHO the experience should be unified to the degree possible.

-- 
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Order changes in PG16 since ICU introduction

От

Jeff Davis

Дата:

25 мая 2023 г., 16:24:27

On Mon, 2023-05-15 at 12:06 +0200, Peter Eisentraut wrote:
> > === 0002: handle some kinds of libc-stlye locale strings
> >
> > ICU used to handle libc locale strings like 'fr_FR@euro', but
> > doesn't
> > in later versions. Handle them in postgres for consistency.
>
> I tend to agree with ICU that these variants are obsolete, and we
> don't
> need to support them anymore.  If this were a tiny patch, then maybe
> ok,
> but the way it's presented here the whole code is duplicated between
> pg_locale.c and initdb.c, which is not great.

I dropped this patch from the series.

> > === 0003: reduce icu_validation_level to WARNING
> >
> > Given that we've seen some inconsistency in which locale names are
> > accepted in different ICU versions, it seems best not to be too
> > strict.
> > Peter Eisentraut suggested that it be set to ERROR originally, but
> > a
> > WARNING should be sufficient to see problems without introducing
> > risks
> > migrating to version 16.
>
> I'm not sure why this is the conclusion.  Presumably, the detection
> capabilities of ICU improve over time, so we want to take advantage
> of
> that?  What are some example scenarios where this change would help?

First of all, I missed this message earlier and I apologize for
proceeding with a commit that contradicted you -- that was not
intentional. The change is small and we can go back if needed.

To restate my reasoning: if we error by default, then changes in ICU
versions can result in errors, which seems too strong to me. I was
hoping to escalate the default for this setting to be "error" down the
road, but it feels like a risk to do so immmediately.

Another thing to consider is that initdb also does validation, and
that's not affected by this GUC. Right now, initdb errors if validation
fails.

> > We've already allowed users to create ICU collations with the C
> > locale
> > in the past, which uses the root collation (not memcmp()), and we
> > need
> > to keep supporting that for upgraded clusters.
>
> I'm not sure I agree that we need to keep supporting that.  The only
> way
> you could get that in past releases is if you specify explicitly,
> "give
> me provider ICU and locale C", and then it wouldn't actually even
> work
> correctly.  So nobody should be using that in practice, and nobody
> should have stumbled into that combination of settings by accident.

OK, then I'm inclined toward the approach to treat iculocale=C with the
memcmp() semantics. Patch added.

I also added a patch with a pg_upgrade check for previous versions with
iculocale=C, to make sure we don't corrupt indexes in case some user
did make that mistake.

> >    3. Introduce collation provider "none", which is always
>
> This seems most attractive, but I think it's quite invasive at this
> point, especially given the dubious premise (see above).

I removed this from the current patch series, and perhaps we should
reconsider it in v17.

> > === 0007: Add a GUC to control the default collation provider
> >
> > Having a GUC would make it easier to migrate to ICU without
> > surprises.
> > This only affects the default for CREATE COLLATION, not CREATE
> > DATABASE
> > (and obviously not initdb).
>
> It's not clear to me why we would want that.  Also not clear why it
> should only affect CREATE COLLATION.

Right now the default for CREATE COLLATION is always libc. For CREATE
DATABASE, it defaults to the template.

I included a patch with a different approach that uses the database
default collation's provider as the default for CREATE COLLATION,
unless LC_COLLATE or LC_CTYPE is specified.

Regards,
    Jeff Davis

Вложения

Re: Order changes in PG16 since ICU introduction

От

Jeff Davis

Дата:

26 мая 2023 г., 00:23:53

On Mon, 2023-05-22 at 14:34 +0200, Peter Eisentraut wrote:
> Please put blank lines between
>
> </sect3>
> <sect3 ...>
>
> etc., matching existing style.
>
> We usually don't capitalize the collation parameters like
>
> CREATE COLLATION mycollation1 (PROVIDER = icu, LOCALE = 'ja-JP);
>
> elsewhere in the documentation.
>
> Table 24.2. ICU Collation Settings should probably be sorted by key,
> or
> at least by something.
>
> All tables should referenced in the text, like "Table x.y shows this
> and
> that."  (Note that a table could float to a different page in some
> output formats, so just putting it into a section without some
> introductory text isn't sound.)

Thank you, done.

> Table 24.1. ICU Collation Levels shows punctuation as level 4, which
> is
> only true in shifted mode, which isn't the default.  The whole
> business
> of treating variable collation elements is getting a bit lost in this
> description.  The kv option is described as "Classes of characters
> ignored during comparison at level 3.", which is effectively true but
> not the whole picture.

I organized the documentation around practical examples and available
options, and less around the conceptual model. I think that's a good
start, but you're right that it over-simplifies in a few areas.

Discussing the model would work better along with an explanation of ICU
rules, where you can make better use of those concepts. I feel like
there are some interesting things that can be done with rules, but I
haven't had a chance to really dig in yet.

Regards,
    Jeff Davis

Re: Order changes in PG16 since ICU introduction

От

"Daniel Verite"

Дата:

26 мая 2023 г., 16:24:35

    Jeff Davis wrote:

> > #1
> >
> > postgres=# create database test1 locale='fr_FR.UTF-8';
> > NOTICE:  using standard form "fr-FR" for ICU locale "fr_FR.UTF-8"
> > ERROR:  new ICU locale (fr-FR) is incompatible with the ICU locale of
>
> I don't see a problem here. If you specify LOCALE to CREATE DATABASE,
> you should either be using "TEMPLATE template0", or you should be
> expecting an error if the LOCALE doesn't match exactly.
>
> What would you like to see happen here?

What's odd is that initdb starting in an fr_FR.UTF-8 environment
found that "fr" was the default ICU locale to use, whereas
"create database" reports that "fr" and "fr_FR.UTF-8" refer to
incompatible locales.

To me initdb is wrong when coming up with the less precise "fr"
instead of "fr-FR".

I suggest the attached patch to call uloc_getDefault() instead of the
current code that somehow leaves out the country/region component.

Best regards,
--
Daniel Vérité
https://postgresql.verite.pro/
Twitter: @DanielVerite

Вложения

initdb-uloc_getdefault.diff

Re: Order changes in PG16 since ICU introduction

От

Jeff Davis

Дата:

26 мая 2023 г., 18:47:02

On Fri, 2023-05-26 at 18:24 +0200, Daniel Verite wrote:
> To me initdb is wrong when coming up with the less precise "fr"
> instead of "fr-FR".
>
> I suggest the attached patch to call uloc_getDefault() instead of the
> current code that somehow leaves out the country/region component.

Thank you. I experimented with several ICU versions and different
environmental settings, and it does seem better at preserving the name
that comes from the environment.

There is a warning in the docs: "Do not use unless you know what you
are doing."[1] I don't see a reason for the warning or any major risk
from us using it. Perhaps it's because the result is affected by either
the environment or the last uloc_setDefault() call. We don't use
uloc_setDefault(), and we only call uloc_getDefault() once, so I don't
see a risk here.

The fix seems simple enough. Committed.

Regards,
    Jeff Davis

[1]
https://unicode-org.github.io/icu-docs/apidoc/dev/icu4c/uloc_8h.html#a4efa16db7351e62293f8ef0c37aac8d2

Re: Order changes in PG16 since ICU introduction

От

Jeff Davis

Дата:

05 июня 2023 г., 17:54:58

New patch series attached. I plan to commit 0001 and 0002 soon, unless
there are objections.

0001 causes the "C" and "POSIX" locales to be treated with
memcmp/pg_ascii semantics in ICU, just like in libc. We also considered
a new "none" provider, but it's more invasive, and we can always
reconsider that in the v17 cycle.

0002 introduces an upgrade check for users who have explicitly
requested provider=icu and iculocale=C on older versions, and rejects
upgrading from v15 in that case to avoid index corruption. Having such
a collation is almost certainly a mistake by the user, because the
collator would not give the expected memcmp semantics.


--
Jeff Davis
PostgreSQL Contributor Team - AWS

Вложения

Re: Order changes in PG16 since ICU introduction

От

"Daniel Verite"

Дата:

06 июня 2023 г., 13:09:59

    Jeff Davis wrote:

> New patch series attached. I plan to commit 0001 and 0002 soon, unless
> there are objections.
>
> 0001 causes the "C" and "POSIX" locales to be treated with
> memcmp/pg_ascii semantics in ICU, just like in libc. We also
> considered a new "none" provider, but it's more invasive, and we can
> always reconsider that in the v17 cycle.

FWIW I don't quite see how 0001 improve things or what problem it's
trying to solve.

0001 creates exceptions throughout the code so that when an ICU
collation has a locale name "C" or "POSIX" then it does not behave
like an ICU collation, even though pg_collation.collprovider='i'
To me it's neither desirable nor necessary that a collation that
has collprovider='i' is diverted to non-ICU semantics.

Also in the current state, this diversion does not apply to initdb.

"initdb --icu-locale=C" with 0001 applied reports this:

   Using language tag "en-US-u-va-posix" for ICU locale "C".
   The database cluster will be initialized with this locale configuration:
     provider:      icu
     ICU locale:  en-US-u-va-posix
     LC_COLLATE:  fr_FR.UTF-8
     [...]

and "initdb --locale=C" reports this:

   Using default ICU locale "fr_FR".
   Using language tag "fr-FR" for ICU locale "fr_FR".
   The database cluster will be initialized with this locale configuration:
     provider:      icu
     ICU locale:  fr-FR
     LC_COLLATE:  C
     [...]

Could you elaborate a bit more on what 0001 is meant to achieve, from
the point of view of the user?

Best regards,
--
Daniel Vérité
https://postgresql.verite.pro/
Twitter: @DanielVerite

Re: Order changes in PG16 since ICU introduction

От

Joe Conway

Дата:

06 июня 2023 г., 18:11:18

On 6/6/23 09:09, Daniel Verite wrote:
>     Jeff Davis wrote:
>> New patch series attached. I plan to commit 0001 and 0002 soon, unless
>> there are objections.
>> 
>> 0001 causes the "C" and "POSIX" locales to be treated with
>> memcmp/pg_ascii semantics in ICU, just like in libc. We also
>> considered a new "none" provider, but it's more invasive, and we can
>> always reconsider that in the v17 cycle.

> 0001 creates exceptions throughout the code so that when an ICU
> collation has a locale name "C" or "POSIX" then it does not behave
> like an ICU collation, even though pg_collation.collprovider='i'
> To me it's neither desirable nor necessary that a collation that
> has collprovider='i' is diverted to non-ICU semantics.

This discussion makes me wonder (though probably too late for the v16 
cycle) if we shouldn't treat "C" and "POSIX" locales to be a third 
provider, something like "internal".

-- 
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Order changes in PG16 since ICU introduction

От

Jeff Davis

Дата:

06 июня 2023 г., 19:15:39

On Tue, 2023-06-06 at 14:11 -0400, Joe Conway wrote:
> This discussion makes me wonder (though probably too late for the v16
> cycle) if we shouldn't treat "C" and "POSIX" locales to be a third
> provider, something like "internal".

That's exactly what I did in v6 of this series: I created a "none"
provider, and when someone specified provider=icu iculocale=C, it would
change the provider to "none":

https://www.postgresql.org/message-id/5f9bf4a0b040428c5db2dc1f23cc3ad96acb5672.camel%40j-davis.com

I'm fine with either approach.

Regards,
    Jeff Davis

Re: Order changes in PG16 since ICU introduction

От

Joe Conway

Дата:

06 июня 2023 г., 19:18:02

On 6/6/23 15:15, Jeff Davis wrote:
> On Tue, 2023-06-06 at 14:11 -0400, Joe Conway wrote:
>> This discussion makes me wonder (though probably too late for the v16
>> cycle) if we shouldn't treat "C" and "POSIX" locales to be a third 
>> provider, something like "internal".
> 
> That's exactly what I did in v6 of this series: I created a "none"
> provider, and when someone specified provider=icu iculocale=C, it would
> change the provider to "none":
> 
> https://www.postgresql.org/message-id/5f9bf4a0b040428c5db2dc1f23cc3ad96acb5672.camel%40j-davis.com
> 
> I'm fine with either approach.

Ha!

Well it seems like I am +1 on that then ;-)

-- 
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Order changes in PG16 since ICU introduction

От

Jeff Davis

Дата:

06 июня 2023 г., 19:18:20

On Tue, 2023-06-06 at 15:09 +0200, Daniel Verite wrote:
> FWIW I don't quite see how 0001 improve things or what problem it's
> trying to solve.

The word "locale" is generic, so we need to make LOCALE/--locale apply
to whatever provider is being used. If "locale" only applies to libc,
using ICU will always be confusing and never be on the same level as
libc, let alone the preferred provider.

The locale "C" is a special case, documented as a non-locale. So, if
LOCALE/--locale apply to ICU, then either ICU needs to handle locale
"C" in the expected way (v8 patch series); or when we see locale "C" we
need to somehow change the provider into something that can handle it
(v6 patch series changes it to the "none" provider).

Please let me know if you disagree with the goal or the reasoning here.
If so, please explain where you think we should end up, because the
status quo does not seem great to me.

> 0001 creates exceptions throughout the code so that when an ICU
> collation has a locale name "C" or "POSIX" then it does not behave
> like an ICU collation, even though pg_collation.collprovider='i'
> To me it's neither desirable nor necessary that a collation that
> has collprovider='i' is diverted to non-ICU semantics.

It's not very principled, but it matches what libc does.

> Also in the current state, this diversion does not apply to initdb.
>
> "initdb --icu-locale=C" with 0001 applied reports this:
>
>    Using language tag "en-US-u-va-posix" for ICU locale "C".

Thank you. I fixed it by skipping the canonicalization for C/POSIX
locales in initdb.

> Could you elaborate a bit more on what 0001 is meant to achieve, from
> the point of view of the user?

It makes it so the user consistently (regardless of the provider) gets
the "no locale" behavior (as documented and historically expected) when
they specify the C or POSIX locales.

Then that enables us to change LOCALE/--locale to apply to ICU, which
means that a simple command like "initdb --locale=en_US" does a
sensible thing regardless of the default provider.

I understand you are skeptical of trying to apply an arbitrary locale
name to ICU, but if they don't specify the provider, what do you expect
to happen?

--
Jeff Davis
PostgreSQL Contributor Team - AWS

Вложения

Re: Order changes in PG16 since ICU introduction

От

Joe Conway

Дата:

06 июня 2023 г., 19:21:25

On 6/6/23 15:18, Jeff Davis wrote:
> On Tue, 2023-06-06 at 15:09 +0200, Daniel Verite wrote:
>> FWIW I don't quite see how 0001 improve things or what problem it's
>> trying to solve.
> 
> The word "locale" is generic, so we need to make LOCALE/--locale apply
> to whatever provider is being used. If "locale" only applies to libc,
> using ICU will always be confusing and never be on the same level as
> libc, let alone the preferred provider.


Agree 100%

> The locale "C" is a special case, documented as a non-locale. So, if
> LOCALE/--locale apply to ICU, then either ICU needs to handle locale
> "C" in the expected way (v8 patch series); or when we see locale "C" we
> need to somehow change the provider into something that can handle it
> (v6 patch series changes it to the "none" provider).

+1 to the latter approach

> Please let me know if you disagree with the goal or the reasoning here.
> If so, please explain where you think we should end up, because the
> status quo does not seem great to me.

also +1

>> 0001 creates exceptions throughout the code so that when an ICU
>> collation has a locale name "C" or "POSIX" then it does not behave
>> like an ICU collation, even though pg_collation.collprovider='i'
>> To me it's neither desirable nor necessary that a collation that
>> has collprovider='i' is diverted to non-ICU semantics.
> 
> It's not very principled, but it matches what libc does.

Makes sense to me

-- 
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Order changes in PG16 since ICU introduction

От

Tom Lane

Дата:

06 июня 2023 г., 19:25:26

Joe Conway <mail@joeconway.com> writes:
> On 6/6/23 15:18, Jeff Davis wrote:
>> The locale "C" is a special case, documented as a non-locale. So, if
>> LOCALE/--locale apply to ICU, then either ICU needs to handle locale
>> "C" in the expected way (v8 patch series); or when we see locale "C" we
>> need to somehow change the provider into something that can handle it
>> (v6 patch series changes it to the "none" provider).

> +1 to the latter approach

Also +1, except that I find "none" a rather confusing choice of name.
There *is* a provider, it's just PG itself not either libc or ICU.
I thought Joe's suggestion of "internal" made more sense.

            regards, tom lane

Re: Order changes in PG16 since ICU introduction

От

Robert Haas

Дата:

06 июня 2023 г., 19:37:17

On Tue, Jun 6, 2023 at 3:25 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Joe Conway <mail@joeconway.com> writes:
> > On 6/6/23 15:18, Jeff Davis wrote:
> >> The locale "C" is a special case, documented as a non-locale. So, if
> >> LOCALE/--locale apply to ICU, then either ICU needs to handle locale
> >> "C" in the expected way (v8 patch series); or when we see locale "C" we
> >> need to somehow change the provider into something that can handle it
> >> (v6 patch series changes it to the "none" provider).
>
> > +1 to the latter approach
>
> Also +1, except that I find "none" a rather confusing choice of name.
> There *is* a provider, it's just PG itself not either libc or ICU.
> I thought Joe's suggestion of "internal" made more sense.

Or perhaps "builtin" or "postgresql".

I'm just thinking that "internal" as a type name kind of means "you
shouldn't be touching this from SQL" and we don't want to give people
the idea that the "C" locale isn't something you should use.

--
Robert Haas
EDB: http://www.enterprisedb.com

Re: Order changes in PG16 since ICU introduction

От

Tom Lane

Дата:

06 июня 2023 г., 19:55:45

Robert Haas <robertmhaas@gmail.com> writes:
> On Tue, Jun 6, 2023 at 3:25 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Also +1, except that I find "none" a rather confusing choice of name.
>> There *is* a provider, it's just PG itself not either libc or ICU.
>> I thought Joe's suggestion of "internal" made more sense.

> Or perhaps "builtin" or "postgresql".

Either OK by me

            regards, tom lane

Re: Order changes in PG16 since ICU introduction

От

Joe Conway

Дата:

06 июня 2023 г., 19:56:32

On 6/6/23 15:55, Tom Lane wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> On Tue, Jun 6, 2023 at 3:25 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>> Also +1, except that I find "none" a rather confusing choice of name.
>>> There *is* a provider, it's just PG itself not either libc or ICU.
>>> I thought Joe's suggestion of "internal" made more sense.
> 
>> Or perhaps "builtin" or "postgresql".
> 
> Either OK by me

Same here

-- 
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Order changes in PG16 since ICU introduction

От

"Jonathan S. Katz"

Дата:

06 июня 2023 г., 20:00:55

On 6/6/23 3:56 PM, Joe Conway wrote:
> On 6/6/23 15:55, Tom Lane wrote:
>> Robert Haas <robertmhaas@gmail.com> writes:
>>> On Tue, Jun 6, 2023 at 3:25 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>>> Also +1, except that I find "none" a rather confusing choice of name.
>>>> There *is* a provider, it's just PG itself not either libc or ICU.
>>>> I thought Joe's suggestion of "internal" made more sense.
>>
>>> Or perhaps "builtin" or "postgresql".
>>
>> Either OK by me
> 
> Same here

Since we're bikeshedding, "postgresql" or "builtin" could make it seem 
to a (app) developer that these may be recommended options, as we're 
trusting PostgreSQL to make the best choices for us. Granted, v16 is 
(theoretically) defaulting to ICU, so that choice is made, but the 
unsuspecting developer could make a switch based on that naming.

However, I don't have a strong alternative -- I understand the concern 
about "internal", so I'd be OK with "postgresql" unless a better name 
appears.

Jonathan

Вложения

OpenPGP_signature

Re: Order changes in PG16 since ICU introduction

От

Tom Lane

Дата:

06 июня 2023 г., 20:20:40

"Jonathan S. Katz" <jkatz@postgresql.org> writes:
> Since we're bikeshedding, "postgresql" or "builtin" could make it seem 
> to a (app) developer that these may be recommended options, as we're 
> trusting PostgreSQL to make the best choices for us. Granted, v16 is 
> (theoretically) defaulting to ICU, so that choice is made, but the 
> unsuspecting developer could make a switch based on that naming.

I don't think this is a problem, as long as any locale name other
than C/POSIX fails when combined with that provider name.

            regards, tom lane

Re: Order changes in PG16 since ICU introduction

От

Andrew Gierth

Дата:

06 июня 2023 г., 20:37:26

>>>>> "Joe" == Joe Conway <mail@joeconway.com> writes:

 > On 6/6/23 15:55, Tom Lane wrote:
 >> Robert Haas <robertmhaas@gmail.com> writes:
 >>> On Tue, Jun 6, 2023 at 3:25 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
 >>>> Also +1, except that I find "none" a rather confusing choice of name.
 >>>> There *is* a provider, it's just PG itself not either libc or ICU.
 >>>> I thought Joe's suggestion of "internal" made more sense.
 >>
 >>> Or perhaps "builtin" or "postgresql".
 >> Either OK by me

 Joe> Same here

I like either "internal" or "builtin" because they correctly identify
that no external resources are used. I'm not keen on "postgresql".

--
Andrew.

Re: Order changes in PG16 since ICU introduction

От

Peter Eisentraut

Дата:

07 июня 2023 г., 15:03:26

On 22.05.23 19:35, Jeff Davis wrote:
> On Thu, 2023-05-11 at 13:07 +0200, Peter Eisentraut wrote:
>> Here is my proposed patch for this.
> 
> The commit message makes it sound like lc_collate/ctype are completely
> obsolete, and I don't think that's quite right: they still represent
> the server environment, which does still matter in some cases.
> 
> I'd just say that they are too confusing (likely to be misused), and
> becoming obsolete (or less relevant), or something along those lines.
> 
> Otherwise, this is fine with me. I didn't do a detailed review because
> it's just mechanical.

I have committed this with some tuning of the commit message.

Re: Order changes in PG16 since ICU introduction

От

"Daniel Verite"

Дата:

07 июня 2023 г., 21:50:57

    Jeff Davis wrote:

> The locale "C" is a special case, documented as a non-locale. So, if
> LOCALE/--locale apply to ICU, then either ICU needs to handle locale
> "C" in the expected way (v8 patch series); or when we see locale "C" we
> need to somehow change the provider into something that can handle it
> (v6 patch series changes it to the "none" provider).

Yes it's a special case but when doing initdb --locale=C, a user does
not need or want an ICU locale. They want the same thing than what v15
does with the same arguments: a template0 database with
datlocprovider='c', datcollate='C', datctype='C', dateiculocale=NULL.

The simplest way to obtain that in v16 is to teach initdb that
--locale=C without the --locale-provider option implies that
--locale-provider=libc ([1])

OTOH what you're proposing with the 0001 patch is much more involved
in terms of tweaking the ICU code so that dateiculocale='C' and
datlocprovider='i' becomes a combination that provides the C semantics
that ICU doesn't have natively.

I don't agree with the reasoning that to make progress with
the other uses of --locale, we need to start by either making ICU
support C/POSIX (0001/0002), or add a new "none/builtin" provider
(previous set of patches).
v16 should not need it any more than v15 did, if v16 does the same as
v15 on locale='C', that is not involve ICU at all.

> Then that enables us to change LOCALE/--locale to apply to ICU, which
> means that a simple command like "initdb --locale=en_US" does a
> sensible thing regardless of the default provider.
>
> I understand you are skeptical of trying to apply an arbitrary locale
> name to ICU, but if they don't specify the provider, what do you expect
> to happen?

It's a hard question because it depends on what people have in their
locale environment combined with what they try to do.
I think that initdb without any locale option should work well in
the majority of environments, but specifying a locale alone will not work
well in a number of cases, so users might end up concluding that they
need to specify not only the provider but lc_collate/lc_ctype.

[1]
https://www.postgresql.org/message-id/360c90b9-7c20-4cec-aade-38e6e3351c05@manitou-mail.org

Best regards,
--
Daniel Vérité
https://postgresql.verite.pro/
Twitter: @DanielVerite

Re: Order changes in PG16 since ICU introduction

От

Peter Eisentraut

Дата:

07 июня 2023 г., 21:59:59

On 05.06.23 19:54, Jeff Davis wrote:
> 
> New patch series attached. I plan to commit 0001 and 0002 soon, unless
> there are objections.
> 
> 0001 causes the "C" and "POSIX" locales to be treated with
> memcmp/pg_ascii semantics in ICU, just like in libc. We also considered
> a new "none" provider, but it's more invasive, and we can always
> reconsider that in the v17 cycle.
> 
> 0002 introduces an upgrade check for users who have explicitly
> requested provider=icu and iculocale=C on older versions, and rejects
> upgrading from v15 in that case to avoid index corruption. Having such
> a collation is almost certainly a mistake by the user, because the
> collator would not give the expected memcmp semantics.

I'm dubious about these two.

0003 seems like the correct direction.  In createdb.c, the change you 
add makes sense, but you should also remove the existing use of the 
locale variable:

-   if (locale)
-   {
-       if (!lc_ctype)
-           lc_ctype = locale;
-       if (!lc_collate)
-           lc_collate = locale;
-   }
-

Re: Order changes in PG16 since ICU introduction

От

Peter Eisentraut

Дата:

07 июня 2023 г., 22:11:24

On 05.06.23 19:54, Jeff Davis wrote:
> New patch series attached.

Could you clarify what here is intended for 16 and what is for later? 
This patch set keeps expanding and changing in each iteration.

There is a PG16 open item linked to this thread

* The rules for choosing default ICU locale seem pretty unfriendly

which I think would be addressed by an appropriately fixed up variant of 
your patch 0003.

(Or if not, what is the actual issue?)

Everything else appears to be either new feature work or fixing 
pre-existing prehavior, so is not in scope for PG16 and should be dealt 
with elsewhere, so we can focus here on closing out this release.

Re: Order changes in PG16 since ICU introduction

От

Jeff Davis

Дата:

07 июня 2023 г., 23:11:44

On Wed, 2023-06-07 at 23:50 +0200, Daniel Verite wrote:
> The simplest way to obtain that in v16 is to teach initdb that
> --locale=C without the --locale-provider option implies that
> --locale-provider=libc ([1])

As I replied in that subthread, that creates a worse problem: if you
only change the provider when the locale is C, then what about when the
locale is *not* C?

  export LANG=en_US.UTF-8
  initdb -D data --locale=fr_FR.UTF-8
  ...
    provider:    icu
    ICU locale:  en-US

I believe that case is an order of magnitude worse than the other cases
you brought up in that subthread.

It also leaves the fundamental problem in place that LOCALE only
applies to the libc provider, which multiple people have agreed is not
acceptable.

Regards,
    Jeff Davis

Re: Order changes in PG16 since ICU introduction

От

Jeff Davis

Дата:

07 июня 2023 г., 23:26:12

On Thu, 2023-06-08 at 00:11 +0200, Peter Eisentraut wrote:
> On 05.06.23 19:54, Jeff Davis wrote:
> > New patch series attached.
>
> Could you clarify what here is intended for 16 and what is for later?

I apologize about the patch churn here. I implemented several
approaches to see what feedback I get, and now it looks like we're
returning to a previous idea (the "builtin" provider).

In v16:

1. We need LOCALE to apply to all providers.

2. We need LOCALE=C to give the memcmp/pg_ascii behavior in all cases
(unless overridden by a separate LC_COLLATE or LC_CTYPE parameter).

Those are the biggest problems raised in this thread, and the patches
to accomplish those things are in scope for v16.

After we sort those out, there are two loose ends:

* What do we do in the case where the environment has LANG=C.UTF-8 (as
some buildfarm members do)? Is an error acceptable in that case?

* Do we move icu_validation_level back to ERROR?

Regards,
    Jeff Davis

Re: Order changes in PG16 since ICU introduction

От

Tatsuo Ishii

Дата:

08 июня 2023 г., 00:30:33

Hi,

> On Wed, 2023-06-07 at 23:50 +0200, Daniel Verite wrote:
>> The simplest way to obtain that in v16 is to teach initdb that
>> --locale=C without the --locale-provider option implies that
>> --locale-provider=libc ([1])
> 
> As I replied in that subthread, that creates a worse problem: if you
> only change the provider when the locale is C, then what about when the
> locale is *not* C?
> 
>   export LANG=en_US.UTF-8
>   initdb -D data --locale=fr_FR.UTF-8
>   ...
>     provider:    icu
>     ICU locale:  en-US
> 
> I believe that case is an order of magnitude worse than the other cases
> you brought up in that subthread.
> 
> It also leaves the fundamental problem in place that LOCALE only
> applies to the libc provider, which multiple people have agreed is not
> acceptable.

Daniels comment:
>> Yes it's a special case but when doing initdb --locale=C, a user does
>> not need or want an ICU locale. They want the same thing than what v15
>> does with the same arguments: a template0 database with
>> datlocprovider='c', datcollate='C', datctype='C', dateiculocale=NULL.

So in this case the only way to keep the same behavior in v16 with "initdb
--locale=C" (--no-locale) in v15 is, bulding PostgreSQL --without-icu?

Best reagards,
--
Tatsuo Ishii
SRA OSS LLC
English: http://www.sraoss.co.jp/index_en/
Japanese:http://www.sraoss.co.jp

Re: Order changes in PG16 since ICU introduction

От

Joe Conway

Дата:

08 июня 2023 г., 00:52:46

On 6/7/23 19:26, Jeff Davis wrote:
> * What do we do in the case where the environment has LANG=C.UTF-8 (as
> some buildfarm members do)? Is an error acceptable in that case?

If I understand the discussion so far correctly, I think that case 
should fall to the provider.

If it supports "C.UTF-8" explicitly as some distributions do, then use it.

If the provider has no such thing, throw an error.

Somewhere we should document that "C.UTF-8" from the provider might not 
be as stable or working as they expect.

-- 
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Order changes in PG16 since ICU introduction

От

Tatsuo Ishii

Дата:

08 июня 2023 г., 01:45:35

>> As I replied in that subthread, that creates a worse problem: if you
>> only change the provider when the locale is C, then what about when the
>> locale is *not* C?
>> 
>>   export LANG=en_US.UTF-8
>>   initdb -D data --locale=fr_FR.UTF-8
>>   ...
>>     provider:    icu
>>     ICU locale:  en-US
>> 
>> I believe that case is an order of magnitude worse than the other cases
>> you brought up in that subthread.
>> 
>> It also leaves the fundamental problem in place that LOCALE only
>> applies to the libc provider, which multiple people have agreed is not
>> acceptable.

Note that most of PostgreSQL users in Japan do initdb
--no-locale. Almost never use other than C locale because the users do
not rely on system collation. Most database have an extra column which
represents the pronunciation in Hiragana or Katakana.

Best reagards,
--
Tatsuo Ishii
SRA OSS LLC
English: http://www.sraoss.co.jp/index_en/
Japanese:http://www.sraoss.co.jp

Re: Order changes in PG16 since ICU introduction

От

"Daniel Verite"

Дата:

08 июня 2023 г., 09:03:25

    Tatsuo Ishii wrote:

> >> Yes it's a special case but when doing initdb --locale=C, a user does
> >> not need or want an ICU locale. They want the same thing than what v15
> >> does with the same arguments: a template0 database with
> >> datlocprovider='c', datcollate='C', datctype='C', dateiculocale=NULL.
>
> So in this case the only way to keep the same behavior in v16 with "initdb
> --locale=C" (--no-locale) in v15 is, bulding PostgreSQL --without-icu?

AFAIK the --no-locale case in v16 is fixed since:

commit 5cd1a5af4d17496a58678c8eb7ab792119c2d723
Author: Jeff Davis <jdavis@postgresql.org>
Date:    Fri Apr 21 13:11:18 2023 -0700

    Fix initdb --no-locale.

    Discussion: https://postgr.es/m/878relf7cb.fsf@news-spur.riddles.org.uk
    Reported-by: Andrew Gierth


The --locale=C case is still being discussed. To me it should
produce the same result than --no-locale and --locale=C in v15, that is,
"ICU is the default" does not apply to that case, but currently
it initializes the cluster with an ICU locale.


Best regards,
--
Daniel Vérité
https://postgresql.verite.pro/
Twitter: @DanielVerite

Re: Order changes in PG16 since ICU introduction

От

"Daniel Verite"

Дата:

08 июня 2023 г., 10:08:12

    Jeff Davis wrote:

> As I replied in that subthread, that creates a worse problem: if you
> only change the provider when the locale is C, then what about when the
> locale is *not* C?
>
>  export LANG=en_US.UTF-8
>  initdb -D data --locale=fr_FR.UTF-8
>  ...
>    provider:    icu
>    ICU locale:  en-US

What you're proposing with the 0003 patch still applies.

In the above case I think we would end up with:

provider=icu
ICU locale=fr-FR
lc_collate=fr_FR.UTF-8
lc_lctype=fr_FR.UTF-8

which is reasonable.


In the following cases we would initialize a libc cluster instead of an
ICU cluster:

- initdb --locale=C
- initdb --locale=POSIX
- LANG=C initdb
- LANG=C.UTF-8 initdb
- LANG=POSIX initdb
- ... possibly other locales that we find are unsuitable for ICU

That is, the rule "ICU by default" really means "ICU unless the locale
that we're being passed or getting from the environment
has semantics that ICU does not provide but we know libc provides,
in which case we fall back to libc".

The user who wants ICU imperatively should invoke
--icu-locale=something or --locale=something --locale-provider=icu
in which case we should not fallback to libc.
We still have to determine lc_collate and lc_ctype either from the
environment or from the locale argument (I think we should
favor the environment), except if the user specifies
--lc-collate=... lc-ctype=...


Best regards,
--
Daniel Vérité
https://postgresql.verite.pro/
Twitter: @DanielVerite

Re: Order changes in PG16 since ICU introduction

От

Jeff Davis

Дата:

08 июня 2023 г., 21:15:39

On Wed, 2023-06-07 at 20:52 -0400, Joe Conway wrote:
> If the provider has no such thing, throw an error.

Just to be clear, that implies that users (and buildfarm members) with
LANG=C.UTF-8 in their environment would not be able to run a plain
"initdb -D data"; they'd get an error. It's hard for me to estimate how
many users might be inconvenienced by that, but it sounds like a risk.

Perhaps for this specific case, and only in initdb, we change
C.anything and POSIX.anything to the builtin provider? CREATE DATABASE
and CREATE COLLATION could still reject such locales.

Regards,
    Jeff Davis

Re: Order changes in PG16 since ICU introduction

От

Joe Conway

Дата:

08 июня 2023 г., 21:59:43

On 6/8/23 17:15, Jeff Davis wrote:
> On Wed, 2023-06-07 at 20:52 -0400, Joe Conway wrote:
>> If the provider has no such thing, throw an error.
> 
> Just to be clear, that implies that users (and buildfarm members) with
> LANG=C.UTF-8 in their environment would not be able to run a plain
> "initdb -D data"; they'd get an error. It's hard for me to estimate how
> many users might be inconvenienced by that, but it sounds like a risk.

Well, but only if their libc provider does not have C.UTF-8, correct?

I see
----------------
Linux Mint 21.1:    /usr/lib/locale/C.utf8
RHEL 8:            /usr/lib/locale/C.utf8
RHEL 9:            /usr/lib/locale/C.utf8
AL2:            /usr/lib/locale/C.utf8

However I do not see it on RHEL 7 :-(

> Perhaps for this specific case, and only in initdb, we change
> C.anything and POSIX.anything to the builtin provider?

Might be best, with some kind of warning perhaps?

> CREATE DATABASE and CREATE COLLATION could still reject such
> locales.

Seems to make sense.

-- 
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Order changes in PG16 since ICU introduction

От

Jeff Davis

Дата:

09 июня 2023 г., 00:36:50

On Tue, 2023-06-06 at 21:37 +0100, Andrew Gierth wrote:
> > > > >
> I like either "internal" or "builtin" because they correctly identify
> that no external resources are used. I'm not keen on "postgresql".

"builtin" seems to be the winner. New patch series attached with doc
and test updates.

This has been a long discussion (it's a messy problem), but I think
I've addressed the most important concerns raised. If you disagree with
something, please indicate whether it's an objection, or a more minor
difference of opinion that I can weigh against other opinions. Also
please indicate if you think something is out of scope for 16.

Patches 0001, 0002:

These patches implement the built-in provider and automatically change
provider=icu to provider=builtin when the locale is C. Other approaches
were considered:
 * Pretend that ICU can support the C locale, and use similar checks
throughout the code like the libc provider does: This was somewhat of a
hack, and had potential issues with upgraded clusters, and several
people seemed to reject it.
 * Switch to the libc provider for the C locale: would make the libc
provider even more complicated and had some potential for confusion,
and also has catalog representation problems when --locale is specified
along with --lc-ctype.

Ultimately we need to choose one approach, and the built-in provider
seems the nicest (though most invasive). It reflects the reality that
we don't actually use libc or icu for the C locale, and it's nicer to
document. The builtin provider seemed to get the most support.

Patch 0003:

Makes LOCALE apply to all providers. The overall feel after this patch
is that "locale" now means the collation locale, and
LC_COLLATE/LC_CTYPE are for the server environment. When using libc,
LC_COLLATE and LC_CTYPE still work as they did before, but their
relationship to database collation feels more like a special case of
the libc provider. I believe most people favor this patch and I haven't
seen recent objections.

I didn't find any surprising behaviors, but there are a few that I'd
like to draw attention to:

0. If you initdb with --locale-provider=libc, and don't specify ICU at
any later point, then none of these changes should affect you and
you'll remain on libc. If someone notices otherwise, please let me
know.

1. If you specify --locale-provider=builtin at initdb time, you *must*
specify --locale=C/POSIX, otherwise you get an error.

2. Patch 0004 is possibly out of scope for 16, but it felt consistent
with the other UI changes and low risk. Please try with/without before
objecting.

3. Daniel Verite felt that we should only change the provider from ICU
to "builtin" for the C locale if the provider is defaulting to ICU; not
if it's specified as ICU. I did not differentiate between specifying
ICU and defaulting to ICU because:
  a. "libc" unconditionally uses the built-in memcmp() logic for C, it
never actually uses libc
  b. If a user really wants the root locale or the en-US-u-va-posix
locale, they can specify those directly
  c. I don't see any plausible case where it helps a user to keep
provider=icu when locale=C.

4. Joe Conway and Peter Eisentraut both felt that C.UTF-8 with
provider=icu should not be changed to use the builtin provider, and
instead passed on to ICU. I implemented a compromise where initdb will
change C.UTF-8 to the built-in provider; but CREATE DATABASE/COLLATION
will pass it along to ICU (which may support it as en-US-u-va-posix in
some versions, or may throw an error in other versions). My reasoning
is that initdb is pulling from the environment, and we should try
harder to succeed on any reasonable environmental settings (otherwise
initdb with default settings could fail); whereas we can be more strict
with CREATE DATABASE/COLLATION.

5. For the built-in provider, initdb defaults to UTF-8 rather than
SQL_ASCII. Otherwise, you would be unable to use ICU at all later,
because ICU doesn't support SQL_ASCII.

--
Jeff Davis
PostgreSQL Contributor Team - AWS

Вложения

Re: Order changes in PG16 since ICU introduction

От

"Daniel Verite"

Дата:

09 июня 2023 г., 12:12:36

    Jeff Davis wrote:

>  I implemented a compromise where initdb will
>  change C.UTF-8 to the built-in provider

This handling of C.UTF-8 would be felt by users as simply broken.

With the v10 patches:

$ initdb --locale=C.UTF-8

initdb: using locale provider "builtin" for ICU locale "C.UTF-8"
The database cluster will be initialized with this locale configuration:
  default collation provider:  builtin
  default collation locale:    C
  LC_COLLATE:  C.UTF-8
  LC_CTYPE:    C.UTF-8

This setup is not what the user has asked for and leads to that kind of
wrong results:

$ psql -c "select upper('é')"
 ?column?
----------
 é

whereas in v15 we would get the correct result 'É'.

Then once inside that cluster, trying to create a database:

  postgres=# create database test locale='C.UTF-8';
  ERROR:  locale provider "builtin" does not support locale "C.UTF-8"
  HINT:  The built-in locale provider only supports the "C" and "POSIX"
locales.

That hardly makes sense considering that initdb stated the opposite,
that the "built-in provider" was adequate for C.UTF-8

In general about the evolution of the patchset, your interpretation
of "defaulting to ICU" seems to be "avoid libc at any cost", which IMV
is unreasonably user-hostile.

Best regards,
--
Daniel Vérité
https://postgresql.verite.pro/
Twitter: @DanielVerite

Re: Order changes in PG16 since ICU introduction

От

Jeff Davis

Дата:

09 июня 2023 г., 15:55:40

On Fri, 2023-06-09 at 14:12 +0200, Daniel Verite wrote:
> >  I implemented a compromise where initdb will
> >  change C.UTF-8 to the built-in provider
>
> $ initdb --locale=C.UTF-8

...

> This setup is not what the user has asked for and leads to that kind
> of
> wrong results:
>
> $ psql -c "select upper('é')"
>  ?column?
> ----------
>  é
>
> whereas in v15 we would get the correct result 'É'.

I guess where I'm confused is: why would a user actually want their
database collation to be C.UTF-8? It's slower than C, our
implementation doesn't properly version it (as you pointed out), and
the semantics don't seem great ('Z' < 'a').

If the user specifies provider=libc, then of course we should honor
that and C.UTF-8 is a valid locale for libc.

But if they don't specify the provider, isn't it much more likely they
just don't care much about the locale, and would be happier with C? 

Perhaps there's some better compromise here than the one I picked, but
I see this as a fairly small problem in comparison to the big problems
that we're solving.

> In general about the evolution of the patchset, your interpretation
> of "defaulting to ICU" seems to be "avoid libc at any cost", which
> IMV
> is unreasonably user-hostile.

The user can easily get libc behavior by specifying --locale-
provider=libc, so I don't see how you reached this conclusion.

Let me try to understand and address the points you raised here[1] in
more detail:

It looks like you are fine with 0003 applying LOCALE to whatever
provider is chosen, but you'd like to be smarter about choosing the
provider and to choose libc in at least some cases.

That is actually very much like option #2 in the list I presented
here[2], and has the same problems. How should the following behave?

  initdb --locale=C --lc-collate=fr_FR.utf8
  initdb --locale=en --lc-collate=fr_FR.utf8

If we switch to libc in the first case, then --locale will be ignored
and the collation will be fr_FR.utf8. But we will leave the second case
as ICU and the collation will be "en". I'm sure we can come up with
something there, but it feels like there's more room for confusion
along this path, and the builtin provider seems cleaner.

You also suggested that we consider switching the provider to libc any
time ICU doesn't support something. I'm not sure whether you meant a
static list (C, C.UTF-8, POSIX, ...?) or some kind of dynamic test. I'm
skeptical of being too smart here, but I'd like to hear what you mean.
I'm also not clear whether you think we should abandon the built-in
provider, or still select it for C/POSIX.

Regards,
    Jeff Davis

[1]
https://www.postgresql.org/message-id/7de2dc15-5211-45b3-afcb-71dcaf7a08bb@manitou-mail.org

[2]
https://www.postgresql.org/message-id/daa9f060aa2349ebc84444515efece49e7b32c5d.camel@j-davis.com

Re: Order changes in PG16 since ICU introduction

От

"Daniel Verite"

Дата:

12 июня 2023 г., 09:37:58

    Jeff Davis wrote:

> I guess where I'm confused is: why would a user actually want their
> database collation to be C.UTF-8? It's slower than C, our
> implementation doesn't properly version it (as you pointed out), and
> the semantics don't seem great ('Z' < 'a').

Because when LC_CTYPE=C, characters outside of US ASCII are not
categorized properly. upper/lower/regexp matching/... produce
incorrect results.

> But if they don't specify the provider, isn't it much more likely they
> just don't care much about the locale, and would be happier with C?

Consider a pre-existing script doing initdb --locale=C.UTF-8
Surely it does care about the locale, otherwise it would not specify
it.
Assuming that it would be better off with C is assuming that a
non-Unicode aware locale is better than the Unicode-aware locale
they're asking. I don't think it's reasonable.

> The user can easily get libc behavior by specifying --locale-
> provider=libc, so I don't see how you reached this conclusion.

What would be user hostile is forcing users that don't need an ICU
locale to change their invocations of initdb/createdb to avoid
regressions with v16. Most people would discover this after
it breaks their apps.

> It looks like you are fine with 0003 applying LOCALE to whatever
> provider is chosen, but you'd like to be smarter about choosing the
> provider and to choose libc in at least some cases.
>
> That is actually very much like option #2 in the list I presented
> here[2], and has the same problems. How should the following behave?
>
>   initdb --locale=C --lc-collate=fr_FR.utf8
>   initdb --locale=en --lc-collate=fr_FR.utf8

The same as v15.

> If we switch to libc in the first case, then --locale will be ignored
> and the collation will be fr_FR.utf8.

$ initdb --locale=C --lc-collate=fr_FR.utf8
v15 does that:

  The database cluster will be initialized with this locale configuration:
    provider:     libc
    LC_COLLATE:  fr_FR.utf8
    LC_CTYPE:     C
    LC_MESSAGES: C
    LC_MONETARY: C
    LC_NUMERIC:  C
    LC_TIME:     C
  The default database encoding has accordingly been set to "SQL_ASCII".

--locale is not ignored, it's overriden for LC_COLLATE only.

> But we will leave the second case as ICU and the collation will be
> "en".

Yes. To me the rule for "ICU is the default" in v16 should be: if the
--locale argument points to a locale that we know ICU does not provide,
we fall back to the v15 behavior down to every detail, otherwise we let
ICU be the provider.

> You also suggested that we consider switching the provider to libc any
> time ICU doesn't support something. I'm not sure whether you meant a
> static list (C, C.UTF-8, POSIX, ...?) or some kind of dynamic test.

C, C.*, POSIX. I'm not sure if there are other cases.

> I'm also not clear whether you think we should abandon the built-in
> provider, or still select it for C/POSIX.

I see it as going in v17, because it came after feature freeze and
is not strictly necessary in v16.



Best regards,
--
Daniel Vérité
https://postgresql.verite.pro/
Twitter: @DanielVerite

Re: Order changes in PG16 since ICU introduction

От

Peter Eisentraut

Дата:

12 июня 2023 г., 21:04:04

On 09.06.23 02:36, Jeff Davis wrote:
> Patches 0001, 0002:
> 
> These patches implement the built-in provider and automatically change
> provider=icu to provider=builtin when the locale is C.

I object to adding a new provider for PG16 (patch 0001).  This is 
clearly a new feature, which wasn't even contemplated a few weeks ago.

>   * Switch to the libc provider for the C locale: would make the libc
> provider even more complicated and had some potential for confusion,
> and also has catalog representation problems when --locale is specified
> along with --lc-ctype.

I don't follow this concern.  This could be done entirely within initdb. 
  I mean, just change the default for --locale-provider if --locale=C is 
given.  That's like 10 lines of code in initdb.c.

I don't think I want CREATE DATABASE or CREATE COLLATION to have that 
logic, nor do they really need it.

> Patch 0003:
> 
> Makes LOCALE apply to all providers. The overall feel after this patch
> is that "locale" now means the collation locale, and
> LC_COLLATE/LC_CTYPE are for the server environment. When using libc,
> LC_COLLATE and LC_CTYPE still work as they did before, but their
> relationship to database collation feels more like a special case of
> the libc provider. I believe most people favor this patch and I haven't
> seen recent objections.

This seems reasonable.

> 1. If you specify --locale-provider=builtin at initdb time, you *must*
> specify --locale=C/POSIX, otherwise you get an error.

Shouldn't the --locale option be ignored (or rejected) in that case. 
Why insist on it being specified?

> 2. Patch 0004 is possibly out of scope for 16, but it felt consistent
> with the other UI changes and low risk. Please try with/without before
> objecting.

Also clearly a new feature.  Also the implications of various upgrade, 
dump/restore scenarios are not fully explored.

I think it's an interesting idea, to make CREATE DATABASE and CREATE 
COLLATION also default to icu of the respective higher scope has been 
set to icu.  In fact, this makes me wonder now whether changing the 
default to icu in *only* initdb is sensible.  But again, we'd need to 
see the full matrix of upgrade scenarios laid out here.

> 3. Daniel Verite felt that we should only change the provider from ICU
> to "builtin" for the C locale if the provider is defaulting to ICU; not
> if it's specified as ICU.

Correct, we shouldn't override what was explicitly specified.

> I did not differentiate between specifying
> ICU and defaulting to ICU because:
>    a. "libc" unconditionally uses the built-in memcmp() logic for C, it
> never actually uses libc
>    b. If a user really wants the root locale or the en-US-u-va-posix
> locale, they can specify those directly
>    c. I don't see any plausible case where it helps a user to keep
> provider=icu when locale=C.

If the user specifies that, that's up to them to deal with the outcomes. 
  Just changing it to something different seems wrong.

> 4. Joe Conway and Peter Eisentraut both felt that C.UTF-8 with
> provider=icu should not be changed to use the builtin provider, and
> instead passed on to ICU. I implemented a compromise where initdb will
> change C.UTF-8 to the built-in provider; but CREATE DATABASE/COLLATION
> will pass it along to ICU (which may support it as en-US-u-va-posix in
> some versions, or may throw an error in other versions). My reasoning
> is that initdb is pulling from the environment, and we should try
> harder to succeed on any reasonable environmental settings (otherwise
> initdb with default settings could fail); whereas we can be more strict
> with CREATE DATABASE/COLLATION.

I'm not objecting to changing anything about C.UTF-8, but I am objecting 
to changing anything substantial in PG16.

> 5. For the built-in provider, initdb defaults to UTF-8 rather than
> SQL_ASCII. Otherwise, you would be unable to use ICU at all later,
> because ICU doesn't support SQL_ASCII.

Also a behavior change that is not appropriate for PG16 at this stage.

Re: Order changes in PG16 since ICU introduction

От

Jeff Davis

Дата:

14 июня 2023 г., 21:24:40

On Mon, 2023-06-12 at 23:04 +0200, Peter Eisentraut wrote:
> > Patch 0003:
> >
> > Makes LOCALE apply to all providers. The overall feel after this
> > patch
> > is that "locale" now means the collation locale, and
> > LC_COLLATE/LC_CTYPE are for the server environment. When using
> > libc,
> > LC_COLLATE and LC_CTYPE still work as they did before, but their
> > relationship to database collation feels more like a special case
> > of
> > the libc provider. I believe most people favor this patch and I
> > haven't
> > seen recent objections.
>
> This seems reasonable.

Attached a clean patch for this.

It seems to have widespread agreement so I plan to commit to v16 soon.

To clarify, this affects both initdb and CREATE DATABASE.

Regards,
    Jeff Davis

Вложения

v11-0001-CREATE-DATABASE-make-LOCALE-apply-to-all-collati.patch

Re: Order changes in PG16 since ICU introduction

От

Jeff Davis

Дата:

15 июня 2023 г., 05:07:13

On Mon, 2023-06-12 at 23:04 +0200, Peter Eisentraut wrote:
> I object to adding a new provider for PG16 (patch 0001).

Added to July CF for 17.

> > 2. Patch 0004 is possibly out of scope for 16

> Also clearly a new feature.

Added to July CF for 17.

Regards,
    Jeff Davis

Re: Order changes in PG16 since ICU introduction

От

Peter Eisentraut

Дата:

16 июня 2023 г., 14:50:34

On 14.06.23 23:24, Jeff Davis wrote:
> On Mon, 2023-06-12 at 23:04 +0200, Peter Eisentraut wrote:
>>> Patch 0003:
>>>
>>> Makes LOCALE apply to all providers. The overall feel after this
>>> patch
>>> is that "locale" now means the collation locale, and
>>> LC_COLLATE/LC_CTYPE are for the server environment. When using
>>> libc,
>>> LC_COLLATE and LC_CTYPE still work as they did before, but their
>>> relationship to database collation feels more like a special case
>>> of
>>> the libc provider. I believe most people favor this patch and I
>>> haven't
>>> seen recent objections.
>>
>> This seems reasonable.
> 
> Attached a clean patch for this.
> 
> It seems to have widespread agreement so I plan to commit to v16 soon.
> 
> To clarify, this affects both initdb and CREATE DATABASE.

This looks good to me.

Attached is small fixup patch with some documentation tweaks and 
simplifying some test code (also includes pgperltidy).

Вложения

0001-fixup-CREATE-DATABASE-make-LOCALE-apply-to-all-colla.patch

Re: Order changes in PG16 since ICU introduction

От

Jeff Davis

Дата:

16 июня 2023 г., 17:42:10

On Fri, 2023-06-16 at 16:50 +0200, Peter Eisentraut wrote:
> This looks good to me.
>
> Attached is small fixup patch with some documentation tweaks and
> simplifying some test code (also includes pgperltidy).

Thank you. Committed with your fixups.

Regards,
    Jeff Davis

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Обсуждение: Order changes in PG16 since ICU introduction

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения