Обсуждение: Windows default locale vs initdb

Поиск
Список
Период
Сортировка

Windows default locale vs initdb

От
Thomas Munro
Дата:
Hi,

Moving this topic into its own thread from the one about collation
versions, because it concerns pre-existing problems, and that thread
is long.

Currently initdb sets up template databases with old-style Windows
locale names reported by the OS, and they seem to have caused us quite
a few problems over the years:

db29620d "Work around Windows locale name with non-ASCII character."
aa1d2fc5 "Another attempt at fixing Windows Norwegian locale."
db477b69 "Deal with yet another issue related to "Norwegian (Bokmål)"..."
9f12a3b9 "Tolerate version lookup failure for old style Windows locale..."

... and probably more, and also various threads about , for example,
"German_German.1252" vs "German_Switzerland.1252" which seem to get
confused or badly canonicalised or rejected somewhere in the mix.

I hadn't focused on any of that before, being a non-Windows-user, but
the entire contents of win32setlocale.c supports the theory that
Windows' manual meant what it said when it said[1]:

"We do not recommend this form for locale strings embedded in
code or serialized to storage, because these strings are more likely
to be changed by an operating system update than the locale name
form."

I suppose that was the only form available at the time the code was
written, so there was no choice.  The question we asked ourselves
multiple times in the other thread was how we're supposed to get to
the modern BCP 47 form when creating the template databases.  It looks
like one possibility, since Vista, is to call
GetUserDefaultLocaleName()[2], which doesn't appear to have been
discussed before on this list.  That doesn't allow you to ask for the
default for each individual category, but I don't know if that is even
a concept for Windows user settings.  It may be that some of the other
nearby functions give a better answer for some reason.  But one thing
is clear from a test that someone kindly ran for me: it reports
standardised strings like "en-NZ", not strings like "English_New
Zealand.1252".

No patch, but I wondered if any Windows hackers have any feedback on
relative sanity of trying to fix all these problems this way.

[1]
https://docs.microsoft.com/en-us/cpp/c-runtime-library/locale-names-languages-and-country-region-strings?view=msvc-160
[2] https://docs.microsoft.com/en-us/windows/win32/api/winnls/nf-winnls-getuserdefaultlocalename



Re: Windows default locale vs initdb

От
Pavel Stehule
Дата:


po 19. 4. 2021 v 7:43 odesílatel Thomas Munro <thomas.munro@gmail.com> napsal:
Hi,

Moving this topic into its own thread from the one about collation
versions, because it concerns pre-existing problems, and that thread
is long.

Currently initdb sets up template databases with old-style Windows
locale names reported by the OS, and they seem to have caused us quite
a few problems over the years:

db29620d "Work around Windows locale name with non-ASCII character."
aa1d2fc5 "Another attempt at fixing Windows Norwegian locale."
db477b69 "Deal with yet another issue related to "Norwegian (Bokmål)"..."
9f12a3b9 "Tolerate version lookup failure for old style Windows locale..."

... and probably more, and also various threads about , for example,
"German_German.1252" vs "German_Switzerland.1252" which seem to get
confused or badly canonicalised or rejected somewhere in the mix.

I hadn't focused on any of that before, being a non-Windows-user, but
the entire contents of win32setlocale.c supports the theory that
Windows' manual meant what it said when it said[1]:

"We do not recommend this form for locale strings embedded in
code or serialized to storage, because these strings are more likely
to be changed by an operating system update than the locale name
form."

I suppose that was the only form available at the time the code was
written, so there was no choice.  The question we asked ourselves
multiple times in the other thread was how we're supposed to get to
the modern BCP 47 form when creating the template databases.  It looks
like one possibility, since Vista, is to call
GetUserDefaultLocaleName()[2], which doesn't appear to have been
discussed before on this list.  That doesn't allow you to ask for the
default for each individual category, but I don't know if that is even
a concept for Windows user settings.  It may be that some of the other
nearby functions give a better answer for some reason.  But one thing
is clear from a test that someone kindly ran for me: it reports
standardised strings like "en-NZ", not strings like "English_New
Zealand.1252".

No patch, but I wondered if any Windows hackers have any feedback on
relative sanity of trying to fix all these problems this way.

Last weekend I talked with one user about one interesting (and messing) issue. They needed to create a new database with Czech collation on Azure SAS. There was not any entry in pg_collation for Czech language. The reply from Microsoft support was to use CREATE DATABASE xxx TEMPLATE 'template0' ENCODING 'utf8' LOCALE 'cs_CZ.UTF8' and it was working.

Regards

Pavel


[1] https://docs.microsoft.com/en-us/cpp/c-runtime-library/locale-names-languages-and-country-region-strings?view=msvc-160
[2] https://docs.microsoft.com/en-us/windows/win32/api/winnls/nf-winnls-getuserdefaultlocalename


Re: Windows default locale vs initdb

От
Andrew Dunstan
Дата:


On Mon, Apr 19, 2021 at 4:53 AM Pavel Stehule <pavel.stehule@gmail.com> wrote:


po 19. 4. 2021 v 7:43 odesílatel Thomas Munro <thomas.munro@gmail.com> napsal:
Hi,

Moving this topic into its own thread from the one about collation
versions, because it concerns pre-existing problems, and that thread
is long.

Currently initdb sets up template databases with old-style Windows
locale names reported by the OS, and they seem to have caused us quite
a few problems over the years:

db29620d "Work around Windows locale name with non-ASCII character."
aa1d2fc5 "Another attempt at fixing Windows Norwegian locale."
db477b69 "Deal with yet another issue related to "Norwegian (Bokmål)"..."
9f12a3b9 "Tolerate version lookup failure for old style Windows locale..."

... and probably more, and also various threads about , for example,
"German_German.1252" vs "German_Switzerland.1252" which seem to get
confused or badly canonicalised or rejected somewhere in the mix.

I hadn't focused on any of that before, being a non-Windows-user, but
the entire contents of win32setlocale.c supports the theory that
Windows' manual meant what it said when it said[1]:

"We do not recommend this form for locale strings embedded in
code or serialized to storage, because these strings are more likely
to be changed by an operating system update than the locale name
form."

I suppose that was the only form available at the time the code was
written, so there was no choice.  The question we asked ourselves
multiple times in the other thread was how we're supposed to get to
the modern BCP 47 form when creating the template databases.  It looks
like one possibility, since Vista, is to call
GetUserDefaultLocaleName()[2], which doesn't appear to have been
discussed before on this list.  That doesn't allow you to ask for the
default for each individual category, but I don't know if that is even
a concept for Windows user settings.  It may be that some of the other
nearby functions give a better answer for some reason.  But one thing
is clear from a test that someone kindly ran for me: it reports
standardised strings like "en-NZ", not strings like "English_New
Zealand.1252".

No patch, but I wondered if any Windows hackers have any feedback on
relative sanity of trying to fix all these problems this way.

Last weekend I talked with one user about one interesting (and messing) issue. They needed to create a new database with Czech collation on Azure SAS. There was not any entry in pg_collation for Czech language. The reply from Microsoft support was to use CREATE DATABASE xxx TEMPLATE 'template0' ENCODING 'utf8' LOCALE 'cs_CZ.UTF8' and it was working.



My understanding from Microsoft staff at conferences is that Azure's PostgreSQL SAS runs on  linux, not WIndows.

cheers

andrew

Re: Windows default locale vs initdb

От
Pavel Stehule
Дата:


po 19. 4. 2021 v 12:52 odesílatel Andrew Dunstan <andrew@dunslane.net> napsal:


On Mon, Apr 19, 2021 at 4:53 AM Pavel Stehule <pavel.stehule@gmail.com> wrote:


po 19. 4. 2021 v 7:43 odesílatel Thomas Munro <thomas.munro@gmail.com> napsal:
Hi,

Moving this topic into its own thread from the one about collation
versions, because it concerns pre-existing problems, and that thread
is long.

Currently initdb sets up template databases with old-style Windows
locale names reported by the OS, and they seem to have caused us quite
a few problems over the years:

db29620d "Work around Windows locale name with non-ASCII character."
aa1d2fc5 "Another attempt at fixing Windows Norwegian locale."
db477b69 "Deal with yet another issue related to "Norwegian (Bokmål)"..."
9f12a3b9 "Tolerate version lookup failure for old style Windows locale..."

... and probably more, and also various threads about , for example,
"German_German.1252" vs "German_Switzerland.1252" which seem to get
confused or badly canonicalised or rejected somewhere in the mix.

I hadn't focused on any of that before, being a non-Windows-user, but
the entire contents of win32setlocale.c supports the theory that
Windows' manual meant what it said when it said[1]:

"We do not recommend this form for locale strings embedded in
code or serialized to storage, because these strings are more likely
to be changed by an operating system update than the locale name
form."

I suppose that was the only form available at the time the code was
written, so there was no choice.  The question we asked ourselves
multiple times in the other thread was how we're supposed to get to
the modern BCP 47 form when creating the template databases.  It looks
like one possibility, since Vista, is to call
GetUserDefaultLocaleName()[2], which doesn't appear to have been
discussed before on this list.  That doesn't allow you to ask for the
default for each individual category, but I don't know if that is even
a concept for Windows user settings.  It may be that some of the other
nearby functions give a better answer for some reason.  But one thing
is clear from a test that someone kindly ran for me: it reports
standardised strings like "en-NZ", not strings like "English_New
Zealand.1252".

No patch, but I wondered if any Windows hackers have any feedback on
relative sanity of trying to fix all these problems this way.

Last weekend I talked with one user about one interesting (and messing) issue. They needed to create a new database with Czech collation on Azure SAS. There was not any entry in pg_collation for Czech language. The reply from Microsoft support was to use CREATE DATABASE xxx TEMPLATE 'template0' ENCODING 'utf8' LOCALE 'cs_CZ.UTF8' and it was working.



My understanding from Microsoft staff at conferences is that Azure's PostgreSQL SAS runs on  linux, not WIndows.

I had different informations, but still there was something wrong because no czech locales was in pg_collation

 

cheers

andrew

Re: Windows default locale vs initdb

От
Dave Page
Дата:


On Mon, Apr 19, 2021 at 11:52 AM Andrew Dunstan <andrew@dunslane.net> wrote:

My understanding from Microsoft staff at conferences is that Azure's PostgreSQL SAS runs on  linux, not WIndows.

This is from a regular Azure Database for PostgreSQL single server:

postgres=> select version();
                          version                           
------------------------------------------------------------
 PostgreSQL 11.6, compiled by Visual C++ build 1800, 64-bit
(1 row) 

And this is from the new Flexible Server preview:

postgres=> select version();
                                                     version                                                     
-----------------------------------------------------------------------------------------------------------------
 PostgreSQL 12.6 on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu 5.4.0-6ubuntu1~16.04.12) 5.4.0 20160609, 64-bit
(1 row)

So I guess it's a case of "it depends".

--

Re: Windows default locale vs initdb

От
Andrew Dunstan
Дата:
On 4/19/21 10:26 AM, Dave Page wrote:
>
>
> On Mon, Apr 19, 2021 at 11:52 AM Andrew Dunstan <andrew@dunslane.net
> <mailto:andrew@dunslane.net>> wrote:
>
>
>     My understanding from Microsoft staff at conferences is that
>     Azure's PostgreSQL SAS runs on  linux, not WIndows.
>
>
> This is from a regular Azure Database for PostgreSQL single server:
>
> postgres=> select version();
>                           version                           
> ------------------------------------------------------------
>  PostgreSQL 11.6, compiled by Visual C++ build 1800, 64-bit
> (1 row) 
>
> And this is from the new Flexible Server preview:
>
> postgres=> select version();
>                                                      version          
>                                           
> -----------------------------------------------------------------------------------------------------------------
>  PostgreSQL 12.6 on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu
> 5.4.0-6ubuntu1~16.04.12) 5.4.0 20160609, 64-bit
> (1 row)
>
> So I guess it's a case of "it depends".
>

Good to know. A year or two back at more than one conference I tried to enlist some of these folks in helping us with
WindowsPostgreSQL and their reply was that they knew nothing about it because they were on Linux :-) I guess things
changeover time.
 


cheers


andrew


--
Andrew Dunstan
EDB: https://www.enterprisedb.com




Re: Windows default locale vs initdb

От
Peter Eisentraut
Дата:
On 19.04.21 07:42, Thomas Munro wrote:
> It looks
> like one possibility, since Vista, is to call
> GetUserDefaultLocaleName()[2], which doesn't appear to have been
> discussed before on this list.  That doesn't allow you to ask for the
> default for each individual category, but I don't know if that is even
> a concept for Windows user settings.

pg_newlocale_from_collation() doesn't support collcollate != collctype 
on Windows anyway, so that wouldn't be an issue.



Re: Windows default locale vs initdb

От
Noah Misch
Дата:
On Mon, Apr 19, 2021 at 05:42:51PM +1200, Thomas Munro wrote:
> Currently initdb sets up template databases with old-style Windows
> locale names reported by the OS, and they seem to have caused us quite
> a few problems over the years:
> 
> db29620d "Work around Windows locale name with non-ASCII character."
> aa1d2fc5 "Another attempt at fixing Windows Norwegian locale."
> db477b69 "Deal with yet another issue related to "Norwegian (Bokmål)"..."
> 9f12a3b9 "Tolerate version lookup failure for old style Windows locale..."

> I suppose that was the only form available at the time the code was
> written, so there was no choice.

Right.

> The question we asked ourselves
> multiple times in the other thread was how we're supposed to get to
> the modern BCP 47 form when creating the template databases.  It looks
> like one possibility, since Vista, is to call
> GetUserDefaultLocaleName()[2]

> No patch, but I wondered if any Windows hackers have any feedback on
> relative sanity of trying to fix all these problems this way.

Sounds reasonable.  If PostgreSQL v15 would otherwise run on Windows Server
2003 R2, this is a good time to let that support end.



Re: Windows default locale vs initdb

От
Juan José Santamaría Flecha
Дата:

On Sun, May 16, 2021 at 6:29 AM Noah Misch <noah@leadboat.com> wrote:
On Mon, Apr 19, 2021 at 05:42:51PM +1200, Thomas Munro wrote:

> The question we asked ourselves
> multiple times in the other thread was how we're supposed to get to
> the modern BCP 47 form when creating the template databases.  It looks
> like one possibility, since Vista, is to call
> GetUserDefaultLocaleName()[2]

> No patch, but I wondered if any Windows hackers have any feedback on
> relative sanity of trying to fix all these problems this way.

Sounds reasonable.  If PostgreSQL v15 would otherwise run on Windows Server
2003 R2, this is a good time to let that support end.

The value returned by GetUserDefaultLocaleName() is a system configured parameter, independent of what you set with setlocale(). It might be reasonable for initdb but not for a backend in most cases.

You can get the locale POSIX-ish name using GetLocaleInfoEx(), but this is no longer recommended, because using LCIDs is no longer recommended [1]. Although, this would work for legacy locales. Please find attached a POC patch showing this approach.


Regards,

Juan José Santamaría Flecha
Вложения

Re: Windows default locale vs initdb

От
Thomas Munro
Дата:
On Wed, Dec 15, 2021 at 11:32 PM Juan José Santamaría Flecha
<juanjo.santamaria@gmail.com> wrote:
> On Sun, May 16, 2021 at 6:29 AM Noah Misch <noah@leadboat.com> wrote:
>> On Mon, Apr 19, 2021 at 05:42:51PM +1200, Thomas Munro wrote:
>> > The question we asked ourselves
>> > multiple times in the other thread was how we're supposed to get to
>> > the modern BCP 47 form when creating the template databases.  It looks
>> > like one possibility, since Vista, is to call
>> > GetUserDefaultLocaleName()[2]
>>
>> > No patch, but I wondered if any Windows hackers have any feedback on
>> > relative sanity of trying to fix all these problems this way.
>>
>> Sounds reasonable.  If PostgreSQL v15 would otherwise run on Windows Server
>> 2003 R2, this is a good time to let that support end.
>>
> The value returned by GetUserDefaultLocaleName() is a system configured parameter, independent of what you set with
setlocale().It might be reasonable for initdb but not for a backend in most cases. 

Agreed.  Only for initdb, and only if you didn't specify a locale name
on the command line.

> You can get the locale POSIX-ish name using GetLocaleInfoEx(), but this is no longer recommended, because using LCIDs
isno longer recommended [1]. Although, this would work for legacy locales. Please find attached a POC patch showing
thisapproach. 

Now that museum-grade Windows has been defenestrated, we are free to
call GetUserDefaultLocaleName().  Here's a patch.

One thing you did in your patch that I disagree with, I think, was to
convert a BCP 47 name to a POSIX name early, that is, s/-/_/.  I think
we should use the locale name exactly as Windows (really, under the
covers, ICU) spells it.  There is only one place in the tree today
that really wants a POSIX locale name, and that's LC_MESSAGES,
accessed by GNU gettext, not Windows.  We already had code to cope
with that.

I think we should also convert to POSIX format when making the
collname in your pg_import_system_collations() proposal, so that
COLLATE "en_US" works (= a SQL identifier), but that's another
thread[1].  I don't think we should do it in collcollate or
datcollate, which is a string for the OS to interpret.

With my garbage collector hat on, I would like to rip out all of the
support for traditional locale names, eventually.  Deleting kludgy
code is easy and fun -- 0002 is a first swing at that -- but there
remains an important unanswered question.  How should someone
pg_upgrade a "English_Canada.1521" cluster if we now reject that name?
 We'd need to do a conversion to "en-CA", or somehow tell the user to.
Hmmmm.

[1] https://www.postgresql.org/message-id/flat/CAC%2BAXB0WFjJGL1n33bRv8wsnV-3PZD0A7kkjJ2KjPH0dOWqQdg%40mail.gmail.com

Вложения

Re: Windows default locale vs initdb

От
Thomas Munro
Дата:
On Tue, Jul 19, 2022 at 10:58 AM Thomas Munro <thomas.munro@gmail.com> wrote:
> Here's a patch.

I added this to the next commitfest, and cfbot promptly told me about
some warnings I needed to fix.  That'll teach me to post a patch
tested with "ci-os-only: windows".  Looking more closely at some error
messages that report GetLastError() where I'd mixed up %d and %lu, I
see also that I didn't quite follow existing conventions for wording
when reporting Windows error numbers, so I fixed that too.

In the "startcreate" step on CI you can see that it says:

The database cluster will be initialized with locale "en-US".
The default database encoding has accordingly been set to "WIN1252".
The default text search configuration will be set to "english".

As for whether "accordingly" still applies, by the logic of of
win32_langinfo()...  Windows still considers WIN1252 to be the default
ANSI code page for "en-US", though it'd work with UTF-8 too.  I'm not
sure what to make of that.  The goal here was to give Windows users
good defaults, but WIN1252 is probably not what most people actually
want.  Hmph.

Вложения

Re: Windows default locale vs initdb

От
Juan José Santamaría Flecha
Дата:

On Tue, Jul 19, 2022 at 12:59 AM Thomas Munro <thomas.munro@gmail.com> wrote:
Now that museum-grade Windows has been defenestrated, we are free to
call GetUserDefaultLocaleName().  Here's a patch.

This LGTM. 

I think we should also convert to POSIX format when making the
collname in your pg_import_system_collations() proposal, so that
COLLATE "en_US" works (= a SQL identifier), but that's another
thread[1].  I don't think we should do it in collcollate or
datcollate, which is a string for the OS to interpret.

That thread has been split [1], but that is how the current version behaves.

With my garbage collector hat on, I would like to rip out all of the
support for traditional locale names, eventually.  Deleting kludgy
code is easy and fun -- 0002 is a first swing at that -- but there
remains an important unanswered question.  How should someone
pg_upgrade a "English_Canada.1521" cluster if we now reject that name?
 We'd need to do a conversion to "en-CA", or somehow tell the user to.
Hmmmm.
 
Is there a safe way to do that in pg_upgrade or would we be forcing users to pg_dump into the new cluster?
 

Regards,

Juan José Santamaría Flecha

Re: Windows default locale vs initdb

От
Juan José Santamaría Flecha
Дата:

On Tue, Jul 19, 2022 at 4:47 AM Thomas Munro <thomas.munro@gmail.com> wrote:
As for whether "accordingly" still applies, by the logic of of
win32_langinfo()...  Windows still considers WIN1252 to be the default
ANSI code page for "en-US", though it'd work with UTF-8 too.  I'm not
sure what to make of that.  The goal here was to give Windows users
good defaults, but WIN1252 is probably not what most people actually
want.  Hmph.

Still, WIN1252 is not the wrong answer for what we are asking. Even if you enable UTF-8 support [1], the system will use the current default Windows ANSI code page (ACP) for the locale and UTF-8 for the code page.


Regards,

Juan José Santamaría Flecha

Re: Windows default locale vs initdb

От
Thomas Munro
Дата:
On Wed, Jul 20, 2022 at 10:27 PM Juan José Santamaría Flecha
<juanjo.santamaria@gmail.com> wrote:
> On Tue, Jul 19, 2022 at 4:47 AM Thomas Munro <thomas.munro@gmail.com> wrote:
>> As for whether "accordingly" still applies, by the logic of of
>> win32_langinfo()...  Windows still considers WIN1252 to be the default
>> ANSI code page for "en-US", though it'd work with UTF-8 too.  I'm not
>> sure what to make of that.  The goal here was to give Windows users
>> good defaults, but WIN1252 is probably not what most people actually
>> want.  Hmph.
>
>
> Still, WIN1252 is not the wrong answer for what we are asking. Even if you enable UTF-8 support [1], the system will
usethe current default Windows ANSI code page (ACP) for the locale and UTF-8 for the code page. 

I'm still confused about what that means.  Suppose we decided to
insist by adding a ".UTF-8" suffix to the name, as that page says we
can now that we're on Windows 10+, when building the default locale
name (see experimental 0002 patch, attached).  It initially seemed to
have the right effect:

The database cluster will be initialized with locale "en-US.UTF-8".
The default database encoding has accordingly been set to "UTF8".
The default text search configuration will be set to "english".

But then the Turkish i test in contrib/citext/sql/citext_utf8.sql failed[1]:

SELECT 'i'::citext = 'İ'::citext AS t;
 t
 ---
- t
+ f
 (1 row)

About the pg_upgrade problem, maybe it's OK ... existing old format
names should continue to work, but we can still remove the weird code
that does locale name tweaking, right?  pg_upgraded databases should
contain fixed names (ie that were fixed by old initdb so should
continue to work), and new clusters will get BCP 47 names.

I don't really know, I was just playing with rough ideas by sending
patches to CI here...

[1] https://cirrus-ci.com/task/6423238052937728

Вложения

Re: Windows default locale vs initdb

От
Juan José Santamaría Flecha
Дата:

On Wed, Jul 20, 2022 at 1:44 PM Thomas Munro <thomas.munro@gmail.com> wrote:
On Wed, Jul 20, 2022 at 10:27 PM Juan José Santamaría Flecha
<juanjo.santamaria@gmail.com> wrote:
> Still, WIN1252 is not the wrong answer for what we are asking. Even if you enable UTF-8 support [1], the system will use the current default Windows ANSI code page (ACP) for the locale and UTF-8 for the code page.

I'm still confused about what that means.  Suppose we decided to
insist by adding a ".UTF-8" suffix to the name, as that page says we
can now that we're on Windows 10+, when building the default locale
name (see experimental 0002 patch, attached).  It initially seemed to
have the right effect:

The database cluster will be initialized with locale "en-US.UTF-8".
The default database encoding has accordingly been set to "UTF8".
The default text search configuration will be set to "english".

Let me try to explain this using the "Beta: Use Unicode UTF-8 for worldwide language support" option [1]. 

- Currently in a system with the language settings of "English_United States" and that option disabled, when executing initdb you get:

The database cluster will be initialized with locale "English_United States.1252".
The default database encoding has accordingly been set to "WIN1252".
The default text search configuration will be set to "english".

And as a test for psql:

SET lc_time='tr_tr.utf8';
SET
SELECT to_char('2000-2-01'::date, 'tmmonth');
ERROR:  character with byte sequence 0xc5 0x9f in encoding "UTF8" has no equivalent in encoding "WIN1252"

We get this error even if the database encoding is UTF8, and is caused by the tr_tr locales being encoded in WIN1254. We can discuss this in another thread, and I can propose a patch.

- If we enable the UTF-8 support option, then the same test goes as:

The database cluster will be initialized with locale "English_United States.utf8".
The default database encoding has accordingly been set to "UTF8".
The default text search configuration will be set to "english".

And for psql:

SET lc_time='tr_tr.utf8';
SET
SELECT to_char('2000-2-01'::date, 'tmmonth');
 to_char
---------
 şubat
(1 row)

In this case the Windows locales are actually UTF8 encoded.

TL;DR; What I want to show through this example is that Windows ACP is not modified by setlocale(), it can only be done through the Windows registry and only in recent releases.
 
But then the Turkish i test in contrib/citext/sql/citext_utf8.sql failed[1]:

SELECT 'i'::citext = 'İ'::citext AS t;
 t
 ---
- t
+ f
 (1 row)

This is current state of affairs:

- Windows:

SELECT U&'\0131' latin_small_dotless,U&'\0069' latin_small
,U&'\0049' latin_capital, lower(U&'\0049')
,U&'\0130' latin_capital_dotted, lower(U&'\0130');
 latin_small_dotless | latin_small | latin_capital | lower | latin_capital_dotted | lower
---------------------+-------------+---------------+-------+----------------------+-------
 ı                   | i           | I             | i     | İ                    | İ

- Linux:

SELECT U&'\0131' latin_small_dotless,U&'\0069' latin_small
,U&'\0049' latin_capital, lower(U&'\0049')
,U&'\0130' latin_capital_dotted, lower(U&'\0130');
 latin_small_dotless | latin_small | latin_capital | lower | latin_capital_dotted | lower
---------------------+-------------+---------------+-------+----------------------+-------
 ı                   | i           | I             | i     | İ                    | i

Latin_capital_dotted doesn't have the same lower value.
 

Regards,

Juan José Santamaría Flecha

Re: Windows default locale vs initdb

От
Thomas Munro
Дата:
On Fri, Jul 22, 2022 at 11:59 PM Juan José Santamaría Flecha
<juanjo.santamaria@gmail.com> wrote:
> TL;DR; What I want to show through this example is that Windows ACP is not modified by setlocale(), it can only be
donethrough the Windows registry and only in recent releases. 

Thanks, that was helpful, and so was that SO link.

So it sounds like I should forget about the v3-0002 patch, but the
v3-0001 and v3-0003 patches might have a future.  And it sounds like
we might need to investigate maybe defending ourselves against the ACP
being different than what we expect (ie not matching the database
encoding)?  Did I understand correctly that you're looking into that?



Re: Windows default locale vs initdb

От
Thomas Munro
Дата:
On Fri, Jul 29, 2022 at 3:33 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> On Fri, Jul 22, 2022 at 11:59 PM Juan José Santamaría Flecha
> <juanjo.santamaria@gmail.com> wrote:
> > TL;DR; What I want to show through this example is that Windows ACP is not modified by setlocale(), it can only be
donethrough the Windows registry and only in recent releases. 
>
> Thanks, that was helpful, and so was that SO link.
>
> So it sounds like I should forget about the v3-0002 patch, but the
> v3-0001 and v3-0003 patches might have a future.  And it sounds like
> we might need to investigate maybe defending ourselves against the ACP
> being different than what we expect (ie not matching the database
> encoding)?  Did I understand correctly that you're looking into that?

I'm going to withdraw this entry.  The sooner we get something like
0001 into a release, the sooner the world will be rid of PostgreSQL
clusters initialised with the bad old locale names that the manual
very clearly tells you not to use for databases.... but I don't
understand this ACP/registry vs database encoding stuff and how it
relates to the use of BCP47 locale names, which puts me off changing
anything until we do.



Re: Windows default locale vs initdb

От
Thomas Munro
Дата:
Another country has changed its name, and a Windows OS update has
again broken every PostgreSQL cluster in that whole country[1] (or at
least those that had accepted initdb's default choice of locale,
probably most).  Let's get to the bottom of this, because otherwise it
is simply going to keep happening, causing administrative pain for a
lot of people.

Here is a rebase of the basic patch I proposed last time, and a
re-statement of what we know:

1.  initdb chooses a default locale using a technique that gives you
an unstable ("Czech Republic"->"Czechia", "Turkey"->"Türkiye"),
non-ASCII ("Norwegian (Bokmål)") string that we are warned we should
not store anywhere.  We store it, and then later it is not recognised.
Instead we should select an IETF BCP 47 locale name, based on stable
ISO country and language codes, like "en-US", "tr-TR" etc.  Here is
the patch to teach initdb to use that, unchanged from v3 except that I
tweaked the docs a bit.

2.  In Windows 10+ it is now also possible to put ".UTF-8" on the end
of locale names.  I couldn't figure out whether we should do that, and
what effect it has on ctypes -- apparently not the effect I expected
(see upthread).  Was our UTF-8 support on Windows already broken, and
this new ".UTF-8" thing is just a new way to reach that brokenness?
Is it OK to continue to choose the "legacy" single byte encodings by
default on that OS, and consider that a separate topic for separate
research?

3.  It is not clear to me how we should deal with pg_upgrade.
Eventually we want all of the old-school names to fade away, and
pg_upgrade would need to be part of that.  Perhaps there is some API
that can be used to translate to the new canonical forms without us
having to maintain translation tables and other messiness in our tree.

4.  Eventually we should probably ban non-ASCII characters from
entering the relevant catalogues (they are shared, so their encoding
is undefined except that they must be a superset of ASCII), and delete
all the old win32setlocale.c kludges, after we reach a point where
everyone should be using exclusively BCP 47.

[1] https://www.postgresql.org/message-id/flat/18196-b10f93dfbde3d7db%40postgresql.org

Вложения

Re: Windows default locale vs initdb

От
Thomas Munro
Дата:
I clicked "Trigger" to get a Mingw test run of this, and it failed[1].
I see why: our function win32_langinfo() believes that it shouldn't
call GetLocaleInfoEx() on non-MSVC compilers, so we see 'initdb:
error: could not find suitable encoding for locale "en-US"'.  I think
it has fallback code that parses the ".1252" or whatever on the end of
the name, but "en-US" hasn't got one.  I don't know the first thing
about Mingw but it looks like a declaration for that function arrived
6 years ago[2], and deleting the "#if defined(_MSC_VER)" fixes the
problem and the tests pass[3].  As far as I know, we don't support any
Mingw but the very latest: it's not a target with real users who have
version requirements, it's just a developer [in]convenience, so if it
passes on CI and whatever MSYS version "fairywren" runs in the build
farm right now, that should be enough.

I could just do that in this patch, but I suppose that also means that
someone needs to go through pg_locale.c and other places that test
_MSC_VER not because they actually care about the compiler but because
they want to detect some crusty old Mingw version, and see what else
can be deleted as a result, possibly including a lot of fallback code.
It feels like a separate cleanup for a separate patch.

[1] https://cirrus-ci.com/task/5301814774464512
[2]
https://github.com/mirror/mingw-w64/blame/eff726c461e09f35eeaed125a3570fa5f807f02b/mingw-w64-tools/widl/include/winnls.h#L931
[3] https://cirrus-ci.com/task/6558569718349824



Re: Windows default locale vs initdb

От
Thomas Munro
Дата:
Here is a thought that occurs to me, as I follow along with Jeff
Davis's evolving proposals for built-in collations and ctypes:  What
would stop us from dropping support for the libc (sic) provider on
Windows?  That may sound radical and likely to cause extra work for
people on upgrade, but how does that compare to the pain of keeping
this barely maintained code in the tree?  Suppose the idea in this
thread goes ahead and we get people to transition to the modern locale
names: there is non-zero transitional/upgrade pain there too.  How
delicious it would be to just nuke the whole thing from orbit, and
keep only cross-platform code that is maintained with enthusiasm by
active hackers.

That's probably a little extreme, but it's the direction my thoughts
start to go in when confronting the realisation that it's up to us
[Unix hackers making drive-by changes], no one is coming to help us
[from the Windows user community].

I've even heard others talk about dropping Windows completely, due to
the maintenance imbalance.  This would be somewhat more fine grained.
(One could use a similar argument to drop non-NTFS filesystems and
turn on POSIX-mode file links, to end that other locus of struggle.)



Re: Windows default locale vs initdb

От
Thomas Munro
Дата:
Ertan Küçükoglu offered to try to review and test this, so here's a rebase.

Some notes:

* it turned out that the Turkish i/I test problem I mentioned earlier
in this thread[1] was just always broken on Windows, we just didn't
ever test with UTF-8 before Meson took over; it's skipped now, see
commit cff4e5a3[2]

* it seems that you can't actually put encodings like .1252 on the end
(.UTF-8 must be a special case); I don't know if we should look into a
better UTF-8 mode for modern Windows, but that'd be a separate project

* this patch only benefits people who run initdb.exe without
explicitly specifying a locale; probably a good number of real systems
in the wild actually use EDB's graphical installer which initialises a
cluster and has its own way of choosing the locale, as discussed in
Ertan's thread[3]

[1]
https://www.postgresql.org/message-id/flat/CA%2BhUKGJZskvCh%3DQm75UkHrY6c1QZUuC92Po9rponj1BbLmcMEA%40mail.gmail.com#3a00c08214a4285d2f3c4297b0ac2be2
[2] https://github.com/postgres/postgres/commit/cff4e5a3
[3] https://www.postgresql.org/message-id/flat/CAH2i4ydECHZPxEBB7gtRG3vROv7a0d3tqAFXzcJWQ9hRsc1znQ%40mail.gmail.com

Вложения

Re: Windows default locale vs initdb

От
Ertan Küçükoglu
Дата:
Hi,

I am a complete noob about PostgreSQL development.
I don't know about the PostgreSQL CI system.
I will be needing some help as to how to do the tests.
I have access to different Windows OSes (v10, Server 2022 mainly).
These systems can be set to English or Turkish locales if needed.
I can also add new Windows versions if needed.
I do not know how to use patch files. I am also not sure what tests I should do.
Do I need to set up a Windows build system for PostgreSQL CI?
Will I download some files (EXE, etc) ready for testing? Copy them over an existing installation for testing?

Thanks for your help.

Regards,
Ertan

Thomas Munro <thomas.munro@gmail.com>, 22 Tem 2024 Pzt, 05:52 tarihinde şunu yazdı:
Ertan Küçükoglu offered to try to review and test this, so here's a rebase.

Some notes:

* it turned out that the Turkish i/I test problem I mentioned earlier
in this thread[1] was just always broken on Windows, we just didn't
ever test with UTF-8 before Meson took over; it's skipped now, see
commit cff4e5a3[2]

* it seems that you can't actually put encodings like .1252 on the end
(.UTF-8 must be a special case); I don't know if we should look into a
better UTF-8 mode for modern Windows, but that'd be a separate project

* this patch only benefits people who run initdb.exe without
explicitly specifying a locale; probably a good number of real systems
in the wild actually use EDB's graphical installer which initialises a
cluster and has its own way of choosing the locale, as discussed in
Ertan's thread[3]

[1] https://www.postgresql.org/message-id/flat/CA%2BhUKGJZskvCh%3DQm75UkHrY6c1QZUuC92Po9rponj1BbLmcMEA%40mail.gmail.com#3a00c08214a4285d2f3c4297b0ac2be2
[2] https://github.com/postgres/postgres/commit/cff4e5a3
[3] https://www.postgresql.org/message-id/flat/CAH2i4ydECHZPxEBB7gtRG3vROv7a0d3tqAFXzcJWQ9hRsc1znQ%40mail.gmail.com

Re: Windows default locale vs initdb

От
Zaid Shabbir
Дата:
Hello Thomas,

Can you please list down some of the use cases for the patch ? Other than Turkish, does this patch have an impact on other locales too ?


Regards,
Zaid


On Mon, Jul 22, 2024 at 7:52 AM Thomas Munro <thomas.munro@gmail.com> wrote:
Ertan Küçükoglu offered to try to review and test this, so here's a rebase.

Some notes:

* it turned out that the Turkish i/I test problem I mentioned earlier
in this thread[1] was just always broken on Windows, we just didn't
ever test with UTF-8 before Meson took over; it's skipped now, see
commit cff4e5a3[2]

* it seems that you can't actually put encodings like .1252 on the end
(.UTF-8 must be a special case); I don't know if we should look into a
better UTF-8 mode for modern Windows, but that'd be a separate project

* this patch only benefits people who run initdb.exe without
explicitly specifying a locale; probably a good number of real systems
in the wild actually use EDB's graphical installer which initialises a
cluster and has its own way of choosing the locale, as discussed in
Ertan's thread[3]

[1] https://www.postgresql.org/message-id/flat/CA%2BhUKGJZskvCh%3DQm75UkHrY6c1QZUuC92Po9rponj1BbLmcMEA%40mail.gmail.com#3a00c08214a4285d2f3c4297b0ac2be2
[2] https://github.com/postgres/postgres/commit/cff4e5a3
[3] https://www.postgresql.org/message-id/flat/CAH2i4ydECHZPxEBB7gtRG3vROv7a0d3tqAFXzcJWQ9hRsc1znQ%40mail.gmail.com

Re: Windows default locale vs initdb

От
Thomas Munro
Дата:
On Mon, Jul 22, 2024 at 8:38 PM Zaid Shabbir <zaidshabbir@gmail.com> wrote:
> Can you please list down some of the use cases for the patch ? Other than Turkish, does this patch have an impact on
otherlocales too ? 

Hi Zaid,

Yes, initdb.exe would use BCP47 codes by default for all languages.
Who knows which country will change its name next?

From a quick search of other recent cases: Czech Republic -> Czechia,
Swaziland -> Eswatini, Cape Verde -> Cabo Verde, and more, plus others
that we have older records of in the mailing list that seemed to
change in some minor technical way: Macau, Hong Hong, Norwegian etc.
The Windows manual says:

"We do not recommend this form for locale strings embedded in
code or serialized to storage, because these strings are more likely
to be changed by an operating system update than the locale name
form."

It's pretty bad for our users when it happens and the Windows locale
name changes: a database cluster that suddenly can't start, and even
after you've figured out why and adjusted the references in
postgresql.conf, you still can't connect.  There is also the problem
that some of the old full names have non-ASCII characters (Türkiye,
São Tomé and Príncipe, Curaçao, Côte d'Ivoire, Åland) which is bad at
least in theory because we use the string in times and places when it
it is not clear what the encoding the name itself has.

I don't use Windows myself, I've just been watching this train wreck
replaying in a loop for long enough.  Clearly it's going to take some
time to wean the user community off the unstable names, and it struck
me that the default is probably the main source of them in new
clusters, hence this patch.



Re: Windows default locale vs initdb

От
Thomas Munro
Дата:
On Mon, Jul 22, 2024 at 8:04 PM Ertan Küçükoglu
<ertan.kucukoglu@gmail.com> wrote:
> I am a complete noob about PostgreSQL development.
> I don't know about the PostgreSQL CI system.
> I will be needing some help as to how to do the tests.
> I have access to different Windows OSes (v10, Server 2022 mainly).
> These systems can be set to English or Turkish locales if needed.
> I can also add new Windows versions if needed.
> I do not know how to use patch files. I am also not sure what tests I should do.
> Do I need to set up a Windows build system for PostgreSQL CI?
> Will I download some files (EXE, etc) ready for testing? Copy them over an existing installation for testing?

Sorry, I didn't mean to put you on the spot :-)  Yeah you'd need to
install a compiler, various libraries and tools to be able to build
form source with a patch.  Unfortunately I'm not the best person to
explain how to do that on Windows as I don't use it.  Honestly it
might be a bit too much new stuff to figure out at once just to test
this small patch.  What I'd be hoping for is confirmation that there
are no weird unintended consequences or problems I'm not seeing since
I'm writing blind patches based on documentation only, but it's
probably too much to ask to figure out the whole development
environment and then go on an open ended expedition looking for
unknown problems.



Re: Windows default locale vs initdb

От
Ertan Küçükoglu
Дата:
Thomas Munro <thomas.munro@gmail.com>, 22 Tem 2024 Pzt, 14:00 tarihinde şunu yazdı:
Sorry, I didn't mean to put you on the spot :-)  Yeah you'd need to
install a compiler, various libraries and tools to be able to build
form source with a patch.  Unfortunately I'm not the best person to
explain how to do that on Windows as I don't use it.  Honestly it
might be a bit too much new stuff to figure out at once just to test
this small patch.  What I'd be hoping for is confirmation that there
are no weird unintended consequences or problems I'm not seeing since
I'm writing blind patches based on documentation only, but it's
probably too much to ask to figure out the whole development
environment and then go on an open ended expedition looking for
unknown problems.

I already installed Visual Studio 2022 with C++ support as suggested in https://www.postgresql.org/docs/current/install-windows-full.html
I cloned codes in the system.
But, I cannot find any "src/tools/msvc" directory. It is missing.
Document states I need everything in there
"The tools for building using Visual C++ or Platform SDK are in the src\tools\msvc directory."
It seems I will need help setting up the build environment.

Re: Windows default locale vs initdb

От
Andrew Dunstan
Дата:
On 2024-07-21 Su 10:51 PM, Thomas Munro wrote:
> Ertan Küçükoglu offered to try to review and test this, so here's a rebase.
>
> Some notes:
>
> * it turned out that the Turkish i/I test problem I mentioned earlier
> in this thread[1] was just always broken on Windows, we just didn't
> ever test with UTF-8 before Meson took over; it's skipped now, see
> commit cff4e5a3[2]
>
> * it seems that you can't actually put encodings like .1252 on the end
> (.UTF-8 must be a special case); I don't know if we should look into a
> better UTF-8 mode for modern Windows, but that'd be a separate project
>
> * this patch only benefits people who run initdb.exe without
> explicitly specifying a locale; probably a good number of real systems
> in the wild actually use EDB's graphical installer which initialises a
> cluster and has its own way of choosing the locale, as discussed in
> Ertan's thread[3]
>
> [1]
https://www.postgresql.org/message-id/flat/CA%2BhUKGJZskvCh%3DQm75UkHrY6c1QZUuC92Po9rponj1BbLmcMEA%40mail.gmail.com#3a00c08214a4285d2f3c4297b0ac2be2
> [2] https://github.com/postgres/postgres/commit/cff4e5a3
> [3] https://www.postgresql.org/message-id/flat/CAH2i4ydECHZPxEBB7gtRG3vROv7a0d3tqAFXzcJWQ9hRsc1znQ%40mail.gmail.com


I have an environment I can use for testing. But what exactly am I 
testing? :-) Install a few "problem" language/region settings, switch 
the system and ensure initdb runs ok?

Other than Turkish, which locales should I install?


cheers


andrew


--
Andrew Dunstan
EDB: https://www.enterprisedb.com




Re: Windows default locale vs initdb

От
Ertan Küçükoglu
Дата:
Andrew Dunstan <andrew@dunslane.net>, 22 Tem 2024 Pzt, 16:44 tarihinde şunu yazdı:
I have an environment I can use for testing. But what exactly am I
testing? :-) Install a few "problem" language/region settings, switch
the system and ensure initdb runs ok?

Other than Turkish, which locales should I install?

Thomas earlier listed a few:
"From a quick search of other recent cases: Czech Republic -> Czechia,
Swaziland -> Eswatini, Cape Verde -> Cabo Verde, and more, plus others
that we have older records of in the mailing list that seemed to
change in some minor technical way: Macau, Hong Hong, Norwegian etc."

I am not sure if all needs testing though.

Thanks & Regards,
Ertan

Re: Windows default locale vs initdb

От
Thomas Munro
Дата:
On Tue, Jul 23, 2024 at 1:44 AM Andrew Dunstan <andrew@dunslane.net> wrote:
> I have an environment I can use for testing. But what exactly am I
> testing? :-) Install a few "problem" language/region settings, switch
> the system and ensure initdb runs ok?

I just want to know about any weird unexpected consequences of using
BCP47 locale names, before we change the default in v18.  The only
concrete thing I found so far was that MinGW didn't like it, but I
provided a fix for that.  It'd still be possible to initialise a new
cluster with the old style names if you really want to, but you'd have
to pass it in explicitly; I was wondering if that could be necessary
in some pg_upgrade scenario but I guess not, it just clobbers
template0's pg_database row with values from the source database, and
recreates everything else so I think it should be fine (?).  I am a
little uneasy about the new names not having .encoding but there
doesn't seem to be an issue with that (such locales exist on Unix
too), and the OS still knows which encoding they use in that case.



Re: Windows default locale vs initdb

От
Thomas Munro
Дата:
On Tue, Jul 23, 2024 at 11:19 AM Thomas Munro <thomas.munro@gmail.com> wrote:
> On Tue, Jul 23, 2024 at 1:44 AM Andrew Dunstan <andrew@dunslane.net> wrote:
> > I have an environment I can use for testing. But what exactly am I
> > testing? :-) Install a few "problem" language/region settings, switch
> > the system and ensure initdb runs ok?

I thought a bit more about what to do with the messy .UTF-8 situation
on Windows, and I think I might see a way forward that harmonises the
code and behaviour with Unix, and deletes a lot of special case code.
But it's only theories + CI so far.

0001, 0002:  As before, teach initdb.exe to choose eg "en-US" by default.

0003:  Force people to choose locales that match the database
encoding, as we do on Unix.  That is, forbid contradictory
combinations like --locale="English_United States.1252"
--encoding=UTF8, which are currently allowed (and the world is full of
such database clusters because that is how the EDB installer GUI makes
them).  The only allowed combinations for American English should now
be: --locale="en-US" --encoding="WIN1252", and --locale="en-US.UTF-8"
--encoding="UTF8".  You can still use the old names if you like, by
explicitly writing --locale="English_United States.1252", but the
encoding then has to be WIN1252.  It's crazy to mix them up, let's ban
that.

Obviously there is a pg_upgrade case to worry about there.  We'd have
to "fix" the now illegal combinations, and I don't know exactly how
yet.

0004:  Rip out the code that does extra wchar_t conversations for
collations.  If I've understood correctly, we don't need them: if you
have a .UTF-8 locale then your encoding is UTF-8 and should be able to
use strcoll_l() directly.  Right?

0005:  Something similar was being done for strftime().  And we might
as well use strftime_l() instead while we're here (part of general
movement to use _l functions and stop splattering setlocale() all over
the place, for the multithreaded future).

These patches pass on CI.  Do they give the expected results when used
on a real Windows system?

There are a few more places where we do wchar_t conversions that could
probably be stripped out too, if my assumptions are correct, and we
could dig further if the basic idea can be validated and people think
this is going in a good direction.

Вложения

Re: Windows default locale vs initdb

От
Ertan Küçükoglu
Дата:
I already installed Visual Studio 2022 with C++ support as suggested in https://www.postgresql.org/docs/current/install-windows-full.html
I cloned codes in the system.
But, I cannot find any "src/tools/msvc" directory. It is missing.
Document states I need everything in there
"The tools for building using Visual C++ or Platform SDK are in the src\tools\msvc directory."
It seems I will need help setting up the build environment.

I am willing to be a tester for Windows given I could get help setting up the build environment.
It also feels documentation needs some update as I failed to find necessary files.

Thanks & Regards,
Ertan

Re: Windows default locale vs initdb

От
Andrew Dunstan
Дата:


On 2024-08-08 Th 4:08 AM, Ertan Küçükoglu wrote:
I already installed Visual Studio 2022 with C++ support as suggested in https://www.postgresql.org/docs/current/install-windows-full.html
I cloned codes in the system.
But, I cannot find any "src/tools/msvc" directory. It is missing.
Document states I need everything in there
"The tools for building using Visual C++ or Platform SDK are in the src\tools\msvc directory."
It seems I will need help setting up the build environment.

I am willing to be a tester for Windows given I could get help setting up the build environment.
It also feels documentation needs some update as I failed to find necessary files.


If you're trying to build the master branch those documents no longer apply. You will need to build using meson, as documented here: <https://www.postgresql.org/docs/17/install-meson.html>


cheers


andrew


--
Andrew Dunstan
EDB: https://www.enterprisedb.com