Обсуждение: new environment variable INITDB_LOCALE_PROVIDER
$SUBJECT makes it easier to test other providers, especially the regression tests. For this to be useful, it should avoid throwing an error for plain "initdb" (without locale flags specified), which means we need defaults for the builtin locale or the ICU locale. I chose "C.UTF-8" and "und" (we could also have environment variables for those too, but that would create some questions when --locale is also specified). Another benefit is that this would make it easier to change the initdb default, which is being discussed here: https://www.postgresql.org/message-id/9b259f4c532943e428e9665122f37c099bab250e.camel@j-davis.com One annoyance is that the tests don't pass when INITDB_LOCALE_PROVIDER=icu. That's because a lot of tests use either -- locale=C or --no-locale, and ICU doesn't have a way to interpret that. We could force the provider to be builtin in that case, I suppose. Another annoyance is that, if INITDB_LOCALE_PROVIDER=builtin, and LC_CTYPE is not UTF-8-compatible, then we need to force LC_CTYPE=C. That affects fewer things than it would with the libc provider, but it still affects some things. Regards, Jeff Davis
Вложения
On Tue, 2025-07-29 at 16:55 -0700, Jeff Davis wrote: > $SUBJECT makes it easier to test other providers, especially the > regression tests. Rebased. Changes: * Use environment variable name PG_LOCALE_PROVIDER, which seems more consistent. * Updated doc. * If the provider is builtin and the LC_CTYPE or LC_COLLATE environment variables aren't compatible with UTF-8, it can override those to "C". But if --locale, --lc-ctype, or --lc-collate are specified and incompatible, they will throw an error instead. Note: when the provider is builtin, the overriding of LC_CTYPE and LC_COLLATE don't matter a lot. LC_CTYPE affects the translation of messages from the OS (but not Postgres messages), as well as a few other places that are likely to be fixed soon (e.g. [1]). LC_COLLATE has no effect when the provider is builtin. In any case, it only happens when those environment variables aren't compatible with UTF-8, and the user hasn't specified any locale settings on the command line. I see this as more of a detail about how the defaults work together that can easily be corrected if the user specifies something different. Also note: if PG_LOCALE_PROVIDER=libc (or is unset), there should be no behavior change with this patch. I am planning to commit this soon. Regards, Jeff Davis [1] https://www.postgresql.org/message-id/0151ad01239e2cc7b3139644358cf8f7b9622ff7.camel@j-davis.com
Вложения
On Oct 9, 2025, at 12:27, Jeff Davis <pgsql@j-davis.com> wrote:* If the provider is builtin and the LC_CTYPE or LC_COLLATE environment
variables aren't compatible with UTF-8, it can override those to "C".
But if --locale, --lc-ctype, or --lc-collate are specified and
incompatible, they will throw an error instead.
Are we assuming that
* if the settings come from command line options, then the user is intentionally doing that, so we throw an error
* if the settings come from env, then the user might not be aware of them, so we only issue a warning?
If that’s the case, I’m not fully convinced by this design. Since initdb is a one-time operation, I think it would be better to require everything to be explicit.
Best regards,
--
Chao Li (Evan)
HighGo Software Co., Ltd.
https://www.highgo.com/
HighGo Software Co., Ltd.
https://www.highgo.com/
Am 09.10.2025 um 06:27 schrieb Jeff Davis <pgsql@j-davis.com>: > > On Tue, 2025-07-29 at 16:55 -0700, Jeff Davis wrote: >> $SUBJECT makes it easier to test other providers, especially the >> regression tests. > > Rebased. > > Changes: > > * Use environment variable name PG_LOCALE_PROVIDER, which seems more > consistent. Is this not something that could already be done using PG_TEST_INITDB_EXTRA_OPTS ?
On Fri, 2025-10-10 at 11:32 +0200, Peter Eisentraut wrote: > > * Use environment variable name PG_LOCALE_PROVIDER, which seems > > more > > consistent. > > Is this not something that could already be done using > PG_TEST_INITDB_EXTRA_OPTS ? 1. PG_LOCALE_PROVIDER is a documented user-facing option, which will make it easier for users to set their preferred provider in scripts, etc. 2. This change also creates default locales for the builtin and ICU providers, so that initdb without any other locale options will succeed regardless of the provider. I broke these up into two patches as v3 to make it easier to understand. These patches are independently useful, but also important if we ever want to change the initdb default to builtin or ICU. Regards, Jeff Davis
Вложения
On Fri, 2025-10-10 at 12:13 +0800, Chao Li wrote: > Are we assuming that > > * if the settings come from command line options, then the user is > intentionally doing that, so we throw an error > * if the settings come from env, then the user might not be aware of > them, so we only issue a warning? > > If that’s the case, I’m not fully convinced by this design. Since > initdb is a one-time operation, I think it would be better to require > everything to be explicit. That would have been ideal a long time ago, but plain "initdb" without locale options has succeeded for a long time, using information from the environment. If we make that fail and require the user to specify the options explicitly, I fear that would be too disruptive to the many scripts around. So we need to do something reasonable when the provider is builtin and LC_CTYPE/LC_COLLATE from the environment are incompatible with UTF-8. Forcing LC_CTYPE=C and/or LC_COLLATE=C: * Only happens if: - the provider is builtin; - LC_CTYPE/LC_COLLATE come from the environment (i.e. --locale/--lc-ctype/--lc-collate are unspecified); and - LC_CTYPE/LC_COLLATE are incompatible with UTF-8. * Has little practical effect because those settings aren't used many places when the provider is builtin or ICU. so I think a warning is acceptable there. Regards, Jeff Davis
On Oct 11, 2025, at 02:28, Jeff Davis <pgsql@j-davis.com> wrote:On Fri, 2025-10-10 at 12:13 +0800, Chao Li wrote:Are we assuming that
* if the settings come from command line options, then the user is
intentionally doing that, so we throw an error
* if the settings come from env, then the user might not be aware of
them, so we only issue a warning?
If that’s the case, I’m not fully convinced by this design. Since
initdb is a one-time operation, I think it would be better to require
everything to be explicit.
That would have been ideal a long time ago, but plain "initdb" without
locale options has succeeded for a long time, using information from
the environment. If we make that fail and require the user to specify
the options explicitly, I fear that would be too disruptive to the many
scripts around.
So we need to do something reasonable when the provider is builtin and
LC_CTYPE/LC_COLLATE from the environment are incompatible with UTF-8.
Forcing LC_CTYPE=C and/or LC_COLLATE=C:
* Only happens if:
- the provider is builtin;
- LC_CTYPE/LC_COLLATE come from the environment (i.e.
--locale/--lc-ctype/--lc-collate are unspecified); and
- LC_CTYPE/LC_COLLATE are incompatible with UTF-8.
* Has little practical effect because those settings aren't
used many places when the provider is builtin or ICU.
so I think a warning is acceptable there.
Thanks for the explanation, that sounds reasonable. In the meantime, my last arguments are:
* If we make that fail, I don’t think that would break existing scripts. Because the default provider is libc and you are introducing a new environment variable to set locale provider, thus a plain initdb will not use builtin provider. Maybe provider can come from PG_TEST_INITDB_EXTRA_OPTS, I'm ok for test environment to only only issue warnings.
* I am thinking loudly. Builtin provider is more performant but with certain limitations. Some production users may want to try builtin provider for better performance but not being aware of the limitation. Their environment contains the actual LC_CTYPE/LC_COLLATE they want to use, and they set the new environment variable with “builtin” for provider. In this case, failing “initdb” would make the user clearly realize the limitation of builtin provider. Otherwise, if the user also ignores the warning messages, then the database would be created with unexpected ctype, which would lead to loss (time, data, etc.)
If those are not the cases, then I am fine with the design.
--
Chao Li (Evan)
HighGo Software Co., Ltd.
https://www.highgo.com/
HighGo Software Co., Ltd.
https://www.highgo.com/
On Sat, 2025-10-11 at 08:30 +0800, Chao Li wrote: > * If we make that fail, I don’t think that would break existing > scripts. Because the default provider is libc and you are introducing > a new environment variable to set locale provider, thus a plain > initdb will not use builtin provider. Maybe provider can come from > PG_TEST_INITDB_EXTRA_OPTS, I'm ok for test environment to only only > issue warnings. I would like it to be possible to change the initdb default in the future to "builtin". See: https://www.postgresql.org/message-id/e4ac16908dad3eddd3ed73c4862591375a3f0539.camel@j-davis.com in that case, initdb should be able to succeed without other options. > * I am thinking loudly. Builtin provider is more performant but with > certain limitations. Some production users may want to try builtin > provider for better performance but not being aware of the > limitation. Their environment contains the actual LC_CTYPE/LC_COLLATE > they want to use, and they set the new environment variable with > “builtin” for provider. In this case, failing “initdb” would make the > user clearly realize the limitation of builtin provider. Otherwise, > if the user also ignores the warning messages, then the database > would be created with unexpected ctype, which would lead to loss > (time, data, etc.) What limitation and/or loss are you concerned about? Unless I'm mistaken, LC_CTYPE has very little practical effect when the provider is builtin and the encoding is UTF-8. The main effect that I'm aware of is that system errors from the OS rely on LC_CTYPE for translation. Ordinary Postgres messages don't need LC_CTYPE, so most of NLS still works even with LC_CTYPE=C; it's just strerror() that depends on LC_CTYPE for the encoding. LC_CTYPE also affects full text search parsing, but I'm fixing that as part of another patch to use the database locale instead. I think contrib/fuzzystrmatch may be affected. Callers of pg_strcasecmp() could be affected, but it's mostly used to compare with ascii anyway. If you are aware of other areas, please let me know. Regards, Jeff Davis
On Oct 11, 2025, at 10:06, Jeff Davis <pgsql@j-davis.com> wrote:On Sat, 2025-10-11 at 08:30 +0800, Chao Li wrote:* If we make that fail, I don’t think that would break existing
scripts. Because the default provider is libc and you are introducing
a new environment variable to set locale provider, thus a plain
initdb will not use builtin provider. Maybe provider can come from
PG_TEST_INITDB_EXTRA_OPTS, I'm ok for test environment to only only
issue warnings.
I would like it to be possible to change the initdb default in the
future to "builtin". See:
https://www.postgresql.org/message-id/e4ac16908dad3eddd3ed73c4862591375a3f0539.camel@j-davis.com
in that case, initdb should be able to succeed without other options.
Yes, if we decide to along with that path, then what I talked would no longer be valid.
* I am thinking loudly. Builtin provider is more performant but with
certain limitations. Some production users may want to try builtin
provider for better performance but not being aware of the
limitation. Their environment contains the actual LC_CTYPE/LC_COLLATE
they want to use, and they set the new environment variable with
“builtin” for provider. In this case, failing “initdb” would make the
user clearly realize the limitation of builtin provider. Otherwise,
if the user also ignores the warning messages, then the database
would be created with unexpected ctype, which would lead to loss
(time, data, etc.)
What limitation and/or loss are you concerned about?
For limitation of builtin provide, I just meant it supports less LC_CTYPE/LC_COLLATE than the other two providers.
I wasn’t concerned about anything, I was just imaging if anything could get a negative impact.
Unless I'm mistaken, LC_CTYPE has very little practical effect when the
provider is builtin and the encoding is UTF-8.
The main effect that I'm aware of is that system errors from the OS
rely on LC_CTYPE for translation. Ordinary Postgres messages don't need
LC_CTYPE, so most of NLS still works even with LC_CTYPE=C; it's just
strerror() that depends on LC_CTYPE for the encoding.
LC_CTYPE also affects full text search parsing, but I'm fixing that as
part of another patch to use the database locale instead.
I think contrib/fuzzystrmatch may be affected.
Callers of pg_strcasecmp() could be affected, but it's mostly used to
compare with ascii anyway.
If you are aware of other areas, please let me know.
Thanks for the explanation. I think I am good now. The latest v3 patch looks good to me.
--
Chao Li (Evan)
HighGo Software Co., Ltd.
https://www.highgo.com/
HighGo Software Co., Ltd.
https://www.highgo.com/
On 10.10.25 20:09, Jeff Davis wrote: > On Fri, 2025-10-10 at 11:32 +0200, Peter Eisentraut wrote: >>> * Use environment variable name PG_LOCALE_PROVIDER, which seems >>> more >>> consistent. >> >> Is this not something that could already be done using >> PG_TEST_INITDB_EXTRA_OPTS ? > > 1. PG_LOCALE_PROVIDER is a documented user-facing option, which will > make it easier for users to set their preferred provider in scripts, > etc. > > 2. This change also creates default locales for the builtin and ICU > providers, so that initdb without any other locale options will succeed > regardless of the provider. > > I broke these up into two patches as v3 to make it easier to > understand. > > These patches are independently useful, but also important if we ever > want to change the initdb default to builtin or ICU. I'm skeptical that we want user-facing environment variables to provide initdb defaults. The use for that hasn't really been explained. For example, I don't recall anyone asking for an environment variable to determine the checksum default. If we did that, then we might end up with an environment variable per option, which would be a lot. The locale options are already complicated enough; adding more ways to set them with new ways that they interact with other options, this adds a lot more complications. I think in practice initdb is mostly run through packager-provided infrastructure, so this facility would probably have very little impact in practice.
On Tue, 2025-10-14 at 21:51 +0200, Peter Eisentraut wrote: > I'm skeptical that we want user-facing environment variables to > provide > initdb defaults. The use for that hasn't really been explained. One motivation was to make it smoother to change the initdb default provider: https://www.postgresql.org/message-id/9b259f4c532943e428e9665122f37c099bab250e.camel@j-davis.com https://www.postgresql.org/message-id/e4ac16908dad3eddd3ed73c4862591375a3f0539.camel@j-davis.com if we were to make that change, then users might have existing scripts and want to use the environment variable to switch it back to libc without modifying the scripts. If you think we can change the initdb default without introducing an environment variable, then perhaps we don't need v3-0002. What do you think about v3-0001? Regards, Jeff Davis