Обсуждение: Change initdb default to the builtin collation provider
-------
Summary
-------
The libc collation provider is a bad default[1]. The builtin collation
provider is a good default, so let's use that.
----------
Motivation
----------
The initdb default is what we choose for new postgres instances when
we have little information about the user's collation needs. The
default has no effect on upgrades, which always use the previous
instance's locale settings.
There are a number of factors to consider when choosing a default:
* Risk of primary key inconsistencies due to libc or ICU updates
* Performance
* Quality of query semantics for a variety of scripts, languages and
regions (excluding the final result order)
* Final result ordering / display
In the absence of specific user requirements, these factors weigh
heavily in favor of the builtin collation provider, and heavily
against libc.
With the builtin provider, there's no risk of primary key or plain
index inconsistencies, the performance is great (ordering with
memcmp()), and the query semantics are based on Unicode.
-------------------
Why does it matter?
-------------------
Arguably, we could just not worry and let various service providers,
tools, scripts, packages, and wrappers make the choice independently.
But that just moves the problem -- someone still needs to make that
choice. Collectively, we have built up some knowledge about collation
here on -hackers, and it would be good to offer it as guidance. And an
initdb default is a good way to offer that guidance.
Unifying around one default also creates a more consistent, tested,
and documented Postgres experience that benefits hackers and users
alike.
--------------------------------
What's the catch? Display order.
--------------------------------
The builtin provider uses code point order, i.e. memcmp(), so the
final result display order is less human-friendly. For instance, 'Z'
comes before 'a'.
That problem is annoying, but *much* easier to fix than the other
factors. The user might add a COLLATE clause to the final ORDER BY, or
perform the sort in the application layer or presentation layer.
Other providers offer a better final display order, but it comes at a
heavy price: index inconsistencies and poor performance. Those
problems are hard to address in an existing system. Some users may be
willing to pay that price, but it should be opt-in.
Furthermore, in the default case, we don't even really know which
language and region to use. We infer it from the environment variable
LC_COLLATE at initdb time, but that's a weak signal: there's little
reason to think that the OS admin, DBA, and end user are all in the
same locale.
In general, there's little reason to think that a single locale for
display order is enough for a whole database. Often, databases are
used (directly or indirectly) by people from dozens of locales. When
per-locale display order becomes an issue, it will be necessary to add
COLLATE clauses or application logic to tailor to the end user
regardless, so the database default locale won't be useful.
For all of these reasons, display order is the wrong thing to optimize
for when the user doesn't specify anything. We should prioritize the
other factors, and for those other factors, the builtin provider is
the best.
------------
Why not ICU?
------------
ICU is better than libc in a lot of ways:
* Better performance
* Platform-independent
* Easier to manage it as a separate library
But fundamentally, I don't think it's a great default, because it
favors final result display order at the risk of primary key
inconsistencies.
------------------
Other Alternatives
------------------
In a previous thread[1], I laid out some alternatives. If someone
disagrees with this proposal, please choose one of those or suggest a
new one.
The most interesting alternative, in my opinion, is #4, but that was
soundly rejected.
---------------------
Which builtin locale?
---------------------
All builtin locales use the exact same ordering: they sort by code
point. Code point order is stable, so primary keys and plain indexes
remain consistent across upgrades forever.
The difference is in case conversion and character classification
semantics:
1. C: Only basic ASCII semantics which never change.
2. PG_C_UTF8: Provides "simple" Unicode semantics. In spirit, this
is similar to the libc "C.UTF-8" locale available on Linux. It's
also similar to the default semantics of at least one big
commercial database, making migrations easier.
3. PG_UNICODE_FAST: Provides "full" Unicode semantics. It's more
aligned with the SQL standard, which specifies in an example the
uppercase of 'ß' is 'SS'.
For the latter two locales, expression and partial indexes depending
on these semantics may be subject to inconsistencies after a Unicode
update.
I propose changing the default to PG_C_UTF8 because it seems simple
and practical. However, I'm also fine with PG_UNICODE_FAST if those
affected by the "full" case mapping find it helpful. "C" is also a
possibility, but the query semantics suffer. All are better than libc.
-------
Details
-------
The mechanics of the default itself are being worked out here[2]. The
concrete proposal here is to commit those patches, and then
change DEFAULT_LOCALE_PROVIDER to be COLLPROVIDER_BUILTIN and
DEFAULT_BUILTIN_LOCALE to whatever we choose here.
Note: the builtin provider requires UTF-8, which can potentially
conflict with the LC_CTYPE. Fortunately, when the builtin provider is
being used, LC_CTPE has little effect. To further reduce the
consequences of LC_CTYPE when using the builtin provider, another
patch[3] fixes tsearch to parse based on the database default locale
rather than depending on LC_CTYPE.
Comments welcome.
Regards,
Jeff Davis
[1]
https://www.postgresql.org/message-id/3e84e861362e971cf8c7d5e4770207d0235947e1.camel@j-davis.com
[2]
https://www.postgresql.org/message-id/7d424dc0b032b30a22220634d12377bf59524bdb.camel@j-davis.com
[3]
https://www.postgresql.org/message-id/0151ad01239e2cc7b3139644358cf8f7b9622ff7.camel@j-davis.com
On 11.10.25 02:48, Jeff Davis wrote: > The builtin provider uses code point order, i.e. memcmp(), so the > final result display order is less human-friendly. For instance, 'Z' > comes before 'a'. > > That problem is annoying, but*much* easier to fix than the other > factors. The user might add a COLLATE clause to the final ORDER BY, or > perform the sort in the application layer or presentation layer. I remain violently opposed to this idea. I don't understand how it could be acceptable to just not provide a good display order by default and have everyone rewrite their queries. > ICU is better than libc in a lot of ways: > > * Better performance > * Platform-independent > * Easier to manage it as a separate library > > But fundamentally, I don't think it's a great default, because it > favors final result display order at the risk of primary key > inconsistencies. I don't understand. We have a versioning system for ICU collations? Does it not work?
On Fri, 2025-10-17 at 17:23 +0200, Peter Eisentraut wrote:
> I remain violently opposed to this idea. I don't understand how it
> could be acceptable to just not provide a good display order by
> default
> and have everyone rewrite their queries.
I assume that you favor alternative 3 listed here[1], which is to use
ICU "und" as the default. Is that correct? Or do you prefer to get the
locale from the environment at initdb time?
One thing you may not have considered is that if the provider is
builtin, a lot more users are likely to learn about and use ICU,
because they will see an unfriendly display order and try to figure out
why. Then they'll be more prepared for upgrades and more likely to see
and respond to a version mismatch.
> I don't understand. We have a versioning system for ICU collations?
> Does it not work?
I have 27 versions of ICU installed by compiling them from source, and
I compile Postgres in my sleep, so it's fine for me.
But for the default user, who's never really considered collation until
after they are already in trouble, having inconsistent primary keys all
over the place is not a great experience. ICU is certainly better than
libc, but I still think people should approach it with non-zero
knowledge.
Regards,
Jeff Davis
[1]
https://www.postgresql.org/message-id/3e84e861362e971cf8c7d5e4770207d0235947e1.camel@j-davis.com
On Fri, 2025-10-17 at 15:02 -0700, Jeff Davis wrote:
> On Fri, 2025-10-17 at 17:23 +0200, Peter Eisentraut wrote:
> > I remain violently opposed to this idea. I don't understand how it
> > could be acceptable to just not provide a good display order by
> > default
> > and have everyone rewrite their queries.
>
> I assume that you favor alternative 3 listed here[1], which is to use
> ICU "und" as the default. Is that correct? Or do you prefer to get
> the
> locale from the environment at initdb time?
Right now we're still stuck with the worst possible default: libc. Can
you make a more concrete counter-proposal here that sorts through some
of the details?
* Should we base the ICU locale on the environment, or just default
everyone to the "und" locale?
* If ICU support is disabled, how does that affect the defaults?
* If using the environment, what happens if the locale is not supported
by ICU (in particular "C" or "C.UTF-8")?
* What would be the default encoding, or should that come from the
environment?
* The ICU provider has some weaknesses around non-UTF8 encodings
because of casts from wchar_t and the use of tolower() in
downcase_identifier(). Are those potential blockers, and if so, are
they fixable?
* Can we try harder to find an acceptable way to use memcmp() for the
indexes by default, at least primary keys, even if the database
collation is ICU? I know that I've argued for this in the past and it's
been soundly rejected[1], but some variation on this idea could be
worthy of consideration.
Regards,
Jeff Davis
[1]
https://www.postgresql.org/message-id/b7a9f32eee8d24518f791168bc6fb653d1f95f4d.camel@j-davis.com
On Fri, 2025-10-10 at 17:48 -0700, Jeff Davis wrote:
> -------
> Summary
> -------
>
> The libc collation provider is a bad default[1]. The builtin
> collation
> provider is a good default, so let's use that.
The attached patches implement a more modest proposal which does not
conflict with Peter's objection about the display order:
0001: If the encoding is unspecified, and cannot be determined from the
locale (i.e. the locale is C), then use UTF-8 rather than SQL_ASCII.
0002: If the provider is unspecified, and the locale is C or C.UTF-8,
then use the builtin provider.
Motivation:
* UTF-8 seems safer than SQL_ASCII when the locale is compatible with
either.
* Whether the "C" locale uses the builtin provider or the libc provider
is mostly about the catalog representation, because the implementation
is the same. I don't have a strong motivation for this change, it just
clarifies that libc is not actually being used when the locale is "C".
* I think most users of the "C.UTF-8" locale would be better off with
the builtin provider, which benefits from important optimizations.
Note:
This would mean that "initdb --no-locale" would select UTF-8 and the
builtin provider with locale "C", whereas previously it would have
selected SQL_ASCII and the libc provider (though it didn't ever really
use libc internally). I'm not sure if others want this behavior or if
it would be surprising.
Regards,
Jeff Davis
Вложения
> The libc collation provider is a bad default[1]. The builtin collation > provider is a good default, so let's use that. Agreed! I've been in so many situations where a libc collation being the default has caused problems down the line, but never in a situation where it being default has been helpful. > In the absence of specific user requirements, these factors weigh > heavily in favor of the builtin collation provider, and heavily > against libc. Even worse is that the current default uses whatever was set in the environment of the session that invokes initdb. This is very unlikely to be the default anyone wants, especially since these environment variables follows through ssh on debian bases systems by default. Me having sv_SE set on my local computer doesn't make it likely to be a reasonable default locale if I ssh to a server to run initdb. > The builtin provider uses code point order, i.e. memcmp(), so the > final result display order is less human-friendly. For instance, 'Z' > comes before 'a'. > That problem is annoying, but *much* easier to fix than the other > factors. The user might add a COLLATE clause to the final ORDER BY, or > perform the sort in the application layer or presentation layer. I'd say that this would be a _good_ feature of choosing a generic unicode collation by default. It's immediately obvious that you need to do something if you want ordering according to some specific language's rules. > Furthermore, in the default case, we don't even really know which > language and region to use. We infer it from the environment variable > LC_COLLATE at initdb time, but that's a weak signal: there's little > reason to think that the OS admin, DBA, and end user are all in the > same locale. If I'm a Turkish person working for a German company and my environment variables happens to specify tr_TR when I run initdb I have not made a conscious choice and it may take years before someone reports an issue with Ö being sorted after O instead of at the end of the alphabet, at which point recifying the situation can be unnecessarily tricky. > I propose changing the default to PG_C_UTF8 because it seems simple > and practical. However, I'm also fine with PG_UNICODE_FAST if those > affected by the "full" case mapping find it helpful. "C" is also a > possibility, but the query semantics suffer. All are better than libc. These are great options for a default for initdb, since we don't have any knowledge of which language specific collation might be appropriate. Maybe we should also document that it's recommended to set locale when running CREATE DATABASE unless the builtin semantics are fine? -- Anders Åstrand Percona
On Tue, 2026-03-10 at 09:37 +0100, Anders Åstrand wrote:
>
> I'd say that this would be a _good_ feature of choosing a generic
> unicode
> collation by default. It's immediately obvious that you need to do
> something
> if you want ordering according to some specific language's rules.
That's a good point: if the default is not subtle, then the user is
likely to notice quickly, and consider it more detail what they
actually want. A default that is subtly wrong might go unnoticed until
it's too late to change.
>
Regards,
Jeff Davis
On Fri, Oct 31, 2025 at 5:30 PM Jeff Davis <pgsql@j-davis.com> wrote: > The attached patches implement a more modest proposal which does not > conflict with Peter's objection about the display order: > > 0001: If the encoding is unspecified, and cannot be determined from the > locale (i.e. the locale is C), then use UTF-8 rather than SQL_ASCII. I don't know if this is exactly the right proposal, but I think it's probably appropriate to start gently pushing people towards UTF-8 rather than anything else. Unicode has largely won, AFAICT, and the use cases for anything else are increasingly narrow. I don't think we should try to be coercive, but there's a reasonable presumption that people who haven't said what they want probably want UTF8. > 0002: If the provider is unspecified, and the locale is C or C.UTF-8, > then use the builtin provider. I'm much less convinced about this idea. I think the number of people who will be unhappy about the less-user-friendly sort order changes is probably quite high. It's reasonable to want something more stable and better version-controlled than libc, but switching to a simple code-point sort seems like a high price to pay for that. -- Robert Haas EDB: http://www.enterprisedb.com
> On 10 Mar 2026, at 16:12, Robert Haas <robertmhaas@gmail.com> wrote: > ..there's a reasonable presumption that people who haven't said what they want > probably want UTF8. +1 -- Daniel Gustafsson
On Fri, 2025-10-31 at 14:30 -0700, Jeff Davis wrote: > On Fri, 2025-10-10 at 17:48 -0700, Jeff Davis wrote: > > ------- > > Summary > > ------- > > > > The libc collation provider is a bad default[1]. The builtin > > collation > > provider is a good default, so let's use that. > > The attached patches implement a more modest proposal which does not > conflict with Peter's objection about the display order: > > 0001: If the encoding is unspecified, and cannot be determined from the > locale (i.e. the locale is C), then use UTF-8 rather than SQL_ASCII. > > 0002: If the provider is unspecified, and the locale is C or C.UTF-8, > then use the builtin provider. I think that would be an improvement, but I am still much more in favor of your original proposal to use the C collation by default. Peter objected: > I don't understand how it could be acceptable to just not provide > a good display order by default and have everyone rewrite their queries. I consider it acceptable. Oracle does it like that by default. Yes, Oracle's behavior is not necessarily what we want to emulate, but I don't remember hearing Oracle users complain about that (and I have heard them complain about other things). He also said: > I don't understand. We have a versioning system for ICU collations? > Does it not work? Well, it works in that it alerts you that you may have index corruption. Good - but a default behavior that excludes the possibility of index corruption after an OS upgrade would work much better for most users. Yours, Laurenz Albe
On Tue, 2026-03-10 at 11:12 -0400, Robert Haas wrote: > I don't know if this is exactly the right proposal, but I think it's > probably appropriate to start gently pushing people towards UTF-8 > rather than anything else. Unicode has largely won, AFAICT, and the > use cases for anything else are increasingly narrow. I don't think we > should try to be coercive, but there's a reasonable presumption that > people who haven't said what they want probably want UTF8. If their environment's LC_CTYPE is UTF8-based, they already get UTF-8. If it isn't, we can either: (a) Fall back to LC_CTYPE=C, which is the only UTF8-compatible locale available everywhere. C is actually not a terrible fallback: it doesn't actually affect many things, because I have moved almost everything to use the database default locale. (b) Warn or error unless they explicitly specify the encoding with -E. But the former is likely to be ignored and the latter is not what I'd call "gentle". Which of these do you think is the right approach? There's narrower question about what we do with LC_CTYPE=C. Currently we use SQL_ASCII encoding, which doesn't seem like a great default, and we could change that to default to UTF8. And another question about whether we change the meaning of --no-locale. > > I'm much less convinced about this idea. I think the number of people > who will be unhappy about the less-user-friendly sort order changes > is > probably quite high. It's reasonable to want something more stable > and > better version-controlled than libc, but switching to a simple > code-point sort seems like a high price to pay for that. Surely inconsistent indexes and poor performance are also a high price, so how do you weigh the prices against each other? We sweat over single-digit performance regressions in fairly specific cases all the time, but here we're 3X slower for index builds: https://www.depesz.com/2024/06/11/how-much-speed-youre-leaving-at-the-table-if-you-use-default-locale/ and 2-5X slower for Sort: https://www.postgresql.org/message-id/64039a2dbcba6f42ed2f32bb5f0371870a70afda.camel@j-davis.com and others don't seem very concerned, so I feel like I'm missing something. Regards, Jeff Davis
On Tue, Mar 10, 2026 at 3:04 PM Jeff Davis <pgsql@j-davis.com> wrote: > If their environment's LC_CTYPE is UTF8-based, they already get UTF-8. > If it isn't, we can either: > > (a) Fall back to LC_CTYPE=C, which is the only UTF8-compatible locale > available everywhere. C is actually not a terrible fallback: it doesn't > actually affect many things, because I have moved almost everything to > use the database default locale. > > (b) Warn or error unless they explicitly specify the encoding with -E. > But the former is likely to be ignored and the latter is not what I'd > call "gentle". > > Which of these do you think is the right approach? I'm a little confused as to how this relates to what you were asking before. I thought you were proposing to pick UTF-8 rather than SQL_ASCII when LC_CTYPE=C, but that's not on this list of options. To be honest, I'd probably be ready to support making the default encoding UTF8 regardless of the environment, and you have to use -E if you want anything else. I think there are still people using other encodings, but I believe it to be a small minority at this point. > There's narrower question about what we do with LC_CTYPE=C. Currently > we use SQL_ASCII encoding, which doesn't seem like a great default, and > we could change that to default to UTF8. And another question about > whether we change the meaning of --no-locale. I think SQL_ASCII is a terrible default. Nobody actually wants that unless they're trying to get out of a sticky situation. Making it opt-in must be right. I do not know what the question about --no-locale is. > We sweat over single-digit performance regressions in fairly specific > cases all the time, but here we're 3X slower for index builds: > > https://www.depesz.com/2024/06/11/how-much-speed-youre-leaving-at-the-table-if-you-use-default-locale/ > > and 2-5X slower for Sort: > > https://www.postgresql.org/message-id/64039a2dbcba6f42ed2f32bb5f0371870a70afda.camel@j-davis.com > > and others don't seem very concerned, so I feel like I'm missing > something. <insert shrug emoji here> At the end of the day, we're all just guessing. My experience working for EDB is that we have a number of customers who care about sort order quite a lot, and we've had to sweat blood to make them happy. And, on a personal level, I have a hard time understanding why anyone would be OK with a sort order that puts Álvaro after Zebra instead of between Alvaro and Beatriz, because that seems extremely frustrating. However, these are just personal biases. I'm much more likely to hear from the customers who care a lot about the details of how something works than I am to hear from the customers who are perfectly happy to take the defaults, because people who are happy don't contact support at all and people who are unhappy about relatively normal things get handled by support; I get the weird cases. And everybody is going to have different experiences. Presumably, your experience is that the indexing and sorting performance is a big concern for the users you support, and that's why you favor prioritizing that part of the experience. That's perfectly legitimate, but it's different from my experience. My experience is that when I tell people they can use collate "C" to speed up sorting, they tell me that's a stupid workaround that doesn't give them the answers that they want, which obviously colors my viewpoint on this question in the same way that your experiences color yours. -- Robert Haas EDB: http://www.enterprisedb.com
Robert Haas wrote: > To be honest, I'd probably be ready to support making the default > encoding UTF8 regardless of the environment, and you have to use -E > if you want anything else. I think there are still people using > other encodings, but I believe it to be a small minority at this > point. It would be interesting to have the point of view of Asian users about this. Recently, the suggestion to retire GB18030 in favor of UTF-8 was met with the objection that GB18030 was likely preferred by users from China [1]. Another example against UTF-8 that I found notable, is Tatsuo Ishii mentioning that Japanese users tend use --no-locale rather than UTF-8 locales [2]. Also, it's not obvious how initdb could choose an UTF-8 locale regardless of the environment. For instance, let's say it finds LC_ALL="fr_FR.iso885915@euro", what would it do? Maybe look at the UTF-8 locales on the system. Here's a subset of what it would find on my system: C.utf8 en_AG en_AG.utf8 en_AU.utf8 en_BW.utf8 en_CA.utf8 en_DK.utf8 en_GB.utf8 en_HK.utf8 en_IE.utf8 ... tr_TR.utf8 From that kind of list, which locale should it pick and why? Personally I think that ignoring the environment's LC_* for the collations would be fine if we went for builtin/C.UTF-8 by default, as $subject suggests. But the level of enthusiasm for that from the community seems much lower than it would need to be for that kind of change to be acceptable. [1] https://www.postgresql.org/message-id/45b4b689-0e78-4d30-a5f9-1a39d01ab2b7%40ww-it.cn [2] https://www.postgresql.org/message-id/20230608.104535.2171011311090815110.t-ishii%40sranhm.sra.co.jp Best regards, -- Daniel Vérité https://postgresql.verite.pro/
On Tue, 2026-03-10 at 18:42 +0100, Laurenz Albe wrote:
>
> I consider it acceptable. Oracle does it like that by default.
> Yes, Oracle's behavior is not necessarily what we want to emulate,
> but I
> don't remember hearing Oracle users complain about that (and I have
> heard them complain about other things).
Thank you. That's useful indirect evidence of what a significant number
of users want, or at least what they find acceptable.
Regards,
Jeff Davis
On Wed, 2026-03-11 at 17:28 +0100, Daniel Verite wrote:
> Also, it's not obvious how initdb could choose an UTF-8 locale
> regardless of the environment.
Right.
> Personally I think that ignoring the environment's LC_* for the
> collations would be fine if we went for builtin/C.UTF-8 by default,
> as
> $subject suggests. But the level of enthusiasm for that from the
> community seems much lower than it would need to be for that kind of
> change to be acceptable.
At this point in the release cycle, it seems too late for $subject to
happen in 19. But I do find this discussion useful, so I hope we can
continue while it has some attention.
Regards,
Jeff Davis
On Wed, 2026-03-11 at 08:47 -0400, Robert Haas wrote:
> At the end of the day, we're all just guessing.
Part of the reason for that is that changing collation is so difficult
that we have very few examples of users moving real workloads from one
collation to another.
> My experience working
> for EDB is that we have a number of customers who care about sort
> order quite a lot, and we've had to sweat blood to make them happy.
Thank you. I have one burning question: for these users who care deeply
about sort order, which scenario best describes their needs?
(a) they mostly work in a single locale (if so, does it match their
UNIX environment?); or
(b) one locale (which one?) is good enough for a variety of locales
because even if it's not perfect, it's still better than ASCII; or
(c) they somehow partition their data by locale and use multiple
locales; or
(d) they have a variety of indexes on the same column using different
collations to satisfy queries from users in different locales
I have found it very difficult to get an answer to that question. When
I press users for details (in the sample of users I've been able to
reach), usually they back off on the need for sort order, and instead
focus on case insensitivity (in which case I suggest the builtin C.UTF-
8).
> And, on a personal level, I have a hard time understanding why anyone
> would be OK with a sort order that puts Álvaro after Zebra instead of
> between Alvaro and Beatriz, because that seems extremely frustrating.
I tend to agree, and I wish we had a way to handle this at a
"presentation" layer rather than pushing the whole thing down into
indexes (storage layer).
In theory, pushing collation down to indexes could offer performance
advantages, but in practice humans don't read a lot of data, so a post-
processing step would be efficient in most cases.
> That's perfectly legitimate, but it's different from my
> experience. My experience is that when I tell people they can use
> collate "C" to speed up sorting, they tell me that's a stupid
> workaround that doesn't give them the answers that they want, which
> obviously colors my viewpoint on this question in the same way that
> your experiences color yours.
"C" is especially unappealing because it doesn't even get basic case
transformations right outside of ASCII.
Regards,
Jeff Davis
On Wed, 2026-03-11 at 08:47 -0400, Robert Haas wrote: > My experience is that when I tell people they can use > collate "C" to speed up sorting, they tell me that's a stupid > workaround that doesn't give them the answers that they want, which > obviously colors my viewpoint on this question in the same way that > your experiences color yours. That makes sense - I would be surprised if everybody were happy with the C collation's sort order. On the other hand, I have had lots of reports about corrupted indexes that need rebuilding (only today one person in my course mentioned it), and I find that people don't exactly appreciate the prospect of having to rebuild dozens of indexes after an upgrade, when they want to keep the down time short. My vision of a better future is like this: PostgreSQL defaults to the C collation. People will express unhappiness about the way names get sorted. "Easy", we tell them, "change that column's collation to a natural language collation". They do it and are happy. The big advantage: if you have only two or three indexes in your database that are sorted in a collation other than C, the likelihood for index corruption will be way lower. For example, the unique constraint on your part number column that contains values like 'XY-1-13*' or '*P1-12_A' (which are pretty likely to be affected by the subtle changes in libc collations) will be sorted in the C collation, which is just fine for everybody. This approach to collations seems to work well for Oracle users, so why not for us? Yours, Laurenz Albe
On Wed, 2026-03-11 at 21:05 +0100, Laurenz Albe wrote:
> The big advantage: if you have only two or three indexes in your
> database that are sorted in a collation other than C, the likelihood
> for index corruption will be way lower. For example, the unique
> constraint on your part number column that contains values like
> 'XY-1-13*' or '*P1-12_A' (which are pretty likely to be affected by
> the subtle changes in libc collations) will be sorted in the C
> collation, which is just fine for everybody.
Agreed. The collation problems are not because it's used in the handful
of indexes where it's useful; the problems happen when it's used
everywhere.
If a collation version change is detected, and a few indexes need to be
REINDEXed, I think users understand that. But it's pretty difficult to
explain to a user that all text indexes (including primary keys) need
to be reindexed, and that there's no way to keep track of what still
needs to be done.
Regards,
Jeff Davis
> And, on a personal level, I have a hard time understanding why anyone > would be OK with a sort order that puts Álvaro after Zebra instead of > between Alvaro and Beatriz, because that seems extremely frustrating. And I would find it extremly frustrating if Åstrand was sorted between Alvaro and Beatriz instead of after Zebra where it belongs. This is kind of the point. There is no generally correct way to sort alphabetically, which is why we should force anyone wanting to do so to make an active choice about which rules should be used. I think a default that's obviously wrong is better than a default that's subtly wrong in this case. -- Anders Åstrand Percona
On Wed, Mar 11, 2026 at 2:20 PM Jeff Davis <pgsql@j-davis.com> wrote: > Part of the reason for that is that changing collation is so difficult > that we have very few examples of users moving real workloads from one > collation to another. Yes. I think actually one of the big challenges right now is making sure that when you initdb to do a pg_upgrade, you get the right settings to make the upgrade work. The extent to which any given proposal makes that better or worse is definitely a point to consider. > Thank you. I have one burning question: for these users who care deeply > about sort order, which scenario best describes their needs? > > (a) they mostly work in a single locale (if so, does it match their > UNIX environment?); or > > (b) one locale (which one?) is good enough for a variety of locales > because even if it's not perfect, it's still better than ASCII; or > > (c) they somehow partition their data by locale and use multiple > locales; or > > (d) they have a variety of indexes on the same column using different > collations to satisfy queries from users in different locales I don't have total information, but I think they mostly use a single locale. If they have extremely specific needs, they are likely to end up with ICU, else they pick a glibc locale. I have no idea how likely that glibc locale is to match their environment. I wouldn't bet on it being the norm, but I wouldn't bet against whatever they have in the environment being more usable than "C". > > And, on a personal level, I have a hard time understanding why anyone > > would be OK with a sort order that puts Álvaro after Zebra instead of > > between Alvaro and Beatriz, because that seems extremely frustrating. > > I tend to agree, and I wish we had a way to handle this at a > "presentation" layer rather than pushing the whole thing down into > indexes (storage layer). > > In theory, pushing collation down to indexes could offer performance > advantages, but in practice humans don't read a lot of data, so a post- > processing step would be efficient in most cases. It's tough if people have range scans. Not everybody does, but they also don't know whether or not they will want them when they're making setup choices. Picking a locale that matches their desired sort order *in case* they end up using range scans in some queries feels like the "safe" coice. > > That's perfectly legitimate, but it's different from my > > experience. My experience is that when I tell people they can use > > collate "C" to speed up sorting, they tell me that's a stupid > > workaround that doesn't give them the answers that they want, which > > obviously colors my viewpoint on this question in the same way that > > your experiences color yours. > > "C" is especially unappealing because it doesn't even get basic case > transformations right outside of ASCII. I completely agree. I dislike it when providers change collation behavior because I don't really believe the narrative that people from a particular political unit have a unified view of how sorting should be done. For example, my native language is English, and I have a view that Á should go between A and B even though Á is not an English character. Google says that's not the normal English sort order and that accented characters should be pushed to the end, but as a native Englsh speaker I find that idea ridiculous and I can't really imagine anyone wanting it. Perhaps my imagination is too limited. Where it gets really subjective is with strings like alvaro, Álvaro, and .-Alvaro. There's no rule that anyone in the United States learns in elementary school that answers that question. We can ask users what they want, but there's not a single right answer that everyone knows, the way everyone knows a < b. So I feel like it would be pretty defensible to have the default be something that is either case-sensitive or not, that either skips nonalphabetic characters or sorts them in code-point order or handles them in literally any other basically sane and understandable way that somebody wants to implement, but something that treats an accented a in the same way it treats a smiley face emoji -- i.e. like a character that nobody knows how to sort -- sits really poorly with me. Am I injecting too much of my personal view into what PostgreSQL's behavior ought to be? Possibly! So maybe the people who are saying that defaulting to C is fine and that making people make an explicit choice if they want something else are right. but my personal guess is that we will make a bunch of people unhappy. I'm not actually super-worried about the people who have very particular ideas about what they want, because there is a decent chance that they're already asking for exactly that thing. We might mess them up a little bit, but they'll figure it out. What I'm most worried about is the population of users -- which I guess to be large -- who do not have a strong preference but won't be happy with something as dumb as "C". If even a small fraction of users create a database using "C" unintentionally and load a terabyte of data into it before realizing that all their text indexes are sorting "wrong", I suspect that's not going to be much fun. Said differently, I don't have an enormous amount of confidence in the environment being a good source of information about what people want, but my experience thus far suggests to that "C" probably isn't what they want, which makes me skeptical of making it the default. Obviously, I could be wildly incorrect. Maybe people will just be super-happy about faster sorting and life will be great. -- Robert Haas EDB: http://www.enterprisedb.com
On Thu, 2026-03-12 at 10:04 -0400, Robert Haas wrote:
> Yes. I think actually one of the big challenges right now is making
> sure that when you initdb to do a pg_upgrade, you get the right
> settings to make the upgrade work.
pg_upgrade should copy the locale settings to the new cluster as of
9637badd9f. If there are still some rough edges here, let me know.
>
> I don't have total information, but I think they mostly use a single
> locale. If they have extremely specific needs, they are likely to end
> up with ICU, else they pick a glibc locale. I have no idea how likely
> that glibc locale is to match their environment. I wouldn't bet on it
> being the norm, but I wouldn't bet against whatever they have in the
> environment being more usable than "C".
That's interesting. In other words, (in your sample) users aren't
worried about the precise sort order in their native language; it's
just that ASCII is particularly bad, and almost any "real" locale is
more appealing.
If the concern is mostly that ASCII is particularly bad, how much of
that is because case is a high-order bit (i.e. 'A' < 'Z' < 'a' < 'z')?
> >
> It's tough if people have range scans.
Range scans using a natural language collation are dubious. It can't be
for a prefix search; LIKE 'myprefix%' needs the index to be defined
with text_pattern_ops (which is code point order), so the default isn't
going to work for them anyway.
(A prefix search can't be implemented with a range scan in natural
language collation because, e.g. in the cs_CZ locale, 'cha' does not
fall between 'ch' and 'ci'.)
So how often is a range scan using a natural language collation
actually useful? I'm sure there are some real cases, but I'd say it's
usually a mistake and they are quite possibly getting wrong results.
> Not everybody does, but they
> also don't know whether or not they will want them when they're
> making
> setup choices. Picking a locale that matches their desired sort order
> *in case* they end up using range scans in some queries feels like
> the
> "safe" coice.
I have trouble understanding this perspective: slow all indexes down
(and sorts, too), and risk index inconsistencies just in case someone
ends up doing a range scan on one of the indexes? How is that safer?
> >
> What I'm
> most worried about is the population of users -- which I guess to be
> large -- who do not have a strong preference but won't be happy with
> something as dumb as "C". If even a small fraction of users create a
> database using "C" unintentionally and load a terabyte of data into
> it
> before realizing that all their text indexes are sorting "wrong", I
> suspect that's not going to be much fun.
This is where we differ: even in that case, I believe all (or nearly
all) of that user's indexes would be better.
When you look at the conditions that must be true for an index with a
natural language collation to be useful, it's certainly not the normal
case, and I'd bet it's closer to "rare":
* the use case must be real (not relying on faulty assumptions about
lexicographical ordering)
* the input data must be large enough to benefit from an index scan
* one of the following must be true:
- the index needs to be correlated with the heap order (seems
unlikely; correlation usually happens with sequences, timestamps, etc.,
not natural language text values); or
- needs to be eligible for an index only scan (plausible); or
- the amount of data read must be small enough that correlation
with the heap doesn't matter
* the result data needs to be small enough for a human to consume it
(otherwise why bother with natural language?)
* the performance improvement must be enough to offset the penalty
for equality searches and index maintenance
While each of those is plausible, when combined, I think it's far from
the typical case.
It's perfectly reasonable to say the user may be upset about the way
the final result order looks, but making all the index orderings worse
is not a good way to fix that.
Regards,
Jeff Davis
On Thu, Mar 12, 2026 at 3:20 PM Jeff Davis <pgsql@j-davis.com> wrote: > pg_upgrade should copy the locale settings to the new cluster as of > 9637badd9f. If there are still some rough edges here, let me know. Oh, interesting! I obviously missed that. > > I don't have total information, but I think they mostly use a single > > locale. If they have extremely specific needs, they are likely to end > > up with ICU, else they pick a glibc locale. I have no idea how likely > > that glibc locale is to match their environment. I wouldn't bet on it > > being the norm, but I wouldn't bet against whatever they have in the > > environment being more usable than "C". > > That's interesting. In other words, (in your sample) users aren't > worried about the precise sort order in their native language; it's > just that ASCII is particularly bad, and almost any "real" locale is > more appealing. Some users are definitely worried about the specific ordering. I didn't mean to imply the contrary. The most demanding customers want an ordering that is bug-compatible with a previous system they've used; many can settle for something that is reasonably similar to some other system they've used, or simply want something appropriate to their country and language. Some are less demanding and, yeah, just anything that is basically reasonable is OK. > Range scans using a natural language collation are dubious. It can't be > for a prefix search; LIKE 'myprefix%' needs the index to be defined > with text_pattern_ops (which is code point order), so the default isn't > going to work for them anyway. > > (A prefix search can't be implemented with a range scan in natural > language collation because, e.g. in the cs_CZ locale, 'cha' does not > fall between 'ch' and 'ci'.) > > So how often is a range scan using a natural language collation > actually useful? I'm sure there are some real cases, but I'd say it's > usually a mistake and they are quite possibly getting wrong results. That seems really pessimistic to me. Many people don't test their code as thoroughly as they should, but a lot of people test it at least somewhat. > I have trouble understanding this perspective: slow all indexes down > (and sorts, too), and risk index inconsistencies just in case someone > ends up doing a range scan on one of the indexes? How is that safer? I mean ... you seem to be imagining that people care about sort and index lookup speed drastically more than what I have experienced. People mostly build indexes if they dump and restore, which most people do very rarely, bordering on never. Or if they reindex occasionally, which is more common, but still not an everyday occurrence for the overwhelming majority of users. Index lookups are common, but the difference between a single-entry lookup on a collation-C index and a single-entry lookup on a glibc-collation index isn't enough to worry about in any case I've ever encountered. I'm sure it's possible to construct cases, but are they really that common? Most queries do a lot more other work than the time they spend doing btree comparisons, at least IME. On the other hand, the possibility of having to use a sequential scan to find entries between X and Y is a potentially huge blow. I would never choose to use collate "C" if I thought there was any chance I would want a range scan. Surely it isn't worth the risk. If I know the data is something where collation is never going to matter (like PostgreSQL log lines ingested into a table) then yes, collate "C" makes a lot of sense. > * the use case must be real (not relying on faulty assumptions about > lexicographical ordering) > * the input data must be large enough to benefit from an index scan > * one of the following must be true: > - the index needs to be correlated with the heap order (seems > unlikely; correlation usually happens with sequences, timestamps, etc., > not natural language text values); or > - needs to be eligible for an index only scan (plausible); or > - the amount of data read must be small enough that correlation > with the heap doesn't matter > * the result data needs to be small enough for a human to consume it > (otherwise why bother with natural language?) > * the performance improvement must be enough to offset the penalty > for equality searches and index maintenance > > While each of those is plausible, when combined, I think it's far from > the typical case. I don't. I mean, let's suppose I've got a web page that displays a report with peoples names, street addresses, cities, states, countries, and telephone numbers. I can click on a column header to sort by that column. For which of those columns do I want a natural language sort? I would argue probably most of them. I might know that my state and country name are all ASCII, and then it doesn't matter. But surely people's names, for example, or their cities, could contain non-ASCII characters? And my report might easily be paginated, to avoid loading too much data in the web browser at once. If displaying the first page I want something like LIMIT 100, and I don't want to have to compute the entire result set to get those first 100 rows. Back when I wrote web applications, before starting at EDB, this is the kind of thing that I did all the time, for like ten years straight. I had plenty of text fields that could have used collate "C", because they contained things like part numbers or account numbers or whatever. But anything that contained a person's name or a company name or any other kind of name that is assigned by humans rather than generated by a computer could contain any of the characters that humans use, and should be sorted the way humans like. And isn't this a totally normal kind of application for somebody to write? It sure was for me. -- Robert Haas EDB: http://www.enterprisedb.com
On Thu, 2026-03-12 at 10:04 -0400, Robert Haas wrote: > So maybe the people who are saying that defaulting to C is fine and > that making people make an explicit choice if they want something else > are right. but my personal guess is that we will make a bunch of > people unhappy. Defaulting to C will make a bunch of people unhappy, I agree. However, that is a kind of problem that they will discover early on during development or testing, and it is easily remedied for those columns where alphabetical sorting order matters. But a good number of people are already unhappy because they have index corruption (today I dealt with another support case). And the people who are diligent enough to know that they should reindex after an OS update are unhappy because of the additional down time. I maintain that the second kind of unhappiness weighs heavier, and I find this need to reindex to be one of the most embarrassing flaws in an otherwise great piece of software. This is a value judgement, and I may be wrong. > If even a small fraction of users create a > database using "C" unintentionally and load a terabyte of data into it > before realizing that all their text indexes are sorting "wrong", I > suspect that's not going to be much fun. Hm. So reindexing a large table after an OS upgrade is acceptable, but reindexing a large table after a bulk load is not? The problem you describe could have been avoided with a bit of testing, but the need to reindex after an OS upgrade sometimes cannot be avoided. > Obviously, I could be wildly incorrect. Maybe people will just be > super-happy about faster sorting and life will be great. I have not experienced sort speed as a frequent problem, and the life with the C collation won't be all that great. But I believe that it would be better than a life with index corruption. Yours, Laurenz Albe
On Thu, 2026-03-12 at 15:58 -0400, Robert Haas wrote:
>
>
> Back when I wrote web applications, before starting at EDB, this is
> the kind of thing that I did all the time, for like ten years
> straight. I had plenty of text fields that could have used collate
> "C", because they contained things like part numbers or account
> numbers or whatever. But anything that contained a person's name or a
> company name or any other kind of name that is assigned by humans
> rather than generated by a computer could contain any of the
> characters that humans use, and should be sorted the way humans like.
> And isn't this a totally normal kind of application for somebody to
> write? It sure was for me.
Yes, I agree that's perfectly normal application. I'm just not sure how
useful it is that the index order matches the expected display order by
default. While it's plausible that it could benefit from a few indexes
with a natural language collation, there are many practical reasons why
it might not.
And if it's a mix of fields, some of which are ASCII and some natural
language, then that's not a particularly strong argument that the
indexes should default to natural language. That leaves you unable to
use the indexes for prefix search on any field, which is a pretty
normal thing to want to do in that kind of application also.
I guess what I'm saying is that I agree that users want an appealing
final result order. But even assuming that's a requirement, pushing
that down into all text indexes by default is a bad trade-off: the cost
side is too high, and to see a net performance benefit, there are too
many "ifs".
Regards,
Jeff Davis