Re: Collation and primary keys

Поиск
Список
Период
Сортировка
От Laurenz Albe
Тема Re: Collation and primary keys
Дата
Msg-id 6613f821f8d1cc9366a2fbf48d96f70f7ccf29cd.camel@cybertec.at
обсуждение исходный текст
Ответ на Collation and primary keys  (Jeff Davis <pgsql@j-davis.com>)
Ответы Re: Collation and primary keys
Список pgsql-hackers
On Tue, 2025-07-15 at 17:34 -0700, Jeff Davis wrote:
> Currently, users who don't make any explicit choice about collation end
> up with primary key indexes that use a libc natural language collation.
> This default is exactly wrong: [...]
>
> So I think we need to do something. That could be better guidance,
> initdb default changes, technical changes, or some combination thereof.
>
> Possibilities include:
>
> 1. Status quo.
>
> 2. Guide users towards the builtin collation provider, which uses
> memcmp() for all text PKs, avoiding all of the problems. The downside
> is that results coming from the index will not be sorted in a natural
> language order, so 'Z' will sort before 'a'. If that's a problem, the
> user can specify a COLLATE clause to their ORDER BY, or sort in their
> application. (Possibly change the initdb default.)
>
> 3. Guide users towards the ICU "und" locale, which mitigates some of
> the problems and provides a compromise natural language sort order that
> may be better than memcmp() in a lot of locales. (Possibly change the
> initdb default.)
>
> 4. Interrupt the chain of defaults somewhere such that PKs end up
> defaulting to memcmp() even if the database collation is something
> else. This requires some technical changes so that equality searches
> still work with the indexes even if the collations don't match (so long
> as they are both deterministic). It could cause performance regressions
> for range scans on PK indexes. I brought something like this up in the
> past[1] and there was a strong consensus that this was a bad idea, but
> I'm listing it here anyway.
>
> 5. Transform PK values with strxfrm() or u_getSortKey() before indexing
> them. While the behavior of the transformation function may change with
> a new release of the provider, it seems less likely to cause a problem
> for equality searches, and therefore carries a lower risk for PKs. The
> downside is that the keys will be larger and there are still some
> risks, including bugs in the implementation (which is not just a
> theoretical concern).
>
> Other ideas? Thoughts?

#4 and #5 are appealing in that they attempt to magically do the right
thing without making any hard choices, but I have my doubts.
If somebody uses ALTER TABLE to add a primary key later (perhaps after
having made do with a unique index before) and the query semantics
change, that would not be great.

I have a radical proposal: Rather than having "initdb" default to
whatever locale is in the environment, make it default the the builtin
provider and the C collation.  Wherever people need a natural language
collation, they can say so explicitly.

That would do nothing for existing installations, but it would get rid
of the problem for new clusters.

Not that I want to present Oracle as an example to follow in general,
but that's how they are doing it, and while I do hear complaints from
Oracle users, I have yet to hear a complaint about the default binary
collation.

Yours,
Laurenz Albe



В списке pgsql-hackers по дате отправления: