Re: Collation and primary keys
От | Laurenz Albe |
---|---|
Тема | Re: Collation and primary keys |
Дата | |
Msg-id | 6613f821f8d1cc9366a2fbf48d96f70f7ccf29cd.camel@cybertec.at обсуждение исходный текст |
Ответ на | Collation and primary keys (Jeff Davis <pgsql@j-davis.com>) |
Ответы |
Re: Collation and primary keys
|
Список | pgsql-hackers |
On Tue, 2025-07-15 at 17:34 -0700, Jeff Davis wrote: > Currently, users who don't make any explicit choice about collation end > up with primary key indexes that use a libc natural language collation. > This default is exactly wrong: [...] > > So I think we need to do something. That could be better guidance, > initdb default changes, technical changes, or some combination thereof. > > Possibilities include: > > 1. Status quo. > > 2. Guide users towards the builtin collation provider, which uses > memcmp() for all text PKs, avoiding all of the problems. The downside > is that results coming from the index will not be sorted in a natural > language order, so 'Z' will sort before 'a'. If that's a problem, the > user can specify a COLLATE clause to their ORDER BY, or sort in their > application. (Possibly change the initdb default.) > > 3. Guide users towards the ICU "und" locale, which mitigates some of > the problems and provides a compromise natural language sort order that > may be better than memcmp() in a lot of locales. (Possibly change the > initdb default.) > > 4. Interrupt the chain of defaults somewhere such that PKs end up > defaulting to memcmp() even if the database collation is something > else. This requires some technical changes so that equality searches > still work with the indexes even if the collations don't match (so long > as they are both deterministic). It could cause performance regressions > for range scans on PK indexes. I brought something like this up in the > past[1] and there was a strong consensus that this was a bad idea, but > I'm listing it here anyway. > > 5. Transform PK values with strxfrm() or u_getSortKey() before indexing > them. While the behavior of the transformation function may change with > a new release of the provider, it seems less likely to cause a problem > for equality searches, and therefore carries a lower risk for PKs. The > downside is that the keys will be larger and there are still some > risks, including bugs in the implementation (which is not just a > theoretical concern). > > Other ideas? Thoughts? #4 and #5 are appealing in that they attempt to magically do the right thing without making any hard choices, but I have my doubts. If somebody uses ALTER TABLE to add a primary key later (perhaps after having made do with a unique index before) and the query semantics change, that would not be great. I have a radical proposal: Rather than having "initdb" default to whatever locale is in the environment, make it default the the builtin provider and the C collation. Wherever people need a natural language collation, they can say so explicitly. That would do nothing for existing installations, but it would get rid of the problem for new clusters. Not that I want to present Oracle as an example to follow in general, but that's how they are doing it, and while I do hear complaints from Oracle users, I have yet to hear a complaint about the default binary collation. Yours, Laurenz Albe
В списке pgsql-hackers по дате отправления: