Collation and primary keys
От | Jeff Davis |
---|---|
Тема | Collation and primary keys |
Дата | |
Msg-id | 3e84e861362e971cf8c7d5e4770207d0235947e1.camel@j-davis.com обсуждение исходный текст |
Ответы |
Re: Collation and primary keys
|
Список | pgsql-hackers |
(The problem and possible solutions are not specific to primary keys, but I'm focusing on PKs for the purposes of this email.) Currently, users who don't make any explicit choice about collation end up with primary key indexes that use a libc natural language collation. This default is exactly wrong: many users run into PK index inconsistencies, and few understand the risks. A recent conversation I had reinforced this point: they didn't think about it much at the time they created the databases, and then ended up with inconsistent PK indexes. I have made some attempt to find users who benefit from the default. I've found a couple possible examples: * The libc C.UTF-8 locale was a reasonable default (though not a natural language collation). But now that we have C.UTF-8 available from the builtin provider, then we should encourage that instead of relying on the slower, platform-specific libc implementation. * Our ICU implementation has some gaps around non-UTF8 encodings, which might push users of those encodings to libc. I'm still not convinced that such users would want libc collators for all of their primary keys, though. Aside from those examples, most users who understand their collation needs well enough to make an informed choice don't want libc, and probably want to be selective about which indexes use natural language vs memcmp. The problems with the status quo have been discussed at length: * inconsistent PK indexes * behavior depends on the specific version of the specific platform (and can change with OS updates) * index builds are much slower Meanwhile, the benefits of the status quo are unclear: * why would we expect that the user values text ordering enough to accept the problems listed above? * why would we expect to want natural-language sort for a PK? * in a multi-lingual world, why would we expect any one locale to be so special as to control the default comparator for all text PKs? * why would we expect the initdb-time environment to represent the right locale? Perhaps the initdb default specifically doesn't matter so much -- that might be treated as a low-level tool called by packaged scripts. But that's just moves the problem: someone needs to make that choice on behalf of users who don't explicitly choose for themselves. So I think we need to do something. That could be better guidance, initdb default changes, technical changes, or some combination thereof. Possibilities include: 1. Status quo. 2. Guide users towards the builtin collation provider, which uses memcmp() for all text PKs, avoiding all of the problems. The downside is that results coming from the index will not be sorted in a natural language order, so 'Z' will sort before 'a'. If that's a problem, the user can specify a COLLATE clause to their ORDER BY, or sort in their application. (Possibly change the initdb default.) 3. Guide users towards the ICU "und" locale, which mitigates some of the problems and provides a compromise natural language sort order that may be better than memcmp() in a lot of locales. (Possibly change the initdb default.) 4. Interrupt the chain of defaults somewhere such that PKs end up defaulting to memcmp() even if the database collation is something else. This requires some technical changes so that equality searches still work with the indexes even if the collations don't match (so long as they are both deterministic). It could cause performance regressions for range scans on PK indexes. I brought something like this up in the past[1] and there was a strong consensus that this was a bad idea, but I'm listing it here anyway. 5. Transform PK values with strxfrm() or u_getSortKey() before indexing them. While the behavior of the transformation function may change with a new release of the provider, it seems less likely to cause a problem for equality searches, and therefore carries a lower risk for PKs. The downside is that the keys will be larger and there are still some risks, including bugs in the implementation (which is not just a theoretical concern). Other ideas? Thoughts? Regards, Jeff Davis [1] https://www.postgresql.org/message-id/b7a9f32eee8d24518f791168bc6fb653d1f95f4d.camel@j-davis.com
В списке pgsql-hackers по дате отправления: