Collation and primary keys

Поиск
Список
Период
Сортировка
От Jeff Davis
Тема Collation and primary keys
Дата
Msg-id 3e84e861362e971cf8c7d5e4770207d0235947e1.camel@j-davis.com
обсуждение исходный текст
Ответы Re: Collation and primary keys
Список pgsql-hackers
(The problem and possible solutions are not specific to primary keys,
but I'm focusing on PKs for the purposes of this email.)

Currently, users who don't make any explicit choice about collation end
up with primary key indexes that use a libc natural language collation.
This default is exactly wrong: many users run into PK index
inconsistencies, and few understand the risks. A recent conversation I
had reinforced this point: they didn't think about it much at the time
they created the databases, and then ended up with inconsistent PK
indexes.

I have made some attempt to find users who benefit from the default.
I've found a couple possible examples:

  * The libc C.UTF-8 locale was a reasonable default (though not a
natural language collation). But now that we have C.UTF-8 available
from the builtin provider, then we should encourage that instead of
relying on the slower, platform-specific libc implementation.

  * Our ICU implementation has some gaps around non-UTF8 encodings,
which might push users of those encodings to libc. I'm still not
convinced that such users would want libc collators for all of their
primary keys, though.

Aside from those examples, most users who understand their collation
needs well enough to make an informed choice don't want libc, and
probably want to be selective about which indexes use natural language
vs memcmp.

The problems with the status quo have been discussed at length:

  * inconsistent PK indexes
  * behavior depends on the specific version of the specific platform
    (and can change with OS updates)
  * index builds are much slower

Meanwhile, the benefits of the status quo are unclear:

  * why would we expect that the user values text ordering enough
    to accept the problems listed above?
  * why would we expect to want natural-language sort for a PK?
  * in a multi-lingual world, why would we expect any one locale
    to be so special as to control the default comparator for all
    text PKs?
  * why would we expect the initdb-time environment to represent
    the right locale?

Perhaps the initdb default specifically doesn't matter so much -- that
might be treated as a low-level tool called by packaged scripts. But
that's just moves the problem: someone needs to make that choice on
behalf of users who don't explicitly choose for themselves.

So I think we need to do something. That could be better guidance,
initdb default changes, technical changes, or some combination thereof.

Possibilities include:

1. Status quo.

2. Guide users towards the builtin collation provider, which uses
memcmp() for all text PKs, avoiding all of the problems. The downside
is that results coming from the index will not be sorted in a natural
language order, so 'Z' will sort before 'a'. If that's a problem, the
user can specify a COLLATE clause to their ORDER BY, or sort in their
application. (Possibly change the initdb default.)

3. Guide users towards the ICU "und" locale, which mitigates some of
the problems and provides a compromise natural language sort order that
may be better than memcmp() in a lot of locales. (Possibly change the
initdb default.)

4. Interrupt the chain of defaults somewhere such that PKs end up
defaulting to memcmp() even if the database collation is something
else. This requires some technical changes so that equality searches
still work with the indexes even if the collations don't match (so long
as they are both deterministic). It could cause performance regressions
for range scans on PK indexes. I brought something like this up in the
past[1] and there was a strong consensus that this was a bad idea, but
I'm listing it here anyway.

5. Transform PK values with strxfrm() or u_getSortKey() before indexing
them. While the behavior of the transformation function may change with
a new release of the provider, it seems less likely to cause a problem
for equality searches, and therefore carries a lower risk for PKs. The
downside is that the keys will be larger and there are still some
risks, including bugs in the implementation (which is not just a
theoretical concern).

Other ideas? Thoughts?

Regards,
    Jeff Davis

[1]
https://www.postgresql.org/message-id/b7a9f32eee8d24518f791168bc6fb653d1f95f4d.camel@j-davis.com




В списке pgsql-hackers по дате отправления: