Re: [17] collation provider "builtin"
От | Jeff Davis |
---|---|
Тема | Re: [17] collation provider "builtin" |
Дата | |
Msg-id | d5ddd74b03d31939119a96ce79e824246ec39e8c.camel@j-davis.com обсуждение исходный текст |
Ответ на | [17] collation provider "builtin" (Jeff Davis <pgsql@j-davis.com>) |
Список | pgsql-hackers |
On Wed, 2023-06-14 at 15:55 -0700, Jeff Davis wrote: > The locale "C" (and equivalently, "POSIX") is not really a libc > locale; > it's implemented internally with memcmp for collation and > pg_ascii_tolower, etc., for ctype. > > The attached patch implements a new collation provider, "builtin", > which only supports "C" and "POSIX". Rebased patch attached. I got some generally positive comments, but it needs some more feedback on the specifics to be committable. This might be a good time to summarize my thoughts on collation after my work in v16: * Picking a database default collation other than UCS_BASIC (a.k.a. "C", a.k.a. memcmp(), a.k.a. provider=builtin) is something that should be done intentionally. It's an impactful choice that affects semantics, performance, and upgrades/deployment. Beyond that, our implementation still lacks a good way to manage versions of collation provider libraries and track object dependencies in a safe way to prevent index corruption, so the safest choice is really just to use stable memcmp() semantics. * The defaults for initdb seem bad in a number of ways, but it's too hard to change that default now (I tried in v16 and reverted it). So the job of reasonable choices is left for higher-level tools and documentation. * We can handle the collation and character classification independently. The main use case is to set the collation to memcmp() semantics (for stability and performance) and set the character classification to something interesting (on the grounds that it's more likely to be stable and less likely to be used in an index than a collation). Right now the only way to do that is to use the libc provider and set the collation to C and the ctype to a libc locale. But there is also a use case for having ICU as the provider for character classification. One option is to have separate datcolprovider=b (builtin provider) and datctypeprovider=i, so that the collation would be handled with memcmp and the character classification daticulocale. It feels like we're growing the fields in pg_database a little too much, but the use case seems valid, and perhaps we can reorganize the catalog representation a bit. -- Jeff Davis PostgreSQL Contributor Team - AWS
Вложения
В списке pgsql-hackers по дате отправления: