Re: Pre-proposal: unicode normalized text

Поиск
Список
Период
Сортировка
От Matthias van de Meent
Тема Re: Pre-proposal: unicode normalized text
Дата
Msg-id CAEze2WipFK6Xrg6Kz0ndt6MSk3GF3LnarNHMJrm=A7dBmYWjnA@mail.gmail.com
обсуждение исходный текст
Ответ на Re: Pre-proposal: unicode normalized text  (Jeff Davis <pgsql@j-davis.com>)
Список pgsql-hackers


On Fri, 6 Oct 2023, 21:08 Jeff Davis, <pgsql@j-davis.com> wrote:
On Fri, 2023-10-06 at 13:33 -0400, Robert Haas wrote:
> What I think people really want is a whole column in
> some encoding that isn't the normal one for that database.

Do people really want that? I'd be curious to know why.

One reason someone would like this is because a database cluster may have been initialized with something like --no-locale (thus getting defaulted to LC_COLLATE=C, which is desired behaviour and gets fast strcmp operations for indexing, and LC_CTYPE=SQL_ASCII, which is not exactly expected but can be sufficient for some workloads), but now that the data has grown they want to use utf8.EN_US collations in some of their new and modern table's fields? 
Or, a user wants to maintain literal translation tables, where different encodings would need to be used for different languages to cover the full script when Unicode might not cover the full character set yet.
Additionally, I'd imagine specialized encodings like Shift_JIS could be more space efficient than UTF-8 for e.g. japanese text, which might be useful for someone who wants to be a bit more frugal with storage when they know text is guaranteed to be in some encoding's native language: compression can do the same work, but also adds significant overhead.

I've certainly experienced situations where I forgot to explicitly include the encoding in initdb --no-locale and then only much later noticed that my big data load is useless due to an inability to create UTF-8 collated indexes.
I often use --no-locale to make string indexing fast (locales/collation are not often important to my workload) and to block any environment variables from being carried over into the installation. An ability to set or update the encoding of columns would help reduce the pain: I would no longer have to re-initialize the database or cluster from 0.

Kind regards,

Matthias van de Meent

В списке pgsql-hackers по дате отправления:

Предыдущее
От: Bohdan Mart
Дата:
Сообщение: Re: Where can I find the doxyfile?
Следующее
От: Amit Kapila
Дата:
Сообщение: Re: [PoC] pg_upgrade: allow to upgrade publisher node