Re: Pre-proposal: unicode normalized text

Поиск

Список

Период

Сортировка

От	Robert Haas
Тема	Re: Pre-proposal: unicode normalized text
Дата	5 октября 2023 г. 11:31:54
Msg-id	CA+TgmobAxizsgjxvZdEQxjEs6RA3qu7JLti_LdXtaXODJoWzNw@mail.gmail.com обсуждение исходный текст
Ответ на	Re: Pre-proposal: unicode normalized text (Isaac Morland <isaac.morland@gmail.com>)
Ответы	Re: Pre-proposal: unicode normalized text Re: Pre-proposal: unicode normalized text Re: Pre-proposal: unicode normalized text
Список	pgsql-hackers

Дерево обсуждения

On Wed, Oct 4, 2023 at 9:02 PM Isaac Morland <isaac.morland@gmail.com> wrote:
>> > What about characters not in UTF-8?
>>
>> Honestly I'm not clear on this topic. Are the "private use" areas in
>> unicode enough to cover use cases for characters not recognized by
>> unicode? Which encodings in postgres can represent characters that
>> can't be automatically transcoded (without failure) to unicode?
>
> Here I’m just anticipating a hypothetical objection, “what about characters that can’t be represented in UTF-8?” to
mysuggestion to always use UTF-8 and I’m saying we shouldn’t care about them. I believe the answers to your questions
inthis paragraph are “yes”, and “none”. 

Years ago, I remember SJIS being cited as an example of an encoding
that had characters which weren't part of Unicode. I don't know
whether this is still a live issue.

But I do think that sometimes users are reluctant to perform encoding
conversions on the data that they have. Sometimes they're not
completely certain what encoding their data is in, and sometimes
they're worried that the encoding conversion might fail or produce
wrong answers. In theory, if your existing data is validly encoded and
you know what encoding it's in and it's easily mapped onto UTF-8,
there's no problem. You can just transcode it and be done. But a lot
of times the reality is a lot messier than that.

Which gives me some sympathy with the idea of wanting multiple
character sets within a database. Such a feature exists in some other
database systems and is, presumably, useful to some people. On the
other hand, to do that in PostgreSQL, we'd need to propagate the
character set/encoding information into all of the places that
currently get the typmod and collation, and that is not a small number
of places. It's a lot of infrastructure for the project to carry
around for a feature that's probably only going to continue to become
less relevant.

I suppose you never know, though. Maybe the Unicode consortium will
explode in a tornado of fiery rage and there will be dueling standards
making war over the proper way of representing an emoji of a dog
eating broccoli for decades to come. In that case, our hypothetical
multi-character-set feature might seem prescient.

--
Robert Haas
EDB: http://www.enterprisedb.com

В списке pgsql-hackers по дате отправления:

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Re: Pre-proposal: unicode normalized text