Re: Pre-proposal: unicode normalized text

Поиск
Список
Период
Сортировка
От Robert Haas
Тема Re: Pre-proposal: unicode normalized text
Дата
Msg-id CA+TgmobAxizsgjxvZdEQxjEs6RA3qu7JLti_LdXtaXODJoWzNw@mail.gmail.com
обсуждение исходный текст
Ответ на Re: Pre-proposal: unicode normalized text  (Isaac Morland <isaac.morland@gmail.com>)
Ответы Re: Pre-proposal: unicode normalized text  (Isaac Morland <isaac.morland@gmail.com>)
Re: Pre-proposal: unicode normalized text  (Jeff Davis <pgsql@j-davis.com>)
Re: Pre-proposal: unicode normalized text  (Nico Williams <nico@cryptonector.com>)
Список pgsql-hackers
On Wed, Oct 4, 2023 at 9:02 PM Isaac Morland <isaac.morland@gmail.com> wrote:
>> > What about characters not in UTF-8?
>>
>> Honestly I'm not clear on this topic. Are the "private use" areas in
>> unicode enough to cover use cases for characters not recognized by
>> unicode? Which encodings in postgres can represent characters that
>> can't be automatically transcoded (without failure) to unicode?
>
> Here I’m just anticipating a hypothetical objection, “what about characters that can’t be represented in UTF-8?” to
mysuggestion to always use UTF-8 and I’m saying we shouldn’t care about them. I believe the answers to your questions
inthis paragraph are “yes”, and “none”. 

Years ago, I remember SJIS being cited as an example of an encoding
that had characters which weren't part of Unicode. I don't know
whether this is still a live issue.

But I do think that sometimes users are reluctant to perform encoding
conversions on the data that they have. Sometimes they're not
completely certain what encoding their data is in, and sometimes
they're worried that the encoding conversion might fail or produce
wrong answers. In theory, if your existing data is validly encoded and
you know what encoding it's in and it's easily mapped onto UTF-8,
there's no problem. You can just transcode it and be done. But a lot
of times the reality is a lot messier than that.

Which gives me some sympathy with the idea of wanting multiple
character sets within a database. Such a feature exists in some other
database systems and is, presumably, useful to some people. On the
other hand, to do that in PostgreSQL, we'd need to propagate the
character set/encoding information into all of the places that
currently get the typmod and collation, and that is not a small number
of places. It's a lot of infrastructure for the project to carry
around for a feature that's probably only going to continue to become
less relevant.

I suppose you never know, though. Maybe the Unicode consortium will
explode in a tornado of fiery rage and there will be dueling standards
making war over the proper way of representing an emoji of a dog
eating broccoli for decades to come. In that case, our hypothetical
multi-character-set feature might seem prescient.

--
Robert Haas
EDB: http://www.enterprisedb.com



В списке pgsql-hackers по дате отправления:

Предыдущее
От: "\"Anitha S\""
Дата:
Сообщение: Two Window aggregate node for logically same over clause
Следующее
От: Bharath Rupireddy
Дата:
Сообщение: Re: [PoC] pg_upgrade: allow to upgrade publisher node