Re: The "char" type versus non-ASCII characters

Поиск
Список
Период
Сортировка
От Andrew Dunstan
Тема Re: The "char" type versus non-ASCII characters
Дата
Msg-id c44b31d4-044a-0e45-1a98-995517b47df7@dunslane.net
обсуждение исходный текст
Ответ на The "char" type versus non-ASCII characters  (Tom Lane <tgl@sss.pgh.pa.us>)
Ответы Re: The "char" type versus non-ASCII characters  (Tom Lane <tgl@sss.pgh.pa.us>)
Список pgsql-hackers
On 12/3/21 14:12, Tom Lane wrote:
> [ breaking off a different new thread ]
>
> Chapman Flack <chap@anastigmatix.net> writes:
>> Then there's "char". It's category S, but does not apply the server
>> encoding. You could call it an 8-bit int type, but it's typically used
>> as a character, making it well-defined for ASCII values and not so
>> for others, just like SQL_ASCII encoding. You could as well say that
>> the "char" type has a defined encoding of SQL_ASCII at all times,
>> regardless of the database encoding.
> This reminds me of something I've been intending to bring up, which
> is that the "char" type is not very encoding-safe.  charout() for
> example just regurgitates the single byte as-is.  I think we deemed
> that okay the last time anyone thought about it, but that was when
> single-byte encodings were the mainstream usage for non-ASCII data.
> If you're using UTF8 or another multi-byte server encoding, it's
> quite easy to get an invalidly-encoded string this way, which at
> minimum is going to break dump/restore scenarios.
>
> I can think of at least three ways we might address this:
>
> * Forbid all non-ASCII values for type "char".  This results in
> simple and portable semantics, but it might break usages that
> work okay today.
>
> * Allow such values only in single-byte server encodings.  This
> is a bit messy, but it wouldn't break any cases that are not
> problematic already.
>
> * Continue to allow non-ASCII values, but change charin/charout,
> char_text, etc so that the external representation is encoding-safe
> (perhaps make it an octal or decimal number).
>
> Either of the first two ways would have to contemplate what to do
> with disallowed values that snuck into the DB via pg_upgrade.
> That leads me to think that the third way might be the most
> preferable, even though it's not terribly backward-compatible.
>


I don't like #2. Is #3 going to change the external representation only
for non-ASCII values? If so, that seems OK.  Changing it for ASCII
values seems ugly. #1 is the simplest to implement and to understand,
and I suspect it would break very little in practice, but others might
disagree with that assessment.


cheers


andrew


--
Andrew Dunstan
EDB: https://www.enterprisedb.com




В списке pgsql-hackers по дате отправления:

Предыдущее
От: Tom Lane
Дата:
Сообщение: The "char" type versus non-ASCII characters
Следующее
От: Tom Lane
Дата:
Сообщение: Re: The "char" type versus non-ASCII characters