Re: The "char" type versus non-ASCII characters

Поиск

Список

Период

Сортировка

От	Chapman Flack
Тема	Re: The "char" type versus non-ASCII characters
Дата	4 декабря 2021 г. 18:07:50
Msg-id	61ABAE76.3070306@anastigmatix.net обсуждение исходный текст
Ответ на	Re: The "char" type versus non-ASCII characters (Tom Lane <tgl@sss.pgh.pa.us>)
Ответы	Re: The "char" type versus non-ASCII characters
Список	pgsql-hackers

Дерево обсуждения

On 12/04/21 11:34, Tom Lane wrote:
> Chapman Flack <chap@anastigmatix.net> writes:
>> "I am one byte of SQL_ASCII regardless of server setting".
> 
> But it's not quite that.  If we treated it as SQL_ASCII, we'd refuse
> to convert it to some other encoding unless the value passes encoding
> verification, which is exactly what charout() is not doing.

Ah, good point. I remembered noticing pg_do_encoding_conversion returning
the src pointer unchanged when SQL_ASCII is involved, but see that it does
verify the dest_encoding when SQL_ASCII is the source.

> encoding-dependent would be if you have ambitions to store a non-ASCII
> character in a "char".  But I think that's something we want to
> strongly discourage, even if we don't prohibit it altogether. ...
> So I'm visualizing it as a uint8 that we happen to like to store
> ASCII codes in, and that's what prompts the thought of using a
> numeric representation for non-ASCII values.

I'm in substantial agreement, though I also see that it is nearly always
set from a quoted literal, and tested against a quoted literal, and calls
itself "char", so I guess I am thinking for consistency's sake it might
be better not to invent some all-new convention for its text representation,
but adopt something that's already familiar, like bytea escaped format.
So it would always look and act like a one-octet bytea. Maybe have charin
accept either bytea-escaped or bytea-hex form too. (Or, never mind; when
restricted to one octet, bytea-hex and the \xhh bytea-escape form are
indistinguishable anyway.)

Then for free we get the property that if somebody today uses 'ű' as
an enum value, it might start appearing as '\xfb' now in dumps, etc.,
but their existing CASE WHEN thing = 'ű' code doesn't stop working
(as long as they haven't done something silly like change the encoding),
and they have the flexibility to update it to WHEN thing = '\xfb' as
time permits if they choose. If they don't, they accept the risk that
by switching to another encoding in the future, they may either see
their existing tests stop matching, or their existing literals fail
to parse, but there won't be invalidly-encoded strings created.

> Yup, cstring is definitely presumed to be in the server's encoding.

Without proposing to change it, I observe that by defining both cstring
and unknown in this way (with the latter being expressly the type of
any literal from the client destined for a type we don't know yet), we're
a bit painted into the corner as far as supporting types like NCHAR.
(I suppose clients could be banned from sending such values as literals,
and required to use extended form and bind them with a binary message.)
It's analogous to the way format-0 and format-1 both act as filters that
no encoding-dependent data can squish through without surviving both the
client and the server encoding, even if it is of a type that's defined
to be independent of either.

Regards,
-Chap

В списке pgsql-hackers по дате отправления:

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Re: The "char" type versus non-ASCII characters