Re: Questionable description about character sets

Поиск

Список

Период

Сортировка

От	Tatsuo Ishii
Тема	Re: Questionable description about character sets
Дата	17 апреля 04:28:24
Msg-id	20260417.102824.927096962510122248.ishii@postgresql.org обсуждение
Ответ на	Re: Questionable description about character sets (Thomas Munro <thomas.munro@gmail.com>)
Ответы	Re: Questionable description about character sets
Список	pgsql-hackers

Дерево обсуждения

> If we wanted to follow the SQL standard's terminology, I think we'd
> call this the "character repertoire".

Calling it "character repertoire" works for me. Fortunately the
meaning of "character repertoire" in the SQL standard and in other
standard (ISO/IEC 2022 or 10646) looks same.

> In the standard, a "character
> set" is the database object representing a repertoire and an encoding
> of it, or its identifier.

Yes. Unlike ISO/IEC 2022 or 10646, the SQL standard has no clear
distinction between character set (in the sense of ISO/IEC 10646) and
encoding. (To me this is quite confusing.)

> But if we put it in the description column,
> we wouldn't have to name it.

Why?

> Researching the standard led me to
> src/backend/catalog/information_schema.sql[1].  It currently reports
> the encoding name as the character set and the repertoire, except
> s/UTF8/UCS/ for the repertoire.  That's the same information as you
> want to document here.  For the character set (in the SQL standard
> sense), the current view definition seems reasonable given that we
> don't support CREATE CHARACTER SET or CHARACTER SET generally,

Why? For example, Shouldn't EUC_JP have JIS X 0201, JIS X 0208 and JIS
X 0212 as its character repertoire?

> and for
> the character repertoire, the s/UTF8/UCS/ translation makes sense, but
> you chose to call it "Unicode".  Shouldn't those agree?

I think "UCS" is not a repertoire, but a coded character set.
"Unicode" or "Unicode repertoire" [1] is more appropreate, I think.

[1] https://www.unicode.org/reports/tr17/tr17-3.html

> If GB18030 were a valid server encoding, it would surely have to
> report UCS, like UTF8, since it is also a "Unicode transformation
> format"[2] (its purpose is to be backwards compatible with legacy
> 2-byte-per-common-Chinese-character formats while also covering all of
> Unicode 100% systematically, ie booting stuff they don't often encode
> into the 3- and 4-byte zone to make room for efficient encoding of
> stuff they do often encode).  So I think that means your new
> documentation should say UCS (or UNICODE) for that one too.

Not sure. I heard that the latest GB18030 (GB18030-2022, at this
point) does not contain some newer Unicode characters.

> I don't
> know how other encodings should spell their repertoire though...

Need research for me too.

Regards,
--
Tatsuo Ishii
SRA OSS K.K.
English: http://www.sraoss.co.jp/index_en/
Japanese:http://www.sraoss.co.jp

В списке pgsql-hackers по дате отправления:

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Re: Questionable description about character sets