Re: Questionable description about character sets
| От | Tatsuo Ishii |
|---|---|
| Тема | Re: Questionable description about character sets |
| Дата | |
| Msg-id | 20260417.102824.927096962510122248.ishii@postgresql.org обсуждение |
| Ответ на | Re: Questionable description about character sets (Thomas Munro <thomas.munro@gmail.com>) |
| Ответы |
Re: Questionable description about character sets
|
| Список | pgsql-hackers |
> If we wanted to follow the SQL standard's terminology, I think we'd > call this the "character repertoire". Calling it "character repertoire" works for me. Fortunately the meaning of "character repertoire" in the SQL standard and in other standard (ISO/IEC 2022 or 10646) looks same. > In the standard, a "character > set" is the database object representing a repertoire and an encoding > of it, or its identifier. Yes. Unlike ISO/IEC 2022 or 10646, the SQL standard has no clear distinction between character set (in the sense of ISO/IEC 10646) and encoding. (To me this is quite confusing.) > But if we put it in the description column, > we wouldn't have to name it. Why? > Researching the standard led me to > src/backend/catalog/information_schema.sql[1]. It currently reports > the encoding name as the character set and the repertoire, except > s/UTF8/UCS/ for the repertoire. That's the same information as you > want to document here. For the character set (in the SQL standard > sense), the current view definition seems reasonable given that we > don't support CREATE CHARACTER SET or CHARACTER SET generally, Why? For example, Shouldn't EUC_JP have JIS X 0201, JIS X 0208 and JIS X 0212 as its character repertoire? > and for > the character repertoire, the s/UTF8/UCS/ translation makes sense, but > you chose to call it "Unicode". Shouldn't those agree? I think "UCS" is not a repertoire, but a coded character set. "Unicode" or "Unicode repertoire" [1] is more appropreate, I think. [1] https://www.unicode.org/reports/tr17/tr17-3.html > If GB18030 were a valid server encoding, it would surely have to > report UCS, like UTF8, since it is also a "Unicode transformation > format"[2] (its purpose is to be backwards compatible with legacy > 2-byte-per-common-Chinese-character formats while also covering all of > Unicode 100% systematically, ie booting stuff they don't often encode > into the 3- and 4-byte zone to make room for efficient encoding of > stuff they do often encode). So I think that means your new > documentation should say UCS (or UNICODE) for that one too. Not sure. I heard that the latest GB18030 (GB18030-2022, at this point) does not contain some newer Unicode characters. > I don't > know how other encodings should spell their repertoire though... Need research for me too. Regards, -- Tatsuo Ishii SRA OSS K.K. English: http://www.sraoss.co.jp/index_en/ Japanese:http://www.sraoss.co.jp
В списке pgsql-hackers по дате отправления: