Обсуждение: Questionable description about character sets

Поиск
Список
Период
Сортировка

Questionable description about character sets

От
Tatsuo Ishii
Дата:
"23.3.1. Supported Character Sets
Table 23.3 shows the character sets available for use in PostgreSQL."

https://www.postgresql.org/docs/current/multibyte.html#MULTIBYTE-CHARSET-SUPPORTED

But the table actually shows encodings (more precisely, "character
encoding scheme") (BIG5...EUC_JP... UTF8). I think we need one more
column for "character sets" (more precisely, "coded character sets").

Encoding   Character set        ...
BIG5       Big5-2003
:
EUC_JP     ASCII, JIS X 0208, JIS X 0212, JIS X 0201
:
UTF8       Unicode      

Best regards,
--
Tatsuo Ishii
SRA OSS K.K.
English: http://www.sraoss.co.jp/index_en/
Japanese:http://www.sraoss.co.jp



Re: Questionable description about character sets

От
Andreas Karlsson
Дата:
On 2/11/26 10:58 AM, Tatsuo Ishii wrote:
> "23.3.1. Supported Character Sets
> Table 23.3 shows the character sets available for use in PostgreSQL."
> 
> https://www.postgresql.org/docs/current/multibyte.html#MULTIBYTE-CHARSET-SUPPORTED
> 
> But the table actually shows encodings (more precisely, "character
> encoding scheme") (BIG5...EUC_JP... UTF8). I think we need one more
> column for "character sets" (more precisely, "coded character sets").
> 
> Encoding   Character set        ...
> BIG5       Big5-2003
> :
> EUC_JP     ASCII, JIS X 0208, JIS X 0212, JIS X 0201
> :
> UTF8       Unicode    

Wouldn't that make the table very wide? And for e.g. European character 
encodings I am not sure it is that useful since most or maybe even all 
of them are subsets of unicode, it mostly gets interesting for encodings 
which support characters not in unicode, right?

Andreas




Re: Questionable description about character sets

От
Tatsuo Ishii
Дата:
> Wouldn't that make the table very wide?

I don't think it would make the table very wide but a little bit
wider. So I think adding the character sets information to
"Description" column is better. Some of encodings already have the
info. See attached patch.

> And for e.g. European
> character encodings I am not sure it is that useful since most or
> maybe even all of them are subsets of unicode, it mostly gets
> interesting for encodings which support characters not in unicode,
> right?

Choosing UTF8 or not is just one of the use cases.

I am thinking about the use case in which user wants to continue to
use other encodings (e.g. wants to avoid conversion to UTF8).
Example: suppose the user has a legacy system in which EUC_JP is
used. The data in the system includes JIS X 0201, JIS X 0208 and JIS X
0212, and he wants to make sure that PostgreSQL supports all those
character sets in EUC_JP, because some tools does not support JIS X
0212. Only JIS X 0212 and JIS X 0208 are supported. Currently the info
(whether JIS X 0212 is supported or not) does not exist anywhere in
our docs. It's only in the source code. I think it's better to have
the info in our docs so that user does not need to look into the
source code.

Best regards,
--
Tatsuo Ishii
SRA OSS K.K.
English: http://www.sraoss.co.jp/index_en/
Japanese:http://www.sraoss.co.jp

From 98c97f670ce647003ce467a84f81cec0cb463c18 Mon Sep 17 00:00:00 2001
From: Tatsuo Ishii <ishii@postgresql.org>
Date: Sat, 14 Feb 2026 16:26:01 +0900
Subject: [PATCH v1] doc: Enhance "PostgreSQL Character Sets" table.

Previously some of encoding lacked description of coded character sets
being used in the encoding. For most of European encoding this is
obvious because there's only or few character sets for encoding, but
it's not true for some Asian encodings. For example, EUC_JP encoding
corresponds to multiple character sets: Namely, JIS X 0201, JIS X 0208
and JIS X 0212. This commit adds the information to "Description"
column.

Discussion: https://postgr.es/m/20260211.185847.1679085676298121526.ishii%40postgresql.org
---
 doc/src/sgml/charset.sgml | 12 ++++++------
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/doc/src/sgml/charset.sgml b/doc/src/sgml/charset.sgml
index 3aabc798012..32c6280489b 100644
--- a/doc/src/sgml/charset.sgml
+++ b/doc/src/sgml/charset.sgml
@@ -1831,7 +1831,7 @@ ORDER BY c COLLATE ebcdic;
         </row>
         <row>
          <entry><literal>EUC_CN</literal></entry>
-         <entry>Extended UNIX Code-CN</entry>
+         <entry>Extended UNIX Code-CN, GB 2312</entry>
          <entry>Simplified Chinese</entry>
          <entry>Yes</entry>
          <entry>Yes</entry>
@@ -1840,7 +1840,7 @@ ORDER BY c COLLATE ebcdic;
         </row>
         <row>
          <entry><literal>EUC_JP</literal></entry>
-         <entry>Extended UNIX Code-JP</entry>
+         <entry>Extended UNIX Code-JP, JIS X 0201, JIS X 0208, JIS X 0212</entry>
          <entry>Japanese</entry>
          <entry>Yes</entry>
          <entry>Yes</entry>
@@ -1849,7 +1849,7 @@ ORDER BY c COLLATE ebcdic;
         </row>
         <row>
          <entry><literal>EUC_JIS_2004</literal></entry>
-         <entry>Extended UNIX Code-JP, JIS X 0213</entry>
+         <entry>Extended UNIX Code-JP, JIS X 0201, JIS X 0213</entry>
          <entry>Japanese</entry>
          <entry>Yes</entry>
          <entry>No</entry>
@@ -1858,7 +1858,7 @@ ORDER BY c COLLATE ebcdic;
         </row>
         <row>
          <entry><literal>EUC_KR</literal></entry>
-         <entry>Extended UNIX Code-KR</entry>
+         <entry>Extended UNIX Code-KR, KS X 1001</entry>
          <entry>Korean</entry>
          <entry>Yes</entry>
          <entry>Yes</entry>
@@ -1867,7 +1867,7 @@ ORDER BY c COLLATE ebcdic;
         </row>
         <row>
          <entry><literal>EUC_TW</literal></entry>
-         <entry>Extended UNIX Code-TW</entry>
+         <entry>Extended UNIX Code-TW, CNS 11643</entry>
          <entry>Traditional Chinese, Taiwanese</entry>
          <entry>Yes</entry>
          <entry>Yes</entry>
@@ -2056,7 +2056,7 @@ ORDER BY c COLLATE ebcdic;
         </row>
         <row>
          <entry><literal>SJIS</literal></entry>
-         <entry>Shift JIS</entry>
+         <entry>Shift JIS, JIS X 0201, JIS X 0208</entry>
          <entry>Japanese</entry>
          <entry>No</entry>
          <entry>No</entry>
-- 
2.43.0


Re: Questionable description about character sets

От
Thomas Munro
Дата:
On Sat, Feb 14, 2026 at 11:20 PM Tatsuo Ishii <ishii@postgresql.org> wrote:
> > Wouldn't that make the table very wide?
>
> I don't think it would make the table very wide but a little bit
> wider. So I think adding the character sets information to
> "Description" column is better. Some of encodings already have the
> info. See attached patch.

When I point my browser at
file:///home/tmunro/projects/postgresql/build/doc/src/sgml/html/multibyte.html
I see these longer descriptions flowing onto multiple lines making the
table cells higher, while the published documentation[1] does only a
small amount of that, and then the font instead becomes smaller as I
make the window narrower.  Is there an easy way to see the final
website form in a local build?

We'd have more free space in the affected rows if we did s/Extended
UNIX Code-JP/EUC-JP/.  Why is that acronym expanded, while ISO, ECMA,
JIS and CP are not?

It might be confusing that the style "ISO 8859-1, ECMA 94" is used to
list alternative encoding standards that are aligned or equivalent,
while here you're listing the encoding and then the underlying
character sets in the same way.  Would it be better to put them in
parentheses?

With those two changes we'd have:

EUC_JP       | EUC-JP (JIS X 0201, JIS X 0208, JIS X 0212)
EUC_JIS_2004 | EUC-JP (JIS X 0201, JIS X 0213)

If we really wanted to save horizontal space, I suppose we could drop
the Alias column and either list aliases in a new table, or give them
their own rows with a description "Alias for ...", but that seems a
bit over the top.

While wondering if some other rows could be more specific, I noticed
that for GBK we have "Extended National Standard".  I don't understand
these things, but from a quick look at Wikipedia[2], I got the idea
that if convert_to('€', 'GBK') = '\x80'::bytea (yes) then what we have
might actually be the yet-further-extended standard known as "GBK
1.0".  Do I have that right?

As for BIG5, it seems to be an underspecified mess defying description
other than "good luck" :-)  Thankfully we won't have to list all the
standards that MULE_INTERNAL indirectly covers, as it looks like we've
agreed to drop it.  And IIRC there was a thread somewhere proposing to
drop JOHAB...

> > And for e.g. European
> > character encodings I am not sure it is that useful since most or
> > maybe even all of them are subsets of unicode, it mostly gets
> > interesting for encodings which support characters not in unicode,
> > right?
>
> Choosing UTF8 or not is just one of the use cases.
>
> I am thinking about the use case in which user wants to continue to
> use other encodings (e.g. wants to avoid conversion to UTF8).
> Example: suppose the user has a legacy system in which EUC_JP is
> used. The data in the system includes JIS X 0201, JIS X 0208 and JIS X
> 0212, and he wants to make sure that PostgreSQL supports all those
> character sets in EUC_JP, because some tools does not support JIS X
> 0212. Only JIS X 0212 and JIS X 0208 are supported. Currently the info
> (whether JIS X 0212 is supported or not) does not exist anywhere in
> our docs. It's only in the source code. I think it's better to have
> the info in our docs so that user does not need to look into the
> source code.

Makes sense to me.  The underlying character sets must be very
important to understand, especially if implementations vary on these
points.  We should give the information.

. o O ( I wonder if anyone has ever tried to make an "XTF-8-JA"
encoding just like UTF-8 but with ~1900 high-frequency Japanese
codepoints swapped into the 2-byte range U+0080-07ff where Greek,
Hebrew, Arabic and others won the encoding lottery.  UTF-16 is
apparently sometimes preferred to save space in other RDBMSs that can
do it, but I suppose you could achieve the same size most of the time
with a scheme like that.  The other encodings have the desired size,
but non-universal character sets.  A similar thought for the languages
of India, but with the frequency fuzziness factor removed: you could
surely map a dozen tiny non-ideographic scripts into that range to
save a byte per character... Hindi, Tamil etc didn't get a very good
deal with UTF-8.  Don't worry, I'm not suggesting that PostgreSQL has
any business inventings its own hair-brained encodings, I'm just
wondering out loud if that is a kind of thing that exists somewhere
out there... )

[1] https://www.postgresql.org/docs/current/multibyte.html
[2] https://en.wikipedia.org/wiki/GBK_(character_encoding)



Re: Questionable description about character sets

От
Nico Williams
Дата:
On Mon, Feb 16, 2026 at 05:35:41PM +1300, Thomas Munro wrote:
>                                              [...].  UTF-16 is
> apparently sometimes preferred to save space in other RDBMSs that can
> do it, but I suppose you could achieve the same size most of the time
> with a scheme like that.  [...]

[Off-topic] I think UTF-16 yielding smaller encodings is a truism.  It
really depends on what language the text is mostly written in, but
mostly it's a truism that's not true.  Anyways, UTF-16 has to go away,
and the sooner the better.

Nico
-- 



Re: Questionable description about character sets

От
Tatsuo Ishii
Дата:
> When I point my browser at
> file:///home/tmunro/projects/postgresql/build/doc/src/sgml/html/multibyte.html
> I see these longer descriptions flowing onto multiple lines making the
> table cells higher, while the published documentation[1] does only a
> small amount of that, and then the font instead becomes smaller as I
> make the window narrower.  Is there an easy way to see the final
> website form in a local build?

Same here. It would be nice to know website form in a local build.

> We'd have more free space in the affected rows if we did s/Extended
> UNIX Code-JP/EUC-JP/.  Why is that acronym expanded, while ISO, ECMA,
> JIS and CP are not?

Fair point.

> It might be confusing that the style "ISO 8859-1, ECMA 94" is used to
> list alternative encoding standards that are aligned or equivalent,
> while here you're listing the encoding and then the underlying
> character sets in the same way.  Would it be better to put them in
> parentheses?
>
> With those two changes we'd have:
>
> EUC_JP       | EUC-JP (JIS X 0201, JIS X 0208, JIS X 0212)
> EUC_JIS_2004 | EUC-JP (JIS X 0201, JIS X 0213)

Looks good to me.

> While wondering if some other rows could be more specific, I noticed
> that for GBK we have "Extended National Standard".  I don't understand
> these things,

Me neither. Probably "Extended National Standard" comes from the fact
that GB means "national standard" and "K" means "extension".  However
actually GBK is not an "official standard" which is mandatory for
Chinese industries to follow [1]. It's kind of strongly recommended
standard to follow. Probably we can just write "Defact standard (CP936)".

> but from a quick look at Wikipedia[2], I got the idea
> that if convert_to('€', 'GBK') = '\x80'::bytea (yes) then what we have
> might actually be the yet-further-extended standard known as "GBK
> 1.0".  Do I have that right?

I don't think so. [2] stats that "Microsoft later added the euro sign
to Code page 936 and assigned the code 0x80 to it. This is not a valid
code point in GBK 1.0. " So what we have seems to be CP936. Actually
in UCS_to_most.pl, which is used to generate gdbk_to_utf8.map, has the
line:
    'GBK' => 'CP936.TXT');

> As for BIG5, it seems to be an underspecified mess defying description
> other than "good luck" :-)

Yeah, ours is BIG5 (Unicode 1.1) + CP950.

> Thankfully we won't have to list all the
> standards that MULE_INTERNAL indirectly covers, as it looks like we've
> agreed to drop it.  And IIRC there was a thread somewhere proposing to
> drop JOHAB...

Apparently JOHAB has not been well tested...

> Makes sense to me.  The underlying character sets must be very
> important to understand, especially if implementations vary on these
> points.  We should give the information.

Yes.

> . o O ( I wonder if anyone has ever tried to make an "XTF-8-JA"
> encoding just like UTF-8 but with ~1900 high-frequency Japanese
> codepoints swapped into the 2-byte range U+0080-07ff where Greek,
> Hebrew, Arabic and others won the encoding lottery.  UTF-16 is
> apparently sometimes preferred to save space in other RDBMSs that can
> do it, but I suppose you could achieve the same size most of the time
> with a scheme like that.  The other encodings have the desired size,
> but non-universal character sets.  A similar thought for the languages
> of India, but with the frequency fuzziness factor removed: you could
> surely map a dozen tiny non-ideographic scripts into that range to
> save a byte per character... Hindi, Tamil etc didn't get a very good
> deal with UTF-8.  Don't worry, I'm not suggesting that PostgreSQL has
> any business inventings its own hair-brained encodings, I'm just
> wondering out loud if that is a kind of thing that exists somewhere
> out there... )

Well, I think inventing internal use only encoding is not a bad thing
in general.  We already have number of internal only data
structures. Internal encodings are just one of them. (I am not saying
I want to implement "XTF-8-JA" though).

> [1] https://www.postgresql.org/docs/current/multibyte.html
> [2] https://en.wikipedia.org/wiki/GBK_(character_encoding)
>

[3] https://ja.wikipedia.org/wiki/GBK

Best regards,
--
Tatsuo Ishii
SRA OSS K.K.
English: http://www.sraoss.co.jp/index_en/
Japanese:http://www.sraoss.co.jp



Re: Questionable description about character sets

От
Robert Treat
Дата:
On Mon, Feb 16, 2026 at 4:48 AM Tatsuo Ishii <ishii@postgresql.org> wrote:
>
> > When I point my browser at
> > file:///home/tmunro/projects/postgresql/build/doc/src/sgml/html/multibyte.html
> > I see these longer descriptions flowing onto multiple lines making the
> > table cells higher, while the published documentation[1] does only a
> > small amount of that, and then the font instead becomes smaller as I
> > make the window narrower.  Is there an easy way to see the final
> > website form in a local build?
>
> Same here. It would be nice to know website form in a local build.
>

Are you folks building with "make STYLE=website html" ?  That usually
gives me a pretty good representation of the web (although beware if
you use any browser specific settings to display websites in different
fonts. For example, on my desktop at home I run with postgresql.org at
133% size, which doesn't carry over when looking at locally built html
pages.

In any case, there is some additional info at
https://www.postgresql.org/docs/devel/docguide-build.html#DOCGUIDE-BUILD-HTML


Robert Treat
https://xzilla.net



Re: Questionable description about character sets

От
Tatsuo Ishii
Дата:
>> Same here. It would be nice to know website form in a local build.
>>
> 
> Are you folks building with "make STYLE=website html" ?  That usually
> gives me a pretty good representation of the web (although beware if
> you use any browser specific settings to display websites in different
> fonts. For example, on my desktop at home I run with postgresql.org at
> 133% size, which doesn't carry over when looking at locally built html
> pages.
> 
> In any case, there is some additional info at
> https://www.postgresql.org/docs/devel/docguide-build.html#DOCGUIDE-BUILD-HTML

Thanks for letting know me. I did not notice it.

Best regards,
--
Tatsuo Ishii
SRA OSS K.K.
English: http://www.sraoss.co.jp/index_en/
Japanese:http://www.sraoss.co.jp



Re: Questionable description about character sets

От
Thomas Munro
Дата:
On Mon, Feb 16, 2026 at 6:07 PM Nico Williams <nico@cryptonector.com> wrote:
> On Mon, Feb 16, 2026 at 05:35:41PM +1300, Thomas Munro wrote:
> >                                              [...].  UTF-16 is
> > apparently sometimes preferred to save space in other RDBMSs that can
> > do it, but I suppose you could achieve the same size most of the time
> > with a scheme like that.  [...]
>
> [Off-topic] I think UTF-16 yielding smaller encodings is a truism.  It
> really depends on what language the text is mostly written in, but
> mostly it's a truism that's not true.  Anyways, UTF-16 has to go away,
> and the sooner the better.

But when it's true for your language and that's what your database
holds, then it's true all the time, and it's not just outliers, we're
talking about nearly all of Asia's languages.  That's ... a lot of
NAND gates being wasted due to arbitrary choices made probably before
UTF-8 even existed.

I do agree with you that UTF-16 has turned out to be an odd beast,
though, not big enough but also too big.  Maybe it's only just right
for CJK (or CJ?).  I don't see much chance at all of anyone
retro-fitting UTF-16 into PostgreSQL anyway, so I wouldn't worry about
that.  I could more easily see us figuring out how to drop the
requirement for high bits in multi-byte sequence tails so that GB18030
could be used to store two-byte Chinese (while also retaining full
access to all of Unicode as it does), and I was basically wondering
out loud if Japan might be hiding something like that somewhere and
imagining what it might look like.