Re: BUG #17611: SJIS conversion rule about duplicated characters differ from Windows

Поиск
Список
Период
Сортировка
От Kyotaro Horiguchi
Тема Re: BUG #17611: SJIS conversion rule about duplicated characters differ from Windows
Дата
Msg-id 20220909.114216.2263659117945873025.horikyota.ntt@gmail.com
обсуждение исходный текст
Ответ на BUG #17611: SJIS conversion rule about duplicated characters differ from Windows  (PG Bug reporting form <noreply@postgresql.org>)
Ответы Re: BUG #17611: SJIS conversion rule about duplicated characters differ from Windows  (Tom Lane <tgl@sss.pgh.pa.us>)
Список pgsql-bugs
This is not a bug, but the designed behavior. But we could change that
conversion table if a plausible reasoning is raised.

At Thu, 08 Sep 2022 11:33:17 +0000, PG Bug reporting form <noreply@postgresql.org> wrote in 
> SJIS(Windows-31J) has several defined characters that has the 
> same glyph but a different code point for it. The SJIS conversion 
> rules in PostgreSQL's client_encoding seem to be slightly different 
> from the rules in the Windows OS.

PostgreSQL follows CP932. And no rule on the precedence between
duplicate characters is published as a public standard. According to
[2], it is published as Microsoft's recommended convention.

> In some cases, it causes a bad thing for Windows users. 
> For example, some text editors can't display these characters, and
> .NET applications raise exceptions when converting SJIS byte 
> sequences to UTF16 (String type). This can happen when using Npgsql[1].
> 
> .NET code:
> ----
> Encoding e = Encoding.GetEncoding("shift_jis",

AFAIK generally Shift_jis and CP932 have different character sets.  I
don't know about .Net but doesn't CP932 work in that case?
Specifically, "Encoding.GetEncoding(932)".  There must a way to deal
with that characters since they are in CP932.

> My customers have difficulty dealing with SJIS code in Windows 
> applications because of this difference in conversion rules. 
> They are migrating from Oracle and many of the applications are 
> written for the SJIS environment.
> 
> The rules for converting from Unicode to characters that are 
> duplicated in SJIS seem to be as follows in Windows[2]: 
> 
> 1. If the character is in both JIS X 0208 and NEC special characters, 
>    use the code point of JIS X 0208.
> 2. If the character is in both NEC special characters and IBM selected 
>    characters, use the code point of NEC special characters.
> 3. If the character is in both IBM selected characters and 
>    NEC selected-IBM extended characters, use the code point of 
>    IBM selected characters.

Mmm. I don't reach the original document by Microsoft pointed from
[2]. Could you tell me an alternative URL?  (Goole didn't offer usable
info by kb170559 or somethig like)

> However, the rules for converting from Unicode to SJIS in PostgreSQL 
> seem to differ from the above second rule.
> SJIS codepoints corresponding to the second rule are listed below:
> - "NEC special characters" : 0x8754 - 0x875D, 0x8782, 0x8784, 0x878A
> - "IBM selected characters": 0xFA4A - 0xFA53, 0xFA59, 0xFA5A, 0xFA58
>
> In src/backend/utils/mb/Unicode/UCS_to_SJIS.pl, @reject_sjis array 
> defines the not used code points when converting Unicode to SJIS.
> According to the second rule above, the @reject_sjis array must contain 
> "IBM selected characters", but it currently contains "NEC special
> characters".

Anyway it is not in the public standard and at most that "rule" is a
recommendation. So it's not the case we "must" change the conversion
table following the "rule".

FYI, the following range of SJIS character codes are *excluded* while
unicode->sjis conversion. They are not only NEC/IBM extension
characters.

ed40 - eefc : so-called "NEC extension"
              uses  fa40 - fc40 (IBM extension) instead.
8754 - 875d : numbers with circle, and upper roman numbers
              uses fa4a - fa53 instead.
878a, 8782, 8784, fa5b, fa54: some japanese combined characters "No." "(株)"...
              uses fa58, fa59, fa5a, 81e6, 879a, 81ca
8790 - 8792 : math symbols, uses 81e0, 81df, 81e7
8795 - 8797 : ditto, 81e3, 81db, 81da
879a - 879c : ditto, 879a, 81bf, 81be

> The current PostgreSQL rules for converting duplicate definition characters
> 
> seems to be introduced by 5735c4cf3d059914e2b9d294203aa06fb2c4ac75, 
> back in 2001, but I could not be found reason for it in past mailing list
> logs. 
> I think this conversion difference is a bug, 
> but is it a rule with some clear reason?

I don't know about a clear rason for the current conversion, but it is
a reason for *not* changing the conversion table that we had no
complaint about the conversion for more than ten years. Because
changing that tables could cause problems elsewhere.

> [1] https://www.npgsql.org/
> [2] https://dev.mysql.com/doc/mysql-g11n-excerpt/8.0/en/charset-cp932.html

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



В списке pgsql-bugs по дате отправления:

Предыдущее
От: Japin Li
Дата:
Сообщение: Re: BUG #17610: Use of multiple composite types incompatible with record-typed function parameter
Следующее
От: Tom Lane
Дата:
Сообщение: Re: BUG #17611: SJIS conversion rule about duplicated characters differ from Windows