Re: BUG #17611: SJIS conversion rule about duplicated characters differ from Windows
От | Kyotaro Horiguchi |
---|---|
Тема | Re: BUG #17611: SJIS conversion rule about duplicated characters differ from Windows |
Дата | |
Msg-id | 20220909.114216.2263659117945873025.horikyota.ntt@gmail.com обсуждение исходный текст |
Ответ на | BUG #17611: SJIS conversion rule about duplicated characters differ from Windows (PG Bug reporting form <noreply@postgresql.org>) |
Ответы |
Re: BUG #17611: SJIS conversion rule about duplicated characters differ from Windows
|
Список | pgsql-bugs |
This is not a bug, but the designed behavior. But we could change that conversion table if a plausible reasoning is raised. At Thu, 08 Sep 2022 11:33:17 +0000, PG Bug reporting form <noreply@postgresql.org> wrote in > SJIS(Windows-31J) has several defined characters that has the > same glyph but a different code point for it. The SJIS conversion > rules in PostgreSQL's client_encoding seem to be slightly different > from the rules in the Windows OS. PostgreSQL follows CP932. And no rule on the precedence between duplicate characters is published as a public standard. According to [2], it is published as Microsoft's recommended convention. > In some cases, it causes a bad thing for Windows users. > For example, some text editors can't display these characters, and > .NET applications raise exceptions when converting SJIS byte > sequences to UTF16 (String type). This can happen when using Npgsql[1]. > > .NET code: > ---- > Encoding e = Encoding.GetEncoding("shift_jis", AFAIK generally Shift_jis and CP932 have different character sets. I don't know about .Net but doesn't CP932 work in that case? Specifically, "Encoding.GetEncoding(932)". There must a way to deal with that characters since they are in CP932. > My customers have difficulty dealing with SJIS code in Windows > applications because of this difference in conversion rules. > They are migrating from Oracle and many of the applications are > written for the SJIS environment. > > The rules for converting from Unicode to characters that are > duplicated in SJIS seem to be as follows in Windows[2]: > > 1. If the character is in both JIS X 0208 and NEC special characters, > use the code point of JIS X 0208. > 2. If the character is in both NEC special characters and IBM selected > characters, use the code point of NEC special characters. > 3. If the character is in both IBM selected characters and > NEC selected-IBM extended characters, use the code point of > IBM selected characters. Mmm. I don't reach the original document by Microsoft pointed from [2]. Could you tell me an alternative URL? (Goole didn't offer usable info by kb170559 or somethig like) > However, the rules for converting from Unicode to SJIS in PostgreSQL > seem to differ from the above second rule. > SJIS codepoints corresponding to the second rule are listed below: > - "NEC special characters" : 0x8754 - 0x875D, 0x8782, 0x8784, 0x878A > - "IBM selected characters": 0xFA4A - 0xFA53, 0xFA59, 0xFA5A, 0xFA58 > > In src/backend/utils/mb/Unicode/UCS_to_SJIS.pl, @reject_sjis array > defines the not used code points when converting Unicode to SJIS. > According to the second rule above, the @reject_sjis array must contain > "IBM selected characters", but it currently contains "NEC special > characters". Anyway it is not in the public standard and at most that "rule" is a recommendation. So it's not the case we "must" change the conversion table following the "rule". FYI, the following range of SJIS character codes are *excluded* while unicode->sjis conversion. They are not only NEC/IBM extension characters. ed40 - eefc : so-called "NEC extension" uses fa40 - fc40 (IBM extension) instead. 8754 - 875d : numbers with circle, and upper roman numbers uses fa4a - fa53 instead. 878a, 8782, 8784, fa5b, fa54: some japanese combined characters "No." "(株)"... uses fa58, fa59, fa5a, 81e6, 879a, 81ca 8790 - 8792 : math symbols, uses 81e0, 81df, 81e7 8795 - 8797 : ditto, 81e3, 81db, 81da 879a - 879c : ditto, 879a, 81bf, 81be > The current PostgreSQL rules for converting duplicate definition characters > > seems to be introduced by 5735c4cf3d059914e2b9d294203aa06fb2c4ac75, > back in 2001, but I could not be found reason for it in past mailing list > logs. > I think this conversion difference is a bug, > but is it a rule with some clear reason? I don't know about a clear rason for the current conversion, but it is a reason for *not* changing the conversion table that we had no complaint about the conversion for more than ten years. Because changing that tables could cause problems elsewhere. > [1] https://www.npgsql.org/ > [2] https://dev.mysql.com/doc/mysql-g11n-excerpt/8.0/en/charset-cp932.html regards. -- Kyotaro Horiguchi NTT Open Source Software Center
В списке pgsql-bugs по дате отправления:
Предыдущее
От: Japin LiДата:
Сообщение: Re: BUG #17610: Use of multiple composite types incompatible with record-typed function parameter
Следующее
От: Tom LaneДата:
Сообщение: Re: BUG #17611: SJIS conversion rule about duplicated characters differ from Windows