BUG #17611: SJIS conversion rule about duplicated characters differ from Windows
От | PG Bug reporting form |
---|---|
Тема | BUG #17611: SJIS conversion rule about duplicated characters differ from Windows |
Дата | |
Msg-id | 17611-472d27cf395135b7@postgresql.org обсуждение исходный текст |
Ответы |
Re: BUG #17611: SJIS conversion rule about duplicated characters differ from Windows
|
Список | pgsql-bugs |
The following bug has been logged on the website: Bug reference: 17611 Logged by: yusuke egashira Email address: egashira.yusuke@fujitsu.com PostgreSQL version: 12.11 Operating system: RHEL7(Server) and Windows10(Client) Description: SJIS(Windows-31J) has several defined characters that has the same glyph but a different code point for it. The SJIS conversion rules in PostgreSQL's client_encoding seem to be slightly different from the rules in the Windows OS. In some cases, it causes a bad thing for Windows users. For example, some text editors can't display these characters, and .NET applications raise exceptions when converting SJIS byte sequences to UTF16 (String type). This can happen when using Npgsql[1]. .NET code: ---- Encoding e = Encoding.GetEncoding("shift_jis", EncoderFallback.ExceptionFallback, DecoderFallback.ExceptionFallback); var utfString = e.GetString(sjis_byte_sequence); ---- Exception: ---- Exception thrown: 'System.Text.DecoderFallbackException' in mscorlib.dll An unhandled exception of type 'System.Text.DecoderFallbackException' occurred in mscorlib.dll Unable to translate bytes [FA][4A] at index 1632 from specified code page to Unicode. ---- My customers have difficulty dealing with SJIS code in Windows applications because of this difference in conversion rules. They are migrating from Oracle and many of the applications are written for the SJIS environment. The rules for converting from Unicode to characters that are duplicated in SJIS seem to be as follows in Windows[2]: 1. If the character is in both JIS X 0208 and NEC special characters, use the code point of JIS X 0208. 2. If the character is in both NEC special characters and IBM selected characters, use the code point of NEC special characters. 3. If the character is in both IBM selected characters and NEC selected-IBM extended characters, use the code point of IBM selected characters. However, the rules for converting from Unicode to SJIS in PostgreSQL seem to differ from the above second rule. SJIS codepoints corresponding to the second rule are listed below: - "NEC special characters" : 0x8754 - 0x875D, 0x8782, 0x8784, 0x878A - "IBM selected characters": 0xFA4A - 0xFA53, 0xFA59, 0xFA5A, 0xFA58 In src/backend/utils/mb/Unicode/UCS_to_SJIS.pl, @reject_sjis array defines the not used code points when converting Unicode to SJIS. According to the second rule above, the @reject_sjis array must contain "IBM selected characters", but it currently contains "NEC special characters". The current PostgreSQL rules for converting duplicate definition characters seems to be introduced by 5735c4cf3d059914e2b9d294203aa06fb2c4ac75, back in 2001, but I could not be found reason for it in past mailing list logs. I think this conversion difference is a bug, but is it a rule with some clear reason? [1] https://www.npgsql.org/ [2] https://dev.mysql.com/doc/mysql-g11n-excerpt/8.0/en/charset-cp932.html
В списке pgsql-bugs по дате отправления: