Обсуждение: BUG #17611: SJIS conversion rule about duplicated characters differ from Windows
BUG #17611: SJIS conversion rule about duplicated characters differ from Windows
От
PG Bug reporting form
Дата:
The following bug has been logged on the website: Bug reference: 17611 Logged by: yusuke egashira Email address: egashira.yusuke@fujitsu.com PostgreSQL version: 12.11 Operating system: RHEL7(Server) and Windows10(Client) Description: SJIS(Windows-31J) has several defined characters that has the same glyph but a different code point for it. The SJIS conversion rules in PostgreSQL's client_encoding seem to be slightly different from the rules in the Windows OS. In some cases, it causes a bad thing for Windows users. For example, some text editors can't display these characters, and .NET applications raise exceptions when converting SJIS byte sequences to UTF16 (String type). This can happen when using Npgsql[1]. .NET code: ---- Encoding e = Encoding.GetEncoding("shift_jis", EncoderFallback.ExceptionFallback, DecoderFallback.ExceptionFallback); var utfString = e.GetString(sjis_byte_sequence); ---- Exception: ---- Exception thrown: 'System.Text.DecoderFallbackException' in mscorlib.dll An unhandled exception of type 'System.Text.DecoderFallbackException' occurred in mscorlib.dll Unable to translate bytes [FA][4A] at index 1632 from specified code page to Unicode. ---- My customers have difficulty dealing with SJIS code in Windows applications because of this difference in conversion rules. They are migrating from Oracle and many of the applications are written for the SJIS environment. The rules for converting from Unicode to characters that are duplicated in SJIS seem to be as follows in Windows[2]: 1. If the character is in both JIS X 0208 and NEC special characters, use the code point of JIS X 0208. 2. If the character is in both NEC special characters and IBM selected characters, use the code point of NEC special characters. 3. If the character is in both IBM selected characters and NEC selected-IBM extended characters, use the code point of IBM selected characters. However, the rules for converting from Unicode to SJIS in PostgreSQL seem to differ from the above second rule. SJIS codepoints corresponding to the second rule are listed below: - "NEC special characters" : 0x8754 - 0x875D, 0x8782, 0x8784, 0x878A - "IBM selected characters": 0xFA4A - 0xFA53, 0xFA59, 0xFA5A, 0xFA58 In src/backend/utils/mb/Unicode/UCS_to_SJIS.pl, @reject_sjis array defines the not used code points when converting Unicode to SJIS. According to the second rule above, the @reject_sjis array must contain "IBM selected characters", but it currently contains "NEC special characters". The current PostgreSQL rules for converting duplicate definition characters seems to be introduced by 5735c4cf3d059914e2b9d294203aa06fb2c4ac75, back in 2001, but I could not be found reason for it in past mailing list logs. I think this conversion difference is a bug, but is it a rule with some clear reason? [1] https://www.npgsql.org/ [2] https://dev.mysql.com/doc/mysql-g11n-excerpt/8.0/en/charset-cp932.html
Re: BUG #17611: SJIS conversion rule about duplicated characters differ from Windows
От
Kyotaro Horiguchi
Дата:
This is not a bug, but the designed behavior. But we could change that conversion table if a plausible reasoning is raised. At Thu, 08 Sep 2022 11:33:17 +0000, PG Bug reporting form <noreply@postgresql.org> wrote in > SJIS(Windows-31J) has several defined characters that has the > same glyph but a different code point for it. The SJIS conversion > rules in PostgreSQL's client_encoding seem to be slightly different > from the rules in the Windows OS. PostgreSQL follows CP932. And no rule on the precedence between duplicate characters is published as a public standard. According to [2], it is published as Microsoft's recommended convention. > In some cases, it causes a bad thing for Windows users. > For example, some text editors can't display these characters, and > .NET applications raise exceptions when converting SJIS byte > sequences to UTF16 (String type). This can happen when using Npgsql[1]. > > .NET code: > ---- > Encoding e = Encoding.GetEncoding("shift_jis", AFAIK generally Shift_jis and CP932 have different character sets. I don't know about .Net but doesn't CP932 work in that case? Specifically, "Encoding.GetEncoding(932)". There must a way to deal with that characters since they are in CP932. > My customers have difficulty dealing with SJIS code in Windows > applications because of this difference in conversion rules. > They are migrating from Oracle and many of the applications are > written for the SJIS environment. > > The rules for converting from Unicode to characters that are > duplicated in SJIS seem to be as follows in Windows[2]: > > 1. If the character is in both JIS X 0208 and NEC special characters, > use the code point of JIS X 0208. > 2. If the character is in both NEC special characters and IBM selected > characters, use the code point of NEC special characters. > 3. If the character is in both IBM selected characters and > NEC selected-IBM extended characters, use the code point of > IBM selected characters. Mmm. I don't reach the original document by Microsoft pointed from [2]. Could you tell me an alternative URL? (Goole didn't offer usable info by kb170559 or somethig like) > However, the rules for converting from Unicode to SJIS in PostgreSQL > seem to differ from the above second rule. > SJIS codepoints corresponding to the second rule are listed below: > - "NEC special characters" : 0x8754 - 0x875D, 0x8782, 0x8784, 0x878A > - "IBM selected characters": 0xFA4A - 0xFA53, 0xFA59, 0xFA5A, 0xFA58 > > In src/backend/utils/mb/Unicode/UCS_to_SJIS.pl, @reject_sjis array > defines the not used code points when converting Unicode to SJIS. > According to the second rule above, the @reject_sjis array must contain > "IBM selected characters", but it currently contains "NEC special > characters". Anyway it is not in the public standard and at most that "rule" is a recommendation. So it's not the case we "must" change the conversion table following the "rule". FYI, the following range of SJIS character codes are *excluded* while unicode->sjis conversion. They are not only NEC/IBM extension characters. ed40 - eefc : so-called "NEC extension" uses fa40 - fc40 (IBM extension) instead. 8754 - 875d : numbers with circle, and upper roman numbers uses fa4a - fa53 instead. 878a, 8782, 8784, fa5b, fa54: some japanese combined characters "No." "(株)"... uses fa58, fa59, fa5a, 81e6, 879a, 81ca 8790 - 8792 : math symbols, uses 81e0, 81df, 81e7 8795 - 8797 : ditto, 81e3, 81db, 81da 879a - 879c : ditto, 879a, 81bf, 81be > The current PostgreSQL rules for converting duplicate definition characters > > seems to be introduced by 5735c4cf3d059914e2b9d294203aa06fb2c4ac75, > back in 2001, but I could not be found reason for it in past mailing list > logs. > I think this conversion difference is a bug, > but is it a rule with some clear reason? I don't know about a clear rason for the current conversion, but it is a reason for *not* changing the conversion table that we had no complaint about the conversion for more than ten years. Because changing that tables could cause problems elsewhere. > [1] https://www.npgsql.org/ > [2] https://dev.mysql.com/doc/mysql-g11n-excerpt/8.0/en/charset-cp932.html regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Kyotaro Horiguchi <horikyota.ntt@gmail.com> writes: > This is not a bug, but the designed behavior. But we could change that > conversion table if a plausible reasoning is raised. Given how long our current behavior has stood, I think it'd have to be a pretty convincing argument. As you say, there'd be some serious compatibility costs to changing that behavior. IIUC, SJIS<->Unicode conversions have always been a squishy thing because of inconsistencies between the various versions of "SJIS". I'm not seeing a good reason we should regard Windows' behavior as authoritative here. I'm not saying I can't be convinced, but "Microsoft does it that way" isn't enough to convince me. regards, tom lane
Re: BUG #17611: SJIS conversion rule about duplicated characters differ from Windows
От
Kyotaro Horiguchi
Дата:
At Thu, 08 Sep 2022 22:58:31 -0400, Tom Lane <tgl@sss.pgh.pa.us> wrote in > Kyotaro Horiguchi <horikyota.ntt@gmail.com> writes: > > This is not a bug, but the designed behavior. But we could change that > > conversion table if a plausible reasoning is raised. > > Given how long our current behavior has stood, I think it'd have to be > a pretty convincing argument. As you say, there'd be some serious > compatibility costs to changing that behavior. > > IIUC, SJIS<->Unicode conversions have always been a squishy thing > because of inconsistencies between the various versions of "SJIS". > I'm not seeing a good reason we should regard Windows' behavior as > authoritative here. > > I'm not saying I can't be convinced, but "Microsoft does it that > way" isn't enough to convince me. Yeah, it is more or less I meant. And I suspect that the problem that his customers are complaining is not caused by our specific conversion table. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
RE: BUG #17611: SJIS conversion rule about duplicated characters differ from Windows
От
"egashira.yusuke@fujitsu.com"
Дата:
Hi, Thank you replying. I understand that the difference between these conversion rules is not bug. Unfortunately, given the many variations of SJIS, we also realized that matching a translation table to Microsoft's "recommended" rule would likely cause other problem to happen. I have been looking for original documents such as kb170559 that Microsoft changed the URL or stopped publishing, but I also couldn't find them... Therefore, I agree with you that PostgreSQL should not suddenly change its conversion rules. However, I still think it is problem that PostgreSQL returns some characters which not able to be used in some Windows environment. Would it be a reasonable solution to this problem to have the user create a map file with the conversion rules changed and add the conversion by CREATE CONVERSION ? > AFAIK generally Shift_jis and CP932 have different character sets. I > don't know about .Net but doesn't CP932 work in that case? > Specifically, "Encoding.GetEncoding(932)". There must a way to deal > with that characters since they are in CP932. Unfortunately, "shift_jis" is the name of "CP932" in .NET[1], so the same exception occurs for "Encoding.GetEncoding(932)". > FYI, the following range of SJIS character codes are *excluded* while > unicode->sjis conversion. They are not only NEC/IBM extension > characters. > > ed40 - eefc : so-called "NEC extension" > uses fa40 - fc40 (IBM extension) instead. > 8754 - 875d : numbers with circle, and upper roman numbers > uses fa4a - fa53 instead. > 878a, 8782, 8784, fa5b, fa54: some japanese combined characters "No." "(株)"... > uses fa58, fa59, fa5a, 81e6, 879a, 81ca > 8790 - 8792 : math symbols, uses 81e0, 81df, 81e7 > 8795 - 8797 : ditto, 81e3, 81db, 81da > 879a - 879c : ditto, 879a, 81bf, 81be Yes, I understand this exclude rules describes the conversion rule for SJIS duplicated characters in PostgreSQL. In my understanding, characters related to duplicate characters included in SJIS are as follows[2]. - NEC special characters(Row 13) : 8740 - 879c - NEC selected-IBM extended characters(Row 89 - 92) : ed40 - eefc - IBM selected characters(Row 115 - 119) : fa40 - fc4b The excluding rule of PostgreSQL seems to be match the Microsoft's recommended rule except for "NEC special characters (Row 13) and IBM selected characters(Row 115 - 119)" rule. > At Thu, 08 Sep 2022 22:58:31 -0400, Tom Lane <tgl@sss.pgh.pa.us> wrote in > > Kyotaro Horiguchi <horikyota.ntt@gmail.com> writes: > > > This is not a bug, but the designed behavior. But we could change that > > > conversion table if a plausible reasoning is raised. > > > > Given how long our current behavior has stood, I think it'd have to be > > a pretty convincing argument. As you say, there'd be some serious > > compatibility costs to changing that behavior. > > > > IIUC, SJIS<->Unicode conversions have always been a squishy thing > > because of inconsistencies between the various versions of "SJIS". > > I'm not seeing a good reason we should regard Windows' behavior as > > authoritative here. > > > > I'm not saying I can't be convinced, but "Microsoft does it that > > way" isn't enough to convince me. > > Yeah, it is more or less I meant. And I suspect that the problem that > his customers are complaining is not caused by our specific conversion > table. I think the behavior of Windows obscures the problem of different conversion rules. I attached the text file extracted 2-byte characters from CP932.TXT[3]. When this is displayed using notepad.exe or type command on cmd.exe, all characters are displayed in readable form. However, when we save the duplicate definition characters displayed on notepad to a file, they are implicitly converted to "Microsoft Recommended Code Points". So, my problem is probably a corner case. My customer used a third-party text editor instead of notepad and claimed that some duplicate definition characters could not be displayed. Npgsql works with the client_encoding=utf8 setting by default, however, there was a customer who wanted to use client_encoding=sjis, and the encoding problem came to light. Of course, both cases can be treat as third-party editors or .NET issues. However, I thought that this might be a bug because those problems would not have occurred if PostgreSQL convert the characters via the Microsoft's recommended conversion rules, and the reason of the current PostgreSQL conversion rules was not clear. At least if the reason of the PostgreSQL's current conversion rule is clear, it will help us to explain to the users. [1] https://docs.microsoft.com/en-us/dotnet/api/system.text.encoding?view=net-6.0#list-of-encodings [2] https://en.wikipedia.org/wiki/Code_page_932_(Microsoft_Windows)#Double-byte_character_differences [3] http://ftp.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP932.TXT Regards. Yusuke, Egashira
Вложения
Re: BUG #17611: SJIS conversion rule about duplicated characters differ from Windows
От
Kyotaro Horiguchi
Дата:
At Fri, 9 Sep 2022 12:22:33 +0000, "egashira.yusuke@fujitsu.com" <egashira.yusuke@fujitsu.com> wrote in > However, I still think it is problem that PostgreSQL returns some characters > which not able to be used in some Windows environment. > Would it be a reasonable solution to this problem to have the user create > a map file with the conversion rules changed and add the conversion by > CREATE CONVERSION ? The best way nowadays would be to move the entire system to unicode. Or doesn't it work to let the .Net application to convert UTF-8 into SJIS locally? > > AFAIK generally Shift_jis and CP932 have different character sets. I > > don't know about .Net but doesn't CP932 work in that case? > > Specifically, "Encoding.GetEncoding(932)". There must a way to deal > > with that characters since they are in CP932. > > Unfortunately, "shift_jis" is the name of "CP932" in .NET[1], so the same > exception occurs for "Encoding.GetEncoding(932)". Wow.. MS uses shift_jis as mere an alias of its variant of CP932 [1] (MS932?).. It's not Shift_JIS nor even CP932 (at least in the decoding direction).. > I think the behavior of Windows obscures the problem of different > conversion rules. I attached the text file extracted 2-byte characters from > CP932.TXT[3]. When this is displayed using notepad.exe or > type command on cmd.exe, all characters are displayed in readable form. > However, when we save the duplicate definition characters displayed > on notepad to a file, they are implicitly converted to > "Microsoft Recommended Code Points". So, my problem is probably > a corner case. .Net seems less robust than notepad.exe. I didn't find a way to create a custom encoding on .Net framework. (But I don't think that is the way to go.) > My customer used a third-party text editor instead of notepad and claimed > that some duplicate definition characters could not be displayed. > Npgsql works with the client_encoding=utf8 setting by default, however, > there was a customer who wanted to use client_encoding=sjis, and > the encoding problem came to light. Ah.. It's nowadays seldom seen, especially about use of level-3 or more rarely-used characters. Anyway I don't see a reason for utf8 not being usable as wire-encoding. If the combination of .Net unicode decoder and sjis encoder works, wouldn't that problem be gone? I believe .Net sjis encoder must yield the desired result. Goole showed me some complaints about .Npgsql and SJIS, but many of them came from the default encodig being not utf8 at that time and some of them are about broken error messages in an unexpected encoding.. > Of course, both cases can be treat as third-party editors or .NET issues. > However, I thought that this might be a bug because those problems would > not have occurred if PostgreSQL convert the characters via the Microsoft's > recommended conversion rules, and the reason of the current PostgreSQL > conversion rules was not clear. > At least if the reason of the PostgreSQL's current conversion rule is clear, > it will help us to explain to the users. The cause of the trouble is that the .Net's specific implement of CP932 decoder is not actually following CP932; it doesn't accept some valid characters. The editor does the same, too. So the correct measure for this situation seems like to convert the texts (in SJIS or UTF-8, as mentioned above) into the special encoding following the MS's recommendation no longer available for some reason. As mentioned before, one of the reasons for the current PostgreSQL's SJIS mapping is that the precedence between duplicate characters is not defined in the standard, in other words, it is implementation dependent. Thus it is valid to arbitrarily define the mapping as far as it covers all characters. It was more than a decade ago so I don't know the principle for the mapping, though. But it seems like putting precedence to characters in the IBM extension area. The reason we don't change it is it's now sufficiently legacy and the lack of complaint until it became legacy despite of (I believe) a certain amount of use cases. It's being leagcy suggests there may be use cases where that conversion is expected. [1] (Japanese doc) https://docs.microsoft.com/ja-jp/dotnet/api/system.text.encoding?view=net-6.0 regards. -- Kyotaro Horiguchi NTT Open Source Software Center