Обсуждение: BUG #4714: Unicode Big5 Conversion
The following bug has been logged online: Bug reference: 4714 Logged by: Roger Chang Email address: rchang111@gmail.com PostgreSQL version: 8.3 Operating system: All Description: Unicode Big5 Conversion Details: This is NOT a bug. but cause some problem. Since long time and up to 8.3.7 still no one to response. Chinese Big5/UTF8 Conversion-map big5_to_utf8.map utf8_to_big5.map have missing some char, don't know to who to talk to? Suffer doing source build every version upgrade. Please somebody can help and add following char map to above map. (+7 char.) {0xf9d6, 0xe7a281}, {0xf9d7, 0xe98ab9}, {0xf9d8, 0xe8a38f}, {0xf9d9, 0xe5a2bb}, {0xf9da, 0xe68192}, {0xf9db, 0xe7b2a7}, {0xf9dc, 0xe5abba} Thanks in Advance. Myself will like to help to do these job in future, feel need to do some help to PostgreSQL after using it for so many many years.
Roger Chang wrote: > The following bug has been logged online: > > Bug reference: 4714 > Logged by: Roger Chang > Email address: rchang111@gmail.com > PostgreSQL version: 8.3 > Operating system: All > Description: Unicode Big5 Conversion > Details: > > This is NOT a bug. but cause some problem. Since long time and up to 8.3.7 > still no one to response. > > Chinese Big5/UTF8 Conversion-map > big5_to_utf8.map > utf8_to_big5.map > have missing some char, don't know to who to talk to? > Suffer doing source build every version upgrade. > > Please somebody can help and add following char map to above map. (+7 > char.) > > {0xf9d6, 0xe7a281}, > {0xf9d7, 0xe98ab9}, > {0xf9d8, 0xe8a38f}, > {0xf9d9, 0xe5a2bb}, > {0xf9da, 0xe68192}, > {0xf9db, 0xe7b2a7}, > {0xf9dc, 0xe5abba} > > Thanks in Advance. > > Myself will like to help to do these job in future, feel need to do some > help to PostgreSQL after using it for so many many years. Thanks! Looking up those Unicode characters in the Unihan database at http://www.unicode.org/cgi-bin/GetUnihanData.pl?codepoint=92B9&useutf8=true doesn't give any mapping to the Big5 encoding. At the bottom, however, there is a mapping to "kHKSCS", which matches the mapping you listed. Looking at the wikipedia page for Big5, it seems that those characters belong to Microsoft's ETEN extension. The page also claims that "The ETen extension became part of the current Big5 standard through popularity." Is that true? Do we support all the other characters in the ETEN extension? Is there some authoritative source for the Big5 encoding, to look these things up? -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
å¼µæ¡è³¢ Roger Chang wrote: > There is authoritative source for the Big5 encoding, but don't believe > that do help > > http://www.cns11643.gov.tw/AIDB/encodings_en.do > > Skip the historical mess already done. we should focus on reality? > > brief events according time-line, > > * BIG5 created, mostly by ETEN company, some others but not important now. > * CNS Standard like 11643, Taiwan Government authority building in > mean time ... > * Windows 3 showup, need Chinese ... pick not CNS but BIG5 ??? > Code Page 950 born. > * ETEN company add "ETen-extension 0xF9D6-0xF9FE" to work with IBM5550 > * Since Windows ME, CP950 add above mentioned 7 char. 0xF9D6-0xF9FE ONLY ??? > * Later Hong Kong add above 7 Char. plus some more symbol in > HKSCS-2004, and what you found is right. > * WHAT A MESS ! > > Focus on reality, > only mentioned 7 Char. I need to build into pgsql sources to compliant > with CP950, since few years ago. Ok, so Windows codepage 950 has those 7 characters, but not the other ETEN extended chars. I think that's a good enough reason to add those 7 chars; we have 'win950' as an alias for big5 anyway. I'll go add those characters. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Heikki Linnakangas wrote: > å¼µæ¡è³¢ Roger Chang wrote: >> There is authoritative source for the Big5 encoding, but don't believe >> that do help >> >> http://www.cns11643.gov.tw/AIDB/encodings_en.do >> >> Skip the historical mess already done. we should focus on reality? >> >> brief events according time-line, >> >> * BIG5 created, mostly by ETEN company, some others but not >> important now. >> * CNS Standard like 11643, Taiwan Government authority building in >> mean time ... >> * Windows 3 showup, need Chinese ... pick not CNS but BIG5 ??? >> Code Page 950 born. >> * ETEN company add "ETen-extension 0xF9D6-0xF9FE" to work with >> IBM5550 >> * Since Windows ME, CP950 add above mentioned 7 char. >> 0xF9D6-0xF9FE ONLY ??? >> * Later Hong Kong add above 7 Char. plus some more symbol in >> HKSCS-2004, and what you found is right. >> * WHAT A MESS ! >> >> Focus on reality, >> only mentioned 7 Char. I need to build into pgsql sources to compliant >> with CP950, since few years ago. > > Ok, so Windows codepage 950 has those 7 characters, but not the other > ETEN extended chars. I think that's a good enough reason to add those 7 > chars; we have 'win950' as an alias for big5 anyway. I downloaded the latest CP950 - Unicode conversion table from ftp://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP950.TXT, and run it through the UCS_to_most.pl script in src/backend/utils/mb/Unicode. Comparing the result to our current big5 conversion table, there's a couple of minor differences in the mapping of punctuation characters, e.g 0xa145 is mapped to Unicode character 2022 BULLET in big5, and to 2027 HYPHENATION POINT in CP950. And we're missing all the "box drawing" characters in CP950 in the ranges 0xc6a1-0xc6fe, 0xc470-0xc7fc and 0xf9dd-0xf9fe. And then there's the 7 characters you mentioned in the range 0xf9d6-0xf9dc. So although we use win950 as an alias for big5, it's not the same thing. I guess we don't care about the box drawing characters, they're not very useful for a database, and we shouldn't change the mapping of existing characters on backwards-compatibility reasons. I wondered if we make win950 a separate encoding, but they seem to be close enough in practice that it's better to keep them the same. So again, I'll just go add those 7 characters. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
> > There is authoritative source for the Big5 encoding, but don't believe > > that do help > > > > http://www.cns11643.gov.tw/AIDB/encodings_en.do > > > > Skip the historical mess already done. we should focus on reality? > > > > brief events according time-line, > > > > * BIG5 created, mostly by ETEN company, some others but not important now. > > * CNS Standard like 11643, Taiwan Government authority building in > > mean time ... > > * Windows 3 showup, need Chinese ... pick not CNS but BIG5 ??? > > Code Page 950 born. > > * ETEN company add "ETen-extension 0xF9D6-0xF9FE" to work with IBM5550 > > * Since Windows ME, CP950 add above mentioned 7 char. 0xF9D6-0xF9FE ONLY ??? > > * Later Hong Kong add above 7 Char. plus some more symbol in > > HKSCS-2004, and what you found is right. > > * WHAT A MESS ! > > > > Focus on reality, > > only mentioned 7 Char. I need to build into pgsql sources to compliant > > with CP950, since few years ago. > > Ok, so Windows codepage 950 has those 7 characters, but not the other > ETEN extended chars. I think that's a good enough reason to add those 7 > chars; we have 'win950' as an alias for big5 anyway. > > I'll go add those characters. Be very careful not to break the standard defined by Unicode. For example the glyph for 0xe7a281 == U+7881 is defined in page 43 of http://unicode.org/charts/PDF/U4E00.pdf. So we need to make sure that the particular kanji character defined in Big5 0xf9d6 has the same glyph as the one defined in Unicode(U+7881). Same thing can be said to rest of the proposed mapping. > {0xf9d6, 0xe7a281}, > {0xf9d7, 0xe98ab9}, > {0xf9d8, 0xe8a38f}, > {0xf9d9, 0xe5a2bb}, > {0xf9da, 0xe68192}, > {0xf9db, 0xe7b2a7}, > {0xf9dc, 0xe5abba}, -- Tatsuo Ishii SRA OSS, Inc. Japan
Tatsuo Ishii wrote: > Be very careful not to break the standard defined by Unicode. > For example the glyph for 0xe7a281 == U+7881 is defined in page 43 of > http://unicode.org/charts/PDF/U4E00.pdf. So we need to make sure that the > particular kanji character defined in Big5 0xf9d6 has the same glyph > as the one defined in Unicode(U+7881). Same thing can be said to rest > of the proposed mapping. The mappings look correct to me; it's the same mappings as defined in the CP950.TXT file from unicode.org. I've added a perl script UCS_to_BIG5.pl to generate the mapping tables from BIG5.TXT as before, with the addition of those seven extra characters from CP950.TXT. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com