Обсуждение: BUG #4714: Unicode Big5 Conversion

Поиск
Список
Период
Сортировка

BUG #4714: Unicode Big5 Conversion

От
"Roger Chang"
Дата:
The following bug has been logged online:

Bug reference:      4714
Logged by:          Roger Chang
Email address:      rchang111@gmail.com
PostgreSQL version: 8.3
Operating system:   All
Description:        Unicode Big5 Conversion
Details:

This is NOT a bug. but cause some problem. Since long time and up to 8.3.7
still no one to response.

Chinese Big5/UTF8 Conversion-map
big5_to_utf8.map
utf8_to_big5.map
have missing some char, don't know to who to talk to?
Suffer doing source build every version upgrade.

Please somebody can help and add following char map to above map. (+7
char.)

 {0xf9d6, 0xe7a281},
 {0xf9d7, 0xe98ab9},
 {0xf9d8, 0xe8a38f},
 {0xf9d9, 0xe5a2bb},
 {0xf9da, 0xe68192},
 {0xf9db, 0xe7b2a7},
 {0xf9dc, 0xe5abba}

Thanks in Advance.

Myself will like to help to do these job in future, feel need to do some
help to PostgreSQL after using it for so many many years.

Re: BUG #4714: Unicode Big5 Conversion

От
Heikki Linnakangas
Дата:
Roger Chang wrote:
> The following bug has been logged online:
>
> Bug reference:      4714
> Logged by:          Roger Chang
> Email address:      rchang111@gmail.com
> PostgreSQL version: 8.3
> Operating system:   All
> Description:        Unicode Big5 Conversion
> Details:
>
> This is NOT a bug. but cause some problem. Since long time and up to 8.3.7
> still no one to response.
>
> Chinese Big5/UTF8 Conversion-map
> big5_to_utf8.map
> utf8_to_big5.map
> have missing some char, don't know to who to talk to?
> Suffer doing source build every version upgrade.
>
> Please somebody can help and add following char map to above map. (+7
> char.)
>
>  {0xf9d6, 0xe7a281},
>  {0xf9d7, 0xe98ab9},
>  {0xf9d8, 0xe8a38f},
>  {0xf9d9, 0xe5a2bb},
>  {0xf9da, 0xe68192},
>  {0xf9db, 0xe7b2a7},
>  {0xf9dc, 0xe5abba}
>
> Thanks in Advance.
>
> Myself will like to help to do these job in future, feel need to do some
> help to PostgreSQL after using it for so many many years.

Thanks!

Looking up those Unicode characters in the Unihan database at
http://www.unicode.org/cgi-bin/GetUnihanData.pl?codepoint=92B9&useutf8=true

doesn't give any mapping to the Big5 encoding. At the bottom, however,
there is a mapping to "kHKSCS", which matches the mapping you listed.

Looking at the wikipedia page for Big5, it seems that those characters
belong to Microsoft's ETEN extension. The page also claims that "The
ETen extension became part of the current Big5 standard through
popularity." Is that true? Do we support all the other characters in the
ETEN extension?

Is there some authoritative source for the Big5 encoding, to look these
things up?

--
   Heikki Linnakangas
   EnterpriseDB   http://www.enterprisedb.com

Re: BUG #4714: Unicode Big5 Conversion

От
Heikki Linnakangas
Дата:
張桂賢 Roger Chang wrote:
> There is authoritative source for the Big5 encoding, but don't believe
> that do help
>
>     http://www.cns11643.gov.tw/AIDB/encodings_en.do
>
> Skip the historical mess already done. we should focus on reality?
>
> brief events according time-line,
>
>     * BIG5 created, mostly by ETEN company, some others but not important now.
>     * CNS Standard like 11643, Taiwan Government authority building in
> mean time ...
>     * Windows 3 showup, need Chinese ... pick not CNS but BIG5 ???
> Code Page 950 born.
>     * ETEN company add "ETen-extension 0xF9D6-0xF9FE" to work with IBM5550
>     * Since Windows ME, CP950 add above mentioned 7 char. 0xF9D6-0xF9FE ONLY ???
>     * Later Hong Kong add above 7 Char. plus some more symbol in
> HKSCS-2004, and what you found is right.
>     * WHAT A MESS !
>
> Focus on reality,
> only mentioned 7 Char. I need to build into pgsql sources to compliant
> with CP950, since few years ago.

Ok, so Windows codepage 950 has those 7 characters, but not the other
ETEN extended chars. I think that's a good enough reason to add those 7
chars; we have 'win950' as an alias for big5 anyway.

I'll go add those characters.

--
   Heikki Linnakangas
   EnterpriseDB   http://www.enterprisedb.com

Re: BUG #4714: Unicode Big5 Conversion

От
Heikki Linnakangas
Дата:
Heikki Linnakangas wrote:
> 張桂賢 Roger Chang wrote:
>> There is authoritative source for the Big5 encoding, but don't believe
>> that do help
>>
>>     http://www.cns11643.gov.tw/AIDB/encodings_en.do
>>
>> Skip the historical mess already done. we should focus on reality?
>>
>> brief events according time-line,
>>
>>     * BIG5 created, mostly by ETEN company, some others but not
>> important now.
>>     * CNS Standard like 11643, Taiwan Government authority building in
>> mean time ...
>>     * Windows 3 showup, need Chinese ... pick not CNS but BIG5 ???
>> Code Page 950 born.
>>     * ETEN company add "ETen-extension 0xF9D6-0xF9FE" to work with
>> IBM5550
>>     * Since Windows ME, CP950 add above mentioned 7 char.
>> 0xF9D6-0xF9FE ONLY ???
>>     * Later Hong Kong add above 7 Char. plus some more symbol in
>> HKSCS-2004, and what you found is right.
>>     * WHAT A MESS !
>>
>> Focus on reality,
>> only mentioned 7 Char. I need to build into pgsql sources to compliant
>> with CP950, since few years ago.
>
> Ok, so Windows codepage 950 has those 7 characters, but not the other
> ETEN extended chars. I think that's a good enough reason to add those 7
> chars; we have 'win950' as an alias for big5 anyway.

I downloaded the latest CP950 - Unicode conversion table from
ftp://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP950.TXT,
and run it through the UCS_to_most.pl script in
src/backend/utils/mb/Unicode. Comparing the result to our current big5
conversion table, there's a couple of minor differences in the mapping
of punctuation characters, e.g 0xa145 is mapped to Unicode character
2022 BULLET in big5, and to 2027 HYPHENATION POINT in CP950. And we're
missing all the "box drawing" characters in CP950 in the ranges
0xc6a1-0xc6fe, 0xc470-0xc7fc and 0xf9dd-0xf9fe. And then there's the 7
characters you mentioned in the range 0xf9d6-0xf9dc.

So although we use win950 as an alias for big5, it's not the same thing.
I guess we don't care about the box drawing characters, they're not very
useful for a database, and we shouldn't change the mapping of existing
characters on backwards-compatibility reasons. I wondered if we make
win950 a separate encoding, but they seem to be close enough in practice
that it's better to keep them the same.

So again, I'll just go add those 7 characters.

--
   Heikki Linnakangas
   EnterpriseDB   http://www.enterprisedb.com

Re: BUG #4714: Unicode Big5 Conversion

От
Tatsuo Ishii
Дата:
> > There is authoritative source for the Big5 encoding, but don't believe
> > that do help
> >
> >     http://www.cns11643.gov.tw/AIDB/encodings_en.do
> >
> > Skip the historical mess already done. we should focus on reality?
> >
> > brief events according time-line,
> >
> >     * BIG5 created, mostly by ETEN company, some others but not important now.
> >     * CNS Standard like 11643, Taiwan Government authority building in
> > mean time ...
> >     * Windows 3 showup, need Chinese ... pick not CNS but BIG5 ???
> > Code Page 950 born.
> >     * ETEN company add "ETen-extension 0xF9D6-0xF9FE" to work with IBM5550
> >     * Since Windows ME, CP950 add above mentioned 7 char. 0xF9D6-0xF9FE ONLY ???
> >     * Later Hong Kong add above 7 Char. plus some more symbol in
> > HKSCS-2004, and what you found is right.
> >     * WHAT A MESS !
> >
> > Focus on reality,
> > only mentioned 7 Char. I need to build into pgsql sources to compliant
> > with CP950, since few years ago.
>
> Ok, so Windows codepage 950 has those 7 characters, but not the other
> ETEN extended chars. I think that's a good enough reason to add those 7
> chars; we have 'win950' as an alias for big5 anyway.
>
> I'll go add those characters.

Be very careful not to break the standard defined by Unicode.
For example the glyph for 0xe7a281 == U+7881 is defined in page 43 of
http://unicode.org/charts/PDF/U4E00.pdf. So we need to make sure that the
particular kanji character defined in Big5 0xf9d6 has the same glyph
as the one defined in Unicode(U+7881). Same thing can be said to rest
of the proposed mapping.

>  {0xf9d6, 0xe7a281},
>  {0xf9d7, 0xe98ab9},
>  {0xf9d8, 0xe8a38f},
>  {0xf9d9, 0xe5a2bb},
>  {0xf9da, 0xe68192},
>  {0xf9db, 0xe7b2a7},
>  {0xf9dc, 0xe5abba},
--
Tatsuo Ishii
SRA OSS, Inc. Japan

Re: BUG #4714: Unicode Big5 Conversion

От
Heikki Linnakangas
Дата:
Tatsuo Ishii wrote:
> Be very careful not to break the standard defined by Unicode.
> For example the glyph for 0xe7a281 == U+7881 is defined in page 43 of
> http://unicode.org/charts/PDF/U4E00.pdf. So we need to make sure that the
> particular kanji character defined in Big5 0xf9d6 has the same glyph
> as the one defined in Unicode(U+7881). Same thing can be said to rest
> of the proposed mapping.

The mappings look correct to me; it's the same mappings as defined in
the CP950.TXT file from unicode.org.

I've added a perl script UCS_to_BIG5.pl to generate the mapping tables
from BIG5.TXT as before, with the addition of those seven extra
characters from CP950.TXT.

--
   Heikki Linnakangas
   EnterpriseDB   http://www.enterprisedb.com