Обсуждение: A Patch for MIC to EUC_TW code converting in mb support
============================================================================
POSTGRESQL BUG REPORT: MIC to EUC_TW code converting in mb support
============================================================================
System Configuration
---------------------
Architecture (example: Intel Pentium) :x86
Operating System (example: Linux 2.0.26 ELF) :Linux 2.2.x and FreeBSD
3.5R
PostgreSQL version (example: PostgreSQL-7.0) :PostgreSQL-7.0.2
Compiler used (example: gcc 2.8.0) :egcs-2.91.66, gcc 2.7.3
A FULL description of the problem:
------------------------------------------------
In PostgreSQL mb (multi-byte) support, there is a bug in code converting
for MIC to EUC_TW. Original mic2euc_tw() in conv.c converts CNS
11643-1992
Plane 2 into 2 bytes EUC_TW encoding. But characters in CNS 11643-1992
Plane 2
should be converted into 4 bytes EUC_TW encoding instead.
A way to repeat the problem:
----------------------------------------------------------------------
When you initdb with -E EUC_TW and set PGCLIENTENCODING to BIG5,
you will find all the characters in CNS 11643-1992 Plane 2 are
incorrectly stored or output.
This problem might be fixed by the solution in the attachement.
*** conv.c Wed Nov 8 22:44:21 2000
--- conv.c.orig Sat May 20 21:12:26 2000
***************
*** 906,920 ****
{
len -= pg_mic_mblen(mic++);
! if (c1 == LC_CNS11643_1)
{
- *p++ = *mic++;
- *p++ = *mic++;
- }
- else if (c1 == LC_CNS11643_2)
- {
- *p++ = SS2;
- *p++ = 0xa2;
*p++ = *mic++;
*p++ = *mic++;
}
--- 906,913 ----
{
len -= pg_mic_mblen(mic++);
! if (c1 == LC_CNS11643_1 || c1 == LC_CNS11643_2)
{
*p++ = *mic++;
*p++ = *mic++;
}
> ============================================================================
>
> POSTGRESQL BUG REPORT: MIC to EUC_TW code converting in mb support
> ============================================================================
>
> System Configuration
> ---------------------
> Architecture (example: Intel Pentium) :x86
> Operating System (example: Linux 2.0.26 ELF) :Linux 2.2.x and FreeBSD
> 3.5R
> PostgreSQL version (example: PostgreSQL-7.0) :PostgreSQL-7.0.2
> Compiler used (example: gcc 2.8.0) :egcs-2.91.66, gcc 2.7.3
>
> A FULL description of the problem:
> ------------------------------------------------
> In PostgreSQL mb (multi-byte) support, there is a bug in code converting
>
> for MIC to EUC_TW. Original mic2euc_tw() in conv.c converts CNS
> 11643-1992
> Plane 2 into 2 bytes EUC_TW encoding. But characters in CNS 11643-1992
> Plane 2
> should be converted into 4 bytes EUC_TW encoding instead.
>
> A way to repeat the problem:
> ----------------------------------------------------------------------
> When you initdb with -E EUC_TW and set PGCLIENTENCODING to BIG5,
> you will find all the characters in CNS 11643-1992 Plane 2 are
> incorrectly stored or output.
>
> This problem might be fixed by the solution in the attachement.
Thanks for pointing it out. Your fix seems correct.
BTW I have found another bug with EUC_TW support. line 917 in conv.c:
*p++ = c1 - LC_CNS11643_3 + 0xa3;
this should be:
*p++ = *mic++ - LC_CNS11643_3 + 0xa3;
Otherwise, CNS 11643-1992 Plane 3 or more won't work. Could you test
it out with CNS 11643-1992 Plane 3 or more?
If they are ok, I will fix the current source and make a patch for
7.0.3 (I guess it's too late to back-patch the 7.0 tree).
--
Tatsuo Ishii
Tatsuo, I assume these are all done in 7.1, right? > > ============================================================================ > > > > POSTGRESQL BUG REPORT: MIC to EUC_TW code converting in mb support > > ============================================================================ > > > > System Configuration > > --------------------- > > Architecture (example: Intel Pentium) :x86 > > Operating System (example: Linux 2.0.26 ELF) :Linux 2.2.x and FreeBSD > > 3.5R > > PostgreSQL version (example: PostgreSQL-7.0) :PostgreSQL-7.0.2 > > Compiler used (example: gcc 2.8.0) :egcs-2.91.66, gcc 2.7.3 > > > > A FULL description of the problem: > > ------------------------------------------------ > > In PostgreSQL mb (multi-byte) support, there is a bug in code converting > > > > for MIC to EUC_TW. Original mic2euc_tw() in conv.c converts CNS > > 11643-1992 > > Plane 2 into 2 bytes EUC_TW encoding. But characters in CNS 11643-1992 > > Plane 2 > > should be converted into 4 bytes EUC_TW encoding instead. > > > > A way to repeat the problem: > > ---------------------------------------------------------------------- > > When you initdb with -E EUC_TW and set PGCLIENTENCODING to BIG5, > > you will find all the characters in CNS 11643-1992 Plane 2 are > > incorrectly stored or output. > > > > This problem might be fixed by the solution in the attachement. > > Thanks for pointing it out. Your fix seems correct. > > BTW I have found another bug with EUC_TW support. line 917 in conv.c: > > *p++ = c1 - LC_CNS11643_3 + 0xa3; > > this should be: > > *p++ = *mic++ - LC_CNS11643_3 + 0xa3; > > Otherwise, CNS 11643-1992 Plane 3 or more won't work. Could you test > it out with CNS 11643-1992 Plane 3 or more? > > If they are ok, I will fix the current source and make a patch for > 7.0.3 (I guess it's too late to back-patch the 7.0 tree). > -- > Tatsuo Ishii > -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 853-3000 + If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania 19026
> Tatsuo, I assume these are all done in 7.1, right? Yes. -- Tatsuo Ishii > > > ============================================================================ > > > > > > POSTGRESQL BUG REPORT: MIC to EUC_TW code converting in mb support > > > ============================================================================ > > > > > > System Configuration > > > --------------------- > > > Architecture (example: Intel Pentium) :x86 > > > Operating System (example: Linux 2.0.26 ELF) :Linux 2.2.x and FreeBSD > > > 3.5R > > > PostgreSQL version (example: PostgreSQL-7.0) :PostgreSQL-7.0.2 > > > Compiler used (example: gcc 2.8.0) :egcs-2.91.66, gcc 2.7.3 > > > > > > A FULL description of the problem: > > > ------------------------------------------------ > > > In PostgreSQL mb (multi-byte) support, there is a bug in code converting > > > > > > for MIC to EUC_TW. Original mic2euc_tw() in conv.c converts CNS > > > 11643-1992 > > > Plane 2 into 2 bytes EUC_TW encoding. But characters in CNS 11643-1992 > > > Plane 2 > > > should be converted into 4 bytes EUC_TW encoding instead. > > > > > > A way to repeat the problem: > > > ---------------------------------------------------------------------- > > > When you initdb with -E EUC_TW and set PGCLIENTENCODING to BIG5, > > > you will find all the characters in CNS 11643-1992 Plane 2 are > > > incorrectly stored or output. > > > > > > This problem might be fixed by the solution in the attachement. > > > > Thanks for pointing it out. Your fix seems correct. > > > > BTW I have found another bug with EUC_TW support. line 917 in conv.c: > > > > *p++ = c1 - LC_CNS11643_3 + 0xa3; > > > > this should be: > > > > *p++ = *mic++ - LC_CNS11643_3 + 0xa3; > > > > Otherwise, CNS 11643-1992 Plane 3 or more won't work. Could you test > > it out with CNS 11643-1992 Plane 3 or more? > > > > If they are ok, I will fix the current source and make a patch for > > 7.0.3 (I guess it's too late to back-patch the 7.0 tree). > > -- > > Tatsuo Ishii > > > > > -- > Bruce Momjian | http://candle.pha.pa.us > pgman@candle.pha.pa.us | (610) 853-3000 > + If your life is a hard drive, | 830 Blythe Avenue > + Christ can be your backup. | Drexel Hill, Pennsylvania 19026