Обсуждение: Multi-byte character bug

Поиск
Список
Период
Сортировка

Multi-byte character bug

От
"Richard So"
Дата:
=20
Two bugs has been found in the SQL parser and Multibyte char support:
=20
1.       =A1=A7Problem connecting to database: java.sql.SQLException: ERROR:
Invalid EUC_TW character sequence found (0xb27a)=A1=A8 was reported in using
JDBC driver to insert record, similar error reported when using ODBC
driver and psql, since auto-conversion from client to server should
convert the charcter to a valid EUC_TW char, therefore this is a bug
=20
2.       inserting record with =A1=A7=C0\=A1=A8 chinese char, the SQL parser
report something like =A1=A5Problem connecting to database:
java.sql.SQLException: ERROR: parser: parse error at or near "4567891"=A1=
=A6
(similar in jdbc and odbc), and the error =A1=A7unterminated string=A1=A8 h=
as
been reported when using psql.
=20
I=A1=A6ve found the problem exists since 7.1.x till 7.2.*.

Re: Multi-byte character bug

От
Tatsuo Ishii
Дата:
PiBUd28gYnVncyBoYXMgYmVlbiBmb3VuZCBpbiB0aGUgU1FMIHBhcnNlciBh
bmQgTXVsdGlieXRlIGNoYXIgc3VwcG9ydDoNCj4gIA0KPiAxLiAgICAgICCh
p1Byb2JsZW0gY29ubmVjdGluZyB0byBkYXRhYmFzZTogamF2YS5zcWwuU1FM
RXhjZXB0aW9uOiBFUlJPUjoNCj4gSW52YWxpZCBFVUNfVFcgY2hhcmFjdGVy
IHNlcXVlbmNlIGZvdW5kICgweGIyN2Epoaggd2FzIHJlcG9ydGVkIGluIHVz
aW5nDQo+IEpEQkMgZHJpdmVyIHRvIGluc2VydCByZWNvcmQsIHNpbWlsYXIg
ZXJyb3IgcmVwb3J0ZWQgd2hlbiB1c2luZyBPREJDDQo+IGRyaXZlciBhbmQg
cHNxbCwgc2luY2UgYXV0by1jb252ZXJzaW9uIGZyb20gY2xpZW50IHRvIHNl
cnZlciBzaG91bGQNCj4gY29udmVydCB0aGUgY2hhcmN0ZXIgdG8gYSB2YWxp
ZCBFVUNfVFcgY2hhciwgdGhlcmVmb3JlIHRoaXMgaXMgYSBidWcNCg0KSG93
IGRpZCB5b3Ugc2V0IHRoZSBhdXRvLWNvbnZlcnNpb24gc2V0dGluZ3MgZm9y
IHBzcWw/IEkgc3VzcGVjdCB5b3UNCmRpZCBzb21ldGhpbmcgd3Jvbmcgd2l0
aCBpdC4NCg0KPiAyLiAgICAgICBpbnNlcnRpbmcgcmVjb3JkIHdpdGggoafA
XKGoIGNoaW5lc2UgY2hhciwgdGhlIFNRTCBwYXJzZXINCj4gcmVwb3J0IHNv
bWV0aGluZyBsaWtlIKGlUHJvYmxlbSBjb25uZWN0aW5nIHRvIGRhdGFiYXNl
Og0KPiBqYXZhLnNxbC5TUUxFeGNlcHRpb246IEVSUk9SOiBwYXJzZXI6IHBh
cnNlIGVycm9yIGF0IG9yIG5lYXIgIjQ1Njc4OTEioaYNCj4gKHNpbWlsYXIg
aW4gamRiYyBhbmQgb2RiYyksIGFuZCB0aGUgZXJyb3Igoad1bnRlcm1pbmF0
ZWQgc3RyaW5noaggaGFzDQo+IGJlZW4gcmVwb3J0ZWQgd2hlbiB1c2luZyBw
c3FsLg0KPiAgDQo+IEmhpnZlIGZvdW5kIHRoZSBwcm9ibGVtIGV4aXN0cyBz
aW5jZSA3LjEueCB0aWxsIDcuMi4qLg0KDQpXaGF0IGlzIHRoZSBlbmNvZGlu
ZyBmb3IgImNoaW5lc2UgY2hhciI/IFlvdSBuZWVkIHRvIGdpdmUgdXMgbW9y
ZQ0KaW5mby4NCi0tDQpUYXRzdW8gSXNoaWkNCg0KUC5TLiAgIFBsZWFzZSBk
b24ndCBwb3N0IHdpdGggbm9uLWFzY2lpIGNoYXJzLiBJZiB5b3UgbmVlZCB0
byBzaG93DQpub24tYXNjaWkgY2hhcnMsIHlvdSBzaG91bGQgZ2l2ZSB0aGVt
IGluIGEgaGV4IGZvcm0uDQo=

Re: Multi-byte character bug

От
"Richard So"
Дата:
>> Two bugs has been found in the SQL parser and Multibyte char support:
>>=20=20

>What is the encoding for "chinese char"? You need to give us more
>info.

By Chinese here, I mean BIG5 encoding character which is a widely used
encoding in HK and Taiwan.
My setup:
    Db encoding: EUC_TW
    Client (JDBC / ODBC) Encoding: BIG5
        JDBC: I supplied the parameter 'charSet=3DBig5' to the
connection string
        ODBC: my locale (Chinese Win2000 machine) is Chinese
Taiwan
    Client application: Tomcat4 jsp page (see the attached)
    App / Db Server: Redhat 7.3 Linux + postgresql (set) 7.2.1-2PGDG
(download binary rpm) + Tomcat4
    App / DB Server locale: zh_TW.Big5
    JDBC driver: pgjdbc2.jar
    Client Machine: Win2000 Chinese (Taiwan) Version with SP2 + I.E.
(jsp) +             Delphi SQL Explorer (ODBC)
    Client Machine locale: Chinese (Taiwan)

>> 1.       'Problem connecting to database: java.sql.SQLException:
ERROR:
>> Invalid EUC_TW character sequence found (0xb27a)' was reported in
using
>> JDBC driver to insert record, similar error reported when using ODBC
>>driver and psql, since auto-conversion from client to server should
>>convert the charcter to a valid EUC_TW char, therefore this is a bug

>How did you set the auto-conversion settings for psql? I suspect you
>did something wrong with it.

I've done a new check on it, I found JDBC and ODBC driver still report
the error message but psql do not (may be as you said, I've done a wrong
procedure).  However, the problem still there: why JDBC and ODBC still
report the error ?
I just tried some Chinese words, but there may be some of other
character will also cause the problem.=20=20
I know Tomcat4 default will return the request parameters in ISO-8859
and therefore I've added code=20
<%@ page contentType=3D"text/html; charset=3DBig5"%>
<%
    request.setCharacterEncoding("BIG5");
%>
to the JSP page and dump the actual SQL posted to postgresql server to
make sure the SQL is correct and its attached (pls see attached file:
offence1.zip).

>> 2.       inserting record with xx =A8 chinese char, the SQL parser
>>report something like 'Problem connecting to database:
>> java.sql.SQLException: ERROR: parser: parse error at or near
"4567891"'
>> (similar in jdbc and odbc), and the error 'unterminated string' has
>> been reported when using psql.
>>=20=20

The character code is 0xc05c, in which the second byte is actually a "\"
(back-slash)
(pls see the attached file: offence2.zip)

>> I=A1=A6ve found the problem exists since 7.1.x till 7.2.*.

Re: Multi-byte character bug

От
Tatsuo Ishii
Дата:
> By Chinese here, I mean BIG5 encoding character which is a widely used
> encoding in HK and Taiwan.

Ok. PostgreSQL does support BIG5 in the *frontend* side.

> I've done a new check on it, I found JDBC and ODBC driver still report
> the error message but psql do not (may be as you said, I've done a wrong
> procedure).  However, the problem still there: why JDBC and ODBC still
> report the error ?

psql works but JDBC and ODBC does not? The fact that psql is working
tell us that at least BIG5<-->EUC_TW works fine. It seems something
wrong with JDBC and ODBC settings. Unfortunately I'm not a Java or
ODBC expert at all. Sorry...

> The character code is 0xc05c, in which the second byte is actually a "\"
> (back-slash)
> (pls see the attached file: offence2.zip)

There's no character code in EUC_TW (CNS 11643-1992) corresponding to
Big5 0xc05c. That's why PostgreSQL complains.
--
Tatsuo Ishii

Re: Multi-byte character bug

От
"Richard So"
Дата:
> -----Original Message-----
> From: pgsql-bugs-owner@postgresql.org
> [mailto:pgsql-bugs-owner@postgresql.org] On Behalf Of Tatsuo Ishii
> Sent: Wednesday, July 31, 2002 1:18 PM
> To: richso@i-cable.com
> Cc: pgsql-bugs@postgresql.org
> Subject: Re: [BUGS] Multi-byte character bug
>
>
> > By Chinese here, I mean BIG5 encoding character which is a
> widely used
> > encoding in HK and Taiwan.
>
> Ok. PostgreSQL does support BIG5 in the *frontend* side.
>
> > I've done a new check on it, I found JDBC and ODBC driver
> still report
> > the error message but psql do not (may be as you said, I've done a
> > wrong procedure).  However, the problem still there: why
> JDBC and ODBC
> > still report the error ?
>
> psql works but JDBC and ODBC does not? The fact that psql is
> working tell us that at least BIG5<-->EUC_TW works fine. It
> seems something wrong with JDBC and ODBC settings.
> Unfortunately I'm not a Java or ODBC expert at all. Sorry...
>

Ok ! I will post to the jdbc and odbc thread for help !


> > The character code is 0xc05c, in which the second byte is
> actually a
> > "\"
> > (back-slash)
> > (pls see the attached file: offence2.zip)
>
> There's no character code in EUC_TW (CNS 11643-1992)
> corresponding to Big5 0xc05c. That's why PostgreSQL complains.


But I've created another db using MULE_INTERNAL encoding, the same error
reported, why ?
Why don't Postgres directly support BIG5 in server side as BIG5 is the
main encoding using for Traditional Chinese communities, i.e. HK &
Taiwan ?  As EUC_TW do not have complete correspondings char in BIG5,
this will seriously prevent the Traditional Chinese communities for
using Postgresql !


> --
> Tatsuo Ishii
>
> ---------------------------(end of
> broadcast)---------------------------
> TIP 5: Have you checked our extensive FAQ?
>
http://www.postgresql.org/users-lounge/docs/faq.html

Re: Multi-byte character bug

От
Tatsuo Ishii
Дата:
> > There's no character code in EUC_TW (CNS 11643-1992)
> > corresponding to Big5 0xc05c. That's why PostgreSQL complains.
>
>
> But I've created another db using MULE_INTERNAL encoding, the same error
> reported, why ?

Since Big5 representation of MULE_INTERNAL is actually "leading
character"+EUC_TW. i.e.

> Why don't Postgres directly support BIG5 in server side

It's because of pury technical reason. Handling those encodings
containing bytes < 0x80 in second (or third) byte of a word confuses
our SQL parser. I think it's not impossible for the parser to handle
Big5, but if we make such a change, the parser would not be able to
other encodings. If you have a good idea to overcome these problems,
we are wellcome.

>  as BIG5 is the
> main encoding using for Traditional Chinese communities, i.e. HK &
> Taiwan ?  As EUC_TW do not have complete correspondings char in BIG5,
> this will seriously prevent the Traditional Chinese communities for
> using Postgresql !

Just a curious. Why do people living in those area prefer Big5 over
EUC_TW? I thought EUC_TW (or CNS 11643-1992) was defined by the
goverment in Taiwan. Is there any technical superiority in Big5?
Or maybe "don't know why but just many peole use Big5":-)
--
Tatsuo Ishii