Обсуждение: Multi-byte character bug
=20 Two bugs has been found in the SQL parser and Multibyte char support: =20 1. =A1=A7Problem connecting to database: java.sql.SQLException: ERROR: Invalid EUC_TW character sequence found (0xb27a)=A1=A8 was reported in using JDBC driver to insert record, similar error reported when using ODBC driver and psql, since auto-conversion from client to server should convert the charcter to a valid EUC_TW char, therefore this is a bug =20 2. inserting record with =A1=A7=C0\=A1=A8 chinese char, the SQL parser report something like =A1=A5Problem connecting to database: java.sql.SQLException: ERROR: parser: parse error at or near "4567891"=A1= =A6 (similar in jdbc and odbc), and the error =A1=A7unterminated string=A1=A8 h= as been reported when using psql. =20 I=A1=A6ve found the problem exists since 7.1.x till 7.2.*.
PiBUd28gYnVncyBoYXMgYmVlbiBmb3VuZCBpbiB0aGUgU1FMIHBhcnNlciBh bmQgTXVsdGlieXRlIGNoYXIgc3VwcG9ydDoNCj4gIA0KPiAxLiAgICAgICCh p1Byb2JsZW0gY29ubmVjdGluZyB0byBkYXRhYmFzZTogamF2YS5zcWwuU1FM RXhjZXB0aW9uOiBFUlJPUjoNCj4gSW52YWxpZCBFVUNfVFcgY2hhcmFjdGVy IHNlcXVlbmNlIGZvdW5kICgweGIyN2Epoaggd2FzIHJlcG9ydGVkIGluIHVz aW5nDQo+IEpEQkMgZHJpdmVyIHRvIGluc2VydCByZWNvcmQsIHNpbWlsYXIg ZXJyb3IgcmVwb3J0ZWQgd2hlbiB1c2luZyBPREJDDQo+IGRyaXZlciBhbmQg cHNxbCwgc2luY2UgYXV0by1jb252ZXJzaW9uIGZyb20gY2xpZW50IHRvIHNl cnZlciBzaG91bGQNCj4gY29udmVydCB0aGUgY2hhcmN0ZXIgdG8gYSB2YWxp ZCBFVUNfVFcgY2hhciwgdGhlcmVmb3JlIHRoaXMgaXMgYSBidWcNCg0KSG93 IGRpZCB5b3Ugc2V0IHRoZSBhdXRvLWNvbnZlcnNpb24gc2V0dGluZ3MgZm9y IHBzcWw/IEkgc3VzcGVjdCB5b3UNCmRpZCBzb21ldGhpbmcgd3Jvbmcgd2l0 aCBpdC4NCg0KPiAyLiAgICAgICBpbnNlcnRpbmcgcmVjb3JkIHdpdGggoafA XKGoIGNoaW5lc2UgY2hhciwgdGhlIFNRTCBwYXJzZXINCj4gcmVwb3J0IHNv bWV0aGluZyBsaWtlIKGlUHJvYmxlbSBjb25uZWN0aW5nIHRvIGRhdGFiYXNl Og0KPiBqYXZhLnNxbC5TUUxFeGNlcHRpb246IEVSUk9SOiBwYXJzZXI6IHBh cnNlIGVycm9yIGF0IG9yIG5lYXIgIjQ1Njc4OTEioaYNCj4gKHNpbWlsYXIg aW4gamRiYyBhbmQgb2RiYyksIGFuZCB0aGUgZXJyb3Igoad1bnRlcm1pbmF0 ZWQgc3RyaW5noaggaGFzDQo+IGJlZW4gcmVwb3J0ZWQgd2hlbiB1c2luZyBw c3FsLg0KPiAgDQo+IEmhpnZlIGZvdW5kIHRoZSBwcm9ibGVtIGV4aXN0cyBz aW5jZSA3LjEueCB0aWxsIDcuMi4qLg0KDQpXaGF0IGlzIHRoZSBlbmNvZGlu ZyBmb3IgImNoaW5lc2UgY2hhciI/IFlvdSBuZWVkIHRvIGdpdmUgdXMgbW9y ZQ0KaW5mby4NCi0tDQpUYXRzdW8gSXNoaWkNCg0KUC5TLiAgIFBsZWFzZSBk b24ndCBwb3N0IHdpdGggbm9uLWFzY2lpIGNoYXJzLiBJZiB5b3UgbmVlZCB0 byBzaG93DQpub24tYXNjaWkgY2hhcnMsIHlvdSBzaG91bGQgZ2l2ZSB0aGVt IGluIGEgaGV4IGZvcm0uDQo=
>> Two bugs has been found in the SQL parser and Multibyte char support: >>=20=20 >What is the encoding for "chinese char"? You need to give us more >info. By Chinese here, I mean BIG5 encoding character which is a widely used encoding in HK and Taiwan. My setup: Db encoding: EUC_TW Client (JDBC / ODBC) Encoding: BIG5 JDBC: I supplied the parameter 'charSet=3DBig5' to the connection string ODBC: my locale (Chinese Win2000 machine) is Chinese Taiwan Client application: Tomcat4 jsp page (see the attached) App / Db Server: Redhat 7.3 Linux + postgresql (set) 7.2.1-2PGDG (download binary rpm) + Tomcat4 App / DB Server locale: zh_TW.Big5 JDBC driver: pgjdbc2.jar Client Machine: Win2000 Chinese (Taiwan) Version with SP2 + I.E. (jsp) + Delphi SQL Explorer (ODBC) Client Machine locale: Chinese (Taiwan) >> 1. 'Problem connecting to database: java.sql.SQLException: ERROR: >> Invalid EUC_TW character sequence found (0xb27a)' was reported in using >> JDBC driver to insert record, similar error reported when using ODBC >>driver and psql, since auto-conversion from client to server should >>convert the charcter to a valid EUC_TW char, therefore this is a bug >How did you set the auto-conversion settings for psql? I suspect you >did something wrong with it. I've done a new check on it, I found JDBC and ODBC driver still report the error message but psql do not (may be as you said, I've done a wrong procedure). However, the problem still there: why JDBC and ODBC still report the error ? I just tried some Chinese words, but there may be some of other character will also cause the problem.=20=20 I know Tomcat4 default will return the request parameters in ISO-8859 and therefore I've added code=20 <%@ page contentType=3D"text/html; charset=3DBig5"%> <% request.setCharacterEncoding("BIG5"); %> to the JSP page and dump the actual SQL posted to postgresql server to make sure the SQL is correct and its attached (pls see attached file: offence1.zip). >> 2. inserting record with xx =A8 chinese char, the SQL parser >>report something like 'Problem connecting to database: >> java.sql.SQLException: ERROR: parser: parse error at or near "4567891"' >> (similar in jdbc and odbc), and the error 'unterminated string' has >> been reported when using psql. >>=20=20 The character code is 0xc05c, in which the second byte is actually a "\" (back-slash) (pls see the attached file: offence2.zip) >> I=A1=A6ve found the problem exists since 7.1.x till 7.2.*.
> By Chinese here, I mean BIG5 encoding character which is a widely used > encoding in HK and Taiwan. Ok. PostgreSQL does support BIG5 in the *frontend* side. > I've done a new check on it, I found JDBC and ODBC driver still report > the error message but psql do not (may be as you said, I've done a wrong > procedure). However, the problem still there: why JDBC and ODBC still > report the error ? psql works but JDBC and ODBC does not? The fact that psql is working tell us that at least BIG5<-->EUC_TW works fine. It seems something wrong with JDBC and ODBC settings. Unfortunately I'm not a Java or ODBC expert at all. Sorry... > The character code is 0xc05c, in which the second byte is actually a "\" > (back-slash) > (pls see the attached file: offence2.zip) There's no character code in EUC_TW (CNS 11643-1992) corresponding to Big5 0xc05c. That's why PostgreSQL complains. -- Tatsuo Ishii
> -----Original Message----- > From: pgsql-bugs-owner@postgresql.org > [mailto:pgsql-bugs-owner@postgresql.org] On Behalf Of Tatsuo Ishii > Sent: Wednesday, July 31, 2002 1:18 PM > To: richso@i-cable.com > Cc: pgsql-bugs@postgresql.org > Subject: Re: [BUGS] Multi-byte character bug > > > > By Chinese here, I mean BIG5 encoding character which is a > widely used > > encoding in HK and Taiwan. > > Ok. PostgreSQL does support BIG5 in the *frontend* side. > > > I've done a new check on it, I found JDBC and ODBC driver > still report > > the error message but psql do not (may be as you said, I've done a > > wrong procedure). However, the problem still there: why > JDBC and ODBC > > still report the error ? > > psql works but JDBC and ODBC does not? The fact that psql is > working tell us that at least BIG5<-->EUC_TW works fine. It > seems something wrong with JDBC and ODBC settings. > Unfortunately I'm not a Java or ODBC expert at all. Sorry... > Ok ! I will post to the jdbc and odbc thread for help ! > > The character code is 0xc05c, in which the second byte is > actually a > > "\" > > (back-slash) > > (pls see the attached file: offence2.zip) > > There's no character code in EUC_TW (CNS 11643-1992) > corresponding to Big5 0xc05c. That's why PostgreSQL complains. But I've created another db using MULE_INTERNAL encoding, the same error reported, why ? Why don't Postgres directly support BIG5 in server side as BIG5 is the main encoding using for Traditional Chinese communities, i.e. HK & Taiwan ? As EUC_TW do not have complete correspondings char in BIG5, this will seriously prevent the Traditional Chinese communities for using Postgresql ! > -- > Tatsuo Ishii > > ---------------------------(end of > broadcast)--------------------------- > TIP 5: Have you checked our extensive FAQ? > http://www.postgresql.org/users-lounge/docs/faq.html
> > There's no character code in EUC_TW (CNS 11643-1992) > > corresponding to Big5 0xc05c. That's why PostgreSQL complains. > > > But I've created another db using MULE_INTERNAL encoding, the same error > reported, why ? Since Big5 representation of MULE_INTERNAL is actually "leading character"+EUC_TW. i.e. > Why don't Postgres directly support BIG5 in server side It's because of pury technical reason. Handling those encodings containing bytes < 0x80 in second (or third) byte of a word confuses our SQL parser. I think it's not impossible for the parser to handle Big5, but if we make such a change, the parser would not be able to other encodings. If you have a good idea to overcome these problems, we are wellcome. > as BIG5 is the > main encoding using for Traditional Chinese communities, i.e. HK & > Taiwan ? As EUC_TW do not have complete correspondings char in BIG5, > this will seriously prevent the Traditional Chinese communities for > using Postgresql ! Just a curious. Why do people living in those area prefer Big5 over EUC_TW? I thought EUC_TW (or CNS 11643-1992) was defined by the goverment in Taiwan. Is there any technical superiority in Big5? Or maybe "don't know why but just many peole use Big5":-) -- Tatsuo Ishii