Обсуждение: Problems with writing EUC-JP/Unicode to console or file
I'm having trouble writing a particular character to a file or even to the console. The character in question is the japanese double-width "-" symbol (in japanese font 'ー'). When I read the data from the pg database, encoded in EUC-JP, and display it in a GUI text field it is displayed properly. However if I try and write it to the console using System.out.println() or to a file using FileWriter.write() the character comes out as a '?'. I am using a the newest version of postgres, 7.3.3, but a slightly old driver, pgidbc2.jar. Is this a problem with the driver, Java, or the way I am trying to print/write out the data? Any help or advice is greatly appreciated! Thanks, Jean-Christian Imbeault
This sounds like your terminal can't display that character. Depending on your OS/terminal you use, you might or might not be able to change the character encoding used by the terminal so it can correctly display that character. For example on my Linux box I can switch to UTF-8 encoding setting: "LANG=en_US.UTF-8", and then even vi will use UTF-8 as character encoding. OTOH, EUC-JP is not supported, so I can't see your character... Cheers, Csaba. On Mon, 2003-06-23 at 08:20, Jean-Christian Imbeault wrote: > I'm having trouble writing a particular character to a file or even to > the console. The character in question is the japanese double-width "-" > symbol (in japanese font 'ー') > > When I read the data from the pg database, encoded in EUC-JP, and > display it in a GUI text field it is displayed properly. However if I > try and write it to the console using System.out.println() or to a file > using FileWriter.write() the character comes out as a '?'. > > I am using a the newest version of postgres, 7.3.3, but a slightly old > driver, pgidbc2.jar. > > Is this a problem with the driver, Java, or the way I am trying to > print/write out the data? > > Any help or advice is greatly appreciated! > > Thanks, > > Jean-Christian Imbeault > > > ---------------------------(end of broadcast)--------------------------- > TIP 8: explain analyze is your friend >
Csaba Nagy wrote: > This sounds like your terminal can't display that character. Hum ... good idea. But I don't think this is the problem in my case as I am using the japanese version of Windows 2000. Also if I hard-coded the string to be written then it is displayed fine on the console or in the file, like this: String string = "ー"; System.out.println(string); The above code prints out fine. But if 'string' is gotten from the database it does not display properly when printed out or written to file. But strangely enough it does display properly in GUI components. Maybe the GUI and the OS use different, and incompatible, unicode fonts? I'm thinking that the unicode byte representation of "ー" in the database and does not map to a valid character in the font set that my OS uses? That or the code point (is that the right word?) that pg uses for this character and is the not the same as Java uses? (I remember some talk about the unicode translation being moved from the driver to the back end and some changes in the translation tables or something like that?) I'm really at a loss here and any advice on what I can do to find the root cause and hopefully a fix for this are very much appreciated as my application depends on being able to write this character to file. Thanks! Jean-Christian Imbeault
I suspect that your machine's default encoding and the encoding used by your Java program doesn't match. When you write on the web, the browser knows the correct encoding to use from the HTTP headers (which BTW you could consult to see what encoding you are writing with). But when you write a file and read it in a console, the encoding is known only if you tell it to the programs you use... i.e. explicitly tell to your Java writer code what encoding to use, and explicitly tell to the editor what encoding to use when opening the file. Otherwise they'll use their default encodings, which might not match. Cheers, Csaba. On Mon, 2003-06-23 at 11:36, Jean-Christian Imbeault wrote: > Csaba Nagy wrote: > > This sounds like your terminal can't display that character. > > Hum ... good idea. But I don't think this is the problem in my case as I > am using the japanese version of Windows 2000. > > Also if I hard-coded the string to be written then it is displayed fine > on the console or in the file, like this: > > String string = "ー"; > System.out.println(string); > > The above code prints out fine. > > But if 'string' is gotten from the database it does not display properly > when printed out or written to file. But strangely enough it does > display properly in GUI components. > > Maybe the GUI and the OS use different, and incompatible, unicode fonts? > > I'm thinking that the unicode byte representation of "ー" in the > database and does not map to a valid character in the font set that my > OS uses? That or the code point (is that the right word?) that pg uses > for this character and is the not the same as Java uses? (I remember > some talk about the unicode translation being moved from the driver to > the back end and some changes in the translation tables or something > like that?) > > I'm really at a loss here and any advice on what I can do to find the > root cause and hopefully a fix for this are very much appreciated as my > application depends on being able to write this character to file. > > Thanks! > > Jean-Christian Imbeault >
Csaba Nagy wrote: > I suspect that your machine's default encoding and the encoding used by > your Java program doesn't match. [snip] >i.e. explicitly tell to your Java > writer code what encoding to use, and explicitly tell to the editor what > encoding to use when opening the file. Otherwise they'll use their > default encodings, which might not match. Very true. I'll look up how to specify the encoding when writing to file. I don't know that it is possible when writing to the console though. *But* I must point out that I am writing quite a bit of data, in japanese, to file and the console and *all* of it come out correctly *except* for that one character ... I *will* check into how to specify the encoding but I don't think that is the problem as everything but the one character comes out out right. And as I had said, if I hard-code the string to be printed it comes out right ... only when the string is retrieved from the database does it come out wrong ... Thanks, Jean-Christian Imbeault
Thomas O'Dowd wrote: > What encoding did you use to put the character into the database? EUC-JP. > There > are some mapping problems still in postgres for some Japanese > characters. It depends on which version of Java you are using and where > the data is coming from etc. I thought so. I had posted this same 'bug' quite a few months back but nothing came of it. I am using postgres 7.3.3 and the NetBeans 3.5 IDE. Java version I think it is 1.4.1_03-b02 > I'm attaching an email I wrote to hackers > about this before. Looks like the same problem. Anyway, nothing to do > with the driver itself. Thanks. I'll subscribe to hackers to see what happens with this. I hope it's something easily fixed? Thanks! Jean-Christian Imbeault
What encoding did you use to put the character into the database? There are some mapping problems still in postgres for some Japanese characters. It depends on which version of Java you are using and where the data is coming from etc. I'm attaching an email I wrote to hackers about this before. Looks like the same problem. Anyway, nothing to do with the driver itself. Cheers, Tom. On Mon, 2003-06-23 at 18:55, Jean-Christian Imbeault wrote: > Csaba Nagy wrote: > > I suspect that your machine's default encoding and the encoding used by > > your Java program doesn't match. > > [snip] > > >i.e. explicitly tell to your Java > > writer code what encoding to use, and explicitly tell to the editor what > > encoding to use when opening the file. Otherwise they'll use their > > default encodings, which might not match. > > Very true. I'll look up how to specify the encoding when writing to > file. I don't know that it is possible when writing to the console though. > > *But* I must point out that I am writing quite a bit of data, in > japanese, to file and the console and *all* of it come out correctly > *except* for that one character ... > > I *will* check into how to specify the encoding but I don't think that > is the problem as everything but the one character comes out out right. > And as I had said, if I hard-code the string to be printed it comes out > right ... only when the string is retrieved from the database does it > come out wrong ... > > Thanks, > > Jean-Christian Imbeault > > > ---------------------------(end of broadcast)--------------------------- > TIP 9: the planner will ignore your desire to choose an index scan if your > joining column's datatypes do not match -- Thomas O'Dowd - Got a keitai? Get Nooped! tom@nooper.com - http://nooper.com Hi all, One Japanese character has been causing my head to swim lately. I've finally tracked down the problem to both Java 1.3 and Postgresql. The problem character is namely: utf-16: 0x301C utf-8: 0xE3809C SJIS: 0x8160 EUC_JP: 0xA1C1 Otherwise known as the WAVE DASH character. The confusion stems from a very similar character 0xFF5E (utf-16) or 0xEFBD9E (utf-8) the FULLWIDTH TILDE. Java has just lately (1.4.1) finally fixed their mappings so that 0x301C maps correctly to both the correct SJIS and EUC-JP character. Previously (at least in 1.3.1) they mapped SJIS to 0xFF5E and EUC to 0x301C, causing all sorts of trouble. Postgresql at least picked one of the two characters namely 0xFF5E, so conversions in and out of the database to/from sjis/euc seemed to be working. Problem is when you try to view utf-8 from the database or if you read the data into java (utf-16) and try converting to euc or sjis from there. Anyway, I think postgresql needs to be fixed for this character. In my opinion what needs to be done is to change the mappings... euc-jp -> utf-8 -> euc-jp ====== ======== ====== 0xA1C1 -> 0xE3809C 0xA1C1 sjis -> utf-8 -> sjis ====== ======== ====== 0x8160 -> 0xE3809C 0x8160 As to what to do with the current mapping of 0xEFBD9E (utf-8)? It probably should be removed. Maybe you could keep the mapping back to the sjis/euc characters to help backward compatibility though. I'm not sure what is the correct approach there. If anyone can tell me how to edit the mappings under: src/backend/utils/mb/Unicode/ and rebuild postgres to use them, then I can test this out locally. Looking forward to your replies. Tom. ---------------------------(end of broadcast)--------------------------- TIP 6: Have you searched our list archives? http://archives.postgresql.org