Обсуждение: Problems with writing EUC-JP/Unicode to console or file

Поиск
Список
Период
Сортировка

Problems with writing EUC-JP/Unicode to console or file

От
Jean-Christian Imbeault
Дата:
I'm having trouble writing a particular character to a file or even to
the console. The character in question is the japanese double-width "-"
symbol (in japanese font 'ー').

When I read the data from the pg database, encoded in EUC-JP, and
display it in a GUI text field it is displayed properly. However if I
try and write it to the console using System.out.println() or to a file
using FileWriter.write() the character comes out as a '?'.

I am using a the newest version of postgres, 7.3.3, but a slightly old
driver, pgidbc2.jar.

Is this a problem with the driver, Java, or the way I am trying to
print/write out the data?

Any help or advice is greatly appreciated!

Thanks,

Jean-Christian Imbeault


Re: Problems with writing EUC-JP/Unicode to console or file

От
Csaba Nagy
Дата:
This sounds like your terminal can't display that character. Depending
on your OS/terminal you use, you might or might not be able to change
the character encoding used by the terminal so it can correctly display
that character. For example on my Linux box I can switch to UTF-8
encoding setting: "LANG=en_US.UTF-8", and then even vi will use UTF-8 as
character encoding. OTOH, EUC-JP is not supported, so I can't see your
character...

Cheers,
Csaba.

On Mon, 2003-06-23 at 08:20, Jean-Christian Imbeault wrote:
> I'm having trouble writing a particular character to a file or even to
> the console. The character in question is the japanese double-width "-"
> symbol (in japanese font 'ー')
>
> When I read the data from the pg database, encoded in EUC-JP, and
> display it in a GUI text field it is displayed properly. However if I
> try and write it to the console using System.out.println() or to a file
> using FileWriter.write() the character comes out as a '?'.
>
> I am using a the newest version of postgres, 7.3.3, but a slightly old
> driver, pgidbc2.jar.
>
> Is this a problem with the driver, Java, or the way I am trying to
> print/write out the data?
>
> Any help or advice is greatly appreciated!
>
> Thanks,
>
> Jean-Christian Imbeault
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 8: explain analyze is your friend
>



Re: Problems with writing EUC-JP/Unicode to console or file

От
Jean-Christian Imbeault
Дата:
Csaba Nagy wrote:
> This sounds like your terminal can't display that character.

Hum ... good idea. But I don't think this is the problem in my case as I
am using the japanese version of Windows 2000.

Also if I hard-coded the string to be written then it is displayed fine
on the console or in the file, like this:

String string = "ー";
System.out.println(string);

The above code prints out fine.

But if 'string' is gotten from the database it does not display properly
when printed out or written to file. But strangely enough it does
display properly in GUI components.

Maybe the GUI and the OS use different, and incompatible, unicode fonts?

I'm thinking that the unicode byte representation of "ー" in the
database and does not map to a valid character in the font set that my
OS uses? That or the code point (is that the right word?) that pg uses
for this character and is the not the same as Java uses? (I remember
some talk about the unicode translation being moved from the driver to
the back end and some changes in the translation tables or something
like that?)

I'm really at a loss here and any advice on what I can do to find the
root cause and hopefully a fix for this are very much appreciated as my
application depends on being able to write this character to file.

Thanks!

Jean-Christian Imbeault


Re: Problems with writing EUC-JP/Unicode to console or file

От
Csaba Nagy
Дата:
I suspect that your machine's default encoding and the encoding used by
your Java program doesn't match. When you write on the web, the browser
knows the correct encoding to use from the HTTP headers (which BTW you
could consult to see what encoding you are writing with). But when you
write a file and read it in a console, the encoding is known only if you
tell it to the programs you use... i.e. explicitly tell to your Java
writer code what encoding to use, and explicitly tell to the editor what
encoding to use when opening the file. Otherwise they'll use their
default encodings, which might not match.

Cheers,
Csaba.


On Mon, 2003-06-23 at 11:36, Jean-Christian Imbeault wrote:
> Csaba Nagy wrote:
> > This sounds like your terminal can't display that character.
>
> Hum ... good idea. But I don't think this is the problem in my case as I
> am using the japanese version of Windows 2000.
>
> Also if I hard-coded the string to be written then it is displayed fine
> on the console or in the file, like this:
>
> String string = "ー";
> System.out.println(string);
>
> The above code prints out fine.
>
> But if 'string' is gotten from the database it does not display properly
> when printed out or written to file. But strangely enough it does
> display properly in GUI components.
>
> Maybe the GUI and the OS use different, and incompatible, unicode fonts?
>
> I'm thinking that the unicode byte representation of "ー" in the
> database and does not map to a valid character in the font set that my
> OS uses? That or the code point (is that the right word?) that pg uses
> for this character and is the not the same as Java uses? (I remember
> some talk about the unicode translation being moved from the driver to
> the back end and some changes in the translation tables or something
> like that?)
>
> I'm really at a loss here and any advice on what I can do to find the
> root cause and hopefully a fix for this are very much appreciated as my
> application depends on being able to write this character to file.
>
> Thanks!
>
> Jean-Christian Imbeault
>



Re: Problems with writing EUC-JP/Unicode to console or file

От
Jean-Christian Imbeault
Дата:
Csaba Nagy wrote:
> I suspect that your machine's default encoding and the encoding used by
> your Java program doesn't match.

[snip]

>i.e. explicitly tell to your Java
> writer code what encoding to use, and explicitly tell to the editor what
> encoding to use when opening the file. Otherwise they'll use their
> default encodings, which might not match.

Very true. I'll look up how to specify the encoding when writing to
file. I don't know that it is possible when writing to the console though.

*But* I must point out that I am writing quite a bit of data, in
japanese, to file and the console and *all* of it come out correctly
*except* for that one character ...

I *will* check into how to specify the encoding but I don't think that
is the problem as everything but the one character comes out out right.
And as I had said, if I hard-code the string to be printed it comes out
right ... only when the string is retrieved from the database does it
come out wrong ...

Thanks,

Jean-Christian Imbeault


Re: Problems with writing EUC-JP/Unicode to console or file

От
Jean-Christian Imbeault
Дата:
Thomas O'Dowd wrote:
> What encoding did you use to put the character into the database?

EUC-JP.

> There
> are some mapping problems still in postgres for some Japanese
> characters. It depends on which version of Java you are using and where
> the data is coming from etc.

I thought so. I had posted this same 'bug' quite a few months back but
nothing came of it.

I am using postgres 7.3.3 and the NetBeans 3.5 IDE. Java version I think
it is 1.4.1_03-b02

> I'm attaching an email I wrote to hackers
> about this before. Looks like the same problem. Anyway, nothing to do
> with the driver itself.

Thanks. I'll subscribe to hackers to see what happens with this.

I hope it's something easily fixed?

Thanks!

Jean-Christian Imbeault


Re: Problems with writing EUC-JP/Unicode to console or file

От
Thomas O'Dowd
Дата:
What encoding did you use to put the character into the database? There
are some mapping problems still in postgres for some Japanese
characters. It depends on which version of Java you are using and where
the data is coming from etc. I'm attaching an email I wrote to hackers
about this before. Looks like the same problem. Anyway, nothing to do
with the driver itself.

Cheers,

Tom.

On Mon, 2003-06-23 at 18:55, Jean-Christian Imbeault wrote:
> Csaba Nagy wrote:
> > I suspect that your machine's default encoding and the encoding used by
> > your Java program doesn't match.
>
> [snip]
>
> >i.e. explicitly tell to your Java
> > writer code what encoding to use, and explicitly tell to the editor what
> > encoding to use when opening the file. Otherwise they'll use their
> > default encodings, which might not match.
>
> Very true. I'll look up how to specify the encoding when writing to
> file. I don't know that it is possible when writing to the console though.
>
> *But* I must point out that I am writing quite a bit of data, in
> japanese, to file and the console and *all* of it come out correctly
> *except* for that one character ...
>
> I *will* check into how to specify the encoding but I don't think that
> is the problem as everything but the one character comes out out right.
> And as I had said, if I hard-code the string to be printed it comes out
> right ... only when the string is retrieved from the database does it
> come out wrong ...
>
> Thanks,
>
> Jean-Christian Imbeault
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 9: the planner will ignore your desire to choose an index scan if your
>       joining column's datatypes do not match
--
Thomas O'Dowd  - Got a keitai? Get Nooped!
tom@nooper.com - http://nooper.com
Hi all,

One Japanese character has been causing my head to swim lately. I've
finally tracked down the problem to both Java 1.3 and Postgresql.

The problem character is namely:
utf-16: 0x301C
utf-8: 0xE3809C
SJIS: 0x8160
EUC_JP: 0xA1C1
Otherwise known as the WAVE DASH character.

The confusion stems from a very similar character 0xFF5E (utf-16) or
0xEFBD9E (utf-8) the FULLWIDTH TILDE.

Java has just lately (1.4.1) finally fixed their mappings so that 0x301C
maps correctly to both the correct SJIS and EUC-JP character. Previously
(at least in 1.3.1) they mapped SJIS to 0xFF5E and EUC to 0x301C,
causing all sorts of trouble.

Postgresql at least picked one of the two characters namely 0xFF5E, so
conversions in and out of the database to/from sjis/euc seemed to be
working. Problem is when you try to view utf-8 from the database or if
you read the data into java (utf-16) and try converting to euc or sjis
from there.

Anyway, I think postgresql needs to be fixed for this character. In my
opinion what needs to be done is to change the mappings...

euc-jp -> utf-8    -> euc-jp
======    ========    ======
0xA1C1 -> 0xE3809C    0xA1C1

sjis   -> utf-8    -> sjis
======    ========    ======
0x8160 -> 0xE3809C    0x8160

As to what to do with the current mapping of 0xEFBD9E (utf-8)? It
probably should be removed. Maybe you could keep the mapping back to the
sjis/euc characters to help backward compatibility though. I'm not sure
what is the correct approach there.

If anyone can tell me how to edit the mappings under:
    src/backend/utils/mb/Unicode/

and rebuild postgres to use them, then I can test this out locally.

Looking forward to your replies.

Tom.



---------------------------(end of broadcast)---------------------------
TIP 6: Have you searched our list archives?

http://archives.postgresql.org