Обсуждение: Small problem with special characters

Поиск
Список
Период
Сортировка

Small problem with special characters

От
Thomas Kellerer
Дата:
Hello,

I'm having a small problem with inserting german umlauts. The PG database was
created with the default code page (by accident) and thus ended up with 'WIN1251'.

Now from within a Java program I'm trying to insert a value into the database
that contains the character 'ö'. Now when doing that, the JDBC driver throws an
error:

character 0xc3b6 of encoding "UTF8" has no equivalent in "WIN1251" [SQL
State=22P05]

What I don't understand, is that this character is contained in the windows code
page. The value itself is retrieved from a JDBC connection to a HSQLDB database,
so the string handling is Java internal only.

I can convert the database to UTF8 that is not the problem, the main reason I'm
writing, is that I'd like to understand what is going on here.

Does this mean, that I can only access UTF8 enabled databases with the JDBC driver?

Thanks
Thomas


Re: Small problem with special characters

От
Tom Lane
Дата:
Thomas Kellerer <spam_eater@gmx.net> writes:
> character 0xc3b6 of encoding "UTF8" has no equivalent in "WIN1251" [SQL
> State=22P05]

Indeed, that's what the conversion table embedded in the backend
thinks.  Some cursory poking about in src/backend/utils/mb/Unicode/
says that we derive these tables mechanically from authoritative data
at ftp://www.unicode.org/Public/MAPPINGS/ ... so maybe you need to
take this up with them.  What's your basis for asserting that this
character exists in code page 1251?

            regards, tom lane

Re: Small problem with special characters

От
Thomas Kellerer
Дата:
Tom Lane wrote on 20.08.2006 17:46:
> Thomas Kellerer <spam_eater@gmx.net> writes:
>> character 0xc3b6 of encoding "UTF8" has no equivalent in "WIN1251" [SQL
>> State=22P05]
>
> Indeed, that's what the conversion table embedded in the backend
> thinks.  Some cursory poking about in src/backend/utils/mb/Unicode/
> says that we derive these tables mechanically from authoritative data
> at ftp://www.unicode.org/Public/MAPPINGS/ ... so maybe you need to
> take this up with them.  What's your basis for asserting that this
> character exists in code page 1251?

Well the 0xc3b6 does obviously _not_ exist in the Windows codepage (as it is a
2byte character). But this is a "regular" umlaut (ö) that (from a "visual" point
of view) does exist in the codepage.

Obviously from within Java I do not have the chance to work with anything else
than UTF (I think Java internally uses some kind of UTF-16 flavor). So how would
I get this character into the table from within a Java program?

Don't get me wrong: I do understand the technical background, and I can happily
use an UTF8 database. So I don't really have an issue with this.

I'm just wondering: how would someone who is forced to use a win1251 database
together with Java would succeed in inserting umlauts into the database?

Cheers
Thomas



Re: Small problem with special characters

От
Kris Jurka
Дата:

On Sun, 20 Aug 2006, Thomas Kellerer wrote:

> Well the 0xc3b6 does obviously _not_ exist in the Windows codepage (as it is
> a 2byte character). But this is a "regular" umlaut (ö) that (from a "visual"
> point of view) does exist in the codepage.

Where do you see it here?

http://www.microsoft.com/globaldev/reference/sbcs/1251.mspx

I can't find it, but if you do it will show the unicode equivalent that
you need to use to insert it.

> I'm just wondering: how would someone who is forced to use a win1251 database
> together with Java would succeed in inserting umlauts into the database?
>

You can't fit every unicode character into a single byte encoding.

Kris Jurka

Re: Small problem with special characters

От
Thomas Kellerer
Дата:
Kris Jurka wrote on 20.08.2006 18:42:
>
>> Well the 0xc3b6 does obviously _not_ exist in the Windows codepage (as
>> it is a 2byte character). But this is a "regular" umlaut (ö) that
>> (from a "visual" point of view) does exist in the codepage.
>
> Where do you see it here?
>
> http://www.microsoft.com/globaldev/reference/sbcs/1251.mspx
>
> I can't find it, but if you do it will show the unicode equivalent that
> you need to use to insert it.
>

*Blush*, I stand corrected.

I thought the 1251 was the "standard" code page. But it seems that the
"standard" at least in western europe is the 1250.

Sorry for the confusion, win1250 does work as expected.

Cheers
Thomas, now wondering how why a database with win1251 ever got created on my
machine :)



Re: Small problem with special characters

От
Tom Lane
Дата:
Thomas Kellerer <spam_eater@gmx.net> writes:
> Thomas, now wondering how why a database with win1251 ever got created on my
> machine :)

It might be worth poking into that --- we could have, say, a mistake
in initdb's idea of how to convert system names for locales into PG
encoding names.  Please see if you can duplicate it.

            regards, tom lane

Re: Small problem with special characters

От
Kris Jurka
Дата:

On Sun, 20 Aug 2006, Thomas Kellerer wrote:

> Thomas, now wondering how why a database with win1251 ever got created on my
> machine :)
>

For historical reasons the alias "win" refers to win1251, perhaps someone
picked out "win" thinking it was for all windows users?

Kris Jurka


Re: Small problem with special characters

От
Thomas Kellerer
Дата:
Tom Lane wrote on 20.08.2006 20:15:
> Thomas Kellerer <spam_eater@gmx.net> writes:
>> Thomas, now wondering how why a database with win1251 ever got created on my
>> machine :)
>
> It might be worth poking into that --- we could have, say, a mistake
> in initdb's idea of how to convert system names for locales into PG
> encoding names.  Please see if you can duplicate it.

Hmm. Is there a way I can find out which encoding was specified with initdb?

I assume I cannot create a database that would not match (or be a true subset
of) the encoding specified with initdb, right?

Regards
Thomas

Re: Small problem with special characters

От
Tom Lane
Дата:
Thomas Kellerer <spam_eater@gmx.net> writes:
> Hmm. Is there a way I can find out which encoding was specified with initdb?

pg_controldata will show you the locale settings that initdb saw.  The
encoding assigned to template0 is what initdb deduced it should use
(unless you overrode it with the -E switch).

> I assume I cannot create a database that would not match (or be a true
> subset of) the encoding specified with initdb, right?

Uh, no, we don't enforce that ... there is a school of thought that
says we should, but the Japanese complain every time it comes up,
because they have to deal with multiple encodings and they don't
care all that much about locale settings.  We probably won't be able
to fix this properly until we can support per-database (or preferably
even finer grain) locale settings.

            regards, tom lane

Re: Small problem with special characters

От
Thomas Kellerer
Дата:
Tom Lane wrote on 24.08.2006 19:28:
> Thomas Kellerer <spam_eater@gmx.net> writes:
>> Hmm. Is there a way I can find out which encoding was specified with initdb?
>
> pg_controldata will show you the locale settings that initdb saw.  The
> encoding assigned to template0 is what initdb deduced it should use
> (unless you overrode it with the -E switch).

I only found a pg_settings, that contains the following:

server_encoding = UTF8
client_encoding = UNICODE

When I run createdb dbname this is created with UTF8, so it must have been
something stupid on my side. I don't think there is a problem with the installer.

Thanks for your help
Thomas