Обсуждение: Unicode database on non-unicode operating system

Поиск
Список
Период
Сортировка

Unicode database on non-unicode operating system

От
"Morten Barklund"
Дата:

Hi,

 

I have a PostgreSQL 8.0.3 running on an older debian server and have some problems with unicode databases and character conversions.

 

 

First up, some backgrund info about my server and installation:

 

test=# \set

VERSION = 'PostgreSQL 8.0.3 on i386-pc-linux-gnu, compiled by GCC cc (GCC) 3.3.5 (Debian 1:3.3.5-13)'

AUTOCOMMIT = 'on'

VERBOSITY = 'default'

PROMPT1 = '%/%R%# '

PROMPT2 = '%/%R%# '

PROMPT3 = '>> '

HISTSIZE = '500'

LASTOID = '0'

DBNAME = 'test'

USER = 'postgres'

HOST = '/var/run/postgresql'

PORT = '5432'

ENCODING = 'UNICODE'

test=# \! uname -a

Linux xx 2.4.27-2-686-smp #1 SMP XX i686 GNU/Linux

test=# \! locale

LANG=POSIX

LC_CTYPE="POSIX"

LC_NUMERIC="POSIX"

LC_TIME="POSIX"

LC_COLLATE="POSIX"

LC_MONETARY="POSIX"

LC_MESSAGES="POSIX"

LC_PAPER="POSIX"

LC_NAME="POSIX"

LC_ADDRESS="POSIX"

LC_TELEPHONE="POSIX"

LC_MEASUREMENT="POSIX"

LC_IDENTIFICATION="POSIX"

LC_ALL=

 

 

My problem is, that the lowercase versions of non-ascii characters are broken. Specifically I found, that when lower() is invoked on a text with non-ascii characters, the operating system's locale is used for converting each octet in the string to lowercase in stead of using the locale of the database to convert each character in the string to lowercase. This caused the danish lower case o with slash "ø", which in unicode is represented as the latin1-readable octets "ø", to be converted to the latin1-readable octets "ã¸", which then in turn was (tried) to be interpreted as a unicode character - but the octects "ã¸" does not represent a unicode character in utf8. The lower case version of "ø" is of course just itself.

 

To get around this problem, I had to create a function ulower:

 

create or replace function ulower(text) returns text as 'begin

 return convert(lower(convert($1,''utf8'',''latin1'')),''latin1'',''utf8'');

end;' language plpgsql immutable;

 

Not a very nice solution and it of course only works for latin1-compatible utf8-encoded strings.

 

First up, I would like to avoid this whole issue. How could this be circumvented, any settings I can flick around?

 

 

Then I tried to apply this immutable function as a functional index on a varchar:

 

test# create index mytable_mycolumn_lower_idx on mytable(ulower(mycolumn));

ERROR:  could not convert UTF-8 character 0x00e2 to ISO8859-1

 

I had a lot of data in the table before creating this index, and apparently one of the rows contained unicode character U+00E2 - which is latin small letter a with circumflex = â. This is a perfectly legal latin1-character (as any Unicode character below U+0100 is).

 

My second question is then, why did it fail to convert this character to latin1?

 

 

Thanks in advance for any help

 

 

Regards,

Morten Barklund

Head of Development

TBWA\

Re: Unicode database on non-unicode operating system

От
Peter Eisentraut
Дата:
Am Dienstag, 15. Juli 2008 schrieb Morten Barklund:
> My problem is, that the lowercase versions of non-ascii characters are
> broken. Specifically I found, that when lower() is invoked on a text with
> non-ascii characters, the operating system's locale is used for converting
> each octet in the string to lowercase in stead of using the locale of the
> database to convert each character in the string to lowercase. This caused
> the danish lower case o with slash "ø", which in unicode is represented as
> the latin1-readable octets "ø", to be converted to the latin1-readable
> octets "ã¸", which then in turn was (tried) to be interpreted as a unicode
> character - but the octects "ã¸" does not represent a unicode character in
> utf8. The lower case version of "ø" is of course just itself.

This means you have mismatching server encodings and locales configured.
Check SHOW lc_collate and SHOW server_encoding, and then pick a combination
that is compatible.  This will probably mean you have to reinitdb.

Re: Unicode database on non-unicode operating system

От
"Morten Barklund"
Дата:
Hi Peter,

Thanks for the hint.

I can see that lc_collate (sorting) and lc_ctype (lower-upper conversion) is set to en_DK and I guess that default
encodingfor en_DK is iso88591 or maybe windows1252. Thus my server should have been initialized with en_DK.utf8 or? How
doI find out what the default encoding for the locale en_DK is? I can see, that normally one would sub-specify this by
eitheradding .iso88591 or .utf8, but is windows1252 then default?
 

Because it is clear, that en_DK includes the proper rules for upper-lower conversion of Danish special characters as I
whenconverting from UTF-8 to ISO 8859-1 can use upper() and lower() as expected. And Danish special characters have the
samecode points in latin1 and windows1252.
 

I am not able to reinitdb, as many other databases are running, which might be affected negatively. This means, that
eventhough my database is created WITH ENCODING 'unicode', it is in fact "broken" as the locale does not fully support
unicodestring handling?
 

I wanted to use Unicode, as I expected non-latin1 characters, but this actually means, that if I had any such, some
stringfunctions would not work at all.
 


Regards,
Morten Barklund

-----Original Message-----
From: Peter Eisentraut [mailto:peter_e@gmx.net] 
Sent: Tuesday, July 15, 2008 2:33 PM
To: pgsql-general@postgresql.org
Cc: Morten Barklund
Subject: Re: [GENERAL] Unicode database on non-unicode operating system

Am Dienstag, 15. Juli 2008 schrieb Morten Barklund:
> My problem is, that the lowercase versions of non-ascii characters are
> broken. Specifically I found, that when lower() is invoked on a text with
> non-ascii characters, the operating system's locale is used for converting
> each octet in the string to lowercase in stead of using the locale of the
> database to convert each character in the string to lowercase. This caused
> the danish lower case o with slash "ø", which in unicode is represented as
> the latin1-readable octets "ø", to be converted to the latin1-readable
> octets "ã¸", which then in turn was (tried) to be interpreted as a unicode
> character - but the octects "ã¸" does not represent a unicode character in
> utf8. The lower case version of "ø" is of course just itself.

This means you have mismatching server encodings and locales configured.  
Check SHOW lc_collate and SHOW server_encoding, and then pick a combination 
that is compatible.  This will probably mean you have to reinitdb.



Re: Unicode database on non-unicode operating system

От
Peter Eisentraut
Дата:
Am Dienstag, 15. Juli 2008 schrieb Morten Barklund:
> I can see that lc_collate (sorting) and lc_ctype (lower-upper conversion)
> is set to en_DK and I guess that default encoding for en_DK is iso88591 or
> maybe windows1252.

It is ISO-8859-1.  There is no support for Windows charmaps on Linux.

> Thus my server should have been initialized with
> en_DK.utf8 or?

Yes, or you should have chosen a different encoding (LATIN1 in your case) when
creating the database.

> How do I find out what the default encoding for the locale en_DK is?

$ LC_ALL=en_DK locale charmap
ISO-8859-1

Note that this is not the "default" encoding, it is the *only* encoding
supported by that locale.

> I can see, that normally one would sub-specify this by either
> adding .iso88591 or .utf8, but is windows1252 then default?

It might be reasonable to use the .iso88591 or .utf8 suffixes if you want to
be explicit, but the unsuffixed locale name is usually just an alias for one
of these.

> I am not able to reinitdb, as many other databases are running, which might
> be affected negatively. This means, that even though my database is created
> WITH ENCODING 'unicode', it is in fact "broken" as the locale does not
> fully support unicode string handling?

Yes.  If you can't reinitdb, then you should recreate the database with
encoding LATIN1.  This won't allow all Unicode characters, obviously, but at
least you get proper behavior for the Danish characters that you need.

Re: Unicode database on non-unicode operating system

От
"Morten Barklund"
Дата:
Hi Peter,

Thank you once again. That cleared up a lot of confusion for me and my 
co-workers and the next server set up will be with unicode and en_DK.utf8 
to ensure consistency.


Regards,
Morten Barklund

-----Original Message-----
From: Peter Eisentraut [mailto:peter_e@gmx.net] 
Sent: Tuesday, July 15, 2008 3:50 PM
To: pgsql-general@postgresql.org
Cc: Morten Barklund
Subject: Re: [GENERAL] Unicode database on non-unicode operating system

Am Dienstag, 15. Juli 2008 schrieb Morten Barklund:
> I can see that lc_collate (sorting) and lc_ctype (lower-upper conversion)
> is set to en_DK and I guess that default encoding for en_DK is iso88591 or
> maybe windows1252.

It is ISO-8859-1.  There is no support for Windows charmaps on Linux.

> Thus my server should have been initialized with 
> en_DK.utf8 or?

Yes, or you should have chosen a different encoding (LATIN1 in your case) when 
creating the database.

> How do I find out what the default encoding for the locale en_DK is?

$ LC_ALL=en_DK locale charmap
ISO-8859-1

Note that this is not the "default" encoding, it is the *only* encoding 
supported by that locale.

> I can see, that normally one would sub-specify this by either 
> adding .iso88591 or .utf8, but is windows1252 then default?

It might be reasonable to use the .iso88591 or .utf8 suffixes if you want to 
be explicit, but the unsuffixed locale name is usually just an alias for one 
of these.

> I am not able to reinitdb, as many other databases are running, which might
> be affected negatively. This means, that even though my database is created
> WITH ENCODING 'unicode', it is in fact "broken" as the locale does not
> fully support unicode string handling?

Yes.  If you can't reinitdb, then you should recreate the database with 
encoding LATIN1.  This won't allow all Unicode characters, obviously, but at 
least you get proper behavior for the Danish characters that you need.