Unicode database on non-unicode operating system

Поиск
Список
Период
Сортировка
От Morten Barklund
Тема Unicode database on non-unicode operating system
Дата
Msg-id AB6A9C75F1620048B14C9E7D9526F5B136CB53@TBWAMAIL.tbwa.dk
обсуждение исходный текст
Ответы Re: Unicode database on non-unicode operating system  (Peter Eisentraut <peter_e@gmx.net>)
Список pgsql-general

Hi,

 

I have a PostgreSQL 8.0.3 running on an older debian server and have some problems with unicode databases and character conversions.

 

 

First up, some backgrund info about my server and installation:

 

test=# \set

VERSION = 'PostgreSQL 8.0.3 on i386-pc-linux-gnu, compiled by GCC cc (GCC) 3.3.5 (Debian 1:3.3.5-13)'

AUTOCOMMIT = 'on'

VERBOSITY = 'default'

PROMPT1 = '%/%R%# '

PROMPT2 = '%/%R%# '

PROMPT3 = '>> '

HISTSIZE = '500'

LASTOID = '0'

DBNAME = 'test'

USER = 'postgres'

HOST = '/var/run/postgresql'

PORT = '5432'

ENCODING = 'UNICODE'

test=# \! uname -a

Linux xx 2.4.27-2-686-smp #1 SMP XX i686 GNU/Linux

test=# \! locale

LANG=POSIX

LC_CTYPE="POSIX"

LC_NUMERIC="POSIX"

LC_TIME="POSIX"

LC_COLLATE="POSIX"

LC_MONETARY="POSIX"

LC_MESSAGES="POSIX"

LC_PAPER="POSIX"

LC_NAME="POSIX"

LC_ADDRESS="POSIX"

LC_TELEPHONE="POSIX"

LC_MEASUREMENT="POSIX"

LC_IDENTIFICATION="POSIX"

LC_ALL=

 

 

My problem is, that the lowercase versions of non-ascii characters are broken. Specifically I found, that when lower() is invoked on a text with non-ascii characters, the operating system's locale is used for converting each octet in the string to lowercase in stead of using the locale of the database to convert each character in the string to lowercase. This caused the danish lower case o with slash "ø", which in unicode is represented as the latin1-readable octets "ø", to be converted to the latin1-readable octets "ã¸", which then in turn was (tried) to be interpreted as a unicode character - but the octects "ã¸" does not represent a unicode character in utf8. The lower case version of "ø" is of course just itself.

 

To get around this problem, I had to create a function ulower:

 

create or replace function ulower(text) returns text as 'begin

 return convert(lower(convert($1,''utf8'',''latin1'')),''latin1'',''utf8'');

end;' language plpgsql immutable;

 

Not a very nice solution and it of course only works for latin1-compatible utf8-encoded strings.

 

First up, I would like to avoid this whole issue. How could this be circumvented, any settings I can flick around?

 

 

Then I tried to apply this immutable function as a functional index on a varchar:

 

test# create index mytable_mycolumn_lower_idx on mytable(ulower(mycolumn));

ERROR:  could not convert UTF-8 character 0x00e2 to ISO8859-1

 

I had a lot of data in the table before creating this index, and apparently one of the rows contained unicode character U+00E2 - which is latin small letter a with circumflex = â. This is a perfectly legal latin1-character (as any Unicode character below U+0100 is).

 

My second question is then, why did it fail to convert this character to latin1?

 

 

Thanks in advance for any help

 

 

Regards,

Morten Barklund

Head of Development

TBWA\

В списке pgsql-general по дате отправления:

Предыдущее
От: Bill Moran
Дата:
Сообщение: Re: Backing up and deleting a database.
Следующее
От: "Sergey Konoplev"
Дата:
Сообщение: Re: Referential integrity vulnerability in 8.3.3