Обсуждение: Japanese words not distinguished

Поиск
Список
Период
Сортировка

Japanese words not distinguished

От
Harry Mantheakis
Дата:
Hello

I run PostgreSQL 7.4.6 on Linux with a JDBC client.

I initialised my database cluster with the following initdb command:

initdb --locale=en_GB.UTF-8 --encoding UNICODE

I have now discovered that my database cannot distinguish Japanese names or
words - it throws unique constraint errors on a composite primary key that
includes a VARCHAR field which stores the names or words.

My tests indicate that the database treats all Japanese names/words as
equal.

Having searched the forum archives, it seems to me that I should have
specified "--locale=C" as the locale setting when I initialised my database
cluster.

I am planning to re-initialise my database cluster using the following
command:

initdb --locale=C --encoding UNICODE

Then, after defining the relevant groups and users, I intend to call
pg_restore with reference to a "tar.gz" dump file of my data.

I wonder if someone might be kind enough to confirm that this is the right
approach to solving the problem.

Many thanks!

Harry Mantheakis
London, UK


Re: Japanese words not distinguished

От
Tom Lane
Дата:
Harry Mantheakis <harry@mantheakis.freeserve.co.uk> writes:
> I run PostgreSQL 7.4.6 on Linux with a JDBC client.

> I initialised my database cluster with the following initdb command:

> initdb --locale=en_GB.UTF-8 --encoding UNICODE

> I have now discovered that my database cannot distinguish Japanese names or
> words - it throws unique constraint errors on a composite primary key that
> includes a VARCHAR field which stores the names or words.

> My tests indicate that the database treats all Japanese names/words as
> equal.

Hmm, is that actually the correct spelling of the locale?  On my Linux
box, locale -a says it's "en_GB.utf8".  I'm not sure how well initdb can
verify the validity of a locale parameter, especially back in the 7.4
branch.  It could be that you are actually using a locale that doesn't
use UTF8 encoding, in which case this behavior is not unheard of
(still pretty broken, IMHO, but I've seen plenty of locale definitions
that just fail on data outside their supported character set).

If you did correctly specify a UTF8-using locale, you probably ought to
report this behavior to your Linux supplier as a bug in that locale
definition.  It doesn't have to sort or case-fold random UTF8 data very
nicely, but it certainly shouldn't report distinct strings as equal.

            regards, tom lane

Re: Japanese words not distinguished

От
Harry Mantheakis
Дата:
> Hmm, is that actually the correct spelling of the locale?  On my Linux
> box, locale -a says it's "en_GB.utf8".  I'm not sure how well initdb can
> verify the validity of a locale parameter, especially back in the 7.4
> branch.  It could be that you are actually using a locale that doesn't
> use UTF8 encoding, in which case this behavior is not unheard of
> (still pretty broken, IMHO, but I've seen plenty of locale definitions
> that just fail on data outside their supported character set).

Calling "locale -a" on my Linux server also lists "en_GB.utf8".

It also lists "en_US.utf8" and yet all the related environment variables
(LC_COLLATE, etc.) indicate their locale settings is "en_US.UTF-8".

I do not know what to make of that.

> If you did correctly specify a UTF8-using locale, you probably ought to
> report this behavior to your Linux supplier as a bug in that locale
> definition.  It doesn't have to sort or case-fold random UTF8 data very
> nicely, but it certainly shouldn't report distinct strings as equal.

I'll look into that - I'm running Fedora Core 3.

Meanwhile, am I correct in assuming that re-initialising my database cluster
with "--locale=C" will solve the problem?

What is more, am I correct in assuming that I can then restore my data with
pg_restore, as prescribed in the documentation?

Kind regards

Harry Mantheakis
London, UK


Re: Japanese words not distinguished

От
Tom Lane
Дата:
Harry Mantheakis <harry@mantheakis.freeserve.co.uk> writes:
> Meanwhile, am I correct in assuming that re-initialising my database cluster
> with "--locale=C" will solve the problem?

AFAIK it should --- of course you won't get any very intelligent sorting
or case folding, but at least it can tell the difference between
different characters ;-).  Be sure to still use encoding = unicode.

            regards, tom lane

Re: Japanese words not distinguished

От
Harry Mantheakis
Дата:
>> Meanwhile, am I correct in assuming that re-initialising my database cluster
>> with "--locale=C" will solve the problem?
>
> AFAIK it should --- of course you won't get any very intelligent sorting
> or case folding, but at least it can tell the difference between
> different characters ;-).  Be sure to still use encoding = unicode.

Okay, thanks for confirming that (!)

But... will setting the C locale affect how PostgreSQL sorts Latin-1 words?

If it does, I'm in trouble! I need to be able to sort (using ORDER BY
queries) English and other European names/words. (I've tried searching for
this topic, but failed to find an answer.)

Would specifying a locale for LC_COLLATE take care of this? Perhaps
something like this:

initdb --locale=C --lc-collate=en_GB.UTF-8 --encoding UNICODE

I'm sorry to drag this out. Perhaps I cannot have it both ways.

Kind regards

Harry Mantheakis
London, UK


Re: Japanese words not distinguished

От
Tatsuo Ishii
Дата:
> Hello
>
> I run PostgreSQL 7.4.6 on Linux with a JDBC client.
>
> I initialised my database cluster with the following initdb command:
>
> initdb --locale=en_GB.UTF-8 --encoding UNICODE
>
> I have now discovered that my database cannot distinguish Japanese names or
> words - it throws unique constraint errors on a composite primary key that
> includes a VARCHAR field which stores the names or words.
>
> My tests indicate that the database treats all Japanese names/words as
> equal.

That's a famous problem among Japaneses PostgreSQL users since the
locale support was born.

> Having searched the forum archives, it seems to me that I should have
> specified "--locale=C" as the locale setting when I initialised my database
> cluster.
>
> I am planning to re-initialise my database cluster using the following
> command:
>
> initdb --locale=C --encoding UNICODE
>
> Then, after defining the relevant groups and users, I intend to call
> pg_restore with reference to a "tar.gz" dump file of my data.
>
> I wonder if someone might be kind enough to confirm that this is the right
> approach to solving the problem.

Correct. The lesson is, never use locale support for Asian languages
and multibyte encodings including UTF-8.
--
Tatsuo Ishii

Re: Japanese words not distinguished

От
Harry Mantheakis
Дата:
> Correct. The lesson is, never use locale support for Asian languages
> and multibyte encodings including UTF-8.

Thank you for your reply - much appreciated.

I'm now concerned if and how this will affect ORDER BY query results (and
other functions) with respect to Latin-1 names and words.

I think I'll have to suck it and see, and then post my results - but that
won't be until after this next week-end.

Kind regards

Harry Mantheakis
London, UK


Re: Japanese words not distinguished

От
Stuart Bishop
Дата:
Harry Mantheakis wrote:
>>Correct. The lesson is, never use locale support for Asian languages
>>and multibyte encodings including UTF-8.
>
>
> Thank you for your reply - much appreciated.
>
> I'm now concerned if and how this will affect ORDER BY query results (and
> other functions) with respect to Latin-1 names and words.
>
> I think I'll have to suck it and see, and then post my results - but that
> won't be until after this next week-end.

C locale and en_* locales give different ordering (at least under Linux).
The en_* ordering is case insensitive, and the C locale ordering is case
sensitive because it is simply comparing the ASCII codes.

--
Stuart Bishop <stuart@stuartbishop.net>
http://www.stuartbishop.net/

Вложения

Re: Japanese words not distinguished

От
Tatsuo Ishii
Дата:
> Harry Mantheakis wrote:
> >>Correct. The lesson is, never use locale support for Asian languages
> >>and multibyte encodings including UTF-8.
> >
> >
> > Thank you for your reply - much appreciated.
> >
> > I'm now concerned if and how this will affect ORDER BY query results (and
> > other functions) with respect to Latin-1 names and words.
> >
> > I think I'll have to suck it and see, and then post my results - but that
> > won't be until after this next week-end.
>
> C locale and en_* locales give different ordering (at least under Linux).
> The en_* ordering is case insensitive, and the C locale ordering is case
> sensitive because it is simply comparing the ASCII codes.

You could use lower/upper to get case insensitive ordering with C locale.
--
Tatsuo Ishii

Re: Japanese words not distinguished

От
Harry Mantheakis
Дата:
>> C locale and en_* locales give different ordering (at least under Linux).
>> The en_* ordering is case insensitive, and the C locale ordering is case
>> sensitive because it is simply comparing the ASCII codes.
>
> You could use lower/upper to get case insensitive ordering with C locale.

Okay, that's great, so I'm going to re-initialise my database cluster with
the C locale this week-end.

I'll report the results, for the record.

Thanks to everyone who replied!

Kind regards

Harry Mantheakis
London, UK


Re: Japanese words not distinguished - Solution

От
Harry Mantheakis
Дата:
> I'll report the results, for the record.

Okay, for the record, all went well. I re-initialise my PostgreSQL 7.4
database cluster using the following command:

initdb --locale=C --encoding UNICODE

Then, after defining the relevant groups and users, I used pg_restore to
restore my data from a dump file.

And now everything works as expected with both European (Latin-1) and
Japanese names and words.

Sorting with ORDER BY on European names and words is adequate for my
requirements, even though the C locale ordering is case sensitive.

As Tatsuo Ishii pointed out, I can use lower/upper to get case insensitive
ordering if I ever need it.

The bottom line seems to be: if you're planning to store data with languages
from across the world, initialise your database cluster with the C locale
and Unicode encoding.

Thanks again to everyone who replied!

Kind regards

Harry Mantheakis
London, UK