Обсуждение: Japanese words not distinguished
Hello I run PostgreSQL 7.4.6 on Linux with a JDBC client. I initialised my database cluster with the following initdb command: initdb --locale=en_GB.UTF-8 --encoding UNICODE I have now discovered that my database cannot distinguish Japanese names or words - it throws unique constraint errors on a composite primary key that includes a VARCHAR field which stores the names or words. My tests indicate that the database treats all Japanese names/words as equal. Having searched the forum archives, it seems to me that I should have specified "--locale=C" as the locale setting when I initialised my database cluster. I am planning to re-initialise my database cluster using the following command: initdb --locale=C --encoding UNICODE Then, after defining the relevant groups and users, I intend to call pg_restore with reference to a "tar.gz" dump file of my data. I wonder if someone might be kind enough to confirm that this is the right approach to solving the problem. Many thanks! Harry Mantheakis London, UK
Harry Mantheakis <harry@mantheakis.freeserve.co.uk> writes: > I run PostgreSQL 7.4.6 on Linux with a JDBC client. > I initialised my database cluster with the following initdb command: > initdb --locale=en_GB.UTF-8 --encoding UNICODE > I have now discovered that my database cannot distinguish Japanese names or > words - it throws unique constraint errors on a composite primary key that > includes a VARCHAR field which stores the names or words. > My tests indicate that the database treats all Japanese names/words as > equal. Hmm, is that actually the correct spelling of the locale? On my Linux box, locale -a says it's "en_GB.utf8". I'm not sure how well initdb can verify the validity of a locale parameter, especially back in the 7.4 branch. It could be that you are actually using a locale that doesn't use UTF8 encoding, in which case this behavior is not unheard of (still pretty broken, IMHO, but I've seen plenty of locale definitions that just fail on data outside their supported character set). If you did correctly specify a UTF8-using locale, you probably ought to report this behavior to your Linux supplier as a bug in that locale definition. It doesn't have to sort or case-fold random UTF8 data very nicely, but it certainly shouldn't report distinct strings as equal. regards, tom lane
> Hmm, is that actually the correct spelling of the locale? On my Linux > box, locale -a says it's "en_GB.utf8". I'm not sure how well initdb can > verify the validity of a locale parameter, especially back in the 7.4 > branch. It could be that you are actually using a locale that doesn't > use UTF8 encoding, in which case this behavior is not unheard of > (still pretty broken, IMHO, but I've seen plenty of locale definitions > that just fail on data outside their supported character set). Calling "locale -a" on my Linux server also lists "en_GB.utf8". It also lists "en_US.utf8" and yet all the related environment variables (LC_COLLATE, etc.) indicate their locale settings is "en_US.UTF-8". I do not know what to make of that. > If you did correctly specify a UTF8-using locale, you probably ought to > report this behavior to your Linux supplier as a bug in that locale > definition. It doesn't have to sort or case-fold random UTF8 data very > nicely, but it certainly shouldn't report distinct strings as equal. I'll look into that - I'm running Fedora Core 3. Meanwhile, am I correct in assuming that re-initialising my database cluster with "--locale=C" will solve the problem? What is more, am I correct in assuming that I can then restore my data with pg_restore, as prescribed in the documentation? Kind regards Harry Mantheakis London, UK
Harry Mantheakis <harry@mantheakis.freeserve.co.uk> writes: > Meanwhile, am I correct in assuming that re-initialising my database cluster > with "--locale=C" will solve the problem? AFAIK it should --- of course you won't get any very intelligent sorting or case folding, but at least it can tell the difference between different characters ;-). Be sure to still use encoding = unicode. regards, tom lane
>> Meanwhile, am I correct in assuming that re-initialising my database cluster >> with "--locale=C" will solve the problem? > > AFAIK it should --- of course you won't get any very intelligent sorting > or case folding, but at least it can tell the difference between > different characters ;-). Be sure to still use encoding = unicode. Okay, thanks for confirming that (!) But... will setting the C locale affect how PostgreSQL sorts Latin-1 words? If it does, I'm in trouble! I need to be able to sort (using ORDER BY queries) English and other European names/words. (I've tried searching for this topic, but failed to find an answer.) Would specifying a locale for LC_COLLATE take care of this? Perhaps something like this: initdb --locale=C --lc-collate=en_GB.UTF-8 --encoding UNICODE I'm sorry to drag this out. Perhaps I cannot have it both ways. Kind regards Harry Mantheakis London, UK
> Hello > > I run PostgreSQL 7.4.6 on Linux with a JDBC client. > > I initialised my database cluster with the following initdb command: > > initdb --locale=en_GB.UTF-8 --encoding UNICODE > > I have now discovered that my database cannot distinguish Japanese names or > words - it throws unique constraint errors on a composite primary key that > includes a VARCHAR field which stores the names or words. > > My tests indicate that the database treats all Japanese names/words as > equal. That's a famous problem among Japaneses PostgreSQL users since the locale support was born. > Having searched the forum archives, it seems to me that I should have > specified "--locale=C" as the locale setting when I initialised my database > cluster. > > I am planning to re-initialise my database cluster using the following > command: > > initdb --locale=C --encoding UNICODE > > Then, after defining the relevant groups and users, I intend to call > pg_restore with reference to a "tar.gz" dump file of my data. > > I wonder if someone might be kind enough to confirm that this is the right > approach to solving the problem. Correct. The lesson is, never use locale support for Asian languages and multibyte encodings including UTF-8. -- Tatsuo Ishii
> Correct. The lesson is, never use locale support for Asian languages > and multibyte encodings including UTF-8. Thank you for your reply - much appreciated. I'm now concerned if and how this will affect ORDER BY query results (and other functions) with respect to Latin-1 names and words. I think I'll have to suck it and see, and then post my results - but that won't be until after this next week-end. Kind regards Harry Mantheakis London, UK
Harry Mantheakis wrote: >>Correct. The lesson is, never use locale support for Asian languages >>and multibyte encodings including UTF-8. > > > Thank you for your reply - much appreciated. > > I'm now concerned if and how this will affect ORDER BY query results (and > other functions) with respect to Latin-1 names and words. > > I think I'll have to suck it and see, and then post my results - but that > won't be until after this next week-end. C locale and en_* locales give different ordering (at least under Linux). The en_* ordering is case insensitive, and the C locale ordering is case sensitive because it is simply comparing the ASCII codes. -- Stuart Bishop <stuart@stuartbishop.net> http://www.stuartbishop.net/
Вложения
> Harry Mantheakis wrote: > >>Correct. The lesson is, never use locale support for Asian languages > >>and multibyte encodings including UTF-8. > > > > > > Thank you for your reply - much appreciated. > > > > I'm now concerned if and how this will affect ORDER BY query results (and > > other functions) with respect to Latin-1 names and words. > > > > I think I'll have to suck it and see, and then post my results - but that > > won't be until after this next week-end. > > C locale and en_* locales give different ordering (at least under Linux). > The en_* ordering is case insensitive, and the C locale ordering is case > sensitive because it is simply comparing the ASCII codes. You could use lower/upper to get case insensitive ordering with C locale. -- Tatsuo Ishii
>> C locale and en_* locales give different ordering (at least under Linux). >> The en_* ordering is case insensitive, and the C locale ordering is case >> sensitive because it is simply comparing the ASCII codes. > > You could use lower/upper to get case insensitive ordering with C locale. Okay, that's great, so I'm going to re-initialise my database cluster with the C locale this week-end. I'll report the results, for the record. Thanks to everyone who replied! Kind regards Harry Mantheakis London, UK
> I'll report the results, for the record. Okay, for the record, all went well. I re-initialise my PostgreSQL 7.4 database cluster using the following command: initdb --locale=C --encoding UNICODE Then, after defining the relevant groups and users, I used pg_restore to restore my data from a dump file. And now everything works as expected with both European (Latin-1) and Japanese names and words. Sorting with ORDER BY on European names and words is adequate for my requirements, even though the C locale ordering is case sensitive. As Tatsuo Ishii pointed out, I can use lower/upper to get case insensitive ordering if I ever need it. The bottom line seems to be: if you're planning to store data with languages from across the world, initialise your database cluster with the C locale and Unicode encoding. Thanks again to everyone who replied! Kind regards Harry Mantheakis London, UK