Обсуждение: help understanding collation order
hi, postgresl-8.4.6 (database encoding is utf8) the following sql behaves strangely on one of my servers: create table ttt(id serial not null primary key, name text); insert into ttt (name) values ('CLARKE, DAVID'); insert into ttt (name) values ('CLARK, PETER'); insert into ttt (name) values ('CLARKE'); insert into ttt (name) values ('CLARK'); select * from ttt order by name; drop table ttt; the behaviour i expect (and see on macosx-10.6.6) is: id | name ----+--------------- 4 | CLARK 2 | CLARK, PETER 3 | CLARKE 1 | CLARKE, DAVID the behaviour i don't expect but see anyway (on debian-5.0) is: id | name ----+--------------- 4 | CLARK 3 | CLARKE 1 | CLARKE, DAVID 2 | CLARK, PETER the "good" server has lc_messages='en_AU' and the "bad" server has lc_messages="en_AU.utf8" which may be relevant but i can't successfully alter these values (i.e. postgres doesn't restart after i try) so i'm hoping it's not relevant. can anyone explain what i've done wrong here and suggest what i can do to fix it? cheers, raf p.s. the postgres installation was PostgresPlus/8.4SS that was subsequently upgraded via postgres one click installers.
raf <raf@raf.org> writes: > the behaviour i expect (and see on macosx-10.6.6) is: > id | name > ----+--------------- > 4 | CLARK > 2 | CLARK, PETER > 3 | CLARKE > 1 | CLARKE, DAVID > the behaviour i don't expect but see anyway (on debian-5.0) is: > id | name > ----+--------------- > 4 | CLARK > 3 | CLARKE > 1 | CLARKE, DAVID > 2 | CLARK, PETER > the "good" server has lc_messages='en_AU' and the > "bad" server has lc_messages="en_AU.utf8" which may > be relevant No, not particularly. Sort order is determined by lc_collate not lc_messages. Unfortunately it's entirely possible that OSX will give you a different sort order than Linux even for similarly named lc_collate settings. About the only lc_collate setting that really behaves the same everywhere, guaranteed, is "C" ... and that might or might not do what you want. (C locale does satisfy the above example but it's hard to be sure what you want in general; and if you are using any non-ASCII characters, C locale will more than likely not be very satisfactory.) regards, tom lane
On Mon, Jan 17, 2011 at 02:19:14PM -0500, Tom Lane wrote: > No, not particularly. Sort order is determined by lc_collate > not lc_messages. Unfortunately it's entirely possible that OSX > will give you a different sort order than Linux even for similarly > named lc_collate settings. About the only lc_collate setting that > really behaves the same everywhere, guaranteed, is "C" This is partly because of the generous room allowed for linguistic tailoring in the Unicode standard on collation. In case one really wants to have a bad day, I can suggest reading http://unicode.org/reports/tr10/ carefully. But they make an important point there, which is that collation rules work by language, not by script. As I understand things, in Postgres's case it's partly a matter of how strongly your OS cleaves to the locale conventions that determines how this will work. (Note that not every database system relies on the underlying OS's facilities the way Postgres does; some have an independent collation mechanism.) Unicode does maintain a locale data repository: http://cldr.unicode.org/. You might be able to figure out which of your systems is not playing nice and complain to the OS vendor. A -- Andrew Sullivan ajs@crankycanuck.ca
Tom Lane wrote: > raf <raf@raf.org> writes: > > the behaviour i expect (and see on macosx-10.6.6) is: > > > id | name > > ----+--------------- > > 4 | CLARK > > 2 | CLARK, PETER > > 3 | CLARKE > > 1 | CLARKE, DAVID > > > the behaviour i don't expect but see anyway (on debian-5.0) is: > > > id | name > > ----+--------------- > > 4 | CLARK > > 3 | CLARKE > > 1 | CLARKE, DAVID > > 2 | CLARK, PETER > > > the "good" server has lc_messages='en_AU' and the > > "bad" server has lc_messages="en_AU.utf8" which may > > be relevant > > No, not particularly. Sort order is determined by lc_collate > not lc_messages. Unfortunately it's entirely possible that OSX > will give you a different sort order than Linux even for similarly > named lc_collate settings. About the only lc_collate setting that > really behaves the same everywhere, guaranteed, is "C" ... and that > might or might not do what you want. (C locale does satisfy the > above example but it's hard to be sure what you want in general; > and if you are using any non-ASCII characters, C locale will more > than likely not be very satisfactory.) > > regards, tom lane thanks. "C" will have to do, i suppose. that and/or re-sort in the client. cheers, raf p.s. if anyone in debian locale land is listening, 'E' does not sort before ','. what were you thinking? :-)
On tis, 2011-01-18 at 10:33 +1100, raf wrote: > p.s. if anyone in debian locale land is listening, > 'E' does not sort before ','. what were you thinking? :-) What is actually happening is that the punctuation is sorted in a second pass after the letters. Which is both correct according to the relevant standards and also practical in many situations. It's usually actually the Mac OS X locales that are broken.
Peter Eisentraut wrote: > On tis, 2011-01-18 at 10:33 +1100, raf wrote: > > p.s. if anyone in debian locale land is listening, > > 'E' does not sort before ','. what were you thinking? :-) > > What is actually happening is that the punctuation is sorted in a second > pass after the letters. Which is both correct according to the relevant > standards and also practical in many situations. i have no doubt that what you say is true just as i have no doubt that it is also incorrect and impractical in every situation i'll ever encounter. i can't imagine ever wanting CLARK to sort both before and after CLARKE. i'll just re-sort all name-ordered reports in the client. pity. > It's usually actually the Mac OS X locales that are broken. only when ignoring the principle of least astonishment. :-) but it's not the same locale on the two hosts. the macosx locale is en_AU. the debian locale is en_AU.utf8 so i don't think they can be compared for brokenness purposes in this case. they may both be correct. cheers, raf
raf wrote: > Peter Eisentraut wrote: > > > On tis, 2011-01-18 at 10:33 +1100, raf wrote: > > > p.s. if anyone in debian locale land is listening, > > > 'E' does not sort before ','. what were you thinking? :-) > > > > What is actually happening is that the punctuation is sorted in a second > > pass after the letters. Which is both correct according to the relevant > > standards and also practical in many situations. > > i have no doubt that what you say is true just as i have no doubt that > it is also incorrect and impractical in every situation i'll ever encounter. > i can't imagine ever wanting CLARK to sort both before and after CLARKE. > i'll just re-sort all name-ordered reports in the client. pity. better idea: i'll separate the family name from the personal names using regexp_replace() and use them separately in the order by clause so the sorting can remain in postgres where it belongs without any commas getting in the way. cheers, raf