Re: a strange order by behavior

Поиск
Список
Период
Сортировка
От Samuel Gendler
Тема Re: a strange order by behavior
Дата
Msg-id BANLkTi=PXV+FrAJHHX56D41Xo4DzG7vADA@mail.gmail.com
обсуждение исходный текст
Ответ на a strange order by behavior  (Eyal Wilde <eyal@impactsoft.co.il>)
Ответы Re: a strange order by behavior  (Pavel Stehule <pavel.stehule@gmail.com>)
Re: a strange order by behavior  (Peter Eisentraut <peter_e@gmx.net>)
Список pgsql-sql


On Tue, Jun 21, 2011 at 3:37 AM, Eyal Wilde <eyal@impactsoft.co.il> wrote:
the database collation is: en_US.UTF-8

drop table t1;
create table t1 (recid int ,f1 varchar(20));
insert into t1 values (1,'a');
insert into t1 values (2,' ');
insert into t1 values (3,'aa');
insert into t1 values (4,' a');
select * from t1 order by f1

result:

recid  f1
2      " "        
1      "a"        -- 2 comes before 1 because space is smaller then 'a'. fine.
4      " a"       -- now is see that 1 comes before 4 because space is greater then 'a' !?
3      "aa"       -- now again, 4 comes before 3 because space is smaller the 'a' !?!


I seem to recall a thread here about it ignoring spaces entirely in that collation (and maybe ignoring capitalization, too?).  I went to go test that assertion by initializing a database with C collation and got some complaint about it being incompatible with my template1 template database.  I initialized a db off of template0 and then got side tracked and you've only just reminded me of it.  I was planning to test whether it is safe to use UTF-8 for encoding but use C collation, and then maybe investigate other collations.

This worked:

createdb  -E UTF-8 --lc-collate=C some_db

so it should be easy enough to play around with it some.  I'm not sure how to get a list of valid collations for any given charset, and it seems like C collation would generate somewhat strange results with non-ascii characters (french accented characters are supposed to alphabetize in some unexpected manner, I believe), so there must be something better - closer to UTF-8 collation but without ignoring whitespace and such.  A quick google search reveals that there is some kind of standard for unicode collation (http://www.unicode.org/reports/tr10/ ) and I have no idea if that is what is represented by the en_US.UTF-8 collation or not.  I've got no real experience with this stuff.

It appears that there are differences regarding collation in recent versions - the beta docs for 9.1 show that you can set collation on individual operations or differently for different columns (http://www.postgresql.org/docs/9.1/static/collation.html ).  I see nothing like that in 8.4 docs.

It definitely looks like we both need to have a read of the localization chapter of the docs for our database version as there is a bunch of stuff in there that I was surprised to read when I just did a quick scan - like using anything but C or posix is much slower and can produce incorrect results in a 'like' query

It looks like the docs prior to 9.1beta have no real reference to collation at all, so it's down to trial and error unless someone in the know speaks up.

--sam


В списке pgsql-sql по дате отправления:

Предыдущее
От: Eyal Wilde
Дата:
Сообщение: a strange order by behavior
Следующее
От: Pavel Stehule
Дата:
Сообщение: Re: a strange order by behavior