Re: a strange order by behavior

Поиск
Список
Период
Сортировка
От Samuel Gendler
Тема Re: a strange order by behavior
Дата
Msg-id BANLkTimVfKet6Aixyz5g6xr6oZ1zJhSs5g@mail.gmail.com
обсуждение исходный текст
Ответ на Re: a strange order by behavior  (Samuel Gendler <sgendler@ideasculptor.com>)
Ответы Re: a strange order by behavior  (Pavel Stehule <pavel.stehule@gmail.com>)
Список pgsql-sql


On Wed, Jun 22, 2011 at 3:39 AM, Samuel Gendler <sgendler@ideasculptor.com> wrote:


On Wed, Jun 22, 2011 at 3:01 AM, Pavel Stehule <pavel.stehule@gmail.com> wrote:
Hello Peter


> Pavel suggested using a collation of ucs_basic, but I get an error when I
> try that on linux:
> $ createdb -U u1 --lc-collate=ucs_basic -E UTF-8 test
> createdb: database creation failed: ERROR:  invalid locale name ucs_basic

isn't this a bug in collations?

The more I read about this, the more this would appear to be the case. It looks like the SQL standard has some baseline collations that are required and it isn't at all clear how one would access those in postgres if the host in question doesn't have those locale's defined on the host. UCS_BASIC is a SQL collation, but doesn't appear to have an explicit definition on a 'standard' linux host (CentOS 5, in my case).  There is another SQL collation called 'UNICODE' which is supposed to obey the Unicode Collation Algorithm with the Default Unicode Collation Element Table defined in Unicode10.  It looks like that collation is relatively sensitive to language-specific sort orders, though it isn't a required collation in the sql standard.  I suspect that it is the UNICODE collation which actually would appear to be the most 'sensible' within the context of this discussion - characters in expected order, spaces honoured, case sensitive. I have so little experience with localization that I'm not sure if I'm reading this all correctly, though.

 

My final bit of research before going to bed - 

If you are stuck with UTF-8 collation, you can do C-style collation in order by clauses like this:

select * from t1 order by f1 using ~<~
select * from t1 order by f1 using ~>~

If you want it to be able to use an index for that (or you want to use LIKE in your where clause with that column), then you need to have an index which uses C-style collation.  You can create an index on the column using text_pattern_ops or varchar_pattern_ops and then LIKE, <, and > comparisons will use that index and do comparisons like C collation, as will order by clauses with ~>~ and ~<~

create index t1_f1_pattern ON t1 (f1 varchar_pattern_ops);
create index t1_f1 ON t1 (f1);

If you have both of those indexes, postgresql will, apparently, do the correct thing based on which operations you ask of it in your query.  The pattern_ops thing is mentioned in the docs (but on a page about indexes rather than on a page about varchar and text types) and I found the ~>~ operator in an ORDER BY clause in an old mailing list email.  It may be in the docs, I'm not sure.  Is there a documentation page or wiki page consisting of 'stuff that may surprise you in postgresql?'  Such a page would be a handy place to point this kind of thing out to new users.  Most users, even relatively experienced ones, probably haven't read the documentation cover to cover.  I certainly hadn't.

9.1 allows collation to be specified on a per-column basis, but since all of the utf-8 collations ignore punctuation and space characters, you'll still have to do C-style collation to get alphabetical order that honors whitespace, punctuation, and case, but then non-ascii characters will be in the wrong order. So you can either have language sensitive sorting or punctuation- and case- sensitive sorting, but never both - at least not without defining a custom locale. Punting this defect to the OS and saying it has to be fixed there seems like a somewhat weak answer - at least without offering up a source for alternative locale definitions that do sorting according to UNICODE collation.  

I'm actually surprised that european users aren't complaining about this all the time, but maybe european users are used to seeing things ordered in a manner which doesn't honour the 'correct' ordering of accented characters.  Actually, I wonder if the probable explanation of the lack of complaint is the fact that the kinds of fields most apps would tend to do alphabetical sorts on probably don't tend to have lots of punctuation other than spaces, so perhaps the language sensitive sorts are deemed sufficient because most people don't notice the funky behaviour with punctuation and whitespace while case-insensitive sort is probably desired most of the time.

В списке pgsql-sql по дате отправления:

Предыдущее
От: Samuel Gendler
Дата:
Сообщение: Re: a strange order by behavior
Следующее
От: Pavel Stehule
Дата:
Сообщение: Re: a strange order by behavior