Re: a strange order by behavior

Поиск
Список
Период
Сортировка
От Samuel Gendler
Тема Re: a strange order by behavior
Дата
Msg-id BANLkTi=YhCJ_0y83QS3e3buUNXL+7SwU6A@mail.gmail.com
обсуждение исходный текст
Ответ на Re: a strange order by behavior  (Samuel Gendler <sgendler@ideasculptor.com>)
Список pgsql-sql


On Wed, Jun 22, 2011 at 2:39 AM, Samuel Gendler <sgendler@ideasculptor.com> wrote:

I was able to create the db with --lc_collate=C and get case-sensitive sorting that treats spaces 'correctly,' but I have no idea how reliable that is with multibyte characters and it almost certainly doesn't handle accented characters correctly in languages that have a mix of ascii and non-ascii characters, since the non-ascii chars will all sort as greater than the ascii chars, which is probably not how things are alphabetized in those languages.


'locale -a' at a command prompt will list all locale's available on a host. I guess there is no universal utf-8 collation, which isn't that surprising, given how utf-8 works, though I had hoped that there might be one that at least attempts to handle all languages that share character codes in some kind of sensible manner.  Perhaps the details of UTF-8 encoding make that effectively impossible.  But it looks as though if you want to handle accented characters, you would need to set collation to something like fr_FR.UTF-8, though it looks like all of the UTF-8 collations ignore spaces and are case-insensitive.  I don't know of a way to determine what the sorting behaviour of an arbitrary collation might be other than experimentation.  You can set LC_ALL to the locale in question and pipe a test file through the sort command to do a quick test:

LC_ALL=en_US.UTF-8 sort < test.txt

I checked variants like .UTF-8, .iso88591, .iso885915@euro and they all ignore spaces in all of the languages I checked.  It appears that the only way to get a space-sensitive sort order is to use C collation.  Same goes for capitalization, I think.  Which is pretty ugly, if you ask me.
It looks like it is theoretically possible to modify a locale's collation behaviour.  On my linux system, all locale's with the space-ignoring, capital-ignoring collation have this entry in the locale definitions in /usr/share/i18n/locales:
LC_COLLATE 
 % Copy the template from ISO/IEC 14651 
copy "iso14651_t1" 
 END LC_COLLATE
Looking at the iso14651_t1 file, I see this:
# Any character not precisely specified will be considered as a special
# character and considered only at the last level.
# <U0000>......<U7FFFFFFF> IGNORE;IGNORE;IGNORE;<U0000>......<U7FFFFFFF>
#
# SYMB.                                N° GLY
#
<U0020> IGNORE;IGNORE;IGNORE;<U0020> # 32 <SP>
<U005F> IGNORE;IGNORE;IGNORE;<U005F> # 33 _
<U0332> IGNORE;IGNORE;IGNORE;<U0332> # 34 <"_>
<U00AF> IGNORE;IGNORE;IGNORE;<U00AF> # 35 - (MACRON)
<U00AD> IGNORE;IGNORE;IGNORE;<U00AD> # 36 <SHY>
<U002D> IGNORE;IGNORE;IGNORE;<U002D> # 37 -
<U002C> IGNORE;IGNORE;IGNORE;<U002C> # 38 ,
<U003B> IGNORE;IGNORE;IGNORE;<U003B> # 39 ;
<U003A> IGNORE;IGNORE;IGNORE;<U003A> # 40 :
 

There doesn't seem to be any obvious discussion of how to structure the LC_COLLATE section of a locale definition, but it is clear that telling it to ignore spaces and punctuation is the problem we are seeing.  That seems like a very strange choice to make as the default for all UTF-8 locales.  I can't find the locale definitions for OS X, which might reveal a more sensible LC_COLLATE or may just be defaulting back to C collation for everything.



В списке pgsql-sql по дате отправления:

Предыдущее
От: Pavel Stehule
Дата:
Сообщение: Re: a strange order by behavior
Следующее
От: Samuel Gendler
Дата:
Сообщение: Re: a strange order by behavior