Unicode FFFF Special Codepoint should always collate high.

Поиск

Список

Период

Сортировка

От	Telford Tendys
Тема	Unicode FFFF Special Codepoint should always collate high.
Дата	22 июня 2021 г. 08:39:18
Msg-id	20210622083918.GA12063@mail обсуждение исходный текст
Ответы	Re: Unicode FFFF Special Codepoint should always collate high.
Список	pgsql-bugs

Дерево обсуждения

Here is an example.

    SELECT ENCODE(x::bytea, 'hex')
    FROM (
        SELECT UNNEST(ARRAY['x', 'x/', U&'x\+02FFFF', U&'x\FFFF',
                             U&'x\+01FFFF', 'x '])::text AS x
    ) z
    ORDER BY x COLLATE "en_US.utf8";


Which gives the following:

   encode
------------
 78
 78efbfbf
 78f09fbfbf
 78f0afbfbf
 7820
 782f
(6 rows)


The real character codepoints (e.g. 0x20 space, or 0x2f slash) are sorting
after the non-character codepoint 0xffff, which is supposed to always have
the highest possible primary weight in all locales, and it is the only
codepoint available to serve this purpose. The other 4-byte non-character
codepoints also incorrectly sort lower than real characters.

Tested on Centos 8 with the following package.

    postgresql12-server-12.7-2PGDG.rhel8.x86_64


The "locale -a" command shows that "en_US.utf8" is in the system,
although, for this purpose, all UTF-8 locales should do the same thing.

Database default locale is also en_US.utf8 and system /etc/locale.conf has
the same LANG="en_US.UTF-8".

Testing the same thing with a simple file and using the "sort" command
from command line prompt gives similar results.


$ od -tx1c /tmp/test.utf8
0000000  78  0a  78  f0  af  bf  bf  0a  78  f0  9f  bf  bf  0a  78  2f
          x  \n   x 360 257 277 277  \n   x 360 237 277 277  \n   x   /
0000020  0a  78  ef  bf  bf  0a  78  20  0a
         \n   x 357 277 277  \n   x      \n
0000031


$ LC_ALL=C sort /tmp/test.utf8 | od -tx1c
0000000  78  0a  78  20  0a  78  2f  0a  78  ef  bf  bf  0a  78  f0  9f
          x  \n   x      \n   x   /  \n   x 357 277 277  \n   x 360 237
0000020  bf  bf  0a  78  f0  af  bf  bf  0a
        277 277  \n   x 360 257 277 277  \n
0000031


$ LC_ALL=en_US.UTF-8 sort /tmp/test.utf8 | od -tx1c
0000000  78  0a  78  ef  bf  bf  0a  78  f0  af  bf  bf  0a  78  f0  9f
          x  \n   x 357 277 277  \n   x 360 257 277 277  \n   x 360 237
0000020  bf  bf  0a  78  20  0a  78  2f  0a
        277 277  \n   x      \n   x   /  \n
0000031


You can see that it once again sorts the non-character codepoint lower
than real characters. For reference, these are the Centos 8 glibc packages.

    glibc-2.28-151.el8.x86_64
    glibc-langpack-en-2.28-151.el8.x86_64


Tested the same thing on sqlfiddle.

    http://sqlfiddle.com/#!15/1841e2/2/0


That is using PostgreSQL 9.3 and the system underneath is not made visible.
It gives the sort order that I would expect, special codepoints are sorting
on the high side of real characters.

I then went back to an old Centos 7 machine and tested the same SQL query.
The result is like this:

   encode
------------
 78
 7820
 782f
 78efbfbf
 78f09fbfbf
 78f0afbfbf
(6 rows)

That's what I expect the correct answer to be, also a correct match to the
sqlfiddle test. For reference these are the centos packages where I got it
working properly.

    postgresql12-server-12.7-1PGDG.rhel7.x86_64

    glibc-common-2.17-260.el7_6.6.x86_64
    glibc-2.17-260.el7_6.6.x86_64

Same postgres version, but running on an older system. Perhaps this
would implicate a bug introduced in the system itself during the
transition from Centos 7 to Centos 8. The old glibc did not have separate
langpacks and all locales were installed by default.

I'm willing to take this up with Redhat if that's useful.

В списке pgsql-bugs по дате отправления:

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Unicode FFFF Special Codepoint should always collate high.