Unicode FFFF Special Codepoint should always collate high.
От | Telford Tendys |
---|---|
Тема | Unicode FFFF Special Codepoint should always collate high. |
Дата | |
Msg-id | 20210622083918.GA12063@mail обсуждение исходный текст |
Ответы |
Re: Unicode FFFF Special Codepoint should always collate high.
|
Список | pgsql-bugs |
Here is an example. SELECT ENCODE(x::bytea, 'hex') FROM ( SELECT UNNEST(ARRAY['x', 'x/', U&'x\+02FFFF', U&'x\FFFF', U&'x\+01FFFF', 'x '])::text AS x ) z ORDER BY x COLLATE "en_US.utf8"; Which gives the following: encode ------------ 78 78efbfbf 78f09fbfbf 78f0afbfbf 7820 782f (6 rows) The real character codepoints (e.g. 0x20 space, or 0x2f slash) are sorting after the non-character codepoint 0xffff, which is supposed to always have the highest possible primary weight in all locales, and it is the only codepoint available to serve this purpose. The other 4-byte non-character codepoints also incorrectly sort lower than real characters. Tested on Centos 8 with the following package. postgresql12-server-12.7-2PGDG.rhel8.x86_64 The "locale -a" command shows that "en_US.utf8" is in the system, although, for this purpose, all UTF-8 locales should do the same thing. Database default locale is also en_US.utf8 and system /etc/locale.conf has the same LANG="en_US.UTF-8". Testing the same thing with a simple file and using the "sort" command from command line prompt gives similar results. $ od -tx1c /tmp/test.utf8 0000000 78 0a 78 f0 af bf bf 0a 78 f0 9f bf bf 0a 78 2f x \n x 360 257 277 277 \n x 360 237 277 277 \n x / 0000020 0a 78 ef bf bf 0a 78 20 0a \n x 357 277 277 \n x \n 0000031 $ LC_ALL=C sort /tmp/test.utf8 | od -tx1c 0000000 78 0a 78 20 0a 78 2f 0a 78 ef bf bf 0a 78 f0 9f x \n x \n x / \n x 357 277 277 \n x 360 237 0000020 bf bf 0a 78 f0 af bf bf 0a 277 277 \n x 360 257 277 277 \n 0000031 $ LC_ALL=en_US.UTF-8 sort /tmp/test.utf8 | od -tx1c 0000000 78 0a 78 ef bf bf 0a 78 f0 af bf bf 0a 78 f0 9f x \n x 357 277 277 \n x 360 257 277 277 \n x 360 237 0000020 bf bf 0a 78 20 0a 78 2f 0a 277 277 \n x \n x / \n 0000031 You can see that it once again sorts the non-character codepoint lower than real characters. For reference, these are the Centos 8 glibc packages. glibc-2.28-151.el8.x86_64 glibc-langpack-en-2.28-151.el8.x86_64 Tested the same thing on sqlfiddle. http://sqlfiddle.com/#!15/1841e2/2/0 That is using PostgreSQL 9.3 and the system underneath is not made visible. It gives the sort order that I would expect, special codepoints are sorting on the high side of real characters. I then went back to an old Centos 7 machine and tested the same SQL query. The result is like this: encode ------------ 78 7820 782f 78efbfbf 78f09fbfbf 78f0afbfbf (6 rows) That's what I expect the correct answer to be, also a correct match to the sqlfiddle test. For reference these are the centos packages where I got it working properly. postgresql12-server-12.7-1PGDG.rhel7.x86_64 glibc-common-2.17-260.el7_6.6.x86_64 glibc-2.17-260.el7_6.6.x86_64 Same postgres version, but running on an older system. Perhaps this would implicate a bug introduced in the system itself during the transition from Centos 7 to Centos 8. The old glibc did not have separate langpacks and all locales were installed by default. I'm willing to take this up with Redhat if that's useful.
В списке pgsql-bugs по дате отправления: