Обсуждение: BUG #15651: Collation setting en_US.utf8 breaking sort order
The following bug has been logged on the website:
Bug reference: 15651
Logged by: Kaleb Akalework
Email address: kaleb.akalework@asg.com
PostgreSQL version: 11.0
Operating system: Linux
Description:
I have PostgresSQL database on Windows. I created database with Collation of
en_US.utf8. Then I created table (The steps to reproduce are below). I
inserted a few rows into this table one of which was row with special
characters "~!@#$^&(". The insert worked fine but then when I do a select on
the column for values >=' ' (Space), I get back all the rows except for the
row that contains
"~!@#$^&(" . In the UTF8 table
https://www.utf8-chartable.de/unicode-utf8-table.pl, I can see that space is
the lowest printable character so technically every printable character
showed be greater than space but it isn't in this case. I create another
database and set collation to 'C'. then the same select query returns
"~!@#$^&(". The problem is only apparent in a Linux environment where
postgreSQL is running and the database has a collation of en_US.utf8. Can
someone help please?
create table test (
name_c varchar(14)
)
insert into test (name_c) values ('AAA')
insert into test (name_c) values ('BAA')
insert into test (name_c) values ('CAA')
insert into test (name_c) values ('DAA')
insert into test (name_c) values ('~!@#$^&(')
select * from test where name_c >= ' '
PG Bug reporting form <noreply@postgresql.org> writes:
> I have PostgresSQL database on Windows. I created database with Collation of
> en_US.utf8.
Really? AFAIK, Windows doesn't support collation names that look like
that.
> Then I created table (The steps to reproduce are below). I
> inserted a few rows into this table one of which was row with special
> characters "~!@#$^&(". The insert worked fine but then when I do a select on
> the column for values >=' ' (Space), I get back all the rows except for the
> row that contains
> "~!@#$^&(" .
This appears to be the intended behavior of en_US sorting.
On a Linux machine I can reproduce it outside Postgres:
$ LANG=C sort stuff.txt
AAA
BAA
CAA
DAA
~!@#$^&(
$ LANG=en_US sort stuff.txt
~!@#$^&(
AAA
BAA
CAA
DAA
(The first line in my test file contains one space.)
regards, tom lane
--> Really? AFAIK, Windows doesn't support collation names that look like that.
I meant to say Linux. Not Windows.
-----Original Message-----
From: Tom Lane <tgl@sss.pgh.pa.us>
Sent: Friday, February 22, 2019 1:03 PM
To: Kaleb Akalework <kaleb.akalework@asg.com>
Cc: pgsql-bugs@lists.postgresql.org
Subject: Re: BUG #15651: Collation setting en_US.utf8 breaking sort order
*** External email: Verify sender before opening attachments or links ***
PG Bug reporting form <noreply@postgresql.org> writes:
> I have PostgresSQL database on Windows. I created database with
> Collation of en_US.utf8.
Really? AFAIK, Windows doesn't support collation names that look like that.
> Then I created table (The steps to reproduce are below). I inserted a
> few rows into this table one of which was row with special characters
> "~!@#$^&(". The insert worked fine but then when I do a select on the
> column for values >=' ' (Space), I get back all the rows except for
> the row that contains "~!@#$^&(" .
This appears to be the intended behavior of en_US sorting.
On a Linux machine I can reproduce it outside Postgres:
$ LANG=C sort stuff.txt
AAA
BAA
CAA
DAA
~!@#$^&(
$ LANG=en_US sort stuff.txt
~!@#$^&(
AAA
BAA
CAA
DAA
(The first line in my test file contains one space.)
regards, tom lane
On Fri, Feb 22, 2019 at 10:03 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > This appears to be the intended behavior of en_US sorting. Right. UCA style algorithms tend to give the least weight of all to whitespace characters. They're tertiary weight, if memory serves. -- Peter Geoghegan
Ok so if this is intended behavior of UTF8 then I understand. My last question then would be if I use a collation settingof C, does it mean I won't be able to support multiple languages? -----Original Message----- From: Peter Geoghegan <pg@bowt.ie> Sent: Friday, February 22, 2019 1:41 PM To: Tom Lane <tgl@sss.pgh.pa.us> Cc: Kaleb Akalework <kaleb.akalework@asg.com>; PostgreSQL mailing lists <pgsql-bugs@lists.postgresql.org> Subject: Re: BUG #15651: Collation setting en_US.utf8 breaking sort order *** External email: Verify sender before opening attachments or links *** On Fri, Feb 22, 2019 at 10:03 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > This appears to be the intended behavior of en_US sorting. Right. UCA style algorithms tend to give the least weight of all to whitespace characters. They're tertiary weight, if memoryserves. -- Peter Geoghegan
Kaleb Akalework <kaleb.akalework@asg.com> writes:
> Ok so if this is intended behavior of UTF8 then I understand. My last question then would be if I use a collation
settingof C, does it mean I won't be able to support multiple languages?
Collation only determines sort order, I believe.
regards, tom lane
On 2019-02-22 19:03, Tom Lane wrote: > $ LANG=en_US sort stuff.txt > ~!@#$^&( > > AAA > BAA > CAA > DAA With ICU (COLLATE "und-x-icu"), I get the line with the space first. I took a bit of a look around the various Unicode documents and I don't find anything that would defend the glibc behavior. <obscure detail> However, since some of those special characters are variable collating elements and some are not, there might well be an explanation. </obscure detail> So, maybe try ICU. -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Peter Eisentraut wrote: > With ICU (COLLATE "und-x-icu"), I get the line with the space first. I > took a bit of a look around the various Unicode documents and I don't > find anything that would defend the glibc behavior. The glibc result is now version-dependent. With glibc-2.28, on Debian buster: buster$ LC_COLLATE=en_US.utf8 sort stuff.txt ~!@#$^&( AAA BAA CAA buster$ apt-cache show libc-bin | grep Version Version: 2.28-7 As opposed to the current Debian stable, with glibc-2.24: stretch$ LC_COLLATE=en_US.utf8 sort stuff.txt ~!@#$^&( AAA BAA CAA DAA stretch$ apt-cache show libc-bin|grep Version Version: 2.24-11+deb9u3 Best regards, -- Daniel Vérité PostgreSQL-powered mailer: http://www.manitou-mail.org Twitter: @DanielVerite
Kaleb Akalework wrote:
> Ok so if this is intended behavior of UTF8 then I understand. My last
> question then would be if I use a collation setting of C, does it mean I
> won't be able to support multiple languages?
You seem to want to the sort order of C, but be aware that you might
have to decide whether you want this:
=> select upper('é' collate "C");
upper
-------
é
(1 row)
or that:
=> select upper('é' collate "en_US");
upper
-------
É
(1 row)
To get the sort order of C but the interpretation of characters closer
to what you'd expect from Unicode, it's possible for the database
to have LC_COLLATE to "C", and LC_CTYPE to, say, en_US.UTF-8.
See CREATE DATABASE.
Best regards,
--
Daniel Vérité
PostgreSQL-powered mailer: http://www.manitou-mail.org
Twitter: @DanielVerite