Обсуждение: [BUGS] BUG #14885: mistake in sorting win1251 chars
The following bug has been logged on the website: Bug reference: 14885 Logged by: Kalin Daskalov Email address: k.daskalov.911@gmail.com PostgreSQL version: 9.4.4 Operating system: windows 7 64 bit Description: I'm using ORDER BY <character varying field> containing win-1251 symbols. No matter whether the encoding is WIN1251 or UTF8. I try it on different versions of PostgreSQL - the result is the same. There is a mistake in sorted values for symbols "и" (и) and "й" (й) - capital letters too. Here is a small example: "АЙГЕР" "АИКО" "АЙКОН" -- Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-bugs
On 11/02/2017 10:57 AM, k.daskalov.911@gmail.com wrote: > The following bug has been logged on the website: > > Bug reference: 14885 > Logged by: Kalin Daskalov > Email address: k.daskalov.911@gmail.com > PostgreSQL version: 9.4.4 > Operating system: windows 7 64 bit > Description: > > I'm using ORDER BY <character varying field> containing win-1251 symbols. > No matter whether the encoding is WIN1251 or UTF8. > I try it on different versions of PostgreSQL - the result is the same. > There is a mistake in sorted values for symbols "и" (и) and "й" > (й) - capital letters too. > Here is a small example: > "АЙГЕР" > "АИКО" > "АЙКОН" > Encoding only defines how different characters are encoded, it has nothing to do with ordering - that's what collations are for. What lc_collate value are you using? See https://www.postgresql.org/docs/9.4/static/collation.html cheers -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-bugs
On Thu, Nov 2, 2017 at 10:57 AM, <k.daskalov.911@gmail.com> wrote: > Here is a small example: > "АЙГЕР" > "АИКО" > "АЙКОН" Are you sure it's sorting badly? In some collations some letters sort equally, you should try: "АИКО" "АЙКОН" and "АИКОH" "АЙКО" as "И" may be sorting in the same place as "Й" and it is deciding by the tails. It happens in spanish with the tildes ( a and à sort in the same place, so unless they are the only difference order depends on aht is around: test=> select * from ( values ('a'),('à'),('ay'),('àx'),('za'),('zà') ) x order by 1;column1 ---------aààxayzazà (6 rows) ) Doing something like this with your alphabet may shed some light on the issue. (note, this was done with locale en_US.utf-8 ) Francisco Olarte. -- Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-bugs
Hi,
I understand you well and this exactly is the situation.
"иa"
"йa"
"иb"
"йb"
"иc"
I have to admit that this is not PostgreSQL problem.
In fact my previous compares are based on ASCII comparison - based on the order of the chars.
Now I test with ANSI comparison realized with MS Windows system functions and the result the same as in PostgreSQL.
But this is not appropriate. In fact if Cyrillic alphabet these are different letters and in Bulgarian language no one does expect this behavior. It's almost like to decide that Latin letters "i" and "y" should have such behavior.
Just the problem is elsewhere... And I have no decision.
Regards,
Kalin Daskalov
I understand you well and this exactly is the situation.
"иa"
"йa"
"иb"
"йb"
"иc"
I have to admit that this is not PostgreSQL problem.
In fact my previous compares are based on ASCII comparison - based on the order of the chars.
Now I test with ANSI comparison realized with MS Windows system functions and the result the same as in PostgreSQL.
But this is not appropriate. In fact if Cyrillic alphabet these are different letters and in Bulgarian language no one does expect this behavior. It's almost like to decide that Latin letters "i" and "y" should have such behavior.
Just the problem is elsewhere... And I have no decision.
Regards,
Kalin Daskalov
On Thu, Nov 2, 2017 at 5:38 PM, Francisco Olarte <folarte@peoplecall.com> wrote:
On Thu, Nov 2, 2017 at 10:57 AM, <k.daskalov.911@gmail.com> wrote:
> Here is a small example:
> "АЙГЕР"
> "АИКО"
> "АЙКОН"
Are you sure it's sorting badly? In some collations some letters sort
equally, you should try:
"АИКО"
"АЙКОН"
and
"АИКОH"
"АЙКО"
as "И" may be sorting in the same place as "Й" and it is deciding by the tails.
It happens in spanish with the tildes ( a and à sort in the same
place, so unless they are the only difference order depends on aht is
around:
test=> select * from ( values ('a'),('à'),('ay'),('àx'),('za'),('zà')
) x order by 1;
column1
---------
a
à
àx
ay
za
zà
(6 rows)
)
Doing something like this with your alphabet may shed some light on the issue.
(note, this was done with locale en_US.utf-8 )
Francisco Olarte.
--
Kalin Daskalov,
k.daskalov.911@gmail.com
k.daskalov.911@gmail.com
2017-11-02 18:27 GMT+01:00 Kalin Daskalov <k.daskalov.911@gmail.com>:
Hi,
I understand you well and this exactly is the situation.
"иa"
"йa"
"иb"
"йb"
"иc"
I have to admit that this is not PostgreSQL problem.
In fact my previous compares are based on ASCII comparison - based on the order of the chars.
Now I test with ANSI comparison realized with MS Windows system functions and the result the same as in PostgreSQL.
But this is not appropriate. In fact if Cyrillic alphabet these are different letters and in Bulgarian language no one does expect this behavior. It's almost like to decide that Latin letters "i" and "y" should have such behavior.
Just the problem is elsewhere... And I have no decision.
what locales do you use? It should to work everywhere. For Czech language we should to use Czech locale
regards
Pavel
Regards,
Kalin DaskalovOn Thu, Nov 2, 2017 at 5:38 PM, Francisco Olarte <folarte@peoplecall.com> wrote:On Thu, Nov 2, 2017 at 10:57 AM, <k.daskalov.911@gmail.com> wrote:
> Here is a small example:
> "АЙГЕР"
> "АИКО"
> "АЙКОН"
Are you sure it's sorting badly? In some collations some letters sort
equally, you should try:
"АИКО"
"АЙКОН"
and
"АИКОH"
"АЙКО"
as "И" may be sorting in the same place as "Й" and it is deciding by the tails.
It happens in spanish with the tildes ( a and à sort in the same
place, so unless they are the only difference order depends on aht is
around:
test=> select * from ( values ('a'),('à'),('ay'),('àx'),('za'),('zà')
) x order by 1;
column1
---------
a
à
àx
ay
za
zà
(6 rows)
)
Doing something like this with your alphabet may shed some light on the issue.
(note, this was done with locale en_US.utf-8 )
Francisco Olarte.
--Kalin Daskalov,
k.daskalov.911@gmail.com
Kalin: On Thu, Nov 2, 2017 at 6:27 PM, Kalin Daskalov <k.daskalov.911@gmail.com> wrote: > I understand you well and this exactly is the situation. ... > I have to admit that this is not PostgreSQL problem. Ok then. > In fact my previous compares are based on ASCII comparison - based on the > order of the chars. I doubt it was ASCII. ASCII is a 7 byte code. You were probably using an 8 bit code partially based on ascii ( Like the ISO-8859-1 typically used in spain, or its superset win-1252 ). What you were doing was probably a lexicographic compare using the unsigned 8 bit value. This is good enough to keep a table for a bsearch or build a btree, but is not what modern collations do ( among other things they collate uper and lower case together, like paper dictionaries normally do ) > Now I test with ANSI comparison realized with MS Windows system functions > and the result the same as in PostgreSQL. Also remember what you refer as Windows is probably Win NT, which has been internally unicode since the beginning. Besides, it's been 15 years since I used it but even then the windows API had lots of ways to do things. > But this is not appropriate. In fact if Cyrillic alphabet these are > different letters and in Bulgarian language no one does expect this > behavior. It's almost like to decide that Latin letters "i" and "y" should > have such behavior. I'm not a Bulgarian speaker, but you should raise it to then. And the i/y letter behaviour depends on the language, "i" is a vowel, but in spanish it can be or not, depending on the word. It sorts between x and z, but that has always been that way. Not knowing Bulgarian I do not know if the two letters you used are different, like n and ñ in spanish, or not, like a and à. If you consider it is right you could try to document it further and try to get the collation changed, but I would consult some references first. Also, which is your locale? Remember collation order depends on it. Francisco Olarte -- Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-bugs
Hi,
It's the proper locale. It's set to Bulgarian everywhere.
- Windows settings:
format: Bulgarian (Bulgaria)
Location: Bulgaria
Current language for non-Unicode programs: Bulgarian (Bulgaria)
- PostgreSQL Database settings:
ENCODING = 'UTF8'
LC_COLLATE = 'Bulgarian_Bulgaria.1251'
LC_CTYPE = 'Bulgarian_Bulgaria.1251'
Probably as you say if I consider it is right I can try to document it further and try to get the collation changed.
Just I'm not sure and have to check on whether it depends on Russian language and whether they consider that 'И' and 'Й' are different letters - probably it's like in Bulgarian.
For now I noticed that following settings works good, but I'm still not sure whether there are no other side effects:
ENCODING = 'UTF8'
LC_COLLATE = 'C'
LC_CTYPE = 'Bulgarian_Bulgaria.1251'
Thank a lot
Kalin Daskalov
It's the proper locale. It's set to Bulgarian everywhere.
- Windows settings:
format: Bulgarian (Bulgaria)
Location: Bulgaria
Current language for non-Unicode programs: Bulgarian (Bulgaria)
- PostgreSQL Database settings:
ENCODING = 'UTF8'
LC_COLLATE = 'Bulgarian_Bulgaria.1251'
LC_CTYPE = 'Bulgarian_Bulgaria.1251'
Probably as you say if I consider it is right I can try to document it further and try to get the collation changed.
Just I'm not sure and have to check on whether it depends on Russian language and whether they consider that 'И' and 'Й' are different letters - probably it's like in Bulgarian.
For now I noticed that following settings works good, but I'm still not sure whether there are no other side effects:
ENCODING = 'UTF8'
LC_COLLATE = 'C'
LC_CTYPE = 'Bulgarian_Bulgaria.1251'
Thank a lot
Kalin Daskalov
On Fri, Nov 3, 2017 at 9:22 AM, Francisco Olarte <folarte@peoplecall.com> wrote:
Kalin:
On Thu, Nov 2, 2017 at 6:27 PM, Kalin Daskalov <k.daskalov.911@gmail.com> wrote:
> I understand you well and this exactly is the situation.
...
> I have to admit that this is not PostgreSQL problem.
Ok then.
> In fact my previous compares are based on ASCII comparison - based on the
> order of the chars.
I doubt it was ASCII. ASCII is a 7 byte code. You were probably using
an 8 bit code partially based on ascii ( Like the ISO-8859-1 typically
used in spain, or its superset win-1252 ). What you were doing was
probably a lexicographic compare using the unsigned 8 bit value. This
is good enough to keep a table for a bsearch or build a btree, but is
not what modern collations do ( among other things they collate uper
and lower case together, like paper dictionaries normally do )
> Now I test with ANSI comparison realized with MS Windows system functions
> and the result the same as in PostgreSQL.
Also remember what you refer as Windows is probably Win NT, which has
been internally unicode since the beginning. Besides, it's been 15
years since I used it but even then the windows API had lots of ways
to do things.
> But this is not appropriate. In fact if Cyrillic alphabet these are
> different letters and in Bulgarian language no one does expect this
> behavior. It's almost like to decide that Latin letters "i" and "y" should
> have such behavior.
I'm not a Bulgarian speaker, but you should raise it to then. And the
i/y letter behaviour depends on the language, "i" is a vowel, but in
spanish it can be or not, depending on the word. It sorts between x
and z, but that has always been that way. Not knowing Bulgarian I do
not know if the two letters you used are different, like n and ñ in
spanish, or not, like a and à. If you consider it is right you could
try to document it further and try to get the collation changed, but I
would consult some references first.
Also, which is your locale? Remember collation order depends on it.
Francisco Olarte
--
Kalin Daskalov,
k.daskalov.911@gmail.com
k.daskalov.911@gmail.com
Kalin: 1st, if you follow on top-posting the flow is going to become dificult to understand. 2nd: On Fri, Nov 3, 2017 at 12:18 PM, Kalin Daskalov <k.daskalov.911@gmail.com> wrote: > It's the proper locale. It's set to Bulgarian everywhere. > - Windows settings: > format: Bulgarian (Bulgaria) > Location: Bulgaria > Current language for non-Unicode programs: Bulgarian (Bulgaria) I assume this is correct, I don't do windows. > - PostgreSQL Database settings: > ENCODING = 'UTF8' > LC_COLLATE = 'Bulgarian_Bulgaria.1251' > LC_CTYPE = 'Bulgarian_Bulgaria.1251' This sounds fishy. Specially the .1251, which hints at a collation for win1251 encoded bulgarian. In my computers ( linux, debian and redhat derived most of them ) I have LC_COLLATE set to en_US.UTF-8, and they seem to sort utf-8 properly. Maybe your .1251 does some strange character mapping utf-8 => 1251 before sorting. The collations are, IIRC, derived from the c library collaiton routines anyway. Francisco Olarte. -- Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-bugs