Обсуждение: different sort order in windows and linux version

Поиск
Список
Период
Сортировка

different sort order in windows and linux version

От
Dragan Matic
Дата:
I have two postgres servers, one on linux (fedora core 5), one on
windows, both are version 8.1.4.

Both databases are initialized with locale Croatian and win1250 encoding.

running pg_controldata on windows returns this

LC_COLLATE:          Croatian_Croatia.1250
LC_CTYPE:            Croatian_Croatia.1250

the same command on linux returns this

LC_COLLATE:                    hr_HR
LC_CTYPE:                      hr_HR

which is the same, I suppose.

the sample databases are both initialized the same way

CREATE DATABASE sample
  WITH OWNER = postgres
       ENCODING = 'WIN1250'
       TABLESPACE = pg_default;


both databases have the same sample table

CREATE TABLE sample
(
  some_text char(13) NOT NULL
);


when I execute a query 'SELECT SOME_TEXT FROM SAMPLE ORDER BY SOME_TEXT'
I get different sort order on these two servers.

On the left side is windows server sort order, and on the right side is
linux server sort order. all values are left padded with spaces.

Postgres windows  Postgres linux

               0               0
               1             000
               2   0000000000000
               3               1
               4              11
               5              12
               6             123
               7             125
               8              13
               9              14
               a              15
               A               2
               b         2343255
               b          234455
               B             243
               c              25
               C               3
               d         31TA001
               e         32NU280
               f              35
               g               4
               z              45
              11               5
              12              55
              13           56455
              14               6
              15              65
              25               7
              35              75
              45               8
              55              85
              65               9
              75               a
              85               A
              aa              aa
              ab             aaa
              aB             aab
              Ab              ab
              AB              aB
              ba              Ab
              bb              AB
              cc             aba
             000             abA
             123             aBa
             125             AbA
             243             ABA
             aaa             abb
             aab             abb
             aba             abc
             abA             abc
             aBa             abC
             AbA             Abc
             ABA             ABC
             abb           abcde
             abb             acc
             abc               b
             abc               b
             abC               B
             Abc              ba
             ABC             baa
             acc             bab
             baa              bb
             bab             bba
             bba             bbb
             bbb             bca
             bca               c
             zzz               C
             zzz              cc
           56455               d
           abcde               e
          234455               f
         2343255               g
         31TA001               z
         32NU280             zzz
   0000000000000             zzz
   zzzzzzzzzzzzz   zzzzzzzzzzzzz

The only thing I can think of is that somehow these databases weren't
initialized in the same way, if so what is the difference?

Tnx in advance

Dragan

Re: different sort order in windows and linux version

От
Martijn van Oosterhout
Дата:
On Fri, Jun 30, 2006 at 11:56:19AM +0200, Dragan Matic wrote:
> I have two postgres servers, one on linux (fedora core 5), one on
> windows, both are version 8.1.4.
>
> Both databases are initialized with locale Croatian and win1250 encoding.
>
> running pg_controldata on windows returns this
>
> LC_COLLATE:          Croatian_Croatia.1250
> LC_CTYPE:            Croatian_Croatia.1250
>
> the same command on linux returns this
>
> LC_COLLATE:                    hr_HR
> LC_CTYPE:                      hr_HR
>
> which is the same, I suppose.

Well, apparently not. Postgres makes no attempt to understand
collations nor try to determine whether they make sense. If you want to
have the same collation on Windows and Linux, I think you're going to
have trouble.

In the past there have existed patches to allow postgres to use ICU for
locale support. It's supposedly not quite as fast, but you will be able
get consistant results across platforms.

Hope this helps,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> From each according to his ability. To each according to his ability to litigate.

Вложения

Re: different sort order in windows and linux version

От
"Tomi NA"
Дата:
On 6/30/06, Martijn van Oosterhout <kleptog@svana.org> wrote:
> On Fri, Jun 30, 2006 at 11:56:19AM +0200, Dragan Matic wrote:
> > I have two postgres servers, one on linux (fedora core 5), one on
> > windows, both are version 8.1.4.
> >
> > Both databases are initialized with locale Croatian and win1250 encoding.
> >
> > running pg_controldata on windows returns this
> >
> > LC_COLLATE:          Croatian_Croatia.1250
> > LC_CTYPE:            Croatian_Croatia.1250
> >
> > the same command on linux returns this
> >
> > LC_COLLATE:                    hr_HR
> > LC_CTYPE:                      hr_HR
> >
> > which is the same, I suppose.
>
> Well, apparently not. Postgres makes no attempt to understand
> collations nor try to determine whether they make sense. If you want to
> have the same collation on Windows and Linux, I think you're going to
> have trouble.

Croatian_Croatia and hr_HR are, in fact, the same in that there is no
other collation for the Croatian language. Whatsmore, Dragan ran the
test using characters which are encoded exactly the same in cp1250,
utf8, iso8859-2, hell, probably even us-ascii. The fact remains that
different OSes collate differently, even for the same locale.

In C++, people use things like GTK, wxWidgets and GCL so that they
could think about "C++ code instead of the platform they're coding on.
In Java, people use things like File.separator instead of "\" or "/"
so that they could think about "Java code".
There are dozens of examples like these and most of the exceptions
stem from the influence of the at the time monopoly-holder.
When you code in the RDBMS environment, you want to code in terms of
pgsql or Oracle or MySQL or whatever: you don't want to program for
Oracle on Solaris vs. Oracle on Linux vs. Oracle on Plan9 or...well,
you get the idea.
Not beeing able to depend on the engine to consistently collate
strings as simple as the ones Dragan listed is closer to a serious bug
(non-deterministic behaviour in otherwise deterministic functions)
than a RFE, but is certainly nowhere near "it's not our problem" as it
regularly seems made up to be. The OS(es) simply and obviously
do(es)n't do a good enough job of it.

> In the past there have existed patches to allow postgres to use ICU for
> locale support. It's supposedly not quite as fast, but you will be able
> get consistant results across platforms.

Personally, I'd be perfectly happy with pgsql if I could choose to
make text operations up to 2-3x slower without the fuss of how it's
going to work on a certain platform, in each pgsql version.
Furthermore, compiling the server myself is not an option for live
usage: on my current project, I'm not even the one installing the
database servers...sending administrators a binary I configured and
compiled (on Windows, in this case!) and noone but me
tested...brrrr...I get the shivers just thinking about it.

If I sound harsh, please excuse me, but I feel like I'm the only one
who thinks these encoding problems (collation, upper/lowercase,
multiple languages in a single database) are serious...nobody seems to
share the sentiment. Ah well...

t.n.a.

Re: different sort order in windows and linux version

От
Martijn van Oosterhout
Дата:
On Fri, Jun 30, 2006 at 07:29:12PM +0200, Tomi NA wrote:
> If I sound harsh, please excuse me, but I feel like I'm the only one
> who thinks these encoding problems (collation, upper/lowercase,
> multiple languages in a single database) are serious...nobody seems to
> share the sentiment. Ah well...

I agree with you, however the resistance (AFAICS) comes mostly from the
fact that we would be depending on an external library to do it. I
don't think postgres should try doing it itself, given that the unicode
character databases are quite large by themselves.

Alternativly, the postgres group could produce a customised version of
ICU that's smaller (the website has details about how). But any case,
this problem will need to be addressed at some point.

Have a nice day,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> From each according to his ability. To each according to his ability to litigate.

Вложения

Re: different sort order in windows and linux version

От
"Tomi NA"
Дата:
On 7/1/06, Martijn van Oosterhout <kleptog@svana.org> wrote:
> On Fri, Jun 30, 2006 at 07:29:12PM +0200, Tomi NA wrote:
> > If I sound harsh, please excuse me, but I feel like I'm the only one
> > who thinks these encoding problems (collation, upper/lowercase,
> > multiple languages in a single database) are serious...nobody seems to
> > share the sentiment. Ah well...
>
> I agree with you, however the resistance (AFAICS) comes mostly from the
> fact that we would be depending on an external library to do it. I
> don't think postgres should try doing it itself, given that the unicode
> character databases are quite large by themselves.
>
> Alternativly, the postgres group could produce a customised version of
> ICU that's smaller (the website has details about how). But any case,
> this problem will need to be addressed at some point.

Basically, it comes down to three possibilities, doesn't it:
1.) use an existing library
2.) write a pgsql specific implementation
3.) forget about it and tend to other issues

Personally, I don't really care if it's 1) or 2): I'm just afraid it's
going to be 3).
Is this a licencing issue (with regard to ICU beeing under the IBM
public licence)? A plugin architecture (to get rid of licencing
headaches) issue?
Are there any other libraries that might do the job?

To be perfectly honest, I've had to tackle so many problems with
encodings during the years I'd make it punishable by law to use
anything *but* UTF...but I'm not president of the Galaxy yet, Zaphod
is. (-:

t.n.a.

Re: different sort order in windows and linux version

От
Tom Lane
Дата:
"Tomi NA" <hefest@gmail.com> writes:
> Basically, it comes down to three possibilities, doesn't it:
> 1.) use an existing library
> 2.) write a pgsql specific implementation
> 3.) forget about it and tend to other issues

> Personally, I don't really care if it's 1) or 2): I'm just afraid it's
> going to be 3).
> Is this a licencing issue (with regard to ICU beeing under the IBM
> public licence)?

Licensing is a concern --- IBM's appears to be not quite BSD enough.
Size and portability of the library are concerns.  Performance is a
concern.  Whether the patch makes the library required or optional is
a concern (if required, the portability issue becomes a whole lot more
urgent).  Loss of existing functionality is a concern --- for instance,
if the patch is such that UTF8 becomes the only supported server
encoding, it'll probably be rejected forthwith.

> A plugin architecture (to get rid of licencing headaches) issue?

AFAIK making it a "plugin" won't alleviate anyone's licensing worries.
Certainly that's not going to answer if the library is GPL.

> To be perfectly honest, I've had to tackle so many problems with
> encodings during the years I'd make it punishable by law to use
> anything *but* UTF...but I'm not president of the Galaxy yet, Zaphod
> is. (-:

Well, the Japanese think that UTF8 is not the solution to all their
worries, so they won't be happy with a UTF8-only solution.  Likewise,
those of us who only need single-byte character sets won't be very happy
with being forced to accept multi-byte processing overhead.

            regards, tom lane

Re: different sort order in windows and linux version

От
Martijn van Oosterhout
Дата:
On Sat, Jul 01, 2006 at 06:23:07PM -0400, Tom Lane wrote:
> "Tomi NA" <hefest@gmail.com> writes:
> > Basically, it comes down to three possibilities, doesn't it:
> > 1.) use an existing library
> > 2.) write a pgsql specific implementation
> > 3.) forget about it and tend to other issues
>
> > Personally, I don't really care if it's 1) or 2): I'm just afraid it's
> > going to be 3).
> > Is this a licencing issue (with regard to ICU beeing under the IBM
> > public licence)?
>
> Licensing is a concern --- IBM's appears to be not quite BSD enough.
> Size and portability of the library are concerns.  Performance is a
> concern.  Whether the patch makes the library required or optional is
> a concern (if required, the portability issue becomes a whole lot more
> urgent).  Loss of existing functionality is a concern --- for instance,
> if the patch is such that UTF8 becomes the only supported server
> encoding, it'll probably be rejected forthwith.

Licence - It's the X/MIT licence, which is almost identical to the BSD
licence.

http://dev.icu-project.org/cgi-bin/viewcvs.cgi/*checkout*/icu/license.html
http://en.wikipedia.org/wiki/MIT_License

But I don't think anyone is actually considering importing ICU into the
postgres source tree, are they?

Size - I'm not sure this is relevent since I don't think we want to
incorporate it into postgres itself, just let people use it if they
have it. In any case though, the default dataset is 8MB. This includes
support for every locale and charset it knows about.

If you drop the conversion stuff (because postgres already has that)
you're down to about 4MB.

Since ICU supports userdefined tables, we could provide a single
cross-platform dataset and get the user's ICU library implementation to
use that.

Portability - ICU runs on all the platforms postgres does, AFAICS.

http://dev.icu-project.org/cgi-bin/viewcvs.cgi/icu/readme.html?rev=release-3-4#HowToBuildSupported

Performance - ICU is approximatly four times faster than glibc for
collation. Even once you include keygen time (including conversion) it
comes out about 40% faster.

http://icu.sourceforge.net/charts/collation_icu4c_glibc.html

ICU is not slow.

> Well, the Japanese think that UTF8 is not the solution to all their
> worries, so they won't be happy with a UTF8-only solution.  Likewise,
> those of us who only need single-byte character sets won't be very happy
> with being forced to accept multi-byte processing overhead.

I've not quite understood the japenese problem with Unicode. My
understanding is that it was primarily due to widespread use of broken
converters.

In any case, ICU appears to beat glibc with single byte encodings, even
including the multi-byte conversion.

However, the most important point is that people have said they'll take
the speed hit if they could get consistant collation. For speed you can
always throw more hardware. But no amount of hardware will fix your
collation issues.

Have a nice day,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> From each according to his ability. To each according to his ability to litigate.

Вложения

Re: different sort order in windows and linux version

От
Dragan Matic
Дата:
Tomi NA wrote:
> On 6/30/06, Martijn van Oosterhout <kleptog@svana.org> wrote:
>> On Fri, Jun 30, 2006 at 11:56:19AM +0200, Dragan Matic wrote:
>> > I have two postgres servers, one on linux (fedora core 5), one on
>> > windows, both are version 8.1.4.
>> >
>
> Not beeing able to depend on the engine to consistently collate
> strings as simple as the ones Dragan listed is closer to a serious bug
> (non-deterministic behaviour in otherwise deterministic functions)
> than a RFE, but is certainly nowhere near "it's not our problem" as it
> regularly seems made up to be. The OS(es) simply and obviously
> do(es)n't do a good enough job of it.
>
I was about to say the same thing. I think that the whole point in
having a portable database system is that the data inside the database
should behave the _same way_ no matter what operating system database is
running on - client shouldn't be aware of the server OS. This is clearly
not the case here. Furthermore, the same thing happens even with en_US
(on Linux) and English_United States (on windows) collations selected,
so it is definitely a serious issue with US collation also and not with
some exotic collation orders only. I think that the only case where it
doesn't happen is when "C" collation is selected. It might be
interesting to see how this issue behaves on other operating systems.

>> In the past there have existed patches to allow postgres to use ICU for
>> locale support. It's supposedly not quite as fast, but you will be able
>> get consistant results across platforms.
>
> Personally, I'd be perfectly happy with pgsql if I could choose to
> make text operations up to 2-3x slower without the fuss of how it's
> going to work on a certain platform, in each pgsql version.
> Furthermore, compiling the server myself is not an option for live
> usage: on my current project, I'm not even the one installing the
> database servers...sending administrators a binary I configured and
> compiled (on Windows, in this case!) and noone but me
> tested...brrrr...I get the shivers just thinking about it.
    Recompiling is not an option for me also, I mean I could do it for
an in-house servers where I am in charge, but our application runs on
many places and on many servers where recompiling postgres with some
third-party patches is out of the question. I think the solution where
postgres would be slower but behaved the same way on all supported
operating systems would also be acceptable for most people.

Dragan


Re: different sort order in windows and linux version

От
Karsten Hilbert
Дата:
On Sun, Jul 02, 2006 at 12:13:02PM +0200, Martijn van Oosterhout wrote:

> However, the most important point is that people have said they'll take
> the speed hit if they could get consistant collation.
I can second that.

Karsten
--
GPG key ID E4071346 @ wwwkeys.pgp.net
E167 67FD A291 2BEA 73BD  4537 78B9 A9F9 E407 1346

Re: different sort order in windows and linux version

От
Agent M
Дата:
On Jul 2, 2006, at 6:13 AM, Martijn van Oosterhout wrote:
> But I don't think anyone is actually considering importing ICU into the
> postgres source tree, are they?
Why not?

> Size - I'm not sure this is relevent since I don't think we want to
> incorporate it into postgres itself, just let people use it if they
> have it. In any case though, the default dataset is 8MB. This includes
> support for every locale and charset it knows about.
>
> If you drop the conversion stuff (because postgres already has that)
> you're down to about 4MB.
Why would you drop the ICU transcoding support instead of the existing
postgres functions? Why the duplicated effort?


>> Well, the Japanese think that UTF8 is not the solution to all their
>> worries, so they won't be happy with a UTF8-only solution.  Likewise,
>> those of us who only need single-byte character sets won't be very
>> happy
>> with being forced to accept multi-byte processing overhead.
>
> I've not quite understood the japenese problem with Unicode. My
> understanding is that it was primarily due to widespread use of broken
> converters.

Certain Japanese characters cannot make a reliable round-trip through
Unicode. ICU uses UTF-16 as its store, so the Japanese folks won't be
happy with an ICU-only solution. However, it would still be of great
benefit to allow ICU to handle as much as possible, leaving the string
encodings to the encoding experts.

At the very least, it would be great to have ICU to handle encoding on
a per-column basis (perhaps extending the text datatype with encoding
info). Perhaps this would be a decent stopgap solution? The backend
protocol would also need a version bump- currently, it converts all
strings to a single encoding.

¬ ¬ ¬ ¬ ¬ ¬ ¬ ¬ ¬ ¬ ¬ ¬ ¬ ¬ ¬ ¬ ¬ ¬
AgentM
agentm@themactionfaction.com
¬ ¬ ¬ ¬ ¬ ¬ ¬ ¬ ¬ ¬ ¬ ¬ ¬ ¬ ¬ ¬ ¬ ¬


Re: different sort order in windows and linux version

От
"Tomi NA"
Дата:
On 7/2/06, Agent M <agentm@themactionfaction.com> wrote:

> Certain Japanese characters cannot make a reliable round-trip through
> Unicode. ICU uses UTF-16 as its store, so the Japanese folks won't be
> happy with an ICU-only solution. However, it would still be of great

Could you explain what you mean and what's special with those characters?

> benefit to allow ICU to handle as much as possible, leaving the string
> encodings to the encoding experts.
>
> At the very least, it would be great to have ICU to handle encoding on
> a per-column basis (perhaps extending the text datatype with encoding
> info). Perhaps this would be a decent stopgap solution? The backend
> protocol would also need a version bump- currently, it converts all
> strings to a single encoding.

Could you give an example of what that would look like in your opinion?
I was thinking more along the lines of a setting in pg_hba.conf where
the server uses or does not use something like ICU...at least as an
intermediate solution.
Adding a "LOCALE" clause to a column definition (similar to the
"ENCODING" clause of the "CREATE DATABASE" statement) would solve most
(not all) problems with a default locale.
There still might be some non-deterministic behaviour with operations
between strings in different locales but it's far from a showstopper.

t.n.a.

Re: different sort order in windows and linux version

От
Martijn van Oosterhout
Дата:
On Sun, Jul 02, 2006 at 12:25:43PM -0400, Agent M wrote:
> On Jul 2, 2006, at 6:13 AM, Martijn van Oosterhout wrote:
> >But I don't think anyone is actually considering importing ICU into the
> >postgres source tree, are they?
> Why not?

Because it's a project of similar size to postgres and probably nearly
as old and I don't think anyone here actually wants to maintain it.

I mean, we could incorporate the source for readline, openssl,
kerberos, the C library but why. That project has maintainers already
and we only wan to use it, not fork it.

> >If you drop the conversion stuff (because postgres already has that)
> >you're down to about 4MB.
> Why would you drop the ICU transcoding support instead of the existing
> postgres functions? Why the duplicated effort?

Because we would want to be bug-for-bug compatable to previous
releases. I suppose it would be possible if someone checked that the
end result is the same.

> Certain Japanese characters cannot make a reliable round-trip through
> Unicode. ICU uses UTF-16 as its store, so the Japanese folks won't be
> happy with an ICU-only solution. However, it would still be of great
> benefit to allow ICU to handle as much as possible, leaving the string
> encodings to the encoding experts.

We don't need round-trip through unicode, since we're only doing one
way conversions for the purpose of collation.

BTW, this site seems to have a good discussion of Japanese characters
and Unicode.

http://www.jbrowse.com/text/unij.html

> At the very least, it would be great to have ICU to handle encoding on
> a per-column basis (perhaps extending the text datatype with encoding
> info). Perhaps this would be a decent stopgap solution? The backend
> protocol would also need a version bump- currently, it converts all
> strings to a single encoding.

That's called SQL COLLATE support and that's an order of magnitude
harder than adding support for ICU. See previous dicussion on -hackers.

Have a nice day,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> From each according to his ability. To each according to his ability to litigate.

Вложения

Re: different sort order in windows and linux version

От
Tom Lane
Дата:
Dragan Matic <mlists@panforma.co.yu> writes:
> I was about to say the same thing. I think that the whole point in
> having a portable database system is that the data inside the database
> should behave the _same way_ no matter what operating system database is
> running on - client shouldn't be aware of the server OS.

So on that argument, we need to eliminate datatypes float8 and float4
forthwith, because they don't behave quite the same on every machine.
And int8 too, because it's not supported on every machine.  And
--enable-integer-datetimes has got to go; in fact configure should
not have any options at all.

>> Personally, I'd be perfectly happy with pgsql if I could choose to
>> make text operations up to 2-3x slower without the fuss of how it's
>> going to work on a certain platform, in each pgsql version.

Fine for you, not so fine for other people with different concerns.

I'm not unsympathetic to your general point, but black-and-white
arguments won't get far in this discussion.  It's all about tradeoffs
... it's most definitely not about one-size-fits-all.

            regards, tom lane

Re: different sort order in windows and linux version

От
Dragan Matic
Дата:
Tom Lane wrote:
> Dragan Matic <mlists@panforma.co.yu> writes:
>
>> I was about to say the same thing. I think that the whole point in
>> having a portable database system is that the data inside the database
>> should behave the _same way_ no matter what operating system database is
>> running on - client shouldn't be aware of the server OS.
>>
>
> So on that argument, we need to eliminate datatypes float8 and float4
> forthwith, because they don't behave quite the same on every machine.
> And int8 too, because it's not supported on every machine.  And
> --enable-integer-datetimes has got to go; in fact configure should
> not have any options at all.
>
>
Ok, I agree on that, I have completely forgotten these datatypes, when I
wrote the above I was having in mind strings only.
>>> Personally, I'd be perfectly happy with pgsql if I could choose to
>>> make text operations up to 2-3x slower without the fuss of how it's
>>> going to work on a certain platform, in each pgsql version.
>>>
>
> Fine for you, not so fine for other people with different concerns.
>
> I'm not unsympathetic to your general point, but black-and-white
> arguments won't get far in this discussion.  It's all about tradeoffs
> ... it's most definitely not about one-size-fits-all.
>
>             regards, tom lane
>
>
>
Sorry if I was being rude, didn't mean to sound that way, I was just
surprised to see this kind of postgresql behavior. I wouldn't like to be
misunderstood, we are using postgresql for 6th year now and we are
extremely satisfied with it. Furthermore we are using Linux servers only
so the mentioned behavior doesn't affect us at the moment. My original
question arose when I wanted to do some tests on my notebook and after
installing postgresql for windows, I was surprised to see that I was not
getting the same results that I was getting from the other servers, so I
thought I could ask if I misconfigured the installation, or was that a
known bug. My concern was that for someone developing application with
postgresql for windows and then deploying it on postgresql on Linux this
might be a big surprise.

regards

Dragan Matic




Re: different sort order in windows and linux version

От
Martijn van Oosterhout
Дата:
On Mon, Jul 03, 2006 at 12:55:18AM -0400, Tom Lane wrote:
> Fine for you, not so fine for other people with different concerns.
>
> I'm not unsympathetic to your general point, but black-and-white
> arguments won't get far in this discussion.  It's all about tradeoffs
> ... it's most definitely not about one-size-fits-all.

What I find most confusing is that the patch for adding ICU to postgres
(using a configure switch) has been around for a while, at one point it
was even in the queue for 8.2:

http://archives.postgresql.org/pgsql-patches/2005-08/msg00309.php

It's vanished from there, not sure why. Even so, we still get regular
requests for it. Presumably it was removed here:

http://archives.postgresql.org/pgsql-patches/2006-03/msg00233.php

Given the patch does nothing if you don't enable it, I'm not quite sure
what the downsides would be...

Have a nice day,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> From each according to his ability. To each according to his ability to litigate.

Вложения