Обсуждение: Windows and locales and UTF-8 (oh my)

Поиск

Список

Период

Сортировка

Windows and locales and UTF-8 (oh my)

От

Tom Lane

Дата:

06 октября 2007 г., 17:55:41

I've been learning much more than I wanted to know about $SUBJECT
since putting in the src/port/chklocale.c code to try to enforce
that our database encoding matches the system locale settings.
There's an ongoing thread in -patches that's been focused on
getting reasonable behavior from the point of view of the Far
Eastern contingent:
http://archives.postgresql.org/pgsql-patches/2007-10/msg00031.php
(Some of that's been applied, but not the very latest proposals.)
Here's some more info from an off-list discussion with Dave Page:

------- Forwarded Messages

Date:    Fri, 05 Oct 2007 20:54:04 +0100
From:    Dave Page <dpage@postgresql.org>
To:      Tom Lane <tgl@sss.pgh.pa.us>
Subject: Re: [CORE] 8.3beta1 Available ...

Dave Page wrote:
> Some further info on that - utf-8 on Windows is actually a
> pseudo-codepage (65001) which doesn't have NLS files, hence why we have
> to convert to utf-16 before sorting. Perhaps the utf-8/65001 name
> difference is the problem here. I'll knock up a quick test program when
> the kids have gone to bed.

So, my test prog (below) returns the following:

Dave@SNAKE:~$ ./setlc "English_United Kingdom.65001"
LC_COLLATE=English_United
Kingdom.65001;LC_CTYPE=C;LC_MONETARY=English_United
Kingdom.65001;LC_NUMERIC=English_United
Kingdom.65001;LC_TIME=English_United Kingdom.65001

So everything other than LC_CTYPE is acceptable in UTF-8 on Windows -
and we already handle LC_CTYPE for UTF-8 on Windows through our UTF-8 ->
UTF-16 conversions internally.

Can we change initdb to test against LC_TIME instead of LC_CTYPE perhaps?

Regards, Dave.

#include <locale.h>

main (int argc, char *argv[])
{       char *lc;
       if (argc > 1)               setlocale(LC_ALL, argv[1]);
       lc = setlocale(LC_ALL, NULL);       printf("%s\n", lc);
}

------- Message 2

Date:    Fri, 05 Oct 2007 23:32:36 +0100
From:    Dave Page <dpage@postgresql.org>
To:      Tom Lane <tgl@sss.pgh.pa.us>
Subject: Re: [CORE] 8.3beta1 Available ...

Tom Lane wrote:
> Dave Page <dpage@postgresql.org> writes:
>> So, my test prog (below) returns the following:
> 
>> Dave@SNAKE:~$ ./setlc "English_United Kingdom.65001"
>> LC_COLLATE=English_United
>> Kingdom.65001;LC_CTYPE=C;LC_MONETARY=English_United
>> Kingdom.65001;LC_NUMERIC=English_United
>> Kingdom.65001;LC_TIME=English_United Kingdom.65001
> 
> That's just frickin' weird ... and a bit scary.  There is a fair amount
> of code in PG that checks for lc_ctype_is_c and does things differently;
> one wonders if that isn't going to get misled by this behavior.  (Hmm,
> maybe this explains some of the "upper/lower doesn't work" reports we've
> been getting??)  Are you sure all variants of Windows act that way?

All the ones we support afaict.

>> Can we change initdb to test against LC_TIME instead of LC_CTYPE perhaps?
> 
> Is there something in Windows that constrains them to be all the same?
> If not this proposal seems just plain wrong :-(  But in any case I'd
> feel more comfortable having it look at LC_COLLATE.

They can all be set independently - it's just that there's no UTF-7
(65000) or UTF-8 (65001) NLS files (http://shlimazl.nm.ru/eng/nls.htm)
defining them fully so Windows doesn't know any more than the characters
that are in both 'pseudo codepages'.

As a result, you can't set LC_CTYPE to .65001 because Windows knows it
can't handle ToUpper() or ToLower() etc. but you can use it to encode
messages and other text.

/D

------- End of Forwarded Messages

I am thinking that Dave's discovery explains some previously unsolved
bug reports, such as
http://archives.postgresql.org/pgsql-bugs/2007-05/msg00260.php
If Windows returns LC_CTYPE=C in a situation like this, then
the various single-byte-charset optimization paths that are enabled by
lc_ctype_is_c() would be mistakenly used, leading to misbehavior in
upper()/lower() and other places.  ISTM we had better hack
lc_ctype_is_c() so that on Windows (only), if the database encoding
is UTF-8 then it returns FALSE regardless of what setlocale says.

That still leaves me with a boatload of questions, though.  If we can't
trust LC_CTYPE as an indicator of the system charset, what can we trust?
In particular this seems to say that looking at LC_CTYPE for chklocale's
purposes is completely useless; what do we look at instead?

Another issue: is it possible to set, say, LC_MESSAGES and LC_TIME to
different codepages and if so what happens?  If that does enable
different bits of infrastructure to return incompatibly encoded strings,
seems we need a defense against that --- what should it be?

One bright spot is that this does seem to suggest a way to implement the
recommendation I made in the -patches thread: if we can't support the
encoding (codepage) used by the locale seen by initdb, we could try
stripping the codepage indicator (if any) and plastering on .65001
to get a UTF8-compatible locale name.  That'd only work on Windows
but that seems the platform where we're most likely to see unsupportable
default encodings.

Comments?  I don't have a Windows development environment so I'm not
in a position to take the lead on testing/fixing this sort of stuff.
        regards, tom lane

Locales and Encodings

От

Gregory Stark

Дата:

12 октября 2007 г., 11:25:06

It seems like the root of the problems we're butting our heads against with
encoding and locale is all the same issue: it's nonsensical to take the locale
at initdb time per-cluster and then allow user-specified encoding
per-database. If anything it would make more sense to go the other way around.

But actually it seems to me we could allow changing both on a per-database
basis with certain restrictions:

. template0 is always SQL_ASCII with locale C

. when creating a new database you can specify the encoding and locale and we check that they're compatible. 

. when creating a new database from a template the new locale and encoding must be identical to the template database's
encodingand locale. Unless the template is template0 in which case we rebuild all indexes after copying.
 

We could liberalize this last restriction if we created a new encoding like
SQL_ASCII but which enforces 7-bit ascii. But then the index rebuild step
could take a long time.

This would make the whole locale/encoding issue make much more transparent. In
database listings you would see both listed alongside, you wouldn't be bound
by any initdb environment choices, and errors when running create database
would be able to tell you exactly what you're doing wrong and what you have to
do to avoid the problem.


--  Gregory Stark EnterpriseDB          http://www.enterprisedb.com

Re: Locales and Encodings

От

Peter Eisentraut

Дата:

12 октября 2007 г., 12:00:46

Am Freitag, 12. Oktober 2007 schrieb Gregory Stark:
> . when creating a new database from a template the new locale and encoding
>   must be identical to the template database's encoding and locale. Unless
> the template is template0 in which case we rebuild all indexes after
> copying.

Why would you restrict the index rebuilding only to this particular case?  It 
could be done for any database.

The other issue are shared catalogs.

-- 
Peter Eisentraut
http://developer.postgresql.org/~petere/

Re: Locales and Encodings

От

Gregory Stark

Дата:

12 октября 2007 г., 13:03:56

"Peter Eisentraut" <peter_e@gmx.net> writes:

> Am Freitag, 12. Oktober 2007 schrieb Gregory Stark:
>> . when creating a new database from a template the new locale and encoding
>>   must be identical to the template database's encoding and locale. Unless
>> the template is template0 in which case we rebuild all indexes after
>> copying.
>
> Why would you restrict the index rebuilding only to this particular case?  It
> could be done for any database.

Well there's no guarantee there isn't 8-bit data in other databases which
would be invalid in the new encoding. I think it's reasonable to assume
there's only 7-bit ascii in template0 however.

An alternative would be introducing an ASCII7 encoding which template0 would
use and any other database in that encoding could be used as a template for
any encoding. However that would still require index rebuilds which would
potentially take a long time. Another alternative would be recoding all the
data from the template database encoding to the new encoding and throwing an
error if a non-encodable character is found.

I think it's a lot simpler to just declare it a non-problem by saying there
won't be any non-ascii text in template0.

> The other issue are shared catalogs.

This approach doesn't address that but I don't think it makes the problems
there any worse either. That is, I think already have these problems around
shared tables.

. If you have two databases with locales that don't agree then the indexes on those tables won't function properly.

. What happens if you create a user while connected to a latin1 database with an é in his username and then connect to
adatabase in a UTF8 database? That username is now an invalidly encoded UTF8 string. 

Perhaps we should be using pattern_ops for the indexes on the shared tables?
Or using bytea with UTF8 encoded strings instead of name and text? That
actually sounds reasonable now that we have convert() functions which take and
generate bytea, at least for the text fields like in pltemplate -- less so for
the name columns.

--  Gregory Stark EnterpriseDB          http://www.enterprisedb.com

Re: Locales and Encodings

От

Martijn van Oosterhout

Дата:

12 октября 2007 г., 13:23:28

On Fri, Oct 12, 2007 at 02:03:47PM +0100, Gregory Stark wrote:
> This approach doesn't address that but I don't think it makes the problems
> there any worse either. That is, I think already have these problems around
> shared tables.

Or we could just setup encodings/locales per column and the problem
goes away entirely. Most of the code's already been written, it's not
even terribly difficult. Where we're stuck is that we can't agree on a
source of locale data. People don't want the ICU or glibc data and
there's no other source as readily available.

Perhaps we should fix that problem, rather than making more
workarounds.

Have a nice day,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> From each according to his ability. To each according to his ability to litigate.

Re: Locales and Encodings

От

Gregory Stark

Дата:

12 октября 2007 г., 14:28:40

"Martijn van Oosterhout" <kleptog@svana.org> writes:

> People don't want the ICU or glibc data and there's no other source as
> readily available.
>
> Perhaps we should fix that problem, rather than making more
> workarounds.

Fix the problem by making ICU a smaller less complex dependency? 
Or fix the problem that glibc isn't everyone's libc?

I think realistically we're basically waiting for strcoll_l to become
standardized by POSIX so we can depend on it.

Personally I think we should just implement our own strcoll_l as a wrapper
around setlocale-strcoll-setlocale and use strcoll_l if it's available and
our, possibly slow, wrapper if not. If we ban direct use of strcoll and other
lc_collate sensitive functions in Postgres we could also remember the last
locale used and not do unnecessary setlocales so existing use cases aren't
slowed down at all.

--  Gregory Stark EnterpriseDB          http://www.enterprisedb.com

Re: Locales and Encodings

От

Peter Eisentraut

Дата:

12 октября 2007 г., 14:32:29

Am Freitag, 12. Oktober 2007 schrieb Martijn van Oosterhout:
> Where we're stuck is that we can't agree on a
> source of locale data. People don't want the ICU or glibc data and
> there's no other source as readily available.

What were the objections to ICU?

-- 
Peter Eisentraut
http://developer.postgresql.org/~petere/

Re: Locales and Encodings

От

Gregory Stark

Дата:

12 октября 2007 г., 15:19:51

"Peter Eisentraut" <peter_e@gmx.net> writes:

> Am Freitag, 12. Oktober 2007 schrieb Martijn van Oosterhout:
>> Where we're stuck is that we can't agree on a
>> source of locale data. People don't want the ICU or glibc data and
>> there's no other source as readily available.
>
> What were the objections to ICU?

It's introducing a new dependency to do something fundamental to Postgres, one
that's larger than all of Postgres.

It would make Postgres inconsistent and less integrated with the rest of the
OS. How do you explain that Postgres doesn't follow the system's
configurations and the collations don't agree with the system collations?

--  Gregory Stark EnterpriseDB          http://www.enterprisedb.com

Re: Locales and Encodings

От

Michael Glaesemann

Дата:

12 октября 2007 г., 15:31:01

On Oct 12, 2007, at 10:19 , Gregory Stark wrote:

> It would make Postgres inconsistent and less integrated with the  
> rest of the
> OS. How do you explain that Postgres doesn't follow the system's
> configurations and the collations don't agree with the system  
> collations?

How is this fundamentally different from PostgreSQL using a separate  
users/roles system than the OS?

Michael Glaesemann
grzm seespotcode net

Re: Locales and Encodings

От

Peter Eisentraut

Дата:

12 октября 2007 г., 15:56:12

Am Freitag, 12. Oktober 2007 schrieb Gregory Stark:
> It would make Postgres inconsistent and less integrated with the rest of
> the OS. How do you explain that Postgres doesn't follow the system's
> configurations and the collations don't agree with the system collations?

We already have our own encoding support (for better or worse), and I don't 
think having one's own locale support would be that much different.

-- 
Peter Eisentraut
http://developer.postgresql.org/~petere/

Re: Locales and Encodings

От

Andreas Pflug

Дата:

12 октября 2007 г., 16:02:15

Michael Glaesemann wrote:
>
> On Oct 12, 2007, at 10:19 , Gregory Stark wrote:
>
>> It would make Postgres inconsistent and less integrated with the rest
>> of the
>> OS. How do you explain that Postgres doesn't follow the system's
>> configurations and the collations don't agree with the system
>> collations?
>
> How is this fundamentally different from PostgreSQL using a separate
> users/roles system than the OS?
Even more, eliminating dependencies on a OS's correct implementation of
locale stuff appears A Good Thing to me. I wonder if a compile time
option to use ICU in 8.4 should be considered, regarding all those
lengthy threads about encoding/locale/collation problems.

Regards,
Andreas

Re: Locales and Encodings

От

Martijn van Oosterhout

Дата:

12 октября 2007 г., 16:16:12

On Fri, Oct 12, 2007 at 03:28:26PM +0100, Gregory Stark wrote:
> Fix the problem by making ICU a smaller less complex dependency?

How? It's 95% data, you can't reduce that. glibc also has 10MB of locale
data. That actual code is much smaller than postgres and doesn't depend
on any other non-system libraries.

> I think realistically we're basically waiting for strcoll_l to become
> standardized by POSIX so we can depend on it.

I think we could be waiting forever then. It's supported by Win32,
MacOSX and glibc. The systems that don't support it tend not to support
multibyte collation anyway. Patches have been created to use this and
rejected because not enough platforms support it...

> Personally I think we should just implement our own strcoll_l as a wrapper
> around setlocale-strcoll-setlocale and use strcoll_l if it's available and
> our, possibly slow, wrapper if not. If we ban direct use of strcoll and other
> lc_collate sensitive functions in Postgres we could also remember the last
> locale used and not do unnecessary setlocales so existing use cases aren't
> slowed down at all.

Been done also. As I recall it was *really* slow, not just a little
bit.

Have a nice day,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> From each according to his ability. To each according to his ability to litigate.

Re: Locales and Encodings

От

Tom Lane

Дата:

12 октября 2007 г., 16:32:08

Peter Eisentraut <peter_e@gmx.net> writes:
> Am Freitag, 12. Oktober 2007 schrieb Gregory Stark:
>> It would make Postgres inconsistent and less integrated with the rest of
>> the OS. How do you explain that Postgres doesn't follow the system's
>> configurations and the collations don't agree with the system collations?

> We already have our own encoding support (for better or worse), and I don't 
> think having one's own locale support would be that much different.

Well, yes it would be, because encodings are pretty well standardized;
there is not likely to be any user-visible difference between one
platform's idea of UTF8 and another's.  This is very very far from being
the case for locales.  See for instance the recent thread in which we
found out that "en_US" locale has utterly different sort orders on
Linux and OS X.
        regards, tom lane

Re: Locales and Encodings

От

Tom Lane

Дата:

12 октября 2007 г., 16:41:10

Martijn van Oosterhout <kleptog@svana.org> writes:
> On Fri, Oct 12, 2007 at 03:28:26PM +0100, Gregory Stark wrote:
>> I think realistically we're basically waiting for strcoll_l to become
>> standardized by POSIX so we can depend on it.

> I think we could be waiting forever then.

strcoll is only a small fraction of the problem anyway.  The <ctype.h>
and <wctype.h> functions are another chunk of it, and then there's the
issues of system message spellings, LC_MONETARY info, etc etc.
        regards, tom lane

Re: Locales and Encodings

От

"Florian G. Pflug"

Дата:

12 октября 2007 г., 17:42:06

Tom Lane wrote:
> Peter Eisentraut <peter_e@gmx.net> writes:
>> Am Freitag, 12. Oktober 2007 schrieb Gregory Stark:
>>> It would make Postgres inconsistent and less integrated with the rest of
>>> the OS. How do you explain that Postgres doesn't follow the system's
>>> configurations and the collations don't agree with the system collations?
> 
>> We already have our own encoding support (for better or worse), and I don't 
>> think having one's own locale support would be that much different.
> 
> Well, yes it would be, because encodings are pretty well standardized;
> there is not likely to be any user-visible difference between one
> platform's idea of UTF8 and another's.  This is very very far from being
> the case for locales.  See for instance the recent thread in which we
> found out that "en_US" locale has utterly different sort orders on
> Linux and OS X.

For me, this paragraph is more of in argument *in favour* of having our own 
locale support. At least for me, consistency between PG running on different 
platforms would bring more benefits than consistency between PG and the platform 
it runs on.

At the company I used to work for, we had all our databases running with 
encoding=utf-8 and locale=C, because I didn't want our applications to depend on 
platform-specific locale issues. Plus, some of the applications supported 
multiple languages, making a cluster-global locale unworkable anyway - a 
restriction which would go away if we went with ICU.

regards, Florian Pflug

Re: Windows and locales and UTF-8 (oh my)

От

Magnus Hagander

Дата:

15 октября 2007 г., 09:10:39

On Sat, Oct 06, 2007 at 01:53:31PM -0400, Tom Lane wrote:
> I've been learning much more than I wanted to know about $SUBJECT
> since putting in the src/port/chklocale.c code to try to enforce
> that our database encoding matches the system locale settings.
> There's an ongoing thread in -patches that's been focused on
> getting reasonable behavior from the point of view of the Far
> Eastern contingent:
> http://archives.postgresql.org/pgsql-patches/2007-10/msg00031.php
> (Some of that's been applied, but not the very latest proposals.)
> Here's some more info from an off-list discussion with Dave Page:

Sorry for the late response to this. I missed the beginning and then got
mixed up in the different threads going aruond :-)

> Tom Lane wrote:
> > Dave Page <dpage@postgresql.org> writes:
> >> So, my test prog (below) returns the following:
> > 
> >> Dave@SNAKE:~$ ./setlc "English_United Kingdom.65001"
> >> LC_COLLATE=English_United
> >> Kingdom.65001;LC_CTYPE=C;LC_MONETARY=English_United
> >> Kingdom.65001;LC_NUMERIC=English_United
> >> Kingdom.65001;LC_TIME=English_United Kingdom.65001
> > 
> > That's just frickin' weird ... and a bit scary.  There is a fair amount
> > of code in PG that checks for lc_ctype_is_c and does things differently;
> > one wonders if that isn't going to get misled by this behavior.  (Hmm,
> > maybe this explains some of the "upper/lower doesn't work" reports we've
> > been getting??)  Are you sure all variants of Windows act that way?
> 
> All the ones we support afaict.

AFICT, this has been standard behaviour in Windows since forever. Certainly
since Windows 2000 which is what we care about.

Windows 9x had different ways of dealing with it since they weren't native
UTF16 internally, but that doesn't matter to us here.

> >> Can we change initdb to test against LC_TIME instead of LC_CTYPE perhaps?
> > 
> > Is there something in Windows that constrains them to be all the same?
> > If not this proposal seems just plain wrong :-(  But in any case I'd
> > feel more comfortable having it look at LC_COLLATE.
> 
> They can all be set independently - it's just that there's no UTF-7
> (65000) or UTF-8 (65001) NLS files (http://shlimazl.nm.ru/eng/nls.htm)
> defining them fully so Windows doesn't know any more than the characters
> that are in both 'pseudo codepages'.
> 
> As a result, you can't set LC_CTYPE to .65001 because Windows knows it
> can't handle ToUpper() or ToLower() etc. but you can use it to encode
> messages and other text.

Yes. And also important, you can set LC_COLLATE to it, which will make all
the UTF16 versions of the functions behave properly.

Remember - all the Windows NT+ operations are UTF16 internally. So when you
set LC_TIME to it, for example, the API functions will generate the
resulting string in UTF16 and then convert it to whatever encoding you
chose - be it UTF8 or LATIN1 or whatever.

> I am thinking that Dave's discovery explains some previously unsolved
> bug reports, such as
> http://archives.postgresql.org/pgsql-bugs/2007-05/msg00260.php
> If Windows returns LC_CTYPE=C in a situation like this, then
> the various single-byte-charset optimization paths that are enabled by
> lc_ctype_is_c() would be mistakenly used, leading to misbehavior in
> upper()/lower() and other places.  ISTM we had better hack
> lc_ctype_is_c() so that on Windows (only), if the database encoding
> is UTF-8 then it returns FALSE regardless of what setlocale says.

Yes, I think we a change to that routine.

But. What about the case when we actually *have* locale=C and
encoding=UTF8. We need to care for that one somehow. Perhaps we should look
at LC_COLLATE instead (again, on Windows only. Possibly even only in the
windows+locale_returns_c+encoring=utf8 case, to distinguish these two)?

> That still leaves me with a boatload of questions, though.  If we can't
> trust LC_CTYPE as an indicator of the system charset, what can we trust?
> In particular this seems to say that looking at LC_CTYPE for chklocale's
> purposes is completely useless; what do we look at instead?

GetACP() returns the "ANSI Codepage", which I *think* is what we're looking
for here. 
http://msdn2.microsoft.com/en-us/library/ms776259.aspx

We should eb able to compare that to something?

> Another issue: is it possible to set, say, LC_MESSAGES and LC_TIME to
> different codepages and if so what happens?  If that does enable
> different bits of infrastructure to return incompatibly encoded strings,
> seems we need a defense against that --- what should it be?

AFAIK, yes, and then you get it back in the wrong encoding.
But as long as we set them to the same, we should be safe. And AFAIK, only
UTF8 (and UTF7, but we don't support that) is the special one we need to
care about.

> One bright spot is that this does seem to suggest a way to implement the
> recommendation I made in the -patches thread: if we can't support the
> encoding (codepage) used by the locale seen by initdb, we could try
> stripping the codepage indicator (if any) and plastering on .65001
> to get a UTF8-compatible locale name.  That'd only work on Windows
> but that seems the platform where we're most likely to see unsupportable
> default encodings.

Um, yes, that should work - assuming encoding is set to UTF8. We can't do
that for any other encoding, of course.

> Comments?  I don't have a Windows development environment so I'm not
> in a position to take the lead on testing/fixing this sort of stuff.

I have the Windows dev environment, but I feel like I'm on deep water
whenever I talk locale/encoding stuff really, I don''t know it as well as
I'd like to. But I'm happy to do coding and testing if I can get enough
pointers on whast I need to test :)

//Magnus

Re: Windows and locales and UTF-8 (oh my)

От

Magnus Hagander

Дата:

15 октября 2007 г., 11:26:13

On Mon, Oct 15, 2007 at 11:09:54AM +0200, Magnus Hagander wrote:
> On Sat, Oct 06, 2007 at 01:53:31PM -0400, Tom Lane wrote:
> > I am thinking that Dave's discovery explains some previously unsolved
> > bug reports, such as
> > http://archives.postgresql.org/pgsql-bugs/2007-05/msg00260.php
> > If Windows returns LC_CTYPE=C in a situation like this, then
> > the various single-byte-charset optimization paths that are enabled by
> > lc_ctype_is_c() would be mistakenly used, leading to misbehavior in
> > upper()/lower() and other places.  ISTM we had better hack
> > lc_ctype_is_c() so that on Windows (only), if the database encoding
> > is UTF-8 then it returns FALSE regardless of what setlocale says.
>
> Yes, I think we a change to that routine.
>
> But. What about the case when we actually *have* locale=C and
> encoding=UTF8. We need to care for that one somehow. Perhaps we should look
> at LC_COLLATE instead (again, on Windows only. Possibly even only in the
> windows+locale_returns_c+encoring=utf8 case, to distinguish these two)?

Hmm. Looking more at that, may there be another problem? Looking at
WriteControlFile(), it writes out what setlocale(LC_CTYPE) returns, which
will then be "C" - even if the database isn't in C.

But I don't really know when that code is called, or if I'm just looking at
things wrong. Just starting up and shutting down the database leaves it at
Swedish_Sweden.1252, not C.
(1252 is still the wrong encoding specifyer, but it'll work anyway since we
convert to UTF16)

Now, I came across this trying to find a way for lc_ctype_is_c() to
determine if the database is in C locale or not, *without* resorting to
setlocale(). Any pointers on how to do that properly?

Also, any pointers on a way to check for the kind of failure that's to be
expected from this one returning the wrong thing?

> > One bright spot is that this does seem to suggest a way to implement the
> > recommendation I made in the -patches thread: if we can't support the
> > encoding (codepage) used by the locale seen by initdb, we could try
> > stripping the codepage indicator (if any) and plastering on .65001
> > to get a UTF8-compatible locale name.  That'd only work on Windows
> > but that seems the platform where we're most likely to see unsupportable
> > default encodings.
>
> Um, yes, that should work - assuming encoding is set to UTF8. We can't do
> that for any other encoding, of course.

Looking at that, doesn't actually need to put that at the end of the
locale-name - all locale names will work with UTF8, even one specifying
1252.

Attached patch seems to work for me for that part. Still doesn't touch
lc_ctype_is_c().

//Magnus

Вложения

win32_utf8.patch

Re: Windows and locales and UTF-8 (oh my)

От

Magnus Hagander

Дата:

15 октября 2007 г., 11:40:25

On Mon, Oct 15, 2007 at 01:26:00PM +0200, Magnus Hagander wrote:
> On Mon, Oct 15, 2007 at 11:09:54AM +0200, Magnus Hagander wrote:
> > On Sat, Oct 06, 2007 at 01:53:31PM -0400, Tom Lane wrote:
> > > I am thinking that Dave's discovery explains some previously unsolved
> > > bug reports, such as
> > > http://archives.postgresql.org/pgsql-bugs/2007-05/msg00260.php
> > > If Windows returns LC_CTYPE=C in a situation like this, then
> > > the various single-byte-charset optimization paths that are enabled by
> > > lc_ctype_is_c() would be mistakenly used, leading to misbehavior in
> > > upper()/lower() and other places.  ISTM we had better hack
> > > lc_ctype_is_c() so that on Windows (only), if the database encoding
> > > is UTF-8 then it returns FALSE regardless of what setlocale says.
> > 
> > Yes, I think we a change to that routine.
> > 
> > But. What about the case when we actually *have* locale=C and
> > encoding=UTF8. We need to care for that one somehow. Perhaps we should look
> > at LC_COLLATE instead (again, on Windows only. Possibly even only in the
> > windows+locale_returns_c+encoring=utf8 case, to distinguish these two)?
> 
> Hmm. Looking more at that, may there be another problem? Looking at
> WriteControlFile(), it writes out what setlocale(LC_CTYPE) returns, which
> will then be "C" - even if the database isn't in C.
> 
> But I don't really know when that code is called, or if I'm just looking at
> things wrong. Just starting up and shutting down the database leaves it at
> Swedish_Sweden.1252, not C.
> (1252 is still the wrong encoding specifyer, but it'll work anyway since we
> convert to UTF16)

Gah, got that backwards. Of course it does, because it only returns "C" if
we set to Swedish_Sweden.65001, and we don't *do* that with the patch I
sent in earlier. We set it to Swedish_Sweden, which is a perfectly valid
LC_CTYPE.

And given that, do we even nede to special-case lc_ctype_is_c() at all? If
we never pass in a .65001 locale (which we don't, because it fails)?

//Magnus

Re: Windows and locales and UTF-8 (oh my)

От

Tom Lane

Дата:

15 октября 2007 г., 16:18:50

Magnus Hagander <magnus@hagander.net> writes:
>>>> On Sat, Oct 06, 2007 at 01:53:31PM -0400, Tom Lane wrote:
>>>> I am thinking that Dave's discovery explains some previously unsolved
>>>> bug reports, such as
>>>> http://archives.postgresql.org/pgsql-bugs/2007-05/msg00260.php
> ...
> And given that, do we even nede to special-case lc_ctype_is_c() at all? If
> we never pass in a .65001 locale (which we don't, because it fails)?

Hmm.  If it doesn't need a special case, then we still lack an
explanation for the aforementioned bug report.
        regards, tom lane

Re: Windows and locales and UTF-8 (oh my)

От

Magnus Hagander

Дата:

15 октября 2007 г., 16:55:45

Tom Lane wrote:
> Magnus Hagander <magnus@hagander.net> writes:
>>>>> On Sat, Oct 06, 2007 at 01:53:31PM -0400, Tom Lane wrote:
>>>>> I am thinking that Dave's discovery explains some previously unsolved
>>>>> bug reports, such as
>>>>> http://archives.postgresql.org/pgsql-bugs/2007-05/msg00260.php
>> ...
>> And given that, do we even nede to special-case lc_ctype_is_c() at all? If
>> we never pass in a .65001 locale (which we don't, because it fails)?
> 
> Hmm.  If it doesn't need a special case, then we still lack an
> explanation for the aforementioned bug report.

From what I can tell that report doesn't tell us very much - we don't
know server encoding, we don't know server locale, we don't even know
client encoding. So I don't think we know anywhere *near* enough to say
it's related to this.

//Magnus

Re: Windows and locales and UTF-8 (oh my)

От

Tom Lane

Дата:

15 октября 2007 г., 17:18:11

Magnus Hagander <magnus@hagander.net> writes:
> Tom Lane wrote:
>> Hmm.  If it doesn't need a special case, then we still lack an
>> explanation for the aforementioned bug report.

> From what I can tell that report doesn't tell us very much - we don't
> know server encoding, we don't know server locale, we don't even know
> client encoding. So I don't think we know anywhere *near* enough to say
> it's related to this.

In the followup we found out that he was using UTF-8 encoding:
http://archives.postgresql.org/pgsql-bugs/2007-05/msg00264.php
So while that report certainly left a great deal to be desired in terms
of precision, my gut tells me it's related.  Has anyone tried to
reproduce that behavior by initdb'ing 8.2 in a suitable UTF-8-using
Windows locale?
        regards, tom lane

Re: Windows and locales and UTF-8 (oh my)

От

Magnus Hagander

Дата:

15 октября 2007 г., 17:44:17

Tom Lane wrote:
> Magnus Hagander <magnus@hagander.net> writes:
>> Tom Lane wrote:
>>> Hmm.  If it doesn't need a special case, then we still lack an
>>> explanation for the aforementioned bug report.
> 
>> From what I can tell that report doesn't tell us very much - we don't
>> know server encoding, we don't know server locale, we don't even know
>> client encoding. So I don't think we know anywhere *near* enough to say
>> it's related to this.
> 
> In the followup we found out that he was using UTF-8 encoding:
> http://archives.postgresql.org/pgsql-bugs/2007-05/msg00264.php
> So while that report certainly left a great deal to be desired in terms
> of precision, my gut tells me it's related.  Has anyone tried to
> reproduce that behavior by initdb'ing 8.2 in a suitable UTF-8-using
> Windows locale?

It doesn't tell us if it's the client or the server that's in UTF8, and
it doesn't tell us about the locale.

Euler Taveira de Oliveira's response says he can't reproduce it. I
haven't tried myself, and that webpage really doesn't tell us what what
the character is. If someone can comment on that, I can try to repro it
on my systems.

//Magnus

Re: Windows and locales and UTF-8 (oh my)

От

Magnus Hagander

Дата:

16 октября 2007 г., 11:13:22

On Mon, Oct 15, 2007 at 07:44:00PM +0200, Magnus Hagander wrote:
> Tom Lane wrote:
> > Magnus Hagander <magnus@hagander.net> writes:
> >> Tom Lane wrote:
> >>> Hmm.  If it doesn't need a special case, then we still lack an
> >>> explanation for the aforementioned bug report.
> > 
> >> From what I can tell that report doesn't tell us very much - we don't
> >> know server encoding, we don't know server locale, we don't even know
> >> client encoding. So I don't think we know anywhere *near* enough to say
> >> it's related to this.
> > 
> > In the followup we found out that he was using UTF-8 encoding:
> > http://archives.postgresql.org/pgsql-bugs/2007-05/msg00264.php
> > So while that report certainly left a great deal to be desired in terms
> > of precision, my gut tells me it's related.  Has anyone tried to
> > reproduce that behavior by initdb'ing 8.2 in a suitable UTF-8-using
> > Windows locale?
> 
> It doesn't tell us if it's the client or the server that's in UTF8, and
> it doesn't tell us about the locale.
> 
> Euler Taveira de Oliveira's response says he can't reproduce it. I
> haven't tried myself, and that webpage really doesn't tell us what what
> the character is. If someone can comment on that, I can try to repro it
> on my systems.

Got some help on IRC to dentify the charafters as ç and Ç.

I can confirm that both work perfectly fine with UTF-8 and locale
Swedish_Sweden.1252. They sort correctly, and they work with both upper()
and lower() correctly. 

This test is with 8.3-HEAD and the patch to allow UTF-8.

This leads me to beleive that something is wrong with the ops system. Most
likely it's just the client that's in UTF8 mode, and the server is
SQL_ASCII.

//Magnus

Re: Windows and locales and UTF-8 (oh my)

От

Euler Taveira de Oliveira

Дата:

17 октября 2007 г., 21:33:08

Magnus Hagander wrote:

> Got some help on IRC to dentify the charafters as ç and Ç.
> 
Exact.

> I can confirm that both work perfectly fine with UTF-8 and locale
> Swedish_Sweden.1252. They sort correctly, and they work with both upper()
> and lower() correctly. 
> 
I didn't remember what locale is. I'll check it.

> This test is with 8.3-HEAD and the patch to allow UTF-8.
> 
I tested with 8.2.4 and my encoding is LATIN1 IIRC. Didn't try UTF-8.

I'll give it a try when i have my dev environment.


--  Euler Taveira de Oliveira http://www.timbira.com/

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Обсуждение: Windows and locales and UTF-8 (oh my)

Вложения