Windows and locales and UTF-8 (oh my)

Поиск
Список
Период
Сортировка
От Tom Lane
Тема Windows and locales and UTF-8 (oh my)
Дата
Msg-id 26692.1191693211@sss.pgh.pa.us
обсуждение исходный текст
Ответы Locales and Encodings  (Gregory Stark <stark@enterprisedb.com>)
Re: Windows and locales and UTF-8 (oh my)  (Magnus Hagander <magnus@hagander.net>)
Список pgsql-hackers
I've been learning much more than I wanted to know about $SUBJECT
since putting in the src/port/chklocale.c code to try to enforce
that our database encoding matches the system locale settings.
There's an ongoing thread in -patches that's been focused on
getting reasonable behavior from the point of view of the Far
Eastern contingent:
http://archives.postgresql.org/pgsql-patches/2007-10/msg00031.php
(Some of that's been applied, but not the very latest proposals.)
Here's some more info from an off-list discussion with Dave Page:

------- Forwarded Messages

Date:    Fri, 05 Oct 2007 20:54:04 +0100
From:    Dave Page <dpage@postgresql.org>
To:      Tom Lane <tgl@sss.pgh.pa.us>
Subject: Re: [CORE] 8.3beta1 Available ...

Dave Page wrote:
> Some further info on that - utf-8 on Windows is actually a
> pseudo-codepage (65001) which doesn't have NLS files, hence why we have
> to convert to utf-16 before sorting. Perhaps the utf-8/65001 name
> difference is the problem here. I'll knock up a quick test program when
> the kids have gone to bed.

So, my test prog (below) returns the following:

Dave@SNAKE:~$ ./setlc "English_United Kingdom.65001"
LC_COLLATE=English_United
Kingdom.65001;LC_CTYPE=C;LC_MONETARY=English_United
Kingdom.65001;LC_NUMERIC=English_United
Kingdom.65001;LC_TIME=English_United Kingdom.65001

So everything other than LC_CTYPE is acceptable in UTF-8 on Windows -
and we already handle LC_CTYPE for UTF-8 on Windows through our UTF-8 ->
UTF-16 conversions internally.

Can we change initdb to test against LC_TIME instead of LC_CTYPE perhaps?

Regards, Dave.


#include <locale.h>

main (int argc, char *argv[])
{       char *lc;
       if (argc > 1)               setlocale(LC_ALL, argv[1]);
       lc = setlocale(LC_ALL, NULL);       printf("%s\n", lc);
}

------- Message 2

Date:    Fri, 05 Oct 2007 23:32:36 +0100
From:    Dave Page <dpage@postgresql.org>
To:      Tom Lane <tgl@sss.pgh.pa.us>
Subject: Re: [CORE] 8.3beta1 Available ...

Tom Lane wrote:
> Dave Page <dpage@postgresql.org> writes:
>> So, my test prog (below) returns the following:
> 
>> Dave@SNAKE:~$ ./setlc "English_United Kingdom.65001"
>> LC_COLLATE=English_United
>> Kingdom.65001;LC_CTYPE=C;LC_MONETARY=English_United
>> Kingdom.65001;LC_NUMERIC=English_United
>> Kingdom.65001;LC_TIME=English_United Kingdom.65001
> 
> That's just frickin' weird ... and a bit scary.  There is a fair amount
> of code in PG that checks for lc_ctype_is_c and does things differently;
> one wonders if that isn't going to get misled by this behavior.  (Hmm,
> maybe this explains some of the "upper/lower doesn't work" reports we've
> been getting??)  Are you sure all variants of Windows act that way?

All the ones we support afaict.

>> Can we change initdb to test against LC_TIME instead of LC_CTYPE perhaps?
> 
> Is there something in Windows that constrains them to be all the same?
> If not this proposal seems just plain wrong :-(  But in any case I'd
> feel more comfortable having it look at LC_COLLATE.

They can all be set independently - it's just that there's no UTF-7
(65000) or UTF-8 (65001) NLS files (http://shlimazl.nm.ru/eng/nls.htm)
defining them fully so Windows doesn't know any more than the characters
that are in both 'pseudo codepages'.

As a result, you can't set LC_CTYPE to .65001 because Windows knows it
can't handle ToUpper() or ToLower() etc. but you can use it to encode
messages and other text.

/D

------- End of Forwarded Messages

I am thinking that Dave's discovery explains some previously unsolved
bug reports, such as
http://archives.postgresql.org/pgsql-bugs/2007-05/msg00260.php
If Windows returns LC_CTYPE=C in a situation like this, then
the various single-byte-charset optimization paths that are enabled by
lc_ctype_is_c() would be mistakenly used, leading to misbehavior in
upper()/lower() and other places.  ISTM we had better hack
lc_ctype_is_c() so that on Windows (only), if the database encoding
is UTF-8 then it returns FALSE regardless of what setlocale says.

That still leaves me with a boatload of questions, though.  If we can't
trust LC_CTYPE as an indicator of the system charset, what can we trust?
In particular this seems to say that looking at LC_CTYPE for chklocale's
purposes is completely useless; what do we look at instead?

Another issue: is it possible to set, say, LC_MESSAGES and LC_TIME to
different codepages and if so what happens?  If that does enable
different bits of infrastructure to return incompatibly encoded strings,
seems we need a defense against that --- what should it be?

One bright spot is that this does seem to suggest a way to implement the
recommendation I made in the -patches thread: if we can't support the
encoding (codepage) used by the locale seen by initdb, we could try
stripping the codepage indicator (if any) and plastering on .65001
to get a UTF8-compatible locale name.  That'd only work on Windows
but that seems the platform where we're most likely to see unsupportable
default encodings.

Comments?  I don't have a Windows development environment so I'm not
in a position to take the lead on testing/fixing this sort of stuff.
        regards, tom lane


В списке pgsql-hackers по дате отправления:

Предыдущее
От: Simon Riggs
Дата:
Сообщение: Re: Polymorphic arguments and composite types
Следующее
От: Stephan Szabo
Дата:
Сообщение: Re: Polymorphic arguments and composite types