Обсуждение: Mixing different LC_COLLATE and database encodings

Поиск

Список

Период

Сортировка

Mixing different LC_COLLATE and database encodings

От

Bill Moseley

Дата:

18 февраля 2006 г., 12:10:26


I've been going through the docs and list archives trying to get
clear on encoding issues, but still have a few question.

Do I have these statements correct?

- LC_COLLATE is set on the cluster at initdb time.  From that point
on all database text is sorted based on that *regardless* of the
encoding set on an individual database.

- So for lc_collate="C" sorting is just based on the byte values, and
if lc_collate="en_US" then sorting is based on the us_US order, and
the bytes are assumed to be 8859-1 (if that matters).

- To clarify the first point, if the database is encoded utf-8 and
lc_collate is en_US then Postgresql does NOT try to convert utf-8 to
8859-1 before sorting.

- If the "client encoding" and the database encoding differ then
Postgresql will convert between the two encodings during I/O.

- The exception is if *either* the client or the server's encoding is
"SQL_ASCII" then no client<=>server conversion is done.


Sound about right?


1) What else is the database's encoding used for besides to determine
how to convert text in input and output based on the client encoding?

2) What client encoding is used if the client does not specify one?
For example, I'm looking through Perl's DBD::Pg and I don't see any
place where it calls PQsetClientEncoding().


  http://search.cpan.org/src/DBDPG/DBD-Pg-1.41/
  http://www.postgresql.org/docs/7.4/interactive/multibyte.html#MULTIBYTE-TRANSLATION-TABLE


3) The vast majority of my utf-8 encoded text that I need to display
sorted probably maps to 8859-1 characters.

I think I already answered this above, but:

Am I correct that Postgresql is *not* converting text from the
database encoding to the cluster encoding before sorting?  That is
with "C" it's just sorting in byte order, and with en_US it's just
assuming that the bytes are 8859-1 and ignoring that it's really
utf-8?

That is, if I have text that's in utf-8 but includes characters that
would map to 8859-1 (say accented chars), that sorting will not be
correct because it's not converted to 8859-1 when sorting?



4) If the above is true, then if I wanted my utf-8 encoded text to be
sorted correctly then I'd need to re-initdb using --encoding=en_US.UTF-8,
correct?



5) I suppose there's not way to answer this, short of running
benchmarks, but any ideas what using a lc_collate with utf-8 would do
to performance?  Is it a big hit?




Not related to Postgresql, but testing some of this is confusing
due to my environment.  How do I get my xterm to work with utf8?
Does ssh do something with encoding?

If I have a utf8 xterm window open on my machine, then ssh to
the server running postgresql where the default locale is "POSIX"
Then running:

    LANG=en_US.utf8 psql utf8test
    utf8test=> \encoding
    UNICODE

    utf8test=> select first_name from person where last_name = 'Anderson';

Then I see:

    Zo<C3><AB>

But, if on that same remote machine I run a unicode xterm (uxterm in
Debian) then in that xterm window I do:

    utf8test=> \encoding
    UNICODE
    utf8test=> select first_name from person where last_name = 'Anderson';
    Zoë  (correct)

It's must slower running xterm remotely than using my local xterm and
ssh, so it would be nice to be able to display the utf8.



--
Bill Moseley
moseley@hank.org

Re: Mixing different LC_COLLATE and database encodings

От

Peter Eisentraut

Дата:

18 февраля 2006 г., 12:20:27

Bill Moseley wrote:
> Do I have these statements correct?

yes

> 1) What else is the database's encoding used for besides to determine
> how to convert text in input and output based on the client encoding?

nothing

> 2) What client encoding is used if the client does not specify one?

the server encoding

> 3) The vast majority of my utf-8 encoded text that I need to display
> sorted probably maps to 8859-1 characters.

probably not :)

> That is, if I have text that's in utf-8 but includes characters that
> would map to 8859-1 (say accented chars), that sorting will not be
> correct because it's not converted to 8859-1 when sorting?

right

> 4) If the above is true, then if I wanted my utf-8 encoded text to be
> sorted correctly then I'd need to re-initdb using
> --encoding=en_US.UTF-8, correct?

right

> 5) I suppose there's not way to answer this, short of running
> benchmarks, but any ideas what using a lc_collate with utf-8 would do
> to performance?  Is it a big hit?

I don't know why that would be a problem.

> Not related to Postgresql, but testing some of this is confusing
> due to my environment.  How do I get my xterm to work with utf8?
> Does ssh do something with encoding?

I don't use xterm so I'll skip the rest.

--
Peter Eisentraut
http://developer.postgresql.org/~petere/

Re: Mixing different LC_COLLATE and database encodings

От

Bill Moseley

Дата:

18 февраля 2006 г., 12:28:21

On Sat, Feb 18, 2006 at 05:20:19PM +0100, Peter Eisentraut wrote:
> > 2) What client encoding is used if the client does not specify one?
>
> the server encoding

What's the server encoding?  The environment when the cluster is
started?  How do you find out what it's running as?

Does that mean if the encoding is anything other than "C" then
Postgresql will convert?  That is, if my database is utf8 and the
server is en_US then text will be sent to the client as 8859-1?  Not,
that's not correct as I'm not seeing that.  So I guess I'm not clear
on that point.

> > 5) I suppose there's not way to answer this, short of running
> > benchmarks, but any ideas what using a lc_collate with utf-8 would do
> > to performance?  Is it a big hit?
>
> I don't know why that would be a problem.

Just that sorting utf8 is a bit more work that sorting raw bytes.

Thanks for the help,

--
Bill Moseley
moseley@hank.org

Re: Mixing different LC_COLLATE and database encodings

От

Peter Eisentraut

Дата:

18 февраля 2006 г., 13:57:39

Bill Moseley wrote:
> What's the server encoding?

When you say "My database is in utf8", then "utf8" is the server
encoding.

> Does that mean if the encoding is anything other than "C"

C is a locale, not an encoding.

> Just that sorting utf8 is a bit more work that sorting raw bytes.

Sorting in C locale is certainly faster, but for anything else, there
won't be any noticeable difference I would think.

--
Peter Eisentraut
http://developer.postgresql.org/~petere/

Re: Mixing different LC_COLLATE and database encodings

От

Tom Lane

Дата:

18 февраля 2006 г., 14:40:19

Bill Moseley <moseley@hank.org> writes:
> - To clarify the first point, if the database is encoded utf-8 and
> lc_collate is en_US then Postgresql does NOT try to convert utf-8 to
> 8859-1 before sorting.

Basically, this is a horribly bad idea and you should never do it.
The database encoding should always match what the locale assumes
for its character set (unless the locale is "C", which doesn't care).

We'd enforce that you never do it if we knew a portable way to determine
the character set assumed by an LC_COLLATE setting.

            regards, tom lane

Re: Mixing different LC_COLLATE and database encodings

От

Bill Moseley

Дата:

18 февраля 2006 г., 21:48:31

On Sat, Feb 18, 2006 at 01:40:09PM -0500, Tom Lane wrote:
> Bill Moseley <moseley@hank.org> writes:
> > - To clarify the first point, if the database is encoded utf-8 and
> > lc_collate is en_US then Postgresql does NOT try to convert utf-8 to
> > 8859-1 before sorting.
>
> Basically, this is a horribly bad idea and you should never do it.
> The database encoding should always match what the locale assumes
> for its character set (unless the locale is "C", which doesn't care).

What's a bad idea?  Having a lc_collate on the cluster that doesn't
support the encodings in the databases?

> We'd enforce that you never do it if we knew a portable way to determine
> the character set assumed by an LC_COLLATE setting.

Again, not sure what "it" is, but I do find it confusing when the
cluster can have only one lc_collate, but the databases on that
cluster can have more than one encoding.  That's why I was asking
how postgresql handles (possibly) different encodings.

Are you saying that if a database is encoded as utf8 then the cluster
should be initiated with something like en_US.utf8?  And then all
databaes on that cluster should be encoded the same?

I suspect I don't understand how LC_COLLATE works that well.

I thought the locale defines the order of the characters, but not the
encoding of those characters.  Maybe that's not correct. I assumed the
same locale should sort the same chars represented in different
encodings the same way.  Maybe that's not the case:

    $ LC_ALL=en_US.UTF-8 locale charmap
    UTF-8

    $ LC_ALL=en_US locale charmap
    ISO-8859-1

    $ LC_ALL=C locale charmap
    ANSI_X3.4-1968

--
Bill Moseley
moseley@hank.org

Re: Mixing different LC_COLLATE and database encodings

От

Peter Eisentraut

Дата:

18 февраля 2006 г., 22:26:19

Bill Moseley wrote:
> What's a bad idea?  Having a lc_collate on the cluster that doesn't
> support the encodings in the databases?

Exactly

> Again, not sure what "it" is, but I do find it confusing when the
> cluster can have only one lc_collate, but the databases on that
> cluster can have more than one encoding.

It is confusing, so don't do it.

> That's why I was asking
> how postgresql handles (possibly) different encodings.

It doesn't.

> Are you saying that if a database is encoded as utf8 then the cluster
> should be initiated with something like en_US.utf8?  And then all
> databaes on that cluster should be encoded the same?

Yes

> I thought the locale defines the order of the characters, but not the
> encoding of those characters.

In theory, they are independent concepts.  But in practice, the C
library gets a bunch bytes from the application (in this case, the
PostgreSQL server) and is asked to sort them.  So it needs to know what
these bytes are supposed to mean.  By design of the POSIX locale
facilities, the C library is told that by way of the locale.  It would
be much simpler for everyone if there was a function strcmp(string1,
string2, collation, encoding), but there isn't.

--
Peter Eisentraut
http://developer.postgresql.org/~petere/

Re: Mixing different LC_COLLATE and database encodings

От

Greg Stark

Дата:

18 февраля 2006 г., 22:31:36

Bill Moseley <moseley@hank.org> writes:

>     $ LC_ALL=en_US.UTF-8 locale charmap
>     UTF-8
>
>     $ LC_ALL=en_US locale charmap
>     ISO-8859-1
>
>     $ LC_ALL=C locale charmap
>     ANSI_X3.4-1968

Unfortunately Postgres only supports a single collation cluster-wide. So
depending on which collation you use of the ones above you would really have
to select either UTF-8 ISO-8859-1 or SQL_ASCII (ie ANSI_X3.4-1968). Anything
else and the collation just won't work properly. It will be expecting UTF-8
and be fed ISO-8859-1 strings, resulting in weird and sometimes inconsistent
sort orders.

There's a certain amount of feeling that using any locale other than C is
probably not ever the right thing given the current functionality. Just about
any database has some strings in it that are really just ascii strings like
char(1) primary keys and other internal database strings. You may not want
them being subject to the locale's collation for comparison purposes and you
may not want the overhead of variable width character encodings.

Those of us in this camp are defining all our databases using C locale and
then using the pg_strxfrm() function that's been floating around the list for
a while to handle sorting strings that need to be sorted in various locales.
This has performs acceptably (but not spectacularly) under glibc but it's not
clear which other libc implementations it works well under.

It also doesn't solve the whole problem since functions like substr() or LIKE
are locale sensitive too. If you need an encoding like UTF-8 and you're stuck
either pushing all your string manipulations into the client or going ahead
with a non-C locale and UTF-8 even for the strings that are really just ascii
strings.

--
greg

Re: Mixing different LC_COLLATE and database encodings

От

Bill Moseley

Дата:

19 февраля 2006 г., 00:16:14

On Sat, Feb 18, 2006 at 09:31:27PM -0500, Greg Stark wrote:
> Anything else and the collation just won't work properly. It will be
> expecting UTF-8 and be fed ISO-8859-1 strings, resulting in weird
> and sometimes inconsistent sort orders.

So if I have utf8 encoded text and the lc_collate is anything but
utf8 then sorting will be all wrong for any chars that don't map to
ASCII (>127).  Kind of a mess.

> There's a certain amount of feeling that using any locale other than C is
> probably not ever the right thing given the current functionality. Just about
> any database has some strings in it that are really just ascii strings like
> char(1) primary keys and other internal database strings. You may not want
> them being subject to the locale's collation for comparison purposes and you
> may not want the overhead of variable width character encodings.

Is the Holy Grail encoding and lc_collate settings per column?

Changing topics, but I'm going to play with different cluster
settings for collate.  If I create a cluster in given directory
is there any problems with moving that cluster (renaming the
directory)?

Thanks for your comments, Greg.

--
Bill Moseley
moseley@hank.org

Re: Mixing different LC_COLLATE and database encodings

От

Martijn van Oosterhout

Дата:

19 февраля 2006 г., 08:45:40

On Sat, Feb 18, 2006 at 08:16:07PM -0800, Bill Moseley wrote:
> Is the Holy Grail encoding and lc_collate settings per column?

Well yes. I've been trying to create a system where you can handle
multiple collations in the same database. I posted the details to
-hackers and got part of the way, but it's a lot of work.

As for encodings, to be honest, I'm not sure whether it's a great idea
to support multiple encodings simultaneously. Things become a lot
easier if you know everything is the same encoding. If you set the
client_encoding automatically on startup it has pretty much the same
effect as having the server always use that encoding. It's just a bit
of time wasted in conversion, but the client doesn't need to care.

By way of example, see ICU which is an internationalisation library
we're considering to get consistant locale support over all platforms.
It supports one encoding, namely UTF-16. It has various functions to
convert other encodings to or from that, but internally it's all
UTF-16. So if we do use that, then all encodings (except native UTF-16)
will need to conversion all the time, so you don't buy anything by
having the server in some random encoding.

The problem ofcourse being that the SQL standard requires some encoding
support. No-one has really come up with a proposal for that yet. IMHO,
that's a parser issue more than anything else.

Have a nice day,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a
> tool for doing 5% of the work and then sitting around waiting for someone
> else to do the other 95% so you can sue them.

Вложения

signature.asc

Re: Mixing different LC_COLLATE and database encodings

От

Greg Stark

Дата:

20 февраля 2006 г., 18:30:19

Martijn van Oosterhout <kleptog@svana.org> writes:

> On Sat, Feb 18, 2006 at 08:16:07PM -0800, Bill Moseley wrote:
> > Is the Holy Grail encoding and lc_collate settings per column?
>
> By way of example, see ICU which is an internationalisation library
> we're considering to get consistant locale support over all platforms.
> It supports one encoding, namely UTF-16. It has various functions to
> convert other encodings to or from that, but internally it's all
> UTF-16. So if we do use that, then all encodings (except native UTF-16)
> will need to conversion all the time, so you don't buy anything by
> having the server in some random encoding.

Ugh. At least from my perspective that makes it a non-starter. As I'm sure you
realize storage density is a major factor, often the dominant factor, in
database performance. Anything that would double the storage size for ascii
foreign keys is going to be a terrible hit.

And having to do a ascii->utf-16 conversion for every foreign key constraint
check would be nearly as bad. I know it's a simple conversion but compared to
a simple strcmp in a critical code path it's going to increase cpu usage
significantly.

I'm still unclear what advantage adding yet another external library
dependency gains Postgres in this area. The bulk of the difficulties seem to
be on the user interface side where it's unclear how to let users control this
functionality. It seems like the actual mechanics of sorting in various
locales can be handled using standard libc i18n functions.

The one issue people have raised is that traditional libc functions require
switching a global state between locales and not all implementations support
that well. But depending on a single non-standard extension seems better than
depending on a huge external library. Especially when the consequences of that
non-standard extension being missing is only that performance will suffer in a
case Postgres currently doesn't handle at all.

--
greg

Re: Mixing different LC_COLLATE and database encodings

От

Martijn van Oosterhout

Дата:

20 февраля 2006 г., 19:11:47

On Mon, Feb 20, 2006 at 05:30:06PM -0500, Greg Stark wrote:
> Martijn van Oosterhout <kleptog@svana.org> writes:
> > By way of example, see ICU which is an internationalisation library
> > we're considering to get consistant locale support over all platforms.
> > It supports one encoding, namely UTF-16. It has various functions to
> > convert other encodings to or from that, but internally it's all
> > UTF-16. So if we do use that, then all encodings (except native UTF-16)
> > will need to conversion all the time, so you don't buy anything by
> > having the server in some random encoding.
>
> Ugh. At least from my perspective that makes it a non-starter. As I'm sure you
> realize storage density is a major factor, often the dominant factor, in
> database performance. Anything that would double the storage size for ascii
> foreign keys is going to be a terrible hit.
>
> And having to do a ascii->utf-16 conversion for every foreign key constraint
> check would be nearly as bad. I know it's a simple conversion but compared to
> a simple strcmp in a critical code path it's going to increase cpu usage
> significantly.

I'm not sure why you're singling out foreign keys here, but one of the
motivations for this COLLATE stuff I'm working on is so you can declare
all the system catalogs as COLLATE 'C' and thus always use strcmp and
*only* in the case where the user explicitly says "I want this column
sorted using French rules" do we incur the overhead. So your example
would be fine.

If we switched to ICU now the overhead could be nasty. We need COLLATE
first.

> I'm still unclear what advantage adding yet another external library
> dependency gains Postgres in this area. The bulk of the difficulties seem to
> be on the user interface side where it's unclear how to let users control this
> functionality. It seems like the actual mechanics of sorting in various
> locales can be handled using standard libc i18n functions.

How about consistancy across platforms? Isn't that the reason we went
for an external timezone library rather than using the system one? How
about not knowing what encoding libc actually expects for strcoll? How
about supporting multiple collations within a single database (say
French and Russian). For example, none of the BSDs or MacOS X support
collations for UTF-8 locales. They're not complaining now but this
seems untenable for the future.

> The one issue people have raised is that traditional libc functions require
> switching a global state between locales and not all implementations support
> that well. But depending on a single non-standard extension seems better than
> depending on a huge external library. Especially when the consequences of that
> non-standard extension being missing is only that performance will suffer in a
> case Postgres currently doesn't handle at all.

The way I'm going at the moment is that ICU would be optional. Without
it *BSD would be limited to what we do now: one locale per DB, no
changes. Linux, Mac OS X and Win32 would be able to support multiple
locales, whatever their system supports. With ICU all platforms support
the entire range supported by it. If you don't like ICU, don't use it.

I'm not going to play games with calling setlocale() to keep changing
state. You saw how Perl reacted to us playing with it. Better we stop
using setlocale() altogether and go with newlocale() wherever possible.

The chance that ICU will be installed on your system grows by the day.
The facilities provided by ICU are so far ahead of what libc provides
I'm not sure it's senseble to compare them.

Have a nice day,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a
> tool for doing 5% of the work and then sitting around waiting for someone
> else to do the other 95% so you can sue them.

Вложения

signature.asc

Re: Mixing different LC_COLLATE and database encodings

От

Tatsuo Ishii

Дата:

20 февраля 2006 г., 21:26:40

> On Sat, Feb 18, 2006 at 08:16:07PM -0800, Bill Moseley wrote:
> > Is the Holy Grail encoding and lc_collate settings per column?
>
> Well yes. I've been trying to create a system where you can handle
> multiple collations in the same database. I posted the details to
> -hackers and got part of the way, but it's a lot of work.
>
> As for encodings, to be honest, I'm not sure whether it's a great idea
> to support multiple encodings simultaneously. Things become a lot
> easier if you know everything is the same encoding. If you set the
> client_encoding automatically on startup it has pretty much the same
> effect as having the server always use that encoding. It's just a bit
> of time wasted in conversion, but the client doesn't need to care.
>
> By way of example, see ICU which is an internationalisation library
> we're considering to get consistant locale support over all platforms.
> It supports one encoding, namely UTF-16. It has various functions to
> convert other encodings to or from that, but internally it's all
> UTF-16. So if we do use that, then all encodings (except native UTF-16)
> will need to conversion all the time, so you don't buy anything by
> having the server in some random encoding.
>
> The problem ofcourse being that the SQL standard requires some encoding
> support. No-one has really come up with a proposal for that yet. IMHO,
> that's a parser issue more than anything else.

If you consider to allow only UTF-16 or whatever encoding in backend,
I will strongly against the idea. We Japanese need those encodings
native support. Converting those encodings with Unicode everytime when
backend and forntend have conversations will be serious performance
hit. Moreover the converion is known as not being roundtrip safe, that
means some information will be lost during the conversion. The another
point would be on disk format. UTF-16 will require more storage than
local encodings. Probably UTF-8 will require more.

I have a feeling that ICU is good for applications, but is not for
DBMSs.
--
Tatsuo Ishii
SRA OSS, Inc. Japan

Re: Mixing different LC_COLLATE and database encodings

От

Martijn van Oosterhout

Дата:

21 февраля 2006 г., 02:44:17

On Tue, Feb 21, 2006 at 10:27:15AM +0900, Tatsuo Ishii wrote:
> If you consider to allow only UTF-16 or whatever encoding in backend,
> I will strongly against the idea. We Japanese need those encodings
> native support. Converting those encodings with Unicode everytime when
> backend and forntend have conversations will be serious performance
> hit. Moreover the converion is known as not being roundtrip safe, that
> means some information will be lost during the conversion. The another
> point would be on disk format. UTF-16 will require more storage than
> local encodings. Probably UTF-8 will require more.

I didn't say that we only support utf-16 in the backend, I said that
when doing comparisons in a non-C locale, you have to convert to UTF-16
to use ICU. If you don't want to use it, don't, it's not going to be
required at any point. Just like currently with Win32, if you use UTF-8
it has to be converted to UTF-16 prior to string comparison.

The only time any of this is required is *sorting* and if you have an
index defined it acts as a cache for the sorted values. Ofcourse
there's a tradeoff but unless you're sorting large datasets all day I
doubt it'll be noticable.

If you're not sorting, none of this is relevent to you.

> I have a feeling that ICU is good for applications, but is not for
> DBMSs.

I think providing a system where users are able to select out of a
large range of possible collation orders and if necessary specify their
own is a worthy goal. Look at the complaints we get now and then of
people who choose en_US as their locale and are surprised when it gives
them a dictionary sort.

ICU allows users to take an existing collation and tweak it if it
doesn't quite match their expectations. You think this is not useful
for a DBMS?

Have a nice day,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a
> tool for doing 5% of the work and then sitting around waiting for someone
> else to do the other 95% so you can sue them.

Вложения

signature.asc

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Обсуждение: Mixing different LC_COLLATE and database encodings

Вложения

Вложения

Вложения