Обсуждение: Mixing different LC_COLLATE and database encodings
I've been going through the docs and list archives trying to get clear on encoding issues, but still have a few question. Do I have these statements correct? - LC_COLLATE is set on the cluster at initdb time. From that point on all database text is sorted based on that *regardless* of the encoding set on an individual database. - So for lc_collate="C" sorting is just based on the byte values, and if lc_collate="en_US" then sorting is based on the us_US order, and the bytes are assumed to be 8859-1 (if that matters). - To clarify the first point, if the database is encoded utf-8 and lc_collate is en_US then Postgresql does NOT try to convert utf-8 to 8859-1 before sorting. - If the "client encoding" and the database encoding differ then Postgresql will convert between the two encodings during I/O. - The exception is if *either* the client or the server's encoding is "SQL_ASCII" then no client<=>server conversion is done. Sound about right? 1) What else is the database's encoding used for besides to determine how to convert text in input and output based on the client encoding? 2) What client encoding is used if the client does not specify one? For example, I'm looking through Perl's DBD::Pg and I don't see any place where it calls PQsetClientEncoding(). http://search.cpan.org/src/DBDPG/DBD-Pg-1.41/ http://www.postgresql.org/docs/7.4/interactive/multibyte.html#MULTIBYTE-TRANSLATION-TABLE 3) The vast majority of my utf-8 encoded text that I need to display sorted probably maps to 8859-1 characters. I think I already answered this above, but: Am I correct that Postgresql is *not* converting text from the database encoding to the cluster encoding before sorting? That is with "C" it's just sorting in byte order, and with en_US it's just assuming that the bytes are 8859-1 and ignoring that it's really utf-8? That is, if I have text that's in utf-8 but includes characters that would map to 8859-1 (say accented chars), that sorting will not be correct because it's not converted to 8859-1 when sorting? 4) If the above is true, then if I wanted my utf-8 encoded text to be sorted correctly then I'd need to re-initdb using --encoding=en_US.UTF-8, correct? 5) I suppose there's not way to answer this, short of running benchmarks, but any ideas what using a lc_collate with utf-8 would do to performance? Is it a big hit? Not related to Postgresql, but testing some of this is confusing due to my environment. How do I get my xterm to work with utf8? Does ssh do something with encoding? If I have a utf8 xterm window open on my machine, then ssh to the server running postgresql where the default locale is "POSIX" Then running: LANG=en_US.utf8 psql utf8test utf8test=> \encoding UNICODE utf8test=> select first_name from person where last_name = 'Anderson'; Then I see: Zo<C3><AB> But, if on that same remote machine I run a unicode xterm (uxterm in Debian) then in that xterm window I do: utf8test=> \encoding UNICODE utf8test=> select first_name from person where last_name = 'Anderson'; Zoë (correct) It's must slower running xterm remotely than using my local xterm and ssh, so it would be nice to be able to display the utf8. -- Bill Moseley moseley@hank.org
Bill Moseley wrote: > Do I have these statements correct? yes > 1) What else is the database's encoding used for besides to determine > how to convert text in input and output based on the client encoding? nothing > 2) What client encoding is used if the client does not specify one? the server encoding > 3) The vast majority of my utf-8 encoded text that I need to display > sorted probably maps to 8859-1 characters. probably not :) > That is, if I have text that's in utf-8 but includes characters that > would map to 8859-1 (say accented chars), that sorting will not be > correct because it's not converted to 8859-1 when sorting? right > 4) If the above is true, then if I wanted my utf-8 encoded text to be > sorted correctly then I'd need to re-initdb using > --encoding=en_US.UTF-8, correct? right > 5) I suppose there's not way to answer this, short of running > benchmarks, but any ideas what using a lc_collate with utf-8 would do > to performance? Is it a big hit? I don't know why that would be a problem. > Not related to Postgresql, but testing some of this is confusing > due to my environment. How do I get my xterm to work with utf8? > Does ssh do something with encoding? I don't use xterm so I'll skip the rest. -- Peter Eisentraut http://developer.postgresql.org/~petere/
On Sat, Feb 18, 2006 at 05:20:19PM +0100, Peter Eisentraut wrote: > > 2) What client encoding is used if the client does not specify one? > > the server encoding What's the server encoding? The environment when the cluster is started? How do you find out what it's running as? Does that mean if the encoding is anything other than "C" then Postgresql will convert? That is, if my database is utf8 and the server is en_US then text will be sent to the client as 8859-1? Not, that's not correct as I'm not seeing that. So I guess I'm not clear on that point. > > 5) I suppose there's not way to answer this, short of running > > benchmarks, but any ideas what using a lc_collate with utf-8 would do > > to performance? Is it a big hit? > > I don't know why that would be a problem. Just that sorting utf8 is a bit more work that sorting raw bytes. Thanks for the help, -- Bill Moseley moseley@hank.org
Bill Moseley wrote: > What's the server encoding? When you say "My database is in utf8", then "utf8" is the server encoding. > Does that mean if the encoding is anything other than "C" C is a locale, not an encoding. > Just that sorting utf8 is a bit more work that sorting raw bytes. Sorting in C locale is certainly faster, but for anything else, there won't be any noticeable difference I would think. -- Peter Eisentraut http://developer.postgresql.org/~petere/
Bill Moseley <moseley@hank.org> writes: > - To clarify the first point, if the database is encoded utf-8 and > lc_collate is en_US then Postgresql does NOT try to convert utf-8 to > 8859-1 before sorting. Basically, this is a horribly bad idea and you should never do it. The database encoding should always match what the locale assumes for its character set (unless the locale is "C", which doesn't care). We'd enforce that you never do it if we knew a portable way to determine the character set assumed by an LC_COLLATE setting. regards, tom lane
On Sat, Feb 18, 2006 at 01:40:09PM -0500, Tom Lane wrote: > Bill Moseley <moseley@hank.org> writes: > > - To clarify the first point, if the database is encoded utf-8 and > > lc_collate is en_US then Postgresql does NOT try to convert utf-8 to > > 8859-1 before sorting. > > Basically, this is a horribly bad idea and you should never do it. > The database encoding should always match what the locale assumes > for its character set (unless the locale is "C", which doesn't care). What's a bad idea? Having a lc_collate on the cluster that doesn't support the encodings in the databases? > We'd enforce that you never do it if we knew a portable way to determine > the character set assumed by an LC_COLLATE setting. Again, not sure what "it" is, but I do find it confusing when the cluster can have only one lc_collate, but the databases on that cluster can have more than one encoding. That's why I was asking how postgresql handles (possibly) different encodings. Are you saying that if a database is encoded as utf8 then the cluster should be initiated with something like en_US.utf8? And then all databaes on that cluster should be encoded the same? I suspect I don't understand how LC_COLLATE works that well. I thought the locale defines the order of the characters, but not the encoding of those characters. Maybe that's not correct. I assumed the same locale should sort the same chars represented in different encodings the same way. Maybe that's not the case: $ LC_ALL=en_US.UTF-8 locale charmap UTF-8 $ LC_ALL=en_US locale charmap ISO-8859-1 $ LC_ALL=C locale charmap ANSI_X3.4-1968 -- Bill Moseley moseley@hank.org
Bill Moseley wrote: > What's a bad idea? Having a lc_collate on the cluster that doesn't > support the encodings in the databases? Exactly > Again, not sure what "it" is, but I do find it confusing when the > cluster can have only one lc_collate, but the databases on that > cluster can have more than one encoding. It is confusing, so don't do it. > That's why I was asking > how postgresql handles (possibly) different encodings. It doesn't. > Are you saying that if a database is encoded as utf8 then the cluster > should be initiated with something like en_US.utf8? And then all > databaes on that cluster should be encoded the same? Yes > I thought the locale defines the order of the characters, but not the > encoding of those characters. In theory, they are independent concepts. But in practice, the C library gets a bunch bytes from the application (in this case, the PostgreSQL server) and is asked to sort them. So it needs to know what these bytes are supposed to mean. By design of the POSIX locale facilities, the C library is told that by way of the locale. It would be much simpler for everyone if there was a function strcmp(string1, string2, collation, encoding), but there isn't. -- Peter Eisentraut http://developer.postgresql.org/~petere/
Bill Moseley <moseley@hank.org> writes: > $ LC_ALL=en_US.UTF-8 locale charmap > UTF-8 > > $ LC_ALL=en_US locale charmap > ISO-8859-1 > > $ LC_ALL=C locale charmap > ANSI_X3.4-1968 Unfortunately Postgres only supports a single collation cluster-wide. So depending on which collation you use of the ones above you would really have to select either UTF-8 ISO-8859-1 or SQL_ASCII (ie ANSI_X3.4-1968). Anything else and the collation just won't work properly. It will be expecting UTF-8 and be fed ISO-8859-1 strings, resulting in weird and sometimes inconsistent sort orders. There's a certain amount of feeling that using any locale other than C is probably not ever the right thing given the current functionality. Just about any database has some strings in it that are really just ascii strings like char(1) primary keys and other internal database strings. You may not want them being subject to the locale's collation for comparison purposes and you may not want the overhead of variable width character encodings. Those of us in this camp are defining all our databases using C locale and then using the pg_strxfrm() function that's been floating around the list for a while to handle sorting strings that need to be sorted in various locales. This has performs acceptably (but not spectacularly) under glibc but it's not clear which other libc implementations it works well under. It also doesn't solve the whole problem since functions like substr() or LIKE are locale sensitive too. If you need an encoding like UTF-8 and you're stuck either pushing all your string manipulations into the client or going ahead with a non-C locale and UTF-8 even for the strings that are really just ascii strings. -- greg
On Sat, Feb 18, 2006 at 09:31:27PM -0500, Greg Stark wrote: > Anything else and the collation just won't work properly. It will be > expecting UTF-8 and be fed ISO-8859-1 strings, resulting in weird > and sometimes inconsistent sort orders. So if I have utf8 encoded text and the lc_collate is anything but utf8 then sorting will be all wrong for any chars that don't map to ASCII (>127). Kind of a mess. > There's a certain amount of feeling that using any locale other than C is > probably not ever the right thing given the current functionality. Just about > any database has some strings in it that are really just ascii strings like > char(1) primary keys and other internal database strings. You may not want > them being subject to the locale's collation for comparison purposes and you > may not want the overhead of variable width character encodings. Is the Holy Grail encoding and lc_collate settings per column? Changing topics, but I'm going to play with different cluster settings for collate. If I create a cluster in given directory is there any problems with moving that cluster (renaming the directory)? Thanks for your comments, Greg. -- Bill Moseley moseley@hank.org
On Sat, Feb 18, 2006 at 08:16:07PM -0800, Bill Moseley wrote: > Is the Holy Grail encoding and lc_collate settings per column? Well yes. I've been trying to create a system where you can handle multiple collations in the same database. I posted the details to -hackers and got part of the way, but it's a lot of work. As for encodings, to be honest, I'm not sure whether it's a great idea to support multiple encodings simultaneously. Things become a lot easier if you know everything is the same encoding. If you set the client_encoding automatically on startup it has pretty much the same effect as having the server always use that encoding. It's just a bit of time wasted in conversion, but the client doesn't need to care. By way of example, see ICU which is an internationalisation library we're considering to get consistant locale support over all platforms. It supports one encoding, namely UTF-16. It has various functions to convert other encodings to or from that, but internally it's all UTF-16. So if we do use that, then all encodings (except native UTF-16) will need to conversion all the time, so you don't buy anything by having the server in some random encoding. The problem ofcourse being that the SQL standard requires some encoding support. No-one has really come up with a proposal for that yet. IMHO, that's a parser issue more than anything else. Have a nice day, -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a > tool for doing 5% of the work and then sitting around waiting for someone > else to do the other 95% so you can sue them.
Вложения
Martijn van Oosterhout <kleptog@svana.org> writes: > On Sat, Feb 18, 2006 at 08:16:07PM -0800, Bill Moseley wrote: > > Is the Holy Grail encoding and lc_collate settings per column? > > By way of example, see ICU which is an internationalisation library > we're considering to get consistant locale support over all platforms. > It supports one encoding, namely UTF-16. It has various functions to > convert other encodings to or from that, but internally it's all > UTF-16. So if we do use that, then all encodings (except native UTF-16) > will need to conversion all the time, so you don't buy anything by > having the server in some random encoding. Ugh. At least from my perspective that makes it a non-starter. As I'm sure you realize storage density is a major factor, often the dominant factor, in database performance. Anything that would double the storage size for ascii foreign keys is going to be a terrible hit. And having to do a ascii->utf-16 conversion for every foreign key constraint check would be nearly as bad. I know it's a simple conversion but compared to a simple strcmp in a critical code path it's going to increase cpu usage significantly. I'm still unclear what advantage adding yet another external library dependency gains Postgres in this area. The bulk of the difficulties seem to be on the user interface side where it's unclear how to let users control this functionality. It seems like the actual mechanics of sorting in various locales can be handled using standard libc i18n functions. The one issue people have raised is that traditional libc functions require switching a global state between locales and not all implementations support that well. But depending on a single non-standard extension seems better than depending on a huge external library. Especially when the consequences of that non-standard extension being missing is only that performance will suffer in a case Postgres currently doesn't handle at all. -- greg
On Mon, Feb 20, 2006 at 05:30:06PM -0500, Greg Stark wrote: > Martijn van Oosterhout <kleptog@svana.org> writes: > > By way of example, see ICU which is an internationalisation library > > we're considering to get consistant locale support over all platforms. > > It supports one encoding, namely UTF-16. It has various functions to > > convert other encodings to or from that, but internally it's all > > UTF-16. So if we do use that, then all encodings (except native UTF-16) > > will need to conversion all the time, so you don't buy anything by > > having the server in some random encoding. > > Ugh. At least from my perspective that makes it a non-starter. As I'm sure you > realize storage density is a major factor, often the dominant factor, in > database performance. Anything that would double the storage size for ascii > foreign keys is going to be a terrible hit. > > And having to do a ascii->utf-16 conversion for every foreign key constraint > check would be nearly as bad. I know it's a simple conversion but compared to > a simple strcmp in a critical code path it's going to increase cpu usage > significantly. I'm not sure why you're singling out foreign keys here, but one of the motivations for this COLLATE stuff I'm working on is so you can declare all the system catalogs as COLLATE 'C' and thus always use strcmp and *only* in the case where the user explicitly says "I want this column sorted using French rules" do we incur the overhead. So your example would be fine. If we switched to ICU now the overhead could be nasty. We need COLLATE first. > I'm still unclear what advantage adding yet another external library > dependency gains Postgres in this area. The bulk of the difficulties seem to > be on the user interface side where it's unclear how to let users control this > functionality. It seems like the actual mechanics of sorting in various > locales can be handled using standard libc i18n functions. How about consistancy across platforms? Isn't that the reason we went for an external timezone library rather than using the system one? How about not knowing what encoding libc actually expects for strcoll? How about supporting multiple collations within a single database (say French and Russian). For example, none of the BSDs or MacOS X support collations for UTF-8 locales. They're not complaining now but this seems untenable for the future. > The one issue people have raised is that traditional libc functions require > switching a global state between locales and not all implementations support > that well. But depending on a single non-standard extension seems better than > depending on a huge external library. Especially when the consequences of that > non-standard extension being missing is only that performance will suffer in a > case Postgres currently doesn't handle at all. The way I'm going at the moment is that ICU would be optional. Without it *BSD would be limited to what we do now: one locale per DB, no changes. Linux, Mac OS X and Win32 would be able to support multiple locales, whatever their system supports. With ICU all platforms support the entire range supported by it. If you don't like ICU, don't use it. I'm not going to play games with calling setlocale() to keep changing state. You saw how Perl reacted to us playing with it. Better we stop using setlocale() altogether and go with newlocale() wherever possible. The chance that ICU will be installed on your system grows by the day. The facilities provided by ICU are so far ahead of what libc provides I'm not sure it's senseble to compare them. Have a nice day, -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a > tool for doing 5% of the work and then sitting around waiting for someone > else to do the other 95% so you can sue them.
Вложения
> On Sat, Feb 18, 2006 at 08:16:07PM -0800, Bill Moseley wrote: > > Is the Holy Grail encoding and lc_collate settings per column? > > Well yes. I've been trying to create a system where you can handle > multiple collations in the same database. I posted the details to > -hackers and got part of the way, but it's a lot of work. > > As for encodings, to be honest, I'm not sure whether it's a great idea > to support multiple encodings simultaneously. Things become a lot > easier if you know everything is the same encoding. If you set the > client_encoding automatically on startup it has pretty much the same > effect as having the server always use that encoding. It's just a bit > of time wasted in conversion, but the client doesn't need to care. > > By way of example, see ICU which is an internationalisation library > we're considering to get consistant locale support over all platforms. > It supports one encoding, namely UTF-16. It has various functions to > convert other encodings to or from that, but internally it's all > UTF-16. So if we do use that, then all encodings (except native UTF-16) > will need to conversion all the time, so you don't buy anything by > having the server in some random encoding. > > The problem ofcourse being that the SQL standard requires some encoding > support. No-one has really come up with a proposal for that yet. IMHO, > that's a parser issue more than anything else. If you consider to allow only UTF-16 or whatever encoding in backend, I will strongly against the idea. We Japanese need those encodings native support. Converting those encodings with Unicode everytime when backend and forntend have conversations will be serious performance hit. Moreover the converion is known as not being roundtrip safe, that means some information will be lost during the conversion. The another point would be on disk format. UTF-16 will require more storage than local encodings. Probably UTF-8 will require more. I have a feeling that ICU is good for applications, but is not for DBMSs. -- Tatsuo Ishii SRA OSS, Inc. Japan
On Tue, Feb 21, 2006 at 10:27:15AM +0900, Tatsuo Ishii wrote: > If you consider to allow only UTF-16 or whatever encoding in backend, > I will strongly against the idea. We Japanese need those encodings > native support. Converting those encodings with Unicode everytime when > backend and forntend have conversations will be serious performance > hit. Moreover the converion is known as not being roundtrip safe, that > means some information will be lost during the conversion. The another > point would be on disk format. UTF-16 will require more storage than > local encodings. Probably UTF-8 will require more. I didn't say that we only support utf-16 in the backend, I said that when doing comparisons in a non-C locale, you have to convert to UTF-16 to use ICU. If you don't want to use it, don't, it's not going to be required at any point. Just like currently with Win32, if you use UTF-8 it has to be converted to UTF-16 prior to string comparison. The only time any of this is required is *sorting* and if you have an index defined it acts as a cache for the sorted values. Ofcourse there's a tradeoff but unless you're sorting large datasets all day I doubt it'll be noticable. If you're not sorting, none of this is relevent to you. > I have a feeling that ICU is good for applications, but is not for > DBMSs. I think providing a system where users are able to select out of a large range of possible collation orders and if necessary specify their own is a worthy goal. Look at the complaints we get now and then of people who choose en_US as their locale and are surprised when it gives them a dictionary sort. ICU allows users to take an existing collation and tweak it if it doesn't quite match their expectations. You think this is not useful for a DBMS? Have a nice day, -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a > tool for doing 5% of the work and then sitting around waiting for someone > else to do the other 95% so you can sue them.