Обсуждение: Internationalization
I'm thinking that this should be approached in a slowly descending set of changes. 1/ Make individual databases possible with a single instance that can be different encoding AND locale/sorting, and all other aspects of using the encoding/langauge rules. 2/ Then tables. 3/ Then columns. ------------------------- So,for the first one, Is there anyway for a single statement to access more than one database? Could a query, regexes, etc be facing indexes in different encodings/sorting collations if different databases in a cluster had different encodings/collations?
On Wed, Jun 30, 2004 at 02:26:10PM -0700, Dennis Gearon wrote: > 1/ Make individual databases possible with a single instance that can be > different encoding AND locale/sorting, and all other aspects of using > the encoding/langauge rules. > Is there anyway for a single statement to access more than one database? > Could a query, regexes, etc be facing indexes in different > encodings/sorting collations if different databases in a cluster had > different encodings/collations? No, but there are at least two problems: 1. shared tables. All databases in each cluster shared at least pg_database, pg_shadow, pg_group and (new) pg_tablespace. And, of course, all their indexes. What would you do about them? 2. when creating a new database, the current method is to copy from template1. How would you change the encoding of the new database? -- Alvaro Herrera (<alvherre[a]dcc.uchile.cl>) "The problem with the future is that it keeps turning into the present" (Hobbes)
Dennis Gearon <gearond@fireserve.net> writes: > Is there anyway for a single statement to access more than one database? > Could a query, regexes, etc be facing indexes in different > encodings/sorting collations if different databases in a cluster had > different encodings/collations? The indexes on the shared system tables (eg, pg_database) are the only issue here. One possible solution is to require that no locale-aware datatypes ever be used in these indexes. I think right now this is true because "name" doesn't use locale-aware sorting; but we'd have to be careful not to break the restriction in future. regards, tom lane
Dennis Gearon <gearond@fireserve.net> writes: > Tom Lane wrote: >> The indexes on the shared system tables (eg, pg_database) are the only >> issue here. One possible solution is to require that no locale-aware >> datatypes ever be used in these indexes. I think right now this is true >> because "name" doesn't use locale-aware sorting; but we'd have to be >> careful not to break the restriction in future. >> > Tom what about table names? Isn't it part of the SQL spec to be able > to set table names to other langauges other than English? [shrug...] So which language/encoding would you like to force everyone to use? The issue is not really whether you can create a database name that looks like however you want. The issues are (a) what it will look like to someone else using a different encoding; and (b) how it will sort if you ask for "select * from pg_database order by datname", relative to someone else's database name that he thinks is in a different locale and encoding than you think yours is. AFAICT the Postgres user community is not ready to accept a "thou shalt use Unicode" decree, so I don't think that mandating a one-size-fits-all answer is going to fly. regards, tom lane
Tom Lane wrote: > Dennis Gearon <gearond@fireserve.net> writes: > >>Tom Lane wrote: >> >>>The indexes on the shared system tables (eg, pg_database) are the only >>>issue here. One possible solution is to require that no locale-aware >>>datatypes ever be used in these indexes. I think right now this is true >>>because "name" doesn't use locale-aware sorting; but we'd have to be >>>careful not to break the restriction in future. >>> >> >>Tom what about table names? Isn't it part of the SQL spec to be able >>to set table names to other langauges other than English? > > > [shrug...] So which language/encoding would you like to force everyone > to use? > > The issue is not really whether you can create a database name that > looks like however you want. The issues are (a) what it will look like > to someone else using a different encoding; and (b) how it will sort if > you ask for "select * from pg_database order by datname", relative to > someone else's database name that he thinks is in a different locale and > encoding than you think yours is. > > AFAICT the Postgres user community is not ready to accept a "thou shalt > use Unicode" decree, so I don't think that mandating a one-size-fits-all > answer is going to fly. > > regards, tom lane > So for now, my database is set up as: show all shows ------------------ server encoding SQL_ASCII I didn't see anything that said what the LC_COLLATE and LC_TYPE settings were when initdb was done. How can I find that out? in postgresql.conf ------------------ LC_MESSAGES = 'C' LC_MONETARY = 'C' LC_NUMERIC = 'C' LC_TIME = 'C' So I have what: 8 bit encoding with standard ASCII ? I can put what langauges in it? It will sort in standard ASCII order, all not English characters will sort last?
Tom Lane wrote: > Dennis Gearon <gearond@fireserve.net> writes: > >>Is there anyway for a single statement to access more than one database? >>Could a query, regexes, etc be facing indexes in different >>encodings/sorting collations if different databases in a cluster had >>different encodings/collations? > > > The indexes on the shared system tables (eg, pg_database) are the only > issue here. One possible solution is to require that no locale-aware > datatypes ever be used in these indexes. I think right now this is true > because "name" doesn't use locale-aware sorting; but we'd have to be > careful not to break the restriction in future. > > regards, tom lane > Tom what about table names? Isn't it part of the SQL spec to be able to set table names to other langauges other than English? ---------------------- I've researched most of the databases out there that will tell you anything about how they have internationlized them. Bya vast majority, I have found them using UTF16 for ALL internals, in memory or CPU. This does double most non orientallangauge application's memory image. But, memory is cheap, and the desktop/Intel server market is just about to goto 64 bit and use much more memory. Based on my research, all characters for most human langauges are able to be displayed in one - 2 byte, 16 bit char via UTF16.I am going to do some more research on that. PROBABLY, most of them use UTF16 on the disk as well. Since most slow processes are IO bound, using an 8bit text datatype,WHEN possible, and converting on the fly might be a good way to keep some speed while truly making an ANSI spec,international database. I'm probably all wet though.
Dennis Gearon <gearond@fireserve.net> writes: > I didn't see anything that said what the LC_COLLATE and LC_TYPE settings were when initdb was done. > How can I find that out? In 7.4 you can just SHOW 'em, but before that you have to use pg_controldata to find it out. > in postgresql.conf > ------------------ > LC_MESSAGES = 'C' > LC_MONETARY = 'C' > LC_NUMERIC = 'C' > LC_TIME = 'C' Given that I'd bet you have collate/ctype as C too, but it's not certain. regards, tom lane