Re: Re: Big 7.1 open items
От | Randall Parker |
---|---|
Тема | Re: Re: Big 7.1 open items |
Дата | |
Msg-id | MPG.13b4559da89d333c989813@news.west.net обсуждение исходный текст |
Ответ на | Re: Re: Big 7.1 open items (Thomas Lockhart <lockhart@alumni.caltech.edu>) |
Список | pgsql-hackers |
Thomas, A few (hopefully relevant) comments regarding character sets, code pages, I18N, and all that: 1) I've seen databases (DB2 if memory serves) that allowed the client side to declare itself to the database back-end engine as being in a particular code page. For instance, one could have a CP850 Latin-1 client and an ISO 8859-1 database. The database engine did appropriate translations in both directions. 2) Mixing code pages in a single column and then having the database engine support it is not trivial. Either each CHAR/VARCHAR would have to have some code page settable per row (eg either as a separate column or as something like mycolumnname.encoding). Even if you could handle all that you'd still be faced with the issue is collating sequence. Each individual code page will have a collating sequence. But how do you collate across code pages? There'd be letters that were only in a single code page. Plus, it gets messy because with, for instance, a simple umlauted a that occurs in CP850, CP1252, and ISO 8859-1 (and likely in other code pages as well). That letter is really the same letter in all those code pages and should treated as such when sorting. 3) I think it is more important for a database to support lots of languages in the stored data than in the field names and table names. If a programmer has to deal with A-Za-z for naming identifiers and that perseon is Korean or Japanese then that is certain is an imposition on them. But its a far far bigger imposition if that programmer can't build a database that will store the letters of his national language and sort and index and search them in convenient ways. 4) The real solution to the multiple code page dilemma is Unicode. Yes, its more space. But the can of worms of dealingwith multiple code pages in a column is really no fun and the result is not great. BTDTHTTS. 5) The problem with enforcing I've built a database in DB2 where particular columns in it contained data from many different code pages (each row had a code page field as well as a text field). For some applications that is okay if that field is not going to be part of an index. However, if a database is going to be defined as being in a particular code page, and if the database engine is going to reject characters that are not recognized as part of that code page then you can't play the sort of game I just described _unless_ there is a different datatype that is similar to CHAR/VARCHAR but for which the RDBMS does not enforce code page legality on each character. Otherwise you choose some code page for a column, you go merrily stuffing in all sorts of rows in all sorts of code pages, and then along come some character that is of a value that is not a value for some other character in the code page that the RDBMS thinks it is. Anyway, I've done lots of I18N database stuff and hopefully a few of my comments will be useful to the assembled brethren <g>. In news:<3948E4D7.A3B722E9@alumni.caltech.edu>, lockhart@alumni.caltech.edu says... > One issue: I can see (or imagine ;) how we can use the Postgres type > system to manage multiple character sets. But allowing arbitrary > character sets in, say, table names forces us to cope with allowing a > mix of character sets in a single column of a system table. afaik this > general capability is not mandated by SQL9x (the SQL_TEXT character set > is used for all system resources??). Would it be acceptable to have a > "default database character set" which is allowed to creep into the > pg_xxx tables? Even that seems to be a difficult thing to accomplish at > the moment (we'd need to get some of the text manipulation functions > from the catalogs, not from hardcoded references as we do now). >
В списке pgsql-hackers по дате отправления: