Thoughts on multiple simultaneous code page support

Поиск
Список
Период
Сортировка
От Randall Parker
Тема Thoughts on multiple simultaneous code page support
Дата
Msg-id 21232575719657@mail.nls.net
обсуждение исходный текст
Ответы Re: Thoughts on multiple simultaneous code page support  (Giles Lean <giles@nemeton.com.au>)
Список pgsql-hackers
I've seen a number of RDBMSs that require that an entire database (whatever they call a database) 
has to be in the same code page. There are disadvantages in this. With those disadvantages in mind 
and with the idea that it would be worth examining options other than Unicode I typed up a list of 
some of the ways that databases could handle lots of different national languages and code pages. 

Suppose you have to develop a database that will store text strings for a large number of languages 
cutting across several code pages. Well, you have a few choices:

1) Make the entire database Unicode (providing that your RDBMS supports Unicode).  At least this way you know that you
canstore any text that you have. Whether it is Swedish, 
 
Russian, Japanese, German, and English is not a problem.   The problem is that all CHAR fields in all tables are now
Unicode.That makes all text storage take 
 
up more room. It also makes sorting and indexing take more time. It also requires translation between 
Unicode and the Code Page that any particular client PC is using.    Aside: DB2 has a feature whereby the client PC can
declareitself to the RDBMS as being in a 
 
particular code page and then DB2 does all the back-end to client-end code page translation going in 
both directions. 

2) Allow each table to be in a particular code page.  This is not too satisfying. How can you do foreign keys between
tableswhen two tables are in two 
 
different code pages? You'd have to ignore the differences and assume that only the first 127 chars 
are used and that they are the same in the code pages of two different code pages. 

3) Allow individual columns to have code pages to be declared for them.   This is okay. But it has downsides:  A) For
everynational language or at least for every code page you want to support you have to 
 
create a new column.        Picture a translation table. You might have 30 languages you translate into. So what do you
do?
 
Make 30 columns? You'd have a column that was a TokenID that describes what phrase or word the 
row represents. Then all the translation columns.    Having a column per language results in more columns than if you
havea column per code page. 
 
But if you have a column per code page that results in more rows. The reason is that you might put 
French, German, English, and several other languages in ISO 8859-1. Well, they each need to go on 
a different line. But if only Japanese goes into the SHIFT-JIS (eg CP932 or CP942) column then only 
one of those rows has Japanese in it. Do you put the Japanese on the same row as the English 
translation? Or do you put it on its own row?    You end up with a sparse matrix appearance if you do one column for
eachcode page. But you 
 
end up with a lot more columns if you do one column for each language. Then you run into limits of 
how many columns and how many bytes a particular RDBMS can support.   
4) Mix code pages in a single column. Different rows may have different code pages.   I've done this in DB2. One gives
upthe ability to do indexing of that column. After all, whose code 
 
page collation rules do you use?   That limitation is okay if the table is a translation table for token ids that are
usedin the 
 
development of software. Basically, the programmers writing some app use Token ID constants in 
their code to specify what text gets displayed. Then the database contains the mappings from those 
token ids to the various national languages. In the case where I did this the database is populated 
from some outside source of data that comes from translators who don't even have direct access to 
the database. The table then is just a storehouse that is indexed on the token id and national 
language fields.   Of course, one could come up with some scheme whereby the RDBMS would somehow know for 
each row and field what its code page is. One would need to have a way to declare a field as having a 
variable code page and then to have a way to set a code page value along with the text for a given 
field. I'm not arguing for this approach. Just pointing it out for completeness. 

Note that in order to support the mixing of code pages in a column you have to have one of two 
conditions:   A) The column is declared in some code page. But the RDBMS can not enforce the requirement 
that all text going into that column be in the set of encodings that are legal in that code page. Some 
of the code pages that will be put in rows in that column may use encodings that are not legal in the 
declared code page.  B) Have a binary character type that is not in any code page. eg see DB2's CLOB, DBCLOB, 
VARGRAPHIC and other similar fields. I forget which one of those I used but I have used one of them. 






В списке pgsql-hackers по дате отправления:

Предыдущее
От: "Randall Parker"
Дата:
Сообщение: Re: An idea on faster CHAR field indexing
Следующее
От: Tom Lane
Дата:
Сообщение: Re: Big 7.1 open items