Re: UTF8 national character data type support WIP patch and list of open issues.

Поиск
Список
Период
Сортировка
От Heikki Linnakangas
Тема Re: UTF8 national character data type support WIP patch and list of open issues.
Дата
Msg-id 522594E8.2050106@vmware.com
обсуждение исходный текст
Ответ на UTF8 national character data type support WIP patch and list of open issues.  ("Boguk, Maksym" <maksymb@fast.au.fujitsu.com>)
Ответы Re: UTF8 national character data type support WIP patch and list of open issues.  (Tom Lane <tgl@sss.pgh.pa.us>)
Re: UTF8 national character data type support WIP patch and list of open issues.  ("Boguk, Maksym" <maksymb@fast.au.fujitsu.com>)
Список pgsql-hackers
On 03.09.2013 05:28, Boguk, Maksym wrote:
> Target usage:  ability to store UTF8 national characters in some
> selected fields inside a single-byte encoded database.
> For sample if I have a ru-RU.koi8r encoded database with mostly Russian
> text inside,  it would be nice to be able store an Japanese text in one
> field without converting the whole database to UTF8 (convert such
> database to UTF8 easily could almost double the database size even if
> only one field in whole database will use any symbols outside of
> ru-RU.koi8r encoding).

Ok.

> What has been done:
>
> 1)Addition of new string data types NATIONAL CHARACTER and NATIONAL
> CHARACTER VARIABLE.
> These types differ from the char/varchar data types in one important
> respect:  NATIONAL string types are always have UTF8 encoding even
> (independent from used database encoding).

I don't like the approach of adding a new data type for this. The 
encoding used for a text field should be an implementation detail, not 
something that's exposed to users at the schema-level. A separate data 
type makes an nvarchar field behave slightly differently from text, for 
example when it's passed to and from functions. It will also require 
drivers and client applications to know about it.

> What need to be done:
>
> 1)Full set of string functions and operators for NATIONAL types (we
> could not use generic text functions because they assume that the stings
> will have database encoding).
> Now only basic set implemented.
> 2)Need implement some way to define default collation for a NATIONAL
> types.
> 3)Need implement some way to input UTF8 characters into NATIONAL types
> via SQL  (there are serious open problem... it will be defined later in
> the text).

Yeah, all of these issues stem from the fact that the NATIONAL types are 
separate from text.

I think we should take a completely different approach to this. Two 
alternatives spring to mind:

1. Implement a new encoding.  The new encoding would be some variant of 
UTF-8 that encodes languages like Russian more efficiently. Then just 
use that in the whole database. Something like SCSU 
(http://www.unicode.org/reports/tr6/) should do the trick, although I'm 
not sure if SCSU can be used as a server-encoding. A lot of code relies 
on the fact that a server encoding must have the high bit set in all 
bytes that are part of a multi-byte character. That's why SJIS for 
example can only be used as a client-encoding. But surely you could come 
up with some subset or variant of SCSU which satisfies that requirement.

2. Compress the column. Simply do "ALTER TABLE foo ALTER COLUMN bar SET 
STORAGE MAIN". That will make Postgres compress that field. That might 
not be very efficient for compressing short cyrillic text encoded in 
UTF-8 today, but that could be improved. There has been discussion on 
supporting more compression algorithms in the past, and one such 
algorithm could be again something like SCSU.

- Heikki



В списке pgsql-hackers по дате отправления:

Предыдущее
От: wangshuo@highgo.com.cn
Дата:
Сообщение: Re: ENABLE/DISABLE CONSTRAINT NAME
Следующее
От: Craig Ringer
Дата:
Сообщение: Re: INSERT...ON DUPLICATE KEY IGNORE