Re: UTF8 national character data type support WIP patch and list of open issues.

Поиск
Список
Период
Сортировка
От Heikki Linnakangas
Тема Re: UTF8 national character data type support WIP patch and list of open issues.
Дата
Msg-id 5239EBC4.5030103@vmware.com
обсуждение исходный текст
Ответ на Re: UTF8 national character data type support WIP patch and list of open issues.  (Robert Haas <robertmhaas@gmail.com>)
Список pgsql-hackers
On 18.09.2013 16:16, Robert Haas wrote:
> On Mon, Sep 16, 2013 at 8:49 AM, MauMau<maumau307@gmail.com>  wrote:
>> 2. NCHAR/NVARCHAR columns can be used in non-UTF-8 databases and always
>> contain Unicode data.
> ...
>> 3. Store strings in UTF-16 encoding in NCHAR/NVARCHAR columns.
>> Fixed-width encoding may allow faster string manipulation as described in
>> Oracle's manual.  But I'm not sure about this, because UTF-16 is not a real
>> fixed-width encoding due to supplementary characters.
>
> It seems to me that these two points here are the real core of your
> proposal.  The rest is just syntactic sugar.
>
> Let me start with the second one: I don't think there's likely to be
> any benefit in using UTF-16 as the internal encoding.  In fact, I
> think it's likely to make things quite a bit more complicated, because
> we have a lot of code that assumes that server encodings have certain
> properties that UTF-16 doesn't - specifically, that any byte with the
> high-bit clear represents the corresponding ASCII character.
>
> As to the first one, if we're going to go to the (substantial) trouble
> of building infrastructure to allow a database to store data in
> multiple encodings, why limit it to storing UTF-8 in non-UTF-8
> databases?  What about storing SHIFT-JIS in UTF-8 databases, or
> Windows-yourfavoriteM$codepagehere in UTF-8 databases, or any other
> combination you might care to name?
>
> Whether we go that way or not, I think storing data in one encoding in
> a database with a different encoding is going to be pretty tricky and
> require far-reaching changes.  You haven't mentioned any of those
> issues or discussed how you would solve them.

I'm not too thrilled about complicating the system for that, either. If 
you really need to deal with many different languages, you can do that 
today by using UTF-8 everywhere. Sure, it might not be the most 
efficient encoding for some characters, but it works.

There is one reason, however, that makes it a lot more compelling: we 
already support having databases with different encodings in the same 
cluster, but the encoding used in the shared catalogs, for usernames and 
database names for example, is not well-defined. If we dealt with 
different encodings in the same database, that inconsistency would go away.

- Heikki



В списке pgsql-hackers по дате отправления:

Предыдущее
От: Hannu Krosing
Дата:
Сообщение: Re: record identical operator
Следующее
От: Kevin Grittner
Дата:
Сообщение: Re: record identical operator