Re: Bug in UTF8-Validation Code?

Поиск

Список

Период

Сортировка

От	Tatsuo Ishii
Тема	Re: Bug in UTF8-Validation Code?
Дата	5 апреля 2007 г. 00:33:38
Msg-id	20070405.093425.67003822.t-ishii@sraoss.co.jp обсуждение исходный текст
Ответ на	Re: Bug in UTF8-Validation Code? (Alvaro Herrera <alvherre@commandprompt.com>)
Ответы	Re: Bug in UTF8-Validation Code?
Список	pgsql-hackers

Дерево обсуждения

> Tatsuo Ishii wrote:
> 
> > BTW, every encoding has its own charset. However the relationship
> > between encoding and charset are not so simple as Unicode. For
> > example, encoding EUC_JP correponds to multiple charsets, namely
> > ASCII, JIS X 0201, JIS X 0208 and JIS X 0212. So a function which
> > returns a "code point" is not quite usefull since it lacks the charset
> > info. I think we need to continute design discussion, probably
> > targetting for 8.4, not 8.3.
> 
> Is Unicode complete as far as Japanese chars go?  I mean, is there a
> character in EUC_JP that is not representable in Unicode?

I don't think Unicode is "complete" in this case. Problems are: EUC_JP
allows user defined characters which are not mapped to Unicode. Also
some characters in EUC_JP corresponds to multiple Unicode points.

> Because if Unicode is complete, ISTM it makes perfect sense to have a
> unicode_char() (or whatever we end up calling it) that takes an Unicode
> code point and returns a character in whatever JIS set you want
> (specified by setting client_encoding to that).  Because then you solved
> the problem nicely.

I'm not sure what kind of use case for unicode_char() you are thinking
about. Anyway if you want a "code point" from a character, we could
easily add such functions to all backend encodings currently we
support. Probably it would look like:

to_code_point(str TEXT) returns TEXT

An example outputs are:

ASCII - 41
ISO 10646 - U+0041
ISO 10646 - U+29E3D
ISO 8859-1 - a5
JIS X 0208 - 4141

It's a little bit too late for 8.2 though.

> One thing that I find confusing in your text above is whether EUC_JP is
> an encoding or a charset?  I would think that the various JIS X are
> encodings, and EUC_JP is the charset; or is it the other way around?

No, EUC_JP is an encoding. JIS X are the charsets.
--
Tatsuo Ishii
SRA OSS, Inc. Japan

В списке pgsql-hackers по дате отправления:

Предыдущее

От: Bruce Momjian
Дата: 05 апреля 2007 г., 00:22:19
Сообщение: Feature freeze roadmap

Следующее

От: Tatsuo Ishii
Дата: 05 апреля 2007 г., 00:55:27
Сообщение: Re: Bug in UTF8-Validation Code?

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Re: Bug in UTF8-Validation Code?

Предыдущее

Следующее