Re: invalidly encoded strings

Поиск

Список

Период

Сортировка

От	Tatsuo Ishii
Тема	Re: invalidly encoded strings
Дата	10 сентября 2007 г. 15:31:44
Msg-id	20070911.003051.41631033.t-ishii@sraoss.co.jp обсуждение исходный текст
Ответ на	Re: invalidly encoded strings (Tom Lane <tgl@sss.pgh.pa.us>)
Ответы	Re: invalidly encoded strings Re: invalidly encoded strings
Список	pgsql-hackers

Дерево обсуждения

> Andrew Dunstan <andrew@dunslane.net> writes:
> > The reason we are prepared to make an exception for Unicode is precisely 
> > because the code point maps to an encoding pattern independently of 
> > architecture, ISTM.
> 
> Right --- there is a well-defined standard for the numerical value of
> each character in Unicode.  And it's also clear what to do in
> single-byte encodings.  It's not at all clear what the representation
> ought to be for other multibyte encodings.  A direct transliteration
> of the byte sequence not only has endianness issues, but will have
> a weird non-dense set of valid values because of the restrictions on
> valid multibyte characters.
> 
> Given that chr() has never before behaved sanely for multibyte values at
> all, extending it to Unicode code points is a reasonable extension,
> and throwing error for other encodings is reasonable too.  If we ever do
> come across code-point standards for other encodings we can adopt 'em at
> that time.

I don't understand whole discussion.

Why do you think that employing the Unicode code point as the chr()
argument could avoid endianness issues? Are you going to represent
Unicode code point as UCS-4? Then you have to specify the endianness
anyway.  (see the UCS-4 standard for more details)

Or are you going to represent Unicode point as a character string such
as 'U+0259'? Then representing any encoding as a string could avoid
endianness issues anyway, and I don't see Unicode code point is any
better than others.

Also I'd like to point out all encodings has its own code point
systems as far as I know. For example, EUC-JP has its corresponding
code point systems, ASCII, JIS X 0208 and JIS X 0212. So I don't see
we can't use "code point" as chr()'s argument for othe encodings(of
course we need optional parameter specifying which character set is
supposed).
--
Tatsuo Ishii
SRA OSS, Inc. Japan

В списке pgsql-hackers по дате отправления:

Предыдущее

От: Martijn van Oosterhout
Дата: 10 сентября 2007 г., 15:05:25
Сообщение: Re: Hash index todo list item

Следующее

От: Simon Riggs
Дата: 10 сентября 2007 г., 15:35:59
Сообщение: Re: Include Lists for Text Search

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Re: invalidly encoded strings

Предыдущее

Следующее