Re: invalidly encoded strings

Поиск
Список
Период
Сортировка
От Tatsuo Ishii
Тема Re: invalidly encoded strings
Дата
Msg-id 20070911.003051.41631033.t-ishii@sraoss.co.jp
обсуждение исходный текст
Ответ на Re: invalidly encoded strings  (Tom Lane <tgl@sss.pgh.pa.us>)
Ответы Re: invalidly encoded strings  (Martijn van Oosterhout <kleptog@svana.org>)
Re: invalidly encoded strings  (Andrew Dunstan <andrew@dunslane.net>)
Список pgsql-hackers
> Andrew Dunstan <andrew@dunslane.net> writes:
> > The reason we are prepared to make an exception for Unicode is precisely 
> > because the code point maps to an encoding pattern independently of 
> > architecture, ISTM.
> 
> Right --- there is a well-defined standard for the numerical value of
> each character in Unicode.  And it's also clear what to do in
> single-byte encodings.  It's not at all clear what the representation
> ought to be for other multibyte encodings.  A direct transliteration
> of the byte sequence not only has endianness issues, but will have
> a weird non-dense set of valid values because of the restrictions on
> valid multibyte characters.
> 
> Given that chr() has never before behaved sanely for multibyte values at
> all, extending it to Unicode code points is a reasonable extension,
> and throwing error for other encodings is reasonable too.  If we ever do
> come across code-point standards for other encodings we can adopt 'em at
> that time.

I don't understand whole discussion.

Why do you think that employing the Unicode code point as the chr()
argument could avoid endianness issues? Are you going to represent
Unicode code point as UCS-4? Then you have to specify the endianness
anyway.  (see the UCS-4 standard for more details)

Or are you going to represent Unicode point as a character string such
as 'U+0259'? Then representing any encoding as a string could avoid
endianness issues anyway, and I don't see Unicode code point is any
better than others.

Also I'd like to point out all encodings has its own code point
systems as far as I know. For example, EUC-JP has its corresponding
code point systems, ASCII, JIS X 0208 and JIS X 0212. So I don't see
we can't use "code point" as chr()'s argument for othe encodings(of
course we need optional parameter specifying which character set is
supposed).
--
Tatsuo Ishii
SRA OSS, Inc. Japan


В списке pgsql-hackers по дате отправления:

Предыдущее
От: Martijn van Oosterhout
Дата:
Сообщение: Re: Hash index todo list item
Следующее
От: Simon Riggs
Дата:
Сообщение: Re: Include Lists for Text Search