Re: Unicode support

Поиск
Список
Период
Сортировка
От Greg Stark
Тема Re: Unicode support
Дата
Msg-id 4136ffa0904140849h36bdb5adl8b4e765b1906c4ed@mail.gmail.com
обсуждение исходный текст
Ответ на Re: Unicode support  (Peter Eisentraut <peter_e@gmx.net>)
Ответы Re: Unicode support  (Tom Lane <tgl@sss.pgh.pa.us>)
Re: Unicode support  ("Kevin Grittner" <Kevin.Grittner@wicourts.gov>)
Re: Unicode support  (Peter Eisentraut <peter_e@gmx.net>)
Список pgsql-hackers
On Tue, Apr 14, 2009 at 1:32 PM, Peter Eisentraut <peter_e@gmx.net> wrote:
> On Monday 13 April 2009 22:39:58 Andrew Dunstan wrote:
>> Umm, but isn't that because your encoding is using one code point?
>>
>> See the OP's explanation w.r.t. canonical equivalence.
>>
>> This isn't about the number of bytes, but about whether or not we should
>> count characters encoded as two or more combined code points as a single
>> char or not.
>
> Here is a test case that shows the problem (if your terminal can display
> combining characters (xterm appears to work)):
>
> SELECT U&'\00E9', char_length(U&'\00E9');
>  ?column? | char_length
> ----------+-------------
>  é        |           1
> (1 row)
>
> SELECT U&'\0065\0301', char_length(U&'\0065\0301');
>  ?column? | char_length
> ----------+-------------
>  é        |           2
> (1 row)

What's really at issue is "what is a string?". That is, it a sequence
of characters or a sequence of code points. If it's the former then we
would also have to prohibit certain strings such as U&'\0301'
entirely. And we have to make substr() pick out the right number of
code points, etc.



--
greg


В списке pgsql-hackers по дате отправления:

Предыдущее
От: Peter Eisentraut
Дата:
Сообщение: Re: Unicode support
Следующее
От: Tom Lane
Дата:
Сообщение: Re: Unicode string literals versus the world