Re: [GENERAL] How well does PostgreSQL 9.6.1 support unicode?

Поиск

Список

Период

Сортировка

От	James Zhou
Тема	Re: [GENERAL] How well does PostgreSQL 9.6.1 support unicode?
Дата	22 декабря 2016 г. 05:24:09
Msg-id	CAGuREpMnw99Xu52jKwfBSfjMWxby3x=CYZab=1F+vjOkY+q3Mw@mail.gmail.com обсуждение исходный текст
Ответ на	Re: [GENERAL] How well does PostgreSQL 9.6.1 support unicode? (Tom Lane <tgl@sss.pgh.pa.us>)
Ответы	Re: [GENERAL] How well does PostgreSQL 9.6.1 support unicode?
Список	pgsql-general

Дерево обсуждения

I figured out that I need to use the function CHR to enter supplementary unicode characters (code points > FFFF, i.e. planes 1 - F), e.g.

insert into unicode(id, string) values(100, CHR(128120)); -- a mojo character, https://unicodelookup.com/#0x1f478/1

insert into unicode(id, string) values(101, CHR(128121)); -- another mojo

insert into unicode(id, string) values(102, CHR(119071)); -- musical symbol g clef ottava alta

insert into unicode(id, string) values(103, CHR(155648)); -- a very infrequently used Chinese character

insert into unicode(id, string) values(104, CHR(155649)); -- another very infrequently used Chinese character

the parameters are decimal representation of the code point values, e.g. 128120 is the decimal value of 1f478

The format U&'\03B1' only works for chars between 0000 - FFFF

When entered with CHR(), PostgreSQL gets their char_length() correctly, so does substring() function.

Thank you all for help.

James

On Wed, Dec 21, 2016 at 8:31 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Vick Khera <vivek@khera.org> writes:
> On Wed, Dec 21, 2016 at 2:56 AM, Kyotaro HORIGUCHI <
> horiguchi.kyotaro@lab.ntt.co.jp> wrote:
>> A PostgreSQL database with encoding=UTF8 just accepts the whole
>> range of Unicode, regardless that a character is defined for the
>> code or not.

> Interesting... when I converted my application and database to utf8
> encoding, I discovered that Postgres is picky about UTF-8. Specifically the
> UTF-8 code point 0xed 0xa0 0x8d which maps to UNICODE code point 0xd80d.
> This looks like a proper character but in fact is not a defined character
> code point.

Well, we're picky to the extent that RFC 3629 tells us to be picky:
http://www.faqs.org/rfcs/rfc3629.html

The case you mention is rejected because it would be half of a UTF16
"surrogate pair", which should not be used in any Unicode representation
other than UTF16; if we allowed it then there would be more than one way
to represent the same Unicode code point, which is undesirable for a lot
of reasons.

> So I think when you present an actual string of UTF8 encoded characters,
> Postgres does refuse characters unknown. However, as you observe, inserting
> the unicode code point directly does not produce an error:

> insert into unicode(id, string) values(1, U&'\d80d');
> INSERT 0 1

Hm. I think that's a bug. The lexer does know that \d80d is half of a
surrogate pair, and it expects the second half to come next. If you
supply something that isn't the second half of a surrogate pair, you
get an error as expected:

u8=# insert into unicode(id, string) values(1, U&'\d80dfoo');
ERROR: invalid Unicode surrogate pair at or near "foo'"
LINE 1: insert into unicode(id, string) values(1, U&'\d80dfoo');
^

But it looks like if you just end the string after the first half of a
surrogate, it just drops the character without complaint. Notice that
what got inserted was a zero-length string, not U+D08D:

u8=# select *, length(string) from unicode;
id | string | length
----+--------+--------
1 | | 0
(1 row)

I'd have expected a syntax error along the line of "incomplete Unicode
surrogate pair". Peter, I think this was your code to begin with ---
was it intentional to not raise error here, or is that an oversight?

regards, tom lane

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

В списке pgsql-general по дате отправления:

Предыдущее

От: Adrian Klaver
Дата: 22 декабря 2016 г., 05:20:58
Сообщение: Re: [GENERAL] Too long startup time after each crash.

Следующее

От: Rui Pacheco
Дата: 22 декабря 2016 г., 05:25:40
Сообщение: [GENERAL] UTF-8 on Postgres wire protocol

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Re: [GENERAL] How well does PostgreSQL 9.6.1 support unicode?

Предыдущее

Следующее