Re: [GENERAL] How well does PostgreSQL 9.6.1 support unicode?

Поиск

Список

Период

Сортировка

От	Kyotaro HORIGUCHI
Тема	Re: [GENERAL] How well does PostgreSQL 9.6.1 support unicode?
Дата	21 декабря 2016 г. 13:56:37
Msg-id	20161221.165637.246733544.horiguchi.kyotaro@lab.ntt.co.jp обсуждение исходный текст
Ответ на	[GENERAL] How well does PostgreSQL 9.6.1 support unicode? (James Zhou <james@360data.ca>)
Ответы	Re: [GENERAL] How well does PostgreSQL 9.6.1 support unicode?
Список	pgsql-general

Дерево обсуждения

Hello,

At Tue, 20 Dec 2016 16:41:51 -0800, James Zhou <james@360data.ca> wrote in
<CAGuREpPHJmoHe_5+P25UCosRvqQpbhPF_0LGFbJ+xYgUKndydg@mail.gmail.com>
> Unicode has evolved from version 1.0 with 7,161 characters released in 1991
> to version 9.0 with 128,172 characters released in June 2016. My questions
> are
> - which version of Unicode is supported by PostgreSQL 9.6.1?
> - what does "supported" exactly mean? simply store it? comparison? sorting?
> substring? etc.
...
> /* characters from BMP, 0000 - FFFF */
> insert into unicode(id, string) values(1, U&'\0041');  -- 'A'
...
> insert into unicode(id, string) values(5, U&'\6211\4EEC'); -- a string of two Chinese characters

These shouldn't be a problem.

> /* Below are unicode characters with code points beyond FFFF, aka planes 1 - F */
> insert into unicode(id, string) values(100, U&'\1F478'); -- a mojo character, https://unicodelookup.com/#0x1f478/1

https://www.postgresql.org/docs/9.6/static/sql-syntax-lexical.html

> Unicode characters can be specified in escaped form by writing a
> backslash followed by the four-digit hexadecimal code point
> number or alternatively a backslash followed by a plus sign
> followed by a six-digit hexadecimal code point number.

So this is parsed as U+1f47 + '8' as you seen. This should be as
the following. '+' is needed just after the backslash.

insert into unicode(id, string) values(100, U&'\+01F478');

The six-digit form accepts up to U+10FFFF so the whole space in
Unicode is usable.

> Observations
>
>    - BMP characters (id <= 10)
>       -  they are stored and fetched correctly.
>       - their lengths in char are correct, although some of them take 3
>       bytes (id = 4, 6, 7)
>       - *But their sorting order seems to be undefined. Can anyone comment
>       the sorting rules?*
>    - Non-BMP characters (id >= 100)
>       - they take 2 - 4 bytes.
>       - Their lengths in character are not correct
>       - they are not retrieved correctly, judged by the their fetched ascii
>       value (column 5 in the table above)
>       - substring is not correct

>
> Specifically, the lack of support for emojo characters 0x1F478, 0x1F479 is
> causing a problem in my application.

'+' would resolve the problem.

> My conclusion:
> - PostgreSQL 9.6.1 only supports a subset of unicode characters in BMP.  Is
> there any documents defining which subset is fully supported?

A PostgreSQL database with encoding=UTF8 just accepts the whole
range of Unicode, regardless that a character is defined for the
code or not.

> Are any configuration I can change so that more unicode characters are
> supported?

For the discussion on sorting, categorize is described in Tom's
mail.

--
Kyotaro Horiguchi
NTT Open Source Software Center

В списке pgsql-general по дате отправления:

Предыдущее

От: James Zhou
Дата: 21 декабря 2016 г., 13:17:56
Сообщение: Re: [GENERAL] How well does PostgreSQL 9.6.1 support unicode?

Следующее

От: Yogesh Sharma
Дата: 21 декабря 2016 г., 14:59:49
Сообщение: [GENERAL]

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Re: [GENERAL] How well does PostgreSQL 9.6.1 support unicode?

Предыдущее

Следующее