Re: [GENERAL] How well does PostgreSQL 9.6.1 support unicode?

Поиск
Список
Период
Сортировка
От Kyotaro HORIGUCHI
Тема Re: [GENERAL] How well does PostgreSQL 9.6.1 support unicode?
Дата
Msg-id 20161221.165637.246733544.horiguchi.kyotaro@lab.ntt.co.jp
обсуждение исходный текст
Ответ на [GENERAL] How well does PostgreSQL 9.6.1 support unicode?  (James Zhou <james@360data.ca>)
Ответы Re: [GENERAL] How well does PostgreSQL 9.6.1 support unicode?
Список pgsql-general
Hello,

At Tue, 20 Dec 2016 16:41:51 -0800, James Zhou <james@360data.ca> wrote in
<CAGuREpPHJmoHe_5+P25UCosRvqQpbhPF_0LGFbJ+xYgUKndydg@mail.gmail.com>
> Unicode has evolved from version 1.0 with 7,161 characters released in 1991
> to version 9.0 with 128,172 characters released in June 2016. My questions
> are
> - which version of Unicode is supported by PostgreSQL 9.6.1?
> - what does "supported" exactly mean? simply store it? comparison? sorting?
> substring? etc.
...
> /* characters from BMP, 0000 - FFFF */
> insert into unicode(id, string) values(1, U&'\0041');  -- 'A'
...
> insert into unicode(id, string) values(5, U&'\6211\4EEC'); -- a string of two Chinese characters

These shouldn't be a problem.

> /* Below are unicode characters with code points beyond FFFF, aka planes 1 - F */
> insert into unicode(id, string) values(100, U&'\1F478'); -- a mojo character, https://unicodelookup.com/#0x1f478/1

https://www.postgresql.org/docs/9.6/static/sql-syntax-lexical.html

> Unicode characters can be specified in escaped form by writing a
> backslash followed by the four-digit hexadecimal code point
> number or alternatively a backslash followed by a plus sign
> followed by a six-digit hexadecimal code point number.

So this is parsed as U+1f47 + '8' as you seen. This should be as
the following. '+' is needed just after the backslash.

insert into unicode(id, string) values(100, U&'\+01F478');

The six-digit form accepts up to U+10FFFF so the whole space in
Unicode is usable.

> Observations
>
>    - BMP characters (id <= 10)
>       -  they are stored and fetched correctly.
>       - their lengths in char are correct, although some of them take 3
>       bytes (id = 4, 6, 7)
>       - *But their sorting order seems to be undefined. Can anyone comment
>       the sorting rules?*
>    - Non-BMP characters (id >= 100)
>       - they take 2 - 4 bytes.
>       - Their lengths in character are not correct
>       - they are not retrieved correctly, judged by the their fetched ascii
>       value (column 5 in the table above)
>       - substring is not correct

>
> Specifically, the lack of support for emojo characters 0x1F478, 0x1F479 is
> causing a problem in my application.

'+' would resolve the problem.

> My conclusion:
> - PostgreSQL 9.6.1 only supports a subset of unicode characters in BMP.  Is
> there any documents defining which subset is fully supported?

A PostgreSQL database with encoding=UTF8 just accepts the whole
range of Unicode, regardless that a character is defined for the
code or not.

> Are any configuration I can change so that more unicode characters are
> supported?

For the discussion on sorting, categorize is described in Tom's
mail.

--
Kyotaro Horiguchi
NTT Open Source Software Center




В списке pgsql-general по дате отправления:

Предыдущее
От: James Zhou
Дата:
Сообщение: Re: [GENERAL] How well does PostgreSQL 9.6.1 support unicode?
Следующее
От: Yogesh Sharma
Дата:
Сообщение: [GENERAL]