Re: JSON and unicode surrogate pairs

Поиск

Список

Период

Сортировка

От	Andrew Dunstan
Тема	Re: JSON and unicode surrogate pairs
Дата	10 июня 2013 г. 17:02:01
Msg-id	51B60669.3080107@dunslane.net обсуждение исходный текст
Ответ на	Re: JSON and unicode surrogate pairs (Tom Lane <tgl@sss.pgh.pa.us>)
Ответы	Re: JSON and unicode surrogate pairs
Список	pgsql-hackers

Дерево обсуждения

On 06/10/2013 11:43 AM, Tom Lane wrote:
> Andrew Dunstan <andrew@dunslane.net> writes:
>> Or we could abandon the conversion altogether, but that doesn't seem
>> very friendly either. I suspect the biggest case for people to use these
>> sequences is where the database is UTF8 but the client encoding is not.
> Well, if that's actually the biggest use-case, then maybe we should just
> say we're *not* in the business of converting those escapes.  That would
> make things nice and consistent regardless of the DB encoding, and it
> would avoid the problem of being able to input a value and then not
> being able to output it again.
>
> It's legal, is it not, to just write the equivalent Unicode character in
> the JSON string and not use the escapes?  If so I would think that that
> would be the most common usage.  If someone's writing an escape, they
> probably had a reason for doing it that way, and might not appreciate
> our overriding their decision.
>
>

We never store the converted values in the JSON object, nor do we return
them from functions that return JSON. But many of the functions and
operators that process the JSON have variants that return text instead
of JSON, and in those cases, when the value returned is a JSON string,
we do the following to it:

  * strip the outside quotes
  * de-escape the various escaped characters (i.e. everything preceded
    by a backslash in the railroad diagram for string at
    <http://www.json.org/>)

Here's an example of the difference:

    andrew=# select '{ "a": "\u00a9"}'::json -> 'a';
      ?column?
    ----------
      "\u00a9"
    (1 row)

    andrew=# select '{ "a": "\u00a9"}'::json ->>'a';
      ?column?
    ----------
      ©
    (1 row)

It's the process of producing the latter that is giving us a headache in
non-UTF8 databases.

... [ more caffeine is consumed ] ...

I have just realized that the problem is actually quite a lot bigger
than that. We also use this value for field name comparison. So, let us
suppose that we have a LATIN1 database and a piece of JSON with a field
name containing the Euro sign ("\u20ac"), a character that is not in
LATIN1. Making that processable so it doesn't blow up would be mighty
tricky and error prone. The non-orthogonality I suggested as a solution
upthread is, by contrast, very small and easy to manage, and not
terribly hard to explain - see attached.

cheers

andrew

Вложения

json-unicode-esc.patch

В списке pgsql-hackers по дате отправления:

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Re: JSON and unicode surrogate pairs

Вложения