Re: JSON and unicode surrogate pairs

Поиск
Список
Период
Сортировка
От Andrew Dunstan
Тема Re: JSON and unicode surrogate pairs
Дата
Msg-id 51B60669.3080107@dunslane.net
обсуждение исходный текст
Ответ на Re: JSON and unicode surrogate pairs  (Tom Lane <tgl@sss.pgh.pa.us>)
Ответы Re: JSON and unicode surrogate pairs  (Andres Freund <andres@2ndquadrant.com>)
Список pgsql-hackers
On 06/10/2013 11:43 AM, Tom Lane wrote:
> Andrew Dunstan <andrew@dunslane.net> writes:
>> Or we could abandon the conversion altogether, but that doesn't seem
>> very friendly either. I suspect the biggest case for people to use these
>> sequences is where the database is UTF8 but the client encoding is not.
> Well, if that's actually the biggest use-case, then maybe we should just
> say we're *not* in the business of converting those escapes.  That would
> make things nice and consistent regardless of the DB encoding, and it
> would avoid the problem of being able to input a value and then not
> being able to output it again.
>
> It's legal, is it not, to just write the equivalent Unicode character in
> the JSON string and not use the escapes?  If so I would think that that
> would be the most common usage.  If someone's writing an escape, they
> probably had a reason for doing it that way, and might not appreciate
> our overriding their decision.
>
>


We never store the converted values in the JSON object, nor do we return
them from functions that return JSON. But many of the functions and
operators that process the JSON have variants that return text instead
of JSON, and in those cases, when the value returned is a JSON string,
we do the following to it:

  * strip the outside quotes
  * de-escape the various escaped characters (i.e. everything preceded
    by a backslash in the railroad diagram for string at
    <http://www.json.org/>)


Here's an example of the difference:

    andrew=# select '{ "a": "\u00a9"}'::json -> 'a';
      ?column?
    ----------
      "\u00a9"
    (1 row)

    andrew=# select '{ "a": "\u00a9"}'::json ->>'a';
      ?column?
    ----------
      ©
    (1 row)

It's the process of producing the latter that is giving us a headache in
non-UTF8 databases.

... [ more caffeine is consumed ] ...

I have just realized that the problem is actually quite a lot bigger
than that. We also use this value for field name comparison. So, let us
suppose that we have a LATIN1 database and a piece of JSON with a field
name containing the Euro sign ("\u20ac"), a character that is not in
LATIN1. Making that processable so it doesn't blow up would be mighty
tricky and error prone. The non-orthogonality I suggested as a solution
upthread is, by contrast, very small and easy to manage, and not
terribly hard to explain - see attached.

cheers

andrew


Вложения

В списке pgsql-hackers по дате отправления:

Предыдущее
От: Tom Lane
Дата:
Сообщение: Re: SPGist "triple parity" concept doesn't work
Следующее
От: Jeff Davis
Дата:
Сообщение: Re: pg_filedump 9.3: checksums (and a few other fixes)