Re: JSON and unicode surrogate pairs

Поиск
Список
Период
Сортировка
От Stefan Drees
Тема Re: JSON and unicode surrogate pairs
Дата
Msg-id 51B72167.1080201@drees.name
обсуждение исходный текст
Ответ на Re: JSON and unicode surrogate pairs  (Hannu Krosing <hannu@2ndQuadrant.com>)
Ответы Re: JSON and unicode surrogate pairs  (Hannu Krosing <hannu@2ndQuadrant.com>)
Список pgsql-hackers
On 2013-06-11 12:53 CEST, Hannu Krosing wrote:
> On 06/11/2013 10:47 AM, Andres Freund wrote:
>> On 2013-06-10 13:01:29 -0400, Andrew Dunstan wrote:
>>>> It's legal, is it not, to just write the equivalent Unicode character in
>>>> the JSON string and not use the escapes?  If so I would think that that
>>>> would be the most common usage.  If someone's writing an escape, they
>>>> probably had a reason for doing it that way, and might not appreciate
>>>> our overriding their decision.
>>> We never store the converted values in the JSON object, nor do we return
>>> them from functions that return JSON. But many of the functions and
>>> operators that process the JSON have variants that return text instead of
>>> JSON, and in those cases, when the value returned is a JSON string, we do
>>> the following to it:
>>>
>>> I have just realized that the problem is actually quite a lot bigger than
>>> that. We also use this value for field name comparison. So, let us suppose
>>> that we have a LATIN1 database and a piece of JSON with a field name
>>> containing the Euro sign ("\u20ac"), a character that is not in LATIN1.
>>> Making that processable so it doesn't blow up would be mighty tricky and
>>> error prone. The non-orthogonality I suggested as a solution upthread is, by
>>> contrast, very small and easy to manage, and not terribly hard to explain -
>>> see attached.
>> I think this all shows pretty clearly that it was a mistake allowing
>> json data in the database that we cannot entirely display with the
>> database's encoding. All the proposed ugly workarounds are only
>> necessary because we don't throw an error when originally validating the
>> json.
>> Even in an utf-8 database you can get errors due to \u unescaping (at
>> attribute access time, *NOT* at json_in() time) due to invalidate
>> surrogate pairs.
>>
>> I think this goes countrary to normal postgres approach of validating
>> data as strict as necessary. And I think we are going to regret not
>> fixing this while there are still relatively few users out there.
> Exactly -
>
>   * allow in only valid JSON.
>   * Validate all utf8 strings for valid unicode.
>   * have one canonic way of outputting unicode - utf8 for utf8 databases,
> escaped for all other encodings
>   * If you need to store anything else, use text.
>
> Requiring preserving "original text" in json data field is Not Good!
>
> I fully expect '{"a":1, "a":none, "a":true, "a":"b"}'::json to come out
> as '{"a":b"}'

ahem, do you mean instead to give (none -> null and missing '"' inserted 
in "answer"):
=# SELECT '{"a":1, "a":null, "a":true, "a":"b"}'::json;                 json
-------------------------------------- {"a":"b"}

or only when "stored" in database and subsequently retrieved? The 
"original text" in this case was perfectly valid JSON text.

> (I know that currently this is noty true and will happen only once I
> read in the json value in client)

Isn't this a good situation and doesn't this also depend on the storage 
representation in the client?

What about this:
=# SELECT '{"measure":"seconds", "measure":42}'::json;                 json
-------------------------------------- {"measure":42}

I presume people being used to store metadata in "preceding" json object 
members with duplicate names, would want to decide in the client 
requesting the data what to do with the metadata information and at what 
point to "drop", wouldn't they :-?)

> For anything else - don't use json, use any text type
>
> If you really need a simple text-validated-as-valid-json-input datatype
> then add this.
>
> Call it jsontext or somesuch :)
>

All the best,
Stefan.



В списке pgsql-hackers по дате отправления:

Предыдущее
От: Andrew Dunstan
Дата:
Сообщение: Re: JSON and unicode surrogate pairs
Следующее
От: Hannu Krosing
Дата:
Сообщение: Re: JSON and unicode surrogate pairs