Re: JSON and unicode surrogate pairs

Поиск
Список
Период
Сортировка
От Hannu Krosing
Тема Re: JSON and unicode surrogate pairs
Дата
Msg-id 51B701C4.1060302@2ndQuadrant.com
обсуждение исходный текст
Ответ на Re: JSON and unicode surrogate pairs  (Andres Freund <andres@2ndquadrant.com>)
Ответы Re: JSON and unicode surrogate pairs  (Andrew Dunstan <andrew@dunslane.net>)
Re: JSON and unicode surrogate pairs  (Stefan Drees <stefan@drees.name>)
Список pgsql-hackers
On 06/11/2013 10:47 AM, Andres Freund wrote:
> On 2013-06-10 13:01:29 -0400, Andrew Dunstan wrote:
>>> It's legal, is it not, to just write the equivalent Unicode character in
>>> the JSON string and not use the escapes?  If so I would think that that
>>> would be the most common usage.  If someone's writing an escape, they
>>> probably had a reason for doing it that way, and might not appreciate
>>> our overriding their decision.
>> We never store the converted values in the JSON object, nor do we return
>> them from functions that return JSON. But many of the functions and
>> operators that process the JSON have variants that return text instead of
>> JSON, and in those cases, when the value returned is a JSON string, we do
>> the following to it:
>>
>> I have just realized that the problem is actually quite a lot bigger than
>> that. We also use this value for field name comparison. So, let us suppose
>> that we have a LATIN1 database and a piece of JSON with a field name
>> containing the Euro sign ("\u20ac"), a character that is not in LATIN1.
>> Making that processable so it doesn't blow up would be mighty tricky and
>> error prone. The non-orthogonality I suggested as a solution upthread is, by
>> contrast, very small and easy to manage, and not terribly hard to explain -
>> see attached.
> I think this all shows pretty clearly that it was a mistake allowing
> json data in the database that we cannot entirely display with the
> database's encoding. All the proposed ugly workarounds are only
> necessary because we don't throw an error when originally validating the
> json.
> Even in an utf-8 database you can get errors due to \u unescaping (at
> attribute access time, *NOT* at json_in() time) due to invalidate
> surrogate pairs.
>
> I think this goes countrary to normal postgres approach of validating
> data as strict as necessary. And I think we are going to regret not
> fixing this while there are still relatively few users out there.
Exactly -
* allow in only valid JSON.* Validate all utf8 strings for valid unicode.* have one canonic way of outputting unicode -
utf8for utf8 databases,
 
escaped for all other encodings* If you need to store anything else, use text.

Requiring preserving "original text" in json data field is Not Good!

I fully expect '{"a":1, "a":none, "a":true, "a":"b"}'::json to come out
as '{"a":b"}'

(I know that currently this is noty true and will happen only once I
read in the json value in client)

For anything else - don't use json, use any text type

If you really need a simple text-validated-as-valid-json-input datatype
then add this.

Call it jsontext or somesuch :)

-- 
Hannu Krosing
PostgreSQL Consultant
Performance, Scalability and High Availability
2ndQuadrant Nordic OÜ




В списке pgsql-hackers по дате отправления:

Предыдущее
От: Hannu Krosing
Дата:
Сообщение: Re: DO ... RETURNING
Следующее
От: Pavel Stehule
Дата:
Сообщение: Re: DO ... RETURNING