Re: JSON and unicode surrogate pairs

Поиск
Список
Период
Сортировка
От Hannu Krosing
Тема Re: JSON and unicode surrogate pairs
Дата
Msg-id 51B72312.1030304@2ndQuadrant.com
обсуждение исходный текст
Ответ на Re: JSON and unicode surrogate pairs  (Andrew Dunstan <andrew@dunslane.net>)
Ответы Re: JSON and unicode surrogate pairs  (Andrew Dunstan <andrew@dunslane.net>)
Список pgsql-hackers
On 06/11/2013 02:41 PM, Andrew Dunstan wrote:
>
> On 06/11/2013 06:53 AM, Hannu Krosing wrote:
>> On 06/11/2013 10:47 AM, Andres Freund wrote:
>>> On 2013-06-10 13:01:29 -0400, Andrew Dunstan wrote:
>>>>> It's legal, is it not, to just write the equivalent Unicode
>>>>> character in
>>>>> the JSON string and not use the escapes?  If so I would think that
>>>>> that
>>>>> would be the most common usage.  If someone's writing an escape, they
>>>>> probably had a reason for doing it that way, and might not appreciate
>>>>> our overriding their decision.
>>>> We never store the converted values in the JSON object, nor do we
>>>> return
>>>> them from functions that return JSON. But many of the functions and
>>>> operators that process the JSON have variants that return text
>>>> instead of
>>>> JSON, and in those cases, when the value returned is a JSON string,
>>>> we do
>>>> the following to it:
>>>>
>>>> I have just realized that the problem is actually quite a lot
>>>> bigger than
>>>> that. We also use this value for field name comparison. So, let us
>>>> suppose
>>>> that we have a LATIN1 database and a piece of JSON with a field name
>>>> containing the Euro sign ("\u20ac"), a character that is not in
>>>> LATIN1.
>>>> Making that processable so it doesn't blow up would be mighty
>>>> tricky and
>>>> error prone. The non-orthogonality I suggested as a solution
>>>> upthread is, by
>>>> contrast, very small and easy to manage, and not terribly hard to
>>>> explain -
>>>> see attached.
>>> I think this all shows pretty clearly that it was a mistake allowing
>>> json data in the database that we cannot entirely display with the
>>> database's encoding. All the proposed ugly workarounds are only
>>> necessary because we don't throw an error when originally validating
>>> the
>>> json.
>>> Even in an utf-8 database you can get errors due to \u unescaping (at
>>> attribute access time, *NOT* at json_in() time) due to invalidate
>>> surrogate pairs.
>>>
>>> I think this goes countrary to normal postgres approach of validating
>>> data as strict as necessary. And I think we are going to regret not
>>> fixing this while there are still relatively few users out there.
>> Exactly -
>>
>>   * allow in only valid JSON.
>>   * Validate all utf8 strings for valid unicode.
>>   * have one canonic way of outputting unicode - utf8 for utf8
>> databases,
>> escaped for all other encodings
>>   * If you need to store anything else, use text.
>>
>> Requiring preserving "original text" in json data field is Not Good!
>>
>> I fully expect '{"a":1, "a":none, "a":true, "a":"b"}'::json to come out
>> as '{"a":b"}'
>>
>> (I know that currently this is noty true and will happen only once I
>> read in the json value in client)
>>
>> For anything else - don't use json, use any text type
>>
>> If you really need a simple text-validated-as-valid-json-input datatype
>> then add this.
>>
>> Call it jsontext or somesuch :)
>>
>
>
> Most of this is irrelevant to solving the current problem. We don't
> have a
> greenfields development of JSON - we have an existing type with which
> we have to work.
We do have current (postgresql v9.2) "json" data type, which is loosely
defined as
"a string which can be fed to a JSON deserializer without errors".

But then we also have a standard.

Sadly the standard is really not about a type, but about serialising a
subset of JavaScript
structures to a utf8 string.

This kind of implies an underlying type which is a structure consisting of
a few basic types and lists and dictionaries/classes based on these but
does not really
specify any type.

> Note too that your "utf8 for utf8 databases, escaped for all other
> encodings"
> is exactly the opposite of what Noah and Andres and Tom are arguing,
> namely that the database encoding should not affect the behaviour,
It does not affect behaviour. It makes it possible to encode valid JSON
(utf8) in client
encoding which may not have full set of unicode characters.

We can do it here, as we know that by definition JSON _is_ unicode, so
clients have
to be able to process any unicode charcter. We can not do this for plain
text, which
has no defined charset outside what client_encoding postulates.
> and is very close to what I proposed yesterday.
>
> It's a pity that we don't have a non-error producing conversion function
> (or if we do that I haven't found it). Then we might adopt a rule for
> processing
> unicode escapes that said "convert unicode escapes to the database
> encoding
only when extracting JSON keys or values to text makes it sense to unescape
to database encoding.

strings inside JSON itself are by definition utf8
> if possible, and if not then emit them unchanged." which might be a
> reasonable
> compromise.
I'd opt for "... and if not then emit them quoted". The default should
be not loosing
any data.



-- 
Hannu Krosing
PostgreSQL Consultant
Performance, Scalability and High Availability
2ndQuadrant Nordic OÜ




В списке pgsql-hackers по дате отправления:

Предыдущее
От: Stefan Drees
Дата:
Сообщение: Re: JSON and unicode surrogate pairs
Следующее
От: Hannu Krosing
Дата:
Сообщение: Re: JSON and unicode surrogate pairs