Re: JSON and unicode surrogate pairs

Поиск
Список
Период
Сортировка
От Andrew Dunstan
Тема Re: JSON and unicode surrogate pairs
Дата
Msg-id 51B71AE6.8000603@dunslane.net
обсуждение исходный текст
Ответ на Re: JSON and unicode surrogate pairs  (Hannu Krosing <hannu@2ndQuadrant.com>)
Ответы Re: JSON and unicode surrogate pairs  (Hannu Krosing <hannu@2ndQuadrant.com>)
Список pgsql-hackers
On 06/11/2013 06:53 AM, Hannu Krosing wrote:
> On 06/11/2013 10:47 AM, Andres Freund wrote:
>> On 2013-06-10 13:01:29 -0400, Andrew Dunstan wrote:
>>>> It's legal, is it not, to just write the equivalent Unicode character in
>>>> the JSON string and not use the escapes?  If so I would think that that
>>>> would be the most common usage.  If someone's writing an escape, they
>>>> probably had a reason for doing it that way, and might not appreciate
>>>> our overriding their decision.
>>> We never store the converted values in the JSON object, nor do we return
>>> them from functions that return JSON. But many of the functions and
>>> operators that process the JSON have variants that return text instead of
>>> JSON, and in those cases, when the value returned is a JSON string, we do
>>> the following to it:
>>>
>>> I have just realized that the problem is actually quite a lot bigger than
>>> that. We also use this value for field name comparison. So, let us suppose
>>> that we have a LATIN1 database and a piece of JSON with a field name
>>> containing the Euro sign ("\u20ac"), a character that is not in LATIN1.
>>> Making that processable so it doesn't blow up would be mighty tricky and
>>> error prone. The non-orthogonality I suggested as a solution upthread is, by
>>> contrast, very small and easy to manage, and not terribly hard to explain -
>>> see attached.
>> I think this all shows pretty clearly that it was a mistake allowing
>> json data in the database that we cannot entirely display with the
>> database's encoding. All the proposed ugly workarounds are only
>> necessary because we don't throw an error when originally validating the
>> json.
>> Even in an utf-8 database you can get errors due to \u unescaping (at
>> attribute access time, *NOT* at json_in() time) due to invalidate
>> surrogate pairs.
>>
>> I think this goes countrary to normal postgres approach of validating
>> data as strict as necessary. And I think we are going to regret not
>> fixing this while there are still relatively few users out there.
> Exactly -
>
>   * allow in only valid JSON.
>   * Validate all utf8 strings for valid unicode.
>   * have one canonic way of outputting unicode - utf8 for utf8 databases,
> escaped for all other encodings
>   * If you need to store anything else, use text.
>
> Requiring preserving "original text" in json data field is Not Good!
>
> I fully expect '{"a":1, "a":none, "a":true, "a":"b"}'::json to come out
> as '{"a":b"}'
>
> (I know that currently this is noty true and will happen only once I
> read in the json value in client)
>
> For anything else - don't use json, use any text type
>
> If you really need a simple text-validated-as-valid-json-input datatype
> then add this.
>
> Call it jsontext or somesuch :)
>


Most of this is irrelevant to solving the current problem. We don't have 
a greenfields development of JSON - we have an existing type with which 
we have to work. Note too that your "utf8 for utf8 databases, escaped 
for all other encodings" is exactly the opposite of what Noah and Andres 
and Tom are arguing, namely that the database encoding should not affect 
the behaviour, and is very close to what I proposed yesterday.

It's a pity that we don't have a non-error producing conversion function 
(or if we do that I haven't found it). Then we might adopt a rule for 
processing unicode escapes that said "convert unicode escapes to the 
database encoding if possible, and if not then emit them unchanged." 
which might be a reasonable compromise.

cheers

andrew








В списке pgsql-hackers по дате отправления:

Предыдущее
От: Robert Haas
Дата:
Сообщение: Re: Server side lo-funcs name
Следующее
От: Stefan Drees
Дата:
Сообщение: Re: JSON and unicode surrogate pairs