Re: JSON and unicode surrogate pairs
От | Hannu Krosing |
---|---|
Тема | Re: JSON and unicode surrogate pairs |
Дата | |
Msg-id | 51B701C4.1060302@2ndQuadrant.com обсуждение исходный текст |
Ответ на | Re: JSON and unicode surrogate pairs (Andres Freund <andres@2ndquadrant.com>) |
Ответы |
Re: JSON and unicode surrogate pairs
(Andrew Dunstan <andrew@dunslane.net>)
Re: JSON and unicode surrogate pairs (Stefan Drees <stefan@drees.name>) |
Список | pgsql-hackers |
On 06/11/2013 10:47 AM, Andres Freund wrote: > On 2013-06-10 13:01:29 -0400, Andrew Dunstan wrote: >>> It's legal, is it not, to just write the equivalent Unicode character in >>> the JSON string and not use the escapes? If so I would think that that >>> would be the most common usage. If someone's writing an escape, they >>> probably had a reason for doing it that way, and might not appreciate >>> our overriding their decision. >> We never store the converted values in the JSON object, nor do we return >> them from functions that return JSON. But many of the functions and >> operators that process the JSON have variants that return text instead of >> JSON, and in those cases, when the value returned is a JSON string, we do >> the following to it: >> >> I have just realized that the problem is actually quite a lot bigger than >> that. We also use this value for field name comparison. So, let us suppose >> that we have a LATIN1 database and a piece of JSON with a field name >> containing the Euro sign ("\u20ac"), a character that is not in LATIN1. >> Making that processable so it doesn't blow up would be mighty tricky and >> error prone. The non-orthogonality I suggested as a solution upthread is, by >> contrast, very small and easy to manage, and not terribly hard to explain - >> see attached. > I think this all shows pretty clearly that it was a mistake allowing > json data in the database that we cannot entirely display with the > database's encoding. All the proposed ugly workarounds are only > necessary because we don't throw an error when originally validating the > json. > Even in an utf-8 database you can get errors due to \u unescaping (at > attribute access time, *NOT* at json_in() time) due to invalidate > surrogate pairs. > > I think this goes countrary to normal postgres approach of validating > data as strict as necessary. And I think we are going to regret not > fixing this while there are still relatively few users out there. Exactly - * allow in only valid JSON.* Validate all utf8 strings for valid unicode.* have one canonic way of outputting unicode - utf8for utf8 databases, escaped for all other encodings* If you need to store anything else, use text. Requiring preserving "original text" in json data field is Not Good! I fully expect '{"a":1, "a":none, "a":true, "a":"b"}'::json to come out as '{"a":b"}' (I know that currently this is noty true and will happen only once I read in the json value in client) For anything else - don't use json, use any text type If you really need a simple text-validated-as-valid-json-input datatype then add this. Call it jsontext or somesuch :) -- Hannu Krosing PostgreSQL Consultant Performance, Scalability and High Availability 2ndQuadrant Nordic OÜ
В списке pgsql-hackers по дате отправления: