Re: JSON and unicode surrogate pairs

Поиск

Список

Период

Сортировка

От	Andrew Dunstan
Тема	Re: JSON and unicode surrogate pairs
Дата	9 июня 2013 г. 23:05:20
Msg-id	51B50662.5030209@dunslane.net обсуждение исходный текст
Ответ на	Re: JSON and unicode surrogate pairs (Robert Haas <robertmhaas@gmail.com>)
Ответы	Re: JSON and unicode surrogate pairs
Список	pgsql-hackers

Дерево обсуждения

On 06/06/2013 12:53 PM, Robert Haas wrote:
> On Wed, Jun 5, 2013 at 10:46 AM, Andrew Dunstan <andrew@dunslane.net> wrote:
>> In 9.2, the JSON parser didn't check the validity of the use of unicode
>> escapes other than that it required 4 hex digits to follow '\u'. In 9.3,
>> that is still the case. However, the JSON accessor functions and operators
>> also try to turn JSON strings into text in the server encoding, and this
>> includes de-escaping \u sequences. This works fine except when there is a
>> pair of sequences representing a UTF-16 type surrogate pair, something that
>> is explicitly permitted in the JSON spec.
>>
>> The attached patch is an attempt to remedy that, and a surrogate pair is
>> turned into the correct code point before converting it to whatever the
>> server encoding is.
>>
>> Note that this would mean we can still put JSON with incorrect use of
>> surrogates into the database, as now (9.2 and later), and they will cause
>> almost all the accessor functions to raise an error, as now (9.3). All this
>> does is allow JSON that uses surrogates correctly not to fail when applying
>> the accessor functions and operators. That's a possible violation of POLA,
>> and at least worth of a note in the docs, but I'm not sure what else we can
>> do now - adding this check to the input lexer would possibly cause restores
>> to fail, which users might not thank us for.
> I think the approach you've proposed here is a good one.
>

I did that, but it's evident from the buildfarm that there's more work 
to do. The problem is that we do the de-escaping as we lex the json to 
construct the look ahead token, and at that stage we don't know whether 
or not it's really going to be needed. That means we can cause errors to 
be raised in far too many places. It's failing on this line:
   converted = pg_any_to_server(utf8str, utf8len, PG_UTF8);

even though the operator in use ("->") doesn't even use the de-escaped 
value.

The real solution is going to be to delay the de-escaping of the string 
until it is known to be wanted. That's unfortunately going to be a bit 
invasive, but I can't see a better solution. I'll work on it ASAP. 
Getting it to work well without a small API change might be pretty hard, 
though.

cheers

andrew

В списке pgsql-hackers по дате отправления:

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Re: JSON and unicode surrogate pairs