Re: jsonb, unicode escapes and escaped backslashes

Поиск
Список
Период
Сортировка
От Andrew Dunstan
Тема Re: jsonb, unicode escapes and escaped backslashes
Дата
Msg-id 54C64D46.4040206@dunslane.net
обсуждение исходный текст
Ответ на Re: jsonb, unicode escapes and escaped backslashes  (Noah Misch <noah@leadboat.com>)
Ответы Re: jsonb, unicode escapes and escaped backslashes  (Noah Misch <noah@leadboat.com>)
Список pgsql-hackers
On 01/23/2015 02:18 AM, Noah Misch wrote:
> On Wed, Jan 21, 2015 at 06:51:34PM -0500, Andrew Dunstan wrote:
>> The following case has just been brought to my attention (look at the
>> differing number of backslashes):
>>
>>     andrew=# select jsonb '"\\u0000"';
>>        jsonb
>>     ----------
>>       "\u0000"
>>     (1 row)
>>
>>     andrew=# select jsonb '"\u0000"';
>>        jsonb
>>     ----------
>>       "\u0000"
>>     (1 row)
> A mess indeed.  The input is unambiguous, but the jsonb representation can't
> distinguish "\u0000" from "\\u0000".  Some operations on the original json
> type have similar problems, since they use an in-memory binary representation
> with the same shortcoming:
>
> [local] test=# select json_array_element_text($$["\u0000"]$$, 0) =
> test-# json_array_element_text($$["\\u0000"]$$, 0);
>   ?column?
> ----------
>   t
> (1 row)
>
>> Things get worse, though. On output, '\uabcd' for any four hex digits is
>> recognized as a unicode escape, and thus the backslash is not escaped, so
>> that we get:
>>
>>     andrew=# select jsonb '"\\uabcd"';
>>        jsonb
>>     ----------
>>       "\uabcd"
>>     (1 row)
>>
>>
>> We could probably fix this fairly easily for non- U+0000 cases by having
>> jsonb_to_cstring use a different escape_json routine.
> Sounds reasonable.  For 9.4.1, before more people upgrade?
>
>> But it's a mess, sadly, and I'm not sure what a good fix for the U+0000 case
>> would look like.
> Agreed.  When a string unescape algorithm removes some kinds of backslash
> escapes and not others, it's nigh inevitable that two semantically-distinct
> inputs can yield the same output.  json_lex_string() fell into that trap by
> making an exception for \u0000.  To fix this, the result needs to be fully
> unescaped (\u0000 converted to the NUL byte) or retain all backslash escapes.
> (Changing that either way is no fun now that an on-disk format is at stake.)
>
>> Maybe we should detect such input and emit a warning of
>> ambiguity? It's likely to be rare enough, but clearly not as rare as we'd
>> like, since this is a report from the field.
> Perhaps.  Something like "WARNING:  jsonb cannot represent \\u0000; reading as
> \u0000"?  Alas, but I do prefer that to silent data corruption.
>



Maybe something like this patch. I have two concerns about it, though.
The first is the possible performance impact of looking for the
offending string in every jsonb input, and the second is that the test
isn't quite right, since input of \\\u0000 doesn't raise this issue -
i.e. the problem arises when u0000 is preceded by an even number of
backslashes.

For the moment, maybe I could commit the fix for the non U+0000 case for
escape_json, and we could think some more about detecting and warning
about the problem strings.

cheers

andrew

Вложения

В списке pgsql-hackers по дате отправления:

Предыдущее
От: Pavel Stehule
Дата:
Сообщение: Re: PL/pgSQL, RAISE and error context
Следующее
От: Tom Lane
Дата:
Сообщение: Re: Unsafe coding in ReorderBufferCommit()