Re: [PATCH] json_lex_string: don't overread on bad UTF8

Поиск
Список
Период
Сортировка
От Michael Paquier
Тема Re: [PATCH] json_lex_string: don't overread on bad UTF8
Дата
Msg-id ZjmjPyA29dIJjmjI@paquier.xyz
обсуждение исходный текст
Ответ на Re: [PATCH] json_lex_string: don't overread on bad UTF8  (Jacob Champion <jacob.champion@enterprisedb.com>)
Ответы Re: [PATCH] json_lex_string: don't overread on bad UTF8
Список pgsql-hackers
On Fri, May 03, 2024 at 07:05:38AM -0700, Jacob Champion wrote:
> On Fri, May 3, 2024 at 4:54 AM Peter Eisentraut <peter@eisentraut.org> wrote:
>> but for the general encoding conversion we have what
>> would appear to be the same behavior in report_invalid_encoding(), and
>> we go out of our way there to produce a verbose error message including
>> the invalid data.

I was looking for that a couple of days ago in the backend but could
not put my finger on it.  Thanks.

> We could port something like that to src/common. IMO that'd be more
> suited for an actual conversion routine, though, as opposed to a
> parser that for the most part assumes you didn't lie about the input
> encoding and is just trying not to crash if you're wrong. Most of the
> time, the parser just copies bytes between delimiters around and it's
> up to the caller to handle encodings... the exceptions to that are the
> \uXXXX escapes and the error handling.

Hmm.  That would still leave the backpatch issue at hand, which is
kind of confusing to leave as it is.  Would it be complicated to
truncate the entire byte sequence in the error message and just give
up because we cannot do better if the input byte sequence is
incomplete?  We could still have some information depending on the
string given in input, which should be enough, hopefully.  With the
location pointing to the beginning of the sequence, even better.

> Offhand, are all of our supported frontend encodings
> self-synchronizing? By that I mean, is it safe to print a partial byte
> sequence if the locale isn't UTF-8? (As I type this I'm starting at
> Shift-JIS, and thinking "probably not.")
>
> Actually -- hopefully this is not too much of a tangent -- that
> further crystallizes a vague unease about the API that I have. The
> JsonLexContext is initialized with something called the
> "input_encoding", but that encoding is necessarily also the output
> encoding for parsed string literals and error messages. For the server
> side that's fine, but frontend clients have the input_encoding locked
> to UTF-8, which seems like it might cause problems? Maybe I'm missing
> code somewhere, but I don't see a conversion routine from
> json_errdetail() to the actual client/locale encoding. (And the parser
> does not support multibyte input_encodings that contain ASCII in trail
> bytes.)

Referring to json_lex_string() that does UTF-8 -> ASCII -> give-up in
its conversion for FRONTEND, I guess?  Yep.  This limitation looks
like a problem, especially if plugging that to libpq.
--
Michael

Вложения

В списке pgsql-hackers по дате отправления:

Предыдущее
От: jian he
Дата:
Сообщение: Revert: Remove useless self-joins *and* -DREALLOCATE_BITMAPSETS make server crash, regress test fail.
Следующее
От: Corey Huinker
Дата:
Сообщение: Re: Statistics Import and Export