Re: [PATCH] json_lex_string: don't overread on bad UTF8

Поиск
Список
Период
Сортировка
От Jacob Champion
Тема Re: [PATCH] json_lex_string: don't overread on bad UTF8
Дата
Msg-id CAOYmi+k_U2zuXhCswwBbo8fH743n4Gzr=hsYH+eyBa+mEOs7Rg@mail.gmail.com
обсуждение исходный текст
Ответ на Re: [PATCH] json_lex_string: don't overread on bad UTF8  (Michael Paquier <michael@paquier.xyz>)
Список pgsql-hackers
On Wed, May 1, 2024 at 8:40 PM Michael Paquier <michael@paquier.xyz> wrote:
>
> On Thu, May 02, 2024 at 11:23:13AM +0900, Michael Paquier wrote:
> > About the fact that we may finish by printing unfinished UTF-8
> > sequences, I'd be curious to hear your thoughts.  Now, the information
> > provided about the partial byte sequences can be also useful for
> > debugging on top of having the error code, no?

Yes, but which information do you want? Do you want to know the bad
byte sequence, or see the glyph that corresponds to it (which is
probably �)? The glyph is better as long as it's complete; if it's a
bad sequence, then maybe you'd prefer to know the particular byte, but
that assumes a lot of technical knowledge on the part of whoever's
reading the message.

> By the way, as long as I have that in mind..  I am not sure that it is
> worth spending cycles in detecting the unfinished sequences and make
> these printable.  Wouldn't it be enough for more cases to adjust
> token_error() to truncate the byte sequences we cannot print?

Maybe. I'm beginning to wonder if I'm overthinking this particular
problem, and if we should just go ahead and print the bad sequence. At
least for the case of UTF-8 console encoding, replacement glyphs will
show up as needed.

There is the matter of a client that's not using UTF-8, though. Do we
deal with that correctly today? (I understand why it was done the way
it was, at least on the server side, but it's still really weird to
have code that parses "JSON" that isn't actually Unicode.)

> Another thing that I think would be nice would be to calculate the
> location of what we're parsing on a given line, and provide that in
> the error context.  That would not be backpatchable as it requires a
> change in JsonLexContext, unfortunately, but it would help in making
> more sense with an error if the incomplete byte sequence is at the
> beginning of a token or after an expected character.

+1, at least that way you can skip directly to the broken spot during
a postmortem.

Thanks,
--Jacob



В списке pgsql-hackers по дате отправления:

Предыдущее
От: Tom Lane
Дата:
Сообщение: Re: Removing unneeded self joins
Следующее
От: Devrim Gündüz
Дата:
Сообщение: Weird "null" errors during DROP TYPE (pg_upgrade)