Re: WIP Incremental JSON Parser

Поиск
Список
Период
Сортировка
От Andrew Dunstan
Тема Re: WIP Incremental JSON Parser
Дата
Msg-id 7ead3d4d-ddc6-2499-0a04-6b29c693ea4a@dunslane.net
обсуждение исходный текст
Ответ на Re: WIP Incremental JSON Parser  (Jacob Champion <jacob.champion@enterprisedb.com>)
Ответы Re: WIP Incremental JSON Parser
Список pgsql-hackers
On 2024-03-29 Fr 17:21, Jacob Champion wrote:
> On Fri, Mar 29, 2024 at 9:42 AM Andrew Dunstan <andrew@dunslane.net> wrote:
>> Here's a new set of patches, with I think everything except the error case Asserts attended to. There's a change to
addsemantic processing to the test suite in patch 4, but I'd fold that into patch 1 when committing.
 
> Thanks! 0004 did highlight one last bug for me -- the return value
> from semantic callbacks is ignored, so applications can't interrupt
> the parsing early:
>
>> +                       if (ostart != NULL)
>> +                           (*ostart) (sem->semstate);
>> +                       inc_lex_level(lex);
> Anything not JSON_SUCCESS should be returned immediately, I think, for
> this and the other calls.


Fixed


>
>> +           /* make sure we've used all the input */
>> +           if (lex->token_terminator - lex->token_start != ptok->len)
>> +               return JSON_INVALID_TOKEN;
> nit: Debugging this, if anyone ever hits it, is going to be confusing.
> The error message is going to claim that a valid token is invalid, as
> opposed to saying that the parser has a bug. Short of introducing
> something like JSON_INTERNAL_ERROR, maybe an Assert() should be added
> at least?


Done


>
>> +       case JSON_NESTING_TOO_DEEP:
>> +           return(_("Json nested too deep, maximum permitted depth is 6400"));
> nit: other error messages use all-caps, "JSON"


Fixed


>
>> +       char        chunk_start[100],
>> +                   chunk_end[100];
>> +
>> +       snprintf(chunk_start, 100, "%s", ib->buf.data);
>> +       snprintf(chunk_end, 100, "%s", ib->buf.data + (ib->buf.len - (MIN_CHUNK + 99)));
>> +       elog(NOTICE, "incremental manifest:\nchunk_start='%s',\nchunk_end='%s'", chunk_start, chunk_end);
> Is this NOTICE intended to be part of the final commit?


No, removed.


>
>> +       inc_state = json_parse_manifest_incremental_init(&context);
>> +
>> +       buffer = pg_malloc(chunk_size + 64);
> These `+ 64`s (there are two of them) seem to be left over from
> debugging. In v7 they were just `+ 1`.


Fixed


I've also added the Asserts to the error handling code you suggested. 
This tests fine with the regression suite under FORCE_JSON_PSTACK.


>
> --
>
> >From this point onward, I think all of my feedback is around
> maintenance characteristics, which is firmly in YMMV territory, and I
> don't intend to hold up a v1 of the feature for it since I'm not the
> maintainer. :D
>
> The complexity around the checksum handling and "final chunk"-ing is
> unfortunate, but not a fault of this patch -- just a weird consequence
> of the layering violation in the format, where the checksum is inside
> the object being checksummed. I don't like it, but I don't think
> there's much to be done about it.


Yeah. I agree it's ugly, but I don't think we should let it hold us up 
at all. Working around it doesn't involve too much extra code.


>
> By far the trickiest part of the implementation is the partial token
> logic, because of all the new ways there are to handle different types
> of failures. I think any changes to the incremental handling in
> json_lex() are going to need intense scrutiny from someone who already
> has the mental model loaded up. I'm going snowblind on the patch and
> I'm no longer the right person to review how hard it is to get up to
> speed with the architecture, but I'd say it's not easy.


json_lex() is not really a very hot piece of code. I have added a 
comment at the top giving a brief description of the partial token 
handling. Even without the partial token bit it can be a bit scary, but 
I think the partial token piece here is not so scary that we should not 
proceed.


>
> For something as security-critical as a JSON parser, I'd usually want
> to counteract that by sinking the observable behaviors in concrete and
> getting both the effective test coverage *and* the code coverage as
> close to 100% as we can stand. (By that, I mean that purposely
> introducing observable bugs into the parser should result in test
> failures. We're not there yet when it comes to the semantic callbacks,
> at least.) But I don't think the current JSON parser is held to that
> standard currently, so it seems strange for me to ask for that here.


Yeah.


>
> I think it'd be good for a v1.x of this feature to focus on
> simplification of the code, and hopefully consolidate and/or eliminate
> some of the duplicated parsing work so that the mental model isn't
> quite so big.
>

I'm not sure how you think that can be done. What parts of the 
processing are you having difficulty in coming to grips with?

Anyway, here are new patches. I've rolled the new semantic test into the 
first patch.


cheers


andrew


--
Andrew Dunstan
EDB: https://www.enterprisedb.com

Вложения

В списке pgsql-hackers по дате отправления:

Предыдущее
От: "Tristan Partin"
Дата:
Сообщение: [MASSMAIL]Silence Meson warning on HEAD
Следующее
От: Robert Haas
Дата:
Сообщение: Re: On disable_cost