Re: WIP Incremental JSON Parser

Поиск
Список
Период
Сортировка
От Andrew Dunstan
Тема Re: WIP Incremental JSON Parser
Дата
Msg-id 253cef9e-1360-a1a6-5091-ad4368b1b8cc@dunslane.net
обсуждение исходный текст
Ответ на Re: WIP Incremental JSON Parser  (Robert Haas <robertmhaas@gmail.com>)
Ответы Re: WIP Incremental JSON Parser  (Robert Haas <robertmhaas@gmail.com>)
Список pgsql-hackers
On 2024-01-02 Tu 10:14, Robert Haas wrote:
> On Tue, Dec 26, 2023 at 11:49 AM Andrew Dunstan <andrew@dunslane.net> wrote:
>> Quite a long time ago Robert asked me about the possibility of an
>> incremental JSON parser. I wrote one, and I've tweaked it a bit, but the
>> performance is significantly worse that that of the current Recursive
>> Descent parser. Nevertheless, I'm attaching my current WIP state for it,
>> and I'll add it to the next CF to keep the conversation going.
> Thanks for doing this. I think it's useful even if it's slower than
> the current parser, although that probably necessitates keeping both,
> which isn't great, but I don't see a better alternative.
>
>> One possible use would be in parsing large manifest files for
>> incremental backup. However, it struck me a few days ago that this might
>> not work all that well. The current parser and the new parser both
>> palloc() space for each field name and scalar token in the JSON (unless
>> they aren't used, which is normally not the case), and they don't free
>> it, so that particularly if done in frontend code this amounts to a
>> possible memory leak, unless the semantic routines do the freeing
>> themselves. So while we can save some memory by not having to slurp in
>> the whole JSON in one hit, we aren't saving any of that other allocation
>> of memory, which amounts to almost as much space as the raw JSON.
> It seems like a pretty significant savings no matter what. Suppose the
> backup_manifest file is 2GB, and instead of creating a 2GB buffer, you
> create an 1MB buffer and feed the data to the parser in 1MB chunks.
> Well, that saves 2GB less 1MB, full stop. Now if we address the issue
> you raise here in some way, we can potentially save even more memory,
> which is great, but even if we don't, we still saved a bunch of memory
> that could not have been saved in any other way.
>
> As far as addressing that other issue, we could address the issue
> either by having the semantic routines free the memory if they don't
> need it, or alternatively by having the parser itself free the memory
> after invoking any callbacks to which it might be passed. The latter
> approach feels more conceptually pure, but the former might be the
> more practical approach. I think what really matters here is that we
> document who must or may do which things. When a callback gets passed
> a pointer, we can document either that (1) it's a palloc'd chunk that
> the calllback can free if they want or (2) that it's a palloc'd chunk
> that the caller must not free or (3) that it's not a palloc'd chunk.
> We can further document the memory context in which the chunk will be
> allocated, if applicable, and when/if the parser will free it.


Yeah. One idea I had yesterday was to stash the field names, which in 
large JSON docs tent to be pretty repetitive, in a hash table instead of 
pstrduping each instance. The name would be valid until the end of the 
parse, and would only need to be duplicated by the callback function if 
it were needed beyond that. That's not the case currently with the 
parse_manifest code. I'll work on using a hash table.

The parse_manifest code does seem to pfree the scalar values it no 
longer needs fairly well, so maybe we don't need to to anything there.


cheers


andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com




В списке pgsql-hackers по дате отправления:

Предыдущее
От: "Andrey M. Borodin"
Дата:
Сообщение: Re: Transaction timeout
Следующее
От: Peter Eisentraut
Дата:
Сообщение: Re: More new SQL/JSON item methods