Re: Reducing output size of nodeToString

Поиск
Список
Период
Сортировка
От Matthias van de Meent
Тема Re: Reducing output size of nodeToString
Дата
Msg-id CAEze2Wigkd1+J4s=7wUqW8Y4g9mDWSC28119ukbKkf799WBpzg@mail.gmail.com
обсуждение исходный текст
Ответ на Re: Reducing output size of nodeToString  (Peter Eisentraut <peter@eisentraut.org>)
Ответы Re: Reducing output size of nodeToString  (Peter Eisentraut <peter@eisentraut.org>)
Список pgsql-hackers
On Tue, 2 Jan 2024 at 11:30, Peter Eisentraut <peter@eisentraut.org> wrote:
>
> On 06.12.23 22:08, Matthias van de Meent wrote:
> > PFA a patch that reduces the output size of nodeToString by 50%+ in
> > most cases (measured on pg_rewrite), which on my system reduces the
> > total size of pg_rewrite by 33% to 472KiB. This does keep the textual
> > pg_node_tree format alive, but reduces its size signficantly.
> >
> > The basic techniques used are
> >   - Don't emit scalar fields when they contain a default value, and
> > make the reading code aware of this.
> >   - Reasonable defaults are set for most datatypes, and overrides can
> > be added with new pg_node_attr() attributes. No introspection into
> > non-null Node/Array/etc. is being done though.
> >   - Reset more fields to their default values before storing the values.
> >   - Don't write trailing 0s in outDatum calls for by-ref types. This
> > saves many bytes for Name fields, but also some other pre-existing
> > entry points.
>
> Based on our discussions, my understanding is that you wanted to produce
> an updated patch set that is split up a bit.

I mentioned that I've been working on implementing (but have not yet
completed) a binary serialization format, with an implementation based
on Andres' generated metadata idea. However, that requires more
elaborate infrastructure than is currently available, so while I said
I'd expected it to be complete before the Christmas weekend, it'll
take some more time - I'm not sure it'll be ready for PG17.

In the meantime here's an updated version of the v0 patch, formally
keeping the textual format alive, while reducing the size
significantly (nearing 2/3 reduction), taking your comments into
account. I think the gains are worth the  consideration without taking
into account the as-of-yet unimplemented binary format.

> My suggestion is to make incremental patches along these lines:
> [...]

Something like the attached? It splits out into the following
0001: basic 'omit default values'
0002: reset location and other querystring-related node fields for all
catalogs of type pg_node_tree.
0003: add default marking on typmod fields.
0004 & 0006: various node fields marked with default() based on
observed common or initial values of those fields
0005: truncate trailing 0s from outDatum
0007 (new): do run-length + gap coding for bitmapset and the various
integer list types. This saves a surprising amount of bytes.

> The last one I have some doubts about, as previously expressed, but the
> first few seem sensible to me.  By splitting it up we can consider these
> incrementally.

That makes a lot of sense. The numbers for the full patchset do seem
quite positive though: The metrics of the query below show a 40%
decrease in size of a fresh pg_rewrite (standard toast compression)
and a 5% decrease in size of the template0 database. The uncompressed
data of pg_rewrite.ev_action is also 60% smaller.

select pg_database_size('template0') as "template0"
     , pg_total_relation_size('pg_rewrite') as "pg_rewrite"
     , sum(pg_column_size(ev_action)) as "compressed"
     , sum(octet_length(ev_action)) as "raw"
from pg_rewrite;

 version | template0 | pg_rewrite | compressed |   raw
---------|-----------+------------+------------+---------
 master  |   7545359 |     761856 |     573307 | 2998712
 0001    |   7365135 |     622592 |     438224 | 1943772
 0002    |   7258639 |     573440 |     401660 | 1835803
 0003    |   7258639 |     565248 |     386211 | 1672539
 0004    |   7176719 |     483328 |     317099 | 1316552
 0005    |   7176719 |     483328 |     315556 | 1300420
 0006    |   7160335 |     466944 |     302806 | 1208621
 0007    |   7143951 |     450560 |     287659 | 1187237

While looking through the data, I noticed the larger views now consist
for a significant portion out of range table entries, specifically the
Alias and Var nodes (which are mostly repeated and/or repetative
values, but split across Nodes). I think column-major storage would be
more efficient to write, but I'm not sure it's worth the effort in
planner code.

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)

Вложения

В списке pgsql-hackers по дате отправления:

Предыдущее
От: Cedric Villemain
Дата:
Сообщение: Change prefetch and read strategies to use range in pg_prewarm ... and raise a question about posix_fadvise WILLNEED
Следующее
От: Nico Williams
Дата:
Сообщение: Re: WIP Incremental JSON Parser