On Tue, 2 Jan 2024 at 11:30, Peter Eisentraut <peter@eisentraut.org> wrote:
>
> On 06.12.23 22:08, Matthias van de Meent wrote:
> > PFA a patch that reduces the output size of nodeToString by 50%+ in
> > most cases (measured on pg_rewrite), which on my system reduces the
> > total size of pg_rewrite by 33% to 472KiB. This does keep the textual
> > pg_node_tree format alive, but reduces its size signficantly.
> >
> > The basic techniques used are
> > - Don't emit scalar fields when they contain a default value, and
> > make the reading code aware of this.
> > - Reasonable defaults are set for most datatypes, and overrides can
> > be added with new pg_node_attr() attributes. No introspection into
> > non-null Node/Array/etc. is being done though.
> > - Reset more fields to their default values before storing the values.
> > - Don't write trailing 0s in outDatum calls for by-ref types. This
> > saves many bytes for Name fields, but also some other pre-existing
> > entry points.
>
> Based on our discussions, my understanding is that you wanted to produce
> an updated patch set that is split up a bit.
I mentioned that I've been working on implementing (but have not yet
completed) a binary serialization format, with an implementation based
on Andres' generated metadata idea. However, that requires more
elaborate infrastructure than is currently available, so while I said
I'd expected it to be complete before the Christmas weekend, it'll
take some more time - I'm not sure it'll be ready for PG17.
In the meantime here's an updated version of the v0 patch, formally
keeping the textual format alive, while reducing the size
significantly (nearing 2/3 reduction), taking your comments into
account. I think the gains are worth the consideration without taking
into account the as-of-yet unimplemented binary format.
> My suggestion is to make incremental patches along these lines:
> [...]
Something like the attached? It splits out into the following
0001: basic 'omit default values'
0002: reset location and other querystring-related node fields for all
catalogs of type pg_node_tree.
0003: add default marking on typmod fields.
0004 & 0006: various node fields marked with default() based on
observed common or initial values of those fields
0005: truncate trailing 0s from outDatum
0007 (new): do run-length + gap coding for bitmapset and the various
integer list types. This saves a surprising amount of bytes.
> The last one I have some doubts about, as previously expressed, but the
> first few seem sensible to me. By splitting it up we can consider these
> incrementally.
That makes a lot of sense. The numbers for the full patchset do seem
quite positive though: The metrics of the query below show a 40%
decrease in size of a fresh pg_rewrite (standard toast compression)
and a 5% decrease in size of the template0 database. The uncompressed
data of pg_rewrite.ev_action is also 60% smaller.
select pg_database_size('template0') as "template0"
, pg_total_relation_size('pg_rewrite') as "pg_rewrite"
, sum(pg_column_size(ev_action)) as "compressed"
, sum(octet_length(ev_action)) as "raw"
from pg_rewrite;
version | template0 | pg_rewrite | compressed | raw
---------|-----------+------------+------------+---------
master | 7545359 | 761856 | 573307 | 2998712
0001 | 7365135 | 622592 | 438224 | 1943772
0002 | 7258639 | 573440 | 401660 | 1835803
0003 | 7258639 | 565248 | 386211 | 1672539
0004 | 7176719 | 483328 | 317099 | 1316552
0005 | 7176719 | 483328 | 315556 | 1300420
0006 | 7160335 | 466944 | 302806 | 1208621
0007 | 7143951 | 450560 | 287659 | 1187237
While looking through the data, I noticed the larger views now consist
for a significant portion out of range table entries, specifically the
Alias and Var nodes (which are mostly repeated and/or repetative
values, but split across Nodes). I think column-major storage would be
more efficient to write, but I'm not sure it's worth the effort in
planner code.
Kind regards,
Matthias van de Meent
Neon (https://neon.tech)