Обсуждение: Compress prune/freeze records with Delta Frame of Reference algorithm

Поиск
Список
Период
Сортировка

Compress prune/freeze records with Delta Frame of Reference algorithm

От
Evgeny Voropaev
Дата:
Hi hackers,

A prune/freeze record contains four sequences of integers representing
frozen, redirected, unused, and dead tuples. Currently, these
sequences are uncompressed, which adds overhead. Eliminating empty
leading bits in these offsets can reduce the size of the record, and
applying bit-packing algorithms can reduce it even further. Reducing
WAL record size also lessens the load on disk and network, which are
heavily used for storing WAL and transmitting it to replicas. For
example, with BLCKSZ equal to 8192, the maximum tuple offset is around
580. Despite this, all offsets are stored as 16-bit integers with no
compression applied.

In the proposed patch, using the customized Delta Frame of Reference
(DFoR) algorithm, the unused and dead sequences are compressed. The
frozen and redirected sequences cannot be compressed with DFoR because
the order of their elements is significant, and DFoR does not yet
support unsorted sequences. The theoretical compression ratio for
dfor_u16 can reach up to 16.

The new GUC wal_prune_dfor_compression controls (enables or disables)
compression for prune/freeze records.

An integral TAP test, 052_prune_dfor_compression.pl, has been
implemented. It demonstrates an average compression ratio of at least
5 when analyzing prune/freeze records in practice.

The patch series includes:
1. 0001-Implement-vect-and-uniqsortvect-containers-and-bitpack.patch
     - Introduces the vect and uniqsortvect containers and the bitpack
       unit used by DFoR.

2. 0002-Implement-Delta-Frame-of-Reference-compression.patch
     - Implements DFoR itself.

3. 0003-Use-Delta-Frame-of-Reference-DFoR-to-compress-prune.patch
     - Applies DFoR to prune/freeze records.

Unit tests are included and integrated into the PostgreSQL build
system.

Using vect, bitpack, and DFoR for prune/freeze records is just one
example of their application. Independently of prune/freeze, these
algorithms can be effectively used by developers in other areas of
PostgreSQL.

Looking forward to discussion!

Best regards,
Evgeny Voropaev,
Tantor Labs, LLC.

Вложения

Re: Compress prune/freeze records with Delta Frame of Reference algorithm

От
Andrey Borodin
Дата:

> On 7 Mar 2026, at 18:56, Evgeny Voropaev <evgeny.voropaev@tantorlabs.com> wrote:
>
> Looking forward to discussion!

Great idea and nice library for DFoR!

Prune/freeze records can be large, and reducing their size would help streaming
replication a lot. The algorithm itself (Delta Frame of Reference bit-packing) is
sound for sorted integer sequences.

Eventually we will have wholesale WAL compression, but proposed here type of
compression is more perspective for particular case, because it exploits knowledge
about nature of compressed data.

The patchset is, obviously, a prototype that worth working on further.
Perhaps, first thing to start is to fix CI failures [0].

I do not know if we can budle LGPL lib. Luckily it's only for tests, when all
design questions are resolved we can just re-implement tests with something else.

As a minor nit: do not use stdlib assert(), use capital Assert() :)

Are we 100% sure qsort() won't allocate something anywhere? sort_template.h seems
to be allocation-free, but just in case...


Best regards, Andrey Borodin.


[0] https://github.com/x4m/postgres_g/runs/67140025428


Re: Compress prune/freeze records with Delta Frame of Reference algorithm

От
Evgeny Voropaev
Дата:
Hello Andrey,

 > Great idea and nice library for DFoR! Thank you for your attention.
I wish the patch would be useful.

All your proposals and recommendations have been implemented in v02.
Also the meson settings has been updated for supporting the new
developments.

 > Perhaps, first thing to start is to fix CI failures

Once meson is fixed, tests should pass successfully now. Looking
forward to this.

 > As a minor nit: do not use stdlib assert(), use capital Assert()

Done.

 > Are we 100% sure qsort() won't allocate something anywhere?
 > sort_template.h seems to be allocation-free, but just in case...

No guarantees from libraries or from descriptions about qsort's
behaviour regarding dynamic memory allocation. So, the qsort is just
substituted with the sort_template, which we trust, and as you
proposed.

Waiting for tests to have passed, and then I hope we could move further.

Best regards, Evgeny Voropaev.

Вложения

Re: Compress prune/freeze records with Delta Frame of Reference algorithm

От
Evgeny Voropaev
Дата:
Rebase the patch onto 516310ed4db, added checks that a destintation
pointer passed to memcpy is not equal to NULL in the function
vect_reserve.

Looking forward to CFBOT results.


Вложения

Re: Compress prune/freeze records with Delta Frame of Reference algorithm

От
Andres Freund
Дата:
Hi,

On 2026-03-07 21:56:18 +0800, Evgeny Voropaev wrote:
> A prune/freeze record contains four sequences of integers representing
> frozen, redirected, unused, and dead tuples. Currently, these
> sequences are uncompressed, which adds overhead. Eliminating empty
> leading bits in these offsets can reduce the size of the record, and
> applying bit-packing algorithms can reduce it even further. Reducing
> WAL record size also lessens the load on disk and network, which are
> heavily used for storing WAL and transmitting it to replicas. For
> example, with BLCKSZ equal to 8192, the maximum tuple offset is around
> 580. Despite this, all offsets are stored as 16-bit integers with no
> compression applied.

I'm unconvinced that this is a serious problem - typically the overhead of WAL
volume due to pruning / freezing is due to the full page images emitted, not
the raw size of the records. Once an FPI is emitted, this doesn't matter.

What gains have you measured in somewhat realistic workloads?

Greetings,

Andres Freund



Re: Compress prune/freeze records with Delta Frame of Reference algorithm

От
Evgeny Voropaev
Дата:
Hello Andres,

> I'm unconvinced that this is a serious problem - typically the overhead of WAL
> volume due to pruning / freezing is due to the full page images emitted, not
> the raw size of the records. Once an FPI is emitted, this doesn't matter.
> 
> What gains have you measured in somewhat realistic workloads?

So far, we have had no tests in any real production environment. 
Moreover, the load in the new test 
(recovery/t/052_prune_dfor_compression.pl) is quite contrived. However, 
it demonstrates a compression ratio of more than 5, and it is measured 
for an overall size of all prune/freeze records with no filtering.

Further development is the implementation of compression of unsorted 
sequences. This is going to allow PostgreSQL to compress also the 
'frozen' and the 'redirected' offset sequences, which should result in a 
greater compression ratio.

But I agree with you, Andres, we need practical results to estimate a 
profit. I wish we would test it on some real load soon.

Also I hope, independently of its usage in prune/freeze records, the 
DFoR itself might be used for compression sequences in other places of PG.

Best regards,
Evgeny Voropaev.




Re: Compress prune/freeze records with Delta Frame of Reference algorithm

От
Evgeny Voropaev
Дата:
Rebase onto 2102ebb1953.
Fix warnings about gnu_printf in tap.c.
Fix wrong destination pointer in memcpy in vect_init.
Fix intl library linking problem for the dfor tests under FreBSD.

Вложения

Re: Compress prune/freeze records with Delta Frame of Reference algorithm

От
Evgeny Voropaev
Дата:

Re: Compress prune/freeze records with Delta Frame of Reference algorithm

От
Evgeny Voropaev
Дата:

Re: Compress prune/freeze records with Delta Frame of Reference algorithm

От
Tomas Vondra
Дата:
On 3/24/26 15:28, Evgeny Voropaev wrote:
> Hello Andres,
> 
>> I'm unconvinced that this is a serious problem - typically the
>> overhead of WAL
>> volume due to pruning / freezing is due to the full page images
>> emitted, not
>> the raw size of the records. Once an FPI is emitted, this doesn't matter.
>>
>> What gains have you measured in somewhat realistic workloads?
> 
> So far, we have had no tests in any real production environment.
> Moreover, the load in the new test (recovery/
> t/052_prune_dfor_compression.pl) is quite contrived. However, it
> demonstrates a compression ratio of more than 5, and it is measured for
> an overall size of all prune/freeze records with no filtering.
> 
> Further development is the implementation of compression of unsorted
> sequences. This is going to allow PostgreSQL to compress also the
> 'frozen' and the 'redirected' offset sequences, which should result in a
> greater compression ratio.
> 
> But I agree with you, Andres, we need practical results to estimate a
> profit. I wish we would test it on some real load soon.
> 
> Also I hope, independently of its usage in prune/freeze records, the
> DFoR itself might be used for compression sequences in other places of PG.
> 

IMHO Andres is right. A ~170kB patch really should present some numbers
quantifying the expected benefit. It doesn't need to be a real workload
from production, but something plausible enough. Even some basic
back-of-the-envelope calculations might be enough to show the promise.

Without this, the cost/benefit is so unclear most senior contributors
will probably review something else. You need to make the case why this
is worth it.

I only quickly skimmed the patches, for exactly this reason. I'm a bit
confused why this needs to add the whole libtap thing in 0001, instead
of just testing this through the SQL interface (same as test_aio etc.).

Also, I find it somewhat unlikely we'd import a GPLv3 library like this,
even if it's just a testing framework. Even ignoring the question of
having a different license for some of the code, it'd mean maintenance
burden (maybe libtap is stable/mature, no idea). I don't see why this
would be better than "write a SQL callable test module".


regards

-- 
Tomas Vondra




Re: Compress prune/freeze records with Delta Frame of Reference algorithm

От
Evgeny Voropaev
Дата:
Tomas, Andreus, Andrey, hello!

 > A ~170kB patch really should present some numbers
 > quantifying the expected benefit. It doesn't need to be a real workload
 > from production, but something plausible enough. Even some basic
 > back-of-the-envelope calculations might be enough to show the promise.

The patch results in reduction of WAL total size by:
     81% during vacuuming a table having no index,
     and by 55% during vacuuming a table having an index.

The numbers are the next:

=== VACUUM (table with no index) ===
-------------------- ----------------- ----------------- -----------
                        DFOR off, bytes    DFOR on, bytes   Reduction
-------------------- ----------------- ----------------- -----------
WAL total size                 6743149           1184446         82%
Prune records size             6710185           1159723        5.8x
-------------------- ----------------- ----------------- -----------

=== VACUUM (table with index) ===
-------------------- ----------------- ----------------- -----------
                        DFOR off, bytes    DFOR on, bytes   Reduction
-------------------- ----------------- ----------------- -----------
WAL total size                20394208           8907090         56%
Prune records size             6812850           1225944        5.6x
-------------------- ----------------- ----------------- -----------

The logic of the tests is based on the technique from [1] and is the
next:

    -- SQL
    CREATE TABLE t_prune ( id int, val text )
        WITH (fillfactor = 100, autovacuum_enabled = false);

    INSERT INTO t_prune
        SELECT g, 'x' FROM generate_series(1,3000000) g;

    CREATE INDEX ON t_prune(id); -- for the test using an indexed table

    DELETE FROM t_prune WHERE id % 500 <> 0;
    SELECT pg_current_wal_flush_lsn(); -- get start_lsn here
    VACUUM FREEZE t_prune; -- 3 times
    SELECT pg_current_wal_flush_lsn(); -- get end_lsn here

    # BASH
    # stop cluster
    pg_waldump -p $wal_dir -s $start_lsn -e $end_lsn 2>/dev/null;

The test is implemented in 052_prune_dfor_compression.pl, therefore
the presented results can be refetched by restarting this test script.

 > Also, I find it somewhat unlikely we'd import a GPLv3 library like
 > this, even if it's just a testing framework. Even ignoring the
 > question of having a different license for some of the code, it'd mean
 > maintenance burden (maybe libtap is stable/mature, no idea). I don't
 > see why this would be better than "write a SQL callable test module".

I am ready to rework it once there is consensus on the core of the
patch.

Best regards,
Evgeny.

P.s. rebased onto a1643d40b30.

[1] 
https://www.postgresql.org/message-id/flat/CAAKRu_ZMw6Npd_qm2KM%2BFwQ3cMOMx1Dh3VMhp8-V7SOLxdK9-g%40mail.gmail.com
Вложения

Re: Compress prune/freeze records with Delta Frame of Reference algorithm

От
Evgeny Voropaev
Дата:
Hello hackers!

Dear Andres, please forgive me for the typo in your name in my previous
message. It was not intentional. My blushes!

v09 is the same as v08, except for a difference in test/dfor/Makefile.
I am still trying to fix the build errors in CI/CD.

P.s. rebased onto e0fa5bd1465.

Вложения

Re: Compress prune/freeze records with Delta Frame of Reference algorithm

От
Evgeny Voropaev
Дата:
The problem with GCC dependencies appearing in Debian Trixie in
test/dfor has been replayed locally and fixed.

P.S. Rebased onto 11d6042337f.
Вложения

Re: Compress prune/freeze records with Delta Frame of Reference algorithm

От
Andres Freund
Дата:
On 2026-04-08 20:34:42 +0800, Evgeny Voropaev wrote:
> Tomas, Andreus, Andrey, hello!
>
> > A ~170kB patch really should present some numbers
> > quantifying the expected benefit. It doesn't need to be a real workload
> > from production, but something plausible enough. Even some basic
> > back-of-the-envelope calculations might be enough to show the promise.
>
> The patch results in reduction of WAL total size by:
>     81% during vacuuming a table having no index,
>     and by 55% during vacuuming a table having an index.
>
> The numbers are the next:
>
> === VACUUM (table with no index) ===
> -------------------- ----------------- ----------------- -----------
>                        DFOR off, bytes    DFOR on, bytes   Reduction
> -------------------- ----------------- ----------------- -----------
> WAL total size                 6743149           1184446         82%
> Prune records size             6710185           1159723        5.8x
> -------------------- ----------------- ----------------- -----------
>
> === VACUUM (table with index) ===
> -------------------- ----------------- ----------------- -----------
>                        DFOR off, bytes    DFOR on, bytes   Reduction
> -------------------- ----------------- ----------------- -----------
> WAL total size                20394208           8907090         56%
> Prune records size             6812850           1225944        5.6x
> -------------------- ----------------- ----------------- -----------

These numbers make the impact sound bigger than I think it really is:

- They neglect that the insert generates ~183MB of WAL, the delete ~161MB
  without indexes and ~243MB / 161MB with.  In contrast to that 6.7Mb isn't
  particularly significant.

- Workloads deleting almost all records in the table but leaving some in to
  prevent truncation aren't particularly common.

- The narrowness of the rows (~30 bytes, with row header) makes the wins much
  bigger than they'd be in realistic cases

- The workload doesn't involve any FPIs. It's much more common to have
  vacuum's occur later and trigger FPIs.

  Heh. In this case FPIs actually would *reduce* the overhead of the current
  code, because the page is so empty after all the deletes that the FPI uses
  less space than the update . It's 4.1MB when not using indexes and not using
  wal compression and 1MB with wal compression.

  Seems we could get a fair bit of benefit by just using a heuristic to switch
  to an FPI when there are enough changes.

  I think that'd just be a few lines.

Greetings,

Andres Freund



Re: Compress prune/freeze records with Delta Frame of Reference algorithm

От
Andrey Borodin
Дата:

> On 9 Apr 2026, at 21:50, Andres Freund <andres@anarazel.de> wrote:
>
> These numbers make the impact sound bigger than I think it really is:
>
> - They neglect that the insert generates ~183MB of WAL, the delete ~161MB
>  without indexes and ~243MB / 161MB with.  In contrast to that 6.7Mb isn't
>  particularly significant.

Well, vacuuming of a bloated tables does happen. The amount of WAL needed to
accumulate bloat does not invalidate benefits of reducing WAL needed to vacuum
bloat.

> - Workloads deleting almost all records in the table but leaving some in to
>  prevent truncation aren't particularly common.

Queue tables may be kind of antipattern, yet users use such tables. And sometimes
they tend to have ~90% bloat. And cause 99% of problems.

> - The narrowness of the rows (~30 bytes, with row header) makes the wins much
>  bigger than they'd be in realistic cases

It’s crafted benchmark to demonstrate bright side. It’s also super easy to demonstrate
the case when proposed patch does not give any benefit at all.

Evgeny, do you know of any cases when the patch has negative effect?

I think if it’s strictly non-negative - then we can just weight complexity of maintaining
the proposed approach against benefits.

> - The workload doesn't involve any FPIs. It's much more common to have
>  vacuum's occur later and trigger FPIs.

AFAIU, without special handling FPIs will not substitute xl_heap_prune, will they?

>  Heh. In this case FPIs actually would *reduce* the overhead of the current
>  code, because the page is so empty after all the deletes that the FPI uses
>  less space than the update . It's 4.1MB when not using indexes and not using
>  wal compression and 1MB with wal compression.
>
>  Seems we could get a fair bit of benefit by just using a heuristic to switch
>  to an FPI when there are enough changes.

I think we could even do it in a generic way: if the record body is heavier that its FPIs - just
do FPIs only.


Best regards, Andrey Borodin.


Re: Compress prune/freeze records with Delta Frame of Reference algorithm

От
Evgeny Voropaev
Дата:
> Evgeny, do you know of any cases when the patch has negative effect?
>
> I think if it’s strictly non-negative - then we can just weight complexity of maintaining
> the proposed approach against benefits.

Andrey, during my research and testing I haven't encountered any
negative effects from this patch on the WAL size.




Re: Compress prune/freeze records with Delta Frame of Reference algorithm

От
Evgeny Voropaev
Дата:
Continue fixing CI issues in this patch. Resolved frz_offsets
misalignment caused by odd byte counts in dead or unused tuple packs.

Patch set has been split, unit-tests have been moved into separated
patch-files. It shows that the very essence of the patch has size of
90 KB only.

P.S. Rebased onto fce3f7d2677.
Вложения

Re: Compress prune/freeze records with Delta Frame of Reference algorithm

От
Heikki Linnakangas
Дата:
On 24/03/2026 16:28, Evgeny Voropaev wrote:
> Also I hope, independently of its usage in prune/freeze records, the 
> DFoR itself might be used for compression sequences in other places of PG.

Yeah, that would make this huge amount of new code much more palatable.

I had a similar thought when I added src/backend/lib/integerset.c, I 
planned to also use it for holding the dead TID list in vacuum for 
starters, and possibly for more things in the future. That plan was 
foiled because we got parallel VACUUM instead, which moved the TID list 
to shared memory, and I didn't account for that in integerset.c. So now 
integerset.c is only used for GiST vacuum, which is a pretty narrow use 
case.

Can this DFoR code replace integerset.c easily? Can we use it for the 
vacuum dead TID list? For GIN posting lists? Where else?

- Heikki




Re: Compress prune/freeze records with Delta Frame of Reference algorithm

От
Andrey Borodin
Дата:

> On 14 Apr 2026, at 14:02, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
>
> Can this DFoR code replace integerset.c easily?

Or, possibly, teach integerset to be serializable to WAL record or anywhere else.


Best regards, Andrey Borodin.


Re: Compress prune/freeze records with Delta Frame of Reference algorithm

От
Evgeny Voropaev
Дата:
Hello hackers,

> Can this DFoR code replace integerset.c easily? Can we use it for
> the vacuum dead TID list? For GIN posting lists? Where else?

Heikki, thank you for your attention and proposals. I'm learning areas
you proposed to be developed. This took time, since I am not adept at
them. Last week I also have been developing the DFoR patch to support
unsorted sequences. That's why there was the delay in answering.

About GIN.
Since GIN exploits TIDs sequences and saves it on the disk, it can be
the most appropriate candidate to be developed with DFoR.

About the dead TID list.
If I'm not mistaken, the dead TID list exists only in RAM and never on
the disk or in the network. So, what is the advantage supposed to be
achieved due to using compression in the dead TID list?

About the GiST vacuuming and the use of integerset in it.
The integerset implements a tree in addition to compression.
DFoR now performs only compression. Moreover the size of a pack is
flexible (varying), which must become an issue for its usage in the
tree. It needs more thorough further elaboration to be developed.

So what do you think about improving GIN by means of DFOR? Should I try?

Best regards,
Evgeny Voropaev,
Tantor Labs, LLC.

P.S.
In the v12 version of the patch:
     - implemented the DFOR compression for unsorted sequences;
     - implemented the compression of frozen and redirected tuple offsets
        in the prune/freeze WAL record
     - ignored header checking of header templates from DFOR file set;
     - rebased onto 9b43e6793b0.

Вложения

Re: Compress prune/freeze records with Delta Frame of Reference algorithm

От
Heikki Linnakangas
Дата:
On 21/04/2026 08:41, Evgeny Voropaev wrote:
> Hello hackers,
> 
>> Can this DFoR code replace integerset.c easily? Can we use it for
>> the vacuum dead TID list? For GIN posting lists? Where else?
> 
> Heikki, thank you for your attention and proposals. I'm learning areas
> you proposed to be developed. This took time, since I am not adept at
> them. Last week I also have been developing the DFoR patch to support
> unsorted sequences. That's why there was the delay in answering.
> 
> About GIN.
> Since GIN exploits TIDs sequences and saves it on the disk, it can be
> the most appropriate candidate to be developed with DFoR.

+1. And maybe the tid lists in to B-tree tuples too while we're at it.

For GIN posting lists, one important property of the current compression 
scheme is that removing TIDs never makes the list larger than the 
original. That's important for VACUUM, see 

https://github.com/postgres/postgres/blob/d3bba041543593eb5341683107d899734dc8e73e/src/backend/access/gin/ginpostinglist.c#L55

> About the dead TID list.
> If I'm not mistaken, the dead TID list exists only in RAM and never on
> the disk or in the network. So, what is the advantage supposed to be
> achieved due to using compression in the dead TID list?

Reduces memory usage. And if it's faster to lookup than the current data 
structure, that too. I don't know if that works out.

> About the GiST vacuuming and the use of integerset in it.
> The integerset implements a tree in addition to compression.
> DFoR now performs only compression. Moreover the size of a pack is
> flexible (varying), which must become an issue for its usage in the
> tree. It needs more thorough further elaboration to be developed.

Hmm. The integerset is a sparse list of integers, just like Frame of 
Reference. The tree inside it is just an implementation detail. I was 
thinking that you could replace the whole tree with DFoR, but I suppose 
you cannot do random lookups in a DFoR compressed list, so you'd still 
need the tree.

- Heikki




Re: Compress prune/freeze records with Delta Frame of Reference algorithm

От
Evgeny Voropaev
Дата: