Обсуждение: [PATCH v1] parallel pg_restore: avoid disk seeks when jumping short distance forward

Поиск
Список
Период
Сортировка

[PATCH v1] parallel pg_restore: avoid disk seeks when jumping short distance forward

От
Dimitrios Apostolou
Дата:
Hello list,

I'm submitting a patch for improving an almost 1h long pause at the start
of parallel pg_restore of a big archive. Related discussion has taken
place at pgsql-performance mailing list at:

https://www.postgresql.org/message-id/flat/6bd16bdb-aa5e-0512-739d-b84100596035%40gmx.net

I think I explain it rather well in the commit message, so I paste it
inline:


Improve the performance of parallel pg_restore (-j) from a custom format
pg_dump archive that does not include data offsets - typically happening
when pg_dump has generated it by writing to stdout instead of a file.

In this case pg_restore workers manifest constant looping of reading
small sizes (4KB) and seeking forward small lenths (around 10KB for a
compressed archive):

read(4, "..."..., 4096) = 4096
lseek(4, 55544369152, SEEK_SET)         = 55544369152
read(4, "..."..., 4096) = 4096
lseek(4, 55544381440, SEEK_SET)         = 55544381440
read(4, "..."..., 4096) = 4096
lseek(4, 55544397824, SEEK_SET)         = 55544397824
read(4, "..."..., 4096) = 4096
lseek(4, 55544414208, SEEK_SET)         = 55544414208
read(4, "..."..., 4096) = 4096
lseek(4, 55544426496, SEEK_SET)         = 55544426496

This happens as each worker scans the whole file until it finds the
entry it wants, skipping forward each block. In combination to the small
block size of the custom format dump, this causes many seeks and low
performance.

Fix by avoiding forward seeks for jumps of less than 1MB forward.
Do instead sequential reads.

Performance gain can be significant, depending on the size of the dump
and the I/O subsystem. On my local NVMe drive, read speeds for that
phase of pg_restore increased from 150MB/s to 3GB/s.


This is my first patch submission, all help is much appreciated.
Regards,
Dimitris


P.S.  What is the recommended way to test a change, besides a generic make
check? And how do I run selectively only the pg_dump/restore tests, in
order to speed up my development routine?


Вложения

Re: [PATCH v1] parallel pg_restore: avoid disk seeks when jumping short distance forward

От
Dimitrios Apostolou
Дата:
On Sat, 29 Mar 2025, Dimitrios Apostolou wrote:
>
> P.S.  What is the recommended way to test a change, besides a generic make
> check? And how do I run selectively only the pg_dump/restore tests, in order
> to speed up my development routine?

I have tested it with:

   make  -C src/bin/pg_dump  check

It didn't break any test, but I also don't see any difference, the
performance boost is noticeable only when restoring a huge archive that is
missing offsets.

Any volunteer to review this one-line patch?

Thanks,
Dimitris




On Tue, Apr 01, 2025 at 09:33:32PM +0200, Dimitrios Apostolou wrote:
> It didn't break any test, but I also don't see any difference, the
> performance boost is noticeable only when restoring a huge archive that is
> missing offsets.

This seems generally reasonable to me, but how did you decide on 1MB as the
threshold?  Have you tested other values?  Could the best threshold vary
based on the workload and hardware?

-- 
nathan



Re: [PATCH v1] parallel pg_restore: avoid disk seeks when jumping short distance forward

От
Dimitrios Apostolou
Дата:
Thanks. This is the first value I tried and it works well. In the archive I have all blocks seem to be between 8 and
20KBso the jump forward before the change never even got close to 1MB. Could it be bigger in an uncompressed archive?
Orin a future pg_dump that raises the block size? I don't really know, so it is difficult to test such scenario but it
madesense to guard against these cases too. 

I chose 1MB by basically doing a very crude calculation in my mind: when would it be worth seeking forward instead of
reading?On very slow drives 60MB/s sequential and 60 IOPS for random reads is a possible speed. In that worst case it
wouldbe better to seek() forward for lengths of over 1MB.  

On 1 April 2025 22:04:00 CEST, Nathan Bossart <nathandbossart@gmail.com> wrote:
>On Tue, Apr 01, 2025 at 09:33:32PM +0200, Dimitrios Apostolou wrote:
>> It didn't break any test, but I also don't see any difference, the
>> performance boost is noticeable only when restoring a huge archive that is
>> missing offsets.
>
>This seems generally reasonable to me, but how did you decide on 1MB as the
>threshold?  Have you tested other values?  Could the best threshold vary
>based on the workload and hardware?
>



Re: [PATCH v2] parallel pg_restore: avoid disk seeks when jumping short distance forward

От
Dimitrios Apostolou
Дата:
I just managed to run pgindent, here is v2 with the comment style fixed.
Вложения

[PING] [PATCH v2] parallel pg_restore: avoid disk seeks when jumping short distance forward

От
Dimitrios Apostolou
Дата:
On Fri, 4 Apr 2025, Dimitrios Apostolou wrote:

> I just managed to run pgindent, here is v2 with the comment style fixed.

Any feedback on this one-liner? Or is the lack of feedback a clue that I
have been missing something important in my patch submission? :-)

Should I CC people that are frequent committers to the file?


Thanks,
Dimitris




Dimitrios Apostolou <jimis@gmx.net> writes:
> Any feedback on this one-liner? Or is the lack of feedback a clue that I 
> have been missing something important in my patch submission? :-)

The calendar ;-).  At this point we're in feature freeze for v18,
so things that aren't bugs aren't likely to get much attention
until v19 development opens up (in July, unless things are really
going badly with v18).

You should add your patch to the July commitfest [1] to make sure
we don't lose track of it.

            regards, tom lane

[1] https://commitfest.postgresql.org/53/




On Mon, 14 Apr 2025, Tom Lane wrote:

>
> You should add your patch to the July commitfest [1] to make sure
> we don't lose track of it.

I rebased the patch (attached) and created an entry in the commitfest:

https://commitfest.postgresql.org/patch/5809/


Thanks!
Dimitris




Attached now...

On Mon, 9 Jun 2025, Dimitrios Apostolou wrote:

>
>
> On Mon, 14 Apr 2025, Tom Lane wrote:
>
>>
>>  You should add your patch to the July commitfest [1] to make sure
>>  we don't lose track of it.
>
> I rebased the patch (attached) and created an entry in the commitfest:
>
> https://commitfest.postgresql.org/patch/5809/
>
>
> Thanks!
> Dimitris
>
>
>
Вложения
On Mon, Jun 09, 2025 at 10:09:57PM +0200, Dimitrios Apostolou wrote:
> Fix by avoiding forward seeks for jumps of less than 1MB forward.
> Do instead sequential reads.
> 
> Performance gain can be significant, depending on the size of the dump
> and the I/O subsystem. On my local NVMe drive, read speeds for that
> phase of pg_restore increased from 150MB/s to 3GB/s.

I was curious about what exactly was leading to the performance gains you
are seeing.  This page has an explanation:

    https://www.mjr19.org.uk/IT/fseek.html

I also wrote a couple of test programs to show the difference between
fseeko-ing and fread-ing through a file with various sizes.  On a Linux
machine, I see this:

     log2(n) | fseeko  | fread
    ---------+---------+-------
           1 | 109.288 | 5.528
           2 |  54.881 | 2.848
           3 |   27.65 | 1.504
           4 |  13.953 | 0.834
           5 |     7.1 |  0.49
           6 |   3.665 | 0.322
           7 |   1.944 | 0.244
           8 |   1.085 | 0.201
           9 |   0.658 | 0.185
          10 |   0.443 | 0.175
          11 |   0.253 | 0.171
          12 |   0.102 | 0.162
          13 |   0.075 |  0.13
          14 |   0.061 | 0.114
          15 |   0.054 |   0.1

So, fseeko() starts winning around 4096 bytes.  On macOS, the differences
aren't quite as dramatic, but 4096 bytes is the break-even point there,
too.  I imagine there's a buffer around that size somewhere...

This doesn't fully explain the results you are seeing, but it does seem to
validate the idea.  I'm curious if you see further improvement with even
lower thresholds (e.g., 8KB, 16KB, 32KB). 

-- 
nathan



On Tue, 10 Jun 2025, Nathan Bossart wrote:

> I also wrote a couple of test programs to show the difference between
> fseeko-ing and fread-ing through a file with various sizes.  On a Linux
> machine, I see this:
>
>     log2(n) | fseeko  | fread
>    ---------+---------+-------
>           1 | 109.288 | 5.528
>           2 |  54.881 | 2.848
>           3 |   27.65 | 1.504
>           4 |  13.953 | 0.834
>           5 |     7.1 |  0.49
>           6 |   3.665 | 0.322
>           7 |   1.944 | 0.244
>           8 |   1.085 | 0.201
>           9 |   0.658 | 0.185
>          10 |   0.443 | 0.175
>          11 |   0.253 | 0.171
>          12 |   0.102 | 0.162
>          13 |   0.075 |  0.13
>          14 |   0.061 | 0.114
>          15 |   0.054 |   0.1
>
> So, fseeko() starts winning around 4096 bytes.  On macOS, the differences
> aren't quite as dramatic, but 4096 bytes is the break-even point there,
> too.  I imagine there's a buffer around that size somewhere...

Thank you for benchmarking! Before answering in more depth, I'm curious,
what read-seek pattern do you see on the system call level (as
shown by strace)? In pg_restore it was a constant loop of
read(4K)-lseek(8-16K).

Dimitris



On Wed, Jun 11, 2025 at 12:32:58AM +0200, Dimitrios Apostolou wrote:
> Thank you for benchmarking! Before answering in more depth, I'm curious,
> what read-seek pattern do you see on the system call level (as shown by
> strace)? In pg_restore it was a constant loop of read(4K)-lseek(8-16K).

For fseeko(), sizes less than 4096 produce a repeating pattern of read()
calls followed by approximately (4096 / size) lseek() calls.  For greater
sizes, it's just a stream of lseek().  For fread(), sizes less than 4096
produce a stream of read(fd, "...", 4096), and for greater sizes, the only
difference is that the last argument is the size.

-- 
nathan



On Wed, 11 Jun 2025, Nathan Bossart wrote:

> On Wed, Jun 11, 2025 at 12:32:58AM +0200, Dimitrios Apostolou wrote:
>> what read-seek pattern do you see on the system call level (as shown by
>> strace)? In pg_restore it was a constant loop of read(4K)-lseek(8-16K).
>
> For fseeko(), sizes less than 4096 produce a repeating pattern of read()
> calls followed by approximately (4096 / size) lseek() calls.  For greater
> sizes, it's just a stream of lseek().

This is the opposite of what the link you shared before describes, so
maybe glibc has changed its behaviour to improve performance.

Anyway, the fact that fseek(>4096) produces a stream of lseek()s, means
that most likely no I/O is happening. You need to issue a getc() after
each fseeko(), like pg_restore is doing.


Dimitris




On Tue, 10 Jun 2025, Nathan Bossart wrote:

> So, fseeko() starts winning around 4096 bytes.  On macOS, the differences
> aren't quite as dramatic, but 4096 bytes is the break-even point there,
> too.  I imagine there's a buffer around that size somewhere...
>
> This doesn't fully explain the results you are seeing, but it does seem to
> validate the idea.  I'm curious if you see further improvement with even
> lower thresholds (e.g., 8KB, 16KB, 32KB).

By the way, I might have set the threshold to 1MB in my program, but
lowering it won't show a difference in my test case, since the lseek()s I
was noticing before the patch were mostly 8-16KB forward. Not sure what is
the defining factor for that. Maybe the compression algorithm, or how wide
the table is?


Dimitris




On Fri, Jun 13, 2025 at 01:00:26AM +0200, Dimitrios Apostolou wrote:
> By the way, I might have set the threshold to 1MB in my program, but
> lowering it won't show a difference in my test case, since the lseek()s I
> was noticing before the patch were mostly 8-16KB forward. Not sure what is
> the defining factor for that. Maybe the compression algorithm, or how wide
> the table is?

I may have missed it, but could you share what the strace looks like with
the patch applied?

-- 
nathan



On Wed, Jun 11, 2025 at 9:48 AM Nathan Bossart <nathandbossart@gmail.com> wrote:
> So, fseeko() starts winning around 4096 bytes.  On macOS, the differences
> aren't quite as dramatic, but 4096 bytes is the break-even point there,
> too.  I imagine there's a buffer around that size somewhere...

BTW you can call setvbuf(f, my_buffer, _IOFBF, my_buffer_size) to
control FILE buffering.  I suspect that glibc ignores the size if you
pass NULL for my_buffer, so you'd need to allocate it yourself and it
should probably be aligned on PG_IO_ALIGN_SIZE for best results
(minimising the number of VM pages that must be held/pinned).  Then
you might be able to get better and less OS-dependent results.  I
haven't studied this seek business so I have no opinion on that and
what a good size would be, but interesting sizes might be
rounded to both PG_IO_ALIGN_SIZE and filesystem block size according
to fstat(fileno(stream)).  IDK, just a thought...



On Fri, 13 Jun 2025, Nathan Bossart wrote:

> On Fri, Jun 13, 2025 at 01:00:26AM +0200, Dimitrios Apostolou wrote:
>> By the way, I might have set the threshold to 1MB in my program, but
>> lowering it won't show a difference in my test case, since the lseek()s I
>> was noticing before the patch were mostly 8-16KB forward. Not sure what is
>> the defining factor for that. Maybe the compression algorithm, or how wide
>> the table is?
>
> I may have missed it, but could you share what the strace looks like with
> the patch applied?

read(4, "..."..., 8192) = 8192
read(4, "..."..., 4096) = 4096
read(4, "..."..., 12288) = 12288
read(4, "..."..., 4096) = 4096
read(4, "..."..., 8192) = 8192
read(4, "..."..., 4096) = 4096
read(4, "..."..., 8192) = 8192
read(4, "..."..., 4096) = 4096
read(4, "..."..., 8192) = 8192
read(4, "..."..., 4096) = 4096
read(4, "..."..., 8192) = 8192
read(4, "..."..., 4096) = 4096
read(4, "..."..., 8192) = 8192
read(4, "..."..., 4096) = 4096
read(4, "..."..., 8192) = 8192
read(4, "..."..., 4096) = 4096
read(4, "..."..., 12288) = 12288
read(4, "..."..., 4096) = 4096
read(4, "..."..., 8192) = 8192
read(4, "..."..., 4096) = 4096
read(4, "..."..., 12288) = 12288
read(4, "..."..., 4096) = 4096
read(4, "..."..., 8192) = 8192
read(4, "..."..., 4096) = 4096
read(4, "..."..., 8192) = 8192
read(4, "..."..., 4096) = 4096
read(4, "..."..., 12288) = 12288
read(4, "..."..., 4096) = 4096
read(4, "..."..., 8192) = 8192
read(4, "..."..., 4096) = 4096





On Sat, 14 Jun 2025, Dimitrios Apostolou wrote:

> On Fri, 13 Jun 2025, Nathan Bossart wrote:
>
>>  On Fri, Jun 13, 2025 at 01:00:26AM +0200, Dimitrios Apostolou wrote:
>>>  By the way, I might have set the threshold to 1MB in my program, but
>>>  lowering it won't show a difference in my test case, since the lseek()s I
>>>  was noticing before the patch were mostly 8-16KB forward. Not sure what
>>>  is
>>>  the defining factor for that. Maybe the compression algorithm, or how
>>>  wide
>>>  the table is?
>>
>>  I may have missed it, but could you share what the strace looks like with
>>  the patch applied?
>
> read(4, "..."..., 8192) = 8192
> read(4, "..."..., 4096) = 4096
> read(4, "..."..., 12288) = 12288
> read(4, "..."..., 4096) = 4096
> read(4, "..."..., 8192) = 8192
> read(4, "..."..., 4096) = 4096
> read(4, "..."..., 8192) = 8192
> read(4, "..."..., 4096) = 4096
> read(4, "..."..., 8192) = 8192
> read(4, "..."..., 4096) = 4096
> read(4, "..."..., 8192) = 8192
> read(4, "..."..., 4096) = 4096
> read(4, "..."..., 8192) = 8192
> read(4, "..."..., 4096) = 4096
> read(4, "..."..., 8192) = 8192
> read(4, "..."..., 4096) = 4096
> read(4, "..."..., 12288) = 12288
> read(4, "..."..., 4096) = 4096
> read(4, "..."..., 8192) = 8192
> read(4, "..."..., 4096) = 4096
> read(4, "..."..., 12288) = 12288
> read(4, "..."..., 4096) = 4096
> read(4, "..."..., 8192) = 8192
> read(4, "..."..., 4096) = 4096
> read(4, "..."..., 8192) = 8192
> read(4, "..."..., 4096) = 4096
> read(4, "..."..., 12288) = 12288
> read(4, "..."..., 4096) = 4096
> read(4, "..."..., 8192) = 8192
> read(4, "..."..., 4096) = 4096


This was from pg_restoring a zstd-compressed custom format dump.

Out of curiosity I've tried the same with an uncompressed dump
(--compress=none). Surprisingly it seems the blocksize is even smaller.

With my patched pg_restore I only get 4K reads and nothing else on
the strace output.

read(4, "..."..., 4096) = 4096
read(4, "..."..., 4096) = 4096
read(4, "..."..., 4096) = 4096
read(4, "..."..., 4096) = 4096
read(4, "..."..., 4096) = 4096
read(4, "..."..., 4096) = 4096

The unpatched pg_restore gives me the weirdest output ever:

read(4, "..."..., 4096) = 4096
lseek(4, 98527916032, SEEK_SET)         = 98527916032
lseek(4, 98527916032, SEEK_SET)         = 98527916032
lseek(4, 98527916032, SEEK_SET)         = 98527916032
lseek(4, 98527916032, SEEK_SET)         = 98527916032
lseek(4, 98527916032, SEEK_SET)         = 98527916032
lseek(4, 98527916032, SEEK_SET)         = 98527916032
[ ... repeats about 80 times ...]
read(4, "..."..., 4096) = 4096
lseek(4, 98527920128, SEEK_SET)         = 98527920128
lseek(4, 98527920128, SEEK_SET)         = 98527920128
lseek(4, 98527920128, SEEK_SET)         = 98527920128
lseek(4, 98527920128, SEEK_SET)         = 98527920128
[ ... repeats ... ]



Seeing this, I think we should really consider raising the pg_dump block
size like Tom suggested on a previous thread.


Dimitris




Hi Nathan, I've noticed you've set yourself as a reviewer of this patch
in the commitfest. I appreciate it, but you might want to combine it
with another simple patch [1] that speeds up the same part of
pg_restore: the initial full scan on TOC-less archives.

[1] https://commitfest.postgresql.org/patch/5817/


On Saturday 2025-06-14 00:04, Nathan Bossart wrote:
>
> On Fri, Jun 13, 2025 at 01:00:26AM +0200, Dimitrios Apostolou wrote:
>> By the way, I might have set the threshold to 1MB in my program, but
>> lowering it won't show a difference in my test case, since the lseek()s I
>> was noticing before the patch were mostly 8-16KB forward. Not sure what is
>> the defining factor for that. Maybe the compression algorithm, or how wide
>> the table is?
>
> I may have missed it, but could you share what the strace looks like with
> the patch applied?

I hope you've seen my response here, with special focus on the small
block size that both compressed and uncompressed custom format archives
have.

I have been needing to pg_restore 10TB TOC-less dumps recently, and it's
a pain to do the full scan, even with both of my patches applied.

Maybe the block size could be a command line option of pg_dump, so that
one could set it to sizes like 100MB, which sounds like a normal block
from the perspective of a 10TB gigantic dump.

Regards,
Dimitris




On Saturday 2025-06-14 18:17, Dimitrios Apostolou wrote:

> Out of curiosity I've tried the same with an uncompressed dump
> (--compress=none). Surprisingly it seems the blocksize is even smaller.
>
> With my patched pg_restore I only get 4K reads and nothing else on the strace
> output.
>
> read(4, "..."..., 4096) = 4096
> read(4, "..."..., 4096) = 4096
> read(4, "..."..., 4096) = 4096
> read(4, "..."..., 4096) = 4096
> read(4, "..."..., 4096) = 4096
> read(4, "..."..., 4096) = 4096


To clarify this output again, I have a huge uncompressed custom format
dump without TOC (because pg_dump was writing to stdout), and at this
point pg_restore is going through the whole archive to find the items it
needs. Allow me to explain what goes on at this point since I have
better insight now.

The code in question, in pg_backup_custom.c:


/*
  * Skip data from current file position.
  * Data blocks are formatted as an integer length, followed by data.
  * A zero length indicates the end of the block.
*/
static void
_skipData(ArchiveHandle *AH)
{
     lclContext *ctx = (lclContext *) AH->formatData;
     size_t        blkLen;
     char       *buf = NULL;
     int            buflen = 0;

     blkLen = ReadInt(AH);
     while (blkLen != 0)
     {
         /* Sequential access is usually faster, so avoid seeking if the
          * jump forward is shorter than 1MB. */
         if (ctx->hasSeek && blkLen > 1024 * 1024)
         {
             if (fseeko(AH->FH, blkLen, SEEK_CUR) != 0)
                 pg_fatal("error during file seek: %m");
         }
         else
         {
             if (blkLen > buflen)
             {
                 free(buf);
                 buf = (char *) pg_malloc(blkLen);
                 buflen = blkLen;
             }
             if (fread(buf, 1, blkLen, AH->FH) != blkLen)
             {
                 if (feof(AH->FH))
                     pg_fatal("could not read from input file: end of file");
                 else
                     pg_fatal("could not read from input file: %m");
             }
         }

         blkLen = ReadInt(AH);
     }

     free(buf);
}


blkLen is almost always a number around 35 to 38.
So fread() is called all the time doing reads of about ~35 bytes.
Then ReadInt() is actually doing getc() a few times.
And it loops over.

Libc is doing buffering of 4k, and that's how we end up seeing so many
4k reads. This also explains the ~80 lseek() between each 4k read() on
the unpatched version, mentioned in previous email.

I've tried setvbuf() like Thomas Munro suggested and I saw a significant
improvement by allocating and using a 1MB buffer for libc stream
buffering.

Question that remains: where is pg_dump setting this ~35B length block?
Is that easy to change without breaking old versions?


Thanks in advance,
Dimitris




Dimitrios Apostolou <jimis@gmx.net> writes:
> Question that remains: where is pg_dump setting this ~35B length block? 

I poked into that question, and found that the answer is some
exceedingly brain-dead buffering logic in compress_zstd.c.
It will dump its buffer during every loop iteration within
_ZstdWriteCommon, no matter how much buffer space it has left;
and each call to cs->writeF() produces a new "data block"
in the output file.  The amount of data fed to _ZstdWriteCommon
per call is whatever the backend sends per "copy data" message,
which is generally one table row.  So if the table rows aren't
too wide, or if they're highly compressible, you get these
tiny data blocks.

compress_lz4.c is equally broken, again buffering no bytes across
calls; although liblz4 seems to do some buffering internally.
I got blocks of around 300 bytes on the test case I was using.
That's still ridiculous.

compress_gzip.c is actually sanely implemented, and consistently
produces blocks of 4096 bytes, which traces to DEFAULT_IO_BUFFER_SIZE
in compress_io.h.

If you choose --compress=none, you get data blocks that correspond
exactly to table rows.  We could imagine doing some internal
buffering to amalgamate short rows into larger blocks, but I'm
not entirely convinced it's worth messing with that case.

The attached patch fixes the buffering logic in compress_zstd.c
and compress_lz4.c.  For zstd, most blocks are now 131591 bytes,
which seems to be determined by ZSTD_CStreamOutSize() not by
our code.  For lz4, I see a range of block sizes but they're
almost all around 64K.  That's apparently emergent from the
behavior of LZ4F_compressBound(): when told we want to supply
it up to 4K at a time, it says it needs a buffer around 64K.

I'm tempted to increase DEFAULT_IO_BUFFER_SIZE so that gzip
also produces blocks in the vicinity of 64K, but we'd have
to decouple the behavior of compress_lz4.c somehow, or it
would want to produce blocks around a megabyte which might
be excessive.  (Or if it's not, we'd still want all these
compression methods to choose similar block sizes, I'd think.)

Anyway, these fixes should remove the need for pg_restore to look
at quite so many places in the archive file.  There may still be
a need for altering the seek-versus-read behavior as you suggest,
but I think we need to re-measure that tradeoff after fixing the
pg_dump side.

            regards, tom lane

diff --git a/src/bin/pg_dump/compress_lz4.c b/src/bin/pg_dump/compress_lz4.c
index e2f7c468293..47ee2e4bbac 100644
--- a/src/bin/pg_dump/compress_lz4.c
+++ b/src/bin/pg_dump/compress_lz4.c
@@ -60,13 +60,11 @@ typedef struct LZ4State
     bool        compressing;

     /*
-     * Used by the Compressor API to mark if the compression headers have been
-     * written after initialization.
+     * I/O buffer area.
      */
-    bool        needs_header_flush;
-
-    size_t        buflen;
-    char       *buffer;
+    char       *buffer;            /* buffer for compressed data */
+    size_t        buflen;            /* allocated size of buffer */
+    size_t        bufdata;        /* amount of valid data currently in buffer */

     /*
      * Used by the Stream API to store already uncompressed data that the
@@ -76,12 +74,6 @@ typedef struct LZ4State
     size_t        overflowlen;
     char       *overflowbuf;

-    /*
-     * Used by both APIs to keep track of the compressed data length stored in
-     * the buffer.
-     */
-    size_t        compressedlen;
-
     /*
      * Used by both APIs to keep track of error codes.
      */
@@ -103,8 +95,17 @@ LZ4State_compression_init(LZ4State *state)
 {
     size_t        status;

+    /*
+     * Compute size needed for buffer, assuming we will present at most
+     * DEFAULT_IO_BUFFER_SIZE input bytes at a time.
+     */
     state->buflen = LZ4F_compressBound(DEFAULT_IO_BUFFER_SIZE, &state->prefs);

+    /*
+     * Then double it, to ensure we're not forced to flush every time.
+     */
+    state->buflen *= 2;
+
     /*
      * LZ4F_compressBegin requires a buffer that is greater or equal to
      * LZ4F_HEADER_SIZE_MAX. Verify that the requirement is met.
@@ -120,6 +121,10 @@ LZ4State_compression_init(LZ4State *state)
     }

     state->buffer = pg_malloc(state->buflen);
+
+    /*
+     * Insert LZ4 header into buffer.
+     */
     status = LZ4F_compressBegin(state->ctx,
                                 state->buffer, state->buflen,
                                 &state->prefs);
@@ -129,7 +134,7 @@ LZ4State_compression_init(LZ4State *state)
         return false;
     }

-    state->compressedlen = status;
+    state->bufdata = status;

     return true;
 }
@@ -201,36 +206,37 @@ WriteDataToArchiveLZ4(ArchiveHandle *AH, CompressorState *cs,
 {
     LZ4State   *state = (LZ4State *) cs->private_data;
     size_t        remaining = dLen;
-    size_t        status;
-    size_t        chunk;
-
-    /* Write the header if not yet written. */
-    if (state->needs_header_flush)
-    {
-        cs->writeF(AH, state->buffer, state->compressedlen);
-        state->needs_header_flush = false;
-    }

     while (remaining > 0)
     {
+        size_t        chunk;
+        size_t        required;
+        size_t        status;

-        if (remaining > DEFAULT_IO_BUFFER_SIZE)
-            chunk = DEFAULT_IO_BUFFER_SIZE;
-        else
-            chunk = remaining;
+        /* We don't try to present more than DEFAULT_IO_BUFFER_SIZE bytes */
+        chunk = Min(remaining, (size_t) DEFAULT_IO_BUFFER_SIZE);
+
+        /* If not enough space, must flush buffer */
+        required = LZ4F_compressBound(chunk, &state->prefs);
+        if (required > state->buflen - state->bufdata)
+        {
+            cs->writeF(AH, state->buffer, state->bufdata);
+            state->bufdata = 0;
+        }

-        remaining -= chunk;
         status = LZ4F_compressUpdate(state->ctx,
-                                     state->buffer, state->buflen,
+                                     state->buffer + state->bufdata,
+                                     state->buflen - state->bufdata,
                                      data, chunk, NULL);

         if (LZ4F_isError(status))
             pg_fatal("could not compress data: %s",
                      LZ4F_getErrorName(status));

-        cs->writeF(AH, state->buffer, status);
+        state->bufdata += status;

-        data = ((char *) data) + chunk;
+        data = ((const char *) data) + chunk;
+        remaining -= chunk;
     }
 }

@@ -238,29 +244,32 @@ static void
 EndCompressorLZ4(ArchiveHandle *AH, CompressorState *cs)
 {
     LZ4State   *state = (LZ4State *) cs->private_data;
+    size_t        required;
     size_t        status;

     /* Nothing needs to be done */
     if (!state)
         return;

-    /*
-     * Write the header if not yet written. The caller is not required to call
-     * writeData if the relation does not contain any data. Thus it is
-     * possible to reach here without having flushed the header. Do it before
-     * ending the compression.
-     */
-    if (state->needs_header_flush)
-        cs->writeF(AH, state->buffer, state->compressedlen);
+    /* We might need to flush the buffer to make room for LZ4F_compressEnd */
+    required = LZ4F_compressBound(0, &state->prefs);
+    if (required > state->buflen - state->bufdata)
+    {
+        cs->writeF(AH, state->buffer, state->bufdata);
+        state->bufdata = 0;
+    }

     status = LZ4F_compressEnd(state->ctx,
-                              state->buffer, state->buflen,
+                              state->buffer + state->bufdata,
+                              state->buflen - state->bufdata,
                               NULL);
     if (LZ4F_isError(status))
         pg_fatal("could not end compression: %s",
                  LZ4F_getErrorName(status));
+    state->bufdata += status;

-    cs->writeF(AH, state->buffer, status);
+    /* Write the final bufferload */
+    cs->writeF(AH, state->buffer, state->bufdata);

     status = LZ4F_freeCompressionContext(state->ctx);
     if (LZ4F_isError(status))
@@ -302,8 +311,6 @@ InitCompressorLZ4(CompressorState *cs, const pg_compress_specification compressi
         pg_fatal("could not initialize LZ4 compression: %s",
                  LZ4F_getErrorName(state->errcode));

-    /* Remember that the header has not been written. */
-    state->needs_header_flush = true;
     cs->private_data = state;
 }

@@ -360,19 +367,10 @@ LZ4Stream_init(LZ4State *state, int size, bool compressing)

     state->compressing = compressing;

-    /* When compressing, write LZ4 header to the output stream. */
     if (state->compressing)
     {
-
         if (!LZ4State_compression_init(state))
             return false;
-
-        errno = 0;
-        if (fwrite(state->buffer, 1, state->compressedlen, state->fp) != state->compressedlen)
-        {
-            errno = (errno) ? errno : ENOSPC;
-            return false;
-        }
     }
     else
     {
@@ -573,8 +571,7 @@ static void
 LZ4Stream_write(const void *ptr, size_t size, CompressFileHandle *CFH)
 {
     LZ4State   *state = (LZ4State *) CFH->private_data;
-    size_t        status;
-    int            remaining = size;
+    size_t        remaining = size;

     /* Lazy init */
     if (!LZ4Stream_init(state, size, true))
@@ -583,23 +580,36 @@ LZ4Stream_write(const void *ptr, size_t size, CompressFileHandle *CFH)

     while (remaining > 0)
     {
-        int            chunk = Min(remaining, DEFAULT_IO_BUFFER_SIZE);
+        size_t        chunk;
+        size_t        required;
+        size_t        status;

-        remaining -= chunk;
+        /* We don't try to present more than DEFAULT_IO_BUFFER_SIZE bytes */
+        chunk = Min(remaining, (size_t) DEFAULT_IO_BUFFER_SIZE);
+
+        /* If not enough space, must flush buffer */
+        required = LZ4F_compressBound(chunk, &state->prefs);
+        if (required > state->buflen - state->bufdata)
+        {
+            errno = 0;
+            if (fwrite(state->buffer, 1, state->bufdata, state->fp) != state->bufdata)
+            {
+                errno = (errno) ? errno : ENOSPC;
+                pg_fatal("error during writing: %m");
+            }
+            state->bufdata = 0;
+        }

-        status = LZ4F_compressUpdate(state->ctx, state->buffer, state->buflen,
+        status = LZ4F_compressUpdate(state->ctx,
+                                     state->buffer + state->bufdata,
+                                     state->buflen - state->bufdata,
                                      ptr, chunk, NULL);
         if (LZ4F_isError(status))
             pg_fatal("error during writing: %s", LZ4F_getErrorName(status));
-
-        errno = 0;
-        if (fwrite(state->buffer, 1, status, state->fp) != status)
-        {
-            errno = (errno) ? errno : ENOSPC;
-            pg_fatal("error during writing: %m");
-        }
+        state->bufdata += status;

         ptr = ((const char *) ptr) + chunk;
+        remaining -= chunk;
     }
 }

@@ -675,6 +685,7 @@ LZ4Stream_close(CompressFileHandle *CFH)
 {
     FILE       *fp;
     LZ4State   *state = (LZ4State *) CFH->private_data;
+    size_t        required;
     size_t        status;
     int            ret;

@@ -683,20 +694,36 @@ LZ4Stream_close(CompressFileHandle *CFH)
     {
         if (state->compressing)
         {
-            status = LZ4F_compressEnd(state->ctx, state->buffer, state->buflen, NULL);
+            /* We might need to flush the buffer to make room */
+            required = LZ4F_compressBound(0, &state->prefs);
+            if (required > state->buflen - state->bufdata)
+            {
+                errno = 0;
+                if (fwrite(state->buffer, 1, state->bufdata, state->fp) != state->bufdata)
+                {
+                    errno = (errno) ? errno : ENOSPC;
+                    pg_log_error("could not write to output file: %m");
+                }
+                state->bufdata = 0;
+            }
+
+            status = LZ4F_compressEnd(state->ctx,
+                                      state->buffer + state->bufdata,
+                                      state->buflen - state->bufdata,
+                                      NULL);
             if (LZ4F_isError(status))
             {
                 pg_log_error("could not end compression: %s",
                              LZ4F_getErrorName(status));
             }
             else
+                state->bufdata += status;
+
+            errno = 0;
+            if (fwrite(state->buffer, 1, state->bufdata, state->fp) != state->bufdata)
             {
-                errno = 0;
-                if (fwrite(state->buffer, 1, status, state->fp) != status)
-                {
-                    errno = (errno) ? errno : ENOSPC;
-                    pg_log_error("could not write to output file: %m");
-                }
+                errno = (errno) ? errno : ENOSPC;
+                pg_log_error("could not write to output file: %m");
             }

             status = LZ4F_freeCompressionContext(state->ctx);
diff --git a/src/bin/pg_dump/compress_zstd.c b/src/bin/pg_dump/compress_zstd.c
index e24d45e1bbe..63b1ebf8583 100644
--- a/src/bin/pg_dump/compress_zstd.c
+++ b/src/bin/pg_dump/compress_zstd.c
@@ -98,24 +98,22 @@ _ZstdWriteCommon(ArchiveHandle *AH, CompressorState *cs, bool flush)
     ZSTD_outBuffer *output = &zstdcs->output;

     /* Loop while there's any input or until flushed */
-    while (input->pos != input->size || flush)
+    while (input->pos < input->size || flush)
     {
         size_t        res;

-        output->pos = 0;
         res = ZSTD_compressStream2(zstdcs->cstream, output,
                                    input, flush ? ZSTD_e_end : ZSTD_e_continue);

         if (ZSTD_isError(res))
             pg_fatal("could not compress data: %s", ZSTD_getErrorName(res));

-        /*
-         * Extra paranoia: avoid zero-length chunks, since a zero length chunk
-         * is the EOF marker in the custom format. This should never happen
-         * but...
-         */
-        if (output->pos > 0)
+        /* Dump output buffer if full, or if we're told to flush */
+        if (output->pos >= output->size || flush)
+        {
             cs->writeF(AH, output->dst, output->pos);
+            output->pos = 0;
+        }

         if (res == 0)
             break;                /* End of frame or all input consumed */

I wrote:
> I'm tempted to increase DEFAULT_IO_BUFFER_SIZE so that gzip
> also produces blocks in the vicinity of 64K, but we'd have
> to decouple the behavior of compress_lz4.c somehow, or it
> would want to produce blocks around a megabyte which might
> be excessive.  (Or if it's not, we'd still want all these
> compression methods to choose similar block sizes, I'd think.)

After a bit of further experimentation, here is a v2 patchset.

0001 is like my previous patch except that it also fixes
Zstd_write and Zstd_close so that the "stream API" code doesn't
behave differently from the older API.  Also, now with draft
commit message.

0002 adjusts things so that lz4 and gzip compression produce
block sizes around 128K, which is what compress_zstd.c already
does after 0001.  While I wouldn't necessarily follow zstd's
lead if it were easy to do differently, it isn't.  We'd have
to ignore ZSTD_CStreamOutSize() in favor of making our own
buffer size choice.  That seems to carry some risks of tickling
bugs that upstream isn't testing for, and the value of 128K is
not so far off that I care to take any such risk.

This brings us to a place where all three compression modes
should yield roughly-comparable data block sizes, which is a good
starting point for further discussion of whether pg_restore needs
seek-versus-read adjustments.

            regards, tom lane

From 8f5ab1e5b18be54dc037a459dee5b9095166232b Mon Sep 17 00:00:00 2001
From: Tom Lane <tgl@sss.pgh.pa.us>
Date: Fri, 10 Oct 2025 20:57:15 -0400
Subject: [PATCH v2 1/2] Fix poor buffering logic in pg_dump's lz4 and zstd
 compression code.

Both of these modules dumped each bit of output that they got from
the underlying compression library as a separate "data block" in
the emitted archive file.  In the case of zstd this'd frequently
result in block sizes well under 100 bytes; lz4 is a little better
but still produces blocks around 300 bytes, at least in the test
case I tried.  This bloats the archive file a little bit compared
to larger block sizes, but the real problem is that when pg_restore
has to skip each data block rather than seeking directly to some
target data, tiny block sizes are enormously inefficient.

Fix both modules so that they fill their allocated buffer reasonably
well before dumping a data block.  In the case of lz4, also delete
some redundant logic that caused the lz4 frame header to be emitted
as a separate data block.  (That saves little, but I see no reason
to expend extra code to get worse results.)

I fixed the "stream API" code too.  In those cases, feeding small
amounts of data to fwrite() probably doesn't have any meaningful
performance consequences.  But it seems like a bad idea to leave
the two sets of code doing the same thing in two different ways.

In passing, remove unnecessary "extra paranoia" check in
_ZstdWriteCommon.  _CustomWriteFunc (the only possible referent
of cs->writeF) already protects itself against zero-length writes,
and it's really a modularity violation for _ZstdWriteCommon to know
that the custom format disallows empty data blocks.

Reported-by: Dimitrios Apostolou <jimis@gmx.net>
Author: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/3515357.1760128017@sss.pgh.pa.us
---
 src/bin/pg_dump/compress_lz4.c  | 167 +++++++++++++++++++-------------
 src/bin/pg_dump/compress_zstd.c |  37 +++----
 2 files changed, 117 insertions(+), 87 deletions(-)

diff --git a/src/bin/pg_dump/compress_lz4.c b/src/bin/pg_dump/compress_lz4.c
index e2f7c468293..47ee2e4bbac 100644
--- a/src/bin/pg_dump/compress_lz4.c
+++ b/src/bin/pg_dump/compress_lz4.c
@@ -60,13 +60,11 @@ typedef struct LZ4State
     bool        compressing;

     /*
-     * Used by the Compressor API to mark if the compression headers have been
-     * written after initialization.
+     * I/O buffer area.
      */
-    bool        needs_header_flush;
-
-    size_t        buflen;
-    char       *buffer;
+    char       *buffer;            /* buffer for compressed data */
+    size_t        buflen;            /* allocated size of buffer */
+    size_t        bufdata;        /* amount of valid data currently in buffer */

     /*
      * Used by the Stream API to store already uncompressed data that the
@@ -76,12 +74,6 @@ typedef struct LZ4State
     size_t        overflowlen;
     char       *overflowbuf;

-    /*
-     * Used by both APIs to keep track of the compressed data length stored in
-     * the buffer.
-     */
-    size_t        compressedlen;
-
     /*
      * Used by both APIs to keep track of error codes.
      */
@@ -103,8 +95,17 @@ LZ4State_compression_init(LZ4State *state)
 {
     size_t        status;

+    /*
+     * Compute size needed for buffer, assuming we will present at most
+     * DEFAULT_IO_BUFFER_SIZE input bytes at a time.
+     */
     state->buflen = LZ4F_compressBound(DEFAULT_IO_BUFFER_SIZE, &state->prefs);

+    /*
+     * Then double it, to ensure we're not forced to flush every time.
+     */
+    state->buflen *= 2;
+
     /*
      * LZ4F_compressBegin requires a buffer that is greater or equal to
      * LZ4F_HEADER_SIZE_MAX. Verify that the requirement is met.
@@ -120,6 +121,10 @@ LZ4State_compression_init(LZ4State *state)
     }

     state->buffer = pg_malloc(state->buflen);
+
+    /*
+     * Insert LZ4 header into buffer.
+     */
     status = LZ4F_compressBegin(state->ctx,
                                 state->buffer, state->buflen,
                                 &state->prefs);
@@ -129,7 +134,7 @@ LZ4State_compression_init(LZ4State *state)
         return false;
     }

-    state->compressedlen = status;
+    state->bufdata = status;

     return true;
 }
@@ -201,36 +206,37 @@ WriteDataToArchiveLZ4(ArchiveHandle *AH, CompressorState *cs,
 {
     LZ4State   *state = (LZ4State *) cs->private_data;
     size_t        remaining = dLen;
-    size_t        status;
-    size_t        chunk;
-
-    /* Write the header if not yet written. */
-    if (state->needs_header_flush)
-    {
-        cs->writeF(AH, state->buffer, state->compressedlen);
-        state->needs_header_flush = false;
-    }

     while (remaining > 0)
     {
+        size_t        chunk;
+        size_t        required;
+        size_t        status;

-        if (remaining > DEFAULT_IO_BUFFER_SIZE)
-            chunk = DEFAULT_IO_BUFFER_SIZE;
-        else
-            chunk = remaining;
+        /* We don't try to present more than DEFAULT_IO_BUFFER_SIZE bytes */
+        chunk = Min(remaining, (size_t) DEFAULT_IO_BUFFER_SIZE);
+
+        /* If not enough space, must flush buffer */
+        required = LZ4F_compressBound(chunk, &state->prefs);
+        if (required > state->buflen - state->bufdata)
+        {
+            cs->writeF(AH, state->buffer, state->bufdata);
+            state->bufdata = 0;
+        }

-        remaining -= chunk;
         status = LZ4F_compressUpdate(state->ctx,
-                                     state->buffer, state->buflen,
+                                     state->buffer + state->bufdata,
+                                     state->buflen - state->bufdata,
                                      data, chunk, NULL);

         if (LZ4F_isError(status))
             pg_fatal("could not compress data: %s",
                      LZ4F_getErrorName(status));

-        cs->writeF(AH, state->buffer, status);
+        state->bufdata += status;

-        data = ((char *) data) + chunk;
+        data = ((const char *) data) + chunk;
+        remaining -= chunk;
     }
 }

@@ -238,29 +244,32 @@ static void
 EndCompressorLZ4(ArchiveHandle *AH, CompressorState *cs)
 {
     LZ4State   *state = (LZ4State *) cs->private_data;
+    size_t        required;
     size_t        status;

     /* Nothing needs to be done */
     if (!state)
         return;

-    /*
-     * Write the header if not yet written. The caller is not required to call
-     * writeData if the relation does not contain any data. Thus it is
-     * possible to reach here without having flushed the header. Do it before
-     * ending the compression.
-     */
-    if (state->needs_header_flush)
-        cs->writeF(AH, state->buffer, state->compressedlen);
+    /* We might need to flush the buffer to make room for LZ4F_compressEnd */
+    required = LZ4F_compressBound(0, &state->prefs);
+    if (required > state->buflen - state->bufdata)
+    {
+        cs->writeF(AH, state->buffer, state->bufdata);
+        state->bufdata = 0;
+    }

     status = LZ4F_compressEnd(state->ctx,
-                              state->buffer, state->buflen,
+                              state->buffer + state->bufdata,
+                              state->buflen - state->bufdata,
                               NULL);
     if (LZ4F_isError(status))
         pg_fatal("could not end compression: %s",
                  LZ4F_getErrorName(status));
+    state->bufdata += status;

-    cs->writeF(AH, state->buffer, status);
+    /* Write the final bufferload */
+    cs->writeF(AH, state->buffer, state->bufdata);

     status = LZ4F_freeCompressionContext(state->ctx);
     if (LZ4F_isError(status))
@@ -302,8 +311,6 @@ InitCompressorLZ4(CompressorState *cs, const pg_compress_specification compressi
         pg_fatal("could not initialize LZ4 compression: %s",
                  LZ4F_getErrorName(state->errcode));

-    /* Remember that the header has not been written. */
-    state->needs_header_flush = true;
     cs->private_data = state;
 }

@@ -360,19 +367,10 @@ LZ4Stream_init(LZ4State *state, int size, bool compressing)

     state->compressing = compressing;

-    /* When compressing, write LZ4 header to the output stream. */
     if (state->compressing)
     {
-
         if (!LZ4State_compression_init(state))
             return false;
-
-        errno = 0;
-        if (fwrite(state->buffer, 1, state->compressedlen, state->fp) != state->compressedlen)
-        {
-            errno = (errno) ? errno : ENOSPC;
-            return false;
-        }
     }
     else
     {
@@ -573,8 +571,7 @@ static void
 LZ4Stream_write(const void *ptr, size_t size, CompressFileHandle *CFH)
 {
     LZ4State   *state = (LZ4State *) CFH->private_data;
-    size_t        status;
-    int            remaining = size;
+    size_t        remaining = size;

     /* Lazy init */
     if (!LZ4Stream_init(state, size, true))
@@ -583,23 +580,36 @@ LZ4Stream_write(const void *ptr, size_t size, CompressFileHandle *CFH)

     while (remaining > 0)
     {
-        int            chunk = Min(remaining, DEFAULT_IO_BUFFER_SIZE);
+        size_t        chunk;
+        size_t        required;
+        size_t        status;

-        remaining -= chunk;
+        /* We don't try to present more than DEFAULT_IO_BUFFER_SIZE bytes */
+        chunk = Min(remaining, (size_t) DEFAULT_IO_BUFFER_SIZE);
+
+        /* If not enough space, must flush buffer */
+        required = LZ4F_compressBound(chunk, &state->prefs);
+        if (required > state->buflen - state->bufdata)
+        {
+            errno = 0;
+            if (fwrite(state->buffer, 1, state->bufdata, state->fp) != state->bufdata)
+            {
+                errno = (errno) ? errno : ENOSPC;
+                pg_fatal("error during writing: %m");
+            }
+            state->bufdata = 0;
+        }

-        status = LZ4F_compressUpdate(state->ctx, state->buffer, state->buflen,
+        status = LZ4F_compressUpdate(state->ctx,
+                                     state->buffer + state->bufdata,
+                                     state->buflen - state->bufdata,
                                      ptr, chunk, NULL);
         if (LZ4F_isError(status))
             pg_fatal("error during writing: %s", LZ4F_getErrorName(status));
-
-        errno = 0;
-        if (fwrite(state->buffer, 1, status, state->fp) != status)
-        {
-            errno = (errno) ? errno : ENOSPC;
-            pg_fatal("error during writing: %m");
-        }
+        state->bufdata += status;

         ptr = ((const char *) ptr) + chunk;
+        remaining -= chunk;
     }
 }

@@ -675,6 +685,7 @@ LZ4Stream_close(CompressFileHandle *CFH)
 {
     FILE       *fp;
     LZ4State   *state = (LZ4State *) CFH->private_data;
+    size_t        required;
     size_t        status;
     int            ret;

@@ -683,20 +694,36 @@ LZ4Stream_close(CompressFileHandle *CFH)
     {
         if (state->compressing)
         {
-            status = LZ4F_compressEnd(state->ctx, state->buffer, state->buflen, NULL);
+            /* We might need to flush the buffer to make room */
+            required = LZ4F_compressBound(0, &state->prefs);
+            if (required > state->buflen - state->bufdata)
+            {
+                errno = 0;
+                if (fwrite(state->buffer, 1, state->bufdata, state->fp) != state->bufdata)
+                {
+                    errno = (errno) ? errno : ENOSPC;
+                    pg_log_error("could not write to output file: %m");
+                }
+                state->bufdata = 0;
+            }
+
+            status = LZ4F_compressEnd(state->ctx,
+                                      state->buffer + state->bufdata,
+                                      state->buflen - state->bufdata,
+                                      NULL);
             if (LZ4F_isError(status))
             {
                 pg_log_error("could not end compression: %s",
                              LZ4F_getErrorName(status));
             }
             else
+                state->bufdata += status;
+
+            errno = 0;
+            if (fwrite(state->buffer, 1, state->bufdata, state->fp) != state->bufdata)
             {
-                errno = 0;
-                if (fwrite(state->buffer, 1, status, state->fp) != status)
-                {
-                    errno = (errno) ? errno : ENOSPC;
-                    pg_log_error("could not write to output file: %m");
-                }
+                errno = (errno) ? errno : ENOSPC;
+                pg_log_error("could not write to output file: %m");
             }

             status = LZ4F_freeCompressionContext(state->ctx);
diff --git a/src/bin/pg_dump/compress_zstd.c b/src/bin/pg_dump/compress_zstd.c
index e24d45e1bbe..5fe2279faae 100644
--- a/src/bin/pg_dump/compress_zstd.c
+++ b/src/bin/pg_dump/compress_zstd.c
@@ -98,24 +98,22 @@ _ZstdWriteCommon(ArchiveHandle *AH, CompressorState *cs, bool flush)
     ZSTD_outBuffer *output = &zstdcs->output;

     /* Loop while there's any input or until flushed */
-    while (input->pos != input->size || flush)
+    while (input->pos < input->size || flush)
     {
         size_t        res;

-        output->pos = 0;
         res = ZSTD_compressStream2(zstdcs->cstream, output,
                                    input, flush ? ZSTD_e_end : ZSTD_e_continue);

         if (ZSTD_isError(res))
             pg_fatal("could not compress data: %s", ZSTD_getErrorName(res));

-        /*
-         * Extra paranoia: avoid zero-length chunks, since a zero length chunk
-         * is the EOF marker in the custom format. This should never happen
-         * but...
-         */
-        if (output->pos > 0)
+        /* Dump output buffer if full, or if we're told to flush */
+        if (output->pos >= output->size || flush)
+        {
             cs->writeF(AH, output->dst, output->pos);
+            output->pos = 0;
+        }

         if (res == 0)
             break;                /* End of frame or all input consumed */
@@ -367,26 +365,31 @@ Zstd_write(const void *ptr, size_t size, CompressFileHandle *CFH)
     if (zstdcs->cstream == NULL)
     {
         zstdcs->output.size = ZSTD_CStreamOutSize();
-        zstdcs->output.dst = pg_malloc0(zstdcs->output.size);
+        zstdcs->output.dst = pg_malloc(zstdcs->output.size);
+        zstdcs->output.pos = 0;
         zstdcs->cstream = _ZstdCStreamParams(CFH->compression_spec);
         if (zstdcs->cstream == NULL)
             pg_fatal("could not initialize compression library");
     }

     /* Consume all input, to be flushed later */
-    while (input->pos != input->size)
+    while (input->pos < input->size)
     {
-        output->pos = 0;
         res = ZSTD_compressStream2(zstdcs->cstream, output, input, ZSTD_e_continue);
         if (ZSTD_isError(res))
             pg_fatal("could not write to file: %s", ZSTD_getErrorName(res));

-        errno = 0;
-        cnt = fwrite(output->dst, 1, output->pos, zstdcs->fp);
-        if (cnt != output->pos)
+        /* Dump output buffer if full */
+        if (output->pos >= output->size)
         {
-            errno = (errno) ? errno : ENOSPC;
-            pg_fatal("could not write to file: %m");
+            errno = 0;
+            cnt = fwrite(output->dst, 1, output->pos, zstdcs->fp);
+            if (cnt != output->pos)
+            {
+                errno = (errno) ? errno : ENOSPC;
+                pg_fatal("could not write to file: %m");
+            }
+            output->pos = 0;
         }
     }
 }
@@ -448,7 +451,6 @@ Zstd_close(CompressFileHandle *CFH)
         /* Loop until the compression buffers are fully consumed */
         for (;;)
         {
-            output->pos = 0;
             res = ZSTD_compressStream2(zstdcs->cstream, output, input, ZSTD_e_end);
             if (ZSTD_isError(res))
             {
@@ -466,6 +468,7 @@ Zstd_close(CompressFileHandle *CFH)
                 success = false;
                 break;
             }
+            output->pos = 0;

             if (res == 0)
                 break;            /* End of frame */
--
2.43.7

From 865a0ccacf8ec065426b1e2823aba9d3cc2c1caf Mon Sep 17 00:00:00 2001
From: Tom Lane <tgl@sss.pgh.pa.us>
Date: Fri, 10 Oct 2025 22:08:13 -0400
Subject: [PATCH v2 2/2] Try to align the block sizes of pg_dump's various
 compression modes.

(This is more of a straw man for discussion than a finished patch.)

After the previous patch, compress_zstd.c tends to produce data block
sizes around 128K, and we don't really have any control over that
unless we want to overrule ZSTD_CStreamOutSize().  Which seems like
a bad idea.  But let's try to align the other compression modes to
produce block sizes roughly comparable to that, so that pg_restore's
skip-data performance isn't enormously different for different modes.

gzip compression can be brought in line simply by setting
DEFAULT_IO_BUFFER_SIZE = 128K, which this patch does.  That
increases some unrelated buffer sizes, but none of them seem
problematic for modern platforms.

lz4's idea of appropriate block size is highly nonlinear:
if we just increase DEFAULT_IO_BUFFER_SIZE then the output
blocks end up around 200K.  I found that adjusting the slop
factor in LZ4State_compression_init was a not-too-ugly way
of bringing that number into line.

With compress = none you get data blocks the same sizes
as the table rows.  We could avoid that by introducing
an additional layer of buffering, but it's not clear to
me that that's a net win, so this patch doesn't do so.

Comments in compress_io.h and 002_pg_dump.pl suggest that if
we increase DEFAULT_IO_BUFFER_SIZE then we need to increase the
amount of data fed through the tests in order to improve coverage.
I've not done that here either.  In my view, the decompression side
of compress_lz4.c needs to be rewritten to be simpler, rather than
tested more.

Author: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/3515357.1760128017@sss.pgh.pa.us
---
 src/bin/pg_dump/compress_io.h  | 2 +-
 src/bin/pg_dump/compress_lz4.c | 9 +++++++--
 2 files changed, 8 insertions(+), 3 deletions(-)

diff --git a/src/bin/pg_dump/compress_io.h b/src/bin/pg_dump/compress_io.h
index 25a7bf0904d..53cf8c9b03b 100644
--- a/src/bin/pg_dump/compress_io.h
+++ b/src/bin/pg_dump/compress_io.h
@@ -24,7 +24,7 @@
  * still exercise all the branches. This applies especially if the value is
  * increased, in which case the overflow buffer may not be needed.
  */
-#define DEFAULT_IO_BUFFER_SIZE    4096
+#define DEFAULT_IO_BUFFER_SIZE    (128 * 1024)

 extern char *supports_compression(const pg_compress_specification compression_spec);

diff --git a/src/bin/pg_dump/compress_lz4.c b/src/bin/pg_dump/compress_lz4.c
index 47ee2e4bbac..c9ea895c137 100644
--- a/src/bin/pg_dump/compress_lz4.c
+++ b/src/bin/pg_dump/compress_lz4.c
@@ -102,9 +102,14 @@ LZ4State_compression_init(LZ4State *state)
     state->buflen = LZ4F_compressBound(DEFAULT_IO_BUFFER_SIZE, &state->prefs);

     /*
-     * Then double it, to ensure we're not forced to flush every time.
+     * Add some slop to ensure we're not forced to flush every time.
+     *
+     * The present slop factor of 50% is chosen so that the typical output
+     * block size is about 128K when DEFAULT_IO_BUFFER_SIZE = 128K.  We might
+     * need a different slop factor to maintain that equivalence if
+     * DEFAULT_IO_BUFFER_SIZE is changed dramatically.
      */
-    state->buflen *= 2;
+    state->buflen += state->buflen / 2;

     /*
      * LZ4F_compressBegin requires a buffer that is greater or equal to
--
2.43.7


While playing around with the test cases for pg_dump compression,
I was startled to discover that the performance of compress_lz4's
"stream API" code is absolutely abysmal.  Here is a simple test
case to demonstrate, using the regression database as test data:

$ pg_dump -Fd --compress=lz4 -f rlz4.dir regression
$ time pg_restore -f /dev/null rlz4.dir

real    0m0.023s
user    0m0.017s
sys     0m0.006s

So far so good, but now let's compress the toc.dat file:

$ lz4 -f -m --rm rlz4.dir/toc.dat
$ time pg_restore -f /dev/null rlz4.dir

real    0m1.335s
user    0m1.326s
sys     0m0.008s

Considering that lz4 prides itself on fast decompression speed,
that is not a sane result.  Decompressing the file only requires
a couple ms on my machine:

$ time lz4cat rlz4.dir/toc.dat.lz4 >/dev/null

real    0m0.002s
user    0m0.000s
sys     0m0.002s

So on this example, pg_restore is something more than 600x slower
to read the TOC data than it ought to be.

On investigation, the blame mostly affixes to LZ4Stream_read_overflow's
habit of memmove'ing all the remaining buffered data after each read
operation.  Since reading a TOC file tends to involve a lot of small
(even one-byte) decompression calls, that amounts to an O(N^2) cost.

This could have been fixed with a minimal patch, but to my
eyes LZ4Stream_read_internal and LZ4Stream_read_overflow are
badly-written spaghetti code; in particular the eol_flag logic
is inefficient and duplicative.  I chose to throw the code
away and rewrite from scratch.  This version is about sixty
lines shorter as well as not having the performance issue.

Fortunately, AFAICT the only way to get to this problem is to
manually LZ4-compress the toc.dat and/or blobs.toc files within a
directory-style archive.  Few people do that, which likely explains
the lack of field complaints.

On top of that, a similar case with gzip doesn't work at all,
though it's supposed to:

$ pg_dump -Fd --compress=gzip -f rgzip.dir regression
$ gzip rgzip.dir/toc.dat 
$ pg_restore -f /dev/null rgzip.dir
pg_restore: error: could not read from input file: 

Tracking this down, it seems that Gzip_read doesn't cope with
a request to read zero bytes.  I wonder how long that's been
broken.

As far as I can see, 002_pg_dump.pl doesn't exercise the case of
manually-compressed toc.dat files.  I wonder why not.

0001 and 0002 attached are the same as before, then 0003 adds a fix
for the LZ4 performance problem, and 0004 fixes the Gzip_read problem.
While at it, I got rid of a few other minor inefficiencies such as
unnecessary buffer-zeroing.

            regards, tom lane

From 1d5ff3431923b9f75415b80721966ad42c4036f3 Mon Sep 17 00:00:00 2001
From: Tom Lane <tgl@sss.pgh.pa.us>
Date: Fri, 10 Oct 2025 20:57:15 -0400
Subject: [PATCH v3 1/4] Fix poor buffering logic in pg_dump's lz4 and zstd
 compression code.

Both of these modules dumped each bit of output that they got from
the underlying compression library as a separate "data block" in
the emitted archive file.  In the case of zstd this'd frequently
result in block sizes well under 100 bytes; lz4 is a little better
but still produces blocks around 300 bytes, at least in the test
case I tried.  This bloats the archive file a little bit compared
to larger block sizes, but the real problem is that when pg_restore
has to skip each data block rather than seeking directly to some
target data, tiny block sizes are enormously inefficient.

Fix both modules so that they fill their allocated buffer reasonably
well before dumping a data block.  In the case of lz4, also delete
some redundant logic that caused the lz4 frame header to be emitted
as a separate data block.  (That saves little, but I see no reason
to expend extra code to get worse results.)

I fixed the "stream API" code too.  In those cases, feeding small
amounts of data to fwrite() probably doesn't have any meaningful
performance consequences.  But it seems like a bad idea to leave
the two sets of code doing the same thing in two different ways.

In passing, remove unnecessary "extra paranoia" check in
_ZstdWriteCommon.  _CustomWriteFunc (the only possible referent
of cs->writeF) already protects itself against zero-length writes,
and it's really a modularity violation for _ZstdWriteCommon to know
that the custom format disallows empty data blocks.

Reported-by: Dimitrios Apostolou <jimis@gmx.net>
Author: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/3515357.1760128017@sss.pgh.pa.us
---
 src/bin/pg_dump/compress_lz4.c  | 167 +++++++++++++++++++-------------
 src/bin/pg_dump/compress_zstd.c |  37 +++----
 2 files changed, 117 insertions(+), 87 deletions(-)

diff --git a/src/bin/pg_dump/compress_lz4.c b/src/bin/pg_dump/compress_lz4.c
index e2f7c468293..47ee2e4bbac 100644
--- a/src/bin/pg_dump/compress_lz4.c
+++ b/src/bin/pg_dump/compress_lz4.c
@@ -60,13 +60,11 @@ typedef struct LZ4State
     bool        compressing;

     /*
-     * Used by the Compressor API to mark if the compression headers have been
-     * written after initialization.
+     * I/O buffer area.
      */
-    bool        needs_header_flush;
-
-    size_t        buflen;
-    char       *buffer;
+    char       *buffer;            /* buffer for compressed data */
+    size_t        buflen;            /* allocated size of buffer */
+    size_t        bufdata;        /* amount of valid data currently in buffer */

     /*
      * Used by the Stream API to store already uncompressed data that the
@@ -76,12 +74,6 @@ typedef struct LZ4State
     size_t        overflowlen;
     char       *overflowbuf;

-    /*
-     * Used by both APIs to keep track of the compressed data length stored in
-     * the buffer.
-     */
-    size_t        compressedlen;
-
     /*
      * Used by both APIs to keep track of error codes.
      */
@@ -103,8 +95,17 @@ LZ4State_compression_init(LZ4State *state)
 {
     size_t        status;

+    /*
+     * Compute size needed for buffer, assuming we will present at most
+     * DEFAULT_IO_BUFFER_SIZE input bytes at a time.
+     */
     state->buflen = LZ4F_compressBound(DEFAULT_IO_BUFFER_SIZE, &state->prefs);

+    /*
+     * Then double it, to ensure we're not forced to flush every time.
+     */
+    state->buflen *= 2;
+
     /*
      * LZ4F_compressBegin requires a buffer that is greater or equal to
      * LZ4F_HEADER_SIZE_MAX. Verify that the requirement is met.
@@ -120,6 +121,10 @@ LZ4State_compression_init(LZ4State *state)
     }

     state->buffer = pg_malloc(state->buflen);
+
+    /*
+     * Insert LZ4 header into buffer.
+     */
     status = LZ4F_compressBegin(state->ctx,
                                 state->buffer, state->buflen,
                                 &state->prefs);
@@ -129,7 +134,7 @@ LZ4State_compression_init(LZ4State *state)
         return false;
     }

-    state->compressedlen = status;
+    state->bufdata = status;

     return true;
 }
@@ -201,36 +206,37 @@ WriteDataToArchiveLZ4(ArchiveHandle *AH, CompressorState *cs,
 {
     LZ4State   *state = (LZ4State *) cs->private_data;
     size_t        remaining = dLen;
-    size_t        status;
-    size_t        chunk;
-
-    /* Write the header if not yet written. */
-    if (state->needs_header_flush)
-    {
-        cs->writeF(AH, state->buffer, state->compressedlen);
-        state->needs_header_flush = false;
-    }

     while (remaining > 0)
     {
+        size_t        chunk;
+        size_t        required;
+        size_t        status;

-        if (remaining > DEFAULT_IO_BUFFER_SIZE)
-            chunk = DEFAULT_IO_BUFFER_SIZE;
-        else
-            chunk = remaining;
+        /* We don't try to present more than DEFAULT_IO_BUFFER_SIZE bytes */
+        chunk = Min(remaining, (size_t) DEFAULT_IO_BUFFER_SIZE);
+
+        /* If not enough space, must flush buffer */
+        required = LZ4F_compressBound(chunk, &state->prefs);
+        if (required > state->buflen - state->bufdata)
+        {
+            cs->writeF(AH, state->buffer, state->bufdata);
+            state->bufdata = 0;
+        }

-        remaining -= chunk;
         status = LZ4F_compressUpdate(state->ctx,
-                                     state->buffer, state->buflen,
+                                     state->buffer + state->bufdata,
+                                     state->buflen - state->bufdata,
                                      data, chunk, NULL);

         if (LZ4F_isError(status))
             pg_fatal("could not compress data: %s",
                      LZ4F_getErrorName(status));

-        cs->writeF(AH, state->buffer, status);
+        state->bufdata += status;

-        data = ((char *) data) + chunk;
+        data = ((const char *) data) + chunk;
+        remaining -= chunk;
     }
 }

@@ -238,29 +244,32 @@ static void
 EndCompressorLZ4(ArchiveHandle *AH, CompressorState *cs)
 {
     LZ4State   *state = (LZ4State *) cs->private_data;
+    size_t        required;
     size_t        status;

     /* Nothing needs to be done */
     if (!state)
         return;

-    /*
-     * Write the header if not yet written. The caller is not required to call
-     * writeData if the relation does not contain any data. Thus it is
-     * possible to reach here without having flushed the header. Do it before
-     * ending the compression.
-     */
-    if (state->needs_header_flush)
-        cs->writeF(AH, state->buffer, state->compressedlen);
+    /* We might need to flush the buffer to make room for LZ4F_compressEnd */
+    required = LZ4F_compressBound(0, &state->prefs);
+    if (required > state->buflen - state->bufdata)
+    {
+        cs->writeF(AH, state->buffer, state->bufdata);
+        state->bufdata = 0;
+    }

     status = LZ4F_compressEnd(state->ctx,
-                              state->buffer, state->buflen,
+                              state->buffer + state->bufdata,
+                              state->buflen - state->bufdata,
                               NULL);
     if (LZ4F_isError(status))
         pg_fatal("could not end compression: %s",
                  LZ4F_getErrorName(status));
+    state->bufdata += status;

-    cs->writeF(AH, state->buffer, status);
+    /* Write the final bufferload */
+    cs->writeF(AH, state->buffer, state->bufdata);

     status = LZ4F_freeCompressionContext(state->ctx);
     if (LZ4F_isError(status))
@@ -302,8 +311,6 @@ InitCompressorLZ4(CompressorState *cs, const pg_compress_specification compressi
         pg_fatal("could not initialize LZ4 compression: %s",
                  LZ4F_getErrorName(state->errcode));

-    /* Remember that the header has not been written. */
-    state->needs_header_flush = true;
     cs->private_data = state;
 }

@@ -360,19 +367,10 @@ LZ4Stream_init(LZ4State *state, int size, bool compressing)

     state->compressing = compressing;

-    /* When compressing, write LZ4 header to the output stream. */
     if (state->compressing)
     {
-
         if (!LZ4State_compression_init(state))
             return false;
-
-        errno = 0;
-        if (fwrite(state->buffer, 1, state->compressedlen, state->fp) != state->compressedlen)
-        {
-            errno = (errno) ? errno : ENOSPC;
-            return false;
-        }
     }
     else
     {
@@ -573,8 +571,7 @@ static void
 LZ4Stream_write(const void *ptr, size_t size, CompressFileHandle *CFH)
 {
     LZ4State   *state = (LZ4State *) CFH->private_data;
-    size_t        status;
-    int            remaining = size;
+    size_t        remaining = size;

     /* Lazy init */
     if (!LZ4Stream_init(state, size, true))
@@ -583,23 +580,36 @@ LZ4Stream_write(const void *ptr, size_t size, CompressFileHandle *CFH)

     while (remaining > 0)
     {
-        int            chunk = Min(remaining, DEFAULT_IO_BUFFER_SIZE);
+        size_t        chunk;
+        size_t        required;
+        size_t        status;

-        remaining -= chunk;
+        /* We don't try to present more than DEFAULT_IO_BUFFER_SIZE bytes */
+        chunk = Min(remaining, (size_t) DEFAULT_IO_BUFFER_SIZE);
+
+        /* If not enough space, must flush buffer */
+        required = LZ4F_compressBound(chunk, &state->prefs);
+        if (required > state->buflen - state->bufdata)
+        {
+            errno = 0;
+            if (fwrite(state->buffer, 1, state->bufdata, state->fp) != state->bufdata)
+            {
+                errno = (errno) ? errno : ENOSPC;
+                pg_fatal("error during writing: %m");
+            }
+            state->bufdata = 0;
+        }

-        status = LZ4F_compressUpdate(state->ctx, state->buffer, state->buflen,
+        status = LZ4F_compressUpdate(state->ctx,
+                                     state->buffer + state->bufdata,
+                                     state->buflen - state->bufdata,
                                      ptr, chunk, NULL);
         if (LZ4F_isError(status))
             pg_fatal("error during writing: %s", LZ4F_getErrorName(status));
-
-        errno = 0;
-        if (fwrite(state->buffer, 1, status, state->fp) != status)
-        {
-            errno = (errno) ? errno : ENOSPC;
-            pg_fatal("error during writing: %m");
-        }
+        state->bufdata += status;

         ptr = ((const char *) ptr) + chunk;
+        remaining -= chunk;
     }
 }

@@ -675,6 +685,7 @@ LZ4Stream_close(CompressFileHandle *CFH)
 {
     FILE       *fp;
     LZ4State   *state = (LZ4State *) CFH->private_data;
+    size_t        required;
     size_t        status;
     int            ret;

@@ -683,20 +694,36 @@ LZ4Stream_close(CompressFileHandle *CFH)
     {
         if (state->compressing)
         {
-            status = LZ4F_compressEnd(state->ctx, state->buffer, state->buflen, NULL);
+            /* We might need to flush the buffer to make room */
+            required = LZ4F_compressBound(0, &state->prefs);
+            if (required > state->buflen - state->bufdata)
+            {
+                errno = 0;
+                if (fwrite(state->buffer, 1, state->bufdata, state->fp) != state->bufdata)
+                {
+                    errno = (errno) ? errno : ENOSPC;
+                    pg_log_error("could not write to output file: %m");
+                }
+                state->bufdata = 0;
+            }
+
+            status = LZ4F_compressEnd(state->ctx,
+                                      state->buffer + state->bufdata,
+                                      state->buflen - state->bufdata,
+                                      NULL);
             if (LZ4F_isError(status))
             {
                 pg_log_error("could not end compression: %s",
                              LZ4F_getErrorName(status));
             }
             else
+                state->bufdata += status;
+
+            errno = 0;
+            if (fwrite(state->buffer, 1, state->bufdata, state->fp) != state->bufdata)
             {
-                errno = 0;
-                if (fwrite(state->buffer, 1, status, state->fp) != status)
-                {
-                    errno = (errno) ? errno : ENOSPC;
-                    pg_log_error("could not write to output file: %m");
-                }
+                errno = (errno) ? errno : ENOSPC;
+                pg_log_error("could not write to output file: %m");
             }

             status = LZ4F_freeCompressionContext(state->ctx);
diff --git a/src/bin/pg_dump/compress_zstd.c b/src/bin/pg_dump/compress_zstd.c
index e24d45e1bbe..5fe2279faae 100644
--- a/src/bin/pg_dump/compress_zstd.c
+++ b/src/bin/pg_dump/compress_zstd.c
@@ -98,24 +98,22 @@ _ZstdWriteCommon(ArchiveHandle *AH, CompressorState *cs, bool flush)
     ZSTD_outBuffer *output = &zstdcs->output;

     /* Loop while there's any input or until flushed */
-    while (input->pos != input->size || flush)
+    while (input->pos < input->size || flush)
     {
         size_t        res;

-        output->pos = 0;
         res = ZSTD_compressStream2(zstdcs->cstream, output,
                                    input, flush ? ZSTD_e_end : ZSTD_e_continue);

         if (ZSTD_isError(res))
             pg_fatal("could not compress data: %s", ZSTD_getErrorName(res));

-        /*
-         * Extra paranoia: avoid zero-length chunks, since a zero length chunk
-         * is the EOF marker in the custom format. This should never happen
-         * but...
-         */
-        if (output->pos > 0)
+        /* Dump output buffer if full, or if we're told to flush */
+        if (output->pos >= output->size || flush)
+        {
             cs->writeF(AH, output->dst, output->pos);
+            output->pos = 0;
+        }

         if (res == 0)
             break;                /* End of frame or all input consumed */
@@ -367,26 +365,31 @@ Zstd_write(const void *ptr, size_t size, CompressFileHandle *CFH)
     if (zstdcs->cstream == NULL)
     {
         zstdcs->output.size = ZSTD_CStreamOutSize();
-        zstdcs->output.dst = pg_malloc0(zstdcs->output.size);
+        zstdcs->output.dst = pg_malloc(zstdcs->output.size);
+        zstdcs->output.pos = 0;
         zstdcs->cstream = _ZstdCStreamParams(CFH->compression_spec);
         if (zstdcs->cstream == NULL)
             pg_fatal("could not initialize compression library");
     }

     /* Consume all input, to be flushed later */
-    while (input->pos != input->size)
+    while (input->pos < input->size)
     {
-        output->pos = 0;
         res = ZSTD_compressStream2(zstdcs->cstream, output, input, ZSTD_e_continue);
         if (ZSTD_isError(res))
             pg_fatal("could not write to file: %s", ZSTD_getErrorName(res));

-        errno = 0;
-        cnt = fwrite(output->dst, 1, output->pos, zstdcs->fp);
-        if (cnt != output->pos)
+        /* Dump output buffer if full */
+        if (output->pos >= output->size)
         {
-            errno = (errno) ? errno : ENOSPC;
-            pg_fatal("could not write to file: %m");
+            errno = 0;
+            cnt = fwrite(output->dst, 1, output->pos, zstdcs->fp);
+            if (cnt != output->pos)
+            {
+                errno = (errno) ? errno : ENOSPC;
+                pg_fatal("could not write to file: %m");
+            }
+            output->pos = 0;
         }
     }
 }
@@ -448,7 +451,6 @@ Zstd_close(CompressFileHandle *CFH)
         /* Loop until the compression buffers are fully consumed */
         for (;;)
         {
-            output->pos = 0;
             res = ZSTD_compressStream2(zstdcs->cstream, output, input, ZSTD_e_end);
             if (ZSTD_isError(res))
             {
@@ -466,6 +468,7 @@ Zstd_close(CompressFileHandle *CFH)
                 success = false;
                 break;
             }
+            output->pos = 0;

             if (res == 0)
                 break;            /* End of frame */
--
2.43.7

From 253f3c9f4bf2916930f4d48730e0dc98f757d66e Mon Sep 17 00:00:00 2001
From: Tom Lane <tgl@sss.pgh.pa.us>
Date: Fri, 10 Oct 2025 22:08:13 -0400
Subject: [PATCH v3 2/4] Try to align the block sizes of pg_dump's various
 compression modes.

(This is more of a straw man for discussion than a finished patch.)

After the previous patch, compress_zstd.c tends to produce data block
sizes around 128K, and we don't really have any control over that
unless we want to overrule ZSTD_CStreamOutSize().  Which seems like
a bad idea.  But let's try to align the other compression modes to
produce block sizes roughly comparable to that, so that pg_restore's
skip-data performance isn't enormously different for different modes.

gzip compression can be brought in line simply by setting
DEFAULT_IO_BUFFER_SIZE = 128K, which this patch does.  That
increases some unrelated buffer sizes, but none of them seem
problematic for modern platforms.

lz4's idea of appropriate block size is highly nonlinear:
if we just increase DEFAULT_IO_BUFFER_SIZE then the output
blocks end up around 200K.  I found that adjusting the slop
factor in LZ4State_compression_init was a not-too-ugly way
of bringing that number into line.

With compress = none you get data blocks the same sizes
as the table rows.  We could avoid that by introducing
an additional layer of buffering, but it's not clear to
me that that's a net win, so this patch doesn't do so.

Comments in compress_io.h and 002_pg_dump.pl suggest that if
we increase DEFAULT_IO_BUFFER_SIZE then we need to increase the
amount of data fed through the tests in order to improve coverage.
I've not done that here either.  In my view, the decompression side
of compress_lz4.c needs to be rewritten to be simpler, rather than
tested more.

Author: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/3515357.1760128017@sss.pgh.pa.us
---
 src/bin/pg_dump/compress_io.h  | 2 +-
 src/bin/pg_dump/compress_lz4.c | 9 +++++++--
 2 files changed, 8 insertions(+), 3 deletions(-)

diff --git a/src/bin/pg_dump/compress_io.h b/src/bin/pg_dump/compress_io.h
index 25a7bf0904d..53cf8c9b03b 100644
--- a/src/bin/pg_dump/compress_io.h
+++ b/src/bin/pg_dump/compress_io.h
@@ -24,7 +24,7 @@
  * still exercise all the branches. This applies especially if the value is
  * increased, in which case the overflow buffer may not be needed.
  */
-#define DEFAULT_IO_BUFFER_SIZE    4096
+#define DEFAULT_IO_BUFFER_SIZE    (128 * 1024)

 extern char *supports_compression(const pg_compress_specification compression_spec);

diff --git a/src/bin/pg_dump/compress_lz4.c b/src/bin/pg_dump/compress_lz4.c
index 47ee2e4bbac..c9ea895c137 100644
--- a/src/bin/pg_dump/compress_lz4.c
+++ b/src/bin/pg_dump/compress_lz4.c
@@ -102,9 +102,14 @@ LZ4State_compression_init(LZ4State *state)
     state->buflen = LZ4F_compressBound(DEFAULT_IO_BUFFER_SIZE, &state->prefs);

     /*
-     * Then double it, to ensure we're not forced to flush every time.
+     * Add some slop to ensure we're not forced to flush every time.
+     *
+     * The present slop factor of 50% is chosen so that the typical output
+     * block size is about 128K when DEFAULT_IO_BUFFER_SIZE = 128K.  We might
+     * need a different slop factor to maintain that equivalence if
+     * DEFAULT_IO_BUFFER_SIZE is changed dramatically.
      */
-    state->buflen *= 2;
+    state->buflen += state->buflen / 2;

     /*
      * LZ4F_compressBegin requires a buffer that is greater or equal to
--
2.43.7

From 0eaeb4009f1b6956d36a89b1139d49ae1f6db2dc Mon Sep 17 00:00:00 2001
From: Tom Lane <tgl@sss.pgh.pa.us>
Date: Sat, 11 Oct 2025 20:26:24 -0400
Subject: [PATCH v3 3/4] Fix serious performance problems in
 LZ4Stream_read_internal.

I was distressed to find that reading an LZ4-compressed toc.dat
file was hundreds of times slower than it ought to be.  On
investigation, the blame mostly affixes to LZ4Stream_read_overflow's
habit of memmove'ing all the remaining buffered data after each read
operation.  Since reading a TOC file tends to involve a lot of small
(even one-byte) decompression calls, that amounts to an O(N^2) cost.

This could have been fixed with a minimal patch, but to my
eyes LZ4Stream_read_internal and LZ4Stream_read_overflow are
badly-written spaghetti code; in particular the eol_flag logic
is inefficient and duplicative.  I chose to throw the code
away and rewrite from scratch.  This version is about sixty
lines shorter as well as not having the performance issue.

Fortunately, AFAICT the only way to get to this problem is to
manually LZ4-compress the toc.dat and/or blobs.toc files within a
directory-style archive.  Few people do that, which likely explains
the lack of field complaints.

Author: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/3515357.1760128017@sss.pgh.pa.us
---
 src/bin/pg_dump/compress_lz4.c | 242 ++++++++++++---------------------
 1 file changed, 89 insertions(+), 153 deletions(-)

diff --git a/src/bin/pg_dump/compress_lz4.c b/src/bin/pg_dump/compress_lz4.c
index c9ea895c137..450afd4e2be 100644
--- a/src/bin/pg_dump/compress_lz4.c
+++ b/src/bin/pg_dump/compress_lz4.c
@@ -65,14 +65,12 @@ typedef struct LZ4State
     char       *buffer;            /* buffer for compressed data */
     size_t        buflen;            /* allocated size of buffer */
     size_t        bufdata;        /* amount of valid data currently in buffer */
-
-    /*
-     * Used by the Stream API to store already uncompressed data that the
-     * caller has not consumed.
-     */
-    size_t        overflowalloclen;
-    size_t        overflowlen;
-    char       *overflowbuf;
+    /* These fields are used only while decompressing: */
+    size_t        bufnext;        /* next buffer position to decompress */
+    char       *outbuf;            /* buffer for decompressed data */
+    size_t        outbuflen;        /* allocated size of outbuf */
+    size_t        outbufdata;        /* amount of valid data currently in outbuf */
+    size_t        outbufnext;        /* next outbuf position to return */

     /*
      * Used by both APIs to keep track of error codes.
@@ -168,8 +166,8 @@ ReadDataFromArchiveLZ4(ArchiveHandle *AH, CompressorState *cs)
         pg_fatal("could not create LZ4 decompression context: %s",
                  LZ4F_getErrorName(status));

-    outbuf = pg_malloc0(DEFAULT_IO_BUFFER_SIZE);
-    readbuf = pg_malloc0(DEFAULT_IO_BUFFER_SIZE);
+    outbuf = pg_malloc(DEFAULT_IO_BUFFER_SIZE);
+    readbuf = pg_malloc(DEFAULT_IO_BUFFER_SIZE);
     readbuflen = DEFAULT_IO_BUFFER_SIZE;
     while ((r = cs->readF(AH, &readbuf, &readbuflen)) > 0)
     {
@@ -184,7 +182,6 @@ ReadDataFromArchiveLZ4(ArchiveHandle *AH, CompressorState *cs)
             size_t        out_size = DEFAULT_IO_BUFFER_SIZE;
             size_t        read_size = readend - readp;

-            memset(outbuf, 0, DEFAULT_IO_BUFFER_SIZE);
             status = LZ4F_decompress(ctx, outbuf, &out_size,
                                      readp, &read_size, &dec_opt);
             if (LZ4F_isError(status))
@@ -327,15 +324,16 @@ InitCompressorLZ4(CompressorState *cs, const pg_compress_specification compressi

 /*
  * LZ4 equivalent to feof() or gzeof().  Return true iff there is no
- * decompressed output in the overflow buffer and the end of the backing file
- * is reached.
+ * more buffered data and the end of the input file has been reached.
  */
 static bool
 LZ4Stream_eof(CompressFileHandle *CFH)
 {
     LZ4State   *state = (LZ4State *) CFH->private_data;

-    return state->overflowlen == 0 && feof(state->fp);
+    return state->outbufnext >= state->outbufdata &&
+        state->bufnext >= state->bufdata &&
+        feof(state->fp);
 }

 static const char *
@@ -357,13 +355,15 @@ LZ4Stream_get_error(CompressFileHandle *CFH)
  *
  * Creates the necessary contexts for either compression or decompression. When
  * compressing data (indicated by compressing=true), it additionally writes the
- * LZ4 header in the output stream.
+ * LZ4 header in the output buffer.
+ *
+ * It's expected that a not-yet-initialized LZ4State will be zero-filled.
  *
  * Returns true on success. In case of a failure returns false, and stores the
  * error code in state->errcode.
  */
 static bool
-LZ4Stream_init(LZ4State *state, int size, bool compressing)
+LZ4Stream_init(LZ4State *state, bool compressing)
 {
     size_t        status;

@@ -386,66 +386,22 @@ LZ4Stream_init(LZ4State *state, int size, bool compressing)
             return false;
         }

-        state->buflen = Max(size, DEFAULT_IO_BUFFER_SIZE);
+        state->buflen = DEFAULT_IO_BUFFER_SIZE;
         state->buffer = pg_malloc(state->buflen);
-
-        state->overflowalloclen = state->buflen;
-        state->overflowbuf = pg_malloc(state->overflowalloclen);
-        state->overflowlen = 0;
+        state->outbuflen = DEFAULT_IO_BUFFER_SIZE;
+        state->outbuf = pg_malloc(state->outbuflen);
     }

     state->inited = true;
     return true;
 }

-/*
- * Read already decompressed content from the overflow buffer into 'ptr' up to
- * 'size' bytes, if available. If the eol_flag is set, then stop at the first
- * occurrence of the newline char prior to 'size' bytes.
- *
- * Any unread content in the overflow buffer is moved to the beginning.
- *
- * Returns the number of bytes read from the overflow buffer (and copied into
- * the 'ptr' buffer), or 0 if the overflow buffer is empty.
- */
-static int
-LZ4Stream_read_overflow(LZ4State *state, void *ptr, int size, bool eol_flag)
-{
-    char       *p;
-    int            readlen = 0;
-
-    if (state->overflowlen == 0)
-        return 0;
-
-    if (state->overflowlen >= size)
-        readlen = size;
-    else
-        readlen = state->overflowlen;
-
-    if (eol_flag && (p = memchr(state->overflowbuf, '\n', readlen)))
-        /* Include the line terminating char */
-        readlen = p - state->overflowbuf + 1;
-
-    memcpy(ptr, state->overflowbuf, readlen);
-    state->overflowlen -= readlen;
-
-    if (state->overflowlen > 0)
-        memmove(state->overflowbuf, state->overflowbuf + readlen, state->overflowlen);
-
-    return readlen;
-}
-
 /*
  * The workhorse for reading decompressed content out of an LZ4 compressed
  * stream.
  *
  * It will read up to 'ptrsize' decompressed content, or up to the new line
- * char if found first when the eol_flag is set. It is possible that the
- * decompressed output generated by reading any compressed input via the
- * LZ4F API, exceeds 'ptrsize'. Any exceeding decompressed content is stored
- * at an overflow buffer within LZ4State. Of course, when the function is
- * called, it will first try to consume any decompressed content already
- * present in the overflow buffer, before decompressing new content.
+ * char if one is found first when the eol_flag is set.
  *
  * Returns the number of bytes of decompressed data copied into the ptr
  * buffer, or -1 in case of error.
@@ -454,62 +410,85 @@ static int
 LZ4Stream_read_internal(LZ4State *state, void *ptr, int ptrsize, bool eol_flag)
 {
     int            dsize = 0;
-    int            rsize;
-    int            size = ptrsize;
-    bool        eol_found = false;
-
-    void       *readbuf;
+    int            remaining = ptrsize;

     /* Lazy init */
-    if (!LZ4Stream_init(state, size, false /* decompressing */ ))
+    if (!LZ4Stream_init(state, false /* decompressing */ ))
     {
         pg_log_error("unable to initialize LZ4 library: %s",
                      LZ4F_getErrorName(state->errcode));
         return -1;
     }

-    /* No work needs to be done for a zero-sized output buffer */
-    if (size <= 0)
-        return 0;
-
-    /* Verify that there is enough space in the outbuf */
-    if (size > state->buflen)
+    /* Loop until postcondition is satisfied */
+    while (remaining > 0)
     {
-        state->buflen = size;
-        state->buffer = pg_realloc(state->buffer, size);
-    }
-
-    /* use already decompressed content if available */
-    dsize = LZ4Stream_read_overflow(state, ptr, size, eol_flag);
-    if (dsize == size || (eol_flag && memchr(ptr, '\n', dsize)))
-        return dsize;
-
-    readbuf = pg_malloc(size);
+        /*
+         * If we already have some decompressed data, return that.
+         */
+        if (state->outbufnext < state->outbufdata)
+        {
+            char       *outptr = state->outbuf + state->outbufnext;
+            size_t        readlen = state->outbufdata - state->outbufnext;
+            bool        eol_found = false;
+
+            if (readlen > remaining)
+                readlen = remaining;
+            /* If eol_flag is set, don't read beyond a newline */
+            if (eol_flag)
+            {
+                char       *eolptr = memchr(outptr, '\n', readlen);

-    do
-    {
-        char       *rp;
-        char       *rend;
+                if (eolptr)
+                {
+                    readlen = eolptr - outptr + 1;
+                    eol_found = true;
+                }
+            }
+            memcpy(ptr, outptr, readlen);
+            ptr = ((char *) ptr) + readlen;
+            state->outbufnext += readlen;
+            dsize += readlen;
+            remaining -= readlen;
+            if (eol_found || remaining == 0)
+                break;
+            /* We must have emptied outbuf */
+            Assert(state->outbufnext >= state->outbufdata);
+        }

-        rsize = fread(readbuf, 1, size, state->fp);
-        if (rsize < size && !feof(state->fp))
+        /*
+         * If we don't have any pending compressed data, load more into
+         * state->buffer.
+         */
+        if (state->bufnext >= state->bufdata)
         {
-            pg_log_error("could not read from input file: %m");
-            return -1;
-        }
+            size_t        rsize;

-        rp = (char *) readbuf;
-        rend = (char *) readbuf + rsize;
+            rsize = fread(state->buffer, 1, state->buflen, state->fp);
+            if (rsize < state->buflen && !feof(state->fp))
+            {
+                pg_log_error("could not read from input file: %m");
+                return -1;
+            }
+            if (rsize == 0)
+                break;            /* must be EOF */
+            state->bufdata = rsize;
+            state->bufnext = 0;
+        }

-        while (rp < rend)
+        /*
+         * Decompress some data into state->outbuf.
+         */
         {
             size_t        status;
-            size_t        outlen = state->buflen;
-            size_t        read_remain = rend - rp;
-
-            memset(state->buffer, 0, outlen);
-            status = LZ4F_decompress(state->dtx, state->buffer, &outlen,
-                                     rp, &read_remain, NULL);
+            size_t        outlen = state->outbuflen;
+            size_t        inlen = state->bufdata - state->bufnext;
+
+            status = LZ4F_decompress(state->dtx,
+                                     state->outbuf, &outlen,
+                                     state->buffer + state->bufnext,
+                                     &inlen,
+                                     NULL);
             if (LZ4F_isError(status))
             {
                 state->errcode = status;
@@ -517,54 +496,11 @@ LZ4Stream_read_internal(LZ4State *state, void *ptr, int ptrsize, bool eol_flag)
                              LZ4F_getErrorName(state->errcode));
                 return -1;
             }
-
-            rp += read_remain;
-
-            /*
-             * fill in what space is available in ptr if the eol flag is set,
-             * either skip if one already found or fill up to EOL if present
-             * in the outbuf
-             */
-            if (outlen > 0 && dsize < size && eol_found == false)
-            {
-                char       *p;
-                size_t        lib = (!eol_flag) ? size - dsize : size - 1 - dsize;
-                size_t        len = outlen < lib ? outlen : lib;
-
-                if (eol_flag &&
-                    (p = memchr(state->buffer, '\n', outlen)) &&
-                    (size_t) (p - state->buffer + 1) <= len)
-                {
-                    len = p - state->buffer + 1;
-                    eol_found = true;
-                }
-
-                memcpy((char *) ptr + dsize, state->buffer, len);
-                dsize += len;
-
-                /* move what did not fit, if any, at the beginning of the buf */
-                if (len < outlen)
-                    memmove(state->buffer, state->buffer + len, outlen - len);
-                outlen -= len;
-            }
-
-            /* if there is available output, save it */
-            if (outlen > 0)
-            {
-                while (state->overflowlen + outlen > state->overflowalloclen)
-                {
-                    state->overflowalloclen *= 2;
-                    state->overflowbuf = pg_realloc(state->overflowbuf,
-                                                    state->overflowalloclen);
-                }
-
-                memcpy(state->overflowbuf + state->overflowlen, state->buffer, outlen);
-                state->overflowlen += outlen;
-            }
+            state->bufnext += inlen;
+            state->outbufdata = outlen;
+            state->outbufnext = 0;
         }
-    } while (rsize == size && dsize < size && eol_found == false);
-
-    pg_free(readbuf);
+    }

     return dsize;
 }
@@ -579,7 +515,7 @@ LZ4Stream_write(const void *ptr, size_t size, CompressFileHandle *CFH)
     size_t        remaining = size;

     /* Lazy init */
-    if (!LZ4Stream_init(state, size, true))
+    if (!LZ4Stream_init(state, true))
         pg_fatal("unable to initialize LZ4 library: %s",
                  LZ4F_getErrorName(state->errcode));

@@ -742,7 +678,7 @@ LZ4Stream_close(CompressFileHandle *CFH)
             if (LZ4F_isError(status))
                 pg_log_error("could not end decompression: %s",
                              LZ4F_getErrorName(status));
-            pg_free(state->overflowbuf);
+            pg_free(state->outbuf);
         }

         pg_free(state->buffer);
--
2.43.7

From 6e65ac1ff26071024cfcf24d15819aa6b72af137 Mon Sep 17 00:00:00 2001
From: Tom Lane <tgl@sss.pgh.pa.us>
Date: Sat, 11 Oct 2025 21:16:12 -0400
Subject: [PATCH v3 4/4] Fix issues with reading zero bytes in Gzip_read and
 Zstd_read.

pg_dump expects a read request of zero bytes to be a no-op
(see for example ReadStr()).  Gzip_read got this wrong and
falsely supposed that the resulting gzret == 0 indicated
an error.  Zstd_read got the right result, but only after
doing a lot more work than necessary, because it checked at
the bottom of the loop not the top.

The Gzip_read fix perhaps should be back-patched, because
it breaks the nominally-supported case of manually gzip'ing
the toc.dat file within a directory-style dump.

Author: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/3515357.1760128017@sss.pgh.pa.us
---
 src/bin/pg_dump/compress_gzip.c | 4 ++++
 src/bin/pg_dump/compress_zstd.c | 5 +----
 2 files changed, 5 insertions(+), 4 deletions(-)

diff --git a/src/bin/pg_dump/compress_gzip.c b/src/bin/pg_dump/compress_gzip.c
index 4a067e1402c..ad3b6486027 100644
--- a/src/bin/pg_dump/compress_gzip.c
+++ b/src/bin/pg_dump/compress_gzip.c
@@ -257,6 +257,10 @@ Gzip_read(void *ptr, size_t size, CompressFileHandle *CFH)
     gzFile        gzfp = (gzFile) CFH->private_data;
     int            gzret;

+    /* Reading zero bytes must be a no-op */
+    if (size == 0)
+        return 0;
+
     gzret = gzread(gzfp, ptr, size);

     /*
diff --git a/src/bin/pg_dump/compress_zstd.c b/src/bin/pg_dump/compress_zstd.c
index 5fe2279faae..36c1fd264ee 100644
--- a/src/bin/pg_dump/compress_zstd.c
+++ b/src/bin/pg_dump/compress_zstd.c
@@ -287,7 +287,7 @@ Zstd_read_internal(void *ptr, size_t size, CompressFileHandle *CFH, bool exit_on
     output->dst = ptr;
     output->pos = 0;

-    for (;;)
+    while (output->pos < output->size)
     {
         Assert(input->pos <= input->size);
         Assert(input->size <= input_allocated_size);
@@ -341,9 +341,6 @@ Zstd_read_internal(void *ptr, size_t size, CompressFileHandle *CFH, bool exit_on
             if (res == 0)
                 break;            /* End of frame */
         }
-
-        if (output->pos == output->size)
-            break;                /* We read all the data that fits */
     }

     return output->pos;
--
2.43.7




On Oct 12, 2025, at 09:25, Tom Lane <tgl@sss.pgh.pa.us> wrote:

While playing around with the test cases for pg_dump compression,
I was startled to discover that the performance of compress_lz4's
"stream API" code is absolutely abysmal.  Here is a simple test
case to demonstrate, using the regression database as test data:

$ pg_dump -Fd --compress=lz4 -f rlz4.dir regression
$ time pg_restore -f /dev/null rlz4.dir

real    0m0.023s
user    0m0.017s
sys     0m0.006s

So far so good, but now let's compress the toc.dat file:

$ lz4 -f -m --rm rlz4.dir/toc.dat
$ time pg_restore -f /dev/null rlz4.dir

real    0m1.335s
user    0m1.326s
sys     0m0.008s

Considering that lz4 prides itself on fast decompression speed,
that is not a sane result.  Decompressing the file only requires
a couple ms on my machine:

$ time lz4cat rlz4.dir/toc.dat.lz4 >/dev/null

real    0m0.002s
user    0m0.000s
sys     0m0.002s

So on this example, pg_restore is something more than 600x slower
to read the TOC data than it ought to be.


I tested the patch on my MacBook M4 (Sequoia 15.6.1), compiled with -O2 optimization:

- For LZ4 performance improvement:

Without the fix (latest master):
```
chaol@ChaodeMacBook-Air tmp % pg_dump -Fd --compress=lz4 -f rlz4.dir regression

chaol@ChaodeMacBook-Air tmp % time pg_restore -f /dev/null rlz4.dir
pg_restore -f /dev/null rlz4.dir  0.03s user 0.03s system 11% cpu 0.463 total

chaol@ChaodeMacBook-Air tmp % lz4 -f -m --rm rlz4.dir/toc.dat

# It took 1.59s to restore from compressed data, much slower than from uncompressed data.
chaol@ChaodeMacBook-Air tmp % time pg_restore -f /dev/null rlz4.dir
pg_restore -f /dev/null rlz4.dir  1.59s user 0.02s system 97% cpu 1.653 total
```

With the fix:
```
chaol@ChaodeMacBook-Air tmp % pg_dump -Fd --compress=lz4 -f rlz4.dir regression

chaol@ChaodeMacBook-Air tmp % time pg_restore -f /dev/null rlz4.dir
pg_restore -f /dev/null rlz4.dir  0.02s user 0.03s system 16% cpu 0.305 total

chaol@ChaodeMacBook-Air tmp % lz4 -f -m --rm rlz4.dir/toc.dat

# Much faster !!!
chaol@ChaodeMacBook-Air tmp % time pg_restore -f /dev/null rlz4.dir
pg_restore -f /dev/null rlz4.dir  0.02s user 0.02s system 93% cpu 0.043 total
```

- For Gzip zero read bug:

Without the fix:
```
chaol@ChaodeMacBook-Air tmp % pg_dump -Fd --compress=gzip -f rgzip.dir regression
chaol@ChaodeMacBook-Air tmp % gzip rgzip.dir/toc.dat

# Failed
chaol@ChaodeMacBook-Air tmp % pg_restore -f /dev/null rgzip.dir
pg_restore: error: could not read from input file:
```

With the fix:
```
chaol@ChaodeMacBook-Air tmp % pg_dump -Fd --compress=gzip -f rgzip.dir regression
chaol@ChaodeMacBook-Air tmp % gzip rgzip.dir/toc.dat

# Ran successfully 
chaol@ChaodeMacBook-Air tmp % pg_restore -f /dev/null rgzip.dir
```

I also reviewed the code change, basically LGTM. Just a couple of trivial comments:

1 - 0001
In WriteDataToArchiveLZ4()

```
+ required = LZ4F_compressBound(chunk, &state->prefs);
+ if (required > state->buflen - state->bufdata)
+ {
+ cs->writeF(AH, state->buffer, state->bufdata);
+ state->bufdata = 0;
+ }
```

And in EndCompressorLZ4()
```
+ required = LZ4F_compressBound(0, &state->prefs);
+ if (required > state->buflen - state->bufdata)
+ {
+ cs->writeF(AH, state->buffer, state->bufdata);
+ state->bufdata = 0;
+ }
```

These two code pieces are similar, only difference is the first parameter passed to LZ4F_compressBound().

I think we can create an inline function for it. But these are just 5 lines, I don’t have a strong option on that.

Same thing for the other code pieces in LZ4Stream_write() and LZ4Stream_close().

2 - 0003
```
 /*
  * LZ4 equivalent to feof() or gzeof().  Return true iff there is no
- * decompressed output in the overflow buffer and the end of the backing file
```

This doesn’t belong to the current patch. But “iff” seems a typo of “if”. You may fix it as you are touching this piece of code.

Best regards,
--
Chao Li (Evan)
HighGo Software Co., Ltd.
https://www.highgo.com/




Re: [PATCH v1] parallel pg_restore: avoid disk seeks when jumping short distance forward

От
"David G. Johnston"
Дата:
On Sunday, October 12, 2025, Chao Li <li.evan.chao@gmail.com> wrote:
2 - 0003
```
 /*
  * LZ4 equivalent to feof() or gzeof().  Return true iff there is no
- * decompressed output in the overflow buffer and the end of the backing file
```

This doesn’t belong to the current patch. But “iff” seems a typo of “if”. You may fix it as you are touching this piece of code.

“iif” is shorthand for “if and only if”.  So it isn’t likely to be a typo; it only needs to be changed if it is wrong.  I haven’t looked to see.

David J.

Re: [PATCH v1] parallel pg_restore: avoid disk seeks when jumping short distance forward

От
"David G. Johnston"
Дата:
On Sunday, October 12, 2025, David G. Johnston <david.g.johnston@gmail.com> wrote:
On Sunday, October 12, 2025, Chao Li <li.evan.chao@gmail.com> wrote:
2 - 0003
```
 /*
  * LZ4 equivalent to feof() or gzeof().  Return true iff there is no
- * decompressed output in the overflow buffer and the end of the backing file
```

This doesn’t belong to the current patch. But “iff” seems a typo of “if”. You may fix it as you are touching this piece of code.

“iif” is shorthand for “if and only if”.  So it isn’t likely to be a typo; it only needs to be changed if it is wrong.  I haven’t looked to see.

Never mind…two f’s, not two i’s …

David J.



On Oct 13, 2025, at 11:39, David G. Johnston <david.g.johnston@gmail.com> wrote:


“iif” is shorthand for “if and only if”.  So it isn’t likely to be a typo; it only needs to be changed if it is wrong.  I haven’t looked to see.


Thanks for the explanation, I wasn’t aware of that. I learned.

Best regards,
--
Chao Li (Evan)
HighGo Software Co., Ltd.
https://www.highgo.com/




Chao Li <li.evan.chao@gmail.com> writes:
>> On Oct 12, 2025, at 09:25, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> While playing around with the test cases for pg_dump compression,
>> I was startled to discover that the performance of compress_lz4's
>> "stream API" code is absolutely abysmal.

> I tested the patch on my MacBook M4 (Sequoia 15.6.1), compiled with -O2 optimization:

Thanks for looking at it!

> I also reviewed the code change, basically LGTM. Just a couple of trivial comments:

> 1 - 0001
> In WriteDataToArchiveLZ4()

> ```
> +        required = LZ4F_compressBound(chunk, &state->prefs);
> +        if (required > state->buflen - state->bufdata)
> +        {
> +            cs->writeF(AH, state->buffer, state->bufdata);
> +            state->bufdata = 0;
> +        }
> ```

> And in EndCompressorLZ4()
> ```
> +    required = LZ4F_compressBound(0, &state->prefs);
> +    if (required > state->buflen - state->bufdata)
> +    {
> +        cs->writeF(AH, state->buffer, state->bufdata);
> +        state->bufdata = 0;
> +    }
> ```

> These two code pieces are similar, only difference is the first parameter passed to LZ4F_compressBound().

> I think we can create an inline function for it. But these are just 5 lines, I don’t have a strong option on that.

Yeah, I don't think that would improve code readability noticeably.
By the time you got done writing a documentation comment for the
new subroutine, the code would probably be longer not shorter.

I've pushed the parts of that patch set that I thought were
uncontroversial.  What's left is the business about increasing
DEFAULT_IO_BUFFER_SIZE and then adjusting the tests appropriately.

So, v4-0001 attached is the previous v3-0002 to increase
DEFAULT_IO_BUFFER_SIZE, plus additions in compress_none.c to make
--compress=none also produce predictably large data blocks.
I decided that if we're going to rely on that behavior as part
of the solution for this thread's original problem, we'd better
make it happen for all compression options.

0002 adds a test case in 002_pg_dump.pl to exercise --compress=none,
because without that we don't have any coverage of the new code
0001 added in compress_none.c.  That makes for a small increase
in the runtime of 002_pg_dump.pl, but I'm inclined to think it's
worth doing.

0003 modifies the existing test cases that manually compress
blobs.toc files so that they also compress toc.dat.  I feel
like it's mostly an oversight that that wasn't done to begin
with; if it had been done, we'd have caught the Gzip_read bug
right away.  Also, AFAICT, this doesn't cost anything measurable
in test runtime.

0004 increases the row width in the existing test case that says
it's trying to push more than DEFAULT_IO_BUFFER_SIZE through
the compressors.  While I agree with the premise, this solution
is hugely expensive: it adds about 12% to the already-long runtime
of 002_pg_dump.pl.  I'd like to find a better way, but ran out of
energy for today.  (I think the reason this costs so much is that
it's effectively iterated hundreds of times because of
002_pg_dump.pl's more or less cross-product approach to testing
everything.  Maybe we should pull it out of that structure?)

            regards, tom lane

From 969b1ba5a94449e10e56103f18ccf4f9c5481796 Mon Sep 17 00:00:00 2001
From: Tom Lane <tgl@sss.pgh.pa.us>
Date: Mon, 13 Oct 2025 16:09:36 -0400
Subject: [PATCH v4 1/4] Align the data block sizes of pg_dump's various
 compression modes.

After commit fe8192a95, compress_zstd.c tends to produce data block
sizes around 128K, and we don't really have any control over that
unless we want to overrule ZSTD_CStreamOutSize().  Which seems like
a bad idea.  But let's try to align the other compression modes to
produce block sizes roughly comparable to that, so that pg_restore's
skip-data performance isn't enormously different for different modes.

gzip compression can be brought in line simply by setting
DEFAULT_IO_BUFFER_SIZE = 128K, which this patch does.  That
increases some unrelated buffer sizes, but none of them seem
problematic for modern platforms.

lz4's idea of appropriate block size is highly nonlinear:
if we just increase DEFAULT_IO_BUFFER_SIZE then the output
blocks end up around 200K.  I found that adjusting the slop
factor in LZ4State_compression_init was a not-too-ugly way
of bringing that number roughly into line.

With compress = none you get data blocks the same sizes as the
table rows, which seems potentially problematic for narrow tables.
Introduce a layer of buffering to make that case match the others.

Comments in compress_io.h and 002_pg_dump.pl suggest that if
we increase DEFAULT_IO_BUFFER_SIZE then we need to increase the
amount of data fed through the tests in order to improve coverage.
I've not done that here, leaving it for a separate patch.

Author: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/3515357.1760128017@sss.pgh.pa.us
---
 src/bin/pg_dump/compress_io.h    |  4 +-
 src/bin/pg_dump/compress_lz4.c   |  9 ++++-
 src/bin/pg_dump/compress_none.c  | 64 +++++++++++++++++++++++++++++++-
 src/tools/pgindent/typedefs.list |  1 +
 4 files changed, 72 insertions(+), 6 deletions(-)

diff --git a/src/bin/pg_dump/compress_io.h b/src/bin/pg_dump/compress_io.h
index 25a7bf0904d..ae008585c89 100644
--- a/src/bin/pg_dump/compress_io.h
+++ b/src/bin/pg_dump/compress_io.h
@@ -22,9 +22,9 @@
  *
  * When changing this value, it's necessary to check the relevant test cases
  * still exercise all the branches. This applies especially if the value is
- * increased, in which case the overflow buffer may not be needed.
+ * increased, in which case some loops may not get iterated.
  */
-#define DEFAULT_IO_BUFFER_SIZE    4096
+#define DEFAULT_IO_BUFFER_SIZE    (128 * 1024)

 extern char *supports_compression(const pg_compress_specification compression_spec);

diff --git a/src/bin/pg_dump/compress_lz4.c b/src/bin/pg_dump/compress_lz4.c
index b817a083d38..450afd4e2be 100644
--- a/src/bin/pg_dump/compress_lz4.c
+++ b/src/bin/pg_dump/compress_lz4.c
@@ -100,9 +100,14 @@ LZ4State_compression_init(LZ4State *state)
     state->buflen = LZ4F_compressBound(DEFAULT_IO_BUFFER_SIZE, &state->prefs);

     /*
-     * Then double it, to ensure we're not forced to flush every time.
+     * Add some slop to ensure we're not forced to flush every time.
+     *
+     * The present slop factor of 50% is chosen so that the typical output
+     * block size is about 128K when DEFAULT_IO_BUFFER_SIZE = 128K.  We might
+     * need a different slop factor to maintain that equivalence if
+     * DEFAULT_IO_BUFFER_SIZE is changed dramatically.
      */
-    state->buflen *= 2;
+    state->buflen += state->buflen / 2;

     /*
      * LZ4F_compressBegin requires a buffer that is greater or equal to
diff --git a/src/bin/pg_dump/compress_none.c b/src/bin/pg_dump/compress_none.c
index 4abb2e95abc..94c155a572d 100644
--- a/src/bin/pg_dump/compress_none.c
+++ b/src/bin/pg_dump/compress_none.c
@@ -22,6 +22,18 @@
  *----------------------
  */

+/*
+ * We buffer outgoing data, just to ensure that data blocks written to the
+ * archive file are of reasonable size.  The read side could use this struct,
+ * but there's no need because it does not retain data across calls.
+ */
+typedef struct NoneCompressorState
+{
+    char       *buffer;            /* buffer for unwritten data */
+    size_t        buflen;            /* allocated size of buffer */
+    size_t        bufdata;        /* amount of valid data currently in buffer */
+} NoneCompressorState;
+
 /*
  * Private routines
  */
@@ -49,13 +61,45 @@ static void
 WriteDataToArchiveNone(ArchiveHandle *AH, CompressorState *cs,
                        const void *data, size_t dLen)
 {
-    cs->writeF(AH, data, dLen);
+    NoneCompressorState *nonecs = (NoneCompressorState *) cs->private_data;
+    size_t        remaining = dLen;
+
+    while (remaining > 0)
+    {
+        size_t        chunk;
+
+        /* Dump buffer if full */
+        if (nonecs->bufdata >= nonecs->buflen)
+        {
+            cs->writeF(AH, nonecs->buffer, nonecs->bufdata);
+            nonecs->bufdata = 0;
+        }
+        /* And fill it */
+        chunk = nonecs->buflen - nonecs->bufdata;
+        if (chunk > remaining)
+            chunk = remaining;
+        memcpy(nonecs->buffer + nonecs->bufdata, data, chunk);
+        nonecs->bufdata += chunk;
+        data = ((const char *) data) + chunk;
+        remaining -= chunk;
+    }
 }

 static void
 EndCompressorNone(ArchiveHandle *AH, CompressorState *cs)
 {
-    /* no op */
+    NoneCompressorState *nonecs = (NoneCompressorState *) cs->private_data;
+
+    if (nonecs)
+    {
+        /* Dump buffer if nonempty */
+        if (nonecs->bufdata > 0)
+            cs->writeF(AH, nonecs->buffer, nonecs->bufdata);
+        /* Free working state */
+        pg_free(nonecs->buffer);
+        pg_free(nonecs);
+        cs->private_data = NULL;
+    }
 }

 /*
@@ -71,6 +115,22 @@ InitCompressorNone(CompressorState *cs,
     cs->end = EndCompressorNone;

     cs->compression_spec = compression_spec;
+
+    /*
+     * If the caller has defined a write function, prepare the necessary
+     * buffer.
+     */
+    if (cs->writeF)
+    {
+        NoneCompressorState *nonecs;
+
+        nonecs = (NoneCompressorState *) pg_malloc(sizeof(NoneCompressorState));
+        nonecs->buflen = DEFAULT_IO_BUFFER_SIZE;
+        nonecs->buffer = pg_malloc(nonecs->buflen);
+        nonecs->bufdata = 0;
+
+        cs->private_data = nonecs;
+    }
 }


diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 5290b91e83e..63f9387044b 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1758,6 +1758,7 @@ NextValueExpr
 Node
 NodeTag
 NonEmptyRange
+NoneCompressorState
 Notification
 NotificationList
 NotifyStmt
--
2.43.7

From 4407d68c4a4b95979e24fc225573944def989fc9 Mon Sep 17 00:00:00 2001
From: Tom Lane <tgl@sss.pgh.pa.us>
Date: Mon, 13 Oct 2025 16:20:38 -0400
Subject: [PATCH v4 2/4] Add a test case to cover pg_dump with --compress=none.

This brings the coverage of compress_none.c up from about 64%
to 90%, in particular covering the new code added in the previous
patch.

This adds perhaps 2% to the runtime of 002_pg_dump.pl.  It could
be argued that compress_none.c is simple enough to not require
permanent memorialization in a test case; but it's less simple
than it was before, so maybe we want this.

Author: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/3515357.1760128017@sss.pgh.pa.us
---
 src/bin/pg_dump/t/002_pg_dump.pl | 18 ++++++++++++++++++
 1 file changed, 18 insertions(+)

diff --git a/src/bin/pg_dump/t/002_pg_dump.pl b/src/bin/pg_dump/t/002_pg_dump.pl
index 6be6888b977..8a08f9a5f6f 100644
--- a/src/bin/pg_dump/t/002_pg_dump.pl
+++ b/src/bin/pg_dump/t/002_pg_dump.pl
@@ -81,6 +81,24 @@ my %pgdump_runs = (
         ],
     },

+    compression_none_custom => {
+        test_key => 'compression',
+        dump_cmd => [
+            'pg_dump',
+            '--format' => 'custom',
+            '--compress' => 'none',
+            '--file' => "$tempdir/compression_none_custom.dump",
+            '--statistics',
+            'postgres',
+        ],
+        restore_cmd => [
+            'pg_restore',
+            '--file' => "$tempdir/compression_none_custom.sql",
+            '--statistics',
+            "$tempdir/compression_none_custom.dump",
+        ],
+    },
+
     # Do not use --no-sync to give test coverage for data sync.
     compression_gzip_custom => {
         test_key => 'compression',
--
2.43.7

From 4f0c4e17ac17dd8fd36cbfc4f887fff8cfecb1f6 Mon Sep 17 00:00:00 2001
From: Tom Lane <tgl@sss.pgh.pa.us>
Date: Mon, 13 Oct 2025 16:30:55 -0400
Subject: [PATCH v4 3/4] Include compression of toc.dat in manually-compressed
 test cases.

We would have found the bug fixed in commit a239c4a0c much sooner
if we'd done this.  As far as I can tell, this doesn't reduce
test coverage at all, since there are other tests of directory
format that still use an uncompressed toc.dat.  And its effect
on the total runtime of 002_pg_dump.pl seems lost in the noise.

While here, fix a glitch that I noticed in testing: the
$glob_patterns tests were incapable of failing, because glob()
will return 'foo' as 'foo' whether there is a matching file or
not.  (Indeed, the stanza just above this one relies on that.)

I'm slightly tempted to back-patch this part.

Author: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/3515357.1760128017@sss.pgh.pa.us
Discussion: https://postgr.es/m/25345.1760289877@sss.pgh.pa.us
---
 src/bin/pg_dump/t/002_pg_dump.pl | 41 +++++++++++++++++++-------------
 1 file changed, 24 insertions(+), 17 deletions(-)

diff --git a/src/bin/pg_dump/t/002_pg_dump.pl b/src/bin/pg_dump/t/002_pg_dump.pl
index 8a08f9a5f6f..8b287a673bf 100644
--- a/src/bin/pg_dump/t/002_pg_dump.pl
+++ b/src/bin/pg_dump/t/002_pg_dump.pl
@@ -140,15 +140,18 @@ my %pgdump_runs = (
             '--statistics',
             'postgres',
         ],
-        # Give coverage for manually compressed blobs.toc files during
-        # restore.
+        # Give coverage for manually-compressed TOC files during restore.
         compress_cmd => {
             program => $ENV{'GZIP_PROGRAM'},
-            args => [ '-f', "$tempdir/compression_gzip_dir/blobs_*.toc", ],
+            args => [
+                '-f',
+                "$tempdir/compression_gzip_dir/toc.dat",
+                "$tempdir/compression_gzip_dir/blobs_*.toc",
+            ],
         },
-        # Verify that only data files were compressed
+        # Verify that TOC and data files were compressed
         glob_patterns => [
-            "$tempdir/compression_gzip_dir/toc.dat",
+            "$tempdir/compression_gzip_dir/toc.dat.gz",
             "$tempdir/compression_gzip_dir/*.dat.gz",
         ],
         restore_cmd => [
@@ -219,18 +222,18 @@ my %pgdump_runs = (
             '--statistics',
             'postgres',
         ],
-        # Give coverage for manually compressed blobs.toc files during
-        # restore.
+        # Give coverage for manually-compressed TOC files during restore.
         compress_cmd => {
             program => $ENV{'LZ4'},
             args => [
                 '-z', '-f', '-m', '--rm',
+                "$tempdir/compression_lz4_dir/toc.dat",
                 "$tempdir/compression_lz4_dir/blobs_*.toc",
             ],
         },
-        # Verify that data files were compressed
+        # Verify that TOC and data files were compressed
         glob_patterns => [
-            "$tempdir/compression_lz4_dir/toc.dat",
+            "$tempdir/compression_lz4_dir/toc.dat.lz4",
             "$tempdir/compression_lz4_dir/*.dat.lz4",
         ],
         restore_cmd => [
@@ -303,18 +306,18 @@ my %pgdump_runs = (
             '--statistics',
             'postgres',
         ],
-        # Give coverage for manually compressed blobs.toc files during
-        # restore.
+        # Give coverage for manually-compressed TOC files during restore.
         compress_cmd => {
             program => $ENV{'ZSTD'},
             args => [
-                '-z', '-f',
-                '--rm', "$tempdir/compression_zstd_dir/blobs_*.toc",
+                '-z', '-f', '--rm',
+                "$tempdir/compression_zstd_dir/toc.dat",
+                "$tempdir/compression_zstd_dir/blobs_*.toc",
             ],
         },
-        # Verify that data files were compressed
+        # Verify that TOC and data files were compressed
         glob_patterns => [
-            "$tempdir/compression_zstd_dir/toc.dat",
+            "$tempdir/compression_zstd_dir/toc.dat.zst",
             "$tempdir/compression_zstd_dir/*.dat.zst",
         ],
         restore_cmd => [
@@ -5523,8 +5526,12 @@ foreach my $run (sort keys %pgdump_runs)
         foreach my $glob_pattern (@{$glob_patterns})
         {
             my @glob_output = glob($glob_pattern);
-            is(scalar(@glob_output) > 0,
-                1, "$run: glob check for $glob_pattern");
+            my $ok = 0;
+            # certainly found some files if glob() returned multiple matches
+            $ok = 1 if (scalar(@glob_output) > 1);
+            # if just one match, we need to check if it's real
+            $ok = 1 if (scalar(@glob_output) == 1 && -f $glob_output[0]);
+            is($ok, 1, "$run: glob check for $glob_pattern");
         }
     }

--
2.43.7

From 163956c37940aeee1bf4c48fb62ae22612ec89be Mon Sep 17 00:00:00 2001
From: Tom Lane <tgl@sss.pgh.pa.us>
Date: Mon, 13 Oct 2025 16:43:06 -0400
Subject: [PATCH v4 4/4] Widen the wide row used to verify correct (de)
 compression.

Commit 1a05c1d25 advises us (not without reason) to ensure that this
test case fully fills DEFAULT_IO_BUFFER_SIZE, so that loops within
the compression logic will iterate completely.  To follow that
advice with the proposed DEFAULT_IO_BUFFER_SIZE of 128K, we need
something close to this.  This does indeed increase the reported
code coverage by a few lines.

This is a very expensive test, however: it increases the total
runtime of 002_pg_dump.pl by about 12% for me, and that's already
one of the more expensive TAP tests.  I wonder if there isn't a
less brute-force way of getting the same result.  If there's not,
is this really worth it?

One idea is to take this out of the main structure of 002_pg_dump.pl
so that the test instance is run just once per compression type,
rather than many many times as it is here.

Author: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/3515357.1760128017@sss.pgh.pa.us
---
 src/bin/pg_dump/t/002_pg_dump.pl | 7 ++++---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/src/bin/pg_dump/t/002_pg_dump.pl b/src/bin/pg_dump/t/002_pg_dump.pl
index 8b287a673bf..8bbd6ed47a9 100644
--- a/src/bin/pg_dump/t/002_pg_dump.pl
+++ b/src/bin/pg_dump/t/002_pg_dump.pl
@@ -3732,14 +3732,15 @@ my %tests = (
     },

     # Insert enough data to surpass DEFAULT_IO_BUFFER_SIZE during
-    # (de)compression operations
+    # (de)compression operations.  The weird regex is because Perl
+    # restricts us to repeat counts of less than 32K.
     'COPY test_compression_method' => {
         create_order => 111,
         create_sql => 'INSERT INTO dump_test.test_compression_method (col1) '
-          . 'SELECT string_agg(a::text, \'\') FROM generate_series(1,4096) a;',
+          . 'SELECT string_agg(a::text, \'\') FROM generate_series(1,65536) a;',
         regexp => qr/^
             \QCOPY dump_test.test_compression_method (col1) FROM stdin;\E
-            \n(?:\d{15277}\n){1}\\\.\n
+            \n(?:(?:\d\d\d\d\d\d\d\d\d\d){31657}\d\d\d\d\n){1}\\\.\n
             /xm,
         like => {
             %full_runs,
--
2.43.7




On Oct 14, 2025, at 05:07, Tom Lane <tgl@sss.pgh.pa.us> wrote:


I've pushed the parts of that patch set that I thought were
uncontroversial.  What's left is the business about increasing
DEFAULT_IO_BUFFER_SIZE and then adjusting the tests appropriately.

So, v4-0001 attached is the previous v3-0002 to increase
DEFAULT_IO_BUFFER_SIZE, plus additions in compress_none.c to make
--compress=none also produce predictably large data blocks.
I decided that if we're going to rely on that behavior as part
of the solution for this thread's original problem, we'd better
make it happen for all compression options.

0002 adds a test case in 002_pg_dump.pl to exercise --compress=none,
because without that we don't have any coverage of the new code
0001 added in compress_none.c.  That makes for a small increase
in the runtime of 002_pg_dump.pl, but I'm inclined to think it's
worth doing.

0003 modifies the existing test cases that manually compress
blobs.toc files so that they also compress toc.dat.  I feel
like it's mostly an oversight that that wasn't done to begin
with; if it had been done, we'd have caught the Gzip_read bug
right away.  Also, AFAICT, this doesn't cost anything measurable
in test runtime.

0004 increases the row width in the existing test case that says
it's trying to push more than DEFAULT_IO_BUFFER_SIZE through
the compressors.  While I agree with the premise, this solution
is hugely expensive: it adds about 12% to the already-long runtime
of 002_pg_dump.pl.  I'd like to find a better way, but ran out of
energy for today.  (I think the reason this costs so much is that
it's effectively iterated hundreds of times because of
002_pg_dump.pl's more or less cross-product approach to testing
everything.  Maybe we should pull it out of that structure?)


In v4 patch, the code changes are straightforward. 0001 changes compress_none.c to write data to a 128K buffer first, then only flush the buffer when it’s filled up. 0002, 0003 and 0004 add more test cases. I have no comment to the code diff.

I tested DEFAULT_IO_BUFFER_SIZE with 4K, 32K, 64K, 128K and 256K. Looks like increasing the buffer size doesn’t improve the performance significantly. Actually, with the buffer size 64K, 128K and 256K, the test results are very close. I tested both with lz4 and none compression. I am not suggesting tuning the buffer size. These data are only for your reference.

To do the test, I created a test db and filled in several GB of data.

```
256K ====
% time pg_dump -Fd --compress=lz4 -f dump_A.dir evantest
pg_dump -Fd --compress=lz4 -f dump_A.dir evantest  3.37s user 0.82s system 57% cpu 7.249 total
% time pg_restore -f /dev/null dump_A.dir
pg_restore -f /dev/null dump_A.dir  0.24s user 0.19s system 43% cpu 0.991 total

% time pg_dump -Fd --compress=none -f dump_A.dir evantest
pg_dump -Fd --compress=none -f dump_A.dir evantest  2.34s user 1.72s system 68% cpu 5.949 total
% time pg_restore -f /dev/null dump_A.dir
pg_restore -f /dev/null dump_A.dir  0.02s user 0.19s system 22% cpu 0.921 total

128K ===
% time pg_dump -Fd --compress=lz4 -f dump_A.dir evantest
pg_dump -Fd --compress=lz4 -f dump_A.dir evantest  3.38s user 0.85s system 64% cpu 6.525 total
% time pg_restore -f /dev/null dump_A.dir
pg_restore -f /dev/null dump_A.dir  0.28s user 0.21s system 47% cpu 1.042 total

% time pg_dump -Fd --compress=none -f dump_A.dir evantest
pg_dump -Fd --compress=none -f dump_A.dir evantest  2.34s user 1.67s system 68% cpu 5.835 total
% time pg_restore -f /dev/null dump_A.dir
pg_restore -f /dev/null dump_A.dir  0.03s user 0.22s system 22% cpu 1.118 total

64K ===
% time pg_dump -Fd --compress=lz4 -f dump_A.dir evantest
pg_dump -Fd --compress=lz4 -f dump_A.dir evantest  3.39s user 0.92s system 63% cpu 6.761 total
% time pg_restore -f /dev/null dump_A.dir
pg_restore -f /dev/null dump_A.dir  0.33s user 0.24s system 40% cpu 1.420 total

% time pg_dump -Fd --compress=none -f dump_A.dir evantest
pg_dump -Fd --compress=none -f dump_A.dir evantest  2.35s user 1.74s system 69% cpu 5.849 total
% time pg_restore -f /dev/null dump_A.dir
pg_restore -f /dev/null dump_A.dir  0.04s user 0.22s system 27% cpu 0.939 total

32K ===
% time pg_dump -Fd --compress=lz4 -f dump_A.dir evantest
pg_dump -Fd --compress=lz4 -f dump_A.dir evantest  3.43s user 0.94s system 58% cpu 7.416 total
% time pg_restore -f /dev/null dump_A.dir
pg_restore -f /dev/null dump_A.dir  0.34s user 0.22s system 56% cpu 0.983 total

% time pg_dump -Fd --compress=none -f dump_A.dir evantest
pg_dump -Fd --compress=none -f dump_A.dir evantest  2.34s user 1.75s system 67% cpu 6.070 total
% time pg_restore -f /dev/null dump_A.dir
pg_restore -f /dev/null dump_A.dir  0.05s user 0.23s system 29% cpu 0.926 total

4k====
% time pg_dump -Fd --compress=lz4 -f dump_A.dir evantest
pg_dump -Fd --compress=lz4 -f dump_A.dir evantest  3.45s user 0.94s system 60% cpu 7.298 total
% time pg_restore -f /dev/null dump_A.dir
pg_restore -f /dev/null dump_A.dir  0.37s user 0.29s system 64% cpu 1.016 total

% time pg_dump -Fd --compress=none -f dump_A.dir evantest
pg_dump -Fd --compress=none -f dump_A.dir evantest  2.33s user 1.78s system 69% cpu 5.947 total
% time pg_restore -f /dev/null dump_A.dir
pg_restore -f /dev/null dump_A.dir  0.12s user 0.29s system 40% cpu 1.009 total
```

Best regards,
--
Chao Li (Evan)
HighGo Software Co., Ltd.
https://www.highgo.com/




Chao Li <li.evan.chao@gmail.com> writes:
> I tested DEFAULT_IO_BUFFER_SIZE with 4K, 32K, 64K, 128K and 256K. Looks like increasing the buffer size doesn’t
improvethe performance significantly. Actually, with the buffer size 64K, 128K and 256K, the test results are very
close.I tested both with lz4 and none compression. I am not suggesting tuning the buffer size. These data are only for
yourreference. 

Yeah, I would not expect straight pg_dump/pg_restore performance
to vary very much once the buffer size gets above not-too-many KB.
The thing we are really interested in here is how fast pg_restore
can skip over unwanted table data in a large archive file, and that
I believe should be pretty sensitive to block size.

You could measure that without getting into the complexities of
parallel restore if you make a custom-format dump of a few large
tables that does not have offset data in it, and then seeing how
fast is selective restore of just the last table.

            regards, tom lane





On Oct 14, 2025, at 08:36, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Chao Li <li.evan.chao@gmail.com> writes:
I tested DEFAULT_IO_BUFFER_SIZE with 4K, 32K, 64K, 128K and 256K. Looks like increasing the buffer size doesn’t improve the performance significantly. Actually, with the buffer size 64K, 128K and 256K, the test results are very close. I tested both with lz4 and none compression. I am not suggesting tuning the buffer size. These data are only for your reference.

Yeah, I would not expect straight pg_dump/pg_restore performance
to vary very much once the buffer size gets above not-too-many KB.
The thing we are really interested in here is how fast pg_restore
can skip over unwanted table data in a large archive file, and that
I believe should be pretty sensitive to block size.

You could measure that without getting into the complexities of
parallel restore if you make a custom-format dump of a few large
tables that does not have offset data in it, and then seeing how
fast is selective restore of just the last table.


Not sure if I did something wrong, but I still don’t see much difference between buffer size 4K and 128K with your suggested test.

I created 3 tables, each with 10 millions of rows:

```
evantest=# CREATE TABLE tbl1 AS SELECT generate_series(1,10000000) AS id;
SELECT 10000000
evantest=# CREATE TABLE tbl2 AS SELECT generate_series(1,10000000) AS id;
SELECT 10000000
evantest=# CREATE TABLE tbl3 AS SELECT generate_series(1,10000000) AS id;
SELECT 10000000
```

And did a custom-format dump:
```
% time pg_dump -Fc -f db.dump evantest
pg_dump -Fc -f db.dump evantest  51.72s user 1.13s system 98% cpu 53.602 total
```

Then pg_restore the last tabl,  compiled with buffer size 4k and 128k: (I dropped tbl3 before running pg_restore)
```
# 4K ===
% time pg_restore -d evantest -t tbl3 db.dump
pg_restore -d evantest -t tbl3 db.dump  0.06s user 0.04s system 6% cpu 1.528 total

# 128K
% time pg_restore -d evantest -t tbl3 db.dump
pg_restore -d evantest -t tbl3 db.dump  0.05s user 0.04s system 3% cpu 2.146 total
```

The other thing I noticed is that, when I do custom-format dump, if a target file exists, pg_dump will just go ahead overwrite the existing file; however, when I do directory dump, if a target dir exists, pg_dump will fail with an error “directory xxx is not empty”. Why the behaviors are different?

Best regards,
--
Chao Li (Evan)
HighGo Software Co., Ltd.
https://www.highgo.com/




Chao Li <li.evan.chao@gmail.com> writes:
>> On Oct 14, 2025, at 08:36, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> The thing we are really interested in here is how fast pg_restore
>> can skip over unwanted table data in a large archive file, and that
>> I believe should be pretty sensitive to block size.

> Not sure if I did something wrong, but I still don’t see much difference between buffer size 4K and 128K with your
suggestedtest. 
>
> % time pg_dump -Fc -f db.dump evantest

This won't show the effect, because pg_dump will be able to go back
and insert data offsets into the dump's TOC, so pg_restore can just
seek to where the data is.  See upthread discussion about what's
needed to provoke Dimitrios' problem.

I tried this very tiny (relatively speaking) test case:

regression=# create database d1;
CREATE DATABASE
regression=# \c d1
You are now connected to database "d1" as user "postgres".
d1=# create table alpha as select repeat(random()::text, 1000) from generate_series(1,1000000);
SELECT 1000000
d1=# create table omega as select 42 as x;
SELECT 1
d1=# \q

Then

$ pg_dump -Fc d1 | cat >d1.dump
$ time pg_restore -f /dev/null -t omega d1.dump

The point of the pipe-to-cat is to reproduce Dimitrios' problem case
with no data offsets in the TOC.  Then the restore is doing about the
simplest thing I can think of to make it skip over most of the archive
file.  Also, I'm intentionally using the default choice of gzip
because that already responds to DEFAULT_IO_BUFFER_SIZE properly.
(This test is with current HEAD, no patches except adjusting
DEFAULT_IO_BUFFER_SIZE.)

I got these timings:

DEFAULT_IO_BUFFER_SIZE = 1K
real    0m0.020s
user    0m0.002s
sys     0m0.017s

DEFAULT_IO_BUFFER_SIZE = 4K
real    0m0.014s
user    0m0.003s
sys     0m0.011s

DEFAULT_IO_BUFFER_SIZE = 128K
real    0m0.002s
user    0m0.000s
sys     0m0.002s

This test case has only about 50MB worth of compressed data,
so of course the times are very small; scaling it up to
gigabytes would yield more impressive results.  But the
effect is clearly visible.

            regards, tom lane





On Oct 14, 2025, at 10:44, Tom Lane <tgl@sss.pgh.pa.us> wrote:

This won't show the effect, because pg_dump will be able to go back
and insert data offsets into the dump's TOC, so pg_restore can just
seek to where the data is.  See upthread discussion about what's
needed to provoke Dimitrios' problem.

I tried this very tiny (relatively speaking) test case:

regression=# create database d1;
CREATE DATABASE
regression=# \c d1
You are now connected to database "d1" as user "postgres".
d1=# create table alpha as select repeat(random()::text, 1000) from generate_series(1,1000000);
SELECT 1000000
d1=# create table omega as select 42 as x;
SELECT 1
d1=# \q

Then

$ pg_dump -Fc d1 | cat >d1.dump
$ time pg_restore -f /dev/null -t omega d1.dump

The point of the pipe-to-cat is to reproduce Dimitrios' problem case
with no data offsets in the TOC.  Then the restore is doing about the
simplest thing I can think of to make it skip over most of the archive
file.  Also, I'm intentionally using the default choice of gzip
because that already responds to DEFAULT_IO_BUFFER_SIZE properly.
(This test is with current HEAD, no patches except adjusting
DEFAULT_IO_BUFFER_SIZE.)

I got these timings:

DEFAULT_IO_BUFFER_SIZE = 1K
real    0m0.020s
user    0m0.002s
sys     0m0.017s

DEFAULT_IO_BUFFER_SIZE = 4K
real    0m0.014s
user    0m0.003s
sys     0m0.011s

DEFAULT_IO_BUFFER_SIZE = 128K
real    0m0.002s
user    0m0.000s
sys     0m0.002s

This test case has only about 50MB worth of compressed data,
so of course the times are very small; scaling it up to
gigabytes would yield more impressive results.  But the
effect is clearly visible.


With your example, I can now see the difference, however, I had to create 5 more times of rows in the first table:

```
evantest=# CREATE TABLE alpha AS SELECT repeat(random()::text, 1000) FROM generate_series(1, 5000000);
SELECT 5000000
evantest=#
evantest=# CREATE TABLE omega AS SELECT 42 AS x;
SELECT 1
```

My test is with the patch, I only adjusted DEFAULT_IO_BUFFER_SIZE.

DEFAULT_IO_BUFFER_SIZE=4K
```
% /usr/bin/time pg_dump -Fc evantest | cat > d1.dump
      294.83 real       220.28 user        45.90 sys

% /usr/bin/time pg_restore -f /dev/null -t omega d1.dump
        0.16 real         0.02 user         0.09 sys
```

DEFAULT_IO_BUFFER_SIZE=128K
```
% /usr/bin/time pg_dump -Fc evantest | cat > d1.dump
      296.89 real       220.85 user        46.64 sys

% /usr/bin/time pg_restore -f /dev/null -t omega d1.dump
        0.01 real         0.00 user         0.00 sys
```

With bigger blocker size, pg_restore skips less blocks, so it gets faster.

Best regards,
--
Chao Li (Evan)
HighGo Software Co., Ltd.
https://www.highgo.com/

I wrote:
> 0004 increases the row width in the existing test case that says
> it's trying to push more than DEFAULT_IO_BUFFER_SIZE through
> the compressors.  While I agree with the premise, this solution
> is hugely expensive: it adds about 12% to the already-long runtime
> of 002_pg_dump.pl.  I'd like to find a better way, but ran out of
> energy for today.  (I think the reason this costs so much is that
> it's effectively iterated hundreds of times because of
> 002_pg_dump.pl's more or less cross-product approach to testing
> everything.  Maybe we should pull it out of that structure?)

The attached patchset accomplishes that by splitting 002_pg_dump.pl
into two scripts, one that is just concerned with the compression
test cases and one that does everything else.  This might not be
the prettiest solution, since it duplicates a lot of perl code.
I thought about refactoring 002_pg_dump.pl so that it could handle
two separate sets of runs-plus-tests, but decided it was overly
complicated already.

Anyway, 0001 attached is the same as in v4, 0002 performs the
test split without intending to change coverage, and then 0003
adds the new test cases I wanted.  For me, this ends up with
just about the same runtime as before, or maybe a smidge less.
I'd hoped for possibly more savings than that, but I'm content
with it being a wash.

I think this is more or less committable, and then we could get
back to the original question of whether it's worth tweaking
pg_restore's seek-vs-scan behavior.

            regards, tom lane

From cf923236ab86f8fedc6bc865a8754bb9fa26f252 Mon Sep 17 00:00:00 2001
From: Tom Lane <tgl@sss.pgh.pa.us>
Date: Wed, 15 Oct 2025 14:48:42 -0400
Subject: [PATCH v5 1/3] Align the data block sizes of pg_dump's various
 compression modes.

After commit fe8192a95, compress_zstd.c tends to produce data block
sizes around 128K, and we don't really have any control over that
unless we want to overrule ZSTD_CStreamOutSize().  Which seems like
a bad idea.  But let's try to align the other compression modes to
produce block sizes roughly comparable to that, so that pg_restore's
skip-data performance isn't enormously different for different modes.

gzip compression can be brought in line simply by setting
DEFAULT_IO_BUFFER_SIZE = 128K, which this patch does.  That
increases some unrelated buffer sizes, but none of them seem
problematic for modern platforms.

lz4's idea of appropriate block size is highly nonlinear:
if we just increase DEFAULT_IO_BUFFER_SIZE then the output
blocks end up around 200K.  I found that adjusting the slop
factor in LZ4State_compression_init was a not-too-ugly way
of bringing that number roughly into line.

With compress = none you get data blocks the same sizes as the
table rows, which seems potentially problematic for narrow tables.
Introduce a layer of buffering to make that case match the others.

Comments in compress_io.h and 002_pg_dump.pl suggest that if
we increase DEFAULT_IO_BUFFER_SIZE then we need to increase the
amount of data fed through the tests in order to improve coverage.
I've not done that here, leaving it for a separate patch.

Author: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/3515357.1760128017@sss.pgh.pa.us
---
 src/bin/pg_dump/compress_io.h    |  4 +-
 src/bin/pg_dump/compress_lz4.c   |  9 ++++-
 src/bin/pg_dump/compress_none.c  | 64 +++++++++++++++++++++++++++++++-
 src/tools/pgindent/typedefs.list |  1 +
 4 files changed, 72 insertions(+), 6 deletions(-)

diff --git a/src/bin/pg_dump/compress_io.h b/src/bin/pg_dump/compress_io.h
index 25a7bf0904d..ae008585c89 100644
--- a/src/bin/pg_dump/compress_io.h
+++ b/src/bin/pg_dump/compress_io.h
@@ -22,9 +22,9 @@
  *
  * When changing this value, it's necessary to check the relevant test cases
  * still exercise all the branches. This applies especially if the value is
- * increased, in which case the overflow buffer may not be needed.
+ * increased, in which case some loops may not get iterated.
  */
-#define DEFAULT_IO_BUFFER_SIZE    4096
+#define DEFAULT_IO_BUFFER_SIZE    (128 * 1024)

 extern char *supports_compression(const pg_compress_specification compression_spec);

diff --git a/src/bin/pg_dump/compress_lz4.c b/src/bin/pg_dump/compress_lz4.c
index b817a083d38..450afd4e2be 100644
--- a/src/bin/pg_dump/compress_lz4.c
+++ b/src/bin/pg_dump/compress_lz4.c
@@ -100,9 +100,14 @@ LZ4State_compression_init(LZ4State *state)
     state->buflen = LZ4F_compressBound(DEFAULT_IO_BUFFER_SIZE, &state->prefs);

     /*
-     * Then double it, to ensure we're not forced to flush every time.
+     * Add some slop to ensure we're not forced to flush every time.
+     *
+     * The present slop factor of 50% is chosen so that the typical output
+     * block size is about 128K when DEFAULT_IO_BUFFER_SIZE = 128K.  We might
+     * need a different slop factor to maintain that equivalence if
+     * DEFAULT_IO_BUFFER_SIZE is changed dramatically.
      */
-    state->buflen *= 2;
+    state->buflen += state->buflen / 2;

     /*
      * LZ4F_compressBegin requires a buffer that is greater or equal to
diff --git a/src/bin/pg_dump/compress_none.c b/src/bin/pg_dump/compress_none.c
index 4abb2e95abc..94c155a572d 100644
--- a/src/bin/pg_dump/compress_none.c
+++ b/src/bin/pg_dump/compress_none.c
@@ -22,6 +22,18 @@
  *----------------------
  */

+/*
+ * We buffer outgoing data, just to ensure that data blocks written to the
+ * archive file are of reasonable size.  The read side could use this struct,
+ * but there's no need because it does not retain data across calls.
+ */
+typedef struct NoneCompressorState
+{
+    char       *buffer;            /* buffer for unwritten data */
+    size_t        buflen;            /* allocated size of buffer */
+    size_t        bufdata;        /* amount of valid data currently in buffer */
+} NoneCompressorState;
+
 /*
  * Private routines
  */
@@ -49,13 +61,45 @@ static void
 WriteDataToArchiveNone(ArchiveHandle *AH, CompressorState *cs,
                        const void *data, size_t dLen)
 {
-    cs->writeF(AH, data, dLen);
+    NoneCompressorState *nonecs = (NoneCompressorState *) cs->private_data;
+    size_t        remaining = dLen;
+
+    while (remaining > 0)
+    {
+        size_t        chunk;
+
+        /* Dump buffer if full */
+        if (nonecs->bufdata >= nonecs->buflen)
+        {
+            cs->writeF(AH, nonecs->buffer, nonecs->bufdata);
+            nonecs->bufdata = 0;
+        }
+        /* And fill it */
+        chunk = nonecs->buflen - nonecs->bufdata;
+        if (chunk > remaining)
+            chunk = remaining;
+        memcpy(nonecs->buffer + nonecs->bufdata, data, chunk);
+        nonecs->bufdata += chunk;
+        data = ((const char *) data) + chunk;
+        remaining -= chunk;
+    }
 }

 static void
 EndCompressorNone(ArchiveHandle *AH, CompressorState *cs)
 {
-    /* no op */
+    NoneCompressorState *nonecs = (NoneCompressorState *) cs->private_data;
+
+    if (nonecs)
+    {
+        /* Dump buffer if nonempty */
+        if (nonecs->bufdata > 0)
+            cs->writeF(AH, nonecs->buffer, nonecs->bufdata);
+        /* Free working state */
+        pg_free(nonecs->buffer);
+        pg_free(nonecs);
+        cs->private_data = NULL;
+    }
 }

 /*
@@ -71,6 +115,22 @@ InitCompressorNone(CompressorState *cs,
     cs->end = EndCompressorNone;

     cs->compression_spec = compression_spec;
+
+    /*
+     * If the caller has defined a write function, prepare the necessary
+     * buffer.
+     */
+    if (cs->writeF)
+    {
+        NoneCompressorState *nonecs;
+
+        nonecs = (NoneCompressorState *) pg_malloc(sizeof(NoneCompressorState));
+        nonecs->buflen = DEFAULT_IO_BUFFER_SIZE;
+        nonecs->buffer = pg_malloc(nonecs->buflen);
+        nonecs->bufdata = 0;
+
+        cs->private_data = nonecs;
+    }
 }


diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 5290b91e83e..63f9387044b 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1758,6 +1758,7 @@ NextValueExpr
 Node
 NodeTag
 NonEmptyRange
+NoneCompressorState
 Notification
 NotificationList
 NotifyStmt
--
2.43.7

From e45a7653c47ee11a12c036738ddfdd88ef6ad6cd Mon Sep 17 00:00:00 2001
From: Tom Lane <tgl@sss.pgh.pa.us>
Date: Wed, 15 Oct 2025 15:02:04 -0400
Subject: [PATCH v5 2/3] Split 002_pg_dump.pl into two test files.

Add a new test script 006_pg_dump_compress.pl, containing just
the pg_dump tests specifically concerned with compression, and
remove those tests from 002_pg_dump.pl.  We can also drop some
infrastructure in 002_pg_dump.pl that was used only for these tests.

The point of this is to avoid the cost of running these test
cases over and over in all the scenarios (runs) that 002_pg_dump.pl
exercises.  We don't learn anything more about the behavior of the
compression code that way, and we expend significant amounts of
time, since one of these test cases is quite large and due to get
larger.

The intent of this specific patch is to provide exactly the same
coverage as before, except that I went back to using --no-sync
in all the test runs moved over to 006_pg_dump_compress.pl.
I think that avoiding that had basically been cargo-culted into
these test cases as a result of modeling them on the
defaults_custom_format test case; again, doing that over and over
isn't going to teach us anything new.

Author: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/3515357.1760128017@sss.pgh.pa.us
---
 src/bin/pg_dump/meson.build               |   1 +
 src/bin/pg_dump/t/002_pg_dump.pl          | 408 ---------------
 src/bin/pg_dump/t/006_pg_dump_compress.pl | 611 ++++++++++++++++++++++
 3 files changed, 612 insertions(+), 408 deletions(-)
 create mode 100644 src/bin/pg_dump/t/006_pg_dump_compress.pl

diff --git a/src/bin/pg_dump/meson.build b/src/bin/pg_dump/meson.build
index a2233b0a1b4..f3c669f484e 100644
--- a/src/bin/pg_dump/meson.build
+++ b/src/bin/pg_dump/meson.build
@@ -102,6 +102,7 @@ tests += {
       't/003_pg_dump_with_server.pl',
       't/004_pg_dump_parallel.pl',
       't/005_pg_dump_filterfile.pl',
+      't/006_pg_dump_compress.pl',
       't/010_dump_connstr.pl',
     ],
   },
diff --git a/src/bin/pg_dump/t/002_pg_dump.pl b/src/bin/pg_dump/t/002_pg_dump.pl
index 6be6888b977..b789cd2e863 100644
--- a/src/bin/pg_dump/t/002_pg_dump.pl
+++ b/src/bin/pg_dump/t/002_pg_dump.pl
@@ -20,22 +20,12 @@ my $tempdir = PostgreSQL::Test::Utils::tempdir;
 # test_key indicates that a given run should simply use the same
 # set of like/unlike tests as another run, and which run that is.
 #
-# compile_option indicates if the commands run depend on a compilation
-# option, if any.  This can be used to control if tests should be
-# skipped when a build dependency is not satisfied.
-#
 # dump_cmd is the pg_dump command to run, which is an array of
 # the full command and arguments to run.  Note that this is run
 # using $node->command_ok(), so the port does not need to be
 # specified and is pulled from $PGPORT, which is set by the
 # PostgreSQL::Test::Cluster system.
 #
-# compress_cmd is the utility command for (de)compression, if any.
-# Note that this should generally be used on pg_dump's output
-# either to generate a text file to run the through the tests, or
-# to test pg_restore's ability to parse manually compressed files
-# that otherwise pg_dump does not compress on its own (e.g. *.toc).
-#
 # glob_patterns is an optional array consisting of strings compilable
 # with glob() to check the files generated after a dump.
 #
@@ -55,8 +45,6 @@ my $tempdir = PostgreSQL::Test::Utils::tempdir;

 my $supports_icu = ($ENV{with_icu} eq 'yes');
 my $supports_gzip = check_pg_config("#define HAVE_LIBZ 1");
-my $supports_lz4 = check_pg_config("#define USE_LZ4 1");
-my $supports_zstd = check_pg_config("#define USE_ZSTD 1");

 my %pgdump_runs = (
     binary_upgrade => {
@@ -81,256 +69,6 @@ my %pgdump_runs = (
         ],
     },

-    # Do not use --no-sync to give test coverage for data sync.
-    compression_gzip_custom => {
-        test_key => 'compression',
-        compile_option => 'gzip',
-        dump_cmd => [
-            'pg_dump',
-            '--format' => 'custom',
-            '--compress' => '1',
-            '--file' => "$tempdir/compression_gzip_custom.dump",
-            '--statistics',
-            'postgres',
-        ],
-        restore_cmd => [
-            'pg_restore',
-            '--file' => "$tempdir/compression_gzip_custom.sql",
-            '--statistics',
-            "$tempdir/compression_gzip_custom.dump",
-        ],
-        command_like => {
-            command => [
-                'pg_restore', '--list',
-                "$tempdir/compression_gzip_custom.dump",
-            ],
-            expected => qr/Compression: gzip/,
-            name => 'data content is gzip-compressed'
-        },
-    },
-
-    # Do not use --no-sync to give test coverage for data sync.
-    compression_gzip_dir => {
-        test_key => 'compression',
-        compile_option => 'gzip',
-        dump_cmd => [
-            'pg_dump',
-            '--jobs' => '2',
-            '--format' => 'directory',
-            '--compress' => 'gzip:1',
-            '--file' => "$tempdir/compression_gzip_dir",
-            '--statistics',
-            'postgres',
-        ],
-        # Give coverage for manually compressed blobs.toc files during
-        # restore.
-        compress_cmd => {
-            program => $ENV{'GZIP_PROGRAM'},
-            args => [ '-f', "$tempdir/compression_gzip_dir/blobs_*.toc", ],
-        },
-        # Verify that only data files were compressed
-        glob_patterns => [
-            "$tempdir/compression_gzip_dir/toc.dat",
-            "$tempdir/compression_gzip_dir/*.dat.gz",
-        ],
-        restore_cmd => [
-            'pg_restore',
-            '--jobs' => '2',
-            '--file' => "$tempdir/compression_gzip_dir.sql",
-            '--statistics',
-            "$tempdir/compression_gzip_dir",
-        ],
-    },
-
-    compression_gzip_plain => {
-        test_key => 'compression',
-        compile_option => 'gzip',
-        dump_cmd => [
-            'pg_dump',
-            '--format' => 'plain',
-            '--compress' => '1',
-            '--file' => "$tempdir/compression_gzip_plain.sql.gz",
-            '--statistics',
-            'postgres',
-        ],
-        # Decompress the generated file to run through the tests.
-        compress_cmd => {
-            program => $ENV{'GZIP_PROGRAM'},
-            args => [ '-d', "$tempdir/compression_gzip_plain.sql.gz", ],
-        },
-    },
-
-    # Do not use --no-sync to give test coverage for data sync.
-    compression_lz4_custom => {
-        test_key => 'compression',
-        compile_option => 'lz4',
-        dump_cmd => [
-            'pg_dump',
-            '--format' => 'custom',
-            '--compress' => 'lz4',
-            '--file' => "$tempdir/compression_lz4_custom.dump",
-            '--statistics',
-            'postgres',
-        ],
-        restore_cmd => [
-            'pg_restore',
-            '--file' => "$tempdir/compression_lz4_custom.sql",
-            '--statistics',
-            "$tempdir/compression_lz4_custom.dump",
-        ],
-        command_like => {
-            command => [
-                'pg_restore', '--list',
-                "$tempdir/compression_lz4_custom.dump",
-            ],
-            expected => qr/Compression: lz4/,
-            name => 'data content is lz4 compressed'
-        },
-    },
-
-    # Do not use --no-sync to give test coverage for data sync.
-    compression_lz4_dir => {
-        test_key => 'compression',
-        compile_option => 'lz4',
-        dump_cmd => [
-            'pg_dump',
-            '--jobs' => '2',
-            '--format' => 'directory',
-            '--compress' => 'lz4:1',
-            '--file' => "$tempdir/compression_lz4_dir",
-            '--statistics',
-            'postgres',
-        ],
-        # Give coverage for manually compressed blobs.toc files during
-        # restore.
-        compress_cmd => {
-            program => $ENV{'LZ4'},
-            args => [
-                '-z', '-f', '-m', '--rm',
-                "$tempdir/compression_lz4_dir/blobs_*.toc",
-            ],
-        },
-        # Verify that data files were compressed
-        glob_patterns => [
-            "$tempdir/compression_lz4_dir/toc.dat",
-            "$tempdir/compression_lz4_dir/*.dat.lz4",
-        ],
-        restore_cmd => [
-            'pg_restore',
-            '--jobs' => '2',
-            '--file' => "$tempdir/compression_lz4_dir.sql",
-            '--statistics',
-            "$tempdir/compression_lz4_dir",
-        ],
-    },
-
-    compression_lz4_plain => {
-        test_key => 'compression',
-        compile_option => 'lz4',
-        dump_cmd => [
-            'pg_dump',
-            '--format' => 'plain',
-            '--compress' => 'lz4',
-            '--file' => "$tempdir/compression_lz4_plain.sql.lz4",
-            '--statistics',
-            'postgres',
-        ],
-        # Decompress the generated file to run through the tests.
-        compress_cmd => {
-            program => $ENV{'LZ4'},
-            args => [
-                '-d', '-f',
-                "$tempdir/compression_lz4_plain.sql.lz4",
-                "$tempdir/compression_lz4_plain.sql",
-            ],
-        },
-    },
-
-    compression_zstd_custom => {
-        test_key => 'compression',
-        compile_option => 'zstd',
-        dump_cmd => [
-            'pg_dump',
-            '--format' => 'custom',
-            '--compress' => 'zstd',
-            '--file' => "$tempdir/compression_zstd_custom.dump",
-            '--statistics',
-            'postgres',
-        ],
-        restore_cmd => [
-            'pg_restore',
-            '--file' => "$tempdir/compression_zstd_custom.sql",
-            '--statistics',
-            "$tempdir/compression_zstd_custom.dump",
-        ],
-        command_like => {
-            command => [
-                'pg_restore', '--list',
-                "$tempdir/compression_zstd_custom.dump",
-            ],
-            expected => qr/Compression: zstd/,
-            name => 'data content is zstd compressed'
-        },
-    },
-
-    compression_zstd_dir => {
-        test_key => 'compression',
-        compile_option => 'zstd',
-        dump_cmd => [
-            'pg_dump',
-            '--jobs' => '2',
-            '--format' => 'directory',
-            '--compress' => 'zstd:1',
-            '--file' => "$tempdir/compression_zstd_dir",
-            '--statistics',
-            'postgres',
-        ],
-        # Give coverage for manually compressed blobs.toc files during
-        # restore.
-        compress_cmd => {
-            program => $ENV{'ZSTD'},
-            args => [
-                '-z', '-f',
-                '--rm', "$tempdir/compression_zstd_dir/blobs_*.toc",
-            ],
-        },
-        # Verify that data files were compressed
-        glob_patterns => [
-            "$tempdir/compression_zstd_dir/toc.dat",
-            "$tempdir/compression_zstd_dir/*.dat.zst",
-        ],
-        restore_cmd => [
-            'pg_restore',
-            '--jobs' => '2',
-            '--file' => "$tempdir/compression_zstd_dir.sql",
-            '--statistics',
-            "$tempdir/compression_zstd_dir",
-        ],
-    },
-
-    # Exercise long mode for test coverage
-    compression_zstd_plain => {
-        test_key => 'compression',
-        compile_option => 'zstd',
-        dump_cmd => [
-            'pg_dump',
-            '--format' => 'plain',
-            '--compress' => 'zstd:long',
-            '--file' => "$tempdir/compression_zstd_plain.sql.zst",
-            '--statistics',
-            'postgres',
-        ],
-        # Decompress the generated file to run through the tests.
-        compress_cmd => {
-            program => $ENV{'ZSTD'},
-            args => [
-                '-d', '-f',
-                "$tempdir/compression_zstd_plain.sql.zst", "-o",
-                "$tempdir/compression_zstd_plain.sql",
-            ],
-        },
-    },
-
     clean => {
         dump_cmd => [
             'pg_dump', '--no-sync',
@@ -891,10 +629,6 @@ my %pgdump_runs = (
 # of the pg_dump runs happening.  This is what "seeds" the
 # system with objects to be dumped out.
 #
-# There can be a flag called 'lz4', which can be set if the test
-# case depends on LZ4.  Tests marked with this flag are skipped if
-# the build used does not support LZ4.
-#
 # Building of this hash takes a bit of time as all of the regexps
 # included in it are compiled.  This greatly improves performance
 # as the regexps are used for each run the test applies to.
@@ -911,7 +645,6 @@ my %full_runs = (
     binary_upgrade => 1,
     clean => 1,
     clean_if_exists => 1,
-    compression => 1,
     createdb => 1,
     defaults => 1,
     exclude_dump_test_schema => 1,
@@ -3210,31 +2943,6 @@ my %tests = (
         },
     },

-    'CREATE MATERIALIZED VIEW matview_compression' => {
-        create_order => 20,
-        create_sql => 'CREATE MATERIALIZED VIEW
-                           dump_test.matview_compression (col2) AS
-                           SELECT col2 FROM dump_test.test_table;
-                           ALTER MATERIALIZED VIEW dump_test.matview_compression
-                           ALTER COLUMN col2 SET COMPRESSION lz4;',
-        regexp => qr/^
-            \QCREATE MATERIALIZED VIEW dump_test.matview_compression AS\E
-            \n\s+\QSELECT col2\E
-            \n\s+\QFROM dump_test.test_table\E
-            \n\s+\QWITH NO DATA;\E
-            .*
-            \QALTER TABLE ONLY dump_test.matview_compression ALTER COLUMN col2 SET COMPRESSION lz4;\E\n
-            /xms,
-        lz4 => 1,
-        like =>
-          { %full_runs, %dump_test_schema_runs, section_pre_data => 1, },
-        unlike => {
-            exclude_dump_test_schema => 1,
-            no_toast_compression => 1,
-            only_dump_measurement => 1,
-        },
-    },
-
     'Check ordering of a matview that depends on a primary key' => {
         create_order => 42,
         create_sql => '
@@ -3691,51 +3399,6 @@ my %tests = (
         },
     },

-    'CREATE TABLE test_compression_method' => {
-        create_order => 110,
-        create_sql => 'CREATE TABLE dump_test.test_compression_method (
-                           col1 text
-                       );',
-        regexp => qr/^
-            \QCREATE TABLE dump_test.test_compression_method (\E\n
-            \s+\Qcol1 text\E\n
-            \Q);\E
-            /xm,
-        like => {
-            %full_runs, %dump_test_schema_runs, section_pre_data => 1,
-        },
-        unlike => {
-            exclude_dump_test_schema => 1,
-            only_dump_measurement => 1,
-        },
-    },
-
-    # Insert enough data to surpass DEFAULT_IO_BUFFER_SIZE during
-    # (de)compression operations
-    'COPY test_compression_method' => {
-        create_order => 111,
-        create_sql => 'INSERT INTO dump_test.test_compression_method (col1) '
-          . 'SELECT string_agg(a::text, \'\') FROM generate_series(1,4096) a;',
-        regexp => qr/^
-            \QCOPY dump_test.test_compression_method (col1) FROM stdin;\E
-            \n(?:\d{15277}\n){1}\\\.\n
-            /xm,
-        like => {
-            %full_runs,
-            data_only => 1,
-            no_schema => 1,
-            section_data => 1,
-            only_dump_test_schema => 1,
-            test_schema_plus_large_objects => 1,
-        },
-        unlike => {
-            binary_upgrade => 1,
-            exclude_dump_test_schema => 1,
-            schema_only => 1,
-            schema_only_with_statistics => 1,
-        },
-    },
-
     'CREATE TABLE fk_reference_test_table' => {
         create_order => 21,
         create_sql => 'CREATE TABLE dump_test.fk_reference_test_table (
@@ -3774,30 +3437,6 @@ my %tests = (
         },
     },

-    'CREATE TABLE test_compression' => {
-        create_order => 3,
-        create_sql => 'CREATE TABLE dump_test.test_compression (
-                           col1 int,
-                           col2 text COMPRESSION lz4
-                       );',
-        regexp => qr/^
-            \QCREATE TABLE dump_test.test_compression (\E\n
-            \s+\Qcol1 integer,\E\n
-            \s+\Qcol2 text\E\n
-            \);\n
-            .*
-            \QALTER TABLE ONLY dump_test.test_compression ALTER COLUMN col2 SET COMPRESSION lz4;\E\n
-            /xms,
-        lz4 => 1,
-        like =>
-          { %full_runs, %dump_test_schema_runs, section_pre_data => 1, },
-        unlike => {
-            exclude_dump_test_schema => 1,
-            no_toast_compression => 1,
-            only_dump_measurement => 1,
-        },
-    },
-
     'CREATE TABLE measurement PARTITIONED BY' => {
         create_order => 90,
         create_sql => 'CREATE TABLE dump_test.measurement (
@@ -5280,13 +4919,6 @@ foreach my $test (
             next;
         }

-        # Skip tests specific to LZ4 if this build does not support
-        # this option.
-        if (!$supports_lz4 && defined($tests{$test}->{lz4}))
-        {
-            next;
-        }
-
         # Normalize command ending: strip all line endings, add
         # semicolon if missing, add two newlines.
         my $create_sql = $tests{$test}->{create_sql};
@@ -5463,42 +5095,9 @@ foreach my $run (sort keys %pgdump_runs)
     my $test_key = $run;
     my $run_db = 'postgres';

-    # Skip command-level tests for gzip/lz4/zstd if the tool is not supported
-    if ($pgdump_runs{$run}->{compile_option}
-        && (($pgdump_runs{$run}->{compile_option} eq 'gzip'
-                && !$supports_gzip)
-            || ($pgdump_runs{$run}->{compile_option} eq 'lz4'
-                && !$supports_lz4)
-            || ($pgdump_runs{$run}->{compile_option} eq 'zstd'
-                && !$supports_zstd)))
-    {
-        note
-          "$run: skipped due to no $pgdump_runs{$run}->{compile_option} support";
-        next;
-    }
-
     $node->command_ok(\@{ $pgdump_runs{$run}->{dump_cmd} },
         "$run: pg_dump runs");

-    if ($pgdump_runs{$run}->{compress_cmd})
-    {
-        my ($compress_cmd) = $pgdump_runs{$run}->{compress_cmd};
-        my $compress_program = $compress_cmd->{program};
-
-        # Skip the rest of the test if the compression program is
-        # not defined.
-        next if (!defined($compress_program) || $compress_program eq '');
-
-        # Arguments may require globbing.
-        my @full_compress_cmd = ($compress_program);
-        foreach my $arg (@{ $compress_cmd->{args} })
-        {
-            push @full_compress_cmd, glob($arg);
-        }
-
-        command_ok(\@full_compress_cmd, "$run: compression commands");
-    }
-
     if ($pgdump_runs{$run}->{glob_patterns})
     {
         my $glob_patterns = $pgdump_runs{$run}->{glob_patterns};
@@ -5579,13 +5178,6 @@ foreach my $run (sort keys %pgdump_runs)
             next;
         }

-        # Skip tests specific to LZ4 if this build does not support
-        # this option.
-        if (!$supports_lz4 && defined($tests{$test}->{lz4}))
-        {
-            next;
-        }
-
         if ($run_db ne $test_db)
         {
             next;
diff --git a/src/bin/pg_dump/t/006_pg_dump_compress.pl b/src/bin/pg_dump/t/006_pg_dump_compress.pl
new file mode 100644
index 00000000000..3737132645b
--- /dev/null
+++ b/src/bin/pg_dump/t/006_pg_dump_compress.pl
@@ -0,0 +1,611 @@
+
+# Copyright (c) 2021-2025, PostgreSQL Global Development Group
+
+###############################################################
+# This test script uses essentially the same structure as
+# 002_pg_dump.pl, but is specialized to deal with compression
+# concerns.  As such, some of the test cases here are large
+# and would contribute undue amounts of runtime if they were
+# included in 002_pg_dump.pl.
+###############################################################
+
+use strict;
+use warnings FATAL => 'all';
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+my $tempdir = PostgreSQL::Test::Utils::tempdir;
+
+###############################################################
+# Definition of the pg_dump runs to make.
+#
+# In addition to the facilities explained in 002_pg_dump.pl,
+# these entries can include:
+#
+# compile_option indicates if the test depends on a compilation
+# option, if any.  This can be used to control if tests should be
+# skipped when a build dependency is not satisfied.
+#
+# compress_cmd is the utility command for (de)compression, if any.
+# Note that this should generally be used on pg_dump's output
+# either to generate a text file to run the through the tests, or
+# to test pg_restore's ability to parse manually compressed files
+# that otherwise pg_dump does not compress on its own (e.g. *.toc).
+
+my $supports_gzip = check_pg_config("#define HAVE_LIBZ 1");
+my $supports_lz4 = check_pg_config("#define USE_LZ4 1");
+my $supports_zstd = check_pg_config("#define USE_ZSTD 1");
+
+my %pgdump_runs = (
+    compression_gzip_custom => {
+        test_key => 'compression',
+        compile_option => 'gzip',
+        dump_cmd => [
+            'pg_dump', '--no-sync',
+            '--format' => 'custom',
+            '--compress' => '1',
+            '--file' => "$tempdir/compression_gzip_custom.dump",
+            '--statistics',
+            'postgres',
+        ],
+        restore_cmd => [
+            'pg_restore',
+            '--file' => "$tempdir/compression_gzip_custom.sql",
+            '--statistics',
+            "$tempdir/compression_gzip_custom.dump",
+        ],
+        command_like => {
+            command => [
+                'pg_restore', '--list',
+                "$tempdir/compression_gzip_custom.dump",
+            ],
+            expected => qr/Compression: gzip/,
+            name => 'data content is gzip-compressed'
+        },
+    },
+
+    compression_gzip_dir => {
+        test_key => 'compression',
+        compile_option => 'gzip',
+        dump_cmd => [
+            'pg_dump', '--no-sync',
+            '--jobs' => '2',
+            '--format' => 'directory',
+            '--compress' => 'gzip:1',
+            '--file' => "$tempdir/compression_gzip_dir",
+            '--statistics',
+            'postgres',
+        ],
+        # Give coverage for manually compressed blobs.toc files during
+        # restore.
+        compress_cmd => {
+            program => $ENV{'GZIP_PROGRAM'},
+            args => [ '-f', "$tempdir/compression_gzip_dir/blobs_*.toc", ],
+        },
+        # Verify that only data files were compressed
+        glob_patterns => [
+            "$tempdir/compression_gzip_dir/toc.dat",
+            "$tempdir/compression_gzip_dir/*.dat.gz",
+        ],
+        restore_cmd => [
+            'pg_restore',
+            '--jobs' => '2',
+            '--file' => "$tempdir/compression_gzip_dir.sql",
+            '--statistics',
+            "$tempdir/compression_gzip_dir",
+        ],
+    },
+
+    compression_gzip_plain => {
+        test_key => 'compression',
+        compile_option => 'gzip',
+        dump_cmd => [
+            'pg_dump', '--no-sync',
+            '--format' => 'plain',
+            '--compress' => '1',
+            '--file' => "$tempdir/compression_gzip_plain.sql.gz",
+            '--statistics',
+            'postgres',
+        ],
+        # Decompress the generated file to run through the tests.
+        compress_cmd => {
+            program => $ENV{'GZIP_PROGRAM'},
+            args => [ '-d', "$tempdir/compression_gzip_plain.sql.gz", ],
+        },
+    },
+
+    compression_lz4_custom => {
+        test_key => 'compression',
+        compile_option => 'lz4',
+        dump_cmd => [
+            'pg_dump', '--no-sync',
+            '--format' => 'custom',
+            '--compress' => 'lz4',
+            '--file' => "$tempdir/compression_lz4_custom.dump",
+            '--statistics',
+            'postgres',
+        ],
+        restore_cmd => [
+            'pg_restore',
+            '--file' => "$tempdir/compression_lz4_custom.sql",
+            '--statistics',
+            "$tempdir/compression_lz4_custom.dump",
+        ],
+        command_like => {
+            command => [
+                'pg_restore', '--list',
+                "$tempdir/compression_lz4_custom.dump",
+            ],
+            expected => qr/Compression: lz4/,
+            name => 'data content is lz4 compressed'
+        },
+    },
+
+    compression_lz4_dir => {
+        test_key => 'compression',
+        compile_option => 'lz4',
+        dump_cmd => [
+            'pg_dump', '--no-sync',
+            '--jobs' => '2',
+            '--format' => 'directory',
+            '--compress' => 'lz4:1',
+            '--file' => "$tempdir/compression_lz4_dir",
+            '--statistics',
+            'postgres',
+        ],
+        # Give coverage for manually compressed blobs.toc files during
+        # restore.
+        compress_cmd => {
+            program => $ENV{'LZ4'},
+            args => [
+                '-z', '-f', '-m', '--rm',
+                "$tempdir/compression_lz4_dir/blobs_*.toc",
+            ],
+        },
+        # Verify that data files were compressed
+        glob_patterns => [
+            "$tempdir/compression_lz4_dir/toc.dat",
+            "$tempdir/compression_lz4_dir/*.dat.lz4",
+        ],
+        restore_cmd => [
+            'pg_restore',
+            '--jobs' => '2',
+            '--file' => "$tempdir/compression_lz4_dir.sql",
+            '--statistics',
+            "$tempdir/compression_lz4_dir",
+        ],
+    },
+
+    compression_lz4_plain => {
+        test_key => 'compression',
+        compile_option => 'lz4',
+        dump_cmd => [
+            'pg_dump', '--no-sync',
+            '--format' => 'plain',
+            '--compress' => 'lz4',
+            '--file' => "$tempdir/compression_lz4_plain.sql.lz4",
+            '--statistics',
+            'postgres',
+        ],
+        # Decompress the generated file to run through the tests.
+        compress_cmd => {
+            program => $ENV{'LZ4'},
+            args => [
+                '-d', '-f',
+                "$tempdir/compression_lz4_plain.sql.lz4",
+                "$tempdir/compression_lz4_plain.sql",
+            ],
+        },
+    },
+
+    compression_zstd_custom => {
+        test_key => 'compression',
+        compile_option => 'zstd',
+        dump_cmd => [
+            'pg_dump', '--no-sync',
+            '--format' => 'custom',
+            '--compress' => 'zstd',
+            '--file' => "$tempdir/compression_zstd_custom.dump",
+            '--statistics',
+            'postgres',
+        ],
+        restore_cmd => [
+            'pg_restore',
+            '--file' => "$tempdir/compression_zstd_custom.sql",
+            '--statistics',
+            "$tempdir/compression_zstd_custom.dump",
+        ],
+        command_like => {
+            command => [
+                'pg_restore', '--list',
+                "$tempdir/compression_zstd_custom.dump",
+            ],
+            expected => qr/Compression: zstd/,
+            name => 'data content is zstd compressed'
+        },
+    },
+
+    compression_zstd_dir => {
+        test_key => 'compression',
+        compile_option => 'zstd',
+        dump_cmd => [
+            'pg_dump', '--no-sync',
+            '--jobs' => '2',
+            '--format' => 'directory',
+            '--compress' => 'zstd:1',
+            '--file' => "$tempdir/compression_zstd_dir",
+            '--statistics',
+            'postgres',
+        ],
+        # Give coverage for manually compressed blobs.toc files during
+        # restore.
+        compress_cmd => {
+            program => $ENV{'ZSTD'},
+            args => [
+                '-z', '-f',
+                '--rm', "$tempdir/compression_zstd_dir/blobs_*.toc",
+            ],
+        },
+        # Verify that data files were compressed
+        glob_patterns => [
+            "$tempdir/compression_zstd_dir/toc.dat",
+            "$tempdir/compression_zstd_dir/*.dat.zst",
+        ],
+        restore_cmd => [
+            'pg_restore',
+            '--jobs' => '2',
+            '--file' => "$tempdir/compression_zstd_dir.sql",
+            '--statistics',
+            "$tempdir/compression_zstd_dir",
+        ],
+    },
+
+    # Exercise long mode for test coverage
+    compression_zstd_plain => {
+        test_key => 'compression',
+        compile_option => 'zstd',
+        dump_cmd => [
+            'pg_dump', '--no-sync',
+            '--format' => 'plain',
+            '--compress' => 'zstd:long',
+            '--file' => "$tempdir/compression_zstd_plain.sql.zst",
+            '--statistics',
+            'postgres',
+        ],
+        # Decompress the generated file to run through the tests.
+        compress_cmd => {
+            program => $ENV{'ZSTD'},
+            args => [
+                '-d', '-f',
+                "$tempdir/compression_zstd_plain.sql.zst", "-o",
+                "$tempdir/compression_zstd_plain.sql",
+            ],
+        },
+    },);
+
+###############################################################
+# Definition of the tests to run.
+#
+# In addition to the facilities explained in 002_pg_dump.pl,
+# these entries can include:
+#
+# compile_option indicates if the test depends on a compilation
+# option, if any.  This can be used to control if tests should be
+# skipped when a build dependency is not satisfied.
+
+# Tests which are considered 'full' dumps by pg_dump, but there
+# are flags used to exclude specific items (ACLs, LOs, etc).
+my %full_runs = (compression => 1,);
+
+# This is where the actual tests are defined.
+my %tests = (
+    'CREATE MATERIALIZED VIEW matview_compression_lz4' => {
+        create_order => 20,
+        create_sql => 'CREATE MATERIALIZED VIEW
+                           matview_compression_lz4 (col2) AS
+                           SELECT repeat(\'xyzzy\', 10000);
+                           ALTER MATERIALIZED VIEW matview_compression_lz4
+                           ALTER COLUMN col2 SET COMPRESSION lz4;',
+        regexp => qr/^
+            \QCREATE MATERIALIZED VIEW public.matview_compression_lz4 AS\E
+            \n\s+\QSELECT repeat('xyzzy'::text, 10000) AS col2\E
+            \n\s+\QWITH NO DATA;\E
+            .*
+            \QALTER TABLE ONLY public.matview_compression_lz4 ALTER COLUMN col2 SET COMPRESSION lz4;\E\n
+            /xms,
+        compile_option => 'lz4',
+        like => {%full_runs},
+    },
+
+    'CREATE TABLE test_compression_method' => {
+        create_order => 110,
+        create_sql => 'CREATE TABLE test_compression_method (
+                           col1 text
+                       );',
+        regexp => qr/^
+            \QCREATE TABLE public.test_compression_method (\E\n
+            \s+\Qcol1 text\E\n
+            \Q);\E
+            /xm,
+        like => { %full_runs, },
+    },
+
+    # Insert enough data to surpass DEFAULT_IO_BUFFER_SIZE during
+    # (de)compression operations
+    'COPY test_compression_method' => {
+        create_order => 111,
+        create_sql => 'INSERT INTO test_compression_method (col1) '
+          . 'SELECT string_agg(a::text, \'\') FROM generate_series(1,4096) a;',
+        regexp => qr/^
+            \QCOPY public.test_compression_method (col1) FROM stdin;\E
+            \n(?:\d{15277}\n){1}\\\.\n
+            /xm,
+        like => { %full_runs, },
+    },
+
+    'CREATE TABLE test_compression' => {
+        create_order => 3,
+        create_sql => 'CREATE TABLE test_compression (
+                           col1 int,
+                           col2 text COMPRESSION lz4
+                       );',
+        regexp => qr/^
+            \QCREATE TABLE public.test_compression (\E\n
+            \s+\Qcol1 integer,\E\n
+            \s+\Qcol2 text\E\n
+            \);\n
+            .*
+            \QALTER TABLE ONLY public.test_compression ALTER COLUMN col2 SET COMPRESSION lz4;\E\n
+            /xms,
+        compile_option => 'lz4',
+        like => {%full_runs},
+    },
+
+    # Create a large object so we can test compression of blobs.toc
+    'LO create (using lo_from_bytea)' => {
+        create_order => 50,
+        create_sql =>
+          'SELECT pg_catalog.lo_from_bytea(0, \'\\x310a320a330a340a350a360a370a380a390a\');',
+        regexp => qr/^SELECT pg_catalog\.lo_create\('\d+'\);/m,
+        like => { %full_runs, },
+    },
+
+    'LO load (using lo_from_bytea)' => {
+        regexp => qr/^
+            \QSELECT pg_catalog.lo_open\E \('\d+',\ \d+\);\n
+            \QSELECT pg_catalog.lowrite(0, \E
+            \Q'\x310a320a330a340a350a360a370a380a390a');\E\n
+            \QSELECT pg_catalog.lo_close(0);\E
+            /xm,
+        like => { %full_runs, },
+    },);
+
+#########################################
+# Create a PG instance to test actually dumping from
+
+my $node = PostgreSQL::Test::Cluster->new('main');
+$node->init;
+$node->start;
+
+my $port = $node->port;
+
+#########################################
+# Set up schemas, tables, etc, to be dumped.
+
+# Build up the create statements
+my %create_sql = ();
+
+foreach my $test (
+    sort {
+        if ($tests{$a}->{create_order} and $tests{$b}->{create_order})
+        {
+            $tests{$a}->{create_order} <=> $tests{$b}->{create_order};
+        }
+        elsif ($tests{$a}->{create_order})
+        {
+            -1;
+        }
+        elsif ($tests{$b}->{create_order})
+        {
+            1;
+        }
+        else
+        {
+            0;
+        }
+    } keys %tests)
+{
+    my $test_db = 'postgres';
+
+    if (defined($tests{$test}->{database}))
+    {
+        $test_db = $tests{$test}->{database};
+    }
+
+    # Skip tests that require an unsupported compile option
+    if ($tests{$test}->{compile_option}
+        && (($tests{$test}->{compile_option} eq 'gzip' && !$supports_gzip)
+            || ($tests{$test}->{compile_option} eq 'lz4'
+                && !$supports_lz4)
+            || ($tests{$test}->{compile_option} eq 'zstd'
+                && !$supports_zstd)))
+    {
+        next;
+    }
+
+    if ($tests{$test}->{create_sql})
+    {
+        # Normalize command ending: strip all line endings, add
+        # semicolon if missing, add two newlines.
+        my $create_sql = $tests{$test}->{create_sql};
+        chomp $create_sql;
+        $create_sql .= ';' unless substr($create_sql, -1) eq ';';
+        $create_sql{$test_db} .= $create_sql . "\n\n";
+    }
+}
+
+# Send the combined set of commands to psql
+foreach my $db (sort keys %create_sql)
+{
+    $node->safe_psql($db, $create_sql{$db});
+}
+
+#########################################
+# Run all runs
+
+foreach my $run (sort keys %pgdump_runs)
+{
+    my $test_key = $run;
+    my $run_db = 'postgres';
+
+    # Skip runs that require an unsupported compile option
+    if ($pgdump_runs{$run}->{compile_option}
+        && (($pgdump_runs{$run}->{compile_option} eq 'gzip'
+                && !$supports_gzip)
+            || ($pgdump_runs{$run}->{compile_option} eq 'lz4'
+                && !$supports_lz4)
+            || ($pgdump_runs{$run}->{compile_option} eq 'zstd'
+                && !$supports_zstd)))
+    {
+        note
+          "$run: skipped due to no $pgdump_runs{$run}->{compile_option} support";
+        next;
+    }
+
+    $node->command_ok(\@{ $pgdump_runs{$run}->{dump_cmd} },
+        "$run: pg_dump runs");
+
+    if ($pgdump_runs{$run}->{compress_cmd})
+    {
+        my ($compress_cmd) = $pgdump_runs{$run}->{compress_cmd};
+        my $compress_program = $compress_cmd->{program};
+
+        # Skip the rest of the test if the compression program is
+        # not defined.
+        next if (!defined($compress_program) || $compress_program eq '');
+
+        # Arguments may require globbing.
+        my @full_compress_cmd = ($compress_program);
+        foreach my $arg (@{ $compress_cmd->{args} })
+        {
+            push @full_compress_cmd, glob($arg);
+        }
+
+        command_ok(\@full_compress_cmd, "$run: compression commands");
+    }
+
+    if ($pgdump_runs{$run}->{glob_patterns})
+    {
+        my $glob_patterns = $pgdump_runs{$run}->{glob_patterns};
+        foreach my $glob_pattern (@{$glob_patterns})
+        {
+            my @glob_output = glob($glob_pattern);
+            is(scalar(@glob_output) > 0,
+                1, "$run: glob check for $glob_pattern");
+        }
+    }
+
+    if ($pgdump_runs{$run}->{command_like})
+    {
+        my $cmd_like = $pgdump_runs{$run}->{command_like};
+        $node->command_like(
+            \@{ $cmd_like->{command} },
+            $cmd_like->{expected},
+            "$run: " . $cmd_like->{name});
+    }
+
+    if ($pgdump_runs{$run}->{restore_cmd})
+    {
+        $node->command_ok(\@{ $pgdump_runs{$run}->{restore_cmd} },
+            "$run: pg_restore runs");
+    }
+
+    if ($pgdump_runs{$run}->{test_key})
+    {
+        $test_key = $pgdump_runs{$run}->{test_key};
+    }
+
+    my $output_file = slurp_file("$tempdir/${run}.sql");
+
+    #########################################
+    # Run all tests where this run is included
+    # as either a 'like' or 'unlike' test.
+
+    foreach my $test (sort keys %tests)
+    {
+        my $test_db = 'postgres';
+
+        if (defined($pgdump_runs{$run}->{database}))
+        {
+            $run_db = $pgdump_runs{$run}->{database};
+        }
+
+        if (defined($tests{$test}->{database}))
+        {
+            $test_db = $tests{$test}->{database};
+        }
+
+        # Check for proper test definitions
+        #
+        # Either "all_runs" should be set or there should be a "like" list,
+        # even if it is empty.  (This makes the test more self-documenting.)
+        if (   !defined($tests{$test}->{all_runs})
+            && !defined($tests{$test}->{like}))
+        {
+            die "missing \"like\" in test \"$test\"";
+        }
+        # Check for useless entries in "unlike" list.  Runs that are
+        # not listed in "like" don't need to be excluded in "unlike".
+        if ($tests{$test}->{unlike}->{$test_key}
+            && !defined($tests{$test}->{like}->{$test_key}))
+        {
+            die "useless \"unlike\" entry \"$test_key\" in test \"$test\"";
+        }
+
+        # Skip tests that require an unsupported compile option
+        if ($tests{$test}->{compile_option}
+            && (($tests{$test}->{compile_option} eq 'gzip' && !$supports_gzip)
+                || ($tests{$test}->{compile_option} eq 'lz4'
+                    && !$supports_lz4)
+                || ($tests{$test}->{compile_option} eq 'zstd'
+                    && !$supports_zstd)))
+        {
+            next;
+        }
+
+        if ($run_db ne $test_db)
+        {
+            next;
+        }
+
+        # Run the test if all_runs is set or if listed as a like, unless it is
+        # specifically noted as an unlike (generally due to an explicit
+        # exclusion or similar).
+        if (($tests{$test}->{like}->{$test_key} || $tests{$test}->{all_runs})
+            && !defined($tests{$test}->{unlike}->{$test_key}))
+        {
+            if (!ok($output_file =~ $tests{$test}->{regexp},
+                    "$run: should dump $test"))
+            {
+                diag("Review $run results in $tempdir");
+            }
+        }
+        else
+        {
+            if (!ok($output_file !~ $tests{$test}->{regexp},
+                    "$run: should not dump $test"))
+            {
+                diag("Review $run results in $tempdir");
+            }
+        }
+    }
+}
+
+#########################################
+# Stop the database instance, which will be removed at the end of the tests.
+
+$node->stop('fast');
+
+done_testing();
--
2.43.7

From 42b2728fece03fc4cc5a14fa7d9081e04ec54e18 Mon Sep 17 00:00:00 2001
From: Tom Lane <tgl@sss.pgh.pa.us>
Date: Wed, 15 Oct 2025 15:12:22 -0400
Subject: [PATCH v5 3/3] Add more TAP test coverage for pg_dump.

Add a test case to cover pg_dump with --compress=none.  This
brings the coverage of compress_none.c up from about 64% to 90%,
in particular covering the new code added in a previous patch.

Include compression of toc.dat in manually-compressed test cases.
We would have found the bug fixed in commit a239c4a0c much sooner
if we'd done this.  As far as I can tell, this doesn't reduce test
coverage at all, since there are other tests of directory format
that still use an uncompressed toc.dat.

Widen the wide row used to verify correct (de) compression.
Commit 1a05c1d25 advises us (not without reason) to ensure that
this test case fully fills DEFAULT_IO_BUFFER_SIZE, so that loops
within the compression logic will iterate completely.  To follow
that advice with the proposed DEFAULT_IO_BUFFER_SIZE of 128K,
we need something close to this.  This does indeed increase the
reported code coverage by a few lines.

While here, fix a glitch that I noticed in testing: the
$glob_patterns tests were incapable of failing, because glob()
will return 'foo' as 'foo' whether there is a matching file or
not.  (Indeed, the stanza just above that one relies on that.)

In my testing, this patch adds approximately as much runtime
as was saved by the previous patch, so that it's about a wash
compared to the old code.  However, we get better test coverage.

Author: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/3515357.1760128017@sss.pgh.pa.us
---
 src/bin/pg_dump/t/002_pg_dump.pl          |  8 ++-
 src/bin/pg_dump/t/006_pg_dump_compress.pl | 66 ++++++++++++++++-------
 2 files changed, 52 insertions(+), 22 deletions(-)

diff --git a/src/bin/pg_dump/t/002_pg_dump.pl b/src/bin/pg_dump/t/002_pg_dump.pl
index b789cd2e863..0e94d8151b7 100644
--- a/src/bin/pg_dump/t/002_pg_dump.pl
+++ b/src/bin/pg_dump/t/002_pg_dump.pl
@@ -5104,8 +5104,12 @@ foreach my $run (sort keys %pgdump_runs)
         foreach my $glob_pattern (@{$glob_patterns})
         {
             my @glob_output = glob($glob_pattern);
-            is(scalar(@glob_output) > 0,
-                1, "$run: glob check for $glob_pattern");
+            my $ok = 0;
+            # certainly found some files if glob() returned multiple matches
+            $ok = 1 if (scalar(@glob_output) > 1);
+            # if just one match, we need to check if it's real
+            $ok = 1 if (scalar(@glob_output) == 1 && -f $glob_output[0]);
+            is($ok, 1, "$run: glob check for $glob_pattern");
         }
     }

diff --git a/src/bin/pg_dump/t/006_pg_dump_compress.pl b/src/bin/pg_dump/t/006_pg_dump_compress.pl
index 3737132645b..b429df19a4d 100644
--- a/src/bin/pg_dump/t/006_pg_dump_compress.pl
+++ b/src/bin/pg_dump/t/006_pg_dump_compress.pl
@@ -39,6 +39,24 @@ my $supports_lz4 = check_pg_config("#define USE_LZ4 1");
 my $supports_zstd = check_pg_config("#define USE_ZSTD 1");

 my %pgdump_runs = (
+    compression_none_custom => {
+        test_key => 'compression',
+        dump_cmd => [
+            'pg_dump', '--no-sync',
+            '--format' => 'custom',
+            '--compress' => 'none',
+            '--file' => "$tempdir/compression_none_custom.dump",
+            '--statistics',
+            'postgres',
+        ],
+        restore_cmd => [
+            'pg_restore',
+            '--file' => "$tempdir/compression_none_custom.sql",
+            '--statistics',
+            "$tempdir/compression_none_custom.dump",
+        ],
+    },
+
     compression_gzip_custom => {
         test_key => 'compression',
         compile_option => 'gzip',
@@ -78,15 +96,18 @@ my %pgdump_runs = (
             '--statistics',
             'postgres',
         ],
-        # Give coverage for manually compressed blobs.toc files during
-        # restore.
+        # Give coverage for manually-compressed TOC files during restore.
         compress_cmd => {
             program => $ENV{'GZIP_PROGRAM'},
-            args => [ '-f', "$tempdir/compression_gzip_dir/blobs_*.toc", ],
+            args => [
+                '-f',
+                "$tempdir/compression_gzip_dir/toc.dat",
+                "$tempdir/compression_gzip_dir/blobs_*.toc",
+            ],
         },
-        # Verify that only data files were compressed
+        # Verify that TOC and data files were compressed
         glob_patterns => [
-            "$tempdir/compression_gzip_dir/toc.dat",
+            "$tempdir/compression_gzip_dir/toc.dat.gz",
             "$tempdir/compression_gzip_dir/*.dat.gz",
         ],
         restore_cmd => [
@@ -155,18 +176,18 @@ my %pgdump_runs = (
             '--statistics',
             'postgres',
         ],
-        # Give coverage for manually compressed blobs.toc files during
-        # restore.
+        # Give coverage for manually-compressed TOC files during restore.
         compress_cmd => {
             program => $ENV{'LZ4'},
             args => [
                 '-z', '-f', '-m', '--rm',
+                "$tempdir/compression_lz4_dir/toc.dat",
                 "$tempdir/compression_lz4_dir/blobs_*.toc",
             ],
         },
-        # Verify that data files were compressed
+        # Verify that TOC and data files were compressed
         glob_patterns => [
-            "$tempdir/compression_lz4_dir/toc.dat",
+            "$tempdir/compression_lz4_dir/toc.dat.lz4",
             "$tempdir/compression_lz4_dir/*.dat.lz4",
         ],
         restore_cmd => [
@@ -239,18 +260,18 @@ my %pgdump_runs = (
             '--statistics',
             'postgres',
         ],
-        # Give coverage for manually compressed blobs.toc files during
-        # restore.
+        # Give coverage for manually-compressed TOC files during restore.
         compress_cmd => {
             program => $ENV{'ZSTD'},
             args => [
-                '-z', '-f',
-                '--rm', "$tempdir/compression_zstd_dir/blobs_*.toc",
+                '-z', '-f', '--rm',
+                "$tempdir/compression_zstd_dir/toc.dat",
+                "$tempdir/compression_zstd_dir/blobs_*.toc",
             ],
         },
-        # Verify that data files were compressed
+        # Verify that TOC and data files were compressed
         glob_patterns => [
-            "$tempdir/compression_zstd_dir/toc.dat",
+            "$tempdir/compression_zstd_dir/toc.dat.zst",
             "$tempdir/compression_zstd_dir/*.dat.zst",
         ],
         restore_cmd => [
@@ -333,14 +354,15 @@ my %tests = (
     },

     # Insert enough data to surpass DEFAULT_IO_BUFFER_SIZE during
-    # (de)compression operations
+    # (de)compression operations.  The weird regex is because Perl
+    # restricts us to repeat counts of less than 32K.
     'COPY test_compression_method' => {
         create_order => 111,
         create_sql => 'INSERT INTO test_compression_method (col1) '
-          . 'SELECT string_agg(a::text, \'\') FROM generate_series(1,4096) a;',
+          . 'SELECT string_agg(a::text, \'\') FROM generate_series(1,65536) a;',
         regexp => qr/^
             \QCOPY public.test_compression_method (col1) FROM stdin;\E
-            \n(?:\d{15277}\n){1}\\\.\n
+            \n(?:(?:\d\d\d\d\d\d\d\d\d\d){31657}\d\d\d\d\n){1}\\\.\n
             /xm,
         like => { %full_runs, },
     },
@@ -502,8 +524,12 @@ foreach my $run (sort keys %pgdump_runs)
         foreach my $glob_pattern (@{$glob_patterns})
         {
             my @glob_output = glob($glob_pattern);
-            is(scalar(@glob_output) > 0,
-                1, "$run: glob check for $glob_pattern");
+            my $ok = 0;
+            # certainly found some files if glob() returned multiple matches
+            $ok = 1 if (scalar(@glob_output) > 1);
+            # if just one match, we need to check if it's real
+            $ok = 1 if (scalar(@glob_output) == 1 && -f $glob_output[0]);
+            is($ok, 1, "$run: glob check for $glob_pattern");
         }
     }

--
2.43.7


I wrote:
> I think this is more or less committable, and then we could get
> back to the original question of whether it's worth tweaking
> pg_restore's seek-vs-scan behavior.

And done.  Dimitrios, could you re-do your testing against current
HEAD, and see if there's still a benefit to tweaking pg_restore's
seek-vs-read decisions, and if so what's the best number?

            regards, tom lane



Re: [PING] [PATCH v2] parallel pg_restore: avoid disk seeks when jumping short distance forward

От
Dimitrios Apostolou
Дата:
Thank you for your work on this, Tom.
I'll try to test it in the weekend.

Dimitris

On 16 October 2025 19:01:10 CEST, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>I wrote:
>> I think this is more or less committable, and then we could get
>> back to the original question of whether it's worth tweaking
>> pg_restore's seek-vs-scan behavior.
>
>And done.  Dimitrios, could you re-do your testing against current
>HEAD, and see if there's still a benefit to tweaking pg_restore's
>seek-vs-read decisions, and if so what's the best number?
>
>            regards, tom lane



[PATCH v4] parallel pg_restore: avoid disk seeks when jumping short distance forward

От
Dimitrios Apostolou
Дата:
On Thursday 2025-10-16 19:01, Tom Lane wrote:

>> I think this is more or less committable, and then we could get
>> back to the original question of whether it's worth tweaking
>> pg_restore's seek-vs-scan behavior.
>
> And done.  Dimitrios, could you re-do your testing against current
> HEAD, and see if there's still a benefit to tweaking pg_restore's
> seek-vs-read decisions, and if so what's the best number?

Sorry for the delay, I hadn't realized a needed to generate a new
database dump using the current HEAD. So I did that, using
--compress=none and storing it on compressed btrfs filesystem, since
that's my primary use case.

I notice that things have improved immensely!
Using the test you suggested (see NOTE1):

     pg_restore -t last_table -f /dev/null  huge.pg_dump


1. The strace output is much more reasonable now; basically it's
    repeating the pattern

        read(4k)
        lseek(~128k forward)

   As a reminder, with old archives it was repeating the pattern:

        read(4k)
        lseek(4k forward)
        lseek(same offset as above) x ~80 times

2. The IO speed is better than before:

       On my 20TB HDD I get 30-50 MB/s read rate.

       With old archives I get 10-20 MB/s read rate.

3. Time to complete: ~25 min

4. CPU usage is low. With old archives the pg_restore process shows
    high *system* CPU (because of the amount of syscalls).


I can't really compare the actual runtime between old and new dump,
because the two dumps are very different. But I have no doubt the new
dump is several times faster to seek through.


NOTE1: My original testcase was

           pg_restore -t last_table -j $NCPU -d testdb

        This testcase does not show as big improvement,
        because every single of the parallel workers is
        concurrently seeking through the dump file.



*** All above was measured from master branch HEAD **
277dec6514728e2d0d87c1279dd5e0afbf897428
Don't rely on zlib's gzgetc() macro.

*** Below I have applied attached patch ***


Regarding the attached patch (rebased and edited commit message), it
basically replaces seek(up to 1MB forward) with read(). The 1MB number
comes a bit out of the top of my head. But tweaking it between 128KB and
1MB wouldn't really change anything, given that the block size is now
128KB: The read() will always be chosen against the seek(). Do you know
of a real-world case with block sizes >128KB?

Anyway I tried it with the new archive from above.


1. strace output is a loop of the following:

         read(4k)
         read(~128k)

2. Read rate is between 150-250MB/s basically max that the HDD can give.

3. Time to complete: ~5 min

4. CPU usage: HIGH (63%), most likely because of the sheer amount
    of data it's parsing.


Regards,
Dimitris

Вложения

Re: [PING] [PATCH v2] parallel pg_restore: avoid disk seeks when jumping short distance forward

От
Dimitrios Apostolou
Дата:

  WriteDataToArchiveNone(ArchiveHandle *AH, CompressorState *cs,
                                            const void *data, size_t dLen)
  {
-       cs->writeF(AH, data, dLen);
+       NoneCompressorState *nonecs = (NoneCompressorState *) cs->private_data;
+       size_t          remaining = dLen;
+
+       while (remaining > 0)
+       {
+               size_t          chunk;
+
+               /* Dump buffer if full */
+               if (nonecs->bufdata >= nonecs->buflen)

Shouldn't this be equality check instead:
     if (nonecs->bufdata == nonecs->buflen)

And possibly also assert(nonecs->bufdata <= nonecs->buflen) ?

+               {
+                       cs->writeF(AH, nonecs->buffer, nonecs->bufdata);
+                       nonecs->bufdata = 0;
+               }
+               /* And fill it */
+               chunk = nonecs->buflen - nonecs->bufdata;
+               if (chunk > remaining)
+                       chunk = remaining;
+               memcpy(nonecs->buffer + nonecs->bufdata, data, chunk);
+               nonecs->bufdata += chunk;
+               data = ((const char *) data) + chunk;
+               remaining -= chunk;
+       }
  }




Re: [PING] [PATCH v2] parallel pg_restore: avoid disk seeks when jumping short distance forward

От
Dimitrios Apostolou
Дата:
On Wednesday 2025-10-15 21:21, Tom Lane wrote:

>> 0004 increases the row width in the existing test case that says
>> it's trying to push more than DEFAULT_IO_BUFFER_SIZE through
>> the compressors.  While I agree with the premise, this solution
>> is hugely expensive: it adds about 12% to the already-long runtime
>> of 002_pg_dump.pl.  I'd like to find a better way, but ran out of
>> energy for today.  (I think the reason this costs so much is that
>> it's effectively iterated hundreds of times because of
>> 002_pg_dump.pl's more or less cross-product approach to testing
>> everything.  Maybe we should pull it out of that structure?)
>
> The attached patchset accomplishes that by splitting 002_pg_dump.pl
> into two scripts, one that is just concerned with the compression
> test cases and one that does everything else.  This might not be
> the prettiest solution, since it duplicates a lot of perl code.
> I thought about refactoring 002_pg_dump.pl so that it could handle
> two separate sets of runs-plus-tests, but decided it was overly
> complicated already.
>
> Anyway, 0001 attached is the same as in v4, 0002 performs the
> test split without intending to change coverage, and then 0003
> adds the new test cases I wanted.  For me, this ends up with
> just about the same runtime as before, or maybe a smidge less.
> I'd hoped for possibly more savings than that, but I'm content
> with it being a wash.
>
> I think this is more or less committable, and then we could get
> back to the original question of whether it's worth tweaking
> pg_restore's seek-vs-scan behavior.


Hi Tom, since you are dealing with pg_restore testing, you might want to
have a look in the 2nd patch from here:

https://www.postgresql.org/message-id/413c1cd8-1d6d-90ba-ac7b-b226a4dad5ed%40gmx.net

Direct link to the patch is:

https://www.postgresql.org/message-id/attachment/177661/v3-0002-Add-new-test-file-with-pg_restore-test-cases.patch


It's a much shorter test, focused on pg_restore.

1. It generates two custom-format dumps (with-TOC and TOC-less).

2. Restores each dump to an empty database using pg_restore with
    a couple of switches combinations
    (one combination (--clean --data-only will not work without a patch
     of mine so we might want to remove that and enrich with others).

3. Tests pg_restore over pre-existing database

4. Tests pg_restore reading file from stdin.


Regards,
Dimitris




Dimitrios Apostolou <jimis@gmx.net> writes:
> Regarding the attached patch (rebased and edited commit message), it
> basically replaces seek(up to 1MB forward) with read(). The 1MB number
> comes a bit out of the top of my head. But tweaking it between 128KB and
> 1MB wouldn't really change anything, given that the block size is now
> 128KB: The read() will always be chosen against the seek(). Do you know
> of a real-world case with block sizes >128KB?

Yeah, with the recent changes I'd expect table data to pretty much
always consist of blocks around 128K, unless the table is smaller
than that of course.

I experimented with this patch locally and came away not too
impressed; it seems the results may be highly platform-dependent.

In the interests of having a common benchmark case that's easy to
replicate, here's precisely what I did:

Use non-assert build of current HEAD (4bea91f21 at the moment).

$ createdb bench
$ time pgbench -i -s 10000 bench

real    14m40.474s
user    1m26.717s
sys     0m5.045s
$ psql bench
...
bench=# create table zedtable(f1 int);
CREATE TABLE
bench=# insert into zedtable values(42);
INSERT 0 1
bench=# \q
$ time pg_dump -Fc --compress=none bench | cat >bench10000.dump

real    7m48.969s
user    0m36.334s
sys     1m35.209s

(At this -s value, the database occupies about 147G and the dump
file about 95G.  It's important the dump file not fit in RAM.)

$ time pg_restore -f /dev/null -t zedtable bench10000.dump

real    1m12.646s
user    0m0.355s
sys     0m5.083s

This compares rather favorably to "cat":

$ time cat bench10000.dump >/dev/null

real    3m6.627s
user    0m0.167s
sys     0m30.999s

I then applied your patch and repeated the restore run:

$ time pg_restore -f /dev/null -t zedtable bench10000.dump

real    2m39.138s
user    0m0.386s
sys     0m28.493s

So for me, the proposed patch actually makes it 2X slower.

Watching it with "iostat 1", I'm seeing about 40MB/s disk read
rate with HEAD, and 500MB/s with the patch; "cat" also shows
read rate around 500MB/s.  So yeah, we can saturate the disk
interface by doing all reads and no seeks, but that doesn't
net out faster.

I did this on a few-years-old Dell Precision 5820 workstation.
The specs for it are a bit vague about the disk subsystem:
  Storage Drive Controllers
  Integrated Intel AHCI SATA chipset controller (8x 6.0Gb/s), SW RAID 0,1,5,10
  Storage Drive
  2.5 1.92TB SATA AG Enterprise Solid State Drive
and hdparm isn't enormously helpful either:

ATA device, with non-removable media
    Model Number:       SSDSC2KB019T8R
    Serial Number:      PHYF1291017A1P9DGN
    Firmware Revision:  XCV1DL69
    Media Serial Num:
    Media Manufacturer:
    Transport:          Serial, ATA8-AST, SATA 1.0a, SATA II Extensions, SATA Rev 2.5, SATA Rev 2.6, SATA Rev 3.0
Standards:
    Used: unknown (minor revision code 0x006d)
    Supported: 10 9 8 7 6 5
    Likely used: 10

I'm running RHEL 8.10, file system is xfs.

So I find this a bit discouraging.  It's not clear why you're seeing a
win and I'm not, and it's even less clear whether there'd be enough
of a win across enough platforms to make it worth trying to engineer
a solution that helps many more people than it hurts.

            regards, tom lane



Dimitrios Apostolou <jimis@gmx.net> writes:
> +               /* Dump buffer if full */
> +               if (nonecs->bufdata >= nonecs->buflen)

> Shouldn't this be equality check instead:
>      if (nonecs->bufdata == nonecs->buflen)

Old defensive-programming habit.  The invariant we want to establish
is that there's some space available, ie
    nonecs->bufdata < nonecs->buflen
and if we just test for equality then we haven't proven that.
Agreed that bufdata shouldn't ever be greater than buflen, but if
it somehow is, an equality test here would contribute to making
things worse (writing ever further past the buffer) not better.

> And possibly also assert(nonecs->bufdata <= nonecs->buflen) ?

Maybe, but that code is simple enough that I didn't see a big
need for an assertion check.

            regards, tom lane



Re: [PATCH v4] parallel pg_restore: avoid disk seeks when jumping short distance forward

От
Dimitrios Apostolou
Дата:
Thanks for the extensive testing! Did you see the same syscall pattern in strace output, as I did?

If yes, then the only reason I can think of that excuses the regression with my patch is that the SATA interface was
maxedout when reading sequentially, while the very short latency of SSDs guarantees thousands of seek() operations per
second.

I was using an HDD, and in older measurements I was using a VM with mounted volume over iSCSI. The first imposes
physicallimits in the amount of seeks, and the second network round-trip limits. 

So you are right, it's probably very platform dependent, and the most important fix was to enlarge the underlying block
size,that you have done. 


Dimitris



I wrote:
> So for me, the proposed patch actually makes it 2X slower.

I went and tried this same test case on a 2024 Mac Mini M4 Pro.
Cutting to the chase:

HEAD:

$ time pg_restore -f /dev/null -t zedtable bench10000.dump

real    1m26.525s
user    0m0.364s
sys     0m6.806s

Patched:

$ time pg_restore -f /dev/null -t zedtable bench10000.dump

real    0m15.419s
user    0m0.279s
sys     0m8.224s

So on this hardware it *does* win (although maybe things would
be different for a parallel restore).  The patched pg_restore
takes just about the same amount of time as "cat", and iostat
shows both of them reaching a bit more than 6GB/s read speed.

My feeling at this point is that we'd probably drop the block
size test as irrelevant, and instead simply ignore ctx->hasSeek
within this loop if we think we're on a platform where that's
the right thing.  But how do we figure that out?

Not sure where we go from here, but clearly a bunch of research
is going to be needed to decide whether this is committable.

            regards, tom lane



Dimitrios Apostolou <jimis@gmx.net> writes:
> Thanks for the extensive testing! Did you see the same syscall pattern in strace output, as I did?

Yes, I did look at that, and it's the same as you saw:
HEAD repeats

        read(4k)
        lseek(~128k forward)

which is to be expected if we have to read data block headers
that are ~128K apart; while patched repeats

         read(4k)
         read(~128k)

which is a bit odd in itself, why isn't it merging the reads better?

> I was using an HDD,

Ah.  Your original message mentioned NVMe so I was assuming you
were also looking at solid-state drives.  I can imagine that
seeking is more painful on HDDs ...

            regards, tom lane



Re: [PATCH v4] parallel pg_restore: avoid disk seeks when jumping short distance forward

От
Dimitrios Apostolou
Дата:
On Tuesday 2025-10-21 00:23, Tom Lane wrote:

> HEAD repeats
>
>        read(4k)
>        lseek(~128k forward)
>
> which is to be expected if we have to read data block headers
> that are ~128K apart; while patched repeats
>
>         read(4k)
>         read(~128k)
>
> which is a bit odd in itself, why isn't it merging the reads better?

The read(4k) happens because of the getc() calls that read the next
block's length.

As noticed in a message above [1], glibc seems to do 4KB buffering by
default, for some reason. setvbuf() can mitigate this.

[1] https://www.postgresql.org/message-id/1po8os49-r63o-2923-p37n-12698o1qn7p0%40tzk.arg

I'm attaching a patch that sets glibc buffering to 1MB just as a proof
of concept. It's obviously WIP, it allocates and never frees. :-)
Feel free to pick it up and change it as you see fit.
With this patch, read() calls are unified in strace. lseeks() remain,
even if they are not actually reading anything.

It seems to me that glibc could implement an optimisation for fseeko():
store the current position in the file, and do not issue the lseek()
system call if the position does not change.


>> I was using an HDD,
>
> Ah.  Your original message mentioned NVMe so I was assuming you
> were also looking at solid-state drives.  I can imagine that
> seeking is more painful on HDDs ...

Sorry for the confusion, in all this time I've run tests on too many
different hardware combinations. Not the best way to draw conclusions,
but it's what I had available at each time.


Dimitris

Вложения

Re: [PATCH v4] parallel pg_restore: avoid disk seeks when jumping short distance forward

От
Dimitrios Apostolou
Дата:
On Tuesday 2025-10-21 00:15, Tom Lane wrote:

>> So for me, the proposed patch actually makes it 2X slower.
>
> I went and tried this same test case on a 2024 Mac Mini M4 Pro.
> Cutting to the chase:
>
> HEAD:
>
> $ time pg_restore -f /dev/null -t zedtable bench10000.dump
>
> real    1m26.525s
> user    0m0.364s
> sys     0m6.806s
>
> Patched:
>
> $ time pg_restore -f /dev/null -t zedtable bench10000.dump
>
> real    0m15.419s
> user    0m0.279s
> sys     0m8.224s
>
> So on this hardware it *does* win (although maybe things would
> be different for a parallel restore).  The patched pg_restore
> takes just about the same amount of time as "cat", and iostat
> shows both of them reaching a bit more than 6GB/s read speed.
>
> My feeling at this point is that we'd probably drop the block
> size test as irrelevant, and instead simply ignore ctx->hasSeek
> within this loop if we think we're on a platform where that's
> the right thing.  But how do we figure that out?
>
> Not sure where we go from here, but clearly a bunch of research
> is going to be needed to decide whether this is committable.

pg_dump files from before your latest fix still exist, and they possibly
contain block header every 30 bytes (or however wide is the table rows).
A patch in pg_restore would vastly improve this use case.

May I suggest the attached patch, which replaces fseeko() with fread()
if the distance is 32KB or less? Sounds rather improbable that this
would make things worse, but maybe it's possible to generate a dump file
with 32KB wide rows, and try restoring on various hardware?

If this too is controversial, then we can reduce the number to 4KB. This
is the buffering that glibc does internally. By using the same in the
given patch, we avoid all the lseek(same-offset) repetitions between the
4K reads. This should be a strict gain, with no downsides.



Dimitris

Вложения
Dimitrios Apostolou <jimis@gmx.net> writes:
> On Tuesday 2025-10-21 00:15, Tom Lane wrote:
>> Not sure where we go from here, but clearly a bunch of research
>> is going to be needed to decide whether this is committable.

> pg_dump files from before your latest fix still exist, and they possibly 
> contain block header every 30 bytes (or however wide is the table rows). 
> A patch in pg_restore would vastly improve this use case.

Yes, that's a fair case to worry about.

> May I suggest the attached patch, which replaces fseeko() with fread() 
> if the distance is 32KB or less? Sounds rather improbable that this 
> would make things worse, but maybe it's possible to generate a dump file 
> with 32KB wide rows, and try restoring on various hardware?
> If this too is controversial, then we can reduce the number to 4KB. This 
> is the buffering that glibc does internally. By using the same in the 
> given patch, we avoid all the lseek(same-offset) repetitions between the 
> 4K reads. This should be a strict gain, with no downsides.

I spent some time strace'ing pg_restore on older dump files with
relatively small block sizes.  You are right that glibc seems quite
stupid about this: it appears to issue a kernel lseek() call for every
fseeko(), even though it keeps using the same 4K worth of data it had
previously read in.  I thought maybe that was an artifact of the
relatively old glibc in RHEL8, but glibc 2.40 from Fedora 41 is no
better.  I also checked current macOS, and it's marginally smarter:
it issues just one seek call per read call.  But that seek is still
useless, since the read is still 4K adjacent to the previous 4K.

(I wonder if maybe there is some POSIX standards compliance issue
involved here?  We don't care about the possibility that the disk
file is changing under us, but other applications do, and maybe
the kernel-visible seek calls are required for some reason.)

Putting in a minimum-block-size-to-seek check gets rid of the seek
calls entirely, producing a straight stream of reads, on all three
platforms.  I agree that doing this with a threshold of 4K seems
like a no-brainer, since that seems to be the common default stdio
buffer size.  Using a larger threshold, or setting up a larger buffer,
seems to risk platform-dependent results, so it would require more
performance testing than I care to do now.  Let's do 4K and call it
a day.

            regards, tom lane