Re: Add LZ4 compression in pg_dump

Поиск
Список
Период
Сортировка
От Justin Pryzby
Тема Re: Add LZ4 compression in pg_dump
Дата
Msg-id 20230227044910.GO1653@telsasoft.com
обсуждение исходный текст
Ответ на Re: Add LZ4 compression in pg_dump  (Justin Pryzby <pryzby@telsasoft.com>)
Ответы Re: Add LZ4 compression in pg_dump  (Tomas Vondra <tomas.vondra@enterprisedb.com>)
Список pgsql-hackers
On Sat, Feb 25, 2023 at 08:05:53AM -0600, Justin Pryzby wrote:
> On Fri, Feb 24, 2023 at 11:02:14PM -0600, Justin Pryzby wrote:
> > I have some fixes (attached) and questions while polishing the patch for
> > zstd compression.  The fixes are small and could be integrated with the
> > patch for zstd, but could be applied independently.
> 
> One more - WriteDataToArchiveGzip() says:

One more again.

The LZ4 path is using non-streaming mode, which compresses each block
without persistent state, giving poor compression for -Fc compared with
-Fp.  If the data is highly compressible, the difference can be orders
of magnitude.

$ ./src/bin/pg_dump/pg_dump -h /tmp postgres -Z lz4 -Fp |wc -c
12351763
$ ./src/bin/pg_dump/pg_dump -h /tmp postgres -Z lz4 -Fc |wc -c
21890708

That's not true for gzip:

$ ./src/bin/pg_dump/pg_dump -h /tmp postgres -t t1 -Z gzip -Fc |wc -c
2118869
$ ./src/bin/pg_dump/pg_dump -h /tmp postgres -t t1 -Z gzip -Fp |wc -c
2115832

The function ought to at least use streaming mode, so each block/row
isn't compressioned in isolation.  003 is a simple patch to use
streaming mode, which improves the -Fc case:

$ ./src/bin/pg_dump/pg_dump -h /tmp postgres -Z lz4 -Fc |wc -c
15178283

However, that still flushes the compression buffer, writing a block
header, for every row.  With a single-column table, pg_dump -Fc -Z lz4
still outputs ~10% *more* data than with no compression at all.  And
that's for compressible data.

$ ./src/bin/pg_dump/pg_dump -h /tmp postgres -t t1 -Fc -Z lz4 |wc -c
12890296
$ ./src/bin/pg_dump/pg_dump -h /tmp postgres -t t1 -Fc -Z none |wc -c
11890296

I think this should use the LZ4F API with frames, which are buffered to
avoid outputting a header for every single row.  The LZ4F format isn't
compatible with the LZ4 format, so (unlike changing to the streaming
API) that's not something we can change in a bugfix release.  I consider
this an Opened Item.

With the LZ4F API in 004, -Fp and -Fc are essentially the same size
(like gzip).  (Oh, and the output is three times smaller, too.)

$ ./src/bin/pg_dump/pg_dump -h /tmp postgres -t t1 -Z lz4 -Fp |wc -c
4155448
$ ./src/bin/pg_dump/pg_dump -h /tmp postgres -t t1 -Z lz4 -Fc |wc -c
4156548

-- 
Justin

Вложения

В списке pgsql-hackers по дате отправления:

Предыдущее
От: Pavel Stehule
Дата:
Сообщение: Re: Proposal: :SQL_EXEC_TIME (like :ROW_COUNT) Variable (psql)
Следующее
От: Andrey Borodin
Дата:
Сообщение: Re: psql \watch 2nd argument: iteration count