Re: refactoring basebackup.c

Поиск
Список
Период
Сортировка
От Jeevan Ladhe
Тема Re: refactoring basebackup.c
Дата
Msg-id CAOgcT0NqC3wNZ=sZWZ252xk6Q=BU_aeEMHE7FeiaP473FzTkgg@mail.gmail.com
обсуждение исходный текст
Ответ на Re: refactoring basebackup.c  (Robert Haas <robertmhaas@gmail.com>)
Ответы Re: refactoring basebackup.c  (Robert Haas <robertmhaas@gmail.com>)
Список pgsql-hackers
0007 adds server-side compression; currently, it only supports
server-side compression using gzip, but I hope that it won't be hard
to generalize that to support LZ4 as well, and Andres told me he
thinks we should aim to support zstd since that library has built-in
parallel compression which is very appealing in this context. 

Thanks, Robert for laying the foundation here.
So, I gave a try to LZ4 streaming API for server-side compression.
LZ4 APIs are documented here[1].

With the attached WIP patch, I am now able to take the backup using the lz4
compression. The attached patch is basically applicable on top of Robert's V3
patch-set[2].

I could take the backup using the command:
pg_basebackup -t server:/tmp/data_lz4 -Xnone --server-compression=lz4

Further, when restored the backup `/tmp/data_lz4` and started the server, I
could see the tables I created, along with the data inserted on the original
server.

When I tried to look into the binary difference between the original data
directory and the backup `data_lz4` directory here is how it looked:

$ diff -qr data/ /tmp/data_lz4
Only in /tmp/data_lz4: backup_label
Only in /tmp/data_lz4: backup_manifest
Only in data/base: pgsql_tmp
Only in /tmp/data_lz4: base.tar
Only in /tmp/data_lz4: base.tar.lz4
Files data/global/pg_control and /tmp/data_lz4/global/pg_control differ
Files data/logfile and /tmp/data_lz4/logfile differ
Only in data/pg_stat: db_0.stat
Only in data/pg_stat: global.stat
Only in data/pg_subtrans: 0000
Only in data/pg_wal: 000000010000000000000099.00000028.backup
Only in data/pg_wal: 00000001000000000000009A
Only in data/pg_wal: 00000001000000000000009B
Only in data/pg_wal: 00000001000000000000009C
Only in data/pg_wal: 00000001000000000000009D
Only in data/pg_wal: 00000001000000000000009E
Only in data/pg_wal/archive_status: 000000010000000000000099.00000028.backup.done
Only in data/: postmaster.opts

For now, what concerns me here is, the following `LZ4F_compressUpdate()` API,
is the one which is doing the core work of streaming compression:

size_t LZ4F_compressUpdate(LZ4F_cctx* cctx,
                                       void* dstBuffer, size_t dstCapacity,
                                 const void* srcBuffer, size_t srcSize,
                                 const LZ4F_compressOptions_t* cOptPtr);

where, `dstCapacity`, is basically provided by the earlier call to
`LZ4F_compressBound()` which provides minimum `dstCapacity` required to
guarantee success of `LZ4F_compressUpdate()`, given a `srcSize` and
`preferences`, for a worst-case scenario. `LZ4F_compressBound()` is:

size_t LZ4F_compressBound(size_t srcSize, const LZ4F_preferences_t* prefsPtr);

Now, hard learning here is that the `dstCapacity` returned by the
`LZ4F_compressBound()` even for a single byte i.e. 1 as `srcSize` is about
~256K (seems it is has something to do with the blockSize in lz4 frame that we
chose, the minimum we can have is 64K), though the actual length of compressed
data by the `LZ4F_compressUpdate()` is very less. Whereas, the destination
buffer length for us i.e. `mysink->base.bbs_next->bbs_buffer_length` is only
32K. In the function call `LZ4F_compressUpdate()`, if I directly try to provide
this `mysink->base.bbs_next->bbs_buffer + bytes_written` as `dstBuffer` and
the value returned by the `LZ4F_compressBound()` as the `dstCapacity` that
sounds very much incorrect to me, since the actual out buffer length remaining
is much less than what is calculated for the worst case by `LZ4F_compressBound()`.

For now, I am creating a temporary buffer of the required size, passing it
for compression, assert that the actual compressed bytes are less than the
whatever length we have available and then copy it to our output buffer.

To give an example, I put some logging statements, and I can see in the log:
"
bytes remaining in mysink->base.bbs_next->bbs_buffer: 16537
input size to be compressed: 512
estimated size for compressed buffer by LZ4F_compressBound(): 262667
actual compressed size: 16
"

Will really appreciate any inputs, comments, suggestions here. 

Regards,
Jeevan Ladhe
 
Вложения

В списке pgsql-hackers по дате отправления:

Предыдущее
От: Tom Lane
Дата:
Сообщение: Re: VARDATA_COMPRESSED_GET_COMPRESS_METHOD comment?
Следующее
От: Jacob Champion
Дата:
Сообщение: Re: [PATCH] support tab-completion for single quote input with equal sign