Re: libpq compression

Поиск
Список
Период
Сортировка
От Daniil Zakhlystov
Тема Re: libpq compression
Дата
Msg-id 6A45DFAA-1682-4EF2-B835-C5F46615EC49@yandex-team.ru
обсуждение исходный текст
Ответ на Re: libpq compression  (Konstantin Knizhnik <k.knizhnik@postgrespro.ru>)
Список pgsql-hackers
Hi!

I’ve contacted Yann Collet (developer of ZSTD) and told him about our discussion. Here is his comment:

> Hi Daniil
>     • Is this an expected behavior of ZSTD to consume more memory during the decompression of data that was
compressedwith a high compression ratio? 
>
> I assume that the target application is employing the streaming mode.
> In which case, yes, the memory usage is directly dependent on the Window size, and the Window size tend to increase
withcompression level. 
>
>     • how we can restrict the maximal memory usage during decompression?
>
> There are several ways.
>
>     • From a decompression perspective
>
> the first method is to _not_ use the streaming mode,
> and employ the direct buffer-to-buffer compression instead,
> like ZSTD_decompress() for example.
> In which case, the decompressor will not need additional memory, it will only employ the provided buffers.
>
> This however entirely depends on the application and can therefore be unpractical.
> It’s fine when decompressing small blocks, it’s not when decompressing gigantic streams of data.
>
> The second method is more straightforward : set a limit to the window size that the decoder accepts to decode.
> This is the ZSTD_d_windowLogMax parameter, documented here :
https://github.com/facebook/zstd/blob/v1.4.7/lib/zstd.h#L536
>
> This can be set to any arbitrary power of 2 limit.
> A frame requiring more than this value will be rejected by the decoder, precisely to avoid sustaining large memory
requirements.
>
> Lastly, note that, in presence of a large window size requirement, the decoder will allocate a correspondingly large
buffer,
> but will not necessarily use it.
> For example, if a frame generated with streaming mode at level 22 declares a 128 MB window size, but effectively only
contains~200 KB of data, 
> the buffer will only use 200 KB.
> The rest of the buffer is “allocated” from an address space perspective but is not “used” and therefore does not
reallyoccupy physical RAM space. 
> This is a capability of all modern OS and contributes to minimizing the impact of outsized window sizes.
>
>
>     • From a compression perspective
>
> Knowing the set limitation, the compressor should be compliant, and avoid going above the threshold.
> One way to do it is to limit the compression level to those which remain below the set limit.
> For example, if the limit is 8 MB, all levels <= 19 will be compatible, as they require 8 MB max (and generally
less).
>
> Another method is to manually set a window size, so that it doesn’t exceed the limit.
> This is the ZSTD_c_windowLog parameter, which is documented here :
https://github.com/facebook/zstd/blob/v1.4.7/lib/zstd.h#L289
>
> Another complementary way is to provide the source size when it’s known.
> By default, the streaming mode doesn’t know the input size, since it’s supposed to receive it in multiple blocks.
> It will only discover it at the end, by which point it’s too late to use this information in the frame header.
> This can be solved, by providing the source size upfront, before starting compression.
> This is the function ZSTD_CCtx_setPledgedSrcSize(), documented here :
https://github.com/facebook/zstd/blob/v1.4.7/lib/zstd.h#L483
> Of course, then the total amount of data in the frame must be exact, otherwise it’s detected as an error.
>
> Taking again the previous example of compressing 200 KB with level 22, on knowing the source size,
> the compressor will resize the window to fit the input, and therefore employ 200 KB, instead of 128 MB.
> This information will be present in the header, and the decompressor will also be able to use 200 KB instead of 128
MB.
> Also, presuming the decompressor has a hard limit set to 8 MB (for example), the header using a 200 KB window size
willpass and be properly decoded, while the header using 128 MB will be rejected. 
> This method is cumulative with the one setting a manual window size (the compressor will select the smallest of
both).
>
>
> So yes, memory consumption is a serious topic, and there are tools in the `zstd` library to deal with it.
>
>
> Hope it helps
>
> Best Regards
>
> Yann Collet

After reading Yann’s advice I repeated yesterday single-directional decompression benchmarks with ZSTD_d_windowLogMax
setto 23, i.e 8MB max window size. 

Total committed memory (Committed_AS) size for ZSTD compression levels 1-19 was pretty much the same:

Committed_AS baseline (size without any benchmark running) - 42.4 GiB

Scenario            Committed_AS    Committed_AS - Baseline
no compression    44,36 GiB        1,05 GiB
ZSTD:1            45,03 GiB        1,06 GiB
ZSTD:5            46,06 GiB        1,09 GiB
ZSTD:9            46,00 GiB        1,08 GiB
ZSTD:13            47,46 GiB        1,12 GiB
ZSTD:17            50,23 GiB        1,18 GiB
ZSTD:19            50,21 GiB        1,18 GiB

As for ZSTD levels higher than 19, decompressor returned the appropriate error (excerpt from PostgreSQL server log):
LOG:  failed to decompress data: Frame requires too much memory for decoding

Full benchmark report: https://docs.google.com/document/d/1LI8hPzMkzkdQLf7pTN-LXPjIJdjN33bEAqVJj0PLnHA
Pull request with max window size limit: https://github.com/postgrespro/libpq_compression/pull/5

This should fix the possible attack vectors related to high ZSTD compression levels.

—
Daniil Zakhlystov


В списке pgsql-hackers по дате отправления:

Предыдущее
От: Ashutosh Bapat
Дата:
Сообщение: Re: Feature request: Connection string parsing for postgres_fdw
Следующее
От: Bharath Rupireddy
Дата:
Сообщение: Re: Fail Fast In CTAS/CMV If Relation Already Exists To Avoid Unnecessary Rewrite, Planning Costs