Re: pglz performance

Поиск
Список
Период
Сортировка
От Andrey Borodin
Тема Re: pglz performance
Дата
Msg-id CF6BA10B-E36D-4489-BF2B-25F9012ED3CA@yandex-team.ru
обсуждение исходный текст
Ответ на Re: pglz performance  (Petr Jelinek <petr@2ndquadrant.com>)
Ответы Re: pglz performance  (Petr Jelinek <petr@2ndquadrant.com>)
Список pgsql-hackers

> 2 авг. 2019 г., в 21:39, Andres Freund <andres@anarazel.de> написал(а):
>
> On 2019-08-02 20:40:51 +0500, Andrey Borodin wrote:
>> We have some kind of "roadmap" of "extensible pglz". We plan to provide implementation on Novembers CF.
>
> I don't understand why it's a good idea to improve the compression side
> of pglz. There's plenty other people that spent a lot of time developing
> better compression algorithms.
Improving compression side of pglz has two different projects:
1. Faster compression with less code and same compression ratio (patch in this thread).
2. Better compression ratio with at least same compression speed of uncompressed values.
Why I want to do patch for 2? Because it's interesting.
Will 1 or 2 be reviewed or committed? I have no idea.
Will many users benefit from 1 or 2? Yes, clearly. Unless we force everyone to stop compressing with pglz.

>> Currently, pglz starts with empty cache map: there is no prior 4k bytes before start. We can add imaginary prefix to
anydata with common substrings: this will enhance compression ratio. 
>> It is hard to decide on training data set for this "common prefix". So we want to produce extension with aggregate
functionwhich produces some "adapted common prefix" from users's data. 
>> Then we can "reserve" few negative bytes for "decompression commands". This command can instruct database on which
commonprefix to use. 
>> But also system command can say "invoke decompression from extension".
>>
>> Thus, user will be able to train database compression on his data and substitute pglz compression with custom
compressionmethod seamlessly. 
>>
>> This will make hard-choosen compression unneeded, but seems overly hacky. But there will be no need to have lz4,
zstd,brotli, lzma and others in core. Why not provide e.g. "time series compression"? Or "DNA compression"? Whatever
gunuser wants for his foot. 
>
> I think this is way too complicated, and will provide not particularly
> much benefit for the majority users.
>
> In fact, I'll argue that we should flat out reject any such patch until
> we have at least one decent default compression algorithm in
> core. You're trying to work around a poor compression algorithm with
> complicated dictionary improvement
OK. The idea of something plugged into pglz seemed odd even to me.
But looks like it restarted lz4 discussion :)

> , that require user interaction, and
> only will work in a relatively small subset of the cases, and will very
> often increase compression times.
No, surely, if implementation of "common prefix" will increase compression times I will not even post a patch.
BTW, lz4 also supports "common prefix", let's do that too?
Here's link on Zstd dictionary builder, but it is compatible with lz4
https://github.com/facebook/zstd#the-case-for-small-data-compression
We actually have small datums.

> 4 авг. 2019 г., в 5:41, Petr Jelinek <petr@2ndquadrant.com> написал(а):
>
> Just so that we don't idly talk, what do you think about the attached?
> It:
> - adds new GUC compression_algorithm with possible values of pglz (default) and lz4 (if lz4 is compiled in), requires
SIGHUP
> - adds --with-lz4 configure option (default yes, so the configure option is actually --without-lz4) that enables the
lz4,it's using system library 
> - uses the compression_algorithm for both TOAST and WAL compression (if on)
> - supports slicing for lz4 as well (pglz was already supported)
> - supports reading old TOAST values
> - adds 1 byte header to the compressed data where we currently store the algorithm kind, that leaves us with 254 more
toadd :) (that's an extra overhead compared to the current state) 
> - changes the rawsize in TOAST header to 31 bits via bit packing
> - uses the extra bit to differentiate between old and new format
> - supports reading from table which has different rows stored with different algorithm (so that the GUC itself can be
freelychanged) 
That's cool. I suggest defaulting to lz4 if it is available. You cannot start cluster on non-lz4 binaries which used
lz4once. 
Do we plan the possibility of compression algorithm as extension? Or will all algorithms be packed into that byte in
core?
What about lz4 "common prefix"? System or user-defined. If lz4 is compiled in we can even offer in-system training,
justmake sure that trained prefixes will make their way to standbys. 

Best regards, Andrey Borodin.


В списке pgsql-hackers по дате отправления:

Предыдущее
От: vignesh C
Дата:
Сообщение: Re: Unused header file inclusion
Следующее
От: Tomas Vondra
Дата:
Сообщение: Re: pglz performance