Re: libpq compression

Поиск
Список
Период
Сортировка
От	Konstantin Knizhnik
Тема	Re: libpq compression
Дата	29 октября 2020 г. 13:45:58
Msg-id	3b1d2752-b7b5-db39-84c9-4ff024041fff@postgrespro.ru обсуждение исходный текст
Ответ на	Re: libpq compression (Andres Freund <andres@anarazel.de>)
Ответы	Re: libpq compression
Список	pgsql-hackers
Дерево обсуждения
Hi,
Thank for review.

On 28.10.2020 22:27, Andres Freund wrote:
> I don't see a corresponding configure.ac change?
Shame on me - I completely forgot that configure is actually generate 
from configure.ac.
Fixed.
>
>> +     <varlistentry id="libpq-connect-compression" xreflabel="compression">
>> +      <term><literal>compression</literal></term>
>> +      <listitem>
>> +      <para>
>> +        Request compression of libpq traffic. Client sends to the server list of compression algorithms, supported
byclient library.
 
>> +        If server supports one of this algorithms, then it acknowledges use of this algorithm and then all libpq
messagessend both from client to server and
 
>> +        visa versa will be compressed. If server is not supporting any of the suggested algorithms, then it replies
with'n' (no compression)
 
>> +        message and it is up to the client whether to continue work without compression or report error.
>> +        Supported compression algorithms are chosen at configure time. Right now two libraries are supported: zlib
(default)and zstd (if Postgres was
 
>> +        configured with --with-zstd option). In both cases streaming mode is used.
>> +      </para>
>> +      </listitem>
>> +     </varlistentry>
>> +
>
> - there should be a reference to potential security impacts when used in
>    combination with encrypted connections
Done

> - What does " and it is up to the client whether to continue work
>    without compression or report error" actually mean for a libpq parameter?
It can not happen.
The client request from server use of compressed protocol only if 
"compression=XXX" was specified in connection string.
But XXX should be supported by client, otherwise this request will be 
rejected.
So supported protocol string sent by client can never be empty.

> - What is the point of the "streaming mode" reference?

There are ways of performing compression:
- block mode when each block is individually compressed (compressor 
stores dictionary in each compressed blocked)
- stream mode
Block mode allows to independently decompress each page. It is good for 
implementing page or field compression (as pglz is used to compress 
toast values).
But it is not efficient for compressing client-server protocol commands.
It seems to me to be important to explain that libpq is using stream 
mode and why there is no pglz compressor
> Why are compression methods identified by one byte identifiers? That
> seems unnecessarily small, given this is commonly a once-per-connection
> action?

It is mostly for simplicity of implementation: it is always simple to 
work with fixed size messages (or with array of chars rather than array 
of strings).
And I do not think that it can somehow decrease flexibility: this 
one-letter algorihth codes are not visible for user. And I do not think 
that we sometime will support more than 127 (or even 64 different 
compression algorithms).
> The protocol sounds to me like there's no way to enable/disable
> compression in an existing connection. To me it seems better to have an
> explicit, client initiated, request to use a specific method of
> compression (including none). That allows to enable compression for bulk
> work, and disable it in other cases (e.g. for security sensitive
> content, or for unlikely to compress well content).

It will significantly complicate implementation (because of buffering at 
different levels).
Also it is not clear to me who and how will control enabling/disabling 
compression in this case?
I can imagine that "\copy" should trigger compression. But what about 
(server side) "copy"  command?
Or just select returning huge results? I do not think that making user 
or application to enable/disable compression on the fly is really good idea.
Overhead of compressing small commands is not so large.
And concerning security risks... In most cases such problem is not 
relevant at all because both client and server are located within single 
reliable network.
It if security of communication really matters, you should not switch 
compression in all cases (including COPY and other bulk data transfer).
It is very strange idea to let client to decide which data is "security 
sensitive" and which not.
>
> I think that would also make cross-version handling easier, because a
> newer client driver can send the compression request and handle the
> error, without needing to reconnect or such.
>
> Most importantly, I think such a design is basically a necessity to make
> connection poolers to work in a sensible way.

I do not completely understand the problem with connection pooler.
Right now developers of Yandex Odyssey are trying to support libpq 
compression in their pooler.
If them will be faced with some problems, I will definitely address them.
> And lastly, wouldn't it be reasonable to allow to specify things like
> compression levels? All that doesn't have to be supported now, but I
> think the protocol should take that into account.
Well, if we want to provide the maximal flexibility, then we should 
allow to specify compression level.
Practically, when I have implemented our CFS compressed storage for 
pgpro-ee and libpq_compression I have performed
a lot benchmarks comparing different compression algorithms with 
different compression levels.
Definitely I do not pretend on doing some research in this area.
But IMHO default (fastest) compression level is always the preferable 
choice: it provides best compromise between speed and compression ratio.
Higher compression levels significantly (several times) reduce 
compression speed, but influence on compression ratio are much smaller.
More over, zstd with default compression level compresses synthetic data 
(i.e. strings will with spaces generated by pgbench)  much better
(with compression ratio 63!) than with higher compression levels.
Right now in Postgres we do not allow user to specify compression level 
neither for compressing TOAST data, nether for WAL compression,...
And IMHO for libpq protocol compression, possibility to specify 
compression level is even less useful.

But if you think that it is so important, I will try to implement it. 
Many questions arise in this case: which side should control compression 
level?
Should client affect compression level both at client side and at server 
side? Or it should be possible to specify separately compression level 
for client and for server?

>> +<para>
>> +        Used compression algorithm. Right now the following streaming compression algorithms are supported: 'f' -
Facebookzstd, 'z' - zlib, 'n' - no compression.
 
>> +</para>
> I would prefer this just be referenced as zstd or zstandard, not
> facebook zstd. There's an RFC (albeit only "informational"), and it
> doesn't name facebook, except as an employer:
> https://tools.ietf.org/html/rfc8478

Please notice that it is internal encoding, user will specify
psql -d "dbname=postgres compression=zstd"

If name "zstd" is not good, I can choose any other.
>
>
>> +int
>> +pq_configure(Port* port)
>> +{
>> +    char* client_compression_algorithms = port->compression_algorithms;
>> +    /*
>> +     * If client request compression, it sends list of supported compression algorithms.
>> +     * Each compression algorirthm is idetified by one letter ('f' - Facebook zsts, 'z' - xlib)
>> +     */
> s/algorirthm/algorithm/
> s/idetified/identified/
> s/zsts/zstd/
> s/xlib/zlib/
>
> That's, uh, quite the typo density.
>
Sorry, fixed


>> +    if (client_compression_algorithms)
>> +    {
>> +        char server_compression_algorithms[ZPQ_MAX_ALGORITHMS];
>> +        char compression_algorithm = ZPQ_NO_COMPRESSION;
>> +        char compression[6] = {'z',0,0,0,5,0}; /* message length = 5 */
>> +        int rc;
> Why is this hand-rolling protocol messages?
Sorry, I do not quite understand your concern.
It seems to me that all libpq message manipulation is more or less 
hand-rolling, isn't it (we are not using protobuf or msgbpack)?
Or do you think that  calling pq_sendbyte,pq_sendint32,... is much safer 
in this case?

>
>> +        /* Intersect lists */
>> +        while (*client_compression_algorithms != '\0')
>> +        {
>> +            if (strchr(server_compression_algorithms, *client_compression_algorithms))
>> +            {
>> +                compression_algorithm = *client_compression_algorithms;
>> +                break;
>> +            }
>> +            client_compression_algorithms += 1;
>> +        }
> Why isn't this is handled within zpq?
>
It seems to be part of libpq client-server handshake protocol. It seems 
to me that zpq is lower level component which is just ordered which 
compressor to use.
>> +        /* Send 'z' message to the client with selectde comression algorithm ('n' if match is ont found) */
> s/selectde/selected/
> s/comression/compression/
> s/ont/not/
>

:(
Fixed.
But looks like you are inspecting not the latest patch 
(libpq_compression-20.patch)
Because two of this three mistypings I have already fixed.

>> +        socket_set_nonblocking(false);
>> +        while ((rc = secure_write(MyProcPort, compression, sizeof(compression))) < 0
>> +               && errno == EINTR);
>> +        if ((size_t)rc != sizeof(compression))
>> +            return -1;
> Huh? This all seems like an abstraction violation.
>
>
>> +        /* initialize compression */
>> +        if (zpq_set_algorithm(compression_algorithm))
>> +            PqStream = zpq_create((zpq_tx_func)secure_write, (zpq_rx_func)secure_read, MyProcPort);
>> +    }
>> +    return 0;
>> +}
> Why is zpq a wrapper around secure_write/read? I'm a bit worried this
> will reduce the other places we could use zpq.
zpq has to read/write data from underlying stream.
And it should be used both in client and server environment.
I didn't see other ways  to provide single zpq implementation without 
code duplication
except pass to it rx/tx functions.

>
>
>>   static int
>> -pq_recvbuf(void)
>> +pq_recvbuf(bool nowait)
>>   {
>> +        /* If srteaming compression is enabled then use correpondent comression read function. */
> s/srteaming/streaming/
> s/correpondent/correponding/
> s/comression/compression/
>
> Could you please try to proof-read the patch a bit? The typo density
> is quite high.
Once again, sorry
Will do.

>
>> +        r = PqStream
>> +            ? zpq_read(PqStream, PqRecvBuffer + PqRecvLength,
>> +                       PQ_RECV_BUFFER_SIZE - PqRecvLength, &processed)
>> +            : secure_read(MyProcPort, PqRecvBuffer + PqRecvLength,
>> +                          PQ_RECV_BUFFER_SIZE - PqRecvLength);
>> +        PqRecvLength += processed;
> ? : doesn't make sense to me in this case. This should be an if/else.
>
Isn't it a matter of style preference?
Why  if/else is principle better than ?:
I agree that sometimes ?: leads to more complex and obscure expressions.
But to you think that if-else in this case will lead to more clear or 
readable code?

Another question is whether conditional expression here is really good idea.
I prefer to replace with indirect function call...
But there are several reasons which makes me to prefer such 
straightforward and not nice way
(at lease difference in function profiles).



>>           if (r < 0)
>>           {
>> +            if (r == ZPQ_DECOMPRESS_ERROR)
>> +            {
>> +                char const* msg = zpq_error(PqStream);
>> +                if (msg == NULL)
>> +                    msg = "end of stream";
>> +                ereport(COMMERROR,
>> +                        (errcode_for_socket_access(),
>> +                         errmsg("failed to decompress data: %s", msg)));
>> +                return EOF;
>> +            }
> I don't think we should error out with "failed to decompress data:"
> e.g. when the client closed the connection.

Sorry, but this error is reported only when ZPQ_DECOMPRESS_ERROR is 
returned.
It means that received data can not be decompressed but not that client 
connection is broken.
>
>
>
>> @@ -1413,13 +1457,18 @@ internal_flush(void)
>>       char       *bufptr = PqSendBuffer + PqSendStart;
>>       char       *bufend = PqSendBuffer + PqSendPointer;
>>   
>> -    while (bufptr < bufend)
>> +    while (bufptr < bufend || zpq_buffered(PqStream) != 0) /* has more data to flush or unsent data in internal
compressionbuffer */
 
>>       {
> Overly long line.
Fixed
> This should at least specify how these functions are supposed to handle
> blocking/nonblocking sockets.
>
Blocking/nonblocking control is done by upper layer.
This zpq functions implementation calls underlying IO functions and do 
not care if this calls are blocking or nonblocking.

>> +
>> +#define ZSTD_BUFFER_SIZE (8*1024)
>> +#define ZSTD_COMPRESSION_LEVEL 1
> Add some arguments for choosing these parameters.
>
What are the suggested way to specify them?
I can not put them in GUCs (because them are also needed at client side).
May it possible to for client to specify them in connection string:
psql -d "compression='ztsd/level=10/buffer=8k"
It seems to be awful and overkill, isn't it?

>> +
>> +/*
>> + * Array with all supported compression algorithms.
>> + */
>> +static ZpqAlgorithm const zpq_algorithms[] =
>> +{
>> +#if HAVE_LIBZSTD
>> +    {zstd_name, zstd_create, zstd_read, zstd_write, zstd_free, zstd_error, zstd_buffered},
>> +#endif
>> +#if HAVE_LIBZ
>> +    {zlib_name, zlib_create, zlib_read, zlib_write, zlib_free, zlib_error, zlib_buffered},
>> +#endif
>> +    {NULL}
>> +};
> I think it's preferrable to use designated initializers.
>
> Do we really need zero terminated lists? Works fine, but brrr.

Once again - a matter of taste:)
Standard C practice IMHO - not invented by me and widely used in 
Postgres code;)

>
>> +/*
>> + * Index of used compression algorithm in zpq_algorithms array.
>> + */
>> +static int zpq_algorithm_impl;
> This is just odd API design imo. Why doesn't the dispatch work based on
> an argument for zpq_create() and the ZpqStream * for the rest?
>
> What if there's two libpq connections in one process? To servers
> supporting different compression algorithms? This isn't going to fly.

Fixed.

>
>> +/*
>> + * Get list of the supported algorithms.
>> + * Each algorithm is identified by one letter: 'f' - Facebook zstd, 'z' - zlib.
>> + * Algorithm identifies are appended to the provided buffer and terminated by '\0'.
>> + */
>> +void
>> +zpq_get_supported_algorithms(char algorithms[ZPQ_MAX_ALGORITHMS])
>> +{
>> +    int i;
>> +    for (i = 0; zpq_algorithms[i].name != NULL; i++)
>> +    {
>> +        Assert(i < ZPQ_MAX_ALGORITHMS);
>> +        algorithms[i] = zpq_algorithms[i].name();
>> +    }
>> +    Assert(i < ZPQ_MAX_ALGORITHMS);
>> +    algorithms[i] = '\0';
>> +}
> Uh, doesn't this bake ZPQ_MAX_ALGORITHMS into the ABI? That seems
> entirely unnecessary?

I tried to avoid use of dynamic memory allocation because zpq is used 
both in client and server environments with different memory allocation 
policies.

>> @@ -2180,6 +2257,20 @@ build_startup_packet(const PGconn *conn, char *packet,
>>           ADD_STARTUP_OPTION("replication", conn->replication);
>>       if (conn->pgoptions && conn->pgoptions[0])
>>           ADD_STARTUP_OPTION("options", conn->pgoptions);
>> +    if (conn->compression && conn->compression[0])
>> +    {
>> +        bool enabled;
>> +        /*
>> +         * If compressoin is enabled, then send to the server list of compression algorithms
>> +         * supported by client
>> +         */
> s/compressoin/compression/

Fixed
>> +        if (parse_bool(conn->compression, &enabled))
>> +        {
>> +            char compression_algorithms[ZPQ_MAX_ALGORITHMS];
>> +            zpq_get_supported_algorithms(compression_algorithms);
>> +            ADD_STARTUP_OPTION("compression", compression_algorithms);
>> +        }
>> +    }
> I think this needs to work in a graceful manner across server
> versions. You can make that work with an argument, using the _pq_
> parameter stuff, but as I said earlier, I think it's a mistake to deal
> with this in the startup packet anyway.

Sorry, I think that it should be quite easy for user to toggle compression.
Originally I suggest to add -z option to psql and other Postgres 
utilities working with libpq protocol.
Adding new options was considered by reviewer as bad idea and so I left 
correspondent option in connection string:

psql -d "dbname=postgres compression=zlib"

It is IMHO less convenient than "psql -z postgres".
And I afraid that using the _pq_ parameter stuff makes enabling of 
compression even less user friendly.
So can replace it to _pq_ convention if there is consensus that adding 
"compression" to startup package should be avoided.

-- 
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
В списке pgsql-hackers по дате отправления:
Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Re: libpq compression