Обсуждение: Compress ReorderBuffer spill files using LZ4
Hi, When the content of a large transaction (size exceeding logical_decoding_work_mem) and its sub-transactions has to be reordered during logical decoding, then, all the changes are written on disk in temporary files located in pg_replslot/<slot_name>. Decoding very large transactions by multiple replication slots can lead to disk space saturation and high I/O utilization. When compiled with LZ4 support (--with-lz4), this patch enables data compression/decompression of these temporary files. Each transaction change that must be written on disk (ReorderBufferDiskChange) is now compressed and encapsulated in a new structure. 3 different compression strategies are implemented: 1. LZ4 streaming compression is the preferred one and works efficiently for small individual changes. 2. LZ4 regular compression when the changes are too large for using the streaming API. 3. No compression when compression fails, the change is then stored not compressed. When not using compression, the following case generates 1590MB of spill files: CREATE TABLE t (i INTEGER PRIMARY KEY, t TEXT); INSERT INTO t SELECT i, 'Hello number n°'||i::TEXT FROM generate_series(1, 10000000) as i; With LZ4 compression, it creates 653MB of spill files: 58.9% less disk space usage. Open items: 1. The spill_bytes column from pg_stat_get_replication_slot() still returns plain data size, not the compressed data size. Should we expose the compressed data size when compression occurs? 2. Do we want a GUC to switch compression on/off? Regards, JT
Вложения
On Thu, Jun 6, 2024 at 4:28 PM Julien Tachoires <julmon@gmail.com> wrote: > > When the content of a large transaction (size exceeding > logical_decoding_work_mem) and its sub-transactions has to be > reordered during logical decoding, then, all the changes are written > on disk in temporary files located in pg_replslot/<slot_name>. > Decoding very large transactions by multiple replication slots can > lead to disk space saturation and high I/O utilization. > Why can't one use 'streaming' option to send changes to the client once it reaches the configured limit of 'logical_decoding_work_mem'? > > 2. Do we want a GUC to switch compression on/off? > It depends on the overhead of decoding. Did you try to measure the decoding overhead of decompression when reading compressed files? -- With Regards, Amit Kapila.
On Thu, Jun 6, 2024 at 4:43 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Thu, Jun 6, 2024 at 4:28 PM Julien Tachoires <julmon@gmail.com> wrote: > > > > When the content of a large transaction (size exceeding > > logical_decoding_work_mem) and its sub-transactions has to be > > reordered during logical decoding, then, all the changes are written > > on disk in temporary files located in pg_replslot/<slot_name>. > > Decoding very large transactions by multiple replication slots can > > lead to disk space saturation and high I/O utilization. > > > > Why can't one use 'streaming' option to send changes to the client > once it reaches the configured limit of 'logical_decoding_work_mem'? > > > > > 2. Do we want a GUC to switch compression on/off? > > > > It depends on the overhead of decoding. Did you try to measure the > decoding overhead of decompression when reading compressed files? I think it depends on the trade-off between the I/O savings from reducing the data size and the performance cost of compressing and decompressing the data. This balance is highly dependent on the hardware. For example, if you have a very slow disk and a powerful processor, compression could be advantageous. Conversely, if the disk is very fast, the I/O savings might be minimal, and the compression overhead could outweigh the benefits. Additionally, the effectiveness of compression also depends on the compression ratio, which varies with the type of data being compressed. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Le jeu. 6 juin 2024 à 04:13, Amit Kapila <amit.kapila16@gmail.com> a écrit : > > On Thu, Jun 6, 2024 at 4:28 PM Julien Tachoires <julmon@gmail.com> wrote: > > > > When the content of a large transaction (size exceeding > > logical_decoding_work_mem) and its sub-transactions has to be > > reordered during logical decoding, then, all the changes are written > > on disk in temporary files located in pg_replslot/<slot_name>. > > Decoding very large transactions by multiple replication slots can > > lead to disk space saturation and high I/O utilization. > > > > Why can't one use 'streaming' option to send changes to the client > once it reaches the configured limit of 'logical_decoding_work_mem'? That's right, setting subscription's option 'streaming' to 'on' moves the problem away from the publisher to the subscribers. This patch tries to improve the default situation when 'streaming' is set to 'off'. > > 2. Do we want a GUC to switch compression on/off? > > > > It depends on the overhead of decoding. Did you try to measure the > decoding overhead of decompression when reading compressed files? Quick benchmarking executed on my laptop shows 1% overhead. Table DDL: CREATE TABLE t (i INTEGER PRIMARY KEY, t TEXT); Data generated with: INSERT INTO t SELECT i, 'Text number n°'||i::TEXT FROM generate_series(1, 10000000) as i; Restoration duration measured using timestamps of log messages: "DEBUG: restored XXXX/YYYY changes from disk" HEAD: 25.54s, 25.94s, 25.516s, 26.267s, 26.11s / avg=25.874s Patch: 26.872s, 26.311s, 25.753s, 26.003, 25.843s / avg=26.156s Regards, JT
On Thu, Jun 6, 2024 at 6:22 PM Julien Tachoires <julmon@gmail.com> wrote: > > Le jeu. 6 juin 2024 à 04:13, Amit Kapila <amit.kapila16@gmail.com> a écrit : > > > > On Thu, Jun 6, 2024 at 4:28 PM Julien Tachoires <julmon@gmail.com> wrote: > > > > > > When the content of a large transaction (size exceeding > > > logical_decoding_work_mem) and its sub-transactions has to be > > > reordered during logical decoding, then, all the changes are written > > > on disk in temporary files located in pg_replslot/<slot_name>. > > > Decoding very large transactions by multiple replication slots can > > > lead to disk space saturation and high I/O utilization. > > > > > > > Why can't one use 'streaming' option to send changes to the client > > once it reaches the configured limit of 'logical_decoding_work_mem'? > > That's right, setting subscription's option 'streaming' to 'on' moves > the problem away from the publisher to the subscribers. This patch > tries to improve the default situation when 'streaming' is set to > 'off'. > Can we think of changing the default to 'parallel'? BTW, it would be better to use 'parallel' for the 'streaming' option, if the workload has large transactions. Is there a reason to use a default value in this case? > > > 2. Do we want a GUC to switch compression on/off? > > > > > > > It depends on the overhead of decoding. Did you try to measure the > > decoding overhead of decompression when reading compressed files? > > Quick benchmarking executed on my laptop shows 1% overhead. > Thanks. We probably need different types of data (say random data in bytea column, etc.) for this. -- With Regards, Amit Kapila.
On 2024-Jun-06, Amit Kapila wrote: > On Thu, Jun 6, 2024 at 4:28 PM Julien Tachoires <julmon@gmail.com> wrote: > > > > When the content of a large transaction (size exceeding > > logical_decoding_work_mem) and its sub-transactions has to be > > reordered during logical decoding, then, all the changes are written > > on disk in temporary files located in pg_replslot/<slot_name>. > > Decoding very large transactions by multiple replication slots can > > lead to disk space saturation and high I/O utilization. I like the general idea of compressing the output of logical decoding. It's not so clear to me that we only want to do so for spilling to disk; for instance, if the two nodes communicate over a slow network, it may even be beneficial to compress when streaming, so to this question: > Why can't one use 'streaming' option to send changes to the client > once it reaches the configured limit of 'logical_decoding_work_mem'? I would say that streaming doesn't necessarily have to mean we don't want compression, because for some users it might be beneficial. I think a GUC would be a good idea. Also, what if for whatever reason you want a different compression algorithm or different compression parameters? Looking at the existing compression UI we offer in pg_basebackup, perhaps you could add something like this: compress_logical_decoding = none compress_logical_decoding = lz4:42 compress_logical_decoding = spill-zstd:99 "none" says to never use compression (perhaps should be the default), "lz4:42" says to use lz4 with parameters 42 on both spilling and streaming, and "spill-zstd:99" says to use Zstd with parameter 99 but only for spilling to disk. (I don't mean to say that you should implement Zstd compression with this patch, only that you should choose the implementation so that adding Zstd support (or whatever) later is just a matter of adding some branches here and there. With the current #ifdef you propose, it's hard to do that. Maybe separate the parts that depend on the specific algorithm to algorithm-agnostic functions.) -- Álvaro Herrera 48°01'N 7°57'E — https://www.EnterpriseDB.com/
Le jeu. 6 juin 2024 à 06:40, Amit Kapila <amit.kapila16@gmail.com> a écrit : > > On Thu, Jun 6, 2024 at 6:22 PM Julien Tachoires <julmon@gmail.com> wrote: > > > > Le jeu. 6 juin 2024 à 04:13, Amit Kapila <amit.kapila16@gmail.com> a écrit : > > > > > > On Thu, Jun 6, 2024 at 4:28 PM Julien Tachoires <julmon@gmail.com> wrote: > > > > > > > > When the content of a large transaction (size exceeding > > > > logical_decoding_work_mem) and its sub-transactions has to be > > > > reordered during logical decoding, then, all the changes are written > > > > on disk in temporary files located in pg_replslot/<slot_name>. > > > > Decoding very large transactions by multiple replication slots can > > > > lead to disk space saturation and high I/O utilization. > > > > > > > > > > Why can't one use 'streaming' option to send changes to the client > > > once it reaches the configured limit of 'logical_decoding_work_mem'? > > > > That's right, setting subscription's option 'streaming' to 'on' moves > > the problem away from the publisher to the subscribers. This patch > > tries to improve the default situation when 'streaming' is set to > > 'off'. > > > > Can we think of changing the default to 'parallel'? BTW, it would be > better to use 'parallel' for the 'streaming' option, if the workload > has large transactions. Is there a reason to use a default value in > this case? You're certainly right, if using the streaming API helps to avoid bad situations and there is no downside, it could be used by default. > > > > 2. Do we want a GUC to switch compression on/off? > > > > > > > > > > It depends on the overhead of decoding. Did you try to measure the > > > decoding overhead of decompression when reading compressed files? > > > > Quick benchmarking executed on my laptop shows 1% overhead. > > > > Thanks. We probably need different types of data (say random data in > bytea column, etc.) for this. Yes, good idea, will run new tests in that sense. Thank you! Regards, JT
Le jeu. 6 juin 2024 à 07:24, Alvaro Herrera <alvherre@alvh.no-ip.org> a écrit : > > On 2024-Jun-06, Amit Kapila wrote: > > > On Thu, Jun 6, 2024 at 4:28 PM Julien Tachoires <julmon@gmail.com> wrote: > > > > > > When the content of a large transaction (size exceeding > > > logical_decoding_work_mem) and its sub-transactions has to be > > > reordered during logical decoding, then, all the changes are written > > > on disk in temporary files located in pg_replslot/<slot_name>. > > > Decoding very large transactions by multiple replication slots can > > > lead to disk space saturation and high I/O utilization. > > I like the general idea of compressing the output of logical decoding. > It's not so clear to me that we only want to do so for spilling to disk; > for instance, if the two nodes communicate over a slow network, it may > even be beneficial to compress when streaming, so to this question: > > > Why can't one use 'streaming' option to send changes to the client > > once it reaches the configured limit of 'logical_decoding_work_mem'? > > I would say that streaming doesn't necessarily have to mean we don't > want compression, because for some users it might be beneficial. Interesting idea, will try to evaluate how to compress/decompress data transiting via streaming and how good the compression ratio would be. > I think a GUC would be a good idea. Also, what if for whatever reason > you want a different compression algorithm or different compression > parameters? Looking at the existing compression UI we offer in > pg_basebackup, perhaps you could add something like this: > > compress_logical_decoding = none > compress_logical_decoding = lz4:42 > compress_logical_decoding = spill-zstd:99 > > "none" says to never use compression (perhaps should be the default), > "lz4:42" says to use lz4 with parameters 42 on both spilling and > streaming, and "spill-zstd:99" says to use Zstd with parameter 99 but > only for spilling to disk. I agree, if the server was compiled with support of multiple compression libraries, users should be able to choose which one they want to use. > (I don't mean to say that you should implement Zstd compression with > this patch, only that you should choose the implementation so that > adding Zstd support (or whatever) later is just a matter of adding some > branches here and there. With the current #ifdef you propose, it's hard > to do that. Maybe separate the parts that depend on the specific > algorithm to algorithm-agnostic functions.) Makes sense, will rework this patch in that way. Thank you! Regards, JT
On Thu, Jun 6, 2024 at 7:54 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote: > > On 2024-Jun-06, Amit Kapila wrote: > > > On Thu, Jun 6, 2024 at 4:28 PM Julien Tachoires <julmon@gmail.com> wrote: > > > > > > When the content of a large transaction (size exceeding > > > logical_decoding_work_mem) and its sub-transactions has to be > > > reordered during logical decoding, then, all the changes are written > > > on disk in temporary files located in pg_replslot/<slot_name>. > > > Decoding very large transactions by multiple replication slots can > > > lead to disk space saturation and high I/O utilization. > > I like the general idea of compressing the output of logical decoding. > It's not so clear to me that we only want to do so for spilling to disk; > for instance, if the two nodes communicate over a slow network, it may > even be beneficial to compress when streaming, so to this question: > > > Why can't one use 'streaming' option to send changes to the client > > once it reaches the configured limit of 'logical_decoding_work_mem'? > > I would say that streaming doesn't necessarily have to mean we don't > want compression, because for some users it might be beneficial. +1 > I think a GUC would be a good idea. Also, what if for whatever reason > you want a different compression algorithm or different compression > parameters? Looking at the existing compression UI we offer in > pg_basebackup, perhaps you could add something like this: > > compress_logical_decoding = none > compress_logical_decoding = lz4:42 > compress_logical_decoding = spill-zstd:99 > > "none" says to never use compression (perhaps should be the default), > "lz4:42" says to use lz4 with parameters 42 on both spilling and > streaming, and "spill-zstd:99" says to use Zstd with parameter 99 but > only for spilling to disk. > I think the compression option should be supported at the CREATE SUBSCRIPTION level instead of being controlled by a GUC. This way, we can decide on compression for each subscription individually rather than applying it to all subscribers. It makes more sense for the subscriber to control this, especially when we are planning to compress the data sent downstream. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
On 2024-Jun-07, Dilip Kumar wrote: > I think the compression option should be supported at the CREATE > SUBSCRIPTION level instead of being controlled by a GUC. This way, we > can decide on compression for each subscription individually rather > than applying it to all subscribers. It makes more sense for the > subscriber to control this, especially when we are planning to > compress the data sent downstream. True. (I think we have some options that are in GUCs for the general behavior and can be overridden by per-subscription options for specific tailoring; would that make sense here? I think it does, considering that what we mostly want is to save disk space in the publisher when spilling to disk.) -- Álvaro Herrera Breisgau, Deutschland — https://www.EnterpriseDB.com/ "I can't go to a restaurant and order food because I keep looking at the fonts on the menu. Five minutes later I realize that it's also talking about food" (Donald Knuth)
On Fri, Jun 7, 2024 at 2:39 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote: > > On 2024-Jun-07, Dilip Kumar wrote: > > > I think the compression option should be supported at the CREATE > > SUBSCRIPTION level instead of being controlled by a GUC. This way, we > > can decide on compression for each subscription individually rather > > than applying it to all subscribers. It makes more sense for the > > subscriber to control this, especially when we are planning to > > compress the data sent downstream. > > True. (I think we have some options that are in GUCs for the general > behavior and can be overridden by per-subscription options for specific > tailoring; would that make sense here? I think it does, considering > that what we mostly want is to save disk space in the publisher when > spilling to disk.) Yeah, that makes sense. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
On Thu, Jun 6, 2024 at 7:54 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote: > > On 2024-Jun-06, Amit Kapila wrote: > > > On Thu, Jun 6, 2024 at 4:28 PM Julien Tachoires <julmon@gmail.com> wrote: > > > > > > When the content of a large transaction (size exceeding > > > logical_decoding_work_mem) and its sub-transactions has to be > > > reordered during logical decoding, then, all the changes are written > > > on disk in temporary files located in pg_replslot/<slot_name>. > > > Decoding very large transactions by multiple replication slots can > > > lead to disk space saturation and high I/O utilization. > > I like the general idea of compressing the output of logical decoding. > It's not so clear to me that we only want to do so for spilling to disk; > for instance, if the two nodes communicate over a slow network, it may > even be beneficial to compress when streaming, so to this question: > > > Why can't one use 'streaming' option to send changes to the client > > once it reaches the configured limit of 'logical_decoding_work_mem'? > > I would say that streaming doesn't necessarily have to mean we don't > want compression, because for some users it might be beneficial. > Fair enough. it would be an interesting feature if we see the wider usefulness of compression/decompression of logical changes. For example, if this can improve the performance of applying large transactions (aka reduce the apply lag for them) even when the 'streaming' option is 'parallel' then it would have a much wider impact. -- With Regards, Amit Kapila.
On Fri, Jun 7, 2024 at 2:08 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > I think the compression option should be supported at the CREATE > SUBSCRIPTION level instead of being controlled by a GUC. This way, we > can decide on compression for each subscription individually rather > than applying it to all subscribers. It makes more sense for the > subscriber to control this, especially when we are planning to > compress the data sent downstream. > Yes, that makes sense. However, we then need to provide this option via SQL APIs as well for other plugins. -- With Regards, Amit Kapila.
On 6/6/24 16:24, Alvaro Herrera wrote: > On 2024-Jun-06, Amit Kapila wrote: > >> On Thu, Jun 6, 2024 at 4:28 PM Julien Tachoires <julmon@gmail.com> wrote: >>> >>> When the content of a large transaction (size exceeding >>> logical_decoding_work_mem) and its sub-transactions has to be >>> reordered during logical decoding, then, all the changes are written >>> on disk in temporary files located in pg_replslot/<slot_name>. >>> Decoding very large transactions by multiple replication slots can >>> lead to disk space saturation and high I/O utilization. > > I like the general idea of compressing the output of logical decoding. > It's not so clear to me that we only want to do so for spilling to disk; > for instance, if the two nodes communicate over a slow network, it may > even be beneficial to compress when streaming, so to this question: > >> Why can't one use 'streaming' option to send changes to the client >> once it reaches the configured limit of 'logical_decoding_work_mem'? > > I would say that streaming doesn't necessarily have to mean we don't > want compression, because for some users it might be beneficial. > > I think a GUC would be a good idea. Also, what if for whatever reason > you want a different compression algorithm or different compression > parameters? Looking at the existing compression UI we offer in > pg_basebackup, perhaps you could add something like this: > > compress_logical_decoding = none > compress_logical_decoding = lz4:42 > compress_logical_decoding = spill-zstd:99 > > "none" says to never use compression (perhaps should be the default), > "lz4:42" says to use lz4 with parameters 42 on both spilling and > streaming, and "spill-zstd:99" says to use Zstd with parameter 99 but > only for spilling to disk. > > (I don't mean to say that you should implement Zstd compression with > this patch, only that you should choose the implementation so that > adding Zstd support (or whatever) later is just a matter of adding some > branches here and there. With the current #ifdef you propose, it's hard > to do that. Maybe separate the parts that depend on the specific > algorithm to algorithm-agnostic functions.) > I haven't been following the "libpq compression" thread, but wouldn't that also do compression for the streaming case? That was my assumption, at least, and it seems like the right way - we probably don't want to patch every place that sends data over network independently, right? regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 6/6/24 12:58, Julien Tachoires wrote: > ... > > When compiled with LZ4 support (--with-lz4), this patch enables data > compression/decompression of these temporary files. Each transaction > change that must be written on disk (ReorderBufferDiskChange) is now > compressed and encapsulated in a new structure. > I'm a bit confused, but why tie this to having lz4? Why shouldn't this be supported even for pglz, or whatever algorithms we add in the future? regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Le ven. 7 juin 2024 à 05:59, Tomas Vondra <tomas.vondra@enterprisedb.com> a écrit : > > On 6/6/24 12:58, Julien Tachoires wrote: > > ... > > > > When compiled with LZ4 support (--with-lz4), this patch enables data > > compression/decompression of these temporary files. Each transaction > > change that must be written on disk (ReorderBufferDiskChange) is now > > compressed and encapsulated in a new structure. > > > > I'm a bit confused, but why tie this to having lz4? Why shouldn't this > be supported even for pglz, or whatever algorithms we add in the future? That's right, reworking this patch in that sense. Regards, JT