Обсуждение: Optimize SnapBuildPurgeOlderTxn: use in-place compaction instead of temporary array
Optimize SnapBuildPurgeOlderTxn: use in-place compaction instead of temporary array
От
Xuneng Zhou
Дата:
Hi Hackers, The SnapBuildPurgeOlderTxn function previously used a suboptimal method to remove old XIDs from the committed.xip array. It allocated a temporary workspace array, copied the surviving elements into it, and then copied them back, incurring unnecessary memory allocation and multiple data copies. This patch refactors the logic to use a standard two-pointer, in-place compaction algorithm. The new approach filters the array in a single pass with no extra memory allocation, improving both CPU and memory efficiency. No behavioral changes are expected. This resolves a TODO comment expecting a more efficient algorithm. Best, Xuneng
Вложения
Re: Optimize SnapBuildPurgeOlderTxn: use in-place compaction instead of temporary array
От
Kirill Reshke
Дата:
On Sat, 18 Oct 2025 at 12:50, Xuneng Zhou <xunengzhou@gmail.com> wrote: > > Hi Hackers, Hi! > The SnapBuildPurgeOlderTxn function previously used a suboptimal > method to remove old XIDs from the committed.xip array. It allocated a > temporary workspace array, copied the surviving elements into it, and > then copied them back, incurring unnecessary memory allocation and > multiple data copies. > > This patch refactors the logic to use a standard two-pointer, in-place > compaction algorithm. The new approach filters the array in a single > pass with no extra memory allocation, improving both CPU and memory > efficiency. > > No behavioral changes are expected. This resolves a TODO comment > expecting a more efficient algorithm. > Indeed, these changes look correct. I wonder why b89e151054a0 did this place this way, hope we do not miss anything here. Can we construct a microbenchmark here which will show some benefit? -- Best regards, Kirill Reshke
Hi, thanks for looking into this. On Sat, Oct 18, 2025 at 4:59 PM Kirill Reshke <reshkekirill@gmail.com> wrote: > > On Sat, 18 Oct 2025 at 12:50, Xuneng Zhou <xunengzhou@gmail.com> wrote: > > > > Hi Hackers, > > Hi! > > > The SnapBuildPurgeOlderTxn function previously used a suboptimal > > method to remove old XIDs from the committed.xip array. It allocated a > > temporary workspace array, copied the surviving elements into it, and > > then copied them back, incurring unnecessary memory allocation and > > multiple data copies. > > > > This patch refactors the logic to use a standard two-pointer, in-place > > compaction algorithm. The new approach filters the array in a single > > pass with no extra memory allocation, improving both CPU and memory > > efficiency. > > > > No behavioral changes are expected. This resolves a TODO comment > > expecting a more efficient algorithm. > > > > Indeed, these changes look correct. > I wonder why b89e151054a0 did this place this way, hope we do not miss > anything here. I think this small refactor does not introduce behavioral changes or breaks given constraints. > Can we construct a microbenchmark here which will show some benefit? > I prepared a simple microbenchmark to evaluate the impact of the algorithm replacement. The attached results summarize the findings. An end-to-end benchmark was not included, as this function is unlikely to be a performance hotspot in typical decoding workloads—the array being cleaned is expected to be relatively small under normal operating conditions. However, its impact could become more noticeable in scenarios with long-running transactions and a large number of catalog-modifying DML or DDL operations. Hardware: AMD EPYC™ Genoa 9454P 48-core 4th generation DDR5 ECC reg NVMe SSD Datacenter Edition (Gen 4) Best, Xuneng
Вложения
Re: Optimize SnapBuildPurgeOlderTxn: use in-place compaction instead of temporary array
От
Kirill Reshke
Дата:
On Mon, 20 Oct 2025 at 08:08, Xuneng Zhou <xunengzhou@gmail.com> wrote: > > Hi, thanks for looking into this. > > On Sat, Oct 18, 2025 at 4:59 PM Kirill Reshke <reshkekirill@gmail.com> wrote: > > > > On Sat, 18 Oct 2025 at 12:50, Xuneng Zhou <xunengzhou@gmail.com> wrote: > > > > > > Hi Hackers, > > > > Hi! > > > > > The SnapBuildPurgeOlderTxn function previously used a suboptimal > > > method to remove old XIDs from the committed.xip array. It allocated a > > > temporary workspace array, copied the surviving elements into it, and > > > then copied them back, incurring unnecessary memory allocation and > > > multiple data copies. > > > > > > This patch refactors the logic to use a standard two-pointer, in-place > > > compaction algorithm. The new approach filters the array in a single > > > pass with no extra memory allocation, improving both CPU and memory > > > efficiency. > > > > > > No behavioral changes are expected. This resolves a TODO comment > > > expecting a more efficient algorithm. > > > > > > > Indeed, these changes look correct. > > I wonder why b89e151054a0 did this place this way, hope we do not miss > > anything here. > > I think this small refactor does not introduce behavioral changes or > breaks given constraints. > > > Can we construct a microbenchmark here which will show some benefit? > > > > I prepared a simple microbenchmark to evaluate the impact of the > algorithm replacement. The attached results summarize the findings. > An end-to-end benchmark was not included, as this function is unlikely > to be a performance hotspot in typical decoding workloads—the array > being cleaned is expected to be relatively small under normal > operating conditions. However, its impact could become more noticeable > in scenarios with long-running transactions and a large number of > catalog-modifying DML or DDL operations. > > Hardware: > AMD EPYC™ Genoa 9454P 48-core 4th generation > DDR5 ECC reg > NVMe SSD Datacenter Edition (Gen 4) > > Best, > Xuneng At first glance these results look satisfactory. Can you please describe, how did you get your numbers? Maybe more script or steps to reproduce, if anyone will be willing to... -- Best regards, Kirill Reshke
Hi, On Mon, Oct 20, 2025 at 11:36 AM Kirill Reshke <reshkekirill@gmail.com> wrote: > > On Mon, 20 Oct 2025 at 08:08, Xuneng Zhou <xunengzhou@gmail.com> wrote: > > > > Hi, thanks for looking into this. > > > > On Sat, Oct 18, 2025 at 4:59 PM Kirill Reshke <reshkekirill@gmail.com> wrote: > > > > > > On Sat, 18 Oct 2025 at 12:50, Xuneng Zhou <xunengzhou@gmail.com> wrote: > > > > > > > > Hi Hackers, > > > > > > Hi! > > > > > > > The SnapBuildPurgeOlderTxn function previously used a suboptimal > > > > method to remove old XIDs from the committed.xip array. It allocated a > > > > temporary workspace array, copied the surviving elements into it, and > > > > then copied them back, incurring unnecessary memory allocation and > > > > multiple data copies. > > > > > > > > This patch refactors the logic to use a standard two-pointer, in-place > > > > compaction algorithm. The new approach filters the array in a single > > > > pass with no extra memory allocation, improving both CPU and memory > > > > efficiency. > > > > > > > > No behavioral changes are expected. This resolves a TODO comment > > > > expecting a more efficient algorithm. > > > > > > > > > > Indeed, these changes look correct. > > > I wonder why b89e151054a0 did this place this way, hope we do not miss > > > anything here. > > > > I think this small refactor does not introduce behavioral changes or > > breaks given constraints. > > > > > Can we construct a microbenchmark here which will show some benefit? > > > > > > > I prepared a simple microbenchmark to evaluate the impact of the > > algorithm replacement. The attached results summarize the findings. > > An end-to-end benchmark was not included, as this function is unlikely > > to be a performance hotspot in typical decoding workloads—the array > > being cleaned is expected to be relatively small under normal > > operating conditions. However, its impact could become more noticeable > > in scenarios with long-running transactions and a large number of > > catalog-modifying DML or DDL operations. > > > > Hardware: > > AMD EPYC™ Genoa 9454P 48-core 4th generation > > DDR5 ECC reg > > NVMe SSD Datacenter Edition (Gen 4) > > > > Best, > > Xuneng > > At first glance these results look satisfactory. > > Can you please describe, how did you get your numbers? Maybe more > script or steps to reproduce, if anyone will be willing to... > Sure. Here is a brief description of this experiential benchmark: 1) what Tier 1 measures Function under test: committed.xip purge in SnapBuild (OLD=workspace+memcpy vs NEW=in-place compaction). Inputs: Array sizes: 100, 500, 1000, 2000, 5000, 10000 Keep ratios: 0.9, 0.5, 0.1, 0.01 Distributions: scattered (Fisher–Yates shuffle), contiguous Repetitions: 30 per scenario RNG and determinism: pg_prng_state with seed 42 per dataset ensures reproducibility. Metrics recorded per scenario: Time (ns): mean, median, p95 Survivors (count) Memory traffic (bytes): bytes_read, bytes_written, bytes_total OLD: reads = (xcnt + survivors) × sizeof(XID); writes = 2 × survivors × sizeof(XID) NEW: reads = xcnt × sizeof(XID); writes = survivors × sizeof(XID) 2) The core components # C Extension (snapbuild_bench.c) - The actual benchmark implementation The C extension contains the actual benchmark implementation that runs inside the PostgreSQL backend process. It's designed to: - Mimic real PostgreSQL code paths** as closely as possible - Use actual PostgreSQL data structures** (`TransactionId`, `MemoryContext`) - Call real PostgreSQL functions** (`NormalTransactionIdPrecedes`) - Measure with nanosecond precision** using PostgreSQL's timing infrastructure # SQL Wrapper (snapbuild_bench--1.0.sql) - function definitions # Orchestration Scripts - Automated benchmark execution and analysis run_snapbuild_purge_bench 3) Execution Flow 1. Extension Installation # Build and install export PG_CONFIG=$HOME/pg/vanilla/bin/pg_config make -C contrib_extension/snapbuild_bench clean install # Create extension in database CREATE EXTENSION snapbuild_bench; 3. Run full benchmark suite ./run_snapbuild_purge_bench.sh --clean --with-baseline <patch> 4. Data Analysis # Generate plots python3 plot_tier1_results.py --csv results/unit/base_unit.csv --out plots/ # Compare baseline vs patched python3 compare_snapbuild_results.py vanilla/ patched/ TBH, the performance improvement from this refactor is fairly straightforward, and it’s unlikely to introduce regressions. The experimental benchmark is therefore more complex than necessary. Still, I treated it as a learning exercise — an opportunity to practice benchmarking methodology and hopefully to reuse some of these techniques when evaluating more performance-critical paths in the future. If anyone has suggestions or spots issues, I’d greatly appreciate your feedback as well. Best, Xuneng
Вложения
Re: Optimize SnapBuildPurgeOlderTxn: use in-place compaction instead of temporary array
От
Kirill Reshke
Дата:
On Mon, 20 Oct 2025 at 13:47, Xuneng Zhou <xunengzhou@gmail.com> wrote: > > Hi, > > On Mon, Oct 20, 2025 at 11:36 AM Kirill Reshke <reshkekirill@gmail.com> wrote: > > > > On Mon, 20 Oct 2025 at 08:08, Xuneng Zhou <xunengzhou@gmail.com> wrote: > > > > > > Hi, thanks for looking into this. > > > > > > On Sat, Oct 18, 2025 at 4:59 PM Kirill Reshke <reshkekirill@gmail.com> wrote: > > > > > > > > On Sat, 18 Oct 2025 at 12:50, Xuneng Zhou <xunengzhou@gmail.com> wrote: > > > > > > > > > > Hi Hackers, > > > > > > > > Hi! > > > > > > > > > The SnapBuildPurgeOlderTxn function previously used a suboptimal > > > > > method to remove old XIDs from the committed.xip array. It allocated a > > > > > temporary workspace array, copied the surviving elements into it, and > > > > > then copied them back, incurring unnecessary memory allocation and > > > > > multiple data copies. > > > > > > > > > > This patch refactors the logic to use a standard two-pointer, in-place > > > > > compaction algorithm. The new approach filters the array in a single > > > > > pass with no extra memory allocation, improving both CPU and memory > > > > > efficiency. > > > > > > > > > > No behavioral changes are expected. This resolves a TODO comment > > > > > expecting a more efficient algorithm. > > > > > > > > > > > > > Indeed, these changes look correct. > > > > I wonder why b89e151054a0 did this place this way, hope we do not miss > > > > anything here. > > > > > > I think this small refactor does not introduce behavioral changes or > > > breaks given constraints. > > > > > > > Can we construct a microbenchmark here which will show some benefit? > > > > > > > > > > I prepared a simple microbenchmark to evaluate the impact of the > > > algorithm replacement. The attached results summarize the findings. > > > An end-to-end benchmark was not included, as this function is unlikely > > > to be a performance hotspot in typical decoding workloads—the array > > > being cleaned is expected to be relatively small under normal > > > operating conditions. However, its impact could become more noticeable > > > in scenarios with long-running transactions and a large number of > > > catalog-modifying DML or DDL operations. > > > > > > Hardware: > > > AMD EPYC™ Genoa 9454P 48-core 4th generation > > > DDR5 ECC reg > > > NVMe SSD Datacenter Edition (Gen 4) > > > > > > Best, > > > Xuneng > > > > At first glance these results look satisfactory. > > > > Can you please describe, how did you get your numbers? Maybe more > > script or steps to reproduce, if anyone will be willing to... > > > > Sure. Here is a brief description of this experiential benchmark: > > 1) what Tier 1 measures > > Function under test: committed.xip purge in SnapBuild > (OLD=workspace+memcpy vs NEW=in-place compaction). > > Inputs: > Array sizes: 100, 500, 1000, 2000, 5000, 10000 > Keep ratios: 0.9, 0.5, 0.1, 0.01 > Distributions: scattered (Fisher–Yates shuffle), contiguous > Repetitions: 30 per scenario > > RNG and determinism: pg_prng_state with seed 42 per dataset ensures > reproducibility. > > Metrics recorded per scenario: > Time (ns): mean, median, p95 > Survivors (count) > Memory traffic (bytes): bytes_read, bytes_written, bytes_total > OLD: reads = (xcnt + survivors) × sizeof(XID); writes = 2 × survivors > × sizeof(XID) > NEW: reads = xcnt × sizeof(XID); writes = survivors × sizeof(XID) > > 2) The core components > > # C Extension (snapbuild_bench.c) - The actual benchmark implementation > > The C extension contains the actual benchmark implementation that runs > inside the PostgreSQL backend process. It's designed to: > - Mimic real PostgreSQL code paths** as closely as possible > - Use actual PostgreSQL data structures** (`TransactionId`, `MemoryContext`) > - Call real PostgreSQL functions** (`NormalTransactionIdPrecedes`) > - Measure with nanosecond precision** using PostgreSQL's timing infrastructure > > # SQL Wrapper (snapbuild_bench--1.0.sql) - function definitions > > # Orchestration Scripts - Automated benchmark execution and analysis > run_snapbuild_purge_bench > > 3) Execution Flow > > 1. Extension Installation > # Build and install > export PG_CONFIG=$HOME/pg/vanilla/bin/pg_config > make -C contrib_extension/snapbuild_bench clean install > # Create extension in database > CREATE EXTENSION snapbuild_bench; > > 3. Run full benchmark suite > ./run_snapbuild_purge_bench.sh --clean --with-baseline <patch> > > 4. Data Analysis > # Generate plots > python3 plot_tier1_results.py --csv results/unit/base_unit.csv --out plots/ > # Compare baseline vs patched > python3 compare_snapbuild_results.py vanilla/ patched/ Thank you. > TBH, the performance improvement from this refactor is fairly > straightforward, and it’s unlikely to introduce regressions. Sure. > The > experimental benchmark is therefore more complex than necessary. > Still, I treated it as a learning exercise — an opportunity to > practice benchmarking methodology and hopefully to reuse some of these > techniques when evaluating more performance-critical paths in the > future. If anyone has suggestions or spots issues, I’d greatly > appreciate your feedback as well. > > Best, > Xuneng I think this patch is in committable state, should we change its CF entry accordingly? -- Best regards, Kirill Reshke
Hi,
Kirill Reshke <reshkekirill@gmail.com> 于 2025年10月20日周一 下午6:07写道:
On Mon, 20 Oct 2025 at 13:47, Xuneng Zhou <xunengzhou@gmail.com> wrote:
>
> Hi,
>
> On Mon, Oct 20, 2025 at 11:36 AM Kirill Reshke <reshkekirill@gmail.com> wrote:
> >
> > On Mon, 20 Oct 2025 at 08:08, Xuneng Zhou <xunengzhou@gmail.com> wrote:
> > >
> > > Hi, thanks for looking into this.
> > >
> > > On Sat, Oct 18, 2025 at 4:59 PM Kirill Reshke <reshkekirill@gmail.com> wrote:
> > > >
> > > > On Sat, 18 Oct 2025 at 12:50, Xuneng Zhou <xunengzhou@gmail.com> wrote:
> > > > >
> > > > > Hi Hackers,
> > > >
> > > > Hi!
> > > >
> > > > > The SnapBuildPurgeOlderTxn function previously used a suboptimal
> > > > > method to remove old XIDs from the committed.xip array. It allocated a
> > > > > temporary workspace array, copied the surviving elements into it, and
> > > > > then copied them back, incurring unnecessary memory allocation and
> > > > > multiple data copies.
> > > > >
> > > > > This patch refactors the logic to use a standard two-pointer, in-place
> > > > > compaction algorithm. The new approach filters the array in a single
> > > > > pass with no extra memory allocation, improving both CPU and memory
> > > > > efficiency.
> > > > >
> > > > > No behavioral changes are expected. This resolves a TODO comment
> > > > > expecting a more efficient algorithm.
> > > > >
> > > >
> > > > Indeed, these changes look correct.
> > > > I wonder why b89e151054a0 did this place this way, hope we do not miss
> > > > anything here.
> > >
> > > I think this small refactor does not introduce behavioral changes or
> > > breaks given constraints.
> > >
> > > > Can we construct a microbenchmark here which will show some benefit?
> > > >
> > >
> > > I prepared a simple microbenchmark to evaluate the impact of the
> > > algorithm replacement. The attached results summarize the findings.
> > > An end-to-end benchmark was not included, as this function is unlikely
> > > to be a performance hotspot in typical decoding workloads—the array
> > > being cleaned is expected to be relatively small under normal
> > > operating conditions. However, its impact could become more noticeable
> > > in scenarios with long-running transactions and a large number of
> > > catalog-modifying DML or DDL operations.
> > >
> > > Hardware:
> > > AMD EPYC™ Genoa 9454P 48-core 4th generation
> > > DDR5 ECC reg
> > > NVMe SSD Datacenter Edition (Gen 4)
> > >
> > > Best,
> > > Xuneng
> >
> > At first glance these results look satisfactory.
> >
> > Can you please describe, how did you get your numbers? Maybe more
> > script or steps to reproduce, if anyone will be willing to...
> >
>
> Sure. Here is a brief description of this experiential benchmark:
>
> 1) what Tier 1 measures
>
> Function under test: committed.xip purge in SnapBuild
> (OLD=workspace+memcpy vs NEW=in-place compaction).
>
> Inputs:
> Array sizes: 100, 500, 1000, 2000, 5000, 10000
> Keep ratios: 0.9, 0.5, 0.1, 0.01
> Distributions: scattered (Fisher–Yates shuffle), contiguous
> Repetitions: 30 per scenario
>
> RNG and determinism: pg_prng_state with seed 42 per dataset ensures
> reproducibility.
>
> Metrics recorded per scenario:
> Time (ns): mean, median, p95
> Survivors (count)
> Memory traffic (bytes): bytes_read, bytes_written, bytes_total
> OLD: reads = (xcnt + survivors) × sizeof(XID); writes = 2 × survivors
> × sizeof(XID)
> NEW: reads = xcnt × sizeof(XID); writes = survivors × sizeof(XID)
>
> 2) The core components
>
> # C Extension (snapbuild_bench.c) - The actual benchmark implementation
>
> The C extension contains the actual benchmark implementation that runs
> inside the PostgreSQL backend process. It's designed to:
> - Mimic real PostgreSQL code paths** as closely as possible
> - Use actual PostgreSQL data structures** (`TransactionId`, `MemoryContext`)
> - Call real PostgreSQL functions** (`NormalTransactionIdPrecedes`)
> - Measure with nanosecond precision** using PostgreSQL's timing infrastructure
>
> # SQL Wrapper (snapbuild_bench--1.0.sql) - function definitions
>
> # Orchestration Scripts - Automated benchmark execution and analysis
> run_snapbuild_purge_bench
>
> 3) Execution Flow
>
> 1. Extension Installation
> # Build and install
> export PG_CONFIG=$HOME/pg/vanilla/bin/pg_config
> make -C contrib_extension/snapbuild_bench clean install
> # Create extension in database
> CREATE EXTENSION snapbuild_bench;
>
> 3. Run full benchmark suite
> ./run_snapbuild_purge_bench.sh --clean --with-baseline <patch>
>
> 4. Data Analysis
> # Generate plots
> python3 plot_tier1_results.py --csv results/unit/base_unit.csv --out plots/
> # Compare baseline vs patched
> python3 compare_snapbuild_results.py vanilla/ patched/
Thank you.
> TBH, the performance improvement from this refactor is fairly
> straightforward, and it’s unlikely to introduce regressions.
Sure.
> The
> experimental benchmark is therefore more complex than necessary.
> Still, I treated it as a learning exercise — an opportunity to
> practice benchmarking methodology and hopefully to reuse some of these
> techniques when evaluating more performance-critical paths in the
> future. If anyone has suggestions or spots issues, I’d greatly
> appreciate your feedback as well.
>
> Best,
> Xuneng
I think this patch is in committable state, should we change its CF
entry accordingly?
Feel free to do so if you agree. Thanks again for your review!
Best,
Xuneng
Re: Optimize SnapBuildPurgeOlderTxn: use in-place compaction instead of temporary array
От
Michael Paquier
Дата:
On Sat, Oct 18, 2025 at 01:59:40PM +0500, Kirill Reshke wrote: > Indeed, these changes look correct. > I wonder why b89e151054a0 did this place this way, hope we do not miss > anything here. Perhaps a lack of time back in 2014? It feels like an item where we would need to research a bit some of the past threads, and see if this has been discussed, or if there were other potential alternatives discussed. This is not saying that what you are doing in this proposal is actually bad, but it's a bit hard to say what an "algorithm" should look like in this specific code path with XID manipulations. Perhaps since 2014, we may have other places in the tree that share similar characteristics as what's done here. So it feels like this needs a bit more historical investigation first, rather than saying that your proposal is the best choice on the table. -- Michael
Вложения
Re: Optimize SnapBuildPurgeOlderTxn: use in-place compaction instead of temporary array
От
Kirill Reshke
Дата:
On Tue, 21 Oct 2025 at 04:31, Michael Paquier <michael@paquier.xyz> wrote: > > On Sat, Oct 18, 2025 at 01:59:40PM +0500, Kirill Reshke wrote: > > Indeed, these changes look correct. > > I wonder why b89e151054a0 did this place this way, hope we do not miss > > anything here. > > Perhaps a lack of time back in 2014? It feels like an item where we > would need to research a bit some of the past threads, and see if this > has been discussed, or if there were other potential alternatives > discussed. This is not saying that what you are doing in this > proposal is actually bad, but it's a bit hard to say what an > "algorithm" should look like in this specific code path with XID > manipulations. Perhaps since 2014, we may have other places in the > tree that share similar characteristics as what's done here. > > So it feels like this needs a bit more historical investigation first, > rather than saying that your proposal is the best choice on the table. > -- > Michael Sure Commit b89e151054a0 comes from [0] Comment of SnapBuildPurgeCommittedTxn tracks to [1] (it was in form "XXX: Neater algorithm?") Between these two messages, it was not disucccesseed... I will also study other related threads like [2], but i don't think they will give more insight here. [0] https://www.postgresql.org/message-id/20140303162652.GB16654%40awork2.anarazel.de [1] https://www.postgresql.org/message-id/20140115002223.GA17204%40awork2.anarazel.de [2] https://www.postgresql.org/message-id/flat/20130914204913.GA4071%40awork2.anarazel.de#:~:text=20130914204913.GA4071%40awork2.anarazel.de -- Best regards, Kirill Reshke
Hi, Thanks for looking into this. On Tue, Oct 21, 2025 at 1:05 PM Kirill Reshke <reshkekirill@gmail.com> wrote: > > On Tue, 21 Oct 2025 at 04:31, Michael Paquier <michael@paquier.xyz> wrote: > > > > On Sat, Oct 18, 2025 at 01:59:40PM +0500, Kirill Reshke wrote: > > > Indeed, these changes look correct. > > > I wonder why b89e151054a0 did this place this way, hope we do not miss > > > anything here. > > > > Perhaps a lack of time back in 2014? It feels like an item where we > > would need to research a bit some of the past threads, and see if this > > has been discussed, or if there were other potential alternatives > > discussed. This is not saying that what you are doing in this > > proposal is actually bad, but it's a bit hard to say what an > > "algorithm" should look like in this specific code path with XID > > manipulations. Perhaps since 2014, we may have other places in the > > tree that share similar characteristics as what's done here. > > > > So it feels like this needs a bit more historical investigation first, > > rather than saying that your proposal is the best choice on the table. > > -- > > Michael > > Sure > > Commit b89e151054a0 comes from [0] > Comment of SnapBuildPurgeCommittedTxn tracks to [1] (it was in form > "XXX: Neater algorithm?") > > Between these two messages, it was not disucccesseed... > > I will also study other related threads like [2], but i don't think > they will give more insight here. > > [0] https://www.postgresql.org/message-id/20140303162652.GB16654%40awork2.anarazel.de > [1] https://www.postgresql.org/message-id/20140115002223.GA17204%40awork2.anarazel.de > [2] https://www.postgresql.org/message-id/flat/20130914204913.GA4071%40awork2.anarazel.de#:~:text=20130914204913.GA4071%40awork2.anarazel.de Introducing logical decoding was a major feature, and I assume there were more critical design decisions and trade-offs to consider at that time. I proposed this refactor not because it is the most significant optimization, but because it seems to be a low-hanging fruit with clear benefits. By using an in-place two-pointer compaction, we can eliminate the extra memory allocation and copy-back step without introducing risks to this well-tested code path. A comparable optimization exists in KnownAssignedXidsCompress() which uses the same algorithm to remove stale XIDs without workspace allocation. That implementation also adds a lazy compaction heuristic that delays compaction until a threshold of removed entries is reached, amortizing the O(N) cost across multiple operations. The comment above the data structure mentions the trade-off of keeping the committed.xip array sorted versus unsorted. If the array were sorted, we could use a binary search combined with memmove to compact it efficiently, achieving O(log n + n) complexity for purging. However, that design would increase the complexity of SnapBuildAddCommittedTxn from O(1) to O(n) and "more complicated wrt wraparound". /* * Array of committed transactions that have modified the catalog. * * As this array is frequently modified we do *not* keep it in * xidComparator order. Instead we sort the array when building & * distributing a snapshot. * * TODO: It's unclear whether that reasoning has much merit. Every * time we add something here after becoming consistent will also * require distributing a snapshot. Storing them sorted would * potentially also make it easier to purge (but more complicated wrt * wraparound?). Should be improved if sorting while building the * snapshot shows up in profiles. */ TransactionId *xip; } committed; /* * Keep track of a new catalog changing transaction that has committed. */ static void SnapBuildAddCommittedTxn(SnapBuild *builder, TransactionId xid) { Assert(TransactionIdIsValid(xid)); if (builder->committed.xcnt == builder->committed.xcnt_space) { builder->committed.xcnt_space = builder->committed.xcnt_space * 2 + 1; elog(DEBUG1, "increasing space for committed transactions to %u", (uint32) builder->committed.xcnt_space); builder->committed.xip = repalloc(builder->committed.xip, builder->committed.xcnt_space * sizeof(TransactionId)); } /* * TODO: It might make sense to keep the array sorted here instead of * doing it every time we build a new snapshot. On the other hand this * gets called repeatedly when a transaction with subtransactions commits. */ builder->committed.xip[builder->committed.xcnt++] = xid; } It might be worth profiling this function to evaluate whether maintaining a sorted array could bring potential benefits, although accurately measuring its end-to-end impact could be difficult if it isn’t a known hotspot. I also did a brief search on the mailing list and found no reports of performance concerns or related proposals to optimize this part of the code. [1] https://www.postgresql.org/search/?m=1&q=SnapBuildPurgeOlderTxn&l=1&d=-1&s=r [2] https://www.postgresql.org/search/?m=1&q=committed.xip&l=1&d=-1&s=r&p=2 Best, Xuneng
Re: Optimize SnapBuildPurgeOlderTxn: use in-place compaction instead of temporary array
От
Masahiko Sawada
Дата:
On Mon, Oct 20, 2025 at 11:13 PM Xuneng Zhou <xunengzhou@gmail.com> wrote: > > Hi, > > Thanks for looking into this. > > On Tue, Oct 21, 2025 at 1:05 PM Kirill Reshke <reshkekirill@gmail.com> wrote: > > > > On Tue, 21 Oct 2025 at 04:31, Michael Paquier <michael@paquier.xyz> wrote: > > > > > > On Sat, Oct 18, 2025 at 01:59:40PM +0500, Kirill Reshke wrote: > > > > Indeed, these changes look correct. > > > > I wonder why b89e151054a0 did this place this way, hope we do not miss > > > > anything here. > > > > > > Perhaps a lack of time back in 2014? It feels like an item where we > > > would need to research a bit some of the past threads, and see if this > > > has been discussed, or if there were other potential alternatives > > > discussed. This is not saying that what you are doing in this > > > proposal is actually bad, but it's a bit hard to say what an > > > "algorithm" should look like in this specific code path with XID > > > manipulations. Perhaps since 2014, we may have other places in the > > > tree that share similar characteristics as what's done here. > > > > > > So it feels like this needs a bit more historical investigation first, > > > rather than saying that your proposal is the best choice on the table. > > > -- > > > Michael > > > > Sure > > > > Commit b89e151054a0 comes from [0] > > Comment of SnapBuildPurgeCommittedTxn tracks to [1] (it was in form > > "XXX: Neater algorithm?") > > > > Between these two messages, it was not disucccesseed... > > > > I will also study other related threads like [2], but i don't think > > they will give more insight here. > > > > [0] https://www.postgresql.org/message-id/20140303162652.GB16654%40awork2.anarazel.de > > [1] https://www.postgresql.org/message-id/20140115002223.GA17204%40awork2.anarazel.de > > [2] https://www.postgresql.org/message-id/flat/20130914204913.GA4071%40awork2.anarazel.de#:~:text=20130914204913.GA4071%40awork2.anarazel.de > > Introducing logical decoding was a major feature, and I assume there > were more critical design decisions and trade-offs to consider at that > time. > > I proposed this refactor not because it is the most significant > optimization, but because it seems to be a low-hanging fruit with > clear benefits. By using an in-place two-pointer compaction, we can > eliminate the extra memory allocation and copy-back step without > introducing risks to this well-tested code path. I agree the proposed in-pace update is better than the current copy-and-iterating approach. While the benefit might not be visible as it's not a known hot-path, I find that the proposed patch makes sense to improve the current codes. It could help some workloads where there are many DDLs (e.g., creating temporary tables in many transactions). > > A comparable optimization exists in KnownAssignedXidsCompress() which > uses the same algorithm to remove stale XIDs without workspace > allocation. That implementation also adds a lazy compaction heuristic > that delays compaction until a threshold of removed entries is > reached, amortizing the O(N) cost across multiple operations. > > The comment above the data structure mentions the trade-off of keeping > the committed.xip array sorted versus unsorted. If the array were > sorted, we could use a binary search combined with memmove to compact > it efficiently, achieving O(log n + n) complexity for purging. > However, that design would increase the complexity of > SnapBuildAddCommittedTxn from O(1) to O(n) and "more complicated wrt > wraparound". > > /* > * Array of committed transactions that have modified the catalog. > * > * As this array is frequently modified we do *not* keep it in > * xidComparator order. Instead we sort the array when building & > * distributing a snapshot. > * > * TODO: It's unclear whether that reasoning has much merit. Every > * time we add something here after becoming consistent will also > * require distributing a snapshot. Storing them sorted would > * potentially also make it easier to purge (but more complicated wrt > * wraparound?). Should be improved if sorting while building the > * snapshot shows up in profiles. > */ > TransactionId *xip; > } committed; > > /* > * Keep track of a new catalog changing transaction that has committed. > */ > static void > SnapBuildAddCommittedTxn(SnapBuild *builder, TransactionId xid) > { > Assert(TransactionIdIsValid(xid)); > > if (builder->committed.xcnt == builder->committed.xcnt_space) > { > builder->committed.xcnt_space = builder->committed.xcnt_space * 2 + 1; > > elog(DEBUG1, "increasing space for committed transactions to %u", > (uint32) builder->committed.xcnt_space); > > builder->committed.xip = repalloc(builder->committed.xip, > builder->committed.xcnt_space * sizeof(TransactionId)); > } > > /* > * TODO: It might make sense to keep the array sorted here instead of > * doing it every time we build a new snapshot. On the other hand this > * gets called repeatedly when a transaction with subtransactions commits. > */ > builder->committed.xip[builder->committed.xcnt++] = xid; > } > > It might be worth profiling this function to evaluate whether > maintaining a sorted array could bring potential benefits, although > accurately measuring its end-to-end impact could be difficult if it > isn’t a known hotspot. I also did a brief search on the mailing list > and found no reports of performance concerns or related proposals to > optimize this part of the code. It might also be worth researching what kind of workloads would need a better algorithm in terms of storing/updating xip and subxip arrays since it would be the primary motivation. Also, otherwise we cannot measure the real-world impact of a new algorithm. Having said that, I find that it would be discussed and developed separately from the proposed patch on this thread. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
Hi Sawada-san, Michael, Thanks for your comments on this patch. On Thu, Oct 23, 2025 at 8:28 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > On Mon, Oct 20, 2025 at 11:13 PM Xuneng Zhou <xunengzhou@gmail.com> wrote: > > > > Hi, > > > > Thanks for looking into this. > > > > On Tue, Oct 21, 2025 at 1:05 PM Kirill Reshke <reshkekirill@gmail.com> wrote: > > > > > > On Tue, 21 Oct 2025 at 04:31, Michael Paquier <michael@paquier.xyz> wrote: > > > > > > > > On Sat, Oct 18, 2025 at 01:59:40PM +0500, Kirill Reshke wrote: > > > > > Indeed, these changes look correct. > > > > > I wonder why b89e151054a0 did this place this way, hope we do not miss > > > > > anything here. > > > > > > > > Perhaps a lack of time back in 2014? It feels like an item where we > > > > would need to research a bit some of the past threads, and see if this > > > > has been discussed, or if there were other potential alternatives > > > > discussed. This is not saying that what you are doing in this > > > > proposal is actually bad, but it's a bit hard to say what an > > > > "algorithm" should look like in this specific code path with XID > > > > manipulations. Perhaps since 2014, we may have other places in the > > > > tree that share similar characteristics as what's done here. > > > > > > > > So it feels like this needs a bit more historical investigation first, > > > > rather than saying that your proposal is the best choice on the table. Following these suggestions, I carefully searched the mailing list archives and found no reports of performance issues directly related to this code path. I also examined other parts of the codebase for similar patterns. Components like integerset might share some characteristics with SnapBuildPurgeOlderTxn, but they have constraints that make them not directly applicable here. I am not very familiar with the whole tree, so the investigation might not be exhaustive. > > > > > > Commit b89e151054a0 comes from [0] > > > Comment of SnapBuildPurgeCommittedTxn tracks to [1] (it was in form > > > "XXX: Neater algorithm?") > > > > > > Between these two messages, it was not disucccesseed... > > > > > > I will also study other related threads like [2], but i don't think > > > they will give more insight here. > > > > > > [0] https://www.postgresql.org/message-id/20140303162652.GB16654%40awork2.anarazel.de > > > [1] https://www.postgresql.org/message-id/20140115002223.GA17204%40awork2.anarazel.de > > > [2] https://www.postgresql.org/message-id/flat/20130914204913.GA4071%40awork2.anarazel.de#:~:text=20130914204913.GA4071%40awork2.anarazel.de > > > > Introducing logical decoding was a major feature, and I assume there > > were more critical design decisions and trade-offs to consider at that > > time. > > > > I proposed this refactor not because it is the most significant > > optimization, but because it seems to be a low-hanging fruit with > > clear benefits. By using an in-place two-pointer compaction, we can > > eliminate the extra memory allocation and copy-back step without > > introducing risks to this well-tested code path. > > I agree the proposed in-pace update is better than the current > copy-and-iterating approach. While the benefit might not be visible as > it's not a known hot-path, I find that the proposed patch makes sense > to improve the current codes. It could help some workloads where there > are many DDLs (e.g., creating temporary tables in many transactions). > I also agree this approach offers better efficiency within the scope of current SnapBuildPurgeOlderTxn. > > > > A comparable optimization exists in KnownAssignedXidsCompress() which > > uses the same algorithm to remove stale XIDs without workspace > > allocation. That implementation also adds a lazy compaction heuristic > > that delays compaction until a threshold of removed entries is > > reached, amortizing the O(N) cost across multiple operations. > > > > The comment above the data structure mentions the trade-off of keeping > > the committed.xip array sorted versus unsorted. If the array were > > sorted, we could use a binary search combined with memmove to compact > > it efficiently, achieving O(log n + n) complexity for purging. > > However, that design would increase the complexity of > > SnapBuildAddCommittedTxn from O(1) to O(n) and "more complicated wrt > > wraparound". > > > > /* > > * Array of committed transactions that have modified the catalog. > > * > > * As this array is frequently modified we do *not* keep it in > > * xidComparator order. Instead we sort the array when building & > > * distributing a snapshot. > > * > > * TODO: It's unclear whether that reasoning has much merit. Every > > * time we add something here after becoming consistent will also > > * require distributing a snapshot. Storing them sorted would > > * potentially also make it easier to purge (but more complicated wrt > > * wraparound?). Should be improved if sorting while building the > > * snapshot shows up in profiles. > > */ > > TransactionId *xip; > > } committed; > > > > /* > > * Keep track of a new catalog changing transaction that has committed. > > */ > > static void > > SnapBuildAddCommittedTxn(SnapBuild *builder, TransactionId xid) > > { > > Assert(TransactionIdIsValid(xid)); > > > > if (builder->committed.xcnt == builder->committed.xcnt_space) > > { > > builder->committed.xcnt_space = builder->committed.xcnt_space * 2 + 1; > > > > elog(DEBUG1, "increasing space for committed transactions to %u", > > (uint32) builder->committed.xcnt_space); > > > > builder->committed.xip = repalloc(builder->committed.xip, > > builder->committed.xcnt_space * sizeof(TransactionId)); > > } > > > > /* > > * TODO: It might make sense to keep the array sorted here instead of > > * doing it every time we build a new snapshot. On the other hand this > > * gets called repeatedly when a transaction with subtransactions commits. > > */ > > builder->committed.xip[builder->committed.xcnt++] = xid; > > } > > > > It might be worth profiling this function to evaluate whether > > maintaining a sorted array could bring potential benefits, although > > accurately measuring its end-to-end impact could be difficult if it > > isn’t a known hotspot. I also did a brief search on the mailing list > > and found no reports of performance concerns or related proposals to > > optimize this part of the code. > > It might also be worth researching what kind of workloads would need a > better algorithm in terms of storing/updating xip and subxip arrays > since it would be the primary motivation. Also, otherwise we cannot > measure the real-world impact of a new algorithm. Having said that, > find that it would be discussed and developed separately from the > proposed patch on this thread. +1 for researching the workloads that might need a sorted array and more efficient algorithm. This exploration isn’t limited to the scope of SnapBuildPurgeOlderTxn but relates more broadly to the overall snapbuild process, which might be worth discussing in a separate thread as suggested. * TODO: It's unclear whether that reasoning has much merit. Every * time we add something here after becoming consistent will also * require distributing a snapshot. Storing them sorted would * potentially also make it easier to purge (but more complicated wrt * wraparound?). Should be improved if sorting while building the * snapshot shows up in profiles. I also constructed an artificial workload to try to surface the qsort call in SnapBuildBuildSnapshot, though such a scenario seems very unlikely to occur in production. for ((c=1; c<=DDL_CLIENTS; c++)); do ( local seq=1 while (( $(date +%s) < tB_end )); do local tbl="hp_ddl_${c}_$seq" "$psql" -h 127.0.0.1 -p "$PORT" -d postgres -c " BEGIN; CREATE TABLE ${tbl} (id int, data text); CREATE INDEX idx_${tbl} ON ${tbl} (id); INSERT INTO ${tbl} VALUES ($seq, 'd'); DROP TABLE ${tbl}; COMMIT;" >/dev/null 2>&1 || true seq=$((seq+1)) done ) 97.02% 0.00% postgres postgres [.] postmaster_child_launch | ---postmaster_child_launch | |--94.93%--BackendMain | PostgresMain | exec_replication_command | StartLogicalReplication | | | --94.92%--WalSndLoop | | | |--92.24%--XLogSendLogical | | | | | --91.63%--LogicalDecodingProcessRecord | | | | | |--89.55%--xact_decode | | | | | | | --89.23%--DecodeCommit | | | | | | | |--64.64%--SnapBuildCommitTxn | | | | | | | | | |--62.60%--SnapBuildBuildSnapshot | | | | | | | | | | | --62.33%--pg_qsort | | | | | | | | | | | |--56.84%--pg_qsort Best, Xuneng
Re: Optimize SnapBuildPurgeOlderTxn: use in-place compaction instead of temporary array
От
Masahiko Sawada
Дата:
On Thu, Oct 23, 2025 at 1:17 AM Xuneng Zhou <xunengzhou@gmail.com> wrote:
>
> Hi Sawada-san, Michael,
>
> Thanks for your comments on this patch.
>
> On Thu, Oct 23, 2025 at 8:28 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> > On Mon, Oct 20, 2025 at 11:13 PM Xuneng Zhou <xunengzhou@gmail.com> wrote:
> > >
> > > Hi,
> > >
> > > Thanks for looking into this.
> > >
> > > On Tue, Oct 21, 2025 at 1:05 PM Kirill Reshke <reshkekirill@gmail.com> wrote:
> > > >
> > > > On Tue, 21 Oct 2025 at 04:31, Michael Paquier <michael@paquier.xyz> wrote:
> > > > >
> > > > > On Sat, Oct 18, 2025 at 01:59:40PM +0500, Kirill Reshke wrote:
> > > > > > Indeed, these changes look correct.
> > > > > > I wonder why b89e151054a0 did this place this way, hope we do not miss
> > > > > > anything here.
> > > > >
> > > > > Perhaps a lack of time back in 2014? It feels like an item where we
> > > > > would need to research a bit some of the past threads, and see if this
> > > > > has been discussed, or if there were other potential alternatives
> > > > > discussed. This is not saying that what you are doing in this
> > > > > proposal is actually bad, but it's a bit hard to say what an
> > > > > "algorithm" should look like in this specific code path with XID
> > > > > manipulations. Perhaps since 2014, we may have other places in the
> > > > > tree that share similar characteristics as what's done here.
> > > > >
> > > > > So it feels like this needs a bit more historical investigation first,
> > > > > rather than saying that your proposal is the best choice on the table.
>
> Following these suggestions, I carefully searched the mailing list
> archives and found no reports of performance issues directly related
> to this code path. I also examined other parts of the codebase for
> similar patterns. Components like integerset might share some
> characteristics with SnapBuildPurgeOlderTxn, but they have constraints
> that make them not directly applicable here. I am not very familiar
> with the whole tree, so the investigation might not be exhaustive.
Thank you for looking through the archives. In logical replication,
performance problems typically show up as replication delays. Since
logical replication involves many different components and processes,
it's quite rare for investigations to trace problems back to this
specific piece of code. However, I still believe it's important to
optimize the performance of logical decoding itself.
>
> > >
> > > A comparable optimization exists in KnownAssignedXidsCompress() which
> > > uses the same algorithm to remove stale XIDs without workspace
> > > allocation. That implementation also adds a lazy compaction heuristic
> > > that delays compaction until a threshold of removed entries is
> > > reached, amortizing the O(N) cost across multiple operations.
> > >
> > > The comment above the data structure mentions the trade-off of keeping
> > > the committed.xip array sorted versus unsorted. If the array were
> > > sorted, we could use a binary search combined with memmove to compact
> > > it efficiently, achieving O(log n + n) complexity for purging.
> > > However, that design would increase the complexity of
> > > SnapBuildAddCommittedTxn from O(1) to O(n) and "more complicated wrt
> > > wraparound".
> > >
> > > /*
> > > * Array of committed transactions that have modified the catalog.
> > > *
> > > * As this array is frequently modified we do *not* keep it in
> > > * xidComparator order. Instead we sort the array when building &
> > > * distributing a snapshot.
> > > *
> > > * TODO: It's unclear whether that reasoning has much merit. Every
> > > * time we add something here after becoming consistent will also
> > > * require distributing a snapshot. Storing them sorted would
> > > * potentially also make it easier to purge (but more complicated wrt
> > > * wraparound?). Should be improved if sorting while building the
> > > * snapshot shows up in profiles.
> > > */
> > > TransactionId *xip;
> > > } committed;
> > >
> > > /*
> > > * Keep track of a new catalog changing transaction that has committed.
> > > */
> > > static void
> > > SnapBuildAddCommittedTxn(SnapBuild *builder, TransactionId xid)
> > > {
> > > Assert(TransactionIdIsValid(xid));
> > >
> > > if (builder->committed.xcnt == builder->committed.xcnt_space)
> > > {
> > > builder->committed.xcnt_space = builder->committed.xcnt_space * 2 + 1;
> > >
> > > elog(DEBUG1, "increasing space for committed transactions to %u",
> > > (uint32) builder->committed.xcnt_space);
> > >
> > > builder->committed.xip = repalloc(builder->committed.xip,
> > > builder->committed.xcnt_space * sizeof(TransactionId));
> > > }
> > >
> > > /*
> > > * TODO: It might make sense to keep the array sorted here instead of
> > > * doing it every time we build a new snapshot. On the other hand this
> > > * gets called repeatedly when a transaction with subtransactions commits.
> > > */
> > > builder->committed.xip[builder->committed.xcnt++] = xid;
> > > }
> > >
> > > It might be worth profiling this function to evaluate whether
> > > maintaining a sorted array could bring potential benefits, although
> > > accurately measuring its end-to-end impact could be difficult if it
> > > isn’t a known hotspot. I also did a brief search on the mailing list
> > > and found no reports of performance concerns or related proposals to
> > > optimize this part of the code.
> >
> > It might also be worth researching what kind of workloads would need a
> > better algorithm in terms of storing/updating xip and subxip arrays
> > since it would be the primary motivation. Also, otherwise we cannot
> > measure the real-world impact of a new algorithm. Having said that,
> > find that it would be discussed and developed separately from the
> > proposed patch on this thread.
>
> +1 for researching the workloads that might need a sorted array and
> more efficient algorithm. This exploration isn’t limited to the scope
> of SnapBuildPurgeOlderTxn but relates more broadly to the overall
> snapbuild process, which might be worth discussing in a separate
> thread as suggested.
>
> * TODO: It's unclear whether that reasoning has much merit. Every
> * time we add something here after becoming consistent will also
> * require distributing a snapshot. Storing them sorted would
> * potentially also make it easier to purge (but more complicated wrt
> * wraparound?). Should be improved if sorting while building the
> * snapshot shows up in profiles.
>
> I also constructed an artificial workload to try to surface the qsort
> call in SnapBuildBuildSnapshot, though such a scenario seems very
> unlikely to occur in production.
>
> for ((c=1; c<=DDL_CLIENTS; c++)); do
> (
> local seq=1
> while (( $(date +%s) < tB_end )); do
> local tbl="hp_ddl_${c}_$seq"
> "$psql" -h 127.0.0.1 -p "$PORT" -d postgres -c "
> BEGIN;
> CREATE TABLE ${tbl} (id int, data text);
> CREATE INDEX idx_${tbl} ON ${tbl} (id);
> INSERT INTO ${tbl} VALUES ($seq, 'd');
> DROP TABLE ${tbl};
> COMMIT;" >/dev/null 2>&1 || true
> seq=$((seq+1))
> done
> )
Interesting. To be honest, I think this scenario might actually occur
in practice, especially in cases where users frequently use CREATE
TEMP TABLE ... ON COMMIT DROP.
Regards,
--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com