Re: [PATCH] Batched clock sweep to reduce cross-socket atomic contention

Поиск
Список
Период
Сортировка
От	Greg Burd
Тема	Re: [PATCH] Batched clock sweep to reduce cross-socket atomic contention
Дата	28 апреля 17:30:13
Msg-id	618241c3-a1e5-4a16-be12-17a361822ee0@app.fastmail.com обсуждение
Ответ на	Re: [PATCH] Batched clock sweep to reduce cross-socket atomic contention (Andres Freund <andres@anarazel.de>)
Список	pgsql-hackers
Дерево обсуждения
On Mon, Apr 27, 2026, at 10:15 AM, Andres Freund wrote:
> Hi,
>
> Thanks for looking into this.
>
> On 2026-04-25 16:08:02 -0400, Greg Burd wrote:
>> Does batching mess up the meaning of usage_count?
>> --------------------------------------------------
>> 
>> Short answer: no.  I want to walk through this because it was my first
>> concern too, and I think it's the question that will come up most on review.
>
>> 
>> The clock sweep's usage_count is an access-frequency approximation measured
>> in units of *complete passes*.  A buffer with usage_count = N survives N
>> passes without a re-pin.  The semantic meaning lives at pass granularity,
>> not at individual-buffer granularity.
>
>> 
>> What batching changes: intra-pass temporal ordering.  Without batching, with
>> N backends sweeping, decrements are interleaved -- backend A hits B[0],
>> backend B hits B[1], backend C hits B[2].  With batching, backend A hits
>> B[0..63] in a tight local burst, then backend B hits B[64..127], etc.  The
>> 64-buffer chunks are decremented in bursts rather than individually.
>
>> 
>> Why it doesn't matter:
>> 
>>   1. Every buffer still gets decremented exactly once per complete
>>      pass.  The invariant the algorithm actually depends on is
>>      untouched.
>> 
>>   2. A buffer's survival window is the time between consecutive
>>      passes.  That's milliseconds to seconds under load.  Whether
>>      B[0] gets decremented 50us before or 50us after B[63] within
>>      the same pass is below the resolution of anything usage_count
>>      is trying to measure.
>
> I don't think this is true, unfortunately. Sure, if you have a completely
> uniform, IO intensive, workload it is, but that's not all there is.  If you
> have a bunch of connections that replace buffers at a low rate and a bunch of
> connections that do so at a high rate, the batches "checked out" by the "low
> rate" connections won't be processed soon. Thus the buffers in that batch
> won't have their usagecount decremented and thus have a stronger "protection"
> against replacement.
>
> You can argue that may be OK, because it'd be unlikely that the next sweep
> would assign the same buffers to an "low rate" backend again. But that'd be an
> argument you'd have to make and validate.

You're right, and my "why it doesn't matter" section overstated things. The uniform-workload assumption was sloppy.
Letme try again with the mixed case in mind.
 

The scenario I think you're describing: a low-rate backend claims B[0..63], finds a victim at B[5], and then doesn't
callStrategyGetBuffer() again for a while -- maybe seconds.  During that
 
time B[6..63] sit with their current usage_count, undecremented, while high-rate backends are sweeping the rest of the
poolat full speed. Those 58 buffers get a free pass they wouldn't have gotten in the interleaved case.
 

I can bound the effect but not dismiss it.  Each slow backend can hold at most 64 undecremented buffers.  With 32GB
shared_buffers(~4.2M buffers), 100 slow backends each holding a full batch means ~6,400 buffers delayed -- 0.15% of the
pool. The delay lasts until the backend's next StrategyGetBuffer() call.  So the question is whether 0.15% of buffers
withtemporarily stale usage_count produces a measurable eviction quality difference.
 

Two observations that bound it further:

  1. In the current code, a backend only calls ClockSweepTick() when
     it needs a victim.  A low-rate backend barely moves the global
     hand at all.  Without batching, buffers at positions beyond the
     current hand are also "undecremented" -- they just haven't been
     reached yet.  Batching changes *which* specific 64 buffers are
     pending, not the total count of undecremented buffers in the
     pool at any instant.

  2. The buffers in a held batch are contiguous in buffer-ID space.
     Since buffer-ID assignment to relation blocks is effectively
     random (driven by eviction order), those 64 buffers are
     scattered across relations.  There's no systematic bias toward
     protecting hot or cold data -- it's a random sample.

That said, "bounded and random" isn't the same as "zero."  One mitigation that's simple to implement: abandon the
remainingbatch at transaction boundaries.  Something like resetting MyBatchEnd = MyBatchPos in AtEOXact or equivalent,
soa backend that goes idle between transactions doesn't hold a stale batch across an idle period.  That limits the
stalenesswindow to the duration of a single transaction, which is when the backend is actively doing work and likely to
consumethe batch quickly anyway.
 

I'd like to measure the mixed-workload case directly.  A benchmark with e.g. 50 backends doing heavy sequential scans
and50 doing single-row OLTP, and compare eviction hit rates with and without the patch.  Would that be the kind of
validationyou'd want to see?
 

> I'm somewhat doubtful that batching that's independent of contention and
> independent of the usage rate will work out all that well.  If you instead
> went for a partitioned sweep architecture, with balancing between the
> different partitions, you don't have that issue. And you have a building block
> for more numa awareness etc.

I agree that partitioned sweep is architecturally more principled and gives you a foundation for deeper NUMA work.  I'm
notarguing that batching is a better long-term architecture.
 

The pragmatic case for batching is: it's ~30 lines, it addresses the identified bottleneck, and it doesn't foreclose on
partitionedsweep later.  If partitioned sweep lands, batching becomes redundant for its primary use case.  If
partitionedsweep lands incrementally -- which the open items in that thread suggest is the realistic path -- this gets
achunk of the multi-socket win into users' hands sooner.
 

One concrete difference: partitioned sweep needs a stealing mechanism for correctness when partitions are unevenly
loaded. Batching avoids that because the "partitions" are ephemeral (one batch cycle) and sequential (global order
preserved),so there's no long-lived imbalance to steal from.  Whether that simplicity is worth the tradeoff you
identifiedabove is a judgment call, and I take your point that the building-block argument favors partitioning.
 

I'm also not attached to "batching instead of partitioning."  If you think the right move is to focus effort on
partitionedsweep, I'm happy to help with that.  But if there's appetite for a smaller change that ships sooner, this is
whatI've got.
 

>> There is one subtle difference worth naming.  When a backend finds a victim at B[5] of its batch, it returns with
MyBatchEndstill sitting at B[63].  The next time that backend needs a victim it resumes at B[6], not at wherever the
globalhand now points.  So the backend drains its batch over multiple StrategyGetBuffer() calls rather than all at
once. Under heavy load, where batches are consumed in microseconds, this is invisible.  Under light load, the
implicationis that some buffers can sit with slightly stale usage_count for longer than they would have before.  But
"lightload" means "the sweep is barely moving and nothing wants to evict anyway" -- so the effect
 
>> doesn't show up where it would hurt.
>
> As mentioned above, this assumes that the replacement rate is uniform between
> backends, which I think is not uniformly true outside of benchmarks.
>
>
>> There's also a small positive side-effect: cache locality.  The backend that
>> just touched BufferDescriptor[B[0]] has the adjacent descriptors warm in
>> L1/L2.
>
> A BufferDesc is 64bytes. With common cacheline sizes and stuff like adjacent
> cacheline prefetching you'll have *maybe* 2 consecutive BufferDescs in L1/L2.
> Where it might help more is the TLB.

You're right, I overstated the L1/L2 argument.  At 64 bytes per descriptor, adjacent cacheline prefetch gets you at
most2 consecutive descriptors, not 64.  TLB is the more plausible benefit -- the batch walks a contiguous virtual
addressrange, which should reduce TLB misses when the descriptor array spans multiple pages.  I haven't tried to
isolatethis in perf and won't claim it until I have numbers. 

>> Benchmarks
>> ----------
>> 
>> Jim ran these; I'm still working on reproducing them locally and will post
>> independent numbers in a follow-up.  All bare metal, Linux, huge pages
>> enabled throughout (more on that below), postmaster pinned to node 0 with
>> `numactl --cpunodebind=0` because otherwise stock TPS varied from 31K to 40K
>> depending on which node the postmaster happened to land on at launch --
>> worth flagging for anyone trying to reproduce.
>
> That's an odd one that I think you need to investigate separately.

Agreed.  I'll investigate and report separately.  My working hypothesis is that it's related to where shared memory
getsphysically allocated relative to the postmaster's NUMA node, which then affects all child backends.  That's
interestingregardless of this patch.
 

>> Workload is pgbench scale 3000 (~45GB) with shared_buffers=32GB, so the
>> working set always spills and the sweep is hot.
>
> Uhm, is this something worth optimizing substantially for?  What you're
> measuring here is basically the worst possible way of configuring a database,
> with full double buffering and a lot of memory bandwidth dedicated to copying
> buffers from one place to another. That's maybe a sane setup if you have a lot
> of small databases that you can't configure individually, but that's not the
> case when you run a reasonably large workload on a 384vCPU setup.
>
> I think to be really convincing you'd have to do this with actual IO involved
> somewhere.

Fair criticism.  The pgbench setup was designed to isolate the clock sweep bottleneck by keeping everything in the OS
pagecache, but you're right that it doesn't represent how someone would actually run a database on a 384-vCPU box.  In
productionyou'd either size shared_buffers to hold the working set (no sweep pressure) or have real storage I/O (where
I/Olatency dilutes sweep contention).
 

The HammerDB TPC-C numbers (which involve I/O and realistic contention patterns) show flat-to-slightly-positive -- no
regression,small win at higher concurrency.  I think that's the more honest picture of what production looks like.  And
perhaps"flat to slightly-positive" delta might not be enough juice for the squeeze, especially this late in a cycle.
 

For the follow-up I'll run:

  - Working set 2-3x shared_buffers on NVMe, so StrategyGetBuffer()
    calls actually hit storage on some fraction of evictions.
  - A mixed OLTP workload (not just pgbench -S) with varied access
    patterns, to address the uniform-workload concern above.
  - perf stat showing bus-cycles, cache-misses, and L3 contention
    deltas, so the mechanism is visible independent of TPS.

I should have led with the TPC-C results and framed the pgbench numbers as "here's where the ceiling is under maximum
sweeppressure" rather than presenting them as the headline result.
 

>> Relationship to Tomas's NUMA series
>> -----------------------------------
>> 
>> Tomas posted a multi-patch NUMA-awareness series in [1] covering buffer interleaving across nodes, partitioned
freelists,partitioned clock sweep, PGPROC interleaving, and related pieces.  I want to be careful here because I don't
thinkwe should frame this patch as competing with that work.
 
>> 
>> One thing I found striking as I re-read the thread: in the benchmarks Tomas
>> posted later in the series, *most of the benefit comes from partitioning the
>> clock sweep*, and the NUMA memory-placement layer on top sometimes runs
>> slower than partitioning alone.  His own conclusion, quoted roughly: the
>> benefit mostly comes from just partitioning the clock sweep, and it's
>> largely independent of the NUMA stuff; the NUMA partitioning is often
>> slower.
>
> That was partially because he measured on something that didn't really have
> significant NUMA effects though...

Fair point.  I shouldn't over-generalize from benchmarks run on hardware that wasn't exercising the NUMA dimension.
Retracted.

>> That observation is the thing that makes me think batching is worth
>> considering on its own.  It's going after the same bottleneck Tomas's
>> partitioning addresses, but:
>> 
>>   - without splitting global eviction visibility (which is where
>>     cross-partition stealing gets complicated),
>
> You *are* doing that tho.

You're right.  When a backend holds B[0..63], those buffers are effectively invisible to other backends for eviction
considerationuntil the batch is consumed.  That is a form of split visibility.
 

The difference from a permanent partition is that the split is short-lived (one batch consumption cycle, microseconds
underload) and sequential (the next batch picks up where this one left off in the global order).  There's no long-lived
assignmentof buffer ranges to backends, so the kind of structural imbalance that drives the need for cross-partition
stealingdoesn't arise.  But I shouldn't have claimed "without splitting."  The honest framing is: "with much more
limited
and transient splitting."

>>   - without requiring NUMA-aware buffer placement (which has huge
>>     page alignment, descriptor-partition-mid-page, and resize
>>     complications that are still being worked out in that thread),
>
> You can do the partitioned clock sweep without *any* of that.

Also correct.  The complications I listed are from Tomas's patches 0001 and 0006 (memory interleaving and NUMA-aware
buffer-to-nodemapping), not from the clock-sweep partitioning patches 0002-0005.  Partitioned clock sweep alone doesn't
requireNUMA-aware buffer placement.  I
 
conflated the two; apologies.

> Greetings,
>
> Andres Freund

To summarize the open items I'm taking away:

  1. Mixed-workload benchmark (high-rate + low-rate backends) to
     measure eviction quality impact of held batches.
  2. I/O-inclusive benchmarks on NVMe with working set > shared_buffers.
  3. Investigate the postmaster NUMA placement variance separately.
  4. Consider batch-abandonment at transaction boundaries as a
     mitigation for the staleness concern.
  5. perf stat data showing the mechanism (bus-cycles, cache-misses).

I'll post results as I have them.  I greatly appreciate your time and thoughtful review.

best.

-greg
В списке pgsql-hackers по дате отправления:
Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Re: [PATCH] Batched clock sweep to reduce cross-socket atomic contention