Обсуждение: Buffer locking is special (hints, checksums, AIO writes)

Поиск

Список

Период

Сортировка

Buffer locking is special (hints, checksums, AIO writes)

От

Andres Freund

Дата:

22 августа, 22:44:48

Hi,

I'm working on making bufmgr.c ready for AIO writes. That requires some
infrastructure changes that I wanted to discuss... In this email I'm trying to
outline the problems and what I currently think we should do.

== Problem 1 - hint bits ==

Currently pages that are being written out are copied if checksums are
enabled. The copy is needed because hint bits can be set with just a share
lock. Writing out a buffer only requires a share lock. If we didn't copy the
buffer, the computed checksum can be falsified by the checksum.

Even without checksums hint bits being set while IO is ongoing is an issue,
e.g. btrfs assumes that pages do not change while being written out with
direct-io, and corrupts itself if they do [1].

The copy we make to avoid checksum failure are not at all cheap [2], but are
particularly problematic for AIO, where a lot of buffers can undergo AIO at
the same time. For AIO the issue is that the longer-lived bounce buffers that
I had initially prototyped actually end up using a lot of memory, making it
hard to figure out what defaults to set.

In [2] I had worked on an approach to avoid copying pages during writes. It
avoided the need to copy buffers by not allowing hint bits to be set while IO
is ongoing. It did so by skipping hint bit writes while IO is ongoing (plus a
fair bit of infrastructure to make that cheap enough).

In that thread Heikki was wondering whether we should instead not go for a
more fundamental solution to the problem, like introducing a separate lock
level that prevents concurrent modifications but allows share lockers. At the
time I was against that, because it seemed like a large project.

However, since then I realized there's a architecturally related second issue:

== Problem 2 - AIO writes vs exclusive locks ==

Separate from the hint bit issue, there is a second issue that I didn't have a
good answer for: Making acquiring an exclusive lock concurrency safe in the
presence of asynchronous writes:

The problem is that while a buffer is being written out, it obviously has to
be share locked. That's true even with AIO. With AIO the share lock is held
once the IO is completed. The problem is that if a backend wants to
exclusively lock a buffer undergoing AIO, it can't just wait for the content
lock as today, it might have to actually reap the IO completion from the
operating system. If one just were to wait for the content lock, there's no
forward progress guarantee.

The buffer's state "knows" that it's undergoing write IO (BM_VALID and
BM_IO_IN_PROGRESS are set). To ensure forward progress guarantee, an exclusive
locker needs to wait for the IO (pgaio_wref_wait(BufferDesc->->io_wref)). The
problem is that it's surprisingly hard to do so race free:

If a backend A were to just check if a buffer is undergoing IO before locking
it, a backend B could start IO on the buffer between A checking for
BM_IO_IN_PROGRESS and acquiring the content lock. We obviously can't just
hold the buffer header spinlock across a blocking lwlock acquisition.

There potentially are ways to synchronize the buffer state and the content
lock, but it requires deep integration between bufmgr.c and lwlock.c.

== Problem 3 - Cacheline contention ==

This is unrelated to AIO, but might influence the architecture for potential
solutions. It might make sense to skip over this section on a quicker
read-through.

The problem is that in some workloads the cacheline containing the BufferDesc
becomes very hot. The most common case are workloads with lots of index nested
loops, the root page and some of the other inner pages in the index can become
very contended.

I've seen cases where running the same workload in four separate copies in the
same postgres instance yields ~8x the throughput - on a machine with forty
cores, there's machines with many more these days [4]/

There are three main issues:

a) We manipulate the same cache line multiple times. E.g. btree always first
pins the page and then locks the page, which performs two atomic operations
on the same cacheline.

I've experimented with putting the content lock on a separate cacheline,
but that causes regressions in other cases.

Leaving nontrivial implementation issues aside, we could release the lock
and pin at the same time. With a bit increased difficulty we could do the
same for pin and lock acquisition. nbtree almost always does the two
together, so it'd not be hard to make it benefit from such an optimization.

b) Many operations, like unpinning a buffer, need to use CAS, instead of an
atomic subtraction.

atomic add/sub scales a *lot* better than compare and swap, as there is no
need to retry. With increased contention the number of retries increases
further and further.

The reason we need to use CAS in so many places is the following:

Historically postgres' spinlocks don't use an atomic operation to release
the spinlock on x86. When making buffer header locks their own thing, I
carried that forward, to avoid performance regressions. Because of that
BufferDesc.state may not be modified while the buffer header spinlock is
held - which is incompatible with using atomic-sub.

However, since then we converted most of the "hot" uses of buffer header
spinlocks into CAS loops (see e.g. PinBuffer()). That makes it feasible to
use an atomic operation for the buffer header lock release, which in turn
allows e.g. unpinning with an atomic sub.

c) Read accesses to the BufferDesc cause contention

Some code, like nbtree, relies on functions like
BufferGetBlockNumber(). Unfortunately that contends with concurrent
modifications of the buffer descriptor (like pinning). Potential solutions
are to rely less on functions like BufferGetBlockNumber() or to split out
the memory for that into a separate (denser?) array.

d) Even after addressing all of the above, there's still a lot of contention

I think the solution here would be something roughly to fastpath locks. If
a buffer is very contended, we can mark it as super-pinned & share locked,
avoiding any atomic operation on the buffer descriptor itself. Instead the
current lock and pincount would be stored in each backends PGPROC.
Obviously evicting or exclusively-locking such a buffer would be a lot more
expensive.

I've prototyped it and it helps a *lot*. The reason I mention this here is
that this seems impossible to do while using the generic lwlocks for the
content lock.

== Solution ? ==

My conclusion from the above is that we ought to:

A) Make Buffer Locks something separate from lwlocks

As part of that introduce a new lock level in-between share and exclusive
locks (provisionally called share-exclusive, but I hate it). The new lock
level allows other share lockers, but can only be held by one backend.

This allows to change the rules so that:

1) Share lockers are not allowed to modify buffers anymore
2) Hint bits need the new lock mode (conditionally upgrading the lock in SetHintBits())
3) Write IO needs to the new lock level

This addresses 1) from above

B) Merge BufferDesc.state and the content lock

This allows to address 2) from above, as we now atomically can check if IO
was concurrently initiated.

Obviously BufferDesc.state is not currently wide enough, therefore the
buffer state has to be updated to a 64bit variable.

C) Allow some modifications of BufferDesc.state while holding spinlock

Just naively doing the above two things reduces scalability, as the
likelihood of CAS failures increases, due to increased number of
modifications of the same atomic variable.

However, by allowing unpinning while the buffer header spinlock is held,
scalability considerably improves in my tests.

Doing so requires changing all uses of LockBufHdr(), but by introducing a
static inline helper the complexity can largely be encapsulated.

I've prototyped the above. The current state is pretty rough, but before I
spend the non-trivial time to make it into an understandable sequence of
changes, I wanted to get feedback.

Does this plan sound reasonable?

The hardest part about this change is that everything kind of depends on each
other. The changes are large enough that they clearly can't just be committed
at once, but doing them over time risks [temporary] performance regressions.

The order of changes I think makes the most sense is the following:

1) Allow some modifications while holding the buffer header spinlock

2) Reduce buffer pin with just an atomic-sub

This needs to happen first, otherwise there are performance regressions
during the later steps.

3) Widen BufferDesc.state to 64 bits

4) Implement buffer locking inside BufferDesc.state

5) Do IO while holding share-exclusive lock and require all buffer
modifications to at least hold share exclusive lock

6) Wait for AIO when acquiring an exclusive content lock

(some of these will likely have parts of their own, but that's details)

Sane?

DOES ANYBODY HAVE A BETTER NAME THAN SHARE-EXCLUSIVE???!?

Greetings,

Andres Freund

[1] https://www.postgresql.org/message-id/CA%2BhUKGKSBaz78Fw3WTF3Q8ArqKCz1GgsTfRFiDPbu-j9OFz-jw%40mail.gmail.com
[2] https://www.postgresql.org/message-id/stj36ea6yyhoxtqkhpieia2z4krnam7qyetc57rfezgk4zgapf%40gcnactj4z56m
[3] In some cases it causes slowdowns for checkpointer close to 50%!
[4] https://anarazel.de/talks/2024-10-23-pgconf-eu-numa-vs-postgresql/numa-vs-postgresql.pdf - slide 19

Re: Buffer locking is special (hints, checksums, AIO writes)

От

Mihail Nikalayeu

Дата:

23 августа, 13:31:53

Hello, Andres!

Andres Freund <andres@anarazel.de>:
>
>    As part of that introduce a new lock level in-between share and exclusive
>    locks (provisionally called share-exclusive, but I hate it). The new lock
>    level allows other share lockers, but can only be held by one backend.
>
>    This allows to change the rules so that:
>
>    1) Share lockers are not allowed to modify buffers anymore
>    2) Hint bits need the new lock mode (conditionally upgrading the lock in SetHintBits())
>    3) Write IO needs to the new lock level

IIUC, it may be mapped to existing locking system:

BUFFER_LOCK_SHARE          ---->      AccessShareLock
new lock mode                           ---->     ExclusiveLock
BUFFER_LOCK_EXCLUSIVE   ---->     AccessExclusiveLock

So, it all may be named:

BUFFER_LOCK_ACCESS_SHARE
BUFFER_LOCK_EXCLUSIVE
BUFFER_LOCK_ACCESS_EXCLUSIVE

being more consistent with table-level locking system.

Greetings,
Mikhail.

Re: Buffer locking is special (hints, checksums, AIO writes)

От

Robert Haas

Дата:

26 августа, 23:21:36

On Fri, Aug 22, 2025 at 3:45 PM Andres Freund <andres@anarazel.de> wrote:
> My conclusion from the above is that we ought to:
>
> A) Make Buffer Locks something separate from lwlocks
> B) Merge BufferDesc.state and the content lock
> C) Allow some modifications of BufferDesc.state while holding spinlock

+1 to (A) and (B). No particular opinion on (C) but if it works well, great.

> The order of changes I think makes the most sense is the following:
>
> 1) Allow some modifications while holding the buffer header spinlock
> 2) Reduce buffer pin with just an atomic-sub
> 3) Widen BufferDesc.state to 64 bits
> 4) Implement buffer locking inside BufferDesc.state
> 5) Do IO while holding share-exclusive lock and require all buffer
>    modifications to at least hold share exclusive lock
> 6) Wait for AIO when acquiring an exclusive content lock

No strong objections. I certainly like getting to (5) and (6) and I
think those are in the right order. I'm not sure about the rest. I
thought (1) and (2) were the same change after reading your email; and
it surprises me a little bit that (2) is separate from (4). But I'm
sure you have a much better sense of this than I do.

> DOES ANYBODY HAVE A BETTER NAME THAN SHARE-EXCLUSIVE???!?

AFAIK "share exclusive" or "SX" is standard terminology. While I'm not
wholly hostile to the idea of coming up with something else, I don't
think our tendency to invent our own way to do everything is one of
our better tendencies as a project.

--
Robert Haas
EDB: http://www.enterprisedb.com

Re: Buffer locking is special (hints, checksums, AIO writes)

От

Andres Freund

Дата:

27 августа, 00:00:13

Hi,

On 2025-08-26 16:21:36 -0400, Robert Haas wrote:
> On Fri, Aug 22, 2025 at 3:45 PM Andres Freund <andres@anarazel.de> wrote:
> > My conclusion from the above is that we ought to:
> >
> > A) Make Buffer Locks something separate from lwlocks
> > B) Merge BufferDesc.state and the content lock
> > C) Allow some modifications of BufferDesc.state while holding spinlock
> 
> +1 to (A) and (B). No particular opinion on (C) but if it works well, great.

Without it I see performance regressions due to the increased rate of CAS
failures due to having more changes to one atomic variable :/



> > The order of changes I think makes the most sense is the following:
> >
> > 1) Allow some modifications while holding the buffer header spinlock
> > 2) Reduce buffer pin with just an atomic-sub
> > 3) Widen BufferDesc.state to 64 bits
> > 4) Implement buffer locking inside BufferDesc.state
> > 5) Do IO while holding share-exclusive lock and require all buffer
> >    modifications to at least hold share exclusive lock
> > 6) Wait for AIO when acquiring an exclusive content lock
> 
> No strong objections. I certainly like getting to (5) and (6) and I
> think those are in the right order. I'm not sure about the rest.


> I thought (1) and (2) were the same change after reading your email

They are certainly related. I thought it'd make sense to split them as
outlined above, as (1) is relatively verbose on its own, but far more
mechanical.


> and it surprises me a little bit that (2) is separate from (4).

Without doing 2) first, I see performance/scalability regressions doing
(4). Doing (3) without (2) also hurts...



> > DOES ANYBODY HAVE A BETTER NAME THAN SHARE-EXCLUSIVE???!?
> 
> AFAIK "share exclusive" or "SX" is standard terminology. While I'm not
> wholly hostile to the idea of coming up with something else, I don't
> think our tendency to invent our own way to do everything is one of
> our better tendencies as a project.

I guess it bothers me that we'd use share-exclusive to mean the buffer can't
be modified, but for real (vs share, which does allow some modifications). But
it's very well plausible that there's no meaningfully better name, in which
case we certainly shouldn't differ from what's somewhat commonly used.


Greetings,

Andres Freund

Re: Buffer locking is special (hints, checksums, AIO writes)

От

Noah Misch

Дата:

27 августа, 03:14:49

On Fri, Aug 22, 2025 at 03:44:48PM -0400, Andres Freund wrote:
> I'm working on making bufmgr.c ready for AIO writes.

Nice!

> == Problem 2 - AIO writes vs exclusive locks ==
> 
> Separate from the hint bit issue, there is a second issue that I didn't have a
> good answer for: Making acquiring an exclusive lock concurrency safe in the
> presence of asynchronous writes:
> 
> The problem is that while a buffer is being written out, it obviously has to
> be share locked. That's true even with AIO. With AIO the share lock is held
> once the IO is completed. The problem is that if a backend wants to
> exclusively lock a buffer undergoing AIO, it can't just wait for the content
> lock as today, it might have to actually reap the IO completion from the
> operating system. If one just were to wait for the content lock, there's no
> forward progress guarantee.
> 
> The buffer's state "knows" that it's undergoing write IO (BM_VALID and
> BM_IO_IN_PROGRESS are set). To ensure forward progress guarantee, an exclusive
> locker needs to wait for the IO (pgaio_wref_wait(BufferDesc->->io_wref)). The
> problem is that it's surprisingly hard to do so race free:
> 
> If a backend A were to just check if a buffer is undergoing IO before locking
> it, a backend B could start IO on the buffer between A checking for
> BM_IO_IN_PROGRESS and acquiring the content lock.  We obviously can't just
> hold the buffer header spinlock across a blocking lwlock acquisition.
> 
> There potentially are ways to synchronize the buffer state and the content
> lock, but it requires deep integration between bufmgr.c and lwlock.c.

You may have considered and rejected simpler alternatives for (2) before
picking the approach you go on to outline.  Anything interesting?  For
example, I imagine these might work with varying degrees of inefficiency:

- Use LWLockConditionalAcquire() with some nonstandard waiting protocol when
  there's a non-I/O lock conflict.
- Take BM_IO_IN_PROGRESS before exclusive-locking, then release it.

> == Problem 3 - Cacheline contention ==

> c) Read accesses to the BufferDesc cause contention
> 
>    Some code, like nbtree, relies on functions like
>    BufferGetBlockNumber(). Unfortunately that contends with concurrent
>    modifications of the buffer descriptor (like pinning). Potential solutions
>    are to rely less on functions like BufferGetBlockNumber() or to split out
>    the memory for that into a separate (denser?) array.

Agreed.  BufferGetBlockNumber() could even use a new local (non-shmem) data
structure, since the buffer's mapping can't change until we unpin.

> d) Even after addressing all of the above, there's still a lot of contention
> 
>    I think the solution here would be something roughly to fastpath locks. If
>    a buffer is very contended, we can mark it as super-pinned & share locked,
>    avoiding any atomic operation on the buffer descriptor itself. Instead the
>    current lock and pincount would be stored in each backends PGPROC.
>    Obviously evicting or exclusively-locking such a buffer would be a lot more
>    expensive.
> 
>    I've prototyped it and it helps a *lot*.  The reason I mention this here is
>    that this seems impossible to do while using the generic lwlocks for the
>    content lock.

Nice.

On Tue, Aug 26, 2025 at 05:00:13PM -0400, Andres Freund wrote:
> On 2025-08-26 16:21:36 -0400, Robert Haas wrote:
> > On Fri, Aug 22, 2025 at 3:45 PM Andres Freund <andres@anarazel.de> wrote:
> > > The order of changes I think makes the most sense is the following:

No concerns so far.  I won't claim I can picture all the implications and be
sure this is the right thing, but it sounds promising.  I like your principle
of ordering changes to avoid performance regressions.

> > > DOES ANYBODY HAVE A BETTER NAME THAN SHARE-EXCLUSIVE???!?

I would consider {AccessShare, Exclusive, AccessExclusive}.  What the $SUBJECT
proposal calls SHARE-EXCLUSIVE would become Exclusive.  That has the same
conflict matrix as the corresponding heavyweight locks, which seems good.  I
don't love our mode names, particularly ShareRowExclusive being unsharable.
However, learning one special taxonomy is better than learning two.

> > AFAIK "share exclusive" or "SX" is standard terminology.

Can you say more about that?  I looked around at
https://google.com/search?q=share+exclusive+%22sx%22+lock but didn't find
anything well-aligned with the proposal:

https://dev.mysql.com/doc/dev/mysql-server/latest//PAGE_LOCK_ORDER.html looked
most relevant, but it doesn't give the big idea.
https://mysqlonarm.github.io/Understanding-InnoDB-rwlock-stats/ is less
authoritative but does articulate the big idea, as "Shared-Exclusive (SX):
offer write access to the resource with inconsistent read. (relaxed
exclusive)."  That differs from $SUBJECT semantics, in which SHARE-EXCLUSIVE
can't see inconsistent reads.

https://docs.oracle.com/en/database/oracle/oracle-database/19/arpls/DBMS_LOCK.html
has term SX = "sub exclusive".  I gather an SX lock on a table lets one do
SELECT FOR UPDATE on that table (each row is the "sub"component being locked).

https://man.freebsd.org/cgi/man.cgi?query=sx_slock&sektion=9&format=html uses
the term "SX", but it's more like our lwlocks.  One acquires S or X, not
blends of them.

Re: Buffer locking is special (hints, checksums, AIO writes)

От

Robert Haas

Дата:

27 августа, 17:03:08

On Tue, Aug 26, 2025 at 8:14 PM Noah Misch <noah@leadboat.com> wrote:
> > > AFAIK "share exclusive" or "SX" is standard terminology.
>
> Can you say more about that?

Looks like I was misremembering. I was thinking of Gray & Reuter,
Transaction Processing: Concepts and Techniques, 1993. However,
opening it up, I find that his vocabulary is slightly different. He
offers the following six lock modes: IS, IX, S, SIX, Update, X. "I"
means "intent" and acts as a modifier to the letter that follows.
Hence, SIX means "a course-granularity shared lock with intent to set
finer-granularity exclusive locks" (p. 408). His lock manager is
hierarchical, so taking a SIX lock on a table means that you are
allowed to read all the rows in the table and you are allowed to
exclusive-lock individual rows as desired and nobody is allowed to
exclusive-lock any rows in the table. It is compatible only with IS;
that is, it does not preclude other people from share-locking
individual rows (which might delay your exclusive locks on those
rows). Since we don't have intent-locking in PostgreSQL, I think my
brain mentally flattened this hierarchy down to S, X, SX, but that's
not what he actually wrote.

His "Update" locks are also somewhat interesting: an update lock is
exactly like an exclusive lock except that it permits PAST
share-locks. You take an update lock when you currently need a
share-lock but anticipate the possibility of needing an
exclusive-lock. This is a deadlock avoidance strategy: updaters will
take turns, and some of them will ultimately want exclusive locks and
others won't, but they can't deadlock against each other as long as
they all take "Update" locks initially and don't try to upgrade to
that level later. An updater's attempt to upgrade to an exclusive lock
can still be delayed by, or deadlock against, share lockers, but those
typically won't try to higher lock levels later.

If we were to use the existing PostgreSQL naming convention, I think
I'd probably argue that the nearest parallel to this level is
ShareUpdateExclusive: a self-exclusive lock level that permits
ordinary table access to continue while blocking exclusive locks, used
for an in-flight maintenance operation. But that's arguable, of
course.

--
Robert Haas
EDB: http://www.enterprisedb.com

Re: Buffer locking is special (hints, checksums, AIO writes)

От

Andres Freund

Дата:

27 августа, 19:18:27

Hi,

On 2025-08-26 17:14:49 -0700, Noah Misch wrote:
> On Fri, Aug 22, 2025 at 03:44:48PM -0400, Andres Freund wrote:
> > == Problem 2 - AIO writes vs exclusive locks ==
> >
> > Separate from the hint bit issue, there is a second issue that I didn't have a
> > good answer for: Making acquiring an exclusive lock concurrency safe in the
> > presence of asynchronous writes:
> >
> > The problem is that while a buffer is being written out, it obviously has to
> > be share locked. That's true even with AIO. With AIO the share lock is held
> > once the IO is completed. The problem is that if a backend wants to
> > exclusively lock a buffer undergoing AIO, it can't just wait for the content
> > lock as today, it might have to actually reap the IO completion from the
> > operating system. If one just were to wait for the content lock, there's no
> > forward progress guarantee.
> >
> > The buffer's state "knows" that it's undergoing write IO (BM_VALID and
> > BM_IO_IN_PROGRESS are set). To ensure forward progress guarantee, an exclusive
> > locker needs to wait for the IO (pgaio_wref_wait(BufferDesc->->io_wref)). The
> > problem is that it's surprisingly hard to do so race free:
> >
> > If a backend A were to just check if a buffer is undergoing IO before locking
> > it, a backend B could start IO on the buffer between A checking for
> > BM_IO_IN_PROGRESS and acquiring the content lock.  We obviously can't just
> > hold the buffer header spinlock across a blocking lwlock acquisition.
> >
> > There potentially are ways to synchronize the buffer state and the content
> > lock, but it requires deep integration between bufmgr.c and lwlock.c.
>
> You may have considered and rejected simpler alternatives for (2) before
> picking the approach you go on to outline.

I tried a few things...

> Anything interesting?

Not really.

The first one you propose is what I looked at for a while:

> For example, I imagine these might work with varying degrees of
> inefficiency:
>
> - Use LWLockConditionalAcquire() with some nonstandard waiting protocol when
>   there's a non-I/O lock conflict.

It's nontrivial to make this race free - the problem is the case where we *do*
have to wait for an exclusive content lock. It's possible for the lwlock to be
released by the owning backend and for IO to be started, after checking
whether IO is in progress (after LWLockConditionalAcquire() had failed).

I came up with a complicated scheme, where setting IO in progress would
afterwards wake up all lwlock waiters and all exclusive content lock waits
were done with LWLockAcquireOrWait().  I think that was working - but it's
also a slower and seems really fragile and ugly.

> - Take BM_IO_IN_PROGRESS before exclusive-locking, then release it.

That just seems expensive. We could make it cheaper by doing it only if a
LWLockConditionalAcquire() doesn't succeed. But it still seems not great.  And
it doesn't really help with addressing the 'setting hint bits while IO is in
progress" part...

> > == Problem 3 - Cacheline contention ==
>
> > c) Read accesses to the BufferDesc cause contention
> >
> >    Some code, like nbtree, relies on functions like
> >    BufferGetBlockNumber(). Unfortunately that contends with concurrent
> >    modifications of the buffer descriptor (like pinning). Potential solutions
> >    are to rely less on functions like BufferGetBlockNumber() or to split out
> >    the memory for that into a separate (denser?) array.
>
> Agreed.  BufferGetBlockNumber() could even use a new local (non-shmem) data
> structure, since the buffer's mapping can't change until we unpin.

Hm. I didn't think about a backend local datastructure for that, perhaps
because it seems not cheap to maintain (both from a runtime and a space
perspective).

If we store the read-only data for buffers separately from the read-write
data, we could access that from backends without a lock, since it can't change
with the buffer pinned.

One way to do that would be to maintain a back-pointer from the BufferDesc to
the BufferLookupEnt, since the latter *already* contains the BufferTag. We
probably don't want to add another indirection to the buffer mapping hash
table, otherwise we could deduplicate the other way round and just put padding
between the modified and read-only part of a buffer desc.

> On Tue, Aug 26, 2025 at 05:00:13PM -0400, Andres Freund wrote:
> > On 2025-08-26 16:21:36 -0400, Robert Haas wrote:
> > > On Fri, Aug 22, 2025 at 3:45 PM Andres Freund <andres@anarazel.de> wrote:
> > > > The order of changes I think makes the most sense is the following:
>
> No concerns so far.  I won't claim I can picture all the implications and be
> sure this is the right thing, but it sounds promising.  I like your principle
> of ordering changes to avoid performance regressions.

I suspect we'll have to merge this incrementally to stay sane, I don't want to
end up with a period of worse performance due to this, that could make it
harder to evaluate other work.

> > > > DOES ANYBODY HAVE A BETTER NAME THAN SHARE-EXCLUSIVE???!?

> I would consider {AccessShare, Exclusive, AccessExclusive}.

One thing I forgot to mention is that with the proposed re-architecture in
place, we could subsequently go further and make pinning just be a very
lightweight lock level, instead of that being a separate dedicated
infrstructure.  One nice outgrowth of that would be that that acquiring a
cleanup lock would just be a real lock acquisition, instead of the dedicated
limited machinery we have right now.

Which would leave us with:
- reference (pins today)
- share
- share-exclusive
- exclusive
- cleanup

This doesn't quite seem to map onto the heavyweight lock levels in a sensible
way...

> What the $SUBJECT proposal calls SHARE-EXCLUSIVE would become Exclusive.

There are a few hundred references to the lock levels though, seems painful to
rename them :(

> That has the same conflict matrix as the corresponding heavyweight locks,
> which seems good.

> I don't love our mode names, particularly ShareRowExclusive being
> unsharable.

I hate them with a passion :). Except for the most basic ones they just don't
stay in my head for more than a few hours.

> However, learning one special taxonomy is better than learning two.

But that's fair.

Greetings,

Andres Freund

Re: Buffer locking is special (hints, checksums, AIO writes)

От

Noah Misch

Дата:

27 августа, 22:14:41

On Wed, Aug 27, 2025 at 12:18:27PM -0400, Andres Freund wrote:
> On 2025-08-26 17:14:49 -0700, Noah Misch wrote:
> > On Fri, Aug 22, 2025 at 03:44:48PM -0400, Andres Freund wrote:
> > > == Problem 3 - Cacheline contention ==
> >
> > > c) Read accesses to the BufferDesc cause contention
> > >
> > >    Some code, like nbtree, relies on functions like
> > >    BufferGetBlockNumber(). Unfortunately that contends with concurrent
> > >    modifications of the buffer descriptor (like pinning). Potential solutions
> > >    are to rely less on functions like BufferGetBlockNumber() or to split out
> > >    the memory for that into a separate (denser?) array.
> >
> > Agreed.  BufferGetBlockNumber() could even use a new local (non-shmem) data
> > structure, since the buffer's mapping can't change until we unpin.
> 
> Hm. I didn't think about a backend local datastructure for that, perhaps
> because it seems not cheap to maintain (both from a runtime and a space
> perspective).

Yes, paying off the cost of maintaining it could be tricky.  It could be the
kind of thing where the overhead loses at 10 cores and wins at 40 cores.  It
could also depend heavily on the workload's concurrent pins per backend.

> If we store the read-only data for buffers separately from the read-write
> data, we could access that from backends without a lock, since it can't change
> with the buffer pinned.

Good point.  That alone may be enough of a win.

> One way to do that would be to maintain a back-pointer from the BufferDesc to
> the BufferLookupEnt, since the latter *already* contains the BufferTag. We
> probably don't want to add another indirection to the buffer mapping hash
> table, otherwise we could deduplicate the other way round and just put padding
> between the modified and read-only part of a buffer desc.

I think you're saying clients would save the back-pointer once and dereference
it many times, with each dereference of a saved back-pointer avoiding a shmem
read of BufferDesc.tag.  Is that right?

> > On Tue, Aug 26, 2025 at 05:00:13PM -0400, Andres Freund wrote:
> > > On 2025-08-26 16:21:36 -0400, Robert Haas wrote:
> > > > On Fri, Aug 22, 2025 at 3:45 PM Andres Freund <andres@anarazel.de> wrote:
> > > > > DOES ANYBODY HAVE A BETTER NAME THAN SHARE-EXCLUSIVE???!?
> 
> > I would consider {AccessShare, Exclusive, AccessExclusive}.
> 
> One thing I forgot to mention is that with the proposed re-architecture in
> place, we could subsequently go further and make pinning just be a very
> lightweight lock level, instead of that being a separate dedicated
> infrstructure.  One nice outgrowth of that would be that that acquiring a
> cleanup lock would just be a real lock acquisition, instead of the dedicated
> limited machinery we have right now.
> 
> Which would leave us with:
> - reference (pins today)
> - share
> - share-exclusive
> - exclusive
> - cleanup
> 
> This doesn't quite seem to map onto the heavyweight lock levels in a sensible
> way...

Could map it like this:

AccessShare - pins today
RowShare - check tuple visibility (BUFFER_LOCK_SHARE today)
Share - set hint bits
ShareUpdateExclusive - clean/write out (borrowing Robert's idea)
Exclusive - add tuples, change xmax, etc. (BUFFER_LOCK_EXCLUSIVE today)
AccessExclusive - cleanup lock or evict the buffer

That has a separate level for hint bits vs. I/O, so multiple backends could
set hint bits.  I don't know whether the benchmarks would favor maintaining
that distinction.

> > What the $SUBJECT proposal calls SHARE-EXCLUSIVE would become Exclusive.
> 
> There are a few hundred references to the lock levels though, seems painful to
> rename them :(

Yes, especially in comments and extensions.  Likely more important than that
for the long-term, your latest proposal has the advantage of keeping short
names for the most-commonly-referenced lock types.  (We could keep
BUFFER_LOCK_SHARE with the lower layers translating that into RowShare, but
that weakens or eliminates the benefit of reducing what readers need to
learn.)  For what it's worth, 6 PGXN modules reference BUFFER_LOCK_SHARE
and/or BUFFER_LOCK_EXCLUSIVE.

Compared to share-exclusive, I think I'd prefer a name that describes the use
cases, "set-hints-or-write" (or separate "write" and "set-hints" levels).
What do you think of that?  I don't know whether that should win vs. names
like ShareUpdateExclusive, though.

Re: Buffer locking is special (hints, checksums, AIO writes)

От

Robert Haas

Дата:

27 августа, 22:22:55

On Wed, Aug 27, 2025 at 12:18 PM Andres Freund <andres@anarazel.de> wrote:
> Which would leave us with:
> - reference (pins today)
> - share
> - share-exclusive
> - exclusive
> - cleanup
>
> This doesn't quite seem to map onto the heavyweight lock levels in a sensible
> way...

Could do: ACCESS SHARE, SHARE, SHARE UPDATE EXCLUSIVE, EXCLUSIVE,
ACCESS EXCLUSIVE.

I've always thought that a pin was a lot like an access share lock and
a cleanup lock was a lot like an access exclusive lock.

But then again, using the same terminology for two different things
might be confusing.

--
Robert Haas
EDB: http://www.enterprisedb.com

Re: Buffer locking is special (hints, checksums, AIO writes)

От

Andres Freund

Дата:

27 августа, 22:29:02

Hi,

On 2025-08-27 12:14:41 -0700, Noah Misch wrote:
> On Wed, Aug 27, 2025 at 12:18:27PM -0400, Andres Freund wrote:
> > One way to do that would be to maintain a back-pointer from the BufferDesc to
> > the BufferLookupEnt, since the latter *already* contains the BufferTag. We
> > probably don't want to add another indirection to the buffer mapping hash
> > table, otherwise we could deduplicate the other way round and just put padding
> > between the modified and read-only part of a buffer desc.
> 
> I think you're saying clients would save the back-pointer once and dereference
> it many times, with each dereference of a saved back-pointer avoiding a shmem
> read of BufferDesc.tag.  Is that right?

I was thinking that we'd not have BufferDesc.tag, instead just storing it
solely in BufferLookupEnt. To get the tag of a BufferDesc, you'd every time
have to follow the back-reference.   But that's actually why it doesn't work -
reading the back-reference pointer would have the same issue as just reading
BufferDesc.tag...


> > > On Tue, Aug 26, 2025 at 05:00:13PM -0400, Andres Freund wrote:
> > > > On 2025-08-26 16:21:36 -0400, Robert Haas wrote:
> > > > > On Fri, Aug 22, 2025 at 3:45 PM Andres Freund <andres@anarazel.de> wrote:
> > > > > > DOES ANYBODY HAVE A BETTER NAME THAN SHARE-EXCLUSIVE???!?
> > 
> > > I would consider {AccessShare, Exclusive, AccessExclusive}.
> > 
> > One thing I forgot to mention is that with the proposed re-architecture in
> > place, we could subsequently go further and make pinning just be a very
> > lightweight lock level, instead of that being a separate dedicated
> > infrstructure.  One nice outgrowth of that would be that that acquiring a
> > cleanup lock would just be a real lock acquisition, instead of the dedicated
> > limited machinery we have right now.
> > 
> > Which would leave us with:
> > - reference (pins today)
> > - share
> > - share-exclusive
> > - exclusive
> > - cleanup
> > 
> > This doesn't quite seem to map onto the heavyweight lock levels in a sensible
> > way...
> 
> Could map it like this:
> 
> AccessShare - pins today
> RowShare - check tuple visibility (BUFFER_LOCK_SHARE today)
> Share - set hint bits
> ShareUpdateExclusive - clean/write out (borrowing Robert's idea)
> Exclusive - add tuples, change xmax, etc. (BUFFER_LOCK_EXCLUSIVE today)
> AccessExclusive - cleanup lock or evict the buffer

I tend think having things like RowShare for buffer locking is confusing
enough to actually make the similarity to the heavyweight locks to not be a
win...


> That has a separate level for hint bits vs. I/O, so multiple backends could
> set hint bits.  I don't know whether the benchmarks would favor maintaining
> that distinction.

I don't think it would - I actually found multiple backends setting the same
hint bits to *hurt* performance a bit.  But what's more important, we don't
have the space for it, I think.  Every lock that can be acquired multiple
times needs a lock count of 18 bits. And we need to store the buffer state
flags (10 bits). There's just not enough space in 64bit to have three 18bit
counters as well as flag bits etc.


> Compared to share-exclusive, I think I'd prefer a name that describes the use
> cases, "set-hints-or-write" (or separate "write" and "set-hints" levels).

I would too, I just couldn't come up with something that conveys the meanings
in a sufficiently concise way :)


> What do you think of that?  I don't know whether that should win vs. names
> like ShareUpdateExclusive, though.

I think it'd be a win compared to the heavyweight lock names...

Greetings,

Andres Freund

Re: Buffer locking is special (hints, checksums, AIO writes)

От

Noah Misch

Дата:

28 августа, 02:04:51

On Wed, Aug 27, 2025 at 03:29:02PM -0400, Andres Freund wrote:
> On 2025-08-27 12:14:41 -0700, Noah Misch wrote:
> > On Wed, Aug 27, 2025 at 12:18:27PM -0400, Andres Freund wrote:
> > > > On Tue, Aug 26, 2025 at 05:00:13PM -0400, Andres Freund wrote:
> > > > > On 2025-08-26 16:21:36 -0400, Robert Haas wrote:
> > > > > > On Fri, Aug 22, 2025 at 3:45 PM Andres Freund <andres@anarazel.de> wrote:
> > > > > > > DOES ANYBODY HAVE A BETTER NAME THAN SHARE-EXCLUSIVE???!?

> > > Which would leave us with:
> > > - reference (pins today)
> > > - share
> > > - share-exclusive
> > > - exclusive
> > > - cleanup

> > Compared to share-exclusive, I think I'd prefer a name that describes the use
> > cases, "set-hints-or-write" (or separate "write" and "set-hints" levels).

Another name idea is "self-exclusive", to contrast with "exclusive" excluding
all of (exclusive, self-exclusive, share).

Fortunately, not much code will acquire this lock type.  Hence, there's
relatively little damage if the name is less obvious than older lock types or
if the name changes later.

Re: Buffer locking is special (hints, checksums, AIO writes)

От

Michael Paquier

Дата:

01 сентября, 05:03:55

On Wed, Aug 27, 2025 at 10:03:08AM -0400, Robert Haas wrote:
> If we were to use the existing PostgreSQL naming convention, I think
> I'd probably argue that the nearest parallel to this level is
> ShareUpdateExclusive: a self-exclusive lock level that permits
> ordinary table access to continue while blocking exclusive locks, used
> for an in-flight maintenance operation. But that's arguable, of
> course.

ShareUpdateExclusive is a term that's been used for some time now and
relates to knowledge that's quite spread in the tree, so it feels like
a natural fit for the use-case described on this thread as we'd want a
self-conflicting lock.  share-exclusive did not sound that bad to me,
TBH, quite the contrary, when applied to buffer locking for aio.

"intent" is also a word I've bumped quite a lot into while looking at
some naming convention, but this is more related to the fact that a
lock is going to be taken, which we don't really have.  So that feels
off.
--
Michael

Вложения

signature.asc

Re: Buffer locking is special (hints, checksums, AIO writes)

От

Andres Freund

Дата:

16 сентября, 02:05:37

Hi,

On 2025-08-22 15:44:48 -0400, Andres Freund wrote:
> The hardest part about this change is that everything kind of depends on each
> other. The changes are large enough that they clearly can't just be committed
> at once, but doing them over time risks [temporary] performance regressions.
> 
> 
> 
> 
> The order of changes I think makes the most sense is the following:
> 
> 1) Allow some modifications while holding the buffer header spinlock
> 
> 2) Reduce buffer pin with just an atomic-sub
> 
>    This needs to happen first, otherwise there are performance regressions
>    during the later steps.

Here are the first few cleaned up patches implementing the above steps, as
well as some cleanups.  I included a commit from another thread, as it
conflicts with these changes, and we really should apply it - and it's
arguably required to make the changes viable, as it removes one more use of
PinBuffer_Locked().

Another change included is to not return the buffer with the spinlock held
from StrategyGetBuffer(), and instead pin the buffer in freelist.c. The reason
for that is to reduce the most common PinBuffer_locked() call. By definition
PinBuffer_locked() will become a bit slower due to 0003. But even without 0003
it 0002 is faster than master. And the previous approach also just seems
pretty unclean.   I don't love that it requires the new TrackNewBufferPin(),
but I don't really have a better idea.

I invite particular attention to the commit message for 0003 as well as the
comment changes in buf_internals.h within.

I'm not sure I like the TODOs added in 0003 and removed in 0004, but squashing
the changes doesn't really seem better either.

Greetings,

Andres Freund

Вложения

Re: Buffer locking is special (hints, checksums, AIO writes)

От

Andres Freund

Дата:

23 сентября, 01:14:12

Hi,

On 2025-09-15 19:05:37 -0400, Andres Freund wrote:
> On 2025-08-22 15:44:48 -0400, Andres Freund wrote:
> > The hardest part about this change is that everything kind of depends on each
> > other. The changes are large enough that they clearly can't just be committed
> > at once, but doing them over time risks [temporary] performance regressions.
> >
> >
> >
> >
> > The order of changes I think makes the most sense is the following:
> >
> > 1) Allow some modifications while holding the buffer header spinlock
> >
> > 2) Reduce buffer pin with just an atomic-sub
> >
> >    This needs to happen first, otherwise there are performance regressions
> >    during the later steps.
>
> Here are the first few cleaned up patches implementing the above steps, as
> well as some cleanups.  I included a commit from another thread, as it
> conflicts with these changes, and we really should apply it - and it's
> arguably required to make the changes viable, as it removes one more use of
> PinBuffer_Locked().
>
> Another change included is to not return the buffer with the spinlock held
> from StrategyGetBuffer(), and instead pin the buffer in freelist.c. The reason
> for that is to reduce the most common PinBuffer_locked() call. By definition
> PinBuffer_locked() will become a bit slower due to 0003. But even without 0003
> it 0002 is faster than master. And the previous approach also just seems
> pretty unclean.   I don't love that it requires the new TrackNewBufferPin(),
> but I don't really have a better idea.
>
> I invite particular attention to the commit message for 0003 as well as the
> comment changes in buf_internals.h within.

Robert looked at the patches while we were chatting, and I addressed his
feedback in this new version.

Changes:

- Updated patch description for 0002, giving a lot more background

- Improved BufferDesc comments a fair bit more in 0003

- Reduced the size of 0003 a bit, by using UnlockBufHdrExt() instead of a CAS
  loop in buffer_stage_common() and reordering some things in
  TerminateBufferIO()

- Made 0004 only use the non-looping atomic op in UnpinBufferNoOwner(), not
  MarkBufferDirty(). I realized the latter would take additional complexity to
  make safe (a CAS loop in TerminateBufferIO()).  I am somewhat doubtful that
  there are workloads where it matters...


Greetings,

Andres Freund

Hi,

I pushed a few commits from this patchset after Matthias' review
(thanks!). Unfortunately in 5e899859287 I missed that the valgrind annotations
would not be done anymore for the buffers returned by
StrategyGetBuffer(). Which turned skink red.

The attached 0001 patch centralizes the valgrind initialization in
TrackNewBufferPin(), which 5e899859287 had added. The nice side effect of that
is that there are fewer VALGRIND_MAKE_MEM_DEFINED() calls than before. The
naming isn't the perfect match, but it seems fine to me.


0002 is a new version of "Allow some buffer state modifications while holding
header lock", mainly with a fair bit of comment polishing around BufferDesc
and one small oversight fixed (didn't update a buffer_state variable in one
place).

Greetings,

Andres Freund

Вложения

Re: Buffer locking is special (hints, checksums, AIO writes)

От

Andres Freund

Дата:

10 октября, 00:16:49

On 2025-10-09 16:35:44 -0400, Andres Freund wrote:
> I pushed a few commits from this patchset after Matthias' review
> (thanks!). Unfortunately in 5e899859287 I missed that the valgrind annotations
> would not be done anymore for the buffers returned by
> StrategyGetBuffer(). Which turned skink red.
> 
> The attached 0001 patch centralizes the valgrind initialization in
> TrackNewBufferPin(), which 5e899859287 had added. The nice side effect of that
> is that there are fewer VALGRIND_MAKE_MEM_DEFINED() calls than before. The
> naming isn't the perfect match, but it seems fine to me.

Forgot to say: I'll push this patch soon, to get skink back to green. Unless
somebody says something.  We can adjust this later, if the comment and/or
placement of VALGRIND_MAKE_MEM_DEFINED() isn't to everyones liking.

Re: Buffer locking is special (hints, checksums, AIO writes)

От

Andres Freund

Дата:

20 ноября, 05:47:49

Hi,

On 2025-10-09 17:16:49 -0400, Andres Freund wrote:
> On 2025-10-09 16:35:44 -0400, Andres Freund wrote:
> > I pushed a few commits from this patchset after Matthias' review
> > (thanks!). Unfortunately in 5e899859287 I missed that the valgrind annotations
> > would not be done anymore for the buffers returned by
> > StrategyGetBuffer(). Which turned skink red.
> > 
> > The attached 0001 patch centralizes the valgrind initialization in
> > TrackNewBufferPin(), which 5e899859287 had added. The nice side effect of that
> > is that there are fewer VALGRIND_MAKE_MEM_DEFINED() calls than before. The
> > naming isn't the perfect match, but it seems fine to me.
> 
> Forgot to say: I'll push this patch soon, to get skink back to green. Unless
> somebody says something.  We can adjust this later, if the comment and/or
> placement of VALGRIND_MAKE_MEM_DEFINED() isn't to everyones liking.

I have pushed that fix as well as the subsequent buffer header locking changes
a while ago.

Attached is a patchset that actually implements the buffer content locks in
bufmgr.c. This isn't that close to a committable shape yet, but it seemed
useful to get it out there.  The first few patches seem closer, so it'll also
be useful to narrow this down.

0001: A straight-up bugfix in lwlock.c - albeit for a bug that seems currently
    effectively harmless.

0002: Not really required, but seems like an improvement to me

0003: A prerequisite to 0004, pretty boring itself

0004: Use 64bit atomics for BufferDesc.state - at this point nothing uses the
additional bits yet, though.  Some annoying reformatting required to avoid
long lines.

0005: There already was a wait event class for BUFFERPIN. It seems better to
make that more general than to implement them separately.

0006+0007: This is preparatory work for 0008, but also worthwhile on its
own. The private refcount stuff does show up in profiles. The reason it's
related is that without these changes the added information in 0008 makes that
worse.

0008: The main change. Implements buffer content locking independently from
lwlock.c. There's obviously a lot of similarity between lwlock.c code and
this, but I've not found a good way to reduce the duplication without giving
up too much.  This patch does immediately introduce share-exclusive as a new
lock level, mostly because it was too painful to do separately.

0009+0010+0011: Preparatory work for 0012.

0012: One of the main goals of this patchset - use the new share-exclusive
lock level to only allow hint bits to be set while no IO is going on.

0013: Prototype of making UnlockReleaseBuffer() faster and of using it more
widely in nbtree.c

0014: Now that hint bits can't be done while IO is going on, we don't need to
copy pages anymore.  This needs a fair bit more work, as denoted by the FIXMEs
in the code.

I've tried to add detail to the more important commit messages, at least until
0012.

I want to again emphasize that the important commits (i.e. 0008, 0012, 0014)
aren't close to being mergeable. But I think they're in a stage that they
could benefit from "lenient" high-level review.

Greetings,

Andres Freund

Hi,

On 2025-11-25 11:54:00 -0500, Andres Freund wrote:
> Thanks a lot for that detailed review!  A few questions and comments, before I
> try to address the comments in the next version.

Here's that new new version, with the following changes

- Some more micro-optimizations, most importantly adding a commit that doesn't
  initialize the delay in LockBufHdr() unless needed. With those I don't see a
  consistent slowdown anymore (slight speedup on one workstation, slight
  slowdown on another, in an absurdly adverse workload)

- Tried to address Melanie's feedback, with some exceptions (some noted below,
  but I also need to make another pass through the reviews)

- re-implemented AssertNotCatalogBufferLock() in the new world

- Substantially expanded comments around setting hint bits (in buffer/README,
  heapam_visibility.c and bufmgr.c)

- split out the change to fsm_vacuum_page() to start to lock the page into is
  own commit

- reordered patch series so that smaller changes are before the 64bit-state
  and "Implement buffer content locks independently of" commits, so they can
  be committed while we finish cleaning the later changes

- I didn't invest much in cleaning up the later patches ("Don't copy pages
  while writing out" and "Make UnlockReleaseBuffer() more efficient") yet,
  wanted to focus on the earlier patches first


Todo:

- still need to rename ResOwnerReleaseBufferPin(). Wondering about what to
  rename ResourceOwnerDesc.name to. "buffer ownership" maybe? Not great...

- gistkillitems() complaint by Melanie

- amortize vs batch vs SetHintBits comment + SHB_* names

- for the next version I'll remove the BATCHMVCC_FEWER_ARGS conditionals from
  0010. I don't love needing BatchMVCCState but I don't really see an
  alternative, the performance difference is pretty persistent.


Questions:
- ForEachLWLockHeldByMe() and LWLockDisown() aren't used anymore, should we
  remove them?


Greetings,

Andres Freund

Hi,

On 2025-12-18 17:06:43 -0500, Andres Freund wrote:
> I'll try to come up with a combined patch that applies the optimizations in
> LWLockWaitListLock() and LockBufHdr() to each other.

Attached is a rebased version of this patch series, with the above patch as
0001.

The later patches have a lot, but not yet all, of Melanie's feedback
addressed. I mainly am sending them because cfbot wanted a rebase anyway.

I'm mainly not yet happy with how 0007 changes buf_internals.h, some more
comment work is also needed. 0009 & 0010 still are more for POC than ready to
be in-depth reviewed.

Greetings,

Andres Freund

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Обсуждение: Buffer locking is special (hints, checksums, AIO writes)

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения