Обсуждение: Buffer locking is special (hints, checksums, AIO writes)

Поиск
Список
Период
Сортировка

Buffer locking is special (hints, checksums, AIO writes)

От
Andres Freund
Дата:
Hi,


I'm working on making bufmgr.c ready for AIO writes. That requires some
infrastructure changes that I wanted to discuss... In this email I'm trying to
outline the problems and what I currently think we should do.


== Problem 1 - hint bits ==

Currently pages that are being written out are copied if checksums are
enabled. The copy is needed because hint bits can be set with just a share
lock. Writing out a buffer only requires a share lock. If we didn't copy the
buffer, the computed checksum can be falsified by the checksum.

Even without checksums hint bits being set while IO is ongoing is an issue,
e.g. btrfs assumes that pages do not change while being written out with
direct-io, and corrupts itself if they do [1].

The copy we make to avoid checksum failure are not at all cheap [2], but are
particularly problematic for AIO, where a lot of buffers can undergo AIO at
the same time. For AIO the issue is that the longer-lived bounce buffers that
I had initially prototyped actually end up using a lot of memory, making it
hard to figure out what defaults to set.

In [2] I had worked on an approach to avoid copying pages during writes. It
avoided the need to copy buffers by not allowing hint bits to be set while IO
is ongoing. It did so by skipping hint bit writes while IO is ongoing (plus a
fair bit of infrastructure to make that cheap enough).


In that thread Heikki was wondering whether we should instead not go for a
more fundamental solution to the problem, like introducing a separate lock
level that prevents concurrent modifications but allows share lockers. At the
time I was against that, because it seemed like a large project.

However, since then I realized there's a architecturally related second issue:


== Problem 2 - AIO writes vs exclusive locks ==

Separate from the hint bit issue, there is a second issue that I didn't have a
good answer for: Making acquiring an exclusive lock concurrency safe in the
presence of asynchronous writes:

The problem is that while a buffer is being written out, it obviously has to
be share locked. That's true even with AIO. With AIO the share lock is held
once the IO is completed. The problem is that if a backend wants to
exclusively lock a buffer undergoing AIO, it can't just wait for the content
lock as today, it might have to actually reap the IO completion from the
operating system. If one just were to wait for the content lock, there's no
forward progress guarantee.

The buffer's state "knows" that it's undergoing write IO (BM_VALID and
BM_IO_IN_PROGRESS are set). To ensure forward progress guarantee, an exclusive
locker needs to wait for the IO (pgaio_wref_wait(BufferDesc->->io_wref)). The
problem is that it's surprisingly hard to do so race free:

If a backend A were to just check if a buffer is undergoing IO before locking
it, a backend B could start IO on the buffer between A checking for
BM_IO_IN_PROGRESS and acquiring the content lock.  We obviously can't just
hold the buffer header spinlock across a blocking lwlock acquisition.

There potentially are ways to synchronize the buffer state and the content
lock, but it requires deep integration between bufmgr.c and lwlock.c.


== Problem 3 - Cacheline contention ==

This is unrelated to AIO, but might influence the architecture for potential
solutions.  It might make sense to skip over this section on a quicker
read-through.

The problem is that in some workloads the cacheline containing the BufferDesc
becomes very hot. The most common case are workloads with lots of index nested
loops, the root page and some of the other inner pages in the index can become
very contended.

I've seen cases where running the same workload in four separate copies in the
same postgres instance yields ~8x the throughput - on a machine with forty
cores, there's machines with many more these days [4]/

There are three main issues:

a) We manipulate the same cache line multiple times. E.g. btree always first
   pins the page and then locks the page, which performs two atomic operations
   on the same cacheline.

   I've experimented with putting the content lock on a separate cacheline,
   but that causes regressions in other cases.

   Leaving nontrivial implementation issues aside, we could release the lock
   and pin at the same time. With a bit increased difficulty we could do the
   same for pin and lock acquisition.  nbtree almost always does the two
   together, so it'd not be hard to make it benefit from such an optimization.


b) Many operations, like unpinning a buffer, need to use CAS, instead of an
   atomic subtraction.

   atomic add/sub scales a *lot* better than compare and swap, as there is no
   need to retry. With increased contention the number of retries increases
   further and further.

   The reason we need to use CAS in so many places is the following:

   Historically postgres' spinlocks don't use an atomic operation to release
   the spinlock on x86. When making buffer header locks their own thing, I
   carried that forward, to avoid performance regressions.  Because of that
   BufferDesc.state may not be modified while the buffer header spinlock is
   held - which is incompatible with using atomic-sub.

   However, since then we converted most of the "hot" uses of buffer header
   spinlocks into CAS loops (see e.g. PinBuffer()). That makes it feasible to
   use an atomic operation for the buffer header lock release, which in turn
   allows e.g. unpinning with an atomic sub.


c) Read accesses to the BufferDesc cause contention

   Some code, like nbtree, relies on functions like
   BufferGetBlockNumber(). Unfortunately that contends with concurrent
   modifications of the buffer descriptor (like pinning). Potential solutions
   are to rely less on functions like BufferGetBlockNumber() or to split out
   the memory for that into a separate (denser?) array.


d) Even after addressing all of the above, there's still a lot of contention

   I think the solution here would be something roughly to fastpath locks. If
   a buffer is very contended, we can mark it as super-pinned & share locked,
   avoiding any atomic operation on the buffer descriptor itself. Instead the
   current lock and pincount would be stored in each backends PGPROC.
   Obviously evicting or exclusively-locking such a buffer would be a lot more
   expensive.

   I've prototyped it and it helps a *lot*.  The reason I mention this here is
   that this seems impossible to do while using the generic lwlocks for the
   content lock.



== Solution ? ==

My conclusion from the above is that we ought to:


A) Make Buffer Locks something separate from lwlocks

   As part of that introduce a new lock level in-between share and exclusive
   locks (provisionally called share-exclusive, but I hate it). The new lock
   level allows other share lockers, but can only be held by one backend.

   This allows to change the rules so that:

   1) Share lockers are not allowed to modify buffers anymore
   2) Hint bits need the new lock mode (conditionally upgrading the lock in SetHintBits())
   3) Write IO needs to the new lock level

   This addresses 1) from above


B) Merge BufferDesc.state and the content lock

   This allows to address 2) from above, as we now atomically can check if IO
   was concurrently initiated.

   Obviously BufferDesc.state is not currently wide enough, therefore the
   buffer state has to be updated to a 64bit variable.


C) Allow some modifications of BufferDesc.state while holding spinlock

   Just naively doing the above two things reduces scalability, as the
   likelihood of CAS failures increases, due to increased number of
   modifications of the same atomic variable.

   However, by allowing unpinning while the buffer header spinlock is held,
   scalability considerably improves in my tests.

   Doing so requires changing all uses of LockBufHdr(), but by introducing a
   static inline helper the complexity can largely be encapsulated.



I've prototyped the above. The current state is pretty rough, but before I
spend the non-trivial time to make it into an understandable sequence of
changes, I wanted to get feedback.


Does this plan sound reasonable?


The hardest part about this change is that everything kind of depends on each
other. The changes are large enough that they clearly can't just be committed
at once, but doing them over time risks [temporary] performance regressions.




The order of changes I think makes the most sense is the following:

1) Allow some modifications while holding the buffer header spinlock

2) Reduce buffer pin with just an atomic-sub

   This needs to happen first, otherwise there are performance regressions
   during the later steps.

3) Widen BufferDesc.state to 64 bits

4) Implement buffer locking inside BufferDesc.state

5) Do IO while holding share-exclusive lock and require all buffer
   modifications to at least hold share exclusive lock

6) Wait for AIO when acquiring an exclusive content lock

(some of these will likely have parts of their own, but that's details)


Sane?


DOES ANYBODY HAVE A BETTER NAME THAN SHARE-EXCLUSIVE???!?


Greetings,

Andres Freund

[1] https://www.postgresql.org/message-id/CA%2BhUKGKSBaz78Fw3WTF3Q8ArqKCz1GgsTfRFiDPbu-j9OFz-jw%40mail.gmail.com
[2] https://www.postgresql.org/message-id/stj36ea6yyhoxtqkhpieia2z4krnam7qyetc57rfezgk4zgapf%40gcnactj4z56m
[3] In some cases it causes slowdowns for checkpointer close to 50%!
[4] https://anarazel.de/talks/2024-10-23-pgconf-eu-numa-vs-postgresql/numa-vs-postgresql.pdf - slide 19



Re: Buffer locking is special (hints, checksums, AIO writes)

От
Mihail Nikalayeu
Дата:
Hello, Andres!

Andres Freund <andres@anarazel.de>:
>
>    As part of that introduce a new lock level in-between share and exclusive
>    locks (provisionally called share-exclusive, but I hate it). The new lock
>    level allows other share lockers, but can only be held by one backend.
>
>    This allows to change the rules so that:
>
>    1) Share lockers are not allowed to modify buffers anymore
>    2) Hint bits need the new lock mode (conditionally upgrading the lock in SetHintBits())
>    3) Write IO needs to the new lock level

IIUC, it may be mapped to existing locking system:

BUFFER_LOCK_SHARE          ---->      AccessShareLock
new lock mode                           ---->     ExclusiveLock
BUFFER_LOCK_EXCLUSIVE   ---->     AccessExclusiveLock

So, it all may be named:

BUFFER_LOCK_ACCESS_SHARE
BUFFER_LOCK_EXCLUSIVE
BUFFER_LOCK_ACCESS_EXCLUSIVE

being more consistent with table-level locking system.

Greetings,
Mikhail.



Re: Buffer locking is special (hints, checksums, AIO writes)

От
Robert Haas
Дата:
On Fri, Aug 22, 2025 at 3:45 PM Andres Freund <andres@anarazel.de> wrote:
> My conclusion from the above is that we ought to:
>
> A) Make Buffer Locks something separate from lwlocks
> B) Merge BufferDesc.state and the content lock
> C) Allow some modifications of BufferDesc.state while holding spinlock

+1 to (A) and (B). No particular opinion on (C) but if it works well, great.

> The order of changes I think makes the most sense is the following:
>
> 1) Allow some modifications while holding the buffer header spinlock
> 2) Reduce buffer pin with just an atomic-sub
> 3) Widen BufferDesc.state to 64 bits
> 4) Implement buffer locking inside BufferDesc.state
> 5) Do IO while holding share-exclusive lock and require all buffer
>    modifications to at least hold share exclusive lock
> 6) Wait for AIO when acquiring an exclusive content lock

No strong objections. I certainly like getting to (5) and (6) and I
think those are in the right order. I'm not sure about the rest. I
thought (1) and (2) were the same change after reading your email; and
it surprises me a little bit that (2) is separate from (4). But I'm
sure you have a much better sense of this than I do.

> DOES ANYBODY HAVE A BETTER NAME THAN SHARE-EXCLUSIVE???!?

AFAIK "share exclusive" or "SX" is standard terminology. While I'm not
wholly hostile to the idea of coming up with something else, I don't
think our tendency to invent our own way to do everything is one of
our better tendencies as a project.

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: Buffer locking is special (hints, checksums, AIO writes)

От
Andres Freund
Дата:
Hi,

On 2025-08-26 16:21:36 -0400, Robert Haas wrote:
> On Fri, Aug 22, 2025 at 3:45 PM Andres Freund <andres@anarazel.de> wrote:
> > My conclusion from the above is that we ought to:
> >
> > A) Make Buffer Locks something separate from lwlocks
> > B) Merge BufferDesc.state and the content lock
> > C) Allow some modifications of BufferDesc.state while holding spinlock
> 
> +1 to (A) and (B). No particular opinion on (C) but if it works well, great.

Without it I see performance regressions due to the increased rate of CAS
failures due to having more changes to one atomic variable :/



> > The order of changes I think makes the most sense is the following:
> >
> > 1) Allow some modifications while holding the buffer header spinlock
> > 2) Reduce buffer pin with just an atomic-sub
> > 3) Widen BufferDesc.state to 64 bits
> > 4) Implement buffer locking inside BufferDesc.state
> > 5) Do IO while holding share-exclusive lock and require all buffer
> >    modifications to at least hold share exclusive lock
> > 6) Wait for AIO when acquiring an exclusive content lock
> 
> No strong objections. I certainly like getting to (5) and (6) and I
> think those are in the right order. I'm not sure about the rest.


> I thought (1) and (2) were the same change after reading your email

They are certainly related. I thought it'd make sense to split them as
outlined above, as (1) is relatively verbose on its own, but far more
mechanical.


> and it surprises me a little bit that (2) is separate from (4).

Without doing 2) first, I see performance/scalability regressions doing
(4). Doing (3) without (2) also hurts...



> > DOES ANYBODY HAVE A BETTER NAME THAN SHARE-EXCLUSIVE???!?
> 
> AFAIK "share exclusive" or "SX" is standard terminology. While I'm not
> wholly hostile to the idea of coming up with something else, I don't
> think our tendency to invent our own way to do everything is one of
> our better tendencies as a project.

I guess it bothers me that we'd use share-exclusive to mean the buffer can't
be modified, but for real (vs share, which does allow some modifications). But
it's very well plausible that there's no meaningfully better name, in which
case we certainly shouldn't differ from what's somewhat commonly used.


Greetings,

Andres Freund



Re: Buffer locking is special (hints, checksums, AIO writes)

От
Noah Misch
Дата:
On Fri, Aug 22, 2025 at 03:44:48PM -0400, Andres Freund wrote:
> I'm working on making bufmgr.c ready for AIO writes.

Nice!

> == Problem 2 - AIO writes vs exclusive locks ==
> 
> Separate from the hint bit issue, there is a second issue that I didn't have a
> good answer for: Making acquiring an exclusive lock concurrency safe in the
> presence of asynchronous writes:
> 
> The problem is that while a buffer is being written out, it obviously has to
> be share locked. That's true even with AIO. With AIO the share lock is held
> once the IO is completed. The problem is that if a backend wants to
> exclusively lock a buffer undergoing AIO, it can't just wait for the content
> lock as today, it might have to actually reap the IO completion from the
> operating system. If one just were to wait for the content lock, there's no
> forward progress guarantee.
> 
> The buffer's state "knows" that it's undergoing write IO (BM_VALID and
> BM_IO_IN_PROGRESS are set). To ensure forward progress guarantee, an exclusive
> locker needs to wait for the IO (pgaio_wref_wait(BufferDesc->->io_wref)). The
> problem is that it's surprisingly hard to do so race free:
> 
> If a backend A were to just check if a buffer is undergoing IO before locking
> it, a backend B could start IO on the buffer between A checking for
> BM_IO_IN_PROGRESS and acquiring the content lock.  We obviously can't just
> hold the buffer header spinlock across a blocking lwlock acquisition.
> 
> There potentially are ways to synchronize the buffer state and the content
> lock, but it requires deep integration between bufmgr.c and lwlock.c.

You may have considered and rejected simpler alternatives for (2) before
picking the approach you go on to outline.  Anything interesting?  For
example, I imagine these might work with varying degrees of inefficiency:

- Use LWLockConditionalAcquire() with some nonstandard waiting protocol when
  there's a non-I/O lock conflict.
- Take BM_IO_IN_PROGRESS before exclusive-locking, then release it.

> == Problem 3 - Cacheline contention ==

> c) Read accesses to the BufferDesc cause contention
> 
>    Some code, like nbtree, relies on functions like
>    BufferGetBlockNumber(). Unfortunately that contends with concurrent
>    modifications of the buffer descriptor (like pinning). Potential solutions
>    are to rely less on functions like BufferGetBlockNumber() or to split out
>    the memory for that into a separate (denser?) array.

Agreed.  BufferGetBlockNumber() could even use a new local (non-shmem) data
structure, since the buffer's mapping can't change until we unpin.

> d) Even after addressing all of the above, there's still a lot of contention
> 
>    I think the solution here would be something roughly to fastpath locks. If
>    a buffer is very contended, we can mark it as super-pinned & share locked,
>    avoiding any atomic operation on the buffer descriptor itself. Instead the
>    current lock and pincount would be stored in each backends PGPROC.
>    Obviously evicting or exclusively-locking such a buffer would be a lot more
>    expensive.
> 
>    I've prototyped it and it helps a *lot*.  The reason I mention this here is
>    that this seems impossible to do while using the generic lwlocks for the
>    content lock.

Nice.

On Tue, Aug 26, 2025 at 05:00:13PM -0400, Andres Freund wrote:
> On 2025-08-26 16:21:36 -0400, Robert Haas wrote:
> > On Fri, Aug 22, 2025 at 3:45 PM Andres Freund <andres@anarazel.de> wrote:
> > > The order of changes I think makes the most sense is the following:

No concerns so far.  I won't claim I can picture all the implications and be
sure this is the right thing, but it sounds promising.  I like your principle
of ordering changes to avoid performance regressions.

> > > DOES ANYBODY HAVE A BETTER NAME THAN SHARE-EXCLUSIVE???!?

I would consider {AccessShare, Exclusive, AccessExclusive}.  What the $SUBJECT
proposal calls SHARE-EXCLUSIVE would become Exclusive.  That has the same
conflict matrix as the corresponding heavyweight locks, which seems good.  I
don't love our mode names, particularly ShareRowExclusive being unsharable.
However, learning one special taxonomy is better than learning two.

> > AFAIK "share exclusive" or "SX" is standard terminology.

Can you say more about that?  I looked around at
https://google.com/search?q=share+exclusive+%22sx%22+lock but didn't find
anything well-aligned with the proposal:

https://dev.mysql.com/doc/dev/mysql-server/latest//PAGE_LOCK_ORDER.html looked
most relevant, but it doesn't give the big idea.
https://mysqlonarm.github.io/Understanding-InnoDB-rwlock-stats/ is less
authoritative but does articulate the big idea, as "Shared-Exclusive (SX):
offer write access to the resource with inconsistent read. (relaxed
exclusive)."  That differs from $SUBJECT semantics, in which SHARE-EXCLUSIVE
can't see inconsistent reads.

https://docs.oracle.com/en/database/oracle/oracle-database/19/arpls/DBMS_LOCK.html
has term SX = "sub exclusive".  I gather an SX lock on a table lets one do
SELECT FOR UPDATE on that table (each row is the "sub"component being locked).

https://man.freebsd.org/cgi/man.cgi?query=sx_slock&sektion=9&format=html uses
the term "SX", but it's more like our lwlocks.  One acquires S or X, not
blends of them.



Re: Buffer locking is special (hints, checksums, AIO writes)

От
Robert Haas
Дата:
On Tue, Aug 26, 2025 at 8:14 PM Noah Misch <noah@leadboat.com> wrote:
> > > AFAIK "share exclusive" or "SX" is standard terminology.
>
> Can you say more about that?

Looks like I was misremembering. I was thinking of Gray & Reuter,
Transaction Processing: Concepts and Techniques, 1993. However,
opening it up, I find that his vocabulary is slightly different. He
offers the following six lock modes: IS, IX, S, SIX, Update, X. "I"
means "intent" and acts as a modifier to the letter that follows.
Hence, SIX means "a course-granularity shared lock with intent to set
finer-granularity exclusive locks" (p. 408). His lock manager is
hierarchical, so taking a SIX lock on a table means that you are
allowed to read all the rows in the table and you are allowed to
exclusive-lock individual rows as desired and nobody is allowed to
exclusive-lock any rows in the table. It is compatible only with IS;
that is, it does not preclude other people from share-locking
individual rows (which might delay your exclusive locks on those
rows). Since we don't have intent-locking in PostgreSQL, I think my
brain mentally flattened this hierarchy down to S, X, SX, but that's
not what he actually wrote.

His "Update" locks are also somewhat interesting: an update lock is
exactly like an exclusive lock except that it permits PAST
share-locks. You take an update lock when you currently need a
share-lock but anticipate the possibility of needing an
exclusive-lock. This is a deadlock avoidance strategy: updaters will
take turns, and some of them will ultimately want exclusive locks and
others won't, but they can't deadlock against each other as long as
they all take "Update" locks initially and don't try to upgrade to
that level later. An updater's attempt to upgrade to an exclusive lock
can still be delayed by, or deadlock against, share lockers, but those
typically won't try to higher lock levels later.

If we were to use the existing PostgreSQL naming convention, I think
I'd probably argue that the nearest parallel to this level is
ShareUpdateExclusive: a self-exclusive lock level that permits
ordinary table access to continue while blocking exclusive locks, used
for an in-flight maintenance operation. But that's arguable, of
course.

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: Buffer locking is special (hints, checksums, AIO writes)

От
Andres Freund
Дата:
Hi,

On 2025-08-26 17:14:49 -0700, Noah Misch wrote:
> On Fri, Aug 22, 2025 at 03:44:48PM -0400, Andres Freund wrote:
> > == Problem 2 - AIO writes vs exclusive locks ==
> >
> > Separate from the hint bit issue, there is a second issue that I didn't have a
> > good answer for: Making acquiring an exclusive lock concurrency safe in the
> > presence of asynchronous writes:
> >
> > The problem is that while a buffer is being written out, it obviously has to
> > be share locked. That's true even with AIO. With AIO the share lock is held
> > once the IO is completed. The problem is that if a backend wants to
> > exclusively lock a buffer undergoing AIO, it can't just wait for the content
> > lock as today, it might have to actually reap the IO completion from the
> > operating system. If one just were to wait for the content lock, there's no
> > forward progress guarantee.
> >
> > The buffer's state "knows" that it's undergoing write IO (BM_VALID and
> > BM_IO_IN_PROGRESS are set). To ensure forward progress guarantee, an exclusive
> > locker needs to wait for the IO (pgaio_wref_wait(BufferDesc->->io_wref)). The
> > problem is that it's surprisingly hard to do so race free:
> >
> > If a backend A were to just check if a buffer is undergoing IO before locking
> > it, a backend B could start IO on the buffer between A checking for
> > BM_IO_IN_PROGRESS and acquiring the content lock.  We obviously can't just
> > hold the buffer header spinlock across a blocking lwlock acquisition.
> >
> > There potentially are ways to synchronize the buffer state and the content
> > lock, but it requires deep integration between bufmgr.c and lwlock.c.
>
> You may have considered and rejected simpler alternatives for (2) before
> picking the approach you go on to outline.

I tried a few things...


> Anything interesting?

Not really.

The first one you propose is what I looked at for a while:

> For example, I imagine these might work with varying degrees of
> inefficiency:
>
> - Use LWLockConditionalAcquire() with some nonstandard waiting protocol when
>   there's a non-I/O lock conflict.

It's nontrivial to make this race free - the problem is the case where we *do*
have to wait for an exclusive content lock. It's possible for the lwlock to be
released by the owning backend and for IO to be started, after checking
whether IO is in progress (after LWLockConditionalAcquire() had failed).

I came up with a complicated scheme, where setting IO in progress would
afterwards wake up all lwlock waiters and all exclusive content lock waits
were done with LWLockAcquireOrWait().  I think that was working - but it's
also a slower and seems really fragile and ugly.


> - Take BM_IO_IN_PROGRESS before exclusive-locking, then release it.

That just seems expensive. We could make it cheaper by doing it only if a
LWLockConditionalAcquire() doesn't succeed. But it still seems not great.  And
it doesn't really help with addressing the 'setting hint bits while IO is in
progress" part...


> > == Problem 3 - Cacheline contention ==
>
> > c) Read accesses to the BufferDesc cause contention
> >
> >    Some code, like nbtree, relies on functions like
> >    BufferGetBlockNumber(). Unfortunately that contends with concurrent
> >    modifications of the buffer descriptor (like pinning). Potential solutions
> >    are to rely less on functions like BufferGetBlockNumber() or to split out
> >    the memory for that into a separate (denser?) array.
>
> Agreed.  BufferGetBlockNumber() could even use a new local (non-shmem) data
> structure, since the buffer's mapping can't change until we unpin.

Hm. I didn't think about a backend local datastructure for that, perhaps
because it seems not cheap to maintain (both from a runtime and a space
perspective).

If we store the read-only data for buffers separately from the read-write
data, we could access that from backends without a lock, since it can't change
with the buffer pinned.

One way to do that would be to maintain a back-pointer from the BufferDesc to
the BufferLookupEnt, since the latter *already* contains the BufferTag. We
probably don't want to add another indirection to the buffer mapping hash
table, otherwise we could deduplicate the other way round and just put padding
between the modified and read-only part of a buffer desc.



> On Tue, Aug 26, 2025 at 05:00:13PM -0400, Andres Freund wrote:
> > On 2025-08-26 16:21:36 -0400, Robert Haas wrote:
> > > On Fri, Aug 22, 2025 at 3:45 PM Andres Freund <andres@anarazel.de> wrote:
> > > > The order of changes I think makes the most sense is the following:
>
> No concerns so far.  I won't claim I can picture all the implications and be
> sure this is the right thing, but it sounds promising.  I like your principle
> of ordering changes to avoid performance regressions.

I suspect we'll have to merge this incrementally to stay sane, I don't want to
end up with a period of worse performance due to this, that could make it
harder to evaluate other work.


> > > > DOES ANYBODY HAVE A BETTER NAME THAN SHARE-EXCLUSIVE???!?

> I would consider {AccessShare, Exclusive, AccessExclusive}.

One thing I forgot to mention is that with the proposed re-architecture in
place, we could subsequently go further and make pinning just be a very
lightweight lock level, instead of that being a separate dedicated
infrstructure.  One nice outgrowth of that would be that that acquiring a
cleanup lock would just be a real lock acquisition, instead of the dedicated
limited machinery we have right now.

Which would leave us with:
- reference (pins today)
- share
- share-exclusive
- exclusive
- cleanup

This doesn't quite seem to map onto the heavyweight lock levels in a sensible
way...



> What the $SUBJECT proposal calls SHARE-EXCLUSIVE would become Exclusive.

There are a few hundred references to the lock levels though, seems painful to
rename them :(


> That has the same conflict matrix as the corresponding heavyweight locks,
> which seems good.

> I don't love our mode names, particularly ShareRowExclusive being
> unsharable.

I hate them with a passion :). Except for the most basic ones they just don't
stay in my head for more than a few hours.


> However, learning one special taxonomy is better than learning two.

But that's fair.


Greetings,

Andres Freund



Re: Buffer locking is special (hints, checksums, AIO writes)

От
Noah Misch
Дата:
On Wed, Aug 27, 2025 at 12:18:27PM -0400, Andres Freund wrote:
> On 2025-08-26 17:14:49 -0700, Noah Misch wrote:
> > On Fri, Aug 22, 2025 at 03:44:48PM -0400, Andres Freund wrote:
> > > == Problem 3 - Cacheline contention ==
> >
> > > c) Read accesses to the BufferDesc cause contention
> > >
> > >    Some code, like nbtree, relies on functions like
> > >    BufferGetBlockNumber(). Unfortunately that contends with concurrent
> > >    modifications of the buffer descriptor (like pinning). Potential solutions
> > >    are to rely less on functions like BufferGetBlockNumber() or to split out
> > >    the memory for that into a separate (denser?) array.
> >
> > Agreed.  BufferGetBlockNumber() could even use a new local (non-shmem) data
> > structure, since the buffer's mapping can't change until we unpin.
> 
> Hm. I didn't think about a backend local datastructure for that, perhaps
> because it seems not cheap to maintain (both from a runtime and a space
> perspective).

Yes, paying off the cost of maintaining it could be tricky.  It could be the
kind of thing where the overhead loses at 10 cores and wins at 40 cores.  It
could also depend heavily on the workload's concurrent pins per backend.

> If we store the read-only data for buffers separately from the read-write
> data, we could access that from backends without a lock, since it can't change
> with the buffer pinned.

Good point.  That alone may be enough of a win.

> One way to do that would be to maintain a back-pointer from the BufferDesc to
> the BufferLookupEnt, since the latter *already* contains the BufferTag. We
> probably don't want to add another indirection to the buffer mapping hash
> table, otherwise we could deduplicate the other way round and just put padding
> between the modified and read-only part of a buffer desc.

I think you're saying clients would save the back-pointer once and dereference
it many times, with each dereference of a saved back-pointer avoiding a shmem
read of BufferDesc.tag.  Is that right?

> > On Tue, Aug 26, 2025 at 05:00:13PM -0400, Andres Freund wrote:
> > > On 2025-08-26 16:21:36 -0400, Robert Haas wrote:
> > > > On Fri, Aug 22, 2025 at 3:45 PM Andres Freund <andres@anarazel.de> wrote:
> > > > > DOES ANYBODY HAVE A BETTER NAME THAN SHARE-EXCLUSIVE???!?
> 
> > I would consider {AccessShare, Exclusive, AccessExclusive}.
> 
> One thing I forgot to mention is that with the proposed re-architecture in
> place, we could subsequently go further and make pinning just be a very
> lightweight lock level, instead of that being a separate dedicated
> infrstructure.  One nice outgrowth of that would be that that acquiring a
> cleanup lock would just be a real lock acquisition, instead of the dedicated
> limited machinery we have right now.
> 
> Which would leave us with:
> - reference (pins today)
> - share
> - share-exclusive
> - exclusive
> - cleanup
> 
> This doesn't quite seem to map onto the heavyweight lock levels in a sensible
> way...

Could map it like this:

AccessShare - pins today
RowShare - check tuple visibility (BUFFER_LOCK_SHARE today)
Share - set hint bits
ShareUpdateExclusive - clean/write out (borrowing Robert's idea)
Exclusive - add tuples, change xmax, etc. (BUFFER_LOCK_EXCLUSIVE today)
AccessExclusive - cleanup lock or evict the buffer

That has a separate level for hint bits vs. I/O, so multiple backends could
set hint bits.  I don't know whether the benchmarks would favor maintaining
that distinction.

> > What the $SUBJECT proposal calls SHARE-EXCLUSIVE would become Exclusive.
> 
> There are a few hundred references to the lock levels though, seems painful to
> rename them :(

Yes, especially in comments and extensions.  Likely more important than that
for the long-term, your latest proposal has the advantage of keeping short
names for the most-commonly-referenced lock types.  (We could keep
BUFFER_LOCK_SHARE with the lower layers translating that into RowShare, but
that weakens or eliminates the benefit of reducing what readers need to
learn.)  For what it's worth, 6 PGXN modules reference BUFFER_LOCK_SHARE
and/or BUFFER_LOCK_EXCLUSIVE.

Compared to share-exclusive, I think I'd prefer a name that describes the use
cases, "set-hints-or-write" (or separate "write" and "set-hints" levels).
What do you think of that?  I don't know whether that should win vs. names
like ShareUpdateExclusive, though.



Re: Buffer locking is special (hints, checksums, AIO writes)

От
Robert Haas
Дата:
On Wed, Aug 27, 2025 at 12:18 PM Andres Freund <andres@anarazel.de> wrote:
> Which would leave us with:
> - reference (pins today)
> - share
> - share-exclusive
> - exclusive
> - cleanup
>
> This doesn't quite seem to map onto the heavyweight lock levels in a sensible
> way...

Could do: ACCESS SHARE, SHARE, SHARE UPDATE EXCLUSIVE, EXCLUSIVE,
ACCESS EXCLUSIVE.

I've always thought that a pin was a lot like an access share lock and
a cleanup lock was a lot like an access exclusive lock.

But then again, using the same terminology for two different things
might be confusing.

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: Buffer locking is special (hints, checksums, AIO writes)

От
Andres Freund
Дата:
Hi,

On 2025-08-27 12:14:41 -0700, Noah Misch wrote:
> On Wed, Aug 27, 2025 at 12:18:27PM -0400, Andres Freund wrote:
> > One way to do that would be to maintain a back-pointer from the BufferDesc to
> > the BufferLookupEnt, since the latter *already* contains the BufferTag. We
> > probably don't want to add another indirection to the buffer mapping hash
> > table, otherwise we could deduplicate the other way round and just put padding
> > between the modified and read-only part of a buffer desc.
> 
> I think you're saying clients would save the back-pointer once and dereference
> it many times, with each dereference of a saved back-pointer avoiding a shmem
> read of BufferDesc.tag.  Is that right?

I was thinking that we'd not have BufferDesc.tag, instead just storing it
solely in BufferLookupEnt. To get the tag of a BufferDesc, you'd every time
have to follow the back-reference.   But that's actually why it doesn't work -
reading the back-reference pointer would have the same issue as just reading
BufferDesc.tag...


> > > On Tue, Aug 26, 2025 at 05:00:13PM -0400, Andres Freund wrote:
> > > > On 2025-08-26 16:21:36 -0400, Robert Haas wrote:
> > > > > On Fri, Aug 22, 2025 at 3:45 PM Andres Freund <andres@anarazel.de> wrote:
> > > > > > DOES ANYBODY HAVE A BETTER NAME THAN SHARE-EXCLUSIVE???!?
> > 
> > > I would consider {AccessShare, Exclusive, AccessExclusive}.
> > 
> > One thing I forgot to mention is that with the proposed re-architecture in
> > place, we could subsequently go further and make pinning just be a very
> > lightweight lock level, instead of that being a separate dedicated
> > infrstructure.  One nice outgrowth of that would be that that acquiring a
> > cleanup lock would just be a real lock acquisition, instead of the dedicated
> > limited machinery we have right now.
> > 
> > Which would leave us with:
> > - reference (pins today)
> > - share
> > - share-exclusive
> > - exclusive
> > - cleanup
> > 
> > This doesn't quite seem to map onto the heavyweight lock levels in a sensible
> > way...
> 
> Could map it like this:
> 
> AccessShare - pins today
> RowShare - check tuple visibility (BUFFER_LOCK_SHARE today)
> Share - set hint bits
> ShareUpdateExclusive - clean/write out (borrowing Robert's idea)
> Exclusive - add tuples, change xmax, etc. (BUFFER_LOCK_EXCLUSIVE today)
> AccessExclusive - cleanup lock or evict the buffer

I tend think having things like RowShare for buffer locking is confusing
enough to actually make the similarity to the heavyweight locks to not be a
win...


> That has a separate level for hint bits vs. I/O, so multiple backends could
> set hint bits.  I don't know whether the benchmarks would favor maintaining
> that distinction.

I don't think it would - I actually found multiple backends setting the same
hint bits to *hurt* performance a bit.  But what's more important, we don't
have the space for it, I think.  Every lock that can be acquired multiple
times needs a lock count of 18 bits. And we need to store the buffer state
flags (10 bits). There's just not enough space in 64bit to have three 18bit
counters as well as flag bits etc.


> Compared to share-exclusive, I think I'd prefer a name that describes the use
> cases, "set-hints-or-write" (or separate "write" and "set-hints" levels).

I would too, I just couldn't come up with something that conveys the meanings
in a sufficiently concise way :)


> What do you think of that?  I don't know whether that should win vs. names
> like ShareUpdateExclusive, though.

I think it'd be a win compared to the heavyweight lock names...

Greetings,

Andres Freund



Re: Buffer locking is special (hints, checksums, AIO writes)

От
Noah Misch
Дата:
On Wed, Aug 27, 2025 at 03:29:02PM -0400, Andres Freund wrote:
> On 2025-08-27 12:14:41 -0700, Noah Misch wrote:
> > On Wed, Aug 27, 2025 at 12:18:27PM -0400, Andres Freund wrote:
> > > > On Tue, Aug 26, 2025 at 05:00:13PM -0400, Andres Freund wrote:
> > > > > On 2025-08-26 16:21:36 -0400, Robert Haas wrote:
> > > > > > On Fri, Aug 22, 2025 at 3:45 PM Andres Freund <andres@anarazel.de> wrote:
> > > > > > > DOES ANYBODY HAVE A BETTER NAME THAN SHARE-EXCLUSIVE???!?

> > > Which would leave us with:
> > > - reference (pins today)
> > > - share
> > > - share-exclusive
> > > - exclusive
> > > - cleanup

> > Compared to share-exclusive, I think I'd prefer a name that describes the use
> > cases, "set-hints-or-write" (or separate "write" and "set-hints" levels).

Another name idea is "self-exclusive", to contrast with "exclusive" excluding
all of (exclusive, self-exclusive, share).

Fortunately, not much code will acquire this lock type.  Hence, there's
relatively little damage if the name is less obvious than older lock types or
if the name changes later.



Re: Buffer locking is special (hints, checksums, AIO writes)

От
Michael Paquier
Дата:
On Wed, Aug 27, 2025 at 10:03:08AM -0400, Robert Haas wrote:
> If we were to use the existing PostgreSQL naming convention, I think
> I'd probably argue that the nearest parallel to this level is
> ShareUpdateExclusive: a self-exclusive lock level that permits
> ordinary table access to continue while blocking exclusive locks, used
> for an in-flight maintenance operation. But that's arguable, of
> course.

ShareUpdateExclusive is a term that's been used for some time now and
relates to knowledge that's quite spread in the tree, so it feels like
a natural fit for the use-case described on this thread as we'd want a
self-conflicting lock.  share-exclusive did not sound that bad to me,
TBH, quite the contrary, when applied to buffer locking for aio.

"intent" is also a word I've bumped quite a lot into while looking at
some naming convention, but this is more related to the fact that a
lock is going to be taken, which we don't really have.  So that feels
off.
--
Michael

Вложения

Re: Buffer locking is special (hints, checksums, AIO writes)

От
Andres Freund
Дата:
Hi,

On 2025-08-22 15:44:48 -0400, Andres Freund wrote:
> The hardest part about this change is that everything kind of depends on each
> other. The changes are large enough that they clearly can't just be committed
> at once, but doing them over time risks [temporary] performance regressions.
> 
> 
> 
> 
> The order of changes I think makes the most sense is the following:
> 
> 1) Allow some modifications while holding the buffer header spinlock
> 
> 2) Reduce buffer pin with just an atomic-sub
> 
>    This needs to happen first, otherwise there are performance regressions
>    during the later steps.

Here are the first few cleaned up patches implementing the above steps, as
well as some cleanups.  I included a commit from another thread, as it
conflicts with these changes, and we really should apply it - and it's
arguably required to make the changes viable, as it removes one more use of
PinBuffer_Locked().

Another change included is to not return the buffer with the spinlock held
from StrategyGetBuffer(), and instead pin the buffer in freelist.c. The reason
for that is to reduce the most common PinBuffer_locked() call. By definition
PinBuffer_locked() will become a bit slower due to 0003. But even without 0003
it 0002 is faster than master. And the previous approach also just seems
pretty unclean.   I don't love that it requires the new TrackNewBufferPin(),
but I don't really have a better idea.

I invite particular attention to the commit message for 0003 as well as the
comment changes in buf_internals.h within.

I'm not sure I like the TODOs added in 0003 and removed in 0004, but squashing
the changes doesn't really seem better either.

Greetings,

Andres Freund

Вложения

Re: Buffer locking is special (hints, checksums, AIO writes)

От
Andres Freund
Дата:
Hi,

On 2025-09-15 19:05:37 -0400, Andres Freund wrote:
> On 2025-08-22 15:44:48 -0400, Andres Freund wrote:
> > The hardest part about this change is that everything kind of depends on each
> > other. The changes are large enough that they clearly can't just be committed
> > at once, but doing them over time risks [temporary] performance regressions.
> >
> >
> >
> >
> > The order of changes I think makes the most sense is the following:
> >
> > 1) Allow some modifications while holding the buffer header spinlock
> >
> > 2) Reduce buffer pin with just an atomic-sub
> >
> >    This needs to happen first, otherwise there are performance regressions
> >    during the later steps.
>
> Here are the first few cleaned up patches implementing the above steps, as
> well as some cleanups.  I included a commit from another thread, as it
> conflicts with these changes, and we really should apply it - and it's
> arguably required to make the changes viable, as it removes one more use of
> PinBuffer_Locked().
>
> Another change included is to not return the buffer with the spinlock held
> from StrategyGetBuffer(), and instead pin the buffer in freelist.c. The reason
> for that is to reduce the most common PinBuffer_locked() call. By definition
> PinBuffer_locked() will become a bit slower due to 0003. But even without 0003
> it 0002 is faster than master. And the previous approach also just seems
> pretty unclean.   I don't love that it requires the new TrackNewBufferPin(),
> but I don't really have a better idea.
>
> I invite particular attention to the commit message for 0003 as well as the
> comment changes in buf_internals.h within.

Robert looked at the patches while we were chatting, and I addressed his
feedback in this new version.

Changes:

- Updated patch description for 0002, giving a lot more background

- Improved BufferDesc comments a fair bit more in 0003

- Reduced the size of 0003 a bit, by using UnlockBufHdrExt() instead of a CAS
  loop in buffer_stage_common() and reordering some things in
  TerminateBufferIO()

- Made 0004 only use the non-looping atomic op in UnpinBufferNoOwner(), not
  MarkBufferDirty(). I realized the latter would take additional complexity to
  make safe (a CAS loop in TerminateBufferIO()).  I am somewhat doubtful that
  there are workloads where it matters...


Greetings,

Andres Freund

Вложения

Re: Buffer locking is special (hints, checksums, AIO writes)

От
Matthias van de Meent
Дата:
On Tue, 23 Sept 2025 at 00:14, Andres Freund <andres@anarazel.de> wrote:
> On 2025-09-15 19:05:37 -0400, Andres Freund wrote:
> > Here are the first few cleaned up patches implementing the above steps, as
> > well as some cleanups.  I included a commit from another thread, as it
> > conflicts with these changes, and we really should apply it - and it's
> > arguably required to make the changes viable, as it removes one more use of
> > PinBuffer_Locked().
> >
> > Another change included is to not return the buffer with the spinlock held
> > from StrategyGetBuffer(), and instead pin the buffer in freelist.c. The reason
> > for that is to reduce the most common PinBuffer_locked() call. By definition
> > PinBuffer_locked() will become a bit slower due to 0003. But even without 0003
> > it 0002 is faster than master. And the previous approach also just seems
> > pretty unclean.   I don't love that it requires the new TrackNewBufferPin(),
> > but I don't really have a better idea.
> >
> > I invite particular attention to the commit message for 0003 as well as the
> > comment changes in buf_internals.h within.
>
> Robert looked at the patches while we were chatting, and I addressed his
> feedback in this new version.

I like these changes, and have some minor comments:

0001 ensures that ReadRecentBuffer increments the usage counter, which
someone who uses an access strategy may want to prevent. I know this
isn't exactly new behaviour, but something I noticed anyway. Apart
from that observation, LGTM

0002 has a FIXME in a comment in GetVictimBuffer. Assuming it's about
the comment itself needing updates, how about:

+     * Ensure, before we pin a victim buffer, that there's a free refcount
+     * entry, and a resource owner slot for the pin.

Again, LGTM.

0003's UnlockBufHdrExt:
This is implemented with CAS, even when we only want to change bits we
know the state of (or could know, if we spent the effort).
Given its inline nature, wouldn't it be better to use atomic_sub
instructions? Or is this to handle cases where the bits we want to
(un)set might be (un)set by a concurrent process?
If the latter, could we specialize this to do a single atomic_sub
whenever we want to change state bits that we know can be only changed
whilst holding the spinlock?

0004: LGTM

0005: LGTM

0006: LGTM

Kind regards,

Matthias van de Meent



Re: Buffer locking is special (hints, checksums, AIO writes)

От
Andres Freund
Дата:
Hi,

On 2025-10-04 09:05:45 +0200, Matthias van de Meent wrote:
> On Tue, 23 Sept 2025 at 00:14, Andres Freund <andres@anarazel.de> wrote:
> > On 2025-09-15 19:05:37 -0400, Andres Freund wrote:
> > > Here are the first few cleaned up patches implementing the above steps, as
> > > well as some cleanups.  I included a commit from another thread, as it
> > > conflicts with these changes, and we really should apply it - and it's
> > > arguably required to make the changes viable, as it removes one more use of
> > > PinBuffer_Locked().
> > >
> > > Another change included is to not return the buffer with the spinlock held
> > > from StrategyGetBuffer(), and instead pin the buffer in freelist.c. The reason
> > > for that is to reduce the most common PinBuffer_locked() call. By definition
> > > PinBuffer_locked() will become a bit slower due to 0003. But even without 0003
> > > it 0002 is faster than master. And the previous approach also just seems
> > > pretty unclean.   I don't love that it requires the new TrackNewBufferPin(),
> > > but I don't really have a better idea.
> > >
> > > I invite particular attention to the commit message for 0003 as well as the
> > > comment changes in buf_internals.h within.
> >
> > Robert looked at the patches while we were chatting, and I addressed his
> > feedback in this new version.
> 
> I like these changes, and have some minor comments:

Thank for reviewing!


> 0001 ensures that ReadRecentBuffer increments the usage counter, which
> someone who uses an access strategy may want to prevent. I know this
> isn't exactly new behaviour, but something I noticed anyway. Apart
> from that observation, LGTM

Are you proposing to change behaviour? Right now ReadRecentBuffer doesn't even
accept a strategy, so I don't really see this as something that needs to be
tackled at this point.

I'm not sure I see any real use cases for ReadRecentBuffer() that would
benefit from a strategy, but I very well might just not be thinking wide
enough.


> 0002 has a FIXME in a comment in GetVictimBuffer. Assuming it's about
> the comment itself needing updates

Indeed.


> , how about:
> 
> +     * Ensure, before we pin a victim buffer, that there's a free refcount
> +     * entry, and a resource owner slot for the pin.
> 
> Again, LGTM.

WFM.


> 0003's UnlockBufHdrExt:
> This is implemented with CAS, even when we only want to change bits we
> know the state of (or could know, if we spent the effort).
> Given its inline nature, wouldn't it be better to use atomic_sub
> instructions? Or is this to handle cases where the bits we want to
> (un)set might be (un)set by a concurrent process?

Yes, it's to handle concurrent changes to the buffer state.

> If the latter, could we specialize this to do a single atomic_sub
> whenever we want to change state bits that we know can be only changed
> whilst holding the spinlock?

We probably could optimize some cases as an atomic-sub, some others as an
atomic-and and others again as an atomic-or. The latter to however are
implemented as a CAS on x86 anyway...

After 0004 I don't think any of the paths using this are actually particularly
hot, so I'm somewhat doubtful it's worth to try to optimize this too much. If
there are hot paths, we really should try to work towards not even needing the
buffer header spinlock, that has a bigger impact that improving the code for
unlocking the buffer header...

Greetings,

Andres Freund



Re: Buffer locking is special (hints, checksums, AIO writes)

От
Matthias van de Meent
Дата:
Hi,

On Tue, 7 Oct 2025 at 00:55, Andres Freund <andres@anarazel.de> wrote:
> On 2025-10-04 09:05:45 +0200, Matthias van de Meent wrote:
> > 0001 ensures that ReadRecentBuffer increments the usage counter, which
> > someone who uses an access strategy may want to prevent. I know this
> > isn't exactly new behaviour, but something I noticed anyway. Apart
> > from that observation, LGTM
>
> Are you proposing to change behaviour?

Eventually, yes, but not necessarily now or with this patchset.

> Right now ReadRecentBuffer doesn't even
> accept a strategy, so I don't really see this as something that needs to be
> tackled at this point.
>
> I'm not sure I see any real use cases for ReadRecentBuffer() that would
> benefit from a strategy, but I very well might just not be thinking wide
> enough.

I think it's rather strange that there is no Extended variant of
ReadRecentBuffer, like how there is a ReadBufferExtended for
ReadBuffer. Yes, ReadRecentBuffer has more arguments to fill and so
has a smaller difference versus ReadBufferExtended, but it's no
complete replacement for ReadBuffer[Ext] when you're aware of a recent
buffer of the page.

I'm not saying it will definitely happen, but I could see that e.g.
amcheck might want to keep track of buffer IDs of recent heap pages it
accessed to verify index's results, without holding a pin on all the
pages; and instead using ReadRecentBuffer[Extended] with a
BufferAccessStrategy to allow re-acquiring the buffer pin without
blowing out shared buffers or making parts of the pool take forever to
evict again.

> > 0003's UnlockBufHdrExt:
> > This is implemented with CAS, even when we only want to change bits we
> > know the state of (or could know, if we spent the effort).
> > Given its inline nature, wouldn't it be better to use atomic_sub
> > instructions? Or is this to handle cases where the bits we want to
> > (un)set might be (un)set by a concurrent process?
>
> Yes, it's to handle concurrent changes to the buffer state.
>
> > If the latter, could we specialize this to do a single atomic_sub
> > whenever we want to change state bits that we know can be only changed
> > whilst holding the spinlock?
>
> We probably could optimize some cases as an atomic-sub, some others as an
> atomic-and and others again as an atomic-or. The latter to however are
> implemented as a CAS on x86 anyway...
>
> After 0004 I don't think any of the paths using this are actually particularly
> hot, so I'm somewhat doubtful it's worth to try to optimize this too much. If
> there are hot paths, we really should try to work towards not even needing the
> buffer header spinlock, that has a bigger impact that improving the code for
> unlocking the buffer header...

Fair enough; I guess we'll see if further optimization would have much
impact once this all has been committed.

Kind regards,

Matthias van de Meent
Databricks



Re: Buffer locking is special (hints, checksums, AIO writes)

От
Andres Freund
Дата:
Hi,

I pushed a few commits from this patchset after Matthias' review
(thanks!). Unfortunately in 5e899859287 I missed that the valgrind annotations
would not be done anymore for the buffers returned by
StrategyGetBuffer(). Which turned skink red.

The attached 0001 patch centralizes the valgrind initialization in
TrackNewBufferPin(), which 5e899859287 had added. The nice side effect of that
is that there are fewer VALGRIND_MAKE_MEM_DEFINED() calls than before. The
naming isn't the perfect match, but it seems fine to me.


0002 is a new version of "Allow some buffer state modifications while holding
header lock", mainly with a fair bit of comment polishing around BufferDesc
and one small oversight fixed (didn't update a buffer_state variable in one
place).

Greetings,

Andres Freund

Вложения

Re: Buffer locking is special (hints, checksums, AIO writes)

От
Andres Freund
Дата:
On 2025-10-09 16:35:44 -0400, Andres Freund wrote:
> I pushed a few commits from this patchset after Matthias' review
> (thanks!). Unfortunately in 5e899859287 I missed that the valgrind annotations
> would not be done anymore for the buffers returned by
> StrategyGetBuffer(). Which turned skink red.
> 
> The attached 0001 patch centralizes the valgrind initialization in
> TrackNewBufferPin(), which 5e899859287 had added. The nice side effect of that
> is that there are fewer VALGRIND_MAKE_MEM_DEFINED() calls than before. The
> naming isn't the perfect match, but it seems fine to me.

Forgot to say: I'll push this patch soon, to get skink back to green. Unless
somebody says something.  We can adjust this later, if the comment and/or
placement of VALGRIND_MAKE_MEM_DEFINED() isn't to everyones liking.



Re: Buffer locking is special (hints, checksums, AIO writes)

От
Andres Freund
Дата:
Hi,

On 2025-10-09 17:16:49 -0400, Andres Freund wrote:
> On 2025-10-09 16:35:44 -0400, Andres Freund wrote:
> > I pushed a few commits from this patchset after Matthias' review
> > (thanks!). Unfortunately in 5e899859287 I missed that the valgrind annotations
> > would not be done anymore for the buffers returned by
> > StrategyGetBuffer(). Which turned skink red.
> > 
> > The attached 0001 patch centralizes the valgrind initialization in
> > TrackNewBufferPin(), which 5e899859287 had added. The nice side effect of that
> > is that there are fewer VALGRIND_MAKE_MEM_DEFINED() calls than before. The
> > naming isn't the perfect match, but it seems fine to me.
> 
> Forgot to say: I'll push this patch soon, to get skink back to green. Unless
> somebody says something.  We can adjust this later, if the comment and/or
> placement of VALGRIND_MAKE_MEM_DEFINED() isn't to everyones liking.

I have pushed that fix as well as the subsequent buffer header locking changes
a while ago.


Attached is a patchset that actually implements the buffer content locks in
bufmgr.c. This isn't that close to a committable shape yet, but it seemed
useful to get it out there.  The first few patches seem closer, so it'll also
be useful to narrow this down.

0001: A straight-up bugfix in lwlock.c - albeit for a bug that seems currently
    effectively harmless.

0002: Not really required, but seems like an improvement to me

0003: A prerequisite to 0004, pretty boring itself

0004: Use 64bit atomics for BufferDesc.state - at this point nothing uses the
additional bits yet, though.  Some annoying reformatting required to avoid
long lines.

0005: There already was a wait event class for BUFFERPIN. It seems better to
make that more general than to implement them separately.

0006+0007: This is preparatory work for 0008, but also worthwhile on its
own. The private refcount stuff does show up in profiles. The reason it's
related is that without these changes the added information in 0008 makes that
worse.

0008: The main change. Implements buffer content locking independently from
lwlock.c. There's obviously a lot of similarity between lwlock.c code and
this, but I've not found a good way to reduce the duplication without giving
up too much.  This patch does immediately introduce share-exclusive as a new
lock level, mostly because it was too painful to do separately.

0009+0010+0011: Preparatory work for 0012.

0012: One of the main goals of this patchset - use the new share-exclusive
lock level to only allow hint bits to be set while no IO is going on.

0013: Prototype of making UnlockReleaseBuffer() faster and of using it more
widely in nbtree.c

0014: Now that hint bits can't be done while IO is going on, we don't need to
copy pages anymore.  This needs a fair bit more work, as denoted by the FIXMEs
in the code.

I've tried to add detail to the more important commit messages, at least until
0012.


I want to again emphasize that the important commits (i.e. 0008, 0012, 0014)
aren't close to being mergeable. But I think they're in a stage that they
could benefit from "lenient" high-level review.

Greetings,

Andres Freund

Вложения

Re: Buffer locking is special (hints, checksums, AIO writes)

От
Greg Burd
Дата:
On Nov 19 2025, at 9:47 pm, Andres Freund <andres@anarazel.de> wrote:

> Hi,
> 
> On 2025-10-09 17:16:49 -0400, Andres Freund wrote:
>> On 2025-10-09 16:35:44 -0400, Andres Freund wrote:
>> > I pushed a few commits from this patchset after Matthias' review
>> > (thanks!). Unfortunately in 5e899859287 I missed that the valgrind annotations
>> > would not be done anymore for the buffers returned by
>> > StrategyGetBuffer(). Which turned skink red.
>> > 
>> > The attached 0001 patch centralizes the valgrind initialization in
>> > TrackNewBufferPin(), which 5e899859287 had added. The nice side
>> effect of that
>> > is that there are fewer VALGRIND_MAKE_MEM_DEFINED() calls than
>> before. The
>> > naming isn't the perfect match, but it seems fine to me.
>> 
>> Forgot to say: I'll push this patch soon, to get skink back to green. Unless
>> somebody says something.  We can adjust this later, if the comment and/or
>> placement of VALGRIND_MAKE_MEM_DEFINED() isn't to everyones liking.
> 
> I have pushed that fix as well as the subsequent buffer header locking changes
> a while ago.

Hello Andres,

After talking to you about these ideas at PGConf in NYC I've been
anxiously awaiting this patch set. Thanks for dedicating the energy and
time to get it to this stage.

High level feedback after reading the patches/email/commit messages is
that it looks to get you to where you wanted to be, unblocking AIO
writes. I think they'll end up being faster than what's in place now,
even before you get to the AIO piece.  Certainly removing the copy of
each page to do a checksum will help.  Opening the door to a future
where we can have super-pinned/locked pages is also a net win.

Everything before/after 0008 was rather easy to digest and understand
and I found nothing really to call out at this stage

0008 is understandable too, it's just sizable. While it is large, I find
it well laid out and more readable than before.  I gave the locking code
a good look, it seems correct AFAICT.

Keep going, I'll be happy to dedicate time to testing and digging into
the commits as you get this into a final state.  I look forward to
extending/enhancing this code once integrated.

> I want to again emphasize that the important commits (i.e. 0008, 0012, 0014)
> aren't close to being mergeable. But I think they're in a stage that they
> could benefit from "lenient" high-level review.
> 
> Greetings,
> 
> Andres Freund

cheers.

-greg



Re: Buffer locking is special (hints, checksums, AIO writes)

От
Andres Freund
Дата:
Hi,

On 2025-11-20 14:08:57 -0500, Greg Burd wrote:
> After talking to you about these ideas at PGConf in NYC I've been
> anxiously awaiting this patch set. Thanks for dedicating the energy and
> time to get it to this stage.

Thanks!


> High level feedback after reading the patches/email/commit messages is
> that it looks to get you to where you wanted to be, unblocking AIO
> writes. I think they'll end up being faster than what's in place now,
> even before you get to the AIO piece.  Certainly removing the copy of
> each page to do a checksum will help.  Opening the door to a future
> where we can have super-pinned/locked pages is also a net win.

I actually had planned to write about the performance effects, in particular
of 0012, a bit more:

It's worth pointing out that the new way of setting hint bits is inherently
more expensive than what we did before - upgrading a lock to a different lock
level isn't free, compared to doing, well, nothing.

For paths that set the hint bits of a whole page, like a seqscan, that cost is
more than amortized by the batched approach introduced in 0011. Those get
faster with the patch, both when already hinted and when not.

However, there are paths that aren't easily amenable to that approach, like
e.g. an ordered index scan referencing unhinted tuples. There we only ever
access a single tuple and release the upgraded lock after every tuple. If the
index scan is perfectly correlated with the table and every tuple is unhinted,
that's a decent amount of additional work.

I've spent a lot of time micro-optimizing that workload, to avoid any
significiant regressions. An extreme stress-test started out being about 20%
slower than today, as of my current local version, it's a bit faster (~1%) on
one of my machines and a bit slower (~2%) on another. Partially that was
achieved by optimizing the hint-bit-lock-upgrade code more (e.g. having a fast
path for updating a single hint bit, avoiding redundant reads of the lock
state by having MarkSharedBufferDirtyHint(), ...), partially by optimizing the
locking code.  The latter is a bit of a cheat though - things would be even
faster if we went with the old way of setting hint bits, but with the
independent optimizations applied.

I think that's ok though:

1) the old way of setting hint bits is a pretty dirty hack that causes issues
   in quite a few places.

2) by definition, having to set hint bits is an ephemeral state, once the hint
   bits are set, the difference vanishes

3) no normal workload shows the difference - my stress test does
   SELECT * FROM manyrows_idx ORDER BY i OFFSET 10000000;
   on a perfectly correlated table with very narrow rows, i.e. an index scan
   of the whole table, where none of the scan results are ever used. Once one
   actually uses the resulting rows, the performance difference completely
   vanishes.

4) as part of the index prefetching work, we might get the infrastructure to
   actually batch the hint-bit setting in this case too.


I see some mild performance gains in workloads like pgbench [-S], but nothing
to write home about. Which I think is about what I would expect - there's a
minor efficiency gain due to the private refcount changes and not having state
tracking for two error recovery mechanisms (lwlocks' and private refcount).

Non-all-visible seqscans do see some performance gain due to 0011, whether the
table is hinted or not. But it's again something that's mostly noticeable in
microbenchmarks, as e.g. tuple deforming or qual evaluation has a much bigger
impact.


> Everything before/after 0008 was rather easy to digest and understand
> and I found nothing really to call out at this stage
>
> 0008 is understandable too, it's just sizable. While it is large, I find
> it well laid out and more readable than before.  I gave the locking code
> a good look, it seems correct AFAICT.

I hope so :), the locking logic it's largely the same as lwlock.c, with some
exceptions due to the added lock level and differences in error handling
state.

I don't really see a good way to split 0008 unfortunately...  I previously had
split 0012 into four patches (core changes, heapam changes, the rest of the
adaptions to setting hint bits in various places, adding assertions relying on
the different lock levels), but I found it pretty unwieldly from a "comment
management" perspective, because comments that need to be rewritten are
temporarily wrong, or would need to be modified in yet another path.  But I'm
open to going back to that approach.

I guess I could pull out the addition of UnlockBuffer() and the "redirection"
to it from LockBuffer() into a separate patch.


> Keep going, I'll be happy to dedicate time to testing and digging into
> the commits as you get this into a final state.  I look forward to
> extending/enhancing this code once integrated.

Cool!

I think 0001, 0002, 0003, 0005 and 0009 should be mergeable pretty soon.  I've
some further polishing to do for 0006 and 0007, but I think they could go in
well ahead of the rest.

For 0008, in addition to what's noted in the commit message, I think there
needs to be an additional section in src/backend/storage/buffer/README (or
such), explaining that buffer content locks used to be lwlocks but aren't
anymore for xyz reasons.  I suspect it'd also be good to have a few references
from lwlock.c to bufmgr.c to make sure the code is co-evolved.

For 0012, I think it might make sense to pull out some of the changes to
fsm_vacuum_page() out into a separate commit. Basically changing the code to
acquire a lock on the page - I still can't quite believe that somebody thought
it's sane to update the page without even bothering with a share lock.

0014 needs a lot more polishing, there's references to the hint bit / locking
interactions all over. Pretty hard to find all the references :(.

Greetings,

Andres Freund



Re: Buffer locking is special (hints, checksums, AIO writes)

От
Melanie Plageman
Дата:
This email is just a review for some (specified below) of the patches 0001-0007

On Wed, Nov 19, 2025 at 9:47 PM Andres Freund <andres@anarazel.de> wrote:
>
> 0002: Not really required, but seems like an improvement to me

The commit message says the point is to get compiler warnings for
switch cases that should be exhaustive, but as soon as I looked for a
switch case like that, I see BufferIsLockedByMeInMode()

        switch (mode)
        {
            case BUFFER_LOCK_EXCLUSIVE:
                lw_mode = LW_EXCLUSIVE;
                break;
            case BUFFER_LOCK_SHARE:
                lw_mode = LW_SHARED;
                break;
            default:
                pg_unreachable();
        }

Which makes it impossible to get such a warning. When I add a lock
mode, it specifically doesn't warn when compiling.
However, I'm a big fan of using enums instead of macros when
appropriate, so I have no issue with this change. I just think the
commit message is a bit confusing.

> 0003: A prerequisite to 0004, pretty boring itself

LGTM.

> 0004: Use 64bit atomics for BufferDesc.state - at this point nothing uses the
> additional bits yet, though.  Some annoying reformatting required to avoid
> long lines.

I noticed that the BUF_STATE_GET_REFCOUNT and BUF_STATE_GET_USAGECOUNT
macros cast the return value to a uint32. We won't use the extra bits
but we did bother to keep the macro result sized to the field width
before so keeping it uint32 is probably more confusing now that state
is 64 bit.

Not related to this patch, but I noticed GetBufferDescriptor() calls
for a uint32 and all the callers pretty much pass a signed int —
wonder why it calls for uint32.

> 0005: There already was a wait event class for BUFFERPIN. It seems better to
> make that more general than to implement them separately.

I reviewed and see no issues with the code, but I don't have an
opinion on this wait event naming so maybe you better _wait_ for some
other review ;)

> 0006+0007: This is preparatory work for 0008, but also worthwhile on its
> own. The private refcount stuff does show up in profiles. The reason it's
> related is that without these changes the added information in 0008 makes that
> worse.

I found it slightly confusing that this commit appears to
unnecessarily add the PrivateRefCountData struct (given that it
doesn't need it to do the new parallel array thing). You could wait
until you need it in 0008, but 0008 is big as it is, so it probably is
fine where it is.

in InitBufferManagerAccess(), why do you have

memset(&PrivateRefCountArrayKeys, 0, sizeof(Buffer));
seems like it should be
memset(PrivateRefCountArrayKeys, 0, sizeof(PrivateRefCountArrayKeys));

I wonder how easy it will be to keep the Buffer in sync between
PrivateRefCountArrayKeys and the PrivateRefCountEntry — would a helper
function help?

ForgetPrivateRefCountEntry doesn’t clear the data member — but maybe
it doesn’t matter...

in ReservePrivateRefCountEntry() there is a superfluous clear

memset(&victim_entry->data, 0, sizeof(victim_entry->data));
victim_entry->data.refcount = 0;

0007
needs a commit message. overall seems fine though.
You should probably capitalize the "c" of "count" in
PrivateRefcountEntryLast to be consistent with the other names.

- Melanie



Re: Buffer locking is special (hints, checksums, AIO writes)

От
Andres Freund
Дата:
Hi,

On 2025-11-19 21:47:49 -0500, Andres Freund wrote:
> 0001: A straight-up bugfix in lwlock.c - albeit for a bug that seems currently
>     effectively harmless.

Does anybody have opinions about whether to backpatch this fix? Given that it
has no real consequences I'm mildly inclined not to, but maybe there are cases
where the additional wait list lock cycle matters?

Greetings,

Andres Freund



Re: Buffer locking is special (hints, checksums, AIO writes)

От
Melanie Plageman
Дата:
On Mon, Nov 24, 2025 at 3:58 PM Andres Freund <andres@anarazel.de> wrote:
>
> Hi,
>
> On 2025-11-19 21:47:49 -0500, Andres Freund wrote:
> > 0001: A straight-up bugfix in lwlock.c - albeit for a bug that seems currently
> >       effectively harmless.
>
> Does anybody have opinions about whether to backpatch this fix? Given that it
> has no real consequences I'm mildly inclined not to, but maybe there are cases
> where the additional wait list lock cycle matters?

Since it is a mistake, I am mildly in favor of backporting to avoid
confusion for future developers. It's pretty weird that LWLockWakeup()
has to be called again to actually unset LW_FLAG_HAS_WAITERS. But
since it's not really harmful, this is a very mild opinion.

- Melanie



Re: Buffer locking is special (hints, checksums, AIO writes)

От
Thomas Munro
Дата:
On Fri, Nov 21, 2025 at 9:51 AM Andres Freund <andres@anarazel.de> wrote:
> It's worth pointing out that the new way of setting hint bits is inherently
> more expensive than what we did before - upgrading a lock to a different lock
> level isn't free, compared to doing, well, nothing.
>
> For paths that set the hint bits of a whole page, like a seqscan, that cost is
> more than amortized by the batched approach introduced in 0011. Those get
> faster with the patch, both when already hinted and when not.

Nice work!

> However, there are paths that aren't easily amenable to that approach, like
> e.g. an ordered index scan referencing unhinted tuples. There we only ever
> access a single tuple and release the upgraded lock after every tuple. If the
> index scan is perfectly correlated with the table and every tuple is unhinted,
> that's a decent amount of additional work.

Yeah, but it was only faster because it was cheating.  It presumably
doesn't happen when you bulk load and then create index.  It
presumably does happen when you insert a lot of data in order, on
first correlated index scan.  Seems like an inherent limitation of the
current tuple-at-a-time architecture when combined with the *required*
interlocking, and not a blocker for this work.

+ Some filesystems, raid implementations, ... do not tolerate the data being

I was aware of BTRFS (EIO on read) and ZFS 2.4 (EIO on read or write
depending on configuration option), but hadn't thought about RAID.
Ugh, right, non-matching RAID1 mirrors (and I guess also b0rked RAID5
parity bits?).  Fun.

https://bugzilla.kernel.org/show_bug.cgi?id=99171

> I've spent a lot of time micro-optimizing that workload, to avoid any
> significiant regressions. An extreme stress-test started out being about 20%
> slower than today, as of my current local version, it's a bit faster (~1%) on
> one of my machines and a bit slower (~2%) on another. Partially that was
> achieved by optimizing the hint-bit-lock-upgrade code more (e.g. having a fast
> path for updating a single hint bit, avoiding redundant reads of the lock
> state by having MarkSharedBufferDirtyHint(), ...), partially by optimizing the
> locking code.  The latter is a bit of a cheat though - things would be even
> faster if we went with the old way of setting hint bits, but with the
> independent optimizations applied.
>
> I think that's ok though:
>
> 1) the old way of setting hint bits is a pretty dirty hack that causes issues
>    in quite a few places.
>
> 2) by definition, having to set hint bits is an ephemeral state, once the hint
>    bits are set, the difference vanishes
>
> 3) no normal workload shows the difference - my stress test does
>    SELECT * FROM manyrows_idx ORDER BY i OFFSET 10000000;
>    on a perfectly correlated table with very narrow rows, i.e. an index scan
>    of the whole table, where none of the scan results are ever used. Once one
>    actually uses the resulting rows, the performance difference completely
>    vanishes.
>
> 4) as part of the index prefetching work, we might get the infrastructure to
>    actually batch the hint-bit setting in this case too.

Yeah.  Was just thinking the same.  Both the streaming and batching
projects have opportunities to figure out an amortisation scheme.  I
have a few vague ideas about stream-based approaches already, hmm...

+1, I think this is OK for now.



Re: Buffer locking is special (hints, checksums, AIO writes)

От
Andres Freund
Дата:
Hi,

On 2025-11-24 16:04:41 -0500, Melanie Plageman wrote:
> On Mon, Nov 24, 2025 at 3:58 PM Andres Freund <andres@anarazel.de> wrote:
> > On 2025-11-19 21:47:49 -0500, Andres Freund wrote:
> > > 0001: A straight-up bugfix in lwlock.c - albeit for a bug that seems currently
> > >       effectively harmless.
> >
> > Does anybody have opinions about whether to backpatch this fix? Given that it
> > has no real consequences I'm mildly inclined not to, but maybe there are cases
> > where the additional wait list lock cycle matters?
> 
> Since it is a mistake, I am mildly in favor of backporting to avoid
> confusion for future developers. It's pretty weird that LWLockWakeup()
> has to be called again to actually unset LW_FLAG_HAS_WAITERS. But
> since it's not really harmful, this is a very mild opinion.

Thanks for chiming in, done.

Greetings,

Andres Freund



Re: Buffer locking is special (hints, checksums, AIO writes)

От
Melanie Plageman
Дата:
On Wed, Nov 19, 2025 at 9:47 PM Andres Freund <andres@anarazel.de> wrote:
>
> 0008: The main change. Implements buffer content locking independently from
> lwlock.c. There's obviously a lot of similarity between lwlock.c code and
> this, but I've not found a good way to reduce the duplication without giving
> up too much.  This patch does immediately introduce share-exclusive as a new
> lock level, mostly because it was too painful to do separately.
>
> 0009+0010+0011: Preparatory work for 0012.
>
> 0012: One of the main goals of this patchset - use the new share-exclusive
> lock level to only allow hint bits to be set while no IO is going on.

Below is a review of 0008-0012
I skipped 0013 and 0014 after seeing "#if 1" in 0013 :)

0008:
-------
> [PATCH v6 08/14] bufmgr: Implement buffer content locks independently
 of lwlocks
...
> This commit unfortunately introduces some code that is very similar to the
> code in lwlock.c, however the code is not equivalent enough to easily merge
> it. The future wins that this commit makes possible seem worth the cost.

It is a truly unfortunate amount of duplication. I tried some
refactoring myself just to convince myself it wasn't a good idea.
However, ISTM there is no reason for the lwlock and buffer locking
implementations to have to stay in sync. So they can diverge as
features are added and perhaps the duplication won't be as conspicuous
in the future.

> As of this commit nothing uses the new share-exclusive lock mode. It will be
>documented and used in a future commit. It seemed too complicated to introduce
> the lock-level in a separate commit.

I would have liked this mentioned earlier in the commit message. Also,
I don't know how I feel about it being "documented" in a future
commit...perhaps just don't say that.

diff --git a/src/include/storage/buf_internals.h
b/src/include/storage/buf_internals.h
index 28519ad2813..0a145d95024 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -32,22 +33,29 @@
 /*
  * Buffer state is a single 64-bit variable where following data is combined.
  *
+ * State of the buffer itself:
  * - 18 bits refcount
  * - 4 bits usage count
  * - 10 bits of flags
  *
+ * State of the content lock:
+ * - 1 bit has_waiter
+ * - 1 bit release_ok
+ * - 1 bit lock state locked

Somewhere you should clearly explain the scenarios in which you still
need to take the buffer header lock with LockBufHdr() vs when you can
just use atomic operations/CAS loops.

Separately, you should clearly explain what BM_LOCKED (I presume what
"lock state locked" refers to) protects.

+/*
+ * Definitions related to buffer content locks
+ */
+#define BM_LOCK_HAS_WAITERS         (UINT64CONST(1) << 63)
+#define BM_LOCK_RELEASE_OK          (UINT64CONST(1) << 62)
+
+#define BM_LOCK_VAL_SHARED          (UINT64CONST(1) << 32)
+#define BM_LOCK_VAL_SHARE_EXCLUSIVE (UINT64CONST(1) << (32 +
MAX_BACKENDS_BITS))
+#define BM_LOCK_VAL_EXCLUSIVE       (UINT64CONST(1) << (32 + 1 +
MAX_BACKENDS_BITS))
+
+#define BM_LOCK_MASK                (((uint64)MAX_BACKENDS << 32) |
BM_LOCK_VAL_SHARE_EXCLUSIVE | BM_LOCK_VAL_EXCLUSIVE)

Not the fault of this patch, but I always found it confusing that the
BM flags were for buffer manager. I always think BM stands for buffer
mapping.

On a more actionable note: you probably want to update the comment in
procnumber.h in your earlier patch to take out the part about how the
limitation could be lifted using a 64 bit state -- since you made the
state 64 bit and didn't lift the limitation (see below for the
specific comment).

/*
 * Note: MAX_BACKENDS_BITS is 18 as that is the space available for buffer
 * refcounts in buf_internals.h.  This limitation could be lifted by using a
 * 64bit state; but it's unlikely to be worthwhile as 2^18-1 backends exceed
 * currently realistic configurations. Even if that limitation were removed,
 * we still could not a) exceed 2^23-1 because inval.c stores the ProcNumber
 * as a 3-byte signed integer, b) INT_MAX/4 because some places compute
 * 4*MaxBackends without any overflow check.  We check that the configured
 * number of backends does not exceed MAX_BACKENDS in InitializeMaxBackends().
 */

@@ -268,6 +287,15 @@ BufMappingPartitionLockByIndex(uint32 index)
  * wait_backend_pgprocno and setting flag bit BM_PIN_COUNT_WAITER.  At present,
  * there can be only one such waiter per buffer.
  *
+ * The content of buffers is protected via the buffer content lock,
+ * implemented as part buffer state. Note that the buffer header lock is *not*
+ * used to control access to the data in the buffer! We used to use an LWLock
+ * to implement the content lock, but having a dedicated implementation of
+ * content locks allows to implement some otherwise hard things (e.g.
+ * race-freely checking if AIO is in progress before locking a buffer
+ * exclusively) and makes otherwise impossible optimizations possible
+ * (e.g. unlocking and unpinning a buffer in one atomic operation).
+ *

I don't really like that this includes a description of how it used to
work. It makes the description more confusing when trying to
understand the current state. And comments like that make most sense
when the current state is confusing because of a long annoying
history.

@@ -302,7 +303,24 @@ extern void BufferGetTag(Buffer buffer,
RelFileLocator *rlocator,
 extern void MarkBufferDirtyHint(Buffer buffer, bool buffer_std);
 extern void UnlockBuffers(void);
-extern void LockBuffer(Buffer buffer, BufferLockMode mode);
+extern void UnlockBuffer(Buffer buffer);
+extern void LockBufferInternal(Buffer buffer, BufferLockMode mode);
+
+/*
+ * Handling BUFFER_LOCK_UNLOCK in bufmgr.c leads to sufficiently worse branch
+ * prediction to impact performance. Therefore handle that switch here, were
+ * most of the time `mode` will be a constant and thus can be optimized out by
+ * the compiler.

Typo: were -> where

+ */
+static inline void
+LockBuffer(Buffer buffer, BufferLockMode mode)
+{
+    if (mode == BUFFER_LOCK_UNLOCK)
+        UnlockBuffer(buffer);
+    else
+        LockBufferInternal(buffer, mode);
+}
+

Is there a reason to stick with the LockBuffer(buf,
BUFFER_LOCK_UNLOCK) interface here? It seems like a cgood time to
start using UnlockBuffer() which I thought most people find more
intuitive.

 /*
- * Acquire or release the content_lock for the buffer.
+ * Acquire the buffer content lock in the specified mode
+ *
+ * If the lock is not available, sleep until it is.
+ *
+ * Side effect: cancel/die interrupts are held off until lock release.
+ *
+ * This uses almost the same locking approach as lwlock.c's
+ * LWLockAcquire(). See documentation atop of lwlock.c for a more detailed
+ * discussion.
+ *
+ * The reason that this, and most of the other BufferLock* functions, get both
+ * the Buffer and BufferDesc* as parameters, is that looking up one from the
+ * other repeatedly shows up noticeably in profiles.
+ */
+static inline void
+BufferLockAcquire(Buffer buffer, BufferDesc *buf_hdr, BufferLockMode mode)

Either here or over the lock mode enum, I'd spend a few sentences
describing how the buffer lock modes interact -- what conflicts with
what. You spend more time saying how this is like LWLock than
explaining what it actually is.

+/*
+ * Remove ourselves from the waitlist.
+ *
+ * This is used if we queued ourselves because we thought we needed to sleep
+ * but, after further checking, we discovered that we don't actually need to
+ * do so.
+ */
+static void
+BufferLockDequeueSelf(BufferDesc *buf_hdr)
+{
+    bool        on_waitlist;
+
+    LockBufHdr(buf_hdr);
+
...
+    /* XXX: combine with fetch_and above? */
+    UnlockBufHdr(buf_hdr);

I know you just ported this comment from the LWLock implementation but
I can't see how it could be worth the effort. It won't be in the
hottest path as you must satisfy a few conditions to get here.


+ * Wakeup all the lockers that currently have a chance to acquire the lock.
+ */
+static void
+BufferLockWakeup(BufferDesc *buf_hdr, bool unlocked)

This should have a more explanatory comment. You should especially
document the unlocked parameter. I would expect something somewhere in
or above this function that basically says (maybe less verbosely)

- Before waking anything: allow all
- After waking SHARE:
    wake SHARE only
- After waking SHARE_EXCLUSIVE:
    wake SHARE only
    (no more SHARE_EXCLUSIVE)
- After waking EXCLUSIVE:
    wake none

+{
+    bool        new_release_ok;
+    bool        wake_exclusive = unlocked;
+    bool        wake_share_exclusive = true;
+    proclist_head wakeup;
+    proclist_mutable_iter iter;
+
+    proclist_init(&wakeup);
+
+    new_release_ok = true;
+
+    /* lock wait list while collecting backends to wake up */
+    LockBufHdr(buf_hdr);
+
+    proclist_foreach_modify(iter, &buf_hdr->lock_waiters, lwWaitLink)
+    {
+        PGPROC       *waiter = GetPGProcByNumber(iter.cur);
+
+        if (!wake_exclusive && waiter->lwWaitMode == BUFFER_LOCK_EXCLUSIVE)
+            continue;
+
+        if (!wake_share_exclusive && waiter->lwWaitMode ==
BUFFER_LOCK_SHARE_EXCLUSIVE)
+            continue;
+

It seems there is a difference between LWLockWakeup() and
BufferLockWakeup() if the queue contains a share lock waiter followed
by an exclusive lock waiter.

In LWLockWakeup(), it will wake up the share waiter, set
wokeup_somebody to true, and then not wake up the exclusive lock
waiter because wokeup_somebody is true.

In BufferLockWakeup() when unlocked is true, it will wake up the share
lock waiter and then wake up the exclusive lock waiter because
wake_exclusive and wake_share_exclusive are both still true.

This might be the intended behavior, but it is a difference from
LWLockWakeup(), so I think it is worth documenting why it is okay.


+         * Signal that the process isn't on the wait list anymore. This allows
+         * BufferLockDequeueSelf() to remove itself of the waitlist with a
+         * proclist_delete(), rather than having to check if it has been

I know this comment was ported over, but the "remove itself of the
waitlist" -- the "of" is confusing in the original comment and it is
confusing here.

+ * Compute subtraction from buffer state for a release of a held lock in
+ * `mode`.
+ *
+ * This is separated from BufferLockUnlock() as we want to combine the lock
+ * release with other atomic operations when possible, leading to the lock
+ * release being done in multiple places.
+ */
+static inline uint64
+BufferLockReleaseSub(BufferLockMode mode)

I don't understand why this is a separate function even with your comment.

+ * BufferLockHeldByMe - test whether my process holds the content lock in any
+ * mode
+ *
+ * This is meant as debug support only.
+ */
+static bool
+BufferLockHeldByMe(BufferDesc *buf_hdr)
+{
+    PrivateRefCountEntry *entry =
+        GetPrivateRefCountEntry(BufferDescriptorGetBuffer(buf_hdr), false);
+
+    if (!entry)
+        return false;
+    else
+        return entry->data.lockmode != BUFFER_LOCK_UNLOCK;
+}

Previously, if I called LockBuffer(buf, BUFFER_LOCK_SHARE) again after
calling it once, say in RelationCopyStorageUsingBuffer(), I would trip
an assert when unpinning the buffer about how I needed to have
released the lock.

Now, though, it doesn't trip the assert because we don't track
multiple locks. When I call LockBuffer(UNLOCK), it sets lockmode to
BUFFER_LOCK_UNLOCK and BufferLockHeldByMe() only checks the private
refcount entry. Whereas LWLockHeldByMe() checked held_lwlocks which
had multiple entries and would report that I still held a lock. Now,
if I call LockBuffer(SHARE) twice, I'll only find out later when I
have a hang because someone is trying to get an exclusive lock and the
actual BufferDesc->state still has a lock set.

Maybe you should add an assert to the lock acquisition path that the
prevate ref count entry mode is UNLOCK?

 void
-LockBuffer(Buffer buffer, BufferLockMode mode)
+LockBufferInternal(Buffer buffer, BufferLockMode mode)
 {
-    buf = GetBufferDescriptor(buffer - 1);
+    buf_hdr = GetBufferDescriptor(buffer - 1);

-    if (mode == BUFFER_LOCK_UNLOCK)
-        LWLockRelease(BufferDescriptorGetContentLock(buf));
-    else if (mode == BUFFER_LOCK_SHARE)
-        LWLockAcquire(BufferDescriptorGetContentLock(buf), LW_SHARED);
+    if (mode == BUFFER_LOCK_SHARE)
+        BufferLockAcquire(buffer, buf_hdr, BUFFER_LOCK_SHARE);
+    else if (mode == BUFFER_LOCK_SHARE_EXCLUSIVE)
+        BufferLockAcquire(buffer, buf_hdr, BUFFER_LOCK_SHARE_EXCLUSIVE);
     else if (mode == BUFFER_LOCK_EXCLUSIVE)
-        LWLockAcquire(BufferDescriptorGetContentLock(buf), LW_EXCLUSIVE);
+        BufferLockAcquire(buffer, buf_hdr, BUFFER_LOCK_EXCLUSIVE);
     else
         elog(ERROR, "unrecognized buffer lock mode: %d", mode);

I presume you've stuck with this if statement structure because that
is what LockBuffer() used. Even though now the BufferLockMode passed
through to BufferLockAcquire is the exact thing you are testing. It
caught my eye and I had to check multiple times if the mode being
passed in is the same. Basically I found it a bit distracting.


@@ -6625,7 +7295,25 @@ ResOwnerReleaseBufferPin(Datum res)
     if (BufferIsLocal(buffer))
         UnpinLocalBufferNoOwner(buffer);
     else
+    {
+        PrivateRefCountEntry *ref;
+
+        ref = GetPrivateRefCountEntry(buffer, false);
+
+        /*
+         * If the buffer was locked at the time of the resowner release,
+         * release the lock now. This should only happen after errors.
+         */
+        if (ref->data.lockmode != BUFFER_LOCK_UNLOCK)
+        {
+            BufferDesc *buf = GetBufferDescriptor(buffer - 1);
+
+            HOLD_INTERRUPTS();    /* match the upcoming RESUME_INTERRUPTS */
+            BufferLockUnlock(buffer, buf);
+        }
+
         UnpinBufferNoOwner(GetBufferDescriptor(buffer - 1));
+    }
 }

 Bit confusing that ResOwnerReleaseBufferBin() now releases locks as well.

0009 -- I didn't look closely at 0009

0010:
--------
> [PATCH v6 10/14] heapam: Use exclusive lock on old page in CLUSTER

> To be able to guarantee that we can set the hint bit, acquire an exclusive
> lock on the old buffer. We need the hint bits to be set as otherwise
> reform_and_rewrite_tuple() -> rewrite_heap_tuple() -> heap_freeze_tuple() will
> get confused.

So, this is an active bug? And what exactly do you mean
heap_freeze_tuple() gets confused? I thought somewhere in there we
would check the clog.

0011:
--------
> [PATCH v6 11/14] heapam: Add batch mode mvcc check and use it in page mode

> 2) We would like to stop setting hint bits while pages are being written
> out. The necessary locking becomes visible for page mode scans if done for
> every tuple. With batching the overhead can be amortized to only happen
> once per page.

I don't understand the above point. What does this patch have to do
with not setting hint bits while pages are being written out (that
happens in the next patch)? And I presume you mean don't set hint bits
on a buffer that is being flushed by someone else -- but it sounds
like you mean not to set hint bits as part of flushing a buffer.

diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 4b0c49f4bb0..ddabd1a3ec3 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -504,42 +504,93 @@ page_collect_tuples(HeapScanDesc scan, Snapshot snapshot,
                     BlockNumber block, int lines,
                     bool all_visible, bool check_serializable)
 {
+    Oid            relid = RelationGetRelid(scan->rs_base.rs_rd);
+#ifdef BATCHMVCC_FEWER_ARGS
+    BatchMVCCState batchmvcc;
+    HeapTupleData *tuples = batchmvcc.tuples;
+    bool       *visible = batchmvcc.visible;
+#else
+    HeapTupleData tuples[MaxHeapTuplesPerPage];
+    bool        visible[MaxHeapTuplesPerPage];
+#endif
     int            ntup = 0;

It's pretty confusing when visible is an output vs input parameter and
who fills it in when. (i.e. it's filled in in page_collect_tuples() if
check_serializable and all_visible are true, otherwise it's filled in
in HeapTupleSatisifiesMVCCBatch())

Personally, I think I'd almost prefer an all-visible and
not-all-visible version of page_collect_tuples() (or helper functions
containing the loop) that separate this. (I haven't tried it though)

BATCHMVCC_FEWER_ARGS definitely doesn't make it any easier to read --
which I assume you are removing.

+        /*
+         * If the page is not all-visible or we need to check serializability,
+         * maintain enough state to be able to refind the tuple efficiently,
+         * without again needing to extract it from the page.
+         */
+        if (!all_visible || check_serializable)
+        {

"enough state" is pretty vague here. Maybe mention what it wouldn't be
valid to do with the tuples array?

/*
* If the page is all visible, these fields won'otherwise wont be
* populated in loop below.
*/

Some spelling issues with "won'" and "wont"

+ * visibility is set in batchmvcc->visible[]. In addition, ->vistuples_dense
+ * is set to contain the offsets of visible tuples.
+ *
+ * Returns the number of visible tuples.
+ */
+int
+HeapTupleSatisfiesMVCCBatch(Snapshot snapshot, Buffer buffer,
+                            int ntups,

I don't really get why this (the patch in general) would be
substantially faster.  You still call HeapTupleSatisfiesMVCC() for
each tuple in a loop. The difference is that you've got pointers into
the array of tuples instead of doing PageGetItem(), then calling
HeapTupleSatisfiesMVCC() for each tuple.


0012:
-------

> [PATCH v6 12/14] Require share-exclusive lock to set hint bits
> To address these issue, this commit changes the rules so that modifications to
> pages are not allowed anymore while holding a share lock. Instead the new

In the commit message, you make it sound like you only change the lock
level for setting the hint bits. But that wouldn't solve any problems
if FlushBuffer() could still happen with a share lock. I would try to
make it clear that you change the lock level both for setting the hint
bits and flushing the buffer.

@@ -77,6 +73,16 @@ gistkillitems(IndexScanDesc scan)
      */
     for (i = 0; i < so->numKilled; i++)
     {
+        if (!killedsomething)
+        {
+            /*
+             * Use hint bit infrastructure to be allowed to modify the page
+             * without holding an exclusive lock.
+             */
+            if (!BufferBeginSetHintBits(buffer))
+                goto unlock;
+        }
+

I don't understand why this is in the loop. Clearly you want to call
BufferBeginSetHintBits() once, but why would you do it in the loop?

In the comment, I might also note that the lock level will be upgraded
as needed or something since we only have a share lock, it is
confusing at first

- * SetHintBits()
+ * To be allowed to set hint bits, SetHintBits() needs to call
+ * BufferBeginSetHintBits(). However, that's not free, and some callsites call
+ * SetHintBits() on many tuples in a row. For those it makes sense to amortize
+ * the cost of BufferBeginSetHintBits(). Additionally it's desirable to defer
+ * the cost of BufferBeginSetHintBits() until a hint bit needs to actually be
+ * set. This enum serves as the necessary state space passed to
+ * SetHintbitsExt().
+ */
+typedef enum SetHintBitsState
+{
+   /* not yet checked if hint bits may be set */
+   SHB_INITIAL,
+   /* failed to get permission to set hint bits, don't check again */
+   SHB_DISABLED,
+   /* allowed to set hint bits */
+   SHB_ENABLED,
+} SetHintBitsState;

I dislike the SHB prefix. Perhaps something involving the word hint?
And should the enum name itself (SetHintBitsState) include the word
"batch"? I know that would make it long. At least the comment should
explain that these are needed when batch setting hint bits.

+SetHintBitsExt(HeapTupleHeader tuple, Buffer buffer,
+               uint16 infomask, TransactionId xid, SetHintBitsState *state)
 {
     if (TransactionIdIsValid(xid))
     {
-        /* NB: xid must be known committed here! */
-        XLogRecPtr    commitLSN = TransactionIdGetCommitLSN(xid);
+        if (BufferIsPermanent(buffer))
+        {

I really wish there was a way to better pull apart the batch and
non-batch cases in a way that could allow the below block to be in a
helper do_set_hint() (or whatever) which SetHintBitsExt() and
SetHintBits() called. And then you inlined BufferSetHintBits16().

I presume you didn't do this because HeapTupleSatisifiesMVCC() for
SNAPSHOT_MVCC calls the non-batch version (and, of course,
HeapTupleSatisifiesVisibility() is the much more common case).

if (TransactionIdIsValid(xid))
{
        if (BufferIsPermanent(buffer))
        {
                /* NB: xid must be known committed here! */
                XLogRecPtr    commitLSN = TransactionIdGetCommitLSN(xid);

                if (XLogNeedsFlush(commitLSN) &&
                        BufferGetLSNAtomic(buffer) < commitLSN)
                {
                        /* not flushed and no LSN interlock, so don't
set hint */
                        return; false;
                }
        }
}

Separately, I was thinking, should we assert here about having the
right lock type?

 * It is only safe to set a transaction-committed hint bit if we know the
 * transaction's commit record is guaranteed to be flushed to disk before the
 * buffer, or if the table is temporary or unlogged and will be obliterated by
 * a crash anyway.  We cannot change the LSN of the page here, because we may
 * hold only a share lock on the buffer, so we can only use the LSN to
 * interlock this if the buffer's LSN already is newer than the commit LSN;
 * otherwise we have to just refrain from setting the hint bit until some
 * future re-examination of the tuple.
 *

Should this say "we may hold only a share exclusive lock on the
buffer". Also what is "this" in "only use the LSN to interlock this"?

@@ -1628,6 +1701,9 @@ HeapTupleSatisfiesMVCCBatch(Snapshot snapshot,
Buffer buffer,
+    if (state == SHB_ENABLED)
+        BufferFinishSetHintBits(buffer, true, true);
+
     return nvis;
 }

I wondered if it would be more natural for BufferBeginSetHintBits()
and BufferFinishSetHintBits() to set SHB_INITIAL and SHB_DISABLED
instead of having callers do it. But, I guess you don't do this
because of gist and hash indexes using this for doing their own
modifications. It seems like it also would help make it clear that
BufferBegin and BufferFinish are for batch mode.

I can't help but feel a bit of awkwardness in the whole API between
this and the way you've supported non-batch mode in SetHintBitsExt().
But it's easy for me to criticize without providing concrete ideas of
how to reorganize it.

+static inline bool
+SharedBufferBeginSetHintBits(Buffer buffer, BufferDesc *buf_hdr,
uint64 *locksta>
+{
+   uint64      old_state;
+   PrivateRefCountEntry *ref;
+   BufferLockMode mode;

These functions probably ought to have comments. And I wonder if you
should say anything (or even assert anything) about how IO better not
be in progress (though it is enforced by the locks).

+        /*
+         * Can't upgrade if somebody else holds the lock in exlusive or
+         * share-exclusive mode.
+         */

typo: exlusive -> exclusive

-4. It is considered OK to update tuple commit status bits (ie, OR the
-values HEAP_XMIN_COMMITTED, HEAP_XMIN_INVALID, HEAP_XMAX_COMMITTED, or
+4. Non-critical information on a page ("hint bits") may be modified while
+holding only a share-exclusive lock and pin on the page. To do so in cases
+where only a share lock is already held, use BufferBeginSetHintBits() &
+BufferFinishSetHintBits() (if multiple hint bits are to be set) or
+BufferSetHintBits16() (if a single hit bit is set).

typo: single hit -> single hint
Also, you should probably say that you use BufferBeginSetHintBits() to
actually upgrade the lock.

I also don't see anywhere in the README where flushing a buffer is
mentioned -- and what lock level is needed for that in different
situations. It kind of feels like it is worth mentioning.

-   if ((pg_atomic_read_u64(&bufHdr->state) & (BM_DIRTY | BM_JUST_DIRTIED)) !=
-       (BM_DIRTY | BM_JUST_DIRTIED))
+   if (unlikely((lockstate & (BM_DIRTY | BM_JUST_DIRTIED)) !=
+                (BM_DIRTY | BM_JUST_DIRTIED)))

I don't quite understand why you do the atomic read in the call to
MarkSharedBufferDirtyHint() form MarkBufferDirtyHint() instead of at
the top of MarkSharedBufferDirtyHint() -- and then you wouldn't have
to pass the lockstate as a parameter, right?

--- a/src/backend/storage/freespace/freespace.c
+++ b/src/backend/storage/freespace/freespace.c
@@ -904,14 +904,22 @@ fsm_vacuum_page(Relation rel, FSMAddress addr,
    max_avail = fsm_get_max_avail(page);
    /*
-    * Reset the next slot pointer. This encourages the use of low-numbered
-    * pages, increasing the chances that a later vacuum can truncate the
-    * relation.  We don't bother with a lock here, nor with marking the page
-    * dirty if it wasn't already, since this is just a hint.
+    * Try to reset the next slot pointer. This encourages the use of
+    * low-numbered pages, increasing the chances that a later vacuum can
+    * truncate the relation.  We don't bother with a lock here, nor with
+    * marking the page dirty if it wasn't already, since this is just a hint.
+    *
+    * To be allowed to update the page without an exclusive lock, we have to
+    * use the hint bit infrastructure.
     */

What the heck? This didn't even take a share lock before...

diff --git a/src/backend/storage/freespace/fsmpage.c
b/src/backend/storage/freesp>
index 66a5c80b5a6..a59696b6484 100644
--- a/src/backend/storage/freespace/fsmpage.c
+++ b/src/backend/storage/freespace/fsmpage.c
@@ -298,9 +298,18 @@ restart:
     * lock and get a garbled next pointer every now and then, than take the
     * concurrency hit of an exclusive lock.
     *
+    * Without an exclusive lock, we need to use the hint bit infrastructure
+    * to be allowed to modify the page.
+    *

Is the sentence above this still correct?

/*
* Update the next-target pointer. Note that we do this even if we're only
* holding a shared lock, on the grounds that it's better to use a shared
* lock and get a garbled next pointer every now and then, than take the
* concurrency hit of an exclusive lock.

We appear to avoid the garbling now?

In general on 0012, I didn't spend much time checking if you caught
all the places where we mention our hint bit hackery (only taking the
share lock). But those can always be caught later as we inevitably
encounter them.

- Melanie



Re: Buffer locking is special (hints, checksums, AIO writes)

От
Andres Freund
Дата:
Hi,

Thanks a lot for that detailed review!  A few questions and comments, before I
try to address the comments in the next version.


On 2025-11-25 10:44:31 -0500, Melanie Plageman wrote:
> On Wed, Nov 19, 2025 at 9:47 PM Andres Freund <andres@anarazel.de> wrote:
> >
> > 0008: The main change. Implements buffer content locking independently from
> > lwlock.c. There's obviously a lot of similarity between lwlock.c code and
> > this, but I've not found a good way to reduce the duplication without giving
> > up too much.  This patch does immediately introduce share-exclusive as a new
> > lock level, mostly because it was too painful to do separately.
> >
> > 0009+0010+0011: Preparatory work for 0012.
> >
> > 0012: One of the main goals of this patchset - use the new share-exclusive
> > lock level to only allow hint bits to be set while no IO is going on.
> 
> Below is a review of 0008-0012
> I skipped 0013 and 0014 after seeing "#if 1" in 0013 :)

I left that in there to make the comparison easier. But clearly that's not to
be committed...


> 0008:
> -------
> > [PATCH v6 08/14] bufmgr: Implement buffer content locks independently
>  of lwlocks
> ...
> > This commit unfortunately introduces some code that is very similar to the
> > code in lwlock.c, however the code is not equivalent enough to easily merge
> > it. The future wins that this commit makes possible seem worth the cost.
> 
> It is a truly unfortunate amount of duplication. I tried some
> refactoring myself just to convince myself it wasn't a good idea.

Without success, I guess?


> However, ISTM there is no reason for the lwlock and buffer locking
> implementations to have to stay in sync. So they can diverge as
> features are added and perhaps the duplication won't be as conspicuous
> in the future.

Right - I'd expect further divergence.


> > As of this commit nothing uses the new share-exclusive lock mode. It will be
> > documented and used in a future commit. It seemed too complicated to introduce
> > the lock-level in a separate commit.
> 
> I would have liked this mentioned earlier in the commit message.

Hm, ok.  I could split share-exclusive out again, but it was somewhat painful,
because it doesn't just lead to adding code, but to changing code.


> Also, I don't know how I feel about it being "documented" in a future
> commit...perhaps just don't say that.

Hm, why?


> diff --git a/src/include/storage/buf_internals.h
> b/src/include/storage/buf_internals.h
> index 28519ad2813..0a145d95024 100644
> --- a/src/include/storage/buf_internals.h
> +++ b/src/include/storage/buf_internals.h
> @@ -32,22 +33,29 @@
>  /*
>   * Buffer state is a single 64-bit variable where following data is combined.
>   *
> + * State of the buffer itself:
>   * - 18 bits refcount
>   * - 4 bits usage count
>   * - 10 bits of flags
>   *
> + * State of the content lock:
> + * - 1 bit has_waiter
> + * - 1 bit release_ok
> + * - 1 bit lock state locked
> 
> Somewhere you should clearly explain the scenarios in which you still
> need to take the buffer header lock with LockBufHdr() vs when you can
> just use atomic operations/CAS loops.

There is an explanation of that, or at least my attempt at it ;). See the
dcoumentation for BufferDesc.


> @@ -268,6 +287,15 @@ BufMappingPartitionLockByIndex(uint32 index)
>   * wait_backend_pgprocno and setting flag bit BM_PIN_COUNT_WAITER.  At present,
>   * there can be only one such waiter per buffer.
>   *
> + * The content of buffers is protected via the buffer content lock,
> + * implemented as part buffer state. Note that the buffer header lock is *not*
> + * used to control access to the data in the buffer! We used to use an LWLock
> + * to implement the content lock, but having a dedicated implementation of
> + * content locks allows to implement some otherwise hard things (e.g.
> + * race-freely checking if AIO is in progress before locking a buffer
> + * exclusively) and makes otherwise impossible optimizations possible
> + * (e.g. unlocking and unpinning a buffer in one atomic operation).
> + *
> 
> I don't really like that this includes a description of how it used to
> work. It makes the description more confusing when trying to
> understand the current state. And comments like that make most sense
> when the current state is confusing because of a long annoying
> history.

I generally agree with that - however, in this case it seemed like folks, in
the future, might actually be wondering why this isn't using lwlocks.


> + */
> +static inline void
> +LockBuffer(Buffer buffer, BufferLockMode mode)
> +{
> +    if (mode == BUFFER_LOCK_UNLOCK)
> +        UnlockBuffer(buffer);
> +    else
> +        LockBufferInternal(buffer, mode);
> +}
> +
> 
> Is there a reason to stick with the LockBuffer(buf,
> BUFFER_LOCK_UNLOCK) interface here? It seems like a cgood time to
> start using UnlockBuffer() which I thought most people find more
> intuitive.

There are like 200 uses of BUFFER_LOCK_UNLOCK|GIN_UNLOCK|GIST_UNLOCK. And
probably a lot of external ones.  I'm not against using UnlockBuffer()
directly in the future, but changing all that code as part of this change
doesn't quite seem to make sense.


>  /*
> - * Acquire or release the content_lock for the buffer.
> + * Acquire the buffer content lock in the specified mode
> + *
> + * If the lock is not available, sleep until it is.
> + *
> + * Side effect: cancel/die interrupts are held off until lock release.
> + *
> + * This uses almost the same locking approach as lwlock.c's
> + * LWLockAcquire(). See documentation atop of lwlock.c for a more detailed
> + * discussion.
> + *
> + * The reason that this, and most of the other BufferLock* functions, get both
> + * the Buffer and BufferDesc* as parameters, is that looking up one from the
> + * other repeatedly shows up noticeably in profiles.
> + */
> +static inline void
> +BufferLockAcquire(Buffer buffer, BufferDesc *buf_hdr, BufferLockMode mode)
> 
> Either here or over the lock mode enum, I'd spend a few sentences
> describing how the buffer lock modes interact -- what conflicts with
> what. You spend more time saying how this is like LWLock than
> explaining what it actually is.

Historically they're just explained in the README. But yea, it makes sense to
add to the enum. I like adding it to the enum better than to the function, as
there are multiple ways to acquire a lock.


> +{
> +    bool        new_release_ok;
> +    bool        wake_exclusive = unlocked;
> +    bool        wake_share_exclusive = true;
> +    proclist_head wakeup;
> +    proclist_mutable_iter iter;
> +
> +    proclist_init(&wakeup);
> +
> +    new_release_ok = true;
> +
> +    /* lock wait list while collecting backends to wake up */
> +    LockBufHdr(buf_hdr);
> +
> +    proclist_foreach_modify(iter, &buf_hdr->lock_waiters, lwWaitLink)
> +    {
> +        PGPROC       *waiter = GetPGProcByNumber(iter.cur);
> +
> +        if (!wake_exclusive && waiter->lwWaitMode == BUFFER_LOCK_EXCLUSIVE)
> +            continue;
> +
> +        if (!wake_share_exclusive && waiter->lwWaitMode ==
> BUFFER_LOCK_SHARE_EXCLUSIVE)
> +            continue;
> +
> 
> It seems there is a difference between LWLockWakeup() and
> BufferLockWakeup() if the queue contains a share lock waiter followed
> by an exclusive lock waiter.
> 
> In LWLockWakeup(), it will wake up the share waiter, set
> wokeup_somebody to true, and then not wake up the exclusive lock
> waiter because wokeup_somebody is true.
> 
> In BufferLockWakeup() when unlocked is true, it will wake up the share
> lock waiter and then wake up the exclusive lock waiter because
> wake_exclusive and wake_share_exclusive are both still true.
> 
> This might be the intended behavior, but it is a difference from
> LWLockWakeup(), so I think it is worth documenting why it is okay.

Yea, that doesn't look quite right. I think I just whacked the code around too
much and somewhere lost that branch.


> 
> +         * Signal that the process isn't on the wait list anymore. This allows
> +         * BufferLockDequeueSelf() to remove itself of the waitlist with a
> +         * proclist_delete(), rather than having to check if it has been
> 
> I know this comment was ported over, but the "remove itself of the
> waitlist" -- the "of" is confusing in the original comment and it is
> confusing here.

As in s/of/from/?


> + * Compute subtraction from buffer state for a release of a held lock in
> + * `mode`.
> + *
> + * This is separated from BufferLockUnlock() as we want to combine the lock
> + * release with other atomic operations when possible, leading to the lock
> + * release being done in multiple places.
> + */
> +static inline uint64
> +BufferLockReleaseSub(BufferLockMode mode)
> 
> I don't understand why this is a separate function even with your comment.

Because there are operations where we want to unlock the buffer as well as do
something else. E.g. in 0013, UnlockReleaseBuffer() we want to unlock the
buffer and decrease the refcount in one atomic operation. For that we need to
know what to subtract from the state variable for the lock portion - hence
BufferLockReleaseSub().
 

> + * BufferLockHeldByMe - test whether my process holds the content lock in any
> + * mode
> + *
> + * This is meant as debug support only.
> + */
> +static bool
> +BufferLockHeldByMe(BufferDesc *buf_hdr)
> +{
> +    PrivateRefCountEntry *entry =
> +        GetPrivateRefCountEntry(BufferDescriptorGetBuffer(buf_hdr), false);
> +
> +    if (!entry)
> +        return false;
> +    else
> +        return entry->data.lockmode != BUFFER_LOCK_UNLOCK;
> +}
> 
> Previously, if I called LockBuffer(buf, BUFFER_LOCK_SHARE) again after
> calling it once, say in RelationCopyStorageUsingBuffer(), I would trip
> an assert when unpinning the buffer about how I needed to have
> released the lock.

Right, that's important to keep that way.


> Now, though, it doesn't trip the assert because we don't track
> multiple locks. When I call LockBuffer(UNLOCK), it sets lockmode to
> BUFFER_LOCK_UNLOCK and BufferLockHeldByMe() only checks the private
> refcount entry.

That part seems right.


> Whereas LWLockHeldByMe() checked held_lwlocks which had multiple entries and
> would report that I still held a lock.

It'd be much better if we had detected that redundant lock acquisition, that's
not legal...


> Now, if I call LockBuffer(SHARE) twice, I'll only find out later when I have
> a hang because someone is trying to get an exclusive lock and the actual
> BufferDesc->state still has a lock set.
> 
> Maybe you should add an assert to the lock acquisition path that the
> prevate ref count entry mode is UNLOCK?

Yes, we should...

It'd be nice if we had a decent way to test things that we except to
crash. Like repeated buffer lock acquisitions...

>  void
> -LockBuffer(Buffer buffer, BufferLockMode mode)
> +LockBufferInternal(Buffer buffer, BufferLockMode mode)
>  {
> -    buf = GetBufferDescriptor(buffer - 1);
> +    buf_hdr = GetBufferDescriptor(buffer - 1);
> 
> -    if (mode == BUFFER_LOCK_UNLOCK)
> -        LWLockRelease(BufferDescriptorGetContentLock(buf));
> -    else if (mode == BUFFER_LOCK_SHARE)
> -        LWLockAcquire(BufferDescriptorGetContentLock(buf), LW_SHARED);
> +    if (mode == BUFFER_LOCK_SHARE)
> +        BufferLockAcquire(buffer, buf_hdr, BUFFER_LOCK_SHARE);
> +    else if (mode == BUFFER_LOCK_SHARE_EXCLUSIVE)
> +        BufferLockAcquire(buffer, buf_hdr, BUFFER_LOCK_SHARE_EXCLUSIVE);
>      else if (mode == BUFFER_LOCK_EXCLUSIVE)
> -        LWLockAcquire(BufferDescriptorGetContentLock(buf), LW_EXCLUSIVE);
> +        BufferLockAcquire(buffer, buf_hdr, BUFFER_LOCK_EXCLUSIVE);
>      else
>          elog(ERROR, "unrecognized buffer lock mode: %d", mode);
> 
> I presume you've stuck with this if statement structure because that
> is what LockBuffer() used.
>
> Even though now the BufferLockMode passed
> through to BufferLockAcquire is the exact thing you are testing.

Ah. Using explicit constants allows the compiler to do constant propagation
(so e.g. BufferLockAttempt() doesn't have runtime branches on mode), which
turns out to generate more efficient code.  I'll add a comment to that effect.


> @@ -6625,7 +7295,25 @@ ResOwnerReleaseBufferPin(Datum res)
>      if (BufferIsLocal(buffer))
>          UnpinLocalBufferNoOwner(buffer);
>      else
> +    {
> +        PrivateRefCountEntry *ref;
> +
> +        ref = GetPrivateRefCountEntry(buffer, false);
> +
> +        /*
> +         * If the buffer was locked at the time of the resowner release,
> +         * release the lock now. This should only happen after errors.
> +         */
> +        if (ref->data.lockmode != BUFFER_LOCK_UNLOCK)
> +        {
> +            BufferDesc *buf = GetBufferDescriptor(buffer - 1);
> +
> +            HOLD_INTERRUPTS();    /* match the upcoming RESUME_INTERRUPTS */
> +            BufferLockUnlock(buffer, buf);
> +        }
> +
>          UnpinBufferNoOwner(GetBufferDescriptor(buffer - 1));
> +    }
>  }
> 
>  Bit confusing that ResOwnerReleaseBufferBin() now releases locks as well.

Do you have a better suggestion? I'll add a comment that makes that explicit,
but other than that I don't have a great idea. Renaming the whole buffer pin
mechanism seems pretty noisy.


> 0009 -- I didn't look closely at 0009

That's hopefully just a boring code move...


> 0010:
> --------
> > [PATCH v6 10/14] heapam: Use exclusive lock on old page in CLUSTER
> 
> > To be able to guarantee that we can set the hint bit, acquire an exclusive
> > lock on the old buffer. We need the hint bits to be set as otherwise
> > reform_and_rewrite_tuple() -> rewrite_heap_tuple() -> heap_freeze_tuple() will
> > get confused.
> 
> So, this is an active bug?

I don't think so - we have an AEL on the relation at that point, so nobody
else can access the buffer, aside from checkpointer writing it out. Which
doesn't currrently block hint bits from being set.


> And what exactly do you mean heap_freeze_tuple() gets confused? I thought
> somewhere in there we would check the clog.

heap_freeze_tuple()->heap_prepare_freeze_tuple() assumes that:
 * It is assumed that the caller has checked the tuple with
 * HeapTupleSatisfiesVacuum() and determined that it is not HEAPTUPLE_DEAD
 * (else we should be removing the tuple, not freezing it).

In the new world, if HTSV were not allowed to set the hint bit, e.g. because
the page is in the process of being written out, there wouldn't be a guarantee
that hints were set.  Which then leads to confusion, because some of the code
assumes that hint bits were already set.  TBH, I don't quite remember the
precise details, it's been a while (this was already part of an earlier
attempt at dealing with the hint bit stuff).  I'll try to reconstruct.


> 0011:
> --------
> > [PATCH v6 11/14] heapam: Add batch mode mvcc check and use it in page mode
> 
> > 2) We would like to stop setting hint bits while pages are being written
> > out. The necessary locking becomes visible for page mode scans if done for
> > every tuple. With batching the overhead can be amortized to only happen
> > once per page.
> 
> I don't understand the above point. What does this patch have to do
> with not setting hint bits while pages are being written out (that
> happens in the next patch)?

It's just an explanation for why we want to use batch mode. Without the
changed locking model the reason for that is a lot less clear.

> And I presume you mean don't set hint bits on a buffer that is being flushed
> by someone else -- but it sounds like you mean not to set hint bits as part
> of flushing a buffer.

Right, I do mean the former.


> diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
> index 4b0c49f4bb0..ddabd1a3ec3 100644
> --- a/src/backend/access/heap/heapam.c
> +++ b/src/backend/access/heap/heapam.c
> @@ -504,42 +504,93 @@ page_collect_tuples(HeapScanDesc scan, Snapshot snapshot,
>                      BlockNumber block, int lines,
>                      bool all_visible, bool check_serializable)
>  {
> +    Oid            relid = RelationGetRelid(scan->rs_base.rs_rd);
> +#ifdef BATCHMVCC_FEWER_ARGS
> +    BatchMVCCState batchmvcc;
> +    HeapTupleData *tuples = batchmvcc.tuples;
> +    bool       *visible = batchmvcc.visible;
> +#else
> +    HeapTupleData tuples[MaxHeapTuplesPerPage];
> +    bool        visible[MaxHeapTuplesPerPage];
> +#endif
>      int            ntup = 0;
> 
> It's pretty confusing when visible is an output vs input parameter and
> who fills it in when. (i.e. it's filled in in page_collect_tuples() if
> check_serializable and all_visible are true, otherwise it's filled in
> in HeapTupleSatisifiesMVCCBatch())

I don't really see a good alternative.


> Personally, I think I'd almost prefer an all-visible and
> not-all-visible version of page_collect_tuples() (or helper functions
> containing the loop) that separate this. (I haven't tried it though)

I have a hard time seeing how that comes out better. Where would you make the
switch between the different functions and how would that lead to easier to
understand code?


> BATCHMVCC_FEWER_ARGS definitely doesn't make it any easier to read --
> which I assume you are removing.

Yea, I only left it in so others perhaps can reproduce the performance effect
of needing BATCHMVCC_FEWER_ARGS.


> +        /*
> +         * If the page is not all-visible or we need to check serializability,
> +         * maintain enough state to be able to refind the tuple efficiently,
> +         * without again needing to extract it from the page.
> +         */
> +        if (!all_visible || check_serializable)
> +        {
> 
> "enough state" is pretty vague here.

I don't really follow. Wouldn't going into more detail just restate the code
in a comment?


> Maybe mention what it wouldn't be valid to do with the tuples array?

Hm? The tuples array *is* that state?



> + * visibility is set in batchmvcc->visible[]. In addition, ->vistuples_dense
> + * is set to contain the offsets of visible tuples.
> + *
> + * Returns the number of visible tuples.
> + */
> +int
> +HeapTupleSatisfiesMVCCBatch(Snapshot snapshot, Buffer buffer,
> +                            int ntups,
> 
> I don't really get why this (the patch in general) would be
> substantially faster.  You still call HeapTupleSatisfiesMVCC() for
> each tuple in a loop. The difference is that you've got pointers into
> the array of tuples instead of doing PageGetItem(), then calling
> HeapTupleSatisfiesMVCC() for each tuple.

There are a few reasons:

1) Most importantly, in the batched world, we only need to check if we can set
   hint bits once. That check is far from free. We also only need to mark the
   buffer dirty once.

2) HeapTupleSatisfiesMVCCBatch() can inline HeapTupleSatisfiesMVCC() and avoid
   some redundant work, e.g. it doesn't have to set up a stack frame each
   time.

3) A loop over the tuples in heapam.c needs an external function call to
   HeapTupleSatisfiesMVCC(), as that's in heapam_visibility.c. That function
   call quickly shows up.


> 
> 0012:
> -------
> 
> > [PATCH v6 12/14] Require share-exclusive lock to set hint bits
> > To address these issue, this commit changes the rules so that modifications to
> > pages are not allowed anymore while holding a share lock. Instead the new
> 
> In the commit message, you make it sound like you only change the lock
> level for setting the hint bits. But that wouldn't solve any problems
> if FlushBuffer() could still happen with a share lock. I would try to
> make it clear that you change the lock level both for setting the hint
> bits and flushing the buffer.

Good point.


> @@ -77,6 +73,16 @@ gistkillitems(IndexScanDesc scan)
>       */
>      for (i = 0; i < so->numKilled; i++)
>      {
> +        if (!killedsomething)
> +        {
> +            /*
> +             * Use hint bit infrastructure to be allowed to modify the page
> +             * without holding an exclusive lock.
> +             */
> +            if (!BufferBeginSetHintBits(buffer))
> +                goto unlock;
> +        }
> +
> 
> I don't understand why this is in the loop. Clearly you want to call
> BufferBeginSetHintBits() once, but why would you do it in the loop?

Why would we want to continue if we can't set hint bits?


> In the comment, I might also note that the lock level will be upgraded
> as needed or something since we only have a share lock, it is
> confusing at first

I don't think copying the way this works into all the callers of
BufferBeginSetHintBits() is a good idea. That just makes it harder to adjust
going down the road.


> - * SetHintBits()
> + * To be allowed to set hint bits, SetHintBits() needs to call
> + * BufferBeginSetHintBits(). However, that's not free, and some callsites call
> + * SetHintBits() on many tuples in a row. For those it makes sense to amortize
> + * the cost of BufferBeginSetHintBits(). Additionally it's desirable to defer
> + * the cost of BufferBeginSetHintBits() until a hint bit needs to actually be
> + * set. This enum serves as the necessary state space passed to
> + * SetHintbitsExt().
> + */
> +typedef enum SetHintBitsState
> +{
> +   /* not yet checked if hint bits may be set */
> +   SHB_INITIAL,
> +   /* failed to get permission to set hint bits, don't check again */
> +   SHB_DISABLED,
> +   /* allowed to set hint bits */
> +   SHB_ENABLED,
> +} SetHintBitsState;
> 
> I dislike the SHB prefix. Perhaps something involving the word hint?

Shrug. It's a very locally used enum, I don't think it matters terribly.


> And should the enum name itself (SetHintBitsState) include the word
> "batch"? I know that would make it

> long. At least the comment should explain that these are needed when batch
> setting hint bits.

Isn't that what the comment does, explaining that we want to amortize the cost
of BufferBeginSetHintBits()?


> +SetHintBitsExt(HeapTupleHeader tuple, Buffer buffer,
> +               uint16 infomask, TransactionId xid, SetHintBitsState *state)
>  {
>      if (TransactionIdIsValid(xid))
>      {
> -        /* NB: xid must be known committed here! */
> -        XLogRecPtr    commitLSN = TransactionIdGetCommitLSN(xid);
> +        if (BufferIsPermanent(buffer))
> +        {
> 
> I really wish there was a way to better pull apart the batch and
> non-batch cases in a way that could allow the below block to be in a
> helper do_set_hint() (or whatever) which SetHintBitsExt() and
> SetHintBits() called.

I couldn't see a way that didn't lead to substantially more code duplication.


> And then you inlined BufferSetHintBits16().

Hm?


> I presume you didn't do this because HeapTupleSatisifiesMVCC() for
> SNAPSHOT_MVCC calls the non-batch version (and, of course,
> HeapTupleSatisifiesVisibility() is the much more common case).
> 
> if (TransactionIdIsValid(xid))
> {
>         if (BufferIsPermanent(buffer))
>         {
>                 /* NB: xid must be known committed here! */
>                 XLogRecPtr    commitLSN = TransactionIdGetCommitLSN(xid);
> 
>                 if (XLogNeedsFlush(commitLSN) &&
>                         BufferGetLSNAtomic(buffer) < commitLSN)
>                 {
>                         /* not flushed and no LSN interlock, so don't
> set hint */
>                         return; false;
>                 }
>         }
> }
> 
> Separately, I was thinking, should we assert here about having the
> right lock type?

Not sure I get what assert of what locktype where?


>  * It is only safe to set a transaction-committed hint bit if we know the
>  * transaction's commit record is guaranteed to be flushed to disk before the
>  * buffer, or if the table is temporary or unlogged and will be obliterated by
>  * a crash anyway.  We cannot change the LSN of the page here, because we may
>  * hold only a share lock on the buffer, so we can only use the LSN to
>  * interlock this if the buffer's LSN already is newer than the commit LSN;
>  * otherwise we have to just refrain from setting the hint bit until some
>  * future re-examination of the tuple.
>  *
> 
> Should this say "we may hold only a share exclusive lock on the
> buffer". Also what is "this" in "only use the LSN to interlock this"?

That's a pre-existing comment, right?


> @@ -1628,6 +1701,9 @@ HeapTupleSatisfiesMVCCBatch(Snapshot snapshot,
> Buffer buffer,
> +    if (state == SHB_ENABLED)
> +        BufferFinishSetHintBits(buffer, true, true);
> +
>      return nvis;
>  }
> 
> I wondered if it would be more natural for BufferBeginSetHintBits()
> and BufferFinishSetHintBits() to set SHB_INITIAL and SHB_DISABLED
> instead of having callers do it. But, I guess you don't do this
> because of gist and hash indexes using this for doing their own
> modifications.

I thought about it. But yea, the different callers seemed to make that not
really useful. It'd also mean that we'd do an external function call for every
tuple, which I think would be prohibitively expensive.



> --- a/src/backend/storage/freespace/freespace.c
> +++ b/src/backend/storage/freespace/freespace.c
> @@ -904,14 +904,22 @@ fsm_vacuum_page(Relation rel, FSMAddress addr,
>     max_avail = fsm_get_max_avail(page);
>     /*
> -    * Reset the next slot pointer. This encourages the use of low-numbered
> -    * pages, increasing the chances that a later vacuum can truncate the
> -    * relation.  We don't bother with a lock here, nor with marking the page
> -    * dirty if it wasn't already, since this is just a hint.
> +    * Try to reset the next slot pointer. This encourages the use of
> +    * low-numbered pages, increasing the chances that a later vacuum can
> +    * truncate the relation.  We don't bother with a lock here, nor with
> +    * marking the page dirty if it wasn't already, since this is just a hint.
> +    *
> +    * To be allowed to update the page without an exclusive lock, we have to
> +    * use the hint bit infrastructure.
>      */
> 
> What the heck? This didn't even take a share lock before...

It is insane. I do not understand how anybody thought this was ok.  I think I
probably should split this out into a separate commit, since it's so insane.


> diff --git a/src/backend/storage/freespace/fsmpage.c
> b/src/backend/storage/freesp>
> index 66a5c80b5a6..a59696b6484 100644
> --- a/src/backend/storage/freespace/fsmpage.c
> +++ b/src/backend/storage/freespace/fsmpage.c
> @@ -298,9 +298,18 @@ restart:
>      * lock and get a garbled next pointer every now and then, than take the
>      * concurrency hit of an exclusive lock.
>      *
> +    * Without an exclusive lock, we need to use the hint bit infrastructure
> +    * to be allowed to modify the page.
> +    *
> 
> Is the sentence above this still correct?

Seems ok enough to me.


> /*
> * Update the next-target pointer. Note that we do this even if we're only
> * holding a shared lock, on the grounds that it's better to use a shared
> * lock and get a garbled next pointer every now and then, than take the
> * concurrency hit of an exclusive lock.
> 
> We appear to avoid the garbling now?

I don't think so. Two backends concurrently can do fsm_search_avail() and
one backend might set a hint to a page that is already used up by the other
one. At least I think so?


> In general on 0012, I didn't spend much time checking if you caught
> all the places where we mention our hint bit hackery (only taking the
> share lock). But those can always be caught later as we inevitably
> encounter them.

There's definitely some more search needed as part of polishing that
commit. But I think it's pretty much inevitable that we'll miss some comment
somewhere :(

Greetings,

Andres Freund



Re: Buffer locking is special (hints, checksums, AIO writes)

От
Melanie Plageman
Дата:
On Tue, Nov 25, 2025 at 11:54 AM Andres Freund <andres@anarazel.de> wrote:
>
> > I skipped 0013 and 0014 after seeing "#if 1" in 0013 :)
>
> I left that in there to make the comparison easier. But clearly that's not to
> be committed...

Eh, I mostly just ran out of steam.

> > > [PATCH v6 08/14] bufmgr: Implement buffer content locks independently
> >  of lwlocks
> > ...
> > > This commit unfortunately introduces some code that is very similar to the
> > > code in lwlock.c, however the code is not equivalent enough to easily merge
> > > it. The future wins that this commit makes possible seem worth the cost.
> >
> > It is a truly unfortunate amount of duplication. I tried some
> > refactoring myself just to convince myself it wasn't a good idea.
>
> Without success, I guess?

Nothing that seemed better...and not worse :)

> > > As of this commit nothing uses the new share-exclusive lock mode. It will be
> > > documented and used in a future commit. It seemed too complicated to introduce
> > > the lock-level in a separate commit.
> >
> > I would have liked this mentioned earlier in the commit message.
>
> Hm, ok.  I could split share-exclusive out again, but it was somewhat painful,
> because it doesn't just lead to adding code, but to changing code.

I don't think you need to introduce share-exclusive in a separate
commit. I just mean it would be good to mention in the commit message
before you get into so many other details that this commit doesn't use
the share-exclusive level yet.

> > Also, I don't know how I feel about it being "documented" in a future
> > commit...perhaps just don't say that.
>
> Hm, why?

I don't really see you documenting share-exclusive lock semantics in a
later commit. Documentation for what the lock level is should be in
the same commit that introduces it -- which I think you mostly have
done. I feel like it is weird to say you will document that lock level
later when 1) you don't really do that and 2) this commit mostly
(besides some requests I had for elaboration) already does that.

> > diff --git a/src/include/storage/buf_internals.h
> > b/src/include/storage/buf_internals.h
> > index 28519ad2813..0a145d95024 100644
> > --- a/src/include/storage/buf_internals.h
> > +++ b/src/include/storage/buf_internals.h
> > @@ -32,22 +33,29 @@
> >  /*
> >   * Buffer state is a single 64-bit variable where following data is combined.
> >   *
> > + * State of the buffer itself:
> >   * - 18 bits refcount
> >   * - 4 bits usage count
> >   * - 10 bits of flags
> >   *
> > + * State of the content lock:
> > + * - 1 bit has_waiter
> > + * - 1 bit release_ok
> > + * - 1 bit lock state locked
> >
> > Somewhere you should clearly explain the scenarios in which you still
> > need to take the buffer header lock with LockBufHdr() vs when you can
> > just use atomic operations/CAS loops.
>
> There is an explanation of that, or at least my attempt at it ;). See the
> dcoumentation for BufferDesc.

Hmm. I suppose. I was imagining something above the state member since
you have added to it, but okay.

Separately, I'll admit I don't quite understand when I have to use
LockBufHdr() and when I should use a CAS loop to update the
bufferdesc->state.

> >
> > +         * Signal that the process isn't on the wait list anymore. This allows
> > +         * BufferLockDequeueSelf() to remove itself of the waitlist with a
> > +         * proclist_delete(), rather than having to check if it has been
> >
> > I know this comment was ported over, but the "remove itself of the
> > waitlist" -- the "of" is confusing in the original comment and it is
> > confusing here.
>
> As in s/of/from/?

Yes, I wasn't sure if it meant "from" or "off" -- but I believe "from"
is more grammatically correct.

> > +static inline uint64
> > +BufferLockReleaseSub(BufferLockMode mode)
> >
> > I don't understand why this is a separate function even with your comment.
>
> Because there are operations where we want to unlock the buffer as well as do
> something else. E.g. in 0013, UnlockReleaseBuffer() we want to unlock the
> buffer and decrease the refcount in one atomic operation. For that we need to
> know what to subtract from the state variable for the lock portion - hence
> BufferLockReleaseSub().

Hmm. Okay well maybe save it for 0013? I don't care that much, though.

> > Maybe you should add an assert to the lock acquisition path that the
> > prevate ref count entry mode is UNLOCK?
>
> Yes, we should...
>
> It'd be nice if we had a decent way to test things that we except to
> crash. Like repeated buffer lock acquisitions...

Indeed.

> > @@ -6625,7 +7295,25 @@ ResOwnerReleaseBufferPin(Datum res)
> >      if (BufferIsLocal(buffer))
> >          UnpinLocalBufferNoOwner(buffer);
> >      else
> > +    {
> > +        PrivateRefCountEntry *ref;
> > +
> > +        ref = GetPrivateRefCountEntry(buffer, false);
> > +
> > +        /*
> > +         * If the buffer was locked at the time of the resowner release,
> > +         * release the lock now. This should only happen after errors.
> > +         */
> > +        if (ref->data.lockmode != BUFFER_LOCK_UNLOCK)
> > +        {
> > +            BufferDesc *buf = GetBufferDescriptor(buffer - 1);
> > +
> > +            HOLD_INTERRUPTS();    /* match the upcoming RESUME_INTERRUPTS */
> > +            BufferLockUnlock(buffer, buf);
> > +        }
> > +
> >          UnpinBufferNoOwner(GetBufferDescriptor(buffer - 1));
> > +    }
> >  }
> >
> >  Bit confusing that ResOwnerReleaseBufferBin() now releases locks as well.
>
> Do you have a better suggestion? I'll add a comment that makes that explicit,
> but other than that I don't have a great idea. Renaming the whole buffer pin
> mechanism seems pretty noisy.

ResOwnerReleaseBuffer()?

What do you mean renaming the whole buffer pin mechanism?

> > 0011:
> > --------
> > > [PATCH v6 11/14] heapam: Add batch mode mvcc check and use it in page mode
> >
> > > 2) We would like to stop setting hint bits while pages are being written
> > > out. The necessary locking becomes visible for page mode scans if done for
> > > every tuple. With batching the overhead can be amortized to only happen
> > > once per page.
>
> > And I presume you mean don't set hint bits on a buffer that is being flushed
> > by someone else -- but it sounds like you mean not to set hint bits as part
> > of flushing a buffer.
>
> Right, I do mean the former.

I would try and state it more clearly then.

> > +        /*
> > +         * If the page is not all-visible or we need to check serializability,
> > +         * maintain enough state to be able to refind the tuple efficiently,
> > +         * without again needing to extract it from the page.
> > +         */
> > +        if (!all_visible || check_serializable)
> > +        {
> >
> > "enough state" is pretty vague here.
>
> I don't really follow. Wouldn't going into more detail just restate the code
> in a comment?

I guess maybe something like

If the page is not all-visible or we need to check serializability,
keep track of the tuples so we can examine them later without the
overhead of extracting them from the page again.

> > 0012:
> > -------
>
> > @@ -77,6 +73,16 @@ gistkillitems(IndexScanDesc scan)
> >       */
> >      for (i = 0; i < so->numKilled; i++)
> >      {
> > +        if (!killedsomething)
> > +        {
> > +            /*
> > +             * Use hint bit infrastructure to be allowed to modify the page
> > +             * without holding an exclusive lock.
> > +             */
> > +            if (!BufferBeginSetHintBits(buffer))
> > +                goto unlock;
> > +        }
> > +
> >
> > I don't understand why this is in the loop. Clearly you want to call
> > BufferBeginSetHintBits() once, but why would you do it in the loop?
>
> Why would we want to continue if we can't set hint bits?

No, I'm suggesting that you move BufferBeginSetHintBits() outside the
loop so it is more obvious that it is happening once at the beginning.
Like this:

    if (so->numKilled > 0 && !BufferBeginSetHintBits(buffer))
        goto unlock;

    for (i = 0; i < so->numKilled; i++)
    {
        offnum = so->killedItems[i];
        iid = PageGetItemId(page, offnum);
        ItemIdMarkDead(iid);
        killedsomething = true;
    }

    if (killedsomething)
    {
        GistMarkPageHasGarbage(page);
        BufferFinishSetHintBits(buffer, true, true);
    }

unlock:
    UnlockReleaseBuffer(buffer);

> > And should the enum name itself (SetHintBitsState) include the word
> > "batch"? I know that would make it
> > long. At least the comment should explain that these are needed when batch
> > setting hint bits.
>
> Isn't that what the comment does, explaining that we want to amortize the cost
> of BufferBeginSetHintBits()?

well, amortize is a pretty fancy word and you don't say "batch" anywhere.

> > And then you inlined BufferSetHintBits16().

This was part of my wishlist for separating the batch and non-batch versions.

> > I presume you didn't do this because HeapTupleSatisifiesMVCC() for
> > SNAPSHOT_MVCC calls the non-batch version (and, of course,
> > HeapTupleSatisifiesVisibility() is the much more common case).
> >
> > if (TransactionIdIsValid(xid))
> > {
> >         if (BufferIsPermanent(buffer))
> >         {
> >                 /* NB: xid must be known committed here! */
> >                 XLogRecPtr    commitLSN = TransactionIdGetCommitLSN(xid);
> >
> >                 if (XLogNeedsFlush(commitLSN) &&
> >                         BufferGetLSNAtomic(buffer) < commitLSN)
> >                 {
> >                         /* not flushed and no LSN interlock, so don't
> > set hint */
> >                         return; false;
> >                 }
> >         }
> > }
> >
> > Separately, I was thinking, should we assert here about having the
> > right lock type?
>
> Not sure I get what assert of what locktype where?

In SetHintBitsExt() that we have share-exclusive or above. Or are
there still callers with only a share lock?

> >  * It is only safe to set a transaction-committed hint bit if we know the
> >  * transaction's commit record is guaranteed to be flushed to disk before the
> >  * buffer, or if the table is temporary or unlogged and will be obliterated by
> >  * a crash anyway.  We cannot change the LSN of the page here, because we may
> >  * hold only a share lock on the buffer, so we can only use the LSN to
> >  * interlock this if the buffer's LSN already is newer than the commit LSN;
> >  * otherwise we have to just refrain from setting the hint bit until some
> >  * future re-examination of the tuple.
> >  *
> >
> > Should this say "we may hold only a share exclusive lock on the
> > buffer". Also what is "this" in "only use the LSN to interlock this"?
>
> That's a pre-existing comment, right?

Right, but are there callers that will only have a share lock after your change?

> > @@ -1628,6 +1701,9 @@ HeapTupleSatisfiesMVCCBatch(Snapshot snapshot,
> > Buffer buffer,
> > +    if (state == SHB_ENABLED)
> > +        BufferFinishSetHintBits(buffer, true, true);
> > +
> >      return nvis;
> >  }
> >
> > I wondered if it would be more natural for BufferBeginSetHintBits()
> > and BufferFinishSetHintBits() to set SHB_INITIAL and SHB_DISABLED
> > instead of having callers do it. But, I guess you don't do this
> > because of gist and hash indexes using this for doing their own
> > modifications.
>
> I thought about it. But yea, the different callers seemed to make that not
> really useful. It'd also mean that we'd do an external function call for every
> tuple, which I think would be prohibitively expensive.

I'm confused why this would mean more external function calls.

> > --- a/src/backend/storage/freespace/freespace.c
> > +++ b/src/backend/storage/freespace/freespace.c
> > @@ -904,14 +904,22 @@ fsm_vacuum_page(Relation rel, FSMAddress addr,
> >     max_avail = fsm_get_max_avail(page);
> >     /*
> > -    * Reset the next slot pointer. This encourages the use of low-numbered
> > -    * pages, increasing the chances that a later vacuum can truncate the
> > -    * relation.  We don't bother with a lock here, nor with marking the page
> > -    * dirty if it wasn't already, since this is just a hint.
> > +    * Try to reset the next slot pointer. This encourages the use of
> > +    * low-numbered pages, increasing the chances that a later vacuum can
> > +    * truncate the relation.  We don't bother with a lock here, nor with
> > +    * marking the page dirty if it wasn't already, since this is just a hint.
> > +    *
> > +    * To be allowed to update the page without an exclusive lock, we have to
> > +    * use the hint bit infrastructure.
> >      */
> >
> > What the heck? This didn't even take a share lock before...
>
> It is insane. I do not understand how anybody thought this was ok.  I think I
> probably should split this out into a separate commit, since it's so insane.

Aren't you going to backport having it take a lock? Then it can be
separate (e.g. have it take a share lock in the "fix" commit and then
this commit bumps it to share-exclusive).

> > diff --git a/src/backend/storage/freespace/fsmpage.c
> > b/src/backend/storage/freesp>
> > /*
> > * Update the next-target pointer. Note that we do this even if we're only
> > * holding a shared lock, on the grounds that it's better to use a shared
> > * lock and get a garbled next pointer every now and then, than take the
> > * concurrency hit of an exclusive lock.
> >
> > We appear to avoid the garbling now?
>
> I don't think so. Two backends concurrently can do fsm_search_avail() and
> one backend might set a hint to a page that is already used up by the other
> one. At least I think so?

Maybe I don't know what it meant by garbled, but I thought it was
talking about two backends each trying to set fp_next_slot. If they
now have to have a share-exclusive lock and they can't both have a
share-exclusive lock at the same time, then it seems like that
wouldn't be a problem. It sounds like you may be talking about a
backend taking up the freespace of a page that is referred to by the
fp_next_slot?

- Melanie



Re: Buffer locking is special (hints, checksums, AIO writes)

От
Andres Freund
Дата:
Hi,

On 2025-11-25 15:02:02 -0500, Melanie Plageman wrote:
> On Tue, Nov 25, 2025 at 11:54 AM Andres Freund <andres@anarazel.de> wrote:
> > > > As of this commit nothing uses the new share-exclusive lock mode. It will be
> > > > documented and used in a future commit. It seemed too complicated to introduce
> > > > the lock-level in a separate commit.
> > >
> > > I would have liked this mentioned earlier in the commit message.
> >
> > Hm, ok.  I could split share-exclusive out again, but it was somewhat painful,
> > because it doesn't just lead to adding code, but to changing code.
> 
> I don't think you need to introduce share-exclusive in a separate
> commit. I just mean it would be good to mention in the commit message
> before you get into so many other details that this commit doesn't use
> the share-exclusive level yet.
> 
> > > Also, I don't know how I feel about it being "documented" in a future
> > > commit...perhaps just don't say that.
> >
> > Hm, why?
> 
> I don't really see you documenting share-exclusive lock semantics in a
> later commit. Documentation for what the lock level is should be in
> the same commit that introduces it -- which I think you mostly have
> done. I feel like it is weird to say you will document that lock level
> later when 1) you don't really do that and 2) this commit mostly
> (besides some requests I had for elaboration) already does that.

The problem I faced was that the explanation for the lock level depends on
later changes - how do you document why share-exclusive is useful without
explaining the hint bit thing, which isn't yet implemented as of that commit.


> > > diff --git a/src/include/storage/buf_internals.h
> > > b/src/include/storage/buf_internals.h
> > > index 28519ad2813..0a145d95024 100644
> > > --- a/src/include/storage/buf_internals.h
> > > +++ b/src/include/storage/buf_internals.h
> > > @@ -32,22 +33,29 @@
> > >  /*
> > >   * Buffer state is a single 64-bit variable where following data is combined.
> > >   *
> > > + * State of the buffer itself:
> > >   * - 18 bits refcount
> > >   * - 4 bits usage count
> > >   * - 10 bits of flags
> > >   *
> > > + * State of the content lock:
> > > + * - 1 bit has_waiter
> > > + * - 1 bit release_ok
> > > + * - 1 bit lock state locked
> > >
> > > Somewhere you should clearly explain the scenarios in which you still
> > > need to take the buffer header lock with LockBufHdr() vs when you can
> > > just use atomic operations/CAS loops.
> >
> > There is an explanation of that, or at least my attempt at it ;). See the
> > dcoumentation for BufferDesc.
> 
> Hmm. I suppose. I was imagining something above the state member since
> you have added to it, but okay.
> 
> Separately, I'll admit I don't quite understand when I have to use
> LockBufHdr() and when I should use a CAS loop to update the
> bufferdesc->state.

It's definitely subtle :(. Not sure what to do about that, other than to work
on eventually just making everything doable with just a CAS. But that's a fair
bit of future work away.


> > > @@ -6625,7 +7295,25 @@ ResOwnerReleaseBufferPin(Datum res)
> > >      if (BufferIsLocal(buffer))
> > >          UnpinLocalBufferNoOwner(buffer);
> > >      else
> > > +    {
> > > +        PrivateRefCountEntry *ref;
> > > +
> > > +        ref = GetPrivateRefCountEntry(buffer, false);
> > > +
> > > +        /*
> > > +         * If the buffer was locked at the time of the resowner release,
> > > +         * release the lock now. This should only happen after errors.
> > > +         */
> > > +        if (ref->data.lockmode != BUFFER_LOCK_UNLOCK)
> > > +        {
> > > +            BufferDesc *buf = GetBufferDescriptor(buffer - 1);
> > > +
> > > +            HOLD_INTERRUPTS();    /* match the upcoming RESUME_INTERRUPTS */
> > > +            BufferLockUnlock(buffer, buf);
> > > +        }
> > > +
> > >          UnpinBufferNoOwner(GetBufferDescriptor(buffer - 1));
> > > +    }
> > >  }
> > >
> > >  Bit confusing that ResOwnerReleaseBufferBin() now releases locks as well.
> >
> > Do you have a better suggestion? I'll add a comment that makes that explicit,
> > but other than that I don't have a great idea. Renaming the whole buffer pin
> > mechanism seems pretty noisy.
> 
> ResOwnerReleaseBuffer()?
> 
> What do you mean renaming the whole buffer pin mechanism?

All the *PrivateRefCount* stuff arguably aught to be renamed...



> > > I presume you didn't do this because HeapTupleSatisifiesMVCC() for
> > > SNAPSHOT_MVCC calls the non-batch version (and, of course,
> > > HeapTupleSatisifiesVisibility() is the much more common case).
> > >
> > > if (TransactionIdIsValid(xid))
> > > {
> > >         if (BufferIsPermanent(buffer))
> > >         {
> > >                 /* NB: xid must be known committed here! */
> > >                 XLogRecPtr    commitLSN = TransactionIdGetCommitLSN(xid);
> > >
> > >                 if (XLogNeedsFlush(commitLSN) &&
> > >                         BufferGetLSNAtomic(buffer) < commitLSN)
> > >                 {
> > >                         /* not flushed and no LSN interlock, so don't
> > > set hint */
> > >                         return; false;
> > >                 }
> > >         }
> > > }
> > >
> > > Separately, I was thinking, should we assert here about having the
> > > right lock type?
> >
> > Not sure I get what assert of what locktype where?
> 
> In SetHintBitsExt() that we have share-exclusive or above.

We *don't* necessarily hold that though, it'll just be acquired by
BufferBeginSetHintBits().  Or do you mean in the SHB_ENABLED case?


> Or are there still callers with only a share lock?

Almost all of them, I think? We can't just call this with an unconditional
share-exclusive lock, since that'd destroy concurrency. Instead we just try to
upgrade to share-exclusive to set the hint bit. If we can't get
share-exclusive, we don't need to block, we just haven't set the hint bit.


> > > @@ -1628,6 +1701,9 @@ HeapTupleSatisfiesMVCCBatch(Snapshot snapshot,
> > > Buffer buffer,
> > > +    if (state == SHB_ENABLED)
> > > +        BufferFinishSetHintBits(buffer, true, true);
> > > +
> > >      return nvis;
> > >  }
> > >
> > > I wondered if it would be more natural for BufferBeginSetHintBits()
> > > and BufferFinishSetHintBits() to set SHB_INITIAL and SHB_DISABLED
> > > instead of having callers do it. But, I guess you don't do this
> > > because of gist and hash indexes using this for doing their own
> > > modifications.
> >
> > I thought about it. But yea, the different callers seemed to make that not
> > really useful. It'd also mean that we'd do an external function call for every
> > tuple, which I think would be prohibitively expensive.
> 
> I'm confused why this would mean more external function calls.

The visibility logic is in heapam_visibility.c, the locking for the buffer in
bufmgr.c? If the knowledge about SetHintBitsState lives in
BufferBeginSetHintBits(), we need to call it to do that check.


> > > --- a/src/backend/storage/freespace/freespace.c
> > > +++ b/src/backend/storage/freespace/freespace.c
> > > @@ -904,14 +904,22 @@ fsm_vacuum_page(Relation rel, FSMAddress addr,
> > >     max_avail = fsm_get_max_avail(page);
> > >     /*
> > > -    * Reset the next slot pointer. This encourages the use of low-numbered
> > > -    * pages, increasing the chances that a later vacuum can truncate the
> > > -    * relation.  We don't bother with a lock here, nor with marking the page
> > > -    * dirty if it wasn't already, since this is just a hint.
> > > +    * Try to reset the next slot pointer. This encourages the use of
> > > +    * low-numbered pages, increasing the chances that a later vacuum can
> > > +    * truncate the relation.  We don't bother with a lock here, nor with
> > > +    * marking the page dirty if it wasn't already, since this is just a hint.
> > > +    *
> > > +    * To be allowed to update the page without an exclusive lock, we have to
> > > +    * use the hint bit infrastructure.
> > >      */
> > >
> > > What the heck? This didn't even take a share lock before...
> >
> > It is insane. I do not understand how anybody thought this was ok.  I think I
> > probably should split this out into a separate commit, since it's so insane.
> 
> Aren't you going to backport having it take a lock? Then it can be
> separate (e.g. have it take a share lock in the "fix" commit and then
> this commit bumps it to share-exclusive).

I wasn't thinking we should backport that. It's been this way for ages,
without having caused known issues....


> > > diff --git a/src/backend/storage/freespace/fsmpage.c
> > > b/src/backend/storage/freesp>
> > > /*
> > > * Update the next-target pointer. Note that we do this even if we're only
> > > * holding a shared lock, on the grounds that it's better to use a shared
> > > * lock and get a garbled next pointer every now and then, than take the
> > > * concurrency hit of an exclusive lock.
> > >
> > > We appear to avoid the garbling now?
> >
> > I don't think so. Two backends concurrently can do fsm_search_avail() and
> > one backend might set a hint to a page that is already used up by the other
> > one. At least I think so?
> 
> Maybe I don't know what it meant by garbled, but I thought it was
> talking about two backends each trying to set fp_next_slot. If they
> now have to have a share-exclusive lock and they can't both have a
> share-exclusive lock at the same time, then it seems like that
> wouldn't be a problem. It sounds like you may be talking about a
> backend taking up the freespace of a page that is referred to by the
> fp_next_slot?

Yes, a version of the latter. The value that fp_next_slot will be set to can
be outdated by the time we actually set it, unless we do all of
fsm_search_avail() under some form of exclusive lock - clearly not something
desirable.

Greetings,

Andres Freund



Re: Buffer locking is special (hints, checksums, AIO writes)

От
Melanie Plageman
Дата:
On Tue, Nov 25, 2025 at 3:46 PM Andres Freund <andres@anarazel.de> wrote:
>
>
> > > > I presume you didn't do this because HeapTupleSatisifiesMVCC() for
> > > > SNAPSHOT_MVCC calls the non-batch version (and, of course,
> > > > HeapTupleSatisifiesVisibility() is the much more common case).
> > > >
> > > > if (TransactionIdIsValid(xid))
> > > > {
> > > >         if (BufferIsPermanent(buffer))
> > > >         {
> > > >                 /* NB: xid must be known committed here! */
> > > >                 XLogRecPtr    commitLSN = TransactionIdGetCommitLSN(xid);
> > > >
> > > >                 if (XLogNeedsFlush(commitLSN) &&
> > > >                         BufferGetLSNAtomic(buffer) < commitLSN)
> > > >                 {
> > > >                         /* not flushed and no LSN interlock, so don't
> > > > set hint */
> > > >                         return; false;
> > > >                 }
> > > >         }
> > > > }
> > > >
> > > > Separately, I was thinking, should we assert here about having the
> > > > right lock type?
> > >
> > > Not sure I get what assert of what locktype where?
> >
> > In SetHintBitsExt() that we have share-exclusive or above.
>
> We *don't* necessarily hold that though, it'll just be acquired by
> BufferBeginSetHintBits().  Or do you mean in the SHB_ENABLED case?

Yea, in the enabled case. Also, can't we skip the whole

    if (TransactionIdIsValid(xid))
    {
        if (BufferIsPermanent(buffer))
        {
            /* NB: xid must be known committed here! */
            XLogRecPtr    commitLSN = TransactionIdGetCommitLSN(xid);

            if (XLogNeedsFlush(commitLSN) &&
                BufferGetLSNAtomic(buffer) < commitLSN)
            {
                /* not flushed and no LSN interlock, so don't set hint */
                return;
            }
        }
    }

part if state is SHB_DISABLED?

- Melanie



Re: Buffer locking is special (hints, checksums, AIO writes)

От
Andres Freund
Дата:
Hi,

On 2025-11-21 12:52:38 -0500, Melanie Plageman wrote:
> > 0004: Use 64bit atomics for BufferDesc.state - at this point nothing uses the
> > additional bits yet, though.  Some annoying reformatting required to avoid
> > long lines.
>
> I noticed that the BUF_STATE_GET_REFCOUNT and BUF_STATE_GET_USAGECOUNT
> macros cast the return value to a uint32. We won't use the extra bits
> but we did bother to keep the macro result sized to the field width
> before so keeping it uint32 is probably more confusing now that state
> is 64 bit.

I can't really follow - why would we want to return a 64bit value if none of
the values ever can get anywhere near that big?


> Not related to this patch, but I noticed GetBufferDescriptor() calls
> for a uint32 and all the callers pretty much pass a signed int —
> wonder why it calls for uint32.

It's not strictly required - no Buffers exist bet INT32_MAX and
UINT32_MAX. However, GetBufferDescriptor() cannot be used for local buffers
(which would have a negative buffer id), therefore a uint32 is fine. Many of
the callers have dedicated branches to deal with local buffers and therefore
couldn't use a uint32.


> > 0006+0007: This is preparatory work for 0008, but also worthwhile on its
> > own. The private refcount stuff does show up in profiles. The reason it's
> > related is that without these changes the added information in 0008 makes that
> > worse.
>
> I found it slightly confusing that this commit appears to
> unnecessarily add the PrivateRefCountData struct (given that it
> doesn't need it to do the new parallel array thing). You could wait
> until you need it in 0008, but 0008 is big as it is, so it probably is
> fine where it is.

It seemed too annoying to whack the code around multiple times...


> in InitBufferManagerAccess(), why do you have
>
> memset(&PrivateRefCountArrayKeys, 0, sizeof(Buffer));
> seems like it should be
> memset(PrivateRefCountArrayKeys, 0, sizeof(PrivateRefCountArrayKeys));

Ugh, indeed.


> I wonder how easy it will be to keep the Buffer in sync between
> PrivateRefCountArrayKeys and the PrivateRefCountEntry — would a helper
> function help?

I don't think we really need it, the existing helper functions are where it
should be manipulated.


> ForgetPrivateRefCountEntry doesn’t clear the data member — but maybe
> it doesn’t matter...

It asserts that the fields are reset, which seems to suffice.


> in ReservePrivateRefCountEntry() there is a superfluous clear
>
> memset(&victim_entry->data, 0, sizeof(victim_entry->data));
> victim_entry->data.refcount = 0;

It's indeed superfluous today. I guess I put it in as a belt and suspenders
approach to future members... The compiler is easily be able to optimize that
redundancy away.


> 0007
> needs a commit message. overall seems fine though.

This is what I've since written:

    bufmgr: Add one-entry cache for private refcount

    The private refcount entry for a buffer is often looked up repeatedly for the
    same buffer, e.g. to pin and then unpin a buffer. Benchmarking shows that it's
    worthwhile to have a one-entry cache for that case. With that cache in place,
    it's worth splitting GetPrivateRefCountEntry() into a small inline
    portion (for the cache hit case) and an out-of-line helper for the rest.

    This is helpful for some workloads today, but becomes more important in an
    upcoming patch that will utilize the private refcount infrastructure to also
    store whether the buffer is currently locked, as that increases the rate of
    lookups substantially.



> You should probably capitalize the "c" of "count" in
> PrivateRefcountEntryLast to be consistent with the other names.

Ooops, yes.

Greetings,

Andres Freund



Re: Buffer locking is special (hints, checksums, AIO writes)

От
Melanie Plageman
Дата:
On Mon, Dec 1, 2025 at 3:28 PM Andres Freund <andres@anarazel.de> wrote:
>
> > I noticed that the BUF_STATE_GET_REFCOUNT and BUF_STATE_GET_USAGECOUNT
> > macros cast the return value to a uint32. We won't use the extra bits
> > but we did bother to keep the macro result sized to the field width
> > before so keeping it uint32 is probably more confusing now that state
> > is 64 bit.
>
> I can't really follow - why would we want to return a 64bit value if none of
> the values ever can get anywhere near that big?

Well, usagecount could never have reached anything close to a uint32
either, so I thought that this cast was meant to match the datatype. I
found it rather confusing otherwise.

> > 0007
> > needs a commit message. overall seems fine though.
>
> This is what I've since written:
>
>     bufmgr: Add one-entry cache for private refcount
>
>     The private refcount entry for a buffer is often looked up repeatedly for the
>     same buffer, e.g. to pin and then unpin a buffer. Benchmarking shows that it's
>     worthwhile to have a one-entry cache for that case. With that cache in place,
>     it's worth splitting GetPrivateRefCountEntry() into a small inline
>     portion (for the cache hit case) and an out-of-line helper for the rest.
>
>     This is helpful for some workloads today, but becomes more important in an
>     upcoming patch that will utilize the private refcount infrastructure to also
>     store whether the buffer is currently locked, as that increases the rate of
>     lookups substantially.

Sounds good based on what I recall of the patch.

- Melanie



Re: Buffer locking is special (hints, checksums, AIO writes)

От
Heikki Linnakangas
Дата:
On 25/11/2025 22:46, Andres Freund wrote:
> On 2025-11-25 15:02:02 -0500, Melanie Plageman wrote:
>> On Tue, Nov 25, 2025 at 11:54 AM Andres Freund <andres@anarazel.de> wrote:
>>>> --- a/src/backend/storage/freespace/freespace.c
>>>> +++ b/src/backend/storage/freespace/freespace.c
>>>> @@ -904,14 +904,22 @@ fsm_vacuum_page(Relation rel, FSMAddress addr,
>>>>      max_avail = fsm_get_max_avail(page);
>>>>      /*
>>>> -    * Reset the next slot pointer. This encourages the use of low-numbered
>>>> -    * pages, increasing the chances that a later vacuum can truncate the
>>>> -    * relation.  We don't bother with a lock here, nor with marking the page
>>>> -    * dirty if it wasn't already, since this is just a hint.
>>>> +    * Try to reset the next slot pointer. This encourages the use of
>>>> +    * low-numbered pages, increasing the chances that a later vacuum can
>>>> +    * truncate the relation.  We don't bother with a lock here, nor with
>>>> +    * marking the page dirty if it wasn't already, since this is just a hint.
>>>> +    *
>>>> +    * To be allowed to update the page without an exclusive lock, we have to
>>>> +    * use the hint bit infrastructure.
>>>>       */
>>>>
>>>> What the heck? This didn't even take a share lock before...
>>>
>>> It is insane. I do not understand how anybody thought this was ok.  I think I
>>> probably should split this out into a separate commit, since it's so insane.
>>
>> Aren't you going to backport having it take a lock? Then it can be
>> separate (e.g. have it take a share lock in the "fix" commit and then
>> this commit bumps it to share-exclusive).
> 
> I wasn't thinking we should backport that. It's been this way for ages,
> without having caused known issues....

I wrote that originally :-). The FSM code always treats the FSM page 
contents as potentially corrupted garbage. FSM page updates are not 
WAL-logged, for starters. And as the comment says, the next slot pointer 
is just a hint for where within the page to start looking for free space.

Page checksums were added later, and now we know about the problems of 
modifying a page a write() is in progress, on some filesystems with 
filesystem-level checksums. So it's a good idea to tighten it up, but it 
was fine back then.

>>>> diff --git a/src/backend/storage/freespace/fsmpage.c
>>>> b/src/backend/storage/freesp>
>>>> /*
>>>> * Update the next-target pointer. Note that we do this even if we're only
>>>> * holding a shared lock, on the grounds that it's better to use a shared
>>>> * lock and get a garbled next pointer every now and then, than take the
>>>> * concurrency hit of an exclusive lock.
>>>>
>>>> We appear to avoid the garbling now?
>>>
>>> I don't think so. Two backends concurrently can do fsm_search_avail() and
>>> one backend might set a hint to a page that is already used up by the other
>>> one. At least I think so?
>>
>> Maybe I don't know what it meant by garbled, but I thought it was
>> talking about two backends each trying to set fp_next_slot. If they
>> now have to have a share-exclusive lock and they can't both have a
>> share-exclusive lock at the same time, then it seems like that
>> wouldn't be a problem. It sounds like you may be talking about a
>> backend taking up the freespace of a page that is referred to by the
>> fp_next_slot?
> 
> Yes, a version of the latter. The value that fp_next_slot will be set to can
> be outdated by the time we actually set it, unless we do all of
> fsm_search_avail() under some form of exclusive lock - clearly not something
> desirable.

I'm pretty sure the "garbled" in the comment means the former, not the 
latter. I.e. it means that the pointer itself might become garbage. 
Would be good to update the comment if that's no longer possible.

But speaking of that: why do we not allow two processes to concurrently 
set hint bits on a page anymore?

- Heikki




Re: Buffer locking is special (hints, checksums, AIO writes)

От
Andres Freund
Дата:
Hi,

On 2025-12-02 10:01:06 +0200, Heikki Linnakangas wrote:
> On 25/11/2025 22:46, Andres Freund wrote:
> > > > > diff --git a/src/backend/storage/freespace/fsmpage.c
> > > > > b/src/backend/storage/freesp>
> > > > > /*
> > > > > * Update the next-target pointer. Note that we do this even if we're only
> > > > > * holding a shared lock, on the grounds that it's better to use a shared
> > > > > * lock and get a garbled next pointer every now and then, than take the
> > > > > * concurrency hit of an exclusive lock.
> > > > > 
> > > > > We appear to avoid the garbling now?
> > > > 
> > > > I don't think so. Two backends concurrently can do fsm_search_avail() and
> > > > one backend might set a hint to a page that is already used up by the other
> > > > one. At least I think so?
> > > 
> > > Maybe I don't know what it meant by garbled, but I thought it was
> > > talking about two backends each trying to set fp_next_slot. If they
> > > now have to have a share-exclusive lock and they can't both have a
> > > share-exclusive lock at the same time, then it seems like that
> > > wouldn't be a problem. It sounds like you may be talking about a
> > > backend taking up the freespace of a page that is referred to by the
> > > fp_next_slot?
> > 
> > Yes, a version of the latter. The value that fp_next_slot will be set to can
> > be outdated by the time we actually set it, unless we do all of
> > fsm_search_avail() under some form of exclusive lock - clearly not something
> > desirable.
> 
> I'm pretty sure the "garbled" in the comment means the former, not the
> latter. I.e. it means that the pointer itself might become garbage. Would be
> good to update the comment if that's no longer possible.

Hm. I thought we had always assumed that 4byte values can be read/written
tear-free. Hence thinking that garbled couldn't refer to reading entire
garbage due to a concurrent write.


> But speaking of that: why do we not allow two processes to concurrently set
> hint bits on a page anymore?

It'd make the locking a lot more complicated without much of a benefit.

The new share-exclusive lock mode only requires one additional bit of lock
state, for the single allowed holder. If we wanted a new lockmode that
prevented the page from being written out concurrently, but could be held
multiple times, we'd need at least MAX_BACKENDS bits for the lock level
allowing hint bits to be set and another lock level to acquire while writing
out the buffer.

At the same time, there seems to be little benefit in setting hint bits on a
page concurrently. A very common case is that the same hint bit(s) would be
set by multiple backends, we don't gain anything from that. And in the cases
where hint bits were intended to be set for different tuples, the window in
which that is not allowed is very narrow, and the cost of not setting right in
that moment is pretty small and the cost of not setting the hint bit right
then and there isn't high.

Makes sense?

Greetings,

Andres Freund



Re: Buffer locking is special (hints, checksums, AIO writes)

От
Heikki Linnakangas
Дата:
On 02/12/2025 15:20, Andres Freund wrote:
> On 2025-12-02 10:01:06 +0200, Heikki Linnakangas wrote:
>> But speaking of that: why do we not allow two processes to concurrently set
>> hint bits on a page anymore?
> 
> It'd make the locking a lot more complicated without much of a benefit.
> 
> The new share-exclusive lock mode only requires one additional bit of lock
> state, for the single allowed holder. If we wanted a new lockmode that
> prevented the page from being written out concurrently, but could be held
> multiple times, we'd need at least MAX_BACKENDS bits for the lock level
> allowing hint bits to be set and another lock level to acquire while writing
> out the buffer.
> 
> At the same time, there seems to be little benefit in setting hint bits on a
> page concurrently. A very common case is that the same hint bit(s) would be
> set by multiple backends, we don't gain anything from that. And in the cases
> where hint bits were intended to be set for different tuples, the window in
> which that is not allowed is very narrow, and the cost of not setting right in
> that moment is pretty small and the cost of not setting the hint bit right
> then and there isn't high.
> 
> Makes sense?

Yep, makes sense. Would be good to put that in a comment somewhere, and 
in the commit message. "We could allow multiple backends to set hint 
bits concurrently, but it'd make the lock implementation more complicated"

- Heikki




Re: Buffer locking is special (hints, checksums, AIO writes)

От
Andres Freund
Дата:
Hi,

On 2025-11-25 11:54:00 -0500, Andres Freund wrote:
> Thanks a lot for that detailed review!  A few questions and comments, before I
> try to address the comments in the next version.

Here's that new new version, with the following changes

- Some more micro-optimizations, most importantly adding a commit that doesn't
  initialize the delay in LockBufHdr() unless needed. With those I don't see a
  consistent slowdown anymore (slight speedup on one workstation, slight
  slowdown on another, in an absurdly adverse workload)

- Tried to address Melanie's feedback, with some exceptions (some noted below,
  but I also need to make another pass through the reviews)

- re-implemented AssertNotCatalogBufferLock() in the new world

- Substantially expanded comments around setting hint bits (in buffer/README,
  heapam_visibility.c and bufmgr.c)

- split out the change to fsm_vacuum_page() to start to lock the page into is
  own commit

- reordered patch series so that smaller changes are before the 64bit-state
  and "Implement buffer content locks independently of" commits, so they can
  be committed while we finish cleaning the later changes

- I didn't invest much in cleaning up the later patches ("Don't copy pages
  while writing out" and "Make UnlockReleaseBuffer() more efficient") yet,
  wanted to focus on the earlier patches first


Todo:

- still need to rename ResOwnerReleaseBufferPin(). Wondering about what to
  rename ResourceOwnerDesc.name to. "buffer ownership" maybe? Not great...

- gistkillitems() complaint by Melanie

- amortize vs batch vs SetHintBits comment + SHB_* names

- for the next version I'll remove the BATCHMVCC_FEWER_ARGS conditionals from
  0010. I don't love needing BatchMVCCState but I don't really see an
  alternative, the performance difference is pretty persistent.


Questions:
- ForEachLWLockHeldByMe() and LWLockDisown() aren't used anymore, should we
  remove them?


Greetings,

Andres Freund

Вложения

Re: Buffer locking is special (hints, checksums, AIO writes)

От
Peter Geoghegan
Дата:
On Tue, Dec 2, 2025 at 7:47 PM Andres Freund <andres@anarazel.de> wrote:
> On 2025-11-25 11:54:00 -0500, Andres Freund wrote:
> > Thanks a lot for that detailed review!  A few questions and comments, before I
> > try to address the comments in the next version.
>
> Here's that new new version, with the following changes

_bt_check_unique will hold an exclusive buffer lock on the page being
LP_DEAD-set in the vast majority of cases. Should we expect your
changes to have no effect at all in that common case?

The BTP_HAS_GARBAGE flag is deprecated these days; we basically don't
use it anymore. How much value might there be in avoiding setting
BTP_HAS_GARBAGE as a way of being able to use BufferSetHintBits16 more
often in nbtree?

--
Peter Geoghegan



Re: Buffer locking is special (hints, checksums, AIO writes)

От
Andres Freund
Дата:
Hi,

On 2025-12-02 20:12:14 -0500, Peter Geoghegan wrote:
> On Tue, Dec 2, 2025 at 7:47 PM Andres Freund <andres@anarazel.de> wrote:
> > On 2025-11-25 11:54:00 -0500, Andres Freund wrote:
> > > Thanks a lot for that detailed review!  A few questions and comments, before I
> > > try to address the comments in the next version.
> >
> > Here's that new new version, with the following changes
> 
> _bt_check_unique will hold an exclusive buffer lock on the page being
> LP_DEAD-set in the vast majority of cases. Should we expect your
> changes to have no effect at all in that common case?

If we already have an exclusive lock, BufferBeginSetHintBits() will quickly
return true and won't ever return false.


> The BTP_HAS_GARBAGE flag is deprecated these days; we basically don't
> use it anymore. How much value might there be in avoiding setting
> BTP_HAS_GARBAGE as a way of being able to use BufferSetHintBits16 more
> often in nbtree?

None of the MarkBufferDirtyHint() cases in nbtree that had to be modified
looked like they would benefit from BufferSetHintBits16(), since they will
typically modify the page multiple times.  But maybe I'm just misunderstanding
what you mean?

Greetings,

Andres Freund



Re: Buffer locking is special (hints, checksums, AIO writes)

От
Andres Freund
Дата:
Hi,

On 2025-12-02 19:47:35 -0500, Andres Freund wrote:
> On 2025-11-25 11:54:00 -0500, Andres Freund wrote:
> > Thanks a lot for that detailed review!  A few questions and comments, before I
> > try to address the comments in the next version.
> 
> Here's that new new version, with the following changes
> 
> - Some more micro-optimizations, most importantly adding a commit that doesn't
>   initialize the delay in LockBufHdr() unless needed. With those I don't see a
>   consistent slowdown anymore (slight speedup on one workstation, slight
>   slowdown on another, in an absurdly adverse workload)
> 
> - Tried to address Melanie's feedback, with some exceptions (some noted below,
>   but I also need to make another pass through the reviews)
> 
> - re-implemented AssertNotCatalogBufferLock() in the new world
> 
> - Substantially expanded comments around setting hint bits (in buffer/README,
>   heapam_visibility.c and bufmgr.c)
> 
> - split out the change to fsm_vacuum_page() to start to lock the page into is
>   own commit
> 
> - reordered patch series so that smaller changes are before the 64bit-state
>   and "Implement buffer content locks independently of" commits, so they can
>   be committed while we finish cleaning the later changes
> 
> - I didn't invest much in cleaning up the later patches ("Don't copy pages
>   while writing out" and "Make UnlockReleaseBuffer() more efficient") yet,
>   wanted to focus on the earlier patches first
> 
> 
> Todo:
> 
> - still need to rename ResOwnerReleaseBufferPin(). Wondering about what to
>   rename ResourceOwnerDesc.name to. "buffer ownership" maybe? Not great...
> 
> - gistkillitems() complaint by Melanie
> 
> - amortize vs batch vs SetHintBits comment + SHB_* names
> 
> - for the next version I'll remove the BATCHMVCC_FEWER_ARGS conditionals from
>   0010. I don't love needing BatchMVCCState but I don't really see an
>   alternative, the performance difference is pretty persistent.
> 
> 
> Questions:
> - ForEachLWLockHeldByMe() and LWLockDisown() aren't used anymore, should we
>   remove them?

I'm planning to work on committing 0001, 0002, 0003, 0008 soon-ish, unless
somebody sees a reason to hold off on that.  After that I think 0005, 0006
would be next.  I think 0004 is a clear improvement, but nobody has looked at
it yet...

For 0007, I wished ConditionalLockBuffer() accepted the lock level, there's no
point in waiting for the lock in the use case. I'm on the fence about whether
it's worth changing the ~12 users of ConditionalLockBuffer()...

Greetings,

Andres Freund