Обсуждение: [PATCH] Write Notifications Through WAL

Поиск

Список

Период

Сортировка

[PATCH] Write Notifications Through WAL

От

Rishu Bagga

Дата:

25 сентября, 04:12:21

Hello all,  creating a new thread here in continuation of [1], as the patch
has changed direction significantly from the subject there.

TLDR:

 The attached patch here aims to increase performance of the NOTIFY
codepath, in the case of concurrent notifying transactions,
by eliminating the need for a global lock to be held
from PreCommit_Notify to the end of transaction.  We accomplish this
through the method inspired by Tom Lane here [2]. The high level is as
follows:

Write Path:
 1. Write notification data into WAL in PreCommit_Notify. Store lsn in MyProc.
2. Acquire NotifyQueueLock in exclusive mode.
3. Write a compact entry in the notify queue, referencing the lsn in step 1.
4. Commit the transaction, and write the notify lsn in step 1, inside
the commit record. 5. Release NotifyQueueLock.

Read Path:

Readers look through the notification queue where they see just the
notify_data_lsn,
And then get the actual notify data from the WAL.

(End TLDR)

Addressing the concerns from the previous patch, and explaining
the latest patch:

On Wed, Sep 10, 2025 at 6:05 AM Matheus Alcantara
<matheusssilv97@gmail.com> wrote:
> There is also some other tests failing, like isolation, regress and
> others.
These are passing in the new patch, after removing the
extraneous SignalBackends call.

On Wed, Sep 10, 2025 at 12:00 PM Arseniy Mukhin
<arseniy.mukhin.dev@gmail.com> wrote:
> I agree that listeners don't need to check if the notify transaction
> was committed or not anymore, but it seems that listeners still need
> to wait until the notify transaction is completed before sending its
> notifications. I believe we should continue using XidInMVCCSnapshot as
> the current version does.

  Since we hold NotifyQueueLock in exclusive mode while writing the compact
entry in the notification queue, and don’t release it until after
commit - it is impossible for a listener to come across a notification
that is not yet committed. In the case of
a backend crash after writing the notification into the queue and
before committing,
we would be in a critical section, so even this is not a scenario
where we might read
an uncommitted notification.

  > 2) Now we have two calls to SignalBackends (is it intentional?). The
> first in AtCommit_Notify() which seems correct to me. The second in
> XactLogCommitRecord() seems too early, because other backends can't
> process new notifications at this point (if the point about
> XidInMVCCSnapshot is correct). And there is Assert failure (Matheus'
> report is about it).

  fixed this in the new patch (removed SignalBackends call in xact.c

> 3) I think we still need to check if the queue is full or not while
> adding new entries and ereport if it is (entries are smaller now, but
> we still can run out of space). And it seems to me that we should do
> this check before entering the critical section, because in the
> critical section it will result in PANIC. Master results in ERROR in
> case of the full queue, so should we do the same here?

Thanks for bringing this up - in the new patch I have added a check to make sure
we have enough space in the queue before entering the critical section.
 The logic is as follows - in PreCommit_Notify, before writing the WAL
for notify
data, we do the following:
 - Take NotifyQueueLock EXCLUSIVE.
 - Look at the current head, tail, and how many entries are already
reserved by other backends.
- If adding one more would exceed:
- the allowed page window (max_notify_queue_pages), or
- the total entry cap (pages × entries per page),
  then ERROR out: “queue is full”.

- If our reserved slot would be the last on the page, pre‑create the
next page so commit won’t stall.

- Increment reservedEntries (our claim) and drop the lock.

I’ve included a TAP test to ensure we error out if we exceed the max
notifications: ‎src/test/modules/test_listen_notify/t/002_queue_full.pl‎

> 4) I don't know, but It seems to me that XactLogCommitRecord is not
> the right place for adding listen/notify logic, it seems to be only
> about wal stuff.

  Hmm, I understand the concern, but we do need to ensure the commit WAL
record and the notify queue entry are done atomically, and we need to
hold the lock for the shortest time possible. So I think maybe we can
make a tradeoff
here for performance over code compartmentalization?

  > 5) Do we want to use NotifyQueueLock as a lock that provides the
> commit order? Maybe we can introduce another lock to avoid unnecessary
> contantion? For example, if we use NotifyQueueLock, we block listeners
> that want to read new notifications while we are inserting a commit
> record, which seems unnecessary.

I think we should stick to using this lock, as we write to the notify
queue while holding this lock. We aren’t holding it for a particularly
long time, so this should be okay. Additionally this allows listeners
to assume entries in the queue are from committed transactions.  In
the current implementation we also use this lock for writing to the
queue.

> 6) We have fixed size queue entries, so I think we don't need this
> "padding" logic at the end of the page anymore, because we know how
> many entries we can have on each page.

Thanks, I've removed this in the new patch.

On Wed, Sep 10, 2025 at 4:31 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote

  > With your patch, since the backends get the notification by reading
> WAL records do we need to prevent WAL records that potentially have
> unconsumed notification from being removed by the checkpointer? Or we
> can save unconsumed notifications in WAL records to somewhere during
> the checkpoint as we do for 2PC transactions.

  Yes we will need to prevent notify data in the WAL from being removed by
checkpointer. I have come up with the following “WAL pinning” solution
to make sure we do not remove necessary WALs:

1. Maintain a notify_data lsn array of MAX_BACKENDS for uncommitted
notify data.
It is possible that the LSN for the data of an uncommitted
notification is lower than the min lsn across the committed entries.
If we only look at the committed data, and remove WAL records below
the min lsn there, and we eventually commit the uncommitted
transaction, then we won’t be able to read its notification data.

When we enter PreCommit_Notify,
- grab NotifyQueueLock in exclusive mode,
- Log the notification record and get the notify data lsn.
- Find a slot in theUncommittedNotifies array, and record the notify
data lsn, along with xid.
- If we come across any entries where the transaction is no longer in
progress, set their slots as not in use, and the lsn as
InvalidXlogRecPtr.
 - Release the NotifyQueueLock

2. Maintain a min notify_data LSN array of length max_notify_pages, to track the
Minimum lsn and corresponding xact needed for notify data across
currently active pages in the notify queue.

- When we advance the tail past a page, we set the min lsn for the
page to be InvalidXlogRecPtr.
- When we add an entry into the notify queue, check against the current min for
the page, and if lower, then update the min for the page.
- Remove the current lsn / xid from the Uncommitted LSNs list.

3. In RemoveOldXlogFiles (where we recycle WAL logs)

- call AsyncNotifyOldestRequiredLSN(¬ify_oldest), which will do the
following:
        - grab NotifiyQueueLock in exclusive mode  - check the min lsn
across both committed and uncommitted in flight notifications
        - store the min lsn needed in notify_oldest
        - prune any invalid entries in both lists while iterating through them.

- then we get the segment of notify_oldest, and if it is > 0, and less
than the currently
computed cutoff segment, set cutoff_seg = oldest_notify_seg - 1, so we maintain
the segment with the oldest needed notify data lsn.

I’ve included a TAP test that ensures we pin necessary WALs:
src/test/modules/test_listen_notify/t/003_wal_pin_test.pl.

—

PERFORMANCE:

Running the following, (pgbench_transaction_notify.sql is attached.)
 pgbench -c 100 -T 100 -j 8 -f pgbench_transaction_notify.sql -d postgres
I consistently see TPS is 60k - 70k.

Without the patch, the TPS is ~30k.

Additionally with  pgbench -c 99 -T 100 -j 8 -f
pgbench_transaction_notify.sql -d postgres  and 1 listener, we see
similar TPS, and no missing WAL errors, so the listen and WAL
retention logic seems correct.

> Also, could you add this patch to the next commit fest if not yet?

Will do!

Thanks for all the feedback on this patch - and apologies for the
delay in coming
up with the new patch, as WAL retention was a bit more complicated
than I initially imagined it to be.

The next steps for me are to also test replication and recovery of
notifications.
  Please let me know what I need to do to get this into a committable state.

Thanks,
Rishu

  [1]
https://www.postgresql.org/message-id/flat/CAK80%3DjiWjMAfjwJymPoG3%3DFNE%3DhHygn5dXaoax1BWQT1rJrCCw%40mail.gmail.com

[2] https://www.postgresql.org/message-id/1878165.1752858390%40sss.pgh.pa.us

Вложения

Re: [PATCH] Write Notifications Through WAL

От

Rishu Bagga

Дата:

26 сентября, 06:36:07

Attached a new patch that resolves failing tests reported by cfbot.

There were still some failing tests on cfbot - related to expecting
new pages to be zeroed in the notify queue, but since we only write a
single compact notify record in a transaction, regardless of the size
or number of notifications, we should no longer expect this. I changed
the tests to not expect this now.

Вложения

notify-through-wal-v6.patch

Re: [PATCH] Write Notifications Through WAL

От

Arseniy Mukhin

Дата:

08 октября, 13:07:42

Hi,

On Thu, Sep 25, 2025 at 4:12 AM Rishu Bagga <rishu.postgres@gmail.com> wrote:
>
> Hello all,  creating a new thread here in continuation of [1], as the patch
> has changed direction significantly from the subject there.

There are a lot of improvements, thank you for the new version! I'm
still in the process of reviewing it, but I would like to share
several points:

>
> On Wed, Sep 10, 2025 at 12:00 PM Arseniy Mukhin
> <arseniy.mukhin.dev@gmail.com> wrote:
> > I agree that listeners don't need to check if the notify transaction
> > was committed or not anymore, but it seems that listeners still need
> > to wait until the notify transaction is completed before sending its
> > notifications. I believe we should continue using XidInMVCCSnapshot as
> > the current version does.
>
> Since we hold NotifyQueueLock in exclusive mode while writing the compact
> entry in the notification queue, and don’t release it until after
> commit - it is impossible for a listener to come across a notification
> that is not yet committed. In the case of
> a backend crash after writing the notification into the queue and
> before committing,
> we would be in a critical section, so even this is not a scenario
> where we might read
> an uncommitted notification.

Yes, we hold the lock while we are writing the commit record to wal,
but other transactions don't see our transaction as committed and
can't see changes that our transaction done besides notifications
(inserts/updates/deletes) right after we wrote the commit record.
There are a few more things that need to be done before it. IIUC the
moment when a fresh snapshot will show our transaction as 'not in
progress' is right after updating procarray (ProcArrayEndTransaction()
call in CommitTransaction()). So I think we still need
XidInMVCCSnapshot() check in reading path. Without it we can end up in
a situation where the listener got the notification but still can't
see changes notify transaction done because notification was processed
too early.

>
> > 3) I think we still need to check if the queue is full or not while
> > adding new entries and ereport if it is (entries are smaller now, but
> > we still can run out of space). And it seems to me that we should do
> > this check before entering the critical section, because in the
> > critical section it will result in PANIC. Master results in ERROR in
> > case of the full queue, so should we do the same here?
>
> Thanks for bringing this up - in the new patch I have added a check to make sure
> we have enough space in the queue before entering the critical section.
> The logic is as follows - in PreCommit_Notify, before writing the WAL
> for notify
> data, we do the following:
>  - Take NotifyQueueLock EXCLUSIVE.
>  - Look at the current head, tail, and how many entries are already
> reserved by other backends.
> - If adding one more would exceed:
> - the allowed page window (max_notify_queue_pages), or
> - the total entry cap (pages × entries per page),
>   then ERROR out: “queue is full”.
>
> - If our reserved slot would be the last on the page, pre‑create the
> next page so commit won’t stall.
>
> - Increment reservedEntries (our claim) and drop the lock.
>
> I’ve included a TAP test to ensure we error out if we exceed the max
> notifications: ‎src/test/modules/test_listen_notify/t/002_queue_full.pl‎

I think reserving slots is a great idea. But I have a question about
checking if the queue is full. Why does PreCommit_Notify need three
checks? Isn't it enough to simply check that we don't exceed
max_total_entries?

> > 4) I don't know, but It seems to me that XactLogCommitRecord is not
> > the right place for adding listen/notify logic, it seems to be only
> > about wal stuff.
>
> Hmm, I understand the concern, but we do need to ensure the commit WAL
> record and the notify queue entry are done atomically, and we need to
> hold the lock for the shortest time possible. So I think maybe we can
> make a tradeoff
> here for performance over code compartmentalization?
>

xl_notify.notify_lsn part looks good to me, but, IMHO, we can move
locking and adding to the queue logic one level up into
RecordTransactionCommit() and write it around the
XactLogCommitRecord() call.

> > 5) Do we want to use NotifyQueueLock as a lock that provides the
> > commit order? Maybe we can introduce another lock to avoid unnecessary
> > contantion? For example, if we use NotifyQueueLock, we block listeners
> > that want to read new notifications while we are inserting a commit
> > record, which seems unnecessary.
>
> I think we should stick to using this lock, as we write to the notify
> queue while holding this lock. We aren’t holding it for a particularly
> long time, so this should be okay. Additionally this allows listeners
> to assume entries in the queue are from committed transactions.  In
> the current implementation we also use this lock for writing to the
> queue.

Yeah, probably you are right, we are not holding it for long, so maybe
it's not necessary to introduce something else here.

>
> On Wed, Sep 10, 2025 at 4:31 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote
>
> > With your patch, since the backends get the notification by reading
> > WAL records do we need to prevent WAL records that potentially have
> > unconsumed notification from being removed by the checkpointer? Or we
> > can save unconsumed notifications in WAL records to somewhere during
> > the checkpoint as we do for 2PC transactions.
>
> Yes we will need to prevent notify data in the WAL from being removed by
> checkpointer. I have come up with the following “WAL pinning” solution
> to make sure we do not remove necessary WALs:
>
> 1. Maintain a notify_data lsn array of MAX_BACKENDS for uncommitted
> notify data.
> It is possible that the LSN for the data of an uncommitted
> notification is lower than the min lsn across the committed entries.
> If we only look at the committed data, and remove WAL records below
> the min lsn there, and we eventually commit the uncommitted
> transaction, then we won’t be able to read its notification data.
>
> When we enter PreCommit_Notify,
> - grab NotifyQueueLock in exclusive mode,
> - Log the notification record and get the notify data lsn.
> - Find a slot in theUncommittedNotifies array, and record the notify
> data lsn, along with xid.
> - If we come across any entries where the transaction is no longer in
> progress, set their slots as not in use, and the lsn as
> InvalidXlogRecPtr.
> - Release the NotifyQueueLock
>
> 2. Maintain a min notify_data LSN array of length max_notify_pages, to track the
> Minimum lsn and corresponding xact needed for notify data across
> currently active pages in the notify queue.
>
> - When we advance the tail past a page, we set the min lsn for the
> page to be InvalidXlogRecPtr.
> - When we add an entry into the notify queue, check against the current min for
> the page, and if lower, then update the min for the page.
> - Remove the current lsn / xid from the Uncommitted LSNs list.
>
> 3. In RemoveOldXlogFiles (where we recycle WAL logs)
>
> - call AsyncNotifyOldestRequiredLSN(¬ify_oldest), which will do the
> following:
>         - grab NotifiyQueueLock in exclusive mode - check the min lsn
> across both committed and uncommitted in flight notifications
>         - store the min lsn needed in notify_oldest
>         - prune any invalid entries in both lists while iterating through them.
>
> - then we get the segment of notify_oldest, and if it is > 0, and less
> than the currently
> computed cutoff segment, set cutoff_seg = oldest_notify_seg - 1, so we maintain
> the segment with the oldest needed notify data lsn.
>
> I’ve included a TAP test that ensures we pin necessary WALs:
> src/test/modules/test_listen_notify/t/003_wal_pin_test.pl.
>

I don't have enough knowledge to say if it's the right approach to
prevent wal truncating with unread notifications. But there are two
things about current implementation I noticed. Not sure if it's worth
doing something with it before you know it's the right approach in
general. But anyway, there are two points about it:

We hold NotifyQueueLock while adding the notifications data record to
wal. IIUC one of the points of using wal here was that it lets us to
write notification data in parallel as wal is good for parallel
writes. But with lock it seems like we don't take advantage of it.
IIUC we hold the lock here to prevent truncation of the new record
after we wrote it and before we save its lsn. Is this correct? Maybe
we can solve it somehow differently? For example before writing the
notification data record we can save the current lsn (something like
XactLastRecEnd) in UncommittedNotifies so we can be sure anything
after it will not be truncated, including our record that we are gonna
write. This way we don't need to hold the lock.

We iterate over all entries in UncommittedNotifies every time we want
to add/remove data. What do you think about using MyProcNumber as an
array index instead of iteration as we do with listener positions for
instance? Sounds possible as every backend can't have more than one
uncommitted lsn. Also maybe backends can remove entries from
UncommittedNotifies in AtAbort_Notify() the same way committed
transactions do it in asyncQueueAddCompactEntry()? This way
AsyncNotifyOldestRequiredLSN() will not need to be bothered with
cleaning UncommittedNotifies and checking transaction statuses.

Thank you!

Best regards,
Arseniy Mukhin

Re: [PATCH] Write Notifications Through WAL

От

Rishu Bagga

Дата:

14 октября, 00:14:07

Thank you Arseniy for the thoughtful and detailed feedback, I have
addressed the concerns in the new patch.

On Wed, Oct 8, 2025 at 3:07 AM Arseniy Mukhin
<arseniy.mukhin.dev@gmail.com> wrote:

> Yes, we hold the lock while we are writing the commit record to wal,
> but other transactions don't see our transaction as committed and
> can't see changes that our transaction done besides notifications
> (inserts/updates/deletes) right after we wrote the commit record.
> There are a few more things that need to be done before it. IIUC the
> moment when a fresh snapshot will show our transaction as 'not in
> progress' is right after updating procarray (ProcArrayEndTransaction()
> call in CommitTransaction()). So I think we still need
> XidInMVCCSnapshot() check in reading path. Without it we can end up in
> a situation where the listener got the notification but still can't
> see changes notify transaction done because notification was processed
> too early.

Good catch, I've fixed this in the new patch.

> I think reserving slots is a great idea. But I have a question about
> checking if the queue is full. Why does PreCommit_Notify need three
> checks? Isn't it enough to simply check that we don't exceed
> max_total_entries?

After taking a closer look, you are right in that we don't need all three
checks; however, it isn't enough to check that we don't exceed
max_total_entries.
The reason for this is that we can exceed max_total_pages without necessarily
exceeding max_total_entries, in the case that the oldest active entry is not the
first entry on the tail page. So actually the only test we need is to
confirm that
when we zero a new page, that the total number of pages is still <=
max_total_pages.
I have updated this in the new patch.

> xl_notify.notify_lsn part looks good to me, but, IMHO, we can move
> locking and adding to the queue logic one level up into
> RecordTransactionCommit() and write it around the
> XactLogCommitRecord() call.

Fixed in the new patch.

> >

> I don't have enough knowledge to say if it's the right approach to
> prevent wal truncating with unread notifications. But there are two
> things about current implementation I noticed. Not sure if it's worth
> doing something with it before you know it's the right approach in
> general. But anyway, there are two points about it:
>
> We hold NotifyQueueLock while adding the notifications data record to
> wal. IIUC one of the points of using wal here was that it lets us to
> write notification data in parallel as wal is good for parallel
> writes. But with lock it seems like we don't take advantage of it.
> IIUC we hold the lock here to prevent truncation of the new record
> after we wrote it and before we save its lsn. Is this correct? Maybe
> we can solve it somehow differently? For example before writing the
> notification data record we can save the current lsn (something like
> XactLastRecEnd) in UncommittedNotifies so we can be sure anything
> after it will not be truncated, including our record that we are gonna
> write. This way we don't need to hold the lock.
>
> We iterate over all entries in UncommittedNotifies every time we want
> to add/remove data. What do you think about using MyProcNumber as an
> array index instead of iteration as we do with listener positions for
> instance? Sounds possible as every backend can't have more than one
> uncommitted lsn. Also maybe backends can remove entries from
> UncommittedNotifies in AtAbort_Notify() the same way committed
> transactions do it in asyncQueueAddCompactEntry()? This way
> AsyncNotifyOldestRequiredLSN() will not need to be bothered with
> cleaning UncommittedNotifies and checking transaction statuses.

After taking another look, I realized there is a cleaner approach that
does not need the use of UncommittedNotifiesArray in the first place,
and also allows us to keep the parallel WAL writes for notifications.
We can simply have the recycler iterate through the procarray instead, and find
the lowest value of a proc's notifyDataLsn. (renamed from
notifyCommitLsn in prev patch)
There is a small window where the notify data WAL record has been written,
and its lsn hasn't been assigned to MyProc->notifyDataLsn, so to make
sure we don't
lose any records, before we write the data record, we set
MyProc->notifyDataLsn to
the current lsn pointer using GetXLogInsertRecPtr(). If a Proc has a
non-invalid value for
notifyDataLsn, then we know that it has uncommitted notification data
written to the WAL,
at or above that value.

The performance remains similar with these changes. I have also
rebased the patch on the latest HEAD.

Thanks,

Rishu

Вложения

notify-through-wal-v7.patch

Re: [PATCH] Write Notifications Through WAL

От

Arseniy Mukhin

Дата:

17 октября, 16:33:01

Hi,

Thank you for the new patch version!

On Tue, Oct 14, 2025 at 12:14 AM Rishu Bagga <rishu.postgres@gmail.com> wrote:
>
> Thank you Arseniy for the thoughtful and detailed feedback, I have
> addressed the concerns in the new patch.
>
> On Wed, Oct 8, 2025 at 3:07 AM Arseniy Mukhin
> <arseniy.mukhin.dev@gmail.com> wrote:
>
> > Yes, we hold the lock while we are writing the commit record to wal,
> > but other transactions don't see our transaction as committed and
> > can't see changes that our transaction done besides notifications
> > (inserts/updates/deletes) right after we wrote the commit record.
> > There are a few more things that need to be done before it. IIUC the
> > moment when a fresh snapshot will show our transaction as 'not in
> > progress' is right after updating procarray (ProcArrayEndTransaction()
> > call in CommitTransaction()). So I think we still need
> > XidInMVCCSnapshot() check in reading path. Without it we can end up in
> > a situation where the listener got the notification but still can't
> > see changes notify transaction done because notification was processed
> > too early.
>
> Good catch, I've fixed this in the new patch.

Hmmm, looks like we still have the same problem in the new version if
I'm not missing something:

    Snapshot    snap = ActiveSnapshotSet() ? GetActiveSnapshot() : NULL;

    ...

    if (qe->dboid == MyDatabaseId && snap != NULL &&
              XidInMVCCSnapshot(qe->xid, snap))

Here we almost always have snap = NULL (except when we are in
Exec_ListenPreCommit maybe), since listeners read notifications when
they are out of active transaction. Why do we need snap nullability
here? IIUC we can do XidInMVCCSnapshot check the same way as we do it
in master.

Also the comment

    /*----------
     * Get snapshot we'll use to decide which xacts are still in progress.
     * This is trickier than it might seem, because of race conditions.

seems to still be relevant, so I think it's worth keeping.

>
> > I think reserving slots is a great idea. But I have a question about
> > checking if the queue is full. Why does PreCommit_Notify need three
> > checks? Isn't it enough to simply check that we don't exceed
> > max_total_entries?
>
> After taking a closer look, you are right in that we don't need all three
> checks; however, it isn't enough to check that we don't exceed
> max_total_entries.
> The reason for this is that we can exceed max_total_pages without necessarily
> exceeding max_total_entries, in the case that the oldest active entry is not the
> first entry on the tail page. So actually the only test we need is to
> confirm that
> when we zero a new page, that the total number of pages is still <=
> max_total_pages.
> I have updated this in the new patch.

LGTM

> > >
>
> > I don't have enough knowledge to say if it's the right approach to
> > prevent wal truncating with unread notifications. But there are two
> > things about current implementation I noticed. Not sure if it's worth
> > doing something with it before you know it's the right approach in
> > general. But anyway, there are two points about it:
> >
> > We hold NotifyQueueLock while adding the notifications data record to
> > wal. IIUC one of the points of using wal here was that it lets us to
> > write notification data in parallel as wal is good for parallel
> > writes. But with lock it seems like we don't take advantage of it.
> > IIUC we hold the lock here to prevent truncation of the new record
> > after we wrote it and before we save its lsn. Is this correct? Maybe
> > we can solve it somehow differently? For example before writing the
> > notification data record we can save the current lsn (something like
> > XactLastRecEnd) in UncommittedNotifies so we can be sure anything
> > after it will not be truncated, including our record that we are gonna
> > write. This way we don't need to hold the lock.
> >
> > We iterate over all entries in UncommittedNotifies every time we want
> > to add/remove data. What do you think about using MyProcNumber as an
> > array index instead of iteration as we do with listener positions for
> > instance? Sounds possible as every backend can't have more than one
> > uncommitted lsn. Also maybe backends can remove entries from
> > UncommittedNotifies in AtAbort_Notify() the same way committed
> > transactions do it in asyncQueueAddCompactEntry()? This way
> > AsyncNotifyOldestRequiredLSN() will not need to be bothered with
> > cleaning UncommittedNotifies and checking transaction statuses.
>
> After taking another look, I realized there is a cleaner approach that
> does not need the use of UncommittedNotifiesArray in the first place,
> and also allows us to keep the parallel WAL writes for notifications.
> We can simply have the recycler iterate through the procarray instead, and find
> the lowest value of a proc's notifyDataLsn. (renamed from
> notifyCommitLsn in prev patch)
> There is a small window where the notify data WAL record has been written,
> and its lsn hasn't been assigned to MyProc->notifyDataLsn, so to make
> sure we don't
> lose any records, before we write the data record, we set
> MyProc->notifyDataLsn to
> the current lsn pointer using GetXLogInsertRecPtr(). If a Proc has a
> non-invalid value for
> notifyDataLsn, then we know that it has uncommitted notification data
> written to the WAL,
> at or above that value.

Looks like you have solved both issues with it. Several points about
wal pin logic:

1) I noticed that we can block wal truncating sometimes.
NotifyPageMins array works great when we have a consistent flow of
notifications and can truncate old queue segments regularly. But what
if we don't have it? Please try the next example:

Send a notification first:

    NOTIFY ch, 'a';

Then do something that makes wal grows.

    create table if not exists t1 (a text);

    insert into t1 (a) select gen_random_uuid()
    from generate_series(1,10000000);

You will see that wal grows without truncation. It seems that to solve
the issue we should not consider queue entries before the queue tail
in AsyncNotifyOldestRequiredLSN as we don't need their data anymore.

2) I'm worried if it's ok to iterate like this:

    /* Scan per-backend pins under ProcArrayLock */
        LWLockAcquire(ProcArrayLock, LW_SHARED);
        for (int i = 0; i < MaxBackends; i++)
        {
           PGPROC *proc = GetPGProcByNumber(i);

We use MaxBackends here, so it looks like we can access PGPROC that
isn't assigned to the existing backend. I failed to find any examples
in the code that would do the same. Maybe we don't need to deal with
the MyProc structure here at all? We can do the same thing that we do
for listener positions. Store it in QueueBackendStatus and directly
access it with MyProcNumber when we want to update it for our backend.
And in AsyncNotifyOldestRequiredLSN we can iterate over backends the
same way we do it asyncQueueAdvanceTail. What do you think?

3) It seems to me that we can miss some notification data record's lsn
here AsyncNotifyOldestRequiredLSN. Let's consider the next schedule:

Transaction                                     Checkpointer

Notify transaction writes notification data wal
record and updates its notifyDataLsn

                                    AsyncNotifyOldestRequiredLSN call.
                                    We iterate over NotifyPageMins. It
                                    doesn't contain transaction's
                                    notifyDataLsn yet.

Transaction is committed, NotifyPageMins is updated
notifyDataLsn is cleared.

                                    We iterate over notifyDataLsns. Again
                                    we don't see notifyDataLsn here as it
                                    has already been removed.

Is it real or I missed something? I see there was the lock in v6 that
could prevent it, but probably we can just swap NotifyPageMins and
notifyDataLsns parts in AsyncNotifyOldestRequiredLSN to solve the
problem?

>
> The performance remains similar with these changes. I have also
> rebased the patch on the latest HEAD.
>

It was surprising that exclusive locking when writing notification
data to wal doesn't affect the result. I tried to do some benchmarking
too:

Tests:

All benchmarks were done without listeners, as we try to improve the
notify path.

insert_only:

            BEGIN;
            INSERT INTO t1 VALUES(1);
            COMMIT;

insert_with_notify:

            BEGIN;
            INSERT INTO t1 VALUES(1);
            NOTIFY benchmark_channel, :payload;
            COMMIT;

notify_only:
            BEGIN;
            NOTIFY benchmark_channel, :payload;
            COMMIT;

pgbench -n -c 100 -j 10 -f $script.sql postgres -T 100

Results:

version       test         payload_len      tps

master     insert_only          10      30981.603164
v6         insert_only          10      34940.927688
v7         insert_only          10      30933.015773

master     insert_with_notify   10      817.851691
v6         insert_with_notify   10      33761.706615
v7         insert_with_notify   10      33870.474923

master     notify_only          10      87750.506562
v6         notify_only          10      36141.169978
v7         notify_only          10      34261.854909

I think insert_only is interesting as some reference point that we
want to get as close as possible with the patch when we measure
insert_with_notify. insert_with_notify shows stable 30-40x boost with
the patch. And it's quite close to insert_only tps for v6 and v7.
Which is great.

I haven't seen differences between v6 and v7. I thought maybe payload
size makes the difference, but there was no difference  when payload
length was 2000. It's surprising.

Also an interesting thing is degradation in notify_only test. master
is 2.5 times faster than the patched version. Don't know why it
happens.

Also there was some instability in results (sometimes master
insert_only shows 30k, sometimes 38k), I don't know why, but I think
it doesn't affect conclusions in general.

Thank you!

Best regards,
Arseniy Mukhin

Re: [PATCH] Write Notifications Through WAL

От

Rishu Bagga

Дата:

04 ноября, 00:36:35

Hi, apologies for the delay, attached a new patch addressing the
concerns, as well as
enabling a standby to listen and receive notifications from the primary.

On Fri, Oct 17, 2025 at 6:33 AM Arseniy Mukhin
<arseniy.mukhin.dev@gmail.com> wrote:

> > Good catch, I've fixed this in the new patch.
>
> Hmmm, looks like we still have the same problem in the new version if
> I'm not missing something:
>
>     Snapshot    snap = ActiveSnapshotSet() ? GetActiveSnapshot() : NULL;
>
>     ...
>
>     if (qe->dboid == MyDatabaseId && snap != NULL &&
>               XidInMVCCSnapshot(qe->xid, snap))
>
> Here we almost always have snap = NULL (except when we are in
> Exec_ListenPreCommit maybe), since listeners read notifications when
> they are out of active transaction. Why do we need snap nullability
> here? IIUC we can do XidInMVCCSnapshot check the same way as we do it
> in master.
>
> Also the comment
>
>     /*----------
>      * Get snapshot we'll use to decide which xacts are still in progress.
>      * This is trickier than it might seem, because of race conditions.
>
> seems to still be relevant, so I think it's worth keeping.
>
>

Reverted to the original logic and restored the comment in the new patch.

>
> Looks like you have solved both issues with it. Several points about
> wal pin logic:
>
> 1) I noticed that we can block wal truncating sometimes.
> NotifyPageMins array works great when we have a consistent flow of
> notifications and can truncate old queue segments regularly. But what
> if we don't have it? Please try the next example:
>
> Send a notification first:
>
>     NOTIFY ch, 'a';
>
> Then do something that makes wal grows.
>
>     create table if not exists t1 (a text);
>
>     insert into t1 (a) select gen_random_uuid()
>     from generate_series(1,10000000);
>
> You will see that wal grows without truncation. It seems that to solve
> the issue we should not consider queue entries before the queue tail
> in AsyncNotifyOldestRequiredLSN as we don't need their data anymore.

In the new patch, I updated the logic so that we update the min lsn on the
entry for the tail page in the page min array if needed, at time of
advancing the queue tail.

> 2) I'm worried if it's ok to iterate like this:
>
>     /* Scan per-backend pins under ProcArrayLock */
>         LWLockAcquire(ProcArrayLock, LW_SHARED);
>         for (int i = 0; i < MaxBackends; i++)
>         {
>            PGPROC *proc = GetPGProcByNumber(i);
>
> We use MaxBackends here, so it looks like we can access PGPROC that
> isn't assigned to the existing backend. I failed to find any examples
> in the code that would do the same. Maybe we don't need to deal with
> the MyProc structure here at all? We can do the same thing that we do
> for listener positions. Store it in QueueBackendStatus and directly
> access it with MyProcNumber when we want to update it for our backend.
> And in AsyncNotifyOldestRequiredLSN we can iterate over backends the
> same way we do it asyncQueueAdvanceTail. What do you think?

I don't see why iterating through MyProc is dangerous here? As for
accessing the PGPROC that isn't assigned, I added a check to skip
unused slots, with a simple "if (proc->pid == 0)". The issue with
using QueueBackendStatus is that we are currently using it to track
the status of listeners, so although we could add information about the
backend's notifying status, this might be a confusing change in the semantics.
We already need to store the notify data lsn in the proc structure, to write the
log record, so to me it makes sense to just look there, and not
duplicate that information.

>
> 3) It seems to me that we can miss some notification data record's lsn
> here AsyncNotifyOldestRequiredLSN. Let's consider the next schedule:
>
> Transaction                                     Checkpointer
>
> Notify transaction writes notification data wal
> record and updates its notifyDataLsn
>
>                                     AsyncNotifyOldestRequiredLSN call.
>                                     We iterate over NotifyPageMins. It
>                                     doesn't contain transaction's
>                                     notifyDataLsn yet.
>
> Transaction is committed, NotifyPageMins is updated
> notifyDataLsn is cleared.
>
>                                     We iterate over notifyDataLsns. Again
>                                     we don't see notifyDataLsn here as it
>                                     has already been removed.
>
> Is it real or I missed something? I see there was the lock in v6 that
> could prevent it, but probably we can just swap NotifyPageMins and
> notifyDataLsns parts in AsyncNotifyOldestRequiredLSN to solve the
> problem?
>

Good catch, yes I think swapping the order of the check to the uncommitted
min lsn and then committed min lsn fixes this, updated in the new patch.

Вложения

notify-through-wal-v8.patch

Re: [PATCH] Write Notifications Through WAL

От

Masahiko Sawada

Дата:

21 ноября, 02:21:51

Hi,

On Mon, Nov 3, 2025 at 1:36 PM Rishu Bagga <rishu.postgres@gmail.com> wrote:
>
> Hi, apologies for the delay, attached a new patch addressing the
> concerns, as well as
> enabling a standby to listen and receive notifications from the primary.
>
> On Fri, Oct 17, 2025 at 6:33 AM Arseniy Mukhin
> <arseniy.mukhin.dev@gmail.com> wrote:
>
> > > Good catch, I've fixed this in the new patch.
> >
> > Hmmm, looks like we still have the same problem in the new version if
> > I'm not missing something:
> >
> >     Snapshot    snap = ActiveSnapshotSet() ? GetActiveSnapshot() : NULL;
> >
> >     ...
> >
> >     if (qe->dboid == MyDatabaseId && snap != NULL &&
> >               XidInMVCCSnapshot(qe->xid, snap))
> >
> > Here we almost always have snap = NULL (except when we are in
> > Exec_ListenPreCommit maybe), since listeners read notifications when
> > they are out of active transaction. Why do we need snap nullability
> > here? IIUC we can do XidInMVCCSnapshot check the same way as we do it
> > in master.
> >
> > Also the comment
> >
> >     /*----------
> >      * Get snapshot we'll use to decide which xacts are still in progress.
> >      * This is trickier than it might seem, because of race conditions.
> >
> > seems to still be relevant, so I think it's worth keeping.
> >
> >
>
> Reverted to the original logic and restored the comment in the new patch.
>
> >
> > Looks like you have solved both issues with it. Several points about
> > wal pin logic:
> >
> > 1) I noticed that we can block wal truncating sometimes.
> > NotifyPageMins array works great when we have a consistent flow of
> > notifications and can truncate old queue segments regularly. But what
> > if we don't have it? Please try the next example:
> >
> > Send a notification first:
> >
> >     NOTIFY ch, 'a';
> >
> > Then do something that makes wal grows.
> >
> >     create table if not exists t1 (a text);
> >
> >     insert into t1 (a) select gen_random_uuid()
> >     from generate_series(1,10000000);
> >
> > You will see that wal grows without truncation. It seems that to solve
> > the issue we should not consider queue entries before the queue tail
> > in AsyncNotifyOldestRequiredLSN as we don't need their data anymore.
>
> In the new patch, I updated the logic so that we update the min lsn on the
> entry for the tail page in the page min array if needed, at time of
> advancing the queue tail.
>
> > 2) I'm worried if it's ok to iterate like this:
> >
> >     /* Scan per-backend pins under ProcArrayLock */
> >         LWLockAcquire(ProcArrayLock, LW_SHARED);
> >         for (int i = 0; i < MaxBackends; i++)
> >         {
> >            PGPROC *proc = GetPGProcByNumber(i);
> >
> > We use MaxBackends here, so it looks like we can access PGPROC that
> > isn't assigned to the existing backend. I failed to find any examples
> > in the code that would do the same. Maybe we don't need to deal with
> > the MyProc structure here at all? We can do the same thing that we do
> > for listener positions. Store it in QueueBackendStatus and directly
> > access it with MyProcNumber when we want to update it for our backend.
> > And in AsyncNotifyOldestRequiredLSN we can iterate over backends the
> > same way we do it asyncQueueAdvanceTail. What do you think?
>
> I don't see why iterating through MyProc is dangerous here? As for
> accessing the PGPROC that isn't assigned, I added a check to skip
> unused slots, with a simple "if (proc->pid == 0)". The issue with
> using QueueBackendStatus is that we are currently using it to track
> the status of listeners, so although we could add information about the
> backend's notifying status, this might be a confusing change in the semantics.
> We already need to store the notify data lsn in the proc structure, to write the
> log record, so to me it makes sense to just look there, and not
> duplicate that information.
>
> >
> > 3) It seems to me that we can miss some notification data record's lsn
> > here AsyncNotifyOldestRequiredLSN. Let's consider the next schedule:
> >
> > Transaction                                     Checkpointer
> >
> > Notify transaction writes notification data wal
> > record and updates its notifyDataLsn
> >
> >                                     AsyncNotifyOldestRequiredLSN call.
> >                                     We iterate over NotifyPageMins. It
> >                                     doesn't contain transaction's
> >                                     notifyDataLsn yet.
> >
> > Transaction is committed, NotifyPageMins is updated
> > notifyDataLsn is cleared.
> >
> >                                     We iterate over notifyDataLsns. Again
> >                                     we don't see notifyDataLsn here as it
> >                                     has already been removed.
> >
> > Is it real or I missed something? I see there was the lock in v6 that
> > could prevent it, but probably we can just swap NotifyPageMins and
> > notifyDataLsns parts in AsyncNotifyOldestRequiredLSN to solve the
> > problem?
> >
>
> Good catch, yes I think swapping the order of the check to the uncommitted
> min lsn and then committed min lsn fixes this, updated in the new patch.

The patch can no longer be applied to the current HEAD. Could you
rebase the patch?

I've briefly looked through the v8 patch and have a comment and a question:

async.c has a great explanation for its implementation at the
beginning of the file. The patch needs to update the comment too.

---
@@ -3100,7 +3100,7 @@ step s1_slru_check_stats:

 ?column?
 --------
-t
+f
 (1 row)

The patch includes such changes in the expected file. What caused these changes?

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: [PATCH] Write Notifications Through WAL

От

Rishu Bagga

Дата:

25 ноября, 04:01:58

Hi all, following up with a new rebased patch.

> > On Fri, Oct 17, 2025 at 6:33 AM Arseniy Mukhin
> > <arseniy.mukhin.dev@gmail.com> wrote:
> > > Also the comment
> > >
> > >     /*----------
> > >      * Get snapshot we'll use to decide which xacts are still in progress.
> > >      * This is trickier than it might seem, because of race conditions.
> > >
> > > seems to still be relevant, so I think it's worth keeping.
> > >
> > >
> >
> > Reverted to the original logic and restored the comment in the new patch.

Looks like the v8 patch I sent in the last email didn't actually
address this concern. I must've made a
git blunder at some point and didn't double check the patch before
sending. The new patch definitely
addresses it. Apologies for the oversight.

>On Fri, Nov 20, 2025 at 3:22 PM Masahiko Sawada
><sawada.mshk@gmail.com> wrote:

> I've briefly looked through the v8 patch and have a comment and a question:
>
> async.c has a great explanation for its implementation at the
> beginning of the file. The patch needs to update the comment too.
>

Updated the comment to describe the new mechanism accurately.

> ---
> @@ -3100,7 +3100,7 @@ step s1_slru_check_stats:
>
>  ?column?
>  --------
> -t
> +f
>  (1 row)
>
> The patch includes such changes in the expected file. What caused these changes?
>

These tests were checking to verify that adding large notifications
would affect pg_stat_slru
by causing new slru pages to be initialized. Now that we write
notifications to WAL, and only
write one compact entry to the notify slru queue, we don't expect
there to be a new page to
be zeroed out, regardless of the size or the number of the
notifications for a single transaction.

> Regards,
>
> --
> Masahiko Sawada
> Amazon Web Services: https://aws.amazon.com

Вложения

notify-through-wal-v9.patch

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Обсуждение: [PATCH] Write Notifications Through WAL

Вложения

Вложения

Вложения

Вложения

Вложения