Обсуждение: [PATCH] Write Notifications Through WAL
Hello all,
creating a new thread here in continuation of [1], as the patch
has changed direction significantly from the subject there.
TLDR:
The attached patch here aims to increase performance of the NOTIFY
codepath, in the case of concurrent notifying transactions,
by eliminating the need for a global lock to be held
from PreCommit_Notify to the end of transaction.
We accomplish this
through the method inspired by Tom Lane here [2]. The high level is as
follows:
Write Path:
1. Write notification data into WAL in PreCommit_Notify. Store lsn in MyProc.
2. Acquire NotifyQueueLock in exclusive mode.
3. Write a compact entry in the notify queue, referencing the lsn in step 1.
4. Commit the transaction, and write the notify lsn in step 1, inside
the commit record.
5. Release NotifyQueueLock.
Read Path:
Readers look through the notification queue where they see just the
notify_data_lsn,
And then get the actual notify data from the WAL.
(End TLDR)
Addressing the concerns from the previous patch, and explaining
the latest patch:
On Wed, Sep 10, 2025 at 6:05 AM Matheus Alcantara
<matheusssilv97@gmail.com> wrote:
> There is also some other tests failing, like isolation, regress and
> others.
These are passing in the new patch, after removing the
extraneous SignalBackends call.
On Wed, Sep 10, 2025 at 12:00 PM Arseniy Mukhin
<arseniy.mukhin.dev@gmail.com> wrote:
> I agree that listeners don't need to check if the notify transaction
> was committed or not anymore, but it seems that listeners still need
> to wait until the notify transaction is completed before sending its
> notifications. I believe we should continue using XidInMVCCSnapshot as
> the current version does.
Since we hold NotifyQueueLock in exclusive mode while writing the compact
entry in the notification queue, and don’t release it until after
commit - it is impossible for a listener to come across a notification
that is not yet committed. In the case of
a backend crash after writing the notification into the queue and
before committing,
we would be in a critical section, so even this is not a scenario
where we might read
an uncommitted notification.
> 2) Now we have two calls to SignalBackends (is it intentional?). The
> first in AtCommit_Notify() which seems correct to me. The second in
> XactLogCommitRecord() seems too early, because other backends can't
> process new notifications at this point (if the point about
> XidInMVCCSnapshot is correct). And there is Assert failure (Matheus'
> report is about it).
fixed this in the new patch (removed SignalBackends call in xact.c
> 3) I think we still need to check if the queue is full or not while
> adding new entries and ereport if it is (entries are smaller now, but
> we still can run out of space). And it seems to me that we should do
> this check before entering the critical section, because in the
> critical section it will result in PANIC. Master results in ERROR in
> case of the full queue, so should we do the same here?
Thanks for bringing this up - in the new patch I have added a check to make sure
we have enough space in the queue before entering the critical section.
The logic is as follows - in PreCommit_Notify, before writing the WAL
for notify
data, we do the following:
- Take NotifyQueueLock EXCLUSIVE.
- Look at the current head, tail, and how many entries are already
reserved by other backends.
- If adding one more would exceed:
- the allowed page window (max_notify_queue_pages), or
- the total entry cap (pages × entries per page),
then ERROR out: “queue is full”.
- If our reserved slot would be the last on the page, pre‑create the
next page so commit won’t stall.
- Increment reservedEntries (our claim) and drop the lock.
I’ve included a TAP test to ensure we error out if we exceed the max
notifications:
src/test/modules/test_listen_notify/t/002_queue_full.pl
> 4) I don't know, but It seems to me that XactLogCommitRecord is not
> the right place for adding listen/notify logic, it seems to be only
> about wal stuff.
Hmm, I understand the concern, but we do need to ensure the commit WAL
record and the notify queue entry are done atomically, and we need to
hold
the lock for the shortest time possible. So I think maybe we can
make a tradeoff
here for performance over code compartmentalization?
> 5) Do we want to use NotifyQueueLock as a lock that provides the
> commit order? Maybe we can introduce another lock to avoid unnecessary
> contantion? For example, if we use NotifyQueueLock, we block listeners
> that want to read new notifications while we are inserting a commit
> record, which seems unnecessary.
I think we should stick to using this lock, as we write to the notify
queue while holding this lock. We aren’t holding it for a particularly
long time,
so this should be okay. Additionally this allows listeners
to assume entries in
the queue are from committed transactions.
In
the current implementation we also use this lock for writing to the
queue.
> 6) We have fixed size queue entries, so I think we don't need this
> "padding" logic at the end of the page anymore, because we know how
> many entries we can have on each page.
Thanks, I've removed this in the new patch.
On Wed, Sep 10, 2025 at 4:31 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote
> With your patch, since the backends get the notification by reading
> WAL records do we need to prevent WAL records that potentially have
> unconsumed notification from being removed by the checkpointer? Or we
> can save unconsumed notifications in WAL records to somewhere during
> the checkpoint as we do for 2PC transactions.
Yes we will need to prevent notify data in the WAL from being removed by
checkpointer. I have come up with the following “WAL pinning” solution
to make sure we
do not remove necessary WALs:
1. Maintain a notify_data lsn array of MAX_BACKENDS for uncommitted
notify data.
It is possible that the LSN for the data of an uncommitted
notification is lower than the min lsn across the committed entries.
If we only look at the committed data, and remove WAL records below
the min lsn there, and we eventually commit the uncommitted
transaction, then we won’t be able to read its notification data.
When we enter PreCommit_Notify,
- grab NotifyQueueLock in exclusive mode,
- Log the notification record and get the notify data lsn.
- Find a slot in theUncommittedNotifies array, and record the notify
data lsn, along with xid.
- If we come across any entries where the transaction is no longer in
progress, set their slots as not in use, and the lsn as
InvalidXlogRecPtr.
- Release the NotifyQueueLock
2. Maintain a min notify_data LSN array of length max_notify_pages, to track the
Minimum lsn and corresponding xact needed for notify data across
currently active pages in the notify queue.
- When we advance the tail past a page, we set the min lsn for the
page to be InvalidXlogRecPtr.
- When we add an entry into the notify queue, check against the current min for
the page, and if lower, then update the min for the page.
- Remove the current lsn / xid from the Uncommitted LSNs list.
3. In RemoveOldXlogFiles (where we recycle WAL logs)
- call AsyncNotifyOldestRequiredLSN(¬ify_oldest), which will do the
following:
- grab NotifiyQueueLock in exclusive mode
- check the min lsn
across both committed and uncommitted in flight notifications
- store the min lsn needed in notify_oldest
- prune any invalid entries in both lists while iterating through them.
- then we get the segment of notify_oldest, and if it is > 0, and less
than the currently
computed cutoff segment, set cutoff_seg = oldest_notify_seg - 1, so we maintain
the segment with the oldest needed notify data lsn.
I’ve included a TAP test that ensures we pin necessary WALs:
src/test/modules/test_listen_notify/t/003_wal_pin_test.pl.
—
PERFORMANCE:
Running the following, (pgbench_transaction_notify.sql is attached.)
pgbench -c 100 -T 100 -j 8 -f pgbench_transaction_notify.sql -d postgres
I consistently see TPS is 60k - 70k.
Without the patch, the TPS is ~30k.
Additionally with
pgbench -c 99 -T 100 -j 8 -f
pgbench_transaction_notify.sql -d postgres
and 1 listener, we see
similar TPS, and no missing WAL errors, so the listen and
WAL
retention logic seems correct.
> Also, could you add this patch to the next commit fest if not yet?
Will do!
Thanks for all the feedback on this patch - and apologies for the
delay in coming
up with the new patch, as WAL retention was a bit more complicated
than
I initially imagined it to be.
The next steps for me are to also test replication and recovery of
notifications.
Please let me know what I need to do to get this into
a committable state.
Thanks,
Rishu
[1]
https://www.postgresql.org/message-id/flat/CAK80%3DjiWjMAfjwJymPoG3%3DFNE%3DhHygn5dXaoax1BWQT1rJrCCw%40mail.gmail.com
[2] https://www.postgresql.org/message-id/1878165.1752858390%40sss.pgh.pa.us
Вложения
Attached a new patch that resolves failing tests reported by cfbot. There were still some failing tests on cfbot - related to expecting new pages to be zeroed in the notify queue, but since we only write a single compact notify record in a transaction, regardless of the size or number of notifications, we should no longer expect this. I changed the tests to not expect this now.
Вложения
Hi, On Thu, Sep 25, 2025 at 4:12 AM Rishu Bagga <rishu.postgres@gmail.com> wrote: > > Hello all, creating a new thread here in continuation of [1], as the patch > has changed direction significantly from the subject there. There are a lot of improvements, thank you for the new version! I'm still in the process of reviewing it, but I would like to share several points: > > On Wed, Sep 10, 2025 at 12:00 PM Arseniy Mukhin > <arseniy.mukhin.dev@gmail.com> wrote: > > I agree that listeners don't need to check if the notify transaction > > was committed or not anymore, but it seems that listeners still need > > to wait until the notify transaction is completed before sending its > > notifications. I believe we should continue using XidInMVCCSnapshot as > > the current version does. > > Since we hold NotifyQueueLock in exclusive mode while writing the compact > entry in the notification queue, and don’t release it until after > commit - it is impossible for a listener to come across a notification > that is not yet committed. In the case of > a backend crash after writing the notification into the queue and > before committing, > we would be in a critical section, so even this is not a scenario > where we might read > an uncommitted notification. Yes, we hold the lock while we are writing the commit record to wal, but other transactions don't see our transaction as committed and can't see changes that our transaction done besides notifications (inserts/updates/deletes) right after we wrote the commit record. There are a few more things that need to be done before it. IIUC the moment when a fresh snapshot will show our transaction as 'not in progress' is right after updating procarray (ProcArrayEndTransaction() call in CommitTransaction()). So I think we still need XidInMVCCSnapshot() check in reading path. Without it we can end up in a situation where the listener got the notification but still can't see changes notify transaction done because notification was processed too early. > > > 3) I think we still need to check if the queue is full or not while > > adding new entries and ereport if it is (entries are smaller now, but > > we still can run out of space). And it seems to me that we should do > > this check before entering the critical section, because in the > > critical section it will result in PANIC. Master results in ERROR in > > case of the full queue, so should we do the same here? > > Thanks for bringing this up - in the new patch I have added a check to make sure > we have enough space in the queue before entering the critical section. > The logic is as follows - in PreCommit_Notify, before writing the WAL > for notify > data, we do the following: > - Take NotifyQueueLock EXCLUSIVE. > - Look at the current head, tail, and how many entries are already > reserved by other backends. > - If adding one more would exceed: > - the allowed page window (max_notify_queue_pages), or > - the total entry cap (pages × entries per page), > then ERROR out: “queue is full”. > > - If our reserved slot would be the last on the page, pre‑create the > next page so commit won’t stall. > > - Increment reservedEntries (our claim) and drop the lock. > > I’ve included a TAP test to ensure we error out if we exceed the max > notifications: src/test/modules/test_listen_notify/t/002_queue_full.pl I think reserving slots is a great idea. But I have a question about checking if the queue is full. Why does PreCommit_Notify need three checks? Isn't it enough to simply check that we don't exceed max_total_entries? > > 4) I don't know, but It seems to me that XactLogCommitRecord is not > > the right place for adding listen/notify logic, it seems to be only > > about wal stuff. > > Hmm, I understand the concern, but we do need to ensure the commit WAL > record and the notify queue entry are done atomically, and we need to > hold the lock for the shortest time possible. So I think maybe we can > make a tradeoff > here for performance over code compartmentalization? > xl_notify.notify_lsn part looks good to me, but, IMHO, we can move locking and adding to the queue logic one level up into RecordTransactionCommit() and write it around the XactLogCommitRecord() call. > > 5) Do we want to use NotifyQueueLock as a lock that provides the > > commit order? Maybe we can introduce another lock to avoid unnecessary > > contantion? For example, if we use NotifyQueueLock, we block listeners > > that want to read new notifications while we are inserting a commit > > record, which seems unnecessary. > > I think we should stick to using this lock, as we write to the notify > queue while holding this lock. We aren’t holding it for a particularly > long time, so this should be okay. Additionally this allows listeners > to assume entries in the queue are from committed transactions. In > the current implementation we also use this lock for writing to the > queue. Yeah, probably you are right, we are not holding it for long, so maybe it's not necessary to introduce something else here. > > On Wed, Sep 10, 2025 at 4:31 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote > > > With your patch, since the backends get the notification by reading > > WAL records do we need to prevent WAL records that potentially have > > unconsumed notification from being removed by the checkpointer? Or we > > can save unconsumed notifications in WAL records to somewhere during > > the checkpoint as we do for 2PC transactions. > > Yes we will need to prevent notify data in the WAL from being removed by > checkpointer. I have come up with the following “WAL pinning” solution > to make sure we do not remove necessary WALs: > > 1. Maintain a notify_data lsn array of MAX_BACKENDS for uncommitted > notify data. > It is possible that the LSN for the data of an uncommitted > notification is lower than the min lsn across the committed entries. > If we only look at the committed data, and remove WAL records below > the min lsn there, and we eventually commit the uncommitted > transaction, then we won’t be able to read its notification data. > > When we enter PreCommit_Notify, > - grab NotifyQueueLock in exclusive mode, > - Log the notification record and get the notify data lsn. > - Find a slot in theUncommittedNotifies array, and record the notify > data lsn, along with xid. > - If we come across any entries where the transaction is no longer in > progress, set their slots as not in use, and the lsn as > InvalidXlogRecPtr. > - Release the NotifyQueueLock > > 2. Maintain a min notify_data LSN array of length max_notify_pages, to track the > Minimum lsn and corresponding xact needed for notify data across > currently active pages in the notify queue. > > - When we advance the tail past a page, we set the min lsn for the > page to be InvalidXlogRecPtr. > - When we add an entry into the notify queue, check against the current min for > the page, and if lower, then update the min for the page. > - Remove the current lsn / xid from the Uncommitted LSNs list. > > 3. In RemoveOldXlogFiles (where we recycle WAL logs) > > - call AsyncNotifyOldestRequiredLSN(¬ify_oldest), which will do the > following: > - grab NotifiyQueueLock in exclusive mode - check the min lsn > across both committed and uncommitted in flight notifications > - store the min lsn needed in notify_oldest > - prune any invalid entries in both lists while iterating through them. > > - then we get the segment of notify_oldest, and if it is > 0, and less > than the currently > computed cutoff segment, set cutoff_seg = oldest_notify_seg - 1, so we maintain > the segment with the oldest needed notify data lsn. > > I’ve included a TAP test that ensures we pin necessary WALs: > src/test/modules/test_listen_notify/t/003_wal_pin_test.pl. > I don't have enough knowledge to say if it's the right approach to prevent wal truncating with unread notifications. But there are two things about current implementation I noticed. Not sure if it's worth doing something with it before you know it's the right approach in general. But anyway, there are two points about it: We hold NotifyQueueLock while adding the notifications data record to wal. IIUC one of the points of using wal here was that it lets us to write notification data in parallel as wal is good for parallel writes. But with lock it seems like we don't take advantage of it. IIUC we hold the lock here to prevent truncation of the new record after we wrote it and before we save its lsn. Is this correct? Maybe we can solve it somehow differently? For example before writing the notification data record we can save the current lsn (something like XactLastRecEnd) in UncommittedNotifies so we can be sure anything after it will not be truncated, including our record that we are gonna write. This way we don't need to hold the lock. We iterate over all entries in UncommittedNotifies every time we want to add/remove data. What do you think about using MyProcNumber as an array index instead of iteration as we do with listener positions for instance? Sounds possible as every backend can't have more than one uncommitted lsn. Also maybe backends can remove entries from UncommittedNotifies in AtAbort_Notify() the same way committed transactions do it in asyncQueueAddCompactEntry()? This way AsyncNotifyOldestRequiredLSN() will not need to be bothered with cleaning UncommittedNotifies and checking transaction statuses. Thank you! Best regards, Arseniy Mukhin
Thank you Arseniy for the thoughtful and detailed feedback, I have addressed the concerns in the new patch. On Wed, Oct 8, 2025 at 3:07 AM Arseniy Mukhin <arseniy.mukhin.dev@gmail.com> wrote: > Yes, we hold the lock while we are writing the commit record to wal, > but other transactions don't see our transaction as committed and > can't see changes that our transaction done besides notifications > (inserts/updates/deletes) right after we wrote the commit record. > There are a few more things that need to be done before it. IIUC the > moment when a fresh snapshot will show our transaction as 'not in > progress' is right after updating procarray (ProcArrayEndTransaction() > call in CommitTransaction()). So I think we still need > XidInMVCCSnapshot() check in reading path. Without it we can end up in > a situation where the listener got the notification but still can't > see changes notify transaction done because notification was processed > too early. Good catch, I've fixed this in the new patch. > I think reserving slots is a great idea. But I have a question about > checking if the queue is full. Why does PreCommit_Notify need three > checks? Isn't it enough to simply check that we don't exceed > max_total_entries? After taking a closer look, you are right in that we don't need all three checks; however, it isn't enough to check that we don't exceed max_total_entries. The reason for this is that we can exceed max_total_pages without necessarily exceeding max_total_entries, in the case that the oldest active entry is not the first entry on the tail page. So actually the only test we need is to confirm that when we zero a new page, that the total number of pages is still <= max_total_pages. I have updated this in the new patch. > xl_notify.notify_lsn part looks good to me, but, IMHO, we can move > locking and adding to the queue logic one level up into > RecordTransactionCommit() and write it around the > XactLogCommitRecord() call. Fixed in the new patch. > > > I don't have enough knowledge to say if it's the right approach to > prevent wal truncating with unread notifications. But there are two > things about current implementation I noticed. Not sure if it's worth > doing something with it before you know it's the right approach in > general. But anyway, there are two points about it: > > We hold NotifyQueueLock while adding the notifications data record to > wal. IIUC one of the points of using wal here was that it lets us to > write notification data in parallel as wal is good for parallel > writes. But with lock it seems like we don't take advantage of it. > IIUC we hold the lock here to prevent truncation of the new record > after we wrote it and before we save its lsn. Is this correct? Maybe > we can solve it somehow differently? For example before writing the > notification data record we can save the current lsn (something like > XactLastRecEnd) in UncommittedNotifies so we can be sure anything > after it will not be truncated, including our record that we are gonna > write. This way we don't need to hold the lock. > > We iterate over all entries in UncommittedNotifies every time we want > to add/remove data. What do you think about using MyProcNumber as an > array index instead of iteration as we do with listener positions for > instance? Sounds possible as every backend can't have more than one > uncommitted lsn. Also maybe backends can remove entries from > UncommittedNotifies in AtAbort_Notify() the same way committed > transactions do it in asyncQueueAddCompactEntry()? This way > AsyncNotifyOldestRequiredLSN() will not need to be bothered with > cleaning UncommittedNotifies and checking transaction statuses. After taking another look, I realized there is a cleaner approach that does not need the use of UncommittedNotifiesArray in the first place, and also allows us to keep the parallel WAL writes for notifications. We can simply have the recycler iterate through the procarray instead, and find the lowest value of a proc's notifyDataLsn. (renamed from notifyCommitLsn in prev patch) There is a small window where the notify data WAL record has been written, and its lsn hasn't been assigned to MyProc->notifyDataLsn, so to make sure we don't lose any records, before we write the data record, we set MyProc->notifyDataLsn to the current lsn pointer using GetXLogInsertRecPtr(). If a Proc has a non-invalid value for notifyDataLsn, then we know that it has uncommitted notification data written to the WAL, at or above that value. The performance remains similar with these changes. I have also rebased the patch on the latest HEAD. Thanks, Rishu
Вложения
Hi,
Thank you for the new patch version!
On Tue, Oct 14, 2025 at 12:14 AM Rishu Bagga <rishu.postgres@gmail.com> wrote:
>
> Thank you Arseniy for the thoughtful and detailed feedback, I have
> addressed the concerns in the new patch.
>
> On Wed, Oct 8, 2025 at 3:07 AM Arseniy Mukhin
> <arseniy.mukhin.dev@gmail.com> wrote:
>
> > Yes, we hold the lock while we are writing the commit record to wal,
> > but other transactions don't see our transaction as committed and
> > can't see changes that our transaction done besides notifications
> > (inserts/updates/deletes) right after we wrote the commit record.
> > There are a few more things that need to be done before it. IIUC the
> > moment when a fresh snapshot will show our transaction as 'not in
> > progress' is right after updating procarray (ProcArrayEndTransaction()
> > call in CommitTransaction()). So I think we still need
> > XidInMVCCSnapshot() check in reading path. Without it we can end up in
> > a situation where the listener got the notification but still can't
> > see changes notify transaction done because notification was processed
> > too early.
>
> Good catch, I've fixed this in the new patch.
Hmmm, looks like we still have the same problem in the new version if
I'm not missing something:
Snapshot snap = ActiveSnapshotSet() ? GetActiveSnapshot() : NULL;
...
if (qe->dboid == MyDatabaseId && snap != NULL &&
XidInMVCCSnapshot(qe->xid, snap))
Here we almost always have snap = NULL (except when we are in
Exec_ListenPreCommit maybe), since listeners read notifications when
they are out of active transaction. Why do we need snap nullability
here? IIUC we can do XidInMVCCSnapshot check the same way as we do it
in master.
Also the comment
/*----------
* Get snapshot we'll use to decide which xacts are still in progress.
* This is trickier than it might seem, because of race conditions.
seems to still be relevant, so I think it's worth keeping.
>
> > I think reserving slots is a great idea. But I have a question about
> > checking if the queue is full. Why does PreCommit_Notify need three
> > checks? Isn't it enough to simply check that we don't exceed
> > max_total_entries?
>
> After taking a closer look, you are right in that we don't need all three
> checks; however, it isn't enough to check that we don't exceed
> max_total_entries.
> The reason for this is that we can exceed max_total_pages without necessarily
> exceeding max_total_entries, in the case that the oldest active entry is not the
> first entry on the tail page. So actually the only test we need is to
> confirm that
> when we zero a new page, that the total number of pages is still <=
> max_total_pages.
> I have updated this in the new patch.
LGTM
> > >
>
> > I don't have enough knowledge to say if it's the right approach to
> > prevent wal truncating with unread notifications. But there are two
> > things about current implementation I noticed. Not sure if it's worth
> > doing something with it before you know it's the right approach in
> > general. But anyway, there are two points about it:
> >
> > We hold NotifyQueueLock while adding the notifications data record to
> > wal. IIUC one of the points of using wal here was that it lets us to
> > write notification data in parallel as wal is good for parallel
> > writes. But with lock it seems like we don't take advantage of it.
> > IIUC we hold the lock here to prevent truncation of the new record
> > after we wrote it and before we save its lsn. Is this correct? Maybe
> > we can solve it somehow differently? For example before writing the
> > notification data record we can save the current lsn (something like
> > XactLastRecEnd) in UncommittedNotifies so we can be sure anything
> > after it will not be truncated, including our record that we are gonna
> > write. This way we don't need to hold the lock.
> >
> > We iterate over all entries in UncommittedNotifies every time we want
> > to add/remove data. What do you think about using MyProcNumber as an
> > array index instead of iteration as we do with listener positions for
> > instance? Sounds possible as every backend can't have more than one
> > uncommitted lsn. Also maybe backends can remove entries from
> > UncommittedNotifies in AtAbort_Notify() the same way committed
> > transactions do it in asyncQueueAddCompactEntry()? This way
> > AsyncNotifyOldestRequiredLSN() will not need to be bothered with
> > cleaning UncommittedNotifies and checking transaction statuses.
>
> After taking another look, I realized there is a cleaner approach that
> does not need the use of UncommittedNotifiesArray in the first place,
> and also allows us to keep the parallel WAL writes for notifications.
> We can simply have the recycler iterate through the procarray instead, and find
> the lowest value of a proc's notifyDataLsn. (renamed from
> notifyCommitLsn in prev patch)
> There is a small window where the notify data WAL record has been written,
> and its lsn hasn't been assigned to MyProc->notifyDataLsn, so to make
> sure we don't
> lose any records, before we write the data record, we set
> MyProc->notifyDataLsn to
> the current lsn pointer using GetXLogInsertRecPtr(). If a Proc has a
> non-invalid value for
> notifyDataLsn, then we know that it has uncommitted notification data
> written to the WAL,
> at or above that value.
Looks like you have solved both issues with it. Several points about
wal pin logic:
1) I noticed that we can block wal truncating sometimes.
NotifyPageMins array works great when we have a consistent flow of
notifications and can truncate old queue segments regularly. But what
if we don't have it? Please try the next example:
Send a notification first:
NOTIFY ch, 'a';
Then do something that makes wal grows.
create table if not exists t1 (a text);
insert into t1 (a) select gen_random_uuid()
from generate_series(1,10000000);
You will see that wal grows without truncation. It seems that to solve
the issue we should not consider queue entries before the queue tail
in AsyncNotifyOldestRequiredLSN as we don't need their data anymore.
2) I'm worried if it's ok to iterate like this:
/* Scan per-backend pins under ProcArrayLock */
LWLockAcquire(ProcArrayLock, LW_SHARED);
for (int i = 0; i < MaxBackends; i++)
{
PGPROC *proc = GetPGProcByNumber(i);
We use MaxBackends here, so it looks like we can access PGPROC that
isn't assigned to the existing backend. I failed to find any examples
in the code that would do the same. Maybe we don't need to deal with
the MyProc structure here at all? We can do the same thing that we do
for listener positions. Store it in QueueBackendStatus and directly
access it with MyProcNumber when we want to update it for our backend.
And in AsyncNotifyOldestRequiredLSN we can iterate over backends the
same way we do it asyncQueueAdvanceTail. What do you think?
3) It seems to me that we can miss some notification data record's lsn
here AsyncNotifyOldestRequiredLSN. Let's consider the next schedule:
Transaction Checkpointer
Notify transaction writes notification data wal
record and updates its notifyDataLsn
AsyncNotifyOldestRequiredLSN call.
We iterate over NotifyPageMins. It
doesn't contain transaction's
notifyDataLsn yet.
Transaction is committed, NotifyPageMins is updated
notifyDataLsn is cleared.
We iterate over notifyDataLsns. Again
we don't see notifyDataLsn here as it
has already been removed.
Is it real or I missed something? I see there was the lock in v6 that
could prevent it, but probably we can just swap NotifyPageMins and
notifyDataLsns parts in AsyncNotifyOldestRequiredLSN to solve the
problem?
>
> The performance remains similar with these changes. I have also
> rebased the patch on the latest HEAD.
>
It was surprising that exclusive locking when writing notification
data to wal doesn't affect the result. I tried to do some benchmarking
too:
Tests:
All benchmarks were done without listeners, as we try to improve the
notify path.
insert_only:
BEGIN;
INSERT INTO t1 VALUES(1);
COMMIT;
insert_with_notify:
BEGIN;
INSERT INTO t1 VALUES(1);
NOTIFY benchmark_channel, :payload;
COMMIT;
notify_only:
BEGIN;
NOTIFY benchmark_channel, :payload;
COMMIT;
pgbench -n -c 100 -j 10 -f $script.sql postgres -T 100
Results:
version test payload_len tps
master insert_only 10 30981.603164
v6 insert_only 10 34940.927688
v7 insert_only 10 30933.015773
master insert_with_notify 10 817.851691
v6 insert_with_notify 10 33761.706615
v7 insert_with_notify 10 33870.474923
master notify_only 10 87750.506562
v6 notify_only 10 36141.169978
v7 notify_only 10 34261.854909
I think insert_only is interesting as some reference point that we
want to get as close as possible with the patch when we measure
insert_with_notify. insert_with_notify shows stable 30-40x boost with
the patch. And it's quite close to insert_only tps for v6 and v7.
Which is great.
I haven't seen differences between v6 and v7. I thought maybe payload
size makes the difference, but there was no difference when payload
length was 2000. It's surprising.
Also an interesting thing is degradation in notify_only test. master
is 2.5 times faster than the patched version. Don't know why it
happens.
Also there was some instability in results (sometimes master
insert_only shows 30k, sometimes 38k), I don't know why, but I think
it doesn't affect conclusions in general.
Thank you!
Best regards,
Arseniy Mukhin