Обсуждение: Improve read_local_xlog_page_guts by replacing polling with latch-based waiting
Improve read_local_xlog_page_guts by replacing polling with latch-based waiting
От
Xuneng Zhou
Дата:
Hi hackers, During a performance run [1], I observed heavy polling in read_local_xlog_page_guts(). Heikki’s comment from a few months ago suggests replacing the current check–sleep–repeat loop with the condition-variable (CV) infrastructure used by the walsender: 1) Problem and Background /* * Loop waiting for xlog to be available if necessary * * TODO: The walsender has its own version of this function, which uses a * condition variable to wake up whenever WAL is flushed. We could use the * same infrastructure here, instead of the check/sleep/repeat style of * loop. */ Because read_local_xlog_page_guts() waits for a specific flush or replay LSN, polling becomes inefficient when waits are long. I built a POC patch that swaps polling for CVs, but a single global CV (or even separate “flush” and “replay” CVs) isn’t ideal: • The wake-up routines don’t know which LSN each waiter cares about, so they would need to broadcast on every flush/replay. • Caching the minimum outstanding target LSN could reduce spurious wake-ups but won’t eliminate them when multiple backends wait for different LSNs simultaneously. • The walsender accepts some broadcast overhead via two CVs for different waiters. A more precise approach would require a request queue that maps waiters to target LSNs and issues targeted wake-ups—adding complexity. 2) Proposal I came across the thread “Implement waiting for WAL LSN replay: reloaded” [2] by Alexander. The “Implement WAIT FOR” patch in that thread provides a well-established infrastructure for waiting on WAL replay in backends. With modest adjustments, it could be generalized. Main changes in patch v1 Improve read_local_xlog_page_guts by replacing polling with latch-based waiting: • Introduce WaitForLSNFlush, analogous to WaitForLSNReplay from the “WAIT FOR” work. • Replace the busy-wait in read_local_xlog_page_guts() with WaitForLSNFlush and WaitForLSNReplay. • Add wake-up calls in XLogFlush and XLogBackgroundFlush. Edge Case: Timeline Switch During Wait /* * Check which timeline to get the record from. * * We have to do it each time through the loop because if we're in * recovery as a cascading standby, the current timeline might've * become historical. We can't rely on RecoveryInProgress() because in * a standby configuration like * * A => B => C * * if we're a logical decoding session on C, and B gets promoted, our * timeline will change while we remain in recovery. * * We can't just keep reading from the old timeline as the last WAL * archive in the timeline will get renamed to .partial by * StartupXLOG(). read_local_xlog_page_guts() re-evaluates the active timeline on each loop iteration because, on a cascading standby, the current timeline can become historical. Once that happens, there’s no need to keep waiting for that timeline. A timeline switch could therefore render an in-progress wait unnecessary. One option is to add a wake-up at the point where the timeline switch occurs, so waiting processes exit promptly. The current approach chooses not to do this, given that most waits are short and timeline changes in cascading standby are rare. Supporting timeline-switch wake-ups would also require additional handling in both WaitForLSNFlush and WaitForLSNReplay, increasing complexity. Comments and suggestions are welcome. [1] https://www.postgresql.org/message-id/CABPTF7VuFYm9TtA9vY8ZtS77qsT+yL_HtSDxUFnW3XsdB5b9ew@mail.gmail.com [2] https://www.postgresql.org/message-id/flat/CAPpHfdsjtZLVzxjGT8rJHCYbM0D5dwkO%2BBBjcirozJ6nYbOW8Q%40mail.gmail.com Best, Xuneng
Вложения
Re: Improve read_local_xlog_page_guts by replacing polling with latch-based waiting
От
Xuneng Zhou
Дата:
Hi, Attached the wrong patch v1-0001-Improve-read_local_xlog_page_guts-by-replacing-po.patch. The correct one is attached again. On Wed, Aug 27, 2025 at 11:23 PM Xuneng Zhou <xunengzhou@gmail.com> wrote: > > Hi hackers, > > During a performance run [1], I observed heavy polling in > read_local_xlog_page_guts(). Heikki’s comment from a few months ago > suggests replacing the current check–sleep–repeat loop with the > condition-variable (CV) infrastructure used by the walsender: > > 1) Problem and Background > /* > * Loop waiting for xlog to be available if necessary > * > * TODO: The walsender has its own version of this function, which uses a > * condition variable to wake up whenever WAL is flushed. We could use the > * same infrastructure here, instead of the check/sleep/repeat style of > * loop. > */ > > Because read_local_xlog_page_guts() waits for a specific flush or > replay LSN, polling becomes inefficient when waits are long. I built a > POC patch that swaps polling for CVs, but a single global CV (or even > separate “flush” and “replay” CVs) isn’t ideal: > • The wake-up routines don’t know which LSN each waiter cares about, > so they would need to broadcast on every flush/replay. > > • Caching the minimum outstanding target LSN could reduce spurious > wake-ups but won’t eliminate them when multiple backends wait for > different LSNs simultaneously. > > • The walsender accepts some broadcast overhead via two CVs for > different waiters. A more precise approach would require a request > queue that maps waiters to target LSNs and issues targeted > wake-ups—adding complexity. > > 2) Proposal > I came across the thread “Implement waiting for WAL LSN replay: > reloaded” [2] by Alexander. The “Implement WAIT FOR” patch in that > thread provides a well-established infrastructure for waiting on WAL > replay in backends. With modest adjustments, it could be generalized. > > Main changes in patch v1 Improve read_local_xlog_page_guts by replacing polling > with latch-based waiting: > • Introduce WaitForLSNFlush, analogous to WaitForLSNReplay from the > “WAIT FOR” work. > > • Replace the busy-wait in read_local_xlog_page_guts() with > WaitForLSNFlush and WaitForLSNReplay. > > • Add wake-up calls in XLogFlush and XLogBackgroundFlush. > > Edge Case: Timeline Switch During Wait > /* > * Check which timeline to get the record from. > * > * We have to do it each time through the loop because if we're in > * recovery as a cascading standby, the current timeline might've > * become historical. We can't rely on RecoveryInProgress() because in > * a standby configuration like > * > * A => B => C > * > * if we're a logical decoding session on C, and B gets promoted, our > * timeline will change while we remain in recovery. > * > * We can't just keep reading from the old timeline as the last WAL > * archive in the timeline will get renamed to .partial by > * StartupXLOG(). > > read_local_xlog_page_guts() re-evaluates the active timeline on each > loop iteration because, on a cascading standby, the current timeline > can become historical. Once that happens, there’s no need to keep > waiting for that timeline. A timeline switch could therefore render an > in-progress wait unnecessary. > > One option is to add a wake-up at the point where the timeline switch > occurs, so waiting processes exit promptly. The current approach > chooses not to do this, given that most waits are short and timeline > changes in cascading standby are rare. Supporting timeline-switch > wake-ups would also require additional handling in both > WaitForLSNFlush and WaitForLSNReplay, increasing complexity. > > Comments and suggestions are welcome. > > [1] https://www.postgresql.org/message-id/CABPTF7VuFYm9TtA9vY8ZtS77qsT+yL_HtSDxUFnW3XsdB5b9ew@mail.gmail.com > [2] https://www.postgresql.org/message-id/flat/CAPpHfdsjtZLVzxjGT8rJHCYbM0D5dwkO%2BBBjcirozJ6nYbOW8Q%40mail.gmail.com > > Best, > Xuneng
Вложения
Re: Improve read_local_xlog_page_guts by replacing polling with latch-based waiting
От
Xuneng Zhou
Дата:
Hi, Some changes in v3: 1) Update the note of xlogwait.c to reflect the extending of its use for flush waiting and internal use for both flush and replay waiting. 2) Update the comment above logical_read_xlog_page which describes the prior-change behavior of read_local_xlog_page. Best, Xuneng
Вложения
Re: Improve read_local_xlog_page_guts by replacing polling with latch-based waiting
От
Xuneng Zhou
Дата:
Hi, On Thu, Aug 28, 2025 at 4:22 PM Xuneng Zhou <xunengzhou@gmail.com> wrote: > > Hi, > > Some changes in v3: > 1) Update the note of xlogwait.c to reflect the extending of its use > for flush waiting and internal use for both flush and replay waiting. > 2) Update the comment above logical_read_xlog_page which describes the > prior-change behavior of read_local_xlog_page. In an off-list discussion, Alexander pointed out potential issues with the current single-heap design for replay and flush when promotion occurs concurrently with WAIT FOR. The following is a simple example illustrating the problem: During promotion, there's a window where we can have mixed waiter types in the same heap: T1: Process A calls read_local_xlog_page_guts on standby T2: RecoveryInProgress() = TRUE, adds to heap as replay waiter T3: Promotion begins T4: EndRecovery() calls WaitLSNWakeup(InvalidXLogRecPtr) T5: SharedRecoveryState = RECOVERY_STATE_DONE T6: Process B calls read_local_xlog_page_guts T7: RecoveryInProgress() = FALSE, adds to SAME heap as flush waiter The problem is that replay LSNs and flush LSNs represent different positions in the WAL stream. Having both types in the same heap can lead to: - Incorrect wakeup logic (comparing incomparable LSNs) - Processes waiting forever - Wrong waiters being woken up To avoid this problem, patch v4 is updated to utilize two separate heaps for flush and replay like Alexander suggested earlier. It also introduces a new separate min LSN tracking field for flushing. Best, Xuneng
Вложения
Re: Improve read_local_xlog_page_guts by replacing polling with latch-based waiting
От
Xuneng Zhou
Дата:
Hi, On Sun, Sep 28, 2025 at 9:47 PM Xuneng Zhou <xunengzhou@gmail.com> wrote: > > Hi, > > On Thu, Aug 28, 2025 at 4:22 PM Xuneng Zhou <xunengzhou@gmail.com> wrote: > > > > Hi, > > > > Some changes in v3: > > 1) Update the note of xlogwait.c to reflect the extending of its use > > for flush waiting and internal use for both flush and replay waiting. > > 2) Update the comment above logical_read_xlog_page which describes the > > prior-change behavior of read_local_xlog_page. > > In an off-list discussion, Alexander pointed out potential issues with > the current single-heap design for replay and flush when promotion > occurs concurrently with WAIT FOR. The following is a simple example > illustrating the problem: > > During promotion, there's a window where we can have mixed waiter > types in the same heap: > > T1: Process A calls read_local_xlog_page_guts on standby > T2: RecoveryInProgress() = TRUE, adds to heap as replay waiter > T3: Promotion begins > T4: EndRecovery() calls WaitLSNWakeup(InvalidXLogRecPtr) > T5: SharedRecoveryState = RECOVERY_STATE_DONE > T6: Process B calls read_local_xlog_page_guts > T7: RecoveryInProgress() = FALSE, adds to SAME heap as flush waiter > > The problem is that replay LSNs and flush LSNs represent different > positions in the WAL stream. Having both types in the same heap can > lead to: > - Incorrect wakeup logic (comparing incomparable LSNs) > - Processes waiting forever > - Wrong waiters being woken up > > To avoid this problem, patch v4 is updated to utilize two separate > heaps for flush and replay like Alexander suggested earlier. It also > introduces a new separate min LSN tracking field for flushing. > v5-0002 separates the waitlsn_cmp() comparator function into two distinct functions (waitlsn_replay_cmp and waitlsn_flush_cmp) for the replay and flush heaps, respectively. Best, Xuneng
Вложения
Re: Improve read_local_xlog_page_guts by replacing polling with latch-based waiting
От
Michael Paquier
Дата:
On Thu, Oct 02, 2025 at 11:06:14PM +0800, Xuneng Zhou wrote: > v5-0002 separates the waitlsn_cmp() comparator function into two distinct > functions (waitlsn_replay_cmp and waitlsn_flush_cmp) for the replay > and flush heaps, respectively. The primary goal that you want to achieve here is a replacement of the wait/sleep logic of read_local_xlog_page_guts() with a condition variable, and design a new facility to make the callback more responsive on polls. That's a fine idea in itself. However I would suggest to implement something that does not depend entirely on WAIT FOR, which is how your patch is presented. Instead of having your patch depend on an entirely different feature, it seems to me that you should try to extract from this other feature the basics that you are looking for, and make them shared between the WAIT FOR patch and what you are trying to achieve here. You should not need something as complex as what the other feature needs for a page read callback in the backend. At the end, I suspect that you will reuse a slight portion of it (or perhaps nothing at all, actually, but I did not look at the full scope of it). You should try to present your patch so as it is in a reviewable state, so as others would be able to read it and understand it. WAIT FOR is much more complex than what you want to do here because it covers larger areas of the code base and needs to worry about more cases. So, you should implement things so as the basic pieces you want to build on top of are simpler, not more complicated. Simpler means easier to review and easier to catch problems, designed in a way that depends on how you want to fix your problem, not designed in a way that depends on how a completely different feature deals with its own problems. -- Michael
Вложения
Re: Improve read_local_xlog_page_guts by replacing polling with latch-based waiting
От
Xuneng Zhou
Дата:
Hi Michael, Thanks for your review. On Fri, Oct 3, 2025 at 2:24 PM Michael Paquier <michael@paquier.xyz> wrote: > > On Thu, Oct 02, 2025 at 11:06:14PM +0800, Xuneng Zhou wrote: > > v5-0002 separates the waitlsn_cmp() comparator function into two distinct > > functions (waitlsn_replay_cmp and waitlsn_flush_cmp) for the replay > > and flush heaps, respectively. > > The primary goal that you want to achieve here is a replacement of the > wait/sleep logic of read_local_xlog_page_guts() with a condition > variable, and design a new facility to make the callback more > responsive on polls. That's a fine idea in itself. However I would > suggest to implement something that does not depend entirely on WAIT > FOR, which is how your patch is presented. Instead of having your > patch depend on an entirely different feature, it seems to me that you > should try to extract from this other feature the basics that you are > looking for, and make them shared between the WAIT FOR patch and what > you are trying to achieve here. You should not need something as > complex as what the other feature needs for a page read callback in > the backend. > > At the end, I suspect that you will reuse a slight portion of it (or > perhaps nothing at all, actually, but I did not look at the full scope > of it). You should try to present your patch so as it is in a > reviewable state, so as others would be able to read it and understand > it. WAIT FOR is much more complex than what you want to do here > because it covers larger areas of the code base and needs to worry > about more cases. So, you should implement things so as the basic > pieces you want to build on top of are simpler, not more complicated. > Simpler means easier to review and easier to catch problems, designed > in a way that depends on how you want to fix your problem, not > designed in a way that depends on how a completely different feature > deals with its own problems. The core infrastructure shared by both this patch and the WAIT FOR command patch is primarily in xlogwait.c, which provides the mechanism for waiting until a given LSN is reached. Other parts of the code in the WAIT FOR patch—covering the SQL command implementation, documentation, and tests—is not relevant for the current patch. What we need is only the infrastructure in xlogwait.c, on which we can implement the optimization for read_local_xlog_page_guts. Regarding complexity: the initial optimization idea was to introduce condition-variable based waiting, as Heikki suggested in his comment: /* * Loop waiting for xlog to be available if necessary * * TODO: The walsender has its own version of this function, which uses a * condition variable to wake up whenever WAL is flushed. We could use the * same infrastructure here, instead of the check/sleep/repeat style of * loop. */ I reviewed the relevant code in WalSndWakeup and WalSndWait. While these mechanisms reduce polling overhead, they don’t prevent false wakeups. Addressing that would likely require a request queue that maps waiters to their target LSNs and issues targeted wakeups—a much more complex design. Given that read_local_xlog_page_guts is not as performance-sensitive as its equivalents, this added complexity may not be justified. So I implemented the initial version of the optimization like WalSndWakeup and WalSndWait. After this, I came across the WAIT FOR patch in the mailing list and noticed that the infrastructure in xlogwait.c aligns well with our needs. Based on that, I built the current patch using this shared infra. If we prioritise simplicity and can tolerate occasional false wakeups, then waiting in read_local_xlog_page_guts can be implemented in a much simpler way than the current version. At the same time, the WAIT FOR command seems to be a valuable feature in its own right, and both patches can naturally share the same infrastructure. We could also extract the infra and implement the current patch on it, then Wait for could utilize it as well. Personally, I don’t have a strong preference between the two approaches. Best, Xuneng
Re: Improve read_local_xlog_page_guts by replacing polling with latch-based waiting
От
Xuneng Zhou
Дата:
Hi, On Fri, Oct 3, 2025 at 9:50 PM Xuneng Zhou <xunengzhou@gmail.com> wrote: > > Hi Michael, > > Thanks for your review. > > On Fri, Oct 3, 2025 at 2:24 PM Michael Paquier <michael@paquier.xyz> wrote: > > > > On Thu, Oct 02, 2025 at 11:06:14PM +0800, Xuneng Zhou wrote: > > > v5-0002 separates the waitlsn_cmp() comparator function into two distinct > > > functions (waitlsn_replay_cmp and waitlsn_flush_cmp) for the replay > > > and flush heaps, respectively. > > > > The primary goal that you want to achieve here is a replacement of the > > wait/sleep logic of read_local_xlog_page_guts() with a condition > > variable, and design a new facility to make the callback more > > responsive on polls. That's a fine idea in itself. However I would > > suggest to implement something that does not depend entirely on WAIT > > FOR, which is how your patch is presented. Instead of having your > > patch depend on an entirely different feature, it seems to me that you > > should try to extract from this other feature the basics that you are > > looking for, and make them shared between the WAIT FOR patch and what > > you are trying to achieve here. You should not need something as > > complex as what the other feature needs for a page read callback in > > the backend. > > > > At the end, I suspect that you will reuse a slight portion of it (or > > perhaps nothing at all, actually, but I did not look at the full scope > > of it). You should try to present your patch so as it is in a > > reviewable state, so as others would be able to read it and understand > > it. WAIT FOR is much more complex than what you want to do here > > because it covers larger areas of the code base and needs to worry > > about more cases. So, you should implement things so as the basic > > pieces you want to build on top of are simpler, not more complicated. > > Simpler means easier to review and easier to catch problems, designed > > in a way that depends on how you want to fix your problem, not > > designed in a way that depends on how a completely different feature > > deals with its own problems. > > The core infrastructure shared by both this patch and the WAIT FOR > command patch is primarily in xlogwait.c, which provides the mechanism > for waiting until a given LSN is reached. Other parts of the code in > the WAIT FOR patch—covering the SQL command implementation, > documentation, and tests—is not relevant for the current patch. What > we need is only the infrastructure in xlogwait.c, on which we can > implement the optimization for read_local_xlog_page_guts. > > Regarding complexity: the initial optimization idea was to introduce > condition-variable based waiting, as Heikki suggested in his comment: > > /* > * Loop waiting for xlog to be available if necessary > * > * TODO: The walsender has its own version of this function, which uses a > * condition variable to wake up whenever WAL is flushed. We could use the > * same infrastructure here, instead of the check/sleep/repeat style of > * loop. > */ > > I reviewed the relevant code in WalSndWakeup and WalSndWait. While > these mechanisms reduce polling overhead, they don’t prevent false > wakeups. Addressing that would likely require a request queue that > maps waiters to their target LSNs and issues targeted wakeups—a much > more complex design. Given that read_local_xlog_page_guts is not as > performance-sensitive as its equivalents, this added complexity may > not be justified. So I implemented the initial version of the > optimization like WalSndWakeup and WalSndWait. > > After this, I came across the WAIT FOR patch in the mailing list and > noticed that the infrastructure in xlogwait.c aligns well with our > needs. Based on that, I built the current patch using this shared > infra. > > If we prioritise simplicity and can tolerate occasional false wakeups, > then waiting in read_local_xlog_page_guts can be implemented in a much > simpler way than the current version. At the same time, the WAIT FOR > command seems to be a valuable feature in its own right, and both > patches can naturally share the same infrastructure. We could also > extract the infra and implement the current patch on it, then Wait for > could utilize it as well. Personally, I don’t have a strong preference > between the two approaches. > Another potential use for this infra could be static XLogRecPtr WalSndWaitForWal(XLogRecPtr loc), I'm planning to hack a version as well. Best, Xuneng
Re: Improve read_local_xlog_page_guts by replacing polling with latch-based waiting
От
Xuneng Zhou
Дата:
Hi, On Sat, Oct 4, 2025 at 10:25 AM Xuneng Zhou <xunengzhou@gmail.com> wrote: > > Hi, > > On Fri, Oct 3, 2025 at 9:50 PM Xuneng Zhou <xunengzhou@gmail.com> wrote: > > > > Hi Michael, > > > > Thanks for your review. > > > > On Fri, Oct 3, 2025 at 2:24 PM Michael Paquier <michael@paquier.xyz> wrote: > > > > > > On Thu, Oct 02, 2025 at 11:06:14PM +0800, Xuneng Zhou wrote: > > > > v5-0002 separates the waitlsn_cmp() comparator function into two distinct > > > > functions (waitlsn_replay_cmp and waitlsn_flush_cmp) for the replay > > > > and flush heaps, respectively. > > > > > > The primary goal that you want to achieve here is a replacement of the > > > wait/sleep logic of read_local_xlog_page_guts() with a condition > > > variable, and design a new facility to make the callback more > > > responsive on polls. That's a fine idea in itself. However I would > > > suggest to implement something that does not depend entirely on WAIT > > > FOR, which is how your patch is presented. Instead of having your > > > patch depend on an entirely different feature, it seems to me that you > > > should try to extract from this other feature the basics that you are > > > looking for, and make them shared between the WAIT FOR patch and what > > > you are trying to achieve here. You should not need something as > > > complex as what the other feature needs for a page read callback in > > > the backend. > > > > > > At the end, I suspect that you will reuse a slight portion of it (or > > > perhaps nothing at all, actually, but I did not look at the full scope > > > of it). You should try to present your patch so as it is in a > > > reviewable state, so as others would be able to read it and understand > > > it. WAIT FOR is much more complex than what you want to do here > > > because it covers larger areas of the code base and needs to worry > > > about more cases. So, you should implement things so as the basic > > > pieces you want to build on top of are simpler, not more complicated. > > > Simpler means easier to review and easier to catch problems, designed > > > in a way that depends on how you want to fix your problem, not > > > designed in a way that depends on how a completely different feature > > > deals with its own problems. > > > > The core infrastructure shared by both this patch and the WAIT FOR > > command patch is primarily in xlogwait.c, which provides the mechanism > > for waiting until a given LSN is reached. Other parts of the code in > > the WAIT FOR patch—covering the SQL command implementation, > > documentation, and tests—is not relevant for the current patch. What > > we need is only the infrastructure in xlogwait.c, on which we can > > implement the optimization for read_local_xlog_page_guts. > > > > Regarding complexity: the initial optimization idea was to introduce > > condition-variable based waiting, as Heikki suggested in his comment: > > > > /* > > * Loop waiting for xlog to be available if necessary > > * > > * TODO: The walsender has its own version of this function, which uses a > > * condition variable to wake up whenever WAL is flushed. We could use the > > * same infrastructure here, instead of the check/sleep/repeat style of > > * loop. > > */ > > > > I reviewed the relevant code in WalSndWakeup and WalSndWait. While > > these mechanisms reduce polling overhead, they don’t prevent false > > wakeups. Addressing that would likely require a request queue that > > maps waiters to their target LSNs and issues targeted wakeups—a much > > more complex design. Given that read_local_xlog_page_guts is not as > > performance-sensitive as its equivalents, this added complexity may > > not be justified. So I implemented the initial version of the > > optimization like WalSndWakeup and WalSndWait. > > > > After this, I came across the WAIT FOR patch in the mailing list and > > noticed that the infrastructure in xlogwait.c aligns well with our > > needs. Based on that, I built the current patch using this shared > > infra. > > > > If we prioritise simplicity and can tolerate occasional false wakeups, > > then waiting in read_local_xlog_page_guts can be implemented in a much > > simpler way than the current version. At the same time, the WAIT FOR > > command seems to be a valuable feature in its own right, and both > > patches can naturally share the same infrastructure. We could also > > extract the infra and implement the current patch on it, then Wait for > > could utilize it as well. Personally, I don’t have a strong preference > > between the two approaches. > > > > Another potential use for this infra could be static XLogRecPtr > WalSndWaitForWal(XLogRecPtr loc), I'm planning to hack a version as > well. > > Best, > Xuneng v6 refactors and extends the infrastructure from the WAIT FOR command patch, applying it to read_local_xlog_page_guts. I'm also thinking of creating a standalone patch/commit for the extended infra in xlogwait, so it can be reused in different threads. Best, Xuneng
Вложения
Re: Improve read_local_xlog_page_guts by replacing polling with latch-based waiting
От
Michael Paquier
Дата:
On Sat, Oct 04, 2025 at 03:21:07PM +0800, Xuneng Zhou wrote: > v6 refactors and extends the infrastructure from the WAIT FOR command > patch, applying it to read_local_xlog_page_guts. I'm also thinking of > creating a standalone patch/commit for the extended > infra in xlogwait, so it can be reused in different threads. Yes, I think that you should split your patch where you think that it can make review easier, because your change touches a very sensitive area of the code base: - First patch tointroduce what you consider is the basic infrastructure required for your patch, that can be shared between multiple pieces. I doubt that you really need to have everything's that in waitlsn.c to achieve what you want here. - Second patch to introduce your actual feature, to make the callback more responsive. - Then, potentially have a third patch, that adds pieces of infrastructure to waitlsn.c that you did not need in the first patch, still are required for the waitlsn.c thread. It would be optionally possible to rebase the waitlsn patch to use patches 1 and 3, then. I'd even try to consider the problem from the angle of looking for independent pieces that could be extracted from the first patch and split as other patches, to ease even again the review. There is a limit to this idea because you need a push/pull/reporting facility for a flush LSN and a replay LSN depending on if you are on a primary, on a standby, and even another case where you are dealing with a promoted standby where you decide to loop back *inside* the callback (which I suspect may not be always the right thing to do depending on the new TLI selected), so there is a limit in what could be treated as an independent piece. At least the bits about pairingheap_initialize() may be worth considering. + if (waitLSNState && + (XLogRecoveryCtl->lastReplayedEndRecPtr >= + pg_atomic_read_u64(&waitLSNState->minWaitedReplayLSN))) + WaitLSNWakeupReplay(XLogRecoveryCtl->lastReplayedEndRecPtr); This code pattern looks like a copy-paste of what's done in synchronous replication. Has some consolidation between syncrep.c and this kind of facility ever been considered? In terms of queues, waits and wakeups, the requirements are pretty similar, still your patch has zero changes related to syncrep.c or syncrep.h. As far as I can see based on your patch, you are repeating some of the mistakes of the wait LSN patch, where I've complained about WaitForLSNReplay() and the duplication it had. One thing you have decided to pull for example is duplicated calls to GetXLogReplayRecPtr(). -- Michael
Вложения
Re: Improve read_local_xlog_page_guts by replacing polling with latch-based waiting
От
Xuneng Zhou
Дата:
Hi Michael, Thanks again for your insightful review. On Mon, Oct 6, 2025 at 10:43 AM Michael Paquier <michael@paquier.xyz> wrote: > > On Sat, Oct 04, 2025 at 03:21:07PM +0800, Xuneng Zhou wrote: > > v6 refactors and extends the infrastructure from the WAIT FOR command > > patch, applying it to read_local_xlog_page_guts. I'm also thinking of > > creating a standalone patch/commit for the extended > > infra in xlogwait, so it can be reused in different threads. > > Yes, I think that you should split your patch where you think that it > can make review easier, because your change touches a very sensitive > area of the code base: > - First patch tointroduce what you consider is the basic > infrastructure required for your patch, that can be shared between > multiple pieces. I doubt that you really need to have everything's > that in waitlsn.c to achieve what you want here. > - Second patch to introduce your actual feature, to make the callback > more responsive. > - Then, potentially have a third patch, that adds pieces of > infrastructure to waitlsn.c that you did not need in the first patch, > still are required for the waitlsn.c thread. It would be optionally > possible to rebase the waitlsn patch to use patches 1 and 3, then. > > I'd even try to consider the problem from the angle of looking for > independent pieces that could be extracted from the first patch and > split as other patches, to ease even again the review. There is a > limit to this idea because you need a push/pull/reporting facility for > a flush LSN and a replay LSN depending on if you are on a primary, on > a standby, and even another case where you are dealing with a promoted > standby where you decide to loop back *inside* the callback (which I > suspect may not be always the right thing to do depending on the new > TLI selected), so there is a limit in what could be treated as an > independent piece. At least the bits about pairingheap_initialize() > may be worth considering. +1 for further split to smooth the review process. The timeout in wait for patch is not needed for the polling problem in the current thread. I'll remove other unused mechanisms as well. Yeh, just looping back *inside* the callback could be problematic if the wait for LSNs don't exist on the current timeline. I'll add a check for it. The new patch set will look like this per your suggestion: Patch 0: pairingheap infrastructure (independent) src/backend/lib/pairingheap.c | +14 -4 src/include/lib/pairingheap.h | +3 Adds pairingheap_initialize() for shared memory usage. Patch 1: Minimal LSN waiting infrastructure src/backend/access/transam/xlogwait.c | (simplified, no timeout…) src/include/access/xlogwait.h | +80 src/backend/storage/ipc/ipci.c | +3 src/include/storage/lwlocklist.h | +1 src/backend/utils/activity/wait_event... | +3 src/backend/access/transam/xact.c | +6 src/backend/storage/lmgr/proc.c | +6 Provides WaitForLSNReplay() and WaitForLSNFlush() for internal WAL consumers. Patch 2: Replace polling in read_local_xlog_page_guts src/backend/access/transam/xlogutils.c | +40 -5 src/backend/access/transam/xlog.c | +10 src/backend/access/transam/xlogrecovery.c | +6 src/backend/replication/walsender.c | -4 Uses Patch 1 infrastructure to eliminate busy-waiting. Patch 3 Extend LSN waiting infrastructure that WAIT FOR needs Patch Wait for command based on Patch 1 and 3 SQL interface, full error handling. > > + if (waitLSNState && > + (XLogRecoveryCtl->lastReplayedEndRecPtr >= > + pg_atomic_read_u64(&waitLSNState->minWaitedReplayLSN))) > + WaitLSNWakeupReplay(XLogRecoveryCtl->lastReplayedEndRecPtr); > > This code pattern looks like a copy-paste of what's done in > synchronous replication. Has some c between syncrep.c and > this kind of facility ever been considered? In terms of queues, waits > and wakeups, the requirements are pretty similar, still your patch has > zero changes related to syncrep.c or syncrep.h. > I'm not aware of this before; they do share some basic requirements here. I'll explore the possibility of consolidating them. > As far as I can see based on your patch, you are repeating some of the > mistakes of the wait LSN patch, where I've complained about > WaitForLSNReplay() and the duplication it had. One thing you have > decided to pull for example is duplicated calls to > GetXLogReplayRecPtr(). > -- Will refactor this. Best, Xuneng
Re: Improve read_local_xlog_page_guts by replacing polling with latch-based waiting
От
Xuneng Zhou
Дата:
Hi, The following is the split patch set. There are certain limitations to this simplification effort, particularly in patch 2. The read_local_xlog_page_guts callback demands more functionality from the facility than the WAIT FOR patch — specifically, it must wait for WAL flush events, though it does not require timeout handling. In some sense, parts of patch 3 can be viewed as a superset of the WAIT FOR patch, since it installs wake-up hooks in more locations. Unlike the WAIT FOR patch, which only needs wake-ups triggered by replay, read_local_xlog_page_guts must also handle wake-ups triggered by WAL flushes. Workload characteristics play a key role here. A sorted dlist performs well when insertions and removals occur in order, achieving O(1) complexity in the best case. In synchronous replication, insertion patterns seem generally monotonic with commit LSNs, though not strictly ordered due to timing variations and contention. When most insertions remain ordered, a dlist can be efficient. However, as the number of elements grows and out-of-order insertions become more frequent, the insertion cost can degrade to O(n) more often. By contrast, a pairing heap maintains stable O(1) insertion for both ordered and disordered inputs, with amortized O(log n) removals. Since LSNs in the WAIT FOR command are likely to arrive in a non-sequential fashion, the pairing heap introduced in v6 provides more predictable performance under such workloads. At this stage (v7), no consolidation between syncrep and xlogwait has been implemented. This is mainly because the dlist and pairing heap each works well under different workloads — neither is likely to be universally optimal. Introducing the facility with a pairing heap first seems reasonable, as it offers flexibility for future refactoring: we could later replace dlist with a heap or adopt a modular design depending on observed workload characteristics. Best, Xuneng
Вложения
Re: Improve read_local_xlog_page_guts by replacing polling with latch-based waiting
От
Xuneng Zhou
Дата:
Hi, On Sat, Oct 11, 2025 at 11:02 AM Xuneng Zhou <xunengzhou@gmail.com> wrote: > > Hi, > > The following is the split patch set. There are certain limitations to > this simplification effort, particularly in patch 2. The > read_local_xlog_page_guts callback demands more functionality from the > facility than the WAIT FOR patch — specifically, it must wait for WAL > flush events, though it does not require timeout handling. In some > sense, parts of patch 3 can be viewed as a superset of the WAIT FOR > patch, since it installs wake-up hooks in more locations. Unlike the > WAIT FOR patch, which only needs wake-ups triggered by replay, > read_local_xlog_page_guts must also handle wake-ups triggered by WAL > flushes. > > Workload characteristics play a key role here. A sorted dlist performs > well when insertions and removals occur in order, achieving O(1) > complexity in the best case. In synchronous replication, insertion > patterns seem generally monotonic with commit LSNs, though not > strictly ordered due to timing variations and contention. When most > insertions remain ordered, a dlist can be efficient. However, as the > number of elements grows and out-of-order insertions become more > frequent, the insertion cost can degrade to O(n) more often. > > By contrast, a pairing heap maintains stable O(1) insertion for both > ordered and disordered inputs, with amortized O(log n) removals. Since > LSNs in the WAIT FOR command are likely to arrive in a non-sequential > fashion, the pairing heap introduced in v6 provides more predictable > performance under such workloads. > > At this stage (v7), no consolidation between syncrep and xlogwait has > been implemented. This is mainly because the dlist and pairing heap > each works well under different workloads — neither is likely to be > universally optimal. Introducing the facility with a pairing heap > first seems reasonable, as it offers flexibility for future > refactoring: we could later replace dlist with a heap or adopt a > modular design depending on observed workload characteristics. > v8-0002 removed the early fast check before addLSNWaiter in WaitForLSNReplay, as the likelihood of a server state change is small compared to the branching cost and added code complexity. Best, Xuneng
Вложения
Re: Improve read_local_xlog_page_guts by replacing polling with latch-based waiting
От
Xuneng Zhou
Дата:
Hi, On Wed, Oct 15, 2025 at 8:31 AM Xuneng Zhou <xunengzhou@gmail.com> wrote: > > Hi, > > On Sat, Oct 11, 2025 at 11:02 AM Xuneng Zhou <xunengzhou@gmail.com> wrote: > > > > Hi, > > > > The following is the split patch set. There are certain limitations to > > this simplification effort, particularly in patch 2. The > > read_local_xlog_page_guts callback demands more functionality from the > > facility than the WAIT FOR patch — specifically, it must wait for WAL > > flush events, though it does not require timeout handling. In some > > sense, parts of patch 3 can be viewed as a superset of the WAIT FOR > > patch, since it installs wake-up hooks in more locations. Unlike the > > WAIT FOR patch, which only needs wake-ups triggered by replay, > > read_local_xlog_page_guts must also handle wake-ups triggered by WAL > > flushes. > > > > Workload characteristics play a key role here. A sorted dlist performs > > well when insertions and removals occur in order, achieving O(1) > > complexity in the best case. In synchronous replication, insertion > > patterns seem generally monotonic with commit LSNs, though not > > strictly ordered due to timing variations and contention. When most > > insertions remain ordered, a dlist can be efficient. However, as the > > number of elements grows and out-of-order insertions become more > > frequent, the insertion cost can degrade to O(n) more often. > > > > By contrast, a pairing heap maintains stable O(1) insertion for both > > ordered and disordered inputs, with amortized O(log n) removals. Since > > LSNs in the WAIT FOR command are likely to arrive in a non-sequential > > fashion, the pairing heap introduced in v6 provides more predictable > > performance under such workloads. > > > > At this stage (v7), no consolidation between syncrep and xlogwait has > > been implemented. This is mainly because the dlist and pairing heap > > each works well under different workloads — neither is likely to be > > universally optimal. Introducing the facility with a pairing heap > > first seems reasonable, as it offers flexibility for future > > refactoring: we could later replace dlist with a heap or adopt a > > modular design depending on observed workload characteristics. > > > > v8-0002 removed the early fast check before addLSNWaiter in WaitForLSNReplay, > as the likelihood of a server state change is small compared to the > branching cost and added code complexity. > Made minor changes to #include of xlogwait.h in patch2 to calm CF-bots down. Best, Xuneng