Обсуждение: Optimize LISTEN/NOTIFY
Hi hackers, The current LISTEN/NOTIFY implementation is well-suited for use-cases like cache invalidation where many backends listen on the same channel. However, its scalability is limited when many backends listen on distinct channels. The root of the problem is that Async_Notify must signal every listening backend in the database, as it lacks central knowledge of which backend is interested in which channel. This results in an O(N) number of kill(pid, SIGUSR1) syscalls as the listener count grows. The attached proof-of-concept patch proposes a straightforward optimization for the single-listener case. It introduces a shared-memory hash table mapping (dboid, channelname) to the ProcNumber of a single listener. When NOTIFY is issued, we first check this table. If a single listener is found, we signal only that backend. Otherwise, we fall back to the existing broadcast behavior. The performance impact for this pattern is significant. A benchmark [1] measuring a NOTIFY "ping-pong" between two connections, while adding a variable number of idle listeners, shows the following: master (8893c3a): 0 extra listeners: 9126 TPS 10 extra listeners: 6233 TPS 100 extra listeners: 2020 TPS 1000 extra listeners: 238 TPS 0001-Optimize-LISTEN-NOTIFY-signaling-for-single-listener.patch: 0 extra listeners: 9152 TPS 10 extra listeners: 9352 TPS 100 extra listeners: 9320 TPS 1000 extra listeners: 8937 TPS As you can see, the patched version's performance is near O(1) with respect to the number of idle listeners, while the current implementation shows the expected O(N) degradation. This patch is a first-step. It uses a simple boolean has_multiple_listeners flag in the hash entry. Once a channel gets a second listener, this flag is set and, crucially, never cleared. The entry will then permanently indicate "multiple listeners", even after all backends on that channel disconnect. A more complete solution would likely use reference counting for each channel's listeners. This would solve the "stuck entry" problem and could also enable a further optimization: targeted signaling to all listeners of a multi-user channel, avoiding the database-wide broadcast entirely. The patch also includes a "wake only tail" optimization (contributed by Marko Tikkaja) to help prevent backends from falling too far behind. Instead of waking all lagging backends at once and creating a "thundering herd", this logic signals only the single backend that is currently at the queue tail. This ensures the global queue tail can always advance, relying on a chain reaction to get backends caught up efficiently. This seems like a sensible improvement in its own right. Thoughts? /Joel [1] Benchmark tool and full results: https://github.com/joelonsql/pg-bench-listen-notify
Вложения
"Joel Jacobson" <joel@compiler.org> writes:
> The attached proof-of-concept patch proposes a straightforward
> optimization for the single-listener case. It introduces a shared-memory
> hash table mapping (dboid, channelname) to the ProcNumber of a single
> listener.
What does that do to the cost and parallelizability of LISTEN/UNLISTEN?
> The patch also includes a "wake only tail" optimization (contributed by
> Marko Tikkaja) to help prevent backends from falling too far behind.
Coulda sworn we dealt with that case some years ago. In any case,
if it's independent of the other idea it should probably get its
own thread.
regards, tom lane
On Sun, Jul 13, 2025, at 01:18, Tom Lane wrote: > "Joel Jacobson" <joel@compiler.org> writes: >> The attached proof-of-concept patch proposes a straightforward >> optimization for the single-listener case. It introduces a shared-memory >> hash table mapping (dboid, channelname) to the ProcNumber of a single >> listener. > > What does that do to the cost and parallelizability of LISTEN/UNLISTEN? Good point. The previous patch would effectively force all LISTEN/UNLISTEN to be serialized, which would at least hurt parallelizability. New benchmark confirm this hypothesis. New patch attached that combines two complementary approaches, that together seems to scale well for both common-channel and unique-channel scenarios: 1. Partitioned Hash Locking The Channel Hash now uses HASH_PARTITION, with an array of NUM_NOTIFY_PARTITIONS lightweight locks. A given channel is mapped to a partition lock using a custom hash function on (dboid, channelname). This allows LISTEN/UNLISTEN operations on different channels to proceed concurrently without fighting over a single global lock, addressing the "many distinct channels" use-case. 2. Optimistic Read-Locking For the "many backends on one channel" use-case, lock acquisition now follows a read-then-upgrade pattern. We first acquire a LW_SHARED lock, to check the channel's state. If the channel is already marked as has_multiple_listeners, we can return immediately without any need for a write. Only if we are the first or second listener on a channel do we release the shared lock and acquire an LW_EXCLUSIVE lock to modify the hash entry. After getting the exclusive lock, we re-verify the state to guard against race conditions. This avoids serializing the third and all subsequent listeners for a popular channel. BENCHMARK https://raw.githubusercontent.com/joelonsql/pg-bench-listen-notify/refs/heads/master/performance_overview_connections_equal_jobs.png https://raw.githubusercontent.com/joelonsql/pg-bench-listen-notify/refs/heads/master/performance_overview_fixed_connections.png I didn't want to attached the images to this email because they are quite large, due to all the details in the images. However, since it's important this mailing list contains all relevant data discussed, I've also included all data in the graphs formatted in ASCII/Markdown: performance_overview.md I've also included the raw parsed data from the pgbench output, which has been used as input to create performance_overview.md as well as the images: pgbench_results_combined.csv I've benchmarked five times per measurement, in random order. All raw measurements have been included in the Markdown document within { curly braces } sorted, next to the average values, to get an idea of the variance. Stddev felt possibly misleading since I'm not sure the data points are normally distributed, since it's benchmarking data. I've run the benchmarks on my MacBook Pro Apple M3 Max, using `caffeinate -dims pgbench ...`. >> The patch also includes a "wake only tail" optimization (contributed by >> Marko Tikkaja) to help prevent backends from falling too far behind. > > Coulda sworn we dealt with that case some years ago. In any case, > if it's independent of the other idea it should probably get its > own thread. Maybe it's been dealt with by some other part of the system, but I can't find any such code anywhere, it's only async.c that currently sends PROCSIG_NOTIFY_INTERRUPT. The wake only tail mechanism seems almost perfect, but I can think of at least one edge-case where we could still get a problem situation: With lots of idle backends, the rate of this one-by-one catch-up may not be fast enough to outpace the queue's advancement, causing other idle backends to eventually lag by more than the QUEUE_CLEANUP_DELAY threshold. To ensure all backends are eventually processed without re-introducing the thundering herd problem, an additional mechanism seems neessary: I see two main options: 1. Extend the chain reaction Once woken, a backend could signal the next backend at the queue tail, propagating the catch-up process. This would need to be managed carefully, perhaps with some kind of global advisory lock, to prevent multiple cascades from running at once. 2. Centralize the work We already have the autovacuum daemon, maybe it could also be made responsible for kicking lagging backends? Other ideas? /Joel Attached: * pgbench-scripts.tar.gz pgbench scripts to reproduce the results, report and images. * performance_overview.md Same results as in the images, but in ASCII/Markdown format. * pgbench_results_combined.csv Parsed output from pgbench runs, used to create performance_overview.md as well as the linked images. * 0001-Optimize-LISTEN-NOTIFY-signaling-for-single-listener-v2.patch Old patch just renamed to -v2 * 0002-Partition-channel-hash-to-improve-LISTEN-UNLISTEN-v2.patch New patch with the approach explained above.
Вложения
On Tue, Jul 15, 2025, at 09:20, Joel Jacobson wrote: > On Sun, Jul 13, 2025, at 01:18, Tom Lane wrote: >> "Joel Jacobson" <joel@compiler.org> writes: >>> The attached proof-of-concept patch proposes a straightforward >>> optimization for the single-listener case. It introduces a shared-memory >>> hash table mapping (dboid, channelname) to the ProcNumber of a single >>> listener. >> >> What does that do to the cost and parallelizability of LISTEN/UNLISTEN? > > Good point. The previous patch would effectively force all LISTEN/UNLISTEN > to be serialized, which would at least hurt parallelizability. > > New benchmark confirm this hypothesis. > > New patch attached that combines two complementary approaches, that together > seems to scale well for both common-channel and unique-channel scenarios: Thanks to the FreeBSD animal failing, I see I made a shared memory blunder. New squashed patch attached. /Joel
Вложения
On Tue, Jul 15, 2025, at 22:56, Joel Jacobson wrote: > On Tue, Jul 15, 2025, at 09:20, Joel Jacobson wrote: >> On Sun, Jul 13, 2025, at 01:18, Tom Lane wrote: >>> "Joel Jacobson" <joel@compiler.org> writes: >>>> The attached proof-of-concept patch proposes a straightforward >>>> optimization for the single-listener case. It introduces a shared-memory >>>> hash table mapping (dboid, channelname) to the ProcNumber of a single >>>> listener. >>> >>> What does that do to the cost and parallelizability of LISTEN/UNLISTEN? >> >> Good point. The previous patch would effectively force all LISTEN/UNLISTEN >> to be serialized, which would at least hurt parallelizability. >> >> New benchmark confirm this hypothesis. >> >> New patch attached that combines two complementary approaches, that together >> seems to scale well for both common-channel and unique-channel scenarios: > > Thanks to the FreeBSD animal failing, I see I made a shared memory blunder. > New squashed patch attached. > > /Joel > Attachments: > * 0001-Subject-Optimize-LISTEN-NOTIFY-signaling-for-scalabi-v3.patch (cfbot is not picking up my patch; I wonder if some filename length is exceeded, trying a shorter filename, apologies for spamming) /Joel
Вложения
Hi Joel, Thanks for sharing the patch. I have a few questions based on a cursory first look. > If a single listener is found, we signal only that backend. > Otherwise, we fall back to the existing broadcast behavior. The idea of not wanting to wake up all backends makes sense to me, but I don’t understand why we want this optimization only for the case where there is a single backend listening on a channel. Is there a pattern of usage in LISTEN/NOTIFY where users typically have either just one or several backends listening on a channel? If we are doing this optimization, why not maintain a list of backends for each channel, and only wake up those channels? Thanks, Rishu
On Wed, Jul 16, 2025, at 02:20, Rishu Bagga wrote:
> Hi Joel,
>
> Thanks for sharing the patch.
> I have a few questions based on a cursory first look.
>
>> If a single listener is found, we signal only that backend.
>> Otherwise, we fall back to the existing broadcast behavior.
>
> The idea of not wanting to wake up all backends makes sense to me,
> but I don’t understand why we want this optimization only for the case
> where there is a single backend listening on a channel.
>
> Is there a pattern of usage in LISTEN/NOTIFY where users typically
> have either just one or several backends listening on a channel?
>
> If we are doing this optimization, why not maintain a list of backends
> for each channel, and only wake up those channels?
Thanks for the thoughtful question. You've hit on the central design trade-off
in this optimization: how to provide targeted signaling for some workloads
without degrading performance for others.
While we don't have telemetry on real-world usage patterns of LISTEN/NOTIFY,
it seems likely that most applications fall into one of three categories,
which I've been thinking of in networking terms:
1. Broadcast-style ("hub mode")
Many backends listening on the *same* channel (e.g., for cache invalidation).
The current implementation is already well-optimized for this, behaving like
an Ethernet hub that broadcasts to all ports. Waking all listeners is efficient
because they all need the message.
2. Targeted notifications ("switch mode")
Each backend listens on its own private channel (e.g., for session events or
worker queues). This is where the current implementation scales poorly, as every
NOTIFY wakes up all listeners regardless of relevance. My patch is designed
to make this behave like an efficient Ethernet switch.
3. Selective multicast-style ("group mode")
A subset of backends shares a channel, but not all. This is the tricky middle
ground. Your question, "why not maintain a list of backends for each channel,
and only wake up those channels?" is exactly the right one to ask.
A full listener list seems like the obvious path to optimizing for *all* cases.
However, the devil is in the details of concurrency and performance. Managing
such a list would require heavier locking, which would create a new bottleneck
and degrade the scalability of LISTEN/UNLISTEN operations—especially for
the "hub mode" case where many backends rapidly subscribe to the same popular
channel.
This patch makes a deliberate architectural choice:
Prioritize a massive, low-risk win for "switch mode" while rigorously protecting
the performance of "hub mode".
It introduces a targeted fast path for single-listener channels and cleanly
falls back to the existing, well-performing broadcast model for everything else.
This brings us back to "group mode", which remains an open optimization problem.
A possible approach could be to track listeners up to a small threshold *K*
(e.g., store up to 4 ProcNumber's in the hash entry). If the count exceeds *K*,
we would flip a "broadcast" flag and revert to hub-mode behavior.
However, this path has a critical drawback:
1. Performance Penalty for Hub Mode
With the current patch, after the second listener joins a channel,
the has_multiple_listeners flag is set. Every subsequent listener can acquire
a shared lock, see the flag is true, and immediately continue. This is
a highly concurrent, read-only operation that does not require mutating shared
state.
In contrast, the K-listener approach would force every new listener (from the
third up to the K-th) to acquire an exclusive lock to mutate the shared
listener array**. This would serialize LISTEN operations on popular channels,
creating the very contention point this patch successfully avoids and directly
harming the hub-mode use case that currently works well.
2. Uncertainty
Compounding this, without clear data on typical "group" sizes, choosing a value
for *K* is a shot in the dark. A small *K* might not help much, while
a large *K* would increase the shared memory footprint and worsen the
serialization penalty.
For these reasons, attempting to build a switch that also optimizes for
multicast risks undermining the architectural clarity and performance of
both the switch and hub models.
This patch, therefore, draws a clean line. It provides a precise,
low-cost path for switch-mode workloads and preserves the existing,
well-performing path for hub-mode workloads. While this leaves "group mode"
unoptimized for now, it ensures we make two common use cases better without
making any use case worse. The new infrastructure is flexible, leaving
the door open should a better approach for "group mode" emerge in
the future—one that doesn't compromise the other two.
Benchmarks updated showing master vs 0001-optimize_listen_notify-v3.patch:
https://github.com/joelonsql/pg-bench-listen-notify/raw/master/plot.png
https://github.com/joelonsql/pg-bench-listen-notify/raw/master/performance_overview_connections_equal_jobs.png
https://github.com/joelonsql/pg-bench-listen-notify/raw/master/performance_overview_fixed_connections.png
I've not included the benchmark CSV data in this mail, since it's quite heavy,
160kB, and I couldn't see any significant performance changes since v2.
/Joel
On Wed, Jul 16, 2025, at 02:20, Rishu Bagga wrote:
> If we are doing this optimization, why not maintain a list of backends
> for each channel, and only wake up those channels?
Thanks for a contributing a great idea, it actually turned out to work
really well in practice!
The attached new v4 of the patch implements your multicast idea:
---
Improve NOTIFY scalability with multicast signaling
Previously, NOTIFY would signal all listening backends in a database for
any channel with more than one listener. This broadcast approach scales
poorly for workloads that rely on targeted notifications to small groups
of backends, as every NOTIFY could wake up many unrelated processes.
This commit introduces a multicast signaling optimization to improve
scalability for such use-cases. A new GUC, `notify_multicast_threshold`,
is added to control the maximum number of listeners to track per
channel. When a NOTIFY is issued, if the number of listeners is at or
below this threshold, only those specific backends are signaled. If the
limit is exceeded, the system falls back to the original broadcast
behavior.
The default for this threshold is set to 16. Benchmarks show this
provides a good balance, with significant performance gains for small to
medium-sized listener groups and diminishing returns for higher values.
Setting the threshold to 0 disables multicast signaling, forcing a
fallback to the broadcast path for all notifications.
To implement this, a new partitioned hash table is introduced in shared
memory to track listeners. Locking is managed with an optimistic
read-then-upgrade pattern. This allows concurrent LISTEN/UNLISTEN
operations on *different* channels to proceed in parallel, as they will
only acquire locks on their respective partitions.
For correctness and to prevent deadlocks, a strict lock ordering
hierarchy (NotifyQueueLock before any partition lock) is observed. The
signaling path in NOTIFY must acquire the global NotifyQueueLock first
before consulting the partitioned hash table, which serializes
concurrent NOTIFYs. The primary concurrency win is for LISTEN/UNLISTEN
operations, which are now much more scalable.
The "wake only tail" optimization, which signals backends that are far
behind in the queue, is also included to ensure the global queue tail
can always advance.
Thanks to Rishu Bagga for the multicast idea.
---
BENCHMARK
To find the optimal default notify_multicast_threshold value,
I created a new benchmark tool that spawns one "ping" worker that sends
notifications to a channel, and multiple "pong" workers that listen on channels
and all immediately reply back to the "ping" worker, and when all replies
have been received, the cycle repeats.
By measuring how many complete round-trips can be performed per second,
it evaluates the impact of different multicast threshold settings.
The results below show the effect of setting the notify_multicast_threshold
just below, or exactly at the N backends per channel, to compare broadcast
vs multicast, for different sizes of multicast groups (where 1 would be the
old targeted mode, optimized for specifically earlier).
K = notify_multicast_threshold
With 2 backends per channel (32 channels total):
patch-v4 (K=1): 8,477 TPS
patch-v4 (K=2): 27,748 TPS (3.3x improvement)
With 4 backends per channel (16 channels total):
patch-v4 (K=1): 7,367 TPS
patch-v4 (K=4): 18,777 TPS (2.6x improvement)
With 8 backends per channel (8 channels total):
patch-v4 (K=1): 5,892 TPS
patch-v4 (K=8): 8,620 TPS (1.5x improvement)
With 16 backends per channel (4 channels total):
patch-v4 (K=1): 4,202 TPS
patch-v4 (K=16): 4,750 TPS (1.1x improvement)
I also reran the old ping-pong as well as the pgbench benchmarks,
and I couldn't detect any negative impact, testing with
notify_multicast_threshold {1, 8, 16}.
Ping-pong benchmark:
Extra Connections: 0
--------------------------------------------------------------------------------
Version Max TPS vs Master All Values (sorted)
-------------------------------------------------------------------------------------
master 9119 baseline {9088, 9095, 9119}
patch-v4 (t=1) 9116 -0.0% {9082, 9090, 9116}
patch-v4 (t=8) 9106 -0.2% {9086, 9102, 9106}
patch-v4 (t=16) 9134 +0.2% {9082, 9116, 9134}
Extra Connections: 10
--------------------------------------------------------------------------------
Version Max TPS vs Master All Values (sorted)
-------------------------------------------------------------------------------------
master 6237 baseline {6224, 6227, 6237}
patch-v4 (t=1) 9358 +50.0% {9302, 9345, 9358}
patch-v4 (t=8) 9348 +49.9% {9266, 9312, 9348}
patch-v4 (t=16) 9408 +50.8% {9339, 9407, 9408}
Extra Connections: 100
--------------------------------------------------------------------------------
Version Max TPS vs Master All Values (sorted)
-------------------------------------------------------------------------------------
master 2028 baseline {2026, 2027, 2028}
patch-v4 (t=1) 9278 +357.3% {9222, 9235, 9278}
patch-v4 (t=8) 9227 +354.8% {9184, 9207, 9227}
patch-v4 (t=16) 9250 +355.9% {9180, 9243, 9250}
Extra Connections: 1000
--------------------------------------------------------------------------------
Version Max TPS vs Master All Values (sorted)
-------------------------------------------------------------------------------------
master 239 baseline {239, 239, 239}
patch-v4 (t=1) 8841 +3594.1% {8819, 8840, 8841}
patch-v4 (t=8) 8835 +3591.7% {8802, 8826, 8835}
patch-v4 (t=16) 8855 +3599.8% {8787, 8843, 8855}
Among my pgbench benchmarks, results seems unaffected in these benchmarks:
listen_unique.sql
listen_common.sql
listen_unlisten_unique.sql
listen_unlisten_common.sql
The listen_notify_unique.sql benchmark shows similar improvements
for all notify_multicast_threshold values tested,
which is expected, since this benchmark uses unique channels,
so a higher notify_multicast_threshold shouldn't affect the results,
which it didn't:
# TEST `listen_notify_unique.sql`
```sql
LISTEN channel_:client_id;
NOTIFY channel_:client_id;
```
## 1 Connection, 1 Job
- **master**: 63696 TPS (baseline)
- **optimize_listen_notify_v4 (t=1.0)**: 63377 TPS (-0.5%)
- **optimize_listen_notify_v4 (t=8.0)**: 62890 TPS (-1.3%)
- **optimize_listen_notify_v4 (t=16.0)**: 63114 TPS (-0.9%)
## 2 Connections, 2 Jobs
- **master**: 90967 TPS (baseline)
- **optimize_listen_notify_v4 (t=1.0)**: 109423 TPS (+20.3%)
- **optimize_listen_notify_v4 (t=8.0)**: 109107 TPS (+19.9%)
- **optimize_listen_notify_v4 (t=16.0)**: 109608 TPS (+20.5%)
## 4 Connections, 4 Jobs
- **master**: 114333 TPS (baseline)
- **optimize_listen_notify_v4 (t=1.0)**: 140986 TPS (+23.3%)
- **optimize_listen_notify_v4 (t=8.0)**: 141263 TPS (+23.6%)
- **optimize_listen_notify_v4 (t=16.0)**: 141327 TPS (+23.6%)
## 8 Connections, 8 Jobs
- **master**: 64429 TPS (baseline)
- **optimize_listen_notify_v4 (t=1.0)**: 93787 TPS (+45.6%)
- **optimize_listen_notify_v4 (t=8.0)**: 93828 TPS (+45.6%)
- **optimize_listen_notify_v4 (t=16.0)**: 93875 TPS (+45.7%)
## 16 Connections, 16 Jobs
- **master**: 41704 TPS (baseline)
- **optimize_listen_notify_v4 (t=1.0)**: 84791 TPS (+103.3%)
- **optimize_listen_notify_v4 (t=8.0)**: 88330 TPS (+111.8%)
- **optimize_listen_notify_v4 (t=16.0)**: 84827 TPS (+103.4%)
## 32 Connections, 32 Jobs
- **master**: 25988 TPS (baseline)
- **optimize_listen_notify_v4 (t=1.0)**: 83197 TPS (+220.1%)
- **optimize_listen_notify_v4 (t=8.0)**: 83453 TPS (+221.1%)
- **optimize_listen_notify_v4 (t=16.0)**: 83576 TPS (+221.6%)
## 1000 Connections, 1 Job
- **master**: 105 TPS (baseline)
- **optimize_listen_notify_v4 (t=1.0)**: 3097 TPS (+2852.1%)
- **optimize_listen_notify_v4 (t=8.0)**: 3079 TPS (+2835.1%)
- **optimize_listen_notify_v4 (t=16.0)**: 3080 TPS (+2835.9%)
## 1000 Connections, 2 Jobs
- **master**: 108 TPS (baseline)
- **optimize_listen_notify_v4 (t=1.0)**: 2981 TPS (+2671.7%)
- **optimize_listen_notify_v4 (t=8.0)**: 3091 TPS (+2774.4%)
- **optimize_listen_notify_v4 (t=16.0)**: 3097 TPS (+2779.6%)
## 1000 Connections, 4 Jobs
- **master**: 105 TPS (baseline)
- **optimize_listen_notify_v4 (t=1.0)**: 2947 TPS (+2705.5%)
- **optimize_listen_notify_v4 (t=8.0)**: 2994 TPS (+2751.0%)
- **optimize_listen_notify_v4 (t=16.0)**: 2992 TPS (+2748.7%)
## 1000 Connections, 8 Jobs
- **master**: 107 TPS (baseline)
- **optimize_listen_notify_v4 (t=1.0)**: 3064 TPS (+2777.0%)
- **optimize_listen_notify_v4 (t=8.0)**: 2981 TPS (+2698.5%)
- **optimize_listen_notify_v4 (t=16.0)**: 2979 TPS (+2696.8%)
## 1000 Connections, 16 Jobs
- **master**: 101 TPS (baseline)
- **optimize_listen_notify_v4 (t=1.0)**: 3068 TPS (+2923.2%)
- **optimize_listen_notify_v4 (t=8.0)**: 2950 TPS (+2806.4%)
- **optimize_listen_notify_v4 (t=16.0)**: 2940 TPS (+2796.8%)
## 1000 Connections, 32 Jobs
- **master**: 102 TPS (baseline)
- **optimize_listen_notify_v4 (t=1.0)**: 2980 TPS (+2815.0%)
- **optimize_listen_notify_v4 (t=8.0)**: 3034 TPS (+2867.9%)
- **optimize_listen_notify_v4 (t=16.0)**: 2962 TPS (+2798.0%)
Here are some plots that includes the above results:
https://github.com/joelonsql/pg-bench-listen-notify/raw/master/plot-v4.png
https://github.com/joelonsql/pg-bench-listen-notify/raw/master/performance_overview_connections_equal_jobs-v4.png
https://github.com/joelonsql/pg-bench-listen-notify/raw/master/performance_overview_fixed_connections-v4.png
/Joel
Вложения
On Thu, Jul 17, 2025, at 09:43, Joel Jacobson wrote:
> On Wed, Jul 16, 2025, at 02:20, Rishu Bagga wrote:
>> If we are doing this optimization, why not maintain a list of backends
>> for each channel, and only wake up those channels?
>
> Thanks for a contributing a great idea, it actually turned out to work
> really well in practice!
>
> The attached new v4 of the patch implements your multicast idea:
Hi hackers,
While my previous attempts of $subject has only focused on optimizing
the multi-channel scenario, I thought it would be really nice if
LISTEN/NOTIFY could be optimize in the general case, benefiting all
users, including those who just listen on a single channel.
To my surprise, this was not only possible, but actually quite simple.
The main idea in this patch, is to introduce an atomic state machine,
with three states, IDLE, SIGNALLED, and PROCESSED, so that we don't
interrupt backends that are already in the process of catching up.
Thanks to Thomas Munro for making me aware of his, Heikki Linnakanga's
and others work in the "Interrupts vs signals" [1] thread.
Maybe my patch is redundant due to their patch set, I'm not really sure?
Their patch seems to refactors the underlying wakeup mechanism. It
replaces the old, complex chain of events (SIGUSR1 signal -> handler ->
flag -> latch) with a single, direct function call: SendInterrupt(). For
async.c, this seems to be a low-level plumbing change that simplifies
how a notification wakeup is delivered.
My patch optimizes the high-level notification protocol. It introduces a
state machine (IDLE, SIGNALLED, PROCESSING) to only signal backends when
needed.
In their patch, in asyn.c's SignalBackends(), they do
SendInterrupt(INTERRUPT_ASYNC_NOTIFY, procno) instead of
SendProcSignal(pid, PROCSIG_NOTIFY_INTERRUPT, procnos[i]). They don't
seem to check if the backend is already signalled or not, but maybe
SendInterrupt() has signal coalescing built-in so it would be a noop
with almost no cost?
I'm happy to rebase my LISTEN/NOTIFY work on top of [1], but I could
also see benefits of doing the opposite.
I'm also happy to help with benchmarking of your work in [1].
Note that this patch doesn't contain the hash table to keep track of
listeners per backend, as proposed in earlier patches. I will propose
such a patch again later, but first we need to figure out if I should
rebase onto [1] or master (HEAD).
--- PATCH ---
Optimize NOTIFY signaling to avoid redundant backend signals
Previously, a NOTIFY would send SIGUSR1 to all listening backends, which
could lead to a "thundering herd" of redundant signals under high
traffic. To address this inefficiency, this patch replaces the simple
volatile notifyInterruptPending flag with a per-backend atomic state
machine, stored in asyncQueueControl->backend[i].state. This state
variable can be in one of three states: IDLE (awaiting signal),
SIGNALLED (signal received, work pending), or PROCESSING (actively
reading the queue).
From the notifier's perspective, SignalBackends now uses an atomic
compare-and-swap (CAS) to transition a listener from IDLE to SIGNALLED.
Only on a successful transition is a signal sent. If the listener is
already SIGNALLED or another notifier wins the race, no redundant signal
is sent. If the listener is in the PROCESSING state, the notifier will
also transition it to SIGNALLED to ensure the listener re-scans the
queue after its current work is done.
On the listener side, ProcessIncomingNotify first transitions its state
from SIGNALLED to PROCESSING. After reading notifications, it attempts
to transition from PROCESSING back to IDLE. If this CAS fails, it means
a new notification arrived during processing and a notifier has already
set the state back to SIGNALLED. The listener then simply re-latches
itself to process the new notifications, avoiding a tight loop.
The primary benefit is a significant reduction in syscall overhead and
unnecessary kernel wakeups in high-traffic scenarios. This dramatically
improves performance for workloads with many concurrent notifiers.
Benchmarks show a substantial increase in NOTIFY-only transaction
throughput, with gains exceeding 200% at higher
concurrency levels.
src/backend/commands/async.c | 209
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++-----------------------------
src/backend/tcop/postgres.c | 4 ++--
src/include/commands/async.h | 4 +++-
3 files changed, 185 insertions(+), 32 deletions(-)
--- BENCHMARK ---
The attached benchmark script does LISTEN on one connection,
and then uses pgbench to send NOTIFY on a varying number of
connections and jobs, to cause a high procsignal load.
I've run the benchmark on my MacBook Pro M3 Max,
10 seconds per run, 3 runs.
(I reused the same benchmark script as in the other thread, "Optimize ProcSignal to avoid redundant SIGUSR1 signals")
Connections=Jobs | TPS (master) | TPS (patch) | Relative Diff (%) | StdDev (master) | StdDev (patch)
------------------+--------------+-------------+-------------------+-----------------+----------------
1 | 118833 | 151510 | 27.50% | 484 | 923
2 | 156005 | 239051 | 53.23% | 3145 | 1596
4 | 177351 | 250910 | 41.48% | 4305 | 4891
8 | 116597 | 171944 | 47.47% | 1549 | 2752
16 | 40835 | 165482 | 305.25% | 2695 | 2825
32 | 37940 | 145150 | 282.58% | 2533 | 1566
64 | 35495 | 131836 | 271.42% | 1837 | 573
128 | 40193 | 121333 | 201.88% | 2254 | 874
(8 rows)
/Joel
https://www.postgresql.org/message-id/flat/CA%2BhUKG%2B3MkS21yK4jL4cgZywdnnGKiBg0jatoV6kzaniBmcqbQ%40mail.gmail.com
Вложения
On Wed, Jul 23, 2025 at 1:39 PM Joel Jacobson <joel@compiler.org> wrote: > In their patch, in asyn.c's SignalBackends(), they do > SendInterrupt(INTERRUPT_ASYNC_NOTIFY, procno) instead of > SendProcSignal(pid, PROCSIG_NOTIFY_INTERRUPT, procnos[i]). They don't > seem to check if the backend is already signalled or not, but maybe > SendInterrupt() has signal coalescing built-in so it would be a noop > with almost no cost? Yeah: + old_pending = pg_atomic_fetch_or_u32(&proc->pendingInterrupts, interruptMask); + + /* + * If the process is currently blocked waiting for an interrupt to arrive, + * and the interrupt wasn't already pending, wake it up. + */ + if ((old_pending & (interruptMask | SLEEPING_ON_INTERRUPTS)) == SLEEPING_ON_INTERRUPTS) + WakeupOtherProc(proc);
On Wed, Jul 23, 2025, at 04:44, Thomas Munro wrote: > On Wed, Jul 23, 2025 at 1:39 PM Joel Jacobson <joel@compiler.org> wrote: >> In their patch, in asyn.c's SignalBackends(), they do >> SendInterrupt(INTERRUPT_ASYNC_NOTIFY, procno) instead of >> SendProcSignal(pid, PROCSIG_NOTIFY_INTERRUPT, procnos[i]). They don't >> seem to check if the backend is already signalled or not, but maybe >> SendInterrupt() has signal coalescing built-in so it would be a noop >> with almost no cost? > > Yeah: > > + old_pending = pg_atomic_fetch_or_u32(&proc->pendingInterrupts, interruptMask); > + > + /* > + * If the process is currently blocked waiting for an interrupt to arrive, > + * and the interrupt wasn't already pending, wake it up. > + */ > + if ((old_pending & (interruptMask | SLEEPING_ON_INTERRUPTS)) == > SLEEPING_ON_INTERRUPTS) > + WakeupOtherProc(proc); Thanks for confirming the coalescing logic in SendInterrupt. That's a great low-level optimization. It's clear we're both targeting the same problem of redundant wake-ups under contention, but approaching it from different architectural levels. The core difference, as I see it, is *where* the state management resides. The "Interrupts vs signals" patch set creates a unified machinery where the 'pending' state for all subsystems is combined into a single atomic bitmask. This is a valid approach. However, I've been exploring an alternative pattern that decouples the state management from the signaling machinery, allowing each subsystem to manage its own state independently. I believe this leads to a simpler, more modular migration path. I've developed a two-patch series for `async.c` to demonstrate this concept. 1. The first patch introduces a lock-free, atomic finite state machine (FSM) entirely within async.c. By using a subsystem-specific atomic integer and CAS operations, async.c can now robustly manage its own listener states (IDLE, SIGNALLED, PROCESSING). This solves the redundant signal problem at the source, as notifiers can now observe a listener's state and refrain from sending a wakeup if one is already pending. 2. The second patch demonstrates that once state is managed locally, the wakeup mechanism becomes trivial.** The expensive `SendProcSignal` call is replaced with a direct `SetLatch`. This leverages the existing, highly-optimized `WaitEventSet` infrastructure as a simple, efficient "poke." This suggests a powerful, incremental migration pattern: first, fix a subsystem's state management internally; second, replace its wakeup mechanism. This vertical, module-by-module approach seems complementary to the horizontal, layer-by-layer refactoring in the "Interrupts vs signals" thread. I'll post a more detailed follow-up in that thread to discuss the broader architectural implications. Attached are the two patches, reframed to better illustrate this two-step pattern. /Joel
Вложения
On Thu, Jul 24, 2025, at 23:03, Joel Jacobson wrote: > * 0001-Optimize-LISTEN-NOTIFY-signaling-with-a-lock-free-at.patch > * 0002-Optimize-LISTEN-NOTIFY-wakeup-by-replacing-signal-wi.patch I'm withdrawing the latest patches, since they won't fix the scalability problems, but only provide some performance improvements by eliminating redundant IPC signalling. This could also be improved outside of async.c, by optimizing ProcSignal [1] or removing ProcSignal as "Interrupts vs Signals" [2] is working on. There seems to be two different scalability problems, that appears to be orthogonal: First, it's the thundering herd problems that I tried to solve initially in this thread, by introducing a hash table in shared memory, to keep track of what backends listen to what channels, to avoid immediate wakeup of all listening backends for every notification. Second, it's the heavyweight lock in PreCommit_Notify(), that prevents parallelism of NOTIFY. Tom Lane has an idea [3] on how to improve this. My perf+pgbench experiments indicate that out of these two different scalability problems, if one or the other is the bottleneck depends on the workload. I think the idea of keeping track of channels per backends has merit, but I want to take a step back and see what others think about the idea first. I guess my main question is if we think we should fix one problem first, then the other, both at the same time, or only one or the other? I've attached some benchmarks using pgbench and running postgres under perf, which I hope can provide some insights. /Joel [1] https://www.postgresql.org/message-id/flat/a0b12a70-8200-4bd4-9e24-56796314bdce%40app.fastmail.com [2] https://www.postgresql.org/message-id/flat/CA%2BhUKG%2B3MkS21yK4jL4cgZywdnnGKiBg0jatoV6kzaniBmcqbQ%40mail.gmail.com [3] https://www.postgresql.org/message-id/1878165.1752858390%40sss.pgh.pa.us
Вложения
[ getting back to this... ]
"Joel Jacobson" <joel@compiler.org> writes:
> I'm withdrawing the latest patches, since they won't fix the scalability
> problems, but only provide some performance improvements by eliminating
> redundant IPC signalling. This could also be improved outside of
> async.c, by optimizing ProcSignal [1] or removing ProcSignal as
> "Interrupts vs Signals" [2] is working on.
> There seems to be two different scalability problems, that appears to be
> orthogonal:
> First, it's the thundering herd problems that I tried to solve initially
> in this thread, by introducing a hash table in shared memory, to keep
> track of what backends listen to what channels, to avoid immediate
> wakeup of all listening backends for every notification.
> Second, it's the heavyweight lock in PreCommit_Notify(), that prevents
> parallelism of NOTIFY. Tom Lane has an idea [3] on how to improve this.
I concur that these are orthogonal issues, but I don't understand
why you withdrew your patches --- don't they constitute a solution
to the first scalability bottleneck?
> I guess my main question is if we think we should fix one problem first,
> then the other, both at the same time, or only one or the other?
I imagine we'd eventually want to fix both, but it doesn't have to
be done in the same patch.
regards, tom lane
On Tue, Sep 23, 2025, at 18:27, Tom Lane wrote: > I concur that these are orthogonal issues, but I don't understand > why you withdrew your patches --- don't they constitute a solution > to the first scalability bottleneck? Thanks for getting back to this thread. I was unhappy with not finding a solution that would improve all use-cases, I had a feeling it would be possible to find one, and I think I've done so now. >> I guess my main question is if we think we should fix one problem first, >> then the other, both at the same time, or only one or the other? > > I imagine we'd eventually want to fix both, but it doesn't have to > be done in the same patch. I've attached a new patch with a new pragmatic approach, that specifically addresses the context switching cost. The patch is based upon the assumption that some extra LISTEN/NOTIFY latency would be acceptable by most users, as a trade-off, in order to improve throughput. One nice thing with this approach is that it has the potential to improve throughput both for users with just a single listening backend, and also for users with lots of listening backends. More details in the commit message of the patch. Curious to hear thoughts on this approach. /Joel
Вложения
Hi Joel,
Thanks for the patch. After reviewing it, I got a few comments.
2.
On Sep 25, 2025, at 04:34, Joel Jacobson <joel@compiler.org> wrote:
Curious to hear thoughts on this approach.
/Joel
<0001-LISTEN-NOTIFY-make-the-latency-throughput-trade-off-.patch>
1.
```
--- a/src/include/utils/timeout.h
+++ b/src/include/utils/timeout.h
@@ -35,6 +35,7 @@ typedef enum TimeoutId
IDLE_SESSION_TIMEOUT,
IDLE_STATS_UPDATE_TIMEOUT,
CLIENT_CONNECTION_CHECK_TIMEOUT,
+ NOTIFY_DEFERRED_WAKEUP_TIMEOUT,
STARTUP_PROGRESS_TIMEOUT,
```
Can we define the new one after STARTUP_PROGRESS_TIMEOUT to try to preserve the existing enum value?
```
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -766,6 +766,7 @@ autovacuum_worker_slots = 16 # autovacuum worker slots to allocate
#lock_timeout = 0 # in milliseconds, 0 is disabled
#idle_in_transaction_session_timeout = 0 # in milliseconds, 0 is disabled
#idle_session_timeout = 0 # in milliseconds, 0 is disabled
+#notify_latency_target = 0 # in milliseconds, 0 is disabled
#bytea_output = 'hex' # hex, escape
```
I think we should add one more table to make the comment to align with last line’s comment.
3.
```
/* GUC parameters */
bool Trace_notify = false;
+int notify_latency_target;
```
I know compiler will auto initiate notify_latency_target to 0. But all other global and static variables around are explicitly initiated, so it would look better to assign 0 to it, which just keeps coding style consistent.
4.
```
+ /*
+ * Throttling check: if we were last active too recently, defer. This
+ * check is safe without a lock because it's based on a backend-local
+ * timestamp.
+ */
+ if (notify_latency_target > 0 &&
+ !TimestampDifferenceExceeds(last_wakeup_start_time,
+ GetCurrentTimestamp(),
+ notify_latency_target))
+ {
+ /*
+ * Too soon. We leave wakeup_pending_flag untouched (it must be true,
+ * or we wouldn't have been signaled) to tell senders we are
+ * intentionally delaying. Arm a timer to re-awaken and process the
+ * backlog later.
+ */
+ enable_timeout_after(NOTIFY_DEFERRED_WAKEUP_TIMEOUT,
+ notify_latency_target);
+ return;
+ }
+
```
Should we avid duplicate timeout to be enabled? Now, whenever a duplicate notification is avoid, a new timeout is enabled. I think we can add another variable to remember if a timeout has been enabled.
Best regards,
--
Chao Li (Evan)
HighGo Software Co., Ltd.
https://www.highgo.com/
HighGo Software Co., Ltd.
https://www.highgo.com/
On Thu, Sep 25, 2025, at 10:25, Chao Li wrote: > Hi Joel, > > Thanks for the patch. After reviewing it, I got a few comments. Thanks for reviewing! >> On Sep 25, 2025, at 04:34, Joel Jacobson <joel@compiler.org> wrote: > 1. ... > Can we define the new one after STARTUP_PROGRESS_TIMEOUT to try to > preserve the existing enum value? Fixed. > 2. ... > I think we should add one more table to make the comment to align with > last line’s comment. Fixed. > 3. ... > I know compiler will auto initiate notify_latency_target to 0. But all > other global and static variables around are explicitly initiated, so > it would look better to assign 0 to it, which just keeps coding style > consistent. Fixed. > 4. ... > Should we avid duplicate timeout to be enabled? Now, whenever a > duplicate notification is avoid, a new timeout is enabled. I think we > can add another variable to remember if a timeout has been enabled. Hmm, I don't see how duplicate timeout could happen? Once we decide to defer the wakeup, wakeup_pending_flag remains set, which avoids further signals from notifiers, so I don't see how we could re-enter ProcessIncomingNotify(), since notifyInterruptPending is reset when ProcessIncomingNotify() is called, and notifyInterruptPending is only set when a signal is received (or set directly when in same process). New patch attached with 1-3 fixed. /Joel
Вложения
On Sep 26, 2025, at 05:13, Joel Jacobson <joel@compiler.org> wrote:
Hmm, I don't see how duplicate timeout could happen?
Once we decide to defer the wakeup, wakeup_pending_flag remains set,
which avoids further signals from notifiers, so I don't see how we could
re-enter ProcessIncomingNotify(), since notifyInterruptPending is reset
when ProcessIncomingNotify() is called, and notifyInterruptPending is
only set when a signal is received (or set directly when in same
process).
I think what you explained is partially correct.
Based on my understanding, any backend process may call SignalBackends(), which means that it’s possible that multiple backend processes may call SignalBackends() concurrently.
Looking at your code, between checking QUEUE_BACKEND_WAKEUP_PENDING_FLAG(i) and set the flag to true, there is a block of code (the “if-else”) to run, so that it’s possible that multiple backend processes have passed the QUEUE_BACKEND_WAKEUP_PENDING_FLAG(i) check, then multiple signals will be sent to a process, which will lead to duplicate timeout enabled in the receiver process.
Best regards,
--
Chao Li (Evan)
HighGo Software Co., Ltd.
https://www.highgo.com/
HighGo Software Co., Ltd.
https://www.highgo.com/
On Fri, Sep 26, 2025, at 04:26, Chao Li wrote: > I think what you explained is partially correct. > > Based on my understanding, any backend process may call > SignalBackends(), which means that it’s possible that multiple backend > processes may call SignalBackends() concurrently. > > Looking at your code, between checking > QUEUE_BACKEND_WAKEUP_PENDING_FLAG(i) and set the flag to true, there is > a block of code (the “if-else”) to run, so that it’s possible that > multiple backend processes have passed the > QUEUE_BACKEND_WAKEUP_PENDING_FLAG(i) check, then multiple signals will > be sent to a process, which will lead to duplicate timeout enabled in > the receiver process. I don't see how that can happen; we're checking wakeup_pending_flag while holding an exclusive lock, so I don't see how multiple backend processes could be within the region where we check/set wakeup_pending_flag, at the same time? /Joel
On Sep 26, 2025, at 17:32, Joel Jacobson <joel@compiler.org> wrote:On Fri, Sep 26, 2025, at 04:26, Chao Li wrote:I think what you explained is partially correct.
Based on my understanding, any backend process may call
SignalBackends(), which means that it’s possible that multiple backend
processes may call SignalBackends() concurrently.
Looking at your code, between checking
QUEUE_BACKEND_WAKEUP_PENDING_FLAG(i) and set the flag to true, there is
a block of code (the “if-else”) to run, so that it’s possible that
multiple backend processes have passed the
QUEUE_BACKEND_WAKEUP_PENDING_FLAG(i) check, then multiple signals will
be sent to a process, which will lead to duplicate timeout enabled in
the receiver process.
I don't see how that can happen; we're checking wakeup_pending_flag
while holding an exclusive lock, so I don't see how multiple backend
processes could be within the region where we check/set
wakeup_pending_flag, at the same time?
/Joel
I might miss the factor of holding an exclusive lock. I will revisit that part again.
--
Chao Li (Evan)
HighGo Software Co., Ltd.
https://www.highgo.com/
HighGo Software Co., Ltd.
https://www.highgo.com/
On Fri, Sep 26, 2025, at 11:44, Chao Li wrote: >> On Sep 26, 2025, at 17:32, Joel Jacobson <joel@compiler.org> wrote: >> >> On Fri, Sep 26, 2025, at 04:26, Chao Li wrote: >> >>> I think what you explained is partially correct. >>> >>> Based on my understanding, any backend process may call >>> SignalBackends(), which means that it’s possible that multiple backend >>> processes may call SignalBackends() concurrently. >>> >>> Looking at your code, between checking >>> QUEUE_BACKEND_WAKEUP_PENDING_FLAG(i) and set the flag to true, there is >>> a block of code (the “if-else”) to run, so that it’s possible that >>> multiple backend processes have passed the >>> QUEUE_BACKEND_WAKEUP_PENDING_FLAG(i) check, then multiple signals will >>> be sent to a process, which will lead to duplicate timeout enabled in >>> the receiver process. >> >> I don't see how that can happen; we're checking wakeup_pending_flag >> while holding an exclusive lock, so I don't see how multiple backend >> processes could be within the region where we check/set >> wakeup_pending_flag, at the same time? >> >> /Joel > > I might miss the factor of holding an exclusive lock. I will revisit > that part again. I've re-read this entire thread, and I actually think my original approaches are more promising, that is, the 0001-optimize_listen_notify-v4.patch patch, doing multicast targeted signaling. Therefore, merely consider the latest patch as PoC with some possible interesting ideas. Before this patch, I had never used PostgreSQL's timeout mechanism before, so I didn't consider it when thinking about how to solve the remaining problems with 0001-optimize_listen_notify-v4.patch, which currently can't guarantee that all listening backends will eventually catch up, since it just kicks one of the most lagging ones, for each notification. This could be a problem in practise if there is a long period of time with no notifications coming in. Then some listening backends could end up not being signaled and would stay behind, preventing the queue tail from advancing. I'm thinking maybe somehow we can use the timeout mechanism here, but I'm not sure how yet. Any ideas? /Joel
On Sep 28, 2025, at 18:24, Joel Jacobson <joel@compiler.org> wrote:
I might miss the factor of holding an exclusive lock. I will revisit
that part again.
I've re-read this entire thread, and I actually think my original
approaches are more promising, that is, the
0001-optimize_listen_notify-v4.patch patch, doing multicast targeted
signaling.
Therefore, merely consider the latest patch as PoC with some possible
interesting ideas.
Before this patch, I had never used PostgreSQL's timeout mechanism
before, so I didn't consider it when thinking about how to solve the
remaining problems with 0001-optimize_listen_notify-v4.patch, which
currently can't guarantee that all listening backends will eventually
catch up, since it just kicks one of the most lagging ones, for each
notification. This could be a problem in practise if there is a long
period of time with no notifications coming in. Then some listening
backends could end up not being signaled and would stay behind,
preventing the queue tail from advancing.
I'm thinking maybe somehow we can use the timeout mechanism here, but
I'm not sure how yet. Any ideas?
/Joel
Hi Joel,
I never had a concern about using the timeout mechanism. My comment was about enabling timeout duplicately.
I just revisited the code, now I agree that I was over-worried because I missed considering NotifyQueueLock. With the lock protection, a backend process’ QUEUE_BACKEND_WAKEUP_PENDING_FLAG won’t have race condition, then it should have no duplicate signals sending to the same backend process. Then in the backend process, you have “last_wakeup_start_time” that avoids duplicate timeout within a configured period, and you reset last_wakeup_start_time in asyncQueueReadAllNotifications() together with cleaning the QUEUE_BACKEND_WAKEUP_PENDING_FLAG.
So, overall v2 looks good to me.
One last tiny comment is about naming of last_wakeup_start_time. I think it can be renamed to “last_wakeup_time”. Because the variable just records when asyncQueueReadAllNotifications() last time called, there seems not a meaning of “start” involved.
--
Chao Li (Evan)
HighGo Software Co., Ltd.
https://www.highgo.com/
HighGo Software Co., Ltd.
https://www.highgo.com/
On Mon, Sep 29, 2025, at 04:33, Chao Li wrote: > I never had a concern about using the timeout mechanism. My comment was > about enabling timeout duplicately. Thanks for reviewing. However, like said in my previous email, I'm sorry, but don't believe in my suggested throughput/latency approach. I unfortunately managed to derail from the IMO more promising approaches I worked on initially. What I couldn't find a solution to then, was the problem of possibly ending up in a situation where some lagging backends would never catch up. In this new patch, I've simply introduced a new bgworker, given the specific task of kicking lagging backends. I wish of course we could do without the bgworker, but I don't see how that would be possible. --- optimize_listen_notify-v5.patch: Fix LISTEN/NOTIFY so it scales with idle listening backends Currently, idle listening backends cause a dramatic slowdown due to context switching when they are signaled and wake up. This is wasteful when they are not listening to the channel being notified. Just 10 extra idle listening connections cause a slowdown from 8700 TPS to 6100 TPS, 100 extra cause it to drop to 2000 TPS, and at 1000 extra it falls to 250 TPS. To improve scalability with the number of idle listening backends, this patch introduces a shared hash table to keep track of channels per listening backend. This hash table is partitioned to reduce contention on concurrent LISTEN/UNLISTEN operations. We keep track of up to NOTIFY_MULTICAST_THRESHOLD (16) listeners per channel. Benchmarks indicated diminishing gains above this level. Setting it lower seems unnecessary, so a constant seemed fine; a GUC did not seem motivated. This patch also adds a wakeup_pending flag to each backend's queue status to avoid redundant signaling when a wakeup is already pending as the backend is signaled again. The flag is set when a backend is signaled and cleared before processing the queue. This order is important to ensure correctness. It was also necessary to add a new bgworker, notify_bgworker, whose sole responsibility is to wake up lagging listening backends, ensuring they are kicked when they are about to fall too far behind. This bgworker is always started at postmaster startup, but is only activated upon NOTIFY by signaling it, unless it is already active. The notify_bgworker staggers the signaling of lagging listening backends by sleeping 100 ms between each signal, to prevent the thundering herd problem we would otherwise get if all listening backends woke up at the same time. It loops until there are no more lagging listening backends, and then becomes inactive. /Joel
Вложения
On Tue, Sep 30, 2025, at 20:56, Joel Jacobson wrote: > Attachments: > * optimize_listen_notify-v5.patch Changes since v5: *) Added missing #include "nodes/pg_list.h" to fix List type error in headerscheck *) Add NOTIFY_DEFERRED_WAKEUP_MAIN to wait_event_names.txt and rename WAIT_EVENT_NOTIFY_DEFERRED_WAKEUP to WAIT_EVENT_NOTIFY_DEFERRED_WAKEUP_MAIN /Joel
Вложения
"Joel Jacobson" <joel@compiler.org> writes:
> Thanks for reviewing. However, like said in my previous email, I'm
> sorry, but don't believe in my suggested throughput/latency approach. I
> unfortunately managed to derail from the IMO more promising approaches I
> worked on initially.
> What I couldn't find a solution to then, was the problem of possibly
> ending up in a situation where some lagging backends would never catch
> up.
> In this new patch, I've simply introduced a new bgworker, given the
> specific task of kicking lagging backends. I wish of course we could do
> without the bgworker, but I don't see how that would be possible.
I don't understand why you feel you need a bgworker. The existing
code does not have any provision that guarantees a lost signal will
eventually be re-sent --- it will be if there is continuing NOTIFY
traffic, but not if all the senders suddenly go quiet. AFAIR
we've had zero complaints about that in 25+ years. So I'm perfectly
content to continue the approach of "check for laggards during
NOTIFY". (This could be gated behind an overall check on how long the
notify queue is, so that we don't expend the cycles when things are
performing as-expected.) If you feel that that's not robust enough,
you should split it out as a separate patch that's advertised as a
robustness improvement not a performance improvement, and see if you
can get anyone to bite.
The other thing I'm concerned about with this patch is the new shared
hash table. I don't think we have anywhere near a good enough fix on
how big it needs to be, and that is problematic because of the
frozen-at-startup size of main shared memory. We could imagine
inventing YA GUC to let the user tell us how big to make it,
but I think there is now a better way: use a dshash table
(src/backend/lib/dshash.c). That offers the additional win that we
don't have to create it at all in an installation that never uses
LISTEN/NOTIFY. We could also rethink whether we really need the
NOTIFY_MULTICAST_THRESHOLD limit: rather than having two code paths,
we could just say that all listeners are registered for every channel.
regards, tom lane
On Thu, Oct 2, 2025, at 18:39, Tom Lane wrote: > I don't understand why you feel you need a bgworker. The existing > code does not have any provision that guarantees a lost signal will > eventually be re-sent --- it will be if there is continuing NOTIFY > traffic, but not if all the senders suddenly go quiet. AFAIR > we've had zero complaints about that in 25+ years. So I'm perfectly > content to continue the approach of "check for laggards during > NOTIFY". (This could be gated behind an overall check on how long the > notify queue is, so that we don't expend the cycles when things are > performing as-expected.) If you feel that that's not robust enough, > you should split it out as a separate patch that's advertised as a > robustness improvement not a performance improvement, and see if you > can get anyone to bite. Good point. I agree it's better to check for laggards during NOTIFY. > The other thing I'm concerned about with this patch is the new shared > hash table. I don't think we have anywhere near a good enough fix on > how big it needs to be, and that is problematic because of the > frozen-at-startup size of main shared memory. We could imagine > inventing YA GUC to let the user tell us how big to make it, > but I think there is now a better way: use a dshash table > (src/backend/lib/dshash.c). That offers the additional win that we > don't have to create it at all in an installation that never uses > LISTEN/NOTIFY. We could also rethink whether we really need the > NOTIFY_MULTICAST_THRESHOLD limit: rather than having two code paths, > we could just say that all listeners are registered for every channel. Thanks for guidance, I didn't know about dshash. The patch is now using dshash. I've been looking at code in launcher.c when implementing it. The function init_channel_hash() ended up being very similar to launcher.c's logicalrep_launcher_attach_dshmem(). /Joel
Вложения
On Mon, Oct 6, 2025, at 22:11, Joel Jacobson wrote: > The patch is now using dshash. I've been looking at code in launcher.c > when implementing it. The function init_channel_hash() ended up being > very similar to launcher.c's logicalrep_launcher_attach_dshmem(). Noticed a mistake on one line just after pressing send. Sorry about that, new version attached. /Joel
Вложения
On Mon, Oct 6, 2025, at 22:22, Joel Jacobson wrote: > On Mon, Oct 6, 2025, at 22:11, Joel Jacobson wrote: >> The patch is now using dshash. I've been looking at code in launcher.c >> when implementing it. The function init_channel_hash() ended up being >> very similar to launcher.c's logicalrep_launcher_attach_dshmem(). > > Noticed a mistake on one line just after pressing send. > Sorry about that, new version attached. Trying to fix the NetBSD failure. I don't understand why 001_constraint_validation, test 'list_parted2_def scanned' and test 'part_5 verified by existing constraints' should be affected by this patch. I guess I could have gotten something wrong with the locking with dshash, that might somehow affect other tests? I've changed the dshash_find() in SignalBackends from dshash_find(..., false) to dshash_find(..., true), that is, to take an exclusive lock instead. Not sure if this is necessary, since we're not modifying the entry, but we're already holding an exclusive lock on NotifyQueueLock here, so I don't think it should affect concurrency. Any help on looking specifically at the dshash code would be much appreciated, since I'm new to this interface. /Joel
Вложения
"Joel Jacobson" <joel@compiler.org> writes:
> Trying to fix the NetBSD failure.
> I don't understand why 001_constraint_validation, test 'list_parted2_def
> scanned' and test 'part_5 verified by existing constraints' should be
> affected by this patch. I guess I could have gotten something wrong with
> the locking with dshash, that might somehow affect other tests?
Our CI infrastructure is not as stable as one could wish. You
sure this is related at all?
regards, tom lane
On Tue, Oct 7, 2025, at 07:43, Tom Lane wrote: > "Joel Jacobson" <joel@compiler.org> writes: >> Trying to fix the NetBSD failure. >> I don't understand why 001_constraint_validation, test 'list_parted2_def >> scanned' and test 'part_5 verified by existing constraints' should be >> affected by this patch. I guess I could have gotten something wrong with >> the locking with dshash, that might somehow affect other tests? > > Our CI infrastructure is not as stable as one could wish. You > sure this is related at all? No, not sure at all. OK, then going forward, I guess I should ignore errors coming from just a single farm animal if the error seems unrelated to my patch. /Joel
On Mon Oct 6, 2025 at 5:11 PM -03, Joel Jacobson wrote:
> The patch is now using dshash. I've been looking at code in launcher.c
> when implementing it. The function init_channel_hash() ended up being
> very similar to launcher.c's logicalrep_launcher_attach_dshmem().
>
Hi,
This is not a complete review, I just read the v9 patch and summarized
some points.
1. You may want to add ChannelEntry and ChannelHashKey to typedefs.list
to get pgindent do the right job on indentation.
2. The ListCell* variables are normally named as lc
+ ListCell *p;
3. This block on ChannelHashRemoveListener() seems contradictory. You
early return if channel_hash == NULL and then call init_channel_hash
that it will early return if channel_hash != NULL. So if channel_hash !=
NULL I don't think that we need to call init_channel_hash()?
+ if (channel_hash == NULL)
+ return;
+
+ init_channel_hash();
A similar check also exists on SignalBackends()
if (channel_hash == NULL)
...
else
{
// channel_hash is != NULL, so init_channel_hash will early
// return.
init_channel_hash();
...
}
4. The ChannelHashRemoveListener() release lock logic could be
simplified to something like the following, what do you think?
+ if (entry->num_listeners == 0)
+ {
+ dsa_free(channel_dsa, entry->listeners_array);
+ dshash_delete_entry(channel_hash, entry);
+ }
+ break;
+ }
+ }
+
+ /* Not found in list */
+ dshash_release_lock(channel_hash, entry);
5. You may want to use list_member() on GetPendingNotifyChannels() to
avoid the inner loop to check for duplicate channel names.
6. s/beind/behind
+ /* Need to signal if a backend has fallen too
far beind */
7. I'm wondering if we could add some TAP tests for this? I think that
adding a case to ensure that we can grown the dshash correctly and also
we manage multiple backends to the same channel properly. This CF [1]
has some examples of how TAP tests can be created to test LISTEN/NOTIFY
[1] https://commitfest.postgresql.org/patch/6095/
--
Matheus Alcantara
Matheus Alcantara <matheusssilv97@gmail.com> writes:
> 7. I'm wondering if we could add some TAP tests for this?
async.c seems already moderately well covered by existing tests
src/test/regress/sql/async.sql
src/test/isolation/specs/async-notify.spec
Do we need more? If there's something not covered, can we extend
those test cases instead of spinning up a whole new installation
for a TAP test?
Also, I don't think it's the job of this patch to provide test
coverage for dshash. That should be quite well covered already.
regards, tom lane
On Tue, Oct 7, 2025, at 14:40, Matheus Alcantara wrote:
> This is not a complete review, I just read the v9 patch and summarized
> some points.
Many thanks for the review!
> 1. You may want to add ChannelEntry and ChannelHashKey to typedefs.list
> to get pgindent do the right job on indentation.
Fixed.
> 2. The ListCell* variables are normally named as lc
> + ListCell *p;
I agree, better to be consistent. I renamed the variables this patch
adds, but I didn't change the existing ListCell *p variables in async.c.
Would we want to harmonize it to just *lc everywhere in async.c?
I notice we also use ListCell *l in async.c at some places.
> 3. This block on ChannelHashRemoveListener() seems contradictory. You
> early return if channel_hash == NULL and then call init_channel_hash
> that it will early return if channel_hash != NULL. So if channel_hash !=
> NULL I don't think that we need to call init_channel_hash()?
> + if (channel_hash == NULL)
> + return;
> +
> + init_channel_hash();
>
> A similar check also exists on SignalBackends()
> if (channel_hash == NULL)
> ...
> else
> {
> // channel_hash is != NULL, so init_channel_hash will early
> // return.
> init_channel_hash();
> ...
> }
Ahh, right, I agree. I've removed the unnecessary init_channel_hash()
calls.
> 4. The ChannelHashRemoveListener() release lock logic could be
> simplified to something like the following, what do you think?
> + if (entry->num_listeners == 0)
> + {
> + dsa_free(channel_dsa, entry->listeners_array);
> + dshash_delete_entry(channel_hash, entry);
> + }
> + break;
> + }
> + }
> +
> + /* Not found in list */
> + dshash_release_lock(channel_hash, entry);
That would be nicer, but I noted that dshash_delete_entry() releases the
lock just like dshash_release_lock(), so then I think we would need to
return; after dshash_delete_entry(), to prevent attempting to release
the lock twice?
> 5. You may want to use list_member() on GetPendingNotifyChannels() to
> avoid the inner loop to check for duplicate channel names.
Ahh, much nicer! Fixed.
> 6. s/beind/behind
> + /* Need to signal if a backend has fallen too
> far beind */
Fixed.
> 7. I'm wondering if we could add some TAP tests for this? I think that
> adding a case to ensure that we can grown the dshash correctly and also
> we manage multiple backends to the same channel properly. This CF [1]
> has some examples of how TAP tests can be created to test LISTEN/NOTIFY
I will look over the tests. Maybe we should add some elog DEBUG at the
new code paths, and ensure the tests at least cover all of them?
/Joel
Вложения
"Joel Jacobson" <joel@compiler.org> writes:
>> 7. I'm wondering if we could add some TAP tests for this? I think that
>> adding a case to ensure that we can grown the dshash correctly and also
>> we manage multiple backends to the same channel properly. This CF [1]
>> has some examples of how TAP tests can be created to test LISTEN/NOTIFY
> I will look over the tests. Maybe we should add some elog DEBUG at the
> new code paths, and ensure the tests at least cover all of them?
I went to do a coverage test on v10, and found that it does not get
through the existing async-notify isolation test: it panics with
"cannot abort transaction %u, it was already committed". It's a bit
premature to worry about adding new tests if you're not passing the
ones that are there.
regards, tom lane
On Tue, Oct 7, 2025, at 20:14, Tom Lane wrote: > "Joel Jacobson" <joel@compiler.org> writes: >>> 7. I'm wondering if we could add some TAP tests for this? I think that >>> adding a case to ensure that we can grown the dshash correctly and also >>> we manage multiple backends to the same channel properly. This CF [1] >>> has some examples of how TAP tests can be created to test LISTEN/NOTIFY > >> I will look over the tests. Maybe we should add some elog DEBUG at the >> new code paths, and ensure the tests at least cover all of them? > > I went to do a coverage test on v10, and found that it does not get > through the existing async-notify isolation test: it panics with > "cannot abort transaction %u, it was already committed". It's a bit > premature to worry about adding new tests if you're not passing the > ones that are there. Ops, I see I got the list_member() code wrong. I've changed it to now create String nodes, and then use strVal(). I also changed back to dshash_find(..., false) in SignalBackends(), since that makes more sense to me, since we're not modifying entry. (This was the code change due to me being fooled by the false alarm from the NetBSD animal.) /Joel
Вложения
"Joel Jacobson" <joel@compiler.org> writes:
> Ops, I see I got the list_member() code wrong. I've changed it to now
> create String nodes, and then use strVal().
Might be better to revert to the previous coding. Using String
nodes is going to roughly double the space eaten for the list,
and it seems like it's not buying you a lot.
> I also changed back to dshash_find(..., false) in SignalBackends(),
> since that makes more sense to me, since we're not modifying entry.
Agreed.
I did a code coverage run and it seems like things are in pretty
good shape already. async.c is about 88% covered and a lot of the
omissions are either Trace_notify or unreached error reports, which
I'm not especially concerned about. The visible coverage gaps are:
1. asyncQueueFillWarning. This wasn't covered before either, because
it doesn't seem very practical to exercise it in an everyday
regression test. Since your patch doesn't touch that code nor the
queue contents, I'm not concerned here.
2. AtSubCommit_Notify's reparenting stanza. This is a pre-existing
omission too, but maybe worth doing something about?
3. AtSubAbort_Notify's pendingActions cleanup loop; same comments.
4. notification_match is not called at all. Again, pre-existing
coverage gap.
5. ChannelHashAddListener: "already registered" case is not reached,
which surprises me a bit, and neither is the "grow the array" stanza.
Since this is new code it might be worth adding coverage.
regards, tom lane
On Tue Oct 7, 2025 at 1:51 PM -03, Tom Lane wrote:
> Matheus Alcantara <matheusssilv97@gmail.com> writes:
>> 7. I'm wondering if we could add some TAP tests for this?
>
> async.c seems already moderately well covered by existing tests
> src/test/regress/sql/async.sql
> src/test/isolation/specs/async-notify.spec
>
> Do we need more? If there's something not covered, can we extend
> those test cases instead of spinning up a whole new installation
> for a TAP test?
>
I've executed the test coverage on v9 and it seems that we still have a
good code coverage. I would imagine with the new branches that the code
coverage could be effected but it's not true. There is just some small
piece of new code added that is not being coveraged.
> Also, I don't think it's the job of this patch to provide test
> coverage for dshash. That should be quite well covered already.
>
When I was mentioning to test that we can grow the dshash correctly it's
because the v9 patch has a logic to grow the array stored on dshash
entry value that currently is not being cover by the tests. I'm not
saying to test the dshash internal logic which I agree that it's not the
job of this patch. Sorry for being confusing.
+ /* Need to add this listener */
+ if (entry->num_listeners >= entry->allocated_listeners)
+ {
+ /* Grow the array (double the size) */
+ int new_size = entry->allocated_listeners * 2;
+ dsa_pointer new_array = dsa_allocate(channel_dsa,
+ sizeof(ProcNumber) * new_size);
+ ProcNumber *new_listeners = (ProcNumber *) dsa_get_address(channel_dsa,
+ new_array);
+
+ /* Copy existing listeners */
+ memcpy(new_listeners, listeners,
+ sizeof(ProcNumber) * entry->num_listeners);
+
+ /* Free old array and update entry */
+ dsa_free(channel_dsa, entry->listeners_array);
+ entry->listeners_array = new_array;
+ entry->allocated_listeners = new_size;
+ listeners = new_listeners;
+ }
--
Matheus Alcantara
"Matheus Alcantara" <matheusssilv97@gmail.com> writes:
> On Tue Oct 7, 2025 at 1:51 PM -03, Tom Lane wrote:
>> Also, I don't think it's the job of this patch to provide test
>> coverage for dshash. That should be quite well covered already.
> When I was mentioning to test that we can grow the dshash correctly it's
> because the v9 patch has a logic to grow the array stored on dshash
> entry value that currently is not being cover by the tests. I'm not
> saying to test the dshash internal logic which I agree that it's not the
> job of this patch. Sorry for being confusing.
Ah, yeah, I misunderstood what you meant. I agree that covering that
"Grow the array" stanza is a good idea, in fact I said the same thing
a little bit ago.
regards, tom lane
On Tue Oct 7, 2025 at 6:17 PM -03, Tom Lane wrote: > "Matheus Alcantara" <matheusssilv97@gmail.com> writes: >> On Tue Oct 7, 2025 at 1:51 PM -03, Tom Lane wrote: >>> Also, I don't think it's the job of this patch to provide test >>> coverage for dshash. That should be quite well covered already. > >> When I was mentioning to test that we can grow the dshash correctly it's >> because the v9 patch has a logic to grow the array stored on dshash >> entry value that currently is not being cover by the tests. I'm not >> saying to test the dshash internal logic which I agree that it's not the >> job of this patch. Sorry for being confusing. > > Ah, yeah, I misunderstood what you meant. I agree that covering that > "Grow the array" stanza is a good idea, in fact I said the same thing > a little bit ago. > Yeah, I just saw your response after I sent the email, which I agree with all the points. So I think that we are on the same "page" now. -- Matheus Alcantara
After several rounds of reviewing, the code is already very good. I just got a few small comments:
Best regards,
On Oct 8, 2025, at 03:26, Joel Jacobson <joel@compiler.org> wrote:
/Joel<optimize_listen_notify-v11.patch>
1
```
+ channels = GetPendingNotifyChannels();
+
LWLockAcquire(NotifyQueueLock, LW_EXCLUSIVE);
- for (ProcNumber i = QUEUE_FIRST_LISTENER; i != INVALID_PROC_NUMBER; i = QUEUE_NEXT_LISTENER(i))
+ foreach(lc, channels)
```
I don’t see where “channels” is freed. GetPendingNotifyChannels() creates a list of Nodes, both the list and Nodes the list points to should be freed.
2
```
+ foreach(lc, channels)
{
- int32 pid = QUEUE_BACKEND_PID(i);
- QueuePosition pos;
+ char *channel = strVal(lfirst(lc));
+ ChannelEntry *entry;
+ ProcNumber *listeners;
+ ChannelHashKey key;
- Assert(pid != InvalidPid);
- pos = QUEUE_BACKEND_POS(i);
- if (QUEUE_BACKEND_DBOID(i) == MyDatabaseId)
+ if (channel_hash == NULL)
+ entry = NULL;
+ else
```
I wonder whether or not “channel_hash” can be NULL here? Maybe possible if a channel is un-listened while the event is pending?
So, maybe add a comment here to explain the logic.
3
The same piece of code as 2.
I think the code can be optimized a little bit. First, we can initialize entry to NULL, then we don’t the if-else. Second, “key” is only used for dshash_find(), so it can defined where it is used.
foreach(lc, channels)
{
char *channel = strVal(lfirst(lc));
ChannelEntry *entry = NULL;
ProcNumber *listeners;
//ChannelHashKey key;
if (channel_hash != NULL)
{
ChannelHashKey key;
ChannelHashPrepareKey(&key, MyDatabaseId, channel);
entry = dshash_find(channel_hash, &key, false);
}
if (entry == NULL)
continue; /* No listeners registered for this channel */
4
```
+ if (signaled[i] || QUEUE_BACKEND_WAKEUP_PENDING(i))
+ continue;
```
I wonder if “signaled[i]” is a duplicate flag of "QUEUE_BACKEND_WAKEUP_PENDING(i)”?
I understand signaled is local, and QUEUE_BACKEND_WAKEUP_PENDING is in shared memory and may be set by other processes, but in local, when signaled[I] is set, QUEUE_BACKEND_WAKEUP_PENDING(i) is also set. And because of NotifyQueueLock, other process should not be able to cleanup the flag.
But if “signals” is really needed, maybe we can use Bitmapset (src/backend/nodes/bitmapset.c), that would use 1/8 of memories comparing to the bool array.
5
```
/*
@@ -1865,6 +2087,7 @@ asyncQueueReadAllNotifications(void)
LWLockAcquire(NotifyQueueLock, LW_SHARED);
/* Assert checks that we have a valid state entry */
Assert(MyProcPid == QUEUE_BACKEND_PID(MyProcNumber));
+ QUEUE_BACKEND_WAKEUP_PENDING(MyProcNumber) = false;
```
This piece of code originally only read the shared memory, so it can use LW_SHARED lock mode, but now it writes to the shared memory, do we need to change the lock mode to “exclusive”?
6
```
+static inline void
+ChannelHashPrepareKey(ChannelHashKey *key, Oid dboid, const char *channel)
+{
+ memset(key, 0, sizeof(ChannelHashKey));
+ key->dboid = dboid;
+ strlcpy(key->channel, channel, NAMEDATALEN);
+}
```
Do we really need the memset()? If “channel” is of length NAMEDATALEN, then it still results in a non-0 terminated key->channel; if channel is shorter than NAMEDATALEN, strlcpy will auto add a tailing ‘\0’. I think previous code should have ensured length of channel should be less than NAMEDATALEN.
7
```
*
* Resist the temptation to make this really large. While that would save
* work in some places, it would add cost in others. In particular, this
@@ -246,6 +280,7 @@ typedef struct QueueBackendStatus
Oid dboid; /* backend's database OID, or InvalidOid */
ProcNumber nextListener; /* id of next listener, or INVALID_PROC_NUMBER */
QueuePosition pos; /* backend has read queue up to here */
+ bool wakeup_pending; /* signal sent but not yet processed */
} QueueBackendStatus;
```
In the same structure, rest of fields are all in camel case, I think it’s better to rename the new field to “wakeupPending”.
8
```
@@ -288,11 +323,91 @@ typedef struct AsyncQueueControl
ProcNumber firstListener; /* id of first listener, or
* INVALID_PROC_NUMBER */
TimestampTz lastQueueFillWarn; /* time of last queue-full msg */
+ dsa_handle channel_hash_dsa;
+ dshash_table_handle channel_hash_dsh;
QueueBackendStatus backend[FLEXIBLE_ARRAY_MEMBER];
```
Same as 7, but in this case, type names are not camel case, maybe okay for field names. I don’t have a strong opinion here.
--
Chao Li (Evan)
HighGo Software Co., Ltd.
https://www.highgo.com/
HighGo Software Co., Ltd.
https://www.highgo.com/
On Oct 8, 2025, at 11:43, Chao Li <li.evan.chao@gmail.com> wrote:6```+static inline void+ChannelHashPrepareKey(ChannelHashKey *key, Oid dboid, const char *channel)+{+ memset(key, 0, sizeof(ChannelHashKey));+ key->dboid = dboid;+ strlcpy(key->channel, channel, NAMEDATALEN);+}```Do we really need the memset()? If “channel” is of length NAMEDATALEN, then it still results in a non-0 terminated key->channel; if channel is shorter than NAMEDATALEN, strlcpy will auto add a tailing ‘\0’. I think previous code should have ensured length of channel should be less than NAMEDATALEN.
For comment 6, the result is the same that I don’t think memset() is needed. However, my previous explanation of strlcpy() was wrong, which should apply to strncpy(). For strlcpy(), it already makes a termination ‘\0’.
And one more nit comment:
9
```
+ int allocated_listeners; /* Allocated size of array */
```
For “size” here, I guess you meant “length”, though “size” also works, but usually “size” means bytes occupied by an array and “length” means number of elements of an array. So, “length” would be clearer here.
--
Chao Li (Evan)
HighGo Software Co., Ltd.
https://www.highgo.com/
HighGo Software Co., Ltd.
https://www.highgo.com/
On Tue, Oct 7, 2025, at 22:15, Tom Lane wrote: > "Joel Jacobson" <joel@compiler.org> writes: >> Ops, I see I got the list_member() code wrong. I've changed it to now >> create String nodes, and then use strVal(). > > Might be better to revert to the previous coding. Using String > nodes is going to roughly double the space eaten for the list, > and it seems like it's not buying you a lot. > >> I also changed back to dshash_find(..., false) in SignalBackends(), >> since that makes more sense to me, since we're not modifying entry. > > Agreed. > > I did a code coverage run and it seems like things are in pretty > good shape already. async.c is about 88% covered and a lot of the > omissions are either Trace_notify or unreached error reports, which > I'm not especially concerned about. The visible coverage gaps are: > > 1. asyncQueueFillWarning. This wasn't covered before either, because > it doesn't seem very practical to exercise it in an everyday > regression test. Since your patch doesn't touch that code nor the > queue contents, I'm not concerned here. I agree. > 2. AtSubCommit_Notify's reparenting stanza. This is a pre-existing > omission too, but maybe worth doing something about? > > 3. AtSubAbort_Notify's pendingActions cleanup loop; same comments. > > 4. notification_match is not called at all. Again, pre-existing > coverage gap. I've added test coverage for all three items above. > 5. ChannelHashAddListener: "already registered" case is not reached, > which surprises me a bit, and neither is the "grow the array" stanza. > Since this is new code it might be worth adding coverage. I've added a test for the "grow the array" stanza. The "already registered" case seems impossible to reach, since the caller, Exec_ListenCommit, returns early if IsListeningOn. Patches: 0001-optimize_listen_notify-v12.patch: Improve LISTEN/NOTIFY test coverage 0002-optimize_listen_notify-v12.patch: Optimize LISTEN/NOTIFY with channel-specific listener tracking I split this into two patches, to make it easier to verify that the second patch doesn't affect the tests added by the first patch. The 0001 patch also includes the "grow the array" test, which is pointless without the 0002 patch, but felt better to add it first anyway. I've also made changes in v12 based on feedback from Chao Li, to which I will reply to shortly. /Joel
Вложения
On Wed, Oct 8, 2025, at 05:43, Chao Li wrote:
> After several rounds of reviewing, the code is already very good. I
> just got a few small comments:
Thanks for feedback!
The below changes have been incorporated into the v12 version
sent in my previous email.
>> On Oct 8, 2025, at 03:26, Joel Jacobson <joel@compiler.org> wrote:
>>
>>
>> /Joel<optimize_listen_notify-v11.patch>
>
>
> 1
> ```
> + channels = GetPendingNotifyChannels();
> +
> LWLockAcquire(NotifyQueueLock, LW_EXCLUSIVE);
> - for (ProcNumber i = QUEUE_FIRST_LISTENER; i != INVALID_PROC_NUMBER; i
> = QUEUE_NEXT_LISTENER(i))
> + foreach(lc, channels)
> ```
>
> I don’t see where “channels” is freed. GetPendingNotifyChannels()
> creates a list of Nodes, both the list and Nodes the list points to
> should be freed.
Per suggestion from Tom Lane I reverted back GetPendingNotifyChannels(),
so this comment is not applicable any longer.
> 2
> ```
> + foreach(lc, channels)
> {
> - int32 pid = QUEUE_BACKEND_PID(i);
> - QueuePosition pos;
> + char *channel = strVal(lfirst(lc));
> + ChannelEntry *entry;
> + ProcNumber *listeners;
> + ChannelHashKey key;
>
> - Assert(pid != InvalidPid);
> - pos = QUEUE_BACKEND_POS(i);
> - if (QUEUE_BACKEND_DBOID(i) == MyDatabaseId)
> + if (channel_hash == NULL)
> + entry = NULL;
> + else
> ```
>
> I wonder whether or not “channel_hash” can be NULL here? Maybe possible
> if a channel is un-listened while the event is pending?
Yes, I think channelHash can be NULL here if doing a NOTIFY
when there hasn't been a LISTEN yet.
> So, maybe add a comment here to explain the logic.
Not sure I think that's necessary.
What do you suggest that comment would say?
> 3
> The same piece of code as 2.
>
> I think the code can be optimized a little bit. First, we can
> initialize entry to NULL, then we don’t the if-else. Second, “key” is
> only used for dshash_find(), so it can defined where it is used.
>
> foreach(lc, channels)
> {
> char *channel = strVal(lfirst(lc));
> ChannelEntry *entry = NULL;
> ProcNumber *listeners;
> //ChannelHashKey key;
>
> if (channel_hash != NULL)
> {
> ChannelHashKey key;
> ChannelHashPrepareKey(&key, MyDatabaseId, channel);
> entry = dshash_find(channel_hash, &key, false);
> }
>
> if (entry == NULL)
> continue; /* No listeners registered for this channel */
Nice, I agree that's more readable, I changed it like that.
> 4
> ```
> + if (signaled[i] || QUEUE_BACKEND_WAKEUP_PENDING(i))
> + continue;
> ```
>
> I wonder if “signaled[i]” is a duplicate flag of
> "QUEUE_BACKEND_WAKEUP_PENDING(i)”?
>
> I understand signaled is local, and QUEUE_BACKEND_WAKEUP_PENDING is in
> shared memory and may be set by other processes, but in local, when
> signaled[I] is set, QUEUE_BACKEND_WAKEUP_PENDING(i) is also set. And
> because of NotifyQueueLock, other process should not be able to cleanup
> the flag.
>
> But if “signals” is really needed, maybe we can use Bitmapset
> (src/backend/nodes/bitmapset.c), that would use 1/8 of memories
> comparing to the bool array.
I agree, since we're holding an exclusive lock, the signaled array is reundant.
I've removed it, so that we rely only on the wakeupPending flag.
> 5
> ```
> /*
> @@ -1865,6 +2087,7 @@ asyncQueueReadAllNotifications(void)
> LWLockAcquire(NotifyQueueLock, LW_SHARED);
> /* Assert checks that we have a valid state entry */
> Assert(MyProcPid == QUEUE_BACKEND_PID(MyProcNumber));
> + QUEUE_BACKEND_WAKEUP_PENDING(MyProcNumber) = false;
> ```
>
> This piece of code originally only read the shared memory, so it can
> use LW_SHARED lock mode, but now it writes to the shared memory, do we
> need to change the lock mode to “exclusive”?
No, LW_SHARED is sufficient here, since the backend only modifies its own state,
and no other backend could do that, without holding an exclusive lock.
> 6
> ```
> +static inline void
> +ChannelHashPrepareKey(ChannelHashKey *key, Oid dboid, const char *channel)
> +{
> + memset(key, 0, sizeof(ChannelHashKey));
> + key->dboid = dboid;
> + strlcpy(key->channel, channel, NAMEDATALEN);
> +}
> ```
>
> Do we really need the memset()? If “channel” is of length NAMEDATALEN,
> then it still results in a non-0 terminated key->channel; if channel is
> shorter than NAMEDATALEN, strlcpy will auto add a tailing ‘\0’. I think
> previous code should have ensured length of channel should be less than
> NAMEDATALEN.
Yes, I think we need memset, since I fear that when the hash table keys
are compared, every byte of the struct might be inspected, so without
zero-initializing it, there could be unused bytes after the null
terminator, that could then cause logically identical keys to be wrongly
considered different.
I haven't checked the implementation though, but my gut feeling says
it's better to be a bit paranoid here.
> 7
> ```
> *
> * Resist the temptation to make this really large. While that would save
> * work in some places, it would add cost in others. In particular, this
> @@ -246,6 +280,7 @@ typedef struct QueueBackendStatus
> Oid dboid; /* backend's database OID, or InvalidOid */
> ProcNumber nextListener; /* id of next listener, or INVALID_PROC_NUMBER */
> QueuePosition pos; /* backend has read queue up to here */
> + bool wakeup_pending; /* signal sent but not yet processed */
> } QueueBackendStatus;
> ```
>
> In the same structure, rest of fields are all in camel case, I think
> it’s better to rename the new field to “wakeupPending”.
>
> 8
> ```
> @@ -288,11 +323,91 @@ typedef struct AsyncQueueControl
> ProcNumber firstListener; /* id of first listener, or
> * INVALID_PROC_NUMBER */
> TimestampTz lastQueueFillWarn; /* time of last queue-full msg */
> + dsa_handle channel_hash_dsa;
> + dshash_table_handle channel_hash_dsh;
> QueueBackendStatus backend[FLEXIBLE_ARRAY_MEMBER];
> ```
>
> Same as 7, but in this case, type names are not camel case, maybe okay
> for field names. I don’t have a strong opinion here.
I've did a major renaming of all new code, to better match the casing style.
It seems like helper functions and fields areNamedLikeThis, while
API-functions AreNamedLikeThis.
If we don't like this naming, I'm happy to change it again, please advise.
/Joel
"Joel Jacobson" <joel@compiler.org> writes:
> On Tue, Oct 7, 2025, at 22:15, Tom Lane wrote:
>> 5. ChannelHashAddListener: "already registered" case is not reached,
>> which surprises me a bit, and neither is the "grow the array" stanza.
> I've added a test for the "grow the array" stanza.
> The "already registered" case seems impossible to reach, since the
> caller, Exec_ListenCommit, returns early if IsListeningOn.
Maybe we should remove the check for "already registered" then,
or reduce it to an Assert? Seems pointless to check twice.
Or thinking a little bigger: why are we maintaining the set of
channels-listened-to both as a list and a hash? Could we remove
the list form?
regards, tom lane
On Oct 8, 2025, at 22:53, Joel Jacobson <joel@compiler.org> wrote:
1
```
+ channels = GetPendingNotifyChannels();
+
LWLockAcquire(NotifyQueueLock, LW_EXCLUSIVE);
- for (ProcNumber i = QUEUE_FIRST_LISTENER; i != INVALID_PROC_NUMBER; i
= QUEUE_NEXT_LISTENER(i))
+ foreach(lc, channels)
```
I don’t see where “channels” is freed. GetPendingNotifyChannels()
creates a list of Nodes, both the list and Nodes the list points to
should be freed.
Per suggestion from Tom Lane I reverted back GetPendingNotifyChannels(),
so this comment is not applicable any longer.
I think you just reverted the usage of list_member() and makeNode(), but returned “channels” is still built by “lappend()” that allocates memory for the List structure. So you need to use “list_free(channels)” to free the memory.
5
```
/*
@@ -1865,6 +2087,7 @@ asyncQueueReadAllNotifications(void)
LWLockAcquire(NotifyQueueLock, LW_SHARED);
/* Assert checks that we have a valid state entry */
Assert(MyProcPid == QUEUE_BACKEND_PID(MyProcNumber));
+ QUEUE_BACKEND_WAKEUP_PENDING(MyProcNumber) = false;
```
This piece of code originally only read the shared memory, so it can
use LW_SHARED lock mode, but now it writes to the shared memory, do we
need to change the lock mode to “exclusive”?
No, LW_SHARED is sufficient here, since the backend only modifies its own state,
and no other backend could do that, without holding an exclusive lock.
Yes, the backend only modifies its own state to “false”, but other backends may set its state to “true”, that is a race condition. So I still think an exclusive lock is needed.
6
```
+static inline void
+ChannelHashPrepareKey(ChannelHashKey *key, Oid dboid, const char *channel)
+{
+ memset(key, 0, sizeof(ChannelHashKey));
+ key->dboid = dboid;
+ strlcpy(key->channel, channel, NAMEDATALEN);
+}
```
Do we really need the memset()? If “channel” is of length NAMEDATALEN,
then it still results in a non-0 terminated key->channel; if channel is
shorter than NAMEDATALEN, strlcpy will auto add a tailing ‘\0’. I think
previous code should have ensured length of channel should be less than
NAMEDATALEN.
Yes, I think we need memset, since I fear that when the hash table keys
are compared, every byte of the struct might be inspected, so without
zero-initializing it, there could be unused bytes after the null
terminator, that could then cause logically identical keys to be wrongly
considered different.
I haven't checked the implementation though, but my gut feeling says
it's better to be a bit paranoid here.
The hash function channel_hash_func() is defined by your own code, it use strnlen() to get length of channel name, so that bytes after ‘\0’ won’t be used.
9
```
+ int allocated_listeners; /* Allocated size of array */
```
For “size” here, I guess you meant “length”, though “size” also works, but usually “size” means bytes occupied by an array and “length” means number of elements of an array. So, “length” would be clearer here.
And I got a new comment for v12:
10
```
+ found = false;
+ foreach(q, channels)
+ {
+ char *existing = (char *) lfirst(q);
+
+ if (strcmp(existing, channel) == 0)
+ {
```
Might be safer to do “strncmp(existing, channel, NAMEDATALEN)”.
Best regards,
--
Chao Li (Evan)
HighGo Software Co., Ltd.
https://www.highgo.com/
HighGo Software Co., Ltd.
https://www.highgo.com/
On Thu, Oct 9, 2025, at 03:11, Chao Li wrote:
> I think you just reverted the usage of list_member() and makeNode(),
> but returned “channels” is still built by “lappend()” that allocates
> memory for the List structure. So you need to use “list_free(channels)”
> to free the memory.
Right. However, I'll see if I can make Tom's idea work of possibly removing the list form, instead.
>>> ```
>>> /*
>>> @@ -1865,6 +2087,7 @@ asyncQueueReadAllNotifications(void)
>>> LWLockAcquire(NotifyQueueLock, LW_SHARED);
>>> /* Assert checks that we have a valid state entry */
>>> Assert(MyProcPid == QUEUE_BACKEND_PID(MyProcNumber));
>>> + QUEUE_BACKEND_WAKEUP_PENDING(MyProcNumber) = false;
>>> ```
>>>
>>> This piece of code originally only read the shared memory, so it can
>>> use LW_SHARED lock mode, but now it writes to the shared memory, do we
>>> need to change the lock mode to “exclusive”?
>>
>> No, LW_SHARED is sufficient here, since the backend only modifies its own state,
>> and no other backend could do that, without holding an exclusive lock.
>
> Yes, the backend only modifies its own state to “false”, but other
> backends may set its state to “true”, that is a race condition. So I
> still think an exclusive lock is needed.
No, other backends cannot alter our state without holding an exclusive lock,
and they cannot obtain an exclusive lock on our backend until we've released
the shared lock we're holding.
>>> 6
>>> ```
>>> +static inline void
>>> +ChannelHashPrepareKey(ChannelHashKey *key, Oid dboid, const char *channel)
>>> +{
>>> + memset(key, 0, sizeof(ChannelHashKey));
>>> + key->dboid = dboid;
>>> + strlcpy(key->channel, channel, NAMEDATALEN);
>>> +}
>>> ```
>>>
>>> Do we really need the memset()? If “channel” is of length NAMEDATALEN,
>>> then it still results in a non-0 terminated key->channel; if channel is
>>> shorter than NAMEDATALEN, strlcpy will auto add a tailing ‘\0’. I think
>>> previous code should have ensured length of channel should be less than
>>> NAMEDATALEN.
>>
>> Yes, I think we need memset, since I fear that when the hash table keys
>> are compared, every byte of the struct might be inspected, so without
>> zero-initializing it, there could be unused bytes after the null
>> terminator, that could then cause logically identical keys to be wrongly
>> considered different.
>>
>> I haven't checked the implementation though, but my gut feeling says
>> it's better to be a bit paranoid here.
>
> The hash function channel_hash_func() is defined by your own code, it
> use strnlen() to get length of channel name, so that bytes after ‘\0’
> won’t be used.
No, the hash function is not used for comparison.
We're using the default dshash_memcmp for comparison:
```
/* parameters for the channel hash table */
static const dshash_parameters channelDSHParams = {
sizeof(ChannelHashKey),
sizeof(ChannelEntry),
dshash_memcmp,
channelHashFunc,
dshash_memcpy,
LWTRANCHE_NOTIFY_CHANNEL_HASH
};
```
Looking at its implementation, we can see it's using memcmp under the hood:
```
/*
* A compare function that forwards to memcmp.
*/
int
dshash_memcmp(const void *a, const void *b, size_t size, void *arg)
{
return memcmp(a, b, size);
}
```
Here, the input parameter `size` comes from `sizeof(ChannelHashKey)`,
so it will include all bytes in the comparison.
> And I guess you missed comment 9:
>
> 9
> ```
> + int allocated_listeners; /* Allocated size of array */
> ```
>
> For “size” here, I guess you meant “length”, though “size” also works,
> but usually “size” means bytes occupied by an array and “length” means
> number of elements of an array. So, “length” would be clearer here.
Agreed, will change.
> And I got a new comment for v12:
>
> 10
> ```
> + found = false;
> + foreach(q, channels)
> + {
> + char *existing = (char *) lfirst(q);
> +
> + if (strcmp(existing, channel) == 0)
> + {
> ```
>
> Might be safer to do “strncmp(existing, channel, NAMEDATALEN)”.
Good idea, will change.
/Joel
On Oct 9, 2025, at 16:07, Joel Jacobson <joel@compiler.org> wrote:```
/*
@@ -1865,6 +2087,7 @@ asyncQueueReadAllNotifications(void)
LWLockAcquire(NotifyQueueLock, LW_SHARED);
/* Assert checks that we have a valid state entry */
Assert(MyProcPid == QUEUE_BACKEND_PID(MyProcNumber));
+ QUEUE_BACKEND_WAKEUP_PENDING(MyProcNumber) = false;
```
This piece of code originally only read the shared memory, so it can
use LW_SHARED lock mode, but now it writes to the shared memory, do we
need to change the lock mode to “exclusive”?
No, LW_SHARED is sufficient here, since the backend only modifies its own state,
and no other backend could do that, without holding an exclusive lock.
Yes, the backend only modifies its own state to “false”, but other
backends may set its state to “true”, that is a race condition. So I
still think an exclusive lock is needed.
No, other backends cannot alter our state without holding an exclusive lock,
and they cannot obtain an exclusive lock on our backend until we've released
the shared lock we're holding.
Ah… That’s true. This comment is resolved.
The hash function channel_hash_func() is defined by your own code, it
use strnlen() to get length of channel name, so that bytes after ‘\0’
won’t be used.
No, the hash function is not used for comparison.
We're using the default dshash_memcmp for comparison:
```
/* parameters for the channel hash table */
static const dshash_parameters channelDSHParams = {
sizeof(ChannelHashKey),
sizeof(ChannelEntry),
dshash_memcmp,
channelHashFunc,
dshash_memcpy,
LWTRANCHE_NOTIFY_CHANNEL_HASH
};
```
Looking at its implementation, we can see it's using memcmp under the hood:
```
/*
* A compare function that forwards to memcmp.
*/
int
dshash_memcmp(const void *a, const void *b, size_t size, void *arg)
{
return memcmp(a, b, size);
}
```
Here, the input parameter `size` comes from `sizeof(ChannelHashKey)`,
so it will include all bytes in the comparison.
Okay, I think I misunderstood hash_function. So, this comment is also resolved.
I am thinking loudly. When a hash key is created, it has been memset to 0, meaning that in key->channel, all bytes after ‘\0’ are also 0, there should not be any random bytes in hash key, so that in channelHashFunc(), we don’t need to to use strnlen() anymore, which improves performance a little bit. Like this:
h = DatumGetUInt32(hash_uint32(k->dboid));
h ^= DatumGetUInt32(hash_any((const unsigned char *) k->channel,
sizeof(k->channel)));
--
Chao Li (Evan)
HighGo Software Co., Ltd.
https://www.highgo.com/
HighGo Software Co., Ltd.
https://www.highgo.com/
On Wed, Oct 8, 2025, at 20:46, Tom Lane wrote: > "Joel Jacobson" <joel@compiler.org> writes: >> On Tue, Oct 7, 2025, at 22:15, Tom Lane wrote: >>> 5. ChannelHashAddListener: "already registered" case is not reached, >>> which surprises me a bit, and neither is the "grow the array" stanza. > >> I've added a test for the "grow the array" stanza. > >> The "already registered" case seems impossible to reach, since the >> caller, Exec_ListenCommit, returns early if IsListeningOn. > > Maybe we should remove the check for "already registered" then, > or reduce it to an Assert? Seems pointless to check twice. > > Or thinking a little bigger: why are we maintaining the set of > channels-listened-to both as a list and a hash? Could we remove > the list form? Yes, it was indeed possible to remove the list form. Some functions got a bit more complex, but I think it's worth it since a single source of truth seems like an important design goal. This also made LISTEN faster when a backend is listening on plenty of channels, since we can now lookup the channel in the hash, instead of having to go through the list as before. The additional linear scan of the listenersArray didn't add any noticeable extra cost even with thousands of listening backends for the channel. I also tried to keep listenersArray sorted and binary-search it, but even with thousands of listening backends, I couldn't measure any overall latency difference of LISTEN, so I kept the linear scan to keep it simple. In Exec_ListenCommit, I've now inlined code that is similar to IsListeningOn. I didn't want to use IsListeningOn since it felt wasteful having to do dshash_find, when we instead can just use dshash_find_or_insert, to handle both cases. I also added a static int numChannelsListeningOn variable, to avoid the possibly expensive operation of going through the entire hash, to be able to check `numChannelsListeningOn == 0` instead of the now removed `listenChannels == NIL`. It's of course critical to keep numChannelsListeningOn in sync, but I think it should be safe? Would of course be better to avoid this variable. Maybe the extra cycles that would cost would be worth it? /Joel
Вложения
On Fri, Oct 10, 2025, at 20:46, Joel Jacobson wrote:
> On Wed, Oct 8, 2025, at 20:46, Tom Lane wrote:
>> "Joel Jacobson" <joel@compiler.org> writes:
>>> On Tue, Oct 7, 2025, at 22:15, Tom Lane wrote:
>>>> 5. ChannelHashAddListener: "already registered" case is not reached,
>>>> which surprises me a bit, and neither is the "grow the array" stanza.
>>
>>> I've added a test for the "grow the array" stanza.
>>
>>> The "already registered" case seems impossible to reach, since the
>>> caller, Exec_ListenCommit, returns early if IsListeningOn.
>>
>> Maybe we should remove the check for "already registered" then,
>> or reduce it to an Assert? Seems pointless to check twice.
>>
>> Or thinking a little bigger: why are we maintaining the set of
>> channels-listened-to both as a list and a hash? Could we remove
>> the list form?
>
> Yes, it was indeed possible to remove the list form.
>
> Some functions got a bit more complex, but I think it's worth it since a
> single source of truth seems like an important design goal.
>
> This also made LISTEN faster when a backend is listening on plenty of
> channels, since we can now lookup the channel in the hash, instead of
> having to go through the list as before. The additional linear scan of
> the listenersArray didn't add any noticeable extra cost even with
> thousands of listening backends for the channel.
>
> I also tried to keep listenersArray sorted and binary-search it, but
> even with thousands of listening backends, I couldn't measure any
> overall latency difference of LISTEN, so I kept the linear scan to keep
> it simple.
>
> In Exec_ListenCommit, I've now inlined code that is similar to
> IsListeningOn. I didn't want to use IsListeningOn since it felt wasteful
> having to do dshash_find, when we instead can just use
> dshash_find_or_insert, to handle both cases.
>
> I also added a static int numChannelsListeningOn variable, to avoid the
> possibly expensive operation of going through the entire hash, to be
> able to check `numChannelsListeningOn == 0` instead of the now removed
> `listenChannels == NIL`. It's of course critical to keep
> numChannelsListeningOn in sync, but I think it should be safe? Would of
> course be better to avoid this variable. Maybe the extra cycles that
> would cost would be worth it?
In addition to previously suggested optimization, there is another major
one that seems doable, that would mean a great improvement for workload
having large traffic differences between channels, i.e. some low traffic
and some high traffic.
I'm not entirely sure this approach is correct though, I've might
misunderstood the guarantees of the heavyweight lock. My assumption is
based on that there can only be one backend that is currently running
the code in PreCommit_Notify after having aquired the heavyweight lock.
If this is not true, then it doesn't work. What made me worried is the
exclusive lock we also take inside the same function, I don't see the
point of it since we're already holding the heavyweight lock, but maybe
this is just to "allows deadlocks to be detected" like the comment says?
---
Patches:
* 0001-optimize_listen_notify-v14.patch:
Just adds additional test coverage of async.c
* 0002-optimize_listen_notify-v14.patch:
Adds the shared channel hash.
Unchanged since 0002-optimize_listen_notify-v13.patch.
* 0003-optimize_listen_notify-v14.patch:
Optimize LISTEN/NOTIFY by advancing idle backends directly
Building on the previous channel-specific listener tracking
optimization, this patch further reduces context switching by detecting
idle listening backends that don't listen to any of the channels being
notified and advancing their queue positions directly without waking
them up.
When a backend commits notifications, it now saves both the queue head
position before and after writing. In SignalBackends(), backends that
are at the old queue head and weren't marked for wakeup (meaning they
don't listen to any of the notified channels) are advanced directly to
the new queue head. This eliminates unnecessary wakeups for these
backends, which would otherwise wake up, scan through all the
notifications, skip each one, and advance to the same position anyway.
The implementation carefully handles the race condition where other
backends may write notifications after the heavyweight lock is released
but before SignalBackends() is called. By saving queueHeadAfterWrite
immediately after writing (before releasing the lock), we ensure
backends are only advanced over the exact notifications we wrote, not
notifications from other concurrent backends.
---
Benchmark:
% ./pgbench_patched --listen-notify-benchmark --notify-round-trips=10000 --notify-idle-step=10
pgbench_patched: starting LISTEN/NOTIFY round-trip benchmark
pgbench_patched: round-trips per iteration: 10000
pgbench_patched: idle listeners added per iteration: 10
master:
idle_listeners round_trips_per_sec max_latency_usec
0 33592.9 2278
10 14251.1 1041
20 9258.7 1367
30 6144.2 2277
40 4653.1 1690
50 3780.7 2869
60 3234.9 3215
70 2818.9 3652
80 2458.7 3219
90 2203.1 3505
100 1951.9 1739
0002-optimize_listen_notify-v14.patch:
idle_listeners round_trips_per_sec max_latency_usec
0 33936.2 889
10 30631.9 1233
20 22404.7 7862
30 19446.2 9539
40 16013.3 13963
50 14310.1 16983
60 12827.0 21363
70 11271.9 24775
80 10764.4 28703
90 9568.1 31693
100 9241.3 32724
0003-optimize_listen_notify-v14.patch:
idle_listeners round_trips_per_sec max_latency_usec
0 33236.8 1090
10 34681.0 1338
20 34530.4 1372
30 34061.6 1339
40 33084.5 913
50 33847.5 955
60 33675.8 1239
70 28857.4 20443
80 33324.9 786
90 33612.3 758
100 31259.2 7706
As we can see, with 0002, the ping-pong round-trips per second degrades
much slower than master, but the wakeup of idle listening backends still
needs to happen at some point, much fewer wakeups, and staggered over
time, but still makes it go down from 33k to 9k due to 100 idle
listening backends. With 0003, the round-trips per second is sustained,
unaffected by additional idle listening backends.
I've also attached the pgbench patch as a .txt in
pgbench-listen-notify-benchmark-patch.txt, since it's not part of this
patch, it's just provided to help others verify the results.
/Joel
Вложения
On Sat, Oct 11, 2025, at 08:43, Joel Jacobson wrote:
> In addition to previously suggested optimization, there is another major
> one that seems doable, that would mean a great improvement for workload
> having large traffic differences between channels, i.e. some low traffic
> and some high traffic.
>
> I'm not entirely sure this approach is correct though, I've might
> misunderstood the guarantees of the heavyweight lock. My assumption is
> based on that there can only be one backend that is currently running
> the code in PreCommit_Notify after having aquired the heavyweight lock.
> If this is not true, then it doesn't work. What made me worried is the
> exclusive lock we also take inside the same function, I don't see the
> point of it since we're already holding the heavyweight lock, but maybe
> this is just to "allows deadlocks to be detected" like the comment says?
..,
> * 0003-optimize_listen_notify-v14.patch:
>
> Optimize LISTEN/NOTIFY by advancing idle backends directly
>
> Building on the previous channel-specific listener tracking
> optimization, this patch further reduces context switching by detecting
> idle listening backends that don't listen to any of the channels being
> notified and advancing their queue positions directly without waking
> them up.
...
> 0003-optimize_listen_notify-v14.patch:
>
> idle_listeners round_trips_per_sec max_latency_usec
> 0 33236.8 1090
> 10 34681.0 1338
> 20 34530.4 1372
> 30 34061.6 1339
> 40 33084.5 913
> 50 33847.5 955
> 60 33675.8 1239
> 70 28857.4 20443
> 80 33324.9 786
> 90 33612.3 758
> 100 31259.2 7706
I noticed the strange data point at idle_listeners=70.
This made me think about the "wake tail only" trick,
and realized this is now unnecessary with the new 0003 idea.
New version attached that removes that part from the 0003 patch.
This also of course improved the stability of max_latency_usec,
since in this specific benchmark all other listeners are always idle,
so they don't need to be woken up ever:
idle_listeners round_trips_per_sec max_latency_usec
0 33631.8 546
10 34318.0 586
20 34813.0 596
30 35073.4 574
40 34646.1 569
50 33755.5 634
60 33991.6 561
70 34049.0 550
80 33886.0 541
90 33545.0 540
100 33163.1 660
/Joel
Вложения
On Sat, Oct 11, 2025, at 09:43, Joel Jacobson wrote:
> On Sat, Oct 11, 2025, at 08:43, Joel Jacobson wrote:
>> In addition to previously suggested optimization, there is another major
...
>> I'm not entirely sure this approach is correct though
Having investigated this, the "direct advancement" approach seems
correct to me.
(I understand the exclusive lock in PreCommit_Notify on NotifyQueueLock
is of course needed because there are other operations that don't
acquire the heavyweight-lock, that take shared/exclusive lock on
NotifyQueueLock to read/modify QUEUE_HEAD, so the exclusive lock on
NotifyQueueLock in PreCommit_Notify is needed, since it modifies the
QUEUE_HEAD.)
Given all the experiments since my earlier message, here is a fresh,
self-contained write-up:
This series has two patches:
* 0001-optimize_listen_notify-v16.patch:
Improve test coverage of async.c. Adds isolation specs covering
previously untested paths (subxact LISTEN reparenting/merge/abort,
simple NOTIFY reparenting, notification_match dedup, and an array-growth
case used by the follow-on patch.
* 0002-optimize_listen_notify-v16.patch:
Optimize LISTEN/NOTIFY by maintaining a shared channel map and using
direct advancement to avoid useless wakeups.
Problem
-------
Today SignalBackends wakes all listeners in the same database, with no
knowledge of which backends listen on which channels. When some backends
are listening on different channels, each NOTIFY causes unnecessary
wakeups and context switches, which can become the bottleneck in
workloads.
Overview of the solution (patch 0002)
-------------------------------------
* Introduce a lazily-created DSA+dshash map (dboid, channel) ->
[ProcNumber] (channelHash). AtCommit_Notify maintains it for
LISTEN/UNLISTEN, and SignalBackends consults it to signal only
listeners on the channels notified within the transaction.
* Add a per-backend wakeupPending flag to suppress duplicate signals.
* Direct advancement: while queuing, PreCommit_Notify records the queue
head before and after our writes. Writers are globally serialized, so
the interval [oldHead, newHead) contains only our entries.
SignalBackends advances any backend still at oldHead directly to
newHead, avoiding a pointless wakeup.
* Keep the queue healthy by signaling backends that have fallen too far
behind (lag >= QUEUE_CLEANUP_DELAY) so the global tail can advance.
* pg_listening_channels and IsListeningOn now read from channelHash.
* Add LWLock tranche NOTIFY_CHANNEL_HASH and wait event
NotifyChannelHash.
No user-visible semantic changes are intended; this is an internal
performance improvement.
Benchmark
---------
Using a patched pgbench (adds --listen-notify-benchmark; attached as
.txt to avoid confusing cfbot). Each run performs 10 000 round trips and
adds 100 idle listeners per iteration.
master (HEAD):
% ./pgbench_patched --listen-notify-benchmark --notify-round-trips=10000 --notify-idle-step=100
idle_listeners round_trips_per_sec max_latency_usec
0 32123.7 893
100 1952.5 1465
200 991.4 3438
300 663.5 2454
400 494.6 2950
500 398.6 3394
600 332.8 4272
700 287.1 4692
800 252.6 5208
900 225.4 5614
1000 202.5 6212
0002-optimize_listen_notify-v16.patch:
% ./pgbench_patched --listen-notify-benchmark --notify-round-trips=10000 --notify-idle-step=100
idle_listeners round_trips_per_sec max_latency_usec
0 31832.6 1067
100 32341.0 1035
200 31562.5 1054
300 30040.1 1057
400 29287.1 1023
500 28191.9 1201
600 28166.5 1019
700 26994.3 1094
800 26501.0 1043
900 25974.2 1005
1000 25720.6 1008
Benchmarked on MacBook Pro Apple M3 Max.
Files
-----
* 0001-optimize_listen_notify-v16.patch - tests only.
* 0002-optimize_listen_notify-v16.patch - implementation.
* pgbench-listen-notify-benchmark-patch.txt - adds --listen-notify-benchmark.
Feedback and review much welcomed.
/Joel
Вложения
"Joel Jacobson" <joel@compiler.org> writes:
> Having investigated this, the "direct advancement" approach seems
> correct to me.
> (I understand the exclusive lock in PreCommit_Notify on NotifyQueueLock
> is of course needed because there are other operations that don't
> acquire the heavyweight-lock, that take shared/exclusive lock on
> NotifyQueueLock to read/modify QUEUE_HEAD, so the exclusive lock on
> NotifyQueueLock in PreCommit_Notify is needed, since it modifies the
> QUEUE_HEAD.)
Right. What the heavyweight lock buys for us in this context is that
we can be sure no other would-be notifier can insert any messages
in between ours, even though we may take and release NotifyQueueLock
several times to allow readers to sneak in. That in turn means that
it's safe to advance readers over that whole set of messages if we
know we didn't wake them up for any of those messages.
There is a false-positive possibility if a reader was previously
signaled but hasn't yet awoken: we will think that maybe we signaled
it and hence not advance its pointer. This is an error in the safe
direction however, and it will advance its pointer when it does
wake up.
A potential complaint is that we are doubling down on the need for
that heavyweight lock, despite the upthread discussion about maybe
getting rid of it for better scalability. However, this patch
only requires holding a lock across all the insertions, not holding
it through commit which I think is the true scalability blockage.
If we did want to get rid of that lock, we'd only need to stop
releasing NotifyQueueLock at insertion page boundary crossings,
which I suspect isn't really that useful anyway. (In connection
with that though, I think you ought to capture both the "before" and
"after" pointers within that lock interval, not expend another lock
acquisition later.)
It would be good if the patch's comments made these points ...
also, the comments above struct AsyncQueueControl need to be
updated, because changing some other backend's queue pos is
not legal under any of the stated rules.
> Given all the experiments since my earlier message, here is a fresh,
> self-contained write-up:
I'm getting itchy about removing the local listenChannels list,
because what you've done is to replace it with a shared data
structure that can't be accessed without a good deal of locking
overhead. That seems like it could easily be a net loss.
Also, I really do not like this implementation of
GetPendingNotifyChannels, as it looks like O(N^2) effort.
The potentially large length of the list it builds is scary too,
considering the comments that SignalBackends had better not fail.
If we have to do it that way it'd be better to collect the list
during PreCommit_Notify.
The "Avoid needing to wake listening backends" loop should probably
be combined with the loop after it; I don't quite see the point of
iterating over all the listening backends twice. Also, why is the
second loop only paying attention to backends in the same DB?
I don't love adding queueHeadBeforeWrite and queueHeadAfterWrite into
the pendingNotifies data structure, as they have no visible connection
to that. In particular, we will have multiple NotificationList
structs when there's nested transactions, and it's certainly
meaningless to have such fields in more than one place. Probably
just making them independent static variables is the best way.
The overall layout of what the patch injects where needs another
look. I don't like inserting code before typedefs and static
variables within a module: that's not our normal layout style.
regards, tom lane
On Oct 15, 2025, at 05:19, Tom Lane <tgl@sss.pgh.pa.us> wrote:"Joel Jacobson" <joel@compiler.org> writes:Having investigated this, the "direct advancement" approach seems
correct to me.(I understand the exclusive lock in PreCommit_Notify on NotifyQueueLock
is of course needed because there are other operations that don't
acquire the heavyweight-lock, that take shared/exclusive lock on
NotifyQueueLock to read/modify QUEUE_HEAD, so the exclusive lock on
NotifyQueueLock in PreCommit_Notify is needed, since it modifies the
QUEUE_HEAD.)
Right. What the heavyweight lock buys for us in this context is that
we can be sure no other would-be notifier can insert any messages
in between ours, even though we may take and release NotifyQueueLock
several times to allow readers to sneak in. That in turn means that
it's safe to advance readers over that whole set of messages if we
know we didn't wake them up for any of those messages.
There is a false-positive possibility if a reader was previously
signaled but hasn't yet awoken: we will think that maybe we signaled
it and hence not advance its pointer. This is an error in the safe
direction however, and it will advance its pointer when it does
wake up.
A potential complaint is that we are doubling down on the need for
that heavyweight lock, despite the upthread discussion about maybe
getting rid of it for better scalability. However, this patch
only requires holding a lock across all the insertions, not holding
it through commit which I think is the true scalability blockage.
If we did want to get rid of that lock, we'd only need to stop
releasing NotifyQueueLock at insertion page boundary crossings,
which I suspect isn't really that useful anyway. (In connection
with that though, I think you ought to capture both the "before" and
"after" pointers within that lock interval, not expend another lock
acquisition later.)
It would be good if the patch's comments made these points ...
also, the comments above struct AsyncQueueControl need to be
updated, because changing some other backend's queue pos is
not legal under any of the stated rules.
I used to think “direct advancement” was a good idea. After reading Tom’s explanation, and reading v16 again carefully, now I also consider it’s adding complexity and could be fragile.
I just composed an example of race condition, please see if it is valid.
Because recoding queueHeadBeforeWrite and queueHeadAfterWrite happen in PreCommit_Notify() and checking them happens in AtCommit_Notify(), there is an interval in between, something may happen.
Say a listener A, it’s head pointing to 1.
And current QueueHead is 1.
Now two notifiers B and C are committing:
* B enters PreCommit_Notify(), it gets the NotifyQueueLock first, it records headBeforeWrite = 1 and writes to 3, and records headAfterWrite = 3.
* Now QueueHead is 3.
* C enters PreCommit_Notify(), it records headBeforeWrite = 3 and writes to 5, and records headAfterWrite = 5.
* Now QueueHead is 5
* C starts to run AtCommit_Notify(), as A’s head is 1, doesn’t equal to C’s headBeforeWrite, C won’t advance A’s head.
* A starts to run AtCommit_Notify(), A’s head equals to B’s beforeHeadWrite, B will advance A’s head to 3.
* At this time, QueueHead is 5, and A’s head is 3, so “direct advancement” will never work for A until A wakes up next time.
I am brainstorming. Maybe we can use a simpler strategy. If a backend’s queue lag exceeds a threshold, then wake it up. This solution is simpler and reliable, also reducing the total wake-up count.
Given all the experiments since my earlier message, here is a fresh,
self-contained write-up:
I'm getting itchy about removing the local listenChannels list,
because what you've done is to replace it with a shared data
structure that can't be accessed without a good deal of locking
overhead. That seems like it could easily be a net loss.
Also, I really do not like this implementation of
GetPendingNotifyChannels, as it looks like O(N^2) effort.
The potentially large length of the list it builds is scary too,
considering the comments that SignalBackends had better not fail.
If we have to do it that way it'd be better to collect the list
during PreCommit_Notify.
I agree with Tom that GetPendingNotifyChannels() is too heavy and unnecessary.
In PreCommit_Notify(), we can maintain a local hash table to record pending nofications’ channel names. dahash also supports hash table in local memory.
Then in SignalBackends(), we no longer need GetPendingNotifyChannels(), we can just iterate all keys of the local channel name hash.
And the local static numChannelsListeningOn is also not needed. We can get the count from the local hash.
WRT to v6, I got a few new comments:
1 - 0002
```
* After commit we are called another time (AtCommit_Notify()). Here we
- * make any actual updates to the effective listen state (listenChannels).
+ * make any actual updates to the effective listen state (channelHash).
* Then we signal any backends that may be interested in our messages
* (including our own backend, if listening). This is done by
- * SignalBackends(), which scans the list of listening backends and sends a
- * PROCSIG_NOTIFY_INTERRUPT signal to every listening backend (we don't
- * know which backend is listening on which channel so we must signal them
- * all). We can exclude backends that are already up to date, though, and
- * we can also exclude backends that are in other databases (unless they
- * are way behind and should be kicked to make them advance their
- * pointers).
+ * SignalBackends(), which consults the shared channel hash table to
+ * identify listeners for the channels that have pending notifications
+ * in the current database. Each selected backend is marked as having a
+ * wakeup pending to avoid duplicate signals, and a PROCSIG_NOTIFY_INTERRUPT
+ * signal is sent to it.
```
In this comment, you refer to “channelHash” and “the shared channel hash table”, they are the same thing, but easy to make readers to misunderstand.
2 - 0002
```
pg_listening_channels(PG_FUNCTION_ARGS)
{
FuncCallContext *funcctx;
+ List *listenChannels;
/* stuff done only on the first call of the function */
if (SRF_IS_FIRSTCALL())
{
+ MemoryContext oldcontext;
+ dshash_seq_status status;
+ ChannelEntry *entry;
+
/* create a function context for cross-call persistence */
funcctx = SRF_FIRSTCALL_INIT();
```
listenChannels is only used within the “if”, so it’s definition can be moved into the “if”.
3 - 0002
```
+ queue_length = asyncQueuePageDiff(QUEUE_POS_PAGE(QUEUE_HEAD),
+ QUEUE_POS_PAGE(QUEUE_TAIL));
+
+ /* Check for lagging backends when the queue spans multiple pages */
+ if (queue_length > 0)
+ {
```
I wonder why this check is needed. If queue_length is 0, can we return immediately from SignalBackends()?
--
Chao Li (Evan)
HighGo Software Co., Ltd.
https://www.highgo.com/
HighGo Software Co., Ltd.
https://www.highgo.com/
On Tue, Oct 14, 2025, at 23:19, Tom Lane wrote: > "Joel Jacobson" <joel@compiler.org> writes: >> Having investigated this, the "direct advancement" approach seems >> correct to me. > >> (I understand the exclusive lock in PreCommit_Notify on NotifyQueueLock >> is of course needed because there are other operations that don't >> acquire the heavyweight-lock, that take shared/exclusive lock on >> NotifyQueueLock to read/modify QUEUE_HEAD, so the exclusive lock on >> NotifyQueueLock in PreCommit_Notify is needed, since it modifies the >> QUEUE_HEAD.) > > Right. What the heavyweight lock buys for us in this context is that > we can be sure no other would-be notifier can insert any messages > in between ours, even though we may take and release NotifyQueueLock > several times to allow readers to sneak in. That in turn means that > it's safe to advance readers over that whole set of messages if we > know we didn't wake them up for any of those messages. Right. > There is a false-positive possibility if a reader was previously > signaled but hasn't yet awoken: we will think that maybe we signaled > it and hence not advance its pointer. This is an error in the safe > direction however, and it will advance its pointer when it does > wake up. I've added a comment on this in SignalBackends. > A potential complaint is that we are doubling down on the need for > that heavyweight lock, despite the upthread discussion about maybe > getting rid of it for better scalability. However, this patch > only requires holding a lock across all the insertions, not holding > it through commit which I think is the true scalability blockage. > > If we did want to get rid of that lock, we'd only need to stop > releasing NotifyQueueLock at insertion page boundary crossings, > which I suspect isn't really that useful anyway. Right. So if the upthread discussion would get rid of the heavyweight lock we would just need to hold the exclusive lock across all insertions. Good to know the two efforts are not conflicting. > (In connection > with that though, I think you ought to capture both the "before" and > "after" pointers within that lock interval, not expend another lock > acquisition later.) Fixed. > It would be good if the patch's comments made these points ... I've added a comment inside PreCommit_Notify on how it would suffice to hold the exclusive lock across all insertions, for the purpose of setting the "before" and "after" pointers, if the heavyweight lock would be removed. > also, the comments above struct AsyncQueueControl need to be > updated, because changing some other backend's queue pos is > not legal under any of the stated rules. Fixed. >> Given all the experiments since my earlier message, here is a fresh, >> self-contained write-up: > > I'm getting itchy about removing the local listenChannels list, > because what you've done is to replace it with a shared data > structure that can't be accessed without a good deal of locking > overhead. That seems like it could easily be a net loss. I agree, I also prefer the local listenChannels list. I've changed it back. > Also, I really do not like this implementation of > GetPendingNotifyChannels, as it looks like O(N^2) effort. > The potentially large length of the list it builds is scary too, > considering the comments that SignalBackends had better not fail. > If we have to do it that way it'd be better to collect the list > during PreCommit_Notify. I agree. I've removed GetPendingNotifyChannels and added a local list, named pendingNotifyChannels instead, collected during PreCommit_Notify. > The "Avoid needing to wake listening backends" loop should probably > be combined with the loop after it; I don't quite see the point of > iterating over all the listening backends twice. I agree. Fixed. > Also, why is the > second loop only paying attention to backends in the same DB? Fixed. (We're already sure it's the same DB, since that's part of the hash key. I've removed the redundant check.) > I don't love adding queueHeadBeforeWrite and queueHeadAfterWrite into > the pendingNotifies data structure, as they have no visible connection > to that. In particular, we will have multiple NotificationList > structs when there's nested transactions, and it's certainly > meaningless to have such fields in more than one place. Probably > just making them independent static variables is the best way. Fixed. > The overall layout of what the patch injects where needs another > look. I don't like inserting code before typedefs and static > variables within a module: that's not our normal layout style. Fixed. /Joel
Вложения
Hi, Thank you for working on it! Benchmarking looks great. There are several points: I tried the patch and it seems listeners sometimes don't receive notifications. To reproduce it you can try to listen to the channel in one psql session and send notifications from another psql session. But all tests are fine, so I tried to write a TAP test to reproduce it. It passes on master and fails with the patch, so looks like it's real. Please find the repro in attachments. I added the TAP test to amcheck module just for simplicity. I think "Direct advancement" is a good idea. But the way it's implemented now has a concurrency bug. Listeners store its current position in the local variable 'pos' during the reading in asyncQueueReadAllNotifications() and don't hold NotifyQueueLock. It means that some notifier can directly advance the listener's position while the listener has an old value in the local variable. The same time we use listener positions to find out the limit we can truncate the queue in asyncQueueAdvanceTail(). asyncQueueAdvanceTail() doesn't know that listeners have a local copy of their positions and can truncate the queue beyond that which means listeners can try to read notifications from the truncated segment. I managed to reproduce it locally. Please let me know if more details are needed. BTW error message a bit confusing: 2025-10-15 13:32:15.570 MSK [261845] ERROR: could not access status of transaction 0 2025-10-15 13:32:15.570 MSK [261845] DETAIL: Could not open file "pg_notify/000000000000001": No such file or directory. Looks like all slru IO errors have an error message about transaction status. It's not a problem really as we have a directory path in the log. Best regards, Arseniy Mukhin
Вложения
Arseniy Mukhin <arseniy.mukhin.dev@gmail.com> writes:
> I think "Direct advancement" is a good idea. But the way it's
> implemented now has a concurrency bug. Listeners store its current
> position in the local variable 'pos' during the reading in
> asyncQueueReadAllNotifications() and don't hold NotifyQueueLock. It
> means that some notifier can directly advance the listener's position
> while the listener has an old value in the local variable. The same
> time we use listener positions to find out the limit we can truncate
> the queue in asyncQueueAdvanceTail().
Good catch!
I think we can perhaps salvage the idea if we invent a separate
"advisory" queue position field, which tells its backend "hey,
you could skip as far as here if you want", but is not used for
purposes of SLRU truncation. Alternatively, split the queue pos
into "this is where to read next" and "this is as much as I'm
definitively done with", where the second field gets advanced at
the end of asyncQueueReadAllNotifications. Not sure which
view would be less confusing (in the end I guess they're nearly
the same thing, differently explained).
A different line of thought could be to get rid of
asyncQueueReadAllNotifications's optimization of moving the
queue pos only once, per
* (We could alternatively retake NotifyQueueLock and move the position
* before handling each individual message, but that seems like too much
* lock traffic.)
Since we only need shared lock to advance our own queue pos,
maybe that wouldn't be too awful. Not sure.
regards, tom lane
On Wed, Oct 15, 2025, at 05:19, Chao Li wrote:
> * B enters PreCommit_Notify(), it gets the NotifyQueueLock first, it
> records headBeforeWrite = 1 and writes to 3, and records headAfterWrite
> = 3.
> * Now QueueHead is 3.
> * C enters PreCommit_Notify(), it records headBeforeWrite = 3 and
> writes to 5, and records headAfterWrite = 5.
No, when C enters PreCommit_Notify, it will be waiting on the
heavyweight lock, currently held by B, which B will hold
until it commits. It will then see headBeforeWrite = 3.
> * Now QueueHead is 5
> * C starts to run AtCommit_Notify(), as A’s head is 1, doesn’t equal
> to C’s headBeforeWrite, C won’t advance A’s head.
> * A starts to run AtCommit_Notify(), A’s head equals to B’s
> beforeHeadWrite, B will advance A’s head to 3.
No, like explained above, B cannot be running here,
it must have committed already (or aborted) since C
was waiting on the heavyweight lock held by B.
The example therefore seems invalid to me.
> I agree with Tom that GetPendingNotifyChannels() is too heavy and unnecessary.
>
> In PreCommit_Notify(), we can maintain a local hash table to record
> pending nofications’ channel names. dahash also supports hash table in
> local memory.
I'm confused, I assume you mean "dynahash" since there is no "dahash"
in the sources? I see dynahash has local-to-a-backend support,
but I don't see why we would need a hash table for this,
we just iterate over it once in SignalBackends,
I think the local list is fine.
The latest version gets rid of GetPendingNotifyChannels()
and replaces it with the local list pendingNotifyChannels.
> And the local static numChannelsListeningOn is also not needed. We can
> get the count from the local hash.
No, you're mixing up the data structures.
The local hash you suggested was for pending notify channels,
but numChannelsListeningOn was needed when we didn't have
listenChannels. Now that I've reverted back to listenChannels,
I also replaced `(numChannelsListeningOn == 0)`
with `(listenChannels == NIL)`.
> WRT to v6, I got a few new comments:
...
> In this comment, you refer to “channelHash” and “the shared channel
> hash table”, they are the same thing, but easy to make readers to
> misunderstand.
Right, will try to improve this in the next version.
> pg_listening_channels(PG_FUNCTION_ARGS)
> {
> FuncCallContext *funcctx;
> + List *listenChannels;
...
> listenChannels is only used within the “if”, so it’s definition can be
> moved into the “if”.
Comment not applicable since local variable listenChannels has now been
removed from pg_listening_channels, now using the original static
listenChannels instead.
> + /* Check for lagging backends when the queue spans multiple pages */
> + if (queue_length > 0)
...
> I wonder why this check is needed. If queue_length is 0, can we return
> immediately from SignalBackends()?
This check has been removed in the latest version.
/Joel
On Wed, Oct 15, 2025 at 5:16 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > > Arseniy Mukhin <arseniy.mukhin.dev@gmail.com> writes: > > I think "Direct advancement" is a good idea. But the way it's > > implemented now has a concurrency bug. Listeners store its current > > position in the local variable 'pos' during the reading in > > asyncQueueReadAllNotifications() and don't hold NotifyQueueLock. It > > means that some notifier can directly advance the listener's position > > while the listener has an old value in the local variable. The same > > time we use listener positions to find out the limit we can truncate > > the queue in asyncQueueAdvanceTail(). > > Good catch! > > I think we can perhaps salvage the idea if we invent a separate > "advisory" queue position field, which tells its backend "hey, > you could skip as far as here if you want", but is not used for > purposes of SLRU truncation. Alternatively, split the queue pos > into "this is where to read next" and "this is as much as I'm > definitively done with", where the second field gets advanced at > the end of asyncQueueReadAllNotifications. Not sure which > view would be less confusing (in the end I guess they're nearly > the same thing, differently explained). > > A different line of thought could be to get rid of > asyncQueueReadAllNotifications's optimization of moving the > queue pos only once, per > > * (We could alternatively retake NotifyQueueLock and move the position > * before handling each individual message, but that seems like too much > * lock traffic.) > > Since we only need shared lock to advance our own queue pos, > maybe that wouldn't be too awful. Not sure. > > regards, tom lane Advisory queue position field sounds good IMHO. Listeners are still solely responsible for advancing their positions so they still need to wake up to do it, but they will only do so if there are relevant notifications, or if they are too far behind. In any case they will be able to jump over all irrelevant stuff. Best regards, Arseniy Mukhin
On Wed, Oct 15, 2025, at 13:19, Arseniy Mukhin wrote: > I tried the patch and it seems listeners sometimes don't receive > notifications. To reproduce it you can try to listen to the channel in > one psql session and send notifications from another psql session. But > all tests are fine, so I tried to write a TAP test to reproduce it. It > passes on master and fails with the patch, so looks like it's real. > Please find the repro in attachments. I added the TAP test to amcheck > module just for simplicity. Indeed a good catch! Thanks for the TAP test. I've migrated it to async-notify.spec, included in 0001-optimize_listen_notify-v18.patch: * Check that notifications sent from a backend that has not done LISTEN are properly delivered to a listener in another backend To fix this, we now do an initChannelHash call at the beginning of SignalBackends, since the problem was that if no LISTEN had been done in the session which did a NOTIFY, the channel would not have been initiated. Added a point about this in the 0002-optimize_listen_notify-v18.patch header: * SignalBackends attaches to the channel hash at the start, ensuring that backends performing NOTIFY without having done LISTEN can still find listeners in the shared hash table. On Wed, Oct 15, 2025, at 16:16, Tom Lane wrote: > I think we can perhaps salvage the idea if we invent a separate > "advisory" queue position field, which tells its backend "hey, > you could skip as far as here if you want", but is not used for > purposes of SLRU truncation. Alternatively, split the queue pos > into "this is where to read next" and "this is as much as I'm > definitively done with", where the second field gets advanced at > the end of asyncQueueReadAllNotifications. Not sure which > view would be less confusing (in the end I guess they're nearly > the same thing, differently explained). > > A different line of thought could be to get rid of > asyncQueueReadAllNotifications's optimization of moving the > queue pos only once, per > > * (We could alternatively retake NotifyQueueLock and move the position > * before handling each individual message, but that seems like too much > * lock traffic.) > > Since we only need shared lock to advance our own queue pos, > maybe that wouldn't be too awful. Not sure. These all sounds like promising ideas. I went ahead and tried the "split the queue pos" idea, implemented in 0002-optimize_listen_notify-v18.patch: Position tracking for truncation safety ---------------------------------------- To prevent race conditions during queue truncation when using direct advancement, backend positions are now tracked using two fields: * pos: The next position to read from. This can be advanced by other backends via direct advancement to skip over uninteresting notifications. * donePos: What the backend has definitively processed and no longer needs. This is used for determining safe truncation points. Without this separation, a backend could be advanced by another backend while it's reading notifications, then write back its stale local position that points to an already-truncated page. By using donePos for truncation decisions and taking the maximum of local and shared pos when updating, we ensure that truncation waits for backends to finish reading, while still allowing position advancement for optimization. On Wed, Oct 15, 2025, at 21:53, Arseniy Mukhin wrote: > Advisory queue position field sounds good IMHO. Listeners are still > solely responsible for advancing their positions so they still need to > wake up to do it, but they will only do so if there are relevant > notifications, or if they are too far behind. In any case they will be > able to jump over all irrelevant stuff. I read your message too late, otherwise I would have tried that approach first. I will try to implement that one too, and perhaps also the third one, and then we can evaluate them to see which one we prefer. /Joel
Вложения
On Wed, Oct 15, 2025, at 16:16, Tom Lane wrote:
> I think we can perhaps salvage the idea if we invent a separate
> "advisory" queue position field, which tells its backend "hey,
> you could skip as far as here if you want", but is not used for
> purposes of SLRU truncation.
I want to experiment with this idea too.
I assume the separate "advisory" queue position field
would actually need to be two struct fields, since a queue position
consists of a page and an offset, right?
typedef struct QueuePosition
{
int64 page; /* SLRU page number */
int offset; /* byte offset within page */
+ int64 advisoryPage; /* suggested skip-ahead page */
+ int advisoryOffset; /* suggested skip-ahead offset */
} QueuePosition;
Or would we want rather want a single "advisory" field that would also
be of type QueuePosition?
/Joel
"Joel Jacobson" <joel@compiler.org> writes:
> I assume the separate "advisory" queue position field
> would actually need to be two struct fields, since a queue position
> consists of a page and an offset, right?
No, I'd think you'd have both
QueuePosition pos; /* backend has read queue up to here */
QueuePosition advisory_pos; /* backend could skip queue to here */
in QueueBackendStatus. The other seems way too confusing.
regards, tom lane
> On Oct 15, 2025, at 23:36, Joel Jacobson <joel@compiler.org> wrote:
>
>> I agree with Tom that GetPendingNotifyChannels() is too heavy and unnecessary.
>>
>> In PreCommit_Notify(), we can maintain a local hash table to record
>> pending nofications’ channel names. dahash also supports hash table in
>> local memory.
>
> I'm confused, I assume you mean "dynahash" since there is no "dahash"
> in the sources? I see dynahash has local-to-a-backend support,
> but I don't see why we would need a hash table for this,
> we just iterate over it once in SignalBackends,
> I think the local list is fine.
>
> The latest version gets rid of GetPendingNotifyChannels()
> and replaces it with the local list pendingNotifyChannels.
Sorry for the typo, Yes, I meant to dynahash” that you have already been using it.
In v18, I see you are building “pendingNotifyChannels” in PreCommit_Notify() with “List”:
```
+ /*
+ * Build list of unique channels for SignalBackends().
+ */
+ pendingNotifyChannels = NIL;
+ foreach_ptr(Notification, n, pendingNotifies->events)
+ {
+ char *channel = n->data;
+
+ /* Add if not already in list */
+ if (!list_member_ptr(pendingNotifyChannels, channel))
+ pendingNotifyChannels = lappend(pendingNotifyChannels, channel);
+ }
```
My suggestion of using dynahah was for the same purpose. Because list_member_ptr() iterates through all list nodes
untilfind the target, so this code is still O(n^2).
Using a hash will make it faster. I used to work on project Concourse [1]. The system is heavily using the
LISTEN/NOTIFYmechanism. There would be thousands of channels at runtime. In that case, hash search would be much faster
thanlinear search.
[1] https://github.com/concourse/concourse
Best regards,
--
Chao Li (Evan)
HighGo Software Co., Ltd.
https://www.highgo.com/
On Wed, Oct 15, 2025, at 16:16, Tom Lane wrote: > Arseniy Mukhin <arseniy.mukhin.dev@gmail.com> writes: >> I think "Direct advancement" is a good idea. But the way it's >> implemented now has a concurrency bug. Listeners store its current >> position in the local variable 'pos' during the reading in >> asyncQueueReadAllNotifications() and don't hold NotifyQueueLock. It >> means that some notifier can directly advance the listener's position >> while the listener has an old value in the local variable. The same >> time we use listener positions to find out the limit we can truncate >> the queue in asyncQueueAdvanceTail(). > > Good catch! I've implemented the three ideas presented below, attached as .txt files that are diffs on top of v19, which has these changes since v17: 0002-optimize_listen_notify-v19.patch: * Improve wording of top comment per request from Chao Li. * Add initChannelHash call to top of SignalBackends, to fix bug reported by Arseniy Mukhin. > I think we can perhaps salvage the idea if we invent a separate > "advisory" queue position field, which tells its backend "hey, > you could skip as far as here if you want", but is not used for > purposes of SLRU truncation. Above idea is implemented in 0002-optimize_listen_notify-v19-alt1.txt > Alternatively, split the queue pos > into "this is where to read next" and "this is as much as I'm > definitively done with", where the second field gets advanced at > the end of asyncQueueReadAllNotifications. Not sure which > view would be less confusing (in the end I guess they're nearly > the same thing, differently explained). Above idea is implemented in 0002-optimize_listen_notify-v19-alt2.txt > A different line of thought could be to get rid of > asyncQueueReadAllNotifications's optimization of moving the > queue pos only once, per > > * (We could alternatively retake NotifyQueueLock and move the position > * before handling each individual message, but that seems like too much > * lock traffic.) > > Since we only need shared lock to advance our own queue pos, > maybe that wouldn't be too awful. Not sure. Above idea is implemented in 0002-optimize_listen_notify-v19-alt3.txt /Joel
Вложения
On Thu, Oct 16, 2025, at 04:54, Chao Li wrote: >> On Oct 15, 2025, at 23:36, Joel Jacobson <joel@compiler.org> wrote: >> The latest version gets rid of GetPendingNotifyChannels() >> and replaces it with the local list pendingNotifyChannels. > > Sorry for the typo, Yes, I meant to dynahash” that you have already > been using it. ... > My suggestion of using dynahah was for the same purpose. Because > list_member_ptr() iterates through all list nodes until find the > target, so this code is still O(n^2). > > Using a hash will make it faster. I used to work on project Concourse > [1]. The system is heavily using the LISTEN/NOTIFY mechanism. There > would be thousands of channels at runtime. In that case, hash search > would be much faster than linear search. > > [1] https://github.com/concourse/concourse Building pendingNotifyChannels is O(N^2) yes, but how large N is realistic here? Note that pendingNotifyChannels is only the unique channels for the notifications in the *current transaction*. At Concourse, did you really do thousands of NOTIFY, with unique channel names, within the same transaction? /Joel
On Thu, Oct 16, 2025, at 20:16, Joel Jacobson wrote: > On Thu, Oct 16, 2025, at 04:54, Chao Li wrote: >>> On Oct 15, 2025, at 23:36, Joel Jacobson <joel@compiler.org> wrote: >>> The latest version gets rid of GetPendingNotifyChannels() >>> and replaces it with the local list pendingNotifyChannels. >> >> Sorry for the typo, Yes, I meant to dynahash” that you have already >> been using it. > ... >> My suggestion of using dynahah was for the same purpose. Because >> list_member_ptr() iterates through all list nodes until find the >> target, so this code is still O(n^2). >> >> Using a hash will make it faster. I used to work on project Concourse >> [1]. The system is heavily using the LISTEN/NOTIFY mechanism. There >> would be thousands of channels at runtime. In that case, hash search >> would be much faster than linear search. >> >> [1] https://github.com/concourse/concourse > > Building pendingNotifyChannels is O(N^2) yes, but how large N is > realistic here? > > Note that pendingNotifyChannels is only the unique channels for the > notifications in the *current transaction*. At Concourse, did you really > do thousands of NOTIFY, with unique channel names, within the same > transaction? I tested doing LISTEN ch1; LISTEN ch2; ... LISTEN ch100000; in one backend, and then \timing on BEGIN; NOTIFY ch1; NOTIFY ch2; ... NOTIFY ch100000; COMMIT; in another backend. Timing for the final COMMIT of the 100k NOTIFY: 2.127 ms (master) 1428.441 ms (0002-optimize_listen_notify-v19.patch) I agree this looks like a real problem, since I guess it's not completely unthinkable someone might have some kind of trigger on a table, that could fire off NOTIFY for each row, possibly causing hundreds of thousands of notifies in the same db txn. I tried changing pendingNotifyChannels from a list to dynahash, which improved the timing, down to 15.169 ms. Once we have decided which of the three alternatives to go forward with, I will add the dynahash code for pendingNotifyChannels. Nice catch, thanks. /Joel
"Joel Jacobson" <joel@compiler.org> writes:
> On Thu, Oct 16, 2025, at 20:16, Joel Jacobson wrote:
>> Building pendingNotifyChannels is O(N^2) yes, but how large N is
>> realistic here?
> I agree this looks like a real problem, since I guess it's not
> completely unthinkable someone might have
> some kind of trigger on a table, that could fire off NOTIFY
> for each row, possibly causing hundreds of thousands of
> notifies in the same db txn.
We already de-duplicate identical NOTIFY operations for exactly that
reason (cf. AsyncExistsPendingNotify). However, non-identical NOTIFYs
obviously can't be merged.
I wonder whether we could adapt that de-duplication logic so that
it produces a list of unique channel names in addition to a list
of unique NOTIFY events. One way could be a list/hashtable of
channels used, and for each one a list/hashtable of unique payloads,
rather than the existing single-level list/hashtable.
regards, tom lane
On Thu, Oct 16, 2025 at 12:39 PM Joel Jacobson <joel@compiler.org> wrote:
>
> On Wed, Oct 15, 2025, at 16:16, Tom Lane wrote:
> > Arseniy Mukhin <arseniy.mukhin.dev@gmail.com> writes:
> >> I think "Direct advancement" is a good idea. But the way it's
> >> implemented now has a concurrency bug. Listeners store its current
> >> position in the local variable 'pos' during the reading in
> >> asyncQueueReadAllNotifications() and don't hold NotifyQueueLock. It
> >> means that some notifier can directly advance the listener's position
> >> while the listener has an old value in the local variable. The same
> >> time we use listener positions to find out the limit we can truncate
> >> the queue in asyncQueueAdvanceTail().
> >
> > Good catch!
>
> I've implemented the three ideas presented below, attached as .txt files
> that are diffs on top of v19, which has these changes since v17:
>
Thank you for the new version and all implementations!
> 0002-optimize_listen_notify-v19.patch:
> * Improve wording of top comment per request from Chao Li.
> * Add initChannelHash call to top of SignalBackends,
> to fix bug reported by Arseniy Mukhin.
>
> > I think we can perhaps salvage the idea if we invent a separate
> > "advisory" queue position field, which tells its backend "hey,
> > you could skip as far as here if you want", but is not used for
> > purposes of SLRU truncation.
>
> Above idea is implemented in 0002-optimize_listen_notify-v19-alt1.txt
pos = QUEUE_BACKEND_POS(i);
/* Direct advancement for idle backends at the old head */
if (pendingNotifies != NULL &&
QUEUE_POS_EQUAL(pos, queueHeadBeforeWrite))
{
QUEUE_BACKEND_ADVISORY_POS(i) = queueHeadAfterWrite;
If we have several notifying backends, it looks like only the first
one will be able to do direct advancement here. Next notifying backend
will fail on QUEUE_POS_EQUAL(pos, queueHeadBeforeWrite) as we don't
wake up the listener and pos will be the same as it was for the first
notifying backend. It seems that to accumulate direct advancement from
several notifying backends we need to compare queueHeadBeforeWrite
with advisoryPos here. And we also need to advance advisoryPos to the
listener's position after reading if advisoryPos falls behind.
Minute of brainstorming
I also thought about a workload that probably frequently can be met.
Let's say we have sequence of notifications:
F F F T F F F T F F F T
Here F - notification from the channel we don't care about and T - the opposite.
It seems that after the first 'T' notification it will be more
difficult for notifying backends to do 'direct advancement' as there
will be some lag before the listener reads the notification and
advances its position. Not sure if it's a problem, probably it depends
on the intensity of notifications. But maybe we can use a bit more
sophisticated data structure here? Something like a list of skip
ranges. Every entry in the list is the range (pos1, pos2) that the
listener can skip during the reading. So instead of advancing
advisoryPos every notifying backend should add skip range to the list.
Notifying backends can merge neighbour ranges (pos1, pos2) & (pos2,
pos3) -> (pos1, pos3). We also can limit the number of entries to 5
for example. Listeners on their side should clear the list before
reading and skip all ranges from it. What do you think? Is it
overkill?
>
> > Alternatively, split the queue pos
> > into "this is where to read next" and "this is as much as I'm
> > definitively done with", where the second field gets advanced at
> > the end of asyncQueueReadAllNotifications. Not sure which
> > view would be less confusing (in the end I guess they're nearly
> > the same thing, differently explained).
>
> Above idea is implemented in 0002-optimize_listen_notify-v19-alt2.txt
>
IMHO it's a little bit more confusing than the first option. Two
points I noticed:
1) We have a fast path in asyncQueueReadAllNotifications()
if (QUEUE_POS_EQUAL(pos, head))
{
/* Nothing to do, we have read all notifications already. */
return;
}
Should we update donePos here? It looks like donePos may never be
updated without it.
2) In SignalBackends()
/* Signal backends that have fallen too far behind */
lag = asyncQueuePageDiff(QUEUE_POS_PAGE(QUEUE_HEAD),
QUEUE_POS_PAGE(pos));
if (lag >= QUEUE_CLEANUP_DELAY)
{
pid = QUEUE_BACKEND_PID(i);
Assert(pid != InvalidPid);
QUEUE_BACKEND_WAKEUP_PENDING(i) = true;
pids[count] = pid;
procnos[count] = i;
count++;
}
Should we use donePos here as it is responsible for queue truncation now?
> > A different line of thought could be to get rid of
> > asyncQueueReadAllNotifications's optimization of moving the
> > queue pos only once, per
> >
> > * (We could alternatively retake NotifyQueueLock and move the position
> > * before handling each individual message, but that seems like too much
> > * lock traffic.)
> >
> > Since we only need shared lock to advance our own queue pos,
> > maybe that wouldn't be too awful. Not sure.
>
> Above idea is implemented in 0002-optimize_listen_notify-v19-alt3.txt
>
Hmm, it seems we still have the race when in the beginning of
asyncQueueReadAllNotifications we read pos into the local variable and
release the lock. IIUC to avoid the race without introducing another
field here, the listener needs to hold the lock until it updates its
position so that the notifying backend cannot change it concurrently.
Best regards,
Arseniy Mukhin
On Thu, Oct 16, 2025, at 22:16, Tom Lane wrote:
> "Joel Jacobson" <joel@compiler.org> writes:
>> On Thu, Oct 16, 2025, at 20:16, Joel Jacobson wrote:
>>> Building pendingNotifyChannels is O(N^2) yes, but how large N is
>>> realistic here?
>
>> I agree this looks like a real problem, since I guess it's not
>> completely unthinkable someone might have
>> some kind of trigger on a table, that could fire off NOTIFY
>> for each row, possibly causing hundreds of thousands of
>> notifies in the same db txn.
>
> We already de-duplicate identical NOTIFY operations for exactly that
> reason (cf. AsyncExistsPendingNotify). However, non-identical NOTIFYs
> obviously can't be merged.
>
> I wonder whether we could adapt that de-duplication logic so that
> it produces a list of unique channel names in addition to a list
> of unique NOTIFY events. One way could be a list/hashtable of
> channels used, and for each one a list/hashtable of unique payloads,
> rather than the existing single-level list/hashtable.
Thanks for the great idea! Yes, this was indeed possible.
0002-optimize_listen_notify-v20.patch:
* Added channelHashtab field, created and updated together with hashtab.
If we have channelHashtab, it's used within PreCommit_Notify to
quickly build pendingNotifyChannelsl.
In this email, I'm also answering to the feedback from Arseniy Mukhin,
and I've based the alt1, alt2, alt3 .txt patches on top of v20.
On Sat, Oct 18, 2025, at 18:41, Arseniy Mukhin wrote:
> Thank you for the new version and all implementations!
Thanks for review and great ideas!
>> > I think we can perhaps salvage the idea if we invent a separate
>> > "advisory" queue position field, which tells its backend "hey,
>> > you could skip as far as here if you want", but is not used for
>> > purposes of SLRU truncation.
>>
>> Above idea is implemented in 0002-optimize_listen_notify-v19-alt1.txt
>
> pos = QUEUE_BACKEND_POS(i);
>
> /* Direct advancement for idle backends at the old head */
> if (pendingNotifies != NULL &&
> QUEUE_POS_EQUAL(pos, queueHeadBeforeWrite))
> {
> QUEUE_BACKEND_ADVISORY_POS(i) = queueHeadAfterWrite;
>
> If we have several notifying backends, it looks like only the first
> one will be able to do direct advancement here. Next notifying backend
> will fail on QUEUE_POS_EQUAL(pos, queueHeadBeforeWrite) as we don't
> wake up the listener and pos will be the same as it was for the first
> notifying backend.
Right.
> It seems that to accumulate direct advancement from
> several notifying backends we need to compare queueHeadBeforeWrite
> with advisoryPos here.
*** 0002-optimize_listen_notify-v20-alt1.txt:
* Fixed; compare advisoryPos with queueHeadBeforeWrite instead of pos.
> And we also need to advance advisoryPos to the
> listener's position after reading if advisoryPos falls behind.
* Fixed; set advisoryPos to max(max,advisoryPos) in PG_FINALLY block.
* Also noted Exec_ListenPreCommit didn't set advisoryPos to max
for the first LISTEN, now fixed.
> Minute of brainstorming
>
> I also thought about a workload that probably frequently can be met.
> Let's say we have sequence of notifications:
>
> F F F T F F F T F F F T
>
> Here F - notification from the channel we don't care about and T - the opposite.
> It seems that after the first 'T' notification it will be more
> difficult for notifying backends to do 'direct advancement' as there
> will be some lag before the listener reads the notification and
> advances its position. Not sure if it's a problem, probably it depends
> on the intensity of notifications.
Hmm, I realize both the advisoryPos and donePos ideas share a problem;
they both require listening backends to wakeup eventually anyway,
just to advance the 'pos'.
The holy grail would be to avoid this context switching cost entirely,
and only need to wakeup listening backends when they are actually
interested in the queued notifications. I think the third idea,
alt3, is most promising in achieving this goal.
> But maybe we can use a bit more
> sophisticated data structure here? Something like a list of skip
> ranges. Every entry in the list is the range (pos1, pos2) that the
> listener can skip during the reading. So instead of advancing
> advisoryPos every notifying backend should add skip range to the list.
> Notifying backends can merge neighbour ranges (pos1, pos2) & (pos2,
> pos3) -> (pos1, pos3). We also can limit the number of entries to 5
> for example. Listeners on their side should clear the list before
> reading and skip all ranges from it. What do you think? Is it
> overkill?
Hmm, maybe, but I'm a bit wary about too much complication.
Hopefully there is a simpler solution that avoids the need for this,
but sure, if we can't find one, then I'm positive to try this skip ranges idea.
>> > Alternatively, split the queue pos
>> > into "this is where to read next" and "this is as much as I'm
>> > definitively done with", where the second field gets advanced at
>> > the end of asyncQueueReadAllNotifications. Not sure which
>> > view would be less confusing (in the end I guess they're nearly
>> > the same thing, differently explained).
>>
>> Above idea is implemented in 0002-optimize_listen_notify-v19-alt2.txt
>>
>
> IMHO it's a little bit more confusing than the first option. Two
> points I noticed:
>
> 1) We have a fast path in asyncQueueReadAllNotifications()
>
> if (QUEUE_POS_EQUAL(pos, head))
> {
> /* Nothing to do, we have read all notifications already. */
> return;
> }
>
> Should we update donePos here? It looks like donePos may never be
> updated without it.
*** 0002-optimize_listen_notify-v20-alt2.txt:
* Fixed; update donePos here
> 2) In SignalBackends()
>
> /* Signal backends that have fallen too far behind */
> lag = asyncQueuePageDiff(QUEUE_POS_PAGE(QUEUE_HEAD),
> QUEUE_POS_PAGE(pos));
>
> if (lag >= QUEUE_CLEANUP_DELAY)
> {
> pid = QUEUE_BACKEND_PID(i);
> Assert(pid != InvalidPid);
>
> QUEUE_BACKEND_WAKEUP_PENDING(i) = true;
> pids[count] = pid;
> procnos[count] = i;
> count++;
> }
>
> Should we use donePos here as it is responsible for queue truncation now?
* Fixed; use donePos here
>> > A different line of thought could be to get rid of
>> > asyncQueueReadAllNotifications's optimization of moving the
>> > queue pos only once, per
>> >
>> > * (We could alternatively retake NotifyQueueLock and move the position
>> > * before handling each individual message, but that seems like too much
>> > * lock traffic.)
>> >
>> > Since we only need shared lock to advance our own queue pos,
>> > maybe that wouldn't be too awful. Not sure.
>>
>> Above idea is implemented in 0002-optimize_listen_notify-v19-alt3.txt
>>
>
> Hmm, it seems we still have the race when in the beginning of
> asyncQueueReadAllNotifications we read pos into the local variable and
> release the lock. IIUC to avoid the race without introducing another
> field here, the listener needs to hold the lock until it updates its
> position so that the notifying backend cannot change it concurrently.
*** 0002-optimize_listen_notify-v20-alt3.txt:
* Fixed; the shared 'pos' is now only updated if the new position is ahead.
To me, it looks like alt3 is the winner in terms of simplicity, and is
also the winner in my ping-pong benchmark, due to avoiding context
switches more effectively than alt1 and alt2.
Eager to hear your thoughts!
/Joel
Вложения
On Mon, Oct 20, 2025, at 00:06, Joel Jacobson wrote: > Attachments: > * 0001-optimize_listen_notify-v20.patch > * 0002-optimize_listen_notify-v20-alt1.txt > * 0002-optimize_listen_notify-v20-alt3.txt > * 0002-optimize_listen_notify-v20-alt2.txt My apologies, I forgot to attach 0002-optimize_listen_notify-v20.patch. /Joel
Вложения
On Mon, Oct 20, 2025, at 00:10, Joel Jacobson wrote: > Attachments: > * 0001-optimize_listen_notify-v20.patch > * 0002-optimize_listen_notify-v20.patch > * 0002-optimize_listen_notify-v20-alt1.txt > * 0002-optimize_listen_notify-v20-alt3.txt > * 0002-optimize_listen_notify-v20-alt2.txt Attaching a new alt1 version, that fixes the mistake of using max(pos, advisoryPos) for lag calculation, which is wrong, since in alt1 it's the backend itself that updates its 'pos' when it wakes up, and it's 'pos' that asyncQueueAdvanceTail looks at, in alt1. /Joel
Вложения
On Mon, Oct 20, 2025 at 1:07 AM Joel Jacobson <joel@compiler.org> wrote: > > > Minute of brainstorming > > > > I also thought about a workload that probably frequently can be met. > > Let's say we have sequence of notifications: > > > > F F F T F F F T F F F T > > > > Here F - notification from the channel we don't care about and T - the opposite. > > It seems that after the first 'T' notification it will be more > > difficult for notifying backends to do 'direct advancement' as there > > will be some lag before the listener reads the notification and > > advances its position. Not sure if it's a problem, probably it depends > > on the intensity of notifications. > > Hmm, I realize both the advisoryPos and donePos ideas share a problem; > they both require listening backends to wakeup eventually anyway, > just to advance the 'pos'. > > The holy grail would be to avoid this context switching cost entirely, > and only need to wakeup listening backends when they are actually > interested in the queued notifications. I think the third idea, > alt3, is most promising in achieving this goal. > Yeah, it would be great. > > But maybe we can use a bit more > > sophisticated data structure here? Something like a list of skip > > ranges. Every entry in the list is the range (pos1, pos2) that the > > listener can skip during the reading. So instead of advancing > > advisoryPos every notifying backend should add skip range to the list. > > Notifying backends can merge neighbour ranges (pos1, pos2) & (pos2, > > pos3) -> (pos1, pos3). We also can limit the number of entries to 5 > > for example. Listeners on their side should clear the list before > > reading and skip all ranges from it. What do you think? Is it > > overkill? > > Hmm, maybe, but I'm a bit wary about too much complication. > Hopefully there is a simpler solution that avoids the need for this, > but sure, if we can't find one, then I'm positive to try this skip ranges idea. > Yes, and it's probably worth doing a benchmarking to see if it's a problem at all before implementing anything. > >> > A different line of thought could be to get rid of > >> > asyncQueueReadAllNotifications's optimization of moving the > >> > queue pos only once, per > >> > > >> > * (We could alternatively retake NotifyQueueLock and move the position > >> > * before handling each individual message, but that seems like too much > >> > * lock traffic.) > >> > > >> > Since we only need shared lock to advance our own queue pos, > >> > maybe that wouldn't be too awful. Not sure. > >> > >> Above idea is implemented in 0002-optimize_listen_notify-v19-alt3.txt > >> > > > > Hmm, it seems we still have the race when in the beginning of > > asyncQueueReadAllNotifications we read pos into the local variable and > > release the lock. IIUC to avoid the race without introducing another > > field here, the listener needs to hold the lock until it updates its > > position so that the notifying backend cannot change it concurrently. > > *** 0002-optimize_listen_notify-v20-alt3.txt: > > * Fixed; the shared 'pos' is now only updated if the new position is ahead. > I managed to reproduce the race with v20-alt3. I tried to write a TAP test reproducing the issue, so it was easier to validate changes. Please find the attached TAP test. I added it to some random package for simplicity. Best regards, Arseniy Mukhin
Вложения
> On Oct 21, 2025, at 00:43, Arseniy Mukhin <arseniy.mukhin.dev@gmail.com> wrote:
>
>
> I managed to reproduce the race with v20-alt3. I tried to write a TAP
> test reproducing the issue, so it was easier to validate changes.
> Please find the attached TAP test. I added it to some random package
> for simplicity.
>
With alt3, as we have acquired the notification lock after reading every message to update the POS, I think we can do a
littlebit more optimization:
The notifier: in SignalBackend()
* Now we check if a listener’s pos equals to beforeWritePos, then we do “directly advancement”
* We can change to if a listener’s pos is between beforeWritePos and afterWritePos, then we can do the advancement.
The listener: in asyncQueueReadAllNotifications():
* With alt3, we only lock and update pos
* We can do more. If current pos in shared memory is after that local pos, then meaning some notifier has done an
advancement,so it can stop reading.
I tried to run your TAP test on my MacBook, but failed:
```
t/008_listen-pos-race.pl .. Dubious, test returned 32 (wstat 8192, 0x2000)
No subtests run
Test Summary Report
-------------------
t/008_listen-pos-race.pl (Wstat: 8192 (exited 32) Tests: 0 Failed: 0)
Non-zero exit status: 32
Parse errors: No plan found in TAP output
Files=1, Tests=0, 3 wallclock secs ( 0.01 usr 0.01 sys + 0.10 cusr 0.29 csys = 0.41 CPU)
Result: FAIL
```
I didn’t spend time debugging the problem. If you can figure the problem, maybe I can run the test from my side.
Best regards,
--
Chao Li (Evan)
HighGo Software Co., Ltd.
https://www.highgo.com/
Hi, On Thu, Oct 23, 2025 at 11:17 AM Chao Li <li.evan.chao@gmail.com> wrote: > > > > > On Oct 21, 2025, at 00:43, Arseniy Mukhin <arseniy.mukhin.dev@gmail.com> wrote: > > > > > > I managed to reproduce the race with v20-alt3. I tried to write a TAP > > test reproducing the issue, so it was easier to validate changes. > > Please find the attached TAP test. I added it to some random package > > for simplicity. > > > > With alt3, as we have acquired the notification lock after reading every message to update the POS, I think we can do alittle bit more optimization: > > The notifier: in SignalBackend() > * Now we check if a listener’s pos equals to beforeWritePos, then we do “directly advancement” > * We can change to if a listener’s pos is between beforeWritePos and afterWritePos, then we can do the advancement. > > The listener: in asyncQueueReadAllNotifications(): > * With alt3, we only lock and update pos > * We can do more. If current pos in shared memory is after that local pos, then meaning some notifier has done an advancement,so it can stop reading. > I think this would be a reasonable optimization if it weren't for the race condition mentioned above. The problem is that if the local pos lags behind the shared memory pos, it could point to a truncated queue segment, so we shouldn't allow that. > I tried to run your TAP test on my MacBook, but failed: > > ``` > t/008_listen-pos-race.pl .. Dubious, test returned 32 (wstat 8192, 0x2000) > No subtests run > > Test Summary Report > ------------------- > t/008_listen-pos-race.pl (Wstat: 8192 (exited 32) Tests: 0 Failed: 0) > Non-zero exit status: 32 > Parse errors: No plan found in TAP output > Files=1, Tests=0, 3 wallclock secs ( 0.01 usr 0.01 sys + 0.10 cusr 0.29 csys = 0.41 CPU) > Result: FAIL > ``` > > I didn’t spend time debugging the problem. If you can figure the problem, maybe I can run the test from my side. > Thank you for trying the test. I think the test works for you as expected, it should fail with error and I have the same error status. Sorry, I failed to realize it could be confusing, probably it was better to fail on some assert instead, but I thought error is enough for temp reproducer. Please see 008_listen-pos-race_test.log for details. Best regards, Arseniy Mukhin
> On Oct 23, 2025, at 18:02, Arseniy Mukhin <arseniy.mukhin.dev@gmail.com> wrote: > > Hi, > > On Thu, Oct 23, 2025 at 11:17 AM Chao Li <li.evan.chao@gmail.com> wrote: >> >> >> >>> On Oct 21, 2025, at 00:43, Arseniy Mukhin <arseniy.mukhin.dev@gmail.com> wrote: >>> >>> >>> I managed to reproduce the race with v20-alt3. I tried to write a TAP >>> test reproducing the issue, so it was easier to validate changes. >>> Please find the attached TAP test. I added it to some random package >>> for simplicity. >>> >> >> With alt3, as we have acquired the notification lock after reading every message to update the POS, I think we can doa little bit more optimization: >> >> The notifier: in SignalBackend() >> * Now we check if a listener’s pos equals to beforeWritePos, then we do “directly advancement” >> * We can change to if a listener’s pos is between beforeWritePos and afterWritePos, then we can do the advancement. >> >> The listener: in asyncQueueReadAllNotifications(): >> * With alt3, we only lock and update pos >> * We can do more. If current pos in shared memory is after that local pos, then meaning some notifier has done an advancement,so it can stop reading. >> > > I think this would be a reasonable optimization if it weren't for the > race condition mentioned above. The problem is that if the local pos > lags behind the shared memory pos, it could point to a truncated queue > segment, so we shouldn't allow that. > I figured out a way to resolve the race condition for alt3: * add an awakening flag for every listener, this flag is only set by listeners * add an advisory pos for every listener, similar to alt1 * if a listener is awaken, notify only updates the listener’s advisory pos; otherwise directly advance its position. * If a running listener see current pos is behind advisory pos, then stop reading See more details in attach patch file, I added code comments for my changes. Now the TAP test won’t hit the race condition. ``` # +++ tap check in src/test/authentication +++ t/008_listen-pos-race.pl .. skipped: Injection points not supported by this build Files=1, Tests=0, 0 wallclock secs ( 0.00 usr 0.00 sys + 0.03 cusr 0.01 csys = 0.04 CPU) Result: NOTESTS ``` And with my solution, listeners longer will still use local pos, so that no longer need to acquire notification lock in everyloop. The patch stack is: v20 patch -> alt3 patch -> tap patch -> my patch. Please see if my solution works. I also made a tiny change in the TAP script to allow it to terminate gracefully. Best regards, -- Chao Li (Evan) HighGo Software Co., Ltd. https://www.highgo.com/
Вложения
On Sun, Oct 26, 2025, at 05:11, Chao Li wrote:
> I figured out a way to resolve the race condition for alt3:
>
> * add an awakening flag for every listener, this flag is only set by
> listeners
> * add an advisory pos for every listener, similar to alt1
> * if a listener is awaken, notify only updates the listener’s advisory
> pos; otherwise directly advance its position.
> * If a running listener see current pos is behind advisory pos, then
> stop reading
>
> See more details in attach patch file, I added code comments for my
> changes. Now the TAP test won’t hit the race condition.
> ```
> # +++ tap check in src/test/authentication +++
> t/008_listen-pos-race.pl .. skipped: Injection points not supported by
> this build
> Files=1, Tests=0, 0 wallclock secs ( 0.00 usr 0.00 sys + 0.03 cusr
> 0.01 csys = 0.04 CPU)
> Result: NOTESTS
> ```
>
> And with my solution, listeners longer will still use local pos, so
> that no longer need to acquire notification lock in every loop.
This sounds promising, similar to what I had in mind. I was thinking
about the idea of using the advisoryPos only when the listening backend
is known to be running (which felt like it would need another shared
boolean field), and to move its pos field directly only when it's not
running, since if it's running we don't need to optimize for context
switching, since it's by definition already running.
What I wanted to investigate what all the concurrency situations
that we can imagine, i.e. to permutate all possible differences
we can think of into a truth table, and reason about each case.
The ones I can think of are, from the perspective of SignalBackends,
reasoning about a specific listening backend:
{is interested in the notifications, is not interested in the notifications} x
{wakeupPending=false, wakeupPending=true} x
{pos < queueHeadBeforeWrite, pos == queueHeadBeforeWrite, pos > queueHeadBeforeWrite, pos == queueHeadAfterWrite, pos >
queueHeadAfterWrite}x
{is running, is not running}
This gives 2x2x5x2=40 states to reason about. Some of these combinations
are probably impossible, I still think it would be good to include them
and explain why they are impossible.
> The patch stack is: v20 patch -> alt3 patch -> tap patch -> my patch.
> Please see if my solution works.
>
> I also made a tiny change in the TAP script to allow it to terminate gracefully.
I haven't looked at the code yet, tried to apply the patch but it fails:
shasum of files:
```
ca54ffa02ac54efd65acce0d09b18e630b5d7982 0001-optimize_listen_notify-v20.patch
5755701bb0e7ac7a0cea3abab9d74a0001b7b63a 0002-optimize_listen_notify-v20.patch
5819e23b5760023be70d2582207b72164904e952 0002-optimize_listen_notify-v20-alt3.txt
33d700dc0b3288d46705e85d381cb564d99079d1 0001-TAP-test-with-listener-pos-race.patch.nocfbot
8ee716451bd5f85761b666712bdfd8b5d936f92d fix-race.patch
```
Trying to apply them on top of current master (39dcfda2d23ac39f14ecf4b83e01eae85d07d9e5):
```
% git apply 0001-optimize_listen_notify-v20.patch
% git apply 0002-optimize_listen_notify-v20.patch
% git apply 0002-optimize_listen_notify-v20-alt3.txt
% git apply 0001-TAP-test-with-listener-pos-race.patch.nocfbot
% git apply fix-race.patch
fix-race.patch:100: indent with spaces.
(QUEUE_POS_PRECEDES(queueHeadBeforeWrite, pos) && QUEUE_POS_PRECEDES(pos,
queueHeadAfterWrite)))&&
error: patch failed: src/backend/commands/async.c:250
error: src/backend/commands/async.c: patch does not apply
error: patch failed: src/test/authentication/t/008_listen-pos-race.pl:8
error: src/test/authentication/t/008_listen-pos-race.pl: patch does not apply
```
I'll try to resolve it manually, but in case you're quicker to reply, I'm sending this now.
/Joel
On Sun, Oct 26, 2025, at 07:33, Joel Jacobson wrote:
> Trying to apply them on top of current master
> (39dcfda2d23ac39f14ecf4b83e01eae85d07d9e5):
>
> ```
> % git apply 0001-optimize_listen_notify-v20.patch
> % git apply 0002-optimize_listen_notify-v20.patch
> % git apply 0002-optimize_listen_notify-v20-alt3.txt
> % git apply 0001-TAP-test-with-listener-pos-race.patch.nocfbot
> % git apply fix-race.patch
> fix-race.patch:100: indent with spaces.
> (QUEUE_POS_PRECEDES(queueHeadBeforeWrite, pos) &&
> QUEUE_POS_PRECEDES(pos, queueHeadAfterWrite))) &&
> error: patch failed: src/backend/commands/async.c:250
> error: src/backend/commands/async.c: patch does not apply
> error: patch failed: src/test/authentication/t/008_listen-pos-race.pl:8
> error: src/test/authentication/t/008_listen-pos-race.pl: patch does not
> apply
> ```
>
> I'll try to resolve it manually, but in case you're quicker to reply,
> I'm sending this now.
I see the problem; seems like you based fix-race.patch on top of
0002-optimize_listen_notify-v19-alt3.txt because fix-race.patch contains
this diff block which is only present in that version:
```
@@ -2441,21 +2485,29 @@ asyncQueueReadAllNotifications(void)
page_buffer.buf,
snapshot);
- /*
- * Update our position in shared memory. The 'pos' variable now
- * holds our new position (advanced past all messages we just
- * processed). This ensures that if we fail while processing
```
I've compared 0002-optimize_listen_notify-v19-alt3.txt with
0002-optimize_listen_notify-v20-alt3.txt and it's only the addition of
QUEUE_POS_PRECEDES which fix-race.patch also adds, and some locking and
pos handling differences.
/Joel
On Sun, Oct 26, 2025, at 07:33, Joel Jacobson wrote: > On Sun, Oct 26, 2025, at 05:11, Chao Li wrote: >> I figured out a way to resolve the race condition for alt3: >> >> * add an awakening flag for every listener, this flag is only set by >> listeners >> * add an advisory pos for every listener, similar to alt1 >> * if a listener is awaken, notify only updates the listener’s advisory >> pos; otherwise directly advance its position. >> * If a running listener see current pos is behind advisory pos, then >> stop reading ... > This sounds promising, similar to what I had in mind. I was thinking > about the idea of using the advisoryPos only when the listening backend > is known to be running (which felt like it would need another shared > boolean field), and to move its pos field directly only when it's not > running, since if it's running we don't need to optimize for context > switching, since it's by definition already running. Write-up of changes since v20: Two new fields have been added to QueueBackendStatus: + QueuePosition advisoryPos; /* safe skip-ahead position */ + bool advancingPos; /* backend is reading the queue */ These are used SignalBackends and asyncQueueReadAllNotifications to handle the empheral state of the shared queue position, since we don't take a lock while advancing it in asyncQueueReadAllNotifications. In SignalBackends, we now don't signal laggers in other databases, instead we will signal any listening backend that could possibly be behind the old queue head, since we can't know if such backend is interested in the notifications before the old queue head. Realistic benchmarks will be needed to determine if this happens often enough to warrant a more complex optimization, such as the ranges idea suggested by Arseniy Mukhin. In SignalBackends, if a backend that is uninterested in our notifications, has a shared pos that is at the old queue head, then we will check if it's not currently advancing its pos, in which case we can set its shared pos to the new queue head, i.e. "direct advance" it, otherwise, if it's currently advancing its pos, and if its advisory pos is behind our new queue head, we will update its advisory pos to our new queue head. In asyncQueueReadAllNotifications, we start by setting wakupPending to false and advisoryPos to true, to indicate that we've woken up, and that we will now start advancing the pos. We also check if the pos is behind the advisory pos, and if so use the advisory pos to update the pos. In asyncQueueReadAllNotifications's PG_FINALLY block, we reset advancingPos to false, and detect if the advisoryPos was set by SignalBackends while we were processing messages on the queue, and if so, and if the advisoryPos is ahead of our pos, we update our shared pos with the advisoryPos, and otherwise update the shared pos with the new pos. /Joel
Вложения
> On Oct 27, 2025, at 07:24, Joel Jacobson <joel@compiler.org> wrote: > > Write-up of changes since v20: > > Two new fields have been added to QueueBackendStatus: > + QueuePosition advisoryPos; /* safe skip-ahead position */ > + bool advancingPos; /* backend is reading the queue */ > > These are used SignalBackends and asyncQueueReadAllNotifications to > handle the empheral state of the shared queue position, since we don't > take a lock while advancing it in asyncQueueReadAllNotifications. > > In SignalBackends, we now don't signal laggers in other databases, > instead we will signal any listening backend that could possibly be > behind the old queue head, since we can't know if such backend is > interested in the notifications before the old queue head. Realistic > benchmarks will be needed to determine if this happens often enough to > warrant a more complex optimization, such as the ranges idea suggested > by Arseniy Mukhin. > > In SignalBackends, if a backend that is uninterested in our > notifications, has a shared pos that is at the old queue head, then we > will check if it's not currently advancing its pos, in which case we can > set its shared pos to the new queue head, i.e. "direct advance" it, > otherwise, if it's currently advancing its pos, and if its advisory pos > is behind our new queue head, we will update its advisory pos to our new > queue head. > > In asyncQueueReadAllNotifications, we start by setting wakupPending to > false and advisoryPos to true, to indicate that we've woken up, and that > we will now start advancing the pos. We also check if the pos is behind > the advisory pos, and if so use the advisory pos to update the pos. > > In asyncQueueReadAllNotifications's PG_FINALLY block, we reset > advancingPos to false, and detect if the advisoryPos was set by > SignalBackends while we were processing messages on the queue, and if > so, and if the advisoryPos is ahead of our pos, we update our shared pos > with the advisoryPos, and otherwise update the shared pos with the new > pos. > > /Joel<0001-optimize_listen_notify-v21.patch><0002-optimize_listen_notify-v21.patch> I did a quick review on v21 only focusing on the “direct advancement” logic. In v21, you added advisoryPos and advancingPos which is same as my proposed solution. But you missed an important point frommine. Let’s say listener L1 is doing a slow advancing, because the last notifier pushed a bunch of notifications and L1 is interestingin them, say current QUEUE_HEAD is QH1. So, L1 is reading till reaching QH1. Now notifier N1 comes. To N1, posBeforeWrite is QH1, and say posAfterWrite is QH2. In this case, as L1 is reading, if N1knows that L1 will read till QH1, then N1 can still set L1’s advisoryPos to QH2, right? From this perspective, we needto add a new field adviancingTillPos to QueueBackendStatus. (This field was also missing from my proposed patch). Then notifier N2 comes after N1. To N2, posBeforeWrite is QH2, and say posAfterWrite is QH3. As L1 is still reading, andit’s advisoryPos is QH2, so N2 can also advance L1’s advisoryPos to QH3. Finally, L1 finished reading and reached QH1. Now it sees advisoryPos is QH3, then it can directly bump its pos to QH3. Do you think this logic is valid? Best regards, -- Chao Li (Evan) HighGo Software Co., Ltd. https://www.highgo.com/