Обсуждение: confusing results from pg_get_replication_slots()
Hi, Since v13, pg_get_replication_slots() returns a wal_status field that supposedly tells you whether the slot is reserving WAL. It returns either "reserved", "extended", "unreserved", or "lost". However, the logic is more complicated than you might expect from a reporting function. We normally call GetWALAvailability() and report whatever it tells us, but there are two exceptions. First, if the slot is invalidated, we skip calling GetWALAvailability() and assume that the answer is "lost". Second, if something is still connected to the slot, we assume that any apparent "lost" answer is due to a race condition and instead return "unreserved". Both of these exceptions can occur at the same time, and the checks are done in the order I've listed here. Therefore, a still-connected slot which is invalidated is shown as "unreserved" rather than, as I would have expected, as "lost". I don't believe we should apply both of these exceptions at the same time. If we actually called GetWALAvailability() and it said the WAL was lost, then perhaps the fact that somebody's still-connected to the slot is contrary evidence and maybe due to some race condition they can catch up again. But if we didn't call GetWALAvailability() and thought that the WAL was lost because the slot is invalidated, the fact that some process is still connected to that slot doesn't invalidate the conclusion. Once the slot is invalidated, it's ignored for purposes of deciding how much WAL to retain in the future, and it's ignored for hot_standby_feedback purposes. It is no longer protecting against any of the things against which slots are supposed to protect. For all practical intents and purposes, such a slot is no more - has ceased to be - has expired and gone to meet its maker - it's an ex-slot. It makes no sense to me to display that slot with a status that shows that there is some hope of recovery when in fact there is none. Note, by the way, that in existing releases, connections to already-invalidated physical slots are not blocked. This has been changed, but only in master. Here is a patch to make invalidated slots always report as "lost", which I propose to back-patch to all supported versions. Many people were involved in the diagnosis of this issue, but particular shot-outs are appropriate to my colleague Nitin Chobisa, who produced the first reproducible test case demonstrating the issue, and my colleague Pavan Deolasee, who further refined the test case and clearly established that it was possible for slots to emerge from the "lost" state, going back to "unreserved". -- Robert Haas EDB: http://www.enterprisedb.com
Вложения
Hi, On 02/01/26 12:40, Robert Haas wrote: > Here is a patch to make invalidated slots always report as "lost", > which I propose to back-patch to all supported versions. > The patch looks correct to me. I'm just wondering if/how we could create a test for this. It possible to create a regression test or a TAP test? Or it's not worthwhile? -- Matheus Alcantara EDB: https://www.enterprisedb.com
On Fri, Jan 2, 2026 at 3:48 PM Matheus Alcantara <matheusssilv97@gmail.com> wrote: > On 02/01/26 12:40, Robert Haas wrote: > > Here is a patch to make invalidated slots always report as "lost", > > which I propose to back-patch to all supported versions. > > The patch looks correct to me. I'm just wondering if/how we could > create a test for this. It possible to create a regression test or a > TAP test? Or it's not worthwhile? It's relatively difficult to reproduce this, especially on master. Amit Kapila's commit f41d8468ddea34170fe19fdc17b5a247e7d3ac78 changed the behavior for physical replication slots. Before that commit, you couldn't connect to an invalidated logical replication slot, but not an invalidated physical replication slot. After this commit, both are prohibited. I imagine that Amit thought this was a distinction without a difference, because of course if the WAL is actually removed then use of the slot will fail later -- but that's not completely true, because there's no guarantee if or when the connection will be used to fetch WAL that has been removed. Nonetheless, I think it's a good change: because invalidated replication slots are ignored, having stuff connect to them and pretend to use them is bad. However, this means that if you wanted a TAP test for this, you would have to let a replication slot get behind far enough that it could be invalidated, trigger a checkpoint that actually invalidates it, and then have the process using the connection catch up quickly enough that it never tries to fetch removed WAL. In older releases, I believe it's a little easier to hit the problem, because you can actually reconnect to an invalidated slot, but I think you still need to the timing to be just right, so that you catch up after the invalidation happens but before the files are actually removed. Even there, I don't see how you could construct a TAP test without injection points, and I'm not really convinced that it's worth adding a bunch of new infrastructure for this. Such a test wouldn't be likely to catch the next bug of this type, if there is one. The best thing to do to really avoid future bugs of this type, IMHO, would be to modify pg_get_replication_slots() so that it does not editorialize on the value returned by GetWALAvaliability(), but how to get there is arguable. Maybe we shouldn't display "lost" when the slot is invalidated but "invalidated", for example, and any other value means we're just returning whatever GetWALAvaliability() told us. Also, maybe the exception for connect slots should just be removed, on the assumption that the race condition isn't common enough to matter, or maybe that logic should be pushed down into GetWALAvailability() if we want to keep it. I'm not sure. Any of that seems like too much to change in the back-branches, but I personally believe rethinking the logic here would be a better use of energy than developing test cases that verify the exact details of the current logic. -- Robert Haas EDB: http://www.enterprisedb.com
Hi Robert! I've tried to look how people use wal_status. There are lots of monitoring usages where transient race conditions do not matter much. But in some cases fatal decisions are made on a "lost" basis. e.g. https://github.com/readysettech/readyset/blob/cb77b75a56d952fb6b1c4171afa9f0b0175fb6d8/replicators/src/postgres_connector/connector.rs#L381 I concur that showing "unreserved" when there is no actual WAL is a bug. Proposed fix will work and is very succinct. Resulting code structure is not super elegant, but acceptable. I don't fully understand circumstances when this bug can do any harm. Maybe negative safe_wal_size could be a surprise forsome monitoring tools. > On 2 Jan 2026, at 20:40, Robert Haas <robertmhaas@gmail.com> wrote: > > For all practical intents and purposes, such a slot is no > more - has ceased to be - has expired and gone to meet its maker - > it's an ex-slot. It makes no sense to me to display that slot with a > status that shows that there is some hope of recovery when in fact > there is none. > > Note, by the way, that in existing releases, connections to > already-invalidated physical slots are not blocked. This has been > changed, but only in master. I don't understand a reason to disallow reviving a slot. Ofc with some new LSN that is currently available in pg_wal. Imagine a following scenario: in a cluster of a Primary and a Standby a long analytical query is causing huge lag, primaryremoves some WAL segments due to max_slot_wal_keep_size, standby is disconnected, consumes several WALs from archive,catches up and continues. Or, if something was vacuumed, cancels analytical query. If we disallow reconnection ofthis stanby, it will stay in archive recovery. I don't see how it's a good thing. > On 3 Jan 2026, at 02:10, Robert Haas <robertmhaas@gmail.com> wrote: > > Maybe we shouldn't display "lost" when the slot > is invalidated but "invalidated", for example, and any other value > means we're just returning whatever GetWALAvaliability() told us. > Also, maybe the exception for connect slots should just be removed, on > the assumption that the race condition isn't common enough to matter, > or maybe that logic should be pushed down into GetWALAvailability() if > we want to keep it. I don't think following logic works: "someone seems to be connected to this slot, perhaps it's still not lost". This is error-proneheuristics that is trying to workaround possibly stale restart_lsn. For HEAD I'd propose to actually read restart_lsn, and determine if walsender will issue "requested WAL segment has alreadybeen removed" on next attempt to send something. In this case slot is "lost". If I understand correctly, slot might be "invalidated", but not "lost" in this sense yet: timeout occured, but WAL is stillthere. Best regards, Andrey Borodin.
On Sat, Jan 3, 2026 at 7:22 AM Andrey Borodin <x4mmm@yandex-team.ru> wrote: > I concur that showing "unreserved" when there is no actual WAL is a bug. > Proposed fix will work and is very succinct. Resulting code structure is not super elegant, but acceptable. Agreed. > I don't fully understand circumstances when this bug can do any harm. Maybe negative safe_wal_size could be a surprisefor some monitoring tools. Yes, the fact that safe_wal_size can go negative is one of the things that makes me think this outcome was not really intended. > I don't understand a reason to disallow reviving a slot. Ofc with some new LSN that is currently available in pg_wal. > > Imagine a following scenario: in a cluster of a Primary and a Standby a long analytical query is causing huge lag, primaryremoves some WAL segments due to max_slot_wal_keep_size, standby is disconnected, consumes several WALs from archive,catches up and continues. Or, if something was vacuumed, cancels analytical query. If we disallow reconnection ofthis stanby, it will stay in archive recovery. I don't see how it's a good thing. I think for physical slots invalidation is a little bit of an odd concept -- why do we ever invalidate a physical slot at all, rather than just stop reserving WAL at some point and let what happens, happen? But the reality is that the slot cannot be resurrected once invalidated; you have to drop and recreate it. Possibly we should revisit that decision or document the logic more clearly, but that's not something to think of back-paching. > > On 3 Jan 2026, at 02:10, Robert Haas <robertmhaas@gmail.com> wrote: > > > > Maybe we shouldn't display "lost" when the slot > > is invalidated but "invalidated", for example, and any other value > > means we're just returning whatever GetWALAvaliability() told us. > > Also, maybe the exception for connect slots should just be removed, on > > the assumption that the race condition isn't common enough to matter, > > or maybe that logic should be pushed down into GetWALAvailability() if > > we want to keep it. > > I don't think following logic works: "someone seems to be connected to this slot, perhaps it's still not lost". This iserror-prone heuristics that is trying to workaround possibly stale restart_lsn. > For HEAD I'd propose to actually read restart_lsn, and determine if walsender will issue "requested WAL segment has alreadybeen removed" on next attempt to send something. In this case slot is "lost". > > If I understand correctly, slot might be "invalidated", but not "lost" in this sense yet: timeout occured, but WAL is stillthere. What I think is *really bad* about this situation is that, when the slot is invalidated, showing it as unreserved makes it still look potentially useful. But no matter whether the WAL is present or not, the slot neither serves to reserve WAL or to hold back xmin once invaliated. Therefore it is not useful. The user would be better off using no slot at all, in which case xmin would be held back and WAL reserved at least while the walreceiver is connected. It is not a question of whether the user can stream from the slot: the user doesn't need a slot to stream. It's a question of whether the user erroneously believes themselves to be protected against something when in fact they are using a defunct slot that is worse than no slot at all. -- Robert Haas EDB: http://www.enterprisedb.com
> On 3 Jan 2026, at 18:12, Robert Haas <robertmhaas@gmail.com> wrote: > > What I think is *really bad* about this situation is that, when the > slot is invalidated, showing it as unreserved makes it still look > potentially useful. But no matter whether the WAL is present or not, > the slot neither serves to reserve WAL or to hold back xmin once > invaliated. Therefore it is not useful. The user would be better off > using no slot at all, in which case xmin would be held back and WAL > reserved at least while the walreceiver is connected. It is not a > question of whether the user can stream from the slot: the user > doesn't need a slot to stream. It's a question of whether the user > erroneously believes themselves to be protected against something when > in fact they are using a defunct slot that is worse than no slot at > all. Slot state is a mix of 3 values here: WAL reservation, WAL availability, xmin reservation. WAL reservation is 3-state value: "reserving", "extended reserving", "not reserving". WAL availability is binary. Always true if reserving. xmin reservation is binary, always true if WAL was continously available (or is it connected at all?). "unreserved" slot does not reserve WAL, but holds xmin. WAL must be avaliable. "lost" does not reserve WAL, and also does not hold xmin. WAL might be available, might be unavailable. Is it possible to report state of the slot consistently without race conditions? Best regards, Andrey Borodin.
On Sat, Jan 3, 2026 at 9:52 AM Andrey Borodin <x4mmm@yandex-team.ru> wrote: > Slot state is a mix of 3 values here: WAL reservation, WAL availability, xmin reservation. > WAL reservation is 3-state value: "reserving", "extended reserving", "not reserving". > WAL availability is binary. Always true if reserving. > xmin reservation is binary, always true if WAL was continously available (or is it connected at all?). > > "unreserved" slot does not reserve WAL, but holds xmin. WAL must be avaliable. > "lost" does not reserve WAL, and also does not hold xmin. WAL might be available, might be unavailable. > > Is it possible to report state of the slot consistently without race conditions? I don't know. I think that on master we should seriously consider reporting invalidated slots in some clearly-distinguishable way, e.g. report "invalidated" rather than "lost". The fact that the slot is invalidated means that whether or not the WAL is still available is a moot point. In the back-branches, I think introducing a new possible value of that column is too much, but I think that display "lost" is clearly better than displaying "unreserved," since the only reason that we ever do the latter is due to a weird exception that is intended to catch race conditions which really makes no sense for an invalidated slot, where recovery is not possible. -- Robert Haas EDB: http://www.enterprisedb.com
On Sat, Jan 3, 2026 at 1:22 PM Andrey Borodin <x4mmm@yandex-team.ru> wrote:
>
> Hi Robert!
Hi Robert, Andrey,
> I don't understand a reason to disallow reviving a slot. Ofc with some new LSN that is currently available in pg_wal.
>
> Imagine a following scenario: in a cluster of a Primary and a Standby a long analytical query is causing huge lag,
primaryremoves some WAL segments due to max_slot_wal_keep_size, standby is disconnected, consumes several WALs from
archive,catches up and continues. Or, if something was vacuumed, cancels analytical query. If we disallow reconnection
ofthis stanby, it will stay in archive recovery. I don't see how it's a good thing.
The key problem here (as I understand it) is that STABLE branches can
silently disable hot_standby_feedback and cause unexplainable query
cancellations (of type confl_snapshot). So to frame this $thread
properly: for some people - such a query offload through standby using
hot_standby_feedback - is critical functionality and it is important
they know when it stopped working (not after getting conflicts).
So, the behaviour of e.g. v16 is confusing when there is use of
max_safe_wal_slot_size (thus activates slot invalidation), replication
lag might be present, and restore_command is in play (to easier
reproduce, not necessary I think) and all of sudden this may cause
confl_snapshot issues might arise. If we add to that the
pg_replication_slot reports wal_status='unreserved' (instead of
"lost"), but restart_lsn is progressing makes it even harder to
understand what's going on. This fix by Robert makes it simply easier
to spot what's going on, but does not fix or prevent the core issue
itself.
I'll start from the end: from my tests it looks like v19/master
behaves more sane today and does kill such replication connections
(marks the slot as "lost" and *prevents* connections there), so the
whole query cancellation discussion is simply not possible there.
primary:
2026-01-05 11:31:11.447 CET [40926] LOG: checkpoint starting: wal
2026-01-05 11:31:11.457 CET [40926] LOG: terminating process
41272 to release replication slot "slot1"
2026-01-05 11:31:11.457 CET [40926] DETAIL: The slot's
restart_lsn 0/8B000000 exceeds the limit by 16777216 bytes.
2026-01-05 11:31:11.457 CET [40926] HINT: You might need to
increase "max_slot_wal_keep_size".
2026-01-05 11:31:11.457 CET [41272] FATAL: terminating connection
due to administrator command
2026-01-05 11:31:11.457 CET [41272] STATEMENT: START_REPLICATION
SLOT "slot1" 0/06000000 TIMELINE 1
2026-01-05 11:31:11.460 CET [41272] LOG: released physical
replication slot "slot1"
2026-01-05 11:31:11.462 CET [40926] LOG: invalidating obsolete
replication slot "slot1"
2026-01-05 11:31:11.462 CET [40926] DETAIL: The slot's
restart_lsn 0/8B000000 exceeds the limit by 16777216 bytes.
2026-01-05 11:31:11.462 CET [40926] HINT: You might need to
increase "max_slot_wal_keep_size".
Even with archiving enabled, the standby won't be ever able to
reconnect unless the slot is recreated (standby will continue recovery
using restore_command, but there won't be lag as pg_stat_replication
will be empty too as of course there won't be connection). This is a
clear message and one knows how to deal with that operationally and it
doesn't cause any mysterious conflicts out of the blue (which is
good). To Andrey's point, the v19 change, could be viewed as feature
regression in situation where down primary replication is fixed by the
restore_command (the v19 simply throws above until slot is recreated,
while the older versions would silently re-connect using replication
connection and switch to walreceiver path [instead of
restore_command], but that would disarm hot_standby_feedback silently
-- which cause false premise that it works, while in reality it does
not which IMHVO is larger grip that this $patch itself).
I haven't tested this one, but Amit's
f41d8468ddea34170fe19fdc17b5a247e7d3ac78 is within REL_18_STABLE, so
that's been that way for quite some time for the
ReplicationSlotAcquire().
So now, with e.g. v16:
- it allows to reconnect to "lost" slots
- this silently disarms hot_standby_feedback and causes unexplainable
query cancellations
- it shows such re-connected slots as "unreserved" without patch
(which is bizarre and even harder to diagnose), so +1 to making it
"lost" instead. That makes it a little bit more visible (but certainly
it doesnt make it fully obvious that it may be disarming
hot_standby_feedback at all). BTW I've tested the patch and in the
exact same conditions it is now reporting "lost" correctly.
- however even with that, it is pretty unclear how and when people
should arrive at conclusion they should recreate their slots and that
this is the issue that might be causing query conflicts. Anyway the
word "slot" in "26.4.2. Handling Query Conflicts" [1] is not
mentioned even a single time there, so perhaps we should just update
docs for < v19 in that this is known issue (solved in v19), and it's
quite rare , but if one is using actively "lost" slot they should
simple disconnect standby, recreate slots to avoid that confl_snapshot
cancellation.
v16's primary log
2026-01-05 13:35:12.560 CET [87976] LOG: checkpoint starting: wal
2026-01-05 13:35:20.174 CET [87976] LOG: terminating process
88033 to release replication slot "slot1"
2026-01-05 13:35:20.174 CET [87976] DETAIL: The slot's
restart_lsn 0/A2020F50 exceeds the limit by 16642224 bytes.
2026-01-05 13:35:20.174 CET [87976] HINT: You might need to
increase max_slot_wal_keep_size.
2026-01-05 13:35:20.174 CET [88033] FATAL: terminating connection
due to administrator command
2026-01-05 13:35:20.174 CET [88033] STATEMENT: START_REPLICATION
SLOT "slot1" 0/3000000 TIMELINE 1
2026-01-05 13:35:51.281 CET [87976] LOG: invalidating obsolete
replication slot "slot1"
2026-01-05 13:35:51.281 CET [87976] DETAIL: The slot's
restart_lsn 0/A2020F50 exceeds the limit by 16642224 bytes.
2026-01-05 13:35:51.281 CET [87976] HINT: You might need to
increase max_slot_wal_keep_size.
[..]
2026-01-05 13:35:51.407 CET [88659] LOG: received replication
command: IDENTIFY_SYSTEM
2026-01-05 13:35:51.407 CET [88659] STATEMENT: IDENTIFY_SYSTEM
2026-01-05 13:35:51.407 CET [88659] LOG: received replication
command: START_REPLICATION SLOT "slot1" 0/A6000000 TIMELINE 1
2026-01-05 13:35:51.407 CET [88659] STATEMENT: START_REPLICATION
SLOT "slot1" 0/A6000000 TIMELINE 1
(clearly lack of ReplicationSlotSetInactiveSince() from master).
Then the concern/scenario raised by Andrey -- in v19 (or in future) to
allow the switching from restore_command to WAL streaming (across
invalidated WAL slot) -- it more sounds more like a new enhancement
for the (far) future, but this time while keeping backend xmin
propagation working.
-J.
[1] - https://www.postgresql.org/docs/current/hot-standby.html#HOT-STANDBY-CONFLICT