Обсуждение: confusing results from pg_get_replication_slots()

Поиск
Список
Период
Сортировка

confusing results from pg_get_replication_slots()

От
Robert Haas
Дата:
Hi,

Since v13, pg_get_replication_slots() returns a wal_status field that
supposedly tells you whether the slot is reserving WAL. It returns
either "reserved", "extended", "unreserved", or "lost". However, the
logic is more complicated than you might expect from a reporting
function. We normally call GetWALAvailability() and report whatever it
tells us, but there are two exceptions. First, if the slot is
invalidated, we skip calling GetWALAvailability() and assume that the
answer is "lost". Second, if something is still connected to the slot,
we assume that any apparent "lost" answer is due to a race condition
and instead return "unreserved". Both of these exceptions can occur at
the same time, and the checks are done in the order I've listed here.
Therefore, a still-connected slot which is invalidated is shown as
"unreserved" rather than, as I would have expected, as "lost".

I don't believe we should apply both of these exceptions at the same
time. If we actually called GetWALAvailability() and it said the WAL
was lost, then perhaps the fact that somebody's still-connected to the
slot is contrary evidence and maybe due to some race condition they
can catch up again. But if we didn't call GetWALAvailability() and
thought that the WAL was lost because the slot is invalidated, the
fact that some process is still connected to that slot doesn't
invalidate the conclusion. Once the slot is invalidated, it's ignored
for purposes of deciding how much WAL to retain in the future, and
it's ignored for hot_standby_feedback purposes. It is no longer
protecting against any of the things against which slots are supposed
to protect. For all practical intents and purposes, such a slot is no
more - has ceased to be - has expired and gone to meet its maker -
it's an ex-slot. It makes no sense to me to display that slot with a
status that shows that there is some hope of recovery when in fact
there is none.

Note, by the way, that in existing releases, connections to
already-invalidated physical slots are not blocked. This has been
changed, but only in master.

Here is a patch to make invalidated slots always report as "lost",
which I propose to back-patch to all supported versions.

Many people were involved in the diagnosis of this issue, but
particular shot-outs are appropriate to my colleague Nitin Chobisa,
who produced the first reproducible test case demonstrating the issue,
and my colleague Pavan Deolasee, who further refined the test case and
clearly established that it was possible for slots to emerge from the
"lost" state, going back to "unreserved".

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Вложения

Re: confusing results from pg_get_replication_slots()

От
Matheus Alcantara
Дата:
Hi,

On 02/01/26 12:40, Robert Haas wrote:
> Here is a patch to make invalidated slots always report as "lost",
> which I propose to back-patch to all supported versions.
> 

The patch looks correct to me. I'm just wondering if/how we could 
create a test for this. It possible to create a regression test or a 
TAP test? Or it's not worthwhile?

-- 
Matheus Alcantara
EDB: https://www.enterprisedb.com



Re: confusing results from pg_get_replication_slots()

От
Robert Haas
Дата:
On Fri, Jan 2, 2026 at 3:48 PM Matheus Alcantara
<matheusssilv97@gmail.com> wrote:
> On 02/01/26 12:40, Robert Haas wrote:
> > Here is a patch to make invalidated slots always report as "lost",
> > which I propose to back-patch to all supported versions.
>
> The patch looks correct to me. I'm just wondering if/how we could
> create a test for this. It possible to create a regression test or a
> TAP test? Or it's not worthwhile?

It's relatively difficult to reproduce this, especially on master.
Amit Kapila's commit f41d8468ddea34170fe19fdc17b5a247e7d3ac78 changed
the behavior for physical replication slots. Before that commit, you
couldn't connect to an invalidated logical replication slot, but not
an invalidated physical replication slot. After this commit, both are
prohibited. I imagine that Amit thought this was a distinction without
a difference, because of course if the WAL is actually removed then
use of the slot will fail later -- but that's not completely true,
because there's no guarantee if or when the connection will be used to
fetch WAL that has been removed. Nonetheless, I think it's a good
change: because invalidated replication slots are ignored, having
stuff connect to them and pretend to use them is bad.

However, this means that if you wanted a TAP test for this, you would
have to let a replication slot get behind far enough that it could be
invalidated, trigger a checkpoint that actually invalidates it, and
then have the process using the connection catch up quickly enough
that it never tries to fetch removed WAL. In older releases, I believe
it's a little easier to hit the problem, because you can actually
reconnect to an invalidated slot, but I think you still need to the
timing to be just right, so that you catch up after the invalidation
happens but before the files are actually removed. Even there, I don't
see how you could construct a TAP test without injection points, and
I'm not really convinced that it's worth adding a bunch of new
infrastructure for this. Such a test wouldn't be likely to catch the
next bug of this type, if there is one.

The best thing to do to really avoid future bugs of this type, IMHO,
would be to modify pg_get_replication_slots() so that it does not
editorialize on the value returned by GetWALAvaliability(), but how to
get there is arguable. Maybe we shouldn't display "lost" when the slot
is invalidated but "invalidated", for example, and any other value
means we're just returning whatever GetWALAvaliability() told us.
Also, maybe the exception for connect slots should just be removed, on
the assumption that the race condition isn't common enough to matter,
or maybe that logic should be pushed down into GetWALAvailability() if
we want to keep it. I'm not sure. Any of that seems like too much to
change in the back-branches, but I personally believe rethinking the
logic here would be a better use of energy than developing test cases
that verify the exact details of the current logic.

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: confusing results from pg_get_replication_slots()

От
Andrey Borodin
Дата:
Hi Robert!

I've tried to look how people use wal_status.
There are lots of monitoring usages where transient race conditions do not matter much.
But in some cases fatal decisions are made on a "lost" basis. e.g.

https://github.com/readysettech/readyset/blob/cb77b75a56d952fb6b1c4171afa9f0b0175fb6d8/replicators/src/postgres_connector/connector.rs#L381

I concur that showing "unreserved" when there is no actual WAL is a bug.
Proposed fix will work and is very succinct. Resulting code structure is not super elegant, but acceptable.

I don't fully understand circumstances when this bug can do any harm. Maybe negative safe_wal_size could be a surprise
forsome monitoring tools. 

> On 2 Jan 2026, at 20:40, Robert Haas <robertmhaas@gmail.com> wrote:
>
> For all practical intents and purposes, such a slot is no
> more - has ceased to be - has expired and gone to meet its maker -
> it's an ex-slot. It makes no sense to me to display that slot with a
> status that shows that there is some hope of recovery when in fact
> there is none.
>
> Note, by the way, that in existing releases, connections to
> already-invalidated physical slots are not blocked. This has been
> changed, but only in master.

I don't understand a reason to disallow reviving a slot. Ofc with some new LSN that is currently available in pg_wal.

Imagine a following scenario: in a cluster of a Primary and a Standby a long analytical query is causing huge lag,
primaryremoves some WAL segments due to max_slot_wal_keep_size, standby is disconnected, consumes several WALs from
archive,catches up and continues. Or, if something was vacuumed, cancels analytical query. If we disallow reconnection
ofthis stanby, it will stay in archive recovery. I don't see how it's a good thing. 



> On 3 Jan 2026, at 02:10, Robert Haas <robertmhaas@gmail.com> wrote:
>
> Maybe we shouldn't display "lost" when the slot
> is invalidated but "invalidated", for example, and any other value
> means we're just returning whatever GetWALAvaliability() told us.
> Also, maybe the exception for connect slots should just be removed, on
> the assumption that the race condition isn't common enough to matter,
> or maybe that logic should be pushed down into GetWALAvailability() if
> we want to keep it.

I don't think following logic works: "someone seems to be connected to this slot, perhaps it's still not lost". This is
error-proneheuristics that is trying to workaround possibly stale restart_lsn. 
For HEAD I'd propose to actually read restart_lsn, and determine if walsender will issue "requested WAL segment has
alreadybeen removed" on next attempt to send something. In this case slot is "lost". 

If I understand correctly, slot might be "invalidated", but not "lost" in this sense yet: timeout occured, but WAL is
stillthere. 


Best regards, Andrey Borodin.


Re: confusing results from pg_get_replication_slots()

От
Robert Haas
Дата:
On Sat, Jan 3, 2026 at 7:22 AM Andrey Borodin <x4mmm@yandex-team.ru> wrote:
> I concur that showing "unreserved" when there is no actual WAL is a bug.
> Proposed fix will work and is very succinct. Resulting code structure is not super elegant, but acceptable.

Agreed.

> I don't fully understand circumstances when this bug can do any harm. Maybe negative safe_wal_size could be a
surprisefor some monitoring tools. 

Yes, the fact that safe_wal_size can go negative is one of the things
that makes me think this outcome was not really intended.

> I don't understand a reason to disallow reviving a slot. Ofc with some new LSN that is currently available in pg_wal.
>
> Imagine a following scenario: in a cluster of a Primary and a Standby a long analytical query is causing huge lag,
primaryremoves some WAL segments due to max_slot_wal_keep_size, standby is disconnected, consumes several WALs from
archive,catches up and continues. Or, if something was vacuumed, cancels analytical query. If we disallow reconnection
ofthis stanby, it will stay in archive recovery. I don't see how it's a good thing. 

I think for physical slots invalidation is a little bit of an odd
concept -- why do we ever invalidate a physical slot at all, rather
than just stop reserving WAL at some point and let what happens,
happen? But the reality is that the slot cannot be resurrected once
invalidated; you have to drop and recreate it. Possibly we should
revisit that decision or document the logic more clearly, but that's
not something to think of back-paching.

> > On 3 Jan 2026, at 02:10, Robert Haas <robertmhaas@gmail.com> wrote:
> >
> > Maybe we shouldn't display "lost" when the slot
> > is invalidated but "invalidated", for example, and any other value
> > means we're just returning whatever GetWALAvaliability() told us.
> > Also, maybe the exception for connect slots should just be removed, on
> > the assumption that the race condition isn't common enough to matter,
> > or maybe that logic should be pushed down into GetWALAvailability() if
> > we want to keep it.
>
> I don't think following logic works: "someone seems to be connected to this slot, perhaps it's still not lost". This
iserror-prone heuristics that is trying to workaround possibly stale restart_lsn. 
> For HEAD I'd propose to actually read restart_lsn, and determine if walsender will issue "requested WAL segment has
alreadybeen removed" on next attempt to send something. In this case slot is "lost". 
>
> If I understand correctly, slot might be "invalidated", but not "lost" in this sense yet: timeout occured, but WAL is
stillthere. 

What I think is *really bad* about this situation is that, when the
slot is invalidated, showing it as unreserved makes it still look
potentially useful. But no matter whether the WAL is present or not,
the slot neither serves to reserve WAL or to hold back xmin once
invaliated. Therefore it is not useful. The user would be better off
using no slot at all, in which case xmin would be held back and WAL
reserved at least while the walreceiver is connected. It is not a
question of whether the user can stream from the slot: the user
doesn't need a slot to stream. It's a question of whether the user
erroneously believes themselves to be protected against something when
in fact they are using a defunct slot that is worse than no slot at
all.

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: confusing results from pg_get_replication_slots()

От
Andrey Borodin
Дата:

> On 3 Jan 2026, at 18:12, Robert Haas <robertmhaas@gmail.com> wrote:
>
> What I think is *really bad* about this situation is that, when the
> slot is invalidated, showing it as unreserved makes it still look
> potentially useful. But no matter whether the WAL is present or not,
> the slot neither serves to reserve WAL or to hold back xmin once
> invaliated. Therefore it is not useful. The user would be better off
> using no slot at all, in which case xmin would be held back and WAL
> reserved at least while the walreceiver is connected. It is not a
> question of whether the user can stream from the slot: the user
> doesn't need a slot to stream. It's a question of whether the user
> erroneously believes themselves to be protected against something when
> in fact they are using a defunct slot that is worse than no slot at
> all.

Slot state is a mix of 3 values here: WAL reservation, WAL availability, xmin reservation.
WAL reservation is 3-state value: "reserving", "extended reserving", "not reserving".
WAL availability is binary. Always true if reserving.
xmin reservation is binary, always true if WAL was continously available (or is it connected at all?).

"unreserved" slot does not reserve WAL, but holds xmin. WAL must be avaliable.
"lost" does not reserve WAL, and also does not hold xmin. WAL might be available, might be unavailable.

Is it possible to report state of the slot consistently without race conditions?


Best regards, Andrey Borodin.


Re: confusing results from pg_get_replication_slots()

От
Robert Haas
Дата:
On Sat, Jan 3, 2026 at 9:52 AM Andrey Borodin <x4mmm@yandex-team.ru> wrote:
> Slot state is a mix of 3 values here: WAL reservation, WAL availability, xmin reservation.
> WAL reservation is 3-state value: "reserving", "extended reserving", "not reserving".
> WAL availability is binary. Always true if reserving.
> xmin reservation is binary, always true if WAL was continously available (or is it connected at all?).
>
> "unreserved" slot does not reserve WAL, but holds xmin. WAL must be avaliable.
> "lost" does not reserve WAL, and also does not hold xmin. WAL might be available, might be unavailable.
>
> Is it possible to report state of the slot consistently without race conditions?

I don't know. I think that on master we should seriously consider
reporting invalidated slots in some clearly-distinguishable way, e.g.
report "invalidated" rather than "lost". The fact that the slot is
invalidated means that whether or not the WAL is still available is a
moot point. In the back-branches, I think introducing a new possible
value of that column is too much, but I think that display "lost" is
clearly better than displaying "unreserved," since the only reason
that we ever do the latter is due to a weird exception that is
intended to catch race conditions which really makes no sense for an
invalidated slot, where recovery is not possible.

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: confusing results from pg_get_replication_slots()

От
Jakub Wartak
Дата:
On Sat, Jan 3, 2026 at 1:22 PM Andrey Borodin <x4mmm@yandex-team.ru> wrote:
>
> Hi Robert!

Hi Robert, Andrey,

> I don't understand a reason to disallow reviving a slot. Ofc with some new LSN that is currently available in pg_wal.
>
> Imagine a following scenario: in a cluster of a Primary and a Standby a long analytical query is causing huge lag,
primaryremoves some WAL segments due to max_slot_wal_keep_size, standby is disconnected, consumes several WALs from
archive,catches up and continues. Or, if something was vacuumed, cancels analytical query. If we disallow reconnection
ofthis stanby, it will stay in archive recovery. I don't see how it's a good thing. 

The key problem here (as I understand it) is that STABLE branches can
silently disable hot_standby_feedback and cause unexplainable query
cancellations (of type confl_snapshot). So to frame this $thread
properly: for some people - such a query offload through standby using
hot_standby_feedback - is critical functionality and it is important
they know when it stopped working (not after getting conflicts).

So, the behaviour of e.g. v16 is confusing when there is use of
max_safe_wal_slot_size (thus activates slot invalidation), replication
lag might be present, and restore_command is in play (to easier
reproduce, not necessary I think) and all of sudden this may cause
confl_snapshot issues might arise. If we add to that the
pg_replication_slot reports wal_status='unreserved' (instead of
"lost"), but restart_lsn is progressing makes it even harder to
understand what's going on. This fix by Robert makes it simply easier
to spot what's going on, but does not fix or prevent the core issue
itself.

I'll start from the end: from my tests it looks like v19/master
behaves more sane today and does kill such replication connections
(marks the slot as "lost" and *prevents* connections there), so the
whole query cancellation discussion is simply not possible there.

primary:
    2026-01-05 11:31:11.447 CET [40926] LOG:  checkpoint starting: wal
    2026-01-05 11:31:11.457 CET [40926] LOG:  terminating process
41272 to release replication slot "slot1"
    2026-01-05 11:31:11.457 CET [40926] DETAIL:  The slot's
restart_lsn 0/8B000000 exceeds the limit by 16777216 bytes.
    2026-01-05 11:31:11.457 CET [40926] HINT:  You might need to
increase "max_slot_wal_keep_size".
    2026-01-05 11:31:11.457 CET [41272] FATAL:  terminating connection
due to administrator command
    2026-01-05 11:31:11.457 CET [41272] STATEMENT:  START_REPLICATION
SLOT "slot1" 0/06000000 TIMELINE 1
    2026-01-05 11:31:11.460 CET [41272] LOG:  released physical
replication slot "slot1"
    2026-01-05 11:31:11.462 CET [40926] LOG:  invalidating obsolete
replication slot "slot1"
    2026-01-05 11:31:11.462 CET [40926] DETAIL:  The slot's
restart_lsn 0/8B000000 exceeds the limit by 16777216 bytes.
    2026-01-05 11:31:11.462 CET [40926] HINT:  You might need to
increase "max_slot_wal_keep_size".

Even with archiving enabled, the standby won't be ever able to
reconnect unless the slot is recreated (standby will continue recovery
using restore_command, but there won't be lag as pg_stat_replication
will be empty too as of course there won't be connection). This is a
clear message and one knows how to deal with that operationally and it
doesn't cause any mysterious conflicts out of the blue (which is
good). To Andrey's point, the v19 change, could be viewed as feature
regression in situation where down primary replication is fixed by the
restore_command (the v19 simply throws above until slot is recreated,
while the older versions would silently re-connect using replication
connection and switch to walreceiver path [instead of
restore_command], but that would disarm hot_standby_feedback silently
-- which cause false premise that it works, while in reality it does
not which IMHVO is larger grip that this $patch itself).

I haven't tested this one, but Amit's
f41d8468ddea34170fe19fdc17b5a247e7d3ac78 is within REL_18_STABLE, so
that's been that way for quite some time for the
ReplicationSlotAcquire().

So now, with e.g. v16:
- it allows to reconnect to "lost" slots
- this silently disarms hot_standby_feedback and causes unexplainable
query cancellations
- it shows such re-connected slots as "unreserved" without patch
(which is bizarre and even harder to diagnose), so +1 to making it
"lost" instead. That makes it a little bit more visible (but certainly
it doesnt make it fully obvious that it may be disarming
hot_standby_feedback at all). BTW I've tested the patch and in the
exact same conditions it is now reporting "lost" correctly.
- however even with that, it is pretty unclear how and when people
should arrive at conclusion they should recreate their slots and that
this is the issue that might be causing query conflicts. Anyway the
word "slot" in "26.4.2. Handling Query Conflicts"  [1] is not
mentioned even a single time there, so perhaps we should just update
docs for < v19 in that this is known issue (solved in v19), and it's
quite rare , but if one is using actively "lost" slot they should
simple disconnect standby, recreate slots to avoid that confl_snapshot
cancellation.

v16's primary log
    2026-01-05 13:35:12.560 CET [87976] LOG:  checkpoint starting: wal
    2026-01-05 13:35:20.174 CET [87976] LOG:  terminating process
88033 to release replication slot "slot1"
    2026-01-05 13:35:20.174 CET [87976] DETAIL:  The slot's
restart_lsn 0/A2020F50 exceeds the limit by 16642224 bytes.
    2026-01-05 13:35:20.174 CET [87976] HINT:  You might need to
increase max_slot_wal_keep_size.
    2026-01-05 13:35:20.174 CET [88033] FATAL:  terminating connection
due to administrator command
    2026-01-05 13:35:20.174 CET [88033] STATEMENT:  START_REPLICATION
SLOT "slot1" 0/3000000 TIMELINE 1
    2026-01-05 13:35:51.281 CET [87976] LOG:  invalidating obsolete
replication slot "slot1"
    2026-01-05 13:35:51.281 CET [87976] DETAIL:  The slot's
restart_lsn 0/A2020F50 exceeds the limit by 16642224 bytes.
    2026-01-05 13:35:51.281 CET [87976] HINT:  You might need to
increase max_slot_wal_keep_size.
    [..]
    2026-01-05 13:35:51.407 CET [88659] LOG:  received replication
command: IDENTIFY_SYSTEM
    2026-01-05 13:35:51.407 CET [88659] STATEMENT:  IDENTIFY_SYSTEM
    2026-01-05 13:35:51.407 CET [88659] LOG:  received replication
command: START_REPLICATION SLOT "slot1" 0/A6000000 TIMELINE 1
    2026-01-05 13:35:51.407 CET [88659] STATEMENT:  START_REPLICATION
SLOT "slot1" 0/A6000000 TIMELINE 1

(clearly lack of ReplicationSlotSetInactiveSince() from master).

Then the concern/scenario raised by Andrey -- in v19 (or in future) to
allow the switching from restore_command to WAL streaming (across
invalidated WAL slot) -- it more sounds more like a new enhancement
for the (far) future, but this time while keeping backend xmin
propagation working.

-J.

[1] - https://www.postgresql.org/docs/current/hot-standby.html#HOT-STANDBY-CONFLICT