Re: confusing results from pg_get_replication_slots()

Поиск
Список
Период
Сортировка
От Robert Haas
Тема Re: confusing results from pg_get_replication_slots()
Дата
Msg-id CA+TgmoaLgv13eMwXuqNkipQU3ScK4+YJvBoJHobYGizojpy9iA@mail.gmail.com
обсуждение исходный текст
Ответ на Re: confusing results from pg_get_replication_slots()  (Andrey Borodin <x4mmm@yandex-team.ru>)
Ответы Re: confusing results from pg_get_replication_slots()
Список pgsql-hackers
On Sat, Jan 3, 2026 at 7:22 AM Andrey Borodin <x4mmm@yandex-team.ru> wrote:
> I concur that showing "unreserved" when there is no actual WAL is a bug.
> Proposed fix will work and is very succinct. Resulting code structure is not super elegant, but acceptable.

Agreed.

> I don't fully understand circumstances when this bug can do any harm. Maybe negative safe_wal_size could be a
surprisefor some monitoring tools. 

Yes, the fact that safe_wal_size can go negative is one of the things
that makes me think this outcome was not really intended.

> I don't understand a reason to disallow reviving a slot. Ofc with some new LSN that is currently available in pg_wal.
>
> Imagine a following scenario: in a cluster of a Primary and a Standby a long analytical query is causing huge lag,
primaryremoves some WAL segments due to max_slot_wal_keep_size, standby is disconnected, consumes several WALs from
archive,catches up and continues. Or, if something was vacuumed, cancels analytical query. If we disallow reconnection
ofthis stanby, it will stay in archive recovery. I don't see how it's a good thing. 

I think for physical slots invalidation is a little bit of an odd
concept -- why do we ever invalidate a physical slot at all, rather
than just stop reserving WAL at some point and let what happens,
happen? But the reality is that the slot cannot be resurrected once
invalidated; you have to drop and recreate it. Possibly we should
revisit that decision or document the logic more clearly, but that's
not something to think of back-paching.

> > On 3 Jan 2026, at 02:10, Robert Haas <robertmhaas@gmail.com> wrote:
> >
> > Maybe we shouldn't display "lost" when the slot
> > is invalidated but "invalidated", for example, and any other value
> > means we're just returning whatever GetWALAvaliability() told us.
> > Also, maybe the exception for connect slots should just be removed, on
> > the assumption that the race condition isn't common enough to matter,
> > or maybe that logic should be pushed down into GetWALAvailability() if
> > we want to keep it.
>
> I don't think following logic works: "someone seems to be connected to this slot, perhaps it's still not lost". This
iserror-prone heuristics that is trying to workaround possibly stale restart_lsn. 
> For HEAD I'd propose to actually read restart_lsn, and determine if walsender will issue "requested WAL segment has
alreadybeen removed" on next attempt to send something. In this case slot is "lost". 
>
> If I understand correctly, slot might be "invalidated", but not "lost" in this sense yet: timeout occured, but WAL is
stillthere. 

What I think is *really bad* about this situation is that, when the
slot is invalidated, showing it as unreserved makes it still look
potentially useful. But no matter whether the WAL is present or not,
the slot neither serves to reserve WAL or to hold back xmin once
invaliated. Therefore it is not useful. The user would be better off
using no slot at all, in which case xmin would be held back and WAL
reserved at least while the walreceiver is connected. It is not a
question of whether the user can stream from the slot: the user
doesn't need a slot to stream. It's a question of whether the user
erroneously believes themselves to be protected against something when
in fact they are using a defunct slot that is worse than no slot at
all.

--
Robert Haas
EDB: http://www.enterprisedb.com



В списке pgsql-hackers по дате отправления: