Re: Question about InvalidatePossiblyObsoleteSlot()

Поиск
Список
Период
Сортировка
От Masahiko Sawada
Тема Re: Question about InvalidatePossiblyObsoleteSlot()
Дата
Msg-id CAD21AoBvL8teXPEXK67_DaUsrft16QQ6XyGvuZVCnT-o8nH2ZA@mail.gmail.com
обсуждение исходный текст
Ответ на Re: Question about InvalidatePossiblyObsoleteSlot()  (Bertrand Drouvot <bertranddrouvot.pg@gmail.com>)
Список pgsql-hackers
On Tue, Oct 14, 2025 at 9:51 PM Bertrand Drouvot
<bertranddrouvot.pg@gmail.com> wrote:
>
> Hi,
>
> On Tue, Oct 14, 2025 at 04:06:44PM -0700, Masahiko Sawada wrote:
> > On Sun, Oct 12, 2025 at 11:27 PM Bertrand Drouvot
> > <bertranddrouvot.pg@gmail.com> wrote:
> > >
> > > Hi,
> > >
> > > On Thu, Oct 09, 2025 at 10:49:39AM +0800, suyu.cmj wrote:
> > > > Hi,
> > > > Thank you for the reference to commit 818fefd8fd4 and the related discussion thread. I understand the intent of
introducinginitial_restart_lsn was to preserve a consistent invalidation cause throughout the invalidation loop. 
> > > > However, I still have a few concerns about this design change:
> > > > 1. I understand the intention to keep the invalidation cause consistent, but If a slot's restart_lsn advances
significantlyduring the invalidation check—indicating it is actively in use—shouldn't we reconsider invalidating it? 
> > > > 2. What potential issues arise if we refrain from invalidating slots whose restart_lsn advances during the
invalidationprocess? Intuitively, an actively used slot that has moved it's restart_lsn beyond the problematic point
shouldnot be marked invalid. 
> > > > 3. If the current approach is indeed correct, should we consider making PG15 and earlier consistent with this
behavior?The behavioral difference across versions may lead to different operational outcomes in otherwise similar
situations.
> > > > I would appreciate your insights on these points.
> > >
> > > I agree that before 818fefd8fd4 the invalidation cause could move from
> > > RS_INVAL_WAL_REMOVED to RS_INVAL_NONE if the slot restart lsn has been able to
> > > advance enough between the time we release the mutex and do the next check.
> > >
> > > With 818fefd8fd4 that's not the case anymore and we keep WAL_REMOVED as the
> > > invalidation cause (even if  the slot restart lsn has been able to advance
> > > enough).
> > >
> > > That looks safe to use the pre 818fefd8fd4 behavior for the slot restart lsn
> > > case because the WAL files have not yet been removed by the checkpointer/startup
> > > process when it's busy in InvalidatePossiblyObsoleteSlot().
> > >
> > > I think that we could get rid of the initial_restart_lsn and just use
> > > s->data.restart_lsn here (while keeping initial xmin ones to preserve the
> > > intent of 818fefd8fd4 for those).
> >
> > IIUC
>
> Thanks for looking at it!
>
> > with the proposed patch, it's possible that we report the slot
> > invalidation once but don't actually invalidate the slot if slot's
> > restart_lsn gets advanced and becomes greater than the oldestLSN after
> > the report, is that right?
>
> We don't really report an "invalidation", what we report is:
>
> LOG:  terminating process 3998707 to release replication slot "logical_slot"
> DETAIL:  The slot's restart_lsn 0/00842480 exceeds the limit by 2874240 bytes.
> HINT:  You might need to increase "max_slot_wal_keep_size".
>
> and we terminate the process:
>
> FATAL:  terminating connection due to administrator command
>
> We are not reporting:
>
> DETAIL:  This replication slot has been invalidated due to "wal_removed".
>
> and the slot is still valid.
>
> That's the pre 818fefd8fd4 behavior.

Thank you for the clarification! Understood.

> Ideally, I think that we should not report anything and not terminate the
> process. I did not look at it, maybe we could look at it as a second step (first
> step being to restore the pre 818fefd8fd4 behavior)?

I find that reporting of terminating a process having an possibly
obsolete slot is fine, but reading some related threads[1][2] it seems
to me that a problem we want to avoid is that we report "terminated"
without leading to an "obsolete" message. Does it make sense to report
explicitly that the slot's restart_lsn gets recovered and we therefore
skipped to invalidate it?

Regards,

[1] https://www.postgresql.org/message-id/flat/ZaTjW2Xh%2BTQUCOH0%40ip-10-97-1-34.eu-west-3.compute.internal
[2] https://www.postgresql.org/message-id/7c025095-5763-17a6-8c80-899b35bd0459@gmail.com

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com



В списке pgsql-hackers по дате отправления: