Re: Newly created replication slot may be invalidated by checkpoint
| От | Amit Kapila |
|---|---|
| Тема | Re: Newly created replication slot may be invalidated by checkpoint |
| Дата | |
| Msg-id | CAA4eK1KhbtdypBc-T=7RaWN7TOuaBN+bojK12=mRbM0HUdJYnQ@mail.gmail.com обсуждение исходный текст |
| Ответ на | RE: Newly created replication slot may be invalidated by checkpoint ("Zhijie Hou (Fujitsu)" <houzj.fnst@fujitsu.com>) |
| Список | pgsql-hackers |
On Thu, Dec 11, 2025 at 12:39 PM Zhijie Hou (Fujitsu) <houzj.fnst@fujitsu.com> wrote: > > > > > The other idea to fix this problem is suggested by Alexander in his > > email [1] which is to introduce a new ReplicationSlotReserveWALLock > > for this purpose. I think introducing LWLock in back branches could be > > questionable. Did you evaluate the pros and cons of using that > > approach? > > I reviewed that approach, and I think the main distinction lies in whether to > use a new LWLock to serialize the process or rely on an existing lock. > Introducing a new LWLock in back branches would alter the size of > MainLWLockArray and affect NUM_INDIVIDUAL_LWLOCKS/LWTRANCHE_FIRST_USER_DEFINED. > Although this may not directly impact user applications since users typically > use standard APIs like RequestNamedLWLockTranche and LWLockNewTrancheId to add > private LWLocks, it still has a slight risk. Additionally, using an existing > lock could keep code similarity with the HEAD, which can be helpful for future > bug fixes and analysis. > Fair enough. I'll wait for Sawada-san/Vitaly to see what their opinion on this matter is. > > Yet, another possibility is that we don't fix this in back branches > > prior to 18 but not sure how frequently it can impact users. Suyu, can > > you please tell how you found this problem in the first place? Is it > > via code-review or did you hit this in the production or while doing > > some related tests? > > > > BTW, I have asked a question regarding commit 2090edc6f32f652a2c in > > email [2]. Did you get a chance to look at that? > > Please refer to the next inline reply. > > > + /* > > + * Recalculate the current minimum LSN to be used in the WAL segment > > + * cleanup. Then, we must synchronize the replication slots again in > > + * order to make this LSN safe to use. > > + */ > > + slotsMinReqLSN = XLogGetReplicationSlotMinimumLSN(); > > + CheckPointReplicationSlots(shutdown); > > + > > /* > > * Some slots have been invalidated; recalculate the old-segment > > * horizon, starting again from RedoRecPtr. > > */ > > XLByteToSeg(RedoRecPtr, _logSegNo, wal_segment_size); > > - KeepLogSeg(recptr, &_logSegNo); > > + KeepLogSeg(recptr, slotsMinReqLSN, &_logSegNo); > > > > > > > > After invalidating the slots, we recalculate the slotsMinReqLSN with the > > latest value of XLogGetReplicationSlotMinimumLSN(). Can't it generate a more > > recent value of slot's restart_lsn which has not been flushed and we may end > > up removing the corresponding WAL? > > Since CheckPointReplicationSlots() is immediately called after recalculating the > slot's minimum LSN, it ensures that the dirty slot's restart_lsn is flushed to > disk before any WAL removal takes place. So, I think only those WALs whose > removal is based on the flushed restart_lsn value are eliminated. > Right. So, we can ignore this point. -- With Regards, Amit Kapila.
В списке pgsql-hackers по дате отправления: