Re: Newly created replication slot may be invalidated by checkpoint
| От | Amit Kapila |
|---|---|
| Тема | Re: Newly created replication slot may be invalidated by checkpoint |
| Дата | |
| Msg-id | CAA4eK1J7Ep6p4CqcETCcYTeQzEjGJt+p_z50aKaEc9n6zZHoyg@mail.gmail.com обсуждение исходный текст |
| Ответ на | Re: Newly created replication slot may be invalidated by checkpoint (vignesh C <vignesh21@gmail.com>) |
| Ответы |
RE: Newly created replication slot may be invalidated by checkpoint
|
| Список | pgsql-hackers |
On Wed, Jan 7, 2026 at 2:52 PM vignesh C <vignesh21@gmail.com> wrote: > > On Mon, 5 Jan 2026 at 15:01, Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > On Tue, Dec 30, 2025 at 5:31 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > > > On Sun, Dec 14, 2025 at 8:21 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > > > On Thu, Dec 11, 2025 at 12:39 PM Zhijie Hou (Fujitsu) > > > > <houzj.fnst@fujitsu.com> wrote: > > > > > > > > > > > > > > > > > The other idea to fix this problem is suggested by Alexander in his > > > > > > email [1] which is to introduce a new ReplicationSlotReserveWALLock > > > > > > for this purpose. I think introducing LWLock in back branches could be > > > > > > questionable. Did you evaluate the pros and cons of using that > > > > > > approach? > > > > > > > > > > I reviewed that approach, and I think the main distinction lies in whether to > > > > > use a new LWLock to serialize the process or rely on an existing lock. > > > > > Introducing a new LWLock in back branches would alter the size of > > > > > MainLWLockArray and affect NUM_INDIVIDUAL_LWLOCKS/LWTRANCHE_FIRST_USER_DEFINED. > > > > > Although this may not directly impact user applications since users typically > > > > > use standard APIs like RequestNamedLWLockTranche and LWLockNewTrancheId to add > > > > > private LWLocks, it still has a slight risk. Additionally, using an existing > > > > > lock could keep code similarity with the HEAD, which can be helpful for future > > > > > bug fixes and analysis. > > > > > > > > > > > > > Fair enough. I'll wait for Sawada-san/Vitaly to see what their opinion > > > > on this matter is. > > > > > > While it's hacky that the proposed approach takes > > > ReplicationSlotAllocationLock before > > > XLogGetReplicationSlotMinimumLSN() during checkpoint, I find that it's > > > better than introducing a new LWLock in minor versions which could > > > lead unexpected compatibility issues. > > > > > > Regarding the v10 patch, do we need to take > > > ReplicationSlotAllocationLock also at the following place? > > > > > > /* > > > * Recalculate the current minimum LSN to be used in the WAL segment > > > * cleanup. Then, we must synchronize the replication slots again in > > > * order to make this LSN safe to use. > > > */ > > > slotsMinReqLSN = XLogGetReplicationSlotMinimumLSN(); > > > CheckPointReplicationSlots(shutdown); > > > > > > > No, we don't need the allocation lock here because RedoRecPtr won't > > change after the previous computation so the WAL reservation during > > creation of slots won't be impacted. I mean if the slot reservation > > starts just before this computation, it should use the latest (this > > checkpoint cycle's RedoRecPtr) whereas that was not true with the case > > where the patch acquires it. > > > > > I think we need to add some comments regardless of taking the lwlock. > > > > > > > I have added a comment though not sure if it is required. > > Here is a version which includes back branch version patches with > pgindent changes. I have verified the following scenario in PG17, > PG16, PG15 and PG14 branches and works as expected with the patches: > Thanks for preparing and testing backbranch patches. One thing I observed a bit different in PG15 (and PG14) is that we need to do LogStandbySnapshot() and XLogFlush() under ReplicationSlotAllocationLock because of the current coding pattern in those branches. But even without this patch, we perform those actions in a loop if the required WAL is removed which is not ideal and changed in 16 and above. AFAICS, there is no problem in doing LogStandbySnapshot() and XLogFlush() under ReplicationSlotAllocationLock because the LW locks acquired in earlier functions are not acquired after ReplicationSlotAllocationLock, so there is no locking order problem. Also, the slot_creation shouldn't be such a frequent operation to matter. However, if required, we can move LogStandbySnapshot() and XLogFlush() outside the loop similar to 16 and above branches. OTOH, doing minimal changes in bank branches and that too in 15 and 14 to fix this issue seems like a prudent choice. Thoughts? -- With Regards, Amit Kapila.
В списке pgsql-hackers по дате отправления: