Обсуждение: walsender: Assert MyReplicationSlot is set before use
Hi Hackers,
While reviewing patch [1], I noticed that in walsender.c, the function NeedToWaitForStandbys() dereferences
MyReplicationSlot->data.failoverwithout first checking whether MyReplicationSlot is NULL.
Looking at the call chain:
```
logical_read_xlog_page() // XLogReaderRoutine->page_read callback
-> WalSndWaitForWal()
-> NeedToWaitForWal()
-> NeedToWaitForStandbys()
```
None of these callers explicitly checks whether MyReplicationSlot is NULL. Since they also don’t dereference
MyReplicationSlotthemselves, it wouldn’t be fair to push that responsibility up to the callers.
Looking elsewhere in the codebase, other places that dereference MyReplicationSlot (for example
CreateReplicationSlot())either do an explicit if (MyReplicationSlot != NULL) check or assert MyReplicationSlot != NULL.
Soit seems reasonable to make the assumption explicit by adding an Assert(MyReplicationSlot) in
NeedToWaitForStandbys().
Another related issue is in StartLogicalReplication(): the function initially asserts MyReplicationSlot == NULL, then
callsReplicationSlotAcquire(), which is expected to set up MyReplicationSlot. Since MyReplicationSlot is dereferenced
laterin this function, I think it would be clearer and safer to add an Assert(MyReplicationSlot != NULL) immediately
afterthe call to ReplicationSlotAcquire().
The attached patch addresses both of these points.
[1] https://postgr.es/m/CAOzEurSRii0tEYhu5cePmRcvS=ZrxTLEvxm3Kj0d7_uKGdM23g@mail.gmail.com
Best regards,
--
Chao Li (Evan)
HighGo Software Co., Ltd.
https://www.highgo.com/
On Mon, 2 Feb 2026 at 11:55, Chao Li <li.evan.chao@gmail.com> wrote: > > Hi Hackers, > > While reviewing patch [1], I noticed that in walsender.c, the function NeedToWaitForStandbys() dereferences MyReplicationSlot->data.failoverwithout first checking whether MyReplicationSlot is NULL. > > Looking at the call chain: > ``` > logical_read_xlog_page() // XLogReaderRoutine->page_read callback > -> WalSndWaitForWal() > -> NeedToWaitForWal() > -> NeedToWaitForStandbys() > ``` > > None of these callers explicitly checks whether MyReplicationSlot is NULL. Since they also don’t dereference MyReplicationSlotthemselves, it wouldn’t be fair to push that responsibility up to the callers. > > Looking elsewhere in the codebase, other places that dereference MyReplicationSlot (for example CreateReplicationSlot())either do an explicit if (MyReplicationSlot != NULL) check or assert MyReplicationSlot != NULL. Soit seems reasonable to make the assumption explicit by adding an Assert(MyReplicationSlot) in NeedToWaitForStandbys(). > > Another related issue is in StartLogicalReplication(): the function initially asserts MyReplicationSlot == NULL, then callsReplicationSlotAcquire(), which is expected to set up MyReplicationSlot. Since MyReplicationSlot is dereferenced laterin this function, I think it would be clearer and safer to add an Assert(MyReplicationSlot != NULL) immediately afterthe call to ReplicationSlotAcquire(). > > The attached patch addresses both of these points. You have forgotten to attach the patch. Regards, Vignesh
On Monday, February 2, 2026 2:25 PM Chao Li <li.evan.chao@gmail.com> wrote: > > While reviewing patch [1], I noticed that in walsender.c, the function > NeedToWaitForStandbys() dereferences MyReplicationSlot->data.failover > without first checking whether MyReplicationSlot is NULL. > > Looking at the call chain: > ``` > logical_read_xlog_page() // XLogReaderRoutine->page_read callback > -> WalSndWaitForWal() > -> NeedToWaitForWal() > -> NeedToWaitForStandbys() > ``` > > None of these callers explicitly checks whether MyReplicationSlot is NULL. > Since they also don’t dereference MyReplicationSlot themselves, it wouldn’t > be fair to push that responsibility up to the callers. > > Looking elsewhere in the codebase, other places that dereference > MyReplicationSlot (for example CreateReplicationSlot()) either do an explicit if > (MyReplicationSlot != NULL) check or assert MyReplicationSlot != NULL. So it > seems reasonable to make the assumption explicit by adding an > Assert(MyReplicationSlot) in NeedToWaitForStandbys(). I think it makes sense to add an Assert as you suggested. > > Another related issue is in StartLogicalReplication(): the function initially > asserts MyReplicationSlot == NULL, then calls ReplicationSlotAcquire(), which > is expected to set up MyReplicationSlot. Since MyReplicationSlot is > dereferenced later in this function, I think it would be clearer and safer to add > an Assert(MyReplicationSlot != NULL) immediately after the call to > ReplicationSlotAcquire(). I don't think the suggested Assert is unnecessary, as the primary function of ReplicationSlotAcquire() is to get MyReplicationSlot assigned a valid value. Any issues that arise should be handled within the function itself. I think an Assert placed after this function does not add additional safety. Best Regards, Hou zj
> On Feb 2, 2026, at 14:41, vignesh C <vignesh21@gmail.com> wrote: > > On Mon, 2 Feb 2026 at 11:55, Chao Li <li.evan.chao@gmail.com> wrote: >> >> Hi Hackers, >> >> While reviewing patch [1], I noticed that in walsender.c, the function NeedToWaitForStandbys() dereferences MyReplicationSlot->data.failoverwithout first checking whether MyReplicationSlot is NULL. >> >> Looking at the call chain: >> ``` >> logical_read_xlog_page() // XLogReaderRoutine->page_read callback >> -> WalSndWaitForWal() >> -> NeedToWaitForWal() >> -> NeedToWaitForStandbys() >> ``` >> >> None of these callers explicitly checks whether MyReplicationSlot is NULL. Since they also don’t dereference MyReplicationSlotthemselves, it wouldn’t be fair to push that responsibility up to the callers. >> >> Looking elsewhere in the codebase, other places that dereference MyReplicationSlot (for example CreateReplicationSlot())either do an explicit if (MyReplicationSlot != NULL) check or assert MyReplicationSlot != NULL. Soit seems reasonable to make the assumption explicit by adding an Assert(MyReplicationSlot) in NeedToWaitForStandbys(). >> >> Another related issue is in StartLogicalReplication(): the function initially asserts MyReplicationSlot == NULL, thencalls ReplicationSlotAcquire(), which is expected to set up MyReplicationSlot. Since MyReplicationSlot is dereferencedlater in this function, I think it would be clearer and safer to add an Assert(MyReplicationSlot != NULL) immediatelyafter the call to ReplicationSlotAcquire(). >> >> The attached patch addresses both of these points. > > You have forgotten to attach the patch. > > Regards, > Vignesh Oops, here comes it. Thanks for the reminder. Best regards, -- Chao Li (Evan) HighGo Software Co., Ltd. https://www.highgo.com/
Вложения
> On Feb 2, 2026, at 14:43, Zhijie Hou (Fujitsu) <houzj.fnst@fujitsu.com> wrote: > > On Monday, February 2, 2026 2:25 PM Chao Li <li.evan.chao@gmail.com> wrote: >> >> While reviewing patch [1], I noticed that in walsender.c, the function >> NeedToWaitForStandbys() dereferences MyReplicationSlot->data.failover >> without first checking whether MyReplicationSlot is NULL. >> >> Looking at the call chain: >> ``` >> logical_read_xlog_page() // XLogReaderRoutine->page_read callback >> -> WalSndWaitForWal() >> -> NeedToWaitForWal() >> -> NeedToWaitForStandbys() >> ``` >> >> None of these callers explicitly checks whether MyReplicationSlot is NULL. >> Since they also don’t dereference MyReplicationSlot themselves, it wouldn’t >> be fair to push that responsibility up to the callers. >> >> Looking elsewhere in the codebase, other places that dereference >> MyReplicationSlot (for example CreateReplicationSlot()) either do an explicit if >> (MyReplicationSlot != NULL) check or assert MyReplicationSlot != NULL. So it >> seems reasonable to make the assumption explicit by adding an >> Assert(MyReplicationSlot) in NeedToWaitForStandbys(). > > I think it makes sense to add an Assert as you suggested. Thanks for confirming. > >> >> Another related issue is in StartLogicalReplication(): the function initially >> asserts MyReplicationSlot == NULL, then calls ReplicationSlotAcquire(), which >> is expected to set up MyReplicationSlot. Since MyReplicationSlot is >> dereferenced later in this function, I think it would be clearer and safer to add >> an Assert(MyReplicationSlot != NULL) immediately after the call to >> ReplicationSlotAcquire(). > > I don't think the suggested Assert is unnecessary, as the primary function of > ReplicationSlotAcquire() is to get MyReplicationSlot assigned a valid value. Any > issues that arise should be handled within the function itself. I think an > Assert placed after this function does not add additional safety. > Yes, ReplicationSlotAcquire() has a retry logic to ensure MyReplicationSlot to be set. So adding a NULL check here is a bitredundant. How about Assert(SlotIsLogical(MyReplicationSlot))? Does it make more sense? Best regards, -- Chao Li (Evan) HighGo Software Co., Ltd. https://www.highgo.com/
On Monday, February 2, 2026 3:16 PM Chao Li <li.evan.chao@gmail.com> wrote: > > > On Feb 2, 2026, at 14:43, Zhijie Hou (Fujitsu) <houzj.fnst@fujitsu.com> > wrote: > > > > On Monday, February 2, 2026 2:25 PM Chao Li <li.evan.chao@gmail.com> wrote: > > > >> > >> Another related issue is in StartLogicalReplication(): the function > >> initially asserts MyReplicationSlot == NULL, then calls > >> ReplicationSlotAcquire(), which is expected to set up > >> MyReplicationSlot. Since MyReplicationSlot is dereferenced later in > >> this function, I think it would be clearer and safer to add an > >> Assert(MyReplicationSlot != NULL) immediately after the call to > ReplicationSlotAcquire(). > > > > I don't think the suggested Assert is unnecessary, as the primary > > function of > > ReplicationSlotAcquire() is to get MyReplicationSlot assigned a valid > > value. Any issues that arise should be handled within the function > > itself. I think an Assert placed after this function does not add additional > safety. > > > > Yes, ReplicationSlotAcquire() has a retry logic to ensure MyReplicationSlot to > be set. So adding a NULL check here is a bit redundant. How about > Assert(SlotIsLogical(MyReplicationSlot))? Does it make more sense? I think we cannot assume the slot type here. A suitable checking might be: If a physical slot was acquired during logical replication, report an error, just like we do in StartReplication(). I haven't verified whether this error will occur in practice, but I think we could reproduce this by dropping the original logical slot and then creating a new physical slot with the same name concurrently. Best Regards, Hou zj
> On Feb 2, 2026, at 15:37, Zhijie Hou (Fujitsu) <houzj.fnst@fujitsu.com> wrote: > > On Monday, February 2, 2026 3:16 PM Chao Li <li.evan.chao@gmail.com> wrote: >> >>> On Feb 2, 2026, at 14:43, Zhijie Hou (Fujitsu) <houzj.fnst@fujitsu.com> >> wrote: >>> >>> On Monday, February 2, 2026 2:25 PM Chao Li <li.evan.chao@gmail.com> wrote: >>> >>>> >>>> Another related issue is in StartLogicalReplication(): the function >>>> initially asserts MyReplicationSlot == NULL, then calls >>>> ReplicationSlotAcquire(), which is expected to set up >>>> MyReplicationSlot. Since MyReplicationSlot is dereferenced later in >>>> this function, I think it would be clearer and safer to add an >>>> Assert(MyReplicationSlot != NULL) immediately after the call to >> ReplicationSlotAcquire(). >>> >>> I don't think the suggested Assert is unnecessary, as the primary >>> function of >>> ReplicationSlotAcquire() is to get MyReplicationSlot assigned a valid >>> value. Any issues that arise should be handled within the function >>> itself. I think an Assert placed after this function does not add additional >> safety. >>> >> >> Yes, ReplicationSlotAcquire() has a retry logic to ensure MyReplicationSlot to >> be set. So adding a NULL check here is a bit redundant. How about >> Assert(SlotIsLogical(MyReplicationSlot))? Does it make more sense? > > I think we cannot assume the slot type here. A suitable checking might > be: If a physical slot was acquired during logical replication, report an error, > just like we do in StartReplication(). > > I haven't verified whether this error will occur in practice, but I think we > could reproduce this by dropping the original logical slot and then creating a > new physical slot with the same name concurrently. > > Best Regards, > Hou zj Thanks for the information. I will verify that. Best regards, -- Chao Li (Evan) HighGo Software Co., Ltd. https://www.highgo.com/
> On Feb 2, 2026, at 15:37, Zhijie Hou (Fujitsu) <houzj.fnst@fujitsu.com> wrote: > > On Monday, February 2, 2026 3:16 PM Chao Li <li.evan.chao@gmail.com> wrote: >> >>> On Feb 2, 2026, at 14:43, Zhijie Hou (Fujitsu) <houzj.fnst@fujitsu.com> >> wrote: >>> >>> On Monday, February 2, 2026 2:25 PM Chao Li <li.evan.chao@gmail.com> wrote: >>> >>>> >>>> Another related issue is in StartLogicalReplication(): the function >>>> initially asserts MyReplicationSlot == NULL, then calls >>>> ReplicationSlotAcquire(), which is expected to set up >>>> MyReplicationSlot. Since MyReplicationSlot is dereferenced later in >>>> this function, I think it would be clearer and safer to add an >>>> Assert(MyReplicationSlot != NULL) immediately after the call to >> ReplicationSlotAcquire(). >>> >>> I don't think the suggested Assert is unnecessary, as the primary >>> function of >>> ReplicationSlotAcquire() is to get MyReplicationSlot assigned a valid >>> value. Any issues that arise should be handled within the function >>> itself. I think an Assert placed after this function does not add additional >> safety. >>> >> >> Yes, ReplicationSlotAcquire() has a retry logic to ensure MyReplicationSlot to >> be set. So adding a NULL check here is a bit redundant. How about >> Assert(SlotIsLogical(MyReplicationSlot))? Does it make more sense? > > I think we cannot assume the slot type here. A suitable checking might > be: If a physical slot was acquired during logical replication, report an error, > just like we do in StartReplication(). Good point. In StartReplication(), we check MyReplicationSlot is not logical, correspondingly, in StartLogicalReplication(),we should check MyReplicationSlot is not physical. > > I haven't verified whether this error will occur in practice, but I think we > could reproduce this by dropping the original logical slot and then creating a > new physical slot with the same name concurrently. > Without sufficient load, this race is hard to trigger manually, so I added a TAP test using an injection point. With theinjection point in place, the TAP test can hit the error: ``` # +++ tap check in src/test/recovery +++ t/052_logical_slot_type_race.pl .. ok All tests successful. Files=1, Tests=2, 3 wallclock secs ( 0.00 usr 0.00 sys + 0.11 cusr 0.36 csys = 0.47 CPU) Result: PASS ``` I think this test demonstrates that adding a sanity check in StartLogicalReplication() is necessary. I’ve put the TAP script in 0002 and will leave it to the committer to decide whether to accept the test. PFA v2: * 0001 changes the Assert to explicitly verify that MyReplicationSlot is a logical slot. * 0002 adds a TAP test covering a race where a logical slot is dropped while a physical slot with the same name is createdconcurrently. If 0002 is accepted, I’m fine with squashing it into 0001. Best regards, -- Chao Li (Evan) HighGo Software Co., Ltd. https://www.highgo.com/
Вложения
On Tue, Feb 3, 2026 at 12:38 PM Chao Li <li.evan.chao@gmail.com> wrote: > > I think we cannot assume the slot type here. A suitable checking might > > be: If a physical slot was acquired during logical replication, report an error, > > just like we do in StartReplication(). > > Good point. In StartReplication(), we check MyReplicationSlot is not logical, correspondingly, in StartLogicalReplication(),we should check MyReplicationSlot is not physical. StartLogicalReplication() calls CreateDecodingContext() after ReplicationSlotAcquire(), and CreateDecodingContext() seems to already perform this check. Isn't that sufficient? Regards, -- Fujii Masao
> On Feb 3, 2026, at 13:12, Fujii Masao <masao.fujii@gmail.com> wrote:
>
> On Tue, Feb 3, 2026 at 12:38 PM Chao Li <li.evan.chao@gmail.com> wrote:
>>> I think we cannot assume the slot type here. A suitable checking might
>>> be: If a physical slot was acquired during logical replication, report an error,
>>> just like we do in StartReplication().
>>
>> Good point. In StartReplication(), we check MyReplicationSlot is not logical, correspondingly, in
StartLogicalReplication(),we should check MyReplicationSlot is not physical.
>
> StartLogicalReplication() calls CreateDecodingContext() after
> ReplicationSlotAcquire(), and CreateDecodingContext() seems to
> already perform this check. Isn't that sufficient?
>
> Regards,
>
>
> --
> Fujii Masao
Ah, thank you so much for pointing out that. I didn’t notice CreateDecodingContext() has done the check already:
```
/* shorter lines... */
slot = MyReplicationSlot;
/* first some sanity checks that are unlikely to be violated */
if (slot == NULL)
elog(ERROR, "cannot perform logical decoding without an acquired slot");
/* make sure the passed slot is suitable, these are user facing errors */
if (SlotIsPhysical(slot))
ereport(ERROR,
(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
errmsg("cannot use physical replication slot for logical decoding")));
```
So, adding the check in StartLogicalReplication() is redundant. But I noticed the error message misses an “a” before
“physicalreplication”. Looking at the error message in StartReplicaton():
```
ReplicationSlotAcquire(cmd->slotname, true, true);
if (SlotIsLogical(MyReplicationSlot))
ereport(ERROR,
(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
errmsg("cannot use a logical replication slot for physical replication")));
```
It has an “a” before “logical replication”. Also I fixed that in CreateDecodingContext().
PFA v3:
* 0001:
- removed the validation of logical slot
- added a comment in StartLogicalReplication() to explain sanity check is deferred
- added an “a” in the error message in CreateDecodingContext()
* 0002: updated the expected error message in the TAP script
With v3 applied, the TAP test still passed.
Best reagards,
--
Chao Li (Evan)
HighGo Software Co., Ltd.
https://www.highgo.com/