Re: Newly created replication slot may be invalidated by checkpoint
| От | Vitaly Davydov |
|---|---|
| Тема | Re: Newly created replication slot may be invalidated by checkpoint |
| Дата | |
| Msg-id | 15922-68ca9280-4f-37de2c40@245457797 обсуждение исходный текст |
| Ответ на | Newly created replication slot may be invalidated by checkpoint ("suyu.cmj" <mengjuan.cmj@alibaba-inc.com>) |
| Ответы |
Re: Newly created replication slot may be invalidated by checkpoint
|
| Список | pgsql-hackers |
Hi suyu.cmj > The commit 2090edc6f32f652a2c introduced a change that the > minimal restart_lsn is obtained at the start of checkpoint creation. If a > replication slot is created and performs a WAL reservation concurrently, the > WAL segment contains the new slot's restart_lsn could be removed by the ongoing > checkpoint. Thank you for reporting this issue. I agree, the issue with slot invalidation seems to take place in REL_17_STABLE and earlier, but it is not reproducible in 18+ versions because of different implementation. The problem may appear if the first persistent slot is created during checkpoint, when slot's oldest lsn is invalid. I'm not sure how it works when some other persistent slots exist. Probably, invalidation is still possible if the reservation happens with lsn older than the oldest lsn of existing slots. In 17 and earlier verions, when checkpoint is started in takes slot's oldest lsn using XLogGetReplicationSlotMinimumLSN(). This value will be used later in WAL segments removal. If a new slot reserved the WAL between getting of slots' oldest lsn and WAL removal, it may be invalidated. It happens because ReplicationSlotReserveWal() checks XLogCtl->lastRemovedSegNo but the segments are not yet removed. There is a subtle thing, when the wal reservation completes at the same time when the checkpointer is between KeepLogSeg and RemoveOldXlogFiles where XLogCtl->lastRemovedSegNo is updated. The slot will not be invalidated but the segments, reserved by the new slot, may be removed, I guess. In 17 and earlier we tried to create a compatible solution, when oldest lsn was taken before slot syncing to disk. In the master branch we added a new last_saved_restart_lsn into ReplicationSlot structure which seems to be a better solution. I prepared a simple fix [1] for 17 and earlier versions. It seems it fixes the problem with first persistent slot creation. I also think, it should work as it was before the patch that added this bug. I also did some changes in the original test script, for 17 ([2]) and 18 ([3]) versions. I continue to investigate and test it. [1] 0001-Fix-invalidation-when-slot-is-created-during-checkpo.patch [2] v2-17-0001-Newly-created-replication-slot-may-be-invalidated-by.patch [3] v2-18-0001-Newly-created-replication-slot-may-be-invalidated-by.patch With best regards, Vitaly
Вложения
В списке pgsql-hackers по дате отправления: