Re: Slot's restart_lsn may point to removed WAL segment after hard restart unexpectedly
От | Alexander Korotkov |
---|---|
Тема | Re: Slot's restart_lsn may point to removed WAL segment after hard restart unexpectedly |
Дата | |
Msg-id | CAPpHfdteS1w_ieKO9mChBonQ5xspwtVv69uxht0N6JmG81NsLg@mail.gmail.com обсуждение исходный текст |
Ответ на | Re: Slot's restart_lsn may point to removed WAL segment after hard restart unexpectedly (vignesh C <vignesh21@gmail.com>) |
Список | pgsql-hackers |
Hi, Vignesh! On Fri, Jun 20, 2025 at 3:42 PM vignesh C <vignesh21@gmail.com> wrote: > On Fri, 20 Jun 2025 at 05:54, Alexander Korotkov <aekorotkov@gmail.com> wrote: > > On Thu, Jun 19, 2025 at 2:05 PM Hayato Kuroda (Fujitsu) > > <kuroda.hayato@fujitsu.com> wrote: > > > > > Regarding assertion failure, I've found that assert in > > > > > PhysicalConfirmReceivedLocation() conflicts with restart_lsn > > > > > previously set by ReplicationSlotReserveWal(). As I can see, > > > > > ReplicationSlotReserveWal() just picks fresh XLogCtl->RedoRecPtr lsn. > > > > > So, it doesn't seems there is a guarantee that restart_lsn never goes > > > > > backward. The commit in ReplicationSlotReserveWal() even states there > > > > > is a "chance that we have to retry". > > > > > > > > > > > > > I don't see how this theory can lead to a restart_lsn of a slot going > > > > backwards. The retry mentioned there is just a retry to reserve the > > > > slot's position again if the required WAL is already removed. Such a > > > > retry can only get the position later than the previous restart_lsn. > > > > > > We analyzed the assertion failure happened at pg_basebackup/020_pg_receivewal, > > > and confirmed that restart_lsn can go backward. This meant that Assert() added > > > by the ca307d5 can cause crash. > > > > > > Background > > > =========== > > > When pg_receivewal starts the replication and it uses the replication slot, it > > > sets as the beginning of the segment where restart_lsn exists, as the startpoint. > > > E.g., if the restart_lsn of the slot is 0/B000D0, pg_receivewal requests WALs > > > from 0/B00000. > > > More detail of this behavior, see f61e1dd2 and d9bae531. > > > > > > What happened here > > > ================== > > > Based on above theory, walsender sent from the beginning of segment (0/B00000). > > > When walreceiver receives, it tried to send reply. At that time the flushed > > > location of WAL would be 0/B00000. walsender sets the received lsn as restart_lsn > > > in PhysicalConfirmReceivedLocation(). Here the restart_lsn went backward (0/B000D0->0/B00000). > > > > > > The assertion failure could happen if CHECKPOINT happened at that time. > > > Attribute last_saved_restart_lsn of the slot was 0/B000D0, but the data.restart_lsn > > > was 0/B00000. It could not satisfy the assertion added in InvalidatePossiblyObsoleteSlot(). > > > > Thank you for your detailed explanation! > > > > > Note > > > ==== > > > 1. > > > In this case, starting from the beginning of the segment is not a problem, because > > > the checkpoint process only removes WAL files from segments that precede the > > > restart_lsn's wal segment. The current segment (0/B00000) will not be removed, > > > so there is no risk of data loss or inconsistency. > > > > > > 2. > > > A similar pattern applies to pg_basebackup. Both use logic that adjusts the > > > requested streaming position to the start of the segment, and it replies the > > > received LSN as flushed. > > > > > > 3. > > > I considered the theory above, but I could not reproduce 040_standby_failover_slots_sync > > > because it is a timing issue. Have someone else reproduced? > > > > > > We are still investigating failure caused at 040_standby_failover_slots_sync. > > > > I didn't manage to reproduce this. But as I see from the logs [1] on > > mamba that START_REPLICATION command was issued just before assert > > trap. Could it be something similar to what I described in [2]. > > Namely: > > 1. ReplicationSlotReserveWal() sets restart_lsn for the slot. > > 2. Concurrent checkpoint flushes that restart_lsn to the disk. > > 3. PhysicalConfirmReceivedLocation() sets restart_lsn of the slot to > > the beginning of the segment. > > Here is my analysis for the 040_standby_failover_slots_sync test > failure where in physical replication slot can point to backward lsn: > In certain rare cases the restart LSN can go backwards. This scenario > can be reproduced by performing checkpoint continuously and slowing > the WAL applying. I have a patch with changes to handle this. > > Here's a summary of the sequence of events: > 1) Standby confirms a new LSN (0/40369C8) when primary sends some WAL contents: > After standby writes the received WAL contents in XLogWalRcvWrite, the > standby sends this lsn 0/40369C8 as the confirmed flush location. The > primary acknowledges this and updates the replication slot's > restart_lsn accordingly: > 2025-06-20 14:33:21.777 IST [134998] standby1 LOG: > PhysicalConfirmReceivedLocation replication slot "sb1_slot" set > restart_lsn to 0/40369C8 > 2025-06-20 14:33:21.777 IST [134998] standby1 STATEMENT: > START_REPLICATION SLOT "sb1_slot" 0/3000000 TIMELINE 1 > Checkpoint persists the new restart_lsn: > > 2) Shortly after receiving the new LSN, a checkpoint occurs which > saves this restart_lsn: > 2025-06-20 14:33:21.780 IST [134913] LOG: checkpoint complete: wrote > 0 buffers (0.0%), wrote 0 SLRU buffers; 0 WAL file(s) added, 0 > removed, 0 recycled; write=0.001 s, sync=0.001 s, total=0.002 s; sync > files=0, longest=0.000 s, average=0.000 s; distance=0 kB, estimate=0 > kB; lsn=0/4036A20, redo lsn=0/40369C8 > > 3)Streaming replication is restarted because of primary_conninfo > change and reload > The WAL receiver process is restarted: > 25-06-20 14:33:21.779 IST [134997] FATAL: terminating walreceiver > process due to administrator command > > 4) Standby sends an older flush pointer after restart: > Upon restart, the WAL receiver sends a flush location (0/401D448) > derived from XLogRecoveryCtl->lastReplayedEndRecPtr, which is older > than the previously confirmed restart_lsn. It is important to note > that we are sending the lastReplayedEndRecPtr which is the last > successfully replayed lsn in this case: > 5-06-20 14:33:21.796 IST [135135] LOG: WalReceiverMain > LogstreamResult.Flush initialized to 0/401D448 > 2025-06-20 14:33:21.796 IST [135135] LOG: sending write 0/401D448 > flush 0/401D448 apply 0/401D448 > > This is done from here: > .... > /* Initialize LogstreamResult and buffers for processing messages */ > LogstreamResult.Write = LogstreamResult.Flush = GetXLogReplayRecPtr(NULL); > > initStringInfo(&reply_message); > > /* Initialize nap wakeup times. */ > now = GetCurrentTimestamp(); > for (int i = 0; i < NUM_WALRCV_WAKEUPS; ++i) > WalRcvComputeNextWakeup(i, now); > > /* Send initial reply/feedback messages. */ > XLogWalRcvSendReply(true, false); > ... > > In case of step 1, we are sending the lsn of the WAL that is written, > since we have slowed down the WAL applying the replay location is > lesser and the replay location is being sent here. > > 5) I have added logs to detect this inconsistency: > This leads to a scenario where the standby tries to confirm a > restart_lsn older than the one currently held by the primary: > 2025-06-20 14:33:21.797 IST [135136] standby1 LOG: crash scenario - > slot sb1_slot, cannot confirm a restart LSN (0/401D448) that is older > than the current one (0/40369C8) > > If a checkpoint happens concurrently during this condition, it may > trigger an assertion failure on the primary due to the restart_lsn > being less than the last_saved_restart_lsn. > Currently this does not break physical replication, but I'm not sure > if the gap increases to many WAL files and if the WAL files get > deleted, how it will behave. > Attached the patch changes with which you can reproduce. Grep for > "crash scenario" in the logs. For me it occurs with every run. The > reproduced logs are attached. > > This proves that the restart_lsn can go backward in cases where the > standby is slowly applying. But this has nothing to do with this > thread, I felt you can commit the assert removal patch. I will > continue the analysis further and see if there is any impact or not > and we can later add comments accordingly. Thank you and other thread participants for the analysis! I'd like to also ask for help with [1]. It seems that new test triggers another bug, which I suppose was there before. Links. 1. https://www.postgresql.org/message-id/CAAKRu_ZCOzQpEumLFgG_%2Biw3FTa%2BhJ4SRpxzaQBYxxM_ZAzWcA%40mail.gmail.com ------ Regards, Alexander Korotkov Supabase
В списке pgsql-hackers по дате отправления: