Обсуждение: [PATCH] Fix minRecoveryPoint not advanced past checkpoint in CreateRestartPoint

Поиск
Список
Период
Сортировка
Hi hackers,

I ran into this while working on recovery pre-check logic that relies on
pg_controldata to verify whether replay has reached a specific restore point.

Reproducer:

```
  -- on primary:
  CHECKPOINT;
  SELECT pg_create_restore_point('test_rp');

  -- recover with:
  --   recovery_target_name = 'test_rp'
  --   recovery_target_action = 'shutdown'

  -- after recovery shuts down:
  pg_controldata shows minRecoveryPoint 104 bytes behind
  pg_create_restore_point's return value (104 bytes = one
  RESTORE_POINT WAL record).
```

My RCA:

When recovery_target_action=shutdown triggers, the checkpointer performs a
shutdown restartpoint via CreateRestartPoint(). If a CHECKPOINT record was
replayed shortly before the recovery target, CreateRestartPoint advances
minRecoveryPoint to the end of that CHECKPOINT record.

However, any no-op records replayed after the CHECKPOINT (such as
RESTORE_POINT) do not dirty pages, so the lazy minRecoveryPoint update that
normally happens during page flushes never fires for them. As a result,
minRecoveryPoint in pg_control ends up behind the actual replay position.

My Fix:

The attached patch fixes this by reading the current replay position from
shared memory after advancing minRecoveryPoint to the checkpoint end, and
advancing further if replay has progressed past it. This is safe because
CheckPointGuts() has already flushed all dirty buffers and the startup process
has exited, so replayEndRecPtr is stable and all pages are on disk.

-- 
Adam

Вложения

Re: [PATCH] Fix minRecoveryPoint not advanced past checkpoint in CreateRestartPoint

От
Heikki Linnakangas
Дата:
On 01/04/2026 11:53, Adam Lee wrote:
> Hi hackers,
> 
> I ran into this while working on recovery pre-check logic that relies on
> pg_controldata to verify whether replay has reached a specific restore point.
> 
> Reproducer:
> 
> ```
>    -- on primary:
>    CHECKPOINT;
>    SELECT pg_create_restore_point('test_rp');
> 
>    -- recover with:
>    --   recovery_target_name = 'test_rp'
>    --   recovery_target_action = 'shutdown'
> 
>    -- after recovery shuts down:
>    pg_controldata shows minRecoveryPoint 104 bytes behind
>    pg_create_restore_point's return value (104 bytes = one
>    RESTORE_POINT WAL record).
> ```
> 
> My RCA:
> 
> When recovery_target_action=shutdown triggers, the checkpointer performs a
> shutdown restartpoint via CreateRestartPoint(). If a CHECKPOINT record was
> replayed shortly before the recovery target, CreateRestartPoint advances
> minRecoveryPoint to the end of that CHECKPOINT record.
> 
> However, any no-op records replayed after the CHECKPOINT (such as
> RESTORE_POINT) do not dirty pages, so the lazy minRecoveryPoint update that
> normally happens during page flushes never fires for them. As a result,
> minRecoveryPoint in pg_control ends up behind the actual replay position.

Hmm, what exactly does minRecoveryPoint mean? The current behavior is 
correct in the sense that if you restarted recovery, you could still 
stop the recovery at the earlier LSN that's the minRecoveryPoint in the 
control file, and the system would be consistent. I agree it feels 
pretty weird though, it would seem natural to advance minRecoveryPoint 
to the last replayed record on a restartpoint.

> My Fix:
> 
> The attached patch fixes this by reading the current replay position from
> shared memory after advancing minRecoveryPoint to the checkpoint end, and
> advancing further if replay has progressed past it. This is safe because
> CheckPointGuts() has already flushed all dirty buffers and the startup process
> has exited, so replayEndRecPtr is stable and all pages are on disk.

We have this comment earlier in CreateRestartPoint():

>      * We don't explicitly advance minRecoveryPoint when we do create a
>      * restartpoint. It's assumed that flushing the buffers will do that as a
>      * side-effect.

That assumption is not quite right, then.

Perhaps we should simply call UpdateMinRecoveryPoint()? That would cause 
the control file to be flushed twice though, so it's a little 
inefficient, but maybe that's fine.

If we go with your patch, does it make this existing logic below obsolete?

>         if (ControlFile->state == DB_IN_ARCHIVE_RECOVERY)
>         {
>             if (ControlFile->minRecoveryPoint < lastCheckPointEndPtr)
>             {
>                 ControlFile->minRecoveryPoint = lastCheckPointEndPtr;
>                 ControlFile->minRecoveryPointTLI = lastCheckPoint.ThisTimeLineID;
> 
>                 /* update local copy */
>                 LocalMinRecoveryPoint = ControlFile->minRecoveryPoint;
>                 LocalMinRecoveryPointTLI = ControlFile->minRecoveryPointTLI;
>             }
>             if (flags & CHECKPOINT_IS_SHUTDOWN)
>                 ControlFile->state = DB_SHUTDOWNED_IN_RECOVERY;
>         }

- Heikki




On Wed, Apr 01, 2026 at 12:38:15PM +0300, Heikki Linnakangas wrote:
> > My RCA:
> > 
> > When recovery_target_action=shutdown triggers, the checkpointer performs a
> > shutdown restartpoint via CreateRestartPoint(). If a CHECKPOINT record was
> > replayed shortly before the recovery target, CreateRestartPoint advances
> > minRecoveryPoint to the end of that CHECKPOINT record.
> > 
> > However, any no-op records replayed after the CHECKPOINT (such as
> > RESTORE_POINT) do not dirty pages, so the lazy minRecoveryPoint update that
> > normally happens during page flushes never fires for them. As a result,
> > minRecoveryPoint in pg_control ends up behind the actual replay position.
> 
> Hmm, what exactly does minRecoveryPoint mean? The current behavior is
> correct in the sense that if you restarted recovery, you could still stop
> the recovery at the earlier LSN that's the minRecoveryPoint in the control
> file, and the system would be consistent. I agree it feels pretty weird
> though, it would seem natural to advance minRecoveryPoint to the last
> replayed record on a restartpoint.

Yes, the system is consistent either way. But for the shutdown action,
it would be natural to advance minRecoveryPoint to the last replayed
record, same as the pause and promote actions do, whose startup process
don't exit and have the chance calling UpdateMinRecoveryPoint().

> Perhaps we should simply call UpdateMinRecoveryPoint()? That would cause the
> control file to be flushed twice though, so it's a little inefficient, but
> maybe that's fine.

And calling UpdateMinRecoveryPoint() still needs to explain why later
the codes need to ensure minRecoveryPoint is past the checkpoint record,
to me it doesn't make things simpler, but I'm OK either way.

> If we go with your patch, does it make this existing logic below obsolete?
> 
> >         if (ControlFile->state == DB_IN_ARCHIVE_RECOVERY)
> >         {
> >             if (ControlFile->minRecoveryPoint < lastCheckPointEndPtr)
> >             {
> >                 ControlFile->minRecoveryPoint = lastCheckPointEndPtr;
> >                 ControlFile->minRecoveryPointTLI = lastCheckPoint.ThisTimeLineID;
> > 
> >                 /* update local copy */
> >                 LocalMinRecoveryPoint = ControlFile->minRecoveryPoint;
> >                 LocalMinRecoveryPointTLI = ControlFile->minRecoveryPointTLI;
> >             }
> >             if (flags & CHECKPOINT_IS_SHUTDOWN)
> >                 ControlFile->state = DB_SHUTDOWNED_IN_RECOVERY;
> >         }

Indeed, no need to check lastCheckPointEndPtr, replayEndRecPtr is always >= lastCheckPointEndPtr

-- 
Adam



PATCH v2 removed the variable lastCheckPointEndPtr and refined the comments.

Thanks for reviewing.

-- 
Adam

Вложения

Re: [PATCH] Fix minRecoveryPoint not advanced past checkpoint in CreateRestartPoint

От
Michael Paquier
Дата:
On Fri, Apr 03, 2026 at 01:10:08PM +0800, Adam Lee wrote:
> PATCH v2 removed the variable lastCheckPointEndPtr and refined the comments.
>
> Thanks for reviewing.

TBH, I am not convinced that this optimization in the control file is
worth it.  minRecoveryPoint refers to a state where the on-disk pages
are all consistent based on their stored LSNs, see also the
cross-check that we do at the end of recovery in the event of
inconsistent pages.  In most cases (say in the 99%-ish range), we will
have page flushes, making it non-relevant.

With time, I have also learnt the hard way that the less code paths
that update minRecoveryPoint in the control file, as well as the local
copies, the better.  Simplifying this code is something we should try
to work on.  Complicating it more has less value.

If we decide that this optimization is worth having, I am going to
request a TAP test to validate the behavior you'd expect out of it.
--
Michael

Вложения
Thanks for reviewing!

On Wed, Apr 08, 2026 at 11:41:32AM +0900, Michael Paquier wrote:
> TBH, I am not convinced that this optimization in the control file is
> worth it.  minRecoveryPoint refers to a state where the on-disk pages
> are all consistent based on their stored LSNs, see also the
> cross-check that we do at the end of recovery in the event of
> inconsistent pages.  In most cases (say in the 99%-ish range), we will
> have page flushes, making it non-relevant.

I understand your point about minRecoveryPoint being primarily about on-disk
page consistency.

However, this value is also exposed via pg_controldata and used by external
tools - for example, pg_rewind reads it to determine where to start replaying
WAL. Third-party backup/recovery tools (like the project I'm working on) may
rely on it to verify that recovery has actually reached a specific restore
point. When the value is behind the actual replay position, these tools can
draw incorrect conclusions.

> With time, I have also learnt the hard way that the less code paths
> that update minRecoveryPoint in the control file, as well as the local
> copies, the better.  Simplifying this code is something we should try
> to work on.  Complicating it more has less value.

Agree. Note that this patch actually removes a code path rather than adding one
- the previous lastCheckPointEndPtr update is subsumed by the GetCurrentReplayRecPtr()
call, the complexity is roughly the same in terms of code paths and logic (I think).

> If we decide that this optimization is worth having, I am going to
> request a TAP test to validate the behavior you'd expect out of it.

PATCH v3 attached, which includes a TAP test covering multiple scenarios:
shutdown with and without a preceding CHECKPOINT, as well as promote and
pause actions for completeness. The test verifies that minRecoveryPoint
reaches at least the restore point LSN in each case.

--
Adam

Вложения