Kyotaro's patch seems good to me and fixes the test case in my patch.
> + LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
> +
> /*
> * Remember the prior checkpoint's redo ptr for
> * UpdateCheckPointDistanceEstimate()
> */
> PriorRedoPtr = ControlFile->checkPointCopy.redo;
>
> + Assert (PriorRedoPtr < RedoRecPtr);
Maybe PriorRedoPtr does not need to be under LWLockAcquire?
regards.
--
Zhao Rui
Alibaba Cloud: https://www.aliyun.com/
------------------ Original ------------------
From: "Kyotaro Horiguchi" <horikyota.ntt@gmail.com>;
Date: Wed, Mar 16, 2022 09:24 AM
To: "pgsql-hackers"<pgsql-hackers@lists.postgresql.org>;
Cc: "masao.fujii"<masao.fujii@oss.nttdata.com>;
Subject: Possible corruption by CreateRestartPoint at promotion
Hello, (Cc:ed Fujii-san)
This is a diverged topic from [1], which is summarized as $SUBJECT.
To recap:
While discussing on additional LSNs in checkpoint log message,
Fujii-san pointed out [2] that there is a case where
CreateRestartPoint leaves unrecoverable database when concurrent
promotion happens. That corruption is "fixed" by the next checkpoint
so it is not a severe corruption.
AFAICS since 9.5, no check(/restart)pionts won't run concurrently with
restartpoint [3]. So I propose to remove the code path as attached.
regards.
[1] https://www.postgresql.org/message-id/20220316.091913.806120467943749797.horikyota.ntt%40gmail.com
[2] https://www.postgresql.org/message-id/7bfad665-db9c-0c2a-2604-9f54763c5f9e%40oss.nttdata.com
[3] https://www.postgresql.org/message-id/20220222.174401.765586897814316743.horikyota.ntt%40gmail.com
--
Kyotaro Horiguchi
NTT Open Source Software Center