Re: Corruption during WAL replay

Поиск
Список
Период
Сортировка
От Kyotaro Horiguchi
Тема Re: Corruption during WAL replay
Дата
Msg-id 20210927.173036.115679524750553023.horikyota.ntt@gmail.com
обсуждение исходный текст
Ответ на Re: Corruption during WAL replay  (Robert Haas <robertmhaas@gmail.com>)
Ответы Re: Corruption during WAL replay  (Daniel Shelepanov <deniel1495@mail.ru>)
Список pgsql-hackers
Thank you for the comments! (Sorry for the late resopnse.)

At Tue, 10 Aug 2021 14:14:05 -0400, Robert Haas <robertmhaas@gmail.com> wrote in 
> On Thu, Mar 4, 2021 at 10:01 PM Kyotaro Horiguchi
> <horikyota.ntt@gmail.com> wrote:
> > The patch assumed that CHKPT_START/COMPLETE barrier are exclusively
> > used each other, but MarkBufferDirtyHint which delays checkpoint start
> > is called in RelationTruncate while delaying checkpoint completion.
> > That is not a strange nor harmful behavior.  I changed delayChkpt to a
> > bitmap integer from an enum so that both barrier are separately
> > triggered.
> >
> > I'm not sure this is the way to go here, though.  This fixes the issue
> > of a crash during RelationTruncate, but the issue of smgrtruncate
> > failure during RelationTruncate still remains (unless we treat that
> > failure as PANIC?).
> 
> I like this patch. As I understand it, we're currently cheating by
> allowing checkpoints to complete without necessarily flushing all of
> the pages that were dirty at the time we fixed the redo pointer out to
> disk. We think this is OK because we know that those pages are going
> to get truncated away, but it's not really OK because when the system
> starts up, it has to replay WAL starting from the checkpoint's redo
> pointer, but the state of the page is not the same as it was at the
> time when the redo pointer was the end of WAL, so redo fails. In the
> case described in
> http://postgr.es/m/BYAPR06MB63739B2692DC6DBB3C5F186CABDA0@BYAPR06MB6373.namprd06.prod.outlook.com
> modifications are made to the page before the redo pointer is fixed
> and those changes never make it to disk, but the truncation also never
> makes it to the disk either. With this patch, that can't happen,
> because no checkpoint can intervene between when we (1) decide we're
> not going to bother writing those dirty pages and (2) actually
> truncate them away. So either the pages will get written as part of
> the checkpoint, or else they'll be gone before the checkpoint
> completes. In the latter case, I suppose redo that would have modified
> those pages will just be skipped, thus dodging the problem.

I think your understanding is right.

> In RelationTruncate, I suggest that we ought to clear the
> delay-checkpoint flag before rather than after calling
> FreeSpaceMapVacuumRange. Since the free space map is not fully
> WAL-logged, anything we're doing there should be non-critical. Also, I

Agreed and fixed.

> think it might be better if MarkBufferDirtyHint stays closer to the
> existing coding and just uses a Boolean and an if-test to decide
> whether to clear the bit, instead of inventing a new mechanism. I
> don't really see anything wrong with the new mechanism, but I think
> it's better to keep the patch minimal.

Yeah, that was a a kind of silly. Fixed.

> As you say, this doesn't fix the problem that truncation might fail.
> But as Andres and Sawada-san said, the solution to that is to get rid
> of the comments saying that it's OK for truncation to fail and make it
> a PANIC. However, I don't think that change needs to be part of this
> patch. Even if we do that, we still need to do this. And even if we do
> this, we still need to do that.

Ok. Addition to the aboves, I rewrote the comment in RelatinoTruncate.

+     * Delay the concurrent checkpoint's completion until this truncation
+     * successfully completes, so that we don't establish a redo-point between
+     * buffer deletion and file-truncate. Otherwise we can leave inconsistent
+     * file content against the WAL records after the REDO position and future
+     * recovery fails.

However, a problem for me for now is that I cannot reproduce the
problem.

To avoid further confusion, the attached is named as *.patch.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c
index e6c70ed0bc..17357179e3 100644
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -3075,8 +3075,8 @@ TruncateMultiXact(MultiXactId newOldestMulti, Oid newOldestMultiDB)
      * crash/basebackup, even though the state of the data directory would
      * require it.
      */
-    Assert(!MyProc->delayChkpt);
-    MyProc->delayChkpt = true;
+    Assert((MyProc->delayChkpt & DELAY_CHKPT_START) == 0);
+    MyProc->delayChkpt |= DELAY_CHKPT_START;
 
     /* WAL log truncation */
     WriteMTruncateXlogRec(newOldestMultiDB,
@@ -3102,7 +3102,7 @@ TruncateMultiXact(MultiXactId newOldestMulti, Oid newOldestMultiDB)
     /* Then offsets */
     PerformOffsetsTruncation(oldestMulti, newOldestMulti);
 
-    MyProc->delayChkpt = false;
+    MyProc->delayChkpt &= ~DELAY_CHKPT_START;
 
     END_CRIT_SECTION();
     LWLockRelease(MultiXactTruncationLock);
diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index 2156de187c..b7dc84d6e3 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -463,7 +463,7 @@ MarkAsPreparingGuts(GlobalTransaction gxact, TransactionId xid, const char *gid,
     proc->lxid = (LocalTransactionId) xid;
     proc->xid = xid;
     Assert(proc->xmin == InvalidTransactionId);
-    proc->delayChkpt = false;
+    proc->delayChkpt = 0;
     proc->statusFlags = 0;
     proc->pid = 0;
     proc->backendId = InvalidBackendId;
@@ -1109,7 +1109,8 @@ EndPrepare(GlobalTransaction gxact)
 
     START_CRIT_SECTION();
 
-    MyProc->delayChkpt = true;
+    Assert((MyProc->delayChkpt & DELAY_CHKPT_START) == 0);
+    MyProc->delayChkpt |= DELAY_CHKPT_START;
 
     XLogBeginInsert();
     for (record = records.head; record != NULL; record = record->next)
@@ -1152,7 +1153,7 @@ EndPrepare(GlobalTransaction gxact)
      * checkpoint starting after this will certainly see the gxact as a
      * candidate for fsyncing.
      */
-    MyProc->delayChkpt = false;
+    MyProc->delayChkpt &= ~DELAY_CHKPT_START;
 
     /*
      * Remember that we have this GlobalTransaction entry locked for us.  If
@@ -2215,7 +2216,8 @@ RecordTransactionCommitPrepared(TransactionId xid,
     START_CRIT_SECTION();
 
     /* See notes in RecordTransactionCommit */
-    MyProc->delayChkpt = true;
+    Assert((MyProc->delayChkpt & DELAY_CHKPT_START) == 0);
+    MyProc->delayChkpt |= DELAY_CHKPT_START;
 
     /*
      * Emit the XLOG commit record. Note that we mark 2PC commits as
@@ -2263,7 +2265,7 @@ RecordTransactionCommitPrepared(TransactionId xid,
     TransactionIdCommitTree(xid, nchildren, children);
 
     /* Checkpoint can proceed now */
-    MyProc->delayChkpt = false;
+    MyProc->delayChkpt &= ~DELAY_CHKPT_START;
 
     END_CRIT_SECTION();
 
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 6597ec45a9..4a1a0c3c1f 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -1334,8 +1334,9 @@ RecordTransactionCommit(void)
          * This makes checkpoint's determination of which xacts are delayChkpt
          * a bit fuzzy, but it doesn't matter.
          */
+        Assert((MyProc->delayChkpt & DELAY_CHKPT_START) == 0);
         START_CRIT_SECTION();
-        MyProc->delayChkpt = true;
+        MyProc->delayChkpt |= DELAY_CHKPT_START;
 
         SetCurrentTransactionStopTimestamp();
 
@@ -1436,7 +1437,7 @@ RecordTransactionCommit(void)
      */
     if (markXidCommitted)
     {
-        MyProc->delayChkpt = false;
+        MyProc->delayChkpt &= ~DELAY_CHKPT_START;
         END_CRIT_SECTION();
     }
 
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index e51a7a749d..a4d564323a 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -9153,18 +9153,30 @@ CreateCheckPoint(int flags)
      * and we will correctly flush the update below.  So we cannot miss any
      * xacts we need to wait for.
      */
-    vxids = GetVirtualXIDsDelayingChkpt(&nvxids);
+    vxids = GetVirtualXIDsDelayingChkpt(&nvxids, DELAY_CHKPT_START);
     if (nvxids > 0)
     {
         do
         {
             pg_usleep(10000L);    /* wait for 10 msec */
-        } while (HaveVirtualXIDsDelayingChkpt(vxids, nvxids));
+        } while (HaveVirtualXIDsDelayingChkpt(vxids, nvxids,
+                                              DELAY_CHKPT_START));
     }
     pfree(vxids);
 
     CheckPointGuts(checkPoint.redo, flags);
 
+    vxids = GetVirtualXIDsDelayingChkpt(&nvxids, DELAY_CHKPT_COMPLETE);
+    if (0 && nvxids > 0)
+    {
+        do
+        {
+            pg_usleep(10000L);    /* wait for 10 msec */
+        } while (HaveVirtualXIDsDelayingChkpt(vxids, nvxids,
+                                              DELAY_CHKPT_COMPLETE));
+    }
+    pfree(vxids);
+
     /*
      * Take a snapshot of running transactions and write this to WAL. This
      * allows us to reconstruct the state of running transactions during
diff --git a/src/backend/access/transam/xloginsert.c b/src/backend/access/transam/xloginsert.c
index b492c656d7..f7a1f981d5 100644
--- a/src/backend/access/transam/xloginsert.c
+++ b/src/backend/access/transam/xloginsert.c
@@ -978,7 +978,7 @@ XLogSaveBufferForHint(Buffer buffer, bool buffer_std)
     /*
      * Ensure no checkpoint can change our view of RedoRecPtr.
      */
-    Assert(MyProc->delayChkpt);
+    Assert((MyProc->delayChkpt & DELAY_CHKPT_START) != 0);
 
     /*
      * Update RedoRecPtr so that we can make the right decision
diff --git a/src/backend/catalog/storage.c b/src/backend/catalog/storage.c
index c5ad28d71f..be9c0e107f 100644
--- a/src/backend/catalog/storage.c
+++ b/src/backend/catalog/storage.c
@@ -325,6 +325,16 @@ RelationTruncate(Relation rel, BlockNumber nblocks)
 
     RelationPreTruncate(rel);
 
+    /*
+     * Delay the concurrent checkpoint's completion until this truncation
+     * successfully completes, so that we don't establish a redo-point between
+     * buffer deletion and file-truncate. Otherwise we can leave inconsistent
+     * file content against the WAL records after the REDO position and future
+     * recovery fails.
+     */
+    Assert((MyProc->delayChkpt & DELAY_CHKPT_COMPLETE) == 0);
+    MyProc->delayChkpt |= DELAY_CHKPT_COMPLETE;
+
     /*
      * We WAL-log the truncation before actually truncating, which means
      * trouble if the truncation fails. If we then crash, the WAL replay
@@ -366,6 +376,10 @@ RelationTruncate(Relation rel, BlockNumber nblocks)
     /* Do the real work to truncate relation forks */
     smgrtruncate(RelationGetSmgr(rel), forks, nforks, blocks);
 
+
+    /* FSM is not WAL-logged, finish the critical section here. */
+    MyProc->delayChkpt &= ~DELAY_CHKPT_COMPLETE;
+
     /*
      * Update upper-level FSM pages to account for the truncation. This is
      * important because the just-truncated pages were likely marked as
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index e88e4e918b..c277dc3e1e 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -3921,7 +3921,9 @@ MarkBufferDirtyHint(Buffer buffer, bool buffer_std)
              * essential that CreateCheckpoint waits for virtual transactions
              * rather than full transactionids.
              */
-            MyProc->delayChkpt = delayChkpt = true;
+            Assert((MyProc->delayChkpt & DELAY_CHKPT_START) == 0);
+            MyProc->delayChkpt |= DELAY_CHKPT_START;
+            delayChkpt = true;
             lsn = XLogSaveBufferForHint(buffer, buffer_std);
         }
 
@@ -3954,7 +3956,7 @@ MarkBufferDirtyHint(Buffer buffer, bool buffer_std)
         UnlockBufHdr(bufHdr, buf_state);
 
         if (delayChkpt)
-            MyProc->delayChkpt = false;
+            MyProc->delayChkpt &= ~DELAY_CHKPT_START;
 
         if (dirtied)
         {
diff --git a/src/backend/storage/ipc/procarray.c b/src/backend/storage/ipc/procarray.c
index bd3c7a47fe..1bc4ea15e9 100644
--- a/src/backend/storage/ipc/procarray.c
+++ b/src/backend/storage/ipc/procarray.c
@@ -689,7 +689,10 @@ ProcArrayEndTransaction(PGPROC *proc, TransactionId latestXid)
 
         proc->lxid = InvalidLocalTransactionId;
         proc->xmin = InvalidTransactionId;
-        proc->delayChkpt = false;    /* be sure this is cleared in abort */
+
+        /* be sure this is cleared in abort */
+        proc->delayChkpt = 0;
+
         proc->recoveryConflictPending = false;
 
         /* must be cleared with xid/xmin: */
@@ -728,7 +731,10 @@ ProcArrayEndTransactionInternal(PGPROC *proc, TransactionId latestXid)
     proc->xid = InvalidTransactionId;
     proc->lxid = InvalidLocalTransactionId;
     proc->xmin = InvalidTransactionId;
-    proc->delayChkpt = false;    /* be sure this is cleared in abort */
+
+    /* be sure this is cleared in abort */
+    proc->delayChkpt = 0;
+
     proc->recoveryConflictPending = false;
 
     /* must be cleared with xid/xmin: */
@@ -3026,7 +3032,8 @@ GetOldestSafeDecodingTransactionId(bool catalogOnly)
  * delaying checkpoint because they have critical actions in progress.
  *
  * Constructs an array of VXIDs of transactions that are currently in commit
- * critical sections, as shown by having delayChkpt set in their PGPROC.
+ * critical sections, as shown by having delayChkpt set to the specified value
+ * in their PGPROC.
  *
  * Returns a palloc'd array that should be freed by the caller.
  * *nvxids is the number of valid entries.
@@ -3040,13 +3047,15 @@ GetOldestSafeDecodingTransactionId(bool catalogOnly)
  * for clearing of delayChkpt to propagate is unimportant for correctness.
  */
 VirtualTransactionId *
-GetVirtualXIDsDelayingChkpt(int *nvxids)
+GetVirtualXIDsDelayingChkpt(int *nvxids, int type)
 {
     VirtualTransactionId *vxids;
     ProcArrayStruct *arrayP = procArray;
     int            count = 0;
     int            index;
 
+    Assert(type != 0);
+
     /* allocate what's certainly enough result space */
     vxids = (VirtualTransactionId *)
         palloc(sizeof(VirtualTransactionId) * arrayP->maxProcs);
@@ -3058,7 +3067,7 @@ GetVirtualXIDsDelayingChkpt(int *nvxids)
         int            pgprocno = arrayP->pgprocnos[index];
         PGPROC       *proc = &allProcs[pgprocno];
 
-        if (proc->delayChkpt)
+        if ((proc->delayChkpt & type) != 0)
         {
             VirtualTransactionId vxid;
 
@@ -3084,12 +3093,14 @@ GetVirtualXIDsDelayingChkpt(int *nvxids)
  * those numbers should be small enough for it not to be a problem.
  */
 bool
-HaveVirtualXIDsDelayingChkpt(VirtualTransactionId *vxids, int nvxids)
+HaveVirtualXIDsDelayingChkpt(VirtualTransactionId *vxids, int nvxids, int type)
 {
     bool        result = false;
     ProcArrayStruct *arrayP = procArray;
     int            index;
 
+    Assert(type != 0);
+
     LWLockAcquire(ProcArrayLock, LW_SHARED);
 
     for (index = 0; index < arrayP->numProcs; index++)
@@ -3100,7 +3111,8 @@ HaveVirtualXIDsDelayingChkpt(VirtualTransactionId *vxids, int nvxids)
 
         GET_VXID_FROM_PGPROC(vxid, *proc);
 
-        if (proc->delayChkpt && VirtualTransactionIdIsValid(vxid))
+        if ((proc->delayChkpt & type) != 0 &&
+            VirtualTransactionIdIsValid(vxid))
         {
             int            i;
 
diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index b7d9da0aa9..95fdf990e7 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -394,7 +394,7 @@ InitProcess(void)
     MyProc->roleId = InvalidOid;
     MyProc->tempNamespaceId = InvalidOid;
     MyProc->isBackgroundWorker = IsBackgroundWorker;
-    MyProc->delayChkpt = false;
+    MyProc->delayChkpt = 0;
     MyProc->statusFlags = 0;
     /* NB -- autovac launcher intentionally does not set IS_AUTOVACUUM */
     if (IsAutoVacuumWorkerProcess())
@@ -579,7 +579,7 @@ InitAuxiliaryProcess(void)
     MyProc->roleId = InvalidOid;
     MyProc->tempNamespaceId = InvalidOid;
     MyProc->isBackgroundWorker = IsBackgroundWorker;
-    MyProc->delayChkpt = false;
+    MyProc->delayChkpt = 0;
     MyProc->statusFlags = 0;
     MyProc->lwWaiting = false;
     MyProc->lwWaitMode = 0;
diff --git a/src/include/storage/proc.h b/src/include/storage/proc.h
index be67d8a861..b9be2454c5 100644
--- a/src/include/storage/proc.h
+++ b/src/include/storage/proc.h
@@ -79,6 +79,10 @@ struct XidCache
  */
 #define INVALID_PGPROCNO        PG_INT32_MAX
 
+/* symbols for PGPROC.delayChkpt */
+#define DELAY_CHKPT_START        (1<<0) 
+#define DELAY_CHKPT_COMPLETE    (1<<1)
+
 typedef enum
 {
     PROC_WAIT_STATUS_OK,
@@ -184,7 +188,8 @@ struct PGPROC
     pg_atomic_uint64 waitStart; /* time at which wait for lock acquisition
                                  * started */
 
-    bool        delayChkpt;        /* true if this proc delays checkpoint start */
+    int            delayChkpt;        /* if this proc delays checkpoint start and/or
+                                 * completion.  */
 
     uint8        statusFlags;    /* this backend's status flags, see PROC_*
                                  * above. mirrored in
diff --git a/src/include/storage/procarray.h b/src/include/storage/procarray.h
index b01fa52139..ec40130466 100644
--- a/src/include/storage/procarray.h
+++ b/src/include/storage/procarray.h
@@ -15,11 +15,11 @@
 #define PROCARRAY_H
 
 #include "storage/lock.h"
+#include "storage/proc.h"
 #include "storage/standby.h"
 #include "utils/relcache.h"
 #include "utils/snapshot.h"
 
-
 extern Size ProcArrayShmemSize(void);
 extern void CreateSharedProcArray(void);
 extern void ProcArrayAdd(PGPROC *proc);
@@ -59,8 +59,9 @@ extern TransactionId GetOldestActiveTransactionId(void);
 extern TransactionId GetOldestSafeDecodingTransactionId(bool catalogOnly);
 extern void GetReplicationHorizons(TransactionId *slot_xmin, TransactionId *catalog_xmin);
 
-extern VirtualTransactionId *GetVirtualXIDsDelayingChkpt(int *nvxids);
-extern bool HaveVirtualXIDsDelayingChkpt(VirtualTransactionId *vxids, int nvxids);
+extern VirtualTransactionId *GetVirtualXIDsDelayingChkpt(int *nvxids, int type);
+extern bool HaveVirtualXIDsDelayingChkpt(VirtualTransactionId *vxids,
+                                         int nvxids, int type);
 
 extern PGPROC *BackendPidGetProc(int pid);
 extern PGPROC *BackendPidGetProcWithLock(int pid);

В списке pgsql-hackers по дате отправления:

Предыдущее
От: Kyotaro Horiguchi
Дата:
Сообщение: Re: Corruption during WAL replay
Следующее
От: Amit Kapila
Дата:
Сообщение: Re: Timeout failure in 019_replslot_limit.pl