Re: prevent immature WAL streaming

Поиск
Список
Период
Сортировка
От Kyotaro Horiguchi
Тема Re: prevent immature WAL streaming
Дата
Msg-id 20210902.092415.728659185237934198.horikyota.ntt@gmail.com
обсуждение исходный текст
Ответ на Re: prevent immature WAL streaming  (Fujii Masao <masao.fujii@oss.nttdata.com>)
Список pgsql-hackers
At Wed, 1 Sep 2021 15:01:43 +0900, Fujii Masao <masao.fujii@oss.nttdata.com> wrote in 
> 
> 
> On 2021/09/01 12:15, Andres Freund wrote:
> > Hi,
> > On 2021-09-01 11:34:34 +0900, Fujii Masao wrote:
> >> On 2021/09/01 0:53, Andres Freund wrote:
> >>> Of course, we need to be careful to not weaken WAL validity checking
> >>> too
> >>> much. How about the following:
> >>>
> >>> If we're "aborting" a continued record, we set
> >>> XLP_FIRST_IS_ABORTED_PARTIAL on
> >>> the page at which we do so (i.e. the page after the valid end of the
> >>> WAL).
> >>
> >> When do you expect that XLP_FIRST_IS_ABORTED_PARTIAL is set? It's set
> >> when recovery finds a a partially-flushed segment-spanning record?
> >> But maybe we cannot do that (i.e., cannot overwrite the page) because
> >> the page that the flag is set in might have already been archived. No?
> > I was imagining that XLP_FIRST_IS_ABORTED_PARTIAL would be set in the
> > "tail
> > end" of a partial record. I.e. if there's a partial record starting in
> > the
> > successfully archived segment A, but the end of the record, in B, has
> > not been
> > written to disk before a crash, we'd set XLP_FIRST_IS_ABORTED_PARTIAL
> > at the
> > end of the valid data in B. Which could not have been archived yet, or
> > we'd
> > not have a partial record.  So we should never need to set the flag on
> > an
> > already archived page.
> 
> Thanks for clarifying that! Unless I misunderstand that, when recovery
> ends
> at a partially-flushed segment-spanning record, it sets
> XLP_FIRST_IS_ABORTED_PARTIAL in the next segment before starting
> writing
> new WAL data there. Therefore XLP_FIRST_IS_CONTRECORD or
> XLP_FIRST_IS_ABORTED_PARTIAL must be set in the next segment following
> a partially-flushed segment-spanning record. If none of them is set,
> the validation code in recovery should report an error.
> 
> Yes, this design looks good to me.

So, this is a crude PoC of that.

At the end of recovery:

- When XLogReadRecord misses a page where the next part of the current
  continuation record should be seen, xlogreader->ContRecAbortPtr is
  set to the beginning of the missing page.

- When StartupXLOG receives a valid ContRecAbortPtr, the value is used
  as the next WAL insertion location then sets the same to
  XLogCtl->contAbortedRecPtr.

- When XLogCtl->contAbortedRecPtr is set, AdvanceXLInsertBuffer()
  (called under XLogInsertRecord()) sets XLP_FIRST_IS_ABORTED_PARTIAL
  flag to the page.

While recovery:
- When XLogReadRecord meets a XLP_FIRST_IS_ABORT_PARTIAL page, it
  rereads a record from there.

In this PoC,

1. This patch is written on the current master, but it doesn't
  interfare with the seg-boundary-memorize patch since it removes the
  calls to RegisterSegmentBoundary.

2. Since xlogreader cannot emit a log-message immediately, we don't
  have a means to leave a log message to inform recovery met an
  aborted partial continuation record. (In this PoC, it is done by
  fprintf:p)

3. Myebe we need to pg_waldump to show partial continuation records,
  but I'm not sure how to realize that.

4. By the way, is this (method) applicable in this stage?


The attached first is the PoC including debug-crash aid.  The second
is the repro script.  It failes to reproduce the situation once per
several trials.

The following log messages are shown by a run of the script.

> ...
> TRAP: FailedAssertion("c++ < 1", File: "xlog.c", Line: 2675, PID: 254644)
> ...
> LOG:  database system is shut down
> ...
> 
> LOG:  redo starts at 0/2000028
> LOG:  redo done at 0/6FFFFA8 system usage:...
> LOG:  #### Recovery finished: ContRecAbort: 0/7000000 (EndRecPtr: 0/6FFFFE8)

The record from 6FFFFE8 is missing the trailing part after 7000000.

> LOG:  #### EndOfLog=0/7000000
> LOG:  #### set XLP_FIRST_IS_ABORT_PARTIAL@0/7000000

So, WAL insertion starts from 7000000 and the first page is set the flag.

> LOG:  database system is ready to accept connections
> ...
> LOG:  database system is shut down
> ...
> #########################
> ...
> LOG:  redo starts at 0/2000028
> LOG:  consistent recovery state reached at 0/2000100
> ...
> LOG:  restored log file "000000010000000000000007" from archive
> #### aborted partial continuation record found at 0/6FFFFE8, continue from 0/7000000

The record from 6FFFFE8 is immature so skip it and continue reading
from 7000000.

> LOG:  last completed transaction was at log time 2021-09-01 20:40:21.775295+09
> LOG:  #### Recovery finished: ContRecAbort: 0/0 (EndRecPtr: 0/8000000)
> LOG:  restored log file "000000010000000000000007" from archive
> LOG:  selected new timeline ID: 2
> LOG:  archive recovery complete
> LOG:  #### EndOfLog=0/8000000

Recovery ends.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 24165ab03e..b0f18e4e5e 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -111,6 +111,7 @@ int            CommitSiblings = 5; /* # concurrent xacts needed to sleep */
 int            wal_retrieve_retry_interval = 5000;
 int            max_slot_wal_keep_size_mb = -1;
 bool        track_wal_io_timing = false;
+bool        contrec_aborted = false;
 
 #ifdef WAL_DEBUG
 bool        XLOG_DEBUG = false;
@@ -586,6 +587,7 @@ typedef struct XLogCtlData
     XLogRecPtr    replicationSlotMinLSN;    /* oldest LSN needed by any slot */
 
     XLogSegNo    lastRemovedSegNo;    /* latest removed/recycled XLOG segment */
+    XLogRecPtr    contAbortedRecPtr;
 
     /* Fake LSN counter, for unlogged relations. Protected by ulsn_lck. */
     XLogRecPtr    unloggedLSN;
@@ -735,6 +737,10 @@ typedef struct XLogCtlData
     XLogSegNo    latestSegBoundary;
     XLogRecPtr    latestSegBoundaryEndPtr;
 
+    /* BEGIN: FOR DEBUGGING-CRASH USE*/
+    bool        crossseg;
+    /* END: DEBUGGING-CRASH USE*/
+
     slock_t        segtrack_lck;    /* locks shared variables shown above */
 } XLogCtlData;
 
@@ -860,6 +866,7 @@ static XLogSource XLogReceiptSource = XLOG_FROM_ANY;
 /* State information for XLOG reading */
 static XLogRecPtr ReadRecPtr;    /* start of last record read */
 static XLogRecPtr EndRecPtr;    /* end+1 of last record read */
+static XLogRecPtr ContRecAbortPtr;    /* end+1 of last aborted contrec */
 
 /*
  * Local copies of equivalent fields in the control file.  When running
@@ -1178,16 +1185,10 @@ XLogInsertRecord(XLogRecData *rdata,
         XLByteToSeg(StartPos, StartSeg, wal_segment_size);
         XLByteToSeg(EndPos, EndSeg, wal_segment_size);
 
-        /*
-         * Register our crossing the segment boundary if that occurred.
-         *
-         * Note that we did not use XLByteToPrevSeg() for determining the
-         * ending segment.  This is so that a record that fits perfectly into
-         * the end of the segment causes the latter to get marked ready for
-         * archival immediately.
-         */
-        if (StartSeg != EndSeg && XLogArchivingActive())
-            RegisterSegmentBoundary(EndSeg, EndPos);
+        /* BEGIN: FOR DEBUGGING-CRASH USE */
+        if (StartSeg != EndSeg)
+            XLogCtl->crossseg = true;
+        /* END: FOR DEBUGGING-CRASH USE */
 
         /*
          * Advance LogwrtRqst.Write so that it includes new block(s).
@@ -2292,6 +2293,27 @@ AdvanceXLInsertBuffer(XLogRecPtr upto, bool opportunistic)
         if (!Insert->forcePageWrites)
             NewPage->xlp_info |= XLP_BKP_REMOVABLE;
 
+        /*
+         * If the last page ended with an aborted partial continuation record,
+         * mark it to tell the parital record is omittable. Snice this happens
+         * only at the end of crash recovery, no rece condition here.
+         */
+        if (XLogCtl->contAbortedRecPtr >= NewPageBeginPtr)
+        {
+            if (XLogCtl->contAbortedRecPtr == NewPageBeginPtr)
+            {
+                NewPage->xlp_info |= XLP_FIRST_IS_ABORT_PARTIAL;
+                elog(LOG, "#### set XLP_FIRST_IS_ABORT_PARTIAL@%X/%X",
+                     LSN_FORMAT_ARGS(NewPageBeginPtr));
+            }
+            else
+                elog(LOG, "### incosistent abort location %X/%X, expected %X/%X",
+                     LSN_FORMAT_ARGS(XLogCtl->contAbortedRecPtr),
+                     LSN_FORMAT_ARGS(NewPageBeginPtr));
+                     
+            XLogCtl->contAbortedRecPtr = InvalidXLogRecPtr;
+        }
+            
         /*
          * If first page of an XLOG segment file, make it a long header.
          */
@@ -2644,6 +2666,17 @@ XLogWrite(XLogwrtRqst WriteRqst, bool flexible)
             {
                 issue_xlog_fsync(openLogFile, openLogSegNo);
 
+                /* BEGIN: FOR DEBUGGING-CRASH USE */
+                if (XLogCtl->crossseg)
+                {
+                    static int c = 0;
+                    struct stat b;
+
+                    if (stat("/tmp/hoge", &b) == 0)
+                        Assert (c++ < 1);
+                }
+                /* END: FOR DEBUGGING-CRASH USE */
+
                 /* signal that we need to wakeup walsenders later */
                 WalSndWakeupRequest();
 
@@ -4568,6 +4601,7 @@ ReadRecord(XLogReaderState *xlogreader, int emode,
         record = XLogReadRecord(xlogreader, &errormsg);
         ReadRecPtr = xlogreader->ReadRecPtr;
         EndRecPtr = xlogreader->EndRecPtr;
+        ContRecAbortPtr = xlogreader->ContRecAbortPtr;
         if (record == NULL)
         {
             if (readFile >= 0)
@@ -7873,12 +7907,26 @@ StartupXLOG(void)
     StandbyMode = false;
 
     /*
-     * Re-fetch the last valid or last applied record, so we can identify the
-     * exact endpoint of what we consider the valid portion of WAL.
+     * The last record may be an immature continuation record at the end of a
+     * page. We continue writing from ContRecAbortPtr instead of EndRecPtr that
+     * case.
      */
-    XLogBeginRead(xlogreader, LastRec);
-    record = ReadRecord(xlogreader, PANIC, false);
-    EndOfLog = EndRecPtr;
+    elog(LOG, "#### Recovery finished: ContRecAbort: %X/%X (EndRecPtr: %X/%X)", LSN_FORMAT_ARGS(ContRecAbortPtr),
LSN_FORMAT_ARGS(EndRecPtr));
+    if (XLogRecPtrIsInvalid(ContRecAbortPtr))
+    {
+        /*
+         * Re-fetch the last valid or last applied record, so we can identify
+         * the exact endpoint of what we consider the valid portion of WAL.
+         */
+        XLogBeginRead(xlogreader, LastRec);
+        record = ReadRecord(xlogreader, PANIC, false);
+        EndOfLog = EndRecPtr;
+    }
+    else
+    {
+        EndOfLog = ContRecAbortPtr;
+        XLogCtl->contAbortedRecPtr = ContRecAbortPtr;
+    }
 
     /*
      * EndOfLogTLI is the TLI in the filename of the XLOG segment containing
@@ -8013,7 +8061,8 @@ StartupXLOG(void)
     Insert = &XLogCtl->Insert;
     Insert->PrevBytePos = XLogRecPtrToBytePos(LastRec);
     Insert->CurrBytePos = XLogRecPtrToBytePos(EndOfLog);
-
+    elog(LOG, "#### EndOfLog=%X/%X", LSN_FORMAT_ARGS(EndOfLog));
+    
     /*
      * Tricky point here: readBuf contains the *last* block that the LastRec
      * record spans, not the one it starts in.  The last block is indeed the
diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index 5cf74e181a..404db7ce4d 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -294,6 +294,7 @@ XLogReadRecord(XLogReaderState *state, char **errormsg)
 
     ResetDecoder(state);
 
+    state->ContRecAbortPtr = InvalidXLogRecPtr;
     RecPtr = state->EndRecPtr;
 
     if (state->ReadRecPtr != InvalidXLogRecPtr)
@@ -319,6 +320,7 @@ XLogReadRecord(XLogReaderState *state, char **errormsg)
         randAccess = true;
     }
 
+retry:
     state->currRecPtr = RecPtr;
 
     targetPagePtr = RecPtr - (RecPtr % XLOG_BLCKSZ);
@@ -444,12 +446,27 @@ XLogReadRecord(XLogReaderState *state, char **errormsg)
                                            XLOG_BLCKSZ));
 
             if (readOff < 0)
-                goto err;
+                goto err_partial_contrec;
 
             Assert(SizeOfXLogShortPHD <= readOff);
 
             /* Check that the continuation on next page looks valid */
             pageHeader = (XLogPageHeader) state->readBuf;
+            if (pageHeader->xlp_info & XLP_FIRST_IS_ABORT_PARTIAL)
+            {
+                if (pageHeader->xlp_info & XLP_FIRST_IS_CONTRECORD)
+                {
+                    report_invalid_record(state,
+                                          "both XLP_FIRST_IS_CONTRECORD and XLP_FIRST_IS_ABORT_PARTIAL are set at
%X/%X",
+                                          LSN_FORMAT_ARGS(RecPtr));
+                    goto err;
+                }
+                    
+                fprintf(stderr, "#### aborted partial continuation record found at %X/%X, continue from %X/%X\n",
LSN_FORMAT_ARGS(RecPtr),LSN_FORMAT_ARGS(targetPagePtr));
 
+                ResetDecoder(state);
+                RecPtr = targetPagePtr;
+                goto retry;
+            }                
             if (!(pageHeader->xlp_info & XLP_FIRST_IS_CONTRECORD))
             {
                 report_invalid_record(state,
@@ -550,6 +567,10 @@ XLogReadRecord(XLogReaderState *state, char **errormsg)
     else
         return NULL;
 
+err_partial_contrec:
+    state->ContRecAbortPtr = targetPagePtr;
+        fprintf(stderr, "contrec aborted@%X/%X\n", LSN_FORMAT_ARGS(state->ContRecAbortPtr));
+    
 err:
 
     /*
diff --git a/src/include/access/xlog_internal.h b/src/include/access/xlog_internal.h
index 3b5eceff65..6390812a5a 100644
--- a/src/include/access/xlog_internal.h
+++ b/src/include/access/xlog_internal.h
@@ -76,8 +76,10 @@ typedef XLogLongPageHeaderData *XLogLongPageHeader;
 #define XLP_LONG_HEADER                0x0002
 /* This flag indicates backup blocks starting in this page are optional */
 #define XLP_BKP_REMOVABLE            0x0004
+/* This flag indicates the first record in this page breaks a contrecord */
+#define XLP_FIRST_IS_ABORT_PARTIAL    0x0008
 /* All defined flag bits in xlp_info (used for validity checking of header) */
-#define XLP_ALL_FLAGS                0x0007
+#define XLP_ALL_FLAGS                0x000F
 
 #define XLogPageHeaderSize(hdr)        \
     (((hdr)->xlp_info & XLP_LONG_HEADER) ? SizeOfXLogLongPHD : SizeOfXLogShortPHD)
diff --git a/src/include/access/xlogreader.h b/src/include/access/xlogreader.h
index 21d200d3df..00a03a628c 100644
--- a/src/include/access/xlogreader.h
+++ b/src/include/access/xlogreader.h
@@ -175,6 +175,8 @@ struct XLogReaderState
     XLogRecPtr    ReadRecPtr;        /* start of last record read */
     XLogRecPtr    EndRecPtr;        /* end+1 of last record read */
 
+    XLogRecPtr    ContRecAbortPtr; /* end+1 of aborted partial contrecord if
+                                  * any */
 
     /* ----------------------------------------
      * Decoded representation of current record
diff --git a/src/include/catalog/pg_control.h b/src/include/catalog/pg_control.h
index e3f48158ce..26fc123cdb 100644
--- a/src/include/catalog/pg_control.h
+++ b/src/include/catalog/pg_control.h
@@ -76,6 +76,7 @@ typedef struct CheckPoint
 #define XLOG_END_OF_RECOVERY            0x90
 #define XLOG_FPI_FOR_HINT                0xA0
 #define XLOG_FPI                        0xB0
+#define XLOG_ABORT_CONTRECORD            x0C0
 
 
 /*
PWD=`pwd`
DATA=data
BKUP=bkup
ARCH=$PWD/arch
rm -rf arch
mkdir arch
rm -rf $DATA
initdb -D $DATA
echo "restart_after_crash = off" >> $DATA/postgresql.conf
echo "archive_mode=on" >> $DATA/postgresql.conf
echo "archive_command='cp %p ${ARCH}/%f'" >> $DATA/postgresql.conf
echo "restart_after_crash = off" >> $DATA/postgresql.conf
rm /tmp/hoge
pg_ctl -D $DATA start
rm -rf $BKUP
pg_basebackup -D $BKUP -h /tmp
echo "archive_mode=off" >> $BKUP/postgresql.conf
echo "restore_command='cp ${ARCH}/%f %p'" >> $BKUP/postgresql.conf
touch bkup/recovery.signal

psql -c 'create table t(a int); insert into t (select a from generate_series(0, 600000) a)'
touch /tmp/hoge
psql -c 'insert into t (select a from generate_series(0, 600000) a)'
rm /tmp/hoge
sleep 5
pg_ctl -D $DATA -w start
psql -c 'checkpoint'
pg_ctl -D $DATA -w stop
echo "#########################"
pg_ctl -D $BKUP -w start
sleep 10
pg_ctl -D $BKUP -w stop





В списке pgsql-hackers по дате отправления:

Предыдущее
От: Jacob Champion
Дата:
Сообщение: Re: [PATCH] test/ssl: rework the sslfiles Makefile target
Следующее
От: Kyotaro Horiguchi
Дата:
Сообщение: Re: Possible missing segments in archiving on standby