Re: archive status ".ready" files may be created too early

Поиск
Список
Период
Сортировка
От Kyotaro Horiguchi
Тема Re: archive status ".ready" files may be created too early
Дата
Msg-id 20201014.090628.839639906081252194.horikyota.ntt@gmail.com
обсуждение исходный текст
Ответ на Re: archive status ".ready" files may be created too early  (Heikki Linnakangas <hlinnaka@iki.fi>)
Ответы Re: archive status ".ready" files may be created too early  (Anastasia Lubennikova <a.lubennikova@postgrespro.ru>)
Список pgsql-hackers
Thanks for visiting this thread.

At Mon, 12 Oct 2020 15:04:40 +0300, Heikki Linnakangas <hlinnaka@iki.fi> wrote in 
> On 07/07/2020 12:02, matsumura.ryo@fujitsu.com wrote:
> > At Monday, July 6, 2020 05:13:40 +0000, "Kyotaro Horiguchi
> > <horikyota(dot)ntt(at)gmail(dot)com>" wrote in
> >>>> after WAL buffer is filled up to the requested position. So when it
> >>>> crosses segment boundary we know the all past corss segment-boundary
> >>>> records are stable. That means all we need to remember is only the
> >>>> position of the latest corss-boundary record.
> >>>
> >>> I could not agree. In the following case, it may not work well.
> >>> - record-A and record-B (record-B is a newer one) is copied, and
> >>> - lastSegContRecStart/End points to record-B's, and
> >>> - FlushPtr is proceeded to in the middle of record-A.
> >>
> >> IIUC, that means record-B is a cross segment-border record and we hav
> >> e
> >> flushed beyond the recrod-B. In that case crash recovery afterwards
> >> can read the complete record-B and will finish recovery *after* the
> >> record-B. That's what we need here.
> > I'm sorry I didn't explain enough.
> > Record-A and Record-B are cross segment-border records.
> > Record-A spans segment X and X+1
> > Record-B spans segment X+2 and X+3.
> > If both records have been inserted to WAL buffer,
> > lastSegContRecStart/End points to Record-B.
> > If a writer flushes upto the middle of segment-X+1,
> > NotifyStableSegments() allows the writer to notify segment-X.
> > Is my understanding correct?
> 
> I think this little ASCII drawing illustrates the above scenario:
> 
>         AAAAA  F  BBBBB
> |---------|---------|---------|
>    seg X    seg X+1   seg X+2
> 
> AAAAA and BBBBB are Record-A and Record-B. F is the current flush
> pointer.

I modified the figure a bit for the explanation below.

          F0        F1
        AAAAA  F  BBBBB
|---------|---------|---------|
   seg X    seg X+1   seg X+2

Matsumura-san has a concern about the case where there are two (or
more) partially-flushed segment-spanning records at the same time.

This patch remembers only the last cross-segment record. If we were
going to flush up to F0 after Record-B had been written, we would fail
to hold-off archiving seg-X. This patch is based on a assumption that
that case cannot happen because we don't leave a pending page at the
time of segment switch and no records don't span over three or more
segments.

> In this case, it would be OK to notify segment X, as long as F is
> greater than the end of record A. And if I'm reading Kyotaro's patch
> correctly, that's what would happen with the patch.
> 
> The patch seems correct to me. I'm a bit sad that we have to track yet
> another WAL position (two, actually) to fix this, but I don't see a
> better way.

Is the two means Record-A and B? Is it needed even with having the
assumption above?

> I wonder if we should arrange things so that XLogwrtResult.Flush never
> points in the middle of a record? I'm not totally convinced that all

That happens at good percentage of page-boundary. And a record can
span over three or more pages. Do we need to avoid all such cases?

I did that only for the cross-segment case.

> the current callers of GetFlushRecPtr() are OK with a middle-of-WAL
> record value. Could we get into similar trouble if a standby
> replicates half of a cross-segment record to a cascaded standby, and
> the cascaded standby has WAL archiving enabled?

The patch includes a fix for primary->standby case. But I'm not sure
we can do that in the cascaded case. A standby is not aware of the
structure of a WAL blob and has no idea of up-to-where to send the
received blobs. However, if we can rely on the behavior of CopyData
that we always receive a blob as a whole sent from the sender at once,
the cascaded standbys are free from the trouble (as far as the
cascaded-standby doesn't crash just before writing the last-half of a
record into pg_wal and after archiving the last full-segment, which
seems unlikely.).

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
From e8320f7486c057ccb97be56ce1859296b8072256 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyoga.ntt@gmail.com>
Date: Tue, 13 Oct 2020 20:41:33 +0900
Subject: [PATCH v2] Avoid archiving a WAL segment that continues to the next
 segment

If the last record of a finshed segment continues to the next segment,
we need to defer archiving of the segment until the record is flushed
to the end. Otherwise crash recovery can overwrite the last record of
a segment and history diverges between archive and pg_wal.
---
 src/backend/access/transam/xlog.c   | 160 +++++++++++++++++++++++++++-
 src/backend/replication/walsender.c |  14 +--
 src/include/access/xlog.h           |   1 +
 3 files changed, 163 insertions(+), 12 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 52a67b1170..8fd0fdb598 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -723,6 +723,16 @@ typedef struct XLogCtlData
      */
     XLogRecPtr    lastFpwDisableRecPtr;
 
+    /* The last segment notified to be archived. Protected by WALWriteLock */
+    XLogSegNo    lastNotifiedSeg;
+
+    /*
+     * Remember the range of the last segment-spanning record. Protected by
+     * info_lck
+     */
+    XLogRecPtr    lastSegContRecStart;
+    XLogRecPtr    lastSegContRecEnd;
+
     slock_t        info_lck;        /* locks shared variables shown above */
 } XLogCtlData;
 
@@ -1160,6 +1170,9 @@ XLogInsertRecord(XLogRecData *rdata,
      */
     if (StartPos / XLOG_BLCKSZ != EndPos / XLOG_BLCKSZ)
     {
+        XLogSegNo startseg;
+        XLogSegNo endseg;
+
         SpinLockAcquire(&XLogCtl->info_lck);
         /* advance global request to include new block(s) */
         if (XLogCtl->LogwrtRqst.Write < EndPos)
@@ -1167,6 +1180,21 @@ XLogInsertRecord(XLogRecData *rdata,
         /* update local result copy while I have the chance */
         LogwrtResult = XLogCtl->LogwrtResult;
         SpinLockRelease(&XLogCtl->info_lck);
+
+        /* Remember the range of the record if it spans over segments */
+        XLByteToSeg(StartPos, startseg, wal_segment_size);
+        XLByteToPrevSeg(EndPos, endseg, wal_segment_size);
+
+        if (startseg != endseg)
+        {
+            SpinLockAcquire(&XLogCtl->info_lck);
+            if (XLogCtl->lastSegContRecEnd < StartPos)
+            {
+                XLogCtl->lastSegContRecStart = StartPos;
+                XLogCtl->lastSegContRecEnd = EndPos;
+            }
+            SpinLockRelease(&XLogCtl->info_lck);
+        }
     }
 
     /*
@@ -2399,6 +2427,43 @@ XLogCheckpointNeeded(XLogSegNo new_segno)
     return false;
 }
 
+/*
+ * Returns last notified segment.
+ */
+static XLogSegNo
+GetLastNotifiedSegment(void)
+{
+    XLogSegNo last_notified;
+
+    SpinLockAcquire(&XLogCtl->info_lck);
+    last_notified = XLogCtl->lastNotifiedSeg;
+    SpinLockRelease(&XLogCtl->info_lck);
+
+    return last_notified;
+}
+
+/*
+ * Notify segments that are not yet notified.
+ */
+static void
+NotifySegmentsUpTo(XLogSegNo notifySegNo)
+{
+    XLogSegNo last_notified = GetLastNotifiedSegment();
+    XLogSegNo i;
+
+    if (notifySegNo <= last_notified)
+        return;
+
+    for (i = XLogCtl->lastNotifiedSeg + 1 ; i <= notifySegNo ; i++)
+        XLogArchiveNotifySeg(i);
+
+    /* Don't go back in the case someone else has made it go further. */
+    SpinLockAcquire(&XLogCtl->info_lck);
+    if (XLogCtl->lastNotifiedSeg < notifySegNo)
+        XLogCtl->lastNotifiedSeg = notifySegNo;
+    SpinLockRelease(&XLogCtl->info_lck);
+}
+
 /*
  * Write and/or fsync the log at least as far as WriteRqst indicates.
  *
@@ -2422,6 +2487,7 @@ XLogWrite(XLogwrtRqst WriteRqst, bool flexible)
     int            npages;
     int            startidx;
     uint32        startoffset;
+    bool        extended = false;
 
     /* We should always be inside a critical section here */
     Assert(CritSectionCount > 0);
@@ -2578,15 +2644,48 @@ XLogWrite(XLogwrtRqst WriteRqst, bool flexible)
              */
             if (finishing_seg)
             {
+                XLogRecPtr lastSegContRecStart;
+                XLogRecPtr lastSegContRecEnd;
+
                 issue_xlog_fsync(openLogFile, openLogSegNo);
 
                 /* signal that we need to wakeup walsenders later */
                 WalSndWakeupRequest();
 
-                LogwrtResult.Flush = LogwrtResult.Write;    /* end of page */
+                SpinLockAcquire(&XLogCtl->info_lck);
+                lastSegContRecStart = XLogCtl->lastSegContRecStart;
+                lastSegContRecEnd = XLogCtl->lastSegContRecEnd;
+                SpinLockRelease(&XLogCtl->info_lck);
 
-                if (XLogArchivingActive())
-                    XLogArchiveNotifySeg(openLogSegNo);
+                /*
+                 * Don't expose flush location at middle of a corss-segment WAL
+                 * record.
+                 */
+                if (LogwrtResult.Write < lastSegContRecStart ||
+                    lastSegContRecEnd <= LogwrtResult.Write)
+                {
+                    LogwrtResult.Flush = LogwrtResult.Write; /* end of page */
+
+                    if (XLogArchivingActive())
+                        NotifySegmentsUpTo(openLogSegNo);
+
+                    extended = false;
+                }
+                else
+                {
+                    /*
+                     * Sometimes we are told to flush up not to the end of a
+                     * record but to WALbuffer page boundary. We advance the
+                     * request LSN to the end of the record in the case the
+                     * record at the requested LSN continues to the next segment.
+                     */
+                    if (lastSegContRecStart <= WriteRqst.Write &&
+                        WriteRqst.Write <= lastSegContRecEnd)
+                        WriteRqst.Write = lastSegContRecEnd;
+
+                    /* Continue to the next iteration ignoring flexible flag */
+                    extended = true;
+                }
 
                 XLogCtl->lastSegSwitchTime = (pg_time_t) time(NULL);
                 XLogCtl->lastSegSwitchLSN = LogwrtResult.Flush;
@@ -2615,11 +2714,23 @@ XLogWrite(XLogwrtRqst WriteRqst, bool flexible)
         }
         curridx = NextBufIdx(curridx);
 
-        /* If flexible, break out of loop as soon as we wrote something */
-        if (flexible && npages == 0)
+        /*
+         * If flexible, break out of loop as soon as we wrote something.
+         * However, we don't leave the loop if the last record in the just
+         * finished segment needs to be finished.
+         */
+        if (flexible && !extended && npages == 0)
             break;
     }
 
+    /*
+     * We have extended the write request to the next segment if the record at
+     * the initial WriteRqst.Write continues to the next segment.  In that case
+     * need to notify the last segment here.
+     */
+    if (extended)
+        NotifySegmentsUpTo(openLogSegNo - 1);
+
     Assert(npages == 0);
 
     /*
@@ -7710,6 +7821,18 @@ StartupXLOG(void)
     XLogCtl->LogwrtRqst.Write = EndOfLog;
     XLogCtl->LogwrtRqst.Flush = EndOfLog;
 
+    /*
+     * We have archived up to the previous segment of EndOfLog so far.
+     * Initialize lastNotifiedSeg if needed.
+     */
+    if (XLogArchivingActive())
+    {
+        XLogSegNo    endLogSegNo;
+
+        XLByteToSeg(EndOfLog, endLogSegNo, wal_segment_size);
+        XLogCtl->lastNotifiedSeg = endLogSegNo - 1;
+    }
+
     /*
      * Update full_page_writes in shared memory and write an XLOG_FPW_CHANGE
      * record before resource manager writes cleanup WAL records or checkpoint
@@ -8434,6 +8557,33 @@ GetFlushRecPtr(void)
     return LogwrtResult.Flush;
 }
 
+/*
+ * GetReplicationTargetRecPtr -- Returns the latest position that is safe to
+ * replicate.
+ */
+XLogRecPtr
+GetReplicationTargetRecPtr(void)
+{
+    XLogRecPtr lastSegContRecStart;
+    XLogRecPtr lastSegContRecEnd;
+
+    SpinLockAcquire(&XLogCtl->info_lck);
+    LogwrtResult = XLogCtl->LogwrtResult;
+    lastSegContRecStart = XLogCtl->lastSegContRecStart;
+    lastSegContRecEnd = XLogCtl->lastSegContRecEnd;
+    SpinLockRelease(&XLogCtl->info_lck);
+
+    /*
+     * Use start position of the last segmenet-spanning continuation record
+     * when the record is not flushed completely.
+     */
+    if (lastSegContRecStart < LogwrtResult.Flush &&
+        LogwrtResult.Flush <= lastSegContRecEnd)
+        return lastSegContRecStart;
+
+    return LogwrtResult.Flush;
+}
+
 /*
  * GetLastImportantRecPtr -- Returns the LSN of the last important record
  * inserted. All records not explicitly marked as unimportant are considered
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 7c9d1b67df..e9eaca5ffa 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -2642,14 +2642,14 @@ XLogSendPhysical(void)
         /*
          * Streaming the current timeline on a primary.
          *
-         * Attempt to send all data that's already been written out and
-         * fsync'd to disk.  We cannot go further than what's been written out
-         * given the current implementation of WALRead().  And in any case
-         * it's unsafe to send WAL that is not securely down to disk on the
-         * primary: if the primary subsequently crashes and restarts, standbys
-         * must not have applied any WAL that got lost on the primary.
+         * Attempt to send all data that's can be replicated.  We cannot go
+         * further than what's been written out given the current
+         * implementation of WALRead().  And in any case it's unsafe to send
+         * WAL that is not securely down to disk on the primary: if the primary
+         * subsequently crashes and restarts, standbys must not have applied
+         * any WAL that got lost on the primary.
          */
-        SendRqstPtr = GetFlushRecPtr();
+        SendRqstPtr = GetReplicationTargetRecPtr();
     }
 
     /*
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 221af87e71..94876f628c 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -338,6 +338,7 @@ extern void GetFullPageWriteInfo(XLogRecPtr *RedoRecPtr_p, bool *doPageWrites_p)
 extern XLogRecPtr GetRedoRecPtr(void);
 extern XLogRecPtr GetInsertRecPtr(void);
 extern XLogRecPtr GetFlushRecPtr(void);
+extern XLogRecPtr GetReplicationTargetRecPtr(void);
 extern XLogRecPtr GetLastImportantRecPtr(void);
 extern void RemovePromoteSignalFiles(void);
 
-- 
2.18.4


В списке pgsql-hackers по дате отправления:

Предыдущее
От: Alvaro Herrera
Дата:
Сообщение: Re: Add Information during standby recovery conflicts
Следующее
От: Andy Fan
Дата:
Сообщение: Re: [HACKERS] Runtime Partition Pruning