Re: BUG #4879: bgwriter fails to fsync the file in recovery mode

Поиск
Список
Период
Сортировка
От Heikki Linnakangas
Тема Re: BUG #4879: bgwriter fails to fsync the file in recovery mode
Дата
Msg-id 4A439E7B.7040107@enterprisedb.com
обсуждение исходный текст
Ответ на Re: BUG #4879: bgwriter fails to fsync the file in recovery mode  (Tom Lane <tgl@sss.pgh.pa.us>)
Ответы Re: BUG #4879: bgwriter fails to fsync the file in recovery mode  (Simon Riggs <simon@2ndQuadrant.com>)
Re: BUG #4879: bgwriter fails to fsync the file in recovery mode  (Tom Lane <tgl@sss.pgh.pa.us>)
Список pgsql-bugs
Tom Lane wrote:
> Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:
>> Heikki Linnakangas wrote:
>>> Hmm, what happens when the startup process performs a write, and
>>> bgwriter is not running? Do the fsync requests queue up in the shmem
>>> queue until the end of recovery when bgwriter is launched? I guess I'll
>>> have to try it out...
>
>> Oh dear, doesn't look good. The startup process has a pendingOpsTable of
>> its own. bgwriter won't fsync() files that the startup process has
>> written itself. That needs to be fixed, or you can lose data when an
>> archive recovery crashes after a restartpoint.
>
> Ouch.  I'm beginning to think that the best thing is to temporarily
> revert the change that made bgwriter active during recovery.  It's
> obviously not been adequately thought through or tested.

That was my first thought too, but unfortunately we now rely on bgwriter
to perform restartpoints :-(.

I came up with the attached patch, which includes Simon's patch to have
all fsync requests forwarded to bgwriter during archive recovery. To fix
the startup checkpoint issue, startup process requests a forced
restartpoint, which will flush any fsync requests bgwriter has
accumulated, before doing the actual checkpoint in the startup process.
This is completely untested still, but does anyone immediately see any
more problems?

--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 4fd9d41..5114664 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -5156,6 +5156,7 @@ StartupXLOG(void)
     XLogRecord *record;
     uint32        freespace;
     TransactionId oldestActiveXID;
+    bool        bgwriterLaunched = false;

     XLogCtl->SharedRecoveryInProgress = true;

@@ -5472,7 +5473,11 @@ StartupXLOG(void)
              * process in addition to postmaster!
              */
             if (InArchiveRecovery && IsUnderPostmaster)
+            {
+                SetForwardFsyncRequests();
                 SendPostmasterSignal(PMSIGNAL_RECOVERY_STARTED);
+                bgwriterLaunched = true;
+            }

             /*
              * main redo apply loop
@@ -5742,7 +5747,16 @@ StartupXLOG(void)
          * assigning a new TLI, using a shutdown checkpoint allows us to have
          * the rule that TLI only changes in shutdown checkpoints, which
          * allows some extra error checking in xlog_redo.
+         *
+         * If bgwriter is active, we can't do the necessary fsyncs in the
+         * startup process because bgwriter has accumulated fsync requests
+         * into its private memory. So we request a last forced restart point,
+         * which will flush them to disk, before performing the actual
+         * checkpoint.
          */
+        if (bgwriterLaunched)
+            RequestCheckpoint(CHECKPOINT_IMMEDIATE | CHECKPOINT_FORCE |
+                              CHECKPOINT_WAIT);
         CreateCheckPoint(CHECKPOINT_IS_SHUTDOWN | CHECKPOINT_IMMEDIATE);

         /*
@@ -6219,27 +6233,11 @@ CreateCheckPoint(int flags)

     /*
      * Acquire CheckpointLock to ensure only one checkpoint happens at a time.
-     * During normal operation, bgwriter is the only process that creates
-     * checkpoints, but at the end of archive recovery, the bgwriter can be
-     * busy creating a restartpoint while the startup process tries to perform
-     * the startup checkpoint.
+     * (This is just pro forma, since in the present system structure there is
+     * only one process that is allowed to issue checkpoints at any given
+     * time.)
      */
-    if (!LWLockConditionalAcquire(CheckpointLock, LW_EXCLUSIVE))
-    {
-        Assert(InRecovery);
-
-        /*
-         * A restartpoint is in progress. Wait until it finishes. This can
-         * cause an extra restartpoint to be performed, but that's OK because
-         * we're just about to perform a checkpoint anyway. Flushing the
-         * buffers in this restartpoint can take some time, but that time is
-         * saved from the upcoming checkpoint so the net effect is zero.
-         */
-        ereport(DEBUG2, (errmsg("hurrying in-progress restartpoint")));
-        RequestCheckpoint(CHECKPOINT_IMMEDIATE | CHECKPOINT_WAIT);
-
-        LWLockAcquire(CheckpointLock, LW_EXCLUSIVE);
-    }
+    LWLockAcquire(CheckpointLock, LW_EXCLUSIVE);

     /*
      * Prepare to accumulate statistics.
@@ -6660,8 +6658,9 @@ CreateRestartPoint(int flags)
      * restartpoint. It's assumed that flushing the buffers will do that as a
      * side-effect.
      */
-    if (XLogRecPtrIsInvalid(lastCheckPointRecPtr) ||
-        XLByteLE(lastCheckPoint.redo, ControlFile->checkPointCopy.redo))
+    if (!(flags & CHECKPOINT_FORCE) &&
+        (XLogRecPtrIsInvalid(lastCheckPointRecPtr) ||
+         XLByteLE(lastCheckPoint.redo, ControlFile->checkPointCopy.redo)))
     {
         XLogRecPtr    InvalidXLogRecPtr = {0, 0};

diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index d42e86a..b01b2ab 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -204,6 +204,21 @@ mdinit(void)
 }

 /*
+ * In archive recovery, we rely on bgwriter to do fsyncs(), but we don't
+ * know that we do archive recovery at process startup when pendingOpsTable
+ * has already been created. Calling this function drops pendingOpsTable
+ * and causes any subsequent requests to be forwarded to bgwriter.
+ */
+void
+SetForwardFsyncRequests(void)
+{
+    /* Perform any pending ops we may have queued up */
+    if (pendingOpsTable)
+        mdsync();
+    pendingOpsTable = NULL;
+}
+
+/*
  *    mdexists() -- Does the physical file exist?
  *
  * Note: this will return true for lingering files, with pending deletions
diff --git a/src/include/storage/smgr.h b/src/include/storage/smgr.h
index 7556b14..fd79d7b 100644
--- a/src/include/storage/smgr.h
+++ b/src/include/storage/smgr.h
@@ -109,6 +109,7 @@ extern void mdpreckpt(void);
 extern void mdsync(void);
 extern void mdpostckpt(void);

+extern void SetForwardFsyncRequests(void);
 extern void RememberFsyncRequest(RelFileNode rnode, ForkNumber forknum,
                      BlockNumber segno);
 extern void ForgetRelationFsyncRequests(RelFileNode rnode, ForkNumber forknum);

В списке pgsql-bugs по дате отправления:

Предыдущее
От: Simon Riggs
Дата:
Сообщение: Re: BUG #4879: bgwriter fails to fsync the file in recovery mode
Следующее
От: Tom Lane
Дата:
Сообщение: Re: BUG #4876: author of MD5 says it's seriously broken - hash collision resistance problems