Обсуждение: [PATCH] A crash and subsequent recovery of the master can cause the slave to get out-of-sync

Поиск
Список
Период
Сортировка

[PATCH] A crash and subsequent recovery of the master can cause the slave to get out-of-sync

От
"Florian G. Pflug"
Дата:
Hi

I believe I have discovered the following problem in pgsql 8.2 and HEAD,
concerning warm-standbys using WAL log shipping.

The problem is that after a crash, the master might complete incomplete
actions via rm_cleanup() - but since it won't wal-log those changes,
the slave won't know about this. This will at least prevent the creation
of any further restart points on the slave (because safe_restartpoint)
will never return true again - it it might even cause data corruption,
if subsequent wal records are interpreted wrongly by the slave because
it sees other data than the master did when it generated them.

Attached is a patch that lets RecoveryRestartPoint call all
rm_cleanup() methods and create a restart point whenever it encounters
a shutdown checkpoint in the wal (because those are generated after
recovery). This ought not cause a performance degradation, because
shutdown checkpoints will occur very infrequently.

The patch is per discussion with Simon Riggs.

I've not yet had a chance to test this patch, I only made sure
that it compiles. I'm sending this out now because I hope this
might make it into 8.2.4.

greetings, Florian Pflug
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 6c67821..93c86a1 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -5060,10 +5060,13 @@ #endif
          * Perform a checkpoint to update all our recovery activity to disk.
          *
          * Note that we write a shutdown checkpoint rather than an on-line
-         * one. This is not particularly critical, but since we may be
-         * assigning a new TLI, using a shutdown checkpoint allows us to have
-         * the rule that TLI only changes in shutdown checkpoints, which
-         * allows some extra error checking in xlog_redo.
+         * one. A slave will always create a restart point if it sees a
+         * shutdown checkpoint, and will call all rm_cleanup() methods before
+         * it does so. This guarantees that any actions taken by the master
+         * in rm_cleanup will also be carried out on the slave.
+         * Additionally, we may be assigning a new TLI, so using a shutdow
+         * checkpoint allows us to have the rule that TLI only changes in shutdown
+         * checkpoints, which allows some extra error checking in xlog_redo.
          */
         CreateCheckPoint(true, true);

@@ -5672,35 +5675,56 @@ CheckPointGuts(XLogRecPtr checkPointRedo
  * restartpoint is needed or not.
  */
 static void
-RecoveryRestartPoint(const CheckPoint *checkPoint)
+RecoveryRestartPoint(const CheckPoint *checkPoint, const bool shutdownCheckpoint)
 {
     int            elapsed_secs;
     int            rmid;

     /*
-     * Do nothing if the elapsed time since the last restartpoint is less than
-     * half of checkpoint_timeout.    (We use a value less than
-     * checkpoint_timeout so that variations in the timing of checkpoints on
-     * the master, or speed of transmission of WAL segments to a slave, won't
-     * make the slave skip a restartpoint once it's synced with the master.)
-     * Checking true elapsed time keeps us from doing restartpoints too often
-     * while rapidly scanning large amounts of WAL.
+     * If the checkpoint we saw in the wal was a shutdown checkpoint, it might
+     * have been written after the recovery following a crash of the master.
+     * In that case, the master will have completed any actions that were
+     * incomplete when it crashed *during recovery*, and these completions
+     * are therefor *not* logged in the wal.
+     * To prevent getting out of sync, we follow what the master did, and
+     * call the rm_cleanup() methods. To be on the safe side, we then create
+     * a RestartPoint, regardless of the time elapsed. Note that asking
+     * the resource managers if they have partial state would be redundant
+     * after calling rm_cleanup().
      */
-    elapsed_secs = time(NULL) - ControlFile->time;
-    if (elapsed_secs < CheckPointTimeout / 2)
-        return;
+    if (shutdownCheckpoint) {
+        for (rmid = 0; rmid <= RM_MAX_ID; rmid++)
+        {
+            if (RmgrTable[rmid].rm_cleanup != NULL)
+                RmgrTable[rmid].rm_cleanup();
+        }
+    }
+    else {
+        /*
+         * Do nothing if the elapsed time since the last restartpoint is less than
+         * half of checkpoint_timeout.    (We use a value less than
+         * checkpoint_timeout so that variations in the timing of checkpoints on
+         * the master, or speed of transmission of WAL segments to a slave, won't
+         * make the slave skip a restartpoint once it's synced with the master.)
+         * Checking true elapsed time keeps us from doing restartpoints too often
+         * while rapidly scanning large amounts of WAL.
+         */
+        elapsed_secs = time(NULL) - ControlFile->time;
+        if (elapsed_secs < CheckPointTimeout / 2)
+            return;

-    /*
-     * Is it safe to checkpoint?  We must ask each of the resource managers
-     * whether they have any partial state information that might prevent a
-     * correct restart from this point.  If so, we skip this opportunity, but
-     * return at the next checkpoint record for another try.
-     */
-    for (rmid = 0; rmid <= RM_MAX_ID; rmid++)
-    {
-        if (RmgrTable[rmid].rm_safe_restartpoint != NULL)
-            if (!(RmgrTable[rmid].rm_safe_restartpoint()))
-                return;
+        /*
+         * Is it safe to checkpoint?  We must ask each of the resource managers
+         * whether they have any partial state information that might prevent a
+         * correct restart from this point.  If so, we skip this opportunity, but
+         * return at the next checkpoint record for another try.
+         */
+        for (rmid = 0; rmid <= RM_MAX_ID; rmid++)
+        {
+            if (RmgrTable[rmid].rm_safe_restartpoint != NULL)
+                if (!(RmgrTable[rmid].rm_safe_restartpoint()))
+                    return;
+        }
     }

     /*
@@ -5835,7 +5859,7 @@ xlog_redo(XLogRecPtr lsn, XLogRecord *re
             ThisTimeLineID = checkPoint.ThisTimeLineID;
         }

-        RecoveryRestartPoint(&checkPoint);
+        RecoveryRestartPoint(&checkPoint, true);
     }
     else if (info == XLOG_CHECKPOINT_ONLINE)
     {
@@ -5864,7 +5888,7 @@ xlog_redo(XLogRecPtr lsn, XLogRecord *re
                     (errmsg("unexpected timeline ID %u (should be %u) in checkpoint record",
                             checkPoint.ThisTimeLineID, ThisTimeLineID)));

-        RecoveryRestartPoint(&checkPoint);
+        RecoveryRestartPoint(&checkPoint, false);
     }
     else if (info == XLOG_SWITCH)
     {

Re: [PATCH] A crash and subsequent recovery of themaster can cause the slave to get out-of-sync

От
"Simon Riggs"
Дата:
On Thu, 2007-04-19 at 22:37 +0200, Florian G. Pflug wrote:

> I believe I have discovered the following problem in pgsql 8.2 and HEAD,
> concerning warm-standbys using WAL log shipping.
> 
> The problem is that after a crash, the master might complete incomplete
> actions via rm_cleanup() - but since it won't wal-log those changes,
> the slave won't know about this. This will at least prevent the creation
> of any further restart points on the slave (because safe_restartpoint)
> will never return true again - it it might even cause data corruption,
> if subsequent wal records are interpreted wrongly by the slave because
> it sees other data than the master did when it generated them.

I agree the problem exists. It is somewhat rare because the idea is that
if the Primary crashes we would failover to the Standby, which would
mean that both Primary and Standby have executed rm_cleanup(), if
needed.

So in the case where the Primary fails and we choose *not* to failover,
there is a potential difficulty on the Standby.

> Attached is a patch that lets RecoveryRestartPoint call all
> rm_cleanup() methods and create a restart point whenever it encounters
> a shutdown checkpoint in the wal (because those are generated after
> recovery). This ought not cause a performance degradation, because
> shutdown checkpoints will occur very infrequently.
> 
> The patch is per discussion with Simon Riggs.
> 
> I've not yet had a chance to test this patch, I only made sure
> that it compiles. I'm sending this out now because I hope this
> might make it into 8.2.4.

The rationale for this fix could be described somewhat differently:

When we shutdown, we know for certain that safe_restartpoint() is true.
However, we don't know whether it is true because we successfully did a
clean shutdown, or because we crashed, recovered and then issued a
shutdown checkpoint following recovery. In the latter case we *must*
execute rm_cleanup() on the standby because it has been executed on the
primary. Not doing so at this point *might* be safe, but is not certain
to be safe. We don't *need* to log a restartpoint at this time, but it
seems sensible to do so.

We need to check that rm_cleanup() routines don't assume that they will
only ever be called once or this will clearly fail. There is also no
need to call rm_cleanup() unless rm_safe_restartpoint() is false.

--  Simon Riggs              EnterpriseDB   http://www.enterprisedb.com




Re: [PATCH] A crash and subsequent recovery of themaster can cause the slave to get out-of-sync

От
"Florian G. Pflug"
Дата:
Simon Riggs wrote:
> On Thu, 2007-04-19 at 22:37 +0200, Florian G. Pflug wrote:
>> The problem is that after a crash, the master might complete incomplete
>> actions via rm_cleanup() - but since it won't wal-log those changes,
>> the slave won't know about this. This will at least prevent the creation
>> of any further restart points on the slave (because safe_restartpoint)
>> will never return true again - it it might even cause data corruption,
>> if subsequent wal records are interpreted wrongly by the slave because
>> it sees other data than the master did when it generated them.
> 
> I agree the problem exists. It is somewhat rare because the idea is that
> if the Primary crashes we would failover to the Standby, which would
> mean that both Primary and Standby have executed rm_cleanup(), if
> needed.
> 
> So in the case where the Primary fails and we choose *not* to failover,
> there is a potential difficulty on the Standby.

It occured to me today that this might plague 8.1 too. As you explain below,
the problem is not really connected to restartpoints - failing to create
them is merely a sympton of this. On 8.1, this might still lead to rm_cleanup()
being called much "later" (if you consider the wal position to be the "time")
on the slave than on the master. I dunno if this really causes trouble - I
don't yet understand the btree code well enough to judge this.

> The rationale for this fix could be described somewhat differently:
> 
> When we shutdown, we know for certain that safe_restartpoint() is true.
> However, we don't know whether it is true because we successfully did a
> clean shutdown, or because we crashed, recovered and then issued a
> shutdown checkpoint following recovery. In the latter case we *must*
> execute rm_cleanup() on the standby because it has been executed on the
> primary. Not doing so at this point *might* be safe, but is not certain
> to be safe. We don't *need* to log a restartpoint at this time, but it
> seems sensible to do so.

While creating the patch, I've been thinking if it might be worthwile
to note that we just did recovery in the ShutdownCheckpoint
(or create a new checkpoint type RecoveryCheckpoint). This wouldl allow
for more error checking, because then the slave could check that
safe_restartpoint() is true for all ShutdownCheckpoints that were not
after recovering.

> We need to check that rm_cleanup() routines don't assume that they will
> only ever be called once or this will clearly fail. There is also no
> need to call rm_cleanup() unless rm_safe_restartpoint() is false.

But a non-idempotent rm_cleanup() routine will cause trouble anyway,
if postgres crashes after having called rm_cleanup() but before creating
the ShutdownCheckpoint.

greetings, Florian Pflug



Re: [PATCH] A crash and subsequent recovery of the master can cause the slave to get out-of-sync

От
Bruce Momjian
Дата:
Your patch has been added to the PostgreSQL unapplied patches list at:
http://momjian.postgresql.org/cgi-bin/pgpatches

It will be applied as soon as one of the PostgreSQL committers reviews
and approves it.

---------------------------------------------------------------------------


Florian G. Pflug wrote:
> Hi
> 
> I believe I have discovered the following problem in pgsql 8.2 and HEAD,
> concerning warm-standbys using WAL log shipping.
> 
> The problem is that after a crash, the master might complete incomplete
> actions via rm_cleanup() - but since it won't wal-log those changes,
> the slave won't know about this. This will at least prevent the creation
> of any further restart points on the slave (because safe_restartpoint)
> will never return true again - it it might even cause data corruption,
> if subsequent wal records are interpreted wrongly by the slave because
> it sees other data than the master did when it generated them.
> 
> Attached is a patch that lets RecoveryRestartPoint call all
> rm_cleanup() methods and create a restart point whenever it encounters
> a shutdown checkpoint in the wal (because those are generated after
> recovery). This ought not cause a performance degradation, because
> shutdown checkpoints will occur very infrequently.
> 
> The patch is per discussion with Simon Riggs.
> 
> I've not yet had a chance to test this patch, I only made sure
> that it compiles. I'm sending this out now because I hope this
> might make it into 8.2.4.
> 
> greetings, Florian Pflug


> 
> ---------------------------(end of broadcast)---------------------------
> TIP 2: Don't 'kill -9' the postmaster

--  Bruce Momjian  <bruce@momjian.us>          http://momjian.us EnterpriseDB
http://www.enterprisedb.com
 + If your life is a hard drive, Christ can be your backup. +


[ back to dealing with this patch, finally ]

"Florian G. Pflug" <fgp@phlo.org> writes:
> While creating the patch, I've been thinking if it might be worthwile
> to note that we just did recovery in the ShutdownCheckpoint
> (or create a new checkpoint type RecoveryCheckpoint). This wouldl allow
> for more error checking, because then the slave could check that
> safe_restartpoint() is true for all ShutdownCheckpoints that were not
> after recovering.

I concur that this is a good idea --- we should have a third checkpoint
record type that shows that a crash recovery occurred.  However, we can
probably only do that for 8.3 and beyond.  If we try to do it in
existing release branches then there's likelihood of trouble due to WAL
incompatibility between master and standby.  While we do advise people
to update their standbys first, I don't think it's worth risking such
problems just to add some more error checking.

Conclusion: we should apply Florian's patch as-is in 8.2, do something
morally equivalent in 8.1 and before, and invent a
CrashRecoveryCheckpoint record type in HEAD.
        regards, tom lane


Re: [PATCH] A crash and subsequent recovery of themaster can cause the slave to get out-of-sync

От
"Florian G. Pflug"
Дата:
Tom Lane wrote:
> [ back to dealing with this patch, finally ]
> 
> "Florian G. Pflug" <fgp@phlo.org> writes:
>> While creating the patch, I've been thinking if it might be worthwile
>> to note that we just did recovery in the ShutdownCheckpoint
>> (or create a new checkpoint type RecoveryCheckpoint). This wouldl allow
>> for more error checking, because then the slave could check that
>> safe_restartpoint() is true for all ShutdownCheckpoints that were not
>> after recovering.
> 
> I concur that this is a good idea --- we should have a third checkpoint
> record type that shows that a crash recovery occurred.  However, we can
> probably only do that for 8.3 and beyond.  If we try to do it in
> existing release branches then there's likelihood of trouble due to WAL
> incompatibility between master and standby.  While we do advise people
> to update their standbys first, I don't think it's worth risking such
> problems just to add some more error checking.>
> Conclusion: we should apply Florian's patch as-is in 8.2, do something
> morally equivalent in 8.1 and before, and invent a
> CrashRecoveryCheckpoint record type in HEAD.

Sounds good.

Do you want me to code up such patches for 8.1 and 8.3 in the next days,
or is someone else already working on it?

greetings, Florian Pflug




"Florian G. Pflug" <fgp@phlo.org> writes:
> Tom Lane wrote:
>> Conclusion: we should apply Florian's patch as-is in 8.2, do something
>> morally equivalent in 8.1 and before, and invent a
>> CrashRecoveryCheckpoint record type in HEAD.

> Sounds good.

Actually, now that I look closer, this patch seems completely wrong.
It's predicated on an assumption that rm_cleanup won't write WAL entries
describing what it did ... but, at least in the btree case, it does.
(I think gist/gin might not, but that's a bug in those AMs not in xlog.)
I'm therefore wondering what test case led you to think there was
something wrong.
        regards, tom lane


Re: [PATCH] A crash and subsequent recovery of themaster can cause the slave to get out-of-sync

От
"Florian G. Pflug"
Дата:
Tom Lane wrote:
> "Florian G. Pflug" <fgp@phlo.org> writes:
>> Tom Lane wrote:
>>> Conclusion: we should apply Florian's patch as-is in 8.2, do something
>>> morally equivalent in 8.1 and before, and invent a
>>> CrashRecoveryCheckpoint record type in HEAD.
> 
>> Sounds good.
> 
> Actually, now that I look closer, this patch seems completely wrong.
> It's predicated on an assumption that rm_cleanup won't write WAL entries
> describing what it did ... but, at least in the btree case, it does.
> (I think gist/gin might not, but that's a bug in those AMs not in xlog.)
> I'm therefore wondering what test case led you to think there was
> something wrong.

It wasn't a testcase - I was trying to understand the xlog code while working
on my concurrent walreplay patch, and wondered what happens if the master 
crashes and then recovery while the slave keeps running.

I've re-read my original email to Simon, and it seems that I believed
that rm_cleanup methods won't bee able to write to the xlog because they are
called during recovery.

But StartupXLOG *does* make the wal append able *before* the rm_cleanup methods
are called.

So I now think (at least for btree) that everything is fine, and that I was
just being stupid.

Sorry for the noise, guys
greetings, Florian Pflug