Re: Standby corruption after master is restarted

Поиск
Список
Период
Сортировка
От Kyotaro HORIGUCHI
Тема Re: Standby corruption after master is restarted
Дата
Msg-id 20180426.195304.118373455.horiguchi.kyotaro@lab.ntt.co.jp
обсуждение исходный текст
Ответ на Re: Standby corruption after master is restarted  (Emre Hasegeli <emre@hasegeli.com>)
Ответы Re: Standby corruption after master is restarted  (Michael Paquier <michael@paquier.xyz>)
Re: Standby corruption after master is restarted  (Michael Paquier <michael@paquier.xyz>)
Список pgsql-bugs
Hello. I found how this occurs.

# added -hackers

I've seen similar case with inadequite operation but this time I
found that it can happen under normal operation.

At Tue, 17 Apr 2018 17:14:05 +0200, Emre Hasegeli <emre@hasegeli.com> wrote in
<CAE2gYzwB0+MM3L6-zdK0JKO_aXqHNCKU7VAiDJk_kKRRJ4B=yA@mail.gmail.com>
> > OK, this seems to confirm the theory that there's a race condition between
> > segment recycling and replicating. It's likely limited to short period after
> > a crash, otherwise we'd probably see many more reports.
> 
> I am suspicious about this part.  It cannot be happening 2 times to us
> and never to anybody else.  Maybe people do not report it because it
> is easy to deal with the problem.  You just delete the corrupted WAL
> file.  Or maybe there is something special in our infrastructure.
> 
> Thank you for your help so far.
> 
> >> This is the connection information.  Although the master shows SSL
> >> compression is disabled in despite of being explicitly asked for.
> 
> > Hmmm, that seems like a separate issue. When you say 'master shows SSL
> > compression is disabled' where do you see that?
> 
> I thought it might be related first, so I mentioned.  Then I found out
> on pg_stat_ssl view on the server that it is actually disabled.
> Probably the configuration of OpenSSL is disabling it.

A segment is not cleard on recycling. walreceiver writes WAL
record by record so startup process can see arbitrary byte
sequence after the last valid record when replication connection
is lost or standby is restarted.

The following scenario results in the similar situation.

1. create master and standby and run.

   It makes happen this easily if wal_keep_segments is set large
   (20 or so) on master and 0 on standby.

2. Write WAL to recycle happens on standby. Explicit checkpoints
   on standby make it faster. May be required to run several
   rounds before we see recycled segment on standby.

   maybe_loop {
     master:
       create table t (a int);
       insert into t (select a from generate_series(0, 150000) a);
       delete from t;
       checkpoint;

     standby:
       checkpoint;
       <check for recycled segments in pg_wal>
   }

3. stop master

4. standby starts to complain that master is missing.

  At this time, standby complains for several kinds of failure. I
  saw 'invalid record length' and 'incorrect prev-link' this
  time. I saw 'invalid resource manager ID' when mixing different
  size records. If XLogReadRecord saw a record with impossibly
  large tot_length there, it will causes the palloc failure and
  startup process dies.


5. If you see 'zero length record', it's nothing interesting.
  Repeat 3 and 4 to see another.


This doesn't seem to happen on master since XLogWrite writes in
units of page so redo always sees zero length record or wrong
magic, or pageaddr at page boundaries after a crash. On the other
hand walreceiver writes in unit of records so redo can see
arbitrary byte sequence as xl_tot_len, xl_prev, xl_rmid while
fetching the next record. Espcially the first among them can
cause FATAL with "invalid memory alloc".

I think this behavior is a bug. XLogReadRecord is considering the
case but palloc_extended() breaks it. So in the attached, add a
new flag MCXT_ALLOC_NO_PARAMERR to palloc_extended() and
allocate_recordbuf calls it with the new flag. That alone fixes
the problem. However, the patch frees read state buffer facing
errorneous records since such records can leave a too-large
buffer allocated.

After applying this patch, the behavior for the situation changes
as follows.

> LOG:  started streaming WAL from primary at 0/83000000 on timeline 1
> LOG:  record length 1298694144 at 0/83C17B70 too long
> FATAL:  terminating walreceiver process due to administrator command
> LOG:  record length 1298694144 at 0/83C17B70 too long
> LOG:  record length 1298694144 at 0/83C17B70 too long
> LOG:  received promote request
> LOG:  redo done at 0/83C17B38
> LOG:  last completed transaction was at log time 2018-04-26 19:10:12.360253+09
> LOG:  selected new timeline ID: 2
> LOG:  archive recovery complete
> LOG:  database system is ready to accept connections


regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index 3a86f3497e..02c26224fc 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -165,7 +165,8 @@ allocate_recordbuf(XLogReaderState *state, uint32 reclength)
     if (state->readRecordBuf)
         pfree(state->readRecordBuf);
     state->readRecordBuf =
-        (char *) palloc_extended(newSize, MCXT_ALLOC_NO_OOM);
+        (char *) palloc_extended(newSize,
+                                 MCXT_ALLOC_NO_OOM | MCXT_ALLOC_NO_PARAMERR);
     if (state->readRecordBuf == NULL)
     {
         state->readRecordBufSize = 0;
@@ -601,10 +602,19 @@ err:
 
 /*
  * Invalidate the xlogreader's read state to force a re-read.
+ *
+ * read state might hold too large buffer on reading invalid record so release
+ * the buffer as well.
  */
 void
 XLogReaderInvalReadState(XLogReaderState *state)
 {
+    if (state->readRecordBuf != NULL)
+    {
+        pfree(state->readRecordBuf);
+        state->readRecordBuf = NULL;
+        state->readRecordBufSize = 0;
+    }
     state->readSegNo = 0;
     state->readOff = 0;
     state->readLen = 0;
diff --git a/src/backend/utils/mmgr/mcxt.c b/src/backend/utils/mmgr/mcxt.c
index ebe0342f18..6e7f95d11f 100644
--- a/src/backend/utils/mmgr/mcxt.c
+++ b/src/backend/utils/mmgr/mcxt.c
@@ -996,7 +996,12 @@ palloc_extended(Size size, int flags)
 
     if (((flags & MCXT_ALLOC_HUGE) != 0 && !AllocHugeSizeIsValid(size)) ||
         ((flags & MCXT_ALLOC_HUGE) == 0 && !AllocSizeIsValid(size)))
+    {
+        if (flags & MCXT_ALLOC_NO_PARAMERR)
+            return NULL;
+
         elog(ERROR, "invalid memory alloc request size %zu", size);
+    }
 
     context->isReset = false;
 
diff --git a/src/include/common/fe_memutils.h b/src/include/common/fe_memutils.h
index 458743dd40..5859ec087f 100644
--- a/src/include/common/fe_memutils.h
+++ b/src/include/common/fe_memutils.h
@@ -16,7 +16,8 @@
 #define MCXT_ALLOC_HUGE            0x01    /* allow huge allocation (> 1 GB) not
                                          * actually used for frontends */
 #define MCXT_ALLOC_NO_OOM        0x02    /* no failure if out-of-memory */
-#define MCXT_ALLOC_ZERO            0x04    /* zero allocated memory */
+#define MCXT_ALLOC_NO_PARAMERR    0x04    /* no failure on parameter error */
+#define MCXT_ALLOC_ZERO            0x08    /* zero allocated memory */
 
 /*
  * "Safe" memory allocation functions --- these exit(1) on failure
diff --git a/src/include/utils/palloc.h b/src/include/utils/palloc.h
index 781e948f69..eb1ecc9ee6 100644
--- a/src/include/utils/palloc.h
+++ b/src/include/utils/palloc.h
@@ -63,7 +63,8 @@ extern PGDLLIMPORT MemoryContext CurrentMemoryContext;
  */
 #define MCXT_ALLOC_HUGE            0x01    /* allow huge allocation (> 1 GB) */
 #define MCXT_ALLOC_NO_OOM        0x02    /* no failure if out-of-memory */
-#define MCXT_ALLOC_ZERO            0x04    /* zero allocated memory */
+#define MCXT_ALLOC_NO_PARAMERR    0x04    /* no failure on parameter error */
+#define MCXT_ALLOC_ZERO            0x08    /* zero allocated memory */
 
 /*
  * Fundamental memory-allocation operations (more are in utils/memutils.h)

В списке pgsql-bugs по дате отправления:

Предыдущее
От: "zoolus ."
Дата:
Сообщение: Fwd: weird behavior of ORDER BY
Следующее
От: Michael Paquier
Дата:
Сообщение: Re: Standby corruption after master is restarted