Re: [BUG] Panic due to incorrect missingContrecPtr after promotion

Поиск
Список
Период
Сортировка
От Imseih (AWS), Sami
Тема Re: [BUG] Panic due to incorrect missingContrecPtr after promotion
Дата
Msg-id E37D2989-CFB6-4302-A871-A007DC71DA25@amazon.com
обсуждение исходный текст
Ответ на Re: [BUG] Panic due to incorrect missingContrecPtr after promotion  (Kyotaro Horiguchi <horikyota.ntt@gmail.com>)
Ответы Re: [BUG] Panic due to incorrect missingContrecPtr after promotion  ("Hsu, John" <hsuchen@amazon.com>)
Re: [BUG] Panic due to incorrect missingContrecPtr after promotion  (Michael Paquier <michael@paquier.xyz>)
Список pgsql-hackers
>    So.. I'd like to hear exactly what you did as the testing.

>    When standby mode is requested, if crash recovery fails with immature
>    contrecord, I think we shouldn't continue recovery. But I'm not sure
>    we need to explictly reject that case.  Further study is needed..


Here is more details about my findings and testing.

Even with commit 9d92582abf918215d27659d45a4c9e78bda50aff 
we still see the issue with post promotion checkpoint
resulting in "request to flush past end of generated WAL;"

i.e.

2022-05-25 00:35:38 UTC::@:[371]:LOG:  checkpoint starting: immediate force wait wal time
2022-05-25 00:35:38 UTC::@:[371]:LOG:  request to flush past end of generated WAL; request 10/B1FA3D88, currpos
7/A8000060
2022-05-25 00:35:38 UTC::@:[371]:PANIC:  xlog flush request 10/B1FA3D88 is not satisfied --- flushed only to
7/A8000060
2022-05-25 00:35:38 UTC:172.31.26.238(38610):administrator@postgres:[23433]:ERROR:  current transaction is aborted,
commandsignored until end of transaction block
 

The intent of commit 9d92582abf918215d27659d45a4c9e78bda50aff 
was to make sure the standby skips the overwrite contrecord.

However, we still see "missingContrecPtr" is being set 
on the standby before promotion and after the instance is 
promoted, the missingContrecPtr is written to WAL 
and the subsequent flush throws a "PANIC:  xlog flush request"

To Reproduce using TAP tests; 

1) apply the attached patch 0001-Patch_to_repro.patch to the head branch.
   This patch adds logging when an OverWriteContRecord
   is created and comments out the invalidation code
   added in commit  9d92582abf918215d27659d45a4c9e78bda50aff
   
2) Run a tap test under "test/recovery"
   make check PROVE_TESTS='t/026_overwrite_contrecord.pl'

3) What is observed in the logfiles for both the standby
   and primary instance in the tap test is 
   that an overwrite contrecord is created on the primary, 
   which is correct, but also on the standby after promotion, 
   which is incorrect. This is incorrect as the contrecord

simseih@88665a22795f recovery % cat tmp_check/log/*prim* | grep 'creating\|promo'
2022-05-27 13:17:50.843 CDT [98429] LOG:  creating overwrite contrecord at 0/2000058 for aborted_lsn 0/1FFD000

simseih@88665a22795f recovery % cat tmp_check/log/*stan* | grep 'creating\|promo'
2022-05-27 13:17:51.361 CDT [98421] LOG:  received promote request
2022-05-27 13:17:51.394 CDT [98421] LOG:  creating overwrite contrecord at 0/2000058 for aborted_lsn 0/1FFD000
simseih@88665a22795f recovery %

What we found:

1. missingContrecPtr is set when 
   StandbyMode is false, therefore
   only a writer should set this value
   and a record is then sent downstream.

   But a standby going through crash 
   recovery will always have StandbyMode = false,
   causing the missingContrecPtr to be incorrectly
   set.

2. If StandbyModeRequested is checked instead,
     we ensure that a standby will not set a 
     missingContrecPtr.

3. After applying the patch below, the tap test succeeded

diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index 5ee90b6..a727aaf 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -2977,7 +2977,7 @@ ReadRecord(XLogPrefetcher *xlogprefetcher, int emode,
                         * we'll write a record to indicate to downstream WAL readers that
                         * that portion is to be ignored.
                         */
-                       if (!StandbyMode &&
+                       if (!StandbyModeRequested &&
                                !XLogRecPtrIsInvalid(xlogreader->abortedRecPtr))
                        {
                                abortedRecPtr = xlogreader->abortedRecPtr;

So, it might be that the best fix is not the commit in 9d92582abf918215d27659d45a4c9e78bda50aff
but to check StandbyModeRequested = false before setting 
missingContrecPtr.

Thank you
---
Sami Imseih
Amazon Web Services


Вложения

В списке pgsql-hackers по дате отправления:

Предыдущее
От: Andres Freund
Дата:
Сообщение: Re: suboverflowed subtransactions concurrency performance optimize
Следующее
От: Cary Huang
Дата:
Сообщение: Re: Allowing REINDEX to have an optional name