Re: [BUG] Panic due to incorrect missingContrecPtr after promotion

Поиск
Список
Период
Сортировка
От Hsu, John
Тема Re: [BUG] Panic due to incorrect missingContrecPtr after promotion
Дата
Msg-id 0071e7fe-b1a9-e38c-5706-657913fde1fb@amazon.com
обсуждение исходный текст
Ответ на Re: [BUG] Panic due to incorrect missingContrecPtr after promotion  ("Imseih (AWS), Sami" <simseih@amazon.com>)
Список pgsql-hackers
Hi,

As an update we have a test suite running for the last few days with 
constant workload and failovers/promotions. With the proposed change 
from Sami we no longer see PANICs due to this.

The comment indicates that the abortedRecPtr and missingContRecPtr 
should only be set when the database is a writer since it's used to 
write a new record to downstream readers. !StandbyMode is a poor proxy 
for this as Sami mentioned since there's cases when a replica is going 
through crash recovery that would then set this record.

Thanks,

John H

On 5/27/22 12:01 PM, Imseih (AWS), Sami wrote:
>>     So.. I'd like to hear exactly what you did as the testing.
>>     When standby mode is requested, if crash recovery fails with immature
>>     contrecord, I think we shouldn't continue recovery. But I'm not sure
>>     we need to explictly reject that case.  Further study is needed..
>
> Here is more details about my findings and testing.
>
> Even with commit 9d92582abf918215d27659d45a4c9e78bda50aff
> we still see the issue with post promotion checkpoint
> resulting in "request to flush past end of generated WAL;"
>
> i.e.
>
> 2022-05-25 00:35:38 UTC::@:[371]:LOG:  checkpoint starting: immediate force wait wal time
> 2022-05-25 00:35:38 UTC::@:[371]:LOG:  request to flush past end of generated WAL; request 10/B1FA3D88, currpos
7/A8000060
> 2022-05-25 00:35:38 UTC::@:[371]:PANIC:  xlog flush request 10/B1FA3D88 is not satisfied --- flushed only to
7/A8000060
> 2022-05-25 00:35:38 UTC:172.31.26.238(38610):administrator@postgres:[23433]:ERROR:  current transaction is aborted,
commandsignored until end of transaction block
 
>
> The intent of commit 9d92582abf918215d27659d45a4c9e78bda50aff
> was to make sure the standby skips the overwrite contrecord.
>
> However, we still see "missingContrecPtr" is being set
> on the standby before promotion and after the instance is
> promoted, the missingContrecPtr is written to WAL
> and the subsequent flush throws a "PANIC:  xlog flush request"
>
> To Reproduce using TAP tests;
>
> 1) apply the attached patch 0001-Patch_to_repro.patch to the head branch.
>     This patch adds logging when an OverWriteContRecord
>     is created and comments out the invalidation code
>     added in commit  9d92582abf918215d27659d45a4c9e78bda50aff
>     
> 2) Run a tap test under "test/recovery"
>     make check PROVE_TESTS='t/026_overwrite_contrecord.pl'
>
> 3) What is observed in the logfiles for both the standby
>     and primary instance in the tap test is
>     that an overwrite contrecord is created on the primary,
>     which is correct, but also on the standby after promotion,
>     which is incorrect. This is incorrect as the contrecord
>
> simseih@88665a22795f recovery % cat tmp_check/log/*prim* | grep 'creating\|promo'
> 2022-05-27 13:17:50.843 CDT [98429] LOG:  creating overwrite contrecord at 0/2000058 for aborted_lsn 0/1FFD000
>
> simseih@88665a22795f recovery % cat tmp_check/log/*stan* | grep 'creating\|promo'
> 2022-05-27 13:17:51.361 CDT [98421] LOG:  received promote request
> 2022-05-27 13:17:51.394 CDT [98421] LOG:  creating overwrite contrecord at 0/2000058 for aborted_lsn 0/1FFD000
> simseih@88665a22795f recovery %
>
> What we found:
>
> 1. missingContrecPtr is set when
>     StandbyMode is false, therefore
>     only a writer should set this value
>     and a record is then sent downstream.
>
>     But a standby going through crash
>     recovery will always have StandbyMode = false,
>     causing the missingContrecPtr to be incorrectly
>     set.
>
> 2. If StandbyModeRequested is checked instead,
>       we ensure that a standby will not set a
>       missingContrecPtr.
>
> 3. After applying the patch below, the tap test succeeded
>
> diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
> index 5ee90b6..a727aaf 100644
> --- a/src/backend/access/transam/xlogrecovery.c
> +++ b/src/backend/access/transam/xlogrecovery.c
> @@ -2977,7 +2977,7 @@ ReadRecord(XLogPrefetcher *xlogprefetcher, int emode,
>                           * we'll write a record to indicate to downstream WAL readers that
>                           * that portion is to be ignored.
>                           */
> -                       if (!StandbyMode &&
> +                       if (!StandbyModeRequested &&
>                                  !XLogRecPtrIsInvalid(xlogreader->abortedRecPtr))
>                          {
>                                  abortedRecPtr = xlogreader->abortedRecPtr;
>
> So, it might be that the best fix is not the commit in 9d92582abf918215d27659d45a4c9e78bda50aff
> but to check StandbyModeRequested = false before setting
> missingContrecPtr.
>
> Thank you
> ---
> Sami Imseih
> Amazon Web Services
>



В списке pgsql-hackers по дате отправления:

Предыдущее
От: Nathan Bossart
Дата:
Сообщение: Re: replacing role-level NOINHERIT with a grant-level option
Следующее
От: Robert Haas
Дата:
Сообщение: pg_auth_members.grantor is bunk