Re: [BUG] Panic due to incorrect missingContrecPtr after promotion
| От | Hsu, John |
|---|---|
| Тема | Re: [BUG] Panic due to incorrect missingContrecPtr after promotion |
| Дата | |
| Msg-id | 0071e7fe-b1a9-e38c-5706-657913fde1fb@amazon.com обсуждение исходный текст |
| Ответ на | Re: [BUG] Panic due to incorrect missingContrecPtr after promotion ("Imseih (AWS), Sami" <simseih@amazon.com>) |
| Список | pgsql-hackers |
Hi,
As an update we have a test suite running for the last few days with
constant workload and failovers/promotions. With the proposed change
from Sami we no longer see PANICs due to this.
The comment indicates that the abortedRecPtr and missingContRecPtr
should only be set when the database is a writer since it's used to
write a new record to downstream readers. !StandbyMode is a poor proxy
for this as Sami mentioned since there's cases when a replica is going
through crash recovery that would then set this record.
Thanks,
John H
On 5/27/22 12:01 PM, Imseih (AWS), Sami wrote:
>> So.. I'd like to hear exactly what you did as the testing.
>> When standby mode is requested, if crash recovery fails with immature
>> contrecord, I think we shouldn't continue recovery. But I'm not sure
>> we need to explictly reject that case. Further study is needed..
>
> Here is more details about my findings and testing.
>
> Even with commit 9d92582abf918215d27659d45a4c9e78bda50aff
> we still see the issue with post promotion checkpoint
> resulting in "request to flush past end of generated WAL;"
>
> i.e.
>
> 2022-05-25 00:35:38 UTC::@:[371]:LOG: checkpoint starting: immediate force wait wal time
> 2022-05-25 00:35:38 UTC::@:[371]:LOG: request to flush past end of generated WAL; request 10/B1FA3D88, currpos
7/A8000060
> 2022-05-25 00:35:38 UTC::@:[371]:PANIC: xlog flush request 10/B1FA3D88 is not satisfied --- flushed only to
7/A8000060
> 2022-05-25 00:35:38 UTC:172.31.26.238(38610):administrator@postgres:[23433]:ERROR: current transaction is aborted,
commandsignored until end of transaction block
>
> The intent of commit 9d92582abf918215d27659d45a4c9e78bda50aff
> was to make sure the standby skips the overwrite contrecord.
>
> However, we still see "missingContrecPtr" is being set
> on the standby before promotion and after the instance is
> promoted, the missingContrecPtr is written to WAL
> and the subsequent flush throws a "PANIC: xlog flush request"
>
> To Reproduce using TAP tests;
>
> 1) apply the attached patch 0001-Patch_to_repro.patch to the head branch.
> This patch adds logging when an OverWriteContRecord
> is created and comments out the invalidation code
> added in commit 9d92582abf918215d27659d45a4c9e78bda50aff
>
> 2) Run a tap test under "test/recovery"
> make check PROVE_TESTS='t/026_overwrite_contrecord.pl'
>
> 3) What is observed in the logfiles for both the standby
> and primary instance in the tap test is
> that an overwrite contrecord is created on the primary,
> which is correct, but also on the standby after promotion,
> which is incorrect. This is incorrect as the contrecord
>
> simseih@88665a22795f recovery % cat tmp_check/log/*prim* | grep 'creating\|promo'
> 2022-05-27 13:17:50.843 CDT [98429] LOG: creating overwrite contrecord at 0/2000058 for aborted_lsn 0/1FFD000
>
> simseih@88665a22795f recovery % cat tmp_check/log/*stan* | grep 'creating\|promo'
> 2022-05-27 13:17:51.361 CDT [98421] LOG: received promote request
> 2022-05-27 13:17:51.394 CDT [98421] LOG: creating overwrite contrecord at 0/2000058 for aborted_lsn 0/1FFD000
> simseih@88665a22795f recovery %
>
> What we found:
>
> 1. missingContrecPtr is set when
> StandbyMode is false, therefore
> only a writer should set this value
> and a record is then sent downstream.
>
> But a standby going through crash
> recovery will always have StandbyMode = false,
> causing the missingContrecPtr to be incorrectly
> set.
>
> 2. If StandbyModeRequested is checked instead,
> we ensure that a standby will not set a
> missingContrecPtr.
>
> 3. After applying the patch below, the tap test succeeded
>
> diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
> index 5ee90b6..a727aaf 100644
> --- a/src/backend/access/transam/xlogrecovery.c
> +++ b/src/backend/access/transam/xlogrecovery.c
> @@ -2977,7 +2977,7 @@ ReadRecord(XLogPrefetcher *xlogprefetcher, int emode,
> * we'll write a record to indicate to downstream WAL readers that
> * that portion is to be ignored.
> */
> - if (!StandbyMode &&
> + if (!StandbyModeRequested &&
> !XLogRecPtrIsInvalid(xlogreader->abortedRecPtr))
> {
> abortedRecPtr = xlogreader->abortedRecPtr;
>
> So, it might be that the best fix is not the commit in 9d92582abf918215d27659d45a4c9e78bda50aff
> but to check StandbyModeRequested = false before setting
> missingContrecPtr.
>
> Thank you
> ---
> Sami Imseih
> Amazon Web Services
>
В списке pgsql-hackers по дате отправления: