Re: [BUG] Panic due to incorrect missingContrecPtr after promotion
От | Imseih (AWS), Sami |
---|---|
Тема | Re: [BUG] Panic due to incorrect missingContrecPtr after promotion |
Дата | |
Msg-id | E37D2989-CFB6-4302-A871-A007DC71DA25@amazon.com обсуждение исходный текст |
Ответ на | Re: [BUG] Panic due to incorrect missingContrecPtr after promotion (Kyotaro Horiguchi <horikyota.ntt@gmail.com>) |
Ответы |
Re: [BUG] Panic due to incorrect missingContrecPtr after promotion
("Hsu, John" <hsuchen@amazon.com>)
Re: [BUG] Panic due to incorrect missingContrecPtr after promotion (Michael Paquier <michael@paquier.xyz>) |
Список | pgsql-hackers |
> So.. I'd like to hear exactly what you did as the testing. > When standby mode is requested, if crash recovery fails with immature > contrecord, I think we shouldn't continue recovery. But I'm not sure > we need to explictly reject that case. Further study is needed.. Here is more details about my findings and testing. Even with commit 9d92582abf918215d27659d45a4c9e78bda50aff we still see the issue with post promotion checkpoint resulting in "request to flush past end of generated WAL;" i.e. 2022-05-25 00:35:38 UTC::@:[371]:LOG: checkpoint starting: immediate force wait wal time 2022-05-25 00:35:38 UTC::@:[371]:LOG: request to flush past end of generated WAL; request 10/B1FA3D88, currpos 7/A8000060 2022-05-25 00:35:38 UTC::@:[371]:PANIC: xlog flush request 10/B1FA3D88 is not satisfied --- flushed only to 7/A8000060 2022-05-25 00:35:38 UTC:172.31.26.238(38610):administrator@postgres:[23433]:ERROR: current transaction is aborted, commandsignored until end of transaction block The intent of commit 9d92582abf918215d27659d45a4c9e78bda50aff was to make sure the standby skips the overwrite contrecord. However, we still see "missingContrecPtr" is being set on the standby before promotion and after the instance is promoted, the missingContrecPtr is written to WAL and the subsequent flush throws a "PANIC: xlog flush request" To Reproduce using TAP tests; 1) apply the attached patch 0001-Patch_to_repro.patch to the head branch. This patch adds logging when an OverWriteContRecord is created and comments out the invalidation code added in commit 9d92582abf918215d27659d45a4c9e78bda50aff 2) Run a tap test under "test/recovery" make check PROVE_TESTS='t/026_overwrite_contrecord.pl' 3) What is observed in the logfiles for both the standby and primary instance in the tap test is that an overwrite contrecord is created on the primary, which is correct, but also on the standby after promotion, which is incorrect. This is incorrect as the contrecord simseih@88665a22795f recovery % cat tmp_check/log/*prim* | grep 'creating\|promo' 2022-05-27 13:17:50.843 CDT [98429] LOG: creating overwrite contrecord at 0/2000058 for aborted_lsn 0/1FFD000 simseih@88665a22795f recovery % cat tmp_check/log/*stan* | grep 'creating\|promo' 2022-05-27 13:17:51.361 CDT [98421] LOG: received promote request 2022-05-27 13:17:51.394 CDT [98421] LOG: creating overwrite contrecord at 0/2000058 for aborted_lsn 0/1FFD000 simseih@88665a22795f recovery % What we found: 1. missingContrecPtr is set when StandbyMode is false, therefore only a writer should set this value and a record is then sent downstream. But a standby going through crash recovery will always have StandbyMode = false, causing the missingContrecPtr to be incorrectly set. 2. If StandbyModeRequested is checked instead, we ensure that a standby will not set a missingContrecPtr. 3. After applying the patch below, the tap test succeeded diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c index 5ee90b6..a727aaf 100644 --- a/src/backend/access/transam/xlogrecovery.c +++ b/src/backend/access/transam/xlogrecovery.c @@ -2977,7 +2977,7 @@ ReadRecord(XLogPrefetcher *xlogprefetcher, int emode, * we'll write a record to indicate to downstream WAL readers that * that portion is to be ignored. */ - if (!StandbyMode && + if (!StandbyModeRequested && !XLogRecPtrIsInvalid(xlogreader->abortedRecPtr)) { abortedRecPtr = xlogreader->abortedRecPtr; So, it might be that the best fix is not the commit in 9d92582abf918215d27659d45a4c9e78bda50aff but to check StandbyModeRequested = false before setting missingContrecPtr. Thank you --- Sami Imseih Amazon Web Services
Вложения
В списке pgsql-hackers по дате отправления:
Предыдущее
От: Andres FreundДата:
Сообщение: Re: suboverflowed subtransactions concurrency performance optimize