Re: [BUG] Panic due to incorrect missingContrecPtr after promotion
От | Kyotaro Horiguchi |
---|---|
Тема | Re: [BUG] Panic due to incorrect missingContrecPtr after promotion |
Дата | |
Msg-id | 20220620.202830.1255811739875634656.horikyota.ntt@gmail.com обсуждение исходный текст |
Ответ на | Re: [BUG] Panic due to incorrect missingContrecPtr after promotion (Michael Paquier <michael@paquier.xyz>) |
Ответы |
Re: [BUG] Panic due to incorrect missingContrecPtr after promotion
(Robert Haas <robertmhaas@gmail.com>)
Re: [BUG] Panic due to incorrect missingContrecPtr after promotion ("Imseih (AWS), Sami" <simseih@amazon.com>) |
Список | pgsql-hackers |
At Mon, 20 Jun 2022 16:13:43 +0900, Michael Paquier <michael@paquier.xyz> wrote in > On Fri, May 27, 2022 at 07:01:37PM +0000, Imseih (AWS), Sami wrote: > > What we found: > > > > 1. missingContrecPtr is set when > > StandbyMode is false, therefore > > only a writer should set this value > > and a record is then sent downstream. > > > > But a standby going through crash > > recovery will always have StandbyMode = false, > > causing the missingContrecPtr to be incorrectly > > set. > > That stands as true as far as I know, StandbyMode would be switched > only once we get out of crash recovery, and only if archive recovery > completes when there is a restore_command. Anyway the change; - abortedRecPtr = InvalidXLogRecPtr; - missingContrecPtr = InvalidXLogRecPtr; + //abortedRecPtr = InvalidXLogRecPtr; + //missingContrecPtr = InvalidXLogRecPtr; Is injecting a bug that invalid "aborted contrecord" record can be injected for certain conditions. If a bug is intentionally injected, it's quite natrual that some behavior gets broken.. > > 2. If StandbyModeRequested is checked instead, > > we ensure that a standby will not set a > > missingContrecPtr. > > > > 3. After applying the patch below, the tap test succeeded > > Hmm. I have not looked at that in depth, but if the intention is to > check that the database is able to write WAL, looking at > XLogCtl->SharedRecoveryState would be the way to go because that's the > flip switching between crash recovery, archive recovery and the end of > recovery (when WAL can be safely written). What we are checking there is "we are going to write WAL", not "we are writing". (!StanbyMode && StandbyModeRequested) means the server have been running crash-recovery before starting archive recovery. In that case, the server is supposed to continue with archived WAL without insering a record. However, if no archived WAL available and the server promoted, the server needs to insert an "aborted contrecord" record again. (I'm not sure how that case happens in the field, though.) So I don't think !StandbyModeRequested is the right thing here. Actually the attached test fails with the fix. > The check in xlogrecovery_redo() still looks like a good thing to have > anyway, because we know that we can safely skip the contrecord. Now, > for any patch produced, could the existing TAP test be extended so as > we are able to get a PANIC even if we keep around the sanity check in > xlogrecovery_redo(). That would likely involve an immediate shutdown > of a standby followed by a start sequence? Thus, I still don't see what have happened at Imseih's hand, but I can cause PANIC with a bit tricky steps, which I don't think valid. This is what I wanted to know the exact steps to cause the PANIC. The attached 1 is the PoC of the TAP test (it uses system()..), and the second is a tentative fix for that. (I don't like the fix, too, though...) regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Вложения
В списке pgsql-hackers по дате отправления: