Re: PANIC during crash recovery of a recently promoted standby

Поиск
Список
Период
Сортировка
От Pavan Deolasee
Тема Re: PANIC during crash recovery of a recently promoted standby
Дата
Msg-id CABOikdO4BPz-h6myYB255ZrrJzmC66g4w6Tb79uHAqbmA3nC3w@mail.gmail.com
обсуждение исходный текст
Ответ на Re: PANIC during crash recovery of a recently promoted standby  (Alvaro Herrera <alvherre@2ndquadrant.com>)
Ответы Re: PANIC during crash recovery of a recently promoted standby  (Michael Paquier <michael@paquier.xyz>)
Список pgsql-hackers


On Fri, May 11, 2018 at 8:39 PM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
Michael Paquier wrote:
> On Thu, May 10, 2018 at 10:52:12AM +0530, Pavan Deolasee wrote:
> > I propose that we should always clear the minRecoveryPoint after promotion
> > to ensure that crash recovery always run to the end if a just-promoted
> > standby crashes before completing its first regular checkpoint. A WIP patch
> > is attached.
>
> I have been playing with your patch and upgraded the test to check as
> well for cascading standbys.  We could use that in the final patch.
> That's definitely something to add in the recovery test suite, and the
> sleep phases should be replaced by waits on replay and/or flush.
>
> Still, that approach looks sensitive to me.  A restart point could be
> running while the end-of-recovery record is inserted, so your patch
> could update minRecoveryPoint to InvalidXLogRecPtr, and then a restart
> point would happily update again the control file's minRecoveryPoint to
> lastCheckPointEndPtr because it would see that the former is older than
> lastCheckPointEndPtr (let's not forget that InvalidXLogRecPtr is 0), so
> you could still crash on invalid pages?

Yeah, I had this exact comment, but I was unable to come up with a test
case that would cause a problem.

Looks like I didn't understand Alvaro's comment when he mentioned it to me off-list. But I now see what Michael and Alvaro mean and that indeed seems like a problem. I was thinking that the test for (ControlFile->state == DB_IN_ARCHIVE_RECOVERY) will ensure that minRecoveryPoint can't be updated after the standby is promoted. While that's true for a DB_IN_PRODUCTION,  the RestartPoint may finish after we have written end-of-recovery record, but before we're in production and thus the minRecoveryPoint may again be set.
 

> I need to think a bit more about that stuff, but one idea would be to
> use a special state in the control file to mark it as ending recovery,
> this way we would control race conditions with restart points.

Hmm.  Can we change the control file in released branches?  (It should
be possible to make the new server understand both old and new formats,
but I think this is breaking new ground and it looks easy to introduce
more bugs there.)


Can't we just remember that in shared memory state instead of writing to the control file? So if we've already performed end-of-recovery, we don't update the minRecoveryPoint when RestartPoint completes. 

Thanks,
Pavan

--
 Pavan Deolasee                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

В списке pgsql-hackers по дате отправления:

Предыдущее
От: Kyotaro HORIGUCHI
Дата:
Сообщение: Re: [HACKERS] [BUGS] Bug in Physical Replication Slots (at least9.5)?
Следующее
От: Hubert Zhang
Дата:
Сообщение: Re: Considering signal handling in plpython again