9.4 HEAD: select() failed in postmaster

Поиск
Список
Период
Сортировка
От Jeff Janes
Тема 9.4 HEAD: select() failed in postmaster
Дата
Msg-id CAMkU=1zqrj-r4u0EMWUzUbrAbnRBwi-SHsVf=xU7VvZzUu4zyg@mail.gmail.com
обсуждение исходный текст
Ответ на Re: 9.4 HEAD: select() failed in postmaster  (Alvaro Herrera <alvherre@2ndquadrant.com>)
Ответы Re: 9.4 HEAD: select() failed in postmaster  ("MauMau" <maumau307@gmail.com>)
Список pgsql-hackers
On Wednesday, September 11, 2013, Alvaro Herrera wrote:
Noah Misch escribió:
> On Tue, Sep 10, 2013 at 05:18:21PM -0700, Jeff Janes wrote:

> > I think the problem is here, where there should be a Max rather than a Min:
> >
> > commit 82233ce7ea42d6ba519aaec63008aff49da6c7af
> > Author: Alvaro Herrera <alvherre@alvh.no-ip.org>
> > Date:   Fri Jun 28 17:20:53 2013 -0400
> >
> >     Send SIGKILL to children if they don't die quickly in immediate shutdown
> >
> > ...
> >
> > +           /* remaining time, but at least 1 second */
> > +           timeout->tv_sec = Min(SIGKILL_CHILDREN_AFTER_SECS -
> > +                                 (time(NULL) - AbortStartTime), 1);
>
> Agreed; good catch.

Yeah, thanks.  Should be a Max().  The current coding presumably makes
it use one second most of the time, instead of whatever the remaining
time is ... until the abort time is past, in which case it causes the
whole thing to break down as reported.

It might very well be that I used Max() there initially and changed to
Min() at the last minute before commit in a moment of brain fade.

I've implemented the Min to Max change and did some more testing.  Now I have a different  but related problem (which I also saw before, but less often than the select() one).  The 5 second clock doesn't get turned off.  So after all processes end, and a new startup is launched, if that startup doesn't report back to the postmaster soon enough, it gets SIGKILLED.

postmaster.c near line 1681


        if ((Shutdown >= ImmediateShutdown || (FatalError && !SendStop)) &&
            now - AbortStartTime >= SIGKILL_CHILDREN_AFTER_SECS)

It seems like this needs to have an additional and-test of pmState, but which states to test I don't really know.

I've added in "&& (pmState>PM_RUN)" and have not had any more failures, so I think that this is on the right path but testing an enum for inequality feels wrong.

Alternatively perhaps FatalError can get cleared when startup is launched, rather than when WAL replay begins.  But I assume it was done the way it is for a reason, even though I don't know that reason.

Cheers,

Jeff

В списке pgsql-hackers по дате отправления:

Предыдущее
От: Kevin Grittner
Дата:
Сообщение: record identical operator
Следующее
От: Sawada Masahiko
Дата:
Сообщение: Re: Patch for fail-back without fresh backup