9.4 HEAD: select() failed in postmaster

Поиск

Список

Период

Сортировка

От	Jeff Janes
Тема	9.4 HEAD: select() failed in postmaster
Дата	13 сентября 2013 г. 00:14:06
Msg-id	CAMkU=1zqrj-r4u0EMWUzUbrAbnRBwi-SHsVf=xU7VvZzUu4zyg@mail.gmail.com обсуждение исходный текст
Ответ на	Re: 9.4 HEAD: select() failed in postmaster (Alvaro Herrera <alvherre@2ndquadrant.com>)
Ответы	Re: 9.4 HEAD: select() failed in postmaster
Список	pgsql-hackers

Дерево обсуждения

On Wednesday, September 11, 2013, Alvaro Herrera wrote:

Noah Misch escribió:
> On Tue, Sep 10, 2013 at 05:18:21PM -0700, Jeff Janes wrote:

> > I think the problem is here, where there should be a Max rather than a Min:
> >
> > commit 82233ce7ea42d6ba519aaec63008aff49da6c7af
> > Author: Alvaro Herrera <alvherre@alvh.no-ip.org>
> > Date: Fri Jun 28 17:20:53 2013 -0400
> >
> > Send SIGKILL to children if they don't die quickly in immediate shutdown
> >
> > ...
> >
> > + /* remaining time, but at least 1 second */
> > + timeout->tv_sec = Min(SIGKILL_CHILDREN_AFTER_SECS -
> > + (time(NULL) - AbortStartTime), 1);
>
> Agreed; good catch.

Yeah, thanks. Should be a Max(). The current coding presumably makes
it use one second most of the time, instead of whatever the remaining
time is ... until the abort time is past, in which case it causes the
whole thing to break down as reported.

It might very well be that I used Max() there initially and changed to
Min() at the last minute before commit in a moment of brain fade.

I've implemented the Min to Max change and did some more testing. Now I have a different but related problem (which I also saw before, but less often than the select() one). The 5 second clock doesn't get turned off. So after all processes end, and a new startup is launched, if that startup doesn't report back to the postmaster soon enough, it gets SIGKILLED.

postmaster.c near line 1681

if ((Shutdown >= ImmediateShutdown || (FatalError && !SendStop)) &&

now - AbortStartTime >= SIGKILL_CHILDREN_AFTER_SECS)

It seems like this needs to have an additional and-test of pmState, but which states to test I don't really know.

I've added in "&& (pmState>PM_RUN)" and have not had any more failures, so I think that this is on the right path but testing an enum for inequality feels wrong.

Alternatively perhaps FatalError can get cleared when startup is launched, rather than when WAL replay begins. But I assume it was done the way it is for a reason, even though I don't know that reason.

Cheers,

Jeff

В списке pgsql-hackers по дате отправления:

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

9.4 HEAD: select() failed in postmaster