Re: 9.4 HEAD: select() failed in postmaster
От | MauMau |
---|---|
Тема | Re: 9.4 HEAD: select() failed in postmaster |
Дата | |
Msg-id | 53F0692AB35345348E29FB46F33127E3@maumau обсуждение исходный текст |
Ответ на | 9.4 HEAD: select() failed in postmaster (Jeff Janes <jeff.janes@gmail.com>) |
Ответы |
Re: 9.4 HEAD: select() failed in postmaster
|
Список | pgsql-hackers |
From: "Jeff Janes" <jeff.janes@gmail.com> -------------------------------------------------- I've implemented the Min to Max change and did some more testing. Now I have a different but related problem (which I also saw before, but less often than the select() one). The 5 second clock doesn't get turned off. So after all processes end, and a new startup is launched, if that startup doesn't report back to the postmaster soon enough, it gets SIGKILLED. postmaster.c near line 1681 if ((Shutdown >= ImmediateShutdown || (FatalError && !SendStop)) && now - AbortStartTime >= SIGKILL_CHILDREN_AFTER_SECS) It seems like this needs to have an additional and-test of pmState, but which states to test I don't really know. I've added in "&& (pmState>PM_RUN)" and have not had any more failures, so I think that this is on the right path but testing an enum for inequality feels wrong. -------------------------------------------------- "AbortStartTime > 0" is also necessary to avoid sending SIGKILL repeatedly. I sent the attached patch during the original discussion. The below fragment is relevant: --- 1663,1688 ---- TouchSocketLockFiles(); last_touch_time = now; } + + /* + * When postmaster got an immediate shutdown request + * or some child terminated abnormally (FatalError case), + * postmaster sends SIGQUIT to all children except + * syslogger and dead_end ones, then wait for them to terminate. + * If some children didn't terminate within a certain amount of time, + * postmaster sends SIGKILL to them and wait again. + * This resolves, for example, the hang situation where + * a backend gets stuck in the call chain: + * free() acquires some lock -> <received SIGQUIT> -> + * quickdie() -> ereport() -> gettext() -> malloc() -> <lock acquisition> + */ + if (AbortStartTime > 0 && /* SIGKILL only once */ + (Shutdown == ImmediateShutdown || (FatalError && !SendStop)) && + now - AbortStartTime >= 10) + { + SignalAllChildren(SIGKILL); + AbortStartTime = 0; + } } } Regards MauMau
Вложения
В списке pgsql-hackers по дате отправления: