Re: Idea for improving buildfarm robustness

Поиск
Список
Период
Сортировка
От Andrew Dunstan
Тема Re: Idea for improving buildfarm robustness
Дата
Msg-id 560AE191.8060504@dunslane.net
обсуждение исходный текст
Ответ на Idea for improving buildfarm robustness  (Tom Lane <tgl@sss.pgh.pa.us>)
Ответы Re: Idea for improving buildfarm robustness  (Tom Lane <tgl@sss.pgh.pa.us>)
Список pgsql-hackers

On 09/29/2015 02:48 PM, Tom Lane wrote:
> A problem the buildfarm has had for a long time is that if for some reason
> the scripts fail to stop a test postmaster, the postmaster process will
> hang around and cause subsequent runs to fail because of socket conflicts.
> This seems to have gotten a lot worse lately due to the influx of very
> slow buildfarm machines, but the risk has always been there.
>
> I've been thinking about teaching the buildfarm script to "kill -9"
> any postmasters left around at the end of the run, but that's fairly
> problematic: how do you find such processes (since "ps" output isn't
> hugely portable, especially not to Windows), and how do you tell them
> apart from postmasters not started by the script?  So the idea was on
> hold.
>
> But today I thought of another way: suppose that we teach the postmaster
> to commit hara-kiri if the $PGDATA directory goes away.  Since the
> buildfarm script definitely does remove all the temporary data directories
> it creates, this ought to get the job done.
>
> An easy way to do that would be to have it check every so often if
> pg_control can still be read.  We should not have it fail on ENFILE or
> EMFILE, since that would create a new failure hazard under heavy load,
> but ENOENT or similar would be reasonable grounds for deciding that
> something is horribly broken.  (At least on Windows, failing on EPERM
> doesn't seem wise either, since we've seen antivirus products randomly
> causing such errors.)
>
> I wouldn't want to do this every time through the postmaster's main loop,
> but we could do this once an hour for no added cost by adding the check
> where it does TouchSocketLockFiles; or once every few minutes if we
> carried a separate variable like last_touch_time.  Once an hour would be
> plenty to fix the buildfarm's problem, I should think.
>
> Another question is what exactly "commit hara-kiri" should consist of.
> We could just abort() or _exit(1) and leave it to child processes to
> notice that the postmaster is gone, or we could make an effort to clean
> up.  I'd be a bit inclined to treat it like a SIGQUIT situation, ie
> kill all the children and exit.  The children are probably having
> problems of their own if the data directory's gone, so forcing
> termination might be best to keep them from getting stuck.
>
> Also, perhaps we'd only enable this behavior in --enable-cassert builds,
> to avoid any risk of a postmaster incorrectly choosing to suicide in a
> production scenario.  Or maybe that's overly conservative.
>
> Thoughts?
>
>             



It's a fine idea. This is much more likely to be robust than any 
buildfarm client fix.

Not every buildfarm member uses cassert, so I'm not sure that's the best 
way to go. axolotl doesn't, and it's one of those that regularly has 
speed problems. Maybe a not-very-well-publicized GUC, or an environment 
setting? Or maybe just enable this anyway. If the data directory is gone 
what's the point in keeping the postmaster around? Shutting it down 
doesn't seem likely to cause any damage.


cheers

andrew



В списке pgsql-hackers по дате отправления:

Предыдущее
От: Tom Lane
Дата:
Сообщение: Re: Idea for improving buildfarm robustness
Следующее
От: Stephen Frost
Дата:
Сообщение: Re: Idea for improving buildfarm robustness