Re: Buildfarm owners: check if your HEAD build is stuck
От | Andrew Dunstan |
---|---|
Тема | Re: Buildfarm owners: check if your HEAD build is stuck |
Дата | |
Msg-id | 44DE8DC6.2010903@dunslane.net обсуждение исходный текст |
Ответ на | Buildfarm owners: check if your HEAD build is stuck (Tom Lane <tgl@sss.pgh.pa.us>) |
Ответы |
Re: Buildfarm owners: check if your HEAD build is stuck
|
Список | pgsql-hackers |
Tom Lane wrote: > A number of the buildfarm machines have been failing HEAD builds > at the "make check" stage since last night, with complaints like > this one from emu: > > ================== pgsql.21911/src/test/regress/log/postmaster.log =================== > FATAL: lock file "/tmp/.s.PGSQL.55678.lock" already exists > HINT: Is another postmaster (PID 23692) using socket file "/tmp/.s.PGSQL.55678"? > > What's happened is that that GUC patch that was in the tree for a few > hours broke postmaster startup on some machines (for as-yet-unidentified > reasons). The postmaster does actually start and establish its > lockfiles, but it never gets to the stage of being able to accept > connections. > > After the buildfarm script rm -rf's the build tree, the postmaster > process is still there but "disembodied" (its executable file is > probably gone, for example, or at least in the state of zero remaining > directory links). But it's still got that socket file and lockfile > in /tmp, and this prevents another postmaster from starting with the > same port number. > > If you've got this situation, you'll need to do a manual "kill" on the > PID mentioned in the lock file before things will start working again. > (pg_ctl won't work because it looks for the data directory > postmaster.pid file, which is long gone.) More generally you might want > to look through a ps listing for unexpected postgres-owned processes. > > I'm not sure whether there's anything much we can do to prevent such > problems in future. Maybe it'd be reasonable for pg_regress to do a > kill -9 on its postmaster child process if it gives up waiting for the > postmaster to accept connections. > > > That's amazingly ugly, and well diagnosed. BTW, buildfarm processes would typically not be postgres owned, at least not on my machines. I run either as myself or as a special buildfarm user. I'm trying to think how we could harden the buildfarm script to avoid such situations, although I am so far without any great revelations. The idea of getting pg_regress to send a signal isn't bad - what if the PID gets reused, since we know not all systems allocate PIDs in a cyclical fashion? Also, I see the pg-regress code has this comment: /* * Fail immediately if postmaster has exited * * XXX is there a way to do thison Windows? */ As I understand it, the way to do it is to call OpenProcess() - if that succeeds then it is still there. I guess if needed we could even do that in src/port/kill.c so that kill(pid,0) would work. But I would want confirmation from the Windows gurus. cheers andrew
В списке pgsql-hackers по дате отправления: