On 07/25/2015 10:59 AM, Tom Lane wrote:
> Now that we've restored proper logging of "make check", I looked into
> today's failure report from axolotl:
> http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=axolotl&dt=2015-07-24%2020%3A29%3A18
>
> What evidently happened there is that "pg_ctl start" gave up waiting for
> the postmaster to start too soon. The postmaster log appears to contain
>
> LOG: database system was shut down at 2015-07-24 16:45:40 EDT
> FATAL: the database system is starting up
> LOG: MultiXact member wraparound protections are now enabled
> LOG: database system is ready to accept connections
>
> which indicates that it did successfully come up, but not till after one
> "PQping" probe from pg_ctl, which was rejected with "still starting up".
> Meanwhile we've got this log output from pg_ctl:
>
> waiting for server to start........ stopped waiting
> pg_ctl: could not start server
> Examine the log output.
>
> Counting the dots indicates that pg_ctl gave up after precisely 5 seconds.
> Now, looking at the logic in pg_ctl's test_postmaster_connection(), the
> only explanation that seems to fit the observed output is that the stat()
> on the postmaster pidfile (at line 650 in HEAD) failed. It's not clear
> why though, since the postmaster was clearly still alive at this point,
> and we must have been able to read the pidfile earlier to construct a
> connection string, else there would have been no PQping attempt.
>
> Maybe the stat failed for some unexpected resource-exhaustion kind of
> reason?
>
> It seems plausible to me that we should change pg_ctl to only consider
> stat() failure to be a reason to give up waiting if errno is ENOENT,
> not anything else. At a minimum, I'd like to modify it to print the
> errno if it's anything else, so that we can confirm or deny this theory
> next time we see this buildfarm failure.
>
> Comments?
>
>
Certainly let's look at the errno.
cheers
andrdew