Обсуждение: PGCTLTIMEOUT in pg_regress, or skink versus the clock

Поиск
Список
Период
Сортировка

PGCTLTIMEOUT in pg_regress, or skink versus the clock

От
Tom Lane
Дата:
Buildfarm member skink has failed three times recently like this:
http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=skink&dt=2016-04-15%2001%3A20%3A44

the relevant part of that being

pg_regress: postmaster did not respond within 60 seconds
Examine /home/andres/build/buildfarm/REL9_5_STABLE/pgsql.build/src/interfaces/ecpg/test/log/postmaster.log for the
reason

where the postmaster log shows nothing particularly surprising,
it's just not reached the ready state yet:

LOG:  database system was shut down at 2016-04-15 05:11:18 UTC
FATAL:  the database system is starting up
LOG:  MultiXact member wraparound protections are now enabled


Now, there are some reasons to suspect that there might be more here than
meets the eye; for one thing, it stretches credulity a bit to believe that
it's only random chance that all three failures are in the 9.5 branch and
all are in the ecpg regression test step.  I'm also curious as to why we
see only one "FATAL: the database system is starting up" connection
rejection and not sixty.  However, by far the simplest explanation for
this failure is just that the postmaster took more than 60 seconds to
start up; and seeing that skink is running Valgrind and is on an AWS
instance, that's not that much of a stretch of credulity either.

Hence, I am thinking that we missed a bet in commit 2ffa86962077c588
et al, and that pg_regress's hard-wired 60-second start timeout ought to
be overridable from an environment variable just as pg_ctl's timeout is.
It might as well be the same environment variable, so I propose the
attached patch.  Note that since the shutdown end of things in pg_regress
uses "pg_ctl stop", that end of it already responds to PGCTLTIMEOUT.
(I could not find any user-facing documentation for pg_regress, so there's
no apparent need for a docs update.)

Any objections?

            regards, tom lane

diff --git a/src/test/regress/pg_regress.c b/src/test/regress/pg_regress.c
index 2f6f56d..574f5b8 100644
*** a/src/test/regress/pg_regress.c
--- b/src/test/regress/pg_regress.c
*************** regression_main(int argc, char *argv[],
*** 2185,2190 ****
--- 2185,2192 ----
      if (temp_instance)
      {
          FILE       *pg_conf;
+         const char *env_wait;
+         int            wait_seconds;

          /*
           * Prepare the temp instance
*************** regression_main(int argc, char *argv[],
*** 2335,2345 ****
          }

          /*
!          * Wait till postmaster is able to accept connections (normally only a
!          * second or so, but Cygwin is reportedly *much* slower).  Don't wait
!          * forever, however.
           */
!         for (i = 0; i < 60; i++)
          {
              /* Done if psql succeeds */
              if (system(buf2) == 0)
--- 2337,2359 ----
          }

          /*
!          * Wait till postmaster is able to accept connections; normally this
!          * is only a second or so, but Cygwin is reportedly *much* slower, and
!          * test builds using Valgrind or similar tools might be too.  Hence,
!          * allow the default timeout of 60 seconds to be overridden from the
!          * PGCTLTIMEOUT environment variable.
           */
!         env_wait = getenv("PGCTLTIMEOUT");
!         if (env_wait != NULL)
!         {
!             wait_seconds = atoi(env_wait);
!             if (wait_seconds <= 0)
!                 wait_seconds = 60;
!         }
!         else
!             wait_seconds = 60;
!
!         for (i = 0; i < wait_seconds; i++)
          {
              /* Done if psql succeeds */
              if (system(buf2) == 0)
*************** regression_main(int argc, char *argv[],
*** 2360,2368 ****

              pg_usleep(1000000L);
          }
!         if (i >= 60)
          {
!             fprintf(stderr, _("\n%s: postmaster did not respond within 60 seconds\nExamine %s/log/postmaster.log for
thereason\n"), progname, outputdir); 

              /*
               * If we get here, the postmaster is probably wedged somewhere in
--- 2374,2383 ----

              pg_usleep(1000000L);
          }
!         if (i >= wait_seconds)
          {
!             fprintf(stderr, _("\n%s: postmaster did not respond within %d seconds\nExamine %s/log/postmaster.log for
thereason\n"), 
!                     progname, wait_seconds, outputdir);

              /*
               * If we get here, the postmaster is probably wedged somewhere in

Re: PGCTLTIMEOUT in pg_regress, or skink versus the clock

От
Noah Misch
Дата:
On Wed, Apr 20, 2016 at 06:38:56PM -0400, Tom Lane wrote:
> I am thinking that we missed a bet in commit 2ffa86962077c588
> et al, and that pg_regress's hard-wired 60-second start timeout ought to
> be overridable from an environment variable just as pg_ctl's timeout is.
> It might as well be the same environment variable, so I propose the
> attached patch.

> Any objections?

Looks reasonable.



Re: PGCTLTIMEOUT in pg_regress, or skink versus the clock

От
Tom Lane
Дата:
Noah Misch <noah@leadboat.com> writes:
> On Wed, Apr 20, 2016 at 06:38:56PM -0400, Tom Lane wrote:
>> I am thinking that we missed a bet in commit 2ffa86962077c588
>> et al, and that pg_regress's hard-wired 60-second start timeout ought to
>> be overridable from an environment variable just as pg_ctl's timeout is.
>> It might as well be the same environment variable, so I propose the
>> attached patch.

>> Any objections?

> Looks reasonable.

Pushed.

I did some more testing here and concluded that slow postmaster startup is
almost certainly the right explanation for skink's problems.  On my
otherwise-idle workstation, postmaster startup under valgrind takes about
10 seconds, of which six or seven seem to involve valgrind just collecting
its thoughts :-(.  The postmaster's socket file does not appear until nine
seconds in, and then by ten seconds it is ready to accept connections.
So that's how come I see just one "FATAL: the database system is starting
up" log entry --- pg_regress's previous eight launches of pg_ctl just failed
with "no such socket file".  So I now think the observed failures on skink
can be explained by supposing that valgrind sometimes takes around a
minute to start up on that platform.  The skink log I quoted before would
fit with the postmaster almost but not quite reaching "ready" status
before pg_ctl's timeout expires.  The other two skink failures actually
have empty postmaster log files, suggesting that valgrind was so slow that
we didn't even get to the "database system was shut down" log message.
        regards, tom lane