Tracing down buildfarm "postmaster does not shut down" failures

Поиск
Список
Период
Сортировка
От Tom Lane
Тема Tracing down buildfarm "postmaster does not shut down" failures
Дата
Msg-id 17763.1454958948@sss.pgh.pa.us
обсуждение исходный текст
Ответы Re: Tracing down buildfarm "postmaster does not shut down" failures  (Alvaro Herrera <alvherre@2ndquadrant.com>)
Re: Tracing down buildfarm "postmaster does not shut down" failures  (Andrew Dunstan <andrew@dunslane.net>)
Re: Tracing down buildfarm "postmaster does not shut down" failures  (Noah Misch <noah@leadboat.com>)
Список pgsql-hackers
Of late, by far the majority of the random-noise failures we see in the
buildfarm have come from failure to shut down the postmaster in a
reasonable timeframe.  An example is this current failure on hornet:

http://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=hornet&dt=2016-02-08%2013%3A41%3A55

waiting for server to shut
down...........................................................................................................................
failed
pg_ctl: server does not shut down
=========== db log file ==========
2016-02-08 15:14:35.783 UTC [56b8b00d.9e00e2:2] LOG:  received fast shutdown request
2016-02-08 15:14:35.783 UTC [56b8b00d.9e00e2:3] LOG:  aborting any active transactions
2016-02-08 15:14:35.783 UTC [56b8b00d.7e0028:2] LOG:  autovacuum launcher shutting down
2016-02-08 15:14:35.865 UTC [56b8b00d.2e5000a:1] LOG:  shutting down

The buildfarm script runs pg_ctl stop with -t 120, and counting the dots
shows that pg_ctl was honoring that, so apparently the postmaster took
more than 2 minutes to shut down.

Looking at other recent successful runs, stopdb-C-1 usually takes 30 to
40 or so seconds on this machine.  So it's possible that it was just so
overloaded that it took three times that much time on this run, but I'm
starting to think there may be more to it than that.  We've seen variants
on this theme on half a dozen machines just in the past week --- and it
seems to mostly happen in 9.5 and HEAD, which is fishy.

The fact that we got "shutting down" but not "database system is shut
down" indicates that the server was in ShutdownXLOG().  Normally the
only component of that that takes much time is buffer flushing, but
could something else be happening?  Or perhaps the checkpoint code
has somehow not gotten the word to do its flushing at full speed?

What I'd like to do to investigate this is put in a temporary HEAD-only
patch that makes ShutdownXLOG() and its subroutines much chattier about
how far they've gotten and what time it is, and also makes pg_ctl print
out the current time if it gives up waiting.  A few failed runs with
that in place will at least allow us to confirm or deny whether it's
just that the shutdown checkpoint is sometimes really slow, or whether
there's a bug lurking.

Any objections?  Anybody have another idea for data to collect?
        regards, tom lane



В списке pgsql-hackers по дате отправления:

Предыдущее
От: Alvaro Herrera
Дата:
Сообщение: Re: [PROPOSAL] VACUUM Progress Checker.
Следующее
От: Fabien COELHO
Дата:
Сообщение: Re: pgbench stats per script & other stuff