Обсуждение: Tracing down buildfarm "postmaster does not shut down" failures

Поиск

Список

Период

Сортировка

Tracing down buildfarm "postmaster does not shut down" failures

От

Tom Lane

Дата:

08 февраля 2016 г., 19:15:54

Of late, by far the majority of the random-noise failures we see in the
buildfarm have come from failure to shut down the postmaster in a
reasonable timeframe.  An example is this current failure on hornet:

http://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=hornet&dt=2016-02-08%2013%3A41%3A55

waiting for server to shut
down...........................................................................................................................
failed
pg_ctl: server does not shut down
=========== db log file ==========
2016-02-08 15:14:35.783 UTC [56b8b00d.9e00e2:2] LOG:  received fast shutdown request
2016-02-08 15:14:35.783 UTC [56b8b00d.9e00e2:3] LOG:  aborting any active transactions
2016-02-08 15:14:35.783 UTC [56b8b00d.7e0028:2] LOG:  autovacuum launcher shutting down
2016-02-08 15:14:35.865 UTC [56b8b00d.2e5000a:1] LOG:  shutting down

The buildfarm script runs pg_ctl stop with -t 120, and counting the dots
shows that pg_ctl was honoring that, so apparently the postmaster took
more than 2 minutes to shut down.

Looking at other recent successful runs, stopdb-C-1 usually takes 30 to
40 or so seconds on this machine.  So it's possible that it was just so
overloaded that it took three times that much time on this run, but I'm
starting to think there may be more to it than that.  We've seen variants
on this theme on half a dozen machines just in the past week --- and it
seems to mostly happen in 9.5 and HEAD, which is fishy.

The fact that we got "shutting down" but not "database system is shut
down" indicates that the server was in ShutdownXLOG().  Normally the
only component of that that takes much time is buffer flushing, but
could something else be happening?  Or perhaps the checkpoint code
has somehow not gotten the word to do its flushing at full speed?

What I'd like to do to investigate this is put in a temporary HEAD-only
patch that makes ShutdownXLOG() and its subroutines much chattier about
how far they've gotten and what time it is, and also makes pg_ctl print
out the current time if it gives up waiting.  A few failed runs with
that in place will at least allow us to confirm or deny whether it's
just that the shutdown checkpoint is sometimes really slow, or whether
there's a bug lurking.

Any objections?  Anybody have another idea for data to collect?
        regards, tom lane

Re: Tracing down buildfarm "postmaster does not shut down" failures

От

Alvaro Herrera

Дата:

08 февраля 2016 г., 20:41:32

Tom Lane wrote:
> Of late, by far the majority of the random-noise failures we see in the
> buildfarm have come from failure to shut down the postmaster in a
> reasonable timeframe.

I noticed that.

> An example is this current failure on hornet:
> 
> http://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=hornet&dt=2016-02-08%2013%3A41%3A55

FWIW you seem to have edited this URL -- it returns a blank page.
The right one is
http://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=hornet&dt=2016-02-08%2013%3A41%3A55&stg=stopdb-C-1

> What I'd like to do to investigate this is put in a temporary HEAD-only
> patch that makes ShutdownXLOG() and its subroutines much chattier about
> how far they've gotten and what time it is, and also makes pg_ctl print
> out the current time if it gives up waiting.  A few failed runs with
> that in place will at least allow us to confirm or deny whether it's
> just that the shutdown checkpoint is sometimes really slow, or whether
> there's a bug lurking.
> 
> Any objections?  Anybody have another idea for data to collect?

Seems like a reasonable place to start.  I assume you'll be installing
some debug-elog calls, enabled by default, and then once the problem is
fixed, we'll change the default to disabled but keep the actual calls?

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Tracing down buildfarm "postmaster does not shut down" failures

От

Tom Lane

Дата:

08 февраля 2016 г., 20:45:33

Alvaro Herrera <alvherre@2ndquadrant.com> writes:
> Tom Lane wrote:
>> What I'd like to do to investigate this is put in a temporary HEAD-only
>> patch that makes ShutdownXLOG() and its subroutines much chattier about
>> how far they've gotten and what time it is, and also makes pg_ctl print
>> out the current time if it gives up waiting.

> Seems like a reasonable place to start.  I assume you'll be installing
> some debug-elog calls, enabled by default, and then once the problem is
> fixed, we'll change the default to disabled but keep the actual calls?

I had in mind to just "git revert" the patch when we're done with it.
There might be some parts we want to keep (for instance, I'm thinking of
logging "postmaster is done" after we've unlinked the pidfile, which might
be useful permanently).  But I plan to err on the side of noisiness for
the moment, rather than make something I think has long-term value.
        regards, tom lane

Re: Tracing down buildfarm "postmaster does not shut down" failures

От

Andrew Dunstan

Дата:

08 февраля 2016 г., 20:59:44


On 02/08/2016 02:15 PM, Tom Lane wrote:
> Of late, by far the majority of the random-noise failures we see in the
> buildfarm have come from failure to shut down the postmaster in a
> reasonable timeframe.  An example is this current failure on hornet:
>
> http://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=hornet&dt=2016-02-08%2013%3A41%3A55
>
> waiting for server to shut
down...........................................................................................................................
failed
> pg_ctl: server does not shut down
> =========== db log file ==========
> 2016-02-08 15:14:35.783 UTC [56b8b00d.9e00e2:2] LOG:  received fast shutdown request
> 2016-02-08 15:14:35.783 UTC [56b8b00d.9e00e2:3] LOG:  aborting any active transactions
> 2016-02-08 15:14:35.783 UTC [56b8b00d.7e0028:2] LOG:  autovacuum launcher shutting down
> 2016-02-08 15:14:35.865 UTC [56b8b00d.2e5000a:1] LOG:  shutting down
>
> The buildfarm script runs pg_ctl stop with -t 120, and counting the dots
> shows that pg_ctl was honoring that, so apparently the postmaster took
> more than 2 minutes to shut down.
>
> Looking at other recent successful runs, stopdb-C-1 usually takes 30 to
> 40 or so seconds on this machine.  So it's possible that it was just so
> overloaded that it took three times that much time on this run, but I'm
> starting to think there may be more to it than that.  We've seen variants
> on this theme on half a dozen machines just in the past week --- and it
> seems to mostly happen in 9.5 and HEAD, which is fishy.
>
> The fact that we got "shutting down" but not "database system is shut
> down" indicates that the server was in ShutdownXLOG().  Normally the
> only component of that that takes much time is buffer flushing, but
> could something else be happening?  Or perhaps the checkpoint code
> has somehow not gotten the word to do its flushing at full speed?
>
> What I'd like to do to investigate this is put in a temporary HEAD-only
> patch that makes ShutdownXLOG() and its subroutines much chattier about
> how far they've gotten and what time it is, and also makes pg_ctl print
> out the current time if it gives up waiting.  A few failed runs with
> that in place will at least allow us to confirm or deny whether it's
> just that the shutdown checkpoint is sometimes really slow, or whether
> there's a bug lurking.
>
> Any objections?  Anybody have another idea for data to collect?
>
>             


I think that's an excellent idea. This has been a minor running sore for 
ages on slow machines.

cheers

andrew

Re: Tracing down buildfarm "postmaster does not shut down" failures

От

Noah Misch

Дата:

09 февраля 2016 г., 03:41:59

On Mon, Feb 08, 2016 at 02:15:48PM -0500, Tom Lane wrote:
> Of late, by far the majority of the random-noise failures we see in the
> buildfarm have come from failure to shut down the postmaster in a
> reasonable timeframe.

> We've seen variants
> on this theme on half a dozen machines just in the past week --- and it
> seems to mostly happen in 9.5 and HEAD, which is fishy.

It has been affecting only the four AIX animals, which do share hardware.
(Back in 2015 and once in 2016-01, it did affect axolotl and shearwater.)  I
agree the concentration on 9.5 and HEAD is suspicious; while those branches
get the most buildfarm runs, that factor by itself doesn't explain the
distribution of failures among versions.

> What I'd like to do to investigate this is put in a temporary HEAD-only
> patch that makes ShutdownXLOG() and its subroutines much chattier about
> how far they've gotten and what time it is, and also makes pg_ctl print
> out the current time if it gives up waiting.  A few failed runs with
> that in place will at least allow us to confirm or deny whether it's
> just that the shutdown checkpoint is sometimes really slow, or whether
> there's a bug lurking.
>
> Any objections?  Anybody have another idea for data to collect?

That's reasonable.  If you would like higher-fidelity data, I can run loops of
"pg_ctl -w start; make installcheck; pg_ctl -t900 -w stop", and I could run
that for HEAD and 9.2 simultaneously.  A day of logs from that should show
clearly if HEAD is systematically worse than 9.2.  By the way, you would
almost surely qualify for an account on this machine.

I had drafted the following message and patch last week, and I suppose it
belongs in this thread:

On Mon, Oct 12, 2015 at 06:41:06PM -0400, Tom Lane wrote:
> I'm not sure if this will completely fix our problems with "pg_ctl start"
> related buildfarm failures on very slow critters.  It does get rid of the
> hard wired 5-second timeout, but the 60-second timeout could still be an
> issue.  I think Noah was considering a patch to allow that number to be
> raised.  I'd be in favor of letting pg_ctl accept a default timeout length
> from an environment variable, and then the slower critters could be fixed
> by adjusting their buildfarm configurations.

Your commit 6bcce25 made src/bin test suites stop failing due to pg_ctl
startup timeouts, but other suites have been failing on the AIX buildfarm zoo
due to slow shutdown.  Example taking 72s to even reach ShutdownXLOG():
http://www.pgbuildfarm.org/cgi-bin/show_log.pl?nm=sungazer&dt=2016-01-15%2012%3A43%3A00

So, I wish to raise the timeout for those animals.  Using an environment
variable was a good idea; it's one less thing for test authors to remember.
Since the variable affects a performance-related fudge factor rather than
change behavior per se, I'm less skittish than usual about unintended
consequences of dynamic scope.  (With said unintended consequences in mind, I
made "pg_ctl register" ignore PGCTLTIMEOUT rather than embed its value into
the service created.)

Thanks,
nm

Вложения

pgctl-timeout-envar-v1.patch

Re: Tracing down buildfarm "postmaster does not shut down" failures

От

Tom Lane

Дата:

09 февраля 2016 г., 03:57:40

Noah Misch <noah@leadboat.com> writes:
> On Mon, Feb 08, 2016 at 02:15:48PM -0500, Tom Lane wrote:
>> We've seen variants
>> on this theme on half a dozen machines just in the past week --- and it
>> seems to mostly happen in 9.5 and HEAD, which is fishy.

> It has been affecting only the four AIX animals, which do share hardware.
> (Back in 2015 and once in 2016-01, it did affect axolotl and shearwater.)

Certainly your AIX critters have shown this a bunch, but here's another
current example:
http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=axolotl&dt=2016-02-08%2014%3A49%3A23

> That's reasonable.  If you would like higher-fidelity data, I can run loops of
> "pg_ctl -w start; make installcheck; pg_ctl -t900 -w stop", and I could run
> that for HEAD and 9.2 simultaneously.  A day of logs from that should show
> clearly if HEAD is systematically worse than 9.2.

That sounds like a fine plan, please do it.

> So, I wish to raise the timeout for those animals.  Using an environment
> variable was a good idea; it's one less thing for test authors to remember.
> Since the variable affects a performance-related fudge factor rather than
> change behavior per se, I'm less skittish than usual about unintended
> consequences of dynamic scope.  (With said unintended consequences in mind, I
> made "pg_ctl register" ignore PGCTLTIMEOUT rather than embed its value into
> the service created.)

While this isn't necessarily a bad idea in isolation, the current
buildfarm scripts explicitly specify a -t value to pg_ctl stop, which
I would not expect an environment variable to override.  So we need
to fix the buildfarm script to allow the timeout to be configurable.
I'm not sure if there would be any value-add in having pg_ctl answer
to an environment variable once we've done that.
        regards, tom lane

Re: Tracing down buildfarm "postmaster does not shut down" failures

От

Andrew Dunstan

Дата:

09 февраля 2016 г., 16:14:20


On 02/08/2016 10:55 PM, Tom Lane wrote:
> Noah Misch <noah@leadboat.com> writes:
>> On Mon, Feb 08, 2016 at 02:15:48PM -0500, Tom Lane wrote:
>>> We've seen variants
>>> on this theme on half a dozen machines just in the past week --- and it
>>> seems to mostly happen in 9.5 and HEAD, which is fishy.
>> It has been affecting only the four AIX animals, which do share hardware.
>> (Back in 2015 and once in 2016-01, it did affect axolotl and shearwater.)
> Certainly your AIX critters have shown this a bunch, but here's another
> current example:
> http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=axolotl&dt=2016-02-08%2014%3A49%3A23
>
>> That's reasonable.  If you would like higher-fidelity data, I can run loops of
>> "pg_ctl -w start; make installcheck; pg_ctl -t900 -w stop", and I could run
>> that for HEAD and 9.2 simultaneously.  A day of logs from that should show
>> clearly if HEAD is systematically worse than 9.2.
> That sounds like a fine plan, please do it.
>
>> So, I wish to raise the timeout for those animals.  Using an environment
>> variable was a good idea; it's one less thing for test authors to remember.
>> Since the variable affects a performance-related fudge factor rather than
>> change behavior per se, I'm less skittish than usual about unintended
>> consequences of dynamic scope.  (With said unintended consequences in mind, I
>> made "pg_ctl register" ignore PGCTLTIMEOUT rather than embed its value into
>> the service created.)
> While this isn't necessarily a bad idea in isolation, the current
> buildfarm scripts explicitly specify a -t value to pg_ctl stop, which
> I would not expect an environment variable to override.  So we need
> to fix the buildfarm script to allow the timeout to be configurable.
> I'm not sure if there would be any value-add in having pg_ctl answer
> to an environment variable once we've done that.


The failure on axolotl was for the ECPG tests, which don't use the 
buildfarm's startup/stop db code. They don't honour TEMP_CONFIG either, 
which they probably should - then we might get better log traces.

cheers

andrew

Re: Tracing down buildfarm "postmaster does not shut down" failures

От

Tom Lane

Дата:

09 февраля 2016 г., 19:13:07

I wrote:
> Noah Misch <noah@leadboat.com> writes:
>> On Mon, Feb 08, 2016 at 02:15:48PM -0500, Tom Lane wrote:
>>> We've seen variants
>>> on this theme on half a dozen machines just in the past week --- and it
>>> seems to mostly happen in 9.5 and HEAD, which is fishy.

>> It has been affecting only the four AIX animals, which do share hardware.
>> (Back in 2015 and once in 2016-01, it did affect axolotl and shearwater.)

> Certainly your AIX critters have shown this a bunch, but here's another
> current example:
> http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=axolotl&dt=2016-02-08%2014%3A49%3A23

While we've not seen an actual test failure since I installed the tracing
code, we do have reports from the four AIX critters, and they are pretty
instructive.  See for example
http://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=hornet&dt=2016-02-09%2016%3A27%3A05&stg=stopdb-C-2
which is

Snapshot: 2016-02-09 16:27:05

waiting for server to shut down at 2016-02-09 18:04:09.159
UTC................................................................................done
 
server stopped
=========== db log file ==========
2016-02-09 18:04:09.159 UTC [56ba2944.ee01bc:2] LOG:  received fast shutdown request at 2016-02-09 18:04:09.159 UTC
2016-02-09 18:04:09.159 UTC [56ba2944.ee01bc:3] LOG:  aborting any active transactions
2016-02-09 18:04:09.160 UTC [56ba2944.14501d8:2] LOG:  autovacuum launcher shutting down at 2016-02-09 18:04:09.160
UTC
2016-02-09 18:04:09.258 UTC [56ba2944.363000e:40] LOG:  shutting down at 2016-02-09 18:04:09.258 UTC
2016-02-09 18:04:09.258 UTC [56ba2944.363000e:41] LOG:  CheckPointGuts starting at 2016-02-09 18:04:09.258 UTC
2016-02-09 18:04:09.258 UTC [56ba2944.363000e:42] LOG:  CheckPointCLOG() done at 2016-02-09 18:04:09.258 UTC
2016-02-09 18:04:09.258 UTC [56ba2944.363000e:43] LOG:  CheckPointCommitTs() done at 2016-02-09 18:04:09.258 UTC
2016-02-09 18:04:09.258 UTC [56ba2944.363000e:44] LOG:  CheckPointSUBTRANS() done at 2016-02-09 18:04:09.258 UTC
2016-02-09 18:04:09.258 UTC [56ba2944.363000e:45] LOG:  CheckPointMultiXact() done at 2016-02-09 18:04:09.258 UTC
2016-02-09 18:04:09.258 UTC [56ba2944.363000e:46] LOG:  CheckPointPredicate() done at 2016-02-09 18:04:09.258 UTC
2016-02-09 18:04:09.258 UTC [56ba2944.363000e:47] LOG:  CheckPointRelationMap() done at 2016-02-09 18:04:09.258 UTC
2016-02-09 18:04:09.258 UTC [56ba2944.363000e:48] LOG:  CheckPointReplicationSlots() done at 2016-02-09 18:04:09.258
UTC
2016-02-09 18:04:09.259 UTC [56ba2944.363000e:49] LOG:  CheckPointSnapBuild() done at 2016-02-09 18:04:09.259 UTC
2016-02-09 18:04:09.259 UTC [56ba2944.363000e:50] LOG:  CheckPointLogicalRewriteHeap() done at 2016-02-09 18:04:09.259
UTC
2016-02-09 18:04:09.259 UTC [56ba2944.363000e:51] LOG:  BufferSync(5) beginning to write 632 buffers at 2016-02-09
18:04:09.259UTC
 
2016-02-09 18:04:09.966 UTC [56ba2944.363000e:52] LOG:  BufferSync(5) done, wrote 632/632 buffers at 2016-02-09
18:04:09.966UTC
 
2016-02-09 18:04:09.966 UTC [56ba2944.363000e:53] LOG:  CheckPointBuffers() done at 2016-02-09 18:04:09.966 UTC
2016-02-09 18:04:09.966 UTC [56ba2944.363000e:54] LOG:  CheckPointReplicationOrigin() done at 2016-02-09 18:04:09.966
UTC
2016-02-09 18:04:09.966 UTC [56ba2944.363000e:55] LOG:  CheckPointGuts done at 2016-02-09 18:04:09.966 UTC
2016-02-09 18:04:09.967 UTC [56ba2944.363000e:56] LOG:  checkpoint WAL record flushed at 2016-02-09 18:04:09.967 UTC
2016-02-09 18:04:09.967 UTC [56ba2944.363000e:57] LOG:  pg_control updated at 2016-02-09 18:04:09.967 UTC
2016-02-09 18:04:29.375 UTC [56ba2944.363000e:58] LOG:  in mdpostckpt, 3072 unlinks remain to do at 2016-02-09
18:04:29.375UTC
 
2016-02-09 18:04:48.207 UTC [56ba2944.363000e:59] LOG:  in mdpostckpt, 2048 unlinks remain to do at 2016-02-09
18:04:48.207UTC
 
2016-02-09 18:05:07.381 UTC [56ba2944.363000e:60] LOG:  in mdpostckpt, 1024 unlinks remain to do at 2016-02-09
18:05:07.381UTC
 
2016-02-09 18:05:25.764 UTC [56ba2944.363000e:61] LOG:  smgrpostckpt() done at 2016-02-09 18:05:25.764 UTC
2016-02-09 18:05:25.764 UTC [56ba2944.363000e:62] LOG:  RemoveOldXlogFiles() done at 2016-02-09 18:05:25.764 UTC
2016-02-09 18:05:25.764 UTC [56ba2944.363000e:63] LOG:  TruncateSUBTRANS() done at 2016-02-09 18:05:25.764 UTC
2016-02-09 18:05:25.764 UTC [56ba2944.363000e:64] LOG:  shutdown checkpoint complete at 2016-02-09 18:05:25.764 UTC
2016-02-09 18:05:25.764 UTC [56ba2944.363000e:65] LOG:  ShutdownCLOG() complete at 2016-02-09 18:05:25.764 UTC
2016-02-09 18:05:25.764 UTC [56ba2944.363000e:66] LOG:  ShutdownCommitTs() complete at 2016-02-09 18:05:25.764 UTC
2016-02-09 18:05:25.764 UTC [56ba2944.363000e:67] LOG:  ShutdownSUBTRANS() complete at 2016-02-09 18:05:25.764 UTC
2016-02-09 18:05:25.764 UTC [56ba2944.363000e:68] LOG:  database system is shut down at 2016-02-09 18:05:25.764 UTC
2016-02-09 18:05:26.165 UTC [56ba2944.ee01bc:4] LOG:  lock files all released at 2016-02-09 18:05:26.165 UTC

This says that the bulk of the shutdown time is being spent in mdpostckpt,
where we are finally unlinking files belonging to relations that were
dropped since the last checkpoint.  That took 75 seconds here, while the
parts before and after it took less than a second each.  There were
somewhere between 3000 and 4000 files to be unlinked, so it's not
astonishing that it took some time, but our other critters seem to get
through this a lot quicker.  (All four AIX critters reported pretty
similar results, BTW.)

So my diagnosis at this point is

(1) Slow file system, specifically slow unlink, is the core of the
problem.  (I wonder if the AIX critters are using an NFS filesystem?)

(2) The apparent increase in failure rate in more recent branches is
probably real, but the likely cause is simply more test cases = more
stuff to be cleaned up at the end = mdpostckpt takes longer.

I'm not sure whether there's anything to be gained by leaving the tracing
code in there till we see actual buildfarm fails.  There might be another
slowdown mechanism somewhere, but I rather doubt it.  Thoughts?

In any case, we should proceed with fixing things so that buildfarm owners
can specify a higher shutdown timeout for especially slow critters.
        regards, tom lane

Re: Tracing down buildfarm "postmaster does not shut down" failures

От

Tom Lane

Дата:

09 февраля 2016 г., 20:07:18

I wrote:
> In any case, we should proceed with fixing things so that buildfarm owners
> can specify a higher shutdown timeout for especially slow critters.

I looked into doing this as I suggested yesterday, namely modifying the
buildfarm scripts, and soon decided that it would be a mess; there are
too many cases where "pg_ctl stop" is not invoked directly by the script.

I'm now in favor of applying the PGCTLTIMEOUT patch Noah proposed, and
*removing* the two existing hacks in run_build.pl that try to force -t 120.

The only real argument I can see against that approach is that we'd have
to back-patch the PGCTLTIMEOUT patch to all active branches if we want
to stop the buildfarm failures.  We don't usually back-patch feature
additions.  On the other hand, this wouldn't be the first time we've
back-patched something on grounds of helping the buildfarm, so I find
that argument pretty weak.
        regards, tom lane

Re: Tracing down buildfarm "postmaster does not shut down" failures

От

Andrew Dunstan

Дата:

09 февраля 2016 г., 20:52:50


On 02/09/2016 03:05 PM, Tom Lane wrote:
> I wrote:
>> In any case, we should proceed with fixing things so that buildfarm owners
>> can specify a higher shutdown timeout for especially slow critters.
> I looked into doing this as I suggested yesterday, namely modifying the
> buildfarm scripts, and soon decided that it would be a mess; there are
> too many cases where "pg_ctl stop" is not invoked directly by the script.
>
> I'm now in favor of applying the PGCTLTIMEOUT patch Noah proposed, and
> *removing* the two existing hacks in run_build.pl that try to force -t 120.
>
> The only real argument I can see against that approach is that we'd have
> to back-patch the PGCTLTIMEOUT patch to all active branches if we want
> to stop the buildfarm failures.  We don't usually back-patch feature
> additions.  On the other hand, this wouldn't be the first time we've
> back-patched something on grounds of helping the buildfarm, so I find
> that argument pretty weak.
>
>             



OK. I can put out a new release as required.

cheers

andrew

Re: Tracing down buildfarm "postmaster does not shut down" failures

От

Tom Lane

Дата:

09 февраля 2016 г., 22:54:15

I wrote:
> I'm not sure whether there's anything to be gained by leaving the tracing
> code in there till we see actual buildfarm fails. There might be another
> slowdown mechanism somewhere, but I rather doubt it. Thoughts?

Hmmm ... I take that back. AFAICT, the failures on Noah's AIX zoo are
sufficiently explained by the "mdpostckpt takes a long time after the
regression tests" theory. However, there is something else happening
on axolotl. Looking at the HEAD and 9.5 branches, there are three very
similar failures in the ECPG step within the past 60 days:

http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=axolotl&dt=2016-02-08%2014%3A49%3A23
http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=axolotl&dt=2015-12-15%2018%3A49%3A31
http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=axolotl&dt=2015-12-12%2001%3A44%3A39

In all three, we got "pg_ctl: server does not shut down", but the
postmaster log claims that it shut down, and pretty speedily too.
For example, in the 2015-12-12 failure,

LOG: received fast shutdown request
LOG: aborting any active transactions
LOG: autovacuum launcher shutting down
LOG: shutting down
LOG: checkpoint starting: shutdown immediate
LOG: checkpoint complete: wrote 176 buffers (1.1%); 0 transaction log file(s) added, 0 removed, 0 recycled;
write=0.039s, sync=0.000 s, total=0.059 s; sync files=0, longest=0.000 s, average=0.000 s; distance=978 kB,
estimate=978kB

LOG: database system is shut down

We have no theory that would account for postmaster shutdown stalling
after the end of ShutdownXLOG, but that seems to be what happened.
How come? Why does only the ECPG test seem to be affected?

It's also pretty fishy that we have three failures in 60 days on HEAD+9.5
but none before that, and none in the older branches. That smells like
a recently-introduced bug, though I have no idea what.

Andrew, I wonder if I could prevail on you to make axolotl run "make
check" on HEAD in src/interfaces/ecpg/ until it fails, so that we can
see if the logging I added tells anything useful about this.
regards, tom lane

Re: Tracing down buildfarm "postmaster does not shut down" failures

От

Tom Lane

Дата:

09 февраля 2016 г., 23:30:11

I wrote:
> ... However, there is something else happening
> on axolotl.  Looking at the HEAD and 9.5 branches, there are three very
> similar failures in the ECPG step within the past 60 days:
> http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=axolotl&dt=2016-02-08%2014%3A49%3A23
> http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=axolotl&dt=2015-12-15%2018%3A49%3A31
> http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=axolotl&dt=2015-12-12%2001%3A44%3A39

Oh look, searching with the "failures" BF webpage turns up a fourth such
failure on sungazer:

http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=sungazer&dt=2016-01-18%2014%3A45%3A15

The plot thickens.  I still have no clue what's going wrong though.
        regards, tom lane

Re: Tracing down buildfarm "postmaster does not shut down" failures

От

Andrew Dunstan

Дата:

09 февраля 2016 г., 23:47:02


On 02/09/2016 05:53 PM, Tom Lane wrote:


>
> Andrew, I wonder if I could prevail on you to make axolotl run "make
> check" on HEAD in src/interfaces/ecpg/ until it fails, so that we can
> see if the logging I added tells anything useful about this.
>
>             


Will do.

cheers

andrew

Re: Tracing down buildfarm "postmaster does not shut down" failures

От

Andrew Dunstan

Дата:

10 февраля 2016 г., 00:07:03


On 02/09/2016 06:46 PM, Andrew Dunstan wrote:
>
>
> On 02/09/2016 05:53 PM, Tom Lane wrote:
>
>
>>
>> Andrew, I wonder if I could prevail on you to make axolotl run "make
>> check" on HEAD in src/interfaces/ecpg/ until it fails, so that we can
>> see if the logging I added tells anything useful about this.
>>
>>
>
>
> Will do.


Incidentally, as I noted earlier, the ecpg tests don't honour 
TEMP_CONFIG, and in axolotl's case this could well make a difference, as 
it it set up like this:
        extra_config =>{            DEFAULT => [                q(log_line_prefix = '%m [%c:%l] '),
"log_connections= 'true'",                "log_disconnections = 'true'",                "log_statement = 'all'",
       "fsync = off",   "stats_temp_directory='/home/andrew/bf/root/stats_temp/$branch'"            ],        },
 

So running it's not running with fsync off or using the ramdisk for 
stats_temp_directory. Of course, that doesn't explain why we're not 
seeing it on branches earlier than 9.5, but it could explain why we're 
only seeing it on the ecpg tests.

cheers

andrew

Re: Tracing down buildfarm "postmaster does not shut down" failures

От

Tom Lane

Дата:

10 февраля 2016 г., 00:24:30

Andrew Dunstan <andrew@dunslane.net> writes:
> Incidentally, as I noted earlier, the ecpg tests don't honour 
> TEMP_CONFIG, and in axolotl's case this could well make a difference, as 
> it it set up like this:
> ...
> So running it's not running with fsync off or using the ramdisk for 
> stats_temp_directory.

Oooohh ...

I had just been looking into what else the postmaster does after "database
system is shut down" comes out.  One of those activities is telling the
stats collector to shut down, and then waiting for it to do so.  So a
really slow write of the collected stats might possibly account for
delaying things here.

What I was doing was adding some more logging in the
post-shutdown-checkpoint area, and this is what I see on dromedary,
when shutting down just after a regression test run:

LOG:  database system is shut down at 2016-02-09 19:12:29.497 EST
LOG:  doing before_shmem_exit 0 at 2016-02-09 19:12:29.498 EST
LOG:  doing on_shmem_exit 2 at 2016-02-09 19:12:29.498 EST
LOG:  doing on_shmem_exit 1 at 2016-02-09 19:12:29.498 EST
LOG:  doing on_shmem_exit 0 at 2016-02-09 19:12:29.498 EST
LOG:  doing on_proc_exit 1 at 2016-02-09 19:12:29.498 EST
LOG:  doing on_proc_exit 0 at 2016-02-09 19:12:29.498 EST
LOG:  calling exit(0) at 2016-02-09 19:12:29.498 EST
LOG:  checkpointer dead at 2016-02-09 19:12:32.382 EST
LOG:  all children dead at 2016-02-09 19:12:32.387 EST
LOG:  doing on_shmem_exit 3 at 2016-02-09 19:12:32.387 EST
LOG:  doing on_shmem_exit 2 at 2016-02-09 19:12:32.387 EST
LOG:  doing on_shmem_exit 1 at 2016-02-09 19:12:32.387 EST
LOG:  doing on_shmem_exit 0 at 2016-02-09 19:12:32.396 EST
LOG:  doing on_proc_exit 1 at 2016-02-09 19:12:32.396 EST
LOG:  doing on_proc_exit 0 at 2016-02-09 19:12:32.396 EST
LOG:  lock files all released at 2016-02-09 19:12:32.397 EST
LOG:  calling exit(0) at 2016-02-09 19:12:32.397 EST

The first batch of those "on_shmem/proc_exit" reports are from the
checkpointer process, and the second set are from the postmaster.
The stats collector shutdown would be between "checkpointer dead"
and "all children dead".  We can see that on this machine, that's
not really an issue ... but how in the world did it take the postmaster
nigh three seconds to notice the checkpoint process exit?  Something
very odd indeed there.

Anyway, I think I should push this additional instrumentation so you
can use it on axolotl.

This also makes it look like we really need to revisit where/when the
"database system is shut down" message is printed.  But that's for
later.
        regards, tom lane

Re: Tracing down buildfarm "postmaster does not shut down" failures

От

Tom Lane

Дата:

10 февраля 2016 г., 00:37:30

I wrote:
> Anyway, I think I should push this additional instrumentation so you
> can use it on axolotl.

Done.
        regards, tom lane

Re: Tracing down buildfarm "postmaster does not shut down" failures

От

Tom Lane

Дата:

10 февраля 2016 г., 00:49:20

Andrew Dunstan <andrew@dunslane.net> writes:
> So running it's not running with fsync off or using the ramdisk for 
> stats_temp_directory. Of course, that doesn't explain why we're not 
> seeing it on branches earlier than 9.5, but it could explain why we're 
> only seeing it on the ecpg tests.

BTW, the evidence we already have from axolotl's last run is that
post-checkpoint shutdown in the ecpg test was pretty quick:

LOG:  database system is shut down at 2016-02-09 16:31:14.784 EST
LOG:  lock files all released at 2016-02-09 16:31:14.817 EST

However, I'd already noted from some other digging in the buildfarm
logs that axolotl's speed seems to vary tremendously.  I do not
know what else you typically run on that hardware, but putting it
under full load might help prove things.
        regards, tom lane

Re: Tracing down buildfarm "postmaster does not shut down" failures

От

Andrew Dunstan

Дата:

10 февраля 2016 г., 01:40:40


On 02/09/2016 07:49 PM, Tom Lane wrote:
> Andrew Dunstan <andrew@dunslane.net> writes:
>> So running it's not running with fsync off or using the ramdisk for
>> stats_temp_directory. Of course, that doesn't explain why we're not
>> seeing it on branches earlier than 9.5, but it could explain why we're
>> only seeing it on the ecpg tests.
> BTW, the evidence we already have from axolotl's last run is that
> post-checkpoint shutdown in the ecpg test was pretty quick:
>
> LOG:  database system is shut down at 2016-02-09 16:31:14.784 EST
> LOG:  lock files all released at 2016-02-09 16:31:14.817 EST
>
> However, I'd already noted from some other digging in the buildfarm
> logs that axolotl's speed seems to vary tremendously.  I do not
> know what else you typically run on that hardware, but putting it
> under full load might help prove things.
>
>             

Almost nothing. It's a Raspberry Pi 2B that I got mainly to run a 
buildfarm animal. About the only other thing of note is a very lightly 
configured Nagios instance.

Of course, I could put it under continuous load running a compilation or 
something.

cheers

andrew

Re: Tracing down buildfarm "postmaster does not shut down" failures

От

Tom Lane

Дата:

10 февраля 2016 г., 01:49:18

Andrew Dunstan <andrew@dunslane.net> writes:
> On 02/09/2016 07:49 PM, Tom Lane wrote:
>> However, I'd already noted from some other digging in the buildfarm
>> logs that axolotl's speed seems to vary tremendously.  I do not
>> know what else you typically run on that hardware, but putting it
>> under full load might help prove things.

> Almost nothing. It's a Raspberry Pi 2B that I got mainly to run a 
> buildfarm animal. About the only other thing of note is a very lightly 
> configured Nagios instance.

Huh, that's quite strange.  There is one fairly recent report of
axolotl failing "make check" because of taking over a minute to shut down:

http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=axolotl&dt=2015-12-14%2020%3A30%3A52

but the runs before and after that show shutdown times of only a second or
two.  
        regards, tom lane

Re: Tracing down buildfarm "postmaster does not shut down" failures

От

Andrew Dunstan

Дата:

10 февраля 2016 г., 02:33:15


On 02/09/2016 08:49 PM, Tom Lane wrote:
> Andrew Dunstan <andrew@dunslane.net> writes:
>> On 02/09/2016 07:49 PM, Tom Lane wrote:
>>> However, I'd already noted from some other digging in the buildfarm
>>> logs that axolotl's speed seems to vary tremendously.  I do not
>>> know what else you typically run on that hardware, but putting it
>>> under full load might help prove things.
>> Almost nothing. It's a Raspberry Pi 2B that I got mainly to run a
>> buildfarm animal. About the only other thing of note is a very lightly
>> configured Nagios instance.
> Huh, that's quite strange.  There is one fairly recent report of
> axolotl failing "make check" because of taking over a minute to shut down:
>
> http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=axolotl&dt=2015-12-14%2020%3A30%3A52
>
> but the runs before and after that show shutdown times of only a second or
> two.


No idea why.

anyway, we got a failure pretty quickly:
   pg_ctl: server does not shut down at 2016-02-09 21:10:11.914 EST
   pg_regress: could not stop postmaster: exit code was 256   Makefile:81: recipe for target 'check' failed   make: ***
[check]Error 2
 

The log traces from the shutdown are below.

cheers

andrew



LOG:  received fast shutdown request at 2016-02-09 21:09:11.824 EST
LOG:  aborting any active transactions
LOG:  autovacuum launcher shutting down at 2016-02-09 21:09:11.830 EST
LOG:  shutting down at 2016-02-09 21:09:11.839 EST
LOG:  checkpoint starting: shutdown immediate
LOG:  CheckPointGuts starting at 2016-02-09 21:09:11.840 EST
LOG:  CheckPointCLOG() done at 2016-02-09 21:09:11.840 EST
LOG:  CheckPointCommitTs() done at 2016-02-09 21:09:11.840 EST
LOG:  CheckPointSUBTRANS() done at 2016-02-09 21:09:11.841 EST
LOG:  CheckPointMultiXact() done at 2016-02-09 21:09:11.841 EST
LOG:  CheckPointPredicate() done at 2016-02-09 21:09:11.841 EST
LOG:  CheckPointRelationMap() done at 2016-02-09 21:09:11.841 EST
LOG:  CheckPointReplicationSlots() done at 2016-02-09 21:09:11.841 EST
LOG:  CheckPointSnapBuild() done at 2016-02-09 21:09:11.841 EST
LOG:  CheckPointLogicalRewriteHeap() done at 2016-02-09 21:09:11.842 EST
LOG:  BufferSync(5) beginning to write 172 buffers at 2016-02-09 
21:09:11.845 EST
LOG:  BufferSync(5) done, wrote 172/172 buffers at 2016-02-09 
21:09:14.635 EST
LOG:  CheckPointBuffers() done at 2016-02-09 21:09:14.636 EST
LOG:  CheckPointReplicationOrigin() done at 2016-02-09 21:09:14.636 EST
LOG:  CheckPointGuts done at 2016-02-09 21:09:14.636 EST
LOG:  checkpoint WAL record flushed at 2016-02-09 21:09:14.636 EST
LOG:  pg_control updated at 2016-02-09 21:09:14.637 EST
LOG:  smgrpostckpt() done at 2016-02-09 21:09:14.668 EST
LOG:  RemoveOldXlogFiles() done at 2016-02-09 21:09:14.668 EST
LOG:  TruncateSUBTRANS() done at 2016-02-09 21:09:14.669 EST
LOG:  checkpoint complete: wrote 172 buffers (1.0%); 0 transaction log 
file(s) added, 0 removed, 0 recycled; write=2.794 s, sync=0.000 s, 
total=2.829 s; sync files=0, longest=0.000 s, average=0.000 s; 
distance=966 kB, estimate=966 kB
LOG:  shutdown checkpoint complete at 2016-02-09 21:09:14.669 EST
LOG:  ShutdownCLOG() complete at 2016-02-09 21:09:14.669 EST
LOG:  ShutdownCommitTs() complete at 2016-02-09 21:09:14.669 EST
LOG:  ShutdownSUBTRANS() complete at 2016-02-09 21:09:14.669 EST
LOG:  database system is shut down at 2016-02-09 21:09:14.669 EST
LOG:  doing before_shmem_exit 0 at 2016-02-09 21:09:14.670 EST
LOG:  doing on_shmem_exit 2 at 2016-02-09 21:09:14.670 EST
LOG:  doing on_shmem_exit 1 at 2016-02-09 21:09:14.670 EST
LOG:  doing on_shmem_exit 0 at 2016-02-09 21:09:14.670 EST
LOG:  doing on_proc_exit 1 at 2016-02-09 21:09:14.670 EST
LOG:  doing on_proc_exit 0 at 2016-02-09 21:09:14.671 EST
LOG:  calling exit(0) at 2016-02-09 21:09:14.671 EST
LOG:  checkpointer dead at 2016-02-09 21:09:14.683 EST
LOG:  all children dead at 2016-02-09 21:10:11.184 EST
LOG:  doing on_shmem_exit 3 at 2016-02-09 21:10:11.191 EST
LOG:  doing on_shmem_exit 2 at 2016-02-09 21:10:11.191 EST
LOG:  doing on_shmem_exit 1 at 2016-02-09 21:10:11.192 EST
LOG:  doing on_shmem_exit 0 at 2016-02-09 21:10:11.208 EST
LOG:  doing on_proc_exit 1 at 2016-02-09 21:10:11.209 EST
LOG:  doing on_proc_exit 0 at 2016-02-09 21:10:11.209 EST
LOG:  lock files all released at 2016-02-09 21:10:11.211 EST
LOG:  calling exit(0) at 2016-02-09 21:10:11.211 EST

Re: Tracing down buildfarm "postmaster does not shut down" failures

От

Jim Nasby

Дата:

10 февраля 2016 г., 02:41:40

On 2/8/16 2:45 PM, Tom Lane wrote:
> Alvaro Herrera <alvherre@2ndquadrant.com> writes:
>> Tom Lane wrote:
>>> What I'd like to do to investigate this is put in a temporary HEAD-only
>>> patch that makes ShutdownXLOG() and its subroutines much chattier about
>>> how far they've gotten and what time it is, and also makes pg_ctl print
>>> out the current time if it gives up waiting.
>
>> Seems like a reasonable place to start.  I assume you'll be installing
>> some debug-elog calls, enabled by default, and then once the problem is
>> fixed, we'll change the default to disabled but keep the actual calls?
>
> I had in mind to just "git revert" the patch when we're done with it.
> There might be some parts we want to keep (for instance, I'm thinking of
> logging "postmaster is done" after we've unlinked the pidfile, which might
> be useful permanently).  But I plan to err on the side of noisiness for
> the moment, rather than make something I think has long-term value.

It's already difficult enough for DBAs to debug some performance issues, 
so getting rid of logging is a step backwards. I realize it's unlikely 
that you could run DEBUG2 in a prod environment, but we still shouldn't 
be reducing visibility.
-- 
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com

Re: Tracing down buildfarm "postmaster does not shut down" failures

От

Tom Lane

Дата:

10 февраля 2016 г., 03:04:34

Andrew Dunstan <andrew@dunslane.net> writes:
> anyway, we got a failure pretty quickly:
>     pg_ctl: server does not shut down at 2016-02-09 21:10:11.914 EST
> ...
> LOG:  received fast shutdown request at 2016-02-09 21:09:11.824 EST
> ...
> LOG:  checkpointer dead at 2016-02-09 21:09:14.683 EST
> LOG:  all children dead at 2016-02-09 21:10:11.184 EST
> ...
> LOG:  lock files all released at 2016-02-09 21:10:11.211 EST

Hmm.  Apparently, pg_ctl gave up exactly one second too early.
The way the wait loop in pg_ctl is coded, it waits one second more
after the last get_pgpid() probe before complaining --- so the last
time it looked for the pidfile was approximately 21:10:10.914, just
300ms before the postmaster removed it.  I wonder if that's entirely
coincidence.

Still, it seems clear that the bulk of the shutdown time is indeed the
stats collector taking its time about shutting down, which is doubly
weird because the ecpg tests shouldn't have created very many tables,
so why would there be a lot of data to write?  Even granting that it's
not writing to ramdisk, 57 seconds to shut down seems pretty excessive.

I wonder if it's worth sticking some instrumentation into stats
collector shutdown?
        regards, tom lane

Re: Tracing down buildfarm "postmaster does not shut down" failures

От

Noah Misch

Дата:

10 февраля 2016 г., 03:09:40

On Tue, Feb 09, 2016 at 10:02:17PM -0500, Tom Lane wrote:
> Still, it seems clear that the bulk of the shutdown time is indeed the
> stats collector taking its time about shutting down, which is doubly
> weird because the ecpg tests shouldn't have created very many tables,
> so why would there be a lot of data to write?  Even granting that it's
> not writing to ramdisk, 57 seconds to shut down seems pretty excessive.
> 
> I wonder if it's worth sticking some instrumentation into stats
> collector shutdown?

I wouldn't be surprised if the collector got backlogged during the main phase
of testing and took awhile to chew through its message queue before even
starting the write of the final stats.

Re: Tracing down buildfarm "postmaster does not shut down" failures

От

Tom Lane

Дата:

10 февраля 2016 г., 03:10:55

Jim Nasby <Jim.Nasby@BlueTreble.com> writes:
> On 2/8/16 2:45 PM, Tom Lane wrote:
>> I had in mind to just "git revert" the patch when we're done with it.

> It's already difficult enough for DBAs to debug some performance issues, 
> so getting rid of logging is a step backwards. I realize it's unlikely 
> that you could run DEBUG2 in a prod environment, but we still shouldn't 
> be reducing visibility.

Meh.  It seems clear to me that we should move the "database system is
shut down" message so that it comes out only after the postmaster has
removed its lockfiles (and hence is really gone so far as any outside
vantage point is concerned).  But I do not think any of the rest of
what I put in has any lasting value.
        regards, tom lane

Re: Tracing down buildfarm "postmaster does not shut down" failures

От

Tom Lane

Дата:

10 февраля 2016 г., 03:27:14

Noah Misch <noah@leadboat.com> writes:
> On Tue, Feb 09, 2016 at 10:02:17PM -0500, Tom Lane wrote:
>> I wonder if it's worth sticking some instrumentation into stats
>> collector shutdown?

> I wouldn't be surprised if the collector got backlogged during the main phase
> of testing and took awhile to chew through its message queue before even
> starting the write of the final stats.

But why would the ecpg tests show such an effect when the main regression
tests don't?  AFAIK the ecpg tests don't exactly stress the server ---
note the trivial amount of data written by the shutdown checkpoint,
for instance.

The other weird thing is that it's only sometimes slow.  If you look at
the last buildfarm result from axolotl, for instance, the tail end of
the ecpg log is

LOG:  ShutdownSUBTRANS() complete at 2016-02-09 16:31:14.784 EST
LOG:  database system is shut down at 2016-02-09 16:31:14.784 EST
LOG:  lock files all released at 2016-02-09 16:31:14.817 EST

so we only spent ~50ms on stats write that time.

The idea I was toying with is that previous filesystem activity (making
the temp install, the server's never-fsync'd writes, etc) has built up a
bunch of dirty kernel buffers, and at some point the kernel goes nuts
writing all that data.  So the issues we're seeing would come and go
depending on the timing of that I/O spike.  I'm not sure how to prove
such a theory from here.
        regards, tom lane

Re: Tracing down buildfarm "postmaster does not shut down" failures

От

Andrew Dunstan

Дата:

10 февраля 2016 г., 04:23:57


On 02/09/2016 10:27 PM, Tom Lane wrote:
> Noah Misch <noah@leadboat.com> writes:
>> On Tue, Feb 09, 2016 at 10:02:17PM -0500, Tom Lane wrote:
>>> I wonder if it's worth sticking some instrumentation into stats
>>> collector shutdown?
>> I wouldn't be surprised if the collector got backlogged during the main phase
>> of testing and took awhile to chew through its message queue before even
>> starting the write of the final stats.
> But why would the ecpg tests show such an effect when the main regression
> tests don't?  AFAIK the ecpg tests don't exactly stress the server ---
> note the trivial amount of data written by the shutdown checkpoint,
> for instance.


The main regression tests run with the stats file on the ramdisk.


>
> The other weird thing is that it's only sometimes slow.  If you look at
> the last buildfarm result from axolotl, for instance, the tail end of
> the ecpg log is
>
> LOG:  ShutdownSUBTRANS() complete at 2016-02-09 16:31:14.784 EST
> LOG:  database system is shut down at 2016-02-09 16:31:14.784 EST
> LOG:  lock files all released at 2016-02-09 16:31:14.817 EST
>
> so we only spent ~50ms on stats write that time.


That part is puzzling.

> The idea I was toying with is that previous filesystem activity (making
> the temp install, the server's never-fsync'd writes, etc) has built up a
> bunch of dirty kernel buffers, and at some point the kernel goes nuts
> writing all that data.  So the issues we're seeing would come and go
> depending on the timing of that I/O spike.  I'm not sure how to prove
> such a theory from here.


Yeah. It's faintly possible that a kernel upgrade will  help.

cheers

andrew

Re: Tracing down buildfarm "postmaster does not shut down" failures

От

Noah Misch

Дата:

10 февраля 2016 г., 06:58:54

On Mon, Feb 08, 2016 at 10:55:24PM -0500, Tom Lane wrote:
> Noah Misch <noah@leadboat.com> writes:
> > On Mon, Feb 08, 2016 at 02:15:48PM -0500, Tom Lane wrote:
> >> We've seen variants
> >> on this theme on half a dozen machines just in the past week --- and it
> >> seems to mostly happen in 9.5 and HEAD, which is fishy.
> 
> > It has been affecting only the four AIX animals, which do share hardware.
> > (Back in 2015 and once in 2016-01, it did affect axolotl and shearwater.)
> 
> Certainly your AIX critters have shown this a bunch, but here's another
> current example:
> http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=axolotl&dt=2016-02-08%2014%3A49%3A23

Oops; I did not consider Monday's results before asserting that.

> > That's reasonable.  If you would like higher-fidelity data, I can run loops of
> > "pg_ctl -w start; make installcheck; pg_ctl -t900 -w stop", and I could run
> > that for HEAD and 9.2 simultaneously.  A day of logs from that should show
> > clearly if HEAD is systematically worse than 9.2.
> 
> That sounds like a fine plan, please do it.

Log files:

HEAD: https://drive.google.com/uc?export=download&id=0B9IURs2-_2ZMakl2TjFHUlpvc1k
92:   https://drive.google.com/uc?export=download&id=0B9IURs2-_2ZMYVZtY3VqcjBFX1k

While I didn't study those logs in detail, a few things jumped out.  Since
9.2, we've raised the default shared_buffers from 32MB to 128MB, and we've
replaced checkpoint_segments=3 with max_wal_size=1GB.  Both changes encourage
bulkier checkpoints.  The 9.2 test runs get one xlog-driven checkpoint before
the shutdown checkpoint, while HEAD gets one time-driven checkpoint.  Also,
the HEAD suite just tests more things.  Here's pg_stat_bgwriter afterward:

HEAD:
checkpoints_timed     | 156
checkpoints_req       | 799
checkpoint_write_time | 16035847
checkpoint_sync_time  | 6555396
buffers_checkpoint    | 744131
buffers_clean         | 0
maxwritten_clean      | 0
buffers_backend       | 3023444
buffers_backend_fsync | 0
buffers_alloc         | 1777010
stats_reset           | 2016-02-08 21:04:24.499607-08

9.2:
checkpoints_timed     | 39
checkpoints_req       | 1369
checkpoint_write_time | 14875776
checkpoint_sync_time  | 8397536
buffers_checkpoint    | 396272
buffers_clean         | 466392
maxwritten_clean      | 1336
buffers_backend       | 1961531
buffers_backend_fsync | 0
buffers_alloc         | 1681324
stats_reset           | 2016-02-08 21:09:21.925487-08

Most notable there is the lack of bgwriter help in HEAD.  The clusters had
initdb-default configuration apart from these additions:

listen_addresses=''
log_line_prefix = '%p %m '
logging_collector = on
log_autovacuum_min_duration = 0
log_checkpoints = on
log_lock_waits = on
log_temp_files = 128kB

I should have added fsync=off, too.  Notice how the AIX animals failed left
and right today, likely thanks to contention from these runs.

On Tue, Feb 09, 2016 at 02:10:50PM -0500, Tom Lane wrote:
> (1) Slow file system, specifically slow unlink, is the core of the
> problem.  (I wonder if the AIX critters are using an NFS filesystem?)

The buildfarm files lie on 600 GiB SAS disks.  I suspect metadata operations,
like unlink(), bottleneck on the jfs2 journal.

On Tue, Feb 09, 2016 at 03:05:01PM -0500, Tom Lane wrote:
> I'm now in favor of applying the PGCTLTIMEOUT patch Noah proposed, and
> *removing* the two existing hacks in run_build.pl that try to force -t 120.
> 
> The only real argument I can see against that approach is that we'd have
> to back-patch the PGCTLTIMEOUT patch to all active branches if we want
> to stop the buildfarm failures.  We don't usually back-patch feature
> additions.  On the other hand, this wouldn't be the first time we've
> back-patched something on grounds of helping the buildfarm, so I find
> that argument pretty weak.

If I were a purist against back-patching features, I might name the variable
PGINTERNAL_TEST_PGCTLTIMEOUT and not document it.  Meh to that.  I'll plan to
commit the original tomorrow.

Re: Tracing down buildfarm "postmaster does not shut down" failures

От

Tom Lane

Дата:

10 февраля 2016 г., 15:55:24

Noah Misch <noah@leadboat.com> writes:
>>> That's reasonable.  If you would like higher-fidelity data, I can run loops of
>>> "pg_ctl -w start; make installcheck; pg_ctl -t900 -w stop", and I could run
>>> that for HEAD and 9.2 simultaneously.  A day of logs from that should show
>>> clearly if HEAD is systematically worse than 9.2.

>> That sounds like a fine plan, please do it.

> Log files:
> HEAD: https://drive.google.com/uc?export=download&id=0B9IURs2-_2ZMakl2TjFHUlpvc1k
> 92:   https://drive.google.com/uc?export=download&id=0B9IURs2-_2ZMYVZtY3VqcjBFX1k

Thanks for doing this.

I did a bit of analysis of these logs.  There is no case in the HEAD log
where 'lock files all released' comes out more than 2.221 seconds after
'shutdown checkpoint complete'.  So it doesn't appear that stats writing
is ever a big problem for your machine.  The checkpoints themselves,
though ... here are the slowest few checkpoints on HEAD:

regression=# select ts,write,sync,total-write-sync as other, total from chkpts order by total desc limit 10;
ts            |  write  |  sync   |  other  |  total  
 
----------------------------+---------+---------+---------+---------2016-02-09 03:33:14.671-08 | 287.691 | 355.528 |
12.509| 655.7282016-02-09 11:40:13.565-08 | 287.327 | 314.218 |  44.059 | 645.6042016-02-09 12:36:49.753-08 | 149.652 |
481.736|  13.094 | 644.4822016-02-09 04:32:53.708-08 | 214.896 | 304.102 |  17.929 | 536.9272016-02-09 16:31:34.661-08
|254.462 | 186.246 |  51.541 | 492.2492016-02-09 08:27:04.365-08 |  14.479 | 298.043 | 100.586 | 413.1082016-02-09
07:37:57.999-08| 280.960 |  10.408 |  14.287 | 305.6552016-02-09 07:06:30.121-08 | 155.331 |  93.433 |  23.814 |
272.5782016-02-0917:45:34.422-08 | 149.471 | 104.185 |  18.598 | 272.2542016-02-09 00:51:41.157-08 | 134.502 | 103.482
| 34.110 | 272.094
 
(10 rows)

and the same for 9.2:

regression=# select ts,write,sync,total-write-sync as other, total from chkpts92 order by total desc limit 10;
 ts             |  write  |  sync   |  other  |  total  
 
----------------------------+---------+---------+---------+---------2016-02-08 23:33:36.16-08  |  66.904 | 397.653 |
229.020| 693.5772016-02-09 11:39:57.061-08 |   2.471 | 303.924 | 201.923 | 508.3182016-02-09 12:37:43.52-08  |  17.620
|333.983 | 141.172 | 492.7752016-02-09 03:27:12.904-08 |  48.761 | 124.453 | 299.877 | 473.0912016-02-09
04:32:49.668-08|  11.328 | 292.310 | 148.088 | 451.7262016-02-09 16:31:25.657-08 |   3.860 | 184.042 | 248.488 |
436.3902016-02-0908:27:04.407-08 |  12.019 | 300.510 | 102.855 | 415.3842016-02-09 03:32:30.718-08 | 296.907 |  15.260
|  5.644 | 317.8112016-02-09 15:31:41.691-08 |   0.385 | 307.996 |   0.155 | 308.5362016-02-09 12:23:10.92-08  |
93.634| 161.581 |  19.342 | 274.557
 
(10 rows)

It looks like you're going to have to set PGCTLTIMEOUT somewhere
north of 10 minutes to keep these animals from failing under load.

Interestingly, we seem to have managed to greatly reduce the "other"
time (which I presume is basically mdpostchkpt unlinking) since 9.2.
The worst case observed in HEAD is about 100s:

regression=# select ts,write,sync,total-write-sync as other, total from chkpts order by total-write-sync desc limit 10;
          ts             |  write  |  sync   |  other  |  total  
 
----------------------------+---------+---------+---------+---------2016-02-09 08:27:04.365-08 |  14.479 | 298.043 |
100.586| 413.1082016-02-09 06:15:22.477-08 |  75.099 |  26.590 |  90.452 | 192.1412016-02-09 16:31:34.661-08 | 254.462
|186.246 |  51.541 | 492.2492016-02-09 15:25:01.586-08 |   0.015 |   2.985 |  47.822 |  50.8222016-02-09
11:40:13.565-08| 287.327 | 314.218 |  44.059 | 645.6042016-02-09 08:56:51.339-08 | 135.464 |  31.242 |  39.131 |
205.8372016-02-0920:23:52.797-08 |   1.309 |  12.155 |  36.350 |  49.8142016-02-09 00:51:41.157-08 | 134.502 | 103.482
| 34.110 | 272.0942016-02-09 19:28:33.154-08 |   0.575 |  20.434 |  33.803 |  54.8122016-02-09 03:47:02.57-08  |
0.066|  37.475 |  33.119 |  70.660
 
(10 rows)

but 9.2's worst cases greatly exceed that:

regression=# select ts,write,sync,total-write-sync as other, total from chkpts92 order by total-write-sync desc limit
10;           ts             | write  |  sync   |  other  |  total  
 
----------------------------+--------+---------+---------+---------2016-02-09 03:27:12.904-08 | 48.761 | 124.453 |
299.877| 473.0912016-02-09 16:31:25.657-08 |  3.860 | 184.042 | 248.488 | 436.3902016-02-08 23:33:36.16-08  | 66.904 |
397.653| 229.020 | 693.5772016-02-09 11:39:57.061-08 |  2.471 | 303.924 | 201.923 | 508.3182016-02-09 07:37:44.48-08  |
3.317 |   1.494 | 158.159 | 162.9702016-02-09 04:32:49.668-08 | 11.328 | 292.310 | 148.088 | 451.7262016-02-09
12:37:43.52-08 | 17.620 | 333.983 | 141.172 | 492.7752016-02-09 08:27:04.407-08 | 12.019 | 300.510 | 102.855 |
415.3842016-02-0918:17:32.193-08 | 90.370 |  25.369 |  63.626 | 179.3652016-02-09 11:02:14.977-08 |  2.713 |   2.306 |
38.581|  43.600
 
(10 rows)

Not real sure where that improvement came from.  We could hardly
be unlinking fewer files overall, and the bottleneck seems to be
in the kernel, so what changed?  The *average*, as opposed to worst
case, checkpoint time has definitely grown since 9.2:

regression=# select avg(write) as write,avg(sync) as sync,avg(total-write-sync) as other, avg(total) as total from
chkpts;      write        |        sync        |       other        |        total        
 
---------------------+--------------------+--------------------+---------------------14.4432267389340560 |
7.4136738934056007| 4.9619313459801265 | 26.8188319783197832
 
(1 row)

regression=# select avg(write) as write,avg(sync) as sync,avg(total-write-sync) as other, avg(total) as total from
chkpts92;     write        |        sync        |       other        |        total        
 
--------------------+--------------------+--------------------+---------------------9.1272341075794621 |
6.2059480440097800| 4.0349553789731051 | 19.3681375305623472
 
(1 row)

I don't find that troublesome, since we're probably syncing more
data in the newer branch's tests.  But the improvement in worst
case behavior is puzzling.
        regards, tom lane

Re: Tracing down buildfarm "postmaster does not shut down" failures

От

Andres Freund

Дата:

10 февраля 2016 г., 16:06:41

On 2016-02-09 22:27:07 -0500, Tom Lane wrote:
> The idea I was toying with is that previous filesystem activity (making
> the temp install, the server's never-fsync'd writes, etc) has built up a
> bunch of dirty kernel buffers, and at some point the kernel goes nuts
> writing all that data.  So the issues we're seeing would come and go
> depending on the timing of that I/O spike.  I'm not sure how to prove
> such a theory from here.

It'd be interesting to monitor
$ grep -E '^(Dirty|Writeback):' /proc/meminfo
output. At least on linux. It's terribly easy to get the kernel into a
state where it has so much data needing to be written back that an
immediate checkpoint takes pretty much forever.

If I understand the code correctly, once a buffer has been placed into
'writeback', it'll be more-or-less processed in order. That can e.g. be
because these buffers have been written to more than 30s ago. If there
then are buffers later that also need to be written back (e.g. due to an
fsync()), you'll often wait for the earlier ones.

Andres

Re: Tracing down buildfarm "postmaster does not shut down" failures

От

Andrew Dunstan

Дата:

10 февраля 2016 г., 17:03:49

On 02/09/2016 11:21 PM, Andrew Dunstan wrote:
>
>
>
>> The idea I was toying with is that previous filesystem activity (making
>> the temp install, the server's never-fsync'd writes, etc) has built up a
>> bunch of dirty kernel buffers, and at some point the kernel goes nuts
>> writing all that data.  So the issues we're seeing would come and go
>> depending on the timing of that I/O spike.  I'm not sure how to prove
>> such a theory from here.
>
>
> Yeah. It's faintly possible that a kernel upgrade will  help.
>
>

Another data point. I have another RPi2B that is running Debian Wheezy 
rather than the Fedora remix. I'm running the same test on it we ran 
yesterday on axolotl. It seems to be running without having the same 
problems. So maybe the particular kernel port to ARM is what's tripping 
us up.

cheers

andrew

Re: Tracing down buildfarm "postmaster does not shut down" failures

От

Tom Lane

Дата:

10 февраля 2016 г., 17:53:36

Andrew Dunstan <andrew@dunslane.net> writes:
>> Yeah. It's faintly possible that a kernel upgrade will  help.

> Another data point. I have another RPi2B that is running Debian Wheezy 
> rather than the Fedora remix. I'm running the same test on it we ran 
> yesterday on axolotl. It seems to be running without having the same 
> problems. So maybe the particular kernel port to ARM is what's tripping 
> us up.

I'd bet against it being a port-specific problem; it sounds more like a
filesystem performance-tuning issue, which could easily change from one
kernel version to another.  What are the kernel version numbers exactly?
        regards, tom lane

Re: Tracing down buildfarm "postmaster does not shut down" failures

От

Andrew Dunstan

Дата:

10 февраля 2016 г., 18:04:28


On 02/10/2016 12:53 PM, Tom Lane wrote:
> Andrew Dunstan <andrew@dunslane.net> writes:
>>> Yeah. It's faintly possible that a kernel upgrade will  help.
>> Another data point. I have another RPi2B that is running Debian Wheezy
>> rather than the Fedora remix. I'm running the same test on it we ran
>> yesterday on axolotl. It seems to be running without having the same
>> problems. So maybe the particular kernel port to ARM is what's tripping
>> us up.
> I'd bet against it being a port-specific problem; it sounds more like a
> filesystem performance-tuning issue, which could easily change from one
> kernel version to another.  What are the kernel version numbers exactly?

axolotl: Linux pirouette 
3.18.13-501.20150510gitf36e19f.sc20.armv7hl.bcm2709 #1 SMP PREEMPT
debian: Linux piratic 3.18.7-v7+ #755 SMP PREEMPT

cheers

andrew

Re: Tracing down buildfarm "postmaster does not shut down" failures

От

Noah Misch

Дата:

12 февраля 2016 г., 02:15:42

On Wed, Feb 10, 2016 at 10:55:10AM -0500, Tom Lane wrote:
> Interestingly, we seem to have managed to greatly reduce the "other"
> time (which I presume is basically mdpostchkpt unlinking) since 9.2.
> The worst case observed in HEAD is about 100s:
> 
> regression=# select ts,write,sync,total-write-sync as other, total from chkpts order by total-write-sync desc limit
10;
>              ts             |  write  |  sync   |  other  |  total  
> ----------------------------+---------+---------+---------+---------
>  2016-02-09 08:27:04.365-08 |  14.479 | 298.043 | 100.586 | 413.108
>  2016-02-09 06:15:22.477-08 |  75.099 |  26.590 |  90.452 | 192.141
>  2016-02-09 16:31:34.661-08 | 254.462 | 186.246 |  51.541 | 492.249
>  2016-02-09 15:25:01.586-08 |   0.015 |   2.985 |  47.822 |  50.822
>  2016-02-09 11:40:13.565-08 | 287.327 | 314.218 |  44.059 | 645.604
>  2016-02-09 08:56:51.339-08 | 135.464 |  31.242 |  39.131 | 205.837
>  2016-02-09 20:23:52.797-08 |   1.309 |  12.155 |  36.350 |  49.814
>  2016-02-09 00:51:41.157-08 | 134.502 | 103.482 |  34.110 | 272.094
>  2016-02-09 19:28:33.154-08 |   0.575 |  20.434 |  33.803 |  54.812
>  2016-02-09 03:47:02.57-08  |   0.066 |  37.475 |  33.119 |  70.660
> (10 rows)
> 
> but 9.2's worst cases greatly exceed that:
> 
> regression=# select ts,write,sync,total-write-sync as other, total from chkpts92 order by total-write-sync desc limit
10;
>              ts             | write  |  sync   |  other  |  total  
> ----------------------------+--------+---------+---------+---------
>  2016-02-09 03:27:12.904-08 | 48.761 | 124.453 | 299.877 | 473.091
>  2016-02-09 16:31:25.657-08 |  3.860 | 184.042 | 248.488 | 436.390
>  2016-02-08 23:33:36.16-08  | 66.904 | 397.653 | 229.020 | 693.577
>  2016-02-09 11:39:57.061-08 |  2.471 | 303.924 | 201.923 | 508.318
>  2016-02-09 07:37:44.48-08  |  3.317 |   1.494 | 158.159 | 162.970
>  2016-02-09 04:32:49.668-08 | 11.328 | 292.310 | 148.088 | 451.726
>  2016-02-09 12:37:43.52-08  | 17.620 | 333.983 | 141.172 | 492.775
>  2016-02-09 08:27:04.407-08 | 12.019 | 300.510 | 102.855 | 415.384
>  2016-02-09 18:17:32.193-08 | 90.370 |  25.369 |  63.626 | 179.365
>  2016-02-09 11:02:14.977-08 |  2.713 |   2.306 |  38.581 |  43.600
> (10 rows)
> 
> Not real sure where that improvement came from.  We could hardly
> be unlinking fewer files overall, and the bottleneck seems to be
> in the kernel, so what changed?  The *average*, as opposed to worst
> case, checkpoint time has definitely grown since 9.2:

Good question; I don't know.  I would have expected 9.2's xlog-driven
checkpoints to create more stability than HEAD's time-driven checkpoints.


On the AIX animals, I have now set PGCTLTIMEOUT=900 and removed the "-t 120"
arguments.

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Обсуждение: Tracing down buildfarm "postmaster does not shut down" failures

Вложения