Re: Tracing down buildfarm "postmaster does not shut down" failures

Поиск

Список

Период

Сортировка

От	Tom Lane
Тема	Re: Tracing down buildfarm "postmaster does not shut down" failures
Дата	9 февраля 2016 г. 19:13:07
Msg-id	17908.1455045050@sss.pgh.pa.us обсуждение
Ответ на	Re: Tracing down buildfarm "postmaster does not shut down" failures (Tom Lane <tgl@sss.pgh.pa.us>)
Ответы	Re: Tracing down buildfarm "postmaster does not shut down" failures Re: Tracing down buildfarm "postmaster does not shut down" failures
Список	pgsql-hackers

Дерево обсуждения

I wrote:
> Noah Misch <noah@leadboat.com> writes:
>> On Mon, Feb 08, 2016 at 02:15:48PM -0500, Tom Lane wrote:
>>> We've seen variants
>>> on this theme on half a dozen machines just in the past week --- and it
>>> seems to mostly happen in 9.5 and HEAD, which is fishy.

>> It has been affecting only the four AIX animals, which do share hardware.
>> (Back in 2015 and once in 2016-01, it did affect axolotl and shearwater.)

> Certainly your AIX critters have shown this a bunch, but here's another
> current example:
> http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=axolotl&dt=2016-02-08%2014%3A49%3A23

While we've not seen an actual test failure since I installed the tracing
code, we do have reports from the four AIX critters, and they are pretty
instructive.  See for example
http://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=hornet&dt=2016-02-09%2016%3A27%3A05&stg=stopdb-C-2
which is

Snapshot: 2016-02-09 16:27:05

waiting for server to shut down at 2016-02-09 18:04:09.159
UTC................................................................................done
 
server stopped
=========== db log file ==========
2016-02-09 18:04:09.159 UTC [56ba2944.ee01bc:2] LOG:  received fast shutdown request at 2016-02-09 18:04:09.159 UTC
2016-02-09 18:04:09.159 UTC [56ba2944.ee01bc:3] LOG:  aborting any active transactions
2016-02-09 18:04:09.160 UTC [56ba2944.14501d8:2] LOG:  autovacuum launcher shutting down at 2016-02-09 18:04:09.160
UTC
2016-02-09 18:04:09.258 UTC [56ba2944.363000e:40] LOG:  shutting down at 2016-02-09 18:04:09.258 UTC
2016-02-09 18:04:09.258 UTC [56ba2944.363000e:41] LOG:  CheckPointGuts starting at 2016-02-09 18:04:09.258 UTC
2016-02-09 18:04:09.258 UTC [56ba2944.363000e:42] LOG:  CheckPointCLOG() done at 2016-02-09 18:04:09.258 UTC
2016-02-09 18:04:09.258 UTC [56ba2944.363000e:43] LOG:  CheckPointCommitTs() done at 2016-02-09 18:04:09.258 UTC
2016-02-09 18:04:09.258 UTC [56ba2944.363000e:44] LOG:  CheckPointSUBTRANS() done at 2016-02-09 18:04:09.258 UTC
2016-02-09 18:04:09.258 UTC [56ba2944.363000e:45] LOG:  CheckPointMultiXact() done at 2016-02-09 18:04:09.258 UTC
2016-02-09 18:04:09.258 UTC [56ba2944.363000e:46] LOG:  CheckPointPredicate() done at 2016-02-09 18:04:09.258 UTC
2016-02-09 18:04:09.258 UTC [56ba2944.363000e:47] LOG:  CheckPointRelationMap() done at 2016-02-09 18:04:09.258 UTC
2016-02-09 18:04:09.258 UTC [56ba2944.363000e:48] LOG:  CheckPointReplicationSlots() done at 2016-02-09 18:04:09.258
UTC
2016-02-09 18:04:09.259 UTC [56ba2944.363000e:49] LOG:  CheckPointSnapBuild() done at 2016-02-09 18:04:09.259 UTC
2016-02-09 18:04:09.259 UTC [56ba2944.363000e:50] LOG:  CheckPointLogicalRewriteHeap() done at 2016-02-09 18:04:09.259
UTC
2016-02-09 18:04:09.259 UTC [56ba2944.363000e:51] LOG:  BufferSync(5) beginning to write 632 buffers at 2016-02-09
18:04:09.259UTC
 
2016-02-09 18:04:09.966 UTC [56ba2944.363000e:52] LOG:  BufferSync(5) done, wrote 632/632 buffers at 2016-02-09
18:04:09.966UTC
 
2016-02-09 18:04:09.966 UTC [56ba2944.363000e:53] LOG:  CheckPointBuffers() done at 2016-02-09 18:04:09.966 UTC
2016-02-09 18:04:09.966 UTC [56ba2944.363000e:54] LOG:  CheckPointReplicationOrigin() done at 2016-02-09 18:04:09.966
UTC
2016-02-09 18:04:09.966 UTC [56ba2944.363000e:55] LOG:  CheckPointGuts done at 2016-02-09 18:04:09.966 UTC
2016-02-09 18:04:09.967 UTC [56ba2944.363000e:56] LOG:  checkpoint WAL record flushed at 2016-02-09 18:04:09.967 UTC
2016-02-09 18:04:09.967 UTC [56ba2944.363000e:57] LOG:  pg_control updated at 2016-02-09 18:04:09.967 UTC
2016-02-09 18:04:29.375 UTC [56ba2944.363000e:58] LOG:  in mdpostckpt, 3072 unlinks remain to do at 2016-02-09
18:04:29.375UTC
 
2016-02-09 18:04:48.207 UTC [56ba2944.363000e:59] LOG:  in mdpostckpt, 2048 unlinks remain to do at 2016-02-09
18:04:48.207UTC
 
2016-02-09 18:05:07.381 UTC [56ba2944.363000e:60] LOG:  in mdpostckpt, 1024 unlinks remain to do at 2016-02-09
18:05:07.381UTC
 
2016-02-09 18:05:25.764 UTC [56ba2944.363000e:61] LOG:  smgrpostckpt() done at 2016-02-09 18:05:25.764 UTC
2016-02-09 18:05:25.764 UTC [56ba2944.363000e:62] LOG:  RemoveOldXlogFiles() done at 2016-02-09 18:05:25.764 UTC
2016-02-09 18:05:25.764 UTC [56ba2944.363000e:63] LOG:  TruncateSUBTRANS() done at 2016-02-09 18:05:25.764 UTC
2016-02-09 18:05:25.764 UTC [56ba2944.363000e:64] LOG:  shutdown checkpoint complete at 2016-02-09 18:05:25.764 UTC
2016-02-09 18:05:25.764 UTC [56ba2944.363000e:65] LOG:  ShutdownCLOG() complete at 2016-02-09 18:05:25.764 UTC
2016-02-09 18:05:25.764 UTC [56ba2944.363000e:66] LOG:  ShutdownCommitTs() complete at 2016-02-09 18:05:25.764 UTC
2016-02-09 18:05:25.764 UTC [56ba2944.363000e:67] LOG:  ShutdownSUBTRANS() complete at 2016-02-09 18:05:25.764 UTC
2016-02-09 18:05:25.764 UTC [56ba2944.363000e:68] LOG:  database system is shut down at 2016-02-09 18:05:25.764 UTC
2016-02-09 18:05:26.165 UTC [56ba2944.ee01bc:4] LOG:  lock files all released at 2016-02-09 18:05:26.165 UTC

This says that the bulk of the shutdown time is being spent in mdpostckpt,
where we are finally unlinking files belonging to relations that were
dropped since the last checkpoint.  That took 75 seconds here, while the
parts before and after it took less than a second each.  There were
somewhere between 3000 and 4000 files to be unlinked, so it's not
astonishing that it took some time, but our other critters seem to get
through this a lot quicker.  (All four AIX critters reported pretty
similar results, BTW.)

So my diagnosis at this point is

(1) Slow file system, specifically slow unlink, is the core of the
problem.  (I wonder if the AIX critters are using an NFS filesystem?)

(2) The apparent increase in failure rate in more recent branches is
probably real, but the likely cause is simply more test cases = more
stuff to be cleaned up at the end = mdpostckpt takes longer.

I'm not sure whether there's anything to be gained by leaving the tracing
code in there till we see actual buildfarm fails.  There might be another
slowdown mechanism somewhere, but I rather doubt it.  Thoughts?

In any case, we should proceed with fixing things so that buildfarm owners
can specify a higher shutdown timeout for especially slow critters.
        regards, tom lane

В списке pgsql-hackers по дате отправления:

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Re: Tracing down buildfarm "postmaster does not shut down" failures