Re: Tracing down buildfarm "postmaster does not shut down" failures

Поиск

Список

Период

Сортировка

От	Tom Lane
Тема	Re: Tracing down buildfarm "postmaster does not shut down" failures
Дата	10 февраля 2016 г. 18:55:24
Msg-id	26216.1455119710@sss.pgh.pa.us обсуждение исходный текст
Ответ на	Re: Tracing down buildfarm "postmaster does not shut down" failures (Noah Misch <noah@leadboat.com>)
Ответы	Re: Tracing down buildfarm "postmaster does not shut down" failures (Noah Misch <noah@leadboat.com>)
Список	pgsql-hackers

Дерево обсуждения

Noah Misch <noah@leadboat.com> writes:
>>> That's reasonable.  If you would like higher-fidelity data, I can run loops of
>>> "pg_ctl -w start; make installcheck; pg_ctl -t900 -w stop", and I could run
>>> that for HEAD and 9.2 simultaneously.  A day of logs from that should show
>>> clearly if HEAD is systematically worse than 9.2.

>> That sounds like a fine plan, please do it.

> Log files:
> HEAD: https://drive.google.com/uc?export=download&id=0B9IURs2-_2ZMakl2TjFHUlpvc1k
> 92:   https://drive.google.com/uc?export=download&id=0B9IURs2-_2ZMYVZtY3VqcjBFX1k

Thanks for doing this.

I did a bit of analysis of these logs.  There is no case in the HEAD log
where 'lock files all released' comes out more than 2.221 seconds after
'shutdown checkpoint complete'.  So it doesn't appear that stats writing
is ever a big problem for your machine.  The checkpoints themselves,
though ... here are the slowest few checkpoints on HEAD:

regression=# select ts,write,sync,total-write-sync as other, total from chkpts order by total desc limit 10;
ts            |  write  |  sync   |  other  |  total  
 
----------------------------+---------+---------+---------+---------2016-02-09 03:33:14.671-08 | 287.691 | 355.528 |
12.509| 655.7282016-02-09 11:40:13.565-08 | 287.327 | 314.218 |  44.059 | 645.6042016-02-09 12:36:49.753-08 | 149.652 |
481.736|  13.094 | 644.4822016-02-09 04:32:53.708-08 | 214.896 | 304.102 |  17.929 | 536.9272016-02-09 16:31:34.661-08
|254.462 | 186.246 |  51.541 | 492.2492016-02-09 08:27:04.365-08 |  14.479 | 298.043 | 100.586 | 413.1082016-02-09
07:37:57.999-08| 280.960 |  10.408 |  14.287 | 305.6552016-02-09 07:06:30.121-08 | 155.331 |  93.433 |  23.814 |
272.5782016-02-0917:45:34.422-08 | 149.471 | 104.185 |  18.598 | 272.2542016-02-09 00:51:41.157-08 | 134.502 | 103.482
| 34.110 | 272.094
 
(10 rows)

and the same for 9.2:

regression=# select ts,write,sync,total-write-sync as other, total from chkpts92 order by total desc limit 10;
 ts             |  write  |  sync   |  other  |  total  
 
----------------------------+---------+---------+---------+---------2016-02-08 23:33:36.16-08  |  66.904 | 397.653 |
229.020| 693.5772016-02-09 11:39:57.061-08 |   2.471 | 303.924 | 201.923 | 508.3182016-02-09 12:37:43.52-08  |  17.620
|333.983 | 141.172 | 492.7752016-02-09 03:27:12.904-08 |  48.761 | 124.453 | 299.877 | 473.0912016-02-09
04:32:49.668-08|  11.328 | 292.310 | 148.088 | 451.7262016-02-09 16:31:25.657-08 |   3.860 | 184.042 | 248.488 |
436.3902016-02-0908:27:04.407-08 |  12.019 | 300.510 | 102.855 | 415.3842016-02-09 03:32:30.718-08 | 296.907 |  15.260
|  5.644 | 317.8112016-02-09 15:31:41.691-08 |   0.385 | 307.996 |   0.155 | 308.5362016-02-09 12:23:10.92-08  |
93.634| 161.581 |  19.342 | 274.557
 
(10 rows)

It looks like you're going to have to set PGCTLTIMEOUT somewhere
north of 10 minutes to keep these animals from failing under load.

Interestingly, we seem to have managed to greatly reduce the "other"
time (which I presume is basically mdpostchkpt unlinking) since 9.2.
The worst case observed in HEAD is about 100s:

regression=# select ts,write,sync,total-write-sync as other, total from chkpts order by total-write-sync desc limit 10;
          ts             |  write  |  sync   |  other  |  total  
 
----------------------------+---------+---------+---------+---------2016-02-09 08:27:04.365-08 |  14.479 | 298.043 |
100.586| 413.1082016-02-09 06:15:22.477-08 |  75.099 |  26.590 |  90.452 | 192.1412016-02-09 16:31:34.661-08 | 254.462
|186.246 |  51.541 | 492.2492016-02-09 15:25:01.586-08 |   0.015 |   2.985 |  47.822 |  50.8222016-02-09
11:40:13.565-08| 287.327 | 314.218 |  44.059 | 645.6042016-02-09 08:56:51.339-08 | 135.464 |  31.242 |  39.131 |
205.8372016-02-0920:23:52.797-08 |   1.309 |  12.155 |  36.350 |  49.8142016-02-09 00:51:41.157-08 | 134.502 | 103.482
| 34.110 | 272.0942016-02-09 19:28:33.154-08 |   0.575 |  20.434 |  33.803 |  54.8122016-02-09 03:47:02.57-08  |
0.066|  37.475 |  33.119 |  70.660
 
(10 rows)

but 9.2's worst cases greatly exceed that:

regression=# select ts,write,sync,total-write-sync as other, total from chkpts92 order by total-write-sync desc limit
10;           ts             | write  |  sync   |  other  |  total  
 
----------------------------+--------+---------+---------+---------2016-02-09 03:27:12.904-08 | 48.761 | 124.453 |
299.877| 473.0912016-02-09 16:31:25.657-08 |  3.860 | 184.042 | 248.488 | 436.3902016-02-08 23:33:36.16-08  | 66.904 |
397.653| 229.020 | 693.5772016-02-09 11:39:57.061-08 |  2.471 | 303.924 | 201.923 | 508.3182016-02-09 07:37:44.48-08  |
3.317 |   1.494 | 158.159 | 162.9702016-02-09 04:32:49.668-08 | 11.328 | 292.310 | 148.088 | 451.7262016-02-09
12:37:43.52-08 | 17.620 | 333.983 | 141.172 | 492.7752016-02-09 08:27:04.407-08 | 12.019 | 300.510 | 102.855 |
415.3842016-02-0918:17:32.193-08 | 90.370 |  25.369 |  63.626 | 179.3652016-02-09 11:02:14.977-08 |  2.713 |   2.306 |
38.581|  43.600
 
(10 rows)

Not real sure where that improvement came from.  We could hardly
be unlinking fewer files overall, and the bottleneck seems to be
in the kernel, so what changed?  The *average*, as opposed to worst
case, checkpoint time has definitely grown since 9.2:

regression=# select avg(write) as write,avg(sync) as sync,avg(total-write-sync) as other, avg(total) as total from
chkpts;      write        |        sync        |       other        |        total        
 
---------------------+--------------------+--------------------+---------------------14.4432267389340560 |
7.4136738934056007| 4.9619313459801265 | 26.8188319783197832
 
(1 row)

regression=# select avg(write) as write,avg(sync) as sync,avg(total-write-sync) as other, avg(total) as total from
chkpts92;     write        |        sync        |       other        |        total        
 
--------------------+--------------------+--------------------+---------------------9.1272341075794621 |
6.2059480440097800| 4.0349553789731051 | 19.3681375305623472
 
(1 row)

I don't find that troublesome, since we're probably syncing more
data in the newer branch's tests.  But the improvement in worst
case behavior is puzzling.
        regards, tom lane

В списке pgsql-hackers по дате отправления:

Предыдущее

От: Teodor Sigaev
Дата: 10 февраля 2016 г., 18:51:37
Сообщение: Re: Mac OS: invalid byte sequence for encoding "UTF8"

Следующее

От: Andres Freund
Дата: 10 февраля 2016 г., 19:01:47
Сообщение: Re: Updated backup APIs for non-exclusive backups

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Re: Tracing down buildfarm "postmaster does not shut down" failures

Предыдущее

Следующее