Re: Race conditions with checkpointer and shutdown

Поиск
Список
Период
Сортировка
От Tom Lane
Тема Re: Race conditions with checkpointer and shutdown
Дата
Msg-id 28461.1555624659@sss.pgh.pa.us
обсуждение исходный текст
Ответ на Re: Race conditions with checkpointer and shutdown  (Tom Lane <tgl@sss.pgh.pa.us>)
Ответы Re: Race conditions with checkpointer and shutdown  (Michael Paquier <michael@paquier.xyz>)
Список pgsql-hackers
Robert Haas <robertmhaas@gmail.com> wrote (in the other thread):
> Any idea whether it's something newly-introduced or of long standing?

It's the latter.  I searched the buildfarm database for failure logs
including the string "server does not shut down" within the last three
years, and got all of the hits attached.  Not all of these look like
the failure pattern Michael pointed to, but enough of them do to say
that the problem has existed since at least mid-2017.  To be concrete,
we have quite a sample of cases where a standby server has received a
"fast shutdown" signal and acknowledged that in its log, but it never
gets to the expected "shutting down" message, meaning it never starts
the shutdown checkpoint let alone finishes it.  The oldest case that
clearly looks like that is

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=nightjar&dt=2017-06-02%2018%3A54%3A29

A significant majority of the recent cases look just like the piculet
failure Michael pointed to, that is we fail to shut down the "london"
server while it's acting as standby in the recovery/t/009_twophase.pl
test.  But there are very similar failures in other tests.

I also notice that the population of machines showing the problem seems
heavily skewed towards, um, weird cases.  For instance, in the set
that have shown this type of failure since January, we have

dragonet: uses JIT
francolin: --disable-spinlocks
gull: armv7
mereswine: armv7
piculet: --disable-atomics
sidewinder: amd64, but running netbsd 7 (and this was 9.6, note)
spurfowl: fairly generic amd64

This leads me to suspect that the problem is (a) some very low-level issue
in spinlocks or or latches or the like, or (b) a timing problem that just
doesn't show up on generic Intel-oid platforms.  The timing theory is
maybe a bit stronger given that one test case shows this more often than
others.  I've not got any clear ideas beyond that.

Anyway, this is *not* new in v12.

            regards, tom lane

    sysname    |      snapshot       |        stage        |                  l
---------------+---------------------+---------------------+-------------------------------------
 jacana        | 2016-07-23 06:15:32 | pg_upgradeCheck     | pg_ctl: server does not shut down\r
 pademelon     | 2016-08-14 03:49:36 | ECPG-Check          | pg_ctl: server does not shut down
 mereswine     | 2017-02-13 14:24:37 | Check               | pg_ctl: server does not shut down
 arapaima      | 2017-03-04 20:06:10 | StopDb-C:4          | pg_ctl: server does not shut down
 nightjar      | 2017-06-02 18:54:29 | SubscriptionCheck   | pg_ctl: server does not shut down
 nightjar      | 2017-06-02 19:54:11 | SubscriptionCheck   | pg_ctl: server does not shut down
 nightjar      | 2017-06-03 15:54:05 | SubscriptionCheck   | pg_ctl: server does not shut down
 nightjar      | 2017-06-03 17:54:18 | SubscriptionCheck   | pg_ctl: server does not shut down
 nightjar      | 2017-06-03 21:54:09 | SubscriptionCheck   | pg_ctl: server does not shut down
 nightjar      | 2017-06-04 00:54:09 | SubscriptionCheck   | pg_ctl: server does not shut down
 nightjar      | 2017-06-04 16:34:32 | SubscriptionCheck   | pg_ctl: server does not shut down
 nightjar      | 2017-06-04 17:54:16 | SubscriptionCheck   | pg_ctl: server does not shut down
 hornet        | 2017-06-05 16:22:09 | SubscriptionCheck   | pg_ctl: server does not shut down
 nightjar      | 2017-06-05 16:54:09 | SubscriptionCheck   | pg_ctl: server does not shut down
 nightjar      | 2017-06-05 20:26:24 | SubscriptionCheck   | pg_ctl: server does not shut down
 nightjar      | 2017-06-06 03:30:02 | SubscriptionCheck   | pg_ctl: server does not shut down
 nightjar      | 2017-06-06 15:54:18 | SubscriptionCheck   | pg_ctl: server does not shut down
 hornet        | 2017-06-06 17:10:02 | SubscriptionCheck   | pg_ctl: server does not shut down
 nightjar      | 2017-06-06 18:54:27 | SubscriptionCheck   | pg_ctl: server does not shut down
 nightjar      | 2017-06-07 00:54:07 | SubscriptionCheck   | pg_ctl: server does not shut down
 nightjar      | 2017-06-07 02:54:06 | SubscriptionCheck   | pg_ctl: server does not shut down
 nightjar      | 2017-06-07 15:12:15 | SubscriptionCheck   | pg_ctl: server does not shut down
 nightjar      | 2017-06-07 17:54:07 | SubscriptionCheck   | pg_ctl: server does not shut down
 nightjar      | 2017-06-07 18:54:06 | SubscriptionCheck   | pg_ctl: server does not shut down
 sungazer      | 2017-06-07 19:46:53 | SubscriptionCheck   | pg_ctl: server does not shut down
 nightjar      | 2017-06-07 21:03:43 | SubscriptionCheck   | pg_ctl: server does not shut down
 nightjar      | 2017-06-08 01:54:07 | SubscriptionCheck   | pg_ctl: server does not shut down
 nightjar      | 2017-06-08 15:54:10 | SubscriptionCheck   | pg_ctl: server does not shut down
 nightjar      | 2017-06-08 16:57:03 | SubscriptionCheck   | pg_ctl: server does not shut down
 nightjar      | 2017-06-08 17:54:09 | SubscriptionCheck   | pg_ctl: server does not shut down
 nightjar      | 2017-06-08 19:54:06 | SubscriptionCheck   | pg_ctl: server does not shut down
 nudibranch    | 2017-07-23 08:31:12 | pg_upgradeCheck     | pg_ctl: server does not shut down
 nudibranch    | 2017-08-06 08:31:25 | pg_upgradeCheck     | pg_ctl: server does not shut down
 spurfowl      | 2017-08-21 18:43:08 | BinInstallCheck     | pg_ctl: server does not shut down
 arapaima      | 2017-10-10 22:12:01 | pg_upgradeCheck     | pg_ctl: server does not shut down
 calliphoridae | 2017-10-11 20:32:24 | recoveryCheck       | pg_ctl: server does not shut down
 arapaima      | 2017-10-12 16:36:18 | pg_upgradeCheck     | pg_ctl: server does not shut down
 arapaima      | 2017-10-12 16:36:18 | pg_upgradeCheck     | pg_ctl: server does not shut down
 arapaima      | 2017-10-13 10:26:13 | test-decoding-check | pg_ctl: server does not shut down
 arapaima      | 2017-10-14 07:40:05 | pg_upgradeCheck     | pg_ctl: server does not shut down
 arapaima      | 2017-10-14 23:22:28 | Check               | pg_ctl: server does not shut down
 arapaima      | 2017-10-15 10:32:28 | test-decoding-check | pg_ctl: server does not shut down
 arapaima      | 2017-10-16 08:40:39 | Check               | pg_ctl: server does not shut down
 arapaima      | 2017-10-17 20:32:25 | pg_upgradeCheck     | pg_ctl: server does not shut down
 arapaima      | 2017-10-18 13:11:05 | pg_upgradeCheck     | pg_ctl: server does not shut down
 arapaima      | 2017-10-18 23:47:09 | Check               | pg_ctl: server does not shut down
 arapaima      | 2017-10-19 09:41:15 | Check               | pg_ctl: server does not shut down
 spurfowl      | 2017-11-15 13:29:06 | BinInstallCheck     | pg_ctl: server does not shut down
 baiji         | 2017-11-27 23:00:01 | Check               | pg_ctl: server does not shut down\r
 skink         | 2017-11-30 08:47:34 | recoveryCheck       | pg_ctl: server does not shut down
 skink         | 2017-12-02 14:30:01 | recoveryCheck       | pg_ctl: server does not shut down
 gharial       | 2017-12-15 09:12:45 | Check               | pg_ctl: server does not shut down
 nudibranch    | 2017-12-17 09:33:02 | pg_upgradeCheck     | pg_ctl: server does not shut down
 nudibranch    | 2018-01-20 10:08:33 | Check               | pg_ctl: server does not shut down
 nudibranch    | 2018-01-27 09:44:39 | Check               | pg_ctl: server does not shut down
 nudibranch    | 2018-02-01 09:49:26 | test-decoding-check | pg_ctl: server does not shut down
 mereswine     | 2018-03-01 11:05:39 | Check               | pg_ctl: server does not shut down
 topminnow     | 2018-03-10 12:45:57 | pg_rewindCheck      | pg_ctl: server does not shut down
 mereswine     | 2018-03-14 08:10:08 | Check               | pg_ctl: server does not shut down
 gull          | 2018-03-30 02:10:15 | Check               | pg_ctl: server does not shut down
 aholehole     | 2018-04-08 11:37:13 | Check               | pg_ctl: server does not shut down
 aholehole     | 2018-04-08 21:41:39 | Check               | pg_ctl: server does not shut down
 aholehole     | 2018-04-10 23:37:22 | Check               | pg_ctl: server does not shut down
 aholehole     | 2018-04-13 16:06:17 | Check               | pg_ctl: server does not shut down
 aholehole     | 2018-04-13 23:59:24 | StopDb-C:1          | pg_ctl: server does not shut down
 aholehole     | 2018-04-14 20:09:25 | pg_upgradeCheck     | pg_ctl: server does not shut down
 aholehole     | 2018-04-17 11:59:57 | pg_upgradeCheck     | pg_ctl: server does not shut down
 aholehole     | 2018-04-19 07:31:31 | Check               | pg_ctl: server does not shut down
 aholehole     | 2018-04-20 08:02:00 | pg_upgradeCheck     | pg_ctl: server does not shut down
 aholehole     | 2018-04-21 04:21:33 | Check               | pg_ctl: server does not shut down
 aholehole     | 2018-04-21 23:06:17 | pg_upgradeCheck     | pg_ctl: server does not shut down
 aholehole     | 2018-04-22 17:03:22 | pg_upgradeCheck     | pg_ctl: server does not shut down
 aholehole     | 2018-04-22 17:03:22 | pg_upgradeCheck     | pg_ctl: server does not shut down
 aholehole     | 2018-04-23 13:08:51 | pg_upgradeCheck     | pg_ctl: server does not shut down
 aholehole     | 2018-04-24 07:52:02 | pg_upgradeCheck     | pg_ctl: server does not shut down
 aholehole     | 2018-04-25 01:32:16 | Check               | pg_ctl: server does not shut down
 gull          | 2018-04-25 02:10:12 | Check               | pg_ctl: server does not shut down
 aholehole     | 2018-04-26 17:42:38 | Check               | pg_ctl: server does not shut down
 aholehole     | 2018-04-28 05:58:59 | Check               | pg_ctl: server does not shut down
 aholehole     | 2018-04-30 13:11:22 | pg_upgradeCheck     | pg_ctl: server does not shut down
 aholehole     | 2018-05-03 02:36:32 | StopDb-C:4          | pg_ctl: server does not shut down
 aholehole     | 2018-05-04 11:07:37 | Check               | pg_ctl: server does not shut down
 aholehole     | 2018-05-04 18:02:59 | test-decoding-check | pg_ctl: server does not shut down
 aholehole     | 2018-05-06 02:52:23 | pg_upgradeCheck     | pg_ctl: server does not shut down
 aholehole     | 2018-05-07 19:16:33 | Check               | pg_ctl: server does not shut down
 aholehole     | 2018-05-08 13:03:52 | StopDb-C:4          | pg_ctl: server does not shut down
 aholehole     | 2018-05-09 04:36:06 | Check               | pg_ctl: server does not shut down
 aholehole     | 2018-05-09 10:41:59 | Check               | pg_ctl: server does not shut down
 aholehole     | 2018-05-09 16:41:32 | test-decoding-check | pg_ctl: server does not shut down
 aholehole     | 2018-05-10 14:06:04 | Check               | pg_ctl: server does not shut down
 aholehole     | 2018-05-11 09:00:33 | pg_upgradeCheck     | pg_ctl: server does not shut down
 aholehole     | 2018-05-12 04:07:50 | Check               | pg_ctl: server does not shut down
 aholehole     | 2018-05-15 22:11:06 | test-decoding-check | pg_ctl: server does not shut down
 aholehole     | 2018-05-16 10:14:21 | test-decoding-check | pg_ctl: server does not shut down
 aholehole     | 2018-05-16 23:01:57 | Check               | pg_ctl: server does not shut down
 aholehole     | 2018-05-17 07:06:03 | pg_upgradeCheck     | pg_ctl: server does not shut down
 aholehole     | 2018-05-17 14:22:36 | test-decoding-check | pg_ctl: server does not shut down
 aholehole     | 2018-05-18 00:15:19 | test-decoding-check | pg_ctl: server does not shut down
 aholehole     | 2018-05-18 11:23:03 | test-decoding-check | pg_ctl: server does not shut down
 aholehole     | 2018-05-18 22:36:45 | test-decoding-check | pg_ctl: server does not shut down
 aholehole     | 2018-05-19 11:54:21 | Check               | pg_ctl: server does not shut down
 aholehole     | 2018-05-19 19:06:08 | StopDb-C:4          | pg_ctl: server does not shut down
 aholehole     | 2018-05-20 09:32:53 | pg_upgradeCheck     | pg_ctl: server does not shut down
 aholehole     | 2018-05-20 16:47:17 | test-decoding-check | pg_ctl: server does not shut down
 aholehole     | 2018-05-21 03:36:24 | test-decoding-check | pg_ctl: server does not shut down
 aholehole     | 2018-05-21 14:38:21 | test-decoding-check | pg_ctl: server does not shut down
 aholehole     | 2018-05-22 03:02:05 | Check               | pg_ctl: server does not shut down
 aholehole     | 2018-05-22 11:06:05 | pg_upgradeCheck     | pg_ctl: server does not shut down
 aholehole     | 2018-05-23 12:13:37 | pg_upgradeCheck     | pg_ctl: server does not shut down
 aholehole     | 2018-05-23 21:05:13 | pg_upgradeCheck     | pg_ctl: server does not shut down
 aholehole     | 2018-05-24 08:29:25 | test-decoding-check | pg_ctl: server does not shut down
 aholehole     | 2018-05-25 14:08:16 | Check               | pg_ctl: server does not shut down
 aholehole     | 2018-05-25 19:11:35 | pg_upgradeCheck     | pg_ctl: server does not shut down
 aholehole     | 2018-05-27 16:56:10 | StopDb-C:1          | pg_ctl: server does not shut down
 aholehole     | 2018-05-29 16:09:14 | StopDb-C:1          | pg_ctl: server does not shut down
 aholehole     | 2018-05-30 15:07:44 | Check               | pg_ctl: server does not shut down
 aholehole     | 2018-05-31 13:10:53 | Check               | pg_ctl: server does not shut down
 aholehole     | 2018-05-31 21:12:12 | pg_upgradeCheck     | pg_ctl: server does not shut down
 aholehole     | 2018-05-31 21:12:12 | pg_upgradeCheck     | pg_ctl: server does not shut down
 aholehole     | 2018-06-04 17:07:20 | Check               | pg_ctl: server does not shut down
 aholehole     | 2018-06-05 00:12:13 | pg_upgradeCheck     | pg_ctl: server does not shut down
 aholehole     | 2018-06-06 17:06:03 | Check               | pg_ctl: server does not shut down
 aholehole     | 2018-06-06 21:49:02 | pg_upgradeCheck     | pg_ctl: server does not shut down
 aholehole     | 2018-06-06 21:49:02 | pg_upgradeCheck     | pg_ctl: server does not shut down
 aholehole     | 2018-06-07 06:53:06 | StopDb-C:1          | pg_ctl: server does not shut down
 aholehole     | 2018-06-07 19:36:02 | pg_upgradeCheck     | pg_ctl: server does not shut down
 aholehole     | 2018-06-08 15:47:15 | pg_upgradeCheck     | pg_ctl: server does not shut down
 aholehole     | 2018-06-09 07:11:38 | StopDb-C:1          | pg_ctl: server does not shut down
 aholehole     | 2018-06-09 21:54:40 | Check               | pg_ctl: server does not shut down
 aholehole     | 2018-06-10 08:12:11 | test-decoding-check | pg_ctl: server does not shut down
 aholehole     | 2018-06-12 00:12:15 | test-decoding-check | pg_ctl: server does not shut down
 aholehole     | 2018-06-13 23:29:02 | StopDb-C:4          | pg_ctl: server does not shut down
 aholehole     | 2018-06-15 04:38:36 | pg_upgradeCheck     | pg_ctl: server does not shut down
 aholehole     | 2018-06-15 04:38:36 | pg_upgradeCheck     | pg_ctl: server does not shut down
 aholehole     | 2018-06-15 13:54:08 | Check               | pg_ctl: server does not shut down
 aholehole     | 2018-06-15 20:19:02 | Check               | pg_ctl: server does not shut down
 aholehole     | 2018-06-16 08:09:36 | pg_upgradeCheck     | pg_ctl: server does not shut down
 aholehole     | 2018-06-17 04:38:19 | StopDb-C:1          | pg_ctl: server does not shut down
 aholehole     | 2018-06-17 19:47:19 | Check               | pg_ctl: server does not shut down
 aholehole     | 2018-06-18 04:06:04 | StopDb-C:4          | pg_ctl: server does not shut down
 aholehole     | 2018-06-18 16:23:17 | Check               | pg_ctl: server does not shut down
 aholehole     | 2018-06-19 18:45:19 | test-decoding-check | pg_ctl: server does not shut down
 aholehole     | 2018-06-20 09:15:42 | pg_upgradeCheck     | pg_ctl: server does not shut down
 aholehole     | 2018-06-20 09:15:42 | pg_upgradeCheck     | pg_ctl: server does not shut down
 aholehole     | 2018-06-21 04:41:45 | Check               | pg_ctl: server does not shut down
 aholehole     | 2018-06-21 13:06:30 | pg_upgradeCheck     | pg_ctl: server does not shut down
 aholehole     | 2018-06-21 23:50:37 | pg_upgradeCheck     | pg_ctl: server does not shut down
 aholehole     | 2018-06-22 13:11:55 | test-decoding-check | pg_ctl: server does not shut down
 aholehole     | 2018-06-23 02:09:06 | Check               | pg_ctl: server does not shut down
 aholehole     | 2018-06-23 13:12:49 | test-decoding-check | pg_ctl: server does not shut down
 aholehole     | 2018-06-25 01:06:02 | pg_upgradeCheck     | pg_ctl: server does not shut down
 aholehole     | 2018-06-25 01:06:02 | pg_upgradeCheck     | pg_ctl: server does not shut down
 aholehole     | 2018-06-25 08:25:08 | pg_upgradeCheck     | pg_ctl: server does not shut down
 aholehole     | 2018-06-25 16:10:48 | test-decoding-check | pg_ctl: server does not shut down
 aholehole     | 2018-06-26 03:34:45 | pg_upgradeCheck     | pg_ctl: server does not shut down
 aholehole     | 2018-06-26 12:47:21 | StopDb-C:1          | pg_ctl: server does not shut down
 gull          | 2018-08-24 03:27:16 | recoveryCheck       | pg_ctl: server does not shut down
 mereswine     | 2018-08-24 08:10:05 | recoveryCheck       | pg_ctl: server does not shut down
 mereswine     | 2018-08-29 09:26:52 | pg_rewindCheck      | pg_ctl: server does not shut down
 gull          | 2018-08-30 02:10:05 | recoveryCheck       | pg_ctl: server does not shut down
 gull          | 2018-09-06 02:10:16 | recoveryCheck       | pg_ctl: server does not shut down
 gull          | 2018-09-14 02:10:05 | recoveryCheck       | pg_ctl: server does not shut down
 sungazer      | 2018-09-18 19:44:53 | subscriptionCheck   | pg_ctl: server does not shut down
 gull          | 2018-09-26 02:10:06 | recoveryCheck       | pg_ctl: server does not shut down
 mereswine     | 2018-10-01 08:10:04 | recoveryCheck       | pg_ctl: server does not shut down
 gull          | 2018-10-03 02:10:05 | recoveryCheck       | pg_ctl: server does not shut down
 gull          | 2018-10-07 03:45:39 | recoveryCheck       | pg_ctl: server does not shut down
 gull          | 2018-10-07 05:07:06 | recoveryCheck       | pg_ctl: server does not shut down
 mereswine     | 2018-10-07 08:10:05 | recoveryCheck       | pg_ctl: server does not shut down
 gull          | 2018-10-13 02:10:06 | recoveryCheck       | pg_ctl: server does not shut down
 gull          | 2018-10-13 05:21:13 | recoveryCheck       | pg_ctl: server does not shut down
 gull          | 2018-10-29 03:25:28 | recoveryCheck       | pg_ctl: server does not shut down
 mereswine     | 2018-11-15 11:55:56 | recoveryCheck       | pg_ctl: server does not shut down
 mereswine     | 2018-11-16 09:10:13 | recoveryCheck       | pg_ctl: server does not shut down
 mereswine     | 2018-11-30 10:34:17 | recoveryCheck       | pg_ctl: server does not shut down
 mereswine     | 2018-11-30 11:44:54 | recoveryCheck       | pg_ctl: server does not shut down
 gull          | 2018-12-08 04:29:34 | recoveryCheck       | pg_ctl: server does not shut down
 mereswine     | 2019-01-30 05:18:59 | recoveryCheck       | pg_ctl: server does not shut down
 francolin     | 2019-02-07 07:55:41 | recoveryCheck       | pg_ctl: server does not shut down
 mereswine     | 2019-02-20 03:10:06 | recoveryCheck       | pg_ctl: server does not shut down
 mereswine     | 2019-02-22 03:10:05 | recoveryCheck       | pg_ctl: server does not shut down
 spurfowl      | 2019-02-25 11:02:04 | pg_rewindCheck      | pg_ctl: server does not shut down
 sidewinder    | 2019-02-25 16:45:00 | recoveryCheck       | pg_ctl: server does not shut down
 mereswine     | 2019-03-06 03:10:05 | recoveryCheck       | pg_ctl: server does not shut down
 piculet       | 2019-03-23 22:28:59 | recoveryCheck       | pg_ctl: server does not shut down
 gull          | 2019-04-11 06:25:20 | recoveryCheck       | pg_ctl: server does not shut down
 mereswine     | 2019-04-12 06:31:30 | recoveryCheck       | pg_ctl: server does not shut down
 dragonet      | 2019-04-16 06:14:01 | recoveryCheck       | pg_ctl: server does not shut down
(188 rows)


В списке pgsql-hackers по дате отправления:

Предыдущее
От: Andres Freund
Дата:
Сообщение: Re: finding changed blocks using WAL scanning
Следующее
От: Thomas Munro
Дата:
Сообщение: Re: Race conditions with checkpointer and shutdown