[HACKERS] Another reason why the recovery tests take a long time

Поиск
Список
Период
Сортировка
От Tom Lane
Тема [HACKERS] Another reason why the recovery tests take a long time
Дата
Msg-id 21344.1498494720@sss.pgh.pa.us
обсуждение исходный текст
Ответы Re: [HACKERS] Another reason why the recovery tests take a long time  (Andres Freund <andres@anarazel.de>)
Re: [HACKERS] Another reason why the recovery tests take a long time  (Tom Lane <tgl@sss.pgh.pa.us>)
Список pgsql-hackers
I've found another edge-case bug through investigation of unexpectedly
slow recovery test runs.  It goes like this:

* While streaming from master to slave, test script shuts down master
while slave is left running.  We soon restart the master, but meanwhile:

* slave's walreceiver process fails, reporting

2017-06-26 16:06:50.209 UTC [13209] LOG:  replication terminated by primary server
2017-06-26 16:06:50.209 UTC [13209] DETAIL:  End of WAL reached on timeline 1 at 0/3000098.
2017-06-26 16:06:50.209 UTC [13209] FATAL:  could not send end-of-streaming message to primary: no COPY in progress

* slave's startup process observes that walreceiver is gone and sends
PMSIGNAL_START_WALRECEIVER to ask for a new one

* more often than you would guess, in fact nearly 100% reproducibly for
me, the postmaster receives/services the PMSIGNAL before it receives
SIGCHLD for the walreceiver.  In this situation sigusr1_handler just
throws away the walreceiver start request, reasoning that the walreceiver
is already running.

* eventually, it dawns on the startup process that the walreceiver
isn't starting, and it asks for a new one.  But that takes ten seconds
(WALRCV_STARTUP_TIMEOUT).

So this looks like a pretty obvious race condition in the postmaster,
which should be resolved by having it set a flag on receipt of
PMSIGNAL_START_WALRECEIVER that's cleared only when it does start a
new walreceiver.  But I wonder whether it's intentional that the old
walreceiver dies in the first place.  That FATAL exit looks suspiciously
like it wasn't originally-designed-in behavior.
        regards, tom lane



В списке pgsql-hackers по дате отправления:

Предыдущее
От: Alexander Korotkov
Дата:
Сообщение: Re: [HACKERS] Pluggable storage
Следующее
От: Alexander Korotkov
Дата:
Сообщение: Re: [HACKERS] GSoC 2017: Foreign Key Arrays