Re: test_shm_mq failing on anole (was: Sending out a request for more buildfarm animals?)

Поиск

Список

Период

Сортировка

От	Robert Haas
Тема	Re: test_shm_mq failing on anole (was: Sending out a request for more buildfarm animals?)
Дата	3 октября 2014 г. 18:38:15
Msg-id	CA+Tgmobr+Q2WgWeasdbDNefVwJkAGALxA=-VtEGNtQgL1V2Ryw@mail.gmail.com обсуждение исходный текст
Ответ на	Re: test_shm_mq failing on anole (was: Sending out a request for more buildfarm animals?) (Robert Haas <robertmhaas@gmail.com>)
Ответы	Re: test_shm_mq failing on anole (was: Sending out a request for more buildfarm animals?)
Список	pgsql-hackers

Дерево обсуждения

On Fri, Oct 3, 2014 at 1:09 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> Further debugging reveals that sigusr1_handler() gets called
> repeatedly, to start autovacuum workers, and it keeps waking up and
> starting them.  But that doesn't cause the background workers to get
> started either, because although sigusr1_handler() contains a call to
> maybe_start_bgworker, it only does that if start_bgworker = true,
> which only happens if
> CheckPostmasterSignal(PMSIGNAL_BACKGROUND_WORKER_CHANGE) is true,
> which on these calls it isn't.
> And I think this might also be the missing ingredient answering
> Andres's question from before: why doesn't the 60-second
> select()-timeout cause the background worker to eventually start even
> if the SELECT doesn't get interrupted?  There seems to be a SIGUSR1
> arriving about every 3 seconds, and I bet that's resetting the timer
> every time.

For now I propose to address this by committing the attached patch,
which gets rid of the separate start_bgworker flag inside
sigusr1_handler() and instead uses the same StartWorkerNeeded ||
HaveCrashedWorker test that we use inside ServerLoop() to decide
whether to call maybe_start_bgworker().  Either more signals will
arrive (in which case the signal handler will launch an additional
background worker every time a signal arrives) or they won't (in which
case the 60-second timeout will eventually expire, and ServerLoop()
will kick into high gear and satisfy all outstanding requests).  This
isn't really right, because there might still be a quite noticeable
delay waiting for workers to get launched, but at least the delay
would be bounded to at most 60 seconds rather than, as at present,
potentially infinite.

A totally correct fix will require a bit more thought.  A search of
the git history reveals that the problem of a signal restarting the
timeout is not new: Tom fixed a similar problem back in 2007 by making
the autovacuum launcher sleep for at most a second at a time.  Such a
fix isn't ideal here, because we really don't want an up-to-1-second
delay launching a newly-registered background worker if there's a way
to avoid that -- it's probably OK for launching daemons, but it's not
so hot for parallel query.  However, we could:

(1) Use the self-pipe trick. We could not directly use a latch, at
least not without a new API, because we might be waiting on more than
one socket.

(2) Have the postmaster not set SA_RESTART for the sigusr1 handler.  I
don't know how platform-independent this approach would be.

(3) Have sigusr1_handler repeatedly call maybe_start_bgworker() until
StartWorkerNeeded becomes false, instead of just calling it once.
ServerLoop() is carefully coded to call maybe_start_bgworker() just
once per iteration, presumably to make sure the server stays
responsive even if the bgworker-starting machinery is quite busy;
looping inside the signal handler would give up that nice property
unless we had some way to break out of the loop if there's activity on
the socket.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Вложения

prevent-infinite-hang-on-anole.patch

В списке pgsql-hackers по дате отправления:

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Re: test_shm_mq failing on anole (was: Sending out a request for more buildfarm animals?)

Вложения