Re: checkpointer code behaving strangely on postmaster -T

Поиск

Список

Период

Сортировка

От	Alvaro Herrera
Тема	Re: checkpointer code behaving strangely on postmaster -T
Дата	10 мая 2012 г. 15:05:15
Msg-id	1336659379-sup-7447@alvh.no-ip.org обсуждение исходный текст
Ответ на	Re: checkpointer code behaving strangely on postmaster -T (Tom Lane <tgl@sss.pgh.pa.us>)
Ответы	Re: checkpointer code behaving strangely on postmaster -T (Tom Lane <tgl@sss.pgh.pa.us>) Re: checkpointer code behaving strangely on postmaster -T (Tom Lane <tgl@sss.pgh.pa.us>)
Список	pgsql-hackers

Дерево обсуждения

Excerpts from Tom Lane's message of jue may 10 02:27:32 -0400 2012:
> Alvaro Herrera <alvherre@alvh.no-ip.org> writes:
> > I noticed while doing some tests that the checkpointer process does not
> > recover very nicely after a backend crashes under postmaster -T (after
> > all processes have been kill -CONTd, of course, and postmaster told to
> > shutdown via Ctrl-C on its console).  For some reason it seems to get
> > stuck on a loop doing sleep(0.5s)  In other case I caught it trying to
> > do a checkpoint, but it was progressing a single page each time and then
> > sleeping.  In that condition, the checkpoint took a very long time to
> > finish.
>
> Is this still a problem as of HEAD?  I think I've fixed some issues in
> the checkpointer's outer loop logic, but not sure if what you saw is
> still there.

Yep, it's still there as far as I can tell.  A backtrace from the
checkpointer shows it's waiting on the latch.

It seems to me that the bug is in the postmaster state machine rather
than checkpointer itself.  After a few false starts, this seems to fix
it:

--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -2136,6 +2136,8 @@ pmdie(SIGNAL_ARGS)                   signal_child(WalWriterPID, SIGTERM);               if
(BgWriterPID!= 0)                   signal_child(BgWriterPID, SIGTERM); 
+               if (FatalError && CheckpointerPID != 0)
+                   signal_child(CheckpointerPID, SIGUSR2);               /*                * If we're in recovery, we
can'tkill the startup process 
@@ -2178,6 +2180,8 @@ pmdie(SIGNAL_ARGS)               signal_child(WalReceiverPID, SIGTERM);           if (BgWriterPID
!=0)               signal_child(BgWriterPID, SIGTERM); 
+           if (FatalError && CheckpointerPID != 0)
+               signal_child(CheckpointerPID, SIGUSR2);           if (pmState == PM_RECOVERY)           {
/*only checkpointer is active in this state */ 

Note that since checkpointer can only be running after we enter
FatalError when the -T (send SIGSTOP instead of SIGQUIT) switch is used,
this bug doesn't seem to affect normal usage.  (I'm not sure SIGUSR2 is
the most appropriate signal to send at this time -- since we're in
FatalError, probably SIGQUIT is better suited.)

One good thing is that when I patched postmaster in a different way
(which I later realized to be bogus), I caused it to die with an
assertion while checkpointer was still running; the debug output let me
know that checkpointer went away immediately.

--
Álvaro Herrera <alvherre@commandprompt.com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

В списке pgsql-hackers по дате отправления:

Предыдущее

От: Tom Lane
Дата: 10 мая 2012 г., 15:05:08
Сообщение: Re: Draft release notes complete

Следующее

От: "MauMau"
Дата: 10 мая 2012 г., 15:09:28
Сообщение: Re: Can pg_trgm handle non-alphanumeric characters?

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Re: checkpointer code behaving strangely on postmaster -T

Предыдущее

Следующее