Re: checkpointer code behaving strangely on postmaster -T

Поиск
Список
Период
Сортировка
От Alvaro Herrera
Тема Re: checkpointer code behaving strangely on postmaster -T
Дата
Msg-id 1336659379-sup-7447@alvh.no-ip.org
обсуждение исходный текст
Ответ на Re: checkpointer code behaving strangely on postmaster -T  (Tom Lane <tgl@sss.pgh.pa.us>)
Ответы Re: checkpointer code behaving strangely on postmaster -T  (Tom Lane <tgl@sss.pgh.pa.us>)
Re: checkpointer code behaving strangely on postmaster -T  (Tom Lane <tgl@sss.pgh.pa.us>)
Список pgsql-hackers
Excerpts from Tom Lane's message of jue may 10 02:27:32 -0400 2012:
> Alvaro Herrera <alvherre@alvh.no-ip.org> writes:
> > I noticed while doing some tests that the checkpointer process does not
> > recover very nicely after a backend crashes under postmaster -T (after
> > all processes have been kill -CONTd, of course, and postmaster told to
> > shutdown via Ctrl-C on its console).  For some reason it seems to get
> > stuck on a loop doing sleep(0.5s)  In other case I caught it trying to
> > do a checkpoint, but it was progressing a single page each time and then
> > sleeping.  In that condition, the checkpoint took a very long time to
> > finish.
>
> Is this still a problem as of HEAD?  I think I've fixed some issues in
> the checkpointer's outer loop logic, but not sure if what you saw is
> still there.

Yep, it's still there as far as I can tell.  A backtrace from the
checkpointer shows it's waiting on the latch.

It seems to me that the bug is in the postmaster state machine rather
than checkpointer itself.  After a few false starts, this seems to fix
it:

--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -2136,6 +2136,8 @@ pmdie(SIGNAL_ARGS)                   signal_child(WalWriterPID, SIGTERM);               if
(BgWriterPID!= 0)                   signal_child(BgWriterPID, SIGTERM); 
+               if (FatalError && CheckpointerPID != 0)
+                   signal_child(CheckpointerPID, SIGUSR2);               /*                * If we're in recovery, we
can'tkill the startup process 
@@ -2178,6 +2180,8 @@ pmdie(SIGNAL_ARGS)               signal_child(WalReceiverPID, SIGTERM);           if (BgWriterPID
!=0)               signal_child(BgWriterPID, SIGTERM); 
+           if (FatalError && CheckpointerPID != 0)
+               signal_child(CheckpointerPID, SIGUSR2);           if (pmState == PM_RECOVERY)           {
/*only checkpointer is active in this state */ 


Note that since checkpointer can only be running after we enter
FatalError when the -T (send SIGSTOP instead of SIGQUIT) switch is used,
this bug doesn't seem to affect normal usage.  (I'm not sure SIGUSR2 is
the most appropriate signal to send at this time -- since we're in
FatalError, probably SIGQUIT is better suited.)

One good thing is that when I patched postmaster in a different way
(which I later realized to be bogus), I caused it to die with an
assertion while checkpointer was still running; the debug output let me
know that checkpointer went away immediately.

--
Álvaro Herrera <alvherre@commandprompt.com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support


В списке pgsql-hackers по дате отправления:

Предыдущее
От: Tom Lane
Дата:
Сообщение: Re: Draft release notes complete
Следующее
От: "MauMau"
Дата:
Сообщение: Re: Can pg_trgm handle non-alphanumeric characters?