Re: checkpointer code behaving strangely on postmaster -T
От | Alvaro Herrera |
---|---|
Тема | Re: checkpointer code behaving strangely on postmaster -T |
Дата | |
Msg-id | 1336659379-sup-7447@alvh.no-ip.org обсуждение исходный текст |
Ответ на | Re: checkpointer code behaving strangely on postmaster -T (Tom Lane <tgl@sss.pgh.pa.us>) |
Ответы |
Re: checkpointer code behaving strangely on postmaster -T
(Tom Lane <tgl@sss.pgh.pa.us>)
Re: checkpointer code behaving strangely on postmaster -T (Tom Lane <tgl@sss.pgh.pa.us>) |
Список | pgsql-hackers |
Excerpts from Tom Lane's message of jue may 10 02:27:32 -0400 2012: > Alvaro Herrera <alvherre@alvh.no-ip.org> writes: > > I noticed while doing some tests that the checkpointer process does not > > recover very nicely after a backend crashes under postmaster -T (after > > all processes have been kill -CONTd, of course, and postmaster told to > > shutdown via Ctrl-C on its console). For some reason it seems to get > > stuck on a loop doing sleep(0.5s) In other case I caught it trying to > > do a checkpoint, but it was progressing a single page each time and then > > sleeping. In that condition, the checkpoint took a very long time to > > finish. > > Is this still a problem as of HEAD? I think I've fixed some issues in > the checkpointer's outer loop logic, but not sure if what you saw is > still there. Yep, it's still there as far as I can tell. A backtrace from the checkpointer shows it's waiting on the latch. It seems to me that the bug is in the postmaster state machine rather than checkpointer itself. After a few false starts, this seems to fix it: --- a/src/backend/postmaster/postmaster.c +++ b/src/backend/postmaster/postmaster.c @@ -2136,6 +2136,8 @@ pmdie(SIGNAL_ARGS) signal_child(WalWriterPID, SIGTERM); if (BgWriterPID!= 0) signal_child(BgWriterPID, SIGTERM); + if (FatalError && CheckpointerPID != 0) + signal_child(CheckpointerPID, SIGUSR2); /* * If we're in recovery, we can'tkill the startup process @@ -2178,6 +2180,8 @@ pmdie(SIGNAL_ARGS) signal_child(WalReceiverPID, SIGTERM); if (BgWriterPID !=0) signal_child(BgWriterPID, SIGTERM); + if (FatalError && CheckpointerPID != 0) + signal_child(CheckpointerPID, SIGUSR2); if (pmState == PM_RECOVERY) { /*only checkpointer is active in this state */ Note that since checkpointer can only be running after we enter FatalError when the -T (send SIGSTOP instead of SIGQUIT) switch is used, this bug doesn't seem to affect normal usage. (I'm not sure SIGUSR2 is the most appropriate signal to send at this time -- since we're in FatalError, probably SIGQUIT is better suited.) One good thing is that when I patched postmaster in a different way (which I later realized to be bogus), I caused it to die with an assertion while checkpointer was still running; the debug output let me know that checkpointer went away immediately. -- Álvaro Herrera <alvherre@commandprompt.com> The PostgreSQL Company - Command Prompt, Inc. PostgreSQL Replication, Consulting, Custom Development, 24x7 support
В списке pgsql-hackers по дате отправления: