Re: backend hangs at immediate shutdown (Re: Back-branch update releases coming in a couple weeks)

Поиск
Список
Период
Сортировка
От Andres Freund
Тема Re: backend hangs at immediate shutdown (Re: Back-branch update releases coming in a couple weeks)
Дата
Msg-id 20130621094158.GA29841@alap2.anarazel.de
обсуждение исходный текст
Ответ на Re: backend hangs at immediate shutdown (Re: Back-branch update releases coming in a couple weeks)  (Alvaro Herrera <alvherre@2ndquadrant.com>)
Ответы Re: backend hangs at immediate shutdown (Re: Back-branch update releases coming in a couple weeks)  (Alvaro Herrera <alvherre@2ndquadrant.com>)
Список pgsql-hackers
On 2013-06-20 22:36:45 -0400, Alvaro Herrera wrote:
> Noah Misch escribió:
> > On Thu, Jun 20, 2013 at 12:33:25PM -0400, Alvaro Herrera wrote:
> > > MauMau escribi?:
> > > > Here, "reliable" means that the database server is certainly shut
> > > > down when pg_ctl returns, not telling a lie that "I shut down the
> > > > server processes for you, so you do not have to be worried that some
> > > > postgres process might still remain and write to disk".  I suppose
> > > > reliable shutdown is crucial especially in HA cluster.  If pg_ctl
> > > > stop -mi gets stuck forever when there is an unkillable process (in
> > > > what situations does this happen? OS bug, or NFS hard mount?), I
> > > > think the DBA has to notice this situation from the unfinished
> > > > pg_ctl, investigate the cause, and take corrective action.
> > > 
> > > So you're suggesting that keeping postmaster up is a useful sign that
> > > the shutdown is not going well?  I'm not really sure about this.  What
> > > do others think?
> > 
> > It would be valuable for "pg_ctl -w -m immediate stop" to have the property
> > that an subsequent start attempt will not fail due to the presence of some
> > backend still attached to shared memory.  (Maybe that's true anyway or can be
> > achieved a better way; I have not investigated.)
> 
> Well, the only case where a process that's been SIGKILLed does not go
> away, as far as I know, is when it is in some uninterruptible sleep due
> to in-kernel operations that get stuck.  Personally I have never seen
> this happen in any other case than some network filesystem getting
> disconnected, or a disk that doesn't respond.  And whenever the
> filesystem starts to respond again, the process gets out of its sleep
> only to die due to the signal.

Those are the situation in which it takes a really long time, yes. But
there can be timing issues involved. Consider a backend that's currently
stuck in a write() because it hit the dirtying limit.  Say you have a
postgres cluster that's currently slowing down to a crawl because it's
overloaded and hitting the dirty limit. Somebody very well might just
want to restart it with -m immediate. In that case a delay of a second
or two till enough dirty memory has been written that write() can
continue is enough for the postmaster to start up again and try to
attach to shared memory where it will find the shared memory to be still
in use.
I don't really see any argument for *not* waiting. Sure it might take a
bit longer till it's shut down, but if it didn't wait that will cause
problems down the road.

> If we leave postmaster running after SIGKILLing its children, the only
> thing we can do is have it continue to SIGKILL processes continuously
> every few seconds until they die (or just sit around doing nothing until
> they all die).  I don't think this will have a different effect than
> postmaster going away trusting the first SIGKILL to do its job
> eventually.

I think it should just wait till all its child processes are dead. No
need to repeat sending the signals - as you say, that won't help.



What we could do to improve the robustness a bit - at least on linux -
is to prctl(PR_SET_PDEATHSIG, SIGKILL) which would cause children to be
killed if the postmaster goes away...

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



В списке pgsql-hackers по дате отправления:

Предыдущее
От: Thom Brown
Дата:
Сообщение: Re: Config reload/restart preview
Следующее
От: Hitoshi Harada
Дата:
Сообщение: Re: refresh materialized view concurrently