Re: problems on Solaris

Поиск
Список
Период
Сортировка
От Robert Haas
Тема Re: problems on Solaris
Дата
Msg-id CA+TgmoajRM0RJePuDxw2FK1Gts4gMAgVmbQ+9tHszYr4UsomEw@mail.gmail.com
обсуждение исходный текст
Ответ на Re: problems on Solaris  (Andres Freund <andres@anarazel.de>)
Ответы Re: problems on Solaris  (Andres Freund <andres@anarazel.de>)
Список pgsql-hackers
On Mon, May 25, 2015 at 10:05 PM, Andres Freund <andres@anarazel.de> wrote:
> Hm. So we have a *occasional* stack size exceeded failure and an
> occasional spinlock error in test_shm_mq. I'm inclined to think that
> this is a shm_mq problem, and not a more general locking problem - it
> seems likely, but not guaranteed, that that'd have materialized
> elsewhere.

I think the problem might be that the spinlock-based memory barrier is
not re-entrant.  Suppose some kind of barrier operation is in process,
and we've acquired the dummy spnlock but not yet released it.  Just
then, we receive a signal.  Since the shm_mq code sets
set_latch_on_sigusr1, procsignal_sigusr1_handler will set MyLatch.
SetLatch now includes barrier operations, so we'll try to acquire and
release the spinlock despite already holding it.  Oops.

> Robert: IIRC there was some problems with shm_mq tests being stuck
> before, right?

The last round of investigation, on anole, resulted in this fix:

commit d0410d66037c2f3f9bee45e0a2db9e47eeba2bb4
Author: Robert Haas <rhaas@postgresql.org>
Date:   Sat Oct 4 21:25:41 2014 -0400
   Eliminate one background-worker-related flag variable.
   Teach sigusr1_handler() to use the same test for whether a worker   might need to be started as ServerLoop().  Aside
frombeing perhaps   a bit simpler, this prevents a potentially-unbounded delay when   starting a background worker.  On
someplatforms, select() doesn't   return when interrupted by a signal, but is instead restarted,   including a reset of
thetimeout to the originally-requested value.   If signals arrive often enough, but no connection requests arrive,
sigusr1_handler()will be executed repeatedly, but the body of   ServerLoop() won't be reached.  This change ensures
that,even in   that case, background workers will eventually get launched.
 
   This is far from a perfect fix; really, we need select() to return   control to ServerLoop() after an interrupt,
eithervia the self-pipe   trick or some other mechanism.  But that's going to require more   work and discussion, so
let'sdo this for now to at least mitigate   the damage.
 
   Per investigation of test_shm_mq failures on buildfarm member anole.

The problem here isn't really with test_shm_mq; it's with the
postmaster.  To really make this work properly, we need to be able to
use latches in the postmaster, and we need to generalize
WaitLatchOrSocket so that it can wait for a latch of any of n sockets.
Then ServerLoop can use that instead of calling select directly.  This
will probably look a lot like what you did to get rid of
ImmediateInterruptOK.

But all of that seems unrelated to the current problems.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



В списке pgsql-hackers по дате отправления:

Предыдущее
От: Ted Toth
Дата:
Сообщение: Re: rhel6 rpm file locations
Следующее
От: Robert Haas
Дата:
Сообщение: Re: rhel6 rpm file locations