Обсуждение: Condition variable live lock

Поиск
Список
Период
Сортировка

Condition variable live lock

От
Thomas Munro
Дата:
Hi hackers,

While debugging a build farm assertion failure after commit 18042840,
and with the assumption that the problem is timing/scheduling
sensitive, I tried hammering the problem workload on a few different
machines and noticed that my slow 2-core test machine fairly regularly
got into a live lock state for tens to millions of milliseconds at a
time when there were 3+ active processes, in here:

int
ConditionVariableBroadcast(ConditionVariable *cv)
{
        int                     nwoken = 0;

        /*
         * Let's just do this the dumbest way possible.  We could try to dequeue
         * all the sleepers at once to save spinlock cycles, but it's a bit hard
         * to get that right in the face of possible sleep cancelations, and we
         * don't want to loop holding the mutex.
         */
        while (ConditionVariableSignal(cv))
                ++nwoken;

        return nwoken;
}

The problem is that another backend can be woken up, determine that it
would like to wait for the condition variable again, and then get
itself added to the back of the wait queue *before the above loop has
finished*, so this interprocess ping-pong isn't guaranteed to
terminate.  It seems that we'll need something slightly smarter than
the above to avoid that.

I don't currently suspect this phenomenon of being responsible for the
problem I'm hunting, even though it occurs on the only machine I've
been able to reproduce my real problem on.  AFAICT the problem
described in this email should deliver arbitrary numbers of spurious
wake-ups wasting arbitrary CPU time but cause no harm that would
affect program correctness.  So I didn't try to write a patch to fix
that just yet.  I think we should probably back patch a fix when we
have one though, because it could bite Parallel Index Scan in
REL_10_STABLE.

-- 
Thomas Munro
http://www.enterprisedb.com


Re: Condition variable live lock

От
Thomas Munro
Дата:
On Fri, Dec 22, 2017 at 4:46 PM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:
>         while (ConditionVariableSignal(cv))
>                 ++nwoken;
>
> The problem is that another backend can be woken up, determine that it
> would like to wait for the condition variable again, and then get
> itself added to the back of the wait queue *before the above loop has
> finished*, so this interprocess ping-pong isn't guaranteed to
> terminate.  It seems that we'll need something slightly smarter than
> the above to avoid that.

Here is one way to fix it: track the wait queue size and use that
number to limit the wakeup loop.  See attached.

That's unbackpatchable though, because it changes the size of struct
ConditionVariable, potentially breaking extensions compiled against an
earlier point release.  Maybe this problem won't really cause problems
in v10 anyway?  It requires a particular interaction pattern that
barrier.c produces but more typical client code might not: the awoken
backends keep re-adding themselves because they're waiting for
everyone (including the waker) to do something, but the waker is stuck
in that broadcast loop.

Thoughts?

-- 
Thomas Munro
http://www.enterprisedb.com

Вложения

Re: Condition variable live lock

От
Andres Freund
Дата:
On 2017-12-29 12:16:20 +1300, Thomas Munro wrote:
> Here is one way to fix it: track the wait queue size and use that
> number to limit the wakeup loop.  See attached.
>
> That's unbackpatchable though, because it changes the size of struct
> ConditionVariable, potentially breaking extensions compiled against an
> earlier point release.  Maybe this problem won't really cause problems
> in v10 anyway?  It requires a particular interaction pattern that
> barrier.c produces but more typical client code might not: the awoken
> backends keep re-adding themselves because they're waiting for
> everyone (including the waker) to do something, but the waker is stuck
> in that broadcast loop.

Hm, I'm not quite convinced by this approach. Partially because of the
backpatch issue you mention, partially because using the list length as
a limit doesn't seem quite nice.

Given that the proclist_contains() checks in condition_variable.c are
already racy, I think it might be feasible to collect all procnos to
signal while holding the spinlock, and then signal all of them in one
go.

Obviously it'd be nicer to not hold a spinlock while looping, but that
seems like something we can't fix in the back branches. [insert rant
about never using spinlocks unless there's very very clear convicing
reasons].

- Andres


Re: Condition variable live lock

От
Robert Haas
Дата:
On Fri, Dec 29, 2017 at 2:38 PM, Andres Freund <andres@anarazel.de> wrote:
> Hm, I'm not quite convinced by this approach. Partially because of the
> backpatch issue you mention, partially because using the list length as
> a limit doesn't seem quite nice.

Seems OK to me.  Certainly better than your competing proposal.

> Given that the proclist_contains() checks in condition_variable.c are
> already racy, I think it might be feasible to collect all procnos to
> signal while holding the spinlock, and then signal all of them in one
> go.

That doesn't seem very nice at all.  Not only does it violate the
coding rule against looping while holding a spinlock, but it seems
that it would require allocating memory while holding one, which is a
non-starter.

> Obviously it'd be nicer to not hold a spinlock while looping, but that
> seems like something we can't fix in the back branches. [insert rant
> about never using spinlocks unless there's very very clear convicing
> reasons].

I don't think that's a coding rule that I'd be prepared to endorse.
We've routinely used spinlocks for years in cases where the critical
section was very short, just to keep the overhead down.  I think it
works fine in that case, although I admit that I failed to appreciate
how unpleasant the livelock possibilities were in this case.

It's not clear to me that we entirely need a back-patchable fix for
this.  It could be that parallel index scan can have the same issue,
but I'm not aware of any user complaints.  Parallel bitmap heap only
ever waits once so it's probably fine.  If we do need a back-patchable
fix, I suppose slock_t mutex could be replaced by pg_atomic_uint32
state.  I think that would avoid changing the size of the structure on
common platforms, though obscure systems with spinlocks > 4 bytes
might be affected.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: Condition variable live lock

От
Andres Freund
Дата:
On 2018-01-04 12:39:47 -0500, Robert Haas wrote:
> > Given that the proclist_contains() checks in condition_variable.c are
> > already racy, I think it might be feasible to collect all procnos to
> > signal while holding the spinlock, and then signal all of them in one
> > go.
> 
> That doesn't seem very nice at all.  Not only does it violate the
> coding rule against looping while holding a spinlock, but it seems
> that it would require allocating memory while holding one, which is a
> non-starter.

We could just use a sufficiently sized buffer beforehand. There's an
obvious upper boundary, so that shouldn't be a big issue.


> > Obviously it'd be nicer to not hold a spinlock while looping, but that
> > seems like something we can't fix in the back branches. [insert rant
> > about never using spinlocks unless there's very very clear convicing
> > reasons].
> 
> I don't think that's a coding rule that I'd be prepared to endorse.
> We've routinely used spinlocks for years in cases where the critical
> section was very short, just to keep the overhead down.

The problem is that due to the contention handling they really don't
keep the overhead that low unless you're absolutely absolutely
maximizing for low number of cycles and have very little
contention. Which isn't actually common.  I think part of the
conventional wisdom when to use spinlock vs lwlocks went out of the
window once we got better scaling lwlocks.


> It's not clear to me that we entirely need a back-patchable fix for
> this.  It could be that parallel index scan can have the same issue,
> but I'm not aware of any user complaints.

I don't think many users are going to be able to diagnose this one, and
it's probably not easily diagnosable even if they complain about
performance.

Greetings,

Andres Freund


Re: Condition variable live lock

От
Tom Lane
Дата:
Andres Freund <andres@anarazel.de> writes:
> On 2018-01-04 12:39:47 -0500, Robert Haas wrote:
>>> Given that the proclist_contains() checks in condition_variable.c are
>>> already racy, I think it might be feasible to collect all procnos to
>>> signal while holding the spinlock, and then signal all of them in one
>>> go.

>> That doesn't seem very nice at all.  Not only does it violate the
>> coding rule against looping while holding a spinlock, but it seems
>> that it would require allocating memory while holding one, which is a
>> non-starter.

> We could just use a sufficiently sized buffer beforehand. There's an
> obvious upper boundary, so that shouldn't be a big issue.

I share Robert's discomfort with that solution, but it seems to me there
might be a better way.  The attached patch uses our own cvWaitLink as a
sentinel to detect when we've woken everybody who was on the wait list
before we arrived.  That gives exactly the desired semantics, not just an
approximation to them.

Now, the limitation with this is that we can't be waiting for any *other*
condition variable, because then we'd be trashing our state about that
variable.  As coded, we can't be waiting for the target CV either, but
that case could actually be handled if we needed to, as per the comment.
I do not know if this is likely to be a problematic limitation
... discuss.  (The patch does survive check-world, FWIW.)

            regards, tom lane

diff --git a/src/backend/storage/lmgr/condition_variable.c b/src/backend/storage/lmgr/condition_variable.c
index 41378d6..0d08eb5 100644
*** a/src/backend/storage/lmgr/condition_variable.c
--- b/src/backend/storage/lmgr/condition_variable.c
*************** int
*** 214,228 ****
  ConditionVariableBroadcast(ConditionVariable *cv)
  {
      int            nwoken = 0;

      /*
!      * Let's just do this the dumbest way possible.  We could try to dequeue
!      * all the sleepers at once to save spinlock cycles, but it's a bit hard
!      * to get that right in the face of possible sleep cancelations, and we
!      * don't want to loop holding the mutex.
       */
!     while (ConditionVariableSignal(cv))
          ++nwoken;

      return nwoken;
  }
--- 214,273 ----
  ConditionVariableBroadcast(ConditionVariable *cv)
  {
      int            nwoken = 0;
+     int            pgprocno = MyProc->pgprocno;

      /*
!      * In some use-cases, it is common for awakened processes to immediately
!      * re-queue themselves.  If we just naively try to reduce the wakeup list
!      * to empty, we'll get into a potentially-indefinite loop against such a
!      * process.  The semantics we really want are just to be sure that we have
!      * wakened all processes that were in the list at entry.  We can use our
!      * own cvWaitLink as a sentinel to detect when we've finished.
!      *
!      * A seeming flaw in this approach is that someone else might signal the
!      * CV and in doing so remove our sentinel entry.  But that's fine: since
!      * CV waiters are always added and removed in order, that must mean that
!      * every previous waiter has been wakened, so we're done.  We'd get an
!      * extra "set" on our latch, which is slightly inefficient but harmless.
!      *
!      * This coding assumes (and asserts) that we never broadcast on a CV that
!      * we are also waiting for.  If that turns out to be necessary, we could
!      * just remove our entry and insert it at the end, and it still works as a
!      * sentinel.
       */
!
!     /* We should not be attempting a CV wait. */
!     Assert(cv_sleep_target == NULL);
!
!     /* Establish the sentinel entry. */
!     SpinLockAcquire(&cv->mutex);
!     Assert(!proclist_contains(&cv->wakeup, pgprocno, cvWaitLink));
!     proclist_push_tail(&cv->wakeup, pgprocno, cvWaitLink);
!     SpinLockRelease(&cv->mutex);
!
!     for (;;)
!     {
!         bool        aminlist;
!         PGPROC       *proc = NULL;
!
!         /*
!          * Check if the sentinel is still there, and if so, remove the first
!          * queue entry (which might be the sentinel, or not).
!          */
!         SpinLockAcquire(&cv->mutex);
!         aminlist = proclist_contains(&cv->wakeup, pgprocno, cvWaitLink);
!         if (aminlist)
!             proc = proclist_pop_head_node(&cv->wakeup, cvWaitLink);
!         SpinLockRelease(&cv->mutex);
!
!         /* If we or anyone else removed the sentinel, we're done. */
!         if (!aminlist || proc == MyProc)
!             break;
!
!         /* We found someone sleeping, so set their latch to wake them up. */
!         SetLatch(&proc->procLatch);
          ++nwoken;
+     }

      return nwoken;
  }

Re: Condition variable live lock

От
Thomas Munro
Дата:
On Fri, Jan 5, 2018 at 5:27 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> I share Robert's discomfort with that solution, but it seems to me there
> might be a better way.  The attached patch uses our own cvWaitLink as a
> sentinel to detect when we've woken everybody who was on the wait list
> before we arrived.  That gives exactly the desired semantics, not just an
> approximation to them.

Very clever.  It works correctly for my test case.

> Now, the limitation with this is that we can't be waiting for any *other*
> condition variable, because then we'd be trashing our state about that
> variable.  As coded, we can't be waiting for the target CV either, but
> that case could actually be handled if we needed to, as per the comment.
> I do not know if this is likely to be a problematic limitation
> ... discuss.  (The patch does survive check-world, FWIW.)

I think that restriction is probably OK.  Even if you have some kind
of chain of CVs where you get woken up, check your interesting
condition and discover that it's now true so you exit you loop and
immediately want to broadcast a signal to some other CV, you'd simply
have to make sure that you put ConditionVariableCancelSleep() before
ConditionVariableBroadcast():

   ConditionVariablePrepareToSleep(cv1);
   while (condition for which we are waiting is not true)
       ConditionVariableSleep(cv1, wait_event_info);
   ConditionVariableCancelSleep();
   ConditionVariableBroadcast(cv2);

It would only be a problem if you are interested in broadcasting to
cv2 when you've been woken up and the condition is *still not true*,
that is, when you've been spuriously woken.  But why would anyone want
to forward spurious wakeups to another CV?

But if that seems too arbitrary, one way to lift the restriction would
be to teach ConditionVariableBroadcast() to call
ConditionVariableCancelSleep() if cv_sleep_target is non-NULL where
you have the current assertion.  Code that is still waiting for a CV
must be in a loop that will eventually re-add it in
ConditionVariableSleep(), and it won't miss any signals that it can't
afford to miss because the first call to ConditionVariableSleep() will
return immediately so the caller will recheck its condition.

-- 
Thomas Munro
http://www.enterprisedb.com


Re: Condition variable live lock

От
Tom Lane
Дата:
Thomas Munro <thomas.munro@enterprisedb.com> writes:
> On Fri, Jan 5, 2018 at 5:27 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Now, the limitation with this is that we can't be waiting for any *other*
>> condition variable, because then we'd be trashing our state about that
>> variable.  As coded, we can't be waiting for the target CV either, but
>> that case could actually be handled if we needed to, as per the comment.
>> I do not know if this is likely to be a problematic limitation
>> ... discuss.  (The patch does survive check-world, FWIW.)

> ... one way to lift the restriction would
> be to teach ConditionVariableBroadcast() to call
> ConditionVariableCancelSleep() if cv_sleep_target is non-NULL where
> you have the current assertion.  Code that is still waiting for a CV
> must be in a loop that will eventually re-add it in
> ConditionVariableSleep(), and it won't miss any signals that it can't
> afford to miss because the first call to ConditionVariableSleep() will
> return immediately so the caller will recheck its condition.

Oh, of course, very simple.

I thought of another possible issue, though.  In the situation where
someone else has removed our sentinel (presumably, by issuing
ConditionVariableSignal just before we were about to remove the
sentinel), my patch assumes we can just do nothing.  But it seems
like that amounts to losing one signal.  Whoever the someone else
was probably expected to awaken a waiter, and now that won't happen.
Should we rejigger the logic so that it awakens one additional waiter
(if there is one) after detecting that someone else has removed the
sentinel?  Obviously, this trades a risk of loss of wakeup for a risk
of spurious wakeup, but presumably the latter is something we can
cope with.

            regards, tom lane


Re: Condition variable live lock

От
Thomas Munro
Дата:
On Fri, Jan 5, 2018 at 7:10 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> I thought of another possible issue, though.  In the situation where
> someone else has removed our sentinel (presumably, by issuing
> ConditionVariableSignal just before we were about to remove the
> sentinel), my patch assumes we can just do nothing.  But it seems
> like that amounts to losing one signal.  Whoever the someone else
> was probably expected to awaken a waiter, and now that won't happen.

Yeah, that's bad.

> Should we rejigger the logic so that it awakens one additional waiter
> (if there is one) after detecting that someone else has removed the
> sentinel?  Obviously, this trades a risk of loss of wakeup for a risk
> of spurious wakeup, but presumably the latter is something we can
> cope with.

One detail is that the caller of ConditionVariableSignal() got a true
return value when it took out the sentinel (indicating that someone
received the signal), and now when you call ConditionVariableSignal()
because !aminlist there may be no one there.  I'm not sure if that's a
problem.  For comparison, pthread_cond_signal() doesn't tell you if
you actually signalled anyone.  Maybe the only reason we have that
return code is so that ConditionVariableBroadcast() can use it the way
it does in master...

An alternative would be to mark sentinel entries somehow so that
signallers can detect them and signal again, but that's not
backpatchable.

-- 
Thomas Munro
http://www.enterprisedb.com


Re: Condition variable live lock

От
Tom Lane
Дата:
Thomas Munro <thomas.munro@enterprisedb.com> writes:
> On Fri, Jan 5, 2018 at 7:10 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Should we rejigger the logic so that it awakens one additional waiter
>> (if there is one) after detecting that someone else has removed the
>> sentinel?  Obviously, this trades a risk of loss of wakeup for a risk
>> of spurious wakeup, but presumably the latter is something we can
>> cope with.

> One detail is that the caller of ConditionVariableSignal() got a true
> return value when it took out the sentinel (indicating that someone
> received the signal), and now when you call ConditionVariableSignal()
> because !aminlist there may be no one there.  I'm not sure if that's a
> problem.  For comparison, pthread_cond_signal() doesn't tell you if
> you actually signalled anyone.  Maybe the only reason we have that
> return code is so that ConditionVariableBroadcast() can use it the way
> it does in master...

Indeed, it looks like no other caller is paying attention to the result.
We could live with the uncertainty in the back branches, and redefine
ConditionVariableSignal as returning void in master.

            regards, tom lane


Re: Condition variable live lock

От
Thomas Munro
Дата:
On Fri, Jan 5, 2018 at 7:42 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Indeed, it looks like no other caller is paying attention to the result.
> We could live with the uncertainty in the back branches, and redefine
> ConditionVariableSignal as returning void in master.

+1

Could we install the sentinel and pop the first entry at the same
time, so that we're not adding an extra spinlock acquire/release?

-- 
Thomas Munro
http://www.enterprisedb.com


Re: Condition variable live lock

От
Alvaro Herrera
Дата:
Tom Lane wrote:

> Obviously, this trades a risk of loss of wakeup for a risk
> of spurious wakeup, but presumably the latter is something we can
> cope with.

I wonder if it'd be useful to have a test mode for condition variables
that spurious wakups happen randomly, to verify that every user is
prepared to cope correctly.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: Condition variable live lock

От
Tom Lane
Дата:
Thomas Munro <thomas.munro@enterprisedb.com> writes:
> Could we install the sentinel and pop the first entry at the same
> time, so that we're not adding an extra spinlock acquire/release?

Hm, maybe.  Other ideas in that space:

* if queue is empty when we first acquire the spinlock, we don't
have to do anything at all.

* if queue is empty after we pop the first entry, we needn't bother
installing our sentinel, just signal that proc and we're done.

It's a question of how complicated you're willing to make this
logic, and whether you trust that we'll be able to test all the
code paths.

            regards, tom lane


Re: Condition variable live lock

От
Tom Lane
Дата:
I wrote:
> It's a question of how complicated you're willing to make this
> logic, and whether you trust that we'll be able to test all the
> code paths.

Attached is a patch incorporating all the ideas mentioned in this thread,
except that I think in HEAD we should change both ConditionVariableSignal
and ConditionVariableBroadcast to return void rather than a possibly
misleading wakeup count.  This could be back-patched as is, though.

As I feared, the existing regression tests are not really adequate for
this: gcov testing shows that the sentinel-inserting code path is
never entered, meaning ConditionVariableBroadcast never sees more
than one waiter.  What's more, it's now also apparent that no outside
caller of ConditionVariableSignal ever actually awakens anything.
So I think it'd be a good idea to expand the regression tests if we
can do so cheaply.  Anybody have ideas about that?  Perhaps a new
module under src/test/modules would be the best way?  Alternatively,
we could drop some of the optimization ideas.

BTW, at least on gaur, this does nothing for the runtime of the join
test, meaning I'd still like to see some effort put into reducing that.

            regards, tom lane

diff --git a/src/backend/storage/lmgr/condition_variable.c b/src/backend/storage/lmgr/condition_variable.c
index 41378d6..55275cf 100644
*** a/src/backend/storage/lmgr/condition_variable.c
--- b/src/backend/storage/lmgr/condition_variable.c
*************** int
*** 214,228 ****
  ConditionVariableBroadcast(ConditionVariable *cv)
  {
      int            nwoken = 0;

      /*
!      * Let's just do this the dumbest way possible.  We could try to dequeue
!      * all the sleepers at once to save spinlock cycles, but it's a bit hard
!      * to get that right in the face of possible sleep cancelations, and we
!      * don't want to loop holding the mutex.
       */
!     while (ConditionVariableSignal(cv))
          ++nwoken;

      return nwoken;
  }
--- 214,300 ----
  ConditionVariableBroadcast(ConditionVariable *cv)
  {
      int            nwoken = 0;
+     int            pgprocno = MyProc->pgprocno;
+     PGPROC       *proc = NULL;
+     bool        have_sentinel = false;

      /*
!      * In some use-cases, it is common for awakened processes to immediately
!      * re-queue themselves.  If we just naively try to reduce the wakeup list
!      * to empty, we'll get into a potentially-indefinite loop against such a
!      * process.  The semantics we really want are just to be sure that we have
!      * wakened all processes that were in the list at entry.  We can use our
!      * own cvWaitLink as a sentinel to detect when we've finished.
!      *
!      * A seeming flaw in this approach is that someone else might signal the
!      * CV and in doing so remove our sentinel entry.  But that's fine: since
!      * CV waiters are always added and removed in order, that must mean that
!      * every previous waiter has been wakened, so we're done.  We'll get an
!      * extra "set" on our latch from the someone else's signal, which is
!      * slightly inefficient but harmless.
!      *
!      * We can't insert our cvWaitLink as a sentinel if it's already in use in
!      * some other proclist.  While that's not expected to be true for typical
!      * uses of this function, we can deal with it by simply canceling any
!      * prepared CV sleep.  The next call to ConditionVariableSleep will take
!      * care of re-establishing the lost state.
       */
!     ConditionVariableCancelSleep();
!
!     /*
!      * Inspect the state of the queue.  If it's empty, we have nothing to do.
!      * If there's exactly one entry, we need only remove and signal that
!      * entry.  Otherwise, remove the first entry and insert our sentinel.
!      */
!     SpinLockAcquire(&cv->mutex);
!     /* While we're here, let's assert we're not in the list. */
!     Assert(!proclist_contains(&cv->wakeup, pgprocno, cvWaitLink));
!
!     if (!proclist_is_empty(&cv->wakeup))
!     {
!         proc = proclist_pop_head_node(&cv->wakeup, cvWaitLink);
!         if (!proclist_is_empty(&cv->wakeup))
!         {
!             proclist_push_tail(&cv->wakeup, pgprocno, cvWaitLink);
!             have_sentinel = true;
!         }
!     }
!     SpinLockRelease(&cv->mutex);
!
!     /* Awaken first waiter, if there was one. */
!     if (proc != NULL)
!     {
!         SetLatch(&proc->procLatch);
          ++nwoken;
+     }
+
+     while (have_sentinel)
+     {
+         /*
+          * Each time through the loop, remove the first wakeup list entry, and
+          * signal it unless it's our sentinel.  Repeat as long as the sentinel
+          * remains in the list.
+          *
+          * Notice that if someone else removes our sentinel, we will waken one
+          * additional process before exiting.  That's intentional, because if
+          * someone else signals the CV, they may be intending to waken some
+          * third process that added itself to the list after we added the
+          * sentinel.  Better to give a spurious wakeup (which should be
+          * harmless beyond wasting some cycles) than to lose a wakeup.
+          */
+         proc = NULL;
+         SpinLockAcquire(&cv->mutex);
+         if (!proclist_is_empty(&cv->wakeup))
+             proc = proclist_pop_head_node(&cv->wakeup, cvWaitLink);
+         have_sentinel = proclist_contains(&cv->wakeup, pgprocno, cvWaitLink);
+         SpinLockRelease(&cv->mutex);
+
+         if (proc != NULL && proc != MyProc)
+         {
+             SetLatch(&proc->procLatch);
+             ++nwoken;
+         }
+     }

      return nwoken;
  }

Re: Condition variable live lock

От
Tom Lane
Дата:
I wrote:
> As I feared, the existing regression tests are not really adequate for
> this: gcov testing shows that the sentinel-inserting code path is
> never entered, meaning ConditionVariableBroadcast never sees more
> than one waiter.

Hmm ... adding tracing log printouts in BarrierArriveAndWait disproves
that: the regression tests do, repeatedly, call ConditionVariableBroadcast
when there are two waiters to be woken (and you can get it to be more if
you raise the number of parallel workers specified in the PHJ tests).
It just doesn't happen with --enable-coverage.  Heisenberg wins again!

I am not sure why coverage tracking changes the behavior so much, but
it's clear from my results that if you change all the PHJ test cases in
join.sql to use 4 parallel workers, then you get plenty of barrier release
events with 4 or 5 barrier participants --- but running the identical test
under --enable-coverage results in only a very small number of releases
with even as many as 2 participants, let alone more.  Perhaps the PHJ test
cases don't run long enough to let slow-starting workers join in?

Anyway, that may or may not indicate something we should tune at a higher
level, but I'm now satisfied that the patch as presented works and does
get tested by our existing tests.  So barring objection I'll commit that
shortly.

There are some other things I don't like about condition_variable.c:

* I think the Asserts in ConditionVariablePrepareToSleep and
ConditionVariableSleep ought to be replaced by full-fledged test and
elog(ERROR), so that they are enforced even in non-assert builds.
I don't have a lot of confidence that corner cases that could violate
those usage restrictions would get caught during developer testing.
Nor do I see an argument that we can't afford the cycles to check.

* ConditionVariablePrepareToSleep needs to be rearranged so that failure
to create the WaitEventSet doesn't leave us in an invalid state.

* A lot of the comments could be improved, IMHO.

Barring objection I'll go deal with those things, too.

            regards, tom lane


Re: Condition variable live lock

От
Robert Haas
Дата:
On Fri, Jan 5, 2018 at 2:38 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> * I think the Asserts in ConditionVariablePrepareToSleep and
> ConditionVariableSleep ought to be replaced by full-fledged test and
> elog(ERROR), so that they are enforced even in non-assert builds.
> I don't have a lot of confidence that corner cases that could violate
> those usage restrictions would get caught during developer testing.
> Nor do I see an argument that we can't afford the cycles to check.

I think those usage restrictions are pretty basic, and I think this
might be used in some places where performance does matter.  So -1
from me for this change.

> * ConditionVariablePrepareToSleep needs to be rearranged so that failure
> to create the WaitEventSet doesn't leave us in an invalid state.

+1.

> * A lot of the comments could be improved, IMHO.

No opinion without seeing what you propose to change.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: Condition variable live lock

От
Tom Lane
Дата:
Robert Haas <robertmhaas@gmail.com> writes:
> On Fri, Jan 5, 2018 at 2:38 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> * I think the Asserts in ConditionVariablePrepareToSleep and
>> ConditionVariableSleep ought to be replaced by full-fledged test and
>> elog(ERROR), so that they are enforced even in non-assert builds.
>> I don't have a lot of confidence that corner cases that could violate
>> those usage restrictions would get caught during developer testing.
>> Nor do I see an argument that we can't afford the cycles to check.

> I think those usage restrictions are pretty basic, and I think this
> might be used in some places where performance does matter.  So -1
> from me for this change.

Really?  We're about to do a process sleep, and we can't afford a single
test and branch to make sure we're doing it sanely?

>> * A lot of the comments could be improved, IMHO.

> No opinion without seeing what you propose to change.

OK, will put out a proposal.

            regards, tom lane


Re: Condition variable live lock

От
Thomas Munro
Дата:
On Sat, Jan 6, 2018 at 6:33 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> As I feared, the existing regression tests are not really adequate for
> this: gcov testing shows that the sentinel-inserting code path is
> never entered, meaning ConditionVariableBroadcast never sees more
> than one waiter.  What's more, it's now also apparent that no outside
> caller of ConditionVariableSignal ever actually awakens anything.
> So I think it'd be a good idea to expand the regression tests if we
> can do so cheaply.  Anybody have ideas about that?  Perhaps a new
> module under src/test/modules would be the best way?  Alternatively,
> we could drop some of the optimization ideas.

I think I might have a suitable test module already.  I'll tidy it up
and propose it in a few days.

> BTW, at least on gaur, this does nothing for the runtime of the join
> test, meaning I'd still like to see some effort put into reducing that.

Will do.

-- 
Thomas Munro
http://www.enterprisedb.com


Re: Condition variable live lock

От
Tom Lane
Дата:
I wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> No opinion without seeing what you propose to change.

> OK, will put out a proposal.

I began with the intention of making no non-cosmetic changes, but then
I started to wonder why ConditionVariablePrepareToSleep bothers with a
!proclist_contains test, when the calling process surely ought not be
in the list -- or any other list -- since it wasn't prepared to sleep.
And that led me down a different rabbit hole ending in the conclusion
that proclist.h could stand some improvements too.  I do not like the
fact that it's impossible to tell whether a proclist_node is in any
proclist or not.  Initially, a proclist_node contains zeroes which is
a distinguishable state, but proclist_delete_offset resets it to
next = prev = INVALID_PGPROCNO which looks the same as a node that's in a
singleton list.  We should have it reset to the initial state of zeroes
instead, and then we can add assertions to proclist_push_xxx that the
supplied node is not already in a list.  Hence, I propose the first
attached patch which tightens things up in proclist.h and then removes
the !proclist_contains test in ConditionVariablePrepareToSleep; the
assertion in proclist_push_tail supersedes that.

The second attached patch is the cosmetic changes I want to make in
condition_variable.c/.h.

I still think that we ought to change the Asserts on cv_sleep_target in
ConditionVariablePrepareToSleep and ConditionVariableSleep to be full
test-and-elog tests.  Those are checking a global correctness property
("global" meaning "interactions between completely unrelated modules can
break this"), and they'd be extremely cheap compared to the rest of what
those functions are doing, so I think insisting that they be Asserts is
penny wise and pound foolish.  Anybody besides Robert want to vote on
that?

Another loose end that I'm seeing here is that while a process waiting on
a condition variable will respond to a cancel or die interrupt, it will
not notice postmaster death.  This seems unwise to me.  I think we should
adjust the WaitLatch call to include WL_POSTMASTER_DEATH as a wake
condition and just do a summary proc_exit(1) if it sees that.  I'd even
argue that that is a back-patchable bug fix.

            regards, tom lane

diff --git a/src/include/storage/proclist_types.h b/src/include/storage/proclist_types.h
index 237fb76..f4dac10 100644
--- a/src/include/storage/proclist_types.h
+++ b/src/include/storage/proclist_types.h
@@ -16,7 +16,12 @@
 #define PROCLIST_TYPES_H

 /*
- * A node in a list of processes.
+ * A node in a doubly-linked list of processes.  The link fields contain
+ * the 0-based PGPROC indexes of the next and previous process, or
+ * INVALID_PGPROCNO in the next-link of the last node and the prev-link
+ * of the first node.  A node that is currently not in any list
+ * should have next == prev == 0; this is not a possible state for a node
+ * that is in a list, because we disallow circularity.
  */
 typedef struct proclist_node
 {
@@ -25,7 +30,8 @@ typedef struct proclist_node
 } proclist_node;

 /*
- * Head of a doubly-linked list of PGPROCs, identified by pgprocno.
+ * Header of a doubly-linked list of PGPROCs, identified by pgprocno.
+ * An empty list is represented by head == tail == INVALID_PGPROCNO.
  */
 typedef struct proclist_head
 {
@@ -42,4 +48,4 @@ typedef struct proclist_mutable_iter
     int            next;            /* pgprocno of the next PGPROC */
 } proclist_mutable_iter;

-#endif
+#endif                            /* PROCLIST_TYPES_H */
diff --git a/src/include/storage/proclist.h b/src/include/storage/proclist.h
index 4e25b47..59a478e 100644
--- a/src/include/storage/proclist.h
+++ b/src/include/storage/proclist.h
@@ -42,7 +42,7 @@ proclist_is_empty(proclist_head *list)

 /*
  * Get a pointer to a proclist_node inside a given PGPROC, given a procno and
- * an offset.
+ * the proclist_node field's offset within struct PGPROC.
  */
 static inline proclist_node *
 proclist_node_get(int procno, size_t node_offset)
@@ -53,13 +53,15 @@ proclist_node_get(int procno, size_t node_offset)
 }

 /*
- * Insert a node at the beginning of a list.
+ * Insert a process at the beginning of a list.
  */
 static inline void
 proclist_push_head_offset(proclist_head *list, int procno, size_t node_offset)
 {
     proclist_node *node = proclist_node_get(procno, node_offset);

+    Assert(node->next == 0 && node->prev == 0);
+
     if (list->head == INVALID_PGPROCNO)
     {
         Assert(list->tail == INVALID_PGPROCNO);
@@ -79,13 +81,15 @@ proclist_push_head_offset(proclist_head *list, int procno, size_t node_offset)
 }

 /*
- * Insert a node at the end of a list.
+ * Insert a process at the end of a list.
  */
 static inline void
 proclist_push_tail_offset(proclist_head *list, int procno, size_t node_offset)
 {
     proclist_node *node = proclist_node_get(procno, node_offset);

+    Assert(node->next == 0 && node->prev == 0);
+
     if (list->tail == INVALID_PGPROCNO)
     {
         Assert(list->head == INVALID_PGPROCNO);
@@ -105,30 +109,38 @@ proclist_push_tail_offset(proclist_head *list, int procno, size_t node_offset)
 }

 /*
- * Delete a node.  The node must be in the list.
+ * Delete a process from a list --- it must be in the list!
  */
 static inline void
 proclist_delete_offset(proclist_head *list, int procno, size_t node_offset)
 {
     proclist_node *node = proclist_node_get(procno, node_offset);

+    Assert(node->next != 0 || node->prev != 0);
+
     if (node->prev == INVALID_PGPROCNO)
+    {
+        Assert(list->head == procno);
         list->head = node->next;
+    }
     else
         proclist_node_get(node->prev, node_offset)->next = node->next;

     if (node->next == INVALID_PGPROCNO)
+    {
+        Assert(list->tail == procno);
         list->tail = node->prev;
+    }
     else
         proclist_node_get(node->next, node_offset)->prev = node->prev;

-    node->next = node->prev = INVALID_PGPROCNO;
+    node->next = node->prev = 0;
 }

 /*
- * Check if a node is currently in a list.  It must be known that the node is
- * not in any _other_ proclist that uses the same proclist_node, so that the
- * only possibilities are that it is in this list or none.
+ * Check if a process is currently in a list.  It must be known that the
+ * process is not in any _other_ proclist that uses the same proclist_node,
+ * so that the only possibilities are that it is in this list or none.
  */
 static inline bool
 proclist_contains_offset(proclist_head *list, int procno,
@@ -136,27 +148,26 @@ proclist_contains_offset(proclist_head *list, int procno,
 {
     proclist_node *node = proclist_node_get(procno, node_offset);

-    /*
-     * If this is not a member of a proclist, then the next and prev pointers
-     * should be 0. Circular lists are not allowed so this condition is not
-     * confusable with a real pgprocno 0.
-     */
+    /* If it's not in any list, it's definitely not in this one. */
     if (node->prev == 0 && node->next == 0)
         return false;

-    /* If there is a previous node, then this node must be in the list. */
-    if (node->prev != INVALID_PGPROCNO)
-        return true;
-
     /*
-     * There is no previous node, so the only way this node can be in the list
-     * is if it's the head node.
+     * It must, in fact, be in this list.  Ideally, in assert-enabled builds,
+     * we'd verify that.  But since this function is typically used while
+     * holding a spinlock, crawling the whole list is unacceptable.  However,
+     * we can verify matters in O(1) time when the node is a list head or
+     * tail, and that seems worth doing, since in practice that should often
+     * be enough to catch mistakes.
      */
-    return list->head == procno;
+    Assert(node->prev != INVALID_PGPROCNO || list->head == procno);
+    Assert(node->next != INVALID_PGPROCNO || list->tail == procno);
+
+    return true;
 }

 /*
- * Remove and return the first node from a list (there must be one).
+ * Remove and return the first process from a list (there must be one).
  */
 static inline PGPROC *
 proclist_pop_head_node_offset(proclist_head *list, size_t node_offset)
@@ -205,4 +216,4 @@ proclist_pop_head_node_offset(proclist_head *list, size_t node_offset)
              proclist_node_get((iter).cur,                                    \
                                offsetof(PGPROC, link_member))->next)

-#endif
+#endif                            /* PROCLIST_H */
diff --git a/src/backend/storage/lmgr/condition_variable.c b/src/backend/storage/lmgr/condition_variable.c
index 60234db..e3bc034 100644
--- a/src/backend/storage/lmgr/condition_variable.c
+++ b/src/backend/storage/lmgr/condition_variable.c
@@ -86,8 +86,7 @@ ConditionVariablePrepareToSleep(ConditionVariable *cv)

     /* Add myself to the wait queue. */
     SpinLockAcquire(&cv->mutex);
-    if (!proclist_contains(&cv->wakeup, pgprocno, cvWaitLink))
-        proclist_push_tail(&cv->wakeup, pgprocno, cvWaitLink);
+    proclist_push_tail(&cv->wakeup, pgprocno, cvWaitLink);
     SpinLockRelease(&cv->mutex);
 }

diff --git a/src/backend/storage/lmgr/condition_variable.c b/src/backend/storage/lmgr/condition_variable.c
index e3bc034..0b9d676 100644
--- a/src/backend/storage/lmgr/condition_variable.c
+++ b/src/backend/storage/lmgr/condition_variable.c
@@ -43,11 +43,17 @@ ConditionVariableInit(ConditionVariable *cv)
 }

 /*
- * Prepare to wait on a given condition variable.  This can optionally be
- * called before entering a test/sleep loop.  Alternatively, the call to
- * ConditionVariablePrepareToSleep can be omitted.  The only advantage of
- * calling ConditionVariablePrepareToSleep is that it avoids an initial
- * double-test of the user's predicate in the case that we need to wait.
+ * Prepare to wait on a given condition variable.
+ *
+ * This can optionally be called before entering a test/sleep loop.
+ * Doing so is more efficient if we'll need to sleep at least once.
+ * However, if the first test of the exit condition is likely to succeed,
+ * it's more efficient to omit the ConditionVariablePrepareToSleep call.
+ * See comments in ConditionVariableSleep for more detail.
+ *
+ * Only one condition variable can be used at a time, ie,
+ * ConditionVariableCancelSleep must be called before any attempt is made
+ * to sleep on a different condition variable.
  */
 void
 ConditionVariablePrepareToSleep(ConditionVariable *cv)
@@ -79,8 +85,8 @@ ConditionVariablePrepareToSleep(ConditionVariable *cv)
     cv_sleep_target = cv;

     /*
-     * Reset my latch before adding myself to the queue and before entering
-     * the caller's predicate loop.
+     * Reset my latch before adding myself to the queue, to ensure that we
+     * don't miss a wakeup that occurs immediately.
      */
     ResetLatch(MyLatch);

@@ -90,20 +96,21 @@ ConditionVariablePrepareToSleep(ConditionVariable *cv)
     SpinLockRelease(&cv->mutex);
 }

-/*--------------------------------------------------------------------------
- * Wait for the given condition variable to be signaled.  This should be
- * called in a predicate loop that tests for a specific exit condition and
- * otherwise sleeps, like so:
+/*
+ * Wait for the given condition variable to be signaled.
  *
- *     ConditionVariablePrepareToSleep(cv); [optional]
+ * This should be called in a predicate loop that tests for a specific exit
+ * condition and otherwise sleeps, like so:
+ *
+ *     ConditionVariablePrepareToSleep(cv);  // optional
  *     while (condition for which we are waiting is not true)
  *         ConditionVariableSleep(cv, wait_event_info);
  *     ConditionVariableCancelSleep();
  *
- * Supply a value from one of the WaitEventXXX enums defined in pgstat.h to
- * control the contents of pg_stat_activity's wait_event_type and wait_event
- * columns while waiting.
- *-------------------------------------------------------------------------*/
+ * wait_event_info should be a value from one of the WaitEventXXX enums
+ * defined in pgstat.h.  This controls the contents of pg_stat_activity's
+ * wait_event_type and wait_event columns while waiting.
+ */
 void
 ConditionVariableSleep(ConditionVariable *cv, uint32 wait_event_info)
 {
@@ -113,13 +120,14 @@ ConditionVariableSleep(ConditionVariable *cv, uint32 wait_event_info)
     /*
      * If the caller didn't prepare to sleep explicitly, then do so now and
      * return immediately.  The caller's predicate loop should immediately
-     * call again if its exit condition is not yet met.  This initial spurious
-     * return can be avoided by calling ConditionVariablePrepareToSleep(cv)
+     * call again if its exit condition is not yet met.  This will result in
+     * the exit condition being tested twice before we first sleep.  The extra
+     * test can be prevented by calling ConditionVariablePrepareToSleep(cv)
      * first.  Whether it's worth doing that depends on whether you expect the
-     * condition to be met initially, in which case skipping the prepare
-     * allows you to skip manipulation of the wait list, or not met initially,
-     * in which case preparing first allows you to skip a spurious test of the
-     * caller's exit condition.
+     * condition to be met initially, in which case skipping the prepare is
+     * recommended because it avoids manipulations of the wait list, or not
+     * met initially, in which case preparing first is better because it
+     * avoids one extra test of the exit condition.
      */
     if (cv_sleep_target == NULL)
     {
@@ -130,7 +138,7 @@ ConditionVariableSleep(ConditionVariable *cv, uint32 wait_event_info)
     /* Any earlier condition variable sleep must have been canceled. */
     Assert(cv_sleep_target == cv);

-    while (!done)
+    do
     {
         CHECK_FOR_INTERRUPTS();

@@ -140,18 +148,23 @@ ConditionVariableSleep(ConditionVariable *cv, uint32 wait_event_info)
          */
         WaitEventSetWait(cv_wait_event_set, -1, &event, 1, wait_event_info);

-        /* Reset latch before testing whether we can return. */
+        /* Reset latch before examining the state of the wait list. */
         ResetLatch(MyLatch);

         /*
          * If this process has been taken out of the wait list, then we know
-         * that is has been signaled by ConditionVariableSignal.  We put it
-         * back into the wait list, so we don't miss any further signals while
-         * the caller's loop checks its condition.  If it hasn't been taken
-         * out of the wait list, then the latch must have been set by
-         * something other than ConditionVariableSignal; though we don't
-         * guarantee not to return spuriously, we'll avoid these obvious
-         * cases.
+         * that it has been signaled by ConditionVariableSignal (or
+         * ConditionVariableBroadcast), so we should return to the caller. But
+         * that doesn't guarantee that the exit condition is met, only that we
+         * ought to check it.  So we must put the process back into the wait
+         * list, to ensure we don't miss any additional wakeup occurring while
+         * the caller checks its exit condition.  We can take ourselves out of
+         * the wait list only when the caller calls
+         * ConditionVariableCancelSleep.
+         *
+         * If we're still in the wait list, then the latch must have been set
+         * by something other than ConditionVariableSignal; though we don't
+         * guarantee not to return spuriously, we'll avoid this obvious case.
          */
         SpinLockAcquire(&cv->mutex);
         if (!proclist_contains(&cv->wakeup, MyProc->pgprocno, cvWaitLink))
@@ -160,13 +173,17 @@ ConditionVariableSleep(ConditionVariable *cv, uint32 wait_event_info)
             proclist_push_tail(&cv->wakeup, MyProc->pgprocno, cvWaitLink);
         }
         SpinLockRelease(&cv->mutex);
-    }
+    } while (!done);
 }

 /*
- * Cancel any pending sleep operation.  We just need to remove ourselves
- * from the wait queue of any condition variable for which we have previously
- * prepared a sleep.
+ * Cancel any pending sleep operation.
+ *
+ * We just need to remove ourselves from the wait queue of any condition
+ * variable for which we have previously prepared a sleep.
+ *
+ * Do nothing if nothing is pending; this allows this function to be called
+ * during transaction abort to clean up any unfinished CV sleep.
  */
 void
 ConditionVariableCancelSleep(void)
diff --git a/src/include/storage/condition_variable.h b/src/include/storage/condition_variable.h
index c7afbbc..7dac477 100644
--- a/src/include/storage/condition_variable.h
+++ b/src/include/storage/condition_variable.h
@@ -27,30 +27,33 @@

 typedef struct
 {
-    slock_t        mutex;
-    proclist_head wakeup;
+    slock_t        mutex;            /* spinlock protecting the wakeup list */
+    proclist_head wakeup;        /* list of wake-able processes */
 } ConditionVariable;

 /* Initialize a condition variable. */
-extern void ConditionVariableInit(ConditionVariable *);
+extern void ConditionVariableInit(ConditionVariable *cv);

 /*
  * To sleep on a condition variable, a process should use a loop which first
  * checks the condition, exiting the loop if it is met, and then calls
  * ConditionVariableSleep.  Spurious wakeups are possible, but should be
- * infrequent.  After exiting the loop, ConditionVariableCancelSleep should
+ * infrequent.  After exiting the loop, ConditionVariableCancelSleep must
  * be called to ensure that the process is no longer in the wait list for
- * the condition variable.
+ * the condition variable.  Only one condition variable can be used at a
+ * time, ie, ConditionVariableCancelSleep must be called before any attempt
+ * is made to sleep on a different condition variable.
  */
-extern void ConditionVariableSleep(ConditionVariable *, uint32 wait_event_info);
+extern void ConditionVariableSleep(ConditionVariable *cv, uint32 wait_event_info);
 extern void ConditionVariableCancelSleep(void);

 /*
- * The use of this function is optional and not necessary for correctness;
- * for efficiency, it should be called prior entering the loop described above
- * if it is thought that the condition is unlikely to hold immediately.
+ * Optionally, ConditionVariablePrepareToSleep can be called before entering
+ * the test-and-sleep loop described above.  Doing so is more efficient if
+ * at least one sleep is needed, whereas not doing so is more efficient when
+ * no sleep is needed because the test condition is true the first time.
  */
-extern void ConditionVariablePrepareToSleep(ConditionVariable *);
+extern void ConditionVariablePrepareToSleep(ConditionVariable *cv);

 /* Wake up a single waiter (via signal) or all waiters (via broadcast). */
 extern void ConditionVariableSignal(ConditionVariable *cv);

Re: Condition variable live lock

От
Tom Lane
Дата:
I wrote:
> I still think that we ought to change the Asserts on cv_sleep_target in
> ConditionVariablePrepareToSleep and ConditionVariableSleep to be full
> test-and-elog tests.  Those are checking a global correctness property
> ("global" meaning "interactions between completely unrelated modules can
> break this"), and they'd be extremely cheap compared to the rest of what
> those functions are doing, so I think insisting that they be Asserts is
> penny wise and pound foolish.

Actually ... perhaps a better design would be to have
ConditionVariable[PrepareTo]Sleep auto-cancel any prepared sleep for
a different condition variable, analogously to what we just did in
ConditionVariableBroadcast, on the same theory that whenever control
returns to the other CV wait loop it can re-establish the relevant
state easily enough.  I have to think that if the use of CVs grows
much, the existing restriction is going to become untenable anyway,
so why not just get rid of it?

            regards, tom lane


Re: Condition variable live lock

От
Tom Lane
Дата:
I wrote:
> Actually ... perhaps a better design would be to have
> ConditionVariable[PrepareTo]Sleep auto-cancel any prepared sleep for
> a different condition variable, analogously to what we just did in
> ConditionVariableBroadcast, on the same theory that whenever control
> returns to the other CV wait loop it can re-establish the relevant
> state easily enough.  I have to think that if the use of CVs grows
> much, the existing restriction is going to become untenable anyway,
> so why not just get rid of it?

Concretely, as per attached.

            regards, tom lane

diff --git a/src/backend/storage/lmgr/condition_variable.c b/src/backend/storage/lmgr/condition_variable.c
index 0b9d676..b01864c 100644
--- a/src/backend/storage/lmgr/condition_variable.c
+++ b/src/backend/storage/lmgr/condition_variable.c
@@ -50,10 +50,6 @@ ConditionVariableInit(ConditionVariable *cv)
  * However, if the first test of the exit condition is likely to succeed,
  * it's more efficient to omit the ConditionVariablePrepareToSleep call.
  * See comments in ConditionVariableSleep for more detail.
- *
- * Only one condition variable can be used at a time, ie,
- * ConditionVariableCancelSleep must be called before any attempt is made
- * to sleep on a different condition variable.
  */
 void
 ConditionVariablePrepareToSleep(ConditionVariable *cv)
@@ -76,10 +72,14 @@ ConditionVariablePrepareToSleep(ConditionVariable *cv)
     }

     /*
-     * It's not legal to prepare a sleep until the previous sleep has been
-     * completed or canceled.
+     * If some other sleep is already prepared, cancel it; this is necessary
+     * because we have just one static variable tracking the prepared sleep.
+     * It's okay to do this because whenever control does return to the other
+     * test-and-sleep loop, its ConditionVariableSleep call will just
+     * re-establish that sleep as the prepared one.
      */
-    Assert(cv_sleep_target == NULL);
+    if (cv_sleep_target != NULL)
+        ConditionVariableCancelSleep();

     /* Record the condition variable on which we will sleep. */
     cv_sleep_target = cv;
@@ -128,16 +128,16 @@ ConditionVariableSleep(ConditionVariable *cv, uint32 wait_event_info)
      * recommended because it avoids manipulations of the wait list, or not
      * met initially, in which case preparing first is better because it
      * avoids one extra test of the exit condition.
+     *
+     * If we are currently prepared to sleep on some other CV, we just cancel
+     * that and prepare this one; see ConditionVariablePrepareToSleep.
      */
-    if (cv_sleep_target == NULL)
+    if (cv_sleep_target != cv)
     {
         ConditionVariablePrepareToSleep(cv);
         return;
     }

-    /* Any earlier condition variable sleep must have been canceled. */
-    Assert(cv_sleep_target == cv);
-
     do
     {
         CHECK_FOR_INTERRUPTS();
diff --git a/src/include/storage/condition_variable.h b/src/include/storage/condition_variable.h
index 7dac477..32e645c 100644
--- a/src/include/storage/condition_variable.h
+++ b/src/include/storage/condition_variable.h
@@ -40,9 +40,7 @@ extern void ConditionVariableInit(ConditionVariable *cv);
  * ConditionVariableSleep.  Spurious wakeups are possible, but should be
  * infrequent.  After exiting the loop, ConditionVariableCancelSleep must
  * be called to ensure that the process is no longer in the wait list for
- * the condition variable.  Only one condition variable can be used at a
- * time, ie, ConditionVariableCancelSleep must be called before any attempt
- * is made to sleep on a different condition variable.
+ * the condition variable.
  */
 extern void ConditionVariableSleep(ConditionVariable *cv, uint32 wait_event_info);
 extern void ConditionVariableCancelSleep(void);

Re: Condition variable live lock

От
Thomas Munro
Дата:
On Mon, Jan 8, 2018 at 12:38 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> I wrote:
>> Actually ... perhaps a better design would be to have
>> ConditionVariable[PrepareTo]Sleep auto-cancel any prepared sleep for
>> a different condition variable, analogously to what we just did in
>> ConditionVariableBroadcast, on the same theory that whenever control
>> returns to the other CV wait loop it can re-establish the relevant
>> state easily enough.  I have to think that if the use of CVs grows
>> much, the existing restriction is going to become untenable anyway,
>> so why not just get rid of it?
>
> Concretely, as per attached.

+1 for the idea.  Haven't looked at the code yet but I'll review this
and the proclist patch shortly.

-- 
Thomas Munro
http://www.enterprisedb.com


Re: Condition variable live lock

От
Tom Lane
Дата:
Thomas Munro <thomas.munro@enterprisedb.com> writes:
> On Mon, Jan 8, 2018 at 12:38 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Concretely, as per attached.

> +1 for the idea.  Haven't looked at the code yet but I'll review this
> and the proclist patch shortly.

Thanks.  BTW, I realized that there is a second (and perhaps more
important) reason why we can only prepare one CV sleep at a time:
we only have one cvWaitLink in our PGPROC.  So I'm now inclined
to word the revised comment in ConditionVariablePrepareToSleep as

    /*
     * If some other sleep is already prepared, cancel it; this is necessary
     * because we have just one static variable tracking the prepared sleep,
     * and also only one cvWaitLink in our PGPROC.  It's okay to do this
     * because whenever control does return to the other test-and-sleep loop,
     * its ConditionVariableSleep call will just re-establish that sleep as
     * the prepared one.
     */

            regards, tom lane


Re: Condition variable live lock

От
Thomas Munro
Дата:
On Sun, Jan 7, 2018 at 10:00 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> I began with the intention of making no non-cosmetic changes, but then
> I started to wonder why ConditionVariablePrepareToSleep bothers with a
> !proclist_contains test, when the calling process surely ought not be
> in the list -- or any other list -- since it wasn't prepared to sleep.
> And that led me down a different rabbit hole ending in the conclusion
> that proclist.h could stand some improvements too.  I do not like the
> fact that it's impossible to tell whether a proclist_node is in any
> proclist or not.  Initially, a proclist_node contains zeroes which is
> a distinguishable state, but proclist_delete_offset resets it to
> next = prev = INVALID_PGPROCNO which looks the same as a node that's in a
> singleton list.  We should have it reset to the initial state of zeroes
> instead, and then we can add assertions to proclist_push_xxx that the
> supplied node is not already in a list.  Hence, I propose the first
> attached patch which tightens things up in proclist.h and then removes
> the !proclist_contains test in ConditionVariablePrepareToSleep; the
> assertion in proclist_push_tail supersedes that.

+1

> The second attached patch is the cosmetic changes I want to make in
> condition_variable.c/.h.

+1

> I still think that we ought to change the Asserts on cv_sleep_target in
> ConditionVariablePrepareToSleep and ConditionVariableSleep to be full
> test-and-elog tests.  Those are checking a global correctness property
> ("global" meaning "interactions between completely unrelated modules can
> break this"), and they'd be extremely cheap compared to the rest of what
> those functions are doing, so I think insisting that they be Asserts is
> penny wise and pound foolish.  Anybody besides Robert want to vote on
> that?

I liked your follow-up idea better (see below).

> Another loose end that I'm seeing here is that while a process waiting on
> a condition variable will respond to a cancel or die interrupt, it will
> not notice postmaster death.  This seems unwise to me.  I think we should
> adjust the WaitLatch call to include WL_POSTMASTER_DEATH as a wake
> condition and just do a summary proc_exit(1) if it sees that.  I'd even
> argue that that is a back-patchable bug fix.

Yeah.  As far as I know so far, every place where we wait on a
WaitEventSet falls into one of 4 categories when it comes to
postmaster death:

1.  We proc_exit(1) if WL_POSTMASTER_DEATH is returned.
2.  We ereport(FATAL) if WL_POSTMASTER_DEATH is returned.
3.  We asked for WL_POSTMASTER_DEATH in the WaitEventSet, but we don't
actually bother checking for it in the returned value.  Instead we
call PostmasterIsAlive() every time through the loop (pgarch.c,
syncrep.c, walsender.c and walreceiver.c).
4.  We didn't ask for WL_POSTMASTER_DEATH.

Do I have that right?  I guess category 3 is suboptimal especially on
some clunky but loveable kernels[1] and all cases of 4 including this
one are probably bugs.  That makes me wonder why we don't change the
WaitEventSet API so that it calls proc_exit(1) for you by default if
you didn't ask to receive WL_POSTMASTER_DEATH explicitly, to handle
category 1 for free.  If you asked for WL_POSTMASTER_DEATH explicitly
then we could return it, to support category 2 callers that want to do
something different.  Just a thought.

Another possibility for this particular case would be that the client
of ConditionVariable would like to be able to chose how to handle
postmaster death, and in turn the client of Barrier (a client) might
like to be able to choose too.  But in every case I'm aware of today
proc_exit(1) is the right thing to do, so teaching
ConditionVariableSleep() to do that seems OK to me.

On Sun, Jan 7, 2018 at 10:43 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Actually ... perhaps a better design would be to have
> ConditionVariable[PrepareTo]Sleep auto-cancel any prepared sleep for
> a different condition variable, analogously to what we just did in
> ConditionVariableBroadcast, on the same theory that whenever control
> returns to the other CV wait loop it can re-establish the relevant
> state easily enough.  I have to think that if the use of CVs grows
> much, the existing restriction is going to become untenable anyway,
> so why not just get rid of it?

+1

It's a more robust API this way.

On Mon, Jan 8, 2018 at 3:34 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Thanks.  BTW, I realized that there is a second (and perhaps more
> important) reason why we can only prepare one CV sleep at a time:
> we only have one cvWaitLink in our PGPROC.  So I'm now inclined
> to word the revised comment in ConditionVariablePrepareToSleep as
>
>     /*
>      * If some other sleep is already prepared, cancel it; this is necessary
>      * because we have just one static variable tracking the prepared sleep,
>      * and also only one cvWaitLink in our PGPROC.  It's okay to do this
>      * because whenever control does return to the other test-and-sleep loop,
>      * its ConditionVariableSleep call will just re-establish that sleep as
>      * the prepared one.
>      */

+1

[1]
https://www.postgresql.org/message-id/flat/CAEepm%3D0qQ6DO-u%3D25ny5EJAUbWeHbAQgJj1UJFAL1NWJNxC%2Bgg%40mail.gmail.com

-- 
Thomas Munro
http://www.enterprisedb.com


Re: Condition variable live lock

От
Thomas Munro
Дата:
On Mon, Jan 8, 2018 at 5:25 PM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:
> On Sun, Jan 7, 2018 at 10:43 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Actually ... perhaps a better design would be to have
>> ConditionVariable[PrepareTo]Sleep auto-cancel any prepared sleep for
>> a different condition variable, analogously to what we just did in
>> ConditionVariableBroadcast, on the same theory that whenever control
>> returns to the other CV wait loop it can re-establish the relevant
>> state easily enough.  I have to think that if the use of CVs grows
>> much, the existing restriction is going to become untenable anyway,
>> so why not just get rid of it?
>
> +1
>
> It's a more robust API this way.

One very small thing after another look:

-       Assert(cv_sleep_target == NULL);
+       if (cv_sleep_target != NULL)
+               ConditionVariableCancelSleep();

The test for cv_sleep_target != NULL is redundant since
ConditionVariableCancelSleep() would return early.
ConditionVariableBroadcast() doesn't do that.

-- 
Thomas Munro
http://www.enterprisedb.com


Re: Condition variable live lock

От
Tom Lane
Дата:
Thomas Munro <thomas.munro@enterprisedb.com> writes:
> One very small thing after another look:

> -       Assert(cv_sleep_target == NULL);
> +       if (cv_sleep_target != NULL)
> +               ConditionVariableCancelSleep();

> The test for cv_sleep_target != NULL is redundant since
> ConditionVariableCancelSleep() would return early.

True.  I did that because Robert was already objecting to the cost
of an added test-and-branch here, so I figured he'd be really
unhappy with the cost of a function call plus test-and-branch ;-)

> ConditionVariableBroadcast() doesn't do that.

Yup.  I considered removing the discrepancy by adding a similar
if-guard in ConditionVariableBroadcast().  The internal test in
ConditionVariableCancelSleep would then be only for the benefit
of outside callers such as AbortTransaction, but that seems fine
and per its documentation.

Or we could remove those if's and save a few bytes at the
cost of some cycles.  I don't care much.

            regards, tom lane


Re: Condition variable live lock

От
Robert Haas
Дата:
On Sun, Jan 7, 2018 at 6:38 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Actually ... perhaps a better design would be to have
>> ConditionVariable[PrepareTo]Sleep auto-cancel any prepared sleep for
>> a different condition variable, analogously to what we just did in
>> ConditionVariableBroadcast, on the same theory that whenever control
>> returns to the other CV wait loop it can re-establish the relevant
>> state easily enough.  I have to think that if the use of CVs grows
>> much, the existing restriction is going to become untenable anyway,
>> so why not just get rid of it?
>
> Concretely, as per attached.

I guess my aversion to converting the existing Assert-test into an
elog test was really a concern that we'd be countenancing the use of
CVs in any coding pattern more complicated than a very simple
test-and-wait loop.  Suppose someone were to propose adding runtime
checks that when we release a spinlock, it is held by the process that
tried to release it.  Someone might reply that such a check ought to
be unnecessary because the code protected by a spinlock ought to be a
short, straight-line critical section and therefore we should be able
to spot any such coding error by inspection.

And I wonder if the same isn't true here.  Is it really sensible,
within a CV-wait loop, to call some other function that contains its
own CV-wait loop?  Is that really a use case we want to support?  If
you do have such a thing, with the present coding, you don't *need* an
Assert() at runtime; you just need to run the code with assertions
enabled AT LEAST ONCE.  If the code flow is so complex that it doesn't
reliably fail an assertion in test environments, maybe it's just too
complex.  Perhaps our answer to that, or so my thinking went, ought to
be "don't write code like that in the first place".

Now, that may be myopic on my part.  If we want to support complex
control flows involving CVs, the approach here has a lot to recommend
it.  Instead of making it the caller's problem to work it out, we do
it automatically.  There is some loss of efficiency, perhaps, since
when control returns to the outer CV-wait loop it will have to recheck
the condition twice before potentially waiting, but maybe that doesn't
matter.  It's often the case that mechanisms like this end up getting
used profitably in a lot of places not imagined by their original
creator, and that might be the case here.

I think an extremely *likely* programming error when programming with
CVs is to have a code path where ConditionVariableCancelSleep() does
not get called.  The proposed change could make such mistakes much
less noticeable.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: Condition variable live lock

От
Tom Lane
Дата:
Robert Haas <robertmhaas@gmail.com> writes:
> I guess my aversion to converting the existing Assert-test into an
> elog test was really a concern that we'd be countenancing the use of
> CVs in any coding pattern more complicated than a very simple
> test-and-wait loop.  Suppose someone were to propose adding runtime
> checks that when we release a spinlock, it is held by the process that
> tried to release it.  Someone might reply that such a check ought to
> be unnecessary because the code protected by a spinlock ought to be a
> short, straight-line critical section and therefore we should be able
> to spot any such coding error by inspection.

Right, *because* we have the rule that spinlock-protected code must be
short straight-line segments.  There is certainly no such rule for
LWLocks, and I'm not sure why you think that CVs have more restricted
use-cases than LWLocks.

There is a potential issue, if you call random code inside the wait loop,
that that code might reset the process latch and thereby lose a signal for
the CV.  As far as I can see at the moment, that'd only be a hazard for
code executed before the loop's ConditionVariableSleep, not after, so even
that can be avoided if you code properly.  In any case, the proposed
generalization of CVs poses no such hazard: yes, we might reset the latch,
but if so the outer CV loop will see that it's lost its prepared sleep
state and will be forced to recheck its exit condition.

> It's often the case that mechanisms like this end up getting
> used profitably in a lot of places not imagined by their original
> creator, and that might be the case here.

Yeah, that's exactly my expectation.  If I thought we were only going to
have five uses of condition variables forevermore, I wouldn't be putting
much time into them.

> I think an extremely *likely* programming error when programming with
> CVs is to have a code path where ConditionVariableCancelSleep() does
> not get called.  The proposed change could make such mistakes much
> less noticeable.

Actually, the proposed change would turn it into barely an error at all.
The only advantage of cancelling a sleep immediately is that you avoid
possibly getting a no-longer-useful latch event later.

            regards, tom lane


Re: Condition variable live lock

От
Tom Lane
Дата:
Thomas Munro <thomas.munro@enterprisedb.com> writes:
> On Sun, Jan 7, 2018 at 10:00 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> I began with the intention of making no non-cosmetic changes, but then
>> I started to wonder why ConditionVariablePrepareToSleep bothers with a
>> !proclist_contains test, when the calling process surely ought not be
>> in the list -- or any other list -- since it wasn't prepared to sleep.
>> And that led me down a different rabbit hole ending in the conclusion
>> that proclist.h could stand some improvements too.

> +1

>> The second attached patch is the cosmetic changes I want to make in
>> condition_variable.c/.h.

> +1

I pushed these, with an addition to ConditionVariablePrepareToSleep's
comment emphasizing that you're supposed to call it before the loop test;
this because some desultory digging found that one of the existing callers
is violating that rule.  replorigin_drop() in replication/logical/origin.c
has

                cv = &state->origin_cv;

                LWLockRelease(ReplicationOriginLock);
                ConditionVariablePrepareToSleep(cv);
                ConditionVariableSleep(cv, WAIT_EVENT_REPLICATION_ORIGIN_DROP);
                ConditionVariableCancelSleep();
                goto restart;

and unless I'm much mistaken, that is flat broken.  Somebody could
change the ReplicationState's acquired_by before we reach
ConditionVariablePrepareToSleep, in which case we won't get any signal
for that and will just sleep here indefinitely.

It would still be broken if we removed the PrepareToSleep call, but
at least it'd be a busy-wait not a hang.

Not sure about a convenient fix.  One idea is to move the
PrepareToSleep call inside the hold on ReplicationOriginLock, which
would fix things only if the various signallers of origin_cv did the
signalling while holding that lock, which they do not.  It's not clear
to me whether making them do so inside that lock would be problematic
for performance.

We can't just move the prepare/cancel sleep calls to around this whole
loop, because we do not know outside the loop which CV is to be waited
on.  So another idea is to get rid of that and have just one CV for all
ReplicationStates.

Or (and I'm sure Robert sees this coming), if we applied my proposed
patch to let ConditionVariableSleep auto-switch to a different CV,
then we could fix this code by simply removing the PrepareToSleep and
moving the ConditionVariableCancelSleep call to below the loop.
This would work even in the perhaps-unlikely case where the slot as
identified by roident changed positions, though you'd get an additional
trip through the loop (a/k/a test of the exit condition) if that
happened.

>> Another loose end that I'm seeing here is that while a process waiting on
>> a condition variable will respond to a cancel or die interrupt, it will
>> not notice postmaster death.

> Yeah.

I'll respond to that separately to keep it from getting confused with
this replorigin_drop() bug.

            regards, tom lane


Re: Condition variable live lock

От
Tom Lane
Дата:
Thomas Munro <thomas.munro@enterprisedb.com> writes:
> On Sun, Jan 7, 2018 at 10:00 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Another loose end that I'm seeing here is that while a process waiting on
>> a condition variable will respond to a cancel or die interrupt, it will
>> not notice postmaster death.  This seems unwise to me.  I think we should
>> adjust the WaitLatch call to include WL_POSTMASTER_DEATH as a wake
>> condition and just do a summary proc_exit(1) if it sees that.  I'd even
>> argue that that is a back-patchable bug fix.

> Yeah.  As far as I know so far, every place where we wait on a
> WaitEventSet falls into one of 4 categories when it comes to
> postmaster death:

> 1.  We proc_exit(1) if WL_POSTMASTER_DEATH is returned.
> 2.  We ereport(FATAL) if WL_POSTMASTER_DEATH is returned.
> 3.  We asked for WL_POSTMASTER_DEATH in the WaitEventSet, but we don't
> actually bother checking for it in the returned value.  Instead we
> call PostmasterIsAlive() every time through the loop (pgarch.c,
> syncrep.c, walsender.c and walreceiver.c).
> 4.  We didn't ask for WL_POSTMASTER_DEATH.

> Do I have that right?  I guess category 3 is suboptimal especially on
> some clunky but loveable kernels[1] and all cases of 4 including this
> one are probably bugs.  That makes me wonder why we don't change the
> WaitEventSet API so that it calls proc_exit(1) for you by default if
> you didn't ask to receive WL_POSTMASTER_DEATH explicitly, to handle
> category 1 for free.  If you asked for WL_POSTMASTER_DEATH explicitly
> then we could return it, to support category 2 callers that want to do
> something different.  Just a thought.

Yeah, that's worth thinking about, because I'm pretty sure that we have
had this type of bug before.  We have to be a bit careful because
Andres is thinking of using the WaitEventState support in the postmaster
loop, where it definitely shouldn't check WL_POSTMASTER_DEATH, and it
is probably also possible to reach some of those calls in standalone
backends, which shouldn't either.  So I'm imagining:

1. The WaitEventSet code always includes WL_POSTMASTER_DEATH checking
if IsUnderPostmaster --- and conversely, if !IsUnderPostmaster, it should
ignore any caller request for that.

2. If caller specifies WL_POSTMASTER_DEATH then we'll return that
flag bit, otherwise just exit(1) --- or should the default be
ereport(FATAL)?

3. Remove explicit specification of WL_POSTMASTER_DEATH anywhere that
the default handling is OK, which is probably almost everywhere.
Get rid of those now-redundant PostmasterIsAlive checks, too.

I'm not real sure BTW why we have some callers that ereport and some
that just exit(1).  Seems like it would be better to be consistent,
though I'm not entirely sure which behavior to standardize on.

(Of course, this would only be an appropriate thing to do in HEAD.
In v10 I think we should do the narrow fix I suggested up top.)

            regards, tom lane


Re: Condition variable live lock

От
Robert Haas
Дата:
On Mon, Jan 8, 2018 at 7:02 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> I'm not real sure BTW why we have some callers that ereport and some
> that just exit(1).  Seems like it would be better to be consistent,
> though I'm not entirely sure which behavior to standardize on.

I think at one point we had an idea that regular backends would FATAL
if the postmaster fell over and other processes (e.g. checkpointer,
bgwriter) would exit silently.  Whether that was the right idea, and
whether it's still is/was ever what the code did, I'm not sure.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company