Re: parallel mode and parallel contexts

Поиск
Список
Период
Сортировка
От Noah Misch
Тема Re: parallel mode and parallel contexts
Дата
Msg-id 20150502163522.GA3461810@tornado.leadboat.com
обсуждение исходный текст
Ответ на Re: parallel mode and parallel contexts  (Robert Haas <robertmhaas@gmail.com>)
Ответы Re: parallel mode and parallel contexts  (Robert Haas <robertmhaas@gmail.com>)
Список pgsql-hackers
On Tue, Apr 21, 2015 at 02:08:00PM -0400, Robert Haas wrote:
> On Tue, Mar 24, 2015 at 3:26 PM, Andres Freund <andres@2ndquadrant.com> wrote:

[proposal: teach deadlock detector about parallel master waiting on workers]

> > deadlock.c is far from simple, and at least I don't find the control
> > flow to be particularly clear. So it's not easy.  It'd be advantageous
> > to tackle things at that level because it'd avoid the need to acquire
> > locks on some lock's waitqueue when blocking; we're going to do that a
> > lot.
> >
> > But It seems to me that it should be possible to suceed: In
> > FindLockCycleRecurse(), in the case that we're not waiting for an actual
> > lock (checkProc->links.next == NULL) we can add a case that considers
> > the 'blocking parallelism' case. ISTM that that's just a
> > FindLockCycleRecurse() call on the process that we're waiting for. We'd
> > either have to find the other side's locktag for DEADLOCK_INFO or invent
> > another field in there; but that seems like a solveable problem.

> 1. The master doesn't actually *wait* until the very end of the parallel phase.
> 2. When the master does wait, it waits for all of the parallel workers
> at once, not each one individually.
> 
> So, I don't think anything as simplistic as teaching a blocking
> shm_mq_receive() to tip off the deadlock detector that we're waiting
> for the process on the other end of that particular queue can ever
> work.  Because of point #2, that never happens.  When I first started
> thinking about how to fix this, I said, well, that's no big deal, we
> can just advertise the whole list of processes that we're waiting for
> in shared memory, rather than just the one.  This is a bit tricky,
> though. Any general API for any backend to advertise that it's waiting
> for an arbitrary subset of the other backends would require O(n^2)
> shared memory state.  That wouldn't be completely insane, but it's not
> great, either.  For this particular case, we could optimize that down
> to O(n) by just chaining all of the children of a given parallel group
> leader in a linked whose nodes are inlined in their PGPROCs, but that
> doesn't feel very general, because it forecloses the possibility of
> the children ever using that API, and I think they might need to.  If
> nothing else, they might need to advertise that they're blocking on
> the master if they are trying to send data, the queue is full, and
> they have to wait for the master to drain some of it before they can
> proceed.

Perhaps, rather than model it as master M waiting on worker list W1|W2|W3,
model it with queue-nonempty and queue-nonfull events, one pair per queue.  M
subscribes to queue0-nonempty, and each of W1,W2,W3 publish to it.  M
publishes to queue0-nonfull, and the workers subscribe.  An edge forms in the
deadlock detector's graph when all of an event's subscribers wait on it.  (One
can approximate that in userspace with a pair of advisory locks.)

> After thinking about it a bit more, I realized that even if we settle
> on some solution to that problem, there's another issues: the
> wait-edges created by this system don't really have the same semantics
> as regular lock waits.  Suppose A is waiting on a lock held by B and B
> is waiting for a lock held by A; that's a deadlock.  But if A is
> waiting for B to write to a tuple queue and B is waiting for A to read
> from a tuple queue, that's not a deadlock if the queues in question
> are the same.

I do see a deadlock.  B wants to block until A reads from the queue, so it
advertises that and sleeps.  Waking up deadlock_timeout later, it notices that
A is waiting for B to write something.  B will not spontaneously suspend
waiting to write to the queue, nor will A suspend waiting to read from the
queue.  Thus, the deadlock is valid.  This assumes the deadlock detector
reasons from an authoritative picture of pending waits and that we reliably
wake up a process when the condition it sought has arrived.  We have that for
heavyweight lock acquisitions.  The assumption may be incompatible with
Andres's hope, quoted above, of avoiding the full lock acquisition procedure.

> A further difference in the semantics of these wait-edges is that if
> process A is awaiting AccessExclusiveLock on a resource held by B, C,
> and D at AccessShareLock, it needs to wait for all of those processes
> to release their locks before it can do anything else.  On the other
> hand, if process A is awaiting tuples from B, C, and D, it just needs
> ONE of those processes to emit tuples in order to make progress.  Now

Right.

> maybe that doesn't make any difference in practice, because even if
> two of those processes are making lively progress and A is receiving
> tuples from them and processing them like gangbusters, that's probably
> not going to help the third one get unstuck.  If we adopt the approach
> of discounting that possibility, then as long as the parallel leader
> can generate tuples locally (that is, local_scan_done = false) we
> don't report the deadlock, but as soon as it can no longer do that
> (local_scan_done = true) then we do, even though we could still
> theoretically read more tuples from the non-stuck workers.

I can't discount the possibility that the master could unstick a worker in the
course of ingesting tuples from other workers.  We could accept a coding rule
against parallel algorithms behaving that way, on pain of unprincipled
deadlock.  It's unattractive to bake such a high-level notion of acceptable
wait patterns into the deadlock detector, and every coding rule has noteworthy
cost.  Since detecting deadlocks long after they were inevitable is also
unattractive, the rule might be worthwhile.

> So then
> you have to wonder why we're not solving problem #1, because the
> deadlock was just as certain before we generated the maximum possible
> number of tuples locally as it was afterwards.

The "1." above reads like a benefit, not a problem.  What might you solve
about it?

Thanks,
nm



В списке pgsql-hackers по дате отправления:

Предыдущее
От: Tomas Vondra
Дата:
Сообщение: Re: PATCH: pgbench - merging transaction logs
Следующее
От: Fabien COELHO
Дата:
Сообщение: Re: PATCH: pgbench - merging transaction logs