Обсуждение: parallel mode and parallel contexts
Attached is a patch that adds two new concepts: parallel mode, and parallel contexts. The idea of this code is to provide a framework for specific parallel things that you want to do, such as parallel sequential scan or parallel sort. When you're in parallel mode, certain operations - like DDL, and anything that would update the command counter - are prohibited. But you gain the ability to create a parallel context, which in turn can be used to fire up parallel workers. And if you do that, then your snapshot, combo CID hash, and GUC values will be copied to the worker, which is handy. This patch is very much half-baked. Among the things that aren't right yet: - There's no handling of heavyweight locking, so I'm quite sure it'll be possible to cause undetected deadlocks if you work at it. There are some existing threads on this topic and perhaps we can incorporate one of those concepts into this patch, but this version does not. - There's no provision for copying the parent's XID and sub-XIDs, if any, to the background workers, which means that if you use this and your transaction has written data, you will get wrong answers, because TransactionIdIsCurrentTransactionId() will do the wrong thing. - There's no really deep integration with the transaction system yet. Previous discussions seem to point toward the need to do various types of coordinated cleanup when the parallel phase is done, or when an error happens. In particular, you probably don't want the abort record to get written while there are still possibly backends that are part of that transaction doing work; and you certainly don't want files created by the current transaction to get removed while some other backend is still writing them. The right way to work all of this out needs some deep thought; agreeing on what the design should be is probably harder than implement it. Despite the above, I think this does a fairly good job laying out how I believe parallelism can be made to work in PostgreSQL: copy a bunch of state from the user backend to the parallel workers, compute for a while, and then shut everything down. Meanwhile, while parallelism is running, forbid changes to state that's already been synchronized, so that things don't get out of step. I think the patch it shows how the act of synchronizing state from the master to the workers can be made quite modular and painless, even though it doesn't synchronize everything relevant. I'd really appreciate any design thoughts anyone may have on how to fix the problems mentioned above, how to fix any other problems you foresee, or even just a list of reasons why you think this will blow up. What I think is that we're really pretty close to do real parallelism, and that this is probably the last major piece of infrastructure that we need in order to support parallel execution in a reasonable way. That's a pretty bold statement, but I believe it to be true: despite the limitations of the current version of this patch, I think we're very close to being able to sit down and code up a parallel algorithm in PostgreSQL and have that not be all that hard. Once we get the first one, I expect a whole bunch more to come together far more quickly than the first one did. I would be remiss if I failed to mention that this patch includes work by my colleagues Amit Kapila, Rushabh Lathia, and Jeevan Chalke, as well as my former colleague Noah Misch; and that it would not have been possible without the patient support of EnterpriseDB management. Thanks, -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Вложения
On 12 December 2014 at 22:52, Robert Haas <robertmhaas@gmail.com> wrote: > I would be remiss if I failed to mention that this patch includes work > by my colleagues Amit Kapila, Rushabh Lathia, and Jeevan Chalke, as > well as my former colleague Noah Misch; and that it would not have > been possible without the patient support of EnterpriseDB management. Noted and thanks to all. I will make this my priority for review, but regrettably not until New Year, about 2.5-3 weeks away. -- Simon Riggs http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Wed, Dec 17, 2014 at 9:16 AM, Simon Riggs <simon@2ndquadrant.com> wrote: > On 12 December 2014 at 22:52, Robert Haas <robertmhaas@gmail.com> wrote: > >> I would be remiss if I failed to mention that this patch includes work >> by my colleagues Amit Kapila, Rushabh Lathia, and Jeevan Chalke, as >> well as my former colleague Noah Misch; and that it would not have >> been possible without the patient support of EnterpriseDB management. > > Noted and thanks to all. > > I will make this my priority for review, but regrettably not until New > Year, about 2.5-3 weeks away. OK! In the meantime, I had a good chat with Heikki on IM yesterday which gave me some new ideas on how to fix up the transaction handling in here, which I am working on implementing. So hopefully I will have that by then. I am also hoping Amit will be adapting his parallel seq-scan patch to this framework. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Wed, Dec 17, 2014 at 2:53 PM, Robert Haas <robertmhaas@gmail.com> wrote: > In the meantime, I had a good chat with Heikki on IM yesterday which > gave me some new ideas on how to fix up the transaction handling in > here, which I am working on implementing. So hopefully I will have > that by then. And here is a new version. This version has some real integration with the transaction system, along the lines of what Heikki suggested to me, so hopefully tuple visibility calculations in a parallel worker will now produce the right answers, though I need to integrate this with code to actually do something-or-other in parallel in order to really test that. There are still some problems with parallel worker shutdown. As hard as I tried to fight it, I'm gradually resigning myself to the fact that we're probably going to have to set things up so that the worker waits for all of its children to die during abort processing (and during commit processing, but that's not bothersome). Otherwise, to take just one example, chaos potentially ensues if you run a parallel query in a subtransaction and then roll back to a savepoint. But this version just kills the workers and doesn't actually wait for them to die; I'll see about fixing that, but wanted to send this around for comments first. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Вложения
On Thu, Dec 18, 2014 at 4:53 AM, Robert Haas <robertmhaas@gmail.com> wrote: > On Wed, Dec 17, 2014 at 9:16 AM, Simon Riggs <simon@2ndquadrant.com> wrote: >> On 12 December 2014 at 22:52, Robert Haas <robertmhaas@gmail.com> wrote: >> >>> I would be remiss if I failed to mention that this patch includes work >>> by my colleagues Amit Kapila, Rushabh Lathia, and Jeevan Chalke, as >>> well as my former colleague Noah Misch; and that it would not have >>> been possible without the patient support of EnterpriseDB management. >> >> Noted and thanks to all. >> >> I will make this my priority for review, but regrettably not until New >> Year, about 2.5-3 weeks away. > > OK! > > In the meantime, I had a good chat with Heikki on IM yesterday which > gave me some new ideas on how to fix up the transaction handling in > here, which I am working on implementing. So hopefully I will have > that by then. > > I am also hoping Amit will be adapting his parallel seq-scan patch to > this framework. Simon, if you are planning to review this patch soon, could you add your name as a reviewer for this patch in the commit fest app? Thanks, -- Michael
On Fri, Dec 19, 2014 at 8:59 AM, Robert Haas <robertmhaas@gmail.com> wrote: > And here is a new version. Here is another new version, with lots of bugs fixed. The worker shutdown sequence is now much more robust, although I think there may still be a bug or two lurking, and I fixed a bunch of other things too. There's now a function called parallel_count() in the parallel_dummy extension contained herein, which does a parallel count of a relation you choose: rhaas=# select count(*) from pgbench_accounts; count --------- 4000000 (1 row) Time: 396.635 ms rhaas=# select parallel_count('pgbench_accounts'::regclass, 0); NOTICE: PID 2429 counted 4000000 tuples parallel_count ---------------- 4000000 (1 row) Time: 234.445 ms rhaas=# select parallel_count('pgbench_accounts'::regclass, 4); NOTICE: PID 2499 counted 583343 tuples CONTEXT: parallel worker, pid 2499 NOTICE: PID 2500 counted 646478 tuples CONTEXT: parallel worker, pid 2500 NOTICE: PID 2501 counted 599813 tuples CONTEXT: parallel worker, pid 2501 NOTICE: PID 2502 counted 611389 tuples CONTEXT: parallel worker, pid 2502 NOTICE: PID 2429 counted 1558977 tuples parallel_count ---------------- 4000000 (1 row) Time: 150.004 ms rhaas=# select parallel_count('pgbench_accounts'::regclass, 8); NOTICE: PID 2429 counted 1267566 tuples NOTICE: PID 2504 counted 346236 tuples CONTEXT: parallel worker, pid 2504 NOTICE: PID 2505 counted 345077 tuples CONTEXT: parallel worker, pid 2505 NOTICE: PID 2506 counted 355325 tuples CONTEXT: parallel worker, pid 2506 NOTICE: PID 2507 counted 350872 tuples CONTEXT: parallel worker, pid 2507 NOTICE: PID 2508 counted 338855 tuples CONTEXT: parallel worker, pid 2508 NOTICE: PID 2509 counted 336903 tuples CONTEXT: parallel worker, pid 2509 NOTICE: PID 2511 counted 326716 tuples CONTEXT: parallel worker, pid 2511 NOTICE: PID 2510 counted 332450 tuples CONTEXT: parallel worker, pid 2510 parallel_count ---------------- 4000000 (1 row) Time: 166.347 ms This example table (pgbench_accounts, scale 40, ~537 MB) is small enough that parallelism doesn't really make sense; you can see from the notice messages above that the master manages to count a quarter of the table before the workers get themselves up and running. The pointer is rather to show how the infrastructure works and that it can be used to write code to do practically useful tasks in a surprisingly small number of lines of code; parallel_count is only maybe ~100 lines on top of the base patch. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Вложения
On Thu, Dec 18, 2014 at 1:23 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>
>
> In the meantime, I had a good chat with Heikki on IM yesterday which
> gave me some new ideas on how to fix up the transaction handling in
> here, which I am working on implementing. So hopefully I will have
> that by then.
>
> I am also hoping Amit will be adapting his parallel seq-scan patch to
> this framework.
>
While working on parallel seq-scan patch to adapt this framework, I
>
>
> In the meantime, I had a good chat with Heikki on IM yesterday which
> gave me some new ideas on how to fix up the transaction handling in
> here, which I am working on implementing. So hopefully I will have
> that by then.
>
> I am also hoping Amit will be adapting his parallel seq-scan patch to
> this framework.
>
While working on parallel seq-scan patch to adapt this framework, I
noticed few things and have questions regrading the same.
1.
Currently parallel worker just attaches to error queue, for tuple queue
do you expect it to be done in the same place or in the caller supplied
function, if later then we need segment address as input to that function
to attach queue to the segment(shm_mq_attach()).
Another question, I have in this regard is that if we have redirected
messages to error queue by using pq_redirect_to_shm_mq, then how can
we set tuple queue for the same purpose. Similarly I think more handling
is needed for tuple queue in master backend and the answer to above will
dictate what is the best way to do it.
2.
Currently there is no interface for wait_for_workers_to_become_ready()
in your patch, don't you think it is important that before start of fetching
tuples, we should make sure all workers are started, what if some worker
fails to start?
On 2014-12-22 14:14:31 -0500, Robert Haas wrote: > On Fri, Dec 19, 2014 at 8:59 AM, Robert Haas <robertmhaas@gmail.com> wrote: > > And here is a new version. > > Here is another new version, with lots of bugs fixed. A couple remarks: * Shouldn't this provide more infrastructure to deal with the fact that we might get less parallel workers than we had plannedfor? * I wonder if parallel contexts shouldn't be tracked via resowners * combocid.c should error out in parallel mode, as insurance * I doubt it's a good idea to allow heap_insert, heap_inplace_update, index_insert. I'm not convinced that those will behandled correct and relaxing restrictions is easier than adding them. * I'd much rather separate HandleParallelMessageInterrupt() in one variant that just tells the machinery there's a interrupt(called from signal handlers) and one that actually handles them. We shouldn't even consider adding any new codethat does allocations, errors and such in signal handlers. That's just a *bad* idea. * I'm not a fan of the shm_toc stuff... Verbose and hard to read. I'd much rather place a struct there and be careful aboutnot using pointers. That also would obliviate the need to reserve some ids. * Doesn't the restriction in GetSerializableTransactionSnapshotInt() apply for repeatable read just the same? * I'm not a fan of relying on GetComboCommandId() to restore things in the same order as before. * I'd say go ahead and commit the BgworkerByOid thing earlier/independently. I've seen need for that a couple times. * s/entrypints/entrypoints/ Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Fri, Jan 2, 2015 at 9:04 AM, Amit Kapila <amit.kapila16@gmail.com> wrote: > While working on parallel seq-scan patch to adapt this framework, I > noticed few things and have questions regrading the same. > > 1. > Currently parallel worker just attaches to error queue, for tuple queue > do you expect it to be done in the same place or in the caller supplied > function, if later then we need segment address as input to that function > to attach queue to the segment(shm_mq_attach()). > Another question, I have in this regard is that if we have redirected > messages to error queue by using pq_redirect_to_shm_mq, then how can > we set tuple queue for the same purpose. Similarly I think more handling > is needed for tuple queue in master backend and the answer to above will > dictate what is the best way to do it. I've come to the conclusion that it's a bad idea for tuples to be sent through the same queue as errors. We want errors (or notices, but especially errors) to be processed promptly, but there may be a considerable delay in processing tuples. For example, imagine a plan that looks like this: Nested Loop -> Parallel Seq Scan on p -> Index Scan on q Index Scan: q.x = p.x The parallel workers should fill up the tuple queues used by the parallel seq scan so that the master doesn't have to do any of that work itself. Therefore, the normal situation will be that those tuple queues are all full. If an error occurs in a worker at that point, it can't add it to the tuple queue, because the tuple queue is full. But even if it could do that, the master then won't notice the error until it reads all of the queued-up tuple messges that are in the queue in front of the error, and maybe some messages from the other queues as well, since it probably round-robins between the queues or something like that. Basically, it could do a lot of extra work before noticing that error in there. Now we could avoid that by having the master read messages from the queue immediately and just save them off to local storage if they aren't error messages. But that's not very desirable either, because now we have no flow-control. The workers will just keep spamming tuples that the master isn't ready for into the queues, and the master will keep reading them and saving them to local storage, and eventually it will run out of memory and die. We could engineer some solution to this problem, of course, but it seems quite a bit simpler to just have two queues. The error queues don't need to be very big (I made them 16kB, which is trivial on any system on which you care about having working parallelism) and the tuple queues can be sized as needed to avoid pipeline stalls. > 2. > Currently there is no interface for wait_for_workers_to_become_ready() > in your patch, don't you think it is important that before start of fetching > tuples, we should make sure all workers are started, what if some worker > fails to start? I think that, in general, getting the most benefit out of parallelism means *avoiding* situations where backends have to wait for each other. If the relation being scanned is not too large, the user backend might be able to finish the whole scan - or a significant fraction of it - before the workers initialize. Of course in that case it might have been a bad idea to parallelize in the first place, but we should still try to make the best of the situation. If some worker fails to start, then instead of having the full degree N parallelism we were hoping for, we have some degree K < N, so things will take a little longer, but everything should still work. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Sat, Jan 3, 2015 at 7:31 PM, Andres Freund <andres@2ndquadrant.com> wrote: > A couple remarks: > * Shouldn't this provide more infrastructure to deal with the fact that > we might get less parallel workers than we had planned for? Maybe, but I'm not really sure what that should look like. My working theory is that individual parallel applications (executor nodes, or functions that use parallelism internally, or whatever) should be coded in such a way that they work correctly even if the number of workers that starts is smaller than what they requested, or even zero. It may turn out that this is impractical in some cases, but I don't know what those cases are yet so I don't know what the common threads will be. I think parallel seq scan should work by establishing N tuple queues (where N is the number of workers we have). When asked to return a tuple, the master should poll all of those queues one after another until it finds one that contains a tuple. If none of them contain a tuple then it should claim a block to scan and return a tuple from that block. That way, if there are enough running workers to keep up with the master's demand for tuples, the master won't do any of the actual scan itself. But if the workers can't keep up -- e.g. suppose 90% of the CPU consumed by the query is in the filter qual for the scan -- then the master can pitch in along with everyone else. As a non-trivial fringe benefit, if the workers don't all start, or take a while to initialize, the user backend isn't stalled meanwhile. > * I wonder if parallel contexts shouldn't be tracked via resowners That is a good question. I confess that I'm somewhat fuzzy about which things should be tracked via the resowner mechanism vs. which things should have their own bespoke bookkeeping. However, the AtEOXact_Parallel() stuff happens well before ResourceOwnerRelease(), which makes merging them seem not particularly clean. > * combocid.c should error out in parallel mode, as insurance Eh, which function? HeapTupleHeaderGetCmin(), HeapTupleHeaderGetCmax(), and AtEOXact_ComboCid() are intended to work in parallel mode. HeapTupleHeaderAdjustCmax() could Assert(!IsInParallelMode()) but the only calls are in heap_update() and heap_delete() which already have error checks, so putting yet another elog() there seems like overkill. > * I doubt it's a good idea to allow heap_insert, heap_inplace_update, > index_insert. I'm not convinced that those will be handled correct and > relaxing restrictions is easier than adding them. I'm fine with adding checks to heap_insert() and heap_inplace_update(). I'm not sure we really need to add anything to index_insert(); how are we going to get there without hitting some other prohibited operation first? > * I'd much rather separate HandleParallelMessageInterrupt() in one > variant that just tells the machinery there's a interrupt (called from > signal handlers) and one that actually handles them. We shouldn't even > consider adding any new code that does allocations, errors and such in > signal handlers. That's just a *bad* idea. That's a nice idea, but I didn't invent the way this crap works today. ImmediateInterruptOK gets set to true while performing a lock wait, and we need to be able to respond to errors while in that state. I think there's got to be a better way to handle that than force every asynchronous operation to have to cope with the fact that ImmediateInterruptOK may be set or not set, but as of today that's what you have to do. > * I'm not a fan of the shm_toc stuff... Verbose and hard to read. I'd > much rather place a struct there and be careful about not using > pointers. That also would obliviate the need to reserve some ids. I don't see how that's going to work with variable-size data structures, and a bunch of the things that we need to serialize are variable-size. Moreover, I'm not really interested in rehashing this design again. I know it's not your favorite; you've said that before. But it makes it possible to write code to do useful things in parallel, a capability that we do not otherwise have. And I like it fine. > * Doesn't the restriction in GetSerializableTransactionSnapshotInt() > apply for repeatable read just the same? Yeah. I'm not sure whether we really need that check at all, because there is a check in GetTransactionSnapshot() that probably checks the same thing. > * I'm not a fan of relying on GetComboCommandId() to restore things in > the same order as before. I thought that was a little wonky, but it's not likely to break, and there is an elog(ERROR) there to catch it if it does, so I'd rather not make it more complicated. > * I'd say go ahead and commit the BgworkerByOid thing > earlier/independently. I've seen need for that a couple times. Woohoo! I was hoping someone would say that. > * s/entrypints/entrypoints/ Thanks. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > On Sat, Jan 3, 2015 at 7:31 PM, Andres Freund <andres@2ndquadrant.com> wrote: >> * I wonder if parallel contexts shouldn't be tracked via resowners > That is a good question. I confess that I'm somewhat fuzzy about > which things should be tracked via the resowner mechanism vs. which > things should have their own bespoke bookkeeping. However, the > AtEOXact_Parallel() stuff happens well before ResourceOwnerRelease(), > which makes merging them seem not particularly clean. FWIW, the resowner mechanism was never meant as a substitute for bespoke bookkeeping. What it is is a helper mechanism to reduce the need for PG_TRY blocks that guarantee that a resource-releasing function will be called even in error paths. I'm not sure whether that analogy applies well in parallel-execution cases. regards, tom lane
On 22 December 2014 at 19:14, Robert Haas <robertmhaas@gmail.com> wrote: > Here is another new version, with lots of bugs fixed. An initial blind review, independent of other comments already made on thread. OK src/backend/access/heap/heapam.c heapam.c prohibitions on update and delete look fine OK src/backend/access/transam/Makefile OK src/backend/access/transam/README.parallel README.parallel and all concepts look good PARTIAL src/backend/access/transam/parallel.c wait_patiently mentioned in comment but doesn’t exist Why not make nworkers into a uint? Trampoline? Really? I think we should define what we mean by that, somewhere, rather than just use the term as if it was self-evident. Entrypints? .. OK src/backend/access/transam/varsup.c QUESTIONS src/backend/access/transam/xact.c These comments don’t have any explanation or justification + * This stores XID of the toplevel transaction to which we belong. + * Normally, it's the same as TopTransactionState.transactionId, but in + * a parallel worker it may not be. + * If we're a parallel worker, there may be additional XIDs that we need + * to regard as part of our transaction even though they don't appear + * in the transaction state stack. This comment is copied-and-pasted too many times for safety and elegance + /* + * Workers synchronize transaction state at the beginning of each parallel + * operation, so we can't account for transaction state change after that + * point. (Note that this check will certainly error out if s->blockState + * is TBLOCK_PARALLEL_INPROGRESS, so we can treat that as an invalid case + * below.) + */ We need a single Check() that contains more detail and comments within Major comments * Parallel stuff at start sounded OK * Transaction stuff strikes me as overcomplicated and error prone. Given there is no explanation or justification for most of it, I’d be inclined to question why its there * If we have a trampoline for DSM, why don’t we use a trampoline for snapshots, then you wouldn’t need to do all this serialize stuff This is very impressive, but it looks like we’re trying to lift too much weight on the first lift. If we want to go anywhere near this, we need to have very clear documentation about how and why its like that. I’m actually very sorry to say that because the review started very well, much much better than most. -- Simon Riggs http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Mon, Jan 5, 2015 at 9:57 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Sat, Jan 3, 2015 at 7:31 PM, Andres Freund <andres@2ndquadrant.com> wrote:
>
> > * Doesn't the restriction in GetSerializableTransactionSnapshotInt()
> > apply for repeatable read just the same?
>
> Yeah. I'm not sure whether we really need that check at all, because
> there is a check in GetTransactionSnapshot() that probably checks the
> same thing.
>
The check in GetSerializableTransactionSnapshotInt() will also prohibit
>
> On Sat, Jan 3, 2015 at 7:31 PM, Andres Freund <andres@2ndquadrant.com> wrote:
>
> > * Doesn't the restriction in GetSerializableTransactionSnapshotInt()
> > apply for repeatable read just the same?
>
> Yeah. I'm not sure whether we really need that check at all, because
> there is a check in GetTransactionSnapshot() that probably checks the
> same thing.
>
The check in GetSerializableTransactionSnapshotInt() will also prohibit
any user/caller of SetSerializableTransactionSnapshot() in parallel mode
as that can't be prohibited by GetTransactionSnapshot().
On Mon, Jan 5, 2015 at 6:21 PM, Simon Riggs <simon@2ndquadrant.com> wrote: > PARTIAL src/backend/access/transam/parallel.c > wait_patiently mentioned in comment but doesn’t exist Woops. Fixed. Originally, DestroyParallelContext() had a Boolean argument wait_patiently, specifying whether it should exit at once or do the equivalent of WaitForParallelWorkersToFinish() first. But that turned out not to work well - for example, in the included parallel_count example, it's very important to (1) first wait for all the workers to exit, (2) then read the final count, and (3) only after that destroy the dynamic shared memory segment. So WaitForParallelWorkersToFinish() got separated out into a different function, but I failed to update the comments. > Why not make nworkers into a uint? We don't have or use a type called uint. I could make it uint32 or uint64 or uint16, but I don't see much advantage in that. I could also make it unsigned, but we make very little use of the unsigned type generally, so I picked int as most consistent with our general practice. If there's a consensus that something else is much better than int, I will change it throughout, but I think we have bigger fish to fry. > Trampoline? Really? I think we should define what we mean by that, > somewhere, rather than just use the term as if it was self-evident. Comments improved. > Entrypints? Already noted by Andres; fixed in the attached version. > These comments don’t have any explanation or justification OK, I rewrote them. Hopefully it's better now. > This comment is copied-and-pasted too many times for safety and elegance It's not the same comment each time it appears; it appears in the exact form you quoted just twice. It does get a little long-winded, I guess, but I think that parallelism is a sufficiently-unfamiliar concept to us as a group that it makes sense to have a detailed comment in each place explaining exactly why that particular callsite requires a check, so that's what I've tried to do. > * Parallel stuff at start sounded OK > * Transaction stuff strikes me as overcomplicated and error prone. > Given there is no explanation or justification for most of it, I’d be > inclined to question why its there Gosh, I was pretty pleased with how simple the transaction integration turned out to be. Most of what's there right now is either (a) code to copy state from the master to the parallel workers or (b) code to throw errors if the workers try to things that aren't safe. I suspect there are a few things missing, but I don't see anything there that looks unnecessary. > * If we have a trampoline for DSM, why don’t we use a trampoline for > snapshots, then you wouldn’t need to do all this serialize stuff The trampoline is just to let extensions use this infrastructure if they want to; there's no way to avoid the serialization-and-restore stuff unless we switch to creating the child processes using fork(), but that would: 1. Not work on Windows. 2. Require the postmaster to deal with processes that are not its immediate children. 3. Possibly introduce other bugs. Since I've spent a year and a half trying to get this method to work and am now, I think, almost there, I'm not particularly sanguine about totally changing the approach. > This is very impressive, but it looks like we’re trying to lift too > much weight on the first lift. If we want to go anywhere near this, we > need to have very clear documentation about how and why its like that. > I’m actually very sorry to say that because the review started very > well, much much better than most. When I posted the group locking patch, it got criticized because it didn't actually do anything useful by itself; similarly, the pg_background patch was criticized for not being a large enough step toward parallelism. So, this time, I posted something more comprehensive. I don't think it's quite complete yet. I expect a committable version of this patch to be maybe another 500-1000 lines over what I have here right now -- I think it needs to do something about heavyweight locking, and I expect that there are some unsafe things that aren't quite prohibited yet. But the current patch is only 2300 lines, which is not astonishingly large for a feature of this magnitude; if anything, I'd say it's surprisingly small, due to a year and a half of effort laying the necessary groundwork via long series of preliminary commits. I'm not unwilling to divide this up some more if we can agree on a way to do that that makes sense, but I think we're nearing the point where we need to take the plunge and say, look, this is version one of parallelism. Thunk. In addition to the changes mentioned above, the attached version prohibits a few more things (as suggested by Andres) and passes the dsm_segment to the user-supplied entrypoint (as requested off-list by Andres, because otherwise you can't set up additional shm_mq structures). -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Вложения
On 6 January 2015 at 16:33, Robert Haas <robertmhaas@gmail.com> wrote: >> These comments don’t have any explanation or justification > > OK, I rewrote them. Hopefully it's better now. Thanks for new version and answers. >> * Transaction stuff strikes me as overcomplicated and error prone. >> Given there is no explanation or justification for most of it, I’d be >> inclined to question why its there > > Gosh, I was pretty pleased with how simple the transaction integration > turned out to be. Most of what's there right now is either (a) code > to copy state from the master to the parallel workers or (b) code to > throw errors if the workers try to things that aren't safe. I suspect > there are a few things missing, but I don't see anything there that > looks unnecessary. If you can explain it in more detail in comments and README, I may agree. At present, I don't get it and it makes me nervous. The comment says "Only the top frame of the transaction state stack is copied to a parallel worker" but I'm not sure why. Top meaning the current subxact or the main xact? If main, why are do we need XactTopTransactionId >> This is very impressive, but it looks like we’re trying to lift too >> much weight on the first lift. If we want to go anywhere near this, we >> need to have very clear documentation about how and why its like that. >> I’m actually very sorry to say that because the review started very >> well, much much better than most. > > When I posted the group locking patch, it got criticized because it > didn't actually do anything useful by itself; similarly, the > pg_background patch was criticized for not being a large enough step > toward parallelism. So, this time, I posted something more > comprehensive. I don't think it's quite complete yet. I expect a > committable version of this patch to be maybe another 500-1000 lines > over what I have here right now -- I think it needs to do something > about heavyweight locking, and I expect that there are some unsafe > things that aren't quite prohibited yet. But the current patch is > only 2300 lines, which is not astonishingly large for a feature of > this magnitude; if anything, I'd say it's surprisingly small, due to a > year and a half of effort laying the necessary groundwork via long > series of preliminary commits. I'm not unwilling to divide this up > some more if we can agree on a way to do that that makes sense, but I > think we're nearing the point where we need to take the plunge and > say, look, this is version one of parallelism. Thunk. I want this also; the only debate is where to draw the line and please don't see that as criticism. I'm very happy it's so short, I agree it could be longer. -- Simon Riggs http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Tue, Jan 6, 2015 at 3:04 PM, Simon Riggs <simon@2ndquadrant.com> wrote: > If you can explain it in more detail in comments and README, I may > agree. At present, I don't get it and it makes me nervous. > > The comment says > "Only the top frame of the transaction state stack is copied to a parallel > worker" > but I'm not sure why. > > Top meaning the current subxact or the main xact? > If main, why are do we need XactTopTransactionId Current subxact. I initially thought of copying the entire TransactionStateData stack, but Heikki suggested (via IM) that I do it this way instead. I believe his concern was that it's never valid to commit or roll back to a subtransaction that is not at the top of the stack, and if you don't copy the stack, you avoid the risk of somehow ending up in that state. Also, you avoid having to invent resource owners for (sub)transactions that don't really exist in the current process. On the other hand, you do end up with a few special cases that wouldn't exist with the other approach. Still, I'm pretty happy to have taken Heikki's advice: it was certainly simple to implement this way, plus hopefully that way at least one person likes what I ended up with. :-) What else needs clarification? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 6 January 2015 at 21:01, Robert Haas <robertmhaas@gmail.com> wrote: > On Tue, Jan 6, 2015 at 3:04 PM, Simon Riggs <simon@2ndquadrant.com> wrote: >> If you can explain it in more detail in comments and README, I may >> agree. At present, I don't get it and it makes me nervous. >> >> The comment says >> "Only the top frame of the transaction state stack is copied to a parallel >> worker" >> but I'm not sure why. >> >> Top meaning the current subxact or the main xact? >> If main, why are do we need XactTopTransactionId > > Current subxact. TopTransactionStateData points to the top-level transaction data, but XactTopTransactionId points to the current subxact. So when you say "Only the top frame of the transaction state stack is copied" you don't mean the top, you mean the bottom (the latest subxact)? Which then becomes the top in the parallel worker? OK... > I initially thought of copying the entire TransactionStateData stack, > but Heikki suggested (via IM) that I do it this way instead. I > believe his concern was that it's never valid to commit or roll back > to a subtransaction that is not at the top of the stack, and if you > don't copy the stack, you avoid the risk of somehow ending up in that > state. Also, you avoid having to invent resource owners for > (sub)transactions that don't really exist in the current process. On > the other hand, you do end up with a few special cases that wouldn't > exist with the other approach. Still, I'm pretty happy to have taken > Heikki's advice: it was certainly simple to implement this way, plus > hopefully that way at least one person likes what I ended up with. > :-) > > What else needs clarification? Those comments really belong in a README, not the first visible comment in xact.c You need to start with the explanation that parallel workers have a faked-up xact stack to make it easier to copy and manage. That is valid because we never change xact state during a worker operation. I get it now and agree, but please work some more on clarity of README, comments and variable naming! -- Simon Riggs http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On 1/6/15, 10:33 AM, Robert Haas wrote: >> >Entrypints? > Already noted by Andres; fixed in the attached version. Perhaps we only parallelize while drinking beer... ;) CreateParallelContext(): Does it actually make sense to have nworkers=0? ISTM that's a bogus case. Also, since the numberof workers will normally be determined dynamically by the planner, should that check be a regular conditional insteadof just an Assert? In LaunchParallelWorkers() the "Start Workers" comment states that we give up registering more workers if one fails to register,but there's nothing in the if condition to do that, and I don't see RegisterDynamicBackgroundWorker() doing it either.Is the comment just incorrect? SerializeTransactionState(): instead of looping through the transaction stack to calculate nxids, couldn't we just set itto maxsize - sizeof(TransactionId) * 3? If we're looping a second time for safety a comment explaining that would be useful... sequence.c: Is it safe to read a sequence value in a parallel worker if the cache_value is > 1? This may be a dumb question, but for functions do we know that all pl's besides C and SQL use SPI? If not I think they couldend up writing in a worker. @@ -2968,7 +2969,8 @@ ProcessInterrupts(void) errmsg("canceling statement due to user request"))); } } - /* If we get here, do nothing (probably, QueryCancelPending was reset) */ + if (ParallelMessagePending) + HandleParallelMessageInterrupt(false); ISTM it'd be good to leave that comment in place (after the if). RestoreComboCIDState(): Assert(!comboCids || !comboHash): Shouldn't that be &&? AIUI both should always be either set or0. Typo: + elog(ERROR, "cannot update SecondarySnapshpt during a parallel operation"); The comment for RestoreSnapshot refers to set_transaction_snapshot, but that doesn't actually exist (or isn't referenced). -- Jim Nasby, Data Architect, Blue Treble Consulting Data in Trouble? Get it in Treble! http://BlueTreble.com
On 6 January 2015 at 21:37, Simon Riggs <simon@2ndquadrant.com> wrote: > I get it now and agree Yes, very much. Should we copy both the top-level frame and the current subxid? Hot Standby links subxids directly to the top-level, so this would be similar. If we copied both, we wouldn't need to special case the Get functions. It would still be O(1). > but please work some more on clarity of > README, comments and variable naming! Something other than "top" sounds good. -- Simon Riggs http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Tue, Jan 6, 2015 at 4:37 PM, Simon Riggs <simon@2ndquadrant.com> wrote: > So when you say "Only the top frame of the transaction state stack is > copied" you don't mean the top, you mean the bottom (the latest > subxact)? Which then becomes the top in the parallel worker? OK... The item most recently added to the stack is properly called the top, but I guess it's confusing in this case because the item on the bottom of the stack is referred to as the TopTransaction. I'll see if I can rephrase that. > Those comments really belong in a README, not the first visible > comment in xact.c OK. > You need to start with the explanation that parallel workers have a > faked-up xact stack to make it easier to copy and manage. That is > valid because we never change xact state during a worker operation. OK. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 7 January 2015 at 13:11, Robert Haas <robertmhaas@gmail.com> wrote: > On Tue, Jan 6, 2015 at 4:37 PM, Simon Riggs <simon@2ndquadrant.com> wrote: >> So when you say "Only the top frame of the transaction state stack is >> copied" you don't mean the top, you mean the bottom (the latest >> subxact)? Which then becomes the top in the parallel worker? OK... > > The item most recently added to the stack is properly called the top, > but I guess it's confusing in this case because the item on the bottom > of the stack is referred to as the TopTransaction. I'll see if I can > rephrase that. Yes, I didn't mean to suggest the confusion was introduced by you - it's just you stepped into the mess by correctly using the word top in a place where its meaning would be opposite to the close-by usage of TopTransaction. Anyway, feeling good about this now. Thanks for your patience. -- Simon Riggs http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Tue, Jan 6, 2015 at 9:37 PM, Jim Nasby <Jim.Nasby@bluetreble.com> wrote: > CreateParallelContext(): Does it actually make sense to have nworkers=0? > ISTM that's a bogus case. I'm not sure whether we'll ever use the zero-worker case in production, but I've certainly found it useful for performance-testing. > Also, since the number of workers will normally be > determined dynamically by the planner, should that check be a regular > conditional instead of just an Assert? I don't think that's really necessary. It shouldn't be too much of a stretch for the planner to come up with a non-negative value. > In LaunchParallelWorkers() the "Start Workers" comment states that we give > up registering more workers if one fails to register, but there's nothing in > the if condition to do that, and I don't see > RegisterDynamicBackgroundWorker() doing it either. Is the comment just > incorrect? Woops, that got changed at some point and I forgot to update the comment. Will fix. > SerializeTransactionState(): instead of looping through the transaction > stack to calculate nxids, couldn't we just set it to maxsize - > sizeof(TransactionId) * 3? If we're looping a second time for safety a > comment explaining that would be useful... Yeah, I guess we could do that. I don't think it really matters very much one way or the other. > sequence.c: Is it safe to read a sequence value in a parallel worker if the > cache_value is > 1? No, because the sequence cache isn't synchronized between the workers. Maybe it would be safe if cache_value == 1, but there's not much use-case: how often are you going to have a read-only query that uses a sequence value. At some point we can look at making sequences parallel-safe, but worrying about it right now doesn't seem like a good use of time. > This may be a dumb question, but for functions do we know that all pl's > besides C and SQL use SPI? If not I think they could end up writing in a > worker. Well, the lower-level checks would catch that. But it is generally true that there's no way to prevent arbitrary C code from doing things that are unsafe in parallel mode and that we can't tell are unsafe. As I've said before, I think that we'll need to have a method of labeling functions as parallel-safe or not, but this patch isn't trying to solve that part of the problem. > @@ -2968,7 +2969,8 @@ ProcessInterrupts(void) > errmsg("canceling statement due to > user request"))); > } > } > - /* If we get here, do nothing (probably, QueryCancelPending was > reset) */ > + if (ParallelMessagePending) > + HandleParallelMessageInterrupt(false); > ISTM it'd be good to leave that comment in place (after the if). > > RestoreComboCIDState(): Assert(!comboCids || !comboHash): Shouldn't that be > &&? AIUI both should always be either set or 0. Fixed. > Typo: + elog(ERROR, "cannot update SecondarySnapshpt during a > parallel operation"); Fixed. > The comment for RestoreSnapshot refers to set_transaction_snapshot, but that > doesn't actually exist (or isn't referenced). Fixed. Will post a new version in a bit. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Wed, Jan 7, 2015 at 10:34 AM, Simon Riggs <simon@2ndquadrant.com> wrote: > Yes, I didn't mean to suggest the confusion was introduced by you - > it's just you stepped into the mess by correctly using the word top in > a place where its meaning would be opposite to the close-by usage of > TopTransaction. > > Anyway, feeling good about this now. Thanks for your patience. Thanks for the kind words, and the review. Here's a new version with a much-expanded README and some other changes to the comments to hopefully address some of your other concerns. I also fixed a couple of other problems while I was at it: the comments in xact.c claimed that parallel shutdown waits for workers to exit, but parallel.c didn't know anything about that. Also, I fixed it so that when an error is propagated from the parallel worker to the user backend, the error context in effect at the time the parallel context was created is used, rather than whatever is in effect when we notice the error. I have little doubt that this version is still afflicted with various bugs, and the heavyweight locking issue remains to be dealt with, but on the whole I think this is headed in the right direction. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Вложения
On 1/7/15, 9:39 AM, Robert Haas wrote: >> sequence.c: Is it safe to read a sequence value in a parallel worker if the >> >cache_value is > 1? > No, because the sequence cache isn't synchronized between the workers. > Maybe it would be safe if cache_value == 1, but there's not much > use-case: how often are you going to have a read-only query that uses > a sequence value. At some point we can look at making sequences > parallel-safe, but worrying about it right now doesn't seem like a > good use of time. Agreed, I was more concerned with calls to nextval(), which don't seem to be prevented in parallel mode? >> >This may be a dumb question, but for functions do we know that all pl's >> >besides C and SQL use SPI? If not I think they could end up writing in a >> >worker. > Well, the lower-level checks would catch that. But it is generally > true that there's no way to prevent arbitrary C code from doing things > that are unsafe in parallel mode and that we can't tell are unsafe. > As I've said before, I think that we'll need to have a method of > labeling functions as parallel-safe or not, but this patch isn't > trying to solve that part of the problem. I was more thinking about all the add-on pl's like pl/ruby. But yeah, I don't see that there's much we can do if they'renot using SPI. -- Jim Nasby, Data Architect, Blue Treble Consulting Data in Trouble? Get it in Treble! http://BlueTreble.com
On Wed, Jan 7, 2015 at 11:03 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>
> I have little doubt that this version is still afflicted with various
> bugs, and the heavyweight locking issue remains to be dealt with, but
> on the whole I think this is headed in the right direction.
>
>
> I have little doubt that this version is still afflicted with various
> bugs, and the heavyweight locking issue remains to be dealt with, but
> on the whole I think this is headed in the right direction.
>
+ParallelMain(Datum main_arg)
{
..
+ /*
+ * Now that we have a resource owner, we can attach to the dynamic
+ * shared memory
segment and read the table of contents.
+ */
+ seg = dsm_attach(DatumGetInt32(main_arg));
Here, I think DatumGetUInt32() needs to be used instead of
DatumGetInt32() as the segment handle is uint32.
On Wed, Jan 7, 2015 at 3:54 PM, Jim Nasby <Jim.Nasby@bluetreble.com> wrote: > Agreed, I was more concerned with calls to nextval(), which don't seem to be > prevented in parallel mode? It looks prevented: /* * Forbid this during parallel operation because, to make it work, * the cooperating backends would need to sharethe backend-local cached * sequence information. Currently, we don't support that. */ PreventCommandIfParallelMode("nextval()"); > I was more thinking about all the add-on pl's like pl/ruby. But yeah, I > don't see that there's much we can do if they're not using SPI. Actually, there is: it's forbidden at the layer of heap_insert(), heap_update(), heap_delete(). It's hard to imagine anyone trying to modify the database as a lower level than that. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Thu, Jan 8, 2015 at 6:52 AM, Amit Kapila <amit.kapila16@gmail.com> wrote: > + seg = dsm_attach(DatumGetInt32(main_arg)); > > Here, I think DatumGetUInt32() needs to be used instead of > DatumGetInt32() as the segment handle is uint32. OK, I'll change that in the next version. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Tue, Jan 13, 2015 at 1:33 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Thu, Jan 8, 2015 at 6:52 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> > + seg = dsm_attach(DatumGetInt32(main_arg));
> >
> > Here, I think DatumGetUInt32() needs to be used instead of
> > DatumGetInt32() as the segment handle is uint32.
>
> OK, I'll change that in the next version.
>
No issues, I have another question related to below code:
+HandleParallelMessages(void)
+{
..
..
+ for (i = 0; i < pcxt->nworkers; ++i)
+ {
+ /*
+ * Read messages for as long as we have an error queue; if we
+ * have hit (or hit while reading) ReadyForQuery, this will go to
+ * NULL.
+ */
+ while (pcxt->worker[i].error_mqh != NULL)
+ {
+ shm_mq_result res;
+
+ CHECK_FOR_INTERRUPTS();
+
+ res = shm_mq_receive(pcxt->worker[i].error_mqh, &nbytes,
+ &data, true);
+ if (res == SHM_MQ_SUCCESS)
>
> On Thu, Jan 8, 2015 at 6:52 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> > + seg = dsm_attach(DatumGetInt32(main_arg));
> >
> > Here, I think DatumGetUInt32() needs to be used instead of
> > DatumGetInt32() as the segment handle is uint32.
>
> OK, I'll change that in the next version.
>
No issues, I have another question related to below code:
+HandleParallelMessages(void)
+{
..
..
+ for (i = 0; i < pcxt->nworkers; ++i)
+ {
+ /*
+ * Read messages for as long as we have an error queue; if we
+ * have hit (or hit while reading) ReadyForQuery, this will go to
+ * NULL.
+ */
+ while (pcxt->worker[i].error_mqh != NULL)
+ {
+ shm_mq_result res;
+
+ CHECK_FOR_INTERRUPTS();
+
+ res = shm_mq_receive(pcxt->worker[i].error_mqh, &nbytes,
+ &data, true);
+ if (res == SHM_MQ_SUCCESS)
Here we are checking the error queue for all the workers and this loop
will continue untill all have sent ReadyForQuery() message ('Z') which
will make this loop continue till all workers have finished their work.
Assume situation where first worker has completed the work and sent
'Z' message and second worker is still sending some tuples, now above
code will keep on waiting for 'Z' message from second worker and won't
allow to receive tuples sent by second worker till it send 'Z' message.
As each worker send its own 'Z' message after completion, so ideally
the above code should receive the message only for worker which has
sent the message. I think for that it needs worker information who has
sent the message.
On Thu, Jan 15, 2015 at 7:00 AM, Amit Kapila <amit.kapila16@gmail.com> wrote: > On Tue, Jan 13, 2015 at 1:33 AM, Robert Haas <robertmhaas@gmail.com> wrote: >> On Thu, Jan 8, 2015 at 6:52 AM, Amit Kapila <amit.kapila16@gmail.com> >> wrote: >> > + seg = dsm_attach(DatumGetInt32(main_arg)); >> > >> > Here, I think DatumGetUInt32() needs to be used instead of >> > DatumGetInt32() as the segment handle is uint32. >> >> OK, I'll change that in the next version. >> > > No issues, I have another question related to below code: > > +HandleParallelMessages(void) > +{ > .. > .. > + for (i = 0; i < pcxt->nworkers; ++i) > + { > + /* > + * Read messages for as long as we have an error queue; if we > + * have hit (or hit while reading) ReadyForQuery, this will go to > + * NULL. > + */ > + while (pcxt->worker[i].error_mqh != NULL) > + { > + shm_mq_result res; > + > + CHECK_FOR_INTERRUPTS(); > + > + res = shm_mq_receive(pcxt->worker[i].error_mqh, &nbytes, > + &data, true); > + if (res == SHM_MQ_SUCCESS) > > Here we are checking the error queue for all the workers and this loop > will continue untill all have sent ReadyForQuery() message ('Z') which > will make this loop continue till all workers have finished their work. > Assume situation where first worker has completed the work and sent > 'Z' message and second worker is still sending some tuples, now above > code will keep on waiting for 'Z' message from second worker and won't > allow to receive tuples sent by second worker till it send 'Z' message. > > As each worker send its own 'Z' message after completion, so ideally > the above code should receive the message only for worker which has > sent the message. I think for that it needs worker information who has > sent the message. Are you talking about HandleParallelMessages() or WaitForParallelWorkersToFinish()? The former doesn't wait for anything; it just handles any messages that are available now. The latter does wait for all workers to finish, but the intention is that you only call it when you're ready to wind up the entire parallel operation, so that's OK. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Thu, Jan 15, 2015 at 6:52 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Thu, Jan 15, 2015 at 7:00 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> > +HandleParallelMessages(void)
> > +{
> > ..
> > ..
> > + for (i = 0; i < pcxt->nworkers; ++i)
> > + {
> > + /*
> > + * Read messages for as long as we have an error queue; if we
> > + * have hit (or hit while reading) ReadyForQuery, this will go to
> > + * NULL.
> > + */
> > + while (pcxt->worker[i].error_mqh != NULL)
> > + {
> > + shm_mq_result res;
> > +
> > + CHECK_FOR_INTERRUPTS();
> > +
> > + res = shm_mq_receive(pcxt->worker[i].error_mqh, &nbytes,
> > + &data, true);
> > + if (res == SHM_MQ_SUCCESS)
> >
> > Here we are checking the error queue for all the workers and this loop
> > will continue untill all have sent ReadyForQuery() message ('Z') which
> > will make this loop continue till all workers have finished their work.
> > Assume situation where first worker has completed the work and sent
> > 'Z' message and second worker is still sending some tuples, now above
> > code will keep on waiting for 'Z' message from second worker and won't
> > allow to receive tuples sent by second worker till it send 'Z' message.
> >
> > As each worker send its own 'Z' message after completion, so ideally
> > the above code should receive the message only for worker which has
> > sent the message. I think for that it needs worker information who has
> > sent the message.
>
> Are you talking about HandleParallelMessages() or
> WaitForParallelWorkersToFinish()? The former doesn't wait for
> anything; it just handles any messages that are available now.
>
> On Thu, Jan 15, 2015 at 7:00 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> > +HandleParallelMessages(void)
> > +{
> > ..
> > ..
> > + for (i = 0; i < pcxt->nworkers; ++i)
> > + {
> > + /*
> > + * Read messages for as long as we have an error queue; if we
> > + * have hit (or hit while reading) ReadyForQuery, this will go to
> > + * NULL.
> > + */
> > + while (pcxt->worker[i].error_mqh != NULL)
> > + {
> > + shm_mq_result res;
> > +
> > + CHECK_FOR_INTERRUPTS();
> > +
> > + res = shm_mq_receive(pcxt->worker[i].error_mqh, &nbytes,
> > + &data, true);
> > + if (res == SHM_MQ_SUCCESS)
> >
> > Here we are checking the error queue for all the workers and this loop
> > will continue untill all have sent ReadyForQuery() message ('Z') which
> > will make this loop continue till all workers have finished their work.
> > Assume situation where first worker has completed the work and sent
> > 'Z' message and second worker is still sending some tuples, now above
> > code will keep on waiting for 'Z' message from second worker and won't
> > allow to receive tuples sent by second worker till it send 'Z' message.
> >
> > As each worker send its own 'Z' message after completion, so ideally
> > the above code should receive the message only for worker which has
> > sent the message. I think for that it needs worker information who has
> > sent the message.
>
> Are you talking about HandleParallelMessages() or
> WaitForParallelWorkersToFinish()? The former doesn't wait for
> anything; it just handles any messages that are available now.
I am talking about HandleParallelMessages(). It doesn't wait but
it is looping which will make it run for longer time as explained
above. Just imagine a case where there are two workers and first
worker has sent 'Z' message and second worker is doing some
work, now in such a scenario loop will not finish until second worker
also send 'Z' message or error. Am I missing something?
On Thu, Jan 15, 2015 at 9:09 AM, Amit Kapila <amit.kapila16@gmail.com> wrote: > On Thu, Jan 15, 2015 at 6:52 PM, Robert Haas <robertmhaas@gmail.com> wrote: >> On Thu, Jan 15, 2015 at 7:00 AM, Amit Kapila <amit.kapila16@gmail.com> >> wrote: >> > +HandleParallelMessages(void) >> > +{ >> > .. >> > .. >> > + for (i = 0; i < pcxt->nworkers; ++i) >> > + { >> > + /* >> > + * Read messages for as long as we have an error queue; if we >> > + * have hit (or hit while reading) ReadyForQuery, this will go to >> > + * NULL. >> > + */ >> > + while (pcxt->worker[i].error_mqh != NULL) >> > + { >> > + shm_mq_result res; >> > + >> > + CHECK_FOR_INTERRUPTS(); >> > + >> > + res = shm_mq_receive(pcxt->worker[i].error_mqh, &nbytes, >> > + &data, true); >> > + if (res == SHM_MQ_SUCCESS) >> > >> > Here we are checking the error queue for all the workers and this loop >> > will continue untill all have sent ReadyForQuery() message ('Z') which >> > will make this loop continue till all workers have finished their work. >> > Assume situation where first worker has completed the work and sent >> > 'Z' message and second worker is still sending some tuples, now above >> > code will keep on waiting for 'Z' message from second worker and won't >> > allow to receive tuples sent by second worker till it send 'Z' message. >> > >> > As each worker send its own 'Z' message after completion, so ideally >> > the above code should receive the message only for worker which has >> > sent the message. I think for that it needs worker information who has >> > sent the message. >> >> Are you talking about HandleParallelMessages() or >> WaitForParallelWorkersToFinish()? The former doesn't wait for >> anything; it just handles any messages that are available now. > > I am talking about HandleParallelMessages(). It doesn't wait but > it is looping which will make it run for longer time as explained > above. Just imagine a case where there are two workers and first > worker has sent 'Z' message and second worker is doing some > work, now in such a scenario loop will not finish until second worker > also send 'Z' message or error. Am I missing something? Blah. You're right. I intended to write this loop so that it only runs until shm_mq_receive() returns SHM_MQ_WOULD_BLOCK. But that's not what I did. Will fix. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2015-01-05 11:27:57 -0500, Robert Haas wrote: > On Sat, Jan 3, 2015 at 7:31 PM, Andres Freund <andres@2ndquadrant.com> wrote: > > * I wonder if parallel contexts shouldn't be tracked via resowners > > That is a good question. I confess that I'm somewhat fuzzy about > which things should be tracked via the resowner mechanism vs. which > things should have their own bespoke bookkeeping. However, the > AtEOXact_Parallel() stuff happens well before ResourceOwnerRelease(), > which makes merging them seem not particularly clean. I'm not sure either. But I think the current location is wrong anyway - during AtEOXact_Parallel() before running user defined queries via AfterTriggerFireDeferred() seems wrong. > > * combocid.c should error out in parallel mode, as insurance > > Eh, which function? HeapTupleHeaderGetCmin(), > HeapTupleHeaderGetCmax(), and AtEOXact_ComboCid() are intended to work > in parallel mode. HeapTupleHeaderAdjustCmax() could > Assert(!IsInParallelMode()) but the only calls are in heap_update() > and heap_delete() which already have error checks, so putting yet > another elog() there seems like overkill. To me it seems like a good idea, but whatever. > > * I doubt it's a good idea to allow heap_insert, heap_inplace_update, > > index_insert. I'm not convinced that those will be handled correct and > > relaxing restrictions is easier than adding them. > > I'm fine with adding checks to heap_insert() and > heap_inplace_update(). I'm not sure we really need to add anything to > index_insert(); how are we going to get there without hitting some > other prohibited operation first? I'm not sure. But it's not that hard to imagine that somebody will start adding codepaths that insert into indexes first. Think upsert. > > * I'd much rather separate HandleParallelMessageInterrupt() in one > > variant that just tells the machinery there's a interrupt (called from > > signal handlers) and one that actually handles them. We shouldn't even > > consider adding any new code that does allocations, errors and such in > > signal handlers. That's just a *bad* idea. > > That's a nice idea, but I didn't invent the way this crap works today. > ImmediateInterruptOK gets set to true while performing a lock wait, > and we need to be able to respond to errors while in that state. I > think there's got to be a better way to handle that than force every > asynchronous operation to have to cope with the fact that > ImmediateInterruptOK may be set or not set, but as of today that's > what you have to do. I personally think it's absolutely unacceptable to add any more of that. That the current mess works is more luck than anything else - and I'm pretty sure there's several bugs in it. But since I realize I can't force you to develop a alternative solution, I tried to implement enough infrastructure to deal with it without too much work. As far as I can see this could relatively easily be implemented ontop of the removal of ImmediateInterruptOK in the patch series I posted yesterday? Afaics you just need to remove most of +HandleParallelMessageInterrupt() and replace it by a single SetLatch(). > > * I'm not a fan of the shm_toc stuff... Verbose and hard to read. I'd > > much rather place a struct there and be careful about not using > > pointers. That also would obliviate the need to reserve some ids. > > I don't see how that's going to work with variable-size data > structures, and a bunch of the things that we need to serialize are > variable-size. Meh. Just appending the variable data to the end of the structure and calculating offsets works just fine. > Moreover, I'm not really interested in rehashing this > design again. I know it's not your favorite; you've said that before. Well, if I keep having to read code using it, I'll keep being annoyed by it. I guess we both have to live with that. Greetings, Andres Freund
On Fri, Jan 16, 2015 at 8:05 AM, Andres Freund <andres@2ndquadrant.com> wrote: > I'm not sure either. But I think the current location is wrong anyway - > during AtEOXact_Parallel() before running user defined queries via > AfterTriggerFireDeferred() seems wrong. Good point. >> I'm fine with adding checks to heap_insert() and >> heap_inplace_update(). I'm not sure we really need to add anything to >> index_insert(); how are we going to get there without hitting some >> other prohibited operation first? > > I'm not sure. But it's not that hard to imagine that somebody will start > adding codepaths that insert into indexes first. Think upsert. OK, but what's the specific reason it's unsafe? The motivation for prohibiting update and delete is that, if a new combo CID were to be created mid-scan, we have no way to make other workers aware of it. There's no special reason to think that heap_insert() or heap_inplace_update() are unsafe, but, sure, we can prohibit them for symmetry. If we're going to start extending the net further and further, we should have specific comments explaining what the hazards. People will eventually want to relax these restrictions, I think, and there's nothing scarier than removing a prohibition that has absolutely no comments to explain why that thing was restricted in the first place. > As far as I can see this could relatively easily be implemented ontop of > the removal of ImmediateInterruptOK in the patch series I posted > yesterday? Afaics you just need to remove most of > +HandleParallelMessageInterrupt() and replace it by a single SetLatch(). That would be swell. I'll have a look, when I have time, or when it's committed, whichever happens first. >> > * I'm not a fan of the shm_toc stuff... Verbose and hard to read. I'd >> > much rather place a struct there and be careful about not using >> > pointers. That also would obliviate the need to reserve some ids. >> >> I don't see how that's going to work with variable-size data >> structures, and a bunch of the things that we need to serialize are >> variable-size. > > Meh. Just appending the variable data to the end of the structure and > calculating offsets works just fine. I think coding it all up ad-hoc would be pretty bug-prone. This provides some structure, where each module only needs to know about its own serialization format. To me, that's a lot easier to work with. New patch attached. I'm going to take the risk of calling this v1 (previous versions have been 0.x), since I've now done something about the heavyweight locking issue, as well as fixed the message-looping bug Amit pointed out. It doubtless needs more work, but it's starting to smell a bit more like a real patch. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Вложения
On Sat, Jan 17, 2015 at 3:40 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>
> New patch attached. I'm going to take the risk of calling this v1
> (previous versions have been 0.x), since I've now done something about
> the heavyweight locking issue, as well as fixed the message-looping
> bug Amit pointed out. It doubtless needs more work, but it's starting
> to smell a bit more like a real patch.
>
>
> New patch attached. I'm going to take the risk of calling this v1
> (previous versions have been 0.x), since I've now done something about
> the heavyweight locking issue, as well as fixed the message-looping
> bug Amit pointed out. It doubtless needs more work, but it's starting
> to smell a bit more like a real patch.
>
I need some clarification regarding below code:
+BgwHandleStatus
+WaitForBackgroundWorkerShutdown(BackgroundWorkerHandle *handle)
+{
+ BgwHandleStatus
status;
+ int rc;
+ bool save_set_latch_on_sigusr1;
+
+
save_set_latch_on_sigusr1 = set_latch_on_sigusr1;
+ set_latch_on_sigusr1 = true;
+
+ PG_TRY();
+ {
+
for (;;)
+ {
+ pid_t pid;
+
+
CHECK_FOR_INTERRUPTS();
+
+ status = GetBackgroundWorkerPid(handle, &pid);
+
if (status == BGWH_STOPPED)
+ return status;
+
+ rc =
WaitLatch(&MyProc->procLatch,
+ WL_LATCH_SET |
WL_POSTMASTER_DEATH, 0);
+
+ if (rc & WL_POSTMASTER_DEATH)
+
return BGWH_POSTMASTER_DIED;
It seems this code has possibility to wait forever.
Assume one of the worker is not able to start (not able to attach
to shared memory or some other reason), then status returned by
GetBackgroundWorkerPid() will be BGWH_NOT_YET_STARTED
and after that it can wait forever in WaitLatch.
On Tue, Jan 20, 2015 at 9:41 AM, Amit Kapila <amit.kapila16@gmail.com> wrote: > It seems [WaitForBackgroundWorkerShutdown] has possibility to wait forever. > Assume one of the worker is not able to start (not able to attach > to shared memory or some other reason), then status returned by > GetBackgroundWorkerPid() will be BGWH_NOT_YET_STARTED > and after that it can wait forever in WaitLatch. I don't think that's right. The status only remains BGWH_NOT_YET_STARTED until the postmaster forks the child process. At that point it immediately changes to BGWH_STARTED. If it starts up and then dies early on, for example because it can't attach to shared memory or somesuch, the status will change to BGWH_STOPPED. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Tue, Jan 20, 2015 at 9:52 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Tue, Jan 20, 2015 at 9:41 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> > It seems [WaitForBackgroundWorkerShutdown] has possibility to wait forever.
> > Assume one of the worker is not able to start (not able to attach
> > to shared memory or some other reason), then status returned by
> > GetBackgroundWorkerPid() will be BGWH_NOT_YET_STARTED
> > and after that it can wait forever in WaitLatch.
>
> I don't think that's right. The status only remains
> BGWH_NOT_YET_STARTED until the postmaster forks the child process.
>
> On Tue, Jan 20, 2015 at 9:41 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> > It seems [WaitForBackgroundWorkerShutdown] has possibility to wait forever.
> > Assume one of the worker is not able to start (not able to attach
> > to shared memory or some other reason), then status returned by
> > GetBackgroundWorkerPid() will be BGWH_NOT_YET_STARTED
> > and after that it can wait forever in WaitLatch.
>
> I don't think that's right. The status only remains
> BGWH_NOT_YET_STARTED until the postmaster forks the child process.
I think the control flow can reach the above location before
postmaster could fork all the workers. Consider a case that
we have launched all workers during ExecutorStart phase
and then before postmaster could start all workers an error
occurs in master backend and then it try to Abort the transaction
and destroy the parallel context, at that moment it will get the
above status and wait forever in above code.
I am able to reproduce above scenario with debugger by using
parallel_seqscan patch.
On Wed, Jan 21, 2015 at 2:11 AM, Amit Kapila <amit.kapila16@gmail.com> wrote: > On Tue, Jan 20, 2015 at 9:52 PM, Robert Haas <robertmhaas@gmail.com> wrote: >> On Tue, Jan 20, 2015 at 9:41 AM, Amit Kapila <amit.kapila16@gmail.com> >> wrote: >> > It seems [WaitForBackgroundWorkerShutdown] has possibility to wait >> > forever. >> > Assume one of the worker is not able to start (not able to attach >> > to shared memory or some other reason), then status returned by >> > GetBackgroundWorkerPid() will be BGWH_NOT_YET_STARTED >> > and after that it can wait forever in WaitLatch. >> >> I don't think that's right. The status only remains >> BGWH_NOT_YET_STARTED until the postmaster forks the child process. > > I think the control flow can reach the above location before > postmaster could fork all the workers. Consider a case that > we have launched all workers during ExecutorStart phase > and then before postmaster could start all workers an error > occurs in master backend and then it try to Abort the transaction > and destroy the parallel context, at that moment it will get the > above status and wait forever in above code. > > I am able to reproduce above scenario with debugger by using > parallel_seqscan patch. Hmm. Well, if you can reproduce it, there clearly must be a bug. But I'm not quite sure where. What should happen in that case is that the process that started the worker has to wait for the postmaster to actually start it, and then after that for the new process to die, and then it should return. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Wed, Jan 21, 2015 at 6:35 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Wed, Jan 21, 2015 at 2:11 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> > On Tue, Jan 20, 2015 at 9:52 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> >> On Tue, Jan 20, 2015 at 9:41 AM, Amit Kapila <amit.kapila16@gmail.com>
> >> wrote:
> >> > It seems [WaitForBackgroundWorkerShutdown] has possibility to wait
> >> > forever.
> >> > Assume one of the worker is not able to start (not able to attach
> >> > to shared memory or some other reason), then status returned by
> >> > GetBackgroundWorkerPid() will be BGWH_NOT_YET_STARTED
> >> > and after that it can wait forever in WaitLatch.
> >>
> >> I don't think that's right. The status only remains
> >> BGWH_NOT_YET_STARTED until the postmaster forks the child process.
> >
> > I think the control flow can reach the above location before
> > postmaster could fork all the workers. Consider a case that
> > we have launched all workers during ExecutorStart phase
> > and then before postmaster could start all workers an error
> > occurs in master backend and then it try to Abort the transaction
> > and destroy the parallel context, at that moment it will get the
> > above status and wait forever in above code.
> >
> > I am able to reproduce above scenario with debugger by using
> > parallel_seqscan patch.
>
> Hmm. Well, if you can reproduce it, there clearly must be a bug. But
> I'm not quite sure where. What should happen in that case is that the
> process that started the worker has to wait for the postmaster to
> actually start it,
>
> On Wed, Jan 21, 2015 at 2:11 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> > On Tue, Jan 20, 2015 at 9:52 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> >> On Tue, Jan 20, 2015 at 9:41 AM, Amit Kapila <amit.kapila16@gmail.com>
> >> wrote:
> >> > It seems [WaitForBackgroundWorkerShutdown] has possibility to wait
> >> > forever.
> >> > Assume one of the worker is not able to start (not able to attach
> >> > to shared memory or some other reason), then status returned by
> >> > GetBackgroundWorkerPid() will be BGWH_NOT_YET_STARTED
> >> > and after that it can wait forever in WaitLatch.
> >>
> >> I don't think that's right. The status only remains
> >> BGWH_NOT_YET_STARTED until the postmaster forks the child process.
> >
> > I think the control flow can reach the above location before
> > postmaster could fork all the workers. Consider a case that
> > we have launched all workers during ExecutorStart phase
> > and then before postmaster could start all workers an error
> > occurs in master backend and then it try to Abort the transaction
> > and destroy the parallel context, at that moment it will get the
> > above status and wait forever in above code.
> >
> > I am able to reproduce above scenario with debugger by using
> > parallel_seqscan patch.
>
> Hmm. Well, if you can reproduce it, there clearly must be a bug. But
> I'm not quite sure where. What should happen in that case is that the
> process that started the worker has to wait for the postmaster to
> actually start it,
Okay, I think this should solve the issue, also it should be done
before call of TerminateBackgroundWorker().
On Fri, Jan 16, 2015 at 5:10 PM, Robert Haas <robertmhaas@gmail.com> wrote: > New patch attached. I'm going to take the risk of calling this v1 > (previous versions have been 0.x), since I've now done something about > the heavyweight locking issue, as well as fixed the message-looping > bug Amit pointed out. It doubtless needs more work, but it's starting > to smell a bit more like a real patch. Here's a new version. Andres mentioned previously that he thought it would be a good idea to commit the addition of BackgroundWorkerInitializeConnectionByOid() separately, as he's had cause to want it independently of the rest of this stuff. It would be useful for pg_background, too. So I've broken that out into a separate patch here (bgworker-by-oid.patch) and will commit that RSN unless somebody thinks it's a bad idea for some reason. AFAICS it should be uncontroversial. The main patch (parallel-mode-v2.patch) has evolved just a bit more since the previous version. It now arranges for all libraries libraries which we'd dynamically loaded into the original process to be loaded into the worker as well, which fixes a possible failure when those libraries define custom GUCs that need to be properly restored. I'm wondering how much more there really is to do here. From the parallel sequential scan discussion, and off-list conversations with Amit, it's becoming clear to me that there are a bunch of things that are generic to query planning and execution that this patch doesn't currently consider, so Amit's handling them in the parallel sequential scan patch. It seems likely that some of that stuff should be pulled out of that patch and formed into some more generic infrastructure. But I don't favor putting that stuff in this patch, because there are already useful things you can do with what I've got here. If you want to write a parallel version of some SQL-callable function, for example, you can do that without anything more than this. You could probably also parallelize utility commands with just this infrastructure. The stuff that may need to be pulled out of Amit's patch is generic to query planning and execution, but probably not for other applications of server-side parallelism. That seems like a relatively coherent place to draw a line. So I'm feeling like it might be just as well to push this much into the tree and move on to the next set of problems. Of course, I don't want to preempt anyone's desire to review and comment on this further, either. The final patch attached her (parallel-dummy-v2.patch) has been updated slightly to incorporate some prefetching logic. It's still just demo code and is not intended for commit. I'm not sure whether the prefetching logic can actually be made to improve performance, either; if anyone feels like playing with that and reporting results back, that would be swell. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Вложения
On 1/30/15 11:08 AM, Robert Haas wrote: > The final patch attached her (parallel-dummy-v2.patch) has been > updated slightly to incorporate some prefetching logic. It's still > just demo code and is not intended for commit. I'm not sure whether > the prefetching logic can actually be made to improve performance, > either; if anyone feels like playing with that and reporting results > back, that would be swell. Wouldn't we want the prefetching to happen after ReadBuffer() instead of before? This way we risk pushing the buffer we actually want out, plus if it's not already available the OS probably won't try and read it until it's read all the prefetch blocks. -- Jim Nasby, Data Architect, Blue Treble Consulting Data in Trouble? Get it in Treble! http://BlueTreble.com
On Fri, Jan 30, 2015 at 6:38 PM, Jim Nasby <Jim.Nasby@bluetreble.com> wrote: > On 1/30/15 11:08 AM, Robert Haas wrote: >> The final patch attached her (parallel-dummy-v2.patch) has been >> updated slightly to incorporate some prefetching logic. It's still >> just demo code and is not intended for commit. I'm not sure whether >> the prefetching logic can actually be made to improve performance, >> either; if anyone feels like playing with that and reporting results >> back, that would be swell. > > Wouldn't we want the prefetching to happen after ReadBuffer() instead of > before? This way we risk pushing the buffer we actually want out, plus if > it's not already available the OS probably won't try and read it until it's > read all the prefetch blocks. Please feel free to try it both ways and let me know what you find out. We've had a decent amount of commentary on this patch and the parallel sequential scan patch, but AFAICT, not much actual testing. The best way to find out whether your proposal would be an improvement is to try it. I have my own theory based on the testing I've done so far, but what would be better than a theory is some evidence, preferably collected by diverse users on diverse systems. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Fri, Jan 30, 2015 at 12:08 PM, Robert Haas <robertmhaas@gmail.com> wrote: > Here's a new version. Andres mentioned previously that he thought it > would be a good idea to commit the addition of > BackgroundWorkerInitializeConnectionByOid() separately, as he's had > cause to want it independently of the rest of this stuff. It would be > useful for pg_background, too. So I've broken that out into a > separate patch here (bgworker-by-oid.patch) and will commit that RSN > unless somebody thinks it's a bad idea for some reason. AFAICS it > should be uncontroversial. This is now done. The main patch needed some updating in light of Andres's recent assault on ImmediateInterruptOK (final result: Andres 1-0 IIOK) so here's a new version. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Вложения
On Wed, Feb 4, 2015 at 9:40 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Fri, Jan 30, 2015 at 12:08 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> > Here's a new version. Andres mentioned previously that he thought it
> > would be a good idea to commit the addition of
> > BackgroundWorkerInitializeConnectionByOid() separately, as he's had
> > cause to want it independently of the rest of this stuff. It would be
> > useful for pg_background, too. So I've broken that out into a
> > separate patch here (bgworker-by-oid.patch) and will commit that RSN
> > unless somebody thinks it's a bad idea for some reason. AFAICS it
> > should be uncontroversial.
>
> This is now done.
>
> The main patch needed some updating in light of Andres's recent
> assault on ImmediateInterruptOK (final result: Andres 1-0 IIOK) so
> here's a new version.
>
I think we should expose variable ParallelWorkerNumber (or if you don't
>
> On Fri, Jan 30, 2015 at 12:08 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> > Here's a new version. Andres mentioned previously that he thought it
> > would be a good idea to commit the addition of
> > BackgroundWorkerInitializeConnectionByOid() separately, as he's had
> > cause to want it independently of the rest of this stuff. It would be
> > useful for pg_background, too. So I've broken that out into a
> > separate patch here (bgworker-by-oid.patch) and will commit that RSN
> > unless somebody thinks it's a bad idea for some reason. AFAICS it
> > should be uncontroversial.
>
> This is now done.
>
> The main patch needed some updating in light of Andres's recent
> assault on ImmediateInterruptOK (final result: Andres 1-0 IIOK) so
> here's a new version.
>
I think we should expose variable ParallelWorkerNumber (or if you don't
want to expose it then atleast GetParallel* function call is required to get
the value of same), as that is needed for external applications wherever
they want to allocate something for each worker, some examples w.r.t
parallel seq scan patch are each worker should have separate tuple
queue and probably for implementation of Explain statement also we
might need it.
On Wed, Feb 4, 2015 at 10:47 PM, Amit Kapila <amit.kapila16@gmail.com> wrote: > I think we should expose variable ParallelWorkerNumber (or if you don't > want to expose it then atleast GetParallel* function call is required to get > the value of same), as that is needed for external applications wherever > they want to allocate something for each worker, some examples w.r.t > parallel seq scan patch are each worker should have separate tuple > queue and probably for implementation of Explain statement also we > might need it. Oh, right. You asked for that before, and I made the variable itself non-static, but didn't add the prototype to the header file. Oops. Here's v4, with that fixed and a few more tweaks. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Вложения
Hi, On 2015-02-06 22:43:21 -0500, Robert Haas wrote: > Here's v4, with that fixed and a few more tweaks. If you attached files generated with 'git format-patch' I could directly apply then with the commit message and such. All at once if it's mutliple patches, as individual commits. On nontrivial patches it's nice to see the commit message(s) along the diff(s). Observations: * Some tailing whitespace in the readme. Very nice otherwise. * Don't like CreateParallelContextForExtension much as a function name - I don't think we normally equate the fact that codeis located in a loadable library with the extension mechanism. * StaticAssertStmt(BUFFERALIGN(PARALLEL_ERROR_QUEUE_SIZE) ...) what's that about? * the plain shm calculations intentionally use mul_size/add_size to deal with overflows. On 32bit it doesn't seem impossible,but unlikely to overflow size_t. * I'd s/very // i "We might be running in a very short-lived memory context.". Or replace it with "too". * In +LaunchParallelWorkers, does it make sense trying to start workers if one failed? ISTM that's likely to not be helpful.I.e. it should just break; after the first failure. * +WaitForParallelWorkersToFinish says that it waits for workers to exit cleanly. To me that's ambiguous. How about "fully"? * ParallelMain restores libraries before GUC state. Given that librararies can, and actually somewhat frequently do, inspectGUCs on load, it seems better to do it the other way round? You argue "We want to do this before restoring GUCs, becausethe libraries might define custom variables.", but I don't buy that. It's completely normal for namespaced GUCs tobe present before a library is loaded? Especially as you then go and process_session_preload_libraries() after settingthe GUCs. * Should ParallelMain maybe enter the parallel context before loading user defined libraries? It's far from absurd to executecode touching the database on library initialization... * rename ParallelMain to ParallelWorkerMain? * I think restoring snapshots needs to fudge the worker's PGXACT->xmin to be the minimum of the top transaction id and thesnapshots. Otherwise if the initiating process dies badly (e.g. because postmaster died) the workers will continue towork, while other processes may remove things. Also, you don't seem to prohibit popping the active snapshot (should thatbe prohibitted maybe?) which might bump the initiator's xmin horizon. * Comment about 'signal_handler' in +HandleParallelMessageInterrupt is outdated. * Is it really a good idea to overlay Z/ReadyForQuery with 'this worker exited cleanly'? Shouldn't it at least be T/Terminate?I'm generally not sure it's wise to use this faux FE/BE protocol here... * HandlingParallelMessages looks dead. * ParallelErrorContext has the wrong comment. * ParallelErrorContext() provides the worker's pid in the context message. I guess that's so it's easier to interpret whensent to the initiator? It'll look odd when logged by the failing process. * We now have pretty much the same handle_sigterm in a bunch of places. Can't we get rid of those? You actually probablycan just use die(). * The comments in xact.c above XactTopTransactionId pretty much assume that the reader knows that that is about parallelstuff. * I'm a bit confused by the fact that Start/CommitTransactionCommand() emit a generic elog when encountering a PARALLEL_INPROGRESS,whereas ReleaseSavepoint()/RollbackTo has a special message. Shouldn't it be pretty much impossible tohit a ReleaseSavepoint() with a parallel transaction in progress? We'd have had to end the previous transaction commandwhile parallel stuff was in progress - right? (Internal/ReleaseCurrentSubTransaction is different, that's used incode) * Why are you deviating from the sorting order used in other places for ParallelCurrentXids? That seems really wierd, especiallyas we use something else a couple lines down. Especially as you actually seem to send stuff in xidComparator order? * I don't think skipping AtEOXact_Namespace() entirely if parallel is a good idea. That function already does some otherstuff than cleaning up temp tables. I think you should pass in parallel and do the skip in there. * Start/DndParallelWorkerTransaction assert the current state, whereas the rest of the file FATALs in that case. I thinkit'd actually be good to be conservative and do the same in this case. * You're manually passing function names to PreventCommandIfParallelMode() in a fair number of cases. I'd either try andkeep the names consistent with what the functions are actually called at the sql level (adding the types in the parens)or just use PG_FUNCNAME_MACRO to keep them correct. * Wait. You now copy all held relation held "as is" to the standby? I quite doubt that's a good idea, and it's not my readingof the conclusions made in the group locking thread. At the very least this part needs to be extensively documented.And while LockAcquireExtended() refers to src/backend/access/transam/README.parallel for details I don't see anythingpertinent in there. And the function header sounds like the only difference is the HS logging - not mentioning thatit essentially disables lock queuing entirely. This seems unsafe (e.g. consider if the initiating backend died and somebody else acquired the lock, possible e.g. if postmasterdied) and not even remotely enough discussed. I think this should be removed from the patch for now. Greetings, Andres Freund
On Sat, Feb 7, 2015 at 7:20 PM, Andres Freund <andres@2ndquadrant.com> wrote: > Observations: > * Some tailing whitespace in the readme. Very nice otherwise. Fixed. Thanks. > * Don't like CreateParallelContextForExtension much as a function name - > I don't think we normally equate the fact that code is located in a > loadable library with the extension mechanism. Suggestions for a better name? CreateParallelContextForLoadableFunction? > * StaticAssertStmt(BUFFERALIGN(PARALLEL_ERROR_QUEUE_SIZE) ...) what's > that about? Gee, maybe I should have added a comment so I'd remember. If the buffer size isn't MAXALIGN'd, that would be really bad, because shm_mq_create() assumes that it's being given an aligned address. Maybe I should add an Assert() there. If it is MAXALIGN'd but not BUFFERALIGN'd, we might waste a few bytes of space, since shm_toc_allocate() always allocates in BUFFERALIGN'd chunks, but I don't think anything will actually break. Not sure if that's worth an assert or not. > * the plain shm calculations intentionally use mul_size/add_size to deal > with overflows. On 32bit it doesn't seem impossible, but unlikely to > overflow size_t. Yes, I do that here too, though as with the plain shm calculations, only in the estimate functions. The functions that actually serialize stuff don't have to worry about overflow because it's already been checked. > * I'd s/very // i "We might be running in a very short-lived memory > context.". Or replace it with "too". Removed "very". > * In +LaunchParallelWorkers, does it make sense trying to start workers > if one failed? ISTM that's likely to not be helpful. I.e. it should > just break; after the first failure. It can't just break, because clearing pcxt->worker[i].error_mqh is an essential step. I could add a flag variable that tracks whether any registrations have failed and change the if statement to if (!any_registrations_failed && RegisterDynamicBackgroundWorker(&worker, &pcxt->worker[i].bgwhandle)), if you want. I thought about doing that but decided it was very unlikely to affect the real-world performance of anything, so easier just to keep the code simple. But I'll change it if you want. > * +WaitForParallelWorkersToFinish says that it waits for workers to exit > cleanly. To me that's ambiguous. How about "fully"? I've removed the word "cleanly" and added a comment to more fully explain the danger: + * Even if the parallel operation seems to have completed successfully, it's + * important to call this function afterwards. We must not miss any errors + * the workers may have thrown during the parallel operation, or any that they + * may yet throw while shutting down. > * ParallelMain restores libraries before GUC state. Given that > librararies can, and actually somewhat frequently do, inspect GUCs on > load, it seems better to do it the other way round? You argue "We want > to do this before restoring GUCs, because the libraries might define > custom variables.", but I don't buy that. It's completely normal for > namespaced GUCs to be present before a library is loaded? Especially > as you then go and process_session_preload_libraries() after setting > the GUCs. This is a good question, but the answer is not entirely clear to me. I'm thinking I should probably just remove process_session_preload_libraries() altogether at this point. That was added at a time when RestoreLibraryState() didn't exist yet, and I think RestoreLibraryState() is a far superior way of handling this, because surely the list of libraries that the parallel leader *actually had loaded* is more definitive than any GUC. Now, that doesn't answer the question about whether we should load libraries first or GUCs. I agree that the comment's reasoning is bogus, but I'm not sure I understand why you think the other order is better. It *is* normal for namespaced GUCs to be present before a library is loaded, but it's equally normal (and, I think, often more desirable in practice) to load the library first and then set the GUCs. Generally, I think that libraries ought to be loaded as early as possible, because they may install hooks that change the way other stuff works, and should therefore be loaded before that other stuff happens. > * Should ParallelMain maybe enter the parallel context before loading > user defined libraries? It's far from absurd to execute code touching > the database on library initialization... It's hard to judge without specific examples. What kinds of things do they do? Are they counting on a transaction being active? I would have thought that was a no-no, since there are many instances in which it won't be true. Also, you might have just gotten loaded because a function stored in your library was called, so you could be in a transaction that's busy doing something else, or deep in a subtransaction stack, etc. It seems unsafe to do very much more than a few syscache lookups here, even if there does happen to be a transaction active. > * rename ParallelMain to ParallelWorkerMain? Sounds good. Done. > * I think restoring snapshots needs to fudge the worker's PGXACT->xmin > to be the minimum of the top transaction id and the > snapshots. Otherwise if the initiating process dies badly > (e.g. because postmaster died) the workers will continue to work, > while other processes may remove things. RestoreTransactionSnapshot() works the same way as the existing import/export snapshot stuff, so that ought to be no less safe than what we're doing already. Any other snapshots that we're restoring had better not have an xmin lower than that one; if they do, the master messed up. Possibly it would be a good idea to have additional safeguards there; not sure exactly what. Perhaps RestoreSnapshot() could assert that the xmin of the restored snapshot follows-or-is-equal-to PGXACT->xmin? That would be safe for the first snapshot we restore because our xmin will be InvalidTransactionId, and after that it should check the condition you're worried about? Thoughts? > Also, you don't seem to > prohibit popping the active snapshot (should that be prohibitted > maybe?) which might bump the initiator's xmin horizon. I think as long as our transaction snapshot is installed correctly our xmin horizon can't advance; am I missing something? It's generally OK to pop the active snapshot, as long as you only pop what you pushed. In a worker, you can push additional snapshots on the active snapshot stack and then pop them, but you can't pop the one ParallelWorkerMain installed for you. You'll probably notice if you do, because then the snapshot stack will be empty when you get back to ParallelWorkerMain() and you'll fail an assertion. Similarly, the master shouldn't pop the snapshot that was active at the start of parallelism until parallelism is done, but again if you did you'd probably fail an assertion later on. Generally, we haven't had much of a problem with PushActiveSnapshot() and PopActiveSnapshot() calls being unbalanced, so I don't really think this is an area that needs especially strong cross-checks. At least not unless we get some evidence that this is a more-common mistake in code that touches parallelism than it is in general. > * Comment about 'signal_handler' in +HandleParallelMessageInterrupt > is outdated. Removed. > * Is it really a good idea to overlay Z/ReadyForQuery with 'this worker > exited cleanly'? Shouldn't it at least be T/Terminate? I'm generally > not sure it's wise to use this faux FE/BE protocol here... Well, I'm not sure about that either and never have been, but I was even less sure inventing a new one was any better. We might need a few new protocol messages (or to reuse a few existing ones for other things) but being able to reuse the existing format for ErrorResponse, NoticeResponse, etc. seems like a pretty solid win. Those are reasonably complex message formats and reimplementing them for no reason seems like a bad idea. Terminate is 'X', not 'T', and it's a frontend-only message. The worker is speaking the backend half of the protocol. We could use it anyway; that wouldn't be silly. I picked ReadyForQuery because that's what the backend sends when it is completely done processing everything that the user most recently requested, which seems defensible here. > * HandlingParallelMessages looks dead. Good catch. Removed. > * ParallelErrorContext has the wrong comment. Doh. Fixed. > * ParallelErrorContext() provides the worker's pid in the context > message. I guess that's so it's easier to interpret when sent to the > initiator? It'll look odd when logged by the failing process. Yes, that's why. Regarding logging, true. I guess the master could add the context instead, although making sure the PID is available looks pretty annoying. At the time we establish the queue, the PID isn't known yet, and by the time we read the error from it, the worker might be gone, such that we can't read its PID. To fix, we'd have to invent a new protocol message that means "here's my PID". Another alternative is to just say that the error came from a parallel worker (e.g. "while executing in parallel worker") and not mention the PID, but that seems like it'd be losing possibly-useful information. > * We now have pretty much the same handle_sigterm in a bunch of > places. Can't we get rid of those? You actually probably can just use > die(). Good idea. Done. > * The comments in xact.c above XactTopTransactionId pretty much assume > that the reader knows that that is about parallel stuff. What would you suggest? The comment begins "Only a single TransactionStateData is placed on the parallel worker's state stack", which seems like a pretty clear way of giving the user a hint that we are talking about parallel stuff. An older version of the patch was much less clear, but I thought I'd remedied that. > * I'm a bit confused by the fact that Start/CommitTransactionCommand() > emit a generic elog when encountering a PARALLEL_INPROGRESS, whereas > ReleaseSavepoint()/RollbackTo has a special message. Shouldn't it be > pretty much impossible to hit a ReleaseSavepoint() with a parallel > transaction in progress? We'd have had to end the previous transaction > command while parallel stuff was in progress - right? > (Internal/ReleaseCurrentSubTransaction is different, that's used in code) It's pretty simple to hit ReleaseSavepoint() while a transaction is in progress. It's pretty much directly SQL-callable, so a PL function run in a parallel worker could easily hit it, or anything that uses SPI. There are similar checks in e.g. EndTransaction(), but you can't invoke CommitTransactionCommand() directly. > * Why are you deviating from the sorting order used in other places for > ParallelCurrentXids? That seems really wierd, especially as we use > something else a couple lines down. Especially as you actually seem to > send stuff in xidComparator order? The transaction IDs have to be sorted into some order so that they can be binary-searched, and this seemed simplest. xidComparator sorts in numerical order, not transaction-ID order, so that's how we send them. That turns out to be convenient anyway, because binary-search on a sorted array of integers is really simple. If we sent them sorted in transaction-ID order we'd have to make sure that the reference transaction ID was the same for both backends, which might not be hard, but I don't see how it would be better than this. > * I don't think skipping AtEOXact_Namespace() entirely if parallel is a > good idea. That function already does some other stuff than cleaning > up temp tables. I think you should pass in parallel and do the skip in > there. That's a very good point. Fixed. > * Start/DndParallelWorkerTransaction assert the current state, whereas > the rest of the file FATALs in that case. I think it'd actually be > good to be conservative and do the same in this case. Well, actually, StartTransaction() does this: if (s->state != TRANS_DEFAULT) elog(WARNING, "StartTransaction while in %s state", TransStateAsString(s->state)); I could copy and paste that code into StartParallelWorkerTransaction() and changing WARNING to FATAL, but the first thing StartParallelWorkerTransaction() does is call StartTransaction(). It seems pretty stupid to have two identical tests that differ only in their log level. The same considerations apply to EndParalellWorkerTransaction() and CommitTransaction(). A worthwhile question is why somebody thought that it was a good idea for the log level there to be WARNING rather than FATAL. But I don't think it's this patch's job to second-guess that decision. > * You're manually passing function names to > PreventCommandIfParallelMode() in a fair number of cases. I'd either > try and keep the names consistent with what the functions are actually > called at the sql level (adding the types in the parens) or just use > PG_FUNCNAME_MACRO to keep them correct. I think putting the type names in is too chatty; I'm not aware we use that style in other error messages. We don't want to lead people to believe that only the form with the particular argument types they used is not OK. PG_FUNCNAME_MACRO will give us the C name, not the SQL name. > * Wait. You now copy all held relation held "as is" to the standby? I > quite doubt that's a good idea, and it's not my reading of the > conclusions made in the group locking thread. At the very least this > part needs to be extensively documented. And while > LockAcquireExtended() refers to > src/backend/access/transam/README.parallel for details I don't see > anything pertinent in there. And the function header sounds like the > only difference is the HS logging - not mentioning that it essentially > disables lock queuing entirely. > > This seems unsafe (e.g. consider if the initiating backend died and > somebody else acquired the lock, possible e.g. if postmaster died) and > not even remotely enough discussed. I think this should be removed > from the patch for now. If it's broken, we need to identify what's wrong and fix it, not just rip it out. It's possible that something is broken with that code, but it's dead certain that something is broken without it: rhaas=# select parallel_count('pgbench_accounts', 1); NOTICE: PID 57956 counted 2434815 tuples NOTICE: PID 57957 counted 1565185 tuples CONTEXT: parallel worker, pid 57957parallel_count ---------------- 4000000 (1 row) rhaas=# begin; BEGIN rhaas=# lock pgbench_accounts; LOCK TABLE rhaas=# select parallel_count('pgbench_accounts', 1); NOTICE: PID 57956 counted 4000000 tuples ...and then it hangs forever. On the specific issues: 1. I agree that it's very dangerous for the parallel backend to acquire the lock this way if the master no longer holds it. Originally, I was imagining that there would be no interlock between the master shutting down and the worker starting up, but you and others convinced me that was a bad idea. So now transaction commit or abort waits for all workers to be gone, which I think reduces the scope of possible problems here pretty significantly. However, it's quite possible that it isn't airtight. One thing we could maybe do to make it safer is pass a pointer to the initiator's PGPROC. If we get the lock via the fast-path we are safe anyway, but if we have to acquire the partition lock, then we can cross-check that the initiator's lock is still there. I think that would button this up pretty tight. 2. My reading of the group locking discussions was that everybody agreed that the originally-proposed group locking approach, which involved considering locks from the same locking group as mutually non-conflicting, was not OK. Several specific examples were offered - e.g. it's clearly not OK for two backends to extend a relation at the same time just because the same locking group. So I abandoned that approach. When I instead proposed the approach of copying only the locks that the master *already* had at the beginning of parallelism and considering *only those* as mutually conflicting, I believe I got several comments to the effect that this was "less scary". Considering the topic area, I'm not sure I'm going to do any better than that. 3. I welcome proposals for other ways of handling this problem, even if they restrict the functionality that can be offered. For example, a proposal that would make parallel_count revert to single-threaded mode but terminate without an indefinite wait would be acceptable to me, provided that it offers some advantage in safety and security over what I've already got. A proposal to make it parallel_count error out in the above case would not be acceptable to me; the planner must not generate parallel plans that will sometimes fail unexpectedly at execution-time. I generally believe that we will be much happier if application programmers need not worry about the failure of parallel workers to obtain locks already held by the master; some such failure modes may be very complex and hard to predict. The fact that the current approach handles the problem entirely within the lock manager, combined with the fact that it is extremely simple, is therefore very appealing to me. Nonetheless, a test case that demonstrates this approach falling down badly would force a rethink; do you have one? Or an idea about what it might look like? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2015-02-10 11:49:58 -0500, Robert Haas wrote: > On Sat, Feb 7, 2015 at 7:20 PM, Andres Freund <andres@2ndquadrant.com> wrote: > > * Don't like CreateParallelContextForExtension much as a function name - > > I don't think we normally equate the fact that code is located in a > > loadable library with the extension mechanism. > > Suggestions for a better name? CreateParallelContextForLoadableFunction? *ExternalFunction maybe? That's dfmgr.c calls it. > > * the plain shm calculations intentionally use mul_size/add_size to deal > > with overflows. On 32bit it doesn't seem impossible, but unlikely to > > overflow size_t. > > Yes, I do that here too, though as with the plain shm calculations, > only in the estimate functions. The functions that actually serialize > stuff don't have to worry about overflow because it's already been > checked. I thought I'd seen some estimation functions that didn't use it. But apparently I was wrong. > > * In +LaunchParallelWorkers, does it make sense trying to start workers > > if one failed? ISTM that's likely to not be helpful. I.e. it should > > just break; after the first failure. > > It can't just break, because clearing pcxt->worker[i].error_mqh is an > essential step. I could add a flag variable that tracks whether any > registrations have failed and change the if statement to if > (!any_registrations_failed && RegisterDynamicBackgroundWorker(&worker, > &pcxt->worker[i].bgwhandle)), if you want. I thought about doing that > but decided it was very unlikely to affect the real-world performance > of anything, so easier just to keep the code simple. But I'll change > it if you want. I think it'd be better. > > * ParallelMain restores libraries before GUC state. Given that > > librararies can, and actually somewhat frequently do, inspect GUCs on > > load, it seems better to do it the other way round? You argue "We want > > to do this before restoring GUCs, because the libraries might define > > custom variables.", but I don't buy that. It's completely normal for > > namespaced GUCs to be present before a library is loaded? Especially > > as you then go and process_session_preload_libraries() after setting > > the GUCs. > > This is a good question, but the answer is not entirely clear to me. > I'm thinking I should probably just remove > process_session_preload_libraries() altogether at this point. That > was added at a time when RestoreLibraryState() didn't exist yet, and I > think RestoreLibraryState() is a far superior way of handling this, > because surely the list of libraries that the parallel leader > *actually had loaded* is more definitive than any GUC. That sounds like a good idea to me. > Now, that doesn't answer the question about whether we should load > libraries first or GUCs. I agree that the comment's reasoning is > bogus, but I'm not sure I understand why you think the other order is > better. It *is* normal for namespaced GUCs to be present before a > library is loaded, but it's equally normal (and, I think, often more > desirable in practice) to load the library first and then set the > GUCs. Well, it's pretty much never the case that the library is loaded before postgresql.conf gucs, right? A changed postgresql.conf is the only exception I can see. Neither is it the normal case for session|local_preload_libraries. Not even when GUCs are loaded via pg_db_role_setting or the startup packet... > Generally, I think that libraries ought to be loaded as early > as possible, because they may install hooks that change the way other > stuff works, and should therefore be loaded before that other stuff > happens. While that may be desirable I don't really see a reason for this to be treated differently for worker processes than the majority of cases otherwise. Anyway, I think this is a relatively minor issue. > > * Should ParallelMain maybe enter the parallel context before loading > > user defined libraries? It's far from absurd to execute code touching > > the database on library initialization... > > It's hard to judge without specific examples. What kinds of things do > they do? I've seen code filling lookup caches and creating system objects (including tables and extensions). > Are they counting on a transaction being active? > I would have thought that was a no-no, since there are many instances > in which it won't be true. Also, you might have just gotten loaded > because a function stored in your library was called, so you could be > in a transaction that's busy doing something else, or deep in a > subtransaction stack, etc. It seems unsafe to do very much more than > a few syscache lookups here, even if there does happen to be a > transaction active. The only reason I'd like it to be active is because that'd *prohibit* doing the crazier stuff. There seems little reason to not da it under the additional protection against crazy things that'd give us? > > * I think restoring snapshots needs to fudge the worker's PGXACT->xmin > > to be the minimum of the top transaction id and the > > snapshots. Otherwise if the initiating process dies badly > > (e.g. because postmaster died) the workers will continue to work, > > while other processes may remove things. > > RestoreTransactionSnapshot() works the same way as the existing > import/export snapshot stuff, so that ought to be no less safe than > what we're doing already. Well, ExportSnapshot()/Import has quite a bit more restrictions than what you're doing... Most importantly it cannot import in transactions unless using read committed isolation, forces xid assignment during export, forces the old snapshot to stick around for the whole transaction and only works on a primary. I'm not actually sure it makes a relevant difference, but it's certainly worth calling attention to. The fuding I was wondering about certainly is unnecessary though - GetSnapshotData() has already performed it... I'm not particularly awake right now and I think this needs a closer look either by someone awake... I'm not fully convinced this is safe. > > Also, you don't seem to > > prohibit popping the active snapshot (should that be prohibitted > > maybe?) which might bump the initiator's xmin horizon. > > I think as long as our transaction snapshot is installed correctly our > xmin horizon can't advance; am I missing something? Maybe I'm missing something, but why would that be the case in a read committed session? ImportSnapshot() only ever calls SetTransactionSnapshot it such a case (why does it contain code to cope without it?), but your patch doesn't seem to guarantee that. But as don't actually transport XactIsoLevel anywhere (omission probably?) that seems to mean that the SetTransactionSnapshot() won't do much, even if the source transaction is repeatable read. Now the thing that actually does give some guarantees is that you immediately afterwards restore the active snapshot and do a PushActiveSnapshot(), which'll prevent MyPgXact->xmin from changing because SnapshotResetXmin() refuses to do anything if there's an ActiveSnapshot. Seems a bit comlicated and fragile though. > > * Is it really a good idea to overlay Z/ReadyForQuery with 'this worker > > exited cleanly'? Shouldn't it at least be T/Terminate? I'm generally > > not sure it's wise to use this faux FE/BE protocol here... > > Well, I'm not sure about that either and never have been, but I was > even less sure inventing a new one was any better. We might need a > few new protocol messages (or to reuse a few existing ones for other > things) but being able to reuse the existing format for ErrorResponse, > NoticeResponse, etc. seems like a pretty solid win. Those are > reasonably complex message formats and reimplementing them for no > reason seems like a bad idea. > > Terminate is 'X', not 'T' Oops, yes. > and it's a frontend-only message. The worker is speaking the backend > half of the protocol. We could use it anyway; that wouldn't be silly. Even if it's a frontend only one it doesn't seem like a bad idea. I've occasionally wished the backend would send explicit termination messages in a bunch of scenarios. I'm a bit tired of seing: FATAL: 57P01: terminating connection due to administrator command LOCATION: ProcessInterrupts, postgres.c:2888 server closed the connection unexpectedlyThis probably means the server terminated abnormallybefore or while processing therequest. The connection to the server was lost. Attempting reset: Succeeded. wheen the session was terminated while a query is active. A explicit termination message seems like a nicer solution than interpreting the FATAL severity. > I picked ReadyForQuery because that's > what the backend sends when it is completely done processing > everything that the user most recently requested, which seems > defensible here. I'm pretty sure that we're going to reuse workers within a parallel query at some point and ready for query seems like a nicer code for saying 'finished with my work, give me the next thing'. > > * ParallelErrorContext() provides the worker's pid in the context > > message. I guess that's so it's easier to interpret when sent to the > > initiator? It'll look odd when logged by the failing process. > > Yes, that's why. Regarding logging, true. I guess the master could > add the context instead, although making sure the PID is available > looks pretty annoying. At the time we establish the queue, the PID > isn't known yet, and by the time we read the error from it, the worker > might be gone, such that we can't read its PID. To fix, we'd have to > invent a new protocol message that means "here's my PID". Hm. Or we could attach the pid to the error message in that case - just like there already is schema_name etc. Or, to stay more in the FE/BE vibe - we could just send a 'K' message at startup which sends MyProcPid, MyCancelKey in normal connections. > > * The comments in xact.c above XactTopTransactionId pretty much assume > > that the reader knows that that is about parallel stuff. > > What would you suggest? The comment begins "Only a single > TransactionStateData is placed on the parallel worker's state stack", > which seems like a pretty clear way of giving the user a hint that we > are talking about parallel stuff. Replacing "Only" by "When running as parallel worker we only place a ..." would already help. To me the comment currently readss like it desperately wishes to be located in the function initiating parallelism rather than global file scope. Maybe it's lonely or such. > > * I'm a bit confused by the fact that Start/CommitTransactionCommand() > > emit a generic elog when encountering a PARALLEL_INPROGRESS, whereas > > ReleaseSavepoint()/RollbackTo has a special message. Shouldn't it be > > pretty much impossible to hit a ReleaseSavepoint() with a parallel > > transaction in progress? We'd have had to end the previous transaction > > command while parallel stuff was in progress - right? > > (Internal/ReleaseCurrentSubTransaction is different, that's used in code) > > It's pretty simple to hit ReleaseSavepoint() while a transaction is in > progress. It's pretty much directly SQL-callable, so a PL function > run in a parallel worker could easily hit it, or anything that uses > SPI. SPI doesn't really work across subtransaction boundaries, so it really should never be called that way. Which is way it (and thus the PLs) return SPI_ERROR_TRANSACTION in case you try to execute it. If it's possible to hit it, we have a problem - I think it'd pretty much lead to rampant memory clobbering and such. Now, I'm not against providing a error check, I was just surprised about the verbosity of the different locations. > > * Why are you deviating from the sorting order used in other places for > > ParallelCurrentXids? That seems really wierd, especially as we use > > something else a couple lines down. Especially as you actually seem to > > send stuff in xidComparator order? > > The transaction IDs have to be sorted into some order so that they can > be binary-searched, and this seemed simplest. xidComparator sorts in > numerical order, not transaction-ID order, so that's how we send them. Forget that bit. Was tired. I'm not sure what I thought. > > * Start/DndParallelWorkerTransaction assert the current state, whereas > > the rest of the file FATALs in that case. I think it'd actually be > > good to be conservative and do the same in this case. > > Well, actually, StartTransaction() does this: > > if (s->state != TRANS_DEFAULT) > elog(WARNING, "StartTransaction while in %s state", > TransStateAsString(s->state)); But StartTransactionCommand does elog(ERROR, "StartTransactionCommand: unexpected state %s", BlockStateAsString(s->blockState)); and CommitTransactionCommand does elog(FATAL, "CommitTransactionCommand: unexpected state %s", BlockStateAsString(s->blockState)); and some more. These only deal with blockState though... > A worthwhile question is why somebody thought that it was a good idea > for the log level there to be WARNING rather than FATAL. Yea, it's really rather odd. Everytime I've seen those WARNINGS, e.g. in the recent "hung backends stuck in spinlock" things got even more pearshaped shortly afterwards. > But I don't think it's this patch's job to second-guess that decision. Fair enough. > > * You're manually passing function names to > > PreventCommandIfParallelMode() in a fair number of cases. I'd either > > try and keep the names consistent with what the functions are actually > > called at the sql level (adding the types in the parens) or just use > > PG_FUNCNAME_MACRO to keep them correct. > > I think putting the type names in is too chatty; I'm not aware we use > that style in other error messages. We don't want to lead people to > believe that only the form with the particular argument types they > used is not OK. > PG_FUNCNAME_MACRO will give us the C name, not the SQL name. Your manual ones don't either, that's what made me complain. E.g. pg_advisory_lock_int8 isn't called that on the SQL level. Instead they use the C name + parens. > > * Wait. You now copy all held relation held "as is" to the standby? I > > quite doubt that's a good idea, and it's not my reading of the > > conclusions made in the group locking thread. At the very least this > > part needs to be extensively documented. And while > > LockAcquireExtended() refers to > > src/backend/access/transam/README.parallel for details I don't see > > anything pertinent in there. And the function header sounds like the > > only difference is the HS logging - not mentioning that it essentially > > disables lock queuing entirely. > > > > This seems unsafe (e.g. consider if the initiating backend died and > > somebody else acquired the lock, possible e.g. if postmaster died) and > > not even remotely enough discussed. I think this should be removed > > from the patch for now. > > If it's broken, we need to identify what's wrong and fix it, not just > rip it out. It's possible that something is broken with that code, > but it's dead certain that something is broken without it: > > rhaas=# select parallel_count('pgbench_accounts', 1); > NOTICE: PID 57956 counted 2434815 tuples > NOTICE: PID 57957 counted 1565185 tuples > CONTEXT: parallel worker, pid 57957 > parallel_count > ---------------- > 4000000 > (1 row) > > rhaas=# begin; > BEGIN > rhaas=# lock pgbench_accounts; > LOCK TABLE > rhaas=# select parallel_count('pgbench_accounts', 1); > NOTICE: PID 57956 counted 4000000 tuples > > ...and then it hangs forever. Which is *not* a good example for the problem. Your primary reasoning for needing something more than sharing the locks that we know the individual query is going to need (which we acquire in the parse analzye/planner/executor) is that we can't predict what some random code does. Now, I still don't think that argument should hold too much sway because we'll only be able to run carefully controlled code *anyway*. But *if* we take it as reasoning for doing more than granting the locks we need for the individual query, we *still* need to cope with exactly the variants of the above invisible deadlock. You can still can call a function acquiring some form of lock on either side. If you'd, as I've argued for before, provide a API that granted workers precisely the locks needed for the execution of a certain type of parallel action, I'd be happy. I.e. regular parallel queries would transport an extended version of what InitPlan() does with relation lock. > On the specific issues: > > 1. I agree that it's very dangerous for the parallel backend to > acquire the lock this way if the master no longer holds it. > Originally, I was imagining that there would be no interlock between > the master shutting down and the worker starting up, but you and > others convinced me that was a bad idea. So now transaction commit or > abort waits for all workers to be gone, which I think reduces the > scope of possible problems here pretty significantly. However, it's > quite possible that it isn't airtight. One thing we could maybe do to > make it safer is pass a pointer to the initiator's PGPROC. If we get > the lock via the fast-path we are safe anyway, but if we have to > acquire the partition lock, then we can cross-check that the > initiator's lock is still there. I think that would button this up > pretty tight. That'd certainly make me feel less bad about it. > 2. My reading of the group locking discussions was that everybody > agreed that the originally-proposed group locking approach, which > involved considering locks from the same locking group as mutually > non-conflicting, was not OK. Several specific examples were offered - > e.g. it's clearly not OK for two backends to extend a relation at the > same time just because the same locking group. So I abandoned that > approach. When I instead proposed the approach of copying only the > locks that the master *already* had at the beginning of parallelism > and considering *only those* as mutually conflicting, I believe I got > several comments to the effect that this was "less scary". > Considering the topic area, I'm not sure I'm going to do any better > than that. It surely is less scary, but still darn scary. It's just friggin hard to have a mental model where (self) exclusive locks suddenly aren't that anymore. It'll get more and more dangerous the more we start to relax restrictions around parallelism - and that *will* come. This concern does not only apply to user level commands, but also our own C code. We will find more and more reasons to parallelize commands and if we will have problems with suddenly not having a chance to prevent parallelism during certain actions. Say we parallelize VACUUM (grand idea for the index scans!) and someone wants to also improve the truncation. Suddenly the AEL during truncation doesn't give us guarantees anymore. There'll be many more of these, and we can't really reason about them because we used to working locks. > 3. I welcome proposals for other ways of handling this problem, even > if they restrict the functionality that can be offered. For example, > a proposal that would make parallel_count revert to single-threaded > mode but terminate without an indefinite wait would be acceptable to > me, provided that it offers some advantage in safety and security over > what I've already got. I think the above example where we only grant the locks required for a specific thing and only on the relevant severity would already be a big improvement. Even better would be to to use the computed set of locks to check whether workers could acquire them and refuse paralellism in that case. Another thing to do is to mark the lock, both in the master and workers, as not effectively being of the original severity anymore. If the own process again tries to acquire the lock on a heavier severity than what was needed at the time of query execution, error out. At least until parallel mode has confirmed to have ended. That should pretty much never happen. I don't see how we can avoid teaching the deadlock detector about all these silent deadlocks in any cases. By your own reasoning. > A proposal to make it parallel_count error out > in the above case would not be acceptable to me; the planner must not > generate parallel plans that will sometimes fail unexpectedly at > execution-time. I generally believe that we will be much happier if > application programmers need not worry about the failure of parallel > workers to obtain locks already held by the master; some such failure > modes may be very complex and hard to predict. The fact that the > current approach handles the problem entirely within the lock manager, > combined with the fact that it is extremely simple, is therefore very > appealing to me. It's trivial to be simple and unsafe... > Nonetheless, a test case that demonstrates this approach falling down > badly would force a rethink; do you have one? Or an idea about what > it might look like? No, I don't have one handy. I have the very strong gut feeling that this would introduce a severe architectural debt that we won't be able to get rid of afterwards, even if it proves to be a really bad idea. To me it's taking a shortcut directly through the hard problems. Greetings, Andres Freund --Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Tue, Feb 10, 2015 at 8:45 PM, Andres Freund <andres@2ndquadrant.com> wrote: >> Suggestions for a better name? CreateParallelContextForLoadableFunction? > > *ExternalFunction maybe? That's dfmgr.c calls it. Done. >> It can't just break, because clearing pcxt->worker[i].error_mqh is an >> essential step. I could add a flag variable that tracks whether any >> registrations have failed and change the if statement to if >> (!any_registrations_failed && RegisterDynamicBackgroundWorker(&worker, >> &pcxt->worker[i].bgwhandle)), if you want. I thought about doing that >> but decided it was very unlikely to affect the real-world performance >> of anything, so easier just to keep the code simple. But I'll change >> it if you want. > > I think it'd be better. Done. >> This is a good question, but the answer is not entirely clear to me. >> I'm thinking I should probably just remove >> process_session_preload_libraries() altogether at this point. That >> was added at a time when RestoreLibraryState() didn't exist yet, and I >> think RestoreLibraryState() is a far superior way of handling this, >> because surely the list of libraries that the parallel leader >> *actually had loaded* is more definitive than any GUC. > > That sounds like a good idea to me. Done. > Well, it's pretty much never the case that the library is loaded before > postgresql.conf gucs, right? A changed postgresql.conf is the only > exception I can see. Neither is it the normal case for > session|local_preload_libraries. Not even when GUCs are loaded via > pg_db_role_setting or the startup packet... Well, it can happen when you issue an explicit LOAD, or an implicit one by calling a function that's in that module, and then go back and set the GUC afterward, if nothing else. > Anyway, I think this is a relatively minor issue. I'm going to leave this alone for now. Evidence may emerge that this is better done in the other order, but in the absence of evidence I'm going to follow my gut rather than yours. With all due respect for your gut, of course. > The only reason I'd like it to be active is because that'd *prohibit* > doing the crazier stuff. There seems little reason to not da it under > the additional protection against crazy things that'd give us? Trying to load additional libraries once parallel mode is in progress can provide failures, because the load of the libraries causes the system to do GUC_ACTION_SET on the GUCs whose initialization was deferred, and that trips up the no-changing-things-while-in-parallel-mode checks. I'm not sure if there's anything we can, or should, do about that. > Well, ExportSnapshot()/Import has quite a bit more restrictions than > what you're doing... Most importantly it cannot import in transactions > unless using read committed isolation, forces xid assignment during > export, forces the old snapshot to stick around for the whole > transaction and only works on a primary. I'm not actually sure it makes > a relevant difference, but it's certainly worth calling attention to. > > The fuding I was wondering about certainly is unnecessary though - > GetSnapshotData() has already performed it... > > I'm not particularly awake right now and I think this needs a closer > look either by someone awake... I'm not fully convinced this is safe. I'm not 100% comfortable with it either, but I just spent some time looking at it and can't see what's wrong with it. Basically, we're trying to get the parallel worker into a state that matches the master's state after doing GetTransactionSnapshot() - namely, CurrentSnapshot should point to the same snapshot on the master, and FirstSnapshotSet should be true, plus the same additional processing that GetTransactionSnapshot() would have done if we're in a higher transaction isolation level. It's possible we don't need to mimic that state, but it seems like a good idea. Still, I wonder if we ought to be banning GetTransactionSnapshot() altogether. I'm not sure if there's ever a time when it's safe for a worker to call that. >> > Also, you don't seem to >> > prohibit popping the active snapshot (should that be prohibitted >> > maybe?) which might bump the initiator's xmin horizon. >> >> I think as long as our transaction snapshot is installed correctly our >> xmin horizon can't advance; am I missing something? > > Maybe I'm missing something, but why would that be the case in a read > committed session? ImportSnapshot() only ever calls > SetTransactionSnapshot it such a case (why does it contain code to cope > without it?), but your patch doesn't seem to guarantee that. > > But as don't actually transport XactIsoLevel anywhere (omission > probably?) that seems to mean that the SetTransactionSnapshot() won't do > much, even if the source transaction is repeatable read. Doesn't XactIsoLevel get set by assign_XactIsoLevel() when we restore the GUC state? > Now the thing that actually does give some guarantees is that you > immediately afterwards restore the active snapshot and do a > PushActiveSnapshot(), which'll prevent MyPgXact->xmin from changing > because SnapshotResetXmin() refuses to do anything if there's an > ActiveSnapshot. Seems a bit comlicated and fragile though. Suggestions? >> Terminate is 'X', not 'T' > > Oops, yes. > >> and it's a frontend-only message. The worker is speaking the backend >> half of the protocol. We could use it anyway; that wouldn't be silly. > > Even if it's a frontend only one it doesn't seem like a bad idea. Done. >> I picked ReadyForQuery because that's >> what the backend sends when it is completely done processing >> everything that the user most recently requested, which seems >> defensible here. > > I'm pretty sure that we're going to reuse workers within a parallel > query at some point and ready for query seems like a nicer code for > saying 'finished with my work, give me the next thing'. Yeah, could well be. > Hm. Or we could attach the pid to the error message in that case - just > like there already is schema_name etc. Or, to stay more in the FE/BE > vibe - we could just send a 'K' message at startup which sends > MyProcPid, MyCancelKey in normal connections. Done. > Replacing "Only" by "When running as parallel worker we only place a > ..." would already help. To me the comment currently readss like it > desperately wishes to be located in the function initiating parallelism > rather than global file scope. Maybe it's lonely or such. Heh, heh. Done. >> > * You're manually passing function names to >> > PreventCommandIfParallelMode() in a fair number of cases. I'd either >> > try and keep the names consistent with what the functions are actually >> > called at the sql level (adding the types in the parens) or just use >> > PG_FUNCNAME_MACRO to keep them correct. >> >> I think putting the type names in is too chatty; I'm not aware we use >> that style in other error messages. We don't want to lead people to >> believe that only the form with the particular argument types they >> used is not OK. > >> PG_FUNCNAME_MACRO will give us the C name, not the SQL name. > > Your manual ones don't either, that's what made me > complain. E.g. pg_advisory_lock_int8 isn't called that on the SQL > level. Instead they use the C name + parens. On further reflection, I think I should probably just change all of those to use a general message about advisory locks, i.e.: ERROR: cannot use advisory locks during a parallel operation That sound good to you? > Which is *not* a good example for the problem. Your primary reasoning > for needing something more than sharing the locks that we know the > individual query is going to need (which we acquire in the parse > analzye/planner/executor) is that we can't predict what some random code > does. Now, I still don't think that argument should hold too much sway > because we'll only be able to run carefully controlled code > *anyway*. I'll avoid repeating what I've said about this before, except to say that I still don't believe for a minute you can predict which locks you'll need. > But *if* we take it as reasoning for doing more than granting > the locks we need for the individual query, we *still* need to cope with > exactly the variants of the above invisible deadlock. You can still can > call a function acquiring some form of lock on either side. That by itself is not a deadlock. If the worker blocks trying to acquire a lock that the master held before the start of parallelism, it will wait forever, even if the master never acquires a lock. But if the worker blocks trying to acquire a lock that the master acquired *during* parallelism, it's presumably something like a catalog lock or a tuple lock or a relation extension lock that the master is going to release quickly. So the master will finish with that resource and then the worker will go ahead. You're right that there are other scenarios to consider, but you need more than that. >> On the specific issues: >> >> 1. I agree that it's very dangerous for the parallel backend to >> acquire the lock this way if the master no longer holds it. >> Originally, I was imagining that there would be no interlock between >> the master shutting down and the worker starting up, but you and >> others convinced me that was a bad idea. So now transaction commit or >> abort waits for all workers to be gone, which I think reduces the >> scope of possible problems here pretty significantly. However, it's >> quite possible that it isn't airtight. One thing we could maybe do to >> make it safer is pass a pointer to the initiator's PGPROC. If we get >> the lock via the fast-path we are safe anyway, but if we have to >> acquire the partition lock, then we can cross-check that the >> initiator's lock is still there. I think that would button this up >> pretty tight. > > That'd certainly make me feel less bad about it. Done. >> 2. My reading of the group locking discussions was that everybody >> agreed that the originally-proposed group locking approach, which >> involved considering locks from the same locking group as mutually >> non-conflicting, was not OK. Several specific examples were offered - >> e.g. it's clearly not OK for two backends to extend a relation at the >> same time just because the same locking group. So I abandoned that >> approach. When I instead proposed the approach of copying only the >> locks that the master *already* had at the beginning of parallelism >> and considering *only those* as mutually conflicting, I believe I got >> several comments to the effect that this was "less scary". >> Considering the topic area, I'm not sure I'm going to do any better >> than that. > > It surely is less scary, but still darn scary. It's just friggin hard to > have a mental model where (self) exclusive locks suddenly aren't that > anymore. It'll get more and more dangerous the more we start to relax > restrictions around parallelism - and that *will* come. Assuming we get a first version committed at some point, yes. > This concern does not only apply to user level commands, but also our > own C code. We will find more and more reasons to parallelize commands > and if we will have problems with suddenly not having a chance to > prevent parallelism during certain actions. Say we parallelize VACUUM > (grand idea for the index scans!) and someone wants to also improve the > truncation. Suddenly the AEL during truncation doesn't give us > guarantees anymore If the VACUUM leader process tries to grab an AccessExclusiveLock after starting parallel mode and before terminating it, surely that's going to conflict with the locks held by any remaining workers at that point. Even if it weren't going to conflict, you'd have to be pretty stupid to code it that way, because you can't remove dead line pointers until you've finished all of the index scans, and you can't very well truncate the relation until you've removed dead line pointers. So to me, this seems like an impossible scenario that wouldn't break even if it were possible. > There'll be many more of these, and we can't really > reason about them because we used to working locks. FUD. I think one thing that might help allay your concerns in this area to a degree is to discuss under what circumstances we think it's safe to create a parallel context in the first place. For example, if you were to insert code that tries to go parallel in the middle of heap_update() or in the middle of relation extension, that's probably not going to work out well with this model, or with any model. It's not *impossible* for it to work out well - if you wanted to extend the relation by a whole bunch of blocks, the worker could do that while the master goes and try to read from CLOG or something, but you'd have to be awfully careful and it's probably a stupid idea for lots of reasons. It really only makes sense to enter parallelism when the backend is in a relatively quiescent state - like at the beginning of a query, or after vacuum scans the heap but before it scans the indexes. At that point it's a lot easier to reason about what the parallel backends can safely do. Broadly, I think that's there's a close connection between what state the master is in when we initiate parallelism and what the workers can safely do. I'm not sure exactly how to describe what the connection is there, but I think it exists, and firming up the way we think about that might make it easier to reason about this. >> 3. I welcome proposals for other ways of handling this problem, even >> if they restrict the functionality that can be offered. For example, >> a proposal that would make parallel_count revert to single-threaded >> mode but terminate without an indefinite wait would be acceptable to >> me, provided that it offers some advantage in safety and security over >> what I've already got. > > I think the above example where we only grant the locks required for a > specific thing and only on the relevant severity would already be a big > improvement. Even better would be to to use the computed set of locks to > check whether workers could acquire them and refuse paralellism in that > case. I think that will just produce lots of very-hard-to-debug undetected deadlocks and huge volumes of code that tries and fails to assess what locks the worker will need. > Another thing to do is to mark the lock, both in the master and workers, > as not effectively being of the original severity anymore. If the own > process again tries to acquire the lock on a heavier severity than what > was needed at the time of query execution, error out. At least until > parallel mode has confirmed to have ended. That should pretty much never > happen. I'm not sure what you mean by the "severity" of the lock. How about marking the locks that the worker inherited from the parallel master and throwing an error if it tries to lock one of those objects in a mode that it does not already hold? That'd probably catch a sizeable number of programming errors without tripping up many legitimate use cases (and we can always relax or modify the prohibition if it later turns out to be problematic). > I don't see how we can avoid teaching the deadlock detector about all > these silent deadlocks in any cases. By your own reasoning. We're not seeing eye to eye here yet, since I don't accept your example scenario and you don't accept mine. Let's keep discussing. Meanwhile, here's an updated patch. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Вложения
On Wed, Feb 11, 2015 at 1:59 PM, Robert Haas <robertmhaas@gmail.com> wrote: > I'm not sure what you mean by the "severity" of the lock. How about > marking the locks that the worker inherited from the parallel master > and throwing an error if it tries to lock one of those objects in a > mode that it does not already hold? That'd probably catch a sizeable > number of programming errors without tripping up many legitimate use > cases (and we can always relax or modify the prohibition if it later > turns out to be problematic). Or maybe do that only when the new lock mode is stronger than one it already holds. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
<div dir="ltr"><br /><div class="gmail_extra"><br /><div class="gmail_quote">On Thu, Feb 12, 2015 at 3:59 AM, Robert Haas<span dir="ltr"><<a href="mailto:robertmhaas@gmail.com" target="_blank">robertmhaas@gmail.com</a>></span> wrote:<br/><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><span class=""></span>We're not seeing eye to eye here yet, since I don't accept your<br /> example scenario and you don't acceptmine. Let's keep discussing.<br /><br /> Meanwhile, here's an updated patch.<br /></blockquote></div><br /></div><divclass="gmail_extra">A lot of cool activity is showing up here, so moved the patch to CF 2015-02. Perhaps Andresyou could add yourself as a reviewer?<br />-- <br /><div class="gmail_signature">Michael<br /></div></div></div>
On Fri, Feb 13, 2015 at 2:22 AM, Michael Paquier <michael.paquier@gmail.com> wrote: > On Thu, Feb 12, 2015 at 3:59 AM, Robert Haas <robertmhaas@gmail.com> wrote: >> >> We're not seeing eye to eye here yet, since I don't accept your >> example scenario and you don't accept mine. Let's keep discussing. >> >> Meanwhile, here's an updated patch. > > A lot of cool activity is showing up here, so moved the patch to CF 2015-02. > Perhaps Andres you could add yourself as a reviewer? Here's a new version of the patch with (a) some additional restrictions on heavyweight locking both at the start of, and during, parallel mode and (b) a write-up in the README explaining the restrictions and my theory of why the handling of heavyweight locking is safe. Hopefully this goes some way towards addressing Andres's concerns. I've also replaced the specific (and wrong) messages about advisory locks with a more generic message, as previously proposed; and I've fixed at least one bug. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Вложения
On 2015-02-11 13:59:04 -0500, Robert Haas wrote: > On Tue, Feb 10, 2015 at 8:45 PM, Andres Freund <andres@2ndquadrant.com> wrote: > > The only reason I'd like it to be active is because that'd *prohibit* > > doing the crazier stuff. There seems little reason to not da it under > > the additional protection against crazy things that'd give us? > > Trying to load additional libraries once parallel mode is in progress > can provide failures, because the load of the libraries causes the > system to do GUC_ACTION_SET on the GUCs whose initialization was > deferred, and that trips up the > no-changing-things-while-in-parallel-mode checks. Oh, that's a good point. > I'm not sure if there's anything we can, or should, do about that. Fine with me. > > Well, ExportSnapshot()/Import has quite a bit more restrictions than > > what you're doing... Most importantly it cannot import in transactions > > unless using read committed isolation, forces xid assignment during > > export, forces the old snapshot to stick around for the whole > > transaction and only works on a primary. I'm not actually sure it makes > > a relevant difference, but it's certainly worth calling attention to. > > > > The fuding I was wondering about certainly is unnecessary though - > > GetSnapshotData() has already performed it... > > > > I'm not particularly awake right now and I think this needs a closer > > look either by someone awake... I'm not fully convinced this is safe. > > I'm not 100% comfortable with it either, but I just spent some time > looking at it and can't see what's wrong with it. Basically, we're > trying to get the parallel worker into a state that matches the > master's state after doing GetTransactionSnapshot() - namely, > CurrentSnapshot should point to the same snapshot on the master, and > FirstSnapshotSet should be true, plus the same additional processing > that GetTransactionSnapshot() would have done if we're in a higher > transaction isolation level. It's possible we don't need to mimic > that state, but it seems like a good idea. I plan to look at this soonish. > Still, I wonder if we ought to be banning GetTransactionSnapshot() > altogether. I'm not sure if there's ever a time when it's safe for a > worker to call that. Why? > >> > Also, you don't seem to > >> > prohibit popping the active snapshot (should that be prohibitted > >> > maybe?) which might bump the initiator's xmin horizon. > >> > >> I think as long as our transaction snapshot is installed correctly our > >> xmin horizon can't advance; am I missing something? > > > > Maybe I'm missing something, but why would that be the case in a read > > committed session? ImportSnapshot() only ever calls > > SetTransactionSnapshot it such a case (why does it contain code to cope > > without it?), but your patch doesn't seem to guarantee that. > > > > But as don't actually transport XactIsoLevel anywhere (omission > > probably?) that seems to mean that the SetTransactionSnapshot() won't do > > much, even if the source transaction is repeatable read. > > Doesn't XactIsoLevel get set by assign_XactIsoLevel() when we restore > the GUC state? Yes, but afaics the transaction start will just overwrite it again: static void StartTransaction(void) { ...XactDeferrable = DefaultXactDeferrable;XactIsoLevel = DefaultXactIsoLevel; ... For a client issued BEGIN it works because utility.c does:case TRANS_STMT_BEGIN:case TRANS_STMT_START: { ListCell *lc; BeginTransactionBlock(); foreach(lc, stmt->options) { DefElem *item = (DefElem *) lfirst(lc); if (strcmp(item->defname, "transaction_isolation") == 0) SetPGVariable("transaction_isolation", list_make1(item->arg), true); else if (strcmp(item->defname, "transaction_read_only") == 0) SetPGVariable("transaction_read_only", list_make1(item->arg), true); else if (strcmp(item->defname, "transaction_deferrable") == 0) SetPGVariable("transaction_deferrable", list_make1(item->arg), true); } Pretty, isn't it? > > Your manual ones don't either, that's what made me > > complain. E.g. pg_advisory_lock_int8 isn't called that on the SQL > > level. Instead they use the C name + parens. > > On further reflection, I think I should probably just change all of > those to use a general message about advisory locks, i.e.: > > ERROR: cannot use advisory locks during a parallel operation > > That sound good to you? Perfectly fine with me. > > Which is *not* a good example for the problem. Your primary reasoning > > for needing something more than sharing the locks that we know the > > individual query is going to need (which we acquire in the parse > > analzye/planner/executor) is that we can't predict what some random code > > does. Now, I still don't think that argument should hold too much sway > > because we'll only be able to run carefully controlled code > > *anyway*. > > I'll avoid repeating what I've said about this before, except to say > that I still don't believe for a minute you can predict which locks > you'll need. I don't understand. Leaving AEL locks on catalog tables aside, pretty much everything else is easily visible? We already do that for RTE permission checks and such? There might be some holes, but those should rather be fixed anyway. What's so hard about determining the locks required for a query? > > But *if* we take it as reasoning for doing more than granting > > the locks we need for the individual query, we *still* need to cope with > > exactly the variants of the above invisible deadlock. You can still can > > call a function acquiring some form of lock on either side. > > That by itself is not a deadlock. If the worker blocks trying to > acquire a lock that the master held before the start of parallelism, > it will wait forever, even if the master never acquires a lock. But > if the worker blocks trying to acquire a lock that the master acquired > *during* parallelism, it's presumably something like a catalog lock or > a tuple lock or a relation extension lock that the master is going to > release quickly. If there's one lock that's acquired that way, there can easily be two. And then they just need need to be acquired in an inconsistent order and you have a deadlock. > > There'll be many more of these, and we can't really > > reason about them because we used to working locks. > > FUD. It certainly scares me. > >> 3. I welcome proposals for other ways of handling this problem, even > >> if they restrict the functionality that can be offered. For example, > >> a proposal that would make parallel_count revert to single-threaded > >> mode but terminate without an indefinite wait would be acceptable to > >> me, provided that it offers some advantage in safety and security over > >> what I've already got. > > > > I think the above example where we only grant the locks required for a > > specific thing and only on the relevant severity would already be a big > > improvement. Even better would be to to use the computed set of locks to > > check whether workers could acquire them and refuse paralellism in that > > case. > > I think that will just produce lots of very-hard-to-debug undetected > deadlocks and huge volumes of code that tries and fails to assess what > locks the worker will need. Teaching the deadlock to recognize that > > Another thing to do is to mark the lock, both in the master and workers, > > as not effectively being of the original severity anymore. If the own > > process again tries to acquire the lock on a heavier severity than what > > was needed at the time of query execution, error out. At least until > > parallel mode has confirmed to have ended. That should pretty much never > > happen. > > I'm not sure what you mean by the "severity" of the lock. The lock level. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Sun, Feb 15, 2015 at 11:48 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Fri, Feb 13, 2015 at 2:22 AM, Michael Paquier
> <michael.paquier@gmail.com> wrote:
> > On Thu, Feb 12, 2015 at 3:59 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> >>
> >> We're not seeing eye to eye here yet, since I don't accept your
> >> example scenario and you don't accept mine. Let's keep discussing.
> >>
> >> Meanwhile, here's an updated patch.
> >
> > A lot of cool activity is showing up here, so moved the patch to CF 2015-02.
> > Perhaps Andres you could add yourself as a reviewer?
>
> Here's a new version of the patch with (a) some additional
> restrictions on heavyweight locking both at the start of, and during,
> parallel mode and (b) a write-up in the README explaining the
> restrictions and my theory of why the handling of heavyweight locking
> is safe. Hopefully this goes some way towards addressing Andres's
> concerns. I've also replaced the specific (and wrong) messages about
> advisory locks with a more generic message, as previously proposed;
> and I've fixed at least one bug.
>
>
> On Fri, Feb 13, 2015 at 2:22 AM, Michael Paquier
> <michael.paquier@gmail.com> wrote:
> > On Thu, Feb 12, 2015 at 3:59 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> >>
> >> We're not seeing eye to eye here yet, since I don't accept your
> >> example scenario and you don't accept mine. Let's keep discussing.
> >>
> >> Meanwhile, here's an updated patch.
> >
> > A lot of cool activity is showing up here, so moved the patch to CF 2015-02.
> > Perhaps Andres you could add yourself as a reviewer?
>
> Here's a new version of the patch with (a) some additional
> restrictions on heavyweight locking both at the start of, and during,
> parallel mode and (b) a write-up in the README explaining the
> restrictions and my theory of why the handling of heavyweight locking
> is safe. Hopefully this goes some way towards addressing Andres's
> concerns. I've also replaced the specific (and wrong) messages about
> advisory locks with a more generic message, as previously proposed;
> and I've fixed at least one bug.
>
Today, while testing parallel_seqscan patch, I encountered one
intermittent issue (it hangs in below function) and I have question
related to below function.
+void
+WaitForParallelWorkersToFinish(ParallelContext *pcxt)
+{
..
+ for (;;)
+ {
..
+ CHECK_FOR_INTERRUPTS();
+ for (i = 0; i < pcxt->nworkers; ++i)
+ {
+ if (pcxt->worker[i].error_mqh != NULL)
+ {
+ anyone_alive = true;
+ break;
+ }
+ }
+
+ if (!anyone_alive)
+ break;
+
+ WaitLatch(&MyProc->procLatch, WL_LATCH_SET, -1);
+ ResetLatch(&MyProc->procLatch);
+ }
+WaitForParallelWorkersToFinish(ParallelContext *pcxt)
+{
..
+ for (;;)
+ {
..
+ CHECK_FOR_INTERRUPTS();
+ for (i = 0; i < pcxt->nworkers; ++i)
+ {
+ if (pcxt->worker[i].error_mqh != NULL)
+ {
+ anyone_alive = true;
+ break;
+ }
+ }
+
+ if (!anyone_alive)
+ break;
+
+ WaitLatch(&MyProc->procLatch, WL_LATCH_SET, -1);
+ ResetLatch(&MyProc->procLatch);
+ }
Isn't there a race condition in this function such that after it finds
that there is some alive worker and before it does WaitLatch(), the
worker completes its work and exits, now in such a case who is
going to wake the backend waiting on procLatch?
On Fri, Mar 6, 2015 at 7:01 AM, Amit Kapila <amit.kapila16@gmail.com> wrote: > Today, while testing parallel_seqscan patch, I encountered one > intermittent issue (it hangs in below function) and I have question > related to below function. > > +void > +WaitForParallelWorkersToFinish(ParallelContext *pcxt) > +{ > .. > + for (;;) > + { > .. > + CHECK_FOR_INTERRUPTS(); > + for (i = 0; i < pcxt->nworkers; ++i) > + { > + if (pcxt->worker[i].error_mqh != NULL) > + { > + anyone_alive = true; > + break; > + } > + } > + > + if (!anyone_alive) > + break; > + > + WaitLatch(&MyProc->procLatch, WL_LATCH_SET, -1); > + ResetLatch(&MyProc->procLatch); > + } > > Isn't there a race condition in this function such that after it finds > that there is some alive worker and before it does WaitLatch(), the > worker completes its work and exits, now in such a case who is > going to wake the backend waiting on procLatch? It doesn't matter whether some other backend sets the process latch before we reach WaitLatch() or after we begin waiting. Either way, it's fine. It would be bad if the process could exit without setting the latch at all, though. I hope that's not the case. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Tue, Feb 17, 2015 at 11:01 AM, Andres Freund <andres@2ndquadrant.com> wrote: >> I'm not 100% comfortable with it either, but I just spent some time >> looking at it and can't see what's wrong with it. Basically, we're >> trying to get the parallel worker into a state that matches the >> master's state after doing GetTransactionSnapshot() - namely, >> CurrentSnapshot should point to the same snapshot on the master, and >> FirstSnapshotSet should be true, plus the same additional processing >> that GetTransactionSnapshot() would have done if we're in a higher >> transaction isolation level. It's possible we don't need to mimic >> that state, but it seems like a good idea. > > I plan to look at this soonish. Did you get a chance to look at this yet? >> Still, I wonder if we ought to be banning GetTransactionSnapshot() >> altogether. I'm not sure if there's ever a time when it's safe for a >> worker to call that. > > Why? I don't know. I looked at it more and I don't really see a problem, but what do I know? >> Doesn't XactIsoLevel get set by assign_XactIsoLevel() when we restore >> the GUC state? > > Yes, but afaics the transaction start will just overwrite it again: > > static void > StartTransaction(void) > { > ... > XactDeferrable = DefaultXactDeferrable; > XactIsoLevel = DefaultXactIsoLevel; > ... Ah, crap. So, yeah, we need to save and restore that, too. Fixed in the attached version. > For a client issued BEGIN it works because utility.c does: > case TRANS_STMT_BEGIN: > case TRANS_STMT_START: > { > ListCell *lc; > > BeginTransactionBlock(); > foreach(lc, stmt->options) > { > DefElem *item = (DefElem *) lfirst(lc); > > if (strcmp(item->defname, "transaction_isolation") == 0) > SetPGVariable("transaction_isolation", > list_make1(item->arg), > true); > else if (strcmp(item->defname, "transaction_read_only") == 0) > SetPGVariable("transaction_read_only", > list_make1(item->arg), > true); > else if (strcmp(item->defname, "transaction_deferrable") == 0) > SetPGVariable("transaction_deferrable", > list_make1(item->arg), > true); > } > > Pretty, isn't it? I think that's just to handle things like BEGIN ISOLATION LEVEL SERIALIZABLE. A plain BEGIN would work fine without that AFAICS. It doesn't seem particularly ugly to me either, but I guess that's neither here nor there. >> I'll avoid repeating what I've said about this before, except to say >> that I still don't believe for a minute you can predict which locks >> you'll need. > > I don't understand. Leaving AEL locks on catalog tables aside, pretty > much everything else is easily visible? We already do that for RTE > permission checks and such? There might be some holes, but those should > rather be fixed anyway. What's so hard about determining the locks > required for a query? Well, if you're only going to call built-in functions, and if you exclude all of the built-in functions that that can take locks on arbitrary objects like pg_get_object_address() and table_to_xml(), and if you leave locks on catalog tables aside, then, given further that we're restricting ourselves to read-only transactions, you *can* determine the locks that will be required for a query -- there won't be any. But one cannot exclude catalog tables by fiat, and all of those other restrictions are ones that I'd like to have a chance of relaxing at some point. It's entirely reasonable for a user to want to parallelize a query that contains a user-defined PL/pgsql function, though, and that might do anything. > If there's one lock that's acquired that way, there can easily be > two. And then they just need need to be acquired in an inconsistent > order and you have a deadlock. There is a detailed and hopefully rigorous analysis of locking-related scenarios in README.parallel in the patch version after the one your reviewed (posted 2015-02-15). Have you looked at that? (It's also in this version.) -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Вложения
On Fri, Mar 6, 2015 at 10:16 AM, Robert Haas <robertmhaas@gmail.com> wrote: > [ responses to review comments ] Here's a new version, fixing a bug Amit found (missing SetLatch) and a more significant oversight identified by Noah (XactLastRecEnd propagation) on the "assessing parallel safety" thread. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Вложения
On Tue, Mar 17, 2015 at 6:53 PM, Robert Haas <robertmhaas@gmail.com> wrote: > On Fri, Mar 6, 2015 at 10:16 AM, Robert Haas <robertmhaas@gmail.com> wrote: >> [ responses to review comments ] > > Here's a new version, fixing a bug Amit found (missing SetLatch) and a > more significant oversight identified by Noah (XactLastRecEnd > propagation) on the "assessing parallel safety" thread. Rebased. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Вложения
Hi, Reading the README first, the rest later. So you can comment on my comments, while I actually look at the code. Parallelism, yay! On 2015-03-18 12:02:07 -0400, Robert Haas wrote: > +Instead, we take a more pragmatic approach: we try to make as many of the > +operations that are safe outside of parallel mode work correctly in parallel > +mode as well, and we try to prohibit the rest via suitable error > checks. I'd say that we'd try to prohibit a bunch of common cases. Among them the ones that are triggerable from SQL. We don't really try to prohibit many kinds of traps, as you describe. > + - The authenticated user ID and current database. Each parallel worker > + will connect to the same database as the initiating backend, using the > + same user ID. This sentence immediately makes me wonder why only the authenticated user, and not the currently active role, is mentioned. > + - The values of all GUCs. Accordingly, permanent changes to the value of > + any GUC are forbidden while in parallel mode; but temporary changes, > + such as entering a function with non-NULL proconfig, are potentially OK. "potentially OK" sounds odd to me. Which danger are you seing that isn't relevan tfor norm > + - The combo CID mappings. This is needed to ensure consistent answers to > + tuple visibility checks. The need to synchronize this data structure is > + a major reason why we can't support writes in parallel mode: such writes > + might create new combo CIDs, and we have now way to let other workers > + (or the initiating backend) know about them. Absolutely *not* in the initial version, but we really need to find a better solution for this. > + - The currently active user ID and security context. Note that this is > + the fourth user ID we restore: the initial step of binding to the correct > + database also involves restoring the authenticated user ID. When GUC > + values are restored, this incidentally sets SessionUserId and OuterUserId > + to the correct values. This final step restores CurrentUserId. Ah. That's the answer for above. Could you just move it next to the other user bit? > +We could copy the entire transaction state stack, > +but most of it would be useless: for example, you can't roll back to a > +savepoint from within a parallel worker, and there are no resources to > +associated with the memory contexts or resource owners of intermediate > +subtransactions. I do wonder if we're not going to have to change this in the not too far away future. But then, this isn't externally visible at all, so whatever. > +At the end of a parallel operation, which can happen either because it > +completed successfully or because it was interrupted by an error, parallel > +workers associated with that operation exit. In the error case, transaction > +abort processing in the parallel leader kills of any remaining workers, and > +the parallel leader then waits for them to die. Very slightly awkward because first you talk about successful *or* error and then about abort processing. > + - Cleanup of pg_temp namespaces is not done. The initiating backend is > + responsible for this, too. How could a worker have its own pg_temp namespace? > +Lock Management > +=============== > + > +Certain heavyweight locks that the initiating backend holds at the beginning > +of parallelism are copied to each worker, which unconditionally acquires them. > +The parallel backends acquire - without waiting - each such lock that the > +leader holds, even if that lock is self-exclusive. This creates the unusual > +situation that a lock which could normally only be held by a single backend > +can be shared among several backends in a parallel group. > + > +Obviously, this presents significant hazards that would not be present in > +normal execution. If, for example, a backend were to initiate parallelism > +while ReindexIsProcessingIndex() were true for some index, the parallel > +backends launched at that time would neither share this state nor be excluded > +from accessing the index via the heavyweight lock mechanism. It is therefore > +imperative that backends only initiate parallelism from places where it will > +be safe for parallel workers to access the relations on which they hold locks. > +It is also important that they not afterwards do anything which causes access > +to those relations to become unsafe, or at least not until after parallelism > +has concluded. The fact that parallelism is strictly read-only means that the > +opportunities for such mishaps are few and far between; furthermore, most > +operations which present these hazards are DDL operations, which will be > +rejected by CheckTableNotInUse() if parallel mode is active. > + > +Only relation locks, locks on database or shared objects, and perhaps the lock > +on our transaction ID are copied to workers. "perhaps"? > Advisory locks are not copied, > +but the leader may hold them at the start of parallelism; they cannot > +subsequently be manipulated while parallel mode is active. It is not safe to > +attempt parallelism while holding a lock of any other type, such as a page, > +tuple, or relation extension lock, and such attempts will fail. Although > +there is currently no reason to do so, such locks could be taken and released > +during parallel mode; they merely cannot be held at the start of parallel > +mode, since we would then fail to provide necessary mutual exclusion. Is it really true that no such locks are acquired? What about e.g. hash indexes? They seem to be acquiring page locks while searching. > +Copying locks to workers is important for the avoidance of undetected > +deadlock between the initiating process and its parallel workers. If the > +initiating process holds a lock on an object at the start of parallelism, > +and the worker subsequently attempts to acquire a lock on that same object > +and blocks, this will result in an undetected deadlock, because the > +initiating process cannot finish the transaction (thus releasing the lock) > +until the worker terminates, and the worker cannot acquire the lock while > +the initiating process holds it. Locks which the processes involved acquire > +and then release during parallelism do not present this hazard; they simply > +force the processes involved to take turns accessing the protected resource. I don't think this is a strong guarantee. There very well can be lack of forward progress if they're waiting on each other in some way. Say the master backend holds the lock and waits on output from a worker. The worker then will endlessly wait for the lock to become free. A deadlock. Or, as another scenario, consider cooperating backends that both try to send tuples to each other but the queue is full. A deadlock. To me it seems the deadlock detector has to be enhanced to be able to see 'waiting for' edges. Independent on how we resolve our difference of opinion on the copying of locks. It seems to me that this isn't all that hard: Whenever we block waiting for another backend we enter ourselves on the wait queue to that backend's virtual transaction. When finished we take the blocking backend off. That, afaics, should do it. Alternatively we can just publish what backend we're waiting for in PGPROC and make deadlock also look at that; but, while slightly cleaner, that looks like being more invasive. > +Copying locks to workers is also important for the avoidance of undetected > +deadlock involving both the parallel processe and other processes. "processe" > For > +example, suppose processes A1 and A2 are cooperating parallel processes and > +B is an unrelated process. Suppose A1 holds a lock L1 and waits for A2 to > +finish; B holds a lock L2 and blocks waiting for L1; and A2 blocks waiting > +for L2. Without lock copying, the waits-for graph is A2 -> B -> A1; there > +is no cycle. With lock copying, A2 will also hold the lock on L1 and the > +deadlock detector can find the cycle A2 -> B -> A2. As in the case of > +deadlocks within the parallel group, undetected deadlock occur if either A1 > +or A2 acquired a lock after the start of parallelism and attempted to > +retain it beyond the end of parallelism. The prohibitions discussed above > +protect us against this case. I think we'd need to add more restrictions to actually make this guarantee anything. At the very least it would not only have to be prohibited to end with a lock held, but also to wait for a worker (or the leader) backend with a not initially granted lock. Am I missing something or does the copying currently break deadlock.c? Because afaics that'll compute lock conflicts in FindLockCycleRecurse() without being aware of the conflicting lock being granted to two backends. Won't this at least trigger spurious deadlocks? It might happen to be without consequence for some reason, but this would, at the very least, need some very careful review. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On 2015-03-18 12:02:07 -0400, Robert Haas wrote: > diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c > index cb6f8a3..173f0ba 100644 > --- a/src/backend/access/heap/heapam.c > +++ b/src/backend/access/heap/heapam.c > @@ -2234,6 +2234,17 @@ static HeapTuple > heap_prepare_insert(Relation relation, HeapTuple tup, TransactionId xid, > CommandId cid, int options) > { > + /* > + * For now, parallel operations are required to be strictly read-only. > + * Unlike heap_update() and heap_delete(), an insert should never create > + * a combo CID, so it might be possible to relax this restrction, but > + * not without more thought and testing. > + */ > + if (IsInParallelMode()) > + ereport(ERROR, > + (errcode(ERRCODE_INVALID_TRANSACTION_STATE), > + errmsg("cannot insert tuples during a parallel operation"))); > + Minor nitpick: Should we perhaps move the ereport to a separate function? Akin to PreventTransactionChain()? Seems a bit nicer to not have the same message everywhere. > +void > +DestroyParallelContext(ParallelContext *pcxt) > +{ > + int i; > + > + /* > + * Be careful about order of operations here! We remove the parallel > + * context from the list before we do anything else; otherwise, if an > + * error occurs during a subsequent step, we might try to nuke it again > + * from AtEOXact_Parallel or AtEOSubXact_Parallel. > + */ > + dlist_delete(&pcxt->node); Hm. I'm wondering about this. What if we actually fail below? Like in dsm_detach() or it's callbacks? Then we'll enter abort again at some point (during proc_exit at the latest) but will not wait for the child workers. Right? > +/* > + * End-of-subtransaction cleanup for parallel contexts. > + * > + * Currently, it's forbidden to enter or leave a subtransaction while > + * parallel mode is in effect, so we could just blow away everything. But > + * we may want to relax that restriction in the future, so this code > + * contemplates that there may be multiple subtransaction IDs in pcxt_list. > + */ > +void > +AtEOSubXact_Parallel(bool isCommit, SubTransactionId mySubId) > +{ > + while (!dlist_is_empty(&pcxt_list)) > + { > + ParallelContext *pcxt; > + > + pcxt = dlist_head_element(ParallelContext, node, &pcxt_list); > + if (pcxt->subid != mySubId) > + break; > + if (isCommit) > + elog(WARNING, "leaked parallel context"); > + DestroyParallelContext(pcxt); > + } > +} > +/* > + * End-of-transaction cleanup for parallel contexts. > + */ > +void > +AtEOXact_Parallel(bool isCommit) > +{ > + while (!dlist_is_empty(&pcxt_list)) > + { > + ParallelContext *pcxt; > + > + pcxt = dlist_head_element(ParallelContext, node, &pcxt_list); > + if (isCommit) > + elog(WARNING, "leaked parallel context"); > + DestroyParallelContext(pcxt); > + } > +} Is there any reason to treat the isCommit case as a warning? To me that sounds like a pretty much guaranteed programming error. If your're going to argue that a couple resource leakage warnings do similarly: I don't think that counts for that much. For one e.g. a leaked tupdesc has much less consequences, for another IIRC those warnings were introduced after the fact. > + * When running as a parallel worker, we place only a single > + * TransactionStateData is placed on the parallel worker's > + * state stack, 'we place .. is placed' > /* > + * EnterParallelMode > + */ > +void > +EnterParallelMode(void) > +{ > + TransactionState s = CurrentTransactionState; > + > + Assert(s->parallelModeLevel >= 0); > + > + ++s->parallelModeLevel; > +} > + > +/* > + * ExitParallelMode > + */ > +void > +ExitParallelMode(void) > +{ > + TransactionState s = CurrentTransactionState; > + > + Assert(s->parallelModeLevel > 0); > + Assert(s->parallelModeLevel > 1 || !ParallelContextActive()); > + > + --s->parallelModeLevel; > + > + if (s->parallelModeLevel == 0) > + CheckForRetainedParallelLocks(); > +} Hm. Is it actually ok for nested parallel mode to retain locks on their own? Won't that pose a problem? Or did you do it that way just because we don't have more state? If so that deserves a comment explaining that htat's the case and why that's acceptable. > @@ -1769,6 +1904,9 @@ CommitTransaction(void) > { > TransactionState s = CurrentTransactionState; > TransactionId latestXid; > + bool parallel; > + > + parallel = (s->blockState == TBLOCK_PARALLEL_INPROGRESS); > + /* If we might have parallel workers, clean them up now. */ > + if (IsInParallelMode()) > + { > + CheckForRetainedParallelLocks(); > + AtEOXact_Parallel(true); > + s->parallelModeLevel = 0; > + } 'parallel' looks strange when we're also, rightly so, do stuff like checking IsInParallelMode(). How about naming it is_parallel_worker or something? Sorry, ran out of concentration here. It's been a long day. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Wed, Mar 18, 2015 at 5:36 PM, Andres Freund <andres@2ndquadrant.com> wrote: > Reading the README first, the rest later. So you can comment on my > comments, while I actually look at the code. Parallelism, yay! Sorry, we don't support parallelism yet. :-) > On 2015-03-18 12:02:07 -0400, Robert Haas wrote: >> +Instead, we take a more pragmatic approach: we try to make as many of the >> +operations that are safe outside of parallel mode work correctly in parallel >> +mode as well, and we try to prohibit the rest via suitable error >> checks. > > I'd say that we'd try to prohibit a bunch of common cases. Among them > the ones that are triggerable from SQL. We don't really try to prohibit > many kinds of traps, as you describe. I revised the text along these lines. >> + - The values of all GUCs. Accordingly, permanent changes to the value of >> + any GUC are forbidden while in parallel mode; but temporary changes, >> + such as entering a function with non-NULL proconfig, are potentially OK. > > "potentially OK" sounds odd to me. Which danger are you seing that isn't > relevan tfor norm Removed "potentially". >> + - The combo CID mappings. This is needed to ensure consistent answers to >> + tuple visibility checks. The need to synchronize this data structure is >> + a major reason why we can't support writes in parallel mode: such writes >> + might create new combo CIDs, and we have now way to let other workers >> + (or the initiating backend) know about them. > > Absolutely *not* in the initial version, but we really need to find a > better solution for this. I have some ideas, but that's not for right now. >> + - The currently active user ID and security context. Note that this is >> + the fourth user ID we restore: the initial step of binding to the correct >> + database also involves restoring the authenticated user ID. When GUC >> + values are restored, this incidentally sets SessionUserId and OuterUserId >> + to the correct values. This final step restores CurrentUserId. > > Ah. That's the answer for above. Could you just move it next to the > other user bit? Well, I think it's good to keep this in the same order it happens. That's almost true, with the exception of the libraries, which were out of order. (I've fixed that now.) >> +We could copy the entire transaction state stack, >> +but most of it would be useless: for example, you can't roll back to a >> +savepoint from within a parallel worker, and there are no resources to >> +associated with the memory contexts or resource owners of intermediate >> +subtransactions. > > I do wonder if we're not going to have to change this in the not too far > away future. But then, this isn't externally visible at all, so > whatever. I definitely thought about copying the whole stack, but Heikki suggested this approach, and I didn't see a reason to argue with it. As you say, we can change it in the future if it proves problematic, but so far I don't see a problem. >> +At the end of a parallel operation, which can happen either because it >> +completed successfully or because it was interrupted by an error, parallel >> +workers associated with that operation exit. In the error case, transaction >> +abort processing in the parallel leader kills of any remaining workers, and >> +the parallel leader then waits for them to die. > > Very slightly awkward because first you talk about successful *or* error > and then about abort processing. I don't understand what's awkward about that. I make a general statement about what happens at the end of a parallel operation, and then the next few sentences follow up by explaining what happens in the error case, and what happens in the success case. >> + - Cleanup of pg_temp namespaces is not done. The initiating backend is >> + responsible for this, too. > > How could a worker have its own pg_temp namespace? It can't. My point here is that it won't clean up the master's pg_temp namespace. I'll add an explicit prohibition against accessing temporary relations and clarify these remarks. >> +Lock Management >> +=============== >> + >> +Certain heavyweight locks that the initiating backend holds at the beginning >> +of parallelism are copied to each worker, which unconditionally acquires them. >> +The parallel backends acquire - without waiting - each such lock that the >> +leader holds, even if that lock is self-exclusive. This creates the unusual >> +situation that a lock which could normally only be held by a single backend >> +can be shared among several backends in a parallel group. >> + >> +Obviously, this presents significant hazards that would not be present in >> +normal execution. If, for example, a backend were to initiate parallelism >> +while ReindexIsProcessingIndex() were true for some index, the parallel >> +backends launched at that time would neither share this state nor be excluded >> +from accessing the index via the heavyweight lock mechanism. It is therefore >> +imperative that backends only initiate parallelism from places where it will >> +be safe for parallel workers to access the relations on which they hold locks. >> +It is also important that they not afterwards do anything which causes access >> +to those relations to become unsafe, or at least not until after parallelism >> +has concluded. The fact that parallelism is strictly read-only means that the >> +opportunities for such mishaps are few and far between; furthermore, most >> +operations which present these hazards are DDL operations, which will be >> +rejected by CheckTableNotInUse() if parallel mode is active. >> + >> +Only relation locks, locks on database or shared objects, and perhaps the lock >> +on our transaction ID are copied to workers. > > "perhaps"? I just meant "if we have one". Changed to "and any lock on our transaction ID". >> Advisory locks are not copied, >> +but the leader may hold them at the start of parallelism; they cannot >> +subsequently be manipulated while parallel mode is active. It is not safe to >> +attempt parallelism while holding a lock of any other type, such as a page, >> +tuple, or relation extension lock, and such attempts will fail. Although >> +there is currently no reason to do so, such locks could be taken and released >> +during parallel mode; they merely cannot be held at the start of parallel >> +mode, since we would then fail to provide necessary mutual exclusion. > > Is it really true that no such locks are acquired? What about e.g. hash > indexes? They seem to be acquiring page locks while searching. Ah, right. I have removed "although ... do so". I was thinking that until we allow writes it wouldn't come up, but that's wrong. >> +Copying locks to workers is important for the avoidance of undetected >> +deadlock between the initiating process and its parallel workers. If the >> +initiating process holds a lock on an object at the start of parallelism, >> +and the worker subsequently attempts to acquire a lock on that same object >> +and blocks, this will result in an undetected deadlock, because the >> +initiating process cannot finish the transaction (thus releasing the lock) >> +until the worker terminates, and the worker cannot acquire the lock while >> +the initiating process holds it. Locks which the processes involved acquire >> +and then release during parallelism do not present this hazard; they simply >> +force the processes involved to take turns accessing the protected resource. > > I don't think this is a strong guarantee. There very well can be lack of > forward progress if they're waiting on each other in some way. Say the > master backend holds the lock and waits on output from a worker. The > worker then will endlessly wait for the lock to become free. A > deadlock. Or, as another scenario, consider cooperating backends that > both try to send tuples to each other but the queue is full. A deadlock. The idea is that both the master and the workers are restricted to locks which they take and release again. The idea is, specifically, to allow syscache lookups. Taking a lock and then waiting for the worker is no good, which is why WaitForParallelWorkersToFinish() calls CheckForRetainedParallelLocks(). That would catch your first scenario. Your second scenario seems to me to be a different kind of problem. If that comes up, what you're going to want to do is rewrite the workers to avoid the deadlock by using non-blocking messaging. (The recent discussions of fixing similar deadlocks where a libpq client and a postgres server are both blocked on write while both output buffers are full centers around similar issues.) Detecting the deadlock and aborting is better than nothing, but not by much. In any case, I don't think it's really this patch's job to prevent deadlocks in code that doesn't exist today and in no way involves the lock manager. > To me it seems the deadlock detector has to be enhanced to be able to > see 'waiting for' edges. Independent on how we resolve our difference of > opinion on the copying of locks. > > It seems to me that this isn't all that hard: Whenever we block waiting > for another backend we enter ourselves on the wait queue to that > backend's virtual transaction. When finished we take the blocking > backend off. That, afaics, should do it. Alternatively we can just > publish what backend we're waiting for in PGPROC and make deadlock also > look at that; but, while slightly cleaner, that looks like being more > invasive. That's an interesting idea. It would be more flexible than what I've got here right now, in that parallel backends could take and retain locks on arbitrary objects, and we'd only error out if it actually created a deadlock, instead of erroring out because of the potential for a deadlock under some unlikely circumstances. But it can't be done with existing lock manager APIs - right now there is no way to put yourself on a wait queue for a virtual transaction except to try to acquire a conflicting lock, and that's no good because then you aren't actually trying to read data from it. You'd need some kind of API that says "pretend I'm waiting for this lock, but don't really wait for it", and you'd need to be darn sure that you removed yourself from the wait queue again before doing any other heavyweight lock manipulation. Do you have specific thoughts on how to implement this? >> +Copying locks to workers is also important for the avoidance of undetected >> +deadlock involving both the parallel processe and other processes. > > "processe" Fixed. >> For >> +example, suppose processes A1 and A2 are cooperating parallel processes and >> +B is an unrelated process. Suppose A1 holds a lock L1 and waits for A2 to >> +finish; B holds a lock L2 and blocks waiting for L1; and A2 blocks waiting >> +for L2. Without lock copying, the waits-for graph is A2 -> B -> A1; there >> +is no cycle. With lock copying, A2 will also hold the lock on L1 and the >> +deadlock detector can find the cycle A2 -> B -> A2. As in the case of >> +deadlocks within the parallel group, undetected deadlock occur if either A1 >> +or A2 acquired a lock after the start of parallelism and attempted to >> +retain it beyond the end of parallelism. The prohibitions discussed above >> +protect us against this case. > > I think we'd need to add more restrictions to actually make this > guarantee anything. At the very least it would not only have to be > prohibited to end with a lock held, but also to wait for a worker (or > the leader) backend with a not initially granted lock. I don't think so. The latter is safe. The other backend will finish with the lock and then you get it afterwards. > Am I missing something or does the copying currently break deadlock.c? > Because afaics that'll compute lock conflicts in FindLockCycleRecurse() > without being aware of the conflicting lock being granted to two > backends. Won't this at least trigger spurious deadlocks? It might > happen to be without consequence for some reason, but this would, at the > very least, need some very careful review. There may be a problem here, but I'm not seeing it. Initially, we're blocking on a lock that we don't already hold, so it's not one of the copied ones. If we follow one or more waits-for edges and arrive back at a copied lock, that's a real deadlock and should be reported as such. Here is yet another version of this patch. In addition to the fixes mentioned above, this version includes some minor rebasing around recent commits, and also better handling of the case where we discover that we cannot launch workers after all. This can happen because (1) dynamic_shared_memory_type=none, (2) the maximum number of DSM segments supported by the system configuration are already in use, or (3) the user creates a parallel context with nworkers=0. In any of those cases, the system will now create a backend-private memory segment instead of a dynamic shared memory segment, and will skip steps that don't need to be done in that case. This is obviously an undesirable scenario. If we choose a parallel sequential scan, we want it to launch workers and really run in parallel. Hopefully, in case (1) or case (3), we will avoid choosing a parallel plan in the first place, but case (2) is pretty hard to avoid completely, as we have no idea what other processes may or may not be doing with dynamic shared memory segments ... and, in any case, degrading to non-parallel execution beats failing outright. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Вложения
On Wed, Mar 18, 2015 at 7:10 PM, Andres Freund <andres@2ndquadrant.com> wrote: > On 2015-03-18 12:02:07 -0400, Robert Haas wrote: >> diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c >> index cb6f8a3..173f0ba 100644 >> --- a/src/backend/access/heap/heapam.c >> +++ b/src/backend/access/heap/heapam.c >> @@ -2234,6 +2234,17 @@ static HeapTuple >> heap_prepare_insert(Relation relation, HeapTuple tup, TransactionId xid, >> CommandId cid, int options) >> { >> + /* >> + * For now, parallel operations are required to be strictly read-only. >> + * Unlike heap_update() and heap_delete(), an insert should never create >> + * a combo CID, so it might be possible to relax this restrction, but >> + * not without more thought and testing. >> + */ >> + if (IsInParallelMode()) >> + ereport(ERROR, >> + (errcode(ERRCODE_INVALID_TRANSACTION_STATE), >> + errmsg("cannot insert tuples during a parallel operation"))); >> + > > Minor nitpick: Should we perhaps move the ereport to a separate > function? Akin to PreventTransactionChain()? Seems a bit nicer to not > have the same message everywhere. Well, it's not actually the same message. They're all a bit different. Or mostly all of them. And the variable part is not a command name, as in the PreventTransactionChain() case, so it would affect translatability if we did that. I don't actually think this is a big deal. >> +void >> +DestroyParallelContext(ParallelContext *pcxt) >> +{ >> + int i; >> + >> + /* >> + * Be careful about order of operations here! We remove the parallel >> + * context from the list before we do anything else; otherwise, if an >> + * error occurs during a subsequent step, we might try to nuke it again >> + * from AtEOXact_Parallel or AtEOSubXact_Parallel. >> + */ >> + dlist_delete(&pcxt->node); > > Hm. I'm wondering about this. What if we actually fail below? Like in > dsm_detach() or it's callbacks? Then we'll enter abort again at some > point (during proc_exit at the latest) but will not wait for the child > workers. Right? Right. It's actually pretty hard to recover when things that get called in the abort path fail, which is why dsm_detach() and shm_mq_detach_callback() try pretty hard to only do things that are no-fail. For example, failing to detach a dsm gives a WARNING, not an ERROR. Now, I did make some attempt in previous patches to ensure that proc_exit() has some ability to recover even if a callback fails (cf 001a573a2011d605f2a6e10aee9996de8581d099) but I'm not too sure how useful that's going to be in practice. I'm open to ideas on how to improve this. >> +void >> +AtEOXact_Parallel(bool isCommit) >> +{ >> + while (!dlist_is_empty(&pcxt_list)) >> + { >> + ParallelContext *pcxt; >> + >> + pcxt = dlist_head_element(ParallelContext, node, &pcxt_list); >> + if (isCommit) >> + elog(WARNING, "leaked parallel context"); >> + DestroyParallelContext(pcxt); >> + } >> +} > > Is there any reason to treat the isCommit case as a warning? To me that > sounds like a pretty much guaranteed programming error. If your're going > to argue that a couple resource leakage warnings do similarly: I don't > think that counts for that much. For one e.g. a leaked tupdesc has much > less consequences, for another IIRC those warnings were introduced > after the fact. Yeah, I'm going to argue that. A leaked parallel context is pretty harmless if there are no workers associated with it. If there are, it's less harmless, of course, but it doesn't seem to me that making that an ERROR buys us much. I mean, the transaction is going to end either way, and a message is going to get printed either way, and it's a bug either way, so whatever. >> + * When running as a parallel worker, we place only a single >> + * TransactionStateData is placed on the parallel worker's >> + * state stack, > > 'we place .. is placed' Will fix in next update. >> + if (s->parallelModeLevel == 0) >> + CheckForRetainedParallelLocks(); >> +} > > Hm. Is it actually ok for nested parallel mode to retain locks on their > own? Won't that pose a problem? Or did you do it that way just because > we don't have more state? If so that deserves a comment explaining that > htat's the case and why that's acceptable. The only time it's really a problem to retain locks is if you are doing WaitForParallelWorkersToFinish(). This is pretty much just a belt-and-suspenders check to make it easier to notice that you've goofed. But, sure, removing the if part makes sense. I'll do that in the next update. >> @@ -1769,6 +1904,9 @@ CommitTransaction(void) >> { >> TransactionState s = CurrentTransactionState; >> TransactionId latestXid; >> + bool parallel; >> + >> + parallel = (s->blockState == TBLOCK_PARALLEL_INPROGRESS); > >> + /* If we might have parallel workers, clean them up now. */ >> + if (IsInParallelMode()) >> + { >> + CheckForRetainedParallelLocks(); >> + AtEOXact_Parallel(true); >> + s->parallelModeLevel = 0; >> + } > > 'parallel' looks strange when we're also, rightly so, do stuff like > checking IsInParallelMode(). How about naming it is_parallel_worker or > something? Yeah, makes sense. Will do that, too. > Sorry, ran out of concentration here. It's been a long day. Thanks for the review so far. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2015-03-19 14:13:59 -0400, Robert Haas wrote: > On Wed, Mar 18, 2015 at 5:36 PM, Andres Freund <andres@2ndquadrant.com> wrote: > > Reading the README first, the rest later. So you can comment on my > > comments, while I actually look at the code. Parallelism, yay! > > Sorry, we don't support parallelism yet. :-) And not even proper sequential work apparently... > >> + - The currently active user ID and security context. Note that this is > >> + the fourth user ID we restore: the initial step of binding to the correct > >> + database also involves restoring the authenticated user ID. When GUC > >> + values are restored, this incidentally sets SessionUserId and OuterUserId > >> + to the correct values. This final step restores CurrentUserId. > > > > Ah. That's the answer for above. Could you just move it next to the > > other user bit? > > Well, I think it's good to keep this in the same order it happens. > That's almost true, with the exception of the libraries, which were > out of order. (I've fixed that now.) I don't find this convincing in the least. This should explain a developer which state he can rely on being shared. For that a sane order is helpful. In which precise order this happens doesn't seem to matter for that at all; it's also likely ot get out of date. > >> +At the end of a parallel operation, which can happen either because it > >> +completed successfully or because it was interrupted by an error, parallel > >> +workers associated with that operation exit. In the error case, transaction > >> +abort processing in the parallel leader kills of any remaining workers, and > >> +the parallel leader then waits for them to die. > > > > Very slightly awkward because first you talk about successful *or* error > > and then about abort processing. > > I don't understand what's awkward about that. I make a general > statement about what happens at the end of a parallel operation, and > then the next few sentences follow up by explaining what happens in > the error case, and what happens in the success case. I seem to have completely overread the "In the error case, " part of the sentence. Forget what I wrote. > >> +Copying locks to workers is important for the avoidance of undetected > >> +deadlock between the initiating process and its parallel workers. If the > >> +initiating process holds a lock on an object at the start of parallelism, > >> +and the worker subsequently attempts to acquire a lock on that same object > >> +and blocks, this will result in an undetected deadlock, because the > >> +initiating process cannot finish the transaction (thus releasing the lock) > >> +until the worker terminates, and the worker cannot acquire the lock while > >> +the initiating process holds it. Locks which the processes involved acquire > >> +and then release during parallelism do not present this hazard; they simply > >> +force the processes involved to take turns accessing the protected resource. > > > > I don't think this is a strong guarantee. There very well can be lack of > > forward progress if they're waiting on each other in some way. Say the > > master backend holds the lock and waits on output from a worker. The > > worker then will endlessly wait for the lock to become free. A > > deadlock. Or, as another scenario, consider cooperating backends that > > both try to send tuples to each other but the queue is full. A deadlock. > > The idea is that both the master and the workers are restricted to > locks which they take and release again. The idea is, specifically, > to allow syscache lookups. Taking a lock and then waiting for the > worker is no good, which is why WaitForParallelWorkersToFinish() calls > CheckForRetainedParallelLocks(). That would catch your first > scenario. > > Your second scenario seems to me to be a different kind of problem. > If that comes up, what you're going to want to do is rewrite the > workers to avoid the deadlock by using non-blocking messaging. I don't think that actually really solves the problem. You'll still end up blocking with full "pipes" at some point. Whether you do it inside the messaging or outside the messaging code doesn't particularly matter. > In any case, I don't think it's really this patch's job to prevent > deadlocks in code that doesn't exist today and in no way involves the > lock manager. Well, you're arguing that we need a solution in the parallelism infrastructure for the lock deadlocks problem. I don't think it's absurd to extend care to other cases. > > To me it seems the deadlock detector has to be enhanced to be able to > > see 'waiting for' edges. Independent on how we resolve our difference of > > opinion on the copying of locks. > > > > It seems to me that this isn't all that hard: Whenever we block waiting > > for another backend we enter ourselves on the wait queue to that > > backend's virtual transaction. When finished we take the blocking > > backend off. That, afaics, should do it. Alternatively we can just > > publish what backend we're waiting for in PGPROC and make deadlock also > > look at that; but, while slightly cleaner, that looks like being more > > invasive. > > That's an interesting idea. It would be more flexible than what I've > got here right now, in that parallel backends could take and retain > locks on arbitrary objects, and we'd only error out if it actually > created a deadlock, instead of erroring out because of the potential > for a deadlock under some unlikely circumstances. It also seems like it'd be able to deal with a bunch of scenarios that the current approach wouldn't be able to deal with. > But it can't be > done with existing lock manager APIs - right now there is no way to > put yourself on a wait queue for a virtual transaction except to try > to acquire a conflicting lock, and that's no good because then you > aren't actually trying to read data from it. Right. > You'd need some kind of > API that says "pretend I'm waiting for this lock, but don't really > wait for it", and you'd need to be darn sure that you removed yourself > from the wait queue again before doing any other heavyweight lock > manipulation. Do you have specific thoughts on how to implement this? I've thought some about this, and I think it's a bit easier to not do it on the actual lock waitqueues, but teach deadlock.c about that kind of blocking. deadlock.c is far from simple, and at least I don't find the control flow to be particularly clear. So it's not easy. It'd be advantageous to tackle things at that level because it'd avoid the need to acquire locks on some lock's waitqueue when blocking; we're going to do that a lot. But It seems to me that it should be possible to suceed: In FindLockCycleRecurse(), in the case that we're not waiting for an actual lock (checkProc->links.next == NULL) we can add a case that considers the 'blocking parallelism' case. ISTM that that's just a FindLockCycleRecurse() call on the process that we're waiting for. We'd either have to find the other side's locktag for DEADLOCK_INFO or invent another field in there; but that seems like a solveable problem. > > Am I missing something or does the copying currently break deadlock.c? > > Because afaics that'll compute lock conflicts in FindLockCycleRecurse() > > without being aware of the conflicting lock being granted to two > > backends. Won't this at least trigger spurious deadlocks? It might > > happen to be without consequence for some reason, but this would, at the > > very least, need some very careful review. > > There may be a problem here, but I'm not seeing it. Initially, we're > blocking on a lock that we don't already hold, so it's not one of the > copied ones. If we follow one or more waits-for edges and arrive back > at a copied lock, that's a real deadlock and should be reported as > such. That's a fair point. I was worried that this is going to introduce additional hard edges between processes. I think it actually might, but only when a lock is upgraded; which really shouldn't happen for the copied locks. Also: Man, trying to understand the guts of deadlock.c only made me understand how *friggin* expensive deadlock checking is. I'm really rather surprised that it only infrequently causes problems. I think I have seen a report or two where it might have been the deadlock check that went bonkers during schema upgrades on larger setups, but that's it. I'm not sure if I said that somewhere before: If we aborted parallelism if any of the to-be-copied locks would conflict with its copy, instead of duplicating them, I could live with this. Then this would amount to jumping the lock queue, which seems reasonable to me. Since not being able to do parallelism already needs to be handled gracefull, this doesn't seem to be too bad. Greetings, Andres Freund --Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Andres Freund <andres@2ndquadrant.com> writes: > Also: Man, trying to understand the guts of deadlock.c only made me > understand how *friggin* expensive deadlock checking is. I'm really > rather surprised that it only infrequently causes problems. The reason for that is that we only run deadlock checking if something's been waiting for at least one second, which pretty much takes it out of any performance-relevant code pathway. I think it would be a serious error to put any deadlock-checking-like behavior into mainline code. Which probably means that Andres is right that teaching deadlock.c about any new sources of deadlock is the way to approach this. regards, tom lane
On Tue, Mar 24, 2015 at 3:26 PM, Andres Freund <andres@2ndquadrant.com> wrote: >> >> + - The currently active user ID and security context. Note that this is >> >> + the fourth user ID we restore: the initial step of binding to the correct >> >> + database also involves restoring the authenticated user ID. When GUC >> >> + values are restored, this incidentally sets SessionUserId and OuterUserId >> >> + to the correct values. This final step restores CurrentUserId. >> > >> > Ah. That's the answer for above. Could you just move it next to the >> > other user bit? >> >> Well, I think it's good to keep this in the same order it happens. >> That's almost true, with the exception of the libraries, which were >> out of order. (I've fixed that now.) > > I don't find this convincing in the least. This should explain a > developer which state he can rely on being shared. For that a sane order > is helpful. In which precise order this happens doesn't seem to matter > for that at all; it's also likely ot get out of date. I'm hoping we can agree to disagree on this point. I like my order order better, you like your order better; we're arguing about the ordering of paragraphs in a README. >> You'd need some kind of >> API that says "pretend I'm waiting for this lock, but don't really >> wait for it", and you'd need to be darn sure that you removed yourself >> from the wait queue again before doing any other heavyweight lock >> manipulation. Do you have specific thoughts on how to implement this? > > I've thought some about this, and I think it's a bit easier to not do it > on the actual lock waitqueues, but teach deadlock.c about that kind of > blocking. > > deadlock.c is far from simple, and at least I don't find the control > flow to be particularly clear. So it's not easy. It'd be advantageous > to tackle things at that level because it'd avoid the need to acquire > locks on some lock's waitqueue when blocking; we're going to do that a > lot. > > But It seems to me that it should be possible to suceed: In > FindLockCycleRecurse(), in the case that we're not waiting for an actual > lock (checkProc->links.next == NULL) we can add a case that considers > the 'blocking parallelism' case. ISTM that that's just a > FindLockCycleRecurse() call on the process that we're waiting for. We'd > either have to find the other side's locktag for DEADLOCK_INFO or invent > another field in there; but that seems like a solveable problem. I'll take a look at this. Thanks for the pointers. > That's a fair point. I was worried that this is going to introduce > additional hard edges between processes. I think it actually might, but > only when a lock is upgraded; which really shouldn't happen for the > copied locks. Agreed. And if it does, and we get a deadlock, I think I'm sorta OK with that. It's a known fact that lock upgrades are a deadlock hazard, and the possible existence of obscure scenarios where that's more likely when parallelism is involved than without it doesn't bother me terribly. I mean, we have to look at the specifics, but I think it's far from obvious that > I'm not sure if I said that somewhere before: If we aborted parallelism > if any of the to-be-copied locks would conflict with its copy, instead > of duplicating them, I could live with this. Then this would amount to > jumping the lock queue, which seems reasonable to me. Since not being > able to do parallelism already needs to be handled gracefull, this > doesn't seem to be too bad. It's a tempting offer since it would shorten the path to getting this committed, but I think that's leaving a lot on the table. In particular, if we ever want to have parallel CLUSTER, VACUUM FULL, or ALTER TABLE, that's not going to get us there. If we don't have the right infrastructure to support that in whatever the initial commit is, fine. But we've got to have a plan that gets us there, or we're just boxing ourselves into a corner. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas wrote: > On Wed, Mar 18, 2015 at 7:10 PM, Andres Freund <andres@2ndquadrant.com> wrote: > >> + /* > >> + * For now, parallel operations are required to be strictly read-only. > >> + * Unlike heap_update() and heap_delete(), an insert should never create > >> + * a combo CID, so it might be possible to relax this restrction, but > >> + * not without more thought and testing. > >> + */ > >> + if (IsInParallelMode()) > >> + ereport(ERROR, > >> + (errcode(ERRCODE_INVALID_TRANSACTION_STATE), > >> + errmsg("cannot insert tuples during a parallel operation"))); > >> + > > > > Minor nitpick: Should we perhaps move the ereport to a separate > > function? Akin to PreventTransactionChain()? Seems a bit nicer to not > > have the same message everywhere. > > Well, it's not actually the same message. They're all a bit > different. Or mostly all of them. And the variable part is not a > command name, as in the PreventTransactionChain() case, so it would > affect translatability if we did that. Three things that vary are 1) the fact that some check IsParallelWorker() and others check IsInParallelMode(), and 2) that some of them are ereports() while others are elog(), and 3) that some are ERROR and others are FATAL. Is there a reason for these differences? (Note typo "restrction" in quoted paragraph above.) Maybe something similar to what commit f88d4cfc did: have a function where all possible messages are together, and one is selected with some enum. That way, it's easier to maintain consistency if more cases are added in the future. > I don't actually think this is a big deal. Yeah. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Thu, Mar 19, 2015 at 11:13 AM, Robert Haas <robertmhaas@gmail.com> wrote: > Here is yet another version of this patch. In addition to the fixes > mentioned above, this version includes some minor rebasing around > recent commits, and also better handling of the case where we discover > that we cannot launch workers after all. This can happen because (1) > dynamic_shared_memory_type=none, (2) the maximum number of DSM > segments supported by the system configuration are already in use, or > (3) the user creates a parallel context with nworkers=0. In any of > those cases, the system will now create a backend-private memory > segment instead of a dynamic shared memory segment, and will skip > steps that don't need to be done in that case. This is obviously an > undesirable scenario. If we choose a parallel sequential scan, we > want it to launch workers and really run in parallel. Hopefully, in > case (1) or case (3), we will avoid choosing a parallel plan in the > first place, but case (2) is pretty hard to avoid completely, as we > have no idea what other processes may or may not be doing with dynamic > shared memory segments ... and, in any case, degrading to non-parallel > execution beats failing outright. I see that you're using git format-patch to generate this. But the patch is only patch 1/4. Is that intentional? Where are the other pieces? I think that the parallel seqscan patch, and the assessing parallel safety patch are intended to fit together with this patch, but I can't find a place where there is a high level overview explaining just how they fit together (I notice Amit's patch has an "#include "access/parallel.h", which is here, but that wasn't trivial to figure out). I haven't been paying too much attention to this patch series. -- Peter Geoghegan
On Tue, Mar 24, 2015 at 3:26 PM, Andres Freund <andres@2ndquadrant.com> wrote: >> You'd need some kind of >> API that says "pretend I'm waiting for this lock, but don't really >> wait for it", and you'd need to be darn sure that you removed yourself >> from the wait queue again before doing any other heavyweight lock >> manipulation. Do you have specific thoughts on how to implement this? > > I've thought some about this, and I think it's a bit easier to not do it > on the actual lock waitqueues, but teach deadlock.c about that kind of > blocking. > > deadlock.c is far from simple, and at least I don't find the control > flow to be particularly clear. So it's not easy. It'd be advantageous > to tackle things at that level because it'd avoid the need to acquire > locks on some lock's waitqueue when blocking; we're going to do that a > lot. > > But It seems to me that it should be possible to suceed: In > FindLockCycleRecurse(), in the case that we're not waiting for an actual > lock (checkProc->links.next == NULL) we can add a case that considers > the 'blocking parallelism' case. ISTM that that's just a > FindLockCycleRecurse() call on the process that we're waiting for. We'd > either have to find the other side's locktag for DEADLOCK_INFO or invent > another field in there; but that seems like a solveable problem. I (finally) had some time to look at this today. Initially, it looked good. Unfortunately, the longer I looked at it, the less convinced I got that we could solve the problem this way. The core of the issue is that the Funnel node in the parallel group leader basically does this: while (!local_scan_done || !remote_scan_done) { attempt a read from each remaining worker's tuple queue, blocking only if local_scan_done; if (we got a tuple) return it; else if (there are no remaining workers) remote_scan_done= true; attempt to produce a tuple just as if we were a worker ourselves; if (we got a tuple) return it; else local_scan_done = true; } Imagine that we're doing a parallel sequential scan; each worker claims one page but goes into the tank before it has returned all of the tuples on that page. The master reads all the remaining pages but must now wait for the workers to finish returning the tuples on the pages they claimed. So what this means is: 1. The master doesn't actually *wait* until the very end of the parallel phase. 2. When the master does wait, it waits for all of the parallel workers at once, not each one individually. So, I don't think anything as simplistic as teaching a blocking shm_mq_receive() to tip off the deadlock detector that we're waiting for the process on the other end of that particular queue can ever work. Because of point #2, that never happens. When I first started thinking about how to fix this, I said, well, that's no big deal, we can just advertise the whole list of processes that we're waiting for in shared memory, rather than just the one. This is a bit tricky, though. Any general API for any backend to advertise that it's waiting for an arbitrary subset of the other backends would require O(n^2) shared memory state. That wouldn't be completely insane, but it's not great, either. For this particular case, we could optimize that down to O(n) by just chaining all of the children of a given parallel group leader in a linked whose nodes are inlined in their PGPROCs, but that doesn't feel very general, because it forecloses the possibility of the children ever using that API, and I think they might need to. If nothing else, they might need to advertise that they're blocking on the master if they are trying to send data, the queue is full, and they have to wait for the master to drain some of it before they can proceed. After thinking about it a bit more, I realized that even if we settle on some solution to that problem, there's another issues: the wait-edges created by this system don't really have the same semantics as regular lock waits. Suppose A is waiting on a lock held by B and B is waiting for a lock held by A; that's a deadlock. But if A is waiting for B to write to a tuple queue and B is waiting for A to read from a tuple queue, that's not a deadlock if the queues in question are the same. If they are different queues, it might be a deadlock, but then again maybe not. It may be that A is prepared to accept B's message from one queue, and that upon fully receiving it, it will do some further work that will lead it to write a tuple into the other queue. If so, we're OK; if not, it's a deadlock. I'm not sure whether you'll want to argue that that is an implausible scenario, but I'm not too sure it is. The worker could be saying "hey, I need some additional piece of your backend-local state in order to finish this computation", and the master could then provide it. I don't have any plans like that, but it's been suggested previously by others, so it's not an obviously nonsensical thing to want to do. A further difference in the semantics of these wait-edges is that if process A is awaiting AccessExclusiveLock on a resource held by B, C, and D at AccessShareLock, it needs to wait for all of those processes to release their locks before it can do anything else. On the other hand, if process A is awaiting tuples from B, C, and D, it just needs ONE of those processes to emit tuples in order to make progress. Now maybe that doesn't make any difference in practice, because even if two of those processes are making lively progress and A is receiving tuples from them and processing them like gangbusters, that's probably not going to help the third one get unstuck. If we adopt the approach of discounting that possibility, then as long as the parallel leader can generate tuples locally (that is, local_scan_done = false) we don't report the deadlock, but as soon as it can no longer do that (local_scan_done = true) then we do, even though we could still theoretically read more tuples from the non-stuck workers. So then you have to wonder why we're not solving problem #1, because the deadlock was just as certain before we generated the maximum possible number of tuples locally as it was afterwards. Thoughts? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Tue, Mar 24, 2015 at 11:56 PM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote: >> Well, it's not actually the same message. They're all a bit >> different. Or mostly all of them. And the variable part is not a >> command name, as in the PreventTransactionChain() case, so it would >> affect translatability if we did that. > > Three things that vary are 1) the fact that some check IsParallelWorker() > and others check IsInParallelMode(), and 2) that some of them are > ereports() while others are elog(), and 3) that some are ERROR and > others are FATAL. Is there a reason for these differences? The message text also varies, because it states the particular operation that is prohibited. We should check IsParallelWorker() for operations that are allowed in the master during parallel mode, but not allowed in the workers - e.g. the master can scan its own temporary relations, but its workers can't. We should check IsInParallelMode() for operations that are completely off-limits in parallel mode - i.e. writes. We use ereport() where we expect that SQL could hit that check, and elog() where we expect that only (buggy?) C code could hit that check. We use FATAL for some of the checks in xact.c, for parity with other, similar checks in xact.c that relate to checking the transaction state. I think this is because we assume that if the transaction state is hosed, it's necessary to terminate the backend to recover. In other cases, we use ERROR. > (Note typo "restrction" in quoted paragraph above.) Thanks, will fix. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Mon, Apr 20, 2015 at 6:49 PM, Peter Geoghegan <pg@heroku.com> wrote: > I see that you're using git format-patch to generate this. But the > patch is only patch 1/4. Is that intentional? Where are the other > pieces? > > I think that the parallel seqscan patch, and the assessing parallel > safety patch are intended to fit together with this patch, but I can't > find a place where there is a high level overview explaining just how > they fit together (I notice Amit's patch has an "#include > "access/parallel.h", which is here, but that wasn't trivial to figure > out). I haven't been paying too much attention to this patch series. The intended order of application is: - parallel mode/contexts - assess parallel safety - parallel heap scan - parallel seq scan The parallel heap scan patch should probably be merged into the parallel seq scan patch, which should probably then be split into several patches, one or more with general parallel query infrastructure and then another with stuff specific to parallel sequential scan. But we're not quite there yet. Until we can get agreement on how to finalize the parallel mode/contexts patch, I haven't been too worried about getting the other stuff sorted out exactly right. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Tue, Apr 21, 2015 at 11:38 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>
>
> After thinking about it a bit more, I realized that even if we settle
> on some solution to that problem, there's another issues: the
> wait-edges created by this system don't really have the same semantics
> as regular lock waits. Suppose A is waiting on a lock held by B and B
> is waiting for a lock held by A; that's a deadlock. But if A is
> waiting for B to write to a tuple queue and B is waiting for A to read
> from a tuple queue, that's not a deadlock if the queues in question
> are the same. If they are different queues, it might be a deadlock,
> but then again maybe not. It may be that A is prepared to accept B's
> message from one queue, and that upon fully receiving it, it will do
> some further work that will lead it to write a tuple into the other
> queue. If so, we're OK; if not, it's a deadlock.
> A further difference in the semantics of these wait-edges is that if
> process A is awaiting AccessExclusiveLock on a resource held by B, C,
> and D at AccessShareLock, it needs to wait for all of those processes
> to release their locks before it can do anything else. On the other
> hand, if process A is awaiting tuples from B, C, and D, it just needs
> ONE of those processes to emit tuples in order to make progress. Now
> maybe that doesn't make any difference in practice, because even if
> two of those processes are making lively progress and A is receiving
> tuples from them and processing them like gangbusters, that's probably
> not going to help the third one get unstuck. If we adopt the approach
> of discounting that possibility, then as long as the parallel leader
> can generate tuples locally (that is, local_scan_done = false) we
> don't report the deadlock, but as soon as it can no longer do that
> (local_scan_done = true) then we do, even though we could still
> theoretically read more tuples from the non-stuck workers. So then
> you have to wonder why we're not solving problem #1, because the
> deadlock was just as certain before we generated the maximum possible
> number of tuples locally as it was afterwards.
>
Having said above, I feel this is not the problem that we should try to
>
>
> After thinking about it a bit more, I realized that even if we settle
> on some solution to that problem, there's another issues: the
> wait-edges created by this system don't really have the same semantics
> as regular lock waits. Suppose A is waiting on a lock held by B and B
> is waiting for a lock held by A; that's a deadlock. But if A is
> waiting for B to write to a tuple queue and B is waiting for A to read
> from a tuple queue, that's not a deadlock if the queues in question
> are the same. If they are different queues, it might be a deadlock,
> but then again maybe not. It may be that A is prepared to accept B's
> message from one queue, and that upon fully receiving it, it will do
> some further work that will lead it to write a tuple into the other
> queue. If so, we're OK; if not, it's a deadlock.
I agree that the way deadlock detector works for locks, it might not
be same as it needs to work for communication buffers (tuple queues).
Here I think we also need to devise some different way to remove resources
from wait/resource queue, it might not be a good idea to do it similar to locks
as locks are released at transaction end whereas this new resources
(communication buffers) don't operate at transaction boundary.
> I'm not sure
> whether you'll want to argue that that is an implausible scenario, but
> I'm not too sure it is. The worker could be saying "hey, I need some
> additional piece of your backend-local state in order to finish this
> computation", and the master could then provide it. I don't have any
> plans like that, but it's been suggested previously by others, so it's
> not an obviously nonsensical thing to want to do.
>
> whether you'll want to argue that that is an implausible scenario, but
> I'm not too sure it is. The worker could be saying "hey, I need some
> additional piece of your backend-local state in order to finish this
> computation", and the master could then provide it. I don't have any
> plans like that, but it's been suggested previously by others, so it's
> not an obviously nonsensical thing to want to do.
>
If such a thing is not possible today and also it seems that is a not a
good design to solve any problem, then why to spend too much effort
in trying to find ways to detect deadlocks for the same.
> A further difference in the semantics of these wait-edges is that if
> process A is awaiting AccessExclusiveLock on a resource held by B, C,
> and D at AccessShareLock, it needs to wait for all of those processes
> to release their locks before it can do anything else. On the other
> hand, if process A is awaiting tuples from B, C, and D, it just needs
> ONE of those processes to emit tuples in order to make progress. Now
> maybe that doesn't make any difference in practice, because even if
> two of those processes are making lively progress and A is receiving
> tuples from them and processing them like gangbusters, that's probably
> not going to help the third one get unstuck. If we adopt the approach
> of discounting that possibility, then as long as the parallel leader
> can generate tuples locally (that is, local_scan_done = false) we
> don't report the deadlock, but as soon as it can no longer do that
> (local_scan_done = true) then we do, even though we could still
> theoretically read more tuples from the non-stuck workers. So then
> you have to wonder why we're not solving problem #1, because the
> deadlock was just as certain before we generated the maximum possible
> number of tuples locally as it was afterwards.
>
I think the deadlock detection code should run if anytime we have to
wait for more than deadlock timeout. Now I think the important point
here is when to actually start waiting, as per current parallel_seqscan
patch we wait only after checking the tuples from all queues, we could
have done it other way (wait if we can't fetch from one queue) as well.
It seems both has pros and cons, so we can proceed assuming current
way is okay and we can consider to change it in future once we find some
reason for the same.
Having said above, I feel this is not the problem that we should try to
solve at this point unless there is any scenario where we could hit
deadlock due to communication buffers. I think such a solution would
be required for advanced form of parallelism (like intra-query parallelism).
By the way, why are we trying to solve this problem, is there any way
with which we can hit it for parallel sequential scan?
On 19-03-2015 15:13, Robert Haas wrote: > On Wed, Mar 18, 2015 at 5:36 PM, Andres Freund <andres@2ndquadrant.com> wrote: >> Reading the README first, the rest later. So you can comment on my >> comments, while I actually look at the code. Parallelism, yay! > I'm also looking at this piece of code and found some low-hanging fruit. -- Euler Taveira Timbira - http://www.timbira.com.br/ PostgreSQL: Consultoria, Desenvolvimento, Suporte 24x7 e Treinamento
Вложения
On 2015-04-22 AM 04:14, Robert Haas wrote: > > We should check IsParallelWorker() for operations that are allowed in > the master during parallel mode, but not allowed in the workers - e.g. > the master can scan its own temporary relations, but its workers > can't. We should check IsInParallelMode() for operations that are > completely off-limits in parallel mode - i.e. writes. > > We use ereport() where we expect that SQL could hit that check, and > elog() where we expect that only (buggy?) C code could hit that check. > By the way, perhaps orthogonal to the basic issue Alvaro and you are discussing here - when playing around with (parallel-mode + parallel-safety + parallel heapscan + parallel seqscan), I'd observed (been a while since) that if you run a CREATE TABLE AS ... SELECT and the SELECT happens to use parallel scan, the statement as a whole fails because the top level statement (CTAS) is not read-only. So, only way to make CTAS succeed is to disable seqscan; which may require some considerations in reporting to user to figure out. I guess it happens because the would-be parallel leader of the scan would also be the one doing DefineRelation() DDL. Apologies if this has been addressed since. Thanks, Amit
On Sun, Apr 26, 2015 at 2:03 PM, Euler Taveira <euler@timbira.com.br> wrote: > On 19-03-2015 15:13, Robert Haas wrote: >> On Wed, Mar 18, 2015 at 5:36 PM, Andres Freund <andres@2ndquadrant.com> wrote: >>> Reading the README first, the rest later. So you can comment on my >>> comments, while I actually look at the code. Parallelism, yay! >> > I'm also looking at this piece of code and found some low-hanging fruit. Thanks. 0001 and 0003 look good, but 0002 is actually unrelated to this patch. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Sun, Apr 26, 2015 at 9:57 PM, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: > On 2015-04-22 AM 04:14, Robert Haas wrote: >> We should check IsParallelWorker() for operations that are allowed in >> the master during parallel mode, but not allowed in the workers - e.g. >> the master can scan its own temporary relations, but its workers >> can't. We should check IsInParallelMode() for operations that are >> completely off-limits in parallel mode - i.e. writes. >> >> We use ereport() where we expect that SQL could hit that check, and >> elog() where we expect that only (buggy?) C code could hit that check. >> > > By the way, perhaps orthogonal to the basic issue Alvaro and you are > discussing here - when playing around with (parallel-mode + parallel-safety + > parallel heapscan + parallel seqscan), I'd observed (been a while since) that > if you run a CREATE TABLE AS ... SELECT and the SELECT happens to use parallel > scan, the statement as a whole fails because the top level statement (CTAS) is > not read-only. So, only way to make CTAS succeed is to disable seqscan; which > may require some considerations in reporting to user to figure out. I guess it > happens because the would-be parallel leader of the scan would also be the one > doing DefineRelation() DDL. Apologies if this has been addressed since. That sounds like a bug in the assess-parallel-safety patch. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Tue, Apr 28, 2015 at 2:38 PM, Robert Haas <robertmhaas@gmail.com> wrote: > On Sun, Apr 26, 2015 at 2:03 PM, Euler Taveira <euler@timbira.com.br> wrote: >> On 19-03-2015 15:13, Robert Haas wrote: >>> On Wed, Mar 18, 2015 at 5:36 PM, Andres Freund <andres@2ndquadrant.com> wrote: >>>> Reading the README first, the rest later. So you can comment on my >>>> comments, while I actually look at the code. Parallelism, yay! >>> >> I'm also looking at this piece of code and found some low-hanging fruit. > > Thanks. 0001 and 0003 look good, but 0002 is actually unrelated to this patch. So, I think it makes sense to split up this patch in two. There's no real debate, AFAICS, about anything in the patch other than the heavyweight locking stuff. So I'd like to go ahead and commit the rest. That's attached here as parallel-mode-v10.patch. The remaining question here is what to do about the heavyweight locking. There seem to be a couple of possible approaches to that, and perhaps with this specific issue separated out we can focus in on that aspect of the problem. Let me start by restating my goals for that portion of the patch: 1. Avoid undetected deadlocks. For example, suppose A and A1 are cooperating parallel processes, and B is an unrelated process. If A1 waits for a lock held by B, and B waits for a lock held by A, and A is waiting for A1 to finish executing before winding up parallelism, the deadlock detector currently won't detect that. Either the deadlock detector needs to know that A waits for A1, or it needs to view the lock held by A as also held by A1, so that it can detect a cycle A1 -> B -> A1. 2. Avoid unprincipled deadlocks. For example, suppose A holds AccessExclusiveLock on a relation R and attempts a parallel sequential scan on R. It launches a worker A1. If A1 attempts to lock R and blocks, and A then waits for A1, we have a deadlock. Detecting this deadlock (as per goal 1) is better than not detecting it, but it's not good enough. Either A shouldn't have attempted a parallel sequential scan in the first place, or it should run to completion without deadlocking with its own workers. 3. Allow parallel DDL. For example, parallel VACUUM FULL or parallel CLUSTER. We don't have code for these things today, but the locking architecture we choose should not preclude doing them later. 4. Avoid restricting parallelism based on the history of the transaction. If BEGIN; COPY foo FROM ...; SELECT ... FROM foo can use parallelism, then BEGIN; TRUNCATE foo; COPY foo FROM ...; SELECT ... FROM foo should also be able to use parallelism. 5. Impose as few restriction as possible on what can be done in parallel mode. For example, the current version of this patch (attached as parallel-mode-locking.patch) allows parallel workers to do system cache lookups, where a relation lock is taken and released, but they cannot take and retain any heavyweight locks; so for example a user-defined function that runs on SQL query against some unrelated table and returns the results isn't parallel-safe right now. This is not ideal. That's all I've got for goals. How do we achieve those goals? So far I think we've hit on three approaches to deadlock detection that I think are at least somewhat plausible. None of them are perfect: D-1. Teach the deadlock detector about implicit edges between cooperating parallel processes. Upthread, I wrote about some problems with this approach: http://www.postgresql.org/message-id/CA+TgmoZ0tOPvfuz5BmEUqmdQ9xYQHzkG4jjJXduYDgQJXmPFbQ@mail.gmail.com D-2. Copy locks from the master to every worker, and forbid workers from acquiring any new, long-lived locks. (Locks that are taken and released quickly, like a system catalog lock for a syscache lookup, or a relation extension lock, won't cause deadlock.) Copying locks ensures that whenever there is a dangerous structure like A1 -> B -> A in the waits-for graph, there will also be a cycle A1 -> B -> A1, so the unmodified deadlock detector will detect the deadlock. Forbidding workers from acquiring any new, long-lived locks prevents locks taken after the copying step from causing similar problems. The restriction against new, long-lived locks is annoying. Also, a transaction holding many locks (like pg_dump) and using many workers will use an enormous number of lock table slots. D-3. Teach the deadlock detector to consider all locks held by any member of the locking group as held by the leader. This guarantees that all deadlocks between the parallel group and other processes in the system will be detected. When a process in the group requests a lock that conflicts with one already held by another group member, but not with any held by a process outside the group, grant it anyway, and without waiting behind pending conflicting lock requests. At the end of parallelism, transfer any locks held by workers and not released back to the leader. In a world where lock requests within a group can't conflict, there's no such thing as a deadlock within a parallel group, so we don't need to detect deadlocks of that type, and there will never be any unprincipled deadlocks within a group because there will never be any deadlocks within a group at all. The biggest problem with this approach is figuring out how to make it safe. I originally proposed this approach on the "group locking" thread and it kind of went down in flames. If two parallel backends are trying to extend the same relation at the same time, or lock the same tuple or page at the same time, those requests have got to conflict. In those cases, we are using locking to enforce real mutual exclusion. However, that doesn't seem like a serious problem. First, locks of those types will not be held at the start of parallelism; starting parallelism while you hold a lock of one of those types would be ridiculous. Second, locks of those types are never long-lived: you take them, do what you need to do, and then release them. Finally, I don't believe you ever take any other heavyweight locks while holding them; I *think* they're always the last heavyweight lock you take. So I think we could allow these locks to conflict normally without introducing any deadlock risk. The only locks that we need to consider as mutually non-conflicting are relation and database object locks. Those are a different kettle of fish: when we change a relation's relfilenode, for example, we must keep the relation locked for the duration of the transaction because of MVCC concenrs. The new pg_class tuple isn't visible to anyone else yet, and any further modifications to the relation must be done using the new relfilenode. But none of that matters for a parallel worker, which shares the same snapshot. There are still cases that do matter; for example, if a parallel backend could REINDEX a backend being concurrently scanned by another parallel backend in the same group, that would cause problems, because REINDEX uses backend-local state that wouldn't be shared. But these cases can also arise without parallelism, due to cursors, and there is an existing mechanism (CheckTableNotInUse) to prevent them. So I think that's OK, too. If not, we can fix other problems as we find them; I don't see any major architectural problems here. The other big problem with this approach is figuring out how to implement it. It doesn't work to have the worker processes manipulate the leader's lock-related data structures as if they were the parent, both because the locking code relies on the shared data structures to match its backend-private data structures and also because if we do block waiting for a lock, there could be more than one waiting process in the lock group at a time, and we need to know which specific processes are waiting so we can wake them up at the correct time. Instead, I think the way to do this is to add an additional field to each PROCLOCK storing a pointer to the leader's PGPROC; FindLockCycleRecurse() can check whether two PGPROCs have the same leader pointer instead of whether the PGPROCs themselves are the same. It can also refuse to follow a waits-for edge if that edge is of one of the types we do want to conflict within a locking group (e.g. relation extension). --- I'm OK with any of these approaches if we can agree on how to solve the problems they respectively present. Currently, I've got an implementation of D-2; I started with an implementation of D-3, which was flawed, but perhaps the idea is salveable, per the discussion above. Andres has consistently advocated for D-1, but see the link above for where I'm stuck in terms of implementing that. There may be another approach we haven't come up with yet, too, for all I know, and that's fine too if it accomplishes the goals listed above. Thanks, -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Вложения
On Wed, Apr 29, 2015 at 12:23 PM, Robert Haas <robertmhaas@gmail.com> wrote: > So, I think it makes sense to split up this patch in two. There's no > real debate, AFAICS, about anything in the patch other than the > heavyweight locking stuff. So I'd like to go ahead and commit the > rest. That's attached here as parallel-mode-v10.patch. Hearing no objections, done. Still hoping for some input on the rest. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Tue, Apr 21, 2015 at 02:08:00PM -0400, Robert Haas wrote: > On Tue, Mar 24, 2015 at 3:26 PM, Andres Freund <andres@2ndquadrant.com> wrote: [proposal: teach deadlock detector about parallel master waiting on workers] > > deadlock.c is far from simple, and at least I don't find the control > > flow to be particularly clear. So it's not easy. It'd be advantageous > > to tackle things at that level because it'd avoid the need to acquire > > locks on some lock's waitqueue when blocking; we're going to do that a > > lot. > > > > But It seems to me that it should be possible to suceed: In > > FindLockCycleRecurse(), in the case that we're not waiting for an actual > > lock (checkProc->links.next == NULL) we can add a case that considers > > the 'blocking parallelism' case. ISTM that that's just a > > FindLockCycleRecurse() call on the process that we're waiting for. We'd > > either have to find the other side's locktag for DEADLOCK_INFO or invent > > another field in there; but that seems like a solveable problem. > 1. The master doesn't actually *wait* until the very end of the parallel phase. > 2. When the master does wait, it waits for all of the parallel workers > at once, not each one individually. > > So, I don't think anything as simplistic as teaching a blocking > shm_mq_receive() to tip off the deadlock detector that we're waiting > for the process on the other end of that particular queue can ever > work. Because of point #2, that never happens. When I first started > thinking about how to fix this, I said, well, that's no big deal, we > can just advertise the whole list of processes that we're waiting for > in shared memory, rather than just the one. This is a bit tricky, > though. Any general API for any backend to advertise that it's waiting > for an arbitrary subset of the other backends would require O(n^2) > shared memory state. That wouldn't be completely insane, but it's not > great, either. For this particular case, we could optimize that down > to O(n) by just chaining all of the children of a given parallel group > leader in a linked whose nodes are inlined in their PGPROCs, but that > doesn't feel very general, because it forecloses the possibility of > the children ever using that API, and I think they might need to. If > nothing else, they might need to advertise that they're blocking on > the master if they are trying to send data, the queue is full, and > they have to wait for the master to drain some of it before they can > proceed. Perhaps, rather than model it as master M waiting on worker list W1|W2|W3, model it with queue-nonempty and queue-nonfull events, one pair per queue. M subscribes to queue0-nonempty, and each of W1,W2,W3 publish to it. M publishes to queue0-nonfull, and the workers subscribe. An edge forms in the deadlock detector's graph when all of an event's subscribers wait on it. (One can approximate that in userspace with a pair of advisory locks.) > After thinking about it a bit more, I realized that even if we settle > on some solution to that problem, there's another issues: the > wait-edges created by this system don't really have the same semantics > as regular lock waits. Suppose A is waiting on a lock held by B and B > is waiting for a lock held by A; that's a deadlock. But if A is > waiting for B to write to a tuple queue and B is waiting for A to read > from a tuple queue, that's not a deadlock if the queues in question > are the same. I do see a deadlock. B wants to block until A reads from the queue, so it advertises that and sleeps. Waking up deadlock_timeout later, it notices that A is waiting for B to write something. B will not spontaneously suspend waiting to write to the queue, nor will A suspend waiting to read from the queue. Thus, the deadlock is valid. This assumes the deadlock detector reasons from an authoritative picture of pending waits and that we reliably wake up a process when the condition it sought has arrived. We have that for heavyweight lock acquisitions. The assumption may be incompatible with Andres's hope, quoted above, of avoiding the full lock acquisition procedure. > A further difference in the semantics of these wait-edges is that if > process A is awaiting AccessExclusiveLock on a resource held by B, C, > and D at AccessShareLock, it needs to wait for all of those processes > to release their locks before it can do anything else. On the other > hand, if process A is awaiting tuples from B, C, and D, it just needs > ONE of those processes to emit tuples in order to make progress. Now Right. > maybe that doesn't make any difference in practice, because even if > two of those processes are making lively progress and A is receiving > tuples from them and processing them like gangbusters, that's probably > not going to help the third one get unstuck. If we adopt the approach > of discounting that possibility, then as long as the parallel leader > can generate tuples locally (that is, local_scan_done = false) we > don't report the deadlock, but as soon as it can no longer do that > (local_scan_done = true) then we do, even though we could still > theoretically read more tuples from the non-stuck workers. I can't discount the possibility that the master could unstick a worker in the course of ingesting tuples from other workers. We could accept a coding rule against parallel algorithms behaving that way, on pain of unprincipled deadlock. It's unattractive to bake such a high-level notion of acceptable wait patterns into the deadlock detector, and every coding rule has noteworthy cost. Since detecting deadlocks long after they were inevitable is also unattractive, the rule might be worthwhile. > So then > you have to wonder why we're not solving problem #1, because the > deadlock was just as certain before we generated the maximum possible > number of tuples locally as it was afterwards. The "1." above reads like a benefit, not a problem. What might you solve about it? Thanks, nm
On Sat, May 2, 2015 at 12:35 PM, Noah Misch <noah@leadboat.com> wrote: > Perhaps, rather than model it as master M waiting on worker list W1|W2|W3, > model it with queue-nonempty and queue-nonfull events, one pair per queue. Each individual queue has only a single reader and a single writer. In your example, there would be three queues, not just one. Of course, one could design a new shared memory data structure representing a collection of related queues, but I still don't see exactly how we get around the problem of that requiring O(MaxBackends^2) storage space. > M > subscribes to queue0-nonempty, and each of W1,W2,W3 publish to it. M > publishes to queue0-nonfull, and the workers subscribe. An edge forms in the > deadlock detector's graph when all of an event's subscribers wait on it. (One > can approximate that in userspace with a pair of advisory locks.) An edge between which two processes? >> After thinking about it a bit more, I realized that even if we settle >> on some solution to that problem, there's another issues: the >> wait-edges created by this system don't really have the same semantics >> as regular lock waits. Suppose A is waiting on a lock held by B and B >> is waiting for a lock held by A; that's a deadlock. But if A is >> waiting for B to write to a tuple queue and B is waiting for A to read >> from a tuple queue, that's not a deadlock if the queues in question >> are the same. > > I do see a deadlock. B wants to block until A reads from the queue, so it > advertises that and sleeps. Waking up deadlock_timeout later, it notices that > A is waiting for B to write something. B will not spontaneously suspend > waiting to write to the queue, nor will A suspend waiting to read from the > queue. Thus, the deadlock is valid. This assumes the deadlock detector > reasons from an authoritative picture of pending waits and that we reliably > wake up a process when the condition it sought has arrived. We have that for > heavyweight lock acquisitions. The assumption may be incompatible with > Andres's hope, quoted above, of avoiding the full lock acquisition procedure. I don't understand. What's typically going to happen here is that the queue will initially be empty, and A will block. Suppose B has a large message to write to the queue, let's say 100x the queue size. It will write the first 1% of the message into the queue, set A's latch, and go to sleep. A will now wake up, drain the queue, set B's latch, and go to sleep. B will then wake up, write the next chunk of the message, set A's latch again, and go to sleep. They'll go back and forth like this until the entire message has been transmitted. There's no deadlock here: everything is working as designed. But there may be instants where both A and B are waiting, because (for example) A may set B's latch and finish going to sleep before B gets around to waking up. That's probably something that we could patch around, but I think it's missing the larger point. When you're dealing with locks, there is one operation that causes blocking (acquiring a lock) and another operation that unblocks other processes (releasing a lock). With message queues, there are still two operations (reading and writing), but either of them can either cause you to block yourself, or to unblock another process. To put that another way, locks enforce mutual exclusion; message queues force mutual *inclusion*. Locks cause blocking when two processes try to operate on the same object at the same time; message queues cause blocking *unless* two processes operate on the same object at the same time. That difference seems to me to be quite fundamental. >> So then >> you have to wonder why we're not solving problem #1, because the >> deadlock was just as certain before we generated the maximum possible >> number of tuples locally as it was afterwards. > > The "1." above reads like a benefit, not a problem. What might you solve > about it? Sorry, that wasn't very clear. Normally, it is a benefit, but in some cases, it could result in a long delay in reporting an inevitable deadlock. What I mean is: suppose we construct a working mechanism where the deadlock detector knows about tuple queue waits. Suppose further that we have a parallel leader and two workers cooperating to do a task that takes 300 s and can be parallelized with perfect efficiency so that, working together, those three processes can get the job done in just 100 s. If the leader tries to take a lock held in a conflicting mode by one of the workers, the worker will fill up the tuple queue - probably quite quickly - and wait. We now detect a deadlock. Our detection is timely, and life is good. On the other hand, suppose that one of the *workers* tries to take a lock held in a conflicting mode by the other worker or by the leader. There is no immediate deadlock, because the leader is not waiting. Instead, we'll finish the computation with the remaining worker and the leader, which will take 150 seconds since we only have two processes instead of three, and *then* the leader waits. I predict that users will be unhappy about doing the whole computation - and only then detecting, well after the time the query would normally have finished, a deadlock that was inevitable from the beginning. That problem is actually not too hard to avoid if you're OK with extremely frequent lock manager manipulations. It's not a deadlock if a worker is waiting for a lock (say, a relation extension lock) held by the leader, because the leader might release that lock quickly. But if the leader gets to the top of the funnel node's main loop and one of its workers is still waiting for the lock at that point, that is almost certainly a deadlock. Locks might get taken and released during a single pass through that loop, but any lock held across iterations is presumably going to be held until end-of-transaction, so we're hosed. But checking for deadlock at the top of every main loop iteration is far too expensive to contemplate. Even doing some much lighter-weight manipulation of the lock manager data structures at the top of every loop iteration is probably going to recreate the same kind of lock manager contention problems that I tried to solve with the fast-path locking stuff in 9.2. To put all of my cards on the table, I've never really believed in this approach. The reason we have a deadlock detector in the first place is because we can't prevent users from making fundamentally incompatible locking requests, like locking two tables in incompatible orderings in two different sessions. But we *don't* have deadlock detection for lwlocks because we (rightly) view it as our job to write the code in such a way that deadlocks don't happen in the first place. So we take locks in a predictable order, even in cases (like updating a tuple) where that involves some fancy gymnastics. We could have update always lock the old-tuple buffer before the new-tuple buffer and make it the deadlock detector's job to sort that out, but that's pretty obviously inferior. We can expect users to tolerate a deadlock when it is their own actions that have precipitated the deadlock, but it's a whole different thing to ask them to accept that what from the user's point of view is an indivisible action (like an UPDATE, or a parallel query, or perhaps an UPSERT) occasionally deadlocks for some reason that they can neither understand nor prevent. Tuple queues are an implementation detail. They shouldn't deadlock for the same reason that lwlock acquisition shouldn't deadlock, and this whole project should be entirely unnecessary. The way we backed into worrying about exposing tuple queues to the deadlock detector is that, if you let one backend in a parallel group get and keep a lock on some resource and then another backend in the same group tries to lock that resource and can't get the lock, you will eventually an undetected deadlock. Andres's view is that we should fix the deadlock detector to detect and report such cases, but I think that's wrong. If a process in a parallel group can take and retain until end of transaction a lock on some resource without which other processes in the same parallel group cannot do useful work, that's a BUG. We shouldn't need the deadlock detector to report that problem for the same reason we shouldn't need the deadlock detector to report lwlock-based deadlocks: if they are ever happening, you need to fix your code, not the deadlock detector. I think where this project went off the rails is when we made the decision as to how to fix the problems in the original group locking approach. In the original concept, locks between processes in the same parallel group just didn't ever conflict; two otherwise-conflicting locks held by different backends within the same group were regarded as those processes sharing that lock. Andres and possibly others pointed out that for stuff like relation locks, that's probably not the right behavior. I agree with that. It was also suggested that this would be less scary if we limited ourselves to sharing the locks held at the beginning of parallelism. That had the further advantage of making things like relation extension locks, which won't be held at the starting of paralellism, unshared, while relation locks would, in most cases, be shared. That felt pretty good, so I did it, but I now think that was a mistake, because it creates edge cases where parallel groups can self-deadlock. If we instead regard relation locks between parallel group members as NEVER conflicting, rather than conflicting only when both locks were acquired after the start of parallelism, those edge cases go away. The only downside is that if a worker A manages to do something to a relation R that makes it unsafe for worker B to access, and worker B then gets the lock anyway, we get no-holds-barred insanity instead of a deadlock. But I am unconvinced that downside amounts to much, because I believe the only way that A can make it unsafe for B to access the relation is by doing something that CheckTableNotInUse() will already prohibit in parallel mode. If it turns out there is some oversight there, I don't think it'll be hard to plug. In contrast, exposing this tuple queue wait information to the deadlock detector is looking quite complex. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Tue, May 05, 2015 at 01:05:38PM -0400, Robert Haas wrote: > On Sat, May 2, 2015 at 12:35 PM, Noah Misch <noah@leadboat.com> wrote: > > Perhaps, rather than model it as master M waiting on worker list W1|W2|W3, > > model it with queue-nonempty and queue-nonfull events, one pair per queue. That comment of mine was junk; it suggested a data structure containing the same worker list it purported to remove. Oops. > Each individual queue has only a single reader and a single writer. > In your example, there would be three queues, not just one. Of > course, one could design a new shared memory data structure > representing a collection of related queues, but I still don't see > exactly how we get around the problem of that requiring > O(MaxBackends^2) storage space. If each backend waits on a uniformly-distributed 50% of other backends, tracking that wait graph indeed requires O(MaxBackends^2) space. Backends completing useful work in the field will have far-sparser wait graphs, and that should inform the choice of data structure: On Tue, Apr 21, 2015 at 02:08:00PM -0400, Robert Haas wrote: > When I first started thinking about how to fix this,I said, well, > that's no big deal, we can just advertise the whole list of processes > that we're waiting for inshared memory, rather than just the one. That boiled down to representing the wait graph with an adjacency list, which sounds like an efficient choice. > > M > > subscribes to queue0-nonempty, and each of W1,W2,W3 publish to it. M > > publishes to queue0-nonfull, and the workers subscribe. An edge forms in the > > deadlock detector's graph when all of an event's subscribers wait on it. (One > > can approximate that in userspace with a pair of advisory locks.) > > An edge between which two processes? I hadn't de-fuzzed my thinking that far yet. If you still need the answer after this message, let me know, and I'll work on that. As a guess, I think it's three edges M-W1, M-W2 and M-W3. > >> After thinking about it a bit more, I realized that even if we settle > >> on some solution to that problem, there's another issues: the > >> wait-edges created by this system don't really have the same semantics > >> as regular lock waits. Suppose A is waiting on a lock held by B and B > >> is waiting for a lock held by A; that's a deadlock. But if A is > >> waiting for B to write to a tuple queue and B is waiting for A to read > >> from a tuple queue, that's not a deadlock if the queues in question > >> are the same. > > > > I do see a deadlock. B wants to block until A reads from the queue, so it > > advertises that and sleeps. Waking up deadlock_timeout later, it notices that > > A is waiting for B to write something. B will not spontaneously suspend > > waiting to write to the queue, nor will A suspend waiting to read from the > > queue. Thus, the deadlock is valid. This assumes the deadlock detector > > reasons from an authoritative picture of pending waits and that we reliably > > wake up a process when the condition it sought has arrived. We have that for > > heavyweight lock acquisitions. The assumption may be incompatible with > > Andres's hope, quoted above, of avoiding the full lock acquisition procedure. > > I don't understand. What's typically going to happen here is that the > queue will initially be empty, and A will block. Suppose B has a > large message to write to the queue, let's say 100x the queue size. > It will write the first 1% of the message into the queue, set A's > latch, and go to sleep. A will now wake up, drain the queue, set B's > latch, and go to sleep. B will then wake up, write the next chunk of > the message, set A's latch again, and go to sleep. They'll go back > and forth like this until the entire message has been transmitted. > There's no deadlock here: everything is working as designed. But > there may be instants where both A and B are waiting, because (for > example) A may set B's latch and finish going to sleep before B gets > around to waking up. I see what you had in mind. The problem showed up in the last sentence; namely, the hand-off from process to process is not atomic like it is for heavyweight locks. That's exactly what I was (not too clearly) getting at when I wrote, "This assumes ..." above. For the sake of illustration, suppose you replace your queue in this algorithm with a heavyweight lock and a small buffer. Initially, B holds the lock and A waits for the lock. The following event sequence repeats until the transfer is done: B fills the buffer B releases the lock, granting it to A during LockRelease() B starts waiting for the lock A empties the buffer A releases the lock, granting it to B during LockRelease() A starts waiting for the lock The deadlock detector will rightly never report a deadlock. To have the same safety in your example, it's not enough for one process to set the latch of another process. The process must update something in static shared memory to indicate that the waited-for condition (queue-nonempty for A, queue-nonfull for B) is now satisfied. You need such a unit of state anyway, don't you? How else would the deadlock detector know that A waits for B to write to the tuple queue? (When the deadlock detector notices this deadlock, it may be running in a process not participating in the parallel group. It can't rely on anything in backend or dynamic shared memory.) With that, the deadlock detector can distinguish a process waiting for an yet-unsatisfied condition from a process that will soon wake to exploit a recently-satisfied condition. > That's probably something that we could patch around, but I think it's > missing the larger point. When you're dealing with locks, there is > one operation that causes blocking (acquiring a lock) and another > operation that unblocks other processes (releasing a lock). With > message queues, there are still two operations (reading and writing), > but either of them can either cause you to block yourself, or to > unblock another process. To put that another way, locks enforce > mutual exclusion; message queues force mutual *inclusion*. Locks > cause blocking when two processes try to operate on the same object at > the same time; message queues cause blocking *unless* two processes > operate on the same object at the same time. That difference seems to > me to be quite fundamental. Interesting thought. > That problem is actually not too hard to avoid if you're OK with > extremely frequent lock manager manipulations. It's not a deadlock if > a worker is waiting for a lock (say, a relation extension lock) held > by the leader, because the leader might release that lock quickly. > But if the leader gets to the top of the funnel node's main loop and > one of its workers is still waiting for the lock at that point, that > is almost certainly a deadlock. Locks might get taken and released > during a single pass through that loop, but any lock held across > iterations is presumably going to be held until end-of-transaction, so > we're hosed. But checking for deadlock at the top of every main loop > iteration is far too expensive to contemplate. Even doing some much > lighter-weight manipulation of the lock manager data structures at the > top of every loop iteration is probably going to recreate the same > kind of lock manager contention problems that I tried to solve with > the fast-path locking stuff in 9.2. Agreed. > To put all of my cards on the table, I've never really believed in > this approach. The reason we have a deadlock detector in the first > place is because we can't prevent users from making fundamentally > incompatible locking requests, like locking two tables in incompatible > orderings in two different sessions. But we *don't* have deadlock > detection for lwlocks because we (rightly) view it as our job to write > the code in such a way that deadlocks don't happen in the first place. > So we take locks in a predictable order, even in cases (like updating > a tuple) where that involves some fancy gymnastics. We could have > update always lock the old-tuple buffer before the new-tuple buffer > and make it the deadlock detector's job to sort that out, but that's > pretty obviously inferior. We can expect users to tolerate a deadlock > when it is their own actions that have precipitated the deadlock, but > it's a whole different thing to ask them to accept that what from the > user's point of view is an indivisible action (like an UPDATE, or a > parallel query, or perhaps an UPSERT) occasionally deadlocks for some > reason that they can neither understand nor prevent. Tuple queues are > an implementation detail. They shouldn't deadlock for the same reason > that lwlock acquisition shouldn't deadlock, and this whole project > should be entirely unnecessary. > > The way we backed into worrying about exposing tuple queues to the > deadlock detector is that, if you let one backend in a parallel group > get and keep a lock on some resource and then another backend in the > same group tries to lock that resource and can't get the lock, you > will eventually an undetected deadlock. Andres's view is that we > should fix the deadlock detector to detect and report such cases, but > I think that's wrong. If a process in a parallel group can take and > retain until end of transaction a lock on some resource without which > other processes in the same parallel group cannot do useful work, > that's a BUG. We shouldn't need the deadlock detector to report that > problem for the same reason we shouldn't need the deadlock detector to > report lwlock-based deadlocks: if they are ever happening, you need to > fix your code, not the deadlock detector. The number of held LWLocks is always zero when entering user-defined code and again when exiting user-defined code. Therefore, the possible LWLock wait graphs are known at compile time, and we could prove that none contain a deadlock. Therefore, we scarcely miss having an LWLock deadlock detector. That does not map to the tuple queue behavior at hand, because we hope to allow the queue's writer to run user-defined code while the queue's reader is waiting. My term "user-defined code" does come with some hand-waving. A superuser can cause an undetected deadlock via a C-language hook. We would not call that a PostgreSQL bug, though the hook is literally user-defined. Let's keep the deadlock detector able to identify every deadlock a non-superuser can cause. That includes deadlocks arising from heavyweight lock acquisition in parallel-safe functions. > I think where this project went off the rails is when we made the > decision as to how to fix the problems in the original group locking > approach. In the original concept, locks between processes in the > same parallel group just didn't ever conflict; two > otherwise-conflicting locks held by different backends within the same > group were regarded as those processes sharing that lock. Andres and > possibly others pointed out that for stuff like relation locks, that's I think you mean "relation extension locks". > probably not the right behavior. I agree with that. It was also > suggested that this would be less scary if we limited ourselves to > sharing the locks held at the beginning of parallelism. That had the > further advantage of making things like relation extension locks, > which won't be held at the starting of paralellism, unshared, while > relation locks would, in most cases, be shared. That felt pretty > good, so I did it, but I now think that was a mistake, because it > creates edge cases where parallel groups can self-deadlock. If we > instead regard relation locks between parallel group members as NEVER > conflicting, rather than conflicting only when both locks were > acquired after the start of parallelism, those edge cases go away. Yes. Then you're back to something like the LWLock scenario, where an undetected deadlock implies a PostgreSQL bug. That's a good place to be. > The only downside is that if a worker A manages to do something to a > relation R that makes it unsafe for worker B to access, and worker B > then gets the lock anyway, we get no-holds-barred insanity instead of > a deadlock. But I am unconvinced that downside amounts to much, > because I believe the only way that A can make it unsafe for B to > access the relation is by doing something that CheckTableNotInUse() > will already prohibit in parallel mode. If it turns out there is some > oversight there, I don't think it'll be hard to plug. This is the bottom line, and I agree. I wrote more about that here: http://www.postgresql.org/message-id/20141226040546.GC1971688@tornado.leadboat.com I admit that it's alarming to have different conflict semantics by locktag. The tension originates, I think, from e.g. LOCKTAG_RELATION_EXTEND serving two distinct purposes simultaneously. As I wrote in the mail just cited, before altering a table in a manner that threatens concurrent usage, one must ensure that (a) other transactions and (b) other parts of my own transaction aren't in the way. Customarily, a heavyweight lock rules out (a), and CheckTableNotInUse() rules out (b). Relation extension does not have or need a CheckTableNotInUse() call or similar, because it doesn't call arbitrary code that might reenter the relation extension process. Under parallelism, though, it will be possible for multiple processes of a given transaction to attempt relation extension simultaneously. So we need a (b) test. It just so happens that allowing LOCKTAG_RELATION_EXTEND to conflict in a parallel group causes that lock acquisition to suffice for both purposes (a) and (b). That's neat but arguably impure. Every cure I've pondered has been worse than the disease, though. It will be okay. > In contrast, > exposing this tuple queue wait information to the deadlock detector is > looking quite complex. Yep. Thanks, nm
On Thu, May 7, 2015 at 3:09 AM, Noah Misch <noah@leadboat.com> wrote: >> Each individual queue has only a single reader and a single writer. >> In your example, there would be three queues, not just one. Of >> course, one could design a new shared memory data structure >> representing a collection of related queues, but I still don't see >> exactly how we get around the problem of that requiring >> O(MaxBackends^2) storage space. > > If each backend waits on a uniformly-distributed 50% of other backends, > tracking that wait graph indeed requires O(MaxBackends^2) space. Backends > completing useful work in the field will have far-sparser wait graphs, and > that should inform the choice of data structure: We can, of course, underallocate and hope for the best. >> > M >> > subscribes to queue0-nonempty, and each of W1,W2,W3 publish to it. M >> > publishes to queue0-nonfull, and the workers subscribe. An edge forms in the >> > deadlock detector's graph when all of an event's subscribers wait on it. (One >> > can approximate that in userspace with a pair of advisory locks.) >> >> An edge between which two processes? > > I hadn't de-fuzzed my thinking that far yet. If you still need the answer > after this message, let me know, and I'll work on that. As a guess, I think > it's three edges M-W1, M-W2 and M-W3. I think it's an edge where one end is M and the other end is fuzzily diffused across W1, W2, and W3, because we only need one of them to wake up. There's no way to represent such a thing in the deadlock detector today. We could invent one, of course, but it sounds complicated. Also, I suspect deadlock checking in this world reduces to http://en.wikipedia.org/wiki/Boolean_satisfiability_problem and is therefore NP-complete. > I see what you had in mind. The problem showed up in the last sentence; > namely, the hand-off from process to process is not atomic like it is for > heavyweight locks. That's exactly what I was (not too clearly) getting at > when I wrote, "This assumes ..." above. For the sake of illustration, suppose > you replace your queue in this algorithm with a heavyweight lock and a small > buffer. Initially, B holds the lock and A waits for the lock. The following > event sequence repeats until the transfer is done: > > B fills the buffer > B releases the lock, granting it to A during LockRelease() > B starts waiting for the lock > A empties the buffer > A releases the lock, granting it to B during LockRelease() > A starts waiting for the lock > > The deadlock detector will rightly never report a deadlock. To have the same > safety in your example, it's not enough for one process to set the latch of > another process. The process must update something in static shared memory to > indicate that the waited-for condition (queue-nonempty for A, queue-nonfull > for B) is now satisfied. You need such a unit of state anyway, don't you? Yes. It's the number of bytes read and written, and it's stored in dynamic shared memory. You can move some of the state to the main shared memory segment, but I don't look forward to the performance consequences of a lock acquire & release cycle every time the queue fills or drains. Of course, there is also a loss of flexibility there: a big reason for developing this stuff in the first place was to avoid being constrained by the main shared memory segment. > The number of held LWLocks is always zero when entering user-defined code and > again when exiting user-defined code. Therefore, the possible LWLock wait > graphs are known at compile time, and we could prove that none contain a > deadlock. Therefore, we scarcely miss having an LWLock deadlock detector. > That does not map to the tuple queue behavior at hand, because we hope to > allow the queue's writer to run user-defined code while the queue's reader is > waiting. My term "user-defined code" does come with some hand-waving. A > superuser can cause an undetected deadlock via a C-language hook. We would > not call that a PostgreSQL bug, though the hook is literally user-defined. > Let's keep the deadlock detector able to identify every deadlock a > non-superuser can cause. That includes deadlocks arising from heavyweight > lock acquisition in parallel-safe functions. I'm in agreement with that goal. >> I think where this project went off the rails is when we made the >> decision as to how to fix the problems in the original group locking >> approach. In the original concept, locks between processes in the >> same parallel group just didn't ever conflict; two >> otherwise-conflicting locks held by different backends within the same >> group were regarded as those processes sharing that lock. Andres and >> possibly others pointed out that for stuff like relation locks, that's > > I think you mean "relation extension locks". Yes, thanks. >> probably not the right behavior. I agree with that. It was also >> suggested that this would be less scary if we limited ourselves to >> sharing the locks held at the beginning of parallelism. That had the >> further advantage of making things like relation extension locks, >> which won't be held at the starting of paralellism, unshared, while >> relation locks would, in most cases, be shared. That felt pretty >> good, so I did it, but I now think that was a mistake, because it >> creates edge cases where parallel groups can self-deadlock. If we >> instead regard relation locks between parallel group members as NEVER >> conflicting, rather than conflicting only when both locks were >> acquired after the start of parallelism, those edge cases go away. > > Yes. Then you're back to something like the LWLock scenario, where an > undetected deadlock implies a PostgreSQL bug. That's a good place to be. Good. >> The only downside is that if a worker A manages to do something to a >> relation R that makes it unsafe for worker B to access, and worker B >> then gets the lock anyway, we get no-holds-barred insanity instead of >> a deadlock. But I am unconvinced that downside amounts to much, >> because I believe the only way that A can make it unsafe for B to >> access the relation is by doing something that CheckTableNotInUse() >> will already prohibit in parallel mode. If it turns out there is some >> oversight there, I don't think it'll be hard to plug. > > This is the bottom line, and I agree. I wrote more about that here: > http://www.postgresql.org/message-id/20141226040546.GC1971688@tornado.leadboat.com > > I admit that it's alarming to have different conflict semantics by locktag. > The tension originates, I think, from e.g. LOCKTAG_RELATION_EXTEND serving two > distinct purposes simultaneously. As I wrote in the mail just cited, before > altering a table in a manner that threatens concurrent usage, one must ensure > that (a) other transactions and (b) other parts of my own transaction aren't > in the way. Customarily, a heavyweight lock rules out (a), and > CheckTableNotInUse() rules out (b). Relation extension does not have or need > a CheckTableNotInUse() call or similar, because it doesn't call arbitrary code > that might reenter the relation extension process. Under parallelism, though, > it will be possible for multiple processes of a given transaction to attempt > relation extension simultaneously. So we need a (b) test. It just so happens > that allowing LOCKTAG_RELATION_EXTEND to conflict in a parallel group causes > that lock acquisition to suffice for both purposes (a) and (b). That's neat > but arguably impure. Every cure I've pondered has been worse than the > disease, though. It will be okay. That's my opinion as well. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company