background workers, round three
От | Robert Haas |
---|---|
Тема | background workers, round three |
Дата | |
Msg-id | CA+TgmoaoB0GWr0Tm71GXt6h9ZA4xx5nk+K3NiFJ30qdz7ChjwQ@mail.gmail.com обсуждение исходный текст |
Ответы |
Re: background workers, round three
(Michael Paquier <michael.paquier@gmail.com>)
|
Список | pgsql-hackers |
Last week, I attempted to write some code to perform a trivial operation in parallel by launching background workers. Despite my earlier convictions that I'd built enough infrastructure to make this possible, the experiment turned out badly - so, patches! It's been pretty obvious to me from the beginning that any sort of parallel computation would need a way to make sure that if any worker dies, everybody dies. Conversely, if the parent aborts, all the workers should die. My thought was that this could be implemented using PG_TRY()/PG_CATCH() blocks over the existing infrastructure, but this turned out to be naive. If the original backend encounters an error before the child manages to start up, there's no good recovery strategy. The parent can't kill the child because it doesn't exist yet, and the parent can't stop the postmaster from starting the child later. The parent could try waiting until the child starts up and THEN killing it, but that's probably unreliable and surely mind-numbingly lame. The first attached patch, terminate-worker-v1.patch, rectifies this problem by providing a TerminateBackgroundWorker() function. When this function is invoked, the postmaster will become unwilling to restart the worker and will send it a SIGTERM if it's already running. (It's important that the SIGTERM be sent by the postmaster, because if the original backend tries to send it, there's a race condition: the process might not be started at the time the original backend tries to send the signal, but the postmaster might start it before it sees the terminate request.) By itself, this is useful, but painful. The pain comes from the fact that all of the house-keeping is left to the programmer. It is possible but not elegant to use something like PG_ENSURE_ERROR_CLEANUP() to ensure that all background workers are terminated even on an error exit from the affected code. The other half of the problem is harder: how do we ensure not only that the untimely demise of a worker process aborts the original backend's transaction, but that it does so in a relatively timely fashion? If the original backend is executing only a very limited amount of code while the parallel workers remain in existence, it would be possible to add explicit checks for the demise of a worker at periodic intervals through all of that code. But this seems a very limiting approach. In the hope of making things better, the second patch attached herewith, ephemeral-precious-v1.patch, adds two new flags, BGWORKER_EPHEMERAL and BGWORKER_PRECIOUS. Setting the BGWORKER_EPHEMERAL flag causes the background worker to be killed when the registrant's (sub)transaction ends. This eliminates the need to catch errors and explicitly invoke TerminateBackgroundWorker() in the error path. You can simply register an ephemeral worker, write code to do stuff with it, and then terminate it. If an error occurs part-way through, the worker will be terminated as part of the abort path. Setting the BGWORKER_PRECIOUS flag causes the unexpected death of the worker to abort the registrant's current (sub)transaction. This eliminates the need to sprinkle the code with checks for a deceased worker. Instead, you can simply register a precious worker, and then just remember to CHECK_FOR_INTERRUPTS(). There were a couple of awkward cases here. First, all the existing stuff that hooks into ProcessInterrupts() makes provision for handling the ImmediateInterruptOK case. I felt that was superfluous here, so instead simply prohibited leaving a precious background worker running beyond the end of the statement. The point is to enable parallel computation, which will, I think, begin and end within the lifespan of one query. Second, odd things happen if the original backend launches precious workers and then begins a subtransaction. Throwing an error in the subtransaction will not do the right thing; the subtransaction may easily be something like a PL/pgsql exception block and an error will be caught and result in unexpected behavior. So we don't. I just added a comment saying that if you do decide to start a subtransaction while you've got precious workers outstanding, you'd better insert an explicit check for whether they're still alive after unwinding the subtransaction (there's a function to make that easy). We could probably build a mechanism to allow an error to be thrown against a (sub)xact other than the innermost one, implicitly aborting everything below that with extreme prejudice, but it seems like overkill. I can't but imagine that early versions of parallel-anything will include "starting subtransactions" on the list of activities which are prohibited in parallel mode. Using the infrastructure provided by those patches, I was able to write some test code, attached as pingpong-v1.patch. You can make a backend fire up a background worker, and the two will take turns setting each others latches for a number of iterations you specify. This could possibly be adapted into a regression test, if people think it's valuable, but for the moment I'm just including it as a demonstration of the functionality, not intended for commit. Rather gratifyingly, you can set a large iteration count and then interrupt or terminate the foreground process, or terminate the background process, and the other one goes away as well. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Вложения
В списке pgsql-hackers по дате отправления: