Обсуждение: ExecGather() + nworkers
The Gather node executor function ExecGather() does this: /* * Register backend workers. We might not get as many as we * requested, or indeed any at all. */ pcxt =node->pei->pcxt; LaunchParallelWorkers(pcxt); /* Set up tuple queue readers to read the results. */ if (pcxt->nworkers > 0) { ... } /* No workers? Then never mind. */ if (!got_any_worker) ExecShutdownGatherWorkers(node); I'm not sure why the test for nworkers following the LaunchParallelWorkers() call doesn't look like this, though: /* Set up tuple queue readers to read the results. */ if (pcxt->nworkers_launched > 0) { ... } ISTM, having followed the chain of calls, that the "if" statement I highlight here is currently tautological (excluding the possibility of one or two special cases in the CreateParallelContext() call performed by ExecInitParallelPlan(). e.g., the IsolationIsSerializable() case *can* result in caller's nworkers being overridden to be 0). I guess it isn't surprising that the code evolved to look like this, since the commit message of b0b0d84b ponders: "I suspect we'll want to revise the Gather node to make use of this new capability [relaunching workers], but even if not it may be useful elsewhere and requires very little additional code". The nworkers_launched tracking came only in this later commit. From my point of view, as a student of this code working on parallel index builds (i.e new code which will also be a client of parallel.c, a module which right now only has nodeGather.c as a client to learn from), this is confusing. It's confusing just because the simpler approach isn't taken when it could have been. It isn't actually wrong that we figure out if any workers were launched in this relatively laborious way: /* Set up tuple queue readers to read the results. */ if (pcxt->nworkers > 0) { ... for (i = 0; i < pcxt->nworkers; ++i) { if (pcxt->worker[i].bgwhandle == NULL) continue; ... nworkers_launched = true; ... But going to this additional trouble (detecting no workers launched on the basis of !nworkers_launched) suggests that simply testing nworkers_launched would be wrong, which AFAICT it isn't. Can't we just do that, and in so doing also totally remove the "for" loop shown here? In the case of parallel sequential scan, it looks like one worker can be helpful, because then the gather node (leader process) can run the plan itself to some degree, and so there are effectively 2 processes scanning at a minimum (unless 0 workers could be allocated to begin with). How useful is it to have a parallel scan when this happens, though? I guess it isn't obvious to me how to reliably back out of not being able to launch at least 2 workers in the case of my parallel index build patch, because I suspect 2 workers (plus the leader process) are the minimum number that will make index builds faster. Right now, it looks like I'll have to check nworkers_launched in the leader (which will be the only process to access the ParallelContext, since it's in its local memory). Then, having established that there are at least the minimum useful number of worker processes launched for sorting, the leader can "permit" worker processes to "really" start based on changing some state in the TOC/segment in common use. Otherwise, the leader must call the whole thing off and do a conventional, serial index build, even though technically the main worker process function has started execution in worker processes. I think what might be better is a general solution to my problem, which I imagine will crop up again and again as new clients are added. I would like an API that lets callers of LaunchParallelWorkers() only actually launch *any* worker on the basis of having been able to launch some minimum sensible number (typically 2). Otherwise, indicate failure, allowing callers to call the whole thing off in a general way, without the postmaster having actually launched anything, and without custom "call it all off" code for parallel index builds. This would probably involve introducing a distinction between a BackgroundWorkerSlot being "reserved" rather than "in_use", lest the postmaster accidentally launch 1 worker process before we established definitively that launching any is really a good idea. Does that make sense? Thanks -- Peter Geoghegan
On Sun, Jan 10, 2016 at 12:29 AM, Peter Geoghegan <pg@heroku.com> wrote: > The Gather node executor function ExecGather() does this: > [ code ] > I'm not sure why the test for nworkers following the > LaunchParallelWorkers() call doesn't look like this, though: > > /* Set up tuple queue readers to read the results. */ > if (pcxt->nworkers_launched > 0) > { > ... > } Hmm, yeah, I guess it could do that. > But going to this additional trouble (detecting no workers launched on > the basis of !nworkers_launched) suggests that simply testing > nworkers_launched would be wrong, which AFAICT it isn't. Can't we just > do that, and in so doing also totally remove the "for" loop shown > here? I don't see how the for loop goes away. > In the case of parallel sequential scan, it looks like one worker can > be helpful, because then the gather node (leader process) can run the > plan itself to some degree, and so there are effectively 2 processes > scanning at a minimum (unless 0 workers could be allocated to begin > with). How useful is it to have a parallel scan when this happens, > though? Empirically, that's really quite useful. When you have 3 or 4 workers, the leader really doesn't make a significant contribution to the work, but what I've seen in my testing is that 1 worker often runs almost twice as fast as 0 workers. > I guess it isn't obvious to me how to reliably back out of not being > able to launch at least 2 workers in the case of my parallel index > build patch, because I suspect 2 workers (plus the leader process) are > the minimum number that will make index builds faster. Right now, it > looks like I'll have to check nworkers_launched in the leader (which > will be the only process to access the ParallelContext, since it's in > its local memory). Then, having established that there are at least > the minimum useful number of worker processes launched for sorting, > the leader can "permit" worker processes to "really" start based on > changing some state in the TOC/segment in common use. Otherwise, the > leader must call the whole thing off and do a conventional, serial > index build, even though technically the main worker process function > has started execution in worker processes. I don't really understand why this should be so. I thought the idea of parallel sort is (roughly) that each worker should read data until it fills work_mem, sort that data, and write a tape. Repeat until no data remains. Then, merge the tapes. I don't see any reason at all why this shouldn't work just fine with a leader and 1 worker. > I think what might be better is a general solution to my problem, > which I imagine will crop up again and again as new clients are added. > I would like an API that lets callers of LaunchParallelWorkers() only > actually launch *any* worker on the basis of having been able to > launch some minimum sensible number (typically 2). Otherwise, indicate > failure, allowing callers to call the whole thing off in a general > way, without the postmaster having actually launched anything, and > without custom "call it all off" code for parallel index builds. This > would probably involve introducing a distinction between a > BackgroundWorkerSlot being "reserved" rather than "in_use", lest the > postmaster accidentally launch 1 worker process before we established > definitively that launching any is really a good idea. I think that's probably over-engineered. I mean, it wouldn't be that hard to have the workers just exit if you decide you don't want them, and I don't really want to make the signaling here more complicated than it really needs to be. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Sun, Jan 10, 2016 at 9:13 AM, Robert Haas <robertmhaas@gmail.com> wrote: >> I'm not sure why the test for nworkers following the >> LaunchParallelWorkers() call doesn't look like this, though: >> >> /* Set up tuple queue readers to read the results. */ >> if (pcxt->nworkers_launched > 0) >> { >> ... >> } > > Hmm, yeah, I guess it could do that. That would make it clearer as an example. >> But going to this additional trouble (detecting no workers launched on >> the basis of !nworkers_launched) suggests that simply testing >> nworkers_launched would be wrong, which AFAICT it isn't. Can't we just >> do that, and in so doing also totally remove the "for" loop shown >> here? > > I don't see how the for loop goes away. I meant that some code in the "for" loop goes away. Not all of it. Just the more obscure code. As I said, I'm mostly pointing this out out of concern for making it clearer as example code. >> In the case of parallel sequential scan, it looks like one worker can >> be helpful, because then the gather node (leader process) can run the >> plan itself to some degree, and so there are effectively 2 processes >> scanning at a minimum (unless 0 workers could be allocated to begin >> with). How useful is it to have a parallel scan when this happens, >> though? > > Empirically, that's really quite useful. When you have 3 or 4 > workers, the leader really doesn't make a significant contribution to > the work, but what I've seen in my testing is that 1 worker often runs > almost twice as fast as 0 workers. I suppose that makes sense, given that parallel sequential scan works best when most tuples are eliminated in workers; there ought not to be many tuples filling the single worker's queue anyway. > I don't really understand why this should be so. I thought the idea > of parallel sort is (roughly) that each worker should read data until > it fills work_mem, sort that data, and write a tape. Repeat until no > data remains. Then, merge the tapes. I don't see any reason at all > why this shouldn't work just fine with a leader and 1 worker. It will work fine with a leader and 1 worker -- the code will be correct, and without any special cases. But it will be a suboptimal use of resources. From the caller's point of view, there is no reason to think it will be faster, and some reason to think it will be slower. A particular concern for parallel sort is that the sort might not use enough memory to need to be an external sort, but you effectively force it to be one by making it a parallel sort (that is not ideal in the long term, but it's a good compromise for 9.6's parallel index build stuff). You're also consuming a BackgroundWorkerSlot for the duration of the sort, in an environment where evidently those are in short supply. Now, you might wonder why it is that the leader cannot also sort runs, just as a worker would. It's possible, but it isn't exactly straightforward. You have to have special cases in several places, even though it probably is going to be uncommon to only have one BackgroundWorkerSlot available in practice. It's simpler to just opt-out, and seems better given that max_parallel_degree is a way of resource limiting based on available cores (it's certainly not about the availability of shared memory for the BackgroundWorkerSlot array). More importantly, I have other, entirely general concerns. Other major RDBMSs have settings that are very similar to max_parallel_degree, with a setting of 1 effectively disabling all parallelism. Both Oracle and SQL Server have settings that they both call the "maximum degree or parallelism". I think it's a bit odd that with Postgres, max_parallel_degree = 1 can still use parallelism at all. I have to wonder: are we conflating controlling the resources used by parallel operations with how shared memory is doled out? I could actually "give back" my parallel worker slots early if I really wanted to (that would be messy, but the then-acquiesced workers do nothing for the duration of the merge beyond conceptually owning the shared tape temp files). I don't think releasing the slots early makes sense, because I tend to think that hanging on to the workers helps the DBA in managing the server's resources. The still-serial merge phase is likely to become a big bottleneck with parallel sort. With parallel sequential scan, a max_parallel_degree of 8 could result in 16 processes scanning in parallel. That's a concern, and not least because it happens only sometimes, when things are timed just right. The fact that only half of those processes are "genuine" workers seems to me like a distinction without a difference. > I think that's probably over-engineered. I mean, it wouldn't be that > hard to have the workers just exit if you decide you don't want them, > and I don't really want to make the signaling here more complicated > than it really needs to be. I worry about the additional overhead of constantly starting and stopping a single worker in some cases (not so much with parallel index build, but other uses of parallel sort beyond 9.6). Furthermore, the coordination between worker and leader processes to make this happen seems messy -- you actually have the postmaster launch processes, but they must immediately get permission to do anything. It wouldn't be that hard to offer a general way of doing this, so why not? -- Peter Geoghegan
On Sun, Jan 10, 2016 at 1:44 PM, Peter Geoghegan <pg@heroku.com> wrote: > With parallel sequential scan, a max_parallel_degree of 8 could result > in 16 processes scanning in parallel. I meant a max_worker_processes setting, which of course is different. Nevertheless, I find it surprising that max_parallel_degree = 1 does not prevent parallel operations, and that max_parallel_degree is defined in terms of the availability of worker processes (in the strict sense of worker processes that are launched by LaunchParallelWorkers(), and not a broader and more practical definition). -- Peter Geoghegan
On Sun, Jan 10, 2016 at 4:44 PM, Peter Geoghegan <pg@heroku.com> wrote: >> I don't really understand why this should be so. I thought the idea >> of parallel sort is (roughly) that each worker should read data until >> it fills work_mem, sort that data, and write a tape. Repeat until no >> data remains. Then, merge the tapes. I don't see any reason at all >> why this shouldn't work just fine with a leader and 1 worker. > > It will work fine with a leader and 1 worker -- the code will be > correct, and without any special cases. But it will be a suboptimal > use of resources. From the caller's point of view, there is no reason > to think it will be faster, and some reason to think it will be > slower. A particular concern for parallel sort is that the sort might > not use enough memory to need to be an external sort, but you > effectively force it to be one by making it a parallel sort (that is > not ideal in the long term, but it's a good compromise for 9.6's > parallel index build stuff). You're also consuming a > BackgroundWorkerSlot for the duration of the sort, in an environment > where evidently those are in short supply. Well, in general, the parallel sort code doesn't really get to pick whether or not a BackgroundWorkerSlot gets used or not. Whoever created the parallel context decides how many workers to request, and then the context got as many of those as it could. It then did arbitrary computation, which at some point in the middle involves one or more parallel sorts. You can't just have one of those workers up and exit in the middle. Now, in the specific case of parallel index build, you probably can do that, if you want to. But to be honest, I'd be inclined not to include that in the first version. If you get fewer workers than you asked for, just use the number you got. Let's see how that actually works out before we decide that we need a lot more mechanism here. You may find that it's surprisingly effective to do it this way. > Now, you might wonder why it is that the leader cannot also sort runs, > just as a worker would. It's possible, but it isn't exactly > straightforward. You have to have special cases in several places, > even though it probably is going to be uncommon to only have one > BackgroundWorkerSlot available in practice. It's simpler to just > opt-out, and seems better given that max_parallel_degree is a way of > resource limiting based on available cores (it's certainly not about > the availability of shared memory for the BackgroundWorkerSlot array). I am surprised that this is not straightforward. I don't see why it shouldn't be, and it worries me that you think it isn't. > More importantly, I have other, entirely general concerns. Other major > RDBMSs have settings that are very similar to max_parallel_degree, > with a setting of 1 effectively disabling all parallelism. Both Oracle > and SQL Server have settings that they both call the "maximum degree > or parallelism". I think it's a bit odd that with Postgres, > max_parallel_degree = 1 can still use parallelism at all. I have to > wonder: are we conflating controlling the resources used by parallel > operations with how shared memory is doled out? We could redefined things so that max_parallel_degree = N means use N - 1 workers, with a minimum value of 1 rather than 0, if there's a consensus that that's better. Personally, I prefer it the way we've got it: it's real darned clear in my mind that max_parallel_degree=0 means "not parallel". But I won't cry into my beer if a consensus emerges that the other way would be better. > I could actually "give back" my parallel worker slots early if I > really wanted to (that would be messy, but the then-acquiesced workers > do nothing for the duration of the merge beyond conceptually owning > the shared tape temp files). I don't think releasing the slots early > makes sense, because I tend to think that hanging on to the workers > helps the DBA in managing the server's resources. The still-serial > merge phase is likely to become a big bottleneck with parallel sort. Like I say, the sort code better not know anything about this directly, or it's going to break when embedded in a query. > With parallel sequential scan, a max_parallel_degree of 8 could result > in 16 processes scanning in parallel. That's a concern, and not least > because it happens only sometimes, when things are timed just right. > The fact that only half of those processes are "genuine" workers seems > to me like a distinction without a difference. This seems dead wrong. A max_parallel_degree of 8 means you have a leader and 8 workers. Where are the other 7 processes coming from? What you should have is 8 processes each of which is participating in both the parallel seq scan and the parallel sort, not 8 processes scanning and 8 separate processes sorting. >> I think that's probably over-engineered. I mean, it wouldn't be that >> hard to have the workers just exit if you decide you don't want them, >> and I don't really want to make the signaling here more complicated >> than it really needs to be. > > I worry about the additional overhead of constantly starting and > stopping a single worker in some cases (not so much with parallel > index build, but other uses of parallel sort beyond 9.6). Furthermore, > the coordination between worker and leader processes to make this > happen seems messy -- you actually have the postmaster launch > processes, but they must immediately get permission to do anything. > > It wouldn't be that hard to offer a general way of doing this, so why not? Well, if these things become actual problems, fine, we can fix them. But let's not decide to add the API before we're agreed that we need it to solve an actual problem that we both agree we have. We are not there yet. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Mon, Jan 11, 2016 at 3:14 AM, Peter Geoghegan <pg@heroku.com> wrote:
>
> On Sun, Jan 10, 2016 at 9:13 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> >> I'm not sure why the test for nworkers following the
> >> LaunchParallelWorkers() call doesn't look like this, though:
> >>
> >> /* Set up tuple queue readers to read the results. */
> >> if (pcxt->nworkers_launched > 0)
> >> {
> >> ...
> >> }
> >
> > Hmm, yeah, I guess it could do that.
>
> That would make it clearer as an example.
>
> >> But going to this additional trouble (detecting no workers launched on
> >> the basis of !nworkers_launched) suggests that simply testing
> >> nworkers_launched would be wrong, which AFAICT it isn't. Can't we just
> >> do that, and in so doing also totally remove the "for" loop shown
> >> here?
> >
> > I don't see how the for loop goes away.
>
> I meant that some code in the "for" loop goes away. Not all of it.
> Just the more obscure code. As I said, I'm mostly pointing this out
> out of concern for making it clearer as example code.
>
Right, I can write a patch to do it in a way you are suggesting if you
>
> On Sun, Jan 10, 2016 at 9:13 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> >> I'm not sure why the test for nworkers following the
> >> LaunchParallelWorkers() call doesn't look like this, though:
> >>
> >> /* Set up tuple queue readers to read the results. */
> >> if (pcxt->nworkers_launched > 0)
> >> {
> >> ...
> >> }
> >
> > Hmm, yeah, I guess it could do that.
>
> That would make it clearer as an example.
>
> >> But going to this additional trouble (detecting no workers launched on
> >> the basis of !nworkers_launched) suggests that simply testing
> >> nworkers_launched would be wrong, which AFAICT it isn't. Can't we just
> >> do that, and in so doing also totally remove the "for" loop shown
> >> here?
> >
> > I don't see how the for loop goes away.
>
> I meant that some code in the "for" loop goes away. Not all of it.
> Just the more obscure code. As I said, I'm mostly pointing this out
> out of concern for making it clearer as example code.
>
Right, I can write a patch to do it in a way you are suggesting if you
are not planning to do it.
>
> >> In the case of parallel sequential scan, it looks like one worker can
> >> be helpful, because then the gather node (leader process) can run the
> >> plan itself to some degree, and so there are effectively 2 processes
> >> scanning at a minimum (unless 0 workers could be allocated to begin
> >> with). How useful is it to have a parallel scan when this happens,
> >> though?
> >
> > Empirically, that's really quite useful. When you have 3 or 4
> > workers, the leader really doesn't make a significant contribution to
> > the work, but what I've seen in my testing is that 1 worker often runs
> > almost twice as fast as 0 workers.
>
> I suppose that makes sense, given that parallel sequential scan works
> best when most tuples are eliminated in workers; there ought not to be
> many tuples filling the single worker's queue anyway.
>
> > I don't really understand why this should be so. I thought the idea
> > of parallel sort is (roughly) that each worker should read data until
> > it fills work_mem, sort that data, and write a tape. Repeat until no
> > data remains. Then, merge the tapes. I don't see any reason at all
> > why this shouldn't work just fine with a leader and 1 worker.
>
> It will work fine with a leader and 1 worker -- the code will be
> correct, and without any special cases. But it will be a suboptimal
> use of resources. From the caller's point of view, there is no reason
> to think it will be faster, and some reason to think it will be
> slower. A particular concern for parallel sort is that the sort might
> not use enough memory to need to be an external sort, but you
> effectively force it to be one by making it a parallel sort (that is
> not ideal in the long term, but it's a good compromise for 9.6's
> parallel index build stuff). You're also consuming a
> BackgroundWorkerSlot for the duration of the sort, in an environment
> where evidently those are in short supply.
>
> Now, you might wonder why it is that the leader cannot also sort runs,
> just as a worker would. It's possible, but it isn't exactly
> straightforward. You have to have special cases in several places,
> even though it probably is going to be uncommon to only have one
> BackgroundWorkerSlot available in practice. It's simpler to just
> opt-out, and seems better given that max_parallel_degree is a way of
> resource limiting based on available cores (it's certainly not about
> the availability of shared memory for the BackgroundWorkerSlot array).
>
>
> >> In the case of parallel sequential scan, it looks like one worker can
> >> be helpful, because then the gather node (leader process) can run the
> >> plan itself to some degree, and so there are effectively 2 processes
> >> scanning at a minimum (unless 0 workers could be allocated to begin
> >> with). How useful is it to have a parallel scan when this happens,
> >> though?
> >
> > Empirically, that's really quite useful. When you have 3 or 4
> > workers, the leader really doesn't make a significant contribution to
> > the work, but what I've seen in my testing is that 1 worker often runs
> > almost twice as fast as 0 workers.
>
> I suppose that makes sense, given that parallel sequential scan works
> best when most tuples are eliminated in workers; there ought not to be
> many tuples filling the single worker's queue anyway.
>
> > I don't really understand why this should be so. I thought the idea
> > of parallel sort is (roughly) that each worker should read data until
> > it fills work_mem, sort that data, and write a tape. Repeat until no
> > data remains. Then, merge the tapes. I don't see any reason at all
> > why this shouldn't work just fine with a leader and 1 worker.
>
> It will work fine with a leader and 1 worker -- the code will be
> correct, and without any special cases. But it will be a suboptimal
> use of resources. From the caller's point of view, there is no reason
> to think it will be faster, and some reason to think it will be
> slower. A particular concern for parallel sort is that the sort might
> not use enough memory to need to be an external sort, but you
> effectively force it to be one by making it a parallel sort (that is
> not ideal in the long term, but it's a good compromise for 9.6's
> parallel index build stuff). You're also consuming a
> BackgroundWorkerSlot for the duration of the sort, in an environment
> where evidently those are in short supply.
>
> Now, you might wonder why it is that the leader cannot also sort runs,
> just as a worker would. It's possible, but it isn't exactly
> straightforward. You have to have special cases in several places,
> even though it probably is going to be uncommon to only have one
> BackgroundWorkerSlot available in practice. It's simpler to just
> opt-out, and seems better given that max_parallel_degree is a way of
> resource limiting based on available cores (it's certainly not about
> the availability of shared memory for the BackgroundWorkerSlot array).
>
If I understand correctly, you are worried about the case where if the
leader is not able to launch the minimum required number of workers,
the parallel index builds will be slower as compare serial index builds.
I think it is genuine to worry about such cases, but it won't be
difficult to just make parallel execution behave as serial execution
(basically, you need to get all the work done by leader). Now, one
could worry, that there will be some overhead of setting-up and
destroy of workers in this case, but I think that could be treated as
a limitation for the initial version of implementation and if such a
case is more common in general usage, then we could have some
mechanism to reserve the workers and start parallelism only when
the leader is able to reserve required number of workers.
> More importantly, I have other, entirely general concerns. Other major
> RDBMSs have settings that are very similar to max_parallel_degree,
> with a setting of 1 effectively disabling all parallelism. Both Oracle
> and SQL Server have settings that they both call the "maximum degree
> or parallelism". I think it's a bit odd that with Postgres,
> max_parallel_degree = 1 can still use parallelism at all. I have to
> wonder: are we conflating controlling the resources used by parallel
> operations with how shared memory is doled out?
We could redefined things so that max_parallel_degree = N means use N
- 1 workers, with a minimum value of 1 rather than 0, if there's a
consensus that that's better. Personally, I prefer it the way we've
got it: it's real darned clear in my mind that max_parallel_degree=0
means "not parallel". But I won't cry into my beer if a consensus
emerges that the other way would be better.
when max_parallel_degree will be renamed to max_query_workers or some similar, then the new metric has sense. And can be more intuitive.
Regards
Pavel
> I could actually "give back" my parallel worker slots early if I
> really wanted to (that would be messy, but the then-acquiesced workers
> do nothing for the duration of the merge beyond conceptually owning
> the shared tape temp files). I don't think releasing the slots early
> makes sense, because I tend to think that hanging on to the workers
> helps the DBA in managing the server's resources. The still-serial
> merge phase is likely to become a big bottleneck with parallel sort.
Like I say, the sort code better not know anything about this
directly, or it's going to break when embedded in a query.
> With parallel sequential scan, a max_parallel_degree of 8 could result
> in 16 processes scanning in parallel. That's a concern, and not least
> because it happens only sometimes, when things are timed just right.
> The fact that only half of those processes are "genuine" workers seems
> to me like a distinction without a difference.
This seems dead wrong. A max_parallel_degree of 8 means you have a
leader and 8 workers. Where are the other 7 processes coming from?
What you should have is 8 processes each of which is participating in
both the parallel seq scan and the parallel sort, not 8 processes
scanning and 8 separate processes sorting.
>> I think that's probably over-engineered. I mean, it wouldn't be that
>> hard to have the workers just exit if you decide you don't want them,
>> and I don't really want to make the signaling here more complicated
>> than it really needs to be.
>
> I worry about the additional overhead of constantly starting and
> stopping a single worker in some cases (not so much with parallel
> index build, but other uses of parallel sort beyond 9.6). Furthermore,
> the coordination between worker and leader processes to make this
> happen seems messy -- you actually have the postmaster launch
> processes, but they must immediately get permission to do anything.
>
> It wouldn't be that hard to offer a general way of doing this, so why not?
Well, if these things become actual problems, fine, we can fix them.
But let's not decide to add the API before we're agreed that we need
it to solve an actual problem that we both agree we have. We are not
there yet.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Sun, Jan 10, 2016 at 5:45 PM, Robert Haas <robertmhaas@gmail.com> wrote: > Well, in general, the parallel sort code doesn't really get to pick > whether or not a BackgroundWorkerSlot gets used or not. Whoever > created the parallel context decides how many workers to request, and > then the context got as many of those as it could. It then did > arbitrary computation, which at some point in the middle involves one > or more parallel sorts. You can't just have one of those workers up > and exit in the middle. Now, in the specific case of parallel index > build, you probably can do that, if you want to. But to be honest, > I'd be inclined not to include that in the first version. If you get > fewer workers than you asked for, just use the number you got. Let's > see how that actually works out before we decide that we need a lot > more mechanism here. You may find that it's surprisingly effective to > do it this way. I am inclined to just accept for the time being (until I post the patch) that one worker and one leader may sometimes be all that we get. I will put off dealing with the problem until I show code, in other words. >> Now, you might wonder why it is that the leader cannot also sort runs, >> just as a worker would. It's possible, but it isn't exactly >> straightforward. > I am surprised that this is not straightforward. I don't see why it > shouldn't be, and it worries me that you think it isn't. It isn't entirely straightforward because it requires special case handling. For example, I must teach the leader not to try and wait on itself to finish sorting runs. It might otherwise attempt that ahead of its final on-the-fly merge. >> More importantly, I have other, entirely general concerns. Other major >> RDBMSs have settings that are very similar to max_parallel_degree, >> with a setting of 1 effectively disabling all parallelism. Both Oracle >> and SQL Server have settings that they both call the "maximum degree >> or parallelism". I think it's a bit odd that with Postgres, >> max_parallel_degree = 1 can still use parallelism at all. I have to >> wonder: are we conflating controlling the resources used by parallel >> operations with how shared memory is doled out? > > We could redefined things so that max_parallel_degree = N means use N > - 1 workers, with a minimum value of 1 rather than 0, if there's a > consensus that that's better. Personally, I prefer it the way we've > got it: it's real darned clear in my mind that max_parallel_degree=0 > means "not parallel". But I won't cry into my beer if a consensus > emerges that the other way would be better. The fact that we don't do that isn't quite the issue, though. It may or may not make sense to count the leader as an additional worker when the leader has very little work to do. In good cases for parallel sequential scan, the leader has very little leader-specific work to do, because most of the time is spent on workers (including the leader) determining that tuples must not be returned to the leader. When that is less true, maybe the leader could reasonably count as a fixed cost for a parallel operation. Hard to say. I'm sorry that that's not really actionable, but I'm still working this stuff out. >> I could actually "give back" my parallel worker slots early if I >> really wanted to (that would be messy, but the then-acquiesced workers >> do nothing for the duration of the merge beyond conceptually owning >> the shared tape temp files). I don't think releasing the slots early >> makes sense, because I tend to think that hanging on to the workers >> helps the DBA in managing the server's resources. The still-serial >> merge phase is likely to become a big bottleneck with parallel sort. > > Like I say, the sort code better not know anything about this > directly, or it's going to break when embedded in a query. tuplesort.c knows very little. nbtsort.c manages workers, and their worker tuplesort states, as well as the leader and its tuplesort state. So tuplesort.c knows a little bit about how the leader tuplesort state may need to reconstruct worker state in order to do its final on-the-fly merge. It knows nothing else, though, and provides generic hooks for assigning worker numbers to worker processes, or logical run numbers (this keeps trace_sort output straight, plus a few other things). Parallel workers are all managed in nbtsort.c, which seems appropriate. Note that I have introduced a way in which a single tuplesort state doesn't perfectly encapsulate a single sort operation, though. > This seems dead wrong. A max_parallel_degree of 8 means you have a > leader and 8 workers. Where are the other 7 processes coming from? > What you should have is 8 processes each of which is participating in > both the parallel seq scan and the parallel sort, not 8 processes > scanning and 8 separate processes sorting. I simply conflated max_parallel_degree and max_worker_processes for a moment. The point is that max_worker_processes is defined in terms of one definition of a worker process that could in theory be quite narrow. A related issue is that I have no way to force Postgres to use however many worker processes I feel like. Currently, I don't have a cost model for parallel index builds, because I just use max_worker_processes; this will probably change soon; I think having something better than just using max_worker_processes is *essential* for parallel index builds. There is already such a model for parallel seq scan in the optimizer, of course. The inability to force a certain number of workers is less of a concern with parallel sequential scan, but would be nice to test it that way. For parallel index builds, it seems reasonable to suppose that advanced DBAs will want to avoid using the cost model sometimes, for example because there are no statistics available following a bulk load. The possible range of sensible nworkers could be drastically different to max_parallel_degree. Other systems directly support a way of doing this. Have you thought about adding, say, a "parallel_degree" GUC, defaulting to 0 or -1 ("use cost model"), but otherwise used directly as an nworkers input for CreateParallelContext()? >>> I think that's probably over-engineered. I mean, it wouldn't be that >>> hard to have the workers just exit if you decide you don't want them, >>> and I don't really want to make the signaling here more complicated >>> than it really needs to be. >> >> I worry about the additional overhead of constantly starting and >> stopping a single worker in some cases (not so much with parallel >> index build, but other uses of parallel sort beyond 9.6). Furthermore, >> the coordination between worker and leader processes to make this >> happen seems messy -- you actually have the postmaster launch >> processes, but they must immediately get permission to do anything. >> >> It wouldn't be that hard to offer a general way of doing this, so why not? > > Well, if these things become actual problems, fine, we can fix them. > But let's not decide to add the API before we're agreed that we need > it to solve an actual problem that we both agree we have. We are not > there yet. That's fair. Before too long, I'll post a patch that doesn't attempt to deal with there only being one worker process, where that isn't expected to be any faster than a serial sort. We can iterate on that. -- Peter Geoghegan
On Mon, Jan 11, 2016 at 9:16 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Mon, Jan 11, 2016 at 3:14 AM, Peter Geoghegan <pg@heroku.com> wrote:
> >
> > On Sun, Jan 10, 2016 at 9:13 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> > >> I'm not sure why the test for nworkers following the
> > >> LaunchParallelWorkers() call doesn't look like this, though:
> > >>
> > >> /* Set up tuple queue readers to read the results. */
> > >> if (pcxt->nworkers_launched > 0)
> > >> {
> > >> ...
> > >> }
> > >
> > > Hmm, yeah, I guess it could do that.
> >
> > That would make it clearer as an example.
> >
> > >> But going to this additional trouble (detecting no workers launched on
> > >> the basis of !nworkers_launched) suggests that simply testing
> > >> nworkers_launched would be wrong, which AFAICT it isn't. Can't we just
> > >> do that, and in so doing also totally remove the "for" loop shown
> > >> here?
> > >
> > > I don't see how the for loop goes away.
> >
> > I meant that some code in the "for" loop goes away. Not all of it.
> > Just the more obscure code. As I said, I'm mostly pointing this out
> > out of concern for making it clearer as example code.
> >
>
> Right, I can write a patch to do it in a way you are suggesting if you
> are not planning to do it.
>
> On Mon, Jan 11, 2016 at 3:14 AM, Peter Geoghegan <pg@heroku.com> wrote:
> >
> > On Sun, Jan 10, 2016 at 9:13 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> > >> I'm not sure why the test for nworkers following the
> > >> LaunchParallelWorkers() call doesn't look like this, though:
> > >>
> > >> /* Set up tuple queue readers to read the results. */
> > >> if (pcxt->nworkers_launched > 0)
> > >> {
> > >> ...
> > >> }
> > >
> > > Hmm, yeah, I guess it could do that.
> >
> > That would make it clearer as an example.
> >
> > >> But going to this additional trouble (detecting no workers launched on
> > >> the basis of !nworkers_launched) suggests that simply testing
> > >> nworkers_launched would be wrong, which AFAICT it isn't. Can't we just
> > >> do that, and in so doing also totally remove the "for" loop shown
> > >> here?
> > >
> > > I don't see how the for loop goes away.
> >
> > I meant that some code in the "for" loop goes away. Not all of it.
> > Just the more obscure code. As I said, I'm mostly pointing this out
> > out of concern for making it clearer as example code.
> >
>
> Right, I can write a patch to do it in a way you are suggesting if you
> are not planning to do it.
>
Changed the code such that nworkers_launched gets used wherever
appropriate instead of nworkers. This includes places other than
pointed out above.
Вложения
On Wed, Jan 13, 2016 at 7:19 PM, Amit Kapila <amit.kapila16@gmail.com> wrote: > On Mon, Jan 11, 2016 at 9:16 AM, Amit Kapila <amit.kapila16@gmail.com> > wrote: >> On Mon, Jan 11, 2016 at 3:14 AM, Peter Geoghegan <pg@heroku.com> wrote: >> > >> > On Sun, Jan 10, 2016 at 9:13 AM, Robert Haas <robertmhaas@gmail.com> >> > wrote: >> > >> I'm not sure why the test for nworkers following the >> > >> LaunchParallelWorkers() call doesn't look like this, though: >> > >> >> > >> /* Set up tuple queue readers to read the results. */ >> > >> if (pcxt->nworkers_launched > 0) >> > >> { >> > >> ... >> > >> } >> > > >> > > Hmm, yeah, I guess it could do that. >> > >> > That would make it clearer as an example. >> > >> > >> But going to this additional trouble (detecting no workers launched >> > >> on >> > >> the basis of !nworkers_launched) suggests that simply testing >> > >> nworkers_launched would be wrong, which AFAICT it isn't. Can't we >> > >> just >> > >> do that, and in so doing also totally remove the "for" loop shown >> > >> here? >> > > >> > > I don't see how the for loop goes away. >> > >> > I meant that some code in the "for" loop goes away. Not all of it. >> > Just the more obscure code. As I said, I'm mostly pointing this out >> > out of concern for making it clearer as example code. >> > >> >> Right, I can write a patch to do it in a way you are suggesting if you >> are not planning to do it. >> > > Changed the code such that nworkers_launched gets used wherever > appropriate instead of nworkers. This includes places other than > pointed out above. The changes of the patch are simple optimizations that are trivial. I didn't find any problem regarding the changes. I think the same optimization is required in "ExecParallelFinish" function also. Regards, Hari Babu Fujitsu Australia
On Fri, Mar 4, 2016 at 11:57 AM, Haribabu Kommi <kommi.haribabu@gmail.com> wrote:
The changes of the patch are simple optimizations that are trivial.On Wed, Jan 13, 2016 at 7:19 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>>
>
> Changed the code such that nworkers_launched gets used wherever
> appropriate instead of nworkers. This includes places other than
> pointed out above.
I didn't find any problem regarding the changes. I think the same
optimization is required in "ExecParallelFinish" function also.
There is already one change as below for ExecParallelFinish() in patch.
@@ -492,7 +492,7 @@ ExecParallelFinish(ParallelExecutorInfo *pei)
WaitForParallelWorkersToFinish(pei->pcxt);
/* Next, accumulate buffer usage. */
- for (i = 0; i < pei->pcxt->nworkers; ++i)
+ for (i = 0; i < pei->pcxt->nworkers_launched; ++i)
InstrAccumParallelQuery(&pei->buffer_usage[i]);
Can you be slightly more specific, where exactly you are expecting more changes?
On Fri, Mar 4, 2016 at 10:33 PM, Amit Kapila <amit.kapila16@gmail.com> wrote: > On Fri, Mar 4, 2016 at 11:57 AM, Haribabu Kommi <kommi.haribabu@gmail.com> > wrote: >> >> On Wed, Jan 13, 2016 at 7:19 PM, Amit Kapila <amit.kapila16@gmail.com> >> wrote: >> >> >> > >> > Changed the code such that nworkers_launched gets used wherever >> > appropriate instead of nworkers. This includes places other than >> > pointed out above. >> >> The changes of the patch are simple optimizations that are trivial. >> I didn't find any problem regarding the changes. I think the same >> optimization is required in "ExecParallelFinish" function also. >> > > There is already one change as below for ExecParallelFinish() in patch. > > @@ -492,7 +492,7 @@ ExecParallelFinish(ParallelExecutorInfo *pei) > > WaitForParallelWorkersToFinish(pei->pcxt); > > > > /* Next, accumulate buffer usage. */ > > - for (i = 0; i < pei->pcxt->nworkers; ++i) > > + for (i = 0; i < pei->pcxt->nworkers_launched; ++i) > > InstrAccumParallelQuery(&pei->buffer_usage[i]); > > > Can you be slightly more specific, where exactly you are expecting more > changes? I missed it during the comparison with existing code and patch. Everything is fine with the patch. I marked the patch as ready for committer. Regards, Hari Babu Fujitsu Australia
On Fri, Mar 4, 2016 at 5:21 PM, Haribabu Kommi <kommi.haribabu@gmail.com> wrote:
On Fri, Mar 4, 2016 at 10:33 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Fri, Mar 4, 2016 at 11:57 AM, Haribabu Kommi <kommi.haribabu@gmail.com>
> wrote:
>>
>> On Wed, Jan 13, 2016 at 7:19 PM, Amit Kapila <amit.kapila16@gmail.com>
>> wrote:
>> >>
>> >
>> > Changed the code such that nworkers_launched gets used wherever
>> > appropriate instead of nworkers. This includes places other than
>> > pointed out above.
>>
>> The changes of the patch are simple optimizations that are trivial.
>> I didn't find any problem regarding the changes. I think the same
>> optimization is required in "ExecParallelFinish" function also.
>>
>
> There is already one change as below for ExecParallelFinish() in patch.
>
> @@ -492,7 +492,7 @@ ExecParallelFinish(ParallelExecutorInfo *pei)
>
> WaitForParallelWorkersToFinish(pei->pcxt);
>
>
>
> /* Next, accumulate buffer usage. */
>
> - for (i = 0; i < pei->pcxt->nworkers; ++i)
>
> + for (i = 0; i < pei->pcxt->nworkers_launched; ++i)
>
> InstrAccumParallelQuery(&pei->buffer_usage[i]);
>
>
> Can you be slightly more specific, where exactly you are expecting more
> changes?
I missed it during the comparison with existing code and patch.
Everything is fine with the patch. I marked the patch as ready for committer.
Thanks!
On Fri, Mar 4, 2016 at 6:55 AM, Amit Kapila <amit.kapila16@gmail.com> wrote: > On Fri, Mar 4, 2016 at 5:21 PM, Haribabu Kommi <kommi.haribabu@gmail.com> > wrote: >> >> On Fri, Mar 4, 2016 at 10:33 PM, Amit Kapila <amit.kapila16@gmail.com> >> wrote: >> > On Fri, Mar 4, 2016 at 11:57 AM, Haribabu Kommi >> > <kommi.haribabu@gmail.com> >> > wrote: >> >> >> >> On Wed, Jan 13, 2016 at 7:19 PM, Amit Kapila <amit.kapila16@gmail.com> >> >> wrote: >> >> >> >> >> > >> >> > Changed the code such that nworkers_launched gets used wherever >> >> > appropriate instead of nworkers. This includes places other than >> >> > pointed out above. >> >> >> >> The changes of the patch are simple optimizations that are trivial. >> >> I didn't find any problem regarding the changes. I think the same >> >> optimization is required in "ExecParallelFinish" function also. >> >> >> > >> > There is already one change as below for ExecParallelFinish() in patch. >> > >> > @@ -492,7 +492,7 @@ ExecParallelFinish(ParallelExecutorInfo *pei) >> > >> > WaitForParallelWorkersToFinish(pei->pcxt); >> > >> > >> > >> > /* Next, accumulate buffer usage. */ >> > >> > - for (i = 0; i < pei->pcxt->nworkers; ++i) >> > >> > + for (i = 0; i < pei->pcxt->nworkers_launched; ++i) >> > >> > InstrAccumParallelQuery(&pei->buffer_usage[i]); >> > >> > >> > Can you be slightly more specific, where exactly you are expecting more >> > changes? >> >> I missed it during the comparison with existing code and patch. >> Everything is fine with the patch. I marked the patch as ready for >> committer. >> > > Thanks! OK, committed. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Fri, Mar 4, 2016 at 11:41 PM, Robert Haas <robertmhaas@gmail.com> wrote:
OK, committed.On Fri, Mar 4, 2016 at 6:55 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Fri, Mar 4, 2016 at 5:21 PM, Haribabu Kommi <kommi.haribabu@gmail.com>
> wrote:
>>
>> On Fri, Mar 4, 2016 at 10:33 PM, Amit Kapila <amit.kapila16@gmail.com>
>> wrote:
>> > On Fri, Mar 4, 2016 at 11:57 AM, Haribabu Kommi
>> > <kommi.haribabu@gmail.com>
>> > wrote:
>> >>
>> >> On Wed, Jan 13, 2016 at 7:19 PM, Amit Kapila <amit.kapila16@gmail.com>
>> >> wrote:
>> >> >>
>> >> >
>> >> > Changed the code such that nworkers_launched gets used wherever
>> >> > appropriate instead of nworkers. This includes places other than
>> >> > pointed out above.
>> >>
>> >> The changes of the patch are simple optimizations that are trivial.
>> >> I didn't find any problem regarding the changes. I think the same
>> >> optimization is required in "ExecParallelFinish" function also.
>> >>
>> >
>> > There is already one change as below for ExecParallelFinish() in patch.
>> >
>> > @@ -492,7 +492,7 @@ ExecParallelFinish(ParallelExecutorInfo *pei)
>> >
>> > WaitForParallelWorkersToFinish(pei->pcxt);
>> >
>> >
>> >
>> > /* Next, accumulate buffer usage. */
>> >
>> > - for (i = 0; i < pei->pcxt->nworkers; ++i)
>> >
>> > + for (i = 0; i < pei->pcxt->nworkers_launched; ++i)
>> >
>> > InstrAccumParallelQuery(&pei->buffer_usage[i]);
>> >
>> >
>> > Can you be slightly more specific, where exactly you are expecting more
>> > changes?
>>
>> I missed it during the comparison with existing code and patch.
>> Everything is fine with the patch. I marked the patch as ready for
>> committer.
>>
>
> Thanks!
Thanks.
On Mon, Jan 11, 2016 at 3:14 AM, Peter Geoghegan <pg@heroku.com> wrote:
>
> On Sun, Jan 10, 2016 at 9:13 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>
> Now, you might wonder why it is that the leader cannot also sort runs,
> just as a worker would. It's possible, but it isn't exactly
> straightforward. You have to have special cases in several places,
> even though it probably is going to be uncommon to only have one
> BackgroundWorkerSlot available in practice. It's simpler to just
> opt-out, and seems better given that max_parallel_degree is a way of
> resource limiting based on available cores (it's certainly not about
> the availability of shared memory for the BackgroundWorkerSlot array).
>
> More importantly, I have other, entirely general concerns. Other major
> RDBMSs have settings that are very similar to max_parallel_degree,
> with a setting of 1 effectively disabling all parallelism. Both Oracle
> and SQL Server have settings that they both call the "maximum degree
> or parallelism". I think it's a bit odd that with Postgres,
> max_parallel_degree = 1 can still use parallelism at all. I have to
> wonder: are we conflating controlling the resources used by parallel
> operations with how shared memory is doled out?
>
>
> On Sun, Jan 10, 2016 at 9:13 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>
> Now, you might wonder why it is that the leader cannot also sort runs,
> just as a worker would. It's possible, but it isn't exactly
> straightforward. You have to have special cases in several places,
> even though it probably is going to be uncommon to only have one
> BackgroundWorkerSlot available in practice. It's simpler to just
> opt-out, and seems better given that max_parallel_degree is a way of
> resource limiting based on available cores (it's certainly not about
> the availability of shared memory for the BackgroundWorkerSlot array).
>
> More importantly, I have other, entirely general concerns. Other major
> RDBMSs have settings that are very similar to max_parallel_degree,
> with a setting of 1 effectively disabling all parallelism. Both Oracle
> and SQL Server have settings that they both call the "maximum degree
> or parallelism". I think it's a bit odd that with Postgres,
> max_parallel_degree = 1 can still use parallelism at all. I have to
> wonder: are we conflating controlling the resources used by parallel
> operations with how shared memory is doled out?
>
Your point is genuine, but OTOH let us say if max_parallel_degree = 1 means parallelism is disabled then when somebody sets max_parallel_degree = 2, then it looks somewhat odd to me that, it will mean that 1 worker process can be used for parallel query. Also, I think the parallel query will be able to get parallel workers till max_worker_processes + 1 which seems less intuitive than the current.
On your point of other databases, I have also checked and it seems like some of other databases like sybase [1] also provide a similar parameter
and value 1 means serial execution, so we might also want to consider it similarly, but to me the current setting sounds quite intuitive, however if more people also feel the same as you, then we should change it.
On Mon, Mar 7, 2016 at 4:04 AM, Amit Kapila <amit.kapila16@gmail.com> wrote: > Your point is genuine, but OTOH let us say if max_parallel_degree = 1 means > parallelism is disabled then when somebody sets max_parallel_degree = 2, > then it looks somewhat odd to me that, it will mean that 1 worker process > can be used for parallel query. I'm not sure that that has to be true. What is the argument for only using one worker process, say in the case of parallel seq scan? I understand that parallel seq scan can consume tuples itself, which seems like a good principle, but how far does it go, and how useful is it in the general case? I'm not suggesting that it isn't, but I'm not sure. How common is it for the leader process to do anything other than coordinate and consume from worker processes? -- Peter Geoghegan
On Mon, Mar 7, 2016 at 2:13 PM, Peter Geoghegan <pg@heroku.com> wrote: > On Mon, Mar 7, 2016 at 4:04 AM, Amit Kapila <amit.kapila16@gmail.com> wrote: >> Your point is genuine, but OTOH let us say if max_parallel_degree = 1 means >> parallelism is disabled then when somebody sets max_parallel_degree = 2, >> then it looks somewhat odd to me that, it will mean that 1 worker process >> can be used for parallel query. > > I'm not sure that that has to be true. > > What is the argument for only using one worker process, say in the > case of parallel seq scan? I understand that parallel seq scan can > consume tuples itself, which seems like a good principle, but how far > does it go, and how useful is it in the general case? I'm not > suggesting that it isn't, but I'm not sure. > > How common is it for the leader process to do anything other than > coordinate and consume from worker processes? 1 worker is often a very big speedup vs. 0 workers, and the work can easily be evenly distributed between the worker and the leader. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company