Обсуждение: Size of Path nodes
So over in my private branch where I've been fooling around with upper planner pathification, I've been sweating bullets to avoid enlarging the size of struct Path, because I was aware that common path types like IndexPath and AppendPath were already a power-of-2 size (at least on 64-bit machines), meaning that palloc'ing them would cost twice as much space if I added any fields. When I got around to merging up to HEAD, I found this in commit f0661c4e8: @@ -753,6 +753,7 @@ typedef struct Path RelOptInfo *parent; /* the relation this path can build */ ParamPathInfo*param_info; /* parameterization info, or NULL if none */ + bool parallel_aware; /* engage parallel-aware logic? */ /* estimated size/costs for path (see costsize.cfor more info) */ double rows; /* estimated number of result tuples */ which means Robert has already blown the planner's space consumption out by close to a factor of 2, and I should stop worrying. Or else he should start worrying. Do we really need this field? Did anyone pay any attention to what happened to planner space consumption and performance when the parallel-scan patch went in? Or am I just too conditioned by working with last-century hardware, and I should stop caring about how large these nodes are? While I'm bitching about this: a field added to a struct as fundamental as Path ought to have a pretty well-defined meaning, and this does not as far as I can tell. There was certainly no documentation worthy of the name added by the above commit. What is the semantic difference between a Path with this flag set and the identical Path without? Likewise for struct Plan? regards, tom lane
On 2015-12-04 12:50:09 -0500, Tom Lane wrote: > So over in my private branch where I've been fooling around with upper > planner pathification, I've been sweating bullets to avoid enlarging the > size of struct Path, because I was aware that common path types like > IndexPath and AppendPath were already a power-of-2 size (at least on > 64-bit machines), meaning that palloc'ing them would cost twice as much > space if I added any fields. > > When I got around to merging up to HEAD, I found this in commit f0661c4e8: > > @@ -753,6 +753,7 @@ typedef struct Path > > RelOptInfo *parent; /* the relation this path can build */ > ParamPathInfo *param_info; /* parameterization info, or NULL if none */ > + bool parallel_aware; /* engage parallel-aware logic? */ > > /* estimated size/costs for path (see costsize.c for more info) */ > double rows; /* estimated number of result tuples */ > > which means Robert has already blown the planner's space consumption out > by close to a factor of 2, and I should stop worrying. FWIW, for me it's still <= 64 bytes: struct Path { NodeTag type; /* 0 4 */ NodeTag pathtype; /* 4 4 */ RelOptInfo * parent; /* 8 8 */ ParamPathInfo* param_info; /* 16 8 */ bool parallel_aware; /* 24 1 */ /* XXX 7 bytes hole, try to pack */ double rows; /* 32 8 */ Cost startup_cost; /* 40 8 */ Cost total_cost; /* 48 8 */ List* pathkeys; /* 56 8 */ /* --- cacheline 1 boundary (64 bytes) --- */ /* size: 64, cachelines: 1, members: 9 */ /* sum members: 57, holes: 1, sum holes: 7 */ }; > While I'm bitching about this: a field added to a struct as fundamental > as Path ought to have a pretty well-defined meaning, and this does not > as far as I can tell. There was certainly no documentation worthy of > the name added by the above commit. What is the semantic difference > between a Path with this flag set and the identical Path without? > Likewise for struct Plan? That part is obviously independently true, regardless of the size. Andres
Andres Freund <andres@anarazel.de> writes: > On 2015-12-04 12:50:09 -0500, Tom Lane wrote: >> which means Robert has already blown the planner's space consumption out >> by close to a factor of 2, and I should stop worrying. > FWIW, for me it's still <= 64 bytes: No, it's not bare Path I'm worried about, it's IndexPath, which went from 128 bytes in previous versions to 136 in HEAD. Likewise for BitmapHeapPath, TidPath, ForeignPath, AppendPath, ResultPath, MaterialPath, and possibly others. regards, tom lane
On Fri, Dec 4, 2015 at 12:50 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > So over in my private branch where I've been fooling around with upper > planner pathification, I've been sweating bullets to avoid enlarging the > size of struct Path, because I was aware that common path types like > IndexPath and AppendPath were already a power-of-2 size (at least on > 64-bit machines), meaning that palloc'ing them would cost twice as much > space if I added any fields. > > When I got around to merging up to HEAD, I found this in commit f0661c4e8: > > @@ -753,6 +753,7 @@ typedef struct Path > > RelOptInfo *parent; /* the relation this path can build */ > ParamPathInfo *param_info; /* parameterization info, or NULL if none */ > + bool parallel_aware; /* engage parallel-aware logic? */ > > /* estimated size/costs for path (see costsize.c for more info) */ > double rows; /* estimated number of result tuples */ > > which means Robert has already blown the planner's space consumption out > by close to a factor of 2, and I should stop worrying. Or else he should > start worrying. Do we really need this field? Did anyone pay any > attention to what happened to planner space consumption and performance > when the parallel-scan patch went in? Or am I just too conditioned by > working with last-century hardware, and I should stop caring about how > large these nodes are? If it's really true that the extra byte I added there has doubled planner memory use, then that's definitely cause for concern. However, I am skeptical that's really what has happened here. Not every path has crossed a power-of-two threshold, and paths are not the only things the planner allocates. What's a reasonable way to assess the effect of this on planner memory use in general? I really HOPE that this isn't a serious problem. The pending patch to all parallel joins adds another int right after that bool, and I doubt rather deeply that'll be the end of it. I'm skeptical that int should be a double, and that maybe there should be a few more things.... I don't expect to need tons of space here, but the current code is pretty stupid about estimating the number of workers that are needed for a parallel plan, and at some point we're going to want to make that less stupid, and that's probably going to require some more bookkeeping. I don't know exactly how that's all going to work yet, but it's pretty hard to make the planner general parallel query paths for a wide variety of plan types if none of those plan types carry any information specific to parallel query. > While I'm bitching about this: a field added to a struct as fundamental > as Path ought to have a pretty well-defined meaning, and this does not > as far as I can tell. There was certainly no documentation worthy of > the name added by the above commit. What is the semantic difference > between a Path with this flag set and the identical Path without? > Likewise for struct Plan? /me crawls into a hole. Well, funny story about that. The parallel sequential scan code forgot to make that flag actually do what it was budgeted to do, but it turns out not to matter as long as the only thing that can be done in parallel is a sequential scan. So there's no live bug right now, and the parallel join patch adds the missing code to make that flag actually meaningful. So I'm not very surprised that you found the current state of things stupid. The reason it's needed is because the parallel join patch generates plans like this: Gather -> Hash Join -> Parallel Seq Scan -> Hash -> Seq Scan The idea of this plan is that each worker builds a hash table on the inner rel, and then reads a subset of the outer rel and performs the join for the subset of rows which it reads. Then the Gather brings all the worker results together in the leader. Well, when I initially coded this, it wasn't working properly, and I couldn't for the life of me figure out why. Eventually, after hours of debugging, I realized that while I'd added this parallel_aware flag with the idea of distinguishing between the outer seq scan (which needs to be split between the workers) from the inner seq scan (which needs to be executed to completion by every worker) there was nothing in the code which would actually cause it to work like that. The inner seq scan was behaving as if it were parallel - that is, it returned only a subset of the rows in each worker - even though the parallel_aware flag was not set. I had thought that there was some logic in execParallel.c or, uh, somewhere, which would give that flag meaning but it turns out that, in master, there isn't. So the parallel join patch, which is otherwise just an optimizer patch, also adds the missing bits to execParallel.c. Maybe I should have committed that separately as a bug fix. To try to clarify a little further, execParallel.c contains three functions that make passes over the PlanState tree. The first one is made before creating the parallel DSM, and its purpose is to estimate the amount of space that each plan node will require in the DSM (it can overestimate, but it can't underestimate). The second one is made after creating the parallel DSM, and gives plan nodes a chance to initialize the space they estimated they would use (or some lesser amount). The third one is made in each worker, and gives plan nodes a chance to fish out a pointer to the space previously initialized in the master. All of this processing is of course skipped for plan nodes that have no parallel smarts. But the parallel-aware flag is supposed to allow these steps to be skipped even for a node that does have parallel smarts, so that those smarts can in effect be shut off when they're not desired. Of course, it would be possible to rejigger things to avoid having this flag in every path. Currently, SeqScans are the only plan type that is ever parallel-aware, so we could put the plan in only that path type. However, I expect to want to expand the set of parallel-aware paths pretty soon. In particular, I want to add at least AppendPath and IndexPath to the list. So I fear that this approach will just be kicking the can down the road, and maybe not all that far down the road. I'm pretty skeptical that we can continue to add planner features without making the data structures any bigger. It looks to me like I'm ending up touching all the same places that the parameterized path stuff touched - that needed a few extra bytes in a whole bunch of data structures, and parallelism wants a few bytes in many of those very same data structures for pretty similar reasons, and from the sound of it, the upper planner path-ification also wants a few bytes in all of those places (for tlists, maybe?). I think that if we make a rule that IndexPath can never again get bigger, we are going to have a very difficult time continuing to improve the planner. One thing to keep in mind is that most modern allocators do not do power-of-two allocation, at least AFAICT. They carve up super-blocks into a bunch of equal sized chunks. So instead of falling over a power-of-two boundary where space suddenly doubles, they instead just move into a larger size class, and those size classes are closely-enough spaced that this isn't such a big deal. We're stuck using palloc, and I can't help noticing that the last time aset.c had a commit touching a three-digit number of lines was in July of 2000. It might be time to consider other alternatives there. I posted a new allocator on -hackers a while back that didn't get a real warm reception, but it was significantly more memory-efficient on memory contexts that contain many allocations than what we're doing today. It may well be that what I did there wasn't a good plan for any number of reasons, but the principle is worth thinking about. Of course, no allocator can avoid the problem that large allocations will need to span cache lines, but it CAN avoid doubling memory consumption when you cross a power-of-two boundary. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 12/4/15 11:50 AM, Tom Lane wrote: > which means Robert has already blown the planner's space consumption out > by close to a factor of 2, and I should stop worrying. Or else he should > start worrying. Do we really need this field? Did anyone pay any > attention to what happened to planner space consumption and performance > when the parallel-scan patch went in? Or am I just too conditioned by > working with last-century hardware, and I should stop caring about how > large these nodes are? I suspect Cachegrind[1] would answer a lot of these questions (though I've never actually used it). I can't get postgres to run under valgrind on my laptop, but maybe someone that's been successful at valgrind can try cachegrind (It's just another mode of valgrind). [1] http://valgrind.org/docs/manual/cg-manual.html -- Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX Experts in Analytics, Data Architecture and PostgreSQL Data in Trouble? Get it in Treble! http://BlueTreble.com
On Fri, Dec 4, 2015 at 2:44 PM, Jim Nasby <Jim.Nasby@bluetreble.com> wrote: > I suspect Cachegrind[1] would answer a lot of these questions (though I've > never actually used it). I can't get postgres to run under valgrind on my > laptop, but maybe someone that's been successful at valgrind can try > cachegrind (It's just another mode of valgrind). I've used Cachegrind, and think it's pretty good. You still need a test case that exercises what you're interested in, though. Distributed costs are really hard to quantify. Sometimes that's because they don't exist, and sometimes it's because they can only be quantified as part of a value judgement. As frustrated as I've sometimes been with those discussions, I do recognize that there has to be a middle ground, and that the emphasis on distributed costs has as much to do with fairness for every contributor as anything else. I would have appreciated some attempt to have quantified the overhead here, but would not have insisted on Robert being as thorough as he conceivably could have been. -- Peter Geoghegan
On 12/4/15 5:14 PM, Peter Geoghegan wrote: > On Fri, Dec 4, 2015 at 2:44 PM, Jim Nasby<Jim.Nasby@bluetreble.com> wrote: >> >I suspect Cachegrind[1] would answer a lot of these questions (though I've >> >never actually used it). I can't get postgres to run under valgrind on my >> >laptop, but maybe someone that's been successful at valgrind can try >> >cachegrind (It's just another mode of valgrind). > I've used Cachegrind, and think it's pretty good. You still need a > test case that exercises what you're interested in, though. > Distributed costs are really hard to quantify. Sometimes that's > because they don't exist, and sometimes it's because they can only be > quantified as part of a value judgement. If we had a good way to run cachegrind (and maybe if it was run automatically somewhere) then at least we'd know what effect a patch had on things. (For those not familiar, there is a tool for diffing too cachegrind runs). Knowing is half the battle and all that. Another interesting possibility would be a standardized perf test. [1] makes an argument for that. Maybe a useful way to set stuff like this up would be to create support for things to run in travis-ci. Time-based tools presumably would be useless, but something doing analysis like cachegrind would probably be OK (though I think they do cap test runs to an hour or something). [1] https://news.ycombinator.com/item?id=8426302 -- Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX Experts in Analytics, Data Architecture and PostgreSQL Data in Trouble? Get it in Treble! http://BlueTreble.com
On Fri, Dec 4, 2015 at 6:14 PM, Peter Geoghegan <pg@heroku.com> wrote: > As frustrated as I've sometimes been with those discussions, I do > recognize that there has to be a middle ground, and that the emphasis > on distributed costs has as much to do with fairness for every > contributor as anything else. I would have appreciated some attempt to > have quantified the overhead here, but would not have insisted on > Robert being as thorough as he conceivably could have been. Honestly, I really never thought of the issue of whether this would push some structure over a power-of-two size. It's not like either Path or IndexPath have comments saying "for the love of heaven, don't make this bigger". This is in contrast to, say, PGXACT, where there is a such a comment, because I knew it was an issue and I added a comment explaining that issue. Now maybe I should have foreseen the objection anyway, but, you know, it's hard to foresee everything somebody might object to. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Sat, Dec 5, 2015 at 2:30 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Fri, Dec 4, 2015 at 12:50 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> > So over in my private branch where I've been fooling around with upper
> > planner pathification, I've been sweating bullets to avoid enlarging the
> > size of struct Path, because I was aware that common path types like
> > IndexPath and AppendPath were already a power-of-2 size (at least on
> > 64-bit machines), meaning that palloc'ing them would cost twice as much
> > space if I added any fields.
> >
> > When I got around to merging up to HEAD, I found this in commit f0661c4e8:
> >
>
> > While I'm bitching about this: a field added to a struct as fundamental
> > as Path ought to have a pretty well-defined meaning, and this does not
> > as far as I can tell. There was certainly no documentation worthy of
> > the name added by the above commit. What is the semantic difference
> > between a Path with this flag set and the identical Path without?
> > Likewise for struct Plan?
>
> /me crawls into a hole.
>
> Well, funny story about that. The parallel sequential scan code
> forgot to make that flag actually do what it was budgeted to do, but
> it turns out not to matter as long as the only thing that can be done
> in parallel is a sequential scan. So there's no live bug right now,
> and the parallel join patch adds the missing code to make that flag
> actually meaningful. So I'm not very surprised that you found the
> current state of things stupid. The reason it's needed is because the
> parallel join patch generates plans like this:
>
> Gather
> -> Hash Join
> -> Parallel Seq Scan
> -> Hash
> -> Seq Scan
>
> The idea of this plan is that each worker builds a hash table on the
> inner rel, and then reads a subset of the outer rel and performs the
> join for the subset of rows which it reads. Then the Gather brings
> all the worker results together in the leader. Well, when I initially
> coded this, it wasn't working properly, and I couldn't for the life of
> me figure out why.
>
> Eventually, after hours of debugging, I realized that while I'd added
> this parallel_aware flag with the idea of distinguishing between the
> outer seq scan (which needs to be split between the workers) from the
> inner seq scan (which needs to be executed to completion by every
> worker) there was nothing in the code which would actually cause it to
> work like that. The inner seq scan was behaving as if it were
> parallel - that is, it returned only a subset of the rows in each
> worker - even though the parallel_aware flag was not set. I had
> thought that there was some logic in execParallel.c or, uh, somewhere,
> which would give that flag meaning but it turns out that, in master,
> there isn't. So the parallel join patch, which is otherwise just an
> optimizer patch, also adds the missing bits to execParallel.c. Maybe
> I should have committed that separately as a bug fix.
>
> To try to clarify a little further, execParallel.c contains three
> functions that make passes over the PlanState tree. The first one is
> made before creating the parallel DSM, and its purpose is to estimate
> the amount of space that each plan node will require in the DSM (it
> can overestimate, but it can't underestimate). The second one is made
> after creating the parallel DSM, and gives plan nodes a chance to
> initialize the space they estimated they would use (or some lesser
> amount). The third one is made in each worker, and gives plan nodes a
> chance to fish out a pointer to the space previously initialized in
> the master. All of this processing is of course skipped for plan
> nodes that have no parallel smarts. But the parallel-aware flag is
> supposed to allow these steps to be skipped even for a node that does
> have parallel smarts, so that those smarts can in effect be shut off
> when they're not desired.
>
>
> Of course, it would be possible to rejigger things to avoid having
> this flag in every path. Currently, SeqScans are the only plan type
> that is ever parallel-aware, so we could put the plan in only that
> path type. However, I expect to want to expand the set of
> parallel-aware paths pretty soon. In particular, I want to add at
> least AppendPath and IndexPath to the list.
>
>
> On Fri, Dec 4, 2015 at 12:50 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> > So over in my private branch where I've been fooling around with upper
> > planner pathification, I've been sweating bullets to avoid enlarging the
> > size of struct Path, because I was aware that common path types like
> > IndexPath and AppendPath were already a power-of-2 size (at least on
> > 64-bit machines), meaning that palloc'ing them would cost twice as much
> > space if I added any fields.
> >
> > When I got around to merging up to HEAD, I found this in commit f0661c4e8:
> >
>
> > While I'm bitching about this: a field added to a struct as fundamental
> > as Path ought to have a pretty well-defined meaning, and this does not
> > as far as I can tell. There was certainly no documentation worthy of
> > the name added by the above commit. What is the semantic difference
> > between a Path with this flag set and the identical Path without?
> > Likewise for struct Plan?
>
> /me crawls into a hole.
>
> Well, funny story about that. The parallel sequential scan code
> forgot to make that flag actually do what it was budgeted to do, but
> it turns out not to matter as long as the only thing that can be done
> in parallel is a sequential scan. So there's no live bug right now,
> and the parallel join patch adds the missing code to make that flag
> actually meaningful. So I'm not very surprised that you found the
> current state of things stupid. The reason it's needed is because the
> parallel join patch generates plans like this:
>
> Gather
> -> Hash Join
> -> Parallel Seq Scan
> -> Hash
> -> Seq Scan
>
> The idea of this plan is that each worker builds a hash table on the
> inner rel, and then reads a subset of the outer rel and performs the
> join for the subset of rows which it reads. Then the Gather brings
> all the worker results together in the leader. Well, when I initially
> coded this, it wasn't working properly, and I couldn't for the life of
> me figure out why.
>
> Eventually, after hours of debugging, I realized that while I'd added
> this parallel_aware flag with the idea of distinguishing between the
> outer seq scan (which needs to be split between the workers) from the
> inner seq scan (which needs to be executed to completion by every
> worker) there was nothing in the code which would actually cause it to
> work like that. The inner seq scan was behaving as if it were
> parallel - that is, it returned only a subset of the rows in each
> worker - even though the parallel_aware flag was not set. I had
> thought that there was some logic in execParallel.c or, uh, somewhere,
> which would give that flag meaning but it turns out that, in master,
> there isn't. So the parallel join patch, which is otherwise just an
> optimizer patch, also adds the missing bits to execParallel.c. Maybe
> I should have committed that separately as a bug fix.
>
> To try to clarify a little further, execParallel.c contains three
> functions that make passes over the PlanState tree. The first one is
> made before creating the parallel DSM, and its purpose is to estimate
> the amount of space that each plan node will require in the DSM (it
> can overestimate, but it can't underestimate). The second one is made
> after creating the parallel DSM, and gives plan nodes a chance to
> initialize the space they estimated they would use (or some lesser
> amount). The third one is made in each worker, and gives plan nodes a
> chance to fish out a pointer to the space previously initialized in
> the master. All of this processing is of course skipped for plan
> nodes that have no parallel smarts. But the parallel-aware flag is
> supposed to allow these steps to be skipped even for a node that does
> have parallel smarts, so that those smarts can in effect be shut off
> when they're not desired.
>
>
> Of course, it would be possible to rejigger things to avoid having
> this flag in every path. Currently, SeqScans are the only plan type
> that is ever parallel-aware, so we could put the plan in only that
> path type. However, I expect to want to expand the set of
> parallel-aware paths pretty soon. In particular, I want to add at
> least AppendPath and IndexPath to the list.
>
To add to whatever has been said above, intention of adding that flag
was also to avoid adding new nodes for parallelism. Basically modify
the existing nodes (like SeqScan) to take care of whatever is needed
for parallel execution. Now fortunately, not much is needed for Seq Scan
(except for explain part and distinguishing backward scan), but I think
that will not be the case going forward when we need to make other nodes
like Append parallel aware. The need and idea for this flag has been
proposed and discussed on parallel seq scan thread [1].
On Fri, Dec 4, 2015 at 4:00 PM, Robert Haas <robertmhaas@gmail.com> wrote: > If it's really true that the extra byte I added there has doubled > planner memory use, then that's definitely cause for concern. > However, I am skeptical that's really what has happened here. Not > every path has crossed a power-of-two threshold, and paths are not the > only things the planner allocates. What's a reasonable way to assess > the effect of this on planner memory use in general? So I did a really crude test of this. I put MemoryContextStats(MessageContext) - which seems to be where the planner garbage is going - at the end of the message-processing loop, and then ran the regression tests with and without parallel_aware in the Path structure. Then I ran a little grep and awk magic over the postmaster logs and compared the sizes of contexts. For reasons I haven't tracked down, the number of instrumentation lines I got with and without the flag differed. But the overall pattern seems pretty clear. In the results below, the "without" number is the number of times MessageContext had allocated the specified amount of storage space without the parallel_aware flag; the "with" number is the number of times it had allocated the specified amount of storage with the parallel_aware flag. size 8192 without 7589 with 7605 size 16384 without 6074 with 6078 size 16448 without 1 with 1 size 24576 without 26 with 27 size 24640 without 75 with 68 size 26448 without 3 with 3 size 32768 without 1747 with 1760 size 36512 without 0 with 1 size 42832 without 1 with 1 size 57344 without 7 with 9 size 57520 without 151 with 152 size 65536 without 1319 with 1349 size 66448 without 1 with 1 size 73728 without 4 with 5 size 73792 without 1 with 1 size 73904 without 2 with 2 size 116448 without 4 with 4 size 122880 without 4 with 4 size 131072 without 631 with 638 size 139264 without 12 with 12 size 216512 without 4 with 4 size 262144 without 496 with 504 size 270336 without 5 with 5 size 316448 without 1 with 1 size 516448 without 2 with 2 size 524288 without 73 with 74 size 532480 without 1 with 1 size 816512 without 1 with 1 size 1048576 without 19 with 19 size 1216448 without 1 with 0 size 2097152 without 4 with 5 queries_with=18337 queries_without=18259 total_with=612886960 total_without=605744096 What I think this is showing is that making Path bigger occasionally pushes palloc over a boundary so that it allocates another chunk, but most of the time t doesn't. Also, it suggests to me that if we're concerned about keeping memory utilization tight on these kinds of queries, we could think about changing palloc's allocation pattern. For example, if we did 8k, 16k, 32k, 64k, 64k, 64k, 64k, 128k, 128k, 128k, 128k, 256k, 256k, 256k, 256k ... a lot of these queries would consume less memory. If there's a strong feeling that I should find a way to make this testing more rigorous, I'm willing to do so, but I suspect that we're not going to find anything very exciting here. A more likely angle of investigation here is to try to figure out what a worst case for enlarging the Path structure might look like, and test that. I don't have a brilliant idea there right at the moment, but I'll mull it over. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Amit Kapila <amit.kapila16@gmail.com> writes: > To add to whatever has been said above, intention of adding that flag > was also to avoid adding new nodes for parallelism. Basically modify > the existing nodes (like SeqScan) to take care of whatever is needed > for parallel execution. TBH, I would say that that's a damn-fool idea. I think you should instead create a separate ParallelSeqScan path type and plan type, and the same for every other thing that becomes parallel-aware. The planner does not normally[1] use the same path type to represent two fundamentally different execution plans with enormously different cost estimates, but that is the direction you want to push in for parallel query. I think it will lead to a mess: lots of unreadable code that has to do things in a way unlike the code around it, and lots of bugs-of-omission in places that should have distinguished seq and parallel cases but didn't. regards, tom lane [1] Yes, I'm aware that UniquePath is an exception. It has reasons to live though, in particular that if it were multiple path types there would still need to be places that understood the common property of those path types all delivering unique results. I do not see any corresponding benefit of fuzzing the distinction between SeqScan and ParallelSeqScan.
On Sat, Dec 5, 2015 at 12:35 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Amit Kapila <amit.kapila16@gmail.com> writes: >> To add to whatever has been said above, intention of adding that flag >> was also to avoid adding new nodes for parallelism. Basically modify >> the existing nodes (like SeqScan) to take care of whatever is needed >> for parallel execution. > > TBH, I would say that that's a damn-fool idea. I think you should instead > create a separate ParallelSeqScan path type and plan type, and the same > for every other thing that becomes parallel-aware. The planner does not > normally[1] use the same path type to represent two fundamentally different > execution plans with enormously different cost estimates, but that is the > direction you want to push in for parallel query. I think it will lead to > a mess: lots of unreadable code that has to do things in a way unlike the > code around it, and lots of bugs-of-omission in places that should have > distinguished seq and parallel cases but didn't. Maybe. But if we go down that path, we're eventually going to have ParallelSeqScan, ParallelIndexScan, ParallelBitmapHeapScan, ParallelForeignScan, ParallelAppend, ParallelHashJoin, and probably a bunch of others. That could lead to a lot of code duplication. Even for ParallelSeqScan: src/backend/executor/nodeSeqscan.c | 136 ++++++++++++++++++++++++++++++++++++++++++++++++++-----------------1 file changed, 103 insertions(+), 33 deletions(-) Now that file has 344 lines today, but only 33 of those lines are actual changes to pre-existing code, and most of those are mechanical. So the effect of making that a whole separate node type would have been to copy about 200 lines of code and comments. That didn't seem like a good idea. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Sat, Dec 5, 2015 at 5:35 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > The planner does not > normally[1] use the same path type to represent two fundamentally different > execution plans with enormously different cost estimates Eh, this is a typical object modelling dilemma. There are lots of different path types and many of them share a lot of properties but depending on which way you look at things different sets of them seem to be the same thing. BitmapScan is very like a Seqscan and MergeAppend is like Append but have different node types, but nodes with Filters attached didn't become FilteredSeqScan and FilteredIndexScan etc... I'm not sure which Parallel is more like and it may be more convenient for the planner one way and other parts the other. -- greg
Robert Haas <robertmhaas@gmail.com> writes: > On Sat, Dec 5, 2015 at 12:35 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Amit Kapila <amit.kapila16@gmail.com> writes: >>> To add to whatever has been said above, intention of adding that flag >>> was also to avoid adding new nodes for parallelism. Basically modify >>> the existing nodes (like SeqScan) to take care of whatever is needed >>> for parallel execution. >> TBH, I would say that that's a damn-fool idea. I think you should instead >> create a separate ParallelSeqScan path type and plan type, and the same >> for every other thing that becomes parallel-aware. > Maybe. But if we go down that path, we're eventually going to have > ParallelSeqScan, ParallelIndexScan, ParallelBitmapHeapScan, > ParallelForeignScan, ParallelAppend, ParallelHashJoin, and probably a > bunch of others. That could lead to a lot of code duplication. Even > for ParallelSeqScan: > src/backend/executor/nodeSeqscan.c | 136 > ++++++++++++++++++++++++++++++++++++++++++++++++++----------------- > 1 file changed, 103 insertions(+), 33 deletions(-) > Now that file has 344 lines today, but only 33 of those lines are > actual changes to pre-existing code, and most of those are mechanical. > So the effect of making that a whole separate node type would have > been to copy about 200 lines of code and comments. That didn't seem > like a good idea. Meh. Size of first patch is a pretty bad guide to whether a particular structural decision is going to be a good idea in the long run. Do you need some examples of patches we've rejected because they were done in a way that minimized the size of the patch rather than a way that made sense from a system structural viewpoint? I would also point out that if we'd followed this approach in the past, nodeIndexscan.c, nodeBitmapIndexscan.c and nodeIndexonlyscan.c would be one file, and it would be nigh unintelligible. Those files nonetheless manage to share code where it makes sense for them to. In any case, even if there's an argument for keeping the two cases together in the executor, I remain of the opinion that you'd be better off with them being distinct Path types in the planner. regards, tom lane
On Sat, Dec 5, 2015 at 2:29 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Meh. Size of first patch is a pretty bad guide to whether a particular > structural decision is going to be a good idea in the long run. Do you > need some examples of patches we've rejected because they were done in a > way that minimized the size of the patch rather than a way that made sense > from a system structural viewpoint? You can provide some examples if you like, but I thought about the issue of overall system structure and settled on this design after coming to the conclusion that this design was better from that viewpoint. If it becomes clear that I'm wrong, fine: we can change it. I think I've done a pretty good job - I'm going to go out on a limb and say an excellent job - modularizing the parallelism code so that, if it turns out I've picked the wrong design for some part of it, we can replace just that part without having a huge impact on all the rest of it. But at this point I am not by any means convinced that I've got it wrong. I've been working on this for several years and have thought about it quite deeply, discussed it extensively on and off the mailing list with anyone and everyone who was willing to participate in that discussion, and read what relevant literature I could find. That doesn't guarantee that I've got it right, but if your premise here is that I haven't thought about system structure, I'm going to say that premise is wrong. I've thought about it a lot. > I would also point out that if we'd followed this approach in the past, > nodeIndexscan.c, nodeBitmapIndexscan.c and nodeIndexonlyscan.c would be > one file, and it would be nigh unintelligible. Those files nonetheless > manage to share code where it makes sense for them to. True, but I don't think that's a good analogy. I think the best analogy to parallelism is parameterized plans. And we DON'T have separate nodes for parameterized index scans and unparameterized index scans, nor do we have separate path nodes. We have something called an index scan and it can be parameterized or unparameterized, and we keep track of that distinction during both planning and execution. So here. For example, in the case of nodeSeqscan.c, almost two-thirds of the new lines of code are to add three additional methods that aren't called in parallel aware mode, and about half of the rest are due to a SeqScanState now having one integer field that's not part of ScanState, which require mechanical changes in a bunch of places ("ps" becomes "ss.ps"). The rest amounts to 15-20 lines of real code change. It may be useful context - although you will already know this, if you have been reading the relevant threads - to know that up until a couple of months ago I actually had this exactly the way you are describing, with parallel sequential scan as a separate node type, mostly because I know that you've leaned in the direction of making things separate node types whenever there's any doubt about the matter. The main reason I changed my mind, aside from the massive code duplication that seemed to be resulting, was the realization that every scan type needs a parallel-aware mode, and that the parallel-aware mode need to degenerate back to exactly the same thing as the non-parallel-aware case when no workers are available. It makes no sense to me to say that we should double the number of executor nodes, or even the number of path nodes, just because we have parallelism. I'm sure we'd never actually double it, as there are some plan nodes like Result for which I can't foresee a parallel-aware version. But we'd add a heck of a lot of them, and in every case the parallel-aware version would be a strict superset of what the non-parallel-aware version can do. Consider Append, for example. A regular Append runs all of the child plans one after another. A parallel-aware Append, like any other parallel-aware node, will be running in multiple copies, one per worker. Well, what it should do in order to minimize contention is try to spread out the workers among the subplans so as to minimize contention, rather than throwing all the workers at the first subplan, then when that gets done throwing them all at the second subplan, and so forth. See the parallel append thread for further discussion of these issues. Now, if you are in a parallel plan, you always want this behavior: there is never any reason (at least that I can see currently) to prefer the regular Append behavior (though the architecture I've chosen could support that if it turns out we need it). And if you are not in a parallel plan, the behavior of the parallel-aware version degenerates to exactly what our current Append node does anyway. So if you separate that into two nodes, a regular Append and a Parallel Append, it's just stupid. You just have one node that is a dumber version of another node, and you have code somewhere to distinguish between them when you could just as well always use the more capable one. I believe this kind of pattern is quite general. Most - though maybe not all - parallel-aware nodes are going to do basically the same thing that they do in a regular computation with a little bit of extra smarts specific to parallelism. If we have parallel-aware nodes that do things that are noticeably different than what the non-parallel aware versions do, well, your bias in this matter is certainly known to me and I will respect it. But I can't stomach the idea that as soon as we add even one line of code to cater to parallelism to a node, we have to duplicate that node and all of the supporting infrastructure. I really doubt you would be happy with that design in the long term either, or even the medium term. We have drifted a bit from the subject of this thread. Maybe that is OK, but we should make sure to finish dealing with the original issue you raised. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Sat, Dec 5, 2015 at 9:11 PM, Robert Haas <robertmhaas@gmail.com> wrote: > here. For example, in the case of nodeSeqscan.c, almost two-thirds of > the new lines of code are to add three additional methods that aren't > called in parallel aware mode, and about half of the rest are due to a > SeqScanState now having one integer field that's not part of > ScanState, which require mechanical changes in a bunch of places ("ps" > becomes "ss.ps"). The rest amounts to 15-20 lines of real code > change. Oops, bad proofreading. It should say "are ONLY called in parallel mode", not "aren't called in parallel aware mode". -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company