Обсуждение: Parallel append plan instability/randomness
According to buildfarm member silverfish, https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=silverfish&dt=2018-01-06%2008%3A53%3A38 it's possible to sometimes get this failure in the regression tests: *** /mnt/buildfarm/buildroot/HEAD/pgsql.build/../pgsql/src/test/regress/expected/select_parallel.out Tue Dec 19 20:24:022017 --- /mnt/buildfarm/buildroot/HEAD/pgsql.build/src/test/regress/results/select_parallel.out Sat Jan 6 09:21:39 2018 *************** *** 75,84 **** Workers Planned: 3 -> Partial Aggregate -> Parallel Append -> Seq Scan on d_star -> Seq Scan on f_star -> Seq Scan on e_star - -> Seq Scan on b_star -> Seq Scan on c_star -> Seq Scan on a_star (11 rows) --- 75,84 ---- Workers Planned: 3 -> Partial Aggregate -> Parallel Append + -> Seq Scan on b_star -> Seq Scan on d_star -> Seq Scan on f_star -> Seq Scan on e_star -> Seq Scan on c_star -> Seq Scan on a_star (11 rows) Irreproducible failures in the regression tests are not very acceptable. Furthermore, considering that the query being tested is explain (costs off) select round(avg(aa)), sum(aa) from a_star; it seems to me that the "expected" order of the sub-scans is mighty random to begin with. A non-parallel implementation of the same query will consistently scan these tables in their inheritance order: Aggregate -> Append -> Seq Scan on a_star -> Seq Scan on b_star -> Seq Scan on c_star -> Seq Scan on d_star -> Seq Scan on e_star -> Seq Scan on f_star Could we fix the parallel case to behave similarly please? regards, tom lane
On Sun, Jan 7, 2018 at 5:44 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > According to buildfarm member silverfish, > > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=silverfish&dt=2018-01-06%2008%3A53%3A38 > > it's possible to sometimes get this failure in the regression tests: > > *** /mnt/buildfarm/buildroot/HEAD/pgsql.build/../pgsql/src/test/regress/expected/select_parallel.out Tue Dec 19 20:24:022017 > --- /mnt/buildfarm/buildroot/HEAD/pgsql.build/src/test/regress/results/select_parallel.out Sat Jan 6 09:21:39 2018 > *************** > *** 75,84 **** > Workers Planned: 3 > -> Partial Aggregate > -> Parallel Append > -> Seq Scan on d_star > -> Seq Scan on f_star > -> Seq Scan on e_star > - -> Seq Scan on b_star > -> Seq Scan on c_star > -> Seq Scan on a_star > (11 rows) > --- 75,84 ---- > Workers Planned: 3 > -> Partial Aggregate > -> Parallel Append > + -> Seq Scan on b_star > -> Seq Scan on d_star > -> Seq Scan on f_star > -> Seq Scan on e_star > -> Seq Scan on c_star > -> Seq Scan on a_star > (11 rows) > > Irreproducible failures in the regression tests are not very acceptable. > Furthermore, considering that the query being tested is > > explain (costs off) > select round(avg(aa)), sum(aa) from a_star; > > it seems to me that the "expected" order of the sub-scans is mighty > random to begin with. > I think order of sub-scans can be random if the number of rows in child relations can vary across runs. For the above case, the subpaths (non-partial-paths) are always in the descending order of their cost and I can see that by running it locally. On my local m/c, output is as below: regression=# explain select round(avg(aa)), sum(aa) from a_star; QUERY PLAN ------------------------------------------------------------------------------- Finalize Aggregate (cost=2.30..2.31 rows=1 width=40) -> Gather (cost=2.28..2.29 rows=3 width=40) Workers Planned: 3 -> Partial Aggregate (cost=2.28..2.29 rows=1 width=40) -> Parallel Append (cost=0.00..2.20 rows=15 width=4) -> Seq Scan on d_star (cost=0.00..1.16 rows=16 width=4) -> Seq Scan on f_star (cost=0.00..1.16 rows=16 width=4) -> Seq Scan on e_star (cost=0.00..1.07 rows=7 width=4) -> Seq Scan on b_star (cost=0.00..1.04 rows=4 width=4) -> Seq Scan on c_star (cost=0.00..1.04 rows=4 width=4) -> Seq Scan on a_star (cost=0.00..1.03 rows=3 width=4) (11 rows) The above indicates that paths are listed in the order as expected. What makes you think that the order of sub-scans can be random? Is it possible that the number of rows in child relations can vary across runs? One theory that can explain above failure is that the costs of scanning some of the sub-paths is very close due to which sometimes the results can vary. If that is the case, then probably using fuzz_factor in costs comparison (as is done in attached patch) can improve the situation, may be we have to consider some other factors like number of rows in each subpath. However, it might be better to first somehow reproduce this case and see what is going wrong, any ideas? -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Вложения
On 8 January 2018 at 10:10, Amit Kapila <amit.kapila16@gmail.com> wrote: > regression=# explain select round(avg(aa)), sum(aa) from a_star; > QUERY PLAN > ------------------------------------------------------------------------------- > Finalize Aggregate (cost=2.30..2.31 rows=1 width=40) > -> Gather (cost=2.28..2.29 rows=3 width=40) > Workers Planned: 3 > -> Partial Aggregate (cost=2.28..2.29 rows=1 width=40) > -> Parallel Append (cost=0.00..2.20 rows=15 width=4) > -> Seq Scan on d_star (cost=0.00..1.16 rows=16 width=4) > -> Seq Scan on f_star (cost=0.00..1.16 rows=16 width=4) > -> Seq Scan on e_star (cost=0.00..1.07 rows=7 width=4) > -> Seq Scan on b_star (cost=0.00..1.04 rows=4 width=4) > -> Seq Scan on c_star (cost=0.00..1.04 rows=4 width=4) > -> Seq Scan on a_star (cost=0.00..1.03 rows=3 width=4) > (11 rows) > > The above indicates that paths are listed in the order as expected. > What makes you think that the order of sub-scans can be random? Is it > possible that the number of rows in child relations can vary across > runs? > > One theory that can explain above failure is that the costs of > scanning some of the sub-paths is very close due to which sometimes > the results can vary. If that is the case, then probably using > fuzz_factor in costs comparison (as is done in attached patch) can > improve the situation, may be we have to consider some other factors > like number of rows in each subpath. However, it might be better to > first somehow reproduce this case and see what is going wrong, any > ideas? The fact that b_star gets moved from 5th position to the first position in the scans, indicates that it's cost shoots up from 1.04 to a value greater than 1.16. It does not look like a case where two costs are almost same due to which their positions swap sometimes. I am trying to figure out what else can it be ... -- Thanks, -Amit Khandekar EnterpriseDB Corporation The Postgres Database Company
Amit Khandekar <amitdkhan.pg@gmail.com> writes: > The fact that b_star gets moved from 5th position to the first > position in the scans, indicates that it's cost shoots up from 1.04 to > a value greater than 1.16. It does not look like a case where two > costs are almost same due to which their positions swap sometimes. I > am trying to figure out what else can it be ... The gut feeling I had upon seeing the failure was that the plan shape depends on the order in which rows happen to be read from the system catalogs by a heapscan. I've not tried to run that idea to ground yet. regards, tom lane
On Mon, Jan 8, 2018 at 11:26 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Amit Khandekar <amitdkhan.pg@gmail.com> writes: >> The fact that b_star gets moved from 5th position to the first >> position in the scans, indicates that it's cost shoots up from 1.04 to >> a value greater than 1.16. It does not look like a case where two >> costs are almost same due to which their positions swap sometimes. I >> am trying to figure out what else can it be ... > That occurred to me as well, but still, the change in plan can happen due to the similar costs. Another possibility as indicated in the previous email is that if somehow the stats of table (reltuples, relpages) is not appropriate, say due to some reason analyze doesn't happen on the table. For example, if you just manually update the value of reltuples for b_star in pg_class to 20 or so, you will see the plan as indicated in the failure. If that is true, then probably doing Analyze before Parallel Append should do the trick. > The gut feeling I had upon seeing the failure was that the plan shape > depends on the order in which rows happen to be read from the system > catalogs by a heapscan. I've not tried to run that idea to ground yet. > I don't see how something like that can happen because we internally sort the subpaths for parallel append. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On 8 January 2018 at 13:35, Amit Kapila <amit.kapila16@gmail.com> wrote: > On Mon, Jan 8, 2018 at 11:26 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Amit Khandekar <amitdkhan.pg@gmail.com> writes: >>> The fact that b_star gets moved from 5th position to the first >>> position in the scans, indicates that it's cost shoots up from 1.04 to >>> a value greater than 1.16. It does not look like a case where two >>> costs are almost same due to which their positions swap sometimes. I >>> am trying to figure out what else can it be ... >> > > That occurred to me as well, but still, the change in plan can happen > due to the similar costs. Agreed. But I think we should first fix the issue due to which the test might be failing in this case. BTW, for your patch, I am thinking we can have a separate factor other than STD_FUZZ_FACTOR ? This way, we can make it much smaller than 1.01 also. And anyways, STD_FUZZ_FACTOR is used only for comparing paths on the same relation, whereas in our case, our comparision goal is different. > Another possibility as indicated in the > previous email is that if somehow the stats of table (reltuples, > relpages) is not appropriate, say due to some reason analyze doesn't > happen on the table. Yes, I am also thinking on the same lines. E.g., if the relpages is 0 (due to no analyze run yet), tuple density calculation follows a different logic, due to which reltuples can be quite bigger. I suspect this also might be the reason. So yes, I think it's worth having ANALYZE on *_star. > For example, if you just manually update the > value of reltuples for b_star in pg_class to 20 or so, you will see > the plan as indicated in the failure. If that is true, then probably > doing Analyze before Parallel Append should do the trick. Or better still, we can have Analyze in create_misc.sql and create_table.sql where the table is populated. > >> The gut feeling I had upon seeing the failure was that the plan shape >> depends on the order in which rows happen to be read from the system >> catalogs by a heapscan. I've not tried to run that idea to ground yet. >> > > I don't see how something like that can happen because we internally > sort the subpaths for parallel append. True. Or may I didn't understand. Tom, are you referring to reading pg_class.reltuples or similar, when say "read from the system catalogs" ? > > > -- > With Regards, > Amit Kapila. > EnterpriseDB: http://www.enterprisedb.com -- Thanks, -Amit Khandekar EnterpriseDB Corporation The Postgres Database Company
On Mon, Jan 8, 2018 at 2:11 PM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote: > On 8 January 2018 at 13:35, Amit Kapila <amit.kapila16@gmail.com> wrote: >> On Mon, Jan 8, 2018 at 11:26 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >>> Amit Khandekar <amitdkhan.pg@gmail.com> writes: >>>> The fact that b_star gets moved from 5th position to the first >>>> position in the scans, indicates that it's cost shoots up from 1.04 to >>>> a value greater than 1.16. It does not look like a case where two >>>> costs are almost same due to which their positions swap sometimes. I >>>> am trying to figure out what else can it be ... >>> >> >> That occurred to me as well, but still, the change in plan can happen >> due to the similar costs. > > Agreed. But I think we should first fix the issue due to which the > test might be failing in this case. BTW, for your patch, I am thinking > we can have a separate factor other than STD_FUZZ_FACTOR ? This way, > we can make it much smaller than 1.01 also. > Sounds good. However, I am not sure what should be the right value for it, apart from STD_FUZZ_FACTOR we are using 1.0000000001 also as fuzz_factor in the code. Any suggestions? And anyways, > STD_FUZZ_FACTOR is used only for comparing paths on the same relation, > whereas in our case, our comparision goal is different. > >> Another possibility as indicated in the >> previous email is that if somehow the stats of table (reltuples, >> relpages) is not appropriate, say due to some reason analyze doesn't >> happen on the table. > > Yes, I am also thinking on the same lines. E.g., if the relpages is 0 > (due to no analyze run yet), tuple density calculation follows a > different logic, due to which reltuples can be quite bigger. I suspect > this also might be the reason. So yes, I think it's worth having > ANALYZE on *_star. > >> For example, if you just manually update the >> value of reltuples for b_star in pg_class to 20 or so, you will see >> the plan as indicated in the failure. If that is true, then probably >> doing Analyze before Parallel Append should do the trick. > > Or better still, we can have Analyze in create_misc.sql and > create_table.sql where the table is populated. > Then probably we might want to do in misc.sql file as that Alters some of these tables which can lead to a rewrite of the table. I think we can do it either way, but lets first wait for Tom or Robert to comment on whether they agree with this theory or if they see some other reason for this behavior. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Sun, Jan 7, 2018 at 11:40 PM, Amit Kapila <amit.kapila16@gmail.com> wrote: > One theory that can explain above failure is that the costs of > scanning some of the sub-paths is very close due to which sometimes > the results can vary. If that is the case, then probably using > fuzz_factor in costs comparison (as is done in attached patch) can > improve the situation, may be we have to consider some other factors > like number of rows in each subpath. This isn't an acceptable solution because sorting requires that the comparison operator satisfy the transitive property; that is, if a = b and b = c then a = c. With your proposed comparator, you could have a = b and b = c but a < c. That will break stuff. It seems like the obvious fix here is to use a query where the contents of the partitions are such that the sorting always produces the same result. We could do that either by changing the query or by changing the data in the partitions or, maybe, by inserting ANALYZE someplace. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Ordering the plan output elements by estimated cost will cause spurious plan changes to be reported after table cardinalities change. Can we choose an explain output order that is stable to changes in cardinality, please? thank you, /Jim -- Sent from: http://www.postgresql-archive.org/PostgreSQL-hackers-f1928748.html
On Mon, Jan 8, 2018 at 11:28 AM, Jim Finnerty <jfinnert@amazon.com> wrote: > Ordering the plan output elements by estimated cost will cause spurious plan > changes to be reported after table cardinalities change. Can we choose an > explain output order that is stable to changes in cardinality, please? No. The order of the plans in the EXPLAIN output isn't some kind of display artifact. Parallel Append sorts the paths by cost and by whether they are partial, for reasons explained in the comments in create_append_path(), and EXPLAIN prints them in that order. In general, if you expect to be able to change the contents of the tables in the regression test database without changing the results of the regression test queries, you are in for a lot of disappointment. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Jim Finnerty <jfinnert@amazon.com> writes: > Ordering the plan output elements by estimated cost will cause spurious plan > changes to be reported after table cardinalities change. Can we choose an > explain output order that is stable to changes in cardinality, please? I found the code that's doing this, in create_append_path, and it says: * For parallel append, non-partial paths are sorted by descending total * costs. That way, the total time to finish all non-partial paths is * minimized. Also, the partial paths are sorted by descending startup * costs. There may be some paths that require to do startup work by a * single worker. In such case, it's better for workers to choose the * expensive ones first, whereas the leader should choose the cheapest * startup plan. There's some merit in that argument, although I'm not sure how much. It's certainly pointless to sort that way if the expected number of workers is >= the number of subpaths. More generally, I wonder if it wouldn't be better to implement this behavior at runtime rather than plan time. Something along the line of "workers choose the highest-cost remaining subplan, but leader chooses the lowest-cost one". regards, tom lane
On Mon, Jan 8, 2018 at 11:42 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Jim Finnerty <jfinnert@amazon.com> writes: >> Ordering the plan output elements by estimated cost will cause spurious plan >> changes to be reported after table cardinalities change. Can we choose an >> explain output order that is stable to changes in cardinality, please? > > I found the code that's doing this, in create_append_path, and it says: > > * For parallel append, non-partial paths are sorted by descending total > * costs. That way, the total time to finish all non-partial paths is > * minimized. Also, the partial paths are sorted by descending startup > * costs. There may be some paths that require to do startup work by a > * single worker. In such case, it's better for workers to choose the > * expensive ones first, whereas the leader should choose the cheapest > * startup plan. > > There's some merit in that argument, although I'm not sure how much. > It's certainly pointless to sort that way if the expected number of > workers is >= the number of subpaths. That is not completely true, because the leader will generally pick first before the workers have finished starting up, and we want it to pick something cheap and, ideally, partial, so that it's not committed to doing a ton of work while the workers sit there idle after filling their tuple queues. > More generally, I wonder if > it wouldn't be better to implement this behavior at runtime rather > than plan time. Something along the line of "workers choose the > highest-cost remaining subplan, but leader chooses the lowest-cost one". Ignoring some details around partial vs. non-partial plans, that's pretty much what we ARE doing, but to make it efficient, we sort the paths at plan time so that those choices are easy to make at runtime. If we didn't do that, we could have every worker sort the paths at execution time instead, or have the first process to arrive perform the sort and store the results in shared memory while everyone else waits, but that seems to be more complicated and less efficient, so I don't understand why you're proposing it. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > On Mon, Jan 8, 2018 at 11:42 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> More generally, I wonder if >> it wouldn't be better to implement this behavior at runtime rather >> than plan time. > Ignoring some details around partial vs. non-partial plans, that's > pretty much what we ARE doing, but to make it efficient, we sort the > paths at plan time so that those choices are easy to make at runtime. > If we didn't do that, we could have every worker sort the paths at > execution time instead, or have the first process to arrive perform > the sort and store the results in shared memory while everyone else > waits, but that seems to be more complicated and less efficient, so I > don't understand why you're proposing it. The main bit of info we'd have at runtime that we lack at plan time is certainty about the number of available workers. Maybe that doesn't really add anything useful to the order in which subplans would be doled out; not sure. regards, tom lane
On Mon, Jan 8, 2018 at 11:57 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Ignoring some details around partial vs. non-partial plans, that's >> pretty much what we ARE doing, but to make it efficient, we sort the >> paths at plan time so that those choices are easy to make at runtime. >> If we didn't do that, we could have every worker sort the paths at >> execution time instead, or have the first process to arrive perform >> the sort and store the results in shared memory while everyone else >> waits, but that seems to be more complicated and less efficient, so I >> don't understand why you're proposing it. > > The main bit of info we'd have at runtime that we lack at plan time is > certainty about the number of available workers. Maybe that doesn't > really add anything useful to the order in which subplans would be doled > out; not sure. The current algorithm doesn't have any use for that information; I'm not sure whether some other algorithm might. If we had perfect information, each participant would always choose the next task in such a way as to minimize overall execution time. Is it possible that the correct choice might depend on how many workers we have in total? /me thinks for a bit. Yes, that's possible. Suppose that the execution times of 5 non-partial plans are 100 99 60 40 1 and that the number of participants is either 2 or 3. Suppose further that the number of rows returned is nominal so that there's no issue of the worker(s) having to wait. If there are 2 participants, one process should get the plans with execution times 99 and 60 and the other should get the rest. If there are 3 participants, one process should get only 100, the second should get 99 and 1, and the last should get 60 and 40. In the actually implemented algorithm, the leader will grab the cheap plan at the end, the worker(s) will grab the expensive plans at the beginning, and things probably won't actually work out very nicely. With three participants, the leader will end up with 1 and 40, the first worker with 100 only, and the second worker with 99 and 60. Fail. The choice of which plan the leader takes might also depend on the number of rows being returned by the various plans. If there's one plan that returns a lot of rows, it would be nice to run that one in the leader, because then we save the cost of transferring those rows between processes. On the other hand, that also risks starving all of the workers; it could be faster to incur the overhead of transferring those rows so that the leader focuses on servicing the tuple queues rather than running plans. There are also problems with partial plans having high startup costs. In general, if a partial plan has a high startup cost, we'd like the leader not to choose it, because then it's not available to read tuples. So, if we have one plan with a startup cost with 1000 and another plan with a startup cost of 800, then we might think that the leader should prefer the latter. However, if the former plan is a parallel hash join and the latter is a non-parallel-aware join, then the reverse *may* be true, because the startup cost of parallel hash can (mostly) be *shared* among as many participants as we have, whereas non-parallel-aware nodes will *repeat* the startup work. Of course, the plan tree doesn't carry information about whether startup costs of parallel plans are likely to be amortized across workers or repeated in each one, and even if it did, it does not seem to be trivial to put that information to good use. Yet another problem with parallel append -- and really, with parallel planning in general -- is that it ignores the cost of resource contention between cooperating backends. If we do a Parallel Append over Parallel Seq Scan operations, spreading workers out figures to win if (1) there are enough of them to general significant contention on the in-memory data structures, which is especially likely if the data is resident in memory or if (2) the different partitions are on different filesystems that can satisfy I/O requests in parallel or even if (3) the different partitions are all on the same filesystem but that filesystem has enough spindles to sequential-scan them all at once. But it could easily lose if, say, the partitions are on different filesystems serviced by a single disk head, and the alternating I/O pattern produces a continuous stream of long seeks. On the the third hand, that case could work out too if the per-row processing is enough that we weren't I/O-bound in the first place. I'd be quite happy if somebody wanted to dig into these issues more. I don't pretend that the existing implementation is anything more than a first approximation of what we really want here. It's just based on the observations that (1) in practice, the leader is very often the bottleneck when executing parallel queries and (2) a task that takes a long time to run and can't be sped up by adding workers must get started as soon as possible. I have a feeling that to do this really well is going to require information that we don't have readily available, like which resources which subplans will use at which times and how many CPUs we need to use to render the query I/O-bound. However, there may be clear improvements that can be made with only localized changes, and I welcome ideas. Query planning appears to be a hard problem. :-p -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > [ example where current parallel-append dispatch algorithm loses ] I should clarify my thinking a bit more: I was imagining that we might switch to a system where, as each worker comes free, it makes a dynamic choice of which subplan to take next. That means that not only do we have more knowledge of the number of available workers than the planner did, but we also have hard knowledge of which subplans actually finished first. So even (or especially?) granted that the cost estimates are imperfect, that might provide a leg up in choosing how to schedule the remaining tasks. I haven't thought about it hard enough to have a clear idea of how that might win. But for sure I don't think we should dismiss out of hand the idea that it could be better than choosing a fixed dispatch order at plan time. Anyway, I'm merely suggesting that as an approach worth investigating in future research; I doubt anyone is going to produce an implementation right away. So we still want to figure out exactly what happened on silverfish and how we can nail down that expected plan shape. (Or, perhaps, decide that we don't need to see that plan in the output?) regards, tom lane
I see. Maybe the EXPLAIN output can be made more stable when "EXPLAIN (costs FALSE)", though? -- Sent from: http://www.postgresql-archive.org/PostgreSQL-hackers-f1928748.html
Robert Haas <robertmhaas@gmail.com> writes: > On Sun, Jan 7, 2018 at 11:40 PM, Amit Kapila <amit.kapila16@gmail.com> wrote: >> One theory that can explain above failure is that the costs of >> scanning some of the sub-paths is very close due to which sometimes >> the results can vary. If that is the case, then probably using >> fuzz_factor in costs comparison (as is done in attached patch) can >> improve the situation, may be we have to consider some other factors >> like number of rows in each subpath. > This isn't an acceptable solution because sorting requires that the > comparison operator satisfy the transitive property; that is, if a = b > and b = c then a = c. With your proposed comparator, you could have a > = b and b = c but a < c. That will break stuff. > It seems like the obvious fix here is to use a query where the > contents of the partitions are such that the sorting always produces > the same result. We could do that either by changing the query or by > changing the data in the partitions or, maybe, by inserting ANALYZE > someplace. The foo_star tables are made in create_table.sql, filled in create_misc.sql, and not modified thereafter. The fact that we have accurate rowcounts for them in select_parallel.sql is because of the database-wide VACUUM that happens at the start of sanity_check.sql. Given the lack of any WHERE condition, the costs in this particular query depend only on the rowcount and physical table size, so inserting an ANALYZE shouldn't (and doesn't, for me) change anything. I would be concerned about side-effects on other queries anyway if we were to ANALYZE tables that have never been ANALYZEd in the regression tests before. I remain pretty much at a loss for exactly what happened on silverfish, but that doesn't mean I have no ideas for improving things. Both append_total_cost_compare and append_startup_cost_compare are seriously deficient IMO, because they ignore the question of what to do when total costs (resp. startup costs) are exactly equal, leaving the outcome to the none too tender mercies of qsort(). Which you'll recall is not a stable algorithm. (For fewer than 7 items, our implementation of qsort falls back to an insertion sort, which *is* stable, perhaps explaining why we've not seen failures in this test case many times before. So this line of thought doesn't explain what happened on silverfish. But there's surely room for improvement here.) The normal rule in the planner, as embodied in places like compare_path_costs, is to break ties on total cost by comparing startup cost, and vice versa, which these two functions fail to do; so my first point is that they ought to be using compare_path_costs rather than inventing their own approach. This would especially affect append_startup_cost_compare, because of the frequency with which plans have identical startup costs (ie 0). Right now I'm afraid we're getting basically random ordering of the partial subpaths in most cases. In the regression test case at hand, the startup costs are all zero so this change wouldn't improve the test case's stability. So I'm thinking that in addition, it would be a good idea for these functions to break exact compare_path_costs ties in some arbitrary but deterministic way, rather than letting qsort() have the final say on what happens. If the subplans were all simple relation scans we could order them by planner relid, but I'm not sure what to do if they're not. BTW, I had not looked closely at list_qsort before, but now that I have it seems seriously bizarre. Why is it allocating a new List header rather than reusing the old one? Because it stomps on the next-links of the list cells, the old header is effectively corrupted and can't possibly be useful afterwards, so we might as well re-use it and save a palloc cycle. (By the same token, it's pretty ludicrous that it claims the input List is "const". Not stomping on the list header is a pretty poor definition of not modifying the list.) regards, tom lane
I wrote: > In the regression test case at hand, the startup costs are all zero so > this change wouldn't improve the test case's stability. So I'm thinking > that in addition, it would be a good idea for these functions to break > exact compare_path_costs ties in some arbitrary but deterministic way, > rather than letting qsort() have the final say on what happens. If the > subplans were all simple relation scans we could order them by planner > relid, but I'm not sure what to do if they're not. Ah, I have an idea --- let's create a bms_compare() qsort comparator for bitmapsets, and use that on the paths' relid sets. It hardly matters what the exact semantics of the comparator are, as long as it behaves sanely for equal sets. regards, tom lane
On Mon, Jan 8, 2018 at 1:26 PM, Jim Finnerty <jfinnert@amazon.com> wrote: > I see. Maybe the EXPLAIN output can be made more stable when "EXPLAIN (costs > FALSE)", though? I think that would be fixing the problem from the wrong end. We want to actually get a stable choice of plan, not display something other than the actual plan because that other thing is more stable. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
I wrote: >> In the regression test case at hand, the startup costs are all zero so >> this change wouldn't improve the test case's stability. So I'm thinking >> that in addition, it would be a good idea for these functions to break >> exact compare_path_costs ties in some arbitrary but deterministic way, >> rather than letting qsort() have the final say on what happens. If the >> subplans were all simple relation scans we could order them by planner >> relid, but I'm not sure what to do if they're not. > Ah, I have an idea --- let's create a bms_compare() qsort comparator for > bitmapsets, and use that on the paths' relid sets. It hardly matters > what the exact semantics of the comparator are, as long as it behaves > sanely for equal sets. Concretely, I propose the attached. I spent a little bit of extra effort on the comparator in hopes that it might be useful for something else. regards, tom lane diff --git a/src/backend/nodes/bitmapset.c b/src/backend/nodes/bitmapset.c index 733fe3c..edcd19a 100644 *** a/src/backend/nodes/bitmapset.c --- b/src/backend/nodes/bitmapset.c *************** bms_equal(const Bitmapset *a, const Bitm *** 173,178 **** --- 173,222 ---- } /* + * bms_compare - qsort-style comparator for bitmapsets + * + * This guarantees to report values as equal iff bms_equal would say they are + * equal. Otherwise, the highest-numbered bit that is set in one value but + * not the other determines the result. (This rule means that, for example, + * {6} is greater than {5}, which seems plausible.) + */ + int + bms_compare(const Bitmapset *a, const Bitmapset *b) + { + int shortlen; + int i; + + /* Handle cases where either input is NULL */ + if (a == NULL) + return bms_is_empty(b) ? 0 : -1; + else if (b == NULL) + return bms_is_empty(a) ? 0 : +1; + /* Handle cases where one input is longer than the other */ + shortlen = Min(a->nwords, b->nwords); + for (i = shortlen; i < a->nwords; i++) + { + if (a->words[i] != 0) + return +1; + } + for (i = shortlen; i < b->nwords; i++) + { + if (b->words[i] != 0) + return -1; + } + /* Process words in common */ + i = shortlen; + while (--i >= 0) + { + bitmapword aw = a->words[i]; + bitmapword bw = b->words[i]; + + if (aw != bw) + return (aw > bw) ? +1 : -1; + } + return 0; + } + + /* * bms_make_singleton - build a bitmapset containing a single member */ Bitmapset * diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c index 7df8761..5c2b996 100644 *** a/src/backend/optimizer/util/pathnode.c --- b/src/backend/optimizer/util/pathnode.c *************** create_append_path(RelOptInfo *rel, *** 1274,1311 **** /* * append_total_cost_compare ! * list_qsort comparator for sorting append child paths by total_cost */ static int append_total_cost_compare(const void *a, const void *b) { Path *path1 = (Path *) lfirst(*(ListCell **) a); Path *path2 = (Path *) lfirst(*(ListCell **) b); ! if (path1->total_cost > path2->total_cost) ! return -1; ! if (path1->total_cost < path2->total_cost) ! return 1; ! ! return 0; } /* * append_startup_cost_compare ! * list_qsort comparator for sorting append child paths by startup_cost */ static int append_startup_cost_compare(const void *a, const void *b) { Path *path1 = (Path *) lfirst(*(ListCell **) a); Path *path2 = (Path *) lfirst(*(ListCell **) b); ! if (path1->startup_cost > path2->startup_cost) ! return -1; ! if (path1->startup_cost < path2->startup_cost) ! return 1; ! ! return 0; } /* --- 1274,1317 ---- /* * append_total_cost_compare ! * qsort comparator for sorting append child paths by total_cost descending ! * ! * For equal total costs, we fall back to comparing startup costs; if those ! * are equal too, break ties using bms_compare on the paths' relids. ! * (This is to avoid getting unpredictable results from qsort.) */ static int append_total_cost_compare(const void *a, const void *b) { Path *path1 = (Path *) lfirst(*(ListCell **) a); Path *path2 = (Path *) lfirst(*(ListCell **) b); + int cmp; ! cmp = compare_path_costs(path1, path2, TOTAL_COST); ! if (cmp == 0) ! cmp = bms_compare(path1->parent->relids, path2->parent->relids); ! return -cmp; } /* * append_startup_cost_compare ! * qsort comparator for sorting append child paths by startup_cost descending ! * ! * For equal startup costs, we fall back to comparing total costs; if those ! * are equal too, break ties using bms_compare on the paths' relids. ! * (This is to avoid getting unpredictable results from qsort.) */ static int append_startup_cost_compare(const void *a, const void *b) { Path *path1 = (Path *) lfirst(*(ListCell **) a); Path *path2 = (Path *) lfirst(*(ListCell **) b); + int cmp; ! cmp = compare_path_costs(path1, path2, STARTUP_COST); ! if (cmp == 0) ! cmp = bms_compare(path1->parent->relids, path2->parent->relids); ! return -cmp; } /* diff --git a/src/include/nodes/bitmapset.h b/src/include/nodes/bitmapset.h index 15397e9..67e8920 100644 *** a/src/include/nodes/bitmapset.h --- b/src/include/nodes/bitmapset.h *************** typedef enum *** 65,70 **** --- 65,71 ---- extern Bitmapset *bms_copy(const Bitmapset *a); extern bool bms_equal(const Bitmapset *a, const Bitmapset *b); + extern int bms_compare(const Bitmapset *a, const Bitmapset *b); extern Bitmapset *bms_make_singleton(int x); extern void bms_free(Bitmapset *a); diff --git a/src/test/regress/expected/select_parallel.out b/src/test/regress/expected/select_parallel.out index 7824ca5..5f73be4 100644 *** a/src/test/regress/expected/select_parallel.out --- b/src/test/regress/expected/select_parallel.out *************** explain (costs off) *** 21,32 **** Workers Planned: 3 -> Partial Aggregate -> Parallel Append ! -> Parallel Seq Scan on a_star ! -> Parallel Seq Scan on b_star ! -> Parallel Seq Scan on c_star -> Parallel Seq Scan on d_star -> Parallel Seq Scan on e_star ! -> Parallel Seq Scan on f_star (11 rows) select round(avg(aa)), sum(aa) from a_star a1; --- 21,32 ---- Workers Planned: 3 -> Partial Aggregate -> Parallel Append ! -> Parallel Seq Scan on f_star -> Parallel Seq Scan on d_star -> Parallel Seq Scan on e_star ! -> Parallel Seq Scan on c_star ! -> Parallel Seq Scan on b_star ! -> Parallel Seq Scan on a_star (11 rows) select round(avg(aa)), sum(aa) from a_star a1; *************** explain (costs off) *** 49,58 **** -> Parallel Append -> Seq Scan on d_star -> Seq Scan on c_star - -> Parallel Seq Scan on a_star - -> Parallel Seq Scan on b_star - -> Parallel Seq Scan on e_star -> Parallel Seq Scan on f_star (11 rows) select round(avg(aa)), sum(aa) from a_star a2; --- 49,58 ---- -> Parallel Append -> Seq Scan on d_star -> Seq Scan on c_star -> Parallel Seq Scan on f_star + -> Parallel Seq Scan on e_star + -> Parallel Seq Scan on b_star + -> Parallel Seq Scan on a_star (11 rows) select round(avg(aa)), sum(aa) from a_star a2; *************** explain (costs off) *** 75,85 **** Workers Planned: 3 -> Partial Aggregate -> Parallel Append - -> Seq Scan on d_star -> Seq Scan on f_star -> Seq Scan on e_star - -> Seq Scan on b_star -> Seq Scan on c_star -> Seq Scan on a_star (11 rows) --- 75,85 ---- Workers Planned: 3 -> Partial Aggregate -> Parallel Append -> Seq Scan on f_star + -> Seq Scan on d_star -> Seq Scan on e_star -> Seq Scan on c_star + -> Seq Scan on b_star -> Seq Scan on a_star (11 rows)
I wrote: > BTW, I had not looked closely at list_qsort before, but now that I have > it seems seriously bizarre. Why is it allocating a new List header > rather than reusing the old one? Because it stomps on the next-links > of the list cells, the old header is effectively corrupted and can't > possibly be useful afterwards, so we might as well re-use it and save > a palloc cycle. (By the same token, it's pretty ludicrous that it > claims the input List is "const". Not stomping on the list header > is a pretty poor definition of not modifying the list.) Actually, after looking closer, I think the current implementation of list_qsort is actively dangerous. If we left the code like this, we would have to document that create_append_path corrupts its caller's copies of subpaths and partial_subpaths when parallel_aware is true (and by "corrupt" I don't just mean "reorder": the list headers the caller has pointers to are now flat wrong). Even with the change I imagine above, we'd have to document that the caller's lists get reordered. Although the sole caller that's currently affected doesn't currently do anything with those lists after the call, it's not terribly hard to imagine that future improvements there would lead to trying to build multiple paths with the same lists or something like that, so that we'd have hidden interactions between different Paths. The same types of hazards would apply to every future use of list_qsort. (In fact, I spent some time trying to think of a way that this could have led to the silverfish failure we started the thread with. I couldn't convince myself that it was possible, but I'm not entirely convinced it's not, either.) So I think that re-using the caller's list cells is another penny-wise- but-pound-foolish idea, and that we ought to just build a new list to avoid the hazard. As per attached. regards, tom lane diff --git a/src/backend/nodes/list.c b/src/backend/nodes/list.c index 083538f..fe085d2 100644 *** a/src/backend/nodes/list.c --- b/src/backend/nodes/list.c *************** list_copy_tail(const List *oldlist, int *** 1250,1290 **** } /* ! * Sort a list using qsort. A sorted list is built but the cells of the ! * original list are re-used. The comparator function receives arguments of ! * type ListCell ** */ List * list_qsort(const List *list, list_qsort_comparator cmp) { - ListCell *cell; - int i; int len = list_length(list); ListCell **list_arr; ! List *new_list; if (len == 0) return NIL; i = 0; - list_arr = palloc(sizeof(ListCell *) * len); foreach(cell, list) list_arr[i++] = cell; qsort(list_arr, len, sizeof(ListCell *), cmp); ! new_list = (List *) palloc(sizeof(List)); ! new_list->type = list->type; ! new_list->length = len; ! new_list->head = list_arr[0]; ! new_list->tail = list_arr[len - 1]; ! for (i = 0; i < len - 1; i++) ! list_arr[i]->next = list_arr[i + 1]; ! list_arr[len - 1]->next = NULL; pfree(list_arr); ! return new_list; } /* --- 1250,1312 ---- } /* ! * Sort a list as though by qsort. ! * ! * A new list is built and returned. ! * The comparator function receives arguments of type ListCell **. */ List * list_qsort(const List *list, list_qsort_comparator cmp) { int len = list_length(list); ListCell **list_arr; ! List *newlist; ! ListCell *newlist_prev; ! ListCell *cell; ! int i; + /* Empty list is easy */ if (len == 0) return NIL; + /* Flatten list cells into an array, so we can use qsort */ + list_arr = (ListCell **) palloc(sizeof(ListCell *) * len); i = 0; foreach(cell, list) list_arr[i++] = cell; qsort(list_arr, len, sizeof(ListCell *), cmp); ! /* Construct new list (this code is much like list_copy) */ ! newlist = new_list(list->type); ! newlist->length = len; ! /* ! * Copy over the data in the first cell; new_list() has already allocated ! * the head cell itself ! */ ! newlist->head->data = list_arr[0]->data; ! newlist_prev = newlist->head; ! for (i = 1; i < len; i++) ! { ! ListCell *newlist_cur; ! ! newlist_cur = (ListCell *) palloc(sizeof(*newlist_cur)); ! newlist_cur->data = list_arr[i]->data; ! newlist_prev->next = newlist_cur; ! ! newlist_prev = newlist_cur; ! } ! ! newlist_prev->next = NULL; ! newlist->tail = newlist_prev; ! ! /* Might as well free the workspace array */ pfree(list_arr); ! ! check_list_invariants(newlist); ! return newlist; } /*
On Tue, Jan 9, 2018 at 12:48 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Robert Haas <robertmhaas@gmail.com> writes: >> On Sun, Jan 7, 2018 at 11:40 PM, Amit Kapila <amit.kapila16@gmail.com> wrote: >>> One theory that can explain above failure is that the costs of >>> scanning some of the sub-paths is very close due to which sometimes >>> the results can vary. If that is the case, then probably using >>> fuzz_factor in costs comparison (as is done in attached patch) can >>> improve the situation, may be we have to consider some other factors >>> like number of rows in each subpath. > >> This isn't an acceptable solution because sorting requires that the >> comparison operator satisfy the transitive property; that is, if a = b >> and b = c then a = c. With your proposed comparator, you could have a >> = b and b = c but a < c. That will break stuff. > >> It seems like the obvious fix here is to use a query where the >> contents of the partitions are such that the sorting always produces >> the same result. We could do that either by changing the query or by >> changing the data in the partitions or, maybe, by inserting ANALYZE >> someplace. > > The foo_star tables are made in create_table.sql, filled in > create_misc.sql, and not modified thereafter. The fact that we have > accurate rowcounts for them in select_parallel.sql is because of the > database-wide VACUUM that happens at the start of sanity_check.sql. > Given the lack of any WHERE condition, the costs in this particular query > depend only on the rowcount and physical table size, so inserting an > ANALYZE shouldn't (and doesn't, for me) change anything. I would be > concerned about side-effects on other queries anyway if we were to ANALYZE > tables that have never been ANALYZEd in the regression tests before. > Fair point. This seems to indicate that wrong rowcounts is probably not the reason for the failure. However, I think it might still be good to use a different set of tables (probably create new tables with appropriate data for these queries) and analyze them explicitly before these queries rather than relying on the execution order of some not-directly related tests. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com