Обсуждение: Top-N sorts verses parallelism
Hi hackers, The start-up cost of bounded (top-N) sorts is sensitive at the small end of N, and the comparison_cost * tuples * LOG2(2.0 * output_tuples) curve doesn't seem to correspond to reality. Here's a contrived example that comes from a benchmark: create table foo as select generate_series(1, 1000000)::int a, (random() * 1000)::int b; analyze foo; select * from foo order by b limit N; Estimated start-up costs for different N: limit 1 = 19425.00 limit 2 = 24425.00 limit 3 = 27349.81 limit 4 = 29425.00 limit 10 = 36034.64 limit 50 = 47644.28 limit 100 = 52644.28 limit 133 = 54701.41 But the actual time doesn't really change on this random development system: it's around 300ms. I stopped at limit 133 because for this particular query, at 134 a Gather Merge plan kicked in and I got degree 3 parallel sorting and I got my answer just shy of three times faster. The speed up definitely paid for the parallel_setup_cost and only one tuple was sent from each worker to the leader because the limit was pushed down. I want to get my answer three times as fast when I say limit 1 too! But I can't because we seem to think that limit 1 is dramatically cheaper than limit 134, so parallelism isn't worth bothering with. With wider tuples the same effect can be seen, though it gets much more difficult to convince Gather Merge to be selected at all for reasons I haven't looked into yet. Is Gather Merge for top-N sorts a missed opportunity due to underestimation of top-N for small values of N? Or is there some other way to think about this problem? Thoughts? -- Thomas Munro http://www.enterprisedb.com
On Wed, Dec 13, 2017 at 1:46 AM, Thomas Munro <thomas.munro@enterprisedb.com> wrote: > Hi hackers, > > The start-up cost of bounded (top-N) sorts is sensitive at the small > end of N, and the > comparison_cost * tuples * LOG2(2.0 * output_tuples) curve doesn't > seem to correspond to reality. Here's a contrived example that comes > from a benchmark: > > create table foo as > select generate_series(1, 1000000)::int a, (random() * 1000)::int b; > analyze foo; > select * from foo order by b limit N; > > But the actual time doesn't really change on this random development > system: it's around 300ms. I stopped at limit 133 because for this > particular query, at 134 a Gather Merge plan kicked in and I got > degree 3 parallel sorting and I got my answer just shy of three times > faster. The speed up definitely paid for the parallel_setup_cost and > only one tuple was sent from each worker to the leader because the > limit was pushed down. I get: rhaas=# explain analyze select * from foo order by b limit 1; QUERY PLAN -------------------------------------------------------------------------------------------------------------------------- Limit (cost=19425.00..19425.00 rows=1 width=8) (actual time=183.129..183.129 rows=1 loops=1) -> Sort (cost=19425.00..21925.00 rows=1000000 width=8) (actual time=183.128..183.128 rows=1 loops=1) Sort Key: b Sort Method: top-N heapsort Memory: 25kB -> Seq Scan on foo (cost=0.00..14425.00 rows=1000000 width=8) (actual time=0.046..89.440 rows=1000000 loops=1) Planning time: 0.083 ms Execution time: 183.171 ms (7 rows) If we assume the costing of the Seq Scan node is right, then the startup cost of the Sort node is too low. For the Seq Scan node, each 1ms of execution time corresponds to 161.28 cost units. If that were accurate, then the 183.128 - 89.440 = 93.668 cost units of startup cost for the Sort ought to have a cost of 14425.00 + (93.668 * 161.28) = 29531.77504, but the actual cost is only 19425.00. In other words, the ratio of the startup cost of the sort to the total cost of the seq scan ought to be about 2:1, to match the ratio of the execution times, but it's actually more like 4:3. I also see that it switches to Gather Merge at LIMIT 134, but if I lower random_page_cost and seq_page_cost from the default values to 0.1, then for me it switches earlier, at 63. Of course, at that point the sort cost is now 3.5x the Seq Scan cost, so now things have gone too far in the other direction and we're still not getting the plan right, but I am out of time to investigate this for the moment. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Tue, Dec 12, 2017 at 10:46 PM, Thomas Munro <thomas.munro@enterprisedb.com> wrote:
Hi hackers,
The start-up cost of bounded (top-N) sorts is sensitive at the small
end of N, and the
comparison_cost * tuples * LOG2(2.0 * output_tuples) curve doesn't
seem to correspond to reality.
Do we want the cost-estimate to be accurate for the worst case (which is not terribly unlikely to happen, input order is opposite of desired output order), or for the average case?
Cheers,
Jeff
On Tue, Dec 12, 2017 at 10:46 PM, Thomas Munro <thomas.munro@enterprisedb.com> wrote:
Hi hackers,
The start-up cost of bounded (top-N) sorts is sensitive at the small
end of N, and the
comparison_cost * tuples * LOG2(2.0 * output_tuples) curve doesn't
seem to correspond to reality. Here's a contrived example that comes
from a benchmark:
create table foo as
select generate_series(1, 1000000)::int a, (random() * 1000)::int b;
analyze foo;
select * from foo order by b limit N;
This looks like a costing bug. The estimated cost of sorting 416,667 estimated tuples in one parallel worker is almost identical to the estimated cost of sorting 1,000,000 tuples when parallelism is turned off. Adding some logging to the cost_sort function, it looks like the limit does not get sent down for the parallel estimate:
NOTICE: JJ cost_sort tuples 1000000.000000, limit 61.000000, sort_mem 65536
NOTICE: JJ cost_sort tuples 416667.000000, limit -1.000000, sort_mem 65536
So the LIMIT gets passed down for the execution step, but not for the planning step.
(On my system, LIMIT 61 is the point at which it switches to parallel)
In any case, if we "fixed" the top-n estimate to use the random-case rather than the worst-case, that would make the LIMIT 133 look more like the LIMIT 1, so would be the opposite direction of what you want.
Cheers,
Jeff
On Fri, Dec 15, 2017 at 10:05 AM, Jeff Janes <jeff.janes@gmail.com> wrote: > On Tue, Dec 12, 2017 at 10:46 PM, Thomas Munro > <thomas.munro@enterprisedb.com> wrote: >> >> Hi hackers, >> >> The start-up cost of bounded (top-N) sorts is sensitive at the small >> end of N, and the >> comparison_cost * tuples * LOG2(2.0 * output_tuples) curve doesn't >> seem to correspond to reality. Here's a contrived example that comes >> from a benchmark: >> >> create table foo as >> select generate_series(1, 1000000)::int a, (random() * 1000)::int b; >> analyze foo; >> select * from foo order by b limit N; > > This looks like a costing bug. The estimated cost of sorting 416,667 > estimated tuples in one parallel worker is almost identical to the estimated > cost of sorting 1,000,000 tuples when parallelism is turned off. Adding > some logging to the cost_sort function, it looks like the limit does not get > sent down for the parallel estimate: > > NOTICE: JJ cost_sort tuples 1000000.000000, limit 61.000000, sort_mem 65536 > NOTICE: JJ cost_sort tuples 416667.000000, limit -1.000000, sort_mem 65536 > > So the LIMIT gets passed down for the execution step, but not for the > planning step. Oh, well spotted. I was looking in the wrong place. Maybe the fix is as simple as the attached? It certainly helps in the cases I tested, including with wider tables. -- Thomas Munro http://www.enterprisedb.com
Вложения
On Thu, Dec 14, 2017 at 5:12 PM, Thomas Munro <thomas.munro@enterprisedb.com> wrote:
>
> This looks like a costing bug. The estimated cost of sorting 416,667
> estimated tuples in one parallel worker is almost identical to the estimated
> cost of sorting 1,000,000 tuples when parallelism is turned off. Adding
> some logging to the cost_sort function, it looks like the limit does not get
> sent down for the parallel estimate:
>
> NOTICE: JJ cost_sort tuples 1000000.000000, limit 61.000000, sort_mem 65536
> NOTICE: JJ cost_sort tuples 416667.000000, limit -1.000000, sort_mem 65536
>
> So the LIMIT gets passed down for the execution step, but not for the
> planning step.
Oh, well spotted. I was looking in the wrong place. Maybe the fix is
as simple as the attached? It certainly helps in the cases I tested,
including with wider tables.
I had hit on the same change. And was also surprised that it was located where it was. With the change, it uses the parallel plan all the way down to LIMIT 1.
With the patch, it still satisfies make check, so if it introduces errors they are subtle ones. If we can't actually do this and it needs to stay -1, then I think we need a comment to explain why.
Cheers,
Jeff
On Fri, Dec 15, 2017 at 2:10 PM, Jeff Janes <jeff.janes@gmail.com> wrote: > I had hit on the same change. And was also surprised that it was located > where it was. With the change, it uses the parallel plan all the way down > to LIMIT 1. > > With the patch, it still satisfies make check, so if it introduces errors > they are subtle ones. If we can't actually do this and it needs to stay -1, > then I think we need a comment to explain why. Interesting. I suspect this is correct now, but would not have been before commit 3452dc5240da43e833118484e1e9b4894d04431c. AFAICS, this doesn't affect any execution-time behavior, just the cost estimate. And, prior to that commit, the execution-time behavior was different: there would not have been any way for the worker to do a top-N sort, because the LIMIT was not pushed through the Gather. Does that sound right, or am I still confused? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Sat, Dec 16, 2017 at 9:13 AM, Robert Haas <robertmhaas@gmail.com> wrote: > On Fri, Dec 15, 2017 at 2:10 PM, Jeff Janes <jeff.janes@gmail.com> wrote: >> I had hit on the same change. And was also surprised that it was located >> where it was. With the change, it uses the parallel plan all the way down >> to LIMIT 1. >> >> With the patch, it still satisfies make check, so if it introduces errors >> they are subtle ones. If we can't actually do this and it needs to stay -1, >> then I think we need a comment to explain why. > > Interesting. I suspect this is correct now, but would not have been > before commit 3452dc5240da43e833118484e1e9b4894d04431c. AFAICS, this > doesn't affect any execution-time behavior, just the cost estimate. > And, prior to that commit, the execution-time behavior was different: > there would not have been any way for the worker to do a top-N sort, > because the LIMIT was not pushed through the Gather. > > Does that sound right, or am I still confused? Looks right to me. Commit 3452dc52 just forgot to tell the planner. I'm pleased about that because it makes this a slam-dunk bug-fix and not some confusing hard to justify costing problem. BTW with this patch, the same test using a smaller foo table with 250k rows limit 1 runs a parallel plan in 45ms here, but 220k rows limit 1 runs a non-parallel plan in 80ms, but that's just a regular costing problem where the point at which the curves cross over will be hard to get right due to all kinds of other variables, so I guess that kind of thing is expected. PS Oops, I do actually know how to spell "versus". Typos are so much more embarrassing in subject lines! -- Thomas Munro http://www.enterprisedb.com
On Fri, Dec 15, 2017 at 4:13 PM, Thomas Munro <thomas.munro@enterprisedb.com> wrote: > Looks right to me. Commit 3452dc52 just forgot to tell the planner. > I'm pleased about that because it makes this a slam-dunk bug-fix and > not some confusing hard to justify costing problem. Jeff Janes inquired off-list about other places where cost_sort() gets called. In costsize.c, it is called from initial_cost_mergejoin, which seems like it does not present an opportunity for pushing down limits, since we don't know how many rows we'll have to join to get a certain number of outputs. In createplan.c, it is called only from label_sort_with_costsize(). That in turn called from create_merge_append_plan(), passing the tuple limit from the path being converted to a plan, which seems unimpeachable. It's also called from make_unique_plan() and create_mergejoin_plan() with -1, which seems OK since in neither case do we know how many input rows we need to read. In planner.c, it's called from plan_cluster_use_sort() with -1; CLUSTER has to read the whole input, so that's fine. In prepunion.c, it's called from choose_hashed_setop() with -1. I think set operations also need to read the whole input. In pathnode.c, it's called from create_merge_append_path, create_unique_path, create_gather_merge_path, create_groupingset_path, and create_sort_path. create_merge_append_path passes any limit applicable to the subpath. create_unique_path passes -1. create_gather_merge_path also passes -1, which as Jeff also pointed out seems possibly wrong. create_sort_path also passes -1, and so does create_groupingsets_path. I went through the callers to create_sort_path and the only one that looks like it can pass a limit is the one you and Jeff already identified. So I think the question is just whether create_gather_merge_path needs a similar fix. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Mon, Dec 18, 2017 at 9:29 AM, Robert Haas <robertmhaas@gmail.com> wrote: > I went through the callers to create_sort_path and the only one that > looks like it can pass a limit is the one you and Jeff already > identified. So I think the question is just whether > create_gather_merge_path needs a similar fix. I might be missing something, but it looks like there are no cases where we have a limit_tuples value we could use AND we're relying on create_gather_merge_path's own ability to create sort paths. So I suspect there is no reason to change create_gather_merge_path itself to deal with tuple limits. I looked at each of its callers: 1. create_ordered_paths is the case the patch posted earlier covers: it has a useful limit_tuples value but it creates the sort path itself first, so there is no need for create_gather_merge_path to be aware of it. 2. create_grouping_paths doesn't have limit_tuples value because grouping always inhibits limits. 3. generate_gather_paths is in turn called by: 3.1. standard_joinsearch can't use limits (at least in general) since it's dealing with a join. 3.2. geco's merge_clump is also about joins, so ditto. 3.3. set_rel_pathlist will consider only pathkeys from existing index scans that set_plain_rel_pathlist found, not creating new pathkeys by sorting. -- Thomas Munro http://www.enterprisedb.com
On Tue, Dec 19, 2017 at 5:24 PM, Thomas Munro <thomas.munro@enterprisedb.com> wrote: > On Mon, Dec 18, 2017 at 9:29 AM, Robert Haas <robertmhaas@gmail.com> wrote: >> I went through the callers to create_sort_path and the only one that >> looks like it can pass a limit is the one you and Jeff already >> identified. So I think the question is just whether >> create_gather_merge_path needs a similar fix. > > I might be missing something, but it looks like there are no cases > where we have a limit_tuples value we could use AND we're relying on > create_gather_merge_path's own ability to create sort paths. So I > suspect there is no reason to change create_gather_merge_path itself > to deal with tuple limits. > Exactly. I was about to post the same and my analysis results are same as yours. I think this raises the question, do we really need cost_sort at that place and if so for which case? -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Tue, Dec 19, 2017 at 6:54 AM, Thomas Munro <thomas.munro@enterprisedb.com> wrote: > On Mon, Dec 18, 2017 at 9:29 AM, Robert Haas <robertmhaas@gmail.com> wrote: >> I went through the callers to create_sort_path and the only one that >> looks like it can pass a limit is the one you and Jeff already >> identified. So I think the question is just whether >> create_gather_merge_path needs a similar fix. > > I might be missing something, but it looks like there are no cases > where we have a limit_tuples value we could use AND we're relying on > create_gather_merge_path's own ability to create sort paths. So I > suspect there is no reason to change create_gather_merge_path itself > to deal with tuple limits. I looked at each of its callers: > > 1. create_ordered_paths is the case the patch posted earlier covers: > it has a useful limit_tuples value but it creates the sort path itself > first, so there is no need for create_gather_merge_path to be aware of > it. > > 2. create_grouping_paths doesn't have limit_tuples value because > grouping always inhibits limits. > > 3. generate_gather_paths is in turn called by: > > 3.1. standard_joinsearch can't use limits (at least in general) since > it's dealing with a join. > > 3.2. geco's merge_clump is also about joins, so ditto. > > 3.3. set_rel_pathlist will consider only pathkeys from existing index > scans that set_plain_rel_pathlist found, not creating new pathkeys by > sorting. Well, it might be good future-proofing, but at least it's good to know that it's not a active bug. I pushed your earlier fix, which turns out to be a revert of commit dc02c7bca4dccf7de278cdc6b3325a829e75b252. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company