Re: parallel joins, and better parallel explain
От | Robert Haas |
---|---|
Тема | Re: parallel joins, and better parallel explain |
Дата | |
Msg-id | CA+Tgmob+e67gP2i833ZrocK966SSO7z1er9BeQKWj7H8yO97Qw@mail.gmail.com обсуждение исходный текст |
Ответ на | Re: parallel joins, and better parallel explain (Dilip Kumar <dilipbalaut@gmail.com>) |
Ответы |
Re: parallel joins, and better parallel explain
(Dilip Kumar <dilipbalaut@gmail.com>)
|
Список | pgsql-hackers |
On Wed, Dec 23, 2015 at 2:34 AM, Dilip Kumar <dilipbalaut@gmail.com> wrote: > Yeah right, After applying all three patches this problem is fixed, now > parallel hash join is faster than normal hash join. > > I have tested one more case which Amit mentioned, I can see in that case > parallel plan (parallel degree>= 3) is still slow, In Normal case it selects > "Hash Join" but in case of parallel worker > 3 it selects Parallel "Nest > Loop Join" which is making it costlier. While investigating this problem, I discovered that I can produce a regression even on unpatched master: rhaas=# set max_parallel_degree = 0; SET rhaas=# explain select sum(1) from t1; QUERY PLAN --------------------------------------------------------------------- Aggregate (cost=1553572.00..1553572.01 rows=1 width=0) -> Seq Scan on t1 (cost=0.00..1528572.00 rows=10000000 width=0) (2 rows) rhaas=# set max_parallel_degree = 3; SET rhaas=# explain select sum(1) from t1; QUERY PLAN ----------------------------------------------------------------------------------- Aggregate (cost=1462734.86..1462734.87 rows=1 width=0) -> Gather (cost=1000.00..1437734.86 rows=10000000 width=0) Number of Workers: 3 -> Parallel Seq Scan on t1 (cost=0.00..436734.86 rows=10000000 width=0) (4 rows) The problem here is that the planner imagines that the sequential scan is going to parallelize perfectly, which is not the case. A Gather node is ten times as expensive per tuple as a sequential scan, but sequential scan doesn't need to pay a per-page cost, so if you crank the number of workers up high enough, the cost per tuple appears to drop until it eventually gets low enough that paying the cost of a Gather node looks worthwhile. I tweaked cost_seqscan() so that it spreads out the CPU cost among all of the workers but assumes the disk cost has to be paid regardless, and that fixes this problem. It doesn't fix your example, though. Even with the costing changes mentioned above, the planner still thinks a nested loop over two seqscans has something to recommend it: rhaas=# Explain (Analyze, verbose) SELECT count(*) FROM t1 JOIN t2 ON t1.c1 = t2.c1 AND t1.c1 BETWEEN 100 AND 200; QUERY PLAN ----------------------------------------------------------------------------------------------------------------------------------------- Aggregate (cost=161755.97..161755.98 rows=1 width=0) (actual time=41164.506..41164.507 rows=1 loops=1) Output: count(*) -> Gather (cost=1000.00..161755.97 rows=1 width=0) (actual time=0.436..41164.388 rows=101 loops=1) Number of Workers: 3 -> Nested Loop (cost=0.00..160755.87 rows=1 width=0) (actual time=329.227..12123.414 rows=25 loops=4) Join Filter: (t1.c1 = t2.c1) Rows Removed by Join Filter: 75749975 Worker 0: actual time=439.924..439.924 rows=0 loops=1 Worker 1: actual time=440.776..440.776 rows=0 loops=1 Worker 2: actual time=436.100..6449.041 rows=15 loops=1 -> Parallel Seq Scan on public.t1 (cost=0.00..102442.10 rows=0 width=4) (actual time=220.185..220.228 rows=25 loops=4) Output: t1.c1, t1.c2 Filter: ((t1.c1 >= 100) AND (t1.c1 <= 200)) Rows Removed by Filter: 2499975 Worker 0: actual time=439.922..439.922 rows=0 loops=1 Worker 1: actual time=440.773..440.773 rows=0 loops=1 Worker 2: actual time=0.016..0.055 rows=15 loops=1 -> Seq Scan on public.t2 (cost=0.00..46217.00 rows=3000000 width=4) (actual time=0.007..235.143 rows=3000000 loops=101) Output: t2.c1, t2.c2 Worker 2: actual time=0.012..215.711 rows=3000000 loops=15 Planning time: 0.150 ms Execution time: 41164.597 ms But this is not entirely the fault of the parallel query code. If you force a seqscan-over-seqscan plan in the non-parallel cast, it estimates the join cost as 287772.00, only slightly more than the 261522.02 cost units it thinks a non-parallel hash join will cost. In fact, however, the non-parallel hash join runs in 1.2 seconds and the non-parallel nested loop takes 46 seconds. So the first problem here is that a plan that the query planner thinks is only 10% more expensive actually runs for almost 40 times longer. If the planner had accurately estimated the real cost of the nested loop, this plan wouldn't have been chosen. If you set enable_nestloop=false, then you get this plan: rhaas=# set enable_nestloop=false; SET rhaas=# set max_parallel_degree=3; SET rhaas=# Explain (Analyze, verbose) SELECT count(*) FROM t1 JOIN t2 ON t1.c1 = t2.c1 AND t1.c1 BETWEEN 100 AND 200; QUERY PLAN ---------------------------------------------------------------------------------------------------------------------------------------------- Aggregate (cost=160909.22..160909.23 rows=1 width=0) (actual time=647.010..647.011 rows=1 loops=1) Output: count(*) -> Hash Join (cost=103442.21..160909.22 rows=1 width=0) (actual time=234.397..646.985 rows=101 loops=1) Hash Cond: (t2.c1 = t1.c1) -> Seq Scan on public.t2 (cost=0.00..46217.00 rows=3000000 width=4) (actual time=0.033..197.595 rows=3000000 loops=1) Output: t2.c1, t2.c2 -> Hash (cost=103442.20..103442.20 rows=1 width=4) (actual time=234.235..234.235 rows=101 loops=1) Output: t1.c1 Buckets: 1024 Batches: 1 Memory Usage: 12kB -> Gather (cost=1000.00..103442.20 rows=1 width=4) (actual time=0.289..234.199 rows=101 loops=1) Output: t1.c1 Number of Workers: 3 -> Parallel Seq Scan on public.t1 (cost=0.00..102442.10 rows=0 width=4) (actual time=171.667..230.080 rows=25 loops=4) Output: t1.c1 Filter: ((t1.c1 >= 100) AND (t1.c1 <= 200)) Rows Removed by Filter: 2499975 Worker 0: actual time=228.628..228.628 rows=0 loops=1 Worker 1: actual time=228.432..228.432 rows=0 loops=1 Worker 2: actual time=229.566..229.566 rows=0 loops=1 Planning time: 0.160 ms Execution time: 647.133 ms (21 rows) And that's a good plan. The parallel nested loop case also suffers from the fact that workers 0 and 1 don't happen to find any of the interesting rows in t1 at all, and worker 2 only finds 15 of them. The leader finds the other 85 and thus has to run most of the iterations of the scan on t2 itself. If the work were divided equally, the parallel nested loop would probably run significantly faster, although it would still be ten times slower than the non-parallel hash join. In the long term, I think the way to fix the uneven work distribution that happens here is to construct the hash table in parallel, as already discussed with Simon upthread. Then we could have a Gather node on top of a Hash Join both inputs to which are Parallel Seq Scans, and now there's basically no risk of a skewed work distribution. While that would be nice to have, I think the big thing to focus on here is how inaccurate the nested loop costing is - as already mentioned, it thinks the non-parallel nested loop is 10% slower than the hash join when it's really forty times slower. The main reason for that is that ((t1.c1 >= 100) AND (t1.c1 <= 200)) actually matches 100 rows, but the planner expects it to match just one. In a real table, there would probably be a unique index on t1 (c1), and that also fixes the problem. If I add that, the non-parallel query runs in 422 ms (with EXPLAIN ANALYZE, on a debug build) and the parallel query runs in 125 ms, and the row count estimates are correct, too. Even if I disable actually using the index, the fact that it fixes the cardinality estimates causes the query to choose a good (parallel!) plan. Updated patch attached. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Вложения
В списке pgsql-hackers по дате отправления: