Re: parallel joins, and better parallel explain

Поиск
Список
Период
Сортировка
От Robert Haas
Тема Re: parallel joins, and better parallel explain
Дата
Msg-id CA+Tgmob+e67gP2i833ZrocK966SSO7z1er9BeQKWj7H8yO97Qw@mail.gmail.com
обсуждение исходный текст
Ответ на Re: parallel joins, and better parallel explain  (Dilip Kumar <dilipbalaut@gmail.com>)
Ответы Re: parallel joins, and better parallel explain  (Dilip Kumar <dilipbalaut@gmail.com>)
Список pgsql-hackers
On Wed, Dec 23, 2015 at 2:34 AM, Dilip Kumar <dilipbalaut@gmail.com> wrote:
> Yeah right, After applying all three patches this problem is fixed, now
> parallel hash join is faster than normal hash join.
>
> I have tested one more case which Amit mentioned, I can see in that case
> parallel plan (parallel degree>= 3) is still slow, In Normal case it selects
> "Hash Join" but in case of parallel worker > 3 it selects Parallel "Nest
> Loop Join" which is making it costlier.

While investigating this problem, I discovered that I can produce a
regression even on unpatched master:

rhaas=# set max_parallel_degree = 0;
SET
rhaas=# explain select sum(1) from t1;
                             QUERY PLAN
---------------------------------------------------------------------
 Aggregate  (cost=1553572.00..1553572.01 rows=1 width=0)
   ->  Seq Scan on t1  (cost=0.00..1528572.00 rows=10000000 width=0)
(2 rows)

rhaas=# set max_parallel_degree = 3;
SET
rhaas=# explain select sum(1) from t1;
                                    QUERY PLAN
-----------------------------------------------------------------------------------
 Aggregate  (cost=1462734.86..1462734.87 rows=1 width=0)
   ->  Gather  (cost=1000.00..1437734.86 rows=10000000 width=0)
         Number of Workers: 3
         ->  Parallel Seq Scan on t1  (cost=0.00..436734.86
rows=10000000 width=0)
(4 rows)

The problem here is that the planner imagines that the sequential scan
is going to parallelize perfectly, which is not the case.   A Gather
node is ten times as expensive per tuple as a sequential scan, but
sequential scan doesn't need to pay a per-page cost, so if you crank
the number of workers up high enough, the cost per tuple appears to
drop until it eventually gets low enough that paying the cost of a
Gather node looks worthwhile.  I tweaked cost_seqscan() so that it
spreads out the CPU cost among all of the workers but assumes the disk
cost has to be paid regardless, and that fixes this problem.

It doesn't fix your example, though.  Even with the costing changes
mentioned above, the planner still thinks a nested loop over two
seqscans has something to recommend it:

rhaas=# Explain (Analyze, verbose) SELECT count(*) FROM t1 JOIN t2 ON
t1.c1 = t2.c1 AND t1.c1 BETWEEN 100 AND 200;
                                                               QUERY
PLAN

-----------------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=161755.97..161755.98 rows=1 width=0) (actual
time=41164.506..41164.507 rows=1 loops=1)
   Output: count(*)
   ->  Gather  (cost=1000.00..161755.97 rows=1 width=0) (actual
time=0.436..41164.388 rows=101 loops=1)
         Number of Workers: 3
         ->  Nested Loop  (cost=0.00..160755.87 rows=1 width=0)
(actual time=329.227..12123.414 rows=25 loops=4)
               Join Filter: (t1.c1 = t2.c1)
               Rows Removed by Join Filter: 75749975
               Worker 0: actual time=439.924..439.924 rows=0 loops=1
               Worker 1: actual time=440.776..440.776 rows=0 loops=1
               Worker 2: actual time=436.100..6449.041 rows=15 loops=1
               ->  Parallel Seq Scan on public.t1
(cost=0.00..102442.10 rows=0 width=4) (actual time=220.185..220.228
rows=25 loops=4)
                     Output: t1.c1, t1.c2
                     Filter: ((t1.c1 >= 100) AND (t1.c1 <= 200))
                     Rows Removed by Filter: 2499975
                     Worker 0: actual time=439.922..439.922 rows=0 loops=1
                     Worker 1: actual time=440.773..440.773 rows=0 loops=1
                     Worker 2: actual time=0.016..0.055 rows=15 loops=1
               ->  Seq Scan on public.t2  (cost=0.00..46217.00
rows=3000000 width=4) (actual time=0.007..235.143 rows=3000000
loops=101)
                     Output: t2.c1, t2.c2
                     Worker 2: actual time=0.012..215.711 rows=3000000 loops=15
 Planning time: 0.150 ms
 Execution time: 41164.597 ms

But this is not entirely the fault of the parallel query code.  If you
force a seqscan-over-seqscan plan in the non-parallel cast, it
estimates the join cost as 287772.00, only slightly more than the
261522.02 cost units it thinks a non-parallel hash join will cost.  In
fact, however, the non-parallel hash join runs in 1.2 seconds and the
non-parallel nested loop takes 46 seconds.  So the first problem here
is that a plan that the query planner thinks is only 10% more
expensive actually runs for almost 40 times longer.  If the planner
had accurately estimated the real cost of the nested loop, this plan
wouldn't have been chosen.  If you set enable_nestloop=false, then you
get this plan:

rhaas=# set enable_nestloop=false;
SET
rhaas=# set max_parallel_degree=3;
SET
rhaas=# Explain (Analyze, verbose) SELECT count(*) FROM t1 JOIN t2 ON
t1.c1 = t2.c1 AND t1.c1 BETWEEN 100 AND 200;

QUERY PLAN

----------------------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=160909.22..160909.23 rows=1 width=0) (actual
time=647.010..647.011 rows=1 loops=1)
   Output: count(*)
   ->  Hash Join  (cost=103442.21..160909.22 rows=1 width=0) (actual
time=234.397..646.985 rows=101 loops=1)
         Hash Cond: (t2.c1 = t1.c1)
         ->  Seq Scan on public.t2  (cost=0.00..46217.00 rows=3000000
width=4) (actual time=0.033..197.595 rows=3000000 loops=1)
               Output: t2.c1, t2.c2
         ->  Hash  (cost=103442.20..103442.20 rows=1 width=4) (actual
time=234.235..234.235 rows=101 loops=1)
               Output: t1.c1
               Buckets: 1024  Batches: 1  Memory Usage: 12kB
               ->  Gather  (cost=1000.00..103442.20 rows=1 width=4)
(actual time=0.289..234.199 rows=101 loops=1)
                     Output: t1.c1
                     Number of Workers: 3
                     ->  Parallel Seq Scan on public.t1
(cost=0.00..102442.10 rows=0 width=4) (actual time=171.667..230.080
rows=25 loops=4)
                           Output: t1.c1
                           Filter: ((t1.c1 >= 100) AND (t1.c1 <= 200))
                           Rows Removed by Filter: 2499975
                           Worker 0: actual time=228.628..228.628 rows=0 loops=1
                           Worker 1: actual time=228.432..228.432 rows=0 loops=1
                           Worker 2: actual time=229.566..229.566 rows=0 loops=1
 Planning time: 0.160 ms
 Execution time: 647.133 ms
(21 rows)

And that's a good plan.

The parallel nested loop case also suffers from the fact that workers
0 and 1 don't happen to find any of the interesting rows in t1 at all,
and worker 2 only finds 15 of them.  The leader finds the other 85 and
thus has to run most of the iterations of the scan on t2 itself.  If
the work were divided equally, the parallel nested loop would probably
run significantly faster, although it would still be ten times slower
than the non-parallel hash join.  In the long term, I think the way to
fix the uneven work distribution that happens here is to construct the
hash table in parallel, as already discussed with Simon upthread.
Then we could have a Gather node on top of a Hash Join both inputs to
which are Parallel Seq Scans, and now there's basically no risk of a
skewed work distribution.

While that would be nice to have, I think the big thing to focus on
here is how inaccurate the nested loop costing is - as already
mentioned, it thinks the non-parallel nested loop is 10% slower than
the hash join when it's really forty times slower.  The main reason
for that is that ((t1.c1 >= 100) AND (t1.c1 <= 200)) actually matches
100 rows, but the planner expects it to match just one.  In a real
table, there would probably be a unique index on t1 (c1), and that
also fixes the problem.  If I add that, the non-parallel query runs in
422 ms (with EXPLAIN ANALYZE, on a debug build) and the parallel query
runs in 125 ms, and the row count estimates are correct, too.  Even if
I disable actually using the index, the fact that it fixes the
cardinality estimates causes the query to choose a good (parallel!)
plan.

Updated patch attached.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Вложения

В списке pgsql-hackers по дате отправления:

Предыдущее
От: Robert Haas
Дата:
Сообщение: Re: Using quicksort for every external sort run
Следующее
От: Peter Geoghegan
Дата:
Сообщение: Re: Using quicksort for every external sort run