Re: parallel joins, and better parallel explain

Поиск
Список
Период
Сортировка
От Amit Kapila
Тема Re: parallel joins, and better parallel explain
Дата
Msg-id CAA4eK1JNGK1cZvBGaLvetZsPmFDTCkj3TLc13NFrNWQL4Oxhkw@mail.gmail.com
обсуждение исходный текст
Ответ на Re: parallel joins, and better parallel explain  (Robert Haas <robertmhaas@gmail.com>)
Ответы Re: parallel joins, and better parallel explain  (Robert Haas <robertmhaas@gmail.com>)
Список pgsql-hackers
On Thu, Dec 3, 2015 at 3:25 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Tue, Dec 1, 2015 at 7:21 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> > It would be better if we can split this patch into multiple patches like
> > Explain related changes, Append pushdown related changes, Join
> > Push down related changes.  You can choose to push the patches as
> > you prefer, but splitting can certainly help in review/verification of the
> > code.
>
> I don't think it really makes sense to split the append push-down
> changes from the join push-down changes; those share a great deal of
> code.

Not an issue.  I have started looking into parallel join patch and below are
few findings:

1.
There are few compilation errors in the patch. It seems patch needs
to adapt the latest changes done in commit-edca44b1.

1>src/backend/optimizer/path/joinpath.c(420): error C2039: 'extra_lateral_rels' : is not a member of 
'JoinPathExtraData'
1>          E:\WorkSpace\PostgreSQL\master\postgresql\src\include\nodes/relation.h(1727) : see declaration of 
'JoinPathExtraData'
..
..

2.
Why consider_parallel_nestloop() doesn't consider materializing inner
relation as we do in match_unsorted_outer()?

I have generated a test as below where non-parallel Nestloop join is
faster than parallel Nestloop join.  I am using 'hydra' for testing this
patch.

CREATE TABLE t1(c1, c2) AS SELECT g, repeat('x', 5) FROM
generate_series(1, 10000000) g;

CREATE TABLE t2(c1, c2) AS SELECT g, repeat('x', 5) FROM
generate_series(1, 2000000) g;

Analyze t1;
Analyze t2;

Restart Server
Connect with psql

set enable_hashjoin=off;
set enable_mergejoin=off;
postgres=# Explain Analyze SELECT count(*) FROM t1 JOIN t2 ON t1.c1 = t2.c1 AND t1.c1 BETWEEN 100000 AND 
100100;
                                                        QUERY PLAN                                            
             
-------------------------------------------------------------------------------------------------------------
--------------
 Aggregate  (cost=3294864.21..3294864.21 rows=1 width=0) (actual time=42614.102..42614.102 rows=1 loops=1)
   ->  Nested Loop  (cost=0.00..3294864.16 rows=20 width=0) (actual time=4123.463..42614.084 rows=101 
loops=1)
         Join Filter: (t1.c1 = t2.c1)
         Rows Removed by Join Filter: 201999899
         ->  Seq Scan on t2  (cost=0.00..30811.00 rows=2000000 width=4) (actual time=0.027..284.979 
rows=2000000 loops=1)
         ->  Materialize  (cost=0.00..204053.41 rows=102 width=4) (actual time=0.000..0.008 rows=101 
loops=2000000)
               ->  Seq Scan on t1  (cost=0.00..204052.90 rows=102 width=4) (actual time=13.920..2024.684 
rows=101 loops=1)
                     Filter: ((c1 >= 100000) AND (c1 <= 100100))
                     Rows Removed by Filter: 9999899
 Planning time: 0.085 ms
 Execution time: 42614.135 ms
 
I have repeated the above statement 3 times and the above result is
median of 3 runs.

Restart Server
Connect with psql

set enable_hashjoin=off;
set enable_mergejoin=off;

set max_parallel_degree=4;

Explain Analyze SELECT count(*) FROM t1 JOIN t2 ON t1.c1 = t2.c1 AND t1.c1 BETWEEN 100000 AND 100100;
                                                            QUERY PLAN                                        
                    
-------------------------------------------------------------------------------------------------------------
---------------------
 Aggregate  (cost=1311396.47..1311396.48 rows=1 width=0) (actual time=45736.973..45736.973 rows=1 loops=1)
   ->  Gather  (cost=1000.00..1311396.42 rows=20 width=0) (actual time=709.083..45736.925 rows=101 loops=1)
         Number of Workers: 4
         ->  Nested Loop  (cost=0.00..1310394.42 rows=20 width=0) (actual time=436.460..11240.321 rows=20 
loops=5)
               Join Filter: (t1.c1 = t2.c1)
               Rows Removed by Join Filter: 40399980
               ->  Parallel Seq Scan on t1  (cost=0.00..45345.09 rows=23 width=4) (actual 
time=425.178..425.232 rows=20 loops=5)
                     Filter: ((c1 >= 100000) AND (c1 <= 100100))
                     Rows Removed by Filter: 1999980
               ->  Seq Scan on t2  (cost=0.00..30811.00 rows=2000000 width=4) (actual time=0.011..270.986 
rows=2000000 loops=101)
 Planning time: 0.115 ms
 Execution time: 45737.863 ms

I have repeated the above statement 3 times and the above result is
median of 3 runs.

Now here the point to observe is that non-parallel case uses both less
Execution time and Planning time to complete the statement.  There
is a considerable increase in planning time without any benefit in
execution.


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

В списке pgsql-hackers по дате отправления:

Предыдущее
От: Michael Paquier
Дата:
Сообщение: Re: psql tab completion bug for ALL IN TABLESPACE
Следующее
От: Petr Jelinek
Дата:
Сообщение: Re: WIP: Rework access method interface