Re: [HACKERS] Partition-wise join for join between (declaratively)partitioned tables

Поиск

Список

Период

Сортировка

От	Ashutosh Bapat
Тема	Re: [HACKERS] Partition-wise join for join between (declaratively)partitioned tables
Дата	21 сентября 2017 г. 16:07:25
Msg-id	CAFjFpRee-0oj4Fq1RjZ78fiwhfSUFAvMa05WtAUtUUCjXOpH9A@mail.gmail.com обсуждение исходный текст
Ответ на	Re: [HACKERS] Partition-wise join for join between (declaratively)partitioned tables (Rafia Sabih <rafia.sabih@enterprisedb.com>)
Список	pgsql-hackers

Дерево обсуждения

On Mon, Sep 18, 2017 at 10:18 AM, Rafia Sabih
<rafia.sabih@enterprisedb.com> wrote:
>>
>
>  Limit  (cost=83341943.28..83341943.35 rows=1 width=92) (actual
> time=1556989.996..1556989.997 rows=1 loops=1)
>    ->  Finalize GroupAggregate  (cost=83341943.28..83342723.24
> rows=10064 width=92) (actual time=1556989.994..1556989.994 rows=1
> loops=1)
>          Group Key: n1.n_name, n2.n_name, (date_part('year'::text,
> (lineitem_001.l_shipdate)::timestamp without time zone))
>          ->  Sort  (cost=83341943.28..83342043.92 rows=40256 width=92)
> (actual time=1556989.910..1556989.911 rows=6 loops=1)
>                Sort Key: n1.n_name, n2.n_name,
> (date_part('year'::text, (lineitem_001.l_shipdate)::timestamp without
> time zone))
>                Sort Method: quicksort  Memory: 27kB
>                ->  Gather  (cost=83326804.81..83338864.31 rows=40256
> width=92) (actual time=1550598.855..1556989.760 rows=20 loops=1)
>                      Workers Planned: 4
>                      Workers Launched: 4
>
> AFAICU the node above sort is group-aggregate and then there is limit,
> and the number of rows for sort node in explain analyse is returned
> number of rows. So, what is happening here is once one group is
> completed it is aggregated and fetched by limit, now there is no need
> for sort to return any more rows and hence the result.

Thanks for your explanation. That makes sense. I forgot about LIMIT node on top.

I debugged the plans today and performed some experiments. Here are my
observations

The join order with and without partition-wise join changes. Without
partition-wise join it is
(lineitem, (suppliers, nation1)), (orders, (customer, nation2)). The
join (lineitem, (suppliers, nation1)) is executed by one gather node
and (orders, (customer, nation2)) is executed by other. Thus the plan
has two gather nodes, which feed to the topmost join.
With partition-wise join the join order is ((lineitem, orders),
(supplier, nation1)), (customer, nation2). The join (lineitem, orders)
uses partition-wise join. This plan executes the whole join tree along
with partial group aggregation under a gather merge.

The rows estimated for various nodes under Gather/GatherMerge are
different from the actual rows e.g.
->  Hash Join  (cost=113164.47..61031454.40 rows=10789501 width=46)
(actual time=3379.931..731987.943 rows=8744357 loops=5) (in
non-partition-wise join plan) OR
->  Append  (cost=179532.36..80681785.95 rows=134868761 width=24)
(actual time=9437.573..1360219.567 rows=109372134 loops=5) (in
partition-wise join plan).
I first thought that this is a real estimation error and spent some
time investigating the estimation error. But eventually realised that
this is how a parallel query plan reports, when I saw that Gather node
estimated correct number of rows even though the nodes under it showed
this difference. Here's the explanation of this report. There are 4
parallel workers, so, the leaders contribution would be estimated to
be 0 by get_parallel_divisor(). So these estimates are per worker and
so the total estimated rows produced by any of the nodes is 4 times
the reported. But when the query actually runs, the leader also
participates, so number of loops = 5 and the actual rows reported are
(total actual rows) / (number of loops i.e. number of backends that
executed the query). The total estimates rows and total actual rows
are roughly equal. So there's no real estimation error, as I thought
earlier. May be we want to make EXPLAIN (ANALYZE) output easier to
understand.

When I tried the same query on laptop with scale 20, I found that the
leader is really contributing as much as other workers. So, the
partial paths were really created based on an estimate which was 20%
off. The cost difference between partition-wise join plan and
non-partition-wise join plan is hardly 1.5%. So, it's possible that if
we correct this estimation error, partition-wise join plan won't be
chosen because of it will have a higher cost. Remember there are two
gather nodes in non-partition-wise join plan and partition-wise join
plan has one gather. So, non-partition-wise join path gets the 20%
decreased estimates twice and partition-wise join gets it only once.

The explain (analyze, verbose) of a parallel node looks like
->  Parallel Seq Scan on public.lineitem_002  (cost=0.00..168752.99
rows=573464 width=24) (actual time=1.395..3075.485 rows=454464
loops=5)                                            Filter:
((lineitem_002.l_shipdate >= '1995-01-01'::date) AND
(lineitem_002.l_shipdate <= '1996-12-31'::date))                                            Rows Removed by Filter:
1045065                                           Worker 0: actual

time=3.358..3131.426 rows=458267 loops=1                                            Worker 1: actual
time=0.860..3146.282 rows=447231 loops=1                                            Worker 2: actual
time=1.317..3123.646 rows=489960 loops=1                                            Worker 3: actual
time=0.927..3130.497 rows=475545 loops=1
If we sum the rows returned by each worker they don't add up to
(actual rows) * (actual loops). So I assumed that the unreported
number of rows were processed by the leader. Is that right?

I might be misunderstanding how parallel query works, but here's my
analysis so far. I will continue investigating further.

Any clues would be helpful.

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

В списке pgsql-hackers по дате отправления:

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Re: [HACKERS] Partition-wise join for join between (declaratively)partitioned tables