Обсуждение: EXPLAIN: Non-parallel ancestor plan nodes exclude parallel worker instrumentation

Поиск
Список
Период
Сортировка

EXPLAIN: Non-parallel ancestor plan nodes exclude parallel worker instrumentation

От
Maciek Sakrejda
Дата:
Hello,

I had some questions about the behavior of some accounting in parallel
EXPLAIN plans. Take the following plan:

```
Gather  (cost=1000.43..750173.74 rows=2 width=235) (actual
time=1665.122..1665.122 rows=0 loops=1)
  Workers Planned: 2
  Workers Launched: 2
  Buffers: shared hit=27683 read=239573
  I/O Timings: read=687.358
  ->  Nested Loop  (cost=0.43..749173.54 rows=1 width=235) (actual
time=1660.095..1660.095 rows=0 loops=3)
        Inner Unique: true
        Buffers: shared hit=77579 read=657847
        I/O Timings: read=2090.189
        Worker 0: actual time=1657.443..1657.443 rows=0 loops=1
          Buffers: shared hit=23645 read=208365
          I/O Timings: read=702.270
        Worker 1: actual time=1658.277..1658.277 rows=0 loops=1
          Buffers: shared hit=26251 read=209909
          I/O Timings: read=700.560
        ->  Parallel Seq Scan on public.schema_indices
(cost=0.00..748877.88 rows=35 width=235) (actual
time=136.744..1659.629 rows=32 loops=3)
              Filter: ((schema_indices.invalidated_at_snapshot_id IS
NULL) AND (NOT schema_indices.is_valid))
              Rows Removed by Filter: 703421
              Buffers: shared hit=77193 read=657847
              I/O Timings: read=2090.189
              Worker 0: actual time=69.248..1656.950 rows=32 loops=1
                Buffers: shared hit=23516 read=208365
                I/O Timings: read=702.270
              Worker 1: actual time=260.188..1657.875 rows=27 loops=1
                Buffers: shared hit=26140 read=209909
                I/O Timings: read=700.560
        ->  Index Scan using schema_tables_pkey on
public.schema_tables  (cost=0.43..8.45 rows=1 width=8) (actual
time=0.011..0.011 rows=0 loops=95)
              Index Cond: (schema_tables.id = schema_indices.table_id)
              Filter: (schema_tables.database_id = 123)
              Rows Removed by Filter: 1
              Buffers: shared hit=386
              Worker 0: actual time=0.011..0.011 rows=0 loops=32
                Buffers: shared hit=129
              Worker 1: actual time=0.011..0.011 rows=0 loops=27
                Buffers: shared hit=111
Planning Time: 0.429 ms
Execution Time: 1667.373 ms
```

The Nested Loop here aggregates data for metrics like `buffers read`
from its workers, and to calculate a metric like `buffers read` for
the parallel leader, we can subtract the values recorded in each
individual worker. This happens in the Seq Scan and Index Scan
children, as well. However, the Gather node appears to only include
values from its direct parallel leader child (excluding that child's
workers).

This leads to the odd situation that the Gather has lower values for
some of these metrics than its child (because the child node reporting
includes the worker metrics) even though the values are supposed to be
cumulative. This is even more surprising for something like I/O
timing, where the Gather has a lower `read` value than one of the
Nested Loop workers, which doesn't make sense in terms of wall-clock
time.

Is this behavior intentional? If so, is there an explanation of the
reasoning or the trade-offs involved? Would it not make sense to
propagate those cumulative parallel costs up the tree all the way to
the root, instead of only using the parallel leader metrics under
Gather?

Thanks,
Maciek



Re: EXPLAIN: Non-parallel ancestor plan nodes exclude parallel worker instrumentation

От
Amit Kapila
Дата:
On Tue, Jun 23, 2020 at 12:55 AM Maciek Sakrejda <m.sakrejda@gmail.com> wrote:
>
> Hello,
>
> I had some questions about the behavior of some accounting in parallel
> EXPLAIN plans. Take the following plan:
>
> ```
> Gather  (cost=1000.43..750173.74 rows=2 width=235) (actual
> time=1665.122..1665.122 rows=0 loops=1)
>   Workers Planned: 2
>   Workers Launched: 2
>   Buffers: shared hit=27683 read=239573
>   I/O Timings: read=687.358
>   ->  Nested Loop  (cost=0.43..749173.54 rows=1 width=235) (actual
> time=1660.095..1660.095 rows=0 loops=3)
>         Inner Unique: true
>         Buffers: shared hit=77579 read=657847
>         I/O Timings: read=2090.189
..
> ```
>
> The Nested Loop here aggregates data for metrics like `buffers read`
> from its workers, and to calculate a metric like `buffers read` for
> the parallel leader, we can subtract the values recorded in each
> individual worker. This happens in the Seq Scan and Index Scan
> children, as well. However, the Gather node appears to only include
> values from its direct parallel leader child (excluding that child's
> workers).
>
> This leads to the odd situation that the Gather has lower values for
> some of these metrics than its child (because the child node reporting
> includes the worker metrics) even though the values are supposed to be
> cumulative.
>

I don't think this is an odd situation because in this case, child
nodes like "Nested Loop" and "Parallel Seq Scan" has a value of
'loops' as 3.  So, to get the correct stats at those nodes, you need
to divide it by 3 whereas, at Gather node, the value of 'loops' is 1.
If you want to verify this thing then try with a plan where loops
should be 1 for child nodes as well, you should get the same value at
both Gather and Parallel Seq Scan nodes.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: EXPLAIN: Non-parallel ancestor plan nodes exclude parallel worker instrumentation

От
Maciek Sakrejda
Дата:
On Tue, Jun 23, 2020 at 2:57 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> I don't think this is an odd situation because in this case, child
> nodes like "Nested Loop" and "Parallel Seq Scan" has a value of
> 'loops' as 3.  So, to get the correct stats at those nodes, you need
> to divide it by 3 whereas, at Gather node, the value of 'loops' is 1.
> If you want to verify this thing then try with a plan where loops
> should be 1 for child nodes as well, you should get the same value at
> both Gather and Parallel Seq Scan nodes.

Thanks for the response, but I still don't follow. I had assumed that
loops=3 was just from loops=1 for the parallel leader plus loops=1 for
each worker--is that not right? I don't see any other reason for
looping over the NL node itself in this plan. The Gather itself
doesn't do any real looping, right?

But even so, the documentation [1] states:

>In some query plans, it is possible for a subplan node to be executed more than once. For example, the inner index
scanwill be executed once per outer row in the above nested-loop plan. In such cases, the loops value reports the total
numberof executions of the node, and the actual time and rows values shown are averages per-execution. This is done to
makethe numbers comparable with the way that the cost estimates are shown. Multiply by the loops value to get the total
timeactually spent in the node. 

So we should be seeing an average, not a sum, right?

[1]: https://www.postgresql.org/docs/current/using-explain.html#USING-EXPLAIN-ANALYZE



Re: EXPLAIN: Non-parallel ancestor plan nodes exclude parallel worker instrumentation

От
Amit Kapila
Дата:
On Wed, Jun 24, 2020 at 3:18 AM Maciek Sakrejda <m.sakrejda@gmail.com> wrote:
>
> On Tue, Jun 23, 2020 at 2:57 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > I don't think this is an odd situation because in this case, child
> > nodes like "Nested Loop" and "Parallel Seq Scan" has a value of
> > 'loops' as 3.  So, to get the correct stats at those nodes, you need
> > to divide it by 3 whereas, at Gather node, the value of 'loops' is 1.
> > If you want to verify this thing then try with a plan where loops
> > should be 1 for child nodes as well, you should get the same value at
> > both Gather and Parallel Seq Scan nodes.
>
> Thanks for the response, but I still don't follow. I had assumed that
> loops=3 was just from loops=1 for the parallel leader plus loops=1 for
> each worker--is that not right?
>

No, I don't think so.

> I don't see any other reason for
> looping over the NL node itself in this plan. The Gather itself
> doesn't do any real looping, right?

It is right that Gather doesn't do looping but Parallel Seq Scan node does so.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: EXPLAIN: Non-parallel ancestor plan nodes exclude parallel worker instrumentation

От
Maciek Sakrejda
Дата:
On Tue, Jun 23, 2020 at 7:55 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > I don't see any other reason for
> > looping over the NL node itself in this plan. The Gather itself
> > doesn't do any real looping, right?
>
> It is right that Gather doesn't do looping but Parallel Seq Scan node does so.

Sorry, I still don't follow. How does a Parallel Seq Scan do looping?
I looked at the parallel plan docs but I don't see looping mentioned
anywhere[1]. Also, is looping not normally indicated on children,
rather than on the node doing the looping? E.g., with a standard
Nested Loop, the outer child will have loops=1 and the inner child
will have loops equal to the row count produced by the outer child
(and the Nested Loop itself will have loops=1 unless it also is being
looped over by a parent node), right?

But even aside from that, why do I need to divide by the number of
loops here, when normally Postgres presents a per-loop average?

[1]: https://www.postgresql.org/docs/current/parallel-plans.html



Re: EXPLAIN: Non-parallel ancestor plan nodes exclude parallel worker instrumentation

От
Amit Kapila
Дата:
On Wed, Jun 24, 2020 at 12:41 PM Maciek Sakrejda <m.sakrejda@gmail.com> wrote:
>
> On Tue, Jun 23, 2020 at 7:55 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > > I don't see any other reason for
> > > looping over the NL node itself in this plan. The Gather itself
> > > doesn't do any real looping, right?
> >
> > It is right that Gather doesn't do looping but Parallel Seq Scan node does so.
>
> Sorry, I still don't follow. How does a Parallel Seq Scan do looping?

Sorry, I intend to say that Parallel Seq Scan is involved in looping.
Let me try by example:

Gather (actual time=6.444..722.642 rows=10000 loops=1)
   Workers Planned: 2
   Workers Launched: 2
   ->  Nested Loop (actual time=0.046..705.936 rows=5000 loops=2)
         ->  Parallel Seq Scan on t1 (actual time=0.010..45.423
rows=250000 loops=2)
         ->  Index Scan using idx_t2 on t2 (actual time=0.002..0.002
rows=0 loops=500000)
               Index Cond: (c1 = t1.c1)

In the above plan, each of the worker runs
NestLoop
   -> Parallel Seq Scan on t1
   -> Index Scan using idx_t2 on t2

So, that leads to loops as 2 on "Parallel Seq Scan" and "Nested Loop"
nodes.  Does this make sense now?

> I looked at the parallel plan docs but I don't see looping mentioned
> anywhere[1]. Also, is looping not normally indicated on children,
> rather than on the node doing the looping? E.g., with a standard
> Nested Loop, the outer child will have loops=1 and the inner child
> will have loops equal to the row count produced by the outer child
> (and the Nested Loop itself will have loops=1 unless it also is being
> looped over by a parent node), right?
>

Yeah, I hope the above has clarified it.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: EXPLAIN: Non-parallel ancestor plan nodes exclude parallel worker instrumentation

От
Amit Kapila
Дата:
On Tue, Jun 23, 2020 at 12:55 AM Maciek Sakrejda <m.sakrejda@gmail.com> wrote:
>
> Hello,
>
> I had some questions about the behavior of some accounting in parallel
> EXPLAIN plans. Take the following plan:
>
>
..
>
> The Nested Loop here aggregates data for metrics like `buffers read`
> from its workers, and to calculate a metric like `buffers read` for
> the parallel leader, we can subtract the values recorded in each
> individual worker. This happens in the Seq Scan and Index Scan
> children, as well. However, the Gather node appears to only include
> values from its direct parallel leader child (excluding that child's
> workers).
>

I have tried to check a similar plan and for me, the values at Gather
node seems to be considering the values from all workers and leader
(aka whatever is displayed at "Nested Loop " node), see below.  I have
tried the test on HEAD.  Which version of PostgreSQL are you using?
If you are also using the latest version then it is possible that in
some cases it is not displaying correct data. If that turns out to be
the case, then feel to share the test case.  Sorry, for the confusion
caused by my previous reply.

 Gather (actual time=2.083..550.093 rows=10000 loops=1)
   Output: t1.c1, t1.c2, t2.c1, t2.c2
   Workers Planned: 2
   Workers Launched: 2
   Buffers: shared hit=1012621 read=84
   I/O Timings: read=0.819
   ->  Nested Loop (actual time=0.084..541.882 rows=3333 loops=3)
         Output: t1.c1, t1.c2, t2.c1, t2.c2
         Buffers: shared hit=1012621 read=84
         I/O Timings: read=0.819
         Worker 0:  actual time=0.069..541.249 rows=3155 loops=1
           Buffers: shared hit=326529 read=29
           I/O Timings: read=0.325
         Worker 1:  actual time=0.063..541.376 rows=3330 loops=1
           Buffers: shared hit=352045 read=26
           I/O Timings: read=0.179
         ->  Parallel Seq Scan on public.t1 (actual time=0.011..34.250
rows=166667 loops=3)
               Output: t1.c1, t1.c2
               Buffers: shared hit=2703
               Worker 0:  actual time=0.011..33.785 rows=161265 loops=1
                 Buffers: shared hit=872
               Worker 1:  actual time=0.009..34.582 rows=173900 loops=1
                 Buffers: shared hit=940
         ->  Index Scan using idx_t2 on public.t2 (actual
time=0.003..0.003 rows=0 loops=500000)
               Output: t2.c1, t2.c2
               Index Cond: (t2.c1 = t1.c1)
               Buffers: shared hit=1009918 read=84
               I/O Timings: read=0.819
               Worker 0:  actual time=0.003..0.003 rows=0 loops=161265
                 Buffers: shared hit=325657 read=29
                 I/O Timings: read=0.325
               Worker 1:  actual time=0.002..0.002 rows=0 loops=173900
                 Buffers: shared hit=351105 read=26
                 I/O Timings: read=0.179

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: EXPLAIN: Non-parallel ancestor plan nodes exclude parallel worker instrumentation

От
Maciek Sakrejda
Дата:
On Wed, Jun 24, 2020 at 2:44 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> So, that leads to loops as 2 on "Parallel Seq Scan" and "Nested Loop" nodes.  Does this make sense now?

Yes, I think we're on the same page. Thanks for the additional details.

It turns out that the plan I sent at the top of the thread is actually
an older plan we had saved, all the way from April 2018. We're fairly
certain this was Postgres 10, but not sure what point release. I tried
to reproduce this on 10, 11, 12, and 13 beta, but I am now seeing
similar results to yours: Buffers and I/O Timings are rolled up into
the parallel leader, and that is propagated as expected to the Gather.
Sorry for the confusion.

On Wed, Jun 24, 2020 at 3:18 AM Maciek Sakrejda <m.sakrejda@gmail.com> wrote:
>So we should be seeing an average, not a sum, right?

And here I missed that the documentation specifies rows and actual
time as per-loop, but other metrics are not--they're just cumulative.
So actual time and rows are still per-"loop" values, but while rows
values are additive (the Gather combines rows from the parallel leader
and the workers), the actual time is not because the whole point is
that this work happens in parallel.

I'll report back if I can reproduce the weird numbers we saw in that
original plan or find out exactly what Postgres version it was from.

Thanks,
Maciek