Обсуждение: EXPLAIN: Non-parallel ancestor plan nodes exclude parallel worker instrumentation
EXPLAIN: Non-parallel ancestor plan nodes exclude parallel worker instrumentation
От
Maciek Sakrejda
Дата:
Hello, I had some questions about the behavior of some accounting in parallel EXPLAIN plans. Take the following plan: ``` Gather (cost=1000.43..750173.74 rows=2 width=235) (actual time=1665.122..1665.122 rows=0 loops=1) Workers Planned: 2 Workers Launched: 2 Buffers: shared hit=27683 read=239573 I/O Timings: read=687.358 -> Nested Loop (cost=0.43..749173.54 rows=1 width=235) (actual time=1660.095..1660.095 rows=0 loops=3) Inner Unique: true Buffers: shared hit=77579 read=657847 I/O Timings: read=2090.189 Worker 0: actual time=1657.443..1657.443 rows=0 loops=1 Buffers: shared hit=23645 read=208365 I/O Timings: read=702.270 Worker 1: actual time=1658.277..1658.277 rows=0 loops=1 Buffers: shared hit=26251 read=209909 I/O Timings: read=700.560 -> Parallel Seq Scan on public.schema_indices (cost=0.00..748877.88 rows=35 width=235) (actual time=136.744..1659.629 rows=32 loops=3) Filter: ((schema_indices.invalidated_at_snapshot_id IS NULL) AND (NOT schema_indices.is_valid)) Rows Removed by Filter: 703421 Buffers: shared hit=77193 read=657847 I/O Timings: read=2090.189 Worker 0: actual time=69.248..1656.950 rows=32 loops=1 Buffers: shared hit=23516 read=208365 I/O Timings: read=702.270 Worker 1: actual time=260.188..1657.875 rows=27 loops=1 Buffers: shared hit=26140 read=209909 I/O Timings: read=700.560 -> Index Scan using schema_tables_pkey on public.schema_tables (cost=0.43..8.45 rows=1 width=8) (actual time=0.011..0.011 rows=0 loops=95) Index Cond: (schema_tables.id = schema_indices.table_id) Filter: (schema_tables.database_id = 123) Rows Removed by Filter: 1 Buffers: shared hit=386 Worker 0: actual time=0.011..0.011 rows=0 loops=32 Buffers: shared hit=129 Worker 1: actual time=0.011..0.011 rows=0 loops=27 Buffers: shared hit=111 Planning Time: 0.429 ms Execution Time: 1667.373 ms ``` The Nested Loop here aggregates data for metrics like `buffers read` from its workers, and to calculate a metric like `buffers read` for the parallel leader, we can subtract the values recorded in each individual worker. This happens in the Seq Scan and Index Scan children, as well. However, the Gather node appears to only include values from its direct parallel leader child (excluding that child's workers). This leads to the odd situation that the Gather has lower values for some of these metrics than its child (because the child node reporting includes the worker metrics) even though the values are supposed to be cumulative. This is even more surprising for something like I/O timing, where the Gather has a lower `read` value than one of the Nested Loop workers, which doesn't make sense in terms of wall-clock time. Is this behavior intentional? If so, is there an explanation of the reasoning or the trade-offs involved? Would it not make sense to propagate those cumulative parallel costs up the tree all the way to the root, instead of only using the parallel leader metrics under Gather? Thanks, Maciek
On Tue, Jun 23, 2020 at 12:55 AM Maciek Sakrejda <m.sakrejda@gmail.com> wrote: > > Hello, > > I had some questions about the behavior of some accounting in parallel > EXPLAIN plans. Take the following plan: > > ``` > Gather (cost=1000.43..750173.74 rows=2 width=235) (actual > time=1665.122..1665.122 rows=0 loops=1) > Workers Planned: 2 > Workers Launched: 2 > Buffers: shared hit=27683 read=239573 > I/O Timings: read=687.358 > -> Nested Loop (cost=0.43..749173.54 rows=1 width=235) (actual > time=1660.095..1660.095 rows=0 loops=3) > Inner Unique: true > Buffers: shared hit=77579 read=657847 > I/O Timings: read=2090.189 .. > ``` > > The Nested Loop here aggregates data for metrics like `buffers read` > from its workers, and to calculate a metric like `buffers read` for > the parallel leader, we can subtract the values recorded in each > individual worker. This happens in the Seq Scan and Index Scan > children, as well. However, the Gather node appears to only include > values from its direct parallel leader child (excluding that child's > workers). > > This leads to the odd situation that the Gather has lower values for > some of these metrics than its child (because the child node reporting > includes the worker metrics) even though the values are supposed to be > cumulative. > I don't think this is an odd situation because in this case, child nodes like "Nested Loop" and "Parallel Seq Scan" has a value of 'loops' as 3. So, to get the correct stats at those nodes, you need to divide it by 3 whereas, at Gather node, the value of 'loops' is 1. If you want to verify this thing then try with a plan where loops should be 1 for child nodes as well, you should get the same value at both Gather and Parallel Seq Scan nodes. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: EXPLAIN: Non-parallel ancestor plan nodes exclude parallel worker instrumentation
От
Maciek Sakrejda
Дата:
On Tue, Jun 23, 2020 at 2:57 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > I don't think this is an odd situation because in this case, child > nodes like "Nested Loop" and "Parallel Seq Scan" has a value of > 'loops' as 3. So, to get the correct stats at those nodes, you need > to divide it by 3 whereas, at Gather node, the value of 'loops' is 1. > If you want to verify this thing then try with a plan where loops > should be 1 for child nodes as well, you should get the same value at > both Gather and Parallel Seq Scan nodes. Thanks for the response, but I still don't follow. I had assumed that loops=3 was just from loops=1 for the parallel leader plus loops=1 for each worker--is that not right? I don't see any other reason for looping over the NL node itself in this plan. The Gather itself doesn't do any real looping, right? But even so, the documentation [1] states: >In some query plans, it is possible for a subplan node to be executed more than once. For example, the inner index scanwill be executed once per outer row in the above nested-loop plan. In such cases, the loops value reports the total numberof executions of the node, and the actual time and rows values shown are averages per-execution. This is done to makethe numbers comparable with the way that the cost estimates are shown. Multiply by the loops value to get the total timeactually spent in the node. So we should be seeing an average, not a sum, right? [1]: https://www.postgresql.org/docs/current/using-explain.html#USING-EXPLAIN-ANALYZE
On Wed, Jun 24, 2020 at 3:18 AM Maciek Sakrejda <m.sakrejda@gmail.com> wrote: > > On Tue, Jun 23, 2020 at 2:57 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > I don't think this is an odd situation because in this case, child > > nodes like "Nested Loop" and "Parallel Seq Scan" has a value of > > 'loops' as 3. So, to get the correct stats at those nodes, you need > > to divide it by 3 whereas, at Gather node, the value of 'loops' is 1. > > If you want to verify this thing then try with a plan where loops > > should be 1 for child nodes as well, you should get the same value at > > both Gather and Parallel Seq Scan nodes. > > Thanks for the response, but I still don't follow. I had assumed that > loops=3 was just from loops=1 for the parallel leader plus loops=1 for > each worker--is that not right? > No, I don't think so. > I don't see any other reason for > looping over the NL node itself in this plan. The Gather itself > doesn't do any real looping, right? It is right that Gather doesn't do looping but Parallel Seq Scan node does so. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: EXPLAIN: Non-parallel ancestor plan nodes exclude parallel worker instrumentation
От
Maciek Sakrejda
Дата:
On Tue, Jun 23, 2020 at 7:55 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > I don't see any other reason for > > looping over the NL node itself in this plan. The Gather itself > > doesn't do any real looping, right? > > It is right that Gather doesn't do looping but Parallel Seq Scan node does so. Sorry, I still don't follow. How does a Parallel Seq Scan do looping? I looked at the parallel plan docs but I don't see looping mentioned anywhere[1]. Also, is looping not normally indicated on children, rather than on the node doing the looping? E.g., with a standard Nested Loop, the outer child will have loops=1 and the inner child will have loops equal to the row count produced by the outer child (and the Nested Loop itself will have loops=1 unless it also is being looped over by a parent node), right? But even aside from that, why do I need to divide by the number of loops here, when normally Postgres presents a per-loop average? [1]: https://www.postgresql.org/docs/current/parallel-plans.html
On Wed, Jun 24, 2020 at 12:41 PM Maciek Sakrejda <m.sakrejda@gmail.com> wrote: > > On Tue, Jun 23, 2020 at 7:55 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > I don't see any other reason for > > > looping over the NL node itself in this plan. The Gather itself > > > doesn't do any real looping, right? > > > > It is right that Gather doesn't do looping but Parallel Seq Scan node does so. > > Sorry, I still don't follow. How does a Parallel Seq Scan do looping? Sorry, I intend to say that Parallel Seq Scan is involved in looping. Let me try by example: Gather (actual time=6.444..722.642 rows=10000 loops=1) Workers Planned: 2 Workers Launched: 2 -> Nested Loop (actual time=0.046..705.936 rows=5000 loops=2) -> Parallel Seq Scan on t1 (actual time=0.010..45.423 rows=250000 loops=2) -> Index Scan using idx_t2 on t2 (actual time=0.002..0.002 rows=0 loops=500000) Index Cond: (c1 = t1.c1) In the above plan, each of the worker runs NestLoop -> Parallel Seq Scan on t1 -> Index Scan using idx_t2 on t2 So, that leads to loops as 2 on "Parallel Seq Scan" and "Nested Loop" nodes. Does this make sense now? > I looked at the parallel plan docs but I don't see looping mentioned > anywhere[1]. Also, is looping not normally indicated on children, > rather than on the node doing the looping? E.g., with a standard > Nested Loop, the outer child will have loops=1 and the inner child > will have loops equal to the row count produced by the outer child > (and the Nested Loop itself will have loops=1 unless it also is being > looped over by a parent node), right? > Yeah, I hope the above has clarified it. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Tue, Jun 23, 2020 at 12:55 AM Maciek Sakrejda <m.sakrejda@gmail.com> wrote: > > Hello, > > I had some questions about the behavior of some accounting in parallel > EXPLAIN plans. Take the following plan: > > .. > > The Nested Loop here aggregates data for metrics like `buffers read` > from its workers, and to calculate a metric like `buffers read` for > the parallel leader, we can subtract the values recorded in each > individual worker. This happens in the Seq Scan and Index Scan > children, as well. However, the Gather node appears to only include > values from its direct parallel leader child (excluding that child's > workers). > I have tried to check a similar plan and for me, the values at Gather node seems to be considering the values from all workers and leader (aka whatever is displayed at "Nested Loop " node), see below. I have tried the test on HEAD. Which version of PostgreSQL are you using? If you are also using the latest version then it is possible that in some cases it is not displaying correct data. If that turns out to be the case, then feel to share the test case. Sorry, for the confusion caused by my previous reply. Gather (actual time=2.083..550.093 rows=10000 loops=1) Output: t1.c1, t1.c2, t2.c1, t2.c2 Workers Planned: 2 Workers Launched: 2 Buffers: shared hit=1012621 read=84 I/O Timings: read=0.819 -> Nested Loop (actual time=0.084..541.882 rows=3333 loops=3) Output: t1.c1, t1.c2, t2.c1, t2.c2 Buffers: shared hit=1012621 read=84 I/O Timings: read=0.819 Worker 0: actual time=0.069..541.249 rows=3155 loops=1 Buffers: shared hit=326529 read=29 I/O Timings: read=0.325 Worker 1: actual time=0.063..541.376 rows=3330 loops=1 Buffers: shared hit=352045 read=26 I/O Timings: read=0.179 -> Parallel Seq Scan on public.t1 (actual time=0.011..34.250 rows=166667 loops=3) Output: t1.c1, t1.c2 Buffers: shared hit=2703 Worker 0: actual time=0.011..33.785 rows=161265 loops=1 Buffers: shared hit=872 Worker 1: actual time=0.009..34.582 rows=173900 loops=1 Buffers: shared hit=940 -> Index Scan using idx_t2 on public.t2 (actual time=0.003..0.003 rows=0 loops=500000) Output: t2.c1, t2.c2 Index Cond: (t2.c1 = t1.c1) Buffers: shared hit=1009918 read=84 I/O Timings: read=0.819 Worker 0: actual time=0.003..0.003 rows=0 loops=161265 Buffers: shared hit=325657 read=29 I/O Timings: read=0.325 Worker 1: actual time=0.002..0.002 rows=0 loops=173900 Buffers: shared hit=351105 read=26 I/O Timings: read=0.179 -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: EXPLAIN: Non-parallel ancestor plan nodes exclude parallel worker instrumentation
От
Maciek Sakrejda
Дата:
On Wed, Jun 24, 2020 at 2:44 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > So, that leads to loops as 2 on "Parallel Seq Scan" and "Nested Loop" nodes. Does this make sense now? Yes, I think we're on the same page. Thanks for the additional details. It turns out that the plan I sent at the top of the thread is actually an older plan we had saved, all the way from April 2018. We're fairly certain this was Postgres 10, but not sure what point release. I tried to reproduce this on 10, 11, 12, and 13 beta, but I am now seeing similar results to yours: Buffers and I/O Timings are rolled up into the parallel leader, and that is propagated as expected to the Gather. Sorry for the confusion. On Wed, Jun 24, 2020 at 3:18 AM Maciek Sakrejda <m.sakrejda@gmail.com> wrote: >So we should be seeing an average, not a sum, right? And here I missed that the documentation specifies rows and actual time as per-loop, but other metrics are not--they're just cumulative. So actual time and rows are still per-"loop" values, but while rows values are additive (the Gather combines rows from the parallel leader and the workers), the actual time is not because the whole point is that this work happens in parallel. I'll report back if I can reproduce the weird numbers we saw in that original plan or find out exactly what Postgres version it was from. Thanks, Maciek