Re: BUG #17335: Duplicate result rows in Gather node when "Suppress Append ... that have a single child" applied

Поиск
Список
Период
Сортировка
От Yura Sokolov
Тема Re: BUG #17335: Duplicate result rows in Gather node when "Suppress Append ... that have a single child" applied
Дата
Msg-id 0f58e3496cfa13c024dee2c645cacdb621864c8a.camel@postgrespro.ru
обсуждение исходный текст
Ответ на Re: BUG #17335: Duplicate result rows in Gather node when "Suppress Append ... that have a single child" applied  (Yura Sokolov <y.sokolov@postgrespro.ru>)
Ответы Re: BUG #17335: Duplicate result rows in Gather node when "Suppress Append ... that have a single child" applied  (Yura Sokolov <y.sokolov@postgrespro.ru>)
Список pgsql-bugs
В Пт, 17/12/2021 в 09:00 +0300, Yura Sokolov пишет:
> В Пн, 13/12/2021 в 14:45 +0000, PG Bug reporting form пишет:
> > The following bug has been logged on the website:
> > 
> > Bug reference:      17335
> > Logged by:          Yura Sokolov
> > Email address:      y.sokolov@postgrespro.ru
> > PostgreSQL version: 14.1
> > Operating system:   any
> > Description:
> > 
> > Our customer faced a query that returns duplicate rows if parallel execution
> > is enabled
> > (max_parallel_workers_per_gather > 0) after 11->12 upgrade.
> > 
> > Investigation shows bug were introduced in
> > 8edd0e79460b414b1d971895312e549e95e12e4f
> > "Suppress Append and MergeAppend plan nodes that have a single child." .
> > Bug is present in all versions since 12 beta, and it is reproduced in master
> > branch as well.
> > 
> > Reproduction.
> > https://gist.github.com/funny-falcon/45fa9460507c015390db99a7efd18a75
> > 
> > Dump contains definition of two tables (with 3 and 2 fields), index and data
> > dump
> > large enough to trigger parallel plan execution.
> > Also `analyze public.testi3, public.testi4` line is added.
> > 
> >
https://gist.github.com/funny-falcon/45fa9460507c015390db99a7efd18a75/raw/06a55e57bc6c460ea8800600b1b597c1ed216259/repro_dump.sql.zst
> > 
> > Query:
> > 
> >     SELECT count(1)
> >     FROM (
> >       SELECT fk, id
> >       FROM testi3
> >       WHERE flagged
> >       UNION ALL
> >       SELECT fk, id
> >       from testi3
> >       WHERE false
> >     ) as t
> >     LEFT OUTER JOIN testi4
> >     USING (fk);
> > 
> > When `set max_parallel_workers_per_gather = 0;` then it returns 44 which is
> > correct.
> > When `set max_parallel_workers_per_gather = 2;` then it returns 88 which is
> > wrong.
> > 
> > Sometimes (for example, when you try to debug it) it returns 44 even
> > with max_parallel_workers_per_gather = 2.
> > It seems that it happens when worker returns first row fast enough so
> > "Gather" node doesn't start to execute plan by itself.
> > 
> > It is confirmed by `set parallel_leader_participation = false`, which forces
> > correct result.
> > `set enable_parallel_append = false` also forces correct result.
> > 
> > Original plan included Gather Merge node and max_parallel_workers_per_gather
> > = 1 .
> > Reproduction code uses Gather instead and max_parallel_workers_per_gather =
> > 2 just because it were easier to get bad plan.
> > 
> > Plan before change 8edd0e79460b414b1d971895312e549e95e12e4f (returns correct
> > result)
> В Пн, 13/12/2021 в 14:45 +0000, PG Bug reporting form пишет:
> > The following bug has been logged on the website:
> > 
> > Bug reference:      17335
> > Logged by:          Yura Sokolov
> > Email address:      y.sokolov@postgrespro.ru
> > PostgreSQL version: 14.1
> > Operating system:   any
> > Description:        
> > 
> > Our customer faced a query that returns duplicate rows if parallel execution
> > is enabled
> > (max_parallel_workers_per_gather > 0) after 11->12 upgrade.
> > 
> > Investigation shows bug were introduced in
> > 8edd0e79460b414b1d971895312e549e95e12e4f
> > "Suppress Append and MergeAppend plan nodes that have a single child." .
> > Bug is present in all versions since 12 beta, and it is reproduced in master
> > branch as well.
> > 
> > Reproduction.
> > https://gist.github.com/funny-falcon/45fa9460507c015390db99a7efd18a75
> > 
> > Dump contains definition of two tables (with 3 and 2 fields), index and data
> > dump
> > large enough to trigger parallel plan execution.
> > Also `analyze public.testi3, public.testi4` line is added.
> > 
> >
https://gist.github.com/funny-falcon/45fa9460507c015390db99a7efd18a75/raw/06a55e57bc6c460ea8800600b1b597c1ed216259/repro_dump.sql.zst
> > 
> > Query:
> > 
> >     SELECT count(1)
> >     FROM (
> >       SELECT fk, id
> >       FROM testi3
> >       WHERE flagged
> >       UNION ALL
> >       SELECT fk, id
> >       from testi3
> >       WHERE false
> >     ) as t
> >     LEFT OUTER JOIN testi4
> >     USING (fk);
> > 
> > When `set max_parallel_workers_per_gather = 0;` then it returns 44 which is
> > correct.
> > When `set max_parallel_workers_per_gather = 2;` then it returns 88 which is
> > wrong.
> > 
> > Sometimes (for example, when you try to debug it) it returns 44 even
> > with max_parallel_workers_per_gather = 2.
> > It seems that it happens when worker returns first row fast enough so
> > "Gather" node doesn't start to execute plan by itself.
> > 
> > It is confirmed by `set parallel_leader_participation = false`, which forces
> > correct result.
> > `set enable_parallel_append = false` also forces correct result.
> > 
> > Original plan included Gather Merge node and max_parallel_workers_per_gather
> > = 1 .
> > Reproduction code uses Gather instead and max_parallel_workers_per_gather =
> > 2 just because it were easier to get bad plan.
> > 
> > Plan before change 8edd0e79460b414b1d971895312e549e95e12e4f (returns correct
> > result)
> > 
> >  Finalize Aggregate  (cost=25003.16..25003.17 rows=1 width=8) (actual
> > time=435.473..435.473 rows=1 loops=1)
> >    ->  Gather  (cost=25003.05..25003.16 rows=1 width=8) (actual time=434.646..456.889 rows=2 loops=1)
> >          Workers Planned: 1
> >         Workers Launched: 1
> >          ->  Partial Aggregate  (cost=24003.05..24003.06 rows=1 width=8)
> > (actual time=433.000..433.000 rows=1 loops=2)
> >                ->  Parallel Hash Left Join  (cost=21701.47..24002.86 rows=74
> > width=0) (actual time=312.576..432.981 rows=22 loops=2)
> >                      Hash Cond: ("*SELECT* 1".fk = testi4.fk)
> >                      ->  Parallel Append  (cost=0.42..9.18 rows=24 width=4)
> > (actual time=0.021..0.112 rows=16 loops=2)
> >                            ->  Subquery Scan on "*SELECT* 1"
> > (cost=0.42..9.11 rows=24 width=4) (actual time=0.041..0.216 rows=33 loops=1)
> >                                  ->  Index Scan using testi3_ix on testi3
> > (cost=0.42..8.87 rows=24 width=8) (actual time=0.039..0.207 rows=33
> > loops=1)
> >                                        Index Cond: (flagged = true)
> >                      ->  Parallel Hash  (cost=12084.91..12084.91 rows=586091
> > width=4) (actual time=305.473..305.473 rows=703309 loops=2)
> >                            Buckets: 131072  Batches: 32  Memory Usage:
> > 2784kB
> >                            ->  Parallel Seq Scan on testi4
> > (cost=0.00..12084.91 rows=586091 width=4) (actual time=0.032..138.958
> > rows=703309 loops=2)
> > 
> > Plan after that change (returns wrong result)
> > 
> >  Finalize Aggregate  (cost=25003.87..25003.88 rows=1 width=8) (actual
> > time=652.062..652.062 rows=1 loops=1)
> >    ->  Gather  (cost=25003.75..25003.86 rows=1 width=8) (actual
> > time=651.482..668.718 rows=2 loops=1)
> >          Workers Planned: 1
> >          Workers Launched: 1
> >          ->  Partial Aggregate  (cost=24003.75..24003.76 rows=1 width=8)
> > (actual time=649.150..649.150 rows=1 loops=2)
> >                ->  Parallel Hash Left Join  (cost=21701.47..24003.49
> > rows=106 width=0) (actual time=461.751..649.111 rows=44 loops=2)
> >                      Hash Cond: ("*SELECT* 1".fk = testi4.fk)
> >                      ->  Subquery Scan on "*SELECT* 1"  (cost=0.42..9.51
> > rows=36 width=4) (actual time=0.050..0.229 rows=33 loops=2)
> >                            ->  Index Scan using testi3_ix on testi3
> > (cost=0.42..9.15 rows=36 width=8) (actual time=0.048..0.214 rows=33 loops=2)
> >                                  Index Cond: (flagged = true)
> >                      ->  Parallel Hash  (cost=12084.91..12084.91 rows=586091
> > width=4) (actual time=444.931..444.931 rows=703309 loops=2)
> >                            Buckets: 131072  Batches: 32  Memory Usage:
> > 2784kB
> >                            ->  Parallel Seq Scan on testi4
> > (cost=0.00..12084.91 rows=586091 width=4) (actual time=0.045..202.913
> > rows=703309 loops=2)]
> 
> I found simple fix:
> 
> --- a/src/backend/optimizer/util/pathnode.c
> +++ b/src/backend/optimizer/util/pathnode.c
> @@ -1284,14 +1284,17 @@ create_append_path(PlannerInfo *root,
>         if
> (list_length(pathnode->subpaths) == 1)
>         {
>                 Path       *child = (Path *) linitial(pathnode->subpaths);
>  
>                 pathnode->path.rows =
> child->rows;
>                 pathnode->path.startup_cost = child->startup_cost;
>                 pathnode->path.total_cost = child->total_cost;
>                 pathn
> ode->path.pathkeys = child->pathkeys;
> +               pathnode->path.parallel_aware = child->parallel_aware;
> +               pathnode->path.parallel_safe = child-
> > parallel_safe;
> +               pathnode->path.parallel_workers = child->parallel_workers;
>         }
>         else
>         {
>                 pathnode->path.pathkeys =
> NIL;  /* unsorted if more than 1 subpath */
>                 cost_append(pathnode);
>         }
> 
> With this fix Gather node deduce `Single Copy: true`, and leader doesn't
> participate in query execution.
> 
> It seems to be correct change:
> - if append node will be removed, then it should take parallel
>   awareness/safety from its single child.
> 
> With this change plan Gather Merge node fails on assert in
> cost_gather_merge
> 
>         /*                                                                                                         
>          * Add one to the number of workers to account for the leader.  This might
>          * be overgenerous since the leader will do less work than other workers
>          * in typical cases, but we'll go with it for now.
>          */                                                                                                        
>         Assert(path->num_workers > 0);
>         N = (double) path->num_workers + 1;
>         logN = LOG2(N);
> 
> Commenting this assert doesn't seem to break anything and doesn't seem to mess
> cost calculation (ie it remains logically correct).
> 
> To preserve assert I've changed it to `Assert(path->num_workers >= 0);`.
> 
> With this changes queries from reproduction code returns correct 44 rows.
> 
> # explain analyze SELECT id, val
> FROM (
>   SELECT fk, id
>   FROM testi3
>   WHERE flagged
>   UNION ALL
>   SELECT fk, id
>   from testi3
>   WHERE false
> ) as t
> LEFT OUTER JOIN testi4
> USING (fk);
>                                                                  QUERY PLAN
                     
 
>
--------------------------------------------------------------------------------------------------------------------------------------------
>  Gather  (cost=22701.47..25015.79 rows=130 width=8) (actual time=1268.071..1686.918 rows=44 loops=1)
>    Workers Planned: 1
>    Workers Launched: 1
>    Single Copy: true
>    ->  Parallel Hash Left Join  (cost=21701.47..24002.79 rows=130 width=8) (actual time=1242.936..1634.421 rows=44
loops=1)
>          Hash Cond: ("*SELECT* 1".fk = testi4.fk)
>          ->  Subquery Scan on "*SELECT* 1"  (cost=0.42..9.09 rows=24 width=8) (actual time=0.049..0.166 rows=33
loops=1)
>                ->  Index Scan using testi3_ix on testi3  (cost=0.42..8.85 rows=24 width=8) (actual time=0.047..0.151
rows=33loops=1)
 
>                      Index Cond: (flagged = true)
>          ->  Parallel Hash  (cost=12084.91..12084.91 rows=586091 width=8) (actual time=1238.029..1238.029
rows=1406618loops=1)
 
>                Buckets: 131072  Batches: 32  Memory Usage: 2784kB
>                ->  Parallel Seq Scan on testi4  (cost=0.00..12084.91 rows=586091 width=8) (actual time=0.070..540.652
rows=1406618loops=1)
 
> 
> Interestingly, Gather Merge node doesn't run workers:
> 
> # explain analyze SELECT id, val
> FROM (
>   SELECT fk, id
>   FROM testi3
>   WHERE flagged
>   UNION ALL
>   SELECT fk, id
>   from testi3
>   WHERE false
> ) as t
> LEFT OUTER JOIN testi4
> USING (fk) ORDER BY val;
>                                                                     QUERY PLAN
                           
 
>
--------------------------------------------------------------------------------------------------------------------------------------------------
>  Gather Merge  (cost=25007.36..25007.68 rows=0 width=8) (actual time=909.975..909.984 rows=44 loops=1)
>    Workers Planned: 0
>    Workers Launched: 0
>    ->  Sort  (cost=24007.36..24007.68 rows=130 width=8) (actual time=909.948..909.952 rows=44 loops=1)
>          Sort Key: testi4.val
>          Sort Method: quicksort  Memory: 27kB
>          ->  Parallel Hash Left Join  (cost=21701.47..24002.79 rows=130 width=8) (actual time=757.099..909.779
rows=44loops=1)
 
>                Hash Cond: ("*SELECT* 1".fk = testi4.fk)
>                ->  Subquery Scan on "*SELECT* 1"  (cost=0.42..9.09 rows=24 width=8) (actual time=0.224..0.576 rows=33
loops=1)
>                      ->  Index Scan using testi3_ix on testi3  (cost=0.42..8.85 rows=24 width=8) (actual
time=0.202..0.543rows=33 loops=1)
 
>                            Index Cond: (flagged = true)
>                ->  Parallel Hash  (cost=12084.91..12084.91 rows=586091 width=8) (actual time=755.983..755.983
rows=1406618loops=1)
 
>                      Buckets: 131072  Batches: 32  Memory Usage: 2730kB
>                      ->  Parallel Seq Scan on testi4  (cost=0.00..12084.91 rows=586091 width=8) (actual
time=0.052..330.983rows=1406618 loops=1)
 
> 
> So this changes could be taken as quick valid fix.
> 
> Futher improvement is to remove Append node earlier to detect there is
> no need in parallelism when single Append child doesn't support it.
> 

I've attached this changes as a patch.

--------
regards,
Sokolov Yura
Postgres Professional
y.sokolov@postgrespro.ru
funny.falcon@gmail.com


Вложения

В списке pgsql-bugs по дате отправления:

Предыдущее
От: PG Bug reporting form
Дата:
Сообщение: BUG #17339: Assert failed on attempt to detach a sequence concurrently
Следующее
От: PG Bug reporting form
Дата:
Сообщение: BUG #17340: Unnest discards rows with empty or null arrays