Обсуждение: BUG #16749: EXPLAIN only shows filtering from the first query in a union chain when using distinct on.
BUG #16749: EXPLAIN only shows filtering from the first query in a union chain when using distinct on.
От
PG Bug reporting form
Дата:
The following bug has been logged on the website: Bug reference: 16749 Logged by: Michael Richards Email address: mike22e@gmail.com PostgreSQL version: 12.2 Operating system: macOS Big Sur (11.0.1) Description: ### Bug description: The query plan produced by EXPLAIN ANALYZE does not show all filters being applied when querying with distinct on over a union'd set of queries. It only shows a filter as if there was just a single query (the first one in the union chain). Confirmed in both Postgres 12 and 13. ### Example to reproduce: Take the following query: select * from generate_series(0,20) n1 left join ( select distinct on (id) * from ( select n2 as id, n2+2 as a from generate_series(0,10) n2 union all select n3+5 as id, n3+12 as a from generate_series(0,10) n3 ) s1 order by id, a ) s2 on s2.id = n1 where 20 is distinct from s2.a; Which produces the correct results (row with column a = 20 is removed), as shown below: n1 | id | a ----+----+---- 0 | 0 | 2 1 | 1 | 3 2 | 2 | 4 3 | 3 | 5 4 | 4 | 6 5 | 5 | 7 6 | 6 | 8 7 | 7 | 9 8 | 8 | 10 9 | 9 | 11 10 | 10 | 12 11 | 11 | 18 12 | 12 | 19 14 | 14 | 21 15 | 15 | 22 16 | | 17 | | 18 | | 19 | | 20 | | (20 rows) However, the row that contained a = 20 was produced using n3+12 in the second query in the union, but the EXPLAIN ANALYZE for this query only shows the n2+2 as a filter; the n3+12 is nowhere to be seen as a filter in the query plan. Hash Left Join (cost=1.52..1.79 rows=20 width=12) (actual time=0.098..0.105 rows=20 loops=1) Hash Cond: (n1.n1 = n2.n2) Filter: (20 IS DISTINCT FROM ((n2.n2 + 2))) Rows Removed by Filter: 1 -> Function Scan on generate_series n1 (cost=0.00..0.21 rows=21 width=4) (actual time=0.005..0.007 rows=21 loops=1) -> Hash (cost=1.24..1.24 rows=22 width=8) (actual time=0.068..0.068 rows=16 loops=1) Buckets: 1024 Batches: 1 Memory Usage: 9kB -> Unique (cost=0.91..1.02 rows=22 width=8) (actual time=0.058..0.064 rows=16 loops=1) -> Sort (cost=0.91..0.96 rows=22 width=8) (actual time=0.058..0.059 rows=22 loops=1) Sort Key: n2.n2, ((n2.n2 + 2)) Sort Method: quicksort Memory: 26kB -> Append (cost=0.00..0.42 rows=22 width=8) (actual time=0.010..0.024 rows=22 loops=1) -> Function Scan on generate_series n2 (cost=0.00..0.14 rows=11 width=8) (actual time=0.009..0.011 rows=11 loops=1) -> Function Scan on generate_series n3 (cost=0.00..0.17 rows=11 width=8) (actual time=0.009..0.011 rows=11 loops=1) The relevant part of the query plan is lines 3-4. Filter: (20 IS DISTINCT FROM ((n2.n2 + 2))) Rows Removed by Filter: 1 The above filter did not actually remove any rows at all. The filter that removed that 1 row should be Filter: (20 IS DISTINCT FROM ((n3.n3 + 12))). This bug also applies when there are any number of union'd queries, not just two—the query plan will still only show the filter from the first query in the union chain. The query results seem to be correct, but the displayed query plan does not account for it. Cheers, Michael
PG Bug reporting form <noreply@postgresql.org> writes: > The query plan produced by EXPLAIN ANALYZE does not show all filters being > applied when querying with distinct on over a union'd set of queries. Hm? Your example has only one filter condition, and there's only one in the plan. =# explain verbose select * from generate_series(0,20) n1 left join ( select distinct on (id) * from ( select n2 as id, n2+2 as a from generate_series(0,10) n2 union all select n3+5 as id, n3+12 as a from generate_series(0,10) n3 ) s1 order by id, a ) s2 on s2.id = n1 where 20 is distinct from s2.a; QUERY PLAN ----------------------------------------------------------------------------------------------------------------- Hash Left Join (cost=1.52..1.79 rows=20 width=12) Output: n1.n1, n2.n2, ((n2.n2 + 2)) Inner Unique: true Hash Cond: (n1.n1 = n2.n2) Filter: (20 IS DISTINCT FROM ((n2.n2 + 2))) -> Function Scan on pg_catalog.generate_series n1 (cost=0.00..0.21 rows=21 width=4) Output: n1.n1 Function Call: generate_series(0, 20) -> Hash (cost=1.24..1.24 rows=22 width=8) Output: n2.n2, ((n2.n2 + 2)) -> Unique (cost=0.91..1.02 rows=22 width=8) Output: n2.n2, ((n2.n2 + 2)) -> Sort (cost=0.91..0.96 rows=22 width=8) Output: n2.n2, ((n2.n2 + 2)) Sort Key: n2.n2, ((n2.n2 + 2)) -> Append (cost=0.00..0.42 rows=22 width=8) -> Function Scan on pg_catalog.generate_series n2 (cost=0.00..0.14 rows=11 width=8) Output: n2.n2, (n2.n2 + 2) Function Call: generate_series(0, 10) -> Function Scan on pg_catalog.generate_series n3 (cost=0.00..0.17 rows=11 width=8) Output: (n3.n3 + 5), (n3.n3 + 12) Function Call: generate_series(0, 10) (22 rows) I think the actual issue here is that EXPLAIN has no good way to reconstruct the subquery alias "s2.a", so what it prints is an expansion based on the first append child. Even if we could reconstruct "s2.a", printing the expansion is more useful and less confusing in most cases (admittedly not so much in this one). The extra parentheses that you see around "(n2.n2 + 2)" in the upper query nodes are indicators that these are just references to the output of the bottom plan node, ie "n2.n2 + 2" is only being computed once at the n2 function scan node, and then bubbled up into the upper levels. While we could print those references as just Vars, in a lot of cases there'd be no very good name to use for them. regards, tom lane