Обсуждение: Reference to parent query from ANY sublink
So far, a suquery of ANY sublink located in WHERE/ON clause can't reference vars exactly one level up, as long as pull-up into the join tree is expected. Now that we have LATERAL subqueries (there seem to be no specifics of SEMI JOIN when it comes to parameterization etc), I think this restriction can be lifted. Thus a subplan should be avoided often. Not sure if something like that is applicable to EXISTS: various parts are cut-off, so there are probably no vars having (varlevelsup == 1). The attachments show cases where the SEMI JOIN should be inserted above INNER JOIN and into the nullable side of OUTER JOIN respectively, each before the patch is applied and after that. So far I didn't test recursive processing, but don't expect problems here. Can the change be as simple as this or do I neglect anything? // Antonin Houska (Tony)
Вложения
On 10/31/2013 03:46 PM, Antonin Houska wrote: > Can the change be as simple as this or do I neglect anything? Well, the example of outer join is wrong. Instead I think query SELECT * FROM tab1 aLEFT JOINtab1 bON b.i = ANY ( SELECT tab2.k FROM tab2 WHERE k = a.j); should be converted to SELECT * FROM tab1 aLEFT JOIN( tab1 b LATERAL SEMI JOIN ( SELECT tab2.k FROM tab2 WHERE k = a.j ) AS ANY_subquery ON b.i = sub.k) I'm not sure if it's legal for the WHERE clause to reference LHS of the original outer join (a.j). Some more restriction may be needed. I need to think about it a bit more. // Tony
On 10/31/2013 09:37 PM, Antonin Houska wrote: > On 10/31/2013 03:46 PM, Antonin Houska wrote: > I'm not sure if it's legal for the WHERE clause to reference LHS of the > original outer join (a.j). Some more restriction may be needed. I need > to think about it a bit more. For a subquery or sublink expression referencing the outer table of an OJ (see tab1) SELECT * FROM tab1 a LEFT JOIN tab2 b ON a.i = ANY ( SELECT k FROM tab3 c WHERE k = a.i); I started my considerations by inserting the SEMI JOIN in a form of subquery, instead of a join node - see SJ_subquery here: SELECT * FROM tab1 a LEFT JOIN ( SELECT * tab2 b SEMI JOIN ( SELECT k FROM tab3 c WHERE k = a.i ) AS ANY_subquery ON a.i = ANY_subquery.k ) AS SJ_subquery ON true; (To allow a.i in the sublink expression, we'd only need to pass both tab1 and tab2 to pull_up_sublinks_qual_recurse() in available_rels1.) However it seem to be these lateral references (from the subquery and/or the sublink expression) to tab1 that make it impossible for SJ_subquery to be pulled up into the parent query's join tree - see jointree_contains_lateral_outer_refs(). I'm not sure if it makes much sense to pull up the sublink in such a case, does it? I ended up with this logic: if the join is INNER, both the subquery and sublink expression can reference either side. If the join is OUTER, only the inner side can be referenced. Otherwise no attempt to introduce the SEMI JOIN. Can this be considered a patch, or is it wrong/incomplete? // Antonin Houska (Tony)
Вложения
Antonin Houska <antonin.houska@gmail.com> wrote: > SELECT * > FROM tab1 a > LEFT JOIN > tab2 b > ON a.i = ANY ( > SELECT k > FROM tab3 c > WHERE k = a.i); This query works with k in any or all tables, but the semantics certainly vary depending on where k happens to be. It would help a lot if you showed SQL statements to create and populate the tables involved and/or if you qualified all referenced column names with the table alias to avoid ambiguity. If I assume that the k reference is supposed to be a column in tab3, what you have is a query where you always get all rows from tab1, and for each row from tab1 you either match it to all rows from tab2 or no rows from tab2 depending on whether the tab1 row has a match in tab3. test=# create table tab1 (i int); CREATE TABLE test=# create table tab2 (j int); CREATE TABLE test=# create table tab3 (k int); CREATE TABLE test=# insert into tab1 values (1), (2), (3); INSERT 0 3 test=# insert into tab2 values (4), (5), (6); INSERT 0 3 test=# insert into tab3 values (1), (3); INSERT 0 2 test=# SELECT * FROM tab1 a LEFT JOIN tab2 b ON a.i = ANY ( SELECT k FROM tab3 c WHERE k = a.i); i | j ---+--- 1 | 4 1 | 5 1 | 6 2 | 3 | 4 3 | 5 3 | 6 (7 rows) > SELECT * > FROM tab1 a > LEFT JOIN > ( > SELECT * > tab2 b > SEMI JOIN > ( SELECT k > FROM tab3 c > WHERE k = a.i > ) AS ANY_subquery > ON a.i = ANY_subquery.k > ) AS SJ_subquery > ON true; It is hard to see what you intend here, since this is not valid syntax. I assume you want a FROM keyword before the tab2 reference, but it's less clear what you intend with the SEMI JOIN syntax. PostgreSQL supports semi-joins; but that is an implementation detail for the EXISTS or IN syntax. Could you clarify your intent? -- Kevin Grittner EDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Kevin Grittner <kgrittn@ymail.com> wrote: > test=# SELECT * > FROM tab1 a > LEFT JOIN > tab2 b > ON a.i = ANY ( > SELECT k > FROM tab3 c > WHERE k = a.i); > i | j > ---+--- > 1 | 4 > 1 | 5 > 1 | 6 > 2 | > 3 | 4 > 3 | 5 > 3 | 6 > (7 rows) > >> SELECT * >> FROM tab1 a >> LEFT JOIN >> ( >> SELECT * >> tab2 b >> SEMI JOIN >> ( SELECT k >> FROM tab3 c >> WHERE k = a.i >> ) AS ANY_subquery >> ON a.i = ANY_subquery.k >> ) AS SJ_subquery >> ON true; > > It is hard to see what you intend here Perhaps you were looking for a way to formulate it something like this?: test=# SELECT * test-# FROM tab1 a test-# LEFT JOIN LATERAL test-# ( test(# SELECT * test(# FROM tab2 b test(# WHERE EXISTS test(# ( test(# SELECT * test(# FROM tab3 c test(# WHERE c.k = a.i test(# ) test(# ) AS SJ_subquery test-# ON true; i | j ---+--- 1 | 4 1 | 5 1 | 6 2 | 3 | 4 3 | 5 3 | 6 (7 rows) Without LATERAL you get an error: ERROR: invalid reference to FROM-clause entry for table "a" LINE 11: WHERE c.k = a.i ^ -- Kevin Grittner EDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 12/06/2013 03:33 PM, Kevin Grittner wrote: > Antonin Houska <antonin.houska@gmail.com> wrote: > >> SELECT * >> FROM tab1 a >> LEFT JOIN >> tab2 b >> ON a.i = ANY ( >> SELECT k >> FROM tab3 c >> WHERE k = a.i); > > This query works with k in any or all tables, but the semantics > certainly vary depending on where k happens to be. It would help a > lot if you showed SQL statements to create and populate the tables > involved and/or if you qualified all referenced column names with > the table alias to avoid ambiguity. I used the DDLs attached (tables.ddl) for this query too, not only for the queries in quaries.sql. Yes, if I had mentioned it and/or qualified the 'k' column reference, it wouldn't have broken anything. > If I assume that the k reference is supposed to be a column in > tab3, what you have is a query where you always get all rows from > tab1, and for each row from tab1 you either match it to all rows > from tab2 or no rows from tab2 depending on whether the tab1 row > has a match in tab3. I concede this particular query is not useful. But the important thing to consider here is which side of the LEFT JOIN the subquery references. >> SELECT * >> FROM tab1 a >> LEFT JOIN >> ( >> SELECT * >> tab2 b >> SEMI JOIN >> ( SELECT k >> FROM tab3 c >> WHERE k = a.i >> ) AS ANY_subquery >> ON a.i = ANY_subquery.k >> ) AS SJ_subquery >> ON true; > > It is hard to see what you intend here, since this is not valid > syntax. This is what I - after having read the related source code - imagine to happen internally when the ANY predicate of the first query is being processed. In fact it should become something like this (also internal stuff) SELECT * FROM tab1 a LEFT JOIN ( tab2 b SEMI JOIN ( SELECT k FROM tab3 c WHERE k = a.i ) AS ANY_subquery ON a.i = ANY_subquery.k) ON true; that is, SEMI JOIN node inserted into the tree rather than a subquery (SJ_subquery). I posted the construct with SJ_subquery to show how I thought about the problem: I thought it's safe (even though not necessarily beautiful) to wrap the SEMI JOIN into the SJ_subquery and let the existing infrastructure decide whether it's legal to turn it into a join node. I concluded that the subquery's references to the tab1 ensure that SJ_subquery won't be flattened, so the patch does nothing if such a reference exists. > PostgreSQL supports semi-joins; but that is an implementation detail > for the EXISTS or IN syntax. ... and for ANY, see subselect.c:convert_ANY_sublink_to_join() > Could you clarify your intent? To get rid of a subplan in some cases that require it so far: when the subquery references table exactly 1 level higher (i.e. the immediate parent query). (I got the idea while reading the source code, as opposed to query tuning.) // Antonin Houska (Tony) > -- > Kevin Grittner > EDB: http://www.enterprisedb.com > The Enterprise PostgreSQL Company >
Antonin Houska <antonin.houska@gmail.com> wrote: > I used the DDLs attached (tables.ddl) for this query too, not > only for the queries in quaries.sql. Yes, if I had mentioned it > and/or qualified the 'k' column reference, it wouldn't have > broken anything. Apologies; I missed the attachments. It makes a lot more sense now that I see those. I see this was a patch originally posted on 2013-10-31 and not added to the CommitFest. I applied it to master and ran the regression tests, and one of the subselect tests failed. This query: SELECT '' AS six, f1 AS "Correlated Field", f3 AS "Second Field" FROM SUBSELECT_TBL upper WHERE f1 IN (SELECT f2 FROM SUBSELECT_TBL WHERE CAST(upper.f2 AS float) = f3); Should have this for a result set: six | Correlated Field | Second Field -----+------------------+-------------- | 2 | 4 | 3 | 5 | 1 | 1 | 2 | 2 | 3 | 3 (5 rows) But during the `make check` or `make install-check` it is missing the last two rows. Oddly, if I go into the database later and try it, the rows show up. It's not immediately apparent to me what's wrong. Will look again later, or maybe someone more familiar with the planner can spot the problem. -- Kevin Grittner EDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Kevin Grittner <kgrittn@ymail.com> wrote: > I applied it to master and ran the regression tests, and one of > the subselect tests failed. > > This query: > > SELECT '' AS six, f1 AS "Correlated Field", f3 AS "Second > Field" > FROM SUBSELECT_TBL upper > WHERE f1 IN > (SELECT f2 FROM SUBSELECT_TBL WHERE CAST(upper.f2 AS float) = f3); > [ ... ] during the `make check` or `make install-check` [ ... ] > is missing the last two rows. Oddly, if I go into the database > later and try it, the rows show up. It's not immediately > apparent to me what's wrong. Using the v2 patch, with the default statistics from table creation, the query modified with an alias of "lower" for the second reference, just for clarity, yields a plan which generates incorrect results: Hash Join (cost=37.12..80.40 rows=442 width=12) (actual time=0.059..0.064 rows=3 loops=1) Hash Cond: (((upper.f2)::double precision = lower.f3) AND (upper.f1 = lower.f2)) -> Seq Scan on subselect_tbl upper (cost=0.00..27.70 rows=1770 width=16) (actual time=0.006..0.007 rows=8 loops=1) -> Hash (cost=34.12..34.12 rows=200 width=12) (actual time=0.020..0.020 rows=5 loops=1) Buckets: 1024 Batches: 1 Memory Usage: 1kB -> HashAggregate (cost=32.12..34.12 rows=200 width=12) (actual time=0.014..0.018 rows=6 loops=1) -> Seq Scan on subselect_tbl lower (cost=0.00..27.70 rows=1770 width=12) (actual time=0.002..0.004 rows=8loops=1) Total runtime: 0.111 ms As soon as there is a VACUUM and/or ANALYZE it generates a plan more like what the OP was hoping for: Hash Semi Join (cost=1.20..2.43 rows=6 width=12) (actual time=0.031..0.036 rows=5 loops=1) Hash Cond: (((upper.f2)::double precision = lower.f3) AND (upper.f1 = lower.f2)) -> Seq Scan on subselect_tbl upper (cost=0.00..1.08 rows=8 width=16) (actual time=0.004..0.007 rows=8 loops=1) -> Hash (cost=1.08..1.08 rows=8 width=12) (actual time=0.012..0.012 rows=7 loops=1) Buckets: 1024 Batches: 1 Memory Usage: 1kB -> Seq Scan on subselect_tbl lower (cost=0.00..1.08 rows=8 width=12) (actual time=0.003..0.005 rows=8 loops=1) Total runtime: 0.074 ms By comparison, without the patch this is the plan: Seq Scan on subselect_tbl upper (cost=0.00..5.59 rows=4 width=12) (actual time=0.022..0.037 rows=5 loops=1) Filter: (SubPlan 1) Rows Removed by Filter: 3 SubPlan 1 -> Seq Scan on subselect_tbl lower (cost=0.00..1.12 rows=1 width=4) (actual time=0.002..0.003 rows=1 loops=8) Filter: ((upper.f2)::double precision = f3) Rows Removed by Filter: 4 Total runtime: 0.066 ms When I run the query with fresh statistics and without EXPLAIN both ways, the unpatched is consistently about 10% faster. So pulling up the subquery can yield an incorrect plan, and even when it yields the "desired" plan with the semi-join it is marginally slower than using the subplan, at least for this small data set. I think it's an interesting idea, but it still needs work. -- Kevin Grittner EDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Kevin Grittner <kgrittn@ymail.com> writes: > Kevin Grittner <kgrittn@ymail.com> wrote: >> I applied it to master and ran the regression tests, and one of >> the subselect tests failed. >> >> This query: >> >> SELECT '' AS six, f1 AS "Correlated Field", f3 AS "Second >> Field" >> �� FROM SUBSELECT_TBL upper >> �� WHERE f1 IN >> ���� (SELECT f2 FROM SUBSELECT_TBL WHERE CAST(upper.f2 AS float) = f3); >> [ ... ] during the `make check` or `make install-check` [ ... ] >> is missing the last two rows.� Oddly, if I go into the database >> later and try it, the rows show up.� It's not immediately >> apparent to me what's wrong. > Using the v2 patch, with the default statistics from table > creation, the query modified with an alias of "lower" for the > second reference, just for clarity, yields a plan which generates > incorrect results: > �Hash Join� (cost=37.12..80.40 rows=442 width=12) (actual time=0.059..0.064 rows=3 loops=1) > �� Hash Cond: (((upper.f2)::double precision = lower.f3) AND (upper.f1 = lower.f2)) > �� ->� Seq Scan on subselect_tbl upper� (cost=0.00..27.70 rows=1770 width=16) (actual time=0.006..0.007 rows=8 loops=1) > �� ->� Hash� (cost=34.12..34.12 rows=200 width=12) (actual time=0.020..0.020 rows=5 loops=1) > �������� Buckets: 1024� Batches: 1� Memory Usage: 1kB > �������� ->� HashAggregate� (cost=32.12..34.12 rows=200 width=12) (actual time=0.014..0.018 rows=6 loops=1) > �������������� ->� Seq Scan on subselect_tbl lower� (cost=0.00..27.70 rows=1770 width=12) (actual time=0.002..0.004 rows=8loops=1) > �Total runtime: 0.111 ms FWIW, that plan isn't obviously wrong; if it is broken, most likely the reason is that the HashAggregate is incorrectly unique-ifying the lower table. (Unfortunately, EXPLAIN doesn't show enough about the HashAgg to know what it's doing exactly.) The given query is, I think, in principle equivalent to SELECT ... FROM SUBSELECT_TBL upper WHERE (f1, f2::float) IN (SELECT f2, f3 FROM SUBSELECT_TBL); and if you ask unmodified HEAD to plan that you get Hash Join (cost=41.55..84.83 rows=442 width=16) Hash Cond: ((upper.f1 = subselect_tbl.f2) AND ((upper.f2)::double precision= subselect_tbl.f3)) -> Seq Scan on subselect_tbl upper (cost=0.00..27.70 rows=1770 width=16) -> Hash (cost=38.55..38.55rows=200 width=12) -> HashAggregate (cost=36.55..38.55 rows=200 width=12) -> SeqScan on subselect_tbl (cost=0.00..27.70 rows=1770 width=12) which is the same thing at the visible level of detail ... but this version computes the correct result. The cost of the HashAggregate is estimated higher, though, which suggests that maybe it's distinct'ing on two columns where the bogus plan only does one. Not sure about where Antonin's patch is going off the rails. I suspect it's too simple somehow, but it's also possible that it's OK and the real issue is some previously undetected bug in LATERAL processing. regards, tom lane
Tom Lane <tgl@sss.pgh.pa.us> wrote: > FWIW, that plan isn't obviously wrong; if it is broken, most > likely the reason is that the HashAggregate is incorrectly > unique-ifying the lower table. (Unfortunately, EXPLAIN doesn't > show enough about the HashAgg to know what it's doing exactly.) Yeah, I found myself wishing for an EXPLAIN option that would show that. > The cost of the HashAggregate is estimated higher, though, which > suggests that maybe it's distinct'ing on two columns where the > bogus plan only does one. FWIW, I noticed that the actual row counts suggested that, too. -- Kevin Grittner EDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Kevin Grittner <kgrittn@ymail.com> writes: > Tom Lane <tgl@sss.pgh.pa.us> wrote: >> FWIW, that plan isn't obviously wrong; if it is broken, most >> likely the reason is that the HashAggregate is incorrectly >> unique-ifying the lower table.� (Unfortunately, EXPLAIN doesn't >> show enough about the HashAgg to know what it's doing exactly.) > Yeah, I found myself wishing for an EXPLAIN option that would show > that. It's not hard to do ... how about the attached? I chose to print grouping keys for both Agg and Group nodes, and to show them unconditionally. There's some case maybe for only including them in verbose mode, but since sort keys are shown unconditionally, it seemed more consistent this way. regards, tom lane diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c index bd5428d..9969a25 100644 *** a/src/backend/commands/explain.c --- b/src/backend/commands/explain.c *************** static void show_sort_keys(SortState *so *** 76,84 **** ExplainState *es); static void show_merge_append_keys(MergeAppendState *mstate, List *ancestors, ExplainState *es); ! static void show_sort_keys_common(PlanState *planstate, ! int nkeys, AttrNumber *keycols, ! List *ancestors, ExplainState *es); static void show_sort_info(SortState *sortstate, ExplainState *es); static void show_hash_info(HashState *hashstate, ExplainState *es); static void show_instrumentation_count(const char *qlabel, int which, --- 76,88 ---- ExplainState *es); static void show_merge_append_keys(MergeAppendState *mstate, List *ancestors, ExplainState *es); ! static void show_agg_keys(AggState *astate, List *ancestors, ! ExplainState *es); ! static void show_group_keys(GroupState *gstate, List *ancestors, ! ExplainState *es); ! static void show_sort_group_keys(PlanState *planstate, const char *qlabel, ! int nkeys, AttrNumber *keycols, ! List *ancestors, ExplainState *es); static void show_sort_info(SortState *sortstate, ExplainState *es); static void show_hash_info(HashState *hashstate, ExplainState *es); static void show_instrumentation_count(const char *qlabel, int which, *************** ExplainNode(PlanState *planstate, List * *** 1341,1347 **** --- 1345,1358 ---- planstate, es); break; case T_Agg: + show_agg_keys((AggState *) planstate, ancestors, es); + show_upper_qual(plan->qual, "Filter", planstate, ancestors, es); + if (plan->qual) + show_instrumentation_count("Rows Removed by Filter", 1, + planstate, es); + break; case T_Group: + show_group_keys((GroupState *) planstate, ancestors, es); show_upper_qual(plan->qual, "Filter", planstate, ancestors, es); if (plan->qual) show_instrumentation_count("Rows Removed by Filter", 1, *************** show_sort_keys(SortState *sortstate, Lis *** 1693,1701 **** { Sort *plan = (Sort *) sortstate->ss.ps.plan; ! show_sort_keys_common((PlanState *) sortstate, ! plan->numCols, plan->sortColIdx, ! ancestors, es); } /* --- 1704,1712 ---- { Sort *plan = (Sort *) sortstate->ss.ps.plan; ! show_sort_group_keys((PlanState *) sortstate, "Sort Key", ! plan->numCols, plan->sortColIdx, ! ancestors, es); } /* *************** show_merge_append_keys(MergeAppendState *** 1707,1720 **** { MergeAppend *plan = (MergeAppend *) mstate->ps.plan; ! show_sort_keys_common((PlanState *) mstate, ! plan->numCols, plan->sortColIdx, ! ancestors, es); } static void ! show_sort_keys_common(PlanState *planstate, int nkeys, AttrNumber *keycols, ! List *ancestors, ExplainState *es) { Plan *plan = planstate->plan; List *context; --- 1718,1773 ---- { MergeAppend *plan = (MergeAppend *) mstate->ps.plan; ! show_sort_group_keys((PlanState *) mstate, "Sort Key", ! plan->numCols, plan->sortColIdx, ! ancestors, es); } + /* + * Show the grouping keys for an Agg node. + */ static void ! show_agg_keys(AggState *astate, List *ancestors, ! ExplainState *es) ! { ! Agg *plan = (Agg *) astate->ss.ps.plan; ! ! if (plan->numCols > 0) ! { ! /* The key columns refer to the tlist of the child plan */ ! ancestors = lcons(astate, ancestors); ! show_sort_group_keys(outerPlanState(astate), "Group Key", ! plan->numCols, plan->grpColIdx, ! ancestors, es); ! ancestors = list_delete_first(ancestors); ! } ! } ! ! /* ! * Show the grouping keys for a Group node. ! */ ! static void ! show_group_keys(GroupState *gstate, List *ancestors, ! ExplainState *es) ! { ! Group *plan = (Group *) gstate->ss.ps.plan; ! ! /* The key columns refer to the tlist of the child plan */ ! ancestors = lcons(gstate, ancestors); ! show_sort_group_keys(outerPlanState(gstate), "Group Key", ! plan->numCols, plan->grpColIdx, ! ancestors, es); ! ancestors = list_delete_first(ancestors); ! } ! ! /* ! * Common code to show sort/group keys, which are represented in plan nodes ! * as arrays of targetlist indexes ! */ ! static void ! show_sort_group_keys(PlanState *planstate, const char *qlabel, ! int nkeys, AttrNumber *keycols, ! List *ancestors, ExplainState *es) { Plan *plan = planstate->plan; List *context; *************** show_sort_keys_common(PlanState *plansta *** 1748,1754 **** result = lappend(result, exprstr); } ! ExplainPropertyList("Sort Key", result, es); } /* --- 1801,1807 ---- result = lappend(result, exprstr); } ! ExplainPropertyList(qlabel, result, es); } /* diff --git a/src/test/regress/expected/aggregates.out b/src/test/regress/expected/aggregates.out index c05b39c..1a0ca5c 100644 *** a/src/test/regress/expected/aggregates.out --- b/src/test/regress/expected/aggregates.out *************** explain (costs off) *** 659,670 **** QUERY PLAN --------------------------------------------------------------------- HashAggregate InitPlan 1 (returns $0) -> Limit -> Index Only Scan Backward using tenk1_unique2 on tenk1 Index Cond: (unique2 IS NOT NULL) -> Result ! (6 rows) select distinct max(unique2) from tenk1; max --- 659,671 ---- QUERY PLAN --------------------------------------------------------------------- HashAggregate + Group Key: $0 InitPlan 1 (returns $0) -> Limit -> Index Only Scan Backward using tenk1_unique2 on tenk1 Index Cond: (unique2 IS NOT NULL) -> Result ! (7 rows) select distinct max(unique2) from tenk1; max *************** explain (costs off) *** 806,811 **** --- 807,813 ---- QUERY PLAN ---------------------------------------------------------------------------------------------- HashAggregate + Group Key: $0, $1 InitPlan 1 (returns $0) -> Limit -> Merge Append *************** explain (costs off) *** 831,837 **** -> Index Only Scan Backward using minmaxtest3i on minmaxtest3 minmaxtest3_1 Index Cond: (f1 IS NOT NULL) -> Result ! (26 rows) select distinct min(f1), max(f1) from minmaxtest; min | max --- 833,839 ---- -> Index Only Scan Backward using minmaxtest3i on minmaxtest3 minmaxtest3_1 Index Cond: (f1 IS NOT NULL) -> Result ! (27 rows) select distinct min(f1), max(f1) from minmaxtest; min | max diff --git a/src/test/regress/expected/matview.out b/src/test/regress/expected/matview.out index 9ba2b9a..31751eb 100644 *** a/src/test/regress/expected/matview.out --- b/src/test/regress/expected/matview.out *************** EXPLAIN (costs off) *** 22,29 **** QUERY PLAN --------------------- HashAggregate -> Seq Scan on t ! (2 rows) CREATE MATERIALIZED VIEW tm AS SELECT type, sum(amt) AS totamt FROM t GROUP BY type WITH NO DATA; SELECT relispopulated FROM pg_class WHERE oid = 'tm'::regclass; --- 22,30 ---- QUERY PLAN --------------------- HashAggregate + Group Key: type -> Seq Scan on t ! (3 rows) CREATE MATERIALIZED VIEW tm AS SELECT type, sum(amt) AS totamt FROM t GROUP BY type WITH NO DATA; SELECT relispopulated FROM pg_class WHERE oid = 'tm'::regclass; *************** EXPLAIN (costs off) *** 59,66 **** Sort Sort Key: t.type -> HashAggregate -> Seq Scan on t ! (4 rows) CREATE MATERIALIZED VIEW tvm AS SELECT * FROM tv ORDER BY type; SELECT * FROM tvm; --- 60,68 ---- Sort Sort Key: t.type -> HashAggregate + Group Key: t.type -> Seq Scan on t ! (5 rows) CREATE MATERIALIZED VIEW tvm AS SELECT * FROM tv ORDER BY type; SELECT * FROM tvm; *************** EXPLAIN (costs off) *** 82,89 **** --------------------------- Aggregate -> HashAggregate -> Seq Scan on t ! (3 rows) CREATE MATERIALIZED VIEW tvvm AS SELECT * FROM tvv; CREATE VIEW tvvmv AS SELECT * FROM tvvm; --- 84,92 ---- --------------------------- Aggregate -> HashAggregate + Group Key: t.type -> Seq Scan on t ! (4 rows) CREATE MATERIALIZED VIEW tvvm AS SELECT * FROM tvv; CREATE VIEW tvvmv AS SELECT * FROM tvvm; diff --git a/src/test/regress/expected/union.out b/src/test/regress/expected/union.out index ae690cf..6f9ee5e 100644 *** a/src/test/regress/expected/union.out --- b/src/test/regress/expected/union.out *************** explain (costs off) *** 494,505 **** QUERY PLAN --------------------------------------------------- HashAggregate -> Append -> Index Scan using t1_ab_idx on t1 Index Cond: ((a || b) = 'ab'::text) -> Index Only Scan using t2_pkey on t2 Index Cond: (ab = 'ab'::text) ! (6 rows) reset enable_seqscan; reset enable_indexscan; --- 494,506 ---- QUERY PLAN --------------------------------------------------- HashAggregate + Group Key: ((t1.a || t1.b)) -> Append -> Index Scan using t1_ab_idx on t1 Index Cond: ((a || b) = 'ab'::text) -> Index Only Scan using t2_pkey on t2 Index Cond: (ab = 'ab'::text) ! (7 rows) reset enable_seqscan; reset enable_indexscan; *************** SELECT * FROM *** 552,568 **** SELECT 2 AS t, 4 AS x) ss WHERE x < 4 ORDER BY x; ! QUERY PLAN ! -------------------------------- Sort Sort Key: ss.x -> Subquery Scan on ss Filter: (ss.x < 4) -> HashAggregate -> Append -> Result -> Result ! (8 rows) SELECT * FROM (SELECT 1 AS t, generate_series(1,10) AS x --- 553,570 ---- SELECT 2 AS t, 4 AS x) ss WHERE x < 4 ORDER BY x; ! QUERY PLAN ! -------------------------------------------------------- Sort Sort Key: ss.x -> Subquery Scan on ss Filter: (ss.x < 4) -> HashAggregate + Group Key: (1), (generate_series(1, 10)) -> Append -> Result -> Result ! (9 rows) SELECT * FROM (SELECT 1 AS t, generate_series(1,10) AS x diff --git a/src/test/regress/expected/window.out b/src/test/regress/expected/window.out index 1e6365b..0f21fcb 100644 *** a/src/test/regress/expected/window.out --- b/src/test/regress/expected/window.out *************** explain (costs off) *** 619,630 **** select first_value(max(x)) over (), y from (select unique1 as x, ten+four as y from tenk1) ss group by y; ! QUERY PLAN ! ------------------------------- WindowAgg -> HashAggregate -> Seq Scan on tenk1 ! (3 rows) -- test non-default frame specifications SELECT four, ten, --- 619,631 ---- select first_value(max(x)) over (), y from (select unique1 as x, ten+four as y from tenk1) ss group by y; ! QUERY PLAN ! --------------------------------------------- WindowAgg -> HashAggregate + Group Key: (tenk1.ten + tenk1.four) -> Seq Scan on tenk1 ! (4 rows) -- test non-default frame specifications SELECT four, ten,
On 12/11/2013 10:15 PM, Tom Lane wrote: > > FWIW, that plan isn't obviously wrong; if it is broken, most likely the > reason is that the HashAggregate is incorrectly unique-ifying the lower > table. (Unfortunately, EXPLAIN doesn't show enough about the HashAgg > to know what it's doing exactly.) The given query is, I think, in > principle equivalent to > > SELECT ... > FROM SUBSELECT_TBL upper > WHERE (f1, f2::float) IN > (SELECT f2, f3 FROM SUBSELECT_TBL); > > and if you ask unmodified HEAD to plan that you get > > Hash Join (cost=41.55..84.83 rows=442 width=16) > Hash Cond: ((upper.f1 = subselect_tbl.f2) AND ((upper.f2)::double precision = subselect_tbl.f3)) > -> Seq Scan on subselect_tbl upper (cost=0.00..27.70 rows=1770 width=16) > -> Hash (cost=38.55..38.55 rows=200 width=12) > -> HashAggregate (cost=36.55..38.55 rows=200 width=12) > -> Seq Scan on subselect_tbl (cost=0.00..27.70 rows=1770 width=12) Before I opened your mail, I also recalled the technique that I noticed in the planner code, to evaluate SEMI JOIN as INNER JOIN with the RHS uniquified, so also thought it could be about the uniquification. > which is the same thing at the visible level of detail ... but this > version computes the correct result. The cost of the HashAggregate is > estimated higher, though, which suggests that maybe it's distinct'ing on > two columns where the bogus plan only does one. debug_print_plan output contains :grpColIdx 2 in the AGG node. I think this corresponds to the join condition, which IMO should be(upper.f1 = subselect_tbl.f2) while the other condition was not in the list of join clauses and therefore ignored for the uniquification's purpose. And gdb tells me that create_unique_path() never gets more than 1 clause. I can't tell whether it should do just for this special purpose. > Not sure about where Antonin's patch is going off the rails. I suspect > it's too simple somehow, but it's also possible that it's OK and the > real issue is some previously undetected bug in LATERAL processing. So far I have no idea how to achieve such conditions without this patch. Thanks for your comments. // Antonin Houska (Tony)
Antonin Houska <antonin.houska@gmail.com> writes: > debug_print_plan output contains > :grpColIdx 2 > in the AGG node. Hm, that means there's only one grouping column (and it's the second tlist entry of the child plan node). So that seems conclusive that the unique-ification is being done wrong. It's not very clear why though. It doesn't seem like your patch is doing anything that would directly affect that. For comparison purposes, using the patch I just posted, I get this description of a correct plan: regression=# explain verbose SELECT * FROM SUBSELECT_TBL upper WHERE (f1, f2::float) IN (SELECT f2, f3 FROM SUBSELECT_TBL); QUERY PLAN ----------------------------------------------------------------------------------------------------Hash Join (cost=41.55..84.83rows=442 width=16) Output: upper.f1, upper.f2, upper.f3 Hash Cond: ((upper.f1 = subselect_tbl.f2) AND((upper.f2)::double precision = subselect_tbl.f3)) -> Seq Scan on public.subselect_tbl upper (cost=0.00..27.70 rows=1770width=16) Output: upper.f1, upper.f2, upper.f3 -> Hash (cost=38.55..38.55 rows=200 width=12) Output:subselect_tbl.f2, subselect_tbl.f3 -> HashAggregate (cost=36.55..38.55 rows=200 width=12) Output:subselect_tbl.f2, subselect_tbl.f3 Group Key: subselect_tbl.f2, subselect_tbl.f3 -> SeqScan on public.subselect_tbl (cost=0.00..27.70 rows=1770 width=12) Output: subselect_tbl.f1, subselect_tbl.f2,subselect_tbl.f3 (12 rows) so it's unique-ifying on both f2 and f3, which is clearly necessary for executing the IN with a plain join. regards, tom lane
Tom Lane <tgl@sss.pgh.pa.us> writes: >> Yeah, I found myself wishing for an EXPLAIN option that would show >> that. > > It's not hard to do ... how about the attached? +1 > I chose to print grouping keys for both Agg and Group nodes, and to > show them unconditionally. There's some case maybe for only including > them in verbose mode, but since sort keys are shown unconditionally, > it seemed more consistent this way. +1 Regards, -- Dimitri Fontaine http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support
Tom Lane <tgl@sss.pgh.pa.us> wrote: > Hm, that means there's only one grouping column (and it's the second > tlist entry of the child plan node). So that seems conclusive that > the unique-ification is being done wrong. Further confirmation using the EXPLAIN patch with Antonin's v2 patch against the table before any EXPLAIN or ANALYZE: Hash Join (cost=37.12..80.40 rows=442 width=12) Hash Cond: (((upper.f2)::double precision = lower.f3) AND (upper.f1 = lower.f2)) -> Seq Scan on subselect_tbl upper (cost=0.00..27.70 rows=1770 width=16) -> Hash (cost=34.12..34.12 rows=200 width=12) -> HashAggregate (cost=32.12..34.12 rows=200 width=12) Group Key: lower.f2 -> Seq Scan on subselect_tbl lower (cost=0.00..27.70 rows=1770 width=12) The additional information is so useful, I'm all for committing that patch. -- Kevin Grittner EDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Kevin Grittner <kgrittn@ymail.com> writes: > Further confirmation using the EXPLAIN patch with Antonin's v2 > patch against the table before any EXPLAIN or ANALYZE: > �Hash Join� (cost=37.12..80.40 rows=442 width=12) > �� Hash Cond: (((upper.f2)::double precision = lower.f3) AND (upper.f1 = lower.f2)) > �� ->� Seq Scan on subselect_tbl upper� (cost=0.00..27.70 rows=1770 width=16) > �� ->� Hash� (cost=34.12..34.12 rows=200 width=12) > �������� ->� HashAggregate� (cost=32.12..34.12 rows=200 width=12) > �������������� Group Key: lower.f2 > �������������� ->� Seq Scan on subselect_tbl lower� (cost=0.00..27.70 rows=1770 width=12) That's about what I thought: it's unique-ifying according to the original semijoin qual, without realizing that the pulled-up clause from the lower WHERE would need to be treated as part of the semijoin qual. This isn't a bug in the existing code, because the case can never arise, since we don't treat an IN/=ANY as a semijoin if the sub-select contains any outer-level Vars. But if you remove that check from convert_ANY_sublink_to_join then you've got to deal with the problem. That said, I'm not too sure where the problem is in detail. I'd have thought that subquery pullup would stick the subquery's WHERE clause into the join quals of the parent JoinExpr node. Is that not happening, or is it perhaps not sufficient to cue the UniquePath machinery? > The additional information is so useful, I'm all for committing > that patch. Will do. regards, tom lane
I wrote: > That's about what I thought: it's unique-ifying according to the original > semijoin qual, without realizing that the pulled-up clause from the lower > WHERE would need to be treated as part of the semijoin qual. This isn't > a bug in the existing code, because the case can never arise, since we > don't treat an IN/=ANY as a semijoin if the sub-select contains any > outer-level Vars. But if you remove that check from > convert_ANY_sublink_to_join then you've got to deal with the problem. > That said, I'm not too sure where the problem is in detail. I'd have > thought that subquery pullup would stick the subquery's WHERE clause > into the join quals of the parent JoinExpr node. Is that not happening, > or is it perhaps not sufficient to cue the UniquePath machinery? BTW, on further thought, I'm afraid this is a bigger can of worms than it appears. The remarks above presume that the subquery is simple enough to be pulled up, which is the case in this example. It might not be too hard to make that case work. But what if the subquery *isn't* simple enough to be pulled up --- for instance, it includes grouping or aggregation? Then there's no way to unify its WHERE clause with the upper semijoin qual. At the very least, this breaks the uniqueify-then-do-a- plain-join implementation strategy for semijoins. So I'm now thinking this patch isn't worth pursuing. Getting all the corner cases right would be a significant amount of work, and in the end it would only benefit strangely-written queries. regards, tom lane
On 12/12/2013 04:36 PM, Tom Lane wrote: > BTW, on further thought, I'm afraid this is a bigger can of worms than > it appears. The remarks above presume that the subquery is simple enough > to be pulled up, which is the case in this example. It might not be too > hard to make that case work. But what if the subquery *isn't* simple > enough to be pulled up --- for instance, it includes grouping or > aggregation? Then there's no way to unify its WHERE clause with the upper > semijoin qual. At the very least, this breaks the uniqueify-then-do-a- > plain-join implementation strategy for semijoins. After having thought about it further, I think I understand. > So I'm now thinking this patch isn't worth pursuing. Getting all the > corner cases right would be a significant amount of work, and in the > end it would only benefit strangely-written queries. Originally it seemed to me that I just (luckily) found a new opportunity for the existing infrastructure. To change the infrastructure because of this small feature would be exactly the opposite. Thanks for having taken a look at it. // Antonin Houska (Tony)