[HACKERS] Pulling up more complicated subqueries

Поиск
Список
Период
Сортировка
От Heikki Linnakangas
Тема [HACKERS] Pulling up more complicated subqueries
Дата
Msg-id 67e353e8-c20e-7980-a282-538779edf25b@iki.fi
обсуждение исходный текст
Ответы Re: [HACKERS] Pulling up more complicated subqueries  (Robert Haas <robertmhaas@gmail.com>)
Re: [HACKERS] Pulling up more complicated subqueries  (Ashutosh Bapat <ashutosh.bapat@enterprisedb.com>)
Список pgsql-hackers
Hi,

I spent some time staring at TPC-DS benchmark's query 6. It contains a 
somewhat complicated subquery, and most of the time spent on that query 
is currently spent on executing the subquery again and again. The 
essence of the query boils down to this:

CREATE TABLE foo (i int4, j int4);
CREATE TABLE bar (i int4, j int4);

INSERT INTO foo SELECT g%10, g FROM generate_series(1, 10000) g;
INSERT INTO bar SELECT g%10, g FROM generate_series(1, 10000) g;


SELECT * FROM foo
WHERE foo.j >= 1.2 *  (SELECT avg(bar.j) FROM bar WHERE foo.i = bar.i);


The planner can pull up simpler subqueries, converting them to joins, 
but unfortunately this case is beyond its capabilities. If it was smart 
enough, it could be transformed into something like this:

SELECT * FROM foo
LEFT JOIN (SELECT avg(bar.j) as avg, bar.i FROM bar GROUP BY bar.i) as 
avg_bar ON foo.i = avg_bar.i
WHERE foo.j >= 1.2 * avg_bar.avg


There was some brief discussion on doing this kind of transformation 
back in 2011, at 
https://www.postgresql.org/message-id/4E2F163B.6060105%40gmail.com. I 
think we're in a better position to do that now than back then, thanks 
to all the other planner work that's been done since.

So how do we get there? I've identified a number of smaller steps that, 
when all put together, can handle this, as well as many other queries of 
course.

0. This simple case is already handled:

SELECT * FROM foo
WHERE foo.j IN (select bar.j from bar)

It's turned into a semi-join. The next steps will extend this basic 
capability, to handle more complicated cases.

postgres=# explain SELECT * FROM foo WHERE foo.j IN (select bar.j from bar);                               QUERY PLAN
------------------------------------------------------------------------- Hash Join  (cost=42.75..93.85 rows=1130
width=8)  Hash Cond: (foo.j = bar.j)   ->  Seq Scan on foo  (cost=0.00..32.60 rows=2260 width=8)   ->  Hash
(cost=40.25..40.25rows=200 width=4)         ->  HashAggregate  (cost=38.25..40.25 rows=200 width=4)               Group
Key:bar.j               ->  Seq Scan on bar  (cost=0.00..32.60 rows=2260 width=4)
 
(7 rows)


1. Currently, the IN/EXISTS pullup is only done when the IN/EXISTS is a 
top-level qual. For example, this isn't currently pulled-up:

SELECT * FROM foo
WHERE foo.j IN (select bar.j from bar) OR foo.j < 10;

That's not a straight semi-join, but we could still turn it into a new 
kind of LEFT-SEMI join. A left-semi join is like a left join, in that it 
returns all rows from the left side, and NULLs for any non-matches. And 
like a semi-join, it returns only one matching row from the right-side, 
if there are duplicates. In the qual, replace the SubLink with an IS NOT 
NULL test. So logically, the above would be converted into:

SELECT * FROM foo
/* SEMI- */ LEFT JOIN (select bar.j from bar) ON foo.j = bar.j
WHERE bar.j IS NOT NULL OR foo.j < 10

This transformation can be applied to IN/EXIST type SubPlan references 
anywhere in the query, not only those in the WHERE clause.


2. Using a similar trick, we can transform subqueries that are not of 
the IN/EXISTS kind. (This isn't useful on its own, because this example 
would be handled as an InitPlan, which performs well. But it's a 
stepping stone in explaining this.)

For example:

SELECT * FROM foo
WHERE foo.j > (select avg(bar.j) from bar)

This can be implemented using yet another new join type, a LEFT-UNIQUE 
join. It's like a LEFT JOIN, but it must check that there are no 
duplicates in the right-hand-side, and throw an error if there are 
(ERROR:  more than one row returned by a subquery used as an expression).

SELECT * FROM foo
/* unique-checking */ LEFT JOIN (select avg(bar.j) as min_j from bar) as 
subq ON TRUE
WHERE foo.j > subq.min_j


3. All the previous examples are uncorrelated subqueries, but all these 
transformations can also be performed on correlated subqueries. The 
difference is that the subquery that's pulled up to the range table must 
be marked as LATERAL. For example:

SELECT * FROM foo
WHERE foo.j > (select avg(bar.j) from bar where bar.i = foo.i)

->

SELECT * FROM foo
/* UNIQUE-CHECKING */ LEFT JOIN LATERAL (select avg(bar.j) from bar 
where bar.i = foo.i) AS subq(j) ON true
WHERE foo.j > subq.j


4. The previous transformation isn't very interesting on its own, 
because the only way to implement a LATERAL join is a nested loop join, 
which is essentially the same as executing the subplan the naive way. 
But we can potentially de-correlate it, by pulling up the correlation qual:


SELECT * FROM foo
/* UNIQUE-CHECKING */ LEFT JOIN LATERAL (select bar.j, bar.i from bar 
WHERE bar.i = foo.i) AS subq(j, i) ON true
WHERE foo.j > subq.j

->

SELECT * FROM foo
/* UNIQUE-CHECKING */ LEFT JOIN (select bar.j, bar.i from bar) AS 
subq(j, i) ON subq.i = foo.i
WHERE foo.j > subq.j


This will get inlined into the parent query, getting rid of the subquery 
altogether. But even if that cannot be done, this is potentially much 
better. (I think we already do such de-correlation, at least in simple 
cases, actually).


5. If the subquery contains aggregation, we can still perform the 
previous transformations, by adding the correlation vars to the GROUP 
BY. Like this:

SELECT * FROM foo
/* UNIQUE-CHECKING */ LEFT JOIN LATERAL (select AVG(bar.j) from bar 
WHERE bar.i = foo.i) AS subq(j) ON true
WHERE foo.j > subq.j

->

SELECT * FROM foo
/* UNIQUE-CHECKING */ LEFT JOIN (select AVG(bar.j), bar.i from bar GROUP 
BY bar.i) AS subq(j, i) ON foo.i = subq.i
WHERE foo.j > subq.j


6. De-correlating the subquery is not necessary a win. Consider the 
previous query, if 'foo' is very small, and 'bar' is a huge. The 
parameterized plan we generate for a correlated LATERAL subquery, with a 
nested loop join, is actually better in that case.

If we teach the planner to consider parameterized nested loop plans for 
non-lateral subqueries, by pushing down join quals back to the subquery. 
There's a comment on this in set_subquery_pathlist:
 * We don't currently support generating parameterized paths for subqueries * by pushing join clauses down into them;
itseems too expensive to 
 
re-plan * the subquery multiple times to consider different alternatives. * (XXX that could stand to be reconsidered,
nowthat we use Paths.)
 


In the old thread from 2011 that I mentioned above, Tom said:

> This leads me to think that we need to represent both cases as the same
> sort of query and make a cost-based decision as to which way to go.
> Thinking of it as a pull-up or push-down transformation is the wrong
> approach because those sorts of transformations are done too early to
> be able to use cost comparisons.

This last step, to push down join quals to a subquery, allows the 
planner to make a cost-based decision. A nested-loop, with the join qual 
pushed back down to the subquery, isn't exactly the same as the SubPlan, 
but it should be comparable in performance.


Thoughts? I'm planning to work on this in the next release cycle, 
although I'm not too familiar with the planner, and I don't know if I'll 
be distracted with something more shiny, so no promises on any results.

As a final note, I found an interesting paper called "Unnesting 
Arbitrary Queries", by Thomas Neumann and Alfons Kemper 
(http://www.btw-2015.de/res/proceedings/Hauptband/Wiss/Neumann-Unnesting_Arbitrary_Querie.pdf). 
It describes a series of transformations that can be applied to 
de-correlate (de-lateralize?) any lateral subquery join. The trick is to 
build a temporary relation that contains all the distinct outer values 
of the join, and pass that relation down to the subquery. So instead of 
"executing" the subquery for every outer row, passing the outer values 
as parameters (using PostgreSQL terminology), you collect a list of all 
the outer values first, and then execute the subquery only once. In the 
subquery, you perform a join with the temporary relation. And then the 
paper describes a bunch of rules and optimizations, that can optimize
away the temporary relation in many cases. That might be one way to 
implement the de-lateralization steps in PostgreSQL.

- Heikki



В списке pgsql-hackers по дате отправления:

Предыдущее
От: "Bossart, Nathan"
Дата:
Сообщение: Re: [HACKERS] [Proposal] Allow users to specify multiple tables inVACUUM commands
Следующее
От: Tom Lane
Дата:
Сообщение: Re: [HACKERS] [COMMITTERS] pgsql: Preventive maintenance in advance of pgindent run.