Обсуждение: Implied BETWEEN from join quals

Поиск
Список
Период
Сортировка

Implied BETWEEN from join quals

От
Simon Riggs
Дата:
I was recently nudged to describe an optimisation of merge
joins/sorts. Rather than decribe that, I've looked at the general
case:

There are some additional implications we may make when joining
tables... a particularly interesting one is that

SELECT *
FROM Fact JOIN Dim on Fact.col = Dim.col

can be rewritten as

SELECT *
FROM Fact JOIN Dim on Fact.col = Dim.col
WHERE
(
Fact.col BETWEEN(SELECT min(col) FROM Dim)
AND(SELECT max(col) FROM Dim)
)
AND
(
Dim.col BETWEEN(SELECT min(col) FROM Fact)
AND(SELECT max(col) FROM Fact)
)

which also works similarly for anti/semi-joins.

If we have indexes on A.col and B.col then these additional quals can
be derived cheaply at run-time and could have an important effect on
optimisation.

1) With various kinds of index, we would be able to use these implied
quals to further restrict the scan. Perhaps that doesn't sound very
interesting, but it is very important when solving an "outside-in"
join on a star schema, such as...

SELECT count(*)
FROM Fact JOIN Dim on Fact.col = Dim.col
WHERE Dim.other = 1

since there is no qual that can be applied directly to the Fact table,
causing us to scan the entire table.

We can rewrite this query as

EXPLAIN (ANALYZE on, TIMING off, COSTS on)
SELECT count(*)
FROM Fact JOIN Dim on Fact.col = Dim.col
WHERE Dim.other = 1
AND
(
Fact.col BETWEEN(SELECT min(col) FROM Dim WHERE Dim.other = 1)
AND(SELECT max(col) FROM Dim WHERE Dim.other = 9)
)

Note that the implied qual on the Dim table has been dropped as
uninteresting. This is because we can calculate the cost and potential
benefit of applying the rewrite, allowing us to discard one or both
implied clauses.

Note also that this has nothing to do with join order. This is solely
about making inferences using the join quals between any two tables.

2) We calculate the join selectivity by comparing the MFVs of the join
columns on the tables being joined. ISTM that we could use the min()
and max() values to refine the selectivity, which can often be wrong
as a result.

- - -

The current planner doesn't add these predicates automatically, but if
it did, it would come up with the following slightly sub-optimal
plan...

EXPLAIN (ANALYZE on, TIMING off, COSTS on)
SELECT count(*)
FROM Fact JOIN Dim on Fact.col = Dim.col
WHERE Dim.other = 1
AND
(
Fact.col BETWEEN(SELECT min(col) FROM Dim WHERE Dim.other = 1)
AND(SELECT max(col) FROM Dim WHERE Dim.other = 1)
)
Aggregate  (cost=31.79..31.80 rows=1 width=0) (actual rows=1 loops=1)  InitPlan 1 (returns $0)    ->  Aggregate
(cost=1.03..1.04rows=1 width=4) (actual rows=1 loops=1)          ->  Seq Scan on dim dim_1  (cost=0.00..1.02 rows=1
width=4)
(actual rows=1 loops=1)                Filter: (other = 1)                Rows Removed by Filter: 1  InitPlan 2
(returns$1)    ->  Aggregate  (cost=1.03..1.04 rows=1 width=4) (actual rows=1 loops=1)          ->  Seq Scan on dim
dim_2 (cost=0.00..1.02 rows=1 width=4)
 
(actual rows=1 loops=1)                Filter: (other = 1)                Rows Removed by Filter: 1  ->  Merge Join
(cost=1.33..29.09rows=250 width=0) (actual
 
rows=100000 loops=1)        Merge Cond: (dim.col = fact.col)        ->  Sort  (cost=1.03..1.04 rows=1 width=4) (actual
rows=1loops=1)              Sort Key: dim.col              Sort Method: quicksort  Memory: 25kB              ->  Seq
Scanon dim  (cost=0.00..1.02 rows=1 width=4)
 
(actual rows=1 loops=1)                    Filter: (other = 1)                    Rows Removed by Filter: 1        ->
IndexOnly Scan using fact_col_idx on fact
 
(cost=0.29..24.29 rows=500 width=4) (actual rows=100000 loops=1)              Index Cond: ((col >= $0) AND (col <= $1))
            Heap Fetches: 100000
 

which is sub-optimal only because of the mis-estimation of the effect
of the min() and max(), so we will still benefit from resolving the
parameters to a constant before proceeding with the main query. A
better plan would be

EXPLAIN (ANALYZE on, TIMING off, COSTS on)
SELECT count(*)
FROM Fact JOIN Dim on Fact.col = Dim.col
WHERE Dim.other = 1
AND Fact.col BETWEEN 1 AND 1
Aggregate  (cost=2944.04..2944.05 rows=1 width=0) (actual rows=1 loops=1)  ->  Hash Join  (cost=1.04..2819.04
rows=50000width=0) (actual
 
rows=100000 loops=1)        Hash Cond: (fact.col = dim.col)        ->  Seq Scan on fact  (cost=0.00..1943.00
rows=100000
width=4) (actual rows=100000 loops=1)              Filter: ((col >= 1) AND (col <= 1))        ->  Hash
(cost=1.02..1.02rows=1 width=4) (actual rows=1 loops=1)              Buckets: 1024  Batches: 1  Memory Usage: 1kB
      ->  Seq Scan on dim  (cost=0.00..1.02 rows=1 width=4)
 
(actual rows=1 loops=1)                    Filter: (other = 1)                    Rows Removed by Filter: 1

but we can probably live with that, as we do with other dynamic index plans.

So when can we use this?

The additional cost of adding this to the query is
* additional qual:  2 * cpu_operator_cost * rows
* getting values: 2 * cost of getting one row from index

Reduced cost plans happen if the additional quals reduce the total
cost of the plan/scan because they significantly reduce selectivity
and possibly because they use an index plan. We don't know the
reduction that is possible, if any, so I suggest we look at the base
rel plan near the bottom of set_plain_rel_pathlist(). If the cost of
scanning the base rel is high in comparison with the cost of the
additional quals, then we add them because if we add them and there is
no benefit we lose a small amount, yet if we do gain benefit it could
be substantial.

Proposal: in set_plain_rel_pathlist() if total cost is already greater
than 100 cost of adding quals then we can add the additional implied
quals and look again for index/seq plans.

It is possible that this could be implemented for 9.5, if it is
considered a good idea.

Good?

- - -

Current proposal ends there, but there is a further optimization that
allows us to remove the join altogether if
* There is a FK between Fact and Dim
* The join is an anti or semi join
* There are no values of the join column on Dim that are between Min()
and Max() that are not returned from Dim

It seems a little less practical, as will become clear..

This optimization allows

SELECT count(*)
FROM Fact JOIN Dim on Fact.col = Dim.col
WHERE Dim.other = 1

to be written as

EXPLAIN (ANALYZE on, TIMING off, COSTS on)
SELECT count(*)
FROM Fact
WHERE
(
Fact.col BETWEEN(SELECT min(col) FROM Dim WHERE Dim.other = 1)
AND(SELECT max(col) FROM Dim WHERE Dim.other = 1)
)
AND
/* when */
(0 = (
SELECT count(*)
FROM Dim
WHERE
(
Dim.col BETWEEN(SELECT min(col) FROM Dim WHERE Dim.other = 1)
AND(SELECT max(col) FROM Dim WHERE Dim.other = 1)
)
AND
NOT (Dim.other = 1)
));

which looks ubelieveably complex, but performs quite well.

For those with a long memory, you may recognise my Proof Join from
2007. Note that my Proof Join contained the (mistaken) requirement
that the key values were contiguous whereas in fact that doesn't
matter, as shown. This has also been described as the BETWEEN join
rewrite (Abadi et al 2008).

-- Simon Riggs                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services



Re: Implied BETWEEN from join quals

От
Stephen Frost
Дата:
Simon,

This all looks good, and at the risk of being slightly off-topic for
this thread, I just wanted to mention..

* Simon Riggs (simon@2ndQuadrant.com) wrote:
> Current proposal ends there, but there is a further optimization that
> allows us to remove the join altogether if
> * There is a FK between Fact and Dim

It'd be great if we could start by looking for the above requirement and
then doing join-removal when it exists and no columns from Dim are
referenced (outside of the FK reference, which must result in 'true' or
we've screwed something up).

I had been looking at this about a month ago and just ran out of time to
play with it, but it doesn't seem like it'd be terribly difficult to
do..
Thanks,
    Stephen

Re: Implied BETWEEN from join quals

От
Simon Riggs
Дата:
On 22 April 2014 17:00, Stephen Frost <sfrost@snowman.net> wrote:
> Simon,
>
> This all looks good, and at the risk of being slightly off-topic for
> this thread, I just wanted to mention..
>
> * Simon Riggs (simon@2ndQuadrant.com) wrote:
>> Current proposal ends there, but there is a further optimization that
>> allows us to remove the join altogether if
>> * There is a FK between Fact and Dim
>
> It'd be great if we could start by looking for the above requirement and
> then doing join-removal when it exists and no columns from Dim are
> referenced (outside of the FK reference, which must result in 'true' or
> we've screwed something up).
>
> I had been looking at this about a month ago and just ran out of time to
> play with it, but it doesn't seem like it'd be terribly difficult to
> do..

Yeh, it was on my list. I had a few optimizer changes I've been
looking to make for a while.

Hopefully 9.5 is the year I get the chance at a running start on that
since they're never as easy as you'd hope.

-- Simon Riggs                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services



Re: Implied BETWEEN from join quals

От
Robert Haas
Дата:
On Tue, Apr 22, 2014 at 11:44 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
> I was recently nudged to describe an optimisation of merge
> joins/sorts. Rather than decribe that, I've looked at the general
> case:
>
> There are some additional implications we may make when joining
> tables... a particularly interesting one is that
>
> SELECT *
> FROM Fact JOIN Dim on Fact.col = Dim.col
>
> can be rewritten as
>
> SELECT *
> FROM Fact JOIN Dim on Fact.col = Dim.col
> WHERE
> (
> Fact.col BETWEEN
>  (SELECT min(col) FROM Dim)
> AND
>  (SELECT max(col) FROM Dim)
> )
> AND
> (
> Dim.col BETWEEN
>  (SELECT min(col) FROM Fact)
> AND
>  (SELECT max(col) FROM Fact)
> )
>
> which also works similarly for anti/semi-joins.
>
> If we have indexes on A.col and B.col then these additional quals can
> be derived cheaply at run-time and could have an important effect on
> optimisation.

Interesting.  This can pretty obviously produce a big win if things
break our way.  But it'll take some effort do determine whether the
range of possible values for the join column is narrow enough to make
the inferred BETWEEN clause selective enough to be useful, and that
effort will be wasted if it turns out that the answer is no.  I can
certainly think of times when this would have helped me, so I kind of
like the idea in theory ... but I won't be surprised if it turns out
to be possible to construct realistic test cases where trying to
figure this out causes too much increase in planning time.

Actually, it seems possible to me that this could end up winning even
if it doesn't enable the use of an index scan rather than a sequential
scan, because eliminating rows during the scan is typically much
cheaper than eliminating them during the join step.  But it also seems
possible that it could lose heavily in that case, if the extra steps
during the scan phase don't eliminate enough rows to reduce the join
cost to a sufficient degree.  I'm not really sure how it will work out
on balance.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Implied BETWEEN from join quals

От
Simon Riggs
Дата:
On 28 April 2014 15:01, Robert Haas <robertmhaas@gmail.com> wrote:

> Interesting.  This can pretty obviously produce a big win if things
> break our way.  But it'll take some effort do determine whether the
> range of possible values for the join column is narrow enough to make
> the inferred BETWEEN clause selective enough to be useful, and that
> effort will be wasted if it turns out that the answer is no.  I can
> certainly think of times when this would have helped me, so I kind of
> like the idea in theory ... but I won't be surprised if it turns out
> to be possible to construct realistic test cases where trying to
> figure this out causes too much increase in planning time.
>
> Actually, it seems possible to me that this could end up winning even
> if it doesn't enable the use of an index scan rather than a sequential
> scan, because eliminating rows during the scan is typically much
> cheaper than eliminating them during the join step.  But it also seems
> possible that it could lose heavily in that case, if the extra steps
> during the scan phase don't eliminate enough rows to reduce the join
> cost to a sufficient degree.  I'm not really sure how it will work out
> on balance.

I agree its not always a win and that the planning overhead is a
little high. That's why I suggest we only attempt to calculate such a
plan if the normal plan for the scan is already high. That way the
extra planning is worth it. I'm sure there are other optimizations
that we might only wish to consider for plans that are otherwise high
cost. This need hardly touch the main planning path for shirt queries
at all.

The extra run-time cost of the test adds 2*cpu_operator_cost to the
scan. However, it will save at least (OldSelectivity -
NewSelectivity)*(cpu_tuple_cost + cpu_operator_cost) in the join node
directly above it (and possible more above that - though without being
able to inspect the join tree we won't know that until later - and the
output plan might actually chage the later join plan anyway). So we
can calculate the threshold at which we should choose the modified
plan, though it will likely be a conservative value.

The new selectivity can be calculated just as other selectivities, I
see no problem there.

So we have a proposed way to limit planning time in cases where the
benefit would be low. Plus we have a way to estimate the reduction in
cost once we add the new tests.

The real value is whether we can use the additional tests to pick up
an index we hadn't been able to use before. Given that MinMax index
use will be a win even for fairly high selectivities (because of the
low cost of index access), it looks like this optimization will play
very nicely with MinMax.

-- Simon Riggs                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services