Implied BETWEEN from join quals

Поиск
Список
Период
Сортировка
От Simon Riggs
Тема Implied BETWEEN from join quals
Дата
Msg-id CA+U5nMLYf2cgbq+YUw-ArLBTcPrqanBf5QiFEC-PBRJCFzOngg@mail.gmail.com
обсуждение исходный текст
Ответы Re: Implied BETWEEN from join quals  (Stephen Frost <sfrost@snowman.net>)
Re: Implied BETWEEN from join quals  (Robert Haas <robertmhaas@gmail.com>)
Список pgsql-hackers
I was recently nudged to describe an optimisation of merge
joins/sorts. Rather than decribe that, I've looked at the general
case:

There are some additional implications we may make when joining
tables... a particularly interesting one is that

SELECT *
FROM Fact JOIN Dim on Fact.col = Dim.col

can be rewritten as

SELECT *
FROM Fact JOIN Dim on Fact.col = Dim.col
WHERE
(
Fact.col BETWEEN(SELECT min(col) FROM Dim)
AND(SELECT max(col) FROM Dim)
)
AND
(
Dim.col BETWEEN(SELECT min(col) FROM Fact)
AND(SELECT max(col) FROM Fact)
)

which also works similarly for anti/semi-joins.

If we have indexes on A.col and B.col then these additional quals can
be derived cheaply at run-time and could have an important effect on
optimisation.

1) With various kinds of index, we would be able to use these implied
quals to further restrict the scan. Perhaps that doesn't sound very
interesting, but it is very important when solving an "outside-in"
join on a star schema, such as...

SELECT count(*)
FROM Fact JOIN Dim on Fact.col = Dim.col
WHERE Dim.other = 1

since there is no qual that can be applied directly to the Fact table,
causing us to scan the entire table.

We can rewrite this query as

EXPLAIN (ANALYZE on, TIMING off, COSTS on)
SELECT count(*)
FROM Fact JOIN Dim on Fact.col = Dim.col
WHERE Dim.other = 1
AND
(
Fact.col BETWEEN(SELECT min(col) FROM Dim WHERE Dim.other = 1)
AND(SELECT max(col) FROM Dim WHERE Dim.other = 9)
)

Note that the implied qual on the Dim table has been dropped as
uninteresting. This is because we can calculate the cost and potential
benefit of applying the rewrite, allowing us to discard one or both
implied clauses.

Note also that this has nothing to do with join order. This is solely
about making inferences using the join quals between any two tables.

2) We calculate the join selectivity by comparing the MFVs of the join
columns on the tables being joined. ISTM that we could use the min()
and max() values to refine the selectivity, which can often be wrong
as a result.

- - -

The current planner doesn't add these predicates automatically, but if
it did, it would come up with the following slightly sub-optimal
plan...

EXPLAIN (ANALYZE on, TIMING off, COSTS on)
SELECT count(*)
FROM Fact JOIN Dim on Fact.col = Dim.col
WHERE Dim.other = 1
AND
(
Fact.col BETWEEN(SELECT min(col) FROM Dim WHERE Dim.other = 1)
AND(SELECT max(col) FROM Dim WHERE Dim.other = 1)
)
Aggregate  (cost=31.79..31.80 rows=1 width=0) (actual rows=1 loops=1)  InitPlan 1 (returns $0)    ->  Aggregate
(cost=1.03..1.04rows=1 width=4) (actual rows=1 loops=1)          ->  Seq Scan on dim dim_1  (cost=0.00..1.02 rows=1
width=4)
(actual rows=1 loops=1)                Filter: (other = 1)                Rows Removed by Filter: 1  InitPlan 2
(returns$1)    ->  Aggregate  (cost=1.03..1.04 rows=1 width=4) (actual rows=1 loops=1)          ->  Seq Scan on dim
dim_2 (cost=0.00..1.02 rows=1 width=4)
 
(actual rows=1 loops=1)                Filter: (other = 1)                Rows Removed by Filter: 1  ->  Merge Join
(cost=1.33..29.09rows=250 width=0) (actual
 
rows=100000 loops=1)        Merge Cond: (dim.col = fact.col)        ->  Sort  (cost=1.03..1.04 rows=1 width=4) (actual
rows=1loops=1)              Sort Key: dim.col              Sort Method: quicksort  Memory: 25kB              ->  Seq
Scanon dim  (cost=0.00..1.02 rows=1 width=4)
 
(actual rows=1 loops=1)                    Filter: (other = 1)                    Rows Removed by Filter: 1        ->
IndexOnly Scan using fact_col_idx on fact
 
(cost=0.29..24.29 rows=500 width=4) (actual rows=100000 loops=1)              Index Cond: ((col >= $0) AND (col <= $1))
            Heap Fetches: 100000
 

which is sub-optimal only because of the mis-estimation of the effect
of the min() and max(), so we will still benefit from resolving the
parameters to a constant before proceeding with the main query. A
better plan would be

EXPLAIN (ANALYZE on, TIMING off, COSTS on)
SELECT count(*)
FROM Fact JOIN Dim on Fact.col = Dim.col
WHERE Dim.other = 1
AND Fact.col BETWEEN 1 AND 1
Aggregate  (cost=2944.04..2944.05 rows=1 width=0) (actual rows=1 loops=1)  ->  Hash Join  (cost=1.04..2819.04
rows=50000width=0) (actual
 
rows=100000 loops=1)        Hash Cond: (fact.col = dim.col)        ->  Seq Scan on fact  (cost=0.00..1943.00
rows=100000
width=4) (actual rows=100000 loops=1)              Filter: ((col >= 1) AND (col <= 1))        ->  Hash
(cost=1.02..1.02rows=1 width=4) (actual rows=1 loops=1)              Buckets: 1024  Batches: 1  Memory Usage: 1kB
      ->  Seq Scan on dim  (cost=0.00..1.02 rows=1 width=4)
 
(actual rows=1 loops=1)                    Filter: (other = 1)                    Rows Removed by Filter: 1

but we can probably live with that, as we do with other dynamic index plans.

So when can we use this?

The additional cost of adding this to the query is
* additional qual:  2 * cpu_operator_cost * rows
* getting values: 2 * cost of getting one row from index

Reduced cost plans happen if the additional quals reduce the total
cost of the plan/scan because they significantly reduce selectivity
and possibly because they use an index plan. We don't know the
reduction that is possible, if any, so I suggest we look at the base
rel plan near the bottom of set_plain_rel_pathlist(). If the cost of
scanning the base rel is high in comparison with the cost of the
additional quals, then we add them because if we add them and there is
no benefit we lose a small amount, yet if we do gain benefit it could
be substantial.

Proposal: in set_plain_rel_pathlist() if total cost is already greater
than 100 cost of adding quals then we can add the additional implied
quals and look again for index/seq plans.

It is possible that this could be implemented for 9.5, if it is
considered a good idea.

Good?

- - -

Current proposal ends there, but there is a further optimization that
allows us to remove the join altogether if
* There is a FK between Fact and Dim
* The join is an anti or semi join
* There are no values of the join column on Dim that are between Min()
and Max() that are not returned from Dim

It seems a little less practical, as will become clear..

This optimization allows

SELECT count(*)
FROM Fact JOIN Dim on Fact.col = Dim.col
WHERE Dim.other = 1

to be written as

EXPLAIN (ANALYZE on, TIMING off, COSTS on)
SELECT count(*)
FROM Fact
WHERE
(
Fact.col BETWEEN(SELECT min(col) FROM Dim WHERE Dim.other = 1)
AND(SELECT max(col) FROM Dim WHERE Dim.other = 1)
)
AND
/* when */
(0 = (
SELECT count(*)
FROM Dim
WHERE
(
Dim.col BETWEEN(SELECT min(col) FROM Dim WHERE Dim.other = 1)
AND(SELECT max(col) FROM Dim WHERE Dim.other = 1)
)
AND
NOT (Dim.other = 1)
));

which looks ubelieveably complex, but performs quite well.

For those with a long memory, you may recognise my Proof Join from
2007. Note that my Proof Join contained the (mistaken) requirement
that the key values were contiguous whereas in fact that doesn't
matter, as shown. This has also been described as the BETWEEN join
rewrite (Abadi et al 2008).

-- Simon Riggs                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services



В списке pgsql-hackers по дате отправления:

Предыдущее
От: Palle Girgensohn
Дата:
Сообщение: Re: Perfomance degradation 9.3 (vs 9.2) for FreeBSD
Следующее
От: Stephen Frost
Дата:
Сообщение: Re: Implied BETWEEN from join quals