Join consolidation / Removing duplicate joins

Поиск

Список

Период

Сортировка

От	David Rowley
Тема	Join consolidation / Removing duplicate joins
Дата	17 сентября 2014 г. 11:00:29
Msg-id	CAApHDvopmWq4i2BCf0VqU4mYmxzHCYwfnUMat9TWuKzdvo7=8w@mail.gmail.com обсуждение исходный текст
Ответы	Re: Join consolidation / Removing duplicate joins
Список	pgsql-hackers

Дерево обсуждения

I've been looking into improving the performance of queries that have co-related sub-queries, such as:

SELECT id,(SELECT value FROM t2 WHERE t2.id = t1.id) FROM t1;

Where currently we produce a plan that executes the subquery as a sub plan, like:

QUERY PLAN

--------------------------------------------------------------------------------------------------------------------------

Seq Scan on t1 (cost=0.00..8456925.00 rows=1000000 width=4) (actual time=0.139..4071.598 rows=1000000 loops=1)

SubPlan 1

-> Index Scan using t2_pkey on t2 (cost=0.42..8.44 rows=1 width=4) (actual time=0.003..0.003 rows=1 loops=1000000)

Index Cond: (id = t1.id)

Planning time: 0.169 ms

Execution time: 4107.809 ms

Though, if the subquery could be proved only to ever return 1 record, then this could be re-written to become:

explain analyze SELECT t1.id,t2.value FROM t1 LEFT OUTER JOIN t2 ON t1.id = t2.id;

QUERY PLAN

--------------------------------------------------------------------------------------------------------------------------

Hash Left Join (cost=30832.00..70728.00 rows=1000000 width=8) (actual time=384.337..1452.387 rows=1000000 loops=1)

Hash Cond: (t1.id = t2.id)

-> Seq Scan on t1 (cost=0.00..14425.00 rows=1000000 width=4) (actual time=0.015..199.716 rows=1000000 loops=1)

-> Hash (cost=14425.00..14425.00 rows=1000000 width=8) (actual time=382.387..382.387 rows=1000000 loops=1)

Buckets: 131072 Batches: 16 Memory Usage: 2463kB

-> Seq Scan on t2 (cost=0.00..14425.00 rows=1000000 width=8) (actual time=0.010..179.911 rows=1000000 loops=1)

Planning time: 0.396 ms

Execution time: 1473.392 ms

(8 rows)

(notice performance increase)

Of course, you might ask, why not just write the 2nd query in the first place? Well, good idea, but it's not always that simple, take the following update statement:

explain update t1

set value = (select value from t2 where t1.id=t2.id),

value2 = (select value2 from t2 where t1.id = t2.id)

where exists(select 1 from t2 where t1.id=t2.id);

We end up with a quite a few extra sub queries where probably 1 join would have done the trick. We could have probably written this using the UPDATE/FROM syntax, but if you like to write standard SQL then that might not fly.

Anyway... I've been thinking of writing some code that converts these sub plans into left joins where it can be proved that the subquery would only at most produce 1 row, but as for the case in the UPDATE statement above, I didn't really want the code to create a new join for each subplan that it pulls into the outer query, it would be nice if the code could detect if there's a suitable join there already, and make use of it.

I started to think what an efficient way to do this might be, and thought maybe that it would be good if on RelOptInfo or so, if we could add a bool flag named something like uniquejoin, where this would be set to True, if we could detect that a unique index existed on the relation that was a subset of the join condition, (similar to the requirement of LEFT JOIN removals), we could have a function that tried to pullup all these subplans into the outer query, when it could detect that at most 1 subplan row could exist for each outer plan row, the code could then convert these to a LEFT OUTER JOIN on the outer query.

A bit later in planning, once all the subqueries are planned or pulled up, a "join consolidation" function could be run that merges all of these duplicate joins into 1 join for each relation, it could do this by just looking at relations that have this uniquejoin flag set to true, and then dig deeper to ensure that the join conditions also match. If we also went to the trouble of setting this flag for outer rels and not just the ones we add from this pullup operation, then we could also merge duplicate INNER JOINS too.

Another example of where removing these duplicated joins could be useful is when 2 views get joined that share a non-empty intersection of their relations, where the join conditions are the same.

I kind of thought that perhaps that all of this extra processing to look for unique indexes might just be more processing than it's worth, as success cases of join consolidation might be small, but then I started looking at the executor code around merge and hash join. It seems that there might also be some performance gains in here too. I see with SEMI joins we move to the next outer row once we've found 1 matching inner row. I believe that with these uniquejoin relations that we could also skip to the next outer row the same as with semi joins. I did some quick hacks in the merge join code to test if there was much performance to gain to be had here and found that on a join conditions involving 2 varchar fields of about 16 chars, that I could get the query to run in 97% of the time (saving 1 comparison on the join condition for each row). So perhaps that's promising enough to make it worth the extra planning cycles of setting this uniquejoin flag.

Does anyone have any thoughts on this?

Regards

David Rowley

В списке pgsql-hackers по дате отправления:

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Join consolidation / Removing duplicate joins