Обсуждение: Multiple-Table-Spanning Joins with ORs in WHERE Clause
Hi pgsql-performance list, what is the recommended way of doing **multiple-table-spanning joins with ORs in the WHERE-clause**? Until now, we've used the LEFT OUTER JOIN to filter big_table like so: SELECT DISTINCT <fields of big_table> FROM "big_table" LEFT OUTER JOIN "table_a" ON ("big_table"."id" = "table_a"."big_table_id") LEFT OUTER JOIN "table_b" ON ("big_table"."id" = "table_b"."big_table_id") WHERE "table_a"."item_id" IN (<handful of items>) OR "table_b"."item_id" IN (<handful of items>); However, this results in an awful slow plan (requiring to scan the complete big_table which obviously isn't optimal). So, we decided (at least for now) to split up the query into two separate ones and merge/de-duplicate the result with application logic: SELECT <fields of big_table> FROM "big_table" INNER JOIN "table_a" ON ("big_table"."id" = "table_a"."big_table_id") WHERE "table_a"."item_id" IN (<handful of items>); SELECT <fields of big_table> FROM "big_table" INNER JOIN "table_b" ON ("big_table"."id" = "table_b"."big_table_id") WHERE "table_b"."item_id" IN (<handful of items>); As you can imagine we would be very glad to solve this issue with a single query and without having to re-code existing logic of PostgreSQL. But how? Best, Sven PS: if you require EXPLAIN ANALYZE, I can post them as well.
> However, this results in an awful slow plan (requiring to scan the complete big_table which obviously isn't optimal)
You mean to say there is a sequential scan ? An explain would be helpful. Are there indexes on the provided where clauses.
Postgres can do a Bitmap heap scan to combine indexes, there is no need to fire two separate queries.
On Thu, Sep 22, 2016 at 6:54 PM, Sven R. Kunze <srkunze@mail.de> wrote:
Hi pgsql-performance list,
what is the recommended way of doing **multiple-table-spanning joins with ORs in the WHERE-clause**?
Until now, we've used the LEFT OUTER JOIN to filter big_table like so:
SELECT DISTINCT <fields of big_table>
FROM
"big_table"
LEFT OUTER JOIN "table_a" ON ("big_table"."id" = "table_a"."big_table_id")
LEFT OUTER JOIN "table_b" ON ("big_table"."id" = "table_b"."big_table_id")
WHERE
"table_a"."item_id" IN (<handful of items>)
OR
"table_b"."item_id" IN (<handful of items>);
However, this results in an awful slow plan (requiring to scan the complete big_table which obviously isn't optimal).
So, we decided (at least for now) to split up the query into two separate ones and merge/de-duplicate the result with application logic:
SELECT <fields of big_table>
FROM
"big_table" INNER JOIN "table_a" ON ("big_table"."id" = "table_a"."big_table_id")
WHERE
"table_a"."item_id" IN (<handful of items>);
SELECT <fields of big_table>
FROM
"big_table" INNER JOIN "table_b" ON ("big_table"."id" = "table_b"."big_table_id")
WHERE
"table_b"."item_id" IN (<handful of items>);
As you can imagine we would be very glad to solve this issue with a single query and without having to re-code existing logic of PostgreSQL. But how?
Best,
Sven
PS: if you require EXPLAIN ANALYZE, I can post them as well.
--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance
Regards,
Madusudanan.B.N
Madusudanan.B.N
-----Original Message----- From: pgsql-performance-owner@postgresql.org [mailto:pgsql-performance-owner@postgresql.org] On Behalf Of Sven R. Kunze Sent: Thursday, September 22, 2016 9:25 AM To: pgsql-performance@postgresql.org Subject: [PERFORM] Multiple-Table-Spanning Joins with ORs in WHERE Clause Hi pgsql-performance list, what is the recommended way of doing **multiple-table-spanning joins with ORs in the WHERE-clause**? Until now, we've used the LEFT OUTER JOIN to filter big_table like so: SELECT DISTINCT <fields of big_table> FROM "big_table" LEFT OUTER JOIN "table_a" ON ("big_table"."id" = "table_a"."big_table_id") LEFT OUTER JOIN "table_b" ON ("big_table"."id" = "table_b"."big_table_id") WHERE "table_a"."item_id" IN (<handful of items>) OR "table_b"."item_id" IN (<handful of items>); However, this results in an awful slow plan (requiring to scan the complete big_table which obviously isn't optimal). So, we decided (at least for now) to split up the query into two separate ones and merge/de-duplicate the result with application logic: SELECT <fields of big_table> FROM "big_table" INNER JOIN "table_a" ON ("big_table"."id" = "table_a"."big_table_id") WHERE "table_a"."item_id" IN (<handful of items>); SELECT <fields of big_table> FROM "big_table" INNER JOIN "table_b" ON ("big_table"."id" = "table_b"."big_table_id") WHERE "table_b"."item_id" IN (<handful of items>); As you can imagine we would be very glad to solve this issue with a single query and without having to re-code existing logic of PostgreSQL. But how? Best, Sven PS: if you require EXPLAIN ANALYZE, I can post them as well. ______________________________________________________________________________________________ What about: SELECT <fields of big_table> FROM "big_table" INNER JOIN "table_a" ON ("big_table"."id" = "table_a"."big_table_id") WHERE "table_a"."item_id" IN (<handful of items>) UNION SELECT <fields of big_table> FROM "big_table" INNER JOIN "table_b" ON ("big_table"."id" = "table_b"."big_table_id") WHERE "table_b"."item_id" IN (<handful of items>); Regards, Igor Neyman
-----Original Message----- From: Igor Neyman Sent: Thursday, September 22, 2016 10:33 AM To: 'Sven R. Kunze' <srkunze@mail.de>; pgsql-performance@postgresql.org Subject: RE: [PERFORM] Multiple-Table-Spanning Joins with ORs in WHERE Clause -----Original Message----- From: pgsql-performance-owner@postgresql.org [mailto:pgsql-performance-owner@postgresql.org] On Behalf Of Sven R. Kunze Sent: Thursday, September 22, 2016 9:25 AM To: pgsql-performance@postgresql.org Subject: [PERFORM] Multiple-Table-Spanning Joins with ORs in WHERE Clause Hi pgsql-performance list, what is the recommended way of doing **multiple-table-spanning joins with ORs in the WHERE-clause**? Until now, we've used the LEFT OUTER JOIN to filter big_table like so: SELECT DISTINCT <fields of big_table> FROM "big_table" LEFT OUTER JOIN "table_a" ON ("big_table"."id" = "table_a"."big_table_id") LEFT OUTER JOIN "table_b" ON ("big_table"."id" = "table_b"."big_table_id") WHERE "table_a"."item_id" IN (<handful of items>) OR "table_b"."item_id" IN (<handful of items>); However, this results in an awful slow plan (requiring to scan the complete big_table which obviously isn't optimal). So, we decided (at least for now) to split up the query into two separate ones and merge/de-duplicate the result with applicationlogic: SELECT <fields of big_table> FROM "big_table" INNER JOIN "table_a" ON ("big_table"."id" = "table_a"."big_table_id") WHERE "table_a"."item_id" IN (<handful of items>); SELECT <fields of big_table> FROM "big_table" INNER JOIN "table_b" ON ("big_table"."id" = "table_b"."big_table_id") WHERE "table_b"."item_id" IN (<handful of items>); As you can imagine we would be very glad to solve this issue with a single query and without having to re-code existing logic of PostgreSQL. But how? Best, Sven PS: if you require EXPLAIN ANALYZE, I can post them as well. ______________________________________________________________________________________________ Another option to try:: SELECT DISTINCT <fields of big_table> FROM "big_table" LEFT OUTER JOIN "table_a" ON ("big_table"."id" = "table_a"."big_table_id" AND "table_a"."item_id" IN (<handful of items>)) LEFT OUTER JOIN "table_b" ON ("big_table"."id" = "table_b"."big_table_id" AND "table_b"."item_id" IN (<handful of items>)); Regards, Igor Neyman
-----Original Message----- From: pgsql-performance-owner@postgresql.org [mailto:pgsql-performance-owner@postgresql.org] On Behalf Of Igor Neyman Sent: Thursday, September 22, 2016 10:36 AM To: Sven R. Kunze <srkunze@mail.de>; pgsql-performance@postgresql.org Subject: Re: [PERFORM] Multiple-Table-Spanning Joins with ORs in WHERE Clause -----Original Message----- From: Igor Neyman Sent: Thursday, September 22, 2016 10:33 AM To: 'Sven R. Kunze' <srkunze@mail.de>; pgsql-performance@postgresql.org Subject: RE: [PERFORM] Multiple-Table-Spanning Joins with ORs in WHERE Clause -----Original Message----- From: pgsql-performance-owner@postgresql.org [mailto:pgsql-performance-owner@postgresql.org] On Behalf Of Sven R. Kunze Sent: Thursday, September 22, 2016 9:25 AM To: pgsql-performance@postgresql.org Subject: [PERFORM] Multiple-Table-Spanning Joins with ORs in WHERE Clause Hi pgsql-performance list, what is the recommended way of doing **multiple-table-spanning joins with ORs in the WHERE-clause**? Until now, we've used the LEFT OUTER JOIN to filter big_table like so: SELECT DISTINCT <fields of big_table> FROM "big_table" LEFT OUTER JOIN "table_a" ON ("big_table"."id" = "table_a"."big_table_id") LEFT OUTER JOIN "table_b" ON ("big_table"."id" = "table_b"."big_table_id") WHERE "table_a"."item_id" IN (<handful of items>) OR "table_b"."item_id" IN (<handful of items>); However, this results in an awful slow plan (requiring to scan the complete big_table which obviously isn't optimal). So, we decided (at least for now) to split up the query into two separate ones and merge/de-duplicate the result with applicationlogic: SELECT <fields of big_table> FROM "big_table" INNER JOIN "table_a" ON ("big_table"."id" = "table_a"."big_table_id") WHERE "table_a"."item_id" IN (<handful of items>); SELECT <fields of big_table> FROM "big_table" INNER JOIN "table_b" ON ("big_table"."id" = "table_b"."big_table_id") WHERE "table_b"."item_id" IN (<handful of items>); As you can imagine we would be very glad to solve this issue with a single query and without having to re-code existing logic of PostgreSQL. But how? Best, Sven PS: if you require EXPLAIN ANALYZE, I can post them as well. ______________________________________________________________________________________________ Another option to try:: SELECT DISTINCT <fields of big_table> FROM "big_table" LEFT OUTER JOIN "table_a" ON ("big_table"."id" = "table_a"."big_table_id" AND "table_a"."item_id" IN (<handful of items>)) LEFT OUTER JOIN "table_b" ON ("big_table"."id" = "table_b"."big_table_id" AND "table_b"."item_id" IN (<handful of items>)); Regards, Igor Neyman _______________________________________________________________________________________________________ Please disregard this last suggestion, it'll not produce required results. Solution using UNION should work. Regards, Igor Neyman
On Thu, Sep 22, 2016 at 6:37 AM, Madusudanan.B.N <b.n.madusudanan@gmail.com> wrote:
> However, this results in an awful slow plan (requiring to scan the complete big_table which obviously isn't optimal)You mean to say there is a sequential scan ? An explain would be helpful. Are there indexes on the provided where clauses.Postgres can do a Bitmap heap scan to combine indexes, there is no need to fire two separate queries.
It can't combine bitmap scans that come from different tables.
But he can just combine the two queries into one, with a UNION.
Cheers,
Jeff
Thanks a lot Madusudanan, Igor, Lutz and Jeff for your suggestions. What I can confirm is that the UNION ideas runs extremely fast (don't have access to the db right now to test the subquery idea, but will check next week as I travel right now). Thanks again! :) I was wondering: would it be possible for PostgreSQL to rewrite the query to generate the UNION (or subquery plan if it's also fast) on it's own? Thanks, Sven On 22.09.2016 16:44, lfischer wrote: > Hi Sven > > Why not do something like > > SELECT * FROM big_table > WHERE > id in (SELECT big_table_id FROM table_a WHERE "table_a"."item_id" > IN (<handful of items>)) > OR > id in (SELECT big_table_id FROM table_a WHERE "table_b"."item_id" > IN (<handful of items>)) > > that way you don't need the "distinct" and therefore there should be > less comparison going on. > > Lutz > > On 22/09/16 14:24, Sven R. Kunze wrote: >> Hi pgsql-performance list, >> >> >> what is the recommended way of doing **multiple-table-spanning joins >> with ORs in the WHERE-clause**? >> >> >> Until now, we've used the LEFT OUTER JOIN to filter big_table like so: >> >> >> SELECT DISTINCT <fields of big_table> >> FROM >> "big_table" >> LEFT OUTER JOIN "table_a" ON ("big_table"."id" = >> "table_a"."big_table_id") >> LEFT OUTER JOIN "table_b" ON ("big_table"."id" = >> "table_b"."big_table_id") >> WHERE >> "table_a"."item_id" IN (<handful of items>) >> OR >> "table_b"."item_id" IN (<handful of items>); >> >> >> However, this results in an awful slow plan (requiring to scan the >> complete big_table which obviously isn't optimal). >> So, we decided (at least for now) to split up the query into two >> separate ones and merge/de-duplicate the result with application logic: >> >> >> SELECT <fields of big_table> >> FROM >> "big_table" INNER JOIN "table_a" ON ("big_table"."id" = >> "table_a"."big_table_id") >> WHERE >> "table_a"."item_id" IN (<handful of items>); >> >> >> SELECT <fields of big_table> >> FROM >> "big_table" INNER JOIN "table_b" ON ("big_table"."id" = >> "table_b"."big_table_id") >> WHERE >> "table_b"."item_id" IN (<handful of items>); >> >> >> As you can imagine we would be very glad to solve this issue with a >> single query and without having to re-code existing logic of >> PostgreSQL. But how? >> >> >> Best, >> Sven >> >> >> PS: if you require EXPLAIN ANALYZE, I can post them as well. >> >> > >
On 23.09.2016 11:00, Pavel Stehule wrote:
2016-09-23 8:35 GMT+02:00 Sven R. Kunze <srkunze@mail.de>:I was wondering: would it be possible for PostgreSQL to rewrite the query to generate the UNION (or subquery plan if it's also fast) on it's own?It depends on real data. On your specific data the UNION variant is pretty fast, on different set, the UNION can be pretty slow. It is related to difficult OR predicate estimation.
I figure that the UNION is fast if the sub-results are small (which they are in our case). On the contrary, when they are huge, the OUTER JOIN variant might be preferable.
Is there something I can do to help here?
Or do you think it's naturally application-dependent and thus should be solved with application logic just as we did?
Cheers,
Sven
On Thu, Sep 22, 2016 at 11:35 PM, Sven R. Kunze <srkunze@mail.de> wrote:
Thanks a lot Madusudanan, Igor, Lutz and Jeff for your suggestions.
What I can confirm is that the UNION ideas runs extremely fast (don't have access to the db right now to test the subquery idea, but will check next week as I travel right now). Thanks again! :)
I was wondering: would it be possible for PostgreSQL to rewrite the query to generate the UNION (or subquery plan if it's also fast) on it's own?
I don't know what the subquery plan is, I don't see references to that in the email chain.
I don't believe that current versions of PostgreSQL are capable of rewriting the plan in the style of a union. It is not just a matter of tweaking the cost estimates, it simply never considers such a plan in the first place given the text of your query.
Perhaps some future version of PostgreSQL could do so, but my gut feeling is that that is not very likely. It would take a lot of work, would risk breaking or slowing down other things, and is probably too much of a niche issue to attract a lot of interest.
Why not just use the union? Are you using a framework which generates the query automatically and you have no control over it? Or do you just think it is ugly or fragile for some other reason?
Perhaps moving the union from the outside to the inside would be more suitable? That way teh select list is only specified once, and if you AND more clauses into the WHERE condition they also only need to be specified once.
SELECT * FROM big_table
WHERE
id in (SELECT big_table_id FROM table_a WHERE "table_a"."item_id" IN (<handful of items>) union
WHERE
id in (SELECT big_table_id FROM table_a WHERE "table_a"."item_id" IN (<handful of items>) union
SELECT big_table_id FROM table_a WHERE "table_b"."item_id" IN (<handful of items>)
);
Cheers,
Jeff
2016-09-29 14:20 GMT+02:00 Sven R. Kunze <srkunze@mail.de>:
In ideal world then plan should be independent on used form. The most difficult is safe estimation of OR predicates. With correct estimation the transformation to UNION form should not be necessary I am think.
On 23.09.2016 11:00, Pavel Stehule wrote:2016-09-23 8:35 GMT+02:00 Sven R. Kunze <srkunze@mail.de>:I was wondering: would it be possible for PostgreSQL to rewrite the query to generate the UNION (or subquery plan if it's also fast) on it's own?It depends on real data. On your specific data the UNION variant is pretty fast, on different set, the UNION can be pretty slow. It is related to difficult OR predicate estimation.
I figure that the UNION is fast if the sub-results are small (which they are in our case). On the contrary, when they are huge, the OUTER JOIN variant might be preferable.
Is there something I can do to help here?
Or do you think it's naturally application-dependent and thus should be solved with application logic just as we did?
Regards
Pavel
Cheers,
Sven
Hi Jeff,
On 29.09.2016 20:03, Jeff Janes wrote:
I don't know what the subquery plan is, I don't see references to that in the email chain.
Lutz posted the following solution:
SELECT * FROM big_table
WHERE
id in (SELECT big_table_id FROM table_a WHERE "table_a"."item_id" IN (<handful of items>))
OR
id in (SELECT big_table_id FROM table_a WHERE "table_b"."item_id" IN (<handful of items>))
I don't believe that current versions of PostgreSQL are capable of rewriting the plan in the style of a union. It is not just a matter of tweaking the cost estimates, it simply never considers such a plan in the first place given the text of your query.
That's okay and that's why I am asking here. :)
Perhaps some future version of PostgreSQL could do so, but my gut feeling is that that is not very likely. It would take a lot of work, would risk breaking or slowing down other things, and is probably too much of a niche issue to attract a lot of interest.
I don't hope so; in business and reports/stats applications there is a lot of room for this.
Why do you think that OR-ing several tables is a niche issue? I can at least name 3 different projects (from 3 different domains) where combining 3 or more tables with OR is relevant and should be reasonably fast.
Most domains that could benefit would probably have star-like schemas. So, big_table corresponds to the center of the star, whereas the rays correspond to various (even dynamic) extensions to the base data structure.
Why not just use the union?
Sure that would work in this particular case. However, this thread actually sought a general answer to "how to OR more than two tables".
Are you using a framework which generates the query automatically and you have no control over it?
We use a framework and we can use the UNION if we want to.
Or do you just think it is ugly or fragile for some other reason?
I don't think it's ugly or fragile. I am just used to the fact that **if it's equivalent** then PostgreSQL can figure it out (without constant supervision from application developers).
So, it's just a matter of inconvenience. ;)
Perhaps moving the union from the outside to the inside would be more suitable? That way teh select list is only specified once, and if you AND more clauses into the WHERE condition they also only need to be specified once.SELECT * FROM big_table
WHERE
id in (SELECT big_table_id FROM table_a WHERE "table_a"."item_id" IN (<handful of items>) unionSELECT big_table_id FROM table_a WHERE "table_b"."item_id" IN (<handful of items>));
Yet another solution I guess, so thanks a lot. :)
This multitude of solution also shows that applications developers might be overwhelmed by choosing the most appropriate AND most long-lasting one. Because what I take from the discussion is that a UNION might be appropriate right now but that could change in the future even for the very same use-case at hand.
Cheers,
Sven
On 29.09.2016 20:12, Pavel Stehule wrote: > In ideal world then plan should be independent on used form. The most > difficult is safe estimation of OR predicates. With correct estimation > the transformation to UNION form should not be necessary I am think. Ah, okay. That's interesting. So how can I help here? Regards, Sven
2016-09-29 20:49 GMT+02:00 Sven R. Kunze <srkunze@mail.de>:
On 29.09.2016 20:12, Pavel Stehule wrote:In ideal world then plan should be independent on used form. The most difficult is safe estimation of OR predicates. With correct estimation the transformation to UNION form should not be necessary I am think.
Ah, okay. That's interesting.
So how can I help here?
try to write a patch :) or better, help with enhancing PostgreSQL's estimation model. Tomas Vondra is working 2 years on multicolumn statistics. He needs help with review.
Regards
Pavel
Regards,
Sven
--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance
On Thu, Sep 29, 2016 at 11:48 AM, Sven R. Kunze <srkunze@mail.de> wrote:
On 29.09.2016 20:03, Jeff Janes wrote:
Perhaps some future version of PostgreSQL could do so, but my gut feeling is that that is not very likely. It would take a lot of work, would risk breaking or slowing down other things, and is probably too much of a niche issue to attract a lot of interest.
I don't hope so; in business and reports/stats applications there is a lot of room for this.
Why do you think that OR-ing several tables is a niche issue? I can at least name 3 different projects (from 3 different domains) where combining 3 or more tables with OR is relevant and should be reasonably fast.
Well, I don't recall seeing this issue on this list before (or a few other forums I read) while I see several other issues over and over again. So that is why I think it is a niche issue. Perhaps I've have seen it before and just forgotten, or have not recognized it as being the same issue each time.
This multitude of solution also shows that applications developers might be overwhelmed by choosing the most appropriate AND most long-lasting one. Because what I take from the discussion is that a UNION might be appropriate right now but that could change in the future even for the very same use-case at hand.
I'm not sure what would cause it to change. Do you mean if you suddenly start selecting a much larger portion of the table? I don't know that the union would be particularly bad in that case, either.
I'm not saying it wouldn't be nice to fix it. I just don't think it is particularly likely to happen soon. I could be wrong (especially if you can write the code to make it happen).
Cheers,
Jeff
On Thu, Sep 29, 2016 at 11:12 AM, Pavel Stehule <pavel.stehule@gmail.com> wrote:
2016-09-29 14:20 GMT+02:00 Sven R. Kunze <srkunze@mail.de>:On 23.09.2016 11:00, Pavel Stehule wrote:2016-09-23 8:35 GMT+02:00 Sven R. Kunze <srkunze@mail.de>:I was wondering: would it be possible for PostgreSQL to rewrite the query to generate the UNION (or subquery plan if it's also fast) on it's own?It depends on real data. On your specific data the UNION variant is pretty fast, on different set, the UNION can be pretty slow. It is related to difficult OR predicate estimation.
I figure that the UNION is fast if the sub-results are small (which they are in our case). On the contrary, when they are huge, the OUTER JOIN variant might be preferable.
Is there something I can do to help here?
Or do you think it's naturally application-dependent and thus should be solved with application logic just as we did?In ideal world then plan should be independent on used form. The most difficult is safe estimation of OR predicates. With correct estimation the transformation to UNION form should not be necessary I am think.
I don't think it is an estimation issue. If it were, the planner would always choose the same inefficient plan (providing the join collapse limits, etc. don't come into play, which I don't think they do here) for all the different ways of writing the query.
Since that is not happening, the planner must not be able to prove that the different queries are semantically identical to each other, which means that it can't pick the other plan no matter how good the estimates look.
Cheers,
Jeff
On 29.09.2016 22:26, Jeff Janes wrote:
Well, I don't recall seeing this issue on this list before (or a few other forums I read) while I see several other issues over and over again. So that is why I think it is a niche issue. Perhaps I've have seen it before and just forgotten, or have not recognized it as being the same issue each time.
Understood.
This multitude of solution also shows that applications developers might be overwhelmed by choosing the most appropriate AND most long-lasting one. Because what I take from the discussion is that a UNION might be appropriate right now but that could change in the future even for the very same use-case at hand.I'm not sure what would cause it to change. Do you mean if you suddenly start selecting a much larger portion of the table? I don't know that the union would be particularly bad in that case, either.
Not suddenly but gradually. Data can change and we don't know for sure how people will use our systems in the future. Hence, another plan would be more optimal or even a seq scan on big_table would be faster.
In the case at hand, I doubt it but you never know.
I'm not saying it wouldn't be nice to fix it. I just don't think it is particularly likely to happen soon. I could be wrong (especially if you can write the code to make it happen).
I have been thinking about this. It would be an interesting exercise as I haven't written much of C in the last decade but sometimes one needs to get out of the comfort zone to get things going.
Sven
Now I found time to investigate all proposed queries side by side. Here are the results (warmup + multiple executions). TL;DR - Jeff's proposed answer performs significantly faster with our data than any other solution (both planning and execution time). I have no real idea how PostgreSQL does query rewriting but I guess the following steps (and reverse ones) are necessary: 1) detect "DISTINCT+LEFT OUTER JOIN" and rewrite to "SUBQUERY" 2) detect "MUTUAL JOIN ON KEY + OR" and rewrite to "UNION" 3) detect "MUTUAL IN KEY+ OR" and rewrite to "UNION" 4) detect "UNION + MUTUAL JOIN ON KEY" and rewrite to "SUBQUERY + UNION" Doing (1+2) or (3+4) would result in the optimal query. To (1+2) seems easier to do, although a "common SELECT lift up"/"UNION push down" (if that's even the correct name) would also be great to have (that's 4)). Is this somehow correct? Regarding cost estimation: it seems like PostgreSQL is clever enough here. So, I tend to agree with Jeff that this is not an issue with cost estimation. ---- DISTINCT + LEFT OUTER JOIN explain analyze SELECT distinct <columns of big_table> FROM "big_table" LEFT OUTER JOIN "table_a" ON ("big_table"."id" = "table_a"."big_table_id") LEFT OUTER JOIN "table_b" ON ("big_table"."id" = "table_b"."big_table_id") WHERE ("table_a"."item_id" IN (<handful of items>) OR "table_b"."item_id" IN (<handful of items>)); HashAggregate (cost=206268.67..206269.46 rows=79 width=185) (actual time=904.859..904.860 rows=5 loops=1) Group Key: <columns of big_table> -> Merge Left Join (cost=1.26..206265.11 rows=79 width=185) (actual time=904.835..904.846 rows=6 loops=1) Merge Cond: (big_table.id = table_a.big_table_id) Filter: (((table_a.item_id)::text = ANY ('<handful of items>'::text[])) OR ((table_b.item_id)::text = ANY ('<handful of items>'::text[]))) Rows Removed by Filter: 901355 -> Merge Left Join (cost=0.85..196703.22 rows=858293 width=243) (actual time=0.009..745.736 rows=858690 loops=1) Merge Cond: (big_table.id = table_b.big_table_id) -> Index Scan using big_table_pkey on big_table (cost=0.42..180776.64 rows=858293 width=185) (actual time=0.005..399.102 rows=858690 loops=1) -> Index Scan using table_b_pkey on table_b (cost=0.42..10343.86 rows=274959 width=62) (actual time=0.003..60.961 rows=274958 loops=1) -> Index Scan using table_a_big_table_id on table_a (cost=0.42..4445.35 rows=118836 width=57) (actual time=0.003..25.456 rows=118833 loops=1) Planning time: 0.934 ms Execution time: 904.936 ms ---- SUBQUERY explain analyze SELECT <columns of big_table> FROM "big_table" WHERE "big_table"."id" in (SELECT "table_a"."big_table_id" FROM "table_a" WHERE "table_a"."item_id" in (<handful of items>)) OR "big_table"."id" in (SELECT "table_b"."big_table_id" FROM "table_b" WHERE "table_b"."item_id" IN (<handful of items>)); Seq Scan on big_table (cost=100.41..115110.80 rows=643720 width=185) (actual time=229.819..229.825 rows=5 loops=1) Filter: ((hashed SubPlan 1) OR (hashed SubPlan 2)) Rows Removed by Filter: 858685 SubPlan 1 -> Index Scan using table_a_item_id_211f18d89c25bc21_uniq on table_a (cost=0.42..58.22 rows=9 width=4) (actual time=0.026..0.043 rows=5 loops=1) Index Cond: ((item_id)::text = ANY ('<handful of items>'::text[])) SubPlan 2 -> Index Scan using table_b_item_id_611f9f519d835e89_uniq on table_b (cost=0.42..42.15 rows=5 width=4) (actual time=0.007..0.040 rows=5 loops=1) Index Cond: ((item_id)::text = ANY ('<handful of items>'::text[])) Planning time: 0.261 ms Execution time: 229.901 ms ---- UNION explain analyze SELECT <columns of big_table> FROM "big_table" INNER JOIN "table_a" ON ("big_table"."id" = "table_a"."big_table_id") WHERE "table_a"."item_id" IN (<handful of items>) UNION SELECT <columns of big_table> FROM "big_table" INNER JOIN "table_b" ON ("big_table"."id" = "table_b"."big_table_id") WHERE "table_b"."item_id" IN (<handful of items>); HashAggregate (cost=216.84..216.98 rows=14 width=185) (actual time=0.092..0.093 rows=5 loops=1) Group Key: <columns of big_table> -> Append (cost=22.59..216.21 rows=14 width=185) (actual time=0.035..0.080 rows=10 loops=1) -> Nested Loop (cost=22.59..132.17 rows=9 width=185) (actual time=0.034..0.044 rows=5 loops=1) -> Bitmap Heap Scan on table_a (cost=22.16..56.10 rows=9 width=4) (actual time=0.029..0.029 rows=5 loops=1) Recheck Cond: ((item_id)::text = ANY ('<handful of items>'::text[])) Heap Blocks: exact=1 -> Bitmap Index Scan on table_a_item_id_211f18d89c25bc21_uniq (cost=0.00..22.16 rows=9 width=0) (actual time=0.027..0.027 rows=5 loops=1) Index Cond: ((item_id)::text = ANY ('<handful of items>'::text[])) -> Index Scan using big_table_pkey on big_table (cost=0.42..8.44 rows=1 width=185) (actual time=0.002..0.002 rows=1 loops=5) Index Cond: (id = table_a.big_table_id) -> Nested Loop (cost=22.58..83.90 rows=5 width=185) (actual time=0.029..0.035 rows=5 loops=1) -> Bitmap Heap Scan on table_b (cost=22.15..41.64 rows=5 width=4) (actual time=0.026..0.026 rows=5 loops=1) Recheck Cond: ((item_id)::text = ANY ('<handful of items>'::text[])) Heap Blocks: exact=1 -> Bitmap Index Scan on table_b_item_id_611f9f519d835e89_uniq (cost=0.00..22.15 rows=5 width=0) (actual time=0.025..0.025 rows=5 loops=1) Index Cond: ((item_id)::text = ANY ('<handful of items>'::text[])) -> Index Scan using big_table_pkey on big_table big_table_1 (cost=0.42..8.44 rows=1 width=185) (actual time=0.001..0.001 rows=1 loops=5) Index Cond: (id = table_b.big_table_id) Planning time: 0.594 ms Execution time: 0.177 ms ---- SUBQUERY + UNION On 29.09.2016 20:03, Jeff Janes wrote: > SELECT * FROM big_table > WHERE > id in (SELECT big_table_id FROM table_a WHERE "table_a"."item_id" > IN (<handful of items>) union > SELECT big_table_id FROM table_a WHERE "table_b"."item_id" IN > (<handful of items>) > ); Nested Loop (cost=98.34..216.53 rows=14 width=185) (actual time=0.061..0.069 rows=5 loops=1) -> HashAggregate (cost=97.91..98.05 rows=14 width=4) (actual time=0.057..0.058 rows=5 loops=1) Group Key: table_a.big_table_id -> Append (cost=22.16..97.88 rows=14 width=4) (actual time=0.028..0.054 rows=10 loops=1) -> Bitmap Heap Scan on table_a (cost=22.16..56.10 rows=9 width=4) (actual time=0.028..0.029 rows=5 loops=1) Recheck Cond: ((item_id)::text = ANY ('<handful of items>'::text[])) Heap Blocks: exact=1 -> Bitmap Index Scan on table_a_item_id_211f18d89c25bc21_uniq (cost=0.00..22.16 rows=9 width=0) (actual time=0.026..0.026 rows=5 loops=1) Index Cond: ((item_id)::text = ANY ('<handful of items>'::text[])) -> Bitmap Heap Scan on table_b (cost=22.15..41.64 rows=5 width=4) (actual time=0.024..0.024 rows=5 loops=1) Recheck Cond: ((item_id)::text = ANY ('<handful of items>'::text[])) Heap Blocks: exact=1 -> Bitmap Index Scan on table_b_item_id_611f9f519d835e89_uniq (cost=0.00..22.15 rows=5 width=0) (actual time=0.023..0.023 rows=5 loops=1) Index Cond: ((item_id)::text = ANY ('<handful of items>'::text[])) -> Index Scan using big_table_pkey on big_table (cost=0.42..8.44 rows=1 width=185) (actual time=0.001..0.001 rows=1 loops=5) Index Cond: (id = table_a.big_table_id) Planning time: 0.250 ms Execution time: 0.104 ms Cheers, Sven NOTE: I added a file with the results for better readability.