Обсуждение: Multiple-Table-Spanning Joins with ORs in WHERE Clause

Поиск
Список
Период
Сортировка

Multiple-Table-Spanning Joins with ORs in WHERE Clause

От
"Sven R. Kunze"
Дата:
Hi pgsql-performance list,


what is the recommended way of doing **multiple-table-spanning joins
with ORs in the WHERE-clause**?


Until now, we've used the LEFT OUTER JOIN to filter big_table like so:


SELECT DISTINCT <fields of big_table>
FROM
     "big_table"
     LEFT OUTER JOIN "table_a" ON ("big_table"."id" =
"table_a"."big_table_id")
     LEFT OUTER JOIN "table_b" ON ("big_table"."id" =
"table_b"."big_table_id")
WHERE
     "table_a"."item_id" IN (<handful of items>)
     OR
     "table_b"."item_id" IN (<handful of items>);


However, this results in an awful slow plan (requiring to scan the
complete big_table which obviously isn't optimal).
So, we decided (at least for now) to split up the query into two
separate ones and merge/de-duplicate the result with application logic:


SELECT <fields of big_table>
FROM
     "big_table" INNER JOIN "table_a" ON ("big_table"."id" =
"table_a"."big_table_id")
WHERE
     "table_a"."item_id" IN (<handful of items>);


SELECT <fields of big_table>
FROM
     "big_table" INNER JOIN "table_b" ON ("big_table"."id" =
"table_b"."big_table_id")
WHERE
     "table_b"."item_id" IN (<handful of items>);


As you can imagine we would be very glad to solve this issue with a
single query and without having to re-code existing logic of PostgreSQL.
But how?


Best,
Sven


PS: if you require EXPLAIN ANALYZE, I can post them as well.


Re: Multiple-Table-Spanning Joins with ORs in WHERE Clause

От
"Madusudanan.B.N"
Дата:
> However, this results in an awful slow plan (requiring to scan the complete big_table which obviously isn't optimal)

You mean to say there is a sequential scan ? An explain would be helpful. Are there indexes on the provided where clauses. 

Postgres can do a Bitmap heap scan to combine indexes, there is no need to fire two separate queries.

On Thu, Sep 22, 2016 at 6:54 PM, Sven R. Kunze <srkunze@mail.de> wrote:
Hi pgsql-performance list,


what is the recommended way of doing **multiple-table-spanning joins with ORs in the WHERE-clause**?


Until now, we've used the LEFT OUTER JOIN to filter big_table like so:


SELECT DISTINCT <fields of big_table>
FROM
    "big_table"
    LEFT OUTER JOIN "table_a" ON ("big_table"."id" = "table_a"."big_table_id")
    LEFT OUTER JOIN "table_b" ON ("big_table"."id" = "table_b"."big_table_id")
WHERE
    "table_a"."item_id" IN (<handful of items>)
    OR
    "table_b"."item_id" IN (<handful of items>);


However, this results in an awful slow plan (requiring to scan the complete big_table which obviously isn't optimal).
So, we decided (at least for now) to split up the query into two separate ones and merge/de-duplicate the result with application logic:


SELECT <fields of big_table>
FROM
    "big_table" INNER JOIN "table_a" ON ("big_table"."id" = "table_a"."big_table_id")
WHERE
    "table_a"."item_id" IN (<handful of items>);


SELECT <fields of big_table>
FROM
    "big_table" INNER JOIN "table_b" ON ("big_table"."id" = "table_b"."big_table_id")
WHERE
    "table_b"."item_id" IN (<handful of items>);


As you can imagine we would be very glad to solve this issue with a single query and without having to re-code existing logic of PostgreSQL. But how?


Best,
Sven


PS: if you require EXPLAIN ANALYZE, I can post them as well.


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance



--

Re: Multiple-Table-Spanning Joins with ORs in WHERE Clause

От
Igor Neyman
Дата:
-----Original Message-----
From: pgsql-performance-owner@postgresql.org [mailto:pgsql-performance-owner@postgresql.org] On Behalf Of Sven R.
Kunze
Sent: Thursday, September 22, 2016 9:25 AM
To: pgsql-performance@postgresql.org
Subject: [PERFORM] Multiple-Table-Spanning Joins with ORs in WHERE Clause

Hi pgsql-performance list,


what is the recommended way of doing **multiple-table-spanning joins with ORs in the WHERE-clause**?


Until now, we've used the LEFT OUTER JOIN to filter big_table like so:


SELECT DISTINCT <fields of big_table>
FROM
     "big_table"
     LEFT OUTER JOIN "table_a" ON ("big_table"."id" = 
"table_a"."big_table_id")
     LEFT OUTER JOIN "table_b" ON ("big_table"."id" = 
"table_b"."big_table_id")
WHERE
     "table_a"."item_id" IN (<handful of items>)
     OR
     "table_b"."item_id" IN (<handful of items>);


However, this results in an awful slow plan (requiring to scan the 
complete big_table which obviously isn't optimal).
So, we decided (at least for now) to split up the query into two 
separate ones and merge/de-duplicate the result with application logic:


SELECT <fields of big_table>
FROM
     "big_table" INNER JOIN "table_a" ON ("big_table"."id" = 
"table_a"."big_table_id")
WHERE
     "table_a"."item_id" IN (<handful of items>);


SELECT <fields of big_table>
FROM
     "big_table" INNER JOIN "table_b" ON ("big_table"."id" = 
"table_b"."big_table_id")
WHERE
     "table_b"."item_id" IN (<handful of items>);


As you can imagine we would be very glad to solve this issue with a 
single query and without having to re-code existing logic of PostgreSQL. 
But how?


Best,
Sven


PS: if you require EXPLAIN ANALYZE, I can post them as well.

______________________________________________________________________________________________

What about:

SELECT <fields of big_table>
FROM
     "big_table" INNER JOIN "table_a" ON ("big_table"."id" = 
"table_a"."big_table_id")
WHERE
     "table_a"."item_id" IN (<handful of items>)
UNION
SELECT <fields of big_table>
FROM
     "big_table" INNER JOIN "table_b" ON ("big_table"."id" = 
"table_b"."big_table_id")
WHERE
     "table_b"."item_id" IN (<handful of items>);


Regards,
Igor Neyman


Re: Multiple-Table-Spanning Joins with ORs in WHERE Clause

От
Igor Neyman
Дата:
-----Original Message-----
From: Igor Neyman 
Sent: Thursday, September 22, 2016 10:33 AM
To: 'Sven R. Kunze' <srkunze@mail.de>; pgsql-performance@postgresql.org
Subject: RE: [PERFORM] Multiple-Table-Spanning Joins with ORs in WHERE Clause


-----Original Message-----
From: pgsql-performance-owner@postgresql.org [mailto:pgsql-performance-owner@postgresql.org] On Behalf Of Sven R.
Kunze
Sent: Thursday, September 22, 2016 9:25 AM
To: pgsql-performance@postgresql.org
Subject: [PERFORM] Multiple-Table-Spanning Joins with ORs in WHERE Clause

Hi pgsql-performance list,


what is the recommended way of doing **multiple-table-spanning joins with ORs in the WHERE-clause**?


Until now, we've used the LEFT OUTER JOIN to filter big_table like so:


SELECT DISTINCT <fields of big_table>
FROM
     "big_table"
     LEFT OUTER JOIN "table_a" ON ("big_table"."id" =
"table_a"."big_table_id")
     LEFT OUTER JOIN "table_b" ON ("big_table"."id" =
"table_b"."big_table_id")
WHERE
     "table_a"."item_id" IN (<handful of items>)
     OR
     "table_b"."item_id" IN (<handful of items>);


However, this results in an awful slow plan (requiring to scan the complete big_table which obviously isn't optimal).
So, we decided (at least for now) to split up the query into two separate ones and merge/de-duplicate the result with
applicationlogic:
 


SELECT <fields of big_table>
FROM
     "big_table" INNER JOIN "table_a" ON ("big_table"."id" =
"table_a"."big_table_id")
WHERE
     "table_a"."item_id" IN (<handful of items>);


SELECT <fields of big_table>
FROM
     "big_table" INNER JOIN "table_b" ON ("big_table"."id" = 
"table_b"."big_table_id")
WHERE
     "table_b"."item_id" IN (<handful of items>);


As you can imagine we would be very glad to solve this issue with a 
single query and without having to re-code existing logic of PostgreSQL. 
But how?


Best,
Sven


PS: if you require EXPLAIN ANALYZE, I can post them as well.

______________________________________________________________________________________________

Another option to try::


SELECT DISTINCT <fields of big_table>
FROM
    "big_table"
    LEFT OUTER JOIN "table_a" ON ("big_table"."id" = "table_a"."big_table_id" AND  "table_a"."item_id" IN (<handful of
items>))
    LEFT OUTER JOIN "table_b" ON ("big_table"."id" = "table_b"."big_table_id" AND "table_b"."item_id" IN (<handful of
items>));

Regards,
Igor Neyman


Re: Multiple-Table-Spanning Joins with ORs in WHERE Clause

От
Igor Neyman
Дата:
-----Original Message-----
From: pgsql-performance-owner@postgresql.org [mailto:pgsql-performance-owner@postgresql.org] On Behalf Of Igor Neyman
Sent: Thursday, September 22, 2016 10:36 AM
To: Sven R. Kunze <srkunze@mail.de>; pgsql-performance@postgresql.org
Subject: Re: [PERFORM] Multiple-Table-Spanning Joins with ORs in WHERE Clause


-----Original Message-----
From: Igor Neyman 
Sent: Thursday, September 22, 2016 10:33 AM
To: 'Sven R. Kunze' <srkunze@mail.de>; pgsql-performance@postgresql.org
Subject: RE: [PERFORM] Multiple-Table-Spanning Joins with ORs in WHERE Clause


-----Original Message-----
From: pgsql-performance-owner@postgresql.org [mailto:pgsql-performance-owner@postgresql.org] On Behalf Of Sven R.
Kunze
Sent: Thursday, September 22, 2016 9:25 AM
To: pgsql-performance@postgresql.org
Subject: [PERFORM] Multiple-Table-Spanning Joins with ORs in WHERE Clause

Hi pgsql-performance list,


what is the recommended way of doing **multiple-table-spanning joins with ORs in the WHERE-clause**?


Until now, we've used the LEFT OUTER JOIN to filter big_table like so:


SELECT DISTINCT <fields of big_table>
FROM
     "big_table"
     LEFT OUTER JOIN "table_a" ON ("big_table"."id" =
"table_a"."big_table_id")
     LEFT OUTER JOIN "table_b" ON ("big_table"."id" =
"table_b"."big_table_id")
WHERE
     "table_a"."item_id" IN (<handful of items>)
     OR
     "table_b"."item_id" IN (<handful of items>);


However, this results in an awful slow plan (requiring to scan the complete big_table which obviously isn't optimal).
So, we decided (at least for now) to split up the query into two separate ones and merge/de-duplicate the result with
applicationlogic:
 


SELECT <fields of big_table>
FROM
     "big_table" INNER JOIN "table_a" ON ("big_table"."id" =
"table_a"."big_table_id")
WHERE
     "table_a"."item_id" IN (<handful of items>);


SELECT <fields of big_table>
FROM
     "big_table" INNER JOIN "table_b" ON ("big_table"."id" = 
"table_b"."big_table_id")
WHERE
     "table_b"."item_id" IN (<handful of items>);


As you can imagine we would be very glad to solve this issue with a 
single query and without having to re-code existing logic of PostgreSQL. 
But how?


Best,
Sven


PS: if you require EXPLAIN ANALYZE, I can post them as well.

______________________________________________________________________________________________

Another option to try::


SELECT DISTINCT <fields of big_table>
FROM
    "big_table"
    LEFT OUTER JOIN "table_a" ON ("big_table"."id" = "table_a"."big_table_id" AND  "table_a"."item_id" IN (<handful of
items>))
    LEFT OUTER JOIN "table_b" ON ("big_table"."id" = "table_b"."big_table_id" AND "table_b"."item_id" IN (<handful of
items>));

Regards,
Igor Neyman

_______________________________________________________________________________________________________

Please disregard this last suggestion, it'll not produce required results.

Solution using UNION should work.

Regards,
Igor Neyman

Re: Multiple-Table-Spanning Joins with ORs in WHERE Clause

От
Jeff Janes
Дата:
On Thu, Sep 22, 2016 at 6:37 AM, Madusudanan.B.N <b.n.madusudanan@gmail.com> wrote:
> However, this results in an awful slow plan (requiring to scan the complete big_table which obviously isn't optimal)

You mean to say there is a sequential scan ? An explain would be helpful. Are there indexes on the provided where clauses. 

Postgres can do a Bitmap heap scan to combine indexes, there is no need to fire two separate queries.

It can't combine bitmap scans that come from different tables.

But he can just combine the two queries into one, with a UNION.

Cheers,

Jeff

Re: Multiple-Table-Spanning Joins with ORs in WHERE Clause

От
"Sven R. Kunze"
Дата:
Thanks a lot Madusudanan, Igor, Lutz and Jeff for your suggestions.

What I can confirm is that the UNION ideas runs extremely fast (don't
have access to the db right now to test the subquery idea, but will
check next week as I travel right now). Thanks again! :)


I was wondering: would it be possible for PostgreSQL to rewrite the
query to generate the UNION (or subquery plan if it's also fast) on it's
own?


Thanks,
Sven

On 22.09.2016 16:44, lfischer wrote:
> Hi Sven
>
> Why not do something like
>
> SELECT * FROM big_table
> WHERE
>      id in (SELECT big_table_id FROM table_a WHERE "table_a"."item_id"
> IN (<handful of items>))
>     OR
>      id in (SELECT big_table_id FROM table_a WHERE "table_b"."item_id"
> IN (<handful of items>))
>
> that way you don't need the "distinct" and therefore there should be
> less comparison going on.
>
> Lutz
>
> On 22/09/16 14:24, Sven R. Kunze wrote:
>> Hi pgsql-performance list,
>>
>>
>> what is the recommended way of doing **multiple-table-spanning joins
>> with ORs in the WHERE-clause**?
>>
>>
>> Until now, we've used the LEFT OUTER JOIN to filter big_table like so:
>>
>>
>> SELECT DISTINCT <fields of big_table>
>> FROM
>>     "big_table"
>>     LEFT OUTER JOIN "table_a" ON ("big_table"."id" =
>> "table_a"."big_table_id")
>>     LEFT OUTER JOIN "table_b" ON ("big_table"."id" =
>> "table_b"."big_table_id")
>> WHERE
>>     "table_a"."item_id" IN (<handful of items>)
>>     OR
>>     "table_b"."item_id" IN (<handful of items>);
>>
>>
>> However, this results in an awful slow plan (requiring to scan the
>> complete big_table which obviously isn't optimal).
>> So, we decided (at least for now) to split up the query into two
>> separate ones and merge/de-duplicate the result with application logic:
>>
>>
>> SELECT <fields of big_table>
>> FROM
>>     "big_table" INNER JOIN "table_a" ON ("big_table"."id" =
>> "table_a"."big_table_id")
>> WHERE
>>     "table_a"."item_id" IN (<handful of items>);
>>
>>
>> SELECT <fields of big_table>
>> FROM
>>     "big_table" INNER JOIN "table_b" ON ("big_table"."id" =
>> "table_b"."big_table_id")
>> WHERE
>>     "table_b"."item_id" IN (<handful of items>);
>>
>>
>> As you can imagine we would be very glad to solve this issue with a
>> single query and without having to re-code existing logic of
>> PostgreSQL. But how?
>>
>>
>> Best,
>> Sven
>>
>>
>> PS: if you require EXPLAIN ANALYZE, I can post them as well.
>>
>>
>
>



Re: Multiple-Table-Spanning Joins with ORs in WHERE Clause

От
"Sven R. Kunze"
Дата:
On 23.09.2016 11:00, Pavel Stehule wrote:
2016-09-23 8:35 GMT+02:00 Sven R. Kunze <srkunze@mail.de>:
I was wondering: would it be possible for PostgreSQL to rewrite the query to generate the UNION (or subquery plan if it's also fast) on it's own?

It depends on real data. On your specific data the UNION variant is pretty fast, on different set, the UNION can be pretty slow. It is related to difficult OR predicate estimation.

I figure that the UNION is fast if the sub-results are small (which they are in our case). On the contrary, when they are huge, the OUTER JOIN variant might be preferable.


Is there something I can do to help here?

Or do you think it's naturally application-dependent and thus should be solved with application logic just as we did?


Cheers,
Sven

Re: Multiple-Table-Spanning Joins with ORs in WHERE Clause

От
Jeff Janes
Дата:
On Thu, Sep 22, 2016 at 11:35 PM, Sven R. Kunze <srkunze@mail.de> wrote:
Thanks a lot Madusudanan, Igor, Lutz and Jeff for your suggestions.

What I can confirm is that the UNION ideas runs extremely fast (don't have access to the db right now to test the subquery idea, but will check next week as I travel right now). Thanks again! :)


I was wondering: would it be possible for PostgreSQL to rewrite the query to generate the UNION (or subquery plan if it's also fast) on it's own?

I don't know what the subquery plan is, I don't see references to that in the email chain.

I don't believe that current versions of PostgreSQL are capable of rewriting the plan in the style of a union.  It is not just a matter of tweaking the cost estimates, it simply never considers such a plan in the first place given the text of your query.

Perhaps some future version of PostgreSQL could do so, but my gut feeling is that that is not very likely.  It would take a lot of work, would risk breaking or slowing down other things, and is probably too much of a niche issue to attract a lot of interest.

Why not just use the union?  Are you using a framework which generates the query automatically and you have no control over it?  Or do you just think it is ugly or fragile for some other reason?

Perhaps moving the union from the outside to the inside would be more suitable?  That way teh select list is only specified once, and if you AND more clauses into the WHERE condition they also only need to be specified once.

SELECT * FROM big_table
WHERE
     id in (SELECT big_table_id FROM table_a WHERE "table_a"."item_id" IN (<handful of items>) union 
             SELECT big_table_id FROM table_a WHERE "table_b"."item_id" IN (<handful of items>)
      );


Cheers,

Jeff

Re: Multiple-Table-Spanning Joins with ORs in WHERE Clause

От
Pavel Stehule
Дата:


2016-09-29 14:20 GMT+02:00 Sven R. Kunze <srkunze@mail.de>:
On 23.09.2016 11:00, Pavel Stehule wrote:
2016-09-23 8:35 GMT+02:00 Sven R. Kunze <srkunze@mail.de>:
I was wondering: would it be possible for PostgreSQL to rewrite the query to generate the UNION (or subquery plan if it's also fast) on it's own?

It depends on real data. On your specific data the UNION variant is pretty fast, on different set, the UNION can be pretty slow. It is related to difficult OR predicate estimation.

I figure that the UNION is fast if the sub-results are small (which they are in our case). On the contrary, when they are huge, the OUTER JOIN variant might be preferable.


Is there something I can do to help here?

Or do you think it's naturally application-dependent and thus should be solved with application logic just as we did?

In ideal world then plan should be independent on used form. The most difficult is safe estimation of OR predicates. With correct estimation the transformation to UNION form should not be necessary I am think.

Regards

Pavel



Cheers,
Sven

Re: Multiple-Table-Spanning Joins with ORs in WHERE Clause

От
"Sven R. Kunze"
Дата:

Hi Jeff,

On 29.09.2016 20:03, Jeff Janes wrote:
I don't know what the subquery plan is, I don't see references to that in the email chain.

Lutz posted the following solution:

SELECT * FROM big_table
WHERE
     id in (SELECT big_table_id FROM table_a WHERE "table_a"."item_id" IN (<handful of items>))
    OR
     id in (SELECT big_table_id FROM table_a WHERE "table_b"."item_id" IN (<handful of items>))

I don't believe that current versions of PostgreSQL are capable of rewriting the plan in the style of a union.  It is not just a matter of tweaking the cost estimates, it simply never considers such a plan in the first place given the text of your query.

That's okay and that's why I am asking here. :)

Perhaps some future version of PostgreSQL could do so, but my gut feeling is that that is not very likely.  It would take a lot of work, would risk breaking or slowing down other things, and is probably too much of a niche issue to attract a lot of interest.

I don't hope so; in business and reports/stats applications there is a lot of room for this.

Why do you think that OR-ing several tables is a niche issue? I can at least name 3 different projects (from 3 different domains) where combining 3 or more tables with OR is relevant and should be reasonably fast.

Most domains that could benefit would probably have star-like schemas. So, big_table corresponds to the center of the star, whereas the rays correspond to various (even dynamic) extensions to the base data structure.

Why not just use the union?

Sure that would work in this particular case. However, this thread actually sought a general answer to "how to OR more than two tables".

Are you using a framework which generates the query automatically and you have no control over it?

We use a framework and we can use the UNION if we want to.

Or do you just think it is ugly or fragile for some other reason?

I don't think it's ugly or fragile. I am just used to the fact that **if it's equivalent** then PostgreSQL can figure it out (without constant supervision from application developers).

So, it's just a matter of inconvenience. ;)

Perhaps moving the union from the outside to the inside would be more suitable?  That way teh select list is only specified once, and if you AND more clauses into the WHERE condition they also only need to be specified once.

SELECT * FROM big_table
WHERE
     id in (SELECT big_table_id FROM table_a WHERE "table_a"."item_id" IN (<handful of items>) union 
             SELECT big_table_id FROM table_a WHERE "table_b"."item_id" IN (<handful of items>)
      );

Yet another solution I guess, so thanks a lot. :)

This multitude of solution also shows that applications developers might be overwhelmed by choosing the most appropriate AND most long-lasting one. Because what I take from the discussion is that a UNION might be appropriate right now but that could change in the future even for the very same use-case at hand.

Cheers,
Sven

Re: Multiple-Table-Spanning Joins with ORs in WHERE Clause

От
"Sven R. Kunze"
Дата:
On 29.09.2016 20:12, Pavel Stehule wrote:
> In ideal world then plan should be independent on used form. The most
> difficult is safe estimation of OR predicates. With correct estimation
> the transformation to UNION form should not be necessary I am think.

Ah, okay. That's interesting.

So how can I help here?

Regards,
Sven


Re: Multiple-Table-Spanning Joins with ORs in WHERE Clause

От
Pavel Stehule
Дата:


2016-09-29 20:49 GMT+02:00 Sven R. Kunze <srkunze@mail.de>:
On 29.09.2016 20:12, Pavel Stehule wrote:
In ideal world then plan should be independent on used form. The most difficult is safe estimation of OR predicates. With correct estimation the transformation to UNION form should not be necessary I am think.

Ah, okay. That's interesting.

So how can I help here?

try to write a patch :) or better, help with enhancing PostgreSQL's estimation model. Tomas Vondra is working 2 years on multicolumn statistics. He needs help with review.

Regards

Pavel

Regards,
Sven



--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: Multiple-Table-Spanning Joins with ORs in WHERE Clause

От
Jeff Janes
Дата:
On Thu, Sep 29, 2016 at 11:48 AM, Sven R. Kunze <srkunze@mail.de> wrote:

On 29.09.2016 20:03, Jeff Janes wrote:

Perhaps some future version of PostgreSQL could do so, but my gut feeling is that that is not very likely.  It would take a lot of work, would risk breaking or slowing down other things, and is probably too much of a niche issue to attract a lot of interest.

I don't hope so; in business and reports/stats applications there is a lot of room for this.

Why do you think that OR-ing several tables is a niche issue? I can at least name 3 different projects (from 3 different domains) where combining 3 or more tables with OR is relevant and should be reasonably fast.

Well, I don't recall seeing this issue on this list before (or a few other forums I read) while I see several other issues over and over again.  So that is why I think it is a niche issue.  Perhaps I've have seen it before and just forgotten, or have not recognized it as being the same issue each time.
 


This multitude of solution also shows that applications developers might be overwhelmed by choosing the most appropriate AND most long-lasting one. Because what I take from the discussion is that a UNION might be appropriate right now but that could change in the future even for the very same use-case at hand.

I'm not sure what would cause it to change.  Do you mean if you suddenly start selecting a much larger portion of the table?  I don't know that the union would be particularly bad in that case, either.

I'm not saying it wouldn't be nice to fix it.  I just don't think it is particularly likely to happen soon.  I could be wrong (especially if you can write the code to make it happen).

Cheers,

Jeff

Re: Multiple-Table-Spanning Joins with ORs in WHERE Clause

От
Jeff Janes
Дата:
On Thu, Sep 29, 2016 at 11:12 AM, Pavel Stehule <pavel.stehule@gmail.com> wrote:


2016-09-29 14:20 GMT+02:00 Sven R. Kunze <srkunze@mail.de>:
On 23.09.2016 11:00, Pavel Stehule wrote:
2016-09-23 8:35 GMT+02:00 Sven R. Kunze <srkunze@mail.de>:
I was wondering: would it be possible for PostgreSQL to rewrite the query to generate the UNION (or subquery plan if it's also fast) on it's own?

It depends on real data. On your specific data the UNION variant is pretty fast, on different set, the UNION can be pretty slow. It is related to difficult OR predicate estimation.

I figure that the UNION is fast if the sub-results are small (which they are in our case). On the contrary, when they are huge, the OUTER JOIN variant might be preferable.


Is there something I can do to help here?

Or do you think it's naturally application-dependent and thus should be solved with application logic just as we did?

In ideal world then plan should be independent on used form. The most difficult is safe estimation of OR predicates. With correct estimation the transformation to UNION form should not be necessary I am think.

I don't think it is an estimation issue.  If it were, the planner would always choose the same inefficient plan (providing the join collapse limits, etc. don't come into play, which I don't think they do here) for all the different ways of writing the query.

Since that is not happening, the planner must not be able to prove that the different queries are semantically identical to each other, which means that it can't pick the other plan no matter how good the estimates look.

Cheers,

Jeff

Re: Multiple-Table-Spanning Joins with ORs in WHERE Clause

От
"Sven R. Kunze"
Дата:
On 29.09.2016 22:26, Jeff Janes wrote:
Well, I don't recall seeing this issue on this list before (or a few other forums I read) while I see several other issues over and over again.  So that is why I think it is a niche issue.  Perhaps I've have seen it before and just forgotten, or have not recognized it as being the same issue each time.

Understood.

 

This multitude of solution also shows that applications developers might be overwhelmed by choosing the most appropriate AND most long-lasting one. Because what I take from the discussion is that a UNION might be appropriate right now but that could change in the future even for the very same use-case at hand.

I'm not sure what would cause it to change.  Do you mean if you suddenly start selecting a much larger portion of the table?  I don't know that the union would be particularly bad in that case, either.

Not suddenly but gradually. Data can change and we don't know for sure how people will use our systems in the future. Hence, another plan would be more optimal or even a seq scan on big_table would be faster.

In the case at hand, I doubt it but you never know.

I'm not saying it wouldn't be nice to fix it.  I just don't think it is particularly likely to happen soon.  I could be wrong (especially if you can write the code to make it happen).

I have been thinking about this. It would be an interesting exercise as I haven't written much of C in the last decade but sometimes one needs to get out of the comfort zone to get things going.


Sven

Re: Multiple-Table-Spanning Joins with ORs in WHERE Clause

От
"Sven R. Kunze"
Дата:
Now I found time to investigate all proposed queries side by side. Here
are the results (warmup + multiple executions). TL;DR - Jeff's proposed
answer performs significantly faster with our data than any other
solution (both planning and execution time).


I have no real idea how PostgreSQL does query rewriting but I guess the
following steps (and reverse ones) are necessary:

1) detect "DISTINCT+LEFT OUTER JOIN" and rewrite to "SUBQUERY"

2) detect "MUTUAL JOIN ON KEY + OR" and rewrite to "UNION"

3) detect "MUTUAL IN KEY+ OR" and rewrite to "UNION"

4) detect "UNION + MUTUAL JOIN ON KEY" and rewrite to "SUBQUERY + UNION"


Doing (1+2) or (3+4) would result in the optimal query. To (1+2) seems
easier to do, although a "common SELECT lift up"/"UNION push down" (if
that's even the correct name) would also be great to have (that's 4)).
Is this somehow correct?


Regarding cost estimation: it seems like PostgreSQL is clever enough
here. So, I tend to agree with Jeff that this is not an issue with cost
estimation.


---- DISTINCT + LEFT OUTER JOIN


explain analyze
SELECT distinct <columns of big_table>
FROM "big_table"
     LEFT OUTER JOIN "table_a" ON ("big_table"."id" =
"table_a"."big_table_id")
     LEFT OUTER JOIN "table_b" ON ("big_table"."id" =
"table_b"."big_table_id")
WHERE
     ("table_a"."item_id" IN (<handful of items>)
     OR
     "table_b"."item_id" IN (<handful of items>));



  HashAggregate  (cost=206268.67..206269.46 rows=79 width=185) (actual
time=904.859..904.860 rows=5 loops=1)
    Group Key: <columns of big_table>
    ->  Merge Left Join  (cost=1.26..206265.11 rows=79 width=185)
(actual time=904.835..904.846 rows=6 loops=1)
          Merge Cond: (big_table.id = table_a.big_table_id)
          Filter: (((table_a.item_id)::text = ANY ('<handful of
items>'::text[])) OR ((table_b.item_id)::text = ANY ('<handful of
items>'::text[])))
          Rows Removed by Filter: 901355
          ->  Merge Left Join  (cost=0.85..196703.22 rows=858293
width=243) (actual time=0.009..745.736 rows=858690 loops=1)
                Merge Cond: (big_table.id = table_b.big_table_id)
                ->  Index Scan using big_table_pkey on big_table
(cost=0.42..180776.64 rows=858293 width=185) (actual time=0.005..399.102
rows=858690 loops=1)
                ->  Index Scan using table_b_pkey on table_b
(cost=0.42..10343.86 rows=274959 width=62) (actual time=0.003..60.961
rows=274958 loops=1)
          ->  Index Scan using table_a_big_table_id on table_a
(cost=0.42..4445.35 rows=118836 width=57) (actual time=0.003..25.456
rows=118833 loops=1)
  Planning time: 0.934 ms
  Execution time: 904.936 ms




---- SUBQUERY

explain analyze
SELECT <columns of big_table>
FROM "big_table"
WHERE
     "big_table"."id" in (SELECT "table_a"."big_table_id" FROM "table_a"
WHERE "table_a"."item_id" in (<handful of items>))
     OR
     "big_table"."id" in (SELECT "table_b"."big_table_id" FROM "table_b"
WHERE "table_b"."item_id" IN (<handful of items>));



  Seq Scan on big_table  (cost=100.41..115110.80 rows=643720 width=185)
(actual time=229.819..229.825 rows=5 loops=1)
    Filter: ((hashed SubPlan 1) OR (hashed SubPlan 2))
    Rows Removed by Filter: 858685
    SubPlan 1
      ->  Index Scan using table_a_item_id_211f18d89c25bc21_uniq on
table_a (cost=0.42..58.22 rows=9 width=4) (actual time=0.026..0.043
rows=5 loops=1)
            Index Cond: ((item_id)::text = ANY ('<handful of
items>'::text[]))
    SubPlan 2
      ->  Index Scan using table_b_item_id_611f9f519d835e89_uniq on
table_b (cost=0.42..42.15 rows=5 width=4) (actual time=0.007..0.040
rows=5 loops=1)
            Index Cond: ((item_id)::text = ANY ('<handful of
items>'::text[]))
  Planning time: 0.261 ms
  Execution time: 229.901 ms



---- UNION

explain analyze
SELECT <columns of big_table>
FROM "big_table"
     INNER JOIN "table_a" ON ("big_table"."id" = "table_a"."big_table_id")
WHERE
     "table_a"."item_id" IN (<handful of items>)
UNION
SELECT <columns of big_table>
FROM "big_table"
     INNER JOIN "table_b" ON ("big_table"."id" = "table_b"."big_table_id")
WHERE
     "table_b"."item_id" IN (<handful of items>);

  HashAggregate  (cost=216.84..216.98 rows=14 width=185) (actual
time=0.092..0.093 rows=5 loops=1)
    Group Key: <columns of big_table>
    ->  Append  (cost=22.59..216.21 rows=14 width=185) (actual
time=0.035..0.080 rows=10 loops=1)
          ->  Nested Loop  (cost=22.59..132.17 rows=9 width=185) (actual
time=0.034..0.044 rows=5 loops=1)
                ->  Bitmap Heap Scan on table_a (cost=22.16..56.10
rows=9 width=4) (actual time=0.029..0.029 rows=5 loops=1)
                      Recheck Cond: ((item_id)::text = ANY ('<handful of
items>'::text[]))
                      Heap Blocks: exact=1
                      ->  Bitmap Index Scan on
table_a_item_id_211f18d89c25bc21_uniq  (cost=0.00..22.16 rows=9 width=0)
(actual time=0.027..0.027 rows=5 loops=1)
                            Index Cond: ((item_id)::text = ANY
('<handful of items>'::text[]))
                ->  Index Scan using big_table_pkey on big_table
(cost=0.42..8.44 rows=1 width=185) (actual time=0.002..0.002 rows=1 loops=5)
                      Index Cond: (id = table_a.big_table_id)
          ->  Nested Loop  (cost=22.58..83.90 rows=5 width=185) (actual
time=0.029..0.035 rows=5 loops=1)
                ->  Bitmap Heap Scan on table_b (cost=22.15..41.64
rows=5 width=4) (actual time=0.026..0.026 rows=5 loops=1)
                      Recheck Cond: ((item_id)::text = ANY ('<handful of
items>'::text[]))
                      Heap Blocks: exact=1
                      ->  Bitmap Index Scan on
table_b_item_id_611f9f519d835e89_uniq  (cost=0.00..22.15 rows=5 width=0)
(actual time=0.025..0.025 rows=5 loops=1)
                            Index Cond: ((item_id)::text = ANY
('<handful of items>'::text[]))
                ->  Index Scan using big_table_pkey on big_table
big_table_1  (cost=0.42..8.44 rows=1 width=185) (actual
time=0.001..0.001 rows=1 loops=5)
                      Index Cond: (id = table_b.big_table_id)
  Planning time: 0.594 ms
  Execution time: 0.177 ms




---- SUBQUERY + UNION


On 29.09.2016 20:03, Jeff Janes wrote:
> SELECT * FROM big_table
> WHERE
>    id in (SELECT big_table_id FROM table_a WHERE "table_a"."item_id"
> IN (<handful of items>) union
>          SELECT big_table_id FROM table_a WHERE "table_b"."item_id" IN
> (<handful of items>)
>   );


  Nested Loop  (cost=98.34..216.53 rows=14 width=185) (actual
time=0.061..0.069 rows=5 loops=1)
    ->  HashAggregate  (cost=97.91..98.05 rows=14 width=4) (actual
time=0.057..0.058 rows=5 loops=1)
          Group Key: table_a.big_table_id
          ->  Append  (cost=22.16..97.88 rows=14 width=4) (actual
time=0.028..0.054 rows=10 loops=1)
                ->  Bitmap Heap Scan on table_a (cost=22.16..56.10
rows=9 width=4) (actual time=0.028..0.029 rows=5 loops=1)
                      Recheck Cond: ((item_id)::text = ANY ('<handful of
items>'::text[]))
                      Heap Blocks: exact=1
                      ->  Bitmap Index Scan on
table_a_item_id_211f18d89c25bc21_uniq  (cost=0.00..22.16 rows=9 width=0)
(actual time=0.026..0.026 rows=5 loops=1)
                            Index Cond: ((item_id)::text = ANY
('<handful of items>'::text[]))
                ->  Bitmap Heap Scan on table_b (cost=22.15..41.64
rows=5 width=4) (actual time=0.024..0.024 rows=5 loops=1)
                      Recheck Cond: ((item_id)::text = ANY ('<handful of
items>'::text[]))
                      Heap Blocks: exact=1
                      ->  Bitmap Index Scan on
table_b_item_id_611f9f519d835e89_uniq  (cost=0.00..22.15 rows=5 width=0)
(actual time=0.023..0.023 rows=5 loops=1)
                            Index Cond: ((item_id)::text = ANY
('<handful of items>'::text[]))
    ->  Index Scan using big_table_pkey on big_table (cost=0.42..8.44
rows=1 width=185) (actual time=0.001..0.001 rows=1 loops=5)
          Index Cond: (id = table_a.big_table_id)
  Planning time: 0.250 ms
  Execution time: 0.104 ms






Cheers,
Sven


NOTE: I added a file with the results for better readability.

Вложения