Обсуждение: 2 left joins causes seqscan

Поиск
Список
Период
Сортировка

2 left joins causes seqscan

От
Willy-Bas Loos
Дата:
Hi,

Today i ran into a situation where a second left join on an indexed field would prevent the index from being used, even though the index is clearly more efficient.
Removing either of the 2 joins would cause that the planner will use the index again.
I tested this on postgres 9.1 and 9.3 on my ubuntu (amd64) laptop.

--Here's the test data:

create table a (id serial primary key, field1 text);
create table b (id integer, title text, lang integer);
create index b_title_lowerto on b using btree (lower(title) text_pattern_ops);
vacuum analyze;

with x as (
insert into a
select generate_series(1,40000) as id
returning id
)
insert into b
select id, translate((random()*100*id)::text, '1234567890.', 'abcdefghij'), 1
from x;
update a set field1=translate(id::text, '1234567890.', 'abcdefghij');
insert into b
select b2.id, translate((random()*100*b2.id)::text, '1234567890.', 'abcdefghij'), 2
from b b2;

--Here's the query that doesn't use the index on "b":

select a.field1, b1.title , b2.title
from a
left join b b1 on b1.id = a.id and b1.lang=1
left join b b2 on b2.id = a.id and b2.lang=2
where (lower(b1.title) like'abcd%' or lower(b2.title) like 'abcd%')

--plan:
Hash Right Join  (cost=4298.60..7214.76 rows=8 width=35)
  Hash Cond: (b1.id = a.id)
  Filter: ((lower(b1.title) ~~ 'abcd%'::text) OR (lower(b2.title) ~~ 'abcd%'::text))
  ->  Seq Scan on b b1  (cost=0.00..1510.00 rows=40176 width=19)
        Filter: (lang = 1)
  ->  Hash  (cost=3798.60..3798.60 rows=40000 width=24)
        ->  Hash Right Join  (cost=1293.00..3798.60 rows=40000 width=24)
              Hash Cond: (b2.id = a.id)
              ->  Seq Scan on b b2  (cost=0.00..1510.00 rows=39824 width=19)
                    Filter: (lang = 2)
              ->  Hash  (cost=793.00..793.00 rows=40000 width=9)
                    ->  Seq Scan on a  (cost=0.00..793.00 rows=40000 width=9)



--Here's the query that does use the index on "b":

select a.field1, b1.title
from a
left join b b1 on b1.id = a.id and b1.lang=1
where lower(b1.title) like 'abcd%'
union
select a.field1, b2.title
from a
left join b b2 on b2.id = a.id and b2.lang=2
where lower(b2.title) like 'abcd%'

--plan:
HashAggregate  (cost=98.31..98.39 rows=8 width=20)
  ->  Append  (cost=4.74..98.27 rows=8 width=20)
        ->  Nested Loop  (cost=4.74..49.10 rows=4 width=20)
              ->  Bitmap Heap Scan on b b1  (cost=4.45..15.82 rows=4 width=19)
                    Filter: ((lang = 1) AND (lower(title) ~~ 'abcd%'::text))
                    ->  Bitmap Index Scan on b_title_lowerto  (cost=0.00..4.45 rows=3 width=0)
                          Index Cond: ((lower(title) ~>=~ 'abcd'::text) AND (lower(title) ~<~ 'abce'::text))
              ->  Index Scan using a_pkey on a  (cost=0.29..8.31 rows=1 width=9)
                    Index Cond: (id = b1.id)
        ->  Nested Loop  (cost=4.74..49.10 rows=4 width=20)
              ->  Bitmap Heap Scan on b b2  (cost=4.45..15.82 rows=4 width=19)
                    Filter: ((lang = 2) AND (lower(title) ~~ 'abcd%'::text))
                    ->  Bitmap Index Scan on b_title_lowerto  (cost=0.00..4.45 rows=3 width=0)
                          Index Cond: ((lower(title) ~>=~ 'abcd'::text) AND (lower(title) ~<~ 'abce'::text))
              ->  Index Scan using a_pkey on a a_1  (cost=0.29..8.31 rows=1 width=9)
                    Index Cond: (id = b2.id)


As you can see, the second query is far more efficient, even though it scans both tables twice to combine the results.
Is this some glitch in the query planner?

Cheers,
--
Willy-Bas Loos

Re: 2 left joins causes seqscan

От
Kevin Grittner
Дата:
Willy-Bas Loos <willybas@gmail.com> wrote:

> As you can see, the second query is far more efficient, even
> though it scans both tables twice to combine the results.

But the two queries don't return the same results.  Of course the
second one will be faster.  The simple equivalent of your second
query is:

explain analyze select a.field1, b.title
  from a
  join b on b.id = a.id
  where lower(b.title) like 'abcd%'
    and lang in (1, 2);

The equivalent of your first query is to take the result sets from
these two queries:

select a1.field1, b1.title, b2.title
  from a a1
  join b b1 on b1.id = a1.id and b1.lang = 1
  left join b b2 on (b2.id = a1.id and b2.lang = 2)
  where lower(b1.title) like'abcd%'
union
select a2.field1, b4.title, b3.title
  from a a2
  join b b3 on b3.id = a2.id and b3.lang = 2
  left join b b4 on (b4.id = a2.id and b4.lang = 1)
  where lower(b3.title) like'abcd%';

The above form does optimize better than the original, but it's not
too surprising that the planner can't come up with the optimal
plan; you've posed quite a challenge for it.

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: 2 left joins causes seqscan

От
Willy-Bas Loos
Дата:

But the two queries don't return the same results.  Of course the
second one will be faster. 
The equivalent of your first query is to take the result sets from
these two queries
(...)
it's not
too surprising that the planner can't come up with the optimal
plan; you've posed quite a challenge for it.


The point that i was trying to make by doing 2 queries and unioning them is, that it is faster to use 2 index scans than to use sequential scans.
I can't quite recognize the challenge that i'm posing the query planner, but i am willing/hoping to learn more about it.

AFAIK, the planner has some statistics about the frequencies in which values in the columns occur. That way, it can calculate the approx number of records that will have to be fetched and considering the latency of a rotating hard disk, it can calculate what is likely to be faster: a sequential scan or using the index for random reads.

In this case, the planner can calculate the number of records that need to be fetched from B, in my case it says it expects 4 of them in both cases. Combined, it would fetch max 8 records from B, in contrast to 40K or even twice that.
I can't understand what is confusing the planner.

Cheers,

Willy-Bas

Re: 2 left joins causes seqscan

От
Kevin Grittner
Дата:
Willy-Bas Loos <willybas@gmail.com> wrote:

> I can't understand what is confusing the planner.

Well, it doesn't do exhaustive proofs of whether two queries are
equivalent.  If it did, it would still not have come up with a plan
like your second one, because it is not equivalent.  The trick in
planning is to stop when the cost of further analysis is likely to
exceed the benefits derived from a better plan.  The fact that the
first query was complex enough that *you* weren't able to
accurately optimize it better before posting is pretty good
evidence that it's moving into the realm of "expensive to
optimize".

Now, if you want to propose some specific check the planner can
make to try to find a plan like the one generated for the query I
showed (which I believe actually *is* equivalent to your first
query), perhaps someone will find implementing that a better use of
their time than any of the other things they have in front of them
to work on, and benchmarks can establish what the planning cost of
that is for people running queries it *won't* benefit compared to
the benefits seen in queries that *do* benefit.  There is an
understandable reluctance to add planning costs to every query run
in order to cause better plan choice for a very small percentage of
queries, especially when there is a workaround -- of explicitly
writing a UNION for those cases.

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: 2 left joins causes seqscan

От
Tom Lane
Дата:
Kevin Grittner <kgrittn@ymail.com> writes:
> Willy-Bas Loos <willybas@gmail.com> wrote:
>> I can't understand what is confusing the planner.

> Well, it doesn't do exhaustive proofs of whether two queries are
> equivalent.  If it did, it would still not have come up with a plan
> like your second one, because it is not equivalent.

Yeah.  The short reason why the index was not used in the original
query is that the supposedly indexable condition was inside an OR,
which made it useless as an index qualification: rows not satisfying
that condition at all might yet satisfy the query as a whole.  The
planner does have some ability to use indexes when every arm of the
OR includes an indexable condition on the same table, but that was
not the case here.

Another point about the proposed transformation is that an OR in
WHERE is far from equivalent to a UNION: WHERE ... OR does not result
in full de-duplication.  You could possibly conclude they were
equivalent if the query output columns included a primary key,
but that was not the case here.  In any case, the planner includes
no logic that could transform OR into UNION, and I'd be pretty
hesitant to add any even if the transformation were formally correct,
because the planner has no ability to optimize UNION meaningfully.
You'd often get a worse plan than you get now.  (Perhaps that will
change someday, but it's not very high on the priority list.)

            regards, tom lane


Re: 2 left joins causes seqscan

От
Willy-Bas Loos
Дата:
On Sun, Sep 14, 2014 at 3:23 PM, Kevin Grittner <kgrittn@ymail.com> wrote:
The fact that the
first query was complex enough that *you* weren't able to
accurately optimize it better before posting is pretty good
evidence that it's moving into the realm of "expensive to
optimize".


Touche
BTW i don't mean any offense. I don't expect for anyone to waste their time on cases that aren't worth the while. I just brought this up because i want to help make the planner even more awesome. I believe that planner hints are a bad thing, because the planner should be able to solve it automatically. If it doesn't, users and developers should talk to each other so that the planner (and/or knowledge how to use it) keeps improving.

On Sun, Sep 14, 2014 at 3:54 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
The
planner does have some ability to use indexes when every arm of the
OR includes an indexable condition on the same table, but that was
not the case here.

It wasn't? I think that it was. I guess that this is the core of my question.
But, in a way you are right. The planner probably sees it as a different table, because it is a different join, even though it is the same table.
Ok, i understand now that it is probably too much to ask from the planner. Because OR negates the effect of the qualifications in the WHERE clause and both joins might be the same table, but that is -arguably- a corner case.
Thanks both of you.
 

In any case, the planner includes
no logic that could transform OR into UNION, and I'd be pretty
hesitant to add any even if the transformation were formally correct,
because the planner has no ability to optimize UNION meaningfully.

No, i didn't mean to say that the planner should do that, just that getting the data in 2 queries (and appending with union (all)) which was faster than the 1 query.


Cheers,
--
Willy-Bas Loos

Re: 2 left joins causes seqscan

От
Willy-Bas Loos
Дата:


On Fri, Sep 12, 2014 at 11:11 PM, Kevin Grittner <kgrittn@ymail.com> wrote:

The equivalent of your first query is to take the result sets from
these two queries:

select a1.field1, b1.title, b2.title
  from a a1
  join b b1 on b1.id = a1.id and b1.lang = 1
  left join b b2 on (b2.id = a1.id and b2.lang = 2)
  where lower(b1.title) like'abcd%'
union
select a2.field1, b4.title, b3.title
  from a a2
  join b b3 on b3.id = a2.id and b3.lang = 2
  left join b b4 on (b4.id = a2.id and b4.lang = 1)
  where lower(b3.title) like'abcd%';


Thanks for that query, it really helps.

Cheers,

--
Willy-Bas Loos