Обсуждение: BUG #13824: EXISTS sometimes uses seq scan instead of index

Поиск
Список
Период
Сортировка

BUG #13824: EXISTS sometimes uses seq scan instead of index

От
grzegorz@thulium.pl
Дата:
The following bug has been logged on the website:

Bug reference:      13824
Logged by:          Grzegorz G
Email address:      grzegorz@thulium.pl
PostgreSQL version: 9.4.5
Operating system:   Debian
Description:

Seems related to

(9.4.4,9.3.9) Improve planner's cost estimates for semi-joins and anti-joins
with inner indexscans (Tom Lane, Tomas Vondra)

I have 2 tables: master(t_outbound) and detail (t_outbound_number). I want
to select only those records from t_outbound that have detailed records in
t_outbound number in certain statuses (id_status).

There is an index on t_outbound_number:
    "t_on_id_status_1_or_9" btree (id_outbound) WHERE id_status = 1 OR
id_status = 9


Whenever I perform select like below, planner thinks it's going to look up
many rows and falls back to seq scan. If I disable seq scan, it correctly
uses the index.
When the t_outbound_number table grows and percentage of records with status
1 or 9 decreases, index is used.

Seems like planner thinks it's going to retrieve all of the records to
verify EXISTS clause, not just one.


psql=> EXPLAIN ANALYZE SELECT id_outbound,
     (EXISTS (SELECT 1 FROM t_outbound_number ton WHERE (id_status = 1 OR
id_status = 9) AND ton.id_outbound=tou.id_outbound))


           from t_outbound tou
ORDER BY id ASC;
                                                              QUERY PLAN


--------------------------------------------------------------------------------------------------------------------------------------
 Sort  (cost=2.58..2.59 rows=5 width=10) (actual time=94.805..94.806 rows=5
loops=1)
   Sort Key: tou.id
   Sort Method: quicksort  Memory: 25kB
   ->  Seq Scan on t_outbound tou  (cost=0.00..2.52 rows=5 width=10) (actual
time=62.955..94.763 rows=5 loops=1)
         SubPlan 1
           ->  Seq Scan on t_outbound_number ton  (cost=0.00..11910.24
rows=40453 width=0) (actual time=18.942..18.942 rows=1 loops=5)
                 Filter: ((id_outbound = tou.id_outbound) AND ((id_status =
1) OR (id_status = 9)))
                 Rows Removed by Filter: 114596
 Planning time: 0.775 ms
 Execution time: 94.853 ms



psql=> set enable_seqscan =off;
SET



psql=> EXPLAIN ANALYZE SELECT id_outbound,
     (EXISTS (SELECT 1 FROM t_outbound_number ton WHERE (id_status = 1 OR
id_status = 9) AND ton.id_outbound=tou.id_outbound))


           from t_outbound tou
ORDER BY id ASC;

QUERY PLAN


-----------------------------------------------------------------------------------------------------------------------------------------------------------
 Index Scan using t_outbound_id_key on t_outbound tou  (cost=0.13..14.69
rows=5 width=10) (actual time=0.033..0.103 rows=5 loops=1)
   SubPlan 1
     ->  Index Scan using t_on_id_status_1_or_9 on t_outbound_number ton
(cost=0.42..3082.02 rows=40453 width=0) (actual time=0.015..0.015 rows=1
loops=5)
           Index Cond: (id_outbound = tou.id_outbound)
 Planning time: 0.811 ms
 Execution time: 0.142 ms

Re: BUG #13824: EXISTS sometimes uses seq scan instead of index

От
Tom Lane
Дата:
grzegorz@thulium.pl writes:
> Whenever I perform select like below, planner thinks it's going to look up
> many rows and falls back to seq scan. If I disable seq scan, it correctly
> uses the index.

You might need to reduce random_page_cost to reflect your environment
better ... especially if you're most concerned about performance with
all data already cached in memory, which is what these examples are
probably showing.

            regards, tom lane

Re: BUG #13824: EXISTS sometimes uses seq scan instead of index

От
Grzegorz Garlewicz
Дата:
I did just what you said - reduced random_page cost from 4 to 2 then 1 and
then 0.5. Neither did change anything (in fact I did not believe it would
change anything but did the test anyway).
If I'm not mistaken, the issue seems to originate from the planner's
thinking it needs to look up all the rows for EXISTS clause, not just a
single one, so it thinks the cost would be much bigger.
regards,
grzegorz

On Fri, Dec 18, 2015 at 2:17 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

> grzegorz@thulium.pl writes:
> > Whenever I perform select like below, planner thinks it's going to look
> up
> > many rows and falls back to seq scan. If I disable seq scan, it correctly
> > uses the index.
>
> You might need to reduce random_page_cost to reflect your environment
> better ... especially if you're most concerned about performance with
> all data already cached in memory, which is what these examples are
> probably showing.
>
>                         regards, tom lane
>

Re: BUG #13824: EXISTS sometimes uses seq scan instead of index

От
Grzegorz Garlewicz
Дата:
Could you please take a look at this one once again?

On Fri, Dec 18, 2015 at 10:23 AM, Grzegorz Garlewicz <grzegorz@thulium.pl>
wrote:

> I did just what you said - reduced random_page cost from 4 to 2 then 1 and
> then 0.5. Neither did change anything (in fact I did not believe it would
> change anything but did the test anyway).
> If I'm not mistaken, the issue seems to originate from the planner's
> thinking it needs to look up all the rows for EXISTS clause, not just a
> single one, so it thinks the cost would be much bigger.
> regards,
> grzegorz
>
> On Fri, Dec 18, 2015 at 2:17 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
>> grzegorz@thulium.pl writes:
>> > Whenever I perform select like below, planner thinks it's going to look
>> up
>> > many rows and falls back to seq scan. If I disable seq scan, it
>> correctly
>> > uses the index.
>>
>> You might need to reduce random_page_cost to reflect your environment
>> better ... especially if you're most concerned about performance with
>> all data already cached in memory, which is what these examples are
>> probably showing.
>>
>>                         regards, tom lane
>>
>
>

Re: BUG #13824: EXISTS sometimes uses seq scan instead of index

От
Kevin Grittner
Дата:
On Mon, Dec 21, 2015 at 9:06 AM, Grzegorz Garlewicz <grzegorz@thulium.pl> wrote:
> Could you please take a look at this one once again?
>
> On Fri, Dec 18, 2015 at 10:23 AM, Grzegorz Garlewicz <grzegorz@thulium.pl> wrote:
>>
>> I did just what you said - reduced random_page cost from 4 to 2
>> then 1 and then 0.5.

It never really makes sense to set random_page_cost less than
seq_page_cost; although setting them to equal values can make sense
if your data is heavily cached.

>> If I'm not mistaken, the issue seems to originate from the
>> planner's thinking it needs to look up all the rows for EXISTS
>> clause, not just a single one, so it thinks the cost would be
>> much bigger.

No, it does not think that.  It knows that an EXISTS test can stop
after finding a single matching row.

What it does think is that values are fairly evenly distributed
(i.e., if a value is 1% of a table you will only need to read about
100 rows before seeing one, rather than finding them all clumped at
the end of the table) and that there is no correlation between
values (i.e., any given id_status value is not more common for one
id_outbound value than another).  There is work in progress to try
to allow for correlated values, and ideas on how to deal with
uneven distribution.

One thing that might help, in addition to reducing random_page_cost
to be equal to or just above seq_page_cost, is to increase
cpu_tuple_cost to something like 0.03 to 0.05.  Benchmarking with
real-world applications I was running showed better plans chosen
with numbers in that range than with lower numbers.

Of course, if the statistics are not up-to-date it doesn't have
much chance of using accurate numbers and is likely to choose a bad
plan.  Make sure that autovacuum is tuned to be aggressive enough,
and if there are big changes to a table you may want to use an
explicit ANALYZE (or VACUUM ANALYZE) before running queries which
reference data in the table.

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company