Обсуждение: force the use of a particular index

От:
Scott Cain
Дата:

Hello,

I am wondering if there is a way to force the use of a particular index
when doing a query.  I have two tables that are pretty big (each >3
million rows), and when I do a join between them the performance is
generally quite poor as it does not use the indexes that I think it
should use.  Here is an example query:

 SELECT DISTINCT f.name,fl.fmin,fl.fmax,fl.strand,f.type_id,f.feature_id
 FROM feature f, featureloc fl
 WHERE
 f.feature_id = fl.feature_id and
 fl.srcfeature_id = 6 and fl.fmin <= 2585581 and fl.fmax >= 2565581 and
 f.type_id = 219

Now, I know that if the query planner will use an index on featureloc on
(srcfeature_id, fmin, fmax) that will reduce the amount of data from the
featureloc table from over 3 million to at most a few thousand, and it
will go quite quickly (if I drop other indexes on this table, it does
use that index and completes in about 1/1000th of the time).  After
that, the join with the feature table should go quite quickly as well
using the primary key on feature.

So, the question is, is there a way I can force the query planner to use
the index I want it to use?  I have experimented with using INNER JOIN
and changing the order of the tables in the join clause, but nothing
seems to work.  Any suggestions?

Thanks much,
Scott

--
------------------------------------------------------------------------
Scott Cain, Ph. D.                                         
GMOD Coordinator (http://www.gmod.org/)                     216-392-3087
Cold Spring Harbor Laboratory


От:
Rod Taylor
Дата:

On Fri, 2003-07-11 at 13:17, Scott Cain wrote:
> The problem (at least as it appears to me) is not that it is performing
> a table scan instead of an index scan, it is that it is using the wrong
> index.  Here is the output from EXPLAIN ANALYZE:
>
>                                                                          QUERY PLAN
>
------------------------------------------------------------------------------------------------------------------------------------------------------------
>  Unique  (cost=494008.47..494037.59 rows=166 width=54) (actual time=114660.37..114660.38 rows=1 loops=1)
>    ->  Sort  (cost=494008.47..494012.63 rows=1664 width=54) (actual time=114660.37..114660.37 rows=1 loops=1)
>          Sort Key: f.name, fl.fmin, fl.fmax, fl.strand, f.type_id, f.feature_id
>          ->  Nested Loop  (cost=0.00..493919.44 rows=1664 width=54) (actual time=2596.13..114632.90 rows=1 loops=1)
>                ->  Index Scan using feature_pkey on feature f  (cost=0.00..134601.43 rows=52231 width=40) (actual
time=105.74..56048.87rows=13825 loops=1) 
>                      Filter: (type_id = 219)
>                ->  Index Scan using featureloc_idx1 on featureloc fl  (cost=0.00..6.87 rows=1 width=14) (actual
time=4.23..4.23rows=0 loops=13825) 
>                      Index Cond: ("outer".feature_id = fl.feature_id)
>                      Filter: ((srcfeature_id = 6) AND (fmin <= 2585581) AND (fmax >= 2565581))
>  Total runtime: 114660.91 msec

> it is using on featureloc (featureloc_idx1) is on the foreign key
> feature_id.  It should instead be using another index, featureloc_idx3,
> which is built on (srcfeature_id, fmin, fmax).

Nope.. The optimizer is right in the decision to use featureloc_idx1.
You will notice it is expecting to retrieve a single row from this
index, but the featureloc_idx3 is bound to be larger (due to indexing
more data), thus take more disk reads for the exact same information (or
in this case, lack thereof).

What is taking a long time is the scan on feature_pkey. It looks like it
is throwing away a ton of rows that are not type_id = 219.  Either that,
or you do a pile of deletes and haven't run REINDEX recently.

Create an index consisting of (feature_id, type_id).  This will probably
make a significant different in execution time.

От:
Rod Taylor
Дата:

On Thu, 2003-07-10 at 15:18, Scott Cain wrote:
> Hello,
>
> I am wondering if there is a way to force the use of a particular index
> when doing a query.  I have two tables that are pretty big (each >3
> million rows), and when I do a join between them the performance is
> generally quite poor as it does not use the indexes that I think it
> should use.  Here is an example query:

Please send the EXPLAIN ANALYZE results for that query with and without
sequential scans enabled.

set enable_seqscan = true;
EXPLAIN ANALYZE <query>;

set enable_seqscan = false;
EXPLAIN ANALYZE <query>;

От:
Scott Cain
Дата:

On Fri, 2003-07-11 at 12:20, Rod Taylor wrote:
> On Fri, 2003-07-11 at 11:36, Scott Cain wrote:
> > Any other ideas?
>
> Out of curiosity, what do you get if you disable hash joins?
>
> set enable_hashjoin = false;

BINGO!
                                                                            QUERY PLAN



-------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Unique  (cost=161718.69..161738.38 rows=113 width=53) (actual time=529.03..529.03 rows=1 loops=1)
   ->  Sort  (cost=161718.69..161721.50 rows=1125 width=53) (actual time=529.02..529.02 rows=1 loops=1)
         Sort Key: f.name, fl.fmin, fl.fmax, fl.strand, f.type_id, f.feature_id
         ->  Merge Join  (cost=26493.64..161661.65 rows=1125 width=53) (actual time=416.46..528.77 rows=1 loops=1)
               Merge Cond: ("outer".feature_id = "inner".feature_id)
               ->  Index Scan using feature_pkey on feature f  (cost=0.00..134592.43 rows=47912 width=39) (actual
time=0.46..502.50rows=431 loops=1) 
                     Filter: (type_id = 219)
               ->  Sort  (cost=26493.64..26722.33 rows=91476 width=14) (actual time=23.98..24.38 rows=570 loops=1)
                     Sort Key: fl.feature_id
                     ->  Index Scan using featureloc_src_6 on featureloc fl  (cost=0.00..18039.22 rows=91476 width=14)
(actualtime=15.16..21.85 rows=570 loops=1) 
                           Index Cond: ((fmin <= 2585581) AND (fmax >= 2565581))
                           Filter: (srcfeature_id = 6)
 Total runtime: 529.52 msec
(13 rows)

>
> How about a partial index on (feature_id) where type_id = 219?

That is a possiblity.  type_id is a foreign key on another table that
has several thousand rows, but in practice, there will be only a subset
of those that we are interested in using with this query, so it may not
be too unwieldy to do for each interesting type_id in practice.
However, for testing I just created the partial index on type_id=219 and
it was not used, so it may not make a difference anyway.

Thanks much,
Scott

--
------------------------------------------------------------------------
Scott Cain, Ph. D.                                         
GMOD Coordinator (http://www.gmod.org/)                     216-392-3087
Cold Spring Harbor Laboratory


От:
Rod Taylor
Дата:

> > set enable_hashjoin = false;
>
> BINGO!

I'm not so sure about that.  Your dataset seems to have changed fairly
significantly since the last test.

> ->  Index Scan using feature_pkey on feature f  (cost=0.00..134592.43 rows=47912 width=39) (actual time=0.46..502.50
rows=431loops=1) 

Notice it only pulled out 431 rows where prior runs pulled out several
thousand (~13000).  I think what really happened was something came
along and deleted a bunch of stuff, then vacuum ran.

От:
Tom Lane
Дата:

Scott Cain <> writes:
> So, the question is, is there a way I can force the query planner to use
> the index I want it to use?

No (and I don't think there should be).  Given that it *can* generate
the plan you want, this is clearly an estimation failure.  What is the
index it does use?  Would you show us EXPLAIN ANALYZE results when
using each index?

            regards, tom lane

От:
Scott Cain
Дата:

Hi Tom,

Embarrassingly, I can't.  I've been monkeying with the database so much
that I can't seem to get it back to the state where I reproduce the
behavior I want.  A database drop and reload may be the only way, but
since that is a time consuming thing to do, I won't be able to do it
until this evening.

Thanks,
Scott

On Fri, 2003-07-11 at 11:24, Tom Lane wrote:
> Scott Cain <> writes:
> > So, the question is, is there a way I can force the query planner to use
> > the index I want it to use?
>
> No (and I don't think there should be).  Given that it *can* generate
> the plan you want, this is clearly an estimation failure.  What is the
> index it does use?  Would you show us EXPLAIN ANALYZE results when
> using each index?
>
>             regards, tom lane
--
------------------------------------------------------------------------
Scott Cain, Ph. D.                                         
GMOD Coordinator (http://www.gmod.org/)                     216-392-3087
Cold Spring Harbor Laboratory


От:
Scott Cain
Дата:

The problem (at least as it appears to me) is not that it is performing
a table scan instead of an index scan, it is that it is using the wrong
index.  Here is the output from EXPLAIN ANALYZE:

                                                                         QUERY PLAN

------------------------------------------------------------------------------------------------------------------------------------------------------------
 Unique  (cost=494008.47..494037.59 rows=166 width=54) (actual time=114660.37..114660.38 rows=1 loops=1)
   ->  Sort  (cost=494008.47..494012.63 rows=1664 width=54) (actual time=114660.37..114660.37 rows=1 loops=1)
         Sort Key: f.name, fl.fmin, fl.fmax, fl.strand, f.type_id, f.feature_id
         ->  Nested Loop  (cost=0.00..493919.44 rows=1664 width=54) (actual time=2596.13..114632.90 rows=1 loops=1)
               ->  Index Scan using feature_pkey on feature f  (cost=0.00..134601.43 rows=52231 width=40) (actual
time=105.74..56048.87rows=13825 loops=1) 
                     Filter: (type_id = 219)
               ->  Index Scan using featureloc_idx1 on featureloc fl  (cost=0.00..6.87 rows=1 width=14) (actual
time=4.23..4.23rows=0 loops=13825) 
                     Index Cond: ("outer".feature_id = fl.feature_id)
                     Filter: ((srcfeature_id = 6) AND (fmin <= 2585581) AND (fmax >= 2565581))
 Total runtime: 114660.91 msec

This is the same regardless of enable_seqscan's setting.  The index that
it is using on featureloc (featureloc_idx1) is on the foreign key
feature_id.  It should instead be using another index, featureloc_idx3,
which is built on (srcfeature_id, fmin, fmax).

I should also mention that I've done a VACUUM FULL ANALYZE on this
database, and I've been using it for a while, and this is the primary
type of query I perform on the database.

Thanks,
Scott



On Fri, 2003-07-11 at 06:51, Rod Taylor wrote:
> On Thu, 2003-07-10 at 15:18, Scott Cain wrote:
> > Hello,
> >
> > I am wondering if there is a way to force the use of a particular index
> > when doing a query.  I have two tables that are pretty big (each >3
> > million rows), and when I do a join between them the performance is
> > generally quite poor as it does not use the indexes that I think it
> > should use.  Here is an example query:
>
> Please send the EXPLAIN ANALYZE results for that query with and without
> sequential scans enabled.
>
> set enable_seqscan = true;
> EXPLAIN ANALYZE <query>;
>
> set enable_seqscan = false;
> EXPLAIN ANALYZE <query>;
--
------------------------------------------------------------------------
Scott Cain, Ph. D.                                         
GMOD Coordinator (http://www.gmod.org/)                     216-392-3087
Cold Spring Harbor Laboratory


От:
Scott Cain
Дата:

Rod,

I see what you mean about the scan on the feature_pkey taking a long
time.  I tried several things to remedy that.  I created an index on
feature (feature_id,type_id) (which I don't think makes sense since
feature_id is the primary key, so add another column really doesn't
help).  I also created a index on feature (type_id, feature_id), but the
planner doesn't use it.  Also, there was an already existing index on
feature (type_id) that the planner never used.

One thing I tried that changed the query plan and improved performance
slightly (but still nowhere near what I need) was to add a partial index
on featureloc on (fmin,fmax) where scrfeature_id=6.  This is something I
could realistically do since there are relatively few (>30)
srcfeature_ids that I am interested in, so putting in place a partial
index for each of them would not be a big deal.  Nevertheless, the
performance is still not there.  Here is the EXPLAIN ANALYZE for this
situation:

                                                                           QUERY PLAN

  

-----------------------------------------------------------------------------------------------------------------------------------------------------------------
 Unique  (cost=156172.23..156200.11 rows=159 width=54) (actual time=63631.93..63631.93 rows=1 loops=1)
   ->  Sort  (cost=156172.23..156176.21 rows=1594 width=54) (actual time=63631.93..63631.93 rows=1 loops=1)
         Sort Key: f.name, fl.fmin, fl.fmax, fl.strand, f.type_id, f.feature_id
         ->  Hash Join  (cost=135100.30..156087.46 rows=1594 width=54) (actual time=63631.29..63631.79 rows=1 loops=1)
               Hash Cond: ("outer".feature_id = "inner".feature_id)
               ->  Index Scan using featureloc_src_6 on featureloc fl  (cost=0.00..18064.99 rows=101883 width=14)
(actualtime=26.11..430.00 rows=570 loops=1) 
                     Index Cond: ((fmin <= 2585581) AND (fmax >= 2565581))
                     Filter: (srcfeature_id = 6)
               ->  Hash  (cost=134601.43..134601.43 rows=48347 width=40) (actual time=63182.86..63182.86 rows=0
loops=1)
                     ->  Index Scan using feature_pkey on feature f  (cost=0.00..134601.43 rows=48347 width=40) (actual
time=69.98..62978.27rows=13825 loops=1) 
                           Filter: (type_id = 219)
 Total runtime: 63632.28 msec
(12 rows)

Any other ideas?

Thanks,
Scott

On Fri, 2003-07-11 at 09:38, Rod Taylor wrote:
> On Fri, 2003-07-11 at 13:17, Scott Cain wrote:
> > The problem (at least as it appears to me) is not that it is performing
> > a table scan instead of an index scan, it is that it is using the wrong
> > index.  Here is the output from EXPLAIN ANALYZE:
> >
> >                                                                          QUERY PLAN
> >
------------------------------------------------------------------------------------------------------------------------------------------------------------
> >  Unique  (cost=494008.47..494037.59 rows=166 width=54) (actual time=114660.37..114660.38 rows=1 loops=1)
> >    ->  Sort  (cost=494008.47..494012.63 rows=1664 width=54) (actual time=114660.37..114660.37 rows=1 loops=1)
> >          Sort Key: f.name, fl.fmin, fl.fmax, fl.strand, f.type_id, f.feature_id
> >          ->  Nested Loop  (cost=0.00..493919.44 rows=1664 width=54) (actual time=2596.13..114632.90 rows=1 loops=1)
> >                ->  Index Scan using feature_pkey on feature f  (cost=0.00..134601.43 rows=52231 width=40) (actual
time=105.74..56048.87rows=13825 loops=1) 
> >                      Filter: (type_id = 219)
> >                ->  Index Scan using featureloc_idx1 on featureloc fl  (cost=0.00..6.87 rows=1 width=14) (actual
time=4.23..4.23rows=0 loops=13825) 
> >                      Index Cond: ("outer".feature_id = fl.feature_id)
> >                      Filter: ((srcfeature_id = 6) AND (fmin <= 2585581) AND (fmax >= 2565581))
> >  Total runtime: 114660.91 msec
>
> > it is using on featureloc (featureloc_idx1) is on the foreign key
> > feature_id.  It should instead be using another index, featureloc_idx3,
> > which is built on (srcfeature_id, fmin, fmax).
>
> Nope.. The optimizer is right in the decision to use featureloc_idx1.
> You will notice it is expecting to retrieve a single row from this
> index, but the featureloc_idx3 is bound to be larger (due to indexing
> more data), thus take more disk reads for the exact same information (or
> in this case, lack thereof).
>
> What is taking a long time is the scan on feature_pkey. It looks like it
> is throwing away a ton of rows that are not type_id = 219.  Either that,
> or you do a pile of deletes and haven't run REINDEX recently.
>
> Create an index consisting of (feature_id, type_id).  This will probably
> make a significant different in execution time.
--
------------------------------------------------------------------------
Scott Cain, Ph. D.                                         
GMOD Coordinator (http://www.gmod.org/)                     216-392-3087
Cold Spring Harbor Laboratory


От:
Rod Taylor
Дата:

On Fri, 2003-07-11 at 11:36, Scott Cain wrote:
> Rod,
>
> I see what you mean about the scan on the feature_pkey taking a long
> time.  I tried several things to remedy that.  I created an index on
> feature (feature_id,type_id) (which I don't think makes sense since
> feature_id is the primary key, so add another column really doesn't

It may be the primary key, but the system looked like it was throwing
away many rows based on type_id.  If it was throwing away many more rows
than found, the index with type_id may have been cheaper.

It is difficult to tell from an EXPLAIN ANALYZE as it doesn't tell you
exactly how many rows were filtered, just the cost to read them and how
many were used after the filter.

> help).  I also created a index on feature (type_id, feature_id), but the
> planner doesn't use it.  Also, there was an already existing index on
> feature (type_id) that the planner never used.

It cannot use more than one index for a given table scan at the moment.
There are proposals on how to 'fix' that, but those require significant
overhauls of various systems.

> Any other ideas?

Out of curiosity, what do you get if you disable hash joins?

set enable_hashjoin = false;


How about a partial index on (feature_id) where type_id = 219?


От:
Scott Cain
Дата:

On Fri, 2003-07-11 at 14:14, Rod Taylor wrote:
> > > set enable_hashjoin = false;
> >
> > BINGO!
>
> I'm not so sure about that.  Your dataset seems to have changed fairly
> significantly since the last test.
>
> > ->  Index Scan using feature_pkey on feature f  (cost=0.00..134592.43 rows=47912 width=39) (actual
time=0.46..502.50rows=431 loops=1) 
>
> Notice it only pulled out 431 rows where prior runs pulled out several
> thousand (~13000).  I think what really happened was something came
> along and deleted a bunch of stuff, then vacuum ran.

There is nearly a zero chance that happened.  This database is
accessible only by me, I haven't deleted anything.  The only things I
have done is to create and drop various indexes and run vacuum.  Is
there anything else that could explain the difference?  Is the index
scan on feature_pkey using information from the index scan on
featureloc_src_6 to limit the number of rows to get from feature?

Scott

--
------------------------------------------------------------------------
Scott Cain, Ph. D.                                         
GMOD Coordinator (http://www.gmod.org/)                     216-392-3087
Cold Spring Harbor Laboratory


От:
Tom Lane
Дата:

Scott Cain <> writes:
> Embarrassingly, I can't.  I've been monkeying with the database so much
> that I can't seem to get it back to the state where I reproduce the
> behavior I want.

If the thing works as desired after a VACUUM ANALYZE, then I suggest
the estimation failure was just due to out-of-date statistics ...

            regards, tom lane