Обсуждение: why is the LIMIT clause slowing down this SELECT?

Поиск
Список
Период
Сортировка

why is the LIMIT clause slowing down this SELECT?

От
"Mason Hale"
Дата:
On a 8.1.9 version database that has been recently vacuumed and
analyzed, I'm seeing some dramatic performance degradation if a limit
clause is included in the query. This seems counter-intuitive to me.

Here's the query and explain plan WITH the LIMIT clause:

SELECT *
   FROM topic_feed
 WHERE topic_id = 106947234
  ORDER BY score DESC
     LIMIT 25

Limit  (cost=0.00..651.69 rows=25 width=29) (actual
time=72644.652..72655.029 rows=25 loops=1)
  ->  Index Scan Backward using topic_feed_score_index on topic_feed
(cost=0.00..21219.08 rows=814 width=29) (actual
time=72644.644..72654.855 rows=25 loops=1)
        Filter: (topic_id = 106947234)
Total runtime: 72655.733 ms

==============

and now WITHOUT the LIMIT clause:

SELECT *
   FROM topic_feed
 WHERE topic_id = 106947234
  ORDER BY score DESC

Sort  (cost=1683.75..1685.78 rows=814 width=29) (actual
time=900.553..902.267 rows=492 loops=1)
  Sort Key: score
  ->  Bitmap Heap Scan on topic_feed  (cost=7.85..1644.40 rows=814
width=29) (actual time=307.900..897.993 rows=492 loops=1)
        Recheck Cond: (topic_id = 106947234)
        ->  Bitmap Index Scan on
index_topic_feed_on_topic_id_and_feed_id  (cost=0.00..7.85 rows=814
width=0) (actual time=213.205..213.205 rows=2460 loops=1)
              Index Cond: (topic_id = 106947234)
Total runtime: 904.049 ms

-----------------------------------------

That's a pretty big delta (72.6 sec vs. 0.9 sec), and I would expect
the query with the limit to be faster.

Can anyone explain why this happens and what I can do about it?

thanks in advance,
Mason

Re: why is the LIMIT clause slowing down this SELECT?

От
Jeff Davis
Дата:
On Wed, 2007-08-01 at 15:56 -0500, Mason Hale wrote:
> On a 8.1.9 version database that has been recently vacuumed and
> analyzed, I'm seeing some dramatic performance degradation if a limit
> clause is included in the query. This seems counter-intuitive to me.
>
> Here's the query and explain plan WITH the LIMIT clause:
>
> SELECT *
>    FROM topic_feed
>  WHERE topic_id = 106947234
>   ORDER BY score DESC
>      LIMIT 25
>
> Limit  (cost=0.00..651.69 rows=25 width=29) (actual
> time=72644.652..72655.029 rows=25 loops=1)
>   ->  Index Scan Backward using topic_feed_score_index on topic_feed
> (cost=0.00..21219.08 rows=814 width=29) (actual
> time=72644.644..72654.855 rows=25 loops=1)
>         Filter: (topic_id = 106947234)
> Total runtime: 72655.733 ms
>
> ==============
>
> and now WITHOUT the LIMIT clause:
>
> SELECT *
>    FROM topic_feed
>  WHERE topic_id = 106947234
>   ORDER BY score DESC
>
> Sort  (cost=1683.75..1685.78 rows=814 width=29) (actual
> time=900.553..902.267 rows=492 loops=1)
>   Sort Key: score
>   ->  Bitmap Heap Scan on topic_feed  (cost=7.85..1644.40 rows=814
> width=29) (actual time=307.900..897.993 rows=492 loops=1)
>         Recheck Cond: (topic_id = 106947234)
>         ->  Bitmap Index Scan on
> index_topic_feed_on_topic_id_and_feed_id  (cost=0.00..7.85 rows=814
> width=0) (actual time=213.205..213.205 rows=2460 loops=1)
>               Index Cond: (topic_id = 106947234)
> Total runtime: 904.049 ms

Let's call those plan 1 and plan 2.

In plan 1, the planner thinks that it will find 25 tuples matching that
topic_id quickly during the backwards index scan on
topic_feed_score_index. Instead, it looks like it has to go through a
lot of tuples before it finds the necessary 25.

In plan 2, it does it backward: first it finds the tuples matching the
topic_id, then it sorts those and returns the first 25.

If I'm reading the plans correctly (plan 2, in particular), there are
2460 tuples matching the topic_id, but only 492 are visible. I don't
know why there are so many invisible tuples that still exist in the
index after your VACUUM.

Are there a lot of UPDATEs/DELETEs on tuples with that topic_id between
when you run the VACUUM and when you run the SELECT? Do you have big
transactions open when running the VACUUM?

Also, how many tuples are in the table overall?

The stats look fairly accurate: it's only off on the estimate of
matching rows by about a factor of two (which isn't terribly bad), 814
estimated versus 492 that are actually visible and match the topic_id.

If I had to guess, I'd say you either:

(a) have too many invisible tuples - you can correct this with more
frequent VACUUMing
(b) the distribution of tuples with matching topic_id is not even, and
many of the tuples with a matching topic_id happen to have a low score -
this is harder to fix, because the stats collector can't detect it.
You'd probably have to coerce it to use plan 2.

Regards,
    Jeff Davis




Re: why is the LIMIT clause slowing down this SELECT?

От
Tom Lane
Дата:
Jeff Davis <pgsql@j-davis.com> writes:
> On Wed, 2007-08-01 at 15:56 -0500, Mason Hale wrote:
>> SELECT *
>> FROM topic_feed
>> WHERE topic_id = 106947234
>> ORDER BY score DESC
>> LIMIT 25

> In plan 1, the planner thinks that it will find 25 tuples matching that
> topic_id quickly during the backwards index scan on
> topic_feed_score_index. Instead, it looks like it has to go through a
> lot of tuples before it finds the necessary 25.

Yeah.  An index on (topic_id, score) would perhaps be helpful.

            regards, tom lane

Re: why is the LIMIT clause slowing down this SELECT?

От
"Scott Marlowe"
Дата:
On 8/1/07, Mason Hale <masonhale@gmail.com> wrote:
> On a 8.1.9 version database that has been recently vacuumed and
> analyzed, I'm seeing some dramatic performance degradation if a limit
> clause is included in the query. This seems counter-intuitive to me.
>
> Here's the query and explain plan WITH the LIMIT clause:
>
> SELECT *
>    FROM topic_feed
>  WHERE topic_id = 106947234
>   ORDER BY score DESC
>      LIMIT 25
>
> Limit  (cost=0.00..651.69 rows=25 width=29) (actual
> time=72644.652..72655.029 rows=25 loops=1)
>   ->  Index Scan Backward using topic_feed_score_index on topic_feed
> (cost=0.00..21219.08 rows=814 width=29) (actual
> time=72644.644..72654.855 rows=25 loops=1)
>         Filter: (topic_id = 106947234)
> Total runtime: 72655.733 ms
>
> ==============
>
> and now WITHOUT the LIMIT clause:
>
> SELECT *
>    FROM topic_feed
>  WHERE topic_id = 106947234
>   ORDER BY score DESC
>
> Sort  (cost=1683.75..1685.78 rows=814 width=29) (actual
> time=900.553..902.267 rows=492 loops=1)
>   Sort Key: score
>   ->  Bitmap Heap Scan on topic_feed  (cost=7.85..1644.40 rows=814
> width=29) (actual time=307.900..897.993 rows=492 loops=1)
>         Recheck Cond: (topic_id = 106947234)
>         ->  Bitmap Index Scan on
> index_topic_feed_on_topic_id_and_feed_id  (cost=0.00..7.85 rows=814
> width=0) (actual time=213.205..213.205 rows=2460 loops=1)
>               Index Cond: (topic_id = 106947234)
> Total runtime: 904.049 ms

Something seems wrong here.  The cost of the second plan adds up to
1685, the cost of the first plan adds up to 651.69 with an
intermediate step that adds up to 21219.08.  ??? I thought the outer
parts of the plan always contained the inner parts?  This doesn't make
sense.

If the inner cost really is 21219 then the planner should have
switched to the cheaper plan with a limit.  If it thinks it's going to
be 651 total, then how come it's ignoring the cost of 21219?

Re: why is the LIMIT clause slowing down this SELECT?

От
Stephan Szabo
Дата:
On Wed, 1 Aug 2007, Scott Marlowe wrote:

> On 8/1/07, Mason Hale <masonhale@gmail.com> wrote:
> > On a 8.1.9 version database that has been recently vacuumed and
> > analyzed, I'm seeing some dramatic performance degradation if a limit
> > clause is included in the query. This seems counter-intuitive to me.
> >
> > Here's the query and explain plan WITH the LIMIT clause:
> >
> > SELECT *
> >    FROM topic_feed
> >  WHERE topic_id = 106947234
> >   ORDER BY score DESC
> >      LIMIT 25
> >
> > Limit  (cost=0.00..651.69 rows=25 width=29) (actual
> > time=72644.652..72655.029 rows=25 loops=1)
> >   ->  Index Scan Backward using topic_feed_score_index on topic_feed
> > (cost=0.00..21219.08 rows=814 width=29) (actual
> > time=72644.644..72654.855 rows=25 loops=1)
> >         Filter: (topic_id = 106947234)
> > Total runtime: 72655.733 ms
> >
> > ==============
> >
> > and now WITHOUT the LIMIT clause:
> >
> > SELECT *
> >    FROM topic_feed
> >  WHERE topic_id = 106947234
> >   ORDER BY score DESC
> >
> > Sort  (cost=1683.75..1685.78 rows=814 width=29) (actual
> > time=900.553..902.267 rows=492 loops=1)
> >   Sort Key: score
> >   ->  Bitmap Heap Scan on topic_feed  (cost=7.85..1644.40 rows=814
> > width=29) (actual time=307.900..897.993 rows=492 loops=1)
> >         Recheck Cond: (topic_id = 106947234)
> >         ->  Bitmap Index Scan on
> > index_topic_feed_on_topic_id_and_feed_id  (cost=0.00..7.85 rows=814
> > width=0) (actual time=213.205..213.205 rows=2460 loops=1)
> >               Index Cond: (topic_id = 106947234)
> > Total runtime: 904.049 ms
>
> Something seems wrong here.  The cost of the second plan adds up to
> 1685, the cost of the first plan adds up to 651.69 with an
> intermediate step that adds up to 21219.08.  ??? I thought the outer
> parts of the plan always contained the inner parts?  This doesn't make
> sense.

I think it's because the top node is a limit node over a node that doesn't
need to run to completion in order to complete the request so it's
expecting an output cost about 25/814ths (limit 25 over 814 estimated
rows) of the input cost as it expects to only run that fraction of the
inner plan.

Re: why is the LIMIT clause slowing down this SELECT?

От
"Mason Hale"
Дата:
> Let's call those plan 1 and plan 2.
>
> In plan 1, the planner thinks that it will find 25 tuples matching that
> topic_id quickly during the backwards index scan on
> topic_feed_score_index. Instead, it looks like it has to go through a
> lot of tuples before it finds the necessary 25.
>
> In plan 2, it does it backward: first it finds the tuples matching the
> topic_id, then it sorts those and returns the first 25.
>
> If I'm reading the plans correctly (plan 2, in particular), there are
> 2460 tuples matching the topic_id, but only 492 are visible. I don't
> know why there are so many invisible tuples that still exist in the
> index after your VACUUM.
>
> Are there a lot of UPDATEs/DELETEs on tuples with that topic_id between
> when you run the VACUUM and when you run the SELECT? Do you have big
> transactions open when running the VACUUM?

we have autovacuum running, but do update this table in particular
frequently. Every row in this table probably gets updated at least
once per day.

>
> Also, how many tuples are in the table overall?
>

147,207

> The stats look fairly accurate: it's only off on the estimate of
> matching rows by about a factor of two (which isn't terribly bad), 814
> estimated versus 492 that are actually visible and match the topic_id.
>
> If I had to guess, I'd say you either:
>
> (a) have too many invisible tuples - you can correct this with more
> frequent VACUUMing
> (b) the distribution of tuples with matching topic_id is not even, and
> many of the tuples with a matching topic_id happen to have a low score -
> this is harder to fix, because the stats collector can't detect it.
> You'd probably have to coerce it to use plan 2.

The score in this case are definitely not a normal distribution. They
follow a power law pattern, with a few with very high scores and a
long tail.

I ended up coercing it to use plan 2 by dropping the index on topic_feed(score).

Which raises another question -- if the planner has already used an
index on topic_id to select the rows, would it ever us another index
on score to order the rows? Or is a compound topic_feed(topic_id,
score) index the way to go there?

thanks,
Mason

Re: why is the LIMIT clause slowing down this SELECT?

От
Jeff Davis
Дата:
On Wed, 2007-08-01 at 21:42 -0500, Mason Hale wrote:
> The score in this case are definitely not a normal distribution. They
> follow a power law pattern, with a few with very high scores and a
> long tail.
>
> I ended up coercing it to use plan 2 by dropping the index on topic_feed(score).
>

I think Tom had the correct advice, you should try an index on
(topic_id,score).

> Which raises another question -- if the planner has already used an
> index on topic_id to select the rows, would it ever us another index
> on score to order the rows? Or is a compound topic_feed(topic_id,
> score) index the way to go there?
>

Two indexes can only be combined for a bitmap index scan, and a bitmap
is in heap order, not index order. That means additional indexes only
help to do additional filtering before it tries to fetch from the table
itself. In your case there is no filter on "score" at all, "score" is
just a sort order.

A compound index should give you what you want.

Regards,
    Jeff Davis