Обсуждение: Selecting max(pk) is slow on empty set

Поиск
Список
Период
Сортировка

Selecting max(pk) is slow on empty set

От
"Alexander Staubo"
Дата:
This is on a fresh pg_restore copy that I have additionally vacuumed
and analyzed. These queries, on a table containing 2.8 million rows,
are very fast:

# select count(*) from user_messages where user_id = 13604;
 count
-------
     0
(1 row)
Time: 0.604 ms

# select * from user_messages where user_id = 13604;
 id | user_id | sender_id | sent_at | dismissed_at | message
----+---------+-----------+---------+--------------+---------
(0 rows)
Time: 0.678 ms

But doing a max() on this empty set takes a long time to run:

# explain analyze select max(id) from user_messages where user_id = 13604;

         QUERY PLAN

------------------------------------------------------------------------------------------------------------------------------------------------------
Result  (cost=633.19..633.20 rows=1 width=0) (actual
time=339160.704..339160.704 rows=1 loops=1)
   InitPlan
     ->  Limit  (cost=0.00..633.19 rows=1 width=4) (actual
time=339160.700..339160.700 rows=0 loops=1)
           ->  Index Scan Backward using user_messages_pkey on
user_messages  (cost=0.00..633188.12 rows=1000 width=4) (actual
time=339160.697..339160                 Filter: ((id IS NOT NULL) AND
(user_id = 13604))
 Total runtime: 339160.770 ms
(6 rows)

Note that it's using the correct index -- user_messages_pkey is on the
id attribute. (Why rows=1000 here?)

PostgreSQL 8.2.5 on Linux and OS X Leopard.

Alexander.

Re: Selecting max(pk) is slow on empty set

От
Richard Huxton
Дата:
Alexander Staubo wrote:
> # explain analyze select max(id) from user_messages where user_id = 13604;
>
>          QUERY PLAN
>
------------------------------------------------------------------------------------------------------------------------------------------------------
> Result  (cost=633.19..633.20 rows=1 width=0) (actual
> time=339160.704..339160.704 rows=1 loops=1)
>    InitPlan
>      ->  Limit  (cost=0.00..633.19 rows=1 width=4) (actual
> time=339160.700..339160.700 rows=0 loops=1)
>            ->  Index Scan Backward using user_messages_pkey on
> user_messages  (cost=0.00..633188.12 rows=1000 width=4) (actual
> time=339160.697..339160                 Filter: ((id IS NOT NULL) AND
> (user_id = 13604))
>  Total runtime: 339160.770 ms
> (6 rows)
>
> Note that it's using the correct index -- user_messages_pkey is on the
> id attribute. (Why rows=1000 here?)

1000 looks suspiciously like a default estimate if the planner knows no
better. Odd since you say that you've just analysed.

Do you have an index on user_id? Presumably that's what's being used in
the case of SELECT * or count(*).

What cost does the count(*) come up with?

Can you trick it with a sub-query (to see the explain)?
SELECT max(id) FROM (SELECT id FROM user_messages WHERE user_id = 13604)
AS foo;

--
   Richard Huxton
   Archonet Ltd

Re: Selecting max(pk) is slow on empty set

От
"Alexander Staubo"
Дата:
On 1/22/08, Richard Huxton <dev@archonet.com> wrote:
> Alexander Staubo wrote:
> > # explain analyze select max(id) from user_messages where user_id = 13604;
> >
> >          QUERY PLAN
> >
------------------------------------------------------------------------------------------------------------------------------------------------------
> > Result  (cost=633.19..633.20 rows=1 width=0) (actual
> > time=339160.704..339160.704 rows=1 loops=1)
> >    InitPlan
> >      ->  Limit  (cost=0.00..633.19 rows=1 width=4) (actual
> > time=339160.700..339160.700 rows=0 loops=1)
> >            ->  Index Scan Backward using user_messages_pkey on
> > user_messages  (cost=0.00..633188.12 rows=1000 width=4) (actual
> > time=339160.697..339160                 Filter: ((id IS NOT NULL) AND
> > (user_id = 13604))
> >  Total runtime: 339160.770 ms
> > (6 rows)
> >
> > Note that it's using the correct index -- user_messages_pkey is on the
> > id attribute. (Why rows=1000 here?)
>
> 1000 looks suspiciously like a default estimate if the planner knows no
> better. Odd since you say that you've just analysed.
>
> Do you have an index on user_id? Presumably that's what's being used in
> the case of SELECT * or count(*).

Yes, I do. However, for some reason it's not being used here. The
index is clustered -- but I haven't run "cluster" on it recently. Does
that matter?

> What cost does the count(*) come up with?

# explain analyze select count(*) from user_messages where user_id = 13604;

QUERY PLAN

------------------------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=3646.04..3646.05 rows=1 width=0) (actual
time=39.448..39.448 rows=1 loops=1)
   ->  Index Scan using user_messages_user on user_messages
(cost=0.00..3643.53 rows=1000 width=0) (actual time=39.410..39.410
rows=0 loops=1)
         Index Cond: (user_id = 13604)
 Total runtime: 39.648 ms
(4 rows)

So here it's using the right index.

> Can you trick it with a sub-query (to see the explain)?
> SELECT max(id) FROM (SELECT id FROM user_messages WHERE user_id = 13604)
> AS foo;

No, I tried that as well; PostgreSQL is clever enough to optimize it
into exactly the same query as the original.

Alexander.

Re: Selecting max(pk) is slow on empty set

От
Richard Huxton
Дата:
Alexander Staubo wrote:
> On 1/22/08, Richard Huxton <dev@archonet.com> wrote:
>> Alexander Staubo wrote:
>>> # explain analyze select max(id) from user_messages where user_id = 13604;
>>>
>>>          QUERY PLAN
>>>
------------------------------------------------------------------------------------------------------------------------------------------------------
>>> Result  (cost=633.19..633.20 rows=1 width=0) (actual
>>> time=339160.704..339160.704 rows=1 loops=1)

>> Do you have an index on user_id? Presumably that's what's being used in
>> the case of SELECT * or count(*).
>
> Yes, I do. However, for some reason it's not being used here. The
> index is clustered -- but I haven't run "cluster" on it recently. Does
> that matter?

The index is still an index...

>> What cost does the count(*) come up with?
>
> # explain analyze select count(*) from user_messages where user_id = 13604;
>
> QUERY PLAN
>
------------------------------------------------------------------------------------------------------------------------------------------------
>  Aggregate  (cost=3646.04..3646.05 rows=1 width=0) (actual
> time=39.448..39.448 rows=1 loops=1)
>    ->  Index Scan using user_messages_user on user_messages
> (cost=0.00..3643.53 rows=1000 width=0) (actual time=39.410..39.410
> rows=0 loops=1)
>          Index Cond: (user_id = 13604)
>  Total runtime: 39.648 ms
> (4 rows)
>
> So here it's using the right index.

Hmm, but with an estimated cost of 3646 (vs.633 for the max(*) which
uses the wrong index). That explains why it's walking backwards through
the pkey index, it thinks that it's 8 times cheaper.

It looks like it thinks that because the estimated cost scanning the
whole index backwards is 633188 for 1000 rows and you only want one row
so that's 1/1000 of that cost.

But why 1000 rows? Actually, it thinks 1000 rows above too. Could it be
inadequate stats on the users column? If the users it gathered stats on
all have > 1000 rows then it might use the default.

Have a look at most_common_vals,most_common_freqs in pg_stats for
tbl=user_messages, att=user perhaps. Then see if an ALTER TABLE SET
STATISTICS 100 makes a difference.

>> Can you trick it with a sub-query (to see the explain)?
>> SELECT max(id) FROM (SELECT id FROM user_messages WHERE user_id = 13604)
>> AS foo;
>
> No, I tried that as well; PostgreSQL is clever enough to optimize it
> into exactly the same query as the original.

Damn :-)



--
   Richard Huxton
   Archonet Ltd

Re: Selecting max(pk) is slow on empty set

От
"Alexander Staubo"
Дата:
On 1/22/08, Richard Huxton <dev@archonet.com> wrote:
> Hmm, but with an estimated cost of 3646 (vs.633 for the max(*) which
> uses the wrong index). That explains why it's walking backwards through
> the pkey index, it thinks that it's 8 times cheaper.
[...]
> Have a look at most_common_vals,most_common_freqs in pg_stats for
> tbl=user_messages, att=user perhaps.

# select histogram_bounds from pg_stats where
tablename='user_messages' and attname='user_id';
                   histogram_bounds
-------------------------------------------------------
 {1,489,1097,1824,2555,3452,4488,5679,6879,8637,13448}

# select null_frac, n_distinct, most_common_vals, most_common_freqs
from pg_stats where tablename='user_messages' and attname='user_id';
 null_frac | n_distinct |                 most_common_vals
    |                                           most_common_freqs

-----------+------------+--------------------------------------------------+-------------------------------------------------------------------------------------------------------
         0 |       2652 |
{5826,484,1206,780,823,4085,4157,5852,1962,6453} |
{0.00933333,0.00766667,0.00666667,0.00633333,0.006,0.00566667,0.00566667,0.00533333,0.005,0.00466667}

> Then see if an ALTER TABLE SET
> STATISTICS 100 makes a difference.

So it does:

# explain analyze select max(id) from user_messages where user_id = 13604;

QUERY PLAN

-----------------------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=1320.52..1320.53 rows=1 width=4) (actual
time=13.640..13.640 rows=1 loops=1)
   ->  Index Scan using user_messages_user on user_messages
(cost=0.00..1319.62 rows=358 width=4) (actual time=13.631..13.631
rows=0 loops=1)
         Index Cond: (user_id = 13604)
 Total runtime: 13.712 ms

Thank you! That solves my performance problem, at least.

But it's worrying that PostgreSQL should be so off in planning the
query. Does this behaviour qualify as a bug, or is this -- that is,
the need to tweak statistics parameters -- just your garden-variety
application-specific optimization?

Alexander.

Re: Selecting max(pk) is slow on empty set

От
Richard Huxton
Дата:
Alexander Staubo wrote:
> On 1/22/08, Richard Huxton <dev@archonet.com> wrote:
>> Then see if an ALTER TABLE SET
>> STATISTICS 100 makes a difference.
>
> So it does:
>
> # explain analyze select max(id) from user_messages where user_id = 13604;
>
> QUERY PLAN
>
-----------------------------------------------------------------------------------------------------------------------------------------------
>  Aggregate  (cost=1320.52..1320.53 rows=1 width=4) (actual
> time=13.640..13.640 rows=1 loops=1)
>    ->  Index Scan using user_messages_user on user_messages
> (cost=0.00..1319.62 rows=358 width=4) (actual time=13.631..13.631
> rows=0 loops=1)
>          Index Cond: (user_id = 13604)
>  Total runtime: 13.712 ms
>
> Thank you! That solves my performance problem, at least.

Although the row-estimate still seems quite high. You might want to
increase it even further (maximum is 1000). If this is a common query,
I'd look at an index on (user,id) rather than just (user) perhaps.

> But it's worrying that PostgreSQL should be so off in planning the
> query. Does this behaviour qualify as a bug, or is this -- that is,
> the need to tweak statistics parameters -- just your garden-variety
> application-specific optimization?

Well, it's data-specific rather than application specific I suppose. The
  issue is that there is a cost to tracking 100 values and you don't
want to pay that on every column in every table. If user 13604 isn't in
the list of most-common users then all it can really do is fix an upper
bound on how many matches it can have. Of course you and I can reason
outside of the data and guess that manu users won't have more than a
handful of messages, but that's not something PG can do.

In theory, PG could auto-tune itself for various parameters. The problem
  then is, do you:
1. Learn constantly, meaning you constantly pay the cost of checking
your decisions and never get consistent plans.
2. Learn once, in which case a change in data frequencies or usage
patterns renders your learning out of date.

You might find http://pgfoundry.org/ useful with the fouine / pqa
projects to analyse query logs.

--
   Richard Huxton
   Archonet Ltd

Re: Selecting max(pk) is slow on empty set

От
"Alexander Staubo"
Дата:
On 1/22/08, Richard Huxton <dev@archonet.com> wrote:
> Although the row-estimate still seems quite high. You might want to
> increase it even further (maximum is 1000). If this is a common query,
> I'd look at an index on (user,id) rather than just (user) perhaps.

Actually that index (with the same statistics setting as before)
yields slightly worse performance:

# explain analyze select max(id) from user_messages where user_id = 13604;

      QUERY PLAN

--------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Result  (cost=3.86..3.87 rows=1 width=0) (actual time=0.051..0.052
rows=1 loops=1)
   InitPlan
     ->  Limit  (cost=0.00..3.86 rows=1 width=4) (actual
time=0.045..0.045 rows=0 loops=1)
           ->  Index Scan Backward using user_messages_user_id_id on
user_messages  (cost=0.00..1486.79 rows=385 width=4) (actual
time=0.042..0.042 rows=0 loops=1)
                 Index Cond: (user_id = 13604)
                 Filter: (id IS NOT NULL)
 Total runtime: 0.128 ms

Compare with the plain index on the one attribute:

# explain analyze select max(id) from user_messages where user_id = 13604;
                                                                 QUERY
PLAN

---------------------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=1388.34..1388.35 rows=1 width=4) (actual
time=0.034..0.035 rows=1 loops=1)
   ->  Index Scan using user_messages_user on user_messages
(cost=0.00..1387.40 rows=374 width=4) (actual time=0.030..0.030 rows=0
loops=1)
         Index Cond: (user_id = 13604)
 Total runtime: 0.085 ms

> > But it's worrying that PostgreSQL should be so off in planning the
> > query. Does this behaviour qualify as a bug, or is this -- that is,
> > the need to tweak statistics parameters -- just your garden-variety
> > application-specific optimization?
>
> Well, it's data-specific rather than application specific I suppose. The
>   issue is that there is a cost to tracking 100 values and you don't
> want to pay that on every column in every table. If user 13604 isn't in
> the list of most-common users then all it can really do is fix an upper
> bound on how many matches it can have. Of course you and I can reason
> outside of the data and guess that manu users won't have more than a
> handful of messages, but that's not something PG can do.

Absolutely. Thanks for the pointers.

Alexander.

Re: Selecting max(pk) is slow on empty set

От
Richard Huxton
Дата:
Alexander Staubo wrote:
> On 1/22/08, Richard Huxton <dev@archonet.com> wrote:
>> Although the row-estimate still seems quite high. You might want to
>> increase it even further (maximum is 1000). If this is a common query,
>> I'd look at an index on (user,id) rather than just (user) perhaps.
>
> Actually that index (with the same statistics setting as before)
> yields slightly worse performance:
>
> # explain analyze select max(id) from user_messages where user_id = 13604;
>  Total runtime: 0.128 ms
>
> Compare with the plain index on the one attribute:
>
> # explain analyze select max(id) from user_messages where user_id = 13604;
>  Total runtime: 0.085 ms

Ah, but:
1. Those times are so small, I'm not sure you can reliably separate
them. Certainly not from one run.
2. For a range of different user-ids I'd expect user_id_id index to
maintain a near-constant time regardless of the number of messages for
that user.
3. You might be able to reduce your statistics on the user column and
still keep the fast plan.

--
   Richard Huxton
   Archonet Ltd

Re: Selecting max(pk) is slow on empty set

От
"Alexander Staubo"
Дата:
On 1/22/08, Richard Huxton <dev@archonet.com> wrote:
> Alexander Staubo wrote:
> > On 1/22/08, Richard Huxton <dev@archonet.com> wrote:
> >> Although the row-estimate still seems quite high. You might want to
> >> increase it even further (maximum is 1000). If this is a common query,
> >> I'd look at an index on (user,id) rather than just (user) perhaps.
> >
> > Actually that index (with the same statistics setting as before)
> > yields slightly worse performance:
> >
> > # explain analyze select max(id) from user_messages where user_id = 13604;
> >  Total runtime: 0.128 ms
> >
> > Compare with the plain index on the one attribute:
> >
> > # explain analyze select max(id) from user_messages where user_id = 13604;
> >  Total runtime: 0.085 ms
>
> Ah, but:
> 1. Those times are so small, I'm not sure you can reliably separate
> them. Certainly not from one run.
> 2. For a range of different user-ids I'd expect user_id_id index to
> maintain a near-constant time regardless of the number of messages for
> that user.
> 3. You might be able to reduce your statistics on the user column and
> still keep the fast plan.

Actually, I wasn't looking at the time, but at the cost and estimated
number of rows, which are both lower for the original index, and the
complexity of the plan, which looks (at least to me) simpler than the
backwards scan.

But you're right. With the combined index I can set the granularity
back to 1000, and empty queries as well as non-empty queries perform
well. The row estimate is still way off, though.

What are the drawbacks of making the statistics buckets finer-grained?

Alexander.

Re: Selecting max(pk) is slow on empty set

От
"Pavel Stehule"
Дата:
>
> But you're right. With the combined index I can set the granularity
> back to 1000, and empty queries as well as non-empty queries perform
> well. The row estimate is still way off, though.¨

Bigger value --> slow analyze. Real maximum is about 200-300. So be carefully.

Regards

Pavel