Обсуждение: Selecting max(pk) is slow on empty set
This is on a fresh pg_restore copy that I have additionally vacuumed and analyzed. These queries, on a table containing 2.8 million rows, are very fast: # select count(*) from user_messages where user_id = 13604; count ------- 0 (1 row) Time: 0.604 ms # select * from user_messages where user_id = 13604; id | user_id | sender_id | sent_at | dismissed_at | message ----+---------+-----------+---------+--------------+--------- (0 rows) Time: 0.678 ms But doing a max() on this empty set takes a long time to run: # explain analyze select max(id) from user_messages where user_id = 13604; QUERY PLAN ------------------------------------------------------------------------------------------------------------------------------------------------------ Result (cost=633.19..633.20 rows=1 width=0) (actual time=339160.704..339160.704 rows=1 loops=1) InitPlan -> Limit (cost=0.00..633.19 rows=1 width=4) (actual time=339160.700..339160.700 rows=0 loops=1) -> Index Scan Backward using user_messages_pkey on user_messages (cost=0.00..633188.12 rows=1000 width=4) (actual time=339160.697..339160 Filter: ((id IS NOT NULL) AND (user_id = 13604)) Total runtime: 339160.770 ms (6 rows) Note that it's using the correct index -- user_messages_pkey is on the id attribute. (Why rows=1000 here?) PostgreSQL 8.2.5 on Linux and OS X Leopard. Alexander.
Alexander Staubo wrote: > # explain analyze select max(id) from user_messages where user_id = 13604; > > QUERY PLAN > ------------------------------------------------------------------------------------------------------------------------------------------------------ > Result (cost=633.19..633.20 rows=1 width=0) (actual > time=339160.704..339160.704 rows=1 loops=1) > InitPlan > -> Limit (cost=0.00..633.19 rows=1 width=4) (actual > time=339160.700..339160.700 rows=0 loops=1) > -> Index Scan Backward using user_messages_pkey on > user_messages (cost=0.00..633188.12 rows=1000 width=4) (actual > time=339160.697..339160 Filter: ((id IS NOT NULL) AND > (user_id = 13604)) > Total runtime: 339160.770 ms > (6 rows) > > Note that it's using the correct index -- user_messages_pkey is on the > id attribute. (Why rows=1000 here?) 1000 looks suspiciously like a default estimate if the planner knows no better. Odd since you say that you've just analysed. Do you have an index on user_id? Presumably that's what's being used in the case of SELECT * or count(*). What cost does the count(*) come up with? Can you trick it with a sub-query (to see the explain)? SELECT max(id) FROM (SELECT id FROM user_messages WHERE user_id = 13604) AS foo; -- Richard Huxton Archonet Ltd
On 1/22/08, Richard Huxton <dev@archonet.com> wrote: > Alexander Staubo wrote: > > # explain analyze select max(id) from user_messages where user_id = 13604; > > > > QUERY PLAN > > ------------------------------------------------------------------------------------------------------------------------------------------------------ > > Result (cost=633.19..633.20 rows=1 width=0) (actual > > time=339160.704..339160.704 rows=1 loops=1) > > InitPlan > > -> Limit (cost=0.00..633.19 rows=1 width=4) (actual > > time=339160.700..339160.700 rows=0 loops=1) > > -> Index Scan Backward using user_messages_pkey on > > user_messages (cost=0.00..633188.12 rows=1000 width=4) (actual > > time=339160.697..339160 Filter: ((id IS NOT NULL) AND > > (user_id = 13604)) > > Total runtime: 339160.770 ms > > (6 rows) > > > > Note that it's using the correct index -- user_messages_pkey is on the > > id attribute. (Why rows=1000 here?) > > 1000 looks suspiciously like a default estimate if the planner knows no > better. Odd since you say that you've just analysed. > > Do you have an index on user_id? Presumably that's what's being used in > the case of SELECT * or count(*). Yes, I do. However, for some reason it's not being used here. The index is clustered -- but I haven't run "cluster" on it recently. Does that matter? > What cost does the count(*) come up with? # explain analyze select count(*) from user_messages where user_id = 13604; QUERY PLAN ------------------------------------------------------------------------------------------------------------------------------------------------ Aggregate (cost=3646.04..3646.05 rows=1 width=0) (actual time=39.448..39.448 rows=1 loops=1) -> Index Scan using user_messages_user on user_messages (cost=0.00..3643.53 rows=1000 width=0) (actual time=39.410..39.410 rows=0 loops=1) Index Cond: (user_id = 13604) Total runtime: 39.648 ms (4 rows) So here it's using the right index. > Can you trick it with a sub-query (to see the explain)? > SELECT max(id) FROM (SELECT id FROM user_messages WHERE user_id = 13604) > AS foo; No, I tried that as well; PostgreSQL is clever enough to optimize it into exactly the same query as the original. Alexander.
Alexander Staubo wrote: > On 1/22/08, Richard Huxton <dev@archonet.com> wrote: >> Alexander Staubo wrote: >>> # explain analyze select max(id) from user_messages where user_id = 13604; >>> >>> QUERY PLAN >>> ------------------------------------------------------------------------------------------------------------------------------------------------------ >>> Result (cost=633.19..633.20 rows=1 width=0) (actual >>> time=339160.704..339160.704 rows=1 loops=1) >> Do you have an index on user_id? Presumably that's what's being used in >> the case of SELECT * or count(*). > > Yes, I do. However, for some reason it's not being used here. The > index is clustered -- but I haven't run "cluster" on it recently. Does > that matter? The index is still an index... >> What cost does the count(*) come up with? > > # explain analyze select count(*) from user_messages where user_id = 13604; > > QUERY PLAN > ------------------------------------------------------------------------------------------------------------------------------------------------ > Aggregate (cost=3646.04..3646.05 rows=1 width=0) (actual > time=39.448..39.448 rows=1 loops=1) > -> Index Scan using user_messages_user on user_messages > (cost=0.00..3643.53 rows=1000 width=0) (actual time=39.410..39.410 > rows=0 loops=1) > Index Cond: (user_id = 13604) > Total runtime: 39.648 ms > (4 rows) > > So here it's using the right index. Hmm, but with an estimated cost of 3646 (vs.633 for the max(*) which uses the wrong index). That explains why it's walking backwards through the pkey index, it thinks that it's 8 times cheaper. It looks like it thinks that because the estimated cost scanning the whole index backwards is 633188 for 1000 rows and you only want one row so that's 1/1000 of that cost. But why 1000 rows? Actually, it thinks 1000 rows above too. Could it be inadequate stats on the users column? If the users it gathered stats on all have > 1000 rows then it might use the default. Have a look at most_common_vals,most_common_freqs in pg_stats for tbl=user_messages, att=user perhaps. Then see if an ALTER TABLE SET STATISTICS 100 makes a difference. >> Can you trick it with a sub-query (to see the explain)? >> SELECT max(id) FROM (SELECT id FROM user_messages WHERE user_id = 13604) >> AS foo; > > No, I tried that as well; PostgreSQL is clever enough to optimize it > into exactly the same query as the original. Damn :-) -- Richard Huxton Archonet Ltd
On 1/22/08, Richard Huxton <dev@archonet.com> wrote: > Hmm, but with an estimated cost of 3646 (vs.633 for the max(*) which > uses the wrong index). That explains why it's walking backwards through > the pkey index, it thinks that it's 8 times cheaper. [...] > Have a look at most_common_vals,most_common_freqs in pg_stats for > tbl=user_messages, att=user perhaps. # select histogram_bounds from pg_stats where tablename='user_messages' and attname='user_id'; histogram_bounds ------------------------------------------------------- {1,489,1097,1824,2555,3452,4488,5679,6879,8637,13448} # select null_frac, n_distinct, most_common_vals, most_common_freqs from pg_stats where tablename='user_messages' and attname='user_id'; null_frac | n_distinct | most_common_vals | most_common_freqs -----------+------------+--------------------------------------------------+------------------------------------------------------------------------------------------------------- 0 | 2652 | {5826,484,1206,780,823,4085,4157,5852,1962,6453} | {0.00933333,0.00766667,0.00666667,0.00633333,0.006,0.00566667,0.00566667,0.00533333,0.005,0.00466667} > Then see if an ALTER TABLE SET > STATISTICS 100 makes a difference. So it does: # explain analyze select max(id) from user_messages where user_id = 13604; QUERY PLAN ----------------------------------------------------------------------------------------------------------------------------------------------- Aggregate (cost=1320.52..1320.53 rows=1 width=4) (actual time=13.640..13.640 rows=1 loops=1) -> Index Scan using user_messages_user on user_messages (cost=0.00..1319.62 rows=358 width=4) (actual time=13.631..13.631 rows=0 loops=1) Index Cond: (user_id = 13604) Total runtime: 13.712 ms Thank you! That solves my performance problem, at least. But it's worrying that PostgreSQL should be so off in planning the query. Does this behaviour qualify as a bug, or is this -- that is, the need to tweak statistics parameters -- just your garden-variety application-specific optimization? Alexander.
Alexander Staubo wrote: > On 1/22/08, Richard Huxton <dev@archonet.com> wrote: >> Then see if an ALTER TABLE SET >> STATISTICS 100 makes a difference. > > So it does: > > # explain analyze select max(id) from user_messages where user_id = 13604; > > QUERY PLAN > ----------------------------------------------------------------------------------------------------------------------------------------------- > Aggregate (cost=1320.52..1320.53 rows=1 width=4) (actual > time=13.640..13.640 rows=1 loops=1) > -> Index Scan using user_messages_user on user_messages > (cost=0.00..1319.62 rows=358 width=4) (actual time=13.631..13.631 > rows=0 loops=1) > Index Cond: (user_id = 13604) > Total runtime: 13.712 ms > > Thank you! That solves my performance problem, at least. Although the row-estimate still seems quite high. You might want to increase it even further (maximum is 1000). If this is a common query, I'd look at an index on (user,id) rather than just (user) perhaps. > But it's worrying that PostgreSQL should be so off in planning the > query. Does this behaviour qualify as a bug, or is this -- that is, > the need to tweak statistics parameters -- just your garden-variety > application-specific optimization? Well, it's data-specific rather than application specific I suppose. The issue is that there is a cost to tracking 100 values and you don't want to pay that on every column in every table. If user 13604 isn't in the list of most-common users then all it can really do is fix an upper bound on how many matches it can have. Of course you and I can reason outside of the data and guess that manu users won't have more than a handful of messages, but that's not something PG can do. In theory, PG could auto-tune itself for various parameters. The problem then is, do you: 1. Learn constantly, meaning you constantly pay the cost of checking your decisions and never get consistent plans. 2. Learn once, in which case a change in data frequencies or usage patterns renders your learning out of date. You might find http://pgfoundry.org/ useful with the fouine / pqa projects to analyse query logs. -- Richard Huxton Archonet Ltd
On 1/22/08, Richard Huxton <dev@archonet.com> wrote: > Although the row-estimate still seems quite high. You might want to > increase it even further (maximum is 1000). If this is a common query, > I'd look at an index on (user,id) rather than just (user) perhaps. Actually that index (with the same statistics setting as before) yields slightly worse performance: # explain analyze select max(id) from user_messages where user_id = 13604; QUERY PLAN -------------------------------------------------------------------------------------------------------------------------------------------------------------------- Result (cost=3.86..3.87 rows=1 width=0) (actual time=0.051..0.052 rows=1 loops=1) InitPlan -> Limit (cost=0.00..3.86 rows=1 width=4) (actual time=0.045..0.045 rows=0 loops=1) -> Index Scan Backward using user_messages_user_id_id on user_messages (cost=0.00..1486.79 rows=385 width=4) (actual time=0.042..0.042 rows=0 loops=1) Index Cond: (user_id = 13604) Filter: (id IS NOT NULL) Total runtime: 0.128 ms Compare with the plain index on the one attribute: # explain analyze select max(id) from user_messages where user_id = 13604; QUERY PLAN --------------------------------------------------------------------------------------------------------------------------------------------- Aggregate (cost=1388.34..1388.35 rows=1 width=4) (actual time=0.034..0.035 rows=1 loops=1) -> Index Scan using user_messages_user on user_messages (cost=0.00..1387.40 rows=374 width=4) (actual time=0.030..0.030 rows=0 loops=1) Index Cond: (user_id = 13604) Total runtime: 0.085 ms > > But it's worrying that PostgreSQL should be so off in planning the > > query. Does this behaviour qualify as a bug, or is this -- that is, > > the need to tweak statistics parameters -- just your garden-variety > > application-specific optimization? > > Well, it's data-specific rather than application specific I suppose. The > issue is that there is a cost to tracking 100 values and you don't > want to pay that on every column in every table. If user 13604 isn't in > the list of most-common users then all it can really do is fix an upper > bound on how many matches it can have. Of course you and I can reason > outside of the data and guess that manu users won't have more than a > handful of messages, but that's not something PG can do. Absolutely. Thanks for the pointers. Alexander.
Alexander Staubo wrote: > On 1/22/08, Richard Huxton <dev@archonet.com> wrote: >> Although the row-estimate still seems quite high. You might want to >> increase it even further (maximum is 1000). If this is a common query, >> I'd look at an index on (user,id) rather than just (user) perhaps. > > Actually that index (with the same statistics setting as before) > yields slightly worse performance: > > # explain analyze select max(id) from user_messages where user_id = 13604; > Total runtime: 0.128 ms > > Compare with the plain index on the one attribute: > > # explain analyze select max(id) from user_messages where user_id = 13604; > Total runtime: 0.085 ms Ah, but: 1. Those times are so small, I'm not sure you can reliably separate them. Certainly not from one run. 2. For a range of different user-ids I'd expect user_id_id index to maintain a near-constant time regardless of the number of messages for that user. 3. You might be able to reduce your statistics on the user column and still keep the fast plan. -- Richard Huxton Archonet Ltd
On 1/22/08, Richard Huxton <dev@archonet.com> wrote: > Alexander Staubo wrote: > > On 1/22/08, Richard Huxton <dev@archonet.com> wrote: > >> Although the row-estimate still seems quite high. You might want to > >> increase it even further (maximum is 1000). If this is a common query, > >> I'd look at an index on (user,id) rather than just (user) perhaps. > > > > Actually that index (with the same statistics setting as before) > > yields slightly worse performance: > > > > # explain analyze select max(id) from user_messages where user_id = 13604; > > Total runtime: 0.128 ms > > > > Compare with the plain index on the one attribute: > > > > # explain analyze select max(id) from user_messages where user_id = 13604; > > Total runtime: 0.085 ms > > Ah, but: > 1. Those times are so small, I'm not sure you can reliably separate > them. Certainly not from one run. > 2. For a range of different user-ids I'd expect user_id_id index to > maintain a near-constant time regardless of the number of messages for > that user. > 3. You might be able to reduce your statistics on the user column and > still keep the fast plan. Actually, I wasn't looking at the time, but at the cost and estimated number of rows, which are both lower for the original index, and the complexity of the plan, which looks (at least to me) simpler than the backwards scan. But you're right. With the combined index I can set the granularity back to 1000, and empty queries as well as non-empty queries perform well. The row estimate is still way off, though. What are the drawbacks of making the statistics buckets finer-grained? Alexander.
> > But you're right. With the combined index I can set the granularity > back to 1000, and empty queries as well as non-empty queries perform > well. The row estimate is still way off, though.¨ Bigger value --> slow analyze. Real maximum is about 200-300. So be carefully. Regards Pavel