Planner's choice
| От | Nigel J. Andrews | 
|---|---|
| Тема | Planner's choice | 
| Дата | |
| Msg-id | Pine.LNX.4.21.0211131510460.14887-100000@ponder.fairway2k.co.uk обсуждение исходный текст | 
| Ответы | Re: Planner's choice | 
| Список | pgsql-general | 
Ok, I'm not sure why the planner is making the choices it is.
Given:
From pg_stats:
 tablename |   attname   | null_frac | avg_width | n_distinct |
most_common_vals                                                       |
most_common_freqs                                          |
                                                                      histogram_bounds
                                                                                                | correlation 
-----------+-------------+-----------+-----------+------------+--------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------
 chat_post | poster_id   |         0 |         2 |        341 | {2149,1130,731,2595,1879,2473,1842,688,521,1656}
                                                                      |
{0.066,0.039,0.0276667,0.0256667,0.0253333,0.023,0.0226667,0.022,0.021,0.021}                        |
{4,252,582,896,1162,1526,1747,1907,2114,2472,2946}
                                  |    0.036261 
 chat_post | time        |         0 |         8 |  -0.417335 | {"1998-07-08 15:09:00-04","1999-02-26
19:31:00-05","2000-01-2712:07:00-05","2001-05-24 14:30:00-04","2002-01-22 10:04:00-05"} |
{0.000666667,0.000666667,0.000666667,0.000666667,0.000666667}                                        | {"1998-03-05
22:47:00-05","1998-08-2911:49:00-04","1999-03-19 19:39:00-05","1999-08-23 09:26:00-04","2000-03-07
13:45:00-05","2000-11-2915:48:00-05","2001-04-11 11:32:00-04","2001-08-31 15:43:00-04","2002-02-01
12:40:00-05","2002-06-1906:32:00-04","2002-11-12 04:44:00-05"} |           1 
with the table definition:
                 Table "chat_post"
   Column    |           Type           | Modifiers
-------------+--------------------------+-----------
 session_id  | smallint                 | not null
 poster_id   | smallint                 | not null
 time        | timestamp with time zone | not null
 post_number | smallint                 | not null
 fts         | txtidx                   |
Indexes: chat_post_text_idx,
         chat_post_time_idx,
         chat_post_timeuser_idx,
         chat_post_user_idx,
         chat_post_usertime_idx
Primary key: chat_post_pkey
Triggers: RI_ConstraintTrigger_23712080,
          RI_ConstraintTrigger_23706474
where chat_post_timeuser_idx is defined on the columns (time,poster_id)
and chat_post_usertime_idx is defined on the columns (poster_id,time)
Why is the planner not choosing the user_time index? For the first of these I
suspect it's the sky high correlation number on the time column but for the
other two?
avid_chat_archive=> explain analyze select * from chat_post where poster_id = '1600' order by time desc limit 2;
NOTICE:  QUERY PLAN:
Limit  (cost=0.00..32.40 rows=2 width=46) (actual time=96204.53..96204.71 rows=2 loops=1)
  ->  Index Scan Backward using chat_post_time_idx on chat_post  (cost=0.00..42370.93 rows=2616 width=46) (actual
time=96204.49..96204.64rows=3 loops=1) 
Total runtime: 96205.18 msec
EXPLAIN
avid_chat_archive=> explain analyze select * from chat_post where poster_id = '1600' order by user, time desc limit 2;
NOTICE:  QUERY PLAN:
Limit  (cost=10262.07..10262.07 rows=2 width=46) (actual time=17400.89..17400.95 rows=2 loops=1)
  ->  Sort  (cost=10262.07..10262.07 rows=2616 width=46) (actual time=17400.85..17400.88 rows=3 loops=1)
        ->  Index Scan using chat_post_user_idx on chat_post  (cost=0.00..10113.60 rows=2616 width=46) (actual
time=99.53..16327.97rows=3372 loops=1) 
Total runtime: 17450.35 msec
EXPLAIN
avid_chat_archive=> explain analyze select * from chat_post where poster_id = '1600' order by time desc, user limit 2;
NOTICE:  QUERY PLAN:
Limit  (cost=10262.07..10262.07 rows=2 width=46) (actual time=1109.68..1109.73 rows=2 loops=1)
  ->  Sort  (cost=10262.07..10262.07 rows=2616 width=46) (actual time=1109.65..1109.67 rows=3 loops=1)
        ->  Index Scan using chat_post_user_idx on chat_post  (cost=0.00..10113.60 rows=2616 width=46) (actual
time=1.47..598.56rows=3372 loops=1) 
Total runtime: 1166.65 msec
               (don't forget this is a bogus time due to the caching)
EXPLAIN
This is part of a larger query/issue. By redoing a query sprinkling 'limit's
throughout I managed to make a >90s query complete in 1s (with the loss of the
total number of results that would be returned). However, I then noticed that
was a feature of the particular poster_id I'd been testing with. Picking a far
less frequent poster_id pushed the time up to >300s. The reason for this was
the planner's choice of chat_post_time_idx which for low numbers of occurances
of poster_id was effectively kicking off a seqscan through an indexscan. My
plan now is to maintain my own set of poster_id stats and use one of several
query variants depending on what they say but this requires at least some
understanding of the choices made by the planner.
BTW, the largest number of occurances of a single poster_id is still only 6% of
the entire table. So changing the stats gathering shouldn't make any
difference? There's also >2000 distinct values of poster_id, not the 341
estimated by the stats, but again this shouldn't matter?
--
Nigel J. Andrews
		
	В списке pgsql-general по дате отправления: