Обсуждение: PG7.4.5: query not using index on date column

Поиск
Список
Период
Сортировка

PG7.4.5: query not using index on date column

От
Dave Steinberg
Дата:
Hi Folks, I was hoping someone could help me to improve the performance of a query I've got that insists on doing a
seq.scan on a large table.  I'm trying to do some reporting based on my spam logs which I've partly digested and stored
ina table.  Here are the particulars:
 

The messages table:
     Column       |         Type          |  Modifiers 
-------------------+-----------------------+--------------message_id        | integer               | not null default
nextval('spamreport.messages_message_id_seq'::text)received_date    | date                  | not nullscore
|numeric               | not nulluser_threshold    | numeric               | not nullraw_email_address | character
varying(64)| not nullprocessing_time   | numeric               | not nullsize              | integer               |
notnullfuzzed_address    | character varying(64) | not nulldomain            | character varying(64) | not null
 
Indexes:   "messages_pkey" primary key, btree (message_id)   "domain_idx" btree ("domain")   "fuzzy_idx" btree
(fuzzed_address)  "received_date_idx" btree (received_date)
 

And here's the primary query I run, along with explain analyze output:

>> explain analyze SELECT * FROM ( SELECT       domain,       count(*) as count,       max(score) as max_score,
avg(score)as average_score,       stddev(score) as stddev_score,       sum(CASE WHEN score > user_threshold THEN 1 ELSE
0END) as spam_count,       avg(processing_time) as average_time,       avg(size) as average_size     FROM messages
WHEREreceived_date BETWEEN '2004-11-01' AND '2004-11-30'       GROUP BY domain ) AS aggs       ORDER BY count DESC;
 
  QUERY PLAN                                                              
-------------------------------------------------------------------Sort  (cost=30303.51..30303.60 rows=35 width=226)
(actualtime=29869.716..29869.883 rows=69 loops=1)  Sort Key: count  ->  Subquery Scan aggs  (cost=30301.56..30302.61
rows=35width=226) (actual time=29861.705..29869.240 rows=69 loops=1)        ->  HashAggregate  (cost=30301.56..30302.26
rows=35width=54) (actual time=29861.681..29868.261 rows=69 loops=1)              ->  Seq Scan on messages
(cost=0.00..21573.04rows=436426 width=54) (actual time=5.523..6304.657 rows=462931 loops=1)                    Filter:
((received_date>= '2004-11-01'::date) AND (received_date <= '2004-11-30'::date))Total runtime: 29870.437 ms
 

This database gets vacuumed nightly.  Also, the query plan stays the same even if I restrict the received_date column
downto a single day.
 

Thanks in advance,
-- 
Dave Steinberg
http://www.geekisp.com/


Re: PG7.4.5: query not using index on date column

От
Tom Lane
Дата:
Dave Steinberg <dave-dated-1101824919.46cd20@redterror.net> writes:
>                ->  Seq Scan on messages  (cost=0.00..21573.04 rows=436426 width=54) (actual time=5.523..6304.657
rows=462931loops=1)
 
>                      Filter: ((received_date >= '2004-11-01'::date) AND (received_date <= '2004-11-30'::date))

How many rows in the table altogether?  A rough guess is a few million
based on the estimated seqscan cost.  That would mean that this query
is retrieving about 10% of the table, which is a large enough fraction
that the planner will probably think a seqscan is best.  It may be right.
If you do "set enable_seqscan = off", how does the EXPLAIN ANALYZE
output change?

If it's not right, you may want to try to adjust random_page_cost and/or
effective_cache_size so that the planner's estimated costs are more in
line with reality.  Beware of making such adjustments on the basis of
only one test case, though.
        regards, tom lane


Re: PG7.4.5: query not using index on date column

От
Dave Steinberg
Дата:
Sorry for the delay in getting back on this thread, painting my  
apartment meant no internet over this thanksgiving break... :)

> How many rows in the table altogether?  A rough guess is a few million
> based on the estimated seqscan cost.  That would mean that this query
> is retrieving about 10% of the table, which is a large enough fraction
> that the planner will probably think a seqscan is best.  It may be  
> right.
> If you do "set enable_seqscan = off", how does the EXPLAIN ANALYZE
> output change?

Right now it's a hair under 800k rows.  With enable_seqscan = off, I  
get this as my analyze results:
                                                                             QUERY PLAN
------------------------------------------------------------------------ 
------------------------------------------------------------------------ 
------------- Sort  (cost=361676.23..361676.33 rows=38 width=226) (actual  
time=46076.756..46076.924 rows=69 loops=1)   Sort Key: count   ->  Subquery Scan aggs  (cost=361674.10..361675.24
rows=38 
 
width=226) (actual time=46068.621..46076.159 rows=69 loops=1)         ->  HashAggregate  (cost=361674.10..361674.86
rows=38 
 
width=54) (actual time=46068.596..46075.170 rows=69 loops=1)               ->  Index Scan using received_date_idx on
messages  
 
(cost=0.00..349968.44 rows=585283 width=54) (actual  
time=20.988..15020.821 rows=589543 loops=1)                     Index Cond: ((received_date >= '2004-11-01'::date)  
AND (received_date <= '2004-11-30'::date)) Total runtime: 46091.315 ms

Quite a bit higher with these estimates.  Using your 10% of the table  
guess above, I did a breakdown by date and got these results:

geekadmin@db:geekisp=>select received_date, count(received_date) from  
spamreport.messages group by received_date order by received_date asc; received_date | count
---------------+------- 2004-10-20    |  7592 2004-10-21    | 19699 2004-10-22    | 17311 2004-10-23    | 16730
2004-10-24   | 18249 2004-10-25    | 16718 2004-10-26    | 16951 2004-10-27    | 19818 2004-10-28    | 19580 2004-10-29
  | 17610 2004-10-30    | 16210 2004-10-31    | 20468 2004-11-01    | 12337 2004-11-02    |  9012 2004-11-03    | 20871
2004-11-04   | 20103 2004-11-05    | 18807 2004-11-06    | 20131 2004-11-07    | 22291 2004-11-08    | 23041 2004-11-09
  | 20254 2004-11-10    | 17810 2004-11-11    | 21091 2004-11-12    | 21976 2004-11-13    | 18824 2004-11-14    | 20543
2004-11-15   | 18829 2004-11-16    | 24248 2004-11-17    | 18093 2004-11-18    | 25675 2004-11-19    | 27084 2004-11-20
  | 22362 2004-11-21    | 25187 2004-11-22    | 26451 2004-11-23    | 26016 2004-11-24    | 23147 2004-11-25    | 25785
2004-11-26   | 20584 2004-11-27    | 25615 2004-11-28    |  6931 2004-11-29    |  6549
 
(41 rows)

So it looks like an aggregation of 2 weeks worth of data is more than  
10%, so its aligned with what you were saying.  That also jives, since  
when I re-enable seq_scan and do the report for one or two day's of  
data - it uses the index (previously it did not - perhaps there was not  
enough data).

> If it's not right, you may want to try to adjust random_page_cost  
> and/or
> effective_cache_size so that the planner's estimated costs are more in
> line with reality.  Beware of making such adjustments on the basis of
> only one test case, though.

I'll look in the manual and try playing with these options, thanks for  
your suggestions!  In the end, I'm guessing that if I need to retrieve  
a few hundred thousand rows from disk, faster disks might make the  
biggest impact on performance.

Thanks for your help!
-- 
Dave Steinberg
http://www.geekisp.com/