index v. seqscan for certain values

Поиск
Список
Период
Сортировка
От Jeremy Dunn
Тема index v. seqscan for certain values
Дата
Msg-id 000901c420a4$7b733650$4f01a8c0@jeremydunn
обсуждение исходный текст
Ответы Re: index v. seqscan for certain values  (Bill Moran <wmoran@potentialtech.com>)
Re: index v. seqscan for certain values  (Stephan Szabo <sszabo@megazone.bigpanda.com>)
Re: index v. seqscan for certain values  (Tom Lane <tgl@sss.pgh.pa.us>)
Список pgsql-performance
I've searched the archives and can't find an answer to this seemingly simple question.  Apologies if it's too common.
 
The table in question has ~1.3M rows.  It has 85 columns, 5 of which have single-column indexes.
 
The column in question (CID) has 183 distinct values.  For these values, the largest has ~38,000 rows, and the smallest has 1 row.  About 30 values have < 100 rows, and about 10 values have > 20,000 rows.
 
The database is 7.2.3 running on RedHat 7.1. (we are in process of upgrading to PG 7.4.2)    All of the query plan options are enabled, and the cpu costs are set to the default values. ( cpu_tuple_cost is 0.01, cpu_index_tuple_cost is 0.001).  The database is VACUUM'd every night.
 
The problem:
A simply query:
    select count(*) from xxx where CID=<smalval>
where <smalval> is a CID value which has relatively few rows, returns a plan using the index on that column.
 
   explain analyze select count(*) from xxx where cid=869366;
   Aggregate  (cost=19136.33..19136.33 rows=1 width=0) (actual time=78.49..78.49 rows=1 loops=1)
     ->  Index Scan using xxx_cid on emailrcpts  (cost=0.00..19122.21 rows=5648 width=0) (actual time=63.40..78.46 rows=1 loops=1)
   Total runtime: 78.69 msec
 
The same plan is true for values which have up to about 20,000 rows:
 
   explain analyze select count(*) from xxx where cid=6223341;
   Aggregate  (cost=74384.19..74384.19 rows=1 width=0) (actual time=11614.89..11614.89 rows=1 loops=1)
     ->  Index Scan using xxx_cid on emailrcpts  (cost=0.00..74329.26 rows=21974 width=0) (actual time=35.75..11582.10 rows=20114 loops=1)
   Total runtime: 11615.05 msec
However for the values that have > 20,000 rows, the plan changes to a sequential scan, which is proportionately much slower.
 
   explain analyze select count(*) from xxx where cid=7191032;
   Aggregate  (cost=97357.61..97357.61 rows=1 width=0) (actual time=46427.81..46427.82 rows=1 loops=1)
    ->   Seq Scan on xxx (cost=0.00..97230.62 rows=50792 width=0) (actual time=9104.45..46370.27 rows=37765 loops=1)
    Total runtime: 46428.00 msec
 
 
The question: why does the planner consider a sequential scan to be better for these top 10 values?  In terms of elapsed time it is more than twice as slow, proportionate to an index scan for the same number of rows.
 
What I tried:
 
A) alter table xxx alter column cid set statistics 500;   
    analyze xxx;
This does not affect the results.
 
B)  dropped/rebuilt the index, with no improvement.
 
C) decreasing cpu_index_tuple_cost by a factor of up to 1000, with no success
 
D) force an index scan for the larger values by using a very high value for cpu_tuple_cost (e.g. .5) but this doesn't seem like a wise thing to do.
 
Your thoughts appreciated in advance!
 
- Jeremy 
 
7+ years experience in Oracle performance-tuning
relatively new to postgresql

В списке pgsql-performance по дате отправления:

Предыдущее
От: "Shea,Dan [CIS]"
Дата:
Сообщение: Re: Deleting certain duplicates
Следующее
От: Bill Moran
Дата:
Сообщение: Re: index v. seqscan for certain values