Re: Correlation in cost_index()

Поиск
Список
Период
Сортировка
От Manfred Koizar
Тема Re: Correlation in cost_index()
Дата
Msg-id f298jvkfqeejig0ed08rpti9nf0754o3eh@4ax.com
обсуждение исходный текст
Ответ на Re: Correlation in cost_index()  (Sean Chittenden <sean@chittenden.org>)
Список pgsql-hackers
On Fri, 8 Aug 2003 15:10:06 -0700, Sean Chittenden
<sean@chittenden.org> wrote:
>> Yes, we knew that already.  Oliver had suggested simply dropping the
>> division by nKeys, thus pretending that the first-column correlation
>> is close enough.  That seems to me to be going too far in the other
>> direction,
>
>But is it really?

Yes.  I once posted two use cases showing that one side is as wrong as
the other.  Wait a momemt ...  Here it is:
http://archives.postgresql.org/pgsql-hackers/2002-10/msg00229.php.

>indexCorrelation is 1.0 for the 1st key in a multi-column index. [...]

Unfortunately there are cases, where the correlation of the first
column doesn't tell us anything about the real index correlation.

>  An indexCorrelation
>of 0.0 doesn't mean that the index is useless either, just that we
>take the full hit of a completely random page read as opposed to some
>fraction of a random page cost.

... and that we cannot expect the tuples we are looking for to be on
the same page or on pages near to each other.  So we have to read a
higher number of pages.

>        indexCorrelation: 0.183380
> Seq Scan on report_user_cat_count rucc  (cost=0.00..21472.69 rows=31893 width=64) (actual time=444.25..2489.27
rows=514loops=1)
 
>
> Index Scan using report_user_cat_count_html_bytes_idx on report_user_cat_count rucc  (cost=0.00..112165.07 rows=31893
width=64)(actual time=68.64..85.75 rows=514 loops=1)
 

The main problem here is the bad estimation for number of rows: 31893
vs. 514.  The seq scan cost depends on the size of the heap relation
and is always the same (~ 21000) for different search conditions.
With low correlation, index scan cost is roughly proportional to the
number of tuples fetched (ignoring the effects of effective_cache_size
for now).  So for a correct guess of number of rows we'd get
            seq    idx
estimated   21000  1800
actual       2500    86
ratio         8.4  20.4

As estimated and actual are not given in the same units, we look at
the estimated/actual ratio and see that index scan cost is
over-estimated by a factor 2.5.  index_cost_algorithm 4 (geometric
interpolation) should get this right.  Can you ANALYSE with a higher
statistics_target and then test with different index_cost_algorithms? 

ServusManfred


В списке pgsql-hackers по дате отправления:

Предыдущее
От: Tom Lane
Дата:
Сообщение: Re: Correlation in cost_index()
Следующее
От: Manfred Koizar
Дата:
Сообщение: Re: Correlation in cost_index()