Re: Group-count estimation statistics

Поиск

Список

Период

Сортировка

От	Greg Stark
Тема	Re: Group-count estimation statistics
Дата	28 января 2005 г. 21:29:46
Msg-id	87u0p1tnk9.fsf@stark.xeocode.com обсуждение исходный текст
Ответ на	Group-count estimation statistics (Tom Lane <tgl@sss.pgh.pa.us>)
Ответы	Re: Group-count estimation statistics
Список	pgsql-hackers

Дерево обсуждения

Tom Lane <tgl@sss.pgh.pa.us> writes:

> The reason this happens even with stats is that the algorithm for
> estimating the number of groups in a multi-group-column situation
> is basically "take the product of the number of distinct values of
> each grouping column, but clamp to the number of rows estimated for
> the whole table".  It turns out that customer_id and item_id are
> pretty highly correlated, enough that the actual number of distinct
> groups is barely a hundredth of what you get from the product rule.

So why is it any more reasonable for Postgres to assume 0 correlation than any
other value. Perhaps Postgres should calculate these cases assuming some
arbitrary level of correlation.

For example, pulling an algorithm out of nowhere, it could take the most
selective value, then multiply -- instead of by the next most selective --
just the square root of the next value, then the cube root of the third value,
and the fourth root of the fourth value, etc.

So for 10M records grouped over three columns, each of which has 1,000
distinct values, you would get 1,000 * 1,000 ^ 1/2 * 1,000 ^ 1/3 or 316,228
distinct values. Which seems like not a bad guess actually.

To be honest I took the "successive roots" thing out of nowhere. I suspect
there's a more rigorous statistics approach which would have some actual
motivation. For a given assumed correlation and distribution there ought be a
way to calculate the expected number of distinct combinations. Then when we
have some mechanism for providing real correlation we can just plug that in in
place of the arbitrarily assumed correlation.

Actually the formula would be quite complex. As the total number of records
goes up the expected number of distinct values should approach the total
number of records, even if the number of distinct values of each column
doesn't change.

> The only real solution, of course, is to acquire cross-column
> statistics, but I don't see that happening in the near future.

There's another possible solution, if Postgres kept statistics on the actual
results of the query it could later use that feedback to come up with better
guesses even if it doesn't know *why* they're better.

-- 
greg

В списке pgsql-hackers по дате отправления:

Предыдущее

От: Stephen Frost
Дата: 28 января 2005 г., 21:25:47
Сообщение: Re: Group-count estimation statistics

Следующее

От: Darcy Buskermolen
Дата: 28 января 2005 г., 21:43:05
Сообщение: -HEAD on FreeBSD 6-CURRENT build failures

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Re: Group-count estimation statistics

Предыдущее

Следующее