Re: Group-count estimation statistics

Поиск
Список
Период
Сортировка
От Greg Stark
Тема Re: Group-count estimation statistics
Дата
Msg-id 87u0p1tnk9.fsf@stark.xeocode.com
обсуждение исходный текст
Ответ на Group-count estimation statistics  (Tom Lane <tgl@sss.pgh.pa.us>)
Ответы Re: Group-count estimation statistics  (Tom Lane <tgl@sss.pgh.pa.us>)
Список pgsql-hackers
Tom Lane <tgl@sss.pgh.pa.us> writes:

> The reason this happens even with stats is that the algorithm for
> estimating the number of groups in a multi-group-column situation
> is basically "take the product of the number of distinct values of
> each grouping column, but clamp to the number of rows estimated for
> the whole table".  It turns out that customer_id and item_id are
> pretty highly correlated, enough that the actual number of distinct
> groups is barely a hundredth of what you get from the product rule.

So why is it any more reasonable for Postgres to assume 0 correlation than any
other value. Perhaps Postgres should calculate these cases assuming some
arbitrary level of correlation.

For example, pulling an algorithm out of nowhere, it could take the most
selective value, then multiply -- instead of by the next most selective --
just the square root of the next value, then the cube root of the third value,
and the fourth root of the fourth value, etc.

So for 10M records grouped over three columns, each of which has 1,000
distinct values, you would get 1,000 * 1,000 ^ 1/2 * 1,000 ^ 1/3 or 316,228
distinct values. Which seems like not a bad guess actually.

To be honest I took the "successive roots" thing out of nowhere. I suspect
there's a more rigorous statistics approach which would have some actual
motivation. For a given assumed correlation and distribution there ought be a
way to calculate the expected number of distinct combinations. Then when we
have some mechanism for providing real correlation we can just plug that in in
place of the arbitrarily assumed correlation.

Actually the formula would be quite complex. As the total number of records
goes up the expected number of distinct values should approach the total
number of records, even if the number of distinct values of each column
doesn't change.


> The only real solution, of course, is to acquire cross-column
> statistics, but I don't see that happening in the near future.

There's another possible solution, if Postgres kept statistics on the actual
results of the query it could later use that feedback to come up with better
guesses even if it doesn't know *why* they're better.


-- 
greg



В списке pgsql-hackers по дате отправления:

Предыдущее
От: Stephen Frost
Дата:
Сообщение: Re: Group-count estimation statistics
Следующее
От: Darcy Buskermolen
Дата:
Сообщение: -HEAD on FreeBSD 6-CURRENT build failures