Re: benchmarking the query planner

Поиск
Список
Период
Сортировка
От Greg Stark
Тема Re: benchmarking the query planner
Дата
Msg-id 4136ffa0812111823u645b6ec9wdca60b3da4b00499@mail.gmail.com
обсуждение исходный текст
Ответ на Re: benchmarking the query planner  (Simon Riggs <simon@2ndQuadrant.com>)
Ответы Re: benchmarking the query planner  (Simon Riggs <simon@2ndQuadrant.com>)
Список pgsql-hackers
On Thu, Dec 11, 2008 at 11:44 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
>
> On Thu, 2008-12-11 at 22:29 +0000, Gregory Stark wrote:
>
>> > And I would like it even more if the sample size increased according
>> to table size, since that makes ndistinct values fairly random for
>> large
>> > tables.
>>
>> Unfortunately _any_ ndistinct estimate based on a sample of the table
>> is going to be pretty random.
>
> We know that constructed data distributions can destroy the
> effectiveness of the ndistinct estimate and make sample size irrelevant.
> But typical real world data distributions do improve their estimations
> with increased sample size and so it is worthwhile.

Well that just means "more is always better" which puts us back in the
same boat of always needing more.

The existing sampling mechanism is tied to solid statistics. It
provides the correct sample size to get a consistent confidence range
for range queries. This is the same mathematics which governs election
polling and other surveys. The sample size you need to get +/- 5% 19
times out of 20 increases as the population increases, but not by very
much.

However ndistinct is a different kind of question. Instead of needing
the relatively small and slowly growing sample size you need a
percentage of the whole table. It would mean letting this one measure
control the whole sample size decision. And the end result would be
pretty unimpressive. The papers I read showed pretty poor results for
sample sizes as large as 50% of the table.

Simply raising the statistics target for larger tables would not be
very helpful. All it would do is raise the table size at which you
would find the sample size too small. You do have to change to using a
percentage of the whole table -- which would make ANALYZE a *lot*
slower for larger tables.

Really the only way we'll get good ndistinct calculations is if we
decide to scan the whole table. If we have to do that for some other
reason then there are algorithms for gathering enough information to
handle ndistinct properly.

-- 
greg


В списке pgsql-hackers по дате отправления:

Предыдущее
От: Tom Lane
Дата:
Сообщение: Re: benchmarking the query planner
Следующее
От: Ron Mayer
Дата:
Сообщение: Re: Mostly Harmless: Welcoming our C++ friends