Re: Query optimizer 8.0.1 (and 8.0)

Поиск
Список
Период
Сортировка
От pgsql@mohawksoft.com
Тема Re: Query optimizer 8.0.1 (and 8.0)
Дата
Msg-id 16881.24.91.171.78.1107786645.squirrel@mail.mohawksoft.com
обсуждение исходный текст
Ответ на Re: Query optimizer 8.0.1 (and 8.0)  (Tom Lane <tgl@sss.pgh.pa.us>)
Ответы Re: Query optimizer 8.0.1 (and 8.0)  (Tom Lane <tgl@sss.pgh.pa.us>)
Re: Query optimizer 8.0.1 (and 8.0)  (Mark Kirkwood <markir@coretech.co.nz>)
Список pgsql-hackers
> pgsql@mohawksoft.com writes:
>> One of the things that is disturbing to me about the analyze settings is
>> that it wants to sample the same number of records from a table
>> regardless
>> of the size of that table.
>
> The papers that I looked at say that this rule has a good solid
> statistical foundation, at least for the case of estimating histograms.
> See the references cited in analyze.c.


Any and all random sampling assumes a degree of uniform distribution. This
is the basis of the model. It assumes that chunks of the whole will be
representative of the whole (to some degree). This works when normal
variations are more or less distributed uniformly. As variations and
trends becomes less uniformly distributed, more samples are required to
characterize it.

Douglas Adams had a great device called the "Total Perspective Vortex"
which infered the whole of the universe from a piece of fairy cake. It was
a subtle play on the absurd notion that a very small sample could lead to
an understanding of an infinitly larger whole.

On a very basic level, why bother sampling the whole table at all? Why not
check one block and infer all information from that? Because we know that
isn't enough data. In a table of 4.6 million rows, can you say with any
mathmatical certainty that a sample of 100 points can be, in any way,
representative?

Another problem with random sampling is trend analysis. Often times there
are minor trends in data. Ron pointed out the lastname firstname trend.
Although there seems to be no correlation between firstnames in the table,
there are clearly groups or clusters of ordered data that is an ordering
that is missed by too small a sample.

I understand why you chose the Vitter algorithm, because it provides a
basically sound methodology for sampling without knowledge of the size of
the whole, but I think we can do better. I would suggest using the current
algorithm the first time through, then adjust the number of samples [n]
based on the previous estimate of the size of the table [N]. Each
successive ANALYZE will become more accurate. The Vitter algorithm is
still useful as [N] will always be an estimate.


В списке pgsql-hackers по дате отправления:

Предыдущее
От: Hannu Krosing
Дата:
Сообщение: Is there a way to make VACUUM run completely outside transaction
Следующее
От: Tom Lane
Дата:
Сообщение: Re: Inline MemoryContextSwitchTo?