Re: ANALYZE sampling is too good

Поиск
Список
Период
Сортировка
От Greg Stark
Тема Re: ANALYZE sampling is too good
Дата
Msg-id CAM-w4HNfAaWSZKLoSjnF4SMU4qUoumtMLcYwG=YQQEjdTwnHXA@mail.gmail.com
обсуждение исходный текст
Ответ на Re: ANALYZE sampling is too good  (Josh Berkus <josh@agliodbs.com>)
Список pgsql-hackers
On Sun, Dec 8, 2013 at 7:03 PM, Josh Berkus <josh@agliodbs.com> wrote:
> They will certainly do so if you don't apply any statistical adjustments
> for selecting more rows from the same pages.
>
> So there's a set of math designed to calculate for the skew introduced
> by reading *all* of the rows in each block.

I just think this is an oversimplification. There's no skew introduced
just by reading all the rows in each block unless there's some kind of
dependency between the block a row is placed on and the data in it. So
I don't believe there can be some single set of math that
automatically removes any skew automatically. The math will depend on
what the dependency is.

Just to be clear, you have to think pretty hard about the way Postgres
internals work to see what kinds of skew might be appearing here. Due
to the way Postgres updates work and HOT cleanups work "hot" tuples
will be weighted less than "cold" tuples. That's not going to be
something someone in ACM knew to design into their maths.

I do have access to ACM or other academic articles if you remember any
author names or any keywords but if it's a database journal I would
worry about patent issues. Do you remember if it was over 17 years
old?


> Obviously, you need to look at more rows than you would with a
> pure-random sample.  Like I said, the 80%+ accurate point in the papers
> seemed to be at a 5% sample.

I really don't believe the 5% thing. It's not enough for n_distinct
and it's *far* too high a value for linear properties like histograms
or nullfrac etc. From a computer point of view it's too high to be
worth bothering. If we have to read 5% of the table we might as well
do a full scan anyways, it'll be marginally slower but much better
quality results.



-- 
greg



В списке pgsql-hackers по дате отправления:

Предыдущее
От: "MauMau"
Дата:
Сообщение: Re: [RFC] Shouldn't we remove annoying FATAL messages from server log?
Следующее
От: Mark Kirkwood
Дата:
Сообщение: Re: ANALYZE sampling is too good