Re: Improving N-Distinct estimation by ANALYZE

Поиск
Список
Период
Сортировка
От Greg Stark
Тема Re: Improving N-Distinct estimation by ANALYZE
Дата
Msg-id 87d5j64ski.fsf@stark.xeocode.com
обсуждение исходный текст
Ответ на Re: Improving N-Distinct estimation by ANALYZE  (Josh Berkus <josh@agliodbs.com>)
Ответы Re: Improving N-Distinct estimation by ANALYZE  ("Jim C. Nasby" <jnasby@pervasive.com>)
Список pgsql-hackers
Josh Berkus <josh@agliodbs.com> writes:

> > These statements are at odds with my admittedly basic understanding of
> > statistics.  Isn't the power of a sample more related to the absolute size of
> > the sample than the sample as fraction of the population?  Why not just pick
> > a smallish sample size, say about 3000, and apply it to all the tables, even
> > the ones with just a single row (modify appropriately from block sampling).
> 
> Nope, it's definitely proportional.   As a simple example, a sample of 500 rows
> in a table of 1000 rows should yeild stats estimates with 90%+ accuracy.  But a
> sample of 500 rows in a 600,000,000 row table is so small as to be nearly
> useless; it's quite possible to get all the same value in a random sample of <
> 0.1% even on a column with a D/N of 0.001.   If you look at the papers cited,
> almost all researchers more recent than Chaudhuri use a proportional sample
> size.

To be clear Josh is talking specifically about the estimate of how many
distinct values a query will see. Not the more usual estimates of how many
records the query will see.

For estimating how many records a query like 
 SELECT * FROM tab WHERE x BETWEEN ? AND ?

the fixed size sample is on fairly solid ground. A sample of 600 gives (iirc)
+/- 2% 19 times out of 20. That's the same sample size most major opinion
polls use...

However this same logic doesn't work for estimating distinct values. Since a
single occurrence of a distinct value is just as important as hundreds of
occurrences, and your chances of finding the single occurrence is proportional
to what percentage of the overall table you sample, to maintain a given
accuracy you're going to have to maintain a sample of percentage of the
overall table.

Worse, my recollection from the paper I mentioned earlier was that sampling
small percentages like 3-5% didn't get you an acceptable accuracy. Before you
got anything reliable you found you were sampling very large percentages of
the table. And note that if you have to sample anything over 10-20% you may as
well just read the whole table. Random access reads are that much slower.

-- 
greg



В списке pgsql-hackers по дате отправления:

Предыдущее
От: Greg Stark
Дата:
Сообщение: Re: Improving N-Distinct estimation by ANALYZE
Следующее
От: Bruce Momjian
Дата:
Сообщение: Re: Heads up: upcoming back-branch re-releases