Re: benchmarking the query planner

Поиск
Список
Период
Сортировка
От Gregory Stark
Тема Re: benchmarking the query planner
Дата
Msg-id 87tz99s7bu.fsf@oxford.xeocode.com
обсуждение исходный текст
Ответ на Re: benchmarking the query planner  (Simon Riggs <simon@2ndQuadrant.com>)
Ответы Re: benchmarking the query planner  (Tom Lane <tgl@sss.pgh.pa.us>)
Re: benchmarking the query planner  (Simon Riggs <simon@2ndQuadrant.com>)
Re: benchmarking the query planner  (Simon Riggs <simon@2ndQuadrant.com>)
Список pgsql-hackers
Simon Riggs <simon@2ndQuadrant.com> writes:

> On Fri, 2008-12-12 at 06:44 -0500, Robert Haas wrote:
>> It is a pretty well-known mathematical fact that for something like an
>> opinion poll your margin of error does not depend on the size of the
>> population but only on the size of your sample.
>
> Yes, I agree with that *but* that refers to population statistics and
> has nothing at all to do with the calculation of ndistinct, which is
> what we were discussing. You can't just switch topics and have the
> statement remain true.

If you go back to my email that was kind of my point. The existing sample size
is on a solid foundation for the histograms and most use cases for the
statistics. But entirely bogus for ndistinct.

The ndistinct estimate is just piggy-backing on that data. However to fix it
would require switching over to scanning a percentage of the whole table which
would be a massive increase in work for that one calculation. You can't fix it
by just adjusting the sample size slightly.

> Note that the estimator we use was shown to be stable in the range of
> sample size between 5-20%.
> http://www.almaden.ibm.com/cs/people/peterh/jasa3rj.pdf

Uhm, this is a survey of lots of different methods and does lots of analysis.
I don't see any simple conclusions about stability. Perhaps I'm just missing
it in the technical details. Could you point out exactly what part of the
paper you're basing this on and what "stable" means?

> We currently use a sample size of 300*stats_target. With default=10 that
> means our sample size is 0.3% on a 1 million row table, and 0.003% on a
> 100 million row table (they're common, I find). 
>
> That size of sample is OK for some kinds of statistics, but not much
> good for ndistinct estimation.

Right, but increasing our sample size by a factor of 150 for a 100M row table
doesn't seem like a reasonable solution to one metric being bogus.

For that matter, if we do consider sampling 5% of the table we may as well
just go ahead and scan the whole table. It wouldn't take much longer and it
would actually produce good estimates.

--  Gregory Stark EnterpriseDB          http://www.enterprisedb.com Ask me about EnterpriseDB's Slony Replication
support!


В списке pgsql-hackers по дате отправления:

Предыдущее
От: Grzegorz Jaskiewicz
Дата:
Сообщение: Re: Polymorphic types vs. domains
Следующее
От: Tom Lane
Дата:
Сообщение: Re: benchmarking the query planner