Re: benchmarking the query planner
От | Gregory Stark |
---|---|
Тема | Re: benchmarking the query planner |
Дата | |
Msg-id | 87tz99s7bu.fsf@oxford.xeocode.com обсуждение исходный текст |
Ответ на | Re: benchmarking the query planner (Simon Riggs <simon@2ndQuadrant.com>) |
Ответы |
Re: benchmarking the query planner
(Tom Lane <tgl@sss.pgh.pa.us>)
Re: benchmarking the query planner (Simon Riggs <simon@2ndQuadrant.com>) Re: benchmarking the query planner (Simon Riggs <simon@2ndQuadrant.com>) |
Список | pgsql-hackers |
Simon Riggs <simon@2ndQuadrant.com> writes: > On Fri, 2008-12-12 at 06:44 -0500, Robert Haas wrote: >> It is a pretty well-known mathematical fact that for something like an >> opinion poll your margin of error does not depend on the size of the >> population but only on the size of your sample. > > Yes, I agree with that *but* that refers to population statistics and > has nothing at all to do with the calculation of ndistinct, which is > what we were discussing. You can't just switch topics and have the > statement remain true. If you go back to my email that was kind of my point. The existing sample size is on a solid foundation for the histograms and most use cases for the statistics. But entirely bogus for ndistinct. The ndistinct estimate is just piggy-backing on that data. However to fix it would require switching over to scanning a percentage of the whole table which would be a massive increase in work for that one calculation. You can't fix it by just adjusting the sample size slightly. > Note that the estimator we use was shown to be stable in the range of > sample size between 5-20%. > http://www.almaden.ibm.com/cs/people/peterh/jasa3rj.pdf Uhm, this is a survey of lots of different methods and does lots of analysis. I don't see any simple conclusions about stability. Perhaps I'm just missing it in the technical details. Could you point out exactly what part of the paper you're basing this on and what "stable" means? > We currently use a sample size of 300*stats_target. With default=10 that > means our sample size is 0.3% on a 1 million row table, and 0.003% on a > 100 million row table (they're common, I find). > > That size of sample is OK for some kinds of statistics, but not much > good for ndistinct estimation. Right, but increasing our sample size by a factor of 150 for a 100M row table doesn't seem like a reasonable solution to one metric being bogus. For that matter, if we do consider sampling 5% of the table we may as well just go ahead and scan the whole table. It wouldn't take much longer and it would actually produce good estimates. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com Ask me about EnterpriseDB's Slony Replication support!
В списке pgsql-hackers по дате отправления: