Re: Improving N-Distinct estimation by ANALYZE
| От | Simon Riggs | 
|---|---|
| Тема | Re: Improving N-Distinct estimation by ANALYZE | 
| Дата | |
| Msg-id | 1136422621.21025.244.camel@localhost.localdomain обсуждение исходный текст | 
| Ответ на | Re: Improving N-Distinct estimation by ANALYZE (Greg Stark <gsstark@mit.edu>) | 
| Список | pgsql-hackers | 
On Wed, 2006-01-04 at 19:22 -0500, Greg Stark wrote: > I think you're right that a reasonable sample size for this kind of > estimate > is going to be proportional to the table size, not the constant sized > sample > that regular statistics need. Agreed [I said exactly that in April]; the counter argument at that time was that proportional samples on large tables lead to external sorts in many cases which would lead to unacceptably long run times - since we need to sort the values for each attribute in turn. I've proposed limiting ourselves to maintenance_work_mem (I credit Josh with that idea from April). If you can allocate 1 GB of memory to an ANALYZE then this would be a very large proportion of a medium sized table (or partition). Considering how few rows we sample at the moment, increasing the actual sample size by many 000s would have a very beneficial effect. > On Tue, 2005-04-26 at 19:03 -0400, Greg Stark wrote: > > This one looks *really* good. > > > > http://www.aladdin.cs.cmu.edu/papers/pdfs/y2001/dist_sampl.pdf > > > > It does require a single full table scan The Distinct Sampling approach you mention would scan the whole table and also use large in-memory hash tables. So it uses more I/O, the same memory and probably less CPU - no sorting required. The technique is the same implementation as a HashAgg, just one that loses rows in a predictable manner when it spills out of memory. It doesn't identify columns that scale with N, nor does it calculate correlation. Thats the same as re-writing Count(Distinct) to use hashing, which is a TODO item. So perhaps you could plan the code to do the Distinct Sampling approach at the same time. Hmmm. I'll think about that. Best Regards, Simon Riggs
В списке pgsql-hackers по дате отправления: