Re: Improving N-Distinct estimation by ANALYZE
От | Simon Riggs |
---|---|
Тема | Re: Improving N-Distinct estimation by ANALYZE |
Дата | |
Msg-id | 1136448661.21025.251.camel@localhost.localdomain обсуждение исходный текст |
Ответ на | Re: Improving N-Distinct estimation by ANALYZE (Greg Stark <gsstark@mit.edu>) |
Список | pgsql-hackers |
On Thu, 2006-01-05 at 00:33 -0500, Greg Stark wrote: > Simon Riggs <simon@2ndquadrant.com> writes: > > > The approach I suggested uses the existing technique for selecting > > random blocks, then either an exhaustive check on all of the rows in a > > block or the existing random row approach, depending upon available > > memory. We need to check all of the rows in a reasonable sample of > > blocks otherwise we might miss clusters of rows in large tables - which > > is the source of the problems identified. > > > > The other reason was to increase the sample size, which is a win in any > > form of statistics. > > Only if your sample is random and independent. The existing mechanism tries > fairly hard to ensure that every record has an equal chance of being selected. > If you read the entire block and not appropriate samples then you'll introduce > systematic sampling errors. For example, if you read an entire block you'll be > biasing towards smaller records. Yes, I discussed that, following Brutlag & Richardson [2000]. The bottom line is if there is no clustering, block sampling is random, which is good; if there is clustering, then you spot it, which is good. > I think it would be useful to have a knob to increase the sample size > separately from the knob for the amount of data retained in the statistics > tables. Though I think you'll be disappointed and find you have to read an > unreasonably large sample out of the table before you get more useful distinct > estimates. OK, I'll look at doing that. Best Regards, Simon Riggs
В списке pgsql-hackers по дате отправления: