> -----Original Message-----
> From: Andrew Dunstan [mailto:]
> Sent: Monday, April 25, 2005 3:43 PM
> To:
> Cc: pgsql-perform;
> Subject: Re: [HACKERS] [PERFORM] Bad n_distinct estimation; hacks
> suggested?
>
> Josh Berkus wrote:
>
> >Simon, Tom:
> >
> >While it's not possible to get accurate estimates from a
> >fixed size sample, I think it would be possible from a
> >small but scalable sample: say, 0.1% of all data pages on
> >large tables, up to the limit of maintenance_work_mem.
Note that the results obtained in the cited paper were obtained
from samples of 5 and 10%. It should also warrant caution
that the authors don't offer any proofs of confidence bounds,
even for the "average" case.
> [...]
> After some more experimentation, I'm wondering about some
> sort of adaptive algorithm, a bit along the lines suggested
> by Marko Ristola, but limited to 2 rounds.
One path might be to use the published algorithm and simply
recompute the statistics after every K blocks are sampled,
where K is a reasonably small number. If it looks like the
statistics are converging on a value, then take a few more
samples, check against the trend value and quit. Otherwise
continue until some artificial limit is reached.
> The idea would be that we take a sample (either of fixed
> size, or some small proportion of the table), see how well
> it fits a larger sample (say a few times the size of the
> first sample), and then adjust the formula accordingly to
> project from the larger sample the estimate for the full
> population. Math not worked out yet - I think we want to
> ensure that the result remains bounded by [d,N].
The crudest algorithm could be something like the Newton-
Ralphson method for finding roots. Just adjust the predicted
value up or down until it comes within an error tolerance of
the observed value for the current sample. No need to choose
powers of 2, and I would argue that simply checking every so
often on the way to a large sample that can be terminated
early is more efficient than sampling and resampling. Of
course, the crude algorithm would almost certainly be I/O
bound, so if a more sophisticated algorithm would give a
better prediction by spending a few more CPU cycles on each
sample block gathered, then that seems like a worthwhile
avenue to pursue.
As far as configuration goes, the user is most likely to
care about how long it takes to gather the statistics or
how accurate they are. So it would probably be best to
terminate the sampling process on a user-defined percentage
of the table size and the minimum error tolerance of the
algorithmic prediction value vs. the computed sample value.
If someone wants a fast and dirty statistic, they set the
row percent low and the error tolerance high, which will
effectively make the blocks read the limiting factor. If
they want an accurate statistic, they set the row percent
as high as they feel they can afford, and the error
tolerance as low as they need to in order to get the query
plans they want.
__
David B. Held
Software Engineer/Array Services Group
200 14th Ave. East, Sartell, MN 56377
320.534.3637 320.253.7800 800.752.8129