On Wed, Oct 20, 2010 at 9:53 PM, Nathan Boley <npboley@gmail.com> wrote:
>> Robert explained why having more MCVs might be useful because we use
>> the frequency of the least common MCV as an upper bound on the
>> frequency of any value in the MCV.
>
> Where is that being used?
var_eq_const
> The only non-MCV frequency estimate that I
> recall seeing is ( nrows - n_ndistinct_rows )/ndistinct. Obviously
> changing the number of mcv's affects this by lowering
> n_ndistinct_rows, but it's always pretty coarse estimate.
That one's used, too, but the other is used as an upper bound.
n_distinct tends to come out too small on large tables, so that
formula is prone to overestimation. Actually, both formulas are prone
to overestimation.
>> Binding the length of the MCV list to the size of the histogram is
>> arbitrary but so would any other value
>
> Wouldn't the best approach be to stop adding MCV's/histogram buckets
> when adding new ones doesn't decrease your prediction error
> 'substantially'?
>
> One very hacky threshold heuristic is to stop adding MCV's when a
> simple equality select ( SELECT col FROM table WHERE col == VALUE )
> would switch the plan from an index to a sequential scan ( or vice
> versa, although with the current code this would never happen ). ie,
> if the non_mcv frequency estimate is 0.1% ( producing an index scan ),
When this happens depends on the values of a whole boat-load of GUCs...
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company