Re: [PERFORM] Bad n_distinct estimation; hacks suggested?

От: Simon Riggs
Тема: Re: [PERFORM] Bad n_distinct estimation; hacks suggested?
Дата: ,
Msg-id: 1114547498.21529.349.camel@localhost.localdomain
(см: обсуждение, исходный текст)
Ответ на: Re: [PERFORM] Bad n_distinct estimation; hacks suggested?  (Tom Lane)
Ответы: Re: [PERFORM] Bad n_distinct estimation; hacks suggested?  (Josh Berkus)
Re: [PERFORM] Bad n_distinct estimation; hacks suggested?  (Andrew Dunstan)
Список: pgsql-hackers

Скрыть дерево обсуждения

Re: [PERFORM] Bad n_distinct estimation; hacks suggested?  (Josh Berkus, )
 Re: [PERFORM] Bad n_distinct estimation; hacks suggested?  (Josh Berkus, )
 Re: [PERFORM] Bad n_distinct estimation; hacks suggested?  ("Andrew Dunstan", )
 Re: [PERFORM] Bad n_distinct estimation; hacks suggested?  (Tom Lane, )
  Re: [PERFORM] Bad n_distinct estimation; hacks suggested?  (Andrew Dunstan, )
   Re: [PERFORM] Bad n_distinct estimation; hacks suggested?  (Josh Berkus, )
   Re: [PERFORM] Bad n_distinct estimation; hacks suggested?  (Josh Berkus, )
    Re: [PERFORM] Bad n_distinct estimation; hacks suggested?  (Tom Lane, )
  Re: [PERFORM] Bad n_distinct estimation; hacks suggested?  (Marko Ristola, )
  Re: [PERFORM] Bad n_distinct estimation; hacks suggested?  (Simon Riggs, )
   Re: [PERFORM] Bad n_distinct estimation; hacks suggested?  (Josh Berkus, )
   Re: [PERFORM] Bad n_distinct estimation; hacks suggested?  (Andrew Dunstan, )
 Re: [PERFORM] Bad n_distinct estimation; hacks suggested?  (Simon Riggs, )
  Re: [PERFORM] Bad n_distinct estimation; hacks suggested?  (Tom Lane, )
   Re: [PERFORM] Bad n_distinct estimation; hacks suggested?  (Simon Riggs, )
    Re: [PERFORM] Bad n_distinct estimation; hacks suggested?  (Josh Berkus, )
     Re: [PERFORM] Bad n_distinct estimation; hacks suggested?  (Josh Berkus, )
     Re: [PERFORM] Bad n_distinct estimation; hacks suggested?  (Andrew Dunstan, )
      Re: [PERFORM] Bad n_distinct estimation; hacks suggested?  (Mischa Sandberg, )
       Re: [PERFORM] Bad n_distinct estimation; hacks suggested?  (Andrew Dunstan, )
        Re: [PERFORM] Bad n_distinct estimation; hacks suggested?  (Josh Berkus, )
         Re: [PERFORM] Bad n_distinct estimation; hacks suggested?  (Mischa Sandberg, )
         Re: [PERFORM] Bad n_distinct estimation; hacks suggested?  (Markus Schaber, )
          Re: [PERFORM] Bad n_distinct estimation; hacks suggested?  (Mischa Sandberg, )
           Re: [PERFORM] Bad n_distinct estimation; hacks suggested?  (Josh Berkus, )
            Re: [PERFORM] Bad n_distinct estimation; hacks suggested?  (Josh Berkus, )
            Re: [PERFORM] Bad n_distinct estimation; hacks suggested?  (Mischa Sandberg, )
            Re: [PERFORM] Bad n_distinct estimation; hacks suggested?  (John A Meinel, )
        Re: [PERFORM] Distinct-Sampling (Gibbons paper) for Postgres  (Josh Berkus, )
        Re: Distinct-Sampling (Gibbons paper) for Postgres  (, )
    Re: [PERFORM] Bad n_distinct estimation; hacks suggested?  (Tom Lane, )
     Re: [PERFORM] Bad n_distinct estimation; hacks suggested?  (Simon Riggs, )
      Re: [PERFORM] Bad n_distinct estimation; hacks suggested?  (Gurmeet Manku, )
      Citation for "Bad n_distinct estimation; hacks suggested?"  (Gurmeet Manku, )

On Sun, 2005-04-24 at 00:48 -0400, Tom Lane wrote:
> Josh Berkus <> writes:
> > Overall, our formula is inherently conservative of n_distinct.   That is, I
> > believe that it is actually computing the *smallest* number of distinct
> > values which would reasonably produce the given sample, rather than the
> > *median* one.  This is contrary to the notes in analyze.c, which seem to
> > think that we're *overestimating* n_distinct.
>
> Well, the notes are there because the early tests I ran on that formula
> did show it overestimating n_distinct more often than not.  Greg is
> correct that this is inherently a hard problem :-(
>
> I have nothing against adopting a different formula, if you can find
> something with a comparable amount of math behind it ... but I fear
> it'd only shift the failure cases around.
>

Perhaps the formula is not actually being applied?

The code looks like this...
 if (nmultiple == 0)
 {
    /* If we found no repeated values, assume it's a unique column */
    stats->stadistinct = -1.0;
 }
 else if (toowide_cnt == 0 && nmultiple == ndistinct)
 {
    /*
     * Every value in the sample appeared more than once.  Assume
     * the column has just these values.
     */
    stats->stadistinct = ndistinct;
 }
 else
 {
    /*----------
     * Estimate the number of distinct values using the estimator
     * proposed by Haas and Stokes in IBM Research Report RJ 10025:


The middle chunk of code looks to me like if we find a distribution
where values all occur at least twice, then we won't bother to apply the
Haas and Stokes equation. That type of frequency distribution would be
very common in a set of values with very high ndistinct, especially when
sampled.

The comment
     * Every value in the sample appeared more than once.  Assume
     * the column has just these values.
doesn't seem to apply when using larger samples, as Josh is using.

Looking at Josh's application it does seem likely that when taking a
sample, all site visitors clicked more than once during their session,
especially if they include home page, adverts, images etc for each page.

Could it be that we have overlooked this simple explanation and that the
Haas and Stokes equation is actually quite good, but just not being
applied?

Best Regards, Simon Riggs



В списке pgsql-hackers по дате сообщения:

От: David Wheeler
Дата:
Сообщение: Re: DO INSTEAD and conditional rules
От: Tom Lane
Дата:
Сообщение: Re: [proposal] protocol extension to support loadable stream filters