Re: [HACKERS] Bad n_distinct estimation; hacks suggested?

От: Andrew Dunstan
Тема: Re: [HACKERS] Bad n_distinct estimation; hacks suggested?
Дата: ,
Msg-id: 426BDE32.4070004@dunslane.net
(см: обсуждение, исходный текст)
Ответ на: Re: [HACKERS] Bad n_distinct estimation; hacks suggested?  (Tom Lane)
Список: pgsql-performance

Скрыть дерево обсуждения

Bad n_distinct estimation; hacks suggested?  (Josh Berkus, )
 Re: Bad n_distinct estimation; hacks suggested?  ("Dave Held", )
 Re: Bad n_distinct estimation; hacks suggested?  (Tom Lane, )
  Re: Bad n_distinct estimation; hacks suggested?  (Josh Berkus, )
   Re: Bad n_distinct estimation; hacks suggested?  (Josh Berkus, )
    Re: Bad n_distinct estimation; hacks suggested?  (Marko Ristola, )
     Re: Bad n_distinct estimation; hacks suggested?  (Josh Berkus, )
      Re: Bad n_distinct estimation; hacks suggested?  (Josh Berkus, )
       Re: Bad n_distinct estimation; hacks suggested?  (Greg Stark, )
        Re: Bad n_distinct estimation; hacks suggested?  (Josh Berkus, )
         Re: Bad n_distinct estimation; hacks suggested?  (Josh Berkus, )
         Re: [HACKERS] Bad n_distinct estimation; hacks suggested?  ("Andrew Dunstan", )
         Re: [HACKERS] Bad n_distinct estimation; hacks suggested?  (Tom Lane, )
          Re: [HACKERS] Bad n_distinct estimation; hacks suggested?  (Marko Ristola, )
          Re: [HACKERS] Bad n_distinct estimation; hacks suggested?  (Simon Riggs, )
           Re: [HACKERS] Bad n_distinct estimation; hacks suggested?  (Josh Berkus, )
           Re: [HACKERS] Bad n_distinct estimation; hacks suggested?  (Andrew Dunstan, )
          Re: [HACKERS] Bad n_distinct estimation; hacks suggested?  (Andrew Dunstan, )
         Re: Bad n_distinct estimation; hacks suggested?  (Simon Riggs, )
          Re: [HACKERS] Bad n_distinct estimation; hacks suggested?  (Tom Lane, )
           Re: [HACKERS] Bad n_distinct estimation; hacks suggested?  (Simon Riggs, )
            Re: [HACKERS] Bad n_distinct estimation; hacks suggested?  (Josh Berkus, )
             Re: [HACKERS] Bad n_distinct estimation; hacks suggested?  (Josh Berkus, )
             Re: [HACKERS] Bad n_distinct estimation; hacks suggested?  (Andrew Dunstan, )
              Re: [HACKERS] Bad n_distinct estimation; hacks suggested?  (Mischa Sandberg, )
               Re: [HACKERS] Bad n_distinct estimation; hacks suggested?  (Andrew Dunstan, )
                Re: Distinct-Sampling (Gibbons paper) for Postgres  (, )
                 Re: Distinct-Sampling (Gibbons paper) for Postgres  (Josh Berkus, )
            Re: [HACKERS] Bad n_distinct estimation; hacks suggested?  (Tom Lane, )
             Re: [HACKERS] Bad n_distinct estimation; hacks suggested?  (Simon Riggs, )
              Re: [HACKERS] Bad n_distinct estimation; hacks suggested?  (Gurmeet Manku, )
               Re: [HACKERS] Bad n_distinct estimation; hacks suggested?  (Simon Riggs, )
              Citation for "Bad n_distinct estimation; hacks suggested?"  (Gurmeet Manku, )
               Foreign key constraints compile faster in 7.4  (Ashish Arte, )
                Re: Foreign key constraints compile faster in 7.4  (Tom Lane, )
 Re: [HACKERS] Bad n_distinct estimation; hacks suggested?  (Josh Berkus, )
 Re: [HACKERS] Bad n_distinct estimation; hacks suggested?  (Josh Berkus, )
  Re: [HACKERS] Bad n_distinct estimation; hacks suggested?  (Tom Lane, )
 Re: [HACKERS] Bad n_distinct estimation; hacks suggested?  (Josh Berkus, )
  Re: [HACKERS] Bad n_distinct estimation; hacks suggested?  (Mischa Sandberg, )
  Re: [HACKERS] Bad n_distinct estimation; hacks suggested?  (Markus Schaber, )
   Re: [HACKERS] Bad n_distinct estimation; hacks suggested?  (Mischa Sandberg, )
    Re: [HACKERS] Bad n_distinct estimation; hacks suggested?  (Josh Berkus, )
     Re: [HACKERS] Bad n_distinct estimation; hacks suggested?  (John A Meinel, )
      Re: [HACKERS] Bad n_distinct estimation; hacks suggested?  (Josh Berkus, )
     Re: [HACKERS] Bad n_distinct estimation; hacks suggested?  (Mischa Sandberg, )


Tom Lane wrote:

>Josh Berkus <> writes:
>
>
>>Overall, our formula is inherently conservative of n_distinct.   That is, I
>>believe that it is actually computing the *smallest* number of distinct
>>values which would reasonably produce the given sample, rather than the
>>*median* one.  This is contrary to the notes in analyze.c, which seem to
>>think that we're *overestimating* n_distinct.
>>
>>
>
>Well, the notes are there because the early tests I ran on that formula
>did show it overestimating n_distinct more often than not.  Greg is
>correct that this is inherently a hard problem :-(
>
>I have nothing against adopting a different formula, if you can find
>something with a comparable amount of math behind it ... but I fear
>it'd only shift the failure cases around.
>
>
>
>

The math in the paper does not seem to look at very low levels of q (=
sample to pop ratio).

The formula has a range of [d,N]. It appears intuitively (i.e. I have
not done any analysis) that at very low levels of q, as f1 moves down
from n, the formula moves down from N towards d very rapidly. I did a
test based on the l_comments field in a TPC lineitems table. The test
set has N = 6001215, D =  2921877. In my random sample of 1000 I got d =
976 and f1 = 961, for a DUJ1 figure of 24923, which is too low by 2
orders of magnitude.

I wonder if this paper has anything that might help:
http://www.stat.washington.edu/www/research/reports/1999/tr355.ps - if I
were more of a statistician I might be able to answer :-)

cheers

andrew




В списке pgsql-performance по дате сообщения:

От: Kevin Brown
Дата:
Сообщение: Re: Joel's Performance Issues WAS : Opteron vs Xeon
От: Mischa Sandberg
Дата:
Сообщение: Re: [HACKERS] Bad n_distinct estimation; hacks suggested?