Re: [PERFORM] Bad n_distinct estimation; hacks suggested?

От: Andrew Dunstan
Тема: Re: [PERFORM] Bad n_distinct estimation; hacks suggested?
Дата: ,
Msg-id: 426D565E.8040400@dunslane.net
(см: обсуждение, исходный текст)
Ответ на: Re: [PERFORM] Bad n_distinct estimation; hacks suggested?  (Josh Berkus)
Ответы: Re: [PERFORM] Bad n_distinct estimation; hacks suggested?  (Mischa Sandberg)
Список: pgsql-hackers

Скрыть дерево обсуждения

Re: [PERFORM] Bad n_distinct estimation; hacks suggested?  (Josh Berkus, )
 Re: [PERFORM] Bad n_distinct estimation; hacks suggested?  (Josh Berkus, )
 Re: [PERFORM] Bad n_distinct estimation; hacks suggested?  ("Andrew Dunstan", )
 Re: [PERFORM] Bad n_distinct estimation; hacks suggested?  (Tom Lane, )
  Re: [PERFORM] Bad n_distinct estimation; hacks suggested?  (Andrew Dunstan, )
   Re: [PERFORM] Bad n_distinct estimation; hacks suggested?  (Josh Berkus, )
   Re: [PERFORM] Bad n_distinct estimation; hacks suggested?  (Josh Berkus, )
    Re: [PERFORM] Bad n_distinct estimation; hacks suggested?  (Tom Lane, )
  Re: [PERFORM] Bad n_distinct estimation; hacks suggested?  (Marko Ristola, )
  Re: [PERFORM] Bad n_distinct estimation; hacks suggested?  (Simon Riggs, )
   Re: [PERFORM] Bad n_distinct estimation; hacks suggested?  (Josh Berkus, )
   Re: [PERFORM] Bad n_distinct estimation; hacks suggested?  (Andrew Dunstan, )
 Re: [PERFORM] Bad n_distinct estimation; hacks suggested?  (Simon Riggs, )
  Re: [PERFORM] Bad n_distinct estimation; hacks suggested?  (Tom Lane, )
   Re: [PERFORM] Bad n_distinct estimation; hacks suggested?  (Simon Riggs, )
    Re: [PERFORM] Bad n_distinct estimation; hacks suggested?  (Josh Berkus, )
     Re: [PERFORM] Bad n_distinct estimation; hacks suggested?  (Josh Berkus, )
     Re: [PERFORM] Bad n_distinct estimation; hacks suggested?  (Andrew Dunstan, )
      Re: [PERFORM] Bad n_distinct estimation; hacks suggested?  (Mischa Sandberg, )
       Re: [PERFORM] Bad n_distinct estimation; hacks suggested?  (Andrew Dunstan, )
        Re: [PERFORM] Bad n_distinct estimation; hacks suggested?  (Josh Berkus, )
         Re: [PERFORM] Bad n_distinct estimation; hacks suggested?  (Mischa Sandberg, )
         Re: [PERFORM] Bad n_distinct estimation; hacks suggested?  (Markus Schaber, )
          Re: [PERFORM] Bad n_distinct estimation; hacks suggested?  (Mischa Sandberg, )
           Re: [PERFORM] Bad n_distinct estimation; hacks suggested?  (Josh Berkus, )
            Re: [PERFORM] Bad n_distinct estimation; hacks suggested?  (Josh Berkus, )
            Re: [PERFORM] Bad n_distinct estimation; hacks suggested?  (Mischa Sandberg, )
            Re: [PERFORM] Bad n_distinct estimation; hacks suggested?  (John A Meinel, )
        Re: [PERFORM] Distinct-Sampling (Gibbons paper) for Postgres  (Josh Berkus, )
        Re: Distinct-Sampling (Gibbons paper) for Postgres  (, )
    Re: [PERFORM] Bad n_distinct estimation; hacks suggested?  (Tom Lane, )
     Re: [PERFORM] Bad n_distinct estimation; hacks suggested?  (Simon Riggs, )
      Re: [PERFORM] Bad n_distinct estimation; hacks suggested?  (Gurmeet Manku, )
      Citation for "Bad n_distinct estimation; hacks suggested?"  (Gurmeet Manku, )


Josh Berkus wrote:

>Simon, Tom:
>
>While it's not possible to get accurate estimates from a fixed size sample, I
>think it would be possible from a small but scalable sample: say, 0.1% of all
>data pages on large tables, up to the limit of maintenance_work_mem.
>
>Setting up these samples as a % of data pages, rather than a pure random sort,
>makes this more feasable; for example, a 70GB table would only need to sample
>about 9000 data pages (or 70MB).  Of course, larger samples would lead to
>better accuracy, and this could be set through a revised GUC (i.e.,
>maximum_sample_size, minimum_sample_size).
>
>I just need a little help doing the math ... please?
>
>


After some more experimentation, I'm wondering about some sort of
adaptive algorithm, a bit along the lines suggested by Marko Ristola,
but limited to 2 rounds.

The idea would be that we take a sample (either of fixed size, or some
small proportion of the table) , see how well it fits a larger sample
(say a few times the size of the first sample), and then adjust the
formula accordingly to project from the larger sample the estimate for
the full population. Math not worked out yet - I think we want to ensure
that the result remains bounded by [d,N].

cheers

andrew




В списке pgsql-hackers по дате сообщения:

От: Christopher Kings-Lynne
Дата:
Сообщение: Re: [PATCHES] Continue transactions after errors in psql
От: Mark Kirkwood
Дата:
Сообщение: Re: [PATCHES] Continue transactions after errors in psql