Re: [PERFORM] Bad n_distinct estimation; hacks suggested?

От: Simon Riggs
Тема: Re: [PERFORM] Bad n_distinct estimation; hacks suggested?
Дата: ,
Msg-id: 1114549351.21529.363.camel@localhost.localdomain
(см: обсуждение, исходный текст)
Ответ на: Re: [PERFORM] Bad n_distinct estimation; hacks suggested?  (Tom Lane)
Ответы: Re: [PERFORM] Bad n_distinct estimation; hacks suggested?  (Gurmeet Manku)
Citation for "Bad n_distinct estimation; hacks suggested?"  (Gurmeet Manku)
Список: pgsql-hackers

Скрыть дерево обсуждения

Re: [PERFORM] Bad n_distinct estimation; hacks suggested?  (Josh Berkus, )
 Re: [PERFORM] Bad n_distinct estimation; hacks suggested?  (Josh Berkus, )
 Re: [PERFORM] Bad n_distinct estimation; hacks suggested?  ("Andrew Dunstan", )
 Re: [PERFORM] Bad n_distinct estimation; hacks suggested?  (Tom Lane, )
  Re: [PERFORM] Bad n_distinct estimation; hacks suggested?  (Andrew Dunstan, )
   Re: [PERFORM] Bad n_distinct estimation; hacks suggested?  (Josh Berkus, )
   Re: [PERFORM] Bad n_distinct estimation; hacks suggested?  (Josh Berkus, )
    Re: [PERFORM] Bad n_distinct estimation; hacks suggested?  (Tom Lane, )
  Re: [PERFORM] Bad n_distinct estimation; hacks suggested?  (Marko Ristola, )
  Re: [PERFORM] Bad n_distinct estimation; hacks suggested?  (Simon Riggs, )
   Re: [PERFORM] Bad n_distinct estimation; hacks suggested?  (Josh Berkus, )
   Re: [PERFORM] Bad n_distinct estimation; hacks suggested?  (Andrew Dunstan, )
 Re: [PERFORM] Bad n_distinct estimation; hacks suggested?  (Simon Riggs, )
  Re: [PERFORM] Bad n_distinct estimation; hacks suggested?  (Tom Lane, )
   Re: [PERFORM] Bad n_distinct estimation; hacks suggested?  (Simon Riggs, )
    Re: [PERFORM] Bad n_distinct estimation; hacks suggested?  (Josh Berkus, )
     Re: [PERFORM] Bad n_distinct estimation; hacks suggested?  (Josh Berkus, )
     Re: [PERFORM] Bad n_distinct estimation; hacks suggested?  (Andrew Dunstan, )
      Re: [PERFORM] Bad n_distinct estimation; hacks suggested?  (Mischa Sandberg, )
       Re: [PERFORM] Bad n_distinct estimation; hacks suggested?  (Andrew Dunstan, )
        Re: [PERFORM] Bad n_distinct estimation; hacks suggested?  (Josh Berkus, )
         Re: [PERFORM] Bad n_distinct estimation; hacks suggested?  (Mischa Sandberg, )
         Re: [PERFORM] Bad n_distinct estimation; hacks suggested?  (Markus Schaber, )
          Re: [PERFORM] Bad n_distinct estimation; hacks suggested?  (Mischa Sandberg, )
           Re: [PERFORM] Bad n_distinct estimation; hacks suggested?  (Josh Berkus, )
            Re: [PERFORM] Bad n_distinct estimation; hacks suggested?  (Josh Berkus, )
            Re: [PERFORM] Bad n_distinct estimation; hacks suggested?  (Mischa Sandberg, )
            Re: [PERFORM] Bad n_distinct estimation; hacks suggested?  (John A Meinel, )
        Re: [PERFORM] Distinct-Sampling (Gibbons paper) for Postgres  (Josh Berkus, )
        Re: Distinct-Sampling (Gibbons paper) for Postgres  (, )
    Re: [PERFORM] Bad n_distinct estimation; hacks suggested?  (Tom Lane, )
     Re: [PERFORM] Bad n_distinct estimation; hacks suggested?  (Simon Riggs, )
      Re: [PERFORM] Bad n_distinct estimation; hacks suggested?  (Gurmeet Manku, )
      Citation for "Bad n_distinct estimation; hacks suggested?"  (Gurmeet Manku, )

On Mon, 2005-04-25 at 17:10 -0400, Tom Lane wrote:
> Simon Riggs <> writes:
> > On Mon, 2005-04-25 at 11:23 -0400, Tom Lane wrote:
> >> It's not just the scan --- you also have to sort, or something like
> >> that, if you want to count distinct values.  I doubt anyone is really
> >> going to consider this a feasible answer for large tables.
>
> > Assuming you don't use the HashAgg plan, which seems very appropriate
> > for the task? (...but I understand the plan otherwise).
>
> The context here is a case with a very large number of distinct
> values...

Yes, but is there another way of doing this other than sampling a larger
proportion of the table? I don't like that answer either, for the
reasons you give.

The manual doesn't actually say this, but you can already alter the
sample size by setting one of the statistics targets higher, but all of
those samples are fixed sample sizes, not a proportion of the table
itself. It seems reasonable to allow an option to scan a higher
proportion of the table. (It would be even better if you could say "keep
going until you run out of memory, then stop", to avoid needing to have
an external sort mode added to ANALYZE).

Oracle and DB2 allow a proportion of the table to be specified as a
sample size during statistics collection. IBM seem to be ignoring their
own research note on estimating ndistinct...

> keep in mind also that we have to do this for *all* the
> columns of the table.

You can collect stats for individual columns. You need only use an
option to increase sample size when required.

Also, if you have a large table and the performance of ANALYZE worries
you, set some fields to 0. Perhaps that should be the default setting
for very long text columns, since analyzing those doesn't help much
(usually) and takes ages. (I'm aware we already don't analyze var length
column values > 1024 bytes).

> A full-table scan for each column seems
> right out to me.

Some systems analyze multiple columns simultaneously.

Best Regards, Simon Riggs



В списке pgsql-hackers по дате сообщения:

От: Tom Lane
Дата:
Сообщение: Re: [proposal] protocol extension to support loadable stream filters
От: Tom Lane
Дата:
Сообщение: Re: pg_restore stuck in a loop?