Re: proposal : cross-column stats

Поиск
Список
Период
Сортировка
От Tomas Vondra
Тема Re: proposal : cross-column stats
Дата
Msg-id 4D14A52F.8040302@fuzzy.cz
обсуждение исходный текст
Ответ на Re: proposal : cross-column stats  (Florian Pflug <fgp@phlo.org>)
Список pgsql-hackers
Dne 24.12.2010 04:41, Florian Pflug napsal(a):
> The filter size could be derived from the table's statistics target, or
> be otherwise user-definable. We could also auto-resize once it gets too
> full. But still, that all seems awfully complex :-(

Using a statistics target is a good idea I think. I think we could use
it to determine error rate of the filter. Something like
  error rate = 10 - 0.9 * (statistics_target - 100)

which gives
  1%  for statistics target 1000  10% for statistics target 100

or maybe something like this (where the error rate grows faster for
smaller statistic target values)
  error rate = 11 - 91000 / (statistics_target^2)

which gives about
  1%  for statistics target 1000  10% for statistics targer 100  36% for statistics target 50

But I guess 10% error rate is the minimum we need so it does not make
much sense to use lower values.

>> >   Another possibility is to collect the data from just a small portion
>> >   of a table and then use the result to estimate the number of distinct
>> >   values for the whole table. But I'm not sure we can do this reliably,
>> >   I see many traps in this.
> This is how it works currently. The problem with this approach is that
> it gives you very little guarantees about how precise the result will be.
> Extrapolating works very well for things like MKVs and histograms, because
> there you're by definition interested mostly in values which occur often -
> and thus with a high probability in the relative few rows you sample. For
> the number of distinct values, however, this isn't true - if ndistinct
> is an order of magnitude smaller than the number of rows, relatively few
> rows can account for a large percentage of the distinct values...

That basically means we need to sample a large portion of the table :-(

> Another idea would be to obtain the ndistinct values from an index somehow.
> Postgres cannot currently scan an index in physical order, only in logical
> order, due to locking considerations. But since we'd only be interested in
> an estimate, maybe a scan in physical block order would work for ndistinc
> estimates? Just a wild idea, mind you, I haven't checked at all if that'd
> be even remotely feasible.

I was thinking about that too, and I think we could do this using
pageinspect contrib module. Sure, there might be a problem with bloated
indexes.

And relying on this actually means it's required to have a multi-column
index on all the columns. Individual indexes are not enough as we need
to get the number of distinct combinations too.

regards
Tomas


В списке pgsql-hackers по дате отправления:

Предыдущее
От: Itagaki Takahiro
Дата:
Сообщение: Re: SQL/MED - file_fdw
Следующее
От: Peter Eisentraut
Дата:
Сообщение: Re: Re: [COMMITTERS] pgsql: Move the documentation of --no-security-label to a more sensible