Re: proposal : cross-column stats

Поиск

Список

Период

Сортировка

От	Tomas Vondra
Тема	Re: proposal : cross-column stats
Дата	23 декабря 2010 г. 15:39:50
Msg-id	4D13A57C.8040400@fuzzy.cz обсуждение исходный текст
Ответ на	Re: proposal : cross-column stats (Florian Pflug <fgp@phlo.org>)
Ответы	Re: proposal : cross-column stats
Список	pgsql-hackers

Дерево обсуждения

Dne 21.12.2010 16:54, Florian Pflug napsal(a):
> I think that maybe it'd be acceptable to scan a large portion of the
> table to estimate dist(A,B), *if* we must only do so only once in a while. But even with
> a full scan, how would we arrive at an estimate for dist(A,B)? Keeping all values in memory,
> and spilling them into, say, an on-disk hash table adds even more overhead to the already
> expensive full scan. Maybe using a bloom filter instead of a hash table could avoid
> the spilling to disk, in exchange for a slightly less precise result...

I've read some basics about a Bloom filters, and I'm not sure we can use
it in this case. I see two problems with this approach:

1) impossibility to predict the number of elements in advance
  To build a Bloom filter with limited error rate, you need to size it  properly (number of hash function and size of
thebit array). But  that's exactly the the information we're looking for.
 
  I guess we could use the highest possible value (equal to the number  of tuples) - according to wiki you need about
10bits per element  with 1% error, i.e. about 10MB of memory for each million of  elements.
 
  That's not much but this needs to be done for each column separately  and for the combination - for N columns you
need(at least) N+1  filters.
 
  Sure - increasing the error rate and using a different estimate to  build the bloom filter would significantly
decreasethe memory  requirements.
 

2) sampling the whole table
  A naive approach would be to sample the whole table each time, but  for large tables that might be a bit of a
problem.So I'm thinking  about what alternatives are there ...
 
  One possibility is to build the Bloom filter incrementally and store  it on disk, or something like that. I guess
thiscould be done  during VACUUM or something like that. Anyway there's an issue how to  set the filter size initially
(estimatethe number of elements),  just as in the previous section.
 
  Another possibility is to collect the data from just a small portion  of a table and then use the result to estimate
thenumber of distinct  values for the whole table. But I'm not sure we can do this reliably,  I see many traps in
this.

But maybe I'm just missing something important.

regards
Tomas

В списке pgsql-hackers по дате отправления:

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Re: proposal : cross-column stats