Dne 21.12.2010 16:54, Florian Pflug napsal(a):
> I think that maybe it'd be acceptable to scan a large portion of the
> table to estimate dist(A,B), *if* we must only do so only once in a while. But even with
> a full scan, how would we arrive at an estimate for dist(A,B)? Keeping all values in memory,
> and spilling them into, say, an on-disk hash table adds even more overhead to the already
> expensive full scan. Maybe using a bloom filter instead of a hash table could avoid
> the spilling to disk, in exchange for a slightly less precise result...
I've read some basics about a Bloom filters, and I'm not sure we can use
it in this case. I see two problems with this approach:
1) impossibility to predict the number of elements in advance
To build a Bloom filter with limited error rate, you need to size it properly (number of hash function and size of
thebit array). But that's exactly the the information we're looking for.
I guess we could use the highest possible value (equal to the number of tuples) - according to wiki you need about
10bits per element with 1% error, i.e. about 10MB of memory for each million of elements.
That's not much but this needs to be done for each column separately and for the combination - for N columns you
need(at least) N+1 filters.
Sure - increasing the error rate and using a different estimate to build the bloom filter would significantly
decreasethe memory requirements.
2) sampling the whole table
A naive approach would be to sample the whole table each time, but for large tables that might be a bit of a
problem.So I'm thinking about what alternatives are there ...
One possibility is to build the Bloom filter incrementally and store it on disk, or something like that. I guess
thiscould be done during VACUUM or something like that. Anyway there's an issue how to set the filter size initially
(estimatethe number of elements), just as in the previous section.
Another possibility is to collect the data from just a small portion of a table and then use the result to estimate
thenumber of distinct values for the whole table. But I'm not sure we can do this reliably, I see many traps in
this.
But maybe I'm just missing something important.
regards
Tomas