Re: estimating # of distinct values

Поиск
Список
Период
Сортировка
От Tomas Vondra
Тема Re: estimating # of distinct values
Дата
Msg-id 4D1CDF9B.5070703@fuzzy.cz
обсуждение исходный текст
Ответ на Re: estimating # of distinct values  (Florian Pflug <fgp@phlo.org>)
Ответы Re: estimating # of distinct values  (Alvaro Herrera <alvherre@commandprompt.com>)
Список pgsql-hackers
Dne 30.12.2010 15:43, Florian Pflug napsal(a):
> On Dec27, 2010, at 23:49 , Kevin Grittner wrote:
>> Robert Haas <robertmhaas@gmail.com> wrote:
>>
>>> With respect to (b), I think I'd need to see a much more detailed
>>> design for how you intend to make this work.  Off the top of my
>>> head there seems to be some pretty serious feasibility problems.
>>
>> I had one random thought on that -- it seemed like a large concern
>> was that there would need to be at least an occasional scan of the
>> entire table to rebuild the distinct value information.
> 
> I believe we could actually avoid that.
> 
> First, the paper "An Optimal Algorithm for the Distinct Elements Problem"
> actually contains an algorithm with *does* handle deletes - it's called
> "L_0" estimate there.

Hmmm, that's interesting. I know there's a part about L_0 estimation,
but that's about estimating Hamming norm of a vector - so I've ignored
it as I thought we can't use it to estimate number of distinct values.
But if it really handles deletions and if we can use it, then it's
really interesting.

> Second, as Tomas pointed out, the stream-based estimator is essentially a
> simplified version of a bloom filter. It starts out with a field of
> N zero bits, and sets K of them to 1 for each value v in the stream.
> Which bits are set to 1 depends on some hash function(s) H_i(v). It's
> then easy to compute how many 1-bits you'd expect to find in the bit
> field after seeing D distinct values, and by reversing that you can
> estimate D from the number of 1-bits in the bit field.

No, I haven't said the stream-based estimators are simplified versions
of a Bloom filter. I said the approach is very similar - all the
algorithms use bitmaps and hash functions, but the algorithms (Bloom
filter vs. probabilistic counting and adaptive sampling) are very different.

The Bloom filter is much more straightforward. The other algorithms are
much more sophisticated which allows to use less space.

> To avoid having to rescan large tables, instead of storing one such
> bit field, we'd store one per B pages of data. We'd then only need
> to scan a range of B pages around every updated or deleted tuple,
> and could afterwards compute a new global estimate of D by combining
> the individual bit fields with bitwise and.

I don't think this could help.

1) This works just with the Bloom filters, not with the other  algorithms (you can't combine the segments using bitmap
OR).

2) With heavily modified tables the updates are usually 'spread'  through the whole table, so you'll have to rebuild
allthe  segments anyway.
 

> Since the need to regularly VACUUM tables hit by updated or deleted
> won't go away any time soon, we could piggy-back the bit field
> rebuilding onto VACUUM to avoid a second scan.

Well, I guess it's a bit more complicated. First of all, there's a local
VACUUM when doing HOT updates. Second, you need to handle inserts too
(what if the table just grows?).

But I'm not a VACUUM expert, so maybe I'm wrong and this is the right
place to handle rebuilds of distinct stats.

I was thinking about something else - we could 'attach' the rebuild to
an actual seq scan if the amount of changes reaches some threshold
(since the last rebuild). Only in case the amount of changes reaches a
higher threshold, we would rebuild the stats on our own.

Something like

IF (# of updates * deletes > 5%) THEN wait for seq scan
IF (# of updates * deletes > 10%) THEN rebuild the stats

I've found a nice ppt describing how Oracle does this:
  http://www.oraclegeek.net/downloads/One_Pass_Distinct_Sampling.ppt

and there's PDF version
  http://www.oraclegeek.net/downloads/OnePassDistinctSampling.pdf

According to this, Oracle is using the probabilistic counting approach
(see slide 26). And once they find out there were to many changes in the
table, they rebuild the whole thing.

I'm not saying we should do exactly the same, just that this might be a
good direction.

regards
Tomas


В списке pgsql-hackers по дате отправления:

Предыдущее
От: Robert Haas
Дата:
Сообщение: Re: and it's not a bunny rabbit, either
Следующее
От: Chris Browne
Дата:
Сообщение: Re: C++ keywords in headers