Re: estimating # of distinct values

Поиск
Список
Период
Сортировка
От Florian Pflug
Тема Re: estimating # of distinct values
Дата
Msg-id 97F89154-01B4-4F22-ADD3-48D96C84927A@phlo.org
обсуждение исходный текст
Ответ на Re: estimating # of distinct values  ("Kevin Grittner" <Kevin.Grittner@wicourts.gov>)
Ответы Re: estimating # of distinct values  (Tomas Vondra <tv@fuzzy.cz>)
Список pgsql-hackers
On Dec27, 2010, at 23:49 , Kevin Grittner wrote:
> Robert Haas <robertmhaas@gmail.com> wrote:
> 
>> With respect to (b), I think I'd need to see a much more detailed
>> design for how you intend to make this work.  Off the top of my
>> head there seems to be some pretty serious feasibility problems.
> 
> I had one random thought on that -- it seemed like a large concern
> was that there would need to be at least an occasional scan of the
> entire table to rebuild the distinct value information.

I believe we could actually avoid that.

First, the paper "An Optimal Algorithm for the Distinct Elements Problem"
actually contains an algorithm with *does* handle deletes - it's called
"L_0" estimate there.

Second, as Tomas pointed out, the stream-based estimator is essentially a
simplified version of a bloom filter. It starts out with a field of
N zero bits, and sets K of them to 1 for each value v in the stream.
Which bits are set to 1 depends on some hash function(s) H_i(v). It's
then easy to compute how many 1-bits you'd expect to find in the bit
field after seeing D distinct values, and by reversing that you can
estimate D from the number of 1-bits in the bit field.

To avoid having to rescan large tables, instead of storing one such
bit field, we'd store one per B pages of data. We'd then only need
to scan a range of B pages around every updated or deleted tuple,
and could afterwards compute a new global estimate of D by combining
the individual bit fields with bitwise and.

Since the need to regularly VACUUM tables hit by updated or deleted
won't go away any time soon, we could piggy-back the bit field
rebuilding onto VACUUM to avoid a second scan.

A good value for B would probably be around
1000*<size of bitfield>/<page size>. If the bitfield needs ~100k, that'd
make B ~= 12000 pages ~= 100MB.

best regards,
Florian Pflug



В списке pgsql-hackers по дате отправления:

Предыдущее
От: Alvaro Herrera
Дата:
Сообщение: Re: Snapshot synchronization, again...
Следующее
От: Florian Pflug
Дата:
Сообщение: Re: Snapshot synchronization, again...