estimating # of distinct values

Поиск
Список
Период
Сортировка
От Tomas Vondra
Тема estimating # of distinct values
Дата
Msg-id 4D18BB0F.5010303@fuzzy.cz
обсуждение исходный текст
Ответы Re: estimating # of distinct values  (Robert Haas <robertmhaas@gmail.com>)
Re: estimating # of distinct values  (Josh Berkus <josh@agliodbs.com>)
Re: estimating # of distinct values  (Tomas Vondra <tv@fuzzy.cz>)
Список pgsql-hackers
Hi everyone,

about two weeks ago I've started a thread about cross-column stats. One
of the proposed solutions is based on number of distinct values, and the
truth is the current distinct estimator has some problems.

I've done some small research - I have read about 20 papers on this, and
I'd like to present a short summary here, possible solutions etc.

The simple truth is

1) sampling-based estimators are a dead-end
2) there are very interesting stream-based estimators
3) the stream-based estimators have some disadvantages

I wrote a more thorough summary on a wiki
(http://wiki.postgresql.org/wiki/Estimating_Distinct) with a list of the
most interesting papers etc. so just a very short explanation.

1) sampling-based estimators are a dead-end
  The paper from Charikar & Chaudhuri states (and proves) that for  each sampling-based estimator, there's a data set
wherethe estimator  produces arbitrary error (with an indispensable probability). So  replacing one sampling-based
estimatorwith another generally does  not improve the situation - fixes one dataset, breaks another one.
 
  The error is directly related to the size of the sample, so the  truth is that to get a good distinct estimate you
needto scan a  significant portion of the table (almost all of it).
 
  So the estimators based on tiny samples are a dead end.

2) there are very interesting stream-based estimators
  A very interesting idea comes from the field of data streams. These  estimates are based no a one pass through the
data,and then  incremental updates. A very nice thing is that these algorithms  don't need that much space, as they
don'tstore the actual values.
 
  The idea behind this is similar to the Bloom filter, i.e. set bits  of a bitmap using a hash of the value. But in
thiscase it's not  required to be able to answer questions like 'is this an element  of the set X?' so it's actually
evenmore space efficient. For an  introduction see the first paper published in 1985 by Flajolet
(http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.12.7100).
  One of the recent articles (published in June 2010) actually presents  an optimal algorithm with O(e^-2 + log(n))
spacecomplexity and O(1)  update complexity. The "e" means precision, i.e. that the estimate  will be within [(1-e)F,
(1+e)F]where F is the real value.
 
  The paper on self-learning bitmap states that to track 10^9 distinct  values with 1% relative error you need about 61
kilobitsof space  (which is absolutely awesome). The optimal algorithm needs a bit more  space I think,
 
  A very interesting solution id "distinct sampling" that somehow  extends the Wegman's adaptive sampling approach.

3) the stream-based estimators have some disadvantages
  Not everything is perfect, though - the most serious disadvantage is  that those algorithms (usually) don't handle
removalof elements.  Inserts are easy to handle, but deletes are hard (just as in case of  Bloom filters).
 
  So using this stream-based approach might lead to degradation in  case of heavily updated tables :-(
  I see two possible solutions here:
  (a) use counters instead of bits, and track actual counts - but this      is tricky, especially with 'probabilistic'
algorithmsand needs      much more space (but still, 32x 61kb is just 250kB)
 
  (b) count how many deletes/updates were performed, and if needed      rebuild the whole bitmap
  But even though these disadvantages, there really is no other  way to enhance the estimates. I don't think this
shouldbe a  default behavior - just as in case of cross-column stats this should  be optional when the current
estimatordoes not work well.
 

regards
Tomas


В списке pgsql-hackers по дате отправления:

Предыдущее
От: Alvaro Herrera
Дата:
Сообщение: Re: C++ keywords in headers (was Re: [GENERAL] #include )
Следующее
От: Tom Lane
Дата:
Сообщение: Re: C++ keywords in headers (was Re: [GENERAL] #include )