Re: PATCH: adaptive ndistinct estimator v4

Поиск
Список
Период
Сортировка
От Tomas Vondra
Тема Re: PATCH: adaptive ndistinct estimator v4
Дата
Msg-id 55207A7A.1050802@2ndquadrant.com
обсуждение исходный текст
Ответ на Re: PATCH: adaptive ndistinct estimator v4  (Greg Stark <stark@mit.edu>)
Список pgsql-hackers
Hi,

On 04/03/15 15:46, Greg Stark wrote:
>  > The simple workaround for this was adding a fallback to GEE when f[1]
> or f[2] is 0. GEE is another estimator described in the paper, behaving
> much better in those cases.
>
> For completeness, what's the downside in just always using GEE?

That's a good question.

GEE is the estimator with minimal average error, as defined in Theorem 1 
in that paper. The exact formulation of the theorem is a bit complex, 
but it essentially says that knowing just the sizes of the data set and 
sample, there's an accuracy limit.

Or put another way, it's possible to construct the data set so that the 
estimator gives you estimates with error exceeding some limit (with a 
certain probability).

Knowledge of how much the data set is skewed gives us opportunity to 
improve the estimates by choosing an estimator performing better with 
such data sets. The problem is we don't know the skew - we can only 
estimate it from the sample, which is what the hybrid estimators do.

The AE estimator (presented in the paper and implemented in the patch) 
is an example of such hybrid estimators, but based on my experiments it 
does not work terribly well with one particular type of skew that I'd 
expect to be relatively common (few very common values, many very rare 
values).

Luckily, GEE performs pretty well in this case, but we can use the AE 
otherwise (ISTM it gives really good estimates).

But of course - there'll always be data sets that are poorly estimated 
(pretty much as Theorem 1 in the paper says). I'd be nice to do more 
testing on real-world data sets, to see if this performs better or worse 
than our current estimator.

regards
Tomas



В списке pgsql-hackers по дате отправления:

Предыдущее
От: Petr Jelinek
Дата:
Сообщение: Re: PATCH: nbklno in _hash_splitbucket unused (after ed9cc2b)
Следующее
От: Tomas Vondra
Дата:
Сообщение: Re: PATCH: nbklno in _hash_splitbucket unused (after ed9cc2b)