Re: estimating # of distinct values

Поиск
Список
Период
Сортировка
От Josh Berkus
Тема Re: estimating # of distinct values
Дата
Msg-id 4D198633.8070406@agliodbs.com
обсуждение исходный текст
Ответ на estimating # of distinct values  (Tomas Vondra <tv@fuzzy.cz>)
Ответы Re: estimating # of distinct values  (Robert Haas <robertmhaas@gmail.com>)
Re: estimating # of distinct values  (tv@fuzzy.cz)
Список pgsql-hackers
> The simple truth is
> 
> 1) sampling-based estimators are a dead-end

While I don't want to discourage you from working on steam-based
estimators ... I'd love to see you implement a proof-of-concept for
PostgreSQL, and test it ... the above is a non-argument.  It requires us
to accept that sample-based estimates cannot ever be made to work,
simply because you say so.

The Charikar and Chaudhuri paper does not, in fact, say that it is
impossible to improve sampling-based estimators as you claim it does. In
fact, the authors offer several ways to improve sampling-based
estimators.  Further, 2000 was hardly the end of sampling-estimation
paper publication; there are later papers with newer ideas.

For example, I still think we could tremendously improve our current
sampling-based estimator without increasing I/O by moving to block-based
estimation*.  The accuracy statistics for block-based samples of 5% of
the table look quite good.

I would agree that it's impossible to get a decent estimate of
n-distinct from a 1% sample.  But there's a huge difference between 5%
or 10% and "a majority of the table".

Again, don't let this discourage you from attempting to write a
steam-based estimator.  But do realize that you'll need to *prove* its
superiority, head-to-head, against sampling-based estimators.

[* http://www.jstor.org/pss/1391058 (unfortunately, no longer
public-access)]

--                                  -- Josh Berkus                                    PostgreSQL Experts Inc.
                        http://www.pgexperts.com
 


В списке pgsql-hackers по дате отправления:

Предыдущее
От: David Fetter
Дата:
Сообщение: Re: "writable CTEs"
Следующее
От: Shigeru HANADA
Дата:
Сообщение: Re: SQL/MED - core functionality