Re: Gsoc2012 idea, tablesample

Поиск
Список
Период
Сортировка
От Ants Aasma
Тема Re: Gsoc2012 idea, tablesample
Дата
Msg-id CA+CSw_tT64CuakJZWnMkyYBtCmOqdrKhCZReMRdJKt+c+2PMDg@mail.gmail.com
обсуждение исходный текст
Ответ на Re: Gsoc2012 idea, tablesample  (Robert Haas <robertmhaas@gmail.com>)
Ответы Re: Gsoc2012 idea, tablesample  ("Kevin Grittner" <Kevin.Grittner@wicourts.gov>)
Список pgsql-hackers
On Thu, May 10, 2012 at 6:33 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> I'm worried this project is getting so complicated that it will be
> beyond the ability of a new hacker to get anything useful done.  Can
> we simplify the requirements here to something that is reasonable for
> a beginner?

It seems to me that the simplest thing to do would be to lift the
sampling done in analyze.c (acquire_sample_rows) and use that to
implement the SYSTEM sampling method. The language in the standard
leads me to believe that Vitter's algorithm used in analyze.c is
exactly what was intended by the authors. The difference between
Vitter's algorithm and a pure Bernoulli process is precisely that
Vitter's method increases the chances to see multiple rows picked from
a single page.

One tricky issue is that tablesample is defined in terms of a
percentage of the underlying table while Vitter's algorithm needs a
fixed number of rows. The standard does state that the result needs to
contain "approximately" the stated percentage of rows. I'm not sure if
calculating the amount of rows to return from reltuples would fit that
definition of approximate. If not, would re-estimating the amount of
reltuples after sampling and taking appropriate corrective action make
it better or would an accurate number be necessary. Getting an
accurate number efficiently would require solving of the COUNT(*)
issue.

For the Bernoulli case I can't think of anything simple that would
better than just scanning the table or poking it with random TIDs.
(the latter has the same problem of estimating the desired result set
size) It seems to me that Vitter's approach could be amended to
produce independent samples by selecting slightly more pages than
result tuples and tweaking the acceptance levels to cancel out the
bias. But that definitely isn't in the territory of simple and would
require rigorous statistical analysis.

And as for the monetary unit sampling, I agree that this is better
left as an optional extra.

Ants Aasma
--
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de


В списке pgsql-hackers по дате отправления:

Предыдущее
От: Peter Eisentraut
Дата:
Сообщение: Re: PL/Python result set slicing broken in Python 3
Следующее
От: Bruce Momjian
Дата:
Сообщение: Re: Draft release notes complete