Re: Gsoc2012 idea, tablesample

Поиск

Список

Период

Сортировка

От	Ants Aasma
Тема	Re: Gsoc2012 idea, tablesample
Дата	10 мая 2012 г. 17:46:50
Msg-id	CA+CSw_tT64CuakJZWnMkyYBtCmOqdrKhCZReMRdJKt+c+2PMDg@mail.gmail.com обсуждение исходный текст
Ответ на	Re: Gsoc2012 idea, tablesample (Robert Haas <robertmhaas@gmail.com>)
Ответы	Re: Gsoc2012 idea, tablesample ("Kevin Grittner" <Kevin.Grittner@wicourts.gov>)
Список	pgsql-hackers

Дерево обсуждения

On Thu, May 10, 2012 at 6:33 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> I'm worried this project is getting so complicated that it will be
> beyond the ability of a new hacker to get anything useful done.  Can
> we simplify the requirements here to something that is reasonable for
> a beginner?

It seems to me that the simplest thing to do would be to lift the
sampling done in analyze.c (acquire_sample_rows) and use that to
implement the SYSTEM sampling method. The language in the standard
leads me to believe that Vitter's algorithm used in analyze.c is
exactly what was intended by the authors. The difference between
Vitter's algorithm and a pure Bernoulli process is precisely that
Vitter's method increases the chances to see multiple rows picked from
a single page.

One tricky issue is that tablesample is defined in terms of a
percentage of the underlying table while Vitter's algorithm needs a
fixed number of rows. The standard does state that the result needs to
contain "approximately" the stated percentage of rows. I'm not sure if
calculating the amount of rows to return from reltuples would fit that
definition of approximate. If not, would re-estimating the amount of
reltuples after sampling and taking appropriate corrective action make
it better or would an accurate number be necessary. Getting an
accurate number efficiently would require solving of the COUNT(*)
issue.

For the Bernoulli case I can't think of anything simple that would
better than just scanning the table or poking it with random TIDs.
(the latter has the same problem of estimating the desired result set
size) It seems to me that Vitter's approach could be amended to
produce independent samples by selecting slightly more pages than
result tuples and tweaking the acceptance levels to cancel out the
bias. But that definitely isn't in the territory of simple and would
require rigorous statistical analysis.

And as for the monetary unit sampling, I agree that this is better
left as an optional extra.

Ants Aasma
--
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de

В списке pgsql-hackers по дате отправления:

Предыдущее

От: Peter Eisentraut
Дата: 10 мая 2012 г., 17:45:20
Сообщение: Re: PL/Python result set slicing broken in Python 3

Следующее

От: Bruce Momjian
Дата: 10 мая 2012 г., 17:49:21
Сообщение: Re: Draft release notes complete

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Re: Gsoc2012 idea, tablesample

Предыдущее

Следующее