Re: Gsoc2012 idea, tablesample

Поиск
Список
Период
Сортировка
От Florian Pflug
Тема Re: Gsoc2012 idea, tablesample
Дата
Msg-id 73B6F18C-C210-4BC2-BEA2-AE9FF50B71B3@phlo.org
обсуждение исходный текст
Ответ на Re: Gsoc2012 idea, tablesample  ("Kevin Grittner" <Kevin.Grittner@wicourts.gov>)
Список pgsql-hackers
On May10, 2012, at 18:36 , Kevin Grittner wrote:
> Robert Haas <robertmhaas@gmail.com> wrote:
> 
>> I wonder if you could do this with something akin to the Bitmap
>> Heap Scan machinery.  Populate a TID bitmap with a bunch of
>> randomly chosen TIDs, fetch them all in physical order
>> and if you don't get as many rows as you need, rinse and repeat
>> until you do.
> 
> Ay, there's the rub.  If you get too many, it is important that you
> read all the way to the end and then randomly omit some of them.

Why is that? From a statistical point of view it shouldn't matter
whether you pick N random samples, or pick M >= N random samples an
then randomly pick N from M. (random implying uniformly distributed
here).

> While a bit of a bother, that's pretty straightforward and should be
> pretty fast, assuming you're not, like, an order of magnitude high. 
> But falling short is tougher; making up the difference could be an
> iterative process, which could always wind up with having you read
> all tuples in the table without filling your sample.

But the likelihood of that happening is extremely low, no? Unless the
sampling percentage is very high, that is, but that case isn't of much
practical importance anyway.

But something else comes to mind. Does the standard permit samples taken
with the BERNOULLI method to contain the same tuple multiple times? If
not, any kind of TID-based approach will have to all previously fetched
TIDs, which seems doable but unfortunate...

best regards,
Florian Pflug



В списке pgsql-hackers по дате отправления:

Предыдущее
От: "David E. Wheeler"
Дата:
Сообщение: Re: PL/perl elog(ERROR) Does not Abort Transaction
Следующее
От: Bruce Momjian
Дата:
Сообщение: Re: PL/perl elog(ERROR) Does not Abort Transaction