Re: Gsoc2012 idea, tablesample

Поиск
Список
Период
Сортировка
От Kevin Grittner
Тема Re: Gsoc2012 idea, tablesample
Дата
Msg-id 4FAC3BB20200002500047B24@gw.wicourts.gov
обсуждение исходный текст
Ответ на Gsoc2012 idea, tablesample  (Heikki Linnakangas <heikki.linnakangas@enterprisedb.com>)
Ответы Re: Gsoc2012 idea, tablesample  (Florian Pflug <fgp@phlo.org>)
Список pgsql-hackers
[point splintered in quoting re-joined]
Florian Pflug  wrote:
> On May10, 2012, at 18:36 , Kevin Grittner wrote:
>> Robert Haas  wrote:
>>
>>> I wonder if you could do this with something akin to the Bitmap
>>> Heap Scan machinery. Populate a TID bitmap with a bunch of
>>> randomly chosen TIDs, fetch them all in physical order
>>> and if you don't get as many rows as you need, rinse and repeat
>>> until you do.
>>
>> If you get too many, it is important that you read all the way to
>> the end and then randomly omit some of them.  While a bit of a
>> bother, that's pretty straightforward and should be pretty fast,
>> assuming you're not, like, an order of magnitude high.
>
> Why is that? From a statistical point of view it shouldn't matter
> whether you pick N random samples, or pick M >= N random samples an
> then randomly pick N from M. (random implying uniformly distributed
> here).
That sounds to me like exactly what what Robert and I both said.
While passing the heap with the bitmap, if you get to the number you
want you don't stop there -- you read all of them ("M" in your
parlance) and randomly drop M minus N of them.  Or, if you prefer, I
guess you could *pick* N of them.  I don't see a logical difference.
>> But falling short is tougher; making up the difference could be an
>> iterative process, which could always wind up with having you read
>> all tuples in the table without filling your sample.
>
> But the likelihood of that happening is extremely low, no?
That depends.  What if someone just did a mass delete and your
statistics aren't yet up to date when they ask to pick a relatively
large percentage of the rows.
> Unless the sampling percentage is very high
Or the statistics are not current.  I agree, this shouldn't happen
often, but we can never know, going in, whether it *is* the case.
You *could* always wind up needing to read the entire table, and
still not hit the initially-targeted number of rows.  Now, arguably
you could use data gleaned from each pass to adjust the target or
inform the size of the next pass.  My point is that "we selected too
few" is a lot more complicated that the "we selected too many" case.
> but that case isn't of much practical importance anyway.
It's important that it's handled in some sane way when it happens.
And it will happen.
> But something else comes to mind. Does the standard permit samples
> taken with the BERNOULLI method to contain the same tuple multiple
> times?
I'm pretty sure not.  That would be nonsensical.
> If not, any kind of TID-based approach will have to all previously
> fetched TIDs, which seems doable but unfortunate...
Right.  You would always need to ignore a duplicate random choice in
any one cycle of generating ctid values; and if you are iterating
because you fell short, you would also need to ignore values from
previous iterations.  And OR your new bitmap against the previous
ones to save for the next iteration, if needed.  I never said it
couldn't be done; it's just very fussy, and you want to avoid a very
large number of iterations in the case that someone deleted 99.99% of
your rows right before you ran your sample and before autovacuum
caught up.
-Kevin


В списке pgsql-hackers по дате отправления:

Предыдущее
От: "Etsuro Fujita"
Дата:
Сообщение: Re: WIP Patch: Selective binary conversion of CSV file foreign tables
Следующее
От: Robert Haas
Дата:
Сообщение: Re: Draft release notes complete