Re: Gsoc2012 idea, tablesample

Поиск
Список
Период
Сортировка
От Florian Pflug
Тема Re: Gsoc2012 idea, tablesample
Дата
Msg-id 7DE33781-4A07-4D3C-8E29-B33D37ADF51B@phlo.org
обсуждение исходный текст
Ответ на Re: Gsoc2012 idea, tablesample  ("Kevin Grittner" <Kevin.Grittner@wicourts.gov>)
Ответы Re: Gsoc2012 idea, tablesample  (Florian Pflug <fgp@phlo.org>)
Re: Gsoc2012 idea, tablesample  ("Kevin Grittner" <Kevin.Grittner@wicourts.gov>)
Список pgsql-hackers
> Florian Pflug  wrote:
>> On May10, 2012, at 18:36 , Kevin Grittner wrote:
>>> Robert Haas  wrote:
>>> 
>>>> I wonder if you could do this with something akin to the Bitmap
>>>> Heap Scan machinery. Populate a TID bitmap with a bunch of
>>>> randomly chosen TIDs, fetch them all in physical order
>>>> and if you don't get as many rows as you need, rinse and repeat
>>>> until you do.
>>> 
>>> If you get too many, it is important that you read all the way to
>>> the end and then randomly omit some of them.  While a bit of a
>>> bother, that's pretty straightforward and should be pretty fast,
>>> assuming you're not, like, an order of magnitude high.
>> 
>> Why is that? From a statistical point of view it shouldn't matter
>> whether you pick N random samples, or pick M >= N random samples an
>> then randomly pick N from M. (random implying uniformly distributed
>> here).
> 
> That sounds to me like exactly what what Robert and I both said.
> While passing the heap with the bitmap, if you get to the number you
> want you don't stop there -- you read all of them ("M" in your
> parlance) and randomly drop M minus N of them.  Or, if you prefer, I
> guess you could *pick* N of them.  I don't see a logical difference.

What I meant to say was "and drop the last M minus N", not "and randomly
drop M minus N". Which, of course, makes my statement utterly wrong since
the tuples are sorted by TID so you'd penalize tuples with a larger TID.
Duh! Sorry for the noise, guys...

>>> But falling short is tougher; making up the difference could be an
>>> iterative process, which could always wind up with having you read
>>> all tuples in the table without filling your sample.
>> 
>> But the likelihood of that happening is extremely low, no?
> 
> That depends.  What if someone just did a mass delete and your
> statistics aren't yet up to date when they ask to pick a relatively
> large percentage of the rows.
> 
>> Unless the sampling percentage is very high
> 
> Or the statistics are not current.  I agree, this shouldn't happen
> often, but we can never know, going in, whether it *is* the case.
> You *could* always wind up needing to read the entire table, and
> still not hit the initially-targeted number of rows.  Now, arguably
> you could use data gleaned from each pass to adjust the target or
> inform the size of the next pass.  My point is that "we selected too
> few" is a lot more complicated that the "we selected too many" case.
> 
>> but that case isn't of much practical importance anyway.
> 
> It's important that it's handled in some sane way when it happens.
> And it will happen.

Hm. Maybe one can get rid of these sorts of problems by factoring in
the expected density of the table beforehand and simply accepting that
the results will be inaccurate if the statistics are outdated?

One could, for example, simply pick
 N := SamplingPercentage * MaxTuplesPerPage / AvgLiveTuplesPerPage

where
AvgLiveTuplesPerPage := #Tuples / #Pages

random TIDs, fetch the live ones, and return them. I'm not totally sure
whether this approach is sensible to non-uniformity in the tuple to
line-pointer assignment, though.

>> But something else comes to mind. Does the standard permit samples
>> taken with the BERNOULLI method to contain the same tuple multiple
>> times?
> 
> I'm pretty sure not.  That would be nonsensical.
> 
>> If not, any kind of TID-based approach will have to all previously
>> fetched TIDs, which seems doable but unfortunate...
> 
> Right.  You would always need to ignore a duplicate random choice in
> any one cycle of generating ctid values; and if you are iterating
> because you fell short, you would also need to ignore values from
> previous iterations.  And OR your new bitmap against the previous
> ones to save for the next iteration, if needed.  I never said it
> couldn't be done; it's just very fussy, and you want to avoid a very
> large number of iterations in the case that someone deleted 99.99% of
> your rows right before you ran your sample and before autovacuum
> caught up.

Actually, thinking more about this, if the approach sketched above
turns out to work, then one might be able to get away without remembering
previously computed TIDs, thus removing a lot of complexity.

Say, for example, you simply start out with a single random TID tid[0].
The you produce the sequence of "random" TIDs by setting
 tid[n+1] = tid[n] + random(1 <= x <= 2*D-1)

where D is the expected distance between one TID from the sample set
and the next higher one, i.e. D = 2 * #TIDs / N. (You'd simply stop once
tid[n] >) #TIDs). This will give you approximately uniformly distributed
TIDs, I think, and will even return them in physical order. The 100$ question
is, by how much does this violate the independence requirement, i.e. how
far are P(tuple X picked) and P(tuple X picked | tuple Y picked) apart?
Some search through the literature should be able to answer that.

Should the dependency turn out to be too high, one might be able to
lower it by scaling up N by a factor q, and then discarding each generated
TID with probability 1/q. This amounts to a gradual switch to a simple
seqscan + bernoulli-experiment algorithm, so for large q the dependency
between P(tuple X picked) and P(tuple Y picked) should go to zero - or
at least so I think.

best regards,
Florian Pflug



В списке pgsql-hackers по дате отправления:

Предыдущее
От: Simon Riggs
Дата:
Сообщение: Re: unite recovery.conf and postgresql.conf
Следующее
От: Bruce Momjian
Дата:
Сообщение: Re: Draft release notes complete