Re: TABLESAMPLE patch is really in pretty sad shape

Поиск
Список
Период
Сортировка
От Petr Jelinek
Тема Re: TABLESAMPLE patch is really in pretty sad shape
Дата
Msg-id 55A7B0DB.6030000@2ndquadrant.com
обсуждение исходный текст
Ответ на Re: TABLESAMPLE patch is really in pretty sad shape  (Tom Lane <tgl@sss.pgh.pa.us>)
Список pgsql-hackers
On 2015-07-13 18:00, Tom Lane wrote:
>
> And here's that.  I do *not* claim that this is a complete list of design
> issues with the patch, it's just things I happened to notice in the amount
> of time I've spent so far (which is already way more than I wanted to
> spend on TABLESAMPLE right now).
>
>
> I'm not sure that we need an extensible sampling-method capability at all,
> let alone that an API for that needs to be the primary focus of a patch.
> Certainly the offered examples of extension modules aren't inspiring:
> tsm_system_time is broken by design (more about that below) and nobody
> would miss tsm_system_rows if it weren't there.  What is far more
> important is to get sampling capability into indexscans, and designing
> an API ahead of time seems like mostly a distraction from that.
>
> I'd think seriously about tossing the entire executor-stage part of the
> patch, creating a stateless (history-independent) sampling filter that
> just accepts or rejects TIDs, and sticking calls to that into all the
> table scan node types.  Once you had something like that working well
> it might be worth considering whether to try to expose an API to
> generalize it.  But even then it's not clear that we really need any
> more options than true-Bernoulli and block-level sampling.

I think this is not really API issue as much as opposite approach on 
what to implement first. I prioritized in first iteration for the true 
block sampling support as that's what I've been mostly asked for.

But my plan was to have at same later stage (9.6+) also ability to do 
subquery scans etc. Basically new SamplingFilter executor node which 
would pass the tuples to examinetuple() which would then decide what to 
do with it. The selection between using nextblock/nexttuple and 
examinetuple was supposed to be extension of the API where the sampling 
method would say if it supports examinetuple or nextblock/nexttuple or 
both. And eventually I wanted to rewrite bernoulli to just use the 
examinetuple() on top of whatever scan once this additional support was 
in. I think I explained this during the review stage.

>
> The IBM paper I linked to in the other thread mentions that their
> optimizer will sometimes choose to do Bernoulli sampling even if SYSTEM
> was requested.  Probably that happens when it decides to do a simple
> indexscan, because then there's no advantage to trying to cluster the

Yeah it happens when there is an index which is used in WHERE clause and 
has bigger selectivity than the percentage specified in the TABLESAMPLE 
clause. This of course breaks one of the common use-case though:
SELECT count(*) * 100 FROM table TABLESAMPLE SYSTEM(1) WHERE foo = bar;

> sampled rows.  But in the case of a bitmap scan, you could very easily do
> either true Bernoulli or block-level sampling simply by adjusting the
> rules about which bits you keep or clear in the bitmap (given that you
> apply the filter between collection of the index bitmap and accessing the
> heap, which seems natural).  The only case where a special scan type
> really seems to be needed is if you want block-level sampling, the query
> would otherwise use a seqscan, *and* the sampling percentage is pretty low
> --- if you'd be reading every second or third block anyway, you're likely
> better off with a plain seqscan so that the kernel sees sequential access
> and does prefetching.  The current API doesn't seem to make it possible to
> switch between seqscan and read-only-selected-blocks based on the sampling
> percentage, but I think that could be an interesting optimization.

Well you can do that if you write your own sampling method. We don't do 
that in optimizer and that's design choice, because you can't really do 
that on high level like that if you want to keep extensibility. And 
given the amount of people that asked if they can do their own sampling 
when I talked to them about this during the design stage, I consider the 
extensibility as more important. Especially if extensibility gives you 
the option to do the switching anyway, albeit on lower-level and not out 
of the box.

> (Another bet that's been missed is having the samplescan logicrequest
> prefetching when it is doing selective block reads.  The current API can't
> support that AFAICS, since there's no expectation that nextblock calls
> could be done asynchronously from nexttuple calls.)

Not sure I follow, would not it be possible to achieve this using the 
tsmseqscan set to true (it's a misnomer then, I know)?

>
> Another issue with the API as designed is the examinetuple() callback.
> Allowing sample methods to see invisible tuples is quite possibly the
> worst idea in the whole patch.  They can *not* do anything with such
> tuples, or they'll totally break reproducibility: if the tuple is
> invisible to your scan, it might well be or soon become invisible to
> everybody, whereupon it would be subject to garbage collection at the
> drop of a hat.  So if an invisible tuple affects the sample method's
> behavior at all, repeated scans in the same query would not produce
> identical results, which as mentioned before is required both by spec
> and for minimally sane query behavior.  Moreover, examining the contents
> of the tuple is unsafe (it might contain pointers to TOAST values that
> no longer exist); and even if it were safe, what's the point?  Sampling
> that pays attention to the data is the very definition of biased.  So
> if we do re-introduce an API like the current one, I'd definitely lose
> this bit and only allow sample methods to consider visible tuples.

I didn't actually have the examinetuple() call on invisible tuples 
originally, it was something I added after discussing with several 
people off-list who mentioned independently on each other that it would 
be useful to have this. I didn't realize the TOAST issue though, 
otherwise I would not do that.

> (I'm not exactly convinced that the system or tsm_system_rows methods
> are adequately reproducible either, given that their sampling pattern
> will change when the relation block count changes.  Perhaps that's
> unavoidable, but it seems like it might be possible to define things
> in such a way that adding blocks doesn't change which existing blocks
> get sampled.)

Note that even standard says that REPEATABLE only returns same result 
"provided certain implementation-defined conditions are satisfied" (I am 
not saying we should not try, I am saying it's ok to have some 
limitations there).

>
> A more localized issue that I noticed is that nowhere is it documented
> what the REPEATABLE parameter value is.  Digging in the code eventually
> reveals that the value is assignment-coerced to an int4, which I find
> rather problematic because a user might reasonably assume that the
> parameter works like setseed's parameter (float8 in the range -1 to 1).
> If he does then he'll get no errors and identical samples from say
> REPEATABLE(0.1) and REPEATABLE(0.2), which is bad.  On the other hand,
> it looks like DB2 allows integer values, so implementing it just like
> setseed might cause problems for people coming from DB2.  I'm inclined to
> suggest that we should define the parameter as being any float8 value,
> and obtain a seed from it with hashfloat8().  That way, no matter whether
> users think that usable values are fractional or integral, they'll get
> sane behavior with different supplied seeds almost always producing
> different samples.
>

Sounds reasonable.

--  Petr Jelinek                  http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training &
Services



В списке pgsql-hackers по дате отправления:

Предыдущее
От: David Christensen
Дата:
Сообщение: Re: [DESIGN] Incremental checksums
Следующее
От: Alvaro Herrera
Дата:
Сообщение: Re: TABLESAMPLE patch is really in pretty sad shape