Re: asynchronous and vectorized execution

Поиск
Список
Период
Сортировка
От Andres Freund
Тема Re: asynchronous and vectorized execution
Дата
Msg-id 20160510235735.q436vhm7pukuzgqz@alap3.anarazel.de
обсуждение исходный текст
Ответ на asynchronous and vectorized execution  (Robert Haas <robertmhaas@gmail.com>)
Ответы Re: asynchronous and vectorized execution  (Robert Haas <robertmhaas@gmail.com>)
Список pgsql-hackers
Hi,

On 2016-05-09 13:33:55 -0400, Robert Haas wrote:
> I think there are several different areas
> where we should consider major upgrades to our executor.  It's too
> slow and it doesn't do everything we want it to do.  The main things
> on my mind are:


3) We use a lot of very cache-inefficient datastructures.

Especially the pervasive use of linked lists in the executor is pretty
bad for performance. Every element is likely to incur cache misses,
every list element pretty much has it's own cacheline (thereby reducing
overall cache hit ratio), they have a horrible allocation overhead (both
space and palloc runtime).


> 1. asynchronous execution, by which I mean the ability of a node to
> somehow say that it will generate a tuple eventually, but is not yet
> ready, so that the executor can go run some other part of the plan
> tree while it waits.  [...].  It is also a problem
> for parallel query: in a parallel sequential scan, the next worker can
> begin reading the next block even if the current block hasn't yet been
> received from the OS.  Whether or not this will be efficient is a
> research question, but it can be done.  However, imagine a parallel
> scan of a btree index: we don't know what page to scan next until we
> read the previous page and examine the next-pointer.  In the meantime,
> any worker that arrives at that scan node has no choice but to block.
> It would be better if the scan node could instead say "hey, thanks for
> coming but I'm really not ready to be on-CPU just at the moment" and
> potentially allow the worker to go work in some other part of the
> query tree.  For that worker to actually find useful work to do
> elsewhere, we'll probably need it to be the case either that the table
> is partitioned or the original query will need to involve UNION ALL,
> but those are not silly cases to worry about, particularly if we get
> native partitioning in 9.7.

I've to admit I'm not that convinced about the speedups in the !fdw
case. There seems to be a lot easier avenues for performance
improvements.


> 2. vectorized execution, by which I mean the ability of a node to
> return tuples in batches rather than one by one.  Andres has opined
> more than once that repeated trips through ExecProcNode defeat the
> ability of the CPU to do branch prediction correctly, slowing the
> whole system down, and that they also result in poor CPU cache
> behavior, since we jump all over the place executing a little bit of
> code from each node before moving onto the next rather than running
> one bit of code first, and then another later.

FWIW, I've even hacked something up for a bunch of simple queries, and
the performance improvements were significant.  Besides it only being a
weekend hack project, the big thing I got stuck on was considering how
to exactly determine when to batch and not to batch.


I'd personally say that the CPU pipeline defeating aspect is worse than
the effect of the cache/branch misses themselves. Today's CPUs are
heavily superscalar, and our instruction-per-cycle throughput shows
pretty clearly that we're not good at employing (micro-)instruction
paralellism. We're quite frequently at well below one instruction/cycle.





> My proposal for how to do this is to make ExecProcNode function as a
> backward-compatibility wrapper.  For asynchronous execution, a node
> might return a not-ready-yet indication, but if that node is called
> via ExecProcNode, it means the caller isn't prepared to receive such
> an indication, so ExecProcNode will just wait for the node to become
> ready and then return the tuple.  Similarly, for vectorized execution,
> a node might return a bunch of tuples all at once.  ExecProcNode will
> extract the first one and return it to the caller, and subsequent
> calls to ExecProcNode will iterate through the rest of the batch, only
> calling the underlying node-specific function when the batch is
> exhausted.  In this way, code that doesn't know about the new stuff
> can continue to work pretty much as it does today.

I agree that that generally is a reasonable way forward.


> Also, and I think
> this is important, nodes don't need the permission of their parent
> node to use these new capabilities.  They can use them whenever they
> wish, without worrying about whether the upper node is prepared to
> deal with it.  If not, ExecProcNode will paper over the problem.  This
> seems to me to be a good way to keep the code simple.

Maybe not permission, but for some cases it seems to be important to
hint to *not* prefetch a lot of rows. E.g. anti joins come to mind. Just
using batching with force seems likely to regress some queries quite
badly (e.g an expensive join inside an EXISTS() which returns many
tuples).


> For asynchronous execution, I have gone so far as to mock up a bit of
> what this might look like.  This shouldn't be taken very seriously at
> this point, but I'm attaching a few very-much-WIP patches to show the
> direction of my line of thinking.  Basically, I propose to have
> ExecBlah (that is, ExecBitmapHeapScan, ExecAppend, etc.) return tuples
> by putting them into a new PlanState member called "result", which is
> just a Node * so that we can support multiple types of results,
> instead of returning them.

What different types of results are you envisioning?



> By the way, one smaller executor project that I think we should also
> look at has to do with this comment in nodeSeqScan.c:
> 
> static bool
> SeqRecheck(SeqScanState *node, TupleTableSlot *slot)
> {
>         /*
>          * Note that unlike IndexScan, SeqScan never use keys in heap_beginscan
>          * (and this is very bad) - so, here we do not check are keys ok or not.
>          */
>         return true;
> }
> 
> Some quick prototyping by my colleague Dilip Kumar suggests that, in
> fact, there are cases where pushing down keys into heap_beginscan()
> could be significantly faster.

I can immediately believe that.


> Some care is required here because any
> functions we execute as scan keys are run with the buffer locked, so
> we had better not run anything very complicated.  But doing this for
> simple things like integer equality operators seems like it could save
> quite a few buffer lock/unlock cycles and some other executor overhead
> as well.

Hm. Do we really have to keep the page locked in the page-at-a-time
mode? Shouldn't the pin suffice?

Greetings,

Andres Freund



В списке pgsql-hackers по дате отправления:

Предыдущее
От: Andres Freund
Дата:
Сообщение: Re: HeapTupleSatisfiesToast() busted? (was atomic pin/unpin causing errors)
Следующее
От: Rob Bygrave
Дата:
Сообщение: alter table alter column ... (larger type) ... when there are dependent views