Hi,
As previously mentioned, tuple deforming is a major bottleneck, and
JITing it can be highly beneficial. I previously had posted a prototype
that does JITing at the slot_deform_tuple() level, caching the deformed
function in the tupledesc.
Storing things in the tupledesc isn't a great concept however - the
lifetime of the generated function is hard to manage. But more
importantly, and even if we moved this into the slot, it precludes
important optimization.
JITing the deforming is a *lot* more efficient if we can combine it with
the JITing of the expressions using the deformed expression. There's a
couple of reasons for that:
1) By knowing the exact attnum the caller is going to request, the code can be optimized. No need to generate code for
columnsnot deformed. If there's NOT NULL columns at/after the last to-be-deformed column, there's no need to generate
checksabout the length of the null-bitmap - getting rid of about half the branches!
2) By generating the deforming code in the generated expression code, the code will be generated together.. That's a
goodchunk of the overhead, of the memory mapping overhead, and it noticeably reduces function call overhead (because
relativenear calls can be used).
3) LLVM's optimizer can inline parts / all of the tuple deforming code into the expression evaluation function,
furtherreducing overhead. In simpler cases and with some additional prodding, llvm even can interleave deforming of
individualcolumns and their use (note that I'm not proposing to do so initially).
4) If we know that the underlying tuple is an actual nonvirtual tuple, e.g. on the scan level, the slot deforming of
NOTNULL can be replaced with direct byte accesses to the relevant column - a good chunk faster again. (note that I'm
notproposing to do so initially)
The problem however is that when generating the expression code we don't
have the necessary information. In my current prototype I'm emitting the
LLVM IR (the input to LLVM) at ExecInitExpr() time for all expressions
in a tree. That allows to emit the code for all functions in executor
tree in one go. But unfortunately the current executor initiation
"framework" doesn't provide information about the underlying slot
tupledescs at that time. Nor does it actually guarantee that the
tupledesc / slots stay the same over the course of the execution.
Therefore I'd like to somehow change things so that the executor keeps
track of whether the tupledesc of inner/outer/scan are going to change,
and if not provide them.
The right approach here seems to be to add a bit of extra data to
ExecAssignScanType etc., and move ExecInitExpr / ExecInitQual /
ExecAssignScanProjectionInfo /... to after that. We then could keep
track of of the relevant tupledescs somewhere in PlanState - that's a
bit ugly, but I don't quite see how to avoid that unless we want to add
major executor-node awareness into expression evaluation.
Thoughts? Better ideas?
Greetings,
Andres Freund