While reading the code generated by llvmjit, I realized the number of LLVM basic blocks used in tuple deforming was directly visible in the generated assembly code with the following code:
0x723382b781c1: jmp 0x723382b781c3 0x723382b781c3: jmp 0x723382b781eb
0x723382b781c5: mov -0x20(%rsp),%rax
0x723382b781..: ... .....
0x723382b781e7: mov %cx,(%rax)
0x723382b781ea: ret
0x723382b781eb: jmp 0x723382b781ed
0x723382b781ed: jmp 0x723382b781ef
0x723382b781ef: jmp 0x723382b781f1
0x723382b781f1: jmp 0x723382b781f3
0x723382b781f3: mov -0x30(%rsp),%rax
0x723382b781..: ... ......
0x723382b78208: mov %rcx,(%rax)
0x723382b7820b: jmp 0x723382b781c5
That's a lot of useless jumps, and LLVM has a specific pass to get rid of these. The attached patch modifies the llvmjit code to always call this pass, even below jit_optimize_above_cost.
On a basic benchmark (a simple select * from table where f = 42), this optimization saved 7ms of runtime while using only 0.1 ms of extra optimization time.