On Sat, Feb 28, 2026 at 4:35 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> > I also wondered if we might have a reasonable case for using alloca(),
> > where available. It's pretty much the thing we are emulating, but
> > keeps the stack nice and compact without big holes to step over for
> > the following call to strcoll_l() or whatever it might be.
>
> +1 for investigating alloca(). The one disadvantage I can see
> to making this coding pattern more common is that it'll result in
> increased stack usage, which is not great now and will become
> considerably less great in our hypothetical multithreaded future.
> If we can fix it so the typical stack consumption is a good deal
> less than 1KB, that worry would be alleviated.
Yeah. I thought about all that quite a bit, and the dangers of
longjmp(), and various portability arcana, and came up with a v2.
Safety:
* DECLARE_STACK_BUFFER() is intended for general use, and gives you
only 128 bytes with the array implementation, enough for lots of
values/nulls arrays and such, and there is also
DECLARE_STACK_BUFFER_LARGE() for the pre-existing locales code that
already uses 1024-byte arrays. To be really paranoid, I suppose we
could instead always use palloc() if you don't have alloca() support,
except when you ask for _LARGE, but I suspect 128 might be OK: it's in
the territory of plain-old-stack-frame-with-a-few-variables. But I
don't think it matters terribly much either way, as no modern system
would really use array mode.
* If using using alloca(), as far as I can see we don't have to be so
cautious *if* we additionally impose a total stack size limit. Here I
chose halfway to check_stack_depth()'s trigger point, but much less
than that would also be fine I think. You can see what that compiles
to at the end. Thoughts?
* I came up with a way to implement
static_assert(!pg_in_lexical_scope(PG_{TRY,CATCH,FINALLY})).
setlongjmp()ing around inside the same stack frame obviously breaks
this stuff fundamentally. I think it might be possible that some very
restricted patterns could made to work: if you allocated before (but
not inside) PG_TRY, either using volatile variables (or perhaps if
PG_TRY contained a compiler barrier?) then it freeing might be OK in
PG_CATCH, for example if plpy_exec.c moves its allocations up. I'm
not too sure about that, though, and for now all stack buffer
operations are banned in all three scopes. Better implementation
techniques for the resulting caveman meta-programming exercise are
welcome... it took me quite some time to make it out of the valley of
shadow warnings.
Portability:
* There are no general purpose systems left that have stacks that grow
upwards AFAICS, right? So I propose that we define PG_STACK_DIRECTION
to -1 in pg_config_manual.h, but continue to write code that should
work correctly if you set it to +1. That simplifies the macrology and
de-branchifies the code. It doesn't make sense to tolerate
non-existent computers by running zillions of runtime tests around the
world every day.
* I guess I should probably actually test it on PA-RISC (the last
stack-grows-up system?) under Qemu or something. The way I approached
this was to write expressions in terms of PG_STACK_DIRECTION, instead
of chunks of #ifdef'd code that would never be seen by a living
compiler and thus inevitably bit-rot. We could still assert that
reality matches expectations in a few spots, and then worry about
configure-time, #ifdef or template file control if a new platform ever
appears with an upward stack.
* PostgreSQL already wouldn't work with "split stack" code (it would
bogusly error out of check_stack_depth() when it meets a far away
stack segment), or other exotic systems like spaghetti stacks. I
don't think this is much of an additional jump in our assumptions
about how C is actually implemented, it's just assuming that there are
no systems that grow down one day and up the next. (A number of CPU
architectures could technically do that, but their ABI firmly nails
down which way it has to be, and the answer is down).
* For estimating whether there is enough free space to call alloca(),
it prefers to use builtins to read the stack pointer (GCC, Clang 22+),
but failing that it wastes a register that initially holds a
conservative estimate (address of a local variable, itself, which is
surely deeper-or-equal to the real stack pointer), and later holds the
result of the last alloca() call, which is 100% accurate on
downward-stack systems, and potentially only out by a small amount of
alignment fuzz on non-existent upward-stack systems because I assumed
that new sp = alloca(size) + size. (It would be more accurate if I
did TYPALIGN(X, size), but I don't know what its X is, I only know
that it's >= alignof(max_align_t), so I'd have to go and read the
relevant ABI manual for this hypothetical upward-stack system to find
out, but since it doesn't exist, all thoughts about a perfect
estimation disappear in a puff of logic.)
* When judging if a pointer came from the stack, it additionally needs
an least-deep bound, but if there is no builtin for the current stack
frame address then it uses check_stack_depth.c's stack_base_ptr. Both
are reliable, all we need is a pointer to anywhere less deep in the
stack than anything we've alloca()'d. The builtin might technically
give you something from a function you're inlined into, but that's OK
too. In practice probably only MSVC uses stack_base_ptr.
* It only uses literal "alloca()" for MSVC, which turns into a builtin
and thus has a property that I'm relying on. For Unixen it uses a
GCC/Clang builtin directly, because otherwise I'd have to worry about
that property too. Builtins are safe in a function call argument
because the compiler implements and understands them, but the original
alloca() from old Unixen somehow moved the stack pointer without the
compiler's knowledge, which used to trash the stack if it had already
started pushing other stuff onto the stack for the function call, at
least in some circumstances. I assume no such things exist now, but
you'd still have to figure out which header to include, and that
varies on niche systems (<alloca.h>, <stddef.h>, ...?).
* For builtin detection, I ran into CC vs CXX vs CLANG problems, so I
gave up and invented pg_has_builtin(). I don't propose it be used
except where needed though since it doesn't exist on GCC < 10 (it's an
idea from Clang). This means that old build farm compilers would use
the array version, which I think is probably desirable for coverage.
Current distros are shipping GCC 11+, probably mostly 14 or so, so I
think it's OK to exclude 10 from a minor optimisation project.
Potential sites that can use this:
I have attached a bunch of easy candidate changes, almost all
mechanical changes for small temporary arrays of values/nulls,
scankeys, etc or temporary C-strings where no refactoring was needed
and the no-escape property was obvious to the eye. In the case of the
planner, I worked a bit harder to find a couple of things that have
pass-down-but-never-capture semantics, based on hunches... FWIW I do
see a small planner speed-up in the join-order-optimisation test David
mentioned the other day in simple tests, though I haven't studied it
seriously enough to report on yet. I'm actually not too sure yet how
to evaluate this stuff at a higher level, ie the effects of tiny
micro-optimisations scattered all over the tree, and only in the
arbitrary places that didn't need refactoring. So for now I'm
reporting only on a toy C-stringification microbenchmark: see end.
Ideas for further development:
* I wanted this to be useful for SIMD, so I made sure that
stack_buffer_alloc_object(T) would handle strict alignment
automatically based on alignof(T). palloc_object(T) et al should get
the same treatment, as I wrote in another thread about the funky __128
situation (and I have basic patches, for another day).
* I see how to implement _realloc(): if ptr == sp, then you have the
most recent allocation, so if you (somehow) know the original size you
can allocate the difference and memmove(). (This condition would
never be met for stack-upwards, but that's OK). In all other cases
you'd probably fall straight back to palloc() I think. I don't know
if there'd be much call for it though.
* Presumably this could benefit from a sprinkle of VALGRIND and sanitizer clues.
* It would be nice to figure out how to make at least one compiler
complain about very obvious escapes, as they do if you return a
pointer to a normal stack variable, but I guess the main problem isn't
the obvious ones.
* Many uses of temporary C strings wouldn't be needed if we accepted
known-length strings in more APIs; it may be unavoidable for libc
functions, but our own reader etc should probably ideally be able to
cope... I have a separate project investigating a centralised string
iterator so that we can make all of our string processing behave the
same way and will think about that...
* There are quite a few places that form or copy temporary
non-escaping index tuples, tuples, tupledescs, values/nulls extracted
from arrays etc that could probably live on the stack quite happily if
we split a few functions into "how much do you need?" and "in place"
pairs. That's been done in a few ad hoc spots for parallel query eg
tuple descriptors.
* Independently of this project, but really the same sort of thing, I
have long wanted to look into all the places that form a tuple, copy
it and then free it (hashing, sorting, materialising) when they should
form into caller-supplied destination, be it the stack, a hash table,
a tuple queue or whatever else...
* Something like "defer" for magic automatic pfree() on scope exit if
we can figure out how to do it without waiting for the year 202y + 10,
as mentioned in reply to Andres...
* Even without that, I suppose GCC cleanup callbacks could still be
used to scribble on the memory that goes out of scope. I guess it
must be pretty hard to miss bugs anyway since the stack is already
scribbled to smithereens.
For a very simple micro-benchmark, I tried a trivial C-string
conversion wrapper:
for (int i = 0; i < 10000000; ++i)
my_strlen("hello world", 11);
Given this definition of a completely useless function that exercises
C-string construction:
size_t
my_strlen(const char *data, size_t size)
{
char *cstr;
size_t result;
DECLARE_STACK_BUFFER();
cstr = stack_buffer_strdup_with_len(data, size);
result = strlen(cstr);
stack_buffer_free(cstr);
return result;
}
With STACK_BUFFER_USE_PALLOC, which just expands those macros to
palloc()/pfree(), it runs in ~216ms here, and with
STACK_BUFFER_USE_ALLOCA or STACK_BUFFER_USE_ARRAY it runs in ~155ms,
so that's a 1.39x speedup.
The generated code looks like:
0x0000000000979323 <+19>: mov 0x442a06(%rip),%rdx #
0xdbbd30 <stack_soft_limit_ptr>
0x000000000097932a <+26>: lea -0x400(%rsp),%rax <--
sp - STACK_BUFFER_DEFAULT
0x0000000000979332 <+34>: cmp %rdx,%rax
0x0000000000979335 <+37>: cmovb %rdx,%rax <--
take the nearer address
%rax now holds the threshold stack address against which all
allocation attempts will be compared. Uninteresting computation of
-(size + 1) for NUL byte:
0x0000000000979339 <+41>: mov %rsi,%rdx
0x000000000097933c <+44>: not %rdx
Then allocation attempt:
0x000000000097933f <+47>: add %rsp,%rdx
0x0000000000979342 <+50>: cmp %rdx,%rax <--
would it fit?
0x0000000000979345 <+53>: jae 0x979380 <my_strlen+112> <--
nope, jump to slow path
0x0000000000979347 <+55>: lea 0x10(%rsi),%rax
0x000000000097934b <+59>: mov %rsi,%rdx
0x000000000097934e <+62>: mov %rdi,%rsi
0x0000000000979351 <+65>: and $0xfffffffffffffff0,%rax <--
alloca() aligns
0x0000000000979355 <+69>: sub %rax,%rsp <--
alloca() allocates
Then it's then a straight line run through the uninteresting bits to
return the result, with stack_buffer_free() entirely elided:
0x0000000000979358 <+72>: mov %rsp,%rdi
0x000000000097935b <+75>: call 0x4950d0 <memcpy@plt>
0x0000000000979360 <+80>: movb $0x0,(%rsp,%rbx,1)
0x0000000000979364 <+84>: mov %rsp,%rdi
0x0000000000979367 <+87>: call 0x494890 <strlen@plt>
0x000000000097936c <+92>: lea -0x18(%rbp),%rsp
0x0000000000979370 <+96>: pop %rbx
0x0000000000979371 <+97>: pop %r12
0x0000000000979373 <+99>: pop %r13
0x0000000000979375 <+101>: pop %rbp
0x0000000000979376 <+102>: ret
If it wouldn't fit, the out-of-line implementation looks like what you
get with STACK_BUFFER_USE_PALLOC:
0x0000000000979377 <+103>: nopw 0x0(%rax,%rax,1)
0x0000000000979380 <+112>: lea 0x1(%rsi),%rdi
0x0000000000979384 <+116>: call 0xa252f0 <palloc>
0x0000000000979389 <+121>: mov %rbx,%rdx
0x000000000097938c <+124>: mov %r12,%rsi
0x000000000097938f <+127>: mov %rax,%r13
0x0000000000979392 <+130>: mov %rax,%rdi
0x0000000000979395 <+133>: call 0x4950d0 <memcpy@plt>
0x000000000097939a <+138>: movb $0x0,0x0(%r13,%rbx,1)
0x00000000009793a0 <+144>: mov %r13,%rdi
0x00000000009793a3 <+147>: call 0x494890 <strlen@plt>
0x00000000009793a8 <+152>: cmp %rsp,%r13
0x00000000009793ab <+155>: jb 0x9793b2 <my_strlen+162>
0x00000000009793ad <+157>: cmp %r13,%rbp
0x00000000009793b0 <+160>: jae 0x97936c <my_strlen+92>
0x00000000009793b2 <+162>: mov %r13,%rdi
0x00000000009793b5 <+165>: mov %rax,-0x28(%rbp)
0x00000000009793b9 <+169>: call 0xa255c0 <pfree>
0x00000000009793be <+174>: mov -0x28(%rbp),%rax
0x00000000009793c2 <+178>: jmp 0x97936c <my_strlen+92>
In more complicated functions GCC's costing only moves the actual
palloc() and pfree() calls out-of-line and jumps back to the
straight-line code, while here it decided it was cheaper to duplicate
the memcpy(), strlen() calls. Either way the straight-line code
assumes the stack will be used, which is hopefully correct most of the
time.
It became a lot more aggressive about that sort of thing when I added
the "stack_buffer_maybe_pfree" flag which seemed to flip GCC's
costing; before that it would often leave the
if-it's-not-between-%rbp-and-%rsp-then-call-pfree code in the main
code. I'm not 100% sure about all that, but it *looks* like an
effective optimisation.
I've attached some data on allocation sizes from the regression tests
(so not hitting all the places changed here), captured with
STACK_BUFFER_USE_PALLOC_LOG. Obviously not representative of real
usage, but that technique can be used to check real allocation sizes
and stack depths for any workload.
. o O ( I originally called it stack_buffer because I started with
"let's standardise the array trick", so it was a buffer. Perhaps a
better name for all this stuff would be pg_stack_alloc() or such,
since it doesn't really have a buffer anymore except in a fallback
mode... )