Обсуждение: A stack allocation API

Поиск

Список

Период

Сортировка

A stack allocation API

От

Thomas Munro

Дата:

27 февраля, 14:59:42

Hi,

In the locale code we often use a 1KB array for copies of strings
where we need a NUL-terminated or transcoded version to give a library
function, with a fallback to palloc() + pfree() if we need more space
than that, but:

* we open code it repeatedly
* we often have two allocations but won't use the stack if we can't fit both
* we don't use it in nearby places that are obviously similar,
probably because it's a bit tedious to repeat
* in the past we've forgotten to pfree() large allocations and had to fix leaks
* it's not very type-safe
* we don't seem to consider alignment for non-char types, eg UChar,
wchar_t (apparently ASAN has never complained about that and I think I
see why it's always OK as written, but I suspect that might be UB)

In the attached, I tried to tidy that up with an interface that lets you write:

    DECLARE_STACK_BUFFER();

    p = stack_buffer_alloc(n);
    ...
    stack_buffer_free(p);

The point of the _free() call is that it might need to call pfree() if
it was a large allocation and not from the stack.

Or slightly higher level and supporting the most common use cases with
a one-liner:

    cstr1 = stack_buffer_strdup_with_len(str1, len1);
    cstr2 = stack_buffer_strdup_with_len(str2, len2);
    result = strcoll_l(cstr1, cstr2, locale);
    stack_buffer_free(cstr1);
    stack_buffer_free(cstr2);

Or for non-char cases without casts or pointer/size arithmetic, in the
style of recent palloc() variants:

    wcstr = stack_buffer_alloc_array(wchar_t, len);
    uchar = stack_buffer_alloc_array(UChar, len);

Better names/ideas welcome.

I also wondered if we might have a reasonable case for using alloca(),
where available.  It's pretty much the thing we are emulating, but
keeps the stack nice and compact without big holes to step over for
the following call to strcoll_l() or whatever it might be.  Though
it's non-standard and often discouraged due to the inherent danger of
overflow, our usage is metered.  I don't see why it's any more
dangerous than the existing code as long as our cap is applied to it,
or am I missing some other problem with that idea?  One issue with
USE_ALLOCA is that we have no systems where that wouldn't be used, so
the fallback code would be untested unless you comment the #define
out...

Вложения

v1-0001-Provide-stack-allocation-API.patch

Re: A stack allocation API

От

Tom Lane

Дата:

27 февраля, 18:35:39

Thomas Munro <thomas.munro@gmail.com> writes:
> In the locale code we often use a 1KB array for copies of strings
> where we need a NUL-terminated or transcoded version to give a library
> function, with a fallback to palloc() + pfree() if we need more space
> than that, but:

Yeah, I think there are some other use-cases too.

> I also wondered if we might have a reasonable case for using alloca(),
> where available.  It's pretty much the thing we are emulating, but
> keeps the stack nice and compact without big holes to step over for
> the following call to strcoll_l() or whatever it might be.

+1 for investigating alloca().  The one disadvantage I can see
to making this coding pattern more common is that it'll result in
increased stack usage, which is not great now and will become
considerably less great in our hypothetical multithreaded future.
If we can fix it so the typical stack consumption is a good deal
less than 1KB, that worry would be alleviated.

            regards, tom lane

Re: A stack allocation API

От

Andres Freund

Дата:

27 февраля, 21:02:30

Hi,

On 2026-02-27 10:35:39 -0500, Tom Lane wrote:
> Thomas Munro <thomas.munro@gmail.com> writes:
> > I also wondered if we might have a reasonable case for using alloca(),
> > where available.  It's pretty much the thing we are emulating, but
> > keeps the stack nice and compact without big holes to step over for
> > the following call to strcoll_l() or whatever it might be.
>
> +1 for investigating alloca().  The one disadvantage I can see
> to making this coding pattern more common is that it'll result in
> increased stack usage, which is not great now and will become
> considerably less great in our hypothetical multithreaded future.

Yea, that's what I immediately was thinking about too. IIRC, on linux, the
stack for the "main" thread is allocated on-demand, but the stack for threads
is mapped entirely upon creation (I think because it'd be hard to ensure
there's space for the stack otherwise). So there's more benefit in keeping the
stack small-ish with threads than there is in a process based model.

That said, I've thought about accellerating a few things with an
'on-stack-if-small-palloc-otherwise' approach as well.  Particularly things
like small StringInfos could really benefit from it - but it'd be a nontrivial
conversion, due code calling pfree on the memory.  I guess we could introduce
a memory context that'd do nothing for pfree(), which could be used when using
the stack version, but IDK, that seems mighty ugly.

However, I'm pretty unconvinced of this argument

> in the past we've forgotten to pfree() large allocations and had to fix leaks

because we'll continue to rely on calling something to free anyway (due to
large objects) and using a different path for smaller objects just will make
it harder to find those.

I wish msvc implemented something akin to gcc/clang's
attribute(cleanup(cleanup_function)), but it doesn't look like it
does. Obviously it would if we were to compile with C++, but I don't think
anybody has appetite for the work it'd need to get there.

Greetings,

Andres Freund

Re: A stack allocation API

От

Thomas Munro

Дата:

11 марта, 03:05:22

On Sat, Feb 28, 2026 at 7:02 AM Andres Freund <andres@anarazel.de> wrote:
> I wish msvc implemented something akin to gcc/clang's
> attribute(cleanup(cleanup_function)), but it doesn't look like it
> does. Obviously it would if we were to compile with C++, but I don't think
> anybody has appetite for the work it'd need to get there.

Well, it does have __try/__finally, but the future of the idea for C2y
looks like Go or Zig:

    #include <stddefer.h>

    p = malloc();
    defer free(p);

It's already available in bleeding edge GCC and Clang 22, based on the
TS 25755 draft[1].  Unfortunately for the real world, it's annoyingly
difficult to come up with *nice* looking macros that can expand to the
dialects we'd need to cover even if we were OK with abandoning the
notion of supporting C < 2y compilers other than the 3 we talk about,
due to syntactic structure and semantic differences, presumably:

1. C2y defer
2. C++ RAII
3. GCC/Clang cleanup attributes
4. MSVC __try/__finally

But with big honking framing macros BEGIN_MAGIC_PIXIE_DUST(), ...
pg_defer_pfree(p) (assuming we'd want a style with a pathway to the
future standard?); ... END_MAGIC_PIXIE_DUST(), I think it's probably
doable...  There are lots of people trying to suffer through
portable-enough-for-me and just-enough-functionality-for-my-project,
maybe some inspiration[2]...

[1] https://www.open-std.org/JTC1/SC22/WG14/www/docs/n3734.pdf
[2] https://antonz.org/defer-in-c/

Re: A stack allocation API

От

Thomas Munro

Дата:

12 марта, 04:52:16

On Sat, Feb 28, 2026 at 4:35 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> > I also wondered if we might have a reasonable case for using alloca(),
> > where available.  It's pretty much the thing we are emulating, but
> > keeps the stack nice and compact without big holes to step over for
> > the following call to strcoll_l() or whatever it might be.
>
> +1 for investigating alloca().  The one disadvantage I can see
> to making this coding pattern more common is that it'll result in
> increased stack usage, which is not great now and will become
> considerably less great in our hypothetical multithreaded future.
> If we can fix it so the typical stack consumption is a good deal
> less than 1KB, that worry would be alleviated.

Yeah.  I thought about all that quite a bit, and the dangers of
longjmp(), and various portability arcana, and came up with a v2.

Safety:

* DECLARE_STACK_BUFFER() is intended for general use, and gives you
only 128 bytes with the array implementation, enough for lots of
values/nulls arrays and such, and there is also
DECLARE_STACK_BUFFER_LARGE() for the pre-existing locales code that
already uses 1024-byte arrays.  To be really paranoid, I suppose we
could instead always use palloc() if you don't have alloca() support,
except when you ask for _LARGE, but I suspect 128 might be OK: it's in
the territory of plain-old-stack-frame-with-a-few-variables.  But I
don't think it matters terribly much either way, as no modern system
would really use array mode.

* If using using alloca(), as far as I can see we don't have to be so
cautious *if* we additionally impose a total stack size limit.  Here I
chose halfway to check_stack_depth()'s trigger point, but much less
than that would also be fine I think.  You can see what that compiles
to at the end.  Thoughts?

* I came up with a way to implement
static_assert(!pg_in_lexical_scope(PG_{TRY,CATCH,FINALLY})).
setlongjmp()ing around inside the same stack frame obviously breaks
this stuff fundamentally.  I think it might be possible that some very
restricted patterns could made to work: if you allocated before (but
not inside) PG_TRY, either using volatile variables (or perhaps if
PG_TRY contained a compiler barrier?) then it freeing might be OK in
PG_CATCH, for example if plpy_exec.c moves its allocations up.  I'm
not too sure about that, though, and for now all stack buffer
operations are banned in all three scopes.  Better implementation
techniques for the resulting caveman meta-programming exercise are
welcome... it took me quite some time to make it out of the valley of
shadow warnings.

Portability:

* There are no general purpose systems left that have stacks that grow
upwards AFAICS, right?  So I propose that we define PG_STACK_DIRECTION
to -1 in pg_config_manual.h, but continue to write code that should
work correctly if you set it to +1.  That simplifies the macrology and
de-branchifies the code.  It doesn't make sense to tolerate
non-existent computers by running zillions of runtime tests around the
world every day.

* I guess I should probably actually test it on PA-RISC (the last
stack-grows-up system?) under Qemu or something.  The way I approached
this was to write expressions in terms of PG_STACK_DIRECTION, instead
of chunks of #ifdef'd code that would never be seen by a living
compiler and thus inevitably bit-rot.  We could still assert that
reality matches expectations in a few spots, and then worry about
configure-time, #ifdef or template file control if a new platform ever
appears with an upward stack.

* PostgreSQL already wouldn't work with "split stack" code (it would
bogusly error out of check_stack_depth() when it meets a far away
stack segment), or other exotic systems like spaghetti stacks.  I
don't think this is much of an additional jump in our assumptions
about how C is actually implemented, it's just assuming that there are
no systems that grow down one day and up the next.  (A number of CPU
architectures could technically do that, but their ABI firmly nails
down which way it has to be, and the answer is down).

* For estimating whether there is enough free space to call alloca(),
it prefers to use builtins to read the stack pointer (GCC, Clang 22+),
but failing that it wastes a register that initially holds a
conservative estimate (address of a local variable, itself, which is
surely deeper-or-equal to the real stack pointer), and later holds the
result of the last alloca() call, which is 100% accurate on
downward-stack systems, and potentially only out by a small amount of
alignment fuzz on non-existent upward-stack systems because I assumed
that new sp = alloca(size) + size.  (It would be more accurate if I
did TYPALIGN(X, size), but I don't know what its X is, I only know
that it's >= alignof(max_align_t), so I'd have to go and read the
relevant ABI manual for this hypothetical upward-stack system to find
out, but since it doesn't exist, all thoughts about a perfect
estimation disappear in a puff of logic.)

* When judging if a pointer came from the stack, it additionally needs
an least-deep bound, but if there is no builtin for the current stack
frame address then it uses check_stack_depth.c's stack_base_ptr.  Both
are reliable, all we need is a pointer to anywhere less deep in the
stack than anything we've alloca()'d.  The builtin might technically
give you something from a function you're inlined into, but that's OK
too.  In practice probably only MSVC uses stack_base_ptr.

* It only uses literal "alloca()" for MSVC, which turns into a builtin
and thus has a property that I'm relying on.  For Unixen it uses a
GCC/Clang builtin directly, because otherwise I'd have to worry about
that property too.  Builtins are safe in a function call argument
because the compiler implements and understands them, but the original
alloca() from old Unixen somehow moved the stack pointer without the
compiler's knowledge, which used to trash the stack if it had already
started pushing other stuff onto the stack for the function call, at
least in some circumstances.  I assume no such things exist now, but
you'd still have to figure out which header to include, and that
varies on niche systems (<alloca.h>, <stddef.h>, ...?).

* For builtin detection, I ran into CC vs CXX vs CLANG problems, so I
gave up and invented pg_has_builtin().  I don't propose it be used
except where needed though since it doesn't exist on GCC < 10 (it's an
idea from Clang).  This means that old build farm compilers would use
the array version, which I think is probably desirable for coverage.
Current distros are shipping GCC 11+, probably mostly 14 or so, so I
think it's OK to exclude 10 from a minor optimisation project.

Potential sites that can use this:

I have attached a bunch of easy candidate changes, almost all
mechanical changes for small temporary arrays of values/nulls,
scankeys, etc or temporary C-strings where no refactoring was needed
and the no-escape property was obvious to the eye.  In the case of the
planner, I worked a bit harder to find a couple of things that have
pass-down-but-never-capture semantics, based on hunches...  FWIW I do
see a small planner speed-up in the join-order-optimisation test David
mentioned the other day in simple tests, though I haven't studied it
seriously enough to report on yet.  I'm actually not too sure yet how
to evaluate this stuff at a higher level, ie the effects of tiny
micro-optimisations scattered all over the tree, and only in the
arbitrary places that didn't need refactoring.  So for now I'm
reporting only on a toy C-stringification microbenchmark: see end.

Ideas for further development:

* I wanted this to be useful for SIMD, so I made sure that
stack_buffer_alloc_object(T) would handle strict alignment
automatically based on alignof(T).  palloc_object(T) et al should get
the same treatment, as I wrote in another thread about the funky __128
situation (and I have basic patches, for another day).

* I see how to implement _realloc(): if ptr == sp, then you have the
most recent allocation, so if you (somehow) know the original size you
can allocate the difference and memmove().  (This condition would
never be met for stack-upwards, but that's OK).  In all other cases
you'd probably fall straight back to palloc() I think.  I don't know
if there'd be much call for it though.

* Presumably this could benefit from a sprinkle of VALGRIND and sanitizer clues.

* It would be nice to figure out how to make at least one compiler
complain about very obvious escapes, as they do if you return a
pointer to a normal stack variable, but I guess the main problem isn't
the obvious ones.

* Many uses of temporary C strings wouldn't be needed if we accepted
known-length strings in more APIs; it may be unavoidable for libc
functions, but our own reader etc should probably ideally be able to
cope... I have a separate project investigating a centralised string
iterator so that we can make all of our string processing behave the
same way and will think about that...

* There are quite a few places that form or copy temporary
non-escaping index tuples, tuples, tupledescs, values/nulls extracted
from arrays etc that could probably live on the stack quite happily if
we split a few functions into "how much do you need?" and "in place"
pairs.  That's been done in a few ad hoc spots for parallel query eg
tuple descriptors.

* Independently of this project, but really the same sort of thing, I
have long wanted to look into all the places that form a tuple, copy
it and then free it (hashing, sorting, materialising) when they should
form into caller-supplied destination, be it the stack, a hash table,
a tuple queue or whatever else...

* Something like "defer" for magic automatic pfree() on scope exit if
we can figure out how to do it without waiting for the year 202y + 10,
as mentioned in reply to Andres...

* Even without that, I suppose GCC cleanup callbacks could still be
used to scribble on the memory that goes out of scope.  I guess it
must be pretty hard to miss bugs anyway since the stack is already
scribbled to smithereens.

For a very simple micro-benchmark, I tried a trivial C-string
conversion wrapper:

   for (int i = 0; i < 10000000; ++i)
       my_strlen("hello world", 11);

Given this definition of a completely useless function that exercises
C-string construction:

size_t
my_strlen(const char *data, size_t size)
{
    char *cstr;
    size_t result;

    DECLARE_STACK_BUFFER();

    cstr = stack_buffer_strdup_with_len(data, size);
    result = strlen(cstr);
    stack_buffer_free(cstr);

    return result;
}

With STACK_BUFFER_USE_PALLOC, which just expands those macros to
palloc()/pfree(), it runs in ~216ms here, and with
STACK_BUFFER_USE_ALLOCA or STACK_BUFFER_USE_ARRAY it runs in ~155ms,
so that's a 1.39x speedup.

The generated code looks like:

   0x0000000000979323 <+19>:    mov    0x442a06(%rip),%rdx        #
0xdbbd30 <stack_soft_limit_ptr>
   0x000000000097932a <+26>:    lea    -0x400(%rsp),%rax          <--
sp - STACK_BUFFER_DEFAULT
   0x0000000000979332 <+34>:    cmp    %rdx,%rax
   0x0000000000979335 <+37>:    cmovb  %rdx,%rax                  <--
take the nearer address

%rax now holds the threshold stack address against which all
allocation attempts will be compared.  Uninteresting computation of
-(size + 1) for NUL byte:

   0x0000000000979339 <+41>:    mov    %rsi,%rdx
   0x000000000097933c <+44>:    not    %rdx

Then allocation attempt:

   0x000000000097933f <+47>:    add    %rsp,%rdx
   0x0000000000979342 <+50>:    cmp    %rdx,%rax                 <--
would it fit?
   0x0000000000979345 <+53>:    jae    0x979380 <my_strlen+112>  <--
nope, jump to slow path
   0x0000000000979347 <+55>:    lea    0x10(%rsi),%rax
   0x000000000097934b <+59>:    mov    %rsi,%rdx
   0x000000000097934e <+62>:    mov    %rdi,%rsi
   0x0000000000979351 <+65>:    and    $0xfffffffffffffff0,%rax  <--
alloca() aligns
   0x0000000000979355 <+69>:    sub    %rax,%rsp                 <--
alloca() allocates

Then it's then a straight line run through the uninteresting bits to
return the result, with stack_buffer_free() entirely elided:

   0x0000000000979358 <+72>:    mov    %rsp,%rdi
   0x000000000097935b <+75>:    call   0x4950d0 <memcpy@plt>
   0x0000000000979360 <+80>:    movb   $0x0,(%rsp,%rbx,1)
   0x0000000000979364 <+84>:    mov    %rsp,%rdi
   0x0000000000979367 <+87>:    call   0x494890 <strlen@plt>
   0x000000000097936c <+92>:    lea    -0x18(%rbp),%rsp
   0x0000000000979370 <+96>:    pop    %rbx
   0x0000000000979371 <+97>:    pop    %r12
   0x0000000000979373 <+99>:    pop    %r13
   0x0000000000979375 <+101>:   pop    %rbp
   0x0000000000979376 <+102>:   ret

If it wouldn't fit, the out-of-line implementation looks like what you
get with STACK_BUFFER_USE_PALLOC:

   0x0000000000979377 <+103>:   nopw   0x0(%rax,%rax,1)
   0x0000000000979380 <+112>:   lea    0x1(%rsi),%rdi
   0x0000000000979384 <+116>:   call   0xa252f0 <palloc>
   0x0000000000979389 <+121>:   mov    %rbx,%rdx
   0x000000000097938c <+124>:   mov    %r12,%rsi
   0x000000000097938f <+127>:   mov    %rax,%r13
   0x0000000000979392 <+130>:   mov    %rax,%rdi
   0x0000000000979395 <+133>:   call   0x4950d0 <memcpy@plt>
   0x000000000097939a <+138>:   movb   $0x0,0x0(%r13,%rbx,1)
   0x00000000009793a0 <+144>:   mov    %r13,%rdi
   0x00000000009793a3 <+147>:   call   0x494890 <strlen@plt>
   0x00000000009793a8 <+152>:   cmp    %rsp,%r13
   0x00000000009793ab <+155>:   jb     0x9793b2 <my_strlen+162>
   0x00000000009793ad <+157>:   cmp    %r13,%rbp
   0x00000000009793b0 <+160>:   jae    0x97936c <my_strlen+92>
   0x00000000009793b2 <+162>:   mov    %r13,%rdi
   0x00000000009793b5 <+165>:   mov    %rax,-0x28(%rbp)
   0x00000000009793b9 <+169>:   call   0xa255c0 <pfree>
   0x00000000009793be <+174>:   mov    -0x28(%rbp),%rax
   0x00000000009793c2 <+178>:   jmp    0x97936c <my_strlen+92>

In more complicated functions GCC's costing only moves the actual
palloc() and pfree() calls out-of-line and jumps back to the
straight-line code, while here it decided it was cheaper to duplicate
the memcpy(), strlen() calls.  Either way the straight-line code
assumes the stack will be used, which is hopefully correct most of the
time.

It became a lot more aggressive about that sort of thing when I added
the "stack_buffer_maybe_pfree" flag which seemed to flip GCC's
costing; before that it would often leave the
if-it's-not-between-%rbp-and-%rsp-then-call-pfree code in the main
code.  I'm not 100% sure about all that, but it *looks* like an
effective optimisation.

I've attached some data on allocation sizes from the regression tests
(so not hitting all the places changed here), captured with
STACK_BUFFER_USE_PALLOC_LOG.  Obviously not representative of real
usage, but that technique can be used to check real allocation sizes
and stack depths for any workload.

. o O ( I originally called it stack_buffer because I started with
"let's standardise the array trick", so it was a buffer.  Perhaps a
better name for all this stuff would be pg_stack_alloc() or such,
since it doesn't really have a buffer anymore except in a fallback
mode... )

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Обсуждение: A stack allocation API

A stack allocation API

Вложения

Re: A stack allocation API

Re: A stack allocation API

Re: A stack allocation API

Re: A stack allocation API

Вложения