Обсуждение: Reducing System Allocator Thrashing of ExecutorState to Alleviate FDW-related Performance Degradations

Поиск

Список

Период

Сортировка

Reducing System Allocator Thrashing of ExecutorState to Alleviate FDW-related Performance Degradations

От

"Jonah H. Harris"

Дата:

17 февраля 2023 г., 00:49:07

Hi everyone,

I've been working on a federated database project that heavily relies on foreign data wrappers. During benchmarking, we noticed high system CPU usage in OLTP-related cases, which we traced back to multiple brk calls resulting from block frees in AllocSetReset upon ExecutorEnd's FreeExecutorState. This is because FDWs allocate their own derived execution states and required data structures within this context, exceeding the initial 8K allocation, that need to be cleaned-up.

Increasing the default query context allocation from ALLOCSET_DEFAULT_SIZES to a larger initial "appropriate size" solved the issue and almost doubled the throughput. However, the "appropriate size" is workload and implementation dependent, so making it configurable may be better than increasing the defaults, which would negatively impact users (memory-wise) who aren't encountering this scenario.

I have a patch to make it configurable, but before submitting it, I wanted to hear your thoughts and feedback on this and any other alternative ideas you may have.

Jonah H. Harris

Re: Reducing System Allocator Thrashing of ExecutorState to Alleviate FDW-related Performance Degradations

От

Andres Freund

Дата:

17 февраля 2023 г., 03:32:04

Hi,

On 2023-02-16 16:49:07 -0500, Jonah H. Harris wrote:
> I've been working on a federated database project that heavily relies on
> foreign data wrappers. During benchmarking, we noticed high system CPU
> usage in OLTP-related cases, which we traced back to multiple brk calls
> resulting from block frees in AllocSetReset upon ExecutorEnd's
> FreeExecutorState. This is because FDWs allocate their own derived
> execution states and required data structures within this context,
> exceeding the initial 8K allocation, that need to be cleaned-up.

What PG version?

Do you have a way to reproduce this with core code,
e.g. postgres_fdw/file_fdw?

What is all that memory used for? Is it possible that the real issue are too
many tiny allocations, due to some allocation growing slowly?

> Increasing the default query context allocation from ALLOCSET_DEFAULT_SIZES
> to a larger initial "appropriate size" solved the issue and almost doubled
> the throughput. However, the "appropriate size" is workload and
> implementation dependent, so making it configurable may be better than
> increasing the defaults, which would negatively impact users (memory-wise)
> who aren't encountering this scenario.
> 
> I have a patch to make it configurable, but before submitting it, I wanted
> to hear your thoughts and feedback on this and any other alternative ideas
> you may have.

This seems way too magic to expose to users. How would they ever know how to
set it? And it will heavily on the specific queries, so a global config won't
work well.

If the issue is a specific FDW needing to make a lot of allocations, I can see
adding an API to tell a memory context that it ought to be ready to allocate a
certain amount of memory efficiently (e.g. by increasing the block size of the
next allocation by more than 2x).

Greetings,

Andres Freund

Re: Reducing System Allocator Thrashing of ExecutorState to Alleviate FDW-related Performance Degradations

От

"Jonah H. Harris"

Дата:

17 февраля 2023 г., 05:34:18

On Thu, Feb 16, 2023 at 7:32 PM Andres Freund <andres@anarazel.de> wrote:

What PG version?

Hey, Andres. Thanks for the reply.

Given not much changed regarding that allocation context IIRC, I’d think all recents. It was observed in 13, 14, and 15.

Do you have a way to reproduce this with core code,
e.g. postgres_fdw/file_fdw?

I’ll have to create one - it was most evident on a TPC-C or sysbench test using the Postgres, MySQL, SQLite, and Oracle FDWs. It may be reproducible with pgbench as well.

What is all that memory used for? Is it possible that the real issue are too
many tiny allocations, due to some allocation growing slowly?

The FDW state management allocations and whatever each FDW needs to accomplish its goals. Different FDWs do different things.

This seems way too magic to expose to users. How would they ever know how to
set it? And it will heavily on the specific queries, so a global config won't
work well.

Agreed on the nastiness of exposing it directly. Not that we don’t give users control of memory anyway, but that one is easier to mess up without at least putting some custom set bounds on it.

If the issue is a specific FDW needing to make a lot of allocations, I can see
adding an API to tell a memory context that it ought to be ready to allocate a
certain amount of memory efficiently (e.g. by increasing the block size of the
next allocation by more than 2x).

While I’m happy to be wrong, it seems is an inherent problem not really specific to FDW implementations themselves but the general expectation that all FDWs are using more of that context than non-FDW cases for similar types of operations, which wasn’t really a consideration in that allocation over time.

If we come up with some sort of alternate allocation strategy, I don’t know how it would be very clean API-wise, but it’s definitely an idea.

Jonah H. Harris

Re: Reducing System Allocator Thrashing of ExecutorState to Alleviate FDW-related Performance Degradations

От

Andres Freund

Дата:

17 февраля 2023 г., 06:40:00

Hi,

On 2023-02-16 21:34:18 -0500, Jonah H. Harris wrote:
> On Thu, Feb 16, 2023 at 7:32 PM Andres Freund <andres@anarazel.de> wrote:
> Given not much changed regarding that allocation context IIRC, I’d think
> all recents. It was observed in 13, 14, and 15.

We did have a fair bit of changes in related code in the last few years,
including some in 16. I wouldn't expect them to help *hugely*, but also
wouldn't be surprised if it showed.

> I’ll have to create one - it was most evident on a TPC-C or sysbench test
> using the Postgres, MySQL, SQLite, and Oracle FDWs. It may be reproducible
> with pgbench as well.

I'd like a workload that hits a perf issue with this, because I think there
likely are some general performance improvements that we could make, without
changing the initial size or the "growth rate".

Perhaps, as a starting point, you could get
  MemoryContextStats(queryDesc->estate->es_query_cxt)
both at the end of standard_ExecutorStart() and at the beginning of
standard_ExecutorFinish(), for one of the queries triggering the performance
issues?

> > If the issue is a specific FDW needing to make a lot of allocations, I can
> > see
> > adding an API to tell a memory context that it ought to be ready to
> > allocate a
> > certain amount of memory efficiently (e.g. by increasing the block size of
> > the
> > next allocation by more than 2x).
>
>
> While I’m happy to be wrong, it seems is an inherent problem not really
> specific to FDW implementations themselves but the general expectation that
> all FDWs are using more of that context than non-FDW cases for similar
> types of operations, which wasn’t really a consideration in that allocation
> over time.

Lots of things can end up in the query context, it's really not FDW specific
for it to be of nontrivial size. E.g. most tuples passed around end up in it.

Similar performance issues also exists for plenty other memory contexts. Which
for me that's a reason *not* to make it configurable just for
CreateExecutorState. Or were you proposing to make ALLOCSET_DEFAULT_INITSIZE
configurable? That would end up with a lot of waste, I think.

The executor context case might actually be a comparatively easy case to
address. There's really two "phases" of use for es_query_ctx. First, we create
the entire executor tree in it, during standard_ExecutorStart(). Second,
during query execution, we allocate things with query lifetime (be that
because they need to live till the end, or because they are otherwise
managed, like tuples).

Even very simple queries end up with multiple blocks at the end:
E.g.
  SELECT relname FROM pg_class WHERE relkind = 'r' AND relname = 'frak';
yields:
  ExecutorState: 43784 total in 3 blocks; 8960 free (5 chunks); 34824 used
    ExprContext: 8192 total in 1 blocks; 7928 free (0 chunks); 264 used
  Grand total: 51976 bytes in 4 blocks; 16888 free (5 chunks); 35088 used

So quite justifiably we could just increase the hardcoded size in
CreateExecutorState. I'd expect that starting a few size classes up would help
noticeably.

But I think we likely could do better here. The amount of memory that ends up
in es_query_cxt during "phase 1" strongly correlates with the complexity of
the statement, as the whole executor tree ends up in it.  Using information
about the complexity of the planned statement to influence es_query_cxt's
block sizes would make sense to me.  I suspect it's a decent enough proxy for
"phase 2" as well.

Medium-long term I really want to allocate at least all the executor nodes
themselves in a single allocation. But that's a bit further out than what
we're talking about here.

Greetings,

Andres Freund

Re: Reducing System Allocator Thrashing of ExecutorState to Alleviate FDW-related Performance Degradations

От

David Rowley

Дата:

17 февраля 2023 г., 07:26:20

On Fri, 17 Feb 2023 at 16:40, Andres Freund <andres@anarazel.de> wrote:
> I'd like a workload that hits a perf issue with this, because I think there
> likely are some general performance improvements that we could make, without
> changing the initial size or the "growth rate".

I didn't hear it mentioned explicitly here, but I suspect it's faster
when increasing the initial size due to the memory context caching
code that reuses aset MemoryContexts (see context_freelists[] in
aset.c). Since we reset the context before caching it, then it'll
remain fast when we can reuse a context, provided we don't need to do
a malloc for an additional block beyond the initial block that's kept
in the cache.

Maybe we should think of a more general-purpose way of doing this
caching which just keeps a global-to-the-process dclist of blocks
laying around.  We could see if we have any free blocks both when
creating the context and also when we need to allocate another block.
I see no reason why this couldn't be shared among the other context
types rather than keeping this cache stuff specific to aset.c.  slab.c
might need to be pickier if the size isn't exactly what it needs, but
generation.c should be able to make use of it the same as aset.c
could.  I'm unsure what'd we'd need in the way of size classing for
this, but I suspect we'd need to pay attention to that rather than do
things like hand over 16MBs of memory to some context that only wants
a 1KB initial block.

David

Re: Reducing System Allocator Thrashing of ExecutorState to Alleviate FDW-related Performance Degradations

От

"Jonah H. Harris"

Дата:

17 февраля 2023 г., 07:40:18

On Thu, Feb 16, 2023 at 11:26 PM David Rowley <dgrowleyml@gmail.com> wrote:

I didn't hear it mentioned explicitly here, but I suspect it's faster
when increasing the initial size due to the memory context caching
code that reuses aset MemoryContexts (see context_freelists[] in
aset.c). Since we reset the context before caching it, then it'll
remain fast when we can reuse a context, provided we don't need to do
a malloc for an additional block beyond the initial block that's kept
in the cache.

This is what we were seeing. The larger initial size reduces/eliminates the multiple smaller blocks that are malloced and freed in each per-query execution.

Maybe we should think of a more general-purpose way of doing this
caching which just keeps a global-to-the-process dclist of blocks
laying around. We could see if we have any free blocks both when
creating the context and also when we need to allocate another block.
I see no reason why this couldn't be shared among the other context
types rather than keeping this cache stuff specific to aset.c. slab.c
might need to be pickier if the size isn't exactly what it needs, but
generation.c should be able to make use of it the same as aset.c
could. I'm unsure what'd we'd need in the way of size classing for
this, but I suspect we'd need to pay attention to that rather than do
things like hand over 16MBs of memory to some context that only wants
a 1KB initial block.

Yeah. There’s definitely a smarter and more reusable approach than I was proposing. A lot of that code is fairly mature and I figured more people wouldn’t want to alter it in such ways - but I’m up for it if an approach like this is the direction we’d want to go in.

Jonah H. Harris

Re: Reducing System Allocator Thrashing of ExecutorState to Alleviate FDW-related Performance Degradations

От

David Rowley

Дата:

17 февраля 2023 г., 08:03:37

On Fri, 17 Feb 2023 at 17:40, Jonah H. Harris <jonah.harris@gmail.com> wrote:
> Yeah. There’s definitely a smarter and more reusable approach than I was proposing. A lot of that code is fairly
matureand I figured more people wouldn’t want to alter it in such ways - but I’m up for it if an approach like this is
thedirection we’d want to go in.

I've spent quite a bit of time in this area recently and I think that
context_freelists[] is showing its age now. It does seem that slab and
generation were added before context_freelists[] (9fa6f00b), but not
by much, and those new contexts had fewer users back then. It feels a
little unfair that aset should get to cache but the other context
types don't. I don't think each context type should have some
separate cache either as that probably means more memory wasted.
Having something agnostic to if it's allocating a new context or
adding a block to an existing one seems like a good idea to me.

I think the tricky part will be the discussion around which size
classes to keep around and in which cases can we use a larger
allocation without worrying too much that it'll be wasted. We also
don't really want to make the minimum memory that a backend can keep
around too bad. Patches such as [1] are trying to reduce that. Maybe
we can just keep a handful of blocks of 1KB, 8KB and 16KB around, or
more accurately put, ALLOCSET_SMALL_INITSIZE,
ALLOCSET_DEFAULT_INITSIZE and ALLOCSET_DEFAULT_INITSIZE * 2, so that
it works correctly if someone adjusts those definitions.

I think you'll want to look at what the maximum memory a backend can
keep around in context_freelists[] and not make the worst-case memory
consumption worse than it is today.

I imagine this would be some new .c file in src/backend/utils/mmgr
which aset.c, generation.c and slab.c each call a function from to see
if we have any cached blocks of that size. You'd want to call that in
all places we call malloc() from those files apart from when aset.c
and generation.c malloc() for a dedicated block. You can probably get
away with replacing all of the free() calls with a call to another
function where you pass the pointer and the size of the block to have
it decide if it's going to free() it or cache it. I doubt you need to
care too much if the block is from a dedicated allocation or a normal
block. We'd just always free() if it's not in the size classes that
we care about.

David

[1] https://commitfest.postgresql.org/42/3867/

Re: Reducing System Allocator Thrashing of ExecutorState to Alleviate FDW-related Performance Degradations

От

"Jonah H. Harris"

Дата:

17 февраля 2023 г., 19:46:03

On Fri, Feb 17, 2023 at 12:03 AM David Rowley <dgrowleyml@gmail.com> wrote:

On Fri, 17 Feb 2023 at 17:40, Jonah H. Harris <jonah.harris@gmail.com> wrote:
> Yeah. There’s definitely a smarter and more reusable approach than I was proposing. A lot of that code is fairly mature and I figured more people wouldn’t want to alter it in such ways - but I’m up for it if an approach like this is the direction we’d want to go in.

Having something agnostic to if it's allocating a new context or
adding a block to an existing one seems like a good idea to me.

I like this idea.

I think the tricky part will be the discussion around which size
classes to keep around and in which cases can we use a larger
allocation without worrying too much that it'll be wasted. We also
don't really want to make the minimum memory that a backend can keep
around too bad. Patches such as [1] are trying to reduce that. Maybe
we can just keep a handful of blocks of 1KB, 8KB and 16KB around, or
more accurately put, ALLOCSET_SMALL_INITSIZE,
ALLOCSET_DEFAULT_INITSIZE and ALLOCSET_DEFAULT_INITSIZE * 2, so that
it works correctly if someone adjusts those definitions.

Per that patch and the general idea, what do you think of either:

1. A single GUC, something like backend_keep_mem, that represents the cached memory we'd retain rather than send directly to free()?

2. Multiple GUCs, one per block size?

While #2 would give more granularity, I'm not sure it would necessarily be needed. The main issue I'd see in that case would be the selection approach to block sizes to keep given a fixed amount of keep memory. We'd generally want the majority of the next queries to make use of it as best as possible, so we'd either need each size to be equally represented or some heuristic.

I don't really like #2, but threw it out there :)

I think you'll want to look at what the maximum memory a backend can
keep around in context_freelists[] and not make the worst-case memory
consumption worse than it is today.

Agreed.

I imagine this would be some new .c file in src/backend/utils/mmgr
which aset.c, generation.c and slab.c each call a function from to see
if we have any cached blocks of that size. You'd want to call that in
all places we call malloc() from those files apart from when aset.c
and generation.c malloc() for a dedicated block. You can probably get
away with replacing all of the free() calls with a call to another
function where you pass the pointer and the size of the block to have
it decide if it's going to free() it or cache it.

Agreed. I would see this as practically just a generic allocator free-list; is that how you view it also?

I doubt you need to care too much if the block is from a dedicated allocation or a normal
block. We'd just always free() if it's not in the size classes that
we care about.

Agreed.

Jonah H. Harris

Re: Reducing System Allocator Thrashing of ExecutorState to Alleviate FDW-related Performance Degradations

От

Andres Freund

Дата:

17 февраля 2023 г., 20:52:01

Hi,

On 2023-02-17 17:26:20 +1300, David Rowley wrote:
> I didn't hear it mentioned explicitly here, but I suspect it's faster
> when increasing the initial size due to the memory context caching
> code that reuses aset MemoryContexts (see context_freelists[] in
> aset.c). Since we reset the context before caching it, then it'll
> remain fast when we can reuse a context, provided we don't need to do
> a malloc for an additional block beyond the initial block that's kept
> in the cache.

I'm not so sure this is the case. Which is one of the reasons I'd really like
to see a) memory context stats for executor context b) a CPU profile of the
problem c) a reproducer.

Jonah, did you just increase the initial size, or did you potentially also
increase the maximum block size?

And did you increase ALLOCSET_DEFAULT_INITSIZE everywhere, or just passed a
larger block size in CreateExecutorState()?  If the latter,the context
freelist wouldn't even come into play.

A 8MB max block size is pretty darn small if you have a workload that ends up
with a gigabytes worth of blocks.

And the problem also could just be that the default initial blocks size takes
too long to ramp up to a reasonable block size. I think it's 20 blocks to get
from ALLOCSET_DEFAULT_INITSIZE to ALLOCSET_DEFAULT_MAXSIZE.  Even if you
allocate a good bit more than 8MB, having to additionally go through 20
smaller chunks is going to be noticable until you reach a good bit higher
number of blocks.

> Maybe we should think of a more general-purpose way of doing this
> caching which just keeps a global-to-the-process dclist of blocks
> laying around.  We could see if we have any free blocks both when
> creating the context and also when we need to allocate another block.

Not so sure about that. I suspect the problem could just as well be the
maximum block size, leading to too many blocks being allocated. Perhaps we
should scale that to a certain fraction of work_mem, by default?

Either way, I don't think we should go too deep without some data, too likely
to miss the actual problem.

> I see no reason why this couldn't be shared among the other context
> types rather than keeping this cache stuff specific to aset.c.  slab.c
> might need to be pickier if the size isn't exactly what it needs, but
> generation.c should be able to make use of it the same as aset.c
> could.  I'm unsure what'd we'd need in the way of size classing for
> this, but I suspect we'd need to pay attention to that rather than do
> things like hand over 16MBs of memory to some context that only wants
> a 1KB initial block.

Possible. I can see something like a generic "free block" allocator being
useful. Potentially with allocating the underlying memory with larger mmap()s
than we need for individual blocks.

Random note:

I wonder if we should having a bitmap (in an int) in front of aset's
freelist. In a lot of cases we incur plenty cache misses, just to find the
freelist bucket empty.

Greetings,

Andres Freund

Re: Reducing System Allocator Thrashing of ExecutorState to Alleviate FDW-related Performance Degradations

От

Andres Freund

Дата:

20 февраля 2023 г., 21:30:10

Hi,

On 2023-02-17 09:52:01 -0800, Andres Freund wrote:
> On 2023-02-17 17:26:20 +1300, David Rowley wrote:
> Random note:
>
> I wonder if we should having a bitmap (in an int) in front of aset's
> freelist. In a lot of cases we incur plenty cache misses, just to find the
> freelist bucket empty.

Two somewhat related thoughts:

1) We should move AllocBlockData->freeptr into AllocSetContext. It's only ever
   used for the block at the head of ->blocks.

   We completely unnecessarily incur more cache line misses due to this (and
   waste a tiny bit of space).

2) We should introduce an API mcxt.c API to perform allocations that the
   caller promises not to individually free.  We've talked a bunch about
   introducing a bump allocator memory context, but that requires using
   dedicated memory contexts, which incurs noticable space overhead, whereas
   just having a separate function call for the existing memory contexts
   doesn't have that issue.

   For aset.c we should just allocate from set->freeptr, without going through
   the freelist. Obviously we'd not round up to a power of 2. And likely, at
   least outside of assert builds, we should not have a chunk header.

Greetings,

Andres Freund

Re: Reducing System Allocator Thrashing of ExecutorState to Alleviate FDW-related Performance Degradations

От

David Rowley

Дата:

20 февраля 2023 г., 22:33:22

On Tue, 21 Feb 2023 at 07:30, Andres Freund <andres@anarazel.de> wrote:
> 2) We should introduce an API mcxt.c API to perform allocations that the
>    caller promises not to individually free.

It's not just pfree. Offhand, there's also repalloc,
GetMemoryChunkSpace and GetMemoryChunkContext too.

I am interested in a bump allocator for tuplesort.c. There it would be
used in isolation and all the code which would touch pointers
allocated by the bump allocator would be self-contained to the
tuplesorting code.

What use case do you have in mind?

David

Re: Reducing System Allocator Thrashing of ExecutorState to Alleviate FDW-related Performance Degradations

От

Andres Freund

Дата:

20 февраля 2023 г., 22:46:48

Hi,

On 2023-02-21 08:33:22 +1300, David Rowley wrote:
> On Tue, 21 Feb 2023 at 07:30, Andres Freund <andres@anarazel.de> wrote:
> > 2) We should introduce an API mcxt.c API to perform allocations that the
> >    caller promises not to individually free.
> 
> It's not just pfree. Offhand, there's also repalloc,
> GetMemoryChunkSpace and GetMemoryChunkContext too.

Sure, and all of those should assert out / crash if done with the allocation
function.

> I am interested in a bump allocator for tuplesort.c. There it would be
> used in isolation and all the code which would touch pointers
> allocated by the bump allocator would be self-contained to the
> tuplesorting code.
> 
> What use case do you have in mind?

E.g. the whole executor state tree (and likely also the plan tree) should be
allocated that way. They're never individually freed. But we also allocate
other things in the same context, and those do need to be individually
freeable. We could use a separate memory context, but that'd increase memory
usage in many cases, because there'd be two different blocks being allocated
from at the same time.

To me opting into this on a per-allocation basis seems likely to make this
more widely usable than requiring a distinct memory context type.

Greetings,

Andres Freund

Re: Reducing System Allocator Thrashing of ExecutorState to Alleviate FDW-related Performance Degradations

От

David Rowley

Дата:

22 февраля 2023 г., 04:45:39

On Sat, 18 Feb 2023 at 06:52, Andres Freund <andres@anarazel.de> wrote:
> And did you increase ALLOCSET_DEFAULT_INITSIZE everywhere, or just passed a
> larger block size in CreateExecutorState()?  If the latter,the context
> freelist wouldn't even come into play.

I think this piece of information is critical to confirm what the issue is.

> A 8MB max block size is pretty darn small if you have a workload that ends up
> with a gigabytes worth of blocks.

We should probably review that separately.  These kinds of definitions
don't age well. The current ones appear about 23 years old now, so we
might be overdue to reconsider what they're set to.

2002-12-15 21:01:34 +0000 150) #define ALLOCSET_DEFAULT_MINSIZE   0
2000-06-28 03:33:33 +0000 151) #define ALLOCSET_DEFAULT_INITSIZE  (8 * 1024)
2000-06-28 03:33:33 +0000 152) #define ALLOCSET_DEFAULT_MAXSIZE   (8 *
1024 * 1024)

... I recall having a desktop with 256MBs of RAM back then...

Let's get to the bottom of where the problem is here before we
consider adjusting those. If the problem is unrelated to that then we
shouldn't be discussing that here.

> And the problem also could just be that the default initial blocks size takes
> too long to ramp up to a reasonable block size. I think it's 20 blocks to get
> from ALLOCSET_DEFAULT_INITSIZE to ALLOCSET_DEFAULT_MAXSIZE.  Even if you
> allocate a good bit more than 8MB, having to additionally go through 20
> smaller chunks is going to be noticable until you reach a good bit higher
> number of blocks.

Well, let's try to help Johan get the information to us. I've attached
a quickly put together patch which adds some debug stuff to aset.c.
Johan, if you have a suitable test instance to try this on, can you
send us the filtered DEBUG output from the log messages starting with
"AllocSet" with and without your change?  Just the output for just the
2nd execution of the query in question is fine.  The first execution
is not useful as the cache of MemoryContexts may not be populated by
that time.  It sounds like it's the foreign server that would need to
be patched with this to test it.

If you can send that in two files we should be able to easily see what
has changed in terms of malloc() calls between the two runs.

David

Вложения

aset_debug_hacks.diff

Re: Reducing System Allocator Thrashing of ExecutorState to Alleviate FDW-related Performance Degradations

От

John Naylor

Дата:

25 февраля 2023 г., 09:26:58

On Tue, Feb 21, 2023 at 2:46 AM Andres Freund <andres@anarazel.de> wrote:

> On 2023-02-21 08:33:22 +1300, David Rowley wrote:
> > I am interested in a bump allocator for tuplesort.c. There it would be
> > used in isolation and all the code which would touch pointers
> > allocated by the bump allocator would be self-contained to the
> > tuplesorting code.
> >
> > What use case do you have in mind?
>
> E.g. the whole executor state tree (and likely also the plan tree) should be
> allocated that way. They're never individually freed. But we also allocate
> other things in the same context, and those do need to be individually
> freeable. We could use a separate memory context, but that'd increase memory
> usage in many cases, because there'd be two different blocks being allocated
> from at the same time.

That reminds me of this thread I recently stumbled across about memory management of prepared statements:

https://www.postgresql.org/message-id/20190726004124.prcb55bp43537vyw%40alap3.anarazel.de

I recently heard of a technique for relative pointers that could enable tree structures within a single allocation.

If "a" needs to store the location of "b" relative to "a", it would be calculated like

a = (char *) &b - (char *) &a;

...then to find b again, do

typeof_b* b_ptr;
b_ptr = (typeof_b* ) ((char *) &a + a);

One issue with this naive sketch is that zero would point to one's self, and it would be better if zero still meant "invalid pointer" so that memset(0) does the right thing.

Using signed byte-sized offsets as an example, the range is -128 to 127, so we can call -128 the invalid pointer, or in binary 0b1000_0000.

To interpret a raw zero as invalid, we need an encoding, and here we can just XOR it:

#define Encode(a) a^0b1000_0000;
#define Decode(a) a^0b1000_0000;

Then, encode(-128) == 0 and decode(0) == -128, and memset(0) will do the right thing and that value will be decoded as invalid.

Conversely, this preserves the ability to point to self, if needed:

encode(0) == -128 and decode(-128) == 0

...so we can store any relative offset in the range -127..127, as well as "invalid offset". This extends to larger signed integer types in the obvious way.

Putting the above two calculations together, the math ends up like this, which can be put into macros:

absolute to relative:
a = Encode((int32) (char *) &b - (char *) &a);

relative to absolute:
typeof_b* b_ptr;
b_ptr = (typeof_b* ) ((char *) &a + Decode(a));

I'm not yet familiar enough with parse/plan/execute trees to know if this would work or not, but that might be a good thing to look into next cycle.

--
John Naylor
EDB: http://www.enterprisedb.com

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Обсуждение: Reducing System Allocator Thrashing of ExecutorState to Alleviate FDW-related Performance Degradations

Вложения