Обсуждение: sb_alloc: a new memory allocator for PostgreSQL

Поиск
Список
Период
Сортировка

sb_alloc: a new memory allocator for PostgreSQL

От
Robert Haas
Дата:
Over the last several months, I've been working on a new memory
allocator for PostgreSQL.  While it's not done, and there are
problems, I think I've reached the point where it makes sense to get
this out in front of a wider audience and get some feedback,
preferably of the sort that doesn't do any permanent damage.  I
started out down this path because parallel sort will really be much
happier if it can allocate from either dynamic shared memory or
backend-local memory using the same interface.  Arguably, we could
implement parallel internal sort using some kind of bespoke memory
management strategy, like packing all of the tuples into the memory
segment tightly starting from the end, and growing the tuple array
from the beginning, but then what happens if we decide to give up on
parallelism and do an external sort after all?  Even if we could
engineer our way around that problem, I think there are bound to be
other things that people want to do with dynamic shared memory
segments where being able to easily allocate and free memory is
desirable.

As I got into the problem a little bit further, I realized that
AllocSetAlloc is actually pretty poor allocator for sorting.  As far
as I can see, the basic goal of aset.c was to make context resets
cheap - which makes sense, because we have a lot of very short-lived
memory contexts.  However, when you're sorting a lot of data, being
able to reset the context quickly is not as important as using memory
efficiently, and AllocSetAlloc is pretty terrible at that, especially
for small allocations.  Each allocation is rounded up to a power of
two, and then we add 16 bytes of overhead on top of that (on 64-bit
systems), which means that palloc has 100% overhead for a large number
of 16-byte allocations, and 200% overhead for a large number of 8-byte
allocations.  The 16-byte overhead doesn't hurt quite as much on big
allocations, but the rounding to a power of two can sometimes be very
bad.  For example, for repeated 96-byte allocations, palloc has 50%
overhead.  I decided I wanted to come up with something better.

I read through the literature and found that most modern allocators
seemed to use superblocks.  A superblock is a chunk of memory, perhaps
64kB, which is carved up into a bunch of chunks of equal size.  Those
chunks don't have individual headers; instead, the pointer address is
used to locate the metadata for the chunk.  Requests are binned into
size classes, which are generally much more fine-grained than the
powers-of-two strategy used by AllocSetAlloc.  Of course, too many
size classes can waste memory if you end up with many partially-full
superblocks, each for a different size class, but the consensus seems
to be that you make it up and more by wasting less memory within each
allocation (e.g. a 96 byte allocation can be exactly 96 bytes, rather
than 128 bytes).

I investigated whether there might be an existing allocator we could
use; I did not find anything that looked suitable.  No allocator that
I ran across could handle allocation of shared memory; in fact, pretty
much all of them assumed the existence of an underlying "system
allocator" that they could use to allocate memory for their own
metadata.  Those that included a system allocator (like tcmalloc)
still didn't support anything that looked like dynamic shared memory.
Most of them were also unsuitable for other reasons (e.g. tcmalloc is
written in C++).  I also didn't find anything that looked like our
memory context paradigm, and in particular the ability to cheaply
reset a context, in any other allocator.  So I decided to write my
own; a first cut is attached.  It's not integrated with the
MemoryContext stuff yet, and very likely needs to be.  There are some
functions for which I've added prototypes but not implementations; the
support for allocation from dynamic shared memory segments is just a
stub right now.  And there are various other warts as well which will
need to be cleaned up in due time if this is to go anywhere.

Performance-wise, the raw allocation performance (sb_alloc) is
reasonably good.  On PPC64, it lost out to palloc on 8-byte
allocations but was comparable for larger allocations; on my MacBook,
it was faster across the board.  The raw deallocation performance
(sb_free) is appreciably slower than pfree; that can probably be
optimized somewhat.  It's not really a fair comparison at this point,
though, among other reasons because palloc is jumping through an extra
layer of indirection.  Leaving that problem to one side, I hacked up
index builds to use the new allocator and found that on REINDEX INDEX
pgbench_accounts_pkey at scale factor 100, this uses 22% less memory
than palloc, but runs about 4-7% slower on PPC64.  It may be possible
to eliminate some of that with a tighter integration of the new
allocator and some more fine-tuning.

As you can probably tell, this is all somewhat preliminary and
experimental at this point, but thoughts welcome.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Вложения

Re: sb_alloc: a new memory allocator for PostgreSQL

От
Greg Stark
Дата:
On Tue, May 6, 2014 at 2:04 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>  I also didn't find anything that looked like our
> memory context paradigm, and in particular the ability to cheaply
> reset a context, in any other allocator.


You probably knew this but just in case the term for this strategy is
called a "ripcord allocator". I believe GCC had one, not sure if it
still uses it. I doubt any existing ones are especially helpful though
compared to looking at regular allocators and seeing how to integrate
them.

I assume you're also aware of lock-free data structures and the like.
A memory allocator seems like something that needs to be super aware
of causing concurrency bottlenecks since it has no idea where it'll be
used.

-- 
greg



Re: sb_alloc: a new memory allocator for PostgreSQL

От
Heikki Linnakangas
Дата:
On 05/06/2014 04:04 PM, Robert Haas wrote:
> Over the last several months, I've been working on a new memory
> allocator for PostgreSQL.  While it's not done, and there are
> problems, I think I've reached the point where it makes sense to get
> this out in front of a wider audience and get some feedback,
> preferably of the sort that doesn't do any permanent damage.  I
> started out down this path because parallel sort will really be much
> happier if it can allocate from either dynamic shared memory or
> backend-local memory using the same interface.  Arguably, we could
> implement parallel internal sort using some kind of bespoke memory
> management strategy, like packing all of the tuples into the memory
> segment tightly starting from the end, and growing the tuple array
> from the beginning, but then what happens if we decide to give up on
> parallelism and do an external sort after all?  Even if we could
> engineer our way around that problem, I think there are bound to be
> other things that people want to do with dynamic shared memory
> segments where being able to easily allocate and free memory is
> desirable.

As a generic remark, I wish that whatever parallel algorithms we will 
use won't need a lot of ad hoc memory allocations from shared memory. 
Even though we have dynamic shared memory now, complex data structures 
with a lot of pointers and different allocations are more painful to 
debug, tune, and make concurrency-safe. But I have no idea what exactly 
you have in mind, so I'll just have to take your word on it that this is 
sensible.

> As I got into the problem a little bit further, I realized that
> AllocSetAlloc is actually pretty poor allocator for sorting.  As far
> as I can see, the basic goal of aset.c was to make context resets
> cheap - which makes sense, because we have a lot of very short-lived
> memory contexts.  However, when you're sorting a lot of data, being
> able to reset the context quickly is not as important as using memory
> efficiently, and AllocSetAlloc is pretty terrible at that, especially
> for small allocations.  Each allocation is rounded up to a power of
> two, and then we add 16 bytes of overhead on top of that (on 64-bit
> systems), which means that palloc has 100% overhead for a large number
> of 16-byte allocations, and 200% overhead for a large number of 8-byte
> allocations.  The 16-byte overhead doesn't hurt quite as much on big
> allocations, but the rounding to a power of two can sometimes be very
> bad.  For example, for repeated 96-byte allocations, palloc has 50%
> overhead.  I decided I wanted to come up with something better.

Yeah, I saw in some tests that about 50% of the memory used for catalog 
caches was waste caused by rounding up all the allocations to power-of-two.

> I read through the literature and found that most modern allocators
> seemed to use superblocks.  A superblock is a chunk of memory, perhaps
> 64kB, which is carved up into a bunch of chunks of equal size.  Those
> chunks don't have individual headers; instead, the pointer address is
> used to locate the metadata for the chunk.  Requests are binned into
> size classes, which are generally much more fine-grained than the
> powers-of-two strategy used by AllocSetAlloc.  Of course, too many
> size classes can waste memory if you end up with many partially-full
> superblocks, each for a different size class, but the consensus seems
> to be that you make it up and more by wasting less memory within each
> allocation (e.g. a 96 byte allocation can be exactly 96 bytes, rather
> than 128 bytes).

Interesting work.

Another kind of memory allocator that I've played with in the past is a 
simple stack allocator that only supports wholesale release of the whole 
context and pfree() is a no-op. That's dead simple and very efficient 
when you don't need retail pfree(), but obviously cannot completely 
replace the current allocator.

I wouldn't conflate shared memory with this. If a piece of code needs to 
work with either one, I think the way to go is to have some sort of 
wrapper functions that route the calls to either the shared or private 
memory allocator, similar to how the same interface is used to deal with 
local, temporary buffers and shared buffers.

- Heikki



Re: sb_alloc: a new memory allocator for PostgreSQL

От
Robert Haas
Дата:
On Tue, May 6, 2014 at 9:31 AM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:
> As a generic remark, I wish that whatever parallel algorithms we will use
> won't need a lot of ad hoc memory allocations from shared memory. Even
> though we have dynamic shared memory now, complex data structures with a lot
> of pointers and different allocations are more painful to debug, tune, and
> make concurrency-safe. But I have no idea what exactly you have in mind, so
> I'll just have to take your word on it that this is sensible.

Yeah, I agree.  Actually, I'm hoping that a lot of what we want to do
can be done using the shm_mq stuff, which uses a messaging paradigm.
If the queue is full, you wait for the consumer to read some data
before writing more.  That is much simpler and avoids a lot of
problems.

There are several problems with using pointers in dynamic shared
memory.  The ones I'm most concerned about are:

1. Segments are relocatable, so you can't actually use absolute
pointers.  Maybe someday we'll have a facility for dynamic shared
memory segments that are mapped at the same address in every process,
or maybe not, but right now we sure don't.

2. You've got to decide up-front how much memory to set aside for
dynamic allocation, and you can't easily change your mind later.  Some
of the DSM implementations support growing the segment, but you've got
to somehow get everyone who is using it to remap it, possibly at a
different address, so it's a long way from being transparent.

But that having been said, pointers are a pretty fundamental data
structure, and I think trying to get rid of them completely isn't
likely to be feasible.  For sorting, you need a big SortTuple array
and then that needs to point to the individual tuples.  I think that's
simple enough to be reasonable, and at any rate I don't see a simpler
approach.

> Yeah, I saw in some tests that about 50% of the memory used for catalog
> caches was waste caused by rounding up all the allocations to power-of-two.

Sadly, I can't see using this allocator for the catalog caches as-is.
The problem is that AllocSetAlloc can start those caches off with a
tiny 1kB allocation.  This allocator is intended to be efficient for
large contexts, so you start off with 4kB of superblock descriptors
and a 64kB chunk for each size class that is in use.  Very reasonable
for multi-megabyte allocations; not so hot for tiny ones.  There may
be a way to serve both needs, but I haven't found it yet.

> I wouldn't conflate shared memory with this. If a piece of code needs to
> work with either one, I think the way to go is to have some sort of wrapper
> functions that route the calls to either the shared or private memory
> allocator, similar to how the same interface is used to deal with local,
> temporary buffers and shared buffers.

Well, that has several disadvantages.  One of them is code
duplication.  This allocator could certainly be a lot simpler if it
only handed shared memory, or for that matter if it only handled
backend-private memory.  But if the right way to do allocation for
sorting is to carve chunks out of superblocks, then it's the right way
regardless of whether you're allocating from shared memory or
backend-private memory.  And if that's the wrong way, then it's wrong
for both.  Using completely different allocators for parallel sort and
non-parallel sort doesn't seem like a great idea to me.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: sb_alloc: a new memory allocator for PostgreSQL

От
Robert Haas
Дата:
On Tue, May 6, 2014 at 9:25 AM, Greg Stark <stark@mit.edu> wrote:
> On Tue, May 6, 2014 at 2:04 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>>  I also didn't find anything that looked like our
>> memory context paradigm, and in particular the ability to cheaply
>> reset a context, in any other allocator.
>
> You probably knew this but just in case the term for this strategy is
> called a "ripcord allocator". I believe GCC had one, not sure if it
> still uses it. I doubt any existing ones are especially helpful though
> compared to looking at regular allocators and seeing how to integrate
> them.

Actually, I hadn't heard that term.

> I assume you're also aware of lock-free data structures and the like.
> A memory allocator seems like something that needs to be super aware
> of causing concurrency bottlenecks since it has no idea where it'll be
> used.

Lock-free data structures are cool, but you tend to pay for contention
avoidance with larger constant overheads.  Earlier versions of this
included more provisions for reducing locking bottlenecks that the
current version; that stuff ended up being surprisingly expensive and
I had to rip it out.  There are a few remaining vestiges that need to
be cleaned up yet.  What I'm thinking is probably cheap enough to live
with - and what the patch roughly contemplates right now - is one
lwlock per size class; so shared allocations can proceed in parallel
if they're for objects of different sizes, but not otherwise.  If we
need an allocator that is better-optimized for concurrency than that,
then I think that's going to need to be a separate effort, and the
result will be something that nobody will want to use for
backend-private allocations, because the overhead in the non-contended
case will be too high.

I don't expect concurrency to be a big problem for the problems I
intend to solve in the medium term.  For parallel internal sort, I
actually only need one backend at a time to be able to allocate from
the shared memory segment; the other backends just move data around,
without actually allocating anything.  So the locking is a non-issue.
Parallel external sort might be different; e.g. you can imagine the
user backend pushing tuples into the DSM and worker backends pulling
them out and writing them to tuplestores, or something like that.
Similarly, you can imagine something like a parallel hash table build
with multiple inserts working at once.  But I don't want to spend too
much time worrying about concurrency for future projects at the
expense of solving problems nearer at hand; one could argue that I've
gone too far in that direction already.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: sb_alloc: a new memory allocator for PostgreSQL

От
Simon Riggs
Дата:
On 6 May 2014 14:49, Robert Haas <robertmhaas@gmail.com> wrote:
> On Tue, May 6, 2014 at 9:31 AM, Heikki Linnakangas
> <hlinnakangas@vmware.com> wrote:
>> As a generic remark, I wish that whatever parallel algorithms we will use
>> won't need a lot of ad hoc memory allocations from shared memory. Even
>> though we have dynamic shared memory now, complex data structures with a lot
>> of pointers and different allocations are more painful to debug, tune, and
>> make concurrency-safe. But I have no idea what exactly you have in mind, so
>> I'll just have to take your word on it that this is sensible.
>
> Yeah, I agree.  Actually, I'm hoping that a lot of what we want to do
> can be done using the shm_mq stuff, which uses a messaging paradigm.
> If the queue is full, you wait for the consumer to read some data
> before writing more.  That is much simpler and avoids a lot of
> problems.
>
> There are several problems with using pointers in dynamic shared
> memory.  The ones I'm most concerned about are:
>
> 1. Segments are relocatable, so you can't actually use absolute
> pointers.  Maybe someday we'll have a facility for dynamic shared
> memory segments that are mapped at the same address in every process,
> or maybe not, but right now we sure don't.

Sounds like a problem for static allocations, not dynamic ones.

It makes a lot of sense to use dynamic shared memory for sorts
especially, since you can just share the base pointer and other info
and a "blind worker" can then do the sort for you without needing
transactions, snapshots etc..

I'd also like to consider putting common reference tables as hash
tables into shmem.

> 2. You've got to decide up-front how much memory to set aside for
> dynamic allocation, and you can't easily change your mind later.  Some
> of the DSM implementations support growing the segment, but you've got
> to somehow get everyone who is using it to remap it, possibly at a
> different address, so it's a long way from being transparent.

Again, depends on the algorithm. If we program the sort to work in
fixed size chunks, we can then use a merge sort at end to link the
chunks together. So we just use an array of fixed size chunks. We
might need to dynamically add more chunks, but nobody needs to remap.

Doing it that way means we do *not* need to change situation if it
becomes an external sort. We just mix shmem and external files, all
merged together at the end.

We need to take account of the amount of memory locally available per
CPU, so there is a maximum size for these things. Not sure what tho'

-- Simon Riggs                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services



Re: sb_alloc: a new memory allocator for PostgreSQL

От
Robert Haas
Дата:
On Tue, May 6, 2014 at 10:40 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
>> 1. Segments are relocatable, so you can't actually use absolute
>> pointers.  Maybe someday we'll have a facility for dynamic shared
>> memory segments that are mapped at the same address in every process,
>> or maybe not, but right now we sure don't.
>
> Sounds like a problem for static allocations, not dynamic ones.

Maybe I didn't say that well: the dynamic shared memory segment isn't
guaranteed to be mapped at the same address in every backend.  So if
you use absolute pointers in your data structures, they might be
incorrect from the point of view of some other backend accessing the
same shared memory segment.  This is true regardless of how the
allocation is done; it's an artifact of the decision not to try to map
dynamic shared memory segments at a constant address in every process
using them.

> It makes a lot of sense to use dynamic shared memory for sorts
> especially, since you can just share the base pointer and other info
> and a "blind worker" can then do the sort for you without needing
> transactions, snapshots etc..

That would be nice, but I think we're actually going to need a lot of
that stuff to make parallel sort work, because the workers need to
have enough an environment set up to run the comparison functions, and
those may try to de-TOAST.

> I'd also like to consider putting common reference tables as hash
> tables into shmem.

I'm not sure exactly what you have in mind here, but some kind of hash
tables facility for dynamic memory is on my bucket list.

>> 2. You've got to decide up-front how much memory to set aside for
>> dynamic allocation, and you can't easily change your mind later.  Some
>> of the DSM implementations support growing the segment, but you've got
>> to somehow get everyone who is using it to remap it, possibly at a
>> different address, so it's a long way from being transparent.
>
> Again, depends on the algorithm. If we program the sort to work in
> fixed size chunks, we can then use a merge sort at end to link the
> chunks together. So we just use an array of fixed size chunks. We
> might need to dynamically add more chunks, but nobody needs to remap.

This is a possible approach, but since each segment might end up in a
different spot in each process that maps it, a pointer now needs to
consist of a segment identifier and an offset within that segment.

> Doing it that way means we do *not* need to change situation if it
> becomes an external sort. We just mix shmem and external files, all
> merged together at the end.

Huh.  Interesting idea.  I haven't researched external sort much yet.

> We need to take account of the amount of memory locally available per
> CPU, so there is a maximum size for these things. Not sure what tho'

I agree that NUMA effects could be significant, but so far I don't
know enough to have any idea what, if anything, makes sense to do in
that area.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company