Обсуждение: sb_alloc: a new memory allocator for PostgreSQL
Over the last several months, I've been working on a new memory allocator for PostgreSQL. While it's not done, and there are problems, I think I've reached the point where it makes sense to get this out in front of a wider audience and get some feedback, preferably of the sort that doesn't do any permanent damage. I started out down this path because parallel sort will really be much happier if it can allocate from either dynamic shared memory or backend-local memory using the same interface. Arguably, we could implement parallel internal sort using some kind of bespoke memory management strategy, like packing all of the tuples into the memory segment tightly starting from the end, and growing the tuple array from the beginning, but then what happens if we decide to give up on parallelism and do an external sort after all? Even if we could engineer our way around that problem, I think there are bound to be other things that people want to do with dynamic shared memory segments where being able to easily allocate and free memory is desirable. As I got into the problem a little bit further, I realized that AllocSetAlloc is actually pretty poor allocator for sorting. As far as I can see, the basic goal of aset.c was to make context resets cheap - which makes sense, because we have a lot of very short-lived memory contexts. However, when you're sorting a lot of data, being able to reset the context quickly is not as important as using memory efficiently, and AllocSetAlloc is pretty terrible at that, especially for small allocations. Each allocation is rounded up to a power of two, and then we add 16 bytes of overhead on top of that (on 64-bit systems), which means that palloc has 100% overhead for a large number of 16-byte allocations, and 200% overhead for a large number of 8-byte allocations. The 16-byte overhead doesn't hurt quite as much on big allocations, but the rounding to a power of two can sometimes be very bad. For example, for repeated 96-byte allocations, palloc has 50% overhead. I decided I wanted to come up with something better. I read through the literature and found that most modern allocators seemed to use superblocks. A superblock is a chunk of memory, perhaps 64kB, which is carved up into a bunch of chunks of equal size. Those chunks don't have individual headers; instead, the pointer address is used to locate the metadata for the chunk. Requests are binned into size classes, which are generally much more fine-grained than the powers-of-two strategy used by AllocSetAlloc. Of course, too many size classes can waste memory if you end up with many partially-full superblocks, each for a different size class, but the consensus seems to be that you make it up and more by wasting less memory within each allocation (e.g. a 96 byte allocation can be exactly 96 bytes, rather than 128 bytes). I investigated whether there might be an existing allocator we could use; I did not find anything that looked suitable. No allocator that I ran across could handle allocation of shared memory; in fact, pretty much all of them assumed the existence of an underlying "system allocator" that they could use to allocate memory for their own metadata. Those that included a system allocator (like tcmalloc) still didn't support anything that looked like dynamic shared memory. Most of them were also unsuitable for other reasons (e.g. tcmalloc is written in C++). I also didn't find anything that looked like our memory context paradigm, and in particular the ability to cheaply reset a context, in any other allocator. So I decided to write my own; a first cut is attached. It's not integrated with the MemoryContext stuff yet, and very likely needs to be. There are some functions for which I've added prototypes but not implementations; the support for allocation from dynamic shared memory segments is just a stub right now. And there are various other warts as well which will need to be cleaned up in due time if this is to go anywhere. Performance-wise, the raw allocation performance (sb_alloc) is reasonably good. On PPC64, it lost out to palloc on 8-byte allocations but was comparable for larger allocations; on my MacBook, it was faster across the board. The raw deallocation performance (sb_free) is appreciably slower than pfree; that can probably be optimized somewhat. It's not really a fair comparison at this point, though, among other reasons because palloc is jumping through an extra layer of indirection. Leaving that problem to one side, I hacked up index builds to use the new allocator and found that on REINDEX INDEX pgbench_accounts_pkey at scale factor 100, this uses 22% less memory than palloc, but runs about 4-7% slower on PPC64. It may be possible to eliminate some of that with a tighter integration of the new allocator and some more fine-tuning. As you can probably tell, this is all somewhat preliminary and experimental at this point, but thoughts welcome. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Вложения
On Tue, May 6, 2014 at 2:04 PM, Robert Haas <robertmhaas@gmail.com> wrote: > I also didn't find anything that looked like our > memory context paradigm, and in particular the ability to cheaply > reset a context, in any other allocator. You probably knew this but just in case the term for this strategy is called a "ripcord allocator". I believe GCC had one, not sure if it still uses it. I doubt any existing ones are especially helpful though compared to looking at regular allocators and seeing how to integrate them. I assume you're also aware of lock-free data structures and the like. A memory allocator seems like something that needs to be super aware of causing concurrency bottlenecks since it has no idea where it'll be used. -- greg
On 05/06/2014 04:04 PM, Robert Haas wrote: > Over the last several months, I've been working on a new memory > allocator for PostgreSQL. While it's not done, and there are > problems, I think I've reached the point where it makes sense to get > this out in front of a wider audience and get some feedback, > preferably of the sort that doesn't do any permanent damage. I > started out down this path because parallel sort will really be much > happier if it can allocate from either dynamic shared memory or > backend-local memory using the same interface. Arguably, we could > implement parallel internal sort using some kind of bespoke memory > management strategy, like packing all of the tuples into the memory > segment tightly starting from the end, and growing the tuple array > from the beginning, but then what happens if we decide to give up on > parallelism and do an external sort after all? Even if we could > engineer our way around that problem, I think there are bound to be > other things that people want to do with dynamic shared memory > segments where being able to easily allocate and free memory is > desirable. As a generic remark, I wish that whatever parallel algorithms we will use won't need a lot of ad hoc memory allocations from shared memory. Even though we have dynamic shared memory now, complex data structures with a lot of pointers and different allocations are more painful to debug, tune, and make concurrency-safe. But I have no idea what exactly you have in mind, so I'll just have to take your word on it that this is sensible. > As I got into the problem a little bit further, I realized that > AllocSetAlloc is actually pretty poor allocator for sorting. As far > as I can see, the basic goal of aset.c was to make context resets > cheap - which makes sense, because we have a lot of very short-lived > memory contexts. However, when you're sorting a lot of data, being > able to reset the context quickly is not as important as using memory > efficiently, and AllocSetAlloc is pretty terrible at that, especially > for small allocations. Each allocation is rounded up to a power of > two, and then we add 16 bytes of overhead on top of that (on 64-bit > systems), which means that palloc has 100% overhead for a large number > of 16-byte allocations, and 200% overhead for a large number of 8-byte > allocations. The 16-byte overhead doesn't hurt quite as much on big > allocations, but the rounding to a power of two can sometimes be very > bad. For example, for repeated 96-byte allocations, palloc has 50% > overhead. I decided I wanted to come up with something better. Yeah, I saw in some tests that about 50% of the memory used for catalog caches was waste caused by rounding up all the allocations to power-of-two. > I read through the literature and found that most modern allocators > seemed to use superblocks. A superblock is a chunk of memory, perhaps > 64kB, which is carved up into a bunch of chunks of equal size. Those > chunks don't have individual headers; instead, the pointer address is > used to locate the metadata for the chunk. Requests are binned into > size classes, which are generally much more fine-grained than the > powers-of-two strategy used by AllocSetAlloc. Of course, too many > size classes can waste memory if you end up with many partially-full > superblocks, each for a different size class, but the consensus seems > to be that you make it up and more by wasting less memory within each > allocation (e.g. a 96 byte allocation can be exactly 96 bytes, rather > than 128 bytes). Interesting work. Another kind of memory allocator that I've played with in the past is a simple stack allocator that only supports wholesale release of the whole context and pfree() is a no-op. That's dead simple and very efficient when you don't need retail pfree(), but obviously cannot completely replace the current allocator. I wouldn't conflate shared memory with this. If a piece of code needs to work with either one, I think the way to go is to have some sort of wrapper functions that route the calls to either the shared or private memory allocator, similar to how the same interface is used to deal with local, temporary buffers and shared buffers. - Heikki
On Tue, May 6, 2014 at 9:31 AM, Heikki Linnakangas <hlinnakangas@vmware.com> wrote: > As a generic remark, I wish that whatever parallel algorithms we will use > won't need a lot of ad hoc memory allocations from shared memory. Even > though we have dynamic shared memory now, complex data structures with a lot > of pointers and different allocations are more painful to debug, tune, and > make concurrency-safe. But I have no idea what exactly you have in mind, so > I'll just have to take your word on it that this is sensible. Yeah, I agree. Actually, I'm hoping that a lot of what we want to do can be done using the shm_mq stuff, which uses a messaging paradigm. If the queue is full, you wait for the consumer to read some data before writing more. That is much simpler and avoids a lot of problems. There are several problems with using pointers in dynamic shared memory. The ones I'm most concerned about are: 1. Segments are relocatable, so you can't actually use absolute pointers. Maybe someday we'll have a facility for dynamic shared memory segments that are mapped at the same address in every process, or maybe not, but right now we sure don't. 2. You've got to decide up-front how much memory to set aside for dynamic allocation, and you can't easily change your mind later. Some of the DSM implementations support growing the segment, but you've got to somehow get everyone who is using it to remap it, possibly at a different address, so it's a long way from being transparent. But that having been said, pointers are a pretty fundamental data structure, and I think trying to get rid of them completely isn't likely to be feasible. For sorting, you need a big SortTuple array and then that needs to point to the individual tuples. I think that's simple enough to be reasonable, and at any rate I don't see a simpler approach. > Yeah, I saw in some tests that about 50% of the memory used for catalog > caches was waste caused by rounding up all the allocations to power-of-two. Sadly, I can't see using this allocator for the catalog caches as-is. The problem is that AllocSetAlloc can start those caches off with a tiny 1kB allocation. This allocator is intended to be efficient for large contexts, so you start off with 4kB of superblock descriptors and a 64kB chunk for each size class that is in use. Very reasonable for multi-megabyte allocations; not so hot for tiny ones. There may be a way to serve both needs, but I haven't found it yet. > I wouldn't conflate shared memory with this. If a piece of code needs to > work with either one, I think the way to go is to have some sort of wrapper > functions that route the calls to either the shared or private memory > allocator, similar to how the same interface is used to deal with local, > temporary buffers and shared buffers. Well, that has several disadvantages. One of them is code duplication. This allocator could certainly be a lot simpler if it only handed shared memory, or for that matter if it only handled backend-private memory. But if the right way to do allocation for sorting is to carve chunks out of superblocks, then it's the right way regardless of whether you're allocating from shared memory or backend-private memory. And if that's the wrong way, then it's wrong for both. Using completely different allocators for parallel sort and non-parallel sort doesn't seem like a great idea to me. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Tue, May 6, 2014 at 9:25 AM, Greg Stark <stark@mit.edu> wrote: > On Tue, May 6, 2014 at 2:04 PM, Robert Haas <robertmhaas@gmail.com> wrote: >> I also didn't find anything that looked like our >> memory context paradigm, and in particular the ability to cheaply >> reset a context, in any other allocator. > > You probably knew this but just in case the term for this strategy is > called a "ripcord allocator". I believe GCC had one, not sure if it > still uses it. I doubt any existing ones are especially helpful though > compared to looking at regular allocators and seeing how to integrate > them. Actually, I hadn't heard that term. > I assume you're also aware of lock-free data structures and the like. > A memory allocator seems like something that needs to be super aware > of causing concurrency bottlenecks since it has no idea where it'll be > used. Lock-free data structures are cool, but you tend to pay for contention avoidance with larger constant overheads. Earlier versions of this included more provisions for reducing locking bottlenecks that the current version; that stuff ended up being surprisingly expensive and I had to rip it out. There are a few remaining vestiges that need to be cleaned up yet. What I'm thinking is probably cheap enough to live with - and what the patch roughly contemplates right now - is one lwlock per size class; so shared allocations can proceed in parallel if they're for objects of different sizes, but not otherwise. If we need an allocator that is better-optimized for concurrency than that, then I think that's going to need to be a separate effort, and the result will be something that nobody will want to use for backend-private allocations, because the overhead in the non-contended case will be too high. I don't expect concurrency to be a big problem for the problems I intend to solve in the medium term. For parallel internal sort, I actually only need one backend at a time to be able to allocate from the shared memory segment; the other backends just move data around, without actually allocating anything. So the locking is a non-issue. Parallel external sort might be different; e.g. you can imagine the user backend pushing tuples into the DSM and worker backends pulling them out and writing them to tuplestores, or something like that. Similarly, you can imagine something like a parallel hash table build with multiple inserts working at once. But I don't want to spend too much time worrying about concurrency for future projects at the expense of solving problems nearer at hand; one could argue that I've gone too far in that direction already. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 6 May 2014 14:49, Robert Haas <robertmhaas@gmail.com> wrote: > On Tue, May 6, 2014 at 9:31 AM, Heikki Linnakangas > <hlinnakangas@vmware.com> wrote: >> As a generic remark, I wish that whatever parallel algorithms we will use >> won't need a lot of ad hoc memory allocations from shared memory. Even >> though we have dynamic shared memory now, complex data structures with a lot >> of pointers and different allocations are more painful to debug, tune, and >> make concurrency-safe. But I have no idea what exactly you have in mind, so >> I'll just have to take your word on it that this is sensible. > > Yeah, I agree. Actually, I'm hoping that a lot of what we want to do > can be done using the shm_mq stuff, which uses a messaging paradigm. > If the queue is full, you wait for the consumer to read some data > before writing more. That is much simpler and avoids a lot of > problems. > > There are several problems with using pointers in dynamic shared > memory. The ones I'm most concerned about are: > > 1. Segments are relocatable, so you can't actually use absolute > pointers. Maybe someday we'll have a facility for dynamic shared > memory segments that are mapped at the same address in every process, > or maybe not, but right now we sure don't. Sounds like a problem for static allocations, not dynamic ones. It makes a lot of sense to use dynamic shared memory for sorts especially, since you can just share the base pointer and other info and a "blind worker" can then do the sort for you without needing transactions, snapshots etc.. I'd also like to consider putting common reference tables as hash tables into shmem. > 2. You've got to decide up-front how much memory to set aside for > dynamic allocation, and you can't easily change your mind later. Some > of the DSM implementations support growing the segment, but you've got > to somehow get everyone who is using it to remap it, possibly at a > different address, so it's a long way from being transparent. Again, depends on the algorithm. If we program the sort to work in fixed size chunks, we can then use a merge sort at end to link the chunks together. So we just use an array of fixed size chunks. We might need to dynamically add more chunks, but nobody needs to remap. Doing it that way means we do *not* need to change situation if it becomes an external sort. We just mix shmem and external files, all merged together at the end. We need to take account of the amount of memory locally available per CPU, so there is a maximum size for these things. Not sure what tho' -- Simon Riggs http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Tue, May 6, 2014 at 10:40 AM, Simon Riggs <simon@2ndquadrant.com> wrote: >> 1. Segments are relocatable, so you can't actually use absolute >> pointers. Maybe someday we'll have a facility for dynamic shared >> memory segments that are mapped at the same address in every process, >> or maybe not, but right now we sure don't. > > Sounds like a problem for static allocations, not dynamic ones. Maybe I didn't say that well: the dynamic shared memory segment isn't guaranteed to be mapped at the same address in every backend. So if you use absolute pointers in your data structures, they might be incorrect from the point of view of some other backend accessing the same shared memory segment. This is true regardless of how the allocation is done; it's an artifact of the decision not to try to map dynamic shared memory segments at a constant address in every process using them. > It makes a lot of sense to use dynamic shared memory for sorts > especially, since you can just share the base pointer and other info > and a "blind worker" can then do the sort for you without needing > transactions, snapshots etc.. That would be nice, but I think we're actually going to need a lot of that stuff to make parallel sort work, because the workers need to have enough an environment set up to run the comparison functions, and those may try to de-TOAST. > I'd also like to consider putting common reference tables as hash > tables into shmem. I'm not sure exactly what you have in mind here, but some kind of hash tables facility for dynamic memory is on my bucket list. >> 2. You've got to decide up-front how much memory to set aside for >> dynamic allocation, and you can't easily change your mind later. Some >> of the DSM implementations support growing the segment, but you've got >> to somehow get everyone who is using it to remap it, possibly at a >> different address, so it's a long way from being transparent. > > Again, depends on the algorithm. If we program the sort to work in > fixed size chunks, we can then use a merge sort at end to link the > chunks together. So we just use an array of fixed size chunks. We > might need to dynamically add more chunks, but nobody needs to remap. This is a possible approach, but since each segment might end up in a different spot in each process that maps it, a pointer now needs to consist of a segment identifier and an offset within that segment. > Doing it that way means we do *not* need to change situation if it > becomes an external sort. We just mix shmem and external files, all > merged together at the end. Huh. Interesting idea. I haven't researched external sort much yet. > We need to take account of the amount of memory locally available per > CPU, so there is a maximum size for these things. Not sure what tho' I agree that NUMA effects could be significant, but so far I don't know enough to have any idea what, if anything, makes sense to do in that area. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company