Обсуждение: rethinking dense_alloc (HashJoin) as a memory context
Hi, In the thread [1] dealing with hashjoin bug introduced in 9.5, Tom voiced his dislike of dense_alloc. I kinda agree with him that introducing "local allocators" may not be the best idea, and as dense_alloc was introduced by me I was playing with the idea to wrap this into a regular memory context, perhaps with some restrictions (e.g. no pfree). But I'm having trouble with that approach ... Let me quickly explain the idea behind dense_alloc. When building the tuple hash table in hash join, we simply allocate large chunk of memory using palloc (~32kB), and then store the tuples into the chunk on our own without calling palloc for each tuple. Each tuple already has length in the header, so we don't need chunk header. Also, we don't do the 2^k chunk sizes and instead store the tuples densely. This means we can't do repalloc or pfree on the tuples, but fine. We never did repalloc in hashjoin anyway, and pfree is only needed when increasing the number of batches. But with dense_alloc we can simply walk through the tuples as stored in the allocated chunks, which has the nice benefit that it's sequential, making memory prefetching more efficient than with the old code (walking through buckets). Also, no freelists and such. So the dense_alloc has several benefits: (a) memory reduction thanks to eliminating StandardChunkHeader (which is 16B, and quite noticeable for narrow tuples) (b) memory reduction thanks to dense packing tuples (not leaving free space in each chunk) (c) improving efficiency by sequential memory accesses (compared to random accesses caused by access through buckets) Per the measurements done in thread [2], (a) and (b) may reduce memory requirements by 50% in some cases. I also vaguely remember doing benchmarks for (c) and seeing measurable improvements, but I don't see the numbers in the thread, so either it was posted somewhere else or not at all :-/ Anyway, I'm explaining this because I think it's important the new reworked code achieves the same benefits. But when trying to implement it as a special memory context, I quickly ran into the requirement that each chunk has a chunk header [3] which would prevent (a). I know it was proposed to only include the chunk header when compiled with asserts, but I don't like the idea of having a reusable code that depends on that (and fails with a segfault without it). If I have to choose between a memory context that is essentially meant to be reused, but likely to fail unexpectedly in non-assert builds, and a special local allocator isolated to a single node, I choose the latter. Perhaps I'd see this differently had there been other places that could use the new memory context, but I can't think of one. Opinions? [1] https://www.postgresql.org/message-id/flat/7034.1454722453%40sss.pgh.pa.us [2] https://www.postgresql.org/message-id/flat/53B4A03F.3070409%40fuzzy.cz [3] https://github.com/postgres/postgres/blob/master/src/include/utils/memutils.h#L49 regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Wed, Jul 13, 2016 at 7:53 AM, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote: > In the thread [1] dealing with hashjoin bug introduced in 9.5, Tom voiced > his dislike of dense_alloc. I kinda agree with him that introducing "local > allocators" may not be the best idea, and as dense_alloc was introduced by > me I was playing with the idea to wrap this into a regular memory context, > perhaps with some restrictions (e.g. no pfree). But I'm having trouble with > that approach ... I think that the "no pfree()" restriction would be necessary to get the same benefit. But, doesn't that undermine the whole idea of making it a memory context? In my view, these "local allocators" are not so bad. They're a bit ugly, but that seems to be worth it so far, and I don't think that there is that much incidental complexity that could be encapsulated. For a few modules, including tuplesort.c, the hash join code, tuplestore.c, and possibly a couple of others, having precise control over memory just seems like a good idea to me (i.e. doling it out from some initial large batch palloc() allocations according to some considerations about the relevant data structures, leaving a cache-friendly layout). I suspect that there are not that many places where it is worth it to even contemplate batch or dense allocators, so I doubt that what we will see all that many more instances of "local allocators". -- Peter Geoghegan
Peter Geoghegan <pg@heroku.com> writes: > On Wed, Jul 13, 2016 at 7:53 AM, Tomas Vondra > <tomas.vondra@2ndquadrant.com> wrote: >> In the thread [1] dealing with hashjoin bug introduced in 9.5, Tom voiced >> his dislike of dense_alloc. I kinda agree with him that introducing "local >> allocators" may not be the best idea, and as dense_alloc was introduced by >> me I was playing with the idea to wrap this into a regular memory context, >> perhaps with some restrictions (e.g. no pfree). But I'm having trouble with >> that approach ... > I think that the "no pfree()" restriction would be necessary to get > the same benefit. But, doesn't that undermine the whole idea of making > it a memory context? The other thing that doesn't seem to square at all with a general-purpose memory context is the desire to walk through the stored tuples directly, knowing that they are adjacent. That means nothing else can be allocated via the same mechanism. So I tend to agree that if we accept Tomas' three requirements as non-negotiable, then trying to make the allocator match the MemoryContext API is probably impractical. My feeling at this point is that we should leave it alone until/unless we see similar requirements elsewhere, and then look to see if we can derive a common abstraction. I always find that it's easier to design APIs based on concrete use-cases than on guesses about what will be needed. I wonder though if we don't already have another similar use-case in the ad-hoc "slab allocators" in reorderbuffer.c. We already know that that code has performance issues, cf bug #14231, so I suspect there's a redesign in its future anyway. regards, tom lane
On 2016-07-13 13:37:55 -0400, Tom Lane wrote: > I wonder though if we don't already have another similar use-case in > the ad-hoc "slab allocators" in reorderbuffer.c. That seems to call more for a general slab allocator design, than for something like here. After all, there's plenty of freeing ther.e > We already know that > that code has performance issues, cf bug #14231, so I suspect there's > a redesign in its future anyway. Note that it's not the slab allocators that is having problems, it's aset.c, iterating through all allocated blocks. Andres
Andres Freund <andres@anarazel.de> writes: > On 2016-07-13 13:37:55 -0400, Tom Lane wrote: >> We already know that >> that code has performance issues, cf bug #14231, so I suspect there's >> a redesign in its future anyway. > Note that it's not the slab allocators that is having problems, it's > aset.c, iterating through all allocated blocks. Well, my point is that that code is using allocation patterns that aset.c isn't very well suited for. The slab allocator logic attempts to paper over one part of that impedance mismatch, but we're seeing there's more. I haven't looked at it closely enough to have a solution proposal. regards, tom lane
On 07/13/2016 07:37 PM, Tom Lane wrote: > Peter Geoghegan <pg@heroku.com> writes: >> On Wed, Jul 13, 2016 at 7:53 AM, Tomas Vondra >> <tomas.vondra@2ndquadrant.com> wrote: >>> In the thread [1] dealing with hashjoin bug introduced in 9.5, Tom voiced >>> his dislike of dense_alloc. I kinda agree with him that introducing "local >>> allocators" may not be the best idea, and as dense_alloc was introduced by >>> me I was playing with the idea to wrap this into a regular memory context, >>> perhaps with some restrictions (e.g. no pfree). But I'm having trouble with >>> that approach ... > >> I think that the "no pfree()" restriction would be necessary to get >> the same benefit. But, doesn't that undermine the whole idea of making >> it a memory context? > > The other thing that doesn't seem to square at all with a general-purpose > memory context is the desire to walk through the stored tuples directly, > knowing that they are adjacent. That means nothing else can be allocated > via the same mechanism. So I tend to agree that if we accept Tomas' three > requirements as non-negotiable, then trying to make the allocator match > the MemoryContext API is probably impractical. > > My feeling at this point is that we should leave it alone until/unless > we see similar requirements elsewhere, and then look to see if we can > derive a common abstraction. I always find that it's easier to design > APIs based on concrete use-cases than on guesses about what will be > needed. I agree with both points. I think the MemoryContext API may not be right abstraction for this. Given a hammer big enough it would probably work in the end, but it'd probably require changes to the public MemoryContext API (e.g. relaxing the StandardChunkHeader requirement). And that seems a bit too risky. So we probably need a new independent abstraction for this, but doing that based on a single use case is a bit silly. > > I wonder though if we don't already have another similar use-case in > the ad-hoc "slab allocators" in reorderbuffer.c. We already know that > that code has performance issues, cf bug #14231, so I suspect there's > a redesign in its future anyway. > I'm not sure - I'm not familiar with reorderbuffer.c, but it seems to do a fair number of pfrees and such. Also, pfrees seem to be the root of the performance issue. I suspect the slab allocator (or rather the allocation strategy in general) may need rethinking, but let's discuss that in that thread. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Wed, Jul 13, 2016 at 1:10 PM, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote: > I think the MemoryContext API may not be right abstraction for this. +1. The MemoryContext API is little short of an absolute bar to implementing an allocator that works significantly differently than aset.c, and that's a shame because aset.c is great for little tiny contexts that don't live very long and sucks for everything else. In particular, it sucks for anything that does a large number of allocations. It's also hostile to any kind of context stored inside a DSM segment, since those don't have a fixed addresses across all backends that share them. The alternative that I played with a while back when I was working on the sb_alloc allocator was to set things up so that we could identify the memory context based on the pointer address rather than the chunk header. That's a whole lot better for memory efficiency since you can squeeze out the chunk headers, but it inevitably slows down pfree. What's not clear to me is to what extent slowing down pfree is an acceptable price for improving the behavior in other ways. I wonder how many of the pfree calls in our current codebase are useless or even counterproductive, or could be made so. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > On Wed, Jul 13, 2016 at 1:10 PM, Tomas Vondra > <tomas.vondra@2ndquadrant.com> wrote: > What's not clear to me is to what extent slowing down pfree is an > acceptable price for improving the behavior in other ways. I wonder > how many of the pfree calls in our current codebase are useless or > even counterproductive, or could be made so. I think there's a lot, but I'm afraid most of them are in contexts (pun intended) where aset.c already works pretty well, ie it's a short-lived context anyway. The areas where we're having pain are where there are fairly long-lived contexts with lots of pfree traffic; certainly that seems to be the case in reorderbuffer.c. Because they're long-lived, you can't just write off the pfrees as ignorable. I wonder whether we could compromise by reducing the minimum "standard chunk header" to be just a pointer to owning context, with the other fields becoming specific to particular mcxt implementations. That would be enough to allow contexts to decide that pfree was a no-op, say, or that they wouldn't support GetMemoryChunkSpace(), without having to decree that misuse can lead to crashes. But that's still more than zero overhead per-chunk. regards, tom lane
On 2016-07-13 16:39:58 -0400, Tom Lane wrote: > Robert Haas <robertmhaas@gmail.com> writes: > > On Wed, Jul 13, 2016 at 1:10 PM, Tomas Vondra > > <tomas.vondra@2ndquadrant.com> wrote: > > What's not clear to me is to what extent slowing down pfree is an > > acceptable price for improving the behavior in other ways. I wonder > > how many of the pfree calls in our current codebase are useless or > > even counterproductive, or could be made so. > > I think there's a lot, but I'm afraid most of them are in contexts > (pun intended) where aset.c already works pretty well, ie it's a > short-lived context anyway. FWIW, hacking up the aset/mcxt.c to use a trivial allocator with less overhead (i.e. just hand out sections out of a continuous block of memory) results in a noticeable speedup in parse heavy workloads. It's a bit ugly though, because of the amount of retail pfrees in random places. > The areas where we're having pain are > where there are fairly long-lived contexts with lots of pfree traffic; > certainly that seems to be the case in reorderbuffer.c. Because they're > long-lived, you can't just write off the pfrees as ignorable. That's a problem too. > I wonder whether we could compromise by reducing the minimum "standard > chunk header" to be just a pointer to owning context, with the other > fields becoming specific to particular mcxt implementations. That would > be enough to allow contexts to decide that pfree was a no-op, say, or that > they wouldn't support GetMemoryChunkSpace(), without having to decree that > misuse can lead to crashes. But that's still more than zero overhead > per-chunk. I think that's a sensible compromise for some use-cases (e.g. parsing, parse analysis, potentially expression contexts). Andres
Andres Freund <andres@anarazel.de> writes: > On 2016-07-13 16:39:58 -0400, Tom Lane wrote: >> I think there's a lot, but I'm afraid most of them are in contexts >> (pun intended) where aset.c already works pretty well, ie it's a >> short-lived context anyway. > FWIW, hacking up the aset/mcxt.c to use a trivial allocator with less > overhead (i.e. just hand out sections out of a continuous block of > memory) results in a noticeable speedup in parse heavy workloads. It's a > bit ugly though, because of the amount of retail pfrees in random > places. Yeah, both the parser and planner are already in the "palloc a lot and don't worry too much about pfrees" camp. I think they could both benefit from an allocator that ignores pfree and doesn't round up request sizes (except for MAXALIGN). I'm not sure how much of the rest of the system would like that behavior, though. The design intention behind mcxt.c/aset.c was always that we'd have multiple allocators (multiple context types). I'm surprised that we've gone this long without following through on that. regards, tom lane
On Wed, Jul 13, 2016 at 4:39 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Robert Haas <robertmhaas@gmail.com> writes: >> On Wed, Jul 13, 2016 at 1:10 PM, Tomas Vondra >> <tomas.vondra@2ndquadrant.com> wrote: >> What's not clear to me is to what extent slowing down pfree is an >> acceptable price for improving the behavior in other ways. I wonder >> how many of the pfree calls in our current codebase are useless or >> even counterproductive, or could be made so. > > I think there's a lot, but I'm afraid most of them are in contexts > (pun intended) where aset.c already works pretty well, ie it's a > short-lived context anyway. The areas where we're having pain are > where there are fairly long-lived contexts with lots of pfree traffic; > certainly that seems to be the case in reorderbuffer.c. Because they're > long-lived, you can't just write off the pfrees as ignorable. > > I wonder whether we could compromise by reducing the minimum "standard > chunk header" to be just a pointer to owning context, with the other > fields becoming specific to particular mcxt implementations. That would > be enough to allow contexts to decide that pfree was a no-op, say, or that > they wouldn't support GetMemoryChunkSpace(), without having to decree that > misuse can lead to crashes. But that's still more than zero overhead > per-chunk. I think that would be worth doing. It's not perfect, and the extra 8 (or 4) bytes per chunk certainly do matter. On the other hand, it's certainly better than what we're doing today, which basically closes off all meaningful innovation in this area. If you are locked into having a 16 byte chunk header that includes the size, the individual memory context implementations don't have much latitude to vary their behavior from what aset.c already does. You can change the policy for allocating chunks from the operating system, and the way you try to recycle chunks that have been freed, but that's about it. Removing the size from the standard header would, I think, make it easier to experiment more widely with alternative memory context implementations and get a better idea what can be saved. We can always think about further changes later, but this seems like a good start. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Sun, Jul 17, 2016 at 1:55 PM, Robert Haas <robertmhaas@gmail.com> wrote: > >On Wed, Jul 13, 2016 at 4:39 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> >> I wonder whether we could compromise by reducing the minimum "standard >> chunk header" to be just a pointer to owning context, with the other >> fields becoming specific to particular mcxt implementations. > > I think that would be worth doing. It's not perfect, and the extra 8 > (or 4) bytes per chunk certainly do matter. I wonder if we could go further. If we don't imagine having a very large number of allocators then we could just ask each one in turn if this block is one of theirs and which context it came from. That would allow an allocator that just allocated everything in a contiguous block to recognize pointers and return the memory context just by the range the pointer lies in. There could be optimizations like if the leading points point to a structure with a decently large magic number then assume it's a valid header to avoid the cost of checking with lots of allocators. But I'm imagining that the list of allocators in use concurrenlty will be fairly small so it might not even be necessary. greg
Greg Stark <stark@mit.edu> writes: > I wonder if we could go further. If we don't imagine having a very > large number of allocators then we could just ask each one in turn if > this block is one of theirs and which context it came from. That would > allow an allocator that just allocated everything in a contiguous > block to recognize pointers and return the memory context just by the > range the pointer lies in. The performance problem is not "large number of allocators", it's "large number of distinct memory ranges". Also, this will fail utterly to recognize duplicate-pfree and bad-pointer cases. Not that the existing method is bulletproof, but this way has zero robustness against caller bugs. regards, tom lane
On Mon, Jul 18, 2016 at 10:19 AM, Greg Stark <stark@mit.edu> wrote: > On Sun, Jul 17, 2016 at 1:55 PM, Robert Haas <robertmhaas@gmail.com> wrote: >>On Wed, Jul 13, 2016 at 4:39 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >>> I wonder whether we could compromise by reducing the minimum "standard >>> chunk header" to be just a pointer to owning context, with the other >>> fields becoming specific to particular mcxt implementations. >> >> I think that would be worth doing. It's not perfect, and the extra 8 >> (or 4) bytes per chunk certainly do matter. > > I wonder if we could go further. If we don't imagine having a very > large number of allocators then we could just ask each one in turn if > this block is one of theirs and which context it came from. That would > allow an allocator that just allocated everything in a contiguous > block to recognize pointers and return the memory context just by the > range the pointer lies in. > > There could be optimizations like if the leading points point to a > structure with a decently large magic number then assume it's a valid > header to avoid the cost of checking with lots of allocators. But I'm > imagining that the list of allocators in use concurrenlty will be > fairly small so it might not even be necessary. Well, if we were going to do this, it would make more sense to have a central registry of blocks that is shared across all memory context types, and when you see a pointer you just look it up in the chunk map and find the containing context that way. I actually did something like this when I was working on sb_alloc (see archives); it looked up whether the chunk was of the new type and, if not, it assumed it came from aset.c. To do this, it used a three-level trie (like Google's tcmalloc based on the pointer address) with, IIRC, some optimizations for the case where we repeatedly free from the same chunk. That slowed down pfree noticeably, though. The problem is that it's pretty well impossible to do something that is as cheap as fetching a memory context pointer from the chunk header, chasing a pointer or two from there, and then jumping. That's just really cheap. The test case I used previously was an external sort, which does lots of retail pfrees. Now that we've mostly abandoned replacement selection, there will be many fewer pfrees while building runs, I think, but still quite a few while merging runs. Now it might be the case that if the allocating is fast enough and we save a bunch of memory, spending a few additional cycles freeing things is no big deal. It might also be the case that this is problematic in a few cases but that we can eliminate those cases. It's likely to take some work, though. Anyway, my point is really that doing what Tom proposes would be a good first step and probably would not involve much risk. And with that change, any new allocator that has some extrinsic way of determing chunk sizes can save 8 bytes per chunk, which is certainly meaningful for any context that does lots of allocations. If somebody writes a new allocator that has good performance characteristics and uses 400MB of memory on some test where aset.c would have used 535MB of memory, and we can compute that with no chunk headers at all would use only 346MB of memory, then we can talk about how to get there. Doing things stepwise is good. One random thought is that we could have a memory allocator altogether separate from palloc/pfree. For example, suppose we introduce qalloc/qfree. A chunk of memory allocated with qalloc can only be freed with qfree, not pfree. In this world, you're completely freed from the requirement to care about chunk headers in any form. The downside, of course, is that you can't let qalloc'd tuples escape into the executor, say, because it only knows how to pfree things, not how to qfree things. But there are plenty of places where you can easily see that allocations won't escape outside the module where they are performed; in fact, there are many cases where it would be easy to pass the memory context *as an argument to the free function* -- which would presumably be quite advantageous for any allocator that doesn't store a pointer to the context in the chunk header. I'm sure there will be some reluctance to go down these kinds of paths because it is undeniably convenient from a programmer's perspective to be able to just say pfree and forget the details, but I believe there will be some cases - especially for contexts that hold lots of allocations - where the performance gains from these kinds of techniques are quite measurable on macrobenchmarks. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Mon, Jul 18, 2016 at 7:56 AM, Robert Haas <robertmhaas@gmail.com> wrote: > The test case I used previously was an external sort, which does lots > of retail pfrees. Now that we've mostly abandoned replacement > selection, there will be many fewer pfrees while building runs, I > think, but still quite a few while merging runs. Surely you mean that in 9.6 there are just as many palloc() + pfree() calls as before when building runs, but many fewer when merging (provided you limit your consideration to a final on-the-fly merge, which are the large majority of merges in practice)? Nothing changed about how tuplesort caller tuples are originally formed in 9.6, so work remains even there. I think we should be using batch memory for caller tuples (e.g., MinimalTuples) past the first run as an optimization, but that isn't something that I plan to do soon. Separately, I've already written a patch to make final merges that are not on-the-fly (i.e. the final merge of a randomAccess caller, where a single materialize output to one tape is formed) use batch memory, mostly to suit parallel sort workers. Parallel sort could increase the prevalence of non-on-the-fly merges by quite a bit, so that is on my agenda for the next release. > Now it might be the > case that if the allocating is fast enough and we save a bunch of > memory, spending a few additional cycles freeing things is no big > deal. It might also be the case that this is problematic in a few > cases but that we can eliminate those cases. It's likely to take some > work, though. Perhaps I simply lack imagination here, but I still suspect that ad-hoc approaches will tend to work best, because most of the benefit can be derived from specialized, precise memory management (what I've called batch memory) for just a few modules, and what remains isn't several broad swathes that can be delineated easily. I can see a "palloc a lot and don't worry too much about pfrees" allocator having some value, but I suspect that that isn't going to move the needle in the same way. -- Peter Geoghegan