Обсуждение: rethinking dense_alloc (HashJoin) as a memory context

Поиск
Список
Период
Сортировка

rethinking dense_alloc (HashJoin) as a memory context

От
Tomas Vondra
Дата:
Hi,

In the thread [1] dealing with hashjoin bug introduced in 9.5, Tom 
voiced his dislike of dense_alloc. I kinda agree with him that 
introducing "local allocators" may not be the best idea, and as 
dense_alloc was introduced by me I was playing with the idea to wrap 
this into a regular memory context, perhaps with some restrictions (e.g. 
no pfree). But I'm having trouble with that approach ...

Let me quickly explain the idea behind dense_alloc. When building the 
tuple hash table in hash join, we simply allocate large chunk of memory 
using palloc (~32kB), and then store the tuples into the chunk on our 
own without calling palloc for each tuple. Each tuple already has length 
in the header, so we don't need chunk header. Also, we don't do the 2^k 
chunk sizes and instead store the tuples densely.

This means we can't do repalloc or pfree on the tuples, but fine. We 
never did repalloc in hashjoin anyway, and pfree is only needed when 
increasing the number of batches. But with dense_alloc we can simply 
walk through the tuples as stored in the allocated chunks, which has the 
nice benefit that it's sequential, making memory prefetching more 
efficient than with the old code (walking through buckets). Also, no 
freelists and such.

So the dense_alloc has several benefits:

(a) memory reduction thanks to eliminating StandardChunkHeader (which is 
16B, and quite noticeable for narrow tuples)

(b) memory reduction thanks to dense packing tuples (not leaving free 
space in each chunk)

(c) improving efficiency by sequential memory accesses (compared to 
random accesses caused by access through buckets)

Per the measurements done in thread [2], (a) and (b) may reduce memory 
requirements by 50% in some cases. I also vaguely remember doing 
benchmarks for (c) and seeing measurable improvements, but I don't see 
the numbers in the thread, so either it was posted somewhere else or not 
at all :-/

Anyway, I'm explaining this because I think it's important the new 
reworked code achieves the same benefits. But when trying to implement 
it as a special memory context, I quickly ran into the requirement that 
each chunk has a chunk header [3] which would prevent (a).

I know it was proposed to only include the chunk header when compiled 
with asserts, but I don't like the idea of having a reusable code that 
depends on that (and fails with a segfault without it).

If I have to choose between a memory context that is essentially meant 
to be reused, but likely to fail unexpectedly in non-assert builds, and 
a special local allocator isolated to a single node, I choose the 
latter. Perhaps I'd see this differently had there been other places 
that could use the new memory context, but I can't think of one.

Opinions?


[1] 
https://www.postgresql.org/message-id/flat/7034.1454722453%40sss.pgh.pa.us

[2] https://www.postgresql.org/message-id/flat/53B4A03F.3070409%40fuzzy.cz

[3] 
https://github.com/postgres/postgres/blob/master/src/include/utils/memutils.h#L49

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: rethinking dense_alloc (HashJoin) as a memory context

От
Peter Geoghegan
Дата:
On Wed, Jul 13, 2016 at 7:53 AM, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
> In the thread [1] dealing with hashjoin bug introduced in 9.5, Tom voiced
> his dislike of dense_alloc. I kinda agree with him that introducing "local
> allocators" may not be the best idea, and as dense_alloc was introduced by
> me I was playing with the idea to wrap this into a regular memory context,
> perhaps with some restrictions (e.g. no pfree). But I'm having trouble with
> that approach ...

I think that the "no pfree()" restriction would be necessary to get
the same benefit. But, doesn't that undermine the whole idea of making
it a memory context?

In my view, these "local allocators" are not so bad. They're a bit
ugly, but that seems to be worth it so far, and I don't think that
there is that much incidental complexity that could be encapsulated.
For a few modules, including tuplesort.c, the hash join code,
tuplestore.c, and possibly a couple of others, having precise control
over memory just seems like a good idea to me (i.e. doling it out from
some initial large batch palloc() allocations according to some
considerations about the relevant data structures, leaving a
cache-friendly layout).

I suspect that there are not that many places where it is worth it to
even contemplate batch or dense allocators, so I doubt that what we
will see all that many more instances of "local allocators".

-- 
Peter Geoghegan



Re: rethinking dense_alloc (HashJoin) as a memory context

От
Tom Lane
Дата:
Peter Geoghegan <pg@heroku.com> writes:
> On Wed, Jul 13, 2016 at 7:53 AM, Tomas Vondra
> <tomas.vondra@2ndquadrant.com> wrote:
>> In the thread [1] dealing with hashjoin bug introduced in 9.5, Tom voiced
>> his dislike of dense_alloc. I kinda agree with him that introducing "local
>> allocators" may not be the best idea, and as dense_alloc was introduced by
>> me I was playing with the idea to wrap this into a regular memory context,
>> perhaps with some restrictions (e.g. no pfree). But I'm having trouble with
>> that approach ...

> I think that the "no pfree()" restriction would be necessary to get
> the same benefit. But, doesn't that undermine the whole idea of making
> it a memory context?

The other thing that doesn't seem to square at all with a general-purpose
memory context is the desire to walk through the stored tuples directly,
knowing that they are adjacent.  That means nothing else can be allocated
via the same mechanism.  So I tend to agree that if we accept Tomas' three
requirements as non-negotiable, then trying to make the allocator match
the MemoryContext API is probably impractical.

My feeling at this point is that we should leave it alone until/unless
we see similar requirements elsewhere, and then look to see if we can
derive a common abstraction.  I always find that it's easier to design
APIs based on concrete use-cases than on guesses about what will be
needed.

I wonder though if we don't already have another similar use-case in
the ad-hoc "slab allocators" in reorderbuffer.c.  We already know that
that code has performance issues, cf bug #14231, so I suspect there's
a redesign in its future anyway.
        regards, tom lane



Re: rethinking dense_alloc (HashJoin) as a memory context

От
Andres Freund
Дата:
On 2016-07-13 13:37:55 -0400, Tom Lane wrote:
> I wonder though if we don't already have another similar use-case in
> the ad-hoc "slab allocators" in reorderbuffer.c.

That seems to call more for a general slab allocator design, than for
something like here. After all, there's plenty of freeing ther.e

> We already know that
> that code has performance issues, cf bug #14231, so I suspect there's
> a redesign in its future anyway.

Note that it's not the slab allocators that is having problems, it's
aset.c, iterating through all allocated blocks.

Andres



Re: rethinking dense_alloc (HashJoin) as a memory context

От
Tom Lane
Дата:
Andres Freund <andres@anarazel.de> writes:
> On 2016-07-13 13:37:55 -0400, Tom Lane wrote:
>> We already know that
>> that code has performance issues, cf bug #14231, so I suspect there's
>> a redesign in its future anyway.

> Note that it's not the slab allocators that is having problems, it's
> aset.c, iterating through all allocated blocks.

Well, my point is that that code is using allocation patterns that aset.c
isn't very well suited for.  The slab allocator logic attempts to paper
over one part of that impedance mismatch, but we're seeing there's more.
I haven't looked at it closely enough to have a solution proposal.
        regards, tom lane



Re: rethinking dense_alloc (HashJoin) as a memory context

От
Tomas Vondra
Дата:
On 07/13/2016 07:37 PM, Tom Lane wrote:
> Peter Geoghegan <pg@heroku.com> writes:
>> On Wed, Jul 13, 2016 at 7:53 AM, Tomas Vondra
>> <tomas.vondra@2ndquadrant.com> wrote:
>>> In the thread [1] dealing with hashjoin bug introduced in 9.5, Tom voiced
>>> his dislike of dense_alloc. I kinda agree with him that introducing "local
>>> allocators" may not be the best idea, and as dense_alloc was introduced by
>>> me I was playing with the idea to wrap this into a regular memory context,
>>> perhaps with some restrictions (e.g. no pfree). But I'm having trouble with
>>> that approach ...
>
>> I think that the "no pfree()" restriction would be necessary to get
>> the same benefit. But, doesn't that undermine the whole idea of making
>> it a memory context?
>
> The other thing that doesn't seem to square at all with a general-purpose
> memory context is the desire to walk through the stored tuples directly,
> knowing that they are adjacent.  That means nothing else can be allocated
> via the same mechanism.  So I tend to agree that if we accept Tomas' three
> requirements as non-negotiable, then trying to make the allocator match
> the MemoryContext API is probably impractical.
>
> My feeling at this point is that we should leave it alone until/unless
> we see similar requirements elsewhere, and then look to see if we can
> derive a common abstraction.  I always find that it's easier to design
> APIs based on concrete use-cases than on guesses about what will be
> needed.

I agree with both points.

I think the MemoryContext API may not be right abstraction for this. 
Given a hammer big enough it would probably work in the end, but it'd 
probably require changes to the public MemoryContext API (e.g. relaxing 
the StandardChunkHeader requirement). And that seems a bit too risky.

So we probably need a new independent abstraction for this, but doing 
that based on a single use case is a bit silly.

>
> I wonder though if we don't already have another similar use-case in
> the ad-hoc "slab allocators" in reorderbuffer.c.  We already know that
> that code has performance issues, cf bug #14231, so I suspect there's
> a redesign in its future anyway.
>

I'm not sure - I'm not familiar with reorderbuffer.c, but it seems to do 
a fair number of pfrees and such. Also, pfrees seem to be the root of 
the performance issue. I suspect the slab allocator (or rather the 
allocation strategy in general) may need rethinking, but let's discuss 
that in that thread.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: rethinking dense_alloc (HashJoin) as a memory context

От
Robert Haas
Дата:
On Wed, Jul 13, 2016 at 1:10 PM, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
> I think the MemoryContext API may not be right abstraction for this.

+1. The MemoryContext API is little short of an absolute bar to
implementing an allocator that works significantly differently than
aset.c, and that's a shame because aset.c is great for little tiny
contexts that don't live very long and sucks for everything else.  In
particular, it sucks for anything that does a large number of
allocations.  It's also hostile to any kind of context stored inside a
DSM segment, since those don't have a fixed addresses across all
backends that share them.

The alternative that I played with a while back when I was working on
the sb_alloc allocator was to set things up so that we could identify
the memory context based on the pointer address rather than the chunk
header.  That's a whole lot better for memory efficiency since you can
squeeze out the chunk headers, but it inevitably slows down pfree.
What's not clear to me is to what extent slowing down pfree is an
acceptable price for improving the behavior in other ways.  I wonder
how many of the pfree calls in our current codebase are useless or
even counterproductive, or could be made so.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: rethinking dense_alloc (HashJoin) as a memory context

От
Tom Lane
Дата:
Robert Haas <robertmhaas@gmail.com> writes:
> On Wed, Jul 13, 2016 at 1:10 PM, Tomas Vondra
> <tomas.vondra@2ndquadrant.com> wrote:
> What's not clear to me is to what extent slowing down pfree is an
> acceptable price for improving the behavior in other ways.  I wonder
> how many of the pfree calls in our current codebase are useless or
> even counterproductive, or could be made so.

I think there's a lot, but I'm afraid most of them are in contexts
(pun intended) where aset.c already works pretty well, ie it's a
short-lived context anyway.  The areas where we're having pain are
where there are fairly long-lived contexts with lots of pfree traffic;
certainly that seems to be the case in reorderbuffer.c.  Because they're
long-lived, you can't just write off the pfrees as ignorable.

I wonder whether we could compromise by reducing the minimum "standard
chunk header" to be just a pointer to owning context, with the other
fields becoming specific to particular mcxt implementations.  That would
be enough to allow contexts to decide that pfree was a no-op, say, or that
they wouldn't support GetMemoryChunkSpace(), without having to decree that
misuse can lead to crashes.  But that's still more than zero overhead
per-chunk.
        regards, tom lane



Re: rethinking dense_alloc (HashJoin) as a memory context

От
Andres Freund
Дата:
On 2016-07-13 16:39:58 -0400, Tom Lane wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
> > On Wed, Jul 13, 2016 at 1:10 PM, Tomas Vondra
> > <tomas.vondra@2ndquadrant.com> wrote:
> > What's not clear to me is to what extent slowing down pfree is an
> > acceptable price for improving the behavior in other ways.  I wonder
> > how many of the pfree calls in our current codebase are useless or
> > even counterproductive, or could be made so.
> 
> I think there's a lot, but I'm afraid most of them are in contexts
> (pun intended) where aset.c already works pretty well, ie it's a
> short-lived context anyway.

FWIW, hacking up the aset/mcxt.c to use a trivial allocator with less
overhead (i.e. just hand out sections out of a continuous block of
memory) results in a noticeable speedup in parse heavy workloads. It's a
bit ugly though, because of the amount of retail pfrees in random
places.


> The areas where we're having pain are
> where there are fairly long-lived contexts with lots of pfree traffic;
> certainly that seems to be the case in reorderbuffer.c.  Because they're
> long-lived, you can't just write off the pfrees as ignorable.

That's a problem too.


> I wonder whether we could compromise by reducing the minimum "standard
> chunk header" to be just a pointer to owning context, with the other
> fields becoming specific to particular mcxt implementations.  That would
> be enough to allow contexts to decide that pfree was a no-op, say, or that
> they wouldn't support GetMemoryChunkSpace(), without having to decree that
> misuse can lead to crashes.  But that's still more than zero overhead
> per-chunk.

I think that's a sensible compromise for some use-cases (e.g. parsing,
parse analysis, potentially expression contexts).

Andres



Re: rethinking dense_alloc (HashJoin) as a memory context

От
Tom Lane
Дата:
Andres Freund <andres@anarazel.de> writes:
> On 2016-07-13 16:39:58 -0400, Tom Lane wrote:
>> I think there's a lot, but I'm afraid most of them are in contexts
>> (pun intended) where aset.c already works pretty well, ie it's a
>> short-lived context anyway.

> FWIW, hacking up the aset/mcxt.c to use a trivial allocator with less
> overhead (i.e. just hand out sections out of a continuous block of
> memory) results in a noticeable speedup in parse heavy workloads. It's a
> bit ugly though, because of the amount of retail pfrees in random
> places.

Yeah, both the parser and planner are already in the "palloc a lot and
don't worry too much about pfrees" camp.  I think they could both benefit
from an allocator that ignores pfree and doesn't round up request sizes
(except for MAXALIGN).  I'm not sure how much of the rest of the system
would like that behavior, though.

The design intention behind mcxt.c/aset.c was always that we'd have
multiple allocators (multiple context types).  I'm surprised that we've
gone this long without following through on that.
        regards, tom lane



Re: rethinking dense_alloc (HashJoin) as a memory context

От
Robert Haas
Дата:
On Wed, Jul 13, 2016 at 4:39 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> On Wed, Jul 13, 2016 at 1:10 PM, Tomas Vondra
>> <tomas.vondra@2ndquadrant.com> wrote:
>> What's not clear to me is to what extent slowing down pfree is an
>> acceptable price for improving the behavior in other ways.  I wonder
>> how many of the pfree calls in our current codebase are useless or
>> even counterproductive, or could be made so.
>
> I think there's a lot, but I'm afraid most of them are in contexts
> (pun intended) where aset.c already works pretty well, ie it's a
> short-lived context anyway.  The areas where we're having pain are
> where there are fairly long-lived contexts with lots of pfree traffic;
> certainly that seems to be the case in reorderbuffer.c.  Because they're
> long-lived, you can't just write off the pfrees as ignorable.
>
> I wonder whether we could compromise by reducing the minimum "standard
> chunk header" to be just a pointer to owning context, with the other
> fields becoming specific to particular mcxt implementations.  That would
> be enough to allow contexts to decide that pfree was a no-op, say, or that
> they wouldn't support GetMemoryChunkSpace(), without having to decree that
> misuse can lead to crashes.  But that's still more than zero overhead
> per-chunk.

I think that would be worth doing.  It's not perfect, and the extra 8
(or 4) bytes per chunk certainly do matter.  On the other hand, it's
certainly better than what we're doing today, which basically closes
off all meaningful innovation in this area.  If you are locked into
having a 16 byte chunk header that includes the size, the individual
memory context implementations don't have much latitude to vary their
behavior from what aset.c already does.  You can change the policy for
allocating chunks from the operating system, and the way you try to
recycle chunks that have been freed, but that's about it.  Removing
the size from the standard header would, I think, make it easier to
experiment more widely with alternative memory context implementations
and get a better idea what can be saved.

We can always think about further changes later, but this seems like a
good start.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: rethinking dense_alloc (HashJoin) as a memory context

От
Greg Stark
Дата:
On Sun, Jul 17, 2016 at 1:55 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>
>On Wed, Jul 13, 2016 at 4:39 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>
>> I wonder whether we could compromise by reducing the minimum "standard
>> chunk header" to be just a pointer to owning context, with the other
>> fields becoming specific to particular mcxt implementations.
>
> I think that would be worth doing.  It's not perfect, and the extra 8
> (or 4) bytes per chunk certainly do matter.

I wonder if we could go further. If we don't imagine having a very
large number of allocators then we could just ask each one in turn if
this block is one of theirs and which context it came from. That would
allow an allocator that just allocated everything in a contiguous
block to recognize pointers and return the memory context just by the
range the pointer lies in.

There could be optimizations like if the leading points point to a
structure with a decently large magic number then assume it's a valid
header to avoid the cost of checking with lots of allocators. But I'm
imagining that the list of allocators in use concurrenlty will be
fairly small so it might not even be necessary.
greg



Re: rethinking dense_alloc (HashJoin) as a memory context

От
Tom Lane
Дата:
Greg Stark <stark@mit.edu> writes:
> I wonder if we could go further. If we don't imagine having a very
> large number of allocators then we could just ask each one in turn if
> this block is one of theirs and which context it came from. That would
> allow an allocator that just allocated everything in a contiguous
> block to recognize pointers and return the memory context just by the
> range the pointer lies in.

The performance problem is not "large number of allocators", it's "large
number of distinct memory ranges".  Also, this will fail utterly to
recognize duplicate-pfree and bad-pointer cases.  Not that the existing
method is bulletproof, but this way has zero robustness against caller
bugs.
        regards, tom lane



Re: rethinking dense_alloc (HashJoin) as a memory context

От
Robert Haas
Дата:
On Mon, Jul 18, 2016 at 10:19 AM, Greg Stark <stark@mit.edu> wrote:
> On Sun, Jul 17, 2016 at 1:55 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>>On Wed, Jul 13, 2016 at 4:39 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>> I wonder whether we could compromise by reducing the minimum "standard
>>> chunk header" to be just a pointer to owning context, with the other
>>> fields becoming specific to particular mcxt implementations.
>>
>> I think that would be worth doing.  It's not perfect, and the extra 8
>> (or 4) bytes per chunk certainly do matter.
>
> I wonder if we could go further. If we don't imagine having a very
> large number of allocators then we could just ask each one in turn if
> this block is one of theirs and which context it came from. That would
> allow an allocator that just allocated everything in a contiguous
> block to recognize pointers and return the memory context just by the
> range the pointer lies in.
>
> There could be optimizations like if the leading points point to a
> structure with a decently large magic number then assume it's a valid
> header to avoid the cost of checking with lots of allocators. But I'm
> imagining that the list of allocators in use concurrenlty will be
> fairly small so it might not even be necessary.

Well, if we were going to do this, it would make more sense to have a
central registry of blocks that is shared across all memory context
types, and when you see a pointer you just look it up in the chunk map
and find the containing context that way.  I actually did something
like this when I was working on sb_alloc (see archives); it looked up
whether the chunk was of the new type and, if not, it assumed it came
from aset.c.  To do this, it used a three-level trie (like Google's
tcmalloc based on the pointer address) with, IIRC, some optimizations
for the case where we repeatedly free from the same chunk.  That
slowed down pfree noticeably, though. The problem is that it's pretty
well impossible to do something that is as cheap as fetching a memory
context pointer from the chunk header, chasing a pointer or two from
there, and then jumping.  That's just really cheap.

The test case I used previously was an external sort, which does lots
of retail pfrees.  Now that we've mostly abandoned replacement
selection, there will be many fewer pfrees while building runs, I
think, but still quite a few while merging runs.  Now it might be the
case that if the allocating is fast enough and we save a bunch of
memory, spending a few additional cycles freeing things is no big
deal.  It might also be the case that this is problematic in a few
cases but that we can eliminate those cases.  It's likely to take some
work, though.

Anyway, my point is really that doing what Tom proposes would be a
good first step and probably would not involve much risk.  And with
that change, any new allocator that has some extrinsic way of
determing chunk sizes can save 8 bytes per chunk, which is certainly
meaningful for any context that does lots of allocations.  If somebody
writes a new allocator that has good performance characteristics and
uses 400MB of memory on some test where aset.c would have used 535MB
of memory, and we can compute that with no chunk headers at all would
use only 346MB of memory, then we can talk about how to get there.
Doing things stepwise is good.

One random thought is that we could have a memory allocator altogether
separate from palloc/pfree.  For example, suppose we introduce
qalloc/qfree.  A chunk of memory allocated with qalloc can only be
freed with qfree, not pfree.  In this world, you're completely freed
from the requirement to care about chunk headers in any form.  The
downside, of course, is that you can't let qalloc'd tuples escape into
the executor, say, because it only knows how to pfree things, not how
to qfree things.  But there are plenty of places where you can easily
see that allocations won't escape outside the module where they are
performed; in fact, there are many cases where it would be easy to
pass the memory context *as an argument to the free function* -- which
would presumably be quite advantageous for any allocator that doesn't
store a pointer to the context in the chunk header.  I'm sure there
will be some reluctance to go down these kinds of paths because it is
undeniably convenient from a programmer's perspective to be able to
just say pfree and forget the details, but I believe there will be
some cases - especially for contexts that hold lots of allocations -
where the performance gains from these kinds of techniques are quite
measurable on macrobenchmarks.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: rethinking dense_alloc (HashJoin) as a memory context

От
Peter Geoghegan
Дата:
On Mon, Jul 18, 2016 at 7:56 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> The test case I used previously was an external sort, which does lots
> of retail pfrees.  Now that we've mostly abandoned replacement
> selection, there will be many fewer pfrees while building runs, I
> think, but still quite a few while merging runs.

Surely you mean that in 9.6 there are just as many palloc() + pfree()
calls as before when building runs, but many fewer when merging
(provided you limit your consideration to a final on-the-fly merge,
which are the large majority of merges in practice)? Nothing changed
about how tuplesort caller tuples are originally formed in 9.6, so
work remains even there.

I think we should be using batch memory for caller tuples (e.g.,
MinimalTuples) past the first run as an optimization, but that isn't
something that I plan to do soon. Separately, I've already written a
patch to make final merges that are not on-the-fly (i.e. the final
merge of a randomAccess caller, where a single materialize output to
one tape is formed) use batch memory, mostly to suit parallel sort
workers. Parallel sort could increase the prevalence of non-on-the-fly
merges by quite a bit, so that is on my agenda for the next release.

> Now it might be the
> case that if the allocating is fast enough and we save a bunch of
> memory, spending a few additional cycles freeing things is no big
> deal.  It might also be the case that this is problematic in a few
> cases but that we can eliminate those cases.  It's likely to take some
> work, though.

Perhaps I simply lack imagination here, but I still suspect that
ad-hoc approaches will tend to work best, because most of the benefit
can be derived from specialized, precise memory management (what I've
called batch memory) for just a few modules, and what remains isn't
several broad swathes that can be delineated easily. I can see a
"palloc a lot and don't worry too much about pfrees" allocator having
some value, but I suspect that that isn't going to move the needle in
the same way.

-- 
Peter Geoghegan