Обсуждение: Memory Accounting

Поиск

Список

Период

Сортировка

Memory Accounting

От

Jeff Davis

Дата:

18 июля 2019 г., 18:24:25

Previous discussion: 
https://postgr.es/m/1407012053.15301.53.camel%40jeff-desktop

This patch introduces a way to ask a memory context how much memory it
currently has allocated. Each time a new block (not an individual
chunk, but a new malloc'd block) is allocated, it increments a struct
member; and each time a block is free'd, it decrements the struct
member. So it tracks blocks allocated by malloc, not what is actually
used for chunks allocated by palloc.

The purpose is for Memory Bounded Hash Aggregation, but it may be
useful in more cases as we try to better adapt to and control memory
usage at execution time.

I ran the same tests as Robert did before[1] on my laptop[2]. The only
difference is that I also set max_parallel_workers[_per_gather]=0 to be
sure. I did 5 runs, alternating between memory-accounting and master,
and I got the following results for "elapsed" (as reported by
trace_sort):


regression=# select version, min(s), max(s), percentile_disc(0.5)
within group (order by s) as median, avg(s)::numeric(10,2) from tmp
group by version;
      version      |  min  |  max  | median |  avg
-------------------+-------+-------+--------+-------
 master            | 13.92 | 14.40 |  14.06 | 14.12
 memory accounting | 13.43 | 14.46 |  14.11 | 14.09
(2 rows)


So, I don't see any significant performance impact for this patch in
this test. That may be because:

* It was never really significant except on PPC64.
* I changed it to only update mem_allocated for the current context,
not recursively for all parent contexts. It's now up to the function
that reports memory usage to recurse or not (as needed).
* A lot of changes to sort have happened since that time, so perhaps
it's not a good test of memory contexts any more.

pgbench didn't show any slowdown either.

I also did another test with hash aggregation that uses significant
memory (t10m is a table with 10M distinct values and work_mem is 1GB):

postgres=# select (select (i, count(*)) from t10m group by i having
count(*) > n) from (values(1),(2),(3),(4),(5)) as s(n);

I didn't see any noticable difference there, either.

Regards,
    Jeff Davis

[1] 
https://postgr.es/m/CA%2BTgmobnu7XEn1gRdXnFo37P79bF%3DqLt46%3D37ajP3Cro9dBRaA%40mail.gmail.com
[2] Linux jdavis 4.15.0-54-generic #58-Ubuntu SMP Mon Jun 24 10:55:24
UTC 2019 x86_64 x86_64 x86_64 GNU/Linux

Вложения

0001-Memory-accounting-at-block-level.patch

Re: Memory Accounting

От

Heikki Linnakangas

Дата:

22 июля 2019 г., 08:30:37

On 18/07/2019 21:24, Jeff Davis wrote:
> Previous discussion:
> https://postgr.es/m/1407012053.15301.53.camel%40jeff-desktop
> 
> This patch introduces a way to ask a memory context how much memory it
> currently has allocated. Each time a new block (not an individual
> chunk, but a new malloc'd block) is allocated, it increments a struct
> member; and each time a block is free'd, it decrements the struct
> member. So it tracks blocks allocated by malloc, not what is actually
> used for chunks allocated by palloc.
> 
> The purpose is for Memory Bounded Hash Aggregation, but it may be
> useful in more cases as we try to better adapt to and control memory
> usage at execution time.

Seems handy.

> * I changed it to only update mem_allocated for the current context,
> not recursively for all parent contexts. It's now up to the function
> that reports memory usage to recurse or not (as needed).

Is that OK for memory bounded hash aggregation? Might there be a lot of 
sub-contexts during hash aggregation?

- Heikki

Re: Memory Accounting

От

Tomas Vondra

Дата:

22 июля 2019 г., 16:16:58

On Mon, Jul 22, 2019 at 11:30:37AM +0300, Heikki Linnakangas wrote:
>On 18/07/2019 21:24, Jeff Davis wrote:
>>Previous discussion:
>>https://postgr.es/m/1407012053.15301.53.camel%40jeff-desktop
>>
>>This patch introduces a way to ask a memory context how much memory it
>>currently has allocated. Each time a new block (not an individual
>>chunk, but a new malloc'd block) is allocated, it increments a struct
>>member; and each time a block is free'd, it decrements the struct
>>member. So it tracks blocks allocated by malloc, not what is actually
>>used for chunks allocated by palloc.
>>
>>The purpose is for Memory Bounded Hash Aggregation, but it may be
>>useful in more cases as we try to better adapt to and control memory
>>usage at execution time.
>
>Seems handy.
>

Indeed.

>>* I changed it to only update mem_allocated for the current context,
>>not recursively for all parent contexts. It's now up to the function
>>that reports memory usage to recurse or not (as needed).
>
>Is that OK for memory bounded hash aggregation? Might there be a lot 
>of sub-contexts during hash aggregation?
>

There shouldn't be, at least not since b419865a814abbc. There might be
cases where custom aggregates still do that, but I think that's simply a
design we should discourage.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Memory Accounting

От

Jeff Davis

Дата:

22 июля 2019 г., 16:33:03

On Mon, 2019-07-22 at 18:16 +0200, Tomas Vondra wrote:
> * I changed it to only update mem_allocated for the current
> > > context,
> > > not recursively for all parent contexts. It's now up to the
> > > function
> > > that reports memory usage to recurse or not (as needed).
> > 
> > Is that OK for memory bounded hash aggregation? Might there be a
> > lot 
> > of sub-contexts during hash aggregation?
> > 
> 
> There shouldn't be, at least not since b419865a814abbc. There might
> be
> cases where custom aggregates still do that, but I think that's
> simply a
> design we should discourage.

Right, I don't think memory-context-per-group is something we should
optimize for.

Discussion:
https://www.postgresql.org/message-id/3839201.Nfa2RvcheX%40techfox.foxi
https://www.postgresql.org/message-id/5334D7A5.2000907%40fuzzy.cz

Commit link:

https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=b419865a814abbca12bdd6eef6a3d5ed67f432e1

Regards,
    Jeff Davis

Re: Memory Accounting

От

Melanie Plageman

Дата:

24 июля 2019 г., 01:18:26

On Thu, Jul 18, 2019 at 11:24 AM Jeff Davis <pgsql@j-davis.com> wrote:

Previous discussion:
https://postgr.es/m/1407012053.15301.53.camel%40jeff-desktop

This patch introduces a way to ask a memory context how much memory it
currently has allocated. Each time a new block (not an individual
chunk, but a new malloc'd block) is allocated, it increments a struct
member; and each time a block is free'd, it decrements the struct
member. So it tracks blocks allocated by malloc, not what is actually
used for chunks allocated by palloc.

Cool! I like how straight-forward this approach is. It seems easy to
build on, as well.

Are there cases where we are likely to palloc a lot without needing to
malloc in a certain memory context? For example, do we have a pattern
where, for some kind of memory intensive operator, we might palloc in
a per tuple context and consistently get chunks without having to
malloc and then later, where we to try and check the bytes allocated
for one of these per tuple contexts to decide on some behavior, the
number would not be representative?

I think that is basically what Heikki is asking about with HashAgg,
but I wondered if there were other cases that you had already thought
through where this might happen.

The purpose is for Memory Bounded Hash Aggregation, but it may be
useful in more cases as we try to better adapt to and control memory
usage at execution time.

This approach seems like it would be good for memory intensive
operators which use a large, representative context. I think the
HashTableContext for HashJoin might be one?

Melanie Plageman

Re: Memory Accounting

От

Tomas Vondra

Дата:

24 июля 2019 г., 21:52:28

On Tue, Jul 23, 2019 at 06:18:26PM -0700, Melanie Plageman wrote:
>On Thu, Jul 18, 2019 at 11:24 AM Jeff Davis <pgsql@j-davis.com> wrote:
>
>> Previous discussion:
>> https://postgr.es/m/1407012053.15301.53.camel%40jeff-desktop
>>
>> This patch introduces a way to ask a memory context how much memory it
>> currently has allocated. Each time a new block (not an individual
>> chunk, but a new malloc'd block) is allocated, it increments a struct
>> member; and each time a block is free'd, it decrements the struct
>> member. So it tracks blocks allocated by malloc, not what is actually
>> used for chunks allocated by palloc.
>>
>>
>Cool! I like how straight-forward this approach is. It seems easy to
>build on, as well.
>
>Are there cases where we are likely to palloc a lot without needing to
>malloc in a certain memory context? For example, do we have a pattern
>where, for some kind of memory intensive operator, we might palloc in
>a per tuple context and consistently get chunks without having to
>malloc and then later, where we to try and check the bytes allocated
>for one of these per tuple contexts to decide on some behavior, the
>number would not be representative?
>

I think there's plenty of places where we quickly get into a state with
enough chunks in the freelist - the per-tuple contexts are a good
example of that, I think.

>I think that is basically what Heikki is asking about with HashAgg,
>but I wondered if there were other cases that you had already thought
>through where this might happen.
>

I think Heikki was asking about places with a lot of sub-contexts, which a
completely different issue. It used to be the case that some aggregates
created a separate context for each group - like array_agg. That would
make Jeff's approach to accounting rather inefficient, because checking
how much memory is used would be very expensive (having to loop over a
large number of contexts).

>
>> The purpose is for Memory Bounded Hash Aggregation, but it may be
>> useful in more cases as we try to better adapt to and control memory
>> usage at execution time.
>>
>>
>This approach seems like it would be good for memory intensive
>operators which use a large, representative context. I think the
>HashTableContext for HashJoin might be one?
>

Yes, that might me a good candidate (and it would be much simpler than
the manual accounting we use now).

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Memory Accounting

От

Melanie Plageman

Дата:

18 сентября 2019 г., 21:32:13

I wanted to address a couple of questions Jeff asked me off-list about
Greenplum's implementations of Memory Accounting.

Greenplum has two memory accounting sub-systems -- one is the
MemoryContext-based system proposed here.
The other memory accounting system tracks "logical" memory owners in
global accounts. For example, every node in a plan has an account,
however, there are other execution contexts, such as the parser, which
have their own logical memory accounts.
Notably, this logical memory account system also tracks chunks instead
of blocks.

The rationale for tracking memory at the logical owner level was that
memory for a logical owner may allocate memory across multiple
contexts and a single context may contain memory belonging to several
of these logical owners.

More compellingly, many of the allocations done during execution are
done directly in the per query or per tuple context--as opposed to
being done in their own uniquely named context. Arguably, this is
because those logical owners (a Result node, for example) are not
memory-intensive and thus do not require specific memory accounting.
However, when debugging a memory leak or OOM, the specificity of
logical owner accounts was seen as desirable. A discrepancy between
memory allocated and memory freed in the per query context doesn't
provide a lot of hints as to the source of the leak.
At the least, there was no meaningful way to represent MemoryContext
account balances in EXPLAIN ANALYZE. Where would the TopMemoryContext
be represented, for example.

Also, by using logical accounts, each node in the plan could be
assigned a quota at plan time--because many memory intensive operators
will not have relinquished the memory they hold when other nodes are
executing (e.g. Materialize)--so, instead of granting each node
work_mem, work_mem is divided up into quotas for each operator in a
particular way. This was meant to pave the way for work_mem
enforcement. This is a topic that has come up in various ways in other
forums. For example, in the XPRS thread, the discussion of erroring
out for queries with no "escape mechanism" brought up by Thomas Munro
[1] is a kind of work_mem enforcement (this discussion was focused
more on a proposed session-level memory setting, but it is still
enforcement of a memory setting).
It was also discussed at PGCon this year in an unconference session on
OOM-detection and debugging, runaway query termination, and
session-level memory consumption tracking [2].

The motivation for tracking chunks instead of blocks was to understand
the "actual" memory consumption of different components in the
database. Then, eventually, memory consumption patterns would emerge
and improvements could be made to memory allocation strategies to suit
different use cases--perhaps other implementations of the
MemoryContext API similar to Slab and Generation were envisioned.
Apparently, it did lead to the discovery of some memory fragmentation
issues which were tuned.

I bring these up not just to answer Jeff's question but also to
provide some anecdotal evidence that the patch here is a good base for
other memory accounting and tracking schemes.

Even if HashAgg is the only initial consumer of the memory accounting
framework, we know that tuplesort can make use of it in its current
state as well. And, if another node or component requires
chunk-tracking, they could implement a different MemoryContext API
implementation which uses the MemoryContextData->mem_allocated field
to track chunks instead of blocks by tracking chunks in its alloc/free
functions.

Ideas like logical memory accounting could leverage the mem_allocated
field and build on top of it.

[1] https://www.postgresql.org/message-id/CA%2BhUKGJEMT7SSZRqt-knu_3iLkdscBCe9M2nrhC259FdE5bX7g%40mail.gmail.com
[2] https://wiki.postgresql.org/wiki/PgCon_2019_Developer_Unconference

Re: Memory Accounting

От

Jeff Davis

Дата:

19 сентября 2019 г., 18:00:37

On Wed, 2019-09-18 at 13:50 -0700, Soumyadeep Chakraborty wrote:
> Hi Jeff,

Hi Soumyadeep and Melanie,

Thank you for the review!

> max_stack_depth    max level    lazy (ms)    eager (ms)    (eage
> r/lazy)
> 2MB    82    302.715    427.554    1.4123978
> 3MB    3474    567.829    896.143    1.578191674
> 7.67MB    8694    2657.972    4903.063    1.844663149

Thank you for collecting data on this. Were you able to find any
regression when compared to no memory accounting at all?

It looks like you agree with the approach and the results. Did you find
any other issues with the patch?

I am also including Robert in this thread. He had some concerns the
last time around due to a small regression on POWER.

Regards,
    Jeff Davis

Re: Memory Accounting

От

Michael Paquier

Дата:

24 сентября 2019 г., 05:21:40

On Wed, Jul 24, 2019 at 11:52:28PM +0200, Tomas Vondra wrote:
> I think Heikki was asking about places with a lot of sub-contexts, which a
> completely different issue. It used to be the case that some aggregates
> created a separate context for each group - like array_agg. That would
> make Jeff's approach to accounting rather inefficient, because checking
> how much memory is used would be very expensive (having to loop over a
> large number of contexts).

The patch has been marked as ready for committer for a week or so, but
it seems to me that this comment has not been addressed, no?  Are we
sure that we want this method if it proves to be inefficient when
there are many sub-contexts and shouldn't we at least test such
scenarios with a worst-case, customly-made, function?
--
Michael

Вложения

signature.asc

Re: Memory Accounting

От

Tomas Vondra

Дата:

24 сентября 2019 г., 12:05:51

On Tue, Sep 24, 2019 at 02:21:40PM +0900, Michael Paquier wrote:
>On Wed, Jul 24, 2019 at 11:52:28PM +0200, Tomas Vondra wrote:
>> I think Heikki was asking about places with a lot of sub-contexts, which a
>> completely different issue. It used to be the case that some aggregates
>> created a separate context for each group - like array_agg. That would
>> make Jeff's approach to accounting rather inefficient, because checking
>> how much memory is used would be very expensive (having to loop over a
>> large number of contexts).
>
>The patch has been marked as ready for committer for a week or so, but
>it seems to me that this comment has not been addressed, no?  Are we
>sure that we want this method if it proves to be inefficient when
>there are many sub-contexts and shouldn't we at least test such
>scenarios with a worst-case, customly-made, function?

I don't think so.

Aggregates creating many memory contexts (context for each group) was
discussed extensively in the thread about v11 [1] in 2015. And back then
the conclusion was that that's a pretty awful pattern anyway, as it uses
much more memory (no cross-context freelists), and has various other
issues. In a way, those aggregates are wrong and should be fixed just
like we fixed array_agg/string_agg (even without the memory accounting).

The way I see it we can do either eager or lazy accounting. Eager might
work better for aggregates with many contexts, but it does increase the
overhead for the "regular" aggregates with just one or two contexts.
Considering how rare those many-context aggregates are (I'm not aware of
any such aggregate at the moment), it seems reasonable to pick the lazy
accounting.

(Note: Another factor affecting the lazy vs. eager efficiency is the
number of palloc/pfree calls vs. calls to determine amount of mem used,
but that's mostly orthogonal and we cn ignore it here).

So I think the approach Jeff ended up with sensible - certainly as a
first step. We may improve it in the future, of course, once we have
more practical experience.

Barring objections, I do plan to get this committed by the end of this
CF (i.e. sometime later this week).

[1] https://www.postgresql.org/message-id/1434311039.4369.39.camel%40jeff-desktop

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Memory Accounting

От

Melanie Plageman

Дата:

24 сентября 2019 г., 18:46:49

On Thu, Sep 19, 2019 at 11:00 AM Jeff Davis <pgsql@j-davis.com> wrote:

On Wed, 2019-09-18 at 13:50 -0700, Soumyadeep Chakraborty wrote:
> Hi Jeff,

Hi Soumyadeep and Melanie,

Thank you for the review!

> max_stack_depth max level lazy (ms) eager (ms) (eage
> r/lazy)
> 2MB 82 302.715 427.554 1.4123978
> 3MB 3474 567.829 896.143 1.578191674
> 7.67MB 8694 2657.972 4903.063 1.844663149

Thank you for collecting data on this. Were you able to find any
regression when compared to no memory accounting at all?

We didn't spend much time comparing performance with and without
memory accounting, as it seems like this was discussed extensively in
the previous thread.

It looks like you agree with the approach and the results. Did you find
any other issues with the patch?

We didn't observe any other problems with the patch and agree with the
approach. It is a good start.

I am also including Robert in this thread. He had some concerns the
last time around due to a small regression on POWER.

I think it would be helpful if we could repeat the performance tests
Robert did on that machine with the current patch (unless this version
of the patch is exactly the same as the ones he tested previously).

Thanks,

Soumyadeep & Melanie

Re: Memory Accounting

От

Michael Paquier

Дата:

25 сентября 2019 г., 00:47:35

On Tue, Sep 24, 2019 at 02:05:51PM +0200, Tomas Vondra wrote:
> The way I see it we can do either eager or lazy accounting. Eager might
> work better for aggregates with many contexts, but it does increase the
> overhead for the "regular" aggregates with just one or two contexts.
> Considering how rare those many-context aggregates are (I'm not aware of
> any such aggregate at the moment), it seems reasonable to pick the lazy
> accounting.

Okay.

> So I think the approach Jeff ended up with sensible - certainly as a
> first step. We may improve it in the future, of course, once we have
> more practical experience.
>
> Barring objections, I do plan to get this committed by the end of this
> CF (i.e. sometime later this week).

Sounds good to me.  Though I have not looked at the patch in details,
the arguments are sensible.  Thanks for confirming.
--
Michael

Вложения

signature.asc

Re: Memory Accounting

От

Tomas Vondra

Дата:

26 сентября 2019 г., 19:22:44

On Tue, Sep 24, 2019 at 11:46:49AM -0700, Melanie Plageman wrote:
>On Thu, Sep 19, 2019 at 11:00 AM Jeff Davis <pgsql@j-davis.com> wrote:
>
>> On Wed, 2019-09-18 at 13:50 -0700, Soumyadeep Chakraborty wrote:
>> > Hi Jeff,
>>
>> Hi Soumyadeep and Melanie,
>>
>> Thank you for the review!
>>
>> > max_stack_depth       max level       lazy (ms)       eager (ms)
>> (eage
>> > r/lazy)
>> > 2MB   82      302.715 427.554 1.4123978
>> > 3MB   3474    567.829 896.143 1.578191674
>> > 7.67MB        8694    2657.972        4903.063        1.844663149
>>
>> Thank you for collecting data on this. Were you able to find any
>> regression when compared to no memory accounting at all?
>>
>>
>We didn't spend much time comparing performance with and without
>memory accounting, as it seems like this was discussed extensively in
>the previous thread.
>
>
>> It looks like you agree with the approach and the results. Did you find
>> any other issues with the patch?
>>
>
>We didn't observe any other problems with the patch and agree with the
>approach. It is a good start.
>
>
>>
>> I am also including Robert in this thread. He had some concerns the
>> last time around due to a small regression on POWER.
>>
>
>I think it would be helpful if we could repeat the performance tests
>Robert did on that machine with the current patch (unless this version
>of the patch is exactly the same as the ones he tested previously).
>

I agree that would be nice, but I don't have access to any Power machine
suitable for this kind of benchmarking :-( Robert, any chance you still
have access to that machine?

It's worth mentioning that those bechmarks (I'm assuming we're talking
about the numbers Rober shared in [1]) were done on patches that used
the eager accounting approach (i.e. walking all parent contexts and
updating the accounting for them).

I'm pretty sure the current "lazy accounting" patches don't have that
issue, so unless someone objects and/or can show numbers demonstrating
I'wrong I'll stick to my plan to get this committed soon.

regards

[1]
https://www.postgresql.org/message-id/flat/CA%2BTgmobnu7XEn1gRdXnFo37P79bF%3DqLt46%3D37ajP3Cro9dBRaA%40mail.gmail.com#3e2dc9e70a9f9eb2d695ab94a580c5a2

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Memory Accounting

От

Jeff Davis

Дата:

26 сентября 2019 г., 20:36:46

On Thu, 2019-09-26 at 21:22 +0200, Tomas Vondra wrote:
> It's worth mentioning that those bechmarks (I'm assuming we're
> talking
> about the numbers Rober shared in [1]) were done on patches that used
> the eager accounting approach (i.e. walking all parent contexts and
> updating the accounting for them).
> 
> I'm pretty sure the current "lazy accounting" patches don't have that
> issue, so unless someone objects and/or can show numbers
> demonstrating
> I'wrong I'll stick to my plan to get this committed soon.

That was my conclusion, as well.

Regards,
    Jeff Davis

Re: Memory Accounting

От

Tomas Vondra

Дата:

28 сентября 2019 г., 22:12:49

On Thu, Sep 26, 2019 at 01:36:46PM -0700, Jeff Davis wrote:
>On Thu, 2019-09-26 at 21:22 +0200, Tomas Vondra wrote:
>> It's worth mentioning that those bechmarks (I'm assuming we're
>> talking
>> about the numbers Rober shared in [1]) were done on patches that used
>> the eager accounting approach (i.e. walking all parent contexts and
>> updating the accounting for them).
>>
>> I'm pretty sure the current "lazy accounting" patches don't have that
>> issue, so unless someone objects and/or can show numbers
>> demonstrating
>> I'wrong I'll stick to my plan to get this committed soon.
>
>That was my conclusion, as well.
>

I was about to commit the patch, but during the final review I've
noticed two places that I think are bugs:

1) aset.c (AllocSetDelete)
--------------------------

#ifdef CLOBBER_FREED_MEMORY
        wipe_mem(block, block->freeptr - ((char *) block));
#endif

        if (block != set->keeper)
        {
            context->mem_allocated -= block->endptr - ((char *) block);
            free(block);
        }

2) generation.c (GenerationReset)
---------------------------------

#ifdef CLOBBER_FREED_MEMORY
        wipe_mem(block, block->blksize);
#endif
        context->mem_allocated -= block->blksize;

Notice that when CLOBBER_FREED_MEMORY is defined, the code first calls
wipe_mem and then accesses fields of the (wiped) block. Interesringly
enough, the regression tests don't seem to exercise these bits - I've
tried adding elog(ERROR) and it still passes. For (2) that's not very
surprising because Generation context is only really used in logical
decoding (and we don't delete the context I think). Not sure about (1)
but it might be because AllocSetReset does the right thing and only
leaves behind the keeper block.

I'm pretty sure a custom function calling the contexts explicitly would
fall over, but I haven't tried.

Aside from that, I've repeated the REINDEX benchmarks done by Robert in
[1] with different scales on two different machines, and I've measured
no difference. Both machines are x86_64, I don't have access to any
Power machine at the moment, unfortunately.

[1] https://www.postgresql.org/message-id/CA%2BTgmobnu7XEn1gRdXnFo37P79bF%3DqLt46%3D37ajP3Cro9dBRaA%40mail.gmail.com

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Memory Accounting

От

Tomas Vondra

Дата:

28 сентября 2019 г., 22:22:06

On Sun, Sep 29, 2019 at 12:12:49AM +0200, Tomas Vondra wrote:
>On Thu, Sep 26, 2019 at 01:36:46PM -0700, Jeff Davis wrote:
>>On Thu, 2019-09-26 at 21:22 +0200, Tomas Vondra wrote:
>>>It's worth mentioning that those bechmarks (I'm assuming we're
>>>talking
>>>about the numbers Rober shared in [1]) were done on patches that used
>>>the eager accounting approach (i.e. walking all parent contexts and
>>>updating the accounting for them).
>>>
>>>I'm pretty sure the current "lazy accounting" patches don't have that
>>>issue, so unless someone objects and/or can show numbers
>>>demonstrating
>>>I'wrong I'll stick to my plan to get this committed soon.
>>
>>That was my conclusion, as well.
>>
>
>I was about to commit the patch, but during the final review I've
>noticed two places that I think are bugs:
>
>1) aset.c (AllocSetDelete)
>--------------------------
>
>#ifdef CLOBBER_FREED_MEMORY
>       wipe_mem(block, block->freeptr - ((char *) block));
>#endif
>
>       if (block != set->keeper)
>       {
>           context->mem_allocated -= block->endptr - ((char *) block);
>           free(block);
>       }
>
>2) generation.c (GenerationReset)
>---------------------------------
>
>#ifdef CLOBBER_FREED_MEMORY
>       wipe_mem(block, block->blksize);
>#endif
>       context->mem_allocated -= block->blksize;
>
>
>Notice that when CLOBBER_FREED_MEMORY is defined, the code first calls
>wipe_mem and then accesses fields of the (wiped) block. Interesringly
>enough, the regression tests don't seem to exercise these bits - I've
>tried adding elog(ERROR) and it still passes. For (2) that's not very
>surprising because Generation context is only really used in logical
>decoding (and we don't delete the context I think). Not sure about (1)
>but it might be because AllocSetReset does the right thing and only
>leaves behind the keeper block.
>
>I'm pretty sure a custom function calling the contexts explicitly would
>fall over, but I haven't tried.
>

Oh, and one more thing - this probably needs to add at least some basic 
explanation of the accounting to src/backend/mmgr/README.


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Memory Accounting

От

Jeff Davis

Дата:

30 сентября 2019 г., 20:34:13

On Sun, 2019-09-29 at 00:22 +0200, Tomas Vondra wrote:
> Notice that when CLOBBER_FREED_MEMORY is defined, the code first
> > calls
> > wipe_mem and then accesses fields of the (wiped) block.
> > Interesringly
> > enough, the regression tests don't seem to exercise these bits -
> > I've
> > tried adding elog(ERROR) and it still passes. For (2) that's not
> > very
> > surprising because Generation context is only really used in
> > logical
> > decoding (and we don't delete the context I think). Not sure about
> > (1)
> > but it might be because AllocSetReset does the right thing and only
> > leaves behind the keeper block.
> > 
> > I'm pretty sure a custom function calling the contexts explicitly
> > would
> > fall over, but I haven't tried.
> > 

Fixed.

I tested with some custom use of memory contexts. The reason
AllocSetDelete() didn't fail before is that most memory contexts use
the free lists (the list of free memory contexts, not the free list of
chunks), so you need to specify a non-default minsize in order to
prevent that and trigger the bug.

AllocSetReset() worked, but it was reading the header of the keeper
block after wiping the contents of the keeper block. It technically
worked, because the header of the keeper block was not wiped, but it
seems more clear to explicitly save the size of the keeper block. In
AllocSetDelete(), saving the keeper size is required, because it wipes
the block headers in addition to the contents.

> Oh, and one more thing - this probably needs to add at least some
> basic 
> explanation of the accounting to src/backend/mmgr/README.

Added.

Regards,
    Jeff Davis

Вложения

0001-Memory-accounting-at-block-level.patch

Re: Memory Accounting

От

Tomas Vondra

Дата:

01 октября 2019 г., 02:11:14

On Mon, Sep 30, 2019 at 01:34:13PM -0700, Jeff Davis wrote:
>On Sun, 2019-09-29 at 00:22 +0200, Tomas Vondra wrote:
>> Notice that when CLOBBER_FREED_MEMORY is defined, the code first
>> > calls
>> > wipe_mem and then accesses fields of the (wiped) block.
>> > Interesringly
>> > enough, the regression tests don't seem to exercise these bits -
>> > I've
>> > tried adding elog(ERROR) and it still passes. For (2) that's not
>> > very
>> > surprising because Generation context is only really used in
>> > logical
>> > decoding (and we don't delete the context I think). Not sure about
>> > (1)
>> > but it might be because AllocSetReset does the right thing and only
>> > leaves behind the keeper block.
>> >
>> > I'm pretty sure a custom function calling the contexts explicitly
>> > would
>> > fall over, but I haven't tried.
>> >
>
>Fixed.
>
>I tested with some custom use of memory contexts. The reason
>AllocSetDelete() didn't fail before is that most memory contexts use
>the free lists (the list of free memory contexts, not the free list of
>chunks), so you need to specify a non-default minsize in order to
>prevent that and trigger the bug.
>
>AllocSetReset() worked, but it was reading the header of the keeper
>block after wiping the contents of the keeper block. It technically
>worked, because the header of the keeper block was not wiped, but it
>seems more clear to explicitly save the size of the keeper block. In
>AllocSetDelete(), saving the keeper size is required, because it wipes
>the block headers in addition to the contents.
>

OK, looks good to me now.

>> Oh, and one more thing - this probably needs to add at least some
>> basic
>> explanation of the accounting to src/backend/mmgr/README.
>
>Added.
>

Well, I've meant a couple of paragraphs explaining the motivation, and
relevant trade-offs considered. So I've written a brief summary of the
design as I understand it and pushed it like that. Of course, if you
could proof-read it, that'd be good.

I had a bit of a hard time deciding who to list as a reviewer - this
patch started sometime in ~2015, and it was initially discussed as part
of the larger hashagg effort, with plenty of people discussing various
ancient versions of the patch. In the end I've included just people from
the current thread. If that omits important past reviews, I'm sorry.

For the record, results of the benchmarks I've done over the past couple
of days are in [1]. It includes both the reindex benchmark used by by
Robert in 2015, and a regular read-only pgbench. In general, the
overhead of the accounting is pretty much indistinguishable from noise
(at least on those two machines).

In any case, thanks for the perseverance working on this.

[1] https://github.com/tvondra/memory-accounting-benchmarks

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Memory Accounting

От

Tom Lane

Дата:

04 октября 2019 г., 04:36:01

So ... why exactly did this patch define MemoryContextData.mem_allocated
as int64? That seems to me to be doubly wrong: it is not the right width
on 32-bit machines, and it is not the right signedness anywhere. I think
that field ought to be of type Size (a/k/a size_t, but memnodes.h always
calls it Size).

I let this pass when the patch went in, but now I'm on the warpath
about it, because since c477f3e449 went in, some of the 32-bit buildfarm
members are failing with

2019-10-04 00:41:56.569 CEST [66916:86] pg_regress/_int LOG: statement: CREATE INDEX text_idx on test__int using gist
(a gist__int_ops );
TRAP: FailedAssertion("total_allocated == context->mem_allocated", File: "aset.c", Line: 1533)
2019-10-04 00:42:25.505 CEST [63836:11] LOG: server process (PID 66916) was terminated by signal 6: Abort trap
2019-10-04 00:42:25.505 CEST [63836:12] DETAIL: Failed process was running: CREATE INDEX text_idx on test__int using
gist( a gist__int_ops );

What I think is happening is that c477f3e449 allowed this bit in
AllocSetRealloc:

context->mem_allocated += blksize - oldblksize;

to be executed in situations where blksize < oldblksize, where before
that was not possible. Of course blksize and oldblksize are of type
Size, hence unsigned, so the subtraction result underflows in this
case. If mem_allocated is of the same width as Size then this does
not matter because the final result wraps around to the proper value,
but if we're going to allow mem_allocated to be wider than Size then
we will need more logic here to add or subtract as appropriate.

(I'm not quite sure why we're not seeing this failure on *all* the
32-bit machines; maybe there's some other factor involved?)

I see no value in defining mem_allocated to be wider than Size.
Yes, the C standard contemplates the possibility that the total
available address space is larger than the largest chunk you can
ever ask malloc for; but nobody has built a platform like that in
this century, and they sucked to program on back in the dark ages
when they did exist. (I speak from experience.) I do not think
we need to design Postgres to allow for that.

Likewise, there's no evident value in allowing mem_allocated
to be negative.

I haven't chased down exactly what else would need to change.
It might be that s/int64/Size/g throughout the patch is
sufficient, but I haven't analyzed it.

regards, tom lane

Re: Memory Accounting

От

Tomas Vondra

Дата:

04 октября 2019 г., 08:26:44

On Fri, Oct 04, 2019 at 12:36:01AM -0400, Tom Lane wrote:
>So ... why exactly did this patch define MemoryContextData.mem_allocated
>as int64?  That seems to me to be doubly wrong: it is not the right width
>on 32-bit machines, and it is not the right signedness anywhere.  I think
>that field ought to be of type Size (a/k/a size_t, but memnodes.h always
>calls it Size).
>

Yeah, I think that's an oversight. Maybe there's a reason why Jeff used
int64, but I can't think of any.

>I let this pass when the patch went in, but now I'm on the warpath
>about it, because since c477f3e449 went in, some of the 32-bit buildfarm
>members are failing with
>
>2019-10-04 00:41:56.569 CEST [66916:86] pg_regress/_int LOG:  statement: CREATE INDEX text_idx on test__int using gist
(a gist__int_ops );
 
>TRAP: FailedAssertion("total_allocated == context->mem_allocated", File: "aset.c", Line: 1533)
>2019-10-04 00:42:25.505 CEST [63836:11] LOG:  server process (PID 66916) was terminated by signal 6: Abort trap
>2019-10-04 00:42:25.505 CEST [63836:12] DETAIL:  Failed process was running: CREATE INDEX text_idx on test__int using
gist( a gist__int_ops );
 
>
>What I think is happening is that c477f3e449 allowed this bit in
>AllocSetRealloc:
>
>    context->mem_allocated += blksize - oldblksize;
>
>to be executed in situations where blksize < oldblksize, where before
>that was not possible.  Of course blksize and oldblksize are of type
>Size, hence unsigned, so the subtraction result underflows in this
>case.  If mem_allocated is of the same width as Size then this does
>not matter because the final result wraps around to the proper value,
>but if we're going to allow mem_allocated to be wider than Size then
>we will need more logic here to add or subtract as appropriate.
>
>(I'm not quite sure why we're not seeing this failure on *all* the
>32-bit machines; maybe there's some other factor involved?)
>

Interesting failure mode (especially that it does *not* fail on some
32-bit machines).

>I see no value in defining mem_allocated to be wider than Size.
>Yes, the C standard contemplates the possibility that the total
>available address space is larger than the largest chunk you can
>ever ask malloc for; but nobody has built a platform like that in
>this century, and they sucked to program on back in the dark ages
>when they did exist.  (I speak from experience.)  I do not think
>we need to design Postgres to allow for that.
>
>Likewise, there's no evident value in allowing mem_allocated
>to be negative.
>
>I haven't chased down exactly what else would need to change.
>It might be that s/int64/Size/g throughout the patch is
>sufficient, but I haven't analyzed it.
>

I think so too, but I'll take a closer look in the afternoon, unless you
beat me to it.


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Memory Accounting

От

Tomas Vondra

Дата:

04 октября 2019 г., 14:29:03

On Fri, Oct 04, 2019 at 10:26:44AM +0200, Tomas Vondra wrote:
>On Fri, Oct 04, 2019 at 12:36:01AM -0400, Tom Lane wrote:
>>I haven't chased down exactly what else would need to change.
>>It might be that s/int64/Size/g throughout the patch is
>>sufficient, but I haven't analyzed it.
>>
>
>I think so too, but I'll take a closer look in the afternoon, unless you
>beat me to it.
>

I've pushed a fix changing the type to Size, splitting the mem_allocated
to two separate updates (to prevent any underflows in the subtraction).
Hopefully this fixes the 32-bit machines ...

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Memory Accounting

От

Jeff Davis

Дата:

04 октября 2019 г., 14:32:21

On Fri, 2019-10-04 at 10:26 +0200, Tomas Vondra wrote:
> On Fri, Oct 04, 2019 at 12:36:01AM -0400, Tom Lane wrote:
> > So ... why exactly did this patch define
> > MemoryContextData.mem_allocated
> > as int64?  That seems to me to be doubly wrong: it is not the right
> > width
> > on 32-bit machines, and it is not the right signedness anywhere.  I
> > think
> > that field ought to be of type Size (a/k/a size_t, but memnodes.h
> > always
> > calls it Size).
> > 
> 
> Yeah, I think that's an oversight. Maybe there's a reason why Jeff
> used
> int64, but I can't think of any.

I had chosen a 64-bit value to account for the situation Tom mentioned:
that, in theory, Size might not be large enough to represent all
allocations in a memory context. Apparently, that theoretical situation
is not worth being concerned about.

The patch has been floating around for a very long time, so I don't
remember exactly why I chose a signed value. Sorry.

Regards,
    Jeff Davis

Re: Memory Accounting

От

Tom Lane

Дата:

04 октября 2019 г., 14:43:18

Jeff Davis <pgsql@j-davis.com> writes:
> On Fri, 2019-10-04 at 10:26 +0200, Tomas Vondra wrote:
>> Yeah, I think that's an oversight. Maybe there's a reason why Jeff
>> used int64, but I can't think of any.

> I had chosen a 64-bit value to account for the situation Tom mentioned:
> that, in theory, Size might not be large enough to represent all
> allocations in a memory context. Apparently, that theoretical situation
> is not worth being concerned about.

Well, you could also argue it the other way: maybe in our children's
time, int64 won't be as wide as Size.  (Yeah, I know that sounds
ridiculous, but needing pointers wider than 32 bits was a ridiculous
idea too when I started in this business.)

The committed fix seems OK to me except that I think you should've
also changed MemoryContextMemAllocated() to return Size.

            regards, tom lane

Re: Memory Accounting

От

Tom Lane

Дата:

04 октября 2019 г., 15:41:36

Tomas Vondra <tomas.vondra@2ndquadrant.com> writes:
> On Fri, Oct 04, 2019 at 12:36:01AM -0400, Tom Lane wrote:
>> What I think is happening is that c477f3e449 allowed this bit in
>> AllocSetRealloc:
>> context->mem_allocated += blksize - oldblksize;
>> to be executed in situations where blksize < oldblksize, where before
>> that was not possible.
>> ...
>> (I'm not quite sure why we're not seeing this failure on *all* the
>> 32-bit machines; maybe there's some other factor involved?)

> Interesting failure mode (especially that it does *not* fail on some
> 32-bit machines).

Just to make things even more mysterious, prairiedog finally showed
the Assert failure on its fourth run with c477f3e449 included:

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=prairiedog&dt=2019-10-04%2012%3A35%3A41

It's also now apparent that lapwing and locust were failing only
sometimes, as well.  I totally don't understand why that failure
would've been only partially reproducible.  Maybe we should dig
a bit harder, rather than just deciding that we fixed it.

            regards, tom lane

Re: Memory Accounting

От

Tom Lane

Дата:

04 октября 2019 г., 18:37:42

I wrote:
> Just to make things even more mysterious, prairiedog finally showed
> the Assert failure on its fourth run with c477f3e449 included:
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=prairiedog&dt=2019-10-04%2012%3A35%3A41
> It's also now apparent that lapwing and locust were failing only
> sometimes, as well.  I totally don't understand why that failure
> would've been only partially reproducible.  Maybe we should dig
> a bit harder, rather than just deciding that we fixed it.

I did that digging, reproducing the problem on florican's host
(again intermittently).  Here's a stack trace from the spot where
we sometimes downsize a large chunk:

#5  0x0851c70a in AllocSetRealloc (context=0x31b35000, pointer=0x319e5020,
    size=1224) at aset.c:1158
#6  0x085232eb in repalloc (pointer=0x319e5020, size=1224) at mcxt.c:1082
#7  0x31b69591 in resize_intArrayType (num=300, a=0x319e5020)
    at _int_tool.c:268
#8  resize_intArrayType (a=0x319e5020, num=300) at _int_tool.c:250
#9  0x31b6995d in _int_unique (r=0x319e5020) at _int_tool.c:329
#10 0x31b66a00 in g_int_union (fcinfo=0xffbfcc5c) at _int_gist.c:146
#11 0x084ff9c4 in FunctionCall2Coll (flinfo=0x319e2bb4, collation=100,
    arg1=834250780, arg2=4290759884) at fmgr.c:1162
#12 0x080db3a3 in gistMakeUnionItVec (giststate=0x319e2820, itvec=0x31bae4a4,
    len=15, attr=0xffbfce5c, isnull=0xffbfcedc) at gistutil.c:204
#13 0x080e410d in gistunionsubkeyvec (giststate=giststate@entry=0x319e2820,
    itvec=itvec@entry=0x31bb5ed4, gsvp=gsvp@entry=0xffbfcd4c) at gistsplit.c:64
#14 0x080e416f in gistunionsubkey (giststate=giststate@entry=0x319e2820,
    itvec=itvec@entry=0x31bb5ed4, spl=spl@entry=0xffbfce3c) at gistsplit.c:91
#15 0x080e4456 in gistSplitByKey (r=<optimized out>, page=<optimized out>,
    itup=<optimized out>, len=<optimized out>, giststate=<optimized out>,
    v=<optimized out>, attno=<optimized out>) at gistsplit.c:689
#16 0x080d8797 in gistSplit (r=0x31bbb424, page=0x297e0b80 "",
    itup=0x31bb5ed4, len=16, giststate=0x319e2820) at gist.c:1432
#17 0x080d8d9c in gistplacetopage (rel=<optimized out>,
    freespace=<optimized out>, giststate=<optimized out>,
    buffer=<optimized out>, itup=<optimized out>, ntup=<optimized out>,
    oldoffnum=<optimized out>, newblkno=<optimized out>,
    leftchildbuf=<optimized out>, splitinfo=<optimized out>,
    markfollowright=<optimized out>, heapRel=<optimized out>,
    is_build=<optimized out>) at gist.c:299

So the potential downsize is expected, triggered by _int_unique
being able to remove some duplicate entries from a GIST union key.
AFAICT the fact that it happens only intermittently must boil down
to the randomized insertion choices that gistchoose() sometimes makes.

In short: nothing to see here, move along.

            regards, tom lane

Re: Memory Accounting

От

Peter Geoghegan

Дата:

04 октября 2019 г., 18:43:06

On Fri, Oct 4, 2019 at 7:32 AM Jeff Davis <pgsql@j-davis.com> wrote:
> The patch has been floating around for a very long time, so I don't
> remember exactly why I chose a signed value. Sorry.

I am reminded of the fact that int64 is used to size buffers within
tuplesort.c, because it needs to support negative availMem sizes --
when huge allocations were first supported, tuplesort.c briefly used
"Size", which didn't work. Perhaps it had something to do with that.

--
Peter Geoghegan

Re: Memory Accounting

От

Tom Lane

Дата:

04 октября 2019 г., 18:58:49

Peter Geoghegan <pg@bowt.ie> writes:
> On Fri, Oct 4, 2019 at 7:32 AM Jeff Davis <pgsql@j-davis.com> wrote:
>> The patch has been floating around for a very long time, so I don't
>> remember exactly why I chose a signed value. Sorry.

> I am reminded of the fact that int64 is used to size buffers within
> tuplesort.c, because it needs to support negative availMem sizes --
> when huge allocations were first supported, tuplesort.c briefly used
> "Size", which didn't work. Perhaps it had something to do with that.

I wonder if we should make that use ssize_t instead.  Probably
not worth the trouble.

            regards, tom lane

Re: Memory Accounting

От

Robert Haas

Дата:

04 октября 2019 г., 20:11:35

On Tue, Sep 24, 2019 at 2:47 PM Melanie Plageman
<melanieplageman@gmail.com> wrote:
> I think it would be helpful if we could repeat the performance tests
> Robert did on that machine with the current patch (unless this version
> of the patch is exactly the same as the ones he tested previously).

I still have access to a POWER machine, but it's currently being used
by somebody else for a benchmarking project, so I can't test this
immediately.

It's probably worth noting that, in addition to whatever's changed in
this patch, tuplesort.c's memory management has been altered
significantly since 2015 - see
0011c0091e886b874e485a46ff2c94222ffbf550 and
e94568ecc10f2638e542ae34f2990b821bbf90ac. I'm not immediately sure how
that would affect the REINDEX case that I tested back then, but it
seems at least possible that they would have the effect of making
palloc overhead less significant. More generally, so much of the
sorting machinery has been overhauled by Peter Geoghegan since then
that what happens now may just not be very comparable to what happened
back then.

I do agree that this approach looks pretty light weight. Tomas's point
about the difference between updating only the current context and
updating all parent contexts seems right on target to me.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Обсуждение: Memory Accounting

Вложения

Вложения

Вложения

Вложения