Обсуждение: scalability bottlenecks with (many) partitions (and more)

Поиск
Список
Период
Сортировка

scalability bottlenecks with (many) partitions (and more)

От
Tomas Vondra
Дата:
Hi,

I happened to investigate a query involving a partitioned table, which
led me to a couple of bottlenecks severely affecting queries dealing
with multiple partitions (or relations in general). After a while I came
up with three WIP patches that improve the behavior by an order of
magnitude, and not just in some extreme cases.


Consider a partitioned pgbench with 20 partitions, say:

   pgbench -i -s 100 --partitions 100 testdb

but let's modify the pgbench_accounts a little bit:

   ALTER TABLE pgbench_accounts ADD COLUMN aid_parent INT;
   UPDATE pgbench_accounts SET aid_parent = aid;
   CREATE INDEX ON pgbench_accounts(aid_parent);
   VACUUM FULL pgbench_accounts;

which simply adds "aid_parent" column which is not a partition key. And
now let's do a query

    SELECT * FROM pgbench_accounts pa JOIN pgbench_branches pb
        ON (pa.bid = pb.bid) WHERE pa.aid_parent = :aid

so pretty much the regular "pgbench -S" except that on the column that
does not allow partition elimination. Now, the plan looks like this:

                                QUERY PLAN
----------------------------------------------------------------------
 Hash Join  (cost=1.52..34.41 rows=10 width=465)
   Hash Cond: (pa.bid = pb.bid)
   ->  Append  (cost=0.29..33.15 rows=10 width=101)
         ->  Index Scan using pgbench_accounts_1_aid_parent_idx on
pgbench_accounts_1 pa_1  (cost=0.29..3.31 rows=1 width=101)
               Index Cond: (aid_parent = 3489734)
         ->  Index Scan using pgbench_accounts_2_aid_parent_idx on
pgbench_accounts_2 pa_2  (cost=0.29..3.31 rows=1 width=101)
               Index Cond: (aid_parent = 3489734)
         ->  Index Scan using pgbench_accounts_3_aid_parent_idx on
pgbench_accounts_3 pa_3  (cost=0.29..3.31 rows=1 width=101)
               Index Cond: (aid_parent = 3489734)
         ->  Index Scan using pgbench_accounts_4_aid_parent_idx on
pgbench_accounts_4 pa_4  (cost=0.29..3.31 rows=1 width=101)
               Index Cond: (aid_parent = 3489734)
         -> ...
   ->  Hash  (cost=1.10..1.10 rows=10 width=364)
         ->  Seq Scan on pgbench_branches pb  (cost=0.00..1.10 rows=10
width=364)


So yeah, scanning all 100 partitions. Not great, but no partitioning
scheme is perfect for all queries. Anyway, let's see how this works on a
big AMD EPYC machine with 96/192 cores - with "-M simple" we get:

parts      1       8      16      32     64       96     160      224
-----------------------------------------------------------------------
0      13877  105732  210890  410452  709509  844683  1050658  1163026
100      653    3957    7120   12022   12707   11813    10349     9633
1000      20     142     270     474     757     808      567      427

These are transactions per second, for different number of clients
(numbers in the header). With -M prepared the story doesn't change - the
numbers are higher, but the overall behavior is pretty much the same.

Firstly, with no partitions (first row), the throughput by ~13k/client
initially, then it gradually levels off. But it grows all the time.

But with 100 or 1000 partitions, it peaks and then starts dropping
again. And moreover, the throughput with 100 or 1000 partitions is just
a tiny fraction of the non-partitioned value. The difference is roughly
equal to the number of partitions - for example with 96 clients, the
difference between 0 and 1000 partitions is 844683/808 = 1045.

I could demonstrate the same behavior with fewer partitions - e.g. with
10 partitions you get ~10x difference, and so on.

Another thing I'd mention is that this is not just about partitioning.
Imagine a star schema with a fact table and dimensions - you'll get the
same behavior depending on the number of dimensions you need to join
with. With "-M simple" you may get this, for example:

dims        1      8      16      32      64      96     160      224
----------------------------------------------------------------------
1       11737  92925  183678  361497  636598  768956  958679  1042799
10        462   3558    7086   13889   25367   29503   25353    24030
100         4     31      61     122     231     292     292      288

So, similar story - significant slowdown as we're adding dimensions.


Now, what could be causing this? Clearly, there's a bottleneck of some
kind, and we're hitting it. Some of this may be simply due to execution
doing more stuff (more index scans, more initialization, ...) but maybe
not - one of the reasons why I started looking into this was not using
all the CPU even for small scales - the CPU was maybe 60% utilized.

So I started poking at things. The first thing that I thought about was
locking, obviously. That's consistent with the limited CPU utilization
(waiting on a lock = not running), and it's somewhat expected when using
many partitions - we need to lock all of them, and if we have 100 or
1000 of them, that's potentially lot of locks.

From past experiments I've known about two places where such bottleneck
could be - NUM_LOCK_PARTITIONS and fast-path locking. So I decided to
give it a try, increase these values and see what happens.

For NUM_LOCK_PARTITIONS this is pretty simple (see 0001 patch). The
LWLock table has 16 partitions by default - it's quite possible that on
machine with many cores and/or many partitions, we can easily hit this.
So I bumped this 4x to 64 partitions.

For fast-path locking the changes are more complicated (see 0002). We
allow keeping 16 relation locks right in PGPROC, and only when this gets
full we promote them to the actual lock table. But with enough
partitions we're guaranteed to fill these 16 slots, of course. But
increasing the number of slots is not simple - firstly, the information
is split between an array of 16 OIDs and UINT64 serving as a bitmap.
Increasing the size of the OID array is simple, but it's harder for the
auxiliary bitmap. But there's more problems - with more OIDs a simple
linear search won't do. But a simple hash table is not a good idea too,
because of poor locality and the need to delete stuff ...

What I ended up doing is having a hash table of 16-element arrays. There
are 64 "pieces", each essentially the (16 x OID + UINT64 bitmap) that we
have now. Each OID is mapped to exactly one of these parts as if in a
hash table, and in each of those 16-element parts we do exactly the same
thing we do now (linear search, removal, etc.). This works great, the
locality is great, etc. The one disadvantage is this makes PGPROC
larger, but I did a lot of benchmarks and I haven't seen any regression
that I could attribute to this. (More about this later.)

Unfortunately, for the pgbench join this does not make much difference.
But for the "star join" (with -M prepared) it does this:

             1      8     16    32       64       96      160       224
------------------------------------------------------------------------
master   21610 137450 247541 300902   270932   229692   191454   189233
patched  21664 151695 301451 594615  1036424  1211716  1480953  1656203
speedup    1.0    1.1    1.2    2.0      3.8      5.3      7.7      8.8

That's a pretty nice speedup, I think.

However, why doesn't the partitioned join improve (at not very much)?
Well, perf profile says stuff like this:


9.16%    0.77%  postgres      [kernel.kallsyms]      [k] asm_exc_page_fault
     |      
      --8.39%--asm_exc_page_fault
            |      
             --7.52%--exc_page_fault
                    |      
                    --7.13%--do_user_addr_fault
                           |      
                            --6.64%--handle_mm_fault
                                  |      
                                   --6.29%--__handle_mm_fault
                                          |      
                                          |--2.17%--__mem_cgroup_charge
                                          |       |      
                                          |       |--1.25%--charge_memcg
                                          |       |       |      
                                          |       |        --0.57%-- ...
                                          |       |      
                                          |        --0.67%-- ...
                                          |      
                                          |--2.04%--vma_alloc_folio

After investigating this for a bit, I came to the conclusion this may be
some sort of a scalability problem in glibc/malloc. I decided to try if
the "memory pool" patch (which I've mentioned in the memory limit thread
as an alternative way to introduce backend-level accounting/limit) could
serve as a backend-level malloc cache, and how would that work. So I
cleaned up the PoC patch I already had (see 0003), and gave it a try.

And with both patches applied, the results for the partitioned join with
100 partitions look like this:

-M simple

                1      8      16      32      64      96     160    224
------------------------------------------------------------------------
master        653   3957    7120   12022   12707   11813   10349   9633
both patches  954   7356   14580   28259   51552   65278   70607  69598
speedup       1.5    1.9     2.0     2.4     4.1     5.5     6.8    7.2


-M prepared

                1      8      16      32      64      96     160    224
------------------------------------------------------------------------
master       1639   8273   14138   14746   13446   14001   11129  10136
both patches 4792  30102   62208  122157  220984  267763  315632 323567
speedup       2.9    3.6     4.4     8.3    16.4    19.1    28.4   31.9


That's pretty nice, I think. And I've seen many such improvements, it's
not a cherry-picked example. For the star join, the improvements are
very similar.

I'm attaching PDF files with a table visualizing results for these two
benchmarks - there's results for different number of partitions/scales,
and different builds (master, one or both of the patches). There's also
a comparison to master, with color scale "red = slower, green = faster"
(but there's no red anywhere, not even for low client counts).

It's also interesting that with just the 0003 patch applied, the change
is much smaller. It's as if the two bottlenecks (locking and malloc) are
in balance - if you only address one one, you don't get much. But if you
address both, it flies.

FWIW where does the malloc overhead come from? For one, while we do have
some caching of malloc-ed memory in memory contexts, that doesn't quite
work cross-query, because we destroy the contexts at the end of the
query. We attempt to cache the memory contexts too, but in this case
that can't help because the allocations come from btbeginscan() where we
do this:

    so = (BTScanOpaque) palloc(sizeof(BTScanOpaqueData));

and BTScanOpaqueData is ~27kB, which means it's an oversized chunk and
thus always allocated using a separate malloc() call. Maybe we could
break it into smaller/cacheable parts, but I haven't tried, and I doubt
it's the only such allocation.

I don't want to get into too much detail about the memory pool, but I
think it's something we should consider doing - I'm sure there's stuff
to improve, but caching the malloc may clearly be very beneficial. The
basic idea is to have a cache that is "adaptive" (i.e. adjusts to
caching blocks of sizes needed by the workload) but also cheap. The
patch is PoC/WIP and needs more work, but I think it works quite well.
If anyone wants to take a look or have a chat at FOSDEM, for example,
I'm available.

FWIW I was wondering if this is a glibc-specific malloc bottleneck, so I
tried running the benchmarks with LD_PRELOAD=jemalloc, and that improves
the behavior a lot - it gets us maybe ~80% of the mempool benefits.
Which is nice, it confirms it's glibc-specific (I wonder if there's a
way to tweak glibc to address this), and it also means systems using
jemalloc (e.g. FreeBSD, right?) don't have this problem. But it also
says the mempool has ~20% benefit on top of jemalloc.

FWIW there's another bottleneck people may not realize, and that's the
number of file descriptors. Once you get to >1000 relations, you can
easily get into situation like this:


54.18%    0.48%  postgres       [kernel.kallsyms]       [k]
entry_SYSCALL_64_after_hwframe
      |       
       --53.70%--entry_SYSCALL_64_after_hwframe
               |       
               --53.03%--do_syscall_64
                       |       
                       |--28.29%--__x64_sys_openat
                       |        |       
                       |         --28.14%--do_sys_openat2
                       |                |       
                       |                |--23.14%--do_filp_open
                       |                |        |       
                       |                |         --22.72%--path_openat


That's pretty bad, it means we're closing/opening file descriptors like
crazy, because every query needs the files. If I increase the number of
file descriptors (both in ulimit and max_files_per_process) to prevent
this trashing, I can increase the throughput ~5x. Of course, this is not
a bottleneck that we can "fix" in code, it's simply a consequence of not
having enough file descriptors etc. But I wonder if we might make it
easier to monitor this, e.g. by tracking the fd cache hit ratio, or
something like that ...


There's a more complete set of benchmarking scripts and results for
these and other tests, in various formats (PDF, ODS, ...) at

    https://github.com/tvondra/scalability-patches

There's results from multiple machines - not just the big epyc machine,
but also smaller intel machines (4C and 16C), and even two rpi5 (yes, it
helps even on rpi5, quite a bit).


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Вложения

Re: scalability bottlenecks with (many) partitions (and more)

От
Ronan Dunklau
Дата:
Le dimanche 28 janvier 2024, 22:57:02 CET Tomas Vondra a écrit :

Hi Tomas !

I'll comment on glibc-malloc part as I studied that part last year, and
proposed some things here: https://www.postgresql.org/message-id/
3424675.QJadu78ljV%40aivenlaptop


> FWIW where does the malloc overhead come from? For one, while we do have
> some caching of malloc-ed memory in memory contexts, that doesn't quite
> work cross-query, because we destroy the contexts at the end of the
> query. We attempt to cache the memory contexts too, but in this case
> that can't help because the allocations come from btbeginscan() where we
> do this:
>
>     so = (BTScanOpaque) palloc(sizeof(BTScanOpaqueData));
>
> and BTScanOpaqueData is ~27kB, which means it's an oversized chunk and
> thus always allocated using a separate malloc() call. Maybe we could
> break it into smaller/cacheable parts, but I haven't tried, and I doubt
> > > > it's the only such allocation.

Did you try running an strace on the process ? That may give you some
hindsights into what malloc is doing. A more sophisticated approach would be
using stap and plugging it into the malloc probes, for example
memory_sbrk_more and memory_sbrk_less.

An important part of glibc's malloc behaviour in that regard comes from the
adjustment of the mmap and free threshold. By default, mmap adjusts them
dynamically and you can poke into that using the
memory_mallopt_free_dyn_thresholds probe.

>
> FWIW I was wondering if this is a glibc-specific malloc bottleneck, so I
> tried running the benchmarks with LD_PRELOAD=jemalloc, and that improves
> the behavior a lot - it gets us maybe ~80% of the mempool benefits.
> Which is nice, it confirms it's glibc-specific (I wonder if there's a
> way to tweak glibc to address this), and it also means systems using
> jemalloc (e.g. FreeBSD, right?) don't have this problem. But it also
> says the mempool has ~20% benefit on top of jemalloc.

GLIBC's malloc offers some tuning for this. In particular, setting either
M_MMAP_THRESHOLD or M_TRIM_THRESHOLD will disable the unpredictable "auto
adjustment" beheviour and allow you to control what it's doing.

By setting a bigger M_TRIM_THRESHOLD, one can make sure memory allocated using
sbrk isn't freed as easily, and you don't run into a pattern of moving the
sbrk pointer up and down repeatedly. The automatic trade off between the mmap
and trim thresholds is supposed to prevent that, but the way it is incremented
means you can end in a bad place depending on your particular allocation
patttern.

Best regards,

--
Ronan Dunklau






Re: scalability bottlenecks with (many) partitions (and more)

От
Tomas Vondra
Дата:
On 1/29/24 09:53, Ronan Dunklau wrote:
> Le dimanche 28 janvier 2024, 22:57:02 CET Tomas Vondra a écrit :
> 
> Hi Tomas !
> 
> I'll comment on glibc-malloc part as I studied that part last year, and 
> proposed some things here: https://www.postgresql.org/message-id/
> 3424675.QJadu78ljV%40aivenlaptop
> 

Thanks for reminding me. I'll re-read that thread.

> 
>> FWIW where does the malloc overhead come from? For one, while we do have
>> some caching of malloc-ed memory in memory contexts, that doesn't quite
>> work cross-query, because we destroy the contexts at the end of the
>> query. We attempt to cache the memory contexts too, but in this case
>> that can't help because the allocations come from btbeginscan() where we
>> do this:
>>
>>     so = (BTScanOpaque) palloc(sizeof(BTScanOpaqueData));
>>
>> and BTScanOpaqueData is ~27kB, which means it's an oversized chunk and
>> thus always allocated using a separate malloc() call. Maybe we could
>> break it into smaller/cacheable parts, but I haven't tried, and I doubt
>>>>> it's the only such allocation.
> 
> Did you try running an strace on the process ? That may give you some 
> hindsights into what malloc is doing. A more sophisticated approach would be 
> using stap and plugging it into the malloc probes, for example 
> memory_sbrk_more and memory_sbrk_less. 
> 

No, I haven't tried that. In my experience strace is pretty expensive,
and if the issue is in glibc itself (before it does the syscalls),
strace won't really tell us much. Not sure, ofc.

> An important part of glibc's malloc behaviour in that regard comes from the 
> adjustment of the mmap and free threshold. By default, mmap adjusts them 
> dynamically and you can poke into that using the 
> memory_mallopt_free_dyn_thresholds probe.
> 

Thanks, I'll take a look at that.

>>
>> FWIW I was wondering if this is a glibc-specific malloc bottleneck, so I
>> tried running the benchmarks with LD_PRELOAD=jemalloc, and that improves
>> the behavior a lot - it gets us maybe ~80% of the mempool benefits.
>> Which is nice, it confirms it's glibc-specific (I wonder if there's a
>> way to tweak glibc to address this), and it also means systems using
>> jemalloc (e.g. FreeBSD, right?) don't have this problem. But it also
>> says the mempool has ~20% benefit on top of jemalloc.
> 
> GLIBC's malloc offers some tuning for this. In particular, setting either 
> M_MMAP_THRESHOLD or M_TRIM_THRESHOLD will disable the unpredictable "auto 
> adjustment" beheviour and allow you to control what it's doing. 
> 
> By setting a bigger M_TRIM_THRESHOLD, one can make sure memory allocated using 
> sbrk isn't freed as easily, and you don't run into a pattern of moving the 
> sbrk pointer up and down repeatedly. The automatic trade off between the mmap 
> and trim thresholds is supposed to prevent that, but the way it is incremented 
> means you can end in a bad place depending on your particular allocation 
> patttern.
> 

So, what values would you recommend for these parameters?

My concern is increasing those value would lead to (much) higher memory
usage, with little control over it. With the mempool we keep more
blocks, ofc, but we have control over freeing the memory.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: scalability bottlenecks with (many) partitions (and more)

От
Ronan Dunklau
Дата:
Le lundi 29 janvier 2024, 13:17:07 CET Tomas Vondra a écrit :
> > Did you try running an strace on the process ? That may give you some
> > hindsights into what malloc is doing. A more sophisticated approach would
> > be using stap and plugging it into the malloc probes, for example
> > memory_sbrk_more and memory_sbrk_less.
>
> No, I haven't tried that. In my experience strace is pretty expensive,
> and if the issue is in glibc itself (before it does the syscalls),
> strace won't really tell us much. Not sure, ofc.

It would tell you how malloc actually performs your allocations, and how often
they end up translated into syscalls. The main issue with glibc would be that
it releases the memory too agressively to the OS, IMO.

>
> > An important part of glibc's malloc behaviour in that regard comes from
> > the
> > adjustment of the mmap and free threshold. By default, mmap adjusts them
> > dynamically and you can poke into that using the
> > memory_mallopt_free_dyn_thresholds probe.
>
> Thanks, I'll take a look at that.
>
> >> FWIW I was wondering if this is a glibc-specific malloc bottleneck, so I
> >> tried running the benchmarks with LD_PRELOAD=jemalloc, and that improves
> >> the behavior a lot - it gets us maybe ~80% of the mempool benefits.
> >> Which is nice, it confirms it's glibc-specific (I wonder if there's a
> >> way to tweak glibc to address this), and it also means systems using
> >> jemalloc (e.g. FreeBSD, right?) don't have this problem. But it also
> >> says the mempool has ~20% benefit on top of jemalloc.
> >
> > GLIBC's malloc offers some tuning for this. In particular, setting either
> > M_MMAP_THRESHOLD or M_TRIM_THRESHOLD will disable the unpredictable "auto
> > adjustment" beheviour and allow you to control what it's doing.
> >
> > By setting a bigger M_TRIM_THRESHOLD, one can make sure memory allocated
> > using sbrk isn't freed as easily, and you don't run into a pattern of
> > moving the sbrk pointer up and down repeatedly. The automatic trade off
> > between the mmap and trim thresholds is supposed to prevent that, but the
> > way it is incremented means you can end in a bad place depending on your
> > particular allocation patttern.
>
> So, what values would you recommend for these parameters?
>
> My concern is increasing those value would lead to (much) higher memory
> usage, with little control over it. With the mempool we keep more
> blocks, ofc, but we have control over freeing the memory.

Right now depending on your workload (especially if you use connection
pooling) you can end up with something like 32 or 64MB of dynamically adjusted
trim-threshold which will never be released back.

The first heurstic I had in mind was to set it to work_mem, up to a
"reasonable" limit I guess. One can argue that it is expected for a backend to
use work_mem frequently, and as such it shouldn't be released back. By setting
work_mem to a lower value, we could ask glibc at the same time to trim the
excess kept memory. That could be useful when a long-lived connection is
pooled, and sees a spike in memory usage only once. Currently that could well
end up with 32MB "wasted" permanently but tuning it ourselves could allow us
to releaase it back.

Since it was last year I worked on this, I'm a bit fuzzy on the details but I
hope this helps.






Re: scalability bottlenecks with (many) partitions (and more)

От
Tomas Vondra
Дата:

On 1/29/24 15:15, Ronan Dunklau wrote:
> Le lundi 29 janvier 2024, 13:17:07 CET Tomas Vondra a écrit :
>>> Did you try running an strace on the process ? That may give you some
>>> hindsights into what malloc is doing. A more sophisticated approach would
>>> be using stap and plugging it into the malloc probes, for example
>>> memory_sbrk_more and memory_sbrk_less.
>>
>> No, I haven't tried that. In my experience strace is pretty expensive,
>> and if the issue is in glibc itself (before it does the syscalls),
>> strace won't really tell us much. Not sure, ofc.
> 
> It would tell you how malloc actually performs your allocations, and how often 
> they end up translated into syscalls. The main issue with glibc would be that 
> it releases the memory too agressively to the OS, IMO.
> 
>>
>>> An important part of glibc's malloc behaviour in that regard comes from
>>> the
>>> adjustment of the mmap and free threshold. By default, mmap adjusts them
>>> dynamically and you can poke into that using the
>>> memory_mallopt_free_dyn_thresholds probe.
>>
>> Thanks, I'll take a look at that.
>>
>>>> FWIW I was wondering if this is a glibc-specific malloc bottleneck, so I
>>>> tried running the benchmarks with LD_PRELOAD=jemalloc, and that improves
>>>> the behavior a lot - it gets us maybe ~80% of the mempool benefits.
>>>> Which is nice, it confirms it's glibc-specific (I wonder if there's a
>>>> way to tweak glibc to address this), and it also means systems using
>>>> jemalloc (e.g. FreeBSD, right?) don't have this problem. But it also
>>>> says the mempool has ~20% benefit on top of jemalloc.
>>>
>>> GLIBC's malloc offers some tuning for this. In particular, setting either
>>> M_MMAP_THRESHOLD or M_TRIM_THRESHOLD will disable the unpredictable "auto
>>> adjustment" beheviour and allow you to control what it's doing.
>>>
>>> By setting a bigger M_TRIM_THRESHOLD, one can make sure memory allocated
>>> using sbrk isn't freed as easily, and you don't run into a pattern of
>>> moving the sbrk pointer up and down repeatedly. The automatic trade off
>>> between the mmap and trim thresholds is supposed to prevent that, but the
>>> way it is incremented means you can end in a bad place depending on your
>>> particular allocation patttern.
>>
>> So, what values would you recommend for these parameters?
>>
>> My concern is increasing those value would lead to (much) higher memory
>> usage, with little control over it. With the mempool we keep more
>> blocks, ofc, but we have control over freeing the memory.
> 
> Right now depending on your workload (especially if you use connection 
> pooling) you can end up with something like 32 or 64MB of dynamically adjusted 
> trim-threshold which will never be released back. 
> 

OK, so let's say I expect each backend to use ~90MB of memory (allocated
at once through memory contexts). How would you set the two limits? By
default it's set to 128kB, which means blocks larger than 128kB are
mmap-ed and released immediately.

But there's very few such allocations - a vast majority of blocks in the
benchmark workloads is <= 8kB or ~27kB (those from btbeginscan).

So I'm thinking about leaving M_TRIM_THRESHOLD as is, but increasing the
M_TRIM_THRESHOLD value to a couple MBs. But I doubt that'll really help,
because what I expect to happen is we execute a query and it allocates
all memory up to a high watermark of ~90MB. And then the query
completes, and we release almost all of it. And even with trim threshold
set to e.g. 8MB we'll free almost all of it, no?

What we want to do is say - hey, we needed 90MB, and now we need 8MB. We
could free 82MB, but maybe let's wait a bit and see if we need that
memory again. And that's pretty much what the mempool does, but I don't
see how to do that using the mmap options.

> The first heurstic I had in mind was to set it to work_mem, up to a 
> "reasonable" limit I guess. One can argue that it is expected for a backend to 
> use work_mem frequently, and as such it shouldn't be released back. By setting 
> work_mem to a lower value, we could ask glibc at the same time to trim the 
> excess kept memory. That could be useful when a long-lived connection is 
> pooled, and sees a spike in memory usage only once. Currently that could well 
> end up with 32MB "wasted" permanently but tuning it ourselves could allow us 
> to releaase it back. 
> 

I'm not sure work_mem is a good parameter to drive this. It doesn't say
how much memory we expect the backend to use - it's a per-operation
limit, so it doesn't work particularly well with partitioning (e.g. with
100 partitions, we may get 100 nodes, which is completely unrelated to
what work_mem says). A backend running the join query with 1000
partitions uses ~90MB (judging by data reported by the mempool), even
with work_mem=4MB. So setting the trim limit to 4MB is pretty useless.

The mempool could tell us how much memory we need (but we could track
this in some other way too, probably). And we could even adjust the mmap
parameters regularly, based on current workload.

But there's then there's the problem that the mmap parameters don't tell
us how much memory to keep, but how large chunks to release.

Let's say we want to keep the 90MB (to allocate the memory once and then
reuse it). How would you do that? We could set MMAP_TRIM_TRESHOLD 100MB,
but then it takes just a little bit of extra memory to release all the
memory, or something.

> Since it was last year I worked on this, I'm a bit fuzzy on the details but I 
> hope this helps.
> 

Thanks for the feedback / insights!


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: scalability bottlenecks with (many) partitions (and more)

От
Ronan Dunklau
Дата:
Le lundi 29 janvier 2024, 15:59:04 CET Tomas Vondra a écrit :
> I'm not sure work_mem is a good parameter to drive this. It doesn't say
> how much memory we expect the backend to use - it's a per-operation
> limit, so it doesn't work particularly well with partitioning (e.g. with
> 100 partitions, we may get 100 nodes, which is completely unrelated to
> what work_mem says). A backend running the join query with 1000
> partitions uses ~90MB (judging by data reported by the mempool), even
> with work_mem=4MB. So setting the trim limit to 4MB is pretty useless.

I understand your point,  I was basing my previous observations on what a
backend typically does during the execution.

>
> The mempool could tell us how much memory we need (but we could track
> this in some other way too, probably). And we could even adjust the mmap
> parameters regularly, based on current workload.
>
> But there's then there's the problem that the mmap parameters don't tell
> If we > > us how much memory to keep, but how large chunks to release.
>
> Let's say we want to keep the 90MB (to allocate the memory once and then
> reuse it). How would you do that? We could set MMAP_TRIM_TRESHOLD 100MB,
> but then it takes just a little bit of extra memory to release all the
> memory, or something.

For doing this you can set M_TOP_PAD using glibc malloc. Which makes sure a
certain amount of memory is always kept.

But the way the dynamic adjustment works makes it sort-of work like this.
MMAP_THRESHOLD and TRIM_THRESHOLD start with low values, meaning we don't
expect to keep much memory around.

So even "small" memory allocations will be served using mmap at first. Once
mmaped memory is released, glibc's consider it a benchmark for "normal"
allocations that can be routinely freed, and adjusts mmap_threshold to the
released mmaped region size, and trim threshold to two times that.

It means over time the two values will converge either to the max value (32MB
for MMAP_THRESHOLD, 64 for trim threshold) or to something big enough to
accomodate your released memory, since anything bigger than half trim
threshold will be allocated using mmap.

Setting any parameter disable that.

But I'm not arguing against the mempool, just chiming in with glibc's malloc
tuning possibilities :-)





Re: scalability bottlenecks with (many) partitions (and more)

От
Tomas Vondra
Дата:

On 1/29/24 16:42, Ronan Dunklau wrote:
> Le lundi 29 janvier 2024, 15:59:04 CET Tomas Vondra a écrit :
>> I'm not sure work_mem is a good parameter to drive this. It doesn't say
>> how much memory we expect the backend to use - it's a per-operation
>> limit, so it doesn't work particularly well with partitioning (e.g. with
>> 100 partitions, we may get 100 nodes, which is completely unrelated to
>> what work_mem says). A backend running the join query with 1000
>> partitions uses ~90MB (judging by data reported by the mempool), even
>> with work_mem=4MB. So setting the trim limit to 4MB is pretty useless.
> 
> I understand your point,  I was basing my previous observations on what a 
> backend typically does during the execution.
> 
>>
>> The mempool could tell us how much memory we need (but we could track
>> this in some other way too, probably). And we could even adjust the mmap
>> parameters regularly, based on current workload.
>>
>> But there's then there's the problem that the mmap parameters don't tell
>> If we > > us how much memory to keep, but how large chunks to release.
>>
>> Let's say we want to keep the 90MB (to allocate the memory once and then
>> reuse it). How would you do that? We could set MMAP_TRIM_TRESHOLD 100MB,
>> but then it takes just a little bit of extra memory to release all the
>> memory, or something.
> 
> For doing this you can set M_TOP_PAD using glibc malloc. Which makes sure a 
> certain amount of memory is always kept. 
> 
> But the way the dynamic adjustment works makes it sort-of work like this. 
> MMAP_THRESHOLD and TRIM_THRESHOLD start with low values, meaning we don't 
> expect to keep much memory around. 
> 
> So even "small" memory allocations will be served using mmap at first. Once 
> mmaped memory is released, glibc's consider it a benchmark for "normal" 
> allocations that can be routinely freed, and adjusts mmap_threshold to the 
> released mmaped region size, and trim threshold to two times that. 
> 
> It means over time the two values will converge either to the max value (32MB 
> for MMAP_THRESHOLD, 64 for trim threshold) or to something big enough to 
> accomodate your released memory, since anything bigger than half trim 
> threshold will be allocated using mmap. 
> 
> Setting any parameter disable that.
> 

Thanks. I gave this a try, and I started the tests with this setting:

export MALLOC_TOP_PAD_=$((64*1024*1024))
export MALLOC_MMAP_THRESHOLD_=$((1024*1024))
export MALLOC_TRIM_THRESHOLD_=$((1024*1024))

which I believe means that:

1) we'll keep 64MB "extra" memory on top of heap, serving as a cache for
future allocations

2) everything below 1MB (so most of the blocks we allocate for contexts)
will be allocated on heap (hence from the cache)

3) we won't trim heap unless there's at least 1MB of free contiguous
space (I wonder if this should be the same as MALLOC_TOP_PAD)

Those are mostly arbitrary values / guesses, and I don't have complete
results yet. But from the results I have it seems this has almost the
same effect as the mempool thing - see the attached PDF, with results
for the "partitioned join" benchmark.

first column - "master" (17dev) with no patches, default glibc

second column - 17dev + locking + mempool, default glibc

third column - 17dev + locking, tuned glibc

The color scale on the right is throughput comparison (third/second), as
a percentage with e.g. 90% meaning tuned glibc is 10% slower than the
mempool results. Most of the time it's slower but very close to 100%,
sometimes it's a bit faster. So overall it's roughly the same.

The color scales below the results is a comparison of each branch to the
master (without patches), showing comparison to current performance.
It's almost the same, although the tuned glibc has a couple regressions
that the mempool does not have.

> But I'm not arguing against the mempool, just chiming in with glibc's malloc 
> tuning possibilities :-)
> 

Yeah. I think the main problem with the glibc parameters is that it's
very implementation-specific and also static - the mempool is more
adaptive, I think. But it's an interesting experiment.

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Вложения