Обсуждение: scalability bottlenecks with (many) partitions (and more)
Hi, I happened to investigate a query involving a partitioned table, which led me to a couple of bottlenecks severely affecting queries dealing with multiple partitions (or relations in general). After a while I came up with three WIP patches that improve the behavior by an order of magnitude, and not just in some extreme cases. Consider a partitioned pgbench with 20 partitions, say: pgbench -i -s 100 --partitions 100 testdb but let's modify the pgbench_accounts a little bit: ALTER TABLE pgbench_accounts ADD COLUMN aid_parent INT; UPDATE pgbench_accounts SET aid_parent = aid; CREATE INDEX ON pgbench_accounts(aid_parent); VACUUM FULL pgbench_accounts; which simply adds "aid_parent" column which is not a partition key. And now let's do a query SELECT * FROM pgbench_accounts pa JOIN pgbench_branches pb ON (pa.bid = pb.bid) WHERE pa.aid_parent = :aid so pretty much the regular "pgbench -S" except that on the column that does not allow partition elimination. Now, the plan looks like this: QUERY PLAN ---------------------------------------------------------------------- Hash Join (cost=1.52..34.41 rows=10 width=465) Hash Cond: (pa.bid = pb.bid) -> Append (cost=0.29..33.15 rows=10 width=101) -> Index Scan using pgbench_accounts_1_aid_parent_idx on pgbench_accounts_1 pa_1 (cost=0.29..3.31 rows=1 width=101) Index Cond: (aid_parent = 3489734) -> Index Scan using pgbench_accounts_2_aid_parent_idx on pgbench_accounts_2 pa_2 (cost=0.29..3.31 rows=1 width=101) Index Cond: (aid_parent = 3489734) -> Index Scan using pgbench_accounts_3_aid_parent_idx on pgbench_accounts_3 pa_3 (cost=0.29..3.31 rows=1 width=101) Index Cond: (aid_parent = 3489734) -> Index Scan using pgbench_accounts_4_aid_parent_idx on pgbench_accounts_4 pa_4 (cost=0.29..3.31 rows=1 width=101) Index Cond: (aid_parent = 3489734) -> ... -> Hash (cost=1.10..1.10 rows=10 width=364) -> Seq Scan on pgbench_branches pb (cost=0.00..1.10 rows=10 width=364) So yeah, scanning all 100 partitions. Not great, but no partitioning scheme is perfect for all queries. Anyway, let's see how this works on a big AMD EPYC machine with 96/192 cores - with "-M simple" we get: parts 1 8 16 32 64 96 160 224 ----------------------------------------------------------------------- 0 13877 105732 210890 410452 709509 844683 1050658 1163026 100 653 3957 7120 12022 12707 11813 10349 9633 1000 20 142 270 474 757 808 567 427 These are transactions per second, for different number of clients (numbers in the header). With -M prepared the story doesn't change - the numbers are higher, but the overall behavior is pretty much the same. Firstly, with no partitions (first row), the throughput by ~13k/client initially, then it gradually levels off. But it grows all the time. But with 100 or 1000 partitions, it peaks and then starts dropping again. And moreover, the throughput with 100 or 1000 partitions is just a tiny fraction of the non-partitioned value. The difference is roughly equal to the number of partitions - for example with 96 clients, the difference between 0 and 1000 partitions is 844683/808 = 1045. I could demonstrate the same behavior with fewer partitions - e.g. with 10 partitions you get ~10x difference, and so on. Another thing I'd mention is that this is not just about partitioning. Imagine a star schema with a fact table and dimensions - you'll get the same behavior depending on the number of dimensions you need to join with. With "-M simple" you may get this, for example: dims 1 8 16 32 64 96 160 224 ---------------------------------------------------------------------- 1 11737 92925 183678 361497 636598 768956 958679 1042799 10 462 3558 7086 13889 25367 29503 25353 24030 100 4 31 61 122 231 292 292 288 So, similar story - significant slowdown as we're adding dimensions. Now, what could be causing this? Clearly, there's a bottleneck of some kind, and we're hitting it. Some of this may be simply due to execution doing more stuff (more index scans, more initialization, ...) but maybe not - one of the reasons why I started looking into this was not using all the CPU even for small scales - the CPU was maybe 60% utilized. So I started poking at things. The first thing that I thought about was locking, obviously. That's consistent with the limited CPU utilization (waiting on a lock = not running), and it's somewhat expected when using many partitions - we need to lock all of them, and if we have 100 or 1000 of them, that's potentially lot of locks. From past experiments I've known about two places where such bottleneck could be - NUM_LOCK_PARTITIONS and fast-path locking. So I decided to give it a try, increase these values and see what happens. For NUM_LOCK_PARTITIONS this is pretty simple (see 0001 patch). The LWLock table has 16 partitions by default - it's quite possible that on machine with many cores and/or many partitions, we can easily hit this. So I bumped this 4x to 64 partitions. For fast-path locking the changes are more complicated (see 0002). We allow keeping 16 relation locks right in PGPROC, and only when this gets full we promote them to the actual lock table. But with enough partitions we're guaranteed to fill these 16 slots, of course. But increasing the number of slots is not simple - firstly, the information is split between an array of 16 OIDs and UINT64 serving as a bitmap. Increasing the size of the OID array is simple, but it's harder for the auxiliary bitmap. But there's more problems - with more OIDs a simple linear search won't do. But a simple hash table is not a good idea too, because of poor locality and the need to delete stuff ... What I ended up doing is having a hash table of 16-element arrays. There are 64 "pieces", each essentially the (16 x OID + UINT64 bitmap) that we have now. Each OID is mapped to exactly one of these parts as if in a hash table, and in each of those 16-element parts we do exactly the same thing we do now (linear search, removal, etc.). This works great, the locality is great, etc. The one disadvantage is this makes PGPROC larger, but I did a lot of benchmarks and I haven't seen any regression that I could attribute to this. (More about this later.) Unfortunately, for the pgbench join this does not make much difference. But for the "star join" (with -M prepared) it does this: 1 8 16 32 64 96 160 224 ------------------------------------------------------------------------ master 21610 137450 247541 300902 270932 229692 191454 189233 patched 21664 151695 301451 594615 1036424 1211716 1480953 1656203 speedup 1.0 1.1 1.2 2.0 3.8 5.3 7.7 8.8 That's a pretty nice speedup, I think. However, why doesn't the partitioned join improve (at not very much)? Well, perf profile says stuff like this: 9.16% 0.77% postgres [kernel.kallsyms] [k] asm_exc_page_fault | --8.39%--asm_exc_page_fault | --7.52%--exc_page_fault | --7.13%--do_user_addr_fault | --6.64%--handle_mm_fault | --6.29%--__handle_mm_fault | |--2.17%--__mem_cgroup_charge | | | |--1.25%--charge_memcg | | | | | --0.57%-- ... | | | --0.67%-- ... | |--2.04%--vma_alloc_folio After investigating this for a bit, I came to the conclusion this may be some sort of a scalability problem in glibc/malloc. I decided to try if the "memory pool" patch (which I've mentioned in the memory limit thread as an alternative way to introduce backend-level accounting/limit) could serve as a backend-level malloc cache, and how would that work. So I cleaned up the PoC patch I already had (see 0003), and gave it a try. And with both patches applied, the results for the partitioned join with 100 partitions look like this: -M simple 1 8 16 32 64 96 160 224 ------------------------------------------------------------------------ master 653 3957 7120 12022 12707 11813 10349 9633 both patches 954 7356 14580 28259 51552 65278 70607 69598 speedup 1.5 1.9 2.0 2.4 4.1 5.5 6.8 7.2 -M prepared 1 8 16 32 64 96 160 224 ------------------------------------------------------------------------ master 1639 8273 14138 14746 13446 14001 11129 10136 both patches 4792 30102 62208 122157 220984 267763 315632 323567 speedup 2.9 3.6 4.4 8.3 16.4 19.1 28.4 31.9 That's pretty nice, I think. And I've seen many such improvements, it's not a cherry-picked example. For the star join, the improvements are very similar. I'm attaching PDF files with a table visualizing results for these two benchmarks - there's results for different number of partitions/scales, and different builds (master, one or both of the patches). There's also a comparison to master, with color scale "red = slower, green = faster" (but there's no red anywhere, not even for low client counts). It's also interesting that with just the 0003 patch applied, the change is much smaller. It's as if the two bottlenecks (locking and malloc) are in balance - if you only address one one, you don't get much. But if you address both, it flies. FWIW where does the malloc overhead come from? For one, while we do have some caching of malloc-ed memory in memory contexts, that doesn't quite work cross-query, because we destroy the contexts at the end of the query. We attempt to cache the memory contexts too, but in this case that can't help because the allocations come from btbeginscan() where we do this: so = (BTScanOpaque) palloc(sizeof(BTScanOpaqueData)); and BTScanOpaqueData is ~27kB, which means it's an oversized chunk and thus always allocated using a separate malloc() call. Maybe we could break it into smaller/cacheable parts, but I haven't tried, and I doubt it's the only such allocation. I don't want to get into too much detail about the memory pool, but I think it's something we should consider doing - I'm sure there's stuff to improve, but caching the malloc may clearly be very beneficial. The basic idea is to have a cache that is "adaptive" (i.e. adjusts to caching blocks of sizes needed by the workload) but also cheap. The patch is PoC/WIP and needs more work, but I think it works quite well. If anyone wants to take a look or have a chat at FOSDEM, for example, I'm available. FWIW I was wondering if this is a glibc-specific malloc bottleneck, so I tried running the benchmarks with LD_PRELOAD=jemalloc, and that improves the behavior a lot - it gets us maybe ~80% of the mempool benefits. Which is nice, it confirms it's glibc-specific (I wonder if there's a way to tweak glibc to address this), and it also means systems using jemalloc (e.g. FreeBSD, right?) don't have this problem. But it also says the mempool has ~20% benefit on top of jemalloc. FWIW there's another bottleneck people may not realize, and that's the number of file descriptors. Once you get to >1000 relations, you can easily get into situation like this: 54.18% 0.48% postgres [kernel.kallsyms] [k] entry_SYSCALL_64_after_hwframe | --53.70%--entry_SYSCALL_64_after_hwframe | --53.03%--do_syscall_64 | |--28.29%--__x64_sys_openat | | | --28.14%--do_sys_openat2 | | | |--23.14%--do_filp_open | | | | | --22.72%--path_openat That's pretty bad, it means we're closing/opening file descriptors like crazy, because every query needs the files. If I increase the number of file descriptors (both in ulimit and max_files_per_process) to prevent this trashing, I can increase the throughput ~5x. Of course, this is not a bottleneck that we can "fix" in code, it's simply a consequence of not having enough file descriptors etc. But I wonder if we might make it easier to monitor this, e.g. by tracking the fd cache hit ratio, or something like that ... There's a more complete set of benchmarking scripts and results for these and other tests, in various formats (PDF, ODS, ...) at https://github.com/tvondra/scalability-patches There's results from multiple machines - not just the big epyc machine, but also smaller intel machines (4C and 16C), and even two rpi5 (yes, it helps even on rpi5, quite a bit). regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Вложения
Le dimanche 28 janvier 2024, 22:57:02 CET Tomas Vondra a écrit : Hi Tomas ! I'll comment on glibc-malloc part as I studied that part last year, and proposed some things here: https://www.postgresql.org/message-id/ 3424675.QJadu78ljV%40aivenlaptop > FWIW where does the malloc overhead come from? For one, while we do have > some caching of malloc-ed memory in memory contexts, that doesn't quite > work cross-query, because we destroy the contexts at the end of the > query. We attempt to cache the memory contexts too, but in this case > that can't help because the allocations come from btbeginscan() where we > do this: > > so = (BTScanOpaque) palloc(sizeof(BTScanOpaqueData)); > > and BTScanOpaqueData is ~27kB, which means it's an oversized chunk and > thus always allocated using a separate malloc() call. Maybe we could > break it into smaller/cacheable parts, but I haven't tried, and I doubt > > > > it's the only such allocation. Did you try running an strace on the process ? That may give you some hindsights into what malloc is doing. A more sophisticated approach would be using stap and plugging it into the malloc probes, for example memory_sbrk_more and memory_sbrk_less. An important part of glibc's malloc behaviour in that regard comes from the adjustment of the mmap and free threshold. By default, mmap adjusts them dynamically and you can poke into that using the memory_mallopt_free_dyn_thresholds probe. > > FWIW I was wondering if this is a glibc-specific malloc bottleneck, so I > tried running the benchmarks with LD_PRELOAD=jemalloc, and that improves > the behavior a lot - it gets us maybe ~80% of the mempool benefits. > Which is nice, it confirms it's glibc-specific (I wonder if there's a > way to tweak glibc to address this), and it also means systems using > jemalloc (e.g. FreeBSD, right?) don't have this problem. But it also > says the mempool has ~20% benefit on top of jemalloc. GLIBC's malloc offers some tuning for this. In particular, setting either M_MMAP_THRESHOLD or M_TRIM_THRESHOLD will disable the unpredictable "auto adjustment" beheviour and allow you to control what it's doing. By setting a bigger M_TRIM_THRESHOLD, one can make sure memory allocated using sbrk isn't freed as easily, and you don't run into a pattern of moving the sbrk pointer up and down repeatedly. The automatic trade off between the mmap and trim thresholds is supposed to prevent that, but the way it is incremented means you can end in a bad place depending on your particular allocation patttern. Best regards, -- Ronan Dunklau
On 1/29/24 09:53, Ronan Dunklau wrote: > Le dimanche 28 janvier 2024, 22:57:02 CET Tomas Vondra a écrit : > > Hi Tomas ! > > I'll comment on glibc-malloc part as I studied that part last year, and > proposed some things here: https://www.postgresql.org/message-id/ > 3424675.QJadu78ljV%40aivenlaptop > Thanks for reminding me. I'll re-read that thread. > >> FWIW where does the malloc overhead come from? For one, while we do have >> some caching of malloc-ed memory in memory contexts, that doesn't quite >> work cross-query, because we destroy the contexts at the end of the >> query. We attempt to cache the memory contexts too, but in this case >> that can't help because the allocations come from btbeginscan() where we >> do this: >> >> so = (BTScanOpaque) palloc(sizeof(BTScanOpaqueData)); >> >> and BTScanOpaqueData is ~27kB, which means it's an oversized chunk and >> thus always allocated using a separate malloc() call. Maybe we could >> break it into smaller/cacheable parts, but I haven't tried, and I doubt >>>>> it's the only such allocation. > > Did you try running an strace on the process ? That may give you some > hindsights into what malloc is doing. A more sophisticated approach would be > using stap and plugging it into the malloc probes, for example > memory_sbrk_more and memory_sbrk_less. > No, I haven't tried that. In my experience strace is pretty expensive, and if the issue is in glibc itself (before it does the syscalls), strace won't really tell us much. Not sure, ofc. > An important part of glibc's malloc behaviour in that regard comes from the > adjustment of the mmap and free threshold. By default, mmap adjusts them > dynamically and you can poke into that using the > memory_mallopt_free_dyn_thresholds probe. > Thanks, I'll take a look at that. >> >> FWIW I was wondering if this is a glibc-specific malloc bottleneck, so I >> tried running the benchmarks with LD_PRELOAD=jemalloc, and that improves >> the behavior a lot - it gets us maybe ~80% of the mempool benefits. >> Which is nice, it confirms it's glibc-specific (I wonder if there's a >> way to tweak glibc to address this), and it also means systems using >> jemalloc (e.g. FreeBSD, right?) don't have this problem. But it also >> says the mempool has ~20% benefit on top of jemalloc. > > GLIBC's malloc offers some tuning for this. In particular, setting either > M_MMAP_THRESHOLD or M_TRIM_THRESHOLD will disable the unpredictable "auto > adjustment" beheviour and allow you to control what it's doing. > > By setting a bigger M_TRIM_THRESHOLD, one can make sure memory allocated using > sbrk isn't freed as easily, and you don't run into a pattern of moving the > sbrk pointer up and down repeatedly. The automatic trade off between the mmap > and trim thresholds is supposed to prevent that, but the way it is incremented > means you can end in a bad place depending on your particular allocation > patttern. > So, what values would you recommend for these parameters? My concern is increasing those value would lead to (much) higher memory usage, with little control over it. With the mempool we keep more blocks, ofc, but we have control over freeing the memory. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Le lundi 29 janvier 2024, 13:17:07 CET Tomas Vondra a écrit : > > Did you try running an strace on the process ? That may give you some > > hindsights into what malloc is doing. A more sophisticated approach would > > be using stap and plugging it into the malloc probes, for example > > memory_sbrk_more and memory_sbrk_less. > > No, I haven't tried that. In my experience strace is pretty expensive, > and if the issue is in glibc itself (before it does the syscalls), > strace won't really tell us much. Not sure, ofc. It would tell you how malloc actually performs your allocations, and how often they end up translated into syscalls. The main issue with glibc would be that it releases the memory too agressively to the OS, IMO. > > > An important part of glibc's malloc behaviour in that regard comes from > > the > > adjustment of the mmap and free threshold. By default, mmap adjusts them > > dynamically and you can poke into that using the > > memory_mallopt_free_dyn_thresholds probe. > > Thanks, I'll take a look at that. > > >> FWIW I was wondering if this is a glibc-specific malloc bottleneck, so I > >> tried running the benchmarks with LD_PRELOAD=jemalloc, and that improves > >> the behavior a lot - it gets us maybe ~80% of the mempool benefits. > >> Which is nice, it confirms it's glibc-specific (I wonder if there's a > >> way to tweak glibc to address this), and it also means systems using > >> jemalloc (e.g. FreeBSD, right?) don't have this problem. But it also > >> says the mempool has ~20% benefit on top of jemalloc. > > > > GLIBC's malloc offers some tuning for this. In particular, setting either > > M_MMAP_THRESHOLD or M_TRIM_THRESHOLD will disable the unpredictable "auto > > adjustment" beheviour and allow you to control what it's doing. > > > > By setting a bigger M_TRIM_THRESHOLD, one can make sure memory allocated > > using sbrk isn't freed as easily, and you don't run into a pattern of > > moving the sbrk pointer up and down repeatedly. The automatic trade off > > between the mmap and trim thresholds is supposed to prevent that, but the > > way it is incremented means you can end in a bad place depending on your > > particular allocation patttern. > > So, what values would you recommend for these parameters? > > My concern is increasing those value would lead to (much) higher memory > usage, with little control over it. With the mempool we keep more > blocks, ofc, but we have control over freeing the memory. Right now depending on your workload (especially if you use connection pooling) you can end up with something like 32 or 64MB of dynamically adjusted trim-threshold which will never be released back. The first heurstic I had in mind was to set it to work_mem, up to a "reasonable" limit I guess. One can argue that it is expected for a backend to use work_mem frequently, and as such it shouldn't be released back. By setting work_mem to a lower value, we could ask glibc at the same time to trim the excess kept memory. That could be useful when a long-lived connection is pooled, and sees a spike in memory usage only once. Currently that could well end up with 32MB "wasted" permanently but tuning it ourselves could allow us to releaase it back. Since it was last year I worked on this, I'm a bit fuzzy on the details but I hope this helps.
On 1/29/24 15:15, Ronan Dunklau wrote: > Le lundi 29 janvier 2024, 13:17:07 CET Tomas Vondra a écrit : >>> Did you try running an strace on the process ? That may give you some >>> hindsights into what malloc is doing. A more sophisticated approach would >>> be using stap and plugging it into the malloc probes, for example >>> memory_sbrk_more and memory_sbrk_less. >> >> No, I haven't tried that. In my experience strace is pretty expensive, >> and if the issue is in glibc itself (before it does the syscalls), >> strace won't really tell us much. Not sure, ofc. > > It would tell you how malloc actually performs your allocations, and how often > they end up translated into syscalls. The main issue with glibc would be that > it releases the memory too agressively to the OS, IMO. > >> >>> An important part of glibc's malloc behaviour in that regard comes from >>> the >>> adjustment of the mmap and free threshold. By default, mmap adjusts them >>> dynamically and you can poke into that using the >>> memory_mallopt_free_dyn_thresholds probe. >> >> Thanks, I'll take a look at that. >> >>>> FWIW I was wondering if this is a glibc-specific malloc bottleneck, so I >>>> tried running the benchmarks with LD_PRELOAD=jemalloc, and that improves >>>> the behavior a lot - it gets us maybe ~80% of the mempool benefits. >>>> Which is nice, it confirms it's glibc-specific (I wonder if there's a >>>> way to tweak glibc to address this), and it also means systems using >>>> jemalloc (e.g. FreeBSD, right?) don't have this problem. But it also >>>> says the mempool has ~20% benefit on top of jemalloc. >>> >>> GLIBC's malloc offers some tuning for this. In particular, setting either >>> M_MMAP_THRESHOLD or M_TRIM_THRESHOLD will disable the unpredictable "auto >>> adjustment" beheviour and allow you to control what it's doing. >>> >>> By setting a bigger M_TRIM_THRESHOLD, one can make sure memory allocated >>> using sbrk isn't freed as easily, and you don't run into a pattern of >>> moving the sbrk pointer up and down repeatedly. The automatic trade off >>> between the mmap and trim thresholds is supposed to prevent that, but the >>> way it is incremented means you can end in a bad place depending on your >>> particular allocation patttern. >> >> So, what values would you recommend for these parameters? >> >> My concern is increasing those value would lead to (much) higher memory >> usage, with little control over it. With the mempool we keep more >> blocks, ofc, but we have control over freeing the memory. > > Right now depending on your workload (especially if you use connection > pooling) you can end up with something like 32 or 64MB of dynamically adjusted > trim-threshold which will never be released back. > OK, so let's say I expect each backend to use ~90MB of memory (allocated at once through memory contexts). How would you set the two limits? By default it's set to 128kB, which means blocks larger than 128kB are mmap-ed and released immediately. But there's very few such allocations - a vast majority of blocks in the benchmark workloads is <= 8kB or ~27kB (those from btbeginscan). So I'm thinking about leaving M_TRIM_THRESHOLD as is, but increasing the M_TRIM_THRESHOLD value to a couple MBs. But I doubt that'll really help, because what I expect to happen is we execute a query and it allocates all memory up to a high watermark of ~90MB. And then the query completes, and we release almost all of it. And even with trim threshold set to e.g. 8MB we'll free almost all of it, no? What we want to do is say - hey, we needed 90MB, and now we need 8MB. We could free 82MB, but maybe let's wait a bit and see if we need that memory again. And that's pretty much what the mempool does, but I don't see how to do that using the mmap options. > The first heurstic I had in mind was to set it to work_mem, up to a > "reasonable" limit I guess. One can argue that it is expected for a backend to > use work_mem frequently, and as such it shouldn't be released back. By setting > work_mem to a lower value, we could ask glibc at the same time to trim the > excess kept memory. That could be useful when a long-lived connection is > pooled, and sees a spike in memory usage only once. Currently that could well > end up with 32MB "wasted" permanently but tuning it ourselves could allow us > to releaase it back. > I'm not sure work_mem is a good parameter to drive this. It doesn't say how much memory we expect the backend to use - it's a per-operation limit, so it doesn't work particularly well with partitioning (e.g. with 100 partitions, we may get 100 nodes, which is completely unrelated to what work_mem says). A backend running the join query with 1000 partitions uses ~90MB (judging by data reported by the mempool), even with work_mem=4MB. So setting the trim limit to 4MB is pretty useless. The mempool could tell us how much memory we need (but we could track this in some other way too, probably). And we could even adjust the mmap parameters regularly, based on current workload. But there's then there's the problem that the mmap parameters don't tell us how much memory to keep, but how large chunks to release. Let's say we want to keep the 90MB (to allocate the memory once and then reuse it). How would you do that? We could set MMAP_TRIM_TRESHOLD 100MB, but then it takes just a little bit of extra memory to release all the memory, or something. > Since it was last year I worked on this, I'm a bit fuzzy on the details but I > hope this helps. > Thanks for the feedback / insights! regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Le lundi 29 janvier 2024, 15:59:04 CET Tomas Vondra a écrit : > I'm not sure work_mem is a good parameter to drive this. It doesn't say > how much memory we expect the backend to use - it's a per-operation > limit, so it doesn't work particularly well with partitioning (e.g. with > 100 partitions, we may get 100 nodes, which is completely unrelated to > what work_mem says). A backend running the join query with 1000 > partitions uses ~90MB (judging by data reported by the mempool), even > with work_mem=4MB. So setting the trim limit to 4MB is pretty useless. I understand your point, I was basing my previous observations on what a backend typically does during the execution. > > The mempool could tell us how much memory we need (but we could track > this in some other way too, probably). And we could even adjust the mmap > parameters regularly, based on current workload. > > But there's then there's the problem that the mmap parameters don't tell > If we > > us how much memory to keep, but how large chunks to release. > > Let's say we want to keep the 90MB (to allocate the memory once and then > reuse it). How would you do that? We could set MMAP_TRIM_TRESHOLD 100MB, > but then it takes just a little bit of extra memory to release all the > memory, or something. For doing this you can set M_TOP_PAD using glibc malloc. Which makes sure a certain amount of memory is always kept. But the way the dynamic adjustment works makes it sort-of work like this. MMAP_THRESHOLD and TRIM_THRESHOLD start with low values, meaning we don't expect to keep much memory around. So even "small" memory allocations will be served using mmap at first. Once mmaped memory is released, glibc's consider it a benchmark for "normal" allocations that can be routinely freed, and adjusts mmap_threshold to the released mmaped region size, and trim threshold to two times that. It means over time the two values will converge either to the max value (32MB for MMAP_THRESHOLD, 64 for trim threshold) or to something big enough to accomodate your released memory, since anything bigger than half trim threshold will be allocated using mmap. Setting any parameter disable that. But I'm not arguing against the mempool, just chiming in with glibc's malloc tuning possibilities :-)
On 1/29/24 16:42, Ronan Dunklau wrote: > Le lundi 29 janvier 2024, 15:59:04 CET Tomas Vondra a écrit : >> I'm not sure work_mem is a good parameter to drive this. It doesn't say >> how much memory we expect the backend to use - it's a per-operation >> limit, so it doesn't work particularly well with partitioning (e.g. with >> 100 partitions, we may get 100 nodes, which is completely unrelated to >> what work_mem says). A backend running the join query with 1000 >> partitions uses ~90MB (judging by data reported by the mempool), even >> with work_mem=4MB. So setting the trim limit to 4MB is pretty useless. > > I understand your point, I was basing my previous observations on what a > backend typically does during the execution. > >> >> The mempool could tell us how much memory we need (but we could track >> this in some other way too, probably). And we could even adjust the mmap >> parameters regularly, based on current workload. >> >> But there's then there's the problem that the mmap parameters don't tell >> If we > > us how much memory to keep, but how large chunks to release. >> >> Let's say we want to keep the 90MB (to allocate the memory once and then >> reuse it). How would you do that? We could set MMAP_TRIM_TRESHOLD 100MB, >> but then it takes just a little bit of extra memory to release all the >> memory, or something. > > For doing this you can set M_TOP_PAD using glibc malloc. Which makes sure a > certain amount of memory is always kept. > > But the way the dynamic adjustment works makes it sort-of work like this. > MMAP_THRESHOLD and TRIM_THRESHOLD start with low values, meaning we don't > expect to keep much memory around. > > So even "small" memory allocations will be served using mmap at first. Once > mmaped memory is released, glibc's consider it a benchmark for "normal" > allocations that can be routinely freed, and adjusts mmap_threshold to the > released mmaped region size, and trim threshold to two times that. > > It means over time the two values will converge either to the max value (32MB > for MMAP_THRESHOLD, 64 for trim threshold) or to something big enough to > accomodate your released memory, since anything bigger than half trim > threshold will be allocated using mmap. > > Setting any parameter disable that. > Thanks. I gave this a try, and I started the tests with this setting: export MALLOC_TOP_PAD_=$((64*1024*1024)) export MALLOC_MMAP_THRESHOLD_=$((1024*1024)) export MALLOC_TRIM_THRESHOLD_=$((1024*1024)) which I believe means that: 1) we'll keep 64MB "extra" memory on top of heap, serving as a cache for future allocations 2) everything below 1MB (so most of the blocks we allocate for contexts) will be allocated on heap (hence from the cache) 3) we won't trim heap unless there's at least 1MB of free contiguous space (I wonder if this should be the same as MALLOC_TOP_PAD) Those are mostly arbitrary values / guesses, and I don't have complete results yet. But from the results I have it seems this has almost the same effect as the mempool thing - see the attached PDF, with results for the "partitioned join" benchmark. first column - "master" (17dev) with no patches, default glibc second column - 17dev + locking + mempool, default glibc third column - 17dev + locking, tuned glibc The color scale on the right is throughput comparison (third/second), as a percentage with e.g. 90% meaning tuned glibc is 10% slower than the mempool results. Most of the time it's slower but very close to 100%, sometimes it's a bit faster. So overall it's roughly the same. The color scales below the results is a comparison of each branch to the master (without patches), showing comparison to current performance. It's almost the same, although the tuned glibc has a couple regressions that the mempool does not have. > But I'm not arguing against the mempool, just chiming in with glibc's malloc > tuning possibilities :-) > Yeah. I think the main problem with the glibc parameters is that it's very implementation-specific and also static - the mempool is more adaptive, I think. But it's an interesting experiment. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Вложения
On Sun, Jan 28, 2024 at 4:57 PM Tomas Vondra <tomas.vondra@enterprisedb.com> wrote: > For NUM_LOCK_PARTITIONS this is pretty simple (see 0001 patch). The > LWLock table has 16 partitions by default - it's quite possible that on > machine with many cores and/or many partitions, we can easily hit this. > So I bumped this 4x to 64 partitions. I think this probably makes sense. I'm a little worried that we're just kicking the can down the road here where maybe we should be solving the problem in some more fundamental way, and I'm also a little worried that we might be reducing single-core performance. But it's probably fine. > What I ended up doing is having a hash table of 16-element arrays. There > are 64 "pieces", each essentially the (16 x OID + UINT64 bitmap) that we > have now. Each OID is mapped to exactly one of these parts as if in a > hash table, and in each of those 16-element parts we do exactly the same > thing we do now (linear search, removal, etc.). This works great, the > locality is great, etc. The one disadvantage is this makes PGPROC > larger, but I did a lot of benchmarks and I haven't seen any regression > that I could attribute to this. (More about this later.) I think this is a reasonable approach. Some comments: - FastPathLocalUseInitialized seems unnecessary to me; the contents of an uninitialized local variable are undefined, but an uninitialized global variable always starts out zeroed. - You need comments in various places, including here, where someone is certain to have questions about the algorithm and choice of constants: +#define FAST_PATH_LOCK_REL_GROUP(rel) (((uint64) (rel) * 7883 + 4481) % FP_LOCK_GROUPS_PER_BACKEND) When I originally coded up the fast-path locking stuff, I supposed that we couldn't make the number of slots too big because the algorithm requires a linear search of the whole array. But with this one trick (a partially-associative cache), there's no real reason that I can think of why you can't make the number of slots almost arbitrarily large. At some point you're going to use too much memory, and probably before that point you're going to make the cache big enough that it doesn't fit in the CPU cache of an individual core, at which point possibly it will stop working as well. But honestly ... I don't quite see why this approach couldn't be scaled quite far. Like, if we raised FP_LOCK_GROUPS_PER_BACKEND from your proposed value of 64 to say 65536, would that still perform well? I'm not saying we should do that, because that's probably a silly amount of memory to use for this, but the point is that when you start to have enough partitions that you run out of lock slots, performance is going to degrade, so you can imagine wanting to try to have enough lock groups to make that unlikely. Which leads me to wonder if there's any particular number of lock groups that is clearly "too many" or whether it's just about how much memory we want to use. -- Robert Haas EDB: http://www.enterprisedb.com
On 6/24/24 17:05, Robert Haas wrote: > On Sun, Jan 28, 2024 at 4:57 PM Tomas Vondra > <tomas.vondra@enterprisedb.com> wrote: >> For NUM_LOCK_PARTITIONS this is pretty simple (see 0001 patch). The >> LWLock table has 16 partitions by default - it's quite possible that on >> machine with many cores and/or many partitions, we can easily hit this. >> So I bumped this 4x to 64 partitions. > > I think this probably makes sense. I'm a little worried that we're > just kicking the can down the road here where maybe we should be > solving the problem in some more fundamental way, and I'm also a > little worried that we might be reducing single-core performance. But > it's probably fine. > Yeah, I haven't seen this causing any regressions - the sensitive paths typically lock only one partition, so having more of them does not affect that. Or if it does, it's likely a reasonable trade off as it reduces the risk of lock contention. That being said, I don't recall benchmarking this patch in isolation, only with the other patches. Maybe I should do that ... >> What I ended up doing is having a hash table of 16-element arrays. There >> are 64 "pieces", each essentially the (16 x OID + UINT64 bitmap) that we >> have now. Each OID is mapped to exactly one of these parts as if in a >> hash table, and in each of those 16-element parts we do exactly the same >> thing we do now (linear search, removal, etc.). This works great, the >> locality is great, etc. The one disadvantage is this makes PGPROC >> larger, but I did a lot of benchmarks and I haven't seen any regression >> that I could attribute to this. (More about this later.) > > I think this is a reasonable approach. Some comments: > > - FastPathLocalUseInitialized seems unnecessary to me; the contents of > an uninitialized local variable are undefined, but an uninitialized > global variable always starts out zeroed. > OK. I didn't realize global variables start a zero. > - You need comments in various places, including here, where someone > is certain to have questions about the algorithm and choice of > constants: > > +#define FAST_PATH_LOCK_REL_GROUP(rel) (((uint64) (rel) * 7883 + 4481) > % FP_LOCK_GROUPS_PER_BACKEND) > Yeah, definitely needs comment explaining this. I admit those numbers are pretty arbitrary primes, to implement a trivial hash function. That was good enough for a PoC patch, but maybe for a "proper" version this should use a better hash function. It needs to be fast, and maybe it doesn't matter that much if it's not perfect. > When I originally coded up the fast-path locking stuff, I supposed > that we couldn't make the number of slots too big because the > algorithm requires a linear search of the whole array. But with this > one trick (a partially-associative cache), there's no real reason that > I can think of why you can't make the number of slots almost > arbitrarily large. At some point you're going to use too much memory, > and probably before that point you're going to make the cache big > enough that it doesn't fit in the CPU cache of an individual core, at > which point possibly it will stop working as well. But honestly ... I > don't quite see why this approach couldn't be scaled quite far. > I don't think I've heard the term "partially-associative cache" before, but now that I look at the approach again, it very much reminds me how set-associative caches work (e.g. with cachelines in CPU caches). It's a 16-way associative cache, assigning each entry into one of 16 slots. I must have been reading some papers in this area shortly before the PoC patch, and the idea came from there, probably. Which is good, because it means it's a well-understood and widely-used approach. > Like, if we raised FP_LOCK_GROUPS_PER_BACKEND from your proposed value > of 64 to say 65536, would that still perform well? I'm not saying we > should do that, because that's probably a silly amount of memory to > use for this, but the point is that when you start to have enough > partitions that you run out of lock slots, performance is going to > degrade, so you can imagine wanting to try to have enough lock groups > to make that unlikely. Which leads me to wonder if there's any > particular number of lock groups that is clearly "too many" or whether > it's just about how much memory we want to use. > That's an excellent question. I don't know. I agree 64 groups is pretty arbitrary, and having 1024 may not be enough even with a modest number of partitions. When I was thinking about using a higher value, my main concern was that it'd made the PGPROC entry larger. Each "fast-path" group is ~72B, so 64 groups is ~4.5kB, and that felt like quite a bit. But maybe it's fine and we could make it much larger - L3 caches tend to be many MBs these days, although AFAIK it's shared by threads running on the CPU. I'll see if I can do some more testing of this, and see if there's a value where it stops improving / starts degrading, etc. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Tue, Jun 25, 2024 at 6:04 AM Tomas Vondra <tomas.vondra@enterprisedb.com> wrote: > Yeah, definitely needs comment explaining this. > > I admit those numbers are pretty arbitrary primes, to implement a > trivial hash function. That was good enough for a PoC patch, but maybe > for a "proper" version this should use a better hash function. It needs > to be fast, and maybe it doesn't matter that much if it's not perfect. Right. My guess is that if we try too hard to make the hash function good, there will be a performance hit. Unlike, say, strings that come from the user, we have no reason to believe that relfilenumbers will have any particular structure or pattern to them, so a low-quality, fast function seems like a good trade-off to me. But I'm *far* from a hashing expert, so I'm prepared for someone who is to tell me that I'm full of garbage. > I don't think I've heard the term "partially-associative cache" before > That's an excellent question. I don't know. > > I agree 64 groups is pretty arbitrary, and having 1024 may not be enough > even with a modest number of partitions. When I was thinking about using > a higher value, my main concern was that it'd made the PGPROC entry > larger. Each "fast-path" group is ~72B, so 64 groups is ~4.5kB, and that > felt like quite a bit. > > But maybe it's fine and we could make it much larger - L3 caches tend to > be many MBs these days, although AFAIK it's shared by threads running on > the CPU. > > I'll see if I can do some more testing of this, and see if there's a > value where it stops improving / starts degrading, etc. Sounds good. -- Robert Haas EDB: http://www.enterprisedb.com