Re: NUMA shared memory interleaving
От | Jakub Wartak |
---|---|
Тема | Re: NUMA shared memory interleaving |
Дата | |
Msg-id | CAKZiRmy4r+CYFX+x7fAj9t4tjMJZnBY73phuBVM7pbEfxueiEg@mail.gmail.com обсуждение исходный текст |
Ответ на | Re: NUMA shared memory interleaving (Bertrand Drouvot <bertranddrouvot.pg@gmail.com>) |
Список | pgsql-hackers |
On Fri, Apr 18, 2025 at 7:43 PM Bertrand Drouvot <bertranddrouvot.pg@gmail.com> wrote: > > Hi, > > On Wed, Apr 16, 2025 at 10:05:04AM -0400, Robert Haas wrote: > > On Wed, Apr 16, 2025 at 5:14 AM Jakub Wartak > > <jakub.wartak@enterprisedb.com> wrote: > > > Normal pgbench workloads tend to be not affected, as each backend > > > tends to touch just a small partition of shm (thanks to BAS > > > strategies). Some remaining questions are: > > > 1. How to name this GUC (numa or numa_shm_interleave) ? I prefer the > > > first option, as we could potentially in future add more optimizations > > > behind that GUC. > > > > I wonder whether the GUC needs to support interleaving between a > > designated set of nodes rather than only being able to do all nodes. > > For example, suppose someone is pinning the processes to a certain set > > of NUMA nodes; perhaps then they wouldn't want to use memory from > > other nodes. > > +1. That could be used for instances consolidation on the same host. One could > ensure that numa nodes are not shared across instances (cpu and memory resource > isolation per instance). Bonus point, adding Direct I/O into the game would > ensure that the OS page cache is not shared too. Hi, the attached patch has two changes: 1. It adds more modes and supports this 'consolidation' and 'isolation' scenario from above. Doc in patch briefly explains the merit. 2. it adds trivial NUMA for PQ The original initial test expanded on the very same machine (4s32c128t, QPI interconnect): numa='off' latency average = 1271.019 ms latency stddev = 245.061 ms tps = 49.683923 (without initial connection time) explanation(pcm-memory): 3 sockets doing ~46MB/s on RAM (almost idle), 1 socket doing ~17GB/s (fully saturated because s_b ended up in this scenario only on NUMA node) numa='all' latency average = 702.622 ms latency stddev = 13.259 ms tps = 90.026526 (without initial connection time) explanation(pcm-memory): this forced to interleave s_b across 4 NUMA nodes and each socket gets equal part of workload (9.2 - 10GB/s) totalling ~37GB/s of memory bandwidth This gives a boost: 90/49.6=1.8x. The values for memory bandwidth are combined read+write. NUMA impact on the Parallel Query: ---------------------------------- with: with the most simplistic interleaving of s_b + dynamic_shared_memory for PQ interleaved too : max_worker_processes=max_parallel_workers=max_parallel_workers_per_gather=64 alter on 1 partition to force real 64 parallel seq scans The query: select sum(octet_length(filler)) from pgbench_accounts; launched 64 effective parallel workes launched for 64 partitions each of 400MB (25600MBs), All of that was fitting in the s_b (32GB), so all fetched from s_b. All was hot, several first runs are not shown. select sum(octet_length(filler)) from pgbench_accounts; numa='off' Time: 1108.178 ms (00:01.108) Time: 1118.494 ms (00:01.118) Time: 1104.491 ms (00:01.104) Time: 1112.221 ms (00:01.112) Time: 1105.501 ms (00:01.106) avg: 1109 ms -- not interleaved, more like appended: postgres=# select * from pg_shmem_allocations_numa where name = 'Buffer Blocks'; name | numa_node | size ---------------+-----------+------------ Buffer Blocks | 0 | 9277800448 Buffer Blocks | 1 | 7044333568 Buffer Blocks | 2 | 9097445376 Buffer Blocks | 3 | 8942256128 numa='all' Time: 1026.747 ms (00:01.027) Time: 1024.087 ms (00:01.024) Time: 1024.179 ms (00:01.024) Time: 1037.026 ms (00:01.037) avg: 1027 ms postgres=# select * from pg_shmem_allocations_numa where name = 'Buffer Blocks'; name | numa_node | size ---------------+-----------+------------ Buffer Blocks | 0 | 8589934592 Buffer Blocks | 1 | 8592031744 Buffer Blocks | 2 | 8589934592 Buffer Blocks | 3 | 8589934592 1109/1027=1.079x, not bad for such trivial change and the paper referenced by Thomas also stated (`We can see an improvement by a factor of more than three by just running the non-NUMA-aware implementation on interleaved memor`), probably it could be improved much further, but I'm not planning to work on this more. So if anything: - latency-wise: it would be best to place leader+all PQ workers close to s_b, provided s_b fits NUMA shared/huge page memory there and you won't need more CPU than there's on that NUMA node... (assuming e.g. hosting 4 DBs on 4-sockets each on it's own, it would be best to pin everything including shm, but PQ workers too) - capacity/TPS-wise or s_b > NUMA: just interleave to maximize bandwidth and get uniform CPU performance out of this The patch supports e.g. numa='@1' which should fully isolate the workload to just memory and CPUs on node #1. As for the patch: I'm lost with our C headers policy :) One of less obvious reasons (outside of better efficiency of consolidation of multiple PostgreSQL cluster on single NUMA server), why I've implemented '=' and '@' is that seems that CXL memory can be attached as a CPU-less(!) NUMA node, thus Linux - depending on sysctls/sysfs setup - could use it for automatic memory tiering and the above provides configurable way to prevent allocation on such (potential) systems - simply exclude such NUMA node via config for now and we are covered I think. I have no access to real hardware, so I haven't researched it further, but it looks like in the far future we could probably indicate preferred NUMA memory nodes (think big s_b, bigger than "CPU" RAM), and that kernel could transparently do NUMA auto balancing/demotion for us (AKA Transparent Page Placement AKA memory) or vice versa: use small s_b and do not use CXL node at all and expect that VFS cache could be spilled there. numa_weighted_interleave_memory() / MPOL_WEIGHTED_INTERLEAVE is not yet supported in distros (although new libnuma has support for it), so I have not included it in the patch, as it was too early. BTW: DO NOT USE meson's --buildtype=debug as it somehow disables the NUMA optimizations benefit, I've lost hours on it (probably -O0 is so slow that it wasn't stressing interconnects enough). Default is --buildtype=debugoptimized which is good. Also if testing performance, check that HW that has proper realistic NUMA remote access distances first, e.g. here my remote had remote access 2x or even 3x. Probably this is worth only testing on multi-sockets which have really higher latencies/throughput limitations, but reports from 1 socket MCMs CPUs (with various Node-per-Socket BIOS settings) are welcome too. kernel 6.14.7 was used with full isolation: cpupower frequency-set --governor performance cpupower idle-set -D0 echo 1 > /sys/devices/system/cpu/intel_pstate/no_turbo echo never > /sys/kernel/mm/transparent_hugepage/enabled echo never > /sys/kernel/mm/transparent_hugepage/defrag max_connections = '10000' huge_pages = 'on' wal_level = 'minimal' wal_buffers = '1024MB' max_wal_senders = '0' shared_buffers = '4 GB' autovacuum = 'off' max_parallel_workers_per_gather = '0' numa = 'all' #numa = 'off' [1] - https://lwn.net/Articles/897536/
Вложения
В списке pgsql-hackers по дате отправления: