Re: Adding basic NUMA awareness
| От | Tomas Vondra |
|---|---|
| Тема | Re: Adding basic NUMA awareness |
| Дата | |
| Msg-id | 7b824e42-02de-4f4a-a81d-5acc89d417ea@vondra.me обсуждение исходный текст |
| Ответ на | Re: Adding basic NUMA awareness (Jakub Wartak <jakub.wartak@enterprisedb.com>) |
| Список | pgsql-hackers |
Hi, here's a rebased patch series, fixing most of the smaller issues from v20251101, and making cfbot happy (hopefully). On 11/6/25 15:02, Jakub Wartak wrote: > On Tue, Nov 4, 2025 at 10:21 PM Tomas Vondra <tomas@vondra.me> wrote: > > Hi Tomas, > >>> 0007a: pg_buffercache_pgproc returns pgproc_ptr and fastpath_ptr in >>> bigint and not hex? I've wanted to adjust that to TEXTOID, but instead >>> I've thought it is going to be simpler to use to_hex() -- see 0009 >>> attached. >>> >> >> I don't know. I added simply because it might be useful for development, >> but we probably don't want to expose these pointers at all. >> >>> 0007b: pg_buffercache_pgproc -- nitpick, but maybe it would be better >>> called pg_shm_pgproc? >>> >> >> Right. It does not belong to pg_buffercache at all, I just added it >> there because I've been messing with that code already. > > Please keep them in for at least for some time (perhaps standalone > patch marked as not intended to be commited would work?). I find the > view extermely useful as it will allow us pinpointing local-vs-remote > NUMA fetches (we need to know the addres). > Are you referring to the _pgproc view specifically, or also to the view with buffer partitions? I don't intend to remove the view for shared buffers, that's indeed useful. >>> 0007c with check_numa='buffers,procs' throws 'mbind Invalid argument' >>> during start: >>> >>> 2025-11-04 10:02:27.055 CET [58464] DEBUG: NUMA: >>> pgproc_init_partition procs 0x7f8d30400000 endptr 0x7f8d30800000 >>> num_procs 2523 node 0 >>> 2025-11-04 10:02:27.057 CET [58464] DEBUG: NUMA: >>> pgproc_init_partition procs 0x7f8d30800000 endptr 0x7f8d30c00000 >>> num_procs 2523 node 1 >>> 2025-11-04 10:02:27.059 CET [58464] DEBUG: NUMA: >>> pgproc_init_partition procs 0x7f8d30c00000 endptr 0x7f8d31000000 >>> num_procs 2523 node 2 >>> 2025-11-04 10:02:27.061 CET [58464] DEBUG: NUMA: >>> pgproc_init_partition procs 0x7f8d31000000 endptr 0x7f8d31400000 >>> num_procs 2523 node 3 >>> 2025-11-04 10:02:27.062 CET [58464] DEBUG: NUMA: >>> pgproc_init_partition procs 0x7f8d31400000 endptr 0x7f8d31407cb0 >>> num_procs 38 node -1 >>> mbind: Invalid argument >>> mbind: Invalid argument >>> mbind: Invalid argument >>> mbind: Invalid argument >>> >> >> I'll take a look, but I don't recall seeing such errors. >> > > Alexy also reported this earlier, here > https://www.postgresql.org/message-id/92e23c85-f646-4bab-b5e0-df30d8ddf4bd%40postgrespro.ru > (just use HP, set some high max_connections). I've double checked this > too , numa_tonode_memory() len needs to HP size. > OK, I'll investigate this. >>> 0007d: so we probably need numa_warn()/numa_error() wrappers (this was >>> initially part of NUMA observability patches but got removed during >>> the course of action), I'm attaching 0008. With that you'll get >>> something a little more up to our standards: >>> 2025-11-04 10:27:07.140 CET [59696] DEBUG: >>> fastpath_parititon_init node = 3, ptr = 0x7f4f4d400000, endptr = >>> 0x7f4f4d4b1660 >>> 2025-11-04 10:27:07.140 CET [59696] WARNING: libnuma: ERROR: mbind >>> >> >> Not sure. > > Any particular objections? We need to somehow emit them into the logs. > No idea, I think it'd be better to make sure this failure can't happen, but maybe it's not possible. I don't understand the mbind failure well enough. >>> 0007f: The "mbind: Invalid argument"" issue itself with the below addition: > [..] >>> >>> but mbind() was called for just 0x7f39eeab1660−0x7f39eea00000 = >>> 0xB1660 = 726624 bytes, but if adjust blindly endptr in that >>> fastpath_partition_init() to be "char *endptr = ptr + 2*1024*1024;" >>> (HP) it doesn't complain anymore and I get success: > [..] >> >> Hmm, so it seems like another hugepage-related issue. The mbind manpage >> says this about "len": >> >> EINVAL An invalid value was specified for flags or mode; or addr + len >> was less than addr; or addr is not a multiple of the system page size. >> >> I don't think that requires (addr+len) to be a multiple of page size, >> but maybe that is required. > > I do think that 'system page size' means above HP page size, but this > time it's just for fastpath_partition_init(), the earlier one seems to > aligned fine (?? -- i havent really checked but there's no error) > Hmmm, ok. Will check. But maybe let's not focus too much on the PGPROC partitioning, I don't think that's likely to go into 19. >>> 0006d: I've got one SIGBUS during a call to select >>> pg_buffercache_numa_pages(); and it looks like that memory accessed is >>> simply not mapped? (bug) >>> >>> Program received signal SIGBUS, Bus error. >>> pg_buffercache_numa_pages (fcinfo=0x561a97e8e680) at >>> ../contrib/pg_buffercache/pg_buffercache_pages.c:386 >>> 386 pg_numa_touch_mem_if_required(ptr); >>> (gdb) print ptr >>> $1 = 0x7f4ed0200000 <error: Cannot access memory at address 0x7f4ed0200000> >>> (gdb) where >>> #0 pg_buffercache_numa_pages (fcinfo=0x561a97e8e680) at >>> ../contrib/pg_buffercache/pg_buffercache_pages.c:386 >>> #1 0x0000561a672a0efe in ExecMakeFunctionResultSet >>> (fcache=0x561a97e8e5d0, econtext=econtext@entry=0x561a97e8dab8, >>> argContext=0x561a97ec62a0, isNull=0x561a97e8e578, >>> isDone=isDone@entry=0x561a97e8e5c0) at >>> ../src/backend/executor/execSRF.c:624 >>> [..] >>> >>> Postmaster had still attached shm (visible via smaps), and if you >>> compare closely 0x7f4ed0200000 against sorted smaps: >>> >>> 7f4921400000-7f4b21400000 rw-s 252600000 00:11 151111 >>> /anon_hugepage (deleted) >>> 7f4b21400000-7f4d21400000 rw-s 452600000 00:11 151111 >>> /anon_hugepage (deleted) >>> 7f4d21400000-7f4f21400000 rw-s 652600000 00:11 151111 >>> /anon_hugepage (deleted) >>> 7f4f21400000-7f4f4bc00000 rw-s 852600000 00:11 151111 >>> /anon_hugepage (deleted) >>> 7f4f4bc00000-7f4f4c000000 rw-s 87ce00000 00:11 151111 >>> /anon_hugepage (deleted) >>> >>> it's NOT there at all (there's no mmap region starting with >>> 0x"7f4e" ). It looks like because pg_buffercache_numa_pages() is not >>> aware of this new mmaped() regions and instead does simple loop over >>> all NBuffers with "for (char *ptr = startptr; ptr < endptr; ptr += >>> os_page_size)"? >>> >> >> I'm confused. How could that mapping be missing? Was this with huge >> pages / how many did you reserve on the nodes? > > > OK I made and error and paritally got it correct (it crashes reliably) > and partially mislead You, appologies, let me explain. There were two > questions for me: > a) why we make single mmap() and after numa_tonode_memory() we get > plenty of mappings > b) why we get SIGBUS (I've thought they are not continus, but they are > after triple-checking) > > ad a) My testing shows that on HP,as stated initially ("all of this > was on 4s/4 NUMA nodes with HP on"). That's what the codes does, you > get single mmaps() (resulting in single entry in smaps), but afte > noda_tonode_memory() there's many of them. Even on laptop: > > System has 1 NUMA nodes (0 to 0). > Attempting to allocate 8.000000 MB of HugeTLB memory... > Successfully allocated HugeTLB memory at 0x755828800000, smaps before: > 755828800000-755829000000 rw-s 00000000 00:11 259808 > /anon_hugepage (deleted) > Pinning first part (from 0x755828800000) to NUMA node 0... > smaps after: > 755828800000-755828c00000 rw-s 00000000 00:11 259808 > /anon_hugepage (deleted) > 755828c00000-755829000000 rw-s 00400000 00:11 259808 > /anon_hugepage (deleted) > Pinning second part (from 0x755828c00000) to NUMA node 0... > smaps after: > 755828800000-755828c00000 rw-s 00000000 00:11 259808 > /anon_hugepage (deleted) > 755828c00000-755829000000 rw-s 00400000 00:11 259808 > /anon_hugepage (deleted) > > It gets even more funny, below I have 8MB HP=on, but just issue 2x > numa_tonode_memory(for len 2MB on 4MB ptr to node0) (two times for > ptr, second time in half of that): > > System has 1 NUMA nodes (0 to 0). > Attempting to allocate 8.000000 MB of HugeTLB memory... > Successfully allocated HugeTLB memory at 0x7302dda00000, smaps before: > 7302dda00000-7302de200000 rw-s 00000000 00:11 284859 > /anon_hugepage (deleted) > Pinning first part (from 0x7302dda00000) to NUMA node 0... > smaps after: > 7302dda00000-7302ddc00000 rw-s 00000000 00:11 284859 > /anon_hugepage (deleted) > 7302ddc00000-7302de200000 rw-s 00200000 00:11 284859 > /anon_hugepage (deleted) > Pinning second part (from 0x7302dde00000) to NUMA node 0... > smaps after: > 7302dda00000-7302ddc00000 rw-s 00000000 00:11 284859 > /anon_hugepage (deleted) > 7302ddc00000-7302dde00000 rw-s 00200000 00:11 284859 > /anon_hugepage (deleted) > 7302dde00000-7302de000000 rw-s 00400000 00:11 284859 > /anon_hugepage (deleted) > 7302de000000-7302de200000 rw-s 00600000 00:11 284859 > /anon_hugepage (deleted) > > Why 4 instead of 1? Because some mappings are now "default" becauswe > their policy was not altered: > > $ grep huge /proc/$(pidof testnumammapsplit)/numa_maps > 7302dda00000 bind:0 file=/anon_hugepage\040(deleted) huge > 7302ddc00000 default file=/anon_hugepage\040(deleted) huge > 7302dde00000 bind:0 file=/anon_hugepage\040(deleted) huge > 7302de000000 default file=/anon_hugepage\040(deleted) huge > > Back to originnal error, they are consecutive regions and earlier problem is > > error: 0x7f4ed0200000 <error: Cannot access memory at address 0x7f4ed0200000> > start: 0x7f4921400000 > end: 0x7f4f4c000000 > > so it fits into that range (that was my mistate earlier, using just > grep not checking are they really within that), but... > >> Maybe there were not enough huge pages left on one of the nodes? > > ad b) right, something like that. I've investigated that SIGBUS there > (it's going to be long): > > with shared_buffers=32GB, huge_pages 17715 (+1 from what postgres -C > shared_memory_size_in_huge_pages returns), right after startup, but no > touch: > > Program received signal SIGBUS, Bus error. > pg_buffercache_numa_pages (fcinfo=0x5572038790b8) at > ../contrib/pg_buffercache/pg_buffercache_pages.c:386 > 386 pg_numa_touch_mem_if_required(ptr); > (gdb) where > #0 pg_buffercache_numa_pages (fcinfo=0x5572038790b8) at > ../contrib/pg_buffercache/pg_buffercache_pages.c:386 > #1 0x00005571f54ddb7d in ExecMakeTableFunctionResult > (setexpr=0x557203870d40, econtext=0x557203870ba8, > argContext=<optimized out>, expectedDesc=0x557203870f80, > randomAccess=false) at ../src/backend/executor/execSRF.c:234 > [..] > (gdb) print ptr > $1 = 0x7f6cf8400000 <error: Cannot access memory at address 0x7f6cf8400000> > (gdb) > > > then it shows?! no available hugepage on one of the nodes (while gdb > is hanging and preving autorestart): > > root@swiatowid:/sys/devices/system/node# grep -r -i HugePages_Free node*/meminfo > node0/meminfo:Node 0 HugePages_Free: 299 > node1/meminfo:Node 1 HugePages_Free: 299 > node2/meminfo:Node 2 HugePages_Free: 299 > node3/meminfo:Node 3 HugePages_Free: 0 > > but they are equal in terms of size: > node0/meminfo:Node 0 HugePages_Total: 4429 > node1/meminfo:Node 1 HugePages_Total: 4429 > node2/meminfo:Node 2 HugePages_Total: 4429 > node3/meminfo:Node 3 HugePages_Total: 4428 > > smaps shows that this address (7f6cf8400000) is mapped in this mapping: > 7f6b49c00000-7f6d49c00000 rw-s 652600000 00:11 86064 > /anon_hugepage (deleted) > > numa_maps for this region shows this is this mapping on node3 (notice > N3 + bind:3 matches lack of memory on Node 3 HugePAges_Free): > 7f6b49c00000 bind:3 file=/anon_hugepage\040(deleted) huge dirty=3444 > N3=3444 kernelpagesize_kB=2048 > > the surrounding area of this looks like that: > > 7f6549c00000 bind:0 file=/anon_hugepage\040(deleted) huge dirty=4096 > N0=4096 kernelpagesize_kB=2048 > 7f6749c00000 bind:1 file=/anon_hugepage\040(deleted) huge dirty=4096 > N1=4096 kernelpagesize_kB=2048 > 7f6949c00000 bind:2 file=/anon_hugepage\040(deleted) huge dirty=4096 > N2=4096 kernelpagesize_kB=2048 > 7f6b49c00000 bind:3 file=/anon_hugepage\040(deleted) huge dirty=3444 > N3=3444 kernelpagesize_kB=2048 <-- this is the one > 7f6d49c00000 default file=/anon_hugepage\040(deleted) huge dirty=107 > mapmax=6 N3=107 kernelpagesize_kB=2048 > > Notice it's just N3=3444, while the others are much larger. So > something was using that hugepages memory on N3: > > # grep kernelpagesize_kB=2048 /proc/1679/numa_maps | grep -Po > N[0-4]=[0-9]+ | sort > N0=2 > N0=4096 > N1=2 > N1=4096 > N2=2 > N2=4096 > N3=1 > N3=1 > N3=1 > N3=1 > N3=107 > N3=13 > N3=3 > N3=3444 > > So per above it's not there (at least not as 2MB HP). But the number > of mappings is wild there! (node where it is failing has plenty of > memory, no hugepage memory left, but it has like 40k+ of small > mappings!) > > # grep -Po 'N[0-3]=' /proc/1679/numa_maps | sort | uniq -c > 17 N0= > 10 N1= > 3 N2= > 40434 N3= > > most of them are `anon_inode:[io_uring]` (and I had > max_connections=10k). You may ask why in spite of Andres optimization > for reducing number segments for uring, it's not working for me ? Well > I've just noticed way too silent failure to active this (altough I'm > on 6.14.x): > 2025-11-06 13:34:49.128 CET [1658] DEBUG: can't use combined > memory mapping for io_uring, kernel or liburing too old > and I dont have io_uring_queue_init_mem()/HAVE_LIBURING_QUEUE_INIT_MEM > apparently on liburing-2.3 (Debian's default). See [1] for more info > (fix is not commited yet sadly). > > Next try, now with io_method = worker and right before start: > > root@swiatowid:/sys/devices/system/node# grep -r -i HugePages_Total > node*/meminfo > node0/meminfo:Node 0 HugePages_Total: 4429 > node1/meminfo:Node 1 HugePages_Total: 4429 > node2/meminfo:Node 2 HugePages_Total: 4429 > node3/meminfo:Node 3 HugePages_Total: 4428 > and HugePages_Free were 100% (if postgresql was down). After start > (but without doing anything else): > root@swiatowid:/sys/devices/system/node# grep -r -i HugePages_Free node*/meminfo > node0/meminfo:Node 0 HugePages_Free: 4393 > node1/meminfo:Node 1 HugePages_Free: 4395 > node2/meminfo:Node 2 HugePages_Free: 4395 > node3/meminfo:Node 3 HugePages_Free: 3446 > > So sadly the picture is the same (something stole my HP on N3 and it's > PostgreSQL on it's own). After some time of investigating that ("who > stole my hugepage across whole OS"), I've just added MAP_POPULATE to > the mix of PG_MMAP_FLAGS and got this after start: > > root@swiatowid:/sys/devices/system/node# grep -r -i HugePages_Free node*/meminfo > node0/meminfo:Node 0 HugePages_Free: 0 > node1/meminfo:Node 1 HugePages_Free: 0 > node2/meminfo:Node 2 HugePages_Free: 0 > node3/meminfo:Node 3 HugePages_Free: 1 > > and then the SELECT to pg_buffercache_numa works fine(!). > > Another ways that I have found to eliminate that SIGBUS > a. Would be to throw much more HugePages (so that node does not run to > HugePages_Free), but that's not real option. > b. Then I've reminded myself that I could be running custom kernel > with experimental CONFIG_READ_ONLY_THP_FOR_FS (to reduce iTLB misses > tranparently with specially linked PG; will double check exact stuff > later), so I've thrown never into > /sys/kernel/mm/transparent_hugepage/enabled and defrag too (yes , > disabled THP) and with that -- drumroll -- that SELECT works. The very > same PG picture after startup (where earlier it would crash), now > after SELECT it looks like that: > > root@swiatowid:/sys/devices/system/node# grep -r -i HugePages_Free node*/meminfo > node0/meminfo:Node 0 HugePages_Free: 83 > node1/meminfo:Node 1 HugePages_Free: 0 > node2/meminfo:Node 2 HugePages_Free: 81 > node3/meminfo:Node 3 HugePages_Free: 82 > > Hope that helps a little. To me it sounds like THP used that memory > somehow and we've also wanted to use. With numa_interleave_ptr() that > wouldn't be a problem because probably it would something else > available, but not here as we indicated exact node. > >>> 0006e: >>> I'm seeking confirmation, but is this the issue we have discussed >>> on PgconfEU related to lack of detection of Mems_allowed, right? e.g. >>> $ numactl --membind="0,1" --cpunodebind="0,1" >>> /usr/pgsql19/bin/pg_ctl -D /path start >>> still shows 4 NUMA nodes used. Current patches use >>> numa_num_configured_nodes(), but it says 'This count includes any >>> nodes that are currently DISABLED'. So I was wondering if I could help >>> by migrating towards numa_num_task_nodes() / numa_get_mems_allowed()? >>> It's the same as You wrote earlier to Alexy? >>> >> >> If "mems_allowed" refers to nodes allowing memory allocation, then yes, >> this would be one way to get into that issue. Oh, is this what happened >> in 0006d? > > OK, thanks for confirmation. No, 0006d was about normal numactl run, > without --membind. > I didn't have time to look into all this info about mappings, io_uring yet, so no response from me. >> I did get a couple of "operation canceled" failures, but only on fairly >> old kernel versions (6.1 which came as default with the VM). > > OK, I'll try to see that later too. > > btw QQ regarding partitioned clockwise as I had thought: does this > opens a road towards multiple BGwriters? (outside of this > $thread/v1/PoC) > I don't think the clocksweep partitioning is required for multiple bgwriters, but it might make it easier. regards -- Tomas Vondra
Вложения
- v20251111-0007-NUMA-partition-PGPROC.patch
- v20251111-0006-NUMA-shared-buffers-partitioning.patch
- v20251111-0005-clock-sweep-weighted-balancing.patch
- v20251111-0004-clock-sweep-scan-all-partitions.patch
- v20251111-0003-clock-sweep-balancing-of-allocations.patch
- v20251111-0002-clock-sweep-basic-partitioning.patch
- v20251111-0001-Infrastructure-for-partitioning-shared-buf.patch
В списке pgsql-hackers по дате отправления: