Re: Adding basic NUMA awareness
От | Tomas Vondra |
---|---|
Тема | Re: Adding basic NUMA awareness |
Дата | |
Msg-id | 1d57d68d-b178-415a-ba11-be0c3714638e@vondra.me обсуждение исходный текст |
Ответ на | Re: Adding basic NUMA awareness (Tomas Vondra <tomas@vondra.me>) |
Список | pgsql-hackers |
On 9/11/25 10:32, Tomas Vondra wrote: > ... > > For example, we may get confused about the memory page size. The "size" > happens before allocation, and at that point we don't know if we succeed > in getting enough huge pages. When "init" happens, we already know that, > so our "memory page size" could be different. We must be careful, e.g. > to not need more memory than we requested. I forgot to mention the other issue with huge pages on NUMA. I already reported [1] it's trivial to crash with a SIGBUS, because (1) huge pages get reserved on all NUMA nodes (evenly) (2) the decision whether to use huge pages is done by mmap(), which only needs to check if there are enough huge pages in total (3) numa_tonode_memory is called later, and does not verify if the target node has enough free pages (I'm not sure it should / can) (4) we only partition (and locate to NUMA nodes) some of the memory, and the rest (which is much smaller, but still sizeable) is likely causing "imbalance" - it gets placed on one (random) node, and it then does not have enough space for the stuff we explicitly placed there (5) then at some point we try accessing one of the shared buffers, that triggers page fault, tries to get a huge page on the NUMA node, realizes there are no free huge pages, and crashes with SIGBUS It clearly is not an option to just let it crash, but I still don't have a great idea how to address it. The only idea I have is to manually interleave the whole shared memory (when using huge pages), page by page, so that this imbalance does not happen. But it's harder than it looks, because we don't necessarily partition everything evenly. For example, one node can get a smaller chunk of shared buffers, because we try to partition buffers and buffers descriptors in a "nice" way. The PGPROC stuff is also not distributed quite evenly (e.g. aux/2pc entries are not mapped to any node). A different approach would be to calculate how many per-node huge pages we'll need (for the stuff we partition explicitly - buffers and PGPROC), and then the rest of the memory that can get placed on any node. And require the "maximum" number of pages that can get placed on any node. But that's annoying wasteful, because every other node will end up with unusable memory. regards [1] https://www.postgresql.org/message-id/71a46484-053c-4b81-ba32-ddac050a8b5d%40vondra.me -- Tomas Vondra
В списке pgsql-hackers по дате отправления: