Re: Adding basic NUMA awareness
От | Tomas Vondra |
---|---|
Тема | Re: Adding basic NUMA awareness |
Дата | |
Msg-id | 6636d842-e8a5-44bc-a206-6df415380a7b@vondra.me обсуждение исходный текст |
Ответ на | Re: Adding basic NUMA awareness (Tomas Vondra <tomas@vondra.me>) |
Ответы |
Re: Adding basic NUMA awareness
|
Список | pgsql-hackers |
Hi, Here's a fresh version of the NUMA patch series. There's a number of substantial improvements: 1) Rebase to current master, particularly on top of 2c7894052759 which removed the freelist. The patch that partitioned the freelist is gone. 2) A bunch of fixes, and it now passes CI workflows on github, and all other testing I did. Of course, more testing is needed. 3) It builds with/without libnuma support, and so on. 4) The separate GUCs were replaced by a single list GUC, similar to what we do for debug_io_direct. The GUC is called debug_numa, and accepts "buffets" and "procs", to partition the two areas. I'm considering adding a "clock-sweep" option in a future patch. I didn't do that here, because the buffer partitioning is already enabled by "buffers", and the clock-sweep just builds on that (partitioning the same way, pretty much). 5) It also works with EXEC_BACKEND, but this turned out to be a bit more challenging than expected. The trouble is some of the parameters (e.g. memory page size) are used both in the "size" and "init" phases, and we need to be extra careful to not make something "inconsistent". For example, we may get confused about the memory page size. The "size" happens before allocation, and at that point we don't know if we succeed in getting enough huge pages. When "init" happens, we already know that, so our "memory page size" could be different. We must be careful, e.g. to not need more memory than we requested. This is a general problem, but the EXEC_BACKEND makes it a bit trickier. In regular fork() case we can simply set some global variables in "size" and use them later in "init", but that doesn't work for EXEC_BACKEND. The current approach simply does the calculations twice, in a way that should end with the same results. But I'm not 100% happy with it, and I suspect it just confirms we should store the results in shmem memory (which is what I called "NUMA registry" before). That's still a TODO. 6) BufferDescPadded is 64B everywhere. Originally, the padding was applied only on 64-bit platforms, and on 32-bit systems the struct was left at 60B. But that is not compatible with buffer partitioning, which relies on the memory page being a multiple of BufferDescPadded. Perhaps it could be relaxed (so that a BufferDesc might span two memory pages), but this seems like the cleanes solution. I don't expect it to make any measurable difference. 7) I've moved some of the code to BgBufferSyncPartition, to make review easier (without all the indentation changes). 8) I've realized some of the TAP tests occasionally fail with ERROR: no unpinned buffers and I think I know why. Some of the tests set shared_buffers to a very low value - like 1MB or even 128kB, and StrategyGetBuffer() may search only a single partition (but not always). We may run out of unpinned buffers in that one partition. This apparently happens more easily on rpi5, due to the weird NUMA layout (there are 8 nodes with memory, but getcpu() reports node 0 for all cores). I suspect the correct fix is to ensure StrategyGetBuffer() scans all partitions, if there are no unpinned buffers in the current one. On realistic setups this shouldn't happen very often, I think. The other issue I just realized is that StrategyGetBuffer() recalculates the partition index over and over, which seems unnecessary (and possibly expensive, due to the modulo). And it also does too many loops, because it used NBuffers instead of the partition size. I'll fix those later. 9) I'm keeping the cloc-sweep balancing patches separate for now. In the end all the clock-sweep patches should be merged, but it keeps the changes easier to review this way, I think. 10) There's not many changes in the PGPROC partitioning patch. I merely fixed issues that broke it on EXEC_BACKEND, and did some smaller tweaks. 11) I'm not including the experimental patches to pin backends to CPUs (or nodes), and so on. It's clear those are unlikely to go in, so it'd be just a distraction. 12) What's the right / portable way to determine the current CPU for a process? The clock-sweep / PGPROC patches need this to pick the right partition, but it's not clear to me which API is the most portable. In the end I used sched_getcpu(), and then numa_node_of_cpu(), but maybe there's a better way. regards -- Tomas Vondra
Вложения
В списке pgsql-hackers по дате отправления: