Re: Draft for basic NUMA observability
| От | Andres Freund | 
|---|---|
| Тема | Re: Draft for basic NUMA observability | 
| Дата | |
| Msg-id | lwp2vhbc6klr42mse7y6oze55p3mfutx2nt3crjv2wdyovst3t@u55ai45p25y4 обсуждение исходный текст | 
| Ответ на | Re: Draft for basic NUMA observability (Jakub Wartak <jakub.wartak@enterprisedb.com>) | 
| Ответы | Re: Draft for basic NUMA observability Re: Draft for basic NUMA observability | 
| Список | pgsql-hackers | 
Hi, On 2025-02-24 12:57:16 +0100, Jakub Wartak wrote: > On Thu, Feb 20, 2025 at 9:32 AM Bertrand Drouvot > <bertranddrouvot.pg@gmail.com> wrote: > > OTOH, one of the main issue that I see with 3. is that the monitoring could > > probably influence the kernel's decision to start pages migration (I'm not 100% > > sure but I could imagine it may influence the kernel's decision due to having to > > read/touch the pages). > > > > But I'm thinking: do we really need to know the page location of every single page? > > I think what we want to see is if the pages are "equally" distributed on all > > the nodes or are somehow "stuck" to one (or more) nodes. In that case what about > > using get_mempolicy() but on a subset of the buffer cache? (say every Nth buffer > > or contiguous chunks). We could create a new function that would accept a > > "sampling distance" as parameter for example, thoughts? > > The way I envision it (and I think what Andres wanted, not sure, still > yet to see him comment on all of this) is to give PG devs a way to > quickly spot NUMA imbalances, even for single relation. Yea. E.g. for some benchmark workloads the difference whether the root btree page is on the same NUMA node as the workload or not makes a roughly 2x perf difference. It's really hard to determine that today. > If there would be agreement that this is the way we want to have it > (from the backend and not from checkpointer), here's what's left on > the table to be done here: > a. isn't there something quicker for touching / page-faulting memory ? If you actually fault in a page the kernel actually has to allocate memory and then zero it out. That rather severely limits the throughput... > If not then maybe add CHECKS_FOR_INTERRUPTS() there? Should definitely be there. > BTW I've tried additional MAP_POPULATE for PG_MMAP_FLAGS, but that didn't > help (it probably only works for parent//postmaster). Yes, needs to be in postmaster. Does the issue with "new" backends seeing pages as not present exist both with and without huge pages? FWIW, what you posted fails on CI: https://cirrus-ci.com/task/5114213770723328 Probably some ifdefs are missing. The sanity-check task configures with minimal dependencies, which is why you're seeing this even on linux. > b. refactor shared code so that it goes into src/port (but with > Linux-only support so far) > c. should we use MemoryContext in pg_get_shmem_numa_allocations or not? You mean a specific context instead of CurrentMemoryContext? > diff --git a/.cirrus.tasks.yml b/.cirrus.tasks.yml > index 91b51142d2e..e3b7554d9e8 100644 > --- a/.cirrus.tasks.yml > +++ b/.cirrus.tasks.yml > @@ -436,6 +436,10 @@ task: > SANITIZER_FLAGS: -fsanitize=address > PG_TEST_PG_COMBINEBACKUP_MODE: --copy-file-range > > + > + # FIXME: use or not the libnuma? > + # --with-libnuma \ > + # > # Normally, the "relation segment" code basically has no coverage in our > # tests, because we (quite reasonably) don't generate tables large > # enough in tests. We've had plenty bugs that we didn't notice due > the I don't see why not. > diff --git a/contrib/pg_buffercache/pg_buffercache--1.5--1.6.sql b/contrib/pg_buffercache/pg_buffercache--1.5--1.6.sql > new file mode 100644 > index 00000000000..e5b3d1f7dd2 > --- /dev/null > +++ b/contrib/pg_buffercache/pg_buffercache--1.5--1.6.sql > @@ -0,0 +1,30 @@ > +/* contrib/pg_buffercache/pg_buffercache--1.5--1.6.sql */ > + > +-- complain if script is sourced in psql, rather than via CREATE EXTENSION > +\echo Use "CREATE EXTENSION pg_buffercache" to load this file. \quit > + > +-- Register the function. > +DROP FUNCTION pg_buffercache_pages() CASCADE; Why? I think that's going to cause problems, as the pg_buffercache view depends on it, and user views might turn in depend on pg_buffercache. I think CASCADE is rarely, if ever, ok to use in an extension scripot. > +CREATE OR REPLACE FUNCTION pg_buffercache_pages(boolean) > +RETURNS SETOF RECORD > +AS 'MODULE_PATHNAME', 'pg_buffercache_pages' > +LANGUAGE C PARALLEL SAFE; > > +-- Create a view for convenient access. > +CREATE OR REPLACE VIEW pg_buffercache AS > + SELECT P.* FROM pg_buffercache_pages(false) AS P > + (bufferid integer, relfilenode oid, reltablespace oid, reldatabase oid, > + relforknumber int2, relblocknumber int8, isdirty bool, usagecount int2, > + pinning_backends int4); > + > +CREATE OR REPLACE VIEW pg_buffercache_numa AS > + SELECT P.* FROM pg_buffercache_pages(true) AS P > + (bufferid integer, relfilenode oid, reltablespace oid, reldatabase oid, > + relforknumber int2, relblocknumber int8, isdirty bool, usagecount int2, > + pinning_backends int4, numa_zone_id int4); > + > +-- Don't want these to be available to public. > +REVOKE ALL ON FUNCTION pg_buffercache_pages(boolean) FROM PUBLIC; > +REVOKE ALL ON pg_buffercache FROM PUBLIC; > +REVOKE ALL ON pg_buffercache_numa FROM PUBLIC; We grant pg_monitor SELECT TO pg_buffercache, I think we should do the same for _numa? > @@ -177,8 +228,61 @@ pg_buffercache_pages(PG_FUNCTION_ARGS) > else > fctx->record[i].isvalid = false; > > +#ifdef USE_LIBNUMA > +/* FIXME: taken from bufmgr.c, maybe move to .h ? */ > +#define BufHdrGetBlock(bufHdr) ((Block) (BufferBlocks + ((Size) (bufHdr)->buf_id) * BLCKSZ)) > + blk2page = (int) i * pages_per_blk; BufferGetBlock() is public, so I don't think BufHdrGetBlock() is needed here. > + j = 0; > + do { > + /* > + * Many buffers can point to the same page, but we want to > + * query just first address. > + * > + * In order to get reliable results we also need to touch memory pages > + * so that inquiry about NUMA zone doesn't return -2. > + */ > + if(os_page_ptrs[blk2page+j] == 0) { > + volatile uint64 touch pg_attribute_unused(); > + os_page_ptrs[blk2page+j] = (char *)BufHdrGetBlock(bufHdr) + (os_page_size*j); > + touch = *(uint64 *)os_page_ptrs[blk2page+j]; > + } > + j++; > + } while(j < (int)pages_per_blk); > +#endif > + Why is this done before we even have gotten -2 back? Even if we need it, it seems like we ought to defer this until necessary. > +#ifdef USE_LIBNUMA > + if(query_numa) { > + /* According to numa(3) it is required to initialize library even if that's no-op. */ > + if(numa_available() == -1) { > + pg_buffercache_mark_numa_invalid(fctx, NBuffers); > + elog(NOTICE, "libnuma initialization failed, some NUMA data might be unavailable.");; > + } else { > + /* Amortize the number of pages we need to query about */ > + if(numa_move_pages(0, os_page_count, os_page_ptrs, NULL, os_pages_status, 0) == -1) { > + elog(ERROR, "failed NUMA pages inquiry status"); > + } I wonder if we ought to override numa_error() so we can display more useful errors. > + > + LWLockAcquire(ShmemIndexLock, LW_SHARED); Doing multiple memory allocations while holding an lwlock is probably not a great idea, even if the lock normally isn't contended. > + os_page_ptrs = palloc(sizeof(void *) * os_page_count); > + os_pages_status = palloc(sizeof(int) * os_page_count); Why do this in very loop iteration? Greetings, Andres Freund
В списке pgsql-hackers по дате отправления: