Обсуждение: failed NUMA pages inquiry status: Operation not permitted
> src/test/regress/expected/numa.out | 13 +++ > src/test/regress/expected/numa_1.out | 5 + numa_1.out is catching this error: ERROR: libnuma initialization failed or NUMA is not supported on this platform This is what I'm getting when running PG18 in docker on Debian trixie (libnuma 2.0.19). However, on older distributions, the error is different: postgres =# select * from pg_shmem_allocations_numa; ERROR: XX000: failed NUMA pages inquiry status: Operation not permitted LOCATION: pg_get_shmem_allocations_numa, shmem.c:691 This makes the numa regression tests fail in Docker on Debian bookworm (libnuma 2.0.16) and older and all of the Ubuntu LTS releases. The attached patch makes it accept these errors, but perhaps it would be better to detect it in pg_numa_available(). Christoph
Вложения
On 10/16/25 13:38, Christoph Berg wrote: >> src/test/regress/expected/numa.out | 13 +++ >> src/test/regress/expected/numa_1.out | 5 + > > numa_1.out is catching this error: > > ERROR: libnuma initialization failed or NUMA is not supported on this platform > > This is what I'm getting when running PG18 in docker on Debian trixie > (libnuma 2.0.19). > > However, on older distributions, the error is different: > > postgres =# select * from pg_shmem_allocations_numa; > ERROR: XX000: failed NUMA pages inquiry status: Operation not permitted > LOCATION: pg_get_shmem_allocations_numa, shmem.c:691 > > This makes the numa regression tests fail in Docker on Debian bookworm > (libnuma 2.0.16) and older and all of the Ubuntu LTS releases. > It's probably more about the kernel version. What kernels are used by these systems? > The attached patch makes it accept these errors, but perhaps it would > be better to detect it in pg_numa_available(). > Not sure how would that work. It seems this is some sort of permission check in numa_move_pages, that's not what pg_numa_available does. Also, it may depending on the page queried (e.g. whether it's exclusive or shared by multiple processes). thanks -- Tomas Vondra
Re: Tomas Vondra > It's probably more about the kernel version. What kernels are used by > these systems? It's the very same kernel, just different docker containers on the same system. I did not investigate yet where the problem is coming from, different libnuma versions seemed like the best bet. Same (differing) results on both these systems: Linux turing 6.16.7+deb14-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.16.7-1 (2025-09-11) x86_64 GNU/Linux Linux jenkins 6.1.0-39-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.148-1 (2025-08-26) x86_64 GNU/Linux > Not sure how would that work. It seems this is some sort of permission > check in numa_move_pages, that's not what pg_numa_available does. Also, > it may depending on the page queried (e.g. whether it's exclusive or > shared by multiple processes). It's probably the lack of some process capability in that environment. Maybe there is a way to query that, but I don't know much about that yet. Christoph
Re: To Tomas Vondra > It's the very same kernel, just different docker containers on the > same system. I did not investigate yet where the problem is coming > from, different libnuma versions seemed like the best bet. numactl shows the problem already: Host system: $ numactl --show policy: default preferred node: current physcpubind: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 cpubind: 0 nodebind: 0 membind: 0 preferred: debian:trixie-slim container: $ numactl --show physcpubind: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 No NUMA support available on this system. debian:bookworm-slim container: $ numactl --show get_mempolicy: Operation not permitted get_mempolicy: Operation not permitted get_mempolicy: Operation not permitted get_mempolicy: Operation not permitted policy: default preferred node: current physcpubind: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 cpubind: 0 nodebind: 0 membind: 0 preferred: Running with sudo does not change the result. So maybe all that's needed is a get_mempolicy() call in pg_numa_available() ? Christoph
Re: To Tomas Vondra > So maybe all that's needed is a get_mempolicy() call in > pg_numa_available() ? Or perhaps give up on pg_numa_available, and just have two _1.out and _2.out that just contain the two different error messages, without trying to catch the problem. Christoph
On 10/16/25 16:54, Christoph Berg wrote: > Re: Tomas Vondra >> It's probably more about the kernel version. What kernels are used by >> these systems? > > It's the very same kernel, just different docker containers on the > same system. I did not investigate yet where the problem is coming > from, different libnuma versions seemed like the best bet. > > Same (differing) results on both these systems: > Linux turing 6.16.7+deb14-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.16.7-1 (2025-09-11) x86_64 GNU/Linux > Linux jenkins 6.1.0-39-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.148-1 (2025-08-26) x86_64 GNU/Linux > Hmmm. Those seem like relatively recent kernels. >> Not sure how would that work. It seems this is some sort of permission >> check in numa_move_pages, that's not what pg_numa_available does. Also, >> it may depending on the page queried (e.g. whether it's exclusive or >> shared by multiple processes). > > It's probably the lack of some process capability in that environment. > Maybe there is a way to query that, but I don't know much about that > yet. > move_page() manpage mentions PTRACE_MODE_READ_REALCREDS (man ptrace) so maybe that's it. -- Tomas Vondra
> So maybe all that's needed is a get_mempolicy() call in
> pg_numa_available() ?
numactl 2.0.19 --show does this:
if (numa_available() < 0) {
show_physcpubind();
printf("No NUMA support available on this system.\n");
exit(1);
}
int numa_available(void)
{
if (get_mempolicy(NULL, NULL, 0, 0, 0) < 0 && (errno == ENOSYS || errno == EPERM))
return -1;
return 0;
}
pg_numa_available is already calling numa_available.
But numactl 2.0.16 has this:
int numa_available(void)
{
if (get_mempolicy(NULL, NULL, 0, 0, 0) < 0 && errno == ENOSYS)
return -1;
return 0;
}
... which is not catching the "permission denied" error I am seeing.
So maybe PG should implement numa_available itself like that. (Or
accept the output difference so the regression tests are passing.)
Christoph
On 10/16/25 17:19, Christoph Berg wrote: >> So maybe all that's needed is a get_mempolicy() call in >> pg_numa_available() ? > > ... > > So maybe PG should implement numa_available itself like that. (Or > accept the output difference so the regression tests are passing.) > I'm not sure which of those options is better. I'm a bit worried just accepting the alternative output would hide some failures in the future (although it's a low risk). So I'm leaning to adjust pg_numa_init() to also check EPERM, per the attached patch. It still calls numa_available(), so that we don't silently miss future libnuma changes. Can you check this makes it work inside the docker container? regards -- Tomas Vondra
Вложения
Re: To Tomas Vondra > So maybe PG should implement numa_available itself like that. Following our discussion at pgconf.eu last week, I just implemented that. The numa and pg_buffercache tests pass in Docker on Debian bookworm now. Christoph
Вложения
Re: Tomas Vondra > So I'm leaning to adjust pg_numa_init() to also check EPERM, per the > attached patch. It still calls numa_available(), so that we don't > silently miss future libnuma changes. > > Can you check this makes it work inside the docker container? Yes your patch works. (Sorry I meant to test earlier, but RL...) Christoph
On 11/14/25 13:52, Christoph Berg wrote: > Re: Tomas Vondra >> So I'm leaning to adjust pg_numa_init() to also check EPERM, per the >> attached patch. It still calls numa_available(), so that we don't >> silently miss future libnuma changes. >> >> Can you check this makes it work inside the docker container? > > Yes your patch works. (Sorry I meant to test earlier, but RL...) > Thanks. I've pushed the fix (and backpatched to 18). regards -- Tomas Vondra
Re: Tomas Vondra > >> So I'm leaning to adjust pg_numa_init() to also check EPERM, per the > >> attached patch. It still calls numa_available(), so that we don't > >> silently miss future libnuma changes. > >> > >> Can you check this makes it work inside the docker container? > > > > Yes your patch works. (Sorry I meant to test earlier, but RL...) > > Thanks. I've pushed the fix (and backpatched to 18). It looks like we are not done here yet :( postgresql-18 is failing here intermittently with this diff: 12:20:24 --- /build/reproducible-path/postgresql-18-18.1/src/test/regress/expected/numa.out 2025-11-10 21:52:06.000000000+0000 12:20:24 +++ /build/reproducible-path/postgresql-18-18.1/build/src/test/regress/results/numa.out 2025-12-11 11:20:22.618989603+0000 12:20:24 @@ -6,8 +6,4 @@ 12:20:24 -- switch to superuser 12:20:24 \c - 12:20:24 SELECT COUNT(*) >= 0 AS ok FROM pg_shmem_allocations_numa; 12:20:24 - ok 12:20:24 ----- 12:20:24 - t 12:20:24 -(1 row) 12:20:24 - 12:20:24 +ERROR: invalid NUMA node id outside of allowed range [0, 0]: -2 That's REL_18_STABLE @ 580b5c, with the Debian packaging on top. I've seen it on unstable/amd64, unstable/arm64, and Ubuntu questing/amd64, where libnuma should take care of this itself, without the extra patch in PG. There was another case on bullseye/amd64 which has the old libnuma. It's been frequent enough so it killed 4 out of the 10 builds currently visible on https://jengus.postgresql.org/job/postgresql-18-binaries-snapshot/. (Though to be fair, only one distribution/arch combination was failing for each of them.) There is also one instance of it in https://jengus.postgresql.org/job/postgresql-19-binaries-snapshot/ I currently have no idea what's happening. Christoph
On 12/11/25 13:29, Christoph Berg wrote: > Re: Tomas Vondra >>>> So I'm leaning to adjust pg_numa_init() to also check EPERM, per the >>>> attached patch. It still calls numa_available(), so that we don't >>>> silently miss future libnuma changes. >>>> >>>> Can you check this makes it work inside the docker container? >>> >>> Yes your patch works. (Sorry I meant to test earlier, but RL...) >> >> Thanks. I've pushed the fix (and backpatched to 18). > > It looks like we are not done here yet :( > > postgresql-18 is failing here intermittently with this diff: > > 12:20:24 --- /build/reproducible-path/postgresql-18-18.1/src/test/regress/expected/numa.out 2025-11-10 21:52:06.000000000+0000 > 12:20:24 +++ /build/reproducible-path/postgresql-18-18.1/build/src/test/regress/results/numa.out 2025-12-11 11:20:22.618989603+0000 > 12:20:24 @@ -6,8 +6,4 @@ > 12:20:24 -- switch to superuser > 12:20:24 \c - > 12:20:24 SELECT COUNT(*) >= 0 AS ok FROM pg_shmem_allocations_numa; > 12:20:24 - ok > 12:20:24 ----- > 12:20:24 - t > 12:20:24 -(1 row) > 12:20:24 - > 12:20:24 +ERROR: invalid NUMA node id outside of allowed range [0, 0]: -2 > > That's REL_18_STABLE @ 580b5c, with the Debian packaging on top. > > I've seen it on unstable/amd64, unstable/arm64, and Ubuntu > questing/amd64, where libnuma should take care of this itself, without > the extra patch in PG. There was another case on bullseye/amd64 which > has the old libnuma. > > It's been frequent enough so it killed 4 out of the 10 builds > currently visible on > https://jengus.postgresql.org/job/postgresql-18-binaries-snapshot/. > (Though to be fair, only one distribution/arch combination was failing > for each of them.) > > There is also one instance of it in > https://jengus.postgresql.org/job/postgresql-19-binaries-snapshot/ > > I currently have no idea what's happening. > Hmmm, strange. -2 is ENOENT, which should mean this: -ENOENT The page is not present. But what does "not present" mean in this context? And why would that be only intermittent? Presumably this is still running in Docker, so maybe it's another weird consequence of that? regards -- Tomas Vondra
Re: Tomas Vondra > Hmmm, strange. -2 is ENOENT, which should mean this: > > -ENOENT > The page is not present. > > But what does "not present" mean in this context? And why would that be > only intermittent? Presumably this is still running in Docker, so maybe > it's another weird consequence of that? Sorry I forgot to mention that this is now in the normal apt.pg.o build environment (chroots without any funky permission restrictions). I have not tried Docker yet. I think it was not happening before the backport of the Docker fix. But I have no idea why this should have broken anything, and why it would only happen like 3% of the time. Christoph
Re: Tomas Vondra
> Hmmm, strange. -2 is ENOENT, which should mean this:
>
> -ENOENT
> The page is not present.
>
> But what does "not present" mean in this context? And why would that be
> only intermittent? Presumably this is still running in Docker, so maybe
> it's another weird consequence of that?
I've managed to reproduce it once, running this loop on
18-as-of-today. It errored out after a few 100 iterations:
while psql -c 'SELECT COUNT(*) >= 0 AS ok FROM pg_shmem_allocations_numa'; do :; done
2025-12-16 11:49:35.982 UTC [621807] myon@postgres ERROR: invalid NUMA node id outside of allowed range [0, 0]: -2
2025-12-16 11:49:35.982 UTC [621807] myon@postgres STATEMENT: SELECT COUNT(*) >= 0 AS ok FROM
pg_shmem_allocations_numa
That was on the apt.pg.o amd64 build machine while a few things were
just building. Maybe ENOENT "The page is not present" means something
was just swapped out because the machine was under heavy load.
I tried reading the kernel source and it sounds related:
* If the source virtual memory range has any unmapped holes, or if
* the destination virtual memory range is not a whole unmapped hole,
* move_pages() will fail respectively with -ENOENT or -EEXIST. This
* provides a very strict behavior to avoid any chance of memory
* corruption going unnoticed if there are userland race conditions.
* Only one thread should resolve the userland page fault at any given
* time for any given faulting address. This means that if two threads
* try to both call move_pages() on the same destination address at the
* same time, the second thread will get an explicit error from this
* command.
...
* The UFFDIO_MOVE_MODE_ALLOW_SRC_HOLES flag can be specified to
* prevent -ENOENT errors to materialize if there are holes in the
* source virtual range that is being remapped. The holes will be
* accounted as successfully remapped in the retval of the
* command. This is mostly useful to remap hugepage naturally aligned
* virtual regions without knowing if there are transparent hugepage
* in the regions or not, but preventing the risk of having to split
* the hugepmd during the remap.
...
ssize_t move_pages(struct userfaultfd_ctx *ctx, unsigned long dst_start,
unsigned long src_start, unsigned long len, __u64 mode)
...
if (!(mode & UFFDIO_MOVE_MODE_ALLOW_SRC_HOLES)) {
err = -ENOENT;
break;
What I don't understand yet is why this move_pages() signature does
not match the one from libnuma and move_pages(2) (note "mode" vs "flags"):
int numa_move_pages(int pid, unsigned long count,
void **pages, const int *nodes, int *status, int flags)
{
return move_pages(pid, count, pages, nodes, status, flags);
}
I guess the answer is somewhere in that gap.
> ERROR: invalid NUMA node id outside of allowed range [0, 0]: -2
Maybe instead of putting sanity checks on what the kernel is
returning, we should just pass that through to the user? (Or perhaps
transform negative numbers to NULL?)
Christoph
Re: To Tomas Vondra > I've managed to reproduce it once, running this loop on > 18-as-of-today. It errored out after a few 100 iterations: > > while psql -c 'SELECT COUNT(*) >= 0 AS ok FROM pg_shmem_allocations_numa'; do :; done > > 2025-12-16 11:49:35.982 UTC [621807] myon@postgres ERROR: invalid NUMA node id outside of allowed range [0, 0]: -2 > 2025-12-16 11:49:35.982 UTC [621807] myon@postgres STATEMENT: SELECT COUNT(*) >= 0 AS ok FROM pg_shmem_allocations_numa > > That was on the apt.pg.o amd64 build machine while a few things were > just building. Maybe ENOENT "The page is not present" means something > was just swapped out because the machine was under heavy load. I played a bit more with it. * It seems to trigger only once for a running cluster. The next one needs a restart * If it doesn't trigger within the first 30s, it probably never will * It seems easier to trigger on a system that is under load (I started a few pgmodeler compile runs in parallel (C++)) But none of that answers the "why". Christoph
On 12/16/25 15:48, Christoph Berg wrote:
> Re: To Tomas Vondra
>> I've managed to reproduce it once, running this loop on
>> 18-as-of-today. It errored out after a few 100 iterations:
>>
>> while psql -c 'SELECT COUNT(*) >= 0 AS ok FROM pg_shmem_allocations_numa'; do :; done
>>
>> 2025-12-16 11:49:35.982 UTC [621807] myon@postgres ERROR: invalid NUMA node id outside of allowed range [0, 0]: -2
>> 2025-12-16 11:49:35.982 UTC [621807] myon@postgres STATEMENT: SELECT COUNT(*) >= 0 AS ok FROM
pg_shmem_allocations_numa
>>
>> That was on the apt.pg.o amd64 build machine while a few things were
>> just building. Maybe ENOENT "The page is not present" means something
>> was just swapped out because the machine was under heavy load.
>
> I played a bit more with it.
>
> * It seems to trigger only once for a running cluster. The next one
> needs a restart
> * If it doesn't trigger within the first 30s, it probably never will
> * It seems easier to trigger on a system that is under load (I started
> a few pgmodeler compile runs in parallel (C++))
>
> But none of that answers the "why".
>
Hmmm, so this is interesting. I tried this on my workstation (with a
single NUMA node), and I see this:
1) right after opening a connection, I get this
test=# select numa_node, count(*) from pg_buffercache_numa group by 1;
numa_node | count
-----------+-------
0 | 290
-2 | 32478
(2 rows)
2) but a select from pg_shmem_allocations_numa works fine
test=# select numa_node, count(*) from pg_shmem_allocations_numa group by 1;
numa_node | count
-----------+-------
0 | 72
(1 row)
3) and if I repeat the pg_buffercache_numa query, it now works
test=# select numa_node, count(*) from pg_buffercache_numa group by 1;
numa_node | count
-----------+-------
0 | 32768
(1 row)
That's a bit strange. I have no idea why is this happening. If I
reconnect, I start getting the failures again.
regards
--
Tomas Vondra
Re: Tomas Vondra > 1) right after opening a connection, I get this > > test=# select numa_node, count(*) from pg_buffercache_numa group by 1; > numa_node | count > -----------+------- > 0 | 290 > -2 | 32478 Does that mean that the "touch all pages" logic is missing in some code paths? But even with that, it seems to be able to degenerate again and accepting -2 in the regression tests would be required to make it stable. Christoph
On 12/16/25 18:54, Christoph Berg wrote: > Re: Tomas Vondra >> 1) right after opening a connection, I get this >> >> test=# select numa_node, count(*) from pg_buffercache_numa group by 1; >> numa_node | count >> -----------+------- >> 0 | 290 >> -2 | 32478 > > Does that mean that the "touch all pages" logic is missing in some > code paths? > I did check and AFAICS we are touching the pages in pg_buffercache_numa. To make it even more confusing, I can no longer reproduce the behavior I reported yesterday. It just consistently reports "0" and I have no idea why it changed :-( I did restart since yesterday, so maybe that changed something. > But even with that, it seems to be able to degenerate again and > accepting -2 in the regression tests would be required to make it > stable. > No opinion yet. Either the -2 can happen occasionally, and then we'd need to adjust the regression tests. Or maybe it's some thinko, and then it'd be good to figure out why it's happening. I find it interesting it does not seem to fail on the buildfarm. Or at least I'm not aware of such failures. Even a rare failure should show itself on the buildfarm a couple times, so how come it didn't? regards -- Tomas Vondra
On 12/17/25 12:07, Tomas Vondra wrote:
>
>
> On 12/16/25 18:54, Christoph Berg wrote:
>> Re: Tomas Vondra
>>> 1) right after opening a connection, I get this
>>>
>>> test=# select numa_node, count(*) from pg_buffercache_numa group by 1;
>>> numa_node | count
>>> -----------+-------
>>> 0 | 290
>>> -2 | 32478
>>
>> Does that mean that the "touch all pages" logic is missing in some
>> code paths?
>>
>
> I did check and AFAICS we are touching the pages in pg_buffercache_numa.
>
> To make it even more confusing, I can no longer reproduce the behavior I
> reported yesterday. It just consistently reports "0" and I have no idea
> why it changed :-( I did restart since yesterday, so maybe that changed
> something.
>
I kept poking at this, and I managed to reproduce it again. The key
seems to be that the system needs to be under pressure, and then it's
reliably reproducible (at least for me).
What I did is I created two instances - one to keep the system busy, one
for experimentation. The "busy" one is set to use shared_buffers=16GB,
and then running read-only pgbench.
pgbench -i -s 4500 test
pgbench -S -j 16 -c 64 -T 600 -P 1 test
The system has 64GB of RAM and 12 cores, so this is a lot of load.
Then, the other instance is set to use shared_buffers=4GB, is started
and immediately queried for NUMA info for buffers (in a loop):
pg_ctl -D data -l pg.log start;
for r in $(seq 1 10); do
psql -p 5001 test -c 'select numa_node, count(*) from
pg_buffercache_numa group by 1';
done;
pg_ctl -D data -l pg.log stop;
And this often fails like this:
----------------------------------------------------------------------
waiting for server to start.... done
server started
numa_node | count
-----------+---------
0 | 1045302
-2 | 3274
(2 rows)
numa_node | count
-----------+---------
0 | 1048576
(1 row)
numa_node | count
-----------+---------
0 | 1048576
(1 row)
numa_node | count
-----------+---------
0 | 1048576
(1 row)
numa_node | count
-----------+---------
0 | 1048576
(1 row)
numa_node | count
-----------+---------
0 | 1048576
(1 row)
numa_node | count
-----------+---------
0 | 1025321
-2 | 23255
(2 rows)
numa_node | count
-----------+---------
0 | 1038596
-2 | 9980
(2 rows)
numa_node | count
-----------+---------
0 | 1048518
-2 | 58
(2 rows)
numa_node | count
-----------+---------
0 | 1048525
-2 | 51
(2 rows)
waiting for server to shut down.... done
server stopped
----------------------------------------------------------------------
So, it clearly fails quite often. And it can fail even later, after a
run that returned no "-2" buffers.
Clearly, something behaves differently than we thought. I've only seen
this happen on a system with swap - once I removed it, this behavior
disappeared too. So it seems a page can be moved to swap, in which case
we get -2 for a status.
In hindsight, that's not all that surprising. It's interesting it can
happen even with the "touching", but I guess there's a race condition
and the memory can get paged out before we inspect the status. We're
querying batches of pages, which probably makes the window larger.
FWIW I now realized I don't even need two instances. If I try this on
the "busy" instance, I get the -2 values too. Which I find a bit weird.
Because why should those be paged out?
The question is what to do about this. I don't think we can prevent the
-2 values, and error-ing out does not seem great either (most systems
have swap, so -2 may not be all that rare).
In fact, pg_shmem_allocations_numa probably should not error-out either,
because it's now reliably failing (on the busy instance).
I guess the only solution is to accept -2 as a possible value (unknown
node). But that makes regression testing harder, because it means the
output could change a lot ...
regards
--
Tomas Vondra
Re: Tomas Vondra > I guess the only solution is to accept -2 as a possible value (unknown > node). But that makes regression testing harder, because it means the > output could change a lot ... Or just not test that, or do something like select numa_node = -2 or numa_node between 0 and 1000 from pg_shmem_allocations_numa; Christoph
On Mon, Jan 5, 2026 at 11:30 PM Christoph Berg <myon@debian.org> wrote:
>
> Re: Tomas Vondra
> > I guess the only solution is to accept -2 as a possible value (unknown
> > node). But that makes regression testing harder, because it means the
> > output could change a lot ...
Hi Tomas! That's pretty wild, nice find about that swapping s_b thing!
So just to confirm, that was reproduced outside containers/docker,
right?
> Or just not test that, or do something like
>
> select numa_node = -2 or numa_node between 0 and 1000 from pg_shmem_allocations_numa;
Well, with the huge-pages it should be not swappable, so another idea
would be simply alter first line of src/test/regress/sql/numa.sql and
sql/pg_buffercache_numa.sql just like below:
- SELECT NOT(pg_numa_available()) AS skip_test \gset
+ SELECT (pg_numa_available() is false OR
current_setting('huge_pages_status')::bool is false) as skip_test
\gset
(I'm making assumption that there are buildfarm animals that
huge_pages enabled, no idea how to check that)
-J.
On 1/6/26 14:23, Jakub Wartak wrote:
> On Mon, Jan 5, 2026 at 11:30 PM Christoph Berg <myon@debian.org> wrote:
>>
>> Re: Tomas Vondra
>>> I guess the only solution is to accept -2 as a possible value (unknown
>>> node). But that makes regression testing harder, because it means the
>>> output could change a lot ...
>
> Hi Tomas! That's pretty wild, nice find about that swapping s_b thing!
> So just to confirm, that was reproduced outside containers/docker,
> right?
>
Yes, this is a regular bare-metal Debian system.
>> Or just not test that, or do something like
>>
>> select numa_node = -2 or numa_node between 0 and 1000 from pg_shmem_allocations_numa;
>
> Well, with the huge-pages it should be not swappable, so another idea
> would be simply alter first line of src/test/regress/sql/numa.sql and
> sql/pg_buffercache_numa.sql just like below:
> - SELECT NOT(pg_numa_available()) AS skip_test \gset
> + SELECT (pg_numa_available() is false OR
> current_setting('huge_pages_status')::bool is false) as skip_test
> \gset
>
> (I'm making assumption that there are buildfarm animals that
> huge_pages enabled, no idea how to check that)
>
Yes, using huge pages makes this go away.
I'm also even more sure it's about swap, because /proc/PID/smaps for
postmaster tracks how much of the mapping is in swap, and with regular
memory pages I get values like this for the main shmem segment:
Swap: 90508 kB
Swap: 275272 kB
Swap: 135020 kB
Swap: 116460 kB
Swap: 102388 kB
Swap: 93832 kB
Swap: 155616 kB
Swap: 165692 kB
These are just values from "grep" while the pgbench is running. The
instance has 16GB shared buffers, so 200MB is close to 1%. Not a huge
part, but still ...
I've always "known" shared buffers could be swapped out, but I've never
realized it would affect cases like this one.
I'm not a huge fan of fixing just the tests. Sure, the tests will pass,
but what's the point of that if you then can't run this on production
because it also fails (I mean, the pg_shmem_allocations_numa will fail)?
I think it's clear we need to tweak this to handle -2 status. And then
also adjust tests to accept non-deterministic results.
regards
--
Tomas Vondra
Hi Tomas, On Tue, Jan 6, 2026 at 4:36 PM Tomas Vondra <tomas@vondra.me> wrote: [..] > I've always "known" shared buffers could be swapped out, but I've never realized it would affect cases like this one. Same, I'm a little surprised by it, but it makes sense. In my old and more recent tests I've always reasoned the following way: NUMA (2+ sockets) --> probably a big production system --> huge_pages literally always enabled to avoid a variety of surprises (locks the region). Also this kind of reminds me of our previous past discussion about dividing shm allocations into smaller requests (potentially 4kB shm regions that are not huge_pages, so in theory swappable) [1]. > I'm not a huge fan of fixing just the tests. Sure, the tests will pass, > but what's the point of that if you then can't run this on production > because it also fails (I mean, the pg_shmem_allocations_numa will fail)? Well, You are probably right. > I think it's clear we need to tweak this to handle -2 status. And then > also adjust tests to accept non-deterministic results. The only question remains is, if we want to expose it to the user or not? We could a) silently ignore ENOENT in the back branches so that "size" won't contain it (well just change pg_get_shmem_allocations_numa()). It is not part of any NUMA node anyway. Well, maybe we could emit DEBUG1 or source code comment about such a fact that we think it may be swapped out. b) no sure is it a good idea, but in master we could expose it as a new column "swapped_out_size" (or change the current datatype of "numa" column from ::integer to something like ::text to allow outputting numa_node as integer, but also putting node="swapped-out" too with proper size). Sounds like a new minor feature that would be able to tell the user that he has swapped out shm, and needs to really enable huge pages (?) -J. [1] - https://www.postgresql.org/message-id/jqg6jd32sw4s6gjkezauer372xrww7xnupvrcsqkegh2uhv6vg%40ppiwoigzz6v4