Обсуждение: failed NUMA pages inquiry status: Operation not permitted
> src/test/regress/expected/numa.out | 13 +++ > src/test/regress/expected/numa_1.out | 5 + numa_1.out is catching this error: ERROR: libnuma initialization failed or NUMA is not supported on this platform This is what I'm getting when running PG18 in docker on Debian trixie (libnuma 2.0.19). However, on older distributions, the error is different: postgres =# select * from pg_shmem_allocations_numa; ERROR: XX000: failed NUMA pages inquiry status: Operation not permitted LOCATION: pg_get_shmem_allocations_numa, shmem.c:691 This makes the numa regression tests fail in Docker on Debian bookworm (libnuma 2.0.16) and older and all of the Ubuntu LTS releases. The attached patch makes it accept these errors, but perhaps it would be better to detect it in pg_numa_available(). Christoph
Вложения
On 10/16/25 13:38, Christoph Berg wrote: >> src/test/regress/expected/numa.out | 13 +++ >> src/test/regress/expected/numa_1.out | 5 + > > numa_1.out is catching this error: > > ERROR: libnuma initialization failed or NUMA is not supported on this platform > > This is what I'm getting when running PG18 in docker on Debian trixie > (libnuma 2.0.19). > > However, on older distributions, the error is different: > > postgres =# select * from pg_shmem_allocations_numa; > ERROR: XX000: failed NUMA pages inquiry status: Operation not permitted > LOCATION: pg_get_shmem_allocations_numa, shmem.c:691 > > This makes the numa regression tests fail in Docker on Debian bookworm > (libnuma 2.0.16) and older and all of the Ubuntu LTS releases. > It's probably more about the kernel version. What kernels are used by these systems? > The attached patch makes it accept these errors, but perhaps it would > be better to detect it in pg_numa_available(). > Not sure how would that work. It seems this is some sort of permission check in numa_move_pages, that's not what pg_numa_available does. Also, it may depending on the page queried (e.g. whether it's exclusive or shared by multiple processes). thanks -- Tomas Vondra
Re: Tomas Vondra > It's probably more about the kernel version. What kernels are used by > these systems? It's the very same kernel, just different docker containers on the same system. I did not investigate yet where the problem is coming from, different libnuma versions seemed like the best bet. Same (differing) results on both these systems: Linux turing 6.16.7+deb14-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.16.7-1 (2025-09-11) x86_64 GNU/Linux Linux jenkins 6.1.0-39-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.148-1 (2025-08-26) x86_64 GNU/Linux > Not sure how would that work. It seems this is some sort of permission > check in numa_move_pages, that's not what pg_numa_available does. Also, > it may depending on the page queried (e.g. whether it's exclusive or > shared by multiple processes). It's probably the lack of some process capability in that environment. Maybe there is a way to query that, but I don't know much about that yet. Christoph
Re: To Tomas Vondra > It's the very same kernel, just different docker containers on the > same system. I did not investigate yet where the problem is coming > from, different libnuma versions seemed like the best bet. numactl shows the problem already: Host system: $ numactl --show policy: default preferred node: current physcpubind: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 cpubind: 0 nodebind: 0 membind: 0 preferred: debian:trixie-slim container: $ numactl --show physcpubind: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 No NUMA support available on this system. debian:bookworm-slim container: $ numactl --show get_mempolicy: Operation not permitted get_mempolicy: Operation not permitted get_mempolicy: Operation not permitted get_mempolicy: Operation not permitted policy: default preferred node: current physcpubind: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 cpubind: 0 nodebind: 0 membind: 0 preferred: Running with sudo does not change the result. So maybe all that's needed is a get_mempolicy() call in pg_numa_available() ? Christoph
Re: To Tomas Vondra > So maybe all that's needed is a get_mempolicy() call in > pg_numa_available() ? Or perhaps give up on pg_numa_available, and just have two _1.out and _2.out that just contain the two different error messages, without trying to catch the problem. Christoph
On 10/16/25 16:54, Christoph Berg wrote: > Re: Tomas Vondra >> It's probably more about the kernel version. What kernels are used by >> these systems? > > It's the very same kernel, just different docker containers on the > same system. I did not investigate yet where the problem is coming > from, different libnuma versions seemed like the best bet. > > Same (differing) results on both these systems: > Linux turing 6.16.7+deb14-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.16.7-1 (2025-09-11) x86_64 GNU/Linux > Linux jenkins 6.1.0-39-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.148-1 (2025-08-26) x86_64 GNU/Linux > Hmmm. Those seem like relatively recent kernels. >> Not sure how would that work. It seems this is some sort of permission >> check in numa_move_pages, that's not what pg_numa_available does. Also, >> it may depending on the page queried (e.g. whether it's exclusive or >> shared by multiple processes). > > It's probably the lack of some process capability in that environment. > Maybe there is a way to query that, but I don't know much about that > yet. > move_page() manpage mentions PTRACE_MODE_READ_REALCREDS (man ptrace) so maybe that's it. -- Tomas Vondra
> So maybe all that's needed is a get_mempolicy() call in
> pg_numa_available() ?
numactl 2.0.19 --show does this:
if (numa_available() < 0) {
show_physcpubind();
printf("No NUMA support available on this system.\n");
exit(1);
}
int numa_available(void)
{
if (get_mempolicy(NULL, NULL, 0, 0, 0) < 0 && (errno == ENOSYS || errno == EPERM))
return -1;
return 0;
}
pg_numa_available is already calling numa_available.
But numactl 2.0.16 has this:
int numa_available(void)
{
if (get_mempolicy(NULL, NULL, 0, 0, 0) < 0 && errno == ENOSYS)
return -1;
return 0;
}
... which is not catching the "permission denied" error I am seeing.
So maybe PG should implement numa_available itself like that. (Or
accept the output difference so the regression tests are passing.)
Christoph
On 10/16/25 17:19, Christoph Berg wrote: >> So maybe all that's needed is a get_mempolicy() call in >> pg_numa_available() ? > > ... > > So maybe PG should implement numa_available itself like that. (Or > accept the output difference so the regression tests are passing.) > I'm not sure which of those options is better. I'm a bit worried just accepting the alternative output would hide some failures in the future (although it's a low risk). So I'm leaning to adjust pg_numa_init() to also check EPERM, per the attached patch. It still calls numa_available(), so that we don't silently miss future libnuma changes. Can you check this makes it work inside the docker container? regards -- Tomas Vondra
Вложения
Re: To Tomas Vondra > So maybe PG should implement numa_available itself like that. Following our discussion at pgconf.eu last week, I just implemented that. The numa and pg_buffercache tests pass in Docker on Debian bookworm now. Christoph
Вложения
Re: Tomas Vondra > So I'm leaning to adjust pg_numa_init() to also check EPERM, per the > attached patch. It still calls numa_available(), so that we don't > silently miss future libnuma changes. > > Can you check this makes it work inside the docker container? Yes your patch works. (Sorry I meant to test earlier, but RL...) Christoph
On 11/14/25 13:52, Christoph Berg wrote: > Re: Tomas Vondra >> So I'm leaning to adjust pg_numa_init() to also check EPERM, per the >> attached patch. It still calls numa_available(), so that we don't >> silently miss future libnuma changes. >> >> Can you check this makes it work inside the docker container? > > Yes your patch works. (Sorry I meant to test earlier, but RL...) > Thanks. I've pushed the fix (and backpatched to 18). regards -- Tomas Vondra
Re: Tomas Vondra > >> So I'm leaning to adjust pg_numa_init() to also check EPERM, per the > >> attached patch. It still calls numa_available(), so that we don't > >> silently miss future libnuma changes. > >> > >> Can you check this makes it work inside the docker container? > > > > Yes your patch works. (Sorry I meant to test earlier, but RL...) > > Thanks. I've pushed the fix (and backpatched to 18). It looks like we are not done here yet :( postgresql-18 is failing here intermittently with this diff: 12:20:24 --- /build/reproducible-path/postgresql-18-18.1/src/test/regress/expected/numa.out 2025-11-10 21:52:06.000000000+0000 12:20:24 +++ /build/reproducible-path/postgresql-18-18.1/build/src/test/regress/results/numa.out 2025-12-11 11:20:22.618989603+0000 12:20:24 @@ -6,8 +6,4 @@ 12:20:24 -- switch to superuser 12:20:24 \c - 12:20:24 SELECT COUNT(*) >= 0 AS ok FROM pg_shmem_allocations_numa; 12:20:24 - ok 12:20:24 ----- 12:20:24 - t 12:20:24 -(1 row) 12:20:24 - 12:20:24 +ERROR: invalid NUMA node id outside of allowed range [0, 0]: -2 That's REL_18_STABLE @ 580b5c, with the Debian packaging on top. I've seen it on unstable/amd64, unstable/arm64, and Ubuntu questing/amd64, where libnuma should take care of this itself, without the extra patch in PG. There was another case on bullseye/amd64 which has the old libnuma. It's been frequent enough so it killed 4 out of the 10 builds currently visible on https://jengus.postgresql.org/job/postgresql-18-binaries-snapshot/. (Though to be fair, only one distribution/arch combination was failing for each of them.) There is also one instance of it in https://jengus.postgresql.org/job/postgresql-19-binaries-snapshot/ I currently have no idea what's happening. Christoph
On 12/11/25 13:29, Christoph Berg wrote: > Re: Tomas Vondra >>>> So I'm leaning to adjust pg_numa_init() to also check EPERM, per the >>>> attached patch. It still calls numa_available(), so that we don't >>>> silently miss future libnuma changes. >>>> >>>> Can you check this makes it work inside the docker container? >>> >>> Yes your patch works. (Sorry I meant to test earlier, but RL...) >> >> Thanks. I've pushed the fix (and backpatched to 18). > > It looks like we are not done here yet :( > > postgresql-18 is failing here intermittently with this diff: > > 12:20:24 --- /build/reproducible-path/postgresql-18-18.1/src/test/regress/expected/numa.out 2025-11-10 21:52:06.000000000+0000 > 12:20:24 +++ /build/reproducible-path/postgresql-18-18.1/build/src/test/regress/results/numa.out 2025-12-11 11:20:22.618989603+0000 > 12:20:24 @@ -6,8 +6,4 @@ > 12:20:24 -- switch to superuser > 12:20:24 \c - > 12:20:24 SELECT COUNT(*) >= 0 AS ok FROM pg_shmem_allocations_numa; > 12:20:24 - ok > 12:20:24 ----- > 12:20:24 - t > 12:20:24 -(1 row) > 12:20:24 - > 12:20:24 +ERROR: invalid NUMA node id outside of allowed range [0, 0]: -2 > > That's REL_18_STABLE @ 580b5c, with the Debian packaging on top. > > I've seen it on unstable/amd64, unstable/arm64, and Ubuntu > questing/amd64, where libnuma should take care of this itself, without > the extra patch in PG. There was another case on bullseye/amd64 which > has the old libnuma. > > It's been frequent enough so it killed 4 out of the 10 builds > currently visible on > https://jengus.postgresql.org/job/postgresql-18-binaries-snapshot/. > (Though to be fair, only one distribution/arch combination was failing > for each of them.) > > There is also one instance of it in > https://jengus.postgresql.org/job/postgresql-19-binaries-snapshot/ > > I currently have no idea what's happening. > Hmmm, strange. -2 is ENOENT, which should mean this: -ENOENT The page is not present. But what does "not present" mean in this context? And why would that be only intermittent? Presumably this is still running in Docker, so maybe it's another weird consequence of that? regards -- Tomas Vondra
Re: Tomas Vondra > Hmmm, strange. -2 is ENOENT, which should mean this: > > -ENOENT > The page is not present. > > But what does "not present" mean in this context? And why would that be > only intermittent? Presumably this is still running in Docker, so maybe > it's another weird consequence of that? Sorry I forgot to mention that this is now in the normal apt.pg.o build environment (chroots without any funky permission restrictions). I have not tried Docker yet. I think it was not happening before the backport of the Docker fix. But I have no idea why this should have broken anything, and why it would only happen like 3% of the time. Christoph
Re: Tomas Vondra
> Hmmm, strange. -2 is ENOENT, which should mean this:
>
> -ENOENT
> The page is not present.
>
> But what does "not present" mean in this context? And why would that be
> only intermittent? Presumably this is still running in Docker, so maybe
> it's another weird consequence of that?
I've managed to reproduce it once, running this loop on
18-as-of-today. It errored out after a few 100 iterations:
while psql -c 'SELECT COUNT(*) >= 0 AS ok FROM pg_shmem_allocations_numa'; do :; done
2025-12-16 11:49:35.982 UTC [621807] myon@postgres ERROR: invalid NUMA node id outside of allowed range [0, 0]: -2
2025-12-16 11:49:35.982 UTC [621807] myon@postgres STATEMENT: SELECT COUNT(*) >= 0 AS ok FROM
pg_shmem_allocations_numa
That was on the apt.pg.o amd64 build machine while a few things were
just building. Maybe ENOENT "The page is not present" means something
was just swapped out because the machine was under heavy load.
I tried reading the kernel source and it sounds related:
* If the source virtual memory range has any unmapped holes, or if
* the destination virtual memory range is not a whole unmapped hole,
* move_pages() will fail respectively with -ENOENT or -EEXIST. This
* provides a very strict behavior to avoid any chance of memory
* corruption going unnoticed if there are userland race conditions.
* Only one thread should resolve the userland page fault at any given
* time for any given faulting address. This means that if two threads
* try to both call move_pages() on the same destination address at the
* same time, the second thread will get an explicit error from this
* command.
...
* The UFFDIO_MOVE_MODE_ALLOW_SRC_HOLES flag can be specified to
* prevent -ENOENT errors to materialize if there are holes in the
* source virtual range that is being remapped. The holes will be
* accounted as successfully remapped in the retval of the
* command. This is mostly useful to remap hugepage naturally aligned
* virtual regions without knowing if there are transparent hugepage
* in the regions or not, but preventing the risk of having to split
* the hugepmd during the remap.
...
ssize_t move_pages(struct userfaultfd_ctx *ctx, unsigned long dst_start,
unsigned long src_start, unsigned long len, __u64 mode)
...
if (!(mode & UFFDIO_MOVE_MODE_ALLOW_SRC_HOLES)) {
err = -ENOENT;
break;
What I don't understand yet is why this move_pages() signature does
not match the one from libnuma and move_pages(2) (note "mode" vs "flags"):
int numa_move_pages(int pid, unsigned long count,
void **pages, const int *nodes, int *status, int flags)
{
return move_pages(pid, count, pages, nodes, status, flags);
}
I guess the answer is somewhere in that gap.
> ERROR: invalid NUMA node id outside of allowed range [0, 0]: -2
Maybe instead of putting sanity checks on what the kernel is
returning, we should just pass that through to the user? (Or perhaps
transform negative numbers to NULL?)
Christoph
Re: To Tomas Vondra > I've managed to reproduce it once, running this loop on > 18-as-of-today. It errored out after a few 100 iterations: > > while psql -c 'SELECT COUNT(*) >= 0 AS ok FROM pg_shmem_allocations_numa'; do :; done > > 2025-12-16 11:49:35.982 UTC [621807] myon@postgres ERROR: invalid NUMA node id outside of allowed range [0, 0]: -2 > 2025-12-16 11:49:35.982 UTC [621807] myon@postgres STATEMENT: SELECT COUNT(*) >= 0 AS ok FROM pg_shmem_allocations_numa > > That was on the apt.pg.o amd64 build machine while a few things were > just building. Maybe ENOENT "The page is not present" means something > was just swapped out because the machine was under heavy load. I played a bit more with it. * It seems to trigger only once for a running cluster. The next one needs a restart * If it doesn't trigger within the first 30s, it probably never will * It seems easier to trigger on a system that is under load (I started a few pgmodeler compile runs in parallel (C++)) But none of that answers the "why". Christoph
On 12/16/25 15:48, Christoph Berg wrote:
> Re: To Tomas Vondra
>> I've managed to reproduce it once, running this loop on
>> 18-as-of-today. It errored out after a few 100 iterations:
>>
>> while psql -c 'SELECT COUNT(*) >= 0 AS ok FROM pg_shmem_allocations_numa'; do :; done
>>
>> 2025-12-16 11:49:35.982 UTC [621807] myon@postgres ERROR: invalid NUMA node id outside of allowed range [0, 0]: -2
>> 2025-12-16 11:49:35.982 UTC [621807] myon@postgres STATEMENT: SELECT COUNT(*) >= 0 AS ok FROM
pg_shmem_allocations_numa
>>
>> That was on the apt.pg.o amd64 build machine while a few things were
>> just building. Maybe ENOENT "The page is not present" means something
>> was just swapped out because the machine was under heavy load.
>
> I played a bit more with it.
>
> * It seems to trigger only once for a running cluster. The next one
> needs a restart
> * If it doesn't trigger within the first 30s, it probably never will
> * It seems easier to trigger on a system that is under load (I started
> a few pgmodeler compile runs in parallel (C++))
>
> But none of that answers the "why".
>
Hmmm, so this is interesting. I tried this on my workstation (with a
single NUMA node), and I see this:
1) right after opening a connection, I get this
test=# select numa_node, count(*) from pg_buffercache_numa group by 1;
numa_node | count
-----------+-------
0 | 290
-2 | 32478
(2 rows)
2) but a select from pg_shmem_allocations_numa works fine
test=# select numa_node, count(*) from pg_shmem_allocations_numa group by 1;
numa_node | count
-----------+-------
0 | 72
(1 row)
3) and if I repeat the pg_buffercache_numa query, it now works
test=# select numa_node, count(*) from pg_buffercache_numa group by 1;
numa_node | count
-----------+-------
0 | 32768
(1 row)
That's a bit strange. I have no idea why is this happening. If I
reconnect, I start getting the failures again.
regards
--
Tomas Vondra
Re: Tomas Vondra > 1) right after opening a connection, I get this > > test=# select numa_node, count(*) from pg_buffercache_numa group by 1; > numa_node | count > -----------+------- > 0 | 290 > -2 | 32478 Does that mean that the "touch all pages" logic is missing in some code paths? But even with that, it seems to be able to degenerate again and accepting -2 in the regression tests would be required to make it stable. Christoph