Обсуждение: failed NUMA pages inquiry status: Operation not permitted

Поиск
Список
Период
Сортировка

failed NUMA pages inquiry status: Operation not permitted

От
Christoph Berg
Дата:
> src/test/regress/expected/numa.out       |  13 +++
> src/test/regress/expected/numa_1.out     |   5 +

numa_1.out is catching this error:

ERROR:  libnuma initialization failed or NUMA is not supported on this platform

This is what I'm getting when running PG18 in docker on Debian trixie
(libnuma 2.0.19).

However, on older distributions, the error is different:

postgres =# select * from pg_shmem_allocations_numa;
ERROR:  XX000: failed NUMA pages inquiry status: Operation not permitted
LOCATION:  pg_get_shmem_allocations_numa, shmem.c:691

This makes the numa regression tests fail in Docker on Debian bookworm
(libnuma 2.0.16) and older and all of the Ubuntu LTS releases.

The attached patch makes it accept these errors, but perhaps it would
be better to detect it in pg_numa_available().

Christoph

Вложения

Re: failed NUMA pages inquiry status: Operation not permitted

От
Tomas Vondra
Дата:

On 10/16/25 13:38, Christoph Berg wrote:
>> src/test/regress/expected/numa.out       |  13 +++
>> src/test/regress/expected/numa_1.out     |   5 +
> 
> numa_1.out is catching this error:
> 
> ERROR:  libnuma initialization failed or NUMA is not supported on this platform
> 
> This is what I'm getting when running PG18 in docker on Debian trixie
> (libnuma 2.0.19).
> 
> However, on older distributions, the error is different:
> 
> postgres =# select * from pg_shmem_allocations_numa;
> ERROR:  XX000: failed NUMA pages inquiry status: Operation not permitted
> LOCATION:  pg_get_shmem_allocations_numa, shmem.c:691
> 
> This makes the numa regression tests fail in Docker on Debian bookworm
> (libnuma 2.0.16) and older and all of the Ubuntu LTS releases.
> 

It's probably more about the kernel version. What kernels are used by
these systems?

> The attached patch makes it accept these errors, but perhaps it would
> be better to detect it in pg_numa_available().
> 

Not sure how would that work. It seems this is some sort of permission
check in numa_move_pages, that's not what pg_numa_available does. Also,
it may depending on the page queried (e.g. whether it's exclusive or
shared by multiple processes).

thanks

-- 
Tomas Vondra




Re: failed NUMA pages inquiry status: Operation not permitted

От
Christoph Berg
Дата:
Re: Tomas Vondra
> It's probably more about the kernel version. What kernels are used by
> these systems?

It's the very same kernel, just different docker containers on the
same system. I did not investigate yet where the problem is coming
from, different libnuma versions seemed like the best bet.

Same (differing) results on both these systems:
Linux turing 6.16.7+deb14-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.16.7-1 (2025-09-11) x86_64 GNU/Linux
Linux jenkins 6.1.0-39-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.148-1 (2025-08-26) x86_64 GNU/Linux

> Not sure how would that work. It seems this is some sort of permission
> check in numa_move_pages, that's not what pg_numa_available does. Also,
> it may depending on the page queried (e.g. whether it's exclusive or
> shared by multiple processes).

It's probably the lack of some process capability in that environment.
Maybe there is a way to query that, but I don't know much about that
yet.

Christoph



Re: failed NUMA pages inquiry status: Operation not permitted

От
Christoph Berg
Дата:
Re: To Tomas Vondra
> It's the very same kernel, just different docker containers on the
> same system. I did not investigate yet where the problem is coming
> from, different libnuma versions seemed like the best bet.

numactl shows the problem already:

Host system:

$ numactl --show
policy: default
preferred node: current
physcpubind: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
cpubind: 0
nodebind: 0
membind: 0
preferred:

debian:trixie-slim container:

$ numactl --show
physcpubind: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
No NUMA support available on this system.

debian:bookworm-slim container:

$ numactl --show
get_mempolicy: Operation not permitted
get_mempolicy: Operation not permitted
get_mempolicy: Operation not permitted
get_mempolicy: Operation not permitted
policy: default
preferred node: current
physcpubind: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
cpubind: 0
nodebind: 0
membind: 0
preferred:

Running with sudo does not change the result.

So maybe all that's needed is a get_mempolicy() call in
pg_numa_available() ?

Christoph



Re: failed NUMA pages inquiry status: Operation not permitted

От
Christoph Berg
Дата:
Re: To Tomas Vondra
> So maybe all that's needed is a get_mempolicy() call in
> pg_numa_available() ?

Or perhaps give up on pg_numa_available, and just have two _1.out and
_2.out that just contain the two different error messages, without
trying to catch the problem.

Christoph



Re: failed NUMA pages inquiry status: Operation not permitted

От
Tomas Vondra
Дата:

On 10/16/25 16:54, Christoph Berg wrote:
> Re: Tomas Vondra
>> It's probably more about the kernel version. What kernels are used by
>> these systems?
> 
> It's the very same kernel, just different docker containers on the
> same system. I did not investigate yet where the problem is coming
> from, different libnuma versions seemed like the best bet.
> 
> Same (differing) results on both these systems:
> Linux turing 6.16.7+deb14-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.16.7-1 (2025-09-11) x86_64 GNU/Linux
> Linux jenkins 6.1.0-39-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.148-1 (2025-08-26) x86_64 GNU/Linux
> 

Hmmm. Those seem like relatively recent kernels.

>> Not sure how would that work. It seems this is some sort of permission
>> check in numa_move_pages, that's not what pg_numa_available does. Also,
>> it may depending on the page queried (e.g. whether it's exclusive or
>> shared by multiple processes).
> 
> It's probably the lack of some process capability in that environment.
> Maybe there is a way to query that, but I don't know much about that
> yet.
> 

move_page() manpage mentions PTRACE_MODE_READ_REALCREDS (man ptrace) so
maybe that's it.

-- 
Tomas Vondra




Re: failed NUMA pages inquiry status: Operation not permitted

От
Christoph Berg
Дата:
> So maybe all that's needed is a get_mempolicy() call in
> pg_numa_available() ?

numactl 2.0.19 --show does this:

        if (numa_available() < 0) {
                show_physcpubind();
                printf("No NUMA support available on this system.\n");
                exit(1);
        }

int numa_available(void)
{
        if (get_mempolicy(NULL, NULL, 0, 0, 0) < 0 && (errno == ENOSYS || errno == EPERM))
                return -1;
        return 0;
}

pg_numa_available is already calling numa_available.

But numactl 2.0.16 has this:

int numa_available(void)
{
    if (get_mempolicy(NULL, NULL, 0, 0, 0) < 0 && errno == ENOSYS)
        return -1;
    return 0;
}

... which is not catching the "permission denied" error I am seeing.

So maybe PG should implement numa_available itself like that. (Or
accept the output difference so the regression tests are passing.)

Christoph



Re: failed NUMA pages inquiry status: Operation not permitted

От
Tomas Vondra
Дата:
On 10/16/25 17:19, Christoph Berg wrote:
>> So maybe all that's needed is a get_mempolicy() call in
>> pg_numa_available() ?
> 
> ...
> 
> So maybe PG should implement numa_available itself like that. (Or
> accept the output difference so the regression tests are passing.)
> 

I'm not sure which of those options is better. I'm a bit worried just
accepting the alternative output would hide some failures in the future
(although it's a low risk).

So I'm leaning to adjust pg_numa_init() to also check EPERM, per the
attached patch. It still calls numa_available(), so that we don't
silently miss future libnuma changes.

Can you check this makes it work inside the docker container?


regards

-- 
Tomas Vondra
Вложения

Re: failed NUMA pages inquiry status: Operation not permitted

От
Christoph Berg
Дата:
Re: To Tomas Vondra
> So maybe PG should implement numa_available itself like that.

Following our discussion at pgconf.eu last week, I just implemented
that. The numa and pg_buffercache tests pass in Docker on Debian
bookworm now.

Christoph

Вложения

Re: failed NUMA pages inquiry status: Operation not permitted

От
Christoph Berg
Дата:
Re: Tomas Vondra
> So I'm leaning to adjust pg_numa_init() to also check EPERM, per the
> attached patch. It still calls numa_available(), so that we don't
> silently miss future libnuma changes.
> 
> Can you check this makes it work inside the docker container?

Yes your patch works. (Sorry I meant to test earlier, but RL...)

Christoph



Re: failed NUMA pages inquiry status: Operation not permitted

От
Tomas Vondra
Дата:
On 11/14/25 13:52, Christoph Berg wrote:
> Re: Tomas Vondra
>> So I'm leaning to adjust pg_numa_init() to also check EPERM, per the
>> attached patch. It still calls numa_available(), so that we don't
>> silently miss future libnuma changes.
>>
>> Can you check this makes it work inside the docker container?
> 
> Yes your patch works. (Sorry I meant to test earlier, but RL...)
> 

Thanks. I've pushed the fix (and backpatched to 18).


regards

-- 
Tomas Vondra



Re: failed NUMA pages inquiry status: Operation not permitted

От
Christoph Berg
Дата:
Re: Tomas Vondra
> >> So I'm leaning to adjust pg_numa_init() to also check EPERM, per the
> >> attached patch. It still calls numa_available(), so that we don't
> >> silently miss future libnuma changes.
> >>
> >> Can you check this makes it work inside the docker container?
> > 
> > Yes your patch works. (Sorry I meant to test earlier, but RL...)
> 
> Thanks. I've pushed the fix (and backpatched to 18).

It looks like we are not done here yet :(

postgresql-18 is failing here intermittently with this diff:

12:20:24 --- /build/reproducible-path/postgresql-18-18.1/src/test/regress/expected/numa.out    2025-11-10
21:52:06.000000000+0000
 
12:20:24 +++ /build/reproducible-path/postgresql-18-18.1/build/src/test/regress/results/numa.out    2025-12-11
11:20:22.618989603+0000
 
12:20:24 @@ -6,8 +6,4 @@
12:20:24  -- switch to superuser
12:20:24  \c -
12:20:24  SELECT COUNT(*) >= 0 AS ok FROM pg_shmem_allocations_numa;
12:20:24 - ok
12:20:24 -----
12:20:24 - t
12:20:24 -(1 row)
12:20:24 -
12:20:24 +ERROR:  invalid NUMA node id outside of allowed range [0, 0]: -2

That's REL_18_STABLE @ 580b5c, with the Debian packaging on top.

I've seen it on unstable/amd64, unstable/arm64, and Ubuntu
questing/amd64, where libnuma should take care of this itself, without
the extra patch in PG. There was another case on bullseye/amd64 which
has the old libnuma.

It's been frequent enough so it killed 4 out of the 10 builds
currently visible on
https://jengus.postgresql.org/job/postgresql-18-binaries-snapshot/.
(Though to be fair, only one distribution/arch combination was failing
for each of them.)

There is also one instance of it in
https://jengus.postgresql.org/job/postgresql-19-binaries-snapshot/

I currently have no idea what's happening.

Christoph



Re: failed NUMA pages inquiry status: Operation not permitted

От
Tomas Vondra
Дата:

On 12/11/25 13:29, Christoph Berg wrote:
> Re: Tomas Vondra
>>>> So I'm leaning to adjust pg_numa_init() to also check EPERM, per the
>>>> attached patch. It still calls numa_available(), so that we don't
>>>> silently miss future libnuma changes.
>>>>
>>>> Can you check this makes it work inside the docker container?
>>>
>>> Yes your patch works. (Sorry I meant to test earlier, but RL...)
>>
>> Thanks. I've pushed the fix (and backpatched to 18).
> 
> It looks like we are not done here yet :(
> 
> postgresql-18 is failing here intermittently with this diff:
> 
> 12:20:24 --- /build/reproducible-path/postgresql-18-18.1/src/test/regress/expected/numa.out    2025-11-10
21:52:06.000000000+0000
 
> 12:20:24 +++ /build/reproducible-path/postgresql-18-18.1/build/src/test/regress/results/numa.out    2025-12-11
11:20:22.618989603+0000
 
> 12:20:24 @@ -6,8 +6,4 @@
> 12:20:24  -- switch to superuser
> 12:20:24  \c -
> 12:20:24  SELECT COUNT(*) >= 0 AS ok FROM pg_shmem_allocations_numa;
> 12:20:24 - ok
> 12:20:24 -----
> 12:20:24 - t
> 12:20:24 -(1 row)
> 12:20:24 -
> 12:20:24 +ERROR:  invalid NUMA node id outside of allowed range [0, 0]: -2
> 
> That's REL_18_STABLE @ 580b5c, with the Debian packaging on top.
> 
> I've seen it on unstable/amd64, unstable/arm64, and Ubuntu
> questing/amd64, where libnuma should take care of this itself, without
> the extra patch in PG. There was another case on bullseye/amd64 which
> has the old libnuma.
> 
> It's been frequent enough so it killed 4 out of the 10 builds
> currently visible on
> https://jengus.postgresql.org/job/postgresql-18-binaries-snapshot/.
> (Though to be fair, only one distribution/arch combination was failing
> for each of them.)
> 
> There is also one instance of it in
> https://jengus.postgresql.org/job/postgresql-19-binaries-snapshot/
> 
> I currently have no idea what's happening.
> 

Hmmm, strange. -2 is ENOENT, which should mean this:

       -ENOENT
              The page is not present.

But what does "not present" mean in this context? And why would that be
only intermittent? Presumably this is still running in Docker, so maybe
it's another weird consequence of that?

regards

-- 
Tomas Vondra




Re: failed NUMA pages inquiry status: Operation not permitted

От
Christoph Berg
Дата:
Re: Tomas Vondra
> Hmmm, strange. -2 is ENOENT, which should mean this:
> 
>        -ENOENT
>               The page is not present.
> 
> But what does "not present" mean in this context? And why would that be
> only intermittent? Presumably this is still running in Docker, so maybe
> it's another weird consequence of that?

Sorry I forgot to mention that this is now in the normal apt.pg.o
build environment (chroots without any funky permission restrictions).
I have not tried Docker yet.

I think it was not happening before the backport of the Docker fix.
But I have no idea why this should have broken anything, and why it
would only happen like 3% of the time.

Christoph



Re: failed NUMA pages inquiry status: Operation not permitted

От
Christoph Berg
Дата:
Re: Tomas Vondra
> Hmmm, strange. -2 is ENOENT, which should mean this:
> 
>        -ENOENT
>               The page is not present.
> 
> But what does "not present" mean in this context? And why would that be
> only intermittent? Presumably this is still running in Docker, so maybe
> it's another weird consequence of that?

I've managed to reproduce it once, running this loop on
18-as-of-today. It errored out after a few 100 iterations:

while psql -c 'SELECT COUNT(*) >= 0 AS ok FROM pg_shmem_allocations_numa'; do :; done

2025-12-16 11:49:35.982 UTC [621807] myon@postgres ERROR:  invalid NUMA node id outside of allowed range [0, 0]: -2
2025-12-16 11:49:35.982 UTC [621807] myon@postgres STATEMENT:  SELECT COUNT(*) >= 0 AS ok FROM
pg_shmem_allocations_numa

That was on the apt.pg.o amd64 build machine while a few things were
just building. Maybe ENOENT "The page is not present" means something
was just swapped out because the machine was under heavy load.

I tried reading the kernel source and it sounds related:

 * If the source virtual memory range has any unmapped holes, or if
 * the destination virtual memory range is not a whole unmapped hole,
 * move_pages() will fail respectively with -ENOENT or -EEXIST. This
 * provides a very strict behavior to avoid any chance of memory
 * corruption going unnoticed if there are userland race conditions.
 * Only one thread should resolve the userland page fault at any given
 * time for any given faulting address. This means that if two threads
 * try to both call move_pages() on the same destination address at the
 * same time, the second thread will get an explicit error from this
 * command.
...
 * The UFFDIO_MOVE_MODE_ALLOW_SRC_HOLES flag can be specified to
 * prevent -ENOENT errors to materialize if there are holes in the
 * source virtual range that is being remapped. The holes will be
 * accounted as successfully remapped in the retval of the
 * command. This is mostly useful to remap hugepage naturally aligned
 * virtual regions without knowing if there are transparent hugepage
 * in the regions or not, but preventing the risk of having to split
 * the hugepmd during the remap.
...
ssize_t move_pages(struct userfaultfd_ctx *ctx, unsigned long dst_start,
                   unsigned long src_start, unsigned long len, __u64 mode)
...
                        if (!(mode & UFFDIO_MOVE_MODE_ALLOW_SRC_HOLES)) {
                                err = -ENOENT;
                                break;

What I don't understand yet is why this move_pages() signature does
not match the one from libnuma and move_pages(2) (note "mode" vs "flags"):

int numa_move_pages(int pid, unsigned long count,
        void **pages, const int *nodes, int *status, int flags)
{
        return move_pages(pid, count, pages, nodes, status, flags);
}

I guess the answer is somewhere in that gap.

> ERROR:  invalid NUMA node id outside of allowed range [0, 0]: -2

Maybe instead of putting sanity checks on what the kernel is
returning, we should just pass that through to the user? (Or perhaps
transform negative numbers to NULL?)

Christoph



Re: failed NUMA pages inquiry status: Operation not permitted

От
Christoph Berg
Дата:
Re: To Tomas Vondra
> I've managed to reproduce it once, running this loop on
> 18-as-of-today. It errored out after a few 100 iterations:
> 
> while psql -c 'SELECT COUNT(*) >= 0 AS ok FROM pg_shmem_allocations_numa'; do :; done
> 
> 2025-12-16 11:49:35.982 UTC [621807] myon@postgres ERROR:  invalid NUMA node id outside of allowed range [0, 0]: -2
> 2025-12-16 11:49:35.982 UTC [621807] myon@postgres STATEMENT:  SELECT COUNT(*) >= 0 AS ok FROM
pg_shmem_allocations_numa
> 
> That was on the apt.pg.o amd64 build machine while a few things were
> just building. Maybe ENOENT "The page is not present" means something
> was just swapped out because the machine was under heavy load.

I played a bit more with it.

* It seems to trigger only once for a running cluster. The next one
  needs a restart
* If it doesn't trigger within the first 30s, it probably never will
* It seems easier to trigger on a system that is under load (I started
  a few pgmodeler compile runs in parallel (C++))

But none of that answers the "why".

Christoph



Re: failed NUMA pages inquiry status: Operation not permitted

От
Tomas Vondra
Дата:
On 12/16/25 15:48, Christoph Berg wrote:
> Re: To Tomas Vondra
>> I've managed to reproduce it once, running this loop on
>> 18-as-of-today. It errored out after a few 100 iterations:
>>
>> while psql -c 'SELECT COUNT(*) >= 0 AS ok FROM pg_shmem_allocations_numa'; do :; done
>>
>> 2025-12-16 11:49:35.982 UTC [621807] myon@postgres ERROR:  invalid NUMA node id outside of allowed range [0, 0]: -2
>> 2025-12-16 11:49:35.982 UTC [621807] myon@postgres STATEMENT:  SELECT COUNT(*) >= 0 AS ok FROM
pg_shmem_allocations_numa
>>
>> That was on the apt.pg.o amd64 build machine while a few things were
>> just building. Maybe ENOENT "The page is not present" means something
>> was just swapped out because the machine was under heavy load.
> 
> I played a bit more with it.
> 
> * It seems to trigger only once for a running cluster. The next one
>   needs a restart
> * If it doesn't trigger within the first 30s, it probably never will
> * It seems easier to trigger on a system that is under load (I started
>   a few pgmodeler compile runs in parallel (C++))
> 
> But none of that answers the "why".
> 

Hmmm, so this is interesting. I tried this on my workstation (with a
single NUMA node), and I see this:

1) right after opening a connection, I get this

test=# select numa_node, count(*) from pg_buffercache_numa group by 1;
 numa_node | count
-----------+-------
         0 |   290
        -2 | 32478
(2 rows)


2) but a select from pg_shmem_allocations_numa works fine

test=# select numa_node, count(*) from pg_shmem_allocations_numa group by 1;
 numa_node | count
-----------+-------
         0 |    72
(1 row)


3) and if I repeat the pg_buffercache_numa query, it now works

test=# select numa_node, count(*) from pg_buffercache_numa group by 1;
 numa_node | count
-----------+-------
         0 | 32768
(1 row)


That's a bit strange. I have no idea why is this happening. If I
reconnect, I start getting the failures again.


regards

-- 
Tomas Vondra




Re: failed NUMA pages inquiry status: Operation not permitted

От
Christoph Berg
Дата:
Re: Tomas Vondra
> 1) right after opening a connection, I get this
> 
> test=# select numa_node, count(*) from pg_buffercache_numa group by 1;
>  numa_node | count
> -----------+-------
>          0 |   290
>         -2 | 32478

Does that mean that the "touch all pages" logic is missing in some
code paths?

But even with that, it seems to be able to degenerate again and
accepting -2 in the regression tests would be required to make it
stable.

Christoph