Re: failed NUMA pages inquiry status: Operation not permitted

Поиск
Список
Период
Сортировка
От Tomas Vondra
Тема Re: failed NUMA pages inquiry status: Operation not permitted
Дата
Msg-id b93d876b-67c1-4f0e-b0c5-a4296f09f5b5@vondra.me
обсуждение исходный текст
Ответ на Re: failed NUMA pages inquiry status: Operation not permitted  (Tomas Vondra <tomas@vondra.me>)
Ответы Re: failed NUMA pages inquiry status: Operation not permitted
Список pgsql-hackers
On 12/17/25 12:07, Tomas Vondra wrote:
> 
> 
> On 12/16/25 18:54, Christoph Berg wrote:
>> Re: Tomas Vondra
>>> 1) right after opening a connection, I get this
>>>
>>> test=# select numa_node, count(*) from pg_buffercache_numa group by 1;
>>>  numa_node | count
>>> -----------+-------
>>>          0 |   290
>>>         -2 | 32478
>>
>> Does that mean that the "touch all pages" logic is missing in some
>> code paths?
>>
> 
> I did check and AFAICS we are touching the pages in pg_buffercache_numa.
> 
> To make it even more confusing, I can no longer reproduce the behavior I
> reported yesterday. It just consistently reports "0" and I have no idea
> why it changed :-( I did restart since yesterday, so maybe that changed
> something.
> 

I kept poking at this, and I managed to reproduce it again. The key
seems to be that the system needs to be under pressure, and then it's
reliably reproducible (at least for me).

What I did is I created two instances - one to keep the system busy, one
for experimentation. The "busy" one is set to use shared_buffers=16GB,
and then running read-only pgbench.

  pgbench -i -s 4500 test
  pgbench -S -j 16 -c 64 -T 600 -P 1 test

The system has 64GB of RAM and 12 cores, so this is a lot of load.

Then, the other instance is set to use shared_buffers=4GB, is started
and immediately queried for NUMA info for buffers (in a loop):

  pg_ctl -D data -l pg.log start;

  for r in $(seq 1 10); do
    psql -p 5001 test -c 'select numa_node, count(*) from
pg_buffercache_numa group by 1';
  done;

  pg_ctl -D data -l pg.log stop;

And this often fails like this:

----------------------------------------------------------------------

waiting for server to start.... done
server started
 numa_node |  count
-----------+---------
         0 | 1045302
        -2 |    3274
(2 rows)

 numa_node |  count
-----------+---------
         0 | 1048576
(1 row)

 numa_node |  count
-----------+---------
         0 | 1048576
(1 row)

 numa_node |  count
-----------+---------
         0 | 1048576
(1 row)

 numa_node |  count
-----------+---------
         0 | 1048576
(1 row)

 numa_node |  count
-----------+---------
         0 | 1048576
(1 row)

 numa_node |  count
-----------+---------
         0 | 1025321
        -2 |   23255
(2 rows)

 numa_node |  count
-----------+---------
         0 | 1038596
        -2 |    9980
(2 rows)

 numa_node |  count
-----------+---------
         0 | 1048518
        -2 |      58
(2 rows)

 numa_node |  count
-----------+---------
         0 | 1048525
        -2 |      51
(2 rows)

waiting for server to shut down.... done
server stopped

----------------------------------------------------------------------

So, it clearly fails quite often. And it can fail even later, after a
run that returned no "-2" buffers.

Clearly, something behaves differently than we thought. I've only seen
this happen on a system with swap - once I removed it, this behavior
disappeared too. So it seems a page can be moved to swap, in which case
we get -2 for a status.

In hindsight, that's not all that surprising. It's interesting it can
happen even with the "touching", but I guess there's a race condition
and the memory can get paged out before we inspect the status. We're
querying batches of pages, which probably makes the window larger.

FWIW I now realized I don't even need two instances. If I try this on
the "busy" instance, I get the -2 values too. Which I find a bit weird.
Because why should those be paged out?

The question is what to do about this. I don't think we can prevent the
-2 values, and error-ing out does not seem great either (most systems
have swap, so -2 may not be all that rare).

In fact, pg_shmem_allocations_numa probably should not error-out either,
because it's now reliably failing (on the busy instance).

I guess the only solution is to accept -2 as a possible value (unknown
node). But that makes regression testing harder, because it means the
output could change a lot ...


regards

-- 
Tomas Vondra




В списке pgsql-hackers по дате отправления: