Re: failed NUMA pages inquiry status: Operation not permitted
| От | Tomas Vondra |
|---|---|
| Тема | Re: failed NUMA pages inquiry status: Operation not permitted |
| Дата | |
| Msg-id | b93d876b-67c1-4f0e-b0c5-a4296f09f5b5@vondra.me обсуждение исходный текст |
| Ответ на | Re: failed NUMA pages inquiry status: Operation not permitted (Tomas Vondra <tomas@vondra.me>) |
| Ответы |
Re: failed NUMA pages inquiry status: Operation not permitted
|
| Список | pgsql-hackers |
On 12/17/25 12:07, Tomas Vondra wrote:
>
>
> On 12/16/25 18:54, Christoph Berg wrote:
>> Re: Tomas Vondra
>>> 1) right after opening a connection, I get this
>>>
>>> test=# select numa_node, count(*) from pg_buffercache_numa group by 1;
>>> numa_node | count
>>> -----------+-------
>>> 0 | 290
>>> -2 | 32478
>>
>> Does that mean that the "touch all pages" logic is missing in some
>> code paths?
>>
>
> I did check and AFAICS we are touching the pages in pg_buffercache_numa.
>
> To make it even more confusing, I can no longer reproduce the behavior I
> reported yesterday. It just consistently reports "0" and I have no idea
> why it changed :-( I did restart since yesterday, so maybe that changed
> something.
>
I kept poking at this, and I managed to reproduce it again. The key
seems to be that the system needs to be under pressure, and then it's
reliably reproducible (at least for me).
What I did is I created two instances - one to keep the system busy, one
for experimentation. The "busy" one is set to use shared_buffers=16GB,
and then running read-only pgbench.
pgbench -i -s 4500 test
pgbench -S -j 16 -c 64 -T 600 -P 1 test
The system has 64GB of RAM and 12 cores, so this is a lot of load.
Then, the other instance is set to use shared_buffers=4GB, is started
and immediately queried for NUMA info for buffers (in a loop):
pg_ctl -D data -l pg.log start;
for r in $(seq 1 10); do
psql -p 5001 test -c 'select numa_node, count(*) from
pg_buffercache_numa group by 1';
done;
pg_ctl -D data -l pg.log stop;
And this often fails like this:
----------------------------------------------------------------------
waiting for server to start.... done
server started
numa_node | count
-----------+---------
0 | 1045302
-2 | 3274
(2 rows)
numa_node | count
-----------+---------
0 | 1048576
(1 row)
numa_node | count
-----------+---------
0 | 1048576
(1 row)
numa_node | count
-----------+---------
0 | 1048576
(1 row)
numa_node | count
-----------+---------
0 | 1048576
(1 row)
numa_node | count
-----------+---------
0 | 1048576
(1 row)
numa_node | count
-----------+---------
0 | 1025321
-2 | 23255
(2 rows)
numa_node | count
-----------+---------
0 | 1038596
-2 | 9980
(2 rows)
numa_node | count
-----------+---------
0 | 1048518
-2 | 58
(2 rows)
numa_node | count
-----------+---------
0 | 1048525
-2 | 51
(2 rows)
waiting for server to shut down.... done
server stopped
----------------------------------------------------------------------
So, it clearly fails quite often. And it can fail even later, after a
run that returned no "-2" buffers.
Clearly, something behaves differently than we thought. I've only seen
this happen on a system with swap - once I removed it, this behavior
disappeared too. So it seems a page can be moved to swap, in which case
we get -2 for a status.
In hindsight, that's not all that surprising. It's interesting it can
happen even with the "touching", but I guess there's a race condition
and the memory can get paged out before we inspect the status. We're
querying batches of pages, which probably makes the window larger.
FWIW I now realized I don't even need two instances. If I try this on
the "busy" instance, I get the -2 values too. Which I find a bit weird.
Because why should those be paged out?
The question is what to do about this. I don't think we can prevent the
-2 values, and error-ing out does not seem great either (most systems
have swap, so -2 may not be all that rare).
In fact, pg_shmem_allocations_numa probably should not error-out either,
because it's now reliably failing (on the busy instance).
I guess the only solution is to accept -2 as a possible value (unknown
node). But that makes regression testing harder, because it means the
output could change a lot ...
regards
--
Tomas Vondra
В списке pgsql-hackers по дате отправления: