Re: Draft for basic NUMA observability

Поиск
Список
Период
Сортировка
От Patrick Stählin
Тема Re: Draft for basic NUMA observability
Дата
Msg-id 3d8bccef-1395-40ec-bc3d-cccd1882227a@packi.ch
обсуждение исходный текст
Ответ на Re: Draft for basic NUMA observability  (Jakub Wartak <jakub.wartak@enterprisedb.com>)
Список pgsql-hackers
Hi Jakub

On 7/24/25 10:01 AM, Jakub Wartak wrote:
> On Tue, Jul 22, 2025 at 11:30 AM Patrick Stählin <me@packi.ch> wrote:
>>
>> Hi!
>>
>> On 4/7/25 11:27 PM, Tomas Vondra wrote:
>>>
>>> I've pushed all three parts of v29, with some additional corrections
>>> (picked lower OIDs, bumped catversion, fixed commit messages).
>>
>> While building the PG18 beta1/2 packages I noticed that in our build
>> containers the selftest for pg_buffercache_numa and numa failed. It
>> seems that libnuma was available and pg_numa_init/numa_available returns
>> no errors, we still fail in pg_numa_query_pages/move_pages with EPERM
>> yielding the following error when accessing
>> pg_buffercache_numa/pg_shmem_allocations_numa:
>>
>>     ERROR: failed NUMA pages inquiry: Operation not permitted
>>
>> The man-page of move_pages lead me to believe that this is because of
>> the missing capability CAP_SYS_NICE on the process but I couldn't prove
>> that theory with the attached patch.
>> The patch did make the tests pass but also disabled NUMA permanently on
>> a vanilla Debian VM and that is certainly not wanted. It may well be
>> that my understanding of checking capabilities and how they work is
>> incomplete. I also think that adding a new dependency for the reason of
>> just checking the capability is probably a bit of an overkill, maybe we
>> can check if we can access move_pages once without an error before
>> treating it as one?
>>
>> I'd be happy to debug this further but I have limited access to our
>> build-infra, I should be able to sneak in commands during the build though.
> 
> 
> Hi Patrick,
> 
> So is it because the container was started without CAP_SYS_NICE so
> even root -> postgres is not having this cap? In my book container
> would be rather small and certainly single container wouldn't be
> spanning multiple CPU sockets, so I would just disable libnuma, anyway
> if I do on regular VM:
 > [...]

This is just for the build-env but it runs the selftest and this fails 
then. The containers this is running in prod is a totally different 
setup and there the numa calls actually work. Disabling it may be an 
option but it would be nice to detect that we can't access it at runtime.

> Can you provide exact details about this container technology?

We use podman to set everything up.

> Can you provide /usr/sbin/capsh --print just before starting PG there?
> Maybe this is more cgroup/cpuset somehow related too?

Here is the output, it seems that cap_sys_nice is missing from the 
bounding set:

+ /usr/sbin/capsh --print
Current: =
Bounding set 

=cap_chown,cap_dac_override,cap_fowner,cap_fsetid,cap_kill,cap_setgid,cap_setuid,cap_setpcap,cap_net_bind_service,cap_sys_chroot,cap_setfcap
Ambient set =
Current IAB: 

!cap_dac_read_search,!cap_linux_immutable,!cap_net_broadcast,!cap_net_admin,!cap_net_raw,!cap_ipc_lock,!cap_ipc_owner,!cap_sys_module,!cap_sys_rawio,!cap_sys_ptrace,!cap_sys_pacct,!cap_sys_admin,!cap_sys_boot,!cap_sys_nice,!cap_sys_resource,!cap_sys_time,!cap_sys_tty_config,!cap_mknod,!cap_lease,!cap_audit_write,!cap_audit_control,!cap_mac_override,!cap_mac_admin,!cap_syslog,!cap_wake_alarm,!cap_block_suspend,!cap_audit_read,!cap_perfmon,!cap_bpf,!cap_checkpoint_restore
Securebits: 00/0x0/1'b0 (no-new-privs=0)
  secure-noroot: no (unlocked)
  secure-no-suid-fixup: no (unlocked)
  secure-keep-caps: no (unlocked)
  secure-no-ambient-raise: no (unlocked)
uid=2000(buildkite-agent) euid=2000(buildkite-agent)
gid=2000(buildkite-agent)
groups=2000(buildkite-agent)
Guessed mode: HYBRID (4)

> Anyway, there is a simpler way to make the tests pass if that's what
> you are after. We do have
> contrib/pg_buffercache/sql/pg_buffercache_numa.sql which is expected
> to match outputs in pg_buffercache_numa.out OR (!)
> pg_buffercache_numa_1.out. We could just handle this edge case by
> adding pg_buffercache_numa_2.out too probably (which would just
> contain semi-valid scenario for "ERROR: failed NUMA pages inquiry:
> Operation not permitted")

Ah, didn't know that was a possibility. Until this sees more usage than 
just querying the state, this may be a nice workaround. If this is more 
wide-spread we probably need something a bit more robust for the 
detection. I already patch out the tests for our build-env so for me 
it's "solved" but that is certainly not a proper solution.

Just FYI, I'll be on PTO so I won't have access to the build-env in the 
next two weeks.

Thanks,
Patrick



В списке pgsql-hackers по дате отправления: