core dumps in auto_prewarm, tests succeed

Поиск
Список
Период
Сортировка
От Andres Freund
Тема core dumps in auto_prewarm, tests succeed
Дата
Msg-id 20240122204117.swton324xcoodnyi@awork3.anarazel.de
обсуждение исходный текст
Ответы Re: core dumps in auto_prewarm, tests succeed
Re: core dumps in auto_prewarm, tests succeed
Список pgsql-hackers
Hi,

I noticed that I was getting core dumps while executing the tests, without the
tests failing. Backtraces are vriations of this:

#0  0x0000000000ca29cd in pg_atomic_read_u32_impl (ptr=0x7fe13497a004) at
../../../../../home/andres/src/postgresql/src/include/port/atomics/generic.h:48
#1  0x0000000000ca2b08 in pg_atomic_read_u32 (ptr=0x7fe13497a004) at
../../../../../home/andres/src/postgresql/src/include/port/atomics.h:239
#2  0x0000000000ca3c3d in LWLockAttemptLock (lock=0x7fe13497a000, mode=LW_EXCLUSIVE)
    at ../../../../../home/andres/src/postgresql/src/backend/storage/lmgr/lwlock.c:825
#3  0x0000000000ca440c in LWLockAcquire (lock=0x7fe13497a000, mode=LW_EXCLUSIVE)
    at ../../../../../home/andres/src/postgresql/src/backend/storage/lmgr/lwlock.c:1264
#4  0x00007fe130204ab4 in apw_detach_shmem (code=0, arg=0) at
../../../../../home/andres/src/postgresql/contrib/pg_prewarm/autoprewarm.c:788
#5  0x0000000000c81c99 in shmem_exit (code=0) at
../../../../../home/andres/src/postgresql/src/backend/storage/ipc/ipc.c:276
#6  0x0000000000c81a7c in proc_exit_prepare (code=0) at
../../../../../home/andres/src/postgresql/src/backend/storage/ipc/ipc.c:198
#7  0x0000000000c819a8 in proc_exit (code=0) at
../../../../../home/andres/src/postgresql/src/backend/storage/ipc/ipc.c:111
#8  0x0000000000bdd0b3 in BackgroundWorkerMain () at
../../../../../home/andres/src/postgresql/src/backend/postmaster/bgworker.c:841
#9  0x0000000000be861d in do_start_bgworker (rw=0x341ff20) at
../../../../../home/andres/src/postgresql/src/backend/postmaster/postmaster.c:5756
#10 0x0000000000be8a34 in maybe_start_bgworkers () at
../../../../../home/andres/src/postgresql/src/backend/postmaster/postmaster.c:5980
#11 0x0000000000be4f9f in process_pm_child_exit () at
../../../../../home/andres/src/postgresql/src/backend/postmaster/postmaster.c:3039
#12 0x0000000000be2de4 in ServerLoop () at
../../../../../home/andres/src/postgresql/src/backend/postmaster/postmaster.c:1765
#13 0x0000000000be27bf in PostmasterMain (argc=4, argv=0x33dbba0) at
../../../../../home/andres/src/postgresql/src/backend/postmaster/postmaster.c:1475
#14 0x0000000000aca326 in main (argc=4, argv=0x33dbba0) at
../../../../../home/andres/src/postgresql/src/backend/main/main.c:198

The most likely culprit seems to be:
https://postgr.es/m/E1rQvjC-002Chd-Cd%40gemulon.postgresql.org

The test encountering this is pg_prewarm/001_basic:
(gdb) p DataDir
$12 = 0x33ef8a0 "/srv/dev/build/postgres/m-dev-assert/testrun/pg_prewarm/001_basic/data/t_001_basic_main_data/pgdata"


A secondary issue is that the tests suceed despite two segfaults. The problem
here likely is that we don't have sufficient error handling during shutdowns:

2024-01-22 12:31:34.133 PST [3054823] LOG:  background worker "logical replication launcher" (PID 3054836) exited with
exitcode 1
 
2024-01-22 12:31:34.443 PST [3054823] LOG:  background worker "autoprewarm leader" (PID 3054835) was terminated by
signal11: Segmentation fault
 
2024-01-22 12:31:34.443 PST [3054823] LOG:  terminating any other active server processes
2024-01-22 12:31:34.444 PST [3054823] LOG:  abnormal database system shutdown
2024-01-22 12:31:34.469 PST [3054823] LOG:  database system is shut down

2024-01-22 12:31:34.555 PST [3055133] LOG:  starting PostgreSQL 17devel on x86_64-linux, compiled by gcc-14.0.0,
64-bit
2024-01-22 12:31:34.555 PST [3055133] LOG:  listening on Unix socket "/tmp/p6XG0kQW9w/.s.PGSQL.60402"
2024-01-22 12:31:34.557 PST [3055148] LOG:  database system was interrupted; last known up at 2024-01-22 12:31:33 PST
2024-01-22 12:31:34.557 PST [3055148] LOG:  database system was not properly shut down; automatic recovery in progress
2024-01-22 12:31:34.558 PST [3055148] LOG:  redo starts at 0/1531090
2024-01-22 12:31:34.559 PST [3055148] LOG:  invalid record length at 0/1555F60: expected at least 24, got 0
2024-01-22 12:31:34.559 PST [3055148] LOG:  redo done at 0/1555F38 system usage: CPU: user: 0.00 s, system: 0.00 s,
elapsed:0.00 s
 
2024-01-22 12:31:34.559 PST [3055146] LOG:  checkpoint starting: end-of-recovery immediate wait
2024-01-22 12:31:34.570 PST [3055146] LOG:  checkpoint complete: wrote 42 buffers (0.3%); 0 WAL file(s) added, 0
removed,0 recycled; write=0.002 s, sync=0.001 s, total=0.011 s; sync files=0, longest=0.000 s, average=0.000 s;
distance=147kB, estimate=147 kB; lsn=0/1555F60, redo lsn=0/1555F60
 
2024-01-22 12:31:34.573 PST [3055133] LOG:  database system is ready to accept connections

ISTM that we shouldn't basically silently overlook shutdowns due to crashes in
the tests. How to not do so is unfortunately not immediately obvious to me...


Greetings,

Andres Freund



В списке pgsql-hackers по дате отправления:

Предыдущее
От: Andres Freund
Дата:
Сообщение: Re: Improve WALRead() to suck data directly from WAL buffers when possible
Следующее
От: Nathan Bossart
Дата:
Сообщение: Re: core dumps in auto_prewarm, tests succeed