Core dumps from recovery/017_shm
От | Thomas Munro |
---|---|
Тема | Core dumps from recovery/017_shm |
Дата | |
Msg-id | CA+hUKGKzfkN6re3yboQ+9qbhV3+f8Qk__ZCApSKY+NoC1Y1thA@mail.gmail.com обсуждение исходный текст |
Список | pgsql-hackers |
While looking for something else, I noticed that we occasionally see assertion failures like this: TRAP: failed Assert("latch->maybe_sleeping == false"), File: "latch.c", Line: 378, PID: 28023 Here's one in the build farm: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mamba&dt=2025-08-05%2005:52:51 And here are some recent cases on CI, which again fail somewhere else, but that might be expected as these are cfbot branches from patches on the mailing list: task_id | task_name ------------------+--------------------------------- 6347210574528512 | Linux - Debian Bookworm - Meson 6420333948829696 | FreeBSD - Meson 5616450825617408 | FreeBSD - Meson 4515661445070848 | Linux - Debian Bookworm - Meson 4945927242252288 | Linux - Debian Bookworm - Meson 5133563223343104 | Linux - Debian Bookworm - Meson You can drop those task IDs into these URLs: https://cirrus-ci.com/task/$TASK_ID https://api.cirrus-ci.com/v1/artifact/task/$TASK_ID/testrun/build/testrun/recovery/017_shm/log/017_shm_gnat.log My current theory is that backends are exiting when the test kills the postmaster, but a backend that is concurrently starting up takes over its latch, and then its first ResetLatch(MyLatch) fails that assertion because maybe_sleeping was never cleared. So I suppose it should be cleared in ... DisownLatch()? That sails close to the topic in these threads: https://www.postgresql.org/message-id/flat/B3C69B86-7F82-4111-B97F-0005497BB745%40yandex-team.ru https://www.postgresql.org/message-id/flat/CA+hUKGKp0kTpummCPa97+WFJTm+uYzQ9Ex8UMdH8ZXkLwO0QgA@mail.gmail.com If we didn't use proc_exit(), we wouldn't recycle the latch, so the problem would go away with the new emergency cleanup solution I'm working on (which incidentally also gets rid of the other source of core dump spam that clogs up BF and CI systems: archive scripts and other subprocesses of backends). More about that soon on that last thread, but... That would still leave versions 15-18 with these rare assertion failures, since they have commit c8f3bc24. So I think the thing to do is change DisownLatch() to clear maybe_sleeping just where it also clears owner_pid, and backpatch that. Another idea would be to do it in WaitEventSetWaitBlock() before exiting, but that'd be duplicated in several places.
В списке pgsql-hackers по дате отправления: