Re: [PATCH] Fix orphaned backend processes on Windows using Job Objects
| От | Thomas Munro |
|---|---|
| Тема | Re: [PATCH] Fix orphaned backend processes on Windows using Job Objects |
| Дата | |
| Msg-id | CA+hUKGJo6hu6GToiXarBRF+AqhFPnzMTW2Nksm0x-+9m2=dskQ@mail.gmail.com обсуждение исходный текст |
| Ответ на | Re: [PATCH] Fix orphaned backend processes on Windows using Job Objects (Bryan Green <dbryan.green@gmail.com>) |
| Список | pgsql-hackers |
On Fri, Nov 7, 2025 at 3:13 AM Bryan Green <dbryan.green@gmail.com> wrote: > The reason to still do this patch and clean up the handle inheritance > mess is that there are states (suspended state, infinite loop, spinlock > hold, whatever) that a process can be in that keeps it from processing > the event. We don't need to wait on the children to voluntarily exit > when postmaster crashes. Agreed on all points. We'd recently come to the same conclusion on this thread: https://www.postgresql.org/message-id/flat/B3C69B86-7F82-4111-B97F-0005497BB745%40yandex-team.ru I think there might arguably be a sort of weak forward progress guarantee in the existing design and it's been a while since we've had problem reports AFAIR*: locks were releases (which turns out to be fundamentally unsafe at least while in a critical section as analysed in that thread, but it does allow progress in blocked backends, so that they can learn of the postmaster's demise), and no one should enter WaitEventSet() while holding a spinlock, and infinite loops are against the law, and it's previously been considered acceptable-ish that a backend might continue to run a long query until completion before exiting (without supporting auxiliary or worker backends, which sounds potentially suspect, but at least you can't wait for another backend without learning of the PostgreSQL's demise assuming the only possible waits are LWLocks or latches). But clearly it's not good enough. The fact that Windows backends are born in suspended state until the postmaster resumes them is indeed a new and significant hole in that theory. Preemptive termination is the only thing that makes sense. *We used to have places that waited but forgot to handle PM exit, and I don't recall "manual orphan cleanup needed" reports since we enforced a central handler. But see also my earlier note about systemd potentially hiding problems these days, if using "mixed" mode to SIGKILL the whole cgroup.
В списке pgsql-hackers по дате отправления: