Обсуждение: [PATCH] Fix orphaned backend processes on Windows using Job Objects

Поиск
Список
Период
Сортировка

[PATCH] Fix orphaned backend processes on Windows using Job Objects

От
Bryan Green
Дата:
Greetings,

When the postmaster exits unexpectedly on Windows (crash, kill, debugger
abort), backend processes continue running. Windows lacks any equivalent
to Unix's getppid() orphan detection. These orphaned backends hold locks
and shared memory, preventing clean restart. This leads to a delay in
restarts and manual killing of orphans.

The problem is easy to reproduce. Start postgres, open a transaction
with LOCK TABLE, then kill the postmaster with taskkill /F. The backend
continues running and restart fails. Manual cleanup is required.

Current approaches (inherited event handles, shared memory flags) depend
on the postmaster running code during exit. A segfault or kill bypasses
all of that.

My proposed solution is to use Windows Job Objects with KILL_ON_JOB_CLOSE.

We just need to call CreateJobObject() in PostmasterMain(), configure
with JOB_OBJECT_LIMIT_KILL_ON_JOB_CLOSE, and assign the postmaster.
Children inherit membership automatically. When the job handle closes on
postmaster exit, the kernel terminates all children atomically. This is
kernel-enforced with no polling and no race conditions.

Job creation can fail if postgres runs under an existing job (service
managers, debuggers). Windows 7 disallows nested jobs. We detect this
with IsProcessInJob(), and if AssignProcessToJobObject() returns
ERROR_ACCESS_DENIED, we log and continue without orphan protection.

KILL_ON_JOB_CLOSE doesn't interfere with clean shutdown. Normal shutdown
signals backends via SetEvent, they exit, postmaster exits, job closes.
Nothing left to kill. The flag only fires during crashes when backends
are still running - exactly when forced termination is correct.

The code is ~200 lines in pg_job_object.c, less than win32/signal.c
(~500 lines). It fails gracefully and works regardless of how postgres
is started, unlike service manager approaches. This avoids polling
unreliability.

The patch has been tested on Windows 10/11 with both MSVC and MinGW
builds. Nested jobs fail gracefully as expected. Clean shutdown is
unaffected. Crash tests with taskkill /F, debugger abort, and access
violations all correctly terminate children immediately with zero orphans.

This patch does not include automated tests because the core
functionality (orphan prevention on crash) requires simulating process
termination, which is difficult to test reliably in CI.

Patch attached. Can add documentation if this approach is approved.

Thoughts?

Bryan Green
Вложения

Re: [PATCH] Fix orphaned backend processes on Windows using Job Objects

От
Andres Freund
Дата:
Hi,

On 2025-11-03 09:12:03 -0600, Bryan Green wrote:
> We just need to call CreateJobObject() in PostmasterMain(), configure
> with JOB_OBJECT_LIMIT_KILL_ON_JOB_CLOSE, and assign the postmaster.
> Children inherit membership automatically. When the job handle closes on
> postmaster exit, the kernel terminates all children atomically. This is
> kernel-enforced with no polling and no race conditions.

What happens if a postmaster child exits irregularly? Is postmaster terminated
as well?

> The patch has been tested on Windows 10/11 with both MSVC and MinGW
> builds. Nested jobs fail gracefully as expected. Clean shutdown is
> unaffected. Crash tests with taskkill /F, debugger abort, and access
> violations all correctly terminate children immediately with zero orphans.
> 
> This patch does not include automated tests because the core
> functionality (orphan prevention on crash) requires simulating process
> termination, which is difficult to test reliably in CI.

Why is it difficult to test in CI? We do some related tests in
013_crash_restart.pl, it doesn't seem like it ought to be hard to also add
tests for postmaster?

Greetings,

Andres Freund



Re: [PATCH] Fix orphaned backend processes on Windows using Job Objects

От
Bryan Green
Дата:
On 11/3/2025 9:19 AM, Andres Freund wrote:
> Hi,
> 
> On 2025-11-03 09:12:03 -0600, Bryan Green wrote:
>> We just need to call CreateJobObject() in PostmasterMain(), configure
>> with JOB_OBJECT_LIMIT_KILL_ON_JOB_CLOSE, and assign the postmaster.
>> Children inherit membership automatically. When the job handle closes on
>> postmaster exit, the kernel terminates all children atomically. This is
>> kernel-enforced with no polling and no race conditions.
> 
> What happens if a postmaster child exits irregularly? Is postmaster terminated
> as well?
> 

No, Job Objects are unidirectional. KILL_ON_JOB_CLOSE only acts when the
postmaster (which holds the job handle) exits. Backend crashes are
handled through PostgreSQL's existing crash recovery mechanism - the
postmaster detects the crash via WaitForMultipleObjects() and initiates
recovery as normal.

The Job Object only takes action when the job handle closes, which
happens when the postmaster exits. It's analogous to a Unix process
group - sending SIGTERM to the group leader kills the group, but
children dying doesn't affect the parent.

>> The patch has been tested on Windows 10/11 with both MSVC and MinGW
>> builds. Nested jobs fail gracefully as expected. Clean shutdown is
>> unaffected. Crash tests with taskkill /F, debugger abort, and access
>> violations all correctly terminate children immediately with zero orphans.
>>
>> This patch does not include automated tests because the core
>> functionality (orphan prevention on crash) requires simulating process
>> termination, which is difficult to test reliably in CI.
> 
> Why is it difficult to test in CI? We do some related tests in
> 013_crash_restart.pl, it doesn't seem like it ought to be hard to also add
> tests for postmaster?
>

Fair point. I was hesitant because testing the actual orphan prevention
requires killing the postmaster while backends are active, which seemed
fragile. But you're right that we already test similar scenarios.

I can add a test to 013_crash_restart.pl (or a new Windows-specific test
file) that:
1. Starts server with active backend
2. Kills postmaster ungracefully (taskkill /F)
3. Verifies backend process terminates automatically
4. Confirms clean restart

Would that be sufficient, or do you have other test scenarios in mind?


> Greetings,
> 
> Andres Freund





Re: [PATCH] Fix orphaned backend processes on Windows using Job Objects

От
Andres Freund
Дата:
On 2025-11-03 09:25:11 -0600, Bryan Green wrote:
> On 11/3/2025 9:19 AM, Andres Freund wrote:
> > Hi,
> > 
> > On 2025-11-03 09:12:03 -0600, Bryan Green wrote:
> >> We just need to call CreateJobObject() in PostmasterMain(), configure
> >> with JOB_OBJECT_LIMIT_KILL_ON_JOB_CLOSE, and assign the postmaster.
> >> Children inherit membership automatically. When the job handle closes on
> >> postmaster exit, the kernel terminates all children atomically. This is
> >> kernel-enforced with no polling and no race conditions.
> > 
> > What happens if a postmaster child exits irregularly? Is postmaster terminated
> > as well?
> > 
> 
> No, Job Objects are unidirectional.

Great.


> >> The patch has been tested on Windows 10/11 with both MSVC and MinGW
> >> builds. Nested jobs fail gracefully as expected. Clean shutdown is
> >> unaffected. Crash tests with taskkill /F, debugger abort, and access
> >> violations all correctly terminate children immediately with zero orphans.
> >>
> >> This patch does not include automated tests because the core
> >> functionality (orphan prevention on crash) requires simulating process
> >> termination, which is difficult to test reliably in CI.
> > 
> > Why is it difficult to test in CI? We do some related tests in
> > 013_crash_restart.pl, it doesn't seem like it ought to be hard to also add
> > tests for postmaster?
> >
> 
> Fair point. I was hesitant because testing the actual orphan prevention
> requires killing the postmaster while backends are active, which seemed
> fragile. But you're right that we already test similar scenarios.
> 
> I can add a test to 013_crash_restart.pl (or a new Windows-specific test
> file) that:
> 1. Starts server with active backend
> 2. Kills postmaster ungracefully (taskkill /F)
> 3. Verifies backend process terminates automatically
> 4. Confirms clean restart
> 
> Would that be sufficient, or do you have other test scenarios in mind?

That's pretty much what I had in mind.

Greetings,

Andres Freund



Re: [PATCH] Fix orphaned backend processes on Windows using Job Objects

От
Bryan Green
Дата:
On 11/3/2025 9:29 AM, Andres Freund wrote:
> On 2025-11-03 09:25:11 -0600, Bryan Green wrote:
>> On 11/3/2025 9:19 AM, Andres Freund wrote:
>>> Hi,
>>>
>>> On 2025-11-03 09:12:03 -0600, Bryan Green wrote:
>>>> We just need to call CreateJobObject() in PostmasterMain(), configure
>>>> with JOB_OBJECT_LIMIT_KILL_ON_JOB_CLOSE, and assign the postmaster.
>>>> Children inherit membership automatically. When the job handle closes on
>>>> postmaster exit, the kernel terminates all children atomically. This is
>>>> kernel-enforced with no polling and no race conditions.
>>>
>>> What happens if a postmaster child exits irregularly? Is postmaster terminated
>>> as well?
>>>
>>
>> No, Job Objects are unidirectional.
> 
> Great.
> 
> 
>>>> The patch has been tested on Windows 10/11 with both MSVC and MinGW
>>>> builds. Nested jobs fail gracefully as expected. Clean shutdown is
>>>> unaffected. Crash tests with taskkill /F, debugger abort, and access
>>>> violations all correctly terminate children immediately with zero orphans.
>>>>
>>>> This patch does not include automated tests because the core
>>>> functionality (orphan prevention on crash) requires simulating process
>>>> termination, which is difficult to test reliably in CI.
>>>
>>> Why is it difficult to test in CI? We do some related tests in
>>> 013_crash_restart.pl, it doesn't seem like it ought to be hard to also add
>>> tests for postmaster?
>>>
>>
>> Fair point. I was hesitant because testing the actual orphan prevention
>> requires killing the postmaster while backends are active, which seemed
>> fragile. But you're right that we already test similar scenarios.
>>
>> I can add a test to 013_crash_restart.pl (or a new Windows-specific test
>> file) that:
>> 1. Starts server with active backend
>> 2. Kills postmaster ungracefully (taskkill /F)
>> 3. Verifies backend process terminates automatically
>> 4. Confirms clean restart
>>
>> Would that be sufficient, or do you have other test scenarios in mind?
> 
> That's pretty much what I had in mind.
> 
> Greetings,
> 
> Andres Freund


I've implemented the test in 013_crash_restart.pl.

The test passes on Windows 10/11 with both MSVC and MinGW builds.
Backends  are typically terminated within 100-200ms after postmaster
kill, confirming the Job Object KILL_ON_JOB_CLOSE mechanism works as
intended.

Updated patch (v2) attached.

--
Bryan
Вложения