[PATCH] Fix orphaned backend processes on Windows using Job Objects
| От | Bryan Green | 
|---|---|
| Тема | [PATCH] Fix orphaned backend processes on Windows using Job Objects | 
| Дата | |
| Msg-id | 880214db-ab8c-4b9e-852c-b0f6d90d3f3d@gmail.com обсуждение исходный текст  | 
		
| Ответы | 
                	
            		Re: [PATCH] Fix orphaned backend processes on Windows using Job Objects
            		
            		 | 
		
| Список | pgsql-hackers | 
Greetings, When the postmaster exits unexpectedly on Windows (crash, kill, debugger abort), backend processes continue running. Windows lacks any equivalent to Unix's getppid() orphan detection. These orphaned backends hold locks and shared memory, preventing clean restart. This leads to a delay in restarts and manual killing of orphans. The problem is easy to reproduce. Start postgres, open a transaction with LOCK TABLE, then kill the postmaster with taskkill /F. The backend continues running and restart fails. Manual cleanup is required. Current approaches (inherited event handles, shared memory flags) depend on the postmaster running code during exit. A segfault or kill bypasses all of that. My proposed solution is to use Windows Job Objects with KILL_ON_JOB_CLOSE. We just need to call CreateJobObject() in PostmasterMain(), configure with JOB_OBJECT_LIMIT_KILL_ON_JOB_CLOSE, and assign the postmaster. Children inherit membership automatically. When the job handle closes on postmaster exit, the kernel terminates all children atomically. This is kernel-enforced with no polling and no race conditions. Job creation can fail if postgres runs under an existing job (service managers, debuggers). Windows 7 disallows nested jobs. We detect this with IsProcessInJob(), and if AssignProcessToJobObject() returns ERROR_ACCESS_DENIED, we log and continue without orphan protection. KILL_ON_JOB_CLOSE doesn't interfere with clean shutdown. Normal shutdown signals backends via SetEvent, they exit, postmaster exits, job closes. Nothing left to kill. The flag only fires during crashes when backends are still running - exactly when forced termination is correct. The code is ~200 lines in pg_job_object.c, less than win32/signal.c (~500 lines). It fails gracefully and works regardless of how postgres is started, unlike service manager approaches. This avoids polling unreliability. The patch has been tested on Windows 10/11 with both MSVC and MinGW builds. Nested jobs fail gracefully as expected. Clean shutdown is unaffected. Crash tests with taskkill /F, debugger abort, and access violations all correctly terminate children immediately with zero orphans. This patch does not include automated tests because the core functionality (orphan prevention on crash) requires simulating process termination, which is difficult to test reliably in CI. Patch attached. Can add documentation if this approach is approved. Thoughts? Bryan Green
Вложения
В списке pgsql-hackers по дате отправления: