Hi,
I found one crash in pg_restore, this occurs when there is a failure before all the child workers are created. Back trace for the same is given below:
#0 0x00007f9c6d31e337 in raise () from /lib64/libc.so.6
#1 0x00007f9c6d31fa28 in abort () from /lib64/libc.so.6
#2 0x00007f9c6d317156 in __assert_fail_base () from /lib64/libc.so.6
#3 0x00007f9c6d317202 in __assert_fail () from /lib64/libc.so.6
#4 0x0000000000407c9e in WaitForTerminatingWorkers (pstate=0x14af7f0) at parallel.c:515
#5 0x0000000000407bf9 in ShutdownWorkersHard (pstate=0x14af7f0) at parallel.c:451
#6 0x0000000000407ae9 in archive_close_connection (code=1, arg=0x6315a0 <shutdown_info>) at parallel.c:368
#7 0x000000000041a7c7 in exit_nicely (code=1) at pg_backup_utils.c:99
#8 0x0000000000408180 in ParallelBackupStart (AH=0x14972e0) at parallel.c:967
#9 0x000000000040a3dd in RestoreArchive (AHX=0x14972e0) at pg_backup_archiver.c:661
#10 0x0000000000404125 in main (argc=6, argv=0x7ffd5146f308) at pg_restore.c:443
The problem is like:
- The variable pstate->numWorkers is being set with the number of workers initially in ParallelBackupStart.
- Then the workers are created one by one.
- Before creating all the process there is a failure.
- Then the parent terminates the child process and waits for all the child process to get terminated.
- This function WaitForTerminatingWorkers checks if all process is terminated by calling HasEveryWorkerTerminated.
- HasEveryWorkerTerminated will always return false because it will check for the numWorkers rather than the actual forked process count and hits the next assert "Assert(j < pstate->numWorkers);".
Attached patch has the fix for the same. Fixed it by setting pstate->numWorkers with the actual worker count when the child process is being created.
Thoughts?
Regards,
Vignesh
EnterpriseDB:
http://www.enterprisedb.com