Jeff Janes <jeff.janes@gmail.com> writes:
> It looks to me like the SIGQUIT from the postmaster is simply getting
> lost. And from what little I understand of signal handling, this is a
> known race with system(3). The archive_command, child of archiver,
> exits before it can receive the signal sent to the entire archiver
> process group, so it doesn't set its exit status to show it was
> signalled. But the signal sent directly to the archiver reaches it
> while it is still ignoring SIGQUITs.
Ugh.
> If the SIGQUIT is getting lost in a race, could it just be blocked
> during the system(3) call?
> I don't know what happens if you call system(3) with SIGQUIT being blocked.
On my machine, man system(3) saith:
system() ignores the SIGINT and SIGQUIT signals, and blocks the SIGCHLD signal, while waiting for the command to
terminate. If this might cause the application to miss a signal that would have killed it, the application should
examinethe return value from system() and take whatever action is appropriate to the application if the command
terminateddue to receipt of a signal.
Now, the code that directly calls system(), namely pgarch_archiveXlog(),
knows this perfectly well, as per the comment at lines 590ff in HEAD.
However, the code that *calls* it did not get the memo :-(, and appears
to be willing to retry regardless.
> Or maybe the postmaster should not be infinitely patient, but send
> another round of signals after a brief delay.
If the first one was ignored, later ones might be too.
I'm inclined to think that we should change pgarch_archiveXlog to
detect these specific signal conditions and just directly exit(),
rather than giving its caller a chance to blow the decision.
regards, tom lane