Andres Freund <andres@anarazel.de> writes:
> On 2017-04-20 00:50:13 -0400, Tom Lane wrote:
>> My first reaction was that that sounded like a lot more work than removing
>> two lines from maybe_start_bgworker and adjusting some comments. But on
>> closer inspection, the slow-bgworker-start issue isn't the only problem
>> here.
> FWIW, I vaguely remember somewhat related issues on x86/linux too.
After sleeping and thinking more, I've realized that the
slow-bgworker-start issue actually exists on *every* platform, it's just
harder to hit when select() is interruptable. But consider the case
where multiple bgworker-start requests arrive while ServerLoop is
actively executing (perhaps because a connection request just came in).
The postmaster has signals blocked, so nothing happens for the moment.
When we go around the loop and reach
PG_SETMASK(&UnBlockSig);
the pending SIGUSR1 is delivered, and sigusr1_handler reads all the
bgworker start requests, and services just one of them. Then control
returns and proceeds to
selres = select(nSockets, &rmask, NULL, NULL, &timeout);
But now there's no interrupt pending. So the remaining start requests
do not get serviced until (a) some other postmaster interrupt arrives,
or (b) the one-minute timeout elapses. They could be waiting awhile.
Bottom line is that any request for more than one bgworker at a time
faces a non-negligible risk of suffering serious latency.
I'm coming back to the idea that at least in the back branches, the
thing to do is allow maybe_start_bgworker to start multiple workers.
Is there any actual evidence for the claim that that might have
bad side effects?
regards, tom lane