Обсуждение: BUG #15036: Un-killable queries Hanging in BgWorkerShutdown

Поиск
Список
Период
Сортировка

BUG #15036: Un-killable queries Hanging in BgWorkerShutdown

От
PG Bug reporting form
Дата:
The following bug has been logged on the website:

Bug reference:      15036
Logged by:          David Kohn
Email address:      djk447@gmail.com
PostgreSQL version: 10.1
Operating system:   Ubuntu 16.04
Description:

I have been experiencing a consistent problem with queries that I cannot
kill with pg_cancel_backend or pg_terminate_backend. In many cases they have
been running for days and are in a transaction so it eventually causes
rather large bloat etc problems. All the backends are in the IPC wait_event.
The backends appear to either be a main client_backend, in which case
wait_event_type fields in pg_stat_activity say BgWorkerShutdown and for the
background workers I see two (though I'm not sure that that this is all of
them): BtreePage and MessageQueuePutMessage. I'm quite sure the clients for
these are dead, they had statement timeouts set to an hour at most, they
might have died sooner than that of other causes. I assume this is a bug and
I should be reporting it here, but if I'm putting it on the wrong list let
me know and I'll move it! 
Best,
David


Re: BUG #15036: Un-killable queries Hanging in BgWorkerShutdown

От
Thomas Munro
Дата:
On Tue, Jan 30, 2018 at 5:48 AM, PG Bug reporting form
<noreply@postgresql.org> wrote:
> The following bug has been logged on the website:
>
> Bug reference:      15036
> Logged by:          David Kohn
> Email address:      djk447@gmail.com
> PostgreSQL version: 10.1
> Operating system:   Ubuntu 16.04
> Description:
>
> I have been experiencing a consistent problem with queries that I cannot
> kill with pg_cancel_backend or pg_terminate_backend. In many cases they have
> been running for days and are in a transaction so it eventually causes
> rather large bloat etc problems. All the backends are in the IPC wait_event.
> The backends appear to either be a main client_backend, in which case
> wait_event_type fields in pg_stat_activity say BgWorkerShutdown and for the
> background workers I see two (though I'm not sure that that this is all of
> them): BtreePage and MessageQueuePutMessage. I'm quite sure the clients for
> these are dead, they had statement timeouts set to an hour at most, they
> might have died sooner than that of other causes. I assume this is a bug and
> I should be reporting it here, but if I'm putting it on the wrong list let
> me know and I'll move it!

Hi David,

Thanks for the report!  Based on the mention of BtreePage, this sounds
like the following bug:

https://www.postgresql.org/message-id/flat/CAEepm%3D2xZUcOGP9V0O_G0%3D2P2wwXwPrkF%3DupWTCJSisUxMnuSg%40mail.gmail.com

The fix for that will be in 10.2 (current target date: February 8th).
The workaround in the meantime would be to disable parallelism, at
least for the queries doing parallel index scans if you can identify
them.

However, I'm not entirely sure why you're not able to cancel these
backends politely with pg_cancel_backend().  For example, the
BtreePage waiter should be in ConditionVariableSleep() and should be
interrupted by such a signal and error out in CHECK_FOR_INTERRUPTS().

-- 
Thomas Munro
http://www.enterprisedb.com


Re: BUG #15036: Un-killable queries Hanging in BgWorkerShutdown

От
David Kohn
Дата:
Responses interleaved with yours. 
On Mon, Jan 29, 2018 at 9:07 PM Thomas Munro <thomas.munro@enterprisedb.com> wrote:
 
Hi David,

Thanks for the report!  Based on the mention of BtreePage, this sounds
like the following bug:
https://www.postgresql.org/message-id/flat/CAEepm%3D2xZUcOGP9V0O_G0%3D2P2wwXwPrkF%3DupWTCJSisUxMnuSg%40mail.gmail.com
 
The fix for that will be in 10.2 (current target date: February 8th).
The workaround in the meantime would be to disable parallelism, at
least for the queries doing parallel index scans if you can identify
them.
That sounds great, I hope that patch will fix it, I'm not quite sure it will though. Some of them have workers that are in the BtreePage state, however at least as many of the hung queries have only workers in the MessageQueuePutMessage state. Would you expect the patch to fix those as well? Or could it be something different?  

However, I'm not entirely sure why you're not able to cancel these
backends politely with pg_cancel_backend().  For example, the
BtreePage waiter should be in ConditionVariableSleep() and should be
interrupted by such a signal and error out in CHECK_FOR_INTERRUPTS().
All of them are definitely un-killable by anything other than a kill -9 that I've found so far. I have a feeling it has something to do with: https://jobs.zalando.com/tech/blog/hack-to-terminate-tcp-conn-postgres/?gh_src=4n3gxh1 but I'm not 100% sure, as I didn't set tcp settings low enough to make catching a packet all that reasonable. I'm happy to try to investigate further, I just don't quite know what that should entail. If you have things that you think would be helpful, please do let me know. 

Thanks for the help! 
D
 

Re: BUG #15036: Un-killable queries Hanging in BgWorkerShutdown

От
Thomas Munro
Дата:
On Tue, Jan 30, 2018 at 4:33 PM, David Kohn <djk447@gmail.com> wrote:
> On Mon, Jan 29, 2018 at 9:07 PM Thomas Munro <thomas.munro@enterprisedb.com>
>> Thanks for the report!  Based on the mention of BtreePage, this sounds
>> like the following bug:
>>
>>
https://www.postgresql.org/message-id/flat/CAEepm%3D2xZUcOGP9V0O_G0%3D2P2wwXwPrkF%3DupWTCJSisUxMnuSg%40mail.gmail.com
>>
>>
>> The fix for that will be in 10.2 (current target date: February 8th).
>> The workaround in the meantime would be to disable parallelism, at
>> least for the queries doing parallel index scans if you can identify
>> them.
>
> That sounds great, I hope that patch will fix it, I'm not quite sure it will
> though. Some of them have workers that are in the BtreePage state, however
> at least as many of the hung queries have only workers in the
> MessageQueuePutMessage state. Would you expect the patch to fix those as
> well? Or could it be something different?

Maybe like this:

1.  Leader process encounters the bug and starts waiting for itself
forever (caused by encountering concurrently deleted btree pages on a
busy system, see that other thread for gory details).  This looks like
wait event = BtreePage.
2.  Worker backend has emitted a bunch of tuples and fills up its
output tuple queue, but the leader isn't reading from the queue, so
the worker waits forever.  This looks like wait event =
MessageQueuePutMessage.

The second thing is just expected and correct behaviour in workers if
the leader process is jammed.

>> However, I'm not entirely sure why you're not able to cancel these
>> backends politely with pg_cancel_backend().  For example, the
>> BtreePage waiter should be in ConditionVariableSleep() and should be
>> interrupted by such a signal and error out in CHECK_FOR_INTERRUPTS().
>
> All of them are definitely un-killable by anything other than a kill -9 that
> I've found so far. I have a feeling it has something to do with:
> https://jobs.zalando.com/tech/blog/hack-to-terminate-tcp-conn-postgres/?gh_src=4n3gxh1
> but I'm not 100% sure, as I didn't set tcp settings low enough to make
> catching a packet all that reasonable. I'm happy to try to investigate
> further, I just don't quite know what that should entail. If you have things
> that you think would be helpful, please do let me know.

Hmm.  Well usually in a case like this the most useful thing would
usually be a backtrace ("gdb /path/to/binary -p PID", then "bt") to
show exactly where they're stuck.  But in this case we already know
more-or-less where they're waiting (the wait event names tell us), and
the real question is: why on earth aren't the wait loops responding to
SIGINT and SIGTERM?  I wonder if there might be something funky about
parallel query + statement timeouts.

-- 
Thomas Munro
http://www.enterprisedb.com


Re: BUG #15036: Un-killable queries Hanging in BgWorkerShutdown

От
David Kohn
Дата:


On Mon, Jan 29, 2018 at 11:08 PM Thomas Munro <thomas.munro@enterprisedb.com> wrote:

the real question is: why on earth aren't the wait loops responding to
SIGINT and SIGTERM?  I wonder if there might be something funky about
parallel query + statement timeouts.
Agreed. Seems like a backtrace wouldn't help much. I saw the other thread with similar cancellation issues a couple notes that might help: 
1) I also have a lateral select inside of a view there. seems doubtful that the lateral has anything to do with it, but in case that could be it, thought I'd pass that along. 
2) Are there any settings that could potentially help with this? for instance, this isn't on a replica, so max_standby_archive_delay wouldn't more forcefully (potentially) cancel a query, is there anything similar that could work here? as you noted we've already set a statement timeout, so it isn't responding to that, but it does get cancelled when another (hung) process is SIGKILL-ed. When that happens the db goes into recovery mode - so is it being sent SIGKILL at that point as well? Or is it some other signal that is a little less invasive? Probably not, but thought I'd ask. 

best,

Re: BUG #15036: Un-killable queries Hanging in BgWorkerShutdown

От
Thomas Munro
Дата:
On Fri, Feb 2, 2018 at 11:54 AM, David Kohn <djk447@gmail.com> wrote:
> On Mon, Jan 29, 2018 at 11:08 PM Thomas Munro
> <thomas.munro@enterprisedb.com> wrote:
>> the real question is: why on earth aren't the wait loops responding to
>> SIGINT and SIGTERM?  I wonder if there might be something funky about
>> parallel query + statement timeouts.
>
> Agreed. Seems like a backtrace wouldn't help much. I saw the other thread
> with similar cancellation issues a couple notes that might help:
> 1) I also have a lateral select inside of a view there. seems doubtful that
> the lateral has anything to do with it, but in case that could be it,
> thought I'd pass that along.

I don't think that's directly relevant -- the cause of BtreePage is
Parallel Index Scan, which you could prevent by setting
max_parallel_workers_per_gather = 0 or min_parallel_index_scan_size =
'5TB' (assuming your indexes are smaller than that).  The 10.2 release
that fixes the parallel btree scan bug is due in a couple of days.

> 2) Are there any settings that could potentially help with this? for
> instance, this isn't on a replica, so max_standby_archive_delay wouldn't
> more forcefully (potentially) cancel a query, is there anything similar that
> could work here? as you noted we've already set a statement timeout, so it
> isn't responding to that, but it does get cancelled when another (hung)
> process is SIGKILL-ed. When that happens the db goes into recovery mode - so
> is it being sent SIGKILL at that point as well? Or is it some other signal
> that is a little less invasive? Probably not, but thought I'd ask.

As Robert mentioned on that other thread, there is a place where the
leader waits for backends to exit while ignoring interrupts.  It's be
good to check if that's happening here, and also figure out what
exactly is happening with the workers (and any other backends that may
be involved in this tangle, for example autovacuum).  Can you get
strack traces for all the relevant processes?

https://wiki.postgresql.org/wiki/Getting_a_stack_trace_of_a_running_PostgreSQL_backend_on_Linux/BSD

You mentioned that one or more uninterruptible backend was in
mq_putmessage() in pqmp.c (wait event "MessageQueuePutMessage").  I
can't immediately see how that can be uninterruptible (for a while I
wondered if interrupting it was causing it to recurse to try to report
an error, but I don't think that's it).  If you kill -QUIT a worker
that's waiting there, you'll get this:

  background worker "parallel worker" (PID 46693) exited with exit code 2
  terminating any other active server processes

If you kill -KILL you'll get:

  background worker "parallel worker" (PID 46721) was terminated by
signal 9: Killed
  terminating any other active server processes

Either way, your cluster restarts with a load of "terminating
connection because of crash of another server process" errors. It
seems problematic that if the leader becomes non-interruptible while
the workers are blocked on a full message queue, there is apparently
no way to orchestrate a graceful stop.

-- 
Thomas Munro
http://www.enterprisedb.com


Re: BUG #15036: Un-killable queries Hanging in BgWorkerShutdown

От
David Kohn
Дата:
It appears that the BTree problem is actually much less common than the other and I'm perfectly happy that there will be a bug fix for that coming out in the next few days. Thanks for the work on that. The other one does appear to be different, so I dove into the code a bit to try to figure it out. Unsure of my reasoning, but perhaps you can illuminate. 

On Mon, Feb 5, 2018 at 4:13 PM Thomas Munro <thomas.munro@enterprisedb.com> wrote:


As Robert mentioned on that other thread, there is a place where the
leader waits for backends to exit while ignoring interrupts.  It's be
good to check if that's happening here, and also figure out what
exactly is happening with the workers (and any other backends that may
be involved in this tangle, for example autovacuum).  Can you get
strack traces for all the relevant processes?

https://wiki.postgresql.org/wiki/Getting_a_stack_trace_of_a_running_PostgreSQL_backend_on_Linux/BSD

You mentioned that one or more uninterruptible backend was in
mq_putmessage() in pqmp.c (wait event "MessageQueuePutMessage").  I
can't immediately see how that can be uninterruptible (for a while I
wondered if interrupting it was causing it to recurse to try to report
an error, but I don't think that's it). 

I haven't yet had a chance to do a stacktrace, but from my reading of the code, the only time the worker will get a wait event of MessageQueuePutMessage is at line 171 of pqmq.c, and the only time a leader would get a message of BgWorkerShutdown would be at line 1160 of bgworker.c. Now, given that I am ending up in a state where I have a leader and one or more workers in those states (and as far as I can tell after a statement timeout) it seems to me that the following series of events could cause it (though I haven't quite figured out what code path is taken on a cancel and whether this is plausible):
1) The leader gets canceled due to statement timeout, so it effectively does a rollback calling AtEOXact_Parallel which calls DestroyParallelContext() without first calling WaitForParallelWorkersToFinish() because we want to immediately end query without getting any more results. So we detach from the error queue, and a bit later detach from any message queues, we haven't checked for interrupts to process any parallel messages that have come in, we then enter our uninterruptible state while we wait for the workers to exit, and do not get any notices about messages that we would have gotten in the CHECK_FOR_INTERRUPTS() Call in WaitForBackgroundWorkerShutdown.
2) The background worker is either trying to send a message on the normal queue, and has hit the waitlatch there, or is trying to send a message on the error queue and hit the waitlatch because the error message is long (this might be the more plausible explanation, I would have potentially long error messages and we did just attempt to terminate the background worker, I do not know what happens if the worker attempts to send a message and ends up in the WaitLatch between when the terminate message was sent and when the leader detaches from the error queue, perhaps a SetLatch before detaching from the error queue would help?) 
3) The worker is uninterruptible because it is waiting on a latch from the parent process before hitting the CHECK_FOR_INTERRUPTS below the latch, the leader process is uninterruptible because it is in that spot where it holds interrupts, so they each wait for the other? (or perhaps if it is the error queue case, the worker is uninterruptible because it is already being forcibly terminated, but is waiting inside of that call?) 

I'm not clear on when we do a SetLatch on those message queues during a cancel of parallel workers, and a number of other things that could definitely invalidate this analysis, but I think there could be a plausible explanation in there somewhere.Would a timeout on the WaitLatch inside of pqmq.c (a relatively large one, say 5 seconds) be too expensive? It seems like it could solve this problem, but I'm not sure if the overhead is so high that that would be a significant slowdown in normal parallel execution or cause problems if processing a message were delayed for longer than the timeout. 

Thanks for the help on this, I hope this is helpful and do let me know if a stacktrace or anything else would be helpful on my end. 

Best,
David


Re: BUG #15036: Un-killable queries Hanging in BgWorkerShutdown

От
Thomas Munro
Дата:
On Thu, Feb 8, 2018 at 7:06 AM, David Kohn <djk447@gmail.com> wrote:
> I'm not clear on when we do a SetLatch on those message queues during a
> cancel of parallel workers, and a number of other things that could
> definitely invalidate this analysis, but I think there could be a plausible
> explanation in there somewhere.

shm_mq_detach_internal() does SetLatch(&victim->procLatch) ("victim"
being the counterparty process) after setting mq_detached.  So ideally
no one should ever be able to wait forever on a queue from which the
other end has detached, but perhaps there is some race condition style
bug lurking in here.  I'm going to do some testing and see if I can
break this...

> Thanks for the help on this, I hope this is helpful and do let me know if a
> stacktrace or anything else would be helpful on my end.

Yeah stack traces would be great, if you can.

-- 
Thomas Munro
http://www.enterprisedb.com


Re: BUG #15036: Un-killable queries Hanging in BgWorkerShutdown

От
Salvador Jacinto
Дата:
AMANHÃ BEREMOS

Em 07/02/2018 8:35 PM, "Thomas Munro" <thomas.munro@enterprisedb.com> escreveu:
On Thu, Feb 8, 2018 at 7:06 AM, David Kohn <djk447@gmail.com> wrote:
> I'm not clear on when we do a SetLatch on those message queues during a
> cancel of parallel workers, and a number of other things that could
> definitely invalidate this analysis, but I think there could be a plausible
> explanation in there somewhere.

shm_mq_detach_internal() does SetLatch(&victim->procLatch) ("victim"
being the counterparty process) after setting mq_detached.  So ideally
no one should ever be able to wait forever on a queue from which the
other end has detached, but perhaps there is some race condition style
bug lurking in here.  I'm going to do some testing and see if I can
break this...

> Thanks for the help on this, I hope this is helpful and do let me know if a
> stacktrace or anything else would be helpful on my end.

Yeah stack traces would be great, if you can.

--
Thomas Munro
http://www.enterprisedb.com

Re: BUG #15036: Un-killable queries Hanging in BgWorkerShutdown

От
Thomas Munro
Дата:
On Thu, Feb 8, 2018 at 1:34 PM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:
> On Thu, Feb 8, 2018 at 7:06 AM, David Kohn <djk447@gmail.com> wrote:
>> I'm not clear on when we do a SetLatch on those message queues during a
>> cancel of parallel workers, and a number of other things that could
>> definitely invalidate this analysis, but I think there could be a plausible
>> explanation in there somewhere.
>
> shm_mq_detach_internal() does SetLatch(&victim->procLatch) ("victim"
> being the counterparty process) after setting mq_detached.  So ideally
> no one should ever be able to wait forever on a queue from which the
> other end has detached, but perhaps there is some race condition style
> bug lurking in here.  I'm going to do some testing and see if I can
> break this...

I tried, but didn't get anywhere with this.

-- 
Thomas Munro
http://www.enterprisedb.com


Re: BUG #15036: Un-killable queries Hanging in BgWorkerShutdown

От
David Kohn
Дата:
I tried to get a stacktrace, but because nothing's crashing normal stacktraces don't happen, tried profiling with perf, sampling specifically the procs that are stuck, but because they're not doing anything there's no samples...so stacktrace not helping so much, thoughts on how to get around that? 
Best,
D

Re: BUG #15036: Un-killable queries Hanging in BgWorkerShutdown

От
Thomas Munro
Дата:
On Sun, Feb 11, 2018 at 2:58 PM, David Kohn <djk447@gmail.com> wrote:
> I tried to get a stacktrace, but because nothing's crashing normal
> stacktraces don't happen, tried profiling with perf, sampling specifically
> the procs that are stuck, but because they're not doing anything there's no
> samples...so stacktrace not helping so much, thoughts on how to get around
> that?

You can get a backtrace from a running program with by connecting to
it with gdb -p PID, then bt for the backtrace.  You might need to
install the symbols package if you only see addresses (on debianoid
systems postgresql-10-dbgsym, not sure what it's called on RHELish
systems).

https://wiki.postgresql.org/wiki/Getting_a_stack_trace_of_a_running_PostgreSQL_backend_on_Linux/BSD

-- 
Thomas Munro
http://www.enterprisedb.com


Re: BUG #15036: Un-killable queries Hanging in BgWorkerShutdown

От
David Kohn
Дата:
Sorry for the delay on this.

You can get a backtrace from a running program with by connecting to
it with gdb -p PID, then bt for the backtrace.  You might need to
install the symbols package if you only see addresses (on debianoid
systems postgresql-10-dbgsym, not sure what it's called on RHELish
systems).

https://wiki.postgresql.org/wiki/Getting_a_stack_trace_of_a_running_PostgreSQL_backend_on_Linux/BSD

here's a backtrace from one of the running pids:  
```
#0  0x00007f699037c9f3 in __epoll_wait_nocancel () at ../sysdeps/unix/syscall-template.S:84
#1  0x00005639b9d28c61 in WaitEventSetWaitBlock (nevents=1, occurred_events=0x7ffd43678d90, cur_timeout=-1, set=0x5639bbef83c8) at /build/postgresql-10-drhiey/postgresql-10-10.3/build/../src/backend/storage/ipc/latch.c:1048
#2  WaitEventSetWait (set=set@entry=0x5639bbef83c8, timeout=timeout@entry=-1, occurred_events=occurred_events@entry=0x7ffd43678d90, nevents=nevents@entry=1, wait_event_info=wait_event_info@entry=134217728)
    at /build/postgresql-10-drhiey/postgresql-10-10.3/build/../src/backend/storage/ipc/latch.c:1000
#3  0x00005639b9d290d4 in WaitLatchOrSocket (latch=0x7f697e506f54, wakeEvents=wakeEvents@entry=17, sock=sock@entry=-1, timeout=-1, timeout@entry=0, wait_event_info=wait_event_info@entry=134217728)
    at /build/postgresql-10-drhiey/postgresql-10-10.3/build/../src/backend/storage/ipc/latch.c:385
#4  0x00005639b9d29185 in WaitLatch (latch=<optimized out>, wakeEvents=wakeEvents@entry=17, timeout=timeout@entry=0, wait_event_info=wait_event_info@entry=134217728)
    at /build/postgresql-10-drhiey/postgresql-10-10.3/build/../src/backend/storage/ipc/latch.c:339
#5  0x00005639b9cccf1b in WaitForBackgroundWorkerShutdown (handle=0x5639bbe8aaf0) at /build/postgresql-10-drhiey/postgresql-10-10.3/build/../src/backend/postmaster/bgworker.c:1154
#6  0x00005639b9af36fd in WaitForParallelWorkersToExit (pcxt=0x5639bbe8a118, pcxt=0x5639bbe8a118) at /build/postgresql-10-drhiey/postgresql-10-10.3/build/../src/backend/access/transam/parallel.c:655
#7  0x00005639b9af4417 in DestroyParallelContext (pcxt=0x5639bbe8a118) at /build/postgresql-10-drhiey/postgresql-10-10.3/build/../src/backend/access/transam/parallel.c:737
#8  0x00005639b9af4a28 in AtEOXact_Parallel (isCommit=isCommit@entry=0 '\000') at /build/postgresql-10-drhiey/postgresql-10-10.3/build/../src/backend/access/transam/parallel.c:1006
#9  0x00005639b9affde7 in AbortTransaction () at /build/postgresql-10-drhiey/postgresql-10-10.3/build/../src/backend/access/transam/xact.c:2538
#10 0x00005639b9b00545 in AbortCurrentTransaction () at /build/postgresql-10-drhiey/postgresql-10-10.3/build/../src/backend/access/transam/xact.c:3097
#11 0x00005639b9d4bd6d in PostgresMain (argc=1, argv=argv@entry=0x5639bbe9ae40, dbname=0x5639bbe9ad58 "marjory", username=0x5639bbe39a08 "reporter") at /build/postgresql-10-drhiey/postgresql-10-10.3/build/../src/backend/tcop/postgres.c:3879
#12 0x00005639b9a850d9 in BackendRun (port=0x5639bbe978f0) at /build/postgresql-10-drhiey/postgresql-10-10.3/build/../src/backend/postmaster/postmaster.c:4405
#13 BackendStartup (port=0x5639bbe978f0) at /build/postgresql-10-drhiey/postgresql-10-10.3/build/../src/backend/postmaster/postmaster.c:4077
#14 ServerLoop () at /build/postgresql-10-drhiey/postgresql-10-10.3/build/../src/backend/postmaster/postmaster.c:1755
#15 0x00005639b9cdb78b in PostmasterMain (argc=5, argv=<optimized out>) at /build/postgresql-10-drhiey/postgresql-10-10.3/build/../src/backend/postmaster/postmaster.c:1363
#16 0x00005639b9a864d5 in main (argc=5, argv=0x5639bbe37850) at /build/postgresql-10-drhiey/postgresql-10-10.3/build/../src/backend/main/main.c:228
```
Joe, cc'd here is a colleague who will be able to help out in future. 

Thanks for all the help on this,
David

Re: BUG #15036: Un-killable queries Hanging in BgWorkerShutdown

От
Joseph B
Дата:
On Fri, Mar 16, 2018 at 2:42 PM, David Kohn <djk447@gmail.com> wrote:
Sorry for the delay on this.

You can get a backtrace from a running program with by connecting to
it with gdb -p PID, then bt for the backtrace.  You might need to
install the symbols package if you only see addresses (on debianoid
systems postgresql-10-dbgsym, not sure what it's called on RHELish
systems).

https://wiki.postgresql.org/wiki/Getting_a_stack_trace_of_a_running_PostgreSQL_backend_on_Linux/BSD

here's a backtrace from one of the running pids:  
```
#0  0x00007f699037c9f3 in __epoll_wait_nocancel () at ../sysdeps/unix/syscall-template.S:84
#1  0x00005639b9d28c61 in WaitEventSetWaitBlock (nevents=1, occurred_events=0x7ffd43678d90, cur_timeout=-1, set=0x5639bbef83c8) at /build/postgresql-10-drhiey/postgresql-10-10.3/build/../src/backend/storage/ipc/latch.c:1048
#2  WaitEventSetWait (set=set@entry=0x5639bbef83c8, timeout=timeout@entry=-1, occurred_events=occurred_events@entry=0x7ffd43678d90, nevents=nevents@entry=1, wait_event_info=wait_event_info@entry=134217728)
    at /build/postgresql-10-drhiey/postgresql-10-10.3/build/../src/backend/storage/ipc/latch.c:1000
#3  0x00005639b9d290d4 in WaitLatchOrSocket (latch=0x7f697e506f54, wakeEvents=wakeEvents@entry=17, sock=sock@entry=-1, timeout=-1, timeout@entry=0, wait_event_info=wait_event_info@entry=134217728)
    at /build/postgresql-10-drhiey/postgresql-10-10.3/build/../src/backend/storage/ipc/latch.c:385
#4  0x00005639b9d29185 in WaitLatch (latch=<optimized out>, wakeEvents=wakeEvents@entry=17, timeout=timeout@entry=0, wait_event_info=wait_event_info@entry=134217728)
    at /build/postgresql-10-drhiey/postgresql-10-10.3/build/../src/backend/storage/ipc/latch.c:339
#5  0x00005639b9cccf1b in WaitForBackgroundWorkerShutdown (handle=0x5639bbe8aaf0) at /build/postgresql-10-drhiey/postgresql-10-10.3/build/../src/backend/postmaster/bgworker.c:1154
#6  0x00005639b9af36fd in WaitForParallelWorkersToExit (pcxt=0x5639bbe8a118, pcxt=0x5639bbe8a118) at /build/postgresql-10-drhiey/postgresql-10-10.3/build/../src/backend/access/transam/parallel.c:655
#7  0x00005639b9af4417 in DestroyParallelContext (pcxt=0x5639bbe8a118) at /build/postgresql-10-drhiey/postgresql-10-10.3/build/../src/backend/access/transam/parallel.c:737
#8  0x00005639b9af4a28 in AtEOXact_Parallel (isCommit=isCommit@entry=0 '\000') at /build/postgresql-10-drhiey/postgresql-10-10.3/build/../src/backend/access/transam/parallel.c:1006
#9  0x00005639b9affde7 in AbortTransaction () at /build/postgresql-10-drhiey/postgresql-10-10.3/build/../src/backend/access/transam/xact.c:2538
#10 0x00005639b9b00545 in AbortCurrentTransaction () at /build/postgresql-10-drhiey/postgresql-10-10.3/build/../src/backend/access/transam/xact.c:3097
#11 0x00005639b9d4bd6d in PostgresMain (argc=1, argv=argv@entry=0x5639bbe9ae40, dbname=0x5639bbe9ad58 "marjory", username=0x5639bbe39a08 "reporter") at /build/postgresql-10-drhiey/postgresql-10-10.3/build/../src/backend/tcop/postgres.c:3879
#12 0x00005639b9a850d9 in BackendRun (port=0x5639bbe978f0) at /build/postgresql-10-drhiey/postgresql-10-10.3/build/../src/backend/postmaster/postmaster.c:4405
#13 BackendStartup (port=0x5639bbe978f0) at /build/postgresql-10-drhiey/postgresql-10-10.3/build/../src/backend/postmaster/postmaster.c:4077
#14 ServerLoop () at /build/postgresql-10-drhiey/postgresql-10-10.3/build/../src/backend/postmaster/postmaster.c:1755
#15 0x00005639b9cdb78b in PostmasterMain (argc=5, argv=<optimized out>) at /build/postgresql-10-drhiey/postgresql-10-10.3/build/../src/backend/postmaster/postmaster.c:1363
#16 0x00005639b9a864d5 in main (argc=5, argv=0x5639bbe37850) at /build/postgresql-10-drhiey/postgresql-10-10.3/build/../src/backend/main/main.c:228
```
Joe, cc'd here is a colleague who will be able to help out in future. 

Thanks for all the help on this,
David


I think this should be the other part of the equation:

#0  0x00007f699037c9f3 in __epoll_wait_nocancel () at ../sysdeps/unix/syscall-template.S:84
#1  0x00005639b9d28c61 in WaitEventSetWaitBlock (nevents=1, occurred_events=0x7ffd43678540, cur_timeout=-1, set=0x5639bbe3a388) at /build/postgresql-10-drhiey/postgresql-10-10.3/build/../src/backend/storage/ipc/latch.c:1048
#2  WaitEventSetWait (set=set@entry=0x5639bbe3a388, timeout=timeout@entry=-1, occurred_events=occurred_events@entry=0x7ffd43678540, nevents=nevents@entry=1, wait_event_info=wait_event_info@entry=134217735)
    at /build/postgresql-10-drhiey/postgresql-10-10.3/build/../src/backend/storage/ipc/latch.c:1000
#3  0x00005639b9d290d4 in WaitLatchOrSocket (latch=0x7f697e544ea4, wakeEvents=wakeEvents@entry=1, sock=sock@entry=-1, timeout=-1, timeout@entry=0, wait_event_info=wait_event_info@entry=134217735)
    at /build/postgresql-10-drhiey/postgresql-10-10.3/build/../src/backend/storage/ipc/latch.c:385
#4  0x00005639b9d29185 in WaitLatch (latch=<optimized out>, wakeEvents=wakeEvents@entry=1, timeout=timeout@entry=0, wait_event_info=wait_event_info@entry=134217735)
    at /build/postgresql-10-drhiey/postgresql-10-10.3/build/../src/backend/storage/ipc/latch.c:339
#5  0x00005639b9c55580 in mq_putmessage (msgtype=69 'E', s=<optimized out>, len=<optimized out>) at /build/postgresql-10-drhiey/postgresql-10-10.3/build/../src/backend/libpq/pqmq.c:171
#6  0x00005639b9c54d44 in pq_endmessage (buf=buf@entry=0x7ffd43678670) at /build/postgresql-10-drhiey/postgresql-10-10.3/build/../src/backend/libpq/pqformat.c:347
#7  0x00005639b9e5d68c in send_message_to_frontend (edata=<optimized out>) at /build/postgresql-10-drhiey/postgresql-10-10.3/build/../src/backend/utils/error/elog.c:3314
#8  EmitErrorReport () at /build/postgresql-10-drhiey/postgresql-10-10.3/build/../src/backend/utils/error/elog.c:1483
#9  0x00005639b9ccc826 in StartBackgroundWorker () at /build/postgresql-10-drhiey/postgresql-10-10.3/build/../src/backend/postmaster/bgworker.c:779
#10 0x00005639b9cd96cb in do_start_bgworker (rw=<optimized out>) at /build/postgresql-10-drhiey/postgresql-10-10.3/build/../src/backend/postmaster/postmaster.c:5728
#11 maybe_start_bgworkers () at /build/postgresql-10-drhiey/postgresql-10-10.3/build/../src/backend/postmaster/postmaster.c:5941
#12 0x00005639b9cda385 in sigusr1_handler (postgres_signal_arg=<optimized out>) at /build/postgresql-10-drhiey/postgresql-10-10.3/build/../src/backend/postmaster/postmaster.c:5121
#13 <signal handler called>
#14 0x00007f69903725b3 in __select_nocancel () at ../sysdeps/unix/syscall-template.S:84
#15 0x00005639b9a8468c in ServerLoop () at /build/postgresql-10-drhiey/postgresql-10-10.3/build/../src/backend/postmaster/postmaster.c:1719
#16 0x00005639b9cdb78b in PostmasterMain (argc=5, argv=<optimized out>) at /build/postgresql-10-drhiey/postgresql-10-10.3/build/../src/backend/postmaster/postmaster.c:1363
#17 0x00005639b9a864d5 in main (argc=5, argv=0x5639bbe37850) at /build/postgresql-10-drhiey/postgresql-10-10.3/build/../src/backend/main/main.c:228