Re: BUG #15036: Un-killable queries Hanging in BgWorkerShutdown

Поиск
Список
Период
Сортировка
От Thomas Munro
Тема Re: BUG #15036: Un-killable queries Hanging in BgWorkerShutdown
Дата
Msg-id CAEepm=1_COFxB1c+Kso=JSH9CZtsJWLOji2d4EJB8E15ory3tw@mail.gmail.com
обсуждение исходный текст
Ответ на Re: BUG #15036: Un-killable queries Hanging in BgWorkerShutdown  (David Kohn <djk447@gmail.com>)
Ответы Re: BUG #15036: Un-killable queries Hanging in BgWorkerShutdown
Список pgsql-bugs
On Fri, Feb 2, 2018 at 11:54 AM, David Kohn <djk447@gmail.com> wrote:
> On Mon, Jan 29, 2018 at 11:08 PM Thomas Munro
> <thomas.munro@enterprisedb.com> wrote:
>> the real question is: why on earth aren't the wait loops responding to
>> SIGINT and SIGTERM?  I wonder if there might be something funky about
>> parallel query + statement timeouts.
>
> Agreed. Seems like a backtrace wouldn't help much. I saw the other thread
> with similar cancellation issues a couple notes that might help:
> 1) I also have a lateral select inside of a view there. seems doubtful that
> the lateral has anything to do with it, but in case that could be it,
> thought I'd pass that along.

I don't think that's directly relevant -- the cause of BtreePage is
Parallel Index Scan, which you could prevent by setting
max_parallel_workers_per_gather = 0 or min_parallel_index_scan_size =
'5TB' (assuming your indexes are smaller than that).  The 10.2 release
that fixes the parallel btree scan bug is due in a couple of days.

> 2) Are there any settings that could potentially help with this? for
> instance, this isn't on a replica, so max_standby_archive_delay wouldn't
> more forcefully (potentially) cancel a query, is there anything similar that
> could work here? as you noted we've already set a statement timeout, so it
> isn't responding to that, but it does get cancelled when another (hung)
> process is SIGKILL-ed. When that happens the db goes into recovery mode - so
> is it being sent SIGKILL at that point as well? Or is it some other signal
> that is a little less invasive? Probably not, but thought I'd ask.

As Robert mentioned on that other thread, there is a place where the
leader waits for backends to exit while ignoring interrupts.  It's be
good to check if that's happening here, and also figure out what
exactly is happening with the workers (and any other backends that may
be involved in this tangle, for example autovacuum).  Can you get
strack traces for all the relevant processes?

https://wiki.postgresql.org/wiki/Getting_a_stack_trace_of_a_running_PostgreSQL_backend_on_Linux/BSD

You mentioned that one or more uninterruptible backend was in
mq_putmessage() in pqmp.c (wait event "MessageQueuePutMessage").  I
can't immediately see how that can be uninterruptible (for a while I
wondered if interrupting it was causing it to recurse to try to report
an error, but I don't think that's it).  If you kill -QUIT a worker
that's waiting there, you'll get this:

  background worker "parallel worker" (PID 46693) exited with exit code 2
  terminating any other active server processes

If you kill -KILL you'll get:

  background worker "parallel worker" (PID 46721) was terminated by
signal 9: Killed
  terminating any other active server processes

Either way, your cluster restarts with a load of "terminating
connection because of crash of another server process" errors. It
seems problematic that if the leader becomes non-interruptible while
the workers are blocked on a full message queue, there is apparently
no way to orchestrate a graceful stop.

-- 
Thomas Munro
http://www.enterprisedb.com


В списке pgsql-bugs по дате отправления:

Предыдущее
От: PG Bug reporting form
Дата:
Сообщение: BUG #15052: unresponsive client
Следующее
От: Dheeraj
Дата:
Сообщение: Re: BUG #15049: Initdb.exe failing to create DB