Re: BUG #15036: Un-killable queries Hanging in BgWorkerShutdown

Поиск
Список
Период
Сортировка
От Thomas Munro
Тема Re: BUG #15036: Un-killable queries Hanging in BgWorkerShutdown
Дата
Msg-id CAEepm=0TBygUnw0MuR6HCZ5mZ483U0ur+GwEKZsKNR4+E1asAQ@mail.gmail.com
обсуждение исходный текст
Ответ на Re: BUG #15036: Un-killable queries Hanging in BgWorkerShutdown  (David Kohn <djk447@gmail.com>)
Ответы Re: BUG #15036: Un-killable queries Hanging in BgWorkerShutdown  (David Kohn <djk447@gmail.com>)
Список pgsql-bugs
On Tue, Jan 30, 2018 at 4:33 PM, David Kohn <djk447@gmail.com> wrote:
> On Mon, Jan 29, 2018 at 9:07 PM Thomas Munro <thomas.munro@enterprisedb.com>
>> Thanks for the report!  Based on the mention of BtreePage, this sounds
>> like the following bug:
>>
>>
https://www.postgresql.org/message-id/flat/CAEepm%3D2xZUcOGP9V0O_G0%3D2P2wwXwPrkF%3DupWTCJSisUxMnuSg%40mail.gmail.com
>>
>>
>> The fix for that will be in 10.2 (current target date: February 8th).
>> The workaround in the meantime would be to disable parallelism, at
>> least for the queries doing parallel index scans if you can identify
>> them.
>
> That sounds great, I hope that patch will fix it, I'm not quite sure it will
> though. Some of them have workers that are in the BtreePage state, however
> at least as many of the hung queries have only workers in the
> MessageQueuePutMessage state. Would you expect the patch to fix those as
> well? Or could it be something different?

Maybe like this:

1.  Leader process encounters the bug and starts waiting for itself
forever (caused by encountering concurrently deleted btree pages on a
busy system, see that other thread for gory details).  This looks like
wait event = BtreePage.
2.  Worker backend has emitted a bunch of tuples and fills up its
output tuple queue, but the leader isn't reading from the queue, so
the worker waits forever.  This looks like wait event =
MessageQueuePutMessage.

The second thing is just expected and correct behaviour in workers if
the leader process is jammed.

>> However, I'm not entirely sure why you're not able to cancel these
>> backends politely with pg_cancel_backend().  For example, the
>> BtreePage waiter should be in ConditionVariableSleep() and should be
>> interrupted by such a signal and error out in CHECK_FOR_INTERRUPTS().
>
> All of them are definitely un-killable by anything other than a kill -9 that
> I've found so far. I have a feeling it has something to do with:
> https://jobs.zalando.com/tech/blog/hack-to-terminate-tcp-conn-postgres/?gh_src=4n3gxh1
> but I'm not 100% sure, as I didn't set tcp settings low enough to make
> catching a packet all that reasonable. I'm happy to try to investigate
> further, I just don't quite know what that should entail. If you have things
> that you think would be helpful, please do let me know.

Hmm.  Well usually in a case like this the most useful thing would
usually be a backtrace ("gdb /path/to/binary -p PID", then "bt") to
show exactly where they're stuck.  But in this case we already know
more-or-less where they're waiting (the wait event names tell us), and
the real question is: why on earth aren't the wait loops responding to
SIGINT and SIGTERM?  I wonder if there might be something funky about
parallel query + statement timeouts.

-- 
Thomas Munro
http://www.enterprisedb.com


В списке pgsql-bugs по дате отправления:

Предыдущее
От: David Kohn
Дата:
Сообщение: Re: BUG #15036: Un-killable queries Hanging in BgWorkerShutdown
Следующее
От: Meirav Rath
Дата:
Сообщение: Re: BUG #15035: scram-sha-256 blocks all logins