Re: BUG #15036: Un-killable queries Hanging in BgWorkerShutdown
От | Thomas Munro |
---|---|
Тема | Re: BUG #15036: Un-killable queries Hanging in BgWorkerShutdown |
Дата | |
Msg-id | CAEepm=0TBygUnw0MuR6HCZ5mZ483U0ur+GwEKZsKNR4+E1asAQ@mail.gmail.com обсуждение исходный текст |
Ответ на | Re: BUG #15036: Un-killable queries Hanging in BgWorkerShutdown (David Kohn <djk447@gmail.com>) |
Ответы |
Re: BUG #15036: Un-killable queries Hanging in BgWorkerShutdown
(David Kohn <djk447@gmail.com>)
|
Список | pgsql-bugs |
On Tue, Jan 30, 2018 at 4:33 PM, David Kohn <djk447@gmail.com> wrote: > On Mon, Jan 29, 2018 at 9:07 PM Thomas Munro <thomas.munro@enterprisedb.com> >> Thanks for the report! Based on the mention of BtreePage, this sounds >> like the following bug: >> >> https://www.postgresql.org/message-id/flat/CAEepm%3D2xZUcOGP9V0O_G0%3D2P2wwXwPrkF%3DupWTCJSisUxMnuSg%40mail.gmail.com >> >> >> The fix for that will be in 10.2 (current target date: February 8th). >> The workaround in the meantime would be to disable parallelism, at >> least for the queries doing parallel index scans if you can identify >> them. > > That sounds great, I hope that patch will fix it, I'm not quite sure it will > though. Some of them have workers that are in the BtreePage state, however > at least as many of the hung queries have only workers in the > MessageQueuePutMessage state. Would you expect the patch to fix those as > well? Or could it be something different? Maybe like this: 1. Leader process encounters the bug and starts waiting for itself forever (caused by encountering concurrently deleted btree pages on a busy system, see that other thread for gory details). This looks like wait event = BtreePage. 2. Worker backend has emitted a bunch of tuples and fills up its output tuple queue, but the leader isn't reading from the queue, so the worker waits forever. This looks like wait event = MessageQueuePutMessage. The second thing is just expected and correct behaviour in workers if the leader process is jammed. >> However, I'm not entirely sure why you're not able to cancel these >> backends politely with pg_cancel_backend(). For example, the >> BtreePage waiter should be in ConditionVariableSleep() and should be >> interrupted by such a signal and error out in CHECK_FOR_INTERRUPTS(). > > All of them are definitely un-killable by anything other than a kill -9 that > I've found so far. I have a feeling it has something to do with: > https://jobs.zalando.com/tech/blog/hack-to-terminate-tcp-conn-postgres/?gh_src=4n3gxh1 > but I'm not 100% sure, as I didn't set tcp settings low enough to make > catching a packet all that reasonable. I'm happy to try to investigate > further, I just don't quite know what that should entail. If you have things > that you think would be helpful, please do let me know. Hmm. Well usually in a case like this the most useful thing would usually be a backtrace ("gdb /path/to/binary -p PID", then "bt") to show exactly where they're stuck. But in this case we already know more-or-less where they're waiting (the wait event names tell us), and the real question is: why on earth aren't the wait loops responding to SIGINT and SIGTERM? I wonder if there might be something funky about parallel query + statement timeouts. -- Thomas Munro http://www.enterprisedb.com
В списке pgsql-bugs по дате отправления:
Предыдущее
От: David KohnДата:
Сообщение: Re: BUG #15036: Un-killable queries Hanging in BgWorkerShutdown