Обсуждение: BUG #16817: kill process cause postmaster hang
The following bug has been logged on the website: Bug reference: 16817 Logged by: Bo Chen Email address: bchen90@163.com PostgreSQL version: 11.8 Operating system: euleros v2r7 x86_64 Description: Hi hackers Recently we encountered a problem that after killed walwriter, we expect the database can recover normally, but it not (the postmaster hang in the stat of 'wait dead end',and the archiver does't exit). After analysis this problem, we found it could be a bug for a long time. for archiver now use 'system' to call the configed archive command. For 'system' the linux programmer's manual describe the following 'During execution of the command, SIGCHLD will be blocked, and SIGINT and SIGQUIT will be ignored'. So, when a child chrash, we now just SIGQUIT the archiver just one time, while the archiver just execute 'system', SIGQUIT will be ignored, then the posmaster hang in stat of 'wait dead end'. For this porblem, we now added a SIGUSR2 for archiver after SIGQUIT for HandleChildCrash. If there any other solution? regards,ChenBo
PG Bug reporting form <noreply@postgresql.org> writes: > Recently we encountered a problem that after killed walwriter, we expect > the database can recover normally, but it not (the postmaster hang in the > stat of 'wait dead end', and the archiver does't exit). > After analysis this problem, we found it could be a bug for a long time. > for archiver now use 'system' to call the configed archive command. For > 'system' the linux programmer's manual describe the following 'During > execution of the command, SIGCHLD will be blocked, and SIGINT and SIGQUIT > will be ignored'. > So, when a child chrash, we now just SIGQUIT the archiver just one time, > while the archiver just execute 'system', SIGQUIT will be ignored, then the > posmaster hang in stat of 'wait dead end'. Not sure I believe this: why wouldn't the SIGKILL-after-5-seconds logic get us out of that situation? regards, tom lane
Hi, tom Thanks for you reply, and can you elaborate "SIGKILL-after-5-seconds logic"? regards, chenbo -- Sent from: https://www.postgresql-archive.org/PostgreSQL-bugs-f2117394.html
On Mon, Jan 25, 2021 at 9:01 AM bchen90 <bchen90@163.com> wrote:
Hi, tom
Thanks for you reply, and can you elaborate "SIGKILL-after-5-seconds
logic"?
regards, chenbo
82233ce7ea42d6ba519aaec63008aff49da6c7af should be the commit Tom was
talking about.
commit 82233ce7ea42d6ba519aaec63008aff49da6c7af
Author: Alvaro Herrera <alvherre@alvh.no-ip.org>
Date: Fri Jun 28 17:20:53 2013 -0400
Send SIGKILL to children if they don't die quickly in immediate shutdown
On immediate shutdown, or during a restart-after-crash sequence,
postmaster used to send SIGQUIT (and then abandon ship if shutdown); but
this is not a good strategy if backends don't die because of that
signal. (This might happen, for example, if a backend gets tangled
trying to malloc() due to gettext(), as in an example illustrated by
MauMau.) This causes problems when later trying to restart the server,
because some processes are still attached to the shared memory segment.
Instead of just abandoning such backends to their fates, we now have
postmaster hang around for a little while longer, send a SIGKILL after
some reasonable waiting period, and then exit. This makes immediate
shutdown more reliable.
There is disagreement on whether it's best for postmaster to exit after
sending SIGKILL, or to stick around until all children have reported
death. If this controversy is resolved differently than what this patch
implements, it's an easy change to make.
Bug reported by MauMau in message 20DAEA8949EC4E2289C6E8E58560DEC0@maumau
MauMau and Álvaro Herrera
Author: Alvaro Herrera <alvherre@alvh.no-ip.org>
Date: Fri Jun 28 17:20:53 2013 -0400
Send SIGKILL to children if they don't die quickly in immediate shutdown
On immediate shutdown, or during a restart-after-crash sequence,
postmaster used to send SIGQUIT (and then abandon ship if shutdown); but
this is not a good strategy if backends don't die because of that
signal. (This might happen, for example, if a backend gets tangled
trying to malloc() due to gettext(), as in an example illustrated by
MauMau.) This causes problems when later trying to restart the server,
because some processes are still attached to the shared memory segment.
Instead of just abandoning such backends to their fates, we now have
postmaster hang around for a little while longer, send a SIGKILL after
some reasonable waiting period, and then exit. This makes immediate
shutdown more reliable.
There is disagreement on whether it's best for postmaster to exit after
sending SIGKILL, or to stick around until all children have reported
death. If this controversy is resolved differently than what this patch
implements, it's an easy change to make.
Bug reported by MauMau in message 20DAEA8949EC4E2289C6E8E58560DEC0@maumau
MauMau and Álvaro Herrera
Best Regards
Andy Fan (https://www.aliyun.com/)
On Sun, Jan 24, 2021 at 06:01:04PM -0700, bchen90 wrote: > Thanks for you reply, and can you elaborate "SIGKILL-after-5-seconds > logic"? You are looking for the changes related to this command, as of postmaster.c: git grep SIGKILL_CHILDREN_AFTER_SECS -- Michael