Обсуждение: pgsql: Perform an immediate shutdown if the postmaster.pid file is remo
Perform an immediate shutdown if the postmaster.pid file is removed. The postmaster now checks every minute or so (worst case, at most two minutes) that postmaster.pid is still there and still contains its own PID. If not, it performs an immediate shutdown, as though it had received SIGQUIT. The original goal behind this change was to ensure that failed buildfarm runs would get fully cleaned up, even if the test scripts had left a postmaster running, which is not an infrequent occurrence. When the buildfarm script removes a test postmaster's $PGDATA directory, its next check on postmaster.pid will fail and cause it to exit. Previously, manual intervention was often needed to get rid of such orphaned postmasters, since they'd block new test postmasters from obtaining the expected socket address. However, by checking postmaster.pid and not something else, we can provide additional robustness: manual removal of postmaster.pid is a frequent DBA mistake, and now we can at least limit the damage that will ensue if a new postmaster is started while the old one is still alive. Back-patch to all supported branches, since we won't get the desired improvement in buildfarm reliability otherwise. Branch ------ REL9_3_STABLE Details ------- http://git.postgresql.org/pg/commitdiff/31bc563b9be306623c5e9a52816b432945fa6df9 Modified Files -------------- src/backend/postmaster/postmaster.c | 52 ++++++++++++++++++++------ src/backend/utils/init/miscinit.c | 70 +++++++++++++++++++++++++++++++++++ src/include/miscadmin.h | 1 + 3 files changed, 112 insertions(+), 11 deletions(-)
On 6 October 2015 at 22:16, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Perform an immediate shutdown if the postmaster.pid file is removed. > > The postmaster now checks every minute or so (worst case, at most two > minutes) that postmaster.pid is still there and still contains its own PID. > If not, it performs an immediate shutdown, as though it had received > SIGQUIT. > > The original goal behind this change was to ensure that failed buildfarm > runs would get fully cleaned up, even if the test scripts had left a > postmaster running, which is not an infrequent occurrence. When the > buildfarm script removes a test postmaster's $PGDATA directory, its next > check on postmaster.pid will fail and cause it to exit. Previously, manual > intervention was often needed to get rid of such orphaned postmasters, > since they'd block new test postmasters from obtaining the expected socket > address. > > However, by checking postmaster.pid and not something else, we can provide > additional robustness: manual removal of postmaster.pid is a frequent DBA > mistake, and now we can at least limit the damage that will ensue if a new > postmaster is started while the old one is still alive. > > Back-patch to all supported branches, since we won't get the desired > improvement in buildfarm reliability otherwise. > > Branch > ------ > REL9_3_STABLE > > Details > ------- > http://git.postgresql.org/pg/commitdiff/31bc563b9be306623c5e9a52816b432945fa6df9 > > Modified Files > -------------- > src/backend/postmaster/postmaster.c | 52 ++++++++++++++++++++------ > src/backend/utils/init/miscinit.c | 70 +++++++++++++++++++++++++++++++++++ > src/include/miscadmin.h | 1 + > 3 files changed, 112 insertions(+), 11 deletions(-) The log contains a misleading output following the removal of the pid file: 2015-10-09 15:39:32 BST [31507]: [4-1] user=,db=,client= LOG: could not open file "postmaster.pid": No such file or directory 2015-10-09 15:39:32 BST [31507]: [5-1] user=,db=,client= LOG: performing immediate shutdown because data directory lock file is invalid 2015-10-09 15:39:32 BST [31507]: [6-1] user=,db=,client= LOG: received immediate shutdown request 2015-10-09 15:39:32 BST [31556]: [1-1] user=,db=,client= WARNING: terminating connection because of crash of another server process 2015-10-09 15:39:32 BST [31556]: [2-1] user=,db=,client= DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory. 2015-10-09 15:39:32 BST [31556]: [3-1] user=,db=,client= HINT: In a moment you should be able to reconnect to the database and repeat your command. Is this anything we need to worry about? -- Thom
Thom Brown <thom@linux.com> writes: > The log contains a misleading output following the removal of the pid file: > 2015-10-09 15:39:32 BST [31507]: [4-1] user=,db=,client= LOG: could > not open file "postmaster.pid": No such file or directory > 2015-10-09 15:39:32 BST [31507]: [5-1] user=,db=,client= LOG: > performing immediate shutdown because data directory lock file is > invalid > 2015-10-09 15:39:32 BST [31507]: [6-1] user=,db=,client= LOG: > received immediate shutdown request > 2015-10-09 15:39:32 BST [31556]: [1-1] user=,db=,client= WARNING: > terminating connection because of crash of another server process > 2015-10-09 15:39:32 BST [31556]: [2-1] user=,db=,client= DETAIL: The > postmaster has commanded this server process to roll back the current > transaction and exit, because another server process exited abnormally > and possibly corrupted shared memory. > 2015-10-09 15:39:32 BST [31556]: [3-1] user=,db=,client= HINT: In a > moment you should be able to reconnect to the database and repeat your > command. Looks as-expected to me. We're forcing a panic stop. regards, tom lane
Tom Lane wrote: > Thom Brown <thom@linux.com> writes: > > The log contains a misleading output following the removal of the pid file: > > > 2015-10-09 15:39:32 BST [31507]: [4-1] user=,db=,client= LOG: could > > not open file "postmaster.pid": No such file or directory > > 2015-10-09 15:39:32 BST [31507]: [5-1] user=,db=,client= LOG: > > performing immediate shutdown because data directory lock file is > > invalid > > 2015-10-09 15:39:32 BST [31507]: [6-1] user=,db=,client= LOG: > > received immediate shutdown request > > 2015-10-09 15:39:32 BST [31556]: [1-1] user=,db=,client= WARNING: > > terminating connection because of crash of another server process > > 2015-10-09 15:39:32 BST [31556]: [2-1] user=,db=,client= DETAIL: The > > postmaster has commanded this server process to roll back the current > > transaction and exit, because another server process exited abnormally > > and possibly corrupted shared memory. > > 2015-10-09 15:39:32 BST [31556]: [3-1] user=,db=,client= HINT: In a > > moment you should be able to reconnect to the database and repeat your > > command. > > Looks as-expected to me. We're forcing a panic stop. I think he's complaining that the final HINT is misleading. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Alvaro Herrera <alvherre@2ndquadrant.com> writes: > Tom Lane wrote: >> Looks as-expected to me. We're forcing a panic stop. > I think he's complaining that the final HINT is misleading. Well, all the particular backend knows is that it got SIGQUIT. Maybe we should rewrite the message text for that entirely, but that didn't seem in-scope for this patch. regards, tom lane