Обсуждение: Archive recovery won't be completed on some situation.
Hello, we found that postgreql won't complete archive recovery foever on some situation. This occurs HEAD, 9.3.3, 9.2.7, 9.1.12. Restarting server with archive recovery fails as following just after it was killed with SIGKILL after pg_start_backup and some wal writes but before pg_stop_backup. | FATAL: WAL ends before end of online backup | HINT: Online backup started with pg_start_backup() must be | ended with pg_stop_backup(), and all WAL up to that point must | be available at recovery. What the mess is once entering this situation, I could find no formal operation to exit from it. On this situation, 'Backup start location' in controldata has some valid location but corresponding 'end of backup' WAL record won't come forever. But I think PG cannot tell the situation dintinctly whether the 'end of backup' reocred is not exists at all or it will come later especially when the server starts as a streaming replication hot-standby. One solution for it would be a new parameter in recovery.conf which tells that the operator wants the server to start as if there were no backup label ever before when the situation comes. It looks ugly and somewhat danger but seems necessary. The first attached file is the script to replay the problem, and the second is the patch trying to do what is described above. After applying this patch on HEAD and uncommneting the 'cancel_backup_label_on_failure = true' in test.sh, the test script runs as following, | LOG: record with zero length at 0/2010F40 | WARNING: backup_label was canceled. | HINT: server might have crashed during backup mode. | LOG: consistent recovery state reached at 0/2010F40 | LOG: redo done at 0/2010DA0 What do you thing about this? regards, -- Kyotaro Horiguchi NTT Open Source Software Center #! /bin/sh killall postgres rm -rf $PGDATA/* initdb cat >> $PGDATA/postgresql.conf <<EOF wal_level = hot_standby EOF pg_ctl start -w sleep 1 psql postgres -c "select pg_start_backup('hoge');" psql postgres -c "create table t (a int);" killall -9 postgres # pg_ctl stop -m f cat > $PGDATA/recovery.conf <<EOF # standby_mode = on restore_command = '/bin/true' recovery_target_timeline = 'latest' # cancel_backup_label_on_failure = true EOF pg_ctl start sleep 5 pg_ctl stop -w pg_ctl start sleep 5 pg_ctl stop -w diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c index cdbe305..d1f93bb 100644 --- a/src/backend/access/transam/xlog.c +++ b/src/backend/access/transam/xlog.c @@ -230,6 +230,7 @@ static TimestampTz recoveryDelayUntilTime;static bool StandbyModeRequested = false;static char *PrimaryConnInfo= NULL;static char *PrimarySlotName = NULL; +static bool cancelBackupLabelOnFailure = false;static char *TriggerFile = NULL;/* are we currently in standby mode? */ @@ -5569,6 +5570,16 @@ readRecoveryCommandFile(void) ereport(DEBUG2, (errmsg("min_recovery_apply_delay= '%s'", item->value))); } + else if (strcmp(item->name, "cancel_backup_label_on_failure") == 0) + { + if (!parse_bool(item->value, &cancelBackupLabelOnFailure)) + ereport(ERROR, + (errcode(ERRCODE_INVALID_PARAMETER_VALUE), + errmsg("parameter \"%s\" requires a Boolean value", + "cancel_backup_label_on_failure"))); + ereport(DEBUG2, + (errmsg_internal("cancel_backup_label_on_failure = '%s'", item->value))); + } else ereport(FATAL, (errmsg("unrecognized recovery parameter \"%s\"", @@ -7111,6 +7122,21 @@ StartupXLOG(void) record = ReadRecord(xlogreader, InvalidXLogRecPtr, LOG, false); } while (record != NULL); + if (cancelBackupLabelOnFailure && + ControlFile->backupStartPoint != InvalidXLogRecPtr) + { + /* + * Try to force complete recocovery when backup_label was + * found but end-of-backup record has not been found. + */ + + ControlFile->backupStartPoint = InvalidXLogRecPtr; + + ereport(WARNING, + (errmsg("backup_label was canceled."), + errhint("server might have crashed during backup mode."))); + CheckRecoveryConsistency(); + } /* * end of main redo apply loop */
On 03/14/2014 12:32 PM, Kyotaro HORIGUCHI wrote: > Restarting server with archive recovery fails as following just > after it was killed with SIGKILL after pg_start_backup and some > wal writes but before pg_stop_backup. > > | FATAL: WAL ends before end of online backup > | HINT: Online backup started with pg_start_backup() must be > | ended with pg_stop_backup(), and all WAL up to that point must > | be available at recovery. > > What the mess is once entering this situation, I could find no > formal operation to exit from it. > > On this situation, 'Backup start location' in controldata has > some valid location but corresponding 'end of backup' WAL record > won't come forever. > > But I think PG cannot tell the situation dintinctly whether the > 'end of backup' reocred is not exists at all or it will come > later especially when the server starts as a streaming > replication hot-standby. If you kill the server while a backup is in progress, the backup is broken. It's correct that the server refuses to start up from the broken backup. So basically, don't do that. - Heikki
<p dir="ltr">Thank you.<p dir="ltr">2014/03/14 19:42 "Heikki Linnakangas" <<a href="mailto:hlinnakangas@vmware.com">hlinnakangas@vmware.com</a>>:<br/> ><br /> > On 03/14/2014 12:32 PM, KyotaroHORIGUCHI wrote:<br /> >><br /> >> Restarting server with archive recovery fails as following just<br/> >> after it was killed with SIGKILL after pg_start_backup and some<br /> >> wal writes but before pg_stop_backup.<br/> >><br /> >> | FATAL: WAL ends before end of online backup<br /> >> | HINT: Onlinebackup started with pg_start_backup() must be<br /> >> | ended with pg_stop_backup(), and all WAL up to thatpoint must<br /> >> | be available at recovery.<br /> >><br /> >> What the mess is once entering thissituation, I could find no<br /> >> formal operation to exit from it.<br /> >><p dir="ltr">> If you killthe server while a backup is in progress, the backup is broken. It's correct that the server refuses to start up fromthe broken backup. So basically, don't do that.<p dir="ltr">Hmm.. What I did is simplly restarting server just aftera crash but the server was accidentially in backup mode. No backup copy is used. Basically, the server is in the samesituation with the simple restart after crash. The difference here is the restarting made the database completly uselesswhile it had been not. I wish to save the database for the case and I suppose it so acceptable.<p dir="ltr">regards,<br/> -- <br /> Kyotaro Horiguchi<br /> NTT Opensource Software Center<br />
<p dir="ltr">Sorry, I wrote a little wrong.<p dir="ltr">2014/03/14 20:24 "Kyotaro HORIGUCHI" <<a href="mailto:horiguchi.kyotaro@lab.ntt.co.jp">horiguchi.kyotaro@lab.ntt.co.jp</a>>:<br/> > I wish to save the databasefor the case and I suppose it so acceptable.<p dir="ltr">and I don't suppose it so unacceptable.<p dir="ltr">regards,<br/> -- <br /> Kyotaro Horiguchi<br /> NTT Opensource Software Center<br />
On 03/14/2014 01:24 PM, Kyotaro HORIGUCHI wrote: > Hmm.. What I did is simplly restarting server just after a crash but the > server was accidentially in backup mode. No backup copy is used. Basically, > the server is in the same situation with the simple restart after crash. You created recovery.conf in the master server after crash. Just don't do that. - Heikki
<p dir="ltr">Hello,<p dir="ltr">2014/03/14 20:51 "Heikki Linnakangas" <<a href="mailto:hlinnakangas@vmware.com">hlinnakangas@vmware.com</a>>:<br/> > You created recovery.conf in the masterserver after crash. Just don't do that.<p dir="ltr">Ah, ok. I understood what you meant.<br /> Sorry that I can't confirmrihgt now, the original issue should occur on the standby. I might've oversimplicated.<p dir="ltr">regards,<br />-- <br /> Kyotaro Horiguchi<br /> NTT Opensource Software Center<br />
<p dir="ltr">Umm.. Sorry for repeated correction.<p dir="ltr">2014/03/14 21:12 "Kyotaro HORIGUCHI" <<a href="mailto:kyota.horiguchi@gmail.com">kyota.horiguchi@gmail.com</a>>:<br/> ><br /> > Ah, ok. I understood whatyou meant.<br /> > Sorry that I can't confirm rihgt now, the original issue should occur on the standby. <p dir="ltr">Theoriginal issue should have occurred on standby<p dir="ltr">> I might've oversimplicated.<br /> ><br />> regards,<br /> > -- <br /> > Kyotaro Horiguchi<br /> > NTT Opensource Software Center<br />
On Fri, Mar 14, 2014 at 7:32 PM, Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote: > Hello, we found that postgreql won't complete archive recovery > foever on some situation. This occurs HEAD, 9.3.3, 9.2.7, 9.1.12. > > Restarting server with archive recovery fails as following just > after it was killed with SIGKILL after pg_start_backup and some > wal writes but before pg_stop_backup. > > | FATAL: WAL ends before end of online backup > | HINT: Online backup started with pg_start_backup() must be > | ended with pg_stop_backup(), and all WAL up to that point must > | be available at recovery. > > What the mess is once entering this situation, I could find no > formal operation to exit from it. Though this is formal way, you can exit from that situation by (1) Remove recovery.conf and start the server with crash recovery (2) Execute pg_start_backup() after crash recovery ends (3) Copy backup_label to somewhere (4) Execute pg_stop_backup() and shutdown the server (5) Copy backup_label back to $PGDATA (6) Create recovery.conf and start the server with archive recovery > On this situation, 'Backup start location' in controldata has > some valid location but corresponding 'end of backup' WAL record > won't come forever. > > But I think PG cannot tell the situation dintinctly whether the > 'end of backup' reocred is not exists at all or it will come > later especially when the server starts as a streaming > replication hot-standby. > > One solution for it would be a new parameter in recovery.conf > which tells that the operator wants the server to start as if > there were no backup label ever before when the situation > comes. It looks ugly and somewhat danger but seems necessary. > > The first attached file is the script to replay the problem, and > the second is the patch trying to do what is described above. > > After applying this patch on HEAD and uncommneting the > 'cancel_backup_label_on_failure = true' in test.sh, the test > script runs as following, > > | LOG: record with zero length at 0/2010F40 > | WARNING: backup_label was canceled. > | HINT: server might have crashed during backup mode. > | LOG: consistent recovery state reached at 0/2010F40 > | LOG: redo done at 0/2010DA0 > > What do you thing about this? What about adding new option into pg_resetxlog so that we can reset the pg_control's backup start location? Even after we've accidentally entered into the situation that you described, we can exit from that by resetting the backup start location in pg_control. Also this option seems helpful to salvage the data as a last resort from the corrupted backup. Regards, -- Fujii Masao
Thank you for good suggestion. > > What the mess is once entering this situation, I could find no > > formal operation to exit from it. > > Though this is formal way, you can exit from that situation by > > (1) Remove recovery.conf and start the server with crash recovery > (2) Execute pg_start_backup() after crash recovery ends > (3) Copy backup_label to somewhere > (4) Execute pg_stop_backup() and shutdown the server > (5) Copy backup_label back to $PGDATA > (6) Create recovery.conf and start the server with archive recovery It will do. And pg_resetxlog was the first thing I checked out for reseting backupStartPoint. > What about adding new option into pg_resetxlog so that we can > reset the pg_control's backup start location? Even after we've > accidentally entered into the situation that you described, we can > exit from that by resetting the backup start location in pg_control. > Also this option seems helpful to salvage the data as a last resort > from the corrupted backup. It is in far better proportion than recovery.conf option:), since it is already warned to be dangerous as its nature. Anyway I'll make sure the situation under the trouble fist. regards, -- Kyotaro Horiguchi NTT Open Source Software Center
Hello, very sorry to have bothered you by silly question. me> It is in far better proportion than recovery.conf option:), since me> it is already warned to be dangerous as its nature. Anyway I'll me> make sure the situation under the trouble fist. It looks exactly the 'starting up as standby of ex-master which crashed during backup mode' case as I checked out the original issue. I agree that no save is needed for the case since it is simply a db corruption. Usefulness of pg_resetxlog's resetting-backup_label-related-items feature is not clear so far, so I don't wish it realised for this time. regards, -- Kyotaro Horiguchi NTT Open Source Software Center
On 03/15/2014 05:59 PM, Fujii Masao wrote: > What about adding new option into pg_resetxlog so that we can > reset the pg_control's backup start location? Even after we've > accidentally entered into the situation that you described, we can > exit from that by resetting the backup start location in pg_control. > Also this option seems helpful to salvage the data as a last resort > from the corrupted backup. Yeah, seems reasonable. After you run pg_resetxlog, there's no hope that the backup end record would arrive any time later. And if it does, it won't really do much good after you've reset the WAL. We probably should just clear out the backup start/stop location always when you run pg_resetxlog. Your database is potentially broken if you reset the WAL before reaching consistency, but if forcibly do that with "pg_resetxlog -f", you've been warned. - Heikki
Hello, thank you for suggestions. The *problematic* operation sequence I saw was performed by pgsql-RA/Pacemaker. It stops a server already with immediate mode and starts the Master as a Standby at first, then promote. Focusing on this situation, there would be reasonable to reset backup positions. 9.4 canceles backup mode even on immediate shutdown so the operation causes no problem, but 9.3 and before are doesn't. Finally, needed amendments per versions are 9.4: Nothing more is needed (but resetting backup mode by resetxlog is acceptable) 9.3: Can be recovered without resetting backup positions in controlfile. (but smarter with it) 9.2: Same to 9.3 9.1: Cannot be recoverd without directly resetting backup position in controlfile. Resetting feature is needed. At Mon, 17 Mar 2014 15:59:09 +0200, Heikki Linnakangas wrote > On 03/15/2014 05:59 PM, Fujii Masao wrote: > > What about adding new option into pg_resetxlog so that we can > > reset the pg_control's backup start location? Even after we've > > accidentally entered into the situation that you described, we can > > exit from that by resetting the backup start location in pg_control. > > Also this option seems helpful to salvage the data as a last resort > > from the corrupted backup. > > Yeah, seems reasonable. After you run pg_resetxlog, there's no hope > that the backup end record would arrive any time later. And if it > does, it won't really do much good after you've reset the WAL. > > We probably should just clear out the backup start/stop location > always when you run pg_resetxlog. Your database is potentially broken > if you reset the WAL before reaching consistency, but if forcibly do > that with "pg_resetxlog -f", you've been warned. Agreed. Attached patches do that and I could "recover" the database state with following steps, (1) Remove recovery.conf and do pg_resetxlog -bf (the option name 'b' would be arguable) (2) Start the server (with crash recovery) (3) Stop the server (in any mode) (4) Create recovery.conf and start the server with archive recovery. Some annoyance in step 2 and 3 but I don't want to support the pacemaker's in-a-sense broken sequence no further:( This is alterable by the following steps suggested in Masao's previous mail for 9.2 and alter, but 9.1 needs forcibly resetting startBackupPoint. At Sun, 16 Mar 2014 00:59:01 +0900, Fujii Masao wrote > Though this is formal way, you can exit from that situation by > > (1) Remove recovery.conf and start the server with crash recovery > (2) Execute pg_start_backup() after crash recovery ends > (3) Copy backup_label to somewhere > (4) Execute pg_stop_backup() and shutdown the server > (5) Copy backup_label back to $PGDATA > (6) Create recovery.conf and start the server with archive recovery This worked for 9.2, 9.3 and HEAD but failed for 9.1 at step 1. | 2014-03-19 15:53:02.512 JST FATAL: WAL ends before end of online backup | 2014-03-19 15:53:02.512 JST HINT: Online backup started with pg_start_backup() must be ended with pg_stop_backup(), andall WAL up to that point must be available at recovery. This seems inevitable. | if (InRecovery && | (XLByteLT(EndOfLog, minRecoveryPoint) || | !XLogRecPtrIsInvalid(ControlFile->backupStartPoint))) | { ... | /* | * Ran off end of WAL before reaching end-of-backup WAL record, or | * minRecoveryPoint. | */ | if (!XLogRecPtrIsInvalid(ControlFile->backupStartPoint)) | ereport(FATAL, | (errmsg("WAL ends before end of online backup"), regards, -- Kyotaro Horiguchi NTT Open Source Software Center diff --git a/src/bin/pg_resetxlog/pg_resetxlog.c b/src/bin/pg_resetxlog/pg_resetxlog.c index 28a4f19..7d9cf6d 100644 --- a/src/bin/pg_resetxlog/pg_resetxlog.c +++ b/src/bin/pg_resetxlog/pg_resetxlog.c @@ -85,6 +85,7 @@ main(int argc, char *argv[]) int c; bool force = false; bool noupdate= false; + bool resetbackuppos = false; MultiXactId set_oldestmxid = 0; char *endptr; char *endptr2; @@ -110,7 +111,7 @@ main(int argc, char *argv[]) } - while ((c = getopt(argc, argv, "fl:m:no:O:x:e:")) != -1) + while ((c = getopt(argc, argv, "fl:m:no:O:x:e:b")) != -1) { switch (c) { @@ -122,6 +123,10 @@ main(int argc, char *argv[]) noupdate = true; break; + case 'b': + resetbackuppos = true; + break; + case 'e': set_xid_epoch = strtoul(optarg, &endptr, 0); if (endptr == optarg ||*endptr != '\0') @@ -350,6 +355,13 @@ main(int argc, char *argv[]) ControlFile.checkPointCopy.PrevTimeLineID = minXlogTli; } + if (resetbackuppos) + { + ControlFile.backupStartPoint = InvalidXLogRecPtr; + ControlFile.backupEndPoint = InvalidXLogRecPtr; + ControlFile.backupEndRequired = false; + } + if (minXlogSegNo > newXlogSegNo) newXlogSegNo = minXlogSegNo; @@ -1098,6 +1110,7 @@ usage(void) printf(_(" -O OFFSET set next multitransaction offset\n")); printf(_(" -V,--version output version information, then exit\n")); printf(_(" -x XID set next transaction ID\n")); + printf(_(" -b reset backup positions\n")); printf(_(" -?, --help show this help, then exit\n")); printf(_("\nReport bugs to <pgsql-bugs@postgresql.org>.\n"));} diff --git a/src/bin/pg_resetxlog/pg_resetxlog.c b/src/bin/pg_resetxlog/pg_resetxlog.c index cd003f4..8b578c8 100644 --- a/src/bin/pg_resetxlog/pg_resetxlog.c +++ b/src/bin/pg_resetxlog/pg_resetxlog.c @@ -82,6 +82,7 @@ main(int argc, char *argv[]) int c; bool force = false; bool noupdate= false; + bool resetbackuppos = false; uint32 set_xid_epoch = (uint32) -1; TransactionId set_xid = 0; Oid set_oid = 0; @@ -114,7 +115,7 @@ main(int argc, char *argv[]) } - while ((c = getopt(argc, argv, "fl:m:no:O:x:e:")) != -1) + while ((c = getopt(argc, argv, "fl:m:no:O:x:e:b")) != -1) { switch (c) { @@ -126,6 +127,10 @@ main(int argc, char *argv[]) noupdate = true; break; + case 'b': + resetbackuppos = true; + break; + case 'e': set_xid_epoch = strtoul(optarg, &endptr, 0); if (endptr == optarg ||*endptr != '\0') @@ -347,6 +352,13 @@ main(int argc, char *argv[]) ControlFile.checkPointCopy.PrevTimeLineID = minXlogTli; } + if (resetbackuppos) + { + ControlFile.backupStartPoint = InvalidXLogRecPtr; + ControlFile.backupEndPoint = InvalidXLogRecPtr; + ControlFile.backupEndRequired = false; + } + if (minXlogSegNo > newXlogSegNo) newXlogSegNo = minXlogSegNo; @@ -1042,6 +1054,7 @@ usage(void) printf(_(" -O OFFSET set next multitransaction offset\n")); printf(_(" -V,--version output version information, then exit\n")); printf(_(" -x XID set next transaction ID\n")); + printf(_(" -b reset backup positions\n")); printf(_(" -?, --help show this help, then exit\n")); printf(_("\nReport bugs to <pgsql-bugs@postgresql.org>.\n"));} diff --git a/src/bin/pg_resetxlog/pg_resetxlog.c b/src/bin/pg_resetxlog/pg_resetxlog.c index 80e8268..149639b 100644 --- a/src/bin/pg_resetxlog/pg_resetxlog.c +++ b/src/bin/pg_resetxlog/pg_resetxlog.c @@ -82,6 +82,7 @@ main(int argc, char *argv[]) int c; bool force = false; bool noupdate= false; + bool resetbackuppos = false; uint32 set_xid_epoch = (uint32) -1; TransactionId set_xid = 0; Oid set_oid = 0; @@ -115,7 +116,7 @@ main(int argc, char *argv[]) } - while ((c = getopt(argc, argv, "fl:m:no:O:x:e:")) != -1) + while ((c = getopt(argc, argv, "fl:m:no:O:x:e:b")) != -1) { switch (c) { @@ -127,6 +128,10 @@ main(int argc, char *argv[]) noupdate = true; break; + case 'b': + resetbackuppos = true; + break; + case 'e': set_xid_epoch = strtoul(optarg, &endptr, 0); if (endptr == optarg ||*endptr != '\0') @@ -333,6 +338,15 @@ main(int argc, char *argv[]) if (minXlogTli > ControlFile.checkPointCopy.ThisTimeLineID) ControlFile.checkPointCopy.ThisTimeLineID= minXlogTli; + if (resetbackuppos) + { + ControlFile.backupStartPoint.xlogid = 0; + ControlFile.backupStartPoint.xrecoff = 0; + ControlFile.backupEndPoint.xlogid = 0; + ControlFile.backupEndPoint.xrecoff = 0; + ControlFile.backupEndRequired = false; + } + if (minXlogId > newXlogId || (minXlogId == newXlogId && minXlogSeg > newXlogSeg)) @@ -1035,6 +1049,7 @@ usage(void) printf(_(" -O OFFSET set next multitransaction offset\n")); printf(_(" -V,--version output version information, then exit\n")); printf(_(" -x XID set next transaction ID\n")); + printf(_(" -b reset backup start position\n")); printf(_(" -?, --help show this help, thenexit\n")); printf(_("\nReport bugs to <pgsql-bugs@postgresql.org>.\n"));} diff --git a/src/bin/pg_resetxlog/pg_resetxlog.c b/src/bin/pg_resetxlog/pg_resetxlog.c index 54cc5b0..3ecfef8 100644 --- a/src/bin/pg_resetxlog/pg_resetxlog.c +++ b/src/bin/pg_resetxlog/pg_resetxlog.c @@ -82,6 +82,7 @@ main(int argc, char *argv[]) int c; bool force = false; bool noupdate= false; + bool resetbackuppos = false; uint32 set_xid_epoch = (uint32) -1; TransactionId set_xid = 0; Oid set_oid = 0; @@ -115,7 +116,7 @@ main(int argc, char *argv[]) } - while ((c = getopt(argc, argv, "fl:m:no:O:x:e:")) != -1) + while ((c = getopt(argc, argv, "fl:m:no:O:x:e:b")) != -1) { switch (c) { @@ -127,6 +128,10 @@ main(int argc, char *argv[]) noupdate = true; break; + case 'b': + resetbackuppos = true; + break; + case 'e': set_xid_epoch = strtoul(optarg, &endptr, 0); if (endptr == optarg ||*endptr != '\0') @@ -333,6 +338,12 @@ main(int argc, char *argv[]) if (minXlogTli > ControlFile.checkPointCopy.ThisTimeLineID) ControlFile.checkPointCopy.ThisTimeLineID= minXlogTli; + if (resetbackuppos) + { + ControlFile.backupStartPoint.xlogid = 0; + ControlFile.backupStartPoint.xrecoff = 0; + } + if (minXlogId > newXlogId || (minXlogId == newXlogId && minXlogSeg > newXlogSeg)) @@ -1028,6 +1039,7 @@ usage(void) printf(_(" -o OID set next OID\n")); printf(_(" -O OFFSET set nextmultitransaction offset\n")); printf(_(" -x XID set next transaction ID\n")); + printf(_(" -b reset backup start position\n")); printf(_(" --help show this help, then exit\n")); printf(_(" --version output version information, then exit\n")); printf(_("\nReport bugs to <pgsql-bugs@postgresql.org>.\n"));
On Wed, Mar 19, 2014 at 5:28 PM, Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote: > Hello, thank you for suggestions. > > The *problematic* operation sequence I saw was performed by > pgsql-RA/Pacemaker. It stops a server already with immediate mode > and starts the Master as a Standby at first, then > promote. Focusing on this situation, there would be reasonable to > reset backup positions. 9.4 canceles backup mode even on > immediate shutdown so the operation causes no problem, but 9.3 > and before are doesn't. Finally, needed amendments per versions > are > > 9.4: Nothing more is needed (but resetting backup mode by > resetxlog is acceptable) > > 9.3: Can be recovered without resetting backup positions in > controlfile. (but smarter with it) > > 9.2: Same to 9.3 > > 9.1: Cannot be recoverd without directly resetting backup > position in controlfile. Resetting feature is needed. > > > At Mon, 17 Mar 2014 15:59:09 +0200, Heikki Linnakangas wrote >> On 03/15/2014 05:59 PM, Fujii Masao wrote: >> > What about adding new option into pg_resetxlog so that we can >> > reset the pg_control's backup start location? Even after we've >> > accidentally entered into the situation that you described, we can >> > exit from that by resetting the backup start location in pg_control. >> > Also this option seems helpful to salvage the data as a last resort >> > from the corrupted backup. >> >> Yeah, seems reasonable. After you run pg_resetxlog, there's no hope >> that the backup end record would arrive any time later. And if it >> does, it won't really do much good after you've reset the WAL. >> >> We probably should just clear out the backup start/stop location >> always when you run pg_resetxlog. Your database is potentially broken >> if you reset the WAL before reaching consistency, but if forcibly do >> that with "pg_resetxlog -f", you've been warned. > > Agreed. Attached patches do that and I could "recover" the > database state with following steps, Adding new option looks like new feature rather than bug fix. I'm afraid that the backpatch of such a change to 9.3 or before is not acceptable. Regards, -- Fujii Masao
On 03/19/2014 10:28 AM, Kyotaro HORIGUCHI wrote: > The*problematic* operation sequence I saw was performed by > pgsql-RA/Pacemaker. It stops a server already with immediate mode > and starts the Master as a Standby at first, then > promote. Focusing on this situation, there would be reasonable to > reset backup positions. Well, that's scary. I would suggest doing a fast shutdown instead. But if you really want to do an immediate shutdown, you should delete the backup_label file after the shutdown When restarting after immediate shutdown and a backup_label file is present, the system doesn't know if the system crashed during a backup, and it needs to perform crash recovery, or if you're trying restore from a backup. It makes a compromise, and starts recovery from the checkpoint given in the backup_label, as if it was restoring from a backup, but if it doesn't see a backup-end WAL record, it just starts up anyway (which would be wrong if you are indeed restoring from a backup). But if you create a recovery.conf file, that indicates that you are definitely restoring from a backup, so it's more strict and insists that the backup-end record must be replayed. > 9.4 canceles backup mode even on > immediate shutdown so the operation causes no problem, but 9.3 > and before are doesn't. Hmm, I don't think we've changed that behavior in 9.4. - Heikki
On Wed, Mar 19, 2014 at 7:57 PM, Heikki Linnakangas <hlinnakangas@vmware.com> wrote: > On 03/19/2014 10:28 AM, Kyotaro HORIGUCHI wrote: >> >> The*problematic* operation sequence I saw was performed by >> >> pgsql-RA/Pacemaker. It stops a server already with immediate mode >> and starts the Master as a Standby at first, then >> promote. Focusing on this situation, there would be reasonable to >> reset backup positions. > > > Well, that's scary. I would suggest doing a fast shutdown instead. But if > you really want to do an immediate shutdown, you should delete the > backup_label file after the shutdown > > When restarting after immediate shutdown and a backup_label file is present, > the system doesn't know if the system crashed during a backup, and it needs > to perform crash recovery, or if you're trying restore from a backup. It > makes a compromise, and starts recovery from the checkpoint given in the > backup_label, as if it was restoring from a backup, but if it doesn't see a > backup-end WAL record, it just starts up anyway (which would be wrong if you > are indeed restoring from a backup). But if you create a recovery.conf file, > that indicates that you are definitely restoring from a backup, so it's more > strict and insists that the backup-end record must be replayed. > > >> 9.4 canceles backup mode even on >> immediate shutdown so the operation causes no problem, but 9.3 >> and before are doesn't. > > > Hmm, I don't think we've changed that behavior in 9.4. ISTM 82233ce7ea42d6ba519aaec63008aff49da6c7af changed immdiate shutdown that way. Regards, -- Fujii Masao
Fujii Masao escribió: > On Wed, Mar 19, 2014 at 7:57 PM, Heikki Linnakangas > <hlinnakangas@vmware.com> wrote: > >> 9.4 canceles backup mode even on immediate shutdown so the > >> operation causes no problem, but 9.3 and before are doesn't. > > > > Hmm, I don't think we've changed that behavior in 9.4. > > ISTM 82233ce7ea42d6ba519aaec63008aff49da6c7af changed immdiate > shutdown that way. Uh, interesting. I didn't see that secondary effect. I hope it's not for ill? -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Hi, I confirmed that 82233ce7ea4 surely did it. At Wed, 19 Mar 2014 09:35:16 -0300, Alvaro Herrera wrote > Fujii Masao escribió: > > On Wed, Mar 19, 2014 at 7:57 PM, Heikki Linnakangas > > <hlinnakangas@vmware.com> wrote: > > > >> 9.4 canceles backup mode even on immediate shutdown so the > > >> operation causes no problem, but 9.3 and before are doesn't. > > > > > > Hmm, I don't think we've changed that behavior in 9.4. > > > > ISTM 82233ce7ea42d6ba519aaec63008aff49da6c7af changed immdiate > > shutdown that way. > > Uh, interesting. I didn't see that secondary effect. I hope it's not > for ill? The crucial factor for the behavior change is that pmdie has become not to exit immediately for SIGQUIT. 'case SIGQUIT:' in pmdie() ended with "ExitPostmaster(0)" before the patch but now it ends with 'PostmasterStateMachine(); break;' so continues to run with pmState = PM_WAIT_BACKENDS, similar to SIGINT (fast shutdown). After all, pmState changes to PM_NO_CHILDREN via PM_WAIT_DEAD_END by SIGCHLDs from non-significant processes, then CancelBackup(). Focusing on the point described above, the small patch below rewinds the behavior back to 9.3 and before but I don't know the appropriateness in regard to the intention of the patch. diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c index e9072b7..f87c25c 100644 --- a/src/backend/postmaster/postmaster.c +++ b/src/backend/postmaster/postmaster.c @@ -2498,16 +2498,7 @@ pmdie(SIGNAL_ARGS) (errmsg("received immediate shutdown request"))); TerminateChildren(SIGQUIT); - pmState = PM_WAIT_BACKENDS; - - /* set stopwatch for them to die */ - AbortStartTime = time(NULL); - - /* - * Now wait for backends to exit. If there are none, - * PostmasterStateMachine will take the next step. - */ - PostmasterStateMachine(); + ExitPostmaster(0); break; } regards, -- Kyotaro Horiguchi NTT Open Source Software Center
Hello, At Wed, 19 Mar 2014 19:34:10 +0900, Fujii Masao wrote > > Agreed. Attached patches do that and I could "recover" the > > database state with following steps, > > Adding new option looks like new feature rather than bug fix. > I'm afraid that the backpatch of such a change to 9.3 or before > is not acceptable. Me too. But on the other hand it simplly is a relief for the consequence of the behavior of server (altough it was ill operation:), and especially it is needed for at least 9.1 which seems cannot be saved without it. Plus it has utterly no impact on servers' behavior of any corresponding versions. So I hope it is accepted. regards, -- Kyotaro Horiguchi NTT Open Source Software Center
Hello, > On 03/19/2014 10:28 AM, Kyotaro HORIGUCHI wrote: > > The*problematic* operation sequence I saw was performed by > > pgsql-RA/Pacemaker. It stops a server already with immediate mode > > and starts the Master as a Standby at first, then > > promote. Focusing on this situation, there would be reasonable to > > reset backup positions. > > Well, that's scary. I would suggest doing a fast shutdown instead. But > if you really want to do an immediate shutdown, you should delete the > backup_label file after the shutdown "We"'d also said them the former thing on several occations. They answered that they hate shutdown checkpoint to take long time before shutdown is completed. The latter one has not come on my mind and seems promising. Thank you for the suggestion. > When restarting after immediate shutdown and a backup_label file is > present, the system doesn't know if the system crashed during a > backup, and it needs to perform crash recovery, or if you're trying > restore from a backup. It makes a compromise, and starts recovery from > the checkpoint given in the backup_label, as if it was restoring from > a backup, but if it doesn't see a backup-end WAL record, it just > starts up anyway (which would be wrong if you are indeed restoring > from a backup). But if you create a recovery.conf file, that indicates > that you are definitely restoring from a backup, so it's more strict > and insists that the backup-end record must be replayed. > > > 9.4 canceles backup mode even on > > immediate shutdown so the operation causes no problem, but 9.3 > > and before are doesn't. > > Hmm, I don't think we've changed that behavior in 9.4. Now pmdie behaves in the similar manner between fast and immediate shutdown after 82233ce7ea42d6ba519. It is an side effect of a change on immediate shutdown which make it to wait the children to die by SIGQUIT. regards, -- Kyotaro Horiguchi NTT Open Source Software Center
Kyotaro HORIGUCHI escribió: > Hi, I confirmed that 82233ce7ea4 surely did it. > > At Wed, 19 Mar 2014 09:35:16 -0300, Alvaro Herrera wrote > > Fujii Masao escribió: > > > On Wed, Mar 19, 2014 at 7:57 PM, Heikki Linnakangas > > > <hlinnakangas@vmware.com> wrote: > > > > > >> 9.4 canceles backup mode even on immediate shutdown so the > > > >> operation causes no problem, but 9.3 and before are doesn't. > > > > > > > > Hmm, I don't think we've changed that behavior in 9.4. > > > > > > ISTM 82233ce7ea42d6ba519aaec63008aff49da6c7af changed immdiate > > > shutdown that way. > > > > Uh, interesting. I didn't see that secondary effect. I hope it's not > > for ill? > > The crucial factor for the behavior change is that pmdie has > become not to exit immediately for SIGQUIT. 'case SIGQUIT:' in > pmdie() ended with "ExitPostmaster(0)" before the patch but now > it ends with 'PostmasterStateMachine(); break;' so continues to > run with pmState = PM_WAIT_BACKENDS, similar to SIGINT (fast > shutdown). > > After all, pmState changes to PM_NO_CHILDREN via PM_WAIT_DEAD_END > by SIGCHLDs from non-significant processes, then CancelBackup(). Judging from what was being said on the thread, it seems that running CancelBackup() after an immediate shutdown is better than not doing it, correct? > Focusing on the point described above, the small patch below > rewinds the behavior back to 9.3 and before but I don't know the > appropriateness in regard to the intention of the patch. I see. Obviously your patch would, in effect, revert 82233ce7ea completely, which is not something we want. I think if we want to go back to the previous behavior of not stopping the backup, some other method should be used. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On Thu, Mar 20, 2014 at 3:43 PM, Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote: > Hello, > > At Wed, 19 Mar 2014 19:34:10 +0900, Fujii Masao wrote >> > Agreed. Attached patches do that and I could "recover" the >> > database state with following steps, >> >> Adding new option looks like new feature rather than bug fix. >> I'm afraid that the backpatch of such a change to 9.3 or before >> is not acceptable. > > Me too. But on the other hand it simplly is a relief for the > consequence of the behavior of server (altough it was ill > operation:), and especially it is needed for at least 9.1 which > seems cannot be saved without it. Plus it has utterly no impact > on servers' behavior of any corresponding versions. So I hope it > is accepted. Even in 9.1, we can think that problematic situation as database corruption and restart the server from the backup which was successfully taken before. No? Regards, -- Fujii Masao
Hello, > > After all, pmState changes to PM_NO_CHILDREN via PM_WAIT_DEAD_END > > by SIGCHLDs from non-significant processes, then CancelBackup(). > > Judging from what was being said on the thread, it seems that running > CancelBackup() after an immediate shutdown is better than not doing it, > correct? Agreed. I like that behavior:) It removes backup_label at immediate shutdown and (altough I didn't see by myself but as far as I saw PostmasterStateMachine) it would skip shutdown checkponit. > > Focusing on the point described above, the small patch below > > rewinds the behavior back to 9.3 and before but I don't know the > > appropriateness in regard to the intention of the patch. > > I see. Obviously your patch would, in effect, revert 82233ce7ea > completely, which is not something we want. I think if we want to go > back to the previous behavior of not stopping the backup, some other > method should be used. As I mentioned above, I don't want to rewind 9.4's behavior back to that of previous ones. What I hope to be realized for now is '-b'(provisional optname) of pg_resetxlog for at least versions which would fall into this problem. What do you think about this maybe 'New Feature' but has meaning practically only for older versions? Of course I agree with that 'you should erase the backup_label just after master has crashed' is the most clean and sane way to *avoid* the situation but the penalty seems a bit too large for the mistake. regards, -- Kyotaro Horiguchi NTT Open Source Software Center
Hello, > > At Wed, 19 Mar 2014 19:34:10 +0900, Fujii Masao wrote > >> > Agreed. Attached patches do that and I could "recover" the > >> > database state with following steps, > >> > >> Adding new option looks like new feature rather than bug fix. > >> I'm afraid that the backpatch of such a change to 9.3 or before > >> is not acceptable. > > > > Me too. But on the other hand it simplly is a relief for the > > consequence of the behavior of server (altough it was ill > > operation:), and especially it is needed for at least 9.1 which > > seems cannot be saved without it. Plus it has utterly no impact > > on servers' behavior of any corresponding versions. So I hope it > > is accepted. > > Even in 9.1, we can think that problematic situation as database corruption > and restart the server from the backup which was successfully taken before. > No? Mmm. I don't think it is relevant to this problem. The problem specific here is 'The database was running until just now, but shutdown the master (by pacemaker), then restart, won't run anymore'. Deleting backup_label after immediate shutdown is the radical measure but existing system would be saved by the option. But, honestly saying, I (also?) don't have sympathy for the situation so much and if all or most of you think the option can cause another problem, I won't insist about that any more. regards, -- Kyotaro Horiguchi NTT Open Source Software Center
On Fri, Mar 28, 2014 at 1:06 AM, Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote: > Mmm. I don't think it is relevant to this problem. The problem > specific here is 'The database was running until just now, but > shutdown the master (by pacemaker), then restart, won't run > anymore'. Deleting backup_label after immediate shutdown is the > radical measure but existing system would be saved by the option. I don't find that very radical at all. The backup_label file is *supposed* to be removed on the master if it crashes during the backup; and it should never be removed from the backup itself. At least that's how I understand it. Unfortunately, people too often remove the file from the backup and, judging by your report, leave it there on the master. (We could try to fix this confusion - and thereby confuse all the people who understand it now - by changing things so that you have to leave the file there on the master, and remove it from the backup. Bwahaha!) -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Monday, March 31, 2014, Robert Haas <robertmhaas@gmail.com> wrote:
On Fri, Mar 28, 2014 at 1:06 AM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:
> Mmm. I don't think it is relevant to this problem. The problem
> specific here is 'The database was running until just now, but
> shutdown the master (by pacemaker), then restart, won't run
> anymore'. Deleting backup_label after immediate shutdown is the
> radical measure but existing system would be saved by the option.
I don't find that very radical at all. The backup_label file is
*supposed* to be removed on the master if it crashes during the
backup; and it should never be removed from the backup itself. At
least that's how I understand it. Unfortunately, people too often
remove the file from the backup and, judging by your report, leave it
there on the master.
At first blush it seems pretty radical to me. Just because the server was e-stopped doesn't mean the backup rsync/cp -r/scp etc. isn't still running, and it is not clear to me that yanking the backup label file out from under it wouldn't cause problems. I mean, you already have problems if you are trying to restore from that backup, but the missing file might make those problems less obvious.
Of course first blush is often wrong, but...
Cheers,
Jeff
On Tue, Apr 1, 2014 at 12:39 AM, Jeff Janes <jeff.janes@gmail.com> wrote: > On Monday, March 31, 2014, Robert Haas <robertmhaas@gmail.com> wrote: >> >> On Fri, Mar 28, 2014 at 1:06 AM, Kyotaro HORIGUCHI >> <horiguchi.kyotaro@lab.ntt.co.jp> wrote: >> > Mmm. I don't think it is relevant to this problem. The problem >> > specific here is 'The database was running until just now, but >> > shutdown the master (by pacemaker), then restart, won't run >> > anymore'. Deleting backup_label after immediate shutdown is the >> > radical measure but existing system would be saved by the option. >> >> I don't find that very radical at all. The backup_label file is >> *supposed* to be removed on the master if it crashes during the >> backup; and it should never be removed from the backup itself. At >> least that's how I understand it. Unfortunately, people too often >> remove the file from the backup and, judging by your report, leave it >> there on the master. > > At first blush it seems pretty radical to me. Just because the server was > e-stopped doesn't mean the backup rsync/cp -r/scp etc. isn't still running, > and it is not clear to me that yanking the backup label file out from under > it wouldn't cause problems. I mean, you already have problems if you are > trying to restore from that backup, but the missing file might make those > problems less obvious. > > Of course first blush is often wrong, but... Well, I guess I was thinking mostly of the case where the whole server's been restarted, in which case none of that stuff is still running any more. If there is only a database server crash, then I agree it's murkier. Still, you probably ought to kill off those things if the database server crashes, and then restart the whole base backup. Otherwise I think you're going to be in trouble whether the backup label file sticks around or not. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Hello, thank you for the discussion. At Tue, 1 Apr 2014 11:41:20 -0400, Robert Haas wrote >> I don't find that very radical at all. The backup_label file is >> *supposed* to be removed on the master if it crashes during the >> backup; and it should never be removed from the backup itself. At >> least that's how I understand it. Unfortunately, people too often The code indeed seems to assume that, and I couldn't think of any measure to avoid that dead-end once recovery sequence reads backup label accidentially left behind. I thought up to remove backup label during immediate shutdown on prvious versions, like 9.4 does. CancelBackup does only stat-unlink-rename sequence so I think this doesn't obstruct immediate shutdown sequence. And this doesn't change any seeming behavior or interfaces just except for this case. What do you think about this? Isn't this also applicable for older versions? postmaster.c@9.3.3:2339 pmdie(SIGNAL_ARGS) { ... switch (postgres_signal_arg) { ... case SIGQUIT: ... SignalUnconnectedWorkers(SIGQUIT); + + /* + * Terminate exclusive backup mode. This is done in + * PostmasterStateMachine() for other shutdown modes. + */ + if (ReachedNormalRunning) + CancelBackup(); ExitPostmaster(0); break; Aside from this, I'll post the new option for pg_resetxlog for the next CF. regards, -- Kyotaro Horiguchi NTT Open Source Software Center
On Tue, Apr 15, 2014 at 2:52 AM, Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote: > Hello, thank you for the discussion. > > At Tue, 1 Apr 2014 11:41:20 -0400, Robert Haas wrote >>> I don't find that very radical at all. The backup_label file is >>> *supposed* to be removed on the master if it crashes during the >>> backup; and it should never be removed from the backup itself. At >>> least that's how I understand it. Unfortunately, people too often > > The code indeed seems to assume that, and I couldn't think of any > measure to avoid that dead-end once recovery sequence reads > backup label accidentially left behind. I thought up to remove > backup label during immediate shutdown on prvious versions, like > 9.4 does. > > CancelBackup does only stat-unlink-rename sequence so I think > this doesn't obstruct immediate shutdown sequence. And this > doesn't change any seeming behavior or interfaces just except for > this case. What do you think about this? Isn't this also > applicable for older versions? I don't think we should consider changing long-established behavior in the back-branches. The old behavior may not be ideal, but that doesn't mean it's a bug. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Hello, > I don't think we should consider changing long-established behavior in > the back-branches. The old behavior may not be ideal, but that > doesn't mean it's a bug. Ok, I understand that. I give it up. regards, -- Kyotaro Horiguchi NTT Open Source Software Center
On Thu, Mar 20, 2014 at 11:38 PM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote: > Kyotaro HORIGUCHI escribió: >> Hi, I confirmed that 82233ce7ea4 surely did it. >> >> At Wed, 19 Mar 2014 09:35:16 -0300, Alvaro Herrera wrote >> > Fujii Masao escribió: >> > > On Wed, Mar 19, 2014 at 7:57 PM, Heikki Linnakangas >> > > <hlinnakangas@vmware.com> wrote: >> > >> > > >> 9.4 canceles backup mode even on immediate shutdown so the >> > > >> operation causes no problem, but 9.3 and before are doesn't. >> > > > >> > > > Hmm, I don't think we've changed that behavior in 9.4. >> > > >> > > ISTM 82233ce7ea42d6ba519aaec63008aff49da6c7af changed immdiate >> > > shutdown that way. >> > >> > Uh, interesting. I didn't see that secondary effect. I hope it's not >> > for ill? >> >> The crucial factor for the behavior change is that pmdie has >> become not to exit immediately for SIGQUIT. 'case SIGQUIT:' in >> pmdie() ended with "ExitPostmaster(0)" before the patch but now >> it ends with 'PostmasterStateMachine(); break;' so continues to >> run with pmState = PM_WAIT_BACKENDS, similar to SIGINT (fast >> shutdown). >> >> After all, pmState changes to PM_NO_CHILDREN via PM_WAIT_DEAD_END >> by SIGCHLDs from non-significant processes, then CancelBackup(). > > Judging from what was being said on the thread, it seems that running > CancelBackup() after an immediate shutdown is better than not doing it, > correct? This is listed as a 9.4 Open Item, but no one seems to want to revert this change. So I'll drop this from the Open Item list barring objections. Regards, -- Fujii Masao
On 05/09/2014 05:19 PM, Fujii Masao wrote: > On Thu, Mar 20, 2014 at 11:38 PM, Alvaro Herrera > <alvherre@2ndquadrant.com> wrote: >> Kyotaro HORIGUCHI escribió: >>> Hi, I confirmed that 82233ce7ea4 surely did it. >>> >>> At Wed, 19 Mar 2014 09:35:16 -0300, Alvaro Herrera wrote >>>> Fujii Masao escribió: >>>>> On Wed, Mar 19, 2014 at 7:57 PM, Heikki Linnakangas >>>>> <hlinnakangas@vmware.com> wrote: >>>> >>>>>>> 9.4 canceles backup mode even on immediate shutdown so the >>>>>>> operation causes no problem, but 9.3 and before are doesn't. >>>>>> >>>>>> Hmm, I don't think we've changed that behavior in 9.4. >>>>> >>>>> ISTM 82233ce7ea42d6ba519aaec63008aff49da6c7af changed immdiate >>>>> shutdown that way. >>>> >>>> Uh, interesting. I didn't see that secondary effect. I hope it's not >>>> for ill? >>> >>> The crucial factor for the behavior change is that pmdie has >>> become not to exit immediately for SIGQUIT. 'case SIGQUIT:' in >>> pmdie() ended with "ExitPostmaster(0)" before the patch but now >>> it ends with 'PostmasterStateMachine(); break;' so continues to >>> run with pmState = PM_WAIT_BACKENDS, similar to SIGINT (fast >>> shutdown). >>> >>> After all, pmState changes to PM_NO_CHILDREN via PM_WAIT_DEAD_END >>> by SIGCHLDs from non-significant processes, then CancelBackup(). >> >> Judging from what was being said on the thread, it seems that running >> CancelBackup() after an immediate shutdown is better than not doing it, >> correct? > > This is listed as a 9.4 Open Item, but no one seems to want to revert > this change. > So I'll drop this from the Open Item list barring objections. I object. We used to call CancelBackup() on immediate shutdown, which was good. That was inadvertently changed by commit 82233ce. That's a regression we should fix. I agree with Alvaro upthread that we don't want to revert 82233ce, but we should come up with a fix. - Heikki
On Mon, May 12, 2014 at 4:52 PM, Heikki Linnakangas <hlinnakangas@vmware.com> wrote: > On 05/09/2014 05:19 PM, Fujii Masao wrote: >> >> On Thu, Mar 20, 2014 at 11:38 PM, Alvaro Herrera >> <alvherre@2ndquadrant.com> wrote: >>> >>> Kyotaro HORIGUCHI escribió: >>>> >>>> Hi, I confirmed that 82233ce7ea4 surely did it. >>>> >>>> At Wed, 19 Mar 2014 09:35:16 -0300, Alvaro Herrera wrote >>>>> >>>>> Fujii Masao escribió: >>>>>> >>>>>> On Wed, Mar 19, 2014 at 7:57 PM, Heikki Linnakangas >>>>>> <hlinnakangas@vmware.com> wrote: >>>>> >>>>> >>>>>>>> 9.4 canceles backup mode even on immediate shutdown so the >>>>>>>> operation causes no problem, but 9.3 and before are doesn't. >>>>>>> >>>>>>> >>>>>>> Hmm, I don't think we've changed that behavior in 9.4. >>>>>> >>>>>> >>>>>> ISTM 82233ce7ea42d6ba519aaec63008aff49da6c7af changed immdiate >>>>>> shutdown that way. >>>>> >>>>> >>>>> Uh, interesting. I didn't see that secondary effect. I hope it's not >>>>> for ill? >>>> >>>> >>>> The crucial factor for the behavior change is that pmdie has >>>> become not to exit immediately for SIGQUIT. 'case SIGQUIT:' in >>>> pmdie() ended with "ExitPostmaster(0)" before the patch but now >>>> it ends with 'PostmasterStateMachine(); break;' so continues to >>>> run with pmState = PM_WAIT_BACKENDS, similar to SIGINT (fast >>>> shutdown). >>>> >>>> After all, pmState changes to PM_NO_CHILDREN via PM_WAIT_DEAD_END >>>> by SIGCHLDs from non-significant processes, then CancelBackup(). >>> >>> >>> Judging from what was being said on the thread, it seems that running >>> CancelBackup() after an immediate shutdown is better than not doing it, >>> correct? >> >> >> This is listed as a 9.4 Open Item, but no one seems to want to revert >> this change. >> So I'll drop this from the Open Item list barring objections. > > > I object. We used to call CancelBackup() on immediate shutdown, which was > good. That was inadvertently changed by commit 82233ce. That's a regression > we should fix. I agree with Alvaro upthread that we don't want to revert > 82233ce, but we should come up with a fix. Hmm.. probably I have the same opinion with you. If I understand this correctly, an immediate shutdown doesn't call CancelBackup() in 9.4 or before. But the commit 82233ce unintentionally changed an immediate shutdown so that it calls CancelBackup(). For now, no one wants to revert the current behavior. So I think there is nothing that we have to do now. No? Regards, -- Fujii Masao
On 05/12/2014 02:29 PM, Fujii Masao wrote: > Hmm.. probably I have the same opinion with you. If I understand this correctly, > an immediate shutdown doesn't call CancelBackup() in 9.4 or before. But the > commit 82233ce unintentionally changed an immediate shutdown so that it calls > CancelBackup(). Oh, sorry. I thought it was the other way 'round: that we used to remove backup_label on an immediate shutdown on 9.3 and before, but that 9.4 doesn't do that anymore. Now that I re-read this thread and tested it myself, I see that I got it backwards. I agree the new behavior is better, and we should just remove the Open Items entry. - Heikki
On Mon, May 12, 2014 at 8:40 PM, Heikki Linnakangas <hlinnakangas@vmware.com> wrote: > On 05/12/2014 02:29 PM, Fujii Masao wrote: >> >> Hmm.. probably I have the same opinion with you. If I understand this >> correctly, >> an immediate shutdown doesn't call CancelBackup() in 9.4 or before. But >> the >> commit 82233ce unintentionally changed an immediate shutdown so that it >> calls >> CancelBackup(). > > > Oh, sorry. I thought it was the other way 'round: that we used to remove > backup_label on an immediate shutdown on 9.3 and before, but that 9.4 > doesn't do that anymore. Now that I re-read this thread and tested it > myself, I see that I got it backwards. > > I agree the new behavior is better, and we should just remove the Open Items > entry. Yes. I just removed that entry. Regards, -- Fujii Masao
Fujii Masao <masao.fujii@gmail.com> writes: > On Mon, May 12, 2014 at 8:40 PM, Heikki Linnakangas > <hlinnakangas@vmware.com> wrote: >> I agree the new behavior is better, and we should just remove the Open Items >> entry. > Yes. I just removed that entry. Our practice in past years has been to move items to a separate "Resolved Issues" section rather than just delete them. I fixed the page to look that way. regards, tom lane
On Tue, May 13, 2014 at 1:36 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Fujii Masao <masao.fujii@gmail.com> writes: >> On Mon, May 12, 2014 at 8:40 PM, Heikki Linnakangas >> <hlinnakangas@vmware.com> wrote: >>> I agree the new behavior is better, and we should just remove the Open Items >>> entry. > >> Yes. I just removed that entry. > > Our practice in past years has been to move items to a separate "Resolved > Issues" section rather than just delete them. I fixed the page to look > that way. Yes. Thanks! Regards, -- Fujii Masao