Re: BUG #8686: Standby could not restart.
От | Tomonari Katsumata |
---|---|
Тема | Re: BUG #8686: Standby could not restart. |
Дата | |
Msg-id | 52B26197.9080208@po.ntts.co.jp обсуждение исходный текст |
Ответ на | BUG #8686: Standby could not restart. (katsumata.tomonari@po.ntts.co.jp) |
Список | pgsql-bugs |
Hi, This is a patch against REL9_2_STABLE(*) (*)0c07ef1ad2ba73a6a050f3e1a19267f961a43586 regards, --------------- Tomonari Katsumata (2013/12/19 11:57), katsumata.tomonari@po.ntts.co.jp wrote: > The following bug has been logged on the website: > > Bug reference: 8686 > Logged by: Tomonari Katsumata > Email address: katsumata.tomonari@po.ntts.co.jp > PostgreSQL version: 9.2.6 > Operating system: Red Hat Enterprise Linux 6.2 x86_64 > Description: > > Hi, > > > I'm testing whether a standby could restart with asynchronous replication. > > > The testcase is very simple like below. > (A) start asyncronous replication > (B) stop standby with "-m f" > (C) start standby > > > I tried (B) and (C) periodically. > > > In almost cases, it worked fine. > But standby could not start one time because of PANIC. > ---- Log ----- > [Standby] 2013-12-09 10:42:30 JST LOG: 00000: database system was shut down > in recovery at 2013-12-09 10:42:29 JST > [Standby] 2013-12-09 10:42:30 JST LOCATION: StartupXLOG, xlog.c:6273 > cp: cannot stat `../arc/00000002.history': No such file or directory > [Standby] 2013-12-09 10:42:30 JST LOG: 00000: entering standby mode > [Standby] 2013-12-09 10:42:30 JST LOCATION: StartupXLOG, xlog.c:6359 > cp: cannot stat `../arc/000000010000000100000008': No such file or > directory > [Standby] 2013-12-09 10:42:30 JST LOG: 00000: consistent recovery state > reached at 1/8E7F110 > [Standby] 2013-12-09 10:42:30 JST LOCATION: CheckRecoveryConsistency, > xlog.c:7366 > [Standby] 2013-12-09 10:42:30 JST LOG: 00000: restored log file > "000000010000000100000007" from archive > [Standby] 2013-12-09 10:42:30 JST LOCATION: RestoreArchivedFile, > xlog.c:3273 > [Standby] 2013-12-09 10:42:30 JST LOG: 00000: redo starts at 1/783B230 > [Standby] 2013-12-09 10:42:30 JST LOCATION: StartupXLOG, xlog.c:6827 > [Standby] 2013-12-09 10:42:30 JST WARNING: 01000: page 1365 of relation > base/16384/16388 does not exist > [Standby] 2013-12-09 10:42:30 JST CONTEXT: xlog redo hot_update: rel > 1663/16384/16388; tid 1365/152; new 1365/153 > [Standby] 2013-12-09 10:42:30 JST LOCATION: report_invalid_page, > xlogutils.c:66 > [Standby] 2013-12-09 10:42:30 JST PANIC: XX000: WAL contains references to > invalid pages > [Standby] 2013-12-09 10:42:30 JST CONTEXT: xlog redo hot_update: rel > 1663/16384/16388; tid 1365/152; new 1365/153 > [Standby] 2013-12-09 10:42:30 JST LOCATION: log_invalid_page, > xlogutils.c:91 > [Standby] 2013-12-09 10:42:30 JST LOG: 00000: startup process (PID 12560) > was terminated by signal 6: Aborted > [Standby] 2013-12-09 10:42:30 JST LOCATION: LogChildExit, > postmaster.c:2895 > [Standby] 2013-12-09 10:42:30 JST LOG: 00000: terminating any other active > server processes > [Standby] 2013-12-09 10:42:30 JST LOCATION: HandleChildCrash, > postmaster.c:2682 > [Standby] 2013-12-09 11:10:12 JST LOG: 00000: database system was > interrupted while in recovery at log time 2013-12-09 10:32:14 JST > [Standby] 2013-12-09 11:10:12 JST HINT: If this has occurred more than once > some data might be corrupted and you might need to choose an earlier > recovery target. > [Standby] 2013-12-09 11:10:12 JST LOCATION: StartupXLOG, xlog.c:6289 > --------- > > > I tried to fix this problem. > At first, I doubted the recovery state reached "consistent" before redo > starts. > And then I checked pg_control and related WAL. > The WAL sequence is like below. > > > WAL--(a)--(b)--(c)--(d)--(e)--> > ================================================ > (a) Latest checkpoint's REDO location > 1/783B230 > > > (b) hot_update > 1/7842010 > > > (c) truncate > 1/8E7E5C8 > > > (d) Latest checkpoint location > 1/8E7F0B0 > > > (e) Minimum recovery ending location > 1/8E7F110 > ================================================ > > > >From these things, I found it has happened with this scenario. > ---------- > (1) standby starting > (2) seeking checkpoint location 1/8E7F0B0 because backup_label is not > absecnt > (3) reachedConsistency is set to true at 1/8E7F110 in > CheckRecoveryConsistent > (4) redo start from 1/783B230 > (5) PANIC at 1/7842010 because reachedConsistency has set already and > operating against a block which will be truncated at (c). > ---------- > > > At step(2), EndRecPtr is set to 1/8E7F110(next to 1/8E7F0B0), > so reachedConsistency is set to true at step(3). > > > I think it's not need to increase EndRecPtr while seeking checkpoint > location. > I tried to revise it and this worked fine. > > > I think this is very very narrow gate, but it could happen. > > > I'm not sure this problem happen with synchronous replication or > another version of PostgreSQL(ie. 9.3/9.1/9.0), > but at least we need to fix it. > > > regards, > ---------------- > Tomonari Katsumata > > > >
Вложения
В списке pgsql-bugs по дате отправления:
Предыдущее
От: katsumata.tomonari@po.ntts.co.jpДата:
Сообщение: BUG #8686: Standby could not restart.
Следующее
От: Alvaro HerreraДата:
Сообщение: Re: Update with subselect sometimes returns wrong result