Обсуждение: PANIC: heap_update_redo: no block
Hi. Upon rebooting one of our main production database servers, we were greeted with this: (@)<2006-03-18 23:30:32.687 MST>[3791]LOG: database system was interrupted while in recovery at 2006-03-18 23:30:26 MST (@)<2006-03-18 23:30:32.687 MST>[3791]HINT: This probably means that some data is corrupted and you will have to use the last backup for recovery. (@)<2006-03-18 23:30:32.688 MST>[3791]LOG: checkpoint record is at D/1919D5F0 (@)<2006-03-18 23:30:32.688 MST>[3791]LOG: redo record is at D/191722C8; undo record is at 0/0; shutdown FALSE (@)<2006-03-18 23:30:32.688 MST>[3791]LOG: next transaction ID: 81148900; next OID: 16566476 (@)<2006-03-18 23:30:32.688 MST>[3791]LOG: next MultiXactId: 1; next MultiXactOffset: 0 (@)<2006-03-18 23:30:32.689 MST>[3791]LOG: database system was not properly shut down; automatic recovery in progress (@)<2006-03-18 23:30:33.032 MST>[3791]LOG: redo starts at D/191722C8 (@)<2006-03-18 23:30:33.035 MST>[3791]PANIC: heap_update_redo: no block (@)<2006-03-18 23:30:33.036 MST>[3790]LOG: startup process (PID 3791) was terminated by signal 6 (@)<2006-03-18 23:30:33.036 MST>[3790]LOG: aborting startup due to startup process failure As far as i know the postgresql was shutdown properly before the reboot (pg_ctl stop -m fast). Though im not positive, I was just brought in when they couldn't figure out why postgresql would not start. Any ideas as to how this happened or how to fix it? Right now im copying over the database, and then going to try pg_resetxlog. Just to make sure, the only possible lost data are things that are/would be in the xlog right? So i dont need to go looking at all the tables, just ones I know were modified then. Are there any other solutions that dont involve possibly loosing data? (Yes I know backups, unfortunately the last back up was about 2 hours ago and is not as up to date as i would like) Just curious, ive also been investigating pitr for instead of doing backups every 2 hours. If this problem were to surface when i was using pitr as a backup solution, would all my data then be hosed (or at least what pg_resetlog can not restore)?
On Sat, Mar 18, 2006 at 11:55:35PM -0700, Alex bahdushka wrote: > Hi. > > Upon rebooting one of our main production database servers, we were > greeted with this: To help you at all, we *really* need to know what version and platform you're running. In particular, are you running the most recent release of your branch. There have been bug fixes related to WAL recovery in some versions... Have a nice day, -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a > tool for doing 5% of the work and then sitting around waiting for someone > else to do the other 95% so you can sue them.
Вложения
On 3/19/06, Martijn van Oosterhout <kleptog@svana.org> wrote: > On Sat, Mar 18, 2006 at 11:55:35PM -0700, Alex bahdushka wrote: > > Hi. > > > > Upon rebooting one of our main production database servers, we were > > greeted with this: > > To help you at all, we *really* need to know what version and platform > you're running. In particular, are you running the most recent release > of your branch. There have been bug fixes related to WAL recovery in > some versions... Ahh of course! sorry! select version(); version -------------------------------------------------------------------------------------------------------------------- PostgreSQL 8.1.3 on i686-pc-linux-gnu, compiled by GCC gcc-3.4 (GCC) 3.4.4 20050314 (prerelease) (Debian 3.4.3-13) > Have a nice day, > -- > Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > > Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a > > tool for doing 5% of the work and then sitting around waiting for someone > > else to do the other 95% so you can sue them. > > > -----BEGIN PGP SIGNATURE----- > Version: GnuPG v1.4.1 (GNU/Linux) > > iD8DBQFEHSreIB7bNG8LQkwRAo5fAJ9TGnk9HFyx7DKsL5C5bSvtmMQIPgCdEVhr > 7nTMsnzvyk1TI5az+GDCx9I= > =zZR+ > -----END PGP SIGNATURE----- > > >
On 3/19/06, Alex bahdushka <bahdushka@gmail.com> wrote: > On 3/19/06, Martijn van Oosterhout <kleptog@svana.org> wrote: > > On Sat, Mar 18, 2006 at 11:55:35PM -0700, Alex bahdushka wrote: > > > Hi. > > > > > > Upon rebooting one of our main production database servers, we were > > > greeted with this: > > > > To help you at all, we *really* need to know what version and platform > > you're running. In particular, are you running the most recent release > > of your branch. There have been bug fixes related to WAL recovery in > > some versions... > > Ahh of course! sorry! > > select version(); > version > -------------------------------------------------------------------------------------------------------------------- > PostgreSQL 8.1.3 on i686-pc-linux-gnu, compiled by GCC gcc-3.4 (GCC) > 3.4.4 20050314 (prerelease) (Debian 3.4.3-13) > After doing some more digging, it looks like that server was missing the appropriate Kpostgresql symlink in /etc/rc0.d/. So upon shutdown (shutdown -h now)... my guess is it got a sigterm (you know where it says Sending all processes a TERM signal or whatever), then it (init) waited 5 seconds or whatever the timeout is and sent a sigkill. If postgresql took longer to shutdown than that timeout and so was then given a sigkill and then server turned off.... Could that do it? (not to mention i don't exactly remember where file system get unmounted, before or after it sends out those signals, i think its before though so it might have mounted it read only (couldn't of unmounted it because it was in use by postgresql)). Im mainly asking because i would love for this to be user error. It scares the hell out of me (and my boss obviously). Though i must say for the 2+ years we have been using postgresql its proven to be very stable, robust and fast. Thanks! <snipped>
""Alex bahdushka"" <bahdushka@gmail.com> wrote > > After doing some more digging, it looks like that server was missing > the appropriate Kpostgresql symlink in /etc/rc0.d/. So upon shutdown > (shutdown -h now)... my guess is it got a sigterm (you know where it > says Sending all processes a TERM signal or whatever), then it (init) > waited 5 seconds or whatever the timeout is and sent a sigkill. > > If postgresql took longer to shutdown than that timeout and so was > then given a sigkill and then server turned off.... Could that do it? > I don't believe in this explaination actually. According the startup message, the error "heap_update_redo: no block" could most possibly happen when PostgreSQL tried to read an existing block but found that the file length is not long enough to have it. How could a SIGKILL truncate a data file like that? Regards, Qingqing
On 3/20/06, Qingqing Zhou <zhouqq@cs.toronto.edu> wrote: > > ""Alex bahdushka"" <bahdushka@gmail.com> wrote > > > > After doing some more digging, it looks like that server was missing > > the appropriate Kpostgresql symlink in /etc/rc0.d/. So upon shutdown > > (shutdown -h now)... my guess is it got a sigterm (you know where it > > says Sending all processes a TERM signal or whatever), then it (init) > > waited 5 seconds or whatever the timeout is and sent a sigkill. > > > > If postgresql took longer to shutdown than that timeout and so was > > then given a sigkill and then server turned off.... Could that do it? > > > > I don't believe in this explaination actually. According the startup > message, the error "heap_update_redo: no block" could most possibly happen > when PostgreSQL tried to read an existing block but found that the file > length is not long enough to have it. How could a SIGKILL truncate a data > file like that? > Hrm... well i obviously have restored the database by now (using pg_resetxlog; pg_dump; initdb; pg_restore). However i did make a backup of the broken directory before I created the new database. If anyone has any thing they would like me to try to possibly help track down this possible bug. I would be more than glad to do it. Since it sounds like its something wrong with the xlog here is the contents of the dir... Im not sure how useful this is but here it is anyways. pg_xlog# du -ak 16404 ./000000010000000D00000022 16404 ./000000010000000D0000001E 16404 ./000000010000000D00000019 16404 ./000000010000000D0000001A 16404 ./000000010000000D0000001D 16404 ./000000010000000D0000001C 16404 ./000000010000000D00000020 16404 ./000000010000000D00000021 16404 ./000000010000000D0000001B 4 ./archive_status 16404 ./000000010000000D00000023 16404 ./000000010000000D0000001F 16404 ./000000010000000D00000018 196856 . They are all the same size, so it does not look like a truncated file... Or am i just misinterpreting the error message and its one of the files elsewhere? The file system is ext3 and it fscked fine, and nothing is the the lost+found dir/. As far as i know the computer was allowed to sync the buffers to disk before the reboot (the plug was not pulled or anything). Any Ideas? Otherwise it sounds like ill just have to chalk this one up to the gods, and hope its fixed in 8.1.4....
""Alex bahdushka"" <bahdushka@gmail.com> wrote > > Hrm... well i obviously have restored the database by now (using > pg_resetxlog; pg_dump; initdb; pg_restore). However i did make a > backup of the broken directory before I created the new database. If > anyone has any thing they would like me to try to possibly help track > down this possible bug. I would be more than glad to do it. > pg_resetxlog is the last resort to "avoid" the real problem. Once you reset it, then the xlog after that some offset will not get replayed (so the problem disappered) and possibly some data get lost :-(. Can you patch the heap/heapam.c/heap_xlog_update() like this: - elog(PANIC, "heap_update_redo: no block"); + elog(PANIC, "heap_update_redo: no block: target block: %u, relation length: %u", + ItemPointerGetBlockNumber(&(xlrec->target.tid)), + RelationGetNumberOfBlocks(reln)); And restart your database to see what's the output? Regards, Qingqing
"Qingqing Zhou" <zhouqq@cs.toronto.edu> writes: > Can you patch the heap/heapam.c/heap_xlog_update() like this: > - elog(PANIC, "heap_update_redo: no block"); > + elog(PANIC, "heap_update_redo: no block: target block: %u, relation > length: %u", > + ItemPointerGetBlockNumber(&(xlrec->target.tid)), > + RelationGetNumberOfBlocks(reln)); > And restart your database to see what's the output? While at it, you should extend the error message to include the relation ID, so you have some idea which table is affected ... this is certainly not a very informative message ... regards, tom lane
"Tom Lane" <tgl@sss.pgh.pa.us> wrote > > While at it, you should extend the error message to include the relation > ID, so you have some idea which table is affected ... this is certainly > not a very informative message ... > Exactly. Please use the following version: - elog(PANIC, "heap_update_redo: no block"); + elog(PANIC, "heap_update_redo: no block: target blcknum: %u, relation(%u/%u/%u) length: %u", + ItemPointerGetBlockNumber(&(xlrec->target.tid)), + reln->rd_node.spcNode, + reln->rd_node.dbNode, + reln->rd_node.relNode, + RelationGetNumberOfBlocks(reln)); BTW: I just realized that there is another (better) way to do so is to enable WAL_DEBUG in xlog.h and SET XLOG_DEBUG=true. And that's why we don't have much error message in xlog redo. Regards, Qingqing
"Qingqing Zhou" <zhouqq@cs.toronto.edu> writes: > BTW: I just realized that there is another (better) way to do so is to > enable WAL_DEBUG in xlog.h and SET XLOG_DEBUG=true. And that's why we don't > have much error message in xlog redo. That was probably Vadim's original reasoning for not being very verbose in the redo routines' various PANIC messages. But for failures in the field it'd be awfully nice to be able to see this info from a standard build, so I'm thinking we should improve the elog messages. If you feel like creating a patch I'll be glad to apply it ... regards, tom lane
"Tom Lane" <tgl@sss.pgh.pa.us> wrote > "Qingqing Zhou" <zhouqq@cs.toronto.edu> writes: > > BTW: I just realized that there is another (better) way to do so is to > > enable WAL_DEBUG in xlog.h and SET XLOG_DEBUG=true. And that's why we don't > > have much error message in xlog redo. > > That was probably Vadim's original reasoning for not being very verbose > in the redo routines' various PANIC messages. But for failures in the > field it'd be awfully nice to be able to see this info from a standard > build, so I'm thinking we should improve the elog messages. If you feel > like creating a patch I'll be glad to apply it ... > So there are three ways to do it: (1) enable WALD_DEBUG by default So it is user's reponsibility to enable XLOG_DEBUG to print verbose information while at error. This adds no cost during normal running, but the problem is that too much information (only the last is useful) may pollute the log file. (2) print verbose information after errror We can change StartupXLOG like this: PG_TRY(); { RmgrTable[record->xl_rmid].rm_redo(EndRecPtr, record); } PG_CATCH(); { RmgrTable[record->xl_rmid].rm_desc(buf, record->xl_info, XLogRecGetData(record)); abort(); } PG_END_CATCH(); Also, channge err_finish() so that if PANIC if InRecovery, do a PG_RE_THROW(). The problem is this looks like a little bit hack. (3) Replace elog in every xlog_ABC_redo() to some xlog_elog(), so the description information can be automatically appended. I vote for method 2. Regards, Qingqing
"Qingqing Zhou" <zhouqq@cs.toronto.edu> writes: > (2) print verbose information after errror If you want to do it that way, the correct thing is for the WAL replay logic to add an error context hook to print the current WAL record, not invent weird hacks in the error processing logic. Compare the way bufmgr.c reports errors during smgrwrite() operations (buffer_write_error_callback()). One problem with the approach is that I'm not sure how robust the WAL record description routines really are. They seem at the least vulnerable to buffer-overflow issues. A crash while trying to describe the current record wouldn't be a net improvement :-(. I'd kind of want to modify them to have a more robust API, eg write into a StringInfo instead of an unspecified-size buffer. regards, tom lane
On 3/21/06, Qingqing Zhou <zhouqq@cs.toronto.edu> wrote: > > "Tom Lane" <tgl@sss.pgh.pa.us> wrote > > > > While at it, you should extend the error message to include the relation > > ID, so you have some idea which table is affected ... this is certainly > > not a very informative message ... > > > > Exactly. Please use the following version: > > - elog(PANIC, "heap_update_redo: no block"); > + elog(PANIC, "heap_update_redo: no block: target blcknum: %u, > relation(%u/%u/%u) length: %u", > + ItemPointerGetBlockNumber(&(xlrec->target.tid)), > + reln->rd_node.spcNode, > + reln->rd_node.dbNode, > + reln->rd_node.relNode, > + RelationGetNumberOfBlocks(reln)); > > BTW: I just realized that there is another (better) way to do so is to > enable WAL_DEBUG in xlog.h and SET XLOG_DEBUG=true. And that's why we don't > have much error message in xlog redo. > Hrm ok ill see about doing either the patch or setting wal_debug to true (or both). However im currently on vacation till Saturday so ill do this first thing then and report the results back. Thank you very much! > Regards, > Qingqing > > > > ---------------------------(end of broadcast)--------------------------- > TIP 5: don't forget to increase your free space map settings >
<snip> > Hrm ok ill see about doing either the patch or setting wal_debug to > true (or both). However im currently on vacation till Saturday so ill > do this first thing then and report the results back. Thank you very > much! Ok here is the output with wal_debug set to 1 in the config and the final supplied patch by Qingqing. (@)<2006-03-25 20:54:17.509 MST>[26571]LOG: database system was interrupted while in recovery at 2006-03-25 20:51:58 MST (@)<2006-03-25 20:54:17.509 MST>[26571]HINT: This probably means that some data is corrupted and you will have to use the last backup for recovery. (@)<2006-03-25 20:54:17.509 MST>[26571]LOG: checkpoint record is at D/1919D5F0 (@)<2006-03-25 20:54:17.509 MST>[26571]LOG: redo record is at D/191722C8; undo record is at 0/0; shutdown FALSE (@)<2006-03-25 20:54:17.509 MST>[26571]LOG: next transaction ID: 81148900; next OID: 16566476 (@)<2006-03-25 20:54:17.509 MST>[26571]LOG: next MultiXactId: 1; next MultiXactOffset: 0 (@)<2006-03-25 20:54:17.509 MST>[26571]LOG: database system was not properly shut down; automatic recovery in progress (@)<2006-03-25 20:54:17.524 MST>[26571]LOG: redo starts at D/191722C8 (@)<2006-03-25 20:54:17.524 MST>[26571]LOG: REDO @ D/191722C8; LSN D/19172678: prev D/191722A0; xid 81148908: Heap - update: rel 1663/16386/2619; tid (@)<2006-03-25 20:54:17.524 MST>[26571]LOG: REDO @ D/19172678; LSN D/191726B8: prev D/191722C8; xid 81148908: Btree - insert: rel 1663/16386/2696; tid (@)<2006-03-25 20:54:17.524 MST>[26571]LOG: REDO @ D/191726B8; LSN D/19172788: prev D/19172678; xid 81148908: Heap - update: rel 1663/16386/2619; tid (@)<2006-03-25 20:54:17.524 MST>[26571]LOG: REDO @ D/19172788; LSN D/191727C8: prev D/191726B8; xid 81148908: Btree - insert: rel 1663/16386/2696; tid (@)<2006-03-25 20:54:17.524 MST>[26571]LOG: REDO @ D/191727C8; LSN D/19172A30: prev D/19172788; xid 81148908: Heap - update: rel 1663/16386/2619; tid (@)<2006-03-25 20:54:17.524 MST>[26571]LOG: REDO @ D/19172A30; LSN D/19172A70: prev D/191727C8; xid 81148908: Btree - insert: rel 1663/16386/2696; tid (@)<2006-03-25 20:54:17.524 MST>[26571]LOG: REDO @ D/19172A70; LSN D/19172F68: prev D/19172A30; xid 81148908: Heap - update: rel 1663/16386/2619; tid (@)<2006-03-25 20:54:17.524 MST>[26571]LOG: REDO @ D/19172F68; LSN D/19172FA8: prev D/19172A70; xid 81148908: Btree - insert: rel 1663/16386/2696; tid (@)<2006-03-25 20:54:17.524 MST>[26571]LOG: REDO @ D/19172FA8; LSN D/19172FD0: prev D/19172F68; xid 81148908: Transaction - commit: 2006-03-18 22:51:5 (@)<2006-03-25 20:54:17.525 MST>[26571]LOG: REDO @ D/19172FD0; LSN D/19173088: prev D/19172FA8; xid 81148942: Heap - insert: rel 1663/16386/2619; tid (@)<2006-03-25 20:54:17.525 MST>[26571]LOG: REDO @ D/19173088; LSN D/191730C8: prev D/19172FD0; xid 81148942: Btree - insert: rel 1663/16386/2696; tid (@)<2006-03-25 20:54:17.525 MST>[26571]LOG: REDO @ D/191730C8; LSN D/191731E8: prev D/19173088; xid 81148942: Heap - insert: rel 1663/16386/2619; tid (@)<2006-03-25 20:54:17.525 MST>[26571]LOG: REDO @ D/191731E8; LSN D/19173228: prev D/191730C8; xid 81148942: Btree - insert: rel 1663/16386/2696; tid (@)<2006-03-25 20:54:17.525 MST>[26571]LOG: REDO @ D/19173228; LSN D/19173250: prev D/191731E8; xid 81148942: Transaction - commit: 2006-03-18 22:51:5 (@)<2006-03-25 20:54:17.525 MST>[26571]LOG: REDO @ D/19173250; LSN D/19173414: prev D/19173228; xid 81148944: Heap - update: rel 1663/16386/2619; tid (@)<2006-03-25 20:54:17.525 MST>[26571]LOG: REDO @ D/19173414; LSN D/19173454: prev D/19173250; xid 81148944: Btree - insert: rel 1663/16386/2696; tid (@)<2006-03-25 20:54:17.525 MST>[26571]LOG: REDO @ D/19173454; LSN D/19173760: prev D/19173414; xid 81148944: Heap - update: rel 1663/16386/2619; tid (@)<2006-03-25 20:54:17.525 MST>[26571]LOG: REDO @ D/19173760; LSN D/191737A0: prev D/19173454; xid 81148944: Btree - insert: rel 1663/16386/2696; tid (@)<2006-03-25 20:54:17.525 MST>[26571]LOG: REDO @ D/191737A0; LSN D/191737C8: prev D/19173760; xid 81148944: Transaction - commit: 2006-03-18 22:51:5 (@)<2006-03-25 20:54:17.525 MST>[26571]LOG: REDO @ D/191737C8; LSN D/1917380C: prev D/191737A0; xid 81148946: Heap - update: rel 1663/16386/16865; tid (@)<2006-03-25 20:54:17.525 MST>[26571]LOG: REDO @ D/1917380C; LSN D/19173834: prev D/191737C8; xid 81148946: Transaction - commit: 2006-03-18 22:51:5 (@)<2006-03-25 20:54:17.525 MST>[26571]LOG: REDO @ D/19173834; LSN D/191739B8: prev D/1917380C; xid 81148948: Heap - insert: rel 1663/16386/2619; tid (@)<2006-03-25 20:54:17.525 MST>[26571]LOG: REDO @ D/191739B8; LSN D/191739F8: prev D/19173834; xid 81148948: Btree - insert: rel 1663/16386/2696; tid (@)<2006-03-25 20:54:17.525 MST>[26571]LOG: REDO @ D/191739F8; LSN D/19173AE4: prev D/191739B8; xid 81148948: Heap - insert: rel 1663/16386/2619; tid (@)<2006-03-25 20:54:17.525 MST>[26571]LOG: REDO @ D/19173AE4; LSN D/19173B24: prev D/191739F8; xid 81148948: Btree - insert: rel 1663/16386/2696; tid (@)<2006-03-25 20:54:17.525 MST>[26571]LOG: REDO @ D/19173B24; LSN D/19173B4C: prev D/19173AE4; xid 81148948: Transaction - commit: 2006-03-18 22:51:5 (@)<2006-03-25 20:54:17.525 MST>[26571]LOG: REDO @ D/19173B4C; LSN D/19173C38: prev D/19173B24; xid 81148960: Heap - insert: rel 1663/16386/2619; tid (@)<2006-03-25 20:54:17.525 MST>[26571]LOG: REDO @ D/19173C38; LSN D/19173C78: prev D/19173B4C; xid 81148960: Btree - insert: rel 1663/16386/2696; tid (@)<2006-03-25 20:54:17.525 MST>[26571]LOG: REDO @ D/19173C78; LSN D/19173EFC: prev D/19173C38; xid 81148960: Heap - insert: rel 1663/16386/2619; tid (@)<2006-03-25 20:54:17.526 MST>[26571]LOG: REDO @ D/19173EFC; LSN D/19173F3C: prev D/19173C78; xid 81148960: Btree - insert: rel 1663/16386/2696; tid (@)<2006-03-25 20:54:17.526 MST>[26571]LOG: REDO @ D/19173F3C; LSN D/19173F64: prev D/19173EFC; xid 81148960: Transaction - commit: 2006-03-18 22:51:5 (@)<2006-03-25 20:54:17.526 MST>[26571]LOG: REDO @ D/19173F64; LSN D/19174064: prev D/19173F3C; xid 81148962: Heap - insert: rel 1663/16386/2619; tid (@)<2006-03-25 20:54:17.526 MST>[26571]LOG: REDO @ D/19174064; LSN D/191740A4: prev D/19173F64; xid 81148962: Btree - insert: rel 1663/16386/2696; tid (@)<2006-03-25 20:54:17.526 MST>[26571]LOG: REDO @ D/191740A4; LSN D/19174294: prev D/19174064; xid 81148962: Heap - insert: rel 1663/16386/2619; tid (@)<2006-03-25 20:54:17.526 MST>[26571]LOG: REDO @ D/19174294; LSN D/191742D4: prev D/191740A4; xid 81148962: Btree - insert: rel 1663/16386/2696; tid (@)<2006-03-25 20:54:17.526 MST>[26571]LOG: REDO @ D/191742D4; LSN D/191742FC: prev D/19174294; xid 81148962: Transaction - commit: 2006-03-18 22:51:5 (@)<2006-03-25 20:54:17.526 MST>[26571]LOG: REDO @ D/191742FC; LSN D/191743E8: prev D/191742D4; xid 81148964: Heap - insert: rel 1663/16386/2619; tid (@)<2006-03-25 20:54:17.526 MST>[26571]LOG: REDO @ D/191743E8; LSN D/19174428: prev D/191742FC; xid 81148964: Btree - insert: rel 1663/16386/2696; tid (@)<2006-03-25 20:54:17.526 MST>[26571]LOG: REDO @ D/19174428; LSN D/191746FC: prev D/191743E8; xid 81148964: Heap - insert: rel 1663/16386/2619; tid (@)<2006-03-25 20:54:17.526 MST>[26571]LOG: REDO @ D/191746FC; LSN D/1917473C: prev D/19174428; xid 81148964: Btree - insert: rel 1663/16386/2696; tid (@)<2006-03-25 20:54:17.526 MST>[26571]LOG: REDO @ D/1917473C; LSN D/19174764: prev D/191746FC; xid 81148964: Transaction - commit: 2006-03-18 22:51:5 (@)<2006-03-25 20:54:17.526 MST>[26571]LOG: REDO @ D/19174764; LSN D/19174824: prev D/1917473C; xid 81148966: Heap - insert: rel 1663/16386/2619; tid (@)<2006-03-25 20:54:17.526 MST>[26571]LOG: REDO @ D/19174824; LSN D/19174864: prev D/19174764; xid 81148966: Btree - insert: rel 1663/16386/2696; tid (@)<2006-03-25 20:54:17.526 MST>[26571]LOG: REDO @ D/19174864; LSN D/191749A4: prev D/19174824; xid 81148966: Heap - insert: rel 1663/16386/2619; tid (@)<2006-03-25 20:54:17.526 MST>[26571]LOG: REDO @ D/191749A4; LSN D/191749E4: prev D/19174864; xid 81148966: Btree - insert: rel 1663/16386/2696; tid (@)<2006-03-25 20:54:17.526 MST>[26571]LOG: REDO @ D/191749E4; LSN D/19174A0C: prev D/191749A4; xid 81148966: Transaction - commit: 2006-03-18 22:51:5 (@)<2006-03-25 20:54:17.526 MST>[26571]LOG: REDO @ D/19174A0C; LSN D/19174D18: prev D/191749E4; xid 81148968: Heap - update: rel 1663/16386/2619; tid (@)<2006-03-25 20:54:17.526 MST>[26571]LOG: REDO @ D/19174D18; LSN D/19174D58: prev D/19174A0C; xid 81148968: Btree - insert: rel 1663/16386/2696; tid (@)<2006-03-25 20:54:17.526 MST>[26571]LOG: REDO @ D/19174D58; LSN D/19174E60: prev D/19174D18; xid 81148968: Heap - update: rel 1663/16386/2619; tid (@)<2006-03-25 20:54:17.527 MST>[26571]LOG: REDO @ D/19174E60; LSN D/19174EA0: prev D/19174D58; xid 81148968: Btree - insert: rel 1663/16386/2696; tid (@)<2006-03-25 20:54:17.527 MST>[26571]LOG: REDO @ D/19174EA0; LSN D/19174EC8: prev D/19174E60; xid 81148968: Transaction - commit: 2006-03-18 22:52:0 (@)<2006-03-25 20:54:17.527 MST>[26571]LOG: REDO @ D/19174EC8; LSN D/1917508C: prev D/19174EA0; xid 81148970: Heap - update: rel 1663/16386/2619; tid (@)<2006-03-25 20:54:17.527 MST>[26571]LOG: REDO @ D/1917508C; LSN D/191750CC: prev D/19174EC8; xid 81148970: Btree - insert: rel 1663/16386/2696; tid (@)<2006-03-25 20:54:17.527 MST>[26571]LOG: REDO @ D/191750CC; LSN D/19175194: prev D/1917508C; xid 81148970: Heap - update: rel 1663/16386/2619; tid (@)<2006-03-25 20:54:17.527 MST>[26571]LOG: REDO @ D/19175194; LSN D/191751D4: prev D/191750CC; xid 81148970: Btree - insert: rel 1663/16386/2696; tid (@)<2006-03-25 20:54:17.527 MST>[26571]LOG: REDO @ D/191751D4; LSN D/191752C8: prev D/19175194; xid 81148970: Heap - update: rel 1663/16386/2619; tid (@)<2006-03-25 20:54:17.527 MST>[26571]LOG: REDO @ D/191752C8; LSN D/19175308: prev D/191751D4; xid 81148970: Btree - insert: rel 1663/16386/2696; tid (@)<2006-03-25 20:54:17.527 MST>[26571]LOG: REDO @ D/19175308; LSN D/191753D0: prev D/191752C8; xid 81148970: Heap - update: rel 1663/16386/2619; tid (@)<2006-03-25 20:54:17.527 MST>[26571]LOG: REDO @ D/191753D0; LSN D/19175410: prev D/19175308; xid 81148970: Btree - insert: rel 1663/16386/2696; tid (@)<2006-03-25 20:54:17.527 MST>[26571]LOG: REDO @ D/19175410; LSN D/19175438: prev D/191753D0; xid 81148970: Transaction - commit: 2006-03-18 22:52:0 (@)<2006-03-25 20:54:17.527 MST>[26571]LOG: REDO @ D/19175438; LSN D/19175524: prev D/19175410; xid 81148972: Heap - insert: rel 1663/16386/2619; tid (@)<2006-03-25 20:54:17.527 MST>[26571]LOG: REDO @ D/19175524; LSN D/19175564: prev D/19175438; xid 81148972: Btree - insert: rel 1663/16386/2696; tid (@)<2006-03-25 20:54:17.527 MST>[26571]LOG: REDO @ D/19175564; LSN D/19175844: prev D/19175524; xid 81148972: Heap - insert: rel 1663/16386/2619; tid (@)<2006-03-25 20:54:17.527 MST>[26571]LOG: REDO @ D/19175844; LSN D/19175884: prev D/19175564; xid 81148972: Btree - insert: rel 1663/16386/2696; tid (@)<2006-03-25 20:54:17.527 MST>[26571]LOG: REDO @ D/19175884; LSN D/1917596C: prev D/19175844; xid 81148972: Heap - insert: rel 1663/16386/2619; tid (@)<2006-03-25 20:54:17.527 MST>[26571]LOG: REDO @ D/1917596C; LSN D/191759AC: prev D/19175884; xid 81148972: Btree - insert: rel 1663/16386/2696; tid (@)<2006-03-25 20:54:17.527 MST>[26571]LOG: REDO @ D/191759AC; LSN D/191759D4: prev D/1917596C; xid 81148972: Transaction - commit: 2006-03-18 22:52:0 (@)<2006-03-25 20:54:17.527 MST>[26571]LOG: REDO @ D/191759D4; LSN D/19175B98: prev D/191759AC; xid 81148974: Heap - update: rel 1663/16386/2619; tid (@)<2006-03-25 20:54:17.527 MST>[26571]LOG: REDO @ D/19175B98; LSN D/19175BD8: prev D/191759D4; xid 81148974: Btree - insert: rel 1663/16386/2696; tid (@)<2006-03-25 20:54:17.527 MST>[26571]LOG: REDO @ D/19175BD8; LSN D/19175EE4: prev D/19175B98; xid 81148974: Heap - update: rel 1663/16386/2619; tid (@)<2006-03-25 20:54:17.528 MST>[26571]LOG: REDO @ D/19175EE4; LSN D/19175F24: prev D/19175BD8; xid 81148974: Btree - insert: rel 1663/16386/2696; tid (@)<2006-03-25 20:54:17.528 MST>[26571]LOG: REDO @ D/19175F24; LSN D/19175F4C: prev D/19175EE4; xid 81148974: Transaction - commit: 2006-03-18 22:52:0 (@)<2006-03-25 20:54:17.528 MST>[26571]LOG: REDO @ D/19175F4C; LSN D/19176104: prev D/19175F24; xid 81148976: Heap - insert: rel 1663/16386/2619; tid (@)<2006-03-25 20:54:17.528 MST>[26571]LOG: REDO @ D/19176104; LSN D/19176144: prev D/19175F4C; xid 81148976: Btree - insert: rel 1663/16386/2696; tid (@)<2006-03-25 20:54:17.528 MST>[26571]LOG: REDO @ D/19176144; LSN D/19176230: prev D/19176104; xid 81148976: Heap - insert: rel 1663/16386/2619; tid (@)<2006-03-25 20:54:17.528 MST>[26571]LOG: REDO @ D/19176230; LSN D/19176270: prev D/19176144; xid 81148976: Btree - insert: rel 1663/16386/2696; tid (@)<2006-03-25 20:54:17.528 MST>[26571]LOG: REDO @ D/19176270; LSN D/19176298: prev D/19176230; xid 81148976: Transaction - commit: 2006-03-18 22:52:0 (@)<2006-03-25 20:54:17.528 MST>[26571]LOG: REDO @ D/19176298; LSN D/19176384: prev D/19176270; xid 81148978: Heap - insert: rel 1663/16386/2619; tid (@)<2006-03-25 20:54:17.528 MST>[26571]LOG: REDO @ D/19176384; LSN D/191763C4: prev D/19176298; xid 81148978: Btree - insert: rel 1663/16386/2696; tid (@)<2006-03-25 20:54:17.528 MST>[26571]LOG: REDO @ D/191763C4; LSN D/191765A8: prev D/19176384; xid 81148978: Heap - insert: rel 1663/16386/2619; tid (@)<2006-03-25 20:54:17.528 MST>[26571]LOG: REDO @ D/191765A8; LSN D/191765E8: prev D/191763C4; xid 81148978: Btree - insert: rel 1663/16386/2696; tid (@)<2006-03-25 20:54:17.528 MST>[26571]LOG: REDO @ D/191765E8; LSN D/19176610: prev D/191765A8; xid 81148978: Transaction - commit: 2006-03-18 22:52:0 (@)<2006-03-25 20:54:17.528 MST>[26571]LOG: REDO @ D/19176610; LSN D/19176644: prev D/191765E8; xid 81148979: Heap - clean: rel 1663/16386/16559898; b (@)<2006-03-25 20:54:17.528 MST>[26571]LOG: REDO @ D/19176644; LSN D/191766A4: prev D/19176610; xid 81148979: Heap - move: rel 1663/16386/16559898; ti (@)<2006-03-25 20:54:17.528 MST>[26571]PANIC: heap_update_redo: no block: target blcknum: 1, relation(1663/16386/16559898) length: 1 (@)<2006-03-25 20:54:17.529 MST>[26570]LOG: startup process (PID 26571) was terminated by signal 6 (@)<2006-03-25 20:54:17.530 MST>[26570]LOG: aborting startup due to startup process failure Ok so whats next?
""Alex bahdushka"" <bahdushka@gmail.com> wrote > > (@)<2006-03-25 20:54:17.528 MST>[26571]LOG: REDO @ D/19176610; LSN > D/19176644: prev D/191765E8; xid 81148979: Heap - clean: rel > 1663/16386/16559898; b > (@)<2006-03-25 20:54:17.528 MST>[26571]LOG: REDO @ D/19176644; LSN > D/191766A4: prev D/19176610; xid 81148979: Heap - move: rel > 1663/16386/16559898; ti > (@)<2006-03-25 20:54:17.528 MST>[26571]PANIC: heap_update_redo: no > block: target blcknum: 1, relation(1663/16386/16559898) length: 1 > (@)<2006-03-25 20:54:17.529 MST>[26570]LOG: startup process (PID > 26571) was terminated by signal 6 > It is like the problem due to the confliction of vacuum & update. The update is on the page that vacuum was just removed. Before we try to understand/attack the bug exactly, first I'd like to see the complete xlog output. Your xlog output is imcomplete -- for example, the first line after "b" should be "lk %u", the second line after "ti" should be "d %u%u". Can you check out why the output looks like that? If you want to examine the source code, look at heapam.c/heap_desc(). Regards, Qingqing
> It is like the problem due to the confliction of vacuum & update. The update > is on the page that vacuum was just removed. > > Before we try to understand/attack the bug exactly, first I'd like to see > the complete xlog output. Your xlog output is imcomplete -- for example, the > first line after "b" should be "lk %u", the second line after "ti" should be > "d %u%u". Can you check out why the output looks like that? If you want to > examine the source code, look at heapam.c/heap_desc(). Gah, yeah it looks like a stupid copy and paste error, im attaching it this time... hope thats ok
Вложения
""Alex bahdushka"" <bahdushka@gmail.com> wrote > > LOG: REDO @ D/19176610; LSN D/19176644: prev D/191765E8; xid 81148979: > Heap - clean: rel 1663/16386/16559898; blk 0 > LOG: REDO @ D/19176644; LSN D/191766A4: prev D/19176610; xid 81148979: > Heap - move: rel 1663/16386/16559898; tid 1/1; new 0/10 > PANIC: heap_update_redo: no block: target blcknum: 1, > relation(1663/16386/16559898) length: 1 > What happened that time is like this: heap 16559898 has two pages {0, 1}. A vacuum full first examines page 0, do some cleaning; then it comes to the second page, move the item 1/1 to page 0 -- as a byproduct, the heap is truncated to only 1 page since 1/1 is the only item on the second page. At this time, system crashed, and heap 16559898 has only 1 page left. At the xlog startup time, for some unknown reason( or I am totally wrong), PostgreSQL didn't extend the heap to 2 blocks, thus the heap_update_redo(more exactly, should be heap_move_redo) failed. But what's the unknown reason, I really don't have a clue :-(. Actually I tried to simulate your situation, but everytime I got a neat recovery -- the only carvet is that depends on the XLOG_SMGR_TRUNCATE written down or not, there may one extra useless page at the end of heap, which is not a problem at all. Regards, Qingqing
"Qingqing Zhou" <zhouqq@cs.toronto.edu> writes: > Actually I tried to simulate your situation, but everytime I got a neat > recovery -- You probably filled the test table and then vacuumed within a single checkpoint cycle, so that the replay sequence included loading data into page 1 in the first place. The risk case is fill table, checkpoint, vacuum, crash; because then the replay starts from the checkpoint and won't re-create page 1. regards, tom lane
""Alex bahdushka"" <bahdushka@gmail.com> wrote > > (@)<2006-03-18 23:30:33.035 MST>[3791]PANIC: heap_update_redo: no block > According to the discussion in pgsql-hackers, to finish this case, did you turn off the full_page_writes parameter? I hope the answer is "yes" ... Regards, Qingqing
On 3/31/06, Qingqing Zhou <zhouqq@cs.toronto.edu> wrote: > > ""Alex bahdushka"" <bahdushka@gmail.com> wrote > > > > (@)<2006-03-18 23:30:33.035 MST>[3791]PANIC: heap_update_redo: no block > > > > According to the discussion in pgsql-hackers, to finish this case, did you > turn off the full_page_writes parameter? I hope the answer is "yes" ... > If by off you mean full_page_writes = on then yes. Thanks for all your help!