Обсуждение: ERROR: could not open relation base/2757655/6930168: No such file or directory -- during warm standby setup
Hi all, After setting up a warm standby (pg_start_backup/rsync/pg_stop_backup), and promoting to master, we encountered an error in the middle of an analyze of the new standby db. (the standby server is a fresh server) Source db: PostgreSQL 8.4.2 Standby db: PostgreSQL 8.4.6 ... INFO: analyzing "public.offer2offer" ERROR: could not open relation base/2757655/6930168: No such file or directory That file does not exist on the source db, nor the standby db. That offer2offer table exists in the source db (42MB), but is 0 bytes on the standby. -- on the standby select * from pg_class where relfilenode = 6930168; -[ RECORD 1 ]--+--------------------------------------------- relname | offer2offer relnamespace | 2200 reltype | 2760224 relowner | 10 relam | 0 relfilenode | 6930168 reltablespace | 0 relpages | 5210 reltuples | 324102 reltoastrelid | 2760225 reltoastidxid | 0 relhasindex | f relisshared | f relistemp | f relkind | r relnatts | 12 relchecks | 0 relhasoids | f relhaspkey | f relhasrules | f relhastriggers | f relhassubclass | f relfrozenxid | 1227738213 select * from offer2offer ; ERROR: could not open relation base/2757655/6930168: No such file or directory -- on the source db select * from pg_class where relname='offer2offer'; -[ RECORD 1 ]--+---------------------------------------------------- relname | offer2offer relnamespace | 2200 reltype | 2760224 relowner | 10 relam | 0 relfilenode | 6946955 reltablespace | 0 relpages | 5216 reltuples | 324642 reltoastrelid | 2760225 reltoastidxid | 0 relhasindex | f relisshared | f relistemp | f relkind | r relnatts | 12 relchecks | 0 relhasoids | f relhaspkey | f relhasrules | f relhastriggers | f relhassubclass | f relfrozenxid | 1228781185 -- on the source server ls -lh `locate 6946955` -rw------- 1 postgres postgres 41M Dec 28 15:17 /var/lib/pgsql/data/base/2757655/6946955 -rw------- 1 postgres postgres 32K Dec 28 15:17 /var/lib/pgsql/data/base/2757655/6946955_fsm We noticed after the initial rsync that we had around 3-4 GB less in the data dir between the source and standby. I assumed that that it was simply because the pg_xlog dir on the standby did not have the WAL files that existed on the source (they were stored in a different partition). We are running a badblocks right now, then we'll do some more disk testing and hopefully memtest86. Does this look like a hardware problem, and/or some catalog corruption? Any suggestions on what steps we should take next? Thanks!
bricklen <bricklen@gmail.com> writes: > After setting up a warm standby > (pg_start_backup/rsync/pg_stop_backup), and promoting to master, we > encountered an error in the middle of an analyze of the new standby > db. (the standby server is a fresh server) > [ relfilenode doesn't match on source and standby ] What can you tell us about what was happening on the source DB while the backup was being taken? In particular I'm wondering if anything that would've given offer2offer a new relfilenode was in progress. Also, does the pg_class entry for offer2offer have the same xmin and ctid in both DBs? regards, tom lane
On Wed, Dec 29, 2010 at 11:35 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > bricklen <bricklen@gmail.com> writes: >> After setting up a warm standby >> (pg_start_backup/rsync/pg_stop_backup), and promoting to master, we >> encountered an error in the middle of an analyze of the new standby >> db. (the standby server is a fresh server) >> [ relfilenode doesn't match on source and standby ] > > What can you tell us about what was happening on the source DB while > the backup was being taken? In particular I'm wondering if anything > that would've given offer2offer a new relfilenode was in progress. > Also, does the pg_class entry for offer2offer have the same xmin and > ctid in both DBs? > > regards, tom lane A couple other notes: - There are two tables that are affected, not just one. I ran individual ANALYZE commands on every table in the db and found that. - The rsync command that we are using uses the "-e ssh -p" switch so we can specify a port number rsync -av -e "ssh -p 9001" --progress --partial -z /var/lib/pgsql/data postgres@standby-tunnel:/var/lib/pgsql/ The pg_start_backup/pg_stop_backup range was about 10 hours, as the transfer took that long (480GB transfer). Sorry for my ignorance, I don't The source db has between 1000 and 3000 transactions/s, so is reasonably volatile. The two tables in question are not accessed very heavily though. Looking at the ctid and xmin between both databases, no, they don't seem to match exactly. Pardon my ignorance, but would those have changed due to vacuums, analyze, or any other forms of access? Source offer2offer: select ctid,xmin,* from pg_class where relname='offer2offer'; -[ RECORD 1 ]--+-------------------------------------------------------------------- ctid | (142,2) xmin | 1228781192 relname | offer2offer relnamespace | 2200 reltype | 2760224 relowner | 10 relam | 0 relfilenode | 6946955 reltablespace | 0 relpages | 5216 reltuples | 324642 reltoastrelid | 2760225 reltoastidxid | 0 relhasindex | f relisshared | f relistemp | f relkind | r relnatts | 12 relchecks | 0 relhasoids | f relhaspkey | f relhasrules | f relhastriggers | f relhassubclass | f relfrozenxid | 1228781185 Standby offer2offer: select ctid,xmin,* from pg_class where relname='offer2offer'; -[ RECORD 1 ]--+--------------------------------------------------------------------- ctid | (142,1) xmin | 1227738244 relname | offer2offer relnamespace | 2200 reltype | 2760224 relowner | 10 relam | 0 relfilenode | 6930168 reltablespace | 0 relpages | 5210 reltuples | 324102 reltoastrelid | 2760225 reltoastidxid | 0 relhasindex | f relisshared | f relistemp | f relkind | r relnatts | 12 relchecks | 0 relhasoids | f relhaspkey | f relhasrules | f relhastriggers | f relhassubclass | f relfrozenxid | 1227738213
bricklen <bricklen@gmail.com> writes: > On Wed, Dec 29, 2010 at 11:35 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> What can you tell us about what was happening on the source DB while >> the backup was being taken? > The source db has between 1000 and 3000 transactions/s, so is > reasonably volatile. The two tables in question are not accessed very > heavily though. > Looking at the ctid and xmin between both databases, no, they don't > seem to match exactly. Pardon my ignorance, but would those have > changed due to vacuums, analyze, or any other forms of access? The difference in ctid, and the values of xmin and relfrozenxid, seems to confirm my suspicion that this wasn't just random cosmic rays. You did something on the source DB that rewrote the table with a new relfilenode (possibly CLUSTER or some form of ALTER TABLE; plain VACUUM or ANALYZE wouldn't do it). And for some reason the standby hasn't picked up that change in the pg_class row. I suspect the explanation is that your technique for setting up the standby is flawed. You can't just rsync and have a valid snapshot of the DB --- you need to be sure that enough WAL gets replayed to fix any inconsistencies arising from the time-extended nature of the rsync operation. But you didn't say exactly how you did that. regards, tom lane
On Wed, Dec 29, 2010 at 12:11 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > bricklen <bricklen@gmail.com> writes: >> On Wed, Dec 29, 2010 at 11:35 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >>> What can you tell us about what was happening on the source DB while >>> the backup was being taken? > >> The source db has between 1000 and 3000 transactions/s, so is >> reasonably volatile. The two tables in question are not accessed very >> heavily though. > >> Looking at the ctid and xmin between both databases, no, they don't >> seem to match exactly. Pardon my ignorance, but would those have >> changed due to vacuums, analyze, or any other forms of access? > > The difference in ctid, and the values of xmin and relfrozenxid, > seems to confirm my suspicion that this wasn't just random cosmic rays. > You did something on the source DB that rewrote the table with a new > relfilenode (possibly CLUSTER or some form of ALTER TABLE; plain VACUUM > or ANALYZE wouldn't do it). And for some reason the standby hasn't > picked up that change in the pg_class row. I suspect the explanation > is that your technique for setting up the standby is flawed. You can't > just rsync and have a valid snapshot of the DB --- you need to be sure > that enough WAL gets replayed to fix any inconsistencies arising from > the time-extended nature of the rsync operation. But you didn't say > exactly how you did that. > Definitely no CLUSTER commands were issued, and there should have been no ALTER commands issued (yesterday was a holiday, no one was here). Would a TRUNCATE have the same effect though? I grep'd through our application, and it appears that at least 3 tables get truncated, one of them several times per hour. The often-truncated table wasn't one of the bad ones, but the others are the ones I've already identified as non-existent. The standby setup process was On source server: - start shipping WAL files to standby @ /var/lib/pgsql/WAL_Archive dir - "select pg_start_backup('standby')" - rsync -av -e "ssh -p 9001" --progress --partial -z /var/lib/pgsql/data postgres@standby-tunnel:/var/lib/pgsql/ - "select pg_stop_backup()" On standby - recovery.conf file: restore_command = 'pg_standby -d -s 2 -t /WAL_Archive/go_live /WAL_Archive %f %p %r 2>>/var/lib/pgsql/data/pg_log/standby.log' Note: /WAL_Archive is a symlink to /var/lib/pgsql/WAL_Archive, that didn't get touched though, as it is on the same directory level as the data dir. pg_ctl -D /var/lib/pgsql/data start touch /var/lib/pgsql/WAL_Archive/go_live Looking over the log of the restore process, there are a couple messages that might be helpful in pinpointing the problem area. 2010-12-29 08:07:34 PST [16589]: [2-1] (user=) (rhost=) LOG: starting archive recovery 2010-12-29 08:07:34 PST [16589]: [3-1] (user=) (rhost=) LOG: restore_command = 'pg_standby -d -s 2 -t /WAL_Archive/go_live /WAL_Archive %f %p %r 2>>/var/lib/pgsql/data/pg_log/standby.log' 2010-12-29 08:07:34 PST [16589]: [4-1] (user=) (rhost=) LOG: recovery_end_command = 'rm -f /WAL_Archive/go_live' 2010-12-29 08:07:46 PST [16589]: [5-1] (user=) (rhost=) LOG: restored log file "0000000100001FB400000080" from archive 2010-12-29 08:07:46 PST [16589]: [6-1] (user=) (rhost=) LOG: automatic recovery in progress 2010-12-29 08:07:46 PST [16589]: [7-1] (user=) (rhost=) LOG: restored log file "0000000100001FB40000007B" from archive 2010-12-29 08:07:46 PST [16589]: [8-1] (user=) (rhost=) LOG: redo starts at 1FB4/7BD68388 2010-12-29 08:07:46 PST [16589]: [9-1] (user=) (rhost=) LOG: consistent recovery state reached 2010-12-29 08:07:46 PST [16589]: [10-1] (user=) (rhost=) LOG: file "pg_clog/0494" doesn't exist, reading as zeroes 2010-12-29 08:07:46 PST [16589]: [11-1] (user=) (rhost=) CONTEXT: xlog redo commit: 2010-12-28 22:35:18.152971-08 2010-12-29 08:08:21 PST [16589]: [12-1] (user=) (rhost=) LOG: restored log file "0000000100001FB40000007C" from archive 2010-12-29 08:08:22 PST [16589]: [13-1] (user=) (rhost=) LOG: restored log file "0000000100001FB40000007D" from archive 2010-12-29 08:08:22 PST [16589]: [14-1] (user=) (rhost=) LOG: restored log file "0000000100001FB40000007E" from archive ... thousands of WAL logs later 2010-12-29 08:43:51 PST [16589]: [2529-1] (user=) (rhost=) LOG: restored log file "0000000100001FBE0000005B" from archive 2010-12-29 08:43:59 PST [16589]: [2530-1] (user=) (rhost=) LOG: restored log file "0000000100001FBE0000005C" from archive 2010-12-29 08:43:59 PST [16589]: [2531-1] (user=) (rhost=) LOG: could not open file "pg_xlog/0000000100001FBE0000005D" (log file 8126, segment 93): No such file or directory 2010-12-29 08:43:59 PST [16589]: [2532-1] (user=) (rhost=) LOG: redo done at 1FBE/5CFFEA58 2010-12-29 08:43:59 PST [16589]: [2533-1] (user=) (rhost=) LOG: last completed transaction was at log time 2010-12-29 08:44:14.784874-08 2010-12-29 08:43:59 PST [16589]: [2534-1] (user=) (rhost=) LOG: restored log file "0000000100001FBE0000005C" from archive 2010-12-29 08:44:12 PST [16589]: [2535-1] (user=) (rhost=) LOG: selected new timeline ID: 2 2010-12-29 08:44:24 PST [16589]: [2536-1] (user=) (rhost=) LOG: archive recovery complete 2010-12-29 08:44:24 PST [25155]: [1-1] (user=) (rhost=) LOG: autovacuum launcher started 2010-12-29 08:44:24 PST [16587]: [2-1] (user=) (rhost=) LOG: database system is ready to accept connections Do those log entries look suspicious?
On Wed, Dec 29, 2010 at 1:53 PM, bricklen <bricklen@gmail.com> wrote: > On Wed, Dec 29, 2010 at 12:11 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> >> The difference in ctid, and the values of xmin and relfrozenxid, >> seems to confirm my suspicion that this wasn't just random cosmic rays. >> You did something on the source DB that rewrote the table with a new >> relfilenode (possibly CLUSTER or some form of ALTER TABLE; plain VACUUM >> or ANALYZE wouldn't do it). And for some reason the standby hasn't >> picked up that change in the pg_class row. I suspect the explanation >> is that your technique for setting up the standby is flawed. You can't >> just rsync and have a valid snapshot of the DB --- you need to be sure >> that enough WAL gets replayed to fix any inconsistencies arising from >> the time-extended nature of the rsync operation. But you didn't say >> exactly how you did that. >> > > Definitely no CLUSTER commands were issued, and there should have been > no ALTER commands issued (yesterday was a holiday, no one was here). > Would a TRUNCATE have the same effect though? I grep'd through our > application, and it appears that at least 3 tables get truncated, one > of them several times per hour. The often-truncated table wasn't one > of the bad ones, but the others are the ones I've already identified > as non-existent. Update: Set up the warm standby again and encountered the same issue, with two of the three previously-identified tables -- the ones that can get truncated throughout the day. We're going to try again overnight when those tables are not truncated and see if that gives us a correctly-working standby. From what I could find from posts to these lists, TRUNCATE commands do reset the relfilenode, and that could account for the issue we are experiencing. What I find odd is that we have one other table that is truncated every 15 minutes (aggregate table) but that one was fine in both attempts at the warm standby.
On Fri, Dec 31, 2010 at 1:13 PM, bricklen <bricklen@gmail.com> wrote: > On Wed, Dec 29, 2010 at 1:53 PM, bricklen <bricklen@gmail.com> wrote: >> On Wed, Dec 29, 2010 at 12:11 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >>> >>> The difference in ctid, and the values of xmin and relfrozenxid, >>> seems to confirm my suspicion that this wasn't just random cosmic rays. >>> You did something on the source DB that rewrote the table with a new >>> relfilenode (possibly CLUSTER or some form of ALTER TABLE; plain VACUUM >>> or ANALYZE wouldn't do it). And for some reason the standby hasn't >>> picked up that change in the pg_class row. I suspect the explanation >>> is that your technique for setting up the standby is flawed. You can't >>> just rsync and have a valid snapshot of the DB --- you need to be sure >>> that enough WAL gets replayed to fix any inconsistencies arising from >>> the time-extended nature of the rsync operation. But you didn't say >>> exactly how you did that. >>> >> >> Definitely no CLUSTER commands were issued, and there should have been >> no ALTER commands issued (yesterday was a holiday, no one was here). >> Would a TRUNCATE have the same effect though? I grep'd through our >> application, and it appears that at least 3 tables get truncated, one >> of them several times per hour. The often-truncated table wasn't one >> of the bad ones, but the others are the ones I've already identified >> as non-existent. > > Update: Set up the warm standby again and encountered the same issue, > with two of the three previously-identified tables -- the ones that > can get truncated throughout the day. We're going to try again > overnight when those tables are not truncated and see if that gives us > a correctly-working standby. > > From what I could find from posts to these lists, TRUNCATE commands do > reset the relfilenode, and that could account for the issue we are > experiencing. What I find odd is that we have one other table that is > truncated every 15 minutes (aggregate table) but that one was fine in > both attempts at the warm standby. What O/S, kernel, and filesystem are you using? -- Jon
bricklen <bricklen@gmail.com> writes: > On Wed, Dec 29, 2010 at 1:53 PM, bricklen <bricklen@gmail.com> wrote: >> On Wed, Dec 29, 2010 at 12:11 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >>> You did something on the source DB that rewrote the table with a new >>> relfilenode (possibly CLUSTER or some form of ALTER TABLE; plain VACUUM >>> or ANALYZE wouldn't do it). And for some reason the standby hasn't >>> picked up that change in the pg_class row. >> Definitely no CLUSTER commands were issued, and there should have been >> no ALTER commands issued (yesterday was a holiday, no one was here). >> Would a TRUNCATE have the same effect though? Yeah, that would assign a new relfilenode too. > Update: Set up the warm standby again and encountered the same issue, > with two of the three previously-identified tables -- the ones that > can get truncated throughout the day. We're going to try again > overnight when those tables are not truncated and see if that gives us > a correctly-working standby. I think you're missing the point here: there is something about your standby setup procedure that is causing you to get an inconsistent set of row states. Failing to access a table at all because of wrong relfilenode in pg_class is a really obvious symptom; but it's just a symptom it's not the core disease. Almost certainly, your user data is out of sync as well, ie there are probably partially-applied updates and so on. Avoiding a TRUNCATE isn't going to eliminate those other problems. There's another thread going on about an inconsistent standby, in which I think we identified the problem as being failure to copy over the backup_label file. That would cause the standby to not replay enough WAL, which would explain your problem nicely. I wonder whether you're doing that or something like it. regards, tom lane
On Fri, Dec 31, 2010 at 11:53 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > I think you're missing the point here: there is something about your > standby setup procedure that is causing you to get an inconsistent set > of row states. For the sake of the archives, the problem turned out to be due to impatience. We were not waiting long enough for all the .ready files in /data/pg_xlog/archive_status/ to be archived to the standby (noted by those files on the master getting renamed to .done). It was taking around 15 minutes for those files to get applied to the standby, not the 1 minute or so that I expected. Sorry for the noise, and thanks for the help!