Обсуждение: ERROR: could not open relation base/2757655/6930168: No such file or directory -- during warm standby setup

Поиск
Список
Период
Сортировка
Hi all,

After setting up a warm standby
(pg_start_backup/rsync/pg_stop_backup), and promoting to master, we
encountered an error in the middle of an analyze of the new standby
db. (the standby server is a fresh server)

Source db: PostgreSQL 8.4.2
Standby db: PostgreSQL 8.4.6

...
INFO:  analyzing "public.offer2offer"
ERROR:  could not open relation base/2757655/6930168: No such file or directory

That file does not exist on the source db, nor the standby db. That
offer2offer table exists in the source db (42MB), but is 0 bytes on
the standby.

-- on the standby
 select * from pg_class where relfilenode = 6930168;
-[ RECORD 1 ]--+---------------------------------------------
relname        | offer2offer
relnamespace   | 2200
reltype        | 2760224
relowner       | 10
relam          | 0
relfilenode    | 6930168
reltablespace  | 0
relpages       | 5210
reltuples      | 324102
reltoastrelid  | 2760225
reltoastidxid  | 0
relhasindex    | f
relisshared    | f
relistemp      | f
relkind        | r
relnatts       | 12
relchecks      | 0
relhasoids     | f
relhaspkey     | f
relhasrules    | f
relhastriggers | f
relhassubclass | f
relfrozenxid   | 1227738213


select * from offer2offer ;
ERROR:  could not open relation base/2757655/6930168: No such file or directory



-- on the source db
 select * from pg_class where relname='offer2offer';
-[ RECORD 1 ]--+----------------------------------------------------
relname        | offer2offer
relnamespace   | 2200
reltype        | 2760224
relowner       | 10
relam          | 0
relfilenode    | 6946955
reltablespace  | 0
relpages       | 5216
reltuples      | 324642
reltoastrelid  | 2760225
reltoastidxid  | 0
relhasindex    | f
relisshared    | f
relistemp      | f
relkind        | r
relnatts       | 12
relchecks      | 0
relhasoids     | f
relhaspkey     | f
relhasrules    | f
relhastriggers | f
relhassubclass | f
relfrozenxid   | 1228781185

-- on the source server
ls -lh `locate 6946955`
-rw------- 1 postgres postgres 41M Dec 28 15:17
/var/lib/pgsql/data/base/2757655/6946955
-rw------- 1 postgres postgres 32K Dec 28 15:17
/var/lib/pgsql/data/base/2757655/6946955_fsm

We noticed after the initial rsync that we had around 3-4 GB less in
the data dir between the source and standby. I assumed that that it
was simply because the pg_xlog dir on the standby did not have the WAL
files that existed on the source (they were stored in a different
partition).

We are running a badblocks right now, then we'll do some more disk
testing and hopefully memtest86.

Does this look like a hardware problem, and/or some catalog corruption?

Any suggestions on what steps we should take next?

Thanks!

bricklen <bricklen@gmail.com> writes:
> After setting up a warm standby
> (pg_start_backup/rsync/pg_stop_backup), and promoting to master, we
> encountered an error in the middle of an analyze of the new standby
> db. (the standby server is a fresh server)
> [ relfilenode doesn't match on source and standby ]

What can you tell us about what was happening on the source DB while
the backup was being taken?  In particular I'm wondering if anything
that would've given offer2offer a new relfilenode was in progress.
Also, does the pg_class entry for offer2offer have the same xmin and
ctid in both DBs?

            regards, tom lane

On Wed, Dec 29, 2010 at 11:35 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> bricklen <bricklen@gmail.com> writes:
>> After setting up a warm standby
>> (pg_start_backup/rsync/pg_stop_backup), and promoting to master, we
>> encountered an error in the middle of an analyze of the new standby
>> db. (the standby server is a fresh server)
>> [ relfilenode doesn't match on source and standby ]
>
> What can you tell us about what was happening on the source DB while
> the backup was being taken?  In particular I'm wondering if anything
> that would've given offer2offer a new relfilenode was in progress.
> Also, does the pg_class entry for offer2offer have the same xmin and
> ctid in both DBs?
>
>                        regards, tom lane

A couple other notes:
- There are two tables that are affected, not just one. I ran
individual ANALYZE commands on every table in the db and found that.
- The rsync command that we are using uses the "-e ssh -p" switch so
we can specify a port number

rsync -av -e "ssh -p 9001" --progress --partial -z /var/lib/pgsql/data
postgres@standby-tunnel:/var/lib/pgsql/

The pg_start_backup/pg_stop_backup range was about 10 hours, as the
transfer took that long (480GB transfer).

Sorry for my ignorance, I don't

The source db has between 1000 and 3000 transactions/s, so is
reasonably volatile. The two tables in question are not accessed very
heavily though.

Looking at the ctid and xmin between both databases, no, they don't
seem to match exactly. Pardon my ignorance, but would those have
changed due to vacuums, analyze, or any other forms of access?


Source offer2offer:
select ctid,xmin,* from pg_class where relname='offer2offer';
-[ RECORD 1 ]--+--------------------------------------------------------------------
ctid           | (142,2)
xmin           | 1228781192
relname        | offer2offer
relnamespace   | 2200
reltype        | 2760224
relowner       | 10
relam          | 0
relfilenode    | 6946955
reltablespace  | 0
relpages       | 5216
reltuples      | 324642
reltoastrelid  | 2760225
reltoastidxid  | 0
relhasindex    | f
relisshared    | f
relistemp      | f
relkind        | r
relnatts       | 12
relchecks      | 0
relhasoids     | f
relhaspkey     | f
relhasrules    | f
relhastriggers | f
relhassubclass | f
relfrozenxid   | 1228781185


Standby offer2offer:
select ctid,xmin,* from pg_class where relname='offer2offer';
-[ RECORD 1 ]--+---------------------------------------------------------------------
ctid           | (142,1)
xmin           | 1227738244
relname        | offer2offer
relnamespace   | 2200
reltype        | 2760224
relowner       | 10
relam          | 0
relfilenode    | 6930168
reltablespace  | 0
relpages       | 5210
reltuples      | 324102
reltoastrelid  | 2760225
reltoastidxid  | 0
relhasindex    | f
relisshared    | f
relistemp      | f
relkind        | r
relnatts       | 12
relchecks      | 0
relhasoids     | f
relhaspkey     | f
relhasrules    | f
relhastriggers | f
relhassubclass | f
relfrozenxid   | 1227738213

bricklen <bricklen@gmail.com> writes:
> On Wed, Dec 29, 2010 at 11:35 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> What can you tell us about what was happening on the source DB while
>> the backup was being taken?

> The source db has between 1000 and 3000 transactions/s, so is
> reasonably volatile. The two tables in question are not accessed very
> heavily though.

> Looking at the ctid and xmin between both databases, no, they don't
> seem to match exactly. Pardon my ignorance, but would those have
> changed due to vacuums, analyze, or any other forms of access?

The difference in ctid, and the values of xmin and relfrozenxid,
seems to confirm my suspicion that this wasn't just random cosmic rays.
You did something on the source DB that rewrote the table with a new
relfilenode (possibly CLUSTER or some form of ALTER TABLE; plain VACUUM
or ANALYZE wouldn't do it).  And for some reason the standby hasn't
picked up that change in the pg_class row.  I suspect the explanation
is that your technique for setting up the standby is flawed.  You can't
just rsync and have a valid snapshot of the DB --- you need to be sure
that enough WAL gets replayed to fix any inconsistencies arising from
the time-extended nature of the rsync operation.  But you didn't say
exactly how you did that.

            regards, tom lane

On Wed, Dec 29, 2010 at 12:11 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> bricklen <bricklen@gmail.com> writes:
>> On Wed, Dec 29, 2010 at 11:35 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>> What can you tell us about what was happening on the source DB while
>>> the backup was being taken?
>
>> The source db has between 1000 and 3000 transactions/s, so is
>> reasonably volatile. The two tables in question are not accessed very
>> heavily though.
>
>> Looking at the ctid and xmin between both databases, no, they don't
>> seem to match exactly. Pardon my ignorance, but would those have
>> changed due to vacuums, analyze, or any other forms of access?
>
> The difference in ctid, and the values of xmin and relfrozenxid,
> seems to confirm my suspicion that this wasn't just random cosmic rays.
> You did something on the source DB that rewrote the table with a new
> relfilenode (possibly CLUSTER or some form of ALTER TABLE; plain VACUUM
> or ANALYZE wouldn't do it).  And for some reason the standby hasn't
> picked up that change in the pg_class row.  I suspect the explanation
> is that your technique for setting up the standby is flawed.  You can't
> just rsync and have a valid snapshot of the DB --- you need to be sure
> that enough WAL gets replayed to fix any inconsistencies arising from
> the time-extended nature of the rsync operation.  But you didn't say
> exactly how you did that.
>

Definitely no CLUSTER commands were issued, and there should have been
no ALTER commands issued (yesterday was a holiday, no one was here).
Would a TRUNCATE have the same effect though? I grep'd through our
application, and it appears that at least 3 tables get truncated, one
of them several times per hour. The often-truncated table wasn't one
of the bad ones, but the others are the ones I've already identified
as non-existent.


The standby setup process was

On source server:
- start shipping WAL files to standby @ /var/lib/pgsql/WAL_Archive dir
- "select pg_start_backup('standby')"
- rsync -av -e "ssh -p 9001" --progress --partial -z
/var/lib/pgsql/data postgres@standby-tunnel:/var/lib/pgsql/
- "select pg_stop_backup()"

On standby
- recovery.conf file:
restore_command = 'pg_standby -d -s 2 -t /WAL_Archive/go_live
/WAL_Archive %f %p %r 2>>/var/lib/pgsql/data/pg_log/standby.log'

Note: /WAL_Archive is a symlink to /var/lib/pgsql/WAL_Archive, that
didn't get touched though, as it is on the same directory level as the
data dir.

pg_ctl -D /var/lib/pgsql/data start
touch /var/lib/pgsql/WAL_Archive/go_live

Looking over the log of the restore process, there are a couple
messages that might be helpful in pinpointing the problem area.


2010-12-29 08:07:34 PST [16589]: [2-1] (user=) (rhost=) LOG:  starting
archive recovery
2010-12-29 08:07:34 PST [16589]: [3-1] (user=) (rhost=) LOG:
restore_command = 'pg_standby -d -s 2 -t /WAL_Archive/go_live
/WAL_Archive %f %p %r 2>>/var/lib/pgsql/data/pg_log/standby.log'
2010-12-29 08:07:34 PST [16589]: [4-1] (user=) (rhost=) LOG:
recovery_end_command = 'rm -f /WAL_Archive/go_live'
2010-12-29 08:07:46 PST [16589]: [5-1] (user=) (rhost=) LOG:  restored
log file "0000000100001FB400000080" from archive
2010-12-29 08:07:46 PST [16589]: [6-1] (user=) (rhost=) LOG:
automatic recovery in progress
2010-12-29 08:07:46 PST [16589]: [7-1] (user=) (rhost=) LOG:  restored
log file "0000000100001FB40000007B" from archive
2010-12-29 08:07:46 PST [16589]: [8-1] (user=) (rhost=) LOG:  redo
starts at 1FB4/7BD68388
2010-12-29 08:07:46 PST [16589]: [9-1] (user=) (rhost=) LOG:
consistent recovery state reached
2010-12-29 08:07:46 PST [16589]: [10-1] (user=) (rhost=) LOG:  file
"pg_clog/0494" doesn't exist, reading as zeroes
2010-12-29 08:07:46 PST [16589]: [11-1] (user=) (rhost=) CONTEXT:
xlog redo commit: 2010-12-28 22:35:18.152971-08
2010-12-29 08:08:21 PST [16589]: [12-1] (user=) (rhost=) LOG:
restored log file "0000000100001FB40000007C" from archive
2010-12-29 08:08:22 PST [16589]: [13-1] (user=) (rhost=) LOG:
restored log file "0000000100001FB40000007D" from archive
2010-12-29 08:08:22 PST [16589]: [14-1] (user=) (rhost=) LOG:
restored log file "0000000100001FB40000007E" from archive
... thousands of WAL logs later
2010-12-29 08:43:51 PST [16589]: [2529-1] (user=) (rhost=) LOG:
restored log file "0000000100001FBE0000005B" from archive
2010-12-29 08:43:59 PST [16589]: [2530-1] (user=) (rhost=) LOG:
restored log file "0000000100001FBE0000005C" from archive
2010-12-29 08:43:59 PST [16589]: [2531-1] (user=) (rhost=) LOG:  could
not open file "pg_xlog/0000000100001FBE0000005D" (log file 8126,
segment 93): No such file or directory
2010-12-29 08:43:59 PST [16589]: [2532-1] (user=) (rhost=) LOG:  redo
done at 1FBE/5CFFEA58
2010-12-29 08:43:59 PST [16589]: [2533-1] (user=) (rhost=) LOG:  last
completed transaction was at log time 2010-12-29 08:44:14.784874-08
2010-12-29 08:43:59 PST [16589]: [2534-1] (user=) (rhost=) LOG:
restored log file "0000000100001FBE0000005C" from archive
2010-12-29 08:44:12 PST [16589]: [2535-1] (user=) (rhost=) LOG:
selected new timeline ID: 2
2010-12-29 08:44:24 PST [16589]: [2536-1] (user=) (rhost=) LOG:
archive recovery complete
2010-12-29 08:44:24 PST [25155]: [1-1] (user=) (rhost=) LOG:
autovacuum launcher started
2010-12-29 08:44:24 PST [16587]: [2-1] (user=) (rhost=) LOG:  database
system is ready to accept connections

Do those log entries look suspicious?

On Wed, Dec 29, 2010 at 1:53 PM, bricklen <bricklen@gmail.com> wrote:
> On Wed, Dec 29, 2010 at 12:11 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>
>> The difference in ctid, and the values of xmin and relfrozenxid,
>> seems to confirm my suspicion that this wasn't just random cosmic rays.
>> You did something on the source DB that rewrote the table with a new
>> relfilenode (possibly CLUSTER or some form of ALTER TABLE; plain VACUUM
>> or ANALYZE wouldn't do it).  And for some reason the standby hasn't
>> picked up that change in the pg_class row.  I suspect the explanation
>> is that your technique for setting up the standby is flawed.  You can't
>> just rsync and have a valid snapshot of the DB --- you need to be sure
>> that enough WAL gets replayed to fix any inconsistencies arising from
>> the time-extended nature of the rsync operation.  But you didn't say
>> exactly how you did that.
>>
>
> Definitely no CLUSTER commands were issued, and there should have been
> no ALTER commands issued (yesterday was a holiday, no one was here).
> Would a TRUNCATE have the same effect though? I grep'd through our
> application, and it appears that at least 3 tables get truncated, one
> of them several times per hour. The often-truncated table wasn't one
> of the bad ones, but the others are the ones I've already identified
> as non-existent.

Update: Set up the warm standby again and encountered the same issue,
with two of the three previously-identified tables -- the ones that
can get truncated throughout the day. We're going to try again
overnight when those tables are not truncated and see if that gives us
a correctly-working standby.

From what I could find from posts to these lists, TRUNCATE commands do
reset the relfilenode, and that could account for the issue we are
experiencing. What I find odd is that we have one other table that is
truncated every 15 minutes (aggregate table) but that one was fine in
both attempts at the warm standby.

On Fri, Dec 31, 2010 at 1:13 PM, bricklen <bricklen@gmail.com> wrote:
> On Wed, Dec 29, 2010 at 1:53 PM, bricklen <bricklen@gmail.com> wrote:
>> On Wed, Dec 29, 2010 at 12:11 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>>
>>> The difference in ctid, and the values of xmin and relfrozenxid,
>>> seems to confirm my suspicion that this wasn't just random cosmic rays.
>>> You did something on the source DB that rewrote the table with a new
>>> relfilenode (possibly CLUSTER or some form of ALTER TABLE; plain VACUUM
>>> or ANALYZE wouldn't do it).  And for some reason the standby hasn't
>>> picked up that change in the pg_class row.  I suspect the explanation
>>> is that your technique for setting up the standby is flawed.  You can't
>>> just rsync and have a valid snapshot of the DB --- you need to be sure
>>> that enough WAL gets replayed to fix any inconsistencies arising from
>>> the time-extended nature of the rsync operation.  But you didn't say
>>> exactly how you did that.
>>>
>>
>> Definitely no CLUSTER commands were issued, and there should have been
>> no ALTER commands issued (yesterday was a holiday, no one was here).
>> Would a TRUNCATE have the same effect though? I grep'd through our
>> application, and it appears that at least 3 tables get truncated, one
>> of them several times per hour. The often-truncated table wasn't one
>> of the bad ones, but the others are the ones I've already identified
>> as non-existent.
>
> Update: Set up the warm standby again and encountered the same issue,
> with two of the three previously-identified tables -- the ones that
> can get truncated throughout the day. We're going to try again
> overnight when those tables are not truncated and see if that gives us
> a correctly-working standby.
>
> From what I could find from posts to these lists, TRUNCATE commands do
> reset the relfilenode, and that could account for the issue we are
> experiencing. What I find odd is that we have one other table that is
> truncated every 15 minutes (aggregate table) but that one was fine in
> both attempts at the warm standby.

What O/S, kernel, and filesystem are you using?


--
Jon

bricklen <bricklen@gmail.com> writes:
> On Wed, Dec 29, 2010 at 1:53 PM, bricklen <bricklen@gmail.com> wrote:
>> On Wed, Dec 29, 2010 at 12:11 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>> You did something on the source DB that rewrote the table with a new
>>> relfilenode (possibly CLUSTER or some form of ALTER TABLE; plain VACUUM
>>> or ANALYZE wouldn't do it).  And for some reason the standby hasn't
>>> picked up that change in the pg_class row.

>> Definitely no CLUSTER commands were issued, and there should have been
>> no ALTER commands issued (yesterday was a holiday, no one was here).
>> Would a TRUNCATE have the same effect though?

Yeah, that would assign a new relfilenode too.

> Update: Set up the warm standby again and encountered the same issue,
> with two of the three previously-identified tables -- the ones that
> can get truncated throughout the day. We're going to try again
> overnight when those tables are not truncated and see if that gives us
> a correctly-working standby.

I think you're missing the point here: there is something about your
standby setup procedure that is causing you to get an inconsistent set
of row states.  Failing to access a table at all because of wrong
relfilenode in pg_class is a really obvious symptom; but it's just a
symptom it's not the core disease.  Almost certainly, your user data
is out of sync as well, ie there are probably partially-applied updates
and so on.  Avoiding a TRUNCATE isn't going to eliminate those other
problems.

There's another thread going on about an inconsistent standby, in which
I think we identified the problem as being failure to copy over the
backup_label file.  That would cause the standby to not replay enough
WAL, which would explain your problem nicely.  I wonder whether you're
doing that or something like it.

            regards, tom lane

On Fri, Dec 31, 2010 at 11:53 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> I think you're missing the point here: there is something about your
> standby setup procedure that is causing you to get an inconsistent set
> of row states.

For the sake of the archives, the problem turned out to be due to
impatience. We were not waiting long enough for all the .ready files
in /data/pg_xlog/archive_status/ to be archived to the standby (noted
by those files on the master getting renamed to .done). It was taking
around 15 minutes for those files to get applied to the standby, not
the 1 minute or so that I expected.

Sorry for the noise, and thanks for the help!