Обсуждение: "could not open relation with OID" errors after promoting the standby to master

Поиск
Список
Период
Сортировка

"could not open relation with OID" errors after promoting the standby to master

От
Joachim Wieland
Дата:
I've switched servers yesterday night and the previous slave is now
the master. This is 9.0.6 (originally) / 9.0.7 (now) on Linux.

Now I'm seeing a bunch of

ERROR:  could not open relation with OID 1990987633
STATEMENT:  create temp table seen_files (fileid integer)

Interestingly enough, 90% of these happen with "create temp table"
statements, but I also see them with regular read-only select
statements, but I'd say it's at most 10% of these.

Some reports for this error message suggested running reindex, so I've
run reindex table and also vacuum full on all system catalogs (just
for good measure) but that didn't help much. Restarting the cluster
didn't help either.

If it matters, I have not promoted the master with a trigger file but
restarted it after deleting recovery.conf.

Everything else appears to be running fine but since it's a) annoying
and b) not very comforting, I might dump and restore tomorrow night.

I added a sleep() after this error message so that I could attach a
debugger and grab a stack trace:

(gdb) bt
#0  0x0000003eb509a170 in __nanosleep_nocancel () from /lib64/libc.so.6
#1  0x0000003eb5099fc4 in sleep () from /lib64/libc.so.6
#2  0x000000000046375c in relation_open (relationId=1990987633,
lockmode=<value optimized out>) at heapam.c:906
#3  0x00000000004b26e6 in heap_drop_with_catalog (relid=1990987633) at
heap.c:1567
#4  0x00000000004ace01 in doDeletion (object=0x62dfbc8,
depRel=0x2b9ea756f6a8) at dependency.c:1046
#5  deleteOneObject (object=0x62dfbc8, depRel=0x2b9ea756f6a8) at
dependency.c:1004
#6  0x00000000004aeb00 in deleteWhatDependsOn (object=<value optimized
out>, showNotices=0 '\000') at dependency.c:401
#7  0x00000000004b6e90 in RemoveTempRelations () at namespace.c:3234
#8  InitTempTableNamespace () at namespace.c:3066
#9  0x00000000004b70b5 in RangeVarGetCreationNamespace
(newRelation=<value optimized out>) at namespace.c:351
#10 0x0000000000536e13 in DefineRelation (stmt=0x638da00, relkind=114
'r', ownerId=0) at tablecmds.c:409
#11 0x000000000063c15f in standard_ProcessUtility (parsetree=<value
optimized out>,   queryString=0x6298460 "create temp table temp_defs_table_20068
(symhash char(32), symbolid integer)", params=0x0, isTopLevel=<value
optimized out>,   dest=<value optimized out>, completionTag=<value optimized out>)
at utility.c:512
#12 0x0000000000637d81 in PortalRunUtility (portal=0x61ec860,
utilityStmt=0x62992d0, isTopLevel=1 '\001', dest=0x6299670,
completionTag=0x7ffffa9c1c30 "") at pquery.c:1191
#13 0x00000000006384de in PortalRunMulti (portal=0x61ec860,
isTopLevel=1 '\001', dest=0x6299670, altdest=0x6299670,
completionTag=0x7ffffa9c1c30 "") at pquery.c:1296
#14 0x0000000000639732 in PortalRun (portal=0x61ec860,
count=9223372036854775807, isTopLevel=1 '\001', dest=0x6299670,
altdest=0x6299670, completionTag=0x7ffffa9c1c30 "")   at pquery.c:822
#15 0x00000000006359e5 in exec_simple_query (argc=<value optimized
out>, argv=<value optimized out>, username=<value optimized out>) at
postgres.c:1058
#16 PostgresMain (argc=<value optimized out>, argv=<value optimized
out>, username=<value optimized out>) at postgres.c:3936
#17 0x00000000005f95d6 in BackendRun () at postmaster.c:3560
#18 BackendStartup () at postmaster.c:3247
#19 ServerLoop () at postmaster.c:1431
#20 0x00000000005fb934 in PostmasterMain (argc=<value optimized out>,
argv=<value optimized out>) at postmaster.c:1092
#21 0x000000000058de36 in main (argc=3, argv=0x61dd1d0) at main.c:188
(gdb)

Any idea? It looks suspicious that it calls into RemoveTempRelations()
from InitTempNamespace() thereby removing the table it is about to
create?


Re: "could not open relation with OID" errors after promoting the standby to master

От
Alvaro Herrera
Дата:
Excerpts from Joachim Wieland's message of mar may 15 22:37:15 -0400 2012:
> I've switched servers yesterday night and the previous slave is now
> the master. This is 9.0.6 (originally) / 9.0.7 (now) on Linux.
>
> Now I'm seeing a bunch of
>
> ERROR:  could not open relation with OID 1990987633

> #7  0x00000000004b6e90 in RemoveTempRelations () at namespace.c:3234
> #8  InitTempTableNamespace () at namespace.c:3066

> Any idea? It looks suspicious that it calls into RemoveTempRelations()
> from InitTempNamespace() thereby removing the table it is about to
> create?

Well, that is not surprising in itself -- InitTempTableNamespace calls
RemoveTempRelations to cleanup from a possibly crashed previous backend
with the same ID.  So that part of the backtrace looks normal to me
(unless there is something weird going on, which might very well be the
case).

It looks to me like you might have leftover pg_depend entries without
the corresponding entries in other catalogs (pg_class, etc).  No idea
how that might happen, but maybe it's a starting point to investigate.

--
Álvaro Herrera <alvherre@commandprompt.com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support


Re: "could not open relation with OID" errors after promoting the standby to master

От
Joachim Wieland
Дата:
On Wed, May 16, 2012 at 11:38 PM, Alvaro Herrera
<alvherre@commandprompt.com> wrote:
> Well, that is not surprising in itself -- InitTempTableNamespace calls
> RemoveTempRelations to cleanup from a possibly crashed previous backend
> with the same ID.  So that part of the backtrace looks normal to me
> (unless there is something weird going on, which might very well be the
> case).

Right, I guess the stack trace is okay but some state was obviously wrong.

I was able to clean that up now by some catalog hacking, but I'm
definitely going to dump and reload soon.

I found out that it was certain backend ids which couldn't create
temporary tables, meaning that when I did a "create temp table" in
these few certain backend ids (about 4-5 all with low id numbers which
is why I hit them quite often), it would give me this "could not open
relation with OID x" error.

I also couldn't drop the temp schema in these backends:

# drop schema pg_temp_4;
ERROR:  cache lookup failed for relation 1990987636

# select oid, * from pg_namespace ;
(got oid 4664506 for "pg_temp_4")

# select * from pg_class where oid = 1990987636;
(no rows returned)

# delete from pg_namespace where oid = 4664506;
DELETE 1

# create temp table myfoo(a int);
CREATE TABLE

Later on I also found some leftover pg_type entries from temporary
tables that didn't exist anymore. I'm quite that certain I shouldn't
see these anymore... And I also find a few entries in pg_class with
relistemp='t' whose oid is considerably older than anything recent.
This kinda suggests that there might be something weird going on when
you have temp tables in flight and fail over, at least that's the only
explanation I have for how this could have happened.


Re: "could not open relation with OID" errors after promoting the standby to master

От
Robert Haas
Дата:
On Tue, May 15, 2012 at 10:37 PM, Joachim Wieland <joe@mcknight.de> wrote:
> If it matters, I have not promoted the master with a trigger file but
> restarted it after deleting recovery.conf.

Hmm.  I think that if you do it this way, the minimum recovery point
won't be respected, which could leave you with a corrupted database.
Now, if all the WAL files that you need are present in pg_xlog anyway,
then they ought to get replayed anyway, but I think that if you are
using restore_command (as opposed to streaming replication) we restore
WAL segments under a different file name, which might cause this
problem.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: "could not open relation with OID" errors after promoting the standby to master

От
Joachim Wieland
Дата:
On Tue, May 22, 2012 at 9:50 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> Hmm.  I think that if you do it this way, the minimum recovery point
> won't be respected, which could leave you with a corrupted database.
> Now, if all the WAL files that you need are present in pg_xlog anyway,
> then they ought to get replayed anyway, but I think that if you are
> using restore_command (as opposed to streaming replication) we restore
> WAL segments under a different file name, which might cause this
> problem.

Uhm, maybe I add some more details, so you get a better idea of what I
did: The idea was to promote the standby to be the new master. There
was streaming replication active but at some time I had to take the
master down. IIRC from the log I saw that after the master went down,
the standby continued recovering from a bunch of archived log files
(via recovery_command), I had suspected that either the standby was
lagging behind a bit or that the master archived them during shutdown.
When the standby didn't have anything else left to recover from
(saying both "xlog file foo doesn't exist" and "cannot connect to
master"), I deleted recovery.conf on the standby and restarted it.

I wouldn't have assumed any corruption was possible given that I did
clean shutdowns on both sides...


Re: "could not open relation with OID" errors after promoting the standby to master

От
Robert Haas
Дата:
On Thu, May 24, 2012 at 10:13 AM, Joachim Wieland <joe@mcknight.de> wrote:
> On Tue, May 22, 2012 at 9:50 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>> Hmm.  I think that if you do it this way, the minimum recovery point
>> won't be respected, which could leave you with a corrupted database.
>> Now, if all the WAL files that you need are present in pg_xlog anyway,
>> then they ought to get replayed anyway, but I think that if you are
>> using restore_command (as opposed to streaming replication) we restore
>> WAL segments under a different file name, which might cause this
>> problem.
>
> Uhm, maybe I add some more details, so you get a better idea of what I
> did: The idea was to promote the standby to be the new master. There
> was streaming replication active but at some time I had to take the
> master down. IIRC from the log I saw that after the master went down,
> the standby continued recovering from a bunch of archived log files
> (via recovery_command), I had suspected that either the standby was
> lagging behind a bit or that the master archived them during shutdown.
> When the standby didn't have anything else left to recover from
> (saying both "xlog file foo doesn't exist" and "cannot connect to
> master"), I deleted recovery.conf on the standby and restarted it.
>
> I wouldn't have assumed any corruption was possible given that I did
> clean shutdowns on both sides...

The thing that's worrying me is that there's not really any such thing
as a "clean" shutdown on a standby.  When you shut down the master, it
checkpoints.  When you shut down the standby, it can't checkpoint, so
I think it's still going to enter recovery at startup.  It'd be
interesting to know where that recovery began and ended as compared
with the minimum recovery point just before the shutdown.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: "could not open relation with OID" errors after promoting the standby to master

От
Heikki Linnakangas
Дата:
On 24.05.2012 18:16, Robert Haas wrote:
> On Thu, May 24, 2012 at 10:13 AM, Joachim Wieland<joe@mcknight.de>  wrote:
>> I wouldn't have assumed any corruption was possible given that I did
>> clean shutdowns on both sides...
>
> The thing that's worrying me is that there's not really any such thing
> as a "clean" shutdown on a standby.  When you shut down the master, it
> checkpoints.  When you shut down the standby, it can't checkpoint, so
> I think it's still going to enter recovery at startup.  It'd be
> interesting to know where that recovery began and ended as compared
> with the minimum recovery point just before the shutdown.

Perhaps we should introduce the concept of a clean standby shutdown. We 
can't write a checkpoint record, but we could write the same information 
somewhere else. Like in the control file. At startup, we'd see that we 
did a clean shutdown at WAL point X/X, and start up without having to 
read through all the WAL from the last master checkpoint.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: "could not open relation with OID" errors after promoting the standby to master

От
Andres Freund
Дата:
On Thursday, May 24, 2012 06:35:06 PM Heikki Linnakangas wrote:
> On 24.05.2012 18:16, Robert Haas wrote:
> > On Thu, May 24, 2012 at 10:13 AM, Joachim Wieland<joe@mcknight.de>  wrote:
> >> I wouldn't have assumed any corruption was possible given that I did
> >> clean shutdowns on both sides...
> > 
> > The thing that's worrying me is that there's not really any such thing
> > as a "clean" shutdown on a standby.  When you shut down the master, it
> > checkpoints.  When you shut down the standby, it can't checkpoint, so
> > I think it's still going to enter recovery at startup.  It'd be
> > interesting to know where that recovery began and ended as compared
> > with the minimum recovery point just before the shutdown.
> 
> Perhaps we should introduce the concept of a clean standby shutdown. We
> can't write a checkpoint record, but we could write the same information
> somewhere else. Like in the control file. At startup, we'd see that we
> did a clean shutdown at WAL point X/X, and start up without having to
> read through all the WAL from the last master checkpoint.
The control file currently is not a very good match because of the current 
requirement of staying below 512 bytes. If we would include the list of 
running xacts that wouldn't be enough.
I wondered before if there is more to do to fix that then to do the atomic 
write();fsync();rename();fsync(); dance. I don't see a problem with the cost 
of that personally...

Andres


Re: "could not open relation with OID" errors after promoting the standby to master

От
Tom Lane
Дата:
Andres Freund <andres@anarazel.de> writes:
> The control file currently is not a very good match because of the current 
> requirement of staying below 512 bytes. If we would include the list of 
> running xacts that wouldn't be enough.
> I wondered before if there is more to do to fix that then to do the atomic 
> write();fsync();rename();fsync(); dance. I don't see a problem with the cost 
> of that personally...

The reason for keeping it to one sector is that you're screwed if the
file is broken, so the fewer failure modes the better.

I'm not sure I believe that we can make a recovery resume from an
arbitrary point in WAL anyway, or that it would be worth the trouble.
Can't we just resume from the last restartpoint?
        regards, tom lane


Re: "could not open relation with OID" errors after promoting the standby to master

От
Andres Freund
Дата:
On Thursday, May 24, 2012 08:32:47 PM Tom Lane wrote:
> Andres Freund <andres@anarazel.de> writes:
> > The control file currently is not a very good match because of the
> > current requirement of staying below 512 bytes. If we would include the
> > list of running xacts that wouldn't be enough.
> > I wondered before if there is more to do to fix that then to do the
> > atomic write();fsync();rename();fsync(); dance. I don't see a problem
> > with the cost of that personally...
> 
> The reason for keeping it to one sector is that you're screwed if the
> file is broken, so the fewer failure modes the better.
Yea, sure. But given the amount of software that depends on the above sequence 
to work correctly I don't really see much of a problem...

> I'm not sure I believe that we can make a recovery resume from an
> arbitrary point in WAL anyway, or that it would be worth the trouble.
> Can't we just resume from the last restartpoint?
Well, with a decent sized checkpoint_segments getting up2date can take quite a 
noticeable amount of time...

Andres


Re: "could not open relation with OID" errors after promoting the standby to master

От
Tom Lane
Дата:
Andres Freund <andres@anarazel.de> writes:
> On Thursday, May 24, 2012 08:32:47 PM Tom Lane wrote:
>> I'm not sure I believe that we can make a recovery resume from an
>> arbitrary point in WAL anyway, or that it would be worth the trouble.
>> Can't we just resume from the last restartpoint?

> Well, with a decent sized checkpoint_segments getting up2date can take quite a 
> noticeable amount of time...

So?  Slow restart is precisely the price you pay for a large checkpoint
interval.
        regards, tom lane


Re: "could not open relation with OID" errors after promoting the standby to master

От
Andres Freund
Дата:
On Thursday, May 24, 2012 08:46:21 PM Tom Lane wrote:
> Andres Freund <andres@anarazel.de> writes:
> > On Thursday, May 24, 2012 08:32:47 PM Tom Lane wrote:
> >> I'm not sure I believe that we can make a recovery resume from an
> >> arbitrary point in WAL anyway, or that it would be worth the trouble.
> >> Can't we just resume from the last restartpoint?
> > 
> > Well, with a decent sized checkpoint_segments getting up2date can take
> > quite a noticeable amount of time...
> 
> So?  Slow restart is precisely the price you pay for a large checkpoint
> interval.
Well. no. If you stop the master in a orderly fashion it will checkpoint and 
thus have a sensible startup time. There is no such possibility for a standby  
without interaction with the master.

Andres

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services