Обсуждение: postgres 8.2.6 core dump when initialising.

Поиск
Список
Период
Сортировка

postgres 8.2.6 core dump when initialising.

От
Darren Reed
Дата:
Some minutes after doing a live reindex of all my tables,
I started to get these messages:
ERROR:  could not open relation with OID 16461
...for every table(!)

So I shutdown...

LOG:  received smart shutdown request
LOG:  shutting down
LOG:  database system is shut down

And reboot....

LOG:  database system was shut down at 2008-04-13 22:38:35 PDT
LOG:  checkpoint record is at 2/8F9B5C14
LOG:  redo record is at 2/8F9B5C14; undo record is at 0/0; shutdown TRUE
LOG:  next transaction ID: 0/669090; next OID: 24651
LOG:  next MultiXactId: 1; next MultiXactOffset: 0
LOG:  database system is ready
LOG:  server process (PID 19944) was terminated by signal 11
LOG:  terminating any other active server processes
LOG:  all server processes terminated; reinitializing
LOG:  database system was interrupted at 2008-04-13 22:38:38 PDT
LOG:  checkpoint record is at 2/8F9B5C14
LOG:  redo record is at 2/8F9B5C14; undo record is at 0/0; shutdown TRUE
LOG:  next transaction ID: 0/669090; next OID: 24651
LOG:  next MultiXactId: 1; next MultiXactOffset: 0
LOG:  database system was not properly shut down; automatic recovery in
progress
LOG:  record with zero length at 2/8F9B5C5C
LOG:  redo is not required
LOG:  database system is ready
LOG:  server process (PID 25402) was terminated by signal 11
LOG:  terminating any other active server processes
LOG:  all server processes terminated; reinitializing
LOG:  database system was interrupted at 2008-04-13 22:38:51 PDT
LOG:  checkpoint record is at 2/8F9B5C5C
LOG:  redo record is at 2/8F9B5C5C; undo record is at 0/0; shutdown TRUE
LOG:  next transaction ID: 0/669090; next OID: 24651
LOG:  next MultiXactId: 1; next MultiXactOffset: 0
LOG:  database system was not properly shut down; automatic recovery in
progress
LOG:  record with zero length at 2/8F9B5CA4
LOG:  redo is not required
LOG:  database system is ready
LOG:  server process (PID 13888) was terminated by signal 11
LOG:  terminating any other active server processes
...

 /usr/pkg/bin/psql --version
psql (PostgreSQL) 8.2.6
contains support for command-line editing

Program terminated with signal 11, Segmentation fault.
#0  0x0829845c in RelationCacheInitializePhase2 () at relcache.c:2400
2400                    LOAD_CRIT_INDEX(TriggerRelidNameIndexId);
(gdb) where
#0  0x0829845c in RelationCacheInitializePhase2 () at relcache.c:2400
#1  0x082ac19c in InitPostgres (dbname=0x83ba450 "postgres",
    username=0x83c1428 "postgres") at postinit.c:459
#2  0x0820b533 in PostgresMain (argc=4, argv=0x83c1458,
    username=0x83c1428 "postgres") at postgres.c:3143
#3  0x081dc572 in BackendRun (port=0x83bc800) at postmaster.c:2934
#4  0x081dbb7e in BackendStartup (port=0x83bc800) at postmaster.c:2561
#5  0x081d985c in ServerLoop () at postmaster.c:1214
#6  0x081d9292 in PostmasterMain (argc=3, argv=0xbfbfeb34) at
postmaster.c:966
#7  0x08187989 in main (argc=3, argv=0xbfbfeb34) at main.c:188
(gdb) p status
$1 = {hashp = 0x8, curBucket = 2, curEntry = 0x8360777}
(gdb) p idhentry
$2 = (RelIdCacheEnt *) 0x836b168
(gdb) p oldcxt
$3 = (MemoryContext) 0x83de258
(gdb) p needNewCacheFile
$4 = 1 '\001'
(gdb) p ird
$5 = (Relation) 0x0
(gdb) p *oldcxt
$1 = {type = T_AllocSetContext, methods = 0x8372de0, parent = 0x83c4080,
  firstchild = 0x0, nextchild = 0x83de1d0,
  name = 0x83de2b4 "TopTransactionContext"}
(gdb) p *idhentry
$2 = {reloid = 137801832, reldesc = 0xbbbed200}

How to fix this?


Re: postgres 8.2.6 core dump when initialising.

От
Tom Lane
Дата:
Darren Reed <darrenr+postgres@fastmail.net> writes:
> Program terminated with signal 11, Segmentation fault.
> #0  0x0829845c in RelationCacheInitializePhase2 () at relcache.c:2400
> 2400                    LOAD_CRIT_INDEX(TriggerRelidNameIndexId);
> (gdb) where
> #0  0x0829845c in RelationCacheInitializePhase2 () at relcache.c:2400

This appears to be the exact same problem you reported back in February:
http://archives.postgresql.org/pgsql-admin/2008-02/msg00370.php

I'm thinking there must be something about what you are doing that
is triggering this issue.  Care to give us a full data dump on your
maintenance habits?

It might be useful if we could look at a pg_filedump dump of your
pg_class table, too.
http://sources.redhat.com/rhdb/

            regards, tom lane

Re: postgres 8.2.6 core dump when initialising.

От
Darren Reed
Дата:
Tom Lane wrote:
> Darren Reed <darrenr+postgres@fastmail.net> writes:
> > Program terminated with signal 11, Segmentation fault.
> > #0  0x0829845c in RelationCacheInitializePhase2 () at relcache.c:2400
> > 2400                    LOAD_CRIT_INDEX(TriggerRelidNameIndexId);
> > (gdb) where
> > #0  0x0829845c in RelationCacheInitializePhase2 () at relcache.c:2400
>
> This appears to be the exact same problem you reported back in February:
> http://archives.postgresql.org/pgsql-admin/2008-02/msg00370.php
>
> I'm thinking there must be something about what you are doing that
> is triggering this issue.  Care to give us a full data dump on your
> maintenance habits?

So, from time to time I run an online pg_dump of all the tables as a backup
but not often enough (of course.)

Some time ago I saw this message:
ERROR:  index "pg_depend_reference_index" contains unexpected zero page
at block 23
HINT:  Please REINDEX it.
ERROR:  index "table_p_hash_idx" contains unexpected zero page at block 7
HINT:  Please REINDEX it.

...and the end result was that I needed to take the database
down into single user mode and reindex it.  I read up on it
and it appears that this problem occurs with specific work
load types - whatever that is, I appear to fit.
Not a big problem.

More recently, I've seen postgres get "stuck" doing an insert
and spin doing nothing.  I tried to grab a core with gcore but
it didn't cooperate.  The evidence for that was the usual "INSERT"
next to the postgres process.  This wasn't fun to deal with,
kill -INT had no affect, kill -TERM had no effect and this I was
left with kill -QUIT.  The database has always started up cleanly
after this, so I didn't think too much of it (should have done
another dump....sigh...)  After the last instance of this problem,
I ran an online REINDEX over all of the tables hoping that this
might solve the problem but not long after, I hit the OID problem.


> It might be useful if we could look at a pg_filedump dump of your
> pg_class table, too.

How do I know which one is pg_class?

I can tell which ones hold the data ;)

Darren


Re: postgres 8.2.6 core dump when initialising.

От
Tom Lane
Дата:
Darren Reed <darrenr+postgres@fastmail.net> writes:
> Some time ago I saw this message:
> ERROR:  index "pg_depend_reference_index" contains unexpected zero page
> at block 23
> HINT:  Please REINDEX it.
> ERROR:  index "table_p_hash_idx" contains unexpected zero page at block 7
> HINT:  Please REINDEX it.

[ squint... ]  If something is randomly zeroing out pages, and it
happens to hit pg_class, that could explain your troubles.  Maybe
you are dealing with a kernel bug or flaky hardware.

>> It might be useful if we could look at a pg_filedump dump of your
>> pg_class table, too.

> How do I know which one is pg_class?

It'd be $PGDATA/base/NNN/1259 where NNN is the OID of the broken
database.  With a dead system the easiest way to find out that OID
is to look at the text file $PGDATA/global/pg_database --- the
first number after a database's name is its OID.

I find that "pg_filedump -i -f FILENAME" gives the most useful output.

            regards, tom lane

Re: postgres 8.2.6 core dump when initialising.

От
Tom Lane
Дата:
Darren Reed <darrenr+postgres@fastmail.net> writes:
> See if the attached helps.

Well, here's the entry for the index it's trying to load:

>  Item  12 -- Length:  164  Offset: 6224 (0x1850)  Flags: USED
>   XMIN: 1  CMIN: 0  XMAX: 667033  CMAX|XVAC: 6  OID: 2701
>   Block Id: 7  linp Index: 3   Attributes: 27   Size: 36
>   infomask: 0x0511 (HASNULL|HASOID|XMIN_COMMITTED|XMAX_COMMITTED)
>   t_bits: [0]: 0xff [1]: 0xff [2]: 0xff [3]: 0x01

>   1850: 01000000 00000000 992d0a00 06000000  ........�-......
>   1860: 00000700 03001b00 110524ff ffff0100  ..........$.....
>   1870: 8d0a0000 70675f74 72696767 65725f74  �...pg_trigger_t
>   1880: 6772656c 69645f74 676e616d 655f696e  grelid_tgname_in
>   1890: 64657800 00000000 00000000 00000000  dex.............
>   18a0: 00000000 00000000 00000000 00000000  ................
>   18b0: 00000000 0b000000 00000000 0a000000  ................
>   18c0: 93010000 8d0a0000 00000000 02000000  �...�...........
>   18d0: 00004040 00000000 00000000 00006900  ..@@..........i.
>   18e0: 02000000 00000000 00000000 00000000  ................
>   18f0: 00000000                             ....

So it's been updated, as is to be expected if you just reindexed
everything, and the updated row went into block 7.  Only problem
is, there is no block 7 :-( ... pg_class ends at block 6.

There are 25 other entries that have update links pointing into
block 7, all of them with very recent XMAX (667033 to 667072,
and we saw you crashed at XID 669090).  So this is consistent
with the idea that your REINDEX was responsible for updating
all those rows.  But what happened to block 7?

What I am suspicious of is that you had an occurrence of the
apparent block-zeroing problem that took out block 7, and shortly
thereafter vacuum came along, found nothing in block 7, and truncated
it away.  If this happened then there should be something about
    relation "pg_class" page 7 is uninitialized --- fixing
in the postmaster log; this would be after the REINDEX and before
the crash, so I'm hoping it's still there?  Did you do a manual
vacuum after reindexing, or maybe autovacuum is on?

            regards, tom lane

Re: postgres 8.2.6 core dump when initialising.

От
Tom Lane
Дата:
Darren Reed <darrenr@fastmail.net> writes:
> Tom Lane wrote:
>> BTW, could you send me the contents of file 24579 in that database
>> (dump it with pg_filedump, same as before)?  According to your dump
>> that's the reindexed version of pg_class_oid_index, and I'm curious
>> to see whether it contains index entries pointing at block 7 of
>> pg_class.  If my theory is right then it will.

> Attached.

Sure enough, there are 26 index entries referencing block 7, which
matches the number of updates into that block that seemed to have
occurred according to pg_class.  So those updates definitely happened,
and then *something* made block 7 go away.

What's the platform here --- what kernel, what version exactly,
what hardware and disk subsystem?  I think your problem is somewhere
in there ...

            regards, tom lane