Обсуждение: Suggestions for post-mortem...

Поиск
Список
Период
Сортировка

Suggestions for post-mortem...

От
Philip Warner
Дата:
We just had a DB die quite nastily, and have no clear idea why.

Looking in the system logs shows nothing out of the ordinary, and
looking in the db logs shows a few odd records:

2006-01-25 12:25:31 EST [mail,5017]: ERROR:  failed to fetch new tuple
for AFTER trigger
2006-01-25 12:26:01 EST [mail,93689]: WARNING:  index "XXXX_pkey"
contains 1416 row versions, but table contains 1410 row versions
2006-01-25 12:26:01 EST [mail,93689]: HINT:  Rebuild the index with REINDEX.
2006-01-25 12:26:01 EST [mail,93689]: WARNING:  index "YYYY" contains
1416 row versions, but table contains 1410 row versions

...repeated several times for several indexes of the same table.

These messages occurred almost immediately before we noticed the dead
state of the DB. Over an hour before these messages there was a
deadlock, but that's not too worrying -- the DB was still OK.

After the above messages, about 80 rows were missing from the table, and
a REINDEX did not restore them (not really surprising). The table in
question has only a small number of rows (1400-ish), but gets updated up
to 5 to 10 times per second.

Thankfully, we had replication in place and just failed over, but we'd
like to try to understand what happened to the old DB.

Any suggestions where to start? Or what the first error might signify?
Or what to put in place to catch more details next time?

It's been running fine for several months (until now) using PG 8.0.3.









Re: Suggestions for post-mortem...

От
Alvaro Herrera
Дата:
Philip Warner wrote:
> We just had a DB die quite nastily, and have no clear idea why.
> 
> Looking in the system logs shows nothing out of the ordinary, and
> looking in the db logs shows a few odd records:
> 
> 2006-01-25 12:25:31 EST [mail,5017]: ERROR:  failed to fetch new tuple
> for AFTER trigger
> 2006-01-25 12:26:01 EST [mail,93689]: WARNING:  index "XXXX_pkey"
> contains 1416 row versions, but table contains 1410 row versions
> 2006-01-25 12:26:01 EST [mail,93689]: HINT:  Rebuild the index with REINDEX.
> 2006-01-25 12:26:01 EST [mail,93689]: WARNING:  index "YYYY" contains
> 1416 row versions, but table contains 1410 row versions

Seems like a rather severe bug.  Maybe race condition somewhere.  Note
that the trigger problem occurs 30 seconds before the VACUUM error shows
up, and being a small table I doubt a vacuum could take so long.

Can you confirm how long does the vacuum take to run?  Is this problem
isolated to this one table, or does it manifest somewhere else?  Do you
have other errors that may indicate a hardware problem?

It would be interesting to see where do the extra index entries point
to.  To do this we would need to read the complete pg_filedump of the
index however ... and if you REINDEXed it, the evidence is already gone.

-- 
Alvaro Herrera                        http://www.advogato.org/person/alvherre
"The Postgresql hackers have what I call a "NASA space shot" mentality.Quite refreshing in a world of "weekend drag
racer"developers."
 
(Scott Marlowe)


Re: Suggestions for post-mortem...

От
Tom Lane
Дата:
Philip Warner <pjw@rhyme.com.au> writes:
> We just had a DB die quite nastily, and have no clear idea why.

Define "die quite nastily" ... you haven't really said what went wrong.

> 2006-01-25 12:25:31 EST [mail,5017]: ERROR:  failed to fetch new tuple
> for AFTER trigger
> 2006-01-25 12:26:01 EST [mail,93689]: WARNING:  index "XXXX_pkey"
> contains 1416 row versions, but table contains 1410 row versions
> 2006-01-25 12:26:01 EST [mail,93689]: HINT:  Rebuild the index with REINDEX.
> 2006-01-25 12:26:01 EST [mail,93689]: WARNING:  index "YYYY" contains
> 1416 row versions, but table contains 1410 row versions

> ...repeated several times for several indexes of the same table.

These could all be manifestations of the ReadBuffer bug fixed in 8.0.6.
Tickling that bug would result in zeroing out a recently-added table page,
which would result in (a) more index entries than table entries, and
(b) possible bleating from other processes wondering where their freshly
inserted tuples went.  Seems to square with the facts presented,
especially if the indexes all agree on how many tuples there should be
--- that makes it pretty unlikely that index corruption is the cause.
        regards, tom lane


Re: Suggestions for post-mortem...

От
Philip Warner
Дата:
Tom Lane wrote:
> Define "die quite nastily" ... you haven't really said what went wrong.
>   
We lost data from this table, and ended up with transactions rolled back
that were in fact committed (ie. data was in DB, program code thought
they were rolled back).

End result was we deemed the database to be in an unknown, uncorrectable
and unstable state.


> These could all be manifestations of the ReadBuffer bug fixed in 8.0.6.
> Tickling that bug would result in zeroing out a recently-added table page,
> which would result in (a) more index entries than table entries, and
> (b) possible bleating from other processes wondering where their freshly
> inserted tuples went.

This sounds consistent, I'd guess the 80 missing records correspond to
80 most recently updated. Not sure about the missing user -- I'll see
what I can find.


Thanks for the help -- we now have a probable cause, and a way forward.

Alvaro Herrera wrote:
> Can you confirm how long does the vacuum take to run?

Usually very quick, runs every minute -- so the problem occurred between
the two vacuums.

> isolated to this one table, or does it manifest somewhere else?  Do you
> have other errors that may indicate a hardware problem?
>
No.