Database corruption event, unlockable rows, possibly bogus virtual xids? Invalid backend server xid

Поиск
Список
Период
Сортировка
От Ned Wolpert
Тема Database corruption event, unlockable rows, possibly bogus virtual xids? Invalid backend server xid
Дата
Msg-id CAFehBF=ktRXFex1tQOAx-cXUaph-jBXkhuSfO9oqgH_4PtNoOA@mail.gmail.com
обсуждение исходный текст
Ответы Re: Database corruption event, unlockable rows, possibly bogus virtual xids? Invalid backend server xid  (Kevin Grittner <kgrittn@ymail.com>)
Re: Database corruption event, unlockable rows, possibly bogus virtual xids? Invalid backend server xid  (Tom Lane <tgl@sss.pgh.pa.us>)
Список pgsql-general
(I originally posted this to pgsql-admin and was pointed to here instead.)

Folks-

  I'm doing a postmortem on a corruption event we had. I have an idea on
what happened, but not sure. I figure I'd share what happened and see if
I'm close to right here.

  Event: Running 9.1.6 with hot-standby, archiving 4 months of wal files,
and even a nightly pg_dump all. 50G database.  Trying to update or delete a
row in a small (21 row, but heavily used table) would lock up completely.
Never finish. Removed all clients, restarted the db instance, no joy. Check
pg_stat_activity, and nothing that wasn't idle.... run the delete, locked
up.

  Ran (SELECT*FROM pg_class JOIN pg_locks ON pg_locks.relation =
pg_class.oid;) with no clients touching this row, fresh restart of the db,
and saw virtualtransactions against this same table where the values would
be -1/nnnnn were nnnnn was a huge number.  Turns out we had about 159
entries from different tables in the database.  Checked hot-standby and, of
course, no locks or anything.  Switched to hot-standby.

  Hot-standby instantly gained these locks, Also noticed that 2 prepared
transactions migrated to the hot-standby. Binary upgraded to 9.1.8, locks
still existed. Ended up executing the one thing we knew would work. Take
the site down, pg_dumpall to fresh instance. Everything is fine.

A little more background: We were running 9.1.4 back when 9.1.6 came out.
We saw there was possible corruption issues and did a binary upgrade and
reindexing. Everything seemed to be fine, but I wonder if we really had
problems back then. We rebuilt the hot-standby after the binary upgrade via
normal restore and wal-file replays.  I should also note that this row that
had the lock on it that would not go away, was created by an app server
that was killed (via -9) since it was non-responsive, and the row 'create
date' (in db and logs) is the exact time the app server was killed.

I was wondering if a) these virtualtransactions that start with '-1/'
indicate a problem, b) if this could have happened from pre 9.1.6
corruption that was fixed in 9.1.6. Or, could this have occurred when we
killed that app server? Or.... am I looking in the wrong place.

I do still have the old data directories so I can start them up and check
out the dataset. Any advice?

Currently running environment: CentOS 5.x
Used the http://yum.postgresql.org/9.1 repo...
Currently using PostgreSQL 9.1.8 on x86_64-unknown-linux-gnu, compiled by
gcc (GCC) 4.1.2 20080704 (Red Hat 4.1.2-52), 64-bit
but was on 9.1.6 at failure time.
The hardware on both primary and secondary servers are in good condition,
raid5 via a hardware card with no issues.  Each a 2x4x2 (2 hyperthreaded
Xeon E5620 2.4Ghz, 4 cores each, 16 threads total) 32G Ram.

Data from pg_dumpall checked out fine; no data loss occurred that we could
tell. Just this transaction lock and what seemed like invalid backend ids
listed in the virtualtransaction from the pg_locks table/view.
(-1/verylargenumber)






--
Virtually, Ned Wolpert

"Settle thy studies, Faustus, and begin..."   --Marlowe

В списке pgsql-general по дате отправления:

Предыдущее
От: bricklen
Дата:
Сообщение: Re: View to show privileges on views/tables/sequences/foreign tables
Следующее
От: Kevin Grittner
Дата:
Сообщение: Re: Full text and removing dashes from names