Re: BUG #10432: failed to re-find parent key in index

Поиск
Список
Период
Сортировка
От Greg Stark
Тема Re: BUG #10432: failed to re-find parent key in index
Дата
Msg-id CAM-w4HM3wLZU_qd6fU7XhEiPX9DO-R_FBszfQqE9GhrNy-kfYw@mail.gmail.com
обсуждение исходный текст
Ответ на Re: BUG #10432: failed to re-find parent key in index  (Andres Freund <andres@2ndquadrant.com>)
Ответы Re: BUG #10432: failed to re-find parent key in index  (Greg Stark <stark@mit.edu>)
Re: BUG #10432: failed to re-find parent key in index  (Andres Freund <andres@2ndquadrant.com>)
Re: BUG #10432: failed to re-find parent key in index  (Heikki Linnakangas <hlinnakangas@vmware.com>)
Список pgsql-bugs
Ok, I made some progress. It turns out this was a pre-existing problem
in the master. They've been getting "failed to re-find parent" errors
for weeks. Far longer than I have any WAL or backups for.

What I did find that was interesting is that this error basically made
the backups worthless. I could build a hot standby and connect and
query it. But as soon as recovery finished it would try to clean up
the incomplete split and fail. Because it had noticed the incomplete
split it had skipped every restartpoint and the next time I tried to
start it it insisted on restarting recovery from the beginning. If we
had been lucky enough not to do any page splits in the broken index
while the backup was being taken all would have been fine. But that
doesn't seem to have happened so all the backups were unrecoverable.

So a few thoughts on how to improve things:

1) Failed to re-find parent should perhaps not be FATAL to recovery.
In fact any index replay error would really be nice not to have to
crash on. All crashing does is prevent the user from being able to
bring up their database and REINDEX the btree. This may be another use
case for the machinery that would protect against corrupt hash indexes
or user-defined indexes -- if we could mark the index invalid and
proceed (perhaps ignoring subsequent records for it) that would be
great.

2) When we see an abort record we could check for any cleanup actions
triggered by that transaction and run them right away. I think the
checkpoints (and maybe hot standby snapshots or vacuum cleanup
records?) also include information about the oldest xid running, they
would also let us prune the cleanup actions sooner. That would at
least find the error sooner. In conjunction with (1) it would also
mean subsequent restartpoints would be effective instead of
suppressing restartpoints right to the end of recovery.

3) The lack of logs around an error during recovery makes it hard to
decipher what's going on. It would be nice to see "Beginning Xlog
cleanup (1 incomplete splits to replay)" and when it crashed "Last
safe point to restart recovery is 324/ABCDEF". As it was it was a
pretty big mystery why the database crashed, the logs made it appear
as if it had started up fine.  And it was unclear why restarting it
caused it to replay from the beginning, I thought maybe something was
wrong with our scripts.

В списке pgsql-bugs по дате отправления:

Предыдущее
От: Rainer Tammer
Дата:
Сообщение: Re: Problem with PostgreSQL 9.2.7 and make check on AIX 7.1
Следующее
От: Greg Stark
Дата:
Сообщение: Re: BUG #10432: failed to re-find parent key in index