Re: BUG #10432: failed to re-find parent key in index

Поиск
Список
Период
Сортировка
От Andres Freund
Тема Re: BUG #10432: failed to re-find parent key in index
Дата
Msg-id 20140604113519.GG1220@awork2.anarazel.de
обсуждение исходный текст
Ответ на Re: BUG #10432: failed to re-find parent key in index  (Greg Stark <stark@mit.edu>)
Список pgsql-bugs
Hi,

On 2014-06-04 12:14:27 +0100, Greg Stark wrote:
> Ok, I made some progress. It turns out this was a pre-existing problem
> in the master. They've been getting "failed to re-find parent" errors
> for weeks. Far longer than I have any WAL or backups for.

Ok.

> 1) Failed to re-find parent should perhaps not be FATAL to recovery.
> In fact any index replay error would really be nice not to have to
> crash on.

I think that's not really realistic. We'd need to put a significant
amount of machinery for this in to be workable. Suddenly a crash restart
doesn't guarantee that you're indexes are there anymore? Not nice.

> All crashing does is prevent the user from being able to
> bring up their database and REINDEX the btree. This may be another use
> case for the machinery that would protect against corrupt hash indexes
> or user-defined indexes -- if we could mark the index invalid and
> proceed (perhaps ignoring subsequent records for it) that would be
> great.
>
> 2) When we see an abort record we could check for any cleanup actions
> triggered by that transaction and run them right away. I think the
> checkpoints (and maybe hot standby snapshots or vacuum cleanup
> records?) also include information about the oldest xid running, they
> would also let us prune the cleanup actions sooner. That would at
> least find the error sooner. In conjunction with (1) it would also
> mean subsequent restartpoints would be effective instead of
> suppressing restartpoints right to the end of recovery.

Heikki removed restartpoints from 9.4 alltogether so most of these are
gone. As all these -even if they were doable - sound far too large for
backpatching I think it's luckily mostly done.

> 3) The lack of logs around an error during recovery makes it hard to
> decipher what's going on. It would be nice to see "Beginning Xlog
> cleanup (1 incomplete splits to replay)" and when it crashed "Last
> safe point to restart recovery is 324/ABCDEF". As it was it was a
> pretty big mystery why the database crashed, the logs made it appear
> as if it had started up fine.  And it was unclear why restarting it
> caused it to replay from the beginning, I thought maybe something was
> wrong with our scripts.

I think this should be fixed by setting up error context stack support
in two places. a) in StartupXLOG() before the rm_cleanup() calls b) in <
9.4 inside the individual cleanup routines.
We do all that around redo routines, but, as evidenced here, that's not
always enough.

Greetings,

Andres Freund

--
 Andres Freund                       http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

В списке pgsql-bugs по дате отправления:

Предыдущее
От: Greg Stark
Дата:
Сообщение: Re: BUG #10432: failed to re-find parent key in index
Следующее
От: Heikki Linnakangas
Дата:
Сообщение: Re: BUG #10432: failed to re-find parent key in index