Re: post-recovery amcheck expectations

Поиск
Список
Период
Сортировка
От Noah Misch
Тема Re: post-recovery amcheck expectations
Дата
Msg-id 20231021035457.cb.nmisch@google.com
обсуждение исходный текст
Ответ на Re: post-recovery amcheck expectations  (Peter Geoghegan <pg@bowt.ie>)
Ответы Re: post-recovery amcheck expectations
Список pgsql-hackers
On Mon, Oct 09, 2023 at 04:46:26PM -0700, Peter Geoghegan wrote:
> On Wed, Oct 4, 2023 at 7:52 PM Noah Misch <noah@leadboat.com> wrote:

> Might make sense to test the fix for this issue using a similar
> approach: by adding custom code that randomly throws errors at a point
> that stresses the implementation. I'm referring to the point at which
> VACUUM is between the first and second phase of page deletion: right
> before (or directly after) _bt_unlink_halfdead_page() is called. That
> isn't fundamentally different to the approach from your TAP test, but
> seems like it might add some interesting variation.

My initial manual test was like that, actually.

> > I lean toward fixing this by
> > having amcheck scan left; if left links reach only half-dead or deleted pages,
> > that's as good as the present child block being P_LEFTMOST.
> 
> Also my preference.

Done mostly that way, except I didn't accept deleted pages.  Making this work
on !readonly would take more than that, and readonly shouldn't need that.

> > There's a
> > different error from bt_index_check(), and I've not yet studied how to fix
> > that:
> >
> >   ERROR:  left link/right link pair in index "not_leftmost_pk" not in agreement
> >   DETAIL:  Block=0 left block=0 left link from block=4294967295.
> 
> This looks like this might be a straightforward case of confusing
> P_NONE for InvalidBlockNumber (or vice-versa). They're both used to
> indicate "no such block" here.

Roughly so.  It turned out this scenario was passing leftcurrent=P_NONE to
bt_recheck_sibling_links(), causing that function to use BTPageGetOpaque() on
the metapage.

> > For some other amcheck expectations, the comments suggest reliance on the
> > bt_index_parent_check() ShareLock.  I haven't tried to make test cases for
> > them, but perhaps recovery can trick them the same way.  Examples:
> >
> >   errmsg("downlink or sibling link points to deleted block in index \"%s\"",
> >   errmsg("block %u is not leftmost in index \"%s\"",
> >   errmsg("block %u is not true root in index \"%s\"",
> 
> These are all much older. They're certainly all from before the
> relevant checks were first added (by commit d114cc53), and seem much
> less likely to be buggy.

After I fixed the original error, the "block %u is not leftmost" surfaced
next.  The attached patch fixes that, too.  I didn't investigate the others.
The original test was flaky in response to WAL flush timing, but this one
survives thousands of runs.

Вложения

В списке pgsql-hackers по дате отправления:

Предыдущее
От: Bharath Rupireddy
Дата:
Сообщение: Remove extraneous break condition in logical slot advance function
Следующее
От: Andres Freund
Дата:
Сообщение: Re: Add support for AT LOCAL