Re: BUG #17255: Server crashes in index_delete_sort_cmp() due to race condition with vacuum

Поиск
Список
Период
Сортировка
От Andres Freund
Тема Re: BUG #17255: Server crashes in index_delete_sort_cmp() due to race condition with vacuum
Дата
Msg-id 20211122223436.ov3qa27pakxay6z7@alap3.anarazel.de
обсуждение исходный текст
Ответ на Re: BUG #17255: Server crashes in index_delete_sort_cmp() due to race condition with vacuum  (Peter Geoghegan <pg@bowt.ie>)
Ответы Re: BUG #17255: Server crashes in index_delete_sort_cmp() due to race condition with vacuum  (Andres Freund <andres@anarazel.de>)
Список pgsql-bugs
Hi,

On 2021-11-22 12:41:32 -0800, Peter Geoghegan wrote:
> On Mon, Nov 22, 2021 at 9:59 AM Andres Freund <andres@anarazel.de> wrote:
> > I'm not quite sure what you referring to with "convincing me"? Whether to go
> > for something comprehensive or more minimal? I'm kind of agnostic on what the
> > right approach is.
> 
> I thought that you were clearly leaning in the direction of a
> minimal-only fix for backpatch. I just meant that there is no point in
> saying much more about it. It's not like I have a problem with your
> approach. My personal opinion is that we'd be somewhat better off if
> we went with my more comprehensive approach, even on 14. That's not
> how you see it - which is fair enough (reasoning about these risks is
> hard, and it tends to be quite subjective).

I think I'm actually just not sure what the better approach for the
backbranches is. I'm worried about the size of your patch, I'm worried about
the craziness of the current architecture (if you can call it that) of
heap_page_prune() with my patch.

I think either way your approach should incorporate the dedicated loop to
check the visibility - for performance reasons alone.


> > I think it's worth to clean up the regression test I wrote and use it
> > regardless of which version of the fix we end up choosing. But I'm a bit bit
> > on the fence - it's quite complicated.
> 
> +1 to the idea of keeping it somewhere. Without necessarily running it
> on the BF.
> 
> > OTOH, I think with the framework in place it'd not be too hard to write a few
> > more tests for odd pruning scenarios...
> 
> Can we commit a test like this without running it by default, at all?
> It's not like it has never been done before.

Is there a reason not to run it if we commit it? IME tests not run by default
tend to bitrot very quickly.  The only issue that I can think of is that it
might be hard to make the test useful and robust in the presence of
debug_discard_caches != 0.


> The nice thing about using amcheck for something like this (compared
> to pageinspect) is that we don't have to specify exact details about
> what it looks like when the test passes. Because passing the test just
> means the absence of any error.

Unless amcheck gets more in-depth there's just too much it doesn't show :(


> How hard would it be to just add amcheck coverage on HEAD? Something
> that goes just a bit further with LP_REDIRECT validation (currently
> verify_heapam does practically nothing like that). We should add
> comprehensive HOT chain validation too, but that can wait a little
> while longer. It's a lot more complicated.

What exactly do you mean with "amcheck coverage" in this context?


> Another thing is that the new assertions themselves are part of our
> overall testing strategy. I see that you've gone further with that
> than I did in your v3-0003-* patch. I should adopt that code to my
> more comprehensive fix for HEAD, while still keeping around the
> existing heap_prune_chain() assertions (perhaps just as
> documentation/comments to make the code easier to follow).

I found that style of check helpful because it notices problems sooner. What
took the longest debugging this issue was that simple assertions (e.g. during
heap_page_prune_execute() or heap_prune_record_unused()) would trigger well
after the actual bug. E.g. marking an item unused that is referenced by a
pre-existing LP_REDIRECT item can't easily be detected at that time. It'd
trigger an assertion at the time the LP_REDIRECT item is marked dead, but that
doesn't help to pinpoint the origin of the problem much.

I'm inclined to think that the costs of the check are worth incurring by
default for cassert builds.


I wish I knew of a good way to have pgbench check the validity of its
results. There's a huge difference between not crashing and returning correct
results. And there's just a lot of bugs that will result in valid-enough
on-disk/in-buffer data, but which are not correct.

Greetings,

Andres Freund



В списке pgsql-bugs по дате отправления:

Предыдущее
От: Peter Geoghegan
Дата:
Сообщение: Re: BUG #17255: Server crashes in index_delete_sort_cmp() due to race condition with vacuum
Следующее
От: Sandeep Thakkar
Дата:
Сообщение: Re: BUG #17290: Error Installation