Re: Assert() failures during RI checks

Поиск
Список
Период
Сортировка
От Tom Lane
Тема Re: Assert() failures during RI checks
Дата
Msg-id 22523.1584903651@sss.pgh.pa.us
обсуждение исходный текст
Ответ на Assert() failures during RI checks  (Antonin Houska <ah@cybertec.at>)
Ответы Re: Assert() failures during RI checks  (Andres Freund <andres@anarazel.de>)
Список pgsql-hackers
Antonin Houska <ah@cybertec.at> writes:
> I was trying to figure out what exactly the "crosscheck snapshot" does in the
> RI checks, and hit some assertion failures:

Yeah, your example reproduces for me.

> I'm not familiar enough with this code but I wonder if it's only about
> incorrect assertions.

Mmm ... not really.  I dug around in the history a bit, and there are
a few moving parts here.  The case that's problematic is where the
"crosscheck" stanza starting at heapam.c:2639 fires, and replaces
result = TM_Ok with result = TM_Updated.  That causes the code to
go into the next stanza, which is all about trying to follow a ctid
update link so it can report a valid TID to the caller along with
the failure result.  But in this case, the tuple at hand has never
been updated, so the assertions fire.

The crosscheck stanza dates back to 55d85f42a, and at the time,
just setting result = TM_Updated was sufficient to make the
then-far-simpler next stanza do the right thing --- basically,
we wanted to just release whatever locks we hold and return the
failure result.  However, later patches added assertions to verify
that that next stanza was doing something sane ... and really, it
isn't in this case.

Another thing that is very peculiar in this area is that the initial
assertion in the second stanza allows the case of result == TM_Deleted.
It makes *no* sense to allow that if what we think we're doing is
chasing a ctid update link.  That alternative was not allowed until
really recently (Andres' commit 5db6df0c0, which appears otherwise
to be just refactoring), and I wonder how carefully it was really
thought through.

So I think there are two somewhat separable issues we have to address:

1. The crosscheck stanza can't get away with just setting result =
TM_Updated.  Maybe the best thing is for it not to try to share code
with what follows, but take responsibility for releasing locks and
returning consistent data to the caller on its own.  I'm not quite
sure what's the most consistent thing for it to return anyway, or
how much we care about what TID is returned in this case.  All we
really want is for an error to be raised, and I don't think the
error path is going to look too closely at the returned TID.

2. Does the following stanza actually need to allow for the case
of a deleted tuple as well as an updated one?  If so, blindly
trying to follow the ctid link is not so cool.

I think the situation in heap_update is just the same, though
it's unclear why the code is slightly different.  The extra
Assert in that crosscheck stanza certainly appears 100% redundant
with the later Assert, though both are wrong per this analysis.

In a non-assert build, this would just boil down to what TID
is returned along with the failure result code, so it might
be that there's no obvious bug, depending on what callers are
doing with that TID.  That would perhaps explain the lack of
associated field trouble reports.  These issues are pretty old
so you'd think we'd have heard about it if there were live bugs.

In any case, seems like we're missing some isolation test
coverage here ...

            regards, tom lane



В списке pgsql-hackers по дате отправления:

Предыдущее
От: Pavel Stehule
Дата:
Сообщение: Re: [HACKERS] [PATCH] Generic type subscripting
Следующее
От: "David G. Johnston"
Дата:
Сообщение: Re: Database recovery from tablespace only