Re: [HACKERS] Server crash (FailedAssertion) due to catcache refcount mis-handling

Поиск
Список
Период
Сортировка
От Tom Lane
Тема Re: [HACKERS] Server crash (FailedAssertion) due to catcache refcount mis-handling
Дата
Msg-id 16208.1502224491@sss.pgh.pa.us
обсуждение исходный текст
Ответ на Re: [HACKERS] Server crash (FailedAssertion) due to catcache refcount mis-handling  (Robert Haas <robertmhaas@gmail.com>)
Ответы Re: [HACKERS] Server crash (FailedAssertion) due to catcache refcount mis-handling  (Robert Haas <robertmhaas@gmail.com>)
Re: [HACKERS] Server crash (FailedAssertion) due to catcache refcount mis-handling  (Andreas Seltenreich <seltenreich@gmx.de>)
Список pgsql-hackers
Robert Haas <robertmhaas@gmail.com> writes:
> On Tue, Aug 8, 2017 at 12:26 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Yeah, I thought about weakening the assertions too, but I couldn't
>> see a fix of that kind that didn't seem mighty ad-hoc.

> More concretely, the present example seems no different than the
> ResourceOwner stuff emitting warnings on commit and remaining silent
> on abort.  We could make it complain on both commit and abort, but
> then it would fail spuriously because there's no other mechanism to
> release resources in the abort path, so we don't.  Similarly here, we
> have every reason to expect the check to pass during ERROR recovery
> but there is no reason to expect it to pass during FATAL recovery, yet
> as coded we will do the test anyway.  That's wrong.

I think that's arguing from expedience not principle.  We had every
reason to think it would pass during FATAL errors too, until we noticed
this specific misbehavior; and there is at least as much of an argument
that this is a bug in SearchCatCacheList as there is that the check
is too strong.

>> Now, there is some room to argue that AtEOXact_CatCache() is just
>> obsolete and we should remove it altogether; I don't think it's
>> caught a real bug since we invented resowners in 8.1.

> Yeah, the same thought crossed my mind.  IIUC, we'd crash if a
> catcache reference were acquired without CurrentResourceOwner being
> valid, so this is really just a belt-and-suspenders check.

Right.  It was worth keeping it around till we were sure all the bugs
were shaken out of ResourceOwners, but surely we crossed the point of
diminishing returns for that long ago.

> ...  I'm also not deadly opposed to
> redesigning the whole mechanism either, but I think that should be
> approached with a measure of caution: it might end up reducing
> reliability rather than increasing it.  I suggest in any case that we
> start with a surgical fix.

In the absence of clear evidence that there are similar bugs elsewhere,
I agree that redesigning FATAL exits would likely cause more problems
than it solves.  But I feel like it would be a good thing to test for.
I wonder if Andreas would be interested in trying the randomly-timed-
SIGTERM thing with sqlsmith.

In the meantime, I think my vote would be to remove AtEOXact_CatCache.
        regards, tom lane



В списке pgsql-hackers по дате отправления:

Предыдущее
От: Andres Freund
Дата:
Сообщение: [HACKERS] Infrastructure for JIT compiling tuple deforming
Следующее
От: Tom Lane
Дата:
Сообщение: Re: [HACKERS] Timing-sensitive case in src/test/recovery TAP tests