Обсуждение: [HACKERS] Server crash (FailedAssertion) due to catcache refcount mis-handling
[HACKERS] Server crash (FailedAssertion) due to catcache refcount mis-handling
От
 
		    	Jeevan Chalke
		    Дата:
		        Hi,
We have observed a random server crash (FailedAssertion), while running few tests at our end. Stack-trace is attached.
By looking at the stack-trace, and as discussed it with my team members; what we have observed that in SearchCatCacheList(), we are incrementing refcount and then decrementing it at the end. However for some reason, if we are in TRY() block (where we increment the refcount), and hit with any interrupt, we failed to decrement the refcount due to which later we get assertion failure.
To mimic the scenario, I have added a sleep in SearchCatCacheList() as given below:
diff --git a/src/backend/utils/cache/catcache.c b/src/backend/utils/cache/catcache.c
index e7e8e3b..eb6d4b5 100644
--- a/src/backend/utils/cache/catcache.c
+++ b/src/backend/utils/cache/catcache.c
@@ -1520,6 +1520,9 @@ SearchCatCacheList(CatCache *cache,
hashValue = CatalogCacheComputeTupleHashValue(cache, ntp);
hashIndex = HASH_INDEX(hashValue, cache->cc_nbuckets);
 
+ elog(INFO, "Sleeping for 0.1 seconds.");
+ pg_usleep(100000L); /* 0.1 seconds */
+
bucket = &cache->cc_bucket[hashIndex];
dlist_foreach(iter, bucket)
{
And then followed these steps to get a server crash:
-- Terminal 1
DROP TYPE typ;
DROP FUNCTION func(x int);
CREATE TYPE typ AS (X VARCHAR(50), Y INT);
CREATE OR REPLACE FUNCTION func(x int) RETURNS int AS $$
DECLARE
rec typ;
var2 numeric;
BEGIN
RAISE NOTICE 'Function Called.';
REC.X := 'Hello';
REC.Y := 0;
  
IF (rec.Y + var2) = 0 THEN
RAISE NOTICE 'Check Pass';
END IF;
RETURN 1;
END;
$$ LANGUAGE plpgsql;
SELECT pg_backend_pid();
SELECT func(1);
-- Terminal 2, should be run in parallel when SELECT func(1) is in progress in terminal 1.
SELECT pg_terminate_backend(<pid of backend obtained in terminal 1>);
I thought it worth posting here to get others attention.
I have observed this on the master branch, but can also be reproducible on back-branches.
Thanks
--
			
		We have observed a random server crash (FailedAssertion), while running few tests at our end. Stack-trace is attached.
By looking at the stack-trace, and as discussed it with my team members; what we have observed that in SearchCatCacheList(), we are incrementing refcount and then decrementing it at the end. However for some reason, if we are in TRY() block (where we increment the refcount), and hit with any interrupt, we failed to decrement the refcount due to which later we get assertion failure.
To mimic the scenario, I have added a sleep in SearchCatCacheList() as given below:
diff --git a/src/backend/utils/cache/catcache.c b/src/backend/utils/cache/catcache.c
index e7e8e3b..eb6d4b5 100644
--- a/src/backend/utils/cache/catcache.c
+++ b/src/backend/utils/cache/catcache.c
@@ -1520,6 +1520,9 @@ SearchCatCacheList(CatCache *cache,
hashValue = CatalogCacheComputeTupleHashValue(cache, ntp);
hashIndex = HASH_INDEX(hashValue, cache->cc_nbuckets);
+ elog(INFO, "Sleeping for 0.1 seconds.");
+ pg_usleep(100000L); /* 0.1 seconds */
+
bucket = &cache->cc_bucket[hashIndex];
dlist_foreach(iter, bucket)
{
And then followed these steps to get a server crash:
-- Terminal 1
DROP TYPE typ;
DROP FUNCTION func(x int);
CREATE TYPE typ AS (X VARCHAR(50), Y INT);
CREATE OR REPLACE FUNCTION func(x int) RETURNS int AS $$
DECLARE
rec typ;
var2 numeric;
BEGIN
RAISE NOTICE 'Function Called.';
REC.X := 'Hello';
REC.Y := 0;
IF (rec.Y + var2) = 0 THEN
RAISE NOTICE 'Check Pass';
END IF;
RETURN 1;
END;
$$ LANGUAGE plpgsql;
SELECT pg_backend_pid();
SELECT func(1);
-- Terminal 2, should be run in parallel when SELECT func(1) is in progress in terminal 1.
SELECT pg_terminate_backend(<pid of backend obtained in terminal 1>);
I thought it worth posting here to get others attention.
I have observed this on the master branch, but can also be reproducible on back-branches.
Thanks
--
Jeevan Chalke
Principal Software Engineer, Product Development
EnterpriseDB Corporation
The Enterprise PostgreSQL Company
Principal Software Engineer, Product Development
EnterpriseDB Corporation
The Enterprise PostgreSQL Company
Вложения
Jeevan Chalke <jeevan.chalke@enterprisedb.com> writes:
> We have observed a random server crash (FailedAssertion), while running few
> tests at our end. Stack-trace is attached.
> By looking at the stack-trace, and as discussed it with my team members;
> what we have observed that in SearchCatCacheList(), we are incrementing
> refcount and then decrementing it at the end. However for some reason, if
> we are in TRY() block (where we increment the refcount), and hit with any
> interrupt, we failed to decrement the refcount due to which later we get
> assertion failure.
Hm.  So SearchCatCacheList has a PG_TRY block that is meant to release
those refcounts, but if you hit the backend with a SIGTERM while it's
in that function, control goes out through elog(FATAL) which doesn't
execute the PG_CATCH cleanup.  But it does do AbortTransaction which
calls AtEOXact_CatCache, and that is expecting that all the cache
refcounts have reached zero.
We could respond to this by using PG_ENSURE_ERROR_CLEANUP there instead
of plain PG_TRY.  But I have an itchy feeling that there may be a lot
of places with similar issues.  Should we be revisiting the basic way
that elog(FATAL) works, to make it less unlike elog(ERROR)?
        regards, tom lane
			
		Re: [HACKERS] Server crash (FailedAssertion) due to catcache refcount mis-handling
От
 
		    	Robert Haas
		    Дата:
		        On Tue, Aug 8, 2017 at 11:36 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> By looking at the stack-trace, and as discussed it with my team members; >> what we have observed that in SearchCatCacheList(), we are incrementing >> refcount and then decrementing it at the end. However for some reason, if >> we are in TRY() block (where we increment the refcount), and hit with any >> interrupt, we failed to decrement the refcount due to which later we get >> assertion failure. > > Hm. So SearchCatCacheList has a PG_TRY block that is meant to release > those refcounts, but if you hit the backend with a SIGTERM while it's > in that function, control goes out through elog(FATAL) which doesn't > execute the PG_CATCH cleanup. But it does do AbortTransaction which > calls AtEOXact_CatCache, and that is expecting that all the cache > refcounts have reached zero. > > We could respond to this by using PG_ENSURE_ERROR_CLEANUP there instead > of plain PG_TRY. But I have an itchy feeling that there may be a lot > of places with similar issues. Should we be revisiting the basic way > that elog(FATAL) works, to make it less unlike elog(ERROR)? I'm not sure what the technically best fix here is. It strikes me that the charter of FATAL ought to be to exit the backend cleanly, safely, and expeditiously. PG_ENSURE_ERROR_CLEANUP, or some similar mechanism, is necessary when we need to correct the contents of shared memory so that other backends can continue to function, but there's no such problem here. You're proposing to do more work in the FATAL pathway so that the debugging cross-checks in the FATAL pathway will pass, which is not an entirely palatable idea. It's definitely frustrating that the ERROR and FATAL pathways are so different - that generated a surprising amount of the work around DSM - but I'm still skeptical about a solution that involves doing more cleanup of what are essentially irrelevant backend-local data structures in the FATAL path. Arguably, this assertion is merely overzealous; we could skip this processing when proc_exit_inprogress, for example. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes:
> On Tue, Aug 8, 2017 at 11:36 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> We could respond to this by using PG_ENSURE_ERROR_CLEANUP there instead
>> of plain PG_TRY.  But I have an itchy feeling that there may be a lot
>> of places with similar issues.  Should we be revisiting the basic way
>> that elog(FATAL) works, to make it less unlike elog(ERROR)?
> ... Arguably, this assertion is merely overzealous; we could skip
> this processing when proc_exit_inprogress, for example.
Yeah, I thought about weakening the assertions too, but I couldn't
see a fix of that kind that didn't seem mighty ad-hoc.
Now, there is some room to argue that AtEOXact_CatCache() is just
obsolete and we should remove it altogether; I don't think it's
caught a real bug since we invented resowners in 8.1.  Short of that,
though, I'm not really happy with just arbitrarily weakening the
checks.
But, again, the larger question to my mind is where else we may have
similar issues.  There's certainly never been any methodical attempt
to see whether elog(FATAL) will work everywhere.
        regards, tom lane
			
		Re: [HACKERS] Server crash (FailedAssertion) due to catcache refcount mis-handling
От
 
		    	Robert Haas
		    Дата:
		        On Tue, Aug 8, 2017 at 12:26 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Yeah, I thought about weakening the assertions too, but I couldn't > see a fix of that kind that didn't seem mighty ad-hoc. I don't really see what's ad-hoc about skipping it in the case where a FATAL is in progress. I mean, skipping a sanity check only in the cases where we know it might pass - and are OK with the fact that it might not pass - seems to me to be an extremely difficult policy to argue against on rational grounds. That's just the definition of writing correct sanity checks. More concretely, the present example seems no different than the ResourceOwner stuff emitting warnings on commit and remaining silent on abort. We could make it complain on both commit and abort, but then it would fail spuriously because there's no other mechanism to release resources in the abort path, so we don't. Similarly here, we have every reason to expect the check to pass during ERROR recovery but there is no reason to expect it to pass during FATAL recovery, yet as coded we will do the test anyway. That's wrong. > Now, there is some room to argue that AtEOXact_CatCache() is just > obsolete and we should remove it altogether; I don't think it's > caught a real bug since we invented resowners in 8.1. Yeah, the same thought crossed my mind. IIUC, we'd crash if a catcache reference were acquired without CurrentResourceOwner being valid, so this is really just a belt-and-suspenders check. > But, again, the larger question to my mind is where else we may have > similar issues. There's certainly never been any methodical attempt > to see whether elog(FATAL) will work everywhere. IIRC, you raised a similar objection back when Bruce added pg_terminate_backend(), which caused that change to be briefly reverted before ultimately being put back. Despite the hardening you did back then, I think it's highly likely that bugs remain in that path, and I am of course not opposed to trying to improve the situation. That having been said, the bugs that remain are probably mostly quite low-probability, because otherwise we'd have found them by now. I think parallel query is likely to flush out a few of the ones that remain by creating a new class of backends that terminate after a single query and may get killed by the leader at any time; that's how we discovered this issue. Fuzz testing could be done, too, e.g. run something like sqlsmith simultaneously in many sessions while killing off backends at random. I'm also not deadly opposed to redesigning the whole mechanism either, but I think that should be approached with a measure of caution: it might end up reducing reliability rather than increasing it. I suggest in any case that we start with a surgical fix. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes:
> On Tue, Aug 8, 2017 at 12:26 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Yeah, I thought about weakening the assertions too, but I couldn't
>> see a fix of that kind that didn't seem mighty ad-hoc.
> More concretely, the present example seems no different than the
> ResourceOwner stuff emitting warnings on commit and remaining silent
> on abort.  We could make it complain on both commit and abort, but
> then it would fail spuriously because there's no other mechanism to
> release resources in the abort path, so we don't.  Similarly here, we
> have every reason to expect the check to pass during ERROR recovery
> but there is no reason to expect it to pass during FATAL recovery, yet
> as coded we will do the test anyway.  That's wrong.
I think that's arguing from expedience not principle.  We had every
reason to think it would pass during FATAL errors too, until we noticed
this specific misbehavior; and there is at least as much of an argument
that this is a bug in SearchCatCacheList as there is that the check
is too strong.
>> Now, there is some room to argue that AtEOXact_CatCache() is just
>> obsolete and we should remove it altogether; I don't think it's
>> caught a real bug since we invented resowners in 8.1.
> Yeah, the same thought crossed my mind.  IIUC, we'd crash if a
> catcache reference were acquired without CurrentResourceOwner being
> valid, so this is really just a belt-and-suspenders check.
Right.  It was worth keeping it around till we were sure all the bugs
were shaken out of ResourceOwners, but surely we crossed the point of
diminishing returns for that long ago.
> ...  I'm also not deadly opposed to
> redesigning the whole mechanism either, but I think that should be
> approached with a measure of caution: it might end up reducing
> reliability rather than increasing it.  I suggest in any case that we
> start with a surgical fix.
In the absence of clear evidence that there are similar bugs elsewhere,
I agree that redesigning FATAL exits would likely cause more problems
than it solves.  But I feel like it would be a good thing to test for.
I wonder if Andreas would be interested in trying the randomly-timed-
SIGTERM thing with sqlsmith.
In the meantime, I think my vote would be to remove AtEOXact_CatCache.
        regards, tom lane
			
		Re: [HACKERS] Server crash (FailedAssertion) due to catcache refcount mis-handling
От
 
		    	Robert Haas
		    Дата:
		        On Tue, Aug 8, 2017 at 4:34 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > In the meantime, I think my vote would be to remove AtEOXact_CatCache. In all supported branches? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes:
> On Tue, Aug 8, 2017 at 4:34 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> In the meantime, I think my vote would be to remove AtEOXact_CatCache.
> In all supported branches?
Whatever we do about this issue, I don't feel a need to do it further
back than HEAD.  It's a non-problem except in an assert-enabled build,
and we don't recommend running those for production, only development.
        regards, tom lane
			
		Re: [HACKERS] Server crash (FailedAssertion) due to catcache refcount mis-handling
От
 
		    	Andreas Seltenreich
		    Дата:
		        Tom Lane writes: > I wonder if Andreas would be interested in trying the randomly-timed- > SIGTERM thing with sqlsmith. Will do. Won't miss this chance to try out discostu's extension pg_rage_terminator[1] :-) regards, Andreas Footnotes: [1] https://github.com/disco-stu/pg_rage_terminator
Re: [HACKERS] Server crash (FailedAssertion) due to catcache refcount mis-handling
От
 
		    	Robert Haas
		    Дата:
		        On Thu, Aug 10, 2017 at 2:50 AM, Andreas Seltenreich <seltenreich@gmx.de> wrote: > Will do. Won't miss this chance to try out discostu's extension > pg_rage_terminator[1] :-) > [1] https://github.com/disco-stu/pg_rage_terminator Oh, that's *awesome*. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: [HACKERS] Server crash (FailedAssertion) due to catcache refcount mis-handling
От
 
		    	Robert Haas
		    Дата:
		        On Tue, Aug 8, 2017 at 10:48 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Robert Haas <robertmhaas@gmail.com> writes: >> On Tue, Aug 8, 2017 at 4:34 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >>> In the meantime, I think my vote would be to remove AtEOXact_CatCache. > >> In all supported branches? > > Whatever we do about this issue, I don't feel a need to do it further > back than HEAD. It's a non-problem except in an assert-enabled build, > and we don't recommend running those for production, only development. Sure, but people still do testing and development against older branches - bug fixes, for example. It doesn't make much sense to me to leave code that we know does the wrong thing in the back branches. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: [HACKERS] Server crash (FailedAssertion) due to catcache refcount mis-handling
От
 
		    	Michael Paquier
		    Дата:
		        On Thu, Aug 10, 2017 at 9:00 PM, Robert Haas <robertmhaas@gmail.com> wrote: > On Thu, Aug 10, 2017 at 2:50 AM, Andreas Seltenreich <seltenreich@gmx.de> wrote: >> Will do. Won't miss this chance to try out discostu's extension >> pg_rage_terminator[1] :-) >> [1] https://github.com/disco-stu/pg_rage_terminator > > Oh, that's *awesome*. You can do that at query level using the planner hook: https://github.com/michaelpq/pg_plugins/tree/master/pg_panic The PANIC should be lowered to a FATAL though. -- Michael
Re: [HACKERS] Server crash (FailedAssertion) due to catcache refcount mis-handling
От
 
		    	Andreas Seltenreich
		    Дата:
		        Tom Lane writes:
> I wonder if Andreas would be interested in trying the randomly-timed-
> SIGTERM thing with sqlsmith.
So far, most of the core dumps generated are Jeevan's assertion failing
with backtraces through SearchCatCacheList.  The rest is failing this
assertion:
   TRAP: FailedAssertion("!(portal->cleanup == ((void *)0))", File: "portalmem.c", Line: 846)
Example backtrace below.  They all happened during a rollback statement.
Testing was done on master at 2336f84284.
regards,
Andreas
Core was generated by `postgres: smith regression [local] ROLLBACK                      '.
Program terminated with signal SIGABRT, Aborted.
#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:58
#1  0x00007f4c26d3240a in __GI_abort () at abort.c:89
#2  0x0000559d18897a73 in ExceptionalCondition (conditionName=conditionName@entry=0x559d18a81370 "!(portal->cleanup ==
((void*)0))", errorType=errorType@entry=0x559d188e3f7d "FailedAssertion", fileName=fileName@entry=0x559d18a81013
"portalmem.c",lineNumber=lineNumber@entry=846) at assert.c:54
 
#3  0x0000559d188c42f1 in AtCleanup_Portals () at portalmem.c:846
#4  0x0000559d18536cb7 in CleanupTransaction () at xact.c:2652
#5  0x0000559d1853b825 in AbortOutOfAnyTransaction () at xact.c:4278
#6  0x0000559d188a7799 in ShutdownPostgres (code=<optimized out>, arg=<optimized out>) at postinit.c:1146
#7  0x0000559d1876b4e9 in shmem_exit (code=code@entry=1) at ipc.c:228
#8  0x0000559d1876b5fa in proc_exit_prepare (code=code@entry=1) at ipc.c:185
#9  0x0000559d1876b688 in proc_exit (code=code@entry=1) at ipc.c:102
#10 0x0000559d188999b1 in errfinish (dummy=<optimized out>) at elog.c:543
#11 0x0000559d1878fefa in ProcessInterrupts () at postgres.c:2841
#12 0x0000559d18790829 in ProcessInterrupts () at postgres.c:2828
#13 0x0000559d18795395 in PortalRunMulti (portal=portal@entry=0x559d197f2bf0, isTopLevel=isTopLevel@entry=1 '\001',
setHoldSnapshot=setHoldSnapshot@entry=0'\000', dest=dest@entry=0x559d19850c40, altdest=altdest@entry=0x559d19850c40,
completionTag=completionTag@entry=0x7ffc04f1b560"") at pquery.c:1239
 
#14 0x0000559d18796069 in PortalRun (portal=portal@entry=0x559d197f2bf0, count=count@entry=9223372036854775807,
isTopLevel=isTopLevel@entry=1'\001', run_once=run_once@entry=1 '\001', dest=dest@entry=0x559d19850c40,
altdest=altdest@entry=0x559d19850c40,completionTag=0x7ffc04f1b560 "") at pquery.c:799
 
#15 0x0000559d18791dca in exec_simple_query (query_string=0x559d1984fe00 "ROLLBACK;") at postgres.c:1099
#16 0x0000559d18793af1 in PostgresMain (argc=<optimized out>, argv=argv@entry=0x559d197fa078, dbname=<optimized out>,
username=<optimizedout>) at postgres.c:4090
 
#17 0x0000559d184a3428 in BackendRun (port=0x559d197e8f00) at postmaster.c:4357
#18 BackendStartup (port=0x559d197e8f00) at postmaster.c:4029
#19 ServerLoop () at postmaster.c:1753
#20 0x0000559d1871ad65 in PostmasterMain (argc=3, argv=0x559d197be5a0) at postmaster.c:1361
#21 0x0000559d184a4a6d in main (argc=3, argv=0x559d197be5a0) at main.c:228
			
		Andreas Seltenreich <seltenreich@gmx.de> writes:
> Tom Lane writes:
>> I wonder if Andreas would be interested in trying the randomly-timed-
>> SIGTERM thing with sqlsmith.
> So far, most of the core dumps generated are Jeevan's assertion failing
> with backtraces through SearchCatCacheList.  The rest is failing this
> assertion:
>     TRAP: FailedAssertion("!(portal->cleanup == ((void *)0))", File: "portalmem.c", Line: 846)
> Example backtrace below.  They all happened during a rollback statement.
> Testing was done on master at 2336f84284.
Interesting.  I can reproduce this by forcing a FATAL exit right there,
eg by adding
        if (pstmt->utilityStmt && IsA(pstmt->utilityStmt, TransactionStmt) &&
            ((TransactionStmt *) pstmt->utilityStmt)->kind == TRANS_STMT_ROLLBACK)
            InterruptPending = ProcDiePending = true;
before PortalRunMulti's CHECK_FOR_INTERRUPTS.  But it only fails if the
rollback is trying to exit a transaction that's already suffered an error.
The explanation seems to be:
1. Because we already aborted the transaction, xact.c is in state
blockState == TBLOCK_ABORT.
2. This causes AbortOutOfAnyTransaction to think that it can skip
doing AbortTransaction() and go straight to CleanupTransaction().
3. AtCleanup_Portals() is expecting that we already ran, and cleared,
the portal cleanup hook (PortalCleanup) for every live portal.  But
we have not done so for the active portal that we were in the midst
of running ROLLBACK in.  So it fails the mentioned assertion.
Thus, the basic problem is that we get to CleanupTransaction() without
having done PortalDrop on the portal we're running ROLLBACK in.
We could take this as an argument for simply removing that assertion,
which would mean that when AtCleanup_Portals calls PortalDrop, the cleanup
hook would get run, the same as it would have if exec_simple_query had had
a chance to drop the portal.  However, I'm still pretty afraid of allowing
arbitrary code to execute during CleanupTransaction().  What seems like
a better idea is to make AtCleanup_Portals just summarily clear the
cleanup hook, perhaps while emitting a warning.
I tried that (see portalmem.c changes in attached patch), and what I got
was this:
regression=# rollback;
WARNING:  skipping cleanup for portal ""
FATAL:  terminating connection due to administrator command
FATAL:  cannot drop active portal ""
server closed the connection unexpectedly
        This probably means the server terminated abnormally
        before or while processing the request.
The connection to the server was lost. Attempting reset: Succeeded.
or in the server log:
2017-08-13 13:49:50.312 EDT [8737] FATAL:  terminating connection due to administrator command
2017-08-13 13:49:50.312 EDT [8737] STATEMENT:  rollback;
2017-08-13 13:49:50.312 EDT [8737] WARNING:  skipping cleanup for portal ""
2017-08-13 13:49:50.312 EDT [8737] FATAL:  cannot drop active portal ""
Well, we could tolerate that maybe, but it doesn't seem like it's actually
acting nicely; the active portal is still creating problems for us.
After some further thought, I decided to try making
AbortOutOfAnyTransaction call AtAbort_Portals() in this situation, thus
basically doing the minimum part of AbortTransaction() needed to clean up
the active portal.  That almost worked --- my initial try got an assertion
failure in mcxt.c, because somebody was trying to drop the
CurrentMemoryContext.  So the minimum part of AbortTransaction that we
need here is really AtAbort_Memory + AtAbort_Portals.  After further
thought I decided it'd be a good idea to phrase that as an unconditional
AtAbort_Memory at the top of AbortOutOfAnyTransaction, thus making sure
we are in some valid context to start with; and then, in case the loop
itself doesn't have anything to do, we need AtCleanup_Memory() at the
bottom of the function to revert CurrentMemoryContext to TopMemoryContext.
In short, the attached patch seems to fix it nicely.  We definitely need
the xact.c changes, but the ones in portalmem.c are perhaps optional, as
in theory now the assertion will never be violated.  But I'm inclined
to keep the portalmem changes anyway, as that will make it more robust
if the situation ever happens in a non-assert build.
Note: there are a couple of comments elsewhere in portalmem.c that need
adjustment if we keep the portalmem changes.  I have not bothered to do
that yet, as it's just cosmetic.
Comments?
            regards, tom lane
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 50c3c3b..1eabc67 100644
*** a/src/backend/access/transam/xact.c
--- b/src/backend/access/transam/xact.c
*************** AbortOutOfAnyTransaction(void)
*** 4233,4238 ****
--- 4233,4241 ----
  {
      TransactionState s = CurrentTransactionState;
+     /* Ensure we're not running in a doomed memory context */
+     AtAbort_Memory();
+
      /*
       * Get out of any transaction or nested transaction
       */
*************** AbortOutOfAnyTransaction(void)
*** 4274,4280 ****
                  break;
              case TBLOCK_ABORT:
              case TBLOCK_ABORT_END:
!                 /* AbortTransaction already done, still need Cleanup */
                  CleanupTransaction();
                  s->blockState = TBLOCK_DEFAULT;
                  break;
--- 4277,4290 ----
                  break;
              case TBLOCK_ABORT:
              case TBLOCK_ABORT_END:
!
!                 /*
!                  * AbortTransaction is already done, still need Cleanup.
!                  * However, if we failed partway through running ROLLBACK,
!                  * there will be an active portal running that command, which
!                  * we need to shut down before doing CleanupTransaction.
!                  */
!                 AtAbort_Portals();
                  CleanupTransaction();
                  s->blockState = TBLOCK_DEFAULT;
                  break;
*************** AbortOutOfAnyTransaction(void)
*** 4297,4302 ****
--- 4307,4320 ----
              case TBLOCK_SUBABORT_END:
              case TBLOCK_SUBABORT_RESTART:
                  /* As above, but AbortSubTransaction already done */
+                 if (s->curTransactionOwner)
+                 {
+                     /* As in TBLOCK_ABORT, might have a live portal to zap */
+                     AtSubAbort_Portals(s->subTransactionId,
+                                        s->parent->subTransactionId,
+                                        s->curTransactionOwner,
+                                        s->parent->curTransactionOwner);
+                 }
                  CleanupSubTransaction();
                  s = CurrentTransactionState;    /* changed by pop */
                  break;
*************** AbortOutOfAnyTransaction(void)
*** 4305,4310 ****
--- 4323,4331 ----
      /* Should be out of all subxacts now */
      Assert(s->parent == NULL);
+
+     /* If we didn't actually have anything to do, revert to TopMemoryContext */
+     AtCleanup_Memory();
  }
  /*
diff --git a/src/backend/utils/mmgr/portalmem.c b/src/backend/utils/mmgr/portalmem.c
index 5983aed..e3a3526 100644
*** a/src/backend/utils/mmgr/portalmem.c
--- b/src/backend/utils/mmgr/portalmem.c
*************** AtCleanup_Portals(void)
*** 842,849 ****
          if (portal->portalPinned)
              portal->portalPinned = false;
!         /* We had better not be calling any user-defined code here */
!         Assert(portal->cleanup == NULL);
          /* Zap it. */
          PortalDrop(portal, false);
--- 842,856 ----
          if (portal->portalPinned)
              portal->portalPinned = false;
!         /*
!          * We had better not call any user-defined code during cleanup, so if
!          * the cleanup hook hasn't been run yet, too bad.
!          */
!         if (PointerIsValid(portal->cleanup))
!         {
!             elog(WARNING, "skipping cleanup for portal \"%s\"", portal->name);
!             portal->cleanup = NULL;
!         }
          /* Zap it. */
          PortalDrop(portal, false);
*************** AtSubCleanup_Portals(SubTransactionId my
*** 1026,1033 ****
          if (portal->portalPinned)
              portal->portalPinned = false;
!         /* We had better not be calling any user-defined code here */
!         Assert(portal->cleanup == NULL);
          /* Zap it. */
          PortalDrop(portal, false);
--- 1033,1047 ----
          if (portal->portalPinned)
              portal->portalPinned = false;
!         /*
!          * We had better not call any user-defined code during cleanup, so if
!          * the cleanup hook hasn't been run yet, too bad.
!          */
!         if (PointerIsValid(portal->cleanup))
!         {
!             elog(WARNING, "skipping cleanup for portal \"%s\"", portal->name);
!             portal->cleanup = NULL;
!         }
          /* Zap it. */
          PortalDrop(portal, false);
-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
			
		Robert Haas <robertmhaas@gmail.com> writes:
> On Tue, Aug 8, 2017 at 10:48 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Robert Haas <robertmhaas@gmail.com> writes:
>>> On Tue, Aug 8, 2017 at 4:34 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>>> In the meantime, I think my vote would be to remove AtEOXact_CatCache.
>>> In all supported branches?
>> Whatever we do about this issue, I don't feel a need to do it further
>> back than HEAD.  It's a non-problem except in an assert-enabled build,
>> and we don't recommend running those for production, only development.
> Sure, but people still do testing and development against older
> branches - bug fixes, for example.  It doesn't make much sense to me
> to leave code that we know does the wrong thing in the back branches.
Not having heard anyone arguing against that, I'll go make it so,
ie AtEOXact_CatCache is toast in all branches.
        regards, tom lane
			
		Re: [HACKERS] Server crash (FailedAssertion) due to catcache refcount mis-handling
От
 
		    	Robert Haas
		    Дата:
		        On Sun, Aug 13, 2017 at 3:05 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Not having heard anyone arguing against that, I'll go make it so, > ie AtEOXact_CatCache is toast in all branches. Great, thanks. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company