Re: [HACKERS] Server crash (FailedAssertion) due to catcache refcount mis-handling

Поиск
Список
Период
Сортировка
От Tom Lane
Тема Re: [HACKERS] Server crash (FailedAssertion) due to catcache refcount mis-handling
Дата
Msg-id 9291.1502647481@sss.pgh.pa.us
обсуждение исходный текст
Ответ на Re: [HACKERS] Server crash (FailedAssertion) due to catcache refcount mis-handling  (Andreas Seltenreich <seltenreich@gmx.de>)
Список pgsql-hackers
Andreas Seltenreich <seltenreich@gmx.de> writes:
> Tom Lane writes:
>> I wonder if Andreas would be interested in trying the randomly-timed-
>> SIGTERM thing with sqlsmith.

> So far, most of the core dumps generated are Jeevan's assertion failing
> with backtraces through SearchCatCacheList.  The rest is failing this
> assertion:
>     TRAP: FailedAssertion("!(portal->cleanup == ((void *)0))", File: "portalmem.c", Line: 846)
> Example backtrace below.  They all happened during a rollback statement.
> Testing was done on master at 2336f84284.

Interesting.  I can reproduce this by forcing a FATAL exit right there,
eg by adding

        if (pstmt->utilityStmt && IsA(pstmt->utilityStmt, TransactionStmt) &&
            ((TransactionStmt *) pstmt->utilityStmt)->kind == TRANS_STMT_ROLLBACK)
            InterruptPending = ProcDiePending = true;

before PortalRunMulti's CHECK_FOR_INTERRUPTS.  But it only fails if the
rollback is trying to exit a transaction that's already suffered an error.
The explanation seems to be:

1. Because we already aborted the transaction, xact.c is in state
blockState == TBLOCK_ABORT.

2. This causes AbortOutOfAnyTransaction to think that it can skip
doing AbortTransaction() and go straight to CleanupTransaction().

3. AtCleanup_Portals() is expecting that we already ran, and cleared,
the portal cleanup hook (PortalCleanup) for every live portal.  But
we have not done so for the active portal that we were in the midst
of running ROLLBACK in.  So it fails the mentioned assertion.

Thus, the basic problem is that we get to CleanupTransaction() without
having done PortalDrop on the portal we're running ROLLBACK in.

We could take this as an argument for simply removing that assertion,
which would mean that when AtCleanup_Portals calls PortalDrop, the cleanup
hook would get run, the same as it would have if exec_simple_query had had
a chance to drop the portal.  However, I'm still pretty afraid of allowing
arbitrary code to execute during CleanupTransaction().  What seems like
a better idea is to make AtCleanup_Portals just summarily clear the
cleanup hook, perhaps while emitting a warning.

I tried that (see portalmem.c changes in attached patch), and what I got
was this:

regression=# rollback;
WARNING:  skipping cleanup for portal ""
FATAL:  terminating connection due to administrator command
FATAL:  cannot drop active portal ""
server closed the connection unexpectedly
        This probably means the server terminated abnormally
        before or while processing the request.
The connection to the server was lost. Attempting reset: Succeeded.

or in the server log:

2017-08-13 13:49:50.312 EDT [8737] FATAL:  terminating connection due to administrator command
2017-08-13 13:49:50.312 EDT [8737] STATEMENT:  rollback;
2017-08-13 13:49:50.312 EDT [8737] WARNING:  skipping cleanup for portal ""
2017-08-13 13:49:50.312 EDT [8737] FATAL:  cannot drop active portal ""

Well, we could tolerate that maybe, but it doesn't seem like it's actually
acting nicely; the active portal is still creating problems for us.

After some further thought, I decided to try making
AbortOutOfAnyTransaction call AtAbort_Portals() in this situation, thus
basically doing the minimum part of AbortTransaction() needed to clean up
the active portal.  That almost worked --- my initial try got an assertion
failure in mcxt.c, because somebody was trying to drop the
CurrentMemoryContext.  So the minimum part of AbortTransaction that we
need here is really AtAbort_Memory + AtAbort_Portals.  After further
thought I decided it'd be a good idea to phrase that as an unconditional
AtAbort_Memory at the top of AbortOutOfAnyTransaction, thus making sure
we are in some valid context to start with; and then, in case the loop
itself doesn't have anything to do, we need AtCleanup_Memory() at the
bottom of the function to revert CurrentMemoryContext to TopMemoryContext.

In short, the attached patch seems to fix it nicely.  We definitely need
the xact.c changes, but the ones in portalmem.c are perhaps optional, as
in theory now the assertion will never be violated.  But I'm inclined
to keep the portalmem changes anyway, as that will make it more robust
if the situation ever happens in a non-assert build.

Note: there are a couple of comments elsewhere in portalmem.c that need
adjustment if we keep the portalmem changes.  I have not bothered to do
that yet, as it's just cosmetic.

Comments?

            regards, tom lane

diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 50c3c3b..1eabc67 100644
*** a/src/backend/access/transam/xact.c
--- b/src/backend/access/transam/xact.c
*************** AbortOutOfAnyTransaction(void)
*** 4233,4238 ****
--- 4233,4241 ----
  {
      TransactionState s = CurrentTransactionState;

+     /* Ensure we're not running in a doomed memory context */
+     AtAbort_Memory();
+
      /*
       * Get out of any transaction or nested transaction
       */
*************** AbortOutOfAnyTransaction(void)
*** 4274,4280 ****
                  break;
              case TBLOCK_ABORT:
              case TBLOCK_ABORT_END:
!                 /* AbortTransaction already done, still need Cleanup */
                  CleanupTransaction();
                  s->blockState = TBLOCK_DEFAULT;
                  break;
--- 4277,4290 ----
                  break;
              case TBLOCK_ABORT:
              case TBLOCK_ABORT_END:
!
!                 /*
!                  * AbortTransaction is already done, still need Cleanup.
!                  * However, if we failed partway through running ROLLBACK,
!                  * there will be an active portal running that command, which
!                  * we need to shut down before doing CleanupTransaction.
!                  */
!                 AtAbort_Portals();
                  CleanupTransaction();
                  s->blockState = TBLOCK_DEFAULT;
                  break;
*************** AbortOutOfAnyTransaction(void)
*** 4297,4302 ****
--- 4307,4320 ----
              case TBLOCK_SUBABORT_END:
              case TBLOCK_SUBABORT_RESTART:
                  /* As above, but AbortSubTransaction already done */
+                 if (s->curTransactionOwner)
+                 {
+                     /* As in TBLOCK_ABORT, might have a live portal to zap */
+                     AtSubAbort_Portals(s->subTransactionId,
+                                        s->parent->subTransactionId,
+                                        s->curTransactionOwner,
+                                        s->parent->curTransactionOwner);
+                 }
                  CleanupSubTransaction();
                  s = CurrentTransactionState;    /* changed by pop */
                  break;
*************** AbortOutOfAnyTransaction(void)
*** 4305,4310 ****
--- 4323,4331 ----

      /* Should be out of all subxacts now */
      Assert(s->parent == NULL);
+
+     /* If we didn't actually have anything to do, revert to TopMemoryContext */
+     AtCleanup_Memory();
  }

  /*
diff --git a/src/backend/utils/mmgr/portalmem.c b/src/backend/utils/mmgr/portalmem.c
index 5983aed..e3a3526 100644
*** a/src/backend/utils/mmgr/portalmem.c
--- b/src/backend/utils/mmgr/portalmem.c
*************** AtCleanup_Portals(void)
*** 842,849 ****
          if (portal->portalPinned)
              portal->portalPinned = false;

!         /* We had better not be calling any user-defined code here */
!         Assert(portal->cleanup == NULL);

          /* Zap it. */
          PortalDrop(portal, false);
--- 842,856 ----
          if (portal->portalPinned)
              portal->portalPinned = false;

!         /*
!          * We had better not call any user-defined code during cleanup, so if
!          * the cleanup hook hasn't been run yet, too bad.
!          */
!         if (PointerIsValid(portal->cleanup))
!         {
!             elog(WARNING, "skipping cleanup for portal \"%s\"", portal->name);
!             portal->cleanup = NULL;
!         }

          /* Zap it. */
          PortalDrop(portal, false);
*************** AtSubCleanup_Portals(SubTransactionId my
*** 1026,1033 ****
          if (portal->portalPinned)
              portal->portalPinned = false;

!         /* We had better not be calling any user-defined code here */
!         Assert(portal->cleanup == NULL);

          /* Zap it. */
          PortalDrop(portal, false);
--- 1033,1047 ----
          if (portal->portalPinned)
              portal->portalPinned = false;

!         /*
!          * We had better not call any user-defined code during cleanup, so if
!          * the cleanup hook hasn't been run yet, too bad.
!          */
!         if (PointerIsValid(portal->cleanup))
!         {
!             elog(WARNING, "skipping cleanup for portal \"%s\"", portal->name);
!             portal->cleanup = NULL;
!         }

          /* Zap it. */
          PortalDrop(portal, false);

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

В списке pgsql-hackers по дате отправления:

Предыдущее
От: Fabien COELHO
Дата:
Сообщение: Re: [HACKERS] [WIP] Zipfian distribution in pgbench
Следующее
От: Fabien COELHO
Дата:
Сообщение: Re: [HACKERS] pgbench - allow to store select results intovariables