Assertion failure in syncrep.c

Поиск
Список
Период
Сортировка
От Pavan Deolasee
Тема Assertion failure in syncrep.c
Дата
Msg-id CABOikdMpnbjpb1nmBnOVqRwJMKh96-moaOCekPu=JtrEgm8RWQ@mail.gmail.com
обсуждение исходный текст
Ответы Re: Assertion failure in syncrep.c  (Pavan Deolasee <pavan.deolasee@gmail.com>)
Re: Assertion failure in syncrep.c  (Simon Riggs <simon@2ndquadrant.com>)
Список pgsql-hackers
Hi All,

While running some tests on REL9_2_STABLE branch, I saw an assertion failure in syncrep.c. The stack trace looks like this: 

   frame #3: 0x00000001055a2da9 postgres`ExceptionalCondition(conditionName=0x000000010567b8c5, errorType=0x00000001055ff193, fileName=0x000000010567b8f4, lineNumber=257) + 137 at assert.c:54
    frame #4: 0x00000001053f19d2 postgres`SyncRepWaitForLSN(XactCommitLSN=XLogRecPtr at 0x00007fff5ab80d08) + 1410 at syncrep.c:257
    frame #5: 0x00000001050f9a52 postgres`RecordTransactionCommit + 1586 at xact.c:1221
    frame #6: 0x00000001050f3e5f postgres`CommitTransaction + 303 at xact.c:1920
    frame #7: 0x00000001050f38f4 postgres`CommitTransactionCommand + 180 at xact.c:2588
    frame #8: 0x000000010543fd8e postgres`finish_xact_command + 126 at postgres.c:2409
<snip>

The failing assertion is this:

251     /*
252      * WalSender has checked our LSN and has removed us from queue. Clean up
253      * state and leave.  It's OK to reset these shared memory fields without
254      * holding SyncRepLock, because any walsenders will ignore us anyway when
255      * we're not on the queue.
256      */
257     Assert(SHMQueueIsDetached(&(MyProc->syncRepLinks)));

My theory is, and I did validate by using breakpoints in gdb, the code in SyncRepWakeQueue() sets the proc->syncRepState to SYNC_REP_WAIT_COMPLETE.

564         /*
565          * Set state to complete; see SyncRepWaitForLSN() for discussion of
566          * the various states.
567          */
568         thisproc->syncRepState = SYNC_REP_WAIT_COMPLETE;
569 
570         /*
571          * Remove thisproc from queue.
572          */
573         SHMQueueDelete(&(thisproc->syncRepLinks));


But before walsender could detach the proc from the syncRepLinks shared queue, the backend checks for the MyProc->syncRepState  and decides not to sleep after all and goes on to check the above Assert. The assertion fails because the proc is not yet removed from the queue.

182         syncRepState = MyProc->syncRepState;
183         if (syncRepState == SYNC_REP_WAITING)
184         {
185             LWLockAcquire(SyncRepLock, LW_SHARED);
186             syncRepState = MyProc->syncRepState;
187             LWLockRelease(SyncRepLock);
188         }
189         if (syncRepState == SYNC_REP_WAIT_COMPLETE)
190             break;
191         

We could just remove the assert, but then there is another similar assert at the start of the function and it may hit for similar reasons (though very unlikely). Instead, we could just reverse the order of setting syncRepState and removing proc from syncRepLinks queue. There is a very rare possibility that the proc is removed from the Q, but before its state is changed, walsender crashes. I wonder if that cause infinite wait for the process.

BTW, even though I was working on 9_2_STABLE branch, looks like issue exists in the master branch too.

Thanks,
Pavan

--
Pavan Deolasee
http://www.linkedin.com/in/pavandeolasee

В списке pgsql-hackers по дате отправления:

Предыдущее
От: Peter Geoghegan
Дата:
Сообщение: Re: Collations and Replication; Next Steps
Следующее
От: Michael Paquier
Дата:
Сообщение: Options OUTPUT_PLUGIN_* controlling format are confusing (Was: Misleading error message in logical decoding)