Re: [BUGS] PANIC during exit on behalf of FATAL semop error

Поиск
Список
Период
Сортировка
От Dave Vitek
Тема Re: [BUGS] PANIC during exit on behalf of FATAL semop error
Дата
Msg-id 65a8f998-c284-eda7-bfea-9e4e1786613a@grammatech.com
обсуждение исходный текст
Ответ на Re: [BUGS] PANIC during exit on behalf of FATAL semop error  (Andres Freund <andres@anarazel.de>)
Список pgsql-bugs
On 9/15/2017 5:42 PM, Andres Freund wrote:
> Hi,
>
> On 2017-09-15 17:30:51 -0400, Dave Vitek wrote:
>> We have an x86_64 linux machine running postgresql 9.6.2.  Our application
>> uses LISTEN/NOTIFY.  We recently made a change so that our testing
>> infrastructure would notice postgres crashes and out popped this crash:
> Oh. That's curious.
>
>> elog(PANIC, "queueing for lock while waiting on another one");
>>
>> in this code:
>>
>> /*
>>   * Add ourselves to the end of the queue.
>>   *
>>   * NB: Mode can be LW_WAIT_UNTIL_FREE here!
>>   */
>> static void
>> LWLockQueueSelf(LWLock *lock, LWLockMode mode)
>> {
>>          /*
>>           * If we don't have a PGPROC structure, there's no way to wait. This
>>           * should never occur, since MyProc should only be null during
>> shared
>>           * memory initialization.
>>           */
>>          if (MyProc == NULL)
>>                  elog(PANIC, "cannot wait without a PGPROC structure");
>>
>>          if (MyProc->lwWaiting)
>> ----->       elog(PANIC, "queueing for lock while waiting on another one");
>> <-------------------
>> Here's a stack trace and a more verbose stack trace.
>>
>> #0  0x00007ff27bb50c37 in __GI_raise (sig=sig@entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:56
>> #1  0x00007ff27bb54028 in __GI_abort () at abort.c:89
>> #2  0x0000000000978119 in errfinish (dummy=0) at elog.c:557
>> #3  0x000000000097a671 in elog_finish (elevel=22, fmt=0xc5a000 "queueing for lock while waiting on another one") at
elog.c:1378
>> #4  0x000000000080eb73 in LWLockQueueSelf (lock=0x7ff272b06480, mode=LW_EXCLUSIVE) at lwlock.c:1035
>> #5  0x000000000080ee30 in LWLockAcquire (lock=0x7ff272b06480, mode=LW_EXCLUSIVE) at lwlock.c:1250
>> #6  0x00000000007fe416 in CleanupInvalidationState (status=1, arg=140679430974720) at sinvaladt.c:344
>> #7  0x00000000007f5132 in shmem_exit (code=1) at ipc.c:261
>> #8  0x00000000007f4f63 in proc_exit_prepare (code=1) at ipc.c:185
>> #9  0x00000000007f4eb3 in proc_exit (code=1) at ipc.c:102
>> #10 0x00000000009780e7 in errfinish (dummy=0) at elog.c:543
>> #11 0x000000000097a671 in elog_finish (elevel=21, fmt=0xc45b4a "semop(id=%d) failed: %m") at elog.c:1378
>> #12 0x00000000007881d9 in PGSemaphoreLock (sema=0x7ff27b7d6740) at pg_sema.c:391
>> #13 0x000000000080ee74 in LWLockAcquire (lock=0x7ff272b06f00, mode=LW_SHARED) at lwlock.c:1287
>> #14 0x0000000000604177 in asyncQueueReadAllNotifications () at async.c:1877
>> #15 0x000000000060464c in ProcessIncomingNotify () at async.c:2058
>> #16 0x0000000000603f04 in ProcessNotifyInterrupt () at async.c:1732
>> #17 0x000000000081da4d in ProcessClientReadInterrupt (blocked=1 '\001') at postgres.c:537
>> #18 0x00000000006d7a8d in secure_read (port=0x2bb90b0, ptr=0xfe8da0 <PqRecvBuffer>, len=8192) at be-secure.c:177
>> #19 0x00000000006e3f82 in pq_recvbuf () at pqcomm.c:921
>> #20 0x00000000006e4022 in pq_getbyte () at pqcomm.c:964
>> #21 0x000000000081d495 in SocketBackend (inBuf=0x7ffe148694a0) at postgres.c:334
>> #22 0x000000000081d9db in ReadCommand (inBuf=0x7ffe148694a0) at postgres.c:507
>> #23 0x00000000008228ff in PostgresMain (argc=1, argv=0x2bbb308, dbname=0x2bbb2f0 "cshub", username=0x2bbb2d0
"cshubuser")at postgres.c:4021
 
>> #24 0x000000000079f903 in BackendRun (port=0x2bb90b0) at postmaster.c:4272
>> #25 0x000000000079ef9d in BackendStartup (port=0x2bb90b0) at postmaster.c:3946
>> #26 0x000000000079b645 in ServerLoop () at postmaster.c:1701
>> #27 0x000000000079ab97 in PostmasterMain (argc=3, argv=0x2b918b0) at postmaster.c:1309
>> #28 0x00000000006e8f8c in main (argc=3, argv=0x2b918b0) at main.c:228
>>
>>
>> So the PANIC is really happening because of the first problem:
>>
>> /*
>>   * PGSemaphoreLock
>>   *
>>   * Lock a semaphore (decrement count), blocking if count would be < 0
>>   */
>> void
>> PGSemaphoreLock(PGSemaphore sema)
>> {
>>          if (errStatus < 0)
>> ------>     elog(FATAL, "semop(id=%d) failed: %m", sema->semId);
>> <--------------
>> }
>>
>> Should this one use PANIC instead of FATAL given that the FATAL exit path
>> causes a PANIC in some cases?  Is there an opportunity to repair the state
>> of things enough that a FATAL exit is possible here?
> I'm right now more curious to discover how this happened. Are you by any
> chance running this with systemd/logind in the mix?  It's RemoveIPC=
> setting can cause such things...
>
> Greetings,
>
> Andres Freund
Thanks for the quick reply.

I can see a systemd-logind process running on the machine.  It's using 
Ubuntu "trusty."  I did not set it up, but assume it might be largely 
using default settings.

The logind.conf looks like:

[Login]
#NAutoVTs=6
#ReserveVT=6
#KillUserProcesses=no
#KillOnlyUsers=
#KillExcludeUsers=root
Controllers=blkio cpu cpuacct cpuset devices freezer hugetlb memory 
perf_event net_cls net_prio
ResetControllers=
#InhibitDelayMaxSec=5
#HandlePowerKey=poweroff
#HandleSuspendKey=suspend
#HandleHibernateKey=hibernate
#HandleLidSwitch=suspend
#PowerKeyIgnoreInhibited=no
#SuspendKeyIgnoreInhibited=no
#HibernateKeyIgnoreInhibited=no
#LidSwitchIgnoreInhibited=yes
#IdleAction=ignore
#IdleActionSec=30min

so no mention of RemoveIPC.  I will add the line to disable it and 
reboot the system anyway.  I wonder how it complains about unrecognized 
variables?

It's also not completely impossible that a human trying to fix something 
ran ipcrm in error.

- Dave


-- 
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

В списке pgsql-bugs по дате отправления:

Предыдущее
От: Andres Freund
Дата:
Сообщение: Re: [BUGS] PANIC during exit on behalf of FATAL semop error
Следующее
От: Thomas Munro
Дата:
Сообщение: Re: [BUGS] BUG #14808: V10-beta4, backend abort