Обсуждение: [BUGS] PANIC during exit on behalf of FATAL semop error
We have an x86_64 linux machine running postgresql 9.6.2. Our application uses LISTEN/NOTIFY. We recently made a change so that our testing infrastructure would notice postgres crashes and out popped this crash:
elog(PANIC, "queueing for lock while waiting on another one");
in this code:
/*
* Add ourselves to the end of the queue.
*
* NB: Mode can be LW_WAIT_UNTIL_FREE here!
*/
static void
LWLockQueueSelf(LWLock *lock, LWLockMode mode)
{
/*
* If we don't have a PGPROC structure, there's no way to wait. This
* should never occur, since MyProc should only be null during shared
* memory initialization.
*/
if (MyProc == NULL)
elog(PANIC, "cannot wait without a PGPROC structure");
if (MyProc->lwWaiting)
-----> elog(PANIC, "queueing for lock while waiting on another one"); <-------------------
LWLockWaitListLock(lock);
...
Here's a stack trace and a more verbose stack trace.
/*
* PGSemaphoreLock
*
* Lock a semaphore (decrement count), blocking if count would be < 0
*/
void
PGSemaphoreLock(PGSemaphore sema)
{
int errStatus;
struct sembuf sops;
sops.sem_op = -1; /* decrement */
sops.sem_flg = 0;
sops.sem_num = sema->semNum;
/*
* Note: if errStatus is -1 and errno == EINTR then it means we returned
* from the operation prematurely because we were sent a signal. So we
* try and lock the semaphore again.
*
* We used to check interrupts here, but that required servicing
* interrupts directly from signal handlers. Which is hard to do safely
* and portably.
*/
do
{
errStatus = semop(sema->semId, &sops, 1);
} while (errStatus < 0 && errno == EINTR);
if (errStatus < 0)
------> elog(FATAL, "semop(id=%d) failed: %m", sema->semId); <--------------
}
Should this one use PANIC instead of FATAL given that the FATAL exit path causes a PANIC in some cases? Is there an opportunity to repair the state of things enough that a FATAL exit is possible here?
I'm not seeing either of the messages in the postgres logs, so I'm not sure what %m expanded to. The logs are short and uninteresting--just startup and shutdown messages. It is possible the machine was being stressed when this occurred.
elog(PANIC, "queueing for lock while waiting on another one");
in this code:
/*
* Add ourselves to the end of the queue.
*
* NB: Mode can be LW_WAIT_UNTIL_FREE here!
*/
static void
LWLockQueueSelf(LWLock *lock, LWLockMode mode)
{
/*
* If we don't have a PGPROC structure, there's no way to wait. This
* should never occur, since MyProc should only be null during shared
* memory initialization.
*/
if (MyProc == NULL)
elog(PANIC, "cannot wait without a PGPROC structure");
if (MyProc->lwWaiting)
-----> elog(PANIC, "queueing for lock while waiting on another one"); <-------------------
LWLockWaitListLock(lock);
...
Here's a stack trace and a more verbose stack trace.
#0 0x00007ff27bb50c37 in __GI_raise (sig=sig@entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:56 #1 0x00007ff27bb54028 in __GI_abort () at abort.c:89 #2 0x0000000000978119 in errfinish (dummy=0) at elog.c:557 #3 0x000000000097a671 in elog_finish (elevel=22, fmt=0xc5a000 "queueing for lock while waiting on another one") at elog.c:1378 #4 0x000000000080eb73 in LWLockQueueSelf (lock=0x7ff272b06480, mode=LW_EXCLUSIVE) at lwlock.c:1035 #5 0x000000000080ee30 in LWLockAcquire (lock=0x7ff272b06480, mode=LW_EXCLUSIVE) at lwlock.c:1250 #6 0x00000000007fe416 in CleanupInvalidationState (status=1, arg=140679430974720) at sinvaladt.c:344 #7 0x00000000007f5132 in shmem_exit (code=1) at ipc.c:261 #8 0x00000000007f4f63 in proc_exit_prepare (code=1) at ipc.c:185 #9 0x00000000007f4eb3 in proc_exit (code=1) at ipc.c:102 #10 0x00000000009780e7 in errfinish (dummy=0) at elog.c:543 #11 0x000000000097a671 in elog_finish (elevel=21, fmt=0xc45b4a "semop(id=%d) failed: %m") at elog.c:1378 #12 0x00000000007881d9 in PGSemaphoreLock (sema=0x7ff27b7d6740) at pg_sema.c:391 #13 0x000000000080ee74 in LWLockAcquire (lock=0x7ff272b06f00, mode=LW_SHARED) at lwlock.c:1287 #14 0x0000000000604177 in asyncQueueReadAllNotifications () at async.c:1877 #15 0x000000000060464c in ProcessIncomingNotify () at async.c:2058 #16 0x0000000000603f04 in ProcessNotifyInterrupt () at async.c:1732 #17 0x000000000081da4d in ProcessClientReadInterrupt (blocked=1 '\001') at postgres.c:537 #18 0x00000000006d7a8d in secure_read (port=0x2bb90b0, ptr=0xfe8da0 <PqRecvBuffer>, len=8192) at be-secure.c:177 #19 0x00000000006e3f82 in pq_recvbuf () at pqcomm.c:921 #20 0x00000000006e4022 in pq_getbyte () at pqcomm.c:964 #21 0x000000000081d495 in SocketBackend (inBuf=0x7ffe148694a0) at postgres.c:334 #22 0x000000000081d9db in ReadCommand (inBuf=0x7ffe148694a0) at postgres.c:507 #23 0x00000000008228ff in PostgresMain (argc=1, argv=0x2bbb308, dbname=0x2bbb2f0 "cshub", username=0x2bbb2d0 "cshubuser") at postgres.c:4021 #24 0x000000000079f903 in BackendRun (port=0x2bb90b0) at postmaster.c:4272 #25 0x000000000079ef9d in BackendStartup (port=0x2bb90b0) at postmaster.c:3946 #26 0x000000000079b645 in ServerLoop () at postmaster.c:1701 #27 0x000000000079ab97 in PostmasterMain (argc=3, argv=0x2b918b0) at postmaster.c:1309 #28 0x00000000006e8f8c in main (argc=3, argv=0x2b918b0) at main.c:228So the PANIC is really happening because of the first problem:
/*
* PGSemaphoreLock
*
* Lock a semaphore (decrement count), blocking if count would be < 0
*/
void
PGSemaphoreLock(PGSemaphore sema)
{
int errStatus;
struct sembuf sops;
sops.sem_op = -1; /* decrement */
sops.sem_flg = 0;
sops.sem_num = sema->semNum;
/*
* Note: if errStatus is -1 and errno == EINTR then it means we returned
* from the operation prematurely because we were sent a signal. So we
* try and lock the semaphore again.
*
* We used to check interrupts here, but that required servicing
* interrupts directly from signal handlers. Which is hard to do safely
* and portably.
*/
do
{
errStatus = semop(sema->semId, &sops, 1);
} while (errStatus < 0 && errno == EINTR);
if (errStatus < 0)
------> elog(FATAL, "semop(id=%d) failed: %m", sema->semId); <--------------
}
Should this one use PANIC instead of FATAL given that the FATAL exit path causes a PANIC in some cases? Is there an opportunity to repair the state of things enough that a FATAL exit is possible here?
I'm not seeing either of the messages in the postgres logs, so I'm not sure what %m expanded to. The logs are short and uninteresting--just startup and shutdown messages. It is possible the machine was being stressed when this occurred.
Hi, On 2017-09-15 17:30:51 -0400, Dave Vitek wrote: > We have an x86_64 linux machine running postgresql 9.6.2. Our application > uses LISTEN/NOTIFY. We recently made a change so that our testing > infrastructure would notice postgres crashes and out popped this crash: Oh. That's curious. > elog(PANIC, "queueing for lock while waiting on another one"); > > in this code: > > /* > * Add ourselves to the end of the queue. > * > * NB: Mode can be LW_WAIT_UNTIL_FREE here! > */ > static void > LWLockQueueSelf(LWLock *lock, LWLockMode mode) > { > /* > * If we don't have a PGPROC structure, there's no way to wait. This > * should never occur, since MyProc should only be null during > shared > * memory initialization. > */ > if (MyProc == NULL) > elog(PANIC, "cannot wait without a PGPROC structure"); > > if (MyProc->lwWaiting) > -----> elog(PANIC, "queueing for lock while waiting on another one"); > <------------------- > Here's a stack trace and a more verbose stack trace. > > #0 0x00007ff27bb50c37 in __GI_raise (sig=sig@entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:56 > #1 0x00007ff27bb54028 in __GI_abort () at abort.c:89 > #2 0x0000000000978119 in errfinish (dummy=0) at elog.c:557 > #3 0x000000000097a671 in elog_finish (elevel=22, fmt=0xc5a000 "queueing for lock while waiting on another one") at elog.c:1378 > #4 0x000000000080eb73 in LWLockQueueSelf (lock=0x7ff272b06480, mode=LW_EXCLUSIVE) at lwlock.c:1035 > #5 0x000000000080ee30 in LWLockAcquire (lock=0x7ff272b06480, mode=LW_EXCLUSIVE) at lwlock.c:1250 > #6 0x00000000007fe416 in CleanupInvalidationState (status=1, arg=140679430974720) at sinvaladt.c:344 > #7 0x00000000007f5132 in shmem_exit (code=1) at ipc.c:261 > #8 0x00000000007f4f63 in proc_exit_prepare (code=1) at ipc.c:185 > #9 0x00000000007f4eb3 in proc_exit (code=1) at ipc.c:102 > #10 0x00000000009780e7 in errfinish (dummy=0) at elog.c:543 > #11 0x000000000097a671 in elog_finish (elevel=21, fmt=0xc45b4a "semop(id=%d) failed: %m") at elog.c:1378 > #12 0x00000000007881d9 in PGSemaphoreLock (sema=0x7ff27b7d6740) at pg_sema.c:391 > #13 0x000000000080ee74 in LWLockAcquire (lock=0x7ff272b06f00, mode=LW_SHARED) at lwlock.c:1287 > #14 0x0000000000604177 in asyncQueueReadAllNotifications () at async.c:1877 > #15 0x000000000060464c in ProcessIncomingNotify () at async.c:2058 > #16 0x0000000000603f04 in ProcessNotifyInterrupt () at async.c:1732 > #17 0x000000000081da4d in ProcessClientReadInterrupt (blocked=1 '\001') at postgres.c:537 > #18 0x00000000006d7a8d in secure_read (port=0x2bb90b0, ptr=0xfe8da0 <PqRecvBuffer>, len=8192) at be-secure.c:177 > #19 0x00000000006e3f82 in pq_recvbuf () at pqcomm.c:921 > #20 0x00000000006e4022 in pq_getbyte () at pqcomm.c:964 > #21 0x000000000081d495 in SocketBackend (inBuf=0x7ffe148694a0) at postgres.c:334 > #22 0x000000000081d9db in ReadCommand (inBuf=0x7ffe148694a0) at postgres.c:507 > #23 0x00000000008228ff in PostgresMain (argc=1, argv=0x2bbb308, dbname=0x2bbb2f0 "cshub", username=0x2bbb2d0 "cshubuser")at postgres.c:4021 > #24 0x000000000079f903 in BackendRun (port=0x2bb90b0) at postmaster.c:4272 > #25 0x000000000079ef9d in BackendStartup (port=0x2bb90b0) at postmaster.c:3946 > #26 0x000000000079b645 in ServerLoop () at postmaster.c:1701 > #27 0x000000000079ab97 in PostmasterMain (argc=3, argv=0x2b918b0) at postmaster.c:1309 > #28 0x00000000006e8f8c in main (argc=3, argv=0x2b918b0) at main.c:228 > > > So the PANIC is really happening because of the first problem: > > /* > * PGSemaphoreLock > * > * Lock a semaphore (decrement count), blocking if count would be < 0 > */ > void > PGSemaphoreLock(PGSemaphore sema) > { > > if (errStatus < 0) > ------> elog(FATAL, "semop(id=%d) failed: %m", sema->semId); > <-------------- > } > > Should this one use PANIC instead of FATAL given that the FATAL exit path > causes a PANIC in some cases? Is there an opportunity to repair the state > of things enough that a FATAL exit is possible here? I'm right now more curious to discover how this happened. Are you by any chance running this with systemd/logind in the mix? It's RemoveIPC= setting can cause such things... Greetings, Andres Freund -- Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-bugs
On 9/15/2017 5:42 PM, Andres Freund wrote: > Hi, > > On 2017-09-15 17:30:51 -0400, Dave Vitek wrote: >> We have an x86_64 linux machine running postgresql 9.6.2. Our application >> uses LISTEN/NOTIFY. We recently made a change so that our testing >> infrastructure would notice postgres crashes and out popped this crash: > Oh. That's curious. > >> elog(PANIC, "queueing for lock while waiting on another one"); >> >> in this code: >> >> /* >> * Add ourselves to the end of the queue. >> * >> * NB: Mode can be LW_WAIT_UNTIL_FREE here! >> */ >> static void >> LWLockQueueSelf(LWLock *lock, LWLockMode mode) >> { >> /* >> * If we don't have a PGPROC structure, there's no way to wait. This >> * should never occur, since MyProc should only be null during >> shared >> * memory initialization. >> */ >> if (MyProc == NULL) >> elog(PANIC, "cannot wait without a PGPROC structure"); >> >> if (MyProc->lwWaiting) >> -----> elog(PANIC, "queueing for lock while waiting on another one"); >> <------------------- >> Here's a stack trace and a more verbose stack trace. >> >> #0 0x00007ff27bb50c37 in __GI_raise (sig=sig@entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:56 >> #1 0x00007ff27bb54028 in __GI_abort () at abort.c:89 >> #2 0x0000000000978119 in errfinish (dummy=0) at elog.c:557 >> #3 0x000000000097a671 in elog_finish (elevel=22, fmt=0xc5a000 "queueing for lock while waiting on another one") at elog.c:1378 >> #4 0x000000000080eb73 in LWLockQueueSelf (lock=0x7ff272b06480, mode=LW_EXCLUSIVE) at lwlock.c:1035 >> #5 0x000000000080ee30 in LWLockAcquire (lock=0x7ff272b06480, mode=LW_EXCLUSIVE) at lwlock.c:1250 >> #6 0x00000000007fe416 in CleanupInvalidationState (status=1, arg=140679430974720) at sinvaladt.c:344 >> #7 0x00000000007f5132 in shmem_exit (code=1) at ipc.c:261 >> #8 0x00000000007f4f63 in proc_exit_prepare (code=1) at ipc.c:185 >> #9 0x00000000007f4eb3 in proc_exit (code=1) at ipc.c:102 >> #10 0x00000000009780e7 in errfinish (dummy=0) at elog.c:543 >> #11 0x000000000097a671 in elog_finish (elevel=21, fmt=0xc45b4a "semop(id=%d) failed: %m") at elog.c:1378 >> #12 0x00000000007881d9 in PGSemaphoreLock (sema=0x7ff27b7d6740) at pg_sema.c:391 >> #13 0x000000000080ee74 in LWLockAcquire (lock=0x7ff272b06f00, mode=LW_SHARED) at lwlock.c:1287 >> #14 0x0000000000604177 in asyncQueueReadAllNotifications () at async.c:1877 >> #15 0x000000000060464c in ProcessIncomingNotify () at async.c:2058 >> #16 0x0000000000603f04 in ProcessNotifyInterrupt () at async.c:1732 >> #17 0x000000000081da4d in ProcessClientReadInterrupt (blocked=1 '\001') at postgres.c:537 >> #18 0x00000000006d7a8d in secure_read (port=0x2bb90b0, ptr=0xfe8da0 <PqRecvBuffer>, len=8192) at be-secure.c:177 >> #19 0x00000000006e3f82 in pq_recvbuf () at pqcomm.c:921 >> #20 0x00000000006e4022 in pq_getbyte () at pqcomm.c:964 >> #21 0x000000000081d495 in SocketBackend (inBuf=0x7ffe148694a0) at postgres.c:334 >> #22 0x000000000081d9db in ReadCommand (inBuf=0x7ffe148694a0) at postgres.c:507 >> #23 0x00000000008228ff in PostgresMain (argc=1, argv=0x2bbb308, dbname=0x2bbb2f0 "cshub", username=0x2bbb2d0 "cshubuser")at postgres.c:4021 >> #24 0x000000000079f903 in BackendRun (port=0x2bb90b0) at postmaster.c:4272 >> #25 0x000000000079ef9d in BackendStartup (port=0x2bb90b0) at postmaster.c:3946 >> #26 0x000000000079b645 in ServerLoop () at postmaster.c:1701 >> #27 0x000000000079ab97 in PostmasterMain (argc=3, argv=0x2b918b0) at postmaster.c:1309 >> #28 0x00000000006e8f8c in main (argc=3, argv=0x2b918b0) at main.c:228 >> >> >> So the PANIC is really happening because of the first problem: >> >> /* >> * PGSemaphoreLock >> * >> * Lock a semaphore (decrement count), blocking if count would be < 0 >> */ >> void >> PGSemaphoreLock(PGSemaphore sema) >> { >> if (errStatus < 0) >> ------> elog(FATAL, "semop(id=%d) failed: %m", sema->semId); >> <-------------- >> } >> >> Should this one use PANIC instead of FATAL given that the FATAL exit path >> causes a PANIC in some cases? Is there an opportunity to repair the state >> of things enough that a FATAL exit is possible here? > I'm right now more curious to discover how this happened. Are you by any > chance running this with systemd/logind in the mix? It's RemoveIPC= > setting can cause such things... > > Greetings, > > Andres Freund Thanks for the quick reply. I can see a systemd-logind process running on the machine. It's using Ubuntu "trusty." I did not set it up, but assume it might be largely using default settings. The logind.conf looks like: [Login] #NAutoVTs=6 #ReserveVT=6 #KillUserProcesses=no #KillOnlyUsers= #KillExcludeUsers=root Controllers=blkio cpu cpuacct cpuset devices freezer hugetlb memory perf_event net_cls net_prio ResetControllers= #InhibitDelayMaxSec=5 #HandlePowerKey=poweroff #HandleSuspendKey=suspend #HandleHibernateKey=hibernate #HandleLidSwitch=suspend #PowerKeyIgnoreInhibited=no #SuspendKeyIgnoreInhibited=no #HibernateKeyIgnoreInhibited=no #LidSwitchIgnoreInhibited=yes #IdleAction=ignore #IdleActionSec=30min so no mention of RemoveIPC. I will add the line to disable it and reboot the system anyway. I wonder how it complains about unrecognized variables? It's also not completely impossible that a human trying to fix something ran ipcrm in error. - Dave -- Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-bugs