Deadlock condition in current sources
От | Tom Lane |
---|---|
Тема | Deadlock condition in current sources |
Дата | |
Msg-id | 9598.1008646141@sss.pgh.pa.us обсуждение исходный текст |
Список | pgsql-hackers |
I have observed a nasty three-way deadlock condition. This proc is trying to generate a new transaction ID, and has hit the one case in every 32K where a new page must be added to the CLOG. That means that an XLOG record must be written to record the creation of the new CLOG page: USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND tgl 1135 0.0 3.4 41012 8812 pts/2 SN 19:54 0:00 postgres: tgl bench [local] idle #0 0x401d63b2 in semop (semid=1474560, sops=0xbfffcd20, nsops=1) at ../sysdeps/unix/sysv/linux/semop.c:36 #1 0x0811ccab in IpcSemaphoreLock (semId=1474560, sem=4, interruptOK=0 '\000') at ipc.c:422 #2 0x0812332f in LWLockAcquire (lockid=WALInsertLock, mode=LW_EXCLUSIVE) at lwlock.c:271 #3 0x08091d90 in XLogInsert (rmid=3 '\003', info=0 '\000', rdata=0xbfffef10) at xlog.c:644 #4 0x08090237 in WriteZeroPageXlogRec (pageno=2) at clog.c:962 #5 0x0808f7e0 in ZeroCLOGPage (pageno=2, writeXlog=1 '\001') at clog.c:357 This proc is holding CLogControlLock, LW_EXCLUSIVE: #6 0x0808ff50 in ExtendCLOG (newestXact=65536) at clog.c:778 This proc is holding XidGenLock, LW_EXCLUSIVE: #7 0x08090590 in GetNewTransactionId () at varsup.c:58 #8 0x08090d77 in StartTransaction () at xact.c:863 #9 0x080910f9 in StartTransactionCommand () at xact.c:1156 #10 0x08126753 in pg_exec_query_string (query_string=0x8271410 "begin", dest=Remote, parse_context=0x8247adc) at postgres.c:603 #11 0x081278da in PostgresMain (argc=4, argv=0xbffff1c0, username=0x822dce9 "tgl") at postgres.c:1849 The first proc is waiting for the second, who already holds WALInsertLock. The second proc is trying to make the first XLOG entry of his transaction. Therefore he needs to set MyProc->logRec, which presently requires him to obtain SInvalLock: USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND tgl 1196 0.0 3.5 41028 8928 pts/2 SN 19:54 0:00 postgres: tgl bench [local] UPDATE #0 0x401d63b2 in semop (semid=1572867, sops=0xbfffcb50, nsops=1) at ../sysdeps/unix/sysv/linux/semop.c:36 #1 0x0811ccab in IpcSemaphoreLock (semId=1572867, sem=15, interruptOK=0 '\000') at ipc.c:422 #2 0x0812332f in LWLockAcquire (lockid=SInvalLock, mode=LW_EXCLUSIVE) at lwlock.c:271 This proc is holding WALInsertLock, LW_EXCLUSIVE: #3 0x0809222f in XLogInsert (rmid=10 '\n', info=40 '(', rdata=0xbfffed50) at xlog.c:747 #4 0x08079f4e in log_heap_update (reln=0x425c7fe0, oldbuf=238, from={ip_blkid = {bi_hi = 1, bi_lo = 5639}, ip_posid = 16}, newbuf=3307, newtup=0x82865e8, move=0 '\000') at heapam.c:1931 #5 0x0807948f in heap_update (relation=0x425c7fe0, otid=0xbfffef10, newtup=0x82865e8, ctid=0xbfffee80) at heapam.c:1565 #6 0x080d6216 in ExecReplace (slot=0x827a9ec, tupleid=0xbfffef10, estate=0x827ae38) at execMain.c:1454 #7 0x080d5f1d in ExecutePlan (estate=0x827ae38, plan=0x827ad90, operation=CMD_UPDATE, numberTuples=0, direction=ForwardScanDirection,destfunc=0x827b6e4) at execMain.c:1129 #8 0x080d5260 in ExecutorRun (queryDesc=0x827ae1c, estate=0x827ae38, feature=3, count=0) at execMain.c:233 #9 0x08127e13 in ProcessQuery (parsetree=0x8272148, plan=0x827ad90, dest=Remote) at pquery.c:293 #10 0x08126942 in pg_exec_query_string ( query_string=0x8271d90 "update accounts set abalance = abalance + 735 where aid= 4270516\n", dest=Remote, parse_context=0x824845c) at postgres.c:781 #11 0x081278da in PostgresMain (argc=4, argv=0xbffff1c0, username=0x822dce9 "tgl") at postgres.c:1849 And this proc is trying to obtain XidGenLock while already holding SInvalLock: USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND tgl 1138 0.0 3.5 41020 8936 pts/2 SN 19:54 0:00 postgres: tgl bench [local] idle in transaction #0 0x401d63b2 in semop (semid=1474560, sops=0xbfffef00, nsops=1) at ../sysdeps/unix/sysv/linux/semop.c:36 #1 0x0811ccab in IpcSemaphoreLock (semId=1474560, sem=7, interruptOK=0 '\000') at ipc.c:422 #2 0x0812332f in LWLockAcquire (lockid=XidGenLock, mode=LW_SHARED) at lwlock.c:271 #3 0x080905d4 in ReadNewTransactionId () at varsup.c:103 This proc is holding SInvalLock, LW_SHARED: #4 0x0811e2ae in GetSnapshotData (serializable=0 '\000') at sinval.c:359 #5 0x081767ce in SetQuerySnapshot () at tqual.c:752 #6 0x081268f9 in pg_exec_query_string ( query_string=0x8271458 "insert into history(tid,bid,aid,delta,mtime) values(336,81,9860149,356,'now')",dest=Remote, parse_context=0x8247b0c) at postgres.c:764 #7 0x081278da in PostgresMain (argc=4, argv=0xbffff1c0, username=0x822dce9 "tgl") at postgres.c:1849 Unfortunately the first proc is holding XidGenLock, ergo deadlock. I don't think we have any room to wiggle in terms of the locking sequence of the first proc (see comments in GetNewTransactionId), nor of the third (see comments in GetSnapshotData). That means the only way to resolve the deadlock is to not grab SInvalLock while holding the WALInsertLock in XLogInsert. I believe this is actually safe, because the only code that looks at the logRec fields of other backends' PROC structures is GetUndoRecPtr, which is only called while holding WALInsertLock in CreateCheckPoint. Therefore, we could re-document proc->logRec as being protected by WALInsertLock not SInvalLock and not have to get SInvalLock in XLogInsert. However, there's still a problem: GetUndoRecPtr also gets SInvalLock while its caller holds WALInsertLock, and therefore this routine could create the second leg of the deadlock too. Removing the SInvalLock lock there creates the problem that backends might be added to or deleted from the PROC array while GetUndoRecPtr runs. I think it might be possible to survive that, by adding an assumption that logRec.xrecoff can be set to zero atomically, but it seems tricky. Comments? Anyone see a better approach? regards, tom lane
В списке pgsql-hackers по дате отправления: