Re: [HACKERS] Deadlock in XLogInsert at AIX

Поиск
Список
Период
Сортировка
От Konstantin Knizhnik
Тема Re: [HACKERS] Deadlock in XLogInsert at AIX
Дата
Msg-id 100ec553-b398-7730-c676-57a5e375cb68@postgrespro.ru
обсуждение исходный текст
Ответ на [HACKERS] Deadlock in XLogInsert at AIX  (Konstantin Knizhnik <k.knizhnik@postgrespro.ru>)
Ответы Re: [HACKERS] Deadlock in XLogInsert at AIX  (Bernd Helmle <mailings@oopsware.de>)
Список pgsql-hackers
More information about the problem - Postgres log contains several records:

2017-01-24 19:15:20.272 MSK [19270462] LOG:  request to flush past end of generated WAL; request 6/AAEBE000, currpos 6/AAEBC2B0

and them correspond to the time when deadlock happen.
There is the following comment in xlog.c concerning this message:

    /*
     * No-one should request to flush a piece of WAL that hasn't even been
     * reserved yet. However, it can happen if there is a block with a bogus
     * LSN on disk, for example. XLogFlush checks for that situation and
     * complains, but only after the flush. Here we just assume that to mean
     * that all WAL that has been reserved needs to be finished. In this
     * corner-case, the return value can be smaller than 'upto' argument.
     */

So looks like it should not happen.
The first thing to suspect is spinlock implementation which is different for GCC and XLC.
But ... if I rebuild Postgres without spinlocks, then the problem is still reproduced.

On 24.01.2017 17:47, Konstantin Knizhnik wrote:
Hi Hackers,

We are running Postgres at AIX and encoountered two strqange problems: active zombies process and deadlock in XLOG writer.
First problem I will explain in separate mail, now I am mostly concerning about deadlock.
It is irregularly reproduced with standard pgbench launched with 100 connections.

It sometimes happens with 9.6 stable version of Postgres but only when it is compiled with xlc compiler.
We failed to reproduce the problem with GCC. So it looks like as bug in compiler or xlc-specific atomics implementation...
But there are few moments which contradicts with this hypothesis:

1. The problem is reproduce with Postgres built without optimization. Usually compiler bugs affect only optimized code.
2. Disabling atomics doesn't help.
3. Without optimization and with  LOCK_DEBUG defined time of reproducing the problem significantly increased. With optimized code it is almost always reproduced in few minutes.
With debug version it usually takes much more time.

But the most confusing thing is stack trace:

(dbx) where
semop(??, ??, ??) at 0x9000000001f5790
PGSemaphoreLock(sema = 0x0a00000044b95928), line 387 in "pg_sema.c"
unnamed block in LWLockWaitForVar(lock = 0x0a0000000000d980, valptr = 0x0a0000000000d9a8, oldval = 102067874256, newval = 0x0fffffffffff9c10), line 1666 in "lwlock.c"
LWLockWaitForVar(lock = 0x0a0000000000d980, valptr = 0x0a0000000000d9a8, oldval = 102067874256, newval = 0x0fffffffffff9c10), line 1666 in "lwlock.c"
unnamed block in WaitXLogInsertionsToFinish(upto = 102067874328), line 1583 in "xlog.c"
WaitXLogInsertionsToFinish(upto = 102067874328), line 1583 in "xlog.c"
AdvanceXLInsertBuffer(upto = 102067874256, opportunistic = '\0'), line 1916 in "xlog.c"
unnamed block in GetXLogBuffer(ptr = 102067874256), line 1697 in "xlog.c"
GetXLogBuffer(ptr = 102067874256), line 1697 in "xlog.c"
CopyXLogRecordToWAL(write_len = 70, isLogSwitch = '\0', rdata = 0x000000011007ce10, StartPos = 102067874256, EndPos = 102067874328), line 1279 in "xlog.c"
XLogInsertRecord(rdata = 0x000000011007ce10, fpw_lsn = 102067718328), line 1011 in "xlog.c"
unnamed block in XLogInsert(rmid = '\n', info = '@'), line 453 in "xloginsert.c"
XLogInsert(rmid = '\n', info = '@'), line 453 in "xloginsert.c"
log_heap_update(reln = 0x0000000110273540, oldbuf = 40544, newbuf = 40544, oldtup = 0x0fffffffffffa2a0, newtup = 0x00000001102bb958, old_key_tuple = (nil), all_visible_cleared = '\0', new_all_visible_cleared = '\0'), line 7708 in "heapam.c"
unnamed block in heap_update(relation = 0x0000000110273540, otid = 0x0fffffffffffa6f8, newtup = 0x00000001102bb958, cid = 1, crosscheck = (nil), wait = '^A', hufd = 0x0fffffffffffa5b0, lockmode = 0x0fffffffffffa5c8), line 4212 in "heapam.c"
heap_update(relation = 0x0000000110273540, otid = 0x0fffffffffffa6f8, newtup = 0x00000001102bb958, cid = 1, crosscheck = (nil), wait = '^A', hufd = 0x0fffffffffffa5b0, lockmode = 0x0fffffffffffa5c8), line 4212 in "heapam.c"
unnamed block in ExecUpdate(tupleid = 0x0fffffffffffa6f8, oldtuple = (nil), slot = 0x00000001102bb308, planSlot = 0x00000001102b4630, epqstate = 0x00000001102b2cd8, estate = 0x00000001102b29e0, canSetTag = '^A'), line 937 in "nodeModifyTable.c"
ExecUpdate(tupleid = 0x0fffffffffffa6f8, oldtuple = (nil), slot = 0x00000001102bb308, planSlot = 0x00000001102b4630, epqstate = 0x00000001102b2cd8, estate = 0x00000001102b29e0, canSetTag = '^A'), line 937 in "nodeModifyTable.c"
ExecModifyTable(node = 0x00000001102b2c30), line 1516 in "nodeModifyTable.c"
ExecProcNode(node = 0x00000001102b2c30), line 396 in "execProcnode.c"
ExecutePlan(estate = 0x00000001102b29e0, planstate = 0x00000001102b2c30, use_parallel_mode = '\0', operation = CMD_UPDATE, sendTuples = '\0', numberTuples = 0, direction = ForwardScanDirection, dest = 0x00000001102b7520), line 1569 in "execMain.c"
standard_ExecutorRun(queryDesc = 0x00000001102b25c0, direction = ForwardScanDirection, count = 0), line 338 in "execMain.c"
ExecutorRun(queryDesc = 0x00000001102b25c0, direction = ForwardScanDirection, count = 0), line 286 in "execMain.c"
ProcessQuery(plan = 0x00000001102b1510, sourceText = "UPDATE pgbench_tellers SET tbalance = tbalance + 4019 WHERE tid = 6409;", params = (nil), dest = 0x00000001102b7520, completionTag = ""), line 187 in "pquery.c"
unnamed block in PortalRunMulti(portal = 0x0000000110115e20, isTopLevel = '^A', setHoldSnapshot = '\0', dest = 0x00000001102b7520, altdest = 0x00000001102b7520, completionTag = ""), line 1303 in "pquery.c"
unnamed block in PortalRunMulti(portal = 0x0000000110115e20, isTopLevel = '^A', setHoldSnapshot = '\0', dest = 0x00000001102b7520, altdest = 0x00000001102b7520, completionTag = ""), line 1303 in "pquery.c"
PortalRunMulti(portal = 0x0000000110115e20, isTopLevel = '^A', setHoldSnapshot = '\0', dest = 0x00000001102b7520, altdest = 0x00000001102b7520, completionTag = ""), line 1303 in "pquery.c"
unnamed block in PortalRun(portal = 0x0000000110115e20, count = 9223372036854775807, isTopLevel = '^A', dest = 0x00000001102b7520, altdest = 0x00000001102b7520, completionTag = ""), line 815 in "pquery.c"
PortalRun(portal = 0x0000000110115e20, count = 9223372036854775807, isTopLevel = '^A', dest = 0x00000001102b7520, altdest = 0x00000001102b7520, completionTag = ""), line 815 in "pquery.c"
unnamed block in exec_simple_query(query_string = "UPDATE pgbench_tellers SET tbalance = tbalance + 4019 WHERE tid = 6409;"), line 1094 in "postgres.c"
exec_simple_query(query_string = "UPDATE pgbench_tellers SET tbalance = tbalance + 4019 WHERE tid = 6409;"), line 1094 in "postgres.c"
unnamed block in PostgresMain(argc = 1, argv = 0x0000000110119f68, dbname = "postgres", username = "postgres"), line 4076 in "postgres.c"
PostgresMain(argc = 1, argv = 0x0000000110119f68, dbname = "postgres", username = "postgres"), line 4076 in "postgres.c"
BackendRun(port = 0x0000000110114290), line 4279 in "postmaster.c"
BackendStartup(port = 0x0000000110114290), line 3953 in "postmaster.c"
unnamed block in ServerLoop(), line 1701 in "postmaster.c"
unnamed block in ServerLoop(), line 1701 in "postmaster.c"
unnamed block in ServerLoop(), line 1701 in "postmaster.c"
ServerLoop(), line 1701 in "postmaster.c"
PostmasterMain(argc = 3, argv = 0x00000001100c6190), line 1309 in "postmaster.c"
main(argc = 3, argv = 0x00000001100c6190), line 228 in "main.c"


As I already mentioned, we built Postgres with LOCK_DEBUG , so we can inspect lock owner. Backend is waiting for itself!
Now please look at two frames in this stack trace marked with red.
XLogInsertRecord is setting WALInsert locks at the beginning of the function:

    if (isLogSwitch)
        WALInsertLockAcquireExclusive();
    else
        WALInsertLockAcquire();

WALInsertLockAcquire just selects random item from WALInsertLocks array and exclusively locks:

    if (lockToTry == -1)
        lockToTry = MyProc->pgprocno % NUM_XLOGINSERT_LOCKS;
    MyLockNo = lockToTry;
    immed = LWLockAcquire(&WALInsertLocks[MyLockNo].l.lock, LW_EXCLUSIVE);

Then, following the stack trace, AdvanceXLInsertBuffer calls WaitXLogInsertionsToFinish:

            /*
             * Now that we have an up-to-date LogwrtResult value, see if we
             * still need to write it or if someone else already did.
             */
            if (LogwrtResult.Write < OldPageRqstPtr)
            {
                /*
                 * Must acquire write lock. Release WALBufMappingLock first,
                 * to make sure that all insertions that we need to wait for
                 * can finish (up to this same position). Otherwise we risk
                 * deadlock.
                 */
                LWLockRelease(WALBufMappingLock);

                WaitXLogInsertionsToFinish(OldPageRqstPtr);

                LWLockAcquire(WALWriteLock, LW_EXCLUSIVE);


It releases WALBufMappingLock but not WAL insert locks!
Finally in WaitXLogInsertionsToFinish tries to wait for all locks:

    for (i = 0; i < NUM_XLOGINSERT_LOCKS; i++)
    {
        XLogRecPtr    insertingat = InvalidXLogRecPtr;

        do
        {
            /*
             * See if this insertion is in progress. LWLockWait will wait for
             * the lock to be released, or for the 'value' to be set by a
             * LWLockUpdateVar call.  When a lock is initially acquired, its
             * value is 0 (InvalidXLogRecPtr), which means that we don't know
             * where it's inserting yet.  We will have to wait for it.  If
             * it's a small insertion, the record will most likely fit on the
             * same page and the inserter will release the lock without ever
             * calling LWLockUpdateVar.  But if it has to sleep, it will
             * advertise the insertion point with LWLockUpdateVar before
             * sleeping.
             */
            if (LWLockWaitForVar(&WALInsertLocks[i].l.lock,
                                 &WALInsertLocks[i].l.insertingAt,
                                 insertingat, &insertingat))

And here we stuck!
The comment to WaitXLogInsertionsToFinish  says:

 * Note: When you are about to write out WAL, you must call this function
 * *before* acquiring WALWriteLock, to avoid deadlocks. This function might
 * need to wait for an insertion to finish (or at least advance to next
 * uninitialized page), and the inserter might need to evict an old WAL buffer
 * to make room for a new one, which in turn requires WALWriteLock.

Which contradicts to the observed stack trace.

I wonder if it is really synchronization bug in xlog.c or there is something wrong in this stack trace and it can not happen in case of normal work?

Thanks in advance,
-- 
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company 

-- 
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company 

В списке pgsql-hackers по дате отправления:

Предыдущее
От: Robert Haas
Дата:
Сообщение: Re: [HACKERS] [ patch ] pg_dump: new --custom-fetch-table and--custom-fetch-value parameters
Следующее
От: Robert Haas
Дата:
Сообщение: Re: [HACKERS] logical-replication.sgml improvements