Re: Missing error handling for FATALs in checkpointer/bgwriter

Поиск
Список
Период
Сортировка
От Andres Freund
Тема Re: Missing error handling for FATALs in checkpointer/bgwriter
Дата
Msg-id 20160505210159.yycie6dvmjcj4m5q@alap3.anarazel.de
обсуждение исходный текст
Ответ на Re: atomic pin/unpin causing errors  (Andres Freund <andres@anarazel.de>)
Список pgsql-hackers
On 2016-05-05 11:52:46 -0700, Andres Freund wrote:
> Hi Jeff,
> 
> On 2016-04-29 10:38:55 -0700, Jeff Janes wrote:
> > I don't see the problem with an cassert-enabled, probably because it
> > is just too slow to ever reach the point where the problem occurs.
> 
> Running the test with cassert enabled I actually get assertion failures,
> due to the FATAL you added.
> 
> #1  0x0000000000958dde in ExceptionalCondition (conditionName=0xb36c2a "!(RefCountErrors == 0)", errorType=0xb361af
"FailedAssertion",
 
>     fileName=0xb36170 "/home/admin/src/postgresql/src/backend/storage/buffer/bufmgr.c", lineNumber=2506) at
/home/admin/src/postgresql/src/backend/utils/error/assert.c:54
> #2  0x00000000007c9fc9 in CheckForBufferLeaks () at
/home/admin/src/postgresql/src/backend/storage/buffer/bufmgr.c:2506
> #3  0x00000000007c9f09 in AtProcExit_Buffers (code=1, arg=0) at
/home/admin/src/postgresql/src/backend/storage/buffer/bufmgr.c:2459
> #4  0x00000000007d927f in shmem_exit (code=1) at /home/admin/src/postgresql/src/backend/storage/ipc/ipc.c:261
> #5  0x00000000007d90dd in proc_exit_prepare (code=1) at /home/admin/src/postgresql/src/backend/storage/ipc/ipc.c:185
> #6  0x00000000007d904b in proc_exit (code=1) at /home/admin/src/postgresql/src/backend/storage/ipc/ipc.c:102
> #7  0x000000000095958d in errfinish (dummy=0) at /home/admin/src/postgresql/src/backend/utils/error/elog.c:543
> #8  0x000000000080214b in mdwrite (reln=0x2e8b4a8, forknum=MAIN_FORKNUM, blocknum=154, buffer=0x2e8e5a8 "",
skipFsync=0'\000')
 
>     at /home/admin/src/postgresql/src/backend/storage/smgr/md.c:832
> #9  0x0000000000804633 in smgrwrite (reln=0x2e8b4a8, forknum=MAIN_FORKNUM, blocknum=154, buffer=0x2e8e5a8 "",
skipFsync=0'\000')
 
>     at /home/admin/src/postgresql/src/backend/storage/smgr/smgr.c:650
> #10 0x00000000007ca548 in FlushBuffer (buf=0x7f0285955330, reln=0x2e8b4a8) at
/home/admin/src/postgresql/src/backend/storage/buffer/bufmgr.c:2734
> #11 0x00000000007c9d5a in SyncOneBuffer (buf_id=2503, skip_recently_used=0 '\000', wb_context=0x7ffe7305d290) at
/home/admin/src/postgresql/src/backend/storage/buffer/bufmgr.c:2377
> #12 0x00000000007c964e in BufferSync (flags=64) at
/home/admin/src/postgresql/src/backend/storage/buffer/bufmgr.c:1967
> #13 0x00000000007ca185 in CheckPointBuffers (flags=64) at
/home/admin/src/postgresql/src/backend/storage/buffer/bufmgr.c:2561
> #14 0x000000000052d497 in CheckPointGuts (checkPointRedo=382762776, flags=64) at
/home/admin/src/postgresql/src/backend/access/transam/xlog.c:8644
> #15 0x000000000052cede in CreateCheckPoint (flags=64) at
/home/admin/src/postgresql/src/backend/access/transam/xlog.c:8430
> #16 0x00000000007706ac in CheckpointerMain () at
/home/admin/src/postgresql/src/backend/postmaster/checkpointer.c:488
> #17 0x000000000053e0d5 in AuxiliaryProcessMain (argc=2, argv=0x7ffe7305ea40) at
/home/admin/src/postgresql/src/backend/bootstrap/bootstrap.c:429
> #18 0x000000000078099f in StartChildProcess (type=CheckpointerProcess) at
/home/admin/src/postgresql/src/backend/postmaster/postmaster.c:5227
> #19 0x000000000077dcc3 in reaper (postgres_signal_arg=17) at
/home/admin/src/postgresql/src/backend/postmaster/postmaster.c:2781
> #20 <signal handler called>
> #21 0x00007f028ebbdac3 in __select_nocancel () at ../sysdeps/unix/syscall-template.S:81
> #22 0x000000000077c049 in ServerLoop () at /home/admin/src/postgresql/src/backend/postmaster/postmaster.c:1654
> #23 0x000000000077b7a9 in PostmasterMain (argc=4, argv=0x2e49f20) at
/home/admin/src/postgresql/src/backend/postmaster/postmaster.c:1298
> #24 0x00000000006c5849 in main (argc=4, argv=0x2e49f20) at /home/admin/src/postgresql/src/backend/main/main.c:228
> 
> You didn't see those?
> 
> 
> The trigger here appears to be that the checkpointer doesn't have
> on-exit callback similar to a normal backend's ShutdownPostgres() et al,
> and thus doesn't trigger a resource owner release.  The normal ERROR
> path has
>         /* buffer pins are released here: */
>         ResourceOwnerRelease(CurrentResourceOwner,
>                              RESOURCE_RELEASE_BEFORE_LOCKS,
>                              false, true);
>         /* we needn't bother with the other ResourceOwnerRelease phases */
> 
> That clearly is a bug. But I'm not immediately seing how this could
> trigger the corruption issue you observed.


The same issue exists in bgwriter afaics. ISTM that we need to provide
an before_shmem_exit (or on_shmem_exit?) handler for both which essentially does/* * These operations are really just a
minimalsubset of * AbortTransaction().  We don't have very many resources to worry * about in bgwriter, but we do have
LWLocks,buffers, and temp files. */LWLockReleaseAll();AbortBufferIO();UnlockBuffers();/* buffer pins are released here:
*/ResourceOwnerRelease(CurrentResourceOwner,                    RESOURCE_RELEASE_BEFORE_LOCKS,
false,true);
 
it looks to me like that should be backpatched?

There's some question about how to make the ordering
vs. AtProcExit_Buffers robust; which is why I'm above explicitly doing
LWLockReleaseAll/AbortBufferIO/UnlockBuffers.

Any better ideas?



В списке pgsql-hackers по дате отправления:

Предыдущее
От: Andreas Seltenreich
Дата:
Сообщение: Re: [sqlsmith] Failed assertion in BecomeLockGroupLeader
Следующее
От: Stephen Frost
Дата:
Сообщение: Re: pg_dump dump catalog ACLs