Re: VM corruption on standby
От | Yura Sokolov |
---|---|
Тема | Re: VM corruption on standby |
Дата | |
Msg-id | fe039a5c-7c15-415c-a082-eaec856b4433@postgrespro.ru обсуждение исходный текст |
Ответ на | Re: VM corruption on standby (Kirill Reshke <reshkekirill@gmail.com>) |
Ответы |
Re: VM corruption on standby
|
Список | pgsql-hackers |
19.08.2025 16:17, Kirill Reshke пишет: > On Tue, 19 Aug 2025 at 14:14, Kirill Reshke <reshkekirill@gmail.com> wrote: >> >> This thread is a candidate for [0] >> >> >> [0]https://wiki.postgresql.org/wiki/PostgreSQL_18_Open_Items >> > > Let me summarize this thread for ease of understanding of what's going on: > > Timeline: > 1) Andrey Borodin sends a patch (on 6 Aug) claiming there is > corruption in VM bits. > 2) We investigate problem in not with how PostgreSQL modified buffers > or logs changes, but with LWLockReleaseALl in proc_exit(1) after > kill-9 PM > 3) We have reached the conclusion that there is no corruption, and > that injection points are not a valid way to reproduce them, because > of WaitLatch and friends. > > 4) But we now suspect there is another corruption with ANY critical > section in scenario: > > I wrote: > >> Maybe I'm very wrong about this, but I'm currently suspecting there is >> corruption involving CHECKPOINT, process in CRIT section and kill -9. >> 1) Some process p1 locks some buffer (name it buf1), enters CRIT >> section, calls MarkBufferDirty and hangs inside XLogInsert on CondVar >> in (GetXLogBuffer -> AdvanceXLInsertBuffer). >> 2) CHECKPOINT (p2) stars and tries to FLUSH dirty buffers, awaiting lock on buf1 >> 3) Postmaster kill-9-ed >> 4) signal of postmaster death delivered to p1, it wakes up in >> WaitLatch/WaitEventSetWaitBlock functions, checks postmaster >> aliveness, and exits releasing all locks. >> 5) p2 acquires locks on buf1 and flushes it to disk. >> 6) signal of postmaster death delivered to p2, p2 exits. > > 5) We create an open item for pg18 and propose revering > bc22dc0e0ddc2dcb6043a732415019cc6b6bf683 or fix it quickly. Latch and ConditionVariable (that uses Latch) are among basic synchronization primitives in PostgreSQL. Therefore they have to work correctly in any place: in critical section, in wal logging, etc. Current behavior of WaitEventSetWaitBlock is certainly the bug and it is ought to be fixed. So +1 for _exit(2) as Tom suggested. -- regards Yura Sokolov aka funny-falcon
В списке pgsql-hackers по дате отправления: