Re: Corruption during WAL replay

Поиск
Список
Период
Сортировка
От Heikki Linnakangas
Тема Re: Corruption during WAL replay
Дата
Msg-id 7555d464-5bf7-7d6f-febd-0df9bcffe2a7@iki.fi
обсуждение исходный текст
Ответ на Re: Corruption during WAL replay  (Teja Mupparti <tejeswarm@hotmail.com>)
Ответы Re: Corruption during WAL replay  (Andres Freund <andres@anarazel.de>)
Список pgsql-hackers
On 14/04/2020 22:04, Teja Mupparti wrote:
> Thanks Kyotaro and Masahiko for the feedback. I think there is a 
> consensus on the critical-section around truncate,

+1

> but I just want to emphasize the need for reversing the order of the
> dropping the buffers and the truncation.
> 
>   Repro details (when full page write = off)
> 
>           1) Page on disk has empty LP 1, Insert into page LP 1
>           2) checkpoint START (Recovery REDO eventually starts here)
>           3) Delete all rows on the page (page is empty now)
>           4) Autovacuum kicks in and truncates the pages
>                   DropRelFileNodeBuffers - Dirty page NOT written, LP 1 
> on disk still empty
>           5) Checkpoint completes
>           6) Crash
>           7) smgrtruncate - Not reached (this is where we do the 
> physical truncate)
> 
>   Now the crash-recovery starts
> 
>           Delete-log-replay (above step-3) reads page with empty LP 1 
> and the delete fails with PANIC (old page on disk with no insert)
> 
> Doing recovery, truncate is even not reached, a WAL replay of the 
> truncation will happen in the future but the recovery fails (repeatedly) 
> even before reaching that point.

Hmm. I think simply reversing the order of DropRelFileNodeBuffers() and 
truncating the file would open a different issue:

   1) Page on disk has empty LP 1, Insert into page LP 1
   2) checkpoint START (Recovery REDO eventually starts here)
   3) Delete all rows on the page (page is empty now)
   4) Autovacuum kicks in and starts truncating
   5) smgrtruncate() truncates the file
   6) checkpoint writes out buffers for pages that were just truncated 
away, expanding the file again.

Your patch had a mechanism to mark the buffers as io-in-progress before 
truncating the file to fix that, but I'm wary of that approach. Firstly, 
it requires scanning the buffers that are dropped twice, which can take 
a long time. I remember that people have already complained that 
DropRelFileNodeBuffers() is slow, when it has to scan all the buffers 
once. More importantly, abusing the BM_IO_INPROGRESS flag for this seems 
bad. For starters, because you're not holding buffer's I/O lock, I 
believe the checkpointer would busy-wait on the buffers until the 
truncation has completed. See StartBufferIO() and AbortBufferIO().

Perhaps a better approach would be to prevent the checkpoint from 
completing, until all in-progress truncations have completed. We have a 
mechanism to wait out in-progress commits at the beginning of a 
checkpoint, right after the redo point has been established. See 
comments around the GetVirtualXIDsDelayingChkpt() function call in 
CreateCheckPoint(). We could have a similar mechanism to wait out the 
truncations before *completing* a checkpoint.

- Heikki



В списке pgsql-hackers по дате отправления:

Предыдущее
От: Thomas Munro
Дата:
Сообщение: Optimising compactify_tuples()
Следующее
От: Ibrar Ahmed
Дата:
Сообщение: Re: COPY FREEZE and setting PD_ALL_VISIBLE/visibility map bits