Re: Corruption during WAL replay

Поиск
Список
Период
Сортировка
От Andres Freund
Тема Re: Corruption during WAL replay
Дата
Msg-id 20200330233159.af74fwkcod6y7u7k@alap3.anarazel.de
обсуждение исходный текст
Ответ на Re: Corruption during WAL replay  (Kyotaro Horiguchi <horikyota.ntt@gmail.com>)
Ответы Re: Corruption during WAL replay  (Kyotaro Horiguchi <horikyota.ntt@gmail.com>)
Re: Corruption during WAL replay  (Teja Mupparti <tejeswarm@hotmail.com>)
Re: Corruption during WAL replay  (Alvaro Herrera <alvherre@2ndquadrant.com>)
Список pgsql-hackers
Hi,

On 2020-03-24 18:18:12 +0900, Kyotaro Horiguchi wrote:
> At Mon, 23 Mar 2020 20:56:59 +0000, Teja Mupparti <tejeswarm@hotmail.com> wrote in 
> > The original bug reporting-email and the relevant discussion is here
> ...
> > The crux of the fix is, in the current code, engine drops the buffer and then truncates the file, but a crash
beforethe truncate and after the buffer-drop is causing the corruption. Patch reverses the order i.e. truncate the file
anddrop the buffer later.
 
> 
> BufferAlloc doesn't wait for the BM_IO_IN_PROGRESS for a valid buffer.

I don't think that's true. For any of this to be relevant the buffer has
to be dirty. In which case BufferAlloc() has to call
FlushBuffer(). Which in turn does a WaitIO() if BM_IO_IN_PROGRESS is
set.

What path are you thinking of? Or alternatively, what am I missing?


> I'm not sure it's acceptable to remember all to-be-deleted buffers
> while truncation.

I don't see a real problem with it. Nor really a good alternative. Note
that for autovacuum truncations we'll only truncate a limited number of
buffers at once, and for most relation truncations we don't enter this
path (since we create a new relfilenode instead).


> 
> +     /*START_CRIT_SECTION();*/

> Is this a point of argument?  It is not needed if we choose the
> strategy (c) in [1], since the protocol is aiming to allow server to
> continue running after truncation failure.
> 
> [1]: https://www.postgresql.org/message-id/20191207001232.klidxnm756wqxvwx%40alap3.anarazel.de

I think it's entirely broken to continue running after a truncation
failure. We obviously have to first WAL log the truncation (since
otherwise we can crash just after doing the truncation). But we cannot
just continue running after WAL logging, but not performing the
associated action: The most obvious reason is that otherwise a replica
will execute the trunction, but the primary will not.

The whole justification for that behaviour "It would turn a usually
harmless failure to truncate, that might spell trouble at WAL replay,
into a certain PANIC." was always dubious (since on-disk and in-memory
state now can diverge), but it's clearly wrong once replication had
entered the picture. There's just no alternative to a critical section
here.


If we are really concerned with truncation failing - I don't know why we
would be, we accept that we have to be able to modify files etc to stay
up - we can add a pre-check ensuring that permissions are set up
appropriately to allow us to truncate.

Greetings,

Andres Freund



В списке pgsql-hackers по дате отправления:

Предыдущее
От: David Steele
Дата:
Сообщение: Re: backup manifests
Следующее
От: Cary Huang
Дата:
Сообщение: Re: Internal key management system