Re: [patch] Fix checksum verification in base backups for zero page headers

Поиск
Список
Период
Сортировка
От Michael Paquier
Тема Re: [patch] Fix checksum verification in base backups for zero page headers
Дата
Msg-id 20201022012519.GF1475@paquier.xyz
обсуждение исходный текст
Ответ на Re: [patch] Fix checksum verification in base backups for zero page headers  (Anastasia Lubennikova <a.lubennikova@postgrespro.ru>)
Ответы Re: [patch] Fix checksum verification in base backups for zero page headers  (Julien Rouhaud <rjuju123@gmail.com>)
Re: [patch] Fix checksum verification in base backups for zero page headers  (Anastasia Lubennikova <a.lubennikova@postgrespro.ru>)
Список pgsql-hackers
On Thu, Oct 22, 2020 at 12:47:03AM +0300, Anastasia Lubennikova wrote:
> Thank you. I always forget about this. Do we have any checklist for such
> changes, that patch authors and reviewers can use?

Not really.  That's more a habit than anything else where any
non-static routine that we publish could be used by some out-of-core
code, so maintaining a pure API compatibility on stable branches is
essential.

> We can also read such pages via shared buffers to be 100% sure.

Yeah, but this has its limits as well.  One can use
ignore_checksum_failure, but this can actually be very dangerous as
you can finish by loading into shared buffers a page that has a header
thought as sane but with a large portion of its page randomly
corrupted, spreading corruption around and leading to more fancy
logic failures in Postgres, with more panic from customers.  Not using
ignore_checksum_failure is safer, but it makes an analyze of the
damages for a given file harder as things would stop at the first
failure of a file with a seqscan.  pg_prewarm can help here, but
that's not the goal of the tool to do that either.

We definitely need a different approach that guarantees that a page is
correctly locked with no concurrent I/O while checked on retry, and I
am looking at that for the SQL-level check.  That's not something I
would do in v13 though, but we can make the existing logic much more
reliable with a set of fixed retries and some sleeps in-between.  A
maximum of 5 retries with 100ms seems like a good balance seen from
here, but that's not a perfect science of course depending on the
hardware latency.

This has been a sensitive topic during the stability period of v13
with many committers commenting on the issue, so I'd rather be very
careful that there are no objections to what's published here, and in
consequence I am going to ping the other thread on the matter.  For
now I have the attached to address the zero-case and the random LSN
case.
--
Michael

Вложения

В списке pgsql-hackers по дате отправления:

Предыдущее
От: Masahiro Ikeda
Дата:
Сообщение: Re: Add statistics to pg_stat_wal view for wal related parameter tuning
Следующее
От: Kyotaro Horiguchi
Дата:
Сообщение: Re: Is Recovery actually paused?