Online checksums verification in the backend

Поиск
Список
Период
Сортировка
От Julien Rouhaud
Тема Online checksums verification in the backend
Дата
Msg-id CAOBaU_aVvMjQn=ge5qPiJOPMmOj5=ii3st5Q0Y+WuLML5sR17w@mail.gmail.com
обсуждение исходный текст
Ответы Re: Online checksums verification in the backend  (Robert Haas <robertmhaas@gmail.com>)
Re: Online checksums verification in the backend  (Masahiko Sawada <sawada.mshk@gmail.com>)
Список pgsql-hackers
Hi,

This topic was discussed several times, with the most recent
discussions found at [1] and [2].  Based on those discussions, my
understanding is that the current approach in BASE_BACKUP has too many
drawbacks and we should instead do this check in the backend.  I've
been working using such approach at VMware, and I'm submitting it here
to discuss the approach and rationales, and hopefully have such a
feature integrated.

First, this was originally developed as an extension.  It means that
the check is performed using an SRF.  That's maybe not the best
approach, as a transaction has be kept for the total processing time.
It can be leveraged by checking each relation independently, but
that's still not ideal.  Maybe using some utility commands (as part of
VACUUM or a new CHECK command for instance) would be a better
approach.

This brings the second consideration: how to report the list corrupted
blocks to end users.  As I said this is for now returned via the SRF,
but this is clearly not ideal and should rather be made available more
globally.  One usage of this information could be block level
recovery.  I'm Cc-ing Sawada-san, as I know he's working on this and
mentioned me that he had ideas on passing the list of corrupted blocks
using the stat collector.

Finally, the read and locking considerations.  I tried to cover that
extensively in the comments, but here are some details on how I tried
to make the check safe while trying to keep the overhead as low as
possible.  First thing is that this is only doing buffered reads,
without any attempt to discard OS cache.  Therefore, any discrepancy
between the OS cache and the disk cannot be detected unless you do
other actions, such as sync / drop_caches on GNU/Linux.

An access share lock on the currently checked relation is held,
meaning that it can't get deleted/truncated.  The total number of
blocks for the given fork is retrieved first, so any new block will be
ignored.  Such new blocks are considered out of scope as being written
after the start of the check.

Each time a buffer is being checked, the target buffer mapping
partition lock is acquired in shared mode, to prevent concurrent
eviction.  If the buffer is found in shared buffers, it's pinned and
released immediately, just to get the state.  If the buffer is found
dirty, no check is performed as it'll be written to disk by the
checkpointer, or during recovery in case of unclean shutdown.
Otherwise, an IO lock is held while the the buffer is being read in a
private buffer. IO Lock and buffer mapping lock are released and then
the check is performed.

If the buffer is not found in shared buffers, the buffer mapping
partition lock is released immediately and the block is read from
disk.  It's therefore possible to get a false positive here, as the
block could be concurrently read, modified and partially written to
disk.  So, if an error is detected in this case, the check is
restarted from scratch and if the buffer is still not found in shared
buffers, the read will be done while still holding the buffer mapping
partition lock to make sure that it can't get concurrently loaded and
modified.  This is an optimistic approach to avoid performance
overhead, assuming that there shouldn't be a lot of positive, and
false positive possibility is very narrow.

The check consists of simply comparing the stored and computed
checksum, with an additional check that the page is really new (using
PageIsVerified) if it's found as PageIsNew().  Since this is done
after releasing all locks, we could definitely add more checks without
causing much overhead, like pd_lower/pd_upper sanity.  I prefer to
keep the check simple for now and rather focus on the general
approach.

Finally, I also reused vacuum costing GUC (for simplicity) and
approach to add some throttling.

I'm attaching a patch that adds a new pg_check_relation() sql function
to perform a check of one or all relations, and some simple regression
tests.

[1] https://www.postgresql.org/message-id/flat/1532606373.3422.5.camel%40credativ.de
[2] https://www.postgresql.org/message-id/flat/20190326170820.6sylklg7eh6uhabd%40alap3.anarazel.de

Вложения

В списке pgsql-hackers по дате отправления:

Предыдущее
От: Jakob Egger
Дата:
Сообщение: Re: Protocol problem with GSSAPI encryption?
Следующее
От: Nikolay Shaplov
Дата:
Сообщение: [PATCH] Finally split StdRdOptions into HeapOptions and ToastOptions