Re: Online verification of checksums

Поиск
Список
Период
Сортировка
От Michael Banck
Тема Re: Online verification of checksums
Дата
Msg-id 20180917140418.GA23519@nighthawk.caipicrew.dd-dns.de
обсуждение исходный текст
Ответ на Re: Online verification of checksums  (Tomas Vondra <tomas.vondra@2ndquadrant.com>)
Ответы Re: Online verification of checksums  (Stephen Frost <sfrost@snowman.net>)
Re: Online verification of checksums  (Tomas Vondra <tomas.vondra@2ndquadrant.com>)
Список pgsql-hackers
Hi,

On Mon, Sep 03, 2018 at 10:29:18PM +0200, Tomas Vondra wrote:
> The patch is mostly copying the verification / retry logic from
> basebackup.c, but I think it omitted a rather important detail that
> makes it incorrect in the presence of concurrent writes.
> 
> The very first thing basebackup does is this:
> 
>     startptr = do_pg_start_backup(...);
> 
> i.e. it waits for a checkpoint, remembering the LSN. And then when
> checking a page it does this:
> 
>    if (!PageIsNew(page) && PageGetLSN(page) < startptr)
>    {
>        ... verify the page checksum
>    }
> 
> Obviously, pg_verify_checksums can't do that easily because it's
> supposed to run from outside the database instance. 

It reads pg_control anyway, so couldn't we just take
ControlFile->checkPoint?

Other than that, basebackup.c seems to only look at pages which haven't
been changed since the backup starting checkpoint (see above if
statement). That's reasonable for backups, but is it just as reasonable
for online verification?

> But the startptr detail is pretty important because it supports this
> retry reasoning:
> 
>     /*
>      * Retry the block on the first failure.  It's
>      * possible that we read the first 4K page of the
>      * block just before postgres updated the entire block
>      * so it ends up looking torn to us.  We only need to
>      * retry once because the LSN should be updated to
>      * something we can ignore on the next pass.  If the
>      * error happens again then it is a true validation
>      * failure.
>      */
> 
> Imagine the 8kB page as two 4kB pages, with the initial state being
> [A1,A2] and another process over-writing it with [B1,B2]. If you read
> the 8kB page, what states can you see?
> 
> I don't think POSIX provides any guarantees about atomicity of the write
> calls (and even if it does, the filesystems on Linux don't seem to). So
> you may observe both [A1,B2] or [A2,B1], or various inconsistent mixes
> of the two versions, depending on timing. Well, torn pages ...
> 
> Pretty much the only thing you can rely on is that when one process does
> 
>     write([B1,B2])
> 
> the other process may first read [A1,B2], but the next read will return
> [B1,B2] (or possibly newer data, if there was another write). It will
> not read the "stale" A1 again.
> 
> The basebackup relies on this kinda implicitly - on the retry it'll
> notice the LSN changed (thanks to the startptr check), and the page will
> be skipped entirely. This is pretty important, because the new page
> might be torn in some other way.
>
> The pg_verify_checksum apparently ignores this skip logic, because on
> the retry it simply re-reads the page again, verifies the checksum and
> reports an error. Which is broken, because the newly read page might be
> torn again due to a concurrent write.

Well, ok.
 
> So IMHO this should do something similar to basebackup - check the page
> LSN, and if it changed then skip the page.
> 
> I'm afraid this requires using the last checkpoint LSN, the way startptr
> is used in basebackup. In particular we can't simply remember LSN from
> the first read, because we might actually read [B1,A2] on the first try,
> and then [B1,B2] or [B1,C2] on the retry. (Actually, the page may be
> torn in various other ways, not necessarily at the 4kB boundary - it
> might be torn right after the LSN, for example).

I'd prefer to come up with a plan where we don't just give up once we
see a new LSN, if possible. If I run a modified pg_verify_checksums
which skips on newer pages in a tight benchmark, basically everything
gets skipped as checkpoints don't happen often enough.

So how about we do check every page, but if one fails on retry, and the
LSN is newer than the checkpoint, we then skip it? Is that logic sound?

In any case, if we decide we really should skip the page if it is newer
than the checkpoint, I think it makes sense to track those skipped pages
and print their number out at the end, if there are any.

> FWIW I also don't understand the purpose of pg_sleep(), it does not seem
> to protect against anything, really.

Well, I've noticed that without it I get sporadic checksum failures on
reread, so I've added it to make them go away. It was certainly a
phenomenological decision that I am happy to trade for a better one.

Also, I noticed there's sometimes a 'data/global/pg_internal.init.606'
or some such file which pg_verify_checksums gets confused on, I guess we
should skip that as well.  Can we assume that all files that start with
the ones in skip[] are safe to skip or should we have an exception for
files starting with pg_internal.init?


Michael

-- 
Michael Banck
Projektleiter / Senior Berater
Tel.: +49 2166 9901-171
Fax:  +49 2166 9901-100
Email: michael.banck@credativ.de

credativ GmbH, HRB Mönchengladbach 12080
USt-ID-Nummer: DE204566209
Trompeterallee 108, 41189 Mönchengladbach
Geschäftsführung: Dr. Michael Meskes, Jörg Folz, Sascha Heuer

Unser Umgang mit personenbezogenen Daten unterliegt
folgenden Bestimmungen: https://www.credativ.de/datenschutz


В списке pgsql-hackers по дате отправления:

Предыдущее
От: Chris Travers
Дата:
Сообщение: Strange OSX make check-world failure
Следующее
От: Tom Lane
Дата:
Сообщение: Re: Strange OSX make check-world failure