Re: Online verification of checksums

Поиск
Список
Период
Сортировка
От Tomas Vondra
Тема Re: Online verification of checksums
Дата
Msg-id a6f6a9f7-3fb6-1cb5-631a-3e36012abde0@2ndquadrant.com
обсуждение исходный текст
Ответ на Re: Online verification of checksums  (Stephen Frost <sfrost@snowman.net>)
Ответы Re: Online verification of checksums  (Stephen Frost <sfrost@snowman.net>)
Список pgsql-hackers
On 09/17/2018 07:35 PM, Stephen Frost wrote:
> Greetings,
> 
> On Mon, Sep 17, 2018 at 13:20 Tomas Vondra <tomas.vondra@2ndquadrant.com
> <mailto:tomas.vondra@2ndquadrant.com>> wrote:
> 
>     On 09/17/2018 07:11 PM, Stephen Frost wrote:
>     > Greetings,
>     >
>     > * Tomas Vondra (tomas.vondra@2ndquadrant.com
>     <mailto:tomas.vondra@2ndquadrant.com>) wrote:
>     >> On 09/17/2018 06:42 PM, Stephen Frost wrote:
>     >>> Ok, good, though I'm not sure what you mean by 'eliminates the
>     >>> consistency guarantees provided by the checkpoint'.  The point
>     is that
>     >>> the page will be in the WAL and the WAL will be replayed during the
>     >>> restore of the backup.
>     >>
>     >> The checkpoint guarantees that the whole page was written and
>     flushed to
>     >> disk with an LSN before the ckeckpoint LSN. So when you read a
>     page with
>     >> that LSN, you know the whole write already completed and a read won't
>     >> return data from before the LSN.
>     >
>     > Well, you know that the first part was written out at some prior
>     point,
>     > but you could end up reading the first part of a page with an
>     older LSN
>     > while also reading the second part with new data.
> 
> 
> 
>     Doesn't the checkpoint fsync pretty much guarantee this can't happen?
> 
> 
> How? Either it’s possible for the latter half of a page to be updated
> before the first half (where the LSN lives), or it isn’t. If it’s
> possible then that LSN could be ancient and it wouldn’t matter. 
> 

I'm not sure I understand what you're saying here.

It is not about the latter page to be updated before the first half. I
don't think that's quite possible, because write() into page cache does
in fact write the data sequentially.

The problem is that the write is not atomic, and AFAIK it happens in
sectors (which are either 512B or 4K these days). And it may arbitrarily
interleave with reads.

So you may do write(8k), but it actually happens in 512B chunks and a
concurrent read may observe some mix of those.

But the trick is that if the read sees the effect of the write somewhere
in the middle of the page, the next read is guaranteed to see all the
preceding new data.

Without the checkpoint we risk seeing the same write() both in read and
re-read, just in a different stage - so the LSN would not change, making
the check futile.

But by waiting for the checkpoint we know that the original write is no
longer in progress, so if we saw a partial write we're guaranteed to see
a new LSN on re-read.

This is what I mean by the checkpoint / fsync guarantee.


>     >> Without the checkpoint that's not guaranteed, and simply
>     re-reading the
>     >> page and rechecking it vs. the first read does not help:
>     >>
>     >> 1) write the first 512B of the page (sector), which includes the LSN
>     >>
>     >> 2) read the whole page, which will be a mix [new 512B, ... old ... ]
>     >>
>     >> 3) the checksum verification fails
>     >>
>     >> 4) read the page again (possibly reading a bit more new data)
>     >>
>     >> 5) the LSN did not change compared to the first read, yet the
>     checksum
>     >> still fails
>     >
>     > So, I agree with all of the above though I've found it to be extremely
>     > rare to get a single read which you've managed to catch part-way
>     through
>     > a write, getting multiple of them over a period of time strikes me as
>     > even more unlikely.  Still, if we can come up with a solution to solve
>     > all of this, great, but I'm not sure that I'm hearing one.
> 
> 
>     I don't recall claiming catching many such torn pages - I'm sure it's
>     not very common in most workloads. But I suspect constructing workloads
>     hitting them regularly is not very difficult either (something with a
>     lot of churn in shared buffers should do the trick).
> 
> 
> The question is if it’s possible to catch a torn page where the second
> half is updated *before* the first half of the page in a read (and then
> in subsequent reads having that state be maintained).  I have some
> skepticism that it’s really possible to happen in the first place but
> having an interrupted system call be stalled across two more system
> calls just seems terribly unlikely, and this is all based on the
> assumption that the kernel might write the second half of a write before
> the first to the kernel cache in the first place.
> 

Yes, if that was possible, the explanation about the checkpoint fsync
guarantee would be bogus, obviously.

I've spent quite a bit of time looking into how write() is handled, and
I believe seeing only the second half is not possible. You may observe a
page torn in various ways (not necessarily in half), e.g.

    [old,new,old]

but then the re-read you should be guaranteed to see new data up until
the last "new" chunk:

    [new,new,old]

At least that's my understanding. I failed to deduce what POSIX says
about this, or how it behaves on various OS/filesystems.

The one thing I've done was writing a simple stress test that writes a
single 8kB page in a loop, reads it concurrently and checks the
behavior. And it seems consistent with my understanding.

> 
>     >>> Now, that said, I do think it's a good *idea* to check against the
>     >>> checkpoint LSN (presuming this is for online checking of
>     checksums- for
>     >>> basebackup, we could just check against the backup-start LSN as
>     anything
>     >>> after that point will be rewritten by WAL anyway).  The reason
>     that I
>     >>> think it's a good idea to check against the checkpoint LSN is
>     that we'd
>     >>> want to throw a big warning if the kernel is just feeding us random
>     >>> garbage on reads and only finding a difference between two reads
>     isn't
>     >>> really doing any kind of validation, whereas checking against the
>     >>> checkpoint-LSN would at least give us some idea that the value being
>     >>> read isn't completely ridiculous.
>     >>>
>     >>> When it comes to if the pg_sleep() is necessary or not, I have
>     to admit
>     >>> to being unsure about that..  I could see how it might be but it
>     seems a
>     >>> bit surprising- I'd probably want to see exactly what the page
>     was at
>     >>> the time of the failure and at the time of the second (no-sleep)
>     re-read
>     >>> and then after a delay and convince myself that it was just an
>     unlucky
>     >>> case of being scheduled in twice to read that page before the
>     process
>     >>> writing it out got a chance to finish the write.
>     >>
>     >> I think the pg_sleep() is a pretty strong sign there's something
>     broken.
>     >> At the very least, it's likely to misbehave on machines with
>     different
>     >> timings, machines under memory and/or memory pressure, etc.
>     >
>     > If we assume that what you've outlined above is a serious enough issue
>     > that we have to address it, and do so without a pg_sleep(), then I
>     think
>     > we have to bake into this a way for the process to check with PG as to
>     > what the page's current LSN is, in shared buffers, because that's the
>     > only place where we've got the locking required to ensure that we
>     don't
>     > end up with a read of a partially written page, and I'm really not
>     > entirely convinced that we need to go to that level.  It'd
>     certainly add
>     > a huge amount of additional complexity for what appears to be a quite
>     > unlikely gain.
>     >
>     > I'll chat w/ David shortly about this again though and get his
>     thoughts
>     > on it.  This is certainly an area we've spent time thinking about but
>     > are obviously also open to finding a better solution.
> 
> 
>     Why not to simply look at the last checkpoint LSN and use that the same
>     way basebackup does? AFAICS that should make the pg_sleep() unnecessary.
> 
> 
> Use that to compare to what?  The LSN in the first half of the page
> could be from well before the checkpoint or even the backup started.
> 

Not sure I follow. If the LSN in the page header is old, and the
checksum check failed, then on re-read we either find a new LSN (in
which case we skip the page) or consider this to be a checksum failure.


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


В списке pgsql-hackers по дате отправления:

Предыдущее
От: Thomas Munro
Дата:
Сообщение: Re: Strange OSX make check-world failure
Следующее
От: Tom Lane
Дата:
Сообщение: Re: Strange OSX make check-world failure