Re: Online verification of checksums

Поиск
Список
Период
Сортировка
От Tomas Vondra
Тема Re: Online verification of checksums
Дата
Msg-id a96dcaa9-3e36-bcb5-34f3-804fefd2a571@2ndquadrant.com
обсуждение исходный текст
Ответ на Re: Online verification of checksums  (Stephen Frost <sfrost@snowman.net>)
Ответы Re: Online verification of checksums  (Stephen Frost <sfrost@snowman.net>)
Список pgsql-hackers
On 09/18/2018 12:01 AM, Stephen Frost wrote:
> Greetings,
> 
> * Tomas Vondra (tomas.vondra@2ndquadrant.com) wrote:
>> On 09/17/2018 07:35 PM, Stephen Frost wrote:
>>> On Mon, Sep 17, 2018 at 13:20 Tomas Vondra <tomas.vondra@2ndquadrant.com
>>> <mailto:tomas.vondra@2ndquadrant.com>> wrote:
>>>     Doesn't the checkpoint fsync pretty much guarantee this can't happen?
>>>
>>> How? Either it’s possible for the latter half of a page to be updated
>>> before the first half (where the LSN lives), or it isn’t. If it’s
>>> possible then that LSN could be ancient and it wouldn’t matter. 
>>
>> I'm not sure I understand what you're saying here.
>>
>> It is not about the latter page to be updated before the first half. I
>> don't think that's quite possible, because write() into page cache does
>> in fact write the data sequentially.
> 
> Well, maybe 'updated before' wasn't quite the right way to talk about
> it, but consider if a read(8K) gets only half-way through the copy
> before having to go do something else and by the time it gets back, a
> write has come in and rewritten the page, such that the read(8K)
> returns half-old and half-new data.
> 
>> The problem is that the write is not atomic, and AFAIK it happens in
>> sectors (which are either 512B or 4K these days). And it may arbitrarily
>> interleave with reads.
> 
> Yes, of course the write isn't atomic, that's clear.
> 
>> So you may do write(8k), but it actually happens in 512B chunks and a
>> concurrent read may observe some mix of those.
> 
> Right, I'm not sure that we really need to worry about sub-4K writes
> though I suppose they're technically possible, but it doesn't much
> matter in this case since the LSN is early on in the page, of course.
> 
>> But the trick is that if the read sees the effect of the write somewhere
>> in the middle of the page, the next read is guaranteed to see all the
>> preceding new data.
> 
> If that's guaranteed then we can just check the LSN and be done.
> 

What do you mean by "check the LSN"? Compare it to LSN from the first
read? You don't know if the first read already saw the new LSN or not
(see the next example).

>> Without the checkpoint we risk seeing the same write() both in read and
>> re-read, just in a different stage - so the LSN would not change, making
>> the check futile.
> 
> This is the part that isn't making much sense to me.  If we are
> guaranteed that writes into the kernel cache are always in order and
> always at least 512B in size, then if we check the LSN first and
> discover it's "old", and then read the rest of the page and calculate
> the checksum, discover it's a bad checksum, and then go back and re-read
> the page then we *must* see that the LSN has changed OR conclude that
> the checksum is invalidated.
> 

Even if the writes are in order and in 512B chunks, you don't know how
they are interleaved with the reads.

Let's assume we're doing a write(), which splits the 8kB page into 512B
chunks. A concurrent read may observe a random mix of old and new data,
depending on timing.

So let's say a read sees the first 2kB of data like this:

[new, new, new, old, new, old, new, old]

OK, the page is obviously torn, checksum fails, and we try reading it
again. We should see new data at least until the last 'new' chunk in the
first read, so let's say we got this:

[new, new, new, new, new, new, new, old]

Obviously, this page is also torn (there are old data at the end), but
we've read the new data in both cases, which includes the LSN. So the
LSN is the same in both cases, and your detection fails.

Comparing the page LSN to the last checkpoint LSN solves this, because
if the LSN is older than the checkpoint LSN, that write must have been
completed by now, and so we're not in danger of seeing only incomplete
effects of it. And newer write will update the LSN.

> The reason this can happen in the first place is that our 8K read might
> only get half-way done before getting scheduled off and a 8K write
> happened on the page before our read(8K) gets back to finishing the
> read, but if what you're saying is true, then we can't ever have a case
> where such a thing would happen and a re-read would still see the "old"
> LSN.
> 
> If we check the LSN first and discover it's "new" (as in, more recent
> than our last checkpoint, or the checkpoint where the backup started)
> then, sure, there's going to be a risk that the page is currently being
> written right that moment and isn't yet completely valid.
> 

Right.

> The problem that we aren't solving for is if, somehow, we do a read(8K)
> and get the first half/second half mixup and then on a subsequent
> read(8K) we see that *again*, implying that somehow the kernel's copy
> has the latter-half of the page updated consistently but not the first
> half.  That's a problem that I haven't got a solution to today.  I'd
> love to have a guarantee that it's not possible- we've certainly never
> seen it but it's been a concern and I thought Michael was suggesting
> he'd seen that, but it sounds like there wasn't a check on the LSN in
> the first read, in which case it could have just been a 'regular' torn
> page case.
> 

Well, yeah. If that would be possible, we'd be in serious trouble. I've
done quite a bit of experimentation with concurrent reads and writes and
I have not observed such behavior. Of course, that's hardly a proof it
can't happen, and it wouldn't be the first surprise with respect to
kernel I/O this year ...

>> But by waiting for the checkpoint we know that the original write is no
>> longer in progress, so if we saw a partial write we're guaranteed to see
>> a new LSN on re-read.
>>
>> This is what I mean by the checkpoint / fsync guarantee.
> 
> I don't think any of this really has anythign to do with either fsync
> being called or with the actual checkpointing process (except to the
> extent that the checkpointer is the thing doing the writing, and that we
> should be checking the LSN against the LSN of the last checkpoint when
> we started, or against the start of the backup LSN if we're talking
> about doing a backup).
> 

You're right it's not about the fsync, sorry for the confusion. My point
is that using the checkpoint LSN gives us a guarantee that write is no
longer in progress, and so we can't see a page torn because of it. And
if we see a partial write due to a new write, it's guaranteed to update
the page LSN (and we'll notice it).

>>> The question is if it’s possible to catch a torn page where the second
>>> half is updated *before* the first half of the page in a read (and then
>>> in subsequent reads having that state be maintained).  I have some
>>> skepticism that it’s really possible to happen in the first place but
>>> having an interrupted system call be stalled across two more system
>>> calls just seems terribly unlikely, and this is all based on the
>>> assumption that the kernel might write the second half of a write before
>>> the first to the kernel cache in the first place.
>>
>> Yes, if that was possible, the explanation about the checkpoint fsync
>> guarantee would be bogus, obviously.
>>
>> I've spent quite a bit of time looking into how write() is handled, and
>> I believe seeing only the second half is not possible. You may observe a
>> page torn in various ways (not necessarily in half), e.g.
>>
>>     [old,new,old]
>>
>> but then the re-read you should be guaranteed to see new data up until
>> the last "new" chunk:
>>
>>     [new,new,old]
>>
>> At least that's my understanding. I failed to deduce what POSIX says
>> about this, or how it behaves on various OS/filesystems.
>>
>> The one thing I've done was writing a simple stress test that writes a
>> single 8kB page in a loop, reads it concurrently and checks the
>> behavior. And it seems consistent with my understanding.
> 
> Good.
> 
>>> Use that to compare to what?  The LSN in the first half of the page
>>> could be from well before the checkpoint or even the backup started.
>>
>> Not sure I follow. If the LSN in the page header is old, and the
>> checksum check failed, then on re-read we either find a new LSN (in
>> which case we skip the page) or consider this to be a checksum failure.
> 
> Right, I'm in agreement with doing that and it's what is done in
> pgbasebackup and pgBackRest.
> 

OK. All I'm saying is pg_verify_checksums should probably do the same
thing, i.e. grab checkpoint LSN and roll with that.


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


В списке pgsql-hackers по дате отправления:

Предыдущее
От: Michael Paquier
Дата:
Сообщение: Re: pgsql: Allow concurrent-safe open() and fopen() in frontend codefor Wi
Следующее
От: Stephen Frost
Дата:
Сообщение: Re: Online verification of checksums