On 4/10/18 5:24 PM, Tomas Vondra wrote:
>
> I think there's a bug in sendFile(). We do check checksums on all pages
> that pass this LSN check:
>
> /*
> * Only check pages which have not been modified since the
> * start of the base backup. Otherwise, they might have been
> * written only halfway and the checksum would not be valid.
> * However, replaying WAL would reinstate the correct page in
> * this case.
> */
> if (PageGetLSN(page) < startptr)
> {
> ...
> }
>
> Now, imagine the page is new, i.e. all-zeroes. That means the LSN is 0/0
> too, and we'll try to verify the checksum - but we actually do not set
> checksums on empty pages.
>
> So I think it should be something like this:
>
> if ((!PageIsNew(page)) && (PageGetLSN(page) < startptr))
> {
> ...
> }
>
> It might be worth verifying that the page is actually all-zeroes (and
> not just with corrupted pd_upper value. Not sure it's worth it.
>
> I've found this by fairly trivial stress testing - running pgbench and
> pg_basebackup in a loop. It was failing pretty reliably (~75% of runs).
> With the proposed change I see no further failures.
Good catch, Tomas!
I should have seen this since I had the same issue when I implemented
this feature in pgBackRest.
Anyway, I agree that your fix looks correct.
Thanks,
--
-David
david@pgmasters.net