Hi,
I think there's a bug in sendFile(). We do check checksums on all pages
that pass this LSN check:
/*
* Only check pages which have not been modified since the
* start of the base backup. Otherwise, they might have been
* written only halfway and the checksum would not be valid.
* However, replaying WAL would reinstate the correct page in
* this case.
*/
if (PageGetLSN(page) < startptr)
{
...
}
Now, imagine the page is new, i.e. all-zeroes. That means the LSN is 0/0
too, and we'll try to verify the checksum - but we actually do not set
checksums on empty pages.
So I think it should be something like this:
if ((!PageIsNew(page)) && (PageGetLSN(page) < startptr))
{
...
}
It might be worth verifying that the page is actually all-zeroes (and
not just with corrupted pd_upper value. Not sure it's worth it.
I've found this by fairly trivial stress testing - running pgbench and
pg_basebackup in a loop. It was failing pretty reliably (~75% of runs).
With the proposed change I see no further failures.
regards
--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services