On Thu, May 7, 2020 at 10:25 AM Stephen Frost <sfrost@snowman.net> wrote:
Greetings,
* Ashwin Agrawal (aagrawal@pivotal.io) wrote: > On Wed, May 6, 2020 at 3:02 PM Robert Haas <robertmhaas@gmail.com> wrote: > > On Wed, May 6, 2020 at 5:48 PM Ashwin Agrawal <aagrawal@pivotal.io> wrote: > > > If pg_basebackup is not able to read BLCKSZ content from file, then it > > > just emits a warning "could not verify checksum in file "____" block > > > X: read buffer size X and page size 8192 differ" currently but misses > > > to error with "checksum error occurred". Only if it can read 8192 and > > > checksum mismatch happens will it error in the end. > > > > I don't think it's a good idea to conflate "hey, we can't checksum > > this because the size is strange" with "hey, the checksum didn't > > match". Suppose the a file has 1000 full blocks and a partial block. > > All 1000 blocks have good checksums. With your change, ISTM that we'd > > first emit a warning saying that the checksum couldn't be verified, > > and then we'd emit a second warning saying that there was 1 checksum > > verification failure, which would also be reported to the stats > > system. I don't think that's what we want. > > I feel the intent of reporting "total checksum verification failure" is to > report corruption. Which way is the secondary piece of the puzzle. Not > being able to read checksum itself to verify is also corruption and is > checksum verification failure I think. WARNINGs will provide fine grained > clarity on what type of checksum verification failure it is, so I am not > sure we really need fine grained clarity in "total numbers" to > differentiate these two types.
Are we absolutely sure that there's no way for a partial block to end up being seen by pg_basebackup, which is just doing routine filesystem read() calls, during normal operation though..? Across all platforms?
Okay, that's a good point, I didn't think about it. This comment to skip verifying checksum, I suppose convinces, can't be sure and hence can't report partial blocks as corruption.
/* * Only check pages which have not been modified since the * start of the base backup. Otherwise, they might have been * written only halfway and the checksum would not be valid. * However, replaying WAL would reinstate the correct page in * this case. We also skip completely new pages, since they * don't have a checksum yet. */
Might be nice to have a similar comment for the partial block case to document why we can't report it as corruption. Thanks.