Re: [PATCH] Verify Checksums during Basebackups

Поиск
Список
Период
Сортировка
От Magnus Hagander
Тема Re: [PATCH] Verify Checksums during Basebackups
Дата
Msg-id CABUevEzz9NfUtjXtfjCyjMuHGBdFQAd=OfUsohK=epeGfBeqjg@mail.gmail.com
обсуждение исходный текст
Ответ на Re: [PATCH] Verify Checksums during Basebackups  (Michael Banck <michael.banck@credativ.de>)
Ответы Re: [PATCH] Verify Checksums during Basebackups  (Michael Banck <michael.banck@credativ.de>)
Список pgsql-hackers
On Sat, Mar 31, 2018 at 2:54 PM, Michael Banck <michael.banck@credativ.de> wrote:
Hi,

On Fri, Mar 30, 2018 at 07:46:02AM -0400, Stephen Frost wrote:
> * Magnus Hagander (magnus@hagander.net) wrote:
> > On Fri, Mar 30, 2018 at 5:35 AM, David Steele <david@pgmasters.net> wrote:
> >
> > > On 3/24/18 10:32 AM, Michael Banck wrote:
> > > > Am Freitag, den 23.03.2018, 17:43 +0100 schrieb Michael Banck:
> > > >> Am Freitag, den 23.03.2018, 10:54 -0400 schrieb David Steele:
> > > >>> In my experience actual block errors are relatively rare, so there
> > > >>> aren't likely to be more than a few in a file.  More common are
> > > >>> overwritten or transposed files, rogue files, etc.  These produce a lot
> > > >>> of output.
> > > >>>
> > > >>> Maybe stop after five?
> > > >
> > > > The attached patch does that, and outputs the total number of
> > > > verification failures of that file after it got sent.
> > > >
> > > >> I'm on board with this, but I have the feeling that this is not a very
> > > >> common pattern in Postgres, or might not be project style at all.  I
> > > >> can't remember even seen an error message like that.
> > > >>
> > > >> Anybody know whether we're doing this in a similar fashion elsewhere?
> > > >
> > > > I tried to have look around and couldn't find any examples, so I'm not
> > > > sure that patch should go in. On the other hand, we abort on checksum
> > > > failures usually (in pg_dump e.g.), so limiting the number of warnings
> > > > does makes sense.
> > > >
> > > > I guess we need to see what others think.
> > >
> > > Well, at this point I would say silence more or less gives consent.
> > >
> > > Can you provide a rebased patch with the validation retry and warning
> > > limiting logic added? I would like to take another pass through it but I
> > > think this is getting close.
> >
> > I was meaning to mention it, but ran out of cycles.
> >
> > I think this is the right way to do it, except the 5 should be a #define
> > and not an inline hardcoded value :) We could argue whether it should be "5
> > total" or "5 per file". When I read the emails I thought it was going to be
> > 5 total, but I see the implementation does 5 / file. In a super-damanged
> > system that will still lead to horrible amounts of logging, but I think
> > maybe if your system is in that bad shape, then it's a lost cause anyway.
>
> 5/file seems reasonable to me as well.
>
> > I also think the "total number of checksum errors" should be logged if
> > they're >0, not >5. And I think *that* one should be logged at the end of
> > the entire process, not per file. That'd be the kind of output that would
> > be the most interesting, I think (e.g. if I have it spread out with 1 block
> > each across 4 files, I want that logged at the end because it's easy to
> > otherwise miss one or two of them that may have happened a long time apart).
>
> I definitely like having a total # of checksum errors included at the
> end, if there are any at all.  When someone is looking to see why the
> process returned a non-zero exit code, they're likely to start looking
> at the end of the log, so having that easily available and clear as to
> why the backup failed is definitely valuable.
>
> > I don't think we have a good comparison elsewhere, and that is as David
> > mention because other codepaths fail hard when they run into something like
> > that. And we explicitly want to *not* fail hard, per previous discussion.
>
> Agreed.

Attached is a new and rebased patch which does the above, plus
integrates the suggested changes by David Steele. The output is now:

$ initdb -k --pgdata=data1 1> /dev/null 2> /dev/null
$ pg_ctl --pgdata=data1 --log=pg1.log start > /dev/null
$ dd conv=notrunc oflag=seek_bytes seek=4000 bs=8 count=1 if=/dev/zero  of=data1/base/12374/2610 2> /dev/null
$ for i in 4000 13000 21000 29000 37000 43000; do dd conv=notrunc oflag=seek_bytes seek=$i bs=8 count=1 if=/dev/zero of=data1/base/12374/1259; done 2> /dev/null
$ pg_basebackup -v -h /tmp --pgdata=data2
pg_basebackup: initiating base backup, waiting for checkpoint to complete
pg_basebackup: checkpoint completed
pg_basebackup: write-ahead log start point: 0/2000060 on timeline 1
pg_basebackup: starting background WAL receiver
pg_basebackup: created temporary replication slot "pg_basebackup_13882"
WARNING:  checksum verification failed in file "./base/12374/2610", block 0: calculated C2C9 but expected EC78
WARNING:  checksum verification failed in file "./base/12374/1259", block 0: calculated 8BAE but expected 46B8
WARNING:  checksum verification failed in file "./base/12374/1259", block 1: calculated E413 but expected 7701
WARNING:  checksum verification failed in file "./base/12374/1259", block 2: calculated 5DA9 but expected D5AA
WARNING:  checksum verification failed in file "./base/12374/1259", block 3: calculated 5651 but expected 4F5E
WARNING:  checksum verification failed in file "./base/12374/1259", block 4: calculated DF39 but expected DF00
WARNING:  further checksum verification failures in file "./base/12374/1259" will not be reported
WARNING:  file "./base/12374/1259" has a total of 6 checksum verification failures
WARNING:  7 total checksum verification failures
pg_basebackup: write-ahead log end point: 0/2000130
pg_basebackup: checksum error occured
$ echo $?
1
$ du -s data2
38820   data2

I ommitted the 'file "foo" has a total of X checksum verification
failures' if there is only one, as seen with file "./base/12374/2610"
above. Same with the "X total checksum verification failures" if there
was only one.

Is that ok with everybody?


I like most of this patch now :)


It might be a micro-optimisation, but ISTM that we shouldn't do the basename(palloc(fn)) in is_heap_file() unless we actually need it -- so not at the top of the function. (And surely "." and ".." should not occur here? I think that's a result of copy/paste from the online checksum patch?

We also do exactly the same basename(palloc(fn)) in sendFile().  Can we find a way to reuse that duplication? Perhaps we want to split it into the two pieces already out in sendFile() and pass it in as separate parameters?

If not then this second one should at least only be done inside the if (verify_checksums).

There is a bigger problem next to that though -- I believe  basename() does not exist on Win32. I haven't tested it, but there is zero documentation of it existing, which usually indicates it doesn't. That's the part that definitely needs to get fixed.

I think you need to look into the functionality in port/path.c, in particular last_dir_separator()?

--

В списке pgsql-hackers по дате отправления:

Предыдущее
От: Дмитрий Воронин
Дата:
Сообщение: Re: Diagonal storage model
Следующее
От: Magnus Hagander
Дата:
Сообщение: Re: [PATCH] pg_hba.conf : new auth option : clientcert=verify-full