Re: Online verification of checksums

Поиск
Список
Период
Сортировка
От Michael Banck
Тема Re: Online verification of checksums
Дата
Msg-id 1537281442.3800.20.camel@credativ.de
обсуждение исходный текст
Ответ на Online verification of checksums  (Michael Banck <michael.banck@credativ.de>)
Ответы Re: Online verification of checksums  (Stephen Frost <sfrost@snowman.net>)
Re: Online verification of checksums  (Robert Haas <robertmhaas@gmail.com>)
Список pgsql-hackers
Hi,

please find attached version 2 of the patch.

Am Donnerstag, den 26.07.2018, 13:59 +0200 schrieb Michael Banck:
> I've now forward-ported this change to pg_verify_checksums, in order to
> make this application useful for online clusters, see attached patch.
> 
> I've tested this in a tight loop (while true; do pg_verify_checksums -D
> data1 -d > /dev/null || /bin/true; done)[2] while doing "while true; do
> createdb pgbench; pgbench -i -s 10 pgbench > /dev/null; dropdb pgbench;
> done", which I already used to develop the original code in the fork and
> which brought up a few bugs.
> 
> I got one checksums verification failure this way, all others were
> caught by the recheck (I've introduced a 500ms delay for the first ten
> failures) like this:
> 
> > pg_verify_checksums: checksum verification failed on first attempt in
> > file "data1/base/16837/16850", block 7770: calculated checksum 785 but
> > expected 5063
> > pg_verify_checksums: block 7770 in file "data1/base/16837/16850"
> > verified ok on recheck

I have now changed this from the pg_sleep() to a check against the
checkpoint LSN as discussed upthread.

> However, I am also seeing sporadic (maybe 0.5 times per pgbench run)
> failures like this:
> 
> > pg_verify_checksums: short read of block 2644 in file
> > "data1/base/16637/16650", got only 4096 bytes
> 
> This is not strictly a verification failure, should we do anything about
> this? In my fork, I am also rechecking on this[3] (and I am happy to
> extend the patch that way), but that makes the code and the patch more
> complicated and I wanted to check the general opinion on this case
> first.

I have added a retry for this as well now, without a pg_sleep() as well.
This catches around 80% of the half-reads, but a few slip through. At
that point we bail out with exit(1), and the user can try again, which I
think is fine? 

Alternatively, we could just skip to the next file then and don't make
it count as a checksum failure.

Other changes from V1:

1. Rebased to 422952ee
2. Ignore ENOENT failure during file open and skip to next file
3. Mention total number of skipped blocks during the summary at the end
of the run
4. Skip files starting with pg_internal.init*


Michael

-- 
Michael Banck
Projektleiter / Senior Berater
Tel.: +49 2166 9901-171
Fax:  +49 2166 9901-100
Email: michael.banck@credativ.de

credativ GmbH, HRB Mönchengladbach 12080
USt-ID-Nummer: DE204566209
Trompeterallee 108, 41189 Mönchengladbach
Geschäftsführung: Dr. Michael Meskes, Jörg Folz, Sascha Heuer

Unser Umgang mit personenbezogenen Daten unterliegt
folgenden Bestimmungen: https://www.credativ.de/datenschutz
Вложения

В списке pgsql-hackers по дате отправления:

Предыдущее
От: Jinhua Luo
Дата:
Сообщение: Re: Is it possible for postgres_fdw to push down queries onco-located tables?
Следующее
От: Tom Lane
Дата:
Сообщение: Re: pgsql: Allow concurrent-safe open() and fopen() in frontend code for Wi