corrupt pages detected by enabling checksums

Поиск
Список
Период
Сортировка
От Jeff Janes
Тема corrupt pages detected by enabling checksums
Дата
Msg-id CAMkU=1yTvoc5D2MzL2KcWxm_vS-kbN6SY_WVHCJVZKOaQ-MB2g@mail.gmail.com
обсуждение исходный текст
Ответы Re: corrupt pages detected by enabling checksums  (Andres Freund <andres@2ndquadrant.com>)
Re: corrupt pages detected by enabling checksums  (Jeff Davis <pgsql@j-davis.com>)
Список pgsql-hackers

I've changed the subject from "regression test failed when enabling checksum" because I now know they are totally unrelated.

My test case didn't need to depend on archiving being on, and so with a simple tweak I rendered the two issues orthogonal.


On Wed, Apr 3, 2013 at 12:15 PM, Jeff Davis <pgsql@j-davis.com> wrote:
On Mon, 2013-04-01 at 19:51 -0700, Jeff Janes wrote:

> What I would probably really want is the data as it existed after the
> crash but before recovery started, but since the postmaster
> immediately starts recovery after the crash, I don't know of a good
> way to capture this.

Can you just turn off the restart_after_crash GUC? I had a chance to
look at this, and seeing the block before and after recovery would be
nice. I didn't see a log file in the data directory, but it didn't go
through recovery, so I assume it already did that.

You don't know that the cluster is in the bad state until after it goes through recovery because most crashes recover perfectly fine.  So it would have to make a side-copy of the cluster after the crash, then recover the original and see how things go, then either retain or delete the side-copy.  Unfortunately my testing harness can't do this at the moment, because the perl script storing the consistency info needs to survive over the crash and recovery.   It might take me awhile to figure out how to make it do this.

I have the server log just go to stderr, where it gets mingled together with messages from my testing harness.  I had not uploaded that file.

Here is a new upload. It contains both a data_dir tarball including xlog, and the mingled stderr ("do_new.out")


The other 3 files in it constitute the harness.  It is a quite a mess, with some hard-coded paths.  The just-posted fix for wal_keep_segments will also have to be applied.

 

The block is corrupt as far as I can tell. The first third is written,
and the remainder is all zeros. The header looks like this:

Yes, that part is by my design.  Why it didn't get fixed from a FPI is not by my design, or course.
 

So, the page may be corrupt without checksums as well, but it just
happens to be hidden for the same reason. Can you try to reproduce it
without -k?

No, things run (seemingly) fine without -k.  
 
And on the checkin right before checksums were added?
Without checksums, you'll need to use pg_filedump (or similar) to find
whether an error has happened.

I'll work on it, but it will take awhile to figure out exactly how to use pg_filedump to do that, and how to automate that process and incorporate it into the harness.

In the meantime, I tested the checksum commit itself (96ef3b8ff1c) and the problem occurs there, so if the problem is not the checksums themselves (and I agree it probably isn't) it must have been introduced before that rather than after.
 
Cheers,

Jeff

В списке pgsql-hackers по дате отправления:

Предыдущее
От: Jim Nasby
Дата:
Сообщение: Re: [PATCH] Exorcise "zero-dimensional" arrays (Was: Re: Should array_length() Return NULL)
Следующее
От: Peter Geoghegan
Дата:
Сообщение: Re: Clang compiler warning on 9.3 HEAD