Re: corrupt pages detected by enabling checksums

Поиск
Список
Период
Сортировка
От Jeff Davis
Тема Re: corrupt pages detected by enabling checksums
Дата
Msg-id 1368137927.24407.85.camel@jdavis
обсуждение исходный текст
Ответ на Re: corrupt pages detected by enabling checksums  (Jim Nasby <jim@nasby.net>)
Ответы Re: corrupt pages detected by enabling checksums  (Jim Nasby <jim@nasby.net>)
Список pgsql-hackers
On Thu, 2013-05-09 at 14:28 -0500, Jim Nasby wrote:
> What about moving some critical data from the beginning of the WAL
> record to the end? That would make it easier to detect that we don't
> have a complete record. It wouldn't necessarily replace the CRC
> though, so maybe that's not good enough.
> 
> Actually, what if we actually *duplicated* some of the same WAL header
> info at the end of the record? Given a reasonable amount of data that
> would damn-near ensure that a torn record was detected, because the
> odds of having the exact same sequence of random bytes would be so
> low. Potentially even just duplicating the LSN would suffice.

I think both of these ideas have some false positives and false
negatives.

If the corruption happens at the record boundary, and wipes out the
special information at the end of the record, then you might think it
was not fully flushed, and we're in the same position as today.

If the WAL record is large, and somehow the beginning and the end get
written to disk but not the middle, then it will look like corruption;
but really the WAL was just not completely flushed. This seems pretty
unlikely, but not impossible.

That being said, I like the idea of introducing some extra checks if a
perfect solution is not possible.

> On the separate write idea, if that could be controlled by a GUC I
> think it'd be worth doing. Anyone that needs to worry about this
> corner case probably has hardware that would support that.

It sounds pretty easy to do that naively. I'm just worried that the
performance will be so bad for so many users that it's not a very
reasonable choice.

Today, it would probably make more sense to just use sync rep. If the
master's WAL is corrupt, and it starts up too early, then that should be
obvious when you try to reconnect streaming replication. I haven't tried
it, but I'm assuming that it gives a useful error message.

Regards,Jeff Davis





В списке pgsql-hackers по дате отправления:

Предыдущее
От: Greg Stark
Дата:
Сообщение: Re: corrupt pages detected by enabling checksums
Следующее
От: Bruce Momjian
Дата:
Сообщение: Re: Re: [GENERAL] pg_upgrade fails, "mismatch of relation OID" - 9.1.9 to 9.2.4