Обсуждение: Proposed WAL changes

Поиск
Список
Период
Сортировка

Proposed WAL changes

От
Tom Lane
Дата:
I have just sent to the pgsql-patches list a rather large set of
proposed diffs for the WAL code.  These changes:

* Store two past checkpoint locations, not just one, in pg_control. On startup, we fall back to the older checkpoint if
thenewer one is unreadable.  Also, a physical copy of the newest checkpoint record is kept in pg_control for possible
usein disaster recovery (ie, complete loss of pg_xlog).  Also add a version number for pg_control itself.  Remove
archdirfrom pg_control; it ought to be a GUC parameter, not a special case (not that it's implemented yet anyway).
 

* Suppress successive checkpoint records when nothing has been entered in the WAL log since the last one.  This is not
somuch to avoid I/O as to make it actually useful to keep track of the last two checkpoints.  If the things are right
nextto each other then there's not a lot of redundancy gained...
 

* Change CRC scheme to a true 64-bit CRC, not a pair of 32-bit CRCs on alternate bytes.

* Fix XLOG record length handling so that it will work at BLCKSZ = 32k.

* Change XID allocation to work more like OID allocation, so that we can flush XID alloc info to the log before there
isany chance an XID will appear in heap files.
 

* Fix a number of minor bugs, such as off-by-one logic for XLOG file wraparound at the 4 gig mark.

* Add documentation and clean up some coding infelicities; move file format declarations out to include files where
plannedcontrib utilities can get at them.
 


Before committing this stuff, I intend to prepare a contrib utility that
can be used to reset pg_control and pg_xlog.  This is mainly for
disaster recovery purposes, but as a side benefit it will allow people
to update 7.1beta installations to this new code without doing initdb.
I need to update contrib/pg_controldata, too.
        regards, tom lane


RE: Proposed WAL changes

От
"Hiroshi Inoue"
Дата:
> -----Original Message-----
> From: Tom Lane
> 
> I have just sent to the pgsql-patches list a rather large set of
> proposed diffs for the WAL code.  These changes:
> 
> * Store two past checkpoint locations, not just one, in pg_control.
>   On startup, we fall back to the older checkpoint if the newer one
>   is unreadable.  Also, a physical copy of the newest checkpoint record
>   is kept in pg_control for possible use in disaster recovery (ie,
>   complete loss of pg_xlog).  Also add a version number for pg_control
>   itself.  Remove archdir from pg_control; it ought to be a GUC
>   parameter, not a special case (not that it's implemented yet anyway).
> 

Is archdir really a GUC parameter ?

Regards,
Hiroshi Inoue 


Re: Proposed WAL changes

От
Tom Lane
Дата:
"Hiroshi Inoue" <Inoue@tpf.co.jp> writes:
>> Remove archdir from pg_control; it ought to be a GUC
>> parameter, not a special case (not that it's implemented yet anyway).

> Is archdir really a GUC parameter ?

Why shouldn't it be?  I see nothing wrong with changing it on-the-fly.
        regards, tom lane


Re: Proposed WAL changes

От
"Vadim Mikheev"
Дата:
> I have just sent to the pgsql-patches list a rather large set of

Please send it to me directly - pgsql-patches' archieve is dated by Feb -:(

> proposed diffs for the WAL code.  These changes:
> 
> * Store two past checkpoint locations, not just one, in pg_control.
>   On startup, we fall back to the older checkpoint if the newer one
>   is unreadable.  Also, a physical copy of the newest checkpoint record

And what to do if older one is unreadable too?
(Isn't it like using 2 x CRC32 instead of CRC64 ? -:))
And what to do if pg_control was lost? (We already discussed that we
should read all logs from newest to oldest ones to find checkpoint).
And why to keep old log files with older checkpoint?

>   is kept in pg_control for possible use in disaster recovery (ie,
>   complete loss of pg_xlog).  Also add a version number for pg_control

Mmmm, how recovery is possible if log was lost? All what could be done
with DB in the event of corrupted/lost log is dumping data from tables
*asis*, without any guarantee about consistency. How checkpoint' content
could be useful?

I feel that the fact that

WAL can't help in the event of disk errors

is often overlooked.

>   itself.  Remove archdir from pg_control; it ought to be a GUC
>   parameter, not a special case (not that it's implemented yet anyway).

I would discuss WAL based BAR management before deciding how to
store/assign archdir. On the other hand it's easy to add archdir
to pg_control later -:)

> * Change CRC scheme to a true 64-bit CRC, not a pair of 32-bit CRCs
>   on alternate bytes.

Great if you've found reliable CRC64 impl!

> * Fix XLOG record length handling so that it will work at BLCKSZ = 32k.

Case I've overlooked -:(
(Though, I always considered BLCKSZ > 8K as temp hack -:))

> * Change XID allocation to work more like OID allocation, so that we
>   can flush XID alloc info to the log before there is any chance an XID
>   will appear in heap files.

I didn't read you postings about this yet.

> * Add documentation and clean up some coding infelicities; move file
>   format declarations out to include files where planned contrib
>   utilities can get at them.

Thanks for that!
> Before committing this stuff, I intend to prepare a contrib utility that
> can be used to reset pg_control and pg_xlog.  This is mainly for
> disaster recovery purposes, but as a side benefit it will allow people

Once again, I would call this "disaster *dump* purposes" -:)
After such operation DB shouldn't be used for anything but dump!

> to update 7.1beta installations to this new code without doing initdb.
> I need to update contrib/pg_controldata, too.

Vadim




Re: Proposed WAL changes

От
Tom Lane
Дата:
"Vadim Mikheev" <vmikheev@sectorbase.com> writes:
>> I have just sent to the pgsql-patches list a rather large set of
> Please send it to me directly - pgsql-patches' archieve is dated by Feb -:(

Done under separate cover.

>> proposed diffs for the WAL code.  These changes:
>> 
>> * Store two past checkpoint locations, not just one, in pg_control.
>> On startup, we fall back to the older checkpoint if the newer one
>> is unreadable.  Also, a physical copy of the newest checkpoint record

> And what to do if older one is unreadable too?
> (Isn't it like using 2 x CRC32 instead of CRC64 ? -:))

Then you lose --- but two checkpoints gives you twice the chance of
recovery (probably more, actually, since it's much more likely that
the previous checkpoint will have reached disk safely).

> And what to do if pg_control was lost? (We already discussed that we
> should read all logs from newest to oldest ones to find checkpoint).

If you have valid WAL files and broken pg_control, then reading the WAL
files is a way to recover.  If you have valid pg_control and broken WAL
files, you have a big problem, but using pg_control to generate a new
empty WAL will at least let you get at your heap files.

> And why to keep old log files with older checkpoint?

Not much point in remembering the older checkpoint location if the
associated WAL file is removed...

> Mmmm, how recovery is possible if log was lost? All what could be done
> with DB in the event of corrupted/lost log is dumping data from tables
> *asis*, without any guarantee about consistency.

Exactly.  That is still better than not being able to dump the data at
all.

>> * Change XID allocation to work more like OID allocation, so that we
>> can flush XID alloc info to the log before there is any chance an XID
>> will appear in heap files.

> I didn't read you postings about this yet.

See later discussion --- Andreas convinced me that flushing NEXTXID
records to disk isn't really needed after all.  (I didn't take the flush
out of my patch yet, but will do so.)  I still want to leave the NEXTXID
records in there, though, because I think that XID and OID assignment
ought to work as nearly alike as possible.

>> Before committing this stuff, I intend to prepare a contrib utility that
>> can be used to reset pg_control and pg_xlog.  This is mainly for
>> disaster recovery purposes, but as a side benefit it will allow people

> Once again, I would call this "disaster *dump* purposes" -:)
> After such operation DB shouldn't be used for anything but dump!

Fair enough.  But we need it.
        regards, tom lane


Re: Proposed WAL changes

От
The Hermit Hacker
Дата:
On Wed, 7 Mar 2001, Vadim Mikheev wrote:

> > I have just sent to the pgsql-patches list a rather large set of
>
> Please send it to me directly - pgsql-patches' archieve is dated by Feb -:(

Huh?

http://www.postgresql.org/mhonarc/pgsql-patches/2001-03/index.html




Re: Proposed WAL changes

От
Ian Lance Taylor
Дата:
"Vadim Mikheev" <vmikheev@sectorbase.com> writes:

> I feel that the fact that
> 
> WAL can't help in the event of disk errors
> 
> is often overlooked.

This is true in general.  But, nevertheless, WAL can be written to
protect against predictable disk errors, when possible.  Failing to
write a couple of disk blocks when the system crashes is a reasonably
predictable disk error.  WAL should ideally be written to work
correctly in that situation.

Ian

---------------------------(end of broadcast)---------------------------
TIP 102: An atom-blaster is a good weapon, but it can point both ways.    -- Isaac Asimov


RE: Proposed WAL changes

От
"Mikheev, Vadim"
Дата:
> > I feel that the fact that
> > 
> > WAL can't help in the event of disk errors
> > 
> > is often overlooked.
> 
> This is true in general.  But, nevertheless, WAL can be written to
> protect against predictable disk errors, when possible.  Failing to
> write a couple of disk blocks when the system crashes is a reasonably
> predictable disk error.  WAL should ideally be written to work
> correctly in that situation.

But what can be done if fsync returns before pages flushed?

Vadim


RE: Proposed WAL changes

От
"Mikheev, Vadim"
Дата:
> > > I have just sent to the pgsql-patches list a rather large set of
> >
> > Please send it to me directly - pgsql-patches' archieve is 
> dated by Feb -:(
> 
> Huh?
> 
> http://www.postgresql.org/mhonarc/pgsql-patches/2001-03/index.html

But there was nothing there yesterday -:)
(Seems archive wasn't updated that time).

Vadim


RE: Proposed WAL changes

От
"Mikheev, Vadim"
Дата:
> > But what can be done if fsync returns before pages flushed?
> No, it won't. When fsync returns, data is promised by the OS to be on
> disk.

Seems you didn't follow discussions about this issue.

Vadim


RE: Proposed WAL changes

От
Alex Pilosov
Дата:
On Wed, 7 Mar 2001, Mikheev, Vadim wrote:

> But what can be done if fsync returns before pages flushed?
No, it won't. When fsync returns, data is promised by the OS to be on
disk.

-alex



Re: Proposed WAL changes

От
ncm@zembu.com (Nathan Myers)
Дата:
On Wed, Mar 07, 2001 at 11:09:25AM -0500, Tom Lane wrote:
> "Vadim Mikheev" <vmikheev@sectorbase.com> writes:
> >> * Store two past checkpoint locations, not just one, in pg_control.
> >> On startup, we fall back to the older checkpoint if the newer one
> >> is unreadable.  Also, a physical copy of the newest checkpoint record
> 
> > And what to do if older one is unreadable too?
> > (Isn't it like using 2 x CRC32 instead of CRC64 ? -:))
> 
> Then you lose --- but two checkpoints gives you twice the chance of
> recovery (probably more, actually, since it's much more likely that
> the previous checkpoint will have reached disk safely).

Actually far more: if the checkpoints are minutes apart, even the 
worst disk drive will certainly have flushed any blocks written for 
the earlier checkpoint.

--
Nathan Myers
ncm@zembu.com


Re: Proposed WAL changes

От
ncm@zembu.com (Nathan Myers)
Дата:
On Wed, Mar 07, 2001 at 12:03:41PM -0800, Mikheev, Vadim wrote:
> Ian wrote:
> > > I feel that the fact that
> > > 
> > > WAL can't help in the event of disk errors
> > > 
> > > is often overlooked.
> > 
> > This is true in general.  But, nevertheless, WAL can be written to
> > protect against predictable disk errors, when possible.  Failing to
> > write a couple of disk blocks when the system crashes 

or, more likely, when power drops; a system crash shouldn't keep the
disk from draining its buffers ...

> > is a reasonably predictable disk error.  WAL should ideally be 
> > written to work correctly in that situation.
> 
> But what can be done if fsync returns before pages flushed?

Just what Tom has done: preserve a little more history.  If it's not
too expensive, then it doesn't hurt you when running on sound hardware,
but it offers a good chance of preventing embarrassments for (the 
overwhelming fraction of) users on garbage hardware.

Nathan Myers
ncm@zembu.com


Re: Proposed WAL changes

От
Ian Lance Taylor
Дата:
"Mikheev, Vadim" <vmikheev@SECTORBASE.COM> writes:

> > > I feel that the fact that
> > > 
> > > WAL can't help in the event of disk errors
> > > 
> > > is often overlooked.
> > 
> > This is true in general.  But, nevertheless, WAL can be written to
> > protect against predictable disk errors, when possible.  Failing to
> > write a couple of disk blocks when the system crashes is a reasonably
> > predictable disk error.  WAL should ideally be written to work
> > correctly in that situation.
> 
> But what can be done if fsync returns before pages flushed?

When you write out critical information, you keep earlier versions of
it.  On startup, if the critical information is corrupt, you use the
earlier versions of it.  This helps protect against the scenario I
mentioned: a few disk blocks may not have been written when the power
goes out.

My impression is that that is what Tom is doing with his patches.

Ian

---------------------------(end of broadcast)---------------------------
TIP 77: A beautiful man is paradise for the eyes, hell for the soul, and
purgatory for the purse.


RE: Proposed WAL changes

От
"Mikheev, Vadim"
Дата:
> > But what can be done if fsync returns before pages flushed?
> 
> When you write out critical information, you keep earlier versions of
> it.  On startup, if the critical information is corrupt, you use the
> earlier versions of it.  This helps protect against the scenario I
> mentioned: a few disk blocks may not have been written when the power
> goes out.
> 
> My impression is that that is what Tom is doing with his patches.

If fsync may return before data *actually* flushed then you may have
unlogged data page changes which breakes WAL rule and means corrupted
(inconsistent) database without ANY ABILITY TO RECOVERY TO CONSISTENT
STATE. Now please explain me how saving positions of two checkpoints
(what Tom is doing) can help here?

Vadim


Re: Proposed WAL changes

От
Tom Lane
Дата:
"Mikheev, Vadim" <vmikheev@SECTORBASE.COM> writes:
> If fsync may return before data *actually* flushed then you may have
> unlogged data page changes which breakes WAL rule and means corrupted
> (inconsistent) database without ANY ABILITY TO RECOVERY TO CONSISTENT
> STATE. Now please explain me how saving positions of two checkpoints
> (what Tom is doing) can help here?

If the WAL log is broken then there is (AFAICT) no way to *guarantee*
recovery to a consistent state.  There might be changes in heap or index
pages that you cannot find a trace of in the readable part of the log,
and hence don't know you need to fix.

However, I think that Vadim is focusing too narrowly on what he can
guarantee, and neglecting the problem of providing reasonable behavior
when things are too broken to meet the guarantee.  In particular,
complete refusal to start up when the WAL log is corrupted is not
reasonable nor acceptable.  We have to be able to recover from that
and provide access to as-close-to-accurate data as we can manage.
I believe that backing up to the prior checkpoint and replaying as much
of the log as we can read is a useful fallback capability if the most
recent checkpoint record is unreadable.

In any case, "guarantees" that are based on the assumption that WAL
writes hit the disk before data writes do are going to fail anyway
if we are using disk drives that reorder writes.  This is the real
reason why we can't think only about guarantees, but also about how
we will respond to "guaranteed impossible" failures.

If you can improve on this recovery idea, by all means do so.  But
don't tell me that we don't need to worry about it, because we do.
        regards, tom lane


RE: Proposed WAL changes

От
"Mikheev, Vadim"
Дата:
> >> * Store two past checkpoint locations, not just one, in pg_control.
> >> On startup, we fall back to the older checkpoint if the newer one
> >> is unreadable.  Also, a physical copy of the newest 
> >> checkpoint record
> 
> > And what to do if older one is unreadable too?
> > (Isn't it like using 2 x CRC32 instead of CRC64 ? -:))
> 
> Then you lose --- but two checkpoints gives you twice the chance of
> recovery (probably more, actually, since it's much more likely that
> the previous checkpoint will have reached disk safely).

This is not correct. If log is corrupted somehow (checkpoint wasn't
flushed as promised) then you have no chance to *recover* because of
DB will be (most likely) in inconsistent state (data pages flushed
before corresponding log records etc). So, second checkpoint gives us
twice the chance to *restart* in normal way - read checkpoint and
rollforward from redo record, - not to *recover*. But this approach
twice increases on-line log size requirements and doesn't help to
handle cases when pg_control was corrupted. Note, I agreed that
disaster *restart* must be implemented, I just think that
"two checkpoints" approach is not the best way to follow.
From my POV, scanning logs is much better - it doesn't require
doubling size of on-line logs and allows to *restart* if pg_control
was lost/corrupted: 

If there is no pg_control or it's corrupted or points to
unexistent/corrupted checkpoint record then scan logs from
newest to oldest one till we find last valid checkpoint record
or oldest valid log record and than redo from there.

> See later discussion --- Andreas convinced me that flushing NEXTXID
> records to disk isn't really needed after all.  (I didn't 
> take the flush out of my patch yet, but will do so.) I still want
> to leave the NEXTXID records in there, though, because I think that
> XID and OID assignment ought to work as nearly alike as possible.

As I explained in short already: with UNDO we'll be able to reuse
XIDs after restart - ie there will be no point to have NEXTXID
records at all. And there is no point to add it now.
Does it fix anything? Isn't "fixing" all what we must do in beta?

Vadim


Re: Proposed WAL changes

От
Tom Lane
Дата:
"Mikheev, Vadim" <vmikheev@SECTORBASE.COM> writes:
> "two checkpoints" approach is not the best way to follow.
> From my POV, scanning logs is much better - it doesn't require
> doubling size of on-line logs and allows to *restart* if pg_control
> was lost/corrupted: 

> If there is no pg_control or it's corrupted or points to
> unexistent/corrupted checkpoint record then scan logs from
> newest to oldest one till we find last valid checkpoint record
> or oldest valid log record and than redo from there.

And how well will that approach work if the last checkpoint record
got written near the start of a log segment file, and then the
checkpointer discarded all your prior log segments because "you don't
need those anymore"?  If the checkpoint record gets corrupted,
you have no readable log at all.

Even if you prefer to scan the logs to find the latest checkpoint
(to which I have no objection), it's still useful for the system
to keep track of the checkpoint-before-latest and only delete log
segments that are older than that.

Of course checkpoint-before-latest is a somewhat arbitrary way of
deciding how far back we need to keep logs, but I think it's more
defensible than anything else we can implement with small work.

> As I explained in short already: with UNDO we'll be able to reuse
> XIDs after restart - ie there will be no point to have NEXTXID
> records at all. And there is no point to add it now.
> Does it fix anything? Isn't "fixing" all what we must do in beta?

If you really insist, I'll take it out again, but I'm unconvinced
that that's necessary.
        regards, tom lane


RE: Proposed WAL changes

От
"Mikheev, Vadim"
Дата:
> > If there is no pg_control or it's corrupted or points to
> > unexistent/corrupted checkpoint record then scan logs from
> > newest to oldest one till we find last valid checkpoint record
> > or oldest valid log record and than redo from there.
> 
> And how well will that approach work if the last checkpoint record
> got written near the start of a log segment file, and then the
> checkpointer discarded all your prior log segments because "you don't
> need those anymore"?  If the checkpoint record gets corrupted,
> you have no readable log at all.

The question - why should we have it? It is assumed that data files
are flushed before checkpoint appears in log. If this assumtion
is wrong due to *bogus* fsync/disk/whatever why should we increase
disk space requirements which will affect *good* systems too?
What will we buy with extra logs? Just some data we can't
guarantee consistency anyway?
It seems that you want guarantee more than me, Tom -:)

BTW, in some my tests size of on-line logs was ~ 200Mb with default
checkpoint interval. So, it's worth to care about on-line logs size.

> > As I explained in short already: with UNDO we'll be able to reuse
> > XIDs after restart - ie there will be no point to have NEXTXID
> > records at all. And there is no point to add it now.
> > Does it fix anything? Isn't "fixing" all what we must do in beta?
> 
> If you really insist, I'll take it out again, but I'm unconvinced
> that that's necessary.

Please convince me that NEXTXID is necessary.
Why add anything that is not useful?

Vadim


Re: Proposed WAL changes

От
Ian Lance Taylor
Дата:
"Mikheev, Vadim" <vmikheev@SECTORBASE.COM> writes:

> If fsync may return before data *actually* flushed then you may have
> unlogged data page changes which breakes WAL rule and means corrupted
> (inconsistent) database without ANY ABILITY TO RECOVERY TO CONSISTENT
> STATE. Now please explain me how saving positions of two checkpoints
> (what Tom is doing) can help here?

Maybe it can't.  I don't know.

I do know that a system which can only function if all disk writes
complete in order is doomed when run on a Unix system with modern IDE
disks.  It won't work correctly in the case of a sudden power outage.

Since many people run Postgres in just that sort of environment, it's
not a good idea to completely fail to handle it.  Clearly a sudden
power outage can cause disk data to be corrupted.  People can accept
that.  They need to know what to do in that case.

For example, for typical Unix filesystems, after a sudden power
outage, people run fsck.  fsck puts the filesystem back in shape.  In
the process, it may find some files it can not fix; they are put in
/lost+found.  In the process, it may lose some files entirely; too
bad.  The point is, after fsck is run, the filesystem is known to be
good, and in most cases all the data will still be present.

Imagine if every time the power went out, people were required to
restore their disks from the last backup.

So, I think that what Postgres needs to do when it sees corruption is
just what fsck does: warn the user, put the database in a usable
state, and provide whatever information can be provided about lost or
inconsistent data.  Users who need full bore gold plated consistency
will restore from their last good backup, and invest in a larger UPS.
Many users will simply keep going.

Ian

---------------------------(end of broadcast)---------------------------
TIP 43: The human animal differs from the lesser primates in his passion for
lists of "Ten Best".    -- H. Allen Smith


Re: Proposed WAL changes

От
"Vadim Mikheev"
Дата:
What do we debate?
I never told that we shouldn't worry about current WAL disability to restart.
And WAL already correctly works in situation of "failing to write a couple of disk
blocks when the system crashes".
My statement at first place that "WAL can't help in the event of disk errors"
was to remind that we should think over how much are we going to guarantee
and by what means in the event our *base* requirements were not answered
(guaranteed -:)). My POV is that two checkpoints increases disk space
requirements for *everyday usage* while buying near nothing because of data
consistency cannot be guaranteed anyway.
On the other hand this is the fastest way to implement WAL restart-ability
- *which is the real problem we have to fix*.

Vadim




Re: Proposed WAL changes

От
Tom Lane
Дата:
"Mikheev, Vadim" <vmikheev@SECTORBASE.COM> writes:
>> And how well will that approach work if the last checkpoint record
>> got written near the start of a log segment file, and then the
>> checkpointer discarded all your prior log segments because "you don't
>> need those anymore"?  If the checkpoint record gets corrupted,
>> you have no readable log at all.

> The question - why should we have it? It is assumed that data files
> are flushed before checkpoint appears in log. If this assumtion
> is wrong due to *bogus* fsync/disk/whatever why should we increase
> disk space requirements which will affect *good* systems too?
> What will we buy with extra logs? Just some data we can't
> guarantee consistency anyway?
> It seems that you want guarantee more than me, Tom -:)

No, but I want a system that's not brittle.  You seem to be content to
design a system that is reliable as long as the WAL log is OK but loses
the entire database unrecoverably as soon as one bit goes bad in the
log.  I'd like a slightly softer failure mode.  WAL logs *will* go bad
(even without system crashes; what of unrecoverable disk read errors?)
and we ought to be able to deal with that with some degree of grace.
Yes, we lost our guarantee of consistency.  That doesn't mean we should
not do the best we can with what we've got left.

> BTW, in some my tests size of on-line logs was ~ 200Mb with default
> checkpoint interval. So, it's worth to care about on-line logs size.

Okay, but to me that suggests we need a smarter log management strategy,
not a management strategy that throws away data we might wish we still
had (for manual analysis if nothing else).  Perhaps the checkpoint
creation rule should be "every M seconds *or* every N megabytes of log,
whichever comes first".  It'd be fairly easy to signal the postmaster to
start up a new checkpoint process when XLogWrite rolls over to a new log
segment, if the last checkpoint was further back than some number of
segments.  Comments?

> Please convince me that NEXTXID is necessary.
> Why add anything that is not useful?

I'm not convinced that it's not necessary.  In particular, consider the
case where we are trying to recover from a crash using an on-line
checkpoint as our last readable WAL entry.  In the pre-NEXTXID code,
this checkpoint would contain the current XID counter and an
advanced-beyond-current OID counter.  I think both of those numbers
should be advanced beyond current, so that there's some safety margin
against reusing XIDs/OIDs that were allocated by now-lost XLOG entries.
The OID code is doing this right, but the XID code wasn't.

Again, it's a question of brittleness.  Yes, as long as everything
operates as designed and the WAL log never drops a bit, we don't need
it.  But I want a safety margin for when things aren't perfect.
        regards, tom lane


RE: Proposed WAL changes

От
"Mikheev, Vadim"
Дата:
> > It seems that you want guarantee more than me, Tom -:)
> 
> No, but I want a system that's not brittle. You seem to be content to
> design a system that is reliable as long as the WAL log is OK 
> but loses the entire database unrecoverably as soon as one bit goes bad
> in the log.

I don't see how absence of old checkpoint forces losing entire database.
You probably will get better consistency by re-applying modifications
which supposed to be in data files already but it seems questionable
to me.

> > BTW, in some my tests size of on-line logs was ~ 200Mb with default
> > checkpoint interval. So, it's worth to care about on-line logs size.
> 
> Okay, but to me that suggests we need a smarter log management strategy,
> not a management strategy that throws away data we might wish we still
> had (for manual analysis if nothing else).

This is what should be covered by archiving log files. Unimplemented -:(

> Perhaps the checkpoint creation rule should be "every M seconds *or*
> every N megabytes of log, whichever comes first".

I like this! Regardless usability of keeping older checkpoint (especially
in future, with log archiving) your rule is worth in any case.
(Nevertheless, keeping two checkpoints still increases disk requirements -:)
But seems I have to waive my objection if I didn't convince you - it's
really simplest way to get WAL restart-ability and I personally have no
ability to implement log scanning now).

> > Please convince me that NEXTXID is necessary.
> > Why add anything that is not useful?
> 
> I'm not convinced that it's not necessary. In particular, 
> consider the case where we are trying to recover from a crash using
> an on-line checkpoint as our last readable WAL entry. In the pre-NEXTXID
> code, this checkpoint would contain the current XID counter and an
> advanced-beyond-current OID counter. I think both of those numbers should
> be advanced beyond current, so that there's some safety margin against
> reusing XIDs/OIDs that were allocated by now-lost XLOG entries.
> The OID code is doing this right, but the XID code wasn't.
> 
> Again, it's a question of brittleness.  Yes, as long as everything
> operates as designed and the WAL log never drops a bit, we don't need
> it.  But I want a safety margin for when things aren't perfect.

Once again - my point is that in the event of lost log record one shouldn't
try to use existent database but just dump it, etc - ie no reason to keep
info about allocated XIDs. But keeping NEXTXID costs nothing, at least -:)

Vadim


Re: Proposed WAL changes

От
Tom Lane
Дата:
"Mikheev, Vadim" <vmikheev@SECTORBASE.COM> writes:
>> No, but I want a system that's not brittle. You seem to be content to
>> design a system that is reliable as long as the WAL log is OK 
>> but loses the entire database unrecoverably as soon as one bit goes bad
>> in the log.

> I don't see how absence of old checkpoint forces losing entire database.

As the code stood last week that's what would happen, because the system
would not restart unless pg_control pointed to a valid checkpoint
record.  I addressed that in a way that seemed good to me.

Now, from what you've said in this conversation you would rather have
the system scan XLOG to decide where to replay from if it cannot read
the last checkpoint record.  That would be okay with me, but even with
that approach I do not think it's safe to truncate the log to nothing
as soon as we've written a checkpoint record.  I want to see a reasonable
amount of log data there at all times.  I don't insist that "reasonable
amount" necessarily means "back to the prior checkpoint" --- but that's
a simple and easy-to-implement interpretation.

> You probably will get better consistency by re-applying modifications
> which supposed to be in data files already but it seems questionable
> to me.

It's not a guarantee, no, but it gives you a better probability of
recovering recent changes when things are hosed.

BTW, can we really trust checkpoint to mean that all data file changes
are down on disk?  I see that the actual implementation of checkpoint is
write out all dirty shmem buffers;sync();if (IsUnderPostmaster)    sleep(2);sync();write checkpoint record to
XLOG;fsyncXLOG;
 

Now HP's man page for sync() says
    The writing, although scheduled, is not necessarily complete upon    return from sync.

I can assure you that 2 seconds is nowhere near enough to ensure that a
sync is complete on my workstation... and I doubt that "scheduled" means
"guaranteed to complete before any subsequently-requested I/O is done".
I think it's entirely possible that the checkpoint record will hit the
disk before the last heap buffer does.

Therefore, even without considering disk drive write reordering, I do
not believe that a checkpoint guarantees very much, and so I think it's
pretty foolish to delete the preceding XLOG data immediately afterwards.


>> Perhaps the checkpoint creation rule should be "every M seconds *or*
>> every N megabytes of log, whichever comes first".

> I like this! Regardless usability of keeping older checkpoint (especially
> in future, with log archiving) your rule is worth in any case.

Okay, I'll see if I can do something with this idea.

Other than what we've discussed, do you have any comments/objections to
my proposed patch?  I've been holding off committing it so that you have
time to review it...
        regards, tom lane


RE: Proposed WAL changes

От
"Mikheev, Vadim"
Дата:
> Other than what we've discussed, do you have any 
> comments/objections to my proposed patch?
> I've been holding off committing it so that you have
> time to review it...

Sorry - I'm heading to meet Marc, Thomas & Geoff right now,
will try to comment asap.

Vadim


Re: Proposed WAL changes

От
Christopher Masto
Дата:
Please forgive me if I'm misunderstanding something of these rather
complex issues.  But I think this is an important question from the
perspective of a sytem administrator responsible for the safety and
uncorruptedness of his users' data.

If I understand correctly, it is possible, through power failure,
Postgres or OS crash, or disk failure, to end up with a situation
where Postgres cannot put the database in a consistent state.  Rather
than failing to start at all, something will be put in place that
allows PG to partially recover and start up, so that you can get to
your data.  I think that leaves DBAs wondering two things:

First, how will I know that my database is corrupted?  I may not be
present to witness the power failure, unattended reboot, and automatic
restart/quasi-recovery.  If the DB has become inconsistent, it is
critical that users do not continue to use it.  I'm all for having
Postgres throw up its hands and refuse to start until I put it in
disaster-dump mode.

Secondly, since disaster-dump seems to be a good term for it, is there
some way to know the extent of the damage?  I.e. in the typical case
of power failure, is the inconsistent part just the "recent" data,
and would it be possible to find out which records are part of that
damaged set?
-- 
Christopher Masto         Senior Network Monkey      NetMonger Communications
chris@netmonger.net        info@netmonger.net        http://www.netmonger.net

Free yourself, free your machine, free the daemon -- http://www.freebsd.org/