Обсуждение: CRCs
Vadim wrote: > Tom wrote: > > Bruce wrote: > > > ... If the CRC on > > > the WAL log checks for errors that are not checked anywhere else, > > > then fine, but I thought disk CRC would just duplicate the I/O > > > subsystem/disk. > > > > A disk-block CRC would detect partially written blocks (ie, > > power drops after disk has written M of the N sectors in a > > block). The disk's own checks will NOT consider this condition a > > failure. I'm not convinced that WAL will reliably detect it either > > (Vadim?). > > Idea proposed by Andreas about "physical log" is implemented! Now WAL > saves whole data blocks on first after checkpoint modification. This > way on recovery modified data blocks will be first restored *as a > whole*. Isn't it much better than just detection of partially writes? This seems to protect against some partial writes, but see below. > > Certainly WAL will not help for corruption caused by external agents, > > away from any updates that are actually being performed/logged. > > What do you mean by "external agents"? External agents include RAM bit drops and noise on cables when blocks are (read and re-) written. Every time data is moved, there is a chance of an undetected error being introduced. The disk only promises (within limits) to deliver the sector that was written; it doesn't promise that what was written is what you meant to write. Errors of this sort accumulate unless caught by end-to-end checks. External agents include bugs in database code, bugs in OS code, bugs in disk controller firmware, and bugs in disk firmware. Each can result in clobbered data, blocks being written in the wrong place, blocks said to be written but not, and any number of other variations. All this code is written by humans, and even the most thorough testing cannot cover even the majority of code paths. External agents include sector errors not caught by the disk CRC: the disk only promises to keep the number of errors delivered to a reasonably low (and documented) level. It's up to the user to notice the errors that slip through. and Andreas wrote: > > A disk-block CRC would detect partially written blocks (ie, power > > drops after disk has written M of the N sectors in a block). The > > disk's own checks will NOT consider this condition a failure. > > But physical log recovery will rewrite every page that was changed > after last checkpoint, thus this is not an issue anymore. No. That assumes that when the drive _says_ the block is written, it is really on the disk. That is not true for IDE drives. It is true for SCSI drives only when the SCSI spec is implemented correctly, but implementing the spec correctly interferes with favorable benchmark results. > > I'm not convinced that WAL will reliably detect it either > > (Vadim?). Certainly WAL will not help for corruption caused by > > external agents, away from any updates that are actually being > > performed/logged. > > The external agent (if malvolent) could write a correct CRC anyway > If on the other hand the agent writes complete garbage, vacuum will > notice. Vacuum does not check most of the bits in the blocks it reads. (Bad bits in metadata will cause a crash only if you're lucky. If not, they result in more corruption.) A database is unusual among computer applications in that an error introduced today can sit unnoticed on the disk, and then result in an unnoticed wrong answer six months later. We need to be able to detect bad bits as soon as possible, before the backups have been overwritten. CRCs are how we can detect cumulative corruption from all sources. Nathan Myers ncm@zembu.com
> > But physical log recovery will rewrite every page that was changed > > after last checkpoint, thus this is not an issue anymore. > > No. That assumes that when the drive _says_ the block is written, > it is really on the disk. That is not true for IDE drives. It is > true for SCSI drives only when the SCSI spec is implemented correctly, > but implementing the spec correctly interferes with favorable > benchmark results. You know - this is *core* assumption. If drive lies about this then *nothing* will help you. Do you remember core rule of WAL? "Changes must be logged *before* changed data pages written". If this rule will be broken then data files will be inconsistent after crash recovery and you will not notice this, w/wo CRC in data blocks. I agreed that CRCs could help to detect other errors but probably it's too late for 7.1 Vadim
On Fri, Jan 12, 2001 at 01:07:56PM -0800, Mikheev, Vadim wrote: > > > But physical log recovery will rewrite every page that was changed > > > after last checkpoint, thus this is not an issue anymore. > > > > No. That assumes that when the drive _says_ the block is written, > > it is really on the disk. That is not true for IDE drives. It is > > true for SCSI drives only when the SCSI spec is implemented correctly, > > but implementing the spec correctly interferes with favorable > > benchmark results. > > You know - this is *core* assumption. If drive lies about this then > *nothing* will help you. Do you remember core rule of WAL? > "Changes must be logged *before* changed data pages written". > If this rule will be broken then data files will be inconsistent > after crash recovery and you will not notice this, w/wo CRC in > data blocks. You can include the data blocks' CRCs in the log entries. > I agreed that CRCs could help to detect other errors but probably > it's too late for 7.1. 7.2 is not too far off. I'm hoping to see it then. Nathan Myers ncm@zembu.com
On Fri, Jan 12, 2001 at 12:35:14PM -0800, Nathan Myers wrote: > Vadim wrote: > > What do you mean by "external agents"? > > External agents include RAM bit drops and noise on cables when > blocks are (read and re-) written. Every time data is moved, > there is a chance of an undetected error being introduced. The > disk only promises (within limits) to deliver the sector that > was written; it doesn't promise that what was written is what > you meant to write. Errors of this sort accumulate unless > caught by end-to-end checks. > > External agents include bugs in database code, bugs in OS code, > bugs in disk controller firmware, and bugs in disk firmware. > Each can result in clobbered data, blocks being written in the > wrong place, blocks said to be written but not, and any number > of other variations. All this code is written by humans, and > even the most thorough testing cannot cover even the majority > of code paths. > > External agents include sector errors not caught by the disk CRC: > the disk only promises to keep the number of errors delivered to a > reasonably low (and documented) level. It's up to the user to > notice the errors that slip through. Interestingly, right after I posted this I noticed that cron noticed a corrupt inode in /dev on my machine. The disk is happy with it, but I'm not... Nathan Myers ncm@zembu.com
> > You know - this is *core* assumption. If drive lies about this then > > *nothing* will help you. Do you remember core rule of WAL? > > "Changes must be logged *before* changed data pages written". > > If this rule will be broken then data files will be inconsistent > > after crash recovery and you will not notice this, w/wo CRC in > > data blocks. > > You can include the data blocks' CRCs in the log entries. How could it help? Vadim
On Fri, Jan 12, 2001 at 02:16:07PM -0800, Mikheev, Vadim wrote: > > > You know - this is *core* assumption. If drive lies about this then > > > *nothing* will help you. Do you remember core rule of WAL? > > > "Changes must be logged *before* changed data pages written". > > > If this rule will be broken then data files will be inconsistent > > > after crash recovery and you will not notice this, w/wo CRC in > > > data blocks. > > > > You can include the data blocks' CRCs in the log entries. > > How could it help? It wouldn't help you recover, but you would be able to report that you cannot recover. To be more specific, if the blocks referenced in the log are partially written, their CRCs will (probably) be wrong. If they are not physically written at all, their CRCs will be correct but will not match what is in the log. In either case the user will know immediately that the database has been corrupted, and must fall back on a failover image or backup. It would be no bad thing to include the CRC of the block referenced wherever in the file format that a block reference lives. Nathan Myers ncm@zembu.com
ncm@zembu.com (Nathan Myers) writes:
>>>>>> "Changes must be logged *before* changed data pages written".
>>>>>> If this rule will be broken then data files will be inconsistent
>>>>>> after crash recovery and you will not notice this, w/wo CRC in
>>>>>> data blocks.
>>>> 
>>>> You can include the data blocks' CRCs in the log entries.
>> 
>> How could it help?
> It wouldn't help you recover, but you would be able to report that 
> you cannot recover.
How?  The scenario Vadim is pointing out is where the disk drive writes
a changed data block in advance of the WAL log entry describing the
change.  Then power drops and the WAL entry never gets made.  At
restart, how will you realize that that data block now contains data you
don't want?  There's not even a log entry telling you you need to look
at it, much less one that tells you what should be in it.
AFAICS, disk-block CRCs do not guard against mishaps involving intended
writes.  They will help guard against data corruption that might creep
in due to outside factors, however.
        regards, tom lane
			
		> > It wouldn't help you recover, but you would be able to report that > > you cannot recover. > > How? The scenario Vadim is pointing out is where the disk > drive writes a changed data block in advance of the WAL log entry > describing the change. Then power drops and the WAL entry never gets > made. At restart, how will you realize that that data block now > contains data you don't want? There's not even a log entry telling > you you need to look at it, much less one that tells you what should > be in it. > > AFAICS, disk-block CRCs do not guard against mishaps involving intended > writes. They will help guard against data corruption that might creep > in due to outside factors, however. I couldn't describe better -:) Vadim
On Fri, Jan 12, 2001 at 06:06:21PM -0500, Tom Lane wrote: > ncm@zembu.com (Nathan Myers) writes: > >>>>>> "Changes must be logged *before* changed data pages written". > >>>>>> If this rule will be broken then data files will be inconsistent > >>>>>> after crash recovery and you will not notice this, w/wo CRC in > >>>>>> data blocks. > >>>> > >>>> You can include the data blocks' CRCs in the log entries. > >> > >> How could it help? > > > It wouldn't help you recover, but you would be able to report that > > you cannot recover. > > How? The scenario Vadim is pointing out is where the disk drive writes > a changed data block in advance of the WAL log entry describing the > change. Then power drops and the WAL entry never gets made. At > restart, how will you realize that that data block now contains data you > don't want? There's not even a log entry telling you you need to look > at it, much less one that tells you what should be in it. OK. In that case, recent transactions that were acknowledged to user programs just disappear. The database isn't corrupt, but it doesn't contain what the user believes is in it. The only way I can think of to guard against that is to have a sequence number in each acknowledgement sent to users, and also reported when the database recovers. If users log their ACK numbers, they can be compared when the database comes back up. Obviously it's better to configure the disk so that it doesn't lie about what's been written. > AFAICS, disk-block CRCs do not guard against mishaps involving intended > writes. They will help guard against data corruption that might creep > in due to outside factors, however. Right. Nathan Myers ncm@zembu.com
* Nathan Myers <ncm@zembu.com> [010112 15:49] wrote: > On Fri, Jan 12, 2001 at 06:06:21PM -0500, Tom Lane wrote: > > ncm@zembu.com (Nathan Myers) writes: > > >>>>>> "Changes must be logged *before* changed data pages written". > > >>>>>> If this rule will be broken then data files will be inconsistent > > >>>>>> after crash recovery and you will not notice this, w/wo CRC in > > >>>>>> data blocks. > > >>>> > > >>>> You can include the data blocks' CRCs in the log entries. > > >> > > >> How could it help? > > > > > It wouldn't help you recover, but you would be able to report that > > > you cannot recover. > > > > How? The scenario Vadim is pointing out is where the disk drive writes > > a changed data block in advance of the WAL log entry describing the > > change. Then power drops and the WAL entry never gets made. At > > restart, how will you realize that that data block now contains data you > > don't want? There's not even a log entry telling you you need to look > > at it, much less one that tells you what should be in it. > > OK. In that case, recent transactions that were acknowledged to user > programs just disappear. The database isn't corrupt, but it doesn't > contain what the user believes is in it. > > The only way I can think of to guard against that is to have a sequence > number in each acknowledgement sent to users, and also reported when the > database recovers. If users log their ACK numbers, they can be compared > when the database comes back up. > > Obviously it's better to configure the disk so that it doesn't lie about > what's been written. I thought WAL+fsync wasn't supposed to allow this to happen? -- -Alfred Perlstein - [bright@wintelcom.net|alfred@freebsd.org] "I have the heart of a child; I keep it in a jar on my desk."
> > How? The scenario Vadim is pointing out is where the disk > > drive writes a changed data block in advance of the WAL log > > entry describing the change. Then power drops and the WAL > > entry never gets made. At restart, how will you realize that > > that data block now contains data you don't want? There's not > > even a log entry telling you you need to look at it, much less > > one that tells you what should be in it. > > OK. In that case, recent transactions that were acknowledged to user > programs just disappear. The database isn't corrupt, but it doesn't > contain what the user believes is in it. Example. 1. Tuple was inserted into index. 2. Looking for free buffer bufmgr decides to write index block. 3. Following WAL core rule bufmgr first calls XLogFlush() to write and fsync log record related to index tuple insertion. 4. *Beliving* that log record is on disk now (after successful fsync) bufmgr writes index block. If log record was not really flushed on disk in 3. but on-disk image of index block was updated in 4. and system crashed after this then after restart recovery you'll have unlawful index tuple pointing to where? Who knows! No guarantee that corresponding heap tuple was flushed on disk. Isn't database corrupted now? Vadim
"Mikheev, Vadim" <vmikheev@SECTORBASE.COM> writes:
> If log record was not really flushed on disk in 3. but on-disk image of
> index block was updated in 4. and system crashed after this then after
> restart recovery you'll have unlawful index tuple pointing to where?
> Who knows! No guarantee that corresponding heap tuple was flushed on
> disk.
This example doesn't seem very convincing.  Wouldn't the XLOG entry
describing creation of the heap tuple appear in the log before the one
for the index tuple?  Or are you assuming that both these XLOG entries
are lost due to disk drive malfeasance?
        regards, tom lane
			
		On Fri, Jan 12, 2001 at 04:10:36PM -0800, Alfred Perlstein wrote: > Nathan Myers <ncm@zembu.com> [010112 15:49] wrote: > > > > Obviously it's better to configure the disk so that it doesn't > > lie about what's been written. > > I thought WAL+fsync wasn't supposed to allow this to happen? It's an OS and hardware configuration matter; you only get correct WAL+fsync semantics if the underlying system is configured right. IDE disks are almost always configured wrong, to spoof benchmarks; SCSI disks sometimes are. If they're configured wrong, then (now that we have a CRC in the log entry) in the event of a power outage the database might come back with recently-acknowledged transaction results discarded. That's a lot better than a corrupt database, but it's not industrial-grade semantics. (Use a UPS.) Nathan Myers ncm@zembu.com
> > If log record was not really flushed on disk in 3. but > > on-disk image of index block was updated in 4. and system > > crashed after this then after restart recovery you'll have > > unlawful index tuple pointing to where? Who knows! > > No guarantee that corresponding heap tuple was flushed on > > disk. > > This example doesn't seem very convincing. Wouldn't the XLOG entry > describing creation of the heap tuple appear in the log before the one > for the index tuple? Or are you assuming that both these XLOG entries > are lost due to disk drive malfeasance? Yes, that was assumed. When UNDO will be implemented and uncomitted tuples will be removed by rollback part of after crash recovery we'll get corrupted database without that assumption. Vadim
Nathan Myers wrote: > > It wouldn't help you recover, but you would be able to report that > you cannot recover. While this could help decting hardware problems, you still won't be able to detect some (many) memory errors because the CRC will be calculated on the already corrupted data. Of course there are other situations where CRC will not match and appropriately logged is a reliable heads-up warning. Bye! -- Daniele
>> AFAICS, disk-block CRCs do not guard against mishaps involving intended
>> writes.  They will help guard against data corruption that might creep
>> in due to outside factors, however.
> Right.  
Given that we seem to have agreed on that, I withdraw my complaint about
disk-block-CRC not being in there for 7.1.  I think we are still a ways
away from the point where externally-induced corruption is a major share
of our failure rate ;-).  7.2 or so will be time enough to add this
feature, and I'd really rather not force another initdb for 7.1.
        regards, tom lane
			
		On Fri, Jan 12, 2001 at 11:30:30PM -0500, Tom Lane wrote: > >> AFAICS, disk-block CRCs do not guard against mishaps involving intended > >> writes. They will help guard against data corruption that might creep > >> in due to outside factors, however. > > > Right. > > Given that we seem to have agreed on that, I withdraw my complaint about > disk-block-CRC not being in there for 7.1. I think we are still a ways > away from the point where externally-induced corruption is a major share > of our failure rate ;-). 7.2 or so will be time enough to add this > feature, and I'd really rather not force another initdb for 7.1. More to the point, putting CRCs on data blocks might have unintended consequences for dump or vacuum processes. 7.1 is a monumental accomplishment even without corruption detection, and the sooner the world has it, the better. Nathan Myers ncm@zembu.com
On Fri, Jan 12, 2001 at 04:38:37PM -0800, Mikheev, Vadim wrote: > Example. > 1. Tuple was inserted into index. > 2. Looking for free buffer bufmgr decides to write index block. > 3. Following WAL core rule bufmgr first calls XLogFlush() to write > and fsync log record related to index tuple insertion. > 4. *Believing* that log record is on disk now (after successful fsync) > bufmgr writes index block. > > If log record was not really flushed on disk in 3. but on-disk image of > index block was updated in 4. and system crashed after this then after > restart recovery you'll have unlawful index tuple pointing to where? > Who knows! No guarantee that corresponding heap tuple was flushed on > disk. > > Isn't database corrupted now? Note, I haven't read the WAL code, so much of what I've said is based on what I know is and isn't possible with logging, rather than on Vadim's actual choices. I know it's *possible* to implement a logging database which can maintain consistency without need for strict write ordering; but without strict write ordering, it is not possible to guarantee durable transactions. That is, after a power outage, such a database may be guaranteed to recover uncorrupted, but some number (>= 0) of the last few acknowledged/committed transactions may be lost. Vadim's implementation assumes strict write ordering, so that (e.g.) with IDE disks a corrupt database is possible in the event of a power outage. (Database and OS crashes don't count; those don't keep the blocks from finding their way from on-disk buffers to disk.) This is no criticism; it is more efficient to assume strict write ordering, and a database that can lose (the last few) committed transactions has limited value. To achieve disk write-order independence is probably not a worthwhile goal, but for systems that cannot provide strict write ordering (e.g., most PCs) it would be helpful to be able to detect that the database has become corrupted. In Vadim's example above, if the index were to contain not only the heap blocks' numbers, but also their CRCs, then the corruption could be detected when the index is used. When the block is read in, its CRC is checked, and when it is referenced via the index, the two CRC values are simply compared and the corruption is revealed. On a machine that does provide strict write ordering, the CRCs in the index might be unnecessary overhead, but they also provide cross-checks to help detect corruption introduced by bugs and whatnot. Or maybe I don't know what I'm talking about. Nathan Myers ncm@zembu.com
ncm@zembu.com (Nathan Myers) writes:
> To achieve disk write-order independence is probably not a worthwhile 
> goal, but for systems that cannot provide strict write ordering (e.g., 
> most PCs) it would be helpful to be able to detect that the database 
> has become corrupted.  In Vadim's example above, if the index were to
> contain not only the heap blocks' numbers, but also their CRCs, then 
> the corruption could be detected when the index is used.  When the 
> block is read in, its CRC is checked, and when it is referenced via 
> the index, the two CRC values are simply compared and the corruption
> is revealed. 
A row-level CRC might be useful for this, but it would have to be on
the data only (not the tuple commit-status bits).  It'd be totally
impractical with a block CRC, I think.  To do it with a block CRC, every
time you changed *anything* in a heap page, you'd have to find all the
index items for each row on the page and update their copies of the
heap block's CRC.  That could easily turn one disk-write into hundreds,
not to mention the index search costs.  Similarly, a check value that is
affected by tuple status updates would enormously increase the cost of
marking tuples committed or dead.
Instead of a partial row CRC, we could just as well use some other bit
of identifying information, say the row OID.  Given a block CRC on the
heap page, we'll be pretty confident already that the heap page is OK,
we just need to guard against the possibility that it's older than the
index item.  Checking that there is a valid tuple at the slot indicated
by the index item, and that it has the right OID, should be a good
enough (and cheap enough) test.
        regards, tom lane
			
		On Sat, Jan 13, 2001 at 12:49:34PM -0500, Tom Lane wrote: > ncm@zembu.com (Nathan Myers) writes: > > ... for systems that cannot provide strict write ordering (e.g., > > most PCs) it would be helpful to be able to detect that the database > > has become corrupted. In Vadim's example above, if the index were to > > contain not only the heap blocks' numbers, but also their CRCs, then > > the corruption could be detected when the index is used. ... > > A row-level CRC might be useful for this, but it would have to be on > the data only (not the tuple commit-status bits). It'd be totally > impractical with a block CRC, I think. ... I almost wrote about an indirect scheme to share the expected block CRC value among all the index entries that need it, but thought it would distract from the correct approach: > Instead of a partial row CRC, we could just as well use some other bit > of identifying information, say the row OID. ... Good. But, wouldn't the TID be more specific? True, it would be pretty unlikely for a block to have an old tuple with the right OID in the same place. Belt-and-braces says check both :-). Either way, the check seems independent of block CRCs. Would this check be simple enough to be safe for 7.1? Nathan Myers ncm@zembu.com
On Sunday 14 January 2001 04:49, Tom Lane wrote: > A row-level CRC might be useful for this, but it would have to be on > the data only (not the tuple commit-status bits). It'd be totally > impractical with a block CRC, I think. To do it with a block CRC, every > time you changed *anything* in a heap page, you'd have to find all the > index items for each row on the page and update their copies of the > heap block's CRC. That could easily turn one disk-write into hundreds, > not to mention the index search costs. Similarly, a check value that is > affected by tuple status updates would enormously increase the cost of > marking tuples committed or dead. Ah, finally. Looks like we are moving in circles (or spirals ;-) )Remember that some 3-4 months ago I requested help from this list several times regarding a trigger function that implements a crc only on the user defined attributes? I wrote one in pgtcl which was slow and had trouble with the C equivalent due to lack of documentation. I still believe this is that useful that it should be an option in Postgresand not a user defined function. Horst
ncm@zembu.com (Nathan Myers) writes:
>> Instead of a partial row CRC, we could just as well use some other bit
>> of identifying information, say the row OID.   ...
> Good.  But, wouldn't the TID be more specific?
Uh, the TID *is* the pointer from index to heap.  There's no redundancy
that way.
> Would this check be simple enough to be safe for 7.1? 
It'd probably be safe, but adding OIDs to index tuples would force an
initdb, which I'd rather avoid at this stage of the cycle.
        regards, tom lane
			
		> Instead of a partial row CRC, we could just as well use some other bit > of identifying information, say the row OID. Given a block CRC on the > heap page, we'll be pretty confident already that the heap page is OK, > we just need to guard against the possibility that it's older than the > index item. Checking that there is a valid tuple at the slot > indicated by the index item, and that it has the right OID, should be > a good enough (and cheap enough) test. This would work in 7.1 but not in 7.2 anyway (assuming UNDO and true transaction rollback to be implemented). There will be no permanent pg_log and after crash recovery any heap tuple with unknown t_xmin status will be assumed as committed. Rollback will remove tuples inserted by uncommitted transactions but this will be possible only for *logged* modifications. One should properly configure disk drives instead of hacking arround this problem. "Log before modifying data pages" is *rule* for any WAL system like Oracle, Informix and dozen others. Vadim