Re: Block-level CRC checks
От | Greg Stark |
---|---|
Тема | Re: Block-level CRC checks |
Дата | |
Msg-id | 7CE61C21-DA8B-4C7F-AC77-1E3B76E3BB0D@enterprisedb.com обсуждение исходный текст |
Ответ на | Re: Block-level CRC checks (Heikki Linnakangas <heikki.linnakangas@enterprisedb.com>) |
Ответы |
Re: Block-level CRC checks
(Aidan Van Dyk <aidan@highrise.ca>)
|
Список | pgsql-hackers |
[sorry for top-posting - damn phone] I thought of saying that too but it doesn't really solve the problem. Think of what happens if someone sets a hint bit on a dirty page. greg On 17 Nov 2008, at 08:26 AM, Heikki Linnakangas <heikki.linnakangas@enterprisedb.com > wrote: > Martijn van Oosterhout wrote: >> On Fri, Nov 14, 2008 at 10:51:57AM -0500, Tom Lane wrote: >>> In fact, if the patch were to break torn-page handling, it would be >>> 100% likely to be a net *decrease* in system reliability. It >>> would add >>> detection of a situation that is not supposed to happen (ie, storage >>> system fails to return the same data it stored) at the cost of >>> breaking >>> one's database when the storage system acts as it's expected and >>> documented to in a routine power-loss situation. >> Ok, I see it's a problem because the hint changes are not WAL logged, >> so torn pages are expected to work in normal operation. But simply >> skipping the hint bits during checksumming is a terrible solution, >> since then any errors in those bits will go undetected. To not be >> able >> to say in the documentation that you'll detect 100% of single-bit >> errors is pretty darn terrible, since that's kind of the goal of the >> exercise. > > Agreed, trying to explain that in the documentation would look like > making excuses. > > The requirement that all hint bit changes are WAL-logged seems like > a pretty big change. I don't like doing that, just for CRCing. > > There has been discussion before about not writing out pages to disk > that only have hint-bit updates on them. That means that the next > time the page is read, the reader needs to do the clog lookups and > set the hint bits again. It's a tradeoff, making the first SELECT > after modifying a page cheaper, I/O-wise, at the cost of making all > subsequent SELECTs that need to read the page from disk or kernel > cache more expensive, CPU-wise. > > I'm not sure if I like that idea or not, but it would also solve the > CRC problem with torn pages. FWIW, it would also solve the problem > suggested with IBM DTLA disks and others that might zero-out a > sector in case of an interrupted write. I'm not totally convinced > that's a problem, as there's apparently other software that make the > same assumption as we do, and we haven't heard of any torn-page > corruption in real life, but still. > > If we made the behavior configurable, that would be pretty hard to > explain in the docs. We'd have three options with dependencies > > - CRC on/off > - write pages with only hint bit changes on/off > - full_page_writes on/off > > If disable full_page_writes, you're vulnerable to torn pages. If you > enable it, you're not. Except if you also turn CRC on. Except if you > also turn "write pages with only hint bit changes" off. > >> Unfortunatly, there's not a lot of easy solutions here. You could do >> two checksums, one with and one without hint bits. The overall >> checksum >> tells you if there's a problem. If it doesn't match the second >> checksum >> will tell you if it's the hint bits or not (torn page problem). If >> it's >> the hint bits you can reset them all and continue. The checksums need >> not be of equal strength. > > Hmm, that would work I guess. > >> The extreme case is an ECC where you explicitly can set it so you can >> alter N bits before you need to recalculate the checksum. >> Computationally though, that sucks. > > Yep. Also, in case of a torn page, you're very likely going to have > several hint bits from the old image and several from the new image. > An error-correcting code would need to be unfeasibly long to cope > with that. > > -- > Heikki Linnakangas > EnterpriseDB http://www.enterprisedb.com > > -- > Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-hackers
В списке pgsql-hackers по дате отправления: