Обсуждение: amcheck's verify_heapam(), and HOT chain verification
I recently pushed a commit which formalized what we expect of HOT chains during pruning -- see assertions and comments added by commit 5cd7eb1f, "Add various assertions to heap pruning code". I notice that verify_heapam() doesn't really do anything with HOT chains, and I'd say that that's a gap. Fortunately it should not be terribly difficult to fill in this gap. The short version is this: I think that verify_heapam() should validate whole HOT chains, not just individual heap-only tuples. This can be accomplished by taking a similar approach to the approach taken in heap_prune_chain() already, for pruning. For a given heap page, pruning fills in its PruneState variable to build a coherent picture of the state of all HOT chains on the entire page, before modifying anything (barring hint bit setting). We only reach heap_page_prune_execute() when we have the whole picture about HOT chains. This high level approach is essential when pruning. Consider LP_REDIRECT items. We might need to set an existing LP_REDIRECT item to LP_DEAD, rendering an entire HOT chain dead. Whether or not that's the appropriate course of action for a given LP_REDIRECT item will depend on the state of the HOT chain as a whole. The HOT chain could of course have several heap-only tuples, and only one visible tuple means that the LP_REDIRECT root item must not be set LP_DEAD -- it should be made to link to the remaining live heap-only tuple instead (while the dead intermediate heap-only tuples become LP_UNUSED, never LP_DEAD). This is made even more tricky by the fact that we have to consider HOT updaters with aborted transactions during pruning. We even have to consider the case where a HOT updater aborts, but another HOT updater tries again and commits. Now we have a heap-only tuple that is "disconnected from its original HOT chain" (or will be until the next prune). Having a stray/disconnected heap-only tuple like this ought to be fine -- the next pruning operation will get to it whenever (see the special handling for these aborted heap-only tuples at the top of heap_prune_chain()). But any disconnected heap-only tuple had better be from an aborted xact -- otherwise we have corruption. This is corruption that looks like index corruption, because sequential scans still give correct answers (I've seen this several times before, ask me how). But it's not index corruption, contrary to appearances -- it's heap corruption. Here are some specific checks I have in mind: * Right now, we throw an error when an LP_REDIRECT points to an LP_UNUSED item. But that existing test should go further, like the assertions I added recently. They should make sure that the LP_REDIRECT links directly to an LP_NORMAL item, which has tuple storage. And, the stored tuple must be a heap-only tuple. Note, in particular, that it's not okay if the LP_REDIRECT points to an LP_DEAD item (which amcheck won't complain about today). That is at least my position on it, and so far it looks like the right position (no assertion failures reported yet). * HOT chain validation: All visible/committed heap-only tuples should be reachable through a "recognizable root item for their HOT chain, or an earlier member of the HOT chain via its t_ctid link". The root item can either be an LP_REDIRECT, or a regular not-heap-only tuple -- but it can't be anything else. * All other heap-only tuples (all heap-only tuples that don't seem to be part of a valid HOT chain) must be from aborted transactions -- as I said, this has to be corruption (heap corruption disguised as index corruption, even). Note that we're very permissive about broken HOT chains -- in general a heap-only tuple's t_ctid link can have pretty wild values, and that's okay. For example, it can point past the current bounds of the page's line pointer array. Everything is still robust because of the way that we validate HOT chains during chain traversal -- we match tuple N's xmax to tuple N+1's xmin, all the time, inside code like heap_hot_search_buffer() that is routinely used during index scans. However, the root item of a HOT chain (usually an LP_REDIRECT) absolutely must point to a valid heap-only tuple at all times -- that's very different indeed. That's the starting point for an index scan -- it is the TID that is stored in indexes that needs to be stable over time. And so pruning needs to leave behind coherent looking whole entire HOT chains, just as we ought to verify that pruning consistently gets it right. -- Peter Geoghegan
> On Nov 5, 2021, at 7:51 PM, Peter Geoghegan <pg@bowt.ie> wrote: > > I recently pushed a commit which formalized what we expect of HOT > chains during pruning -- see assertions and comments added by commit > 5cd7eb1f, "Add various assertions to heap pruning code". I notice that > verify_heapam() doesn't really do anything with HOT chains, and I'd > say that that's a gap. Fortunately it should not be terribly difficult > to fill in this gap. > > The short version is this: I think that verify_heapam() should > validate whole HOT chains, not just individual heap-only tuples. This > can be accomplished by taking a similar approach to the approach taken > in heap_prune_chain() already, for pruning. > > For a given heap page, pruning fills in its PruneState variable to > build a coherent picture of the state of all HOT chains on the entire > page, before modifying anything (barring hint bit setting). We only > reach heap_page_prune_execute() when we have the whole picture about > HOT chains. This high level approach is essential when pruning. > Consider LP_REDIRECT items. We might need to set an existing > LP_REDIRECT item to LP_DEAD, rendering an entire HOT chain dead. > Whether or not that's the appropriate course of action for a given > LP_REDIRECT item will depend on the state of the HOT chain as a whole. > > The HOT chain could of course have several heap-only tuples, and only > one visible tuple means that the LP_REDIRECT root item must not be set > LP_DEAD -- it should be made to link to the remaining live heap-only > tuple instead (while the dead intermediate heap-only tuples become > LP_UNUSED, never LP_DEAD). This is made even more tricky by the fact > that we have to consider HOT updaters with aborted transactions during > pruning. > > We even have to consider the case where a HOT updater aborts, but > another HOT updater tries again and commits. Now we have a heap-only > tuple that is "disconnected from its original HOT chain" (or will be > until the next prune). Having a stray/disconnected heap-only tuple > like this ought to be fine -- the next pruning operation will get to > it whenever (see the special handling for these aborted heap-only > tuples at the top of heap_prune_chain()). But any disconnected > heap-only tuple had better be from an aborted xact -- otherwise we > have corruption. This is corruption that looks like index corruption, > because sequential scans still give correct answers (I've seen this > several times before, ask me how). But it's not index corruption, > contrary to appearances -- it's heap corruption. > > Here are some specific checks I have in mind: > > * Right now, we throw an error when an LP_REDIRECT points to an > LP_UNUSED item. But that existing test should go further, like the > assertions I added recently. They should make sure that the > LP_REDIRECT links directly to an LP_NORMAL item, which has tuple > storage. And, the stored tuple must be a heap-only tuple. > > Note, in particular, that it's not okay if the LP_REDIRECT points to > an LP_DEAD item (which amcheck won't complain about today). That is at > least my position on it, and so far it looks like the right position > (no assertion failures reported yet). > > * HOT chain validation: All visible/committed heap-only tuples should > be reachable through a "recognizable root item for their HOT chain, or > an earlier member of the HOT chain via its t_ctid link". The root item > can either be an LP_REDIRECT, or a regular not-heap-only tuple -- but > it can't be anything else. > > * All other heap-only tuples (all heap-only tuples that don't seem to > be part of a valid HOT chain) must be from aborted transactions -- as > I said, this has to be corruption (heap corruption disguised as index > corruption, even). > > Note that we're very permissive about broken HOT chains -- in general > a heap-only tuple's t_ctid link can have pretty wild values, and > that's okay. For example, it can point past the current bounds of the > page's line pointer array. Everything is still robust because of the > way that we validate HOT chains during chain traversal -- we match > tuple N's xmax to tuple N+1's xmin, all the time, inside code like > heap_hot_search_buffer() that is routinely used during index scans. > However, the root item of a HOT chain (usually an LP_REDIRECT) > absolutely must point to a valid heap-only tuple at all times -- > that's very different indeed. That's the starting point for an index > scan -- it is the TID that is stored in indexes that needs to be > stable over time. And so pruning needs to leave behind coherent > looking whole entire HOT chains, just as we ought to verify that > pruning consistently gets it right. Thanks, Peter, for this analysis. It's getting a bit late in the day, but I'll try this out tomorrow, I expect. — Mark Dilger EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Fri, Nov 5, 2021 at 8:18 PM Mark Dilger <mark.dilger@enterprisedb.com> wrote: > Thanks, Peter, for this analysis. It's strategically important work. What you've done so far has every chance of catching corruption involving storage issues, which is great. But we ought to do more to catch certain known general patterns of corruption. The so-called "freeze the dead" bug (see commit 9c2f0a6c for the final fix) is the single weirdest and scariest bug that I can recall (there were a couple of incorrect bug fixes that had to be reverted), and it is very much the kind of thing that I'd like to see verify_heapam.c handle exhaustively. That would de-risk a lot of things. Note that the same variety of HOT chain corruption bugs besides that one, so it's really a general class of corruption, not one specific bug. The heapallindexed verification option actually detected the "freeze the dead" bug [1], although that was a total accident -- that checking option happens to use CREATE INDEX infrastructure, which happens to sanitize HOT chains by calling heap_get_root_tuples() and detecting contradictions -- heap_get_root_tuples() is code that is right next to the pruning code I mentioned. That level of detection is certainly better than nothing, but it's still not good enough -- we need this in verify_heapam, too. Here's why I think that we need very thorough HOT chain verification that lives directly in verify_heapam: First of all, the heapallindexed option is rather expensive, whereas the additional overhead for verify_heapam would be minimal -- you probably wouldn't even need to make it optional. Second of all, why should you need to use bt_index_check() to detect what is after all heap corruption? And third of all, the approach of detecting corruption by outsourcing detection to heapam_index_build_range_scan() (which calls heap_get_root_tuples() itself) probably risks missing corrupt HOT chains where the corruption doesn't look a certain way -- it was not designed with amcheck in mind. heapam_index_build_range_scan() + heap_get_root_tuples() will notice a heap-only tuple without a parent, which is a good start. But what about an LP_REDIRECT item whose lp_off (i.e. page offset number redirect) somehow points to any item on the page other than a valid heap-only tuple? Also doesn't catch contradictory commit states for heap-only tuples that form a hot chain through their t_ctid links. I'm sure that there are other things that I haven't thought of, too -- there is a lot going on here. This HOT chain verification business is a little like the checking that we do in verify_nbtree.c, actually. The individual tuples (in this case heap tuples) may well not be corrupt (or demonstrably corrupt) in isolation. But taken together, in a particular page offset number sequence order, they can nevertheless *imply* corruption -- it's implied if the tuples tell a contradictory story about what's going on at the level of the whole page. A HOT chain LP_REDIRECT is after all a little like a "mini index" stored on a heap page. Sequential scans don't care about LP_REDIRECTs at all. But index scans expect a great deal from LP_REDIRECTs. [1] https://www.postgresql.org/message-id/CAH2-Wznm4rCrhFAiwKPWTpEw2bXDtgROZK7jWWGucXeH3D1fmA@mail.gmail.com -- Peter Geoghegan
On Fri, Nov 5, 2021 at 7:51 PM Peter Geoghegan <pg@bowt.ie> wrote: > Here are some specific checks I have in mind: One more for the list: * Validate PageIsAllVisible() for each page. In other words, pg_visibility should be merged with verify_heapam.c (or at least pg_visibility 's pg_check_frozen() and pg_check_visible() functions should be moved, merged, or whatever). This would mean that verify_heapam() would directly check if the page-level PD_ALL_VISIBLE flag contradicts either the tuple headers of tuples with storage on the page, the presence (or absence) of LP_DEAD stub line pointers on the page, or the corresponding visibility map bit (e.g., VISIBILITYMAP_ALL_VISIBLE) for the page. There is value in teaching verify_heapam() about any possible problem, including with the visibility map, but it's certainly less valuable than the HOT chain verification stuff -- and probably trickier to get right. I'm mentioning it now to be exhaustive, but it's less of a priority for me personally. I am quite willing to help out with all this, if you're interested. One more thing about HOT chain validation: I can give you another example bug of the kind I'd expect verify_heapam() to catch only with full HOT chain validation. This one is a vintage MultiXact bug that has the same basic HOT chain corruption, looks-like-index-corruption-but-isn't quality as the more memorable freeze-the-dead bug (this one was fixed by commit 6bfa88ac): https://www.postgresql.org/message-id/CAM3SWZTMQiCi5PV5OWHb%2BbYkUcnCk%3DO67w0cSswPvV7XfUcU5g%40mail.gmail.com In general I think that reviewing historic examples of pernicious corruption bugs is a valuable exercise when designing tools like amcheck. Maybe even revert the fix during testing, to be sure it would have been caught had the final tool been available. History doesn't repeat itself, but it does rhyme. -- Peter Geoghegan
> On Nov 6, 2021, at 3:09 PM, Peter Geoghegan <pg@bowt.ie> wrote: > > I am quite willing to help out with all this, if you're interested. Yes, I am quite interested, though I will have to alternate between this work and the various patch sets that I've alreadysubmitted for this development cycle. I think we need a corruption generating framework that can be deployed on the buildfarm. What I have in mind is inspiredby your comments about the "freeze the dead" bug. We need to be able to build versions of the database with knownbugs enabled, both real-world bugs from the past and contrived bugs we write only for this purpose, so as to createnon-trivial corruption on the build farm. I think it would be sufficient if such builds were run less frequently,perhaps triggered by commits to a corruption testing branch? We could keep the old bugs alive on that branchwhile periodically merging in everything else from HEAD? That seems a tad tiresome, but maybe it wouldn't be too badif the intentional bugs were limited to just a few backend files, and as much as possible in code at the end of the fileor in separate files to reduce merge conflicts? I'm cc'ing Andrew to get his thoughts about the buildfarm integration.... — Mark Dilger EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Sun, Nov 7, 2021 at 9:30 AM Mark Dilger <mark.dilger@enterprisedb.com> wrote: > Yes, I am quite interested, though I will have to alternate between this work and the various patch sets that I've alreadysubmitted for this development cycle. Great! We still have a lot of work to do with HOT chain level invariants. I have seen very clear evidence of that in the last 24 hours: https://postgr.es/m/CAH2-Wzma=Y3O+LRx2Wj_HwGbbbeNwr6FoJzXni8hxOMw55pcZg@mail.gmail.com > I think we need a corruption generating framework that can be deployed on the buildfarm. What I have in mind is inspiredby your comments about the "freeze the dead" bug. I'm not sure that that's truly necessary. IMV the important thing is that we formalize the invariants, and have tooling that can test them. Maintaining tests that actually display specific broken behavior (as opposed to its absence) seems like it might be quite a burden. There are many specific ways that these invariants might break, but the specifics shouldn't matter -- the invariants should cut through that (to the extent that that's possible). The "freeze the dead" bug is mostly useful as a way of framing the discussion (perhaps even in code comments). I did this with a historic CREATE INDEX CONCURRENTLY bug, in code comments for heapallindexed verification. Verification using heapallindexed actually detected two more CIC bugs years later -- not including the earlier fixes that didn't quite get everything right (thinking of the prepared xact CIC bugs found using amcheck). The invariants that heapallindexed tests generalize to many different situations, including situations that I couldn't have possibly anticipated with any kind of precision. Having the right high level idea is what really mattered. -- Peter Geoghegan