Обсуждение: Fixes for two separate bugs in nbtree VACUUM's page deletion
Attached are two patches, both of which are fixes for bugs in nbtree VACUUM page deletion. The first fix for a bug in commit 857f9c36cda. The immediate issue is that the code that maintains the oldest btpo.xact in the index accesses the special area of pages without holding a buffer pin. More fundamentally, the logic for examining pages that are deleted by the current VACUUM operation for the purposes of maintaining the oldest btpo.xact needs to be pushed down into the guts of the page deletion code. This has been described on other threads [1][2], but I'm starting a new one to focus attention on the bugs themselves (any discussion of Valgrind instrumentation can remain on the other thread). The second fix is a spin-off of the first. It fixes a much less serious issue that is present in all supported versions (it has probably been around forever). The issue is with undercounting in the "%u index pages have been deleted" figure reported by VACUUM VERBOSE. For example, suppose I delete most of the tuples from a table with a B-Tree index, leading to lots of empty pages that are candidates for deletion (the specifics really aren't very important). I then run VACUUM VERBOSE, and can see something like this: 4900 index pages have been deleted, 0 are currently reusable. If I immediately run another VACUUM VERBOSE, I can see this: 4924 index pages have been deleted, 4924 are currently reusable. It makes sense that the pages weren't reusable in the first VACUUM, but were in the second VACUUM. But where did the extra 24 pages come from? That doesn't make any sense at all. With the second patch applied, the same test case shows the correct number ("4924 index pages have been deleted") consistently. See the patch for details of how the accounting is wrong. (The short version is that we need to push the accounting down into the guts of page deletion, which is why the second patch relies on the first patch.) I would like to backpatch both patches to all branches that have commit 857f9c36cda -- v11, v12, and master. The second issue isn't serious, but it seems worth keeping v11+ in sync in this area. Note that any backpatch theoretically creates an ABI break for callers of the _bt_pagedel() function. Perhaps I could get around this by using global state in the back branches or something, but that's ugly as sin. Besides, I have a very hard time imagining an extension that feels entitled to call _bt_pagedel(). There are all kinds of ways in which that's a terrible idea. And it's hard to imagine how that could ever seem useful. I'd like to hear what other people think about this ABI issue in particular, though. [1] https://postgr.es/m/CAH2-Wz=mWPBLZ2cr95cBV=kzPav8OOBtkHTfg+-+AUiozANK0w@mail.gmail.com [2] https://postgr.es/m/CAH2-WzkLgyN3zBvRZ1pkNJThC=xi_0gpWRUb_45eexLH1+k2_Q@mail.gmail.com -- Peter Geoghegan
Вложения
On Mon, Apr 27, 2020 at 11:02 AM Peter Geoghegan <pg@bowt.ie> wrote: > I would like to backpatch both patches to all branches that have > commit 857f9c36cda -- v11, v12, and master. The second issue isn't > serious, but it seems worth keeping v11+ in sync in this area. Note > that any backpatch theoretically creates an ABI break for callers of > the _bt_pagedel() function. I plan to commit both fixes, while backpatching to v11, v12 and the master branch on Friday -- barring objections. I'm almost sure that the ABI thing is a non-issue, but it would be nice to get that seconded. -- Peter Geoghegan
On Tue, 28 Apr 2020 at 11:35, Peter Geoghegan <pg@bowt.ie> wrote: > > On Mon, Apr 27, 2020 at 11:02 AM Peter Geoghegan <pg@bowt.ie> wrote: > > I would like to backpatch both patches to all branches that have > > commit 857f9c36cda -- v11, v12, and master. The second issue isn't > > serious, but it seems worth keeping v11+ in sync in this area. Note > > that any backpatch theoretically creates an ABI break for callers of > > the _bt_pagedel() function. > > I plan to commit both fixes, while backpatching to v11, v12 and the > master branch on Friday -- barring objections. Thank you for the patches and for starting the new thread. I agree with both patches. For the first fix it seems better to push down the logic to the page deletion code as your 0001 patch does so. The following change changes the page deletion code so that it emits LOG message indicating the index corruption when a deleted page is passed. But we used to ignore in this case. @@ -1511,14 +1523,21 @@ _bt_pagedel(Relation rel, Buffer buf) for (;;) { - page = BufferGetPage(buf); + page = BufferGetPage(leafbuf); opaque = (BTPageOpaque) PageGetSpecialPointer(page); /* * Internal pages are never deleted directly, only as part of deleting * the whole branch all the way down to leaf level. + * + * Also check for deleted pages here. Caller never passes us a fully + * deleted page. Only VACUUM can delete pages, so there can't have + * been a concurrent deletion. Assume that we reached any deleted + * page encountered here by following a sibling link, and that the + * index is corrupt. */ - if (!P_ISLEAF(opaque)) + Assert(!P_ISDELETED(opaque)); + if (!P_ISLEAF(opaque) || P_ISDELETED(opaque)) { /* * Pre-9.4 page deletion only marked internal pages as half-dead, @@ -1537,13 +1556,21 @@ _bt_pagedel(Relation rel, Buffer buf) errmsg("index \"%s\" contains a half-dead internal page", RelationGetRelationName(rel)), errhint("This can be caused by an interrupted VACUUM in version 9.3 or older, before upgrade. Please REINDEX it."))); - _bt_relbuf(rel, buf); + + if (P_ISDELETED(opaque)) + ereport(LOG, + (errcode(ERRCODE_INDEX_CORRUPTED), + errmsg("index \"%s\" contains a deleted page with a live sibling link", + RelationGetRelationName(rel)))); + + _bt_relbuf(rel, leafbuf); return ndeleted; } I agree with this change but since I'm not sure this change directly relates to this bug I wonder if it can be a separate patch. Regards, -- Masahiko Sawada http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Tue, Apr 28, 2020 at 12:21 AM Masahiko Sawada <masahiko.sawada@2ndquadrant.com> wrote: > I agree with both patches. Thanks for the review. > For the first fix it seems better to push down the logic to the page > deletion code as your 0001 patch does so. The following change changes > the page deletion code so that it emits LOG message indicating the > index corruption when a deleted page is passed. But we used to ignore > in this case. > I agree with this change but since I'm not sure this change directly > relates to this bug I wonder if it can be a separate patch. I am not surprised that you are concerned about this. That was the hardest decision I had to make when writing the patch. The central idea in the 0001-* patch (as I say at the end of the commit message) is that we have a strict rule about where we handle maintaining the oldest btpo.xact for already-deleted pages, and where we handle it for pages that become deleted. The rules is this: btvacuumpage() is strictly responsible for the former category (as well as totally live pages), while _bt_pagedel() is responsible for the latter category (this includes pages that started out as half-dead but become deleted). In order for this to work, _bt_pagedel() must not ever see a deleted page that it did not delete itself. If I didn't explicitly take a strict position on this, then I suppose that I'd have to have similar handling for maintaining the oldest btpo.xact for existing deleted pages encountered within _bt_pagedel(). But that doesn't make any sense, even with corruption, so I took a strict position. Even with corruption, how could we encounter an *existing* deleted page in _bt_pagedel() but not encounter the same page within some call to btvacuumpage() made from btvacuumscan()? In other words, how can we fail to maintain the oldest btpo.xact for deleted pages even if we assume that the index has a corrupt page with sibling links that point to a fully deleted page? It seems impossible, even in this extreme, contrived scenario. Then there is the separate question of the new log message about corruption. It's not too hard to see why _bt_pagedel() cannot ever access an existing deleted page unless there is corruption: * _bt_pagedel()'s only caller (the btvacuumpage() caller) will simply never pass a deleted page -- it handles those directly. * _bt_pagedel() can only access additional leaf pages to consider deleting them by following the right sibling link of the btvacuumpage() caller's leaf page (or possibly some page even further to the right if it ends up deleting many leaf pages in one go). Following right links and finding a deleted page can only happen when there is a concurrent page deletion by VACUUM, since the sibling links to the deleted page are changed by the second stage of deletion (i.e. by the atomic actionat where the page is actually marked deleted, within _bt_unlink_halfdead_page()). * There cannot be a concurrent VACUUM, though, since _bt_pagedel() runs in VACUUM. (Note that we already depend on this for correctness in _bt_unlink_halfdead_page(), which has a comment that says "a half-dead page cannot resurrect because there can be only one vacuum process running at a time".) The strict position seems justified. It really isn't that strict. I'm not concerned about the new LOG message concerning corruption being annoying or wrong -- Peter Geoghegan
On Tue, Apr 28, 2020 at 12:21 AM Masahiko Sawada <masahiko.sawada@2ndquadrant.com> wrote: > For the first fix it seems better to push down the logic to the page > deletion code as your 0001 patch does so. The following change changes > the page deletion code so that it emits LOG message indicating the > index corruption when a deleted page is passed. But we used to ignore > in this case. Attached is v2 of the patch set, which includes a new patch that I want to target the master branch with. This patch (v2-0003-*) refactors btvacuumscan(). This revision also changed the first bugfix patch. I have simplified some of the details in nbtree.c that were added by commit 857f9c36cda. Can't we call _bt_update_meta_cleanup_info() at a lower level, like in the patch? AFAICT it makes more sense to just call it in btvacuumscan(). Please let me know what you think of those changes. The big change in the new master-only refactoring patch (v2-0003-*) is that I now treat a call to btvacuumpage() that has to "backtrack/recurse" and doesn't find a page that looks like the expected right half of a page split (a page split that occurred since the VACUUM operation began) as indicating corruption. This corruption is logged. I believe that we should never see this happen, for reasons that are similar to the reasons why _bt_pagedel() never finds a deleted page when moving right, as I went into in my recent e-mail to this thread. I think that this is worth tightening up for Postgres 13. I will hold off on committing v2-0003-*, since I need to nail down the reasoning for treating this condition as corruption. Plus it's not urgent. I think that the general direction taken in v2-0003-* is the right one, in any case. The "recursion" in btvacuumpage() doesn't make much sense -- it obscures what's really going on IMV. Also, using the variable name "scanblkno" at every level -- btvacuumscan(), btvacuumpage(), and _bt_pagedel() -- makes it clear that it's the same block at all levels of the code. And that the "backtrack/recursion" case doesn't kill tuples from the "scanblkno" block (and doesn't attempt to call _bt_pagedel(), which is designed only to handle blocks passed by btvacuumpage() when they are the current "scanblkno"). It seems unlikely that the VACUUM VERBOSE pages deleted accounting bug would have happened if the code was already structured this way. -- Peter Geoghegan
Вложения
On Wed, 29 Apr 2020 at 01:17, Peter Geoghegan <pg@bowt.ie> wrote: > > On Tue, Apr 28, 2020 at 12:21 AM Masahiko Sawada > <masahiko.sawada@2ndquadrant.com> wrote: > > I agree with both patches. > > Thanks for the review. > > > For the first fix it seems better to push down the logic to the page > > deletion code as your 0001 patch does so. The following change changes > > the page deletion code so that it emits LOG message indicating the > > index corruption when a deleted page is passed. But we used to ignore > > in this case. > > > I agree with this change but since I'm not sure this change directly > > relates to this bug I wonder if it can be a separate patch. > > I am not surprised that you are concerned about this. That was the > hardest decision I had to make when writing the patch. > > The central idea in the 0001-* patch (as I say at the end of the > commit message) is that we have a strict rule about where we handle > maintaining the oldest btpo.xact for already-deleted pages, and where > we handle it for pages that become deleted. The rules is this: > btvacuumpage() is strictly responsible for the former category (as > well as totally live pages), while _bt_pagedel() is responsible for > the latter category (this includes pages that started out as half-dead > but become deleted). > > In order for this to work, _bt_pagedel() must not ever see a deleted > page that it did not delete itself. If I didn't explicitly take a > strict position on this, then I suppose that I'd have to have similar > handling for maintaining the oldest btpo.xact for existing deleted > pages encountered within _bt_pagedel(). But that doesn't make any > sense, even with corruption, so I took a strict position. Even with > corruption, how could we encounter an *existing* deleted page in > _bt_pagedel() but not encounter the same page within some call to > btvacuumpage() made from btvacuumscan()? In other words, how can we > fail to maintain the oldest btpo.xact for deleted pages even if we > assume that the index has a corrupt page with sibling links that point > to a fully deleted page? It seems impossible, even in this extreme, > contrived scenario. > > Then there is the separate question of the new log message about > corruption. It's not too hard to see why _bt_pagedel() cannot ever > access an existing deleted page unless there is corruption: > > * _bt_pagedel()'s only caller (the btvacuumpage() caller) will simply > never pass a deleted page -- it handles those directly. > > * _bt_pagedel() can only access additional leaf pages to consider > deleting them by following the right sibling link of the > btvacuumpage() caller's leaf page (or possibly some page even further > to the right if it ends up deleting many leaf pages in one go). > Following right links and finding a deleted page can only happen when > there is a concurrent page deletion by VACUUM, since the sibling links > to the deleted page are changed by the second stage of deletion (i.e. > by the atomic actionat where the page is actually marked deleted, > within _bt_unlink_halfdead_page()). > > * There cannot be a concurrent VACUUM, though, since _bt_pagedel() > runs in VACUUM. (Note that we already depend on this for correctness > in _bt_unlink_halfdead_page(), which has a comment that says "a > half-dead page cannot resurrect because there can be only one vacuum > process running at a time".) > > The strict position seems justified. It really isn't that strict. I'm > not concerned about the new LOG message concerning corruption being > annoying or wrong Thank you for the explanation. This makes sense to me. We've ever been able to regard it as an index corruption that an existing deleted page is detected within _bt_pagedel(). But with changes required for fixing this bug, the responsibilities will change so that _bt_pagedel() must not see a deleted page. Therefore we can take a more strict position. It's not that since this bug could lead an index corruption we will start warning it. Regards, -- Masahiko Sawada http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Thu, 30 Apr 2020 at 07:29, Peter Geoghegan <pg@bowt.ie> wrote: > > On Tue, Apr 28, 2020 at 12:21 AM Masahiko Sawada > <masahiko.sawada@2ndquadrant.com> wrote: > > For the first fix it seems better to push down the logic to the page > > deletion code as your 0001 patch does so. The following change changes > > the page deletion code so that it emits LOG message indicating the > > index corruption when a deleted page is passed. But we used to ignore > > in this case. > > Attached is v2 of the patch set, which includes a new patch that I > want to target the master branch with. This patch (v2-0003-*) > refactors btvacuumscan(). > > This revision also changed the first bugfix patch. I have simplified > some of the details in nbtree.c that were added by commit 857f9c36cda. > Can't we call _bt_update_meta_cleanup_info() at a lower level, like in > the patch? AFAICT it makes more sense to just call it in > btvacuumscan(). Please let me know what you think of those changes. Yes, I agree. > > The big change in the new master-only refactoring patch (v2-0003-*) is > that I now treat a call to btvacuumpage() that has to > "backtrack/recurse" and doesn't find a page that looks like the > expected right half of a page split (a page split that occurred since > the VACUUM operation began) as indicating corruption. This corruption > is logged. I believe that we should never see this happen, for reasons > that are similar to the reasons why _bt_pagedel() never finds a > deleted page when moving right, as I went into in my recent e-mail to > this thread. I think that this is worth tightening up for Postgres 13. > > I will hold off on committing v2-0003-*, since I need to nail down the > reasoning for treating this condition as corruption. Plus it's not > urgent. I think that the general direction taken in v2-0003-* is the > right one, in any case. The "recursion" in btvacuumpage() doesn't make > much sense -- it obscures what's really going on IMV. Also, using the > variable name "scanblkno" at every level -- btvacuumscan(), > btvacuumpage(), and _bt_pagedel() -- makes it clear that it's the same > block at all levels of the code. And that the "backtrack/recursion" > case doesn't kill tuples from the "scanblkno" block (and doesn't > attempt to call _bt_pagedel(), which is designed only to handle blocks > passed by btvacuumpage() when they are the current "scanblkno"). It > seems unlikely that the VACUUM VERBOSE pages deleted accounting bug > would have happened if the code was already structured this way. +1 for refactoring. I often get confused that btvacuumpage() takes two block numbers (blkno and orig_blkno) in spite of the same value. Your change also makes it clear. For the part of treating that case as an index corruption I will need some time to review because of lacking knowledge of btree indexes. So I'll review it later. Regards, -- Masahiko Sawada http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Wed, Apr 29, 2020 at 11:38 PM Masahiko Sawada <masahiko.sawada@2ndquadrant.com> wrote: > Thank you for the explanation. This makes sense to me. Pushed both of the fixes. Thanks! -- Peter Geoghegan
On Thu, Apr 30, 2020 at 12:20 AM Masahiko Sawada <masahiko.sawada@2ndquadrant.com> wrote: > For the part of treating that case as an index corruption I will need > some time to review because of lacking knowledge of btree indexes. So > I'll review it later. I pushed the refactoring patch today. Thanks for the review. The final test for corruption that I added to btvacuumscan() is less aggressive than what you saw in the patch I posted. We only report corruption when backtracking/recursing if the page is "new", not a leaf page, or is half-dead. We don't treat a fully deleted page as corruption, because there is a case where the same call to btvacuumscan() may have called _bt_pagedel() already, which may have deleted the block that we backtrack/recurse to. The "sibling links cannot point to a deleted page without concurrent deletion, and we know that can't happen because we are VACUUM" stuff doesn't really apply -- we remember which block we will recurse to *before* we actually call _bt_pagedel(). -- Peter Geoghegan