Обсуждение: Fixes for two separate bugs in nbtree VACUUM's page deletion

Поиск
Список
Период
Сортировка

Fixes for two separate bugs in nbtree VACUUM's page deletion

От
Peter Geoghegan
Дата:
Attached are two patches, both of which are fixes for bugs in nbtree
VACUUM page deletion.

The first fix for a bug in commit 857f9c36cda. The immediate issue is
that the code that maintains the oldest btpo.xact in the index
accesses the special area of pages without holding a buffer pin. More
fundamentally, the logic for examining pages that are deleted by the
current VACUUM operation for the purposes of maintaining the oldest
btpo.xact needs to be pushed down into the guts of the page deletion
code. This has been described on other threads [1][2], but I'm
starting a new one to focus attention on the bugs themselves (any
discussion of Valgrind instrumentation can remain on the other
thread).

The second fix is a spin-off of the first. It fixes a much less
serious issue that is present in all supported versions (it has
probably been around forever). The issue is with undercounting in the
"%u index pages have been deleted" figure reported by VACUUM VERBOSE.

For example, suppose I delete most of the tuples from a table with a
B-Tree index, leading to lots of empty pages that are candidates for
deletion (the specifics really aren't very important). I then run
VACUUM VERBOSE, and can see something like this:

4900 index pages have been deleted, 0 are currently reusable.

If I immediately run another VACUUM VERBOSE, I can see this:

4924 index pages have been deleted, 4924 are currently reusable.

It makes sense that the pages weren't reusable in the first VACUUM,
but were in the second VACUUM. But where did the extra 24 pages come
from? That doesn't make any sense at all. With the second patch
applied, the same test case shows the correct number ("4924 index
pages have been deleted") consistently. See the patch for details of
how the accounting is wrong. (The short version is that we need to
push the accounting down into the guts of page deletion, which is why
the second patch relies on the first patch.)

I would like to backpatch both patches to all branches that have
commit 857f9c36cda -- v11, v12, and master. The second issue isn't
serious, but it seems worth keeping v11+ in sync in this area. Note
that any backpatch theoretically creates an ABI break for callers of
the _bt_pagedel() function. Perhaps I could get around this by using
global state in the back branches or something, but that's ugly as
sin. Besides, I have a very hard time imagining an extension that
feels entitled to call _bt_pagedel(). There are all kinds of ways in
which that's a terrible idea. And it's hard to imagine how that could
ever seem useful. I'd like to hear what other people think about this
ABI issue in particular, though.

[1] https://postgr.es/m/CAH2-Wz=mWPBLZ2cr95cBV=kzPav8OOBtkHTfg+-+AUiozANK0w@mail.gmail.com
[2] https://postgr.es/m/CAH2-WzkLgyN3zBvRZ1pkNJThC=xi_0gpWRUb_45eexLH1+k2_Q@mail.gmail.com
-- 
Peter Geoghegan

Вложения

Re: Fixes for two separate bugs in nbtree VACUUM's page deletion

От
Peter Geoghegan
Дата:
On Mon, Apr 27, 2020 at 11:02 AM Peter Geoghegan <pg@bowt.ie> wrote:
> I would like to backpatch both patches to all branches that have
> commit 857f9c36cda -- v11, v12, and master. The second issue isn't
> serious, but it seems worth keeping v11+ in sync in this area. Note
> that any backpatch theoretically creates an ABI break for callers of
> the _bt_pagedel() function.

I plan to commit both fixes, while backpatching to v11, v12 and the
master branch on Friday -- barring objections.

I'm almost sure that the ABI thing is a non-issue, but it would be
nice to get that seconded.

-- 
Peter Geoghegan



Re: Fixes for two separate bugs in nbtree VACUUM's page deletion

От
Masahiko Sawada
Дата:
On Tue, 28 Apr 2020 at 11:35, Peter Geoghegan <pg@bowt.ie> wrote:
>
> On Mon, Apr 27, 2020 at 11:02 AM Peter Geoghegan <pg@bowt.ie> wrote:
> > I would like to backpatch both patches to all branches that have
> > commit 857f9c36cda -- v11, v12, and master. The second issue isn't
> > serious, but it seems worth keeping v11+ in sync in this area. Note
> > that any backpatch theoretically creates an ABI break for callers of
> > the _bt_pagedel() function.
>
> I plan to commit both fixes, while backpatching to v11, v12 and the
> master branch on Friday -- barring objections.

Thank you for the patches and for starting the new thread.

I agree with both patches.

For the first fix it seems better to push down the logic to the page
deletion code as your 0001 patch does so. The following change changes
the page deletion code so that it emits LOG message indicating the
index corruption when a deleted page is passed. But we used to ignore
in this case.

@@ -1511,14 +1523,21 @@ _bt_pagedel(Relation rel, Buffer buf)

    for (;;)
    {
-       page = BufferGetPage(buf);
+       page = BufferGetPage(leafbuf);
        opaque = (BTPageOpaque) PageGetSpecialPointer(page);

        /*
         * Internal pages are never deleted directly, only as part of deleting
         * the whole branch all the way down to leaf level.
+        *
+        * Also check for deleted pages here.  Caller never passes us a fully
+        * deleted page.  Only VACUUM can delete pages, so there can't have
+        * been a concurrent deletion.  Assume that we reached any deleted
+        * page encountered here by following a sibling link, and that the
+        * index is corrupt.
         */
-       if (!P_ISLEAF(opaque))
+       Assert(!P_ISDELETED(opaque));
+       if (!P_ISLEAF(opaque) || P_ISDELETED(opaque))
        {
            /*
             * Pre-9.4 page deletion only marked internal pages as half-dead,
@@ -1537,13 +1556,21 @@ _bt_pagedel(Relation rel, Buffer buf)
                         errmsg("index \"%s\" contains a half-dead
internal page",
                                RelationGetRelationName(rel)),
                         errhint("This can be caused by an interrupted
VACUUM in version 9.3 or older, before upgrade. Please REINDEX
it.")));
-           _bt_relbuf(rel, buf);
+
+           if (P_ISDELETED(opaque))
+               ereport(LOG,
+                       (errcode(ERRCODE_INDEX_CORRUPTED),
+                        errmsg("index \"%s\" contains a deleted page
with a live sibling link",
+                               RelationGetRelationName(rel))));
+
+           _bt_relbuf(rel, leafbuf);
            return ndeleted;
        }

I agree with this change but since I'm not sure this change directly
relates to this bug I wonder if it can be a separate patch.

Regards,

--
Masahiko Sawada            http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Fixes for two separate bugs in nbtree VACUUM's page deletion

От
Peter Geoghegan
Дата:
On Tue, Apr 28, 2020 at 12:21 AM Masahiko Sawada
<masahiko.sawada@2ndquadrant.com> wrote:
> I agree with both patches.

Thanks for the review.

> For the first fix it seems better to push down the logic to the page
> deletion code as your 0001 patch does so. The following change changes
> the page deletion code so that it emits LOG message indicating the
> index corruption when a deleted page is passed. But we used to ignore
> in this case.

> I agree with this change but since I'm not sure this change directly
> relates to this bug I wonder if it can be a separate patch.

I am not surprised that you are concerned about this. That was the
hardest decision I had to make when writing the patch.

The central idea in the 0001-* patch (as I say at the end of the
commit message) is that we have a strict rule about where we handle
maintaining the oldest btpo.xact for already-deleted pages, and where
we handle it for pages that become deleted. The rules is this:
btvacuumpage() is strictly responsible for the former category (as
well as totally live pages), while _bt_pagedel() is responsible for
the latter category (this includes pages that started out as half-dead
but become deleted).

In order for this to work, _bt_pagedel() must not ever see a deleted
page that it did not delete itself. If I didn't explicitly take a
strict position on this, then I suppose that I'd have to have similar
handling for maintaining the oldest btpo.xact for existing deleted
pages encountered within _bt_pagedel(). But that doesn't make any
sense, even with corruption, so I took a strict position. Even with
corruption, how could we encounter an *existing* deleted page in
_bt_pagedel() but not encounter the same page within some call to
btvacuumpage() made from btvacuumscan()? In other words, how can we
fail to maintain the oldest btpo.xact for deleted pages even if we
assume that the index has a corrupt page with sibling links that point
to a fully deleted page? It seems impossible, even in this extreme,
contrived scenario.

Then there is the separate question of the new log message about
corruption. It's not too hard to see why _bt_pagedel() cannot ever
access an existing deleted page unless there is corruption:

* _bt_pagedel()'s only caller (the btvacuumpage() caller) will simply
never pass a deleted page -- it handles those directly.

* _bt_pagedel() can only access additional leaf pages to consider
deleting them by following the right sibling link of the
btvacuumpage() caller's leaf page (or possibly some page even further
to the right if it ends up deleting many leaf pages in one go).
Following right links and finding a deleted page can only happen when
there is a concurrent page deletion by VACUUM, since the sibling links
to the deleted page are changed by the second stage of deletion (i.e.
by the atomic actionat where the page is actually marked deleted,
within _bt_unlink_halfdead_page()).

* There cannot be a concurrent VACUUM, though, since _bt_pagedel()
runs in VACUUM. (Note that we already depend on this for correctness
in _bt_unlink_halfdead_page(), which has a comment that says "a
half-dead page cannot resurrect because there can be only one vacuum
process running at a time".)

The strict position seems justified. It really isn't that strict. I'm
not concerned about the new LOG message concerning corruption being
annoying or wrong

-- 
Peter Geoghegan



Re: Fixes for two separate bugs in nbtree VACUUM's page deletion

От
Peter Geoghegan
Дата:
On Tue, Apr 28, 2020 at 12:21 AM Masahiko Sawada
<masahiko.sawada@2ndquadrant.com> wrote:
> For the first fix it seems better to push down the logic to the page
> deletion code as your 0001 patch does so. The following change changes
> the page deletion code so that it emits LOG message indicating the
> index corruption when a deleted page is passed. But we used to ignore
> in this case.

Attached is v2 of the patch set, which includes a new patch that I
want to target the master branch with. This patch (v2-0003-*)
refactors btvacuumscan().

This revision also changed the first bugfix patch. I have simplified
some of the details in nbtree.c that were added by commit 857f9c36cda.
Can't we call _bt_update_meta_cleanup_info() at a lower level, like in
the patch? AFAICT it makes more sense to just call it in
btvacuumscan(). Please let me know what you think of those changes.

The big change in the new master-only refactoring patch (v2-0003-*) is
that I now treat a call to btvacuumpage() that has to
"backtrack/recurse" and doesn't find a page that looks like the
expected right half of a page split (a page split that occurred since
the VACUUM operation began) as indicating corruption. This corruption
is logged. I believe that we should never see this happen, for reasons
that are similar to the reasons why _bt_pagedel() never finds a
deleted page when moving right, as I went into in my recent e-mail to
this thread. I think that this is worth tightening up for Postgres 13.

I will hold off on committing v2-0003-*, since I need to nail down the
reasoning for treating this condition as corruption. Plus it's not
urgent. I think that the general direction taken in v2-0003-* is the
right one, in any case. The "recursion" in btvacuumpage() doesn't make
much sense -- it obscures what's really going on IMV. Also, using the
variable name "scanblkno" at every level -- btvacuumscan(),
btvacuumpage(), and _bt_pagedel() -- makes it clear that it's the same
block at all levels of the code. And that the "backtrack/recursion"
case doesn't kill tuples from the "scanblkno" block (and doesn't
attempt to call _bt_pagedel(), which is designed only to handle blocks
passed by btvacuumpage() when they are the current "scanblkno"). It
seems unlikely that the VACUUM VERBOSE pages deleted accounting bug
would have happened if the code was already structured this way.

--
Peter Geoghegan

Вложения

Re: Fixes for two separate bugs in nbtree VACUUM's page deletion

От
Masahiko Sawada
Дата:
On Wed, 29 Apr 2020 at 01:17, Peter Geoghegan <pg@bowt.ie> wrote:
>
> On Tue, Apr 28, 2020 at 12:21 AM Masahiko Sawada
> <masahiko.sawada@2ndquadrant.com> wrote:
> > I agree with both patches.
>
> Thanks for the review.
>
> > For the first fix it seems better to push down the logic to the page
> > deletion code as your 0001 patch does so. The following change changes
> > the page deletion code so that it emits LOG message indicating the
> > index corruption when a deleted page is passed. But we used to ignore
> > in this case.
>
> > I agree with this change but since I'm not sure this change directly
> > relates to this bug I wonder if it can be a separate patch.
>
> I am not surprised that you are concerned about this. That was the
> hardest decision I had to make when writing the patch.
>
> The central idea in the 0001-* patch (as I say at the end of the
> commit message) is that we have a strict rule about where we handle
> maintaining the oldest btpo.xact for already-deleted pages, and where
> we handle it for pages that become deleted. The rules is this:
> btvacuumpage() is strictly responsible for the former category (as
> well as totally live pages), while _bt_pagedel() is responsible for
> the latter category (this includes pages that started out as half-dead
> but become deleted).
>
> In order for this to work, _bt_pagedel() must not ever see a deleted
> page that it did not delete itself. If I didn't explicitly take a
> strict position on this, then I suppose that I'd have to have similar
> handling for maintaining the oldest btpo.xact for existing deleted
> pages encountered within _bt_pagedel(). But that doesn't make any
> sense, even with corruption, so I took a strict position. Even with
> corruption, how could we encounter an *existing* deleted page in
> _bt_pagedel() but not encounter the same page within some call to
> btvacuumpage() made from btvacuumscan()? In other words, how can we
> fail to maintain the oldest btpo.xact for deleted pages even if we
> assume that the index has a corrupt page with sibling links that point
> to a fully deleted page? It seems impossible, even in this extreme,
> contrived scenario.
>
> Then there is the separate question of the new log message about
> corruption. It's not too hard to see why _bt_pagedel() cannot ever
> access an existing deleted page unless there is corruption:
>
> * _bt_pagedel()'s only caller (the btvacuumpage() caller) will simply
> never pass a deleted page -- it handles those directly.
>
> * _bt_pagedel() can only access additional leaf pages to consider
> deleting them by following the right sibling link of the
> btvacuumpage() caller's leaf page (or possibly some page even further
> to the right if it ends up deleting many leaf pages in one go).
> Following right links and finding a deleted page can only happen when
> there is a concurrent page deletion by VACUUM, since the sibling links
> to the deleted page are changed by the second stage of deletion (i.e.
> by the atomic actionat where the page is actually marked deleted,
> within _bt_unlink_halfdead_page()).
>
> * There cannot be a concurrent VACUUM, though, since _bt_pagedel()
> runs in VACUUM. (Note that we already depend on this for correctness
> in _bt_unlink_halfdead_page(), which has a comment that says "a
> half-dead page cannot resurrect because there can be only one vacuum
> process running at a time".)
>
> The strict position seems justified. It really isn't that strict. I'm
> not concerned about the new LOG message concerning corruption being
> annoying or wrong

Thank you for the explanation. This makes sense to me. We've ever been
able to regard it as an index corruption that an existing deleted page
is detected within _bt_pagedel(). But with changes required for fixing
this bug, the responsibilities will change so that _bt_pagedel() must
not see a deleted page. Therefore we can take a more strict position.
It's not that since this bug could lead an index corruption we will
start warning it.

Regards,

-- 
Masahiko Sawada            http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Fixes for two separate bugs in nbtree VACUUM's page deletion

От
Masahiko Sawada
Дата:
On Thu, 30 Apr 2020 at 07:29, Peter Geoghegan <pg@bowt.ie> wrote:
>
> On Tue, Apr 28, 2020 at 12:21 AM Masahiko Sawada
> <masahiko.sawada@2ndquadrant.com> wrote:
> > For the first fix it seems better to push down the logic to the page
> > deletion code as your 0001 patch does so. The following change changes
> > the page deletion code so that it emits LOG message indicating the
> > index corruption when a deleted page is passed. But we used to ignore
> > in this case.
>
> Attached is v2 of the patch set, which includes a new patch that I
> want to target the master branch with. This patch (v2-0003-*)
> refactors btvacuumscan().
>
> This revision also changed the first bugfix patch. I have simplified
> some of the details in nbtree.c that were added by commit 857f9c36cda.
> Can't we call _bt_update_meta_cleanup_info() at a lower level, like in
> the patch? AFAICT it makes more sense to just call it in
> btvacuumscan(). Please let me know what you think of those changes.

Yes, I agree.

>
> The big change in the new master-only refactoring patch (v2-0003-*) is
> that I now treat a call to btvacuumpage() that has to
> "backtrack/recurse" and doesn't find a page that looks like the
> expected right half of a page split (a page split that occurred since
> the VACUUM operation began) as indicating corruption. This corruption
> is logged. I believe that we should never see this happen, for reasons
> that are similar to the reasons why _bt_pagedel() never finds a
> deleted page when moving right, as I went into in my recent e-mail to
> this thread. I think that this is worth tightening up for Postgres 13.
>
> I will hold off on committing v2-0003-*, since I need to nail down the
> reasoning for treating this condition as corruption. Plus it's not
> urgent. I think that the general direction taken in v2-0003-* is the
> right one, in any case. The "recursion" in btvacuumpage() doesn't make
> much sense -- it obscures what's really going on IMV. Also, using the
> variable name "scanblkno" at every level -- btvacuumscan(),
> btvacuumpage(), and _bt_pagedel() -- makes it clear that it's the same
> block at all levels of the code. And that the "backtrack/recursion"
> case doesn't kill tuples from the "scanblkno" block (and doesn't
> attempt to call _bt_pagedel(), which is designed only to handle blocks
> passed by btvacuumpage() when they are the current "scanblkno"). It
> seems unlikely that the VACUUM VERBOSE pages deleted accounting bug
> would have happened if the code was already structured this way.

+1 for refactoring. I often get confused that btvacuumpage() takes two
block numbers (blkno and orig_blkno) in spite of the same value. Your
change also makes it clear.

For the part of treating that case as an index corruption I will need
some time to review because of lacking knowledge of btree indexes. So
I'll review it later.


Regards,

--
Masahiko Sawada            http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Fixes for two separate bugs in nbtree VACUUM's page deletion

От
Peter Geoghegan
Дата:
On Wed, Apr 29, 2020 at 11:38 PM Masahiko Sawada
<masahiko.sawada@2ndquadrant.com> wrote:
> Thank you for the explanation. This makes sense to me.

Pushed both of the fixes.

Thanks!
-- 
Peter Geoghegan



Re: Fixes for two separate bugs in nbtree VACUUM's page deletion

От
Peter Geoghegan
Дата:
On Thu, Apr 30, 2020 at 12:20 AM Masahiko Sawada
<masahiko.sawada@2ndquadrant.com> wrote:
> For the part of treating that case as an index corruption I will need
> some time to review because of lacking knowledge of btree indexes. So
> I'll review it later.

I pushed the refactoring patch today. Thanks for the review.

The final test for corruption that I added to btvacuumscan() is less
aggressive than what you saw in the patch I posted. We only report
corruption when backtracking/recursing if the page is "new", not a
leaf page, or is half-dead. We don't treat a fully deleted page as
corruption, because there is a case where the same call to
btvacuumscan() may have called _bt_pagedel() already, which may have
deleted the block that we backtrack/recurse to. The "sibling links
cannot point to a deleted page without concurrent deletion, and we
know that can't happen because we are VACUUM" stuff doesn't really
apply -- we remember which block we will recurse to *before* we
actually call _bt_pagedel().

-- 
Peter Geoghegan