Обсуждение: xid wraparound danger due to INDEX_CLEANUP false

Поиск
Список
Период
Сортировка

xid wraparound danger due to INDEX_CLEANUP false

От
Andres Freund
Дата:
Hi,

commit a96c41feec6b6616eb9d5baee9a9e08c20533c38
Author: Robert Haas <rhaas@postgresql.org>
Date:   2019-04-04 14:58:53 -0400

    Allow VACUUM to be run with index cleanup disabled.

    This commit adds a new reloption, vacuum_index_cleanup, which
    controls whether index cleanup is performed for a particular
    relation by default.  It also adds a new option to the VACUUM
    command, INDEX_CLEANUP, which can be used to override the
    reloption.  If neither the reloption nor the VACUUM option is
    used, the default is true, as before.

    Masahiko Sawada, reviewed and tested by Nathan Bossart, Alvaro
    Herrera, Kyotaro Horiguchi, Darafei Praliaskouski, and me.
    The wording of the documentation is mostly due to me.

    Discussion: http://postgr.es/m/CAD21AoAt5R3DNUZSjOoXDUY=naYPUOuffVsRzuTYMz29yLzQCA@mail.gmail.com

made the index scan that is part of vacuum optional.  I'm afraid that it
is not safe to do so unconditionally. Until this commit indexes could
rely on at least the amvacuumcleanup callback being called once per
vacuum. Which guaranteed that an index could ensure that there are no
too-old xids anywhere in the index.

But now that's not the case anymore:

    vacrelstats->useindex = (nindexes > 0 &&
                             params->index_cleanup == VACOPT_TERNARY_ENABLED);
...
    /* Do post-vacuum cleanup */
    if (vacrelstats->useindex)
        lazy_cleanup_all_indexes(Irel, indstats, vacrelstats, lps, nindexes);

E.g. btree has xids both in the metapage contents, as well as using it
on normal index pages as part of page deletion.

The slightly oder feature to avoid unnecessary scans during cleanup
protects against this issue by skipping the scan inside the index AM:

/*
 * _bt_vacuum_needs_cleanup() -- Checks if index needs cleanup assuming that
 *            btbulkdelete() wasn't called.
 */
static bool
_bt_vacuum_needs_cleanup(IndexVacuumInfo *info)
{
...
    else if (TransactionIdIsValid(metad->btm_oldest_btpo_xact) &&
             TransactionIdPrecedes(metad->btm_oldest_btpo_xact,
                                   RecentGlobalXmin))
    {
        /*
         * If oldest btpo.xact in the deleted pages is older than
         * RecentGlobalXmin, then at least one deleted page can be recycled.
         */
        result = true;
    }

which will afaict result in all such xids getting removed (or at least
give the AM the choice to do so).


It's possible that something protects against dangers in the case of
INDEX CLEANUP false, or that the consequences aren't too bad. But I
didn't see any comments about the danagers in the patch.

Greetings,

Andres Freund



Re: xid wraparound danger due to INDEX_CLEANUP false

От
Robert Haas
Дата:
On Wed, Apr 15, 2020 at 7:38 PM Andres Freund <andres@anarazel.de> wrote:
> It's possible that something protects against dangers in the case of
> INDEX CLEANUP false, or that the consequences aren't too bad. But I
> didn't see any comments about the danagers in the patch.

I seem to recall Simon raising this issue at the time that the patch
was being discussed, and I thought that we had eventually decided that
it was OK for some reason. But I don't remember the details, and it is
possible that we got it wrong. :-(

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: xid wraparound danger due to INDEX_CLEANUP false

От
Peter Geoghegan
Дата:
On Wed, Apr 15, 2020 at 4:57 PM Robert Haas <robertmhaas@gmail.com> wrote:
> I seem to recall Simon raising this issue at the time that the patch
> was being discussed, and I thought that we had eventually decided that
> it was OK for some reason. But I don't remember the details, and it is
> possible that we got it wrong. :-(

It must be unreliable because it's based on something that is known to
be unreliable:


https://git.postgresql.org/gitweb/?p=postgresql.git;a=blob;f=src/backend/access/nbtree/README;h=c5b0a30e4ebd4fe3bd4a6f8192284c452d1170b9;hb=refs/heads/REL_12_STABLE#l331

Also, the commit message of 6655a729 says that nbtree has had this
problem "since time immemorial". I am planning to work on that
problem, eventually.

-- 
Peter Geoghegan



Re: xid wraparound danger due to INDEX_CLEANUP false

От
Masahiko Sawada
Дата:
On Thu, Apr 16, 2020 at 8:38 AM Andres Freund <andres@anarazel.de> wrote:
>
> Hi,
>
> commit a96c41feec6b6616eb9d5baee9a9e08c20533c38
> Author: Robert Haas <rhaas@postgresql.org>
> Date:   2019-04-04 14:58:53 -0400
>
>     Allow VACUUM to be run with index cleanup disabled.
>
>     This commit adds a new reloption, vacuum_index_cleanup, which
>     controls whether index cleanup is performed for a particular
>     relation by default.  It also adds a new option to the VACUUM
>     command, INDEX_CLEANUP, which can be used to override the
>     reloption.  If neither the reloption nor the VACUUM option is
>     used, the default is true, as before.
>
>     Masahiko Sawada, reviewed and tested by Nathan Bossart, Alvaro
>     Herrera, Kyotaro Horiguchi, Darafei Praliaskouski, and me.
>     The wording of the documentation is mostly due to me.
>
>     Discussion: http://postgr.es/m/CAD21AoAt5R3DNUZSjOoXDUY=naYPUOuffVsRzuTYMz29yLzQCA@mail.gmail.com
>
> made the index scan that is part of vacuum optional.  I'm afraid that it
> is not safe to do so unconditionally. Until this commit indexes could
> rely on at least the amvacuumcleanup callback being called once per
> vacuum. Which guaranteed that an index could ensure that there are no
> too-old xids anywhere in the index.
>
> But now that's not the case anymore:
>
>         vacrelstats->useindex = (nindexes > 0 &&
>                                                          params->index_cleanup == VACOPT_TERNARY_ENABLED);
> ...
>         /* Do post-vacuum cleanup */
>         if (vacrelstats->useindex)
>                 lazy_cleanup_all_indexes(Irel, indstats, vacrelstats, lps, nindexes);
>
> E.g. btree has xids both in the metapage contents, as well as using it
> on normal index pages as part of page deletion.
>
> The slightly oder feature to avoid unnecessary scans during cleanup
> protects against this issue by skipping the scan inside the index AM:
>
> /*
>  * _bt_vacuum_needs_cleanup() -- Checks if index needs cleanup assuming that
>  *                      btbulkdelete() wasn't called.
>  */
> static bool
> _bt_vacuum_needs_cleanup(IndexVacuumInfo *info)
> {
> ...
>         else if (TransactionIdIsValid(metad->btm_oldest_btpo_xact) &&
>                          TransactionIdPrecedes(metad->btm_oldest_btpo_xact,
>                                                                    RecentGlobalXmin))
>         {
>                 /*
>                  * If oldest btpo.xact in the deleted pages is older than
>                  * RecentGlobalXmin, then at least one deleted page can be recycled.
>                  */
>                 result = true;
>         }
>
> which will afaict result in all such xids getting removed (or at least
> give the AM the choice to do so).

For btree indexes, IIRC skipping index cleanup could not be a cause of
corruption, but be a cause of index bloat since it leaves recyclable
pages which are not marked as recyclable. The index bloat is the main
side effect of skipping index cleanup. When user executes VACUUM with
INDEX_CLEANUP to reclaim index garbage, such pages will also be
recycled sooner or later? Or skipping index cleanup can be a cause of
recyclable page never being recycled?

Regards,

--
Masahiko Sawada  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: xid wraparound danger due to INDEX_CLEANUP false

От
Andres Freund
Дата:
Hi,

On 2020-04-16 16:30:02 +0900, Masahiko Sawada wrote:
> For btree indexes, IIRC skipping index cleanup could not be a cause of
> corruption, but be a cause of index bloat since it leaves recyclable
> pages which are not marked as recyclable. The index bloat is the main
> side effect of skipping index cleanup. When user executes VACUUM with
> INDEX_CLEANUP to reclaim index garbage, such pages will also be
> recycled sooner or later? Or skipping index cleanup can be a cause of
> recyclable page never being recycled?

Well, it depends on what you define as "never". Once the xids on the
pages have wrapped around, the page level xids will appear to be from
the future for a long time. And the metapage xid appearing to be from
the future will prevent some vacuums from actually doing the scan too,
even if INDEX_CLEANUP is reenabled.  So a VACUUM, even with
INDEX_CLEANUP on, will not be able to recycle those pages anymore.  At
some point the wrapped around xids will be "current" again, if there's
enough new xids.


It's no ok for vacuumlazy.c to make decisions like this. I think the
INDEX_CLEANUP logic clearly needs to be pushed down into the
amvacuumcleanup callbacks, and it needs to be left to the index AMs to
decide what the correct behaviour is.


You can't just change things like this without reviewing the
consequences to AMs and documenting them?

Greetings,

Andres Freund



Re: xid wraparound danger due to INDEX_CLEANUP false

От
Andres Freund
Дата:
Hi,

On 2020-04-15 18:11:40 -0700, Peter Geoghegan wrote:
> On Wed, Apr 15, 2020 at 4:57 PM Robert Haas <robertmhaas@gmail.com> wrote:
> > I seem to recall Simon raising this issue at the time that the patch
> > was being discussed, and I thought that we had eventually decided that
> > it was OK for some reason. But I don't remember the details, and it is
> > possible that we got it wrong. :-(
> 
> It must be unreliable because it's based on something that is known to
> be unreliable:
> 
>
https://git.postgresql.org/gitweb/?p=postgresql.git;a=blob;f=src/backend/access/nbtree/README;h=c5b0a30e4ebd4fe3bd4a6f8192284c452d1170b9;hb=refs/heads/REL_12_STABLE#l331

Sure, there is some pre-existing wraparound danger for individual
pages. But it's a pretty narrow corner case before INDEX_CLEANUP
off.

That comment says something about "shared-memory free space map", making
it sound like any crash would loose the page. But it's a normal FSM
these days. Vacuum will insert the deleted page into the free space
map. So either the FSM would need to be corrupted to not find the
inserted page anymore, or the index would need to grow slow enough to
not use a page before the wraparound.  And then such wrapped around xids
would exist on individual pages. Not on all deleted pages, like with
INDEX_CLEANUP false.

And, what's worse, in the INDEX_CLEANUP off case, future VACUUMs with
INDEX_CLEANUP on might not even visit the index. As there very well
might not be many dead heap tuples around anymore (previous vacuums with
cleanup off will have removed them), the
vacuum_cleanup_index_scale_factor logic may prevent index vacuums. In
contrast to the normal situations where the btm_oldest_btpo_xact check
will prevent that from becoming a problem.


Peter, as far as I can tell, with INDEX_CLEANUP off, nbtree will never
be able to recycle half-dead pages? And thus would effectively never
recycle any dead space? Is that correct?

Greetings,

Andres Freund



Re: xid wraparound danger due to INDEX_CLEANUP false

От
Peter Geoghegan
Дата:
On Thu, Apr 16, 2020 at 11:27 AM Andres Freund <andres@anarazel.de> wrote:
> Sure, there is some pre-existing wraparound danger for individual
> pages. But it's a pretty narrow corner case before INDEX_CLEANUP
> off.

It's a matter of degree. Hard to judge something like that.

> And, what's worse, in the INDEX_CLEANUP off case, future VACUUMs with
> INDEX_CLEANUP on might not even visit the index. As there very well
> might not be many dead heap tuples around anymore (previous vacuums with
> cleanup off will have removed them), the
> vacuum_cleanup_index_scale_factor logic may prevent index vacuums. In
> contrast to the normal situations where the btm_oldest_btpo_xact check
> will prevent that from becoming a problem.

I guess that they should visit the metapage to see if they need to do
that much. That would allow us to fix the problem while mostly
honoring INDEX_CLEANUP off, I think.

> Peter, as far as I can tell, with INDEX_CLEANUP off, nbtree will never
> be able to recycle half-dead pages? And thus would effectively never
> recycle any dead space? Is that correct?

I agree. The fact that btm_oldest_btpo_xact is an all-or-nothing thing
(with wraparound hazards) is bad in itself, and introduced new risk to
v11 compared to previous versions (without the INDEX_CLEANUP = off
feature entering into it).  The simple fact that we don't even check
it with INDEX_CLEANUP = off is a bigger problem, though, and one that
now seems unrelated.

BWT, a lot of people get confused about what half-dead pages are. I
would like to make something clear that may not be obvious: While it's
bad that the implementation leaks pages that should go in the FSM,
it's not the end of the world. They should get evicted from
shared_buffers pretty quickly if there is any pressure, and impose no
real cost on index scans.

There are (roughly) 3 types of pages that we're concerned about here
in the common case where we're just deleting a leaf page:

* A half-dead page -- no downlink in its parent, marked dead.

* A deleted page -- now no sidelinks, either. Not initially safe to recycle.

* A deleted page in the FSM -- this is what we have the interlock for.

Half-dead pages are pretty rare, because VACUUM really has to have a
hard crash for that to happen (that might not be 100% true, but it's
at least 99% true). That's always been the case, and we don't really
need to talk about them here at all. We're just concerned with deleted
pages in the context of this discussion (and whether or not they can
be recycled without confusing in-flight index scans). These are the
only pages that are marked with an XID at all.

Another thing that's worth pointing out is that this whole
RecentGlobalXmin business is how we opted to implement what Lanin &
Sasha call "the drain technique". It is rather different to the usual
ways in which we use RecentGlobalXmin. We're only using it as a proxy
(an absurdly conservative proxy) for whether or not there might be an
in-flight index scan that lands on a concurrently recycled index page
and gets completely confused. So it is purely about the integrity of
the data structure itself. It is a consequence of doing so little
locking when descending the tree -- our index scans don't need to
couple buffer locks on the way down the tree at all. So we make VACUUM
worry about that, rather than making index scans worry about VACUUM
(though the latter design is a reasonable and common one).

There is absolutely no reason why we have to delay recycling for very
long, even in cases with long running transactions or whatever. I
agree that it's just an accident that it works that way. VACUUM could
probably remember deleted pages, and then revisited those pages at the
end of the index vacuuming -- that might make a big difference in a
lot of workloads. Or it could chain them together as a linked list
which can be accessed much more eagerly in some cases.

--
Peter Geoghegan



Re: xid wraparound danger due to INDEX_CLEANUP false

От
Andres Freund
Дата:
Hi,

On 2020-04-16 13:28:00 -0700, Peter Geoghegan wrote:
> > And, what's worse, in the INDEX_CLEANUP off case, future VACUUMs with
> > INDEX_CLEANUP on might not even visit the index. As there very well
> > might not be many dead heap tuples around anymore (previous vacuums with
> > cleanup off will have removed them), the
> > vacuum_cleanup_index_scale_factor logic may prevent index vacuums. In
> > contrast to the normal situations where the btm_oldest_btpo_xact check
> > will prevent that from becoming a problem.
> 
> I guess that they should visit the metapage to see if they need to do
> that much. That would allow us to fix the problem while mostly
> honoring INDEX_CLEANUP off, I think.

Yea. _bt_vacuum_needs_cleanup() needs to check if
metad->btm_oldest_btpo_xact is older than the FreezeLimit computed by
vacuum_set_xid_limits() and vacuum the index if so even if INDEX_CLEANUP
false.


> BWT, a lot of people get confused about what half-dead pages are. I
> would like to make something clear that may not be obvious: While it's
> bad that the implementation leaks pages that should go in the FSM,
> it's not the end of the world. They should get evicted from
> shared_buffers pretty quickly if there is any pressure, and impose no
> real cost on index scans.

Yea, half-dead pages aren't the main problem. It's pages that contain
only dead tuples, but aren't unlinked from the tree. Without a vacuum
scan we'll never reuse them - even if we know they're all dead.

Note that the page being in the FSM is not protection against wraparound
:(. We recheck whether a page is recyclable when getting it from the FSM
(probably required, due to the FSM not being crashsafe). It's of course
much less likely to happen at that stage, because the pages can get
reused.

I think we really just stop being miserly and update the xid to be
FrozenTransactionId or InvalidTransactionId when vacuum encounters one
that's from before the the xid cutoff used by vacuum (i.e. what could
become the new relfrozenxid).  That seems like it'd be a few lines, not
more.


> Another thing that's worth pointing out is that this whole
> RecentGlobalXmin business is how we opted to implement what Lanin &
> Sasha call "the drain technique". It is rather different to the usual
> ways in which we use RecentGlobalXmin. We're only using it as a proxy
> (an absurdly conservative proxy) for whether or not there might be an
> in-flight index scan that lands on a concurrently recycled index page
> and gets completely confused. So it is purely about the integrity of
> the data structure itself. It is a consequence of doing so little
> locking when descending the tree -- our index scans don't need to
> couple buffer locks on the way down the tree at all. So we make VACUUM
> worry about that, rather than making index scans worry about VACUUM
> (though the latter design is a reasonable and common one).
>
> There is absolutely no reason why we have to delay recycling for very
> long, even in cases with long running transactions or whatever. I
> agree that it's just an accident that it works that way. VACUUM could
> probably remember deleted pages, and then revisited those pages at the
> end of the index vacuuming -- that might make a big difference in a
> lot of workloads. Or it could chain them together as a linked list
> which can be accessed much more eagerly in some cases.

I think it doesn't really help meaningfully for vacuum to be a bit
smarter about when to recognize pages as being recyclable. IMO the big
issue is that vacuum won't be very frequent, so we'll grow the index
until that time, even if there's many "effectively empty" pages.

I.e. even if the killtuples logic allows us to recognize that all actual
index tuples are fully dead, we'll not benefit from that unless there's
a new insertion that belongs onto the "empty" page. That's fine for
indexes that are updated roughly evenly across the value range, but
terrible for indexes that "grow" mostly on one side, and "shrink" on the
other.

I'd bet it be beneficial if we were to either have scans unlink such
pages directly, or if they just entered the page into the FSM and have
_bt_getbuf() do the unlinking.  I'm not sure if the current locking
model assumes anywhere that there is only one process (vacuum) unlinking
pages though?

Greetings,

Andres Freund



Re: xid wraparound danger due to INDEX_CLEANUP false

От
Peter Geoghegan
Дата:
On Thu, Apr 16, 2020 at 3:49 PM Andres Freund <andres@anarazel.de> wrote:
> I think we really just stop being miserly and update the xid to be
> FrozenTransactionId or InvalidTransactionId when vacuum encounters one
> that's from before the the xid cutoff used by vacuum (i.e. what could
> become the new relfrozenxid).  That seems like it'd be a few lines, not
> more.

Okay.

> > There is absolutely no reason why we have to delay recycling for very
> > long, even in cases with long running transactions or whatever. I
> > agree that it's just an accident that it works that way. VACUUM could
> > probably remember deleted pages, and then revisited those pages at the
> > end of the index vacuuming -- that might make a big difference in a
> > lot of workloads. Or it could chain them together as a linked list
> > which can be accessed much more eagerly in some cases.
>
> I think it doesn't really help meaningfully for vacuum to be a bit
> smarter about when to recognize pages as being recyclable. IMO the big
> issue is that vacuum won't be very frequent, so we'll grow the index
> until that time, even if there's many "effectively empty" pages.

(It seems like you're talking about the big picture now, not the
problems in Postgres 11 and 12 features in this area -- you're talking
about what happens to empty pages, not what happens to deleted pages.)

I'll say some more things about the less ambitious goal of more eager
recycling of pages, as they are deleted:

An individual VACUUM operation cannot recycle a page right after
_bt_pagedel() is called to delete the page. VACUUM will both set a
target leaf page half dead and delete it all at once in _bt_pagedel()
(it does that much in the simple and common case). Again, this is
because recycling very soon after the call to _bt_pagedel() will
introduce races with concurrent index scans -- they could fail to
observe the deletion in the parent (i.e. see its downlink, since child
isn't even half dead), and land on a concurrently recycled page
(VACUUM concurrently marks the page half-dead, fully dead/deleted, and
then even goes as far as recycling it). So the current design makes a
certain amount of sense -- we can't be super aggressive like that.
(Actually, maybe it doesn't make sense to not just put the page in the
FSM there and then -- see "Thinks some more" below.)

Even still, nothing stops the same VACUUM operation from (for example)
remembering a list of pages it has deleted during the current scan,
and then coming back at the end of the bulk scan of the index to
reconsider if it can recycle the pages now (2 minutes later instead of
2 months later). With a new RecentGlobalXmin (or something that's
conceptually like a new RecentGlobalXmin).

Similarly, we could do limited VACUUMs that only visit previously
deleted pages, once VACUUM is taught to chain deleted pages together
to optimize recycling. We don't have to repeat another pass over the
entire index to recycle the pages because of this special deleted page
linking. This is something that we use when we have to recycle pages,
but it's a " INDEX_CLEANUP = off" index VACUUM -- we don't really want
to do most of the stuff that index vacuuming needs to do, but we must
still visit the metapage to check btm_oldest_btpo_xact, and then maybe
walk the deleted page linked list.

*** Thinks some more ***

As you pointed out, _bt_getbuf() already distrusts the FSM -- it has
its own _bt_page_recyclable() check, probably because the FSM isn't
crash safe. Maybe we could improve matters by teaching _bt_pagedel()
to put a page it deleted in the FSM immediately -- no need to wait
until the next index VACUUM for the RecordFreeIndexPage() call. It
still isn't quite enough that _bt_getbuf() distrusts the FSM, so we'd
also have to teach _bt_getbuf() some heuristics that made it
understand that VACUUM is now designed to put stuff in the FSM
immediately, so we don't have to wait for the next VACUUM operation to
get to it. Maybe _bt_getbuf() should try the FSM a few times before
giving up and allocating a new page, etc.

This wouldn't make VACUUM delete any more pages any sooner, but it
would make those pages reclaimable much sooner. Also, it wouldn't
solve the wraparound problem, but that is a bug, not a
performance/efficiency issue.

> I.e. even if the killtuples logic allows us to recognize that all actual
> index tuples are fully dead, we'll not benefit from that unless there's
> a new insertion that belongs onto the "empty" page. That's fine for
> indexes that are updated roughly evenly across the value range, but
> terrible for indexes that "grow" mostly on one side, and "shrink" on the
> other.

That could be true, but there are certain things about B-Tree space
utilization that might surprise you:

https://www.drdobbs.com/reexamining-b-trees/184408694?pgno=3

> I'd bet it be beneficial if we were to either have scans unlink such
> pages directly, or if they just entered the page into the FSM and have
> _bt_getbuf() do the unlinking.

That won't work, since you're now talking about pages that aren't
deleted (or even half-dead) that are just candidates to be deleted
because they're empty. So you'd have to do all the steps in
_bt_pagedel() within a new _bt_getbuf() path, which would have many
deadlock hazards. Unlinking the page from the tree itself (deleting)
is really complicated.

> I'm not sure if the current locking
> model assumes anywhere that there is only one process (vacuum) unlinking
> pages though?

I'm not sure, though _bt_unlink_halfdead_page() has comments supposing
that there could be concurrent page deletions like that.

-- 
Peter Geoghegan



Re: xid wraparound danger due to INDEX_CLEANUP false

От
Masahiko Sawada
Дата:
On Fri, 17 Apr 2020 at 02:58, Andres Freund <andres@anarazel.de> wrote:
>
> Hi,
>
> On 2020-04-16 16:30:02 +0900, Masahiko Sawada wrote:
> > For btree indexes, IIRC skipping index cleanup could not be a cause of
> > corruption, but be a cause of index bloat since it leaves recyclable
> > pages which are not marked as recyclable. The index bloat is the main
> > side effect of skipping index cleanup. When user executes VACUUM with
> > INDEX_CLEANUP to reclaim index garbage, such pages will also be
> > recycled sooner or later? Or skipping index cleanup can be a cause of
> > recyclable page never being recycled?
>
> Well, it depends on what you define as "never". Once the xids on the
> pages have wrapped around, the page level xids will appear to be from
> the future for a long time. And the metapage xid appearing to be from
> the future will prevent some vacuums from actually doing the scan too,
> even if INDEX_CLEANUP is reenabled.  So a VACUUM, even with
> INDEX_CLEANUP on, will not be able to recycle those pages anymore.  At
> some point the wrapped around xids will be "current" again, if there's
> enough new xids.
>
>
> It's no ok for vacuumlazy.c to make decisions like this. I think the
> INDEX_CLEANUP logic clearly needs to be pushed down into the
> amvacuumcleanup callbacks, and it needs to be left to the index AMs to
> decide what the correct behaviour is.

I wanted to clarify the impact of this bug. I agree with you.

On Fri, 17 Apr 2020 at 07:49, Andres Freund <andres@anarazel.de> wrote:
>
> Hi,
>
> On 2020-04-16 13:28:00 -0700, Peter Geoghegan wrote:
> > > And, what's worse, in the INDEX_CLEANUP off case, future VACUUMs with
> > > INDEX_CLEANUP on might not even visit the index. As there very well
> > > might not be many dead heap tuples around anymore (previous vacuums with
> > > cleanup off will have removed them), the
> > > vacuum_cleanup_index_scale_factor logic may prevent index vacuums.

I think this doesn't happen because, in the INDEX_CLEANUP off case,
vacuum marks linepointers of dead tuples as dead but leaves them.
Therefore future VACUUMs with INDEX_CLEANUP on will see these dead
linepointers and invoke ambulkdelete.

> > I guess that they should visit the metapage to see if they need to do
> > that much. That would allow us to fix the problem while mostly
> > honoring INDEX_CLEANUP off, I think.
>
> Yea. _bt_vacuum_needs_cleanup() needs to check if
> metad->btm_oldest_btpo_xact is older than the FreezeLimit computed by
> vacuum_set_xid_limits() and vacuum the index if so even if INDEX_CLEANUP
> false.

Agreed. So _bt_vacuum_needs_cleanup() would become something like the following?

if (metad->btm_version < BTREE_NOVAC_VERSION)
    result = true;
else if (TransactionIdIsvaid(metad->btm_oldest_btpo_xact) &&
         TransactionIdPrecedes(metad->btm_oldest_btpo_xact,
                               FreezeLimit))
    result = true;
else if (index_cleanup_disabled)
    result = false;
else if (TransactionIdIsValid(metad->btm_oldest_btpo_xact) &&
         TransactionIdPrecedes(metad->btm_oldest_btpo_xact,
                               RecentGlobalXmin))
    result = true;
else
   result = determine based on vacuum_cleanup_index_scale_factor;

Or perhaps we can change _bt_vacuum_needs_cleanup() so that it does
index cleanup if metad->btm_oldest_btpo_xact is older than the
FreezeLimit *and* it's an aggressive vacuum.

Anyway, if we change IndexVacuumInfo to tell AM that INDEX_CLEANUP
option is disabled and FreezeLimit a problem is that it would break
compatibility

Regards,

--
Masahiko Sawada            http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: xid wraparound danger due to INDEX_CLEANUP false

От
Robert Haas
Дата:
On Thu, Apr 16, 2020 at 6:49 PM Andres Freund <andres@anarazel.de> wrote:
> Yea. _bt_vacuum_needs_cleanup() needs to check if
> metad->btm_oldest_btpo_xact is older than the FreezeLimit computed by
> vacuum_set_xid_limits() and vacuum the index if so even if INDEX_CLEANUP
> false.

I'm still fairly unclear on what the actual problem is here, and on
how we propose to fix it. It seems to me that we probably don't have a
problem in the case where we don't advance relfrozenxid or relminmxid,
because in that case there's not much difference between the behavior
created by this patch and a case where we just error out due to an
interrupt or something before reaching the index cleanup stage. I
think that the problem is that in the case where we do relfrozenxid,
we might advance it past some XID value stored in the index metadata.
Is that right?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: xid wraparound danger due to INDEX_CLEANUP false

От
Peter Geoghegan
Дата:
On Thu, Apr 16, 2020 at 12:30 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> For btree indexes, IIRC skipping index cleanup could not be a cause of
> corruption, but be a cause of index bloat since it leaves recyclable
> pages which are not marked as recyclable.

I spotted a bug in "Skip full index scan during cleanup of B-tree
indexes when possible" which is unrelated to the index cleanup issue.

This code is wrong, because we don't have a buffer lock (or a buffer
pin) on the buffer anymore:

        ndel = _bt_pagedel(rel, buf);

        /* count only this page, else may double-count parent */
        if (ndel)
        {
            stats->pages_deleted++;
            if (!TransactionIdIsValid(vstate->oldestBtpoXact) ||
                TransactionIdPrecedes(opaque->btpo.xact,
vstate->oldestBtpoXact))
                vstate->oldestBtpoXact = opaque->btpo.xact;
        }

        MemoryContextSwitchTo(oldcontext);
        /* pagedel released buffer, so we shouldn't */

(As the comment says, _bt_pagedel() releases it.)

There is another, more fundamental issue, though: _bt_pagedel() can
delete more than one page. That might be okay if the "extra" pages
were always internal pages, but they're not -- it says so in the
comments above _bt_pagedel(). See the code at the end of
_bt_pagedel(), that says something about why we delete the right
sibling page in some cases.

I think that the fix is to push down the vstate into lower level code
in nbtpage.c. Want to have a go at fixing it?

(It would be nice if we could teach Valgrind to "poison" buffers when
we don't have a pin held...that would probably have caught this issue
almost immediately.)

-- 
Peter Geoghegan



Re: xid wraparound danger due to INDEX_CLEANUP false

От
Peter Geoghegan
Дата:
On Wed, Apr 22, 2020 at 6:05 PM Peter Geoghegan <pg@bowt.ie> wrote:
> (It would be nice if we could teach Valgrind to "poison" buffers when
> we don't have a pin held...that would probably have caught this issue
> almost immediately.)

I can get Valgrind to complain about it when the regression tests are
run with the attached patch applied. I see this in the logs at several
points when "make installcheck" runs:

==1082059== VALGRINDERROR-BEGIN
==1082059== Invalid read of size 4
==1082059==    at 0x21D8DE: btvacuumpage (nbtree.c:1370)
==1082059==    by 0x21DA61: btvacuumscan (nbtree.c:1039)
==1082059==    by 0x21DBD5: btbulkdelete (nbtree.c:879)
==1082059==    by 0x215821: index_bulk_delete (indexam.c:698)
==1082059==    by 0x20FDCE: lazy_vacuum_index (vacuumlazy.c:2427)
==1082059==    by 0x2103EA: lazy_vacuum_all_indexes (vacuumlazy.c:1794)
==1082059==    by 0x211EA1: lazy_scan_heap (vacuumlazy.c:1681)
==1082059==    by 0x211EA1: heap_vacuum_rel (vacuumlazy.c:510)
==1082059==    by 0x360414: table_relation_vacuum (tableam.h:1457)
==1082059==    by 0x360414: vacuum_rel (vacuum.c:1880)
==1082059==    by 0x361785: vacuum (vacuum.c:449)
==1082059==    by 0x361F0E: ExecVacuum (vacuum.c:249)
==1082059==    by 0x4D979C: standard_ProcessUtility (utility.c:823)
==1082059==    by 0x4D9C7F: ProcessUtility (utility.c:522)
==1082059==    by 0x4D6791: PortalRunUtility (pquery.c:1157)
==1082059==    by 0x4D725F: PortalRunMulti (pquery.c:1303)
==1082059==    by 0x4D7CEF: PortalRun (pquery.c:779)
==1082059==    by 0x4D3BB7: exec_simple_query (postgres.c:1239)
==1082059==    by 0x4D4ABD: PostgresMain (postgres.c:4315)
==1082059==    by 0x45B0C9: BackendRun (postmaster.c:4510)
==1082059==    by 0x45B0C9: BackendStartup (postmaster.c:4202)
==1082059==    by 0x45B0C9: ServerLoop (postmaster.c:1727)
==1082059==    by 0x45C754: PostmasterMain (postmaster.c:1400)
==1082059==    by 0x3BDD68: main (main.c:210)
==1082059==  Address 0x6cc7378 is in a rw- anonymous segment
==1082059==
==1082059== VALGRINDERROR-END

(The line numbers might be slightly different to master here, but the
line from btvacuumpage() is definitely the one that accesses the
special area of the B-Tree page after we drop the pin.)

This patch is very rough -- it was just the first thing that I tried.
I don't know how Valgrind remembers the status of shared memory
regions across backends when they're marked with
VALGRIND_MAKE_MEM_NOACCESS(). Even still, this idea looks promising. I
should try to come up with a committable patch before too long.

The good news is that the error I showed is the only error that I see,
at least with this rough patch + "make installcheck". It's possible
that the patch isn't as effective as it could be, though. For one
thing, it definitely won't detect incorrect buffer accesses where a
pin is held but a buffer lock is not held. That seems possible, but a
bit harder.

-- 
Peter Geoghegan

Вложения

Re: xid wraparound danger due to INDEX_CLEANUP false

От
Andres Freund
Дата:
Hi,

On 2020-04-22 20:08:42 -0700, Peter Geoghegan wrote:
> I can get Valgrind to complain about it when the regression tests are
> run with the attached patch applied.

Nice!  Have you checked how much of an incremental slowdown this causes?


> This patch is very rough -- it was just the first thing that I tried.
> I don't know how Valgrind remembers the status of shared memory
> regions across backends when they're marked with
> VALGRIND_MAKE_MEM_NOACCESS(). Even still, this idea looks promising. I
> should try to come up with a committable patch before too long.

IIRC valgrind doesn't at all share access markings across processes.


> The good news is that the error I showed is the only error that I see,
> at least with this rough patch + "make installcheck". It's possible
> that the patch isn't as effective as it could be, though. For one
> thing, it definitely won't detect incorrect buffer accesses where a
> pin is held but a buffer lock is not held. That seems possible, but a
> bit harder.

Given hint bits it seems fairly hard to make that a reliable check.


> +#ifdef USE_VALGRIND
> +    if (!isLocalBuf)
> +    {
> +        Buffer b = BufferDescriptorGetBuffer(bufHdr);
> +        VALGRIND_MAKE_MEM_DEFINED(BufferGetPage(b), BLCKSZ);
> +    }
> +#endif

Hm. It's a bit annoying that we have to mark the contents defined. It'd
be kinda useful to be able to mark unused parts of pages as undefined
initially. But there's afaictl no way to just set/unset addressability,
while not touching definedness. So this is probably the best we can do
without adding a lot of complexity.


>      if (isExtend)
>      {
>          /* new buffers are zero-filled */
> @@ -1039,6 +1047,12 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
>          buf = GetBufferDescriptor(buf_id);
>  
>          valid = PinBuffer(buf, strategy);
> +#ifdef USE_VALGRIND
> +        {
> +            Buffer b = BufferDescriptorGetBuffer(buf);
> +            VALGRIND_MAKE_MEM_DEFINED(BufferGetPage(b), BLCKSZ);
> +        }
> +#endif

Why aren't we doing this in PinBuffer() and PinBuffer_Locked(), but at
their callsites?


> @@ -1633,6 +1653,12 @@ PinBuffer(BufferDesc *buf, BufferAccessStrategy strategy)
>                                                 buf_state))
>              {
>                  result = (buf_state & BM_VALID) != 0;
> +#ifdef USE_VALGRIND
> +                {
> +                    Buffer b = BufferDescriptorGetBuffer(buf);
> +                    VALGRIND_MAKE_MEM_DEFINED(BufferGetPage(b), BLCKSZ);
> +                }
> +#endif
>                  break;
>              }
>          }

Oh, but we actually are doing it in PinBuffer() too?


>          /*
>           * Decrement the shared reference count.
> @@ -2007,6 +2039,12 @@ BufferSync(int flags)
>           */
>          if (pg_atomic_read_u32(&bufHdr->state) & BM_CHECKPOINT_NEEDED)
>          {
> +#ifdef USE_VALGRIND
> +            {
> +                Buffer b = BufferDescriptorGetBuffer(bufHdr);
> +                VALGRIND_MAKE_MEM_DEFINED(BufferGetPage(b), BLCKSZ);
> +            }
> +#endif
>              if (SyncOneBuffer(buf_id, false, &wb_context) & BUF_WRITTEN)
>              {
>                  TRACE_POSTGRESQL_BUFFER_SYNC_WRITTEN(buf_id);

Shouldn't the pin we finally acquire in SyncOneBuffer() be sufficient?


> @@ -2730,6 +2768,12 @@ FlushBuffer(BufferDesc *buf, SMgrRelation reln)
>       * Run PageGetLSN while holding header lock, since we don't have the
>       * buffer locked exclusively in all cases.
>       */
> +#ifdef USE_VALGRIND
> +    {
> +        Buffer b = BufferDescriptorGetBuffer(buf);
> +        VALGRIND_MAKE_MEM_DEFINED(BufferGetPage(b), BLCKSZ);
> +    }
> +#endif
>      recptr = BufferGetLSN(buf);

This shouldn't be needed, as the caller ought to hold a pin:
 *
 * The caller must hold a pin on the buffer and have share-locked the
 * buffer contents.  (Note: a share-lock does not prevent updates of
 * hint bits in the buffer, so the page could change while the write
 * is in progress, but we assume that that will not invalidate the data
 * written.)
 *

Greetings,

Andres Freund



Re: xid wraparound danger due to INDEX_CLEANUP false

От
Peter Geoghegan
Дата:
On Wed, Apr 22, 2020 at 8:33 PM Andres Freund <andres@anarazel.de> wrote:
> On 2020-04-22 20:08:42 -0700, Peter Geoghegan wrote:
> > I can get Valgrind to complain about it when the regression tests are
> > run with the attached patch applied.
>
> Nice!  Have you checked how much of an incremental slowdown this causes?

No, but I didn't notice much of a slowdown.

> > This patch is very rough -- it was just the first thing that I tried.
> > I don't know how Valgrind remembers the status of shared memory
> > regions across backends when they're marked with
> > VALGRIND_MAKE_MEM_NOACCESS(). Even still, this idea looks promising. I
> > should try to come up with a committable patch before too long.
>
> IIRC valgrind doesn't at all share access markings across processes.

I didn't think so.

> > The good news is that the error I showed is the only error that I see,
> > at least with this rough patch + "make installcheck". It's possible
> > that the patch isn't as effective as it could be, though. For one
> > thing, it definitely won't detect incorrect buffer accesses where a
> > pin is held but a buffer lock is not held. That seems possible, but a
> > bit harder.
>
> Given hint bits it seems fairly hard to make that a reliable check.

I don't follow. It doesn't have to be a perfect check. Detecting if
there is *any* buffer lock held at all would be a big improvement.

> Why aren't we doing this in PinBuffer() and PinBuffer_Locked(), but at
> their callsites?

I wrote this patch in a completely careless manner in less than 10
minutes, just to see how hard it was (I thought that it might have
been much harder). I wasn't expecting you to review it. I thought that
I was clear about that.

-- 
Peter Geoghegan



Re: xid wraparound danger due to INDEX_CLEANUP false

От
Peter Geoghegan
Дата:
On Wed, Apr 22, 2020 at 9:05 PM Peter Geoghegan <pg@bowt.ie> wrote:
> > Given hint bits it seems fairly hard to make that a reliable check.
>
> I don't follow. It doesn't have to be a perfect check. Detecting if
> there is *any* buffer lock held at all would be a big improvement.

It is true that the assumptions that heapam makes about what a buffer
pin will prevent (that a pin will prevent any kind of page
defragmentation) are not really compatible with marking pages as
undefined in lower level code like bufmgr.c. There are too many
exceptions for it to work like that.

The case I was really thinking about was the nbtree
_bt_drop_lock_and_maybe_pin() stuff, which is very confusing. The
confusing structure of the
BTScanPosIsPinned()/_bt_drop_lock_and_maybe_pin() code more or less
caused the skip scan patch to have numerous bugs involving code
holding a buffer pin, but not a buffer lock, at least when I last
looked at it a couple of months ago. The only thing having a pin on a
leaf page guarantees is that the TIDs from tuples on the page won't be
concurrently recycled by VACUUM. This is a very weak guarantee -- in
particular, it's much weaker than the guarantees around buffer pins
that apply in heapam. It's certainly not going to prevent any kind of
defragmentation of the page -- the page can even split, for example.
Any code that relies on holding a pin to prevent anything more than
that is broken, but possibly only in a subtle way. It's not like page
splits happen all that frequently.

Given that I was concerned about a fairly specific situation, a
specific solution seems like it might be the best way to structure the
extra checks. The attached rough patch shows the kind of approach that
might be practical in specific index access methods. This works on top
of the patch I posted yesterday. The idea is to mark the buffer's page
as a noaccess region within _bt_drop_lock_and_maybe_pin(), and then
mark it defined again at either of the two points that we might have
to relock (but not repin) the buffer to re-read the page. This doesn't
cause any regression test failures, so maybe there are no bugs like
this currently, but it still seems like it might be worth pursuing on
top of the buffer pin stuff.

--
Peter Geoghegan

Вложения

Re: xid wraparound danger due to INDEX_CLEANUP false

От
Peter Geoghegan
Дата:
On Thu, Apr 16, 2020 at 11:27 AM Andres Freund <andres@anarazel.de> wrote:
> Sure, there is some pre-existing wraparound danger for individual
> pages. But it's a pretty narrow corner case before INDEX_CLEANUP
> off.
>
> That comment says something about "shared-memory free space map", making
> it sound like any crash would loose the page. But it's a normal FSM
> these days. Vacuum will insert the deleted page into the free space
> map. So either the FSM would need to be corrupted to not find the
> inserted page anymore, or the index would need to grow slow enough to
> not use a page before the wraparound.  And then such wrapped around xids
> would exist on individual pages. Not on all deleted pages, like with
> INDEX_CLEANUP false.

Is that really that narrow, even without "INDEX_CLEANUP false"? It's
not as if the index needs to grow very slowly to have only very few
page splits hour to hour (it depends on whether the inserts are random
or not, and so on). Especially if you had a bulk DELETE affecting many
rows, which is hardly that uncommon.

Fundamentally, btvacuumpage() doesn't freeze 32-bit XIDs (from
bpto.xact) when it recycles deleted pages. It simply puts them in the
FSM without changing anything about the page itself. This means
surprisingly little in the context of nbtree: the
_bt_page_recyclable() XID check that takes place in btvacuumpage()
also takes place in _bt_getbuf(), at the point where the page actually
gets recycled by the client. That's not great.

It wouldn't be so unreasonable if btvacuumpage() actually did freeze
the bpto.xact value at the point where it puts the page in the FSM. It
doesn't need to be crash safe; it can work as a hint. Maybe "freezing"
is the wrong word (too much baggage). More like we'd have VACUUM
represent that "this deleted B-Tree page is definitely not considered
to still be a part of the tree by any possible other backend" using a
page flag hint -- btvacuumpage() would "mark the deleted page as
recyclable" explicitly. Note that we still need to keep the original
bpto.xact XID around for _bt_log_reuse_page() (also, do we need to
worry _bt_log_reuse_page() with a wrapped-around XID?).

--
Peter Geoghegan



Re: xid wraparound danger due to INDEX_CLEANUP false

От
Andres Freund
Дата:
Hi,

On 2020-04-29 11:28:00 -0700, Peter Geoghegan wrote:
> On Thu, Apr 16, 2020 at 11:27 AM Andres Freund <andres@anarazel.de> wrote:
> > Sure, there is some pre-existing wraparound danger for individual
> > pages. But it's a pretty narrow corner case before INDEX_CLEANUP
> > off.
> >
> > That comment says something about "shared-memory free space map", making
> > it sound like any crash would loose the page. But it's a normal FSM
> > these days. Vacuum will insert the deleted page into the free space
> > map. So either the FSM would need to be corrupted to not find the
> > inserted page anymore, or the index would need to grow slow enough to
> > not use a page before the wraparound.  And then such wrapped around xids
> > would exist on individual pages. Not on all deleted pages, like with
> > INDEX_CLEANUP false.
> 
> Is that really that narrow, even without "INDEX_CLEANUP false"? It's
> not as if the index needs to grow very slowly to have only very few
> page splits hour to hour (it depends on whether the inserts are random
> or not, and so on). Especially if you had a bulk DELETE affecting many
> rows, which is hardly that uncommon.

Well, you'd need to have a workload that has bulk deletes, high xid
usage *and* doesn't insert new data to use those empty pages


> Fundamentally, btvacuumpage() doesn't freeze 32-bit XIDs (from
> bpto.xact) when it recycles deleted pages. It simply puts them in the
> FSM without changing anything about the page itself. This means
> surprisingly little in the context of nbtree: the
> _bt_page_recyclable() XID check that takes place in btvacuumpage()
> also takes place in _bt_getbuf(), at the point where the page actually
> gets recycled by the client. That's not great.

I think it's quite foolish for btvacuumpage() to not freeze xids. If we
only do so when necessary (i.e. older than a potential new relfrozenxid,
and only when the vacuum didn't yet skip pages), the costs are pretty
miniscule.


> It wouldn't be so unreasonable if btvacuumpage() actually did freeze
> the bpto.xact value at the point where it puts the page in the FSM. It
> doesn't need to be crash safe; it can work as a hint.

I'd much rather make sure the xid is guaranteed to be removed. As
outlined above, the cost would be small, and I think the likelihood of
the consequences of wrapped around xids getting worse over time is
substantial.


> Note that we still need to keep the original bpto.xact XID around for
> _bt_log_reuse_page() (also, do we need to worry _bt_log_reuse_page()
> with a wrapped-around XID?).

I'd just WAL log the reuse when freezing the xid. Then there's no worry
about wraparounds. And I don't think it'd cause additional conflicts;
the vacuum itself (or a prior vacuum) would also have to cause them, I
think?

Greetings,

Andres Freund



Re: xid wraparound danger due to INDEX_CLEANUP false

От
Peter Geoghegan
Дата:
On Wed, Apr 29, 2020 at 12:54 PM Andres Freund <andres@anarazel.de> wrote:
> > Fundamentally, btvacuumpage() doesn't freeze 32-bit XIDs (from
> > bpto.xact) when it recycles deleted pages. It simply puts them in the
> > FSM without changing anything about the page itself. This means
> > surprisingly little in the context of nbtree: the
> > _bt_page_recyclable() XID check that takes place in btvacuumpage()
> > also takes place in _bt_getbuf(), at the point where the page actually
> > gets recycled by the client. That's not great.
>
> I think it's quite foolish for btvacuumpage() to not freeze xids. If we
> only do so when necessary (i.e. older than a potential new relfrozenxid,
> and only when the vacuum didn't yet skip pages), the costs are pretty
> miniscule.

I wonder if we should just bite the bullet and mark pages that are
recycled by VACUUM as explicitly recycled, with WAL-logging and
everything (this is like freezing, but stronger). That way, the
_bt_page_recyclable() call within _bt_getbuf() would only be required
to check that state (while btvacuumpage() would use something like a
_bt_page_eligible_for_recycling(), which would do the same thing as
the current _bt_page_recyclable()).

I'm not sure how low the costs would be, but at least we'd only have
to do it once per already-deleted page (i.e. only the first time a
VACUUM operation found _bt_page_eligible_for_recycling() returned true
for the page and marked it recycled in a crash safe manner). That
design would be quite a lot simpler, because it expresses the problem
in terms that make sense to the nbtree code. _bt_getbuf() should not
have to make a distinction between "possibly recycled" versus
"definitely recycled".

It makes sense that the FSM is not crash safe, I suppose, but we're
talking about relatively large amounts of free space here. Can't we
just do it properly/reliably?

-- 
Peter Geoghegan



Re: xid wraparound danger due to INDEX_CLEANUP false

От
Peter Geoghegan
Дата:
On Wed, Apr 29, 2020 at 1:40 PM Peter Geoghegan <pg@bowt.ie> wrote:
> I'm not sure how low the costs would be, but at least we'd only have
> to do it once per already-deleted page (i.e. only the first time a
> VACUUM operation found _bt_page_eligible_for_recycling() returned true
> for the page and marked it recycled in a crash safe manner). That
> design would be quite a lot simpler, because it expresses the problem
> in terms that make sense to the nbtree code. _bt_getbuf() should not
> have to make a distinction between "possibly recycled" versus
> "definitely recycled".

As a bonus, we now have an easy correctness cross-check: if some
random piece of nbtree code lands on a page (having followed a
downlink or sibling link) that is marked recycled, then clearly
something is very wrong -- throw a "can't happen" error.

This would be especially useful in places like _bt_readpage(), I
suppose. Think of extreme cases like cursors, which can have a scan
that remembers a block number of a leaf page, that only actually gets
accessed hours or days later (for whatever reason). If that code was
buggy in some way, we might have a hope of figuring it out at some
point with this cross-check.

-- 
Peter Geoghegan



Re: xid wraparound danger due to INDEX_CLEANUP false

От
Peter Geoghegan
Дата:
On Wed, Apr 29, 2020 at 2:04 PM Peter Geoghegan <pg@bowt.ie> wrote:
> As a bonus, we now have an easy correctness cross-check: if some
> random piece of nbtree code lands on a page (having followed a
> downlink or sibling link) that is marked recycled, then clearly
> something is very wrong -- throw a "can't happen" error.

If this doesn't sound that appealing to any of you, then bear this in
mind: nbtree has a terrifying tendency to *not* produce wrong answers
to queries when we land on a concurrently-recycled page.
_bt_moveright() is willing to move right as many times as it takes to
arrive at the correct page, even though in typical cases having to
move right once is rare -- twice is exceptional. I suppose that there
is roughly a 50% chance that we'll end up landing at a point in the
key space that is to the left of the point where we're supposed to
arrive at.  It might take many, many page accesses before
_bt_moveright() finds the correct page, but that often won't be very
noticeable. Or if it is noticed, corruption won't be suspected --
we're still getting a correct answer.

-- 
Peter Geoghegan



Re: xid wraparound danger due to INDEX_CLEANUP false

От
Andres Freund
Дата:
Hi,

On 2020-04-29 13:40:55 -0700, Peter Geoghegan wrote:
> On Wed, Apr 29, 2020 at 12:54 PM Andres Freund <andres@anarazel.de> wrote:
> > > Fundamentally, btvacuumpage() doesn't freeze 32-bit XIDs (from
> > > bpto.xact) when it recycles deleted pages. It simply puts them in the
> > > FSM without changing anything about the page itself. This means
> > > surprisingly little in the context of nbtree: the
> > > _bt_page_recyclable() XID check that takes place in btvacuumpage()
> > > also takes place in _bt_getbuf(), at the point where the page actually
> > > gets recycled by the client. That's not great.
> >
> > I think it's quite foolish for btvacuumpage() to not freeze xids. If we
> > only do so when necessary (i.e. older than a potential new relfrozenxid,
> > and only when the vacuum didn't yet skip pages), the costs are pretty
> > miniscule.
>
> I wonder if we should just bite the bullet and mark pages that are
> recycled by VACUUM as explicitly recycled, with WAL-logging and
> everything (this is like freezing, but stronger). That way, the
> _bt_page_recyclable() call within _bt_getbuf() would only be required
> to check that state (while btvacuumpage() would use something like a
> _bt_page_eligible_for_recycling(), which would do the same thing as
> the current _bt_page_recyclable()).

I'm not sure I see the advantage. Only doing so in the freezing case
seems unlikely to cause additional conflicts, but I'm less sure about
doing it always. btpo.xact will often be quite recent, right? So it'd
likely cause more conflicts.


> I'm not sure how low the costs would be, but at least we'd only have
> to do it once per already-deleted page (i.e. only the first time a
> VACUUM operation found _bt_page_eligible_for_recycling() returned true
> for the page and marked it recycled in a crash safe manner). That
> design would be quite a lot simpler, because it expresses the problem
> in terms that make sense to the nbtree code. _bt_getbuf() should not
> have to make a distinction between "possibly recycled" versus
> "definitely recycled".

I don't really see the problem with the check in _bt_getbuf()? I'd
actually like to be *more* aggressive about putting pages in the FSM (or
whatever), and that'd probably require checks like this. E.g. whenever
we unlink a page, we should put it into the FSM (or something different,
see below). And then do all the necessary checks in _bt_getbuf().

It's pretty sad that one right now basically needs to vacuum twice to
reuse pages in nbtree (once to delete the page, once to record it in the
fsm). Usually the xmin horizon should advance much more quickly than
that, allowing reuse earlier.

As far as I can tell, even just adding them to the FSM when setting
ISDELETED would be advantageous. There's obviously that we'll cause
backends to visit a lot of pages that can't actually be reused... But if
we did what I suggest below, that danger probably could largely be
avoided.


> It makes sense that the FSM is not crash safe, I suppose, but we're
> talking about relatively large amounts of free space here. Can't we
> just do it properly/reliably?

What do you mean by that? To have the FSM be crash-safe?

It could make sense to just not have the FSM, and have a linked-list
style stack of pages reachable from the meta page.  That'd be especially
advantageous if we kept xids associated with the "about to be
recyclable" pages in the metapage, so it'd be cheap to evaluate.

There's possibly some not-entirely-trivial locking concerns around such
a list. Adding entries seems easy enough, because we currently only
delete pages from within vacuum. But popping entries could be more
complicated, I'm not exactly sure if there are potential lock nesting
issues (nor am I actually sure there aren't existing ones).

Greetings,

Andres Freund



Re: xid wraparound danger due to INDEX_CLEANUP false

От
Peter Geoghegan
Дата:
On Wed, Apr 29, 2020 at 2:56 PM Andres Freund <andres@anarazel.de> wrote:
> I'm not sure I see the advantage. Only doing so in the freezing case
> seems unlikely to cause additional conflicts, but I'm less sure about
> doing it always. btpo.xact will often be quite recent, right? So it'd
> likely cause more conflicts.

btpo.xact values come from a call to ReadNewTransactionId(). There
pretty much has to be one call to ReadNewTransactionId() per page
deletion (see comments about it within _bt_unlink_halfdead_page()). So
yes, I'd say that it could be very recent.

I don't mind continuing to do the conflicts in _bt_getbuf(), which
would delay it until the point where we really need the page --
especially if we could do that in a way that captures temporal
locality (e.g. your recyclable page chaining idea).

> I don't really see the problem with the check in _bt_getbuf()? I'd
> actually like to be *more* aggressive about putting pages in the FSM (or
> whatever), and that'd probably require checks like this. E.g. whenever
> we unlink a page, we should put it into the FSM (or something different,
> see below). And then do all the necessary checks in _bt_getbuf().

Basically, I would like to have a page state that represents "it
should be impossible for any scan to land on this page, except for
btvacuumscan()". Without relying on 32-bit XIDs, and ideally without
relying on any other state to interpret what it really means. In
principle we can set a deleted page to that state at the earliest
possible point when that becomes true, without it meaning anything
else, or requiring that we do anything else at the same time (e.g.
actually using it for the right half of a page in a page split,
generating recovery conflicts).

> It's pretty sad that one right now basically needs to vacuum twice to
> reuse pages in nbtree (once to delete the page, once to record it in the
> fsm). Usually the xmin horizon should advance much more quickly than
> that, allowing reuse earlier.

Yes, that's definitely bad. I like the general idea of making us more
aggressive with recycling. Marking pages as "recyclable" (not
"recycled") not too long after they first get deleted in VACUUM, and
then separately recycling them in _bt_getbuf() is a cleaner design.
Separation of concerns. That would build confidence in a more
aggressive approach -- we could add lots of cross-checks against
landing on a recyclable page. Note that we have had a bug in this
exact mechanism in the past -- see commit d3abbbebe52.

If there was a bug then we might still land on the page after it gets
fully recycled, in which case the cross-checks won't detect the bug.
But ISTM that we always have a good chance of landing on the page
before that happens, in which case the cross-check complains and we
get a log message, and possibly even a bug report. We don't have to be
truly lucky to see when our approach is buggy when we go on to make
page deletion more aggressive (in whatever way). And we get the same
cross-checks on standbys.

> > It makes sense that the FSM is not crash safe, I suppose, but we're
> > talking about relatively large amounts of free space here. Can't we
> > just do it properly/reliably?
>
> What do you mean by that? To have the FSM be crash-safe?

That, or the equivalent (pretty much your chaining idea) may well be a
good idea. But what I really meant was an explicit "recyclable" page
state. That's all. We may or may not also continue to rely on the FSM
in the same way.

I suppose that we should try to get rid of the FSM in nbtree. I see
the advantages. It's not essential to my proposal, though.

> It could make sense to just not have the FSM, and have a linked-list
> style stack of pages reachable from the meta page.  That'd be especially
> advantageous if we kept xids associated with the "about to be
> recyclable" pages in the metapage, so it'd be cheap to evaluate.

I like that idea. But doesn't that also argue for delaying the
conflicts until we actually recycle a "recyclable" page?

> There's possibly some not-entirely-trivial locking concerns around such
> a list. Adding entries seems easy enough, because we currently only
> delete pages from within vacuum. But popping entries could be more
> complicated, I'm not exactly sure if there are potential lock nesting
> issues (nor am I actually sure there aren't existing ones).

A "recyclable" page state might help with this, too. _bt_getbuf() is a
bag of tricks, even leaving aside generating recovery conflicts.

If we are going to be more eager, then the cost of dirtying the page a
second time to mark it "recyclable" might mostly not matter.
Especially if we chain the pages. That is, maybe VACUUM recomputes
RecentGlobalXmin at the end of its first btvacuumscan() scan (or at
the end of the whole VACUUM operation), when it notices that it is
already possible to mark many pages as "recyclable". Perhaps we won't
write out the page twice much of the time, because it won't have been
that long since VACUUM dirtied the page in order to delete it.

Yeah, we could be a lot more aggressive here, in a bunch of ways. As
I've said quite a few times, it seems like our implementation of "the
drain technique" is way more conservative than it needs to be (i.e. we
use ReadNewTransactionId() without considering any of the specifics of
the index). But if we mess up, we can't expect amcheck to detect the
problems, which would be totally transient. We're talking about
incredibly subtle concurrency bugs. So IMV it's just not going to
happen until the whole thing becomes way less scary.

--
Peter Geoghegan



Re: xid wraparound danger due to INDEX_CLEANUP false

От
Masahiko Sawada
Дата:
On Sat, 18 Apr 2020 at 03:22, Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Thu, Apr 16, 2020 at 6:49 PM Andres Freund <andres@anarazel.de> wrote:
> > Yea. _bt_vacuum_needs_cleanup() needs to check if
> > metad->btm_oldest_btpo_xact is older than the FreezeLimit computed by
> > vacuum_set_xid_limits() and vacuum the index if so even if INDEX_CLEANUP
> > false.
>
> I'm still fairly unclear on what the actual problem is here, and on
> how we propose to fix it. It seems to me that we probably don't have a
> problem in the case where we don't advance relfrozenxid or relminmxid,
> because in that case there's not much difference between the behavior
> created by this patch and a case where we just error out due to an
> interrupt or something before reaching the index cleanup stage. I
> think that the problem is that in the case where we do relfrozenxid,
> we might advance it past some XID value stored in the index metadata.
> Is that right?

I think advancing relfrozenxid past oldest_btpo_xact actually cannot
be a problem. If a subsequent vacuum sees oldest_btpo_xact is an old
xid, we can recycle pages. Before introducing to INDEX_CLEANUP =
false, we used to invoke either bulkdelete or vaucumcleanup at least
once in each vacuum. And thanks to relfrozenxid, a table is
periodically vacuumed by an anti-wraparound vacuum. But with this
feature, we can unconditionally skip both bulkdelete and
vacuumcleanup. So IIUC the problem is that since we skip both,
oldst_btpo_xact could be seen as a "future" xid during vacuum. Which
will be a cause of that vacuum misses pages which can actually be
recycled.

I think we can fix this issue by calling vacuumcleanup callback when
an anti-wraparound vacuum even if INDEX_CLEANUP is false. That way we can
let index AM make decisions whether doing cleanup index at least once
until XID wraparound, same as before. Originally the motivation of
disabling INDEX_CLEANUP was to skip index full scan when
anti-wraparound vacuum to reduce the execution time. By this
change, we will end up doing an index full scan also in some
anti-wraparound vacuum case but we still can skip that if there is no
recyclable page in an index.

Regards,

--
Masahiko Sawada            http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: xid wraparound danger due to INDEX_CLEANUP false

От
Peter Geoghegan
Дата:
On Tue, May 5, 2020 at 2:52 PM Masahiko Sawada
<masahiko.sawada@2ndquadrant.com> wrote:
> So IIUC the problem is that since we skip both,
> oldst_btpo_xact could be seen as a "future" xid during vacuum. Which
> will be a cause of that vacuum misses pages which can actually be
> recycled.

This is also my understanding of the problem.

> I think we can fix this issue by calling vacuumcleanup callback when
> an anti-wraparound vacuum even if INDEX_CLEANUP is false. That way we can
> let index AM make decisions whether doing cleanup index at least once
> until XID wraparound, same as before.

+1

Can you work on a patch?

-- 
Peter Geoghegan



Re: xid wraparound danger due to INDEX_CLEANUP false

От
Masahiko Sawada
Дата:
On Wed, 6 May 2020 at 07:14, Peter Geoghegan <pg@bowt.ie> wrote:
>
> On Tue, May 5, 2020 at 2:52 PM Masahiko Sawada
> <masahiko.sawada@2ndquadrant.com> wrote:
> > So IIUC the problem is that since we skip both,
> > oldst_btpo_xact could be seen as a "future" xid during vacuum. Which
> > will be a cause of that vacuum misses pages which can actually be
> > recycled.
>
> This is also my understanding of the problem.
>
> > I think we can fix this issue by calling vacuumcleanup callback when
> > an anti-wraparound vacuum even if INDEX_CLEANUP is false. That way we can
> > let index AM make decisions whether doing cleanup index at least once
> > until XID wraparound, same as before.
>
> +1
>
> Can you work on a patch?

Yes, I'll submit a bug fix patch.

Regards,

-- 
Masahiko Sawada            http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: xid wraparound danger due to INDEX_CLEANUP false

От
Masahiko Sawada
Дата:
On Wed, 6 May 2020 at 07:17, Masahiko Sawada
<masahiko.sawada@2ndquadrant.com> wrote:
>
> On Wed, 6 May 2020 at 07:14, Peter Geoghegan <pg@bowt.ie> wrote:
> >
> > On Tue, May 5, 2020 at 2:52 PM Masahiko Sawada
> > <masahiko.sawada@2ndquadrant.com> wrote:
> > > So IIUC the problem is that since we skip both,
> > > oldst_btpo_xact could be seen as a "future" xid during vacuum. Which
> > > will be a cause of that vacuum misses pages which can actually be
> > > recycled.
> >
> > This is also my understanding of the problem.
> >
> > > I think we can fix this issue by calling vacuumcleanup callback when
> > > an anti-wraparound vacuum even if INDEX_CLEANUP is false. That way we can
> > > let index AM make decisions whether doing cleanup index at least once
> > > until XID wraparound, same as before.
> >
> > +1
> >
> > Can you work on a patch?
>
> Yes, I'll submit a bug fix patch.
>

I've attached the patch fixes this issue.

With this patch, we don't skip only index cleanup phase when
performing an aggressive vacuum. The reason why I don't skip only
index cleanup phase is that index vacuum phase can be called multiple
times, which takes a very long time. Since the purpose of this index
cleanup is to process recyclable pages it's enough to do only index
cleanup phase. However it also means we do index cleanup even when
table might have garbage whereas we used to call index cleanup only
when there is no garbage on a table. As far as I can think it's no
problem but perhaps needs more research.


Regards,

--
Masahiko Sawada            http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Вложения

Re: xid wraparound danger due to INDEX_CLEANUP false

От
Peter Geoghegan
Дата:
On Wed, May 6, 2020 at 2:28 AM Masahiko Sawada
<masahiko.sawada@2ndquadrant.com> wrote:
> I've attached the patch fixes this issue.
>
> With this patch, we don't skip only index cleanup phase when
> performing an aggressive vacuum. The reason why I don't skip only
> index cleanup phase is that index vacuum phase can be called multiple
> times, which takes a very long time. Since the purpose of this index
> cleanup is to process recyclable pages it's enough to do only index
> cleanup phase.

That's only true in nbtree, though. The way that I imagined we'd want
to fix this is by putting control in each index access method. So,
we'd revise the way that amvacuumcleanup() worked -- the
amvacuumcleanup routine for each index AM would sometimes be called in
a mode that means "I don't really want you to do any cleanup, but if
you absolutely have to for your own reasons then you can". This could
be represented using something similar to
IndexVacuumInfo.analyze_only.

This approach has an obvious disadvantage: the patch really has to
teach *every* index AM to do something with that state (most will
simply do no work). It seems logical to have the index AM control what
happens, though. This allows the logic to live inside
_bt_vacuum_needs_cleanup() in the case of nbtree, so there is only one
place where we make decisions like this.

Most other AMs don't have this problem. GiST has a similar issue with
recyclable pages, except that it doesn't use 32-bit XIDs so it doesn't
need to care about this stuff at all. Besides, it seems like it might
be a good idea to do other basic maintenance of the index when we're
"skipping" index cleanup. We probably should always do things that
require only a small, fixed amount of work. Things like maintaining
metadata in the metapage.

There may be practical reasons why this approach isn't suitable for
backpatch even if it is a superior approach. What do you think? Also,
what do you think about this Robert and Andres?

-- 
Peter Geoghegan



Re: xid wraparound danger due to INDEX_CLEANUP false

От
Peter Geoghegan
Дата:
On Wed, May 6, 2020 at 11:28 AM Peter Geoghegan <pg@bowt.ie> wrote:
> This approach has an obvious disadvantage: the patch really has to
> teach *every* index AM to do something with that state (most will
> simply do no work). It seems logical to have the index AM control what
> happens, though. This allows the logic to live inside
> _bt_vacuum_needs_cleanup() in the case of nbtree, so there is only one
> place where we make decisions like this.

Also, do we really want to skip summarization of BRIN indexes? This
cleanup is rather dissimilar to the cleanup that takes place in most
other AMs -- it isn't really that related to garbage collection (BRIN
is rather unique in this respect). I think that BRIN might be an
inappropriate target for "index_cleanup off" VACUUMs for that reason.

See the explanation of how this takes place from the docs:
https://www.postgresql.org/docs/devel/brin-intro.html#BRIN-OPERATION

-- 
Peter Geoghegan



Re: xid wraparound danger due to INDEX_CLEANUP false

От
Alvaro Herrera
Дата:
On 2020-May-06, Peter Geoghegan wrote:

> Also, do we really want to skip summarization of BRIN indexes? This
> cleanup is rather dissimilar to the cleanup that takes place in most
> other AMs -- it isn't really that related to garbage collection (BRIN
> is rather unique in this respect). I think that BRIN might be an
> inappropriate target for "index_cleanup off" VACUUMs for that reason.
> 
> See the explanation of how this takes place from the docs:
> https://www.postgresql.org/docs/devel/brin-intro.html#BRIN-OPERATION

Good question.  I agree that BRIN summarization is not at all related to
what other index AMs do during the cleanup phase.  However, if the
index_cleanup feature is there to make it faster to process a table
that's nearing wraparound hazards (or at least the warnings), then I
think it makes sense to complete the vacuum as fast as possible -- which
includes not trying to summarize it for brin indexes.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: xid wraparound danger due to INDEX_CLEANUP false

От
Peter Geoghegan
Дата:
On Wed, May 6, 2020 at 1:06 PM Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
> Good question.  I agree that BRIN summarization is not at all related to
> what other index AMs do during the cleanup phase.  However, if the
> index_cleanup feature is there to make it faster to process a table
> that's nearing wraparound hazards (or at least the warnings), then I
> think it makes sense to complete the vacuum as fast as possible -- which
> includes not trying to summarize it for brin indexes.

I forgot about the fact that the AutoVacuumRequestWork() interface
exists at all until just now. That's how "autosummarize = on" makes
sure that autosummarization takes place. These work items are not
affected by the fact that the VACUUM happens to be a "index_cleanup
off" VACUUM. Fortunately, the user is required to explicitly opt-in to
autosummarization (by setting "autosummarize = on") in order for
autovacuum to spend extra time processing work items when it might be
important to advance relfrozenxid ASAP. (My initial assumption was
that the autosummarization business happened within
brinvacuumcleanup(), but I now see that I was mistaken.)

There is a separate question (nothing to do with summarization) about
the cleanup steps performed in brinvacuumcleanup(), which are unlike
any of the cleanup/maintenance that we expect for an amvacuumcleanup
routine in general. As I said in my last e-mail, these steps have
nothing to do with garbage tuples. Rather, it's deferred maintenance
that we need to do even with append-only workloads (including when
autosummarization has not been enabled). What about that? Is that
okay?

ISTM that the fundamental issue is that BRIN imagines that it is in
control, which isn't quite true in light of the "index_cleanup off"
stuff -- a call to brinvacuumcleanup() is expected to take place at
fairly consistent intervals to take care of revmap processing, which,
in general, might not happen now. I blame commit a96c41feec6 for this,
not BRIN. ISTM that whatever behavior we deem appropriate, the proper
place to decide on it is within BRIN. Not within vacuumlazy.c.

--
Peter Geoghegan



Re: xid wraparound danger due to INDEX_CLEANUP false

От
Masahiko Sawada
Дата:
On Thu, 7 May 2020 at 03:28, Peter Geoghegan <pg@bowt.ie> wrote:
>
> On Wed, May 6, 2020 at 2:28 AM Masahiko Sawada
> <masahiko.sawada@2ndquadrant.com> wrote:
> > I've attached the patch fixes this issue.
> >
> > With this patch, we don't skip only index cleanup phase when
> > performing an aggressive vacuum. The reason why I don't skip only
> > index cleanup phase is that index vacuum phase can be called multiple
> > times, which takes a very long time. Since the purpose of this index
> > cleanup is to process recyclable pages it's enough to do only index
> > cleanup phase.
>
> That's only true in nbtree, though. The way that I imagined we'd want
> to fix this is by putting control in each index access method. So,
> we'd revise the way that amvacuumcleanup() worked -- the
> amvacuumcleanup routine for each index AM would sometimes be called in
> a mode that means "I don't really want you to do any cleanup, but if
> you absolutely have to for your own reasons then you can". This could
> be represented using something similar to
> IndexVacuumInfo.analyze_only.
>
> This approach has an obvious disadvantage: the patch really has to
> teach *every* index AM to do something with that state (most will
> simply do no work). It seems logical to have the index AM control what
> happens, though. This allows the logic to live inside
> _bt_vacuum_needs_cleanup() in the case of nbtree, so there is only one
> place where we make decisions like this.
>
> Most other AMs don't have this problem. GiST has a similar issue with
> recyclable pages, except that it doesn't use 32-bit XIDs so it doesn't
> need to care about this stuff at all. Besides, it seems like it might
> be a good idea to do other basic maintenance of the index when we're
> "skipping" index cleanup. We probably should always do things that
> require only a small, fixed amount of work. Things like maintaining
> metadata in the metapage.
>
> There may be practical reasons why this approach isn't suitable for
> backpatch even if it is a superior approach. What do you think?

I agree this idea is better. I was thinking the same approach but I
was concerned about backpatching. Especially since I was thinking to
add one or two fields to IndexVacuumInfo existing index AM might not
work with the new VacuumInfo structure.

If we go with this idea, we need to change lazy vacuum so that it uses
two-pass strategy vacuum even if INDEX_CLEANUP is false. Also in
parallel vacuum, we need to launch workers. But I think these changes
are no big problem.

Regards,

-- 
Masahiko Sawada            http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: xid wraparound danger due to INDEX_CLEANUP false

От
Masahiko Sawada
Дата:
On Thu, 7 May 2020 at 15:40, Masahiko Sawada
<masahiko.sawada@2ndquadrant.com> wrote:
>
> On Thu, 7 May 2020 at 03:28, Peter Geoghegan <pg@bowt.ie> wrote:
> >
> > On Wed, May 6, 2020 at 2:28 AM Masahiko Sawada
> > <masahiko.sawada@2ndquadrant.com> wrote:
> > > I've attached the patch fixes this issue.
> > >
> > > With this patch, we don't skip only index cleanup phase when
> > > performing an aggressive vacuum. The reason why I don't skip only
> > > index cleanup phase is that index vacuum phase can be called multiple
> > > times, which takes a very long time. Since the purpose of this index
> > > cleanup is to process recyclable pages it's enough to do only index
> > > cleanup phase.
> >
> > That's only true in nbtree, though. The way that I imagined we'd want
> > to fix this is by putting control in each index access method. So,
> > we'd revise the way that amvacuumcleanup() worked -- the
> > amvacuumcleanup routine for each index AM would sometimes be called in
> > a mode that means "I don't really want you to do any cleanup, but if
> > you absolutely have to for your own reasons then you can". This could
> > be represented using something similar to
> > IndexVacuumInfo.analyze_only.
> >
> > This approach has an obvious disadvantage: the patch really has to
> > teach *every* index AM to do something with that state (most will
> > simply do no work). It seems logical to have the index AM control what
> > happens, though. This allows the logic to live inside
> > _bt_vacuum_needs_cleanup() in the case of nbtree, so there is only one
> > place where we make decisions like this.
> >
> > Most other AMs don't have this problem. GiST has a similar issue with
> > recyclable pages, except that it doesn't use 32-bit XIDs so it doesn't
> > need to care about this stuff at all. Besides, it seems like it might
> > be a good idea to do other basic maintenance of the index when we're
> > "skipping" index cleanup. We probably should always do things that
> > require only a small, fixed amount of work. Things like maintaining
> > metadata in the metapage.
> >
> > There may be practical reasons why this approach isn't suitable for
> > backpatch even if it is a superior approach. What do you think?
>
> I agree this idea is better. I was thinking the same approach but I
> was concerned about backpatching. Especially since I was thinking to
> add one or two fields to IndexVacuumInfo existing index AM might not
> work with the new VacuumInfo structure.

It would be ok if we added these fields at the end of VacuumInfo structure?

Regards,

-- 
Masahiko Sawada            http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: xid wraparound danger due to INDEX_CLEANUP false

От
Masahiko Sawada
Дата:
On Thu, 7 May 2020 at 16:26, Masahiko Sawada
<masahiko.sawada@2ndquadrant.com> wrote:
>
> On Thu, 7 May 2020 at 15:40, Masahiko Sawada
> <masahiko.sawada@2ndquadrant.com> wrote:
> >
> > On Thu, 7 May 2020 at 03:28, Peter Geoghegan <pg@bowt.ie> wrote:
> > >
> > > On Wed, May 6, 2020 at 2:28 AM Masahiko Sawada
> > > <masahiko.sawada@2ndquadrant.com> wrote:
> > > > I've attached the patch fixes this issue.
> > > >
> > > > With this patch, we don't skip only index cleanup phase when
> > > > performing an aggressive vacuum. The reason why I don't skip only
> > > > index cleanup phase is that index vacuum phase can be called multiple
> > > > times, which takes a very long time. Since the purpose of this index
> > > > cleanup is to process recyclable pages it's enough to do only index
> > > > cleanup phase.
> > >
> > > That's only true in nbtree, though. The way that I imagined we'd want
> > > to fix this is by putting control in each index access method. So,
> > > we'd revise the way that amvacuumcleanup() worked -- the
> > > amvacuumcleanup routine for each index AM would sometimes be called in
> > > a mode that means "I don't really want you to do any cleanup, but if
> > > you absolutely have to for your own reasons then you can". This could
> > > be represented using something similar to
> > > IndexVacuumInfo.analyze_only.
> > >
> > > This approach has an obvious disadvantage: the patch really has to
> > > teach *every* index AM to do something with that state (most will
> > > simply do no work). It seems logical to have the index AM control what
> > > happens, though. This allows the logic to live inside
> > > _bt_vacuum_needs_cleanup() in the case of nbtree, so there is only one
> > > place where we make decisions like this.
> > >
> > > Most other AMs don't have this problem. GiST has a similar issue with
> > > recyclable pages, except that it doesn't use 32-bit XIDs so it doesn't
> > > need to care about this stuff at all. Besides, it seems like it might
> > > be a good idea to do other basic maintenance of the index when we're
> > > "skipping" index cleanup. We probably should always do things that
> > > require only a small, fixed amount of work. Things like maintaining
> > > metadata in the metapage.
> > >
> > > There may be practical reasons why this approach isn't suitable for
> > > backpatch even if it is a superior approach. What do you think?
> >
> > I agree this idea is better. I was thinking the same approach but I
> > was concerned about backpatching. Especially since I was thinking to
> > add one or two fields to IndexVacuumInfo existing index AM might not
> > work with the new VacuumInfo structure.
>
> It would be ok if we added these fields at the end of VacuumInfo structure?
>

I've attached WIP patch for HEAD. With this patch, the core pass
index_cleanup to bulkdelete and vacuumcleanup callbacks so that they
can make decision whether run vacuum or not.

If the direction of this patch seems good, I'll change the patch so
that we pass something information to these callbacks indicating
whether this vacuum is anti-wraparound vacuum. This is necessary
because it's enough to invoke index cleanup before XID wraparound as
per discussion so far.

Regards,

-- 
Masahiko Sawada            http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Вложения

Re: xid wraparound danger due to INDEX_CLEANUP false

От
Peter Geoghegan
Дата:
On Mon, May 18, 2020 at 7:32 PM Masahiko Sawada
<masahiko.sawada@2ndquadrant.com> wrote:
> I've attached WIP patch for HEAD. With this patch, the core pass
> index_cleanup to bulkdelete and vacuumcleanup callbacks so that they
> can make decision whether run vacuum or not.
>
> If the direction of this patch seems good, I'll change the patch so
> that we pass something information to these callbacks indicating
> whether this vacuum is anti-wraparound vacuum. This is necessary
> because it's enough to invoke index cleanup before XID wraparound as
> per discussion so far.

It. seems like the right direction to me. Robert?

-- 
Peter Geoghegan



Re: xid wraparound danger due to INDEX_CLEANUP false

От
Peter Geoghegan
Дата:
On Fri, May 22, 2020 at 1:40 PM Peter Geoghegan <pg@bowt.ie> wrote:
> It. seems like the right direction to me. Robert?

BTW, this is an interesting report of somebody using the INDEX_CLEANUP
feature when they had to deal with a difficult production issue:

https://www.buildkitestatus.com/incidents/h0vnx4gp7djx

This report is not really relevant to our discussion, but I thought
you might find it interesting.

-- 
Peter Geoghegan



Re: xid wraparound danger due to INDEX_CLEANUP false

От
Masahiko Sawada
Дата:
On Tue, 19 May 2020 at 11:32, Masahiko Sawada
<masahiko.sawada@2ndquadrant.com> wrote:
>
> On Thu, 7 May 2020 at 16:26, Masahiko Sawada
> <masahiko.sawada@2ndquadrant.com> wrote:
> >
> > On Thu, 7 May 2020 at 15:40, Masahiko Sawada
> > <masahiko.sawada@2ndquadrant.com> wrote:
> > >
> > > On Thu, 7 May 2020 at 03:28, Peter Geoghegan <pg@bowt.ie> wrote:
> > > >
> > > > On Wed, May 6, 2020 at 2:28 AM Masahiko Sawada
> > > > <masahiko.sawada@2ndquadrant.com> wrote:
> > > > > I've attached the patch fixes this issue.
> > > > >
> > > > > With this patch, we don't skip only index cleanup phase when
> > > > > performing an aggressive vacuum. The reason why I don't skip only
> > > > > index cleanup phase is that index vacuum phase can be called multiple
> > > > > times, which takes a very long time. Since the purpose of this index
> > > > > cleanup is to process recyclable pages it's enough to do only index
> > > > > cleanup phase.
> > > >
> > > > That's only true in nbtree, though. The way that I imagined we'd want
> > > > to fix this is by putting control in each index access method. So,
> > > > we'd revise the way that amvacuumcleanup() worked -- the
> > > > amvacuumcleanup routine for each index AM would sometimes be called in
> > > > a mode that means "I don't really want you to do any cleanup, but if
> > > > you absolutely have to for your own reasons then you can". This could
> > > > be represented using something similar to
> > > > IndexVacuumInfo.analyze_only.
> > > >
> > > > This approach has an obvious disadvantage: the patch really has to
> > > > teach *every* index AM to do something with that state (most will
> > > > simply do no work). It seems logical to have the index AM control what
> > > > happens, though. This allows the logic to live inside
> > > > _bt_vacuum_needs_cleanup() in the case of nbtree, so there is only one
> > > > place where we make decisions like this.
> > > >
> > > > Most other AMs don't have this problem. GiST has a similar issue with
> > > > recyclable pages, except that it doesn't use 32-bit XIDs so it doesn't
> > > > need to care about this stuff at all. Besides, it seems like it might
> > > > be a good idea to do other basic maintenance of the index when we're
> > > > "skipping" index cleanup. We probably should always do things that
> > > > require only a small, fixed amount of work. Things like maintaining
> > > > metadata in the metapage.
> > > >
> > > > There may be practical reasons why this approach isn't suitable for
> > > > backpatch even if it is a superior approach. What do you think?
> > >
> > > I agree this idea is better. I was thinking the same approach but I
> > > was concerned about backpatching. Especially since I was thinking to
> > > add one or two fields to IndexVacuumInfo existing index AM might not
> > > work with the new VacuumInfo structure.
> >
> > It would be ok if we added these fields at the end of VacuumInfo structure?
> >
>
> I've attached WIP patch for HEAD. With this patch, the core pass
> index_cleanup to bulkdelete and vacuumcleanup callbacks so that they
> can make decision whether run vacuum or not.
>
> If the direction of this patch seems good, I'll change the patch so
> that we pass something information to these callbacks indicating
> whether this vacuum is anti-wraparound vacuum. This is necessary
> because it's enough to invoke index cleanup before XID wraparound as
> per discussion so far.
>

I've updated the patch so that vacuum passes is_wraparound flag to
bulkdelete and vacuumcleanup. Therefore I've added two new variables
in total: index_cleanup and is_wraparound. Index AMs can make the
decision of whether to skip bulkdelete and indexcleanup or not.

Also, I've added this item to Older Bugs so as not to forget.

Regards,

-- 
Masahiko Sawada            http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Вложения

Re: xid wraparound danger due to INDEX_CLEANUP false

От
Robert Haas
Дата:
On Fri, May 22, 2020 at 4:40 PM Peter Geoghegan <pg@bowt.ie> wrote:
> On Mon, May 18, 2020 at 7:32 PM Masahiko Sawada
> <masahiko.sawada@2ndquadrant.com> wrote:
> > I've attached WIP patch for HEAD. With this patch, the core pass
> > index_cleanup to bulkdelete and vacuumcleanup callbacks so that they
> > can make decision whether run vacuum or not.
> >
> > If the direction of this patch seems good, I'll change the patch so
> > that we pass something information to these callbacks indicating
> > whether this vacuum is anti-wraparound vacuum. This is necessary
> > because it's enough to invoke index cleanup before XID wraparound as
> > per discussion so far.
>
> It. seems like the right direction to me. Robert?

Sorry, I'm so far behind on my email. Argh.

I think, especially on the blog post you linked, that we should aim to
have INDEX_CLEANUP OFF mode do the minimum possible amount of work
while still keeping us safe against transaction ID wraparound. So if,
for example, it's desirable but not imperative for BRIN to
resummarize, then it's OK normally but should be skipped with
INDEX_CLEANUP OFF.

I find the patch itself confusing and the comments inadequate,
especially the changes to lazy_scan_heap(). Before the INDEX_CLEANUP
patch went into the tree, LVRelStats had a member hasindex indicating
whether or not the table had any indexes. The patch changed that
member to useindex, indicating whether or not we were going to do
index vacuuming; thus, it would be false if either there were no
indexes or if we were going to ignore them. This patch redefines
useindex to mean whether or not the table has any indexes, but without
renaming the variable. There's also really no comments anywhere in the
vacuumlazy.c explaining the overall scheme here; what are we actually
doing? Apparently, what we're really doing here is that even when
INDEX_CLEANUP is OFF, we're still going to keep all the dead tuples.
That seems sad, but if it's what we have to do then it at least needs
comments explaining it.

As for the btree portion of the change, I expect you understand this
better than I do, so I'm reluctant to stick my neck out, but it seems
that what the patch does is force index cleanup to happen even when
INDEX_CLEANUP is OFF provided that the vacuum is for wraparound and
the btree index has at least 1 recyclable page. My first reaction is
to wonder whether that doesn't nerf this feature into oblivion. Isn't
it likely that an index that is being vacuumed for wraparound will
have a recyclable page someplace? If the presence of that 1 recyclable
page causes the "help, I'm about to run out of XIDs, please do the
least possible work" flag to become a no-op, I don't think users are
going to be too happy with that. Maybe I am misunderstanding.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: xid wraparound danger due to INDEX_CLEANUP false

От
Peter Geoghegan
Дата:
On Wed, Jun 24, 2020 at 10:21 AM Robert Haas <robertmhaas@gmail.com> wrote:
> Sorry, I'm so far behind on my email. Argh.

That's okay.

> I think, especially on the blog post you linked, that we should aim to
> have INDEX_CLEANUP OFF mode do the minimum possible amount of work
> while still keeping us safe against transaction ID wraparound. So if,
> for example, it's desirable but not imperative for BRIN to
> resummarize, then it's OK normally but should be skipped with
> INDEX_CLEANUP OFF.

I agree that that's very important.

> Apparently, what we're really doing here is that even when
> INDEX_CLEANUP is OFF, we're still going to keep all the dead tuples.
> That seems sad, but if it's what we have to do then it at least needs
> comments explaining it.

+1. Though I think that AMs should technically have the right to
consider it advisory.

> As for the btree portion of the change, I expect you understand this
> better than I do, so I'm reluctant to stick my neck out, but it seems
> that what the patch does is force index cleanup to happen even when
> INDEX_CLEANUP is OFF provided that the vacuum is for wraparound and
> the btree index has at least 1 recyclable page. My first reaction is
> to wonder whether that doesn't nerf this feature into oblivion. Isn't
> it likely that an index that is being vacuumed for wraparound will
> have a recyclable page someplace? If the presence of that 1 recyclable
> page causes the "help, I'm about to run out of XIDs, please do the
> least possible work" flag to become a no-op, I don't think users are
> going to be too happy with that. Maybe I am misunderstanding.

I have mixed feelings about it myself. These are valid concerns.

This is a problem to the extent that the user has a table that they'd
like to use INDEX_CLEANUP with, that has indexes that regularly
require cleanup due to page deletion. ISTM, then, that the really
relevant high level design questions for this patch are:

1. How often is that likely to happen in The Real World™?

2. If we fail to do cleanup and leak already-deleted pages, how bad is
that? ( Both in general, and in the worst case.)

I'll hazard a guess for 1: I think that it might not come up that
often. Page deletion is often something that we hardly ever need. And,
unlike some DB systems, we only do it when pages are fully empty
(which, as it turns out, isn't necessarily better than our simple
approach [1]). I tend to think it's unlikely to happen in cases where
INDEX_CLEANUP is used, because those are cases that also must not have
that much index churn to begin with.

Then there's question 2. The intuition behind the approach from
Sawada-san's patch was that allowing wraparound here feels wrong -- it
should be in the AM's hands. However, it's not like I can point to
some ironclad guarantee about not leaking deleted pages that existed
before the INDEX_CLEANUP feature. We know that the FSM is not crash
safe, and that's that. Is this really all that different? Maybe it is,
but it seems like a quantitative difference to me.

I'm kind of arguing against myself even as I try to advance my
original argument. If workloads that use INDEX_CLEANUP don't need to
delete and recycle pages in any case, then why should we care that
those same workloads might leak pages on account of the wraparound
hazard? There's nothing to leak! Maybe some compromise is possible, at
least for the backbranches. Perhaps nbtree is told about the
requirements in a bit more detail than we'd imagined. It's not just an
INDEX_CLEANUP boolean. It could be an enum or something. Not sure,
though.

The real reason that I want to push the mechanism down into index
access methods is because that design is clearly better overall; it
just so happens that the specific way in which we currently defer
recycling in nbtree makes very little sense, so it's harder to see the
big picture. The xid-cleanup design that we have was approximately the
easiest way to do it, so that's what we got. We should figure out a
way to recycle the pages at something close to the earliest possible
opportunity, without having to perform a full scan on the index
relation within btvacuumscan(). Maybe we can use the autovacuum work
item mechanism for that. For indexes that get VACUUMed once a week on
average, it makes zero sense to wait another week to recycle the pages
that get deleted, in a staggered fashion. It should be possible to
recycle the pages a minute or two after VACUUM proper finishes, with
extra work that's proportionate to the number of deleted pages. This
is still conservative. I am currently very busy with an unrelated
B-Tree prototype, so I might not get around to it this year. Maybe
Sawada-san can think about this?

Also, handling of the vacuum_cleanup_index_scale_factor stuff (added
to Postgres 11 by commit 857f9c36) should live next to code for the
confusingly-similar INDEX_CLEANUP stuff (added to Postgres 12 by
commit a96c41feec6), on general principle. I think that that
organization is a lot easier to follow.

[1] https://www.sciencedirect.com/science/article/pii/002200009390020W
--
Peter Geoghegan



Re: xid wraparound danger due to INDEX_CLEANUP false

От
Masahiko Sawada
Дата:
On Thu, 25 Jun 2020 at 05:02, Peter Geoghegan <pg@bowt.ie> wrote:
>
> On Wed, Jun 24, 2020 at 10:21 AM Robert Haas <robertmhaas@gmail.com> wrote:
> > Sorry, I'm so far behind on my email. Argh.
>
> That's okay.
>
> > I think, especially on the blog post you linked, that we should aim to
> > have INDEX_CLEANUP OFF mode do the minimum possible amount of work
> > while still keeping us safe against transaction ID wraparound. So if,
> > for example, it's desirable but not imperative for BRIN to
> > resummarize, then it's OK normally but should be skipped with
> > INDEX_CLEANUP OFF.
>
> I agree that that's very important.
>
> > Apparently, what we're really doing here is that even when
> > INDEX_CLEANUP is OFF, we're still going to keep all the dead tuples.
> > That seems sad, but if it's what we have to do then it at least needs
> > comments explaining it.
>
> +1. Though I think that AMs should technically have the right to
> consider it advisory.
>
> > As for the btree portion of the change, I expect you understand this
> > better than I do, so I'm reluctant to stick my neck out, but it seems
> > that what the patch does is force index cleanup to happen even when
> > INDEX_CLEANUP is OFF provided that the vacuum is for wraparound and
> > the btree index has at least 1 recyclable page. My first reaction is
> > to wonder whether that doesn't nerf this feature into oblivion. Isn't
> > it likely that an index that is being vacuumed for wraparound will
> > have a recyclable page someplace? If the presence of that 1 recyclable
> > page causes the "help, I'm about to run out of XIDs, please do the
> > least possible work" flag to become a no-op, I don't think users are
> > going to be too happy with that. Maybe I am misunderstanding.
>
> I have mixed feelings about it myself. These are valid concerns.
>
> This is a problem to the extent that the user has a table that they'd
> like to use INDEX_CLEANUP with, that has indexes that regularly
> require cleanup due to page deletion. ISTM, then, that the really
> relevant high level design questions for this patch are:
>
> 1. How often is that likely to happen in The Real World™?
>
> 2. If we fail to do cleanup and leak already-deleted pages, how bad is
> that? ( Both in general, and in the worst case.)
>
> I'll hazard a guess for 1: I think that it might not come up that
> often. Page deletion is often something that we hardly ever need. And,
> unlike some DB systems, we only do it when pages are fully empty
> (which, as it turns out, isn't necessarily better than our simple
> approach [1]). I tend to think it's unlikely to happen in cases where
> INDEX_CLEANUP is used, because those are cases that also must not have
> that much index churn to begin with.
>
> Then there's question 2. The intuition behind the approach from
> Sawada-san's patch was that allowing wraparound here feels wrong -- it
> should be in the AM's hands. However, it's not like I can point to
> some ironclad guarantee about not leaking deleted pages that existed
> before the INDEX_CLEANUP feature. We know that the FSM is not crash
> safe, and that's that. Is this really all that different? Maybe it is,
> but it seems like a quantitative difference to me.

I think that with the approach implemented in my patch, it could be a
problem for the user that the user cannot easily know in advance
whether vacuum with INDEX_CLEANUP false will perform index cleanup,
even if page deletion doesn’t happen in most cases. They can check the
vacuum will be a wraparound vacuum but it’s relatively hard for users
to check in advance there are recyclable pages in the btree index.
This will be a restriction for users, especially those who want to use
INDEX_CLEANUP feature to speed up an impending XID wraparound vacuum
described on the blog post that Peter shared.

I don’t come up with a good solution to keep us safe against XID
wraparound yet but it seems to me that it’s better to have an option
that forces index cleanup not to happen.

>
> I'm kind of arguing against myself even as I try to advance my
> original argument. If workloads that use INDEX_CLEANUP don't need to
> delete and recycle pages in any case, then why should we care that
> those same workloads might leak pages on account of the wraparound
> hazard?

I had the same impression at first.

> The real reason that I want to push the mechanism down into index
> access methods is because that design is clearly better overall; it
> just so happens that the specific way in which we currently defer
> recycling in nbtree makes very little sense, so it's harder to see the
> big picture. The xid-cleanup design that we have was approximately the
> easiest way to do it, so that's what we got. We should figure out a
> way to recycle the pages at something close to the earliest possible
> opportunity, without having to perform a full scan on the index
> relation within btvacuumscan().

+1

> Maybe we can use the autovacuum work
> item mechanism for that. For indexes that get VACUUMed once a week on
> average, it makes zero sense to wait another week to recycle the pages
> that get deleted, in a staggered fashion. It should be possible to
> recycle the pages a minute or two after VACUUM proper finishes, with
> extra work that's proportionate to the number of deleted pages. This
> is still conservative. I am currently very busy with an unrelated
> B-Tree prototype, so I might not get around to it this year. Maybe
> Sawada-san can think about this?

I thought that btbulkdelete and/or btvacuumcleanup can register an
autovacuum work item to recycle the page that gets deleted but it
might not able to recycle those pages enough because the autovacuum
work items could be taken just after vacuum. And if page deletion is
relatively a rare case in practice, we might be able to take an
optimistic approach that vacuum registers deleted pages to FSM on the
deletion and a process who takes a free page checks if the page is
really recyclable. Anyway, I’ll try to think more about this.

Regards,

--
Masahiko Sawada            http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: xid wraparound danger due to INDEX_CLEANUP false

От
Robert Haas
Дата:
On Wed, Jun 24, 2020 at 4:02 PM Peter Geoghegan <pg@bowt.ie> wrote:
> > Apparently, what we're really doing here is that even when
> > INDEX_CLEANUP is OFF, we're still going to keep all the dead tuples.
> > That seems sad, but if it's what we have to do then it at least needs
> > comments explaining it.
>
> +1. Though I think that AMs should technically have the right to
> consider it advisory.

I'm not really convinced. I agree that from a theoretical point of
view an index can have arbitrary needs and is the arbiter of its own
needs, but when I pull the emergency break, I want the vehicle to
stop, not think about stopping. There's a fine argument for the idea
that depressing the regular brake pedal entitles the vehicle to
exercise some discretion, and on modern cars it does (think ABS, if
nothing else). But pulling the emergency break is a statement that I
wish to override any contrary judgement about whether stopping is a
good idea. I think this option is rightly viewed as an emergency
break, and giving AMs the right to decide that we'll instead pull off
at the next exit doesn't sit well with me. At the end of the day, the
human being should be in charge, not the program.

(Great, now Skynet will be gunning for me...)

> 1. How often is that likely to happen in The Real World™?
>
> 2. If we fail to do cleanup and leak already-deleted pages, how bad is
> that? ( Both in general, and in the worst case.)
>
> I'll hazard a guess for 1: I think that it might not come up that
> often. Page deletion is often something that we hardly ever need. And,
> unlike some DB systems, we only do it when pages are fully empty
> (which, as it turns out, isn't necessarily better than our simple
> approach [1]). I tend to think it's unlikely to happen in cases where
> INDEX_CLEANUP is used, because those are cases that also must not have
> that much index churn to begin with.

I don't think I believe this. All you need is one small range-deletion, right?

> Then there's question 2. The intuition behind the approach from
> Sawada-san's patch was that allowing wraparound here feels wrong -- it
> should be in the AM's hands. However, it's not like I can point to
> some ironclad guarantee about not leaking deleted pages that existed
> before the INDEX_CLEANUP feature. We know that the FSM is not crash
> safe, and that's that. Is this really all that different? Maybe it is,
> but it seems like a quantitative difference to me.

I don't think I believe this, either. In the real-world example to
which you linked, the user ran REINDEX afterward to recover from index
bloat, and we could advise other people who use this option that it
may leak space that a subsequent VACUUM may fail to recover, and
therefore they too should consider REINDEX. Bloat sucks and I hate it,
but in the vehicle analogy from up above, it's the equivalent of
getting lost while driving someplace. It is inconvenient and may cause
you many problems, but you will not be dead. Running out of XIDs is a
brick wall. Either the car stops or you hit the wall. Ideally you can
manage to both not get lost and also not hit a brick wall, but in an
emergency situation where you have to choose either to get lost or to
hit a brick wall, there's only one right answer. As bad as bloat is,
and it's really bad, there are users who manage to run incredibly
bloated databases for long periods of time just because the stuff that
gets slow is either stuff that they're not doing at all, or only doing
in batch jobs where it's OK if it runs super-slow and where it may
even be possible to disable the batch job altogether, at least for a
while. The set of users who can survive running out of XIDs is limited
to those who can get by with just read-only queries, and that's
practically nobody. I have yet to encounter a customer who didn't
consider running out of XIDs to be an emergency.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: xid wraparound danger due to INDEX_CLEANUP false

От
Peter Geoghegan
Дата:
On Thu, Jun 25, 2020 at 6:59 AM Masahiko Sawada
<masahiko.sawada@2ndquadrant.com> wrote:
> I think that with the approach implemented in my patch, it could be a
> problem for the user that the user cannot easily know in advance
> whether vacuum with INDEX_CLEANUP false will perform index cleanup,
> even if page deletion doesn’t happen in most cases.

I was unclear. I agree that the VACUUM command with "INDEX_CLEANUP =
off" is an emergency mechanism that should be fully respected, even
when that means that we'll leak deleted pages.

Perhaps it would make sense to behave differently when the index is on
a table that has "vacuum_index_cleanup = off" set, and the vacuum is
started by autovacuum, and is not an anti-wraparound vacuum. That
doesn't seem all that appealing now that I write it down, though,
because it's a non-obvious behavioral difference among cases that
users probably expect to behave similarly. On the other hand, what
user knows that there is something called an aggressive vacuum, which
isn't exactly the same thing as an anti-wraparound vacuum?

I find it hard to decide what the least-worst thing is for the
backbranches. What do you think?

> I don’t come up with a good solution to keep us safe against XID
> wraparound yet but it seems to me that it’s better to have an option
> that forces index cleanup not to happen.

I don't think that there is a good solution that is suitable for
backpatching. The real solution is to redesign the recycling along the
lines I described.

I don't think that it's terrible that we can leak deleted pages,
especially considering the way that users are expected to use the
INDEX_CLEANUP feature. I would like to be sure that the problem is
well understood, though -- we should at least have a plan for Postgres
v14.

> I thought that btbulkdelete and/or btvacuumcleanup can register an
> autovacuum work item to recycle the page that gets deleted but it
> might not able to recycle those pages enough because the autovacuum
> work items could be taken just after vacuum. And if page deletion is
> relatively a rare case in practice, we might be able to take an
> optimistic approach that vacuum registers deleted pages to FSM on the
> deletion and a process who takes a free page checks if the page is
> really recyclable. Anyway, I’ll try to think more about this.

Right -- just putting the pages in the FSM immediately, and making it
a problem that we deal with within _bt_getbuf() is an alternative
approach that might be better.

--
Peter Geoghegan



Re: xid wraparound danger due to INDEX_CLEANUP false

От
Peter Geoghegan
Дата:
On Thu, Jun 25, 2020 at 8:28 AM Robert Haas <robertmhaas@gmail.com> wrote:
> I'm not really convinced. I agree that from a theoretical point of
> view an index can have arbitrary needs and is the arbiter of its own
> needs, but when I pull the emergency break, I want the vehicle to
> stop, not think about stopping.

Making this theoretical argument in the context of this discussion was
probably a mistake. I agree that this is the emergency break, and it
needs to work like one.

It might be worth considering some compromise in the event of using
the "vacuum_index_cleanup" reloption (i.e. when the user has set it to
'off'), provided there is good reason to believe that we're not in an
emergency -- I mentioned this to Masahiko just now. I admit that that
isn't very appealing for other reasons, but it's worth considering a
way of ameliorating the problem in back branches. We really ought to
change how recycling works, so that it happens at the tail end of the
same VACUUM operation that deleted the pages -- but that cannot be
backpatched.

It might be that the most appropriate mitigation in the back branches
is a log message that reports on the fact that we've probably leaked
pages due to this issue. Plus some documentation. Though even that
would require calling nbtree to check if that is actually true (by
checking the metapage), so it still requires backpatching something
close to Masahiko's draft patch.

> I don't think I believe this. All you need is one small range-deletion, right?

Right.

> > Then there's question 2. The intuition behind the approach from
> > Sawada-san's patch was that allowing wraparound here feels wrong -- it
> > should be in the AM's hands. However, it's not like I can point to
> > some ironclad guarantee about not leaking deleted pages that existed
> > before the INDEX_CLEANUP feature. We know that the FSM is not crash
> > safe, and that's that. Is this really all that different? Maybe it is,
> > but it seems like a quantitative difference to me.
>
> I don't think I believe this, either. In the real-world example to
> which you linked, the user ran REINDEX afterward to recover from index
> bloat, and we could advise other people who use this option that it
> may leak space that a subsequent VACUUM may fail to recover, and
> therefore they too should consider REINDEX.

I was talking about the intuition behind the design. I did not intend
to suggest that nbtree should ignore "INDEX_CLEANUP = off" regardless
of the consequences.

I am sure about this much: The design embodied by Masahiko's patch is
clearly a better one overall, even if it doesn't fix the problem on
its own. I agree that we cannot allow nbtree to ignore "INDEX_CLEANUP
= off", even if that means leaking pages that could otherwise be
recycled. I'm not sure what we should do about any of this in the back
branches, though. I wish I had a simple idea about what to do there.

--
Peter Geoghegan



Re: xid wraparound danger due to INDEX_CLEANUP false

От
Robert Haas
Дата:
On Thu, Jun 25, 2020 at 8:44 PM Peter Geoghegan <pg@bowt.ie> wrote:
> I am sure about this much: The design embodied by Masahiko's patch is
> clearly a better one overall, even if it doesn't fix the problem on
> its own. I agree that we cannot allow nbtree to ignore "INDEX_CLEANUP
> = off", even if that means leaking pages that could otherwise be
> recycled. I'm not sure what we should do about any of this in the back
> branches, though. I wish I had a simple idea about what to do there.

My opinion is that there's no need to change the code in the
back-branches, and that I don't really like the approach in master
either.

I think what we're saying is that there is no worse consequence to
turning off index_cleanup than some bloat that isn't likely to be
recovered unless you REINDEX. If the problem in question were going to
cause data loss or data corruption or something, we'd have to take
stronger action, but I don't think anyone's saying that this is the
case. Therefore, I think we can handle the back-branches by letting
users know about the bloat hazard and suggesting that they avoid this
option unless it's necessary to avoid running out of XIDs.

Now, what about master? I think it's fine to offer the AM a callback
even when index_cleanup = false, for example so that it can freeze
something in its metapage, but I don't agree with passing it the TIDs.
That seems like it's just inviting it to ignore the emergency brake,
and it's also incurring real overhead, because remembering all those
TIDs can use a lot of memory. If that API limitation causes a problem
for some future index AM, that will be a good point to discuss when
the patch for said AM is submitted for review. I entirely agree with
you that the way btree arranges for btree recycling is crude, and I
would be delighted if you want to improve it, either for v14 or for
any future release, or if somebody else wants to do so. However, even
if that never happens, so what?

In retrospect, I regret committing this patch without better
understanding the issues in this area. That was a fail on my part. At
the same time, it doesn't really sound like the issues are all that
bad. The potential index bloat does suck, but it can still suck less
than the alternatives, and we have evidence that for at least one
user, it was worth a major version upgrade just to replace the
suckitude they had with the suckitude this patch creates.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: xid wraparound danger due to INDEX_CLEANUP false

От
Peter Geoghegan
Дата:
On Fri, Jun 26, 2020 at 5:39 AM Robert Haas <robertmhaas@gmail.com> wrote:
> My opinion is that there's no need to change the code in the
> back-branches, and that I don't really like the approach in master
> either.

I guess it's hard to see a way that we could fix this in the
backbranches, provided we aren't willing to tolerate a big refactor,
or a cleanup scan of the index (note that I mean btvacuumcleanup(),
not btvacuumscan(), which is quite different).

> I think what we're saying is that there is no worse consequence to
> turning off index_cleanup than some bloat that isn't likely to be
> recovered unless you REINDEX.

That's true.

> Now, what about master? I think it's fine to offer the AM a callback
> even when index_cleanup = false, for example so that it can freeze
> something in its metapage, but I don't agree with passing it the TIDs.
> That seems like it's just inviting it to ignore the emergency brake,
> and it's also incurring real overhead, because remembering all those
> TIDs can use a lot of memory.

You don't have to do anything with TIDs passed from vacuumlazy.c to
recycle pages that need to be recycled, since you only have to go
through btvacuumcleanup() to avoid the problem that we're talking
about (you don't have to call btvacuumscan() to kill TIDs that
vacuumlazy.c will have pruned). Killing TIDs/tuples in the index was
never something that would make sense, even within the confines of the
existing flawed nbtree recycling design. However, you do need to scan
the entire index to do that much. FWIW, that doesn't seem like it
*totally* violates the spirit of "index_cleanup = false", since you're
still not doing most of the usual nbtree vacuuming stuff (even though
you have to scan the index, there is still much less work total).

> If that API limitation causes a problem
> for some future index AM, that will be a good point to discuss when
> the patch for said AM is submitted for review. I entirely agree with
> you that the way btree arranges for btree recycling is crude, and I
> would be delighted if you want to improve it, either for v14 or for
> any future release, or if somebody else wants to do so. However, even
> if that never happens, so what?

I think that it's important to be able to describe an ideal (though
still realistic) design, even if it might remain aspirational for a
long time. I suspect that pushing the mechanism down into index AMs
has other non-obvious benefits.

> In retrospect, I regret committing this patch without better
> understanding the issues in this area. That was a fail on my part. At
> the same time, it doesn't really sound like the issues are all that
> bad. The potential index bloat does suck, but it can still suck less
> than the alternatives, and we have evidence that for at least one
> user, it was worth a major version upgrade just to replace the
> suckitude they had with the suckitude this patch creates.

I actually agree -- this is a really important feature, and I'm glad
that we have it. Even in this slightly flawed form. I remember a great
need for the feature back when I was involved in supporting Postgres
in production.

-- 
Peter Geoghegan



Re: xid wraparound danger due to INDEX_CLEANUP false

От
Masahiko Sawada
Дата:
On Sat, 27 Jun 2020 at 08:00, Peter Geoghegan <pg@bowt.ie> wrote:
>
> On Fri, Jun 26, 2020 at 5:39 AM Robert Haas <robertmhaas@gmail.com> wrote:
> > My opinion is that there's no need to change the code in the
> > back-branches, and that I don't really like the approach in master
> > either.
>
> I guess it's hard to see a way that we could fix this in the
> backbranches, provided we aren't willing to tolerate a big refactor,
> or a cleanup scan of the index (note that I mean btvacuumcleanup(),
> not btvacuumscan(), which is quite different).

Agreed.

>
> > I think what we're saying is that there is no worse consequence to
> > turning off index_cleanup than some bloat that isn't likely to be
> > recovered unless you REINDEX.
>
> That's true.

Regarding to the extent of the impact, this bug will affect the user
who turned vacuum_index_cleanup off or executed manually vacuum with
INDEX_CLEANUP off for a long time, after some vacuums. On the other
hand, the user who uses INDEX_CLEANUP off on the spot or turns
vacuum_index_cleanup off of the table from the start would not be
affected or less affected.

>
> > In retrospect, I regret committing this patch without better
> > understanding the issues in this area. That was a fail on my part. At
> > the same time, it doesn't really sound like the issues are all that
> > bad. The potential index bloat does suck, but it can still suck less
> > than the alternatives, and we have evidence that for at least one
> > user, it was worth a major version upgrade just to replace the
> > suckitude they had with the suckitude this patch creates.
>
> I actually agree -- this is a really important feature, and I'm glad
> that we have it. Even in this slightly flawed form. I remember a great
> need for the feature back when I was involved in supporting Postgres
> in production.

I apologize for writing this patch without enough consideration. I
should have been more careful as I learned the nbtree page recycle
strategy when discussing  vacuum_cleanup_index_scale_factor patch.

Regards,

-- 
Masahiko Sawada            http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: xid wraparound danger due to INDEX_CLEANUP false

От
Peter Geoghegan
Дата:
On Fri, Jun 26, 2020 at 10:15 PM Masahiko Sawada
<masahiko.sawada@2ndquadrant.com> wrote:
> Regarding to the extent of the impact, this bug will affect the user
> who turned vacuum_index_cleanup off or executed manually vacuum with
> INDEX_CLEANUP off for a long time, after some vacuums. On the other
> hand, the user who uses INDEX_CLEANUP off on the spot or turns
> vacuum_index_cleanup off of the table from the start would not be
> affected or less affected.

I don't think that it's likely to cause too much trouble. It's already
possible to leak deleted pages, if only because the FSM isn't crash
safe. Actually, the nbtree README says this, and has since 2003:

"""
(Note: if we find a deleted page with an extremely old transaction
number, it'd be worthwhile to re-mark it with FrozenTransactionId so that
a later xid wraparound can't cause us to think the page is unreclaimable.
But in more normal situations this would be a waste of a disk write.)
"""

But, uh, isn't the btvacuumcleanup() call supposed to avoid
wraparound? Who knows?!

It doesn't seem like the recycling aspect of page deletion was
rigorously designed, possibly because it's harder to test than page
deletion itself. This is a problem that we should fix.

> I apologize for writing this patch without enough consideration. I
> should have been more careful as I learned the nbtree page recycle
> strategy when discussing  vacuum_cleanup_index_scale_factor patch.

While it's unfortunate that this was missed, let's not lose
perspective. Anybody using the INDEX_CLEANUP feature (whether it's
through a direct VACUUM, or by using the reloption) is already asking
for an extreme behavior: skipping regular index vacuuming. I imagine
that the vast majority of users that are in that position just don't
care about the possibility of leaking deleted pages. They care about
avoiding a real disaster from XID wraparound.

-- 
Peter Geoghegan



Re: xid wraparound danger due to INDEX_CLEANUP false

От
Masahiko Sawada
Дата:
On Sun, 28 Jun 2020 at 02:44, Peter Geoghegan <pg@bowt.ie> wrote:
>
> On Fri, Jun 26, 2020 at 10:15 PM Masahiko Sawada
> <masahiko.sawada@2ndquadrant.com> wrote:
> > Regarding to the extent of the impact, this bug will affect the user
> > who turned vacuum_index_cleanup off or executed manually vacuum with
> > INDEX_CLEANUP off for a long time, after some vacuums. On the other
> > hand, the user who uses INDEX_CLEANUP off on the spot or turns
> > vacuum_index_cleanup off of the table from the start would not be
> > affected or less affected.
>
> I don't think that it's likely to cause too much trouble. It's already
> possible to leak deleted pages, if only because the FSM isn't crash
> safe. Actually, the nbtree README says this, and has since 2003:
>
> """
> (Note: if we find a deleted page with an extremely old transaction
> number, it'd be worthwhile to re-mark it with FrozenTransactionId so that
> a later xid wraparound can't cause us to think the page is unreclaimable.
> But in more normal situations this would be a waste of a disk write.)
> """
>
> But, uh, isn't the btvacuumcleanup() call supposed to avoid
> wraparound? Who knows?!
>
> It doesn't seem like the recycling aspect of page deletion was
> rigorously designed, possibly because it's harder to test than page
> deletion itself. This is a problem that we should fix.

Agreed.

>
> > I apologize for writing this patch without enough consideration. I
> > should have been more careful as I learned the nbtree page recycle
> > strategy when discussing  vacuum_cleanup_index_scale_factor patch.
>
> While it's unfortunate that this was missed, let's not lose
> perspective. Anybody using the INDEX_CLEANUP feature (whether it's
> through a direct VACUUM, or by using the reloption) is already asking
> for an extreme behavior: skipping regular index vacuuming. I imagine
> that the vast majority of users that are in that position just don't
> care about the possibility of leaking deleted pages. They care about
> avoiding a real disaster from XID wraparound.

For back branches, I'm considering how we let users know about this.
For safety, we can let users know that we recommend avoiding
INDEX_CLEANUP false unless it's necessary to avoid running out of XIDs
on the documentation and/or the release note. But on the other hand,
since there is the fact that leaving recyclable pages is already
possible to happen as you mentioned I'm concerned it gets the user
into confusion and might needlessly incite unrest of users. I'm
thinking what we can do for users, in addition to leaving the summary
of this discussion as a source code comment. What do you think?

Regards,

--
Masahiko Sawada            http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: xid wraparound danger due to INDEX_CLEANUP false

От
Masahiko Sawada
Дата:
On Mon, Jun 29, 2020 at 9:51 PM Masahiko Sawada
<masahiko.sawada@2ndquadrant.com> wrote:
>
> On Sun, 28 Jun 2020 at 02:44, Peter Geoghegan <pg@bowt.ie> wrote:
> >
> > On Fri, Jun 26, 2020 at 10:15 PM Masahiko Sawada
> > <masahiko.sawada@2ndquadrant.com> wrote:
> > > Regarding to the extent of the impact, this bug will affect the user
> > > who turned vacuum_index_cleanup off or executed manually vacuum with
> > > INDEX_CLEANUP off for a long time, after some vacuums. On the other
> > > hand, the user who uses INDEX_CLEANUP off on the spot or turns
> > > vacuum_index_cleanup off of the table from the start would not be
> > > affected or less affected.
> >
> > I don't think that it's likely to cause too much trouble. It's already
> > possible to leak deleted pages, if only because the FSM isn't crash
> > safe. Actually, the nbtree README says this, and has since 2003:
> >
> > """
> > (Note: if we find a deleted page with an extremely old transaction
> > number, it'd be worthwhile to re-mark it with FrozenTransactionId so that
> > a later xid wraparound can't cause us to think the page is unreclaimable.
> > But in more normal situations this would be a waste of a disk write.)
> > """
> >
> > But, uh, isn't the btvacuumcleanup() call supposed to avoid
> > wraparound? Who knows?!
> >
> > It doesn't seem like the recycling aspect of page deletion was
> > rigorously designed, possibly because it's harder to test than page
> > deletion itself. This is a problem that we should fix.
>
> Agreed.
>
> >
> > > I apologize for writing this patch without enough consideration. I
> > > should have been more careful as I learned the nbtree page recycle
> > > strategy when discussing  vacuum_cleanup_index_scale_factor patch.
> >
> > While it's unfortunate that this was missed, let's not lose
> > perspective. Anybody using the INDEX_CLEANUP feature (whether it's
> > through a direct VACUUM, or by using the reloption) is already asking
> > for an extreme behavior: skipping regular index vacuuming. I imagine
> > that the vast majority of users that are in that position just don't
> > care about the possibility of leaking deleted pages. They care about
> > avoiding a real disaster from XID wraparound.
>
> For back branches, I'm considering how we let users know about this.
> For safety, we can let users know that we recommend avoiding
> INDEX_CLEANUP false unless it's necessary to avoid running out of XIDs
> on the documentation and/or the release note. But on the other hand,
> since there is the fact that leaving recyclable pages is already
> possible to happen as you mentioned I'm concerned it gets the user
> into confusion and might needlessly incite unrest of users. I'm
> thinking what we can do for users, in addition to leaving the summary
> of this discussion as a source code comment. What do you think?
>

Several months passed from the discussion. We decided not to do
anything on back branches but IIUC the fundamental issue is not fixed
yet. The issue pointed out by Andres that we should leave the index AM
to decide whether doing vacuum cleanup or not when INDEX_CLEANUP is
specified is still valid. Is that right?

For HEAD, there was a discussion that we change lazy vacuum and
bulkdelete and vacuumcleanup APIs so that it calls these APIs even
when INDEX_CLEANUP is specified. That is, when INDEX_CLEANUP false is
specified, it collects dead tuple TIDs into maintenance_work_mem space
and passes the flag indicating INDEX_CLEANUP is specified or not to
index AMs. Index AM decides whether doing bulkdelete/vacuumcleanup. A
downside of this idea would be that we will end up using
maintenance_work_mem even if all index AMs of the table don't do
bulkdelete/vacuumcleanup at all.

The second idea I came up with is to add an index AM API  (say,
amcanskipindexcleanup = true/false) telling index cleanup is skippable
or not. Lazy vacuum checks this flag for each index on the table
before starting. If index cleanup is skippable in all indexes, it can
choose one-pass vacuum, meaning no need to collect dead tuple TIDs in
maintenance_work_mem. All in-core index AM will set to true. Perhaps
it’s true (skippable) by default for backward compatibility.

The in-core AMs including btree indexes will work same as before. This
fix is to make it more desirable behavior and possibly to help other
AMs that require to call vacuumcleanup in all cases. Once we fix it I
wonder if we can disable index cleanup when autovacuum’s
anti-wraparound vacuum.

Regards,

--
Masahiko Sawada
EnterpriseDB:  https://www.enterprisedb.com/



Re: xid wraparound danger due to INDEX_CLEANUP false

От
Peter Geoghegan
Дата:
On Thu, Nov 19, 2020 at 5:58 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> Several months passed from the discussion. We decided not to do
> anything on back branches but IIUC the fundamental issue is not fixed
> yet. The issue pointed out by Andres that we should leave the index AM
> to decide whether doing vacuum cleanup or not when INDEX_CLEANUP is
> specified is still valid. Is that right?

I don't remember if Andres actually said that publicly, but I
definitely did. I do remember discussing this with him privately, at
which point he agreed with what I said. Which you just summarized
well.

> For HEAD, there was a discussion that we change lazy vacuum and
> bulkdelete and vacuumcleanup APIs so that it calls these APIs even
> when INDEX_CLEANUP is specified. That is, when INDEX_CLEANUP false is
> specified, it collects dead tuple TIDs into maintenance_work_mem space
> and passes the flag indicating INDEX_CLEANUP is specified or not to
> index AMs.

Right.

> Index AM decides whether doing bulkdelete/vacuumcleanup. A
> downside of this idea would be that we will end up using
> maintenance_work_mem even if all index AMs of the table don't do
> bulkdelete/vacuumcleanup at all.

That is a downside, but I don't think that it's a serious downside.
But it may not matter, because there are lots of reasons to move in
this direction.

> The second idea I came up with is to add an index AM API  (say,
> amcanskipindexcleanup = true/false) telling index cleanup is skippable
> or not. Lazy vacuum checks this flag for each index on the table
> before starting. If index cleanup is skippable in all indexes, it can
> choose one-pass vacuum, meaning no need to collect dead tuple TIDs in
> maintenance_work_mem. All in-core index AM will set to true. Perhaps
> it’s true (skippable) by default for backward compatibility.

(The terminology here is very confusing, because the goal of the
INDEX_CLEANUP feature in v12 is not really to skip a call to
btvacuumcleanup(). The goal is really to skip a call to
btbulkdelete(). I will try to be careful.)

I think that the ideal design here would be a new hybrid of two
existing features:

1.) Your INDEX_CLEANUP feature from Postgres 12.

and:

2.) Your vacuum_cleanup_index_scale_factor feature from Postgres 11.

The INDEX_CLEANUP feature is very useful, because skipping indexes
entirely can be very helpful for many reasons (e.g. faster manual
VACUUM in the event of wraparound related emergencies).  But
INDEX_CLEANUP has 2 problems today:

A. It doesn't interact well with vacuum_cleanup_index_scale_factor.
This is the problem that has been discussed on this thread.

and:

B. It is an "all or nothing" thing. Unlike the
vacuum_cleanup_index_scale_factor feature, it does not care about what
the index AM/individual index wants. But it should.

(**Thinks some more***)

Actually, on second thought, maybe INDEX_CLEANUP only has one problem.
Problem A is actually just a special case of problem B. There are many
interesting opportunities created by solving problem B
comprehensively.

So, what useful enhancements to VACUUM are possible once we have
something like INDEX_CLEANUP, that is sensitive to the needs of
indexes? Well, you already identified one yourself, so obviously
you're thinking about this in a similar way already:

> The in-core AMs including btree indexes will work same as before. This
> fix is to make it more desirable behavior and possibly to help other
> AMs that require to call vacuumcleanup in all cases. Once we fix it I
> wonder if we can disable index cleanup when autovacuum’s
> anti-wraparound vacuum.

Obviously this is a good idea. The fact that anti-wraparound vacuum
isn't really special compared to regular autovacuum is *bad*.
Obviously anti-wraparound is in some sense more important than regular
vacuum. Making it as similar as possible to vacuum simply isn't
helpful. Maybe it is slightly more elegant in theory, but in the real
world it is a poor design. (See also: every single PostgreSQL post
mortem that has ever been written.)

But why stop with this? There are other big advantages to allowing
individual indexes/index AMs influence of the INDEX_CLEANUP behavior.
Especially if they're sensitive to the needs of particular indexes on
a table (not just all of the indexes on the table taken together).

As you may know, my bottom-up index deletion patch can more or less
eliminate index bloat in indexes that don't get "logically changed" by
many non-HOT updates. It's *very* effective with non-HOT updates and
lots of indexes. See this benchmark result for a recent example:

https://postgr.es/m/CAGnEbohYF_K6b0v=2uc289=v67qNhc3n01Ftic8X94zP7kKqtw@mail.gmail.com

The feature is effective enough to make it almost unnecessary to
VACUUM certain indexes -- though not necessarily other indexes on the
same table. Of course, in the long term it will eventually be
necessary to really vacuum these indexes. Not because the indexes
themselves care, though -- they really don't (if they don't receive
logical changes from non-HOT updates, and so benefit very well from
the proposed bottom-up index deletion mechanism, they really have no
selfish reason to care if they ever get vacuumed by autovacuum).

The reason we eventually need to call ambulkdelete() with these
indexes (with all indexes, actually) even with these enhancements is
related to the heap. We eventually want to make LP_DEAD line pointers
in the heap LP_UNUSED. But we should be lazy about it, and wait until
it becomes a real problem. Maybe we can only do a traditional VACUUM
(with a second pass of the heap for heap LP_UNUSED stuff) much much
less frequently than today. At the same time, we can set the FSM for
index-only scans much more frequently.

It's also important that we really make index vacuuming a per-index
thing. You can see this in the example benchmark I linked to, which
was posted by Victor: no page splits in one never-logically-modified
index, and some page splits in other indexes that were actually
changed by UPDATEs again and again. Clearly you can have several
different indexes on the same table that have very different needs.

With some indexes we want to be extra lazy (these are indexes that
make good use of bottom-up deletion). But with other indexes on the
same table we want to be eager. Maybe even very eager. If we can make
per-index decisions, then every individual part of the system works
well. Currently, the problem with autovacuum scheduling is that it
probably makes sense for the heap with the defaults (or something like
them), and probably doesn't make any sense for indexes (though it also
varies among indexes). So today we have maybe 7 different things
(maybe 6 indexes + 1 table), and we pretend that they are only one
thing. It's just a fantasy. The reality is that we have 7 things that
have only a very loose and complicated relationship to each other. We
need to stop believing in this fantasy, and start paying attention to
the more complicated reality. The only way to do that is ask each
index directly, while being prepared to get very different answers
from each index on the same table.

Here is what I mean by that: it would also probably be very useful to
do something like a ambulkdelete() call for only a subset of indexes
that really need it. So you aggressively vacuum the one index that
really does get logically modified by an UPDATE, and not the other 6
that don't. (Of course it's still true that we cannot have a second
heap pass to make LP_DEAD line pointers in the heap LP_UNUSED --
obviously that's unsafe unless we're 100% sure that nothing in any
index points to the now-LP_UNUSED line pointer. But many big
improvements are possible without violating this basic invariant.)

If you are able to pursue this project, in whole or in part, I would
definitely be supportive of that. I may be able to commit it. I think
that this project has many benefits, not just one or two. It seems
strategic.

--
Peter Geoghegan



Re: xid wraparound danger due to INDEX_CLEANUP false

От
Masahiko Sawada
Дата:
On Fri, Nov 20, 2020 at 12:56 PM Peter Geoghegan <pg@bowt.ie> wrote:
>
> On Thu, Nov 19, 2020 at 5:58 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> > Several months passed from the discussion. We decided not to do
> > anything on back branches but IIUC the fundamental issue is not fixed
> > yet. The issue pointed out by Andres that we should leave the index AM
> > to decide whether doing vacuum cleanup or not when INDEX_CLEANUP is
> > specified is still valid. Is that right?
>
> I don't remember if Andres actually said that publicly, but I
> definitely did. I do remember discussing this with him privately, at
> which point he agreed with what I said. Which you just summarized
> well.
>
> > For HEAD, there was a discussion that we change lazy vacuum and
> > bulkdelete and vacuumcleanup APIs so that it calls these APIs even
> > when INDEX_CLEANUP is specified. That is, when INDEX_CLEANUP false is
> > specified, it collects dead tuple TIDs into maintenance_work_mem space
> > and passes the flag indicating INDEX_CLEANUP is specified or not to
> > index AMs.
>
> Right.
>
> > Index AM decides whether doing bulkdelete/vacuumcleanup. A
> > downside of this idea would be that we will end up using
> > maintenance_work_mem even if all index AMs of the table don't do
> > bulkdelete/vacuumcleanup at all.
>
> That is a downside, but I don't think that it's a serious downside.
> But it may not matter, because there are lots of reasons to move in
> this direction.
>
> > The second idea I came up with is to add an index AM API  (say,
> > amcanskipindexcleanup = true/false) telling index cleanup is skippable
> > or not. Lazy vacuum checks this flag for each index on the table
> > before starting. If index cleanup is skippable in all indexes, it can
> > choose one-pass vacuum, meaning no need to collect dead tuple TIDs in
> > maintenance_work_mem. All in-core index AM will set to true. Perhaps
> > it’s true (skippable) by default for backward compatibility.
>
> (The terminology here is very confusing, because the goal of the
> INDEX_CLEANUP feature in v12 is not really to skip a call to
> btvacuumcleanup(). The goal is really to skip a call to
> btbulkdelete(). I will try to be careful.)
>
> I think that the ideal design here would be a new hybrid of two
> existing features:
>
> 1.) Your INDEX_CLEANUP feature from Postgres 12.
>
> and:
>
> 2.) Your vacuum_cleanup_index_scale_factor feature from Postgres 11.
>
> The INDEX_CLEANUP feature is very useful, because skipping indexes
> entirely can be very helpful for many reasons (e.g. faster manual
> VACUUM in the event of wraparound related emergencies).  But
> INDEX_CLEANUP has 2 problems today:
>
> A. It doesn't interact well with vacuum_cleanup_index_scale_factor.
> This is the problem that has been discussed on this thread.
>
> and:
>
> B. It is an "all or nothing" thing. Unlike the
> vacuum_cleanup_index_scale_factor feature, it does not care about what
> the index AM/individual index wants. But it should.

Agreed.

>
> (**Thinks some more***)
>
> Actually, on second thought, maybe INDEX_CLEANUP only has one problem.
> Problem A is actually just a special case of problem B. There are many
> interesting opportunities created by solving problem B
> comprehensively.
>
> So, what useful enhancements to VACUUM are possible once we have
> something like INDEX_CLEANUP, that is sensitive to the needs of
> indexes? Well, you already identified one yourself, so obviously
> you're thinking about this in a similar way already:
>
> > The in-core AMs including btree indexes will work same as before. This
> > fix is to make it more desirable behavior and possibly to help other
> > AMs that require to call vacuumcleanup in all cases. Once we fix it I
> > wonder if we can disable index cleanup when autovacuum’s
> > anti-wraparound vacuum.
>
> Obviously this is a good idea. The fact that anti-wraparound vacuum
> isn't really special compared to regular autovacuum is *bad*.
> Obviously anti-wraparound is in some sense more important than regular
> vacuum. Making it as similar as possible to vacuum simply isn't
> helpful. Maybe it is slightly more elegant in theory, but in the real
> world it is a poor design. (See also: every single PostgreSQL post
> mortem that has ever been written.)
>
> But why stop with this? There are other big advantages to allowing
> individual indexes/index AMs influence of the INDEX_CLEANUP behavior.
> Especially if they're sensitive to the needs of particular indexes on
> a table (not just all of the indexes on the table taken together).
>
> As you may know, my bottom-up index deletion patch can more or less
> eliminate index bloat in indexes that don't get "logically changed" by
> many non-HOT updates. It's *very* effective with non-HOT updates and
> lots of indexes. See this benchmark result for a recent example:
>
> https://postgr.es/m/CAGnEbohYF_K6b0v=2uc289=v67qNhc3n01Ftic8X94zP7kKqtw@mail.gmail.com
>
> The feature is effective enough to make it almost unnecessary to
> VACUUM certain indexes -- though not necessarily other indexes on the
> same table. Of course, in the long term it will eventually be
> necessary to really vacuum these indexes. Not because the indexes
> themselves care, though -- they really don't (if they don't receive
> logical changes from non-HOT updates, and so benefit very well from
> the proposed bottom-up index deletion mechanism, they really have no
> selfish reason to care if they ever get vacuumed by autovacuum).
>
> The reason we eventually need to call ambulkdelete() with these
> indexes (with all indexes, actually) even with these enhancements is
> related to the heap. We eventually want to make LP_DEAD line pointers
> in the heap LP_UNUSED. But we should be lazy about it, and wait until
> it becomes a real problem. Maybe we can only do a traditional VACUUM
> (with a second pass of the heap for heap LP_UNUSED stuff) much much
> less frequently than today. At the same time, we can set the FSM for
> index-only scans much more frequently.
>
> It's also important that we really make index vacuuming a per-index
> thing. You can see this in the example benchmark I linked to, which
> was posted by Victor: no page splits in one never-logically-modified
> index, and some page splits in other indexes that were actually
> changed by UPDATEs again and again. Clearly you can have several
> different indexes on the same table that have very different needs.
>
> With some indexes we want to be extra lazy (these are indexes that
> make good use of bottom-up deletion). But with other indexes on the
> same table we want to be eager. Maybe even very eager. If we can make
> per-index decisions, then every individual part of the system works
> well. Currently, the problem with autovacuum scheduling is that it
> probably makes sense for the heap with the defaults (or something like
> them), and probably doesn't make any sense for indexes (though it also
> varies among indexes). So today we have maybe 7 different things
> (maybe 6 indexes + 1 table), and we pretend that they are only one
> thing. It's just a fantasy. The reality is that we have 7 things that
> have only a very loose and complicated relationship to each other. We
> need to stop believing in this fantasy, and start paying attention to
> the more complicated reality. The only way to do that is ask each
> index directly, while being prepared to get very different answers
> from each index on the same table.
>
> Here is what I mean by that: it would also probably be very useful to
> do something like a ambulkdelete() call for only a subset of indexes
> that really need it. So you aggressively vacuum the one index that
> really does get logically modified by an UPDATE, and not the other 6
> that don't. (Of course it's still true that we cannot have a second
> heap pass to make LP_DEAD line pointers in the heap LP_UNUSED --
> obviously that's unsafe unless we're 100% sure that nothing in any
> index points to the now-LP_UNUSED line pointer. But many big
> improvements are possible without violating this basic invariant.)

I had missed your bottom-up index deletion patch but it's a promising
improvement. With that patch, the number of dead tuples in individual
indexes may differ. So it's important that we make index vacuuming a
per-index thing.

Given that patch, it seems to me that it would be better to ask
individual index AM before calling to bulkdelete about the needs of
bulkdelete. That is, passing VACUUM options and the number of
collected dead tuples etc. to index AM, we ask index AM via a new
index AM API whether it wants to do bulkdelete or not. We call
bulkdelete for only indexes that answered 'yes'. If we got 'no' from
any one of the indexes, we cannot have a second heap pass.
INDEX_CLEANUP is not enforceable. When INDEX_CLEANUP is set to false,
we expect index AMs to return 'no' unless they have a special reason
for the needs of bulkdelete.

One possible benefit of this idea even without bottom-up index
deleteion patch would be something like
vacuum_index_cleanup_scale_factor for bulkdelete. For example, in the
case where the amount of dead tuple is slightly larger than
maitenance_work_mem the second time calling to bulkdelete will be
called with a small amount of dead tuples, which is less efficient. If
an index AM is able to determine not to do bulkdelete by comparing the
number of dead tuples to a threshold, it can avoid such bulkdelete
calling.

Also, as a future extension, once we have retail index deletion
feature, we might be able to make that API return a ternary value:
'no', 'do_bulkdelete', ‘do_indexscandelete, so that index AM can
choose the appropriate method of index deletion based on the
statistics.

But for making index vacuuming per-index thing, we need to deal with
the problem that we cannot know which indexes of the table still has
index tuples pointing to the collected dead tuple. For example, if an
index always says 'no' (not need bulkdelete therefore we need to keep
dead line pointers), the collected dead tuples might already be marked
as LP_DEAD and there might already not be index tuples pointing to
them in other index AMs. In that case we don't want to call to
bulkdelete for other indexes. Probably we can have additional
statistics like the number of dead tuples in individual indexes so
that they can determine the needs of bulkdelete. But it’s not a
comprehensive solution.

>
> If you are able to pursue this project, in whole or in part, I would
> definitely be supportive of that. I may be able to commit it. I think
> that this project has many benefits, not just one or two. It seems
> strategic.

Thanks, that’s really helpful. I’m going to work on that. Since things
became complicated by these two features that I proposed I’ll do my
best to sort out the problem and improve it in PG14.

Regards,

--
Masahiko Sawada
EnterpriseDB:  https://www.enterprisedb.com/



Re: xid wraparound danger due to INDEX_CLEANUP false

От
Robert Haas
Дата:
On Thu, Nov 19, 2020 at 8:58 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> For HEAD, there was a discussion that we change lazy vacuum and
> bulkdelete and vacuumcleanup APIs so that it calls these APIs even
> when INDEX_CLEANUP is specified. That is, when INDEX_CLEANUP false is
> specified, it collects dead tuple TIDs into maintenance_work_mem space
> and passes the flag indicating INDEX_CLEANUP is specified or not to
> index AMs. Index AM decides whether doing bulkdelete/vacuumcleanup. A
> downside of this idea would be that we will end up using
> maintenance_work_mem even if all index AMs of the table don't do
> bulkdelete/vacuumcleanup at all.
>
> The second idea I came up with is to add an index AM API  (say,
> amcanskipindexcleanup = true/false) telling index cleanup is skippable
> or not. Lazy vacuum checks this flag for each index on the table
> before starting. If index cleanup is skippable in all indexes, it can
> choose one-pass vacuum, meaning no need to collect dead tuple TIDs in
> maintenance_work_mem. All in-core index AM will set to true. Perhaps
> it’s true (skippable) by default for backward compatibility.
>
> The in-core AMs including btree indexes will work same as before. This
> fix is to make it more desirable behavior and possibly to help other
> AMs that require to call vacuumcleanup in all cases. Once we fix it I
> wonder if we can disable index cleanup when autovacuum’s
> anti-wraparound vacuum.

It (still) doesn't seem very sane to me to have an index that requires
cleanup in all cases. I mean, VACUUM could error or be killed just
before the index cleanup hase happens anyway, so it's not like an
index AM can licitly depend on getting called just because we visited
the heap. It could, of course, depend on getting called before
relfrozenxid is advanced, or before the heap's dead line pointers are
marked unused, or something like that, but it can't just be like, hey,
you have to call me.

I think this whole discussion is to some extent a product of the
contract between the index AM and the table AM being more than
slightly unclear. Maybe we need to clear up the definitional problems
first.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: xid wraparound danger due to INDEX_CLEANUP false

От
Peter Geoghegan
Дата:
On Fri, Nov 20, 2020 at 3:17 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> I had missed your bottom-up index deletion patch but it's a promising
> improvement. With that patch, the number of dead tuples in individual
> indexes may differ. So it's important that we make index vacuuming a
> per-index thing.

Right, but it's important to remember that even bottom-up index
deletion isn't really special in theory. Bottom-up index deletion is
"just" a reliable version of the existing LP_DEAD index deletion thing
(which has been around since Postgres 8.2). In theory it doesn't
change the fundamental nature of the problem. In practice it does,
because it makes it very obvious to pgsql-hackers that indexes on the
same table can have very different needs from VACUUM. And the actual
differences we see are much bigger now. Even still, the fact that you
had certain big differences across indexes on the same table is not a
new thing. (Actually, you can even see this on the master branch in
Victor's bottom-up deletion benchmark, where the primary key index
actually doesn't grow on the master branch, even after 8 hours.)

The bottom-up index deletion patch (and the enhancements we're talking
about here, for VACUUM itself) are based on "the generational
hypothesis" that underlies generational garbage collection. The
philosophy is the same. See:

https://plumbr.io/handbook/garbage-collection-in-java/generational-hypothesis

In theory, "most garbage comes from new objects" is "just" an
empirical observation, that may or may not be true with each
workload/Java program/whatever. In practice it is important enough to
be a big part of how every modern garbage collector works -- it's
almost always true, and even when it isn't true it doesn't actually
hurt to make the assumption that it is true and then be wrong. I
believe that we have to take a holistic view of the problem to make
real progress.

Andres said something similar in a recent blog post:


https://techcommunity.microsoft.com/t5/azure-database-for-postgresql/improving-postgres-connection-scalability-snapshots/ba-p/1806462#interlude-removing-the-need-for-recentglobalxminhorizon

"In most workloads the majority of accesses are to live tuples, and
when encountering non-live tuple versions they are either very old, or
very new."

(This was just a coincidence, but it was good to see that he made the
same observation.)

> Given that patch, it seems to me that it would be better to ask
> individual index AM before calling to bulkdelete about the needs of
> bulkdelete. That is, passing VACUUM options and the number of
> collected dead tuples etc. to index AM, we ask index AM via a new
> index AM API whether it wants to do bulkdelete or not. We call
> bulkdelete for only indexes that answered 'yes'. If we got 'no' from
> any one of the indexes, we cannot have a second heap pass.
> INDEX_CLEANUP is not enforceable. When INDEX_CLEANUP is set to false,
> we expect index AMs to return 'no' unless they have a special reason
> for the needs of bulkdelete.

I don't have a very detailed idea of the interface or anything. There
are a few questions that naturally present themselves, that I don't
have good answers to right now. Obviously vacuumlazy.c will only treat
this feedback from each index as an advisory thing. So what happens
when 50% of the indexes say yes and 50% say no? This is a subproblem
that must be solved as part of this work. Ideally it will be solved by
you.  :-)

> One possible benefit of this idea even without bottom-up index
> deleteion patch would be something like
> vacuum_index_cleanup_scale_factor for bulkdelete. For example, in the
> case where the amount of dead tuple is slightly larger than
> maitenance_work_mem the second time calling to bulkdelete will be
> called with a small amount of dead tuples, which is less efficient. If
> an index AM is able to determine not to do bulkdelete by comparing the
> number of dead tuples to a threshold, it can avoid such bulkdelete
> calling.

I agree. Actually, I thought the same thing myself, even before I
realized that bottom-up index deletion was possible.

> Also, as a future extension, once we have retail index deletion
> feature, we might be able to make that API return a ternary value:
> 'no', 'do_bulkdelete', ‘do_indexscandelete, so that index AM can
> choose the appropriate method of index deletion based on the
> statistics.

I agree again!

We may eventually be able to make autovacuum run very frequently
against each table in many important cases, with each VACUUM taking
very little wall-clock time. We don't have to change the fundamental
design to fix most of the current problems. I suspect that the "top
down" nature of VACUUM is sometimes helpful. We just need to
compensate when this design is insufficient. Getting the "best of both
worlds" is possible.

> But for making index vacuuming per-index thing, we need to deal with
> the problem that we cannot know which indexes of the table still has
> index tuples pointing to the collected dead tuple. For example, if an
> index always says 'no' (not need bulkdelete therefore we need to keep
> dead line pointers), the collected dead tuples might already be marked
> as LP_DEAD and there might already not be index tuples pointing to
> them in other index AMs. In that case we don't want to call to
> bulkdelete for other indexes. Probably we can have additional
> statistics like the number of dead tuples in individual indexes so
> that they can determine the needs of bulkdelete. But it’s not a
> comprehensive solution.

Right. Maybe we don't ask the index AMs for discrete yes/no answers.
Maybe we can ask them for a continuous answer, such as a value between
0.0 and 1.0 that represents the urgency/bloat, or something like that.
And so the final yes/no answer that really does have to be made for
the table as a whole (does VACUUM do a second pass over the heap to
make LP_DEAD items into LP_UNUSED items?) can at least consider the
worst case for each index. And maybe the average case, too.

(I am just making random suggestions to stimulate discussion. Don't
take these specific suggestions about the am interface too seriously.)

> > If you are able to pursue this project, in whole or in part, I would
> > definitely be supportive of that. I may be able to commit it. I think
> > that this project has many benefits, not just one or two. It seems
> > strategic.
>
> Thanks, that’s really helpful. I’m going to work on that. Since things
> became complicated by these two features that I proposed I’ll do my
> best to sort out the problem and improve it in PG14.

Excellent! Thank you.

--
Peter Geoghegan



Re: xid wraparound danger due to INDEX_CLEANUP false

От
Robert Haas
Дата:
On Fri, Nov 20, 2020 at 2:37 PM Peter Geoghegan <pg@bowt.ie> wrote:
> Right. Maybe we don't ask the index AMs for discrete yes/no answers.
> Maybe we can ask them for a continuous answer, such as a value between
> 0.0 and 1.0 that represents the urgency/bloat, or something like that.
> And so the final yes/no answer that really does have to be made for
> the table as a whole (does VACUUM do a second pass over the heap to
> make LP_DEAD items into LP_UNUSED items?) can at least consider the
> worst case for each index. And maybe the average case, too.

That's an interesting idea. We should think about the needs of brin
indexes when designing something better than the current system. They
have the interesting property that the heap deciding to change LP_DEAD
to LP_UNUSED doesn't break anything even if nothing's been done to the
index, because they don't store TIDs anyway. So that's an example of
an index AM that might want to do some work to keep performance up,
but it's not actually required. This might be orthogonal to the
0.0-1.0 scale you were thinking about, but it might be good to factor
it into the thinking somehow.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: xid wraparound danger due to INDEX_CLEANUP false

От
Peter Geoghegan
Дата:
On Fri, Nov 20, 2020 at 12:04 PM Robert Haas <robertmhaas@gmail.com> wrote:
> That's an interesting idea. We should think about the needs of brin
> indexes when designing something better than the current system. They
> have the interesting property that the heap deciding to change LP_DEAD
> to LP_UNUSED doesn't break anything even if nothing's been done to the
> index, because they don't store TIDs anyway. So that's an example of
> an index AM that might want to do some work to keep performance up,
> but it's not actually required. This might be orthogonal to the
> 0.0-1.0 scale you were thinking about, but it might be good to factor
> it into the thinking somehow.

I actually made this exact suggestion about BRIN myself, several years ago.

As I've said, it seems like it would be a good idea to ask the exact
same generic question of each index in turn (which is answered using
local smarts added to the index AM). Again, the question is: How
important is it that you get vacuumed now, from your own
narrow/selfish point of view? The way that BRIN answers this question
is not the novel thing about BRIN among other index access methods,
though. (Not that you claimed otherwise -- just framing the discussion
carefully.)

BRIN has no selfish reason to care if the table never gets to have its
LP_DEAD line pointers set to LP_UNUSED -- that's just not something
that it can be expected to understand directly. But all index access
methods should be thought of as not caring about this, because it's
just not their problem. (Especially with bottom-up index deletion, but
even without it.)

The interesting and novel thing about BRIN here is this: lazyvacuum.c
can be taught that a BRIN index alone is no reason to have to do a
second pass over the heap (to make the LP_DEAD/pruned-by-VACUUM line
pointers LP_UNUSED). A BRIN index never gets affected by the usual
considerations about the heapam invariant (the usual thing about TIDs
in an index not pointing to a line pointer that is at risk of being
recycled), which presents us with a unique-to-BRIN opportunity. Which
is exactly what you said.

(***Thinks some more***)

Actually, now I think that BRIN shouldn't be special to vacuumlazy.c
in any way. It doesn't make sense as part of this future world in
which index vacuuming can be skipped for individual indexes (which is
what I talked to Sawada-san about a little earlier in this thread).
Why should it be useful to exploit the "no-real-TIDs" property of BRIN
in this future world? It can only solve a problem that the main
enhancement is itself expected to solve without any special help from
BRIN (just the generic am callback that asks the same generic question
about index vacuuming urgency).

The only reason we press ahead with a second scan (the
LP_DEAD-to-LP_UNUSED thing) in this ideal world is a heap/table
problem. The bloat eventually gets out of hand *in the table*. We have
now conceptually decoupled the problems experienced in the table/heap
from the problems for each index (mostly), so this actually makes
sense. The theory behind AV scheduling becomes much closer to reality
-- by changing the reality! (The need to "prune the table to VACUUM
any one index" notwithstanding -- that's still necessary, of course,
but we still basically decouple table bloat from index bloat at the
conceptual level.)

Does that make sense?
-- 
Peter Geoghegan



Re: xid wraparound danger due to INDEX_CLEANUP false

От
Robert Haas
Дата:
On Fri, Nov 20, 2020 at 4:21 PM Peter Geoghegan <pg@bowt.ie> wrote:
> Actually, now I think that BRIN shouldn't be special to vacuumlazy.c
> in any way. It doesn't make sense as part of this future world in
> which index vacuuming can be skipped for individual indexes (which is
> what I talked to Sawada-san about a little earlier in this thread).
> Why should it be useful to exploit the "no-real-TIDs" property of BRIN
> in this future world? It can only solve a problem that the main
> enhancement is itself expected to solve without any special help from
> BRIN (just the generic am callback that asks the same generic question
> about index vacuuming urgency).
>
> The only reason we press ahead with a second scan (the
> LP_DEAD-to-LP_UNUSED thing) in this ideal world is a heap/table
> problem. The bloat eventually gets out of hand *in the table*. We have
> now conceptually decoupled the problems experienced in the table/heap
> from the problems for each index (mostly), so this actually makes
> sense. The theory behind AV scheduling becomes much closer to reality
> -- by changing the reality! (The need to "prune the table to VACUUM
> any one index" notwithstanding -- that's still necessary, of course,
> but we still basically decouple table bloat from index bloat at the
> conceptual level.)
>
> Does that make sense?

I *think* so. For me the point is that the index never has a right to
insist on being vacuumed, but it can offer an opinion on how helpful
it would be.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: xid wraparound danger due to INDEX_CLEANUP false

От
Peter Geoghegan
Дата:
On Fri, Nov 20, 2020 at 2:17 PM Robert Haas <robertmhaas@gmail.com> wrote:
> > Does that make sense?
>
> I *think* so. For me the point is that the index never has a right to
> insist on being vacuumed, but it can offer an opinion on how helpful
> it would be.

Right, that might be the single most important point. It's a somewhat
more bottom-up direction for VACUUM, that is still fundamentally
top-down. Because that's still necessary.

Opportunistic heap pruning is usually very effective, so today we
realistically have these 4 byte line pointers accumulating in heap
pages. The corresponding "bloatum" in index pages is an index tuple +
line pointer (at least 16 bytes + 4 bytes). Meaning that we accumulate
that *at least* 20 bytes for each 4 bytes in the table. And, indexes
care about *where* items go, making the problem even worse. So in the
absence of index tuple LP_DEAD setting/deletion (or bottom-up index
deletion in Postgres 14), the problem in indexes is probably at least
5x worse.

The precise extent to which this is true will vary. It's a mistake to
try to reason about it at a high level, because there is just too much
variation for that approach to work. We should just give index access
methods *some* say. Sometimes this allows index vacuuming to be very
lazy, other times it allows index vacuuming to be very eager. Often
this variation exists among indexes on the same table.

Of course, vacuumlazy.c is still responsible for not letting the
accumulation of LP_DEAD heap line pointers get out of hand (without
allowing index TIDs to point to the wrong thing due to dangerous TID
recycling issues/bugs). The accumulation of LP_DEAD heap line pointers
will often take a very long time to get out of hand. But when it does
finally get out of hand, index access methods don't get to veto being
vacuumed. Because this isn't actually about their needs anymore.

Actually, the index access methods never truly veto anything. They
merely give some abstract signal about how urgent it is to them (like
the 0.0 - 1.0 thing). This difference actually matters. One index
among many on a table saying "meh, I guess I could benefit from some
index vacuuming if it's no real trouble to you vacuumlazy.c" rather
than saying "it's absolutely unnecessary, don't waste CPU cycles
vacuumlazy.c" may actually shift how vacuumlazy.c processes the heap
(at least occasionally). Maybe the high level VACUUM operation decides
that it is worth taking care of everything all at once -- if all the
indexes together either say "meh" or "now would be a good time", and
vacuumlazy.c then notices that the accumulation of LP_DEAD line
pointers is *also* becoming a problem (it's also a "meh" situation),
then it can be *more* ambitious. It can do a traditional VACUUM early.
Which might still make sense.

This also means that vacuumlazy.c would ideally think about this as an
optimization problem. It may be lazy or eager for the whole table,
just as it may be lazy or eager for individual indexes. (Though the
eagerness/laziness dynamic is probably much more noticeable with
indexes in practice.)

-- 
Peter Geoghegan



Re: xid wraparound danger due to INDEX_CLEANUP false

От
Masahiko Sawada
Дата:
On Sat, Nov 21, 2020 at 8:03 AM Peter Geoghegan <pg@bowt.ie> wrote:
>
> On Fri, Nov 20, 2020 at 2:17 PM Robert Haas <robertmhaas@gmail.com> wrote:
> > > Does that make sense?
> >
> > I *think* so. For me the point is that the index never has a right to
> > insist on being vacuumed, but it can offer an opinion on how helpful
> > it would be.
>
> Right, that might be the single most important point. It's a somewhat
> more bottom-up direction for VACUUM, that is still fundamentally
> top-down. Because that's still necessary.
>
> Opportunistic heap pruning is usually very effective, so today we
> realistically have these 4 byte line pointers accumulating in heap
> pages. The corresponding "bloatum" in index pages is an index tuple +
> line pointer (at least 16 bytes + 4 bytes). Meaning that we accumulate
> that *at least* 20 bytes for each 4 bytes in the table. And, indexes
> care about *where* items go, making the problem even worse. So in the
> absence of index tuple LP_DEAD setting/deletion (or bottom-up index
> deletion in Postgres 14), the problem in indexes is probably at least
> 5x worse.
>
> The precise extent to which this is true will vary. It's a mistake to
> try to reason about it at a high level, because there is just too much
> variation for that approach to work. We should just give index access
> methods *some* say. Sometimes this allows index vacuuming to be very
> lazy, other times it allows index vacuuming to be very eager. Often
> this variation exists among indexes on the same table.
>
> Of course, vacuumlazy.c is still responsible for not letting the
> accumulation of LP_DEAD heap line pointers get out of hand (without
> allowing index TIDs to point to the wrong thing due to dangerous TID
> recycling issues/bugs). The accumulation of LP_DEAD heap line pointers
> will often take a very long time to get out of hand. But when it does
> finally get out of hand, index access methods don't get to veto being
> vacuumed. Because this isn't actually about their needs anymore.
>
> Actually, the index access methods never truly veto anything. They
> merely give some abstract signal about how urgent it is to them (like
> the 0.0 - 1.0 thing). This difference actually matters. One index
> among many on a table saying "meh, I guess I could benefit from some
> index vacuuming if it's no real trouble to you vacuumlazy.c" rather
> than saying "it's absolutely unnecessary, don't waste CPU cycles
> vacuumlazy.c" may actually shift how vacuumlazy.c processes the heap
> (at least occasionally). Maybe the high level VACUUM operation decides
> that it is worth taking care of everything all at once -- if all the
> indexes together either say "meh" or "now would be a good time", and
> vacuumlazy.c then notices that the accumulation of LP_DEAD line
> pointers is *also* becoming a problem (it's also a "meh" situation),
> then it can be *more* ambitious. It can do a traditional VACUUM early.
> Which might still make sense.
>
> This also means that vacuumlazy.c would ideally think about this as an
> optimization problem. It may be lazy or eager for the whole table,
> just as it may be lazy or eager for individual indexes. (Though the
> eagerness/laziness dynamic is probably much more noticeable with
> indexes in practice.)
>

I discussed this topic off-list with Peter Geoghegan. And we think
that we can lead this fix to future improvement. Let me summarize the
proposals.

The first proposal is the fix of this inappropriate behavior discussed
on this thread. We pass a new flag in calling bulkdelete(), indicating
whether or not the index can safely skip this bulkdelete() call. This
is equivalent to whether or not lazy vacuum will do the heap clean
(making LP_DEAD LP_UNUSED in lazy_vacuum_heap()). If it's true
(meaning to do heap clean), since dead tuples referenced by index
tuples will be physically removed, index AM would have to delete the
index tuples. If it's false, we call to bulkdelete() with this flag so
that index AM can safely skip bulkdelete(). Of course index AM also
can dare not to skip it because of its personal reason. Index AM
including BRIN that doesn't store heap TID can decide whether or not
to do regardless of this flag.

The next proposal upon the above proposal is to add a new index AM
API, say ambulkdeletestrategy(), which is called before bulkdelete()
for each index and asks the index bulk-deletion strategy. In this API,
lazy vacuum asks, "Hey index X, I collected garbage heap tuples during
heap scanning, how urgent is vacuuming for you?", and the index
answers either "it's urgent" when it wants to do bulk-deletion or
"it's not urgent, I can skip it". The point of this proposal is to
isolate heap vacuum and index vacuum for each index so that we can
employ different strategies for each index. Lazy vacuum can decide
whether or not to do heap clean based on the answers from the indexes.
Lazy vacuum can set the flag I proposed above according to the
decision. If all indexes answer 'yes'  (meaning it will do
bulkdelete()), lazy vacuum can do heap clean. On the other hand, if
even one index answers 'no' (meaning it will not do bulkdelete()),
lazy vacuum cannot do the heap clean. On the other hand, lazy vacuum
would also be able to require indexes to do bulkdelete() for its
personal reason. It’s something like saying "Hey index X, you answered
not to do bulkdelete() but since heap clean is necessary for me please
don't skip bulkdelete()".

In connection with this change, we would need to rethink the meaning
of the INDEX_CLEANUP option. As of now, if it's not set (i.g.
VACOPT_TERNARY_DEFAULT in the code), it's treated as true and will do
heap clean. But I think we can make it something like a neutral state
by default. This neutral state could be "on" and "off" depending on
several factors including the answers of ambulkdeletestrategy(), the
table status, and user's request. In this context, specifying
INDEX_CLEANUP would mean making the neutral state "on" or "off" by
user's request. The table status that could influence the decision
could concretely be, for instance:

* Removing LP_DEAD accumulation due to skipping bulkdelete() for a long time.
* Making pages all-visible for index-only scan.

We would not benefit much from the bulkdeletestrategy() idea for now.
But there are potential enhancements using this API:

* If bottom-up index deletion feature[1] is introduced, individual
indexes could be a different situation in terms of dead tuple
accumulation; some indexes on the table can delete its garbage index
tuples without bulkdelete(). A problem will appear that doing
bulkdelete() for such indexes would not be efficient. This problem is
solved by this proposal because we can do bulkdelete() for a subset of
indexes on the table.

* If retail index deletion feature[2] is introduced, we can make the
return value of bulkdeletestrategy() a ternary value: "do_bulkdete",
"do_indexscandelete", and "no".

* We probably can introduce a threshold of the number of dead tuples
to control whether or not to do index tuple bulk-deletion (like
bulkdelete() version of vacuum_cleanup_index_scale_factor). In the
case where the amount of dead tuples is slightly larger than
maitenance_work_mem the second time calling to bulkdelete will be
called with a small number of dead tuples, which is inefficient. This
problem is also solved by this proposal by allowing a subset of
indexes to skip bulkdelete() if the number of dead tuple doesn't
exceed the threshold.

Any thoughts?

I'm writing a PoC patch so will share it.

Regards,

[1] https://www.postgresql.org/message-id/CAH2-Wzm%2BmaE3apHB8NOtmM%3Dp-DO65j2V5GzAWCOEEuy3JZgb2g%40mail.gmail.com
[2] https://www.postgresql.org/message-id/425db134-8bba-005c-b59d-56e50de3b41e%40postgrespro.ru

--
Masahiko Sawada
EnterpriseDB:  https://www.enterprisedb.com/



Re: xid wraparound danger due to INDEX_CLEANUP false

От
Peter Geoghegan
Дата:
On Tue, Dec 15, 2020 at 6:44 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> In connection with this change, we would need to rethink the meaning
> of the INDEX_CLEANUP option. As of now, if it's not set (i.g.
> VACOPT_TERNARY_DEFAULT in the code), it's treated as true and will do
> heap clean. But I think we can make it something like a neutral state
> by default. This neutral state could be "on" and "off" depending on
> several factors including the answers of ambulkdeletestrategy(), the
> table status, and user's request. In this context, specifying
> INDEX_CLEANUP would mean making the neutral state "on" or "off" by
> user's request.

I think a new value such as "smart" should be introduced, which can
become the default.

> The table status that could influence the decision
> could concretely be, for instance:
>
> * Removing LP_DEAD accumulation due to skipping bulkdelete() for a long time.
> * Making pages all-visible for index-only scan.
>
> We would not benefit much from the bulkdeletestrategy() idea for now.
> But there are potential enhancements using this API:
>
> * If bottom-up index deletion feature[1] is introduced, individual
> indexes could be a different situation in terms of dead tuple
> accumulation; some indexes on the table can delete its garbage index
> tuples without bulkdelete(). A problem will appear that doing
> bulkdelete() for such indexes would not be efficient. This problem is
> solved by this proposal because we can do bulkdelete() for a subset of
> indexes on the table.

The chances of the bottom-up index deletion being committed for
PostgreSQL 14 are very high. While it hasn't received too much review,
there seems to be very little downside, and lots of upside.

> * If retail index deletion feature[2] is introduced, we can make the
> return value of bulkdeletestrategy() a ternary value: "do_bulkdete",
> "do_indexscandelete", and "no".

Makes sense.

> * We probably can introduce a threshold of the number of dead tuples
> to control whether or not to do index tuple bulk-deletion (like
> bulkdelete() version of vacuum_cleanup_index_scale_factor). In the
> case where the amount of dead tuples is slightly larger than
> maitenance_work_mem the second time calling to bulkdelete will be
> called with a small number of dead tuples, which is inefficient. This
> problem is also solved by this proposal by allowing a subset of
> indexes to skip bulkdelete() if the number of dead tuple doesn't
> exceed the threshold.

Good idea. Maybe this won't be possible for PostgreSQL 14, but this is
the kind of possibility that we should try to unlock. I had a
similar-yet-different idea to this idea of Masahiko's, actually, which
is to use LSN to determine (unreliably) if a B-Tree leaf page is
likely to have garbage tuples within VACUUM.

This other idea probably also won't happen for PostgreSQL. That's not
important. The truly important thing is that we come up with the right
*general* design, that can support either technique in the future. I'm
not sure which precise design will work best, but I am confident that
*some* combination of these two ideas (or other ideas) will work very
well. Right now we don't have the appropriate general framework.

> Any thoughts?

Nothing to add to what you said, really. I agree that it makes sense
to think of all of these things at the same time.

It'll be easier to see how far these different ideas can be pushed
once a prototype is available.

> I'm writing a PoC patch so will share it.

Great! I suggest starting a new thread for that.


--
Peter Geoghegan