Обсуждение: xid wraparound danger due to INDEX_CLEANUP false
Hi, commit a96c41feec6b6616eb9d5baee9a9e08c20533c38 Author: Robert Haas <rhaas@postgresql.org> Date: 2019-04-04 14:58:53 -0400 Allow VACUUM to be run with index cleanup disabled. This commit adds a new reloption, vacuum_index_cleanup, which controls whether index cleanup is performed for a particular relation by default. It also adds a new option to the VACUUM command, INDEX_CLEANUP, which can be used to override the reloption. If neither the reloption nor the VACUUM option is used, the default is true, as before. Masahiko Sawada, reviewed and tested by Nathan Bossart, Alvaro Herrera, Kyotaro Horiguchi, Darafei Praliaskouski, and me. The wording of the documentation is mostly due to me. Discussion: http://postgr.es/m/CAD21AoAt5R3DNUZSjOoXDUY=naYPUOuffVsRzuTYMz29yLzQCA@mail.gmail.com made the index scan that is part of vacuum optional. I'm afraid that it is not safe to do so unconditionally. Until this commit indexes could rely on at least the amvacuumcleanup callback being called once per vacuum. Which guaranteed that an index could ensure that there are no too-old xids anywhere in the index. But now that's not the case anymore: vacrelstats->useindex = (nindexes > 0 && params->index_cleanup == VACOPT_TERNARY_ENABLED); ... /* Do post-vacuum cleanup */ if (vacrelstats->useindex) lazy_cleanup_all_indexes(Irel, indstats, vacrelstats, lps, nindexes); E.g. btree has xids both in the metapage contents, as well as using it on normal index pages as part of page deletion. The slightly oder feature to avoid unnecessary scans during cleanup protects against this issue by skipping the scan inside the index AM: /* * _bt_vacuum_needs_cleanup() -- Checks if index needs cleanup assuming that * btbulkdelete() wasn't called. */ static bool _bt_vacuum_needs_cleanup(IndexVacuumInfo *info) { ... else if (TransactionIdIsValid(metad->btm_oldest_btpo_xact) && TransactionIdPrecedes(metad->btm_oldest_btpo_xact, RecentGlobalXmin)) { /* * If oldest btpo.xact in the deleted pages is older than * RecentGlobalXmin, then at least one deleted page can be recycled. */ result = true; } which will afaict result in all such xids getting removed (or at least give the AM the choice to do so). It's possible that something protects against dangers in the case of INDEX CLEANUP false, or that the consequences aren't too bad. But I didn't see any comments about the danagers in the patch. Greetings, Andres Freund
On Wed, Apr 15, 2020 at 7:38 PM Andres Freund <andres@anarazel.de> wrote: > It's possible that something protects against dangers in the case of > INDEX CLEANUP false, or that the consequences aren't too bad. But I > didn't see any comments about the danagers in the patch. I seem to recall Simon raising this issue at the time that the patch was being discussed, and I thought that we had eventually decided that it was OK for some reason. But I don't remember the details, and it is possible that we got it wrong. :-( -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Wed, Apr 15, 2020 at 4:57 PM Robert Haas <robertmhaas@gmail.com> wrote: > I seem to recall Simon raising this issue at the time that the patch > was being discussed, and I thought that we had eventually decided that > it was OK for some reason. But I don't remember the details, and it is > possible that we got it wrong. :-( It must be unreliable because it's based on something that is known to be unreliable: https://git.postgresql.org/gitweb/?p=postgresql.git;a=blob;f=src/backend/access/nbtree/README;h=c5b0a30e4ebd4fe3bd4a6f8192284c452d1170b9;hb=refs/heads/REL_12_STABLE#l331 Also, the commit message of 6655a729 says that nbtree has had this problem "since time immemorial". I am planning to work on that problem, eventually. -- Peter Geoghegan
On Thu, Apr 16, 2020 at 8:38 AM Andres Freund <andres@anarazel.de> wrote: > > Hi, > > commit a96c41feec6b6616eb9d5baee9a9e08c20533c38 > Author: Robert Haas <rhaas@postgresql.org> > Date: 2019-04-04 14:58:53 -0400 > > Allow VACUUM to be run with index cleanup disabled. > > This commit adds a new reloption, vacuum_index_cleanup, which > controls whether index cleanup is performed for a particular > relation by default. It also adds a new option to the VACUUM > command, INDEX_CLEANUP, which can be used to override the > reloption. If neither the reloption nor the VACUUM option is > used, the default is true, as before. > > Masahiko Sawada, reviewed and tested by Nathan Bossart, Alvaro > Herrera, Kyotaro Horiguchi, Darafei Praliaskouski, and me. > The wording of the documentation is mostly due to me. > > Discussion: http://postgr.es/m/CAD21AoAt5R3DNUZSjOoXDUY=naYPUOuffVsRzuTYMz29yLzQCA@mail.gmail.com > > made the index scan that is part of vacuum optional. I'm afraid that it > is not safe to do so unconditionally. Until this commit indexes could > rely on at least the amvacuumcleanup callback being called once per > vacuum. Which guaranteed that an index could ensure that there are no > too-old xids anywhere in the index. > > But now that's not the case anymore: > > vacrelstats->useindex = (nindexes > 0 && > params->index_cleanup == VACOPT_TERNARY_ENABLED); > ... > /* Do post-vacuum cleanup */ > if (vacrelstats->useindex) > lazy_cleanup_all_indexes(Irel, indstats, vacrelstats, lps, nindexes); > > E.g. btree has xids both in the metapage contents, as well as using it > on normal index pages as part of page deletion. > > The slightly oder feature to avoid unnecessary scans during cleanup > protects against this issue by skipping the scan inside the index AM: > > /* > * _bt_vacuum_needs_cleanup() -- Checks if index needs cleanup assuming that > * btbulkdelete() wasn't called. > */ > static bool > _bt_vacuum_needs_cleanup(IndexVacuumInfo *info) > { > ... > else if (TransactionIdIsValid(metad->btm_oldest_btpo_xact) && > TransactionIdPrecedes(metad->btm_oldest_btpo_xact, > RecentGlobalXmin)) > { > /* > * If oldest btpo.xact in the deleted pages is older than > * RecentGlobalXmin, then at least one deleted page can be recycled. > */ > result = true; > } > > which will afaict result in all such xids getting removed (or at least > give the AM the choice to do so). For btree indexes, IIRC skipping index cleanup could not be a cause of corruption, but be a cause of index bloat since it leaves recyclable pages which are not marked as recyclable. The index bloat is the main side effect of skipping index cleanup. When user executes VACUUM with INDEX_CLEANUP to reclaim index garbage, such pages will also be recycled sooner or later? Or skipping index cleanup can be a cause of recyclable page never being recycled? Regards, -- Masahiko Sawada http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Hi, On 2020-04-16 16:30:02 +0900, Masahiko Sawada wrote: > For btree indexes, IIRC skipping index cleanup could not be a cause of > corruption, but be a cause of index bloat since it leaves recyclable > pages which are not marked as recyclable. The index bloat is the main > side effect of skipping index cleanup. When user executes VACUUM with > INDEX_CLEANUP to reclaim index garbage, such pages will also be > recycled sooner or later? Or skipping index cleanup can be a cause of > recyclable page never being recycled? Well, it depends on what you define as "never". Once the xids on the pages have wrapped around, the page level xids will appear to be from the future for a long time. And the metapage xid appearing to be from the future will prevent some vacuums from actually doing the scan too, even if INDEX_CLEANUP is reenabled. So a VACUUM, even with INDEX_CLEANUP on, will not be able to recycle those pages anymore. At some point the wrapped around xids will be "current" again, if there's enough new xids. It's no ok for vacuumlazy.c to make decisions like this. I think the INDEX_CLEANUP logic clearly needs to be pushed down into the amvacuumcleanup callbacks, and it needs to be left to the index AMs to decide what the correct behaviour is. You can't just change things like this without reviewing the consequences to AMs and documenting them? Greetings, Andres Freund
Hi, On 2020-04-15 18:11:40 -0700, Peter Geoghegan wrote: > On Wed, Apr 15, 2020 at 4:57 PM Robert Haas <robertmhaas@gmail.com> wrote: > > I seem to recall Simon raising this issue at the time that the patch > > was being discussed, and I thought that we had eventually decided that > > it was OK for some reason. But I don't remember the details, and it is > > possible that we got it wrong. :-( > > It must be unreliable because it's based on something that is known to > be unreliable: > > https://git.postgresql.org/gitweb/?p=postgresql.git;a=blob;f=src/backend/access/nbtree/README;h=c5b0a30e4ebd4fe3bd4a6f8192284c452d1170b9;hb=refs/heads/REL_12_STABLE#l331 Sure, there is some pre-existing wraparound danger for individual pages. But it's a pretty narrow corner case before INDEX_CLEANUP off. That comment says something about "shared-memory free space map", making it sound like any crash would loose the page. But it's a normal FSM these days. Vacuum will insert the deleted page into the free space map. So either the FSM would need to be corrupted to not find the inserted page anymore, or the index would need to grow slow enough to not use a page before the wraparound. And then such wrapped around xids would exist on individual pages. Not on all deleted pages, like with INDEX_CLEANUP false. And, what's worse, in the INDEX_CLEANUP off case, future VACUUMs with INDEX_CLEANUP on might not even visit the index. As there very well might not be many dead heap tuples around anymore (previous vacuums with cleanup off will have removed them), the vacuum_cleanup_index_scale_factor logic may prevent index vacuums. In contrast to the normal situations where the btm_oldest_btpo_xact check will prevent that from becoming a problem. Peter, as far as I can tell, with INDEX_CLEANUP off, nbtree will never be able to recycle half-dead pages? And thus would effectively never recycle any dead space? Is that correct? Greetings, Andres Freund
On Thu, Apr 16, 2020 at 11:27 AM Andres Freund <andres@anarazel.de> wrote: > Sure, there is some pre-existing wraparound danger for individual > pages. But it's a pretty narrow corner case before INDEX_CLEANUP > off. It's a matter of degree. Hard to judge something like that. > And, what's worse, in the INDEX_CLEANUP off case, future VACUUMs with > INDEX_CLEANUP on might not even visit the index. As there very well > might not be many dead heap tuples around anymore (previous vacuums with > cleanup off will have removed them), the > vacuum_cleanup_index_scale_factor logic may prevent index vacuums. In > contrast to the normal situations where the btm_oldest_btpo_xact check > will prevent that from becoming a problem. I guess that they should visit the metapage to see if they need to do that much. That would allow us to fix the problem while mostly honoring INDEX_CLEANUP off, I think. > Peter, as far as I can tell, with INDEX_CLEANUP off, nbtree will never > be able to recycle half-dead pages? And thus would effectively never > recycle any dead space? Is that correct? I agree. The fact that btm_oldest_btpo_xact is an all-or-nothing thing (with wraparound hazards) is bad in itself, and introduced new risk to v11 compared to previous versions (without the INDEX_CLEANUP = off feature entering into it). The simple fact that we don't even check it with INDEX_CLEANUP = off is a bigger problem, though, and one that now seems unrelated. BWT, a lot of people get confused about what half-dead pages are. I would like to make something clear that may not be obvious: While it's bad that the implementation leaks pages that should go in the FSM, it's not the end of the world. They should get evicted from shared_buffers pretty quickly if there is any pressure, and impose no real cost on index scans. There are (roughly) 3 types of pages that we're concerned about here in the common case where we're just deleting a leaf page: * A half-dead page -- no downlink in its parent, marked dead. * A deleted page -- now no sidelinks, either. Not initially safe to recycle. * A deleted page in the FSM -- this is what we have the interlock for. Half-dead pages are pretty rare, because VACUUM really has to have a hard crash for that to happen (that might not be 100% true, but it's at least 99% true). That's always been the case, and we don't really need to talk about them here at all. We're just concerned with deleted pages in the context of this discussion (and whether or not they can be recycled without confusing in-flight index scans). These are the only pages that are marked with an XID at all. Another thing that's worth pointing out is that this whole RecentGlobalXmin business is how we opted to implement what Lanin & Sasha call "the drain technique". It is rather different to the usual ways in which we use RecentGlobalXmin. We're only using it as a proxy (an absurdly conservative proxy) for whether or not there might be an in-flight index scan that lands on a concurrently recycled index page and gets completely confused. So it is purely about the integrity of the data structure itself. It is a consequence of doing so little locking when descending the tree -- our index scans don't need to couple buffer locks on the way down the tree at all. So we make VACUUM worry about that, rather than making index scans worry about VACUUM (though the latter design is a reasonable and common one). There is absolutely no reason why we have to delay recycling for very long, even in cases with long running transactions or whatever. I agree that it's just an accident that it works that way. VACUUM could probably remember deleted pages, and then revisited those pages at the end of the index vacuuming -- that might make a big difference in a lot of workloads. Or it could chain them together as a linked list which can be accessed much more eagerly in some cases. -- Peter Geoghegan
Hi, On 2020-04-16 13:28:00 -0700, Peter Geoghegan wrote: > > And, what's worse, in the INDEX_CLEANUP off case, future VACUUMs with > > INDEX_CLEANUP on might not even visit the index. As there very well > > might not be many dead heap tuples around anymore (previous vacuums with > > cleanup off will have removed them), the > > vacuum_cleanup_index_scale_factor logic may prevent index vacuums. In > > contrast to the normal situations where the btm_oldest_btpo_xact check > > will prevent that from becoming a problem. > > I guess that they should visit the metapage to see if they need to do > that much. That would allow us to fix the problem while mostly > honoring INDEX_CLEANUP off, I think. Yea. _bt_vacuum_needs_cleanup() needs to check if metad->btm_oldest_btpo_xact is older than the FreezeLimit computed by vacuum_set_xid_limits() and vacuum the index if so even if INDEX_CLEANUP false. > BWT, a lot of people get confused about what half-dead pages are. I > would like to make something clear that may not be obvious: While it's > bad that the implementation leaks pages that should go in the FSM, > it's not the end of the world. They should get evicted from > shared_buffers pretty quickly if there is any pressure, and impose no > real cost on index scans. Yea, half-dead pages aren't the main problem. It's pages that contain only dead tuples, but aren't unlinked from the tree. Without a vacuum scan we'll never reuse them - even if we know they're all dead. Note that the page being in the FSM is not protection against wraparound :(. We recheck whether a page is recyclable when getting it from the FSM (probably required, due to the FSM not being crashsafe). It's of course much less likely to happen at that stage, because the pages can get reused. I think we really just stop being miserly and update the xid to be FrozenTransactionId or InvalidTransactionId when vacuum encounters one that's from before the the xid cutoff used by vacuum (i.e. what could become the new relfrozenxid). That seems like it'd be a few lines, not more. > Another thing that's worth pointing out is that this whole > RecentGlobalXmin business is how we opted to implement what Lanin & > Sasha call "the drain technique". It is rather different to the usual > ways in which we use RecentGlobalXmin. We're only using it as a proxy > (an absurdly conservative proxy) for whether or not there might be an > in-flight index scan that lands on a concurrently recycled index page > and gets completely confused. So it is purely about the integrity of > the data structure itself. It is a consequence of doing so little > locking when descending the tree -- our index scans don't need to > couple buffer locks on the way down the tree at all. So we make VACUUM > worry about that, rather than making index scans worry about VACUUM > (though the latter design is a reasonable and common one). > > There is absolutely no reason why we have to delay recycling for very > long, even in cases with long running transactions or whatever. I > agree that it's just an accident that it works that way. VACUUM could > probably remember deleted pages, and then revisited those pages at the > end of the index vacuuming -- that might make a big difference in a > lot of workloads. Or it could chain them together as a linked list > which can be accessed much more eagerly in some cases. I think it doesn't really help meaningfully for vacuum to be a bit smarter about when to recognize pages as being recyclable. IMO the big issue is that vacuum won't be very frequent, so we'll grow the index until that time, even if there's many "effectively empty" pages. I.e. even if the killtuples logic allows us to recognize that all actual index tuples are fully dead, we'll not benefit from that unless there's a new insertion that belongs onto the "empty" page. That's fine for indexes that are updated roughly evenly across the value range, but terrible for indexes that "grow" mostly on one side, and "shrink" on the other. I'd bet it be beneficial if we were to either have scans unlink such pages directly, or if they just entered the page into the FSM and have _bt_getbuf() do the unlinking. I'm not sure if the current locking model assumes anywhere that there is only one process (vacuum) unlinking pages though? Greetings, Andres Freund
On Thu, Apr 16, 2020 at 3:49 PM Andres Freund <andres@anarazel.de> wrote: > I think we really just stop being miserly and update the xid to be > FrozenTransactionId or InvalidTransactionId when vacuum encounters one > that's from before the the xid cutoff used by vacuum (i.e. what could > become the new relfrozenxid). That seems like it'd be a few lines, not > more. Okay. > > There is absolutely no reason why we have to delay recycling for very > > long, even in cases with long running transactions or whatever. I > > agree that it's just an accident that it works that way. VACUUM could > > probably remember deleted pages, and then revisited those pages at the > > end of the index vacuuming -- that might make a big difference in a > > lot of workloads. Or it could chain them together as a linked list > > which can be accessed much more eagerly in some cases. > > I think it doesn't really help meaningfully for vacuum to be a bit > smarter about when to recognize pages as being recyclable. IMO the big > issue is that vacuum won't be very frequent, so we'll grow the index > until that time, even if there's many "effectively empty" pages. (It seems like you're talking about the big picture now, not the problems in Postgres 11 and 12 features in this area -- you're talking about what happens to empty pages, not what happens to deleted pages.) I'll say some more things about the less ambitious goal of more eager recycling of pages, as they are deleted: An individual VACUUM operation cannot recycle a page right after _bt_pagedel() is called to delete the page. VACUUM will both set a target leaf page half dead and delete it all at once in _bt_pagedel() (it does that much in the simple and common case). Again, this is because recycling very soon after the call to _bt_pagedel() will introduce races with concurrent index scans -- they could fail to observe the deletion in the parent (i.e. see its downlink, since child isn't even half dead), and land on a concurrently recycled page (VACUUM concurrently marks the page half-dead, fully dead/deleted, and then even goes as far as recycling it). So the current design makes a certain amount of sense -- we can't be super aggressive like that. (Actually, maybe it doesn't make sense to not just put the page in the FSM there and then -- see "Thinks some more" below.) Even still, nothing stops the same VACUUM operation from (for example) remembering a list of pages it has deleted during the current scan, and then coming back at the end of the bulk scan of the index to reconsider if it can recycle the pages now (2 minutes later instead of 2 months later). With a new RecentGlobalXmin (or something that's conceptually like a new RecentGlobalXmin). Similarly, we could do limited VACUUMs that only visit previously deleted pages, once VACUUM is taught to chain deleted pages together to optimize recycling. We don't have to repeat another pass over the entire index to recycle the pages because of this special deleted page linking. This is something that we use when we have to recycle pages, but it's a " INDEX_CLEANUP = off" index VACUUM -- we don't really want to do most of the stuff that index vacuuming needs to do, but we must still visit the metapage to check btm_oldest_btpo_xact, and then maybe walk the deleted page linked list. *** Thinks some more *** As you pointed out, _bt_getbuf() already distrusts the FSM -- it has its own _bt_page_recyclable() check, probably because the FSM isn't crash safe. Maybe we could improve matters by teaching _bt_pagedel() to put a page it deleted in the FSM immediately -- no need to wait until the next index VACUUM for the RecordFreeIndexPage() call. It still isn't quite enough that _bt_getbuf() distrusts the FSM, so we'd also have to teach _bt_getbuf() some heuristics that made it understand that VACUUM is now designed to put stuff in the FSM immediately, so we don't have to wait for the next VACUUM operation to get to it. Maybe _bt_getbuf() should try the FSM a few times before giving up and allocating a new page, etc. This wouldn't make VACUUM delete any more pages any sooner, but it would make those pages reclaimable much sooner. Also, it wouldn't solve the wraparound problem, but that is a bug, not a performance/efficiency issue. > I.e. even if the killtuples logic allows us to recognize that all actual > index tuples are fully dead, we'll not benefit from that unless there's > a new insertion that belongs onto the "empty" page. That's fine for > indexes that are updated roughly evenly across the value range, but > terrible for indexes that "grow" mostly on one side, and "shrink" on the > other. That could be true, but there are certain things about B-Tree space utilization that might surprise you: https://www.drdobbs.com/reexamining-b-trees/184408694?pgno=3 > I'd bet it be beneficial if we were to either have scans unlink such > pages directly, or if they just entered the page into the FSM and have > _bt_getbuf() do the unlinking. That won't work, since you're now talking about pages that aren't deleted (or even half-dead) that are just candidates to be deleted because they're empty. So you'd have to do all the steps in _bt_pagedel() within a new _bt_getbuf() path, which would have many deadlock hazards. Unlinking the page from the tree itself (deleting) is really complicated. > I'm not sure if the current locking > model assumes anywhere that there is only one process (vacuum) unlinking > pages though? I'm not sure, though _bt_unlink_halfdead_page() has comments supposing that there could be concurrent page deletions like that. -- Peter Geoghegan
On Fri, 17 Apr 2020 at 02:58, Andres Freund <andres@anarazel.de> wrote: > > Hi, > > On 2020-04-16 16:30:02 +0900, Masahiko Sawada wrote: > > For btree indexes, IIRC skipping index cleanup could not be a cause of > > corruption, but be a cause of index bloat since it leaves recyclable > > pages which are not marked as recyclable. The index bloat is the main > > side effect of skipping index cleanup. When user executes VACUUM with > > INDEX_CLEANUP to reclaim index garbage, such pages will also be > > recycled sooner or later? Or skipping index cleanup can be a cause of > > recyclable page never being recycled? > > Well, it depends on what you define as "never". Once the xids on the > pages have wrapped around, the page level xids will appear to be from > the future for a long time. And the metapage xid appearing to be from > the future will prevent some vacuums from actually doing the scan too, > even if INDEX_CLEANUP is reenabled. So a VACUUM, even with > INDEX_CLEANUP on, will not be able to recycle those pages anymore. At > some point the wrapped around xids will be "current" again, if there's > enough new xids. > > > It's no ok for vacuumlazy.c to make decisions like this. I think the > INDEX_CLEANUP logic clearly needs to be pushed down into the > amvacuumcleanup callbacks, and it needs to be left to the index AMs to > decide what the correct behaviour is. I wanted to clarify the impact of this bug. I agree with you. On Fri, 17 Apr 2020 at 07:49, Andres Freund <andres@anarazel.de> wrote: > > Hi, > > On 2020-04-16 13:28:00 -0700, Peter Geoghegan wrote: > > > And, what's worse, in the INDEX_CLEANUP off case, future VACUUMs with > > > INDEX_CLEANUP on might not even visit the index. As there very well > > > might not be many dead heap tuples around anymore (previous vacuums with > > > cleanup off will have removed them), the > > > vacuum_cleanup_index_scale_factor logic may prevent index vacuums. I think this doesn't happen because, in the INDEX_CLEANUP off case, vacuum marks linepointers of dead tuples as dead but leaves them. Therefore future VACUUMs with INDEX_CLEANUP on will see these dead linepointers and invoke ambulkdelete. > > I guess that they should visit the metapage to see if they need to do > > that much. That would allow us to fix the problem while mostly > > honoring INDEX_CLEANUP off, I think. > > Yea. _bt_vacuum_needs_cleanup() needs to check if > metad->btm_oldest_btpo_xact is older than the FreezeLimit computed by > vacuum_set_xid_limits() and vacuum the index if so even if INDEX_CLEANUP > false. Agreed. So _bt_vacuum_needs_cleanup() would become something like the following? if (metad->btm_version < BTREE_NOVAC_VERSION) result = true; else if (TransactionIdIsvaid(metad->btm_oldest_btpo_xact) && TransactionIdPrecedes(metad->btm_oldest_btpo_xact, FreezeLimit)) result = true; else if (index_cleanup_disabled) result = false; else if (TransactionIdIsValid(metad->btm_oldest_btpo_xact) && TransactionIdPrecedes(metad->btm_oldest_btpo_xact, RecentGlobalXmin)) result = true; else result = determine based on vacuum_cleanup_index_scale_factor; Or perhaps we can change _bt_vacuum_needs_cleanup() so that it does index cleanup if metad->btm_oldest_btpo_xact is older than the FreezeLimit *and* it's an aggressive vacuum. Anyway, if we change IndexVacuumInfo to tell AM that INDEX_CLEANUP option is disabled and FreezeLimit a problem is that it would break compatibility Regards, -- Masahiko Sawada http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Thu, Apr 16, 2020 at 6:49 PM Andres Freund <andres@anarazel.de> wrote: > Yea. _bt_vacuum_needs_cleanup() needs to check if > metad->btm_oldest_btpo_xact is older than the FreezeLimit computed by > vacuum_set_xid_limits() and vacuum the index if so even if INDEX_CLEANUP > false. I'm still fairly unclear on what the actual problem is here, and on how we propose to fix it. It seems to me that we probably don't have a problem in the case where we don't advance relfrozenxid or relminmxid, because in that case there's not much difference between the behavior created by this patch and a case where we just error out due to an interrupt or something before reaching the index cleanup stage. I think that the problem is that in the case where we do relfrozenxid, we might advance it past some XID value stored in the index metadata. Is that right? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Thu, Apr 16, 2020 at 12:30 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > For btree indexes, IIRC skipping index cleanup could not be a cause of > corruption, but be a cause of index bloat since it leaves recyclable > pages which are not marked as recyclable. I spotted a bug in "Skip full index scan during cleanup of B-tree indexes when possible" which is unrelated to the index cleanup issue. This code is wrong, because we don't have a buffer lock (or a buffer pin) on the buffer anymore: ndel = _bt_pagedel(rel, buf); /* count only this page, else may double-count parent */ if (ndel) { stats->pages_deleted++; if (!TransactionIdIsValid(vstate->oldestBtpoXact) || TransactionIdPrecedes(opaque->btpo.xact, vstate->oldestBtpoXact)) vstate->oldestBtpoXact = opaque->btpo.xact; } MemoryContextSwitchTo(oldcontext); /* pagedel released buffer, so we shouldn't */ (As the comment says, _bt_pagedel() releases it.) There is another, more fundamental issue, though: _bt_pagedel() can delete more than one page. That might be okay if the "extra" pages were always internal pages, but they're not -- it says so in the comments above _bt_pagedel(). See the code at the end of _bt_pagedel(), that says something about why we delete the right sibling page in some cases. I think that the fix is to push down the vstate into lower level code in nbtpage.c. Want to have a go at fixing it? (It would be nice if we could teach Valgrind to "poison" buffers when we don't have a pin held...that would probably have caught this issue almost immediately.) -- Peter Geoghegan
On Wed, Apr 22, 2020 at 6:05 PM Peter Geoghegan <pg@bowt.ie> wrote: > (It would be nice if we could teach Valgrind to "poison" buffers when > we don't have a pin held...that would probably have caught this issue > almost immediately.) I can get Valgrind to complain about it when the regression tests are run with the attached patch applied. I see this in the logs at several points when "make installcheck" runs: ==1082059== VALGRINDERROR-BEGIN ==1082059== Invalid read of size 4 ==1082059== at 0x21D8DE: btvacuumpage (nbtree.c:1370) ==1082059== by 0x21DA61: btvacuumscan (nbtree.c:1039) ==1082059== by 0x21DBD5: btbulkdelete (nbtree.c:879) ==1082059== by 0x215821: index_bulk_delete (indexam.c:698) ==1082059== by 0x20FDCE: lazy_vacuum_index (vacuumlazy.c:2427) ==1082059== by 0x2103EA: lazy_vacuum_all_indexes (vacuumlazy.c:1794) ==1082059== by 0x211EA1: lazy_scan_heap (vacuumlazy.c:1681) ==1082059== by 0x211EA1: heap_vacuum_rel (vacuumlazy.c:510) ==1082059== by 0x360414: table_relation_vacuum (tableam.h:1457) ==1082059== by 0x360414: vacuum_rel (vacuum.c:1880) ==1082059== by 0x361785: vacuum (vacuum.c:449) ==1082059== by 0x361F0E: ExecVacuum (vacuum.c:249) ==1082059== by 0x4D979C: standard_ProcessUtility (utility.c:823) ==1082059== by 0x4D9C7F: ProcessUtility (utility.c:522) ==1082059== by 0x4D6791: PortalRunUtility (pquery.c:1157) ==1082059== by 0x4D725F: PortalRunMulti (pquery.c:1303) ==1082059== by 0x4D7CEF: PortalRun (pquery.c:779) ==1082059== by 0x4D3BB7: exec_simple_query (postgres.c:1239) ==1082059== by 0x4D4ABD: PostgresMain (postgres.c:4315) ==1082059== by 0x45B0C9: BackendRun (postmaster.c:4510) ==1082059== by 0x45B0C9: BackendStartup (postmaster.c:4202) ==1082059== by 0x45B0C9: ServerLoop (postmaster.c:1727) ==1082059== by 0x45C754: PostmasterMain (postmaster.c:1400) ==1082059== by 0x3BDD68: main (main.c:210) ==1082059== Address 0x6cc7378 is in a rw- anonymous segment ==1082059== ==1082059== VALGRINDERROR-END (The line numbers might be slightly different to master here, but the line from btvacuumpage() is definitely the one that accesses the special area of the B-Tree page after we drop the pin.) This patch is very rough -- it was just the first thing that I tried. I don't know how Valgrind remembers the status of shared memory regions across backends when they're marked with VALGRIND_MAKE_MEM_NOACCESS(). Even still, this idea looks promising. I should try to come up with a committable patch before too long. The good news is that the error I showed is the only error that I see, at least with this rough patch + "make installcheck". It's possible that the patch isn't as effective as it could be, though. For one thing, it definitely won't detect incorrect buffer accesses where a pin is held but a buffer lock is not held. That seems possible, but a bit harder. -- Peter Geoghegan
Вложения
Hi, On 2020-04-22 20:08:42 -0700, Peter Geoghegan wrote: > I can get Valgrind to complain about it when the regression tests are > run with the attached patch applied. Nice! Have you checked how much of an incremental slowdown this causes? > This patch is very rough -- it was just the first thing that I tried. > I don't know how Valgrind remembers the status of shared memory > regions across backends when they're marked with > VALGRIND_MAKE_MEM_NOACCESS(). Even still, this idea looks promising. I > should try to come up with a committable patch before too long. IIRC valgrind doesn't at all share access markings across processes. > The good news is that the error I showed is the only error that I see, > at least with this rough patch + "make installcheck". It's possible > that the patch isn't as effective as it could be, though. For one > thing, it definitely won't detect incorrect buffer accesses where a > pin is held but a buffer lock is not held. That seems possible, but a > bit harder. Given hint bits it seems fairly hard to make that a reliable check. > +#ifdef USE_VALGRIND > + if (!isLocalBuf) > + { > + Buffer b = BufferDescriptorGetBuffer(bufHdr); > + VALGRIND_MAKE_MEM_DEFINED(BufferGetPage(b), BLCKSZ); > + } > +#endif Hm. It's a bit annoying that we have to mark the contents defined. It'd be kinda useful to be able to mark unused parts of pages as undefined initially. But there's afaictl no way to just set/unset addressability, while not touching definedness. So this is probably the best we can do without adding a lot of complexity. > if (isExtend) > { > /* new buffers are zero-filled */ > @@ -1039,6 +1047,12 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum, > buf = GetBufferDescriptor(buf_id); > > valid = PinBuffer(buf, strategy); > +#ifdef USE_VALGRIND > + { > + Buffer b = BufferDescriptorGetBuffer(buf); > + VALGRIND_MAKE_MEM_DEFINED(BufferGetPage(b), BLCKSZ); > + } > +#endif Why aren't we doing this in PinBuffer() and PinBuffer_Locked(), but at their callsites? > @@ -1633,6 +1653,12 @@ PinBuffer(BufferDesc *buf, BufferAccessStrategy strategy) > buf_state)) > { > result = (buf_state & BM_VALID) != 0; > +#ifdef USE_VALGRIND > + { > + Buffer b = BufferDescriptorGetBuffer(buf); > + VALGRIND_MAKE_MEM_DEFINED(BufferGetPage(b), BLCKSZ); > + } > +#endif > break; > } > } Oh, but we actually are doing it in PinBuffer() too? > /* > * Decrement the shared reference count. > @@ -2007,6 +2039,12 @@ BufferSync(int flags) > */ > if (pg_atomic_read_u32(&bufHdr->state) & BM_CHECKPOINT_NEEDED) > { > +#ifdef USE_VALGRIND > + { > + Buffer b = BufferDescriptorGetBuffer(bufHdr); > + VALGRIND_MAKE_MEM_DEFINED(BufferGetPage(b), BLCKSZ); > + } > +#endif > if (SyncOneBuffer(buf_id, false, &wb_context) & BUF_WRITTEN) > { > TRACE_POSTGRESQL_BUFFER_SYNC_WRITTEN(buf_id); Shouldn't the pin we finally acquire in SyncOneBuffer() be sufficient? > @@ -2730,6 +2768,12 @@ FlushBuffer(BufferDesc *buf, SMgrRelation reln) > * Run PageGetLSN while holding header lock, since we don't have the > * buffer locked exclusively in all cases. > */ > +#ifdef USE_VALGRIND > + { > + Buffer b = BufferDescriptorGetBuffer(buf); > + VALGRIND_MAKE_MEM_DEFINED(BufferGetPage(b), BLCKSZ); > + } > +#endif > recptr = BufferGetLSN(buf); This shouldn't be needed, as the caller ought to hold a pin: * * The caller must hold a pin on the buffer and have share-locked the * buffer contents. (Note: a share-lock does not prevent updates of * hint bits in the buffer, so the page could change while the write * is in progress, but we assume that that will not invalidate the data * written.) * Greetings, Andres Freund
On Wed, Apr 22, 2020 at 8:33 PM Andres Freund <andres@anarazel.de> wrote: > On 2020-04-22 20:08:42 -0700, Peter Geoghegan wrote: > > I can get Valgrind to complain about it when the regression tests are > > run with the attached patch applied. > > Nice! Have you checked how much of an incremental slowdown this causes? No, but I didn't notice much of a slowdown. > > This patch is very rough -- it was just the first thing that I tried. > > I don't know how Valgrind remembers the status of shared memory > > regions across backends when they're marked with > > VALGRIND_MAKE_MEM_NOACCESS(). Even still, this idea looks promising. I > > should try to come up with a committable patch before too long. > > IIRC valgrind doesn't at all share access markings across processes. I didn't think so. > > The good news is that the error I showed is the only error that I see, > > at least with this rough patch + "make installcheck". It's possible > > that the patch isn't as effective as it could be, though. For one > > thing, it definitely won't detect incorrect buffer accesses where a > > pin is held but a buffer lock is not held. That seems possible, but a > > bit harder. > > Given hint bits it seems fairly hard to make that a reliable check. I don't follow. It doesn't have to be a perfect check. Detecting if there is *any* buffer lock held at all would be a big improvement. > Why aren't we doing this in PinBuffer() and PinBuffer_Locked(), but at > their callsites? I wrote this patch in a completely careless manner in less than 10 minutes, just to see how hard it was (I thought that it might have been much harder). I wasn't expecting you to review it. I thought that I was clear about that. -- Peter Geoghegan
On Wed, Apr 22, 2020 at 9:05 PM Peter Geoghegan <pg@bowt.ie> wrote: > > Given hint bits it seems fairly hard to make that a reliable check. > > I don't follow. It doesn't have to be a perfect check. Detecting if > there is *any* buffer lock held at all would be a big improvement. It is true that the assumptions that heapam makes about what a buffer pin will prevent (that a pin will prevent any kind of page defragmentation) are not really compatible with marking pages as undefined in lower level code like bufmgr.c. There are too many exceptions for it to work like that. The case I was really thinking about was the nbtree _bt_drop_lock_and_maybe_pin() stuff, which is very confusing. The confusing structure of the BTScanPosIsPinned()/_bt_drop_lock_and_maybe_pin() code more or less caused the skip scan patch to have numerous bugs involving code holding a buffer pin, but not a buffer lock, at least when I last looked at it a couple of months ago. The only thing having a pin on a leaf page guarantees is that the TIDs from tuples on the page won't be concurrently recycled by VACUUM. This is a very weak guarantee -- in particular, it's much weaker than the guarantees around buffer pins that apply in heapam. It's certainly not going to prevent any kind of defragmentation of the page -- the page can even split, for example. Any code that relies on holding a pin to prevent anything more than that is broken, but possibly only in a subtle way. It's not like page splits happen all that frequently. Given that I was concerned about a fairly specific situation, a specific solution seems like it might be the best way to structure the extra checks. The attached rough patch shows the kind of approach that might be practical in specific index access methods. This works on top of the patch I posted yesterday. The idea is to mark the buffer's page as a noaccess region within _bt_drop_lock_and_maybe_pin(), and then mark it defined again at either of the two points that we might have to relock (but not repin) the buffer to re-read the page. This doesn't cause any regression test failures, so maybe there are no bugs like this currently, but it still seems like it might be worth pursuing on top of the buffer pin stuff. -- Peter Geoghegan
Вложения
On Thu, Apr 16, 2020 at 11:27 AM Andres Freund <andres@anarazel.de> wrote: > Sure, there is some pre-existing wraparound danger for individual > pages. But it's a pretty narrow corner case before INDEX_CLEANUP > off. > > That comment says something about "shared-memory free space map", making > it sound like any crash would loose the page. But it's a normal FSM > these days. Vacuum will insert the deleted page into the free space > map. So either the FSM would need to be corrupted to not find the > inserted page anymore, or the index would need to grow slow enough to > not use a page before the wraparound. And then such wrapped around xids > would exist on individual pages. Not on all deleted pages, like with > INDEX_CLEANUP false. Is that really that narrow, even without "INDEX_CLEANUP false"? It's not as if the index needs to grow very slowly to have only very few page splits hour to hour (it depends on whether the inserts are random or not, and so on). Especially if you had a bulk DELETE affecting many rows, which is hardly that uncommon. Fundamentally, btvacuumpage() doesn't freeze 32-bit XIDs (from bpto.xact) when it recycles deleted pages. It simply puts them in the FSM without changing anything about the page itself. This means surprisingly little in the context of nbtree: the _bt_page_recyclable() XID check that takes place in btvacuumpage() also takes place in _bt_getbuf(), at the point where the page actually gets recycled by the client. That's not great. It wouldn't be so unreasonable if btvacuumpage() actually did freeze the bpto.xact value at the point where it puts the page in the FSM. It doesn't need to be crash safe; it can work as a hint. Maybe "freezing" is the wrong word (too much baggage). More like we'd have VACUUM represent that "this deleted B-Tree page is definitely not considered to still be a part of the tree by any possible other backend" using a page flag hint -- btvacuumpage() would "mark the deleted page as recyclable" explicitly. Note that we still need to keep the original bpto.xact XID around for _bt_log_reuse_page() (also, do we need to worry _bt_log_reuse_page() with a wrapped-around XID?). -- Peter Geoghegan
Hi, On 2020-04-29 11:28:00 -0700, Peter Geoghegan wrote: > On Thu, Apr 16, 2020 at 11:27 AM Andres Freund <andres@anarazel.de> wrote: > > Sure, there is some pre-existing wraparound danger for individual > > pages. But it's a pretty narrow corner case before INDEX_CLEANUP > > off. > > > > That comment says something about "shared-memory free space map", making > > it sound like any crash would loose the page. But it's a normal FSM > > these days. Vacuum will insert the deleted page into the free space > > map. So either the FSM would need to be corrupted to not find the > > inserted page anymore, or the index would need to grow slow enough to > > not use a page before the wraparound. And then such wrapped around xids > > would exist on individual pages. Not on all deleted pages, like with > > INDEX_CLEANUP false. > > Is that really that narrow, even without "INDEX_CLEANUP false"? It's > not as if the index needs to grow very slowly to have only very few > page splits hour to hour (it depends on whether the inserts are random > or not, and so on). Especially if you had a bulk DELETE affecting many > rows, which is hardly that uncommon. Well, you'd need to have a workload that has bulk deletes, high xid usage *and* doesn't insert new data to use those empty pages > Fundamentally, btvacuumpage() doesn't freeze 32-bit XIDs (from > bpto.xact) when it recycles deleted pages. It simply puts them in the > FSM without changing anything about the page itself. This means > surprisingly little in the context of nbtree: the > _bt_page_recyclable() XID check that takes place in btvacuumpage() > also takes place in _bt_getbuf(), at the point where the page actually > gets recycled by the client. That's not great. I think it's quite foolish for btvacuumpage() to not freeze xids. If we only do so when necessary (i.e. older than a potential new relfrozenxid, and only when the vacuum didn't yet skip pages), the costs are pretty miniscule. > It wouldn't be so unreasonable if btvacuumpage() actually did freeze > the bpto.xact value at the point where it puts the page in the FSM. It > doesn't need to be crash safe; it can work as a hint. I'd much rather make sure the xid is guaranteed to be removed. As outlined above, the cost would be small, and I think the likelihood of the consequences of wrapped around xids getting worse over time is substantial. > Note that we still need to keep the original bpto.xact XID around for > _bt_log_reuse_page() (also, do we need to worry _bt_log_reuse_page() > with a wrapped-around XID?). I'd just WAL log the reuse when freezing the xid. Then there's no worry about wraparounds. And I don't think it'd cause additional conflicts; the vacuum itself (or a prior vacuum) would also have to cause them, I think? Greetings, Andres Freund
On Wed, Apr 29, 2020 at 12:54 PM Andres Freund <andres@anarazel.de> wrote: > > Fundamentally, btvacuumpage() doesn't freeze 32-bit XIDs (from > > bpto.xact) when it recycles deleted pages. It simply puts them in the > > FSM without changing anything about the page itself. This means > > surprisingly little in the context of nbtree: the > > _bt_page_recyclable() XID check that takes place in btvacuumpage() > > also takes place in _bt_getbuf(), at the point where the page actually > > gets recycled by the client. That's not great. > > I think it's quite foolish for btvacuumpage() to not freeze xids. If we > only do so when necessary (i.e. older than a potential new relfrozenxid, > and only when the vacuum didn't yet skip pages), the costs are pretty > miniscule. I wonder if we should just bite the bullet and mark pages that are recycled by VACUUM as explicitly recycled, with WAL-logging and everything (this is like freezing, but stronger). That way, the _bt_page_recyclable() call within _bt_getbuf() would only be required to check that state (while btvacuumpage() would use something like a _bt_page_eligible_for_recycling(), which would do the same thing as the current _bt_page_recyclable()). I'm not sure how low the costs would be, but at least we'd only have to do it once per already-deleted page (i.e. only the first time a VACUUM operation found _bt_page_eligible_for_recycling() returned true for the page and marked it recycled in a crash safe manner). That design would be quite a lot simpler, because it expresses the problem in terms that make sense to the nbtree code. _bt_getbuf() should not have to make a distinction between "possibly recycled" versus "definitely recycled". It makes sense that the FSM is not crash safe, I suppose, but we're talking about relatively large amounts of free space here. Can't we just do it properly/reliably? -- Peter Geoghegan
On Wed, Apr 29, 2020 at 1:40 PM Peter Geoghegan <pg@bowt.ie> wrote: > I'm not sure how low the costs would be, but at least we'd only have > to do it once per already-deleted page (i.e. only the first time a > VACUUM operation found _bt_page_eligible_for_recycling() returned true > for the page and marked it recycled in a crash safe manner). That > design would be quite a lot simpler, because it expresses the problem > in terms that make sense to the nbtree code. _bt_getbuf() should not > have to make a distinction between "possibly recycled" versus > "definitely recycled". As a bonus, we now have an easy correctness cross-check: if some random piece of nbtree code lands on a page (having followed a downlink or sibling link) that is marked recycled, then clearly something is very wrong -- throw a "can't happen" error. This would be especially useful in places like _bt_readpage(), I suppose. Think of extreme cases like cursors, which can have a scan that remembers a block number of a leaf page, that only actually gets accessed hours or days later (for whatever reason). If that code was buggy in some way, we might have a hope of figuring it out at some point with this cross-check. -- Peter Geoghegan
On Wed, Apr 29, 2020 at 2:04 PM Peter Geoghegan <pg@bowt.ie> wrote: > As a bonus, we now have an easy correctness cross-check: if some > random piece of nbtree code lands on a page (having followed a > downlink or sibling link) that is marked recycled, then clearly > something is very wrong -- throw a "can't happen" error. If this doesn't sound that appealing to any of you, then bear this in mind: nbtree has a terrifying tendency to *not* produce wrong answers to queries when we land on a concurrently-recycled page. _bt_moveright() is willing to move right as many times as it takes to arrive at the correct page, even though in typical cases having to move right once is rare -- twice is exceptional. I suppose that there is roughly a 50% chance that we'll end up landing at a point in the key space that is to the left of the point where we're supposed to arrive at. It might take many, many page accesses before _bt_moveright() finds the correct page, but that often won't be very noticeable. Or if it is noticed, corruption won't be suspected -- we're still getting a correct answer. -- Peter Geoghegan
Hi, On 2020-04-29 13:40:55 -0700, Peter Geoghegan wrote: > On Wed, Apr 29, 2020 at 12:54 PM Andres Freund <andres@anarazel.de> wrote: > > > Fundamentally, btvacuumpage() doesn't freeze 32-bit XIDs (from > > > bpto.xact) when it recycles deleted pages. It simply puts them in the > > > FSM without changing anything about the page itself. This means > > > surprisingly little in the context of nbtree: the > > > _bt_page_recyclable() XID check that takes place in btvacuumpage() > > > also takes place in _bt_getbuf(), at the point where the page actually > > > gets recycled by the client. That's not great. > > > > I think it's quite foolish for btvacuumpage() to not freeze xids. If we > > only do so when necessary (i.e. older than a potential new relfrozenxid, > > and only when the vacuum didn't yet skip pages), the costs are pretty > > miniscule. > > I wonder if we should just bite the bullet and mark pages that are > recycled by VACUUM as explicitly recycled, with WAL-logging and > everything (this is like freezing, but stronger). That way, the > _bt_page_recyclable() call within _bt_getbuf() would only be required > to check that state (while btvacuumpage() would use something like a > _bt_page_eligible_for_recycling(), which would do the same thing as > the current _bt_page_recyclable()). I'm not sure I see the advantage. Only doing so in the freezing case seems unlikely to cause additional conflicts, but I'm less sure about doing it always. btpo.xact will often be quite recent, right? So it'd likely cause more conflicts. > I'm not sure how low the costs would be, but at least we'd only have > to do it once per already-deleted page (i.e. only the first time a > VACUUM operation found _bt_page_eligible_for_recycling() returned true > for the page and marked it recycled in a crash safe manner). That > design would be quite a lot simpler, because it expresses the problem > in terms that make sense to the nbtree code. _bt_getbuf() should not > have to make a distinction between "possibly recycled" versus > "definitely recycled". I don't really see the problem with the check in _bt_getbuf()? I'd actually like to be *more* aggressive about putting pages in the FSM (or whatever), and that'd probably require checks like this. E.g. whenever we unlink a page, we should put it into the FSM (or something different, see below). And then do all the necessary checks in _bt_getbuf(). It's pretty sad that one right now basically needs to vacuum twice to reuse pages in nbtree (once to delete the page, once to record it in the fsm). Usually the xmin horizon should advance much more quickly than that, allowing reuse earlier. As far as I can tell, even just adding them to the FSM when setting ISDELETED would be advantageous. There's obviously that we'll cause backends to visit a lot of pages that can't actually be reused... But if we did what I suggest below, that danger probably could largely be avoided. > It makes sense that the FSM is not crash safe, I suppose, but we're > talking about relatively large amounts of free space here. Can't we > just do it properly/reliably? What do you mean by that? To have the FSM be crash-safe? It could make sense to just not have the FSM, and have a linked-list style stack of pages reachable from the meta page. That'd be especially advantageous if we kept xids associated with the "about to be recyclable" pages in the metapage, so it'd be cheap to evaluate. There's possibly some not-entirely-trivial locking concerns around such a list. Adding entries seems easy enough, because we currently only delete pages from within vacuum. But popping entries could be more complicated, I'm not exactly sure if there are potential lock nesting issues (nor am I actually sure there aren't existing ones). Greetings, Andres Freund
On Wed, Apr 29, 2020 at 2:56 PM Andres Freund <andres@anarazel.de> wrote: > I'm not sure I see the advantage. Only doing so in the freezing case > seems unlikely to cause additional conflicts, but I'm less sure about > doing it always. btpo.xact will often be quite recent, right? So it'd > likely cause more conflicts. btpo.xact values come from a call to ReadNewTransactionId(). There pretty much has to be one call to ReadNewTransactionId() per page deletion (see comments about it within _bt_unlink_halfdead_page()). So yes, I'd say that it could be very recent. I don't mind continuing to do the conflicts in _bt_getbuf(), which would delay it until the point where we really need the page -- especially if we could do that in a way that captures temporal locality (e.g. your recyclable page chaining idea). > I don't really see the problem with the check in _bt_getbuf()? I'd > actually like to be *more* aggressive about putting pages in the FSM (or > whatever), and that'd probably require checks like this. E.g. whenever > we unlink a page, we should put it into the FSM (or something different, > see below). And then do all the necessary checks in _bt_getbuf(). Basically, I would like to have a page state that represents "it should be impossible for any scan to land on this page, except for btvacuumscan()". Without relying on 32-bit XIDs, and ideally without relying on any other state to interpret what it really means. In principle we can set a deleted page to that state at the earliest possible point when that becomes true, without it meaning anything else, or requiring that we do anything else at the same time (e.g. actually using it for the right half of a page in a page split, generating recovery conflicts). > It's pretty sad that one right now basically needs to vacuum twice to > reuse pages in nbtree (once to delete the page, once to record it in the > fsm). Usually the xmin horizon should advance much more quickly than > that, allowing reuse earlier. Yes, that's definitely bad. I like the general idea of making us more aggressive with recycling. Marking pages as "recyclable" (not "recycled") not too long after they first get deleted in VACUUM, and then separately recycling them in _bt_getbuf() is a cleaner design. Separation of concerns. That would build confidence in a more aggressive approach -- we could add lots of cross-checks against landing on a recyclable page. Note that we have had a bug in this exact mechanism in the past -- see commit d3abbbebe52. If there was a bug then we might still land on the page after it gets fully recycled, in which case the cross-checks won't detect the bug. But ISTM that we always have a good chance of landing on the page before that happens, in which case the cross-check complains and we get a log message, and possibly even a bug report. We don't have to be truly lucky to see when our approach is buggy when we go on to make page deletion more aggressive (in whatever way). And we get the same cross-checks on standbys. > > It makes sense that the FSM is not crash safe, I suppose, but we're > > talking about relatively large amounts of free space here. Can't we > > just do it properly/reliably? > > What do you mean by that? To have the FSM be crash-safe? That, or the equivalent (pretty much your chaining idea) may well be a good idea. But what I really meant was an explicit "recyclable" page state. That's all. We may or may not also continue to rely on the FSM in the same way. I suppose that we should try to get rid of the FSM in nbtree. I see the advantages. It's not essential to my proposal, though. > It could make sense to just not have the FSM, and have a linked-list > style stack of pages reachable from the meta page. That'd be especially > advantageous if we kept xids associated with the "about to be > recyclable" pages in the metapage, so it'd be cheap to evaluate. I like that idea. But doesn't that also argue for delaying the conflicts until we actually recycle a "recyclable" page? > There's possibly some not-entirely-trivial locking concerns around such > a list. Adding entries seems easy enough, because we currently only > delete pages from within vacuum. But popping entries could be more > complicated, I'm not exactly sure if there are potential lock nesting > issues (nor am I actually sure there aren't existing ones). A "recyclable" page state might help with this, too. _bt_getbuf() is a bag of tricks, even leaving aside generating recovery conflicts. If we are going to be more eager, then the cost of dirtying the page a second time to mark it "recyclable" might mostly not matter. Especially if we chain the pages. That is, maybe VACUUM recomputes RecentGlobalXmin at the end of its first btvacuumscan() scan (or at the end of the whole VACUUM operation), when it notices that it is already possible to mark many pages as "recyclable". Perhaps we won't write out the page twice much of the time, because it won't have been that long since VACUUM dirtied the page in order to delete it. Yeah, we could be a lot more aggressive here, in a bunch of ways. As I've said quite a few times, it seems like our implementation of "the drain technique" is way more conservative than it needs to be (i.e. we use ReadNewTransactionId() without considering any of the specifics of the index). But if we mess up, we can't expect amcheck to detect the problems, which would be totally transient. We're talking about incredibly subtle concurrency bugs. So IMV it's just not going to happen until the whole thing becomes way less scary. -- Peter Geoghegan
On Sat, 18 Apr 2020 at 03:22, Robert Haas <robertmhaas@gmail.com> wrote: > > On Thu, Apr 16, 2020 at 6:49 PM Andres Freund <andres@anarazel.de> wrote: > > Yea. _bt_vacuum_needs_cleanup() needs to check if > > metad->btm_oldest_btpo_xact is older than the FreezeLimit computed by > > vacuum_set_xid_limits() and vacuum the index if so even if INDEX_CLEANUP > > false. > > I'm still fairly unclear on what the actual problem is here, and on > how we propose to fix it. It seems to me that we probably don't have a > problem in the case where we don't advance relfrozenxid or relminmxid, > because in that case there's not much difference between the behavior > created by this patch and a case where we just error out due to an > interrupt or something before reaching the index cleanup stage. I > think that the problem is that in the case where we do relfrozenxid, > we might advance it past some XID value stored in the index metadata. > Is that right? I think advancing relfrozenxid past oldest_btpo_xact actually cannot be a problem. If a subsequent vacuum sees oldest_btpo_xact is an old xid, we can recycle pages. Before introducing to INDEX_CLEANUP = false, we used to invoke either bulkdelete or vaucumcleanup at least once in each vacuum. And thanks to relfrozenxid, a table is periodically vacuumed by an anti-wraparound vacuum. But with this feature, we can unconditionally skip both bulkdelete and vacuumcleanup. So IIUC the problem is that since we skip both, oldst_btpo_xact could be seen as a "future" xid during vacuum. Which will be a cause of that vacuum misses pages which can actually be recycled. I think we can fix this issue by calling vacuumcleanup callback when an anti-wraparound vacuum even if INDEX_CLEANUP is false. That way we can let index AM make decisions whether doing cleanup index at least once until XID wraparound, same as before. Originally the motivation of disabling INDEX_CLEANUP was to skip index full scan when anti-wraparound vacuum to reduce the execution time. By this change, we will end up doing an index full scan also in some anti-wraparound vacuum case but we still can skip that if there is no recyclable page in an index. Regards, -- Masahiko Sawada http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Tue, May 5, 2020 at 2:52 PM Masahiko Sawada <masahiko.sawada@2ndquadrant.com> wrote: > So IIUC the problem is that since we skip both, > oldst_btpo_xact could be seen as a "future" xid during vacuum. Which > will be a cause of that vacuum misses pages which can actually be > recycled. This is also my understanding of the problem. > I think we can fix this issue by calling vacuumcleanup callback when > an anti-wraparound vacuum even if INDEX_CLEANUP is false. That way we can > let index AM make decisions whether doing cleanup index at least once > until XID wraparound, same as before. +1 Can you work on a patch? -- Peter Geoghegan
On Wed, 6 May 2020 at 07:14, Peter Geoghegan <pg@bowt.ie> wrote: > > On Tue, May 5, 2020 at 2:52 PM Masahiko Sawada > <masahiko.sawada@2ndquadrant.com> wrote: > > So IIUC the problem is that since we skip both, > > oldst_btpo_xact could be seen as a "future" xid during vacuum. Which > > will be a cause of that vacuum misses pages which can actually be > > recycled. > > This is also my understanding of the problem. > > > I think we can fix this issue by calling vacuumcleanup callback when > > an anti-wraparound vacuum even if INDEX_CLEANUP is false. That way we can > > let index AM make decisions whether doing cleanup index at least once > > until XID wraparound, same as before. > > +1 > > Can you work on a patch? Yes, I'll submit a bug fix patch. Regards, -- Masahiko Sawada http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Wed, 6 May 2020 at 07:17, Masahiko Sawada <masahiko.sawada@2ndquadrant.com> wrote: > > On Wed, 6 May 2020 at 07:14, Peter Geoghegan <pg@bowt.ie> wrote: > > > > On Tue, May 5, 2020 at 2:52 PM Masahiko Sawada > > <masahiko.sawada@2ndquadrant.com> wrote: > > > So IIUC the problem is that since we skip both, > > > oldst_btpo_xact could be seen as a "future" xid during vacuum. Which > > > will be a cause of that vacuum misses pages which can actually be > > > recycled. > > > > This is also my understanding of the problem. > > > > > I think we can fix this issue by calling vacuumcleanup callback when > > > an anti-wraparound vacuum even if INDEX_CLEANUP is false. That way we can > > > let index AM make decisions whether doing cleanup index at least once > > > until XID wraparound, same as before. > > > > +1 > > > > Can you work on a patch? > > Yes, I'll submit a bug fix patch. > I've attached the patch fixes this issue. With this patch, we don't skip only index cleanup phase when performing an aggressive vacuum. The reason why I don't skip only index cleanup phase is that index vacuum phase can be called multiple times, which takes a very long time. Since the purpose of this index cleanup is to process recyclable pages it's enough to do only index cleanup phase. However it also means we do index cleanup even when table might have garbage whereas we used to call index cleanup only when there is no garbage on a table. As far as I can think it's no problem but perhaps needs more research. Regards, -- Masahiko Sawada http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Вложения
On Wed, May 6, 2020 at 2:28 AM Masahiko Sawada <masahiko.sawada@2ndquadrant.com> wrote: > I've attached the patch fixes this issue. > > With this patch, we don't skip only index cleanup phase when > performing an aggressive vacuum. The reason why I don't skip only > index cleanup phase is that index vacuum phase can be called multiple > times, which takes a very long time. Since the purpose of this index > cleanup is to process recyclable pages it's enough to do only index > cleanup phase. That's only true in nbtree, though. The way that I imagined we'd want to fix this is by putting control in each index access method. So, we'd revise the way that amvacuumcleanup() worked -- the amvacuumcleanup routine for each index AM would sometimes be called in a mode that means "I don't really want you to do any cleanup, but if you absolutely have to for your own reasons then you can". This could be represented using something similar to IndexVacuumInfo.analyze_only. This approach has an obvious disadvantage: the patch really has to teach *every* index AM to do something with that state (most will simply do no work). It seems logical to have the index AM control what happens, though. This allows the logic to live inside _bt_vacuum_needs_cleanup() in the case of nbtree, so there is only one place where we make decisions like this. Most other AMs don't have this problem. GiST has a similar issue with recyclable pages, except that it doesn't use 32-bit XIDs so it doesn't need to care about this stuff at all. Besides, it seems like it might be a good idea to do other basic maintenance of the index when we're "skipping" index cleanup. We probably should always do things that require only a small, fixed amount of work. Things like maintaining metadata in the metapage. There may be practical reasons why this approach isn't suitable for backpatch even if it is a superior approach. What do you think? Also, what do you think about this Robert and Andres? -- Peter Geoghegan
On Wed, May 6, 2020 at 11:28 AM Peter Geoghegan <pg@bowt.ie> wrote: > This approach has an obvious disadvantage: the patch really has to > teach *every* index AM to do something with that state (most will > simply do no work). It seems logical to have the index AM control what > happens, though. This allows the logic to live inside > _bt_vacuum_needs_cleanup() in the case of nbtree, so there is only one > place where we make decisions like this. Also, do we really want to skip summarization of BRIN indexes? This cleanup is rather dissimilar to the cleanup that takes place in most other AMs -- it isn't really that related to garbage collection (BRIN is rather unique in this respect). I think that BRIN might be an inappropriate target for "index_cleanup off" VACUUMs for that reason. See the explanation of how this takes place from the docs: https://www.postgresql.org/docs/devel/brin-intro.html#BRIN-OPERATION -- Peter Geoghegan
On 2020-May-06, Peter Geoghegan wrote: > Also, do we really want to skip summarization of BRIN indexes? This > cleanup is rather dissimilar to the cleanup that takes place in most > other AMs -- it isn't really that related to garbage collection (BRIN > is rather unique in this respect). I think that BRIN might be an > inappropriate target for "index_cleanup off" VACUUMs for that reason. > > See the explanation of how this takes place from the docs: > https://www.postgresql.org/docs/devel/brin-intro.html#BRIN-OPERATION Good question. I agree that BRIN summarization is not at all related to what other index AMs do during the cleanup phase. However, if the index_cleanup feature is there to make it faster to process a table that's nearing wraparound hazards (or at least the warnings), then I think it makes sense to complete the vacuum as fast as possible -- which includes not trying to summarize it for brin indexes. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Wed, May 6, 2020 at 1:06 PM Alvaro Herrera <alvherre@2ndquadrant.com> wrote: > Good question. I agree that BRIN summarization is not at all related to > what other index AMs do during the cleanup phase. However, if the > index_cleanup feature is there to make it faster to process a table > that's nearing wraparound hazards (or at least the warnings), then I > think it makes sense to complete the vacuum as fast as possible -- which > includes not trying to summarize it for brin indexes. I forgot about the fact that the AutoVacuumRequestWork() interface exists at all until just now. That's how "autosummarize = on" makes sure that autosummarization takes place. These work items are not affected by the fact that the VACUUM happens to be a "index_cleanup off" VACUUM. Fortunately, the user is required to explicitly opt-in to autosummarization (by setting "autosummarize = on") in order for autovacuum to spend extra time processing work items when it might be important to advance relfrozenxid ASAP. (My initial assumption was that the autosummarization business happened within brinvacuumcleanup(), but I now see that I was mistaken.) There is a separate question (nothing to do with summarization) about the cleanup steps performed in brinvacuumcleanup(), which are unlike any of the cleanup/maintenance that we expect for an amvacuumcleanup routine in general. As I said in my last e-mail, these steps have nothing to do with garbage tuples. Rather, it's deferred maintenance that we need to do even with append-only workloads (including when autosummarization has not been enabled). What about that? Is that okay? ISTM that the fundamental issue is that BRIN imagines that it is in control, which isn't quite true in light of the "index_cleanup off" stuff -- a call to brinvacuumcleanup() is expected to take place at fairly consistent intervals to take care of revmap processing, which, in general, might not happen now. I blame commit a96c41feec6 for this, not BRIN. ISTM that whatever behavior we deem appropriate, the proper place to decide on it is within BRIN. Not within vacuumlazy.c. -- Peter Geoghegan
On Thu, 7 May 2020 at 03:28, Peter Geoghegan <pg@bowt.ie> wrote: > > On Wed, May 6, 2020 at 2:28 AM Masahiko Sawada > <masahiko.sawada@2ndquadrant.com> wrote: > > I've attached the patch fixes this issue. > > > > With this patch, we don't skip only index cleanup phase when > > performing an aggressive vacuum. The reason why I don't skip only > > index cleanup phase is that index vacuum phase can be called multiple > > times, which takes a very long time. Since the purpose of this index > > cleanup is to process recyclable pages it's enough to do only index > > cleanup phase. > > That's only true in nbtree, though. The way that I imagined we'd want > to fix this is by putting control in each index access method. So, > we'd revise the way that amvacuumcleanup() worked -- the > amvacuumcleanup routine for each index AM would sometimes be called in > a mode that means "I don't really want you to do any cleanup, but if > you absolutely have to for your own reasons then you can". This could > be represented using something similar to > IndexVacuumInfo.analyze_only. > > This approach has an obvious disadvantage: the patch really has to > teach *every* index AM to do something with that state (most will > simply do no work). It seems logical to have the index AM control what > happens, though. This allows the logic to live inside > _bt_vacuum_needs_cleanup() in the case of nbtree, so there is only one > place where we make decisions like this. > > Most other AMs don't have this problem. GiST has a similar issue with > recyclable pages, except that it doesn't use 32-bit XIDs so it doesn't > need to care about this stuff at all. Besides, it seems like it might > be a good idea to do other basic maintenance of the index when we're > "skipping" index cleanup. We probably should always do things that > require only a small, fixed amount of work. Things like maintaining > metadata in the metapage. > > There may be practical reasons why this approach isn't suitable for > backpatch even if it is a superior approach. What do you think? I agree this idea is better. I was thinking the same approach but I was concerned about backpatching. Especially since I was thinking to add one or two fields to IndexVacuumInfo existing index AM might not work with the new VacuumInfo structure. If we go with this idea, we need to change lazy vacuum so that it uses two-pass strategy vacuum even if INDEX_CLEANUP is false. Also in parallel vacuum, we need to launch workers. But I think these changes are no big problem. Regards, -- Masahiko Sawada http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Thu, 7 May 2020 at 15:40, Masahiko Sawada <masahiko.sawada@2ndquadrant.com> wrote: > > On Thu, 7 May 2020 at 03:28, Peter Geoghegan <pg@bowt.ie> wrote: > > > > On Wed, May 6, 2020 at 2:28 AM Masahiko Sawada > > <masahiko.sawada@2ndquadrant.com> wrote: > > > I've attached the patch fixes this issue. > > > > > > With this patch, we don't skip only index cleanup phase when > > > performing an aggressive vacuum. The reason why I don't skip only > > > index cleanup phase is that index vacuum phase can be called multiple > > > times, which takes a very long time. Since the purpose of this index > > > cleanup is to process recyclable pages it's enough to do only index > > > cleanup phase. > > > > That's only true in nbtree, though. The way that I imagined we'd want > > to fix this is by putting control in each index access method. So, > > we'd revise the way that amvacuumcleanup() worked -- the > > amvacuumcleanup routine for each index AM would sometimes be called in > > a mode that means "I don't really want you to do any cleanup, but if > > you absolutely have to for your own reasons then you can". This could > > be represented using something similar to > > IndexVacuumInfo.analyze_only. > > > > This approach has an obvious disadvantage: the patch really has to > > teach *every* index AM to do something with that state (most will > > simply do no work). It seems logical to have the index AM control what > > happens, though. This allows the logic to live inside > > _bt_vacuum_needs_cleanup() in the case of nbtree, so there is only one > > place where we make decisions like this. > > > > Most other AMs don't have this problem. GiST has a similar issue with > > recyclable pages, except that it doesn't use 32-bit XIDs so it doesn't > > need to care about this stuff at all. Besides, it seems like it might > > be a good idea to do other basic maintenance of the index when we're > > "skipping" index cleanup. We probably should always do things that > > require only a small, fixed amount of work. Things like maintaining > > metadata in the metapage. > > > > There may be practical reasons why this approach isn't suitable for > > backpatch even if it is a superior approach. What do you think? > > I agree this idea is better. I was thinking the same approach but I > was concerned about backpatching. Especially since I was thinking to > add one or two fields to IndexVacuumInfo existing index AM might not > work with the new VacuumInfo structure. It would be ok if we added these fields at the end of VacuumInfo structure? Regards, -- Masahiko Sawada http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Thu, 7 May 2020 at 16:26, Masahiko Sawada <masahiko.sawada@2ndquadrant.com> wrote: > > On Thu, 7 May 2020 at 15:40, Masahiko Sawada > <masahiko.sawada@2ndquadrant.com> wrote: > > > > On Thu, 7 May 2020 at 03:28, Peter Geoghegan <pg@bowt.ie> wrote: > > > > > > On Wed, May 6, 2020 at 2:28 AM Masahiko Sawada > > > <masahiko.sawada@2ndquadrant.com> wrote: > > > > I've attached the patch fixes this issue. > > > > > > > > With this patch, we don't skip only index cleanup phase when > > > > performing an aggressive vacuum. The reason why I don't skip only > > > > index cleanup phase is that index vacuum phase can be called multiple > > > > times, which takes a very long time. Since the purpose of this index > > > > cleanup is to process recyclable pages it's enough to do only index > > > > cleanup phase. > > > > > > That's only true in nbtree, though. The way that I imagined we'd want > > > to fix this is by putting control in each index access method. So, > > > we'd revise the way that amvacuumcleanup() worked -- the > > > amvacuumcleanup routine for each index AM would sometimes be called in > > > a mode that means "I don't really want you to do any cleanup, but if > > > you absolutely have to for your own reasons then you can". This could > > > be represented using something similar to > > > IndexVacuumInfo.analyze_only. > > > > > > This approach has an obvious disadvantage: the patch really has to > > > teach *every* index AM to do something with that state (most will > > > simply do no work). It seems logical to have the index AM control what > > > happens, though. This allows the logic to live inside > > > _bt_vacuum_needs_cleanup() in the case of nbtree, so there is only one > > > place where we make decisions like this. > > > > > > Most other AMs don't have this problem. GiST has a similar issue with > > > recyclable pages, except that it doesn't use 32-bit XIDs so it doesn't > > > need to care about this stuff at all. Besides, it seems like it might > > > be a good idea to do other basic maintenance of the index when we're > > > "skipping" index cleanup. We probably should always do things that > > > require only a small, fixed amount of work. Things like maintaining > > > metadata in the metapage. > > > > > > There may be practical reasons why this approach isn't suitable for > > > backpatch even if it is a superior approach. What do you think? > > > > I agree this idea is better. I was thinking the same approach but I > > was concerned about backpatching. Especially since I was thinking to > > add one or two fields to IndexVacuumInfo existing index AM might not > > work with the new VacuumInfo structure. > > It would be ok if we added these fields at the end of VacuumInfo structure? > I've attached WIP patch for HEAD. With this patch, the core pass index_cleanup to bulkdelete and vacuumcleanup callbacks so that they can make decision whether run vacuum or not. If the direction of this patch seems good, I'll change the patch so that we pass something information to these callbacks indicating whether this vacuum is anti-wraparound vacuum. This is necessary because it's enough to invoke index cleanup before XID wraparound as per discussion so far. Regards, -- Masahiko Sawada http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Вложения
On Mon, May 18, 2020 at 7:32 PM Masahiko Sawada <masahiko.sawada@2ndquadrant.com> wrote: > I've attached WIP patch for HEAD. With this patch, the core pass > index_cleanup to bulkdelete and vacuumcleanup callbacks so that they > can make decision whether run vacuum or not. > > If the direction of this patch seems good, I'll change the patch so > that we pass something information to these callbacks indicating > whether this vacuum is anti-wraparound vacuum. This is necessary > because it's enough to invoke index cleanup before XID wraparound as > per discussion so far. It. seems like the right direction to me. Robert? -- Peter Geoghegan
On Fri, May 22, 2020 at 1:40 PM Peter Geoghegan <pg@bowt.ie> wrote: > It. seems like the right direction to me. Robert? BTW, this is an interesting report of somebody using the INDEX_CLEANUP feature when they had to deal with a difficult production issue: https://www.buildkitestatus.com/incidents/h0vnx4gp7djx This report is not really relevant to our discussion, but I thought you might find it interesting. -- Peter Geoghegan
On Tue, 19 May 2020 at 11:32, Masahiko Sawada <masahiko.sawada@2ndquadrant.com> wrote: > > On Thu, 7 May 2020 at 16:26, Masahiko Sawada > <masahiko.sawada@2ndquadrant.com> wrote: > > > > On Thu, 7 May 2020 at 15:40, Masahiko Sawada > > <masahiko.sawada@2ndquadrant.com> wrote: > > > > > > On Thu, 7 May 2020 at 03:28, Peter Geoghegan <pg@bowt.ie> wrote: > > > > > > > > On Wed, May 6, 2020 at 2:28 AM Masahiko Sawada > > > > <masahiko.sawada@2ndquadrant.com> wrote: > > > > > I've attached the patch fixes this issue. > > > > > > > > > > With this patch, we don't skip only index cleanup phase when > > > > > performing an aggressive vacuum. The reason why I don't skip only > > > > > index cleanup phase is that index vacuum phase can be called multiple > > > > > times, which takes a very long time. Since the purpose of this index > > > > > cleanup is to process recyclable pages it's enough to do only index > > > > > cleanup phase. > > > > > > > > That's only true in nbtree, though. The way that I imagined we'd want > > > > to fix this is by putting control in each index access method. So, > > > > we'd revise the way that amvacuumcleanup() worked -- the > > > > amvacuumcleanup routine for each index AM would sometimes be called in > > > > a mode that means "I don't really want you to do any cleanup, but if > > > > you absolutely have to for your own reasons then you can". This could > > > > be represented using something similar to > > > > IndexVacuumInfo.analyze_only. > > > > > > > > This approach has an obvious disadvantage: the patch really has to > > > > teach *every* index AM to do something with that state (most will > > > > simply do no work). It seems logical to have the index AM control what > > > > happens, though. This allows the logic to live inside > > > > _bt_vacuum_needs_cleanup() in the case of nbtree, so there is only one > > > > place where we make decisions like this. > > > > > > > > Most other AMs don't have this problem. GiST has a similar issue with > > > > recyclable pages, except that it doesn't use 32-bit XIDs so it doesn't > > > > need to care about this stuff at all. Besides, it seems like it might > > > > be a good idea to do other basic maintenance of the index when we're > > > > "skipping" index cleanup. We probably should always do things that > > > > require only a small, fixed amount of work. Things like maintaining > > > > metadata in the metapage. > > > > > > > > There may be practical reasons why this approach isn't suitable for > > > > backpatch even if it is a superior approach. What do you think? > > > > > > I agree this idea is better. I was thinking the same approach but I > > > was concerned about backpatching. Especially since I was thinking to > > > add one or two fields to IndexVacuumInfo existing index AM might not > > > work with the new VacuumInfo structure. > > > > It would be ok if we added these fields at the end of VacuumInfo structure? > > > > I've attached WIP patch for HEAD. With this patch, the core pass > index_cleanup to bulkdelete and vacuumcleanup callbacks so that they > can make decision whether run vacuum or not. > > If the direction of this patch seems good, I'll change the patch so > that we pass something information to these callbacks indicating > whether this vacuum is anti-wraparound vacuum. This is necessary > because it's enough to invoke index cleanup before XID wraparound as > per discussion so far. > I've updated the patch so that vacuum passes is_wraparound flag to bulkdelete and vacuumcleanup. Therefore I've added two new variables in total: index_cleanup and is_wraparound. Index AMs can make the decision of whether to skip bulkdelete and indexcleanup or not. Also, I've added this item to Older Bugs so as not to forget. Regards, -- Masahiko Sawada http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Вложения
On Fri, May 22, 2020 at 4:40 PM Peter Geoghegan <pg@bowt.ie> wrote: > On Mon, May 18, 2020 at 7:32 PM Masahiko Sawada > <masahiko.sawada@2ndquadrant.com> wrote: > > I've attached WIP patch for HEAD. With this patch, the core pass > > index_cleanup to bulkdelete and vacuumcleanup callbacks so that they > > can make decision whether run vacuum or not. > > > > If the direction of this patch seems good, I'll change the patch so > > that we pass something information to these callbacks indicating > > whether this vacuum is anti-wraparound vacuum. This is necessary > > because it's enough to invoke index cleanup before XID wraparound as > > per discussion so far. > > It. seems like the right direction to me. Robert? Sorry, I'm so far behind on my email. Argh. I think, especially on the blog post you linked, that we should aim to have INDEX_CLEANUP OFF mode do the minimum possible amount of work while still keeping us safe against transaction ID wraparound. So if, for example, it's desirable but not imperative for BRIN to resummarize, then it's OK normally but should be skipped with INDEX_CLEANUP OFF. I find the patch itself confusing and the comments inadequate, especially the changes to lazy_scan_heap(). Before the INDEX_CLEANUP patch went into the tree, LVRelStats had a member hasindex indicating whether or not the table had any indexes. The patch changed that member to useindex, indicating whether or not we were going to do index vacuuming; thus, it would be false if either there were no indexes or if we were going to ignore them. This patch redefines useindex to mean whether or not the table has any indexes, but without renaming the variable. There's also really no comments anywhere in the vacuumlazy.c explaining the overall scheme here; what are we actually doing? Apparently, what we're really doing here is that even when INDEX_CLEANUP is OFF, we're still going to keep all the dead tuples. That seems sad, but if it's what we have to do then it at least needs comments explaining it. As for the btree portion of the change, I expect you understand this better than I do, so I'm reluctant to stick my neck out, but it seems that what the patch does is force index cleanup to happen even when INDEX_CLEANUP is OFF provided that the vacuum is for wraparound and the btree index has at least 1 recyclable page. My first reaction is to wonder whether that doesn't nerf this feature into oblivion. Isn't it likely that an index that is being vacuumed for wraparound will have a recyclable page someplace? If the presence of that 1 recyclable page causes the "help, I'm about to run out of XIDs, please do the least possible work" flag to become a no-op, I don't think users are going to be too happy with that. Maybe I am misunderstanding. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Wed, Jun 24, 2020 at 10:21 AM Robert Haas <robertmhaas@gmail.com> wrote: > Sorry, I'm so far behind on my email. Argh. That's okay. > I think, especially on the blog post you linked, that we should aim to > have INDEX_CLEANUP OFF mode do the minimum possible amount of work > while still keeping us safe against transaction ID wraparound. So if, > for example, it's desirable but not imperative for BRIN to > resummarize, then it's OK normally but should be skipped with > INDEX_CLEANUP OFF. I agree that that's very important. > Apparently, what we're really doing here is that even when > INDEX_CLEANUP is OFF, we're still going to keep all the dead tuples. > That seems sad, but if it's what we have to do then it at least needs > comments explaining it. +1. Though I think that AMs should technically have the right to consider it advisory. > As for the btree portion of the change, I expect you understand this > better than I do, so I'm reluctant to stick my neck out, but it seems > that what the patch does is force index cleanup to happen even when > INDEX_CLEANUP is OFF provided that the vacuum is for wraparound and > the btree index has at least 1 recyclable page. My first reaction is > to wonder whether that doesn't nerf this feature into oblivion. Isn't > it likely that an index that is being vacuumed for wraparound will > have a recyclable page someplace? If the presence of that 1 recyclable > page causes the "help, I'm about to run out of XIDs, please do the > least possible work" flag to become a no-op, I don't think users are > going to be too happy with that. Maybe I am misunderstanding. I have mixed feelings about it myself. These are valid concerns. This is a problem to the extent that the user has a table that they'd like to use INDEX_CLEANUP with, that has indexes that regularly require cleanup due to page deletion. ISTM, then, that the really relevant high level design questions for this patch are: 1. How often is that likely to happen in The Real World™? 2. If we fail to do cleanup and leak already-deleted pages, how bad is that? ( Both in general, and in the worst case.) I'll hazard a guess for 1: I think that it might not come up that often. Page deletion is often something that we hardly ever need. And, unlike some DB systems, we only do it when pages are fully empty (which, as it turns out, isn't necessarily better than our simple approach [1]). I tend to think it's unlikely to happen in cases where INDEX_CLEANUP is used, because those are cases that also must not have that much index churn to begin with. Then there's question 2. The intuition behind the approach from Sawada-san's patch was that allowing wraparound here feels wrong -- it should be in the AM's hands. However, it's not like I can point to some ironclad guarantee about not leaking deleted pages that existed before the INDEX_CLEANUP feature. We know that the FSM is not crash safe, and that's that. Is this really all that different? Maybe it is, but it seems like a quantitative difference to me. I'm kind of arguing against myself even as I try to advance my original argument. If workloads that use INDEX_CLEANUP don't need to delete and recycle pages in any case, then why should we care that those same workloads might leak pages on account of the wraparound hazard? There's nothing to leak! Maybe some compromise is possible, at least for the backbranches. Perhaps nbtree is told about the requirements in a bit more detail than we'd imagined. It's not just an INDEX_CLEANUP boolean. It could be an enum or something. Not sure, though. The real reason that I want to push the mechanism down into index access methods is because that design is clearly better overall; it just so happens that the specific way in which we currently defer recycling in nbtree makes very little sense, so it's harder to see the big picture. The xid-cleanup design that we have was approximately the easiest way to do it, so that's what we got. We should figure out a way to recycle the pages at something close to the earliest possible opportunity, without having to perform a full scan on the index relation within btvacuumscan(). Maybe we can use the autovacuum work item mechanism for that. For indexes that get VACUUMed once a week on average, it makes zero sense to wait another week to recycle the pages that get deleted, in a staggered fashion. It should be possible to recycle the pages a minute or two after VACUUM proper finishes, with extra work that's proportionate to the number of deleted pages. This is still conservative. I am currently very busy with an unrelated B-Tree prototype, so I might not get around to it this year. Maybe Sawada-san can think about this? Also, handling of the vacuum_cleanup_index_scale_factor stuff (added to Postgres 11 by commit 857f9c36) should live next to code for the confusingly-similar INDEX_CLEANUP stuff (added to Postgres 12 by commit a96c41feec6), on general principle. I think that that organization is a lot easier to follow. [1] https://www.sciencedirect.com/science/article/pii/002200009390020W -- Peter Geoghegan
On Thu, 25 Jun 2020 at 05:02, Peter Geoghegan <pg@bowt.ie> wrote: > > On Wed, Jun 24, 2020 at 10:21 AM Robert Haas <robertmhaas@gmail.com> wrote: > > Sorry, I'm so far behind on my email. Argh. > > That's okay. > > > I think, especially on the blog post you linked, that we should aim to > > have INDEX_CLEANUP OFF mode do the minimum possible amount of work > > while still keeping us safe against transaction ID wraparound. So if, > > for example, it's desirable but not imperative for BRIN to > > resummarize, then it's OK normally but should be skipped with > > INDEX_CLEANUP OFF. > > I agree that that's very important. > > > Apparently, what we're really doing here is that even when > > INDEX_CLEANUP is OFF, we're still going to keep all the dead tuples. > > That seems sad, but if it's what we have to do then it at least needs > > comments explaining it. > > +1. Though I think that AMs should technically have the right to > consider it advisory. > > > As for the btree portion of the change, I expect you understand this > > better than I do, so I'm reluctant to stick my neck out, but it seems > > that what the patch does is force index cleanup to happen even when > > INDEX_CLEANUP is OFF provided that the vacuum is for wraparound and > > the btree index has at least 1 recyclable page. My first reaction is > > to wonder whether that doesn't nerf this feature into oblivion. Isn't > > it likely that an index that is being vacuumed for wraparound will > > have a recyclable page someplace? If the presence of that 1 recyclable > > page causes the "help, I'm about to run out of XIDs, please do the > > least possible work" flag to become a no-op, I don't think users are > > going to be too happy with that. Maybe I am misunderstanding. > > I have mixed feelings about it myself. These are valid concerns. > > This is a problem to the extent that the user has a table that they'd > like to use INDEX_CLEANUP with, that has indexes that regularly > require cleanup due to page deletion. ISTM, then, that the really > relevant high level design questions for this patch are: > > 1. How often is that likely to happen in The Real World™? > > 2. If we fail to do cleanup and leak already-deleted pages, how bad is > that? ( Both in general, and in the worst case.) > > I'll hazard a guess for 1: I think that it might not come up that > often. Page deletion is often something that we hardly ever need. And, > unlike some DB systems, we only do it when pages are fully empty > (which, as it turns out, isn't necessarily better than our simple > approach [1]). I tend to think it's unlikely to happen in cases where > INDEX_CLEANUP is used, because those are cases that also must not have > that much index churn to begin with. > > Then there's question 2. The intuition behind the approach from > Sawada-san's patch was that allowing wraparound here feels wrong -- it > should be in the AM's hands. However, it's not like I can point to > some ironclad guarantee about not leaking deleted pages that existed > before the INDEX_CLEANUP feature. We know that the FSM is not crash > safe, and that's that. Is this really all that different? Maybe it is, > but it seems like a quantitative difference to me. I think that with the approach implemented in my patch, it could be a problem for the user that the user cannot easily know in advance whether vacuum with INDEX_CLEANUP false will perform index cleanup, even if page deletion doesn’t happen in most cases. They can check the vacuum will be a wraparound vacuum but it’s relatively hard for users to check in advance there are recyclable pages in the btree index. This will be a restriction for users, especially those who want to use INDEX_CLEANUP feature to speed up an impending XID wraparound vacuum described on the blog post that Peter shared. I don’t come up with a good solution to keep us safe against XID wraparound yet but it seems to me that it’s better to have an option that forces index cleanup not to happen. > > I'm kind of arguing against myself even as I try to advance my > original argument. If workloads that use INDEX_CLEANUP don't need to > delete and recycle pages in any case, then why should we care that > those same workloads might leak pages on account of the wraparound > hazard? I had the same impression at first. > The real reason that I want to push the mechanism down into index > access methods is because that design is clearly better overall; it > just so happens that the specific way in which we currently defer > recycling in nbtree makes very little sense, so it's harder to see the > big picture. The xid-cleanup design that we have was approximately the > easiest way to do it, so that's what we got. We should figure out a > way to recycle the pages at something close to the earliest possible > opportunity, without having to perform a full scan on the index > relation within btvacuumscan(). +1 > Maybe we can use the autovacuum work > item mechanism for that. For indexes that get VACUUMed once a week on > average, it makes zero sense to wait another week to recycle the pages > that get deleted, in a staggered fashion. It should be possible to > recycle the pages a minute or two after VACUUM proper finishes, with > extra work that's proportionate to the number of deleted pages. This > is still conservative. I am currently very busy with an unrelated > B-Tree prototype, so I might not get around to it this year. Maybe > Sawada-san can think about this? I thought that btbulkdelete and/or btvacuumcleanup can register an autovacuum work item to recycle the page that gets deleted but it might not able to recycle those pages enough because the autovacuum work items could be taken just after vacuum. And if page deletion is relatively a rare case in practice, we might be able to take an optimistic approach that vacuum registers deleted pages to FSM on the deletion and a process who takes a free page checks if the page is really recyclable. Anyway, I’ll try to think more about this. Regards, -- Masahiko Sawada http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Wed, Jun 24, 2020 at 4:02 PM Peter Geoghegan <pg@bowt.ie> wrote: > > Apparently, what we're really doing here is that even when > > INDEX_CLEANUP is OFF, we're still going to keep all the dead tuples. > > That seems sad, but if it's what we have to do then it at least needs > > comments explaining it. > > +1. Though I think that AMs should technically have the right to > consider it advisory. I'm not really convinced. I agree that from a theoretical point of view an index can have arbitrary needs and is the arbiter of its own needs, but when I pull the emergency break, I want the vehicle to stop, not think about stopping. There's a fine argument for the idea that depressing the regular brake pedal entitles the vehicle to exercise some discretion, and on modern cars it does (think ABS, if nothing else). But pulling the emergency break is a statement that I wish to override any contrary judgement about whether stopping is a good idea. I think this option is rightly viewed as an emergency break, and giving AMs the right to decide that we'll instead pull off at the next exit doesn't sit well with me. At the end of the day, the human being should be in charge, not the program. (Great, now Skynet will be gunning for me...) > 1. How often is that likely to happen in The Real World™? > > 2. If we fail to do cleanup and leak already-deleted pages, how bad is > that? ( Both in general, and in the worst case.) > > I'll hazard a guess for 1: I think that it might not come up that > often. Page deletion is often something that we hardly ever need. And, > unlike some DB systems, we only do it when pages are fully empty > (which, as it turns out, isn't necessarily better than our simple > approach [1]). I tend to think it's unlikely to happen in cases where > INDEX_CLEANUP is used, because those are cases that also must not have > that much index churn to begin with. I don't think I believe this. All you need is one small range-deletion, right? > Then there's question 2. The intuition behind the approach from > Sawada-san's patch was that allowing wraparound here feels wrong -- it > should be in the AM's hands. However, it's not like I can point to > some ironclad guarantee about not leaking deleted pages that existed > before the INDEX_CLEANUP feature. We know that the FSM is not crash > safe, and that's that. Is this really all that different? Maybe it is, > but it seems like a quantitative difference to me. I don't think I believe this, either. In the real-world example to which you linked, the user ran REINDEX afterward to recover from index bloat, and we could advise other people who use this option that it may leak space that a subsequent VACUUM may fail to recover, and therefore they too should consider REINDEX. Bloat sucks and I hate it, but in the vehicle analogy from up above, it's the equivalent of getting lost while driving someplace. It is inconvenient and may cause you many problems, but you will not be dead. Running out of XIDs is a brick wall. Either the car stops or you hit the wall. Ideally you can manage to both not get lost and also not hit a brick wall, but in an emergency situation where you have to choose either to get lost or to hit a brick wall, there's only one right answer. As bad as bloat is, and it's really bad, there are users who manage to run incredibly bloated databases for long periods of time just because the stuff that gets slow is either stuff that they're not doing at all, or only doing in batch jobs where it's OK if it runs super-slow and where it may even be possible to disable the batch job altogether, at least for a while. The set of users who can survive running out of XIDs is limited to those who can get by with just read-only queries, and that's practically nobody. I have yet to encounter a customer who didn't consider running out of XIDs to be an emergency. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Thu, Jun 25, 2020 at 6:59 AM Masahiko Sawada <masahiko.sawada@2ndquadrant.com> wrote: > I think that with the approach implemented in my patch, it could be a > problem for the user that the user cannot easily know in advance > whether vacuum with INDEX_CLEANUP false will perform index cleanup, > even if page deletion doesn’t happen in most cases. I was unclear. I agree that the VACUUM command with "INDEX_CLEANUP = off" is an emergency mechanism that should be fully respected, even when that means that we'll leak deleted pages. Perhaps it would make sense to behave differently when the index is on a table that has "vacuum_index_cleanup = off" set, and the vacuum is started by autovacuum, and is not an anti-wraparound vacuum. That doesn't seem all that appealing now that I write it down, though, because it's a non-obvious behavioral difference among cases that users probably expect to behave similarly. On the other hand, what user knows that there is something called an aggressive vacuum, which isn't exactly the same thing as an anti-wraparound vacuum? I find it hard to decide what the least-worst thing is for the backbranches. What do you think? > I don’t come up with a good solution to keep us safe against XID > wraparound yet but it seems to me that it’s better to have an option > that forces index cleanup not to happen. I don't think that there is a good solution that is suitable for backpatching. The real solution is to redesign the recycling along the lines I described. I don't think that it's terrible that we can leak deleted pages, especially considering the way that users are expected to use the INDEX_CLEANUP feature. I would like to be sure that the problem is well understood, though -- we should at least have a plan for Postgres v14. > I thought that btbulkdelete and/or btvacuumcleanup can register an > autovacuum work item to recycle the page that gets deleted but it > might not able to recycle those pages enough because the autovacuum > work items could be taken just after vacuum. And if page deletion is > relatively a rare case in practice, we might be able to take an > optimistic approach that vacuum registers deleted pages to FSM on the > deletion and a process who takes a free page checks if the page is > really recyclable. Anyway, I’ll try to think more about this. Right -- just putting the pages in the FSM immediately, and making it a problem that we deal with within _bt_getbuf() is an alternative approach that might be better. -- Peter Geoghegan
On Thu, Jun 25, 2020 at 8:28 AM Robert Haas <robertmhaas@gmail.com> wrote: > I'm not really convinced. I agree that from a theoretical point of > view an index can have arbitrary needs and is the arbiter of its own > needs, but when I pull the emergency break, I want the vehicle to > stop, not think about stopping. Making this theoretical argument in the context of this discussion was probably a mistake. I agree that this is the emergency break, and it needs to work like one. It might be worth considering some compromise in the event of using the "vacuum_index_cleanup" reloption (i.e. when the user has set it to 'off'), provided there is good reason to believe that we're not in an emergency -- I mentioned this to Masahiko just now. I admit that that isn't very appealing for other reasons, but it's worth considering a way of ameliorating the problem in back branches. We really ought to change how recycling works, so that it happens at the tail end of the same VACUUM operation that deleted the pages -- but that cannot be backpatched. It might be that the most appropriate mitigation in the back branches is a log message that reports on the fact that we've probably leaked pages due to this issue. Plus some documentation. Though even that would require calling nbtree to check if that is actually true (by checking the metapage), so it still requires backpatching something close to Masahiko's draft patch. > I don't think I believe this. All you need is one small range-deletion, right? Right. > > Then there's question 2. The intuition behind the approach from > > Sawada-san's patch was that allowing wraparound here feels wrong -- it > > should be in the AM's hands. However, it's not like I can point to > > some ironclad guarantee about not leaking deleted pages that existed > > before the INDEX_CLEANUP feature. We know that the FSM is not crash > > safe, and that's that. Is this really all that different? Maybe it is, > > but it seems like a quantitative difference to me. > > I don't think I believe this, either. In the real-world example to > which you linked, the user ran REINDEX afterward to recover from index > bloat, and we could advise other people who use this option that it > may leak space that a subsequent VACUUM may fail to recover, and > therefore they too should consider REINDEX. I was talking about the intuition behind the design. I did not intend to suggest that nbtree should ignore "INDEX_CLEANUP = off" regardless of the consequences. I am sure about this much: The design embodied by Masahiko's patch is clearly a better one overall, even if it doesn't fix the problem on its own. I agree that we cannot allow nbtree to ignore "INDEX_CLEANUP = off", even if that means leaking pages that could otherwise be recycled. I'm not sure what we should do about any of this in the back branches, though. I wish I had a simple idea about what to do there. -- Peter Geoghegan
On Thu, Jun 25, 2020 at 8:44 PM Peter Geoghegan <pg@bowt.ie> wrote: > I am sure about this much: The design embodied by Masahiko's patch is > clearly a better one overall, even if it doesn't fix the problem on > its own. I agree that we cannot allow nbtree to ignore "INDEX_CLEANUP > = off", even if that means leaking pages that could otherwise be > recycled. I'm not sure what we should do about any of this in the back > branches, though. I wish I had a simple idea about what to do there. My opinion is that there's no need to change the code in the back-branches, and that I don't really like the approach in master either. I think what we're saying is that there is no worse consequence to turning off index_cleanup than some bloat that isn't likely to be recovered unless you REINDEX. If the problem in question were going to cause data loss or data corruption or something, we'd have to take stronger action, but I don't think anyone's saying that this is the case. Therefore, I think we can handle the back-branches by letting users know about the bloat hazard and suggesting that they avoid this option unless it's necessary to avoid running out of XIDs. Now, what about master? I think it's fine to offer the AM a callback even when index_cleanup = false, for example so that it can freeze something in its metapage, but I don't agree with passing it the TIDs. That seems like it's just inviting it to ignore the emergency brake, and it's also incurring real overhead, because remembering all those TIDs can use a lot of memory. If that API limitation causes a problem for some future index AM, that will be a good point to discuss when the patch for said AM is submitted for review. I entirely agree with you that the way btree arranges for btree recycling is crude, and I would be delighted if you want to improve it, either for v14 or for any future release, or if somebody else wants to do so. However, even if that never happens, so what? In retrospect, I regret committing this patch without better understanding the issues in this area. That was a fail on my part. At the same time, it doesn't really sound like the issues are all that bad. The potential index bloat does suck, but it can still suck less than the alternatives, and we have evidence that for at least one user, it was worth a major version upgrade just to replace the suckitude they had with the suckitude this patch creates. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Fri, Jun 26, 2020 at 5:39 AM Robert Haas <robertmhaas@gmail.com> wrote: > My opinion is that there's no need to change the code in the > back-branches, and that I don't really like the approach in master > either. I guess it's hard to see a way that we could fix this in the backbranches, provided we aren't willing to tolerate a big refactor, or a cleanup scan of the index (note that I mean btvacuumcleanup(), not btvacuumscan(), which is quite different). > I think what we're saying is that there is no worse consequence to > turning off index_cleanup than some bloat that isn't likely to be > recovered unless you REINDEX. That's true. > Now, what about master? I think it's fine to offer the AM a callback > even when index_cleanup = false, for example so that it can freeze > something in its metapage, but I don't agree with passing it the TIDs. > That seems like it's just inviting it to ignore the emergency brake, > and it's also incurring real overhead, because remembering all those > TIDs can use a lot of memory. You don't have to do anything with TIDs passed from vacuumlazy.c to recycle pages that need to be recycled, since you only have to go through btvacuumcleanup() to avoid the problem that we're talking about (you don't have to call btvacuumscan() to kill TIDs that vacuumlazy.c will have pruned). Killing TIDs/tuples in the index was never something that would make sense, even within the confines of the existing flawed nbtree recycling design. However, you do need to scan the entire index to do that much. FWIW, that doesn't seem like it *totally* violates the spirit of "index_cleanup = false", since you're still not doing most of the usual nbtree vacuuming stuff (even though you have to scan the index, there is still much less work total). > If that API limitation causes a problem > for some future index AM, that will be a good point to discuss when > the patch for said AM is submitted for review. I entirely agree with > you that the way btree arranges for btree recycling is crude, and I > would be delighted if you want to improve it, either for v14 or for > any future release, or if somebody else wants to do so. However, even > if that never happens, so what? I think that it's important to be able to describe an ideal (though still realistic) design, even if it might remain aspirational for a long time. I suspect that pushing the mechanism down into index AMs has other non-obvious benefits. > In retrospect, I regret committing this patch without better > understanding the issues in this area. That was a fail on my part. At > the same time, it doesn't really sound like the issues are all that > bad. The potential index bloat does suck, but it can still suck less > than the alternatives, and we have evidence that for at least one > user, it was worth a major version upgrade just to replace the > suckitude they had with the suckitude this patch creates. I actually agree -- this is a really important feature, and I'm glad that we have it. Even in this slightly flawed form. I remember a great need for the feature back when I was involved in supporting Postgres in production. -- Peter Geoghegan
On Sat, 27 Jun 2020 at 08:00, Peter Geoghegan <pg@bowt.ie> wrote: > > On Fri, Jun 26, 2020 at 5:39 AM Robert Haas <robertmhaas@gmail.com> wrote: > > My opinion is that there's no need to change the code in the > > back-branches, and that I don't really like the approach in master > > either. > > I guess it's hard to see a way that we could fix this in the > backbranches, provided we aren't willing to tolerate a big refactor, > or a cleanup scan of the index (note that I mean btvacuumcleanup(), > not btvacuumscan(), which is quite different). Agreed. > > > I think what we're saying is that there is no worse consequence to > > turning off index_cleanup than some bloat that isn't likely to be > > recovered unless you REINDEX. > > That's true. Regarding to the extent of the impact, this bug will affect the user who turned vacuum_index_cleanup off or executed manually vacuum with INDEX_CLEANUP off for a long time, after some vacuums. On the other hand, the user who uses INDEX_CLEANUP off on the spot or turns vacuum_index_cleanup off of the table from the start would not be affected or less affected. > > > In retrospect, I regret committing this patch without better > > understanding the issues in this area. That was a fail on my part. At > > the same time, it doesn't really sound like the issues are all that > > bad. The potential index bloat does suck, but it can still suck less > > than the alternatives, and we have evidence that for at least one > > user, it was worth a major version upgrade just to replace the > > suckitude they had with the suckitude this patch creates. > > I actually agree -- this is a really important feature, and I'm glad > that we have it. Even in this slightly flawed form. I remember a great > need for the feature back when I was involved in supporting Postgres > in production. I apologize for writing this patch without enough consideration. I should have been more careful as I learned the nbtree page recycle strategy when discussing vacuum_cleanup_index_scale_factor patch. Regards, -- Masahiko Sawada http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Fri, Jun 26, 2020 at 10:15 PM Masahiko Sawada <masahiko.sawada@2ndquadrant.com> wrote: > Regarding to the extent of the impact, this bug will affect the user > who turned vacuum_index_cleanup off or executed manually vacuum with > INDEX_CLEANUP off for a long time, after some vacuums. On the other > hand, the user who uses INDEX_CLEANUP off on the spot or turns > vacuum_index_cleanup off of the table from the start would not be > affected or less affected. I don't think that it's likely to cause too much trouble. It's already possible to leak deleted pages, if only because the FSM isn't crash safe. Actually, the nbtree README says this, and has since 2003: """ (Note: if we find a deleted page with an extremely old transaction number, it'd be worthwhile to re-mark it with FrozenTransactionId so that a later xid wraparound can't cause us to think the page is unreclaimable. But in more normal situations this would be a waste of a disk write.) """ But, uh, isn't the btvacuumcleanup() call supposed to avoid wraparound? Who knows?! It doesn't seem like the recycling aspect of page deletion was rigorously designed, possibly because it's harder to test than page deletion itself. This is a problem that we should fix. > I apologize for writing this patch without enough consideration. I > should have been more careful as I learned the nbtree page recycle > strategy when discussing vacuum_cleanup_index_scale_factor patch. While it's unfortunate that this was missed, let's not lose perspective. Anybody using the INDEX_CLEANUP feature (whether it's through a direct VACUUM, or by using the reloption) is already asking for an extreme behavior: skipping regular index vacuuming. I imagine that the vast majority of users that are in that position just don't care about the possibility of leaking deleted pages. They care about avoiding a real disaster from XID wraparound. -- Peter Geoghegan
On Sun, 28 Jun 2020 at 02:44, Peter Geoghegan <pg@bowt.ie> wrote: > > On Fri, Jun 26, 2020 at 10:15 PM Masahiko Sawada > <masahiko.sawada@2ndquadrant.com> wrote: > > Regarding to the extent of the impact, this bug will affect the user > > who turned vacuum_index_cleanup off or executed manually vacuum with > > INDEX_CLEANUP off for a long time, after some vacuums. On the other > > hand, the user who uses INDEX_CLEANUP off on the spot or turns > > vacuum_index_cleanup off of the table from the start would not be > > affected or less affected. > > I don't think that it's likely to cause too much trouble. It's already > possible to leak deleted pages, if only because the FSM isn't crash > safe. Actually, the nbtree README says this, and has since 2003: > > """ > (Note: if we find a deleted page with an extremely old transaction > number, it'd be worthwhile to re-mark it with FrozenTransactionId so that > a later xid wraparound can't cause us to think the page is unreclaimable. > But in more normal situations this would be a waste of a disk write.) > """ > > But, uh, isn't the btvacuumcleanup() call supposed to avoid > wraparound? Who knows?! > > It doesn't seem like the recycling aspect of page deletion was > rigorously designed, possibly because it's harder to test than page > deletion itself. This is a problem that we should fix. Agreed. > > > I apologize for writing this patch without enough consideration. I > > should have been more careful as I learned the nbtree page recycle > > strategy when discussing vacuum_cleanup_index_scale_factor patch. > > While it's unfortunate that this was missed, let's not lose > perspective. Anybody using the INDEX_CLEANUP feature (whether it's > through a direct VACUUM, or by using the reloption) is already asking > for an extreme behavior: skipping regular index vacuuming. I imagine > that the vast majority of users that are in that position just don't > care about the possibility of leaking deleted pages. They care about > avoiding a real disaster from XID wraparound. For back branches, I'm considering how we let users know about this. For safety, we can let users know that we recommend avoiding INDEX_CLEANUP false unless it's necessary to avoid running out of XIDs on the documentation and/or the release note. But on the other hand, since there is the fact that leaving recyclable pages is already possible to happen as you mentioned I'm concerned it gets the user into confusion and might needlessly incite unrest of users. I'm thinking what we can do for users, in addition to leaving the summary of this discussion as a source code comment. What do you think? Regards, -- Masahiko Sawada http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Mon, Jun 29, 2020 at 9:51 PM Masahiko Sawada <masahiko.sawada@2ndquadrant.com> wrote: > > On Sun, 28 Jun 2020 at 02:44, Peter Geoghegan <pg@bowt.ie> wrote: > > > > On Fri, Jun 26, 2020 at 10:15 PM Masahiko Sawada > > <masahiko.sawada@2ndquadrant.com> wrote: > > > Regarding to the extent of the impact, this bug will affect the user > > > who turned vacuum_index_cleanup off or executed manually vacuum with > > > INDEX_CLEANUP off for a long time, after some vacuums. On the other > > > hand, the user who uses INDEX_CLEANUP off on the spot or turns > > > vacuum_index_cleanup off of the table from the start would not be > > > affected or less affected. > > > > I don't think that it's likely to cause too much trouble. It's already > > possible to leak deleted pages, if only because the FSM isn't crash > > safe. Actually, the nbtree README says this, and has since 2003: > > > > """ > > (Note: if we find a deleted page with an extremely old transaction > > number, it'd be worthwhile to re-mark it with FrozenTransactionId so that > > a later xid wraparound can't cause us to think the page is unreclaimable. > > But in more normal situations this would be a waste of a disk write.) > > """ > > > > But, uh, isn't the btvacuumcleanup() call supposed to avoid > > wraparound? Who knows?! > > > > It doesn't seem like the recycling aspect of page deletion was > > rigorously designed, possibly because it's harder to test than page > > deletion itself. This is a problem that we should fix. > > Agreed. > > > > > > I apologize for writing this patch without enough consideration. I > > > should have been more careful as I learned the nbtree page recycle > > > strategy when discussing vacuum_cleanup_index_scale_factor patch. > > > > While it's unfortunate that this was missed, let's not lose > > perspective. Anybody using the INDEX_CLEANUP feature (whether it's > > through a direct VACUUM, or by using the reloption) is already asking > > for an extreme behavior: skipping regular index vacuuming. I imagine > > that the vast majority of users that are in that position just don't > > care about the possibility of leaking deleted pages. They care about > > avoiding a real disaster from XID wraparound. > > For back branches, I'm considering how we let users know about this. > For safety, we can let users know that we recommend avoiding > INDEX_CLEANUP false unless it's necessary to avoid running out of XIDs > on the documentation and/or the release note. But on the other hand, > since there is the fact that leaving recyclable pages is already > possible to happen as you mentioned I'm concerned it gets the user > into confusion and might needlessly incite unrest of users. I'm > thinking what we can do for users, in addition to leaving the summary > of this discussion as a source code comment. What do you think? > Several months passed from the discussion. We decided not to do anything on back branches but IIUC the fundamental issue is not fixed yet. The issue pointed out by Andres that we should leave the index AM to decide whether doing vacuum cleanup or not when INDEX_CLEANUP is specified is still valid. Is that right? For HEAD, there was a discussion that we change lazy vacuum and bulkdelete and vacuumcleanup APIs so that it calls these APIs even when INDEX_CLEANUP is specified. That is, when INDEX_CLEANUP false is specified, it collects dead tuple TIDs into maintenance_work_mem space and passes the flag indicating INDEX_CLEANUP is specified or not to index AMs. Index AM decides whether doing bulkdelete/vacuumcleanup. A downside of this idea would be that we will end up using maintenance_work_mem even if all index AMs of the table don't do bulkdelete/vacuumcleanup at all. The second idea I came up with is to add an index AM API (say, amcanskipindexcleanup = true/false) telling index cleanup is skippable or not. Lazy vacuum checks this flag for each index on the table before starting. If index cleanup is skippable in all indexes, it can choose one-pass vacuum, meaning no need to collect dead tuple TIDs in maintenance_work_mem. All in-core index AM will set to true. Perhaps it’s true (skippable) by default for backward compatibility. The in-core AMs including btree indexes will work same as before. This fix is to make it more desirable behavior and possibly to help other AMs that require to call vacuumcleanup in all cases. Once we fix it I wonder if we can disable index cleanup when autovacuum’s anti-wraparound vacuum. Regards, -- Masahiko Sawada EnterpriseDB: https://www.enterprisedb.com/
On Thu, Nov 19, 2020 at 5:58 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > Several months passed from the discussion. We decided not to do > anything on back branches but IIUC the fundamental issue is not fixed > yet. The issue pointed out by Andres that we should leave the index AM > to decide whether doing vacuum cleanup or not when INDEX_CLEANUP is > specified is still valid. Is that right? I don't remember if Andres actually said that publicly, but I definitely did. I do remember discussing this with him privately, at which point he agreed with what I said. Which you just summarized well. > For HEAD, there was a discussion that we change lazy vacuum and > bulkdelete and vacuumcleanup APIs so that it calls these APIs even > when INDEX_CLEANUP is specified. That is, when INDEX_CLEANUP false is > specified, it collects dead tuple TIDs into maintenance_work_mem space > and passes the flag indicating INDEX_CLEANUP is specified or not to > index AMs. Right. > Index AM decides whether doing bulkdelete/vacuumcleanup. A > downside of this idea would be that we will end up using > maintenance_work_mem even if all index AMs of the table don't do > bulkdelete/vacuumcleanup at all. That is a downside, but I don't think that it's a serious downside. But it may not matter, because there are lots of reasons to move in this direction. > The second idea I came up with is to add an index AM API (say, > amcanskipindexcleanup = true/false) telling index cleanup is skippable > or not. Lazy vacuum checks this flag for each index on the table > before starting. If index cleanup is skippable in all indexes, it can > choose one-pass vacuum, meaning no need to collect dead tuple TIDs in > maintenance_work_mem. All in-core index AM will set to true. Perhaps > it’s true (skippable) by default for backward compatibility. (The terminology here is very confusing, because the goal of the INDEX_CLEANUP feature in v12 is not really to skip a call to btvacuumcleanup(). The goal is really to skip a call to btbulkdelete(). I will try to be careful.) I think that the ideal design here would be a new hybrid of two existing features: 1.) Your INDEX_CLEANUP feature from Postgres 12. and: 2.) Your vacuum_cleanup_index_scale_factor feature from Postgres 11. The INDEX_CLEANUP feature is very useful, because skipping indexes entirely can be very helpful for many reasons (e.g. faster manual VACUUM in the event of wraparound related emergencies). But INDEX_CLEANUP has 2 problems today: A. It doesn't interact well with vacuum_cleanup_index_scale_factor. This is the problem that has been discussed on this thread. and: B. It is an "all or nothing" thing. Unlike the vacuum_cleanup_index_scale_factor feature, it does not care about what the index AM/individual index wants. But it should. (**Thinks some more***) Actually, on second thought, maybe INDEX_CLEANUP only has one problem. Problem A is actually just a special case of problem B. There are many interesting opportunities created by solving problem B comprehensively. So, what useful enhancements to VACUUM are possible once we have something like INDEX_CLEANUP, that is sensitive to the needs of indexes? Well, you already identified one yourself, so obviously you're thinking about this in a similar way already: > The in-core AMs including btree indexes will work same as before. This > fix is to make it more desirable behavior and possibly to help other > AMs that require to call vacuumcleanup in all cases. Once we fix it I > wonder if we can disable index cleanup when autovacuum’s > anti-wraparound vacuum. Obviously this is a good idea. The fact that anti-wraparound vacuum isn't really special compared to regular autovacuum is *bad*. Obviously anti-wraparound is in some sense more important than regular vacuum. Making it as similar as possible to vacuum simply isn't helpful. Maybe it is slightly more elegant in theory, but in the real world it is a poor design. (See also: every single PostgreSQL post mortem that has ever been written.) But why stop with this? There are other big advantages to allowing individual indexes/index AMs influence of the INDEX_CLEANUP behavior. Especially if they're sensitive to the needs of particular indexes on a table (not just all of the indexes on the table taken together). As you may know, my bottom-up index deletion patch can more or less eliminate index bloat in indexes that don't get "logically changed" by many non-HOT updates. It's *very* effective with non-HOT updates and lots of indexes. See this benchmark result for a recent example: https://postgr.es/m/CAGnEbohYF_K6b0v=2uc289=v67qNhc3n01Ftic8X94zP7kKqtw@mail.gmail.com The feature is effective enough to make it almost unnecessary to VACUUM certain indexes -- though not necessarily other indexes on the same table. Of course, in the long term it will eventually be necessary to really vacuum these indexes. Not because the indexes themselves care, though -- they really don't (if they don't receive logical changes from non-HOT updates, and so benefit very well from the proposed bottom-up index deletion mechanism, they really have no selfish reason to care if they ever get vacuumed by autovacuum). The reason we eventually need to call ambulkdelete() with these indexes (with all indexes, actually) even with these enhancements is related to the heap. We eventually want to make LP_DEAD line pointers in the heap LP_UNUSED. But we should be lazy about it, and wait until it becomes a real problem. Maybe we can only do a traditional VACUUM (with a second pass of the heap for heap LP_UNUSED stuff) much much less frequently than today. At the same time, we can set the FSM for index-only scans much more frequently. It's also important that we really make index vacuuming a per-index thing. You can see this in the example benchmark I linked to, which was posted by Victor: no page splits in one never-logically-modified index, and some page splits in other indexes that were actually changed by UPDATEs again and again. Clearly you can have several different indexes on the same table that have very different needs. With some indexes we want to be extra lazy (these are indexes that make good use of bottom-up deletion). But with other indexes on the same table we want to be eager. Maybe even very eager. If we can make per-index decisions, then every individual part of the system works well. Currently, the problem with autovacuum scheduling is that it probably makes sense for the heap with the defaults (or something like them), and probably doesn't make any sense for indexes (though it also varies among indexes). So today we have maybe 7 different things (maybe 6 indexes + 1 table), and we pretend that they are only one thing. It's just a fantasy. The reality is that we have 7 things that have only a very loose and complicated relationship to each other. We need to stop believing in this fantasy, and start paying attention to the more complicated reality. The only way to do that is ask each index directly, while being prepared to get very different answers from each index on the same table. Here is what I mean by that: it would also probably be very useful to do something like a ambulkdelete() call for only a subset of indexes that really need it. So you aggressively vacuum the one index that really does get logically modified by an UPDATE, and not the other 6 that don't. (Of course it's still true that we cannot have a second heap pass to make LP_DEAD line pointers in the heap LP_UNUSED -- obviously that's unsafe unless we're 100% sure that nothing in any index points to the now-LP_UNUSED line pointer. But many big improvements are possible without violating this basic invariant.) If you are able to pursue this project, in whole or in part, I would definitely be supportive of that. I may be able to commit it. I think that this project has many benefits, not just one or two. It seems strategic. -- Peter Geoghegan
On Fri, Nov 20, 2020 at 12:56 PM Peter Geoghegan <pg@bowt.ie> wrote: > > On Thu, Nov 19, 2020 at 5:58 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > Several months passed from the discussion. We decided not to do > > anything on back branches but IIUC the fundamental issue is not fixed > > yet. The issue pointed out by Andres that we should leave the index AM > > to decide whether doing vacuum cleanup or not when INDEX_CLEANUP is > > specified is still valid. Is that right? > > I don't remember if Andres actually said that publicly, but I > definitely did. I do remember discussing this with him privately, at > which point he agreed with what I said. Which you just summarized > well. > > > For HEAD, there was a discussion that we change lazy vacuum and > > bulkdelete and vacuumcleanup APIs so that it calls these APIs even > > when INDEX_CLEANUP is specified. That is, when INDEX_CLEANUP false is > > specified, it collects dead tuple TIDs into maintenance_work_mem space > > and passes the flag indicating INDEX_CLEANUP is specified or not to > > index AMs. > > Right. > > > Index AM decides whether doing bulkdelete/vacuumcleanup. A > > downside of this idea would be that we will end up using > > maintenance_work_mem even if all index AMs of the table don't do > > bulkdelete/vacuumcleanup at all. > > That is a downside, but I don't think that it's a serious downside. > But it may not matter, because there are lots of reasons to move in > this direction. > > > The second idea I came up with is to add an index AM API (say, > > amcanskipindexcleanup = true/false) telling index cleanup is skippable > > or not. Lazy vacuum checks this flag for each index on the table > > before starting. If index cleanup is skippable in all indexes, it can > > choose one-pass vacuum, meaning no need to collect dead tuple TIDs in > > maintenance_work_mem. All in-core index AM will set to true. Perhaps > > it’s true (skippable) by default for backward compatibility. > > (The terminology here is very confusing, because the goal of the > INDEX_CLEANUP feature in v12 is not really to skip a call to > btvacuumcleanup(). The goal is really to skip a call to > btbulkdelete(). I will try to be careful.) > > I think that the ideal design here would be a new hybrid of two > existing features: > > 1.) Your INDEX_CLEANUP feature from Postgres 12. > > and: > > 2.) Your vacuum_cleanup_index_scale_factor feature from Postgres 11. > > The INDEX_CLEANUP feature is very useful, because skipping indexes > entirely can be very helpful for many reasons (e.g. faster manual > VACUUM in the event of wraparound related emergencies). But > INDEX_CLEANUP has 2 problems today: > > A. It doesn't interact well with vacuum_cleanup_index_scale_factor. > This is the problem that has been discussed on this thread. > > and: > > B. It is an "all or nothing" thing. Unlike the > vacuum_cleanup_index_scale_factor feature, it does not care about what > the index AM/individual index wants. But it should. Agreed. > > (**Thinks some more***) > > Actually, on second thought, maybe INDEX_CLEANUP only has one problem. > Problem A is actually just a special case of problem B. There are many > interesting opportunities created by solving problem B > comprehensively. > > So, what useful enhancements to VACUUM are possible once we have > something like INDEX_CLEANUP, that is sensitive to the needs of > indexes? Well, you already identified one yourself, so obviously > you're thinking about this in a similar way already: > > > The in-core AMs including btree indexes will work same as before. This > > fix is to make it more desirable behavior and possibly to help other > > AMs that require to call vacuumcleanup in all cases. Once we fix it I > > wonder if we can disable index cleanup when autovacuum’s > > anti-wraparound vacuum. > > Obviously this is a good idea. The fact that anti-wraparound vacuum > isn't really special compared to regular autovacuum is *bad*. > Obviously anti-wraparound is in some sense more important than regular > vacuum. Making it as similar as possible to vacuum simply isn't > helpful. Maybe it is slightly more elegant in theory, but in the real > world it is a poor design. (See also: every single PostgreSQL post > mortem that has ever been written.) > > But why stop with this? There are other big advantages to allowing > individual indexes/index AMs influence of the INDEX_CLEANUP behavior. > Especially if they're sensitive to the needs of particular indexes on > a table (not just all of the indexes on the table taken together). > > As you may know, my bottom-up index deletion patch can more or less > eliminate index bloat in indexes that don't get "logically changed" by > many non-HOT updates. It's *very* effective with non-HOT updates and > lots of indexes. See this benchmark result for a recent example: > > https://postgr.es/m/CAGnEbohYF_K6b0v=2uc289=v67qNhc3n01Ftic8X94zP7kKqtw@mail.gmail.com > > The feature is effective enough to make it almost unnecessary to > VACUUM certain indexes -- though not necessarily other indexes on the > same table. Of course, in the long term it will eventually be > necessary to really vacuum these indexes. Not because the indexes > themselves care, though -- they really don't (if they don't receive > logical changes from non-HOT updates, and so benefit very well from > the proposed bottom-up index deletion mechanism, they really have no > selfish reason to care if they ever get vacuumed by autovacuum). > > The reason we eventually need to call ambulkdelete() with these > indexes (with all indexes, actually) even with these enhancements is > related to the heap. We eventually want to make LP_DEAD line pointers > in the heap LP_UNUSED. But we should be lazy about it, and wait until > it becomes a real problem. Maybe we can only do a traditional VACUUM > (with a second pass of the heap for heap LP_UNUSED stuff) much much > less frequently than today. At the same time, we can set the FSM for > index-only scans much more frequently. > > It's also important that we really make index vacuuming a per-index > thing. You can see this in the example benchmark I linked to, which > was posted by Victor: no page splits in one never-logically-modified > index, and some page splits in other indexes that were actually > changed by UPDATEs again and again. Clearly you can have several > different indexes on the same table that have very different needs. > > With some indexes we want to be extra lazy (these are indexes that > make good use of bottom-up deletion). But with other indexes on the > same table we want to be eager. Maybe even very eager. If we can make > per-index decisions, then every individual part of the system works > well. Currently, the problem with autovacuum scheduling is that it > probably makes sense for the heap with the defaults (or something like > them), and probably doesn't make any sense for indexes (though it also > varies among indexes). So today we have maybe 7 different things > (maybe 6 indexes + 1 table), and we pretend that they are only one > thing. It's just a fantasy. The reality is that we have 7 things that > have only a very loose and complicated relationship to each other. We > need to stop believing in this fantasy, and start paying attention to > the more complicated reality. The only way to do that is ask each > index directly, while being prepared to get very different answers > from each index on the same table. > > Here is what I mean by that: it would also probably be very useful to > do something like a ambulkdelete() call for only a subset of indexes > that really need it. So you aggressively vacuum the one index that > really does get logically modified by an UPDATE, and not the other 6 > that don't. (Of course it's still true that we cannot have a second > heap pass to make LP_DEAD line pointers in the heap LP_UNUSED -- > obviously that's unsafe unless we're 100% sure that nothing in any > index points to the now-LP_UNUSED line pointer. But many big > improvements are possible without violating this basic invariant.) I had missed your bottom-up index deletion patch but it's a promising improvement. With that patch, the number of dead tuples in individual indexes may differ. So it's important that we make index vacuuming a per-index thing. Given that patch, it seems to me that it would be better to ask individual index AM before calling to bulkdelete about the needs of bulkdelete. That is, passing VACUUM options and the number of collected dead tuples etc. to index AM, we ask index AM via a new index AM API whether it wants to do bulkdelete or not. We call bulkdelete for only indexes that answered 'yes'. If we got 'no' from any one of the indexes, we cannot have a second heap pass. INDEX_CLEANUP is not enforceable. When INDEX_CLEANUP is set to false, we expect index AMs to return 'no' unless they have a special reason for the needs of bulkdelete. One possible benefit of this idea even without bottom-up index deleteion patch would be something like vacuum_index_cleanup_scale_factor for bulkdelete. For example, in the case where the amount of dead tuple is slightly larger than maitenance_work_mem the second time calling to bulkdelete will be called with a small amount of dead tuples, which is less efficient. If an index AM is able to determine not to do bulkdelete by comparing the number of dead tuples to a threshold, it can avoid such bulkdelete calling. Also, as a future extension, once we have retail index deletion feature, we might be able to make that API return a ternary value: 'no', 'do_bulkdelete', ‘do_indexscandelete, so that index AM can choose the appropriate method of index deletion based on the statistics. But for making index vacuuming per-index thing, we need to deal with the problem that we cannot know which indexes of the table still has index tuples pointing to the collected dead tuple. For example, if an index always says 'no' (not need bulkdelete therefore we need to keep dead line pointers), the collected dead tuples might already be marked as LP_DEAD and there might already not be index tuples pointing to them in other index AMs. In that case we don't want to call to bulkdelete for other indexes. Probably we can have additional statistics like the number of dead tuples in individual indexes so that they can determine the needs of bulkdelete. But it’s not a comprehensive solution. > > If you are able to pursue this project, in whole or in part, I would > definitely be supportive of that. I may be able to commit it. I think > that this project has many benefits, not just one or two. It seems > strategic. Thanks, that’s really helpful. I’m going to work on that. Since things became complicated by these two features that I proposed I’ll do my best to sort out the problem and improve it in PG14. Regards, -- Masahiko Sawada EnterpriseDB: https://www.enterprisedb.com/
On Thu, Nov 19, 2020 at 8:58 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > For HEAD, there was a discussion that we change lazy vacuum and > bulkdelete and vacuumcleanup APIs so that it calls these APIs even > when INDEX_CLEANUP is specified. That is, when INDEX_CLEANUP false is > specified, it collects dead tuple TIDs into maintenance_work_mem space > and passes the flag indicating INDEX_CLEANUP is specified or not to > index AMs. Index AM decides whether doing bulkdelete/vacuumcleanup. A > downside of this idea would be that we will end up using > maintenance_work_mem even if all index AMs of the table don't do > bulkdelete/vacuumcleanup at all. > > The second idea I came up with is to add an index AM API (say, > amcanskipindexcleanup = true/false) telling index cleanup is skippable > or not. Lazy vacuum checks this flag for each index on the table > before starting. If index cleanup is skippable in all indexes, it can > choose one-pass vacuum, meaning no need to collect dead tuple TIDs in > maintenance_work_mem. All in-core index AM will set to true. Perhaps > it’s true (skippable) by default for backward compatibility. > > The in-core AMs including btree indexes will work same as before. This > fix is to make it more desirable behavior and possibly to help other > AMs that require to call vacuumcleanup in all cases. Once we fix it I > wonder if we can disable index cleanup when autovacuum’s > anti-wraparound vacuum. It (still) doesn't seem very sane to me to have an index that requires cleanup in all cases. I mean, VACUUM could error or be killed just before the index cleanup hase happens anyway, so it's not like an index AM can licitly depend on getting called just because we visited the heap. It could, of course, depend on getting called before relfrozenxid is advanced, or before the heap's dead line pointers are marked unused, or something like that, but it can't just be like, hey, you have to call me. I think this whole discussion is to some extent a product of the contract between the index AM and the table AM being more than slightly unclear. Maybe we need to clear up the definitional problems first. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Fri, Nov 20, 2020 at 3:17 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > I had missed your bottom-up index deletion patch but it's a promising > improvement. With that patch, the number of dead tuples in individual > indexes may differ. So it's important that we make index vacuuming a > per-index thing. Right, but it's important to remember that even bottom-up index deletion isn't really special in theory. Bottom-up index deletion is "just" a reliable version of the existing LP_DEAD index deletion thing (which has been around since Postgres 8.2). In theory it doesn't change the fundamental nature of the problem. In practice it does, because it makes it very obvious to pgsql-hackers that indexes on the same table can have very different needs from VACUUM. And the actual differences we see are much bigger now. Even still, the fact that you had certain big differences across indexes on the same table is not a new thing. (Actually, you can even see this on the master branch in Victor's bottom-up deletion benchmark, where the primary key index actually doesn't grow on the master branch, even after 8 hours.) The bottom-up index deletion patch (and the enhancements we're talking about here, for VACUUM itself) are based on "the generational hypothesis" that underlies generational garbage collection. The philosophy is the same. See: https://plumbr.io/handbook/garbage-collection-in-java/generational-hypothesis In theory, "most garbage comes from new objects" is "just" an empirical observation, that may or may not be true with each workload/Java program/whatever. In practice it is important enough to be a big part of how every modern garbage collector works -- it's almost always true, and even when it isn't true it doesn't actually hurt to make the assumption that it is true and then be wrong. I believe that we have to take a holistic view of the problem to make real progress. Andres said something similar in a recent blog post: https://techcommunity.microsoft.com/t5/azure-database-for-postgresql/improving-postgres-connection-scalability-snapshots/ba-p/1806462#interlude-removing-the-need-for-recentglobalxminhorizon "In most workloads the majority of accesses are to live tuples, and when encountering non-live tuple versions they are either very old, or very new." (This was just a coincidence, but it was good to see that he made the same observation.) > Given that patch, it seems to me that it would be better to ask > individual index AM before calling to bulkdelete about the needs of > bulkdelete. That is, passing VACUUM options and the number of > collected dead tuples etc. to index AM, we ask index AM via a new > index AM API whether it wants to do bulkdelete or not. We call > bulkdelete for only indexes that answered 'yes'. If we got 'no' from > any one of the indexes, we cannot have a second heap pass. > INDEX_CLEANUP is not enforceable. When INDEX_CLEANUP is set to false, > we expect index AMs to return 'no' unless they have a special reason > for the needs of bulkdelete. I don't have a very detailed idea of the interface or anything. There are a few questions that naturally present themselves, that I don't have good answers to right now. Obviously vacuumlazy.c will only treat this feedback from each index as an advisory thing. So what happens when 50% of the indexes say yes and 50% say no? This is a subproblem that must be solved as part of this work. Ideally it will be solved by you. :-) > One possible benefit of this idea even without bottom-up index > deleteion patch would be something like > vacuum_index_cleanup_scale_factor for bulkdelete. For example, in the > case where the amount of dead tuple is slightly larger than > maitenance_work_mem the second time calling to bulkdelete will be > called with a small amount of dead tuples, which is less efficient. If > an index AM is able to determine not to do bulkdelete by comparing the > number of dead tuples to a threshold, it can avoid such bulkdelete > calling. I agree. Actually, I thought the same thing myself, even before I realized that bottom-up index deletion was possible. > Also, as a future extension, once we have retail index deletion > feature, we might be able to make that API return a ternary value: > 'no', 'do_bulkdelete', ‘do_indexscandelete, so that index AM can > choose the appropriate method of index deletion based on the > statistics. I agree again! We may eventually be able to make autovacuum run very frequently against each table in many important cases, with each VACUUM taking very little wall-clock time. We don't have to change the fundamental design to fix most of the current problems. I suspect that the "top down" nature of VACUUM is sometimes helpful. We just need to compensate when this design is insufficient. Getting the "best of both worlds" is possible. > But for making index vacuuming per-index thing, we need to deal with > the problem that we cannot know which indexes of the table still has > index tuples pointing to the collected dead tuple. For example, if an > index always says 'no' (not need bulkdelete therefore we need to keep > dead line pointers), the collected dead tuples might already be marked > as LP_DEAD and there might already not be index tuples pointing to > them in other index AMs. In that case we don't want to call to > bulkdelete for other indexes. Probably we can have additional > statistics like the number of dead tuples in individual indexes so > that they can determine the needs of bulkdelete. But it’s not a > comprehensive solution. Right. Maybe we don't ask the index AMs for discrete yes/no answers. Maybe we can ask them for a continuous answer, such as a value between 0.0 and 1.0 that represents the urgency/bloat, or something like that. And so the final yes/no answer that really does have to be made for the table as a whole (does VACUUM do a second pass over the heap to make LP_DEAD items into LP_UNUSED items?) can at least consider the worst case for each index. And maybe the average case, too. (I am just making random suggestions to stimulate discussion. Don't take these specific suggestions about the am interface too seriously.) > > If you are able to pursue this project, in whole or in part, I would > > definitely be supportive of that. I may be able to commit it. I think > > that this project has many benefits, not just one or two. It seems > > strategic. > > Thanks, that’s really helpful. I’m going to work on that. Since things > became complicated by these two features that I proposed I’ll do my > best to sort out the problem and improve it in PG14. Excellent! Thank you. -- Peter Geoghegan
On Fri, Nov 20, 2020 at 2:37 PM Peter Geoghegan <pg@bowt.ie> wrote: > Right. Maybe we don't ask the index AMs for discrete yes/no answers. > Maybe we can ask them for a continuous answer, such as a value between > 0.0 and 1.0 that represents the urgency/bloat, or something like that. > And so the final yes/no answer that really does have to be made for > the table as a whole (does VACUUM do a second pass over the heap to > make LP_DEAD items into LP_UNUSED items?) can at least consider the > worst case for each index. And maybe the average case, too. That's an interesting idea. We should think about the needs of brin indexes when designing something better than the current system. They have the interesting property that the heap deciding to change LP_DEAD to LP_UNUSED doesn't break anything even if nothing's been done to the index, because they don't store TIDs anyway. So that's an example of an index AM that might want to do some work to keep performance up, but it's not actually required. This might be orthogonal to the 0.0-1.0 scale you were thinking about, but it might be good to factor it into the thinking somehow. -- Robert Haas EDB: http://www.enterprisedb.com
On Fri, Nov 20, 2020 at 12:04 PM Robert Haas <robertmhaas@gmail.com> wrote: > That's an interesting idea. We should think about the needs of brin > indexes when designing something better than the current system. They > have the interesting property that the heap deciding to change LP_DEAD > to LP_UNUSED doesn't break anything even if nothing's been done to the > index, because they don't store TIDs anyway. So that's an example of > an index AM that might want to do some work to keep performance up, > but it's not actually required. This might be orthogonal to the > 0.0-1.0 scale you were thinking about, but it might be good to factor > it into the thinking somehow. I actually made this exact suggestion about BRIN myself, several years ago. As I've said, it seems like it would be a good idea to ask the exact same generic question of each index in turn (which is answered using local smarts added to the index AM). Again, the question is: How important is it that you get vacuumed now, from your own narrow/selfish point of view? The way that BRIN answers this question is not the novel thing about BRIN among other index access methods, though. (Not that you claimed otherwise -- just framing the discussion carefully.) BRIN has no selfish reason to care if the table never gets to have its LP_DEAD line pointers set to LP_UNUSED -- that's just not something that it can be expected to understand directly. But all index access methods should be thought of as not caring about this, because it's just not their problem. (Especially with bottom-up index deletion, but even without it.) The interesting and novel thing about BRIN here is this: lazyvacuum.c can be taught that a BRIN index alone is no reason to have to do a second pass over the heap (to make the LP_DEAD/pruned-by-VACUUM line pointers LP_UNUSED). A BRIN index never gets affected by the usual considerations about the heapam invariant (the usual thing about TIDs in an index not pointing to a line pointer that is at risk of being recycled), which presents us with a unique-to-BRIN opportunity. Which is exactly what you said. (***Thinks some more***) Actually, now I think that BRIN shouldn't be special to vacuumlazy.c in any way. It doesn't make sense as part of this future world in which index vacuuming can be skipped for individual indexes (which is what I talked to Sawada-san about a little earlier in this thread). Why should it be useful to exploit the "no-real-TIDs" property of BRIN in this future world? It can only solve a problem that the main enhancement is itself expected to solve without any special help from BRIN (just the generic am callback that asks the same generic question about index vacuuming urgency). The only reason we press ahead with a second scan (the LP_DEAD-to-LP_UNUSED thing) in this ideal world is a heap/table problem. The bloat eventually gets out of hand *in the table*. We have now conceptually decoupled the problems experienced in the table/heap from the problems for each index (mostly), so this actually makes sense. The theory behind AV scheduling becomes much closer to reality -- by changing the reality! (The need to "prune the table to VACUUM any one index" notwithstanding -- that's still necessary, of course, but we still basically decouple table bloat from index bloat at the conceptual level.) Does that make sense? -- Peter Geoghegan
On Fri, Nov 20, 2020 at 4:21 PM Peter Geoghegan <pg@bowt.ie> wrote: > Actually, now I think that BRIN shouldn't be special to vacuumlazy.c > in any way. It doesn't make sense as part of this future world in > which index vacuuming can be skipped for individual indexes (which is > what I talked to Sawada-san about a little earlier in this thread). > Why should it be useful to exploit the "no-real-TIDs" property of BRIN > in this future world? It can only solve a problem that the main > enhancement is itself expected to solve without any special help from > BRIN (just the generic am callback that asks the same generic question > about index vacuuming urgency). > > The only reason we press ahead with a second scan (the > LP_DEAD-to-LP_UNUSED thing) in this ideal world is a heap/table > problem. The bloat eventually gets out of hand *in the table*. We have > now conceptually decoupled the problems experienced in the table/heap > from the problems for each index (mostly), so this actually makes > sense. The theory behind AV scheduling becomes much closer to reality > -- by changing the reality! (The need to "prune the table to VACUUM > any one index" notwithstanding -- that's still necessary, of course, > but we still basically decouple table bloat from index bloat at the > conceptual level.) > > Does that make sense? I *think* so. For me the point is that the index never has a right to insist on being vacuumed, but it can offer an opinion on how helpful it would be. -- Robert Haas EDB: http://www.enterprisedb.com
On Fri, Nov 20, 2020 at 2:17 PM Robert Haas <robertmhaas@gmail.com> wrote: > > Does that make sense? > > I *think* so. For me the point is that the index never has a right to > insist on being vacuumed, but it can offer an opinion on how helpful > it would be. Right, that might be the single most important point. It's a somewhat more bottom-up direction for VACUUM, that is still fundamentally top-down. Because that's still necessary. Opportunistic heap pruning is usually very effective, so today we realistically have these 4 byte line pointers accumulating in heap pages. The corresponding "bloatum" in index pages is an index tuple + line pointer (at least 16 bytes + 4 bytes). Meaning that we accumulate that *at least* 20 bytes for each 4 bytes in the table. And, indexes care about *where* items go, making the problem even worse. So in the absence of index tuple LP_DEAD setting/deletion (or bottom-up index deletion in Postgres 14), the problem in indexes is probably at least 5x worse. The precise extent to which this is true will vary. It's a mistake to try to reason about it at a high level, because there is just too much variation for that approach to work. We should just give index access methods *some* say. Sometimes this allows index vacuuming to be very lazy, other times it allows index vacuuming to be very eager. Often this variation exists among indexes on the same table. Of course, vacuumlazy.c is still responsible for not letting the accumulation of LP_DEAD heap line pointers get out of hand (without allowing index TIDs to point to the wrong thing due to dangerous TID recycling issues/bugs). The accumulation of LP_DEAD heap line pointers will often take a very long time to get out of hand. But when it does finally get out of hand, index access methods don't get to veto being vacuumed. Because this isn't actually about their needs anymore. Actually, the index access methods never truly veto anything. They merely give some abstract signal about how urgent it is to them (like the 0.0 - 1.0 thing). This difference actually matters. One index among many on a table saying "meh, I guess I could benefit from some index vacuuming if it's no real trouble to you vacuumlazy.c" rather than saying "it's absolutely unnecessary, don't waste CPU cycles vacuumlazy.c" may actually shift how vacuumlazy.c processes the heap (at least occasionally). Maybe the high level VACUUM operation decides that it is worth taking care of everything all at once -- if all the indexes together either say "meh" or "now would be a good time", and vacuumlazy.c then notices that the accumulation of LP_DEAD line pointers is *also* becoming a problem (it's also a "meh" situation), then it can be *more* ambitious. It can do a traditional VACUUM early. Which might still make sense. This also means that vacuumlazy.c would ideally think about this as an optimization problem. It may be lazy or eager for the whole table, just as it may be lazy or eager for individual indexes. (Though the eagerness/laziness dynamic is probably much more noticeable with indexes in practice.) -- Peter Geoghegan
On Sat, Nov 21, 2020 at 8:03 AM Peter Geoghegan <pg@bowt.ie> wrote: > > On Fri, Nov 20, 2020 at 2:17 PM Robert Haas <robertmhaas@gmail.com> wrote: > > > Does that make sense? > > > > I *think* so. For me the point is that the index never has a right to > > insist on being vacuumed, but it can offer an opinion on how helpful > > it would be. > > Right, that might be the single most important point. It's a somewhat > more bottom-up direction for VACUUM, that is still fundamentally > top-down. Because that's still necessary. > > Opportunistic heap pruning is usually very effective, so today we > realistically have these 4 byte line pointers accumulating in heap > pages. The corresponding "bloatum" in index pages is an index tuple + > line pointer (at least 16 bytes + 4 bytes). Meaning that we accumulate > that *at least* 20 bytes for each 4 bytes in the table. And, indexes > care about *where* items go, making the problem even worse. So in the > absence of index tuple LP_DEAD setting/deletion (or bottom-up index > deletion in Postgres 14), the problem in indexes is probably at least > 5x worse. > > The precise extent to which this is true will vary. It's a mistake to > try to reason about it at a high level, because there is just too much > variation for that approach to work. We should just give index access > methods *some* say. Sometimes this allows index vacuuming to be very > lazy, other times it allows index vacuuming to be very eager. Often > this variation exists among indexes on the same table. > > Of course, vacuumlazy.c is still responsible for not letting the > accumulation of LP_DEAD heap line pointers get out of hand (without > allowing index TIDs to point to the wrong thing due to dangerous TID > recycling issues/bugs). The accumulation of LP_DEAD heap line pointers > will often take a very long time to get out of hand. But when it does > finally get out of hand, index access methods don't get to veto being > vacuumed. Because this isn't actually about their needs anymore. > > Actually, the index access methods never truly veto anything. They > merely give some abstract signal about how urgent it is to them (like > the 0.0 - 1.0 thing). This difference actually matters. One index > among many on a table saying "meh, I guess I could benefit from some > index vacuuming if it's no real trouble to you vacuumlazy.c" rather > than saying "it's absolutely unnecessary, don't waste CPU cycles > vacuumlazy.c" may actually shift how vacuumlazy.c processes the heap > (at least occasionally). Maybe the high level VACUUM operation decides > that it is worth taking care of everything all at once -- if all the > indexes together either say "meh" or "now would be a good time", and > vacuumlazy.c then notices that the accumulation of LP_DEAD line > pointers is *also* becoming a problem (it's also a "meh" situation), > then it can be *more* ambitious. It can do a traditional VACUUM early. > Which might still make sense. > > This also means that vacuumlazy.c would ideally think about this as an > optimization problem. It may be lazy or eager for the whole table, > just as it may be lazy or eager for individual indexes. (Though the > eagerness/laziness dynamic is probably much more noticeable with > indexes in practice.) > I discussed this topic off-list with Peter Geoghegan. And we think that we can lead this fix to future improvement. Let me summarize the proposals. The first proposal is the fix of this inappropriate behavior discussed on this thread. We pass a new flag in calling bulkdelete(), indicating whether or not the index can safely skip this bulkdelete() call. This is equivalent to whether or not lazy vacuum will do the heap clean (making LP_DEAD LP_UNUSED in lazy_vacuum_heap()). If it's true (meaning to do heap clean), since dead tuples referenced by index tuples will be physically removed, index AM would have to delete the index tuples. If it's false, we call to bulkdelete() with this flag so that index AM can safely skip bulkdelete(). Of course index AM also can dare not to skip it because of its personal reason. Index AM including BRIN that doesn't store heap TID can decide whether or not to do regardless of this flag. The next proposal upon the above proposal is to add a new index AM API, say ambulkdeletestrategy(), which is called before bulkdelete() for each index and asks the index bulk-deletion strategy. In this API, lazy vacuum asks, "Hey index X, I collected garbage heap tuples during heap scanning, how urgent is vacuuming for you?", and the index answers either "it's urgent" when it wants to do bulk-deletion or "it's not urgent, I can skip it". The point of this proposal is to isolate heap vacuum and index vacuum for each index so that we can employ different strategies for each index. Lazy vacuum can decide whether or not to do heap clean based on the answers from the indexes. Lazy vacuum can set the flag I proposed above according to the decision. If all indexes answer 'yes' (meaning it will do bulkdelete()), lazy vacuum can do heap clean. On the other hand, if even one index answers 'no' (meaning it will not do bulkdelete()), lazy vacuum cannot do the heap clean. On the other hand, lazy vacuum would also be able to require indexes to do bulkdelete() for its personal reason. It’s something like saying "Hey index X, you answered not to do bulkdelete() but since heap clean is necessary for me please don't skip bulkdelete()". In connection with this change, we would need to rethink the meaning of the INDEX_CLEANUP option. As of now, if it's not set (i.g. VACOPT_TERNARY_DEFAULT in the code), it's treated as true and will do heap clean. But I think we can make it something like a neutral state by default. This neutral state could be "on" and "off" depending on several factors including the answers of ambulkdeletestrategy(), the table status, and user's request. In this context, specifying INDEX_CLEANUP would mean making the neutral state "on" or "off" by user's request. The table status that could influence the decision could concretely be, for instance: * Removing LP_DEAD accumulation due to skipping bulkdelete() for a long time. * Making pages all-visible for index-only scan. We would not benefit much from the bulkdeletestrategy() idea for now. But there are potential enhancements using this API: * If bottom-up index deletion feature[1] is introduced, individual indexes could be a different situation in terms of dead tuple accumulation; some indexes on the table can delete its garbage index tuples without bulkdelete(). A problem will appear that doing bulkdelete() for such indexes would not be efficient. This problem is solved by this proposal because we can do bulkdelete() for a subset of indexes on the table. * If retail index deletion feature[2] is introduced, we can make the return value of bulkdeletestrategy() a ternary value: "do_bulkdete", "do_indexscandelete", and "no". * We probably can introduce a threshold of the number of dead tuples to control whether or not to do index tuple bulk-deletion (like bulkdelete() version of vacuum_cleanup_index_scale_factor). In the case where the amount of dead tuples is slightly larger than maitenance_work_mem the second time calling to bulkdelete will be called with a small number of dead tuples, which is inefficient. This problem is also solved by this proposal by allowing a subset of indexes to skip bulkdelete() if the number of dead tuple doesn't exceed the threshold. Any thoughts? I'm writing a PoC patch so will share it. Regards, [1] https://www.postgresql.org/message-id/CAH2-Wzm%2BmaE3apHB8NOtmM%3Dp-DO65j2V5GzAWCOEEuy3JZgb2g%40mail.gmail.com [2] https://www.postgresql.org/message-id/425db134-8bba-005c-b59d-56e50de3b41e%40postgrespro.ru -- Masahiko Sawada EnterpriseDB: https://www.enterprisedb.com/
On Tue, Dec 15, 2020 at 6:44 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > In connection with this change, we would need to rethink the meaning > of the INDEX_CLEANUP option. As of now, if it's not set (i.g. > VACOPT_TERNARY_DEFAULT in the code), it's treated as true and will do > heap clean. But I think we can make it something like a neutral state > by default. This neutral state could be "on" and "off" depending on > several factors including the answers of ambulkdeletestrategy(), the > table status, and user's request. In this context, specifying > INDEX_CLEANUP would mean making the neutral state "on" or "off" by > user's request. I think a new value such as "smart" should be introduced, which can become the default. > The table status that could influence the decision > could concretely be, for instance: > > * Removing LP_DEAD accumulation due to skipping bulkdelete() for a long time. > * Making pages all-visible for index-only scan. > > We would not benefit much from the bulkdeletestrategy() idea for now. > But there are potential enhancements using this API: > > * If bottom-up index deletion feature[1] is introduced, individual > indexes could be a different situation in terms of dead tuple > accumulation; some indexes on the table can delete its garbage index > tuples without bulkdelete(). A problem will appear that doing > bulkdelete() for such indexes would not be efficient. This problem is > solved by this proposal because we can do bulkdelete() for a subset of > indexes on the table. The chances of the bottom-up index deletion being committed for PostgreSQL 14 are very high. While it hasn't received too much review, there seems to be very little downside, and lots of upside. > * If retail index deletion feature[2] is introduced, we can make the > return value of bulkdeletestrategy() a ternary value: "do_bulkdete", > "do_indexscandelete", and "no". Makes sense. > * We probably can introduce a threshold of the number of dead tuples > to control whether or not to do index tuple bulk-deletion (like > bulkdelete() version of vacuum_cleanup_index_scale_factor). In the > case where the amount of dead tuples is slightly larger than > maitenance_work_mem the second time calling to bulkdelete will be > called with a small number of dead tuples, which is inefficient. This > problem is also solved by this proposal by allowing a subset of > indexes to skip bulkdelete() if the number of dead tuple doesn't > exceed the threshold. Good idea. Maybe this won't be possible for PostgreSQL 14, but this is the kind of possibility that we should try to unlock. I had a similar-yet-different idea to this idea of Masahiko's, actually, which is to use LSN to determine (unreliably) if a B-Tree leaf page is likely to have garbage tuples within VACUUM. This other idea probably also won't happen for PostgreSQL. That's not important. The truly important thing is that we come up with the right *general* design, that can support either technique in the future. I'm not sure which precise design will work best, but I am confident that *some* combination of these two ideas (or other ideas) will work very well. Right now we don't have the appropriate general framework. > Any thoughts? Nothing to add to what you said, really. I agree that it makes sense to think of all of these things at the same time. It'll be easier to see how far these different ideas can be pushed once a prototype is available. > I'm writing a PoC patch so will share it. Great! I suggest starting a new thread for that. -- Peter Geoghegan