Обсуждение: Visibility map, partial vacuums
Here's finally my attempt at the visibility map, aka. the dead space map. It's still work-in-progress, but it's time to discuss some design details in detail. Patch attached, anyway, for reference. Visibility Map is basically a bitmap, with one bit per heap page, with '1' for pages that are known to contain only tuples that are visible to everyone. Such pages don't need vacuuming, because there is no dead tuples, and the information can also be used to skip visibility checks. It should allow index-only-scans in the future, 8.5 perhaps, but that's not part of this patch. The visibility map is stored in a new relation fork, alongside the main data and the FSM. Lazy VACUUM only needs to visit pages that are '0' in the visibility map. This allows partial vacuums, where we only need to scan those parts of the table that need vacuuming, plus all indexes. To avoid having to update the visibility map every time a heap page is updated, I have added a new flag to the heap page header, PD_ALL_VISIBLE, which indicates the same thing as a set bit in the visibility map: all tuples on the page are known to be visible to everyone. When a page is modified, the visibility map only needs to be updated if PD_ALL_VISIBLE was set. That should make the impact unnoticeable for use cases with lots of updates, where the visibility map doesn't help, as only the first update on page after a vacuum needs to update the visibility map. As a bonus, I'm using the PD_ALL_VISIBLE flag to skip visibility checks in sequential scans. That seems to give a small 5-10% speedup on my laptop, to a simple "SELECT COUNT(*) FROM foo" query, where foo is a narrow table with just a single integer column, fitting in RAM. The critical part of this patch is to keep the PD_ALL_VISIBLE flag and the visibility map up-to-date, avoiding race conditions. An invariant is maintained: if PD_ALL_VISIBLE flag is *not* set, the corresponding bit in the visiblity map must also not be set. If PD_ALL_VISIBLE flag is set, the bit in the visibility map can be set, or not. To modify a page: If PD_ALL_VISIBLE flag is set, the bit in the visibility map is cleared first. The heap page is kept pinned, but not locked, while the visibility map is updated. We want to avoid holding a lock across I/O, even though the visibility map is likely to stay in cache. After the visibility map has been updated, the page is exclusively locked and modified as usual, and PD_ALL_VISIBLE flag is cleared before releasing the lock. To set the PD_ALL_VISIBLE flag, you must hold an exclusive lock on the page, while you observe that all tuples on the page are visible to everyone. To set the bit in the visibility map, you need to hold a cleanup lock on the heap page. That keeps away other backends trying to clear the bit in the visibility map at the same time. Note that you need to hold a lock on the heap page to examine PD_ALL_VISIBLE, otherwise the cleanup lock doesn't protect from the race condition. That's how the patch works right now. However, there's a small performance problem with the current approach: setting the PD_ALL_VISIBLE flag must be WAL-logged. Otherwise, this could happen: 1. All tuples on a page become visible to everyone. The inserting transaction committed, for example. A backend sees that and set PD_ALL_VISIBLE 2. Vacuum comes along, and sees that there's no work to be done on the page. It sets the bit in the visibility map. 3. The visibility map page is flushed to disk. The heap page is not, yet. 4. Crash The bit in the visibility map is now set, but the corresponding PD_ALL_VISIBLE flag is not, because it never made it to disk. I'm avoiding that at the moment by only setting PD_ALL_VISIBLE as part of a page prune operation, and forcing a WAL record to be written even if no other work is done on the page. The downside of that is that it can lead to a big increase in WAL traffic after a bulk load, for example. The first vacuum after the bulk load would have to write a WAL record for every heap page, even though there's no dead tuples. One option would be to just ignore that problem for now, and not WAL-log. As long as we don't use the visibility map for anything like index-only-scans, it doesn't matter much if there's some bits set that shouldn't be. It just means that VACUUM will skip some pages that need vacuuming, but VACUUM FREEZE will eventually catch those. Given how little time we have until commitfest and feature freeze, that's probably the most reasonable thing to do. I'll follow up with other solutions to that problem, but mainly for discussion for 8.5. Another thing that does need to be fixed, is the way that the extension and truncation of the visibility map is handled; that's broken in the current patch. I started working on the patch a long time ago, before the FSM rewrite was finished, and haven't gotten around fixing that part yet. We already solved it for the FSM, so we could just follow that pattern. The way we solved truncation in the FSM was to write a separate WAL record with the new heap size, but perhaps we want to revisit that decision, instead of adding again new code to write a third WAL record, for truncation of the visibility map. smgrtruncate() writes a WAL record of its own, if any full blocks are truncated away of the FSM, but we needed a WAL record even if no full blocks are truncated from the FSM file, because the "tail" of the last remaining FSM page, representing the truncated away heap pages, still needs to cleared. Visibility map has the same problem. One proposal was to piggyback on the smgrtruncate() WAL-record, and call FreeSpaceMapTruncateRel from smgr_redo(). I considered that ugly from a modularity point of view; smgr.c shouldn't be calling higher-level functions. But maybe it wouldn't be that bad, after all. Or, we could remove WAL-logging from smgrtruncate() altogether, and move it to RelationTruncate() or another higher-level function, and handle the WAL-logging and replay there. There's some side-effects of partial vacuums that also need to be fixed. First of all, the tuple count stored in pg_class is now wrong: it only includes tuples from the pages that are visited. VACUUM VERBOSE output needs to be changed as well to reflect that only some pages were scanned. Other TODOs - performance testing, to ensure that there's no significant performance penalty. - should add a specialized version of visibilitymap_clear() for WAL reaply, so that wouldn't have to rely so much on the fake relcache entries. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com *** src/backend/access/heap/Makefile --- src/backend/access/heap/Makefile *************** *** 12,17 **** subdir = src/backend/access/heap top_builddir = ../../../.. include $(top_builddir)/src/Makefile.global ! OBJS = heapam.o hio.o pruneheap.o rewriteheap.o syncscan.o tuptoaster.o include $(top_srcdir)/src/backend/common.mk --- 12,17 ---- top_builddir = ../../../.. include $(top_builddir)/src/Makefile.global ! OBJS = heapam.o hio.o pruneheap.o rewriteheap.o syncscan.o tuptoaster.o visibilitymap.o include $(top_srcdir)/src/backend/common.mk *** src/backend/access/heap/heapam.c --- src/backend/access/heap/heapam.c *************** *** 47,52 **** --- 47,53 ---- #include "access/transam.h" #include "access/tuptoaster.h" #include "access/valid.h" + #include "access/visibilitymap.h" #include "access/xact.h" #include "access/xlogutils.h" #include "catalog/catalog.h" *************** *** 194,199 **** heapgetpage(HeapScanDesc scan, BlockNumber page) --- 195,201 ---- int ntup; OffsetNumber lineoff; ItemId lpp; + bool all_visible; Assert(page < scan->rs_nblocks); *************** *** 233,252 **** heapgetpage(HeapScanDesc scan, BlockNumber page) lines = PageGetMaxOffsetNumber(dp); ntup = 0; for (lineoff = FirstOffsetNumber, lpp = PageGetItemId(dp, lineoff); lineoff <= lines; lineoff++, lpp++) { if (ItemIdIsNormal(lpp)) { - HeapTupleData loctup; bool valid; ! loctup.t_data = (HeapTupleHeader) PageGetItem((Page) dp, lpp); ! loctup.t_len = ItemIdGetLength(lpp); ! ItemPointerSet(&(loctup.t_self), page, lineoff); ! valid = HeapTupleSatisfiesVisibility(&loctup, snapshot, buffer); if (valid) scan->rs_vistuples[ntup++] = lineoff; } --- 235,266 ---- lines = PageGetMaxOffsetNumber(dp); ntup = 0; + /* + * If the all-visible flag indicates that all tuples on the page are + * visible to everyone, we can skip the per-tuple visibility tests. + */ + all_visible = PageIsAllVisible(dp); + for (lineoff = FirstOffsetNumber, lpp = PageGetItemId(dp, lineoff); lineoff <= lines; lineoff++, lpp++) { if (ItemIdIsNormal(lpp)) { bool valid; ! if (all_visible) ! valid = true; ! else ! { ! HeapTupleData loctup; ! ! loctup.t_data = (HeapTupleHeader) PageGetItem((Page) dp, lpp); ! loctup.t_len = ItemIdGetLength(lpp); ! ItemPointerSet(&(loctup.t_self), page, lineoff); ! valid = HeapTupleSatisfiesVisibility(&loctup, snapshot, buffer); ! } if (valid) scan->rs_vistuples[ntup++] = lineoff; } *************** *** 1914,1919 **** heap_insert(Relation relation, HeapTuple tup, CommandId cid, --- 1928,1934 ---- Page page = BufferGetPage(buffer); uint8 info = XLOG_HEAP_INSERT; + xlrec.all_visible_cleared = PageIsAllVisible(page); xlrec.target.node = relation->rd_node; xlrec.target.tid = heaptup->t_self; rdata[0].data = (char *) &xlrec; *************** *** 1961,1966 **** heap_insert(Relation relation, HeapTuple tup, CommandId cid, --- 1976,1991 ---- PageSetTLI(page, ThisTimeLineID); } + if (PageIsAllVisible(BufferGetPage(buffer))) + { + /* + * The bit in the visibility map was already cleared by + * RelationGetBufferForTuple + */ + /* visibilitymap_clear(relation, BufferGetBlockNumber(buffer)); */ + PageClearAllVisible(BufferGetPage(buffer)); + } + END_CRIT_SECTION(); UnlockReleaseBuffer(buffer); *************** *** 2045,2050 **** heap_delete(Relation relation, ItemPointer tid, --- 2070,2080 ---- Assert(ItemPointerIsValid(tid)); buffer = ReadBuffer(relation, ItemPointerGetBlockNumber(tid)); + + /* Clear the bit in the visibility map if necessary */ + if (PageIsAllVisible(BufferGetPage(buffer))) + visibilitymap_clear(relation, BufferGetBlockNumber(buffer)); + LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE); page = BufferGetPage(buffer); *************** *** 2208,2213 **** l1: --- 2238,2244 ---- XLogRecPtr recptr; XLogRecData rdata[2]; + xlrec.all_visible_cleared = PageIsAllVisible(page); xlrec.target.node = relation->rd_node; xlrec.target.tid = tp.t_self; rdata[0].data = (char *) &xlrec; *************** *** 2229,2234 **** l1: --- 2260,2268 ---- END_CRIT_SECTION(); + if (PageIsAllVisible(page)) + PageClearAllVisible(page); + LockBuffer(buffer, BUFFER_LOCK_UNLOCK); /* *************** *** 2627,2632 **** l2: --- 2661,2670 ---- } else { + /* Clear bit in visibility map */ + if (PageIsAllVisible(page)) + visibilitymap_clear(relation, BufferGetBlockNumber(buffer)); + /* Re-acquire the lock on the old tuple's page. */ LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE); /* Re-check using the up-to-date free space */ *************** *** 2750,2755 **** l2: --- 2788,2799 ---- PageSetTLI(BufferGetPage(buffer), ThisTimeLineID); } + /* The bits in visibility map were already cleared */ + if (PageIsAllVisible(BufferGetPage(buffer))) + PageClearAllVisible(BufferGetPage(buffer)); + if (newbuf != buffer && PageIsAllVisible(BufferGetPage(newbuf))) + PageClearAllVisible(BufferGetPage(newbuf)); + END_CRIT_SECTION(); if (newbuf != buffer) *************** *** 3381,3386 **** l3: --- 3425,3436 ---- END_CRIT_SECTION(); + /* + * Don't update the visibility map here. Locking a tuple doesn't + * change visibility info. + */ + /* visibilitymap_clear(relation, tuple->t_self); */ + LockBuffer(*buffer, BUFFER_LOCK_UNLOCK); /* *************** *** 3727,3733 **** log_heap_clean(Relation reln, Buffer buffer, OffsetNumber *redirected, int nredirected, OffsetNumber *nowdead, int ndead, OffsetNumber *nowunused, int nunused, ! bool redirect_move) { xl_heap_clean xlrec; uint8 info; --- 3777,3783 ---- OffsetNumber *redirected, int nredirected, OffsetNumber *nowdead, int ndead, OffsetNumber *nowunused, int nunused, ! bool redirect_move, bool all_visible_set) { xl_heap_clean xlrec; uint8 info; *************** *** 3741,3746 **** log_heap_clean(Relation reln, Buffer buffer, --- 3791,3797 ---- xlrec.block = BufferGetBlockNumber(buffer); xlrec.nredirected = nredirected; xlrec.ndead = ndead; + xlrec.all_visible_set = all_visible_set; rdata[0].data = (char *) &xlrec; rdata[0].len = SizeOfHeapClean; *************** *** 3892,3900 **** log_heap_update(Relation reln, Buffer oldbuf, ItemPointerData from, --- 3943,3953 ---- else info = XLOG_HEAP_UPDATE; + xlrec.all_visible_cleared = PageIsAllVisible(BufferGetPage(oldbuf)); xlrec.target.node = reln->rd_node; xlrec.target.tid = from; xlrec.newtid = newtup->t_self; + xlrec.new_all_visible_cleared = PageIsAllVisible(BufferGetPage(newbuf)); rdata[0].data = (char *) &xlrec; rdata[0].len = SizeOfHeapUpdate; *************** *** 4029,4034 **** heap_xlog_clean(XLogRecPtr lsn, XLogRecord *record, bool clean_move) --- 4082,4088 ---- int nredirected; int ndead; int nunused; + bool all_visible_set; if (record->xl_info & XLR_BKP_BLOCK_1) return; *************** *** 4046,4051 **** heap_xlog_clean(XLogRecPtr lsn, XLogRecord *record, bool clean_move) --- 4100,4106 ---- nredirected = xlrec->nredirected; ndead = xlrec->ndead; + all_visible_set = xlrec->all_visible_set; end = (OffsetNumber *) ((char *) xlrec + record->xl_len); redirected = (OffsetNumber *) ((char *) xlrec + SizeOfHeapClean); nowdead = redirected + (nredirected * 2); *************** *** 4058,4064 **** heap_xlog_clean(XLogRecPtr lsn, XLogRecord *record, bool clean_move) redirected, nredirected, nowdead, ndead, nowunused, nunused, ! clean_move); /* * Note: we don't worry about updating the page's prunability hints. --- 4113,4119 ---- redirected, nredirected, nowdead, ndead, nowunused, nunused, ! clean_move, all_visible_set); /* * Note: we don't worry about updating the page's prunability hints. *************** *** 4152,4157 **** heap_xlog_delete(XLogRecPtr lsn, XLogRecord *record) --- 4207,4224 ---- ItemId lp = NULL; HeapTupleHeader htup; + /* + * The visibility map always needs to be updated, even if the heap page + * is already up-to-date. + */ + if (xlrec->all_visible_cleared) + { + Relation reln = CreateFakeRelcacheEntry(xlrec->target.node); + + visibilitymap_clear(reln, ItemPointerGetBlockNumber(&(xlrec->target.tid))); + FreeFakeRelcacheEntry(reln); + } + if (record->xl_info & XLR_BKP_BLOCK_1) return; *************** *** 4189,4194 **** heap_xlog_delete(XLogRecPtr lsn, XLogRecord *record) --- 4256,4264 ---- /* Mark the page as a candidate for pruning */ PageSetPrunable(page, record->xl_xid); + if (xlrec->all_visible_cleared) + PageClearAllVisible(page); + /* Make sure there is no forward chain link in t_ctid */ htup->t_ctid = xlrec->target.tid; PageSetLSN(page, lsn); *************** *** 4213,4218 **** heap_xlog_insert(XLogRecPtr lsn, XLogRecord *record) --- 4283,4299 ---- xl_heap_header xlhdr; uint32 newlen; + /* + * The visibility map always needs to be updated, even if the heap page + * is already up-to-date. + */ + if (xlrec->all_visible_cleared) + { + Relation reln = CreateFakeRelcacheEntry(xlrec->target.node); + visibilitymap_clear(reln, ItemPointerGetBlockNumber(&xlrec->target.tid)); + FreeFakeRelcacheEntry(reln); + } + if (record->xl_info & XLR_BKP_BLOCK_1) return; *************** *** 4270,4275 **** heap_xlog_insert(XLogRecPtr lsn, XLogRecord *record) --- 4351,4360 ---- elog(PANIC, "heap_insert_redo: failed to add tuple"); PageSetLSN(page, lsn); PageSetTLI(page, ThisTimeLineID); + + if (xlrec->all_visible_cleared) + PageClearAllVisible(page); + MarkBufferDirty(buffer); UnlockReleaseBuffer(buffer); } *************** *** 4297,4302 **** heap_xlog_update(XLogRecPtr lsn, XLogRecord *record, bool move, bool hot_update) --- 4382,4398 ---- int hsize; uint32 newlen; + /* + * The visibility map always needs to be updated, even if the heap page + * is already up-to-date. + */ + if (xlrec->all_visible_cleared) + { + Relation reln = CreateFakeRelcacheEntry(xlrec->target.node); + visibilitymap_clear(reln, ItemPointerGetBlockNumber(&xlrec->target.tid)); + FreeFakeRelcacheEntry(reln); + } + if (record->xl_info & XLR_BKP_BLOCK_1) { if (samepage) *************** *** 4361,4366 **** heap_xlog_update(XLogRecPtr lsn, XLogRecord *record, bool move, bool hot_update) --- 4457,4465 ---- /* Mark the page as a candidate for pruning */ PageSetPrunable(page, record->xl_xid); + if (xlrec->all_visible_cleared) + PageClearAllVisible(page); + /* * this test is ugly, but necessary to avoid thinking that insert change * is already applied *************** *** 4376,4381 **** heap_xlog_update(XLogRecPtr lsn, XLogRecord *record, bool move, bool hot_update) --- 4475,4491 ---- newt:; + /* + * The visibility map always needs to be updated, even if the heap page + * is already up-to-date. + */ + if (xlrec->new_all_visible_cleared) + { + Relation reln = CreateFakeRelcacheEntry(xlrec->target.node); + visibilitymap_clear(reln, ItemPointerGetBlockNumber(&xlrec->newtid)); + FreeFakeRelcacheEntry(reln); + } + if (record->xl_info & XLR_BKP_BLOCK_2) return; *************** *** 4453,4458 **** newsame:; --- 4563,4572 ---- offnum = PageAddItem(page, (Item) htup, newlen, offnum, true, true); if (offnum == InvalidOffsetNumber) elog(PANIC, "heap_update_redo: failed to add tuple"); + + if (xlrec->new_all_visible_cleared) + PageClearAllVisible(page); + PageSetLSN(page, lsn); PageSetTLI(page, ThisTimeLineID); MarkBufferDirty(buffer); *** src/backend/access/heap/hio.c --- src/backend/access/heap/hio.c *************** *** 16,21 **** --- 16,22 ---- #include "postgres.h" #include "access/hio.h" + #include "access/visibilitymap.h" #include "storage/bufmgr.h" #include "storage/freespace.h" #include "storage/lmgr.h" *************** *** 221,229 **** RelationGetBufferForTuple(Relation relation, Size len, pageFreeSpace = PageGetHeapFreeSpace(page); if (len + saveFreeSpace <= pageFreeSpace) { ! /* use this page as future insert target, too */ ! relation->rd_targblock = targetBlock; ! return buffer; } /* --- 222,278 ---- pageFreeSpace = PageGetHeapFreeSpace(page); if (len + saveFreeSpace <= pageFreeSpace) { ! if (PageIsAllVisible(page)) ! { ! /* ! * Need to update the visibility map first. Let's drop the ! * locks while we do that. ! */ ! LockBuffer(buffer, BUFFER_LOCK_UNLOCK); ! if (otherBlock != targetBlock && BufferIsValid(otherBuffer)) ! LockBuffer(otherBuffer, BUFFER_LOCK_UNLOCK); ! ! visibilitymap_clear(relation, BufferGetBlockNumber(buffer)); ! ! /* relock */ ! if (otherBuffer == InvalidBuffer) ! { ! /* easy case */ ! LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE); ! } ! else if (otherBlock == targetBlock) ! { ! /* also easy case */ ! LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE); ! } ! else if (otherBlock < targetBlock) ! { ! /* lock other buffer first */ ! LockBuffer(otherBuffer, BUFFER_LOCK_EXCLUSIVE); ! LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE); ! } ! else ! { ! /* lock target buffer first */ ! LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE); ! LockBuffer(otherBuffer, BUFFER_LOCK_EXCLUSIVE); ! } ! ! /* Check if it still has enough space */ ! pageFreeSpace = PageGetHeapFreeSpace(page); ! if (len + saveFreeSpace <= pageFreeSpace) ! { ! /* use this page as future insert target, too */ ! relation->rd_targblock = targetBlock; ! return buffer; ! } ! } ! else ! { ! /* use this page as future insert target, too */ ! relation->rd_targblock = targetBlock; ! return buffer; ! } } /* *************** *** 276,281 **** RelationGetBufferForTuple(Relation relation, Size len, --- 325,332 ---- */ buffer = ReadBuffer(relation, P_NEW); + visibilitymap_extend(relation, BufferGetBlockNumber(buffer) + 1); + /* * We can be certain that locking the otherBuffer first is OK, since it * must have a lower page number. *** src/backend/access/heap/pruneheap.c --- src/backend/access/heap/pruneheap.c *************** *** 17,22 **** --- 17,24 ---- #include "access/heapam.h" #include "access/htup.h" #include "access/transam.h" + #include "access/visibilitymap.h" + #include "access/xlogdefs.h" #include "miscadmin.h" #include "pgstat.h" #include "storage/bufmgr.h" *************** *** 37,42 **** typedef struct --- 39,45 ---- OffsetNumber redirected[MaxHeapTuplesPerPage * 2]; OffsetNumber nowdead[MaxHeapTuplesPerPage]; OffsetNumber nowunused[MaxHeapTuplesPerPage]; + bool all_visible_set; /* marked[i] is TRUE if item i is entered in one of the above arrays */ bool marked[MaxHeapTuplesPerPage + 1]; } PruneState; *************** *** 156,161 **** heap_page_prune(Relation relation, Buffer buffer, TransactionId OldestXmin, --- 159,166 ---- OffsetNumber offnum, maxoff; PruneState prstate; + bool all_visible, all_visible_in_future; + TransactionId newest_xid; /* * Our strategy is to scan the page and make lists of items to change, *************** *** 177,182 **** heap_page_prune(Relation relation, Buffer buffer, TransactionId OldestXmin, --- 182,188 ---- */ prstate.new_prune_xid = InvalidTransactionId; prstate.nredirected = prstate.ndead = prstate.nunused = 0; + prstate.all_visible_set = false; memset(prstate.marked, 0, sizeof(prstate.marked)); /* Scan the page */ *************** *** 215,220 **** heap_page_prune(Relation relation, Buffer buffer, TransactionId OldestXmin, --- 221,317 ---- if (redirect_move) EndNonTransactionalInvalidation(); + /* Update the visibility map */ + all_visible = true; + all_visible_in_future = true; + newest_xid = InvalidTransactionId; + maxoff = PageGetMaxOffsetNumber(page); + for (offnum = FirstOffsetNumber; + offnum <= maxoff; + offnum = OffsetNumberNext(offnum)) + { + ItemId itemid = PageGetItemId(page, offnum); + HeapTupleHeader htup; + HTSV_Result status; + + if (!ItemIdIsUsed(itemid) || ItemIdIsRedirected(itemid)) + continue; + + if (ItemIdIsDead(itemid)) + { + all_visible = false; + all_visible_in_future = false; + break; + } + + htup = (HeapTupleHeader) PageGetItem(page, itemid); + status = HeapTupleSatisfiesVacuum(htup, OldestXmin, buffer); + switch(status) + { + case HEAPTUPLE_DEAD: + /* + * There shouldn't be any dead tuples left on the page, since + * we just pruned. They should've been truncated to just dead + * line pointers. + */ + Assert(false); + case HEAPTUPLE_RECENTLY_DEAD: + /* + * This tuple is not visible to all, and it won't become + * so in the future + */ + all_visible = false; + all_visible_in_future = false; + break; + case HEAPTUPLE_INSERT_IN_PROGRESS: + /* + * This tuple is not visible to all. But it might become + * so in the future, if the inserter commits. + */ + all_visible = false; + if (TransactionIdFollows(HeapTupleHeaderGetXmin(htup), newest_xid)) + newest_xid = HeapTupleHeaderGetXmin(htup); + break; + case HEAPTUPLE_DELETE_IN_PROGRESS: + /* + * This tuple is not visible to all. But it might become + * so in the future, if the deleter aborts. + */ + all_visible = false; + if (TransactionIdFollows(HeapTupleHeaderGetXmax(htup), newest_xid)) + newest_xid = HeapTupleHeaderGetXmax(htup); + break; + case HEAPTUPLE_LIVE: + /* + * Check if the inserter is old enough that this tuple is + * visible to all + */ + if (!TransactionIdPrecedes(HeapTupleHeaderGetXmin(htup), OldestXmin)) + { + /* + * Nope. But as OldestXmin advances beyond xmin, this + * will become visible to all + */ + all_visible = false; + if (TransactionIdFollows(HeapTupleHeaderGetXmin(htup), newest_xid)) + newest_xid = HeapTupleHeaderGetXmin(htup); + } + } + } + if (all_visible) + { + if (!PageIsAllVisible(page)) + prstate.all_visible_set = true; + } + else if (all_visible_in_future && TransactionIdIsValid(newest_xid)) + { + /* + * We still have hope that all tuples will become visible + * in the future + */ + heap_prune_record_prunable(&prstate, newest_xid); + } + /* Any error while applying the changes is critical */ START_CRIT_SECTION(); *************** *** 230,236 **** heap_page_prune(Relation relation, Buffer buffer, TransactionId OldestXmin, prstate.redirected, prstate.nredirected, prstate.nowdead, prstate.ndead, prstate.nowunused, prstate.nunused, ! redirect_move); /* * Update the page's pd_prune_xid field to either zero, or the lowest --- 327,333 ---- prstate.redirected, prstate.nredirected, prstate.nowdead, prstate.ndead, prstate.nowunused, prstate.nunused, ! redirect_move, prstate.all_visible_set); /* * Update the page's pd_prune_xid field to either zero, or the lowest *************** *** 253,264 **** heap_page_prune(Relation relation, Buffer buffer, TransactionId OldestXmin, if (!relation->rd_istemp) { XLogRecPtr recptr; - recptr = log_heap_clean(relation, buffer, prstate.redirected, prstate.nredirected, prstate.nowdead, prstate.ndead, prstate.nowunused, prstate.nunused, ! redirect_move); PageSetLSN(BufferGetPage(buffer), recptr); PageSetTLI(BufferGetPage(buffer), ThisTimeLineID); --- 350,360 ---- if (!relation->rd_istemp) { XLogRecPtr recptr; recptr = log_heap_clean(relation, buffer, prstate.redirected, prstate.nredirected, prstate.nowdead, prstate.ndead, prstate.nowunused, prstate.nunused, ! redirect_move, prstate.all_visible_set); PageSetLSN(BufferGetPage(buffer), recptr); PageSetTLI(BufferGetPage(buffer), ThisTimeLineID); *************** *** 701,707 **** heap_page_prune_execute(Buffer buffer, OffsetNumber *redirected, int nredirected, OffsetNumber *nowdead, int ndead, OffsetNumber *nowunused, int nunused, ! bool redirect_move) { Page page = (Page) BufferGetPage(buffer); OffsetNumber *offnum; --- 797,803 ---- OffsetNumber *redirected, int nredirected, OffsetNumber *nowdead, int ndead, OffsetNumber *nowunused, int nunused, ! bool redirect_move, bool all_visible) { Page page = (Page) BufferGetPage(buffer); OffsetNumber *offnum; *************** *** 766,771 **** heap_page_prune_execute(Buffer buffer, --- 862,875 ---- * whether it has free pointers. */ PageRepairFragmentation(page); + + /* + * We don't want poke the visibility map from here, as that might mean + * physical I/O; just set the flag on the heap page. The caller can + * update the visibility map afterwards if it wants to. + */ + if (all_visible) + PageSetAllVisible(page); } *** /dev/null --- src/backend/access/heap/visibilitymap.c *************** *** 0 **** --- 1,312 ---- + /*------------------------------------------------------------------------- + * + * visibilitymap.c + * Visibility map + * + * Portions Copyright (c) 2008, PostgreSQL Global Development Group + * Portions Copyright (c) 1994, Regents of the University of California + * + * + * IDENTIFICATION + * $PostgreSQL$ + * + * NOTES + * + * The visibility map is a bitmap with one bit per heap page. A set bit means + * that all tuples on the page are visible to all transactions. The + * map is conservative in the sense that we make sure that whenever a bit is + * set, we know the condition is true, but if a bit is not set, it might + * or might not be. + * + * From that it follows that when a bit is set, we need to update the LSN + * of the page to make sure that it doesn't get written to disk before the + * WAL record of the changes that made it possible to set the bit is flushed. + * But when a bit is cleared, we don't have to do that because if the page is + * flushed early, it's ok. + * + * There's no explicit WAL logging in the functions in this file. The callers + * must make sure that whenever a bit is cleared, the bit is cleared on WAL + * replay of the updating operation as well. XXX: the WAL-logging of setting + * bit needs more thought. + * + * LOCKING + * + * To clear a bit for a heap page, caller must hold an exclusive lock + * on the heap page. To set a bit, a clean up lock on the heap page is + * needed. + * + *------------------------------------------------------------------------- + */ + #include "postgres.h" + + #include "access/visibilitymap.h" + #include "storage/bufmgr.h" + #include "storage/bufpage.h" + #include "storage/smgr.h" + + //#define TRACE_VISIBILITYMAP + + /* Number of bits allocated for each heap block. */ + #define BITS_PER_HEAPBLOCK 1 + + /* Number of heap blocks we can represent in one byte. */ + #define HEAPBLOCKS_PER_BYTE 8 + + /* Number of heap blocks we can represent in one visibility map page */ + #define HEAPBLOCKS_PER_PAGE ((BLCKSZ - SizeOfPageHeaderData) * HEAPBLOCKS_PER_BYTE ) + + /* Mapping from heap block number to the right bit in the visibility map */ + #define HEAPBLK_TO_MAPBLOCK(x) ((x) / HEAPBLOCKS_PER_PAGE) + #define HEAPBLK_TO_MAPBYTE(x) (((x) % HEAPBLOCKS_PER_PAGE) / HEAPBLOCKS_PER_BYTE) + #define HEAPBLK_TO_MAPBIT(x) ((x) % HEAPBLOCKS_PER_BYTE) + + static Buffer ReadVMBuffer(Relation rel, BlockNumber blkno); + static Buffer ReleaseAndReadVMBuffer(Relation rel, BlockNumber blkno, Buffer oldbuf); + + static Buffer + ReadVMBuffer(Relation rel, BlockNumber blkno) + { + if (blkno == P_NEW) + return ReadBufferWithFork(rel, VISIBILITYMAP_FORKNUM, P_NEW); + + if (rel->rd_vm_nblocks_cache == InvalidBlockNumber || + rel->rd_vm_nblocks_cache <= blkno) + rel->rd_vm_nblocks_cache = smgrnblocks(rel->rd_smgr, VISIBILITYMAP_FORKNUM); + + if (blkno >= rel->rd_fsm_nblocks_cache) + return InvalidBuffer; + else + return ReadBufferWithFork(rel, VISIBILITYMAP_FORKNUM, blkno); + } + + static Buffer + ReleaseAndReadVMBuffer(Relation rel, BlockNumber blkno, Buffer oldbuf) + { + if (BufferIsValid(oldbuf)) + { + if (BufferGetBlockNumber(oldbuf) == blkno) + return oldbuf; + else + ReleaseBuffer(oldbuf); + } + + return ReadVMBuffer(rel, blkno); + } + + void + visibilitymap_truncate(Relation rel, BlockNumber nheapblocks) + { + BlockNumber mapBlock = HEAPBLK_TO_MAPBLOCK(nheapblocks); + uint32 mapByte = HEAPBLK_TO_MAPBYTE(nheapblocks); + uint8 mapBit = HEAPBLK_TO_MAPBIT(nheapblocks); + + #ifdef TRACE_VISIBILITYMAP + elog(LOG, "vm_truncate %s %d", RelationGetRelationName(rel), nheapblocks); + #endif + + /* Truncate away pages that are no longer needed */ + if (mapBlock == 0 && mapBit == 0) + smgrtruncate(rel->rd_smgr, VISIBILITYMAP_FORKNUM, mapBlock, + rel->rd_istemp); + else + { + Buffer mapBuffer; + Page page; + char *mappage; + int len; + + smgrtruncate(rel->rd_smgr, VISIBILITYMAP_FORKNUM, mapBlock + 1, + rel->rd_istemp); + + /* + * Clear all bits in the last map page, that represent the truncated + * heap blocks. This is not only tidy, but also necessary because + * we don't clear the bits on extension. + */ + mapBuffer = ReadVMBuffer(rel, mapBlock); + if (BufferIsValid(mapBuffer)) + { + page = BufferGetPage(mapBuffer); + mappage = PageGetContents(page); + + LockBuffer(mapBuffer, BUFFER_LOCK_EXCLUSIVE); + + /* + * Clear out the unwanted bytes. + */ + len = HEAPBLOCKS_PER_PAGE/HEAPBLOCKS_PER_BYTE - (mapByte + 1); + MemSet(&mappage[mapByte + 1], 0, len); + + /* + * Mask out the unwanted bits of the last remaining byte + * + * ((1 << 0) - 1) = 00000000 + * ((1 << 1) - 1) = 00000001 + * ... + * ((1 << 6) - 1) = 00111111 + * ((1 << 7) - 1) = 01111111 + */ + mappage[mapByte] &= (1 << mapBit) - 1; + + /* + * This needs to be WAL-logged. Although the now unused shouldn't + * be accessed anymore, they better be zero if we extend again. + */ + + MarkBufferDirty(mapBuffer); + UnlockReleaseBuffer(mapBuffer); + } + } + } + + void + visibilitymap_extend(Relation rel, BlockNumber nheapblocks) + { + BlockNumber mapBlock = HEAPBLK_TO_MAPBLOCK(nheapblocks); + BlockNumber size; + + #ifdef TRACE_VISIBILITYMAP + elog(LOG, "vm_extend %s %d", RelationGetRelationName(rel), nheapblocks); + #endif + + Assert(nheapblocks > 0); + + /* Open it at the smgr level if not already done */ + RelationOpenSmgr(rel); + + size = smgrnblocks(rel->rd_smgr, VISIBILITYMAP_FORKNUM); + for(; size < mapBlock + 1; size++) + { + Buffer mapBuffer = ReadVMBuffer(rel, P_NEW); + + LockBuffer(mapBuffer, BUFFER_LOCK_EXCLUSIVE); + PageInit(BufferGetPage(mapBuffer), BLCKSZ, 0); + MarkBufferDirty(mapBuffer); + UnlockReleaseBuffer(mapBuffer); + } + } + + /* + * Marks that all tuples on a heap page are visible to all. + * + * *buf is a buffer, previously returned by visibilitymap_test(). This is + * an opportunistic function; if *buf doesn't contain the bit for heapBlk, + * we do nothing. We don't want to do any I/O, because the caller is holding + * a cleanup lock on the heap page. + */ + void + visibilitymap_set_opt(Relation rel, BlockNumber heapBlk, XLogRecPtr recptr, + Buffer *buf) + { + BlockNumber mapBlock = HEAPBLK_TO_MAPBLOCK(heapBlk); + uint32 mapByte = HEAPBLK_TO_MAPBYTE(heapBlk); + uint8 mapBit = HEAPBLK_TO_MAPBIT(heapBlk); + Page page; + char *mappage; + + #ifdef TRACE_VISIBILITYMAP + elog(WARNING, "vm_set %s %d", RelationGetRelationName(rel), heapBlk); + #endif + + if (!BufferIsValid(*buf) || BufferGetBlockNumber(*buf) != mapBlock) + return; + + page = BufferGetPage(*buf); + mappage = PageGetContents(page); + LockBuffer(*buf, BUFFER_LOCK_EXCLUSIVE); + + if (!(mappage[mapByte] & (1 << mapBit))) + { + mappage[mapByte] |= (1 << mapBit); + + if (XLByteLT(PageGetLSN(page), recptr)) + PageSetLSN(page, recptr); + PageSetTLI(page, ThisTimeLineID); + MarkBufferDirty(*buf); + } + + LockBuffer(*buf, BUFFER_LOCK_UNLOCK); + } + + /* + * Are all tuples on heap page visible to all? + * + * The page containing the bit for the heap block is (kept) pinned, + * and *buf is set to that buffer. If *buf is valid on entry, it should + * be a buffer previously returned by this function, for the same relation, + * and unless the new heap block is on the same page, it is released. + */ + bool + visibilitymap_test(Relation rel, BlockNumber heapBlk, Buffer *buf) + { + BlockNumber mapBlock = HEAPBLK_TO_MAPBLOCK(heapBlk); + uint32 mapByte = HEAPBLK_TO_MAPBYTE(heapBlk); + uint8 mapBit = HEAPBLK_TO_MAPBIT(heapBlk); + bool val; + char *mappage; + + #ifdef TRACE_VISIBILITYMAP + elog(WARNING, "vm_test %s %d", RelationGetRelationName(rel), heapBlk); + #endif + + *buf = ReleaseAndReadVMBuffer(rel, mapBlock, *buf); + if (!BufferIsValid(*buf)) + return false; + + /* XXX: Can we get away without locking? */ + LockBuffer(*buf, BUFFER_LOCK_SHARE); + + mappage = PageGetContents(BufferGetPage(*buf)); + + val = (mappage[mapByte] & (1 << mapBit)) ? true : false; + + LockBuffer(*buf, BUFFER_LOCK_UNLOCK); + + return val; + } + + /* + * Mark that not all tuples are visible to all. + */ + void + visibilitymap_clear(Relation rel, BlockNumber heapBlk) + { + BlockNumber mapBlock = HEAPBLK_TO_MAPBLOCK(heapBlk); + uint32 mapByte = HEAPBLK_TO_MAPBYTE(heapBlk); + uint8 mapBit = HEAPBLK_TO_MAPBIT(heapBlk); + Buffer mapBuffer; + char *mappage; + + #ifdef TRACE_VISIBILITYMAP + elog(WARNING, "vm_clear %s %d", RelationGetRelationName(rel), heapBlk); + #endif + + mapBuffer = ReadVMBuffer(rel, mapBlock); + if (!BufferIsValid(mapBuffer)) + return; /* nothing to do */ + + /* XXX: Can we get away without locking here? + * + * We mustn't re-set a bit that was just cleared, so it doesn't seem + * safe. Clearing the bit is really "load; and; store", so without + * the lock we might store back a bit that's just being cleared + * by a concurrent updater. + * + * We could use the buffer header spinlock here, but the API to do + * that is intended to be internal to buffer manager. We'd still need + * to get a shared lock to mark the buffer as dirty, though. + */ + LockBuffer(mapBuffer, BUFFER_LOCK_EXCLUSIVE); + + mappage = PageGetContents(BufferGetPage(mapBuffer)); + + if (mappage[mapByte] & (1 << mapBit)) + { + mappage[mapByte] &= ~(1 << mapBit); + + MarkBufferDirty(mapBuffer); + } + + LockBuffer(mapBuffer, BUFFER_LOCK_UNLOCK); + ReleaseBuffer(mapBuffer); + } *** src/backend/access/transam/xlogutils.c --- src/backend/access/transam/xlogutils.c *************** *** 360,365 **** CreateFakeRelcacheEntry(RelFileNode rnode) --- 360,366 ---- rel->rd_targblock = InvalidBlockNumber; rel->rd_fsm_nblocks_cache = InvalidBlockNumber; + rel->rd_vm_nblocks_cache = InvalidBlockNumber; rel->rd_smgr = NULL; return rel; *** src/backend/catalog/heap.c --- src/backend/catalog/heap.c *************** *** 33,38 **** --- 33,39 ---- #include "access/heapam.h" #include "access/sysattr.h" #include "access/transam.h" + #include "access/visibilitymap.h" #include "access/xact.h" #include "catalog/catalog.h" #include "catalog/dependency.h" *************** *** 306,316 **** heap_create(const char *relname, smgrcreate(rel->rd_smgr, MAIN_FORKNUM, rel->rd_istemp, false); /* ! * For a real heap, create FSM fork as well. Indexams are ! * responsible for creating any extra forks themselves. */ if (relkind == RELKIND_RELATION || relkind == RELKIND_TOASTVALUE) smgrcreate(rel->rd_smgr, FSM_FORKNUM, rel->rd_istemp, false); } return rel; --- 307,320 ---- smgrcreate(rel->rd_smgr, MAIN_FORKNUM, rel->rd_istemp, false); /* ! * For a real heap, create FSM and visibility map as well. Indexams ! * are responsible for creating any extra forks themselves. */ if (relkind == RELKIND_RELATION || relkind == RELKIND_TOASTVALUE) + { smgrcreate(rel->rd_smgr, FSM_FORKNUM, rel->rd_istemp, false); + smgrcreate(rel->rd_smgr, VISIBILITYMAP_FORKNUM, rel->rd_istemp, false); + } } return rel; *************** *** 2324,2329 **** heap_truncate(List *relids) --- 2328,2334 ---- /* Truncate the FSM and actual file (and discard buffers) */ FreeSpaceMapTruncateRel(rel, 0); + visibilitymap_truncate(rel, 0); RelationTruncate(rel, 0); /* If this relation has indexes, truncate the indexes too */ *** src/backend/catalog/index.c --- src/backend/catalog/index.c *************** *** 1343,1354 **** setNewRelfilenode(Relation relation, TransactionId freezeXid) smgrcreate(srel, MAIN_FORKNUM, relation->rd_istemp, false); /* ! * For a heap, create FSM fork as well. Indexams are responsible for ! * creating any extra forks themselves. */ if (relation->rd_rel->relkind == RELKIND_RELATION || relation->rd_rel->relkind == RELKIND_TOASTVALUE) smgrcreate(srel, FSM_FORKNUM, relation->rd_istemp, false); /* schedule unlinking old files */ for (i = 0; i <= MAX_FORKNUM; i++) --- 1343,1357 ---- smgrcreate(srel, MAIN_FORKNUM, relation->rd_istemp, false); /* ! * For a heap, create FSM and visibility map as well. Indexams are ! * responsible for creating any extra forks themselves. */ if (relation->rd_rel->relkind == RELKIND_RELATION || relation->rd_rel->relkind == RELKIND_TOASTVALUE) + { smgrcreate(srel, FSM_FORKNUM, relation->rd_istemp, false); + smgrcreate(srel, VISIBILITYMAP_FORKNUM, relation->rd_istemp, false); + } /* schedule unlinking old files */ for (i = 0; i <= MAX_FORKNUM; i++) *** src/backend/commands/vacuum.c --- src/backend/commands/vacuum.c *************** *** 26,31 **** --- 26,32 ---- #include "access/genam.h" #include "access/heapam.h" #include "access/transam.h" + #include "access/visibilitymap.h" #include "access/xact.h" #include "access/xlog.h" #include "catalog/namespace.h" *************** *** 1327,1332 **** scan_heap(VRelStats *vacrelstats, Relation onerel, --- 1328,1336 ---- nblocks = RelationGetNumberOfBlocks(onerel); + if (nblocks > 0) + visibilitymap_extend(onerel, nblocks); + /* * We initially create each VacPage item in a maximal-sized workspace, * then copy the workspace into a just-large-enough copy. *************** *** 2822,2828 **** repair_frag(VRelStats *vacrelstats, Relation onerel, recptr = log_heap_clean(onerel, buf, NULL, 0, NULL, 0, unused, uncnt, ! false); PageSetLSN(page, recptr); PageSetTLI(page, ThisTimeLineID); } --- 2826,2832 ---- recptr = log_heap_clean(onerel, buf, NULL, 0, NULL, 0, unused, uncnt, ! false, false); PageSetLSN(page, recptr); PageSetTLI(page, ThisTimeLineID); } *************** *** 2843,2848 **** repair_frag(VRelStats *vacrelstats, Relation onerel, --- 2847,2853 ---- if (blkno < nblocks) { FreeSpaceMapTruncateRel(onerel, blkno); + visibilitymap_truncate(onerel, blkno); RelationTruncate(onerel, blkno); vacrelstats->rel_pages = blkno; /* set new number of blocks */ } *************** *** 2881,2886 **** move_chain_tuple(Relation rel, --- 2886,2899 ---- Size tuple_len = old_tup->t_len; /* + * we don't need to bother with the usual locking protocol for updating + * the visibility map, since we're holding an AccessExclusiveLock on the + * relation anyway. + */ + visibilitymap_clear(rel, BufferGetBlockNumber(old_buf)); + visibilitymap_clear(rel, BufferGetBlockNumber(dst_buf)); + + /* * make a modifiable copy of the source tuple. */ heap_copytuple_with_tuple(old_tup, &newtup); *************** *** 3020,3025 **** move_plain_tuple(Relation rel, --- 3033,3046 ---- ItemId newitemid; Size tuple_len = old_tup->t_len; + /* + * we don't need to bother with the usual locking protocol for updating + * the visibility map, since we're holding an AccessExclusiveLock on the + * relation anyway. + */ + visibilitymap_clear(rel, BufferGetBlockNumber(old_buf)); + visibilitymap_clear(rel, BufferGetBlockNumber(dst_buf)); + /* copy tuple */ heap_copytuple_with_tuple(old_tup, &newtup); *************** *** 3238,3243 **** vacuum_heap(VRelStats *vacrelstats, Relation onerel, VacPageList vacuum_pages) --- 3259,3265 ---- RelationGetRelationName(onerel), vacrelstats->rel_pages, relblocks))); FreeSpaceMapTruncateRel(onerel, relblocks); + visibilitymap_truncate(onerel, relblocks); RelationTruncate(onerel, relblocks); vacrelstats->rel_pages = relblocks; /* set new number of blocks */ } *************** *** 3279,3285 **** vacuum_page(Relation onerel, Buffer buffer, VacPage vacpage) recptr = log_heap_clean(onerel, buffer, NULL, 0, NULL, 0, vacpage->offsets, vacpage->offsets_free, ! false); PageSetLSN(page, recptr); PageSetTLI(page, ThisTimeLineID); } --- 3301,3307 ---- recptr = log_heap_clean(onerel, buffer, NULL, 0, NULL, 0, vacpage->offsets, vacpage->offsets_free, ! false, false); PageSetLSN(page, recptr); PageSetTLI(page, ThisTimeLineID); } *** src/backend/commands/vacuumlazy.c --- src/backend/commands/vacuumlazy.c *************** *** 40,45 **** --- 40,46 ---- #include "access/genam.h" #include "access/heapam.h" #include "access/transam.h" + #include "access/visibilitymap.h" #include "commands/dbcommands.h" #include "commands/vacuum.h" #include "miscadmin.h" *************** *** 87,92 **** typedef struct LVRelStats --- 88,94 ---- int max_dead_tuples; /* # slots allocated in array */ ItemPointer dead_tuples; /* array of ItemPointerData */ int num_index_scans; + bool scanned_all; /* have we scanned all pages (this far) in the rel? */ } LVRelStats; *************** *** 101,111 **** static BufferAccessStrategy vac_strategy; /* non-export function prototypes */ static void lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats, ! Relation *Irel, int nindexes); static void lazy_vacuum_heap(Relation onerel, LVRelStats *vacrelstats); static void lazy_vacuum_index(Relation indrel, IndexBulkDeleteResult **stats, LVRelStats *vacrelstats); static void lazy_cleanup_index(Relation indrel, IndexBulkDeleteResult *stats, LVRelStats *vacrelstats); --- 103,114 ---- /* non-export function prototypes */ static void lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats, ! Relation *Irel, int nindexes, bool scan_all); static void lazy_vacuum_heap(Relation onerel, LVRelStats *vacrelstats); static void lazy_vacuum_index(Relation indrel, IndexBulkDeleteResult **stats, LVRelStats *vacrelstats); + static void lazy_cleanup_index(Relation indrel, IndexBulkDeleteResult *stats, LVRelStats *vacrelstats); *************** *** 140,145 **** lazy_vacuum_rel(Relation onerel, VacuumStmt *vacstmt, --- 143,149 ---- BlockNumber possibly_freeable; PGRUsage ru0; TimestampTz starttime = 0; + bool scan_all; pg_rusage_init(&ru0); *************** *** 165,172 **** lazy_vacuum_rel(Relation onerel, VacuumStmt *vacstmt, vac_open_indexes(onerel, RowExclusiveLock, &nindexes, &Irel); vacrelstats->hasindex = (nindexes > 0); /* Do the vacuuming */ ! lazy_scan_heap(onerel, vacrelstats, Irel, nindexes); /* Done with indexes */ vac_close_indexes(nindexes, Irel, NoLock); --- 169,187 ---- vac_open_indexes(onerel, RowExclusiveLock, &nindexes, &Irel); vacrelstats->hasindex = (nindexes > 0); + /* Should we use the visibility map or scan all pages? */ + if (vacstmt->freeze_min_age != -1) + scan_all = true; + else if (vacstmt->analyze) + scan_all = true; + else + scan_all = false; + + /* initialize this variable */ + vacrelstats->scanned_all = true; + /* Do the vacuuming */ ! lazy_scan_heap(onerel, vacrelstats, Irel, nindexes, scan_all); /* Done with indexes */ vac_close_indexes(nindexes, Irel, NoLock); *************** *** 231,237 **** lazy_vacuum_rel(Relation onerel, VacuumStmt *vacstmt, */ static void lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats, ! Relation *Irel, int nindexes) { BlockNumber nblocks, blkno; --- 246,252 ---- */ static void lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats, ! Relation *Irel, int nindexes, bool scan_all) { BlockNumber nblocks, blkno; *************** *** 246,251 **** lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats, --- 261,267 ---- IndexBulkDeleteResult **indstats; int i; PGRUsage ru0; + Buffer vmbuffer = InvalidBuffer; pg_rusage_init(&ru0); *************** *** 267,272 **** lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats, --- 283,291 ---- lazy_space_alloc(vacrelstats, nblocks); + if (nblocks > 0) + visibilitymap_extend(onerel, nblocks); + for (blkno = 0; blkno < nblocks; blkno++) { Buffer buf; *************** *** 279,284 **** lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats, --- 298,320 ---- OffsetNumber frozen[MaxOffsetNumber]; int nfrozen; Size freespace; + bool all_visible_according_to_vm; + + /* + * If all tuples on page are visible to all, there's no + * need to visit that page. + * + * Note that we test the visibility map even if we're scanning all + * pages, to pin the visibility map page. We might set the bit there, + * and we don't want to do the I/O while we're holding the heap page + * locked. + */ + all_visible_according_to_vm = visibilitymap_test(onerel, blkno, &vmbuffer); + if (!scan_all && all_visible_according_to_vm) + { + vacrelstats->scanned_all = false; + continue; + } vacuum_delay_point(); *************** *** 525,530 **** lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats, --- 561,570 ---- freespace = PageGetHeapFreeSpace(page); + /* Update the visibility map */ + if (PageIsAllVisible(page)) + visibilitymap_set_opt(onerel, blkno, PageGetLSN(page), &vmbuffer); + /* Remember the location of the last page with nonremovable tuples */ if (hastup) vacrelstats->nonempty_pages = blkno + 1; *************** *** 560,565 **** lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats, --- 600,611 ---- vacrelstats->num_index_scans++; } + if (BufferIsValid(vmbuffer)) + { + ReleaseBuffer(vmbuffer); + vmbuffer = InvalidBuffer; + } + /* Do post-vacuum cleanup and statistics update for each index */ for (i = 0; i < nindexes; i++) lazy_cleanup_index(Irel[i], indstats[i], vacrelstats); *************** *** 622,627 **** lazy_vacuum_heap(Relation onerel, LVRelStats *vacrelstats) --- 668,682 ---- LockBufferForCleanup(buf); tupindex = lazy_vacuum_page(onerel, tblk, buf, tupindex, vacrelstats); + /* + * Before we let the page go, prune it. The primary reason is to + * update the visibility map in the common special case that we just + * vacuumed away the last tuple on the page that wasn't visible to + * everyone. + */ + vacrelstats->tuples_deleted += + heap_page_prune(onerel, buf, OldestXmin, false, false); + /* Now that we've compacted the page, record its available space */ page = BufferGetPage(buf); freespace = PageGetHeapFreeSpace(page); *************** *** 686,692 **** lazy_vacuum_page(Relation onerel, BlockNumber blkno, Buffer buffer, recptr = log_heap_clean(onerel, buffer, NULL, 0, NULL, 0, unused, uncnt, ! false); PageSetLSN(page, recptr); PageSetTLI(page, ThisTimeLineID); } --- 741,747 ---- recptr = log_heap_clean(onerel, buffer, NULL, 0, NULL, 0, unused, uncnt, ! false, false); PageSetLSN(page, recptr); PageSetTLI(page, ThisTimeLineID); } *************** *** 829,834 **** lazy_truncate_heap(Relation onerel, LVRelStats *vacrelstats) --- 884,890 ---- * Okay to truncate. */ FreeSpaceMapTruncateRel(onerel, new_rel_pages); + visibilitymap_truncate(onerel, new_rel_pages); RelationTruncate(onerel, new_rel_pages); /* *** src/backend/utils/cache/relcache.c --- src/backend/utils/cache/relcache.c *************** *** 305,310 **** AllocateRelationDesc(Relation relation, Form_pg_class relp) --- 305,311 ---- MemSet(relation, 0, sizeof(RelationData)); relation->rd_targblock = InvalidBlockNumber; relation->rd_fsm_nblocks_cache = InvalidBlockNumber; + relation->rd_vm_nblocks_cache = InvalidBlockNumber; /* make sure relation is marked as having no open file yet */ relation->rd_smgr = NULL; *************** *** 1366,1371 **** formrdesc(const char *relationName, Oid relationReltype, --- 1367,1373 ---- relation = (Relation) palloc0(sizeof(RelationData)); relation->rd_targblock = InvalidBlockNumber; relation->rd_fsm_nblocks_cache = InvalidBlockNumber; + relation->rd_vm_nblocks_cache = InvalidBlockNumber; /* make sure relation is marked as having no open file yet */ relation->rd_smgr = NULL; *************** *** 1654,1662 **** RelationReloadIndexInfo(Relation relation) heap_freetuple(pg_class_tuple); /* We must recalculate physical address in case it changed */ RelationInitPhysicalAddr(relation); ! /* Must reset targblock and fsm_nblocks_cache in case rel was truncated */ relation->rd_targblock = InvalidBlockNumber; relation->rd_fsm_nblocks_cache = InvalidBlockNumber; /* Must free any AM cached data, too */ if (relation->rd_amcache) pfree(relation->rd_amcache); --- 1656,1665 ---- heap_freetuple(pg_class_tuple); /* We must recalculate physical address in case it changed */ RelationInitPhysicalAddr(relation); ! /* Must reset targblock and fsm_nblocks_cache and vm_nblocks_cache in case rel was truncated */ relation->rd_targblock = InvalidBlockNumber; relation->rd_fsm_nblocks_cache = InvalidBlockNumber; + relation->rd_vm_nblocks_cache = InvalidBlockNumber; /* Must free any AM cached data, too */ if (relation->rd_amcache) pfree(relation->rd_amcache); *************** *** 1740,1745 **** RelationClearRelation(Relation relation, bool rebuild) --- 1743,1749 ---- { relation->rd_targblock = InvalidBlockNumber; relation->rd_fsm_nblocks_cache = InvalidBlockNumber; + relation->rd_vm_nblocks_cache = InvalidBlockNumber; if (relation->rd_rel->relkind == RELKIND_INDEX) { relation->rd_isvalid = false; /* needs to be revalidated */ *************** *** 2335,2340 **** RelationBuildLocalRelation(const char *relname, --- 2339,2345 ---- rel->rd_targblock = InvalidBlockNumber; rel->rd_fsm_nblocks_cache = InvalidBlockNumber; + rel->rd_vm_nblocks_cache = InvalidBlockNumber; /* make sure relation is marked as having no open file yet */ rel->rd_smgr = NULL; *************** *** 3592,3597 **** load_relcache_init_file(void) --- 3597,3603 ---- rel->rd_smgr = NULL; rel->rd_targblock = InvalidBlockNumber; rel->rd_fsm_nblocks_cache = InvalidBlockNumber; + rel->rd_vm_nblocks_cache = InvalidBlockNumber; if (rel->rd_isnailed) rel->rd_refcnt = 1; else *** src/include/access/heapam.h --- src/include/access/heapam.h *************** *** 125,131 **** extern XLogRecPtr log_heap_clean(Relation reln, Buffer buffer, OffsetNumber *redirected, int nredirected, OffsetNumber *nowdead, int ndead, OffsetNumber *nowunused, int nunused, ! bool redirect_move); extern XLogRecPtr log_heap_freeze(Relation reln, Buffer buffer, TransactionId cutoff_xid, OffsetNumber *offsets, int offcnt); --- 125,131 ---- OffsetNumber *redirected, int nredirected, OffsetNumber *nowdead, int ndead, OffsetNumber *nowunused, int nunused, ! bool redirect_move, bool all_visible); extern XLogRecPtr log_heap_freeze(Relation reln, Buffer buffer, TransactionId cutoff_xid, OffsetNumber *offsets, int offcnt); *************** *** 142,148 **** extern void heap_page_prune_execute(Buffer buffer, OffsetNumber *redirected, int nredirected, OffsetNumber *nowdead, int ndead, OffsetNumber *nowunused, int nunused, ! bool redirect_move); extern void heap_get_root_tuples(Page page, OffsetNumber *root_offsets); /* in heap/syncscan.c */ --- 142,148 ---- OffsetNumber *redirected, int nredirected, OffsetNumber *nowdead, int ndead, OffsetNumber *nowunused, int nunused, ! bool redirect_move, bool all_visible); extern void heap_get_root_tuples(Page page, OffsetNumber *root_offsets); /* in heap/syncscan.c */ *** src/include/access/htup.h --- src/include/access/htup.h *************** *** 595,600 **** typedef struct xl_heaptid --- 595,601 ---- typedef struct xl_heap_delete { xl_heaptid target; /* deleted tuple id */ + bool all_visible_cleared; /* PD_ALL_VISIBLE was cleared */ } xl_heap_delete; #define SizeOfHeapDelete (offsetof(xl_heap_delete, target) + SizeOfHeapTid) *************** *** 620,635 **** typedef struct xl_heap_header typedef struct xl_heap_insert { xl_heaptid target; /* inserted tuple id */ /* xl_heap_header & TUPLE DATA FOLLOWS AT END OF STRUCT */ } xl_heap_insert; ! #define SizeOfHeapInsert (offsetof(xl_heap_insert, target) + SizeOfHeapTid) /* This is what we need to know about update|move|hot_update */ typedef struct xl_heap_update { xl_heaptid target; /* deleted tuple id */ ItemPointerData newtid; /* new inserted tuple id */ /* NEW TUPLE xl_heap_header (PLUS xmax & xmin IF MOVE OP) */ /* and TUPLE DATA FOLLOWS AT END OF STRUCT */ } xl_heap_update; --- 621,639 ---- typedef struct xl_heap_insert { xl_heaptid target; /* inserted tuple id */ + bool all_visible_cleared; /* PD_ALL_VISIBLE was cleared */ /* xl_heap_header & TUPLE DATA FOLLOWS AT END OF STRUCT */ } xl_heap_insert; ! #define SizeOfHeapInsert (offsetof(xl_heap_insert, all_visible_cleared) + sizeof(bool)) /* This is what we need to know about update|move|hot_update */ typedef struct xl_heap_update { xl_heaptid target; /* deleted tuple id */ ItemPointerData newtid; /* new inserted tuple id */ + bool all_visible_cleared; /* PD_ALL_VISIBLE was cleared */ + bool new_all_visible_cleared; /* same for the page of newtid */ /* NEW TUPLE xl_heap_header (PLUS xmax & xmin IF MOVE OP) */ /* and TUPLE DATA FOLLOWS AT END OF STRUCT */ } xl_heap_update; *************** *** 660,665 **** typedef struct xl_heap_clean --- 664,670 ---- BlockNumber block; uint16 nredirected; uint16 ndead; + bool all_visible_set; /* OFFSET NUMBERS FOLLOW */ } xl_heap_clean; *** /dev/null --- src/include/access/visibilitymap.h *************** *** 0 **** --- 1,28 ---- + /*------------------------------------------------------------------------- + * + * visibilitymap.h + * visibility map interface + * + * + * Portions Copyright (c) 2007, PostgreSQL Global Development Group + * Portions Copyright (c) 1994, Regents of the University of California + * + * $PostgreSQL$ + * + *------------------------------------------------------------------------- + */ + #ifndef VISIBILITYMAP_H + #define VISIBILITYMAP_H + + #include "utils/rel.h" + #include "storage/buf.h" + #include "storage/itemptr.h" + #include "access/xlogdefs.h" + + extern void visibilitymap_set_opt(Relation rel, BlockNumber heapBlk, XLogRecPtr recptr, Buffer *vmbuf); + extern void visibilitymap_clear(Relation rel, BlockNumber heapBlk); + extern bool visibilitymap_test(Relation rel, BlockNumber heapBlk, Buffer *vmbuf); + extern void visibilitymap_extend(Relation rel, BlockNumber heapblk); + extern void visibilitymap_truncate(Relation rel, BlockNumber heapblk); + + #endif /* VISIBILITYMAP_H */ *** src/include/storage/bufpage.h --- src/include/storage/bufpage.h *************** *** 152,159 **** typedef PageHeaderData *PageHeader; #define PD_HAS_FREE_LINES 0x0001 /* are there any unused line pointers? */ #define PD_PAGE_FULL 0x0002 /* not enough free space for new * tuple? */ ! #define PD_VALID_FLAG_BITS 0x0003 /* OR of all valid pd_flags bits */ /* * Page layout version number 0 is for pre-7.3 Postgres releases. --- 152,161 ---- #define PD_HAS_FREE_LINES 0x0001 /* are there any unused line pointers? */ #define PD_PAGE_FULL 0x0002 /* not enough free space for new * tuple? */ + #define PD_ALL_VISIBLE 0x0004 /* all tuples on page are visible to + * everyone */ ! #define PD_VALID_FLAG_BITS 0x0007 /* OR of all valid pd_flags bits */ /* * Page layout version number 0 is for pre-7.3 Postgres releases. *************** *** 336,341 **** typedef PageHeaderData *PageHeader; --- 338,350 ---- #define PageClearFull(page) \ (((PageHeader) (page))->pd_flags &= ~PD_PAGE_FULL) + #define PageIsAllVisible(page) \ + (((PageHeader) (page))->pd_flags & PD_ALL_VISIBLE) + #define PageSetAllVisible(page) \ + (((PageHeader) (page))->pd_flags |= PD_ALL_VISIBLE) + #define PageClearAllVisible(page) \ + (((PageHeader) (page))->pd_flags &= ~PD_ALL_VISIBLE) + #define PageIsPrunable(page, oldestxmin) \ ( \ AssertMacro(TransactionIdIsNormal(oldestxmin)), \ *** src/include/storage/relfilenode.h --- src/include/storage/relfilenode.h *************** *** 24,37 **** typedef enum ForkNumber { InvalidForkNumber = -1, MAIN_FORKNUM = 0, ! FSM_FORKNUM /* * NOTE: if you add a new fork, change MAX_FORKNUM below and update the * forkNames array in catalog.c */ } ForkNumber; ! #define MAX_FORKNUM FSM_FORKNUM /* * RelFileNode must provide all that we need to know to physically access --- 24,38 ---- { InvalidForkNumber = -1, MAIN_FORKNUM = 0, ! FSM_FORKNUM, /* * NOTE: if you add a new fork, change MAX_FORKNUM below and update the * forkNames array in catalog.c */ + VISIBILITYMAP_FORKNUM } ForkNumber; ! #define MAX_FORKNUM VISIBILITYMAP_FORKNUM /* * RelFileNode must provide all that we need to know to physically access *** src/include/utils/rel.h --- src/include/utils/rel.h *************** *** 195,202 **** typedef struct RelationData List *rd_indpred; /* index predicate tree, if any */ void *rd_amcache; /* available for use by index AM */ ! /* Cached last-seen size of the FSM */ BlockNumber rd_fsm_nblocks_cache; /* use "struct" here to avoid needing to include pgstat.h: */ struct PgStat_TableStatus *pgstat_info; /* statistics collection area */ --- 195,203 ---- List *rd_indpred; /* index predicate tree, if any */ void *rd_amcache; /* available for use by index AM */ ! /* Cached last-seen size of the FSM and visibility map */ BlockNumber rd_fsm_nblocks_cache; + BlockNumber rd_vm_nblocks_cache; /* use "struct" here to avoid needing to include pgstat.h: */ struct PgStat_TableStatus *pgstat_info; /* statistics collection area */
Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:
> To modify a page:
> If PD_ALL_VISIBLE flag is set, the bit in the visibility map is cleared 
> first. The heap page is kept pinned, but not locked, while the 
> visibility map is updated. We want to avoid holding a lock across I/O, 
> even though the visibility map is likely to stay in cache. After the 
> visibility map has been updated, the page is exclusively locked and 
> modified as usual, and PD_ALL_VISIBLE flag is cleared before releasing 
> the lock.
So after having determined that you will modify a page, you release the
ex lock on the buffer and then try to regain it later?  Seems like a
really bad idea from here.  What if it's no longer possible to do the
modification you intended?
> To set the PD_ALL_VISIBLE flag, you must hold an exclusive lock on the 
> page, while you observe that all tuples on the page are visible to everyone.
That doesn't sound too good from a concurrency standpoint...
> That's how the patch works right now. However, there's a small 
> performance problem with the current approach: setting the 
> PD_ALL_VISIBLE flag must be WAL-logged. Otherwise, this could happen:
I'm more concerned about *clearing* the bit being WAL-logged.  That's
necessary for correctness.
        regards, tom lane
			
		On Mon, 2008-10-27 at 14:03 +0200, Heikki Linnakangas wrote: > One option would be to just ignore that problem for now, and not > WAL-log. Probably worth skipping for now, since it will cause patch conflicts if you do. Are there any other interactions with Hot Standby? But it seems like we can sneak in an extra flag on a HEAP2_CLEAN record to say "page is now all visible", without too much work. Does the PD_ALL_VISIBLE flag need to be set at the same time as updating the VM? Surely heapgetpage() could do a ConditionalLockBuffer exclusive to set the block flag (unlogged), but just not update VM. Separating the two concepts should allow the visibility check speed gain to more generally available. -- Simon Riggs www.2ndQuadrant.comPostgreSQL Training, Services and Support
Simon Riggs wrote: > On Mon, 2008-10-27 at 14:03 +0200, Heikki Linnakangas wrote: >> One option would be to just ignore that problem for now, and not >> WAL-log. > > Probably worth skipping for now, since it will cause patch conflicts if > you do. Are there any other interactions with Hot Standby? > > But it seems like we can sneak in an extra flag on a HEAP2_CLEAN record > to say "page is now all visible", without too much work. Hmm. Even if a tuple is visible to everyone on the master, it's not necessarily yet visible to all the read-only transactions in the slave. > Does the PD_ALL_VISIBLE flag need to be set at the same time as updating > the VM? Surely heapgetpage() could do a ConditionalLockBuffer exclusive > to set the block flag (unlogged), but just not update VM. Separating the > two concepts should allow the visibility check speed gain to more > generally available. Yes, that should be possible in theory. There's no version of ConditionalLockBuffer() for conditionally upgrading a shared lock to exclusive, but it should be possible in theory. I'm not sure if it would be safe to set the PD_ALL_VISIBLE_FLAG while holding just a shared lock, though. If it is, then we could do just that. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:
> ... I'm not sure if it would 
> be safe to set the PD_ALL_VISIBLE_FLAG while holding just a shared lock, 
> though. If it is, then we could do just that.
Seems like it must be safe.  If you have shared lock on a page then no
one else could be modifying the page in a way that would falsify
PD_ALL_VISIBLE.  You might have several processes concurrently try to
set the bit but that is safe (same situation as for hint bits).
The harder part is propagating the bit to the visibility map, but I
gather you intend to only allow VACUUM to do that?
        regards, tom lane
			
		Tom Lane wrote: > The harder part is propagating the bit to the visibility map, but I > gather you intend to only allow VACUUM to do that? Yep. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Tom Lane wrote: > Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes: >> To modify a page: >> If PD_ALL_VISIBLE flag is set, the bit in the visibility map is cleared >> first. The heap page is kept pinned, but not locked, while the >> visibility map is updated. We want to avoid holding a lock across I/O, >> even though the visibility map is likely to stay in cache. After the >> visibility map has been updated, the page is exclusively locked and >> modified as usual, and PD_ALL_VISIBLE flag is cleared before releasing >> the lock. > > So after having determined that you will modify a page, you release the > ex lock on the buffer and then try to regain it later? Seems like a > really bad idea from here. What if it's no longer possible to do the > modification you intended? In case of insert/update, you have to find a new target page. I put the logic in RelationGetBufferForTuple(). In case of delete and update (old page), the flag is checked and bit cleared just after pinning the buffer, before doing anything else. (I note that that's not actually what the patch is doing for heap_update, will fix..) If we give up on the strict requirement that the bit in the visibility map has to be cleared if the PD_ALL_VISIBLE flag on the page is not set, then we could just update the visibility map after releasing the locks on the heap pages. I think I'll do that for now, for simplicity. >> To set the PD_ALL_VISIBLE flag, you must hold an exclusive lock on the >> page, while you observe that all tuples on the page are visible to everyone. > > That doesn't sound too good from a concurrency standpoint... Well, no, but it's only done in VACUUM. And pruning. I implemented it as a new loop that call HeapTupleSatisfiesVacuum on each tuple, and checking that xmin is old enough for live tuples, but come to think of it, we're already calling HeapTupleSatisfiesVacuum for every tuple on the page during VACUUM, so it should be possible to piggyback on that by restructuring the code. >> That's how the patch works right now. However, there's a small >> performance problem with the current approach: setting the >> PD_ALL_VISIBLE flag must be WAL-logged. Otherwise, this could happen: > > I'm more concerned about *clearing* the bit being WAL-logged. That's > necessary for correctness. Yes, clearing the PD_ALL_VISIBLE flag always needs to be WAL-logged. There's a new boolean field in xl_heap_insert/update/delete records indicating if the operation cleared the flag. On replay, if the flag was cleared, the bit in the visibility map is also cleared. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On Tue, 2008-10-28 at 14:57 +0200, Heikki Linnakangas wrote: > Simon Riggs wrote: > > On Mon, 2008-10-27 at 14:03 +0200, Heikki Linnakangas wrote: > >> One option would be to just ignore that problem for now, and not > >> WAL-log. > > > > Probably worth skipping for now, since it will cause patch conflicts if > > you do. Are there any other interactions with Hot Standby? > > > > But it seems like we can sneak in an extra flag on a HEAP2_CLEAN record > > to say "page is now all visible", without too much work. > > Hmm. Even if a tuple is visible to everyone on the master, it's not > necessarily yet visible to all the read-only transactions in the slave. Never a problem. No query can ever see the rows removed by a cleanup record, enforced by the recovery system. > > Does the PD_ALL_VISIBLE flag need to be set at the same time as updating > > the VM? Surely heapgetpage() could do a ConditionalLockBuffer exclusive > > to set the block flag (unlogged), but just not update VM. Separating the > > two concepts should allow the visibility check speed gain to more > > generally available. > > Yes, that should be possible in theory. There's no version of > ConditionalLockBuffer() for conditionally upgrading a shared lock to > exclusive, but it should be possible in theory. I'm not sure if it would > be safe to set the PD_ALL_VISIBLE_FLAG while holding just a shared lock, > though. If it is, then we could do just that. To be honest, I'm more excited about your perf results for that than I am about speeding up some VACUUMs. -- Simon Riggs www.2ndQuadrant.comPostgreSQL Training, Services and Support
Simon Riggs wrote: > On Tue, 2008-10-28 at 14:57 +0200, Heikki Linnakangas wrote: >> Simon Riggs wrote: >>> On Mon, 2008-10-27 at 14:03 +0200, Heikki Linnakangas wrote: >>>> One option would be to just ignore that problem for now, and not >>>> WAL-log. >>> Probably worth skipping for now, since it will cause patch conflicts if >>> you do. Are there any other interactions with Hot Standby? >>> >>> But it seems like we can sneak in an extra flag on a HEAP2_CLEAN record >>> to say "page is now all visible", without too much work. >> Hmm. Even if a tuple is visible to everyone on the master, it's not >> necessarily yet visible to all the read-only transactions in the slave. > > Never a problem. No query can ever see the rows removed by a cleanup > record, enforced by the recovery system. Yes, but there's a problem with recently inserted tuples: 1. A query begins in the slave, taking a snapshot with xmax = 100. So the effects of anything more recent should not be seen. 2. Transaction 100 inserts a tuple in the master, and commits 3. A vacuum comes along. There's no other transactions running in the master. Vacuum sees that all tuples on the page, including the one just inserted, are visible to everyone, and sets PD_ALL_VISIBLE flag. 4. The change is replicated to the slave. 5. The query in the slave that began at step 1 looks at the page, sees that the PD_ALL_VISIBLE flag is set. Therefore it skips the visibility checks, and erroneously returns the inserted tuple. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On Mon, 2008-10-27 at 14:03 +0200, Heikki Linnakangas wrote: > Lazy VACUUM only needs to visit pages that are '0' in the visibility > map. This allows partial vacuums, where we only need to scan those parts > of the table that need vacuuming, plus all indexes. Just realised that this means we still have to visit each block of a btree index with a cleanup lock. That means the earlier idea of saying I don't need a cleanup lock if the page is not in memory makes a lot more sense with a partial vacuum. 1. Scan all blocks in memory for the index (and so, don't do this unless the index is larger than a certain % of shared buffers), 2. Start reading in new blocks until you've removed the correct number of tuples 3. Work through the rest of the blocks checking that they are either in shared buffers and we can get a cleanup lock, or they aren't in shared buffers and so nobody has them pinned. If you step (2) intelligently with regard to index correlation you might not need to do much I/O at all, if any. (1) has a good hit ratio because mostly only active tables will be vacuumed so are fairly likely to be in memory. -- Simon Riggs www.2ndQuadrant.comPostgreSQL Training, Services and Support
On Tue, 2008-10-28 at 19:02 +0200, Heikki Linnakangas wrote: > Yes, but there's a problem with recently inserted tuples: > > 1. A query begins in the slave, taking a snapshot with xmax = 100. So > the effects of anything more recent should not be seen. > 2. Transaction 100 inserts a tuple in the master, and commits > 3. A vacuum comes along. There's no other transactions running in the > master. Vacuum sees that all tuples on the page, including the one just > inserted, are visible to everyone, and sets PD_ALL_VISIBLE flag. > 4. The change is replicated to the slave. > 5. The query in the slave that began at step 1 looks at the page, sees > that the PD_ALL_VISIBLE flag is set. Therefore it skips the visibility > checks, and erroneously returns the inserted tuple. Yep. I was thinking about FSM and row removal. So PD_ALL_VISIBLE must be separately settable on the standby. Another reason why it should be able to be set without a VACUUM - since there will never be one on standby. -- Simon Riggs www.2ndQuadrant.comPostgreSQL Training, Services and Support
Simon Riggs <simon@2ndQuadrant.com> writes:
> On Mon, 2008-10-27 at 14:03 +0200, Heikki Linnakangas wrote:
>> Lazy VACUUM only needs to visit pages that are '0' in the visibility 
>> map. This allows partial vacuums, where we only need to scan those parts 
>> of the table that need vacuuming, plus all indexes.
> Just realised that this means we still have to visit each block of a
> btree index with a cleanup lock.
Yes, and your proposal cannot fix that.  Read "The Deletion Algorithm"
in nbtree/README, particularly the second paragraph.
        regards, tom lane
			
		Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:
> Yes, but there's a problem with recently inserted tuples:
> 1. A query begins in the slave, taking a snapshot with xmax = 100. So 
> the effects of anything more recent should not be seen.
> 2. Transaction 100 inserts a tuple in the master, and commits
> 3. A vacuum comes along. There's no other transactions running in the 
> master. Vacuum sees that all tuples on the page, including the one just 
> inserted, are visible to everyone, and sets PD_ALL_VISIBLE flag.
> 4. The change is replicated to the slave.
> 5. The query in the slave that began at step 1 looks at the page, sees 
> that the PD_ALL_VISIBLE flag is set. Therefore it skips the visibility 
> checks, and erroneously returns the inserted tuple.
But this is exactly equivalent to the problem with recently deleted
tuples: vacuum on the master might take actions that are premature with
respect to the status on the slave.  Whatever solution we adopt for that
will work for this too.
        regards, tom lane
			
		On Tue, 2008-10-28 at 13:58 -0400, Tom Lane wrote: > Simon Riggs <simon@2ndQuadrant.com> writes: > > On Mon, 2008-10-27 at 14:03 +0200, Heikki Linnakangas wrote: > >> Lazy VACUUM only needs to visit pages that are '0' in the visibility > >> map. This allows partial vacuums, where we only need to scan those parts > >> of the table that need vacuuming, plus all indexes. > > > Just realised that this means we still have to visit each block of a > > btree index with a cleanup lock. > > Yes, and your proposal cannot fix that. Read "The Deletion Algorithm" > in nbtree/README, particularly the second paragraph. Yes, understood. Please read the algorithm again. It does guarantee that each block in the index has been checked to see if nobody is pinning it, it just avoids performing I/O to prove that. -- Simon Riggs www.2ndQuadrant.comPostgreSQL Training, Services and Support
Heikki Linnakangas wrote: > Another thing that does need to be fixed, is the way that the extension > and truncation of the visibility map is handled; that's broken in the > current patch. I started working on the patch a long time ago, before > the FSM rewrite was finished, and haven't gotten around fixing that part > yet. We already solved it for the FSM, so we could just follow that > pattern. The way we solved truncation in the FSM was to write a separate > WAL record with the new heap size, but perhaps we want to revisit that > decision, instead of adding again new code to write a third WAL record, > for truncation of the visibility map. smgrtruncate() writes a WAL record > of its own, if any full blocks are truncated away of the FSM, but we > needed a WAL record even if no full blocks are truncated from the FSM > file, because the "tail" of the last remaining FSM page, representing > the truncated away heap pages, still needs to cleared. Visibility map > has the same problem. > > One proposal was to piggyback on the smgrtruncate() WAL-record, and call > FreeSpaceMapTruncateRel from smgr_redo(). I considered that ugly from a > modularity point of view; smgr.c shouldn't be calling higher-level > functions. But maybe it wouldn't be that bad, after all. Or, we could > remove WAL-logging from smgrtruncate() altogether, and move it to > RelationTruncate() or another higher-level function, and handle the > WAL-logging and replay there. In preparation for the visibility map patch, I revisited the truncation issue, and hacked together a patch to piggyback the FSM truncation to the main fork smgr truncation WAL record. I moved the WAL-logging from smgrtruncate() to RelationTruncate(). There's a new flag to RelationTruncate indicating whether the FSM should be truncated too, and only one truncation WAL record is written for the operation. That does seem cleaner than the current approach where the FSM writes a separate WAL record just to clear the bits of the last remaining FSM page. I had to move RelationTruncate() to smgr.c, because I don't think a function in bufmgr.c should be doing WAL-logging. However, RelationTruncate really doesn't belong in smgr.c either. Also, now that smgrtruncate doesn't write its own WAL record, it doesn't seem right for smgrcreate to be doing that either. So, I think I'll take this one step forward, and move RelationTruncate() to a new higher level file, e.g. src/backend/catalog/storage.c, and also create a new RelationCreateStorage() function that calls smgrcreate(), and move the WAL-logging from smgrcreate() to RelationCreateStorage(). So, we'll have two functions in a new file: /* Create physical storage for a relation. If 'fsm' is true, an FSM fork is also created */ RelationCreateStorage(Relation rel, bool fsm) /* Truncate the relation to 'nblocks' blocks. If 'fsm' is true, the FSM is also truncated */ RelationTruncate(Relation rel, BlockNumber nblocks, bool fsm) The next question is whether the "pending rel deletion" stuff in smgr.c should be moved to the new file too. It seems like it would belong there better. That would leave smgr.c as a very thin wrapper around md.c -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes: > The next question is whether the "pending rel deletion" stuff in smgr.c should > be moved to the new file too. It seems like it would belong there better. That > would leave smgr.c as a very thin wrapper around md.c Well it's just a switch, albeit with only one case, so I wouldn't expect it to be much more than a thin wrapper. If we had more storage systems it might be clearer what features were common to all of them and could be hoisted up from md.c. I'm not clear there are any though. Actually I wonder if an entirely in-memory storage system would help with the "temporary table" problem on systems where the kernel is too aggressive about flushing file buffers or metadata. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com Ask me about EnterpriseDB's PostGIS support!
Heikki Linnakangas wrote: > So, I think I'll take this one step forward, and move RelationTruncate() > to a new higher level file, e.g. src/backend/catalog/storage.c, and also > create a new RelationCreateStorage() function that calls smgrcreate(), > and move the WAL-logging from smgrcreate() to RelationCreateStorage(). > > So, we'll have two functions in a new file: > > /* Create physical storage for a relation. If 'fsm' is true, an FSM fork > is also created */ > RelationCreateStorage(Relation rel, bool fsm) > /* Truncate the relation to 'nblocks' blocks. If 'fsm' is true, the FSM > is also truncated */ > RelationTruncate(Relation rel, BlockNumber nblocks, bool fsm) > > The next question is whether the "pending rel deletion" stuff in smgr.c > should be moved to the new file too. It seems like it would belong there > better. That would leave smgr.c as a very thin wrapper around md.c This new approach feels pretty good to me, attached is a patch to do just that. Many of the functions formerly in smgr.c are now in src/backend/catalog/storage.c, including all the WAL-logging and pending rel deletion stuff. I kept their old names for now, though perhaps they should be renamed now that they're above smgr level. I also implemented Tom's idea of delaying creation of the FSM until it's needed, not because of performance, but because it started to get quite hairy to keep track of which relations should have a FSM and which shouldn't. Creation of the FSM fork is now treated more like extending a relation, as a non-WAL-logged operation, and it's up to freespace.c to create the file when it's needed. There's no operation to explicitly delete an individual fork of a relation, RelationCreateStorage only creates the main fork, RelationDropStorage drops all forks, and RelationTruncate truncates the FSM if and only if the FSM fork exists. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com *** src/backend/access/gin/gininsert.c --- src/backend/access/gin/gininsert.c *************** *** 284,292 **** ginbuild(PG_FUNCTION_ARGS) elog(ERROR, "index \"%s\" already contains data", RelationGetRelationName(index)); - /* Initialize FSM */ - InitIndexFreeSpaceMap(index); - initGinState(&buildstate.ginstate, index); /* initialize the root page */ --- 284,289 ---- *** src/backend/access/gin/ginvacuum.c --- src/backend/access/gin/ginvacuum.c *************** *** 16,21 **** --- 16,22 ---- #include "access/genam.h" #include "access/gin.h" + #include "catalog/storage.h" #include "commands/vacuum.h" #include "miscadmin.h" #include "storage/bufmgr.h" *************** *** 757,763 **** ginvacuumcleanup(PG_FUNCTION_ARGS) if (info->vacuum_full && lastBlock > lastFilledBlock) { /* try to truncate index */ - FreeSpaceMapTruncateRel(index, lastFilledBlock + 1); RelationTruncate(index, lastFilledBlock + 1); stats->pages_removed = lastBlock - lastFilledBlock; --- 758,763 ---- *** src/backend/access/gist/gist.c --- src/backend/access/gist/gist.c *************** *** 103,111 **** gistbuild(PG_FUNCTION_ARGS) elog(ERROR, "index \"%s\" already contains data", RelationGetRelationName(index)); - /* Initialize FSM */ - InitIndexFreeSpaceMap(index); - /* no locking is needed */ initGISTstate(&buildstate.giststate, index); --- 103,108 ---- *** src/backend/access/gist/gistvacuum.c --- src/backend/access/gist/gistvacuum.c *************** *** 16,21 **** --- 16,22 ---- #include "access/genam.h" #include "access/gist_private.h" + #include "catalog/storage.h" #include "commands/vacuum.h" #include "miscadmin.h" #include "storage/bufmgr.h" *************** *** 603,609 **** gistvacuumcleanup(PG_FUNCTION_ARGS) if (info->vacuum_full && lastFilledBlock < lastBlock) { /* try to truncate index */ - FreeSpaceMapTruncateRel(rel, lastFilledBlock + 1); RelationTruncate(rel, lastFilledBlock + 1); stats->std.pages_removed = lastBlock - lastFilledBlock; --- 604,609 ---- *** src/backend/access/heap/heapam.c --- src/backend/access/heap/heapam.c *************** *** 4863,4870 **** heap_sync(Relation rel) /* FlushRelationBuffers will have opened rd_smgr */ smgrimmedsync(rel->rd_smgr, MAIN_FORKNUM); ! /* sync FSM as well */ ! smgrimmedsync(rel->rd_smgr, FSM_FORKNUM); /* toast heap, if any */ if (OidIsValid(rel->rd_rel->reltoastrelid)) --- 4863,4869 ---- /* FlushRelationBuffers will have opened rd_smgr */ smgrimmedsync(rel->rd_smgr, MAIN_FORKNUM); ! /* FSM is not critical, don't bother syncing it */ /* toast heap, if any */ if (OidIsValid(rel->rd_rel->reltoastrelid)) *************** *** 4874,4880 **** heap_sync(Relation rel) toastrel = heap_open(rel->rd_rel->reltoastrelid, AccessShareLock); FlushRelationBuffers(toastrel); smgrimmedsync(toastrel->rd_smgr, MAIN_FORKNUM); - smgrimmedsync(toastrel->rd_smgr, FSM_FORKNUM); heap_close(toastrel, AccessShareLock); } } --- 4873,4878 ---- *** src/backend/access/nbtree/nbtree.c --- src/backend/access/nbtree/nbtree.c *************** *** 22,27 **** --- 22,28 ---- #include "access/nbtree.h" #include "access/relscan.h" #include "catalog/index.h" + #include "catalog/storage.h" #include "commands/vacuum.h" #include "miscadmin.h" #include "storage/bufmgr.h" *************** *** 109,117 **** btbuild(PG_FUNCTION_ARGS) elog(ERROR, "index \"%s\" already contains data", RelationGetRelationName(index)); - /* Initialize FSM */ - InitIndexFreeSpaceMap(index); - buildstate.spool = _bt_spoolinit(index, indexInfo->ii_Unique, false); /* --- 110,115 ---- *************** *** 696,702 **** btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats, /* * Okay to truncate. */ - FreeSpaceMapTruncateRel(rel, new_pages); RelationTruncate(rel, new_pages); /* update statistics */ --- 694,699 ---- *** src/backend/access/transam/rmgr.c --- src/backend/access/transam/rmgr.c *************** *** 31,37 **** const RmgrData RmgrTable[RM_MAX_ID + 1] = { {"Database", dbase_redo, dbase_desc, NULL, NULL, NULL}, {"Tablespace", tblspc_redo, tblspc_desc, NULL, NULL, NULL}, {"MultiXact", multixact_redo, multixact_desc, NULL, NULL, NULL}, ! {"FreeSpaceMap", fsm_redo, fsm_desc, NULL, NULL, NULL}, {"Reserved 8", NULL, NULL, NULL, NULL, NULL}, {"Heap2", heap2_redo, heap2_desc, NULL, NULL, NULL}, {"Heap", heap_redo, heap_desc, NULL, NULL, NULL}, --- 31,37 ---- {"Database", dbase_redo, dbase_desc, NULL, NULL, NULL}, {"Tablespace", tblspc_redo, tblspc_desc, NULL, NULL, NULL}, {"MultiXact", multixact_redo, multixact_desc, NULL, NULL, NULL}, ! {"Reserved 7", NULL, NULL, NULL, NULL, NULL}, {"Reserved 8", NULL, NULL, NULL, NULL, NULL}, {"Heap2", heap2_redo, heap2_desc, NULL, NULL, NULL}, {"Heap", heap_redo, heap_desc, NULL, NULL, NULL}, *** src/backend/access/transam/twophase.c --- src/backend/access/transam/twophase.c *************** *** 48,54 **** --- 48,56 ---- #include "access/twophase.h" #include "access/twophase_rmgr.h" #include "access/xact.h" + #include "access/xlogutils.h" #include "catalog/pg_type.h" + #include "catalog/storage.h" #include "funcapi.h" #include "miscadmin.h" #include "pg_trace.h" *************** *** 141,152 **** static void RecordTransactionCommitPrepared(TransactionId xid, int nchildren, TransactionId *children, int nrels, ! RelFileFork *rels); static void RecordTransactionAbortPrepared(TransactionId xid, int nchildren, TransactionId *children, int nrels, ! RelFileFork *rels); static void ProcessRecords(char *bufptr, TransactionId xid, const TwoPhaseCallback callbacks[]); --- 143,154 ---- int nchildren, TransactionId *children, int nrels, ! RelFileNode *rels); static void RecordTransactionAbortPrepared(TransactionId xid, int nchildren, TransactionId *children, int nrels, ! RelFileNode *rels); static void ProcessRecords(char *bufptr, TransactionId xid, const TwoPhaseCallback callbacks[]); *************** *** 793,800 **** StartPrepare(GlobalTransaction gxact) TransactionId xid = gxact->proc.xid; TwoPhaseFileHeader hdr; TransactionId *children; ! RelFileFork *commitrels; ! RelFileFork *abortrels; /* Initialize linked list */ records.head = palloc0(sizeof(XLogRecData)); --- 795,802 ---- TransactionId xid = gxact->proc.xid; TwoPhaseFileHeader hdr; TransactionId *children; ! RelFileNode *commitrels; ! RelFileNode *abortrels; /* Initialize linked list */ records.head = palloc0(sizeof(XLogRecData)); *************** *** 832,843 **** StartPrepare(GlobalTransaction gxact) } if (hdr.ncommitrels > 0) { ! save_state_data(commitrels, hdr.ncommitrels * sizeof(RelFileFork)); pfree(commitrels); } if (hdr.nabortrels > 0) { ! save_state_data(abortrels, hdr.nabortrels * sizeof(RelFileFork)); pfree(abortrels); } } --- 834,845 ---- } if (hdr.ncommitrels > 0) { ! save_state_data(commitrels, hdr.ncommitrels * sizeof(RelFileNode)); pfree(commitrels); } if (hdr.nabortrels > 0) { ! save_state_data(abortrels, hdr.nabortrels * sizeof(RelFileNode)); pfree(abortrels); } } *************** *** 1140,1147 **** FinishPreparedTransaction(const char *gid, bool isCommit) TwoPhaseFileHeader *hdr; TransactionId latestXid; TransactionId *children; ! RelFileFork *commitrels; ! RelFileFork *abortrels; int i; /* --- 1142,1151 ---- TwoPhaseFileHeader *hdr; TransactionId latestXid; TransactionId *children; ! RelFileNode *commitrels; ! RelFileNode *abortrels; ! RelFileNode *delrels; ! int ndelrels; int i; /* *************** *** 1169,1178 **** FinishPreparedTransaction(const char *gid, bool isCommit) bufptr = buf + MAXALIGN(sizeof(TwoPhaseFileHeader)); children = (TransactionId *) bufptr; bufptr += MAXALIGN(hdr->nsubxacts * sizeof(TransactionId)); ! commitrels = (RelFileFork *) bufptr; ! bufptr += MAXALIGN(hdr->ncommitrels * sizeof(RelFileFork)); ! abortrels = (RelFileFork *) bufptr; ! bufptr += MAXALIGN(hdr->nabortrels * sizeof(RelFileFork)); /* compute latestXid among all children */ latestXid = TransactionIdLatest(xid, hdr->nsubxacts, children); --- 1173,1182 ---- bufptr = buf + MAXALIGN(sizeof(TwoPhaseFileHeader)); children = (TransactionId *) bufptr; bufptr += MAXALIGN(hdr->nsubxacts * sizeof(TransactionId)); ! commitrels = (RelFileNode *) bufptr; ! bufptr += MAXALIGN(hdr->ncommitrels * sizeof(RelFileNode)); ! abortrels = (RelFileNode *) bufptr; ! bufptr += MAXALIGN(hdr->nabortrels * sizeof(RelFileNode)); /* compute latestXid among all children */ latestXid = TransactionIdLatest(xid, hdr->nsubxacts, children); *************** *** 1214,1234 **** FinishPreparedTransaction(const char *gid, bool isCommit) */ if (isCommit) { ! for (i = 0; i < hdr->ncommitrels; i++) ! { ! SMgrRelation srel = smgropen(commitrels[i].rnode); ! smgrdounlink(srel, commitrels[i].forknum, false, false); ! smgrclose(srel); ! } } else { ! for (i = 0; i < hdr->nabortrels; i++) { ! SMgrRelation srel = smgropen(abortrels[i].rnode); ! smgrdounlink(srel, abortrels[i].forknum, false, false); ! smgrclose(srel); } } /* And now do the callbacks */ --- 1218,1245 ---- */ if (isCommit) { ! delrels = commitrels; ! ndelrels = hdr->ncommitrels; } else { ! delrels = abortrels; ! ndelrels = hdr->nabortrels; ! } ! for (i = 0; i < ndelrels; i++) ! { ! SMgrRelation srel = smgropen(delrels[i]); ! ForkNumber fork; ! ! for (fork = 0; fork <= MAX_FORKNUM; fork++) { ! if (smgrexists(srel, fork)) ! { ! XLogDropRelation(delrels[i], fork); ! smgrdounlink(srel, fork, false, true); ! } } + smgrclose(srel); } /* And now do the callbacks */ *************** *** 1639,1646 **** RecoverPreparedTransactions(void) bufptr = buf + MAXALIGN(sizeof(TwoPhaseFileHeader)); subxids = (TransactionId *) bufptr; bufptr += MAXALIGN(hdr->nsubxacts * sizeof(TransactionId)); ! bufptr += MAXALIGN(hdr->ncommitrels * sizeof(RelFileFork)); ! bufptr += MAXALIGN(hdr->nabortrels * sizeof(RelFileFork)); /* * Reconstruct subtrans state for the transaction --- needed --- 1650,1657 ---- bufptr = buf + MAXALIGN(sizeof(TwoPhaseFileHeader)); subxids = (TransactionId *) bufptr; bufptr += MAXALIGN(hdr->nsubxacts * sizeof(TransactionId)); ! bufptr += MAXALIGN(hdr->ncommitrels * sizeof(RelFileNode)); ! bufptr += MAXALIGN(hdr->nabortrels * sizeof(RelFileNode)); /* * Reconstruct subtrans state for the transaction --- needed *************** *** 1693,1699 **** RecordTransactionCommitPrepared(TransactionId xid, int nchildren, TransactionId *children, int nrels, ! RelFileFork *rels) { XLogRecData rdata[3]; int lastrdata = 0; --- 1704,1710 ---- int nchildren, TransactionId *children, int nrels, ! RelFileNode *rels) { XLogRecData rdata[3]; int lastrdata = 0; *************** *** 1718,1724 **** RecordTransactionCommitPrepared(TransactionId xid, { rdata[0].next = &(rdata[1]); rdata[1].data = (char *) rels; ! rdata[1].len = nrels * sizeof(RelFileFork); rdata[1].buffer = InvalidBuffer; lastrdata = 1; } --- 1729,1735 ---- { rdata[0].next = &(rdata[1]); rdata[1].data = (char *) rels; ! rdata[1].len = nrels * sizeof(RelFileNode); rdata[1].buffer = InvalidBuffer; lastrdata = 1; } *************** *** 1766,1772 **** RecordTransactionAbortPrepared(TransactionId xid, int nchildren, TransactionId *children, int nrels, ! RelFileFork *rels) { XLogRecData rdata[3]; int lastrdata = 0; --- 1777,1783 ---- int nchildren, TransactionId *children, int nrels, ! RelFileNode *rels) { XLogRecData rdata[3]; int lastrdata = 0; *************** *** 1796,1802 **** RecordTransactionAbortPrepared(TransactionId xid, { rdata[0].next = &(rdata[1]); rdata[1].data = (char *) rels; ! rdata[1].len = nrels * sizeof(RelFileFork); rdata[1].buffer = InvalidBuffer; lastrdata = 1; } --- 1807,1813 ---- { rdata[0].next = &(rdata[1]); rdata[1].data = (char *) rels; ! rdata[1].len = nrels * sizeof(RelFileNode); rdata[1].buffer = InvalidBuffer; lastrdata = 1; } *** src/backend/access/transam/xact.c --- src/backend/access/transam/xact.c *************** *** 28,33 **** --- 28,34 ---- #include "access/xlogutils.h" #include "catalog/catalog.h" #include "catalog/namespace.h" + #include "catalog/storage.h" #include "commands/async.h" #include "commands/tablecmds.h" #include "commands/trigger.h" *************** *** 819,825 **** RecordTransactionCommit(void) bool markXidCommitted = TransactionIdIsValid(xid); TransactionId latestXid = InvalidTransactionId; int nrels; ! RelFileFork *rels; bool haveNonTemp; int nchildren; TransactionId *children; --- 820,826 ---- bool markXidCommitted = TransactionIdIsValid(xid); TransactionId latestXid = InvalidTransactionId; int nrels; ! RelFileNode *rels; bool haveNonTemp; int nchildren; TransactionId *children; *************** *** 900,906 **** RecordTransactionCommit(void) { rdata[0].next = &(rdata[1]); rdata[1].data = (char *) rels; ! rdata[1].len = nrels * sizeof(RelFileFork); rdata[1].buffer = InvalidBuffer; lastrdata = 1; } --- 901,907 ---- { rdata[0].next = &(rdata[1]); rdata[1].data = (char *) rels; ! rdata[1].len = nrels * sizeof(RelFileNode); rdata[1].buffer = InvalidBuffer; lastrdata = 1; } *************** *** 1165,1171 **** RecordTransactionAbort(bool isSubXact) TransactionId xid = GetCurrentTransactionIdIfAny(); TransactionId latestXid; int nrels; ! RelFileFork *rels; int nchildren; TransactionId *children; XLogRecData rdata[3]; --- 1166,1172 ---- TransactionId xid = GetCurrentTransactionIdIfAny(); TransactionId latestXid; int nrels; ! RelFileNode *rels; int nchildren; TransactionId *children; XLogRecData rdata[3]; *************** *** 1226,1232 **** RecordTransactionAbort(bool isSubXact) { rdata[0].next = &(rdata[1]); rdata[1].data = (char *) rels; ! rdata[1].len = nrels * sizeof(RelFileFork); rdata[1].buffer = InvalidBuffer; lastrdata = 1; } --- 1227,1233 ---- { rdata[0].next = &(rdata[1]); rdata[1].data = (char *) rels; ! rdata[1].len = nrels * sizeof(RelFileNode); rdata[1].buffer = InvalidBuffer; lastrdata = 1; } *************** *** 2078,2084 **** AbortTransaction(void) AtEOXact_xml(); AtEOXact_on_commit_actions(false); AtEOXact_Namespace(false); - smgrabort(); AtEOXact_Files(); AtEOXact_ComboCid(); AtEOXact_HashTables(false); --- 2079,2084 ---- *************** *** 4239,4250 **** xact_redo_commit(xl_xact_commit *xlrec, TransactionId xid) /* Make sure files supposed to be dropped are dropped */ for (i = 0; i < xlrec->nrels; i++) { ! SMgrRelation srel; ! XLogDropRelation(xlrec->xnodes[i].rnode, xlrec->xnodes[i].forknum); ! ! srel = smgropen(xlrec->xnodes[i].rnode); ! smgrdounlink(srel, xlrec->xnodes[i].forknum, false, true); smgrclose(srel); } } --- 4239,4255 ---- /* Make sure files supposed to be dropped are dropped */ for (i = 0; i < xlrec->nrels; i++) { ! SMgrRelation srel = smgropen(xlrec->xnodes[i]); ! ForkNumber fork; ! for (fork = 0; fork <= MAX_FORKNUM; fork++) ! { ! if (smgrexists(srel, fork)) ! { ! XLogDropRelation(xlrec->xnodes[i], fork); ! smgrdounlink(srel, fork, false, true); ! } ! } smgrclose(srel); } } *************** *** 4277,4288 **** xact_redo_abort(xl_xact_abort *xlrec, TransactionId xid) /* Make sure files supposed to be dropped are dropped */ for (i = 0; i < xlrec->nrels; i++) { ! SMgrRelation srel; ! XLogDropRelation(xlrec->xnodes[i].rnode, xlrec->xnodes[i].forknum); ! ! srel = smgropen(xlrec->xnodes[i].rnode); ! smgrdounlink(srel, xlrec->xnodes[i].forknum, false, true); smgrclose(srel); } } --- 4282,4298 ---- /* Make sure files supposed to be dropped are dropped */ for (i = 0; i < xlrec->nrels; i++) { ! SMgrRelation srel = smgropen(xlrec->xnodes[i]); ! ForkNumber fork; ! for (fork = 0; fork <= MAX_FORKNUM; fork++) ! { ! if (smgrexists(srel, fork)) ! { ! XLogDropRelation(xlrec->xnodes[i], fork); ! smgrdounlink(srel, fork, false, true); ! } ! } smgrclose(srel); } } *************** *** 4339,4346 **** xact_desc_commit(StringInfo buf, xl_xact_commit *xlrec) appendStringInfo(buf, "; rels:"); for (i = 0; i < xlrec->nrels; i++) { ! char *path = relpath(xlrec->xnodes[i].rnode, ! xlrec->xnodes[i].forknum); appendStringInfo(buf, " %s", path); pfree(path); } --- 4349,4355 ---- appendStringInfo(buf, "; rels:"); for (i = 0; i < xlrec->nrels; i++) { ! char *path = relpath(xlrec->xnodes[i], MAIN_FORKNUM); appendStringInfo(buf, " %s", path); pfree(path); } *************** *** 4367,4374 **** xact_desc_abort(StringInfo buf, xl_xact_abort *xlrec) appendStringInfo(buf, "; rels:"); for (i = 0; i < xlrec->nrels; i++) { ! char *path = relpath(xlrec->xnodes[i].rnode, ! xlrec->xnodes[i].forknum); appendStringInfo(buf, " %s", path); pfree(path); } --- 4376,4382 ---- appendStringInfo(buf, "; rels:"); for (i = 0; i < xlrec->nrels; i++) { ! char *path = relpath(xlrec->xnodes[i], MAIN_FORKNUM); appendStringInfo(buf, " %s", path); pfree(path); } *** src/backend/access/transam/xlogutils.c --- src/backend/access/transam/xlogutils.c *************** *** 273,279 **** XLogReadBufferExtended(RelFileNode rnode, ForkNumber forknum, * filesystem loses an inode during a crash. Better to write the data * until we are actually told to delete the file.) */ ! smgrcreate(smgr, forknum, false, true); lastblock = smgrnblocks(smgr, forknum); --- 273,279 ---- * filesystem loses an inode during a crash. Better to write the data * until we are actually told to delete the file.) */ ! smgrcreate(smgr, forknum, true); lastblock = smgrnblocks(smgr, forknum); *** src/backend/catalog/Makefile --- src/backend/catalog/Makefile *************** *** 13,19 **** include $(top_builddir)/src/Makefile.global OBJS = catalog.o dependency.o heap.o index.o indexing.o namespace.o aclchk.o \ pg_aggregate.o pg_constraint.o pg_conversion.o pg_depend.o pg_enum.o \ pg_largeobject.o pg_namespace.o pg_operator.o pg_proc.o pg_shdepend.o \ ! pg_type.o toasting.o BKIFILES = postgres.bki postgres.description postgres.shdescription --- 13,19 ---- OBJS = catalog.o dependency.o heap.o index.o indexing.o namespace.o aclchk.o \ pg_aggregate.o pg_constraint.o pg_conversion.o pg_depend.o pg_enum.o \ pg_largeobject.o pg_namespace.o pg_operator.o pg_proc.o pg_shdepend.o \ ! pg_type.o storage.o toasting.o BKIFILES = postgres.bki postgres.description postgres.shdescription *** src/backend/catalog/heap.c --- src/backend/catalog/heap.c *************** *** 47,52 **** --- 47,53 ---- #include "catalog/pg_tablespace.h" #include "catalog/pg_type.h" #include "catalog/pg_type_fn.h" + #include "catalog/storage.h" #include "commands/tablecmds.h" #include "commands/typecmds.h" #include "miscadmin.h" *************** *** 295,317 **** heap_create(const char *relname, /* * Have the storage manager create the relation's disk file, if needed. * ! * We create storage for the main fork here, and also for the FSM for a ! * heap or toast relation. The caller is responsible for creating any ! * additional forks if needed. */ if (create_storage) ! { ! Assert(rel->rd_smgr == NULL); ! RelationOpenSmgr(rel); ! smgrcreate(rel->rd_smgr, MAIN_FORKNUM, rel->rd_istemp, false); ! ! /* ! * For a real heap, create FSM fork as well. Indexams are ! * responsible for creating any extra forks themselves. ! */ ! if (relkind == RELKIND_RELATION || relkind == RELKIND_TOASTVALUE) ! smgrcreate(rel->rd_smgr, FSM_FORKNUM, rel->rd_istemp, false); ! } return rel; } --- 296,306 ---- /* * Have the storage manager create the relation's disk file, if needed. * ! * We only create the main fork here, the other forks will be created ! * on-demand. */ if (create_storage) ! RelationCreateStorage(rel->rd_node, rel->rd_istemp); return rel; } *************** *** 1426,1438 **** heap_drop_with_catalog(Oid relid) if (rel->rd_rel->relkind != RELKIND_VIEW && rel->rd_rel->relkind != RELKIND_COMPOSITE_TYPE) { ! ForkNumber forknum; ! ! RelationOpenSmgr(rel); ! for (forknum = 0; forknum <= MAX_FORKNUM; forknum++) ! if (smgrexists(rel->rd_smgr, forknum)) ! smgrscheduleunlink(rel->rd_smgr, forknum, rel->rd_istemp); ! RelationCloseSmgr(rel); } /* --- 1415,1421 ---- if (rel->rd_rel->relkind != RELKIND_VIEW && rel->rd_rel->relkind != RELKIND_COMPOSITE_TYPE) { ! RelationDropStorage(rel); } /* *************** *** 2348,2354 **** heap_truncate(List *relids) Relation rel = lfirst(cell); /* Truncate the FSM and actual file (and discard buffers) */ - FreeSpaceMapTruncateRel(rel, 0); RelationTruncate(rel, 0); /* If this relation has indexes, truncate the indexes too */ --- 2331,2336 ---- *** src/backend/catalog/index.c --- src/backend/catalog/index.c *************** *** 41,46 **** --- 41,47 ---- #include "catalog/pg_opclass.h" #include "catalog/pg_tablespace.h" #include "catalog/pg_type.h" + #include "catalog/storage.h" #include "commands/tablecmds.h" #include "executor/executor.h" #include "miscadmin.h" *************** *** 897,903 **** index_drop(Oid indexId) Relation indexRelation; HeapTuple tuple; bool hasexprs; - ForkNumber forknum; /* * To drop an index safely, we must grab exclusive lock on its parent --- 898,903 ---- *************** *** 918,929 **** index_drop(Oid indexId) /* * Schedule physical removal of the files */ ! RelationOpenSmgr(userIndexRelation); ! for (forknum = 0; forknum <= MAX_FORKNUM; forknum++) ! if (smgrexists(userIndexRelation->rd_smgr, forknum)) ! smgrscheduleunlink(userIndexRelation->rd_smgr, forknum, ! userIndexRelation->rd_istemp); ! RelationCloseSmgr(userIndexRelation); /* * Close and flush the index's relcache entry, to ensure relcache doesn't --- 918,924 ---- /* * Schedule physical removal of the files */ ! RelationDropStorage(userIndexRelation); /* * Close and flush the index's relcache entry, to ensure relcache doesn't *************** *** 1283,1293 **** setNewRelfilenode(Relation relation, TransactionId freezeXid) { Oid newrelfilenode; RelFileNode newrnode; - SMgrRelation srel; Relation pg_class; HeapTuple tuple; Form_pg_class rd_rel; - ForkNumber i; /* Can't change relfilenode for nailed tables (indexes ok though) */ Assert(!relation->rd_isnailed || --- 1278,1286 ---- *************** *** 1318,1325 **** setNewRelfilenode(Relation relation, TransactionId freezeXid) RelationGetRelid(relation)); rd_rel = (Form_pg_class) GETSTRUCT(tuple); - RelationOpenSmgr(relation); - /* * ... and create storage for corresponding forks in the new relfilenode. * --- 1311,1316 ---- *************** *** 1327,1354 **** setNewRelfilenode(Relation relation, TransactionId freezeXid) */ newrnode = relation->rd_node; newrnode.relNode = newrelfilenode; - srel = smgropen(newrnode); - - /* Create the main fork, like heap_create() does */ - smgrcreate(srel, MAIN_FORKNUM, relation->rd_istemp, false); /* ! * For a heap, create FSM fork as well. Indexams are responsible for ! * creating any extra forks themselves. */ ! if (relation->rd_rel->relkind == RELKIND_RELATION || ! relation->rd_rel->relkind == RELKIND_TOASTVALUE) ! smgrcreate(srel, FSM_FORKNUM, relation->rd_istemp, false); ! ! /* schedule unlinking old files */ ! for (i = 0; i <= MAX_FORKNUM; i++) ! { ! if (smgrexists(relation->rd_smgr, i)) ! smgrscheduleunlink(relation->rd_smgr, i, relation->rd_istemp); ! } ! ! smgrclose(srel); ! RelationCloseSmgr(relation); /* update the pg_class row */ rd_rel->relfilenode = newrelfilenode; --- 1318,1330 ---- */ newrnode = relation->rd_node; newrnode.relNode = newrelfilenode; /* ! * Create the main fork, like heap_create() does, and drop the old ! * storage. */ ! RelationCreateStorage(newrnode, relation->rd_istemp); ! RelationDropStorage(relation); /* update the pg_class row */ rd_rel->relfilenode = newrelfilenode; *************** *** 2326,2333 **** reindex_index(Oid indexId) if (inplace) { /* ! * Truncate the actual file (and discard buffers). The indexam ! * is responsible for truncating the FSM, if applicable */ RelationTruncate(iRel, 0); } --- 2302,2308 ---- if (inplace) { /* ! * Truncate the actual file (and discard buffers). */ RelationTruncate(iRel, 0); } *** /dev/null --- src/backend/catalog/storage.c *************** *** 0 **** --- 1,460 ---- + /*------------------------------------------------------------------------- + * + * storage.c + * code to create and destroy physical storage for relations + * + * Portions Copyright (c) 1996-2008, PostgreSQL Global Development Group + * Portions Copyright (c) 1994, Regents of the University of California + * + * + * IDENTIFICATION + * $PostgreSQL$ + * + *------------------------------------------------------------------------- + */ + + #include "postgres.h" + + #include "access/xact.h" + #include "access/xlogutils.h" + #include "catalog/catalog.h" + #include "catalog/storage.h" + #include "storage/freespace.h" + #include "storage/smgr.h" + #include "utils/memutils.h" + #include "utils/rel.h" + + /* + * We keep a list of all relations (represented as RelFileNode values) + * that have been created or deleted in the current transaction. When + * a relation is created, we create the physical file immediately, but + * remember it so that we can delete the file again if the current + * transaction is aborted. Conversely, a deletion request is NOT + * executed immediately, but is just entered in the list. When and if + * the transaction commits, we can delete the physical file. + * + * To handle subtransactions, every entry is marked with its transaction + * nesting level. At subtransaction commit, we reassign the subtransaction's + * entries to the parent nesting level. At subtransaction abort, we can + * immediately execute the abort-time actions for all entries of the current + * nesting level. + * + * NOTE: the list is kept in TopMemoryContext to be sure it won't disappear + * unbetimes. It'd probably be OK to keep it in TopTransactionContext, + * but I'm being paranoid. + */ + + typedef struct PendingRelDelete + { + RelFileNode relnode; /* relation that may need to be deleted */ + bool isTemp; /* is it a temporary relation? */ + bool atCommit; /* T=delete at commit; F=delete at abort */ + int nestLevel; /* xact nesting level of request */ + struct PendingRelDelete *next; /* linked-list link */ + } PendingRelDelete; + + static PendingRelDelete *pendingDeletes = NULL; /* head of linked list */ + + /* + * Declarations for smgr-related XLOG records + * + * Note: we log file creation and truncation here, but logging of deletion + * actions is handled by xact.c, because it is part of transaction commit. + */ + + /* XLOG gives us high 4 bits */ + #define XLOG_SMGR_CREATE 0x10 + #define XLOG_SMGR_TRUNCATE 0x20 + + typedef struct xl_smgr_create + { + RelFileNode rnode; + } xl_smgr_create; + + typedef struct xl_smgr_truncate + { + BlockNumber blkno; + RelFileNode rnode; + } xl_smgr_truncate; + + + /* + * RelationCreateStorage + * Create physical storage for a relation. + * + * Create the underlying disk file storage for the relation. This only + * creates the main fork; additional forks are created lazily by the + * modules that need them. + * + * This function is transactional. The creation is WAL-logged, and if the + * transaction aborts later on, the storage will be destroyed. + */ + void + RelationCreateStorage(RelFileNode rnode, bool istemp) + { + PendingRelDelete *pending; + + XLogRecPtr lsn; + XLogRecData rdata; + xl_smgr_create xlrec; + SMgrRelation srel; + + srel = smgropen(rnode); + smgrcreate(srel, MAIN_FORKNUM, false); + + smgrclose(srel); + + if (istemp) + { + /* + * Make an XLOG entry showing the file creation. If we abort, the file + * will be dropped at abort time. + */ + xlrec.rnode = rnode; + + rdata.data = (char *) &xlrec; + rdata.len = sizeof(xlrec); + rdata.buffer = InvalidBuffer; + rdata.next = NULL; + + lsn = XLogInsert(RM_SMGR_ID, XLOG_SMGR_CREATE, &rdata); + } + + /* Add the relation to the list of stuff to delete at abort */ + pending = (PendingRelDelete *) + MemoryContextAlloc(TopMemoryContext, sizeof(PendingRelDelete)); + pending->relnode = rnode; + pending->isTemp = istemp; + pending->atCommit = false; /* delete if abort */ + pending->nestLevel = GetCurrentTransactionNestLevel(); + pending->next = pendingDeletes; + pendingDeletes = pending; + } + + /* + * RelationDropStorage + * Schedule unlinking of physical storage at transaction commit. + */ + void + RelationDropStorage(Relation rel) + { + PendingRelDelete *pending; + + /* Add the relation to the list of stuff to delete at commit */ + pending = (PendingRelDelete *) + MemoryContextAlloc(TopMemoryContext, sizeof(PendingRelDelete)); + pending->relnode = rel->rd_node; + pending->isTemp = rel->rd_istemp; + pending->atCommit = true; /* delete if commit */ + pending->nestLevel = GetCurrentTransactionNestLevel(); + pending->next = pendingDeletes; + pendingDeletes = pending; + + /* + * NOTE: if the relation was created in this transaction, it will now be + * present in the pending-delete list twice, once with atCommit true and + * once with atCommit false. Hence, it will be physically deleted at end + * of xact in either case (and the other entry will be ignored by + * smgrDoPendingDeletes, so no error will occur). We could instead remove + * the existing list entry and delete the physical file immediately, but + * for now I'll keep the logic simple. + */ + + RelationCloseSmgr(rel); + } + + /* + * RelationTruncate + * Physically truncate a relation to the specified number of blocks. + * + * This includes getting rid of any buffers for the blocks that are to be + * dropped. If 'fsm' is true, the FSM of the relation is truncated as well. + */ + void + RelationTruncate(Relation rel, BlockNumber nblocks) + { + bool fsm; + + /* Open it at the smgr level if not already done */ + RelationOpenSmgr(rel); + + /* Make sure rd_targblock isn't pointing somewhere past end */ + rel->rd_targblock = InvalidBlockNumber; + + /* Truncate the FSM too if it exists. */ + fsm = smgrexists(rel->rd_smgr, FSM_FORKNUM); + if (fsm) + FreeSpaceMapTruncateRel(rel, nblocks); + + /* + * We WAL-log the truncation before actually truncating, which + * means trouble if the truncation fails. If we then crash, the WAL + * replay likely isn't going to succeed in the truncation either, and + * cause a PANIC. It's tempting to put a critical section here, but + * that cure would be worse than the disease. It would turn a usually + * harmless failure to truncate, that could spell trouble at WAL replay, + * into a certain PANIC. + */ + if (rel->rd_istemp) + { + /* + * Make an XLOG entry showing the file truncation. + */ + XLogRecPtr lsn; + XLogRecData rdata; + xl_smgr_truncate xlrec; + + xlrec.blkno = nblocks; + xlrec.rnode = rel->rd_node; + + rdata.data = (char *) &xlrec; + rdata.len = sizeof(xlrec); + rdata.buffer = InvalidBuffer; + rdata.next = NULL; + + lsn = XLogInsert(RM_SMGR_ID, XLOG_SMGR_TRUNCATE, &rdata); + + /* + * Flush, because otherwise the truncation of the main relation + * might hit the disk before the WAL record of truncating the + * FSM is flushed. If we crashed during that window, we'd be + * left with a truncated heap, without a truncated FSM. + */ + if (fsm) + XLogFlush(lsn); + } + + /* Do the real work */ + smgrtruncate(rel->rd_smgr, MAIN_FORKNUM, nblocks, rel->rd_istemp); + } + + /* + * smgrDoPendingDeletes() -- Take care of relation deletes at end of xact. + * + * This also runs when aborting a subxact; we want to clean up a failed + * subxact immediately. + */ + void + smgrDoPendingDeletes(bool isCommit) + { + int nestLevel = GetCurrentTransactionNestLevel(); + PendingRelDelete *pending; + PendingRelDelete *prev; + PendingRelDelete *next; + + prev = NULL; + for (pending = pendingDeletes; pending != NULL; pending = next) + { + next = pending->next; + if (pending->nestLevel < nestLevel) + { + /* outer-level entries should not be processed yet */ + prev = pending; + } + else + { + /* unlink list entry first, so we don't retry on failure */ + if (prev) + prev->next = next; + else + pendingDeletes = next; + /* do deletion if called for */ + if (pending->atCommit == isCommit) + { + int i; + + /* schedule unlinking old files */ + SMgrRelation srel; + + srel = smgropen(pending->relnode); + for (i = 0; i <= MAX_FORKNUM; i++) + { + if (smgrexists(srel, i)) + smgrdounlink(srel, + i, + pending->isTemp, + false); + } + smgrclose(srel); + } + /* must explicitly free the list entry */ + pfree(pending); + /* prev does not change */ + } + } + } + + /* + * smgrGetPendingDeletes() -- Get a list of relations to be deleted. + * + * The return value is the number of relations scheduled for termination. + * *ptr is set to point to a freshly-palloc'd array of RelFileForks. + * If there are no relations to be deleted, *ptr is set to NULL. + * + * If haveNonTemp isn't NULL, the bool it points to gets set to true if + * there is any non-temp table pending to be deleted; false if not. + * + * Note that the list does not include anything scheduled for termination + * by upper-level transactions. + */ + int + smgrGetPendingDeletes(bool forCommit, RelFileNode **ptr, bool *haveNonTemp) + { + int nestLevel = GetCurrentTransactionNestLevel(); + int nrels; + RelFileNode *rptr; + PendingRelDelete *pending; + + nrels = 0; + if (haveNonTemp) + *haveNonTemp = false; + for (pending = pendingDeletes; pending != NULL; pending = pending->next) + { + if (pending->nestLevel >= nestLevel && pending->atCommit == forCommit) + nrels++; + } + if (nrels == 0) + { + *ptr = NULL; + return 0; + } + rptr = (RelFileNode *) palloc(nrels * sizeof(RelFileNode)); + *ptr = rptr; + for (pending = pendingDeletes; pending != NULL; pending = pending->next) + { + if (pending->nestLevel >= nestLevel && pending->atCommit == forCommit) + { + *rptr = pending->relnode; + rptr++; + } + if (haveNonTemp && !pending->isTemp) + *haveNonTemp = true; + } + return nrels; + } + + /* + * PostPrepare_smgr -- Clean up after a successful PREPARE + * + * What we have to do here is throw away the in-memory state about pending + * relation deletes. It's all been recorded in the 2PC state file and + * it's no longer smgr's job to worry about it. + */ + void + PostPrepare_smgr(void) + { + PendingRelDelete *pending; + PendingRelDelete *next; + + for (pending = pendingDeletes; pending != NULL; pending = next) + { + next = pending->next; + pendingDeletes = next; + /* must explicitly free the list entry */ + pfree(pending); + } + } + + + /* + * AtSubCommit_smgr() --- Take care of subtransaction commit. + * + * Reassign all items in the pending-deletes list to the parent transaction. + */ + void + AtSubCommit_smgr(void) + { + int nestLevel = GetCurrentTransactionNestLevel(); + PendingRelDelete *pending; + + for (pending = pendingDeletes; pending != NULL; pending = pending->next) + { + if (pending->nestLevel >= nestLevel) + pending->nestLevel = nestLevel - 1; + } + } + + /* + * AtSubAbort_smgr() --- Take care of subtransaction abort. + * + * Delete created relations and forget about deleted relations. + * We can execute these operations immediately because we know this + * subtransaction will not commit. + */ + void + AtSubAbort_smgr(void) + { + smgrDoPendingDeletes(false); + } + + void + smgr_redo(XLogRecPtr lsn, XLogRecord *record) + { + uint8 info = record->xl_info & ~XLR_INFO_MASK; + + if (info == XLOG_SMGR_CREATE) + { + xl_smgr_create *xlrec = (xl_smgr_create *) XLogRecGetData(record); + SMgrRelation reln; + + reln = smgropen(xlrec->rnode); + smgrcreate(reln, MAIN_FORKNUM, true); + } + else if (info == XLOG_SMGR_TRUNCATE) + { + xl_smgr_truncate *xlrec = (xl_smgr_truncate *) XLogRecGetData(record); + SMgrRelation reln; + + reln = smgropen(xlrec->rnode); + + /* + * Forcibly create relation if it doesn't exist (which suggests that + * it was dropped somewhere later in the WAL sequence). As in + * XLogOpenRelation, we prefer to recreate the rel and replay the log + * as best we can until the drop is seen. + */ + smgrcreate(reln, MAIN_FORKNUM, true); + + smgrtruncate(reln, MAIN_FORKNUM, xlrec->blkno, false); + + /* Also tell xlogutils.c about it */ + XLogTruncateRelation(xlrec->rnode, MAIN_FORKNUM, xlrec->blkno); + + /* Truncate FSM too */ + if (smgrexists(reln, FSM_FORKNUM)) + { + Relation rel = CreateFakeRelcacheEntry(xlrec->rnode); + FreeSpaceMapTruncateRel(rel, xlrec->blkno); + FreeFakeRelcacheEntry(rel); + } + + } + else + elog(PANIC, "smgr_redo: unknown op code %u", info); + } + + void + smgr_desc(StringInfo buf, uint8 xl_info, char *rec) + { + uint8 info = xl_info & ~XLR_INFO_MASK; + + if (info == XLOG_SMGR_CREATE) + { + xl_smgr_create *xlrec = (xl_smgr_create *) rec; + char *path = relpath(xlrec->rnode, MAIN_FORKNUM); + + appendStringInfo(buf, "file create: %s", path); + pfree(path); + } + else if (info == XLOG_SMGR_TRUNCATE) + { + xl_smgr_truncate *xlrec = (xl_smgr_truncate *) rec; + char *path = relpath(xlrec->rnode, MAIN_FORKNUM); + + appendStringInfo(buf, "file truncate: %s to %u blocks", path, + xlrec->blkno); + pfree(path); + } + else + appendStringInfo(buf, "UNKNOWN"); + } *** src/backend/commands/tablecmds.c --- src/backend/commands/tablecmds.c *************** *** 35,40 **** --- 35,41 ---- #include "catalog/pg_trigger.h" #include "catalog/pg_type.h" #include "catalog/pg_type_fn.h" + #include "catalog/storage.h" #include "catalog/toasting.h" #include "commands/cluster.h" #include "commands/defrem.h" *************** *** 6482,6488 **** ATExecSetTableSpace(Oid tableOid, Oid newTableSpace) Relation pg_class; HeapTuple tuple; Form_pg_class rd_rel; ! ForkNumber forkNum; /* * Need lock here in case we are recursing to toast table or index --- 6483,6489 ---- Relation pg_class; HeapTuple tuple; Form_pg_class rd_rel; ! ForkNumber forkNum; /* * Need lock here in case we are recursing to toast table or index *************** *** 6558,6564 **** ATExecSetTableSpace(Oid tableOid, Oid newTableSpace) newrnode = rel->rd_node; newrnode.relNode = newrelfilenode; newrnode.spcNode = newTableSpace; - dstrel = smgropen(newrnode); RelationOpenSmgr(rel); --- 6559,6564 ---- *************** *** 6567,6588 **** ATExecSetTableSpace(Oid tableOid, Oid newTableSpace) * of old physical files. * * NOTE: any conflict in relfilenode value will be caught in ! * smgrcreate() below. */ ! for (forkNum = 0; forkNum <= MAX_FORKNUM; forkNum++) { if (smgrexists(rel->rd_smgr, forkNum)) { ! smgrcreate(dstrel, forkNum, rel->rd_istemp, false); copy_relation_data(rel->rd_smgr, dstrel, forkNum, rel->rd_istemp); - - smgrscheduleunlink(rel->rd_smgr, forkNum, rel->rd_istemp); } } /* Close old and new relation */ smgrclose(dstrel); - RelationCloseSmgr(rel); /* update the pg_class row */ rd_rel->reltablespace = (newTableSpace == MyDatabaseTableSpace) ? InvalidOid : newTableSpace; --- 6567,6592 ---- * of old physical files. * * NOTE: any conflict in relfilenode value will be caught in ! * RelationCreateStorage(). */ ! RelationCreateStorage(newrnode, rel->rd_istemp); ! ! dstrel = smgropen(newrnode); ! ! copy_relation_data(rel->rd_smgr, dstrel, MAIN_FORKNUM, rel->rd_istemp); ! for (forkNum = MAIN_FORKNUM + 1; forkNum <= MAX_FORKNUM; forkNum++) { if (smgrexists(rel->rd_smgr, forkNum)) { ! smgrcreate(dstrel, forkNum, false); copy_relation_data(rel->rd_smgr, dstrel, forkNum, rel->rd_istemp); } } + RelationDropStorage(rel); + /* Close old and new relation */ smgrclose(dstrel); /* update the pg_class row */ rd_rel->reltablespace = (newTableSpace == MyDatabaseTableSpace) ? InvalidOid : newTableSpace; *** src/backend/commands/vacuum.c --- src/backend/commands/vacuum.c *************** *** 31,36 **** --- 31,37 ---- #include "catalog/namespace.h" #include "catalog/pg_database.h" #include "catalog/pg_namespace.h" + #include "catalog/storage.h" #include "commands/dbcommands.h" #include "commands/vacuum.h" #include "executor/executor.h" *************** *** 2863,2869 **** repair_frag(VRelStats *vacrelstats, Relation onerel, /* Truncate relation, if needed */ if (blkno < nblocks) { - FreeSpaceMapTruncateRel(onerel, blkno); RelationTruncate(onerel, blkno); vacrelstats->rel_pages = blkno; /* set new number of blocks */ } --- 2864,2869 ---- *************** *** 3258,3264 **** vacuum_heap(VRelStats *vacrelstats, Relation onerel, VacPageList vacuum_pages) (errmsg("\"%s\": truncated %u to %u pages", RelationGetRelationName(onerel), vacrelstats->rel_pages, relblocks))); - FreeSpaceMapTruncateRel(onerel, relblocks); RelationTruncate(onerel, relblocks); vacrelstats->rel_pages = relblocks; /* set new number of blocks */ } --- 3258,3263 ---- *** src/backend/commands/vacuumlazy.c --- src/backend/commands/vacuumlazy.c *************** *** 40,45 **** --- 40,46 ---- #include "access/genam.h" #include "access/heapam.h" #include "access/transam.h" + #include "catalog/storage.h" #include "commands/dbcommands.h" #include "commands/vacuum.h" #include "miscadmin.h" *************** *** 827,833 **** lazy_truncate_heap(Relation onerel, LVRelStats *vacrelstats) /* * Okay to truncate. */ - FreeSpaceMapTruncateRel(onerel, new_rel_pages); RelationTruncate(onerel, new_rel_pages); /* --- 828,833 ---- *** src/backend/rewrite/rewriteDefine.c --- src/backend/rewrite/rewriteDefine.c *************** *** 19,31 **** #include "catalog/indexing.h" #include "catalog/namespace.h" #include "catalog/pg_rewrite.h" #include "miscadmin.h" #include "nodes/nodeFuncs.h" #include "parser/parse_utilcmd.h" #include "rewrite/rewriteDefine.h" #include "rewrite/rewriteManip.h" #include "rewrite/rewriteSupport.h" - #include "storage/smgr.h" #include "utils/acl.h" #include "utils/builtins.h" #include "utils/inval.h" --- 19,31 ---- #include "catalog/indexing.h" #include "catalog/namespace.h" #include "catalog/pg_rewrite.h" + #include "catalog/storage.h" #include "miscadmin.h" #include "nodes/nodeFuncs.h" #include "parser/parse_utilcmd.h" #include "rewrite/rewriteDefine.h" #include "rewrite/rewriteManip.h" #include "rewrite/rewriteSupport.h" #include "utils/acl.h" #include "utils/builtins.h" #include "utils/inval.h" *************** *** 484,499 **** DefineQueryRewrite(char *rulename, * XXX what about getting rid of its TOAST table? For now, we don't. */ if (RelisBecomingView) ! { ! ForkNumber forknum; ! ! RelationOpenSmgr(event_relation); ! for (forknum = 0; forknum <= MAX_FORKNUM; forknum++) ! if (smgrexists(event_relation->rd_smgr, forknum)) ! smgrscheduleunlink(event_relation->rd_smgr, forknum, ! event_relation->rd_istemp); ! RelationCloseSmgr(event_relation); ! } /* Close rel, but keep lock till commit... */ heap_close(event_relation, NoLock); --- 484,490 ---- * XXX what about getting rid of its TOAST table? For now, we don't. */ if (RelisBecomingView) ! RelationDropStorage(event_relation); /* Close rel, but keep lock till commit... */ heap_close(event_relation, NoLock); *** src/backend/storage/buffer/bufmgr.c --- src/backend/storage/buffer/bufmgr.c *************** *** 1695,1702 **** void BufmgrCommit(void) { /* Nothing to do in bufmgr anymore... */ - - smgrcommit(); } /* --- 1695,1700 ---- *************** *** 1848,1873 **** RelationGetNumberOfBlocks(Relation relation) return smgrnblocks(relation->rd_smgr, MAIN_FORKNUM); } - /* - * RelationTruncate - * Physically truncate a relation to the specified number of blocks. - * - * As of Postgres 8.1, this includes getting rid of any buffers for the - * blocks that are to be dropped; previously, callers had to do that. - */ - void - RelationTruncate(Relation rel, BlockNumber nblocks) - { - /* Open it at the smgr level if not already done */ - RelationOpenSmgr(rel); - - /* Make sure rd_targblock isn't pointing somewhere past end */ - rel->rd_targblock = InvalidBlockNumber; - - /* Do the real work */ - smgrtruncate(rel->rd_smgr, MAIN_FORKNUM, nblocks, rel->rd_istemp); - } - /* --------------------------------------------------------------------- * DropRelFileNodeBuffers * --- 1846,1851 ---- *** src/backend/storage/freespace/freespace.c --- src/backend/storage/freespace/freespace.c *************** *** 47,53 **** * MaxFSMRequestSize depends on the architecture and BLCKSZ, but assuming * default 8k BLCKSZ, and that MaxFSMRequestSize is 24 bytes, the categories * look like this ! * * * Range Category * 0 - 31 0 --- 47,53 ---- * MaxFSMRequestSize depends on the architecture and BLCKSZ, but assuming * default 8k BLCKSZ, and that MaxFSMRequestSize is 24 bytes, the categories * look like this ! * * * Range Category * 0 - 31 0 *************** *** 93,107 **** typedef struct /* Address of the root page. */ static const FSMAddress FSM_ROOT_ADDRESS = { FSM_ROOT_LEVEL, 0 }; - /* XLOG record types */ - #define XLOG_FSM_TRUNCATE 0x00 /* truncate */ - - typedef struct - { - RelFileNode node; /* truncated relation */ - BlockNumber nheapblocks; /* new number of blocks in the heap */ - } xl_fsm_truncate; - /* functions to navigate the tree */ static FSMAddress fsm_get_child(FSMAddress parent, uint16 slot); static FSMAddress fsm_get_parent(FSMAddress child, uint16 *slot); --- 93,98 ---- *************** *** 110,116 **** static BlockNumber fsm_get_heap_blk(FSMAddress addr, uint16 slot); static BlockNumber fsm_logical_to_physical(FSMAddress addr); static Buffer fsm_readbuf(Relation rel, FSMAddress addr, bool extend); ! static void fsm_extend(Relation rel, BlockNumber nfsmblocks); /* functions to convert amount of free space to a FSM category */ static uint8 fsm_space_avail_to_cat(Size avail); --- 101,107 ---- static BlockNumber fsm_logical_to_physical(FSMAddress addr); static Buffer fsm_readbuf(Relation rel, FSMAddress addr, bool extend); ! static void fsm_extend(Relation rel, BlockNumber nfsmblocks, bool createstorage); /* functions to convert amount of free space to a FSM category */ static uint8 fsm_space_avail_to_cat(Size avail); *************** *** 123,130 **** static int fsm_set_and_search(Relation rel, FSMAddress addr, uint16 slot, static BlockNumber fsm_search(Relation rel, uint8 min_cat); static uint8 fsm_vacuum_page(Relation rel, FSMAddress addr, bool *eof); - static void fsm_redo_truncate(xl_fsm_truncate *xlrec); - /******** Public API ********/ --- 114,119 ---- *************** *** 275,280 **** FreeSpaceMapTruncateRel(Relation rel, BlockNumber nblocks) --- 264,276 ---- RelationOpenSmgr(rel); + /* + * If no FSM has been created yet for this relation, there's nothing to + * truncate. + */ + if (!smgrexists(rel->rd_smgr, FSM_FORKNUM)) + return; + /* Get the location in the FSM of the first removed heap block */ first_removed_address = fsm_get_location(nblocks, &first_removed_slot); *************** *** 307,348 **** FreeSpaceMapTruncateRel(Relation rel, BlockNumber nblocks) smgrtruncate(rel->rd_smgr, FSM_FORKNUM, new_nfsmblocks, rel->rd_istemp); /* - * FSM truncations are WAL-logged, because we must never return a block - * that doesn't exist in the heap, not even if we crash before the FSM - * truncation has made it to disk. smgrtruncate() writes its own WAL - * record, but that's not enough to zero out the last remaining FSM page. - * (if we didn't need to zero out anything above, we can skip this) - */ - if (!rel->rd_istemp && first_removed_slot != 0) - { - xl_fsm_truncate xlrec; - XLogRecData rdata; - XLogRecPtr recptr; - - xlrec.node = rel->rd_node; - xlrec.nheapblocks = nblocks; - - rdata.data = (char *) &xlrec; - rdata.len = sizeof(xl_fsm_truncate); - rdata.buffer = InvalidBuffer; - rdata.next = NULL; - - recptr = XLogInsert(RM_FREESPACE_ID, XLOG_FSM_TRUNCATE, &rdata); - - /* - * Flush, because otherwise the truncation of the main relation - * might hit the disk before the WAL record of truncating the - * FSM is flushed. If we crashed during that window, we'd be - * left with a truncated heap, without a truncated FSM. - */ - XLogFlush(recptr); - } - - /* * Need to invalidate the relcache entry, because rd_fsm_nblocks_cache * seen by other backends is no longer valid. */ ! CacheInvalidateRelcache(rel); rel->rd_fsm_nblocks_cache = new_nfsmblocks; } --- 303,313 ---- smgrtruncate(rel->rd_smgr, FSM_FORKNUM, new_nfsmblocks, rel->rd_istemp); /* * Need to invalidate the relcache entry, because rd_fsm_nblocks_cache * seen by other backends is no longer valid. */ ! if (!InRecovery) ! CacheInvalidateRelcache(rel); rel->rd_fsm_nblocks_cache = new_nfsmblocks; } *************** *** 538,551 **** fsm_readbuf(Relation rel, FSMAddress addr, bool extend) RelationOpenSmgr(rel); ! if (rel->rd_fsm_nblocks_cache == InvalidBlockNumber || rel->rd_fsm_nblocks_cache <= blkno) ! rel->rd_fsm_nblocks_cache = smgrnblocks(rel->rd_smgr, FSM_FORKNUM); if (blkno >= rel->rd_fsm_nblocks_cache) { if (extend) ! fsm_extend(rel, blkno + 1); else return InvalidBuffer; } --- 503,521 ---- RelationOpenSmgr(rel); ! if (rel->rd_fsm_nblocks_cache == InvalidBlockNumber || rel->rd_fsm_nblocks_cache <= blkno) ! { ! if (!smgrexists(rel->rd_smgr, FSM_FORKNUM)) ! fsm_extend(rel, blkno + 1, true); ! else ! rel->rd_fsm_nblocks_cache = smgrnblocks(rel->rd_smgr, FSM_FORKNUM); ! } if (blkno >= rel->rd_fsm_nblocks_cache) { if (extend) ! fsm_extend(rel, blkno + 1, false); else return InvalidBuffer; } *************** *** 566,575 **** fsm_readbuf(Relation rel, FSMAddress addr, bool extend) /* * Ensure that the FSM fork is at least n_fsmblocks long, extending * it if necessary with empty pages. And by empty, I mean pages filled ! * with zeros, meaning there's no free space. */ static void ! fsm_extend(Relation rel, BlockNumber n_fsmblocks) { BlockNumber n_fsmblocks_now; Page pg; --- 536,546 ---- /* * Ensure that the FSM fork is at least n_fsmblocks long, extending * it if necessary with empty pages. And by empty, I mean pages filled ! * with zeros, meaning there's no free space. If createstorage is true, ! * the FSM file might need to be created first. */ static void ! fsm_extend(Relation rel, BlockNumber n_fsmblocks, bool createstorage) { BlockNumber n_fsmblocks_now; Page pg; *************** *** 584,595 **** fsm_extend(Relation rel, BlockNumber n_fsmblocks) * FSM happens seldom enough that it doesn't seem worthwhile to * have a separate lock tag type for it. * ! * Note that another backend might have extended the relation ! * before we get the lock. */ LockRelationForExtension(rel, ExclusiveLock); ! n_fsmblocks_now = smgrnblocks(rel->rd_smgr, FSM_FORKNUM); while (n_fsmblocks_now < n_fsmblocks) { smgrextend(rel->rd_smgr, FSM_FORKNUM, n_fsmblocks_now, --- 555,574 ---- * FSM happens seldom enough that it doesn't seem worthwhile to * have a separate lock tag type for it. * ! * Note that another backend might have extended or created the ! * relation before we get the lock. */ LockRelationForExtension(rel, ExclusiveLock); ! /* Create the FSM file first if it doesn't exist */ ! if (createstorage && !smgrexists(rel->rd_smgr, FSM_FORKNUM)) ! { ! smgrcreate(rel->rd_smgr, FSM_FORKNUM, false); ! n_fsmblocks_now = 0; ! } ! else ! n_fsmblocks_now = smgrnblocks(rel->rd_smgr, FSM_FORKNUM); ! while (n_fsmblocks_now < n_fsmblocks) { smgrextend(rel->rd_smgr, FSM_FORKNUM, n_fsmblocks_now, *************** *** 799,873 **** fsm_vacuum_page(Relation rel, FSMAddress addr, bool *eof_p) return max_avail; } - - - /****** WAL-logging ******/ - - static void - fsm_redo_truncate(xl_fsm_truncate *xlrec) - { - FSMAddress first_removed_address; - uint16 first_removed_slot; - BlockNumber fsmblk; - Buffer buf; - - /* Get the location in the FSM of the first removed heap block */ - first_removed_address = fsm_get_location(xlrec->nheapblocks, - &first_removed_slot); - fsmblk = fsm_logical_to_physical(first_removed_address); - - /* - * Zero out the tail of the last remaining FSM page. We rely on the - * replay of the smgr truncation record to remove completely unused - * pages. - */ - buf = XLogReadBufferExtended(xlrec->node, FSM_FORKNUM, fsmblk, - RBM_ZERO_ON_ERROR); - if (BufferIsValid(buf)) - { - Page page = BufferGetPage(buf); - - if (PageIsNew(page)) - PageInit(page, BLCKSZ, 0); - fsm_truncate_avail(page, first_removed_slot); - MarkBufferDirty(buf); - UnlockReleaseBuffer(buf); - } - } - - void - fsm_redo(XLogRecPtr lsn, XLogRecord *record) - { - uint8 info = record->xl_info & ~XLR_INFO_MASK; - - switch (info) - { - case XLOG_FSM_TRUNCATE: - fsm_redo_truncate((xl_fsm_truncate *) XLogRecGetData(record)); - break; - default: - elog(PANIC, "fsm_redo: unknown op code %u", info); - } - } - - void - fsm_desc(StringInfo buf, uint8 xl_info, char *rec) - { - uint8 info = xl_info & ~XLR_INFO_MASK; - - switch (info) - { - case XLOG_FSM_TRUNCATE: - { - xl_fsm_truncate *xlrec = (xl_fsm_truncate *) rec; - - appendStringInfo(buf, "truncate: rel %u/%u/%u; nheapblocks %u;", - xlrec->node.spcNode, xlrec->node.dbNode, - xlrec->node.relNode, xlrec->nheapblocks); - break; - } - default: - appendStringInfo(buf, "UNKNOWN"); - break; - } - } --- 778,780 ---- *** src/backend/storage/freespace/indexfsm.c --- src/backend/storage/freespace/indexfsm.c *************** *** 31,50 **** */ /* - * InitIndexFreeSpaceMap - Create or reset the FSM fork for relation. - */ - void - InitIndexFreeSpaceMap(Relation rel) - { - /* Create FSM fork if it doesn't exist yet, or truncate it if it does */ - RelationOpenSmgr(rel); - if (!smgrexists(rel->rd_smgr, FSM_FORKNUM)) - smgrcreate(rel->rd_smgr, FSM_FORKNUM, rel->rd_istemp, false); - else - smgrtruncate(rel->rd_smgr, FSM_FORKNUM, 0, rel->rd_istemp); - } - - /* * GetFreeIndexPage - return a free page from the FSM * * As a side effect, the page is marked as used in the FSM. --- 31,36 ---- *** src/backend/storage/smgr/smgr.c --- src/backend/storage/smgr/smgr.c *************** *** 17,31 **** */ #include "postgres.h" - #include "access/xact.h" #include "access/xlogutils.h" #include "catalog/catalog.h" #include "commands/tablespace.h" #include "storage/bufmgr.h" #include "storage/ipc.h" #include "storage/smgr.h" #include "utils/hsearch.h" - #include "utils/memutils.h" /* --- 17,30 ---- */ #include "postgres.h" #include "access/xlogutils.h" #include "catalog/catalog.h" #include "commands/tablespace.h" #include "storage/bufmgr.h" + #include "storage/freespace.h" #include "storage/ipc.h" #include "storage/smgr.h" #include "utils/hsearch.h" /* *************** *** 58,65 **** typedef struct f_smgr void (*smgr_truncate) (SMgrRelation reln, ForkNumber forknum, BlockNumber nblocks, bool isTemp); void (*smgr_immedsync) (SMgrRelation reln, ForkNumber forknum); - void (*smgr_commit) (void); /* may be NULL */ - void (*smgr_abort) (void); /* may be NULL */ void (*smgr_pre_ckpt) (void); /* may be NULL */ void (*smgr_sync) (void); /* may be NULL */ void (*smgr_post_ckpt) (void); /* may be NULL */ --- 57,62 ---- *************** *** 70,76 **** static const f_smgr smgrsw[] = { /* magnetic disk */ {mdinit, NULL, mdclose, mdcreate, mdexists, mdunlink, mdextend, mdread, mdwrite, mdnblocks, mdtruncate, mdimmedsync, ! NULL, NULL, mdpreckpt, mdsync, mdpostckpt } }; --- 67,73 ---- /* magnetic disk */ {mdinit, NULL, mdclose, mdcreate, mdexists, mdunlink, mdextend, mdread, mdwrite, mdnblocks, mdtruncate, mdimmedsync, ! mdpreckpt, mdsync, mdpostckpt } }; *************** *** 82,146 **** static const int NSmgr = lengthof(smgrsw); */ static HTAB *SMgrRelationHash = NULL; - /* - * We keep a list of all relations (represented as RelFileNode values) - * that have been created or deleted in the current transaction. When - * a relation is created, we create the physical file immediately, but - * remember it so that we can delete the file again if the current - * transaction is aborted. Conversely, a deletion request is NOT - * executed immediately, but is just entered in the list. When and if - * the transaction commits, we can delete the physical file. - * - * To handle subtransactions, every entry is marked with its transaction - * nesting level. At subtransaction commit, we reassign the subtransaction's - * entries to the parent nesting level. At subtransaction abort, we can - * immediately execute the abort-time actions for all entries of the current - * nesting level. - * - * NOTE: the list is kept in TopMemoryContext to be sure it won't disappear - * unbetimes. It'd probably be OK to keep it in TopTransactionContext, - * but I'm being paranoid. - */ - - typedef struct PendingRelDelete - { - RelFileNode relnode; /* relation that may need to be deleted */ - ForkNumber forknum; /* fork number that may need to be deleted */ - int which; /* which storage manager? */ - bool isTemp; /* is it a temporary relation? */ - bool atCommit; /* T=delete at commit; F=delete at abort */ - int nestLevel; /* xact nesting level of request */ - struct PendingRelDelete *next; /* linked-list link */ - } PendingRelDelete; - - static PendingRelDelete *pendingDeletes = NULL; /* head of linked list */ - - - /* - * Declarations for smgr-related XLOG records - * - * Note: we log file creation and truncation here, but logging of deletion - * actions is handled by xact.c, because it is part of transaction commit. - */ - - /* XLOG gives us high 4 bits */ - #define XLOG_SMGR_CREATE 0x10 - #define XLOG_SMGR_TRUNCATE 0x20 - - typedef struct xl_smgr_create - { - RelFileNode rnode; - ForkNumber forknum; - } xl_smgr_create; - - typedef struct xl_smgr_truncate - { - BlockNumber blkno; - RelFileNode rnode; - ForkNumber forknum; - } xl_smgr_truncate; - - /* local function prototypes */ static void smgrshutdown(int code, Datum arg); static void smgr_internal_unlink(RelFileNode rnode, ForkNumber forknum, --- 79,84 ---- *************** *** 341,358 **** smgrclosenode(RelFileNode rnode) * to be created. * * If isRedo is true, it is okay for the underlying file to exist ! * already because we are in a WAL replay sequence. In this case ! * we should make no PendingRelDelete entry; the WAL sequence will ! * tell whether to drop the file. */ void ! smgrcreate(SMgrRelation reln, ForkNumber forknum, bool isTemp, bool isRedo) { - XLogRecPtr lsn; - XLogRecData rdata; - xl_smgr_create xlrec; - PendingRelDelete *pending; - /* * Exit quickly in WAL replay mode if we've already opened the file. * If it's open, it surely must exist. --- 279,289 ---- * to be created. * * If isRedo is true, it is okay for the underlying file to exist ! * already because we are in a WAL replay sequence. */ void ! smgrcreate(SMgrRelation reln, ForkNumber forknum, bool isRedo) { /* * Exit quickly in WAL replay mode if we've already opened the file. * If it's open, it surely must exist. *************** *** 374,442 **** smgrcreate(SMgrRelation reln, ForkNumber forknum, bool isTemp, bool isRedo) isRedo); (*(smgrsw[reln->smgr_which].smgr_create)) (reln, forknum, isRedo); - - if (isRedo) - return; - - /* - * Make an XLOG entry showing the file creation. If we abort, the file - * will be dropped at abort time. - */ - xlrec.rnode = reln->smgr_rnode; - xlrec.forknum = forknum; - - rdata.data = (char *) &xlrec; - rdata.len = sizeof(xlrec); - rdata.buffer = InvalidBuffer; - rdata.next = NULL; - - lsn = XLogInsert(RM_SMGR_ID, XLOG_SMGR_CREATE, &rdata); - - /* Add the relation to the list of stuff to delete at abort */ - pending = (PendingRelDelete *) - MemoryContextAlloc(TopMemoryContext, sizeof(PendingRelDelete)); - pending->relnode = reln->smgr_rnode; - pending->forknum = forknum; - pending->which = reln->smgr_which; - pending->isTemp = isTemp; - pending->atCommit = false; /* delete if abort */ - pending->nestLevel = GetCurrentTransactionNestLevel(); - pending->next = pendingDeletes; - pendingDeletes = pending; - } - - /* - * smgrscheduleunlink() -- Schedule unlinking a relation at xact commit. - * - * The fork is marked to be removed from the store if we successfully - * commit the current transaction. - */ - void - smgrscheduleunlink(SMgrRelation reln, ForkNumber forknum, bool isTemp) - { - PendingRelDelete *pending; - - /* Add the relation to the list of stuff to delete at commit */ - pending = (PendingRelDelete *) - MemoryContextAlloc(TopMemoryContext, sizeof(PendingRelDelete)); - pending->relnode = reln->smgr_rnode; - pending->forknum = forknum; - pending->which = reln->smgr_which; - pending->isTemp = isTemp; - pending->atCommit = true; /* delete if commit */ - pending->nestLevel = GetCurrentTransactionNestLevel(); - pending->next = pendingDeletes; - pendingDeletes = pending; - - /* - * NOTE: if the relation was created in this transaction, it will now be - * present in the pending-delete list twice, once with atCommit true and - * once with atCommit false. Hence, it will be physically deleted at end - * of xact in either case (and the other entry will be ignored by - * smgrDoPendingDeletes, so no error will occur). We could instead remove - * the existing list entry and delete the physical file immediately, but - * for now I'll keep the logic simple. - */ } /* --- 305,310 ---- *************** *** 573,599 **** smgrtruncate(SMgrRelation reln, ForkNumber forknum, BlockNumber nblocks, /* Do the truncation */ (*(smgrsw[reln->smgr_which].smgr_truncate)) (reln, forknum, nblocks, isTemp); - - if (!isTemp) - { - /* - * Make an XLOG entry showing the file truncation. - */ - XLogRecPtr lsn; - XLogRecData rdata; - xl_smgr_truncate xlrec; - - xlrec.blkno = nblocks; - xlrec.rnode = reln->smgr_rnode; - xlrec.forknum = forknum; - - rdata.data = (char *) &xlrec; - rdata.len = sizeof(xlrec); - rdata.buffer = InvalidBuffer; - rdata.next = NULL; - - lsn = XLogInsert(RM_SMGR_ID, XLOG_SMGR_TRUNCATE, &rdata); - } } /* --- 441,446 ---- *************** *** 627,813 **** smgrimmedsync(SMgrRelation reln, ForkNumber forknum) /* - * PostPrepare_smgr -- Clean up after a successful PREPARE - * - * What we have to do here is throw away the in-memory state about pending - * relation deletes. It's all been recorded in the 2PC state file and - * it's no longer smgr's job to worry about it. - */ - void - PostPrepare_smgr(void) - { - PendingRelDelete *pending; - PendingRelDelete *next; - - for (pending = pendingDeletes; pending != NULL; pending = next) - { - next = pending->next; - pendingDeletes = next; - /* must explicitly free the list entry */ - pfree(pending); - } - } - - - /* - * smgrDoPendingDeletes() -- Take care of relation deletes at end of xact. - * - * This also runs when aborting a subxact; we want to clean up a failed - * subxact immediately. - */ - void - smgrDoPendingDeletes(bool isCommit) - { - int nestLevel = GetCurrentTransactionNestLevel(); - PendingRelDelete *pending; - PendingRelDelete *prev; - PendingRelDelete *next; - - prev = NULL; - for (pending = pendingDeletes; pending != NULL; pending = next) - { - next = pending->next; - if (pending->nestLevel < nestLevel) - { - /* outer-level entries should not be processed yet */ - prev = pending; - } - else - { - /* unlink list entry first, so we don't retry on failure */ - if (prev) - prev->next = next; - else - pendingDeletes = next; - /* do deletion if called for */ - if (pending->atCommit == isCommit) - smgr_internal_unlink(pending->relnode, - pending->forknum, - pending->which, - pending->isTemp, - false); - /* must explicitly free the list entry */ - pfree(pending); - /* prev does not change */ - } - } - } - - /* - * smgrGetPendingDeletes() -- Get a list of relations to be deleted. - * - * The return value is the number of relations scheduled for termination. - * *ptr is set to point to a freshly-palloc'd array of RelFileForks. - * If there are no relations to be deleted, *ptr is set to NULL. - * - * If haveNonTemp isn't NULL, the bool it points to gets set to true if - * there is any non-temp table pending to be deleted; false if not. - * - * Note that the list does not include anything scheduled for termination - * by upper-level transactions. - */ - int - smgrGetPendingDeletes(bool forCommit, RelFileFork **ptr, bool *haveNonTemp) - { - int nestLevel = GetCurrentTransactionNestLevel(); - int nrels; - RelFileFork *rptr; - PendingRelDelete *pending; - - nrels = 0; - if (haveNonTemp) - *haveNonTemp = false; - for (pending = pendingDeletes; pending != NULL; pending = pending->next) - { - if (pending->nestLevel >= nestLevel && pending->atCommit == forCommit) - nrels++; - } - if (nrels == 0) - { - *ptr = NULL; - return 0; - } - rptr = (RelFileFork *) palloc(nrels * sizeof(RelFileFork)); - *ptr = rptr; - for (pending = pendingDeletes; pending != NULL; pending = pending->next) - { - if (pending->nestLevel >= nestLevel && pending->atCommit == forCommit) - { - rptr->rnode = pending->relnode; - rptr->forknum = pending->forknum; - rptr++; - } - if (haveNonTemp && !pending->isTemp) - *haveNonTemp = true; - } - return nrels; - } - - /* - * AtSubCommit_smgr() --- Take care of subtransaction commit. - * - * Reassign all items in the pending-deletes list to the parent transaction. - */ - void - AtSubCommit_smgr(void) - { - int nestLevel = GetCurrentTransactionNestLevel(); - PendingRelDelete *pending; - - for (pending = pendingDeletes; pending != NULL; pending = pending->next) - { - if (pending->nestLevel >= nestLevel) - pending->nestLevel = nestLevel - 1; - } - } - - /* - * AtSubAbort_smgr() --- Take care of subtransaction abort. - * - * Delete created relations and forget about deleted relations. - * We can execute these operations immediately because we know this - * subtransaction will not commit. - */ - void - AtSubAbort_smgr(void) - { - smgrDoPendingDeletes(false); - } - - /* - * smgrcommit() -- Prepare to commit changes made during the current - * transaction. - * - * This is called before we actually commit. - */ - void - smgrcommit(void) - { - int i; - - for (i = 0; i < NSmgr; i++) - { - if (smgrsw[i].smgr_commit) - (*(smgrsw[i].smgr_commit)) (); - } - } - - /* - * smgrabort() -- Clean up after transaction abort. - */ - void - smgrabort(void) - { - int i; - - for (i = 0; i < NSmgr; i++) - { - if (smgrsw[i].smgr_abort) - (*(smgrsw[i].smgr_abort)) (); - } - } - - /* * smgrpreckpt() -- Prepare for checkpoint. */ void --- 474,479 ---- *************** *** 852,931 **** smgrpostckpt(void) } } - - void - smgr_redo(XLogRecPtr lsn, XLogRecord *record) - { - uint8 info = record->xl_info & ~XLR_INFO_MASK; - - if (info == XLOG_SMGR_CREATE) - { - xl_smgr_create *xlrec = (xl_smgr_create *) XLogRecGetData(record); - SMgrRelation reln; - - reln = smgropen(xlrec->rnode); - smgrcreate(reln, xlrec->forknum, false, true); - } - else if (info == XLOG_SMGR_TRUNCATE) - { - xl_smgr_truncate *xlrec = (xl_smgr_truncate *) XLogRecGetData(record); - SMgrRelation reln; - - reln = smgropen(xlrec->rnode); - - /* - * Forcibly create relation if it doesn't exist (which suggests that - * it was dropped somewhere later in the WAL sequence). As in - * XLogOpenRelation, we prefer to recreate the rel and replay the log - * as best we can until the drop is seen. - */ - smgrcreate(reln, xlrec->forknum, false, true); - - /* Can't use smgrtruncate because it would try to xlog */ - - /* - * First, force bufmgr to drop any buffers it has for the to-be- - * truncated blocks. We must do this, else subsequent XLogReadBuffer - * operations will not re-extend the file properly. - */ - DropRelFileNodeBuffers(xlrec->rnode, xlrec->forknum, false, - xlrec->blkno); - - /* Do the truncation */ - (*(smgrsw[reln->smgr_which].smgr_truncate)) (reln, - xlrec->forknum, - xlrec->blkno, - false); - - /* Also tell xlogutils.c about it */ - XLogTruncateRelation(xlrec->rnode, xlrec->forknum, xlrec->blkno); - } - else - elog(PANIC, "smgr_redo: unknown op code %u", info); - } - - void - smgr_desc(StringInfo buf, uint8 xl_info, char *rec) - { - uint8 info = xl_info & ~XLR_INFO_MASK; - - if (info == XLOG_SMGR_CREATE) - { - xl_smgr_create *xlrec = (xl_smgr_create *) rec; - char *path = relpath(xlrec->rnode, xlrec->forknum); - - appendStringInfo(buf, "file create: %s", path); - pfree(path); - } - else if (info == XLOG_SMGR_TRUNCATE) - { - xl_smgr_truncate *xlrec = (xl_smgr_truncate *) rec; - char *path = relpath(xlrec->rnode, xlrec->forknum); - - appendStringInfo(buf, "file truncate: %s to %u blocks", path, - xlrec->blkno); - pfree(path); - } - else - appendStringInfo(buf, "UNKNOWN"); - } --- 518,520 ---- *** src/include/access/rmgr.h --- src/include/access/rmgr.h *************** *** 23,29 **** typedef uint8 RmgrId; #define RM_DBASE_ID 4 #define RM_TBLSPC_ID 5 #define RM_MULTIXACT_ID 6 - #define RM_FREESPACE_ID 7 #define RM_HEAP2_ID 9 #define RM_HEAP_ID 10 #define RM_BTREE_ID 11 --- 23,28 ---- *** src/include/access/xact.h --- src/include/access/xact.h *************** *** 90,97 **** typedef struct xl_xact_commit TimestampTz xact_time; /* time of commit */ int nrels; /* number of RelFileForks */ int nsubxacts; /* number of subtransaction XIDs */ ! /* Array of RelFileFork(s) to drop at commit */ ! RelFileFork xnodes[1]; /* VARIABLE LENGTH ARRAY */ /* ARRAY OF COMMITTED SUBTRANSACTION XIDs FOLLOWS */ } xl_xact_commit; --- 90,97 ---- TimestampTz xact_time; /* time of commit */ int nrels; /* number of RelFileForks */ int nsubxacts; /* number of subtransaction XIDs */ ! /* Array of RelFileNode(s) to drop at commit */ ! RelFileNode xnodes[1]; /* VARIABLE LENGTH ARRAY */ /* ARRAY OF COMMITTED SUBTRANSACTION XIDs FOLLOWS */ } xl_xact_commit; *************** *** 102,109 **** typedef struct xl_xact_abort TimestampTz xact_time; /* time of abort */ int nrels; /* number of RelFileForks */ int nsubxacts; /* number of subtransaction XIDs */ ! /* Array of RelFileFork(s) to drop at abort */ ! RelFileFork xnodes[1]; /* VARIABLE LENGTH ARRAY */ /* ARRAY OF ABORTED SUBTRANSACTION XIDs FOLLOWS */ } xl_xact_abort; --- 102,109 ---- TimestampTz xact_time; /* time of abort */ int nrels; /* number of RelFileForks */ int nsubxacts; /* number of subtransaction XIDs */ ! /* Array of RelFileNode(s) to drop at abort */ ! RelFileNode xnodes[1]; /* VARIABLE LENGTH ARRAY */ /* ARRAY OF ABORTED SUBTRANSACTION XIDs FOLLOWS */ } xl_xact_abort; *** /dev/null --- src/include/catalog/storage.h *************** *** 0 **** --- 1,32 ---- + /*------------------------------------------------------------------------- + * + * heap.h + * prototypes for functions in backend/catalog/heap.c + * + * + * Portions Copyright (c) 1996-2008, PostgreSQL Global Development Group + * Portions Copyright (c) 1994, Regents of the University of California + * + * $PostgreSQL$ + * + *------------------------------------------------------------------------- + */ + #ifndef STORAGE_H + #define STORAGE_H + + #include "storage/block.h" + #include "storage/relfilenode.h" + #include "utils/rel.h" + + extern void RelationCreateStorage(RelFileNode rnode, bool istemp); + extern void RelationDropStorage(Relation rel); + extern void RelationTruncate(Relation rel, BlockNumber nblocks); + + extern void smgrDoPendingDeletes(bool isCommit); + extern int smgrGetPendingDeletes(bool forCommit, RelFileNode **ptr, + bool *haveNonTemp); + extern void AtSubCommit_smgr(void); + extern void AtSubAbort_smgr(void); + extern void PostPrepare_smgr(void); + + #endif /* STORAGE_H */ *** src/include/storage/bufmgr.h --- src/include/storage/bufmgr.h *************** *** 176,182 **** extern void PrintBufferLeakWarning(Buffer buffer); extern void CheckPointBuffers(int flags); extern BlockNumber BufferGetBlockNumber(Buffer buffer); extern BlockNumber RelationGetNumberOfBlocks(Relation relation); - extern void RelationTruncate(Relation rel, BlockNumber nblocks); extern void FlushRelationBuffers(Relation rel); extern void FlushDatabaseBuffers(Oid dbid); extern void DropRelFileNodeBuffers(RelFileNode rnode, ForkNumber forkNum, --- 176,181 ---- *** src/include/storage/freespace.h --- src/include/storage/freespace.h *************** *** 33,40 **** extern void XLogRecordPageWithFreeSpace(RelFileNode rnode, BlockNumber heapBlk, extern void FreeSpaceMapTruncateRel(Relation rel, BlockNumber nblocks); extern void FreeSpaceMapVacuum(Relation rel); - /* WAL prototypes */ - extern void fsm_desc(StringInfo buf, uint8 xl_info, char *rec); - extern void fsm_redo(XLogRecPtr lsn, XLogRecord *record); - #endif /* FREESPACE_H */ --- 33,36 ---- *** src/include/storage/indexfsm.h --- src/include/storage/indexfsm.h *************** *** 20,26 **** extern BlockNumber GetFreeIndexPage(Relation rel); extern void RecordFreeIndexPage(Relation rel, BlockNumber page); extern void RecordUsedIndexPage(Relation rel, BlockNumber page); - extern void InitIndexFreeSpaceMap(Relation rel); extern void IndexFreeSpaceMapTruncate(Relation rel, BlockNumber nblocks); extern void IndexFreeSpaceMapVacuum(Relation rel); --- 20,25 ---- *** src/include/storage/relfilenode.h --- src/include/storage/relfilenode.h *************** *** 78,90 **** typedef struct RelFileNode (node1).dbNode == (node2).dbNode && \ (node1).spcNode == (node2).spcNode) - /* - * RelFileFork identifies a particular fork of a relation. - */ - typedef struct RelFileFork - { - RelFileNode rnode; - ForkNumber forknum; - } RelFileFork; - #endif /* RELFILENODE_H */ --- 78,81 ---- *** src/include/storage/smgr.h --- src/include/storage/smgr.h *************** *** 65,74 **** extern void smgrsetowner(SMgrRelation *owner, SMgrRelation reln); extern void smgrclose(SMgrRelation reln); extern void smgrcloseall(void); extern void smgrclosenode(RelFileNode rnode); ! extern void smgrcreate(SMgrRelation reln, ForkNumber forknum, ! bool isTemp, bool isRedo); ! extern void smgrscheduleunlink(SMgrRelation reln, ForkNumber forknum, ! bool isTemp); extern void smgrdounlink(SMgrRelation reln, ForkNumber forknum, bool isTemp, bool isRedo); extern void smgrextend(SMgrRelation reln, ForkNumber forknum, --- 65,71 ---- extern void smgrclose(SMgrRelation reln); extern void smgrcloseall(void); extern void smgrclosenode(RelFileNode rnode); ! extern void smgrcreate(SMgrRelation reln, ForkNumber forknum, bool isRedo); extern void smgrdounlink(SMgrRelation reln, ForkNumber forknum, bool isTemp, bool isRedo); extern void smgrextend(SMgrRelation reln, ForkNumber forknum, *************** *** 81,94 **** extern BlockNumber smgrnblocks(SMgrRelation reln, ForkNumber forknum); extern void smgrtruncate(SMgrRelation reln, ForkNumber forknum, BlockNumber nblocks, bool isTemp); extern void smgrimmedsync(SMgrRelation reln, ForkNumber forknum); - extern void smgrDoPendingDeletes(bool isCommit); - extern int smgrGetPendingDeletes(bool forCommit, RelFileFork **ptr, - bool *haveNonTemp); - extern void AtSubCommit_smgr(void); - extern void AtSubAbort_smgr(void); - extern void PostPrepare_smgr(void); - extern void smgrcommit(void); - extern void smgrabort(void); extern void smgrpreckpt(void); extern void smgrsync(void); extern void smgrpostckpt(void); --- 78,83 ----
I committed the changes to FSM truncation yesterday, that helps with the truncation of the visibility map as well. Attached is an updated visibility map patch. There's two open issues: 1. The bits in the visibility map are set in the 1st phase of lazy vacuum. That works, but it means that after a delete or update, it takes two vacuums until the bit in the visibility map is set. The first vacuum removes the dead tuple, and only the second sees that there's no dead tuples and sets the bit. 2. Should modify the output of VACUUM VERBOSE to say how many pages were actually scanned. What other information is relevant, or is no longer relevant, with partial vacuums. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com *** src/backend/access/heap/Makefile --- src/backend/access/heap/Makefile *************** *** 12,17 **** subdir = src/backend/access/heap top_builddir = ../../../.. include $(top_builddir)/src/Makefile.global ! OBJS = heapam.o hio.o pruneheap.o rewriteheap.o syncscan.o tuptoaster.o include $(top_srcdir)/src/backend/common.mk --- 12,17 ---- top_builddir = ../../../.. include $(top_builddir)/src/Makefile.global ! OBJS = heapam.o hio.o pruneheap.o rewriteheap.o syncscan.o tuptoaster.o visibilitymap.o include $(top_srcdir)/src/backend/common.mk *** src/backend/access/heap/heapam.c --- src/backend/access/heap/heapam.c *************** *** 47,52 **** --- 47,53 ---- #include "access/transam.h" #include "access/tuptoaster.h" #include "access/valid.h" + #include "access/visibilitymap.h" #include "access/xact.h" #include "access/xlogutils.h" #include "catalog/catalog.h" *************** *** 195,200 **** heapgetpage(HeapScanDesc scan, BlockNumber page) --- 196,202 ---- int ntup; OffsetNumber lineoff; ItemId lpp; + bool all_visible; Assert(page < scan->rs_nblocks); *************** *** 233,252 **** heapgetpage(HeapScanDesc scan, BlockNumber page) lines = PageGetMaxOffsetNumber(dp); ntup = 0; for (lineoff = FirstOffsetNumber, lpp = PageGetItemId(dp, lineoff); lineoff <= lines; lineoff++, lpp++) { if (ItemIdIsNormal(lpp)) { - HeapTupleData loctup; bool valid; ! loctup.t_data = (HeapTupleHeader) PageGetItem((Page) dp, lpp); ! loctup.t_len = ItemIdGetLength(lpp); ! ItemPointerSet(&(loctup.t_self), page, lineoff); ! valid = HeapTupleSatisfiesVisibility(&loctup, snapshot, buffer); if (valid) scan->rs_vistuples[ntup++] = lineoff; } --- 235,266 ---- lines = PageGetMaxOffsetNumber(dp); ntup = 0; + /* + * If the all-visible flag indicates that all tuples on the page are + * visible to everyone, we can skip the per-tuple visibility tests. + */ + all_visible = PageIsAllVisible(dp); + for (lineoff = FirstOffsetNumber, lpp = PageGetItemId(dp, lineoff); lineoff <= lines; lineoff++, lpp++) { if (ItemIdIsNormal(lpp)) { bool valid; ! if (all_visible) ! valid = true; ! else ! { ! HeapTupleData loctup; ! ! loctup.t_data = (HeapTupleHeader) PageGetItem((Page) dp, lpp); ! loctup.t_len = ItemIdGetLength(lpp); ! ItemPointerSet(&(loctup.t_self), page, lineoff); ! valid = HeapTupleSatisfiesVisibility(&loctup, snapshot, buffer); ! } if (valid) scan->rs_vistuples[ntup++] = lineoff; } *************** *** 1860,1865 **** heap_insert(Relation relation, HeapTuple tup, CommandId cid, --- 1874,1880 ---- TransactionId xid = GetCurrentTransactionId(); HeapTuple heaptup; Buffer buffer; + bool all_visible_cleared; if (relation->rd_rel->relhasoids) { *************** *** 1920,1925 **** heap_insert(Relation relation, HeapTuple tup, CommandId cid, --- 1935,1946 ---- RelationPutHeapTuple(relation, buffer, heaptup); + if (PageIsAllVisible(BufferGetPage(buffer))) + { + all_visible_cleared = true; + PageClearAllVisible(BufferGetPage(buffer)); + } + /* * XXX Should we set PageSetPrunable on this page ? * *************** *** 1943,1948 **** heap_insert(Relation relation, HeapTuple tup, CommandId cid, --- 1964,1970 ---- Page page = BufferGetPage(buffer); uint8 info = XLOG_HEAP_INSERT; + xlrec.all_visible_cleared = all_visible_cleared; xlrec.target.node = relation->rd_node; xlrec.target.tid = heaptup->t_self; rdata[0].data = (char *) &xlrec; *************** *** 1994,1999 **** heap_insert(Relation relation, HeapTuple tup, CommandId cid, --- 2016,2026 ---- UnlockReleaseBuffer(buffer); + /* Clear the bit in the visibility map if necessary */ + if (all_visible_cleared) + visibilitymap_clear(relation, + ItemPointerGetBlockNumber(&(heaptup->t_self))); + /* * If tuple is cachable, mark it for invalidation from the caches in case * we abort. Note it is OK to do this after releasing the buffer, because *************** *** 2070,2075 **** heap_delete(Relation relation, ItemPointer tid, --- 2097,2103 ---- Buffer buffer; bool have_tuple_lock = false; bool iscombo; + bool all_visible_cleared = false; Assert(ItemPointerIsValid(tid)); *************** *** 2216,2221 **** l1: --- 2244,2255 ---- */ PageSetPrunable(page, xid); + if (PageIsAllVisible(page)) + { + all_visible_cleared = true; + PageClearAllVisible(page); + } + /* store transaction information of xact deleting the tuple */ tp.t_data->t_infomask &= ~(HEAP_XMAX_COMMITTED | HEAP_XMAX_INVALID | *************** *** 2237,2242 **** l1: --- 2271,2277 ---- XLogRecPtr recptr; XLogRecData rdata[2]; + xlrec.all_visible_cleared = all_visible_cleared; xlrec.target.node = relation->rd_node; xlrec.target.tid = tp.t_self; rdata[0].data = (char *) &xlrec; *************** *** 2281,2286 **** l1: --- 2316,2325 ---- */ CacheInvalidateHeapTuple(relation, &tp); + /* Clear the bit in the visibility map if necessary */ + if (all_visible_cleared) + visibilitymap_clear(relation, BufferGetBlockNumber(buffer)); + /* Now we can release the buffer */ ReleaseBuffer(buffer); *************** *** 2388,2393 **** heap_update(Relation relation, ItemPointer otid, HeapTuple newtup, --- 2427,2434 ---- bool have_tuple_lock = false; bool iscombo; bool use_hot_update = false; + bool all_visible_cleared = false; + bool all_visible_cleared_new = false; Assert(ItemPointerIsValid(otid)); *************** *** 2763,2768 **** l2: --- 2804,2815 ---- MarkBufferDirty(newbuf); MarkBufferDirty(buffer); + /* + * Note: we mustn't clear PD_ALL_VISIBLE flags before calling writing + * the WAL record, because log_heap_update looks at those flags and sets + * the corresponding flags in the WAL record. + */ + /* XLOG stuff */ if (!relation->rd_istemp) { *************** *** 2778,2783 **** l2: --- 2825,2842 ---- PageSetTLI(BufferGetPage(buffer), ThisTimeLineID); } + /* Clear PD_ALL_VISIBLE flags */ + if (PageIsAllVisible(BufferGetPage(buffer))) + { + all_visible_cleared = true; + PageClearAllVisible(BufferGetPage(buffer)); + } + if (newbuf != buffer && PageIsAllVisible(BufferGetPage(newbuf))) + { + all_visible_cleared_new = true; + PageClearAllVisible(BufferGetPage(newbuf)); + } + END_CRIT_SECTION(); if (newbuf != buffer) *************** *** 2791,2796 **** l2: --- 2850,2861 ---- */ CacheInvalidateHeapTuple(relation, &oldtup); + /* Clear bits in visibility map */ + if (all_visible_cleared) + visibilitymap_clear(relation, BufferGetBlockNumber(buffer)); + if (all_visible_cleared_new) + visibilitymap_clear(relation, BufferGetBlockNumber(newbuf)); + /* Now we can release the buffer(s) */ if (newbuf != buffer) ReleaseBuffer(newbuf); *************** *** 3412,3417 **** l3: --- 3477,3487 ---- LockBuffer(*buffer, BUFFER_LOCK_UNLOCK); /* + * Don't update the visibility map here. Locking a tuple doesn't + * change visibility info. + */ + + /* * Now that we have successfully marked the tuple as locked, we can * release the lmgr tuple lock, if we had it. */ *************** *** 3916,3922 **** log_heap_update(Relation reln, Buffer oldbuf, ItemPointerData from, --- 3986,3994 ---- xlrec.target.node = reln->rd_node; xlrec.target.tid = from; + xlrec.all_visible_cleared = PageIsAllVisible(BufferGetPage(oldbuf)); xlrec.newtid = newtup->t_self; + xlrec.new_all_visible_cleared = PageIsAllVisible(BufferGetPage(newbuf)); rdata[0].data = (char *) &xlrec; rdata[0].len = SizeOfHeapUpdate; *************** *** 4186,4191 **** heap_xlog_delete(XLogRecPtr lsn, XLogRecord *record) --- 4258,4274 ---- ItemId lp = NULL; HeapTupleHeader htup; + /* + * The visibility map always needs to be updated, even if the heap page + * is already up-to-date. + */ + if (xlrec->all_visible_cleared) + { + Relation reln = CreateFakeRelcacheEntry(xlrec->target.node); + visibilitymap_clear(reln, ItemPointerGetBlockNumber(&(xlrec->target.tid))); + FreeFakeRelcacheEntry(reln); + } + if (record->xl_info & XLR_BKP_BLOCK_1) return; *************** *** 4223,4228 **** heap_xlog_delete(XLogRecPtr lsn, XLogRecord *record) --- 4306,4314 ---- /* Mark the page as a candidate for pruning */ PageSetPrunable(page, record->xl_xid); + if (xlrec->all_visible_cleared) + PageClearAllVisible(page); + /* Make sure there is no forward chain link in t_ctid */ htup->t_ctid = xlrec->target.tid; PageSetLSN(page, lsn); *************** *** 4249,4254 **** heap_xlog_insert(XLogRecPtr lsn, XLogRecord *record) --- 4335,4351 ---- Size freespace; BlockNumber blkno; + /* + * The visibility map always needs to be updated, even if the heap page + * is already up-to-date. + */ + if (xlrec->all_visible_cleared) + { + Relation reln = CreateFakeRelcacheEntry(xlrec->target.node); + visibilitymap_clear(reln, ItemPointerGetBlockNumber(&xlrec->target.tid)); + FreeFakeRelcacheEntry(reln); + } + if (record->xl_info & XLR_BKP_BLOCK_1) return; *************** *** 4307,4312 **** heap_xlog_insert(XLogRecPtr lsn, XLogRecord *record) --- 4404,4413 ---- PageSetLSN(page, lsn); PageSetTLI(page, ThisTimeLineID); + + if (xlrec->all_visible_cleared) + PageClearAllVisible(page); + MarkBufferDirty(buffer); UnlockReleaseBuffer(buffer); *************** *** 4347,4352 **** heap_xlog_update(XLogRecPtr lsn, XLogRecord *record, bool move, bool hot_update) --- 4448,4464 ---- uint32 newlen; Size freespace; + /* + * The visibility map always needs to be updated, even if the heap page + * is already up-to-date. + */ + if (xlrec->all_visible_cleared) + { + Relation reln = CreateFakeRelcacheEntry(xlrec->target.node); + visibilitymap_clear(reln, ItemPointerGetBlockNumber(&xlrec->target.tid)); + FreeFakeRelcacheEntry(reln); + } + if (record->xl_info & XLR_BKP_BLOCK_1) { if (samepage) *************** *** 4411,4416 **** heap_xlog_update(XLogRecPtr lsn, XLogRecord *record, bool move, bool hot_update) --- 4523,4531 ---- /* Mark the page as a candidate for pruning */ PageSetPrunable(page, record->xl_xid); + if (xlrec->all_visible_cleared) + PageClearAllVisible(page); + /* * this test is ugly, but necessary to avoid thinking that insert change * is already applied *************** *** 4426,4431 **** heap_xlog_update(XLogRecPtr lsn, XLogRecord *record, bool move, bool hot_update) --- 4541,4557 ---- newt:; + /* + * The visibility map always needs to be updated, even if the heap page + * is already up-to-date. + */ + if (xlrec->new_all_visible_cleared) + { + Relation reln = CreateFakeRelcacheEntry(xlrec->target.node); + visibilitymap_clear(reln, ItemPointerGetBlockNumber(&xlrec->newtid)); + FreeFakeRelcacheEntry(reln); + } + if (record->xl_info & XLR_BKP_BLOCK_2) return; *************** *** 4504,4509 **** newsame:; --- 4630,4638 ---- if (offnum == InvalidOffsetNumber) elog(PANIC, "heap_update_redo: failed to add tuple"); + if (xlrec->new_all_visible_cleared) + PageClearAllVisible(page); + freespace = PageGetHeapFreeSpace(page); /* needed to update FSM below */ PageSetLSN(page, lsn); *** /dev/null --- src/backend/access/heap/visibilitymap.c *************** *** 0 **** --- 1,390 ---- + /*------------------------------------------------------------------------- + * + * visibilitymap.c + * bitmap for tracking visibility of heap tuples + * + * Portions Copyright (c) 1996-2008, PostgreSQL Global Development Group + * Portions Copyright (c) 1994, Regents of the University of California + * + * + * IDENTIFICATION + * $PostgreSQL$ + * + * NOTES + * + * The visibility map is a bitmap with one bit per heap page. A set bit means + * that all tuples on the page are visible to all transactions, and doesn't + * therefore need to be vacuumed. + * + * The map is conservative in the sense that we make sure that whenever a bit + * is set, we know the condition is true, but if a bit is not set, it might + * or might not be. + * + * There's no explicit WAL logging in the functions in this file. The callers + * must make sure that whenever a bit is cleared, the bit is cleared on WAL + * replay of the updating operation as well. Setting bits during recovery + * isn't necessary for correctness. + * + * LOCKING + * + * In heapam.c, whenever a page is modified so that not all tuples on the + * page are visible to everyone anymore, the corresponding bit in the + * visibility map is cleared. The bit in the visibility map is cleared + * after releasing the lock on the heap page, to avoid holding the lock + * over possible I/O to read in the visibility map page. + * + * To set a bit, you need to hold a lock on the heap page. That prevents + * the race condition where VACUUM sees that all tuples on the page are + * visible to everyone, but another backend modifies the page before VACUUM + * sets the bit in the visibility map. + * + * When a bit is set, we need to update the LSN of the page to make sure that + * the visibility map update doesn't get written to disk before the WAL record + * of the changes that made it possible to set the bit is flushed. But when a + * bit is cleared, we don't have to do that because it's always OK to clear + * a bit in the map from correctness point of view. + * + * TODO + * + * It would be nice to use the visibility map to skip visibility checkes in + * index scans. + * + * Currently, the visibility map is not 100% correct all the time. + * During updates, the bit in the visibility map is cleared after releasing + * the lock on the heap page. During the window after releasing the lock + * and clearing the bit in the visibility map, the bit in the visibility map + * is set, but the new insertion or deletion is not yet visible to other + * backends. + * + * That might actually be OK for the index scans, though. The newly inserted + * tuple wouldn't have an index pointer yet, so all tuples reachable from an + * index would still be visible to all other backends, and deletions wouldn't + * be visible to other backends yet. + * + * + *------------------------------------------------------------------------- + */ + #include "postgres.h" + + #include "access/visibilitymap.h" + #include "storage/bufmgr.h" + #include "storage/bufpage.h" + #include "storage/lmgr.h" + #include "storage/smgr.h" + + /*#define TRACE_VISIBILITYMAP */ + + /* Number of bits allocated for each heap block. */ + #define BITS_PER_HEAPBLOCK 1 + + /* Number of heap blocks we can represent in one byte. */ + #define HEAPBLOCKS_PER_BYTE 8 + + /* Number of heap blocks we can represent in one visibility map page */ + #define HEAPBLOCKS_PER_PAGE ((BLCKSZ - SizeOfPageHeaderData) * HEAPBLOCKS_PER_BYTE ) + + /* Mapping from heap block number to the right bit in the visibility map */ + #define HEAPBLK_TO_MAPBLOCK(x) ((x) / HEAPBLOCKS_PER_PAGE) + #define HEAPBLK_TO_MAPBYTE(x) (((x) % HEAPBLOCKS_PER_PAGE) / HEAPBLOCKS_PER_BYTE) + #define HEAPBLK_TO_MAPBIT(x) ((x) % HEAPBLOCKS_PER_BYTE) + + static Buffer vm_readbuf(Relation rel, BlockNumber blkno, bool extend); + static void vm_extend(Relation rel, BlockNumber nvmblocks, bool createstorage); + + /* + * Read a visibility map page. + * + * If the page doesn't exist, InvalidBuffer is returned, or if 'extend' is + * true, the visibility map file is extended. + */ + static Buffer + vm_readbuf(Relation rel, BlockNumber blkno, bool extend) + { + Buffer buf; + + RelationOpenSmgr(rel); + + if (rel->rd_vm_nblocks_cache == InvalidBlockNumber || + rel->rd_vm_nblocks_cache <= blkno) + { + if (!smgrexists(rel->rd_smgr, VISIBILITYMAP_FORKNUM)) + vm_extend(rel, blkno + 1, true); + else + rel->rd_vm_nblocks_cache = smgrnblocks(rel->rd_smgr, + VISIBILITYMAP_FORKNUM); + } + + if (blkno >= rel->rd_vm_nblocks_cache) + { + if (extend) + vm_extend(rel, blkno + 1, false); + else + return InvalidBuffer; + } + + /* + * Use ZERO_ON_ERROR mode, and initialize the page if necessary. XXX The + * information is not accurate anyway, so it's better to clear corrupt + * pages than error out. Since the FSM changes are not WAL-logged, the + * so-called torn page problem on crash can lead to pages with corrupt + * headers, for example. + */ + buf = ReadBufferExtended(rel, VISIBILITYMAP_FORKNUM, blkno, + RBM_ZERO_ON_ERROR, NULL); + if (PageIsNew(BufferGetPage(buf))) + PageInit(BufferGetPage(buf), BLCKSZ, 0); + return buf; + } + + /* + * Ensure that the visibility map fork is at least n_vmblocks long, extending + * it if necessary with empty pages. And by empty, I mean pages filled + * with zeros, meaning there's no free space. If createstorage is true, + * the physical file might need to be created first. + */ + static void + vm_extend(Relation rel, BlockNumber n_vmblocks, bool createstorage) + { + BlockNumber n_vmblocks_now; + Page pg; + + pg = (Page) palloc(BLCKSZ); + PageInit(pg, BLCKSZ, 0); + + /* + * We use the relation extension lock to lock out other backends + * trying to extend the visibility map at the same time. It also locks out + * extension of the main fork, unnecessarily, but extending the + * visibility map happens seldom enough that it doesn't seem worthwhile to + * have a separate lock tag type for it. + * + * Note that another backend might have extended or created the + * relation before we get the lock. + */ + LockRelationForExtension(rel, ExclusiveLock); + + /* Create the file first if it doesn't exist */ + if (createstorage && !smgrexists(rel->rd_smgr, VISIBILITYMAP_FORKNUM)) + { + smgrcreate(rel->rd_smgr, VISIBILITYMAP_FORKNUM, false); + n_vmblocks_now = 0; + } + else + n_vmblocks_now = smgrnblocks(rel->rd_smgr, VISIBILITYMAP_FORKNUM); + + while (n_vmblocks_now < n_vmblocks) + { + smgrextend(rel->rd_smgr, VISIBILITYMAP_FORKNUM, n_vmblocks_now, + (char *) pg, rel->rd_istemp); + n_vmblocks_now++; + } + + UnlockRelationForExtension(rel, ExclusiveLock); + + pfree(pg); + + /* update the cache with the up-to-date size */ + rel->rd_vm_nblocks_cache = n_vmblocks_now; + } + + void + visibilitymap_truncate(Relation rel, BlockNumber nheapblocks) + { + BlockNumber truncBlock = HEAPBLK_TO_MAPBLOCK(nheapblocks); + uint32 truncByte = HEAPBLK_TO_MAPBYTE(nheapblocks); + uint8 truncBit = HEAPBLK_TO_MAPBIT(nheapblocks); + BlockNumber newnblocks; + + #ifdef TRACE_VISIBILITYMAP + elog(DEBUG1, "vm_truncate %s %d", RelationGetRelationName(rel), nheapblocks); + #endif + + /* + * If no visibility map has been created yet for this relation, there's + * nothing to truncate. + */ + if (!smgrexists(rel->rd_smgr, VISIBILITYMAP_FORKNUM)) + return; + + /* Truncate away pages that are no longer needed */ + if (truncByte == 0 && truncBit == 0) + newnblocks = truncBlock; + else + { + Buffer mapBuffer; + Page page; + char *mappage; + int len; + + newnblocks = truncBlock + 1; + + /* + * Clear all bits in the last map page, that represent the truncated + * heap blocks. This is not only tidy, but also necessary because + * we don't clear the bits on extension. + */ + mapBuffer = vm_readbuf(rel, truncBlock, false); + if (BufferIsValid(mapBuffer)) + { + page = BufferGetPage(mapBuffer); + mappage = PageGetContents(page); + + LockBuffer(mapBuffer, BUFFER_LOCK_EXCLUSIVE); + + /* + * Clear out the unwanted bytes. + */ + len = HEAPBLOCKS_PER_PAGE/HEAPBLOCKS_PER_BYTE - (truncByte + 1); + MemSet(&mappage[truncByte + 1], 0, len); + + /* + * Mask out the unwanted bits of the last remaining byte + * + * ((1 << 0) - 1) = 00000000 + * ((1 << 1) - 1) = 00000001 + * ... + * ((1 << 6) - 1) = 00111111 + * ((1 << 7) - 1) = 01111111 + */ + mappage[truncByte] &= (1 << truncBit) - 1; + + /* + * This needs to be WAL-logged. Although the now unused shouldn't + * be accessed anymore, they better be zero if we extend again. + */ + + MarkBufferDirty(mapBuffer); + UnlockReleaseBuffer(mapBuffer); + } + } + + if (smgrnblocks(rel->rd_smgr, VISIBILITYMAP_FORKNUM) > newnblocks) + smgrtruncate(rel->rd_smgr, VISIBILITYMAP_FORKNUM, newnblocks, + rel->rd_istemp); + } + + /* + * Marks that all tuples on a heap page are visible to all. + * + * recptr is the LSN of the heap page. The LSN of the visibility map + * page is advanced to that, to make sure that the visibility map doesn't + * get flushed to disk before update to the heap page that made all tuples + * visible. + * + * *buf is a buffer previously returned by visibilitymap_test(). This is + * an opportunistic function; if *buf doesn't contain the bit for heapBlk, + * we do nothing. We don't want to do any I/O here, because the caller is + * holding a cleanup lock on the heap page. + */ + void + visibilitymap_set(Relation rel, BlockNumber heapBlk, XLogRecPtr recptr, + Buffer *buf) + { + BlockNumber mapBlock = HEAPBLK_TO_MAPBLOCK(heapBlk); + uint32 mapByte = HEAPBLK_TO_MAPBYTE(heapBlk); + uint8 mapBit = HEAPBLK_TO_MAPBIT(heapBlk); + Page page; + char *mappage; + + #ifdef TRACE_VISIBILITYMAP + elog(DEBUG1, "vm_set %s %d", RelationGetRelationName(rel), heapBlk); + #endif + + if (!BufferIsValid(*buf) || BufferGetBlockNumber(*buf) != mapBlock) + return; + + page = BufferGetPage(*buf); + mappage = PageGetContents(page); + LockBuffer(*buf, BUFFER_LOCK_EXCLUSIVE); + + if (!(mappage[mapByte] & (1 << mapBit))) + { + mappage[mapByte] |= (1 << mapBit); + + if (XLByteLT(PageGetLSN(page), recptr)) + PageSetLSN(page, recptr); + PageSetTLI(page, ThisTimeLineID); + MarkBufferDirty(*buf); + } + + LockBuffer(*buf, BUFFER_LOCK_UNLOCK); + } + + /* + * Are all tuples on heap page visible to all? + * + * The page containing the bit for the heap block is (kept) pinned, + * and *buf is set to that buffer. If *buf is valid on entry, it should + * be a buffer previously returned by this function, for the same relation, + * and unless the new heap block is on the same page, it is released. On the + * first call, InvalidBuffer should be passed, and when the caller doesn't + * want to test any more pages, it should release *buf if it's valid. + */ + bool + visibilitymap_test(Relation rel, BlockNumber heapBlk, Buffer *buf) + { + BlockNumber mapBlock = HEAPBLK_TO_MAPBLOCK(heapBlk); + uint32 mapByte = HEAPBLK_TO_MAPBYTE(heapBlk); + uint8 mapBit = HEAPBLK_TO_MAPBIT(heapBlk); + bool val; + char *mappage; + + #ifdef TRACE_VISIBILITYMAP + elog(DEBUG1, "vm_test %s %d", RelationGetRelationName(rel), heapBlk); + #endif + + if (BufferIsValid(*buf)) + { + if (BufferGetBlockNumber(*buf) == heapBlk) + return *buf; + else + ReleaseBuffer(*buf); + } + + *buf = vm_readbuf(rel, mapBlock, true); + if (!BufferIsValid(*buf)) + return false; + + mappage = PageGetContents(BufferGetPage(*buf)); + + /* + * We don't need to lock the page, as we're only looking at a single bit. + */ + val = (mappage[mapByte] & (1 << mapBit)) ? true : false; + + return val; + } + + /* + * Mark that not all tuples are visible to all. + */ + void + visibilitymap_clear(Relation rel, BlockNumber heapBlk) + { + BlockNumber mapBlock = HEAPBLK_TO_MAPBLOCK(heapBlk); + int mapByte = HEAPBLK_TO_MAPBYTE(heapBlk); + int mapBit = HEAPBLK_TO_MAPBIT(heapBlk); + uint8 mask = 1 << mapBit; + Buffer mapBuffer; + char *mappage; + + #ifdef TRACE_VISIBILITYMAP + elog(DEBUG1, "vm_clear %s %d", RelationGetRelationName(rel), heapBlk); + #endif + + mapBuffer = vm_readbuf(rel, mapBlock, false); + if (!BufferIsValid(mapBuffer)) + return; /* nothing to do */ + + LockBuffer(mapBuffer, BUFFER_LOCK_EXCLUSIVE); + mappage = PageGetContents(BufferGetPage(mapBuffer)); + + if (mappage[mapByte] & mask) + { + mappage[mapByte] &= ~mask; + + MarkBufferDirty(mapBuffer); + } + + UnlockReleaseBuffer(mapBuffer); + } *** src/backend/access/transam/xlogutils.c --- src/backend/access/transam/xlogutils.c *************** *** 377,382 **** CreateFakeRelcacheEntry(RelFileNode rnode) --- 377,383 ---- rel->rd_targblock = InvalidBlockNumber; rel->rd_fsm_nblocks_cache = InvalidBlockNumber; + rel->rd_vm_nblocks_cache = InvalidBlockNumber; rel->rd_smgr = NULL; return rel; *** src/backend/catalog/catalog.c --- src/backend/catalog/catalog.c *************** *** 54,60 **** */ const char *forkNames[] = { "main", /* MAIN_FORKNUM */ ! "fsm" /* FSM_FORKNUM */ }; /* --- 54,61 ---- */ const char *forkNames[] = { "main", /* MAIN_FORKNUM */ ! "fsm", /* FSM_FORKNUM */ ! "vm" /* VISIBILITYMAP_FORKNUM */ }; /* *** src/backend/catalog/heap.c --- src/backend/catalog/heap.c *************** *** 33,38 **** --- 33,39 ---- #include "access/heapam.h" #include "access/sysattr.h" #include "access/transam.h" + #include "access/visibilitymap.h" #include "access/xact.h" #include "catalog/catalog.h" #include "catalog/dependency.h" *** src/backend/catalog/storage.c --- src/backend/catalog/storage.c *************** *** 19,24 **** --- 19,25 ---- #include "postgres.h" + #include "access/visibilitymap.h" #include "access/xact.h" #include "access/xlogutils.h" #include "catalog/catalog.h" *************** *** 175,180 **** void --- 176,182 ---- RelationTruncate(Relation rel, BlockNumber nblocks) { bool fsm; + bool vm; /* Open it at the smgr level if not already done */ RelationOpenSmgr(rel); *************** *** 187,192 **** RelationTruncate(Relation rel, BlockNumber nblocks) --- 189,199 ---- if (fsm) FreeSpaceMapTruncateRel(rel, nblocks); + /* Truncate the visibility map too if it exists. */ + vm = smgrexists(rel->rd_smgr, VISIBILITYMAP_FORKNUM); + if (vm) + visibilitymap_truncate(rel, nblocks); + /* * We WAL-log the truncation before actually truncating, which * means trouble if the truncation fails. If we then crash, the WAL *************** *** 222,228 **** RelationTruncate(Relation rel, BlockNumber nblocks) * left with a truncated heap, but the FSM would still contain * entries for the non-existent heap pages. */ ! if (fsm) XLogFlush(lsn); } --- 229,235 ---- * left with a truncated heap, but the FSM would still contain * entries for the non-existent heap pages. */ ! if (fsm || vm) XLogFlush(lsn); } *** src/backend/commands/vacuum.c --- src/backend/commands/vacuum.c *************** *** 26,31 **** --- 26,32 ---- #include "access/genam.h" #include "access/heapam.h" #include "access/transam.h" + #include "access/visibilitymap.h" #include "access/xact.h" #include "access/xlog.h" #include "catalog/namespace.h" *************** *** 2902,2907 **** move_chain_tuple(Relation rel, --- 2903,2914 ---- Size tuple_len = old_tup->t_len; /* + * Clear the bits in the visibility map. + */ + visibilitymap_clear(rel, BufferGetBlockNumber(old_buf)); + visibilitymap_clear(rel, BufferGetBlockNumber(dst_buf)); + + /* * make a modifiable copy of the source tuple. */ heap_copytuple_with_tuple(old_tup, &newtup); *************** *** 3005,3010 **** move_chain_tuple(Relation rel, --- 3012,3021 ---- END_CRIT_SECTION(); + PageClearAllVisible(BufferGetPage(old_buf)); + if (dst_buf != old_buf) + PageClearAllVisible(BufferGetPage(dst_buf)); + LockBuffer(dst_buf, BUFFER_LOCK_UNLOCK); if (dst_buf != old_buf) LockBuffer(old_buf, BUFFER_LOCK_UNLOCK); *************** *** 3107,3112 **** move_plain_tuple(Relation rel, --- 3118,3140 ---- END_CRIT_SECTION(); + /* + * Clear the visible-to-all hint bits on the page, and bits in the + * visibility map. Normally we'd release the locks on the heap pages + * before updating the visibility map, but doesn't really matter here + * because we're holding an AccessExclusiveLock on the relation anyway. + */ + if (PageIsAllVisible(dst_page)) + { + PageClearAllVisible(dst_page); + visibilitymap_clear(rel, BufferGetBlockNumber(dst_buf)); + } + if (PageIsAllVisible(old_page)) + { + PageClearAllVisible(old_page); + visibilitymap_clear(rel, BufferGetBlockNumber(old_buf)); + } + dst_vacpage->free = PageGetFreeSpaceWithFillFactor(rel, dst_page); LockBuffer(dst_buf, BUFFER_LOCK_UNLOCK); LockBuffer(old_buf, BUFFER_LOCK_UNLOCK); *** src/backend/commands/vacuumlazy.c --- src/backend/commands/vacuumlazy.c *************** *** 40,45 **** --- 40,46 ---- #include "access/genam.h" #include "access/heapam.h" #include "access/transam.h" + #include "access/visibilitymap.h" #include "catalog/storage.h" #include "commands/dbcommands.h" #include "commands/vacuum.h" *************** *** 88,93 **** typedef struct LVRelStats --- 89,95 ---- int max_dead_tuples; /* # slots allocated in array */ ItemPointer dead_tuples; /* array of ItemPointerData */ int num_index_scans; + bool scanned_all; /* have we scanned all pages (this far) in the rel? */ } LVRelStats; *************** *** 102,108 **** static BufferAccessStrategy vac_strategy; /* non-export function prototypes */ static void lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats, ! Relation *Irel, int nindexes); static void lazy_vacuum_heap(Relation onerel, LVRelStats *vacrelstats); static void lazy_vacuum_index(Relation indrel, IndexBulkDeleteResult **stats, --- 104,110 ---- /* non-export function prototypes */ static void lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats, ! Relation *Irel, int nindexes, bool scan_all); static void lazy_vacuum_heap(Relation onerel, LVRelStats *vacrelstats); static void lazy_vacuum_index(Relation indrel, IndexBulkDeleteResult **stats, *************** *** 141,146 **** lazy_vacuum_rel(Relation onerel, VacuumStmt *vacstmt, --- 143,149 ---- BlockNumber possibly_freeable; PGRUsage ru0; TimestampTz starttime = 0; + bool scan_all; pg_rusage_init(&ru0); *************** *** 166,173 **** lazy_vacuum_rel(Relation onerel, VacuumStmt *vacstmt, vac_open_indexes(onerel, RowExclusiveLock, &nindexes, &Irel); vacrelstats->hasindex = (nindexes > 0); /* Do the vacuuming */ ! lazy_scan_heap(onerel, vacrelstats, Irel, nindexes); /* Done with indexes */ vac_close_indexes(nindexes, Irel, NoLock); --- 169,185 ---- vac_open_indexes(onerel, RowExclusiveLock, &nindexes, &Irel); vacrelstats->hasindex = (nindexes > 0); + /* Should we use the visibility map or scan all pages? */ + if (vacstmt->freeze_min_age != -1) + scan_all = true; + else + scan_all = false; + + /* initialize this variable */ + vacrelstats->scanned_all = true; + /* Do the vacuuming */ ! lazy_scan_heap(onerel, vacrelstats, Irel, nindexes, scan_all); /* Done with indexes */ vac_close_indexes(nindexes, Irel, NoLock); *************** *** 189,195 **** lazy_vacuum_rel(Relation onerel, VacuumStmt *vacstmt, /* Update statistics in pg_class */ vac_update_relstats(onerel, vacrelstats->rel_pages, vacrelstats->rel_tuples, ! vacrelstats->hasindex, FreezeLimit); /* report results to the stats collector, too */ pgstat_report_vacuum(RelationGetRelid(onerel), onerel->rd_rel->relisshared, --- 201,208 ---- /* Update statistics in pg_class */ vac_update_relstats(onerel, vacrelstats->rel_pages, vacrelstats->rel_tuples, ! vacrelstats->hasindex, ! vacrelstats->scanned_all ? FreezeLimit : InvalidOid); /* report results to the stats collector, too */ pgstat_report_vacuum(RelationGetRelid(onerel), onerel->rd_rel->relisshared, *************** *** 230,236 **** lazy_vacuum_rel(Relation onerel, VacuumStmt *vacstmt, */ static void lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats, ! Relation *Irel, int nindexes) { BlockNumber nblocks, blkno; --- 243,249 ---- */ static void lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats, ! Relation *Irel, int nindexes, bool scan_all) { BlockNumber nblocks, blkno; *************** *** 245,250 **** lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats, --- 258,264 ---- IndexBulkDeleteResult **indstats; int i; PGRUsage ru0; + Buffer vmbuffer = InvalidBuffer; pg_rusage_init(&ru0); *************** *** 278,283 **** lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats, --- 292,315 ---- OffsetNumber frozen[MaxOffsetNumber]; int nfrozen; Size freespace; + bool all_visible_according_to_vm; + bool all_visible; + + /* + * If all tuples on page are visible to all, there's no + * need to visit that page. + * + * Note that we test the visibility map even if we're scanning all + * pages, to pin the visibility map page. We might set the bit there, + * and we don't want to do the I/O while we're holding the heap page + * locked. + */ + all_visible_according_to_vm = visibilitymap_test(onerel, blkno, &vmbuffer); + if (!scan_all && all_visible_according_to_vm) + { + vacrelstats->scanned_all = false; + continue; + } vacuum_delay_point(); *************** *** 354,359 **** lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats, --- 386,398 ---- { empty_pages++; freespace = PageGetHeapFreeSpace(page); + + PageSetAllVisible(page); + /* Update the visibility map */ + if (!all_visible_according_to_vm) + visibilitymap_set(onerel, blkno, PageGetLSN(page), + &vmbuffer); + UnlockReleaseBuffer(buf); RecordPageWithFreeSpace(onerel, blkno, freespace); continue; *************** *** 371,376 **** lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats, --- 410,416 ---- * Now scan the page to collect vacuumable items and check for tuples * requiring freezing. */ + all_visible = true; nfrozen = 0; hastup = false; prev_dead_count = vacrelstats->num_dead_tuples; *************** *** 408,413 **** lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats, --- 448,454 ---- if (ItemIdIsDead(itemid)) { lazy_record_dead_tuple(vacrelstats, &(tuple.t_self)); + all_visible = false; continue; } *************** *** 442,447 **** lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats, --- 483,489 ---- nkeep += 1; else tupgone = true; /* we can delete the tuple */ + all_visible = false; break; case HEAPTUPLE_LIVE: /* Tuple is good --- but let's do some validity checks */ *************** *** 449,454 **** lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats, --- 491,525 ---- !OidIsValid(HeapTupleGetOid(&tuple))) elog(WARNING, "relation \"%s\" TID %u/%u: OID is invalid", relname, blkno, offnum); + + /* + * Definitely visible to all? Note that SetHintBits handles + * async commit correctly + */ + if (all_visible) + { + /* + * Is it visible to all transactions? It's important + * that we look at the hint bit here. Only if a hint + * bit is set, we can be sure that the tuple is indeed + * live, even if asynchronous_commit is true and we + * crash later + */ + if (!(tuple.t_data->t_infomask & HEAP_XMIN_COMMITTED)) + { + all_visible = false; + break; + } + /* + * The inserter definitely committed. But is it + * old enough that everyone sees it as committed? + */ + if (!TransactionIdPrecedes(HeapTupleHeaderGetXmin(tuple.t_data), OldestXmin)) + { + all_visible = false; + break; + } + } break; case HEAPTUPLE_RECENTLY_DEAD: *************** *** 457,468 **** lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats, --- 528,542 ---- * from relation. */ nkeep += 1; + all_visible = false; break; case HEAPTUPLE_INSERT_IN_PROGRESS: /* This is an expected case during concurrent vacuum */ + all_visible = false; break; case HEAPTUPLE_DELETE_IN_PROGRESS: /* This is an expected case during concurrent vacuum */ + all_visible = false; break; default: elog(ERROR, "unexpected HeapTupleSatisfiesVacuum result"); *************** *** 525,530 **** lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats, --- 599,621 ---- freespace = PageGetHeapFreeSpace(page); + /* Update the all-visible flag on the page */ + if (!PageIsAllVisible(page) && all_visible) + { + SetBufferCommitInfoNeedsSave(buf); + PageSetAllVisible(page); + } + else if (PageIsAllVisible(page) && !all_visible) + { + elog(WARNING, "all-visible flag was incorrectly set"); + SetBufferCommitInfoNeedsSave(buf); + PageClearAllVisible(page); + } + + /* Update the visibility map */ + if (!all_visible_according_to_vm && all_visible) + visibilitymap_set(onerel, blkno, PageGetLSN(page), &vmbuffer); + /* Remember the location of the last page with nonremovable tuples */ if (hastup) vacrelstats->nonempty_pages = blkno + 1; *************** *** 560,565 **** lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats, --- 651,663 ---- vacrelstats->num_index_scans++; } + /* Release the pin on the visibility map page */ + if (BufferIsValid(vmbuffer)) + { + ReleaseBuffer(vmbuffer); + vmbuffer = InvalidBuffer; + } + /* Do post-vacuum cleanup and statistics update for each index */ for (i = 0; i < nindexes; i++) lazy_cleanup_index(Irel[i], indstats[i], vacrelstats); *************** *** 623,628 **** lazy_vacuum_heap(Relation onerel, LVRelStats *vacrelstats) --- 721,735 ---- LockBufferForCleanup(buf); tupindex = lazy_vacuum_page(onerel, tblk, buf, tupindex, vacrelstats); + /* + * Before we let the page go, prune it. The primary reason is to + * update the visibility map in the common special case that we just + * vacuumed away the last tuple on the page that wasn't visible to + * everyone. + */ + vacrelstats->tuples_deleted += + heap_page_prune(onerel, buf, OldestXmin, false, false); + /* Now that we've compacted the page, record its available space */ page = BufferGetPage(buf); freespace = PageGetHeapFreeSpace(page); *** src/backend/storage/freespace/freespace.c --- src/backend/storage/freespace/freespace.c *************** *** 555,562 **** fsm_extend(Relation rel, BlockNumber n_fsmblocks, bool createstorage) * FSM happens seldom enough that it doesn't seem worthwhile to * have a separate lock tag type for it. * ! * Note that another backend might have extended the relation ! * before we get the lock. */ LockRelationForExtension(rel, ExclusiveLock); --- 555,562 ---- * FSM happens seldom enough that it doesn't seem worthwhile to * have a separate lock tag type for it. * ! * Note that another backend might have extended or created the ! * relation before we get the lock. */ LockRelationForExtension(rel, ExclusiveLock); *** src/backend/storage/smgr/smgr.c --- src/backend/storage/smgr/smgr.c *************** *** 21,26 **** --- 21,27 ---- #include "catalog/catalog.h" #include "commands/tablespace.h" #include "storage/bufmgr.h" + #include "storage/freespace.h" #include "storage/ipc.h" #include "storage/smgr.h" #include "utils/hsearch.h" *** src/backend/utils/cache/relcache.c --- src/backend/utils/cache/relcache.c *************** *** 305,310 **** AllocateRelationDesc(Relation relation, Form_pg_class relp) --- 305,311 ---- MemSet(relation, 0, sizeof(RelationData)); relation->rd_targblock = InvalidBlockNumber; relation->rd_fsm_nblocks_cache = InvalidBlockNumber; + relation->rd_vm_nblocks_cache = InvalidBlockNumber; /* make sure relation is marked as having no open file yet */ relation->rd_smgr = NULL; *************** *** 1377,1382 **** formrdesc(const char *relationName, Oid relationReltype, --- 1378,1384 ---- relation = (Relation) palloc0(sizeof(RelationData)); relation->rd_targblock = InvalidBlockNumber; relation->rd_fsm_nblocks_cache = InvalidBlockNumber; + relation->rd_vm_nblocks_cache = InvalidBlockNumber; /* make sure relation is marked as having no open file yet */ relation->rd_smgr = NULL; *************** *** 1665,1673 **** RelationReloadIndexInfo(Relation relation) heap_freetuple(pg_class_tuple); /* We must recalculate physical address in case it changed */ RelationInitPhysicalAddr(relation); ! /* Must reset targblock and fsm_nblocks_cache in case rel was truncated */ relation->rd_targblock = InvalidBlockNumber; relation->rd_fsm_nblocks_cache = InvalidBlockNumber; /* Must free any AM cached data, too */ if (relation->rd_amcache) pfree(relation->rd_amcache); --- 1667,1676 ---- heap_freetuple(pg_class_tuple); /* We must recalculate physical address in case it changed */ RelationInitPhysicalAddr(relation); ! /* Must reset targblock and fsm_nblocks_cache and vm_nblocks_cache in case rel was truncated */ relation->rd_targblock = InvalidBlockNumber; relation->rd_fsm_nblocks_cache = InvalidBlockNumber; + relation->rd_vm_nblocks_cache = InvalidBlockNumber; /* Must free any AM cached data, too */ if (relation->rd_amcache) pfree(relation->rd_amcache); *************** *** 1751,1756 **** RelationClearRelation(Relation relation, bool rebuild) --- 1754,1760 ---- { relation->rd_targblock = InvalidBlockNumber; relation->rd_fsm_nblocks_cache = InvalidBlockNumber; + relation->rd_vm_nblocks_cache = InvalidBlockNumber; if (relation->rd_rel->relkind == RELKIND_INDEX) { relation->rd_isvalid = false; /* needs to be revalidated */ *************** *** 2346,2351 **** RelationBuildLocalRelation(const char *relname, --- 2350,2356 ---- rel->rd_targblock = InvalidBlockNumber; rel->rd_fsm_nblocks_cache = InvalidBlockNumber; + rel->rd_vm_nblocks_cache = InvalidBlockNumber; /* make sure relation is marked as having no open file yet */ rel->rd_smgr = NULL; *************** *** 3603,3608 **** load_relcache_init_file(void) --- 3608,3614 ---- rel->rd_smgr = NULL; rel->rd_targblock = InvalidBlockNumber; rel->rd_fsm_nblocks_cache = InvalidBlockNumber; + rel->rd_vm_nblocks_cache = InvalidBlockNumber; if (rel->rd_isnailed) rel->rd_refcnt = 1; else *** src/include/access/heapam.h --- src/include/access/heapam.h *************** *** 153,158 **** extern void heap_page_prune_execute(Buffer buffer, --- 153,159 ---- OffsetNumber *nowunused, int nunused, bool redirect_move); extern void heap_get_root_tuples(Page page, OffsetNumber *root_offsets); + extern void heap_page_update_all_visible(Buffer buffer); /* in heap/syncscan.c */ extern void ss_report_location(Relation rel, BlockNumber location); *** src/include/access/htup.h --- src/include/access/htup.h *************** *** 601,606 **** typedef struct xl_heaptid --- 601,607 ---- typedef struct xl_heap_delete { xl_heaptid target; /* deleted tuple id */ + bool all_visible_cleared; /* PD_ALL_VISIBLE was cleared */ } xl_heap_delete; #define SizeOfHeapDelete (offsetof(xl_heap_delete, target) + SizeOfHeapTid) *************** *** 626,641 **** typedef struct xl_heap_header typedef struct xl_heap_insert { xl_heaptid target; /* inserted tuple id */ /* xl_heap_header & TUPLE DATA FOLLOWS AT END OF STRUCT */ } xl_heap_insert; ! #define SizeOfHeapInsert (offsetof(xl_heap_insert, target) + SizeOfHeapTid) /* This is what we need to know about update|move|hot_update */ typedef struct xl_heap_update { xl_heaptid target; /* deleted tuple id */ ItemPointerData newtid; /* new inserted tuple id */ /* NEW TUPLE xl_heap_header (PLUS xmax & xmin IF MOVE OP) */ /* and TUPLE DATA FOLLOWS AT END OF STRUCT */ } xl_heap_update; --- 627,645 ---- typedef struct xl_heap_insert { xl_heaptid target; /* inserted tuple id */ + bool all_visible_cleared; /* PD_ALL_VISIBLE was cleared */ /* xl_heap_header & TUPLE DATA FOLLOWS AT END OF STRUCT */ } xl_heap_insert; ! #define SizeOfHeapInsert (offsetof(xl_heap_insert, all_visible_cleared) + sizeof(bool)) /* This is what we need to know about update|move|hot_update */ typedef struct xl_heap_update { xl_heaptid target; /* deleted tuple id */ ItemPointerData newtid; /* new inserted tuple id */ + bool all_visible_cleared; /* PD_ALL_VISIBLE was cleared */ + bool new_all_visible_cleared; /* same for the page of newtid */ /* NEW TUPLE xl_heap_header (PLUS xmax & xmin IF MOVE OP) */ /* and TUPLE DATA FOLLOWS AT END OF STRUCT */ } xl_heap_update; *** /dev/null --- src/include/access/visibilitymap.h *************** *** 0 **** --- 1,28 ---- + /*------------------------------------------------------------------------- + * + * visibilitymap.h + * visibility map interface + * + * + * Portions Copyright (c) 2007, PostgreSQL Global Development Group + * Portions Copyright (c) 1994, Regents of the University of California + * + * $PostgreSQL$ + * + *------------------------------------------------------------------------- + */ + #ifndef VISIBILITYMAP_H + #define VISIBILITYMAP_H + + #include "utils/rel.h" + #include "storage/buf.h" + #include "storage/itemptr.h" + #include "access/xlogdefs.h" + + extern void visibilitymap_set(Relation rel, BlockNumber heapBlk, + XLogRecPtr recptr, Buffer *vmbuf); + extern void visibilitymap_clear(Relation rel, BlockNumber heapBlk); + extern bool visibilitymap_test(Relation rel, BlockNumber heapBlk, Buffer *vmbuf); + extern void visibilitymap_truncate(Relation rel, BlockNumber heapblk); + + #endif /* VISIBILITYMAP_H */ *** src/include/storage/bufpage.h --- src/include/storage/bufpage.h *************** *** 152,159 **** typedef PageHeaderData *PageHeader; #define PD_HAS_FREE_LINES 0x0001 /* are there any unused line pointers? */ #define PD_PAGE_FULL 0x0002 /* not enough free space for new * tuple? */ ! #define PD_VALID_FLAG_BITS 0x0003 /* OR of all valid pd_flags bits */ /* * Page layout version number 0 is for pre-7.3 Postgres releases. --- 152,161 ---- #define PD_HAS_FREE_LINES 0x0001 /* are there any unused line pointers? */ #define PD_PAGE_FULL 0x0002 /* not enough free space for new * tuple? */ + #define PD_ALL_VISIBLE 0x0004 /* all tuples on page are visible to + * everyone */ ! #define PD_VALID_FLAG_BITS 0x0007 /* OR of all valid pd_flags bits */ /* * Page layout version number 0 is for pre-7.3 Postgres releases. *************** *** 336,341 **** typedef PageHeaderData *PageHeader; --- 338,350 ---- #define PageClearFull(page) \ (((PageHeader) (page))->pd_flags &= ~PD_PAGE_FULL) + #define PageIsAllVisible(page) \ + (((PageHeader) (page))->pd_flags & PD_ALL_VISIBLE) + #define PageSetAllVisible(page) \ + (((PageHeader) (page))->pd_flags |= PD_ALL_VISIBLE) + #define PageClearAllVisible(page) \ + (((PageHeader) (page))->pd_flags &= ~PD_ALL_VISIBLE) + #define PageIsPrunable(page, oldestxmin) \ ( \ AssertMacro(TransactionIdIsNormal(oldestxmin)), \ *** src/include/storage/relfilenode.h --- src/include/storage/relfilenode.h *************** *** 24,37 **** typedef enum ForkNumber { InvalidForkNumber = -1, MAIN_FORKNUM = 0, ! FSM_FORKNUM /* * NOTE: if you add a new fork, change MAX_FORKNUM below and update the * forkNames array in catalog.c */ } ForkNumber; ! #define MAX_FORKNUM FSM_FORKNUM /* * RelFileNode must provide all that we need to know to physically access --- 24,38 ---- { InvalidForkNumber = -1, MAIN_FORKNUM = 0, ! FSM_FORKNUM, ! VISIBILITYMAP_FORKNUM /* * NOTE: if you add a new fork, change MAX_FORKNUM below and update the * forkNames array in catalog.c */ } ForkNumber; ! #define MAX_FORKNUM VISIBILITYMAP_FORKNUM /* * RelFileNode must provide all that we need to know to physically access *** src/include/utils/rel.h --- src/include/utils/rel.h *************** *** 195,202 **** typedef struct RelationData List *rd_indpred; /* index predicate tree, if any */ void *rd_amcache; /* available for use by index AM */ ! /* Cached last-seen size of the FSM */ BlockNumber rd_fsm_nblocks_cache; /* use "struct" here to avoid needing to include pgstat.h: */ struct PgStat_TableStatus *pgstat_info; /* statistics collection area */ --- 195,203 ---- List *rd_indpred; /* index predicate tree, if any */ void *rd_amcache; /* available for use by index AM */ ! /* Cached last-seen size of the FSM and visibility map */ BlockNumber rd_fsm_nblocks_cache; + BlockNumber rd_vm_nblocks_cache; /* use "struct" here to avoid needing to include pgstat.h: */ struct PgStat_TableStatus *pgstat_info; /* statistics collection area */
Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:
> I committed the changes to FSM truncation yesterday, that helps with the 
> truncation of the visibility map as well. Attached is an updated 
> visibility map patch.
I looked over this patch a bit ...
> 1. The bits in the visibility map are set in the 1st phase of lazy 
> vacuum. That works, but it means that after a delete or update, it takes 
> two vacuums until the bit in the visibility map is set. The first vacuum 
> removes the dead tuple, and only the second sees that there's no dead 
> tuples and sets the bit.
I think this is probably not a big issue really.  The point of this change
is to optimize things for pages that are static over the long term; one
extra vacuum cycle before the page is deemed static doesn't seem like a
problem.  You could even argue that this saves I/O because we don't set
the bit (and perhaps later have to clear it) until we know that the page
has stayed static across a vacuum cycle and thus has a reasonable
probability of continuing to do so.
A possible problem is that if a relation is filled all in one shot,
autovacuum would trigger a single vacuum cycle on it and then never have
a reason to trigger another; leading to the bits never getting set (or
at least not till an antiwraparound vacuum occurs).  We might want to
tweak autovac so that an extra vacuum cycle occurs in this case.  But
I'm not quite sure what a reasonable heuristic would be.
Some other points:
* ISTM that the patch is designed on the plan that the PD_ALL_VISIBLE
page header flag *must* be correct, but it's really okay if the backing
map bit *isn't* correct --- in particular we don't trust the map bit
when performing antiwraparound vacuums.  This isn't well documented.
* Also, I see that vacuum has a provision for clearing an incorrectly
set PD_ALL_VISIBLE flag, but shouldn't it fix the map too?
* It would be good if the visibility map fork were never created until
there is occasion to set a bit in it; this would for instance typically
mean that temp tables would never have one.  I think that
visibilitymap.c doesn't get this quite right --- in particular
vm_readbuf seems willing to create/extend the fork whether its extend
argument is true or not, so it looks like an inquiry operation would
cause the map fork to be created.  It should be possible to act as
though a nonexistent fork just means "all zeroes".
* heap_insert's all_visible_cleared variable doesn't seem to get
initialized --- didn't your compiler complain?
* You missed updating SizeOfHeapDelete and SizeOfHeapUpdate
        regards, tom lane
			
		On Sun, 2008-11-23 at 14:05 -0500, Tom Lane wrote: > A possible problem is that if a relation is filled all in one shot, > autovacuum would trigger a single vacuum cycle on it and then never have > a reason to trigger another; leading to the bits never getting set (or > at least not till an antiwraparound vacuum occurs). We might want to > tweak autovac so that an extra vacuum cycle occurs in this case. But > I'm not quite sure what a reasonable heuristic would be. > This would only be an issue if using the visibility map for things other than partial vacuum (e.g. index-only scan), right? If we never do another VACUUM, we don't need partial vacuum. Regards,Jeff Davis
Jeff Davis <pgsql@j-davis.com> writes:
> On Sun, 2008-11-23 at 14:05 -0500, Tom Lane wrote:
>> A possible problem is that if a relation is filled all in one shot,
>> autovacuum would trigger a single vacuum cycle on it and then never have
>> a reason to trigger another; leading to the bits never getting set (or
>> at least not till an antiwraparound vacuum occurs).
> This would only be an issue if using the visibility map for things other
> than partial vacuum (e.g. index-only scan), right? If we never do
> another VACUUM, we don't need partial vacuum.
Well, the patch already uses the page header bits for optimization of
seqscans, and could probably make good use of them for bitmap scans too.
It'd be nice if the page header bits got set even if the map bits
didn't.
Reflecting on it though, maybe Heikki described the behavior too
pessimistically anyway.  If a page contains no dead tuples, it should
get its bits set on first visit anyhow, no?  So for the ordinary bulk
load scenario where there are no failed insertions, the first vacuum
pass should set all the bits ... at least, if enough time has passed
for RecentXmin to be past the inserting transaction.
However, my comment above was too optimistic, because in an insert-only
scenario autovac would in fact not trigger VACUUM at all, only ANALYZE.
So it seems like we do indeed want to rejigger autovac's rules a bit
to account for the possibility of wanting to apply vacuum to get
visibility bits set.
        regards, tom lane
			
		Tom Lane wrote: > However, my comment above was too optimistic, because in an insert-only > scenario autovac would in fact not trigger VACUUM at all, only ANALYZE. > > So it seems like we do indeed want to rejigger autovac's rules a bit > to account for the possibility of wanting to apply vacuum to get > visibility bits set. I'm sure I'm missing something, but I thought the point of this was to lessen the impact of VACUUM and now you are suggesting that we have to add vacuums to tables that have never needed one before.
Tom Lane wrote: > Reflecting on it though, maybe Heikki described the behavior too > pessimistically anyway. If a page contains no dead tuples, it should > get its bits set on first visit anyhow, no? So for the ordinary bulk > load scenario where there are no failed insertions, the first vacuum > pass should set all the bits ... at least, if enough time has passed > for RecentXmin to be past the inserting transaction. Right. I did say "... after a delete or update, it takes two vacuums until ..." in my mail. > However, my comment above was too optimistic, because in an insert-only > scenario autovac would in fact not trigger VACUUM at all, only ANALYZE. > > So it seems like we do indeed want to rejigger autovac's rules a bit > to account for the possibility of wanting to apply vacuum to get > visibility bits set. I'm not too excited about triggering an extra vacuum. As Matthew pointed out, the point of this patch is to reduce the number of vacuums required, not increase it. If you're not going to vacuum a table, you don't care if the bits in the visibility map are set or not. We could set the PD_ALL_VISIBLE flag more aggressively, outside VACUUMs, if we want to make the seqscan optimization more effective. For example, a seqscan could set the flag too, if it sees that all the tuples were visible, and had the XMIN_COMMITTED and XMAX_INVALID hint bits set. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Tom Lane wrote: > * ISTM that the patch is designed on the plan that the PD_ALL_VISIBLE > page header flag *must* be correct, but it's really okay if the backing > map bit *isn't* correct --- in particular we don't trust the map bit > when performing antiwraparound vacuums. This isn't well documented. Right. Will add comments. We can't use the map bit for antiwraparound vacuums, because the bit doesn't tell you when the tuples have been frozen. And we can't advance relfrozenxid if we've skipped any pages. I've been thinking that we could add one frozenxid field to each visibility map page, for the oldest xid on the heap pages covered by the visibility map page. That would allow more fine-grained anti-wraparound vacuums as well. > * Also, I see that vacuum has a provision for clearing an incorrectly > set PD_ALL_VISIBLE flag, but shouldn't it fix the map too? Yes, will fix. Although, as long as we don't trust the visibility map, no real damage would be done. > * It would be good if the visibility map fork were never created until > there is occasion to set a bit in it; this would for instance typically > mean that temp tables would never have one. I think that > visibilitymap.c doesn't get this quite right --- in particular > vm_readbuf seems willing to create/extend the fork whether its extend > argument is true or not, so it looks like an inquiry operation would > cause the map fork to be created. It should be possible to act as > though a nonexistent fork just means "all zeroes". The visibility map won't be inquired unless you vacuum. This is a bit tricky. In vacuum, we only know whether we can set a bit or not, after we've acquired a cleanup lock on the page, and scanned all the tuples. While we're holding a cleanup lock, we don't want to do I/O, which could potentially block out other processes for a long time. So it's too late to extend the visibility map at that point. I agree that vm_readbuf should not create the fork if 'extend' is false, that's an oversight, but it won't change the actual behavior because visibilitymap_test calls it with 'extend' true. Because of the above. I will add comments about that, though, there's nothing describing that currently. > * heap_insert's all_visible_cleared variable doesn't seem to get > initialized --- didn't your compiler complain? Hmph, I must've been compiling with -O0. > * You missed updating SizeOfHeapDelete and SizeOfHeapUpdate Thanks. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:
> I've been thinking that we could add one frozenxid field to each 
> visibility map page, for the oldest xid on the heap pages covered by the 
> visibility map page. That would allow more fine-grained anti-wraparound 
> vacuums as well.
This doesn't strike me as a particularly good idea.  Right now the map
is only hints as far as vacuum is concerned --- if you do the above then
the map becomes critical data.  And I don't really think you'll buy
much.
> The visibility map won't be inquired unless you vacuum. This is a bit 
> tricky. In vacuum, we only know whether we can set a bit or not, after 
> we've acquired a cleanup lock on the page, and scanned all the tuples. 
> While we're holding a cleanup lock, we don't want to do I/O, which could 
> potentially block out other processes for a long time. So it's too late 
> to extend the visibility map at that point.
This is no good; I think you've made the wrong tradeoffs.  In
particular, even though only vacuum *currently* uses the map, you want
to extend it to be used by indexscans.  So it's going to uselessly
spring into being even without vacuums.
I'm not convinced that I/O while holding cleanup lock is so bad that we
should break other aspects of the system to avoid it.  However, if you
want to stick to that, how about* vacuum page, possibly set its header bit* release page lock (but not pin)* if we need
toset the bit, fetch the corresponding map page  (I/O might happen here)* get share lock on heap page, then recheck its
headerbit;  if still set, set the map bit
 
        regards, tom lane
			
		Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:
> Tom Lane wrote:
>> So it seems like we do indeed want to rejigger autovac's rules a bit
>> to account for the possibility of wanting to apply vacuum to get
>> visibility bits set.
> I'm not too excited about triggering an extra vacuum. As Matthew pointed 
> out, the point of this patch is to reduce the number of vacuums 
> required, not increase it. If you're not going to vacuum a table, you 
> don't care if the bits in the visibility map are set or not.
But it's already the case that the bits provide a performance increase
to other things besides vacuum.
> We could set the PD_ALL_VISIBLE flag more aggressively, outside VACUUMs, 
> if we want to make the seqscan optimization more effective. For example, 
> a seqscan could set the flag too, if it sees that all the tuples were 
> visible, and had the XMIN_COMMITTED and XMAX_INVALID hint bits set.
I was wondering whether we could teach heap_page_prune to set the flag
without adding any extra tuple visibility checks.  A seqscan per se
shouldn't be doing this because it doesn't normally call
HeapTupleSatifiesVacuum.
        regards, tom lane
			
		Tom Lane <tgl@sss.pgh.pa.us> writes: > Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes: >> I've been thinking that we could add one frozenxid field to each >> visibility map page, for the oldest xid on the heap pages covered by the >> visibility map page. That would allow more fine-grained anti-wraparound >> vacuums as well. > > This doesn't strike me as a particularly good idea. Right now the map > is only hints as far as vacuum is concerned --- if you do the above then > the map becomes critical data. And I don't really think you'll buy > much. Hm, that depends on how critical the critical data is. It's critical that the frozenxid that autovacuum sees is no more recent than the actual frozenxid, but not critical that it be entirely up-to-date otherwise. So if it's possible for the frozenxid in the visibility map to go backwards then it's no good, since if that update is lost we might skip a necessary vacuum freeze. But if we guarantee that we never update the frozenxid in the visibility map forward ahead of recentglobalxmin then it can't ever go backwards. (Well, not in a way that matters) However I'm a bit puzzled how you could possibly maintain this frozenxid. As soon as you freeze an xid you'll have to visit all the other pages covered by that visibility map page to see what the new value should be. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com Ask me about EnterpriseDB's 24x7 Postgres support!
Gregory Stark wrote: > However I'm a bit puzzled how you could possibly maintain this frozenxid. As > soon as you freeze an xid you'll have to visit all the other pages covered by > that visibility map page to see what the new value should be. Right, you could only advance it when you scan all the pages covered by the visibility map page. But that's better than having to scan the whole relation. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes: > Gregory Stark wrote: >> However I'm a bit puzzled how you could possibly maintain this frozenxid. As >> soon as you freeze an xid you'll have to visit all the other pages covered by >> that visibility map page to see what the new value should be. > > Right, you could only advance it when you scan all the pages covered by the > visibility map page. But that's better than having to scan the whole relation. Is it? It seems like that would just move around the work. You'll still have to visit every page once ever 2B transactions or so. You'll just do it 64MB at a time. It's nice to smooth the work but it would be much nicer to detect that a normal vacuum has already processed all of those pages since the last insert/update/delete on those pages and so avoid the work entirely. To avoid the work entirely you need some information about the oldest xid on those pages seen by regular vacuums (and/or prunes). We would want to skip any page which: a) Has been visited by vacuum freeze and not been updated since b) Has been visited by a regular vacuum and the oldest xid found was more recent than freeze_threshold. c) Has been updated frequently such that no old tuples remain Ideally (b) should completely obviate the need for anti-wraparound freezes entirely. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com Ask me about EnterpriseDB's 24x7 Postgres support!
Gregory Stark <stark@enterprisedb.com> writes:
> So if it's possible for the frozenxid in the visibility map to go backwards
> then it's no good, since if that update is lost we might skip a necessary
> vacuum freeze.
Seems like a lost disk write would be enough to make that happen.
Now you might argue that the odds of that are no worse than the odds of
losing an update to one particular heap page, but in this case the
single hiccup could lead to losing half a gigabyte of data (assuming 8K
page size).  The leverage you get for saving vacuum freeze work is
exactly equal to the magnification factor for data loss.
        regards, tom lane
			
		On Nov 23, 2008, at 3:18 PM, Tom Lane wrote: > So it seems like we do indeed want to rejigger autovac's rules a bit > to account for the possibility of wanting to apply vacuum to get > visibility bits set. That makes the idea of not writing out hint bit updates unless the page is already dirty a lot easier to swallow, because now we'd have a mechanism in place to ensure that they were set in a reasonable timeframe by autovacuum. That actually wouldn't incur much extra overhead at all, except in the case of a table that's effectively write-only. Actually, that's not even true; you still have to eventually freeze a write-mostly table. -- Decibel!, aka Jim C. Nasby, Database Architect decibel@decibel.org Give your computer some brain candy! www.distributed.net Team #1828
Tom Lane wrote: > Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes: >> The visibility map won't be inquired unless you vacuum. This is a bit >> tricky. In vacuum, we only know whether we can set a bit or not, after >> we've acquired a cleanup lock on the page, and scanned all the tuples. >> While we're holding a cleanup lock, we don't want to do I/O, which could >> potentially block out other processes for a long time. So it's too late >> to extend the visibility map at that point. > > This is no good; I think you've made the wrong tradeoffs. In > particular, even though only vacuum *currently* uses the map, you want > to extend it to be used by indexscans. So it's going to uselessly > spring into being even without vacuums. > > I'm not convinced that I/O while holding cleanup lock is so bad that we > should break other aspects of the system to avoid it. However, if you > want to stick to that, how about > * vacuum page, possibly set its header bit > * release page lock (but not pin) > * if we need to set the bit, fetch the corresponding map page > (I/O might happen here) > * get share lock on heap page, then recheck its header bit; > if still set, set the map bit Yeah, could do that. There is another problem, though, if the map is frequently probed for pages that don't exist in the map, or the map doesn't exist at all. Currently, the size of the map file is kept in relcache, in the rd_vm_nblocks_cache variable. Whenever a page is accessed that's > rd_vm_nblocks_cache, smgrnblocks is called to see if the page exists, and rd_vm_nblocks_cache is updated. That means that every probe to a non-existing page causes an lseek(), which isn't free. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:
> There is another problem, though, if the map is frequently probed for 
> pages that don't exist in the map, or the map doesn't exist at all. 
> Currently, the size of the map file is kept in relcache, in the 
> rd_vm_nblocks_cache variable. Whenever a page is accessed that's > 
> rd_vm_nblocks_cache, smgrnblocks is called to see if the page exists, 
> and rd_vm_nblocks_cache is updated. That means that every probe to a 
> non-existing page causes an lseek(), which isn't free.
Well, considering how seldom new pages will be added to the visibility
map, it seems to me we could afford to send out a relcache inval event
when that happens.  Then rd_vm_nblocks_cache could be treated as
trustworthy.
Maybe it'd be worth doing that for the FSM too.  The frequency of
invals would be higher, but then again the reference frequency is
probably higher too?
        regards, tom lane
			
		Tom Lane wrote: > Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes: >> There is another problem, though, if the map is frequently probed for >> pages that don't exist in the map, or the map doesn't exist at all. >> Currently, the size of the map file is kept in relcache, in the >> rd_vm_nblocks_cache variable. Whenever a page is accessed that's > >> rd_vm_nblocks_cache, smgrnblocks is called to see if the page exists, >> and rd_vm_nblocks_cache is updated. That means that every probe to a >> non-existing page causes an lseek(), which isn't free. > > Well, considering how seldom new pages will be added to the visibility > map, it seems to me we could afford to send out a relcache inval event > when that happens. Then rd_vm_nblocks_cache could be treated as > trustworthy. > > Maybe it'd be worth doing that for the FSM too. The frequency of > invals would be higher, but then again the reference frequency is > probably higher too? A relcache invalidation sounds awfully heavy-weight. Perhaps a light-weight invalidation event that doesn't flush the entry altogether, but just resets the cached sizes? -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:
> Tom Lane wrote:
>> Well, considering how seldom new pages will be added to the visibility
>> map, it seems to me we could afford to send out a relcache inval event
>> when that happens.  Then rd_vm_nblocks_cache could be treated as
>> trustworthy.
> A relcache invalidation sounds awfully heavy-weight.
It really isn't.
        regards, tom lane
			
		Tom Lane wrote: > Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes: >> Tom Lane wrote: >>> Well, considering how seldom new pages will be added to the visibility >>> map, it seems to me we could afford to send out a relcache inval event >>> when that happens. Then rd_vm_nblocks_cache could be treated as >>> trustworthy. > >> A relcache invalidation sounds awfully heavy-weight. > > It really isn't. Okay, then. I'll use relcache invalidation for both the FSM and visibility map. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Tom Lane wrote: > Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes: >> There is another problem, though, if the map is frequently probed for >> pages that don't exist in the map, or the map doesn't exist at all. >> Currently, the size of the map file is kept in relcache, in the >> rd_vm_nblocks_cache variable. Whenever a page is accessed that's > >> rd_vm_nblocks_cache, smgrnblocks is called to see if the page exists, >> and rd_vm_nblocks_cache is updated. That means that every probe to a >> non-existing page causes an lseek(), which isn't free. > > Well, considering how seldom new pages will be added to the visibility > map, it seems to me we could afford to send out a relcache inval event > when that happens. Then rd_vm_nblocks_cache could be treated as > trustworthy. Here's an updated version, with a lot of smaller cleanups, and using relcache invalidation to notify other backends when the visibility map fork is extended. I already committed the change to FSM to do the same. I'm feeling quite satisfied to commit this patch early next week. I modified the VACUUM VERBOSE output slightly, to print the number of pages scanned. The added part emphasized below: postgres=# vacuum verbose foo; INFO: vacuuming "public.foo" INFO: "foo": removed 230 row versions in 10 pages INFO: "foo": found 230 removable, 10 nonremovable row versions in *10 out of* 43 pages DETAIL: 0 dead row versions cannot be removed yet. There were 0 unused item pointers. 0 pages are entirely empty. CPU 0.00s/0.00u sec elapsed 0.00 sec. VACUUM That seems OK to me, but maybe others have an opinion on that? -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Heikki Linnakangas wrote: > Here's an updated version, ... And here it is, for real... -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com *** src/backend/access/heap/Makefile --- src/backend/access/heap/Makefile *************** *** 12,17 **** subdir = src/backend/access/heap top_builddir = ../../../.. include $(top_builddir)/src/Makefile.global ! OBJS = heapam.o hio.o pruneheap.o rewriteheap.o syncscan.o tuptoaster.o include $(top_srcdir)/src/backend/common.mk --- 12,17 ---- top_builddir = ../../../.. include $(top_builddir)/src/Makefile.global ! OBJS = heapam.o hio.o pruneheap.o rewriteheap.o syncscan.o tuptoaster.o visibilitymap.o include $(top_srcdir)/src/backend/common.mk *** src/backend/access/heap/heapam.c --- src/backend/access/heap/heapam.c *************** *** 47,52 **** --- 47,53 ---- #include "access/transam.h" #include "access/tuptoaster.h" #include "access/valid.h" + #include "access/visibilitymap.h" #include "access/xact.h" #include "access/xlogutils.h" #include "catalog/catalog.h" *************** *** 195,200 **** heapgetpage(HeapScanDesc scan, BlockNumber page) --- 196,202 ---- int ntup; OffsetNumber lineoff; ItemId lpp; + bool all_visible; Assert(page < scan->rs_nblocks); *************** *** 233,252 **** heapgetpage(HeapScanDesc scan, BlockNumber page) lines = PageGetMaxOffsetNumber(dp); ntup = 0; for (lineoff = FirstOffsetNumber, lpp = PageGetItemId(dp, lineoff); lineoff <= lines; lineoff++, lpp++) { if (ItemIdIsNormal(lpp)) { - HeapTupleData loctup; bool valid; ! loctup.t_data = (HeapTupleHeader) PageGetItem((Page) dp, lpp); ! loctup.t_len = ItemIdGetLength(lpp); ! ItemPointerSet(&(loctup.t_self), page, lineoff); ! valid = HeapTupleSatisfiesVisibility(&loctup, snapshot, buffer); if (valid) scan->rs_vistuples[ntup++] = lineoff; } --- 235,266 ---- lines = PageGetMaxOffsetNumber(dp); ntup = 0; + /* + * If the all-visible flag indicates that all tuples on the page are + * visible to everyone, we can skip the per-tuple visibility tests. + */ + all_visible = PageIsAllVisible(dp); + for (lineoff = FirstOffsetNumber, lpp = PageGetItemId(dp, lineoff); lineoff <= lines; lineoff++, lpp++) { if (ItemIdIsNormal(lpp)) { bool valid; ! if (all_visible) ! valid = true; ! else ! { ! HeapTupleData loctup; ! ! loctup.t_data = (HeapTupleHeader) PageGetItem((Page) dp, lpp); ! loctup.t_len = ItemIdGetLength(lpp); ! ItemPointerSet(&(loctup.t_self), page, lineoff); ! valid = HeapTupleSatisfiesVisibility(&loctup, snapshot, buffer); ! } if (valid) scan->rs_vistuples[ntup++] = lineoff; } *************** *** 1860,1865 **** heap_insert(Relation relation, HeapTuple tup, CommandId cid, --- 1874,1880 ---- TransactionId xid = GetCurrentTransactionId(); HeapTuple heaptup; Buffer buffer; + bool all_visible_cleared = false; if (relation->rd_rel->relhasoids) { *************** *** 1920,1925 **** heap_insert(Relation relation, HeapTuple tup, CommandId cid, --- 1935,1946 ---- RelationPutHeapTuple(relation, buffer, heaptup); + if (PageIsAllVisible(BufferGetPage(buffer))) + { + all_visible_cleared = true; + PageClearAllVisible(BufferGetPage(buffer)); + } + /* * XXX Should we set PageSetPrunable on this page ? * *************** *** 1943,1948 **** heap_insert(Relation relation, HeapTuple tup, CommandId cid, --- 1964,1970 ---- Page page = BufferGetPage(buffer); uint8 info = XLOG_HEAP_INSERT; + xlrec.all_visible_cleared = all_visible_cleared; xlrec.target.node = relation->rd_node; xlrec.target.tid = heaptup->t_self; rdata[0].data = (char *) &xlrec; *************** *** 1994,1999 **** heap_insert(Relation relation, HeapTuple tup, CommandId cid, --- 2016,2026 ---- UnlockReleaseBuffer(buffer); + /* Clear the bit in the visibility map if necessary */ + if (all_visible_cleared) + visibilitymap_clear(relation, + ItemPointerGetBlockNumber(&(heaptup->t_self))); + /* * If tuple is cachable, mark it for invalidation from the caches in case * we abort. Note it is OK to do this after releasing the buffer, because *************** *** 2070,2075 **** heap_delete(Relation relation, ItemPointer tid, --- 2097,2103 ---- Buffer buffer; bool have_tuple_lock = false; bool iscombo; + bool all_visible_cleared = false; Assert(ItemPointerIsValid(tid)); *************** *** 2216,2221 **** l1: --- 2244,2255 ---- */ PageSetPrunable(page, xid); + if (PageIsAllVisible(page)) + { + all_visible_cleared = true; + PageClearAllVisible(page); + } + /* store transaction information of xact deleting the tuple */ tp.t_data->t_infomask &= ~(HEAP_XMAX_COMMITTED | HEAP_XMAX_INVALID | *************** *** 2237,2242 **** l1: --- 2271,2277 ---- XLogRecPtr recptr; XLogRecData rdata[2]; + xlrec.all_visible_cleared = all_visible_cleared; xlrec.target.node = relation->rd_node; xlrec.target.tid = tp.t_self; rdata[0].data = (char *) &xlrec; *************** *** 2281,2286 **** l1: --- 2316,2325 ---- */ CacheInvalidateHeapTuple(relation, &tp); + /* Clear the bit in the visibility map if necessary */ + if (all_visible_cleared) + visibilitymap_clear(relation, BufferGetBlockNumber(buffer)); + /* Now we can release the buffer */ ReleaseBuffer(buffer); *************** *** 2388,2393 **** heap_update(Relation relation, ItemPointer otid, HeapTuple newtup, --- 2427,2434 ---- bool have_tuple_lock = false; bool iscombo; bool use_hot_update = false; + bool all_visible_cleared = false; + bool all_visible_cleared_new = false; Assert(ItemPointerIsValid(otid)); *************** *** 2763,2768 **** l2: --- 2804,2815 ---- MarkBufferDirty(newbuf); MarkBufferDirty(buffer); + /* + * Note: we mustn't clear PD_ALL_VISIBLE flags before writing the WAL + * record, because log_heap_update looks at those flags to set the + * corresponding flags in the WAL record. + */ + /* XLOG stuff */ if (!relation->rd_istemp) { *************** *** 2778,2783 **** l2: --- 2825,2842 ---- PageSetTLI(BufferGetPage(buffer), ThisTimeLineID); } + /* Clear PD_ALL_VISIBLE flags */ + if (PageIsAllVisible(BufferGetPage(buffer))) + { + all_visible_cleared = true; + PageClearAllVisible(BufferGetPage(buffer)); + } + if (newbuf != buffer && PageIsAllVisible(BufferGetPage(newbuf))) + { + all_visible_cleared_new = true; + PageClearAllVisible(BufferGetPage(newbuf)); + } + END_CRIT_SECTION(); if (newbuf != buffer) *************** *** 2791,2796 **** l2: --- 2850,2861 ---- */ CacheInvalidateHeapTuple(relation, &oldtup); + /* Clear bits in visibility map */ + if (all_visible_cleared) + visibilitymap_clear(relation, BufferGetBlockNumber(buffer)); + if (all_visible_cleared_new) + visibilitymap_clear(relation, BufferGetBlockNumber(newbuf)); + /* Now we can release the buffer(s) */ if (newbuf != buffer) ReleaseBuffer(newbuf); *************** *** 3412,3417 **** l3: --- 3477,3487 ---- LockBuffer(*buffer, BUFFER_LOCK_UNLOCK); /* + * Don't update the visibility map here. Locking a tuple doesn't + * change visibility info. + */ + + /* * Now that we have successfully marked the tuple as locked, we can * release the lmgr tuple lock, if we had it. */ *************** *** 3916,3922 **** log_heap_update(Relation reln, Buffer oldbuf, ItemPointerData from, --- 3986,3994 ---- xlrec.target.node = reln->rd_node; xlrec.target.tid = from; + xlrec.all_visible_cleared = PageIsAllVisible(BufferGetPage(oldbuf)); xlrec.newtid = newtup->t_self; + xlrec.new_all_visible_cleared = PageIsAllVisible(BufferGetPage(newbuf)); rdata[0].data = (char *) &xlrec; rdata[0].len = SizeOfHeapUpdate; *************** *** 4185,4197 **** heap_xlog_delete(XLogRecPtr lsn, XLogRecord *record) OffsetNumber offnum; ItemId lp = NULL; HeapTupleHeader htup; if (record->xl_info & XLR_BKP_BLOCK_1) return; ! buffer = XLogReadBuffer(xlrec->target.node, ! ItemPointerGetBlockNumber(&(xlrec->target.tid)), ! false); if (!BufferIsValid(buffer)) return; page = (Page) BufferGetPage(buffer); --- 4257,4281 ---- OffsetNumber offnum; ItemId lp = NULL; HeapTupleHeader htup; + BlockNumber blkno; + + blkno = ItemPointerGetBlockNumber(&(xlrec->target.tid)); + + /* + * The visibility map always needs to be updated, even if the heap page + * is already up-to-date. + */ + if (xlrec->all_visible_cleared) + { + Relation reln = CreateFakeRelcacheEntry(xlrec->target.node); + visibilitymap_clear(reln, blkno); + FreeFakeRelcacheEntry(reln); + } if (record->xl_info & XLR_BKP_BLOCK_1) return; ! buffer = XLogReadBuffer(xlrec->target.node, blkno, false); if (!BufferIsValid(buffer)) return; page = (Page) BufferGetPage(buffer); *************** *** 4223,4228 **** heap_xlog_delete(XLogRecPtr lsn, XLogRecord *record) --- 4307,4315 ---- /* Mark the page as a candidate for pruning */ PageSetPrunable(page, record->xl_xid); + if (xlrec->all_visible_cleared) + PageClearAllVisible(page); + /* Make sure there is no forward chain link in t_ctid */ htup->t_ctid = xlrec->target.tid; PageSetLSN(page, lsn); *************** *** 4249,4259 **** heap_xlog_insert(XLogRecPtr lsn, XLogRecord *record) Size freespace; BlockNumber blkno; if (record->xl_info & XLR_BKP_BLOCK_1) return; - blkno = ItemPointerGetBlockNumber(&(xlrec->target.tid)); - if (record->xl_info & XLOG_HEAP_INIT_PAGE) { buffer = XLogReadBuffer(xlrec->target.node, blkno, true); --- 4336,4357 ---- Size freespace; BlockNumber blkno; + blkno = ItemPointerGetBlockNumber(&(xlrec->target.tid)); + + /* + * The visibility map always needs to be updated, even if the heap page + * is already up-to-date. + */ + if (xlrec->all_visible_cleared) + { + Relation reln = CreateFakeRelcacheEntry(xlrec->target.node); + visibilitymap_clear(reln, blkno); + FreeFakeRelcacheEntry(reln); + } + if (record->xl_info & XLR_BKP_BLOCK_1) return; if (record->xl_info & XLOG_HEAP_INIT_PAGE) { buffer = XLogReadBuffer(xlrec->target.node, blkno, true); *************** *** 4307,4312 **** heap_xlog_insert(XLogRecPtr lsn, XLogRecord *record) --- 4405,4414 ---- PageSetLSN(page, lsn); PageSetTLI(page, ThisTimeLineID); + + if (xlrec->all_visible_cleared) + PageClearAllVisible(page); + MarkBufferDirty(buffer); UnlockReleaseBuffer(buffer); *************** *** 4347,4352 **** heap_xlog_update(XLogRecPtr lsn, XLogRecord *record, bool move, bool hot_update) --- 4449,4466 ---- uint32 newlen; Size freespace; + /* + * The visibility map always needs to be updated, even if the heap page + * is already up-to-date. + */ + if (xlrec->all_visible_cleared) + { + Relation reln = CreateFakeRelcacheEntry(xlrec->target.node); + visibilitymap_clear(reln, + ItemPointerGetBlockNumber(&xlrec->target.tid)); + FreeFakeRelcacheEntry(reln); + } + if (record->xl_info & XLR_BKP_BLOCK_1) { if (samepage) *************** *** 4411,4416 **** heap_xlog_update(XLogRecPtr lsn, XLogRecord *record, bool move, bool hot_update) --- 4525,4533 ---- /* Mark the page as a candidate for pruning */ PageSetPrunable(page, record->xl_xid); + if (xlrec->all_visible_cleared) + PageClearAllVisible(page); + /* * this test is ugly, but necessary to avoid thinking that insert change * is already applied *************** *** 4426,4431 **** heap_xlog_update(XLogRecPtr lsn, XLogRecord *record, bool move, bool hot_update) --- 4543,4559 ---- newt:; + /* + * The visibility map always needs to be updated, even if the heap page + * is already up-to-date. + */ + if (xlrec->new_all_visible_cleared) + { + Relation reln = CreateFakeRelcacheEntry(xlrec->target.node); + visibilitymap_clear(reln, ItemPointerGetBlockNumber(&xlrec->newtid)); + FreeFakeRelcacheEntry(reln); + } + if (record->xl_info & XLR_BKP_BLOCK_2) return; *************** *** 4504,4509 **** newsame:; --- 4632,4640 ---- if (offnum == InvalidOffsetNumber) elog(PANIC, "heap_update_redo: failed to add tuple"); + if (xlrec->new_all_visible_cleared) + PageClearAllVisible(page); + freespace = PageGetHeapFreeSpace(page); /* needed to update FSM below */ PageSetLSN(page, lsn); *** /dev/null --- src/backend/access/heap/visibilitymap.c *************** *** 0 **** --- 1,478 ---- + /*------------------------------------------------------------------------- + * + * visibilitymap.c + * bitmap for tracking visibility of heap tuples + * + * Portions Copyright (c) 1996-2008, PostgreSQL Global Development Group + * Portions Copyright (c) 1994, Regents of the University of California + * + * + * IDENTIFICATION + * $PostgreSQL$ + * + * INTERFACE ROUTINES + * visibilitymap_clear - clear a bit in the visibility map + * visibilitymap_pin - pin a map page for setting a bit + * visibilitymap_set - set a bit in a previously pinned page + * visibilitymap_test - test if a bit is set + * + * NOTES + * + * The visibility map is a bitmap with one bit per heap page. A set bit means + * that all tuples on the page are visible to all transactions, and doesn't + * therefore need to be vacuumed. The map is conservative in the sense that we + * make sure that whenever a bit is set, we know the condition is true, but if + * a bit is not set, it might or might not be. + * + * There's no explicit WAL logging in the functions in this file. The callers + * must make sure that whenever a bit is cleared, the bit is cleared on WAL + * replay of the updating operation as well. Setting bits during recovery + * isn't necessary for correctness. + * + * Currently, the visibility map is only used as a hint, to speed up VACUUM. + * A corrupted visibility map won't cause data corruption, although it can + * make VACUUM skip pages that need vacuuming, until the next anti-wraparound + * vacuum. The visibility map is not used for anti-wraparound vacuums, because + * an anti-wraparound vacuum needs to freeze tuples and observe the latest xid + * present in the table, also on pages that don't have any dead tuples. + * + * Although the visibility map is just a hint at the moment, the PD_ALL_VISIBLE + * flag on heap pages *must* be correct. + * + * LOCKING + * + * In heapam.c, whenever a page is modified so that not all tuples on the + * page are visible to everyone anymore, the corresponding bit in the + * visibility map is cleared. The bit in the visibility map is cleared + * after releasing the lock on the heap page, to avoid holding the lock + * over possible I/O to read in the visibility map page. + * + * To set a bit, you need to hold a lock on the heap page. That prevents + * the race condition where VACUUM sees that all tuples on the page are + * visible to everyone, but another backend modifies the page before VACUUM + * sets the bit in the visibility map. + * + * When a bit is set, the LSN of the visibility map page is updated to make + * sure that the visibility map update doesn't get written to disk before the + * WAL record of the changes that made it possible to set the bit is flushed. + * But when a bit is cleared, we don't have to do that because it's always OK + * to clear a bit in the map from correctness point of view. + * + * TODO + * + * It would be nice to use the visibility map to skip visibility checkes in + * index scans. + * + * Currently, the visibility map is not 100% correct all the time. + * During updates, the bit in the visibility map is cleared after releasing + * the lock on the heap page. During the window after releasing the lock + * and clearing the bit in the visibility map, the bit in the visibility map + * is set, but the new insertion or deletion is not yet visible to other + * backends. + * + * That might actually be OK for the index scans, though. The newly inserted + * tuple wouldn't have an index pointer yet, so all tuples reachable from an + * index would still be visible to all other backends, and deletions wouldn't + * be visible to other backends yet. + * + * There's another hole in the way the PD_ALL_VISIBLE flag is set. When + * vacuum observes that all tuples are visible to all, it sets the flag on + * the heap page, and also sets the bit in the visibility map. If we then + * crash, and only the visibility map page was flushed to disk, we'll have + * a bit set in the visibility map, but the corresponding flag on the heap + * page is not set. If the heap page is then updated, the updater won't + * know to clear the bit in the visibility map. + * + *------------------------------------------------------------------------- + */ + #include "postgres.h" + + #include "access/visibilitymap.h" + #include "storage/bufmgr.h" + #include "storage/bufpage.h" + #include "storage/lmgr.h" + #include "storage/smgr.h" + #include "utils/inval.h" + + /*#define TRACE_VISIBILITYMAP */ + + /* + * Size of the bitmap on each visibility map page, in bytes. There's no + * extra headers, so the whole page minus except for the standard page header + * is used for the bitmap. + */ + #define MAPSIZE (BLCKSZ - SizeOfPageHeaderData) + + /* Number of bits allocated for each heap block. */ + #define BITS_PER_HEAPBLOCK 1 + + /* Number of heap blocks we can represent in one byte. */ + #define HEAPBLOCKS_PER_BYTE 8 + + /* Number of heap blocks we can represent in one visibility map page. */ + #define HEAPBLOCKS_PER_PAGE (MAPSIZE * HEAPBLOCKS_PER_BYTE) + + /* Mapping from heap block number to the right bit in the visibility map */ + #define HEAPBLK_TO_MAPBLOCK(x) ((x) / HEAPBLOCKS_PER_PAGE) + #define HEAPBLK_TO_MAPBYTE(x) (((x) % HEAPBLOCKS_PER_PAGE) / HEAPBLOCKS_PER_BYTE) + #define HEAPBLK_TO_MAPBIT(x) ((x) % HEAPBLOCKS_PER_BYTE) + + /* prototypes for internal routines */ + static Buffer vm_readbuf(Relation rel, BlockNumber blkno, bool extend); + static void vm_extend(Relation rel, BlockNumber nvmblocks); + + + /* + * visibilitymap_clear - clear a bit in visibility map + * + * Clear a bit in the visibility map, marking that not all tuples are + * visible to all transactions anymore. + */ + void + visibilitymap_clear(Relation rel, BlockNumber heapBlk) + { + BlockNumber mapBlock = HEAPBLK_TO_MAPBLOCK(heapBlk); + int mapByte = HEAPBLK_TO_MAPBYTE(heapBlk); + int mapBit = HEAPBLK_TO_MAPBIT(heapBlk); + uint8 mask = 1 << mapBit; + Buffer mapBuffer; + char *map; + + #ifdef TRACE_VISIBILITYMAP + elog(DEBUG1, "vm_clear %s %d", RelationGetRelationName(rel), heapBlk); + #endif + + mapBuffer = vm_readbuf(rel, mapBlock, false); + if (!BufferIsValid(mapBuffer)) + return; /* nothing to do */ + + LockBuffer(mapBuffer, BUFFER_LOCK_EXCLUSIVE); + map = PageGetContents(BufferGetPage(mapBuffer)); + + if (map[mapByte] & mask) + { + map[mapByte] &= ~mask; + + MarkBufferDirty(mapBuffer); + } + + UnlockReleaseBuffer(mapBuffer); + } + + /* + * visibilitymap_pin - pin a map page for setting a bit + * + * Setting a bit in the visibility map is a two-phase operation. First, call + * visibilitymap_pin, to pin the visibility map page containing the bit for + * the heap page. Because that can require I/O to read the map page, you + * shouldn't hold a lock on the heap page while doing that. Then, call + * visibilitymap_set to actually set the bit. + * + * On entry, *buf should be InvalidBuffer or a valid buffer returned by + * an earlier call to visibilitymap_pin or visibilitymap_test on the same + * relation. On return, *buf is a valid buffer with the map page containing + * the the bit for heapBlk. + * + * If the page doesn't exist in the map file yet, it is extended. + */ + void + visibilitymap_pin(Relation rel, BlockNumber heapBlk, Buffer *buf) + { + BlockNumber mapBlock = HEAPBLK_TO_MAPBLOCK(heapBlk); + + /* Reuse the old pinned buffer if possible */ + if (BufferIsValid(*buf)) + { + if (BufferGetBlockNumber(*buf) == mapBlock) + return; + + ReleaseBuffer(*buf); + } + *buf = vm_readbuf(rel, mapBlock, true); + } + + /* + * visibilitymap_set - set a bit on a previously pinned page + * + * recptr is the LSN of the heap page. The LSN of the visibility map page is + * advanced to that, to make sure that the visibility map doesn't get flushed + * to disk before the update to the heap page that made all tuples visible. + * + * This is an opportunistic function. It does nothing, unless *buf + * contains the bit for heapBlk. Call visibilitymap_pin first to pin + * the right map page. This function doesn't do any I/O. + */ + void + visibilitymap_set(Relation rel, BlockNumber heapBlk, XLogRecPtr recptr, + Buffer *buf) + { + BlockNumber mapBlock = HEAPBLK_TO_MAPBLOCK(heapBlk); + uint32 mapByte = HEAPBLK_TO_MAPBYTE(heapBlk); + uint8 mapBit = HEAPBLK_TO_MAPBIT(heapBlk); + Page page; + char *map; + + #ifdef TRACE_VISIBILITYMAP + elog(DEBUG1, "vm_set %s %d", RelationGetRelationName(rel), heapBlk); + #endif + + /* Check that we have the right page pinned */ + if (!BufferIsValid(*buf) || BufferGetBlockNumber(*buf) != mapBlock) + return; + + page = BufferGetPage(*buf); + map = PageGetContents(page); + LockBuffer(*buf, BUFFER_LOCK_EXCLUSIVE); + + if (!(map[mapByte] & (1 << mapBit))) + { + map[mapByte] |= (1 << mapBit); + + if (XLByteLT(PageGetLSN(page), recptr)) + PageSetLSN(page, recptr); + PageSetTLI(page, ThisTimeLineID); + MarkBufferDirty(*buf); + } + + LockBuffer(*buf, BUFFER_LOCK_UNLOCK); + } + + /* + * visibilitymap_test - test if a bit is set + * + * Are all tuples on heapBlk visible to all, according to the visibility map? + * + * On entry, *buf should be InvalidBuffer or a valid buffer returned by an + * earlier call to visibilitymap_pin or visibilitymap_test on the same + * relation. On return, *buf is a valid buffer with the map page containing + * the the bit for heapBlk, or InvalidBuffer. The caller is responsible for + * releasing *buf after it's done testing and setting bits. + */ + bool + visibilitymap_test(Relation rel, BlockNumber heapBlk, Buffer *buf) + { + BlockNumber mapBlock = HEAPBLK_TO_MAPBLOCK(heapBlk); + uint32 mapByte = HEAPBLK_TO_MAPBYTE(heapBlk); + uint8 mapBit = HEAPBLK_TO_MAPBIT(heapBlk); + bool result; + char *map; + + #ifdef TRACE_VISIBILITYMAP + elog(DEBUG1, "vm_test %s %d", RelationGetRelationName(rel), heapBlk); + #endif + + /* Reuse the old pinned buffer if possible */ + if (BufferIsValid(*buf)) + { + if (BufferGetBlockNumber(*buf) != mapBlock) + { + ReleaseBuffer(*buf); + *buf = InvalidBuffer; + } + } + + if (!BufferIsValid(*buf)) + { + *buf = vm_readbuf(rel, mapBlock, false); + if (!BufferIsValid(*buf)) + return false; + } + + map = PageGetContents(BufferGetPage(*buf)); + + /* + * We don't need to lock the page, as we're only looking at a single bit. + */ + result = (map[mapByte] & (1 << mapBit)) ? true : false; + + return result; + } + + /* + * visibilitymap_test - truncate the visibility map + */ + void + visibilitymap_truncate(Relation rel, BlockNumber nheapblocks) + { + BlockNumber newnblocks; + /* last remaining block, byte, and bit */ + BlockNumber truncBlock = HEAPBLK_TO_MAPBLOCK(nheapblocks); + uint32 truncByte = HEAPBLK_TO_MAPBYTE(nheapblocks); + uint8 truncBit = HEAPBLK_TO_MAPBIT(nheapblocks); + + #ifdef TRACE_VISIBILITYMAP + elog(DEBUG1, "vm_truncate %s %d", RelationGetRelationName(rel), nheapblocks); + #endif + + /* + * If no visibility map has been created yet for this relation, there's + * nothing to truncate. + */ + if (!smgrexists(rel->rd_smgr, VISIBILITYMAP_FORKNUM)) + return; + + /* + * Unless the new size is exactly at a visibility map page boundary, the + * tail bits in the last remaining map page, representing truncated heap + * blocks, need to be cleared. This is not only tidy, but also necessary + * because we don't get a chance to clear the bits if the heap is + * extended again. + */ + if (truncByte != 0 || truncBit != 0) + { + Buffer mapBuffer; + Page page; + char *map; + + newnblocks = truncBlock + 1; + + mapBuffer = vm_readbuf(rel, truncBlock, false); + if (!BufferIsValid(mapBuffer)) + { + /* nothing to do, the file was already smaller */ + return; + } + + page = BufferGetPage(mapBuffer); + map = PageGetContents(page); + + LockBuffer(mapBuffer, BUFFER_LOCK_EXCLUSIVE); + + /* Clear out the unwanted bytes. */ + MemSet(&map[truncByte + 1], 0, MAPSIZE - (truncByte + 1)); + + /* + * Mask out the unwanted bits of the last remaining byte. + * + * ((1 << 0) - 1) = 00000000 + * ((1 << 1) - 1) = 00000001 + * ... + * ((1 << 6) - 1) = 00111111 + * ((1 << 7) - 1) = 01111111 + */ + map[truncByte] &= (1 << truncBit) - 1; + + MarkBufferDirty(mapBuffer); + UnlockReleaseBuffer(mapBuffer); + } + else + newnblocks = truncBlock; + + if (smgrnblocks(rel->rd_smgr, VISIBILITYMAP_FORKNUM) < newnblocks) + { + /* nothing to do, the file was already smaller than requested size */ + return; + } + + smgrtruncate(rel->rd_smgr, VISIBILITYMAP_FORKNUM, newnblocks, + rel->rd_istemp); + + /* + * Need to invalidate the relcache entry, because rd_vm_nblocks + * seen by other backends is no longer valid. + */ + if (!InRecovery) + CacheInvalidateRelcache(rel); + + rel->rd_vm_nblocks = newnblocks; + } + + /* + * Read a visibility map page. + * + * If the page doesn't exist, InvalidBuffer is returned, or if 'extend' is + * true, the visibility map file is extended. + */ + static Buffer + vm_readbuf(Relation rel, BlockNumber blkno, bool extend) + { + Buffer buf; + + RelationOpenSmgr(rel); + + /* + * The current size of the visibility map fork is kept in relcache, to + * avoid reading beyond EOF. If we haven't cached the size of the map yet, + * do that first. + */ + if (rel->rd_vm_nblocks == InvalidBlockNumber) + { + if (smgrexists(rel->rd_smgr, VISIBILITYMAP_FORKNUM)) + rel->rd_vm_nblocks = smgrnblocks(rel->rd_smgr, + VISIBILITYMAP_FORKNUM); + else + rel->rd_vm_nblocks = 0; + } + + /* Handle requests beyond EOF */ + if (blkno >= rel->rd_vm_nblocks) + { + if (extend) + vm_extend(rel, blkno + 1); + else + return InvalidBuffer; + } + + /* + * Use ZERO_ON_ERROR mode, and initialize the page if necessary. It's + * always safe to clear bits, so it's better to clear corrupt pages than + * error out. + */ + buf = ReadBufferExtended(rel, VISIBILITYMAP_FORKNUM, blkno, + RBM_ZERO_ON_ERROR, NULL); + if (PageIsNew(BufferGetPage(buf))) + PageInit(BufferGetPage(buf), BLCKSZ, 0); + return buf; + } + + /* + * Ensure that the visibility map fork is at least vm_nblocks long, extending + * it if necessary with zeroed pages. + */ + static void + vm_extend(Relation rel, BlockNumber vm_nblocks) + { + BlockNumber vm_nblocks_now; + Page pg; + + pg = (Page) palloc(BLCKSZ); + PageInit(pg, BLCKSZ, 0); + + /* + * We use the relation extension lock to lock out other backends trying + * to extend the visibility map at the same time. It also locks out + * extension of the main fork, unnecessarily, but extending the + * visibility map happens seldom enough that it doesn't seem worthwhile to + * have a separate lock tag type for it. + * + * Note that another backend might have extended or created the + * relation before we get the lock. + */ + LockRelationForExtension(rel, ExclusiveLock); + + /* Create the file first if it doesn't exist */ + if ((rel->rd_vm_nblocks == 0 || rel->rd_vm_nblocks == InvalidBlockNumber) + && !smgrexists(rel->rd_smgr, VISIBILITYMAP_FORKNUM)) + { + smgrcreate(rel->rd_smgr, VISIBILITYMAP_FORKNUM, false); + vm_nblocks_now = 0; + } + else + vm_nblocks_now = smgrnblocks(rel->rd_smgr, VISIBILITYMAP_FORKNUM); + + while (vm_nblocks_now < vm_nblocks) + { + smgrextend(rel->rd_smgr, VISIBILITYMAP_FORKNUM, vm_nblocks_now, + (char *) pg, rel->rd_istemp); + vm_nblocks_now++; + } + + UnlockRelationForExtension(rel, ExclusiveLock); + + pfree(pg); + + /* Update the relcache with the up-to-date size */ + if (!InRecovery) + CacheInvalidateRelcache(rel); + rel->rd_vm_nblocks = vm_nblocks_now; + } *** src/backend/access/transam/xlogutils.c --- src/backend/access/transam/xlogutils.c *************** *** 377,382 **** CreateFakeRelcacheEntry(RelFileNode rnode) --- 377,383 ---- rel->rd_targblock = InvalidBlockNumber; rel->rd_fsm_nblocks = InvalidBlockNumber; + rel->rd_vm_nblocks = InvalidBlockNumber; rel->rd_smgr = NULL; return rel; *** src/backend/catalog/catalog.c --- src/backend/catalog/catalog.c *************** *** 54,60 **** */ const char *forkNames[] = { "main", /* MAIN_FORKNUM */ ! "fsm" /* FSM_FORKNUM */ }; /* --- 54,61 ---- */ const char *forkNames[] = { "main", /* MAIN_FORKNUM */ ! "fsm", /* FSM_FORKNUM */ ! "vm" /* VISIBILITYMAP_FORKNUM */ }; /* *** src/backend/catalog/storage.c --- src/backend/catalog/storage.c *************** *** 19,24 **** --- 19,25 ---- #include "postgres.h" + #include "access/visibilitymap.h" #include "access/xact.h" #include "access/xlogutils.h" #include "catalog/catalog.h" *************** *** 175,180 **** void --- 176,182 ---- RelationTruncate(Relation rel, BlockNumber nblocks) { bool fsm; + bool vm; /* Open it at the smgr level if not already done */ RelationOpenSmgr(rel); *************** *** 187,192 **** RelationTruncate(Relation rel, BlockNumber nblocks) --- 189,199 ---- if (fsm) FreeSpaceMapTruncateRel(rel, nblocks); + /* Truncate the visibility map too if it exists. */ + vm = smgrexists(rel->rd_smgr, VISIBILITYMAP_FORKNUM); + if (vm) + visibilitymap_truncate(rel, nblocks); + /* * We WAL-log the truncation before actually truncating, which * means trouble if the truncation fails. If we then crash, the WAL *************** *** 217,228 **** RelationTruncate(Relation rel, BlockNumber nblocks) /* * Flush, because otherwise the truncation of the main relation ! * might hit the disk before the WAL record of truncating the ! * FSM is flushed. If we crashed during that window, we'd be ! * left with a truncated heap, but the FSM would still contain ! * entries for the non-existent heap pages. */ ! if (fsm) XLogFlush(lsn); } --- 224,235 ---- /* * Flush, because otherwise the truncation of the main relation ! * might hit the disk before the WAL record, and the truncation of ! * the FSM or visibility map. If we crashed during that window, we'd ! * be left with a truncated heap, but the FSM or visibility map would ! * still contain entries for the non-existent heap pages. */ ! if (fsm || vm) XLogFlush(lsn); } *** src/backend/commands/vacuum.c --- src/backend/commands/vacuum.c *************** *** 26,31 **** --- 26,32 ---- #include "access/genam.h" #include "access/heapam.h" #include "access/transam.h" + #include "access/visibilitymap.h" #include "access/xact.h" #include "access/xlog.h" #include "catalog/namespace.h" *************** *** 2902,2907 **** move_chain_tuple(Relation rel, --- 2903,2914 ---- Size tuple_len = old_tup->t_len; /* + * Clear the bits in the visibility map. + */ + visibilitymap_clear(rel, BufferGetBlockNumber(old_buf)); + visibilitymap_clear(rel, BufferGetBlockNumber(dst_buf)); + + /* * make a modifiable copy of the source tuple. */ heap_copytuple_with_tuple(old_tup, &newtup); *************** *** 3005,3010 **** move_chain_tuple(Relation rel, --- 3012,3021 ---- END_CRIT_SECTION(); + PageClearAllVisible(BufferGetPage(old_buf)); + if (dst_buf != old_buf) + PageClearAllVisible(BufferGetPage(dst_buf)); + LockBuffer(dst_buf, BUFFER_LOCK_UNLOCK); if (dst_buf != old_buf) LockBuffer(old_buf, BUFFER_LOCK_UNLOCK); *************** *** 3107,3112 **** move_plain_tuple(Relation rel, --- 3118,3140 ---- END_CRIT_SECTION(); + /* + * Clear the visible-to-all hint bits on the page, and bits in the + * visibility map. Normally we'd release the locks on the heap pages + * before updating the visibility map, but doesn't really matter here + * because we're holding an AccessExclusiveLock on the relation anyway. + */ + if (PageIsAllVisible(dst_page)) + { + PageClearAllVisible(dst_page); + visibilitymap_clear(rel, BufferGetBlockNumber(dst_buf)); + } + if (PageIsAllVisible(old_page)) + { + PageClearAllVisible(old_page); + visibilitymap_clear(rel, BufferGetBlockNumber(old_buf)); + } + dst_vacpage->free = PageGetFreeSpaceWithFillFactor(rel, dst_page); LockBuffer(dst_buf, BUFFER_LOCK_UNLOCK); LockBuffer(old_buf, BUFFER_LOCK_UNLOCK); *** src/backend/commands/vacuumlazy.c --- src/backend/commands/vacuumlazy.c *************** *** 40,45 **** --- 40,46 ---- #include "access/genam.h" #include "access/heapam.h" #include "access/transam.h" + #include "access/visibilitymap.h" #include "catalog/storage.h" #include "commands/dbcommands.h" #include "commands/vacuum.h" *************** *** 88,93 **** typedef struct LVRelStats --- 89,95 ---- int max_dead_tuples; /* # slots allocated in array */ ItemPointer dead_tuples; /* array of ItemPointerData */ int num_index_scans; + bool scanned_all; /* have we scanned all pages (this far)? */ } LVRelStats; *************** *** 102,108 **** static BufferAccessStrategy vac_strategy; /* non-export function prototypes */ static void lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats, ! Relation *Irel, int nindexes); static void lazy_vacuum_heap(Relation onerel, LVRelStats *vacrelstats); static void lazy_vacuum_index(Relation indrel, IndexBulkDeleteResult **stats, --- 104,110 ---- /* non-export function prototypes */ static void lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats, ! Relation *Irel, int nindexes, bool scan_all); static void lazy_vacuum_heap(Relation onerel, LVRelStats *vacrelstats); static void lazy_vacuum_index(Relation indrel, IndexBulkDeleteResult **stats, *************** *** 141,146 **** lazy_vacuum_rel(Relation onerel, VacuumStmt *vacstmt, --- 143,149 ---- BlockNumber possibly_freeable; PGRUsage ru0; TimestampTz starttime = 0; + bool scan_all; pg_rusage_init(&ru0); *************** *** 161,173 **** lazy_vacuum_rel(Relation onerel, VacuumStmt *vacstmt, vacrelstats = (LVRelStats *) palloc0(sizeof(LVRelStats)); vacrelstats->num_index_scans = 0; /* Open all indexes of the relation */ vac_open_indexes(onerel, RowExclusiveLock, &nindexes, &Irel); vacrelstats->hasindex = (nindexes > 0); /* Do the vacuuming */ ! lazy_scan_heap(onerel, vacrelstats, Irel, nindexes); /* Done with indexes */ vac_close_indexes(nindexes, Irel, NoLock); --- 164,183 ---- vacrelstats = (LVRelStats *) palloc0(sizeof(LVRelStats)); vacrelstats->num_index_scans = 0; + vacrelstats->scanned_all = true; /* Open all indexes of the relation */ vac_open_indexes(onerel, RowExclusiveLock, &nindexes, &Irel); vacrelstats->hasindex = (nindexes > 0); + /* Should we use the visibility map or scan all pages? */ + if (vacstmt->freeze_min_age != -1) + scan_all = true; + else + scan_all = false; + /* Do the vacuuming */ ! lazy_scan_heap(onerel, vacrelstats, Irel, nindexes, scan_all); /* Done with indexes */ vac_close_indexes(nindexes, Irel, NoLock); *************** *** 186,195 **** lazy_vacuum_rel(Relation onerel, VacuumStmt *vacstmt, /* Vacuum the Free Space Map */ FreeSpaceMapVacuum(onerel); ! /* Update statistics in pg_class */ vac_update_relstats(onerel, vacrelstats->rel_pages, vacrelstats->rel_tuples, ! vacrelstats->hasindex, FreezeLimit); /* report results to the stats collector, too */ pgstat_report_vacuum(RelationGetRelid(onerel), onerel->rd_rel->relisshared, --- 196,209 ---- /* Vacuum the Free Space Map */ FreeSpaceMapVacuum(onerel); ! /* ! * Update statistics in pg_class. We can only advance relfrozenxid if we ! * didn't skip any pages. ! */ vac_update_relstats(onerel, vacrelstats->rel_pages, vacrelstats->rel_tuples, ! vacrelstats->hasindex, ! vacrelstats->scanned_all ? FreezeLimit : InvalidOid); /* report results to the stats collector, too */ pgstat_report_vacuum(RelationGetRelid(onerel), onerel->rd_rel->relisshared, *************** *** 230,242 **** lazy_vacuum_rel(Relation onerel, VacuumStmt *vacstmt, */ static void lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats, ! Relation *Irel, int nindexes) { BlockNumber nblocks, blkno; HeapTupleData tuple; char *relname; BlockNumber empty_pages, vacuumed_pages; double num_tuples, tups_vacuumed, --- 244,257 ---- */ static void lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats, ! Relation *Irel, int nindexes, bool scan_all) { BlockNumber nblocks, blkno; HeapTupleData tuple; char *relname; BlockNumber empty_pages, + scanned_pages, vacuumed_pages; double num_tuples, tups_vacuumed, *************** *** 245,250 **** lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats, --- 260,266 ---- IndexBulkDeleteResult **indstats; int i; PGRUsage ru0; + Buffer vmbuffer = InvalidBuffer; pg_rusage_init(&ru0); *************** *** 254,260 **** lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats, get_namespace_name(RelationGetNamespace(onerel)), relname))); ! empty_pages = vacuumed_pages = 0; num_tuples = tups_vacuumed = nkeep = nunused = 0; indstats = (IndexBulkDeleteResult **) --- 270,276 ---- get_namespace_name(RelationGetNamespace(onerel)), relname))); ! empty_pages = vacuumed_pages = scanned_pages = 0; num_tuples = tups_vacuumed = nkeep = nunused = 0; indstats = (IndexBulkDeleteResult **) *************** *** 278,286 **** lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats, --- 294,321 ---- OffsetNumber frozen[MaxOffsetNumber]; int nfrozen; Size freespace; + bool all_visible_according_to_vm = false; + bool all_visible; + + /* + * Skip pages that don't require vacuuming according to the + * visibility map. + */ + if (!scan_all) + { + all_visible_according_to_vm = + visibilitymap_test(onerel, blkno, &vmbuffer); + if (all_visible_according_to_vm) + { + vacrelstats->scanned_all = false; + continue; + } + } vacuum_delay_point(); + scanned_pages++; + /* * If we are close to overrunning the available space for dead-tuple * TIDs, pause and do a cycle of vacuuming before we tackle this page. *************** *** 354,360 **** lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats, { empty_pages++; freespace = PageGetHeapFreeSpace(page); ! UnlockReleaseBuffer(buf); RecordPageWithFreeSpace(onerel, blkno, freespace); continue; } --- 389,414 ---- { empty_pages++; freespace = PageGetHeapFreeSpace(page); ! ! if (!PageIsAllVisible(page)) ! { ! SetBufferCommitInfoNeedsSave(buf); ! PageSetAllVisible(page); ! } ! ! LockBuffer(buf, BUFFER_LOCK_UNLOCK); ! ! /* Update the visibility map */ ! if (!all_visible_according_to_vm) ! { ! visibilitymap_pin(onerel, blkno, &vmbuffer); ! LockBuffer(buf, BUFFER_LOCK_SHARE); ! if (PageIsAllVisible(page)) ! visibilitymap_set(onerel, blkno, PageGetLSN(page), &vmbuffer); ! LockBuffer(buf, BUFFER_LOCK_UNLOCK); ! } ! ! ReleaseBuffer(buf); RecordPageWithFreeSpace(onerel, blkno, freespace); continue; } *************** *** 371,376 **** lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats, --- 425,431 ---- * Now scan the page to collect vacuumable items and check for tuples * requiring freezing. */ + all_visible = true; nfrozen = 0; hastup = false; prev_dead_count = vacrelstats->num_dead_tuples; *************** *** 408,413 **** lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats, --- 463,469 ---- if (ItemIdIsDead(itemid)) { lazy_record_dead_tuple(vacrelstats, &(tuple.t_self)); + all_visible = false; continue; } *************** *** 442,447 **** lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats, --- 498,504 ---- nkeep += 1; else tupgone = true; /* we can delete the tuple */ + all_visible = false; break; case HEAPTUPLE_LIVE: /* Tuple is good --- but let's do some validity checks */ *************** *** 449,454 **** lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats, --- 506,540 ---- !OidIsValid(HeapTupleGetOid(&tuple))) elog(WARNING, "relation \"%s\" TID %u/%u: OID is invalid", relname, blkno, offnum); + + /* + * Is the tuple definitely visible to all transactions? + * + * NB: Like with per-tuple hint bits, we can't set the + * flag if the inserter committed asynchronously. See + * SetHintBits for more info. Check that the + * HEAP_XMIN_COMMITTED hint bit is set because of that. + */ + if (all_visible) + { + TransactionId xmin; + + if (!(tuple.t_data->t_infomask & HEAP_XMIN_COMMITTED)) + { + all_visible = false; + break; + } + /* + * The inserter definitely committed. But is it + * old enough that everyone sees it as committed? + */ + xmin = HeapTupleHeaderGetXmin(tuple.t_data); + if (!TransactionIdPrecedes(xmin, OldestXmin)) + { + all_visible = false; + break; + } + } break; case HEAPTUPLE_RECENTLY_DEAD: *************** *** 457,468 **** lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats, --- 543,557 ---- * from relation. */ nkeep += 1; + all_visible = false; break; case HEAPTUPLE_INSERT_IN_PROGRESS: /* This is an expected case during concurrent vacuum */ + all_visible = false; break; case HEAPTUPLE_DELETE_IN_PROGRESS: /* This is an expected case during concurrent vacuum */ + all_visible = false; break; default: elog(ERROR, "unexpected HeapTupleSatisfiesVacuum result"); *************** *** 525,536 **** lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats, freespace = PageGetHeapFreeSpace(page); /* Remember the location of the last page with nonremovable tuples */ if (hastup) vacrelstats->nonempty_pages = blkno + 1; - UnlockReleaseBuffer(buf); - /* * If we remembered any tuples for deletion, then the page will be * visited again by lazy_vacuum_heap, which will compute and record --- 614,656 ---- freespace = PageGetHeapFreeSpace(page); + /* Update the all-visible flag on the page */ + if (!PageIsAllVisible(page) && all_visible) + { + SetBufferCommitInfoNeedsSave(buf); + PageSetAllVisible(page); + } + else if (PageIsAllVisible(page) && !all_visible) + { + elog(WARNING, "PD_ALL_VISIBLE flag was incorrectly set"); + SetBufferCommitInfoNeedsSave(buf); + PageClearAllVisible(page); + + /* + * XXX: Normally, we would drop the lock on the heap page before + * updating the visibility map. + */ + visibilitymap_clear(onerel, blkno); + } + + LockBuffer(buf, BUFFER_LOCK_UNLOCK); + + /* Update the visibility map */ + if (!all_visible_according_to_vm && all_visible) + { + visibilitymap_pin(onerel, blkno, &vmbuffer); + LockBuffer(buf, BUFFER_LOCK_SHARE); + if (PageIsAllVisible(page)) + visibilitymap_set(onerel, blkno, PageGetLSN(page), &vmbuffer); + LockBuffer(buf, BUFFER_LOCK_UNLOCK); + } + + ReleaseBuffer(buf); + /* Remember the location of the last page with nonremovable tuples */ if (hastup) vacrelstats->nonempty_pages = blkno + 1; /* * If we remembered any tuples for deletion, then the page will be * visited again by lazy_vacuum_heap, which will compute and record *************** *** 560,565 **** lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats, --- 680,692 ---- vacrelstats->num_index_scans++; } + /* Release the pin on the visibility map page */ + if (BufferIsValid(vmbuffer)) + { + ReleaseBuffer(vmbuffer); + vmbuffer = InvalidBuffer; + } + /* Do post-vacuum cleanup and statistics update for each index */ for (i = 0; i < nindexes; i++) lazy_cleanup_index(Irel[i], indstats[i], vacrelstats); *************** *** 572,580 **** lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats, tups_vacuumed, vacuumed_pages))); ereport(elevel, ! (errmsg("\"%s\": found %.0f removable, %.0f nonremovable row versions in %u pages", RelationGetRelationName(onerel), ! tups_vacuumed, num_tuples, nblocks), errdetail("%.0f dead row versions cannot be removed yet.\n" "There were %.0f unused item pointers.\n" "%u pages are entirely empty.\n" --- 699,707 ---- tups_vacuumed, vacuumed_pages))); ereport(elevel, ! (errmsg("\"%s\": found %.0f removable, %.0f nonremovable row versions in %u out of %u pages", RelationGetRelationName(onerel), ! tups_vacuumed, num_tuples, scanned_pages, nblocks), errdetail("%.0f dead row versions cannot be removed yet.\n" "There were %.0f unused item pointers.\n" "%u pages are entirely empty.\n" *************** *** 623,628 **** lazy_vacuum_heap(Relation onerel, LVRelStats *vacrelstats) --- 750,764 ---- LockBufferForCleanup(buf); tupindex = lazy_vacuum_page(onerel, tblk, buf, tupindex, vacrelstats); + /* + * Before we let the page go, prune it. The primary reason is to + * update the visibility map in the common special case that we just + * vacuumed away the last tuple on the page that wasn't visible to + * everyone. + */ + vacrelstats->tuples_deleted += + heap_page_prune(onerel, buf, OldestXmin, false, false); + /* Now that we've compacted the page, record its available space */ page = BufferGetPage(buf); freespace = PageGetHeapFreeSpace(page); *** src/backend/utils/cache/relcache.c --- src/backend/utils/cache/relcache.c *************** *** 305,310 **** AllocateRelationDesc(Relation relation, Form_pg_class relp) --- 305,311 ---- MemSet(relation, 0, sizeof(RelationData)); relation->rd_targblock = InvalidBlockNumber; relation->rd_fsm_nblocks = InvalidBlockNumber; + relation->rd_vm_nblocks = InvalidBlockNumber; /* make sure relation is marked as having no open file yet */ relation->rd_smgr = NULL; *************** *** 1377,1382 **** formrdesc(const char *relationName, Oid relationReltype, --- 1378,1384 ---- relation = (Relation) palloc0(sizeof(RelationData)); relation->rd_targblock = InvalidBlockNumber; relation->rd_fsm_nblocks = InvalidBlockNumber; + relation->rd_vm_nblocks = InvalidBlockNumber; /* make sure relation is marked as having no open file yet */ relation->rd_smgr = NULL; *************** *** 1665,1673 **** RelationReloadIndexInfo(Relation relation) heap_freetuple(pg_class_tuple); /* We must recalculate physical address in case it changed */ RelationInitPhysicalAddr(relation); ! /* Must reset targblock and fsm_nblocks in case rel was truncated */ relation->rd_targblock = InvalidBlockNumber; relation->rd_fsm_nblocks = InvalidBlockNumber; /* Must free any AM cached data, too */ if (relation->rd_amcache) pfree(relation->rd_amcache); --- 1667,1679 ---- heap_freetuple(pg_class_tuple); /* We must recalculate physical address in case it changed */ RelationInitPhysicalAddr(relation); ! /* ! * Must reset targblock, fsm_nblocks and vm_nblocks in case rel was ! * truncated ! */ relation->rd_targblock = InvalidBlockNumber; relation->rd_fsm_nblocks = InvalidBlockNumber; + relation->rd_vm_nblocks = InvalidBlockNumber; /* Must free any AM cached data, too */ if (relation->rd_amcache) pfree(relation->rd_amcache); *************** *** 1751,1756 **** RelationClearRelation(Relation relation, bool rebuild) --- 1757,1763 ---- { relation->rd_targblock = InvalidBlockNumber; relation->rd_fsm_nblocks = InvalidBlockNumber; + relation->rd_vm_nblocks = InvalidBlockNumber; if (relation->rd_rel->relkind == RELKIND_INDEX) { relation->rd_isvalid = false; /* needs to be revalidated */ *************** *** 2346,2351 **** RelationBuildLocalRelation(const char *relname, --- 2353,2359 ---- rel->rd_targblock = InvalidBlockNumber; rel->rd_fsm_nblocks = InvalidBlockNumber; + rel->rd_vm_nblocks = InvalidBlockNumber; /* make sure relation is marked as having no open file yet */ rel->rd_smgr = NULL; *************** *** 3603,3608 **** load_relcache_init_file(void) --- 3611,3617 ---- rel->rd_smgr = NULL; rel->rd_targblock = InvalidBlockNumber; rel->rd_fsm_nblocks = InvalidBlockNumber; + rel->rd_vm_nblocks = InvalidBlockNumber; if (rel->rd_isnailed) rel->rd_refcnt = 1; else *** src/include/access/heapam.h --- src/include/access/heapam.h *************** *** 153,158 **** extern void heap_page_prune_execute(Buffer buffer, --- 153,159 ---- OffsetNumber *nowunused, int nunused, bool redirect_move); extern void heap_get_root_tuples(Page page, OffsetNumber *root_offsets); + extern void heap_page_update_all_visible(Buffer buffer); /* in heap/syncscan.c */ extern void ss_report_location(Relation rel, BlockNumber location); *** src/include/access/htup.h --- src/include/access/htup.h *************** *** 601,609 **** typedef struct xl_heaptid typedef struct xl_heap_delete { xl_heaptid target; /* deleted tuple id */ } xl_heap_delete; ! #define SizeOfHeapDelete (offsetof(xl_heap_delete, target) + SizeOfHeapTid) /* * We don't store the whole fixed part (HeapTupleHeaderData) of an inserted --- 601,610 ---- typedef struct xl_heap_delete { xl_heaptid target; /* deleted tuple id */ + bool all_visible_cleared; /* PD_ALL_VISIBLE was cleared */ } xl_heap_delete; ! #define SizeOfHeapDelete (offsetof(xl_heap_delete, all_visible_cleared) + sizeof(bool)) /* * We don't store the whole fixed part (HeapTupleHeaderData) of an inserted *************** *** 626,646 **** typedef struct xl_heap_header typedef struct xl_heap_insert { xl_heaptid target; /* inserted tuple id */ /* xl_heap_header & TUPLE DATA FOLLOWS AT END OF STRUCT */ } xl_heap_insert; ! #define SizeOfHeapInsert (offsetof(xl_heap_insert, target) + SizeOfHeapTid) /* This is what we need to know about update|move|hot_update */ typedef struct xl_heap_update { xl_heaptid target; /* deleted tuple id */ ItemPointerData newtid; /* new inserted tuple id */ /* NEW TUPLE xl_heap_header (PLUS xmax & xmin IF MOVE OP) */ /* and TUPLE DATA FOLLOWS AT END OF STRUCT */ } xl_heap_update; ! #define SizeOfHeapUpdate (offsetof(xl_heap_update, newtid) + SizeOfIptrData) /* * This is what we need to know about vacuum page cleanup/redirect --- 627,650 ---- typedef struct xl_heap_insert { xl_heaptid target; /* inserted tuple id */ + bool all_visible_cleared; /* PD_ALL_VISIBLE was cleared */ /* xl_heap_header & TUPLE DATA FOLLOWS AT END OF STRUCT */ } xl_heap_insert; ! #define SizeOfHeapInsert (offsetof(xl_heap_insert, all_visible_cleared) + sizeof(bool)) /* This is what we need to know about update|move|hot_update */ typedef struct xl_heap_update { xl_heaptid target; /* deleted tuple id */ ItemPointerData newtid; /* new inserted tuple id */ + bool all_visible_cleared; /* PD_ALL_VISIBLE was cleared */ + bool new_all_visible_cleared; /* same for the page of newtid */ /* NEW TUPLE xl_heap_header (PLUS xmax & xmin IF MOVE OP) */ /* and TUPLE DATA FOLLOWS AT END OF STRUCT */ } xl_heap_update; ! #define SizeOfHeapUpdate (offsetof(xl_heap_update, new_all_visible_cleared) + sizeof(bool)) /* * This is what we need to know about vacuum page cleanup/redirect *** /dev/null --- src/include/access/visibilitymap.h *************** *** 0 **** --- 1,30 ---- + /*------------------------------------------------------------------------- + * + * visibilitymap.h + * visibility map interface + * + * + * Portions Copyright (c) 2007, PostgreSQL Global Development Group + * Portions Copyright (c) 1994, Regents of the University of California + * + * $PostgreSQL$ + * + *------------------------------------------------------------------------- + */ + #ifndef VISIBILITYMAP_H + #define VISIBILITYMAP_H + + #include "utils/rel.h" + #include "storage/buf.h" + #include "storage/itemptr.h" + #include "access/xlogdefs.h" + + extern void visibilitymap_clear(Relation rel, BlockNumber heapBlk); + extern void visibilitymap_pin(Relation rel, BlockNumber heapBlk, + Buffer *vmbuf); + extern void visibilitymap_set(Relation rel, BlockNumber heapBlk, + XLogRecPtr recptr, Buffer *vmbuf); + extern bool visibilitymap_test(Relation rel, BlockNumber heapBlk, Buffer *vmbuf); + extern void visibilitymap_truncate(Relation rel, BlockNumber heapblk); + + #endif /* VISIBILITYMAP_H */ *** src/include/storage/bufpage.h --- src/include/storage/bufpage.h *************** *** 152,159 **** typedef PageHeaderData *PageHeader; #define PD_HAS_FREE_LINES 0x0001 /* are there any unused line pointers? */ #define PD_PAGE_FULL 0x0002 /* not enough free space for new * tuple? */ ! #define PD_VALID_FLAG_BITS 0x0003 /* OR of all valid pd_flags bits */ /* * Page layout version number 0 is for pre-7.3 Postgres releases. --- 152,161 ---- #define PD_HAS_FREE_LINES 0x0001 /* are there any unused line pointers? */ #define PD_PAGE_FULL 0x0002 /* not enough free space for new * tuple? */ + #define PD_ALL_VISIBLE 0x0004 /* all tuples on page are visible to + * everyone */ ! #define PD_VALID_FLAG_BITS 0x0007 /* OR of all valid pd_flags bits */ /* * Page layout version number 0 is for pre-7.3 Postgres releases. *************** *** 336,341 **** typedef PageHeaderData *PageHeader; --- 338,350 ---- #define PageClearFull(page) \ (((PageHeader) (page))->pd_flags &= ~PD_PAGE_FULL) + #define PageIsAllVisible(page) \ + (((PageHeader) (page))->pd_flags & PD_ALL_VISIBLE) + #define PageSetAllVisible(page) \ + (((PageHeader) (page))->pd_flags |= PD_ALL_VISIBLE) + #define PageClearAllVisible(page) \ + (((PageHeader) (page))->pd_flags &= ~PD_ALL_VISIBLE) + #define PageIsPrunable(page, oldestxmin) \ ( \ AssertMacro(TransactionIdIsNormal(oldestxmin)), \ *** src/include/storage/relfilenode.h --- src/include/storage/relfilenode.h *************** *** 24,37 **** typedef enum ForkNumber { InvalidForkNumber = -1, MAIN_FORKNUM = 0, ! FSM_FORKNUM /* * NOTE: if you add a new fork, change MAX_FORKNUM below and update the * forkNames array in catalog.c */ } ForkNumber; ! #define MAX_FORKNUM FSM_FORKNUM /* * RelFileNode must provide all that we need to know to physically access --- 24,38 ---- { InvalidForkNumber = -1, MAIN_FORKNUM = 0, ! FSM_FORKNUM, ! VISIBILITYMAP_FORKNUM /* * NOTE: if you add a new fork, change MAX_FORKNUM below and update the * forkNames array in catalog.c */ } ForkNumber; ! #define MAX_FORKNUM VISIBILITYMAP_FORKNUM /* * RelFileNode must provide all that we need to know to physically access *** src/include/utils/rel.h --- src/include/utils/rel.h *************** *** 195,202 **** typedef struct RelationData List *rd_indpred; /* index predicate tree, if any */ void *rd_amcache; /* available for use by index AM */ ! /* size of the FSM, or InvalidBlockNumber if not known yet */ BlockNumber rd_fsm_nblocks; /* use "struct" here to avoid needing to include pgstat.h: */ struct PgStat_TableStatus *pgstat_info; /* statistics collection area */ --- 195,206 ---- List *rd_indpred; /* index predicate tree, if any */ void *rd_amcache; /* available for use by index AM */ ! /* ! * sizes of the free space and visibility map forks, or InvalidBlockNumber ! * if not known yet ! */ BlockNumber rd_fsm_nblocks; + BlockNumber rd_vm_nblocks; /* use "struct" here to avoid needing to include pgstat.h: */ struct PgStat_TableStatus *pgstat_info; /* statistics collection area */
Heikki Linnakangas wrote: > Here's an updated version, with a lot of smaller cleanups, and using > relcache invalidation to notify other backends when the visibility map > fork is extended. I already committed the change to FSM to do the same. > I'm feeling quite satisfied to commit this patch early next week. Committed. I haven't done any doc changes for this yet. I think a short section in the "database internal storage" chapter is probably in order, and the fact that plain VACUUM skips pages should be mentioned somewhere. I'll skim through references to vacuum and see what needs to be changed. Hmm. It just occurred to me that I think this circumvented the anti-wraparound vacuuming: a normal vacuum doesn't advance relfrozenxid anymore. We'll need to disable the skipping when autovacuum is triggered to prevent wraparound. VACUUM FREEZE does that already, but it's unnecessarily aggressive in freezing. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes: > Hmm. It just occurred to me that I think this circumvented the anti-wraparound > vacuuming: a normal vacuum doesn't advance relfrozenxid anymore. We'll need to > disable the skipping when autovacuum is triggered to prevent wraparound. VACUUM > FREEZE does that already, but it's unnecessarily aggressive in freezing. Having seen how the anti-wraparound vacuums work in the field I think merely replacing it with a regular vacuum which covers the whole table will not actually work well. What will happen is that, because nothing else is advancing the relfrozenxid, the age of the relfrozenxid for all tables will advance until they all hit autovacuum_max_freeze_age. Quite often all the tables were created around the same time so they will all hit autovacuum_max_freeze_age at the same time. So a database which was operating fine and receiving regular vacuums at a reasonable pace will suddenly be hit by vacuums for every table all at the same time, 3 at a time. If you don't have vacuum_cost_delay set that will cause a major issue. Even if you do have vacuum_cost_delay set it will prevent the small busy tables from getting vacuumed regularly due to the backlog in anti-wraparound vacuums. Worse, vacuum will set the freeze_xid to nearly the same value for all of the tables. So it will all happen again in another 100M transactions. And again in another 100M transactions, and again... I think there are several things which need to happen here. 1) Raise autovacuum_max_freeze_age to 400M or 800M. Having it at 200M just means unnecessary full table vacuums long beforethey accomplish anything. 2) Include a factor which spreads out the anti-wraparound freezes in the autovacuum launcher. Some ideas: . we could implicitly add random(vacuum_freeze_min_age) to the autovacuum_max_freeze_age. That would spread them outevenly over 100M transactions. . we could check if another anti-wraparound vacuum is still running and implicitly add a vacuum_freeze_min_age penaltyto the autovacuum_max_freeze_age for each running anti-wraparound vacuum. That would spread them out withoutbeing introducing non-determinism which seems better. . we could leave autovacuum_max_freeze_age and instead pick a semi-random vacuum_freeze_min_age. This would mean thefirst set of anti-wraparound vacuums would still be synchronized but subsequent ones might be spread out somewhat.There's not as much room to randomize this though and it would affect how much i/o vacuum did which makes itseem less palatable to me. 3) I also think we need to put a clamp on the vacuum_cost_delay. Too many people are setting it to unreasonably high valueswhich results in their vacuums never completing. Actually I think what we should do is junk all the existing parametersand replace it with a vacuum_nice_level or vacuum_bandwidth_cap from which we calculate the cost_limit and hideall the other parameters as internal parameters. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com Get trained by Bruce Momjian - ask me about EnterpriseDB'sPostgreSQL training!
Heikki Linnakangas wrote: > Hmm. It just occurred to me that I think this circumvented the > anti-wraparound vacuuming: a normal vacuum doesn't advance relfrozenxid > anymore. We'll need to disable the skipping when autovacuum is triggered > to prevent wraparound. VACUUM FREEZE does that already, but it's > unnecessarily aggressive in freezing. Heh :-) Yes, this should be handled sanely, without having to invoke FREEZE. -- Alvaro Herrera http://www.CommandPrompt.com/ PostgreSQL Replication, Consulting, Custom Development, 24x7 support
Gregory Stark wrote: > Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes: > >> Hmm. It just occurred to me that I think this circumvented the anti-wraparound >> vacuuming: a normal vacuum doesn't advance relfrozenxid anymore. We'll need to >> disable the skipping when autovacuum is triggered to prevent wraparound. VACUUM >> FREEZE does that already, but it's unnecessarily aggressive in freezing. > > Having seen how the anti-wraparound vacuums work in the field I think merely > replacing it with a regular vacuum which covers the whole table will not > actually work well. > > What will happen is that, because nothing else is advancing the relfrozenxid, > the age of the relfrozenxid for all tables will advance until they all hit > autovacuum_max_freeze_age. Quite often all the tables were created around the > same time so they will all hit autovacuum_max_freeze_age at the same time. > > So a database which was operating fine and receiving regular vacuums at a > reasonable pace will suddenly be hit by vacuums for every table all at the > same time, 3 at a time. If you don't have vacuum_cost_delay set that will > cause a major issue. Even if you do have vacuum_cost_delay set it will prevent > the small busy tables from getting vacuumed regularly due to the backlog in > anti-wraparound vacuums. > > Worse, vacuum will set the freeze_xid to nearly the same value for all of the > tables. So it will all happen again in another 100M transactions. And again in > another 100M transactions, and again... > > I think there are several things which need to happen here. > > 1) Raise autovacuum_max_freeze_age to 400M or 800M. Having it at 200M just > means unnecessary full table vacuums long before they accomplish anything. > > 2) Include a factor which spreads out the anti-wraparound freezes in the > autovacuum launcher. Some ideas: > > . we could implicitly add random(vacuum_freeze_min_age) to the > autovacuum_max_freeze_age. That would spread them out evenly over 100M > transactions. > > . we could check if another anti-wraparound vacuum is still running and > implicitly add a vacuum_freeze_min_age penalty to the > autovacuum_max_freeze_age for each running anti-wraparound vacuum. That > would spread them out without being introducing non-determinism which > seems better. > > . we could leave autovacuum_max_freeze_age and instead pick a semi-random > vacuum_freeze_min_age. This would mean the first set of anti-wraparound > vacuums would still be synchronized but subsequent ones might be spread > out somewhat. There's not as much room to randomize this though and it > would affect how much i/o vacuum did which makes it seem less palatable > to me. How about a way to say that only one (or a config parameter for <n>) of the autovac workers can be used for anti-wraparound vacuum? Then the other slots would still be available for the small-but-frequently-updated tables. > 3) I also think we need to put a clamp on the vacuum_cost_delay. Too many > people are setting it to unreasonably high values which results in their > vacuums never completing. Actually I think what we should do is junk all > the existing parameters and replace it with a vacuum_nice_level or > vacuum_bandwidth_cap from which we calculate the cost_limit and hide all > the other parameters as internal parameters. It would certainly be helpful if it was just a single parameter - the arbitraryness of the parameters there now make them pretty hard to set properly - or at least easy to set wrong. //Magnus
Gregory Stark wrote: > Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes: > >> Hmm. It just occurred to me that I think this circumvented the anti-wraparound >> vacuuming: a normal vacuum doesn't advance relfrozenxid anymore. We'll need to >> disable the skipping when autovacuum is triggered to prevent wraparound. VACUUM >> FREEZE does that already, but it's unnecessarily aggressive in freezing. FWIW, it seems the omission is actually the other way 'round. Autovacuum always forces a full-scanning vacuum, making the visibility map useless for autovacuum. This obviously needs to be fixed. > What will happen is that, because nothing else is advancing the relfrozenxid, > the age of the relfrozenxid for all tables will advance until they all hit > autovacuum_max_freeze_age. Quite often all the tables were created around the > same time so they will all hit autovacuum_max_freeze_age at the same time. > > So a database which was operating fine and receiving regular vacuums at a > reasonable pace will suddenly be hit by vacuums for every table all at the > same time, 3 at a time. If you don't have vacuum_cost_delay set that will > cause a major issue. Even if you do have vacuum_cost_delay set it will prevent > the small busy tables from getting vacuumed regularly due to the backlog in > anti-wraparound vacuums. > > Worse, vacuum will set the freeze_xid to nearly the same value for all of the > tables. So it will all happen again in another 100M transactions. And again in > another 100M transactions, and again... But we already have that problem, don't we? When you initially load your database, all tuples will have the same xmin, and all tables will have more or less the same relfrozenxid. I guess you can argue that it becomes more obvious if vacuums are otherwise cheaper, but I don't think the visibility map makes that much difference to suddenly make this issue urgent. Agreed that it would be nice to do something about it, though. > I think there are several things which need to happen here. > > 1) Raise autovacuum_max_freeze_age to 400M or 800M. Having it at 200M just > means unnecessary full table vacuums long before they accomplish anything. It allows you to truncate clog. If I did my math right, 200M transactions amounts to ~50MB of clog. Perhaps we should still raise it, disk space is cheap after all. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes: > Gregory Stark wrote: >> Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes: >> >>> Hmm. It just occurred to me that I think this circumvented the anti-wraparound >>> vacuuming: a normal vacuum doesn't advance relfrozenxid anymore. We'll need to >>> disable the skipping when autovacuum is triggered to prevent wraparound. VACUUM >>> FREEZE does that already, but it's unnecessarily aggressive in freezing. > > FWIW, it seems the omission is actually the other way 'round. Autovacuum always > forces a full-scanning vacuum, making the visibility map useless for > autovacuum. This obviously needs to be fixed. How does it do that? Is there some option in the VacStmt to control this? Do we just need a syntax to set that option? How easy is it to tell what percentage of the table needs to be vacuumed? If it's > 50% perhaps it would make sense to scan the whole table? (Hm. Not really if it's a contiguous 50% though...) Another idea: Perhaps each page of the visibility map should have a frozenxid (or multiple frozenxids?). Then if an individual page of the visibility map is old we could force scanning all the heap pages covered by that map page and update it. I'm not sure we can do that safely though without locking issues -- or is it ok because it's vacuum doing the updating? >> Worse, vacuum will set the freeze_xid to nearly the same value for all of the >> tables. So it will all happen again in another 100M transactions. And again in >> another 100M transactions, and again... > > But we already have that problem, don't we? When you initially load your > database, all tuples will have the same xmin, and all tables will have more or > less the same relfrozenxid. I guess you can argue that it becomes more obvious > if vacuums are otherwise cheaper, but I don't think the visibility map makes > that much difference to suddenly make this issue urgent. We already have that problem but it only bites in a specific case: if you have no other vacuums being triggered by the regular dead tuple scale factor. The normal case is intended to be that autovacuum triggers much more frequently than every 100M transactions to reduce bloat. However in practice this specific case does seem to arise rather alarmingly easy. Most databases do have some large tables which are never deleted from or updated. Also, the default scale factor of 20% is actually quite easy to never reach if your tables are also growing quickly -- effectively moving the goalposts further out as fast as the updates and deletes bloat the table. The visibility map essentially widens this specific use case to cover *all* tables. Since the relfrozenxid would never get advanced by regular vacuums the only time it would get advanced is when they all hit the 200M wall simultaneously. > Agreed that it would be nice to do something about it, though. > >> I think there are several things which need to happen here. >> >> 1) Raise autovacuum_max_freeze_age to 400M or 800M. Having it at 200M just >> means unnecessary full table vacuums long before they accomplish anything. > > It allows you to truncate clog. If I did my math right, 200M transactions > amounts to ~50MB of clog. Perhaps we should still raise it, disk space is cheap > after all. Ah. Hm. Then perhaps this belongs in the realm of the config generator people are working on. They'll need a dial to say how much disk space you expect your database to take in addition to how much memory your machine has available. 50M is nothing for a 1TB database but it's kind of silly to have to keep hundreds of megs of clogs on a 1MB database. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com Get trained by Bruce Momjian - ask me about EnterpriseDB'sPostgreSQL training!
Gregory Stark wrote: > Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes: >> Gregory Stark wrote: >>> Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes: >>>> Hmm. It just occurred to me that I think this circumvented the anti-wraparound >>>> vacuuming: a normal vacuum doesn't advance relfrozenxid anymore. We'll need to >>>> disable the skipping when autovacuum is triggered to prevent wraparound. VACUUM >>>> FREEZE does that already, but it's unnecessarily aggressive in freezing. >> FWIW, it seems the omission is actually the other way 'round. Autovacuum always >> forces a full-scanning vacuum, making the visibility map useless for >> autovacuum. This obviously needs to be fixed. > > How does it do that? Is there some option in the VacStmt to control this? Do > we just need a syntax to set that option? The way it works now is that if VacuumStmt->freeze_min_age is not -1 (which means "use the default"), the visibility map is not used and the whole table is scanned. Autovacuum always sets freeze_min_age, so it's never using the visibility map. Attached is a patch I'm considering to fix that. > How easy is it to tell what percentage of the table needs to be vacuumed? If > it's > 50% perhaps it would make sense to scan the whole table? (Hm. Not > really if it's a contiguous 50% though...) Hmm. You could scan the visibility map to see how much you could skip by using it. You could account for contiguity. > Another idea: Perhaps each page of the visibility map should have a frozenxid > (or multiple frozenxids?). Then if an individual page of the visibility map is > old we could force scanning all the heap pages covered by that map page and > update it. I'm not sure we can do that safely though without locking issues -- > or is it ok because it's vacuum doing the updating? We discussed that a while ago: http://archives.postgresql.org/message-id/492A6032.6080000@enterprisedb.com Tom was concerned about making the visibility map not just a hint but critical data. Rightly so. This is certainly 8.5 stuff; perhaps it would be more palatable after we get the index-only-scans working using the visibility map, since the map would be critical data anyway. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com diff --git a/src/backend/commands/vacuumlazy.c b/src/backend/commands/vacuumlazy.c index fd2429a..3e3cb9d 100644 --- a/src/backend/commands/vacuumlazy.c +++ b/src/backend/commands/vacuumlazy.c @@ -171,10 +171,7 @@ lazy_vacuum_rel(Relation onerel, VacuumStmt *vacstmt, vacrelstats->hasindex = (nindexes > 0); /* Should we use the visibility map or scan all pages? */ - if (vacstmt->freeze_min_age != -1) - scan_all = true; - else - scan_all = false; + scan_all = vacstmt->scan_all; /* Do the vacuuming */ lazy_scan_heap(onerel, vacrelstats, Irel, nindexes, scan_all); diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c index eb7ab4d..2781f6e 100644 --- a/src/backend/nodes/copyfuncs.c +++ b/src/backend/nodes/copyfuncs.c @@ -2771,6 +2771,7 @@ _copyVacuumStmt(VacuumStmt *from) COPY_SCALAR_FIELD(analyze); COPY_SCALAR_FIELD(verbose); COPY_SCALAR_FIELD(freeze_min_age); + COPY_SCALAR_FIELD(scan_all)); COPY_NODE_FIELD(relation); COPY_NODE_FIELD(va_cols); diff --git a/src/backend/nodes/equalfuncs.c b/src/backend/nodes/equalfuncs.c index d4c57bb..86a032f 100644 --- a/src/backend/nodes/equalfuncs.c +++ b/src/backend/nodes/equalfuncs.c @@ -1436,6 +1436,7 @@ _equalVacuumStmt(VacuumStmt *a, VacuumStmt *b) COMPARE_SCALAR_FIELD(analyze); COMPARE_SCALAR_FIELD(verbose); COMPARE_SCALAR_FIELD(freeze_min_age); + COMPARE_SCALAR_FIELD(scan_all); COMPARE_NODE_FIELD(relation); COMPARE_NODE_FIELD(va_cols); diff --git a/src/backend/parser/gram.y b/src/backend/parser/gram.y index 85f4616..1aab75c 100644 --- a/src/backend/parser/gram.y +++ b/src/backend/parser/gram.y @@ -5837,6 +5837,7 @@ VacuumStmt: VACUUM opt_full opt_freeze opt_verbose n->analyze = false; n->full = $2; n->freeze_min_age = $3 ? 0 : -1; + n->scan_all = $3; n->verbose = $4; n->relation = NULL; n->va_cols = NIL; @@ -5849,6 +5850,7 @@ VacuumStmt: VACUUM opt_full opt_freeze opt_verbose n->analyze = false; n->full = $2; n->freeze_min_age = $3 ? 0 : -1; + n->scan_all = $3; n->verbose = $4; n->relation = $5; n->va_cols = NIL; @@ -5860,6 +5862,7 @@ VacuumStmt: VACUUM opt_full opt_freeze opt_verbose n->vacuum = true; n->full = $2; n->freeze_min_age = $3 ? 0 : -1; + n->scan_all = $3; n->verbose |= $4; $$ = (Node *)n; } diff --git a/src/backend/postmaster/autovacuum.c b/src/backend/postmaster/autovacuum.c index 8d8947f..2c68779 100644 --- a/src/backend/postmaster/autovacuum.c +++ b/src/backend/postmaster/autovacuum.c @@ -2649,6 +2649,7 @@ autovacuum_do_vac_analyze(autovac_table *tab, vacstmt.full = false; vacstmt.analyze = tab->at_doanalyze; vacstmt.freeze_min_age = tab->at_freeze_min_age; + vacstmt.scan_all = tab->at_wraparound; vacstmt.verbose = false; vacstmt.relation = NULL; /* not used since we pass a relid */ vacstmt.va_cols = NIL; diff --git a/src/include/nodes/parsenodes.h b/src/include/nodes/parsenodes.h index bb71ac1..df19f7e 100644 --- a/src/include/nodes/parsenodes.h +++ b/src/include/nodes/parsenodes.h @@ -1966,6 +1966,7 @@ typedef struct VacuumStmt bool full; /* do FULL (non-concurrent) vacuum */ bool analyze; /* do ANALYZE step */ bool verbose; /* print progress info */ + bool scan_all; /* force scan of all pages */ int freeze_min_age; /* min freeze age, or -1 to use default */ RangeVar *relation; /* single table to process, or NULL */ List *va_cols; /* list of column names, or NIL for all */
Gregory Stark <stark@enterprisedb.com> writes: > Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes: > >> Gregory Stark wrote: >>> 1) Raise autovacuum_max_freeze_age to 400M or 800M. Having it at 200M just >>> means unnecessary full table vacuums long before they accomplish anything. >> >> It allows you to truncate clog. If I did my math right, 200M transactions >> amounts to ~50MB of clog. Perhaps we should still raise it, disk space is cheap >> after all. Hm, the more I think about it the more this bothers me. It's another subtle change from the current behaviour. Currently *every* vacuum tries to truncate the clog. So you're constantly trimming off a little bit. With the visibility map (assuming you fix it not to do full scans all the time) you can never truncate the clog just as you can never advance the relfrozenxid unless you do a special full-table vacuum. I think in practice most people had a read-only table somewhere in their database which prevented the clog from ever being truncated anyways, so perhaps this isn't such a big deal. But the bottom line is that the anti-wraparound vacuums are going to be a lot more important and much more visible now than they were in the past. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com Get trained by Bruce Momjian - ask me about EnterpriseDB'sPostgreSQL training!
Would someone tell me why 'autovacuum_freeze_max_age' defaults to 200M when our wraparound limit is around 2B? Also, is anything being done about the concern about 'vacuum storm' explained below? --------------------------------------------------------------------------- Gregory Stark wrote: > Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes: > > > Hmm. It just occurred to me that I think this circumvented the anti-wraparound > > vacuuming: a normal vacuum doesn't advance relfrozenxid anymore. We'll need to > > disable the skipping when autovacuum is triggered to prevent wraparound. VACUUM > > FREEZE does that already, but it's unnecessarily aggressive in freezing. > > Having seen how the anti-wraparound vacuums work in the field I think merely > replacing it with a regular vacuum which covers the whole table will not > actually work well. > > What will happen is that, because nothing else is advancing the relfrozenxid, > the age of the relfrozenxid for all tables will advance until they all hit > autovacuum_max_freeze_age. Quite often all the tables were created around the > same time so they will all hit autovacuum_max_freeze_age at the same time. > > So a database which was operating fine and receiving regular vacuums at a > reasonable pace will suddenly be hit by vacuums for every table all at the > same time, 3 at a time. If you don't have vacuum_cost_delay set that will > cause a major issue. Even if you do have vacuum_cost_delay set it will prevent > the small busy tables from getting vacuumed regularly due to the backlog in > anti-wraparound vacuums. > > Worse, vacuum will set the freeze_xid to nearly the same value for all of the > tables. So it will all happen again in another 100M transactions. And again in > another 100M transactions, and again... > > I think there are several things which need to happen here. > > 1) Raise autovacuum_max_freeze_age to 400M or 800M. Having it at 200M just > means unnecessary full table vacuums long before they accomplish anything. > > 2) Include a factor which spreads out the anti-wraparound freezes in the > autovacuum launcher. Some ideas: > > . we could implicitly add random(vacuum_freeze_min_age) to the > autovacuum_max_freeze_age. That would spread them out evenly over 100M > transactions. > > . we could check if another anti-wraparound vacuum is still running and > implicitly add a vacuum_freeze_min_age penalty to the > autovacuum_max_freeze_age for each running anti-wraparound vacuum. That > would spread them out without being introducing non-determinism which > seems better. > > . we could leave autovacuum_max_freeze_age and instead pick a semi-random > vacuum_freeze_min_age. This would mean the first set of anti-wraparound > vacuums would still be synchronized but subsequent ones might be spread > out somewhat. There's not as much room to randomize this though and it > would affect how much i/o vacuum did which makes it seem less palatable > to me. > > 3) I also think we need to put a clamp on the vacuum_cost_delay. Too many > people are setting it to unreasonably high values which results in their > vacuums never completing. Actually I think what we should do is junk all > the existing parameters and replace it with a vacuum_nice_level or > vacuum_bandwidth_cap from which we calculate the cost_limit and hide all > the other parameters as internal parameters. > > -- > Gregory Stark > EnterpriseDB http://www.enterprisedb.com > Get trained by Bruce Momjian - ask me about EnterpriseDB's PostgreSQL training! > > -- > Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-hackers -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
Bruce Momjian wrote: > Would someone tell me why 'autovacuum_freeze_max_age' defaults to 200M > when our wraparound limit is around 2B? > Presumably because of this (from the docs): "The commit status uses two bits per transaction, so if autovacuum_freeze_max_age has its maximum allowed value of a little less than two billion, pg_clog can be expected to grow to about half a gigabyte." cheers andrew
Andrew Dunstan wrote: > > > Bruce Momjian wrote: > > Would someone tell me why 'autovacuum_freeze_max_age' defaults to 200M > > when our wraparound limit is around 2B? > > > > Presumably because of this (from the docs): > > "The commit status uses two bits per transaction, so if > autovacuum_freeze_max_age has its maximum allowed value of a little less > than two billion, pg_clog can be expected to grow to about half a gigabyte." Oh, that's interesting; thanks. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
Bruce Momjian <bruce@momjian.us> writes: > Would someone tell me why 'autovacuum_freeze_max_age' defaults to 200M > when our wraparound limit is around 2B? I suggested raising it dramatically in the post you quote and Heikki pointed it controls the maximum amount of space the clog will take. Raising it to, say, 800M will mean up to 200MB of space which might be kind of annoying for a small database. It would be nice if we could ensure the clog got trimmed frequently enough on small databases that we could raise the max_age. It's really annoying to see all these vacuums running 10x more often than necessary. The rest of the thread is visible at the bottom of: http://article.gmane.org/gmane.comp.db.postgresql.devel.general/107525 > Also, is anything being done about the concern about 'vacuum storm' > explained below? I'm interested too. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com Ask me about EnterpriseDB's PostGIS support!
Gregory Stark wrote: > Bruce Momjian <bruce@momjian.us> writes: > >> Would someone tell me why 'autovacuum_freeze_max_age' defaults to 200M >> when our wraparound limit is around 2B? > > I suggested raising it dramatically in the post you quote and Heikki pointed > it controls the maximum amount of space the clog will take. Raising it to, > say, 800M will mean up to 200MB of space which might be kind of annoying for a > small database. > > It would be nice if we could ensure the clog got trimmed frequently enough on > small databases that we could raise the max_age. It's really annoying to see > all these vacuums running 10x more often than necessary. Well, if it's a small database, you might as well just vacuum it. > The rest of the thread is visible at the bottom of: > > http://article.gmane.org/gmane.comp.db.postgresql.devel.general/107525 > >> Also, is anything being done about the concern about 'vacuum storm' >> explained below? > > I'm interested too. The additional "vacuum_freeze_table_age" (as I'm now calling it) setting I discussed in a later thread should alleviate that somewhat. When a table is autovacuumed, the whole table is scanned to freeze tuples if it's older than vacuum_freeze_table_age, and relfrozenxid is advanced. When different tables reach the autovacuum threshold at different times, they will also have their relfrozenxids set to different values. And in fact no anti-wraparound vacuum is needed. That doesn't help with read-only or insert-only tables, but that's not a new problem. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Heikki Linnakangas wrote: > >> Also, is anything being done about the concern about 'vacuum storm' > >> explained below? > > > > I'm interested too. > > The additional "vacuum_freeze_table_age" (as I'm now calling it) setting > I discussed in a later thread should alleviate that somewhat. When a > table is autovacuumed, the whole table is scanned to freeze tuples if > it's older than vacuum_freeze_table_age, and relfrozenxid is advanced. > When different tables reach the autovacuum threshold at different times, > they will also have their relfrozenxids set to different values. And in > fact no anti-wraparound vacuum is needed. > > That doesn't help with read-only or insert-only tables, but that's not a > new problem. OK, is this targeted for 8.4? -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
Gregory Stark wrote: > Bruce Momjian <bruce@momjian.us> writes: > > > Would someone tell me why 'autovacuum_freeze_max_age' defaults to 200M > > when our wraparound limit is around 2B? > > I suggested raising it dramatically in the post you quote and Heikki pointed > it controls the maximum amount of space the clog will take. Raising it to, > say, 800M will mean up to 200MB of space which might be kind of annoying for a > small database. > > It would be nice if we could ensure the clog got trimmed frequently enough on > small databases that we could raise the max_age. It's really annoying to see > all these vacuums running 10x more often than necessary. I always assumed that it was our 4-byte xid that was requiring our vacuum freeze, but I now see our limiting factor is the size of clog; interesting. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
Bruce Momjian wrote: > Heikki Linnakangas wrote: >>>> Also, is anything being done about the concern about 'vacuum storm' >>>> explained below? >>> I'm interested too. >> The additional "vacuum_freeze_table_age" (as I'm now calling it) setting >> I discussed in a later thread should alleviate that somewhat. When a >> table is autovacuumed, the whole table is scanned to freeze tuples if >> it's older than vacuum_freeze_table_age, and relfrozenxid is advanced. >> When different tables reach the autovacuum threshold at different times, >> they will also have their relfrozenxids set to different values. And in >> fact no anti-wraparound vacuum is needed. >> >> That doesn't help with read-only or insert-only tables, but that's not a >> new problem. > > OK, is this targeted for 8.4? Yes. It's been on my todo list for a long time, and I've also added it to the Open Items list so that we don't lose track of it. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com