Re: Reduce pinning in btree indexes

Поиск
Список
Период
Сортировка
От Kyotaro HORIGUCHI
Тема Re: Reduce pinning in btree indexes
Дата
Msg-id 20150316.155624.73903538.horiguchi.kyotaro@lab.ntt.co.jp
обсуждение исходный текст
Ответ на Re: Reduce pinning in btree indexes  (Kevin Grittner <kgrittn@ymail.com>)
Ответы Re: Reduce pinning in btree indexes  (Kevin Grittner <kgrittn@ymail.com>)
Список pgsql-hackers
Thank you for rewriting.

I was satisfied with the current patch (Head of btnopin on your
github repos). I have no further comment on the functions. Peter
might have a comment on the README description but I suppose I
can push this to the committers.

I attached the your latest patch to this mail as
bt-nopin-v4.patch for now. Please check that there's no problem
in it.


- By this patch, index scan becomes to release buffer pins while fetching index tuples in a page, so it should reduce
thechance of index scans with long duration to block vacuum, because vacuums now can easily overtake the current
positionof an index scan. I didn't actually measured how effective it is, though.
 

- It looks to work correctly for scan in both direction with simultaneous page splitting and vacuum.

- It makes no performance deterioration, on the contrary it accelerates index scans. It seems to be because of removal
oflock and unlock surrounding _bt_steppage in bt_next.
 

- It applies cleanly on the current head on the master branch.

- It has enough intelligible comments and README descriptions.

- An inrelevant typo fix is made in buffers/README by this patch. But it doesn't seem necessary to be separated.

regards,



At Fri, 13 Mar 2015 15:08:25 +0000 (UTC), Kevin Grittner <kgrittn@ymail.com> wrote in
<848652045.4044965.1426259305233.JavaMail.yahoo@mail.yahoo.com>
> Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote:
> > At Thu, 12 Mar 2015 15:27:37 -0700, Peter Geoghegan <pg@heroku.com> wrote:
> >> On Sat, Feb 14, 2015 at 4:19 PM, Kevin Grittner <kgrittn@ymail.com> wrote:

> >> I think you should call out those "confounding factors" in the
> >> README. It's not hard to piece them together from
> >> _bt_drop_lock_and_maybe_pin(), but I think you should anyway.
> 
> OK:
> 
> https://github.com/kgrittn/postgres/commit/f5f59ded30b114ac83b90a00ba1fa5ef490b994e

It looks perfect for me. Do you have any further request on this,
Peter?

> >> Don't use BUFFER_LOCK_SHARE -- use BT_READ, as the existing
> >> nbtree LockBuffer() callers do. You're inconsistent about that
> >> in V3.
> >
> > I agree with you. It looks the only point where it is used.
> 
> OK:
> 
> https://github.com/kgrittn/postgres/commit/76118b58be819ed5e68569c926d0222bc41640ea
> 
> > Addition to that, the commnet just above the point methioned
> > above quizes me.
> >
> >>    /* XXX: Can walking left be lighter on the locking and pins? */
> >>        if (BTScanPosIsPinned(so->currPos))
> >>                LockBuffer(so->currPos.buf, BUFFER_LOCK_SHARE);
> >>        else
> >>                so->currPos.buf = _bt_getbuf(rel, so->currPos.currPage, BT_READ);
> >
> > I'm happy if I could read the meaming of the comment more
> > clearly. I understand that it says that you want to remove the
> > locking (and pinning), but can't now because the simultaneous
> > splitting of the left page would break something. I'd like to see
> > it clearer even for me either I am correct or not..
> 
> Does this clear it up?:
> 
> https://github.com/kgrittn/postgres/commit/22066fc161a092e800e4c1e853136c4513f8771b

Thank you for the labor. It also looks perfect.

> Since there are no changes that would affect the compiled code
> here, I'm not posting a new patch yet.  I'll do that once things
> seem to have settled down a bit more.


-- 
Kyotaro Horiguchi
NTT Open Source Software Center
diff --git a/src/backend/access/nbtree/README b/src/backend/access/nbtree/README
index 213542c..c904d2f 100644
--- a/src/backend/access/nbtree/README
+++ b/src/backend/access/nbtree/README
@@ -92,17 +92,18 @@ To minimize lock/unlock traffic, an index scan always searches a leaf pageto identify all the
matchingitems at once, copying their heap tuple IDsinto backend-local storage.  The heap tuple IDs are then processed
whilenotholding any page lock within the index.  We do continue to hold a pin
 
-on the leaf page, to protect against concurrent deletions (see below).
-In this state the scan is effectively stopped "between" pages, either
-before or after the page it has pinned.  This is safe in the presence of
-concurrent insertions and even page splits, because items are never moved
-across pre-existing page boundaries --- so the scan cannot miss any items
-it should have seen, nor accidentally return the same item twice.  The scan
-must remember the page's right-link at the time it was scanned, since that
-is the page to move right to; if we move right to the current right-link
-then we'd re-scan any items moved by a page split.  We don't similarly
-remember the left-link, since it's best to use the most up-to-date
-left-link when trying to move left (see detailed move-left algorithm below).
+on the leaf page in some circumstances, to protect against concurrent
+deletions (see below).  In this state the scan is effectively stopped
+"between" pages, either before or after the page it has pinned.  This is
+safe in the presence of concurrent insertions and even page splits, because
+items are never moved across pre-existing page boundaries --- so the scan
+cannot miss any items it should have seen, nor accidentally return the same
+item twice.  The scan must remember the page's right-link at the time it
+was scanned, since that is the page to move right to; if we move right to
+the current right-link then we'd re-scan any items moved by a page split.
+We don't similarly remember the left-link, since it's best to use the most
+up-to-date left-link when trying to move left (see detailed move-left
+algorithm below).In most cases we release our lock and pin on a page before attemptingto acquire pin and lock on the
pagewe are moving to.  In a few places
 
@@ -154,25 +155,37 @@ starts.  This is not necessary for correctness in terms of the btree indexoperations themselves;
asexplained above, index scans logically stop"between" pages and so can't lose their place.  The reason we do it is
toprovidean interlock between non-full VACUUM and indexscans.  Since VACUUM
 
-deletes index entries before deleting tuples, the super-exclusive lock
-guarantees that VACUUM can't delete any heap tuple that an indexscanning
-process might be about to visit.  (This guarantee works only for simple
-indexscans that visit the heap in sync with the index scan, not for bitmap
-scans.  We only need the guarantee when using non-MVCC snapshot rules; in
-an MVCC snapshot, it wouldn't matter if the heap tuple were replaced with
-an unrelated tuple at the same TID, because the new tuple wouldn't be
-visible to our scan anyway.)
-
-Because a page can be split even while someone holds a pin on it, it is
-possible that an indexscan will return items that are no longer stored on
-the page it has a pin on, but rather somewhere to the right of that page.
-To ensure that VACUUM can't prematurely remove such heap tuples, we require
-btbulkdelete to obtain super-exclusive lock on every leaf page in the index,
-even pages that don't contain any deletable tuples.  This guarantees that
-the btbulkdelete call cannot return while any indexscan is still holding
-a copy of a deleted index tuple.  Note that this requirement does not say
-that btbulkdelete must visit the pages in any particular order.  (See also
-on-the-fly deletion, below.)
+deletes index entries before reclaiming heap tuple line pointers, the
+super-exclusive lock guarantees that VACUUM can't reclaim for re-use a
+line pointer that an indexscanning process might be about to visit.  This
+guarantee works only for simple indexscans that visit the heap in sync
+with the index scan, not for bitmap scans.  We only need the guarantee
+when using non-MVCC snapshot rules; when using an MVCC snapshot, it
+doesn't matter if the heap tuple is replaced with an unrelated tuple at
+the same TID, because the new tuple won't be visible to our scan anyway.
+Therefore, a scan using an MVCC snapshot which has no other confounding
+factors will not hold the pin after the page contents are read.  The
+current reasons for exceptions, where a pin is still needed, are if the
+index is not WAL-logged or if the scan is an index-only scan.  If later
+work allows the pin to be dropped for all cases we will be able to
+simplify the vacuum code, since the concept of a super-exclusive lock
+for btree indexes will no longer be needed.
+
+Because a pin is not always held, and a page can be split even while
+someone does hold a pin on it, it is possible that an indexscan will
+return items that are no longer stored on the page it has a pin on, but
+rather somewhere to the right of that page.  To ensure that VACUUM can't
+prematurely remove such heap tuples, we require btbulkdelete to obtain a
+super-exclusive lock on every leaf page in the index, even pages that
+don't contain any deletable tuples.  Any scan which could yield incorrect
+results if the tuple at a TID matching the the scan's range and filter
+conditions were replaced by a different tuple while the scan is in
+progress must hold the pin on each index page until all index entries read
+from the page have been processed.  This guarantees that the btbulkdelete
+call cannot return while any indexscan is still holding a copy of a
+deleted index tuple if the scan could be confused by that.  Note that this
+requirement does not say that btbulkdelete must visit the pages in any
+particular order.  (See also on-the-fly deletion, below.)There is no such interlocking for deletion of items in
internalpages,since backends keep no lock nor pin on a page they have descended past.
 
@@ -396,8 +409,12 @@ that this breaks the interlock between VACUUM and indexscans, but that isnot so: as long as an
indexscanningprocess has a pin on the page wherethe index item used to be, VACUUM cannot complete its btbulkdelete
scanandso cannot remove the heap tuple.  This is another reason why
 
-btbulkdelete has to get super-exclusive lock on every leaf page, not only
-the ones where it actually sees items to delete.
+btbulkdelete has to get a super-exclusive lock on every leaf page, not
+only the ones where it actually sees items to delete.  So that we can
+handle the cases where we attempt LP_DEAD flagging for a page after we
+have released its pin, we remember the LSN of the index page when we read
+the index tuples from it; we do not attempt to flag index tuples as dead
+if the we didn't hold the pin the entire time and the LSN has changed.WAL Considerations------------------
@@ -462,7 +479,7 @@ metapage update (of the "fast root" link) is performed and logged as partof the insertion into the
parentlevel.  When splitting the root page, themetapage update is handled as part of the "new root" action.
 
-Each step in page deletion are logged as separate WAL entries: marking the
+Each step in page deletion is logged as a separate WAL entry: marking theleaf as half-dead and removing the downlink
isone record, and unlinking apage is a second record.  If vacuum is interrupted for some reason, or thesystem crashes,
thetree is consistent for searches and insertions.  The
 
diff --git a/src/backend/access/nbtree/nbtinsert.c b/src/backend/access/nbtree/nbtinsert.c
index 932c6f7..ef68a71 100644
--- a/src/backend/access/nbtree/nbtinsert.c
+++ b/src/backend/access/nbtree/nbtinsert.c
@@ -498,9 +498,9 @@ _bt_check_unique(Relation rel, IndexTuple itup, Relation heapRel, *        If there's not enough
roomin the space, we try to make room by *        removing any LP_DEAD tuples. * 
- *        On entry, *buf and *offsetptr point to the first legal position
+ *        On entry, *bufptr and *offsetptr point to the first legal position *        where the new tuple could be
inserted. The caller should hold an
 
- *        exclusive lock on *buf.  *offsetptr can also be set to
+ *        exclusive lock on *bufptr.  *offsetptr can also be set to *        InvalidOffsetNumber, in which case the
functionwill search for the *        right location within the page if needed.  On exit, they point to the *
choseninsert location.  If _bt_findinsertloc decides to move right,
 
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 3af462b..9a6dcdd 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -404,7 +404,8 @@ btbeginscan(PG_FUNCTION_ARGS)    /* allocate private workspace */    so = (BTScanOpaque)
palloc(sizeof(BTScanOpaqueData));
-    so->currPos.buf = so->markPos.buf = InvalidBuffer;
+    BTScanPosInvalidate(so->currPos);
+    BTScanPosInvalidate(so->markPos);    if (scan->numberOfKeys > 0)        so->keyData = (ScanKey)
palloc(scan->numberOfKeys* sizeof(ScanKeyData));    else
 
@@ -424,8 +425,6 @@ btbeginscan(PG_FUNCTION_ARGS)     * scan->xs_itupdesc whether we'll need it or not, since that's so
cheap.    */    so->currTuples = so->markTuples = NULL;
 
-    so->currPos.nextTupleOffset = 0;
-    so->markPos.nextTupleOffset = 0;    scan->xs_itupdesc = RelationGetDescr(rel);
@@ -451,17 +450,14 @@ btrescan(PG_FUNCTION_ARGS)    {        /* Before leaving current page, deal with any killed items
*/       if (so->numKilled > 0)
 
-            _bt_killitems(scan, false);
-        ReleaseBuffer(so->currPos.buf);
-        so->currPos.buf = InvalidBuffer;
+            _bt_killitems(scan);
+        BTScanPosUnpinIfPinned(so->currPos);
+        BTScanPosInvalidate(so->currPos);    }
-    if (BTScanPosIsValid(so->markPos))
-    {
-        ReleaseBuffer(so->markPos.buf);
-        so->markPos.buf = InvalidBuffer;
-    }    so->markItemIndex = -1;
+    BTScanPosUnpinIfPinned(so->markPos);
+    BTScanPosInvalidate(so->markPos);    /*     * Allocate tuple workspace arrays, if needed for an index-only scan
and
@@ -515,17 +511,14 @@ btendscan(PG_FUNCTION_ARGS)    {        /* Before leaving current page, deal with any killed
items*/        if (so->numKilled > 0)
 
-            _bt_killitems(scan, false);
-        ReleaseBuffer(so->currPos.buf);
-        so->currPos.buf = InvalidBuffer;
+            _bt_killitems(scan);
+        BTScanPosUnpinIfPinned(so->currPos);    }
-    if (BTScanPosIsValid(so->markPos))
-    {
-        ReleaseBuffer(so->markPos.buf);
-        so->markPos.buf = InvalidBuffer;
-    }    so->markItemIndex = -1;
+    BTScanPosUnpinIfPinned(so->markPos);
+
+    /* No need to invalidate positions, the RAM is about to be freed. */    /* Release storage */    if (so->keyData
!=NULL)
 
@@ -552,12 +545,8 @@ btmarkpos(PG_FUNCTION_ARGS)    IndexScanDesc scan = (IndexScanDesc) PG_GETARG_POINTER(0);
BTScanOpaqueso = (BTScanOpaque) scan->opaque;
 
-    /* we aren't holding any read locks, but gotta drop the pin */
-    if (BTScanPosIsValid(so->markPos))
-    {
-        ReleaseBuffer(so->markPos.buf);
-        so->markPos.buf = InvalidBuffer;
-    }
+    /* There may be an old mark with a pin (but no lock). */
+    BTScanPosUnpinIfPinned(so->markPos);    /*     * Just record the current itemIndex.  If we later step to next
page
@@ -568,7 +557,10 @@ btmarkpos(PG_FUNCTION_ARGS)    if (BTScanPosIsValid(so->currPos))        so->markItemIndex =
so->currPos.itemIndex;   else
 
+    {
+        BTScanPosInvalidate(so->markPos);        so->markItemIndex = -1;
+    }    /* Also record the current positions of any array keys */    if (so->numArrayKeys)
@@ -593,28 +585,54 @@ btrestrpos(PG_FUNCTION_ARGS)    if (so->markItemIndex >= 0)    {        /*
-         * The mark position is on the same page we are currently on. Just
+         * The scan has never moved to a new page since the last mark.  Just         * restore the itemIndex.
+         *
+         * NB: In this case we can't count on anything in so->markPos to be
+         * accurate.         */        so->currPos.itemIndex = so->markItemIndex;    }
+    else if (so->currPos.currPage == so->markPos.currPage)
+    {
+        /*
+         * so->markItemIndex < 0 but mark and current positions are on the
+         * same page.  This would be an unusual case, where the scan moved to
+         * a new index page after the mark, restored, and later restored again
+         * without moving off the marked page.  It is not clear that this code
+         * can currently be reached, but it seems better to make this function
+         * robust for this case than to Assert() or elog() that it can't
+         * happen.
+         *
+         * We neither want to set so->markItemIndex >= 0 (because that could
+         * cause a later move to a new page to redo the memcpy() executions)
+         * nor re-execute the memcpy() functions for a restore within the same
+         * page.  The previous restore to this page already set everything
+         * except markPos as it should be.
+         */
+        so->currPos.itemIndex = so->markPos.itemIndex;
+    }    else    {
-        /* we aren't holding any read locks, but gotta drop the pin */
+        /*
+         * The scan moved to a new page after last mark or restore, and we are
+         * now restoring to the marked page.  We aren't holding any read
+         * locks, but if we're still holding the pin for the current position,
+         * we must drop it.
+         */        if (BTScanPosIsValid(so->currPos))        {            /* Before leaving current page, deal with
anykilled items */
 
-            if (so->numKilled > 0 &&
-                so->currPos.buf != so->markPos.buf)
-                _bt_killitems(scan, false);
-            ReleaseBuffer(so->currPos.buf);
-            so->currPos.buf = InvalidBuffer;
+            if (so->numKilled > 0)
+                _bt_killitems(scan);
+            BTScanPosUnpinIfPinned(so->currPos);        }        if (BTScanPosIsValid(so->markPos))        {
/* bump pin on mark buffer for assignment to current buffer */
 
-            IncrBufferRefCount(so->markPos.buf);
+            if (BTScanPosIsPinned(so->markPos))
+                IncrBufferRefCount(so->markPos.buf);            memcpy(&so->currPos, &so->markPos,
offsetof(BTScanPosData,items[1]) +                   so->markPos.lastItem * sizeof(BTScanPosItem));
 
@@ -622,6 +640,8 @@ btrestrpos(PG_FUNCTION_ARGS)                memcpy(so->currTuples, so->markTuples,
    so->markPos.nextTupleOffset);        }
 
+        else
+            BTScanPosInvalidate(so->currPos);    }    PG_RETURN_VOID();
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index a681d6b..af4201a 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -22,6 +22,7 @@#include "storage/predicate.h"#include "utils/lsyscache.h"#include "utils/rel.h"
+#include "utils/tqual.h"static bool _bt_readpage(IndexScanDesc scan, ScanDirection dir,
@@ -31,6 +32,36 @@ static void _bt_saveitem(BTScanOpaque so, int itemIndex,static bool _bt_steppage(IndexScanDesc scan,
ScanDirectiondir);static Buffer _bt_walk_left(Relation rel, Buffer buf);static bool _bt_endpoint(IndexScanDesc scan,
ScanDirectiondir);
 
+static void _bt_drop_lock_and_maybe_pin(IndexScanDesc scan, BTScanPos sp);
+
+
+/*
+ *    _bt_drop_lock_and_maybe_pin()
+ *
+ * Unlock the buffer; and if it is safe to release the pin, do that, too.  It
+ * is safe if the scan is using an MVCC snapshot and the index is WAL-logged.
+ * This will prevent vacuum from stalling in a blocked state trying to read a
+ * page when a cursor is sitting on it -- at least in many important cases.
+ *
+ * Set the buffer to invalid if the pin is released, since the buffer may be
+ * re-used.  If we need to go back to this block (for example, to apply
+ * LP_DEAD hints) we must get a fresh reference to the buffer.  Hopefully it
+ * will remain in shared memory for as long as it takes to scan the index
+ * buffer page.
+ */
+static void
+_bt_drop_lock_and_maybe_pin(IndexScanDesc scan, BTScanPos sp)
+{
+    LockBuffer(sp->buf, BUFFER_LOCK_UNLOCK);
+
+    if (IsMVCCSnapshot(scan->xs_snapshot) &&
+        RelationNeedsWAL(scan->indexRelation) &&
+        !scan->xs_want_itup)
+    {
+        ReleaseBuffer(sp->buf);
+        sp->buf = InvalidBuffer;
+    }
+}/*
@@ -506,6 +537,8 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)    StrategyNumber strat_total;    BTScanPosItem
*currItem;
+    Assert(!BTScanPosIsValid(so->currPos));
+    pgstat_count_index_scan(rel);    /*
@@ -942,9 +975,6 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)    /* don't need to keep the stack around... */
_bt_freestack(stack);
-    /* remember which buffer we have pinned, if any */
-    so->currPos.buf = buf;
-    if (!BufferIsValid(buf))    {        /*
@@ -1023,6 +1053,10 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)            break;    }
+    /* remember which buffer we have pinned, if any */
+    Assert(!BTScanPosIsValid(so->currPos));
+    so->currPos.buf = buf;
+    /*     * Now load data from the first page of the scan.     */
@@ -1032,12 +1066,15 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)         * There's no actually-matching data on
thispage.  Try to advance to         * the next page.  Return false if there's no matching data at all.         */
 
+        LockBuffer(so->currPos.buf, BUFFER_LOCK_UNLOCK);        if (!_bt_steppage(scan, dir))            return false;
  }
 
-
-    /* Drop the lock, but not pin, on the current page */
-    LockBuffer(so->currPos.buf, BUFFER_LOCK_UNLOCK);
+    else
+    {
+        /* Drop the lock, and maybe the pin, on the current page */
+        _bt_drop_lock_and_maybe_pin(scan, &so->currPos);
+    }    /* OK, itemIndex says what to return */    currItem = &so->currPos.items[so->currPos.itemIndex];
@@ -1051,8 +1088,8 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)/* *    _bt_next() -- Get the next item in a
scan.*
 
- *        On entry, so->currPos describes the current page, which is pinned
- *        but not locked, and so->currPos.itemIndex identifies which item was
+ *        On entry, so->currPos describes the current page, which may be pinned
+ *        but is not locked, and so->currPos.itemIndex identifies which item was *        previously returned. * *
  On successful exit, scan->xs_ctup.t_self is set to the TID of the
 
@@ -1076,26 +1113,16 @@ _bt_next(IndexScanDesc scan, ScanDirection dir)    {        if (++so->currPos.itemIndex >
so->currPos.lastItem)       {
 
-            /* We must acquire lock before applying _bt_steppage */
-            Assert(BufferIsValid(so->currPos.buf));
-            LockBuffer(so->currPos.buf, BT_READ);            if (!_bt_steppage(scan, dir))                return
false;
-            /* Drop the lock, but not pin, on the new page */
-            LockBuffer(so->currPos.buf, BUFFER_LOCK_UNLOCK);        }    }    else    {        if
(--so->currPos.itemIndex< so->currPos.firstItem)        {
 
-            /* We must acquire lock before applying _bt_steppage */
-            Assert(BufferIsValid(so->currPos.buf));
-            LockBuffer(so->currPos.buf, BT_READ);            if (!_bt_steppage(scan, dir))                return
false;
-            /* Drop the lock, but not pin, on the new page */
-            LockBuffer(so->currPos.buf, BUFFER_LOCK_UNLOCK);        }    }
@@ -1135,7 +1162,10 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)    IndexTuple    itup;
 bool        continuescan;
 
-    /* we must have the buffer pinned and locked */
+    /*
+     * We must have the buffer pinned and locked, but the usual macro can't be
+     * used here; this function is what makes it good for currPos.
+     */    Assert(BufferIsValid(so->currPos.buf));    page = BufferGetPage(so->currPos.buf);
@@ -1144,6 +1174,19 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)    maxoff =
PageGetMaxOffsetNumber(page);   /*
 
+     * We note the buffer's block number so that we can release the pin later.
+     * This allows us to re-read the buffer if it is needed again for hinting.
+     */
+    so->currPos.currPage = BufferGetBlockNumber(so->currPos.buf);
+
+    /*
+     * We save the LSN of the page as we read it, so that we know whether it
+     * safe to apply LP_DEAD hints to the page later.  This allows us to drop
+     * the pin for MVCC scans, which allows vacuum to avoid blocking.
+     */
+    so->currPos.lsn = PageGetLSN(page);
+
+    /*     * we must save the page's right-link while scanning it; this tells us     * where to step right to after
we'redone with these items.  There is no     * corresponding need for the left-link, since splits always go right.
 
@@ -1153,6 +1196,12 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)    /* initialize tuple
workspaceto empty */    so->currPos.nextTupleOffset = 0;
 
+    /*
+     * Now that the current page has been made consistent, the macro should be
+     * good.
+     */
+    Assert(BTScanPosIsPinned(so->currPos));
+    if (ScanDirectionIsForward(dir))    {        /* load items[] in ascending order */
@@ -1241,11 +1290,14 @@ _bt_saveitem(BTScanOpaque so, int itemIndex,/* *    _bt_steppage() -- Step to next page
containingvalid data for scan *
 
- * On entry, so->currPos.buf must be pinned and read-locked.  We'll drop
- * the lock and pin before moving to next page.
+ * On entry, if so->currPos.buf is valid the buffer is pinned but not locked;
+ * if pinned, we'll drop the pin before moving to next page.  The buffer is
+ * not locked on entry. *
- * On success exit, we hold pin and read-lock on the next interesting page,
- * and so->currPos is updated to contain data from that page.
+ * On success exit, so->currPos is updated to contain data from the next
+ * interesting page.  For success on a scan using a non-MVCC snapshot we hold
+ * a pin, but not a read lock, on that page.  If we do not hold the pin, we
+ * set so->currPos.buf to InvalidBuffer.  We return TRUE to indicate success. * * If there are no more matching
recordsin the given direction, we drop all * locks and pins, set so->currPos.buf to InvalidBuffer, and return FALSE.
 
@@ -1258,12 +1310,11 @@ _bt_steppage(IndexScanDesc scan, ScanDirection dir)    Page        page;    BTPageOpaque
opaque;
-    /* we must have the buffer pinned and locked */
-    Assert(BufferIsValid(so->currPos.buf));
+    Assert(BTScanPosIsValid(so->currPos));    /* Before leaving current page, deal with any killed items */    if
(so->numKilled> 0)
 
-        _bt_killitems(scan, true);
+        _bt_killitems(scan);    /*     * Before we modify currPos, make a copy of the page data if there was a
@@ -1272,7 +1323,8 @@ _bt_steppage(IndexScanDesc scan, ScanDirection dir)    if (so->markItemIndex >= 0)    {        /*
bumppin on current buffer for assignment to mark buffer */
 
-        IncrBufferRefCount(so->currPos.buf);
+        if (BTScanPosIsPinned(so->currPos))
+            IncrBufferRefCount(so->currPos.buf);        memcpy(&so->markPos, &so->currPos,
offsetof(BTScanPosData,items[1]) +               so->currPos.lastItem * sizeof(BTScanPosItem));
 
@@ -1294,14 +1346,17 @@ _bt_steppage(IndexScanDesc scan, ScanDirection dir)        /* Remember we left a page with data
*/       so->currPos.moreLeft = true;
 
+        /* release the previous buffer, if pinned */
+        BTScanPosUnpinIfPinned(so->currPos);
+        for (;;)        {
-            /* release the previous buffer */
-            _bt_relbuf(rel, so->currPos.buf);
-            so->currPos.buf = InvalidBuffer;            /* if we're at end of scan, give up */            if (blkno ==
P_NONE|| !so->currPos.moreRight)
 
+            {
+                BTScanPosInvalidate(so->currPos);                return false;
+            }            /* check for interrupts while we're not holding any buffer lock */
CHECK_FOR_INTERRUPTS();           /* step right one page */
 
@@ -1317,8 +1372,10 @@ _bt_steppage(IndexScanDesc scan, ScanDirection dir)                if (_bt_readpage(scan, dir,
P_FIRSTDATAKEY(opaque)))                   break;            }
 
+            /* nope, keep going */            blkno = opaque->btpo_next;
+            _bt_relbuf(rel, so->currPos.buf);        }    }    else
@@ -1332,14 +1389,27 @@ _bt_steppage(IndexScanDesc scan, ScanDirection dir)         * to our left splits while we are
inflight to it, plus the         * possibility that the page we were on gets deleted after we leave         * it.  See
nbtree/READMEfor details.
 
+         *
+         * It might be possible to rearrange this code to have less overhead
+         * in pinning and locking, but that would require capturing the left
+         * pointer when the page is initially read, and using it here, along
+         * with big changes to _bt_walk_left() and the code below.  It is not
+         * clear whether this would be a win, since if the page immediately to
+         * the left splits after we read this page and before we step left, we
+         * would need to visit more pages than with the current code.         */
+        if (BTScanPosIsPinned(so->currPos))
+            LockBuffer(so->currPos.buf, BT_READ);
+        else
+            so->currPos.buf = _bt_getbuf(rel, so->currPos.currPage, BT_READ);
+        for (;;)        {            /* Done if we know there are no matching keys to the left */            if
(!so->currPos.moreLeft)           {                _bt_relbuf(rel, so->currPos.buf);
 
-                so->currPos.buf = InvalidBuffer;
+                BTScanPosInvalidate(so->currPos);                return false;            }
@@ -1348,7 +1418,10 @@ _bt_steppage(IndexScanDesc scan, ScanDirection dir)            /* if we're physically at end of
index,return failure */            if (so->currPos.buf == InvalidBuffer)
 
+            {
+                BTScanPosInvalidate(so->currPos);                return false;
+            }            /*             * Okay, we managed to move left to a non-deleted page. Done if
@@ -1368,6 +1441,9 @@ _bt_steppage(IndexScanDesc scan, ScanDirection dir)        }    }
+    /* Drop the lock, and maybe the pin, on the current page */
+    _bt_drop_lock_and_maybe_pin(scan, &so->currPos);
+    return true;}
@@ -1604,7 +1680,7 @@ _bt_endpoint(IndexScanDesc scan, ScanDirection dir)         * exists.         */
PredicateLockRelation(rel,scan->xs_snapshot);
 
-        so->currPos.buf = InvalidBuffer;
+        BTScanPosInvalidate(so->currPos);        return false;    }
@@ -1658,12 +1734,15 @@ _bt_endpoint(IndexScanDesc scan, ScanDirection dir)         * There's no actually-matching data
onthis page.  Try to advance to         * the next page.  Return false if there's no matching data at all.         */
 
+        LockBuffer(so->currPos.buf, BUFFER_LOCK_UNLOCK);        if (!_bt_steppage(scan, dir))            return false;
  }
 
-
-    /* Drop the lock, but not pin, on the current page */
-    LockBuffer(so->currPos.buf, BUFFER_LOCK_UNLOCK);
+    else
+    {
+        /* Drop the lock, and maybe the pin, on the current page */
+        _bt_drop_lock_and_maybe_pin(scan, &so->currPos);
+    }    /* OK, itemIndex says what to return */    currItem = &so->currPos.items[so->currPos.itemIndex];
diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c
index 379dac9..d1589f0 100644
--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@@ -1721,27 +1721,35 @@ _bt_check_rowcompare(ScanKey skey, IndexTuple tuple, TupleDesc tupdesc, * _bt_killitems - set
LP_DEADstate for items an indexscan caller has * told us were killed *
 
- * scan->so contains information about the current page and killed tuples
- * thereon (generally, this should only be called if so->numKilled > 0).
+ * scan->opaque, referenced locally through so, contains information about the
+ * current page and killed tuples thereon (generally, this should only be
+ * called if so->numKilled > 0). *
- * The caller must have pin on so->currPos.buf, but may or may not have
- * read-lock, as indicated by haveLock.  Note that we assume read-lock
- * is sufficient for setting LP_DEAD status (which is only a hint).
+ * The caller does not have a lock on the page and may or may not have the
+ * page pinned in a buffer.  Note that read-lock is sufficient for setting
+ * LP_DEAD status (which is only a hint). * * We match items by heap TID before assuming they are the right ones to *
delete. We cope with cases where items have moved right due to insertions. * If an item has moved off the current page
dueto a split, we'll fail to * find it and do nothing (this is not an error case --- we assume the item
 
- * will eventually get marked in a future indexscan).  Note that because we
- * hold pin on the target page continuously from initially reading the items
- * until applying this function, VACUUM cannot have deleted any items from
- * the page, and so there is no need to search left from the recorded offset.
- * (This observation also guarantees that the item is still the right one
- * to delete, which might otherwise be questionable since heap TIDs can get
- * recycled.)
+ * will eventually get marked in a future indexscan).
+ *
+ * Note that if we hold a pin on the target page continuously from initially
+ * reading the items until applying this function, VACUUM cannot have deleted
+ * any items from the page, and so there is no need to search left from the
+ * recorded offset.  (This observation also guarantees that the item is still
+ * the right one to delete, which might otherwise be questionable since heap
+ * TIDs can get recycled.)  This holds true even if the page has been modified
+ * by inserts and page splits, so there is no need to consult the LSN.
+ *
+ * If the pin was released after reading the page, then we re-read it.  If it
+ * has been modified since we read it (as determined by the LSN), we dare not
+ * flag any entries because it is possible that the old entry was vacuumed
+ * away and the TID was re-used by a completely different heap tuple. */void
-_bt_killitems(IndexScanDesc scan, bool haveLock)
+_bt_killitems(IndexScanDesc scan){    BTScanOpaque so = (BTScanOpaque) scan->opaque;    Page        page;
@@ -1749,19 +1757,56 @@ _bt_killitems(IndexScanDesc scan, bool haveLock)    OffsetNumber minoff;    OffsetNumber
maxoff;   int            i;
 
+    int            numKilled = so->numKilled;    bool        killedsomething = false;
-    Assert(BufferIsValid(so->currPos.buf));
+    Assert(BTScanPosIsValid(so->currPos));
+
+    /*
+     * Always reset the scan state, so we don't look for same items on other
+     * pages.
+     */
+    so->numKilled = 0;
-    if (!haveLock)
+    if (BTScanPosIsPinned(so->currPos))
+    {
+        /*
+         * We have held the pin on this page since we read the index tuples,
+         * so all we need to do is lock it.  The pin will have prevented
+         * re-use of any TID on the page, so there is no need to check the
+         * LSN.
+         */        LockBuffer(so->currPos.buf, BT_READ);
-    page = BufferGetPage(so->currPos.buf);
+        page = BufferGetPage(so->currPos.buf);
+    }
+    else
+    {
+        Buffer        buf;
+
+        /* Attempt to re-read the buffer, getting pin and lock. */
+        buf = _bt_getbuf(scan->indexRelation, so->currPos.currPage, BT_READ);
+
+        /* It might not exist anymore; in which case we can't hint it. */
+        if (!BufferIsValid(buf))
+            return;
+
+        page = BufferGetPage(buf);
+        if (PageGetLSN(page) == so->currPos.lsn)
+            so->currPos.buf = buf;
+        else
+        {
+            /* Modified while not pinned means hinting is not safe. */
+            _bt_relbuf(scan->indexRelation, buf);
+            return;
+        }
+    }
+    opaque = (BTPageOpaque) PageGetSpecialPointer(page);    minoff = P_FIRSTDATAKEY(opaque);    maxoff =
PageGetMaxOffsetNumber(page);
-    for (i = 0; i < so->numKilled; i++)
+    for (i = 0; i < numKilled; i++)    {        int            itemIndex = so->killedItems[i];        BTScanPosItem
*kitem= &so->currPos.items[itemIndex];
 
@@ -1799,14 +1844,7 @@ _bt_killitems(IndexScanDesc scan, bool haveLock)        MarkBufferDirtyHint(so->currPos.buf,
true);   }
 
-    if (!haveLock)
-        LockBuffer(so->currPos.buf, BUFFER_LOCK_UNLOCK);
-
-    /*
-     * Always reset the scan state, so we don't look for same items on other
-     * pages.
-     */
-    so->numKilled = 0;
+    LockBuffer(so->currPos.buf, BUFFER_LOCK_UNLOCK);}
diff --git a/src/backend/access/nbtree/nbtxlog.c b/src/backend/access/nbtree/nbtxlog.c
index e2ace5c..2f85867 100644
--- a/src/backend/access/nbtree/nbtxlog.c
+++ b/src/backend/access/nbtree/nbtxlog.c
@@ -395,7 +395,8 @@ btree_xlog_vacuum(XLogReaderState *record)     * unpinned between the lastBlockVacuumed and the
currentblock, if there     * are any.  This prevents replay of the VACUUM from reaching the stage of     * removing
heaptuples while there could still be indexscans "in flight"
 
-     * to those particular tuples (see nbtree/README).
+     * to those particular tuples for those scans which could be confused by
+     * finding new tuples at the old TID locations (see nbtree/README).     *     * It might be worth checking if
thereare actually any backends running;     * if not, we could just skip this.
 
diff --git a/src/backend/storage/buffer/README b/src/backend/storage/buffer/README
index a4ebbcc..45c5c83 100644
--- a/src/backend/storage/buffer/README
+++ b/src/backend/storage/buffer/README
@@ -18,8 +18,8 @@ backend to pin a page more than once concurrently; the buffer managerhandles this efficiently.  It is
consideredOK to hold a pin for longintervals --- for example, sequential scans hold a pin on the current pageuntil done
processingall the tuples on the page, which could be quite a
 
-while if the scan is the outer scan of a join.  Similarly, btree index
-scans hold a pin on the current index page.  This is OK because normal
+while if the scan is the outer scan of a join.  Similarly, a btree index
+scan may hold a pin on the current index page.  This is OK because normaloperations never wait for a page's pin count
todrop to zero.  (Anythingthat might need to do such a wait is instead handled by waiting to obtainthe relation-level
lock,which is why you'd better hold one first.)  Pins
 
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 990dab3..2ef349b 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -518,6 +518,8 @@ typedef struct BTScanPosData{    Buffer        buf;            /* if valid, the buffer is pinned
*/
+    XLogRecPtr    lsn;            /* pos in the WAL stream when page was read */
+    BlockNumber currPage;        /* page we've referencd by items array */    BlockNumber nextPage;        /* page's
rightlink when we scanned it */    /*
 
@@ -551,7 +553,37 @@ typedef struct BTScanPosDatatypedef BTScanPosData *BTScanPos;
-#define BTScanPosIsValid(scanpos) BufferIsValid((scanpos).buf)
+#define BTScanPosIsPinned(scanpos) \
+( \
+    AssertMacro(BlockNumberIsValid((scanpos).currPage) || \
+                !BufferIsValid((scanpos).buf)), \
+    BufferIsValid((scanpos).buf) \
+)
+#define BTScanPosUnpin(scanpos) \
+    do { \
+        ReleaseBuffer((scanpos).buf); \
+        (scanpos).buf = InvalidBuffer; \
+    } while (0)
+#define BTScanPosUnpinIfPinned(scanpos) \
+    do { \
+        if (BTScanPosIsPinned(scanpos)) \
+            BTScanPosUnpin(scanpos); \
+    } while (0)
+
+#define BTScanPosIsValid(scanpos) \
+( \
+    AssertMacro(BlockNumberIsValid((scanpos).currPage) || \
+                !BufferIsValid((scanpos).buf)), \
+    BlockNumberIsValid((scanpos).currPage) \
+)
+#define BTScanPosInvalidate(scanpos) \
+    do { \
+        (scanpos).currPage = InvalidBlockNumber; \
+        (scanpos).nextPage = InvalidBlockNumber; \
+        (scanpos).buf = InvalidBuffer; \
+        (scanpos).lsn = InvalidXLogRecPtr; \
+        (scanpos).nextTupleOffset = 0; \
+    } while (0);/* We need one of these for each equality-type SK_SEARCHARRAY scan key */typedef struct
BTArrayKeyInfo
@@ -697,7 +729,7 @@ extern void _bt_preprocess_keys(IndexScanDesc scan);extern IndexTuple _bt_checkkeys(IndexScanDesc
scan,             Page page, OffsetNumber offnum,              ScanDirection dir, bool *continuescan);
 
-extern void _bt_killitems(IndexScanDesc scan, bool haveLock);
+extern void _bt_killitems(IndexScanDesc scan);extern BTCycleId _bt_vacuum_cycleid(Relation rel);extern BTCycleId
_bt_start_vacuum(Relationrel);extern void _bt_end_vacuum(Relation rel); 

В списке pgsql-hackers по дате отправления:

Предыдущее
От: Kouhei Kaigai
Дата:
Сообщение: Re: Custom/Foreign-Join-APIs (Re: [v9.5] Custom Plan API)
Следующее
От: Fabien COELHO
Дата:
Сообщение: Re: PATCH: pgbench - merging transaction logs