Обсуждение: Index-only-scans, indexam API changes

Поиск
Список
Период
Сортировка

Index-only-scans, indexam API changes

От
Heikki Linnakangas
Дата:
At the moment, amgettuple only returns pointers to heap tuples. There is
no way to return data from the index tuples. That needs to be changed to
support index-only scans.

I propose that we split index_getnext into two steps: fetching the next
match from the index (index_next()), and fetching the corresponding heap
tuple (index_fetch()). Patch attached. There is still a shorthand
index_getnext() that is compatible with the old function, but it now
just calls index_next()+index_fetch().

The new index_fetch() function can only fetch one match from a HOT
chain. That greatly simplifies the code in indexam.c. The only caller
needing to fetch more than one tuple from a HOT chain (= using
SnapshotAny) is CLUSTER, so I moved the HOT-chain following logic into
cluster.c, with small changes to heap_hot_search_buffer().

--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com
*** a/src/backend/access/heap/heapam.c
--- b/src/backend/access/heap/heapam.c
***************
*** 1490,1497 **** heap_fetch(Relation relation,
   * On entry, *tid is the TID of a tuple (either a simple tuple, or the root
   * of a HOT chain), and buffer is the buffer holding this tuple.  We search
   * for the first chain member satisfying the given snapshot.  If one is
!  * found, we update *tid to reference that tuple's offset number, and
!  * return TRUE.  If no match, return FALSE without modifying *tid.
   *
   * If all_dead is not NULL, we check non-visible tuples to see if they are
   * globally dead; *all_dead is set TRUE if all members of the HOT chain
--- 1490,1503 ----
   * On entry, *tid is the TID of a tuple (either a simple tuple, or the root
   * of a HOT chain), and buffer is the buffer holding this tuple.  We search
   * for the first chain member satisfying the given snapshot.  If one is
!  * found, we update *tid and *heapTuple to reference that tuple, and
!  * return TRUE.  If no match, return FALSE without modifying *tid (*heapTuple
!  * is undefined in that case).
!  *
!  * To fetch the next matching tuple in the chain, call again with 'firstCall'
!  * set to FALSE and *tid still pointing to the tuple returned by previous call.
!  * (normally only one tuple in a chain can be visible at a time, but that does
!  * not hold for special snapshots like SnapshotAny)
   *
   * If all_dead is not NULL, we check non-visible tuples to see if they are
   * globally dead; *all_dead is set TRUE if all members of the HOT chain
***************
*** 1503,1529 **** heap_fetch(Relation relation,
   */
  bool
  heap_hot_search_buffer(ItemPointer tid, Buffer buffer, Snapshot snapshot,
!                        bool *all_dead)
  {
      Page        dp = (Page) BufferGetPage(buffer);
      TransactionId prev_xmax = InvalidTransactionId;
      OffsetNumber offnum;
      bool        at_chain_start;

!     if (all_dead)
          *all_dead = true;

      Assert(TransactionIdIsValid(RecentGlobalXmin));

      Assert(ItemPointerGetBlockNumber(tid) == BufferGetBlockNumber(buffer));
      offnum = ItemPointerGetOffsetNumber(tid);
!     at_chain_start = true;

      /* Scan through possible multiple members of HOT-chain */
      for (;;)
      {
          ItemId        lp;
-         HeapTupleData heapTuple;

          /* check for bogus TID */
          if (offnum < FirstOffsetNumber || offnum > PageGetMaxOffsetNumber(dp))
--- 1509,1537 ----
   */
  bool
  heap_hot_search_buffer(ItemPointer tid, Buffer buffer, Snapshot snapshot,
!                        HeapTuple heapTuple, bool *all_dead, bool firstCall)
  {
      Page        dp = (Page) BufferGetPage(buffer);
      TransactionId prev_xmax = InvalidTransactionId;
      OffsetNumber offnum;
      bool        at_chain_start;
+     bool        skip;

!     if (all_dead && firstCall)
          *all_dead = true;

      Assert(TransactionIdIsValid(RecentGlobalXmin));

      Assert(ItemPointerGetBlockNumber(tid) == BufferGetBlockNumber(buffer));
      offnum = ItemPointerGetOffsetNumber(tid);
!
!     at_chain_start = firstCall;
!     skip = !firstCall;

      /* Scan through possible multiple members of HOT-chain */
      for (;;)
      {
          ItemId        lp;

          /* check for bogus TID */
          if (offnum < FirstOffsetNumber || offnum > PageGetMaxOffsetNumber(dp))
***************
*** 1546,1558 **** heap_hot_search_buffer(ItemPointer tid, Buffer buffer, Snapshot snapshot,
              break;
          }

!         heapTuple.t_data = (HeapTupleHeader) PageGetItem(dp, lp);
!         heapTuple.t_len = ItemIdGetLength(lp);

          /*
           * Shouldn't see a HEAP_ONLY tuple at chain start.
           */
!         if (at_chain_start && HeapTupleIsHeapOnly(&heapTuple))
              break;

          /*
--- 1554,1566 ----
              break;
          }

!         heapTuple->t_data = (HeapTupleHeader) PageGetItem(dp, lp);
!         heapTuple->t_len = ItemIdGetLength(lp);

          /*
           * Shouldn't see a HEAP_ONLY tuple at chain start.
           */
!         if (at_chain_start && HeapTupleIsHeapOnly(heapTuple))
              break;

          /*
***************
*** 1561,1577 **** heap_hot_search_buffer(ItemPointer tid, Buffer buffer, Snapshot snapshot,
           */
          if (TransactionIdIsValid(prev_xmax) &&
              !TransactionIdEquals(prev_xmax,
!                                  HeapTupleHeaderGetXmin(heapTuple.t_data)))
              break;

          /* If it's visible per the snapshot, we must return it */
!         if (HeapTupleSatisfiesVisibility(&heapTuple, snapshot, buffer))
          {
              ItemPointerSetOffsetNumber(tid, offnum);
              if (all_dead)
                  *all_dead = false;
              return true;
          }

          /*
           * If we can't see it, maybe no one else can either.  At caller
--- 1569,1586 ----
           */
          if (TransactionIdIsValid(prev_xmax) &&
              !TransactionIdEquals(prev_xmax,
!                                  HeapTupleHeaderGetXmin(heapTuple->t_data)))
              break;

          /* If it's visible per the snapshot, we must return it */
!         if (!skip && HeapTupleSatisfiesVisibility(heapTuple, snapshot, buffer))
          {
              ItemPointerSetOffsetNumber(tid, offnum);
              if (all_dead)
                  *all_dead = false;
              return true;
          }
+         skip = false;

          /*
           * If we can't see it, maybe no one else can either.  At caller
***************
*** 1579,1585 **** heap_hot_search_buffer(ItemPointer tid, Buffer buffer, Snapshot snapshot,
           * transactions.
           */
          if (all_dead && *all_dead &&
!             HeapTupleSatisfiesVacuum(heapTuple.t_data, RecentGlobalXmin,
                                       buffer) != HEAPTUPLE_DEAD)
              *all_dead = false;

--- 1588,1594 ----
           * transactions.
           */
          if (all_dead && *all_dead &&
!             HeapTupleSatisfiesVacuum(heapTuple->t_data, RecentGlobalXmin,
                                       buffer) != HEAPTUPLE_DEAD)
              *all_dead = false;

***************
*** 1587,1599 **** heap_hot_search_buffer(ItemPointer tid, Buffer buffer, Snapshot snapshot,
           * Check to see if HOT chain continues past this tuple; if so fetch
           * the next offnum and loop around.
           */
!         if (HeapTupleIsHotUpdated(&heapTuple))
          {
!             Assert(ItemPointerGetBlockNumber(&heapTuple.t_data->t_ctid) ==
                     ItemPointerGetBlockNumber(tid));
!             offnum = ItemPointerGetOffsetNumber(&heapTuple.t_data->t_ctid);
              at_chain_start = false;
!             prev_xmax = HeapTupleHeaderGetXmax(heapTuple.t_data);
          }
          else
              break;                /* end of chain */
--- 1596,1608 ----
           * Check to see if HOT chain continues past this tuple; if so fetch
           * the next offnum and loop around.
           */
!         if (HeapTupleIsHotUpdated(heapTuple))
          {
!             Assert(ItemPointerGetBlockNumber(&heapTuple->t_data->t_ctid) ==
                     ItemPointerGetBlockNumber(tid));
!             offnum = ItemPointerGetOffsetNumber(&heapTuple->t_data->t_ctid);
              at_chain_start = false;
!             prev_xmax = HeapTupleHeaderGetXmax(heapTuple->t_data);
          }
          else
              break;                /* end of chain */
***************
*** 1615,1624 **** heap_hot_search(ItemPointer tid, Relation relation, Snapshot snapshot,
  {
      bool        result;
      Buffer        buffer;

      buffer = ReadBuffer(relation, ItemPointerGetBlockNumber(tid));
      LockBuffer(buffer, BUFFER_LOCK_SHARE);
!     result = heap_hot_search_buffer(tid, buffer, snapshot, all_dead);
      LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
      ReleaseBuffer(buffer);
      return result;
--- 1624,1635 ----
  {
      bool        result;
      Buffer        buffer;
+     HeapTupleData heapTuple;

      buffer = ReadBuffer(relation, ItemPointerGetBlockNumber(tid));
      LockBuffer(buffer, BUFFER_LOCK_SHARE);
!     result = heap_hot_search_buffer(tid, buffer, snapshot, &heapTuple,
!                                     all_dead, true);
      LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
      ReleaseBuffer(buffer);
      return result;
*** a/src/backend/access/index/genam.c
--- b/src/backend/access/index/genam.c
***************
*** 97,105 **** RelationGetIndexScan(Relation indexRelation,
      ItemPointerSetInvalid(&scan->xs_ctup.t_self);
      scan->xs_ctup.t_data = NULL;
      scan->xs_cbuf = InvalidBuffer;
-     scan->xs_hot_dead = false;
-     scan->xs_next_hot = InvalidOffsetNumber;
-     scan->xs_prev_xmax = InvalidTransactionId;

      /*
       * Let the AM fill in the key and any opaque data it wants.
--- 97,102 ----
*** a/src/backend/access/index/indexam.c
--- b/src/backend/access/index/indexam.c
***************
*** 20,26 ****
   *        index_insert    - insert an index tuple into a relation
   *        index_markpos    - mark a scan position
   *        index_restrpos    - restore a scan position
!  *        index_getnext    - get the next tuple from a scan
   *        index_getbitmap - get all tuples from a scan
   *        index_bulk_delete    - bulk deletion of index tuples
   *        index_vacuum_cleanup    - post-deletion cleanup of an index
--- 20,27 ----
   *        index_insert    - insert an index tuple into a relation
   *        index_markpos    - mark a scan position
   *        index_restrpos    - restore a scan position
!  *        index_next        - advance index scan to next match
!  *        index_fetch        - fetch heap tuple for current index scan position
   *        index_getbitmap - get all tuples from a scan
   *        index_bulk_delete    - bulk deletion of index tuples
   *        index_vacuum_cleanup    - post-deletion cleanup of an index
***************
*** 310,317 **** index_rescan(IndexScanDesc scan, ScanKey key)
          scan->xs_cbuf = InvalidBuffer;
      }

-     scan->xs_next_hot = InvalidOffsetNumber;
-
      scan->kill_prior_tuple = false;        /* for safety */

      FunctionCall2(procedure,
--- 311,316 ----
***************
*** 389,420 **** index_restrpos(IndexScanDesc scan)
      SCAN_CHECKS;
      GET_SCAN_PROCEDURE(amrestrpos);

-     scan->xs_next_hot = InvalidOffsetNumber;
-
      scan->kill_prior_tuple = false;        /* for safety */

      FunctionCall1(procedure, PointerGetDatum(scan));
  }

  /* ----------------
!  *        index_getnext - get the next heap tuple from a scan
   *
!  * The result is the next heap tuple satisfying the scan keys and the
!  * snapshot, or NULL if no more matching tuples exist.    On success,
!  * the buffer containing the heap tuple is pinned (the pin will be dropped
!  * at the next index_getnext or index_endscan).
   *
   * Note: caller must check scan->xs_recheck, and perform rechecking of the
   * scan keys if required.  We do not do that here because we don't have
   * enough information to do it efficiently in the general case.
   * ----------------
   */
! HeapTuple
! index_getnext(IndexScanDesc scan, ScanDirection direction)
  {
-     HeapTuple    heapTuple = &scan->xs_ctup;
-     ItemPointer tid = &heapTuple->t_self;
      FmgrInfo   *procedure;

      SCAN_CHECKS;
      GET_SCAN_PROCEDURE(amgettuple);
--- 388,415 ----
      SCAN_CHECKS;
      GET_SCAN_PROCEDURE(amrestrpos);

      scan->kill_prior_tuple = false;        /* for safety */

      FunctionCall1(procedure, PointerGetDatum(scan));
  }

  /* ----------------
!  *        index_next - get the next index tuple from a scan
   *
!  * Advances to the next index tuple satisfying the scan keys, returning TRUE
!  * if there was one, FALSE otherwise. The heap TID pointer of the next match
!  * is stored in scan->xs_ctup.self.
   *
   * Note: caller must check scan->xs_recheck, and perform rechecking of the
   * scan keys if required.  We do not do that here because we don't have
   * enough information to do it efficiently in the general case.
   * ----------------
   */
! bool
! index_next(IndexScanDesc scan, ScanDirection direction)
  {
      FmgrInfo   *procedure;
+     bool         found;

      SCAN_CHECKS;
      GET_SCAN_PROCEDURE(amgettuple);
***************
*** 422,641 **** index_getnext(IndexScanDesc scan, ScanDirection direction)
      Assert(TransactionIdIsValid(RecentGlobalXmin));

      /*
!      * We always reset xs_hot_dead; if we are here then either we are just
!      * starting the scan, or we previously returned a visible tuple, and in
!      * either case it's inappropriate to kill the prior index entry.
       */
!     scan->xs_hot_dead = false;

!     for (;;)
!     {
!         OffsetNumber offnum;
!         bool        at_chain_start;
!         Page        dp;

!         if (scan->xs_next_hot != InvalidOffsetNumber)
!         {
!             /*
!              * We are resuming scan of a HOT chain after having returned an
!              * earlier member.    Must still hold pin on current heap page.
!              */
!             Assert(BufferIsValid(scan->xs_cbuf));
!             Assert(ItemPointerGetBlockNumber(tid) ==
!                    BufferGetBlockNumber(scan->xs_cbuf));
!             Assert(TransactionIdIsValid(scan->xs_prev_xmax));
!             offnum = scan->xs_next_hot;
!             at_chain_start = false;
!             scan->xs_next_hot = InvalidOffsetNumber;
!         }
!         else
          {
!             bool        found;
!             Buffer        prev_buf;
!
!             /*
!              * If we scanned a whole HOT chain and found only dead tuples,
!              * tell index AM to kill its entry for that TID.
!              */
!             scan->kill_prior_tuple = scan->xs_hot_dead;
!
!             /*
!              * The AM's gettuple proc finds the next index entry matching the
!              * scan keys, and puts the TID in xs_ctup.t_self (ie, *tid). It
!              * should also set scan->xs_recheck, though we pay no attention to
!              * that here.
!              */
!             found = DatumGetBool(FunctionCall2(procedure,
!                                                PointerGetDatum(scan),
!                                                Int32GetDatum(direction)));
!
!             /* Reset kill flag immediately for safety */
!             scan->kill_prior_tuple = false;
!
!             /* If we're out of index entries, break out of outer loop */
!             if (!found)
!                 break;
!
!             pgstat_count_index_tuples(scan->indexRelation, 1);
!
!             /* Switch to correct buffer if we don't have it already */
!             prev_buf = scan->xs_cbuf;
!             scan->xs_cbuf = ReleaseAndReadBuffer(scan->xs_cbuf,
!                                                  scan->heapRelation,
!                                              ItemPointerGetBlockNumber(tid));
!
!             /*
!              * Prune page, but only if we weren't already on this page
!              */
!             if (prev_buf != scan->xs_cbuf)
!                 heap_page_prune_opt(scan->heapRelation, scan->xs_cbuf,
!                                     RecentGlobalXmin);
!
!             /* Prepare to scan HOT chain starting at index-referenced offnum */
!             offnum = ItemPointerGetOffsetNumber(tid);
!             at_chain_start = true;
!
!             /* We don't know what the first tuple's xmin should be */
!             scan->xs_prev_xmax = InvalidTransactionId;
!
!             /* Initialize flag to detect if all entries are dead */
!             scan->xs_hot_dead = true;
          }

!         /* Obtain share-lock on the buffer so we can examine visibility */
!         LockBuffer(scan->xs_cbuf, BUFFER_LOCK_SHARE);

!         dp = (Page) BufferGetPage(scan->xs_cbuf);

-         /* Scan through possible multiple members of HOT-chain */
-         for (;;)
-         {
-             ItemId        lp;
-             ItemPointer ctid;
-
-             /* check for bogus TID */
-             if (offnum < FirstOffsetNumber ||
-                 offnum > PageGetMaxOffsetNumber(dp))
-                 break;
-
-             lp = PageGetItemId(dp, offnum);
-
-             /* check for unused, dead, or redirected items */
-             if (!ItemIdIsNormal(lp))
-             {
-                 /* We should only see a redirect at start of chain */
-                 if (ItemIdIsRedirected(lp) && at_chain_start)
-                 {
-                     /* Follow the redirect */
-                     offnum = ItemIdGetRedirect(lp);
-                     at_chain_start = false;
-                     continue;
-                 }
-                 /* else must be end of chain */
-                 break;
-             }
-
-             /*
-              * We must initialize all of *heapTuple (ie, scan->xs_ctup) since
-              * it is returned to the executor on success.
-              */
-             heapTuple->t_data = (HeapTupleHeader) PageGetItem(dp, lp);
-             heapTuple->t_len = ItemIdGetLength(lp);
-             ItemPointerSetOffsetNumber(tid, offnum);
-             heapTuple->t_tableOid = RelationGetRelid(scan->heapRelation);
-             ctid = &heapTuple->t_data->t_ctid;
-
-             /*
-              * Shouldn't see a HEAP_ONLY tuple at chain start.  (This test
-              * should be unnecessary, since the chain root can't be removed
-              * while we have pin on the index entry, but let's make it
-              * anyway.)
-              */
-             if (at_chain_start && HeapTupleIsHeapOnly(heapTuple))
-                 break;
-
-             /*
-              * The xmin should match the previous xmax value, else chain is
-              * broken.    (Note: this test is not optional because it protects
-              * us against the case where the prior chain member's xmax aborted
-              * since we looked at it.)
-              */
-             if (TransactionIdIsValid(scan->xs_prev_xmax) &&
-                 !TransactionIdEquals(scan->xs_prev_xmax,
-                                   HeapTupleHeaderGetXmin(heapTuple->t_data)))
-                 break;
-
-             /* If it's visible per the snapshot, we must return it */
-             if (HeapTupleSatisfiesVisibility(heapTuple, scan->xs_snapshot,
-                                              scan->xs_cbuf))
-             {
-                 /*
-                  * If the snapshot is MVCC, we know that it could accept at
-                  * most one member of the HOT chain, so we can skip examining
-                  * any more members.  Otherwise, check for continuation of the
-                  * HOT-chain, and set state for next time.
-                  */
-                 if (IsMVCCSnapshot(scan->xs_snapshot))
-                     scan->xs_next_hot = InvalidOffsetNumber;
-                 else if (HeapTupleIsHotUpdated(heapTuple))
-                 {
-                     Assert(ItemPointerGetBlockNumber(ctid) ==
-                            ItemPointerGetBlockNumber(tid));
-                     scan->xs_next_hot = ItemPointerGetOffsetNumber(ctid);
-                     scan->xs_prev_xmax = HeapTupleHeaderGetXmax(heapTuple->t_data);
-                 }
-                 else
-                     scan->xs_next_hot = InvalidOffsetNumber;
-
-                 LockBuffer(scan->xs_cbuf, BUFFER_LOCK_UNLOCK);
-
-                 pgstat_count_heap_fetch(scan->indexRelation);
-
-                 return heapTuple;
-             }
-
-             /*
-              * If we can't see it, maybe no one else can either.  Check to see
-              * if the tuple is dead to all transactions.  If we find that all
-              * the tuples in the HOT chain are dead, we'll signal the index AM
-              * to not return that TID on future indexscans.
-              */
-             if (scan->xs_hot_dead &&
-                 HeapTupleSatisfiesVacuum(heapTuple->t_data, RecentGlobalXmin,
-                                          scan->xs_cbuf) != HEAPTUPLE_DEAD)
-                 scan->xs_hot_dead = false;
-
-             /*
-              * Check to see if HOT chain continues past this tuple; if so
-              * fetch the next offnum (we don't bother storing it into
-              * xs_next_hot, but must store xs_prev_xmax), and loop around.
-              */
-             if (HeapTupleIsHotUpdated(heapTuple))
-             {
-                 Assert(ItemPointerGetBlockNumber(ctid) ==
-                        ItemPointerGetBlockNumber(tid));
-                 offnum = ItemPointerGetOffsetNumber(ctid);
-                 at_chain_start = false;
-                 scan->xs_prev_xmax = HeapTupleHeaderGetXmax(heapTuple->t_data);
-             }
-             else
-                 break;            /* end of chain */
-         }                        /* loop over a single HOT chain */
-
-         LockBuffer(scan->xs_cbuf, BUFFER_LOCK_UNLOCK);
-
-         /* Loop around to ask index AM for another TID */
-         scan->xs_next_hot = InvalidOffsetNumber;
-     }

!     /* Release any held pin on a heap page */
!     if (BufferIsValid(scan->xs_cbuf))
      {
!         ReleaseBuffer(scan->xs_cbuf);
!         scan->xs_cbuf = InvalidBuffer;
      }

!     return NULL;                /* failure exit */
  }

  /* ----------------
--- 417,541 ----
      Assert(TransactionIdIsValid(RecentGlobalXmin));

      /*
!      * The AM's gettuple proc finds the next index entry matching the scan
!      * keys, and puts the TID in xs_ctup.t_self. It should also set
!      * scan->xs_recheck, though we pay no attention to that here.
       */
!     found = DatumGetBool(FunctionCall2(procedure,
!                                        PointerGetDatum(scan),
!                                        Int32GetDatum(direction)));

!     /* Reset kill flag immediately for safety */
!     scan->kill_prior_tuple = false;

!     /* If we're out of index entries, release any held pin on a heap page */
!     if (!found)
!     {
!         if (BufferIsValid(scan->xs_cbuf))
          {
!             ReleaseBuffer(scan->xs_cbuf);
!             scan->xs_cbuf = InvalidBuffer;
          }
+         return false;
+     }

!     pgstat_count_index_tuples(scan->indexRelation, 1);

!     return true;
! }


! /* ----------------
!  *        index_fetch - fetch the heap tuple the current index pointer points to
!  *
!  * The result is the heap tuple the current index tuple points to if it
!  * matches the snapshot, or NULL if it doesn't. On success, the buffer
!  * containing the heap tuple is pinned (the pin will be dropped at the next
!  * index_fetch or index_endscan).
!  * ----------------
!  */
! HeapTuple
! index_fetch(IndexScanDesc scan)
! {
!     ItemPointer tid = &scan->xs_ctup.t_self;
!     bool        found;
!     Buffer        prev_buf;
!
!     /*
!      * We only fetch the first matching tuple in a chain, so we can't support
!      * SnapshotAny. All other snapshots only consider one tuple in a HOT chain
!      * as visible at any given moment.
!      */
!     Assert(scan->xs_snapshot != SnapshotAny);
!
!     /* Switch to correct buffer if we don't have it already */
!     prev_buf = scan->xs_cbuf;
!     scan->xs_cbuf = ReleaseAndReadBuffer(scan->xs_cbuf,
!                                          scan->heapRelation,
!                                          ItemPointerGetBlockNumber(tid));
!
!     /*
!      * Prune page, but only if we weren't already on this page
!      */
!     if (prev_buf != scan->xs_cbuf)
!         heap_page_prune_opt(scan->heapRelation, scan->xs_cbuf,
!                             RecentGlobalXmin);
!
!     /* Obtain share-lock on the buffer so we can examine visibility */
!     LockBuffer(scan->xs_cbuf, BUFFER_LOCK_SHARE);
!
!     found = heap_hot_search_buffer(tid, scan->xs_cbuf,
!                                    scan->xs_snapshot, &scan->xs_ctup,
!                                    &scan->kill_prior_tuple, true);
!     LockBuffer(scan->xs_cbuf, BUFFER_LOCK_UNLOCK);
!
!     if (found)
      {
!         Assert(!scan->kill_prior_tuple);
!
!         /*
!          * We must initialize all of *heapTuple (ie, scan->xs_ctup) since
!          * it is returned to the executor on success.
!          */
!         scan->xs_ctup.t_tableOid = RelationGetRelid(scan->heapRelation);
!
!         pgstat_count_heap_fetch(scan->indexRelation);
!
!         return &scan->xs_ctup;
      }
+     else
+         return NULL;
+ }

! /* ----------------
!  *        index_getnext - get the next heap tuple from a scan
!  *
!  * This is a shorthand for index_next() + index_fetch().
!  *
!  * The result is the next heap tuple satisfying the scan keys and the
!  * snapshot, or NULL if no more matching tuples exist.    On success,
!  * the buffer containing the heap tuple is pinned (the pin will be dropped
!  * at the next index_getnext or index_endscan).
!  *
!  * Note: caller must check scan->xs_recheck, and perform rechecking of the
!  * scan keys if required.  We do not do that here because we don't have
!  * enough information to do it efficiently in the general case.
!  * ----------------
!  */
! HeapTuple
! index_getnext(IndexScanDesc scan, ScanDirection direction)
! {
!     for(;;)
!     {
!         HeapTuple tup;
!
!         if (!index_next(scan, direction))
!             return NULL;
!
!         tup = index_fetch(scan);
!         if (tup != NULL)
!             return tup;
!     }
  }

  /* ----------------
*** a/src/backend/commands/cluster.c
--- b/src/backend/commands/cluster.c
***************
*** 772,777 **** copy_heap_data(Oid OIDNewHeap, Oid OIDOldHeap, Oid OIDOldIndex)
--- 772,778 ----
      TransactionId OldestXmin;
      TransactionId FreezeXid;
      RewriteState rwstate;
+     ItemPointerData tid;

      /*
       * Open the relations we need.
***************
*** 829,847 **** copy_heap_data(Oid OIDNewHeap, Oid OIDOldHeap, Oid OIDOldIndex)
      scan = index_beginscan(OldHeap, OldIndex,
                             SnapshotAny, 0, (ScanKey) NULL);

!     while ((tuple = index_getnext(scan, ForwardScanDirection)) != NULL)
      {
          HeapTuple    copiedTuple;
          bool        isdead;
          int            i;

          CHECK_FOR_INTERRUPTS();

!         /* Since we used no scan keys, should never need to recheck */
!         if (scan->xs_recheck)
!             elog(ERROR, "CLUSTER does not support lossy index conditions");

!         LockBuffer(scan->xs_cbuf, BUFFER_LOCK_SHARE);

          switch (HeapTupleSatisfiesVacuum(tuple->t_data, OldestXmin,
                                           scan->xs_cbuf))
--- 830,885 ----
      scan = index_beginscan(OldHeap, OldIndex,
                             SnapshotAny, 0, (ScanKey) NULL);

!     tuple = NULL;
!     for(;;)
      {
          HeapTuple    copiedTuple;
          bool        isdead;
          int            i;
+         bool        found;

          CHECK_FOR_INTERRUPTS();

!         /* Advance to next index tuple if we're done with current HOT chain */
!         if (tuple == NULL)
!         {
!             if (!index_next(scan, ForwardScanDirection))
!                 break;
!
!             /* Since we used no scan keys, should never need to recheck */
!             if (scan->xs_recheck)
!                 elog(ERROR, "CLUSTER does not support lossy index conditions");
!
!             tid = scan->xs_ctup.t_self;

!             scan->xs_cbuf = ReleaseAndReadBuffer(scan->xs_cbuf, OldHeap,
!                                                  ItemPointerGetBlockNumber(&tid));
!             /* Obtain share-lock on the buffer so we can examine visibility */
!             LockBuffer(scan->xs_cbuf, BUFFER_LOCK_SHARE);
!             found = heap_hot_search_buffer(&tid, scan->xs_cbuf,
!                                            scan->xs_snapshot, &scan->xs_ctup,
!                                            &scan->kill_prior_tuple, true);
!             if (!found)
!             {
!                 LockBuffer(scan->xs_cbuf, BUFFER_LOCK_UNLOCK);
!                 break;
!             }
!         }
!         else
!         {
!             /* Fetch next tuple from current HOT chain */
!             LockBuffer(scan->xs_cbuf, BUFFER_LOCK_SHARE);
!             found = heap_hot_search_buffer(&tid, scan->xs_cbuf,
!                                            scan->xs_snapshot, &scan->xs_ctup,
!                                            &scan->kill_prior_tuple, false);
!             if (!found)
!             {
!                 LockBuffer(scan->xs_cbuf, BUFFER_LOCK_UNLOCK);
!                 tuple = NULL;
!                 continue;
!             }
!         }
!         tuple = &scan->xs_ctup;

          switch (HeapTupleSatisfiesVacuum(tuple->t_data, OldestXmin,
                                           scan->xs_cbuf))
*** a/src/backend/executor/nodeBitmapHeapscan.c
--- b/src/backend/executor/nodeBitmapHeapscan.c
***************
*** 382,390 **** bitgetpage(HeapScanDesc scan, TBMIterateResult *tbmres)
          {
              OffsetNumber offnum = tbmres->offsets[curslot];
              ItemPointerData tid;

              ItemPointerSet(&tid, page, offnum);
!             if (heap_hot_search_buffer(&tid, buffer, snapshot, NULL))
                  scan->rs_vistuples[ntup++] = ItemPointerGetOffsetNumber(&tid);
          }
      }
--- 382,392 ----
          {
              OffsetNumber offnum = tbmres->offsets[curslot];
              ItemPointerData tid;
+             HeapTupleData heapTuple;

              ItemPointerSet(&tid, page, offnum);
!             if (heap_hot_search_buffer(&tid, buffer, snapshot, &heapTuple,
!                                        NULL, true))
                  scan->rs_vistuples[ntup++] = ItemPointerGetOffsetNumber(&tid);
          }
      }
*** a/src/include/access/genam.h
--- b/src/include/access/genam.h
***************
*** 117,122 **** extern void index_endscan(IndexScanDesc scan);
--- 117,125 ----
  extern void index_markpos(IndexScanDesc scan);
  extern void index_restrpos(IndexScanDesc scan);
  extern HeapTuple index_getnext(IndexScanDesc scan, ScanDirection direction);
+ extern bool index_next(IndexScanDesc scan, ScanDirection direction);
+ extern HeapTuple index_fetch(IndexScanDesc scan);
+
  extern int64 index_getbitmap(IndexScanDesc scan, TIDBitmap *bitmap);

  extern IndexBulkDeleteResult *index_bulk_delete(IndexVacuumInfo *info,
*** a/src/include/access/heapam.h
--- b/src/include/access/heapam.h
***************
*** 83,89 **** extern bool heap_fetch(Relation relation, Snapshot snapshot,
             HeapTuple tuple, Buffer *userbuf, bool keep_buf,
             Relation stats_relation);
  extern bool heap_hot_search_buffer(ItemPointer tid, Buffer buffer,
!                        Snapshot snapshot, bool *all_dead);
  extern bool heap_hot_search(ItemPointer tid, Relation relation,
                  Snapshot snapshot, bool *all_dead);

--- 83,90 ----
             HeapTuple tuple, Buffer *userbuf, bool keep_buf,
             Relation stats_relation);
  extern bool heap_hot_search_buffer(ItemPointer tid, Buffer buffer,
!                        Snapshot snapshot, HeapTuple tuple,
!                        bool *all_dead, bool firstCall);
  extern bool heap_hot_search(ItemPointer tid, Relation relation,
                  Snapshot snapshot, bool *all_dead);

*** a/src/include/access/relscan.h
--- b/src/include/access/relscan.h
***************
*** 77,87 **** typedef struct IndexScanDescData
      Buffer        xs_cbuf;        /* current heap buffer in scan, if any */
      /* NB: if xs_cbuf is not InvalidBuffer, we hold a pin on that buffer */
      bool        xs_recheck;        /* T means scan keys must be rechecked */
-
-     /* state data for traversing HOT chains in index_getnext */
-     bool        xs_hot_dead;    /* T if all members of HOT chain are dead */
-     OffsetNumber xs_next_hot;    /* next member of HOT chain, if any */
-     TransactionId xs_prev_xmax; /* previous HOT chain member's XMAX, if any */
  } IndexScanDescData;

  /* Struct for heap-or-index scans of system tables */
--- 77,82 ----

Re: Index-only-scans, indexam API changes

От
Greg Stark
Дата:
On Mon, Jul 13, 2009 at 8:19 AM, Heikki
Linnakangas<heikki.linnakangas@enterprisedb.com> wrote:
>
> I propose that we split index_getnext into two steps: fetching the next
> match from the index (index_next()), and fetching the corresponding heap
> tuple (index_fetch()).

A pretty trivial concern, but it seems confusing that the function to
fetch a tuple from the heap is called "index_fetch()"... Perhaps
something more explicit like index_fetch_heap_tuple() would be
clearer?

-- 
greg
http://mit.edu/~gsstark/resume.pdf


Re: Index-only-scans, indexam API changes

От
Tom Lane
Дата:
Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:
> At the moment, amgettuple only returns pointers to heap tuples. There is
> no way to return data from the index tuples. That needs to be changed to
> support index-only scans.

What are you going to do for index types that don't store the original
data (e.g. hash)?
        regards, tom lane


Re: Index-only-scans, indexam API changes

От
Heikki Linnakangas
Дата:
Tom Lane wrote:
> Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:
>> At the moment, amgettuple only returns pointers to heap tuples. There is
>> no way to return data from the index tuples. That needs to be changed to
>> support index-only scans.
> 
> What are you going to do for index types that don't store the original
> data (e.g. hash)?

They will obviously not be able to regurgitate index tuples. I have not
yet decided how that's going to be signaled. In the prototype patch I
have, I have hard-coded that only b-trees can do that. A new column
"amcanreturntuples" column in pg_am seems like the most straightforward way.

(This indexam API patch isn't bothered with that yet. It just splits
index_gettuple() into two)

--  Heikki Linnakangas EnterpriseDB   http://www.enterprisedb.com


Re: Index-only-scans, indexam API changes

От
Tom Lane
Дата:
Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:
> Tom Lane wrote:
>> What are you going to do for index types that don't store the original
>> data (e.g. hash)?

> They will obviously not be able to regurgitate index tuples. I have not
> yet decided how that's going to be signaled.

Well, I think that's a pretty critical part of the API change.

> (This indexam API patch isn't bothered with that yet. It just splits
> index_gettuple() into two)

One thought here is that an AM call isn't really free, and doing two of
them instead of one mightn't be such a good idea.  I would suggest
either having a separate AM entry point to get both bits of data
("amgettupledata"?) or adding an optional parameter to amgettuple.
A small attraction of the alternate entry point is that then you can
flag the "unsupported" case by putting a zero in that pgam column, as
indeed we already do for amgettuple; so you don't need an additional
bool column.

[ thinks a bit ... ]  At least for GIST, it is possible that whether
data can be regurgitated will vary depending on the selected opclass.
Some opclasses use the STORAGE modifier and some don't.  I am not sure
how hard we want to work to support flexibility there.  Would it be
sufficient to hard-code the check as "pgam says the AM can do it,
and the opclass does not have a STORAGE property"?  Or do we need
additional intelligence about GIST opclasses?
        regards, tom lane


Re: Index-only-scans, indexam API changes

От
Heikki Linnakangas
Дата:
Tom Lane wrote:
> One thought here is that an AM call isn't really free, and doing two of
> them instead of one mightn't be such a good idea.  I would suggest
> either having a separate AM entry point to get both bits of data
> ("amgettupledata"?) or adding an optional parameter to amgettuple.

I'm thinking of adding a new flag to IndexScanDesc, "needIndexTuple".
When that is set, amgettuple stores a pointer to the IndexTuple in
another new field in IndexScanDesc, xs_itup. (In case of b-tree at
least, the IndexTuple is a palloc'd copy of the tuple on disk)

> [ thinks a bit ... ]  At least for GIST, it is possible that whether
> data can be regurgitated will vary depending on the selected opclass.
> Some opclasses use the STORAGE modifier and some don't.  I am not sure
> how hard we want to work to support flexibility there.  Would it be
> sufficient to hard-code the check as "pgam says the AM can do it,
> and the opclass does not have a STORAGE property"?  Or do we need
> additional intelligence about GIST opclasses?

Well, the way I have it implemented is that the am is not required to
return the index tuple, even if requested. I implemented the B-tree
changes similar to how we implement "kill_prior_tuple": in btgettuple,
lock the index page and see if the tuple is still at the same position
that we remembered in the scan opaque struct. If not (because of
concurrent changes to the page), we give up and don't return the index
tuple. The executor will then perform a heap fetch as before.

Alternatively, we could copy all the matching index tuples to private
memory when we step to a new index page, but that seems pretty expensive
if we're only going to use the first few matching tuples (LIMIT), and I
fear the memory management gets complicated. But in any case, the GiST
issue would still be there.

Since we're discussing it, I'm attaching the prototype patch I have for
returning tuples from b-tree and using them to filter rows before heap
fetches. I was going to post it in the morning along with description
about the planner and executor changes, but here goes. It applies on top
of the indexam API patch I started this thread with.

--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com
*** a/src/backend/access/index/genam.c
--- b/src/backend/access/index/genam.c
***************
*** 91,102 **** RelationGetIndexScan(Relation indexRelation,
--- 91,104 ----

      scan->kill_prior_tuple = false;
      scan->ignore_killed_tuples = true;    /* default setting */
+     scan->needIndexTuple = false;

      scan->opaque = NULL;

      ItemPointerSetInvalid(&scan->xs_ctup.t_self);
      scan->xs_ctup.t_data = NULL;
      scan->xs_cbuf = InvalidBuffer;
+     scan->xs_itup = NULL;

      /*
       * Let the AM fill in the key and any opaque data it wants.
*** a/src/backend/access/index/indexam.c
--- b/src/backend/access/index/indexam.c
***************
*** 398,404 **** index_restrpos(IndexScanDesc scan)
   *
   * Advances to the next index tuple satisfying the scan keys, returning TRUE
   * if there was one, FALSE otherwise. The heap TID pointer of the next match
!  * is stored in scan->xs_ctup.self.
   *
   * Note: caller must check scan->xs_recheck, and perform rechecking of the
   * scan keys if required.  We do not do that here because we don't have
--- 398,406 ----
   *
   * Advances to the next index tuple satisfying the scan keys, returning TRUE
   * if there was one, FALSE otherwise. The heap TID pointer of the next match
!  * is stored in scan->xs_ctup.self. If scan->needIndexTuple is TRUE, the
!  * index tuple itself is stored in scan->xs_itup, if the indexam succeeded in
!  * fetching it.
   *
   * Note: caller must check scan->xs_recheck, and perform rechecking of the
   * scan keys if required.  We do not do that here because we don't have
*** a/src/backend/access/nbtree/nbtree.c
--- b/src/backend/access/nbtree/nbtree.c
***************
*** 280,285 **** btgettuple(PG_FUNCTION_ARGS)
--- 280,297 ----
      else
          res = _bt_first(scan, dir);

+     /* Fetch the index tuple if needed */
+     if (scan->needIndexTuple)
+     {
+         if (scan->xs_itup != NULL)
+         {
+             pfree(scan->xs_itup);
+             scan->xs_itup = NULL;
+         }
+         if (res)
+             scan->xs_itup = _bt_getindextuple(scan);
+     }
+
      PG_RETURN_BOOL(res);
  }

*** a/src/backend/access/nbtree/nbtsearch.c
--- b/src/backend/access/nbtree/nbtsearch.c
***************
*** 963,968 **** _bt_next(IndexScanDesc scan, ScanDirection dir)
--- 963,1031 ----
      return true;
  }

+
+ /*
+  * _bt_getindextuple - fetch index tuple at current position.
+  *
+  * The caller must have pin on so->currPos.buf.
+  *
+  * The tuple returned is a palloc'd copy. This can fail to find the tuple if
+  * new tuples have been inserted to the page since we stepped on this index
+  * page. NULL is returned in that case.
+  */
+ IndexTuple
+ _bt_getindextuple(IndexScanDesc scan)
+ {
+     BTScanOpaque so = (BTScanOpaque) scan->opaque;
+     Page        page;
+     BTPageOpaque opaque;
+     OffsetNumber minoff;
+     OffsetNumber maxoff;
+     IndexTuple    ituple, result;
+     int itemIndex;
+     BTScanPosItem *positem;
+     OffsetNumber offnum;
+
+     Assert(BufferIsValid(so->currPos.buf));
+
+     LockBuffer(so->currPos.buf, BT_READ);
+
+     page = BufferGetPage(so->currPos.buf);
+     opaque = (BTPageOpaque) PageGetSpecialPointer(page);
+     minoff = P_FIRSTDATAKEY(opaque);
+     maxoff = PageGetMaxOffsetNumber(page);
+
+     itemIndex = so->currPos.itemIndex;
+     positem = &so->currPos.items[itemIndex];
+     offnum = positem->indexOffset;
+
+     Assert(itemIndex >= so->currPos.firstItem &&
+            itemIndex <= so->currPos.lastItem);
+     if (offnum < minoff)
+     {
+         /* pure paranoia */
+         LockBuffer(so->currPos.buf, BUFFER_LOCK_UNLOCK);
+         return NULL;
+     }
+
+     ituple = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
+
+     if (ItemPointerEquals(&ituple->t_tid, &scan->xs_ctup.t_self))
+     {
+         /* found the item */
+         Size itupsz = IndexTupleSize(ituple);
+
+         result = palloc(itupsz);
+         memcpy(result, ituple, itupsz);
+     }
+     else
+         result = NULL;
+
+     LockBuffer(so->currPos.buf, BUFFER_LOCK_UNLOCK);
+
+     return result;
+ }
+
  /*
   *    _bt_readpage() -- Load data from current index page into so->currPos
   *
*** a/src/backend/commands/explain.c
--- b/src/backend/commands/explain.c
***************
*** 828,833 **** explain_outNode(StringInfo str,
--- 828,838 ----
                             ((Scan *) plan)->scanrelid,
                             plan, outer_plan,
                             str, indent, es);
+             show_scan_qual(((IndexScan *)plan)->indexonlyqual,
+                                "Index-Only Filter",
+                            ((Scan *) plan)->scanrelid,
+                            plan, outer_plan,
+                            str, indent, es);
              show_scan_qual(plan->qual,
                             "Filter",
                             ((Scan *) plan)->scanrelid,
*** a/src/backend/executor/execTuples.c
--- b/src/backend/executor/execTuples.c
***************
*** 92,97 ****
--- 92,98 ----
  #include "postgres.h"

  #include "funcapi.h"
+ #include "access/itup.h"
  #include "catalog/pg_type.h"
  #include "nodes/nodeFuncs.h"
  #include "storage/bufmgr.h"
***************
*** 577,582 **** ExecStoreVirtualTuple(TupleTableSlot *slot)
--- 578,624 ----
      return slot;
  }

+ TupleTableSlot *
+ ExecStoreIndexTuple(IndexTuple tuple, TupleTableSlot *slot, bool shouldFree)
+ {
+     int attnum;
+
+     /*
+      * sanity checks
+      */
+     Assert(tuple != NULL);
+     Assert(slot != NULL);
+     Assert(slot->tts_tupleDescriptor != NULL);
+
+     /*
+      * Free any old physical tuple belonging to the slot.
+      */
+     if (slot->tts_shouldFree)
+         heap_freetuple(slot->tts_tuple);
+     if (slot->tts_shouldFreeMin)
+         heap_free_minimal_tuple(slot->tts_mintuple);
+
+     /*
+      * Store the new tuple into the specified slot.
+      */
+     slot->tts_isempty = false;
+     slot->tts_shouldFree = shouldFree;
+     slot->tts_shouldFreeMin = false;
+     slot->tts_tuple = NULL;
+     slot->tts_mintuple = NULL;
+
+     for(attnum = 1; attnum <= slot->tts_tupleDescriptor->natts; attnum++)
+     {
+         slot->tts_values[attnum - 1] =
+             index_getattr(tuple, attnum, slot->tts_tupleDescriptor,
+                           &slot->tts_isnull[attnum - 1]);
+     }
+
+     slot->tts_nvalid = slot->tts_tupleDescriptor->natts;
+
+     return slot;
+ }
+
  /* --------------------------------
   *        ExecStoreAllNullTuple
   *            Set up the slot to contain a null in every column.
*** a/src/backend/executor/nodeIndexscan.c
--- b/src/backend/executor/nodeIndexscan.c
***************
*** 35,52 ****
  #include "utils/memutils.h"


- static TupleTableSlot *IndexNext(IndexScanState *node);
-
-
  /* ----------------------------------------------------------------
!  *        IndexNext
   *
   *        Retrieve a tuple from the IndexScan node's currentRelation
   *        using the index specified in the IndexScanState information.
   * ----------------------------------------------------------------
   */
! static TupleTableSlot *
! IndexNext(IndexScanState *node)
  {
      EState       *estate;
      ExprContext *econtext;
--- 35,49 ----
  #include "utils/memutils.h"


  /* ----------------------------------------------------------------
!  *        ExecIndexScan(node)
   *
   *        Retrieve a tuple from the IndexScan node's currentRelation
   *        using the index specified in the IndexScanState information.
   * ----------------------------------------------------------------
   */
! TupleTableSlot *
! ExecIndexScan(IndexScanState *node)
  {
      EState       *estate;
      ExprContext *econtext;
***************
*** 55,60 **** IndexNext(IndexScanState *node)
--- 52,89 ----
      Index        scanrelid;
      HeapTuple    tuple;
      TupleTableSlot *slot;
+     IndexScan *plan = (IndexScan *) node->ss.ps.plan;
+     ExprDoneCond isDone;
+     TupleTableSlot *resultSlot;
+
+     /*
+      * If we have runtime keys and they've not already been set up, do it now.
+      */
+     if (node->iss_NumRuntimeKeys != 0 && !node->iss_RuntimeKeysReady)
+         ExecReScan((PlanState *) node, NULL);
+
+     /*
+      * Check to see if we're still projecting out tuples from a previous scan
+      * tuple (because there is a function-returning-set in the projection
+      * expressions).  If so, try to project another one.
+      */
+     if (node->ss.ps.ps_TupFromTlist)
+     {
+         Assert(node->ss.ps.ps_ProjInfo); /* can't get here if not projecting */
+         resultSlot = ExecProject(node->ss.ps.ps_ProjInfo, &isDone);
+         if (isDone == ExprMultipleResult)
+             return resultSlot;
+         /* Done with that source tuple... */
+         node->ss.ps.ps_TupFromTlist = false;
+     }
+
+     /*
+      * Reset per-tuple memory context to free any expression evaluation
+      * storage allocated in the previous tuple cycle.  Note this can't happen
+      * until we're done projecting out tuples from a scan tuple.
+      */
+     econtext = node->ss.ps.ps_ExprContext;
+     ResetExprContext(econtext);

      /*
       * extract necessary information from index scan node
***************
*** 62,68 **** IndexNext(IndexScanState *node)
      estate = node->ss.ps.state;
      direction = estate->es_direction;
      /* flip direction if this is an overall backward scan */
!     if (ScanDirectionIsBackward(((IndexScan *) node->ss.ps.plan)->indexorderdir))
      {
          if (ScanDirectionIsForward(direction))
              direction = BackwardScanDirection;
--- 91,97 ----
      estate = node->ss.ps.state;
      direction = estate->es_direction;
      /* flip direction if this is an overall backward scan */
!     if (ScanDirectionIsBackward(plan->indexorderdir))
      {
          if (ScanDirectionIsForward(direction))
              direction = BackwardScanDirection;
***************
*** 72,78 **** IndexNext(IndexScanState *node)
      scandesc = node->iss_ScanDesc;
      econtext = node->ss.ps.ps_ExprContext;
      slot = node->ss.ss_ScanTupleSlot;
!     scanrelid = ((IndexScan *) node->ss.ps.plan)->scan.scanrelid;

      /*
       * Check if we are evaluating PlanQual for tuple of this relation.
--- 101,107 ----
      scandesc = node->iss_ScanDesc;
      econtext = node->ss.ps.ps_ExprContext;
      slot = node->ss.ss_ScanTupleSlot;
!     scanrelid = plan->scan.scanrelid;

      /*
       * Check if we are evaluating PlanQual for tuple of this relation.
***************
*** 106,162 **** IndexNext(IndexScanState *node)
      /*
       * ok, now that we have what we need, fetch the next tuple.
       */
!     while ((tuple = index_getnext(scandesc, direction)) != NULL)
      {
          /*
           * Store the scanned tuple in the scan tuple slot of the scan state.
!          * Note: we pass 'false' because tuples returned by amgetnext are
           * pointers onto disk pages and must not be pfree()'d.
           */
          ExecStoreTuple(tuple,    /* tuple to store */
                         slot,    /* slot to store in */
!                        scandesc->xs_cbuf,        /* buffer containing tuple */
                         false);    /* don't pfree */

          /*
           * If the index was lossy, we have to recheck the index quals using
           * the real tuple.
           */
          if (scandesc->xs_recheck)
          {
-             econtext->ecxt_scantuple = slot;
              ResetExprContext(econtext);
              if (!ExecQual(node->indexqualorig, econtext, false))
                  continue;        /* nope, so ask index for another one */
          }

!         return slot;
      }

      /*
       * if we get here it means the index scan failed so we are at the end of
       * the scan..
       */
-     return ExecClearTuple(slot);
- }

- /* ----------------------------------------------------------------
-  *        ExecIndexScan(node)
-  * ----------------------------------------------------------------
-  */
- TupleTableSlot *
- ExecIndexScan(IndexScanState *node)
- {
      /*
!      * If we have runtime keys and they've not already been set up, do it now.
       */
!     if (node->iss_NumRuntimeKeys != 0 && !node->iss_RuntimeKeysReady)
!         ExecReScan((PlanState *) node, NULL);

!     /*
!      * use IndexNext as access method
!      */
!     return ExecScan(&node->ss, (ExecScanAccessMtd) IndexNext);
  }

  /* ----------------------------------------------------------------
--- 135,245 ----
      /*
       * ok, now that we have what we need, fetch the next tuple.
       */
!     while (index_next(scandesc, direction))
      {
+         bool indextuple_fetched = false;
+
+         /* See if we can evaluate quals or targets using index tuple */
+         if (scandesc->needIndexTuple && scandesc->xs_itup != NULL)
+         {
+             indextuple_fetched = true;
+
+             slot = node->iss_IndexTupleSlot;
+             ExecStoreIndexTuple(scandesc->xs_itup, slot, false);
+
+             econtext->ecxt_scantuple = slot;
+             if (!ExecQual(node->iss_indexonlyqual, econtext, false))
+                 continue;
+         }
+
+         if ((tuple = index_fetch(scandesc)) == NULL)
+             continue;
+
+         slot = node->ss.ss_ScanTupleSlot;
          /*
           * Store the scanned tuple in the scan tuple slot of the scan state.
!          * Note: we pass 'false' because tuples returned by index_fetch are
           * pointers onto disk pages and must not be pfree()'d.
           */
          ExecStoreTuple(tuple,    /* tuple to store */
                         slot,    /* slot to store in */
!                        scandesc->xs_cbuf,    /* buffer containing tuple */
                         false);    /* don't pfree */

+         econtext->ecxt_scantuple = slot;
+
+         /* check index-only quals if we failed to get index tuple */
+         if (!indextuple_fetched && node->iss_indexonlyqual != NULL &&
+             !ExecQual(node->iss_indexonlyqual, econtext, false))
+             continue;
+
+         /* check additional quals */
+         if (node->ss.ps.qual &&
+             !ExecQual(node->ss.ps.qual, econtext, false))
+         {
+             /* Failed quals */
+             continue;
+         }
+
          /*
           * If the index was lossy, we have to recheck the index quals using
           * the real tuple.
           */
          if (scandesc->xs_recheck)
          {
              ResetExprContext(econtext);
              if (!ExecQual(node->indexqualorig, econtext, false))
                  continue;        /* nope, so ask index for another one */
          }

!         /* A visible tuple was found */
!
!         /*
!          * Found a satisfactory scan tuple.
!          */
!         if (node->ss.ps.ps_ProjInfo)
!         {
!             /*
!              * Form a projection tuple, store it in the result tuple slot
!              * and return it --- unless we find we can project no tuples
!              * from this scan tuple, in which case continue scan.
!              */
!             resultSlot = ExecProject(node->ss.ps.ps_ProjInfo, &isDone);
!             if (isDone != ExprEndResult)
!             {
!                 node->ss.ps.ps_TupFromTlist = (isDone == ExprMultipleResult);
!                 return resultSlot;
!             }
!         }
!         else
!         {
!             /*
!              * Here, we aren't projecting, so just return scan tuple.
!              */
!             return slot;
!         }
      }

      /*
       * if we get here it means the index scan failed so we are at the end of
       * the scan..
       */

      /*
!      * if the slot returned by the accessMtd contains NULL, then it means
!      * there is nothing more to scan so we just return an empty slot,
!      * being careful to use the projection result slot so it has correct
!      * tupleDesc.
       */
!     if (TupIsNull(slot))
!     {
!         if (node->ss.ps.ps_ProjInfo)
!             return ExecClearTuple(node->ss.ps.ps_ProjInfo->pi_slot);
!         else
!             return slot;
!     }

!     return ExecClearTuple(slot);
  }

  /* ----------------------------------------------------------------
***************
*** 431,436 **** ExecEndIndexScan(IndexScanState *node)
--- 514,520 ----
       */
      ExecClearTuple(node->ss.ps.ps_ResultTupleSlot);
      ExecClearTuple(node->ss.ss_ScanTupleSlot);
+     ExecClearTuple(node->iss_IndexTupleSlot);

      /*
       * close the index relation (no-op if we didn't open it)
***************
*** 519,531 **** ExecInitIndexScan(IndexScan *node, EState *estate, int eflags)
          ExecInitExpr((Expr *) node->indexqualorig,
                       (PlanState *) indexstate);

! #define INDEXSCAN_NSLOTS 2

      /*
       * tuple table initialization
       */
      ExecInitResultTupleSlot(estate, &indexstate->ss.ps);
      ExecInitScanTupleSlot(estate, &indexstate->ss);

      /*
       * open the base relation and acquire appropriate lock on it.
--- 603,624 ----
          ExecInitExpr((Expr *) node->indexqualorig,
                       (PlanState *) indexstate);

!     indexstate->iss_indexonlyqual = (List *)
!         ExecInitExpr((Expr *) node->indexonlyqual,
!                      (PlanState *) indexstate);
!
!     indexstate->iss_indexonlyqualorig = (List *)
!         ExecInitExpr((Expr *) node->indexonlyqualorig,
!                      (PlanState *) indexstate);
!
! #define INDEXSCAN_NSLOTS 3

      /*
       * tuple table initialization
       */
      ExecInitResultTupleSlot(estate, &indexstate->ss.ps);
      ExecInitScanTupleSlot(estate, &indexstate->ss);
+     indexstate->iss_IndexTupleSlot = ExecAllocTableSlot(estate->es_tupleTable);

      /*
       * open the base relation and acquire appropriate lock on it.
***************
*** 566,571 **** ExecInitIndexScan(IndexScan *node, EState *estate, int eflags)
--- 659,671 ----
                                       relistarget ? NoLock : AccessShareLock);

      /*
+      * Initialize projection info for returning tuples straight from the index
+      */
+     if (node->indexonlyqual != NIL)
+         ExecSetSlotDescriptor(indexstate->iss_IndexTupleSlot,
+                               RelationGetDescr(indexstate->iss_RelationDesc));
+
+     /*
       * Initialize index-specific scan state
       */
      indexstate->iss_RuntimeKeysReady = false;
***************
*** 611,616 **** ExecInitIndexScan(IndexScan *node, EState *estate, int eflags)
--- 711,718 ----
                                                 estate->es_snapshot,
                                                 indexstate->iss_NumScanKeys,
                                                 indexstate->iss_ScanKeys);
+     if (node->indexonlyqual != NIL)
+         indexstate->iss_ScanDesc->needIndexTuple = true;

      /*
       * all done.
*** a/src/backend/nodes/copyfuncs.c
--- b/src/backend/nodes/copyfuncs.c
***************
*** 313,318 **** _copyIndexScan(IndexScan *from)
--- 313,320 ----
      COPY_NODE_FIELD(indexqual);
      COPY_NODE_FIELD(indexqualorig);
      COPY_SCALAR_FIELD(indexorderdir);
+     COPY_NODE_FIELD(indexonlyqual);
+     COPY_NODE_FIELD(indexonlyqualorig);

      return newnode;
  }
*** a/src/backend/nodes/outfuncs.c
--- b/src/backend/nodes/outfuncs.c
***************
*** 399,404 **** _outIndexScan(StringInfo str, IndexScan *node)
--- 399,406 ----
      WRITE_NODE_FIELD(indexqual);
      WRITE_NODE_FIELD(indexqualorig);
      WRITE_ENUM_FIELD(indexorderdir, ScanDirection);
+     WRITE_NODE_FIELD(indexonlyqual);
+     WRITE_NODE_FIELD(indexonlyqualorig);
  }

  static void
*** a/src/backend/optimizer/path/costsize.c
--- b/src/backend/optimizer/path/costsize.c
***************
*** 1,3 ****
--- 1,4 ----
+
  /*-------------------------------------------------------------------------
   *
   * costsize.c
***************
*** 192,197 **** cost_seqscan(Path *path, PlannerInfo *root,
--- 193,200 ----
   *
   * 'index' is the index to be used
   * 'indexQuals' is the list of applicable qual clauses (implicit AND semantics)
+  * 'indexOnlyQuals' is the list of qual clauses that are applied to tuples
+  *        fetched from index, before fetching the heap tuple
   * 'outer_rel' is the outer relation when we are considering using the index
   *        scan as the inside of a nestloop join (hence, some of the indexQuals
   *        are join clauses, and we should expect repeated scans of the index);
***************
*** 213,218 **** void
--- 216,222 ----
  cost_index(IndexPath *path, PlannerInfo *root,
             IndexOptInfo *index,
             List *indexQuals,
+            List *indexOnlyQuals,
             RelOptInfo *outer_rel)
  {
      RelOptInfo *baserel = index->rel;
***************
*** 221,226 **** cost_index(IndexPath *path, PlannerInfo *root,
--- 225,231 ----
      Cost        indexStartupCost;
      Cost        indexTotalCost;
      Selectivity indexSelectivity;
+     Selectivity indexOnlyQualSelectivity;
      double        indexCorrelation,
                  csquared;
      Cost        min_IO_cost,
***************
*** 254,259 **** cost_index(IndexPath *path, PlannerInfo *root,
--- 259,267 ----
                       PointerGetDatum(&indexSelectivity),
                       PointerGetDatum(&indexCorrelation));

+     indexOnlyQualSelectivity =
+         clauselist_selectivity(root, indexOnlyQuals, 0, JOIN_INNER, NULL);
+
      /*
       * Save amcostestimate's results for possible use in bitmap scan planning.
       * We don't bother to save indexStartupCost or indexCorrelation, because a
***************
*** 267,273 **** cost_index(IndexPath *path, PlannerInfo *root,
      run_cost += indexTotalCost - indexStartupCost;

      /* estimate number of main-table tuples fetched */
!     tuples_fetched = clamp_row_est(indexSelectivity * baserel->tuples);

      /*----------
       * Estimate number of main-table pages fetched, and compute I/O cost.
--- 275,282 ----
      run_cost += indexTotalCost - indexStartupCost;

      /* estimate number of main-table tuples fetched */
!     tuples_fetched = clamp_row_est(
!         indexOnlyQualSelectivity * indexSelectivity * baserel->tuples);

      /*----------
       * Estimate number of main-table pages fetched, and compute I/O cost.
*** a/src/backend/optimizer/path/indxpath.c
--- b/src/backend/optimizer/path/indxpath.c
***************
*** 27,32 ****
--- 27,33 ----
  #include "optimizer/cost.h"
  #include "optimizer/pathnode.h"
  #include "optimizer/paths.h"
+ #include "optimizer/planmain.h"
  #include "optimizer/predtest.h"
  #include "optimizer/restrictinfo.h"
  #include "optimizer/var.h"
***************
*** 290,295 **** find_usable_indexes(PlannerInfo *root, RelOptInfo *rel,
--- 291,297 ----
          IndexOptInfo *index = (IndexOptInfo *) lfirst(ilist);
          IndexPath  *ipath;
          List       *restrictclauses;
+         List       *indexonlyQuals;
          List       *index_pathkeys;
          List       *useful_pathkeys;
          bool        useful_predicate;
***************
*** 397,411 **** find_usable_indexes(PlannerInfo *root, RelOptInfo *rel,
              useful_pathkeys = NIL;

          /*
           * 3. Generate an indexscan path if there are relevant restriction
           * clauses in the current clauses, OR the index ordering is
           * potentially useful for later merging or final output ordering, OR
           * the index has a predicate that was proven by the current clauses.
           */
!         if (found_clause || useful_pathkeys != NIL || useful_predicate)
          {
              ipath = create_index_path(root, index,
                                        restrictclauses,
                                        useful_pathkeys,
                                        index_is_ordered ?
                                        ForwardScanDirection :
--- 399,451 ----
              useful_pathkeys = NIL;

          /*
+          * X. The index might be useful if data from it can be used to
+          * evaluate any restrictions not used as index keys.
+          */
+         {
+             ListCell *l, *l2;
+
+             indexonlyQuals = NIL;
+             foreach(l, rel->baserestrictinfo)
+             {
+                 RestrictInfo *rinfo = (RestrictInfo *) lfirst(l);
+                 Expr *indexonlyQual;
+                 bool indexed = false;
+
+                 /*
+                  * Only consider restrictions that are not handled by the
+                  * index
+                  */
+                 foreach(l2, restrictclauses)
+                 {
+                     if (list_member((List *) lfirst(l2), rinfo))
+                     {
+                         indexed = true;
+                         break;
+                     }
+                 }
+                 if (!indexed)
+                 {
+                     indexonlyQual = make_indexonly_expr(root, index, rinfo->clause,
+                                                         NULL);
+                     if (indexonlyQual != NULL)
+                         indexonlyQuals = lappend(indexonlyQuals, rinfo->clause);
+                 }
+             }
+         }
+
+         /*
           * 3. Generate an indexscan path if there are relevant restriction
           * clauses in the current clauses, OR the index ordering is
           * potentially useful for later merging or final output ordering, OR
           * the index has a predicate that was proven by the current clauses.
           */
!         if (found_clause || useful_pathkeys != NIL || useful_predicate ||
!             indexonlyQuals != NIL)
          {
              ipath = create_index_path(root, index,
                                        restrictclauses,
+                                       indexonlyQuals,
                                        useful_pathkeys,
                                        index_is_ordered ?
                                        ForwardScanDirection :
***************
*** 429,434 **** find_usable_indexes(PlannerInfo *root, RelOptInfo *rel,
--- 469,475 ----
              {
                  ipath = create_index_path(root, index,
                                            restrictclauses,
+                                           indexonlyQuals,
                                            useful_pathkeys,
                                            BackwardScanDirection,
                                            outer_rel);
*** a/src/backend/optimizer/plan/createplan.c
--- b/src/backend/optimizer/plan/createplan.c
***************
*** 871,885 **** create_indexscan_plan(PlannerInfo *root,
--- 871,902 ----
      Index        baserelid = best_path->path.parent->relid;
      Oid            indexoid = best_path->indexinfo->indexoid;
      List       *qpqual;
+     List       *indexonlyqual;
      List       *stripped_indexquals;
+     List       *indexonlyqualorig;
      List       *fixed_indexquals;
      ListCell   *l;
      IndexScan  *scan_plan;
+     RelOptInfo *reloptinfo;
+     IndexOptInfo *indexoptinfo;

      /* it should be a base rel... */
      Assert(baserelid > 0);
      Assert(best_path->path.parent->rtekind == RTE_RELATION);

+     /* Find the index that was used */
+     reloptinfo =  root->simple_rel_array[baserelid];
+     foreach(l, reloptinfo->indexlist)
+     {
+         indexoptinfo = (IndexOptInfo *) lfirst(l);
+         if (indexoptinfo->indexoid == indexoid)
+             break;
+         else
+             indexoptinfo = NULL;
+     }
+     if (indexoptinfo == NULL)
+         elog(ERROR, "can't find IndexOptInfo for index scan");
+
      /*
       * Build "stripped" indexquals structure (no RestrictInfos) to pass to
       * executor as indexqualorig
***************
*** 928,936 **** create_indexscan_plan(PlannerInfo *root,
--- 945,957 ----
       * plan so that they'll be properly rechecked by EvalPlanQual testing.
       */
      qpqual = NIL;
+     indexonlyqual = NIL;
+     indexonlyqualorig = NIL;
      foreach(l, scan_clauses)
      {
          RestrictInfo *rinfo = (RestrictInfo *) lfirst(l);
+         Expr *indexonlyclause;
+         bool is_indexonly;

          Assert(IsA(rinfo, RestrictInfo));
          if (rinfo->pseudoconstant)
***************
*** 952,958 **** create_indexscan_plan(PlannerInfo *root,
                          continue;
              }
          }
!         qpqual = lappend(qpqual, rinfo);
      }

      /* Sort clauses into best execution order */
--- 973,989 ----
                          continue;
              }
          }
!
!         indexonlyclause = make_indexonly_expr(root, indexoptinfo,
!                                               (Expr *) rinfo->clause,
!                                               &is_indexonly);
!         if (is_indexonly)
!         {
!             indexonlyqual = lappend(indexonlyqual, indexonlyclause);
!             indexonlyqualorig = lappend(indexonlyqualorig, rinfo->clause);
!         }
!         else
!             qpqual = lappend(qpqual, rinfo);
      }

      /* Sort clauses into best execution order */
***************
*** 969,974 **** create_indexscan_plan(PlannerInfo *root,
--- 1000,1007 ----
                                 fixed_indexquals,
                                 stripped_indexquals,
                                 best_path->indexscandir);
+     scan_plan->indexonlyqual = indexonlyqual;
+     scan_plan->indexonlyqualorig = indexonlyqualorig;

      copy_path_costsize(&scan_plan->scan.plan, &best_path->path);
      /* use the indexscan-specific rows estimate, not the parent rel's */
*** a/src/backend/optimizer/plan/planagg.c
--- b/src/backend/optimizer/plan/planagg.c
***************
*** 402,407 **** build_minmax_path(PlannerInfo *root, RelOptInfo *rel, MinMaxAggInfo *info)
--- 402,408 ----
          new_path = create_index_path(root, index,
                                       restrictclauses,
                                       NIL,
+                                      NIL,
                                       indexscandir,
                                       NULL);

*** a/src/backend/optimizer/plan/setrefs.c
--- b/src/backend/optimizer/plan/setrefs.c
***************
*** 16,21 ****
--- 16,23 ----
  #include "postgres.h"

  #include "access/transam.h"
+ #include "catalog/pg_am.h"
+ #include "catalog/pg_opfamily.h"
  #include "catalog/pg_type.h"
  #include "nodes/makefuncs.h"
  #include "nodes/nodeFuncs.h"
***************
*** 274,279 **** set_plan_refs(PlannerGlobal *glob, Plan *plan, int rtoffset)
--- 276,285 ----
                      fix_scan_list(glob, splan->indexqual, rtoffset);
                  splan->indexqualorig =
                      fix_scan_list(glob, splan->indexqualorig, rtoffset);
+                 splan->indexonlyqual =
+                     fix_scan_list(glob, splan->indexonlyqual, rtoffset);
+                 splan->indexonlyqualorig =
+                     fix_scan_list(glob, splan->indexonlyqualorig, rtoffset);
              }
              break;
          case T_BitmapIndexScan:
***************
*** 516,521 **** set_plan_refs(PlannerGlobal *glob, Plan *plan, int rtoffset)
--- 522,675 ----
  }

  /*
+  * replace_heapvars_with_indexvars
+  *
+  * Replaces all Vars in an expression with Vars that refer to columns
+  * of an index. The resulting expression can be used in the executor to
+  * evaluate an expression using an IndexTuple from an an index, instead
+  * of a heap tuple.
+  *
+  * Returns NULL if the expression can't be evaluated using columns from
+  * the index only.
+  */
+ typedef struct
+ {
+     IndexOptInfo *index; /* index we're dealing with */
+     bool indexonly;
+ } replace_heapvars_context;
+
+ static Node *
+ replace_heapvars_with_indexvars(Node *node, replace_heapvars_context *context)
+ {
+     /*
+      * We represent index keys by Var nodes having the varno of the base table
+      * but varattno equal to the index's attribute number (index column
+      * position).  This is a bit hokey ... would be cleaner to use a
+      * special-purpose node type that could not be mistaken for a regular Var.
+      * But it will do for now.
+      */
+     Var           *result;
+     int            pos;
+     IndexOptInfo *index = context->index;
+     ListCell   *indexpr_item;
+
+     if (node == NULL || !context->indexonly)
+         return NULL;
+
+     if (IsA(node, Var))
+     {
+         Var *var = (Var *) node;
+
+         if (var->varno == index->rel->relid)
+         {
+             /* Try to match against simple index columns */
+             int            varatt = ((Var *) node)->varattno;
+
+             if (varatt != 0)
+             {
+                 for (pos = 0; pos < index->ncolumns; pos++)
+                 {
+                     if (index->indexkeys[pos] == varatt)
+                     {
+                         result = (Var *) copyObject(node);
+                         result->varattno = pos + 1;
+
+                         /*
+                          * XXX: The hack in name b-tree opclass to store the
+                          * keys as cstrings instead of names confuses
+                          * executor.
+                          */
+                         if(index->opfamily[pos] == NAME_BTREE_FAM_OID)
+                             result->vartype = CSTRINGOID;
+
+                         return (Node *) result;
+                     }
+                 }
+                 /* Not in index. Can't do an index-only scan */
+                 context->indexonly = false;
+                 return NULL;
+             }
+             else
+             {
+                 /* TODO: we don't handle full-row references yet. */
+                 context->indexonly = false;
+                 return NULL;
+             }
+         }
+         else
+         {
+             /* XXX: Can this happen? */
+             context->indexonly = false;
+             return NULL;
+         }
+     }
+
+     /* Try to match against index expressions */
+     indexpr_item = list_head(index->indexprs);
+     for (pos = 0; pos < index->ncolumns; pos++)
+     {
+         if (index->indexkeys[pos] == 0)
+         {
+             Node       *indexkey;
+
+             if (indexpr_item == NULL)
+                 elog(ERROR, "too few entries in indexprs list");
+             indexkey = (Node *) lfirst(indexpr_item);
+             if (indexkey && IsA(indexkey, RelabelType))
+                 indexkey = (Node *) ((RelabelType *) indexkey)->arg;
+             if (equal(node, indexkey))
+             {
+                 /* Found a match */
+                 result = makeVar(index->rel->relid, pos + 1,
+                                  exprType(lfirst(indexpr_item)), -1,
+                                  0);
+                 return (Node *) result;
+             }
+             indexpr_item = lnext(indexpr_item);
+         }
+     }
+
+     /* Not found */
+     return expression_tree_mutator(node, replace_heapvars_with_indexvars,
+                                    context);
+ }
+
+ /*
+  * Given an expression, return an Expr tree that evaluates the expression
+  * using columns from an index. On success, *indexonly is set to TRUE.
+  * If the expression can't be evaluated using only index columns, *indexonly
+  * is set to FALSE and NULL is returned.
+  */
+ Expr *
+ make_indexonly_expr(PlannerInfo *root, IndexOptInfo *index,
+                     Expr *expr, bool *indexonly)
+ {
+     replace_heapvars_context context;
+
+     /*
+      * XXX: Right now we hard-code that only B-trees can return index tuples.
+      */
+     if (index->relam != BTREE_AM_OID)
+     {
+         if (indexonly)
+             *indexonly = false;
+         return NULL;
+     }
+
+     context.index = index;
+     context.indexonly = true;
+
+     expr = (Expr *) replace_heapvars_with_indexvars((Node *) expr, &context);
+
+     if (indexonly)
+         *indexonly = context.indexonly;
+     if (!context.indexonly)
+         return NULL;
+     else
+         return expr;
+ }
+
+ /*
   * set_subqueryscan_references
   *        Do set_plan_references processing on a SubqueryScan
   *
***************
*** 925,936 **** set_inner_join_references(PlannerGlobal *glob, Plan *inner_plan,
           */
          IndexScan  *innerscan = (IndexScan *) inner_plan;
          List       *indexqualorig = innerscan->indexqualorig;

          /* No work needed if indexqual refers only to its own rel... */
          if (NumRelids((Node *) indexqualorig) > 1)
          {
-             Index        innerrel = innerscan->scan.scanrelid;
-
              /* only refs to outer vars get changed in the inner qual */
              innerscan->indexqualorig = fix_join_expr(glob,
                                                       indexqualorig,
--- 1079,1090 ----
           */
          IndexScan  *innerscan = (IndexScan *) inner_plan;
          List       *indexqualorig = innerscan->indexqualorig;
+         List       *indexonlyqualorig = innerscan->indexonlyqualorig;
+         Index        innerrel = innerscan->scan.scanrelid;

          /* No work needed if indexqual refers only to its own rel... */
          if (NumRelids((Node *) indexqualorig) > 1)
          {
              /* only refs to outer vars get changed in the inner qual */
              innerscan->indexqualorig = fix_join_expr(glob,
                                                       indexqualorig,
***************
*** 944,963 **** set_inner_join_references(PlannerGlobal *glob, Plan *inner_plan,
                                                   NULL,
                                                   innerrel,
                                                   0);

!             /*
!              * We must fix the inner qpqual too, if it has join clauses (this
!              * could happen if special operators are involved: some indexquals
!              * may get rechecked as qpquals).
!              */
!             if (NumRelids((Node *) inner_plan->qual) > 1)
!                 inner_plan->qual = fix_join_expr(glob,
!                                                  inner_plan->qual,
                                                   outer_itlist,
                                                   NULL,
                                                   innerrel,
                                                   0);
          }
      }
      else if (IsA(inner_plan, BitmapIndexScan))
      {
--- 1098,1135 ----
                                                   NULL,
                                                   innerrel,
                                                   0);
+         }

!         /* No work needed if indexqual refers only to its own rel... */
!         if (NumRelids((Node *) indexonlyqualorig) > 1)
!         {
!             /* only refs to outer vars get changed in the inner qual */
!             innerscan->indexonlyqualorig = fix_join_expr(glob,
!                                                      indexonlyqualorig,
!                                                      outer_itlist,
!                                                      NULL,
!                                                      innerrel,
!                                                      0);
!             innerscan->indexonlyqual = fix_join_expr(glob,
!                                                  innerscan->indexonlyqual,
                                                   outer_itlist,
                                                   NULL,
                                                   innerrel,
                                                   0);
          }
+
+         /*
+          * We must fix the inner qpqual too, if it has join clauses (this
+          * could happen if special operators are involved: some indexquals
+          * may get rechecked as qpquals).
+          */
+         if (NumRelids((Node *) inner_plan->qual) > 1)
+             inner_plan->qual = fix_join_expr(glob,
+                                              inner_plan->qual,
+                                              outer_itlist,
+                                              NULL,
+                                              innerrel,
+                                              0);
      }
      else if (IsA(inner_plan, BitmapIndexScan))
      {
*** a/src/backend/optimizer/util/pathnode.c
--- b/src/backend/optimizer/util/pathnode.c
***************
*** 414,419 **** create_seqscan_path(PlannerInfo *root, RelOptInfo *rel)
--- 414,421 ----
   * 'index' is a usable index.
   * 'clause_groups' is a list of lists of RestrictInfo nodes
   *            to be used as index qual conditions in the scan.
+  * 'filter_quals' is a list of expressions that can be evaluated using data
+  *            from index.
   * 'pathkeys' describes the ordering of the path.
   * 'indexscandir' is ForwardScanDirection or BackwardScanDirection
   *            for an ordered index, or NoMovementScanDirection for
***************
*** 427,432 **** IndexPath *
--- 429,435 ----
  create_index_path(PlannerInfo *root,
                    IndexOptInfo *index,
                    List *clause_groups,
+                   List *indexonlyquals,
                    List *pathkeys,
                    ScanDirection indexscandir,
                    RelOptInfo *outer_rel)
***************
*** 504,510 **** create_index_path(PlannerInfo *root,
          pathnode->rows = rel->rows;
      }

!     cost_index(pathnode, root, index, indexquals, outer_rel);

      return pathnode;
  }
--- 507,513 ----
          pathnode->rows = rel->rows;
      }

!     cost_index(pathnode, root, index, indexquals, indexonlyquals, outer_rel);

      return pathnode;
  }
*** a/src/include/access/nbtree.h
--- b/src/include/access/nbtree.h
***************
*** 556,561 **** extern int32 _bt_compare(Relation rel, int keysz, ScanKey scankey,
--- 556,562 ----
  extern bool _bt_first(IndexScanDesc scan, ScanDirection dir);
  extern bool _bt_next(IndexScanDesc scan, ScanDirection dir);
  extern Buffer _bt_get_endpoint(Relation rel, uint32 level, bool rightmost);
+ extern IndexTuple _bt_getindextuple(IndexScanDesc scan);

  /*
   * prototypes for functions in nbtutils.c
*** a/src/include/access/relscan.h
--- b/src/include/access/relscan.h
***************
*** 16,21 ****
--- 16,22 ----

  #include "access/genam.h"
  #include "access/heapam.h"
+ #include "access/itup.h"


  typedef struct HeapScanDescData
***************
*** 64,69 **** typedef struct IndexScanDescData
--- 65,71 ----
      Snapshot    xs_snapshot;    /* snapshot to see */
      int            numberOfKeys;    /* number of scan keys */
      ScanKey        keyData;        /* array of scan key descriptors */
+     bool        needIndexTuple;    /* should scan store index tuple in xs_itup? */

      /* signaling to index AM about killing index tuples */
      bool        kill_prior_tuple;        /* last-returned tuple is dead */
***************
*** 77,82 **** typedef struct IndexScanDescData
--- 79,86 ----
      Buffer        xs_cbuf;        /* current heap buffer in scan, if any */
      /* NB: if xs_cbuf is not InvalidBuffer, we hold a pin on that buffer */
      bool        xs_recheck;        /* T means scan keys must be rechecked */
+
+     IndexTuple    xs_itup;        /* current index tuple, if any */
  } IndexScanDescData;

  /* Struct for heap-or-index scans of system tables */
*** a/src/include/executor/tuptable.h
--- b/src/include/executor/tuptable.h
***************
*** 15,20 ****
--- 15,21 ----
  #define TUPTABLE_H

  #include "access/htup.h"
+ #include "access/itup.h"
  #include "access/tupdesc.h"
  #include "storage/buf.h"

***************
*** 165,170 **** extern TupleTableSlot *ExecStoreTuple(HeapTuple tuple,
--- 166,174 ----
  extern TupleTableSlot *ExecStoreMinimalTuple(MinimalTuple mtup,
                        TupleTableSlot *slot,
                        bool shouldFree);
+ extern TupleTableSlot *ExecStoreIndexTuple(IndexTuple itup,
+                       TupleTableSlot *slot,
+                       bool shouldFree);
  extern TupleTableSlot *ExecClearTuple(TupleTableSlot *slot);
  extern TupleTableSlot *ExecStoreVirtualTuple(TupleTableSlot *slot);
  extern TupleTableSlot *ExecStoreAllNullTuple(TupleTableSlot *slot);
*** a/src/include/nodes/execnodes.h
--- b/src/include/nodes/execnodes.h
***************
*** 1124,1129 **** typedef struct IndexScanState
--- 1124,1133 ----
      ExprContext *iss_RuntimeContext;
      Relation    iss_RelationDesc;
      IndexScanDesc iss_ScanDesc;
+
+     TupleTableSlot *iss_IndexTupleSlot;    /* slot for storing index tuple */
+     List           *iss_indexonlyqual;
+     List           *iss_indexonlyqualorig;
  } IndexScanState;

  /* ----------------
*** a/src/include/nodes/plannodes.h
--- b/src/include/nodes/plannodes.h
***************
*** 263,269 **** typedef Scan SeqScan;
   * position, not the base table's column, even though varno is for the base
   * table).    This is a bit hokey ... would be cleaner to use a special-purpose
   * node type that could not be mistaken for a regular Var.    But it will do
!  * for now.
   * ----------------
   */
  typedef struct IndexScan
--- 263,269 ----
   * position, not the base table's column, even though varno is for the base
   * table).    This is a bit hokey ... would be cleaner to use a special-purpose
   * node type that could not be mistaken for a regular Var.    But it will do
!  * for now. (We do the same for indexonlyqual)
   * ----------------
   */
  typedef struct IndexScan
***************
*** 273,278 **** typedef struct IndexScan
--- 273,280 ----
      List       *indexqual;        /* list of index quals (OpExprs) */
      List       *indexqualorig;    /* the same in original form */
      ScanDirection indexorderdir;    /* forward or backward or don't care */
+     List       *indexonlyqual;    /* quals that can be satisfied from index */
+     List       *indexonlyqualorig;    /* same in original form */
  } IndexScan;

  /* ----------------
*** a/src/include/optimizer/cost.h
--- b/src/include/optimizer/cost.h
***************
*** 66,72 **** extern double index_pages_fetched(double tuples_fetched, BlockNumber pages,
                      double index_pages, PlannerInfo *root);
  extern void cost_seqscan(Path *path, PlannerInfo *root, RelOptInfo *baserel);
  extern void cost_index(IndexPath *path, PlannerInfo *root, IndexOptInfo *index,
!            List *indexQuals, RelOptInfo *outer_rel);
  extern void cost_bitmap_heap_scan(Path *path, PlannerInfo *root, RelOptInfo *baserel,
                        Path *bitmapqual, RelOptInfo *outer_rel);
  extern void cost_bitmap_and_node(BitmapAndPath *path, PlannerInfo *root);
--- 66,72 ----
                      double index_pages, PlannerInfo *root);
  extern void cost_seqscan(Path *path, PlannerInfo *root, RelOptInfo *baserel);
  extern void cost_index(IndexPath *path, PlannerInfo *root, IndexOptInfo *index,
!                List *indexQuals, List *indexOnlyQuals, RelOptInfo *outer_rel);
  extern void cost_bitmap_heap_scan(Path *path, PlannerInfo *root, RelOptInfo *baserel,
                        Path *bitmapqual, RelOptInfo *outer_rel);
  extern void cost_bitmap_and_node(BitmapAndPath *path, PlannerInfo *root);
*** a/src/include/optimizer/pathnode.h
--- b/src/include/optimizer/pathnode.h
***************
*** 31,36 **** extern Path *create_seqscan_path(PlannerInfo *root, RelOptInfo *rel);
--- 31,37 ----
  extern IndexPath *create_index_path(PlannerInfo *root,
                    IndexOptInfo *index,
                    List *clause_groups,
+                   List *indexonlyquals,
                    List *pathkeys,
                    ScanDirection indexscandir,
                    RelOptInfo *outer_rel);
*** a/src/include/optimizer/planmain.h
--- b/src/include/optimizer/planmain.h
***************
*** 119,122 **** extern void extract_query_dependencies(List *queries,
--- 119,125 ----
                             List **relationOids,
                             List **invalItems);

+ extern Expr *make_indexonly_expr(PlannerInfo *root, IndexOptInfo *index,
+                                  Expr *expr, bool *indexonly);
+
  #endif   /* PLANMAIN_H */

Re: Index-only-scans, indexam API changes

От
Robert Haas
Дата:
On Mon, Jul 13, 2009 at 11:32 AM, Heikki
Linnakangas<heikki.linnakangas@enterprisedb.com> wrote:
> Tom Lane wrote:
>> One thought here is that an AM call isn't really free, and doing two of
>> them instead of one mightn't be such a good idea.  I would suggest
>> either having a separate AM entry point to get both bits of data
>> ("amgettupledata"?) or adding an optional parameter to amgettuple.
>
> I'm thinking of adding a new flag to IndexScanDesc, "needIndexTuple".
> When that is set, amgettuple stores a pointer to the IndexTuple in
> another new field in IndexScanDesc, xs_itup. (In case of b-tree at
> least, the IndexTuple is a palloc'd copy of the tuple on disk)
>
>> [ thinks a bit ... ]  At least for GIST, it is possible that whether
>> data can be regurgitated will vary depending on the selected opclass.
>> Some opclasses use the STORAGE modifier and some don't.  I am not sure
>> how hard we want to work to support flexibility there.  Would it be
>> sufficient to hard-code the check as "pgam says the AM can do it,
>> and the opclass does not have a STORAGE property"?  Or do we need
>> additional intelligence about GIST opclasses?
>
> Well, the way I have it implemented is that the am is not required to
> return the index tuple, even if requested. I implemented the B-tree
> changes similar to how we implement "kill_prior_tuple": in btgettuple,
> lock the index page and see if the tuple is still at the same position
> that we remembered in the scan opaque struct. If not (because of
> concurrent changes to the page), we give up and don't return the index
> tuple. The executor will then perform a heap fetch as before.
>
> Alternatively, we could copy all the matching index tuples to private
> memory when we step to a new index page, but that seems pretty expensive
> if we're only going to use the first few matching tuples (LIMIT), and I
> fear the memory management gets complicated. But in any case, the GiST
> issue would still be there.
>
> Since we're discussing it, I'm attaching the prototype patch I have for
> returning tuples from b-tree and using them to filter rows before heap
> fetches. I was going to post it in the morning along with description
> about the planner and executor changes, but here goes. It applies on top
> of the indexam API patch I started this thread with.

I'm not sure where we are on this patch for reviewing purposes.
Heikki, are you planning to provide an updated patch?  Or what should
we be doing here from an RRR standpoint?

...Robert


Re: Index-only-scans, indexam API changes

От
Teodor Sigaev
Дата:
> [ thinks a bit ... ]  At least for GIST, it is possible that whether
> data can be regurgitated will vary depending on the selected opclass.
> Some opclasses use the STORAGE modifier and some don't.  I am not sure
> how hard we want to work to support flexibility there.  Would it be
> sufficient to hard-code the check as "pgam says the AM can do it,
> and the opclass does not have a STORAGE property"?  Or do we need
> additional intelligence about GIST opclasses?

GiST: btree_gist uses STORAGE option, although original value is accessible from 
index's tuple.

GIN doesn't require setting of STORAGE option even if it's impossible to 
reconstruct original heap value from indexed value. Right now, only btree_gin's 
opclasses could be used for index only scans (and only for single-column index 
scan!).

So, STORAGE option could not indicate "reconstruct-ability" original heap value 
:(. It seems to me, opclass definition could contain boolean field about that, 
but with STORAGE option specified it's needed to have separate "reconstructing" 
interface method. IMHO, it's too complex for now and it doesn't give significant 
benefits.

Although index-only scan over GIN/GiST could be useful for some aggregates 
queries like count(*).

-- 
Teodor Sigaev                                   E-mail: teodor@sigaev.ru
  WWW: http://www.sigaev.ru/