Re: [HACKERS] WAL logging problem in 9.4.3?

Поиск
Список
Период
Сортировка
От Kyotaro HORIGUCHI
Тема Re: [HACKERS] WAL logging problem in 9.4.3?
Дата
Msg-id 20170912.131441.20602611.horiguchi.kyotaro@lab.ntt.co.jp
обсуждение исходный текст
Ответ на Re: [HACKERS] WAL logging problem in 9.4.3?  (Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp>)
Список pgsql-hackers
Hello,

At Fri, 08 Sep 2017 16:30:01 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in
<20170908.163001.53230385.horiguchi.kyotaro@lab.ntt.co.jp>
> > >> 2017-04-13 12:11:27.065 JST [85441] t/102_vacuumdb_stages.pl
> > >> STATEMENT:  ANALYZE;
> > >> 2017-04-13 12:12:25.766 JST [85492] LOG:  BufferNeedsWAL: pendingSyncs
> > >> = 0x0, no_pending_sync = 0
> > >> 
> > >> -       lsn = XLogInsert(RM_SMGR_ID,
> > >> -                        XLOG_SMGR_TRUNCATE | XLR_SPECIAL_REL_UPDATE);
> > >> +           rel->no_pending_sync= false;
> > >> +           rel->pending_sync = pending;
> > >> +       }
> > >> 
> > >> It seems to me that those flags and the pending_sync data should be
> > >> kept in the context of backend process and not be part of the Relation
> > >> data...
> > > 
> > > I understand that the context of "backend process" means
> > > storage.c local. I don't mind the context on which the data is,
> > > but I found only there that can get rid of frequent hash
> > > searching. For pending deletions, just appending to a list is
> > > enough and costs almost nothing, on the other hand pendig syncs
> > > are required to be referenced, sometimes very frequently.
> > > 
> > >> +void
> > >> +RecordPendingSync(Relation rel)
> > >> I don't think that I agree that this should be part of relcache.c. The
> > >> syncs are tracked should be tracked out of the relation context.
> > > 
> > > Yeah.. It's in storage.c in the latest patch. (Sorry for the
> > > duplicate name). I think it is a kind of bond between smgr and
> > > relation.
> > > 
> > >> Seeing how invasive this change is, I would also advocate for this
> > >> patch as only being a HEAD-only change, not many people are
> > >> complaining about this optimization of TRUNCATE missing when wal_level
> > >> = minimal, and this needs a very careful review.
> > > 
> > > Agreed.
> > > 
> > >> Should I code something? Or Horiguchi-san, would you take care of it?
> > >> The previous crash I saw has been taken care of, but it's been really
> > >> some time since I looked at this patch...
> > > 
> > > My point is hash-search on every tuple insertion should be evaded
> > > even if it happens rearely. Once it was a bit apart from your
> > > original patch, but in the latest patch the significant part
> > > (pending-sync hash) is revived from the original one.
> > 
> > This patch has followed along since CF 2016-03, do we think we can reach a
> > conclusion in this CF?  It was marked as "Waiting on Author”, based on
> > developments since in this thread, I’ve changed it back to “Needs Review”
> > again.
> 
> I manged to reload its context into my head. It doesn't apply on
> the current master and needs some amendment. I'm going to work on
> this.

Rebased and slightly modified.

Michael's latest patch on which this patch is piggybacking seems
works perfectly. The motive of my addition is avoiding frequent
(I think specifically per tuple modification) hash accessing
occurs while pending-syncs exist. The hash contains at least 6 or
more entries.

The attached patch emits more log messages that will be removed
in the final shape to see how much the addition reduces the hash
access.  As a basis of determining the worthiness of the
additional mechanism, I'll show an example of a set of queries
below.

In the log messages, "r" is relation oid, "b" is buffer number,
"hash" is the pointer to the backend-global hash table for
pending syncs. "ent" is the entry in the hash belongs to the
relation, "neg" is a flag indicates that the existing pending
sync hash doesn't have an entry for the relation.

=# set log_min_message to debug2;
=# begin;
=# create table test1(a text primary key);
> DEBUG:  BufferNeedsWAL(r 2608, b 55): hash = (nil), ent=(nil), neg = 0
# relid=2608 buf=55, hash has not been created

=# insert into test1 values ('inserted row');
> DEBUG:  BufferNeedsWAL(r 24807, b 0): hash = (nil), ent=(nil), neg = 0
# relid=24807, fist buffer, hash has not bee created

=# copy test1 from '/<somewhere>/copy_data.txt';
> DEBUG:  BufferNeedsWAL(r 24807, b 0): hash = 0x171de00, ent=0x171f390, neg = 0
# hash created, pending sync entry linked, no longer needs hash acess
# (repeats for the number of buffers)
COPY 200

=# create table test3(a text primary key);
> DEBUG:  BufferNeedsWAL(r 2608, b 55): hash = 0x171de00, ent=(nil), neg = 1
# no pending sync entry for this relation, no longer needs hash access.

=# insert into test3 (select a from generate_series(0, 99) a);
> DEBUG:  BufferNeedsWAL(r 24816, b 0): hash = 0x171de00, ent=(nil), neg = 0
> DEBUG:  BufferNeedsWAL: accessing hash : not found
> DEBUG:  BufferNeedsWAL(r 24816, b 0): hash = 0x171de00, ent=(nil), neg = 1
# This table no longer needs hash access, (repeats for the number of tuples)

=#  truncate test3;
=#  insert into test3 (select a from generate_series(0, 99) a);
> DEBUG:  BufferNeedsWAL(r 24816, b 0): hash = 0x171de00, ent=(nil), neg = 0
> DEBUG:  BufferNeedsWAL: accessing hash : found
> DEBUG:  BufferNeedsWAL(r 24816, b 0): hash = 0x171de00, ent=0x171f340, neg = 0
# This table has pending sync but no longer needs hash access,
#  (repeats for the number of tuples)

The hash is required in the case of relcache invalidation. When
ent=(nil) and neg = 0 but hash != (nil), it tries hash search and
restores the previous state.

This mechanism avoids most of the hash accesses by replacing into
just following a pointer. On the other hand, the hash access
occurs only after relation truncate in the current
transaction. In other words, this won't be in effect unless any
of table truncation, copy, create as, alter table or refresing
matview occurs.


regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
*** a/src/backend/access/heap/heapam.c
--- b/src/backend/access/heap/heapam.c
***************
*** 34,39 ****
--- 34,61 ----  *      the POSTGRES heap access method used for all POSTGRES  *      relations.  *
+  * WAL CONSIDERATIONS
+  *      All heap operations are normally WAL-logged. but there are a few
+  *      exceptions. Temporary and unlogged relations never need to be
+  *      WAL-logged, but we can also skip WAL-logging for a table that was
+  *      created in the same transaction, if we don't need WAL for PITR or
+  *      WAL archival purposes (i.e. if wal_level=minimal), and we fsync()
+  *      the file to disk at COMMIT instead.
+  *
+  *      The same-relation optimization is not employed automatically on all
+  *      updates to a table that was created in the same transacton, because
+  *      for a small number of changes, it's cheaper to just create the WAL
+  *      records than fsyncing() the whole relation at COMMIT. It is only
+  *      worthwhile for (presumably) large operations like COPY, CLUSTER,
+  *      or VACUUM FULL. Use heap_register_sync() to initiate such an
+  *      operation; it will cause any subsequent updates to the table to skip
+  *      WAL-logging, if possible, and cause the heap to be synced to disk at
+  *      COMMIT.
+  *
+  *      To make that work, all modifications to heap must use
+  *      HeapNeedsWAL() to check if WAL-logging is needed in this transaction
+  *      for the given block.
+  *  *-------------------------------------------------------------------------  */ #include "postgres.h"
***************
*** 56,61 ****
--- 78,84 ---- #include "access/xlogutils.h" #include "catalog/catalog.h" #include "catalog/namespace.h"
+ #include "catalog/storage.h" #include "miscadmin.h" #include "pgstat.h" #include "port/atomics.h"
***************
*** 2370,2381 **** ReleaseBulkInsertStatePin(BulkInsertState bistate)  * The new tuple is stamped with current
transactionID and the specified  * command ID.  *
 
-  * If the HEAP_INSERT_SKIP_WAL option is specified, the new tuple is not
-  * logged in WAL, even for a non-temp relation.  Safe usage of this behavior
-  * requires that we arrange that all new tuples go into new pages not
-  * containing any tuples from other transactions, and that the relation gets
-  * fsync'd before commit.  (See also heap_sync() comments)
-  *  * The HEAP_INSERT_SKIP_FSM option is passed directly to  * RelationGetBufferForTuple, which see for more info.
*
--- 2393,2398 ----
*** a/src/backend/access/heap/pruneheap.c
--- b/src/backend/access/heap/pruneheap.c
***************
*** 20,25 ****
--- 20,26 ---- #include "access/htup_details.h" #include "access/xlog.h" #include "catalog/catalog.h"
+ #include "catalog/storage.h" #include "miscadmin.h" #include "pgstat.h" #include "storage/bufmgr.h"
***************
*** 259,265 **** heap_page_prune(Relation relation, Buffer buffer, TransactionId OldestXmin,         /*          * Emit
aWAL HEAP_CLEAN record showing what we did          */
 
!         if (RelationNeedsWAL(relation))         {             XLogRecPtr    recptr; 
--- 260,266 ----         /*          * Emit a WAL HEAP_CLEAN record showing what we did          */
!         if (BufferNeedsWAL(relation, buffer))         {             XLogRecPtr    recptr; 
*** a/src/backend/access/heap/rewriteheap.c
--- b/src/backend/access/heap/rewriteheap.c
***************
*** 649,657 **** raw_heap_insert(RewriteState state, HeapTuple tup)     }     else if (HeapTupleHasExternal(tup) ||
tup->t_len> TOAST_TUPLE_THRESHOLD)         heaptup = toast_insert_or_update(state->rs_new_rel, tup, NULL,
 
!                                          HEAP_INSERT_SKIP_FSM |
!                                          (state->rs_use_wal ?
!                                           0 : HEAP_INSERT_SKIP_WAL));     else         heaptup = tup; 
--- 649,655 ----     }     else if (HeapTupleHasExternal(tup) || tup->t_len > TOAST_TUPLE_THRESHOLD)         heaptup =
toast_insert_or_update(state->rs_new_rel,tup, NULL,
 
!                                          HEAP_INSERT_SKIP_FSM);     else         heaptup = tup; 
*** a/src/backend/access/heap/visibilitymap.c
--- b/src/backend/access/heap/visibilitymap.c
***************
*** 88,93 ****
--- 88,94 ---- #include "access/heapam_xlog.h" #include "access/visibilitymap.h" #include "access/xlog.h"
+ #include "catalog/storage.h" #include "miscadmin.h" #include "storage/bufmgr.h" #include "storage/lmgr.h"
***************
*** 307,313 **** visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,         map[mapByte] |= (flags <<
mapOffset);        MarkBufferDirty(vmBuf); 
 
!         if (RelationNeedsWAL(rel))         {             if (XLogRecPtrIsInvalid(recptr))             {
--- 308,314 ----         map[mapByte] |= (flags << mapOffset);         MarkBufferDirty(vmBuf); 
!         if (BufferNeedsWAL(rel, heapBuf))         {             if (XLogRecPtrIsInvalid(recptr))             {
*** a/src/backend/access/transam/xact.c
--- b/src/backend/access/transam/xact.c
***************
*** 2007,2012 **** CommitTransaction(void)
--- 2007,2015 ----     /* close large objects before lower-level cleanup */     AtEOXact_LargeObject(true); 
+     /* Flush updates to relations that we didn't WAL-logged */
+     smgrDoPendingSyncs(true);
+      /*      * Mark serializable transaction as complete for predicate locking      * purposes.  This should be done
aslate as we can put it and still allow
 
***************
*** 2235,2240 **** PrepareTransaction(void)
--- 2238,2246 ----     /* close large objects before lower-level cleanup */     AtEOXact_LargeObject(true); 
+     /* Flush updates to relations that we didn't WAL-logged */
+     smgrDoPendingSyncs(true);
+      /*      * Mark serializable transaction as complete for predicate locking      * purposes.  This should be done
aslate as we can put it and still allow
 
***************
*** 2548,2553 **** AbortTransaction(void)
--- 2554,2560 ----     AtAbort_Notify();     AtEOXact_RelationMap(false);     AtAbort_Twophase();
+     smgrDoPendingSyncs(false);    /* abandone pending syncs */      /*      * Advertise the fact that we aborted in
pg_xact(assuming that we got as
 
*** a/src/backend/catalog/storage.c
--- b/src/backend/catalog/storage.c
***************
*** 29,34 ****
--- 29,35 ---- #include "catalog/storage_xlog.h" #include "storage/freespace.h" #include "storage/smgr.h"
+ #include "utils/hsearch.h" #include "utils/memutils.h" #include "utils/rel.h" 
***************
*** 64,69 **** typedef struct PendingRelDelete
--- 65,113 ---- static PendingRelDelete *pendingDeletes = NULL; /* head of linked list */  /*
+  * We also track relation files (RelFileNode values) that have been created
+  * in the same transaction, and that have been modified without WAL-logging
+  * the action (an optimization possible with wal_level=minimal). When we are
+  * about to skip WAL-logging, a PendingRelSync entry is created, and
+  * 'sync_above' is set to the current size of the relation. Any operations
+  * on blocks < sync_above need to be WAL-logged as usual, but for operations
+  * on higher blocks, WAL-logging is skipped.
+  *
+  * NB: after WAL-logging has been skipped for a block, we must not WAL-log
+  * any subsequent actions on the same block either. Replaying the WAL record
+  * of the subsequent action might fail otherwise, as the "before" state of
+  * the block might not match, as the earlier actions were not WAL-logged.
+  * Likewise, after we have WAL-logged an operation for a block, we must
+  * WAL-log any subsequent operations on the same page as well. Replaying
+  * a possible full-page-image from the earlier WAL record would otherwise
+  * revert the page to the old state, even if we sync the relation at end
+  * of transaction.
+  *
+  * If a relation is truncated (without creating a new relfilenode), and we
+  * emit a WAL record of the truncation, we can't skip WAL-logging for any
+  * of the truncated blocks anymore, as replaying the truncation record will
+  * destroy all the data inserted after that. But if we have already decided
+  * to skip WAL-logging changes to a relation, and the relation is truncated,
+  * we don't need to WAL-log the truncation either.
+  *
+  * This mechanism is currently only used by heaps. Indexes are always
+  * WAL-logged. Also, this only applies for wal_level=minimal; with higher
+  * WAL levels we need the WAL for PITR/replication anyway.
+  */
+ typedef struct PendingRelSync
+ {
+     RelFileNode relnode;        /* relation created in same xact */
+     BlockNumber sync_above;        /* WAL-logging skipped for blocks >=
+                                  * sync_above */
+     BlockNumber truncated_to;    /* truncation WAL record was written */
+ }    PendingRelSync;
+ 
+ /* Relations that need to be fsync'd at commit */
+ static HTAB *pendingSyncs = NULL;
+ 
+ static void createPendingSyncsHash(void);
+ 
+ /*  * RelationCreateStorage  *        Create physical storage for a relation.  *
***************
*** 226,231 **** RelationPreserveStorage(RelFileNode rnode, bool atCommit)
--- 270,277 ---- void RelationTruncate(Relation rel, BlockNumber nblocks) {
+     PendingRelSync *pending = NULL;
+     bool        found;     bool        fsm;     bool        vm; 
***************
*** 260,296 **** RelationTruncate(Relation rel, BlockNumber nblocks)      */     if (RelationNeedsWAL(rel))     {
!         /*
!          * Make an XLOG entry reporting the file truncation.
!          */
!         XLogRecPtr    lsn;
!         xl_smgr_truncate xlrec;
! 
!         xlrec.blkno = nblocks;
!         xlrec.rnode = rel->rd_node;
!         xlrec.flags = SMGR_TRUNCATE_ALL;
! 
!         XLogBeginInsert();
!         XLogRegisterData((char *) &xlrec, sizeof(xlrec));
! 
!         lsn = XLogInsert(RM_SMGR_ID,
!                          XLOG_SMGR_TRUNCATE | XLR_SPECIAL_REL_UPDATE);
! 
!         /*
!          * Flush, because otherwise the truncation of the main relation might
!          * hit the disk before the WAL record, and the truncation of the FSM
!          * or visibility map. If we crashed during that window, we'd be left
!          * with a truncated heap, but the FSM or visibility map would still
!          * contain entries for the non-existent heap pages.
!          */
!         if (fsm || vm)
!             XLogFlush(lsn);     }      /* Do the real work */     smgrtruncate(rel->rd_smgr, MAIN_FORKNUM, nblocks);
} /*  *    smgrDoPendingDeletes() -- Take care of relation deletes at end of xact.  *
 
--- 306,386 ----      */     if (RelationNeedsWAL(rel))     {
!         /* no_pending_sync is ignored since new entry is created here */
!         if (!rel->pending_sync)
!         {
!             if (!pendingSyncs)
!                 createPendingSyncsHash();
!             elog(DEBUG2, "RelationTruncate: accessing hash");
!             pending = (PendingRelSync *) hash_search(pendingSyncs,
!                                                  (void *) &rel->rd_node,
!                                                  HASH_ENTER, &found);
!             if (!found)
!             {
!                 pending->sync_above = InvalidBlockNumber;
!                 pending->truncated_to = InvalidBlockNumber;
!             }
! 
!             rel->no_pending_sync= false;
!             rel->pending_sync = pending;
!         }
! 
!         if (rel->pending_sync->sync_above == InvalidBlockNumber ||
!             rel->pending_sync->sync_above < nblocks)
!         {
!             /*
!              * Make an XLOG entry reporting the file truncation.
!              */
!             XLogRecPtr        lsn;
!             xl_smgr_truncate xlrec;
! 
!             xlrec.blkno = nblocks;
!             xlrec.rnode = rel->rd_node;
! 
!             XLogBeginInsert();
!             XLogRegisterData((char *) &xlrec, sizeof(xlrec));
! 
!             lsn = XLogInsert(RM_SMGR_ID,
!                              XLOG_SMGR_TRUNCATE | XLR_SPECIAL_REL_UPDATE);
! 
!             elog(DEBUG2, "WAL-logged truncation of rel %u/%u/%u to %u blocks",
!                  rel->rd_node.spcNode, rel->rd_node.dbNode, rel->rd_node.relNode,
!                  nblocks);
! 
!             /*
!              * Flush, because otherwise the truncation of the main relation
!              * might hit the disk before the WAL record, and the truncation of
!              * the FSM or visibility map. If we crashed during that window,
!              * we'd be left with a truncated heap, but the FSM or visibility
!              * map would still contain entries for the non-existent heap
!              * pages.
!              */
!             if (fsm || vm)
!                 XLogFlush(lsn);
! 
!             rel->pending_sync->truncated_to = nblocks;
!         }     }      /* Do the real work */     smgrtruncate(rel->rd_smgr, MAIN_FORKNUM, nblocks); } 
+ /* create the hash table to track pending at-commit fsyncs */
+ static void
+ createPendingSyncsHash(void)
+ {
+     /* First time through: initialize the hash table */
+     HASHCTL        ctl;
+ 
+     MemSet(&ctl, 0, sizeof(ctl));
+     ctl.keysize = sizeof(RelFileNode);
+     ctl.entrysize = sizeof(PendingRelSync);
+     ctl.hash = tag_hash;
+     pendingSyncs = hash_create("pending relation sync table", 5,
+                                &ctl, HASH_ELEM | HASH_FUNCTION);
+ }
+  /*  *    smgrDoPendingDeletes() -- Take care of relation deletes at end of xact.  *
***************
*** 369,374 **** smgrDoPendingDeletes(bool isCommit)
--- 459,482 ---- }  /*
+  * RelationRemovePendingSync() -- remove pendingSync entry for a relation
+  */
+ void
+ RelationRemovePendingSync(Relation rel)
+ {
+     bool found;
+ 
+     rel->pending_sync = NULL;
+     rel->no_pending_sync = true;
+     if (pendingSyncs)
+     {
+         elog(DEBUG2, "RelationRemovePendingSync: accessing hash");
+         hash_search(pendingSyncs, (void *) &rel->rd_node, HASH_REMOVE, &found);
+     }
+ }
+ 
+ 
+ /*  * smgrGetPendingDeletes() -- Get a list of non-temp relations to be deleted.  *  * The return value is the number
ofrelations scheduled for termination.
 
***************
*** 419,424 **** smgrGetPendingDeletes(bool forCommit, RelFileNode **ptr)
--- 527,696 ----     return nrels; } 
+ 
+ /*
+  * Remember that the given relation needs to be sync'd at commit, because we
+  * are going to skip WAL-logging subsequent actions to it.
+  */
+ void
+ RecordPendingSync(Relation rel)
+ {
+     bool found = true;
+     BlockNumber nblocks;
+ 
+     Assert(RelationNeedsWAL(rel));
+ 
+     /* ignore no_pending_sync since new entry is created here */
+     if (!rel->pending_sync)
+     {
+         if (!pendingSyncs)
+             createPendingSyncsHash();
+ 
+         /* Look up or create an entry */
+         rel->no_pending_sync = false;
+         elog(DEBUG2, "RecordPendingSync: accessing hash");
+         rel->pending_sync =
+             (PendingRelSync *) hash_search(pendingSyncs,
+                                            (void *) &rel->rd_node,
+                                            HASH_ENTER, &found);
+     }
+ 
+     nblocks = RelationGetNumberOfBlocks(rel);
+     if (!found)
+     {
+         rel->pending_sync->truncated_to = InvalidBlockNumber;
+         rel->pending_sync->sync_above = nblocks;
+ 
+         elog(DEBUG2,
+              "registering new pending sync for rel %u/%u/%u at block %u",
+              rel->rd_node.spcNode, rel->rd_node.dbNode, rel->rd_node.relNode,
+              nblocks);
+ 
+     }
+     else if (rel->pending_sync->sync_above == InvalidBlockNumber)
+     {
+         elog(DEBUG2, "registering pending sync for rel %u/%u/%u at block %u",
+              rel->rd_node.spcNode, rel->rd_node.dbNode, rel->rd_node.relNode,
+              nblocks);
+         rel->pending_sync->sync_above = nblocks;
+     }
+     else
+         elog(DEBUG2,
+              "pending sync for rel %u/%u/%u was already registered at block %u (new %u)",
+              rel->rd_node.spcNode, rel->rd_node.dbNode, rel->rd_node.relNode,
+              rel->pending_sync->sync_above, nblocks);
+ }
+ 
+ /*
+  * Do changes to given heap page need to be WAL-logged?
+  *
+  * This takes into account any previous RecordPendingSync() requests.
+  *
+  * Note that it is required to check this before creating any WAL records for
+  * heap pages - it is not merely an optimization! WAL-logging a record, when
+  * we have already skipped a previous WAL record for the same page could lead
+  * to failure at WAL replay, as the "before" state expected by the record
+  * might not match what's on disk. Also, if the heap was truncated earlier, we
+  * must WAL-log any changes to the once-truncated blocks, because replaying
+  * the truncation record will destroy them.
+  */
+ bool
+ BufferNeedsWAL(Relation rel, Buffer buf)
+ {
+     BlockNumber blkno = InvalidBlockNumber;
+ 
+     if (!RelationNeedsWAL(rel))
+         return false;
+ 
+     elog(DEBUG2, "BufferNeedsWAL(r %d, b %d): hash = %p, ent=%p, neg = %d", rel->rd_id, BufferGetBlockNumber(buf),
pendingSyncs,rel->pending_sync, rel->no_pending_sync);
 
+     /* no further work if we know that we don't have pending sync */
+     if (!pendingSyncs || rel->no_pending_sync)
+         return true;
+ 
+     /* do the real work */
+     if (!rel->pending_sync)
+     {
+         bool found = false;
+ 
+         /*
+          * Hold the entry in rel. This relies on the fact that hash entry
+          * never moves.
+          */
+         rel->pending_sync =
+             (PendingRelSync *) hash_search(pendingSyncs,
+                                            (void *) &rel->rd_node,
+                                            HASH_FIND, &found);
+         elog(DEBUG2, "BufferNeedsWAL: accessing hash : %s", found ? "found" : "not found");
+         if (!found)
+         {
+             /* we don't have no one. don't access the hash no longer */
+             rel->no_pending_sync = true;
+             return true;
+         }
+     }
+ 
+     blkno = BufferGetBlockNumber(buf);
+     if (rel->pending_sync->sync_above == InvalidBlockNumber ||
+         rel->pending_sync->sync_above > blkno)
+     {
+         elog(DEBUG2, "not skipping WAL-logging for rel %u/%u/%u block %u, because sync_above is %u",
+              rel->rd_node.spcNode, rel->rd_node.dbNode, rel->rd_node.relNode,
+              blkno, rel->pending_sync->sync_above);
+         return true;
+     }
+ 
+     /*
+      * We have emitted a truncation record for this block.
+      */
+     if (rel->pending_sync->truncated_to != InvalidBlockNumber &&
+         rel->pending_sync->truncated_to <= blkno)
+     {
+         elog(DEBUG2, "not skipping WAL-logging for rel %u/%u/%u block %u, because it was truncated earlier in the
samexact",
 
+              rel->rd_node.spcNode, rel->rd_node.dbNode, rel->rd_node.relNode,
+              blkno);
+         return true;
+     }
+ 
+     elog(DEBUG2, "skipping WAL-logging for rel %u/%u/%u block %u",
+          rel->rd_node.spcNode, rel->rd_node.dbNode, rel->rd_node.relNode,
+          blkno);
+ 
+     return false;
+ }
+ 
+ /*
+  * Sync to disk any relations that we skipped WAL-logging for earlier.
+  */
+ void
+ smgrDoPendingSyncs(bool isCommit)
+ {
+     if (!pendingSyncs)
+         return;
+ 
+     if (isCommit)
+     {
+         HASH_SEQ_STATUS status;
+         PendingRelSync *pending;
+ 
+         hash_seq_init(&status, pendingSyncs);
+ 
+         while ((pending = hash_seq_search(&status)) != NULL)
+         {
+             if (pending->sync_above != InvalidBlockNumber)
+             {
+                 FlushRelationBuffersWithoutRelCache(pending->relnode, false);
+                 smgrimmedsync(smgropen(pending->relnode, InvalidBackendId), MAIN_FORKNUM);
+ 
+                 elog(DEBUG2, "syncing rel %u/%u/%u", pending->relnode.spcNode,
+                      pending->relnode.dbNode, pending->relnode.relNode);
+             }
+         }
+     }
+ 
+     hash_destroy(pendingSyncs);
+     pendingSyncs = NULL;
+ }
+  /*  *    PostPrepare_smgr -- Clean up after a successful PREPARE  *
*** a/src/backend/commands/copy.c
--- b/src/backend/commands/copy.c
***************
*** 2347,2354 **** CopyFrom(CopyState cstate)      *    - data is being written to relfilenode created in this
transaction     * then we can skip writing WAL.  It's safe because if the transaction      * doesn't commit, we'll
discardthe table (or the new relfilenode file).
 
!      * If it does commit, we'll have done the heap_sync at the bottom of this
!      * routine first.      *      * As mentioned in comments in utils/rel.h, the in-same-transaction test      * is
notalways set correctly, since in rare cases rd_newRelfilenodeSubid
 
--- 2347,2353 ----      *    - data is being written to relfilenode created in this transaction      * then we can skip
writingWAL.  It's safe because if the transaction      * doesn't commit, we'll discard the table (or the new
relfilenodefile).
 
!      * If it does commit, commit will do heap_sync().      *      * As mentioned in comments in utils/rel.h, the
in-same-transactiontest      * is not always set correctly, since in rare cases rd_newRelfilenodeSubid
 
***************
*** 2380,2386 **** CopyFrom(CopyState cstate)     {         hi_options |= HEAP_INSERT_SKIP_FSM;         if
(!XLogIsNeeded())
!             hi_options |= HEAP_INSERT_SKIP_WAL;     }      /*
--- 2379,2385 ----     {         hi_options |= HEAP_INSERT_SKIP_FSM;         if (!XLogIsNeeded())
!             heap_register_sync(cstate->rel);     }      /*
***************
*** 2862,2872 **** CopyFrom(CopyState cstate)     FreeExecutorState(estate);      /*
!      * If we skipped writing WAL, then we need to sync the heap (but not
!      * indexes since those use WAL anyway)      */
-     if (hi_options & HEAP_INSERT_SKIP_WAL)
-         heap_sync(cstate->rel);      return processed; }
--- 2861,2871 ----     FreeExecutorState(estate);      /*
!      * If we skipped writing WAL, then we will sync the heap at the end of
!      * the transaction. (We used to do it here, but it was later found out
!      * that to be safe, we must also avoid WAL-logging any subsequent
!      * actions on the pages we skipped WAL for). Indexes always use WAL.      */      return processed; }
*** a/src/backend/commands/createas.c
--- b/src/backend/commands/createas.c
***************
*** 567,574 **** intorel_startup(DestReceiver *self, int operation, TupleDesc typeinfo)      * We can skip WAL-logging
theinsertions, unless PITR or streaming      * replication is in use. We can skip the FSM in any case.      */
 
!     myState->hi_options = HEAP_INSERT_SKIP_FSM |
!         (XLogIsNeeded() ? 0 : HEAP_INSERT_SKIP_WAL);     myState->bistate = GetBulkInsertState();      /* Not using
WALrequires smgr_targblock be initially invalid */
 
--- 567,575 ----      * We can skip WAL-logging the insertions, unless PITR or streaming      * replication is in use.
Wecan skip the FSM in any case.      */
 
!     if (!XLogIsNeeded())
!         heap_register_sync(intoRelationDesc);
!     myState->hi_options = HEAP_INSERT_SKIP_FSM;     myState->bistate = GetBulkInsertState();      /* Not using WAL
requiressmgr_targblock be initially invalid */
 
***************
*** 617,625 **** intorel_shutdown(DestReceiver *self)      FreeBulkInsertState(myState->bistate); 
!     /* If we skipped using WAL, must heap_sync before commit */
!     if (myState->hi_options & HEAP_INSERT_SKIP_WAL)
!         heap_sync(myState->rel);      /* close rel, but keep lock until commit */     heap_close(myState->rel,
NoLock);
--- 618,624 ----      FreeBulkInsertState(myState->bistate); 
!     /* If we skipped using WAL, we will sync the relation at commit */      /* close rel, but keep lock until commit
*/    heap_close(myState->rel, NoLock);
 
*** a/src/backend/commands/matview.c
--- b/src/backend/commands/matview.c
***************
*** 477,483 **** transientrel_startup(DestReceiver *self, int operation, TupleDesc typeinfo)      */
myState->hi_options= HEAP_INSERT_SKIP_FSM | HEAP_INSERT_FROZEN;     if (!XLogIsNeeded())
 
!         myState->hi_options |= HEAP_INSERT_SKIP_WAL;     myState->bistate = GetBulkInsertState();      /* Not using
WALrequires smgr_targblock be initially invalid */
 
--- 477,483 ----      */     myState->hi_options = HEAP_INSERT_SKIP_FSM | HEAP_INSERT_FROZEN;     if (!XLogIsNeeded())
!         heap_register_sync(transientrel);     myState->bistate = GetBulkInsertState();      /* Not using WAL requires
smgr_targblockbe initially invalid */
 
***************
*** 520,528 **** transientrel_shutdown(DestReceiver *self)      FreeBulkInsertState(myState->bistate); 
!     /* If we skipped using WAL, must heap_sync before commit */
!     if (myState->hi_options & HEAP_INSERT_SKIP_WAL)
!         heap_sync(myState->transientrel);      /* close transientrel, but keep lock until commit */
heap_close(myState->transientrel,NoLock);
 
--- 520,526 ----      FreeBulkInsertState(myState->bistate); 
!     /* If we skipped using WAL, we will sync the relation at commit */      /* close transientrel, but keep lock
untilcommit */     heap_close(myState->transientrel, NoLock);
 
*** a/src/backend/commands/tablecmds.c
--- b/src/backend/commands/tablecmds.c
***************
*** 4357,4364 **** ATRewriteTable(AlteredTableInfo *tab, Oid OIDNewHeap, LOCKMODE lockmode)         bistate =
GetBulkInsertState();         hi_options = HEAP_INSERT_SKIP_FSM;         if (!XLogIsNeeded())
 
!             hi_options |= HEAP_INSERT_SKIP_WAL;     }     else     {
--- 4357,4365 ----         bistate = GetBulkInsertState();          hi_options = HEAP_INSERT_SKIP_FSM;
+          if (!XLogIsNeeded())
!             heap_register_sync(newrel);     }     else     {
***************
*** 4624,4631 **** ATRewriteTable(AlteredTableInfo *tab, Oid OIDNewHeap, LOCKMODE lockmode)
FreeBulkInsertState(bistate);         /* If we skipped writing WAL, then we need to sync the heap. */
 
-         if (hi_options & HEAP_INSERT_SKIP_WAL)
-             heap_sync(newrel);          heap_close(newrel, NoLock);     }
--- 4625,4630 ----
*** a/src/backend/commands/vacuumlazy.c
--- b/src/backend/commands/vacuumlazy.c
***************
*** 891,897 **** lazy_scan_heap(Relation onerel, int options, LVRelStats *vacrelstats,                  * page has been
previouslyWAL-logged, and if not, do that                  * now.                  */
 
!                 if (RelationNeedsWAL(onerel) &&                     PageGetLSN(page) == InvalidXLogRecPtr)
        log_newpage_buffer(buf, true); 
 
--- 891,897 ----                  * page has been previously WAL-logged, and if not, do that                  * now.
             */
 
!                 if (BufferNeedsWAL(onerel, buf) &&                     PageGetLSN(page) == InvalidXLogRecPtr)
           log_newpage_buffer(buf, true); 
 
***************
*** 1118,1124 **** lazy_scan_heap(Relation onerel, int options, LVRelStats *vacrelstats,             }              /*
NowWAL-log freezing if necessary */
 
!             if (RelationNeedsWAL(onerel))             {                 XLogRecPtr    recptr; 
--- 1118,1124 ----             }              /* Now WAL-log freezing if necessary */
!             if (BufferNeedsWAL(onerel, buf))             {                 XLogRecPtr    recptr; 
***************
*** 1476,1482 **** lazy_vacuum_page(Relation onerel, BlockNumber blkno, Buffer buffer,     MarkBufferDirty(buffer);
/* XLOG stuff */
 
!     if (RelationNeedsWAL(onerel))     {         XLogRecPtr    recptr; 
--- 1476,1482 ----     MarkBufferDirty(buffer);      /* XLOG stuff */
!     if (BufferNeedsWAL(onerel, buffer))     {         XLogRecPtr    recptr; 
*** a/src/backend/storage/buffer/bufmgr.c
--- b/src/backend/storage/buffer/bufmgr.c
***************
*** 451,456 **** static BufferDesc *BufferAlloc(SMgrRelation smgr,
--- 451,457 ----             BufferAccessStrategy strategy,             bool *foundPtr); static void
FlushBuffer(BufferDesc*buf, SMgrRelation reln);
 
+ static void FlushRelationBuffers_common(SMgrRelation smgr, bool islocal); static void AtProcExit_Buffers(int code,
Datumarg); static void CheckForBufferLeaks(void); static int    rnode_comparator(const void *p1, const void *p2);
 
***************
*** 3147,3166 **** PrintPinnedBufs(void) void FlushRelationBuffers(Relation rel) {
-     int            i;
-     BufferDesc *bufHdr;
-      /* Open rel at the smgr level if not already done */     RelationOpenSmgr(rel); 
!     if (RelationUsesLocalBuffers(rel))     {         for (i = 0; i < NLocBuffer; i++)         {             uint32
   buf_state;              bufHdr = GetLocalBufferDescriptor(i);
 
!             if (RelFileNodeEquals(bufHdr->tag.rnode, rel->rd_node) &&                 ((buf_state =
pg_atomic_read_u32(&bufHdr->state))&                  (BM_VALID | BM_DIRTY)) == (BM_VALID | BM_DIRTY))             {
 
--- 3148,3188 ---- void FlushRelationBuffers(Relation rel) {     /* Open rel at the smgr level if not already done */
 RelationOpenSmgr(rel); 
 
!     FlushRelationBuffers_common(rel->rd_smgr, RelationUsesLocalBuffers(rel));
! }
! 
! /*
!  * Like FlushRelationBuffers(), but the relation is specified by a
!  * RelFileNode
!  */
! void
! FlushRelationBuffersWithoutRelCache(RelFileNode rnode, bool islocal)
! {
!     FlushRelationBuffers_common(smgropen(rnode, InvalidBackendId), islocal);
! }
! 
! /*
!  * Code shared between functions FlushRelationBuffers() and
!  * FlushRelationBuffersWithoutRelCache().
!  */
! static void
! FlushRelationBuffers_common(SMgrRelation smgr, bool islocal)
! {
!     RelFileNode rnode = smgr->smgr_rnode.node;
!     int            i;
!     BufferDesc *bufHdr;
! 
!     if (islocal)     {         for (i = 0; i < NLocBuffer; i++)         {             uint32        buf_state;
     bufHdr = GetLocalBufferDescriptor(i);
 
!             if (RelFileNodeEquals(bufHdr->tag.rnode, rnode) &&                 ((buf_state =
pg_atomic_read_u32(&bufHdr->state))&                  (BM_VALID | BM_DIRTY)) == (BM_VALID | BM_DIRTY))             {
 
***************
*** 3177,3183 **** FlushRelationBuffers(Relation rel)                  PageSetChecksumInplace(localpage,
bufHdr->tag.blockNum);
 
!                 smgrwrite(rel->rd_smgr,                           bufHdr->tag.forkNum,
bufHdr->tag.blockNum,                          localpage,
 
--- 3199,3205 ----                  PageSetChecksumInplace(localpage, bufHdr->tag.blockNum); 
!                 smgrwrite(smgr,                           bufHdr->tag.forkNum,
bufHdr->tag.blockNum,                          localpage,
 
***************
*** 3207,3224 **** FlushRelationBuffers(Relation rel)          * As in DropRelFileNodeBuffers, an unlocked precheck
shouldbe safe          * and saves some cycles.          */
 
!         if (!RelFileNodeEquals(bufHdr->tag.rnode, rel->rd_node))             continue;
ReservePrivateRefCountEntry();         buf_state = LockBufHdr(bufHdr);
 
!         if (RelFileNodeEquals(bufHdr->tag.rnode, rel->rd_node) &&             (buf_state & (BM_VALID | BM_DIRTY)) ==
(BM_VALID| BM_DIRTY))         {             PinBuffer_Locked(bufHdr);
LWLockAcquire(BufferDescriptorGetContentLock(bufHdr),LW_SHARED);
 
!             FlushBuffer(bufHdr, rel->rd_smgr);             LWLockRelease(BufferDescriptorGetContentLock(bufHdr));
       UnpinBuffer(bufHdr, true);         }
 
--- 3229,3246 ----          * As in DropRelFileNodeBuffers, an unlocked precheck should be safe          * and saves
somecycles.          */
 
!         if (!RelFileNodeEquals(bufHdr->tag.rnode, rnode))             continue;
ReservePrivateRefCountEntry();         buf_state = LockBufHdr(bufHdr);
 
!         if (RelFileNodeEquals(bufHdr->tag.rnode, rnode) &&             (buf_state & (BM_VALID | BM_DIRTY)) ==
(BM_VALID| BM_DIRTY))         {             PinBuffer_Locked(bufHdr);
LWLockAcquire(BufferDescriptorGetContentLock(bufHdr),LW_SHARED);
 
!             FlushBuffer(bufHdr, smgr);             LWLockRelease(BufferDescriptorGetContentLock(bufHdr));
UnpinBuffer(bufHdr,true);         }
 
*** a/src/backend/utils/cache/relcache.c
--- b/src/backend/utils/cache/relcache.c
***************
*** 72,77 ****
--- 72,78 ---- #include "optimizer/var.h" #include "rewrite/rewriteDefine.h" #include "rewrite/rowsecurity.h"
+ #include "storage/bufmgr.h" #include "storage/lmgr.h" #include "storage/smgr.h" #include "utils/array.h"
***************
*** 418,423 **** AllocateRelationDesc(Form_pg_class relp)
--- 419,428 ----     /* which we mark as a reference-counted tupdesc */     relation->rd_att->tdrefcount = 1; 
+     /* We don't know if pending sync for this relation exists so far */
+     relation->pending_sync = NULL;
+     relation->no_pending_sync = false;
+      MemoryContextSwitchTo(oldcxt);      return relation;
***************
*** 2040,2045 **** formrdesc(const char *relationName, Oid relationReltype,
--- 2045,2054 ----         relation->rd_rel->relhasindex = true;     } 
+     /* We don't know if pending sync for this relation exists so far */
+     relation->pending_sync = NULL;
+     relation->no_pending_sync = false;
+      /*      * add new reldesc to relcache      */
***************
*** 3364,3369 **** RelationBuildLocalRelation(const char *relname,
--- 3373,3382 ----     else         rel->rd_rel->relfilenode = relfilenode; 
+     /* newly built relation has no pending sync */
+     rel->no_pending_sync = true;
+     rel->pending_sync = NULL;
+      RelationInitLockInfo(rel);    /* see lmgr.c */      RelationInitPhysicalAddr(rel);
*** a/src/include/access/heapam.h
--- b/src/include/access/heapam.h
***************
*** 25,34 ****   /* "options" flag bits for heap_insert */
! #define HEAP_INSERT_SKIP_WAL    0x0001
! #define HEAP_INSERT_SKIP_FSM    0x0002
! #define HEAP_INSERT_FROZEN        0x0004
! #define HEAP_INSERT_SPECULATIVE 0x0008  typedef struct BulkInsertStateData *BulkInsertState; 
--- 25,33 ----   /* "options" flag bits for heap_insert */
! #define HEAP_INSERT_SKIP_FSM    0x0001
! #define HEAP_INSERT_FROZEN        0x0002
! #define HEAP_INSERT_SPECULATIVE 0x0004  typedef struct BulkInsertStateData *BulkInsertState; 
***************
*** 179,184 **** extern void simple_heap_delete(Relation relation, ItemPointer tid);
--- 178,184 ---- extern void simple_heap_update(Relation relation, ItemPointer otid,                    HeapTuple tup);

+ extern void heap_register_sync(Relation relation); extern void heap_sync(Relation relation); extern void
heap_update_snapshot(HeapScanDescscan, Snapshot snapshot); 
 
*** a/src/include/catalog/storage.h
--- b/src/include/catalog/storage.h
***************
*** 22,34 **** extern void RelationCreateStorage(RelFileNode rnode, char relpersistence); extern void
RelationDropStorage(Relationrel); extern void RelationPreserveStorage(RelFileNode rnode, bool atCommit); extern void
RelationTruncate(Relationrel, BlockNumber nblocks);
 
!  /*  * These functions used to be in storage/smgr/smgr.c, which explains the  * naming  */ extern void
smgrDoPendingDeletes(boolisCommit); extern int    smgrGetPendingDeletes(bool forCommit, RelFileNode **ptr); extern void
AtSubCommit_smgr(void);extern void AtSubAbort_smgr(void); extern void PostPrepare_smgr(void);
 
--- 22,37 ---- extern void RelationDropStorage(Relation rel); extern void RelationPreserveStorage(RelFileNode rnode,
boolatCommit); extern void RelationTruncate(Relation rel, BlockNumber nblocks);
 
! extern void RelationRemovePendingSync(Relation rel); /*  * These functions used to be in storage/smgr/smgr.c, which
explainsthe  * naming  */ extern void smgrDoPendingDeletes(bool isCommit); extern int    smgrGetPendingDeletes(bool
forCommit,RelFileNode **ptr);
 
+ extern void smgrDoPendingSyncs(bool isCommit);
+ extern void RecordPendingSync(Relation rel);
+ bool BufferNeedsWAL(Relation rel, Buffer buf); extern void AtSubCommit_smgr(void); extern void AtSubAbort_smgr(void);
externvoid PostPrepare_smgr(void);
 
*** a/src/include/storage/bufmgr.h
--- b/src/include/storage/bufmgr.h
***************
*** 190,195 **** extern BlockNumber RelationGetNumberOfBlocksInFork(Relation relation,
--- 190,197 ----                                 ForkNumber forkNum); extern void FlushOneBuffer(Buffer buffer); extern
voidFlushRelationBuffers(Relation rel);
 
+ extern void FlushRelationBuffersWithoutRelCache(RelFileNode rnode,
+                                     bool islocal); extern void FlushDatabaseBuffers(Oid dbid); extern void
DropRelFileNodeBuffers(RelFileNodeBackendrnode,                        ForkNumber forkNum, BlockNumber firstDelBlock);
 
*** a/src/include/utils/rel.h
--- b/src/include/utils/rel.h
***************
*** 216,221 **** typedef struct RelationData
--- 216,229 ----      /* use "struct" here to avoid needing to include pgstat.h: */     struct PgStat_TableStatus
*pgstat_info;/* statistics collection area */
 
+ 
+     /*
+      * no_pending_sync is true if this relation is known not to have pending
+      * syncs.  Elsewise searching for registered sync is required if
+      * pending_sync is NULL.
+      */
+     bool                   no_pending_sync;
+     struct PendingRelSync *pending_sync; } RelationData;

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

В списке pgsql-hackers по дате отправления:

Предыдущее
От: Thomas Munro
Дата:
Сообщение: Re: [HACKERS] More flexible LDAP auth search filters?
Следующее
От: "Bossart, Nathan"
Дата:
Сообщение: Re: [HACKERS] [Proposal] Allow users to specify multiple tables inVACUUM commands