Re: WAL logging problem in 9.4.3?

Поиск
Список
Период
Сортировка
От Kyotaro HORIGUCHI
Тема Re: WAL logging problem in 9.4.3?
Дата
Msg-id 20160929.220257.67781565.horiguchi.kyotaro@lab.ntt.co.jp
обсуждение исходный текст
Ответ на Re: WAL logging problem in 9.4.3?  (Michael Paquier <michael.paquier@gmail.com>)
Ответы Re: WAL logging problem in 9.4.3?  (Michael Paquier <michael.paquier@gmail.com>)
Re: [HACKERS] WAL logging problem in 9.4.3?  (Alvaro Herrera <alvherre@2ndquadrant.com>)
Список pgsql-hackers
Hello,

At Thu, 29 Sep 2016 16:59:55 +0900, Michael Paquier <michael.paquier@gmail.com> wrote in
<CAB7nPqT5x05tG7aut1yz+WJN76DqNz1Jzq46fSFtee4YbY0YcA@mail.gmail.com>
> On Mon, Sep 26, 2016 at 5:03 PM, Kyotaro HORIGUCHI
> <horiguchi.kyotaro@lab.ntt.co.jp> wrote:
> > Hello, I return to this before my things:)
> >
> > Though I haven't played with the patch yet..
> 
> Be sure to run the test cases in the patch or base your tests on them then!

All items of 006_truncate_opt fail on ed0b228 and they are fixed
with the patch.

> > Though I don't know how it actually impacts the perfomance, it
> > seems to me that we can live with truncated_to and sync_above in
> > RelationData and BufferNeedsWAL(rel, buf) instead of
> > HeapNeedsWAL(rel, buf).  Anyway up to one entry for one relation
> > seems to exist at once in the hash.
> 
> TBH, I still think that the design of this patch as proposed is pretty
> cool and easy to follow.

It is clean from certain viewpoint but additional hash,
especially hash-searching on every HeapNeedsWAL seems to me to be
unacceptable. Do you see it accetable?


The attached patch is quiiiccck-and-dirty-hack of Michael's patch
just as a PoC of my proposal quoted above. This also passes the
006 test.  The major changes are the following.

- Moved sync_above and truncted_to into  RelationData.

- Cleaning up is done in AtEOXact_cleanup instead of explicit calling to smgrDoPendingSyncs().

* BufferNeedsWAL (replace of HeapNeedsWAL) no longer requires hash_search. It just refers to the additional members in
thegiven Relation.
 

X I feel that I have dropped one of the features of the origitnal patch during the hack, but I don't recall it clearly
now:(

X I haven't consider relfilenode replacement, which didn't matter for the original patch. (but there's few places to
consider).

What do you think about this?

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 38bba16..02e33cc 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -34,6 +34,28 @@ *      the POSTGRES heap access method used for all POSTGRES *      relations. *
+ * WAL CONSIDERATIONS
+ *      All heap operations are normally WAL-logged. but there are a few
+ *      exceptions. Temporary and unlogged relations never need to be
+ *      WAL-logged, but we can also skip WAL-logging for a table that was
+ *      created in the same transaction, if we don't need WAL for PITR or
+ *      WAL archival purposes (i.e. if wal_level=minimal), and we fsync()
+ *      the file to disk at COMMIT instead.
+ *
+ *      The same-relation optimization is not employed automatically on all
+ *      updates to a table that was created in the same transacton, because
+ *      for a small number of changes, it's cheaper to just create the WAL
+ *      records than fsyncing() the whole relation at COMMIT. It is only
+ *      worthwhile for (presumably) large operations like COPY, CLUSTER,
+ *      or VACUUM FULL. Use heap_register_sync() to initiate such an
+ *      operation; it will cause any subsequent updates to the table to skip
+ *      WAL-logging, if possible, and cause the heap to be synced to disk at
+ *      COMMIT.
+ *
+ *      To make that work, all modifications to heap must use
+ *      HeapNeedsWAL() to check if WAL-logging is needed in this transaction
+ *      for the given block.
+ * *------------------------------------------------------------------------- */#include "postgres.h"
@@ -55,6 +77,7 @@#include "access/xlogutils.h"#include "catalog/catalog.h"#include "catalog/namespace.h"
+#include "catalog/storage.h"#include "miscadmin.h"#include "pgstat.h"#include "storage/bufmgr.h"
@@ -2331,12 +2354,6 @@ FreeBulkInsertState(BulkInsertState bistate) * The new tuple is stamped with current transaction
IDand the specified * command ID. *
 
- * If the HEAP_INSERT_SKIP_WAL option is specified, the new tuple is not
- * logged in WAL, even for a non-temp relation.  Safe usage of this behavior
- * requires that we arrange that all new tuples go into new pages not
- * containing any tuples from other transactions, and that the relation gets
- * fsync'd before commit.  (See also heap_sync() comments)
- * * The HEAP_INSERT_SKIP_FSM option is passed directly to * RelationGetBufferForTuple, which see for more info. *
@@ -2440,7 +2457,7 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,    MarkBufferDirty(buffer);    /*
XLOGstuff */
 
-    if (!(options & HEAP_INSERT_SKIP_WAL) && RelationNeedsWAL(relation))
+    if (BufferNeedsWAL(relation, buffer))    {        xl_heap_insert xlrec;        xl_heap_header xlhdr;
@@ -2639,12 +2656,10 @@ heap_multi_insert(Relation relation, HeapTuple *tuples, int ntuples,    int            ndone;
char       *scratch = NULL;    Page        page;
 
-    bool        needwal;    Size        saveFreeSpace;    bool        need_tuple_data =
RelationIsLogicallyLogged(relation);   bool        need_cids = RelationIsAccessibleInLogicalDecoding(relation);
 
-    needwal = !(options & HEAP_INSERT_SKIP_WAL) && RelationNeedsWAL(relation);    saveFreeSpace =
RelationGetTargetPageFreeSpace(relation,                                                  HEAP_DEFAULT_FILLFACTOR);
 
@@ -2659,7 +2674,7 @@ heap_multi_insert(Relation relation, HeapTuple *tuples, int ntuples,     * palloc() within a
criticalsection is not safe, so we allocate this     * beforehand.     */
 
-    if (needwal)
+    if (RelationNeedsWAL(relation))        scratch = palloc(BLCKSZ);    /*
@@ -2694,6 +2709,7 @@ heap_multi_insert(Relation relation, HeapTuple *tuples, int ntuples,        Buffer
vmbuffer= InvalidBuffer;        bool        all_visible_cleared = false;        int            nthispage;
 
+        bool        needwal;        CHECK_FOR_INTERRUPTS();
@@ -2705,6 +2721,7 @@ heap_multi_insert(Relation relation, HeapTuple *tuples, int ntuples,
            InvalidBuffer, options, bistate,                                           &vmbuffer, NULL);        page =
BufferGetPage(buffer);
+        needwal = BufferNeedsWAL(relation, buffer);        /* NO EREPORT(ERROR) from here till changes are logged */
    START_CRIT_SECTION();
 
@@ -3261,7 +3278,7 @@ l1:     * NB: heap_abort_speculative() uses the same xlog record and replay     * routines.
*/
-    if (RelationNeedsWAL(relation))
+    if (BufferNeedsWAL(relation, buffer))    {        xl_heap_delete xlrec;        XLogRecPtr    recptr;
@@ -4194,7 +4211,8 @@ l2:    MarkBufferDirty(buffer);    /* XLOG stuff */
-    if (RelationNeedsWAL(relation))
+    if (BufferNeedsWAL(relation, buffer) ||
+        BufferNeedsWAL(relation, newbuf))    {        XLogRecPtr    recptr;
@@ -5148,7 +5166,7 @@ failed:     * (Also, in a PITR log-shipping or 2PC environment, we have to have XLOG     *
entriesfor everything anyway.)     */
 
-    if (RelationNeedsWAL(relation))
+    if (BufferNeedsWAL(relation, *buffer))    {        xl_heap_lock xlrec;        XLogRecPtr    recptr;
@@ -5825,7 +5843,7 @@ l4:        MarkBufferDirty(buf);        /* XLOG stuff */
-        if (RelationNeedsWAL(rel))
+        if (BufferNeedsWAL(rel, buf))        {            xl_heap_lock_updated xlrec;            XLogRecPtr
recptr;
@@ -5980,7 +5998,7 @@ heap_finish_speculative(Relation relation, HeapTuple tuple)    htup->t_ctid = tuple->t_self;
/*XLOG stuff */
 
-    if (RelationNeedsWAL(relation))
+    if (BufferNeedsWAL(relation, buffer))    {        xl_heap_confirm xlrec;        XLogRecPtr    recptr;
@@ -6112,7 +6130,7 @@ heap_abort_speculative(Relation relation, HeapTuple tuple)     * The WAL records generated here
matchheap_delete().  The same recovery     * routines are used.     */
 
-    if (RelationNeedsWAL(relation))
+    if (BufferNeedsWAL(relation, buffer))    {        xl_heap_delete xlrec;        XLogRecPtr    recptr;
@@ -6218,7 +6236,7 @@ heap_inplace_update(Relation relation, HeapTuple tuple)    MarkBufferDirty(buffer);    /* XLOG
stuff*/
 
-    if (RelationNeedsWAL(relation))
+    if (BufferNeedsWAL(relation, buffer))    {        xl_heap_inplace xlrec;        XLogRecPtr    recptr;
@@ -7331,7 +7349,7 @@ log_heap_clean(Relation reln, Buffer buffer,    XLogRecPtr    recptr;    /* Caller should not
callme on a non-WAL-logged relation */
 
-    Assert(RelationNeedsWAL(reln));
+    Assert(BufferNeedsWAL(reln, buffer));    xlrec.latestRemovedXid = latestRemovedXid;    xlrec.nredirected =
nredirected;
@@ -7379,7 +7397,7 @@ log_heap_freeze(Relation reln, Buffer buffer, TransactionId cutoff_xid,    XLogRecPtr    recptr;
 /* Caller should not call me on a non-WAL-logged relation */
 
-    Assert(RelationNeedsWAL(reln));
+    Assert(BufferNeedsWAL(reln, buffer));    /* nor when there are no tuples to freeze */    Assert(ntuples > 0);
@@ -7464,7 +7482,7 @@ log_heap_update(Relation reln, Buffer oldbuf,    int            bufflags;    /* Caller should not
callme on a non-WAL-logged relation */
 
-    Assert(RelationNeedsWAL(reln));
+    Assert(BufferNeedsWAL(reln, newbuf) || BufferNeedsWAL(reln, oldbuf));    XLogBeginInsert();
@@ -7567,76 +7585,86 @@ log_heap_update(Relation reln, Buffer oldbuf,    xlrec.new_offnum =
ItemPointerGetOffsetNumber(&newtup->t_self);   xlrec.new_xmax = HeapTupleHeaderGetRawXmax(newtup->t_data);
 
+    XLogRegisterData((char *) &xlrec, SizeOfHeapUpdate);
+    bufflags = REGBUF_STANDARD;    if (init)        bufflags |= REGBUF_WILL_INIT;    if (need_tuple_data)
bufflags|= REGBUF_KEEP_DATA;
 
-    XLogRegisterBuffer(0, newbuf, bufflags);
-    if (oldbuf != newbuf)
-        XLogRegisterBuffer(1, oldbuf, REGBUF_STANDARD);
-
-    XLogRegisterData((char *) &xlrec, SizeOfHeapUpdate);
-    /*     * Prepare WAL data for the new tuple.     */
-    if (prefixlen > 0 || suffixlen > 0)
+    if (BufferNeedsWAL(reln, newbuf))    {
-        if (prefixlen > 0 && suffixlen > 0)
-        {
-            prefix_suffix[0] = prefixlen;
-            prefix_suffix[1] = suffixlen;
-            XLogRegisterBufData(0, (char *) &prefix_suffix, sizeof(uint16) * 2);
-        }
-        else if (prefixlen > 0)
-        {
-            XLogRegisterBufData(0, (char *) &prefixlen, sizeof(uint16));
-        }
-        else
+        XLogRegisterBuffer(0, newbuf, bufflags);
+
+        if ((prefixlen > 0 || suffixlen > 0))        {
-            XLogRegisterBufData(0, (char *) &suffixlen, sizeof(uint16));
+            if (prefixlen > 0 && suffixlen > 0)
+            {
+                prefix_suffix[0] = prefixlen;
+                prefix_suffix[1] = suffixlen;
+                XLogRegisterBufData(0, (char *) &prefix_suffix,
+                                    sizeof(uint16) * 2);
+            }
+            else if (prefixlen > 0)
+            {
+                XLogRegisterBufData(0, (char *) &prefixlen, sizeof(uint16));
+            }
+            else
+            {
+                XLogRegisterBufData(0, (char *) &suffixlen, sizeof(uint16));
+            }        }
-    }
-    xlhdr.t_infomask2 = newtup->t_data->t_infomask2;
-    xlhdr.t_infomask = newtup->t_data->t_infomask;
-    xlhdr.t_hoff = newtup->t_data->t_hoff;
-    Assert(SizeofHeapTupleHeader + prefixlen + suffixlen <= newtup->t_len);
+        xlhdr.t_infomask2 = newtup->t_data->t_infomask2;
+        xlhdr.t_infomask = newtup->t_data->t_infomask;
+        xlhdr.t_hoff = newtup->t_data->t_hoff;
+        Assert(SizeofHeapTupleHeader + prefixlen + suffixlen <= newtup->t_len);
-    /*
-     * PG73FORMAT: write bitmap [+ padding] [+ oid] + data
-     *
-     * The 'data' doesn't include the common prefix or suffix.
-     */
-    XLogRegisterBufData(0, (char *) &xlhdr, SizeOfHeapHeader);
-    if (prefixlen == 0)
-    {
-        XLogRegisterBufData(0,
-                            ((char *) newtup->t_data) + SizeofHeapTupleHeader,
-                          newtup->t_len - SizeofHeapTupleHeader - suffixlen);
-    }
-    else
-    {        /*
-         * Have to write the null bitmap and data after the common prefix as
-         * two separate rdata entries.
+         * PG73FORMAT: write bitmap [+ padding] [+ oid] + data
+         *
+         * The 'data' doesn't include the common prefix or suffix.         */
-        /* bitmap [+ padding] [+ oid] */
-        if (newtup->t_data->t_hoff - SizeofHeapTupleHeader > 0)
+        XLogRegisterBufData(0, (char *) &xlhdr, SizeOfHeapHeader);
+        if (prefixlen == 0)        {            XLogRegisterBufData(0,                           ((char *)
newtup->t_data)+ SizeofHeapTupleHeader,
 
-                             newtup->t_data->t_hoff - SizeofHeapTupleHeader);
+                          newtup->t_len - SizeofHeapTupleHeader - suffixlen);        }
+        else
+        {
+            /*
+             * Have to write the null bitmap and data after the common prefix
+             * as two separate rdata entries.
+             */
+            /* bitmap [+ padding] [+ oid] */
+            if (newtup->t_data->t_hoff - SizeofHeapTupleHeader > 0)
+            {
+                XLogRegisterBufData(0,
+                           ((char *) newtup->t_data) + SizeofHeapTupleHeader,
+                             newtup->t_data->t_hoff - SizeofHeapTupleHeader);
+            }
-        /* data after common prefix */
-        XLogRegisterBufData(0,
+            /* data after common prefix */
+            XLogRegisterBufData(0,              ((char *) newtup->t_data) + newtup->t_data->t_hoff + prefixlen,
    newtup->t_len - newtup->t_data->t_hoff - prefixlen - suffixlen);
 
+        }    }
+    /*
+     * If the old and new tuple are on different pages, also register the old
+     * page, so that a full-page image is created for it if necessary. We
+     * don't need any extra information to replay changes to it.
+     */
+    if (oldbuf != newbuf && BufferNeedsWAL(reln, oldbuf))
+        XLogRegisterBuffer(1, oldbuf, REGBUF_STANDARD);
+    /* We need to log a tuple identity */    if (need_tuple_data && old_key_tuple)    {
@@ -8555,8 +8583,13 @@ heap_xlog_update(XLogReaderState *record, bool hot_update)     */    /* Deal with old tuple
version*/
 
-    oldaction = XLogReadBufferForRedo(record, (oldblk == newblk) ? 0 : 1,
-                                      &obuffer);
+    if (oldblk == newblk)
+        oldaction = XLogReadBufferForRedo(record, 0, &obuffer);
+    else if (XLogRecHasBlockRef(record, 1))
+        oldaction = XLogReadBufferForRedo(record, 1, &obuffer);
+    else
+        oldaction = BLK_DONE;
+    if (oldaction == BLK_NEEDS_REDO)    {        page = BufferGetPage(obuffer);
@@ -8610,6 +8643,8 @@ heap_xlog_update(XLogReaderState *record, bool hot_update)        PageInit(page,
BufferGetPageSize(nbuffer),0);        newaction = BLK_NEEDS_REDO;    }
 
+    else if (!XLogRecHasBlockRef(record, 0))
+        newaction = BLK_DONE;    else        newaction = XLogReadBufferForRedo(record, 0, &nbuffer);
@@ -9046,9 +9081,16 @@ heap2_redo(XLogReaderState *record) *    heap_sync        - sync a heap, for use when no WAL has
beenwritten * * This forces the heap contents (including TOAST heap if any) down to disk.
 
- * If we skipped using WAL, and WAL is otherwise needed, we must force the
- * relation down to disk before it's safe to commit the transaction.  This
- * requires writing out any dirty buffers and then doing a forced fsync.
+ * If we did any changes to the heap bypassing the buffer manager, we must
+ * force the relation down to disk before it's safe to commit the
+ * transaction, because the direct modifications will not be flushed by
+ * the next checkpoint.
+ *
+ * We used to also use this after batch operations like COPY and CLUSTER,
+ * if we skipped using WAL and WAL is otherwise needed, but there were
+ * corner-cases involving other WAL-logged operations to the same
+ * relation, where that was not enough. heap_register_sync() should be
+ * used for that purpose instead. * * Indexes are not touched.  (Currently, index operations associated with * the
commandsthat use this are WAL-logged and so do not need fsync.
 
@@ -9081,3 +9123,33 @@ heap_sync(Relation rel)        heap_close(toastrel, AccessShareLock);    }}
+
+/*
+ *    heap_register_sync    - register a heap to be synced to disk at commit
+ *
+ * This can be used to skip WAL-logging changes on a relation file that has
+ * been created in the same transaction. This makes note of the current size of
+ * the relation, and ensures that when the relation is extended, any changes
+ * to the new blocks in the heap, in the same transaction, will not be
+ * WAL-logged. Instead, the heap contents are flushed to disk at commit,
+ * like heap_sync() does.
+ *
+ * This does the same for the TOAST heap, if any. Indexes are not affected.
+ */
+void
+heap_register_sync(Relation rel)
+{
+    /* non-WAL-logged tables never need fsync */
+    if (!RelationNeedsWAL(rel))
+        return;
+
+    RecordPendingSync(rel);
+    if (OidIsValid(rel->rd_rel->reltoastrelid))
+    {
+        Relation    toastrel;
+
+        toastrel = heap_open(rel->rd_rel->reltoastrelid, AccessShareLock);
+        RecordPendingSync(toastrel);
+        heap_close(toastrel, AccessShareLock);
+    }
+}
diff --git a/src/backend/access/heap/pruneheap.c b/src/backend/access/heap/pruneheap.c
index 6ff9251..27a2447 100644
--- a/src/backend/access/heap/pruneheap.c
+++ b/src/backend/access/heap/pruneheap.c
@@ -20,6 +20,7 @@#include "access/htup_details.h"#include "access/xlog.h"#include "catalog/catalog.h"
+#include "catalog/storage.h"#include "miscadmin.h"#include "pgstat.h"#include "storage/bufmgr.h"
@@ -260,7 +261,7 @@ heap_page_prune(Relation relation, Buffer buffer, TransactionId OldestXmin,        /*         *
Emita WAL HEAP_CLEAN record showing what we did         */
 
-        if (RelationNeedsWAL(relation))
+        if (BufferNeedsWAL(relation, buffer))        {            XLogRecPtr    recptr;
diff --git a/src/backend/access/heap/rewriteheap.c b/src/backend/access/heap/rewriteheap.c
index f9ce986..36ba62a 100644
--- a/src/backend/access/heap/rewriteheap.c
+++ b/src/backend/access/heap/rewriteheap.c
@@ -649,9 +649,7 @@ raw_heap_insert(RewriteState state, HeapTuple tup)    }    else if (HeapTupleHasExternal(tup) ||
tup->t_len> TOAST_TUPLE_THRESHOLD)        heaptup = toast_insert_or_update(state->rs_new_rel, tup, NULL,
 
-                                         HEAP_INSERT_SKIP_FSM |
-                                         (state->rs_use_wal ?
-                                          0 : HEAP_INSERT_SKIP_WAL));
+                                         HEAP_INSERT_SKIP_FSM);    else        heaptup = tup;
diff --git a/src/backend/access/heap/visibilitymap.c b/src/backend/access/heap/visibilitymap.c
index 3ad4a9f..e08623c 100644
--- a/src/backend/access/heap/visibilitymap.c
+++ b/src/backend/access/heap/visibilitymap.c
@@ -88,6 +88,7 @@#include "access/heapam_xlog.h"#include "access/visibilitymap.h"#include "access/xlog.h"
+#include "catalog/storage.h"#include "miscadmin.h"#include "storage/bufmgr.h"#include "storage/lmgr.h"
@@ -307,7 +308,7 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,        map[mapByte] |= (flags
<<mapOffset);        MarkBufferDirty(vmBuf);
 
-        if (RelationNeedsWAL(rel))
+        if (BufferNeedsWAL(rel, heapBuf))        {            if (XLogRecPtrIsInvalid(recptr))            {
diff --git a/src/backend/catalog/storage.c b/src/backend/catalog/storage.c
index 0d8311c..a2f03a7 100644
--- a/src/backend/catalog/storage.c
+++ b/src/backend/catalog/storage.c
@@ -260,31 +260,41 @@ RelationTruncate(Relation rel, BlockNumber nblocks)     */    if (RelationNeedsWAL(rel))    {
-        /*
-         * Make an XLOG entry reporting the file truncation.
-         */
-        XLogRecPtr    lsn;
-        xl_smgr_truncate xlrec;
-
-        xlrec.blkno = nblocks;
-        xlrec.rnode = rel->rd_node;
-        xlrec.flags = SMGR_TRUNCATE_ALL;
-
-        XLogBeginInsert();
-        XLogRegisterData((char *) &xlrec, sizeof(xlrec));
-
-        lsn = XLogInsert(RM_SMGR_ID,
-                         XLOG_SMGR_TRUNCATE | XLR_SPECIAL_REL_UPDATE);
-
-        /*
-         * Flush, because otherwise the truncation of the main relation might
-         * hit the disk before the WAL record, and the truncation of the FSM
-         * or visibility map. If we crashed during that window, we'd be left
-         * with a truncated heap, but the FSM or visibility map would still
-         * contain entries for the non-existent heap pages.
-         */
-        if (fsm || vm)
-            XLogFlush(lsn);
+        if (rel->sync_above == InvalidBlockNumber ||
+            rel->sync_above < nblocks)
+        {
+            /*
+             * Make an XLOG entry reporting the file truncation.
+             */
+            XLogRecPtr        lsn;
+            xl_smgr_truncate xlrec;
+
+            xlrec.blkno = nblocks;
+            xlrec.rnode = rel->rd_node;
+
+            XLogBeginInsert();
+            XLogRegisterData((char *) &xlrec, sizeof(xlrec));
+
+            lsn = XLogInsert(RM_SMGR_ID,
+                             XLOG_SMGR_TRUNCATE | XLR_SPECIAL_REL_UPDATE);
+
+            elog(DEBUG2, "WAL-logged truncation of rel %u/%u/%u to %u blocks",
+                 rel->rd_node.spcNode, rel->rd_node.dbNode, rel->rd_node.relNode,
+                 nblocks);
+
+            /*
+             * Flush, because otherwise the truncation of the main relation
+             * might hit the disk before the WAL record, and the truncation of
+             * the FSM or visibility map. If we crashed during that window,
+             * we'd be left with a truncated heap, but the FSM or visibility
+             * map would still contain entries for the non-existent heap
+             * pages.
+             */
+            if (fsm || vm)
+                XLogFlush(lsn);
+
+            rel->truncated_to = nblocks;
+        }    }    /* Do the real work */
@@ -419,6 +429,59 @@ smgrGetPendingDeletes(bool forCommit, RelFileNode **ptr)    return nrels;}
+void
+RecordPendingSync(Relation rel)
+{
+    Assert(RelationNeedsWAL(rel));
+
+    if (rel->sync_above == InvalidBlockNumber)
+    {
+        elog(DEBUG2, "registering pending sync for rel %u/%u/%u at block %u",
+             rel->rd_node.spcNode, rel->rd_node.dbNode, rel->rd_node.relNode,
+             RelationGetNumberOfBlocks(rel));
+        rel->sync_above = RelationGetNumberOfBlocks(rel);
+    }
+    else
+        elog(DEBUG2, "pending sync for rel %u/%u/%u was already registered at block %u (new %u)",
+             rel->rd_node.spcNode, rel->rd_node.dbNode, rel->rd_node.relNode,
+             rel->sync_above, RelationGetNumberOfBlocks(rel));
+}
+
+bool
+BufferNeedsWAL(Relation rel, Buffer buf)
+{
+    BlockNumber blkno = InvalidBlockNumber;
+
+    if (!RelationNeedsWAL(rel))
+        return false;
+
+    blkno = BufferGetBlockNumber(buf);
+    if (rel->sync_above == InvalidBlockNumber ||
+         rel->sync_above > blkno)
+    {
+        elog(DEBUG2, "not skipping WAL-logging for rel %u/%u/%u block %u, because sync_above is %u",
+             rel->rd_node.spcNode, rel->rd_node.dbNode, rel->rd_node.relNode,
+             blkno, rel->sync_above);
+        return true;
+    }
+
+    /*
+     * We have emitted a truncation record for this block.
+     */
+    if (rel->truncated_to != InvalidBlockNumber &&
+        rel->truncated_to <= blkno)
+    {
+        elog(DEBUG2, "not skipping WAL-logging for rel %u/%u/%u block %u, because it was truncated earlier in the same
xact",
+             rel->rd_node.spcNode, rel->rd_node.dbNode, rel->rd_node.relNode, blkno);
+        return true;
+    }
+
+    elog(DEBUG2, "skipping WAL-logging for rel %u/%u/%u block %u",
+         rel->rd_node.spcNode, rel->rd_node.dbNode, rel->rd_node.relNode, blkno);
+
+    return false;
+}
+/* *    PostPrepare_smgr -- Clean up after a successful PREPARE *
diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c
index f45b330..a0fe63f 100644
--- a/src/backend/commands/copy.c
+++ b/src/backend/commands/copy.c
@@ -2269,8 +2269,7 @@ CopyFrom(CopyState cstate)     *    - data is being written to relfilenode created in this
transaction    * then we can skip writing WAL.  It's safe because if the transaction     * doesn't commit, we'll
discardthe table (or the new relfilenode file).
 
-     * If it does commit, we'll have done the heap_sync at the bottom of this
-     * routine first.
+     * If it does commit, commit will do heap_sync().     *     * As mentioned in comments in utils/rel.h, the
in-same-transactiontest     * is not always set correctly, since in rare cases rd_newRelfilenodeSubid
 
@@ -2302,7 +2301,7 @@ CopyFrom(CopyState cstate)    {        hi_options |= HEAP_INSERT_SKIP_FSM;        if
(!XLogIsNeeded())
-            hi_options |= HEAP_INSERT_SKIP_WAL;
+            heap_register_sync(cstate->rel);    }    /*
@@ -2551,11 +2550,11 @@ CopyFrom(CopyState cstate)    FreeExecutorState(estate);    /*
-     * If we skipped writing WAL, then we need to sync the heap (but not
-     * indexes since those use WAL anyway)
+     * If we skipped writing WAL, then we will sync the heap at the end of
+     * the transaction. (We used to do it here, but it was later found out
+     * that to be safe, we must also avoid WAL-logging any subsequent
+     * actions on the pages we skipped WAL for). Indexes always use WAL.     */
-    if (hi_options & HEAP_INSERT_SKIP_WAL)
-        heap_sync(cstate->rel);    return processed;}
diff --git a/src/backend/commands/createas.c b/src/backend/commands/createas.c
index 5b4f6af..b64d52a 100644
--- a/src/backend/commands/createas.c
+++ b/src/backend/commands/createas.c
@@ -567,8 +567,9 @@ intorel_startup(DestReceiver *self, int operation, TupleDesc typeinfo)     * We can skip
WAL-loggingthe insertions, unless PITR or streaming     * replication is in use. We can skip the FSM in any case.
*/
-    myState->hi_options = HEAP_INSERT_SKIP_FSM |
-        (XLogIsNeeded() ? 0 : HEAP_INSERT_SKIP_WAL);
+    if (!XLogIsNeeded())
+        heap_register_sync(intoRelationDesc);
+    myState->hi_options = HEAP_INSERT_SKIP_FSM;    myState->bistate = GetBulkInsertState();    /* Not using WAL
requiressmgr_targblock be initially invalid */
 
@@ -617,9 +618,7 @@ intorel_shutdown(DestReceiver *self)    FreeBulkInsertState(myState->bistate);
-    /* If we skipped using WAL, must heap_sync before commit */
-    if (myState->hi_options & HEAP_INSERT_SKIP_WAL)
-        heap_sync(myState->rel);
+    /* If we skipped using WAL, we will sync the relation at commit */    /* close rel, but keep lock until commit */
 heap_close(myState->rel, NoLock);
 
diff --git a/src/backend/commands/matview.c b/src/backend/commands/matview.c
index 6cddcbd..dbef95b 100644
--- a/src/backend/commands/matview.c
+++ b/src/backend/commands/matview.c
@@ -456,7 +456,7 @@ transientrel_startup(DestReceiver *self, int operation, TupleDesc typeinfo)     */
myState->hi_options= HEAP_INSERT_SKIP_FSM | HEAP_INSERT_FROZEN;    if (!XLogIsNeeded())
 
-        myState->hi_options |= HEAP_INSERT_SKIP_WAL;
+        heap_register_sync(transientrel);    myState->bistate = GetBulkInsertState();    /* Not using WAL requires
smgr_targblockbe initially invalid */
 
@@ -499,9 +499,7 @@ transientrel_shutdown(DestReceiver *self)    FreeBulkInsertState(myState->bistate);
-    /* If we skipped using WAL, must heap_sync before commit */
-    if (myState->hi_options & HEAP_INSERT_SKIP_WAL)
-        heap_sync(myState->transientrel);
+    /* If we skipped using WAL, we will sync the relation at commit */    /* close transientrel, but keep lock until
commit*/    heap_close(myState->transientrel, NoLock);
 
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index 86e9814..ca892ea 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -3984,8 +3984,9 @@ ATRewriteTable(AlteredTableInfo *tab, Oid OIDNewHeap, LOCKMODE lockmode)        bistate =
GetBulkInsertState();       hi_options = HEAP_INSERT_SKIP_FSM;
 
+        if (!XLogIsNeeded())
-            hi_options |= HEAP_INSERT_SKIP_WAL;
+            heap_register_sync(newrel);    }    else    {
@@ -4236,8 +4237,6 @@ ATRewriteTable(AlteredTableInfo *tab, Oid OIDNewHeap, LOCKMODE lockmode)
FreeBulkInsertState(bistate);       /* If we skipped writing WAL, then we need to sync the heap. */
 
-        if (hi_options & HEAP_INSERT_SKIP_WAL)
-            heap_sync(newrel);        heap_close(newrel, NoLock);    }
diff --git a/src/backend/commands/vacuumlazy.c b/src/backend/commands/vacuumlazy.c
index 231e92d..3662f7b 100644
--- a/src/backend/commands/vacuumlazy.c
+++ b/src/backend/commands/vacuumlazy.c
@@ -879,7 +879,7 @@ lazy_scan_heap(Relation onerel, int options, LVRelStats *vacrelstats,                 * page has
beenpreviously WAL-logged, and if not, do that                 * now.                 */
 
-                if (RelationNeedsWAL(onerel) &&
+                if (BufferNeedsWAL(onerel, buf) &&                    PageGetLSN(page) == InvalidXLogRecPtr)
        log_newpage_buffer(buf, true);
 
@@ -1106,7 +1106,7 @@ lazy_scan_heap(Relation onerel, int options, LVRelStats *vacrelstats,            }            /*
NowWAL-log freezing if necessary */
 
-            if (RelationNeedsWAL(onerel))
+            if (BufferNeedsWAL(onerel, buf))            {                XLogRecPtr    recptr;
@@ -1462,7 +1462,7 @@ lazy_vacuum_page(Relation onerel, BlockNumber blkno, Buffer buffer,    MarkBufferDirty(buffer);
/* XLOG stuff */
 
-    if (RelationNeedsWAL(onerel))
+    if (BufferNeedsWAL(onerel, buffer))    {        XLogRecPtr    recptr;
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 76ade37..d128e63 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -451,6 +451,7 @@ static BufferDesc *BufferAlloc(SMgrRelation smgr,            BufferAccessStrategy strategy,
  bool *foundPtr);static void FlushBuffer(BufferDesc *buf, SMgrRelation reln);
 
+static void FlushRelationBuffers_common(SMgrRelation smgr, bool islocal);static void AtProcExit_Buffers(int code,
Datumarg);static void CheckForBufferLeaks(void);static int    rnode_comparator(const void *p1, const void *p2);
 
@@ -3130,20 +3131,41 @@ PrintPinnedBufs(void)voidFlushRelationBuffers(Relation rel){
-    int            i;
-    BufferDesc *bufHdr;
-    /* Open rel at the smgr level if not already done */    RelationOpenSmgr(rel);
-    if (RelationUsesLocalBuffers(rel))
+    FlushRelationBuffers_common(rel->rd_smgr, RelationUsesLocalBuffers(rel));
+}
+
+/*
+ * Like FlushRelationBuffers(), but the relation is specified by a
+ * RelFileNode
+ */
+void
+FlushRelationBuffersWithoutRelCache(RelFileNode rnode, bool islocal)
+{
+    FlushRelationBuffers_common(smgropen(rnode, InvalidBackendId), islocal);
+}
+
+/*
+ * Code shared between functions FlushRelationBuffers() and
+ * FlushRelationBuffersWithoutRelCache().
+ */
+static void
+FlushRelationBuffers_common(SMgrRelation smgr, bool islocal)
+{
+    RelFileNode rnode = smgr->smgr_rnode.node;
+    int            i;
+    BufferDesc *bufHdr;
+
+    if (islocal)    {        for (i = 0; i < NLocBuffer; i++)        {            uint32        buf_state;
bufHdr= GetLocalBufferDescriptor(i);
 
-            if (RelFileNodeEquals(bufHdr->tag.rnode, rel->rd_node) &&
+            if (RelFileNodeEquals(bufHdr->tag.rnode, rnode) &&                ((buf_state =
pg_atomic_read_u32(&bufHdr->state))&                 (BM_VALID | BM_DIRTY)) == (BM_VALID | BM_DIRTY))            {
 
@@ -3160,7 +3182,7 @@ FlushRelationBuffers(Relation rel)                PageSetChecksumInplace(localpage,
bufHdr->tag.blockNum);
-                smgrwrite(rel->rd_smgr,
+                smgrwrite(smgr,                          bufHdr->tag.forkNum,
bufHdr->tag.blockNum,                         localpage,
 
@@ -3190,18 +3212,18 @@ FlushRelationBuffers(Relation rel)         * As in DropRelFileNodeBuffers, an unlocked precheck
shouldbe safe         * and saves some cycles.         */
 
-        if (!RelFileNodeEquals(bufHdr->tag.rnode, rel->rd_node))
+        if (!RelFileNodeEquals(bufHdr->tag.rnode, rnode))            continue;        ReservePrivateRefCountEntry();
    buf_state = LockBufHdr(bufHdr);
 
-        if (RelFileNodeEquals(bufHdr->tag.rnode, rel->rd_node) &&
+        if (RelFileNodeEquals(bufHdr->tag.rnode, rnode) &&            (buf_state & (BM_VALID | BM_DIRTY)) == (BM_VALID
|BM_DIRTY))        {            PinBuffer_Locked(bufHdr);
LWLockAcquire(BufferDescriptorGetContentLock(bufHdr),LW_SHARED);
 
-            FlushBuffer(bufHdr, rel->rd_smgr);
+            FlushBuffer(bufHdr, smgr);            LWLockRelease(BufferDescriptorGetContentLock(bufHdr));
UnpinBuffer(bufHdr,true);        }
 
diff --git a/src/backend/utils/cache/relcache.c b/src/backend/utils/cache/relcache.c
index 8d2ad01..31ae0f1 100644
--- a/src/backend/utils/cache/relcache.c
+++ b/src/backend/utils/cache/relcache.c
@@ -66,6 +66,7 @@#include "optimizer/var.h"#include "rewrite/rewriteDefine.h"#include "rewrite/rowsecurity.h"
+#include "storage/bufmgr.h"#include "storage/lmgr.h"#include "storage/smgr.h"#include "utils/array.h"
@@ -407,6 +408,9 @@ AllocateRelationDesc(Form_pg_class relp)    /* which we mark as a reference-counted tupdesc */
relation->rd_att->tdrefcount= 1;
 
+    relation->sync_above = InvalidBlockNumber;
+    relation->truncated_to = InvalidBlockNumber;
+    MemoryContextSwitchTo(oldcxt);    return relation;
@@ -1731,6 +1735,9 @@ formrdesc(const char *relationName, Oid relationReltype,        relation->rd_rel->relhasindex =
true;   }
 
+    relation->sync_above = InvalidBlockNumber;
+    relation->truncated_to = InvalidBlockNumber;
+    /*     * add new reldesc to relcache     */
@@ -2055,6 +2062,22 @@ RelationDestroyRelation(Relation relation, bool remember_tupdesc)    pfree(relation);}
+static void
+RelationDoPendingFlush(Relation relation)
+{
+    if (relation->sync_above != InvalidBlockNumber)
+    {
+        FlushRelationBuffersWithoutRelCache(relation->rd_node, false);
+        smgrimmedsync(smgropen(relation->rd_node, InvalidBackendId),
+                      MAIN_FORKNUM);
+
+        elog(DEBUG2, "syncing rel %u/%u/%u",
+             relation->rd_node.spcNode,
+             relation->rd_node.dbNode, relation->rd_node.relNode);
+        
+    }
+}
+/* * RelationClearRelation *
@@ -2686,7 +2709,10 @@ AtEOXact_cleanup(Relation relation, bool isCommit)    if (relation->rd_createSubid !=
InvalidSubTransactionId)   {        if (isCommit)
 
+        {
+            RelationDoPendingFlush(relation);            relation->rd_createSubid = InvalidSubTransactionId;
+        }        else if (RelationHasReferenceCountZero(relation))        {            RelationClearRelation(relation,
false);
@@ -3019,6 +3045,9 @@ RelationBuildLocalRelation(const char *relname,    else        rel->rd_rel->relfilenode =
relfilenode;
+    rel->sync_above = InvalidBlockNumber;
+    rel->truncated_to = InvalidBlockNumber;
+    RelationInitLockInfo(rel);    /* see lmgr.c */    RelationInitPhysicalAddr(rel);
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index b3a595c..1c169ef 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -25,10 +25,9 @@/* "options" flag bits for heap_insert */
-#define HEAP_INSERT_SKIP_WAL    0x0001
-#define HEAP_INSERT_SKIP_FSM    0x0002
-#define HEAP_INSERT_FROZEN        0x0004
-#define HEAP_INSERT_SPECULATIVE 0x0008
+#define HEAP_INSERT_SKIP_FSM    0x0001
+#define HEAP_INSERT_FROZEN        0x0002
+#define HEAP_INSERT_SPECULATIVE 0x0004typedef struct BulkInsertStateData *BulkInsertState;
@@ -177,6 +176,7 @@ extern void simple_heap_delete(Relation relation, ItemPointer tid);extern void
simple_heap_update(Relationrelation, ItemPointer otid,                   HeapTuple tup);
 
+extern void heap_register_sync(Relation relation);extern void heap_sync(Relation relation);/* in heap/pruneheap.c */
diff --git a/src/include/catalog/storage.h b/src/include/catalog/storage.h
index ef960da..235c2b4 100644
--- a/src/include/catalog/storage.h
+++ b/src/include/catalog/storage.h
@@ -29,6 +29,8 @@ extern void RelationTruncate(Relation rel, BlockNumber nblocks); */extern void
smgrDoPendingDeletes(boolisCommit);extern int    smgrGetPendingDeletes(bool forCommit, RelFileNode **ptr);
 
+extern void RecordPendingSync(Relation rel);
+bool BufferNeedsWAL(Relation rel, Buffer buf);extern void AtSubCommit_smgr(void);extern void
AtSubAbort_smgr(void);externvoid PostPrepare_smgr(void);
 
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index 3d5dea7..f02ea93 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -202,6 +202,8 @@ extern BlockNumber RelationGetNumberOfBlocksInFork(Relation relation,
ForkNumber forkNum);extern void FlushOneBuffer(Buffer buffer);extern void FlushRelationBuffers(Relation rel);
 
+extern void FlushRelationBuffersWithoutRelCache(RelFileNode rnode,
+                                    bool islocal);extern void FlushDatabaseBuffers(Oid dbid);extern void
DropRelFileNodeBuffers(RelFileNodeBackendrnode,                       ForkNumber forkNum, BlockNumber firstDelBlock);
 
diff --git a/src/include/utils/rel.h b/src/include/utils/rel.h
index ed14442..a8a2b23 100644
--- a/src/include/utils/rel.h
+++ b/src/include/utils/rel.h
@@ -172,6 +172,9 @@ typedef struct RelationData    /* use "struct" here to avoid needing to include pgstat.h: */
structPgStat_TableStatus *pgstat_info;        /* statistics collection area */
 
+
+    BlockNumber sync_above;
+    BlockNumber truncated_to;} RelationData;

В списке pgsql-hackers по дате отправления:

Предыдущее
От: Etsuro Fujita
Дата:
Сообщение: Re: postgres_fdw : altering foreign table not invalidating prepare statement execution plan.
Следующее
От: 郭 勇
Дата:
Сообщение: postgresql infinite loop