Re: [HACKERS] WAL logging problem in 9.4.3?

Поиск
Список
Период
Сортировка
От Kyotaro HORIGUCHI
Тема Re: [HACKERS] WAL logging problem in 9.4.3?
Дата
Msg-id 20190320.171754.171896368.horiguchi.kyotaro@lab.ntt.co.jp
обсуждение исходный текст
Ответ на Re: [HACKERS] WAL logging problem in 9.4.3?  (Noah Misch <noah@leadboat.com>)
Ответы Re: [HACKERS] WAL logging problem in 9.4.3?  (Noah Misch <noah@leadboat.com>)
Список pgsql-hackers
Thank you for reviewing!

At Sun, 10 Mar 2019 19:27:08 -0700, Noah Misch <noah@leadboat.com> wrote in
<20190311022708.GA2189728@rfd.leadboat.com>
> This has been waiting for a review since October, so I reviewed it.  The code
> comment at PendingRelSync summarizes the design well, and I like that design.

It is Michael's work.

> I also liked the design in the https://postgr.es/m/559FA0BA.3080808@iki.fi
> last paragraph, and I suspect it would have been no harder to back-patch.  I
> wonder if it would have been simpler and better, but I'm not asking anyone to
> investigate that.  Let's keep pursuing your current design.

I must admit that this is complex..

> This moves a shared_buffers scan and smgrimmedsync() from commands like COPY
> to COMMIT.  Users setting a timeout on COMMIT may need to adjust, and
> log_min_duration_statement analysis will reflect the change.  I feel that's
> fine.  (There already exist ways for COMMIT to be slow.)
> 
> On Mon, Mar 04, 2019 at 12:24:48PM +0900, Kyotaro HORIGUCHI wrote:
> > --- a/src/backend/access/nbtree/nbtsort.c
> > +++ b/src/backend/access/nbtree/nbtsort.c
> > @@ -611,8 +611,14 @@ _bt_blwritepage(BTWriteState *wstate, Page page, BlockNumber blkno)
> >      /* Ensure rd_smgr is open (could have been closed by relcache flush!) */
> >      RelationOpenSmgr(wstate->index);
> >  
> > -    /* XLOG stuff */
> > -    if (wstate->btws_use_wal)
> > +    /* XLOG stuff
> > +     *
> > +     * Even if minimal mode, WAL is required here if truncation happened after
> > +     * being created in the same transaction. It is not needed otherwise but
> > +     * we don't bother identifying the case precisely.
> > +     */
> > +    if (wstate->btws_use_wal ||
> > +        (blkno == BTREE_METAPAGE && BTPageGetMeta(page)->btm_root == 0))
> 
> We initialized "btws_use_wal" like this:
> 
>     #define XLogIsNeeded() (wal_level >= WAL_LEVEL_REPLICA)
>     #define RelationNeedsWAL(relation) \
>         ((relation)->rd_rel->relpersistence == RELPERSISTENCE_PERMANENT)
>     wstate.btws_use_wal = XLogIsNeeded() && RelationNeedsWAL(wstate.index);
> 
> Hence, this change causes us to emit WAL for the metapage of a
> RELPERSISTENCE_UNLOGGED or RELPERSISTENCE_TEMP relation.  We should never do
> that.  If we do that for RELPERSISTENCE_TEMP, redo will write to a permanent
> relfilenode.  I've attached a test case for this; it is a patch that applies
> on top of your v7 patches.  The test checks for orphaned files after redo.

Oops!  Added RelationNeedsWAL(index) there. (Attched 1st patch on
top of this patchset)

> > +     * If no tuple was inserted, it's possible that we are truncating a
> > +     * relation. We need to emit WAL for the metapage in the case. However it
> > +     * is not required elsewise,
> 
> Did you mean to write more words after that comma?

Sorry, it is just a garbage. Required work is done in
_bt_blwritepage.

> > --- a/src/backend/catalog/storage.c
> > +++ b/src/backend/catalog/storage.c
> 
> > + * NB: after WAL-logging has been skipped for a block, we must not WAL-log
> > + * any subsequent actions on the same block either. Replaying the WAL record
> > + * of the subsequent action might fail otherwise, as the "before" state of
> > + * the block might not match, as the earlier actions were not WAL-logged.
> 
> Good point.  To participate in WAL redo properly, each "before" state must
> have a distinct pd_lsn.  In CREATE INDEX USING btree, the initial index build
> skips WAL, but an INSERT later in the same transaction writes WAL.  There,
> however, each "before" state does have a distinct pd_lsn; the initial build
> has pd_lsn==0, and each subsequent state has a pd_lsn driven by WAL position.
> Hence, I think the CREATE INDEX USING btree behavior is fine, even though it
> doesn't conform to this code comment.

(The NB is Michael's work.)
Yes. Btree works differently from heap. Thak you for confirmation.

> I think this restriction applies only to full_page_writes=off.  Otherwise, the
> first WAL-logged change will find pd_lsn==0 and emit a full-page image.  With
> a full-page image in the record, the block's "before" state doesn't matter.
> Also, one could make it safe to write WAL for a particular block by issuing
> heap_sync() for the block's relation.

Umm.. Once truncate happens, WAL is emitted for all pages. If we
decide to skip WALs on copy or similar bulk operations, WALs are
not emitted at all, including XLOG_HEAP_INIT_PAGE. So that
doesn't happen. The unlogged data is synced at commit time.

> > +/*
> > + * RelationRemovePendingSync() -- remove pendingSync entry for a relation
> > + */
> > +void
> > +RelationRemovePendingSync(Relation rel)
> 
> What is the coding rule for deciding when to call this?  Currently, only
> ATExecSetTableSpace() calls this.  CLUSTER doesn't call it, despite behaving
> much like ALTER TABLE SET TABLESPACE behaves.
> > +{
> > +    bool found;
> > +
> > +    rel->pending_sync = NULL;
> > +    rel->no_pending_sync = true;
> > +    if (pendingSyncs)
> > +    {
> > +        elog(DEBUG2, "RelationRemovePendingSync: accessing hash");
> > +        hash_search(pendingSyncs, (void *) &rel->rd_node, HASH_REMOVE, &found);
> > +    }
> > +}
> 
> We'd need a mechanism to un-remove the sync at subtransaction abort.  My
> attachment includes a test case demonstrating the consequences of that defect.
> Please look for other areas that need to know about subtransactions; patch v7
> had no code pertaining to subtransactions.

Agreed It forgets about subtransaction rollbacks. I'll make
RelationRemovePendingSync just mark as "removed" and make
ROLLBACK TO and RELEASE process the flag make it work. (Attached
2nd patch on top of thie patchset)

> 
> > +        elog(DEBUG2, "not skipping WAL-logging for rel %u/%u/%u block %u, because sync_above is %u",
> 
> As you mention upthread, you have many debugging elog()s.  These are too
> detailed to include in every binary, but I do want them in the code.  See
> CACHE_elog() for a good example of achieving that.

Agreed will do. They were need to check the behavior precisely
but usually not needed.

> > +/*
> > + * Sync to disk any relations that we skipped WAL-logging for earlier.
> > + */
> > +void
> > +smgrDoPendingSyncs(bool isCommit)
> > +{
> > +    if (!pendingSyncs)
> > +        return;
> > +
> > +    if (isCommit)
> > +    {
> > +        HASH_SEQ_STATUS status;
> > +        PendingRelSync *pending;
> > +
> > +        hash_seq_init(&status, pendingSyncs);
> > +
> > +        while ((pending = hash_seq_search(&status)) != NULL)
> > +        {
> > +            if (pending->sync_above != InvalidBlockNumber)
> 
> I'm mildly unhappy that pendingSyncs entries with "pending->sync_above ==
> InvalidBlockNumber" are not sync requests at all.  Those just record the fact
> of a RelationTruncate() happening.  If you can think of a way to improve that,
> please do so.  If not, it's okay.

After a truncation, required WAL records are emitted for the
truncated pages, so no need to sync. Does this make sense for
you? (Maybe commit is needed there)

> > --- a/src/backend/utils/cache/relcache.c
> > +++ b/src/backend/utils/cache/relcache.c
> 
> > @@ -412,6 +413,10 @@ AllocateRelationDesc(Form_pg_class relp)
> >      /* which we mark as a reference-counted tupdesc */
> >      relation->rd_att->tdrefcount = 1;
> >  
> > +    /* We don't know if pending sync for this relation exists so far */
> > +    relation->pending_sync = NULL;
> > +    relation->no_pending_sync = false;
> 
> RelationData fields other than "pgstat_info" have "rd_" prefixes; add that
> prefix to these fields.
> This is a nonstandard place to clear fields.  Clear them in
> load_relcache_init_file() only, like we do for rd_statvalid.  (Other paths
> will then rely on palloc0() for implicit initialization.)

Agreed, will do in the next version.

> > --- a/src/backend/access/heap/heapam.c
> > +++ b/src/backend/access/heap/heapam.c
> 
> > @@ -3991,7 +4007,8 @@ heap_update(Relation relation, ItemPointer otid, HeapTuple newtup,
> >      MarkBufferDirty(buffer);
> >  
> >      /* XLOG stuff */
> > -    if (RelationNeedsWAL(relation))
> > +    if (BufferNeedsWAL(relation, buffer) ||
> > +        BufferNeedsWAL(relation, newbuf))
> 
> This is fine if both buffers need WAL or neither buffer needs WAL.  It is not
> fine when one buffer needs WAL and the other buffer does not.  My attachment
> includes a test case.  Of the bugs I'm reporting, this one seems most
> difficult to solve well.

Yeah, it is right (and it's rather silly). Thank you for
pointing out. Will fix.

> > @@ -8961,9 +8978,16 @@ heap2_redo(XLogReaderState *record)
> >   *    heap_sync        - sync a heap, for use when no WAL has been written
> >   *
> >   * This forces the heap contents (including TOAST heap if any) down to disk.
> > - * If we skipped using WAL, and WAL is otherwise needed, we must force the
> > - * relation down to disk before it's safe to commit the transaction.  This
> > - * requires writing out any dirty buffers and then doing a forced fsync.
> > + * If we did any changes to the heap bypassing the buffer manager, we must
> > + * force the relation down to disk before it's safe to commit the
> > + * transaction, because the direct modifications will not be flushed by
> > + * the next checkpoint.
> > + *
> > + * We used to also use this after batch operations like COPY and CLUSTER,
> > + * if we skipped using WAL and WAL is otherwise needed, but there were
> > + * corner-cases involving other WAL-logged operations to the same
> > + * relation, where that was not enough. heap_register_sync() should be
> > + * used for that purpose instead.
> 
> We still use heap_sync() in CLUSTER.  Can we migrate CLUSTER to the newer
> heap_register_sync()?  Patch v7 makes some commands use the new way (COPY,
> CREATE TABLE AS, REFRESH MATERIALIZED VIEW, ALTER TABLE) and leaves other
> commands using the old way (CREATE INDEX USING btree, ALTER TABLE SET
> TABLESPACE, CLUSTER).  It would make the system simpler to understand if we
> eliminated the old way.  If that creates more problems than it solves, please
> at least write down a coding rule to explain why certain commands shouldn't
> use the old way.

Perhaps doable for TABLESPACE and CLUSTER. I'm not sure about
CREATE INDEX. I'll consider them.

I don't have enough time for now so the new version will be
posted early next week.

Thanks you for the review!

regards.

diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index fb4a80bf1d..060e0171a5 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -627,7 +627,8 @@ _bt_blwritepage(BTWriteState *wstate, Page page, BlockNumber blkno)
      * we don't bother identifying the case precisely.
      */
     if (wstate->btws_use_wal ||
-        (blkno == BTREE_METAPAGE && BTPageGetMeta(page)->btm_root == 0))
+        (RelationNeedsWAL(wstate->index) &&
+         (blkno == BTREE_METAPAGE && BTPageGetMeta(page)->btm_root == 0)))
     {
         /* We use the heap NEWPAGE record type for this */
         log_newpage(&wstate->index->rd_node, MAIN_FORKNUM, blkno, page, true);
@@ -1071,11 +1072,6 @@ _bt_uppershutdown(BTWriteState *wstate, BTPageState *state)
      * set to point to "P_NONE").  This changes the index to the "valid" state
      * by filling in a valid magic number in the metapage.
      */
-    /*
-     * If no tuple was inserted, it's possible that we are truncating a
-     * relation. We need to emit WAL for the metapage in the case. However it
-     * is not required elsewise,
-     */
     metapage = (Page) palloc(BLCKSZ);
     _bt_initmetapage(metapage, rootblkno, rootlevel);
     _bt_blwritepage(wstate, metapage, BTREE_METAPAGE);
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index d1210de8f4..3ce69b7a40 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -4037,6 +4037,8 @@ ReleaseSavepoint(const char *name)
                 (errcode(ERRCODE_S_E_INVALID_SPECIFICATION),
                  errmsg("savepoint \"%s\" does not exist within current savepoint level", name)));
 
+    smgrProcessPendingSyncRemoval(s->subTransactionId, true);
+
     /*
      * Mark "commit pending" all subtransactions up to the target
      * subtransaction.  The actual commits will happen when control gets to
@@ -4146,6 +4148,8 @@ RollbackToSavepoint(const char *name)
                 (errcode(ERRCODE_S_E_INVALID_SPECIFICATION),
                  errmsg("savepoint \"%s\" does not exist within current savepoint level", name)));
 
+    smgrProcessPendingSyncRemoval(s->subTransactionId, false);
+
     /*
      * Mark "abort pending" all subtransactions up to the target
      * subtransaction.  The actual aborts will happen when control gets to
diff --git a/src/backend/catalog/storage.c b/src/backend/catalog/storage.c
index 26dc3ddb1b..ad4a1e5127 100644
--- a/src/backend/catalog/storage.c
+++ b/src/backend/catalog/storage.c
@@ -99,6 +99,7 @@ typedef struct PendingRelSync
     BlockNumber sync_above;        /* WAL-logging skipped for blocks >=
                                  * sync_above */
     BlockNumber truncated_to;    /* truncation WAL record was written */
+    SubTransactionId removed_xid; /* subxid where this is removed */
 }    PendingRelSync;
 
 /* Relations that need to be fsync'd at commit */
@@ -405,6 +406,7 @@ getPendingSyncEntry(Relation rel, bool create)
     {
         pendsync_entry->truncated_to = InvalidBlockNumber;
         pendsync_entry->sync_above = InvalidBlockNumber;
+        pendsync_entry->removed_xid = InvalidSubTransactionId;
     }
 
     /* hold shortcut in Relation */
@@ -498,14 +500,17 @@ void
 RelationRemovePendingSync(Relation rel)
 {
     bool found;
+    PendingRelSync *pending_sync;
 
-    rel->pending_sync = NULL;
-    rel->no_pending_sync = true;
-    if (pendingSyncs)
-    {
-        elog(DEBUG2, "RelationRemovePendingSync: accessing hash");
-        hash_search(pendingSyncs, (void *) &rel->rd_node, HASH_REMOVE, &found);
-    }
+    if (rel->no_pending_sync)
+        return;
+
+    pending_sync = getPendingSyncEntry(rel, false);
+    
+    if (pending_sync)
+        return;
+
+    rel->pending_sync->removed_xid = GetCurrentSubTransactionId();
 }
 
 
@@ -693,6 +698,31 @@ smgrDoPendingSyncs(bool isCommit)
     pendingSyncs = NULL;
 }
 
+void
+smgrProcessPendingSyncRemoval(SubTransactionId sxid, bool isCommit)
+{
+    HASH_SEQ_STATUS status;
+    PendingRelSync *pending;
+
+    if (!pendingSyncs)
+        return;
+
+    hash_seq_init(&status, pendingSyncs);
+
+    while ((pending = hash_seq_search(&status)) != NULL)
+    {
+        if (pending->removed_xid == sxid)
+        {
+            pending->removed_xid = InvalidSubTransactionId;
+            if (isCommit)
+            {
+                pending->sync_above = InvalidBlockNumber;
+                pending->truncated_to = InvalidBlockNumber;
+            }
+        }        
+    }    
+}
+
 /*
  *    PostPrepare_smgr -- Clean up after a successful PREPARE
  *

В списке pgsql-hackers по дате отправления:

Предыдущее
От: Fabien COELHO
Дата:
Сообщение: Re: Offline enabling/disabling of data checksums
Следующее
От: Amit Langote
Дата:
Сообщение: Re: speeding up planning with partitions