Re: [HACKERS] WAL logging problem in 9.4.3?

Поиск
Список
Период
Сортировка
От Noah Misch
Тема Re: [HACKERS] WAL logging problem in 9.4.3?
Дата
Msg-id 20190820060314.GA3086296@rfd.leadboat.com
обсуждение исходный текст
Ответ на Re: [HACKERS] WAL logging problem in 9.4.3?  (Kyotaro Horiguchi <horikyota.ntt@gmail.com>)
Ответы Re: [HACKERS] WAL logging problem in 9.4.3?  (Kyotaro Horiguchi <horikyota.ntt@gmail.com>)
Список pgsql-hackers
On Mon, Aug 19, 2019 at 06:59:59PM +0900, Kyotaro Horiguchi wrote:
> At Sat, 17 Aug 2019 20:52:30 -0700, Noah Misch <noah@leadboat.com> wrote in
<20190818035230.GB3021338@rfd.leadboat.com>
> > For two-phase commit, PrepareTransaction() needs to execute pending syncs.
> 
> Now TwoPhaseFileHeader has two new members for (commit-time)
> pending syncs. Pending-syncs are useless on wal-replay, but that
> is needed for commit-prepared.

There's no need to modify TwoPhaseFileHeader or the COMMIT PREPARED sql
command, which is far too late to be syncing new relation files.  (A crash may
have already destroyed their data.)  PrepareTransaction(), which implements
the PREPARE TRANSACTION command, is the right place for these syncs.

A failure in these new syncs needs to prevent the transaction from being
marked committed.  Hence, in CommitTransaction(), these new syncs need to
happen after the last step that could create assign a new relfilenode and
before RecordTransactionCommit().  I suspect it's best to do it after
PreCommit_on_commit_actions() and before AtEOXact_LargeObject().

> > On Thu, Jul 25, 2019 at 10:39:36AM +0900, Kyotaro Horiguchi wrote:
> > > --- a/src/backend/catalog/storage.c
> > > +++ b/src/backend/catalog/storage.c
> > > @@ -428,21 +450,34 @@ smgrDoPendingDeletes(bool isCommit)
> ...
> > > +                    smgrimmedsync(srel, MAIN_FORKNUM);
> > 
> > This should sync all forks, not just MAIN_FORKNUM.  Code that writes WAL for
> > FSM_FORKNUM and VISIBILITYMAP_FORKNUM checks RelationNeedsWAL().  There may be
> > no bug today, but it's conceptually wrong to make RelationNeedsWAL() return
> > false due to this code, use RelationNeedsWAL() for multiple forks, and then
> > not actually sync all forks.
> 
> I agree that all forks needs syncing, but FSM and VM are checking
> RelationNeedsWAL(modified). To make sure, are you suggesting to
> sync all forks instead of emitting WAL for them, or suggesting
> that VM and FSM to emit WALs even when the modified
> RelationNeedsWAL returns false (+ sync all forks)?

I hadn't thought that far.  What do you think is best?

> > The https://postgr.es/m/559FA0BA.3080808@iki.fi design had another component
> > not appearing here.  It said, "Instead, at COMMIT, we'd fsync() the relation,
> > or if it's smaller than some threshold, WAL-log the contents of the whole file
> > at that point."  Please write the part to WAL-log the contents of small files
> > instead of syncing them.
> 
> I'm not sure the point of the behavior. I suppose that the "log"
> is a sequence of new_page records. It also needs to be synced and
> it is always larger than the file to be synced. I can't think of
> an appropriate threshold without the point.

Yes, it would be a sequence of new-page records.  FlushRelationBuffers() locks
every buffer header containing a buffer of the current database.  The belief
has been that writing one page to xlog is cheaper than FlushRelationBuffers()
in a busy system with large shared_buffers.

> > > --- a/src/backend/commands/copy.c
> > > +++ b/src/backend/commands/copy.c
> > > @@ -2725,28 +2722,9 @@ CopyFrom(CopyState cstate)
> > >       * If it does commit, we'll have done the table_finish_bulk_insert() at
> > >       * the bottom of this routine first.
> > >       *
> > > -     * As mentioned in comments in utils/rel.h, the in-same-transaction test
> > > -     * is not always set correctly, since in rare cases rd_newRelfilenodeSubid
> > > -     * can be cleared before the end of the transaction. The exact case is
> > > -     * when a relation sets a new relfilenode twice in same transaction, yet
> > > -     * the second one fails in an aborted subtransaction, e.g.
> > > -     *
> > > -     * BEGIN;
> > > -     * TRUNCATE t;
> > > -     * SAVEPOINT save;
> > > -     * TRUNCATE t;
> > > -     * ROLLBACK TO save;
> > > -     * COPY ...
> > 
> > The comment material being deleted is still correct, so don't delete it.
> > Moreover, the code managing rd_firstRelfilenodeSubid has a similar bug.  The
> > attached patch adds an assertion that RelationNeedsWAL() and the
> > pendingDeletes array have the same opinion about the relfilenode, and it
> > expands a test case to fail that assertion.
> 
> (Un?)Fortunately, that doesn't fail.. (with rebased version on
> the recent master) I'll recheck that tomorrow.

Did you build with --enable-cassert?

> > > --- a/src/include/utils/rel.h
> > > +++ b/src/include/utils/rel.h
> > > @@ -74,11 +74,13 @@ typedef struct RelationData
> > >      SubTransactionId rd_createSubid;    /* rel was created in current xact */
> > >      SubTransactionId rd_newRelfilenodeSubid;    /* new relfilenode assigned in
> > >                                                   * current xact */
> > > +    SubTransactionId rd_firstRelfilenodeSubid;    /* new relfilenode assigned
> > > +                                                 * first in current xact */

> >  This field needs to be 100% reliable.  In other words,
> > it must equal InvalidSubTransactionId if and only if the relfilenode matches
> > the relfilenode that would be in place if the top transaction rolled back.
> 
> I don't get this. I think the variable moves as you suggested. It
> is handled same way with fd_new* in AtEOSubXact_cleanup but the
> difference is in assignment but rollback. rd_fist* won't change
> after the first assignment so rollback of the subid means
> relfilenode is also rolled back to the initial value at the
> beginning of the top transaction.

$ git grep -n 'rd_firstRelfilenodeSubid = '
src/backend/commands/cluster.c:1061:            rel1->rd_firstRelfilenodeSubid = GetCurrentSubTransactionId();
src/backend/utils/cache/relcache.c:3067:    relation->rd_firstRelfilenodeSubid = InvalidSubTransactionId;
src/backend/utils/cache/relcache.c:3173:            relation->rd_firstRelfilenodeSubid = parentSubid;
src/backend/utils/cache/relcache.c:3175:            relation->rd_firstRelfilenodeSubid = InvalidSubTransactionId;

swap_relation_files() is the only place initializing this field.  Many paths
that assign a new relfilenode will never call swap_relation_files().



В списке pgsql-hackers по дате отправления:

Предыдущее
От: Thomas Munro
Дата:
Сообщение: Re: Fixing typos and inconsistencies
Следующее
От: Kyotaro Horiguchi
Дата:
Сообщение: Re: FETCH FIRST clause PERCENT option