Re: [HACKERS] WAL logging problem in 9.4.3?

Поиск
Список
Период
Сортировка
От Kyotaro Horiguchi
Тема Re: [HACKERS] WAL logging problem in 9.4.3?
Дата
Msg-id 20190822.210606.07927021.horikyota.ntt@gmail.com
обсуждение исходный текст
Ответ на Re: [HACKERS] WAL logging problem in 9.4.3?  (Noah Misch <noah@leadboat.com>)
Ответы Re: [HACKERS] WAL logging problem in 9.4.3?  (Noah Misch <noah@leadboat.com>)
Список pgsql-hackers
Hello.

At Mon, 19 Aug 2019 23:03:14 -0700, Noah Misch <noah@leadboat.com> wrote in
<20190820060314.GA3086296@rfd.leadboat.com>
> On Mon, Aug 19, 2019 at 06:59:59PM +0900, Kyotaro Horiguchi wrote:
> > At Sat, 17 Aug 2019 20:52:30 -0700, Noah Misch <noah@leadboat.com> wrote in
<20190818035230.GB3021338@rfd.leadboat.com>
> > > For two-phase commit, PrepareTransaction() needs to execute pending syncs.
> > 
> > Now TwoPhaseFileHeader has two new members for (commit-time)
> > pending syncs. Pending-syncs are useless on wal-replay, but that
> > is needed for commit-prepared.
> 
> There's no need to modify TwoPhaseFileHeader or the COMMIT PREPARED sql
> command, which is far too late to be syncing new relation files.  (A crash may
> have already destroyed their data.)  PrepareTransaction(), which implements
> the PREPARE TRANSACTION command, is the right place for these syncs.
> 
> A failure in these new syncs needs to prevent the transaction from being
> marked committed.  Hence, in CommitTransaction(), these new syncs need to

Agreed.

> happen after the last step that could create assign a new relfilenode and
> before RecordTransactionCommit().  I suspect it's best to do it after
> PreCommit_on_commit_actions() and before AtEOXact_LargeObject().

I don't find an obvious problem there. Since pending deletes and
pending syncs are separately processed, I'm planning to make a
separate list for syncs from deletes.

> > > This should sync all forks, not just MAIN_FORKNUM.  Code that writes WAL for
> > > FSM_FORKNUM and VISIBILITYMAP_FORKNUM checks RelationNeedsWAL().  There may be
> > > no bug today, but it's conceptually wrong to make RelationNeedsWAL() return
> > > false due to this code, use RelationNeedsWAL() for multiple forks, and then
> > > not actually sync all forks.
> > 
> > I agree that all forks needs syncing, but FSM and VM are checking
> > RelationNeedsWAL(modified). To make sure, are you suggesting to
> > sync all forks instead of emitting WAL for them, or suggesting
> > that VM and FSM to emit WALs even when the modified
> > RelationNeedsWAL returns false (+ sync all forks)?
> 
> I hadn't thought that far.  What do you think is best?

As in the latest patch, sync ALL forks then no WALs. We could
skip syncing FSM but I'm not sure it's work doing.


> > > The https://postgr.es/m/559FA0BA.3080808@iki.fi design had another component
> > > not appearing here.  It said, "Instead, at COMMIT, we'd fsync() the relation,
> > > or if it's smaller than some threshold, WAL-log the contents of the whole file
> > > at that point."  Please write the part to WAL-log the contents of small files
> > > instead of syncing them.
> > 
> > I'm not sure the point of the behavior. I suppose that the "log"
> > is a sequence of new_page records. It also needs to be synced and
> > it is always larger than the file to be synced. I can't think of
> > an appropriate threshold without the point.
> 
> Yes, it would be a sequence of new-page records.  FlushRelationBuffers() locks
> every buffer header containing a buffer of the current database.  The belief
> has been that writing one page to xlog is cheaper than FlushRelationBuffers()
> in a busy system with large shared_buffers.

I'm at a loss.. The decision between WAL and sync is made at
commit time, when we no longer have a pin on a buffer. When
emitting WAL, opposite to the assumption, lock needs to be
re-acquired for every page to emit log_new_page. What is worse,
we may need to reload evicted buffers.  If the file has been
CopyFrom'ed, ring buffer strategy makes the situnation farther
worse. That doesn't seem cheap at all..

If there were any chance on WAL for smaller files here, it would
be on the files smaller than the ring size of bulk-write
strategy(16MB).

If we pick up every buffer page of the file instead of scanning
through all buffers, that makes things worse by conflicts on
partition locks.

Any thoughts?



# Sorry time's up today.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



В списке pgsql-hackers по дате отправления:

Предыдущее
От: Konstantin Knizhnik
Дата:
Сообщение: Re: Why overhead of SPI is so large?
Следующее
От: Peter Eisentraut
Дата:
Сообщение: mingw32 floating point diff