Re: [HACKERS] subscription worker signalling wal writer too much

Поиск

Список

Период

Сортировка

От	Jeff Janes
Тема	Re: [HACKERS] subscription worker signalling wal writer too much
Дата	16 июня 2017 г. 04:06:43
Msg-id	CAMkU=1z6_K4mE12vNirgV5qnu58+cZQdNxD+Pb5bzxnvOVVWoA@mail.gmail.com обсуждение исходный текст
Ответ на	Re: [HACKERS] subscription worker signalling wal writer too much (Andres Freund <andres@anarazel.de>)
Ответы	Re: [HACKERS] subscription worker signalling wal writer too much (Jeff Janes <jeff.janes@gmail.com>) Re: [HACKERS] subscription worker signalling wal writer too much (Andres Freund <andres@anarazel.de>)
Список	pgsql-hackers

Дерево обсуждения

On Wed, Jun 14, 2017 at 4:29 PM, Andres Freund <andres@anarazel.de> wrote:

On 2017-06-14 16:24:27 -0700, Jeff Janes wrote:
> On Wed, Jun 14, 2017 at 3:20 PM, Andres Freund <andres@anarazel.de> wrote:
>
> > On 2017-06-14 15:08:49 -0700, Jeff Janes wrote:
> > > On Wed, Jun 14, 2017 at 11:55 AM, Jeff Janes <jeff.janes@gmail.com>
> > wrote:
> > >
> > > > If I publish a pgbench workload and subscribe to it, the subscription
> > > > worker is signalling the wal writer thousands of times a second, once
> > for
> > > > every async commit. This has a noticeable performance cost.
> > > >
> > >
> > > I've used a local variable to avoid waking up the wal writer more than
> > once
> > > for the same page boundary. This reduces the number of wake-ups by about
> > > 7/8.
> >
> > Maybe I'm missing something here, but isn't that going to reduce our
> > guarantees about when asynchronously committed xacts are flushed out?
> > You can easily fit a number of commits into the same page... As this
> > isn't specific to logical-rep, I don't think that's ok.
> >
>
> The guarantee is based on wal_writer_delay not on SIGUSR1, so I don't think
> this changes that. (Also, it isn't really a guarantee, the fsync can take
> many seconds to complete once we do initiate it, and there is absolutely
> nothing we can do about that, other than do the fsync synchronously in the
> first place).

Well, wal_writer_delay doesn't work if walwriter is in sleep mode, and
this afaics would allow wal writer to go into sleep mode with half a
page filled, and it'd not be woken up again until the page is filled.
No?

If it is taking the big sleep, then we always wake it up, since acd4c7d58baf09f.

I didn't change that part. I only changed what we do if it not hibernating.

> > Have you chased down why there's that many wakeups? Normally I'd have
> > expected that a number of the SetLatch() calls get consolidated
> > together, but I guess walwriter is "too quick" in waking up and
> > resetting the latch?

> I'll have to dig into that some more. The 7/8 reduction I cited was just
> in calls to SetLatch from that part of the code, I didn't measure whether
> the SetLatch actually called kill(owner_pid, SIGUSR1) or not when I
> determined that reduction, so it wasn't truly wake ups I measured. Actual
> wake ups were measured only indirectly via the impact on performance. I'll
> need to figure out how to instrument that without distorting the
> performance too much in the process..

I'd suspect that just measuring the number of kill() calls should be
doable, if measured via perf or something like hta.t

It looks like only limited consolidation was happening, the number of kills invoked was more than half of the number of SetLatch. I think the reason is what when kill(owner_pid, SIGUSR1) is called, the kernel immediately suspends the calling process and gives the signalled process a chance to run in that time slice. The Wal Writer gets woken up, sees that it only has a partial page to write and decides not to do anything, but resets the latch. It does this fast enough that the subscription worker hasn't migrated CPUs yet.

But, if the WAL Writer sees it has already flushed up to the highest eligible page boundary, why doesn't the SetAsync code see the same thing? Because it isn't flushed that high, but only written that high. It looks like a story of three git commits:

commit 4de82f7d7c50a81ec8e70e2cb0ab413ab9134c0b

Author: Simon Riggs <simon@2ndQuadrant.com>

Date: Sun Nov 13 09:00:57 2011 +0000

Wakeup WALWriter as needed for asynchronous commit performance.

commit db76b1efbbab2441428a9ef21f7ac9ba43c52482

Author: Andres Freund <andres@anarazel.de>

Date: Mon Feb 15 22:15:35 2016 +0100

Allow SetHintBits() to succeed if the buffer's LSN is new enough.

commit 7975c5e0a992ae9a45e03d145e0d37e2b5a707f5

Author: Andres Freund <andres@anarazel.de>

Date: Mon Feb 15 23:52:38 2016 +0100

Allow the WAL writer to flush WAL at a reduced rate.

The first change made the WALWriter wake up and do a write and flush whenever an async commit occurred and there was a completed WAL page to be written. This way the hint bits could be set while the heap page was still hot, because they couldn't get set until the WAL covering the hinted-at transaction commit is flushed.

The second change said we can set hint bits before the WAL covering the hinted-at transaction is flushed, as long the page LSN is newer than that (so the wal covering the hinted-at transaction commit must be flushed before the page containing the hint bit can be written).

Then the third change makes the wal writer only write the full pages most of the times it is woken up, not flush them. It only flushes them once every wal_writer_delay or wal_writer_flush_after, regardless of how often it is woken up.

It seems like the third change rendered the first one useless. Wouldn't it better, and much simpler, just to have reverted the first change once the second one was done? If that were done, would the third change still be needed? (It did seem to add some other features as well, so I'm going to assume we still want those...).

But now the first change is even worse than useless, it is positively harmful. The only thing to stop it from waking the WALWriter for every async commit is this line:

/* if we have already flushed that far, we're done */

if (WriteRqstPtr <= LogwrtResult.Flush)

return;

Since LogwrtResult.Flush doesn't advance anymore, this condition doesn't becomes false due to us waking walwriter, it only becomes false once the timeout expires (which it would have done anyway with no help from us), or once wal_writer_flush_after is exceeded. So now every async commit is waking the walwriter. Most of those wake up do nothing (there is not a completely new patch to write), some write one completed page but don't flush anything, and very few do a flush, and that one would have been done anyway.

So for the sake of wal_writer_flush_after, I guess the simplest thing to do would be to change repeat the flushbytes calculation and wake the walwriter only if it looks like that will be exceeded? Attached patch does that.

Maybe the computation of flushbytes should be put into a macro, rather than repeated in the code? Or even done on the fly, I don't see any reason it needs to be stored in a variable before testing.

This new patch is simpler than the previous one, and more effective at speeding up replication. I assume it would speed up pgbench with synchronous_commit turned off (or against unlogged tables) as well, but I don't think I have the hardware needed to test that.

Cheers,

Jeff