Обсуждение: Lockless StrategyGetBuffer() clock sweep

Поиск
Список
Период
Сортировка

Lockless StrategyGetBuffer() clock sweep

От
Andres Freund
Дата:
Hi,

I've previously posted a patch at
http://archives.postgresql.org/message-id/20141010160020.GG6670%40alap3.anarazel.de
that reduces contention in StrategyGetBuffer() by making the clock sweep
lockless.  Robert asked me to post it to a new thread; I originally
wrote it to see some other contention in more detail, that's why it
ended up in that thread...

The performance numbers I've quoted over there are:
> Test:
> pgbench  -M prepared -P 5 -S -c 496 -j 496 -T 5000
> on a scale=1000 database, with 4GB of shared buffers.
>
> Before:
> progress: 40.0 s, 136252.3 tps, lat 3.628 ms stddev 4.547
> progress: 45.0 s, 135049.0 tps, lat 3.660 ms stddev 4.515
> progress: 50.0 s, 135788.9 tps, lat 3.640 ms stddev 4.398
> progress: 55.0 s, 135268.4 tps, lat 3.654 ms stddev 4.469
> progress: 60.0 s, 134991.6 tps, lat 3.661 ms stddev 4.739
>
> after:
> progress: 40.0 s, 207701.1 tps, lat 2.382 ms stddev 3.018
> progress: 45.0 s, 208022.4 tps, lat 2.377 ms stddev 2.902
> progress: 50.0 s, 209187.1 tps, lat 2.364 ms stddev 2.970
> progress: 55.0 s, 206462.7 tps, lat 2.396 ms stddev 2.871
> progress: 60.0 s, 210263.8 tps, lat 2.351 ms stddev 2.914

Imo the patch doesn't complicate the logic noticeably...

I do wonder if we could make the freelist accesses lockless as well -
but I think that's harder. So I don't want to mix that with this.

Greetings,

Andres Freund

--
 Andres Freund                       http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Вложения

Re: Lockless StrategyGetBuffer() clock sweep

От
Robert Haas
Дата:
On Mon, Oct 27, 2014 at 9:32 AM, Andres Freund <andres@2ndquadrant.com> wrote:
> I've previously posted a patch at
> http://archives.postgresql.org/message-id/20141010160020.GG6670%40alap3.anarazel.de
> that reduces contention in StrategyGetBuffer() by making the clock sweep
> lockless.  Robert asked me to post it to a new thread; I originally
> wrote it to see some other contention in more detail, that's why it
> ended up in that thread...

Does this LATCHPTR_ACCESS_ONCE crap really do anything?  Why do we
need that?  I'm not convinced that it's actually solving any problem
we would otherwise have with this code.  But if it is, then how about
adding a flag that is 4 bytes wide or less alongside bgwriterLatch,
and just checking the flag instead of checking bgwriterLatch itself?

Actually, the same approach would work for INT_ACCESS_ONCE.  So I
propose this approach instead: Add a new atomic flag to
StrategyControl, useSlowPath.  If StrategyGetBuffer() finds
useSlowPath set, call StrategyGetBufferSlow(), which will acquire the
spinlock, check whether bgwriterLatch is set and/or the freelist is
non-empty and return a buffer ID if able to allocate from the
freelist; otherwise -1.  If StrategyGetBufferSlow() returns -1, or we
decide not to call it in the first place because useSlowPath isn't
set, then fall through to your clock-sweep logic.  We can set
useSlowPath at stratup and whenever we put buffers on the free list.

The lack of memory barriers while testing useSlowPath (or, in your
version, when testing the ACCESS_ONCE conditions) means that we could
fail to set the bgwriter latch or pop from the freelist if another
backend has very recently updated those things.  But that's not
actually a problem; it's fine for any individual request to skip those
things, as they are more like hints than correctness requirements.

The interaction between this and the bgreclaimer idea is interesting.
We can't making popping the freelist lockless without somehow dealing
with the resulting A-B-A problem (namely, that between the time we
read &head->next and the time we CAS the list-head to that value, the
head might have been popped, another item pushed, and the original
list head pushed again).  So even if bgreclaimer saves some work for
individual backends - avoiding the need for them to clock-sweep across
many buffers - it may not be worth it if it means taking a spinlock to
pop the freelist instead of using an atomics-driven clock sweep.
Considering that there may be a million plus buffers to scan through,
that's a surprising conclusion, but it seems to be where the data is
pointing us.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Lockless StrategyGetBuffer() clock sweep

От
Andres Freund
Дата:
On 2014-10-29 14:18:33 -0400, Robert Haas wrote:
> On Mon, Oct 27, 2014 at 9:32 AM, Andres Freund <andres@2ndquadrant.com> wrote:
> > I've previously posted a patch at
> > http://archives.postgresql.org/message-id/20141010160020.GG6670%40alap3.anarazel.de
> > that reduces contention in StrategyGetBuffer() by making the clock sweep
> > lockless.  Robert asked me to post it to a new thread; I originally
> > wrote it to see some other contention in more detail, that's why it
> > ended up in that thread...
> 
> Does this LATCHPTR_ACCESS_ONCE crap really do anything?

Well, t forces the variable to be read from memory, instead of allowing
it to be read from a register... I'd much rather have a generic, type
independent, ACCESS_ONCE macro, but I'm not aware how to do that in a
way that works across compilers. It also prevents the compiler from
doing speculative writes.

> Why do we need that?  I'm not convinced that it's actually solving any
> problem we would otherwise have with this code.

I agree that in this case we can get rid of it.

> But if it is, then how about
> adding a flag that is 4 bytes wide or less alongside bgwriterLatch,
> and just checking the flag instead of checking bgwriterLatch itself?

Yea, that'd be nicer. I didn't do it because it made the patch slightly
more invasive... The complexity really is only needed because we're not
guaranteed that 64bit reads are atomic. Although we actually can be
sure, because there's no platform with nonatomic intptr_t reads - so
64bit platforms actually *do* have atomic 64bit reads/writes.

So if we just have a integer 'setBgwriterLatch' somewhere we're good. We
don't even need to take a lock to set it. Afaics the worst that can
happen is that several processes set the latch...

> Actually, the same approach would work for INT_ACCESS_ONCE.  So I
> propose this approach instead: Add a new atomic flag to
> StrategyControl, useSlowPath.  If StrategyGetBuffer() finds
> useSlowPath set, call StrategyGetBufferSlow(), which will acquire the
> spinlock, check whether bgwriterLatch is set and/or the freelist is
> non-empty and return a buffer ID if able to allocate from the
> freelist; otherwise -1.  If StrategyGetBufferSlow() returns -1, or we
> decide not to call it in the first place because useSlowPath isn't
> set, then fall through to your clock-sweep logic.  We can set
> useSlowPath at stratup and whenever we put buffers on the free list.

I don't like the idea to to conflate the slowpath for the bgwriter latch
with the freelist slowpath. And I don't really see what that'd buy us
over what's in the patch right now?

> The lack of memory barriers while testing useSlowPath (or, in your
> version, when testing the ACCESS_ONCE conditions) means that we could
> fail to set the bgwriter latch or pop from the freelist if another
> backend has very recently updated those things.  But that's not
> actually a problem; it's fine for any individual request to skip those
> things, as they are more like hints than correctness requirements.

Right.

> The interaction between this and the bgreclaimer idea is interesting.
> We can't making popping the freelist lockless without somehow dealing
> with the resulting A-B-A problem (namely, that between the time we
> read &head->next and the time we CAS the list-head to that value, the
> head might have been popped, another item pushed, and the original
> list head pushed again).

I think if we really feel the need to, we can circumvent the ABA problem
here. But I'm not yet convinced that there's the need to do so.  I'm
unsure that a single process that touches all the buffers at some
frequency is actually a good idea on modern NUMA systems...

I wonder if we could make a trylock for spinlocks work - then we could
look at the freelist if the lock is free and just go to the clock cycle
otherwise. My guess is that that'd be quite performant.  IIRC it's just
the spinlock semaphore fallback that doesn't know how to do trylock...


> So even if bgreclaimer saves some work for
> individual backends - avoiding the need for them to clock-sweep across
> many buffers - it may not be worth it if it means taking a spinlock to
> pop the freelist instead of using an atomics-driven clock sweep.
> Considering that there may be a million plus buffers to scan through,
> that's a surprising conclusion, but it seems to be where the data is
> pointing us.

I'm not really convinced of this yet either. It might just be that the
bgreclaimer implementation isn't good enough. But to me that doesn't
really change the fact that there's clear benefit in this patch - even
if we can make bgreclaimer beneficial the lockless scan will be good.

My feeling is that make bg*writer* more efficient is more likely to be
beneficial overall than introducing bgreclaimer. I've seen the clock
sweeps really hurt in cases where many many buffers are dirty.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: Lockless StrategyGetBuffer() clock sweep

От
Amit Kapila
Дата:
On Thu, Oct 30, 2014 at 12:39 AM, Andres Freund <andres@2ndquadrant.com> wrote:
> On 2014-10-29 14:18:33 -0400, Robert Haas wrote:
>
> > The interaction between this and the bgreclaimer idea is interesting.
> > We can't making popping the freelist lockless without somehow dealing
> > with the resulting A-B-A problem (namely, that between the time we
> > read &head->next and the time we CAS the list-head to that value, the
> > head might have been popped, another item pushed, and the original
> > list head pushed again).
>
> I think if we really feel the need to, we can circumvent the ABA problem
> here. But I'm not yet convinced that there's the need to do so.  I'm
> unsure that a single process that touches all the buffers at some
> frequency is actually a good idea on modern NUMA systems...
>
> I wonder if we could make a trylock for spinlocks work - then we could
> look at the freelist if the lock is free and just go to the clock cycle
> otherwise. My guess is that that'd be quite performant.  IIRC it's just
> the spinlock semaphore fallback that doesn't know how to do trylock...
>
>
> > So even if bgreclaimer saves some work for
> > individual backends - avoiding the need for them to clock-sweep across
> > many buffers - it may not be worth it if it means taking a spinlock to
> > pop the freelist instead of using an atomics-driven clock sweep.
> > Considering that there may be a million plus buffers to scan through,
> > that's a surprising conclusion, but it seems to be where the data is
> > pointing us.
>
> I'm not really convinced of this yet either. It might just be that the
> bgreclaimer implementation isn't good enough. But to me that doesn't
> really change the fact that there's clear benefit in this patch - 

I have a feeling that this might also have some regression at higher
loads (like scale_factor = 5000, shared_buffers = 8GB, 
client_count = 128, 256) for the similar reasons as bgreclaimer patch,
means although both reduces contention around spin lock, however
it moves contention somewhere else.  I have yet to take data before
concluding anything (I am just waiting for your other patch (wait free
LW_SHARED) to be committed).
I think once wait free LW_SHARED is in, we can evaluate this patch
(if required may be we can see if there is any interaction between
this and bgreclaimer).  However if you want, I think this can be done
even separately from wait free LW_SHARED patch.

> even
> if we can make bgreclaimer beneficial the lockless scan will be good.
>
> My feeling is that make bg*writer* more efficient is more likely to be
> beneficial overall than introducing bgreclaimer. 

One idea could be that we bgwriter as something similar to auto vacuum
launcher, which means that bgwriter can launch different workers based
on the kind of need.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Lockless StrategyGetBuffer() clock sweep

От
Andres Freund
Дата:
On 2014-10-30 10:23:56 +0530, Amit Kapila wrote:
> I have a feeling that this might also have some regression at higher
> loads (like scale_factor = 5000, shared_buffers = 8GB,
> client_count = 128, 256) for the similar reasons as bgreclaimer patch,
> means although both reduces contention around spin lock, however
> it moves contention somewhere else.  I have yet to take data before
> concluding anything (I am just waiting for your other patch (wait free
> LW_SHARED) to be committed).

I have a hard time to see how this could be. In the uncontended case the
number of cachelines touched and the number of atomic operations is
exactly the same. In the contended case the new implementation does far
fewer atomic ops - and doesn't do spinning.

What's your theory?

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: Lockless StrategyGetBuffer() clock sweep

От
Robert Haas
Дата:
On Wed, Oct 29, 2014 at 3:09 PM, Andres Freund <andres@2ndquadrant.com> wrote:
>> But if it is, then how about
>> adding a flag that is 4 bytes wide or less alongside bgwriterLatch,
>> and just checking the flag instead of checking bgwriterLatch itself?
>
> Yea, that'd be nicer. I didn't do it because it made the patch slightly
> more invasive... The complexity really is only needed because we're not
> guaranteed that 64bit reads are atomic. Although we actually can be
> sure, because there's no platform with nonatomic intptr_t reads - so
> 64bit platforms actually *do* have atomic 64bit reads/writes.
>
> So if we just have a integer 'setBgwriterLatch' somewhere we're good. We
> don't even need to take a lock to set it. Afaics the worst that can
> happen is that several processes set the latch...

OK, that design is fine with me.

>> Actually, the same approach would work for INT_ACCESS_ONCE.  So I
>> propose this approach instead: Add a new atomic flag to
>> StrategyControl, useSlowPath.  If StrategyGetBuffer() finds
>> useSlowPath set, call StrategyGetBufferSlow(), which will acquire the
>> spinlock, check whether bgwriterLatch is set and/or the freelist is
>> non-empty and return a buffer ID if able to allocate from the
>> freelist; otherwise -1.  If StrategyGetBufferSlow() returns -1, or we
>> decide not to call it in the first place because useSlowPath isn't
>> set, then fall through to your clock-sweep logic.  We can set
>> useSlowPath at stratup and whenever we put buffers on the free list.
>
> I don't like the idea to to conflate the slowpath for the bgwriter latch
> with the freelist slowpath. And I don't really see what that'd buy us
> over what's in the patch right now?

Well, I was thinking that with a single flag check we could skip two
paths that, as of today, are both uncommonly taken.  Moving all that
logic off into a separate function might help the compiler optimize
for the fast case, too.

> I wonder if we could make a trylock for spinlocks work - then we could
> look at the freelist if the lock is free and just go to the clock cycle
> otherwise. My guess is that that'd be quite performant.  IIRC it's just
> the spinlock semaphore fallback that doesn't know how to do trylock...

I could go for that.  I don't think the semaphore thing is really a
problem.  If the spinlock semaphore implementation indeed can't
support that, then just have it fall back to waiting for the lock and
always returning true.  Semaphores are so slow that it's not going to
make any difference to performance.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Lockless StrategyGetBuffer() clock sweep

От
Amit Kapila
Дата:
On Thu, Oct 30, 2014 at 5:01 PM, Andres Freund <andres@2ndquadrant.com> wrote:
>
> On 2014-10-30 10:23:56 +0530, Amit Kapila wrote:
> > I have a feeling that this might also have some regression at higher
> > loads (like scale_factor = 5000, shared_buffers = 8GB,
> > client_count = 128, 256) for the similar reasons as bgreclaimer patch,
> > means although both reduces contention around spin lock, however
> > it moves contention somewhere else.  I have yet to take data before
> > concluding anything (I am just waiting for your other patch (wait free
> > LW_SHARED) to be committed).
>
> I have a hard time to see how this could be. In the uncontended case the
> number of cachelines touched and the number of atomic operations is
> exactly the same. In the contended case the new implementation does far
> fewer atomic ops - and doesn't do spinning.
>
> What's your theory?

I have observed that once we reduce the contention in one path, it doesn't
always lead to performance/scalability gain and rather shifts to other lock
if that exists.  This is the reason why we have to work on reducing contention
around both BufFreeList and Buf Mapping tables lock together.  I have taken
some performance data and it seems this patch also exhibits similar behaviour
as bgreclaimer patch and I believe resolving contention around dynahash can
improve the situation (Robert's chash patch can be helpful).


Performance Data
------------------------------
Configuration and Db Details
IBM POWER-8 24 cores, 192 hardware threads
RAM = 492GB
max_connections = 300
shared_buffers = 8GB
checkpoint_segments=30
checkpoint_timeout    =15min
Client Count = number of concurrent sessions and threads (ex. -c 8 -j 8)
Duration of each individual run = 5mins
Test mode - pgbench readonly (-M prepared)

Data below is median of 3 runs, for individual run data check document
attached with this mail.

Scale_Factor = 1000
Patch_ver/Client_Count128256
HEAD265502201689
Patch283448224888


Scale_Factor = 5000
Patch_ver/Client_Count128256
HEAD190435177477
Patch171485167794
 

The above data indicates that there is performance gain at scale factor
1000, however there is a regression at scale factor 5000.



With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com
Вложения

Re: Lockless StrategyGetBuffer() clock sweep

От
Andres Freund
Дата:
On 2014-10-30 07:55:08 -0400, Robert Haas wrote:
> On Wed, Oct 29, 2014 at 3:09 PM, Andres Freund <andres@2ndquadrant.com> wrote:
> >> But if it is, then how about
> >> adding a flag that is 4 bytes wide or less alongside bgwriterLatch,
> >> and just checking the flag instead of checking bgwriterLatch itself?
> >
> > Yea, that'd be nicer. I didn't do it because it made the patch slightly
> > more invasive... The complexity really is only needed because we're not
> > guaranteed that 64bit reads are atomic. Although we actually can be
> > sure, because there's no platform with nonatomic intptr_t reads - so
> > 64bit platforms actually *do* have atomic 64bit reads/writes.
> >
> > So if we just have a integer 'setBgwriterLatch' somewhere we're good. We
> > don't even need to take a lock to set it. Afaics the worst that can
> > happen is that several processes set the latch...
> 
> OK, that design is fine with me.

There's a slight problem with this, namely restarts of bgwriter. If it
crashes the reference to the relevant latch will currently be reset via
StrategyNotifyBgWriter(). In reality that's not a problem because
sizeof(void*) writes are always atomic, but currently we don't assume
that. We'd sometimes write to a old latch, but that's harmless because
they're never deallocated.

So I can see several solutions right now:
1) Redefine our requirements so that aligned sizeof(void*) writes are  always atomic. That doesn't affect any currently
supportedplatform  and won't ever affect any future platform either. E.g. linux has had  that requirement for a long
time.
2) Use a explicitly defined latch for the bgworker instead of using the  PGPROC->procLatch. That way it never has to be
resetand all  SetLatch()s will eventually go to the right process.
 
3) Continue requiring the spinlock when setting the latch.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: Lockless StrategyGetBuffer() clock sweep

От
Robert Haas
Дата:
On Mon, Dec 8, 2014 at 2:51 PM, Andres Freund <andres@2ndquadrant.com> wrote:
> On 2014-10-30 07:55:08 -0400, Robert Haas wrote:
>> On Wed, Oct 29, 2014 at 3:09 PM, Andres Freund <andres@2ndquadrant.com> wrote:
>> >> But if it is, then how about
>> >> adding a flag that is 4 bytes wide or less alongside bgwriterLatch,
>> >> and just checking the flag instead of checking bgwriterLatch itself?
>> >
>> > Yea, that'd be nicer. I didn't do it because it made the patch slightly
>> > more invasive... The complexity really is only needed because we're not
>> > guaranteed that 64bit reads are atomic. Although we actually can be
>> > sure, because there's no platform with nonatomic intptr_t reads - so
>> > 64bit platforms actually *do* have atomic 64bit reads/writes.
>> >
>> > So if we just have a integer 'setBgwriterLatch' somewhere we're good. We
>> > don't even need to take a lock to set it. Afaics the worst that can
>> > happen is that several processes set the latch...
>>
>> OK, that design is fine with me.
>
> There's a slight problem with this, namely restarts of bgwriter. If it
> crashes the reference to the relevant latch will currently be reset via
> StrategyNotifyBgWriter(). In reality that's not a problem because
> sizeof(void*) writes are always atomic, but currently we don't assume
> that. We'd sometimes write to a old latch, but that's harmless because
> they're never deallocated.
>
> So I can see several solutions right now:
> 1) Redefine our requirements so that aligned sizeof(void*) writes are
>    always atomic. That doesn't affect any currently supported platform
>    and won't ever affect any future platform either. E.g. linux has had
>    that requirement for a long time.
> 2) Use a explicitly defined latch for the bgworker instead of using the
>    PGPROC->procLatch. That way it never has to be reset and all
>    SetLatch()s will eventually go to the right process.
> 3) Continue requiring the spinlock when setting the latch.

Maybe you could store the pgprocno instead of the PROC *.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Lockless StrategyGetBuffer() clock sweep

От
Andres Freund
Дата:
On 2014-12-10 21:52:17 -0500, Robert Haas wrote:
> Maybe you could store the pgprocno instead of the PROC *.

That's a good idea. Here's a patch implementing that and other things.

Changes:
* The handling of wraparound is slight changed. There's a separate patch
  for the case where nextVictimBuffer is above NBuffers. That allows a)
  to avoid the somewhat expensive modulo operation in the common case b)
  always consistent results for StrategySyncStart()
* StrategySyncStart() doesn't have a situation in which it can return
  inaccurate information anymore. That could actually trigger an
  assertion bgwriter.
* There was a bug because the local victim variable was signed instead
  of unsigned.
* Clock sweep ticks are moved into a separate routine.

Comments?

Greetings,

Andres Freund

--
 Andres Freund                       http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Вложения

Re: Lockless StrategyGetBuffer() clock sweep

От
Robert Haas
Дата:
On Tue, Dec 23, 2014 at 3:30 PM, Andres Freund <andres@2ndquadrant.com> wrote:
> On 2014-12-10 21:52:17 -0500, Robert Haas wrote:
>> Maybe you could store the pgprocno instead of the PROC *.
>
> That's a good idea. Here's a patch implementing that and other things.

Thanks.  Cool.

> Changes:
> * The handling of wraparound is slight changed. There's a separate patch
>   for the case where nextVictimBuffer is above NBuffers. That allows a)
>   to avoid the somewhat expensive modulo operation in the common case b)
>   always consistent results for StrategySyncStart()
> * StrategySyncStart() doesn't have a situation in which it can return
>   inaccurate information anymore. That could actually trigger an
>   assertion bgwriter.
> * There was a bug because the local victim variable was signed instead
>   of unsigned.
> * Clock sweep ticks are moved into a separate routine.
>
> Comments?

You need to spell-check your commit message.  My spell-checker
complains about acquiration, aleviated, aleviates, and clocksweep.
The first is not a word at all; perhaps you meant acquisition.  The
next two, I believe, need a double-l rather than a single-l.
"acquiration" also appears in the text of the patch, along with
"wether" (should be "whether").  And the last one should be two words.

I don't think I have anything to say about the substance of the patch.
If fetch-and-add is faster than a spinlock cycle, then it is.  And
it's good to be fast.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Lockless StrategyGetBuffer() clock sweep

От
Andres Freund
Дата:
On 2014-12-23 16:42:41 -0500, Robert Haas wrote:
> I don't think I have anything to say about the substance of the patch.
> If fetch-and-add is faster than a spinlock cycle, then it is.  And
> it's good to be fast.

I don't think the primary advantage is that it's fast (even though it
should be as fast as a single TAS on x86). It's that you can never sleep
while holding the spinlock when there's no such spinlock and that
everytime you transfer the cacheline from another cpu to you you'll also
make progress...

Will fix the other stuff.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services