Обсуждение: Reducing spinlock acquisition within clock sweep loop

Поиск
Список
Период
Сортировка

Reducing spinlock acquisition within clock sweep loop

От
Merlin Moncure
Дата:
Background:
The main sources of contention, buffer_strategy_lock, has been removed
FWICT via 5d7962c6 and d72731a7.  However, during the sweep each tick
locks the buffer header via spinlock in order to to adjust
usage_count.  I believe that removing this lock is possible
optimization fruit.  I doubt that this is a major contention point for
most workloads (over the years, I've only seen one or two cases where
I suspected clock sweep was getting gummed up by buffer locks) but for
systems that are already under lock duress it's hard to argue that
spamming spinlocks in a tight loop makes things easier for the
hardware that is protecting cachelines.  In short, clock sweep heavy
workloads are issuing a lot of atomic ops.  Any savings here is good.

Right now usage_count ranges across a small number (currently 0-5),
incrementing upon allocation and decrementing on sweep.  Because of
this, each tick of the clock sweep spinlocks the buffer in preparation
of A. the buffer being returned or B. the buffer having a non zero
usage count so that it can be safely decremented.   B is potential
optimization fruit, I think.

The idea:
Instead, why not let usage_count rotate around on it's own in a
clock-like fashion?  the idea is that, instead of defining usage count
as a fixed offset from zero, it is based on the number of times the
buffer has been touched since the last sweep.  In other words, each
time the buffer clock crosses midnight, 'usage count' for the whole
pool drops by one, implicitly. Each time a buffer is touched, it's
usage_count is set to 'completePasses' + 1, but, just like today, is
never allowed to increment past completePasses + 5.  So, what used to
be usage_count becomes defined as the distance between the usage_count
and completePasses.

Because of usage_count tracking rotation, calculating what used to be
usage_count requires a bit of logic to arrive at the right number as
rotation happens.  I'm pretty sure all the rotation edge cases can be
handled but I'm glossing over them for now.

Clock sweep buffer acquisition currently looks like this:
loop:
tick();
spinlock();
in use? goto loop
usage_count > 0? yes: decrement, goto loop
return buffer;

Proposed buffer acquisition would look like:
loop:
tick();
victim < last_victim? yes: atomic re-fetch completePasses (see below)
usage_count > completePasses? yes: goto loop
try_spinlock(); no lock? goto loop
usage_count > completePasses? yes: goto loop
in use? goto loop
return buffer;

(try_spinlock is an adorned TAS() as a opposed to a full spin).  Usage
count is double checked after lock acquisition in case the local copy
is stale.

Bad stuff:
1. Complexity. Is it worth it?  Calculating proper 'usage_count'
distance in the face of two rotating numbers requires some clever
coding.

2. Unfortunately, since we've (for very good reasons) got a clock
sweep that can be concurrently engaged by many actors, it's no longer
safe to assume that completePasses can be safely examined outside of
the sweep loop but must intead be checked with every tick.  Doing this
with another atomic read would defeat the entire purpose of the patch,
but perhaps we can check it based on a heuristic: it's examined when
victim is less than the last_seen victim (which is locally saved off).

...which would add up to a tiny (zero?) risk of misjudging
usage_distance.  The fallout, if it were to happen, would be to
possibly to return a buffer that would have a usage_count = 1 with the
old method.

3. Since proper usage_count, aka distance is managed by being relative
to completePasses, which slightly changes the semantics due to the
fact the entire buffer pool drop in terms of usage_count at once.  I'm
doubtful this matters though.

Just spitballing here -- wondering if the idea has legs.  I think this
should compatible with Robert's usage_count tweaking with flags FWICT.
I'm inclined to work up a POC patch unless this is unworkable for some
reason.  The payoff, for workloads with high sweep activity, would be
drastically reduced spinlock activity and associated cache line
stress.

merlin



Re: Reducing spinlock acquisition within clock sweep loop

От
Andres Freund
Дата:
On 2015-04-21 14:46:04 -0500, Merlin Moncure wrote:
> The main sources of contention, buffer_strategy_lock, has been removed
> FWICT via 5d7962c6 and d72731a7.  However, during the sweep each tick
> locks the buffer header via spinlock in order to to adjust
> usage_count.

FWIW, I think the best approach to achieve this is to make most buffer
header manipulations lockless. It's not trivial but quite possible. I'd
a very rough POC a while back ([1]); that only degenerated to something
spinning for a couple edge cases (!BM_VALID || BM_PIN_COUNT_WAITER
IIRC).  That'd not necessarily reduce the number of atomic ops, but
making things lockless makes does away with the problem of sleeping
while holding a spinlock etc.

The primary use case for that is less the clock sweep than things index
root pages and small lookup pages. I've seen several times that the
spinlock can really hurt there.


I do believe that we can't continue with our current clock sweep
approach for much longer. It's already extremely expensive and even if
we fix some of the most glaring problems (as discussed nearby atm), it's
still a horribly cache/runtime inefficient way to do things in many
workloads.

> Right now usage_count ranges across a small number (currently 0-5),
> incrementing upon allocation and decrementing on sweep.  Because of
> this, each tick of the clock sweep spinlocks the buffer in preparation
> of A. the buffer being returned or B. the buffer having a non zero
> usage count so that it can be safely decremented.   B is potential
> optimization fruit, I think.

> The idea:
> Instead, why not let usage_count rotate around on it's own in a
> clock-like fashion?  the idea is that, instead of defining usage count
> as a fixed offset from zero, it is based on the number of times the
> buffer has been touched since the last sweep.  In other words, each
> time the buffer clock crosses midnight, 'usage count' for the whole
> pool drops by one, implicitly. Each time a buffer is touched, it's
> usage_count is set to 'completePasses' + 1, but, just like today, is
> never allowed to increment past completePasses + 5.  So, what used to
> be usage_count becomes defined as the distance between the usage_count
> and completePasses.
>
> Because of usage_count tracking rotation, calculating what used to be
> usage_count requires a bit of logic to arrive at the right number as
> rotation happens.  I'm pretty sure all the rotation edge cases can be
> handled but I'm glossing over them for now.

> Clock sweep buffer acquisition currently looks like this:
> loop:
> tick();
> spinlock();
> in use? goto loop
> usage_count > 0? yes: decrement, goto loop
> return buffer;
>
> Proposed buffer acquisition would look like:
> loop:
> tick();
> victim < last_victim? yes: atomic re-fetch completePasses (see below)
> usage_count > completePasses? yes: goto loop
> try_spinlock(); no lock? goto loop
> usage_count > completePasses? yes: goto loop
> in use? goto loop
> return buffer;
>
> (try_spinlock is an adorned TAS() as a opposed to a full spin).  Usage
> count is double checked after lock acquisition in case the local copy
> is stale.

I don't really see what this improves? You could do the "try spinlock"
bit just as well with the old algorithm and I don't really see a
accuracy benefit in the proposed approach? Maybe I'm missing something
entirely?


[1]: The basic trick I have in mind is that by putting flags, usagecount
and refcount into one integer (in separate bits) it is possible to do
CAS loops to manipulate each of those. E.g. to decrement usagecount you
can do something roughly like

do
{   val = pg_atomic_read_u32(desc->flag_and_counts);   if (BufferDescUsageCount(val) == 0)      break;
   newval = BufferDescChangeUsageCount(val, -1);
}
while(!pg_atomic_compare_exchange_u32(&desc->flag_and_counts,                                    &val, newval))

Where BufferDescUsageCount just to some bit masking and shifting to
manipulate the right part. Same for refcount and most flags.

To deal with changing tags I'd introduced a flag value that was used to
say 'you have to spinlock this time'.

Greetings,

Andres Freund



Re: Reducing spinlock acquisition within clock sweep loop

От
Merlin Moncure
Дата:
On Tue, Apr 21, 2015 at 3:25 PM, Andres Freund <andres@anarazel.de> wrote:
> On 2015-04-21 14:46:04 -0500, Merlin Moncure wrote:
>> The main sources of contention, buffer_strategy_lock, has been removed
>> FWICT via 5d7962c6 and d72731a7.  However, during the sweep each tick
>> locks the buffer header via spinlock in order to to adjust
>> usage_count.
>
> FWIW, I think the best approach to achieve this is to make most buffer
> header manipulations lockless. It's not trivial but quite possible. I'd
> a very rough POC a while back ([1]); that only degenerated to something
> spinning for a couple edge cases (!BM_VALID || BM_PIN_COUNT_WAITER
> IIRC).  That'd not necessarily reduce the number of atomic ops, but
> making things lockless makes does away with the problem of sleeping
> while holding a spinlock etc.

Right.  lockless doesn't necessarily translate to better performance
though if you have to increase atomic ops to get away with it.

> The primary use case for that is less the clock sweep than things index
> root pages and small lookup pages. I've seen several times that the
> spinlock can really hurt there.

Yeah.  I'm trying to address that basically...see below.

>> Clock sweep buffer acquisition currently looks like this:
>> loop:
>> tick();
>> spinlock();
>> in use? goto loop
>> usage_count > 0? yes: decrement, goto loop
>> return buffer;
>>
>> Proposed buffer acquisition would look like:
>> loop:
>> tick();
>> victim < last_victim? yes: atomic re-fetch completePasses (see below)
>> usage_count > completePasses? yes: goto loop
>> try_spinlock(); no lock? goto loop
>> usage_count > completePasses? yes: goto loop
>> in use? goto loop
>> return buffer;
>>
>> (try_spinlock is an adorned TAS() as a opposed to a full spin).  Usage
>> count is double checked after lock acquisition in case the local copy
>> is stale.
>
> I don't really see what this improves? You could do the "try spinlock"
> bit just as well with the old algorithm and I don't really see a
> accuracy benefit in the proposed approach? Maybe I'm missing something
> entirely?

Yes: the 'try lock' isn't necessary here, i just threw it in. It's
been suggested few times in the past...I see no useful reason why the
clocksweep (using the current algorithm) should sit and spin.
Contended buffers should be hopped over.  The real win is in
eliminating the spinlock during clock tick except for buffers that are
real candidates for being returned.  refcount is a different problem,
but has different characteristics; it has to be 100% accurate while
usage_count does not, something which can be exploited (previously I
suggested to simply remove all locking around it but I'm discarding
that idea as there's no way to really prove it works well in unusual
circumstances).

> [1]: The basic trick I have in mind is that by putting flags, usagecount
> and refcount into one integer (in separate bits) it is possible to do
> CAS loops to manipulate each of those. E.g. to decrement usagecount you
> can do something roughly like
>
> do
> {
>     val = pg_atomic_read_u32(desc->flag_and_counts);
>     if (BufferDescUsageCount(val) == 0)
>        break;
>
>     newval = BufferDescChangeUsageCount(val, -1);
> }
> while(!pg_atomic_compare_exchange_u32(&desc->flag_and_counts,
>                                      &val, newval))
>
> Where BufferDescUsageCount just to some bit masking and shifting to
> manipulate the right part. Same for refcount and most flags.
>
> To deal with changing tags I'd introduced a flag value that was used to
> say 'you have to spinlock this time'.

Interesting.  I'm not sure sure that these ideas are contradictory,
especially since refcount and usage_count are not always manipulated
at the same time.

Maybe you can get the best of both worlds; lockless pin/unpin with a
lock free sweep: you'd have a fancy 'BufferDescIncreaseUsageCount',
but not a decrease, since the completePasses (or some derivative of
it), increases to give you the 'decrease' which is implicit as the
clock sweeps around.  What I'm proposing would decrease the atomic ops
from the current 'atomic increment + spinlock' to a simple
'increment', plus the unusual atomic read to re-fetch completePasses.

OTOH, one tricky bit I didn't mention is that inferring usage_count
from an always incrementing value and completePasses would require the
'increment' routine to have knowledge of completePasses where
previously it doesn't need it to support the 'max 5' check, requiring
an atomic read, although perhaps not on every pin; a local copy would
mostly do.

merlin