Обсуждение: Reducing spinlock acquisition within clock sweep loop
Background: The main sources of contention, buffer_strategy_lock, has been removed FWICT via 5d7962c6 and d72731a7. However, during the sweep each tick locks the buffer header via spinlock in order to to adjust usage_count. I believe that removing this lock is possible optimization fruit. I doubt that this is a major contention point for most workloads (over the years, I've only seen one or two cases where I suspected clock sweep was getting gummed up by buffer locks) but for systems that are already under lock duress it's hard to argue that spamming spinlocks in a tight loop makes things easier for the hardware that is protecting cachelines. In short, clock sweep heavy workloads are issuing a lot of atomic ops. Any savings here is good. Right now usage_count ranges across a small number (currently 0-5), incrementing upon allocation and decrementing on sweep. Because of this, each tick of the clock sweep spinlocks the buffer in preparation of A. the buffer being returned or B. the buffer having a non zero usage count so that it can be safely decremented. B is potential optimization fruit, I think. The idea: Instead, why not let usage_count rotate around on it's own in a clock-like fashion? the idea is that, instead of defining usage count as a fixed offset from zero, it is based on the number of times the buffer has been touched since the last sweep. In other words, each time the buffer clock crosses midnight, 'usage count' for the whole pool drops by one, implicitly. Each time a buffer is touched, it's usage_count is set to 'completePasses' + 1, but, just like today, is never allowed to increment past completePasses + 5. So, what used to be usage_count becomes defined as the distance between the usage_count and completePasses. Because of usage_count tracking rotation, calculating what used to be usage_count requires a bit of logic to arrive at the right number as rotation happens. I'm pretty sure all the rotation edge cases can be handled but I'm glossing over them for now. Clock sweep buffer acquisition currently looks like this: loop: tick(); spinlock(); in use? goto loop usage_count > 0? yes: decrement, goto loop return buffer; Proposed buffer acquisition would look like: loop: tick(); victim < last_victim? yes: atomic re-fetch completePasses (see below) usage_count > completePasses? yes: goto loop try_spinlock(); no lock? goto loop usage_count > completePasses? yes: goto loop in use? goto loop return buffer; (try_spinlock is an adorned TAS() as a opposed to a full spin). Usage count is double checked after lock acquisition in case the local copy is stale. Bad stuff: 1. Complexity. Is it worth it? Calculating proper 'usage_count' distance in the face of two rotating numbers requires some clever coding. 2. Unfortunately, since we've (for very good reasons) got a clock sweep that can be concurrently engaged by many actors, it's no longer safe to assume that completePasses can be safely examined outside of the sweep loop but must intead be checked with every tick. Doing this with another atomic read would defeat the entire purpose of the patch, but perhaps we can check it based on a heuristic: it's examined when victim is less than the last_seen victim (which is locally saved off). ...which would add up to a tiny (zero?) risk of misjudging usage_distance. The fallout, if it were to happen, would be to possibly to return a buffer that would have a usage_count = 1 with the old method. 3. Since proper usage_count, aka distance is managed by being relative to completePasses, which slightly changes the semantics due to the fact the entire buffer pool drop in terms of usage_count at once. I'm doubtful this matters though. Just spitballing here -- wondering if the idea has legs. I think this should compatible with Robert's usage_count tweaking with flags FWICT. I'm inclined to work up a POC patch unless this is unworkable for some reason. The payoff, for workloads with high sweep activity, would be drastically reduced spinlock activity and associated cache line stress. merlin
On 2015-04-21 14:46:04 -0500, Merlin Moncure wrote: > The main sources of contention, buffer_strategy_lock, has been removed > FWICT via 5d7962c6 and d72731a7. However, during the sweep each tick > locks the buffer header via spinlock in order to to adjust > usage_count. FWIW, I think the best approach to achieve this is to make most buffer header manipulations lockless. It's not trivial but quite possible. I'd a very rough POC a while back ([1]); that only degenerated to something spinning for a couple edge cases (!BM_VALID || BM_PIN_COUNT_WAITER IIRC). That'd not necessarily reduce the number of atomic ops, but making things lockless makes does away with the problem of sleeping while holding a spinlock etc. The primary use case for that is less the clock sweep than things index root pages and small lookup pages. I've seen several times that the spinlock can really hurt there. I do believe that we can't continue with our current clock sweep approach for much longer. It's already extremely expensive and even if we fix some of the most glaring problems (as discussed nearby atm), it's still a horribly cache/runtime inefficient way to do things in many workloads. > Right now usage_count ranges across a small number (currently 0-5), > incrementing upon allocation and decrementing on sweep. Because of > this, each tick of the clock sweep spinlocks the buffer in preparation > of A. the buffer being returned or B. the buffer having a non zero > usage count so that it can be safely decremented. B is potential > optimization fruit, I think. > The idea: > Instead, why not let usage_count rotate around on it's own in a > clock-like fashion? the idea is that, instead of defining usage count > as a fixed offset from zero, it is based on the number of times the > buffer has been touched since the last sweep. In other words, each > time the buffer clock crosses midnight, 'usage count' for the whole > pool drops by one, implicitly. Each time a buffer is touched, it's > usage_count is set to 'completePasses' + 1, but, just like today, is > never allowed to increment past completePasses + 5. So, what used to > be usage_count becomes defined as the distance between the usage_count > and completePasses. > > Because of usage_count tracking rotation, calculating what used to be > usage_count requires a bit of logic to arrive at the right number as > rotation happens. I'm pretty sure all the rotation edge cases can be > handled but I'm glossing over them for now. > Clock sweep buffer acquisition currently looks like this: > loop: > tick(); > spinlock(); > in use? goto loop > usage_count > 0? yes: decrement, goto loop > return buffer; > > Proposed buffer acquisition would look like: > loop: > tick(); > victim < last_victim? yes: atomic re-fetch completePasses (see below) > usage_count > completePasses? yes: goto loop > try_spinlock(); no lock? goto loop > usage_count > completePasses? yes: goto loop > in use? goto loop > return buffer; > > (try_spinlock is an adorned TAS() as a opposed to a full spin). Usage > count is double checked after lock acquisition in case the local copy > is stale. I don't really see what this improves? You could do the "try spinlock" bit just as well with the old algorithm and I don't really see a accuracy benefit in the proposed approach? Maybe I'm missing something entirely? [1]: The basic trick I have in mind is that by putting flags, usagecount and refcount into one integer (in separate bits) it is possible to do CAS loops to manipulate each of those. E.g. to decrement usagecount you can do something roughly like do { val = pg_atomic_read_u32(desc->flag_and_counts); if (BufferDescUsageCount(val) == 0) break; newval = BufferDescChangeUsageCount(val, -1); } while(!pg_atomic_compare_exchange_u32(&desc->flag_and_counts, &val, newval)) Where BufferDescUsageCount just to some bit masking and shifting to manipulate the right part. Same for refcount and most flags. To deal with changing tags I'd introduced a flag value that was used to say 'you have to spinlock this time'. Greetings, Andres Freund
On Tue, Apr 21, 2015 at 3:25 PM, Andres Freund <andres@anarazel.de> wrote: > On 2015-04-21 14:46:04 -0500, Merlin Moncure wrote: >> The main sources of contention, buffer_strategy_lock, has been removed >> FWICT via 5d7962c6 and d72731a7. However, during the sweep each tick >> locks the buffer header via spinlock in order to to adjust >> usage_count. > > FWIW, I think the best approach to achieve this is to make most buffer > header manipulations lockless. It's not trivial but quite possible. I'd > a very rough POC a while back ([1]); that only degenerated to something > spinning for a couple edge cases (!BM_VALID || BM_PIN_COUNT_WAITER > IIRC). That'd not necessarily reduce the number of atomic ops, but > making things lockless makes does away with the problem of sleeping > while holding a spinlock etc. Right. lockless doesn't necessarily translate to better performance though if you have to increase atomic ops to get away with it. > The primary use case for that is less the clock sweep than things index > root pages and small lookup pages. I've seen several times that the > spinlock can really hurt there. Yeah. I'm trying to address that basically...see below. >> Clock sweep buffer acquisition currently looks like this: >> loop: >> tick(); >> spinlock(); >> in use? goto loop >> usage_count > 0? yes: decrement, goto loop >> return buffer; >> >> Proposed buffer acquisition would look like: >> loop: >> tick(); >> victim < last_victim? yes: atomic re-fetch completePasses (see below) >> usage_count > completePasses? yes: goto loop >> try_spinlock(); no lock? goto loop >> usage_count > completePasses? yes: goto loop >> in use? goto loop >> return buffer; >> >> (try_spinlock is an adorned TAS() as a opposed to a full spin). Usage >> count is double checked after lock acquisition in case the local copy >> is stale. > > I don't really see what this improves? You could do the "try spinlock" > bit just as well with the old algorithm and I don't really see a > accuracy benefit in the proposed approach? Maybe I'm missing something > entirely? Yes: the 'try lock' isn't necessary here, i just threw it in. It's been suggested few times in the past...I see no useful reason why the clocksweep (using the current algorithm) should sit and spin. Contended buffers should be hopped over. The real win is in eliminating the spinlock during clock tick except for buffers that are real candidates for being returned. refcount is a different problem, but has different characteristics; it has to be 100% accurate while usage_count does not, something which can be exploited (previously I suggested to simply remove all locking around it but I'm discarding that idea as there's no way to really prove it works well in unusual circumstances). > [1]: The basic trick I have in mind is that by putting flags, usagecount > and refcount into one integer (in separate bits) it is possible to do > CAS loops to manipulate each of those. E.g. to decrement usagecount you > can do something roughly like > > do > { > val = pg_atomic_read_u32(desc->flag_and_counts); > if (BufferDescUsageCount(val) == 0) > break; > > newval = BufferDescChangeUsageCount(val, -1); > } > while(!pg_atomic_compare_exchange_u32(&desc->flag_and_counts, > &val, newval)) > > Where BufferDescUsageCount just to some bit masking and shifting to > manipulate the right part. Same for refcount and most flags. > > To deal with changing tags I'd introduced a flag value that was used to > say 'you have to spinlock this time'. Interesting. I'm not sure sure that these ideas are contradictory, especially since refcount and usage_count are not always manipulated at the same time. Maybe you can get the best of both worlds; lockless pin/unpin with a lock free sweep: you'd have a fancy 'BufferDescIncreaseUsageCount', but not a decrease, since the completePasses (or some derivative of it), increases to give you the 'decrease' which is implicit as the clock sweeps around. What I'm proposing would decrease the atomic ops from the current 'atomic increment + spinlock' to a simple 'increment', plus the unusual atomic read to re-fetch completePasses. OTOH, one tricky bit I didn't mention is that inferring usage_count from an always incrementing value and completePasses would require the 'increment' routine to have knowledge of completePasses where previously it doesn't need it to support the 'max 5' check, requiring an atomic read, although perhaps not on every pin; a local copy would mostly do. merlin