Wait free LW_SHARED acquisition
От | Andres Freund |
---|---|
Тема | Wait free LW_SHARED acquisition |
Дата | |
Msg-id | 20130926225545.GB26663@awork2.anarazel.de обсуждение исходный текст |
Ответы |
Re: Wait free LW_SHARED acquisition
(Peter Geoghegan <pg@heroku.com>)
Re: Wait free LW_SHARED acquisition (Heikki Linnakangas <hlinnakangas@vmware.com>) Re: Wait free LW_SHARED acquisition (Florian Pflug <fgp@phlo.org>) Re: Wait free LW_SHARED acquisition - v0.2 (Andres Freund <andres@2ndquadrant.com>) Re: Wait free LW_SHARED acquisition (Christian Kruse <christian@2ndQuadrant.com>) Re: Wait free LW_SHARED acquisition (Andres Freund <andres@2ndquadrant.com>) Re: Wait free LW_SHARED acquisition - v0.9 (Andres Freund <andres@2ndquadrant.com>) Re: Wait free LW_SHARED acquisition - v0.10 (Andres Freund <andres@2ndquadrant.com>) |
Список | pgsql-hackers |
Hello, We have had several customers running postgres on bigger machines report problems on busy systems. Most recently one where a fully cached workload completely stalled in s_lock()s due to the *shared* lwlock acquisition in BufferAlloc() around the buffer partition lock. Increasing the padding to a full cacheline helps making the partitioning of the partition space actually effective (before it's essentially halved on a read-mostly workload), but that still leaves one with very hot spinlocks. So the goal is to have LWLockAcquire(LW_SHARED) never block unless somebody else holds an exclusive lock. To produce enough appetite for reading the rest of the long mail, here's some numbers on a pgbench -j 90 -c 90 -T 60 -S (-i -s 10) on a 4xE5-4620 master + padding: tps = 146904.451764 master + padding + lwlock: tps = 590445.927065 That's rougly 400%. So, nice little improvement. Unless - not entirely unlikely - I fucked up and it's fast because it's broken. Anyway, here's the algorith I chose to implement: The basic idea is to have a single 'uint32 lockcount' instead of the former 'char exclusive' and 'int shared' and to use an atomic increment to acquire the lock. That's fairly easy to do for rw-spinlocks, but a lot harder for something like LWLocks that want to wait in the OS. Exlusive lock acquisition: Use an atomic compare-and-exchange on the lockcount variable swapping in EXCLUSIVE_LOCK/1<<31/0x80000000 if and only if the current value of lockcount is 0. If the swap was not successfull, we have to wait. Shared lock acquisition: Use an atomic add (lock xadd) to the lockcount variable to add 1. If the value is bigger than EXCLUSIVE_LOCK we know that somebody actually has an exclusive lock, and we back out by atomically decrementing by 1 again. If so, we have to wait for the exlusive locker to release the lock. Queueing & Wakeup: Whenever we don't get a shared/exclusive lock we us nearly the same queuing mechanism as we currently do. While we probably could make it lockless as well, the queue currently is still protected by the good old spinlock. Relase: Use a atomic decrement to release the lock. If the new value is zero (we get that atomically), we know we have to release waiters. And the real world: Now, as you probably have noticed, naively doing the above has two glaring race conditions: 1) too-quick-for-queueing: We try to lock using the atomic operations and notice that we have to wait. Unfortunately until we have finished queuing, the former locker very well might have already finished it's work. 2) spurious failed locks: Due to the logic of backing out of shared locks after we unconditionally added a 1 to lockcount, we might have prevented another exclusive locker from getting the lock: 1) Session A: LWLockAcquire(LW_EXCLUSIVE) - success 2) Session B: LWLockAcquire(LW_SHARED) - lockcount += 1 3) Session B: LWLockAcquire(LW_SHARED) - oops, bigger than EXCLUSIVE_LOCK 4) Session B: LWLockRelease() 5) Session C: LWLockAcquire(LW_EXCLUSIVE) - check if lockcount = 0, no. wait. 6) Session B: LWLockAcquire(LW_SHARED) - lockcount -= 1 7) Session B: LWLockAcquire(LW_SHARED) - wait So now we can have both B) and C) waiting on a lock that nobody is holding anymore. Not good. The solution: We use a two phased attempt at locking: Phase 1: Try to do it atomically, if we succeed, nice Phase 2: Add us too the waitqueue of the lock Phase 3: Try to grab the lock again, if we succeed, remove ourselves from the queue Phase 4: Sleep till wakeup, goto Phase 1 This protects us against both problems from above: 1) Nobody can release too quick, before we're queued, after Phase 2 since we're already queued. 2) If somebody spuriously got blocked from acquiring the lock, they will get queued in Phase 2 and we can wake them up if neccessary. Now, there are lots of tiny details to handle additionally to those, but those seem better handled by looking at the code? - The above algorithm only works for LWLockAcquire, not directly for LWLockAcquireConditional where we don't want to wait. In that case we just need to retry acquiring the lock until we're sure we didn't disturb anybody in doing so. - we can get removed from the queue of waiters in Phase 3, before we remove ourselves. In that case we need to absorb the wakeup. - Spurious locks can prevent us from recognizing a lock that's free during release. Solve it by checking for existing waiters whenever an exlusive lock is released. I've done a couple of off-hand benchmarks and so far I can confirm that everything using lots of shared locks benefits greatly and everything else doesn't really change much. So far I've seen mostly some slight improvements in exclusive lock heavy workloads, but those were pretty small. It's also very important to mention that those speedups are only there on multi-socket machines. From what I've benchmarked so far in LW_SHARED heavy workloads with 1 socket you get ~5-10%, 2 sockets 20-30% and finally and nicely for 4 sockets: 350-400%. While I did assume the difference would be bigger on 4 socket machines than on my older 2 socket workstation (that's where the 20-30% come from) I have to admit, I was surprised by the difference on the 4 socket machine. Does anybody see fundamental problems with the algorithm? The implementation sure isn't ready for several reasons, but I don't want to go ahead and spend lots of time on it before hearing some more voices. So what's todo? The file header tells us: * - revive pure-spinlock implementation * - abstract away atomic ops, we really only need a few. * - CAS * - LOCK XADD * - convert PGPROC->lwWaitLink to ilist.h slist or even dlist. * - remove LWLockWakeup dealing with MyProc * - overhaul the mask offsets, make SHARED/EXCLUSIVE_LOCK_MASK wider, MAX_BACKENDS Currently only gcc is supported because I used its __sync_fetch_and_add(), __sync_fetch_and_sub() and __sync_val_compare_and_swap() are used. There have been reports about __sync_fetch_and_sub() not getting properly optimized with gcc < 4.8, perhaps we need to replace it by _and_add(-val). Given the low amount of primitives required, it should be adaptable to most newer compilers. Comments? Fundamental flaws? 8 socket machines? Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Вложения
В списке pgsql-hackers по дате отправления: