I didn't reproduce the regression. I had access to multicore machine but didn't see either regression on low clients or improvements on high clients.
In the attached path spinlock delay was exposed in s_lock.h and used in LockBufHdr().
Dilip, could you try this version of patch? Could you also run perf or other profiler in the case of regression. It would be nice to compare profiles with and without patch. We probably could find the cause of regression.