I haven't actually reviewed the code, but this sort of thing seems like good evidence that we need your patch, or something like it. The fact that the patch produces little performance improvement on it's own (though it does produce some) shouldn't be held against it - the fact that the contention shifts elsewhere when the first bottleneck is removed is not your patch's fault.
In terms of ameliorating contention on the buffer mapping locks, I think it would be better to replace the whole buffer mapping table with something different. I started working on that almost 2 years ago, building a hash-table that can be read without requiring any locks and written with, well, less locking than what we have right now: