Increasing the shared_buffers size improved the performance by 15%. The trend remains the same though: steep drop in performance after a certain number of clients.
My deployment is "NUMA-aware". I allocate cores that reside on the same socket. Once I reach the maximum number of cores, I start allocating cores from a neighbouring socket.
I'll try to print the number of spins_per_delay for each experiment... just in case I get something interesting.