Hi,
On 2023-08-02 16:51:29 -0700, Matt Smiley wrote:
> I thought it might be helpful to share some more details from one of the
> case studies behind Nik's suggestion.
>
> Bursty contention on lock_manager lwlocks recently became a recurring cause
> of query throughput drops for GitLab.com, and we got to study the behavior
> via USDT and uprobe instrumentation along with more conventional
> observations (see
> https://gitlab.com/gitlab-com/gl-infra/scalability/-/issues/2301). This
> turned up some interesting finds, and I thought sharing some of that
> research might be helpful.
Hm, I'm curious whether you have a way to trigger the issue outside of your
prod environment. Mainly because I'm wondering if you're potentially hitting
the issue fixed in a4adc31f690 - we ended up not backpatching that fix, so
you'd not see the benefit unless you reproduced the load in 16+.
I'm also wondering if it's possible that the reason for the throughput drops
are possibly correlated with heavyweight contention or higher frequency access
to the pg_locks view. Deadlock checking and the locks view acquire locks on
all lock manager partitions... So if there's a bout of real lock contention
(for longer than deadlock_timeout)...
Given that most of your lock manager traffic comes from query planning - have
you evaluated using prepared statements more heavily?
Greetings,
Andres Freund