Re: Remove Instruction Synchronization Barrier in spin_delay() for ARM64 architecture

Поиск

Список

Период

Сортировка

От	Andres Freund
Тема	Re: Remove Instruction Synchronization Barrier in spin_delay() for ARM64 architecture
Дата	15 августа 23:13:30
Msg-id	fgsf5ofxte7er3z6t2womog6t3nlhiwklyy5bg6jfshj3maln2@enb6qeculxlm обсуждение исходный текст
Ответ на	Re: Remove Instruction Synchronization Barrier in spin_delay() for ARM64 architecture (Nathan Bossart <nathandbossart@gmail.com>)
Ответы	Re: Remove Instruction Synchronization Barrier in spin_delay() for ARM64 architecture Re: Remove Instruction Synchronization Barrier in spin_delay() for ARM64 architecture
Список	pgsql-hackers

Дерево обсуждения

Hi,

On 2025-08-15 12:57:52 -0500, Nathan Bossart wrote:
> On Fri, Aug 15, 2025 at 01:39:52PM -0400, Andres Freund wrote:
> > On 2025-08-14 11:29:08 +0200, Álvaro Herrera wrote:
> >> However, changing that spinlock to an lwlock doesn't look easy, because of
> >> the way each pgss entry is created as a dynahash entry, and then deallocated
> >> from there.  With spinlocks we can just reinit the spinlock each time, but
> >> that doesn't work with lwlocks.  We have no easy way to associate then
> >> disassociate each entry from a specific lwlock.
> > 
> > I'm not following? The lwlock can just be inside the struct, just like the
> > spinlock is? "Association" is just LWLockInitialize() and deassociation is not
> > needed.
> 
> Indeed.  I rebased an old patch that I had lying around to demonstrate.  If
> my past testing [0] is to be trusted, this actually hurts performance,
> unfortunately.

FWIW, rather interesting result of testing the patch briefly:

On my older workstation, the patch is a substantial *gain* when there's a lot
of contention. But on my newer workstation it's a *loss*.

The penalty from enabling pg_stat_statements for readonly pgbench on the newer
workstation is rather bad - about 1/3 the throughput.

I think the main reason that lwlocks loose on the newer machine is that we
loose spinning. The newer machine has more cores and more numa domains and the
fairer locks lead to more cacheline pingpong...

IMO, the only way to actually make pg_stat_statements scale is to move to a
model much more like our regular stats. I.e. accumulate counters in backend
local memory and only occasionally update the shared stats. Even if you were
to move pgss successfully to atomics, the cacheline contention still would be
terrible for performance.

FWIW, I'd not be surprised if moving to atomics would often cause *slowdowns*
compared to using the spinlocks. You'd replace one atomic operation with
dozens, to update all those fields individually. With loads of cacheline
pingpong inbetween.

Greetings,

Andres Freund

В списке pgsql-hackers по дате отправления:

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Re: Remove Instruction Synchronization Barrier in spin_delay() for ARM64 architecture