Re: Improving spin-lock implementation on ARM.

Поиск

Список

Период

Сортировка

От	Krunal Bauskar
Тема	Re: Improving spin-lock implementation on ARM.
Дата	2 декабря 2020 г. 06:57:37
Msg-id	CAB10pyZuh_fAnRUZ-hNd9bJ7iyLUjcOo4Fun5ykLb47A3xe8oA@mail.gmail.com обсуждение исходный текст
Ответ на	Re: Improving spin-lock implementation on ARM. (Tom Lane <tgl@sss.pgh.pa.us>)
Ответы	Re: Improving spin-lock implementation on ARM. (Krunal Bauskar <krunalbauskar@gmail.com>) Re: Improving spin-lock implementation on ARM. (Alexander Korotkov <aekorotkov@gmail.com>) Re: Improving spin-lock implementation on ARM. (Alexander Korotkov <aekorotkov@gmail.com>)
Список	pgsql-hackers

Дерево обсуждения

On Tue, 1 Dec 2020 at 22:19, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Alexander Korotkov <aekorotkov@gmail.com> writes:
> On Tue, Dec 1, 2020 at 6:19 PM Krunal Bauskar <krunalbauskar@gmail.com> wrote:
>> I would request you guys to re-think it from this perspective to help ensure that PGSQL can scale well on ARM.
>> s_lock becomes a top-most function and LSE is not a universal solution but CAS surely helps ease the main bottleneck.

> CAS patch isn't proven to be a universal solution as well. We have
> tested the patch on just a few processors, and Tom has seen the
> regression [1]. The benchmark used by Tom was artificial, but the
> results may be relevant for some real-life workload.

Yeah. I think that the main conclusion from what we've seen here is
that on smaller machines like M1, a standard pgbench benchmark just
isn't capable of driving PG into serious spinlock contention. (That
reflects very well on the work various people have done over the years
to get rid of spinlock contention, because ten or so years ago it was
a huge problem on this size of machine. But evidently, not any more.)
Per the results others have posted, nowadays you need dozens of cores
and hundreds of client threads to measure any such issue with pgbench.

So that is why I experimented with a special test that does nothing
except pound on one spinlock. Sure it's artificial, but if you want
to see the effects of different spinlock implementations then it's
just too hard to get any results with pgbench's regular scripts.

And that's why it disturbs me that the CAS-spinlock patch showed up
worse in that environment. The fact that it's not visible in the
regular pgbench test just means that the effect is too small to
measure in that test. But in a test where we *can* measure an effect,
it's not looking good.

It would be interesting to see some results from the same test I did
on other processors. I suspect the results would look a lot different
from mine ... but we won't know unless someone does it. Or, if someone
wants to propose some other test case, let's have a look.

> I'm expressing just my personal opinion, other committers can have
> different opinions. I don't particularly think this topic is
> necessarily a non-starter. But I do think that given ambiguity we've
> observed in the benchmark, much more research is needed to push this
> topic forward.

Yeah. I'm not here to say "do nothing". But I think we need results
from more machines and more test cases to convince ourselves whether
there's a consistent, worthwhile win from any specific patch.

I think there is an ambiguity with lse and that has been the

source of some confusion so let's make another attempt to
understand all the observations and then define the next steps.

-----------------------------------------------------------------

1. CAS patch (applied on the baseline)
- Kunpeng: 10-45% improvement observed [1]
- Graviton2: 30-50% improvement observed [2]
- M1: Only select results are available cas continue to maintain a marginal gain but not significant. [3]
[inline with what we observed with Kunpeng and Graviton2 for select results too].

2. Let's ignore CAS for a sec and just think of LSE independently
- Kunpeng: regression observed
- Graviton2: gain observed
- M1: regression observed
[while lse probably is default explicitly enabling it with +lse causes regression on the head itself [4].
client=2/4: 1816/714 ---- vs ---- 892/610]

There is enough reason not to immediately consider enabling LSE given its unable to perform consistently on all hardware.
-----------------------------------------------------------------

With those 2 aspects clear let's evaluate what options we have in hand

1. Enable CAS approach
- What we gain: pgsql scale on Kunpeng/Graviton2
(m1 awaiting read-write result but may marginally scale [[5]: "but the patched numbers are only about a few percent better"])
- What we lose: Nothing for now.

2. LSE:
- What we gain: Scaled workload with Graviton2
- What we lose: regression on M1 and Kunpeng.

Let's think of both approaches independently.

- Enabling CAS would help us scale on all hardware (Kunpeng/Graviton2/M1)
- Enabling LSE would help us scale only on some but regress on others.
[LSE could be considered in the future once it stabilizes and all hardware adapts to it]

-------------------------------------------------------------------

Let me know what do you think about this analysis and any specific direction that we should consider to help move forward.

-------------------------------------------------------------------

Links:
[1]: https://www.postgresql.org/message-id/attachment/116612/Screenshot%20from%202020-12-01%2017-55-21.png
[2]: https://www.postgresql.org/message-id/attachment/116521/arm-rw.png
[3]: https://www.postgresql.org/message-id/1367116.1606802480%40sss.pgh.pa.us
[4]: https://www.postgresql.org/message-id/1158478.1606716507%40sss.pgh.pa.us
[5]: https://www.postgresql.org/message-id/51e2f75b-3742-7f28-4438-0425b11cf410%40enterprisedb.com