On Wed, Dec 2, 2020 at 6:58 AM Krunal Bauskar <krunalbauskar@gmail.com> wrote:
> 1. CAS patch (applied on the baseline)
> - Kunpeng: 10-45% improvement observed [1]
> - Graviton2: 30-50% improvement observed [2]
What does lower boundary of improvement mean? Does it mean minimal
improvement observed? Obviously not, because there is no improvement
with a low number of clients or read-only benchmark.
> - M1: Only select results are available cas continue to maintain a marginal gain but not significant. [3]
Read-only benchmark doesn't involve spinlocks (I've just rechecked
this). So, this difference is purely speculative.
Also, Tom observed the regression [1]. The benchmark is artificial,
but it may correspond to some real workload with heavily-loaded
spinlocks. And that might have an explanation. ldrex/strex
themselves don't work as memory barrier (this is why compiler adds
explicit memory barrier afterwards). And I bet ldrex unpaired with
strex could see an outdated value. On high-contended spinlocks that
may cause too pessimistic waits.
> 2. Let's ignore CAS for a sec and just think of LSE independently
> - Kunpeng: regression observed
Yeah, it's sad that there are ARM chips, where making the same action
in a single instruction is slower than the same actions in multiple
instructions.
> - Graviton2: gain observed
Have to say it's 5.5x improvement, which is dramatically more than
CAS-loop patch can give.
> - M1: regression observed
> [while lse probably is default explicitly enabling it with +lse causes regression on the head itself [4].
> client=2/4: 1816/714 ---- vs ---- 892/610]
This is plain false. LSE was enabled in all the versions tested in M1
[2]. So not LSE instructions or +lse option caused the regression,
but lack of other options enabled by default in Apple's clang.
> Let me know what do you think about this analysis and any specific direction that we should consider to help move
forward.
In order to move forward with ARM-optimized spinlocks I would
investigate other options (at least [3]).
Regarding CAS spinlock patch I can propose the following steps to
clarify the things.
1. Run Tom's spinlock test on ARM machines other than M1.
2. Try to construct a pgbench script, which produces heavily-loaded
spinlock without custom C-function, and also run it across various ARM
machines.
Links
1. https://www.postgresql.org/message-id/741389.1606530957%40sss.pgh.pa.us
2. https://www.postgresql.org/message-id/1274781.1606760475%40sss.pgh.pa.us
3. https://linux-concepts.blogspot.com/2018/05/spinlock-implementation-in-arm.html
------
Regards,
Alexander Korotkov