Re: [PATCH] Fix ARM64/MSVC atomic memory ordering issues on Win11 by adding explicit DMB barriers
| От | Greg Burd |
|---|---|
| Тема | Re: [PATCH] Fix ARM64/MSVC atomic memory ordering issues on Win11 by adding explicit DMB barriers |
| Дата | |
| Msg-id | c57bcd13-558f-4004-8988-55df8512b878@app.fastmail.com обсуждение исходный текст |
| Ответ на | Re: [PATCH] Fix ARM64/MSVC atomic memory ordering issues on Win11 by adding explicit DMB barriers (Andres Freund <andres@anarazel.de>) |
| Список | pgsql-hackers |
On Mon, Nov 24, 2025, at 6:20 PM, Andres Freund wrote:
> Hi,
Thanks again for taking a look at the patch, hopefully I got it right this time. :)
> On 2025-11-24 11:28:28 -0500, Greg Burd wrote:
>> @@ -2509,25 +2513,64 @@ int main(void)
>> }
>> '''
>>
>> - if cc.links(prog, name: '__crc32cb, __crc32ch, __crc32cw, and __crc32cd without -march=armv8-a+crc',
>> - args: test_c_args)
>> - # Use ARM CRC Extension unconditionally
>> - cdata.set('USE_ARMV8_CRC32C', 1)
>> - have_optimized_crc = true
>> - elif cc.links(prog, name: '__crc32cb, __crc32ch, __crc32cw, and __crc32cd with -march=armv8-a+crc+simd',
>> - args: test_c_args + ['-march=armv8-a+crc+simd'])
>> - # Use ARM CRC Extension, with runtime check
>> - cflags_crc += '-march=armv8-a+crc+simd'
>> - cdata.set('USE_ARMV8_CRC32C', false)
>> - cdata.set('USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK', 1)
>> - have_optimized_crc = true
>> - elif cc.links(prog, name: '__crc32cb, __crc32ch, __crc32cw, and __crc32cd with -march=armv8-a+crc',
>> - args: test_c_args + ['-march=armv8-a+crc'])
>> - # Use ARM CRC Extension, with runtime check
>> - cflags_crc += '-march=armv8-a+crc'
>> - cdata.set('USE_ARMV8_CRC32C', false)
>> - cdata.set('USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK', 1)
>> - have_optimized_crc = true
>> + if cc.get_id() == 'msvc'
>> + # MSVC: Intrinsic availability check for ARM64
>> + if host_machine.cpu_family() == 'aarch64'
>> + # Test if CRC32C intrinsics are available in intrin.h
>> + crc32c_test_msvc = '''
>> + #include <intrin.h>
>> + int main(void) {
>> + uint32_t crc = 0;
>> + uint8_t data = 0;
>> + crc = __crc32cb(crc, data);
>> + return 0;
>> + }
>> + '''
>> + if cc.links(crc32c_test_msvc, name: '__crc32cb intrinsic available')
>> + cdata.set('USE_ARMV8_CRC32C', 1)
>> + have_optimized_crc = true
>> + message('Using ARM64 CRC32C hardware acceleration (MSVC)')
>> + else
>> + message('CRC32C intrinsics not available on this MSVC ARM64 build')
>> + endif
>
> Does this:
> a) need to be conditional at all, given that it's msvc specific, it seems we
> don't need to run a test?
> b) why is the msvc block outside of the general aarch64 block but then has
> another nested aarch64 test inside? That seems unnecessarily complicated and
> requires reindenting unnecessarily much code?
Yep, I rushed this. Apologies. I've re-worked it with your suggestions.
>> +/*
>> + * For Arm64, use __isb intrinsic. See aarch64 inline assembly definition for details.
>> + */
>> +#ifdef _M_ARM64
>> +
>> +static __forceinline void
>> +spin_delay(void)
>> +{
>> + /* Reference: https://learn.microsoft.com/en-us/cpp/intrinsics/arm64-intrinsics#BarrierRestrictions */
>> + __isb(_ARM64_BARRIER_SY);
>> +}
>> +#else
>> +/*
>> + * For x64, use _mm_pause intrinsic instead of rep nop.
>> + */
>> static __forceinline void
>> spin_delay(void)
>> {
>> _mm_pause();
>> }
>
> This continues to use a barrier, with a reference to a list of barrier
> semantics that really don't seem to make a whole lot of sense in the context
> of spin_delay(). If we want to emit this kind of barrier for now it's ok with
> me, but it should be documented as just being a fairly random choice, rather
> than a link that doesn't explain anything.
I did more digging and found that you were right about the use of ISB for spin_delay(). I think I was misled by
earliercode in that file (lines 277-286) where there is an implementation of spin_delay() that uses ISB, I ran with
thatnot doing enough research myself. So I did more digging and found an article on this [1] and it seems that YIELD
shouldbe used, not ISB. I checked into how others implement this feature, Java [2][3] uses YIELD, Linux [4][5] uses
YIELDin cpu_relax() called by __delay().
>> +#endif
>> #else
>> static __forceinline void
>> spin_delay(void)
>> @@ -623,9 +640,13 @@ spin_delay(void)
>> #include <intrin.h>
>> #pragma intrinsic(_ReadWriteBarrier)
>>
>> -#define S_UNLOCK(lock) \
>> +#ifdef _M_ARM64
>> +#define S_UNLOCK(lock) \
>> + do { __dmb(_ARM64_BARRIER_SY); (*(lock)) = 0; } while (0)
>> +#else
>
> This doesn't seem like the right way to implement this - why not use
> InterlockedExchange(lock, 0)? That will do the write with barrier semantics.
Great idea, done. Seems to work too.
> Greetings,
>
> Andres Freund
Given what I learned about YIELD vs ISB for spin delay it seems like a reasonable idea to submit a new patch for the
non-MSVCpath and switch it to YIELD, what do you think?
v5 attached, best.
-greg
[1]
https://developer.arm.com/community/arm-community-blogs/b/architectures-and-processors-blog/posts/multi-threaded-applications-arm
[2] https://cr.openjdk.org/~dchuyko/8186670/yield/spinwait.html
[3] https://mail.openjdk.org/pipermail/aarch64-port-dev/2017-August/004880.html
[4]
https://github.com/torvalds/linux/blob/ac3fd01e4c1efce8f2c054cdeb2ddd2fc0fb150d/arch/arm64/include/asm/vdso/processor.h
[5] https://github.com/torvalds/linux/commit/f511e079177a9b97175a9a3b0ee2374d55682403
Вложения
В списке pgsql-hackers по дате отправления: