Обсуждение: [PATCH] Fix ARM64/MSVC atomic memory ordering issues on Win11 by adding explicit DMB ​barriers

Поиск
Список
Период
Сортировка
Hi all,

Dave and I have been working together to get ARM64 with MSVC functional.
 The attached patches accomplish that. Dave is the author of the first
which addresses some build issues and fixes the spin_delay() semantics,
I did the second which fixes some atomics in this combination.

PostgreSQL when compiled with MSVC on ARM64 architecture in particular
when optimizations are enabled (e.g., /O2), fails 027_stream_regress.
After some investigation and analysis of generated assembly code, Dave
Cramer and I have identified that the root cause is insufficient memory
barrier semantics in both atomic operations and spinlocks on ARM64 when
compiled with MSVC with /O2.

Dave knew I was in the process of setting up a Win11/ARM64/MSVC build
animal and pinged me with this issue.  Dave got me started on the path
to finding the issue by sending me his work around:

--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -744,6 +744,7 @@ static void
WALInsertLockUpdateInsertingAt(XLogRecPtr insertingAt);
  * before the data page can be written out.  This implements the basic
  * WAL rule "write the log before the data".)
  */
+#pragma optimize("",off)
 XLogRecPtr
 XLogInsertRecord(XLogRecData *rdata,
                                 XLogRecPtr fpw_lsn,
@@ -1088,7 +1089,7 @@ XLogInsertRecord(XLogRecData *rdata,

        return EndPos;
 }
-
+#pragma optimize("",on)
 /*


This pointed a finger at the atomics, so I started there.  We used a few
tools, but worth noting is https://godbolt.org/ where we were able to
quickly see that the MSVC assembly was missing the "dmb" barriers on
this platform.  I'm not sure how long this link will be valid, but in
the short term here's our investigation: https://godbolt.org/z/PPqfxe1bn


PROBLEM DESCRIPTION

PostgreSQL test failures occur intermittently on MSVC ARM64 builds,
manifesting as timing-dependent failures in critical sections
protected by spinlocks and atomic variables. The failures are
reproducible when the test suite is compiled with optimization flags
(/O2), particularly in the recovery/027_stream_regress test which
involves WAL replication and standby recovery.

The root cause has two components:

1. Atomic operations lack memory barriers on ARM64
2. MSVC spinlock implementation lacks memory barriers on ARM64


TECHNICAL ANALYSIS

PART 1: ATOMIC OPERATIONS MEMORY BARRIERS

GCC's __atomic_compare_exchange_n() with __ATOMIC_SEQ_CST semantics
generates a call to __aarch64_cas4_acq_rel(), which is a library
function that provides explicit acquire-release memory ordering
semantics through either:

* LSE path (modern ARM64): Using CASAL instruction with built-in
  memory ordering [1][2]

* Legacy path (older ARM64): Using LDAXR/STLXR instructions with
  explicit dmb sy instruction [3]

MSVC's _InterlockedCompareExchange() intrinsic on ARM64 performs the
atomic operation but does NOT emit the necessary Data Memory Barrier
(DMB) instructions [4][5].


PART 2: SPINLOCK IMPLEMENTATION LACKS BARRIERS

The MSVC spinlock implementation in src/include/storage/s_lock.h had
two issues on ARM64/MSVC:

#define TAS(lock) (InterlockedCompareExchange(lock, 1, 0))
#define S_UNLOCK(lock) do { _ReadWriteBarrier(); (*(lock)) = 0; } while (0)

Issue 1: TAS() uses InterlockedCompareExchange without hardware barriers

The InterlockedCompareExchange intrinsic lacks full memory barrier
semantics on ARM64, identical to the atomic operations issue.

Issue 2: S_UNLOCK() uses only a compiler barrier

_ReadWriteBarrier() is a compiler barrier, NOT a hardware memory
barrier [6].  It prevents the compiler from reordering operations, but
the CPU can still reorder memory operations. This is fundamentally
insufficient for ARM64's weaker memory model.

For comparison, GCC's __sync_lock_release() emits actual hardware
barriers.

IMPACT ON 027_STREAM_REGRESS

The 027_stream_regress test involves WAL replication and standby
recovery — heavily dependent on synchronized access to shared memory
protected by spinlocks [7].  Without proper barriers on ARM64:

1. Thread A acquires spinlock (no full barrier emitted)
2. Thread A modifies shared WAL buffer
3. Thread B acquires spinlock before Thread A's writes become visible
4. Thread B reads stale WAL data
5. WAL replication gets corrupted or hangs indefinitely
6. Test times out waiting for standby to catch up


WHY ARM32 AND X86/X64 ARE UNAFFECTED

MSVC's _InterlockedCompareExchange does provide full memory barriers on:
* x86/x64: Memory barriers are implicit in the x86 memory model [8]
* ARM32: MSVC explicitly generates full barriers for ARM32 [5]

Only ARM64 lacks the necessary barriers, making this a platform-specific
issue.

ATTACHED SOLUTION

Add explicit DMB (Data Memory Barrier) instructions before and after
atomic operations and spinlock operations on ARM64 to provide sequential
consistency semantics.

0002: src/inclue/port/atomic/generic-msvc.h

Added platform-specific DMB macros that expand to
__dmb(_ARM64_BARRIER_SY) on ARM64.

Applied to all six atomic operations:
* pg_atomic_compare_exchange_u32_impl()
* pg_atomic_exchange_u32_impl()
* pg_atomic_fetch_add_u32_impl()
* pg_atomic_compare_exchange_u64_impl()
* pg_atomic_exchange_u64_impl()
* pg_atomic_fetch_add_u64_impl()


0001: src/include/storage/s_lock.h

Added ARM64-specific spinlock implementation with explicit DMB barriers [9]:

#if defined(_M_ARM64)
#define TAS(lock) tas_msvc_arm64(lock)

static __forceinline int
tas_msvc_arm64(volatile slock_t *lock)
{
  int result;

  /* Full barrier before atomic operation */
  __dmb(_ARM64_BARRIER_SY);

  /* Atomic compare-and-swap */
  result = InterlockedCompareExchange(lock, 1, 0);

  /* Full barrier after atomic operation */
  __dmb(_ARM64_BARRIER_SY);

  return result;
}

#define S_UNLOCK(lock)
do {
  __dmb(_ARM64_BARRIER_SY); /* Full barrier before release /
  ((lock)) = 0;
} while (0)

#else
  /* Non-ARM64 MSVC: existing implementation unchanged */
#endif


The spinlock acquire now ensures:

* Before CAS: All prior memory operations complete before
  acquiring the lock.

* After CAS: The CAS completes before subsequent operations
  access protected data

The spinlock release now ensures:

* Before writing 0: All critical section operations are visible
  to other threads


You may ask: why two DMBs in the atomic operations instead of one?
GCC's non-LSE path (LDAXR/STLXR) uses only one DMB because:
* LDAXR (Load-Acquire Exclusive) provides half-barrier acquire
  semantics [3]
* STLXR (Store-Release Exclusive) provides half-barrier release
  semantics [3]
* One final dmb sy upgrades to full sequential consistency

Since _InterlockedCompareExchange provides NO barrier semantics on
ARM64, we must provide both halves:

* First DMB acts as a release barrier (ensures prior memory ops
  complete before CAS)
* Second DMB acts as an acquire barrier (ensures subsequent memory
  ops wait for CAS)
* Together they provide sequential consistency matching GCC's
  semantics [3]


VERIFICATION

The fix has been verified by:

1. Spinlock fix resolves 027_stream_regress timeout: Test now passes
   consistently on MSVC ARM64 with /O2 optimization without hanging

2. Assembly code inspection: Confirmed that dmb sy instructions now
   appear in the optimized assembly for ARM64 builds

3. Platform compatibility: No regression on x86/x64 or ARM32 (macros
   expand to no-ops; original code path unchanged)


WHY CLANG/LLVM ON MACOS ARM64 DOESN'T HAVE THIS PROBLEM

PostgreSQL builds successfully on Apple Silicon Macs (ARM64) without
the memory ordering issues observed on MSVC Windows ARM64. The
difference comes down to how Clang/LLVM and MSVC handle atomic
operations.

CLANG/LLVM APPROACH (macOS, Linux, Android ARM64)

Clang/LLVM uses GCC-compatible atomic builtins
(__atomic_compare_exchange_n, etc.) even on platforms where it's not
GCC [125][134]. The LLVM backend has an AtomicExpand pass that
properly expands these operations to include appropriate memory
barriers for the target architecture [134].

On ARM64, Clang generates:

__aarch64_cas4_acq_rel library calls (or CASAL instruction with LSE)
Proper acquire-release semantics built into the instruction sequence
Automatic full dmb sy barriers where needed This means PostgreSQL's
use of __sync_lock_test_and_set and _atomic* builtins work correctly
on macOS ARM64 without additional patches.


Phew... I hope I read all those docs correctly and got that right.  Feel
free to let me know if I missed something.  Looking forward to your
feedback and review so I can get this new build animal up and running.

best.

-greg

[1] ARM Developer: CAS Instructions
https://developer.arm.com/documentation/dui0801/latest/A64-Data-Transfer-Instructions/CASAB--CASALB--CASB--CASLB--A64-

[2] ARM Developer: Load-Acquire and Store-Release Instructions
https://developer.arm.com/documentation/102336/0100/Load-Acquire-and-Store-Release-instructions

[3] ARM Developer: Data Memory Barrier (DMB)
https://developer.arm.com/documentation/100069/0610/A64-General-Instructions/DMB?lang=en

[4] Microsoft Learn: _InterlockedCompareExchange Intrinsic Functions
https://learn.microsoft.com/en-us/cpp/intrinsics/interlockedcompareexchange-intrinsic-functions?view=msvc-170

[5] Microsoft Learn: ARM Intrinsics - Memory Barriers
https://learn.microsoft.com/en-us/cpp/intrinsics/arm-intrinsics?view=msvc-170

[6] Microsoft Learn: _ReadWriteBarrier is a Compiler Barrier
https://learn.microsoft.com/en-us/cpp/intrinsics/compiler-intrinsics?view=msvc-170

[7] PostgreSQL: 027_stream_regress WAL replication testing
https://www.postgresql.org/message-id/193115.1763243897@sss.pgh.pa.us

[8] Intel Volume 3A: Memory Ordering

https://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-software-developer-vol-3a-part-1-manual.pdf

[9] Microsoft Developer Blog: The AArch64 processor - Barriers
https://devblogs.microsoft.com/oldnewthing/20220812-00/?p=106968
Вложения
On 20.11.25 21:45, Greg Burd wrote:
> Dave and I have been working together to get ARM64 with MSVC functional.
>   The attached patches accomplish that. Dave is the author of the first
> which addresses some build issues and fixes the spin_delay() semantics,
> I did the second which fixes some atomics in this combination.

 > -  zlib_t = dependency('zlib', required: zlibopt)
 > +  zlib_t = dependency('zlib', method : 'pkg-config', required: zlibopt)

This appears to be a change unrelated to your patch description.

Also, the second patch contains a number of random whitespace changes. 
It would be good to clean that up so the patch is easier to analyze.




On Nov 20 2025, at 5:00 pm, Peter Eisentraut <peter@eisentraut.org> wrote:

> On 20.11.25 21:45, Greg Burd wrote:
>> Dave and I have been working together to get ARM64 with MSVC functional.
>>   The attached patches accomplish that. Dave is the author of the first
>> which addresses some build issues and fixes the spin_delay() semantics,
>> I did the second which fixes some atomics in this combination.
> 

Hi Peter,

Thanks for taking a second to review.

> > -  zlib_t = dependency('zlib', required: zlibopt)
> > +  zlib_t = dependency('zlib', method : 'pkg-config', required: zlibopt)
> 
> This appears to be a change unrelated to your patch description.

Hmmm, these are changes that Dave added to get things to compile.  They
worked for me but I'll review them again in the morning and update the description.

> Also, the second patch contains a number of random whitespace changes. 
> It would be good to clean that up so the patch is easier to analyze.

Certainly, apologies.  I'll clean that up first thing in the morning.

-greg



I took a quick look at 0001.

+#ifdef _MSC_VER
+#include <intrin.h>
+#else
 #include <arm_acle.h>
 unsigned int crc;

I think you can remove this since we unconditionally do the runtime check
for MSVC.  In any case, the missing #endif seems likely to cause
problems.

--- a/src/port/pg_crc32c_armv8.c
+++ b/src/port/pg_crc32c_armv8.c
@@ -14,7 +14,9 @@
  */
 #include "c.h"
 
+#ifndef _MSC_VER
 #include <arm_acle.h>
+#endif

Hm.  Doesn't MSVC require intrin.h?

-- 
nathan



Hi,

On 2025-11-20 15:45:22 -0500, Greg Burd wrote:
> Dave and I have been working together to get ARM64 with MSVC functional.
>  The attached patches accomplish that. Dave is the author of the first
> which addresses some build issues and fixes the spin_delay() semantics,
> I did the second which fixes some atomics in this combination.

Thanks for working on this!


> This pointed a finger at the atomics, so I started there.  We used a few
> tools, but worth noting is https://godbolt.org/ where we were able to
> quickly see that the MSVC assembly was missing the "dmb" barriers on
> this platform.  I'm not sure how long this link will be valid, but in
> the short term here's our investigation: https://godbolt.org/z/PPqfxe1bn
> 
> 
> PROBLEM DESCRIPTION
> 
> PostgreSQL test failures occur intermittently on MSVC ARM64 builds,
> manifesting as timing-dependent failures in critical sections
> protected by spinlocks and atomic variables. The failures are
> reproducible when the test suite is compiled with optimization flags
> (/O2), particularly in the recovery/027_stream_regress test which
> involves WAL replication and standby recovery.
> 
> The root cause has two components:
> 
> 1. Atomic operations lack memory barriers on ARM64
> 2. MSVC spinlock implementation lacks memory barriers on ARM64
> 
> TECHNICAL ANALYSIS
> 
> PART 1: ATOMIC OPERATIONS MEMORY BARRIERS
> 
> GCC's __atomic_compare_exchange_n() with __ATOMIC_SEQ_CST semantics
> generates a call to __aarch64_cas4_acq_rel(), which is a library
> function that provides explicit acquire-release memory ordering
> semantics through either:
> 
> * LSE path (modern ARM64): Using CASAL instruction with built-in
>   memory ordering [1][2]
> 
> * Legacy path (older ARM64): Using LDAXR/STLXR instructions with
>   explicit dmb sy instruction [3]
> 
> MSVC's _InterlockedCompareExchange() intrinsic on ARM64 performs the
> atomic operation but does NOT emit the necessary Data Memory Barrier
> (DMB) instructions [4][5].

I couldn't reproduce this result when playing around on godbolt. By specifying
/arch:armv9.4 msvc can be convinced to emit the code for the intrinsics inline
(at least for most of them).  And that makes it visible that
_InterlockedCompareExchange() results in a "casal" instruction. Looking that
up shows:

https://developer.arm.com/documentation/dui0801/l/A64-Data-Transfer-Instructions/CASA--CASAL--CAS--CASL--CASAL--CAS--CASL--A64-
which includes these two statements:
"CASA and CASAL load from memory with acquire semantics."
"CASL and CASAL store to memory with release semantics."


> Issue 2: S_UNLOCK() uses only a compiler barrier
> 
> _ReadWriteBarrier() is a compiler barrier, NOT a hardware memory
> barrier [6].  It prevents the compiler from reordering operations, but
> the CPU can still reorder memory operations. This is fundamentally
> insufficient for ARM64's weaker memory model.

Yea, that seems broken on a non-TSO architecture.  Is the problem fixed if you
change just this to include a proper barrier?

Greetings,

Andres Freund



Hi,

On 2025-11-20 19:03:47 -0500, Andres Freund wrote:
> > MSVC's _InterlockedCompareExchange() intrinsic on ARM64 performs the
> > atomic operation but does NOT emit the necessary Data Memory Barrier
> > (DMB) instructions [4][5].
> 
> I couldn't reproduce this result when playing around on godbolt. By specifying
> /arch:armv9.4 msvc can be convinced to emit the code for the intrinsics inline
> (at least for most of them).  And that makes it visible that
> _InterlockedCompareExchange() results in a "casal" instruction. Looking that
> up shows:
>
https://developer.arm.com/documentation/dui0801/l/A64-Data-Transfer-Instructions/CASA--CASAL--CAS--CASL--CASAL--CAS--CASL--A64-
> which includes these two statements:
> "CASA and CASAL load from memory with acquire semantics."
> "CASL and CASAL store to memory with release semantics."

Further evidence for that is that
https://learn.microsoft.com/en-us/windows/win32/api/winnt/nf-winnt-interlockedcompareexchange
states:
"This function generates a full memory barrier (or fence) to ensure that memory operations are completed in order."

(note that we are using the function, not the intrinsic for TAS())

Greetings,

Andres



On Fri, Nov 21, 2025 at 9:45 AM Greg Burd <greg@burd.me> wrote:
> Dave and I have been working together to get ARM64 with MSVC functional.
>  The attached patches accomplish that. Dave is the author of the first
> which addresses some build issues and fixes the spin_delay() semantics,
> I did the second which fixes some atomics in this combination.

A couple of immediate thoughts:

https://learn.microsoft.com/en-us/cpp/intrinsics/interlockedexchangeadd-intrinsic-functions?view=msvc-170

Doesn't seem to match your conclusion.

+  if cc.get_id() == 'msvc'
+    cdata.set('USE_ARMV8_CRC32C', false)
+    cdata.set('USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK', 1)

USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK (pg_crc32c_armv8_choose.c) won't
actually work on Windows, but I don't think we should waste time
implementing it: vendor-supported versions of Windows 11 require
ARMv8.1A to boot[1][2], and that has it, so I think we should probably
just define USE_ARMV8_CRC32C.

+static __forceinline void
+spin_delay(void)
+{
+     /* Reference:
https://learn.microsoft.com/en-us/cpp/intrinsics/arm64-intrinsics#BarrierRestrictions
*/
+    __isb(_ARM64_BARRIER_SY);
+}

I don't doubt that barriers are missing in a few places, but how can
this be the right place?

If you have an environment set up so it's easy to test, I would also
be very interested to know if my patch set[3] that nukes all this
stuff and includes <stdatomic.h> instead, which is green on
Windows/x86 CI, will just work™ there too.

[1] https://en.wikipedia.org/wiki/Windows_11_version_history
[2] https://learn.microsoft.com/en-us/lifecycle/products/windows-11-home-and-pro
[3] https://www.postgresql.org/message-id/flat/CA%2BhUKGKFvu3zyvv3aaj5hHs9VtWcjFAmisOwOc7aOZNc5AF3NA%40mail.gmail.com




On Thu, 20 Nov 2025 at 17:36, Nathan Bossart <nathandbossart@gmail.com> wrote:
I took a quick look at 0001.

+#ifdef _MSC_VER
+#include <intrin.h>
+#else
 #include <arm_acle.h>
 unsigned int crc;

I think you can remove this since we unconditionally do the runtime check
for MSVC.  In any case, the missing #endif seems likely to cause
problems.

--- a/src/port/pg_crc32c_armv8.c
+++ b/src/port/pg_crc32c_armv8.c
@@ -14,7 +14,9 @@
  */
 #include "c.h"

+#ifndef _MSC_VER
 #include <arm_acle.h>
+#endif

Hm.  Doesn't MSVC require intrin.h?

--
nathan


Dave 
On Nov 20 2025, at 7:03 pm, Andres Freund <andres@anarazel.de> wrote:

> Hi,
> 
> On 2025-11-20 15:45:22 -0500, Greg Burd wrote:
>> Dave and I have been working together to get ARM64 with MSVC functional.
>>  The attached patches accomplish that. Dave is the author of the first
>> which addresses some build issues and fixes the spin_delay() semantics,
>> I did the second which fixes some atomics in this combination.
> 
> Thanks for working on this!

You're welcome, thanks for reviewing it. :)

>> 
>> MSVC's _InterlockedCompareExchange() intrinsic on ARM64 performs the
>> atomic operation but does NOT emit the necessary Data Memory Barrier
>> (DMB) instructions [4][5].
> 
> I couldn't reproduce this result when playing around on godbolt. By specifying
> /arch:armv9.4 msvc can be convinced to emit the code for the
> intrinsics inline
> (at least for most of them).  And that makes it visible that
> _InterlockedCompareExchange() results in a "casal" instruction.
> Looking that
> up shows:
>
https://developer.arm.com/documentation/dui0801/l/A64-Data-Transfer-Instructions/CASA--CASAL--CAS--CASL--CASAL--CAS--CASL--A64-
> which includes these two statements:
> "CASA and CASAL load from memory with acquire semantics."
> "CASL and CASAL store to memory with release semantics."

I didn't even think to check for a compiler flag for the architecture,
nice call!  If this emits the correct instructions it is a much better
approach.  I'll give it a try, thanks for the nudge.

>> Issue 2: S_UNLOCK() uses only a compiler barrier
>> 
>> _ReadWriteBarrier() is a compiler barrier, NOT a hardware memory
>> barrier [6].  It prevents the compiler from reordering operations, but
>> the CPU can still reorder memory operations. This is fundamentally
>> insufficient for ARM64's weaker memory model.
> 
> Yea, that seems broken on a non-TSO architecture.  Is the problem
> fixed if you change just this to include a proper barrier?

Using the flag from above the _ReadWriteBarrier() does (in godbolt) turn
into a casal which (AFAIK) is going to do the trick.  I'll see if I can
update meson.build and get this work as intended.

> Greetings,
> 
> Andres Freund

best.

-greg



On Nov 20 2025, at 7:07 pm, Andres Freund <andres@anarazel.de> wrote:

> Hi,
> 
> On 2025-11-20 19:03:47 -0500, Andres Freund wrote:
>> > MSVC's _InterlockedCompareExchange() intrinsic on ARM64 performs the
>> > atomic operation but does NOT emit the necessary Data Memory Barrier
>> > (DMB) instructions [4][5].
>> 
>> I couldn't reproduce this result when playing around on godbolt. By specifying
>> /arch:armv9.4 msvc can be convinced to emit the code for the
>> intrinsics inline
>> (at least for most of them).  And that makes it visible that
>> _InterlockedCompareExchange() results in a "casal" instruction.
>> Looking that
>> up shows:
>>
https://developer.arm.com/documentation/dui0801/l/A64-Data-Transfer-Instructions/CASA--CASAL--CAS--CASL--CASAL--CAS--CASL--A64-
>> which includes these two statements:
>> "CASA and CASAL load from memory with acquire semantics."
>> "CASL and CASAL store to memory with release semantics."
> 
> Further evidence for that is that
> https://learn.microsoft.com/en-us/windows/win32/api/winnt/nf-winnt-interlockedcompareexchange
> states:
> "This function generates a full memory barrier (or fence) to ensure
> that memory operations are completed in order."
> 
> (note that we are using the function, not the intrinsic for TAS())

Got it, thanks.

> Greetings,
> 
> Andres

best.

-greg



On Nov 20 2025, at 5:36 pm, Nathan Bossart <nathandbossart@gmail.com> wrote:

> I took a quick look at 0001.

Thanks for taking a second to review!

> +#ifdef _MSC_VER
> +#include <intrin.h>
> +#else
> #include <arm_acle.h>
> unsigned int crc;
> 
> I think you can remove this since we unconditionally do the runtime check
> for MSVC.  In any case, the missing #endif seems likely to cause
> problems.
> 
> --- a/src/port/pg_crc32c_armv8.c
> +++ b/src/port/pg_crc32c_armv8.c
> @@ -14,7 +14,9 @@
>  */
> #include "c.h"
> 
> +#ifndef _MSC_VER
> #include <arm_acle.h>
> +#endif
> 
> Hm.  Doesn't MSVC require intrin.h?

It does in fact fail to compile without this part of the patch, I think
Dave posted a bug about this.  I added the missing endif, thanks!

> -- 
> nathan

best.

-greg



On Nov 20 2025, at 7:08 pm, Thomas Munro <thomas.munro@gmail.com> wrote:

> On Fri, Nov 21, 2025 at 9:45 AM Greg Burd <greg@burd.me> wrote:
>> Dave and I have been working together to get ARM64 with MSVC functional.
>>  The attached patches accomplish that. Dave is the author of the first
>> which addresses some build issues and fixes the spin_delay() semantics,
>> I did the second which fixes some atomics in this combination.
>
> A couple of immediate thoughts:

Hey Thomas, I'm always interested in your thoughts.  Thanks for taking a
look at this.

> https://learn.microsoft.com/en-us/cpp/intrinsics/interlockedexchangeadd-intrinsic-functions?view=msvc-170
>
> Doesn't seem to match your conclusion.
>
> +  if cc.get_id() == 'msvc'
> +    cdata.set('USE_ARMV8_CRC32C', false)
> +    cdata.set('USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK', 1)
>
> USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK (pg_crc32c_armv8_choose.c) won't
> actually work on Windows, but I don't think we should waste time
> implementing it: vendor-supported versions of Windows 11 require
> ARMv8.1A to boot[1][2], and that has it, so I think we should probably
> just define USE_ARMV8_CRC32C.

I'll give that a go, I think your logic is sound.

> +static __forceinline void
> +spin_delay(void)
> +{
> +     /* Reference:
> https://learn.microsoft.com/en-us/cpp/intrinsics/arm64-intrinsics#BarrierRestrictions
> */
> +    __isb(_ARM64_BARRIER_SY);
> +}
>
> I don't doubt that barriers are missing in a few places, but how can
> this be the right place?

Well, with the compiler flag that Andres mentioned I need to reconsider
this change.

> If you have an environment set up so it's easy to test, I would also
> be very interested to know if my patch set[3] that nukes all this
> stuff and includes <stdatomic.h> instead, which is green on
> Windows/x86 CI, will just work™ there too.

I do, and I will as soon as I get this sorted.

> [1] https://en.wikipedia.org/wiki/Windows_11_version_history
> [2] https://learn.microsoft.com/en-us/lifecycle/products/windows-11-home-and-pro
> [3] https://www.postgresql.org/message-id/flat/CA%2BhUKGKFvu3zyvv3aaj5hHs9VtWcjFAmisOwOc7aOZNc5AF3NA%40mail.gmail.com

best.

-greg



Okay,

With the new MSVC compiler flag Andres mentioned (/arch:armv9.4) I only
had to update the S_UNLOCK() macro, the compiler did the rest correctly
AFAICT.  So, a much smaller patch (v2) attached. FWIW I'm using Visual
Studio 2026 (18) to build, other platform information below [1].

The white space/formatting issues seem to have been due to my ineptitude
on Windows running pgindent, or maybe I can blame Perl.  I'll try to dig
if I get a minute to figure out what was the cause and if I need to
patch something.

Also attached is a file with some notes on how I build. The build farm
section isn't finished yet, but there is value in the rest for anyone
doing similar work.  If you're wondering how I learned PowerShell, I
didn't.  I used AI, forgive me.  Eventually I'll have a build animal to
add to the "farm" and update a wiki page somewhere when this solidifies
a bit more. :)

best.

-greg

[1] OS Name    Microsoft Windows 11 Pro
Version    10.0.26100 Build 26100
Other OS Description     Not Available
OS Manufacturer    Microsoft Corporation
System Name    SANTORINI
System Manufacturer    Microsoft Corporation
System Model    Windows Dev Kit 2023
System Type    ARM64-based PC
System SKU    2043
Processor    Snapdragon Compute Platform, 2995 Mhz, 8 Core(s), 8 Logical Processor(s)
BIOS Version/Date    Microsoft Corporation 13.42.235, 11/13/2024
Вложения
On Thu, Nov 20, 2025, at 7:08 PM, Thomas Munro wrote:
> If you have an environment set up so it's easy to test, I would also
> be very interested to know if my patch set[3] that nukes all this
> stuff and includes <stdatomic.h> instead, which is green on
> Windows/x86 CI, will just work™ there too.
>
> [3]
> https://www.postgresql.org/message-id/flat/CA%2BhUKGKFvu3zyvv3aaj5hHs9VtWcjFAmisOwOc7aOZNc5AF3NA%40mail.gmail.com

Hello Thomas,

I gave your stdatomic patchs a try, here's where it fell apart linking.

[174/1996] Linking target src/interfaces/libpq/libpq.dll
   Creating library src/interfaces/libpq\libpq.lib
[981/1996] Linking target src/interfaces/ecpg/pgtypeslib/libpgtypes.dll
   Creating library src\interfaces\ecpg\pgtypeslib\libpgtypes.lib
[1235/1996] Linking target src/interfaces/ecpg/ecpglib/libecpg.dll
   Creating library src\interfaces\ecpg\ecpglib\libecpg.lib
[1401/1996] Linking target src/backend/postgres.exe
FAILED: [code=1120] src/backend/postgres.exe src/backend/postgres.pdb
"link" @src/backend/postgres.exe.rsp
   Creating library src\backend\postgres.lib
storage_lmgr_s_lock.c.obj : error LNK2019: unresolved external symbol _mm_pause referenced in function
perform_spin_delay
src\backend\postgres.exe : fatal error LNK1120: 1 unresolved externals
[1410/1996] Compiling C object src/backend/snowball/dict_snowball.dll.p/libstemmer_stem_UTF_8_german.c.obj
ninja: build stopped: subcommand failed.

I used your v2 and the attached reduced set of Dave and my changes.  I'll work on this next week if I find a bit of
time,shouldn't be all that hard.  I thought I'd update you before then.  Also attached is my config script, I sent out
mynotes on a separate email. 

I forgot to provide the MSVC compiler/linker versions for posterity, they are:
  cl (msvc 19.50.35718 "Microsoft (R) C/C++ Optimizing Compiler Version 19.50.35718 for ARM64")
  link 14.50.35718.0

best.

-greg

Вложения
Hi,

On 2025-11-22 16:43:30 -0500, Greg Burd wrote:
> With the new MSVC compiler flag Andres mentioned (/arch:armv9.4) I only
> had to update the S_UNLOCK() macro, the compiler did the rest correctly
> AFAICT.

Just to be clear - the flag shouldn't be necessary for things to work
correctly. I was only using it to have godbolt inline the intrinsics, rather
than have them generate an out-of-line function call that I couldn't easily
inspect. I'm fairly sure that the out-of-line functions also have the relevant
barriers.


> @@ -2509,7 +2513,10 @@ int main(void)
>  }
>  '''
>  
> -  if cc.links(prog, name: '__crc32cb, __crc32ch, __crc32cw, and __crc32cd without -march=armv8-a+crc',
> +  if cc.get_id() == 'msvc'
> +    cdata.set('USE_ARMV8_CRC32C', 1)
> +    have_optimized_crc = true

Should have a comment explaining why we can do this unconditionally...


Greetings,

Andres Freund



On Mon, Nov 24, 2025 at 4:55 AM Andres Freund <andres@anarazel.de> wrote:
> > -  if cc.links(prog, name: '__crc32cb, __crc32ch, __crc32cw, and __crc32cd without -march=armv8-a+crc',
> > +  if cc.get_id() == 'msvc'
> > +    cdata.set('USE_ARMV8_CRC32C', 1)
> > +    have_optimized_crc = true
>
> Should have a comment explaining why we can do this unconditionally...

I was wondering about that in light of your revelation that it must be
using /arch:armv8.0 by default.  If we only support Windows/ARM on
armv8.1 hardware (like Windows 11 itself), then I think that means
that we'd want /arch:armv8.1 here:

+  # Add ARM64 architecture flag for Windows 11 ARM64 for correct intrensics
+  if host_machine.system() == 'windows' and host_machine.cpu_family()
== 'aarch64'
+    add_project_arguments('/arch:armv9.4', language: ['c', 'cpp'])
+  endif

We can't impose our own random high ISA requirement like that, or some
machines will choke on illegal instructions.

I wonder if the same applies to Visual Studio on x86.  The OS now
requires x86-64-v2 (like RHEL9, and many other Linux distros except
Debian?), but Visual Studio might not know that... but then we might
say that's an issue for the EDB packaging crew to think about, not us.

In that case we might want to say "OK you choose the ARM version, but
we're not going to write the runtime test for CRC on armv8, we'll do a
compile-time test only, because it would be stupid to waste time
writing code for armv8.0".



On Nov 23 2025, at 10:55 am, Andres Freund <andres@anarazel.de> wrote:

> Hi,
> 
> On 2025-11-22 16:43:30 -0500, Greg Burd wrote:
>> With the new MSVC compiler flag Andres mentioned (/arch:armv9.4) I only
>> had to update the S_UNLOCK() macro, the compiler did the rest correctly
>> AFAICT.
> 
> Just to be clear - the flag shouldn't be necessary for things to work
> correctly. I was only using it to have godbolt inline the intrinsics, rather
> than have them generate an out-of-line function call that I couldn't easily
> inspect. I'm fairly sure that the out-of-line functions also have the relevant
> barriers.

Well, I learned something new about MSVC (again) today.  Thanks Andres
for pointing that out.  I fiddled with godbolt a bit and indeed on ARM64
systems the only thing that flag (/arch:armv9.4) does is to replace the
out-of-line function call for the intrinsic with the inlined version of it.

Without:
        bl          _InterlockedCompareExchange

With:
        mov         w10,#2
        str         w8,[sp]
        mov         x9,sp
        casal       w8,w10,[x9]

Good to know, and yes it does seem that the ASM includes the correct
instruction sequence.

I think, except for the build issues, the one thing I can put my finger
on as a "fix" is the change from _ReadWriteBarrier() to _dmb(_ARM64_BARRIER_SY).

The docs say:
The _ReadWriteBarrier intrinsic limits the compiler optimizations that
can remove or reorder memory accesses across the point of the call.

Note that it says "compiler optimization" not "system memory barrier". 
So, this macro change:

 #define S_UNLOCK(lock)    \
-    do { _ReadWriteBarrier(); (*(lock)) = 0; } while (0)
+    do { __dmb(_ARM64_BARRIER_SY); (*(lock)) = 0; } while (0)
 
Makes more sense.  This change replaces a compiler-only barrier with a
full system memory barrier, fundamentally strengthening memory ordering
guarantees for spinlock release on ARM64.

When I remove the compiler flag from the patch and keep the S_UNLOCK()
change the tests pass.  I think we've found the critical fix.  I'll
update the patch set.

>> @@ -2509,7 +2513,10 @@ int main(void)
>>  }
>>  '''
>>  
>> -  if cc.links(prog, name: '__crc32cb, __crc32ch, __crc32cw, and
>> __crc32cd without -march=armv8-a+crc',
>> +  if cc.get_id() == 'msvc'
>> +    cdata.set('USE_ARMV8_CRC32C', 1)
>> +    have_optimized_crc = true
> 
> Should have a comment explaining why we can do this unconditionally...

Sure, the more I look at this the more I want to clean it up a bit more.

> Greetings,
> 
> Andres Freund

Thanks again for the helpful nudge Andres, best.

-greg



On Nov 23 2025, at 3:32 pm, Thomas Munro <thomas.munro@gmail.com> wrote:

> On Mon, Nov 24, 2025 at 4:55 AM Andres Freund <andres@anarazel.de> wrote:
>> > -  if cc.links(prog, name: '__crc32cb, __crc32ch, __crc32cw, and
>> __crc32cd without -march=armv8-a+crc',
>> > +  if cc.get_id() == 'msvc'
>> > +    cdata.set('USE_ARMV8_CRC32C', 1)
>> > +    have_optimized_crc = true
>>
>> Should have a comment explaining why we can do this unconditionally...
>
> I was wondering about that in light of your revelation that it must be
> using /arch:armv8.0 by default.  If we only support Windows/ARM on
> armv8.1 hardware (like Windows 11 itself), then I think that means
> that we'd want /arch:armv8.1 here:
>
> +  # Add ARM64 architecture flag for Windows 11 ARM64 for correct intrensics
> +  if host_machine.system() == 'windows' and host_machine.cpu_family()
> == 'aarch64'
> +    add_project_arguments('/arch:armv9.4', language: ['c', 'cpp'])
> +  endif

Hey Thomas,

Turns out that flag isn't required at all, the issue was the S_UNLOCK()
implementation's use of a compiler barrier not a CPU memory barrier.

> We can't impose our own random high ISA requirement like that, or some
> machines will choke on illegal instructions.

New incoming patch drops this entirely, so no need to worry about it.

> I wonder if the same applies to Visual Studio on x86.  The OS now
> requires x86-64-v2 (like RHEL9, and many other Linux distros except
> Debian?), but Visual Studio might not know that... but then we might
> say that's an issue for the EDB packaging crew to think about, not us.
>
> In that case we might want to say "OK you choose the ARM version, but
> we're not going to write the runtime test for CRC on armv8, we'll do a
> compile-time test only, because it would be stupid to waste time
> writing code for armv8.0".

best.

-greg



Hello,

v3 attached is a more concise single patch that adds support for MSVC on Win11 ARM64.  The issues fixed are mostly
relatedto config, build, docs, an implementation of spin_delay() and a change to S_UNLOCK() macro.  I've squished the
workwe did into a single commit and improved the commit message.  Thomas, I think in another thread you mentioned that
someof what Dave added came from you, correct?  We should add you as another author if so, right?
 

best.

-greg

Вложения
On Mon, Nov 24, 2025, at 10:04 AM, Greg Burd wrote:
> Hello,
>
> v3 attached is a more concise single patch that adds support for MSVC 
> on Win11 ARM64.  The issues fixed are mostly related to config, build, 
> docs, an implementation of spin_delay() and a change to S_UNLOCK() 
> macro.  I've squished the work we did into a single commit and improved 
> the commit message.  Thomas, I think in another thread you mentioned 
> that some of what Dave added came from you, correct?  We should add you 
> as another author if so, right?
>
> best.
>
> -greg
>
> Attachments:
> * v3-0001-Enable-PostgreSQL-build-on-Windows-11-ARM64-with-.patch

I forgot to account for the non-ARM64 implementation of S_UNLOCK(), thanks CI for pointing that out.

-greg
Вложения
Hi,

On 2025-11-24 11:28:28 -0500, Greg Burd wrote:
> @@ -2509,25 +2513,64 @@ int main(void)
>  }
>  '''
>  
> -  if cc.links(prog, name: '__crc32cb, __crc32ch, __crc32cw, and __crc32cd without -march=armv8-a+crc',
> -      args: test_c_args)
> -    # Use ARM CRC Extension unconditionally
> -    cdata.set('USE_ARMV8_CRC32C', 1)
> -    have_optimized_crc = true
> -  elif cc.links(prog, name: '__crc32cb, __crc32ch, __crc32cw, and __crc32cd with -march=armv8-a+crc+simd',
> -      args: test_c_args + ['-march=armv8-a+crc+simd'])
> -    # Use ARM CRC Extension, with runtime check
> -    cflags_crc += '-march=armv8-a+crc+simd'
> -    cdata.set('USE_ARMV8_CRC32C', false)
> -    cdata.set('USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK', 1)
> -    have_optimized_crc = true
> -  elif cc.links(prog, name: '__crc32cb, __crc32ch, __crc32cw, and __crc32cd with -march=armv8-a+crc',
> -      args: test_c_args + ['-march=armv8-a+crc'])
> -    # Use ARM CRC Extension, with runtime check
> -    cflags_crc += '-march=armv8-a+crc'
> -    cdata.set('USE_ARMV8_CRC32C', false)
> -    cdata.set('USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK', 1)
> -    have_optimized_crc = true
> +  if cc.get_id() == 'msvc'
> +    # MSVC: Intrinsic availability check for ARM64
> +    if host_machine.cpu_family() == 'aarch64'
> +      # Test if CRC32C intrinsics are available in intrin.h
> +      crc32c_test_msvc = '''
> +        #include <intrin.h>
> +        int main(void) {
> +          uint32_t crc = 0;
> +          uint8_t data = 0;
> +          crc = __crc32cb(crc, data);
> +          return 0;
> +        }
> +      '''
> +      if cc.links(crc32c_test_msvc, name: '__crc32cb intrinsic available')
> +        cdata.set('USE_ARMV8_CRC32C', 1)
> +        have_optimized_crc = true
> +        message('Using ARM64 CRC32C hardware acceleration (MSVC)')
> +      else
> +        message('CRC32C intrinsics not available on this MSVC ARM64 build')
> +      endif

Does this:
a) need to be conditional at all, given that it's msvc specific, it seems we
   don't need to run a test?
b) why is the msvc block outside of the general aarch64 block but then has
another nested aarch64 test inside? That seems unnecessarily complicated and
requires reindenting unnecessarily much code?


> +/*
> + * For Arm64, use __isb intrinsic. See aarch64 inline assembly definition for details.
> + */
> +#ifdef _M_ARM64
> +
> +static __forceinline void
> +spin_delay(void)
> +{
> +     /* Reference: https://learn.microsoft.com/en-us/cpp/intrinsics/arm64-intrinsics#BarrierRestrictions */
> +    __isb(_ARM64_BARRIER_SY);
> +}
> +#else
> +/*
> + * For x64, use _mm_pause intrinsic instead of rep nop.
> + */
>  static __forceinline void
>  spin_delay(void)
>  {
>      _mm_pause();
>  }

This continues to use a barrier, with a reference to a list of barrier
semantics that really don't seem to make a whole lot of sense in the context
of spin_delay(). If we want to emit this kind of barrier for now it's ok with
me, but it should be documented as just being a fairly random choice, rather
than a link that doesn't explain anything.


> +#endif
>  #else
>  static __forceinline void
>  spin_delay(void)
> @@ -623,9 +640,13 @@ spin_delay(void)
>  #include <intrin.h>
>  #pragma intrinsic(_ReadWriteBarrier)
>  
> -#define S_UNLOCK(lock)    \
> +#ifdef _M_ARM64
> +#define S_UNLOCK(lock) \
> +    do { __dmb(_ARM64_BARRIER_SY); (*(lock)) = 0; } while (0)
> +#else

This doesn't seem like the right way to implement this - why not use
InterlockedExchange(lock, 0)? That will do the write with barrier semantics.

Greetings,

Andres Freund



On Mon, Nov 24, 2025, at 6:20 PM, Andres Freund wrote:
> Hi,

Thanks again for taking a look at the patch, hopefully I got it right this time. :)

> On 2025-11-24 11:28:28 -0500, Greg Burd wrote:
>> @@ -2509,25 +2513,64 @@ int main(void)
>>  }
>>  '''
>>  
>> -  if cc.links(prog, name: '__crc32cb, __crc32ch, __crc32cw, and __crc32cd without -march=armv8-a+crc',
>> -      args: test_c_args)
>> -    # Use ARM CRC Extension unconditionally
>> -    cdata.set('USE_ARMV8_CRC32C', 1)
>> -    have_optimized_crc = true
>> -  elif cc.links(prog, name: '__crc32cb, __crc32ch, __crc32cw, and __crc32cd with -march=armv8-a+crc+simd',
>> -      args: test_c_args + ['-march=armv8-a+crc+simd'])
>> -    # Use ARM CRC Extension, with runtime check
>> -    cflags_crc += '-march=armv8-a+crc+simd'
>> -    cdata.set('USE_ARMV8_CRC32C', false)
>> -    cdata.set('USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK', 1)
>> -    have_optimized_crc = true
>> -  elif cc.links(prog, name: '__crc32cb, __crc32ch, __crc32cw, and __crc32cd with -march=armv8-a+crc',
>> -      args: test_c_args + ['-march=armv8-a+crc'])
>> -    # Use ARM CRC Extension, with runtime check
>> -    cflags_crc += '-march=armv8-a+crc'
>> -    cdata.set('USE_ARMV8_CRC32C', false)
>> -    cdata.set('USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK', 1)
>> -    have_optimized_crc = true
>> +  if cc.get_id() == 'msvc'
>> +    # MSVC: Intrinsic availability check for ARM64
>> +    if host_machine.cpu_family() == 'aarch64'
>> +      # Test if CRC32C intrinsics are available in intrin.h
>> +      crc32c_test_msvc = '''
>> +        #include <intrin.h>
>> +        int main(void) {
>> +          uint32_t crc = 0;
>> +          uint8_t data = 0;
>> +          crc = __crc32cb(crc, data);
>> +          return 0;
>> +        }
>> +      '''
>> +      if cc.links(crc32c_test_msvc, name: '__crc32cb intrinsic available')
>> +        cdata.set('USE_ARMV8_CRC32C', 1)
>> +        have_optimized_crc = true
>> +        message('Using ARM64 CRC32C hardware acceleration (MSVC)')
>> +      else
>> +        message('CRC32C intrinsics not available on this MSVC ARM64 build')
>> +      endif
>
> Does this:
> a) need to be conditional at all, given that it's msvc specific, it seems we
>    don't need to run a test?
> b) why is the msvc block outside of the general aarch64 block but then has
> another nested aarch64 test inside? That seems unnecessarily complicated and
> requires reindenting unnecessarily much code?

Yep, I rushed this.  Apologies.  I've re-worked it with your suggestions.

>> +/*
>> + * For Arm64, use __isb intrinsic. See aarch64 inline assembly definition for details.
>> + */
>> +#ifdef _M_ARM64
>> +
>> +static __forceinline void
>> +spin_delay(void)
>> +{
>> +     /* Reference: https://learn.microsoft.com/en-us/cpp/intrinsics/arm64-intrinsics#BarrierRestrictions */
>> +    __isb(_ARM64_BARRIER_SY);
>> +}
>> +#else
>> +/*
>> + * For x64, use _mm_pause intrinsic instead of rep nop.
>> + */
>>  static __forceinline void
>>  spin_delay(void)
>>  {
>>      _mm_pause();
>>  }
>
> This continues to use a barrier, with a reference to a list of barrier
> semantics that really don't seem to make a whole lot of sense in the context
> of spin_delay(). If we want to emit this kind of barrier for now it's ok with
> me, but it should be documented as just being a fairly random choice, rather
> than a link that doesn't explain anything.

I did more digging and found that you were right about the use of ISB for spin_delay().  I think I was misled by
earliercode in that file (lines 277-286) where there is an implementation of spin_delay() that uses ISB, I ran with
thatnot doing enough research myself.  So I did more digging and found an article on this [1] and it seems that YIELD
shouldbe used, not ISB.  I checked into how others implement this feature, Java [2][3] uses YIELD, Linux [4][5] uses
YIELDin cpu_relax() called by __delay().
 

>> +#endif
>>  #else
>>  static __forceinline void
>>  spin_delay(void)
>> @@ -623,9 +640,13 @@ spin_delay(void)
>>  #include <intrin.h>
>>  #pragma intrinsic(_ReadWriteBarrier)
>>  
>> -#define S_UNLOCK(lock)    \
>> +#ifdef _M_ARM64
>> +#define S_UNLOCK(lock) \
>> +    do { __dmb(_ARM64_BARRIER_SY); (*(lock)) = 0; } while (0)
>> +#else
>
> This doesn't seem like the right way to implement this - why not use
> InterlockedExchange(lock, 0)? That will do the write with barrier semantics.

Great idea, done.  Seems to work too.

> Greetings,
>
> Andres Freund

Given what I learned about YIELD vs ISB for spin delay it seems like a reasonable idea to submit a new patch for the
non-MSVCpath and switch it to YIELD, what do you think?
 

v5 attached, best.

-greg

[1]
https://developer.arm.com/community/arm-community-blogs/b/architectures-and-processors-blog/posts/multi-threaded-applications-arm

[2] https://cr.openjdk.org/~dchuyko/8186670/yield/spinwait.html
[3] https://mail.openjdk.org/pipermail/aarch64-port-dev/2017-August/004880.html
[4]
https://github.com/torvalds/linux/blob/ac3fd01e4c1efce8f2c054cdeb2ddd2fc0fb150d/arch/arm64/include/asm/vdso/processor.h
[5] https://github.com/torvalds/linux/commit/f511e079177a9b97175a9a3b0ee2374d55682403
Вложения