Обсуждение: [PATCH] Fix ARM64/MSVC atomic memory ordering issues on Win11 by adding explicit DMB barriers
Hi all,
Dave and I have been working together to get ARM64 with MSVC functional.
The attached patches accomplish that. Dave is the author of the first
which addresses some build issues and fixes the spin_delay() semantics,
I did the second which fixes some atomics in this combination.
PostgreSQL when compiled with MSVC on ARM64 architecture in particular
when optimizations are enabled (e.g., /O2), fails 027_stream_regress.
After some investigation and analysis of generated assembly code, Dave
Cramer and I have identified that the root cause is insufficient memory
barrier semantics in both atomic operations and spinlocks on ARM64 when
compiled with MSVC with /O2.
Dave knew I was in the process of setting up a Win11/ARM64/MSVC build
animal and pinged me with this issue. Dave got me started on the path
to finding the issue by sending me his work around:
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -744,6 +744,7 @@ static void
WALInsertLockUpdateInsertingAt(XLogRecPtr insertingAt);
* before the data page can be written out. This implements the basic
* WAL rule "write the log before the data".)
*/
+#pragma optimize("",off)
XLogRecPtr
XLogInsertRecord(XLogRecData *rdata,
XLogRecPtr fpw_lsn,
@@ -1088,7 +1089,7 @@ XLogInsertRecord(XLogRecData *rdata,
return EndPos;
}
-
+#pragma optimize("",on)
/*
This pointed a finger at the atomics, so I started there. We used a few
tools, but worth noting is https://godbolt.org/ where we were able to
quickly see that the MSVC assembly was missing the "dmb" barriers on
this platform. I'm not sure how long this link will be valid, but in
the short term here's our investigation: https://godbolt.org/z/PPqfxe1bn
PROBLEM DESCRIPTION
PostgreSQL test failures occur intermittently on MSVC ARM64 builds,
manifesting as timing-dependent failures in critical sections
protected by spinlocks and atomic variables. The failures are
reproducible when the test suite is compiled with optimization flags
(/O2), particularly in the recovery/027_stream_regress test which
involves WAL replication and standby recovery.
The root cause has two components:
1. Atomic operations lack memory barriers on ARM64
2. MSVC spinlock implementation lacks memory barriers on ARM64
TECHNICAL ANALYSIS
PART 1: ATOMIC OPERATIONS MEMORY BARRIERS
GCC's __atomic_compare_exchange_n() with __ATOMIC_SEQ_CST semantics
generates a call to __aarch64_cas4_acq_rel(), which is a library
function that provides explicit acquire-release memory ordering
semantics through either:
* LSE path (modern ARM64): Using CASAL instruction with built-in
memory ordering [1][2]
* Legacy path (older ARM64): Using LDAXR/STLXR instructions with
explicit dmb sy instruction [3]
MSVC's _InterlockedCompareExchange() intrinsic on ARM64 performs the
atomic operation but does NOT emit the necessary Data Memory Barrier
(DMB) instructions [4][5].
PART 2: SPINLOCK IMPLEMENTATION LACKS BARRIERS
The MSVC spinlock implementation in src/include/storage/s_lock.h had
two issues on ARM64/MSVC:
#define TAS(lock) (InterlockedCompareExchange(lock, 1, 0))
#define S_UNLOCK(lock) do { _ReadWriteBarrier(); (*(lock)) = 0; } while (0)
Issue 1: TAS() uses InterlockedCompareExchange without hardware barriers
The InterlockedCompareExchange intrinsic lacks full memory barrier
semantics on ARM64, identical to the atomic operations issue.
Issue 2: S_UNLOCK() uses only a compiler barrier
_ReadWriteBarrier() is a compiler barrier, NOT a hardware memory
barrier [6]. It prevents the compiler from reordering operations, but
the CPU can still reorder memory operations. This is fundamentally
insufficient for ARM64's weaker memory model.
For comparison, GCC's __sync_lock_release() emits actual hardware
barriers.
IMPACT ON 027_STREAM_REGRESS
The 027_stream_regress test involves WAL replication and standby
recovery — heavily dependent on synchronized access to shared memory
protected by spinlocks [7]. Without proper barriers on ARM64:
1. Thread A acquires spinlock (no full barrier emitted)
2. Thread A modifies shared WAL buffer
3. Thread B acquires spinlock before Thread A's writes become visible
4. Thread B reads stale WAL data
5. WAL replication gets corrupted or hangs indefinitely
6. Test times out waiting for standby to catch up
WHY ARM32 AND X86/X64 ARE UNAFFECTED
MSVC's _InterlockedCompareExchange does provide full memory barriers on:
* x86/x64: Memory barriers are implicit in the x86 memory model [8]
* ARM32: MSVC explicitly generates full barriers for ARM32 [5]
Only ARM64 lacks the necessary barriers, making this a platform-specific
issue.
ATTACHED SOLUTION
Add explicit DMB (Data Memory Barrier) instructions before and after
atomic operations and spinlock operations on ARM64 to provide sequential
consistency semantics.
0002: src/inclue/port/atomic/generic-msvc.h
Added platform-specific DMB macros that expand to
__dmb(_ARM64_BARRIER_SY) on ARM64.
Applied to all six atomic operations:
* pg_atomic_compare_exchange_u32_impl()
* pg_atomic_exchange_u32_impl()
* pg_atomic_fetch_add_u32_impl()
* pg_atomic_compare_exchange_u64_impl()
* pg_atomic_exchange_u64_impl()
* pg_atomic_fetch_add_u64_impl()
0001: src/include/storage/s_lock.h
Added ARM64-specific spinlock implementation with explicit DMB barriers [9]:
#if defined(_M_ARM64)
#define TAS(lock) tas_msvc_arm64(lock)
static __forceinline int
tas_msvc_arm64(volatile slock_t *lock)
{
int result;
/* Full barrier before atomic operation */
__dmb(_ARM64_BARRIER_SY);
/* Atomic compare-and-swap */
result = InterlockedCompareExchange(lock, 1, 0);
/* Full barrier after atomic operation */
__dmb(_ARM64_BARRIER_SY);
return result;
}
#define S_UNLOCK(lock)
do {
__dmb(_ARM64_BARRIER_SY); /* Full barrier before release /
((lock)) = 0;
} while (0)
#else
/* Non-ARM64 MSVC: existing implementation unchanged */
#endif
The spinlock acquire now ensures:
* Before CAS: All prior memory operations complete before
acquiring the lock.
* After CAS: The CAS completes before subsequent operations
access protected data
The spinlock release now ensures:
* Before writing 0: All critical section operations are visible
to other threads
You may ask: why two DMBs in the atomic operations instead of one?
GCC's non-LSE path (LDAXR/STLXR) uses only one DMB because:
* LDAXR (Load-Acquire Exclusive) provides half-barrier acquire
semantics [3]
* STLXR (Store-Release Exclusive) provides half-barrier release
semantics [3]
* One final dmb sy upgrades to full sequential consistency
Since _InterlockedCompareExchange provides NO barrier semantics on
ARM64, we must provide both halves:
* First DMB acts as a release barrier (ensures prior memory ops
complete before CAS)
* Second DMB acts as an acquire barrier (ensures subsequent memory
ops wait for CAS)
* Together they provide sequential consistency matching GCC's
semantics [3]
VERIFICATION
The fix has been verified by:
1. Spinlock fix resolves 027_stream_regress timeout: Test now passes
consistently on MSVC ARM64 with /O2 optimization without hanging
2. Assembly code inspection: Confirmed that dmb sy instructions now
appear in the optimized assembly for ARM64 builds
3. Platform compatibility: No regression on x86/x64 or ARM32 (macros
expand to no-ops; original code path unchanged)
WHY CLANG/LLVM ON MACOS ARM64 DOESN'T HAVE THIS PROBLEM
PostgreSQL builds successfully on Apple Silicon Macs (ARM64) without
the memory ordering issues observed on MSVC Windows ARM64. The
difference comes down to how Clang/LLVM and MSVC handle atomic
operations.
CLANG/LLVM APPROACH (macOS, Linux, Android ARM64)
Clang/LLVM uses GCC-compatible atomic builtins
(__atomic_compare_exchange_n, etc.) even on platforms where it's not
GCC [125][134]. The LLVM backend has an AtomicExpand pass that
properly expands these operations to include appropriate memory
barriers for the target architecture [134].
On ARM64, Clang generates:
__aarch64_cas4_acq_rel library calls (or CASAL instruction with LSE)
Proper acquire-release semantics built into the instruction sequence
Automatic full dmb sy barriers where needed This means PostgreSQL's
use of __sync_lock_test_and_set and _atomic* builtins work correctly
on macOS ARM64 without additional patches.
Phew... I hope I read all those docs correctly and got that right. Feel
free to let me know if I missed something. Looking forward to your
feedback and review so I can get this new build animal up and running.
best.
-greg
[1] ARM Developer: CAS Instructions
https://developer.arm.com/documentation/dui0801/latest/A64-Data-Transfer-Instructions/CASAB--CASALB--CASB--CASLB--A64-
[2] ARM Developer: Load-Acquire and Store-Release Instructions
https://developer.arm.com/documentation/102336/0100/Load-Acquire-and-Store-Release-instructions
[3] ARM Developer: Data Memory Barrier (DMB)
https://developer.arm.com/documentation/100069/0610/A64-General-Instructions/DMB?lang=en
[4] Microsoft Learn: _InterlockedCompareExchange Intrinsic Functions
https://learn.microsoft.com/en-us/cpp/intrinsics/interlockedcompareexchange-intrinsic-functions?view=msvc-170
[5] Microsoft Learn: ARM Intrinsics - Memory Barriers
https://learn.microsoft.com/en-us/cpp/intrinsics/arm-intrinsics?view=msvc-170
[6] Microsoft Learn: _ReadWriteBarrier is a Compiler Barrier
https://learn.microsoft.com/en-us/cpp/intrinsics/compiler-intrinsics?view=msvc-170
[7] PostgreSQL: 027_stream_regress WAL replication testing
https://www.postgresql.org/message-id/193115.1763243897@sss.pgh.pa.us
[8] Intel Volume 3A: Memory Ordering
https://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-software-developer-vol-3a-part-1-manual.pdf
[9] Microsoft Developer Blog: The AArch64 processor - Barriers
https://devblogs.microsoft.com/oldnewthing/20220812-00/?p=106968
Вложения
Re: [PATCH] Fix ARM64/MSVC atomic memory ordering issues on Win11 by adding explicit DMB barriers
От
Peter Eisentraut
Дата:
On 20.11.25 21:45, Greg Burd wrote:
> Dave and I have been working together to get ARM64 with MSVC functional.
> The attached patches accomplish that. Dave is the author of the first
> which addresses some build issues and fixes the spin_delay() semantics,
> I did the second which fixes some atomics in this combination.
> - zlib_t = dependency('zlib', required: zlibopt)
> + zlib_t = dependency('zlib', method : 'pkg-config', required: zlibopt)
This appears to be a change unrelated to your patch description.
Also, the second patch contains a number of random whitespace changes.
It would be good to clean that up so the patch is easier to analyze.
On Nov 20 2025, at 5:00 pm, Peter Eisentraut <peter@eisentraut.org> wrote:
> On 20.11.25 21:45, Greg Burd wrote:
>> Dave and I have been working together to get ARM64 with MSVC functional.
>> The attached patches accomplish that. Dave is the author of the first
>> which addresses some build issues and fixes the spin_delay() semantics,
>> I did the second which fixes some atomics in this combination.
>
Hi Peter,
Thanks for taking a second to review.
> > - zlib_t = dependency('zlib', required: zlibopt)
> > + zlib_t = dependency('zlib', method : 'pkg-config', required: zlibopt)
>
> This appears to be a change unrelated to your patch description.
Hmmm, these are changes that Dave added to get things to compile. They
worked for me but I'll review them again in the morning and update the description.
> Also, the second patch contains a number of random whitespace changes.
> It would be good to clean that up so the patch is easier to analyze.
Certainly, apologies. I'll clean that up first thing in the morning.
-greg
I took a quick look at 0001. +#ifdef _MSC_VER +#include <intrin.h> +#else #include <arm_acle.h> unsigned int crc; I think you can remove this since we unconditionally do the runtime check for MSVC. In any case, the missing #endif seems likely to cause problems. --- a/src/port/pg_crc32c_armv8.c +++ b/src/port/pg_crc32c_armv8.c @@ -14,7 +14,9 @@ */ #include "c.h" +#ifndef _MSC_VER #include <arm_acle.h> +#endif Hm. Doesn't MSVC require intrin.h? -- nathan
Hi, On 2025-11-20 15:45:22 -0500, Greg Burd wrote: > Dave and I have been working together to get ARM64 with MSVC functional. > The attached patches accomplish that. Dave is the author of the first > which addresses some build issues and fixes the spin_delay() semantics, > I did the second which fixes some atomics in this combination. Thanks for working on this! > This pointed a finger at the atomics, so I started there. We used a few > tools, but worth noting is https://godbolt.org/ where we were able to > quickly see that the MSVC assembly was missing the "dmb" barriers on > this platform. I'm not sure how long this link will be valid, but in > the short term here's our investigation: https://godbolt.org/z/PPqfxe1bn > > > PROBLEM DESCRIPTION > > PostgreSQL test failures occur intermittently on MSVC ARM64 builds, > manifesting as timing-dependent failures in critical sections > protected by spinlocks and atomic variables. The failures are > reproducible when the test suite is compiled with optimization flags > (/O2), particularly in the recovery/027_stream_regress test which > involves WAL replication and standby recovery. > > The root cause has two components: > > 1. Atomic operations lack memory barriers on ARM64 > 2. MSVC spinlock implementation lacks memory barriers on ARM64 > > TECHNICAL ANALYSIS > > PART 1: ATOMIC OPERATIONS MEMORY BARRIERS > > GCC's __atomic_compare_exchange_n() with __ATOMIC_SEQ_CST semantics > generates a call to __aarch64_cas4_acq_rel(), which is a library > function that provides explicit acquire-release memory ordering > semantics through either: > > * LSE path (modern ARM64): Using CASAL instruction with built-in > memory ordering [1][2] > > * Legacy path (older ARM64): Using LDAXR/STLXR instructions with > explicit dmb sy instruction [3] > > MSVC's _InterlockedCompareExchange() intrinsic on ARM64 performs the > atomic operation but does NOT emit the necessary Data Memory Barrier > (DMB) instructions [4][5]. I couldn't reproduce this result when playing around on godbolt. By specifying /arch:armv9.4 msvc can be convinced to emit the code for the intrinsics inline (at least for most of them). And that makes it visible that _InterlockedCompareExchange() results in a "casal" instruction. Looking that up shows: https://developer.arm.com/documentation/dui0801/l/A64-Data-Transfer-Instructions/CASA--CASAL--CAS--CASL--CASAL--CAS--CASL--A64- which includes these two statements: "CASA and CASAL load from memory with acquire semantics." "CASL and CASAL store to memory with release semantics." > Issue 2: S_UNLOCK() uses only a compiler barrier > > _ReadWriteBarrier() is a compiler barrier, NOT a hardware memory > barrier [6]. It prevents the compiler from reordering operations, but > the CPU can still reorder memory operations. This is fundamentally > insufficient for ARM64's weaker memory model. Yea, that seems broken on a non-TSO architecture. Is the problem fixed if you change just this to include a proper barrier? Greetings, Andres Freund
Hi, On 2025-11-20 19:03:47 -0500, Andres Freund wrote: > > MSVC's _InterlockedCompareExchange() intrinsic on ARM64 performs the > > atomic operation but does NOT emit the necessary Data Memory Barrier > > (DMB) instructions [4][5]. > > I couldn't reproduce this result when playing around on godbolt. By specifying > /arch:armv9.4 msvc can be convinced to emit the code for the intrinsics inline > (at least for most of them). And that makes it visible that > _InterlockedCompareExchange() results in a "casal" instruction. Looking that > up shows: > https://developer.arm.com/documentation/dui0801/l/A64-Data-Transfer-Instructions/CASA--CASAL--CAS--CASL--CASAL--CAS--CASL--A64- > which includes these two statements: > "CASA and CASAL load from memory with acquire semantics." > "CASL and CASAL store to memory with release semantics." Further evidence for that is that https://learn.microsoft.com/en-us/windows/win32/api/winnt/nf-winnt-interlockedcompareexchange states: "This function generates a full memory barrier (or fence) to ensure that memory operations are completed in order." (note that we are using the function, not the intrinsic for TAS()) Greetings, Andres
On Fri, Nov 21, 2025 at 9:45 AM Greg Burd <greg@burd.me> wrote: > Dave and I have been working together to get ARM64 with MSVC functional. > The attached patches accomplish that. Dave is the author of the first > which addresses some build issues and fixes the spin_delay() semantics, > I did the second which fixes some atomics in this combination. A couple of immediate thoughts: https://learn.microsoft.com/en-us/cpp/intrinsics/interlockedexchangeadd-intrinsic-functions?view=msvc-170 Doesn't seem to match your conclusion. + if cc.get_id() == 'msvc' + cdata.set('USE_ARMV8_CRC32C', false) + cdata.set('USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK', 1) USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK (pg_crc32c_armv8_choose.c) won't actually work on Windows, but I don't think we should waste time implementing it: vendor-supported versions of Windows 11 require ARMv8.1A to boot[1][2], and that has it, so I think we should probably just define USE_ARMV8_CRC32C. +static __forceinline void +spin_delay(void) +{ + /* Reference: https://learn.microsoft.com/en-us/cpp/intrinsics/arm64-intrinsics#BarrierRestrictions */ + __isb(_ARM64_BARRIER_SY); +} I don't doubt that barriers are missing in a few places, but how can this be the right place? If you have an environment set up so it's easy to test, I would also be very interested to know if my patch set[3] that nukes all this stuff and includes <stdatomic.h> instead, which is green on Windows/x86 CI, will just work™ there too. [1] https://en.wikipedia.org/wiki/Windows_11_version_history [2] https://learn.microsoft.com/en-us/lifecycle/products/windows-11-home-and-pro [3] https://www.postgresql.org/message-id/flat/CA%2BhUKGKFvu3zyvv3aaj5hHs9VtWcjFAmisOwOc7aOZNc5AF3NA%40mail.gmail.com
On Thu, 20 Nov 2025 at 17:36, Nathan Bossart <nathandbossart@gmail.com> wrote:
I took a quick look at 0001.
+#ifdef _MSC_VER
+#include <intrin.h>
+#else
#include <arm_acle.h>
unsigned int crc;
I think you can remove this since we unconditionally do the runtime check
for MSVC. In any case, the missing #endif seems likely to cause
problems.
--- a/src/port/pg_crc32c_armv8.c
+++ b/src/port/pg_crc32c_armv8.c
@@ -14,7 +14,9 @@
*/
#include "c.h"
+#ifndef _MSC_VER
#include <arm_acle.h>
+#endif
Hm. Doesn't MSVC require intrin.h?
--
nathan
I posted this as a bug with Microsoft https://developercommunity.visualstudio.com/t/MSVCs-_InterlockedCompareExchange-doe/11004239
Dave
On Nov 20 2025, at 7:03 pm, Andres Freund <andres@anarazel.de> wrote: > Hi, > > On 2025-11-20 15:45:22 -0500, Greg Burd wrote: >> Dave and I have been working together to get ARM64 with MSVC functional. >> The attached patches accomplish that. Dave is the author of the first >> which addresses some build issues and fixes the spin_delay() semantics, >> I did the second which fixes some atomics in this combination. > > Thanks for working on this! You're welcome, thanks for reviewing it. :) >> >> MSVC's _InterlockedCompareExchange() intrinsic on ARM64 performs the >> atomic operation but does NOT emit the necessary Data Memory Barrier >> (DMB) instructions [4][5]. > > I couldn't reproduce this result when playing around on godbolt. By specifying > /arch:armv9.4 msvc can be convinced to emit the code for the > intrinsics inline > (at least for most of them). And that makes it visible that > _InterlockedCompareExchange() results in a "casal" instruction. > Looking that > up shows: > https://developer.arm.com/documentation/dui0801/l/A64-Data-Transfer-Instructions/CASA--CASAL--CAS--CASL--CASAL--CAS--CASL--A64- > which includes these two statements: > "CASA and CASAL load from memory with acquire semantics." > "CASL and CASAL store to memory with release semantics." I didn't even think to check for a compiler flag for the architecture, nice call! If this emits the correct instructions it is a much better approach. I'll give it a try, thanks for the nudge. >> Issue 2: S_UNLOCK() uses only a compiler barrier >> >> _ReadWriteBarrier() is a compiler barrier, NOT a hardware memory >> barrier [6]. It prevents the compiler from reordering operations, but >> the CPU can still reorder memory operations. This is fundamentally >> insufficient for ARM64's weaker memory model. > > Yea, that seems broken on a non-TSO architecture. Is the problem > fixed if you change just this to include a proper barrier? Using the flag from above the _ReadWriteBarrier() does (in godbolt) turn into a casal which (AFAIK) is going to do the trick. I'll see if I can update meson.build and get this work as intended. > Greetings, > > Andres Freund best. -greg
On Nov 20 2025, at 7:07 pm, Andres Freund <andres@anarazel.de> wrote: > Hi, > > On 2025-11-20 19:03:47 -0500, Andres Freund wrote: >> > MSVC's _InterlockedCompareExchange() intrinsic on ARM64 performs the >> > atomic operation but does NOT emit the necessary Data Memory Barrier >> > (DMB) instructions [4][5]. >> >> I couldn't reproduce this result when playing around on godbolt. By specifying >> /arch:armv9.4 msvc can be convinced to emit the code for the >> intrinsics inline >> (at least for most of them). And that makes it visible that >> _InterlockedCompareExchange() results in a "casal" instruction. >> Looking that >> up shows: >> https://developer.arm.com/documentation/dui0801/l/A64-Data-Transfer-Instructions/CASA--CASAL--CAS--CASL--CASAL--CAS--CASL--A64- >> which includes these two statements: >> "CASA and CASAL load from memory with acquire semantics." >> "CASL and CASAL store to memory with release semantics." > > Further evidence for that is that > https://learn.microsoft.com/en-us/windows/win32/api/winnt/nf-winnt-interlockedcompareexchange > states: > "This function generates a full memory barrier (or fence) to ensure > that memory operations are completed in order." > > (note that we are using the function, not the intrinsic for TAS()) Got it, thanks. > Greetings, > > Andres best. -greg
On Nov 20 2025, at 5:36 pm, Nathan Bossart <nathandbossart@gmail.com> wrote: > I took a quick look at 0001. Thanks for taking a second to review! > +#ifdef _MSC_VER > +#include <intrin.h> > +#else > #include <arm_acle.h> > unsigned int crc; > > I think you can remove this since we unconditionally do the runtime check > for MSVC. In any case, the missing #endif seems likely to cause > problems. > > --- a/src/port/pg_crc32c_armv8.c > +++ b/src/port/pg_crc32c_armv8.c > @@ -14,7 +14,9 @@ > */ > #include "c.h" > > +#ifndef _MSC_VER > #include <arm_acle.h> > +#endif > > Hm. Doesn't MSVC require intrin.h? It does in fact fail to compile without this part of the patch, I think Dave posted a bug about this. I added the missing endif, thanks! > -- > nathan best. -greg
On Nov 20 2025, at 7:08 pm, Thomas Munro <thomas.munro@gmail.com> wrote: > On Fri, Nov 21, 2025 at 9:45 AM Greg Burd <greg@burd.me> wrote: >> Dave and I have been working together to get ARM64 with MSVC functional. >> The attached patches accomplish that. Dave is the author of the first >> which addresses some build issues and fixes the spin_delay() semantics, >> I did the second which fixes some atomics in this combination. > > A couple of immediate thoughts: Hey Thomas, I'm always interested in your thoughts. Thanks for taking a look at this. > https://learn.microsoft.com/en-us/cpp/intrinsics/interlockedexchangeadd-intrinsic-functions?view=msvc-170 > > Doesn't seem to match your conclusion. > > + if cc.get_id() == 'msvc' > + cdata.set('USE_ARMV8_CRC32C', false) > + cdata.set('USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK', 1) > > USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK (pg_crc32c_armv8_choose.c) won't > actually work on Windows, but I don't think we should waste time > implementing it: vendor-supported versions of Windows 11 require > ARMv8.1A to boot[1][2], and that has it, so I think we should probably > just define USE_ARMV8_CRC32C. I'll give that a go, I think your logic is sound. > +static __forceinline void > +spin_delay(void) > +{ > + /* Reference: > https://learn.microsoft.com/en-us/cpp/intrinsics/arm64-intrinsics#BarrierRestrictions > */ > + __isb(_ARM64_BARRIER_SY); > +} > > I don't doubt that barriers are missing in a few places, but how can > this be the right place? Well, with the compiler flag that Andres mentioned I need to reconsider this change. > If you have an environment set up so it's easy to test, I would also > be very interested to know if my patch set[3] that nukes all this > stuff and includes <stdatomic.h> instead, which is green on > Windows/x86 CI, will just work™ there too. I do, and I will as soon as I get this sorted. > [1] https://en.wikipedia.org/wiki/Windows_11_version_history > [2] https://learn.microsoft.com/en-us/lifecycle/products/windows-11-home-and-pro > [3] https://www.postgresql.org/message-id/flat/CA%2BhUKGKFvu3zyvv3aaj5hHs9VtWcjFAmisOwOc7aOZNc5AF3NA%40mail.gmail.com best. -greg
Okay, With the new MSVC compiler flag Andres mentioned (/arch:armv9.4) I only had to update the S_UNLOCK() macro, the compiler did the rest correctly AFAICT. So, a much smaller patch (v2) attached. FWIW I'm using Visual Studio 2026 (18) to build, other platform information below [1]. The white space/formatting issues seem to have been due to my ineptitude on Windows running pgindent, or maybe I can blame Perl. I'll try to dig if I get a minute to figure out what was the cause and if I need to patch something. Also attached is a file with some notes on how I build. The build farm section isn't finished yet, but there is value in the rest for anyone doing similar work. If you're wondering how I learned PowerShell, I didn't. I used AI, forgive me. Eventually I'll have a build animal to add to the "farm" and update a wiki page somewhere when this solidifies a bit more. :) best. -greg [1] OS Name Microsoft Windows 11 Pro Version 10.0.26100 Build 26100 Other OS Description Not Available OS Manufacturer Microsoft Corporation System Name SANTORINI System Manufacturer Microsoft Corporation System Model Windows Dev Kit 2023 System Type ARM64-based PC System SKU 2043 Processor Snapdragon Compute Platform, 2995 Mhz, 8 Core(s), 8 Logical Processor(s) BIOS Version/Date Microsoft Corporation 13.42.235, 11/13/2024
Вложения
On Thu, Nov 20, 2025, at 7:08 PM, Thomas Munro wrote: > If you have an environment set up so it's easy to test, I would also > be very interested to know if my patch set[3] that nukes all this > stuff and includes <stdatomic.h> instead, which is green on > Windows/x86 CI, will just work™ there too. > > [3] > https://www.postgresql.org/message-id/flat/CA%2BhUKGKFvu3zyvv3aaj5hHs9VtWcjFAmisOwOc7aOZNc5AF3NA%40mail.gmail.com Hello Thomas, I gave your stdatomic patchs a try, here's where it fell apart linking. [174/1996] Linking target src/interfaces/libpq/libpq.dll Creating library src/interfaces/libpq\libpq.lib [981/1996] Linking target src/interfaces/ecpg/pgtypeslib/libpgtypes.dll Creating library src\interfaces\ecpg\pgtypeslib\libpgtypes.lib [1235/1996] Linking target src/interfaces/ecpg/ecpglib/libecpg.dll Creating library src\interfaces\ecpg\ecpglib\libecpg.lib [1401/1996] Linking target src/backend/postgres.exe FAILED: [code=1120] src/backend/postgres.exe src/backend/postgres.pdb "link" @src/backend/postgres.exe.rsp Creating library src\backend\postgres.lib storage_lmgr_s_lock.c.obj : error LNK2019: unresolved external symbol _mm_pause referenced in function perform_spin_delay src\backend\postgres.exe : fatal error LNK1120: 1 unresolved externals [1410/1996] Compiling C object src/backend/snowball/dict_snowball.dll.p/libstemmer_stem_UTF_8_german.c.obj ninja: build stopped: subcommand failed. I used your v2 and the attached reduced set of Dave and my changes. I'll work on this next week if I find a bit of time,shouldn't be all that hard. I thought I'd update you before then. Also attached is my config script, I sent out mynotes on a separate email. I forgot to provide the MSVC compiler/linker versions for posterity, they are: cl (msvc 19.50.35718 "Microsoft (R) C/C++ Optimizing Compiler Version 19.50.35718 for ARM64") link 14.50.35718.0 best. -greg
Вложения
Hi,
On 2025-11-22 16:43:30 -0500, Greg Burd wrote:
> With the new MSVC compiler flag Andres mentioned (/arch:armv9.4) I only
> had to update the S_UNLOCK() macro, the compiler did the rest correctly
> AFAICT.
Just to be clear - the flag shouldn't be necessary for things to work
correctly. I was only using it to have godbolt inline the intrinsics, rather
than have them generate an out-of-line function call that I couldn't easily
inspect. I'm fairly sure that the out-of-line functions also have the relevant
barriers.
> @@ -2509,7 +2513,10 @@ int main(void)
> }
> '''
>
> - if cc.links(prog, name: '__crc32cb, __crc32ch, __crc32cw, and __crc32cd without -march=armv8-a+crc',
> + if cc.get_id() == 'msvc'
> + cdata.set('USE_ARMV8_CRC32C', 1)
> + have_optimized_crc = true
Should have a comment explaining why we can do this unconditionally...
Greetings,
Andres Freund
On Mon, Nov 24, 2025 at 4:55 AM Andres Freund <andres@anarazel.de> wrote:
> > - if cc.links(prog, name: '__crc32cb, __crc32ch, __crc32cw, and __crc32cd without -march=armv8-a+crc',
> > + if cc.get_id() == 'msvc'
> > + cdata.set('USE_ARMV8_CRC32C', 1)
> > + have_optimized_crc = true
>
> Should have a comment explaining why we can do this unconditionally...
I was wondering about that in light of your revelation that it must be
using /arch:armv8.0 by default. If we only support Windows/ARM on
armv8.1 hardware (like Windows 11 itself), then I think that means
that we'd want /arch:armv8.1 here:
+ # Add ARM64 architecture flag for Windows 11 ARM64 for correct intrensics
+ if host_machine.system() == 'windows' and host_machine.cpu_family()
== 'aarch64'
+ add_project_arguments('/arch:armv9.4', language: ['c', 'cpp'])
+ endif
We can't impose our own random high ISA requirement like that, or some
machines will choke on illegal instructions.
I wonder if the same applies to Visual Studio on x86. The OS now
requires x86-64-v2 (like RHEL9, and many other Linux distros except
Debian?), but Visual Studio might not know that... but then we might
say that's an issue for the EDB packaging crew to think about, not us.
In that case we might want to say "OK you choose the ARM version, but
we're not going to write the runtime test for CRC on armv8, we'll do a
compile-time test only, because it would be stupid to waste time
writing code for armv8.0".
On Nov 23 2025, at 10:55 am, Andres Freund <andres@anarazel.de> wrote:
> Hi,
>
> On 2025-11-22 16:43:30 -0500, Greg Burd wrote:
>> With the new MSVC compiler flag Andres mentioned (/arch:armv9.4) I only
>> had to update the S_UNLOCK() macro, the compiler did the rest correctly
>> AFAICT.
>
> Just to be clear - the flag shouldn't be necessary for things to work
> correctly. I was only using it to have godbolt inline the intrinsics, rather
> than have them generate an out-of-line function call that I couldn't easily
> inspect. I'm fairly sure that the out-of-line functions also have the relevant
> barriers.
Well, I learned something new about MSVC (again) today. Thanks Andres
for pointing that out. I fiddled with godbolt a bit and indeed on ARM64
systems the only thing that flag (/arch:armv9.4) does is to replace the
out-of-line function call for the intrinsic with the inlined version of it.
Without:
bl _InterlockedCompareExchange
With:
mov w10,#2
str w8,[sp]
mov x9,sp
casal w8,w10,[x9]
Good to know, and yes it does seem that the ASM includes the correct
instruction sequence.
I think, except for the build issues, the one thing I can put my finger
on as a "fix" is the change from _ReadWriteBarrier() to _dmb(_ARM64_BARRIER_SY).
The docs say:
The _ReadWriteBarrier intrinsic limits the compiler optimizations that
can remove or reorder memory accesses across the point of the call.
Note that it says "compiler optimization" not "system memory barrier".
So, this macro change:
#define S_UNLOCK(lock) \
- do { _ReadWriteBarrier(); (*(lock)) = 0; } while (0)
+ do { __dmb(_ARM64_BARRIER_SY); (*(lock)) = 0; } while (0)
Makes more sense. This change replaces a compiler-only barrier with a
full system memory barrier, fundamentally strengthening memory ordering
guarantees for spinlock release on ARM64.
When I remove the compiler flag from the patch and keep the S_UNLOCK()
change the tests pass. I think we've found the critical fix. I'll
update the patch set.
>> @@ -2509,7 +2513,10 @@ int main(void)
>> }
>> '''
>>
>> - if cc.links(prog, name: '__crc32cb, __crc32ch, __crc32cw, and
>> __crc32cd without -march=armv8-a+crc',
>> + if cc.get_id() == 'msvc'
>> + cdata.set('USE_ARMV8_CRC32C', 1)
>> + have_optimized_crc = true
>
> Should have a comment explaining why we can do this unconditionally...
Sure, the more I look at this the more I want to clean it up a bit more.
> Greetings,
>
> Andres Freund
Thanks again for the helpful nudge Andres, best.
-greg
On Nov 23 2025, at 3:32 pm, Thomas Munro <thomas.munro@gmail.com> wrote:
> On Mon, Nov 24, 2025 at 4:55 AM Andres Freund <andres@anarazel.de> wrote:
>> > - if cc.links(prog, name: '__crc32cb, __crc32ch, __crc32cw, and
>> __crc32cd without -march=armv8-a+crc',
>> > + if cc.get_id() == 'msvc'
>> > + cdata.set('USE_ARMV8_CRC32C', 1)
>> > + have_optimized_crc = true
>>
>> Should have a comment explaining why we can do this unconditionally...
>
> I was wondering about that in light of your revelation that it must be
> using /arch:armv8.0 by default. If we only support Windows/ARM on
> armv8.1 hardware (like Windows 11 itself), then I think that means
> that we'd want /arch:armv8.1 here:
>
> + # Add ARM64 architecture flag for Windows 11 ARM64 for correct intrensics
> + if host_machine.system() == 'windows' and host_machine.cpu_family()
> == 'aarch64'
> + add_project_arguments('/arch:armv9.4', language: ['c', 'cpp'])
> + endif
Hey Thomas,
Turns out that flag isn't required at all, the issue was the S_UNLOCK()
implementation's use of a compiler barrier not a CPU memory barrier.
> We can't impose our own random high ISA requirement like that, or some
> machines will choke on illegal instructions.
New incoming patch drops this entirely, so no need to worry about it.
> I wonder if the same applies to Visual Studio on x86. The OS now
> requires x86-64-v2 (like RHEL9, and many other Linux distros except
> Debian?), but Visual Studio might not know that... but then we might
> say that's an issue for the EDB packaging crew to think about, not us.
>
> In that case we might want to say "OK you choose the ARM version, but
> we're not going to write the runtime test for CRC on armv8, we'll do a
> compile-time test only, because it would be stupid to waste time
> writing code for armv8.0".
best.
-greg
Hello, v3 attached is a more concise single patch that adds support for MSVC on Win11 ARM64. The issues fixed are mostly relatedto config, build, docs, an implementation of spin_delay() and a change to S_UNLOCK() macro. I've squished the workwe did into a single commit and improved the commit message. Thomas, I think in another thread you mentioned that someof what Dave added came from you, correct? We should add you as another author if so, right? best. -greg
Вложения
On Mon, Nov 24, 2025, at 10:04 AM, Greg Burd wrote: > Hello, > > v3 attached is a more concise single patch that adds support for MSVC > on Win11 ARM64. The issues fixed are mostly related to config, build, > docs, an implementation of spin_delay() and a change to S_UNLOCK() > macro. I've squished the work we did into a single commit and improved > the commit message. Thomas, I think in another thread you mentioned > that some of what Dave added came from you, correct? We should add you > as another author if so, right? > > best. > > -greg > > Attachments: > * v3-0001-Enable-PostgreSQL-build-on-Windows-11-ARM64-with-.patch I forgot to account for the non-ARM64 implementation of S_UNLOCK(), thanks CI for pointing that out. -greg
Вложения
Hi,
On 2025-11-24 11:28:28 -0500, Greg Burd wrote:
> @@ -2509,25 +2513,64 @@ int main(void)
> }
> '''
>
> - if cc.links(prog, name: '__crc32cb, __crc32ch, __crc32cw, and __crc32cd without -march=armv8-a+crc',
> - args: test_c_args)
> - # Use ARM CRC Extension unconditionally
> - cdata.set('USE_ARMV8_CRC32C', 1)
> - have_optimized_crc = true
> - elif cc.links(prog, name: '__crc32cb, __crc32ch, __crc32cw, and __crc32cd with -march=armv8-a+crc+simd',
> - args: test_c_args + ['-march=armv8-a+crc+simd'])
> - # Use ARM CRC Extension, with runtime check
> - cflags_crc += '-march=armv8-a+crc+simd'
> - cdata.set('USE_ARMV8_CRC32C', false)
> - cdata.set('USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK', 1)
> - have_optimized_crc = true
> - elif cc.links(prog, name: '__crc32cb, __crc32ch, __crc32cw, and __crc32cd with -march=armv8-a+crc',
> - args: test_c_args + ['-march=armv8-a+crc'])
> - # Use ARM CRC Extension, with runtime check
> - cflags_crc += '-march=armv8-a+crc'
> - cdata.set('USE_ARMV8_CRC32C', false)
> - cdata.set('USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK', 1)
> - have_optimized_crc = true
> + if cc.get_id() == 'msvc'
> + # MSVC: Intrinsic availability check for ARM64
> + if host_machine.cpu_family() == 'aarch64'
> + # Test if CRC32C intrinsics are available in intrin.h
> + crc32c_test_msvc = '''
> + #include <intrin.h>
> + int main(void) {
> + uint32_t crc = 0;
> + uint8_t data = 0;
> + crc = __crc32cb(crc, data);
> + return 0;
> + }
> + '''
> + if cc.links(crc32c_test_msvc, name: '__crc32cb intrinsic available')
> + cdata.set('USE_ARMV8_CRC32C', 1)
> + have_optimized_crc = true
> + message('Using ARM64 CRC32C hardware acceleration (MSVC)')
> + else
> + message('CRC32C intrinsics not available on this MSVC ARM64 build')
> + endif
Does this:
a) need to be conditional at all, given that it's msvc specific, it seems we
don't need to run a test?
b) why is the msvc block outside of the general aarch64 block but then has
another nested aarch64 test inside? That seems unnecessarily complicated and
requires reindenting unnecessarily much code?
> +/*
> + * For Arm64, use __isb intrinsic. See aarch64 inline assembly definition for details.
> + */
> +#ifdef _M_ARM64
> +
> +static __forceinline void
> +spin_delay(void)
> +{
> + /* Reference: https://learn.microsoft.com/en-us/cpp/intrinsics/arm64-intrinsics#BarrierRestrictions */
> + __isb(_ARM64_BARRIER_SY);
> +}
> +#else
> +/*
> + * For x64, use _mm_pause intrinsic instead of rep nop.
> + */
> static __forceinline void
> spin_delay(void)
> {
> _mm_pause();
> }
This continues to use a barrier, with a reference to a list of barrier
semantics that really don't seem to make a whole lot of sense in the context
of spin_delay(). If we want to emit this kind of barrier for now it's ok with
me, but it should be documented as just being a fairly random choice, rather
than a link that doesn't explain anything.
> +#endif
> #else
> static __forceinline void
> spin_delay(void)
> @@ -623,9 +640,13 @@ spin_delay(void)
> #include <intrin.h>
> #pragma intrinsic(_ReadWriteBarrier)
>
> -#define S_UNLOCK(lock) \
> +#ifdef _M_ARM64
> +#define S_UNLOCK(lock) \
> + do { __dmb(_ARM64_BARRIER_SY); (*(lock)) = 0; } while (0)
> +#else
This doesn't seem like the right way to implement this - why not use
InterlockedExchange(lock, 0)? That will do the write with barrier semantics.
Greetings,
Andres Freund
On Mon, Nov 24, 2025, at 6:20 PM, Andres Freund wrote:
> Hi,
Thanks again for taking a look at the patch, hopefully I got it right this time. :)
> On 2025-11-24 11:28:28 -0500, Greg Burd wrote:
>> @@ -2509,25 +2513,64 @@ int main(void)
>> }
>> '''
>>
>> - if cc.links(prog, name: '__crc32cb, __crc32ch, __crc32cw, and __crc32cd without -march=armv8-a+crc',
>> - args: test_c_args)
>> - # Use ARM CRC Extension unconditionally
>> - cdata.set('USE_ARMV8_CRC32C', 1)
>> - have_optimized_crc = true
>> - elif cc.links(prog, name: '__crc32cb, __crc32ch, __crc32cw, and __crc32cd with -march=armv8-a+crc+simd',
>> - args: test_c_args + ['-march=armv8-a+crc+simd'])
>> - # Use ARM CRC Extension, with runtime check
>> - cflags_crc += '-march=armv8-a+crc+simd'
>> - cdata.set('USE_ARMV8_CRC32C', false)
>> - cdata.set('USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK', 1)
>> - have_optimized_crc = true
>> - elif cc.links(prog, name: '__crc32cb, __crc32ch, __crc32cw, and __crc32cd with -march=armv8-a+crc',
>> - args: test_c_args + ['-march=armv8-a+crc'])
>> - # Use ARM CRC Extension, with runtime check
>> - cflags_crc += '-march=armv8-a+crc'
>> - cdata.set('USE_ARMV8_CRC32C', false)
>> - cdata.set('USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK', 1)
>> - have_optimized_crc = true
>> + if cc.get_id() == 'msvc'
>> + # MSVC: Intrinsic availability check for ARM64
>> + if host_machine.cpu_family() == 'aarch64'
>> + # Test if CRC32C intrinsics are available in intrin.h
>> + crc32c_test_msvc = '''
>> + #include <intrin.h>
>> + int main(void) {
>> + uint32_t crc = 0;
>> + uint8_t data = 0;
>> + crc = __crc32cb(crc, data);
>> + return 0;
>> + }
>> + '''
>> + if cc.links(crc32c_test_msvc, name: '__crc32cb intrinsic available')
>> + cdata.set('USE_ARMV8_CRC32C', 1)
>> + have_optimized_crc = true
>> + message('Using ARM64 CRC32C hardware acceleration (MSVC)')
>> + else
>> + message('CRC32C intrinsics not available on this MSVC ARM64 build')
>> + endif
>
> Does this:
> a) need to be conditional at all, given that it's msvc specific, it seems we
> don't need to run a test?
> b) why is the msvc block outside of the general aarch64 block but then has
> another nested aarch64 test inside? That seems unnecessarily complicated and
> requires reindenting unnecessarily much code?
Yep, I rushed this. Apologies. I've re-worked it with your suggestions.
>> +/*
>> + * For Arm64, use __isb intrinsic. See aarch64 inline assembly definition for details.
>> + */
>> +#ifdef _M_ARM64
>> +
>> +static __forceinline void
>> +spin_delay(void)
>> +{
>> + /* Reference: https://learn.microsoft.com/en-us/cpp/intrinsics/arm64-intrinsics#BarrierRestrictions */
>> + __isb(_ARM64_BARRIER_SY);
>> +}
>> +#else
>> +/*
>> + * For x64, use _mm_pause intrinsic instead of rep nop.
>> + */
>> static __forceinline void
>> spin_delay(void)
>> {
>> _mm_pause();
>> }
>
> This continues to use a barrier, with a reference to a list of barrier
> semantics that really don't seem to make a whole lot of sense in the context
> of spin_delay(). If we want to emit this kind of barrier for now it's ok with
> me, but it should be documented as just being a fairly random choice, rather
> than a link that doesn't explain anything.
I did more digging and found that you were right about the use of ISB for spin_delay(). I think I was misled by
earliercode in that file (lines 277-286) where there is an implementation of spin_delay() that uses ISB, I ran with
thatnot doing enough research myself. So I did more digging and found an article on this [1] and it seems that YIELD
shouldbe used, not ISB. I checked into how others implement this feature, Java [2][3] uses YIELD, Linux [4][5] uses
YIELDin cpu_relax() called by __delay().
>> +#endif
>> #else
>> static __forceinline void
>> spin_delay(void)
>> @@ -623,9 +640,13 @@ spin_delay(void)
>> #include <intrin.h>
>> #pragma intrinsic(_ReadWriteBarrier)
>>
>> -#define S_UNLOCK(lock) \
>> +#ifdef _M_ARM64
>> +#define S_UNLOCK(lock) \
>> + do { __dmb(_ARM64_BARRIER_SY); (*(lock)) = 0; } while (0)
>> +#else
>
> This doesn't seem like the right way to implement this - why not use
> InterlockedExchange(lock, 0)? That will do the write with barrier semantics.
Great idea, done. Seems to work too.
> Greetings,
>
> Andres Freund
Given what I learned about YIELD vs ISB for spin delay it seems like a reasonable idea to submit a new patch for the
non-MSVCpath and switch it to YIELD, what do you think?
v5 attached, best.
-greg
[1]
https://developer.arm.com/community/arm-community-blogs/b/architectures-and-processors-blog/posts/multi-threaded-applications-arm
[2] https://cr.openjdk.org/~dchuyko/8186670/yield/spinwait.html
[3] https://mail.openjdk.org/pipermail/aarch64-port-dev/2017-August/004880.html
[4]
https://github.com/torvalds/linux/blob/ac3fd01e4c1efce8f2c054cdeb2ddd2fc0fb150d/arch/arm64/include/asm/vdso/processor.h
[5] https://github.com/torvalds/linux/commit/f511e079177a9b97175a9a3b0ee2374d55682403