Обсуждение: BUG #16990: Random PANIC in qemu user context
The following bug has been logged on the website: Bug reference: 16990 Logged by: Paul Guyot Email address: pguyot@kallisys.net PostgreSQL version: 11.11 Operating system: qemu-arm-static chrooted raspios inside ubuntu Description: Within GitHub Actions Workflow, a qemu chrooted environment is created from a RaspiOS lite image, within which latest availble postgresql is installed from apt (postgresql 11.11). Then tests of embedded software are executed, which includes creating a postgresql database and performing few benign operations (as far as PostgreSQL is concerned). Tests run perfectly fine in a desktop-like environment as well as on real devices. Within this qemu context, randomly yet quite frequently, postgresql PANICs. Latest log was the following : 2021-05-02 09:22:21.591 BST [15024] PANIC: stuck spinlock detected at LWLockWaitListLock, /build/postgresql-11-rRyn74/postgresql-11-11.11/build/../src/backend/storage/lmgr/lwlock.c:832 qemu: uncaught target signal 6 (Aborted) - core dumped 2021-05-02 09:22:21.597 BST [15022] PANIC: stuck spinlock detected at LWLockWaitListLock, /build/postgresql-11-rRyn74/postgresql-11-11.11/build/../src/backend/storage/lmgr/lwlock.c:832 qemu: uncaught target signal 6 (Aborted) - core dumped 2021-05-02 09:22:21.762 BST [15423] pynab@test_pynab PANIC: stuck spinlock detected at LWLockWaitListLock, /build/postgresql-11-rRyn74/postgresql-11-11.11/build/../src/backend/storage/lmgr/lwlock.c:832 2021-05-02 09:22:21.762 BST [15423] pynab@test_pynab STATEMENT: SELECT "django_content_type"."id", "django_content_type"."app_label", "django_content_type"."model" FROM "django_content_type" WHERE "django_content_type"."app_label" = 'auth' qemu: uncaught target signal 6 (Aborted) - core dumped 2021-05-02 09:22:24.481 BST [15011] LOG: server process (PID 15423) was terminated by signal 6: Aborted 2021-05-02 09:22:24.481 BST [15011] DETAIL: Failed process was running: SELECT "django_content_type"."id", "django_content_type"."app_label", "django_content_type"."model" FROM "django_content_type" WHERE "django_content_type"."app_label" = 'auth' 2021-05-02 09:22:24.481 BST [15011] LOG: terminating any other active server processes 2021-05-02 09:22:24.567 BST [15011] LOG: all server processes terminated; reinitializing 2021-05-02 09:22:24.601 BST [15512] LOG: database system was interrupted; last known up at 2021-05-02 09:18:11 BST 2021-05-02 09:22:24.692 BST [15512] LOG: database system was not properly shut down; automatic recovery in progress 2021-05-02 09:22:24.699 BST [15512] LOG: redo starts at 0/171E170 2021-05-02 09:22:25.045 BST [15512] LOG: invalid record length at 0/1957948: wanted 24, got 0 2021-05-02 09:22:25.046 BST [15512] LOG: redo done at 0/1957910 2021-05-02 09:22:25.048 BST [15512] LOG: last completed transaction was at log time 2021-05-02 09:20:04.917746+01 2021-05-02 09:22:25.096 BST [15011] LOG: database system is ready to accept connections The log is publicly available here : https://github.com/pguyot/pynab/runs/2485660214?check_suite_focus=true Notice how sluggish the test is compared to when PostgreSQL doesn't PANIC, with the same environment. For example, this run worked perfectly under 20 minutes: https://github.com/pguyot/pynab/runs/2483559259?check_suite_focus=true I tried to update CI script to upload the full raspbian image in case of panics to get my hands on the core dump, but it's so sluggish I'm not sure it will not timeout eventually. I wonder if this sluggishness is not a cause of the PANIC. Could you please advise about how to investigate further this crash?
PG Bug reporting form <noreply@postgresql.org> writes: > Within GitHub Actions Workflow, a qemu chrooted environment is created from > a RaspiOS lite image, within which latest availble postgresql is installed > from apt (postgresql 11.11). > Then tests of embedded software are executed, which includes creating a > postgresql database and performing few benign operations (as far as > PostgreSQL is concerned). Tests run perfectly fine in a desktop-like > environment as well as on real devices. > Within this qemu context, randomly yet quite frequently, postgresql > PANICs. > Latest log was the following : > 2021-05-02 09:22:21.591 BST [15024] PANIC: stuck spinlock detected at > LWLockWaitListLock, > /build/postgresql-11-rRyn74/postgresql-11-11.11/build/../src/backend/storage/lmgr/lwlock.c:832 Hm. Looking at the lwlock.c source code, that's not actually a stuck spinlock (in the sense of a loop around a TAS() call), but a loop waiting for an LWLock's LW_FLAG_LOCKED bit to become clear. It's morally the same thing though, in that we don't expect the conflicting lock to be held for more than a few instructions, so we just busy-wait and delay until the lock can be obtained. Seems like there are a few possible explanations: 1. Compiler bug generating incorrect code for the wait loop (e.g., failing to re-fetch the volatile variable each time though). The difficulty with this theory is that then you'd expect to see the same freezeup in normal non-qemu execution. But maybe qemu slows things down enough that the window for contention on an LWLock can be hit, whereas you'd hardly ever see that without qemu. Seems unlikely, but maybe it'd be worth disassembling LWLockWaitListLock to check. 2. qemu bug in emulating the atomic-update instructions that are used to set/clear LW_FLAG_LOCKED. This doesn't seem real probable either, but maybe it's the most likely of a bad lot. 3. qemu is so slow that the spinlock delay times out. I don't believe this one either, mainly because we haven't seen it in our own occasional uses of qemu; and if it were that slow it'd be entirely unusable. The spinlock timeout is normally multiple seconds, which is several orders of magnitude longer than such locks ought to be held. 4. Postgres bug causing the lock to never get released. This theory has the same problem as #1, ie you have to explain why it's not seen in any other environment. 5. The lock does get released, but there are enough processes contending for it that some process times out before it successfully acquires the lock. It's possible perhaps that that could happen under a very high-load scenario, but that doesn't seem like the category of test that would be sane to run under qemu. Not sure what to tell you, other than "make sure qemu and your build toolchain are up-to-date". regards, tom lane
> Not sure what to tell you, other than "make sure qemu and your > build toolchain are up-to-date". In this scenario, I use postgresql 11.11 that was compiled by raspbian folks. I also used the qemu binary provided by ubuntufor focal, which happens to be 4.2 (not the latest). I found out the corresponding function using readelf to locate the string constant. For the record, the C function is here: https://github.com/postgres/postgres/blob/REL_11_STABLE/src/backend/storage/lmgr/lwlock.c#L811 The tight read loop is as follows: 32b548: e28d0004 add r0, sp, #4 32b54c: eb000679 bl 32cf38 <perform_spin_delay@@Base> 32b550: e5943004 ldr r3, [r4, #4] 32b554: e3130201 tst r3, #268435456 ; 0x10000000 32b558: 1afffffa bne 32b548 <RememberSimpleDeadLock@@Base+0xc4> At address 32b550, it does perform a read, honoring the volatile pointer. I guess the lock is acquired by the same function: https://github.com/postgres/postgres/blob/REL_11_STABLE/src/backend/storage/lmgr/lwlock.c#L824 The corresponding code is the following 32b508: ee070fba mcr 15, 0, r0, cr7, cr10, {5} 32b50c: e1953f9f ldrex r3, [r5] 32b510: e3832201 orr r2, r3, #268435456 ; 0x10000000 32b514: e1851f92 strex r1, r2, [r5] 32b518: e3510000 cmp r1, #0 32b51c: 1afffffa bne 32b50c <RememberSimpleDeadLock@@Base+0x88> 32b520: e3130201 tst r3, #268435456 ; 0x10000000 32b524: ee070fba mcr 15, 0, r0, cr7, cr10, {5} 32b528: 0a00000e beq 32b568 <RememberSimpleDeadLock@@Base+0xe4> mcr 15, 0, r0, cr7, cr10, {5} is __sync_synchronize() and based on the previous instructions, r5 is equal to r4+4 as usedin the tight loop. I also guess the corresponding unlock function just follows, and disassembling it reveals the same use of __sync_synchronize(). 32b644: ee070fba mcr 15, 0, r0, cr7, cr10, {5} 32b648: e1932f9f ldrex r2, [r3] 32b64c: e3c22201 bic r2, r2, #268435456 ; 0x10000000 32b650: e1831f92 strex r1, r2, [r3] 32b654: e3510000 cmp r1, #0 32b658: 1afffffa bne 32b648 <RememberSimpleDeadLock@@Base+0x1c4> 32b65c: ee070fba mcr 15, 0, r0, cr7, cr10, {5} 32b660: e8bd8070 pop {r4, r5, r6, pc} QEMU user emulation documentation mentions something specific to threading on ARM. https://qemu.readthedocs.io/en/latest/user/main.html > Threading: > On Linux, QEMU can emulate the clone syscall and create a real host thread (with a separate virtual CPU) for each emulatedthread. Note that not all targets currently emulate atomic operations correctly. x86 and Arm use a global lock inorder to preserve their semantics. I have yet to determine what impact it could have here. Can we imagine a situation where the memory barrier was not honoredand an unlock would be overwritten with a lock? Eventually, I have tried to run the whole script with taskset -c 0 (which is fine with the tests as the target system, aRaspberry Pi Zero, is single core, while GitHub Linux runners have 2 vCPUs). https://github.com/pguyot/pynab/commit/91011e68e446c69e317fd1198c58f85ff0cd5fb1 https://github.com/pguyot/pynab/runs/2486051700?check_suite_focus=true I ran it four times so far, and no postgresql PANIC happens. So your hypothesis of a bug (limitation) of qemu 4.2 seems probable… FYI, newer ARM architectures, starting with armv7l, have a dedicated instruction for memory barriers which is not used hereas it is not recognized by Raspberry PI Zero CPU. Paul
Paul Guyot <pguyot@kallisys.net> writes: > QEMU user emulation documentation mentions something specific to threading on ARM. > https://qemu.readthedocs.io/en/latest/user/main.html >> Threading: >> On Linux, QEMU can emulate the clone syscall and create a real host thread (with a separate virtual CPU) for each emulatedthread. Note that not all targets currently emulate atomic operations correctly. x86 and Arm use a global lock inorder to preserve their semantics. Oooh, that's pretty suggestive. I wonder just when that "global lock" is taken or released. Maybe the timing of it is such that it interferes with ever seeing our lock as free? regards, tom lane
Hi, On 2021-05-02 18:19:58 -0400, Tom Lane wrote: > Paul Guyot <pguyot@kallisys.net> writes: > > QEMU user emulation documentation mentions something specific to threading on ARM. > > https://qemu.readthedocs.io/en/latest/user/main.html > >> Threading: > >> On Linux, QEMU can emulate the clone syscall and create a real host thread (with a separate virtual CPU) for each emulatedthread. Note that not all targets currently emulate atomic operations correctly. x86 and Arm use a global lock inorder to preserve their semantics. > > Oooh, that's pretty suggestive. I wonder just when that "global lock" is > taken or released. Maybe the timing of it is such that it interferes > with ever seeing our lock as free? The way it talks about threads makes me wonder if possibly cross-process shared memory wasn't fully taken into account when using the emulated atomics? It's also not hard to imagine problems that exist for ll/sc style atomic operations but not for cmpxchg/xadd/... - just acquiring a lock when ll or sc are executing isn't sufficient, as it needs to fail if there are any *non-atomic* writes to the relevant memory too. Which won't know to acquire the lock... Greetings, Andres Freund