Обсуждение: BUG #16990: Random PANIC in qemu user context

Поиск
Список
Период
Сортировка

BUG #16990: Random PANIC in qemu user context

От
PG Bug reporting form
Дата:
The following bug has been logged on the website:

Bug reference:      16990
Logged by:          Paul Guyot
Email address:      pguyot@kallisys.net
PostgreSQL version: 11.11
Operating system:   qemu-arm-static chrooted raspios inside ubuntu
Description:

Within GitHub Actions Workflow, a qemu chrooted environment is created from
a RaspiOS lite image, within which latest availble postgresql is installed
from apt (postgresql 11.11).
Then tests of embedded software are executed, which includes creating a
postgresql database and performing few benign operations (as far as
PostgreSQL is concerned). Tests run perfectly fine in a desktop-like
environment as well as on real devices.

Within this qemu context, randomly yet quite frequently, postgresql
PANICs.
Latest log was the following :
2021-05-02 09:22:21.591 BST [15024] PANIC:  stuck spinlock detected at
LWLockWaitListLock,
/build/postgresql-11-rRyn74/postgresql-11-11.11/build/../src/backend/storage/lmgr/lwlock.c:832
qemu: uncaught target signal 6 (Aborted) - core dumped
2021-05-02 09:22:21.597 BST [15022] PANIC:  stuck spinlock detected at
LWLockWaitListLock,
/build/postgresql-11-rRyn74/postgresql-11-11.11/build/../src/backend/storage/lmgr/lwlock.c:832
qemu: uncaught target signal 6 (Aborted) - core dumped
2021-05-02 09:22:21.762 BST [15423] pynab@test_pynab PANIC:  stuck spinlock
detected at LWLockWaitListLock,
/build/postgresql-11-rRyn74/postgresql-11-11.11/build/../src/backend/storage/lmgr/lwlock.c:832
2021-05-02 09:22:21.762 BST [15423] pynab@test_pynab STATEMENT:  SELECT
"django_content_type"."id", "django_content_type"."app_label",
"django_content_type"."model" FROM "django_content_type" WHERE
"django_content_type"."app_label" = 'auth'
qemu: uncaught target signal 6 (Aborted) - core dumped
2021-05-02 09:22:24.481 BST [15011] LOG:  server process (PID 15423) was
terminated by signal 6: Aborted
2021-05-02 09:22:24.481 BST [15011] DETAIL:  Failed process was running:
SELECT "django_content_type"."id", "django_content_type"."app_label",
"django_content_type"."model" FROM "django_content_type" WHERE
"django_content_type"."app_label" = 'auth'
2021-05-02 09:22:24.481 BST [15011] LOG:  terminating any other active
server processes
2021-05-02 09:22:24.567 BST [15011] LOG:  all server processes terminated;
reinitializing
2021-05-02 09:22:24.601 BST [15512] LOG:  database system was interrupted;
last known up at 2021-05-02 09:18:11 BST
2021-05-02 09:22:24.692 BST [15512] LOG:  database system was not properly
shut down; automatic recovery in progress
2021-05-02 09:22:24.699 BST [15512] LOG:  redo starts at 0/171E170
2021-05-02 09:22:25.045 BST [15512] LOG:  invalid record length at
0/1957948: wanted 24, got 0
2021-05-02 09:22:25.046 BST [15512] LOG:  redo done at 0/1957910
2021-05-02 09:22:25.048 BST [15512] LOG:  last completed transaction was at
log time 2021-05-02 09:20:04.917746+01
2021-05-02 09:22:25.096 BST [15011] LOG:  database system is ready to accept
connections

The log is publicly available here :
https://github.com/pguyot/pynab/runs/2485660214?check_suite_focus=true

Notice how sluggish the test is compared to when PostgreSQL doesn't PANIC,
with the same environment. For example, this run worked perfectly under 20
minutes:
https://github.com/pguyot/pynab/runs/2483559259?check_suite_focus=true

I tried to update CI script to upload the full raspbian image in case of
panics to get my hands on the core dump, but it's so sluggish I'm not sure
it will not timeout eventually. I wonder if this sluggishness is not a cause
of the PANIC. Could you please advise about how to investigate further this
crash?


Re: BUG #16990: Random PANIC in qemu user context

От
Tom Lane
Дата:
PG Bug reporting form <noreply@postgresql.org> writes:
> Within GitHub Actions Workflow, a qemu chrooted environment is created from
> a RaspiOS lite image, within which latest availble postgresql is installed
> from apt (postgresql 11.11).
> Then tests of embedded software are executed, which includes creating a
> postgresql database and performing few benign operations (as far as
> PostgreSQL is concerned). Tests run perfectly fine in a desktop-like
> environment as well as on real devices.

> Within this qemu context, randomly yet quite frequently, postgresql
> PANICs.
> Latest log was the following :
> 2021-05-02 09:22:21.591 BST [15024] PANIC:  stuck spinlock detected at
> LWLockWaitListLock,
> /build/postgresql-11-rRyn74/postgresql-11-11.11/build/../src/backend/storage/lmgr/lwlock.c:832

Hm.  Looking at the lwlock.c source code, that's not actually a stuck
spinlock (in the sense of a loop around a TAS() call), but a loop
waiting for an LWLock's LW_FLAG_LOCKED bit to become clear.  It's
morally the same thing though, in that we don't expect the conflicting
lock to be held for more than a few instructions, so we just busy-wait
and delay until the lock can be obtained.

Seems like there are a few possible explanations:

1. Compiler bug generating incorrect code for the wait loop (e.g.,
failing to re-fetch the volatile variable each time though).  The
difficulty with this theory is that then you'd expect to see the same
freezeup in normal non-qemu execution.  But maybe qemu slows things
down enough that the window for contention on an LWLock can be hit,
whereas you'd hardly ever see that without qemu.  Seems unlikely,
but maybe it'd be worth disassembling LWLockWaitListLock to check.

2. qemu bug in emulating the atomic-update instructions that are
used to set/clear LW_FLAG_LOCKED.  This doesn't seem real probable
either, but maybe it's the most likely of a bad lot.

3. qemu is so slow that the spinlock delay times out.  I don't
believe this one either, mainly because we haven't seen it in
our own occasional uses of qemu; and if it were that slow it'd
be entirely unusable.  The spinlock timeout is normally multiple
seconds, which is several orders of magnitude longer than such
locks ought to be held.

4. Postgres bug causing the lock to never get released.  This theory
has the same problem as #1, ie you have to explain why it's not seen
in any other environment.

5. The lock does get released, but there are enough processes
contending for it that some process times out before it
successfully acquires the lock.  It's possible perhaps that that
could happen under a very high-load scenario, but that doesn't seem
like the category of test that would be sane to run under qemu.

Not sure what to tell you, other than "make sure qemu and your
build toolchain are up-to-date".

            regards, tom lane



Re: BUG #16990: Random PANIC in qemu user context

От
Paul Guyot
Дата:
> Not sure what to tell you, other than "make sure qemu and your
> build toolchain are up-to-date".

In this scenario, I use postgresql 11.11 that was compiled by raspbian folks. I also used the qemu binary provided by
ubuntufor focal, which happens to be 4.2 (not the latest). 

I found out the corresponding function using readelf to locate the string constant.

For the record, the C function is here:
https://github.com/postgres/postgres/blob/REL_11_STABLE/src/backend/storage/lmgr/lwlock.c#L811

The tight read loop is as follows:
  32b548:    e28d0004     add        r0, sp, #4
  32b54c:    eb000679     bl        32cf38 <perform_spin_delay@@Base>
  32b550:    e5943004     ldr        r3, [r4, #4]
  32b554:    e3130201     tst        r3, #268435456    ; 0x10000000
  32b558:    1afffffa         bne        32b548 <RememberSimpleDeadLock@@Base+0xc4>

At address 32b550, it does perform a read, honoring the volatile pointer.

I guess the lock is acquired by the same function:
https://github.com/postgres/postgres/blob/REL_11_STABLE/src/backend/storage/lmgr/lwlock.c#L824

The corresponding code is the following
  32b508:    ee070fba     mcr    15, 0, r0, cr7, cr10, {5}
  32b50c:    e1953f9f     ldrex    r3, [r5]
  32b510:    e3832201     orr        r2, r3, #268435456    ; 0x10000000
  32b514:    e1851f92     strex    r1, r2, [r5]
  32b518:    e3510000     cmp    r1, #0
  32b51c:    1afffffa         bne        32b50c <RememberSimpleDeadLock@@Base+0x88>
  32b520:    e3130201     tst        r3, #268435456    ; 0x10000000
  32b524:    ee070fba     mcr    15, 0, r0, cr7, cr10, {5}
  32b528:    0a00000e     beq        32b568 <RememberSimpleDeadLock@@Base+0xe4>

mcr    15, 0, r0, cr7, cr10, {5} is __sync_synchronize() and based on the previous instructions, r5 is equal to r4+4 as
usedin the tight loop. 

I also guess the corresponding unlock function just follows, and disassembling it reveals the same use of
__sync_synchronize().
  32b644:    ee070fba     mcr    15, 0, r0, cr7, cr10, {5}
  32b648:    e1932f9f     ldrex    r2, [r3]
  32b64c:    e3c22201     bic        r2, r2, #268435456    ; 0x10000000
  32b650:    e1831f92     strex    r1, r2, [r3]
  32b654:    e3510000     cmp    r1, #0
  32b658:    1afffffa         bne        32b648 <RememberSimpleDeadLock@@Base+0x1c4>
  32b65c:    ee070fba     mcr    15, 0, r0, cr7, cr10, {5}
  32b660:    e8bd8070     pop        {r4, r5, r6, pc}

QEMU user emulation documentation mentions something specific to threading on ARM.
https://qemu.readthedocs.io/en/latest/user/main.html
> Threading:
> On Linux, QEMU can emulate the clone syscall and create a real host thread (with a separate virtual CPU) for each
emulatedthread. Note that not all targets currently emulate atomic operations correctly. x86 and Arm use a global lock
inorder to preserve their semantics. 

I have yet to determine what impact it could have here. Can we imagine a situation where the memory barrier was not
honoredand an unlock would be overwritten with a lock? 

Eventually, I have tried to run the whole script with taskset -c 0 (which is fine with the tests as the target system,
aRaspberry Pi Zero, is single core, while GitHub Linux runners have 2 vCPUs). 
https://github.com/pguyot/pynab/commit/91011e68e446c69e317fd1198c58f85ff0cd5fb1
https://github.com/pguyot/pynab/runs/2486051700?check_suite_focus=true

I ran it four times so far, and no postgresql PANIC happens. So your hypothesis of a bug (limitation) of qemu 4.2 seems
probable…
FYI, newer ARM architectures, starting with armv7l, have a dedicated instruction for memory barriers which is not used
hereas it is not recognized by Raspberry PI Zero CPU. 

Paul




Re: BUG #16990: Random PANIC in qemu user context

От
Tom Lane
Дата:
Paul Guyot <pguyot@kallisys.net> writes:
> QEMU user emulation documentation mentions something specific to threading on ARM.
> https://qemu.readthedocs.io/en/latest/user/main.html
>> Threading:
>> On Linux, QEMU can emulate the clone syscall and create a real host thread (with a separate virtual CPU) for each
emulatedthread. Note that not all targets currently emulate atomic operations correctly. x86 and Arm use a global lock
inorder to preserve their semantics. 

Oooh, that's pretty suggestive.  I wonder just when that "global lock" is
taken or released.  Maybe the timing of it is such that it interferes
with ever seeing our lock as free?

            regards, tom lane



Re: BUG #16990: Random PANIC in qemu user context

От
Andres Freund
Дата:
Hi,

On 2021-05-02 18:19:58 -0400, Tom Lane wrote:
> Paul Guyot <pguyot@kallisys.net> writes:
> > QEMU user emulation documentation mentions something specific to threading on ARM.
> > https://qemu.readthedocs.io/en/latest/user/main.html
> >> Threading:
> >> On Linux, QEMU can emulate the clone syscall and create a real host thread (with a separate virtual CPU) for each
emulatedthread. Note that not all targets currently emulate atomic operations correctly. x86 and Arm use a global lock
inorder to preserve their semantics.
 
> 
> Oooh, that's pretty suggestive.  I wonder just when that "global lock" is
> taken or released.  Maybe the timing of it is such that it interferes
> with ever seeing our lock as free?

The way it talks about threads makes me wonder if possibly cross-process
shared memory wasn't fully taken into account when using the emulated
atomics?

It's also not hard to imagine problems that exist for ll/sc style atomic
operations but not for cmpxchg/xadd/... - just acquiring a lock when ll
or sc are executing isn't sufficient, as it needs to fail if there are
any *non-atomic* writes to the relevant memory too. Which won't know to
acquire the lock...

Greetings,

Andres Freund