Re: BUG #16990: Random PANIC in qemu user context

Поиск
Список
Период
Сортировка
От Tom Lane
Тема Re: BUG #16990: Random PANIC in qemu user context
Дата
Msg-id 3714052.1619980429@sss.pgh.pa.us
обсуждение исходный текст
Ответ на BUG #16990: Random PANIC in qemu user context  (PG Bug reporting form <noreply@postgresql.org>)
Ответы Re: BUG #16990: Random PANIC in qemu user context  (Paul Guyot <pguyot@kallisys.net>)
Список pgsql-bugs
PG Bug reporting form <noreply@postgresql.org> writes:
> Within GitHub Actions Workflow, a qemu chrooted environment is created from
> a RaspiOS lite image, within which latest availble postgresql is installed
> from apt (postgresql 11.11).
> Then tests of embedded software are executed, which includes creating a
> postgresql database and performing few benign operations (as far as
> PostgreSQL is concerned). Tests run perfectly fine in a desktop-like
> environment as well as on real devices.

> Within this qemu context, randomly yet quite frequently, postgresql
> PANICs.
> Latest log was the following :
> 2021-05-02 09:22:21.591 BST [15024] PANIC:  stuck spinlock detected at
> LWLockWaitListLock,
> /build/postgresql-11-rRyn74/postgresql-11-11.11/build/../src/backend/storage/lmgr/lwlock.c:832

Hm.  Looking at the lwlock.c source code, that's not actually a stuck
spinlock (in the sense of a loop around a TAS() call), but a loop
waiting for an LWLock's LW_FLAG_LOCKED bit to become clear.  It's
morally the same thing though, in that we don't expect the conflicting
lock to be held for more than a few instructions, so we just busy-wait
and delay until the lock can be obtained.

Seems like there are a few possible explanations:

1. Compiler bug generating incorrect code for the wait loop (e.g.,
failing to re-fetch the volatile variable each time though).  The
difficulty with this theory is that then you'd expect to see the same
freezeup in normal non-qemu execution.  But maybe qemu slows things
down enough that the window for contention on an LWLock can be hit,
whereas you'd hardly ever see that without qemu.  Seems unlikely,
but maybe it'd be worth disassembling LWLockWaitListLock to check.

2. qemu bug in emulating the atomic-update instructions that are
used to set/clear LW_FLAG_LOCKED.  This doesn't seem real probable
either, but maybe it's the most likely of a bad lot.

3. qemu is so slow that the spinlock delay times out.  I don't
believe this one either, mainly because we haven't seen it in
our own occasional uses of qemu; and if it were that slow it'd
be entirely unusable.  The spinlock timeout is normally multiple
seconds, which is several orders of magnitude longer than such
locks ought to be held.

4. Postgres bug causing the lock to never get released.  This theory
has the same problem as #1, ie you have to explain why it's not seen
in any other environment.

5. The lock does get released, but there are enough processes
contending for it that some process times out before it
successfully acquires the lock.  It's possible perhaps that that
could happen under a very high-load scenario, but that doesn't seem
like the category of test that would be sane to run under qemu.

Not sure what to tell you, other than "make sure qemu and your
build toolchain are up-to-date".

            regards, tom lane



В списке pgsql-bugs по дате отправления:

Предыдущее
От: Alexander Korotkov
Дата:
Сообщение: Re: BUG #16986: reindex error on ltree index
Следующее
От: Paul Guyot
Дата:
Сообщение: Re: BUG #16990: Random PANIC in qemu user context