Re: Potential deadlock in pgaio_io_wait()

Поиск
Список
Период
Сортировка
От Andres Freund
Тема Re: Potential deadlock in pgaio_io_wait()
Дата
Msg-id kl7wzxh6lhtbexnkp7meobqjksnayvzvh53uzsdjsubdfkom37@kemmr6est2g6
обсуждение исходный текст
Ответ на Re: Potential deadlock in pgaio_io_wait()  (Thomas Munro <thomas.munro@gmail.com>)
Список pgsql-hackers
Hi,

On 2025-08-15 17:39:30 +1200, Thomas Munro wrote:
> I discussed this off-list with Andres who provided the following review:
> 
>  * +1 for the analysis
>  * +1 for the solution
>  * his benchmark machine shows no regression under heavy IO submission workload
>  * needs better comments
> 
> I had expected that patch to be rejected as too slow.  I was thinking
> that it should be enough to insert a memory barrier and then do an
> unlocked check for an empty waitlist in ConditionVariableBroadcast()
> to avoid a spinlock acquire/release.  That caused a few weird hangs I
> didn't understand, which is why I didn't post that version last time,
> but he pointed out that the other side also needs a memory barrier in
> ConditionVariablePrepareToSleep() and the existing spinlock
> acquire/release is not enough.

FTR, the reason that we would need a barrier there is that we define
SpinLockRelease() as a non-atomic volatile store on x86. That means that there
is no guarantee that another backend sees an up2date view of the memory if
reading a memory location *without* acquiring the spinlock. The reason this is
correct for spinlocks is that x86 has strongly ordered stores, and the
*acquisition* of a spinlock will only succeed once the release of the spinlock
has become visible.  But if you read something *without* the spinlock, you're
not guaranteed to see all the changes done without the spinlock.

I do wonder if we accidentally rely on SpinLockRelease() being a barrier
anywhere...


> I also thought of a small optimisation, presented in the -B patch.
> It's a bit of a shame to wait for backend->submit_cv and then also
> ioh->cv in io_method=worker.  It's just a bunch of IPC ping-pong for
> nothing.  So I figured it should be allowed to fall all the way
> through based on its lack of ->wait_one.  Worth bothering with?

I don't think I'd go there without concrete evidence that it helps - it makes
it harder to understand when to wait for what. Not terribly so, but enough
that I'd not go there without some measurable benefit.


> On a superficial note:
> 
>  AIO_IO_COMPLETION      "Waiting for another process to complete IO."
> +AIO_IO_SUBMIT  "Waiting for another process to submit IO."
>  AIO_IO_URING_SUBMIT    "Waiting for IO submission via io_uring."
>  AIO_IO_URING_EXECUTION "Waiting for IO execution via io_uring."


> We're inconsistent in our choice of noun or verb.  I went with _SUBMIT
> to match the following line, rather than _SUBMISSION to match the
> preceding line.  Shrug.

FWIW, I think so far it's a verb when the process is doing the work (e.g. the
process is calling io_uring_enter(to_submit = ..). It's a noun if we wait for
somebody *else* to do the completion, since we're not doing work ourselves.



> From ec2e1e21054f00918d3b28ce01129bc06de37790 Mon Sep 17 00:00:00 2001
> From: Thomas Munro <thomas.munro@gmail.com>
> Date: Sun, 3 Aug 2025 23:07:56 +1200
> Subject: [PATCH v2 1/2] aio: Fix pgaio_io_wait() for staged IOs.
> 
> Previously, pgaio_io_wait()'s cases for PGAIO_HS_DEFINED and
> PGAIO_HS_STAGED fell through to waiting for completion.  The owner only
> promises to advance it to PGAIO_HS_SUBMITTED.  The waiter needs to be
> prepared to call ->wait_one() itself once the IO is submitted in order
> to guarantee progress and avoid deadlocks on IO methods that provide
> ->wait_one().
> 
> Introduce a new per-backend condition variable submit_cv, woken by by
> pgaio_submit_stage(), and use it to wait for the state to advance.  The
> new broadcast doesn't seem to cause any measurable slowdown, so ideas
> for optimizing the common no-waiters case were abandoned for now.
> 
> It may not be possible to reach any real deadlock with existing AIO
> users, but that situation could change.  There's also no reason the
> waiter shouldn't begin to wait via the IO method as soon as possible
> even without a deadlock.
> 
> Picked up by testing a proposed IO method that has ->wait_one(), like
> io_method=io_uring, and code review.

LGTM.

Greetings,

Andres Freund



В списке pgsql-hackers по дате отправления: