Re: Blocking I/O, async I/O and io_uring

Поиск
Список
Период
Сортировка
От Craig Ringer
Тема Re: Blocking I/O, async I/O and io_uring
Дата
Msg-id CAGRY4nx8hqNoUWpLHnE9FoUUWmegKT9pGiJyAb+hwn2iuYQSUw@mail.gmail.com
обсуждение исходный текст
Ответ на Re: Blocking I/O, async I/O and io_uring  (Andres Freund <andres@anarazel.de>)
Ответы Re: Blocking I/O, async I/O and io_uring  (Andres Freund <andres@anarazel.de>)
Список pgsql-hackers
On Tue, 8 Dec 2020 at 12:02, Andres Freund <andres@anarazel.de> wrote:
Hi,

On 2020-12-08 10:55:37 +0800, Craig Ringer wrote:
> A new kernel API called io_uring has recently come to my attention. I
> assume some of you (Andres?) have been following it for a while.

Yea, I've spent a *lot* of time working on AIO support, utilizing
io_uring. Recently Thomas also joined in the fun. I've given two talks
referencing it (last pgcon, last pgday brussels), but otherwise I've not
yet written much about. Things aren't *quite* right yet architecturally,
but I think we're getting there.

That's wonderful. Thankyou.

I'm badly behind on the conference circuit due to geographic isolation and small children. I'll hunt up your talks.

The current state is at https://github.com/anarazel/postgres/tree/aio
(but it's not a very clean history at the moment).

Fantastic!

Have you done much bpf / systemtap / perf based work on measurement and tracing of latencies etc? If not that's something I'd be keen to help with. I've mostly been using systemtap so far but I'm trying to pivot over to bpf.

I hope to submit a big tracepoints patch set for PostgreSQL soon to better expose our wait points and latencies, improve visibility of blocking, and help make activity traceable through all the stages of processing. I'll Cc you when I do.
 
> io_uring appears to offer a way to make system calls including reads,
> writes, fsync()s, and more in a non-blocking, batched and pipelined manner,
> with or without O_DIRECT. Basically async I/O with usable buffered I/O and
> fsync support. It has ordering support which is really important for us.

My results indicate that we really want to have have, optional & not
enabled by default of course, O_DIRECT support. We just can't benefit
fully of modern SSDs otherwise. Buffered is also important, of course.

Even more so for NVDRAM, Optane and all that, where zero-copy and low context switches becomes important too.

We're a long way from that being a priority but it's still not to be dismissed.

I'm pretty sure that I've got the basics of this working pretty well. I
don't think the executor architecture is as big an issue as you seem to
think. There are further benefits that could be unlocked if we had a
more flexible executor model (imagine switching between different parts
of the query whenever blocked on IO - can't do that due to the stack
right now).

Yep, that's what I'm talking about being an issue.

Blocked on an index read? Move on to the next tuple and come back when the index read is done.

I really like what I see of the io_uring architecture so far. It's ideal for callback-based event-driven flow control. But that doesn't fit postgres well for the executor. It's better for redo etc.



The way it currently works is that things like sequential scans, vacuum,
etc use a prefetching helper which will try to use AIO to read ahead of
the next needed block. That helper uses callbacks to determine the next
needed block, which e.g. vacuum uses to skip over all-visible/frozen
blocks. There's plenty other places that should use that helper, but we
already can get considerably higher throughput for seqscans, vacuum on
both very fast local storage, and high-latency cloud storage.

Similarly, for writes there's a small helper to manage a write-queue of
configurable depth, which currently is used to by checkpointer and
bgwriter (but should be used in more places). Especially with direct IO
checkpointing can be a lot faster *and* less impactful on the "regular"
load.

Sure sounds like a useful interim step. That's great.

I've got asynchronous writing of WAL mostly working, but need to
redesign the locking a bit further. Right now it's a win in some cases,
but not others. The latter to a significant degree due to unnecessary
blocking....

That's where io_uring's I/O ordering operations looked interesting. But I haven't looked closely enough to see if they're going to help us with I/O ordering in a multiprocessing architecture like postgres.

In an ideal world we could tell the kernel about WAL-to-heap I/O dependencies and even let it apply WAL then heap changes out-of-order so long as they didn't violate any ordering constraints we specify between particular WAL records or between WAL writes and their corresponding heap blocks. But I don't know if the io_uring interface is that capable.

I did some basic experiments a while ago with using write barriers between WAL records and heap writes instead of fsync()ing, but as you note, the increased blocking and reduction in the kernel's ability to do I/O reordering is generally worse than the costs of the fsync()s we do now.

> I'm thinking that redo is probably a good first candidate. It doesn't
> depend on the guts of the executor. It is much less sensitive to
> ordering between operations in shmem and on disk since it runs in the
> startup process. And it hurts REALLY BADLY from its single-threaded
> blocking approach to I/O - as shown by an extension written by
> 2ndQuadrant that can double redo performance by doing read-ahead on
> btree pages that will soon be needed.

Thomas has a patch for prefetching during WAL apply. It currently uses
posix_fadvise(), but he took care that it'd be fairly easy to rebase it
onto "real" AIO. Most of the changes necessary are pretty independent of
posix_fadvise vs aio.

Cool. You know we worked on something like that in 2ndQ too, with fast_redo, and it's pretty effective at reducing the I/O waits for b-tree index maintenance.

How feasible do you think it'd be to take it a step further and structure redo as a pipelined queue, where redo calls enqueue I/O operations and completion handlers then return immediately? Everything still goes to disk in the order it's enqueued, and the callbacks will be invoked in order, so they can update appropriate shmem state etc. Since there's no concurrency during redo, it should be much simpler than normal user backend operations where we have all the tight coordination of buffer management, WAL write ordering, PGXACT and PGPROC, the clog, etc.

So far the main issue I see with it is that there are still way too many places we'd have to block because of logic that requires the result of a read in order to perform a subsequent write. We can't just turn those into event driven continuations on the queue and keep going unless we can guarantee that the later WAL we apply while we're waiting is independent of any changes the earlier pending writes might make and that's hard, especially with b-trees. And it's those read-then-write ordering points that hurt our redo performance the most already.

В списке pgsql-hackers по дате отправления:

Предыдущее
От: Kyotaro Horiguchi
Дата:
Сообщение: Re: pg_rewind race condition just after promotion
Следующее
От: Masahiro Ikeda
Дата:
Сообщение: About to add WAL write/fsync statistics to pg_stat_wal view