Обсуждение: Blocking I/O, async I/O and io_uring

Поиск
Список
Период
Сортировка

Blocking I/O, async I/O and io_uring

От
Craig Ringer
Дата:
Hi all

A new kernel API called io_uring has recently come to my attention. I assume some of you (Andres?) have been following it for a while.

io_uring appears to offer a way to make system calls including reads, writes, fsync()s, and more in a non-blocking, batched and pipelined manner, with or without O_DIRECT. Basically async I/O with usable buffered I/O and fsync support. It has ordering support which is really important for us.

This should be on our radar. The main barriers to benefiting from linux-aio based async I/O in postgres in the past has been its reliance on direct I/O, the various kernel-version quirks, platform portability, and its maybe-async-except-when-it's-randomly-not nature.

The kernel version and portability remain an issue with io_uring so it's not like this is something we can pivot over to completely. But we should probably take a closer look at it.

PostgreSQL spends a huge amount of time waiting, doing nothing, for blocking I/O. If we can improve that then we could potentially realize some major increases in I/O utilization especially for bigger, less concurrent workloads. The most obvious candidates to benefit would be redo, logical apply, and bulk loading.

But I have no idea how to even begin to fit this into PostgreSQL's executor pipeline. Almost all PostgreSQL's code is synchronous-blocking-imperative in nature, with a push/pull executor pipeline. It seems to have been recognised for some time that this is increasingly hurting our performance and scalability as platforms become more and more parallel.

To benefit from AIO (be it POSIX, linux-aio, io_uring, Windows AIO, etc) we have to be able to dispatch I/O and do something else while we wait for the results. So we need the ability to pipeline the executor and pipeline redo.

I thought I'd start the discussion on this and see where we can go with it. What incremental steps can be done to move us toward parallelisable I/O without having to redesign everything?

I'm thinking that redo is probably a good first candidate. It doesn't depend on the guts of the executor. It is much less sensitive to ordering between operations in shmem and on disk since it runs in the startup process. And it hurts REALLY BADLY from its single-threaded blocking approach to I/O - as shown by an extension written by 2ndQuadrant that can double redo performance by doing read-ahead on btree pages that will soon be needed.

Thoughts anybody?

Re: Blocking I/O, async I/O and io_uring

От
Craig Ringer
Дата:
References to get things started:


You'll probably notice how this parallels my sporadic activities around pipelining in other areas, and the PoC libpq pipelining patch I sent in a few years ago.

Re: Blocking I/O, async I/O and io_uring

От
Thomas Munro
Дата:
On Tue, Dec 8, 2020 at 3:56 PM Craig Ringer
<craig.ringer@enterprisedb.com> wrote:
> I thought I'd start the discussion on this and see where we can go with it. What incremental steps can be done to
moveus toward parallelisable I/O without having to redesign everything? 
>
> I'm thinking that redo is probably a good first candidate. It doesn't depend on the guts of the executor. It is much
lesssensitive to ordering between operations in shmem and on disk since it runs in the startup process. And it hurts
REALLYBADLY from its single-threaded blocking approach to I/O - as shown by an extension written by 2ndQuadrant that
candouble redo performance by doing read-ahead on btree pages that will soon be needed. 

About the redo suggestion: https://commitfest.postgresql.org/31/2410/
does exactly that!  It currently uses POSIX_FADV_WILLNEED because
that's what PrefetchSharedBuffer() does, but when combined with a
"real AIO" patch set (see earlier threads and conference talks on this
by Andres) and a few small tweaks to control batching of I/O
submissions, it does exactly what you're describing.  I tried to keep
the WAL prefetcher project entirely disentangled from the core AIO
work, though, hence the "poor man's AIO" for now.



Re: Blocking I/O, async I/O and io_uring

От
Andreas Karlsson
Дата:
On 12/8/20 3:55 AM, Craig Ringer wrote:
> A new kernel API called io_uring has recently come to my attention. I 
> assume some of you (Andres?) have been following it for a while.

Andres did a talk on this at FOSDEM PGDay earlier this year. You can see 
his slides below, but since they are from January things might have 
changed since then.

https://www.postgresql.eu/events/fosdem2020/schedule/session/2959-asynchronous-io-for-postgresql/

Andreas



Re: Blocking I/O, async I/O and io_uring

От
Andres Freund
Дата:
Hi,

On 2020-12-08 10:55:37 +0800, Craig Ringer wrote:
> A new kernel API called io_uring has recently come to my attention. I
> assume some of you (Andres?) have been following it for a while.

Yea, I've spent a *lot* of time working on AIO support, utilizing
io_uring. Recently Thomas also joined in the fun. I've given two talks
referencing it (last pgcon, last pgday brussels), but otherwise I've not
yet written much about. Things aren't *quite* right yet architecturally,
but I think we're getting there.

Thomas is working on making the AIO infrastructure portable (a worker
based fallback, posix AIO support for freebsd & OSX). Once that's done,
and some of the architectural thins are resolved, I plan to write a long
email about what I think the right design is, and where I am at.

The current state is at https://github.com/anarazel/postgres/tree/aio
(but it's not a very clean history at the moment).

There's currently no windows AIO support, but it shouldn't be too hard
to add. My preliminary look indicates that we'd likely have to use
overlapped IO with WaitForMultipleObjects(), not IOCP, since we need to
be able to handle latches etc, which seems harder with IOCP. But perhaps
we can do something using the signal handling emulation posting events
onto IOCP instead.


> io_uring appears to offer a way to make system calls including reads,
> writes, fsync()s, and more in a non-blocking, batched and pipelined manner,
> with or without O_DIRECT. Basically async I/O with usable buffered I/O and
> fsync support. It has ordering support which is really important for us.

My results indicate that we really want to have have, optional & not
enabled by default of course, O_DIRECT support. We just can't benefit
fully of modern SSDs otherwise. Buffered is also important, of course.


> But I have no idea how to even begin to fit this into PostgreSQL's executor
> pipeline. Almost all PostgreSQL's code is synchronous-blocking-imperative
> in nature, with a push/pull executor pipeline. It seems to have been
> recognised for some time that this is increasingly hurting our performance
> and scalability as platforms become more and more parallel.

> To benefit from AIO (be it POSIX, linux-aio, io_uring, Windows AIO, etc) we
> have to be able to dispatch I/O and do something else while we wait for the
> results. So we need the ability to pipeline the executor and pipeline redo.

> I thought I'd start the discussion on this and see where we can go with it.
> What incremental steps can be done to move us toward parallelisable I/O
> without having to redesign everything?

I'm pretty sure that I've got the basics of this working pretty well. I
don't think the executor architecture is as big an issue as you seem to
think. There are further benefits that could be unlocked if we had a
more flexible executor model (imagine switching between different parts
of the query whenever blocked on IO - can't do that due to the stack
right now).

The way it currently works is that things like sequential scans, vacuum,
etc use a prefetching helper which will try to use AIO to read ahead of
the next needed block. That helper uses callbacks to determine the next
needed block, which e.g. vacuum uses to skip over all-visible/frozen
blocks. There's plenty other places that should use that helper, but we
already can get considerably higher throughput for seqscans, vacuum on
both very fast local storage, and high-latency cloud storage.

Similarly, for writes there's a small helper to manage a write-queue of
configurable depth, which currently is used to by checkpointer and
bgwriter (but should be used in more places). Especially with direct IO
checkpointing can be a lot faster *and* less impactful on the "regular"
load.

I've got asynchronous writing of WAL mostly working, but need to
redesign the locking a bit further. Right now it's a win in some cases,
but not others. The latter to a significant degree due to unnecessary
blocking....


> I'm thinking that redo is probably a good first candidate. It doesn't
> depend on the guts of the executor. It is much less sensitive to
> ordering between operations in shmem and on disk since it runs in the
> startup process. And it hurts REALLY BADLY from its single-threaded
> blocking approach to I/O - as shown by an extension written by
> 2ndQuadrant that can double redo performance by doing read-ahead on
> btree pages that will soon be needed.

Thomas has a patch for prefetching during WAL apply. It currently uses
posix_fadvise(), but he took care that it'd be fairly easy to rebase it
onto "real" AIO. Most of the changes necessary are pretty independent of
posix_fadvise vs aio.

Greetings,

Andres Freund



RE: Blocking I/O, async I/O and io_uring

От
"tsunakawa.takay@fujitsu.com"
Дата:
From: Andres Freund <andres@anarazel.de>
> Especially with direct IO
> checkpointing can be a lot faster *and* less impactful on the "regular"
> load.

I'm looking forward to this from the async+direct I/O, since the throughput of some write-heavy workload decreased by
halfor more during checkpointing (due to fsync?) Would you mind sharing any preliminary results on this if you have
something?


Regards
Takayuki Tsunakawa






Re: Blocking I/O, async I/O and io_uring

От
Craig Ringer
Дата:
On Tue, 8 Dec 2020 at 12:02, Andres Freund <andres@anarazel.de> wrote:
Hi,

On 2020-12-08 10:55:37 +0800, Craig Ringer wrote:
> A new kernel API called io_uring has recently come to my attention. I
> assume some of you (Andres?) have been following it for a while.

Yea, I've spent a *lot* of time working on AIO support, utilizing
io_uring. Recently Thomas also joined in the fun. I've given two talks
referencing it (last pgcon, last pgday brussels), but otherwise I've not
yet written much about. Things aren't *quite* right yet architecturally,
but I think we're getting there.

That's wonderful. Thankyou.

I'm badly behind on the conference circuit due to geographic isolation and small children. I'll hunt up your talks.

The current state is at https://github.com/anarazel/postgres/tree/aio
(but it's not a very clean history at the moment).

Fantastic!

Have you done much bpf / systemtap / perf based work on measurement and tracing of latencies etc? If not that's something I'd be keen to help with. I've mostly been using systemtap so far but I'm trying to pivot over to bpf.

I hope to submit a big tracepoints patch set for PostgreSQL soon to better expose our wait points and latencies, improve visibility of blocking, and help make activity traceable through all the stages of processing. I'll Cc you when I do.
 
> io_uring appears to offer a way to make system calls including reads,
> writes, fsync()s, and more in a non-blocking, batched and pipelined manner,
> with or without O_DIRECT. Basically async I/O with usable buffered I/O and
> fsync support. It has ordering support which is really important for us.

My results indicate that we really want to have have, optional & not
enabled by default of course, O_DIRECT support. We just can't benefit
fully of modern SSDs otherwise. Buffered is also important, of course.

Even more so for NVDRAM, Optane and all that, where zero-copy and low context switches becomes important too.

We're a long way from that being a priority but it's still not to be dismissed.

I'm pretty sure that I've got the basics of this working pretty well. I
don't think the executor architecture is as big an issue as you seem to
think. There are further benefits that could be unlocked if we had a
more flexible executor model (imagine switching between different parts
of the query whenever blocked on IO - can't do that due to the stack
right now).

Yep, that's what I'm talking about being an issue.

Blocked on an index read? Move on to the next tuple and come back when the index read is done.

I really like what I see of the io_uring architecture so far. It's ideal for callback-based event-driven flow control. But that doesn't fit postgres well for the executor. It's better for redo etc.



The way it currently works is that things like sequential scans, vacuum,
etc use a prefetching helper which will try to use AIO to read ahead of
the next needed block. That helper uses callbacks to determine the next
needed block, which e.g. vacuum uses to skip over all-visible/frozen
blocks. There's plenty other places that should use that helper, but we
already can get considerably higher throughput for seqscans, vacuum on
both very fast local storage, and high-latency cloud storage.

Similarly, for writes there's a small helper to manage a write-queue of
configurable depth, which currently is used to by checkpointer and
bgwriter (but should be used in more places). Especially with direct IO
checkpointing can be a lot faster *and* less impactful on the "regular"
load.

Sure sounds like a useful interim step. That's great.

I've got asynchronous writing of WAL mostly working, but need to
redesign the locking a bit further. Right now it's a win in some cases,
but not others. The latter to a significant degree due to unnecessary
blocking....

That's where io_uring's I/O ordering operations looked interesting. But I haven't looked closely enough to see if they're going to help us with I/O ordering in a multiprocessing architecture like postgres.

In an ideal world we could tell the kernel about WAL-to-heap I/O dependencies and even let it apply WAL then heap changes out-of-order so long as they didn't violate any ordering constraints we specify between particular WAL records or between WAL writes and their corresponding heap blocks. But I don't know if the io_uring interface is that capable.

I did some basic experiments a while ago with using write barriers between WAL records and heap writes instead of fsync()ing, but as you note, the increased blocking and reduction in the kernel's ability to do I/O reordering is generally worse than the costs of the fsync()s we do now.

> I'm thinking that redo is probably a good first candidate. It doesn't
> depend on the guts of the executor. It is much less sensitive to
> ordering between operations in shmem and on disk since it runs in the
> startup process. And it hurts REALLY BADLY from its single-threaded
> blocking approach to I/O - as shown by an extension written by
> 2ndQuadrant that can double redo performance by doing read-ahead on
> btree pages that will soon be needed.

Thomas has a patch for prefetching during WAL apply. It currently uses
posix_fadvise(), but he took care that it'd be fairly easy to rebase it
onto "real" AIO. Most of the changes necessary are pretty independent of
posix_fadvise vs aio.

Cool. You know we worked on something like that in 2ndQ too, with fast_redo, and it's pretty effective at reducing the I/O waits for b-tree index maintenance.

How feasible do you think it'd be to take it a step further and structure redo as a pipelined queue, where redo calls enqueue I/O operations and completion handlers then return immediately? Everything still goes to disk in the order it's enqueued, and the callbacks will be invoked in order, so they can update appropriate shmem state etc. Since there's no concurrency during redo, it should be much simpler than normal user backend operations where we have all the tight coordination of buffer management, WAL write ordering, PGXACT and PGPROC, the clog, etc.

So far the main issue I see with it is that there are still way too many places we'd have to block because of logic that requires the result of a read in order to perform a subsequent write. We can't just turn those into event driven continuations on the queue and keep going unless we can guarantee that the later WAL we apply while we're waiting is independent of any changes the earlier pending writes might make and that's hard, especially with b-trees. And it's those read-then-write ordering points that hurt our redo performance the most already.

Re: Blocking I/O, async I/O and io_uring

От
Andres Freund
Дата:
Hi,

On 2020-12-08 13:01:38 +0800, Craig Ringer wrote:
> Have you done much bpf / systemtap / perf based work on measurement and
> tracing of latencies etc? If not that's something I'd be keen to help with.
> I've mostly been using systemtap so far but I'm trying to pivot over to
> bpf.

Not much - there's still so many low hanging fruits and architectural
things to finish that it didn't yet seem pressing.




> I've got asynchronous writing of WAL mostly working, but need to
> > redesign the locking a bit further. Right now it's a win in some cases,
> > but not others. The latter to a significant degree due to unnecessary
> > blocking....

> That's where io_uring's I/O ordering operations looked interesting. But I
> haven't looked closely enough to see if they're going to help us with I/O
> ordering in a multiprocessing architecture like postgres.

The ordering ops aren't quite powerful enough to be a huge boon
performance-wise (yet). They can cut down on syscall and intra-process
context switch overhead to some degree, but otherwise it's not different
than userspace submitting another request upon receving of a completion.


> In an ideal world we could tell the kernel about WAL-to-heap I/O
> dependencies and even let it apply WAL then heap changes out-of-order so
> long as they didn't violate any ordering constraints we specify between
> particular WAL records or between WAL writes and their corresponding heap
> blocks. But I don't know if the io_uring interface is that capable.

It's not. And that kind of dependency inferrence wouldn't be cheap on
the PG side either.

I don't think it'd help that much for WAL apply anyway. You need
read-ahead of the WAL to avoid unnecessary waits for a lot of records
anyway. And the writes during WAL are mostly pretty asynchronous (mainly
writeback during buffer replacement).

An imo considerably more interesting case is avoiding blocking on a WAL
flush when needing to write a page out in an OLTPish workload. But I can
think of more efficient ways there too.


> How feasible do you think it'd be to take it a step further and structure
> redo as a pipelined queue, where redo calls enqueue I/O operations and
> completion handlers then return immediately? Everything still goes to disk
> in the order it's enqueued, and the callbacks will be invoked in order, so
> they can update appropriate shmem state etc. Since there's no concurrency
> during redo, it should be *much* simpler than normal user backend
> operations where we have all the tight coordination of buffer management,
> WAL write ordering, PGXACT and PGPROC, the clog, etc.

I think it'd be a fairly massive increase in complexity. And I don't see
a really large payoff: Once you have real readahead in the WAL there's
really not much synchronous IO left. What am I missing?

Greetings,

Andres Freund



Re: Blocking I/O, async I/O and io_uring

От
Andres Freund
Дата:
Hi,

On 2020-12-08 04:24:44 +0000, tsunakawa.takay@fujitsu.com wrote:
> I'm looking forward to this from the async+direct I/O, since the
> throughput of some write-heavy workload decreased by half or more
> during checkpointing (due to fsync?)

Depends on why that is. The most common, I think, cause is that your WAL
volume increases drastically just after a checkpoint starts, because
initially all page modification will trigger full-page writes.  There's
a significant slowdown even if you prevent the checkpointer from doing
*any* writes at that point.  I got the WAL AIO stuff to the point that I
see a good bit of speedup at high WAL volumes, and I see it helping in
this scenario.

There's of course also the issue that checkpoint writes cause other IO
(including WAL writes) to slow down and, importantly, cause a lot of
jitter leading to unpredictable latencies.  I've seen some good and some
bad results around this with the patch, but there's a bunch of TODOs to
resolve before delving deeper really makes sense (the IO depth control
is not good enough right now).

A third issue is that sometimes checkpointer can't really keep up - and
that I think I've seen pretty clearly addressed by the patch. I have
managed to get to ~80% of my NVMe disks top write speed (> 2.5GB/s) by
the checkpointer, and I think I know what to do for the remainder.


> Would you mind sharing any preliminary results on this if you have
> something?

I ran numbers at some point, but since then enough has changed
(including many correctness issues fixed) that they don't seem really
relevant anymore.  I'll try to include some in the post I'm planning to
do in a few weeks.

Greetings,

Andres Freund



Re: Blocking I/O, async I/O and io_uring

От
Fujii Masao
Дата:

On 2020/12/08 11:55, Craig Ringer wrote:
> Hi all
> 
> A new kernel API called io_uring has recently come to my attention. I assume some of you (Andres?) have been
followingit for a while.
 
> 
> io_uring appears to offer a way to make system calls including reads, writes, fsync()s, and more in a non-blocking,
batchedand pipelined manner, with or without O_DIRECT. Basically async I/O with usable buffered I/O and fsync support.
Ithas ordering support which is really important for us.
 
> 
> This should be on our radar. The main barriers to benefiting from linux-aio based async I/O in postgres in the past
hasbeen its reliance on direct I/O, the various kernel-version quirks, platform portability, and its
maybe-async-except-when-it's-randomly-notnature.
 
> 
> The kernel version and portability remain an issue with io_uring so it's not like this is something we can pivot over
tocompletely. But we should probably take a closer look at it.
 
> 
> PostgreSQL spends a huge amount of time waiting, doing nothing, for blocking I/O. If we can improve that then we
couldpotentially realize some major increases in I/O utilization especially for bigger, less concurrent workloads. The
mostobvious candidates to benefit would be redo, logical apply, and bulk loading.
 
> 
> But I have no idea how to even begin to fit this into PostgreSQL's executor pipeline. Almost all PostgreSQL's code is
synchronous-blocking-imperativein nature, with a push/pull executor pipeline. It seems to have been recognised for some
timethat this is increasingly hurting our performance and scalability as platforms become more and more parallel.
 
> 
> To benefit from AIO (be it POSIX, linux-aio, io_uring, Windows AIO, etc) we have to be able to dispatch I/O and do
somethingelse while we wait for the results. So we need the ability to pipeline the executor and pipeline redo.
 
> 
> I thought I'd start the discussion on this and see where we can go with it. What incremental steps can be done to
moveus toward parallelisable I/O without having to redesign everything?
 
> 
> I'm thinking that redo is probably a good first candidate. It doesn't depend on the guts of the executor. It is much
lesssensitive to ordering between operations in shmem and on disk since it runs in the startup process. And it hurts
REALLYBADLY from its single-threaded blocking approach to I/O - as shown by an extension written by 2ndQuadrant that
candouble redo performance by doing read-ahead on btree pages that will soon be needed.
 
> 
> Thoughts anybody?

I was wondering if async I/O might be helpful for the performance
improvement of walreceiver. In physical replication, walreceiver receives,
writes and fsyncs WAL data. Also it does tasks like keepalive. Since
walreceiver is a single process, for example, currently it cannot do other
tasks while fsyncing WAL to the disk.

OTOH, if walreceiver can do other tasks even while fsyncing WAL by
using async I/O, ISTM that it might improve the performance of walreceiver.

Regards,

-- 
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION



Re: Blocking I/O, async I/O and io_uring

От
Craig Ringer
Дата:
On Tue, 8 Dec 2020 at 15:04, Andres Freund <andres@anarazel.de> wrote:
Hi,

On 2020-12-08 04:24:44 +0000, tsunakawa.takay@fujitsu.com wrote:
> I'm looking forward to this from the async+direct I/O, since the
> throughput of some write-heavy workload decreased by half or more
> during checkpointing (due to fsync?)

Depends on why that is. The most common, I think, cause is that your WAL
volume increases drastically just after a checkpoint starts, because
initially all page modification will trigger full-page writes.  There's
a significant slowdown even if you prevent the checkpointer from doing
*any* writes at that point.  I got the WAL AIO stuff to the point that I
see a good bit of speedup at high WAL volumes, and I see it helping in
this scenario.

There's of course also the issue that checkpoint writes cause other IO
(including WAL writes) to slow down and, importantly, cause a lot of
jitter leading to unpredictable latencies.  I've seen some good and some
bad results around this with the patch, but there's a bunch of TODOs to
resolve before delving deeper really makes sense (the IO depth control
is not good enough right now).

A third issue is that sometimes checkpointer can't really keep up - and
that I think I've seen pretty clearly addressed by the patch. I have
managed to get to ~80% of my NVMe disks top write speed (> 2.5GB/s) by
the checkpointer, and I think I know what to do for the remainder.


Thanks for explaining this. I'm really glad you're looking into it. If I get the chance I'd like to try to apply some wait-analysis and blocking stats tooling to it. I'll report back if I make any progress there.