Обсуждение: Don't synchronously wait for already-in-progress IO in read stream
Hi,
In the index prefetching thread we discovered that read stream performance
suffers rather substantially when a read stream is reading blocks multiple
times within the readahead distance.
The problem leading to that is that we are currently synchronously waiting for
IO on a buffer when AsyncReadBuffers() encounters a buffer already undergoing
IO. If a block is read twice, that means we won't actually have enough IOs in
flight to have good performance. What's worse, currently the wait is not
attributed to IO wait (since we're waiting in WaitIO, rather than waiting for
IO).
This does not commonly occur with in-tree users of read streams, as users like
seqscans, bitmap heap scans, vacuum, ... will never try to read the same block
twice. However with index prefetching that is a more common case.
It is possible to encounter a version of this in 18/master: If multiple scans
for the same table are in progress, they can end up waiting synchronously for
each other. However it's a much less severe issue, as the scan that is
"further ahead" will not be blocked.
To fix it, the attached patch has AsyncReadBuffers() check if the "target"
buffer already has IO in progress. If so, it assing the buffer's IO wait
reference to the ReadBuffersOperation. That allows WaitReadBuffers() to wait
for the IO. To make that work correctly, the buffer stats etc have to be
updated in that case in WaitReadBuffers().
I did not feel like I was sufficiently confident in making this work without
tests. However, it's not exactly trivial to test some versions of this, due to
the way multiple processes need to be coordinated. It took way way longer to
write tests than to make the code actually work :/.
The attached tests add a new read_stream_for_blocks() function to test_aio. I
found it also rather useful to reproduce the performance issue without the
index prefetching patch applied. To test the cross process case the injection
point infrastructure in test_aio had to be extended a bit.
Attached are three patches:
0001: Introduces a TestAio package and splits out some existing tests out of
001_aio.pl
0002: Add read_stream test infrastructure & tests
Note that the tests don't test that we don't unnecessarily wait, as
described above, just that IO works correctly.
0003: Improve performance of read stream when re-encountering blocks
To reproduce the issue, the read_stream_for_blocks() function added to
test_aio can be used, in combination with debug_io_direct=data (it's also
possible without DIO, it'd just be more work).
prep:
CREATE EXTENSION test_aio;
CREATE TABLE large AS SELECT i, repeat(random()::text, 5) FROM generate_series(1, 1000000) g(i);
test:
SELECT pg_buffercache_evict_relation('large');
EXPLAIN ANALYZE SELECT * FROM read_stream_for_blocks('large', ARRAY(SELECT i + generate_series(0, 3) FROM
generate_series(1,10000) g(i)));
Without 0003 applied:
┌─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│ QUERY PLAN
│
├─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ Function Scan on read_stream_for_blocks (cost=975.00..985.00 rows=1000 width=12) (actual time=673.647..675.254
rows=40000.00loops=1) │
│ Buffers: shared hit=29997 read=10003
│
│ I/O Timings: shared read=16.116
│
│ InitPlan 1
│
│ -> Result (cost=0.00..975.00 rows=40000 width=4) (actual time=0.556..7.575 rows=40000.00 loops=1)
│
│ -> ProjectSet (cost=0.00..375.00 rows=40000 width=8) (actual time=0.556..4.804 rows=40000.00 loops=1)
│
│ -> Function Scan on generate_series g (cost=0.00..100.00 rows=10000 width=4) (actual
time=0.554..0.988rows=10000.00 loops=1) │
│ Planning Time: 0.060 ms
│
│ Execution Time: 676.436 ms
│
└─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
(9 rows)
Time: 676.753 ms
With 0003:
┌─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│ QUERY PLAN
│
├─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ Function Scan on read_stream_for_blocks (cost=975.00..985.00 rows=1000 width=12) (actual time=87.730..89.453
rows=40000.00loops=1) │
│ Buffers: shared hit=29997 read=10003
│
│ I/O Timings: shared read=50.467
│
│ InitPlan 1
│
│ -> Result (cost=0.00..975.00 rows=40000 width=4) (actual time=0.541..7.496 rows=40000.00 loops=1)
│
│ -> ProjectSet (cost=0.00..375.00 rows=40000 width=8) (actual time=0.540..4.772 rows=40000.00 loops=1)
│
│ -> Function Scan on generate_series g (cost=0.00..100.00 rows=10000 width=4) (actual
time=0.539..0.965rows=10000.00 loops=1) │
│ Planning Time: 0.046 ms
│
│ Execution Time: 90.661 ms
│
└─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
(9 rows)
Time: 90.887 ms
Greetings,
Andres Freund
Вложения
On Thu, Sep 11, 2025 at 5:46 PM Andres Freund <andres@anarazel.de> wrote: > The problem leading to that is that we are currently synchronously waiting for > IO on a buffer when AsyncReadBuffers() encounters a buffer already undergoing > IO. If a block is read twice, that means we won't actually have enough IOs in > flight to have good performance. What's worse, currently the wait is not > attributed to IO wait (since we're waiting in WaitIO, rather than waiting for > IO). This patch no longer cleanly applies. Can you post a new version? Thanks -- Peter Geoghegan
On Fri, Sep 12, 2025 at 9:46 AM Andres Freund <andres@anarazel.de> wrote:
+ * It's possible that another backend starts IO on the buffer between this
+ * check and the ReadBuffersCanStartIO(nowait = false) below. In that case
+ * we will synchronously wait for the IO below, but the window for that is
+ * small enough that it won't happen often enough to have a significant
+ * performance impact.
+ */
+ if (ReadBuffersIOAlreadyInProgress(operation, buffers[nblocks_done]))
...
/*
* Check if we can start IO on the first to-be-read buffer.
*
- * If an I/O is already in progress in another backend, we want to wait
- * for the outcome: either done, or something went wrong and we will
- * retry.
+ * If a synchronous I/O is in progress in another backend (it can't be
+ * this backend), we want to wait for the outcome: either done, or
+ * something went wrong and we will retry.
*/
if (!ReadBuffersCanStartIO(buffers[nblocks_done], false))
"..., or an asynchronous IO was started after
ReadBuffersIOAlreadyInProgress() (unlikely), ..."?
I suppose (or perhaps vaguely recall from an off-list discussion?)
that you must have considered merging the new
is-it-already-in-progress check into ReadBuffersCanStartIO(). I
suppose the nowait argument would become a tri-state argument with a
value that means "don't wait for an in-progress read, just give me the
IO handle so I can 'join' it as a foreign waiter", with a new output
argument to receive the handle, or something along those lines, and I
guess you'd need a tri-state result, and perhaps s/Can/Try/ in the
name. That'd remove the double-check (extra header lock-unlock cycle)
and associated race that can cause that rare synchronous wait (which
must still happen sometimes in the duelling concurrent scan use
case?), at the slight extra cost of having to allocate and free a
handle in the case of repeated blocks (eg the index->heap scan use
case), but at least that's just backend-local list pushups and doesn't
do extra work otherwise. Is there some logical problem with that
approach? Is the code just too clumsy?
On Sun, Nov 9, 2025 at 5:21 PM Thomas Munro <thomas.munro@gmail.com> wrote: > > I suppose (or perhaps vaguely recall from an off-list discussion?) > that you must have considered merging the new > is-it-already-in-progress check into ReadBuffersCanStartIO(). I > suppose the nowait argument would become a tri-state argument with a > value that means "don't wait for an in-progress read, just give me the > IO handle so I can 'join' it as a foreign waiter", with a new output > argument to receive the handle, or something along those lines, and I > guess you'd need a tri-state result, and perhaps s/Can/Try/ in the > name. That'd remove the double-check (extra header lock-unlock cycle) > and associated race that can cause that rare synchronous wait (which > must still happen sometimes in the duelling concurrent scan use > case?), at the slight extra cost of having to allocate and free a > handle in the case of repeated blocks (eg the index->heap scan use > case), but at least that's just backend-local list pushups and doesn't > do extra work otherwise. Is there some logical problem with that > approach? Is the code just too clumsy? Attached v3 basically does what you suggested above. Now, we should only have to wait if the backend encounters a buffer after another backend has set BM_IO_IN_PROGRESS but before that other backend has set the buffer descriptor's wait reference. 0001 and 0002 are Andres' test-related patches. 0003 is a change I think is required to make one of the tests stable (esp on the BSDs). 0004 is a bit of preliminary refactoring and 0005 is Andres' foreign IO concept but with your suggested structure and my suggested styling. I could potentially break out more into smaller refactoring commits, but I don't think it's too bad the way it is. A few things about the patch that I'm not sure about: - I don't know if pgaio_submit_staged() is in all the right places (and not in too many places). I basically do it before we would wait when starting read IO on the buffer. In the permanent buffers case, that's now only when BM_IO_IN_PROGRESS is set but the wait reference isn't valid yet. This can't happen in the temporary buffers case, so I'm not sure we need to call pgaio_submit_staged(). - StartBufferIO() is no longer invoked in the AsyncReadBuffers() path. We could refactor it so that it works for AsyncReadBuffers(), but that would involve returning something that distinguishes between IO_IN_PROGRESS and IO already done. And StartBufferIO()'s comment explicitly says it wants to avoid that. If we keep my structure, with AsyncReadBuffers() using its own helper (PrepareNewReadBufferIO()) instead of StartBufferIO(), then it seems like we need some way to make it clear what StartBufferIO() is for. I'm not sure what would collectively describe its current users, though. It also now has no non-test callers passing nowait as true. However, once we add write combining, it will, so it seems like we should leave it the way it is to avoid churn. However, other developers might be confused in the interim. - In the 004_read_stream tests, I wonder if there is a way to test that we don't wait for foreign IO until WaitReadBuffers(). We have tests for the stream accessing the same block, which in some cases will exercise the foreign IO path. But it doesn't distinguish between the old behavior -- waiting for the IO to complete when starting read IO on it -- and the new behavior -- not waiting until WaitReadBuffers(). That may not be possible to test, though. - Melanie
Вложения
Hi,
Thank you for working on this!
On Sat, 24 Jan 2026 at 00:04, Melanie Plageman
<melanieplageman@gmail.com> wrote:
>
> On Sun, Nov 9, 2025 at 5:21 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> >
> > I suppose (or perhaps vaguely recall from an off-list discussion?)
> > that you must have considered merging the new
> > is-it-already-in-progress check into ReadBuffersCanStartIO(). I
> > suppose the nowait argument would become a tri-state argument with a
> > value that means "don't wait for an in-progress read, just give me the
> > IO handle so I can 'join' it as a foreign waiter", with a new output
> > argument to receive the handle, or something along those lines, and I
> > guess you'd need a tri-state result, and perhaps s/Can/Try/ in the
> > name. That'd remove the double-check (extra header lock-unlock cycle)
> > and associated race that can cause that rare synchronous wait (which
> > must still happen sometimes in the duelling concurrent scan use
> > case?), at the slight extra cost of having to allocate and free a
> > handle in the case of repeated blocks (eg the index->heap scan use
> > case), but at least that's just backend-local list pushups and doesn't
> > do extra work otherwise. Is there some logical problem with that
> > approach? Is the code just too clumsy?
>
> Attached v3 basically does what you suggested above. Now, we should
> only have to wait if the backend encounters a buffer after another
> backend has set BM_IO_IN_PROGRESS but before that other backend has
> set the buffer descriptor's wait reference.
>
> 0001 and 0002 are Andres' test-related patches. 0003 is a change I
> think is required to make one of the tests stable (esp on the BSDs).
> 0004 is a bit of preliminary refactoring and 0005 is Andres' foreign
> IO concept but with your suggested structure and my suggested styling.
> I could potentially break out more into smaller refactoring commits,
> but I don't think it's too bad the way it is.
I confirm that I am able to produce the regression that Andres
mentioned with the patches excluding 0005, and 0005 fixes the
regression.
> A few things about the patch that I'm not sure about:
>
> - I don't know if pgaio_submit_staged() is in all the right places
> (and not in too many places). I basically do it before we would wait
> when starting read IO on the buffer. In the permanent buffers case,
> that's now only when BM_IO_IN_PROGRESS is set but the wait reference
> isn't valid yet. This can't happen in the temporary buffers case, so
> I'm not sure we need to call pgaio_submit_staged().
I agree with you, I think we don't need to call pgaio_submit_staged()
for the temporary buffers case.
> - StartBufferIO() is no longer invoked in the AsyncReadBuffers() path.
> We could refactor it so that it works for AsyncReadBuffers(), but that
> would involve returning something that distinguishes between
> IO_IN_PROGRESS and IO already done. And StartBufferIO()'s comment
> explicitly says it wants to avoid that.
> If we keep my structure, with AsyncReadBuffers() using its own helper
> (PrepareNewReadBufferIO()) instead of StartBufferIO(), then it seems
> like we need some way to make it clear what StartBufferIO() is for.
> I'm not sure what would collectively describe its current users,
> though. It also now has no non-test callers passing nowait as true.
> However, once we add write combining, it will, so it seems like we
> should leave it the way it is to avoid churn. However, other
> developers might be confused in the interim.
I don't have a comment for this.
> - In the 004_read_stream tests, I wonder if there is a way to test
> that we don't wait for foreign IO until WaitReadBuffers(). We have
> tests for the stream accessing the same block, which in some cases
> will exercise the foreign IO path. But it doesn't distinguish between
> the old behavior -- waiting for the IO to complete when starting read
> IO on it -- and the new behavior -- not waiting until
> WaitReadBuffers(). That may not be possible to test, though.
Won't 'stream accessing the same block test' almost always test the
new behavior (not waiting until WaitReadBuffers())? Having dedicated
tests for both cases would be helpful, though.
My review:
0001:
0001 LGTM.
---------------
0002:
diff --git a/src/test/modules/test_aio/t/004_read_stream.pl
b/src/test/modules/test_aio/t/004_read_stream.pl
+foreach my $method (TestAio::supported_io_methods())
+{
+ $node->adjust_conf('postgresql.conf', 'io_method', 'worker');
+ $node->start();
+ test_io_method($method, $node);
+ $node->stop();
+}
This seems wrong, we always test io_method=worker. I think we need to
replace 'worker' with the $method. Also, we can add check below to the
test_io_method function in the 004_read_stream.pl:
```
is($node->safe_psql('postgres', 'SHOW io_method'),
$io_method, "$io_method: io_method set correctly");
```
Other than that, 0002 LGTM.
---------------
0003:
> 0003 is a change I
> think is required to make one of the tests stable (esp on the BSDs).
0003 LGTM.
---------------
> 0004 is a bit of preliminary refactoring and 0005 is Andres' foreign
> IO concept but with your suggested structure and my suggested styling.
> I could potentially break out more into smaller refactoring commits,
> but I don't think it's too bad the way it is.
0004:
Nitpick but I prefer something like TrackBufferHit() or
CountBufferHit() as a function name instead of ProcessBufferHit().
ProcessBufferHit() gives the impression that the function is doing a
job more than it currently does. Other than that, 0004 LGTM.
---------------
0005:
0005 LGTM. However, I am still looking into the AIO code. I wanted to
share my review so far.
---------------
--
Regards,
Nazir Bilal Yavuz
Microsoft