Обсуждение: pgstattuple: Use streaming read API in pgstatindex functions
Hi hackers, While reading the code related to streaming reads and their current use cases, I noticed that pgstatindex could potentially benefit from adopting the streaming read API. The required change is relatively simple—similar to what has already been implemented in the pg_warm and pg_visibility extensions. I also ran some performance tests on an experimental patch to validate the improvement. Summary Cold cache performance (the typical use case for diagnostic tools): - Medium indexes (~21MB): 1.21x - 1.79x faster (20-44% speedup) - Large indexes (~214MB): 1.50x - 1.90x faster (30-47% speedup) - Xlarge indexes (~1351MB):1.4x–1.9x gains. (29–47% speedup) Hardware: AX162-R from hetzner Test matrix: - Index types: Primary key, timestamp, float, composite (3 columns) - Sizes: Medium (1M rows, ~21MB), Large (10M rows, ~214MB), XLarge (50M rows, ~ 1351MB)) - Layouts: Unfragmented (sequential) and Fragmented (random insert order) - Cache states: Cold (dropped OS cache) and Warm (pg_prewarm) Xlarge fragmented example: ==> Creating secondary indexes on test_xlarge Created 3 secondary indexes: created_at, score, composite Created test_xlarge_pkey: 1351 MB Fragmentation stats (random insert order): leaf_frag_pct | avg_density_pct | leaf_pages | size ---------------+-----------------+------------+--------- 49.9 | 71.5 | 172272 | 1351 MB (1 row) configuration: - shared_buffers = 16GB - effective_io_concurrency = 500 - io_combine_limit = 16 - autovacuum = off - checkpoint_timeout = 1h - bgwriter_delay = 10000ms (minimize background writes) - jit = off - max_parallel_workers_per_gather = 0 Unfragmented Indexes (Cold Cache) Index Type Size Baseline Patched Speedup Primary Key Medium 31.5 ms 19.6 ms 1.58× Primary Key Large 184.0 ms 119.0 ms 1.54× Timestamp Medium 13.4 ms 10.5 ms 1.28× Timestamp Large 85.0 ms 45.6 ms 1.86× Float (score) Medium 13.7 ms 11.4 ms 1.21× Float (score) Large 84.0 ms 45.0 ms 1.86× Composite (3 col) Medium 26.7 ms 17.2 ms 1.56× Composite (3 col) Large 89.8 ms 51.2 ms 1.75× ⸻ Fragmented Indexes (Cold Cache) To address concerns about filesystem fragmentation, I tested indexes built with random inserts (ORDER BY random()) to trigger page splits and create fragmented indexes: Index Type Size Baseline Patched Speedup Primary Key Medium 41.9 ms 23.5 ms 1.79× Primary Key Large 236.0 ms 148.0 ms 1.58× Primary Key XLarge 953.4 ms 663.1 ms 1.43× Timestamp Medium 32.1 ms 18.8 ms 1.70× Timestamp Large 188.0 ms 117.0 ms 1.59× Timestamp XLarge 493.0 ms 518.6 ms 0.95× Float (score) Medium 14.0 ms 10.9 ms 1.28× Float (score) Large 85.8 ms 45.2 ms 1.89× Float (score) XLarge 263.2 ms 176.5 ms 1.49× Composite (3 col) Medium 42.3 ms 24.1 ms 1.75× Composite (3 col) Large 245.0 ms 162.0 ms 1.51× Composite (3 col) XLarge 1052.5 ms 716.5 ms 1.46× Summary: Fragmentation generally does not hurt streaming reads; most fragmented cases still see 1.4×–1.9× gains. One outlier (XLarge Timestamp) shows a slight regression (0.95×). ⸻ Warm Cache Results When indexes are fully cached in shared_buffers: Unfragmented: infrequent little regression for small to medium size index(single digit ms variance, barely noticeable); small gains for large size index Fragmented: infrequent little regression for small to medium size index(single digit ms variance, barely noticeable); small gains for large size index Best, Xuneng
Вложения
Hi, On Mon, Oct 13, 2025 at 10:07 AM Xuneng Zhou <xunengzhou@gmail.com> wrote: > > Hi hackers, > > While reading the code related to streaming reads and their current > use cases, I noticed that pgstatindex could potentially benefit from > adopting the streaming read API. The required change is relatively > simple—similar to what has already been implemented in the pg_warm and > pg_visibility extensions. I also ran some performance tests on an > experimental patch to validate the improvement. > > Summary > Cold cache performance (the typical use case for diagnostic tools): > - Medium indexes (~21MB): 1.21x - 1.79x faster (20-44% speedup) > - Large indexes (~214MB): 1.50x - 1.90x faster (30-47% speedup) > - Xlarge indexes (~1351MB):1.4x–1.9x gains. (29–47% speedup) > > Hardware: AX162-R from hetzner > > Test matrix: > - Index types: Primary key, timestamp, float, composite (3 columns) > - Sizes: Medium (1M rows, ~21MB), Large (10M rows, ~214MB), XLarge > (50M rows, ~ 1351MB)) > - Layouts: Unfragmented (sequential) and Fragmented (random insert order) > - Cache states: Cold (dropped OS cache) and Warm (pg_prewarm) > > Xlarge fragmented example: > ==> Creating secondary indexes on test_xlarge > Created 3 secondary indexes: created_at, score, composite > Created test_xlarge_pkey: 1351 MB > Fragmentation stats (random insert order): > leaf_frag_pct | avg_density_pct | leaf_pages | size > ---------------+-----------------+------------+--------- > 49.9 | 71.5 | 172272 | 1351 MB > (1 row) > > configuration: > - shared_buffers = 16GB > - effective_io_concurrency = 500 > - io_combine_limit = 16 > - autovacuum = off > - checkpoint_timeout = 1h > - bgwriter_delay = 10000ms (minimize background writes) > - jit = off > - max_parallel_workers_per_gather = 0 > > Unfragmented Indexes (Cold Cache) > > Index Type Size Baseline Patched Speedup > Primary Key Medium 31.5 ms 19.6 ms 1.58× > Primary Key Large 184.0 ms 119.0 ms 1.54× > Timestamp Medium 13.4 ms 10.5 ms 1.28× > Timestamp Large 85.0 ms 45.6 ms 1.86× > Float (score) Medium 13.7 ms 11.4 ms 1.21× > Float (score) Large 84.0 ms 45.0 ms 1.86× > Composite (3 col) Medium 26.7 ms 17.2 ms 1.56× > Composite (3 col) Large 89.8 ms 51.2 ms 1.75× > > ⸻ > > Fragmented Indexes (Cold Cache) > > To address concerns about filesystem fragmentation, I tested indexes built > with random inserts (ORDER BY random()) to trigger page splits and create > fragmented indexes: > > Index Type Size Baseline Patched Speedup > Primary Key Medium 41.9 ms 23.5 ms 1.79× > Primary Key Large 236.0 ms 148.0 ms 1.58× > Primary Key XLarge 953.4 ms 663.1 ms 1.43× > Timestamp Medium 32.1 ms 18.8 ms 1.70× > Timestamp Large 188.0 ms 117.0 ms 1.59× > Timestamp XLarge 493.0 ms 518.6 ms 0.95× > Float (score) Medium 14.0 ms 10.9 ms 1.28× > Float (score) Large 85.8 ms 45.2 ms 1.89× > Float (score) XLarge 263.2 ms 176.5 ms 1.49× > Composite (3 col) Medium 42.3 ms 24.1 ms 1.75× > Composite (3 col) Large 245.0 ms 162.0 ms 1.51× > Composite (3 col) XLarge 1052.5 ms 716.5 ms 1.46× > > Summary: Fragmentation generally does not hurt streaming reads; most > fragmented cases still see 1.4×–1.9× gains. One outlier (XLarge > Timestamp) shows a slight regression (0.95×). > > ⸻ > > Warm Cache Results > When indexes are fully cached in shared_buffers: > Unfragmented: infrequent little regression for small to medium size > index(single digit ms variance, barely noticeable); small gains for > large size index > Fragmented: infrequent little regression for small to medium size > index(single digit ms variance, barely noticeable); small gains for > large size index > Fix indentation issue in v1. Best, Xuneng
Вложения
Hi, Thank you for working on this! On Mon, 13 Oct 2025 at 06:20, Xuneng Zhou <xunengzhou@gmail.com> wrote: > > Fix indentation issue in v1. I did not look at the benchmarks, so here are my code comments. - I would avoid creating a new scope for the streaming read. While it makes the streaming code easier to interpret, it introduces a large diff due to indentation changes. - I suggest adding the following comment about streaming batching, as it is used in other places: /* * It is safe to use batchmode as block_range_read_stream_cb takes no * locks. */ - For the '/* Scan all blocks except the metapage using streaming reads */' comments, it might be helpful to clarify that the 0th page is the metapage. Something like: '/* Scan all blocks except the metapage (0th page) using streaming reads */'. Other than these comments, the code looks good to me. -- Regards, Nazir Bilal Yavuz Microsoft
Hi Bilal, Thanks for looking into this. On Mon, Oct 13, 2025 at 3:00 PM Nazir Bilal Yavuz <byavuz81@gmail.com> wrote: > > Hi, > > Thank you for working on this! > > On Mon, 13 Oct 2025 at 06:20, Xuneng Zhou <xunengzhou@gmail.com> wrote: > > > > Fix indentation issue in v1. > > I did not look at the benchmarks, so here are my code comments. > > - I would avoid creating a new scope for the streaming read. While it > makes the streaming code easier to interpret, it introduces a large > diff due to indentation changes. > > - I suggest adding the following comment about streaming batching, as > it is used in other places: > > /* > * It is safe to use batchmode as block_range_read_stream_cb takes no > * locks. > */ > > - For the '/* Scan all blocks except the metapage using streaming > reads */' comments, it might be helpful to clarify that the 0th page > is the metapage. Something like: '/* Scan all blocks except the > metapage (0th page) using streaming reads */'. > > Other than these comments, the code looks good to me. > Here is patch v3. The comments have been added, and the extra scope ({}) has been removed as suggested. Best, Xuneng
Вложения
Hi Xuneng Zhou
> - /* Unlock and release buffer */
> - LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
> - ReleaseBuffer(buffer);
> + UnlockReleaseBuffer(buffer);
> }
> - /* Unlock and release buffer */
> - LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
> - ReleaseBuffer(buffer);
> + UnlockReleaseBuffer(buffer);
> }
Thanks for your patch! Just to nitpick a bit — I think this comment is worth keeping, even though the function name already conveys its meaning.
On Mon, Oct 13, 2025 at 4:42 PM Xuneng Zhou <xunengzhou@gmail.com> wrote:
Hi Bilal,
Thanks for looking into this.
On Mon, Oct 13, 2025 at 3:00 PM Nazir Bilal Yavuz <byavuz81@gmail.com> wrote:
>
> Hi,
>
> Thank you for working on this!
>
> On Mon, 13 Oct 2025 at 06:20, Xuneng Zhou <xunengzhou@gmail.com> wrote:
> >
> > Fix indentation issue in v1.
>
> I did not look at the benchmarks, so here are my code comments.
>
> - I would avoid creating a new scope for the streaming read. While it
> makes the streaming code easier to interpret, it introduces a large
> diff due to indentation changes.
>
> - I suggest adding the following comment about streaming batching, as
> it is used in other places:
>
> /*
> * It is safe to use batchmode as block_range_read_stream_cb takes no
> * locks.
> */
>
> - For the '/* Scan all blocks except the metapage using streaming
> reads */' comments, it might be helpful to clarify that the 0th page
> is the metapage. Something like: '/* Scan all blocks except the
> metapage (0th page) using streaming reads */'.
>
> Other than these comments, the code looks good to me.
>
Here is patch v3. The comments have been added, and the extra scope
({}) has been removed as suggested.
Best,
Xuneng
Hi, On Mon, 13 Oct 2025 at 11:42, Xuneng Zhou <xunengzhou@gmail.com> wrote: > > Here is patch v3. The comments have been added, and the extra scope > ({}) has been removed as suggested. Thanks, the code looks good to me! The benchmark results are nice. I have a quick question about the setup, if you are benchmarking with the default io_method (worker), then the results might show the combined benefit of async I/O and streaming I/O, rather than just streaming I/O alone. If that is the case, I would suggest also running a benchmark with io_method=sync to isolate the performance impact of streaming I/O. You might get interesting results. -- Regards, Nazir Bilal Yavuz Microsoft
Hi Wenhui, Thanks for looking into this. On Mon, Oct 13, 2025 at 5:41 PM wenhui qiu <qiuwenhuifx@gmail.com> wrote: > > Hi Xuneng Zhou > > > > - /* Unlock and release buffer */ > > - LockBuffer(buffer, BUFFER_LOCK_UNLOCK); > > - ReleaseBuffer(buffer); > > + UnlockReleaseBuffer(buffer); > > } > Thanks for your patch! Just to nitpick a bit — I think this comment is worth keeping, even though the function name alreadyconveys its meaning. > It makes sense. I'll keep it. Best, Xuneng
Hi, On Mon, Oct 13, 2025 at 5:42 PM Nazir Bilal Yavuz <byavuz81@gmail.com> wrote: > > Hi, > > On Mon, 13 Oct 2025 at 11:42, Xuneng Zhou <xunengzhou@gmail.com> wrote: > > > > Here is patch v3. The comments have been added, and the extra scope > > ({}) has been removed as suggested. > > Thanks, the code looks good to me! > > The benchmark results are nice. I have a quick question about the > setup, if you are benchmarking with the default io_method (worker), > then the results might show the combined benefit of async I/O and > streaming I/O, rather than just streaming I/O alone. If that is the > case, I would suggest also running a benchmark with io_method=sync to > isolate the performance impact of streaming I/O. You might get > interesting results. The previous benchmark was conducted using the default io_method=worker. The following results are from a new benchmark run with io_method=sync. The performance gains from streaming reads are still present, though smaller than those observed when asynchronous I/O and streaming I/O are combined — which is expected. Thanks for the suggestion! Fragmented Indexes (Cold Cache) Index Type Size Baseline Patched Speedup Primary Key Medium 43.80 ms 38.61 ms 1.13× Primary Key Large 238.24 ms 202.47 ms 1.17× Primary Key XLarge 962.90 ms 793.57 ms 1.21× Timestamp Medium 33.34 ms 29.98 ms 1.11× Timestamp Large 190.41 ms 161.34 ms 1.18× Timestamp XLarge 794.52 ms 647.82 ms 1.22× Float (score) Medium 14.46 ms 13.51 ms 1.06× Float (score) Large 87.38 ms 77.22 ms 1.13× Float (score) XLarge 278.47 ms 233.22 ms 1.19× Composite (3 col) Medium 44.49 ms 40.10 ms 1.10× Composite (3 col) Large 244.86 ms 211.01 ms 1.16× Composite (3 col) XLarge 1073.32 ms 872.42 ms 1.23× ________________________________ Fragmented Indexes (Warm Cache) Index Type Size Baseline Patched Speedup Primary Key Medium 7.91 ms 7.88 ms 1.00× Primary Key Large 35.58 ms 36.41 ms 0.97× Primary Key XLarge 126.29 ms 126.95 ms 0.99× Timestamp Medium 5.14 ms 6.82 ms 0.75× Timestamp Large 22.96 ms 29.55 ms 0.77× Timestamp XLarge 104.26 ms 106.01 ms 0.98× Float (score) Medium 3.76 ms 4.18 ms 0.90× Float (score) Large 13.65 ms 13.01 ms 1.04× Float (score) XLarge 40.58 ms 41.28 ms 0.98× Composite (3 col) Medium 8.23 ms 8.25 ms 0.99× Composite (3 col) Large 37.10 ms 37.59 ms 0.98× Composite (3 col) XLarge 139.89 ms 138.21 ms 1.01× Best, Xuneng
Вложения
Hi,
Here’s the updated summary report(cold cache, fragmented index), now including results for the streaming I/O + io_uring configuration.
Best,
Xuneng