Re: index prefetching
От | Tomas Vondra |
---|---|
Тема | Re: index prefetching |
Дата | |
Msg-id | 9af33041-1c16-4973-855a-718aa1048ee1@vondra.me обсуждение исходный текст |
Ответ на | Re: index prefetching (Peter Geoghegan <pg@bowt.ie>) |
Ответы |
Re: index prefetching
|
Список | pgsql-hackers |
On 8/17/25 19:30, Peter Geoghegan wrote: > On Thu, Aug 14, 2025 at 10:12 PM Peter Geoghegan <pg@bowt.ie> wrote: >> As far as I know, we only have the following unambiguous performance >> regressions (that clearly need to be fixed): >> >> 1. This issue. >> >> 2. There's about a 3% loss of throughput on pgbench SELECT. > > Update: I managed to fix the performance regression with pgbench > SELECT (regression 2). Since Andres' patch fixes the other regression > (regression 1), we no longer have any known performance regression > (though I don't doubt that they still exist somewhere). I've also > added back the enable_indexscan_prefetch testing GUC (Andres asked me > to do that a few weeks back). If you set > enable_indexscan_prefetch=false, btgetbatch performance is virtually > identical to master/btgettuple. > > A working copy of the patchset with these revisions is available from: > https://github.com/petergeoghegan/postgres/tree/index-prefetch-batch-v1.6 > > The solution to the pgbench issue was surprisingly straightforward. > Profiling showed that the regression was caused by the added overhead > of using the read stream, for queries where prefetching cannot > possibly help -- such small startup costs are relatively noticeable > with pgbench's highly selective scans. It turns out that it's possible > to initially avoid using a read stream, while still retaining the > option of switching over to using a read stream later on. The trick to > fixing the pgbench issue was delaying creating a read stream for long > enough for the pgbench queries to never need to create one, without > that impacting queries that at least have some chance of benefiting > from prefetching. > > The actual heuristic I'm using to decide when to start the read stream > is simple: only start a read stream right after the scan's second > batch is returned by amgetbatch, but before we've fetched any heap > blocks related to that second batch (start using a read stream when > fetching new heap blocks from that second batch). It's possible that > that heuristic isn't sophisticated enough for other types of queries. > But either way the basic structure within indexam.c places no > restrictions on when we start a read stream. It doesn't have to be > aligned with amgetbatch-wise batch boundaries, for example (I just > found that structure convenient). > > I haven't spent much time testing this change, but it appears to work > perfectly (no pgbench regressions, but also no regressions in queries > that were already seeing significant benefits from prefetching). I'd > feel better about all this if we had better testing of the read stream > invariants by (say) adding assertions to index_scan_stream_read_next, > the read stream callback. And just having comments that explain those > invariants. > Thanks for investigating this. I think it's the right direction - simple OLTP queries should not be paying for building read_stream when there's little chance of benefit. Unfortunately, this seems to be causing regressions, both compared to master (or disabled prefetching), and to the earlier prefetch patches. I kept running the query generator [1] that builds data sets with randomized parameters, and then runs index scan queries on that, looking for differences between branches. Consider this data set: ------------------------------------------------------------------------ create unlogged table t (a bigint, b text) with (fillfactor = 20); insert into t select 1 * a, b from (select r, a, b, generate_series(0,1-1) AS p from (select row_number() over () AS r, a, b from (select i AS a, md5(i::text) AS b from generate_series(1, 10000000) s(i) ORDER BY (i + 256 * (random() - 0.5))) foo) bar) baz ORDER BY ((r * 1 + p) + 128 * (random() - 0.5)); create index idx on t(a ASC); vacuum freeze t; analyze t; ------------------------------------------------------------------------ Let's run this query (all runs are with cold caches): EXPLAIN (ANALYZE, COSTS OFF) SELECT * FROM t WHERE a BETWEEN 5085 AND 3053660 ORDER BY a ASC; 1) current patch ================ QUERY PLAN ----------------------------------------------------------------------- Index Scan using idx on t (actual time=0.517..6593.821 rows=3048576.00 loops=1) Index Cond: ((a >= 5085) AND (a <= 3053660)) Index Searches: 1 Prefetch Distance: 2.066 Prefetch Count: 296179 Prefetch Stalls: 2553745 Prefetch Skips: 198613 Prefetch Resets: 0 Stream Ungets: 0 Stream Forwarded: 74 Prefetch Histogram: [2,4) => 289560, [4,8) => 6604, [8,16) => 15 Buffers: shared hit=2704779 read=153516 Planning: Buffers: shared hit=78 read=27 Planning Time: 5.525 ms Execution Time: 6721.599 ms (16 rows) 2) removed priorbatch (always uses read stream) =============================================== QUERY PLAN ----------------------------------------------------------------------- Index Scan using idx on t (actual time=1.008..1932.379 rows=3048576.00 loops=1) Index Cond: ((a >= 5085) AND (a <= 3053660)) Index Searches: 1 Prefetch Distance: 87.970 Prefetch Count: 2877141 Prefetch Stalls: 1 Prefetch Skips: 198617 Prefetch Resets: 0 Stream Ungets: 27182 Stream Forwarded: 7640 Prefetch Histogram: [2,4) => 2, [4,8) => 6, [8,16) => 7, [16,32) => 10, [32,64) => 8183, [64,128) => 2868933 Buffers: shared hit=2704571 read=153516 Planning: Buffers: shared hit=78 read=27 Planning Time: 14.302 ms Execution Time: 2036.654 ms (16 rows) 3) no prefetch (same as master) =============================== set enable_indexscan_prefetch = off; QUERY PLAN ----------------------------------------------------------------------- Index Scan using idx on t (actual time=0.850..1336.723 rows=3048576.00 loops=1) Index Cond: ((a >= 5085) AND (a <= 3053660)) Index Searches: 1 Buffers: shared hit=2704779 read=153516 Planning: Buffers: shared hit=82 read=22 Planning Time: 10.696 ms Execution Time: 1433.530 ms (8 rows) The main difference in the explains is this: Prefetch Distance: 2.066 (new patch) Prefetch Distance: 87.970 (old patch, without priorbatch) The histogram just confirms this, with most prefetches either in [2,4) or [64,128) bins. The new patch has much lower prefetch distance. I believe this is the same issue with "collapsed" distance after resetting the read_stream. In that case the trouble was the reset also set distance to 1, and there were so many "hits" due to buffers read earlier it never ramped up again (we doubled it every now and then, but the decay was faster). The same thing happens here, I think. We process the first batch without using a read stream. Then after reading the second batch we create the read_stream, but it starts with distance=1 - it's just like after reset. And it never ramps up the distance, because of the hits from reading the preceding batch. For the resets, the solution (at least for now) was to remember the distance and restore it after reset. But here we don't have any distance to restore - there's no prefetch or read stream. Maybe it'd be possible to track some stats, during the initial phase, and then use that to initialize the distance for the first batch processed by read stream? Seems rather inconvenient, though. What exactly is the overhead of creating the read_stream? Is that about allocating memory, or something else? Would it be possible to reduce the overhead enough to not matter even for OLTP queries? Maybe it would be possible to initialize the read_stream only "partially", enough to do do sync I/O and track the distance, and delay only the expensive stuff? I'm also not sure it's optimal to only initialize read_stream after reading the next batch. For some indexes a batch can have hundreds of items, and that certainly could benefit from prefetching. I suppose it should be possible to initialize the read_stream half-way though a batch, right? Or is there a reason why that can't work? regards [1] https://github.com/tvondra/postgres/tree/index-prefetch-master/query-stress-test -- Tomas Vondra
В списке pgsql-hackers по дате отправления: