Re: index prefetching
От | Tomas Vondra |
---|---|
Тема | Re: index prefetching |
Дата | |
Msg-id | 367160ea-b1ed-4481-e804-bca509128878@enterprisedb.com обсуждение исходный текст |
Ответ на | Re: index prefetching (Tomas Vondra <tomas.vondra@enterprisedb.com>) |
Ответы |
Re: index prefetching
|
Список | pgsql-hackers |
Hi, Here's a new WIP version of the patch set adding prefetching to indexes, exploring a couple alternative approaches. After the patch 2023/10/16 version, I happened to have an off-list discussion with Andres, and he suggested to try a couple things, and there's a couple more things I tried on my own too. Attached is the patch series starting with the 2023/10/16 patch, and then trying different things in separate patches (discussed later). As usual, there's also a bunch of benchmark results - due to size I'm unable to attach all of them here (the PDFs are pretty large), but you can find them at (with all the scripts etc.): https://github.com/tvondra/index-prefetch-tests/tree/master/2023-11-23 I'll attach only a couple small PNG with highlighted speedup/regression patterns, but it's unreadable and more of a pointer to the PDF. A quick overview of the patches ------------------------------- v20231124-0001-prefetch-2023-10-16.patch - same as the October 16 patch, with only minor comment tweaks v20231124-0002-rely-on-PrefetchBuffer-instead-of-custom-c.patch - removes custom cache of recently prefetched blocks, replaces it simply by calling PrefetchBuffer (which check shared buffers) v20231124-0003-check-page-cache-using-preadv2.patch - adds a check using preadv2(RWF_NOWAIT) to check if the whole page is in page cache v20231124-0004-reintroduce-the-LRU-cache-of-recent-blocks.patch - adds back a small LRU cache to identify sequential patterns (based on benchmarks of 0002/0003 patches) v20231124-0005-hold-the-vm-buffer-for-IOS-prefetching.patch v20231124-0006-poc-reuse-vm-information.patch - optimizes the visibilitymap handling when prefetching for IOS (to deal with overhead in the all-visible cases) by v20231124-0007-20231016-reworked.patch - returns back to the 20231016 patch, but this time with the VM optimizations in patches 0005/0006 (in retrospect I might have simply moved 0005+0006 right after 0001, but the patch evolved differently - shouldn't matter here) Now, let's talk about the patches one by one ... PrefetchBuffer + preadv2 (0002+0003) ------------------------------------ After I posted the patch in October, I happened to have an off-list discussion with Andres, and he suggested to try ditching the local cache of recently prefetched blocks, and instead: 1) call PrefetchBuffer (which checks if the page is in shared buffers, and skips the prefetch if it's already there) 2) if the page is not in shared buffers, use preadv2(RWF_NOWAIT) to check if it's in the kernel page cache Doing (1) is trivial - PrefetchBuffer() already does the shared buffer check, so 0002 simply removes the custom cache code. Doing (2) needs a bit more code to actually call preadv2() - 0003 adds FileCached() to fd.c, smgrcached() to smgr.c, and then calls it from PrefetchBuffer() right before smgrprefetch(). There's a couple loose ends (e.g. configure should check if preadv2 is supported), but in principle I think this is generally correct. Unfortunately, these changes led to a bunch of clear regressions :-( Take a look at the attached point-4-regressions-small.png, which is page 5 from the full results PDF [1][2]. As before, I plotted this as a huge pivot table with various parameters (test, dataset, prefetch, ...) on the left, and (build, nmatches) on the top. So each column shows timings for a particular patch and query returning nmatches rows. After the pivot table (on the right) is a heatmap, comparing timings for each build to master (the first couple of columns). As usual, the numbers are "timing compared to master" so e.g. 50% means the query completed in 1/2 the time compared to master. Color coding is simple too, green means "good" (speedup), red means "bad" (regression). The higher the saturation, the bigger the difference. I find this visualization handy as it quickly highlights differences between the various patches. Just look for changes in red/green areas. In the points-5-regressions-small.png image, you can see three areas of clear regressions, either compared to the master or the 20231016 patch. All of this is for "uncached" runs, i.e. after instance got restarted and the page cache was dropped too. The first regression is for bitmapscan. The first two builds show no difference compared to master - which makes sense, because the 20231016 patch does not touch any code used by bitmapscan, and the 0003 patch simply uses PrefetchBuffer as is. But then 0004 adds preadv2 to it, and the performance immediately sinks, with timings being ~5-6x higher for queries matching 1k-100k rows. The patches 0005/0006 can't possibly improve this, because visibilitymap are entirely unrelated to bitmapscans, and so is the small LRU to detect sequential patterns. The indexscan regression #1 shows a similar pattern, but in the opposite direction - indesxcan cases massively improved with the 20231016 patch (and even after just using PrefetchBuffer) revert back to master with 0003 (adding preadv2). Ditching the preadv2 restores the gains (the last build results are nicely green again). The indexscan regression #2 is interesting too, and it illustrates the importance of detecting sequential access patterns. It shows that as soon as we call PrefetBuffer() directly, the timings increase to maybe 2-5x compared to master. That's pretty terrible. Once the small LRU cache used to detect sequential patterns is added back, the performance recovers and regression disappears. Clearly, this detection matters. Unfortunately, the LRU can't do anything for the two other regresisons, because those are on random/cyclic patterns, so the LRU won't work (certainly not for the random case). preadv2 issues? --------------- I'm not entirely sure if I'm using preadv2 somehow wrong, but it doesn't seem to perform terribly well in this use case. I decided to do some microbenchmarks, measuring how long it takes to do preadv2 when the pages are [not] in cache etc. The C files are at [3]. preadv2-test simply reads file twice, first with NOWAIT and then without it. With clean page cache, the results look like this: file: ./tmp.img size: 1073741824 (131072) block 8192 check 8192 preadv2 NOWAIT time 78472 us calls 131072 hits 0 misses 131072 preadv2 WAIT time 9849082 us calls 131072 hits 131072 misses 0 and then, if you run it again with the file still being in page cache: file: ./tmp.img size: 1073741824 (131072) block 8192 check 8192 preadv2 NOWAIT time 258880 us calls 131072 hits 131072 misses 0 preadv2 WAIT time 213196 us calls 131072 hits 131072 misses 0 This is pretty terrible, IMO. It says that if the page is not in cache, the preadv2 calls take ~80ms. Which is very cheap, compared to the total read time (so if we can speed that up by prefetching, it's worth it). But if the file is already in cache, it takes ~260ms, and actually exceeds the time needed to just do preadv2() without the NOWAIT flag. AFAICS the problem is preadv2() doesn't just check if the data is available, it also copies the data and all that. But even if we only ask for the first byte, it's still way more expensive than with empty cache: file: ./tmp.img size: 1073741824 (131072) block 8192 check 1 preadv2 NOWAIT time 119751 us calls 131072 hits 131072 misses 0 preadv2 WAIT time 208136 us calls 131072 hits 131072 misses 0 There's also a fadvise-test microbenchmark that just does fadvise all the time, and even that is way cheaper than using preadv2(NOWAIT) in both cases: no cache: file: ./tmp.img size: 1073741824 (131072) block 8192 fadvise time 631686 us calls 131072 hits 0 misses 0 preadv2 time 207483 us calls 131072 hits 131072 misses 0 cache: file: ./tmp.img size: 1073741824 (131072) block 8192 fadvise time 79874 us calls 131072 hits 0 misses 0 preadv2 time 239141 us calls 131072 hits 131072 misses 0 So that's 300ms vs. 500ms in the caches case (the difference in the no-cache case is even more significant). It's entirely possible I'm doing something wrong, or maybe I just think about this the wrong way, but I can't quite imagine this being useful for this working - at least not for reasonably good local storage. Maybe it could help for slow/remote storage, or something? For now, I think the right approach is to go back to the cache of recently prefetched blocks. I liked on the preadv2 approach is that it knows exactly what is currently in page cache, while the local cache is just an approximation cache of recently prefetched blocks. And it also knows about stuff prefetched by other backends, while the local cache is private to the particular backend (or even to the particular scan node). But the local cache seems to perform much better, so there's that. LRU cache of recent blocks (0004) --------------------------------- The importance of this optimization is clearly visible in the regression image mentioned earlier - the "indexscan regression #2" shows that the sequential pattern regresses with 0002+0003 patches, but once the small LRU cache is introduced back and uses to skip prefetching for sequential patterns, the regression disappears. Ofc, this is part of the origina 20231016 patch, so going back to that version naturally includes this. visibility map optimizations (0005/0006) ---------------------------------------- Earlier benchmark results showed a bit annoying regression for index-only scans that don't need prefetching (i.e. with all pages all-visible). There was quite a bit of inefficiency because both the prefetcher and IOS code accessed the visibilitymap independently, and the prefetcher did that in a rather inefficient way. These patches make the prefetcher more efficient by reusing buffer, and also share the visibility info between prefetcher and the IOS code. I'm sure this needs more work / cleanup, but the regresion is mostly gone, as illustrated by the attached point-0-ios-improvement-small.png. layering questions ------------------ Aside from the preadv2() question, the main open question remains to be the "layering", i.e. which code should be responsible for prefetching. At the moment all the magic happens in indexam.c, in index_getnext_* functions, so that all callers benefit from prefetching. But as mentioned earlier in this thread, indexam.c seems to be the wrong layer, and I think I agree. The problem is - the prefetching needs to happen in index_getnext_* so that all index_getnext_* callers benefit from it. We could do that in the executor for index_getnext_tid(), but that's a bit weird - it'd work for index-only scans, but the primary target is regular index scans, which calls index_getnext_slot(). However, it seems it'd be good if the prefetcher and the executor code could exchange/share information more easily. Take for example the visibilitymap stuff in IOS in patches 0005/0006). I made it work, but it sure looks inconvenient, partially due to the split between executor and indexam code. The only idea I have is to have the prefetcher code somewhere in the executor, but then pass it to index_getnext_* functions, either as a new parameter (with NULL => no prefetching), or maybe as a field of scandesc (but that seems wrong, to point from the desc to something that's essentially a part of the executor state). There's also the thing that the prefetcher is part of IndexScanDesc, but it really should be in the IndexScanState. That's weird, but mostly down to my general laziness. regards [1] https://github.com/tvondra/index-prefetch-tests/blob/master/2023-11-23/pdf/point.pdf [2] https://github.com/tvondra/index-prefetch-tests/blob/master/2023-11-23/png/point-4.png [3] https://github.com/tvondra/index-prefetch-tests/tree/master/2023-11-23/preadv-tests -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Вложения
- v20231124-0006-poc-reuse-vm-information.patch
- v20231124-0001-prefetch-2023-10-16.patch
- v20231124-0002-rely-on-PrefetchBuffer-instead-of-custom-c.patch
- v20231124-0003-check-page-cache-using-preadv2.patch
- v20231124-0004-reintroduce-the-LRU-cache-of-recent-blocks.patch
- v20231124-0005-hold-the-vm-buffer-for-IOS-prefetching.patch
- v20231124-0007-20231016-reworked.patch
- point-0-ios-improvement-small.png
- point-4-regressions-small.png
В списке pgsql-hackers по дате отправления: