Re: index prefetching
| От | Peter Geoghegan |
|---|---|
| Тема | Re: index prefetching |
| Дата | |
| Msg-id | CAH2-Wzk9=x=a2TbcqYcX+XXmDHQr5=1v9m4Z_v8a-KwF1Zoz0A@mail.gmail.com обсуждение исходный текст |
| Ответ на | Re: index prefetching (Peter Geoghegan <pg@bowt.ie>) |
| Ответы |
Re: index prefetching
|
| Список | pgsql-hackers |
On Sun, Nov 2, 2025 at 6:49 PM Peter Geoghegan <pg@bowt.ie> wrote: > Nothing really new here (I've been > working on batching on the table AM side, but nothing to show on that > just yet). Tomas and I had a meeting on Friday to discuss a way forward with this project. Progress has stalled, and we feel that now is a good time to pivot by refactoring the patch into smaller, independently useful/committable pieces. This email explains our current thinking (Tomas should correct me if I get anything wrong here). The short version/executive summary =================================== The need to get everything done in one single release seems to be hampering progress. We made quick progress for a few months, but now that we've exhausted the easy wins, the layering issues that remain are making every remaining open item near intractable. The layering issues make it very hard to keep on top of all of the regressions; we're just doing too much at once. We're trying to manage all of the regressions from the addition of prefetching/a heapam read stream, while also trying to manage the regressions from moving index AMs from the old amgettuple interface to the new amgetbatch interface. And we still need to revise the table AM to move the read stream from indexam.c over to the table AM side (this isn't in the latest version of the patch at all). Just making these AM interface changes is already a huge project on its own. This makes it hard to focus on just a few things at any one time; everything is interdependent. We seem to end up playing whack-a-mole whenever we try to zero in on any single problem; we end up going in circles. The new tentative plan is to cut scope by focussing on switching over to the new index AM + table AM interface from the patch in the short term, for Postgres 19. There is an almost immediate benefit to just doing that much, unrelated to I/O prefetching for index scans: it enables batching of heap page buffer locking/unlocking (during the point where index scans perform heap_hot_search_buffer calls) on the table AM/heapam side during ordered index scans. That can dramatically cut down on repeat buffer locking and unlocking, giving us enough of a win (more details below) to be the sole justification for switching over to the new set of AM interfaces for Postgres 19. Our long term goals won't change under this phased approach, but our timeline/short term focus certainly will. We hope to get some general feedback about this new strategy for the project now, particularly from Andres. The main practical concern is managing project risk sensibly. Difficulties with refactoring AM interfaces while introducing a read stream =========================================================================== The uncertainty about how to resolve some of the remaining individual open items for the project (specifically concerns about I/O prefetching/read stream + resource management concerns, and how they *interact* with broader layering questions) is the main blocker to further progress. I'll now give a specific example of what I mean by this, just because it's likely to be clearer than explaining the underlying problem in general terms. Currently, we can have up to 64 leaf-page-wise batches. Usually this is more than enough, but occasionally we run out of space for batches, and have to reset the read stream. This is certainly a kludge; we discard pinned buffers with useful data in order to work around what we've thought of as an implementation deficiency on the read stream side up until now. Obviously just discarding useful work like that is never okay (nobody will argue with me on this, I'm sure). At various points we talked about addressing this particular problem by teaching the read stream to "pause" such that we can consume those remaining pinned buffers as needed, without consuming even more heap pages/buffers to do so (there's no natural upper bound on those, I think). We'd then "unpause" and resume prefetching again, once we managed to free some more leaf-page-wise batches up. But I'm now starting to have serious doubts about this approach (or at least doubts about the approach that I think other people have in mind when they talk about this kind of "pausing"). Again, it's really hard to pin down *where* we should be fixing things. It occurs to me that it doesn't make much sense that the table AM/indexam.c has *no* awareness of how many heap buffers are already pinned on its behalf. The fact that that knowledge is *exclusively* confined to the read stream isn't actually good. What we really need to do is to care about all buffer pins held by the whole index scan node, whether for index pages or for heap pages (though note that holding onto buffer pins on index pages should be rare in practice). We need to directly acknowledge the tension that exists between heapam and index AM needs, I think. The read stream needs to be involved in this process, but it should be a 2-way conversation. The read stream already defensively checks externally held buffer pins, which might kinda work for what we have in mind -- but probably not. It seems bad to depend on what is supposed to be a defensive measure for all this. Separately, we'll probably eventually want the heapam side to be able to notice that a block number that it requests is already in the pending list, so that it can be marked as a duplicate (and so not unpinned until the duplicate request is also satisfied/has its heap tuples returned to the scan). That's another factor pushing things in this general direction. (Less important, but noted here for completeness.) I've been talking about problems when 64 leaf-page-wise batches isn't enough, which is rare in practice. It's far more common for 64 to be too *many* batches, which wastes memory (e.g, with largely sequential heap access we seem to need no more than 5 or 10 at a time, even when prefetching is really important). But it's hard to see how we could lazily allocate memory used for batches under anything like the current structure. It's circular: we should only allocate more leaf-page-wise batches to make it possible to do more useful heap prefetching. But right now heap prefetching will stall (or will "pause" in its own kludgey way) precisely because there aren't enough leaf-page-wise batches! Granted, adding a "pausing" capability might be useful elsewhere. But that in itself doesn't justify the general idea of pausing in the specific way that index prefetching requires. Why should it? Why should we pause when we've filled 64 leaf-page-wise batches instead of 5 or 10 or 1000? ISTM that we're tacitly assuming that the total number of usable leaf-page-wise batches remaining is a useful proxy for the costs that actually matter. But why should it be? 64 is just a number that we picked fairly arbitrarily, and one that has only a weak relationship with more concrete costs such as leaf page buffer pins held (as noted already, needing to hold onto a leaf page buffer pin until we call btfreebatch against its batch isn't actually needed during most index scans, but there will be exceptions). My gut instinct is that this stuff will actually matter, in practice, at least some of the time. And that that'll necessitate giving the implementation a clear and complete picture of costs and benefits when scheduling index scans that prefetch. Pausing can't be based on some randomly chosen magic number, like 64, since that's bound to be totally wrong in a nonzero number of cases. ISTM that we cannot subordinate the table AM to the read stream. But we also can't subordinate the read stream to the table AM. Figuring all that out is hard. This is the kind of problem that we'd like to defer for now. Minor caveat: I'm not sure that Tomas endorses everything I've said here about "pausing" the read stream. But that probably doesn't matter much. Either way, these kinds of questions still weigh on the project, and something should be done about it now, to keep things on course. Phased approach =============== As touched upon at the start of this email, under this new phased approach to the project, the short term goal is to make heapam avoid repeat buffer locks during index scans where that's clearly avoidable. Making that much work shares many of the same problems with I/O prefetching (particularly the basics of layering/AM revisions), but defers dealing with the thorniest issues with pin resource management. That's what I'll talk about here --- what we can defer, and what we cannot defer. But first, on a more positive note, I'll talk about the short term benefits. My early prototype of the "only lock heap buffer once per group of TIDs that point to the same heap page returned from an index scan" optimization has been shown to improve throughput for large-ish range scans by quite a bit. Variants of pgbench select with queries like "SELECT * FROM pg_bench_accounts WHERE aid BETWEEN 1000 AND 1500" show improvements in throughput of up to 20% (and show similar reductions in query latency). That's a nice win, all on its own. Now back to talking about risks. There's still a lot of complexity that cannot be deferred with this phased approach. We must still switch over index AMs from amgettuple to the new amgetbatch interface. And, we need to make the table AM interface used by index scans higher level: index_getnext_slot would directly call a new table-AM-wise callback, just passing it its own ScanDirection argument directly -- we wouldn't be passing TIDs to the table AM anymore. The new table AM batch interface would work in terms of "give me the next tuple in the current scan direction", not in terms of "give me this random TID, which you know nothing else about". The table AM becomes directly aware of the fact that it is participating in an ordered index scan. This design is amenable to allowing the table AM to see which accesses will be required in the near future -- that requirement is common to both I/O prefetching and this other heap buffer lock optimization. It's even more complicated than just those changes to the index AM and table AM interfaces: we'll also require that the table AM directly interfaces with another layer that manages leaf-page-wise batches on its behalf. They need to *cooperate* with each other, to a certain degree. The executor proper won't call amgetbatch directly under this scheme (it'd just provide a library of routines that help table AMs to do so on their own). That much doesn't seem deferrable. And it's hard. So this phased approach certainly doesn't eliminate project risk, by any stretch of the imagination. Offhand, I'd estimate that taking this phased approach cuts the number of blockers to making an initial commit in half. Here's a nonexhaustive list of notable pain points that *won't* need to be addressed in the short term, under this new approach/structure (I'm somewhat repeating myself here): * Most regressions are likely much easier to avoid/are automatically avoided. Particularly with selective point query scans. * No need to integrate the read stream, no need to solve most resource management problems (the prior item about regressions is very much related to this one). * No need for streamPos stuff when iterating through TIDs from a leaf-page-wise batch (only need readPos now). There's no need to keep those 2 things in sync, because there'll only be 1 thing now. Here's a nonexhaustive list of problems that we *will* still need to solve in the earliest committed patch, under this phased approach (again, I'm repeating myself somewhat): * Actually integrating the amgetbatch interface in a way that is future-proof. * Revising the table AM interface such that the table AM is directly aware of the fact that it is feeding heap/table tuples to an ordered index scan. That's a big conceptual shift for table AMs. * Making the prior 2 changes "fit together" sensibly, in a way that considers current and future needs. Also a big shift. The "only lock heap buffer once per group of TIDs that point to the same heap page returned from an index scan" optimization still requires some general awareness of index AM costs on the table AM side. It only makes sense for us to batch-up extra TIDs (from the same heap page) when determining which TIDs are about to be accessed as a group isn't too expensive/the information is readily available to the table AM, because it requested it from the index AM itself. We're setting a new precedent by saying that it's okay to share certain knowledge across what we previously thought of as strictly separate layers of abstraction. I think that that makes sense (what else could possibly work?), but I want to draw specific attention to that now. * We'll still need index-only scans to do things in a way that prevents inconsistencies/changing our mind in terms of which TIDs are all-visible. This has the advantage of allowing us to avoid accessing the visibility map from the executor proper, which is an existing modularity violation that we already agree ought to be fixed. This will also keep us honest (we won't be deferring more than we should). But that's not why I think it's essential to move VM accesses into the table AM. We should only batch together accesses to a heap page when we know for sure that those TIDs will in fact be accessed. How are we supposed to have general and robust handling for all that, in a world where the visibility map continues to be accessed from the executor proper? At best, not handling VM integration comprehensively (for index-only scans) ties our hands around reordering work, and seems like it'd be very brittle. It would likely have similar problems to our current problems with managing a read stream in indexam.c, while relying on tacit knowledge of how precisely those same heap blocks will later actually be accessed from the heapam side. The sensible solution is to put control of the scan's progress all in one place. We don't want to have to worry about what happens when the VM is concurrently set or unset. When Andres and Tomas talk about table AM modularity stuff, they tend to focus on why it's bad that the table AM interface uses heap TIDs specifically. I agree with all that. But even if I didn't, everything that I just said about the need to centralize control in the table AM would still be true. That's why I'm focussing on that here (it's really pretty subtle). That's all I have for now. My thoughts here should be considered tentative; I want to put my thinking on a more rigorous footing before really committing to this new phased approach. -- Peter Geoghegan
В списке pgsql-hackers по дате отправления: