Re: index prefetching
От | Tomas Vondra |
---|---|
Тема | Re: index prefetching |
Дата | |
Msg-id | 32c15a30-6e25-4f6d-9191-76a19482c556@vondra.me обсуждение исходный текст |
Ответ на | Re: index prefetching (Tomas Vondra <tomas@vondra.me>) |
Список | pgsql-hackers |
Hi, Here's a rebased version of the patch, addressing a couple bugs with scrollable cursors that Peter reported to me off-list. The patch did not handle that quite right, resulting either in incorrect results (when the position happened to be off by one), or crashes (when it got out of sync with the read stream). But then there are some issues with array keys and mark/restore, triggered by Peter's "dynamic SAOP advancement" tests in extra tests (some of the tests use data files too large to post on hackers, it's available in the github branch). The patch used to handle mark/restore entirely in indexam.c, and for simple scans that works. But with array keys the btree code needs to update the moreLeft/moreRight/needPrimScan flags, so that after restoring it knows where to continue. There's two "fix" patches trying to make this work - it does not crash, and almost all the "incorrect" query results are actually stats about buffer hits etc. And that is expected to change with prefetching, not a bug. But then there are a bunch of explains where the number of index scans changed, e.g. like - Index Searches: 5 + Index Searches: 4 And that is almost certainly a bug. I haven't figured this out yet, and I feel a bit lost again :-( It made me think again whether it makes sense to make this fundamental redesign of the index AM interface a prerequisite for prefetching. I don't dispute the advantages of this new design, with indexam.c responsible for more stuff (e.g. when a batch gets freed). It seems more flexible and might make some stuff easier, and if we were designing it now, we'd do it that way ... Even if I eventually to fix this issue, will I ever be sufficiently confident about correctness of the new code, enough to commit that? Perhaps I'm too skeptical, but I'm not really sure about that anymore. After thinking about this for a while, I decided to revisit the approach used in the experimental patch I spoke about at pgconf.dev unconference in 2023, and see if maybe it could be made to work. That patch was pretty dumb - it simply initiated prefetches from the AM, by calling PrefetchBuffer(). And the arguments against that doing this from the AM seems like a layering violation, that every AM would need to do a copy of this, because each AM has a different representation of the internal scan state. But after looking at it with fresh eyes, this seems fixable. It might have been "more true" with the fadvise-based prefetching, but with the ReadStream the amount of new AM code is *much* smaller. It doesn't need to track the distance, or anything like that - that's handled by the ReadStream. It just needs to respond to read_next callback. It also doesn't feel like a layering violation, for the same reason. I gave this a try last week, and I was surprised how easy it was to make this work, and how small and simple the patches are - see the attached simple-prefetch.tgz archive: infrastructure - 22kB btree - 10kB hash - 7kB gist - 10kB spgist - 16kB That's a grand total of ~64kB (there might be some more improvements necessary, esp. in the gist/spgist part). Now compare that with the more complex patch, where we have infrastructure - 100kB nbtree - 100kB And that's just one index type. The other index types would probably need a comparable amount of new code eventually ... Sure, it can probably be made somewhat smaller (e.g. the nbtree code copies a lot of stuff to support both the old and new approach, and that might be reduced if we ditch the old one), and some of the diff are comments. But even considering all that the size/complexity difference will remain significant. The one real limitation of the simpler approach is that prefetching is limited to a single leaf page - we can't prefetch from the next one, until the scan advances to it. But based on experiments comparing this simpler and the "complex" approach, I don't think that really matters that much. I haven't seen any difference for regular queries. The one case where I think it might matter is queries with array keys, where each array key matches a single tuple on a different leaf page. The complex patch might prefetch tuples for later array values, while the simpler patch won't be able to do that. If an array key matches multiple tuples, the simple patch can prefetch those just fine, of course. I don't know which case is more likely. One argument for moving more stuff (including prefetching) to indexam.c was it seems desirable to have one "component" aware of all the relevant information, so that it can adjust prefetching in some way. I believe that's still possible even with the simpler patch - nothing prevents adding a "struct" to the scan descriptor, and using it from the read_next callback or something like that. regards [1] https://github.com/tvondra/postgres/tree/index-prefetch-2025 -- Tomas Vondra
Вложения
В списке pgsql-hackers по дате отправления: