Re: index prefetching
От | Tomas Vondra |
---|---|
Тема | Re: index prefetching |
Дата | |
Msg-id | e4af4973-b2ed-4a51-a9bd-524922c2987b@vondra.me обсуждение исходный текст |
Ответ на | Re: index prefetching (Tomas Vondra <tomas@vondra.me>) |
Список | pgsql-hackers |
Hi, I got pinged about issues (compiler warnings, and some test failures) in the simple patch version shared in May. So here's a rebased and cleaned up version addressing that, and a couple additional issues I ran into. FWIW if you run check-world on this, you may get failures in io_workers TAP test. That's a pre-existing issue [1], the patch just makes it easier to hit as it (probably) added AIO in some part of the test. Otherwise it should pass all tests (and it does for me on CI). The main changes in the patches and remaining questions: (1) fixed compiler warnings These were mostly due to contrib/test AMs with not-updated ambeginscan() implementations. (2) GiST fixes I fixed a bug in how the prefetching handled distances, leading to "tuples returned out of order" errors. It did not copy the Datums when batching the reordered values, not realizing it may be FLOAT8, and on 32-bit systems the Datum is just a pointer. Fixed by datumCopy(). I'm not aware of any actual bug in the GiST code, but I'm sure the memory management there is sketchy and likely leaks memory. Needs some more thought and testing. The SP-GiST may have similar issues. (3) ambeginscan(heap, index, ....) I originally undid the changes to ambeginscan(), i.e. the callback was restored back to what master has. To to create the ReadStream the AM needs the heap, but it could build Relation using index->rd_index->indrelid. That worked, but I did not like it for two reasons. The AM then needs to manage the relation (close it etc.). And there was no way to know when ambeginscan() gets called for a bitmap scan, in which case the read_stream is unnecessary/useless. So it got created, but never used. Not very expensive, but messy. So I ended up restoring the ambeginscan() change, i.e. it now gets the heap relation. I ended up passing it as the first argument, mostly for consistency with index_beginscan(), which also does (heap, index, ...). I renamed the index argument from 'rel' to 'index' in a couple of the indexes, it was confusing to have 'heap' and 'rel'. (4) lastBlock I added the optimization to not queue duplicate block numbers, i.e. if the index returns a sequence of TIDs from the same block, we skip queueing that and simply use the buffer we already have. This is quite a bit more efficient. This is something the read_next callback in each AM needs to do, but it's pretty simple. (5) xs_visible The current patch expects the AM to set the xs_visible even if it's not using ReadStream (which is required to do that in the callback). If the AM does not do that, index-only scans are broken. But it occurs to me we could handle this in index_getnext_tid(). If the AM does not use a ReadStream (xs_rs==NULL), we can check the VM and store the value in xs_visible. It'd need moving the vmBuffer to the scan descriptor (it's now in IndexOnlyScanState), but that seems OK. And the AMs now add the buffer anyway. (6) SGML I added a couple paragraphs to indexam.sgml, documenting the new heap argument, and also requirements from the read_next callback (e.g. the lastBlock and xs_visible setting). (7) remaining annoyances There's a couple things that still annoy me - the "read_next" callbacks are very similar, and duplicate a fair amount of code to stuff they're required to. There's a little bit AM-specific code to get the next item from the ScanOpaque structs, and then code to skip duplicate block numbers and check the visibility map (if needed). I believe both of these things could be refactored into some shared place. The AMs would just call a function from indexam.c (which seems OK from layering POV, and there's plenty of such calls). I believe the same place could also act as the "scan manager" component managing the prefetching (and related stuff?), as suggested by Peter Geoghegan some time ago. I ran out of time to work on this today, but I'll look into this soon. FWIW I'm still planning to work on the "complex" patch version and see if it can be moved forward. I've been having some very helpful chats about this with Peter Geoghegan, and I'm still open to the possibility of making it work. This simpler version is partially a hedge to have at least something in case the complex patch does not make it. regards [1] https://www.postgresql.org/message-id/t5aqjhkj6xdkido535pds7fk5z4finoxra4zypefjqnlieevbg%40357aaf6u525j -- Tomas Vondra
Вложения
В списке pgsql-hackers по дате отправления: