Обсуждение: Trying out read streams in pgvector (an extension)
Hi, I was looking around for an exotic index type to try the experience of streamifying an extension, ie out-of-core code. I am totally new to pgvector, but since everyone keeps talking about it, I could not avoid picking up some basic facts in the pgconf.dev hallway track, and understood that its scans have some degree of known-order access predictability, and then also some degree of fuzzy-predictable order-not-yet-determined access too. It's also quite random in the I/O sense. Here's a toy to streamify the known-order part. I think for the fuzzy part that links those parts together, maybe there is some way to guess when it's a reasonable time to speculatively prefetch the lowest order stuff in the pairing heap, and then deal with it if you're wrong, but I didn't try that... Someone involved in that project mentioned that it's probably not a great topic to research in practice, because real world users of HNSW use fully cached ie prewarmed indexes, because the performance is so bad otherwise. (Though maybe that argument is a little circular...). So although this patch clearly speeds up cold HSNW searches to a degree controlled by effective_io_concurrency, I'll probably look for something else. Suggestions for interesting index types to look at streamifying are very welcome! Hmm. If that's really true about HNSW though, then there may still be an opportunity to do automatic memory prefetching[1]. But then in the case of index building, "stream" is NULL in this patch anyway. It surely must also be possible to find some good places to put profitable explicit pg_mem_prefetch() calls given the predictability and the need to get only ~60ns ahead for that usage. I didn't look into that because I was trying to prove things about read_stream.c, not get involved in another project :-D Here ends my science experiment report, which I'm dropping here just in case others see useful ideas here. The main thing I learned about the read stream API is that it'd be nice to be able to reset the stream but preserve the distance (something that came up on the streaming sequential scan thread for a different reason), to deal with cases where look-ahead opportunities come in bursts but you want a longer lived stream than I used here. That is the reason the patch creates and destroys temporary streams in a loop; doh. It also provides an interesting case study for what speculative random look-ahead support might need to look like. [1] https://www.postgresql.org/message-id/flat/CA%2BhUKGKNUMnqubrrz8pRBdEM8vHeSCZcNq7iqERmkt6zPtpA3g%40mail.gmail.com === setup ==== create extension vector; create or replace function random_vector(dimensions int) returns vector language sql begin atomic; select array_agg(random())::vector from generate_series(1, dimensions); end; create table t (id serial, embedding vector(6)); insert into t (embedding) select random_vector(6) from generate_series(1, 1000000); set maintenance_work_mem = '2GB'; create index on t using hnsw(embedding vector_l2_ops); === test of a hot search, assuming repeated === select embedding <-> '[0.5,0.5,0.5,0.5,0.5,0.5]'::vector from t where embedding <-> '[0.5,0.5,0.5,0.5,0.5,0.5]'::vector < 0.2 order by 1 limit 20; === test of a cold search, assuming empty caches === create or replace function test() returns void language plpgsql as $$ declare my_vec vector(6) := random_vector(6); begin perform embedding <-> my_vec from t where embedding <-> my_vec < 0.2 order by 1 limit 20; end; $$; select test(); (Make sure you remember to set effective_io_concurrency to an interesting number if you want to generate a lot of overlapping fadvise calls.)
Вложения
On 11/06/2024 07:53, Thomas Munro wrote: > Someone involved in that project mentioned that it's probably not a > great topic to research in practice, because real world users of HNSW > use fully cached ie prewarmed indexes, because the performance is so > bad otherwise. (Though maybe that argument is a little circular...). I think that's true in practice for *building* an HNSW index, but faster *searching* when the index is not in memory seems quite useful. And of course, faster is always better, even if it's only in a non-optimal scenario. > So although this patch clearly speeds up cold HSNW searches to a > degree controlled by effective_io_concurrency, I'll probably look for > something else. Suggestions for interesting index types to look at > streamifying are very welcome! GiST and GIN? > Hmm. If that's really true about HNSW though, then there may still be > an opportunity to do automatic memory prefetching[1]. But then in the > case of index building, "stream" is NULL in this patch anyway. It > surely must also be possible to find some good places to put > profitable explicit pg_mem_prefetch() calls given the predictability > and the need to get only ~60ns ahead for that usage. I didn't look > into that because I was trying to prove things about read_stream.c, > not get involved in another project :-D > > Here ends my science experiment report, which I'm dropping here just > in case others see useful ideas here. The main thing I learned about > the read stream API is that it'd be nice to be able to reset the > stream but preserve the distance (something that came up on the > streaming sequential scan thread for a different reason), to deal with > cases where look-ahead opportunities come in bursts but you want a > longer lived stream than I used here. That is the reason the patch > creates and destroys temporary streams in a loop; doh. It also > provides an interesting case study for what speculative random > look-ahead support might need to look like. This reminds me of a prototype I wrote earlier, see https://github.com/pgvector/pgvector/pull/386, 1st commit. It reorganizes HnswSearchLayer() so that it in iteration, it first collects all the neighbors to visit, and then visits them, somewhat similar to your patch. -- Heikki Linnakangas Neon (https://neon.tech)
On 6/11/24 12:53 AM, Thomas Munro wrote: > Hi, > > I was looking around for an exotic index type to try the experience of > streamifying an extension, ie out-of-core code. I am totally new to > pgvector, but since everyone keeps talking about it, I could not avoid > picking up some basic facts in the pgconf.dev hallway track, and > understood that its scans have some degree of known-order access > predictability, and then also some degree of fuzzy-predictable > order-not-yet-determined access too. It's also quite random in the > I/O sense. Cool! I happened to be chatting w/Andrew about this yesterday to see if there could be some benefits for folks who are running pgvector on PG17. > Here's a toy to streamify the known-order part. I think for the fuzzy > part that links those parts together, maybe there is some way to guess > when it's a reasonable time to speculatively prefetch the lowest order > stuff in the pairing heap, and then deal with it if you're wrong, but > I didn't try that... I would suggest submitting this at least as a draft PR to the pgvector project[1]: https://github.com/pgvector/pgvector > Someone involved in that project mentioned that it's probably not a > great topic to research in practice, because real world users of HNSW > use fully cached ie prewarmed indexes, because the performance is so > bad otherwise. I don't think that was me, at least in those words (and I had noted I'd love to chat w/you about this, but we didn't find time). Stating it differently, the "ideal" is to keep the indexes in memory, as that leads to the best performance, but reality is more complicated. These datasets are quite large (e.g. the 1536-dim vector is a 6KB payload, excluding what's in the index) and if you're storing the full vector in the index (there are now some quantization methods available[4]), you can easily double your dataset size, and quickly exceed available memory. So I think in the real world, you're more likely to see swapping pages between disk and memory. Some of this was addressed in the talk @ PGConf.dev[3] (slides here[2]). > (Though maybe that argument is a little circular...). > So although this patch clearly speeds up cold HSNW searches to a > degree controlled by effective_io_concurrency, I'll probably look for > something else. Suggestions for interesting index types to look at > streamifying are very welcome! Yup, so this makes sense for HNSW particularly at the higher-level pages. But it may make more sense for IVFFlat, given how it clusters data. With IVFFlat, you first find your lists/centers, and then you determine how you index each vector around the lists. When those lists are stored to disk, they're basically sequential. A lot of the struggles with IVFFlat is both the long load from disk and ultimately some comptuational issues for a larger set of vector comparisons (though if you're able to build small, efficient clusters, it can be much faster than HNSW!). HNSW behaves more like a (bear with me) typically "tree-based" index, where you'll have hot spots at the top, but because of the nature of vector search, the lower levels tend to be more random in access. Regardless, the part where this is interesting (at least to me) is that a lot of these vectors tend to take up a full page anyway, so anything we can do to read them faster from disk will generally get a thumbs up from me. > Hmm. If that's really true about HNSW though, then there may still be > an opportunity to do automatic memory prefetching[1]. But then in the > case of index building, "stream" is NULL in this patch anyway. It > surely must also be possible to find some good places to put > profitable explicit pg_mem_prefetch() calls given the predictability > and the need to get only ~60ns ahead for that usage. I didn't look > into that because I was trying to prove things about read_stream.c, > not get involved in another project :-D Well, as alluded to in[2], thinking about how another project uses this will certainly help, and anything we can do to continue to speed up vector queries helps PostgreSQL ;) Some of the contributions from folks who have focused on core have significantly helped pgvector. > Here ends my science experiment report, which I'm dropping here just > in case others see useful ideas here. The main thing I learned about > the read stream API is that it'd be nice to be able to reset the > stream but preserve the distance (something that came up on the > streaming sequential scan thread for a different reason), to deal with > cases where look-ahead opportunities come in bursts but you want a > longer lived stream than I used here. That is the reason the patch > creates and destroys temporary streams in a loop; doh. It also > provides an interesting case study for what speculative random > look-ahead support might need to look like. If you're curious, I can fire up some of my more serious benchmarks on this to do a before/after to see if there's anything interesting. I have a few large datasets (10s of millions) of larger vectors (1536dim => 6KB payloads) that could see the net effect here. > (Make sure you remember to set effective_io_concurrency to an > interesting number if you want to generate a lot of overlapping > fadvise calls.) What would you recommend as an "interesting number?" - particularly using the data parameters above. Thanks, Jonathan [1] https://github.com/pgvector/pgvector [2] https://www.pgevents.ca/events/pgconfdev2024/sessions/session/1/slides/42/pgconfdev-2024-vectors.pdf [3] https://www.pgevents.ca/events/pgconfdev2024/schedule/session/1-vectors-how-to-better-support-a-nasty-data-type-in-postgresql/ [4] https://jkatz05.com/post/postgres/pgvector-scalar-binary-quantization/
Вложения
On Fri, Sep 6, 2024 at 4:28 PM Thomas Munro <thomas.munro@gmail.com> wrote: > Without this > patch for PostgreSQL, it reads 1, 2, 4, 7 blocks (= 16 in total) > before it has to take a break to hop to a new page, and then it start > again at 1. Oops. Erm, correction: 1, 2, 4, 8, 1 (because it runs out due to m == 16 and resets).
There was a mistake in my query, so the macOS speedup column was wrong
(was accidentally comparing Linux number with macOS master, sorry for
the noise). I also forgot to mention that you don't actually get the
speedup on PostgreSQL 17 on a Mac, because Peter only recently
implemented the needed read-ahead support for macOS in master/18, but
it doesn't get slower. Here's the corrected table:
| linux (xfs) | macos (apfs)
branch | eic | avg | speedup | stdev | avg | speedup | stdev
--------+-----+--------+---------+--------+--------+---------+--------
master | | 73.959 | 1.0 | 24.168 | 72.290 | 1.0 | 11.851
stream | 0 | 70.117 | 1.1 | 36.699 | 76.289 | 0.9 | 12.742
stream | 1 | 57.983 | 1.3 | 5.845 | 79.969 | 0.9 | 8.308
stream | 2 | 35.629 | 2.1 | 4.088 | 49.198 | 1.5 | 7.686
stream | 3 | 28.477 | 2.6 | 2.607 | 37.540 | 1.9 | 5.272
stream | 4 | 26.493 | 2.8 | 3.691 | 33.014 | 2.2 | 4.444
stream | 5 | 23.711 | 3.1 | 2.435 | 32.622 | 2.2 | 2.270
stream | 6 | 22.885 | 3.2 | 1.908 | 31.254 | 2.3 | 4.170
stream | 7 | 21.910 | 3.4 | 2.153 | 33.669 | 2.1 | 4.616
stream | 8 | 20.741 | 3.6 | 1.594 | 34.182 | 2.1 | 3.819
stream | 9 | 22.471 | 3.3 | 3.094 | 30.690 | 2.4 | 2.677
stream | 10 | 19.895 | 3.7 | 1.695 | 32.631 | 2.2 | 4.976
stream | 11 | 19.447 | 3.8 | 1.647 | 31.163 | 2.3 | 3.351
stream | 12 | 18.658 | 4.0 | 1.503 | 30.817 | 2.3 | 3.538
stream | 13 | 18.886 | 3.9 | 0.874 | 29.184 | 2.5 | 4.832
stream | 14 | 18.667 | 4.0 | 1.692 | 28.783 | 2.5 | 3.459
stream | 15 | 19.080 | 3.9 | 1.429 | 28.928 | 2.5 | 3.396
stream | 16 | 18.929 | 3.9 | 3.469 | 29.282 | 2.5 | 2.868
On Fri, Sep 6, 2024 at 4:28 PM Thomas Munro <thomas.munro@gmail.com> wrote: > Here's a new version with a TODO tidied up. I also understood that we > need to tweak the read_stream_reset() function, so that it doesn't > forget its current readhead distance when it hops between HNSW nodes > (which is something that comes up in several other potential uses > cases including another one I am working in in core). Without this > patch for PostgreSQL, it reads 1, 2, 4, 7 blocks (= 16 in total) > before it has to take a break to hop to a new page, and then it start > again at 1. Oops. With this patch, it is less forgetful, and reaches > the full possible I/O concurrency of 16 (or whatever the minimum of > HNSW's m parameter and effective_io_concurrency is for you). I heard that the pgvector project is now trying to do this for real, and (surprise!) running into this problem. It causes streamified HNSW search to regress in performance on some queries when the overheads of streaming are not outweighed by the (bogusly constrained) gains in concurrency. We just don't generate enough concurrency to win. I probably should have been more opinionated and just committed a version of that distance-reset policy change, but I guess at the time I wrote the above, streaming and AIO were a little too abstract to attract reviews relating to hypothetical external projects. We definitely want to fix that for v19, because it also affects the streamified index scan project and doubtless many other things. I wrote about that with patches[1] and will start a new thread soon with a new collection of rebased heuristics improvements. But for now, to fix pgvector's woes, I wonder if it might make sense to call this a bug in v18, and back-patch the tiniest possible change. Something like what I posted[2] in this thread almost two years ago. I don't think it really affects any core code: we use read_stream_reset() only in very minimal ways there (I could elaborate), and it's quite arguable that the existing policy is wrong for them too, but we'd need to confirm that and perhaps think about other extensions that might be using it. Better ideas? [1] https://www.postgresql.org/message-id/flat/CA%2BhUKGL6hCd40Dh1AcFcoiw5zJXK2T1dRKO3oe8RkPExqA5zoQ%40mail.gmail.com#181a22a8be99ff561b7beae44986870c [2] https://www.postgresql.org/message-id/flat/CA%2BhUKG%2Bx2BcqWzBC77cN0ewhzMF0kYhC6c4G_T2gJLPbqYQ6Ow%40mail.gmail.com#9aa6012713b473611ae46d8e2032586f
On Tue, Nov 11, 2025 at 4:22 PM Thomas Munro <thomas.munro@gmail.com> wrote: > > But for now, to fix pgvector's woes, I wonder if it might make sense > to call this a bug in v18, and back-patch the tiniest possible change. > Something like what I posted[2] in this thread almost two years ago. > I don't think it really affects any core code: we use > read_stream_reset() only in very minimal ways there (I could > elaborate), and it's quite arguable that the existing policy is wrong > for them too, but we'd need to confirm that and perhaps think about > other extensions that might be using it. If we are worried about regressing other extensions using read_stream_reset(), we could make the read stream reset which preserves the distance a different function in backbranches. - Melanie
On Wed, Nov 12, 2025 at 11:52 AM Melanie Plageman <melanieplageman@gmail.com> wrote: > On Tue, Nov 11, 2025 at 4:22 PM Thomas Munro <thomas.munro@gmail.com> wrote: > > But for now, to fix pgvector's woes, I wonder if it might make sense > > to call this a bug in v18, and back-patch the tiniest possible change. > > Something like what I posted[2] in this thread almost two years ago. > > I don't think it really affects any core code: we use > > read_stream_reset() only in very minimal ways there (I could > > elaborate), and it's quite arguable that the existing policy is wrong > > for them too, but we'd need to confirm that and perhaps think about > > other extensions that might be using it. > > If we are worried about regressing other extensions using > read_stream_reset(), we could make the read stream reset which > preserves the distance a different function in backbranches. Hmm, yeah, interesting idea. Candidate names might include read_stream_restart() and read_stream_continue(). The point being that the block number callback reported end-of-stream, but that was only temporary, and now it has more information and would like to continue. Those are some of the names I bounced around for a new read_stream_reset() flag argument for v19 (I rather liked "continue"), but I also like this separate function idea. Back-patching a new function would certainly remove all doubt about unintended consequences for existing callers of read_stream_reset(), so yeah, that wins on pure conservative safety grounds. As for the future, hmm, it might even be better to have an explicit separate API for this operation in master too, as it is turning out to be quite a common requirement and the naming is much clearer like that. We don't usually design new APIs while back-patching though, that's probably why I didn't think of that, but if we view this as a design bug that folded too many jobs into read_stream_reset() that we now want to fix by splitting one off, maybe that's OK? Seems pretty risk-free, anyway.
On Wed, Nov 12, 2025 at 12:19 PM Thomas Munro <thomas.munro@gmail.com> wrote: > On Wed, Nov 12, 2025 at 11:52 AM Melanie Plageman > <melanieplageman@gmail.com> wrote: > > If we are worried about regressing other extensions using > > read_stream_reset(), we could make the read stream reset which > > preserves the distance a different function in backbranches. Here is a draft patch like that, that tries to be as small as possible. Trying out the name read_stream_resume().
Вложения
Hi, Thank you for working on this! On Wed, 12 Nov 2025 at 07:12, Thomas Munro <thomas.munro@gmail.com> wrote: > > On Wed, Nov 12, 2025 at 12:19 PM Thomas Munro <thomas.munro@gmail.com> wrote: > > On Wed, Nov 12, 2025 at 11:52 AM Melanie Plageman > > <melanieplageman@gmail.com> wrote: > > > If we are worried about regressing other extensions using > > > read_stream_reset(), we could make the read stream reset which > > > preserves the distance a different function in backbranches. > > Here is a draft patch like that, that tries to be as small as > possible. Trying out the name read_stream_resume(). I liked the idea of having a different function named read_stream_resume for this purpose. 0001 looks good to me. 0002: + /* End-of-stream. */ + buf = read_stream_next_buffer(stream, NULL); + Assert(buf == InvalidBuffer); + buf = read_stream_next_buffer(stream, NULL); + Assert(buf == InvalidBuffer); I noticed there are two 'read_stream_next_buffer()' and 'InvalidBuffer' checks. Does having both provide any additional validation? I tried removing one of them, and the test still passed. Also, there is one thing I wanted to clarify about the 'read_stream_resume()'. If 'read_stream_next_buffer()' returns an 'InvalidBuffer', then we can use 'read_stream_resume()' alone because we know that we already consumed all buffers in the stream. For the rest, we need to use 'read_stream_resume()' together with the 'read_stream_reset()', right? -- Regards, Nazir Bilal Yavuz Microsoft
On Wed, Nov 12, 2025 at 9:04 PM Nazir Bilal Yavuz <byavuz81@gmail.com> wrote: > On Wed, 12 Nov 2025 at 07:12, Thomas Munro <thomas.munro@gmail.com> wrote: > 0002: > > + /* End-of-stream. */ > + buf = read_stream_next_buffer(stream, NULL); > + Assert(buf == InvalidBuffer); > + buf = read_stream_next_buffer(stream, NULL); > + Assert(buf == InvalidBuffer); > > I noticed there are two 'read_stream_next_buffer()' and > 'InvalidBuffer' checks. Does having both provide any additional > validation? I tried removing one of them, and the test still passed. I wanted to demonstrate that this is a state that the stream is stuck in until you call _resume(). I suppose an alternative design would be that _next_buffer() returns InvalidBuffer only once (= the block number callback returns InvalidBlock once) and then automatically resumes (= it restores the distance) and then you can call read_stream_next_buffer() again (= the block number callback will be called again to fill the stream up with new buffers before waiting for the first one to be ready to give to you if it isn't already). That would have the advantage of not requiring a new function at all and make the patch even shorter, but I don't know, I guess I thought that would be a bit more fragile in some way, less explicit. Hmm, would it actually be better if it worked like that? > Also, there is one thing I wanted to clarify about the > 'read_stream_resume()'. If 'read_stream_next_buffer()' returns an > 'InvalidBuffer', then we can use 'read_stream_resume()' alone because > we know that we already consumed all buffers in the stream. For the > rest, we need to use 'read_stream_resume()' together with the > 'read_stream_reset()', right? For the rest, there would be no need to call read_stream_resume(). To recap the uses of read_stream_reset(): the original purpose was to release any buffers (pins) that the stream is holding internally because of look-ahead, and put it back to its original state, ready to be reused. It is called (1) by read_stream_end() as an implementation detail (eg if a LIMIT or anything else except ERROR/FATAL ends your query early, we need to unpin buffers queued in the stream before we pfree it), (2) explicitly by rescans, (3) in hypothetical code I thought about that would want to stream blocks speculatively and then change its mind after predicting incorrectly (I had a few patches like that, abandoned for now), and then (4) in this case, by places that temporarily ran out of block numbers, but will have some more again soon and want to resume the stream. It was already debatable whether heuristic state like lookahead distance should survive acoss rescans, or in other words, whether the expected I/O requirements of the previous scan are a useful prediction of the requirements of the next scan, but the answer is clearer in case (4), hence the desire to find a way to separate that use case from the others.
On Tue, Nov 11, 2025 at 11:12 PM Thomas Munro <thomas.munro@gmail.com> wrote:
>
> Here is a draft patch like that, that tries to be as small as
> possible. Trying out the name read_stream_resume().
I like read_stream_resume(). Tested out 0001 with pgvector and can
confirm it works.
In the test, I would initialize test_read_stream_resume_state.count to 0
+ test_read_stream_resume_state state = {.blkno = blkno};
- Melanie