Re: Trying out read streams in pgvector (an extension)
| От | Thomas Munro |
|---|---|
| Тема | Re: Trying out read streams in pgvector (an extension) |
| Дата | |
| Msg-id | CA+hUKGL7-Dx8KiUo=G91Y5tfFpwDUFFQJ6=9D8Gr1n=DZxGh+w@mail.gmail.com обсуждение исходный текст |
| Ответ на | Re: Trying out read streams in pgvector (an extension) (Nazir Bilal Yavuz <byavuz81@gmail.com>) |
| Список | pgsql-hackers |
On Wed, Nov 12, 2025 at 9:04 PM Nazir Bilal Yavuz <byavuz81@gmail.com> wrote: > On Wed, 12 Nov 2025 at 07:12, Thomas Munro <thomas.munro@gmail.com> wrote: > 0002: > > + /* End-of-stream. */ > + buf = read_stream_next_buffer(stream, NULL); > + Assert(buf == InvalidBuffer); > + buf = read_stream_next_buffer(stream, NULL); > + Assert(buf == InvalidBuffer); > > I noticed there are two 'read_stream_next_buffer()' and > 'InvalidBuffer' checks. Does having both provide any additional > validation? I tried removing one of them, and the test still passed. I wanted to demonstrate that this is a state that the stream is stuck in until you call _resume(). I suppose an alternative design would be that _next_buffer() returns InvalidBuffer only once (= the block number callback returns InvalidBlock once) and then automatically resumes (= it restores the distance) and then you can call read_stream_next_buffer() again (= the block number callback will be called again to fill the stream up with new buffers before waiting for the first one to be ready to give to you if it isn't already). That would have the advantage of not requiring a new function at all and make the patch even shorter, but I don't know, I guess I thought that would be a bit more fragile in some way, less explicit. Hmm, would it actually be better if it worked like that? > Also, there is one thing I wanted to clarify about the > 'read_stream_resume()'. If 'read_stream_next_buffer()' returns an > 'InvalidBuffer', then we can use 'read_stream_resume()' alone because > we know that we already consumed all buffers in the stream. For the > rest, we need to use 'read_stream_resume()' together with the > 'read_stream_reset()', right? For the rest, there would be no need to call read_stream_resume(). To recap the uses of read_stream_reset(): the original purpose was to release any buffers (pins) that the stream is holding internally because of look-ahead, and put it back to its original state, ready to be reused. It is called (1) by read_stream_end() as an implementation detail (eg if a LIMIT or anything else except ERROR/FATAL ends your query early, we need to unpin buffers queued in the stream before we pfree it), (2) explicitly by rescans, (3) in hypothetical code I thought about that would want to stream blocks speculatively and then change its mind after predicting incorrectly (I had a few patches like that, abandoned for now), and then (4) in this case, by places that temporarily ran out of block numbers, but will have some more again soon and want to resume the stream. It was already debatable whether heuristic state like lookahead distance should survive acoss rescans, or in other words, whether the expected I/O requirements of the previous scan are a useful prediction of the requirements of the next scan, but the answer is clearer in case (4), hence the desire to find a way to separate that use case from the others.
В списке pgsql-hackers по дате отправления: