Обсуждение: Using read stream in autoprewarm
Hi,
I am working on using the read stream in autoprewarm. I observed ~10%
performance gain with this change. The patch is attached.
The downside of the read stream approach is that a new read stream
object needs to be created for each database, relation and fork. I was
wondering if this would cause a regression but it did not (at least
depending on results of my testing). Another downside could be the
code getting complicated.
For the testing,
- I created 50 databases with each of them having 50 tables and the
size of the tables are 520KB.
    - patched: 51157 ms
    - master: 56769 ms
- I created 5 databases with each of them having 1 table and the size
of the tables are 3GB.
    - patched: 32679 ms
    - master: 36706 ms
I put debugging message with timing information in
autoprewarm_database_main() function, then run autoprewarm 100 times
(by restarting the server) and cleared the OS cache before each
restart. Also, I ensured that the block number of the buffer returning
from the read stream API is correct. I am not sure if that much
testing is enough for this kind of change.
Any feedback would be appreciated.
--
Regards,
Nazir Bilal Yavuz
Microsoft
			
		Вложения
> On 8 Aug 2024, at 11:32, Nazir Bilal Yavuz <byavuz81@gmail.com> wrote: > > Any feedback would be appreciated. I've took a look into the patch. It seems to me that you add new block numbers to the read stream until you have buffers.So when there are no more buffers you will still have some queued blocks. Maybe can you change the logic so that number of free buffers must be enough to allocate all blocks in look-ahead distance? Thanks! Best regards, Andrey Borodin.
Dear Nazir,
At first A quick look it looks good. I will take a closer look at it tomorrow. Could you please let me know about the performance tests and graphics?
Best regards, Stepan Neretin!
Hi, Thanks for looking into this! On Thu, 31 Oct 2024 at 21:18, Andrey M. Borodin <x4mmm@yandex-team.ru> wrote: > > > On 8 Aug 2024, at 11:32, Nazir Bilal Yavuz <byavuz81@gmail.com> wrote: > > > > Any feedback would be appreciated. > > I've took a look into the patch. It seems to me that you add new block numbers to the read stream until you have buffers.So when there are no more buffers you will still have some queued blocks. > Maybe can you change the logic so that number of free buffers must be enough to allocate all blocks in look-ahead distance? I see what you mean. When the have_free_buffer() function returns false in the callback, there are still queued blocks in the stream although there are no free buffers in the buffer pool. I think the best way to solve this is to get the number of free buffers in the buffer pool by 'BufferStrategyControl.lastFreeBuffer - BufferStrategyControl.firstFreeBuffer' and then compare it with 'stream->pending_read_nblocks'. When the 'stream->pending_read_nblocks == number_of_free_buffers_in_buffer_pool', end the stream. The problem with that is stream->pending_read_nblocks isn't public, also I am not sure whether 'BufferStrategyControl.lastFreeBuffer - BufferStrategyControl.firstFreeBuffer' is safe to use. -- Regards, Nazir Bilal Yavuz Microsoft
Hi, Thanks for looking into this! On Thu, 31 Oct 2024 at 22:02, Stepan Neretin <sndcppg@gmail.com> wrote: > > At first A quick look it looks good. I will take a closer look at it tomorrow. Could you please let me know about the performancetests and graphics? Sorry but I didn't understand what you mean by performance tests and graphics. Do you want something else than the information in the first email [1]? [1] postgr.es/m/CAN55FZ3n8Gd%2BhajbL%3D5UkGzu_aHGRqnn%2BxktXq2fuds%3D1AOR6Q%40mail.gmail.com -- Regards, Nazir Bilal Yavuz Microsoft
> On 1 Nov 2024, at 12:51, Nazir Bilal Yavuz <byavuz81@gmail.com> wrote: > > am not > sure whether 'BufferStrategyControl.lastFreeBuffer - > BufferStrategyControl.firstFreeBuffer' is safe to use. Ugh... it will work. But it seems to me too dirty hack. There's no scalable way to know size of a free list. Let's just comment that we might read some more buffers if database does not fit into memory? Alternatively we can count size of a free list on the start. Best regards, Andrey Borodin.
Hi, Newer reviewer here, trying to understand more about the read stream API. On Tuesday, November 26th, 2024 at 11:07 AM, Nazir Bilal Yavuz <byavuz81@gmail.com> wrote: > Any feedback would be appreciated. I've executed the same test of 5 databases with each of them having 1 table of 3GB of size and I've got very similar results. I've also tested using a single database with 4 tables with ~60GB of size and the results compared with master was more closer but still an improvement. Note that I've also increased the default shared_buffers to 7GB to see how it works with large buffer pools. - patched: 5.4259 s - master: 5.53186 s Not to much to say about the code, I'm currently learning more about the read stream API and Postgresql hacking itself. Just some minor points and questions about the patches. v2-0002-Count-free-buffers-at-the-start-of-the-autoprewar.patch --- a/src/backend/storage/buffer/freelist.c +/* + * get_number_of_free_buffers -- a lockless way to get the number of free + * buffers in buffer pool. + * + * Note that result continuosly changes as free buffers are moved out by other + * operations. + */ +int +get_number_of_free_buffers(void) typo on continuosly -> continuously v2-0001-Use-read-stream-in-autoprewarm.patch + bool *rs_have_free_buffer = per_buffer_data; + + + *rs_have_free_buffer = true; + Not sure if I understand why this variable is needed, it seems that it is only written and never read? Just as comparison, the block_range_read_stream_cb callback used on pg_prewarm seems to not use the per_buffer_data parameter. -- Matheus Alcantara EDB: https://www.enterprisedb.com
On Wednesday, November 27th, 2024 at 11:19 AM, Nazir Bilal Yavuz <byavuz81@gmail.com> wrote: > > v2-0001-Use-read-stream-in-autoprewarm.patch > > + bool *rs_have_free_buffer = per_buffer_data; > > + > > + > > + *rs_have_free_buffer = true; > > + > > > > Not sure if I understand why this variable is needed, it seems that it is only > > written and never read? Just as comparison, the block_range_read_stream_cb > > callback used on pg_prewarm seems to not use the per_buffer_data parameter. > > > Actually, it is read in the main loop of the > autoprewarm_database_main() function: > > /* There are no free buffers left in shared buffers, break the loop. */ > else if (!(*rs_have_free_buffer)) > break; > > apw_read_stream_next_block() callback function sets > rs_have_free_buffer's value to false when there is no free buffer left > in the shared buffers. And the code above terminates the main loop in > the autoprewarm_database_main() function when it is set to false. > > block_range_read_stream_cb() callback is used when the callback only > needs to loop over the block numbers. However, for the autoprewarm > case; the callback function needs to do additional checks so another > callback and the use of this variable are required. Ohh, I see, thanks very much for the explanation. > v3 is attached. Thanks. I don't know if there is another way that this patch could be tested? Looking forward on other reviews on this. -- Matheus Alcantara EDB: https://www.enterprisedb.com
On Wed, 27 Nov 2024 at 19:20, Nazir Bilal Yavuz <byavuz81@gmail.com> wrote: > > Hi, > > Thank you for looking into this! > > On Wed, 27 Nov 2024 at 16:50, Matheus Alcantara <mths.dev@pm.me> wrote: > > I've executed the same test of 5 databases with each of them having 1 table of > > 3GB of size and I've got very similar results. > > > > I've also tested using a single database with 4 tables with ~60GB of size and > > the results compared with master was more closer but still an improvement. Note > > that I've also increased the default shared_buffers to 7GB to see how it works > > with large buffer pools. > > - patched: 5.4259 s > > - master: 5.53186 s > > Thanks for the testing. > > > Not to much to say about the code, I'm currently learning more about the read > > stream API and Postgresql hacking itself. Just some minor points and questions > > about the patches. > > > > > > v2-0002-Count-free-buffers-at-the-start-of-the-autoprewar.patch > > --- a/src/backend/storage/buffer/freelist.c > > +/* > > + * get_number_of_free_buffers -- a lockless way to get the number of free > > + * buffers in buffer pool. > > + * > > + * Note that result continuosly changes as free buffers are moved out by other > > + * operations. > > + */ > > +int > > +get_number_of_free_buffers(void) > > > > typo on continuosly -> continuously > > Done. > > > v2-0001-Use-read-stream-in-autoprewarm.patch > > + bool *rs_have_free_buffer = per_buffer_data; > > + > > + > > + *rs_have_free_buffer = true; > > + > > > > Not sure if I understand why this variable is needed, it seems that it is only > > written and never read? Just as comparison, the block_range_read_stream_cb > > callback used on pg_prewarm seems to not use the per_buffer_data parameter. > > Actually, it is read in the main loop of the > autoprewarm_database_main() function: > > /* There are no free buffers left in shared buffers, break the loop. */ > else if (!(*rs_have_free_buffer)) > break; > > apw_read_stream_next_block() callback function sets > rs_have_free_buffer's value to false when there is no free buffer left > in the shared buffers. And the code above terminates the main loop in > the autoprewarm_database_main() function when it is set to false. > > block_range_read_stream_cb() callback is used when the callback only > needs to loop over the block numbers. However, for the autoprewarm > case; the callback function needs to do additional checks so another > callback and the use of this variable are required. > > v3 is attached. > > -- > Regards, > Nazir Bilal Yavuz > Microsoft Hi! > + old_blk = &(p->block_info[p->pos - 1]); > + cur_blk = &(p->block_info[p->pos]); Should we Assert(p->pos > 0 && p->pos < *something*) Patch tested with no regression. -- Best regards, Kirill Reshke
On Fri, 29 Nov 2024 at 16:19, Nazir Bilal Yavuz <byavuz81@gmail.com> wrote: > v4 is attached. Hi! I feel like we are ready to mark this as RFC, WDYT? -- Best regards, Kirill Reshke
> On 2 Dec 2024, at 16:16, Kirill Reshke <reshkekirill@gmail.com> wrote: > > I feel like we are ready to mark this as RFC, WDYT? +1 Best regards, Andrey Borodin.
Hi, On Mon, 2 Dec 2024 at 16:30, Andrey M. Borodin <x4mmm@yandex-team.ru> wrote: > > > On 2 Dec 2024, at 16:16, Kirill Reshke <reshkekirill@gmail.com> wrote: > > > > I feel like we are ready to mark this as RFC, WDYT? > > +1 Done. -- Regards, Nazir Bilal Yavuz Microsoft
On Fri, Nov 29, 2024 at 6:20 AM Nazir Bilal Yavuz <byavuz81@gmail.com> wrote: > > v4 is attached. I've started looking at this version. What I don't like about this structure is that we are forced to increment the position in the array of BlockInfoRecords in both the callback and the main loop in autoprewarm_database_main(). There isn't a way around it because we have to return control to the user when we encounter a new relation (we can't close the relation and destroy the read stream in the callback). And in the main loop in autoprewarm_database_main(), we may fail to open the next relation and then need to keep advancing the position in the array of BlockInfoRecords. It isn't just that we have to advance the position in both places -- we also have to have a special case for the first block. All in all, given that in the current read stream API, a single stream must only concern itself with a single relation, fork combo, I think there is no elegant way to deal with this in autoprewarm. One alternative is to loop through the array of BlockInfoRecords and get the start and end positions of the blocks in the arary for a single relation/fork combo. Then we could make the read stream and pass those two positions and the array as callback_private_data. That would mean we loop through the whole array twice, but I wonder if the improvement in clarity is worth it? Some review feedback on your v4: I don't think we need the rs_have_free_buffer per_buffer_data. We can just check have_free_buffers() in both the callback and main loop in autoprewarm_database_main(). I also think you want a comment about why first_block is needed. And I think you need to guard the read_stream_end() after the loop -- what if we never made a read stream because we errored out for all the block's relations or something? - Melanie
On Sat, Mar 29, 2025 at 4:09 PM Melanie Plageman <melanieplageman@gmail.com> wrote: > > One alternative is to loop through the array of BlockInfoRecords and > get the start and end positions of the blocks in the arary for a > single relation/fork combo. Then we could make the read stream and > pass those two positions and the array as callback_private_data. That > would mean we loop through the whole array twice, but I wonder if the > improvement in clarity is worth it? An alternative to this alternative is to somehow include the length of each "span" (BlockInfoRecords from a single relation/fork) in the first BlockInfoRecord of that span when building the array. Dunno how hard that would be, but then you wouldn't have to loop through it twice. - Melanie
Hi, On 2025-03-29 16:09:56 -0400, Melanie Plageman wrote: > I've started looking at this version. What I don't like about this > structure is that we are forced to increment the position in the array > of BlockInfoRecords in both the callback and the main loop in > autoprewarm_database_main(). There isn't a way around it because we > have to return control to the user when we encounter a new relation > (we can't close the relation and destroy the read stream in the > callback). And in the main loop in autoprewarm_database_main(), we may > fail to open the next relation and then need to keep advancing the > position in the array of BlockInfoRecords. > > It isn't just that we have to advance the position in both places -- > we also have to have a special case for the first block. All in all, > given that in the current read stream API, a single stream must only > concern itself with a single relation, fork combo, I think there is no > elegant way to deal with this in autoprewarm. How about having an iterator function operating on a pointer to iterator state that's used both by the main loop and the read stream callback? If the iterator reaches the next relation, it returns InvalidBlockNumber and the main loop starts the next stream? Greetings, Andres Freund
On Sat, Mar 29, 2025 at 4:44 PM Andres Freund <andres@anarazel.de> wrote: > > How about having an iterator function operating on a pointer to iterator state > that's used both by the main loop and the read stream callback? If the > iterator reaches the next relation, it returns InvalidBlockNumber and the main > loop starts the next stream? I don't think that removes the need for the first_block special case. And we still need to duplicate the logic for detecting the next database, block, or filenumber in both places. It maybe reduces the potential for error a little bit. But I don't think it improves the clarity. - Melanie
Hi, Thank you for looking into this! On Sat, 29 Mar 2025 at 23:10, Melanie Plageman <melanieplageman@gmail.com> wrote: > > On Fri, Nov 29, 2024 at 6:20 AM Nazir Bilal Yavuz <byavuz81@gmail.com> wrote: > > > > v4 is attached. > > I've started looking at this version. What I don't like about this > structure is that we are forced to increment the position in the array > of BlockInfoRecords in both the callback and the main loop in > autoprewarm_database_main(). There isn't a way around it because we > have to return control to the user when we encounter a new relation > (we can't close the relation and destroy the read stream in the > callback). And in the main loop in autoprewarm_database_main(), we may > fail to open the next relation and then need to keep advancing the > position in the array of BlockInfoRecords. > > It isn't just that we have to advance the position in both places -- > we also have to have a special case for the first block. All in all, > given that in the current read stream API, a single stream must only > concern itself with a single relation, fork combo, I think there is no > elegant way to deal with this in autoprewarm. > > One alternative is to loop through the array of BlockInfoRecords and > get the start and end positions of the blocks in the arary for a > single relation/fork combo. Then we could make the read stream and > pass those two positions and the array as callback_private_data. That > would mean we loop through the whole array twice, but I wonder if the > improvement in clarity is worth it? I think this is a good alternative. I will work on this and try to propose a patch. > Some review feedback on your v4: I don't think we need the > rs_have_free_buffer per_buffer_data. We can just check > have_free_buffers() in both the callback and main loop in > autoprewarm_database_main(). I also think you want a comment about why > first_block is needed. And I think you need to guard the > read_stream_end() after the loop -- what if we never made a read > stream because we errored out for all the block's relations or > something? All of these are addressed. One extra thing I noticed is we were not checking if blocknum < number_of_block_in_relation at the first_block case in the stream callback, this is fixed now. -- Regards, Nazir Bilal Yavuz Microsoft
Вложения
On Sun, Mar 30, 2025 at 10:01 AM Nazir Bilal Yavuz <byavuz81@gmail.com> wrote:
>
> > Some review feedback on your v4: I don't think we need the
> > rs_have_free_buffer per_buffer_data. We can just check
> > have_free_buffers() in both the callback and main loop in
> > autoprewarm_database_main(). I also think you want a comment about why
> > first_block is needed. And I think you need to guard the
> > read_stream_end() after the loop -- what if we never made a read
> > stream because we errored out for all the block's relations or
> > something?
>
> All of these are addressed. One extra thing I noticed is we were not
> checking if blocknum < number_of_block_in_relation at the first_block
> case in the stream callback, this is fixed now.
I'm wondering why you need to check if have_free_buffer() in the else
branch after getting the buffer from the read stream API. Can't you
just put it back in the for loop condition? Seems like it would have
the same effect.
-   for (; pos < apw_state->prewarm_stop_idx; pos++)
+   for (; pos < apw_state->prewarm_stop_idx && have_free_buffer(); pos++)
    {
        BlockInfoRecord *blk = &block_info[pos];
        Buffer      buf;
@@ -640,9 +640,6 @@ autoprewarm_database_main(Datum main_arg)
            apw_state->prewarmed_blocks++;
            ReleaseBuffer(buf);
        }
-       /* There are no free buffers left in shared buffers, break the loop. */
-       else if (!have_free_buffer())
-           break;
Looking at the code some more, I feel stuck on my original point about
incrementing the position in two places.
AFAICT, you are going to end up going through the array twice with this design.
Because you don't set the pos variable in autoprewarm_database_main()
from the p->pos variable which the read stream callback is
incrementing, if the read stream callback increments p->pos it a few
positions yielding those blocks to the read stream machinery to read,
you are then going to iterate over those positions in the array again
in the autoprewarm_database_main() loop.
I think you can get around this by setting pos from p->pos in
autoprewarm_database_main() after read_stream_next_buffer(). Or by
using p->pos in the loop in autoprewarm_database_main() (which is
basically what Andres suggested). I'm not sure, though, if this has
any problems. Like not closing a relation in the right place.
- Melanie
			
		Hi,
Thank you for looking into this!
On Mon, 31 Mar 2025 at 17:42, Melanie Plageman
<melanieplageman@gmail.com> wrote:
>
> On Sun, Mar 30, 2025 at 10:01 AM Nazir Bilal Yavuz <byavuz81@gmail.com> wrote:
> >
> > > Some review feedback on your v4: I don't think we need the
> > > rs_have_free_buffer per_buffer_data. We can just check
> > > have_free_buffers() in both the callback and main loop in
> > > autoprewarm_database_main(). I also think you want a comment about why
> > > first_block is needed. And I think you need to guard the
> > > read_stream_end() after the loop -- what if we never made a read
> > > stream because we errored out for all the block's relations or
> > > something?
> >
> > All of these are addressed. One extra thing I noticed is we were not
> > checking if blocknum < number_of_block_in_relation at the first_block
> > case in the stream callback, this is fixed now.
>
> I'm wondering why you need to check if have_free_buffer() in the else
> branch after getting the buffer from the read stream API. Can't you
> just put it back in the for loop condition? Seems like it would have
> the same effect.
>
> -   for (; pos < apw_state->prewarm_stop_idx; pos++)
> +   for (; pos < apw_state->prewarm_stop_idx && have_free_buffer(); pos++)
>     {
>         BlockInfoRecord *blk = &block_info[pos];
>         Buffer      buf;
> @@ -640,9 +640,6 @@ autoprewarm_database_main(Datum main_arg)
>             apw_state->prewarmed_blocks++;
>             ReleaseBuffer(buf);
>         }
> -       /* There are no free buffers left in shared buffers, break the loop. */
> -       else if (!have_free_buffer())
> -           break;
>
You are right, done. Attached as v6.
> Looking at the code some more, I feel stuck on my original point about
> incrementing the position in two places.
> AFAICT, you are going to end up going through the array twice with this design.
> Because you don't set the pos variable in autoprewarm_database_main()
> from the p->pos variable which the read stream callback is
> incrementing, if the read stream callback increments p->pos it a few
> positions yielding those blocks to the read stream machinery to read,
> you are then going to iterate over those positions in the array again
> in the autoprewarm_database_main() loop.
>
> I think you can get around this by setting pos from p->pos in
> autoprewarm_database_main() after read_stream_next_buffer(). Or by
> using p->pos in the loop in autoprewarm_database_main() (which is
> basically what Andres suggested). I'm not sure, though, if this has
> any problems. Like not closing a relation in the right place.
I worked on an alternative approach, I refactored code a bit. It does
not traverse the list two times and I think the code is more suitable
to use read streams now. I simply get how many blocks are processed by
read streams and move the list forward by this number, so the actual
loop skips these blocks. This approach is attached with 'alternative'
prefix.
--
Regards,
Nazir Bilal Yavuz
Microsoft
			
		Вложения
On Mon, Mar 31, 2025 at 12:27 PM Nazir Bilal Yavuz <byavuz81@gmail.com> wrote:
>
> I worked on an alternative approach, I refactored code a bit. It does
> not traverse the list two times and I think the code is more suitable
> to use read streams now. I simply get how many blocks are processed by
> read streams and move the list forward by this number, so the actual
> loop skips these blocks. This approach is attached with 'alternative'
> prefix.
I am leaning toward the refactored approach because I don't think we
want to go through the array twice and I think it is hard to get it
right with incrementing p.pos in both places and being sure we
correctly close the relation etc.
Looking at your alternative approach, I don't see how the innermost
while loop in autoprewarm_database_main() is correct
        /* Check whether blocknum is valid and within fork file size. */
        while (cur_filenumber == blk->filenumber &&
               blk->blocknum >= nblocks_in_fork)
        {
            /* Move to next forknum. */
            pos++;
            continue;
        }
Won't this just infinitely loop?
- Melanie
			
		Hi,
On Mon, 31 Mar 2025 at 21:15, Melanie Plageman
<melanieplageman@gmail.com> wrote:
>
> On Mon, Mar 31, 2025 at 12:27 PM Nazir Bilal Yavuz <byavuz81@gmail.com> wrote:
> >
> > I worked on an alternative approach, I refactored code a bit. It does
> > not traverse the list two times and I think the code is more suitable
> > to use read streams now. I simply get how many blocks are processed by
> > read streams and move the list forward by this number, so the actual
> > loop skips these blocks. This approach is attached with 'alternative'
> > prefix.
>
> I am leaning toward the refactored approach because I don't think we
> want to go through the array twice and I think it is hard to get it
> right with incrementing p.pos in both places and being sure we
> correctly close the relation etc.
I liked it more as well.
> Looking at your alternative approach, I don't see how the innermost
> while loop in autoprewarm_database_main() is correct
>
>         /* Check whether blocknum is valid and within fork file size. */
>         while (cur_filenumber == blk->filenumber &&
>                blk->blocknum >= nblocks_in_fork)
>         {
>             /* Move to next forknum. */
>             pos++;
>             continue;
>         }
>
> Won't this just infinitely loop?
Oops, you are right. It should be an if statement instead of a while
loop, fixed now. Also, moved 'dsm_detach(seg);' to its previous place
to reduce diff. Attached as v7. Do you think that I should continue to
attach both approaches?
--
Regards,
Nazir Bilal Yavuz
Microsoft
			
		Вложения
On Mon, Mar 31, 2025 at 2:58 PM Nazir Bilal Yavuz <byavuz81@gmail.com> wrote:
>
> Do you think that I should continue to
> attach both approaches?
No, for now let's try and get this approach to a good place and then
see which one we like.
I think there might be another problem with the code. We only set
cur_database in the loop in autoprewarm_databas_main() when it is 0
        if (cur_database != blk->database)
        {
            if (cur_database == 0)
                cur_database = blk->database;
I know that the read stream will return InvalidBlockNumber when we
move onto the next database, but I don't see how we will end up
actually stopping prewarming in autoprewarm_database_main() when we
move on to the next database.
Another thing:
I don't know if it is a correctness issue but in
autoprewarm_database_main(), in this case
        if (!rel && cur_filenumber != blk->filenumber)
        {
you have removed the Assert(rel == NULL) -- I worry that we will end
up with a rel from a previous iteration after failing to open th enext
rel. I think we want this assert.
And a last thing
I noticed is that the initial values for cur_database, cur_filenumber,
and nblocks_in_fork in autoprewarm_database_main() are all initialized
to different kinds of initial values for different reasons. I'm
thinking if there is a way to make it consistent.
    cur_database = block_info[pos].database;
    cur_filenumber = InvalidOid;
    nblocks_in_fork = InvalidBlockNumber;
cur_database is set to be the same as the first database in the array
so that we won't hit
        if (cur_database != blk->database)
on the first block.
However, we make cur_filenumber InvalidOid because for the first block
we want to hit code that forces us to open a new relation
        if (!rel && cur_filenumber != blk->filenumber)
And nblocks_in_fork to InvalidBlockNumber so that 1) we don't have to
get the number before starting the loop and 2) so we would move past
BlockInfoRecords with invalid filenumber and invalid blocknumber
        if (cur_filenumber == blk->filenumber &&
            blk->blocknum >= nblocks_in_fork)
So, I'm just thinking if it would be correct to initialize
cur_database to InvalidOid and to check for that before skipping a
block, or if that doesn't work when the first blocks' database is
InvalidOid.
- Melanie
			
		On Mon, Mar 31, 2025 at 3:27 PM Melanie Plageman
<melanieplageman@gmail.com> wrote:
>
> I think there might be another problem with the code. We only set
> cur_database in the loop in autoprewarm_databas_main() when it is 0
>
>         if (cur_database != blk->database)
>         {
>             if (cur_database == 0)
>                 cur_database = blk->database;
>
> I know that the read stream will return InvalidBlockNumber when we
> move onto the next database, but I don't see how we will end up
> actually stopping prewarming in autoprewarm_database_main() when we
> move on to the next database.
Whoops, this isn't right. It does work. I'm going to draft a version
suggesting slightly different variable naming and a couple comments to
make this more clear.
- Melanie
			
		On Mon, Mar 31, 2025 at 3:45 PM Melanie Plageman
<melanieplageman@gmail.com> wrote:
>
> Whoops, this isn't right. It does work. I'm going to draft a version
> suggesting slightly different variable naming and a couple comments to
> make this more clear.
Okay, so after further study, I think there are multiple issues still
with the code. We could end up comparing a blocknumber to nblocks
calculated from a different fork. To address this, you'll need to keep
track of the last fork_number. At that point, you kind of have to
bring back old_blk -- because that is what we are recreating with
multiple local variables.
But, I think, overall, what we actually want to do is actually be
really explicit about fast-forwarding in the failure cases (when we
want to skip ahead because a relation is invalid or a fork is
invalid). We were trying to use the main loop control and just add
special cases to allow us to do this fast-forwarding. But, I think
instead, we want to just go to a function or loop somewhere and fast
forward through those bad blocks until we get to the next run of
blocks from a different relation or fork.
I've sketched out an idea like this in the attached. I don't think it
is 100% correct. It does pass tests, but I think we might incorrectly
advance pos twice after skipping a run of blocks belonging to a bad
fork or relation -- and thus skip the first good block after some bad
blocks.
It also needs some more refactoring.
maybe instead of having the skip code like this
        skip_forknumber:;
            while ((blk = next_record(block_info, &i)) != NULL &&
blk->database == database && blk->filenumber == filenumber &&
blk->forknum == forknum);
we make it a function? to avoid the back-to-back while loop conditions
(because of the outer do while loop).
But the explicit looping for skipping the bad blocks and the nested
loops for rel and fork -- I think these are less error prone.
What do you think?
- Melanie
			
		Вложения
Hi, On Tue, 1 Apr 2025 at 05:14, Melanie Plageman <melanieplageman@gmail.com> wrote: > > On Mon, Mar 31, 2025 at 3:45 PM Melanie Plageman > <melanieplageman@gmail.com> wrote: > > > > Whoops, this isn't right. It does work. I'm going to draft a version > > suggesting slightly different variable naming and a couple comments to > > make this more clear. > > Okay, so after further study, I think there are multiple issues still > with the code. We could end up comparing a blocknumber to nblocks > calculated from a different fork. To address this, you'll need to keep > track of the last fork_number. At that point, you kind of have to > bring back old_blk -- because that is what we are recreating with > multiple local variables. Yes, I realized the same. > But, I think, overall, what we actually want to do is actually be > really explicit about fast-forwarding in the failure cases (when we > want to skip ahead because a relation is invalid or a fork is > invalid). We were trying to use the main loop control and just add > special cases to allow us to do this fast-forwarding. But, I think > instead, we want to just go to a function or loop somewhere and fast > forward through those bad blocks until we get to the next run of > blocks from a different relation or fork. > > I've sketched out an idea like this in the attached. I don't think it > is 100% correct. It does pass tests, but I think we might incorrectly > advance pos twice after skipping a run of blocks belonging to a bad > fork or relation -- and thus skip the first good block after some bad > blocks. The test actually does not pass, it prewarms 4 of 211 blocks. It prewarms all 211 blocks in the master. From 'pg_prewarm/001_basic/log/001_basic_main.log': 'autoprewarm successfully prewarmed 4 of 211 previously-loaded blocks' > It also needs some more refactoring. > > maybe instead of having the skip code like this > skip_forknumber:; > while ((blk = next_record(block_info, &i)) != NULL && > blk->database == database && blk->filenumber == filenumber && > blk->forknum == forknum); > > we make it a function? to avoid the back-to-back while loop conditions > (because of the outer do while loop). +1 for using the functions. I think it is hard to follow / maintain this with the do-while loops and goto statements. > But the explicit looping for skipping the bad blocks and the nested > loops for rel and fork -- I think these are less error prone. One question in my mind is, the outermost loop stops when the database changes, we do not check if it is changed from the database oid = 0. Handling this might require some structural changes. > What do you think? I think what you said is right but the current version of the patch looks more complicated to me. I may be biased because I do not like do-while loops and goto statements. -- Regards, Nazir Bilal Yavuz Microsoft
Hi, On Tue, 1 Apr 2025 at 05:14, Melanie Plageman <melanieplageman@gmail.com> wrote: > > On Mon, Mar 31, 2025 at 3:45 PM Melanie Plageman > <melanieplageman@gmail.com> wrote: > > > > Whoops, this isn't right. It does work. I'm going to draft a version > > suggesting slightly different variable naming and a couple comments to > > make this more clear. > > Okay, so after further study, I think there are multiple issues still > with the code. We could end up comparing a blocknumber to nblocks > calculated from a different fork. To address this, you'll need to keep > track of the last fork_number. At that point, you kind of have to > bring back old_blk -- because that is what we are recreating with > multiple local variables. I am attaching v8, which is an updated version of the v7. I tried to get rid of these local variables and refactored code to make logic more straightforward instead of going back and forth. 0001 and 0002 are v8. 0003 is another refactoring attempt to make code more straightforward. I did not squash 0003 to previous patches as you might not like it. -- Regards, Nazir Bilal Yavuz Microsoft
Вложения
On Tue, Apr 1, 2025 at 7:21 AM Nazir Bilal Yavuz <byavuz81@gmail.com> wrote: > > On Tue, 1 Apr 2025 at 05:14, Melanie Plageman <melanieplageman@gmail.com> wrote: > > > +1 for using the functions. I think it is hard to follow / maintain > this with the do-while loops and goto statements. I'll take a look at your downthread proposal in a bit. But the attached patch is a new version of what I proposed with the functions. It's still not totally correct, but I wanted to see what you thought. > > But the explicit looping for skipping the bad blocks and the nested > > loops for rel and fork -- I think these are less error prone. > > One question in my mind is, the outermost loop stops when the database > changes, we do not check if it is changed from the database oid = 0. > Handling this might require some structural changes. I don't understand why each database has global objects at the beginning. If there are global objects, they are global to all databases, so surely the sort function would have put them all at the beginning? One problem is we need a database connection to prewarm these, but if the global objects are all at the beginning, then maybe we can handle those with a special case and not force ourselves to check for them when trying to load blocks from every database. - Melanie
Вложения
On Tue, Apr 1, 2025 at 8:50 AM Nazir Bilal Yavuz <byavuz81@gmail.com> wrote:
>
> I am attaching v8, which is an updated version of the v7. I tried to
> get rid of these local variables and refactored code to make logic
> more straightforward instead of going back and forth.
>
> 0001 and 0002 are v8. 0003 is another refactoring attempt to make code
> more straightforward. I did not squash 0003 to previous patches as you
> might not like it.
I looked at the code on your github branch that has all three of these
squashed together.
I think our approaches are converging. I like that you are
fast-forwarding to the next filenumber or fork number explicitly when
there is a bad relation or fork. I've changed my version (see newest
one attached) to do the fast-forwarding inline instead of in a
separate function like yours (the function didn't save many LOC and
actually may have added to cognitive overhead).
Compared to my version, I think you avoided one level of loop nesting with your
        if (!rel)
        else if (smgrexists(RelationGetSmgr(rel), blk->forknum))
        else
but for starters, I don't think you can do this:
        else if (smgrexists(RelationGetSmgr(rel), blk->forknum))
because you didn't check if you have a legal forknum first
And, I actually kind of prefer the explicitly nested structure
loop through all relations
      loop through all forks
            loop through all buffers
While in the old structure, I liked your
autoprewarm_prewarm_relation() function, but I think it is nicer
inlined like in my version. It makes the loop through all buffers
explicit too.
I know you mentioned off-list that you don't like the handling of
global objects in my version, but I prefer doing it this way (even
though we have to check for in the loop condition) to having to set
the current database once we reach non-shared objects. It feels too
fiddly. This way seems less error prone. Looking at this version, what
do you think? Could we do it better?
Let me know what you think of this version. I think it is the best of
both our approaches. I've separated it into two commits -- the first
does all the refactoring without using the read stream API and the
second one uses the read stream API.
On another topic, what are the minimal places we need to call
have_free_buffers() (in this version)? I haven't even started looking
at the last patch you've been sending that is about checking the
freelist. I'll have to do that next.
- Melanie
			
		Вложения
Hi, On Wed, 2 Apr 2025 at 01:36, Melanie Plageman <melanieplageman@gmail.com> wrote: > > On Tue, Apr 1, 2025 at 8:50 AM Nazir Bilal Yavuz <byavuz81@gmail.com> wrote: > > > > I am attaching v8, which is an updated version of the v7. I tried to > > get rid of these local variables and refactored code to make logic > > more straightforward instead of going back and forth. > > > > 0001 and 0002 are v8. 0003 is another refactoring attempt to make code > > more straightforward. I did not squash 0003 to previous patches as you > > might not like it. > > I looked at the code on your github branch that has all three of these > squashed together. Thank you! > I think our approaches are converging. I like that you are > fast-forwarding to the next filenumber or fork number explicitly when > there is a bad relation or fork. I've changed my version (see newest > one attached) to do the fast-forwarding inline instead of in a > separate function like yours (the function didn't save many LOC and > actually may have added to cognitive overhead). > > Compared to my version, I think you avoided one level of loop nesting with your > > if (!rel) > else if (smgrexists(RelationGetSmgr(rel), blk->forknum)) > else > > but for starters, I don't think you can do this: > > else if (smgrexists(RelationGetSmgr(rel), blk->forknum)) > > because you didn't check if you have a legal forknum first You are right, I missed that. I think smgrexists() should return NULL if the forknum is invalid but it is not a topic for this thread. > And, I actually kind of prefer the explicitly nested structure > > loop through all relations > loop through all forks > loop through all buffers I prefer this as well. We know when we opened the relation, so we do not need to close it in two places like I did. > While in the old structure, I liked your > autoprewarm_prewarm_relation() function, but I think it is nicer > inlined like in my version. It makes the loop through all buffers > explicit too. Yes, I liked your approach. > I know you mentioned off-list that you don't like the handling of > global objects in my version, but I prefer doing it this way (even > though we have to check for in the loop condition) to having to set > the current database once we reach non-shared objects. It feels too > fiddly. This way seems less error prone. Looking at this version, what > do you think? Could we do it better? I think there might be a problem with that approach. Let's say that we are able to open relation when database oid = 0 and filenumber = 18. Then we are trying to find a valid fork now. We couldn't find a valid fork immediately, so we continued looping. Then database oid is changed from 0 to let's say 1 but filenumber remains the same. We are still in the valid fork loop, so relation remains from the database oid = 0. Isn't that wrong? > Let me know what you think of this version. I think it is the best of > both our approaches. I've separated it into two commits -- the first > does all the refactoring without using the read stream API and the > second one uses the read stream API. Some comments, 0001: ReadBufferExtended() can be called in its own minimal loop, otherwise we end up doing unnecessary checks for each ReadBufferExtended() call. This is not a problem when the 0002 is applied. 0002: We don't skip blocks whose blocknum is more than nblocks_in_fork. We can add that either to a stream callback like you did before or after the read_stream_end. I prefer stream callback because of the reason below [1]. > On another topic, what are the minimal places we need to call > have_free_buffers() (in this version)? I haven't even started looking > at the last patch you've been sending that is about checking the > freelist. I'll have to do that next. I think its current places are good enough. We may add one after the read_stream_end if we want to handle blk->blocknum >= nblocks_in_fork after the read stream finishes. If we handle that in the stream callback then no need to add have_free_buffers() [1]. Other than these comments, I think the current structure looks good. -- Regards, Nazir Bilal Yavuz Microsoft
On Wed, Apr 2, 2025 at 6:26 AM Nazir Bilal Yavuz <byavuz81@gmail.com> wrote: > > On Wed, 2 Apr 2025 at 01:36, Melanie Plageman <melanieplageman@gmail.com> wrote: > > > > I know you mentioned off-list that you don't like the handling of > > global objects in my version, but I prefer doing it this way (even > > though we have to check for in the loop condition) to having to set > > the current database once we reach non-shared objects. It feels too > > fiddly. This way seems less error prone. Looking at this version, what > > do you think? Could we do it better? > > I think there might be a problem with that approach. Let's say that we > are able to open relation when database oid = 0 and filenumber = 18. > Then we are trying to find a valid fork now. We couldn't find a valid > fork immediately, so we continued looping. Then database oid is > changed from 0 to let's say 1 but filenumber remains the same. We are > still in the valid fork loop, so relation remains from the database > oid = 0. Isn't that wrong? Yep, you are totally right. The code was wrong. We could fix it by setting current_db to the valid database once we've prewarmed the global objects, but we need that logic in three places, so that seems quite undesirable. In attached v9, I've added a patch to apw_load_buffers() which invokes autoprewarm_database_main() for the global objects alone but while connected to the first valid database. It's not the best solution but I think it is better than having that fiddly logic everywhere about database 0. This made me think, I wonder if we could connect to template0 or template1 to prewarm the global objects. Then we could also prewarm if only global objects are present (that doesn't seem very important but it would be a side effect). It might be more clear to connect to template0/1 instead of the first valid database to prewarm global objects. I don't know if there is some reason not to do this -- like maybe bg workers aren't allowed or something? > 0001: > > ReadBufferExtended() can be called in its own minimal loop, otherwise > we end up doing unnecessary checks for each ReadBufferExtended() call. > This is not a problem when the 0002 is applied. Could you provide a snippet of example code? If we call ReadBufferExtended() in a loop, the block won't be changing, so I don't see how that will help. > 0002: > > We don't skip blocks whose blocknum is more than nblocks_in_fork. We > can add that either to a stream callback like you did before or after > the read_stream_end. I prefer stream callback because of the reason > below [1]. Yep, I also thought we had to have that logic, but because we sort by db,rel,fork,blkno, I think blocks with blocknumber >= nblocks_in_fork will be last and so we just want to move on to the next fork. > > On another topic, what are the minimal places we need to call > > have_free_buffers() (in this version)? I haven't even started looking > > at the last patch you've been sending that is about checking the > > freelist. I'll have to do that next. > > I think its current places are good enough. We may add one after the > read_stream_end if we want to handle blk->blocknum >= nblocks_in_fork > after the read stream finishes. If we handle that in the stream > callback then no need to add have_free_buffers() [1]. As long as we have it in the callback, I don't think that we need have_free_buffers() after read_stream_end() since it is in the while loop condition which we will immediately execute after read_stream_end(). I was also wondering about the other patch in your earlier set that set stop_idx from get_number_of_free_buffers(). Could you tell me more about that? What does it do and why is it needed with the read stream but wasn't needed before? - Melanie
Вложения
Hi, On Wed, 2 Apr 2025 at 18:54, Melanie Plageman <melanieplageman@gmail.com> wrote: > > On Wed, Apr 2, 2025 at 6:26 AM Nazir Bilal Yavuz <byavuz81@gmail.com> wrote: > > > > On Wed, 2 Apr 2025 at 01:36, Melanie Plageman <melanieplageman@gmail.com> wrote: > > > > > > I know you mentioned off-list that you don't like the handling of > > > global objects in my version, but I prefer doing it this way (even > > > though we have to check for in the loop condition) to having to set > > > the current database once we reach non-shared objects. It feels too > > > fiddly. This way seems less error prone. Looking at this version, what > > > do you think? Could we do it better? > > > > I think there might be a problem with that approach. Let's say that we > > are able to open relation when database oid = 0 and filenumber = 18. > > Then we are trying to find a valid fork now. We couldn't find a valid > > fork immediately, so we continued looping. Then database oid is > > changed from 0 to let's say 1 but filenumber remains the same. We are > > still in the valid fork loop, so relation remains from the database > > oid = 0. Isn't that wrong? > > Yep, you are totally right. The code was wrong. We could fix it by > setting current_db to the valid database once we've prewarmed the > global objects, but we need that logic in three places, so that seems > quite undesirable. > > In attached v9, I've added a patch to apw_load_buffers() which invokes > autoprewarm_database_main() for the global objects alone but while > connected to the first valid database. It's not the best solution but > I think it is better than having that fiddly logic everywhere about > database 0. I liked this. I think it is better compared to handling global objects in autoprewarm_database_main(). > This made me think, I wonder if we could connect to template0 or > template1 to prewarm the global objects. Then we could also prewarm if > only global objects are present (that doesn't seem very important but > it would be a side effect). It might be more clear to connect to > template0/1 instead of the first valid database to prewarm global > objects. I don't know if there is some reason not to do this -- like > maybe bg workers aren't allowed or something? I am not sure but I think the current implementation is good enough. > > 0001: > > > > ReadBufferExtended() can be called in its own minimal loop, otherwise > > we end up doing unnecessary checks for each ReadBufferExtended() call. > > This is not a problem when the 0002 is applied. > > Could you provide a snippet of example code? If we call > ReadBufferExtended() in a loop, the block won't be changing, so I > don't see how that will help. I don't have an example code right now. But what I mean is we may call ReadBufferExtended() in a loop for the blocks in the same fork. We don't need to call smgrexists() and RelationGetNumberOfBlocksInFork() for each block, we will call these for each fork not for each block. However, like I said before, this is not important when the read stream code is applied. > > 0002: > > > > We don't skip blocks whose blocknum is more than nblocks_in_fork. We > > can add that either to a stream callback like you did before or after > > the read_stream_end. I prefer stream callback because of the reason > > below [1]. > > Yep, I also thought we had to have that logic, but because we sort by > db,rel,fork,blkno, I think blocks with blocknumber >= nblocks_in_fork > will be last and so we just want to move on to the next fork. I agree that they will be the last, but won't we end up creating a read stream object for each block? > > > On another topic, what are the minimal places we need to call > > > have_free_buffers() (in this version)? I haven't even started looking > > > at the last patch you've been sending that is about checking the > > > freelist. I'll have to do that next. > > > > I think its current places are good enough. We may add one after the > > read_stream_end if we want to handle blk->blocknum >= nblocks_in_fork > > after the read stream finishes. If we handle that in the stream > > callback then no need to add have_free_buffers() [1]. > > As long as we have it in the callback, I don't think that we need > have_free_buffers() after read_stream_end() since it is in the while > loop condition which we will immediately execute after > read_stream_end(). This is okay for me. > I was also wondering about the other patch in your earlier set that > set stop_idx from get_number_of_free_buffers(). Could you tell me more > about that? What does it do and why is it needed with the read stream > but wasn't needed before? In the read stream code, we use callbacks to create bigger I/Os. These I/Os aren't processed until the io_combine_limit or we hit not sequential blocknum. In other words, when the have_free_buffer() function returns false in the callback; there are still queued blocks in the stream, although there are no free buffers in the buffer pool. We can end up creating I/Os bigger than free buffers in the shared buffers. To solve that a bit, we try to get a number of free buffers in the shared buffers. So, we try to minimize the problem above by using the actual free buffer count. That optimization has problems like if other processes fill shared buffers at the same time while the read stream is running, then this optimization will not work well. -- Regards, Nazir Bilal Yavuz Microsoft
On Wed, Apr 2, 2025 at 1:20 PM Nazir Bilal Yavuz <byavuz81@gmail.com> wrote: > > On Wed, 2 Apr 2025 at 18:54, Melanie Plageman <melanieplageman@gmail.com> wrote: > > > > On Wed, Apr 2, 2025 at 6:26 AM Nazir Bilal Yavuz <byavuz81@gmail.com> wrote: > > > > I don't have an example code right now. But what I mean is we may call > ReadBufferExtended() in a loop for the blocks in the same fork. We > don't need to call smgrexists() and RelationGetNumberOfBlocksInFork() > for each block, we will call these for each fork not for each block. > However, like I said before, this is not important when the read > stream code is applied. Ah, you are so right. That was totally messed up in the last version. I've fixed it in attached v10. I think having it correct in the 0002 patch makes it easier to understand how the read stream callback is replacing it. > > > We don't skip blocks whose blocknum is more than nblocks_in_fork. We > > > can add that either to a stream callback like you did before or after > > > the read_stream_end. I prefer stream callback because of the reason > > > below [1]. > > > > Yep, I also thought we had to have that logic, but because we sort by > > db,rel,fork,blkno, I think blocks with blocknumber >= nblocks_in_fork > > will be last and so we just want to move on to the next fork. > > I agree that they will be the last, but won't we end up creating a > read stream object for each block? Ah, yes, you are right. That would have been really broken. I think I fixed it. See attached. Now we'll only do that for the first block if it is invalid (which is probably okay IMO). > > I was also wondering about the other patch in your earlier set that > > set stop_idx from get_number_of_free_buffers(). Could you tell me more > > about that? What does it do and why is it needed with the read stream > > but wasn't needed before? > > In the read stream code, we use callbacks to create bigger I/Os. These > I/Os aren't processed until the io_combine_limit or we hit not > sequential blocknum. In other words, when the have_free_buffer() > function returns false in the callback; there are still queued blocks > in the stream, although there are no free buffers in the buffer pool. > We can end up creating I/Os bigger than free buffers in the shared > buffers. > > To solve that a bit, we try to get a number of free buffers in the > shared buffers. So, we try to minimize the problem above by using the > actual free buffer count. That optimization has problems like if other > processes fill shared buffers at the same time while the read stream > is running, then this optimization will not work well. Hmm. Yea, I do find it confusing that it will get so easily out of date. Let's circle back to this after getting the other patches to a good place (but before committing all of this). - Melanie
Вложения
On Wed, Apr 2, 2025 at 3:34 PM Melanie Plageman <melanieplageman@gmail.com> wrote: > attached v10 Attached v11 has an assortment of cosmetic updates. - Melanie
Вложения
On Wed, Apr 2, 2025 at 8:25 PM Melanie Plageman <melanieplageman@gmail.com> wrote: > > On Wed, Apr 2, 2025 at 3:34 PM Melanie Plageman > <melanieplageman@gmail.com> wrote: > > attached v10 > > Attached v11 has an assortment of cosmetic updates. Attached v12 fixes a bug Bilal found off-list in 0002 related to handling invalid blocks. - Melanie
Вложения
On 03/04/2025 17:31, Melanie Plageman wrote: > Attached v12 fixes a bug Bilal found off-list in 0002 related to > handling invalid blocks. I had a quick look at this. Looks good overall, some small remarks: v12-0001-Autoprewarm-global-objects-separately.patch > Instead, modify apw_load_buffers() to prewarm the shared objects in one > invocation of autoprewarm_database_main() while connected to the first > valid database. So it effectively treats "global objects" as one extra database, launching a separate worker process to handle global objects. It took me a while to understand that. From the commit message, I understood that it still does that within the first worker process invocation, but no. A comment somewhere would be good. One extra worker process invocation is obviously not an improvement performance-wise, but seems acceptable. v12-0002-Refactor-autoprewarm_database_main-in-preparatio.patch Yes, I agree this makes the logic more clear v12-0003-Use-streaming-read-I-O-in-autoprewarm.patch I wonder if the have_free_buffer() calls work correctly with read streams? Or will you "overshoot", prewarming a few more pages after the buffer cache is already full? I guess that depends on when exactly the read stream code allocates the buffer. While reviewing this, I noticed a pre-existing bug: The code ignores 'tablespace' when deciding if it's reached the end of the current relation. I believe it's possible to have two different relations with the same relnumber, in different tablespaces. -- Heikki Linnakangas Neon (https://neon.tech)
On Thu, Apr 3, 2025 at 11:17 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote: > > I had a quick look at this. Looks good overall, some small remarks: Thanks for taking a look! > v12-0001-Autoprewarm-global-objects-separately.patch > > > Instead, modify apw_load_buffers() to prewarm the shared objects in one > > invocation of autoprewarm_database_main() while connected to the first > > valid database. > > So it effectively treats "global objects" as one extra database, > launching a separate worker process to handle global objects. It took me > a while to understand that. From the commit message, I understood that > it still does that within the first worker process invocation, but no. A > comment somewhere would be good. Yea, I could have been more explicit about that. Actually, I was chatting about this with Andres off-list and he was like, why do you need to check the database at all? Won't prewarm_stop_idx already have that built in? And I think he's right. In attached v13, I've added a separate patch (0002) which turns this check into an assert. And I removed the check from all of the other loops in the later patches. > v12-0003-Use-streaming-read-I-O-in-autoprewarm.patch > > I wonder if the have_free_buffer() calls work correctly with read > streams? Or will you "overshoot", prewarming a few more pages after the > buffer cache is already full? I guess that depends on when exactly the > read stream code allocates the buffer. It does have some overshoot -- but a max of io_combine_limit blocks will be evicted. The read stream code builds up an IO of up to io_combine_limit blocks before calling StartReadBuffer(). So you could be in a situation where you weren't quite out of buffers on the freelist while you are building up the IO and then when you go to pin those buffers, there aren't enough on the freelist. But I think that's okay. > While reviewing this, I noticed a pre-existing bug: The code ignores > 'tablespace' when deciding if it's reached the end of the current > relation. I believe it's possible to have two different relations with > the same relnumber, in different tablespaces. Good catch. I've included a fix for this in the attached set (0001) - Melanie
Вложения
> On 3 Apr 2025, at 21:25, Melanie Plageman <melanieplageman@gmail.com> wrote:
>
> On Thu, Apr 3, 2025 at 11:17 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
>>
>> I had a quick look at this. Looks good overall
Same here, this seemed like a good piece to bite into with my limited AIO
knowledge to learn more, and reading it over it seems like a good change.
A few small comments:
+    * `pos` is the read stream callback's index into block_info. Because the
I'm not a fan of markdown in code comments (also in a few more places).
+                       /* Time to try and open our new found relation */
s/new found/newfound/
+   while (p->pos < apw_state->prewarm_stop_idx)
+   {
+       BlockInfoRecord blk = p->block_info[p->pos];
+
+       CHECK_FOR_INTERRUPTS();
Isn't checking inside this loop increasing the frequency of checks compared to
the current version?
+           Assert(read_stream_next_buffer(stream, NULL) == InvalidBuffer);
Is there a non programmer-error case where this can happen?  The Assert right
after a loop around the same function seems to imply there is a race or toctou
case which if so could use a comment.
--
Daniel Gustafsson
			
		On Thu, Apr 3, 2025 at 4:22 PM Daniel Gustafsson <daniel@yesql.se> wrote:
>
> >> I had a quick look at this. Looks good overall
>
> Same here, this seemed like a good piece to bite into with my limited AIO
> knowledge to learn more, and reading it over it seems like a good change.
Thanks for taking a look!
> A few small comments:
>
> +    * `pos` is the read stream callback's index into block_info. Because the
> I'm not a fan of markdown in code comments (also in a few more places).
Removed them. I got the idea of doing this to distinguish variable
names in comments from English words. But I can see how it is kind of
distracting -- since it is not common in the codebase.
> +                       /* Time to try and open our new found relation */
> s/new found/newfound/
Fixed
> +   while (p->pos < apw_state->prewarm_stop_idx)
> +   {
> +       BlockInfoRecord blk = p->block_info[p->pos];
> +
> +       CHECK_FOR_INTERRUPTS();
> Isn't checking inside this loop increasing the frequency of checks compared to
> the current version?
It's unclear. The current version does seem to execute the main while
loop (including the CFI) once per block -- even for blocks that it
doesn't end up reading for whatever reason. Things get murkier with
the read stream code. But I put it in the callback to keep the general
idea of doing a CFI once per block. In attached v14, I moved the CFI
to the top of the callback, outside of the loop, to make that
intention more clear.
> +           Assert(read_stream_next_buffer(stream, NULL) == InvalidBuffer);
> Is there a non programmer-error case where this can happen?  The Assert right
> after a loop around the same function seems to imply there is a race or toctou
> case which if so could use a comment.
Yep. Good call. At some point one read stream user had this assert
because its invocation of read_stream_buffer() was interleaved with
other stuff, so it wasn't obvious that the stream would be exhausted
when it was time to end it. And the assert helped defend that
invariant against future innovation :) I think I've copy-pasta'd this
assert around for no good reason to other read stream users. I've
removed it in v14 and I should probably do a follow-on commit to
master to remove it from the other places it obviously doesn't belong
and is a confusing distraction for future readers.
- Melanie
			
		Вложения
> On 3 Apr 2025, at 22:54, Melanie Plageman <melanieplageman@gmail.com> wrote:
> On Thu, Apr 3, 2025 at 4:22 PM Daniel Gustafsson <daniel@yesql.se> wrote:
>> +   while (p->pos < apw_state->prewarm_stop_idx)
>> +   {
>> +       BlockInfoRecord blk = p->block_info[p->pos];
>> +
>> +       CHECK_FOR_INTERRUPTS();
>> Isn't checking inside this loop increasing the frequency of checks compared to
>> the current version?
>
> It's unclear. The current version does seem to execute the main while
> loop (including the CFI) once per block -- even for blocks that it
> doesn't end up reading for whatever reason. Things get murkier with
> the read stream code. But I put it in the callback to keep the general
> idea of doing a CFI once per block. In attached v14, I moved the CFI
> to the top of the callback, outside of the loop, to make that
> intention more clear.
LGTM.
>> +           Assert(read_stream_next_buffer(stream, NULL) == InvalidBuffer);
>> Is there a non programmer-error case where this can happen?  The Assert right
>> after a loop around the same function seems to imply there is a race or toctou
>> case which if so could use a comment.
>
> Yep. Good call. At some point one read stream user had this assert
> because its invocation of read_stream_buffer() was interleaved with
> other stuff, so it wasn't obvious that the stream would be exhausted
> when it was time to end it. And the assert helped defend that
> invariant against future innovation :) I think I've copy-pasta'd this
> assert around for no good reason to other read stream users. I've
> removed it in v14 and I should probably do a follow-on commit to
> master to remove it from the other places it obviously doesn't belong
> and is a confusing distraction for future readers.
Makes sense, thanks for clarifying and I agree with removing the assertion.
This patch is already marked Ready for Committer and I concur with that.
--
Daniel Gustafsson
			
		Hi, On Fri, 4 Apr 2025 at 10:59, Daniel Gustafsson <daniel@yesql.se> wrote: > > This patch is already marked Ready for Committer and I concur with that. Same on my end, v14 LGTM. -- Regards, Nazir Bilal Yavuz Microsoft
On Fri, Apr 4, 2025 at 4:17 AM Nazir Bilal Yavuz <byavuz81@gmail.com> wrote: > > Same on my end, v14 LGTM. Cool. Pushed and marked as such in the CF app. - Melanie