Обсуждение: Confine vacuum skip logic to lazy_scan_skip
Hi, I've written a patch set for vacuum to use the streaming read interface proposed in [1]. Making lazy_scan_heap() async-friendly required a bit of refactoring of lazy_scan_heap() and lazy_scan_skip(). I needed to confine all of the skipping logic -- previously spread across lazy_scan_heap() and lazy_scan_skip() -- to lazy_scan_skip(). All of the patches doing this and other preparation for vacuum to use the streaming read API can be applied on top of master. The attached patch set does this. There are a few comments that still need to be updated. I also noticed I needed to reorder and combine a couple of the commits. I wanted to register this for the january commitfest, so I didn't quite have time for the finishing touches. - Melanie [1] https://www.postgresql.org/message-id/flat/CA%2BhUKGJkOiOCa%2Bmag4BF%2BzHo7qo%3Do9CFheB8%3Dg6uT5TUm2gkvA%40mail.gmail.com
Вложения
- v1-0001-lazy_scan_skip-remove-unnecessary-local-var-rel_p.patch
- v1-0002-lazy_scan_skip-remove-unneeded-local-var-nskippab.patch
- v1-0003-Add-lazy_scan_skip-unskippable-state.patch
- v1-0004-Confine-vacuum-skip-logic-to-lazy_scan_skip.patch
- v1-0005-VacSkipState-saves-reference-to-LVRelState.patch
- v1-0006-VacSkipState-store-next-unskippable-block-vmbuffe.patch
- v1-0007-Remove-unneeded-vacuum_delay_point-from-lazy_scan.patch
On Sun, Dec 31, 2023 at 1:28 PM Melanie Plageman <melanieplageman@gmail.com> wrote: > > There are a few comments that still need to be updated. I also noticed I > needed to reorder and combine a couple of the commits. I wanted to > register this for the january commitfest, so I didn't quite have time > for the finishing touches. I've updated this patch set to remove a commit that didn't make sense on its own and do various other cleanup. - Melanie
Вложения
- v2-0004-Confine-vacuum-skip-logic-to-lazy_scan_skip.patch
- v2-0003-Add-lazy_scan_skip-unskippable-state.patch
- v2-0001-lazy_scan_skip-remove-unnecessary-local-var-rel_p.patch
- v2-0002-lazy_scan_skip-remove-unneeded-local-var-nskippab.patch
- v2-0005-VacSkipState-saves-reference-to-LVRelState.patch
- v2-0006-Remove-unneeded-vacuum_delay_point-from-lazy_scan.patch
Hi,
On 2024-01-02 12:36:18 -0500, Melanie Plageman wrote:
> Subject: [PATCH v2 1/6] lazy_scan_skip remove unnecessary local var rel_pages
> Subject: [PATCH v2 2/6] lazy_scan_skip remove unneeded local var
>  nskippable_blocks
I think these may lead to worse code - the compiler has to reload
vacrel->rel_pages/next_unskippable_block for every loop iteration, because it
can't guarantee that they're not changed within one of the external functions
called in the loop body.
> Subject: [PATCH v2 3/6] Add lazy_scan_skip unskippable state
> 
> Future commits will remove all skipping logic from lazy_scan_heap() and
> confine it to lazy_scan_skip(). To make those commits more clear, first
> introduce the struct, VacSkipState, which will maintain the variables
> needed to skip ranges less than SKIP_PAGES_THRESHOLD.
Why not add this to LVRelState, possibly as a struct embedded within it?
> From 335faad5948b2bec3b83c2db809bb9161d373dcb Mon Sep 17 00:00:00 2001
> From: Melanie Plageman <melanieplageman@gmail.com>
> Date: Sat, 30 Dec 2023 16:59:27 -0500
> Subject: [PATCH v2 4/6] Confine vacuum skip logic to lazy_scan_skip
> 
> In preparation for vacuum to use the streaming read interface (and eventually
> AIO), refactor vacuum's logic for skipping blocks such that it is entirely
> confined to lazy_scan_skip(). This turns lazy_scan_skip() and the VacSkipState
> it uses into an iterator which yields blocks to lazy_scan_heap(). Such a
> structure is conducive to an async interface.
And it's cleaner - I find the current code extremely hard to reason about.
> By always calling lazy_scan_skip() -- instead of only when we have reached the
> next unskippable block, we no longer need the skipping_current_range variable.
> lazy_scan_heap() no longer needs to manage the skipped range -- checking if we
> reached the end in order to then call lazy_scan_skip(). And lazy_scan_skip()
> can derive the visibility status of a block from whether or not we are in a
> skippable range -- that is, whether or not the next_block is equal to the next
> unskippable block.
I wonder if it should be renamed as part of this - the name is somewhat
confusing now (and perhaps before)? lazy_scan_get_next_block() or such?
> +    while (true)
>      {
>          Buffer        buf;
>          Page        page;
> -        bool        all_visible_according_to_vm;
>          LVPagePruneState prunestate;
>  
> -        if (blkno == vacskip.next_unskippable_block)
> -        {
> -            /*
> -             * Can't skip this page safely.  Must scan the page.  But
> -             * determine the next skippable range after the page first.
> -             */
> -            all_visible_according_to_vm = vacskip.next_unskippable_allvis;
> -            lazy_scan_skip(vacrel, &vacskip, blkno + 1);
> -
> -            Assert(vacskip.next_unskippable_block >= blkno + 1);
> -        }
> -        else
> -        {
> -            /* Last page always scanned (may need to set nonempty_pages) */
> -            Assert(blkno < rel_pages - 1);
> -
> -            if (vacskip.skipping_current_range)
> -                continue;
> +        blkno = lazy_scan_skip(vacrel, &vacskip, blkno + 1,
> +                               &all_visible_according_to_vm);
>  
> -            /* Current range is too small to skip -- just scan the page */
> -            all_visible_according_to_vm = true;
> -        }
> +        if (blkno == InvalidBlockNumber)
> +            break;
>  
>          vacrel->scanned_pages++;
>
I don't like that we still do determination about the next block outside of
lazy_scan_skip() and have duplicated exit conditions between lazy_scan_skip()
and lazy_scan_heap().
I'd probably change the interface to something like
while (lazy_scan_get_next_block(vacrel, &blkno))
{
...
}
> From b6603e35147c4bbe3337280222e6243524b0110e Mon Sep 17 00:00:00 2001
> From: Melanie Plageman <melanieplageman@gmail.com>
> Date: Sun, 31 Dec 2023 09:47:18 -0500
> Subject: [PATCH v2 5/6] VacSkipState saves reference to LVRelState
> 
> The streaming read interface can only give pgsr_next callbacks access to
> two pieces of private data. As such, move a reference to the LVRelState
> into the VacSkipState.
> 
> This is a separate commit (as opposed to as part of the commit
> introducing VacSkipState) because it is required for using the streaming
> read interface but not a natural change on its own. VacSkipState is per
> block and the LVRelState is referenced for the whole relation vacuum.
I'd do it the other way round, i.e. either embed VacSkipState ino LVRelState
or point to it from VacSkipState.
LVRelState is already tied to the iteration state, so I don't think there's a
reason not to do so.
Greetings,
Andres Freund
			
		Hi, On Fri, 5 Jan 2024 at 02:25, Jim Nasby <jim.nasby@gmail.com> wrote: > > On 1/4/24 2:23 PM, Andres Freund wrote: > > On 2024-01-02 12:36:18 -0500, Melanie Plageman wrote: > > Subject: [PATCH v2 1/6] lazy_scan_skip remove unnecessary local var rel_pages > Subject: [PATCH v2 2/6] lazy_scan_skip remove unneeded local var > nskippable_blocks > > I think these may lead to worse code - the compiler has to reload > vacrel->rel_pages/next_unskippable_block for every loop iteration, because it > can't guarantee that they're not changed within one of the external functions > called in the loop body. > > Admittedly I'm not up to speed on recent vacuum changes, but I have to wonder if the concept of skipping should go awayin the context of vector IO? Instead of thinking about "we can skip this range of blocks", why not maintain a list of"here's the next X number of blocks that we need to vacuum"? Sorry if I misunderstood. AFAIU, with the help of the vectored IO; "the next X number of blocks that need to be vacuumed" will be prefetched by calculating the unskippable blocks ( using the lazy_scan_skip() function ) and the X will be determined by Postgres itself. Do you have something different in your mind? -- Regards, Nazir Bilal Yavuz Microsoft
v3 attached
On Thu, Jan 4, 2024 at 3:23 PM Andres Freund <andres@anarazel.de> wrote:
>
> Hi,
>
> On 2024-01-02 12:36:18 -0500, Melanie Plageman wrote:
> > Subject: [PATCH v2 1/6] lazy_scan_skip remove unnecessary local var rel_pages
> > Subject: [PATCH v2 2/6] lazy_scan_skip remove unneeded local var
> >  nskippable_blocks
>
> I think these may lead to worse code - the compiler has to reload
> vacrel->rel_pages/next_unskippable_block for every loop iteration, because it
> can't guarantee that they're not changed within one of the external functions
> called in the loop body.
I buy that for 0001 but 0002 is still using local variables.
nskippable_blocks was just another variable to keep track of even
though we could already get that info from local variables
next_unskippable_block and next_block.
In light of this comment, I've refactored 0003/0004 (0002 and 0003 in
this version [v3]) to use local variables in the loop as well. I had
started using the members of the VacSkipState which I introduced.
> > Subject: [PATCH v2 3/6] Add lazy_scan_skip unskippable state
> >
> > Future commits will remove all skipping logic from lazy_scan_heap() and
> > confine it to lazy_scan_skip(). To make those commits more clear, first
> > introduce the struct, VacSkipState, which will maintain the variables
> > needed to skip ranges less than SKIP_PAGES_THRESHOLD.
>
> Why not add this to LVRelState, possibly as a struct embedded within it?
Done in attached.
> > From 335faad5948b2bec3b83c2db809bb9161d373dcb Mon Sep 17 00:00:00 2001
> > From: Melanie Plageman <melanieplageman@gmail.com>
> > Date: Sat, 30 Dec 2023 16:59:27 -0500
> > Subject: [PATCH v2 4/6] Confine vacuum skip logic to lazy_scan_skip
>
> > By always calling lazy_scan_skip() -- instead of only when we have reached the
> > next unskippable block, we no longer need the skipping_current_range variable.
> > lazy_scan_heap() no longer needs to manage the skipped range -- checking if we
> > reached the end in order to then call lazy_scan_skip(). And lazy_scan_skip()
> > can derive the visibility status of a block from whether or not we are in a
> > skippable range -- that is, whether or not the next_block is equal to the next
> > unskippable block.
>
> I wonder if it should be renamed as part of this - the name is somewhat
> confusing now (and perhaps before)? lazy_scan_get_next_block() or such?
Why stop there! I've removed lazy and called it
heap_vac_scan_get_next_block() -- a little long, but...
> > +     while (true)
> >       {
> >               Buffer          buf;
> >               Page            page;
> > -             bool            all_visible_according_to_vm;
> >               LVPagePruneState prunestate;
> >
> > -             if (blkno == vacskip.next_unskippable_block)
> > -             {
> > -                     /*
> > -                      * Can't skip this page safely.  Must scan the page.  But
> > -                      * determine the next skippable range after the page first.
> > -                      */
> > -                     all_visible_according_to_vm = vacskip.next_unskippable_allvis;
> > -                     lazy_scan_skip(vacrel, &vacskip, blkno + 1);
> > -
> > -                     Assert(vacskip.next_unskippable_block >= blkno + 1);
> > -             }
> > -             else
> > -             {
> > -                     /* Last page always scanned (may need to set nonempty_pages) */
> > -                     Assert(blkno < rel_pages - 1);
> > -
> > -                     if (vacskip.skipping_current_range)
> > -                             continue;
> > +             blkno = lazy_scan_skip(vacrel, &vacskip, blkno + 1,
> > +                                                        &all_visible_according_to_vm);
> >
> > -                     /* Current range is too small to skip -- just scan the page */
> > -                     all_visible_according_to_vm = true;
> > -             }
> > +             if (blkno == InvalidBlockNumber)
> > +                     break;
> >
> >               vacrel->scanned_pages++;
> >
>
> I don't like that we still do determination about the next block outside of
> lazy_scan_skip() and have duplicated exit conditions between lazy_scan_skip()
> and lazy_scan_heap().
>
> I'd probably change the interface to something like
>
> while (lazy_scan_get_next_block(vacrel, &blkno))
> {
> ...
> }
I've done this. I do now find the parameter names a bit confusing.
There is next_block (which is the "next block in line" and is an input
parameter) and blkno, which is an output parameter with the next block
that should actually be processed. Maybe it's okay?
> > From b6603e35147c4bbe3337280222e6243524b0110e Mon Sep 17 00:00:00 2001
> > From: Melanie Plageman <melanieplageman@gmail.com>
> > Date: Sun, 31 Dec 2023 09:47:18 -0500
> > Subject: [PATCH v2 5/6] VacSkipState saves reference to LVRelState
> >
> > The streaming read interface can only give pgsr_next callbacks access to
> > two pieces of private data. As such, move a reference to the LVRelState
> > into the VacSkipState.
> >
> > This is a separate commit (as opposed to as part of the commit
> > introducing VacSkipState) because it is required for using the streaming
> > read interface but not a natural change on its own. VacSkipState is per
> > block and the LVRelState is referenced for the whole relation vacuum.
>
> I'd do it the other way round, i.e. either embed VacSkipState ino LVRelState
> or point to it from VacSkipState.
>
> LVRelState is already tied to the iteration state, so I don't think there's a
> reason not to do so.
Done, and, as such, this patch is dropped from the set.
- Melane
			
		Вложения
On Fri, Jan 5, 2024 at 5:51 AM Nazir Bilal Yavuz <byavuz81@gmail.com> wrote: > > On Fri, 5 Jan 2024 at 02:25, Jim Nasby <jim.nasby@gmail.com> wrote: > > > > On 1/4/24 2:23 PM, Andres Freund wrote: > > > > On 2024-01-02 12:36:18 -0500, Melanie Plageman wrote: > > > > Subject: [PATCH v2 1/6] lazy_scan_skip remove unnecessary local var rel_pages > > Subject: [PATCH v2 2/6] lazy_scan_skip remove unneeded local var > > nskippable_blocks > > > > I think these may lead to worse code - the compiler has to reload > > vacrel->rel_pages/next_unskippable_block for every loop iteration, because it > > can't guarantee that they're not changed within one of the external functions > > called in the loop body. > > > > Admittedly I'm not up to speed on recent vacuum changes, but I have to wonder if the concept of skipping should go awayin the context of vector IO? Instead of thinking about "we can skip this range of blocks", why not maintain a list of"here's the next X number of blocks that we need to vacuum"? > > Sorry if I misunderstood. AFAIU, with the help of the vectored IO; > "the next X number of blocks that need to be vacuumed" will be > prefetched by calculating the unskippable blocks ( using the > lazy_scan_skip() function ) and the X will be determined by Postgres > itself. Do you have something different in your mind? I think you are both right. As we gain more control of readahead from within Postgres, we will likely want to revisit this heuristic as it may not serve us anymore. But the streaming read interface/vectored I/O is also not a drop-in replacement for it. To change anything and ensure there is no regression, we will probably have to do cross-platform benchmarking, though. That being said, I would absolutely love to get rid of the skippable ranges because I find them very error-prone and confusing. Hopefully now that the skipping logic is isolated to a single function, it will be easier not to trip over it when working on lazy_scan_heap(). - Melanie
On 1/11/24 5:50 PM, Melanie Plageman wrote: > On Fri, Jan 5, 2024 at 5:51 AM Nazir Bilal Yavuz <byavuz81@gmail.com> wrote: >> >> On Fri, 5 Jan 2024 at 02:25, Jim Nasby <jim.nasby@gmail.com> wrote: >>> >>> On 1/4/24 2:23 PM, Andres Freund wrote: >>> >>> On 2024-01-02 12:36:18 -0500, Melanie Plageman wrote: >>> >>> Subject: [PATCH v2 1/6] lazy_scan_skip remove unnecessary local var rel_pages >>> Subject: [PATCH v2 2/6] lazy_scan_skip remove unneeded local var >>> nskippable_blocks >>> >>> I think these may lead to worse code - the compiler has to reload >>> vacrel->rel_pages/next_unskippable_block for every loop iteration, because it >>> can't guarantee that they're not changed within one of the external functions >>> called in the loop body. >>> >>> Admittedly I'm not up to speed on recent vacuum changes, but I have to wonder if the concept of skipping should go awayin the context of vector IO? Instead of thinking about "we can skip this range of blocks", why not maintain a list of"here's the next X number of blocks that we need to vacuum"? >> >> Sorry if I misunderstood. AFAIU, with the help of the vectored IO; >> "the next X number of blocks that need to be vacuumed" will be >> prefetched by calculating the unskippable blocks ( using the >> lazy_scan_skip() function ) and the X will be determined by Postgres >> itself. Do you have something different in your mind? > > I think you are both right. As we gain more control of readahead from > within Postgres, we will likely want to revisit this heuristic as it > may not serve us anymore. But the streaming read interface/vectored > I/O is also not a drop-in replacement for it. To change anything and > ensure there is no regression, we will probably have to do > cross-platform benchmarking, though. > > That being said, I would absolutely love to get rid of the skippable > ranges because I find them very error-prone and confusing. Hopefully > now that the skipping logic is isolated to a single function, it will > be easier not to trip over it when working on lazy_scan_heap(). Yeah, arguably it's just a matter of semantics, but IMO it's a lot clearer to simply think in terms of "here's the next blocks we know we want to vacuum" instead of "we vacuum everything, but sometimes we skip some blocks". -- Jim Nasby, Data Architect, Austin TX
On Fri, Jan 12, 2024 at 2:02 PM Jim Nasby <jim.nasby@gmail.com> wrote: > > On 1/11/24 5:50 PM, Melanie Plageman wrote: > > On Fri, Jan 5, 2024 at 5:51 AM Nazir Bilal Yavuz <byavuz81@gmail.com> wrote: > >> > >> On Fri, 5 Jan 2024 at 02:25, Jim Nasby <jim.nasby@gmail.com> wrote: > >>> > >>> On 1/4/24 2:23 PM, Andres Freund wrote: > >>> > >>> On 2024-01-02 12:36:18 -0500, Melanie Plageman wrote: > >>> > >>> Subject: [PATCH v2 1/6] lazy_scan_skip remove unnecessary local var rel_pages > >>> Subject: [PATCH v2 2/6] lazy_scan_skip remove unneeded local var > >>> nskippable_blocks > >>> > >>> I think these may lead to worse code - the compiler has to reload > >>> vacrel->rel_pages/next_unskippable_block for every loop iteration, because it > >>> can't guarantee that they're not changed within one of the external functions > >>> called in the loop body. > >>> > >>> Admittedly I'm not up to speed on recent vacuum changes, but I have to wonder if the concept of skipping should goaway in the context of vector IO? Instead of thinking about "we can skip this range of blocks", why not maintain a listof "here's the next X number of blocks that we need to vacuum"? > >> > >> Sorry if I misunderstood. AFAIU, with the help of the vectored IO; > >> "the next X number of blocks that need to be vacuumed" will be > >> prefetched by calculating the unskippable blocks ( using the > >> lazy_scan_skip() function ) and the X will be determined by Postgres > >> itself. Do you have something different in your mind? > > > > I think you are both right. As we gain more control of readahead from > > within Postgres, we will likely want to revisit this heuristic as it > > may not serve us anymore. But the streaming read interface/vectored > > I/O is also not a drop-in replacement for it. To change anything and > > ensure there is no regression, we will probably have to do > > cross-platform benchmarking, though. > > > > That being said, I would absolutely love to get rid of the skippable > > ranges because I find them very error-prone and confusing. Hopefully > > now that the skipping logic is isolated to a single function, it will > > be easier not to trip over it when working on lazy_scan_heap(). > > Yeah, arguably it's just a matter of semantics, but IMO it's a lot > clearer to simply think in terms of "here's the next blocks we know we > want to vacuum" instead of "we vacuum everything, but sometimes we skip > some blocks". Even "we vacuum some stuff, but sometimes we skip some blocks" would be okay. What we have now is "we vacuum some stuff, but sometimes we skip some blocks, but only if we would skip enough blocks, and, when we decide to do that we can't go back and actually get visibility information for those blocks we skipped because we are too cheap" - Melanie
On Fri, 12 Jan 2024 at 05:12, Melanie Plageman
<melanieplageman@gmail.com> wrote:
>
> v3 attached
>
> On Thu, Jan 4, 2024 at 3:23 PM Andres Freund <andres@anarazel.de> wrote:
> >
> > Hi,
> >
> > On 2024-01-02 12:36:18 -0500, Melanie Plageman wrote:
> > > Subject: [PATCH v2 1/6] lazy_scan_skip remove unnecessary local var rel_pages
> > > Subject: [PATCH v2 2/6] lazy_scan_skip remove unneeded local var
> > >  nskippable_blocks
> >
> > I think these may lead to worse code - the compiler has to reload
> > vacrel->rel_pages/next_unskippable_block for every loop iteration, because it
> > can't guarantee that they're not changed within one of the external functions
> > called in the loop body.
>
> I buy that for 0001 but 0002 is still using local variables.
> nskippable_blocks was just another variable to keep track of even
> though we could already get that info from local variables
> next_unskippable_block and next_block.
>
> In light of this comment, I've refactored 0003/0004 (0002 and 0003 in
> this version [v3]) to use local variables in the loop as well. I had
> started using the members of the VacSkipState which I introduced.
>
> > > Subject: [PATCH v2 3/6] Add lazy_scan_skip unskippable state
> > >
> > > Future commits will remove all skipping logic from lazy_scan_heap() and
> > > confine it to lazy_scan_skip(). To make those commits more clear, first
> > > introduce the struct, VacSkipState, which will maintain the variables
> > > needed to skip ranges less than SKIP_PAGES_THRESHOLD.
> >
> > Why not add this to LVRelState, possibly as a struct embedded within it?
>
> Done in attached.
>
> > > From 335faad5948b2bec3b83c2db809bb9161d373dcb Mon Sep 17 00:00:00 2001
> > > From: Melanie Plageman <melanieplageman@gmail.com>
> > > Date: Sat, 30 Dec 2023 16:59:27 -0500
> > > Subject: [PATCH v2 4/6] Confine vacuum skip logic to lazy_scan_skip
> >
> > > By always calling lazy_scan_skip() -- instead of only when we have reached the
> > > next unskippable block, we no longer need the skipping_current_range variable.
> > > lazy_scan_heap() no longer needs to manage the skipped range -- checking if we
> > > reached the end in order to then call lazy_scan_skip(). And lazy_scan_skip()
> > > can derive the visibility status of a block from whether or not we are in a
> > > skippable range -- that is, whether or not the next_block is equal to the next
> > > unskippable block.
> >
> > I wonder if it should be renamed as part of this - the name is somewhat
> > confusing now (and perhaps before)? lazy_scan_get_next_block() or such?
>
> Why stop there! I've removed lazy and called it
> heap_vac_scan_get_next_block() -- a little long, but...
>
> > > +     while (true)
> > >       {
> > >               Buffer          buf;
> > >               Page            page;
> > > -             bool            all_visible_according_to_vm;
> > >               LVPagePruneState prunestate;
> > >
> > > -             if (blkno == vacskip.next_unskippable_block)
> > > -             {
> > > -                     /*
> > > -                      * Can't skip this page safely.  Must scan the page.  But
> > > -                      * determine the next skippable range after the page first.
> > > -                      */
> > > -                     all_visible_according_to_vm = vacskip.next_unskippable_allvis;
> > > -                     lazy_scan_skip(vacrel, &vacskip, blkno + 1);
> > > -
> > > -                     Assert(vacskip.next_unskippable_block >= blkno + 1);
> > > -             }
> > > -             else
> > > -             {
> > > -                     /* Last page always scanned (may need to set nonempty_pages) */
> > > -                     Assert(blkno < rel_pages - 1);
> > > -
> > > -                     if (vacskip.skipping_current_range)
> > > -                             continue;
> > > +             blkno = lazy_scan_skip(vacrel, &vacskip, blkno + 1,
> > > +                                                        &all_visible_according_to_vm);
> > >
> > > -                     /* Current range is too small to skip -- just scan the page */
> > > -                     all_visible_according_to_vm = true;
> > > -             }
> > > +             if (blkno == InvalidBlockNumber)
> > > +                     break;
> > >
> > >               vacrel->scanned_pages++;
> > >
> >
> > I don't like that we still do determination about the next block outside of
> > lazy_scan_skip() and have duplicated exit conditions between lazy_scan_skip()
> > and lazy_scan_heap().
> >
> > I'd probably change the interface to something like
> >
> > while (lazy_scan_get_next_block(vacrel, &blkno))
> > {
> > ...
> > }
>
> I've done this. I do now find the parameter names a bit confusing.
> There is next_block (which is the "next block in line" and is an input
> parameter) and blkno, which is an output parameter with the next block
> that should actually be processed. Maybe it's okay?
>
> > > From b6603e35147c4bbe3337280222e6243524b0110e Mon Sep 17 00:00:00 2001
> > > From: Melanie Plageman <melanieplageman@gmail.com>
> > > Date: Sun, 31 Dec 2023 09:47:18 -0500
> > > Subject: [PATCH v2 5/6] VacSkipState saves reference to LVRelState
> > >
> > > The streaming read interface can only give pgsr_next callbacks access to
> > > two pieces of private data. As such, move a reference to the LVRelState
> > > into the VacSkipState.
> > >
> > > This is a separate commit (as opposed to as part of the commit
> > > introducing VacSkipState) because it is required for using the streaming
> > > read interface but not a natural change on its own. VacSkipState is per
> > > block and the LVRelState is referenced for the whole relation vacuum.
> >
> > I'd do it the other way round, i.e. either embed VacSkipState ino LVRelState
> > or point to it from VacSkipState.
> >
> > LVRelState is already tied to the iteration state, so I don't think there's a
> > reason not to do so.
>
> Done, and, as such, this patch is dropped from the set.
CFBot shows that the patch does not apply anymore as in [1]:
=== applying patch
./v3-0002-Add-lazy_scan_skip-unskippable-state-to-LVRelStat.patch
patching file src/backend/access/heap/vacuumlazy.c
...
Hunk #10 FAILED at 1042.
Hunk #11 FAILED at 1121.
Hunk #12 FAILED at 1132.
Hunk #13 FAILED at 1161.
Hunk #14 FAILED at 1172.
Hunk #15 FAILED at 1194.
...
6 out of 21 hunks FAILED -- saving rejects to file
src/backend/access/heap/vacuumlazy.c.rej
Please post an updated version for the same.
[1] - http://cfbot.cputube.org/patch_46_4755.log
Regards,
Vignesh
			
		On Fri, Jan 26, 2024 at 8:28 AM vignesh C <vignesh21@gmail.com> wrote: > > CFBot shows that the patch does not apply anymore as in [1]: > === applying patch > ./v3-0002-Add-lazy_scan_skip-unskippable-state-to-LVRelStat.patch > patching file src/backend/access/heap/vacuumlazy.c > ... > Hunk #10 FAILED at 1042. > Hunk #11 FAILED at 1121. > Hunk #12 FAILED at 1132. > Hunk #13 FAILED at 1161. > Hunk #14 FAILED at 1172. > Hunk #15 FAILED at 1194. > ... > 6 out of 21 hunks FAILED -- saving rejects to file > src/backend/access/heap/vacuumlazy.c.rej > > Please post an updated version for the same. > > [1] - http://cfbot.cputube.org/patch_46_4755.log Fixed in attached rebased v4 - Melanie
Вложения
On Mon, Jan 29, 2024 at 8:18 PM Melanie Plageman <melanieplageman@gmail.com> wrote: > > On Fri, Jan 26, 2024 at 8:28 AM vignesh C <vignesh21@gmail.com> wrote: > > > > CFBot shows that the patch does not apply anymore as in [1]: > > === applying patch > > ./v3-0002-Add-lazy_scan_skip-unskippable-state-to-LVRelStat.patch > > patching file src/backend/access/heap/vacuumlazy.c > > ... > > Hunk #10 FAILED at 1042. > > Hunk #11 FAILED at 1121. > > Hunk #12 FAILED at 1132. > > Hunk #13 FAILED at 1161. > > Hunk #14 FAILED at 1172. > > Hunk #15 FAILED at 1194. > > ... > > 6 out of 21 hunks FAILED -- saving rejects to file > > src/backend/access/heap/vacuumlazy.c.rej > > > > Please post an updated version for the same. > > > > [1] - http://cfbot.cputube.org/patch_46_4755.log > > Fixed in attached rebased v4 In light of Thomas' update to the streaming read API [1], I have rebased and updated this patch set. The attached v5 has some simplifications when compared to v4 but takes largely the same approach. 0001-0004 are refactoring 0005 is the streaming read code not yet in master 0006 is the vacuum streaming read user for vacuum's first pass 0007 is the vacuum streaming read user for vacuum's second pass - Melanie [1] https://www.postgresql.org/message-id/CA%2BhUKGJtLyxcAEvLhVUhgD4fMQkOu3PDaj8Qb9SR_UsmzgsBpQ%40mail.gmail.com
Вложения
- v5-0004-Remove-unneeded-vacuum_delay_point-from-heap_vac_.patch
- v5-0001-lazy_scan_skip-remove-unneeded-local-var-nskippab.patch
- v5-0003-Confine-vacuum-skip-logic-to-lazy_scan_skip.patch
- v5-0005-Streaming-Read-API.patch
- v5-0002-Add-lazy_scan_skip-unskippable-state-to-LVRelStat.patch
- v5-0006-Vacuum-first-pass-uses-Streaming-Read-interface.patch
- v5-0007-Vacuum-second-pass-uses-Streaming-Read-interface.patch
On 27/02/2024 21:47, Melanie Plageman wrote: > The attached v5 has some simplifications when compared to v4 but takes > largely the same approach. > > 0001-0004 are refactoring I'm looking at just these 0001-0004 patches for now. I like those changes a lot for the sake of readablity even without any of the later patches. I made some further changes. I kept them as separate commits for easier review, see the commit messages for details. Any thoughts on those changes? I feel heap_vac_scan_get_next_block() function could use some love. Maybe just some rewording of the comments, or maybe some other refactoring; not sure. But I'm pretty happy with the function signature and how it's called. BTW, do we have tests that would fail if we botched up heap_vac_scan_get_next_block() so that it would skip pages incorrectly, for example? Not asking you to write them for this patch, but I'm just wondering. -- Heikki Linnakangas Neon (https://neon.tech)
Вложения
- v6-0001-lazy_scan_skip-remove-unneeded-local-var-nskippab.patch
- v6-0002-Add-lazy_scan_skip-unskippable-state-to-LVRelStat.patch
- v6-0003-Confine-vacuum-skip-logic-to-lazy_scan_skip.patch
- v6-0004-Remove-unneeded-vacuum_delay_point-from-heap_vac_.patch
- v6-0005-Remove-unused-skipping_current_range-field.patch
- v6-0006-Move-vmbuffer-back-to-a-local-varible-in-lazy_sca.patch
- v6-0007-Rename-skip_state.patch
- v6-0008-Track-current_block-in-the-skip-state.patch
- v6-0009-Comment-whitespace-cleanup.patch
On Tue, Feb 27, 2024 at 02:47:03PM -0500, Melanie Plageman wrote: > On Mon, Jan 29, 2024 at 8:18 PM Melanie Plageman > <melanieplageman@gmail.com> wrote: > > > > On Fri, Jan 26, 2024 at 8:28 AM vignesh C <vignesh21@gmail.com> wrote: > > > > > > CFBot shows that the patch does not apply anymore as in [1]: > > > === applying patch > > > ./v3-0002-Add-lazy_scan_skip-unskippable-state-to-LVRelStat.patch > > > patching file src/backend/access/heap/vacuumlazy.c > > > ... > > > Hunk #10 FAILED at 1042. > > > Hunk #11 FAILED at 1121. > > > Hunk #12 FAILED at 1132. > > > Hunk #13 FAILED at 1161. > > > Hunk #14 FAILED at 1172. > > > Hunk #15 FAILED at 1194. > > > ... > > > 6 out of 21 hunks FAILED -- saving rejects to file > > > src/backend/access/heap/vacuumlazy.c.rej > > > > > > Please post an updated version for the same. > > > > > > [1] - http://cfbot.cputube.org/patch_46_4755.log > > > > Fixed in attached rebased v4 > > In light of Thomas' update to the streaming read API [1], I have > rebased and updated this patch set. > > The attached v5 has some simplifications when compared to v4 but takes > largely the same approach. Attached is a patch set (v5a) which updates the streaming read user for vacuum to fix an issue Andrey Borodin pointed out to me off-list. Note that I started writing this email before seeing Heikki's upthread review [1], so I will respond to that in a bit. There are no changes in v5a to any of the prelim refactoring patches which Heikki reviewed in that email. I only changed the vacuum streaming read users (last two patches in the set). Back to this patch set: Andrey pointed out that it was failing to compile on windows and the reason is that I had accidentally left an undefined variable "index" in these places Assert(index > 0); ... ereport(DEBUG2, (errmsg("table \"%s\": removed %lld dead item identifiers in %u pages", vacrel->relname, (long long) index, vacuumed_pages))); See https://cirrus-ci.com/task/6312305361682432 I don't understand how this didn't warn me (or fail to compile) for an assert build on my own workstation. It seems to think "index" is a function? Anyway, thinking about what the correct assertion would be here: Assert(index > 0); Assert(vacrel->num_index_scans > 1 || (rbstate->end_idx == vacrel->lpdead_items && vacuumed_pages == vacrel->lpdead_item_pages)); I think I can just replace "index" with "rbstate->end_index". At the end of reaping, this should have the same value that index would have had. The issue with this is if pg_streaming_read_buffer_get_next() somehow never returned a valid buffer (there were no dead items), then rbstate would potentially be uninitialized. The old assertion (index > 0) would only have been true if there were some dead items, but there isn't an explicit assertion in this function that there were some dead items. Perhaps it is worth adding this? Even if we add this, perhaps it is unacceptable from a programming standpoint to use rbstate in that scope? In addition to fixing this slip-up, I have done some performance testing for streaming read vacuum. Note that these tests are for both vacuum passes (1 and 2) using streaming read. Performance results: The TL;DR of my performance results is that streaming read vacuum is faster. However there is an issue with the interaction of the streaming read code and the vacuum buffer access strategy which must be addressed. Note that "master" in the results below is actually just a commit on my branch [2] before the one adding the vacuum streaming read users. So it includes all of my refactoring of the vacuum code from the preliminary patches. I tested two vacuum "data states". Both are relatively small tables because the impact of streaming read can easily be seen even at small table sizes. DDL for both data states is at the end of the email. The first data state is a 28 MB table which has never been vacuumed and has one or two dead tuples on every block. All of the blocks have dead tuples, so all of the blocks must be vacuumed. We'll call this the "sequential" data state. The second data state is a 67 MB table which has been vacuumed and then a small percentage of the blocks (non-consecutive blocks at irregular intervals) are updated afterward. Because the visibility map has been updated and only a few blocks have dead tuples, large ranges of blocks do not need to be vacuumed. There is at least one run of blocks with dead tuples larger than 1 block but most of the blocks with dead tuples are a single block followed by many blocks with no dead tuples. We'll call this the "few" data state. I tested these data states with "master" and with streaming read vacuum with three caching options: - table data fully in shared buffers (via pg_prewarm) - table data in the kernel buffer cache but not in shared buffers - table data completely uncached I tested the OS cached and uncached caching options with both the default vacuum buffer access strategy and with BUFFER_USAGE_LIMIT 0 (which uses as many shared buffers as needed). For the streaming read vacuum, I tested with maintenance_io_concurrency 10, 100, and 1000. 10 is the current default on master. maintenance_io_concurrency is not used by vacuum on master AFAICT. maintenance_io_concurrency is used by streaming read to determine how many buffers it can pin at the same time (with the hope of combining consecutive blocks into larger IOs) and, in the case of vacuum, it is used to determine prefetch distance. In the following results, I ran vacuum at least five times and averaged the timing results. Table data cached in shared buffers =================================== Sequential data state --------------------- The only noticeable difference in performance was that streaming read vacuum took 2% longer than master (19 ms vs 18.6 ms). It was a bit more noticeable at maintenance_io_concurrency 1000 than 10. This may be resolved by a patch Thomas is working on to avoid pinning too many buffers if larger IOs cannot be created (like in a fully SB resident workload). We should revisit this when that patch is available. Few data state -------------- There was no difference in timing for any of the scenarios. Table data cached in OS buffer cache ==================================== Sequential data state --------------------- With the buffer access strategy disabled, streaming read vacuum took 11% less time regardless of maintenance_io_concurrency (26 ms vs 23 ms). With the default vacuum buffer access strategy, maintenance_io_concurrency had a large impact: Note that "mic" is maintenace_io_concurrency | data state | code | mic | time (ms) | +------------+-----------+------+-----------+ | sequential | master | NA | 99 | | sequential | streaming | 10 | 122 | | sequential | streaming | 100 | 28 | The streaming read API calculates the maximum number of pinned buffers as 4 * maintenance_io_concurrency. The default vacuum buffer access strategy ring buffer is 256 kB -- which is 32 buffers. With maintenance_io_concurrency 10, streaming read code wants to pin 40 buffers. There is likely an interaction between this and the buffer access strategy which leads to the slowdown at maintenance_io_concurrency 10. We could change the default maintenance_io_concurrency, but a better option is to take the buffer access strategy into account in the streaming read code. Few data state -------------- There was no difference in timing for any of the scenarios. Table data uncached =================== Sequential data state --------------------- When the buffer access strategy is disabled, streaming read vacuum takes 12% less time regardless of maintenance_io_concurrency (36 ms vs 41 ms). With the default buffer access strategy (ring buffer 256 kB) and maintenance_io_concurrency 10 (the default), the streaming read vacuum takes 19% more time. But if we bump maintenance_io_concurrency up to 100+, streaming read vacuum takes 64% less time: | data state | code | mic | time (ms) | +------------+-----------+------+-----------+ | sequential | master | NA | 113 | | sequential | streaming | 10 | 140 | | sequential | streaming | 100 | 41 | This is likely due to the same adverse interaction between streaming reads' max pinned buffers and the buffer access strategy ring buffer size. Few data state -------------- The buffer access strategy had no impact here, so all of these results are with the default buffer access strategy. The streaming read vacuum takes 20-25% less time than master vacuum. | data state | code | mic | time (ms) | +------------+-----------+------+-----------+ | few | master | NA | 4.5 | | few | streaming | 10 | 3.4 | | few | streaming | 100 | 3.5 | The improvement is likely due to prefetching and the one range of consecutive blocks containing dead tuples which could be merged into a larger IO. Higher maintenance_io_concurrency only helps a little probably because: 1) most the blocks to vacuum are not consecutive so we can't make bigger IOs in most cases 2) we are not vacuuming enough blocks such that we want to prefetch more than 10 blocks. This experiment should probably be redone with larger tables containing more blocks needing vacuum. At 3-4 ms, a 20% performance difference isn't really that interesting. The next step (other than the preliminary refactoring patches) is to decide how the streaming read API should use the buffer access strategy. Sequential Data State DDL: drop table if exists foo; create table foo (a int) with (autovacuum_enabled=false, fillfactor=25); insert into foo select i % 3 from generate_series(1,200000)i; update foo set a = 5 where a = 1; Few Data State DDL: drop table if exists foo; create table foo (a int) with (autovacuum_enabled=false, fillfactor=25); insert into foo select i from generate_series(2,20000)i; insert into foo select 1 from generate_series(1,200)i; insert into foo select i from generate_series(2,20000)i; insert into foo select 1 from generate_series(1,200)i; insert into foo select i from generate_series(2,200000)i; insert into foo select 1 from generate_series(1,200)i; insert into foo select i from generate_series(2,20000)i; insert into foo select 1 from generate_series(1,2000)i; insert into foo select i from generate_series(2,20000)i; insert into foo select 1 from generate_series(1,200)i; insert into foo select i from generate_series(2,200000)i; insert into foo select 1 from generate_series(1,200)i; vacuum (freeze) foo; update foo set a = 5 where a = 1; - Melanie [1] https://www.postgresql.org/message-id/1eeccf12-d5d1-4b7e-b88b-7342410129d7%40iki.fi [2] https://github.com/melanieplageman/postgres/tree/vac_pgsr
Вложения
- v5a-0001-lazy_scan_skip-remove-unneeded-local-var-nskippa.patch
- v5a-0002-Add-lazy_scan_skip-unskippable-state-to-LVRelSta.patch
- v5a-0003-Confine-vacuum-skip-logic-to-lazy_scan_skip.patch
- v5a-0004-Remove-unneeded-vacuum_delay_point-from-heap_vac.patch
- v5a-0005-Streaming-Read-API.patch
- v5a-0006-Vacuum-first-pass-uses-Streaming-Read-interface.patch
- v5a-0007-Vacuum-second-pass-uses-Streaming-Read-interface.patch
On Wed, Mar 06, 2024 at 09:55:21PM +0200, Heikki Linnakangas wrote: > On 27/02/2024 21:47, Melanie Plageman wrote: > > The attached v5 has some simplifications when compared to v4 but takes > > largely the same approach. > > > > 0001-0004 are refactoring > > I'm looking at just these 0001-0004 patches for now. I like those changes a > lot for the sake of readablity even without any of the later patches. Thanks! And thanks so much for the review! I've done a small performance experiment comparing a branch with all of the patches applied (your v6 0001-0009) with master. I made an 11 GB table that has 1,394,328 blocks. For setup, I vacuumed it to update the VM and made sure it was entirely in shared buffers. All of this was to make sure all of the blocks would be skipped and we spend the majority of the time spinning through the lazy_scan_heap() code. Then I ran vacuum again (the actual test). I saw vacuum go from 13 ms to 10 ms with the patches applied. I think I need to do some profiling to see if the difference is actually due to our code changes, but I thought I would share preliminary results. > I made some further changes. I kept them as separate commits for easier > review, see the commit messages for details. Any thoughts on those changes? I've given some inline feedback on most of the extra patches you added. Short answer is they all seem fine to me except I have a reservations about 0008 because of the number of blkno variables flying around. I didn't have a chance to rebase these into my existing changes today, so either I will do it tomorrow or, if you are feeling like you're on a roll and want to do it, that also works! > I feel heap_vac_scan_get_next_block() function could use some love. Maybe > just some rewording of the comments, or maybe some other refactoring; not > sure. But I'm pretty happy with the function signature and how it's called. I was wondering if we should remove the "get" and just go with heap_vac_scan_next_block(). I didn't do that originally because I didn't want to imply that the next block was literally the sequentially next block, but I think maybe I was overthinking it. Another idea is to call it heap_scan_vac_next_block() and then the order of the words is more like the table AM functions that get the next block (e.g. heapam_scan_bitmap_next_block()). Though maybe we don't want it to be too similar to those since this isn't a table AM callback. As for other refactoring and other rewording of comments and such, I will take a pass at this tomorrow. > BTW, do we have tests that would fail if we botched up > heap_vac_scan_get_next_block() so that it would skip pages incorrectly, for > example? Not asking you to write them for this patch, but I'm just > wondering. So, while developing this, when I messed up and skipped blocks I shouldn't, vacuum would error out with the "found xmin from before relfrozenxid" error -- which would cause random tests to fail. I know that's not a correctly failing test of this code. I think there might be some tests in the verify_heapam tests that could/do test this kind of thing but I don't remember them failing for me during development -- so I didn't spend much time looking at them. I would also sometimes get freespace or VM tests that would fail because those blocks that are incorrectly skipped were meant to be reflected in the FSM or VM in those tests. All of that is to say, perhaps we should write a more targeted test? When I was writing the code, I added logging of skipped blocks and then came up with different scenarios and ran them on master and with the patch and diffed the logs. > From b4047b941182af0643838fde056c298d5cc3ae32 Mon Sep 17 00:00:00 2001 > From: Heikki Linnakangas <heikki.linnakangas@iki.fi> > Date: Wed, 6 Mar 2024 20:13:42 +0200 > Subject: [PATCH v6 5/9] Remove unused 'skipping_current_range' field > > --- > src/backend/access/heap/vacuumlazy.c | 2 -- > 1 file changed, 2 deletions(-) > > diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c > index 65d257aab83..51391870bf3 100644 > --- a/src/backend/access/heap/vacuumlazy.c > +++ b/src/backend/access/heap/vacuumlazy.c > @@ -217,8 +217,6 @@ typedef struct LVRelState > Buffer vmbuffer; > /* Next unskippable block's visibility status */ > bool next_unskippable_allvis; > - /* Whether or not skippable blocks should be skipped */ > - bool skipping_current_range; > } skip; > } LVRelState; > > -- > 2.39.2 > Oops! I thought I removed this. I must have forgotten > From 27e431e8dc69bbf09d831cb1cf2903d16f177d74 Mon Sep 17 00:00:00 2001 > From: Heikki Linnakangas <heikki.linnakangas@iki.fi> > Date: Wed, 6 Mar 2024 20:58:57 +0200 > Subject: [PATCH v6 6/9] Move vmbuffer back to a local varible in > lazy_scan_heap() > > It felt confusing that we passed around the current block, 'blkno', as > an argument to lazy_scan_new_or_empty() and lazy_scan_prune(), but > 'vmbuffer' was accessed directly in the 'scan_state'. > > It was also a bit vague, when exactly 'vmbuffer' was valid. Calling > heap_vac_scan_get_next_block() set it, sometimes, to a buffer that > might or might not contain the VM bit for 'blkno'. But other > functions, like lazy_scan_prune(), assumed it to contain the correct > buffer. That was fixed up visibilitymap_pin(). But clearly it was not > "owned" by heap_vac_scan_get_next_block(), like the other 'scan_state' > fields. > > I moved it back to a local variable, like it was. Maybe there would be > even better ways to handle it, but at least this is not worse than > what we have in master currently. I'm fine with this. I did it the way I did (grouping it with the "next_unskippable_block" in the skip struct), because I think that this vmbuffer is always the buffer containing the VM bit for the next unskippable block -- which sometimes is the block returned by heap_vac_scan_get_next_block() and sometimes isn't. I agree it might be best as a local variable but perhaps we could retain the comment about it being the block of the VM containing the bit for the next unskippable block. (Honestly, the whole thing is very confusing). > From 519e26a01b6e6974f9e0edb94b00756af053f7ee Mon Sep 17 00:00:00 2001 > From: Heikki Linnakangas <heikki.linnakangas@iki.fi> > Date: Wed, 6 Mar 2024 20:27:57 +0200 > Subject: [PATCH v6 7/9] Rename skip_state > > I don't want to emphasize the "skipping" part. Rather, it's the state > onwed by the heap_vac_scan_get_next_block() function This makes sense to me. Skipping should be private details of vacuum's get_next_block functionality. Though the name is a bit long. Maybe we don't need the "get" and "state" parts (it is already in a struct with state in the name)? > From 6dfae936a29e2d3479273f8ab47778a596258b16 Mon Sep 17 00:00:00 2001 > From: Heikki Linnakangas <heikki.linnakangas@iki.fi> > Date: Wed, 6 Mar 2024 21:03:19 +0200 > Subject: [PATCH v6 8/9] Track 'current_block' in the skip state > > The caller was expected to always pass last blk + 1. It's not clear if > the next_unskippable block accounting would work correctly if you > passed something else. So rather than expecting the caller to do that, > have heap_vac_scan_get_next_block() keep track of the last returned > block itself, in the 'skip' state. > > This is largely redundant with the LVRelState->blkno field. But that > one is currently only used for error reporting, so it feels best to > give heap_vac_scan_get_next_block() its own field that it owns. I understand and agree with you that relying on blkno + 1 is bad and we should make the "next_block" state keep track of the current block. Though, I now find it easy to confuse lvrelstate->get_next_block_state->current_block, lvrelstate->blkno and the local variable blkno in lazy_scan_heap(). I think it is a naming thing and not that we shouldn't have all three. I'll think more about it in the morning. > From 619556cad4aad68d1711c12b962e9002e56d8db2 Mon Sep 17 00:00:00 2001 > From: Heikki Linnakangas <heikki.linnakangas@iki.fi> > Date: Wed, 6 Mar 2024 21:35:11 +0200 > Subject: [PATCH v6 9/9] Comment & whitespace cleanup > > I moved some of the paragraphs to inside the > heap_vac_scan_get_next_block() function. I found the explanation in > the function comment at the old place like too much detail. Someone > looking at the function signature and how to call it would not care > about all the details of what can or cannot be skipped. LGTM. Thanks again. - Melanie
On Wed, Mar 06, 2024 at 10:00:23PM -0500, Melanie Plageman wrote: > On Wed, Mar 06, 2024 at 09:55:21PM +0200, Heikki Linnakangas wrote: > > I made some further changes. I kept them as separate commits for easier > > review, see the commit messages for details. Any thoughts on those changes? > > I've given some inline feedback on most of the extra patches you added. > Short answer is they all seem fine to me except I have a reservations > about 0008 because of the number of blkno variables flying around. I > didn't have a chance to rebase these into my existing changes today, so > either I will do it tomorrow or, if you are feeling like you're on a > roll and want to do it, that also works! Attached v7 contains all of the changes that you suggested plus some additional cleanups here and there. > > I feel heap_vac_scan_get_next_block() function could use some love. Maybe > > just some rewording of the comments, or maybe some other refactoring; not > > sure. But I'm pretty happy with the function signature and how it's called. I've cleaned up the comments on heap_vac_scan_next_block() in the first couple patches (not so much in the streaming read user). Let me know if it addresses your feelings or if I should look for other things I could change. I will say that now all of the variable names are *very* long. I didn't want to remove the "state" from LVRelState->next_block_state. (In fact, I kind of miss the "get". But I had to draw the line somewhere.) I think without "state" in the name, next_block sounds too much like a function. Any ideas for shortening the names of next_block_state and its members or are you fine with them? > I was wondering if we should remove the "get" and just go with > heap_vac_scan_next_block(). I didn't do that originally because I didn't > want to imply that the next block was literally the sequentially next > block, but I think maybe I was overthinking it. > > Another idea is to call it heap_scan_vac_next_block() and then the order > of the words is more like the table AM functions that get the next block > (e.g. heapam_scan_bitmap_next_block()). Though maybe we don't want it to > be too similar to those since this isn't a table AM callback. I've done a version of this. > > From 27e431e8dc69bbf09d831cb1cf2903d16f177d74 Mon Sep 17 00:00:00 2001 > > From: Heikki Linnakangas <heikki.linnakangas@iki.fi> > > Date: Wed, 6 Mar 2024 20:58:57 +0200 > > Subject: [PATCH v6 6/9] Move vmbuffer back to a local varible in > > lazy_scan_heap() > > > > It felt confusing that we passed around the current block, 'blkno', as > > an argument to lazy_scan_new_or_empty() and lazy_scan_prune(), but > > 'vmbuffer' was accessed directly in the 'scan_state'. > > > > It was also a bit vague, when exactly 'vmbuffer' was valid. Calling > > heap_vac_scan_get_next_block() set it, sometimes, to a buffer that > > might or might not contain the VM bit for 'blkno'. But other > > functions, like lazy_scan_prune(), assumed it to contain the correct > > buffer. That was fixed up visibilitymap_pin(). But clearly it was not > > "owned" by heap_vac_scan_get_next_block(), like the other 'scan_state' > > fields. > > > > I moved it back to a local variable, like it was. Maybe there would be > > even better ways to handle it, but at least this is not worse than > > what we have in master currently. > > I'm fine with this. I did it the way I did (grouping it with the > "next_unskippable_block" in the skip struct), because I think that this > vmbuffer is always the buffer containing the VM bit for the next > unskippable block -- which sometimes is the block returned by > heap_vac_scan_get_next_block() and sometimes isn't. > > I agree it might be best as a local variable but perhaps we could retain > the comment about it being the block of the VM containing the bit for the > next unskippable block. (Honestly, the whole thing is very confusing). In 0001-0004 I've stuck with only having the local variable vmbuffer in lazy_scan_heap(). In 0006 (introducing pass 1 vacuum streaming read user) I added a vmbuffer back to the next_block_state (while also keeping the local variable vmbuffer in lazy_scan_heap()). The vmbuffer in lazy_scan_heap() contains the block of the VM containing visi information for the next unskippable block or for the current block if its visi information happens to be in the same block of the VM as either 1) the next unskippable block or 2) the most recently processed heap block. Streaming read vacuum separates this visibility check in heap_vac_scan_next_block() from the main loop of lazy_scan_heap(), so we can't just use a local variable anymore. Now the local variable vmbuffer in lazy_scan_heap() will only already contain the block with the visi information for the to-be-processed block if it happens to be in the same VM block as the most recently processed heap block. That means potentially more VM fetches. However, by adding a vmbuffer to next_block_state, the callback may be able to avoid extra VM fetches from one invocation to the next. Note that next_block->current_block in the streaming read vacuum context is actually the prefetch block. - Melanie
Вложения
- v7-0001-lazy_scan_skip-remove-unneeded-local-var-nskippab.patch
- v7-0002-Add-lazy_scan_skip-next-block-state-to-LVRelState.patch
- v7-0003-Confine-vacuum-skip-logic-to-lazy_scan_skip.patch
- v7-0004-Remove-unneeded-vacuum_delay_point-from-heap_vac_.patch
- v7-0005-Streaming-Read-API.patch
- v7-0006-Vacuum-first-pass-uses-Streaming-Read-interface.patch
- v7-0007-Vacuum-second-pass-uses-Streaming-Read-interface.patch
On 08/03/2024 02:46, Melanie Plageman wrote:
> On Wed, Mar 06, 2024 at 10:00:23PM -0500, Melanie Plageman wrote:
>>> I feel heap_vac_scan_get_next_block() function could use some love. Maybe
>>> just some rewording of the comments, or maybe some other refactoring; not
>>> sure. But I'm pretty happy with the function signature and how it's called.
> 
> I've cleaned up the comments on heap_vac_scan_next_block() in the first
> couple patches (not so much in the streaming read user). Let me know if
> it addresses your feelings or if I should look for other things I could
> change.
Thanks, that is better. I think I now finally understand how the 
function works, and now I can see some more issues and refactoring 
opportunities :-).
Looking at current lazy_scan_skip() code in 'master', one thing now 
caught my eye (and it's the same with your patches):
>     *next_unskippable_allvis = true;
>     while (next_unskippable_block < rel_pages)
>     {
>         uint8        mapbits = visibilitymap_get_status(vacrel->rel,
>                                                        next_unskippable_block,
>                                                        vmbuffer);
> 
>         if ((mapbits & VISIBILITYMAP_ALL_VISIBLE) == 0)
>         {
>             Assert((mapbits & VISIBILITYMAP_ALL_FROZEN) == 0);
>             *next_unskippable_allvis = false;
>             break;
>         }
> 
>         /*
>          * Caller must scan the last page to determine whether it has tuples
>          * (caller must have the opportunity to set vacrel->nonempty_pages).
>          * This rule avoids having lazy_truncate_heap() take access-exclusive
>          * lock on rel to attempt a truncation that fails anyway, just because
>          * there are tuples on the last page (it is likely that there will be
>          * tuples on other nearby pages as well, but those can be skipped).
>          *
>          * Implement this by always treating the last block as unsafe to skip.
>          */
>         if (next_unskippable_block == rel_pages - 1)
>             break;
> 
>         /* DISABLE_PAGE_SKIPPING makes all skipping unsafe */
>         if (!vacrel->skipwithvm)
>         {
>             /* Caller shouldn't rely on all_visible_according_to_vm */
>             *next_unskippable_allvis = false;
>             break;
>         }
> 
>         /*
>          * Aggressive VACUUM caller can't skip pages just because they are
>          * all-visible.  They may still skip all-frozen pages, which can't
>          * contain XIDs < OldestXmin (XIDs that aren't already frozen by now).
>          */
>         if ((mapbits & VISIBILITYMAP_ALL_FROZEN) == 0)
>         {
>             if (vacrel->aggressive)
>                 break;
> 
>             /*
>              * All-visible block is safe to skip in non-aggressive case.  But
>              * remember that the final range contains such a block for later.
>              */
>             skipsallvis = true;
>         }
> 
>         /* XXX: is it OK to remove this? */
>         vacuum_delay_point();
>         next_unskippable_block++;
>         nskippable_blocks++;
>     }
Firstly, it seems silly to check DISABLE_PAGE_SKIPPING within the loop. 
When DISABLE_PAGE_SKIPPING is set, we always return the next block and 
set *next_unskippable_allvis = false regardless of the visibility map, 
so why bother checking the visibility map at all?
Except at the very last block of the relation! If you look carefully, 
at the last block we do return *next_unskippable_allvis = true, if the 
VM says so, even if DISABLE_PAGE_SKIPPING is set. I think that's wrong. 
Surely the intention was to pretend that none of the VM bits were set if 
DISABLE_PAGE_SKIPPING is used, also for the last block.
This was changed in commit 980ae17310:
> @@ -1311,7 +1327,11 @@ lazy_scan_skip(LVRelState *vacrel, Buffer *vmbuffer, BlockNumber next_block,
>  
>                 /* DISABLE_PAGE_SKIPPING makes all skipping unsafe */
>                 if (!vacrel->skipwithvm)
> +               {
> +                       /* Caller shouldn't rely on all_visible_according_to_vm */
> +                       *next_unskippable_allvis = false;
>                         break;
> +               }
Before that, *next_unskippable_allvis was set correctly according to the 
VM, even when DISABLE_PAGE_SKIPPING was used. It's not clear to me why 
that was changed. And I think setting it to 'true' would be a more 
failsafe value than 'false'. When *next_unskippable_allvis is set to 
true, the caller cannot rely on it because a concurrent modification 
could immediately clear the VM bit. But because VACUUM is the only 
process that sets VM bits, if it's set to false, the caller can assume 
that it's still not set later on.
One consequence of that is that with DISABLE_PAGE_SKIPPING, 
lazy_scan_heap() dirties all pages, even if there are no changes. The 
attached test script demonstrates that.
ISTM we should revert the above hunk, and backpatch it to v16. I'm a 
little wary because I don't understand why that change was made in the 
first place, though. I think it was just an ill-advised attempt at 
tidying up the code as part of the larger commit, but I'm not sure. 
Peter, do you remember?
I wonder if we should give up trying to set all_visible_according_to_vm 
correctly when we decide what to skip, and always do 
"all_visible_according_to_vm = visibilitymap_get_status(...)" in 
lazy_scan_prune(). It would be more expensive, but maybe it doesn't 
matter in practice. It would get rid of this tricky bookkeeping in 
heap_vac_scan_next_block().
-- 
Heikki Linnakangas
Neon (https://neon.tech)
			
		Вложения
On Fri, Mar 8, 2024 at 8:49 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote: > ISTM we should revert the above hunk, and backpatch it to v16. I'm a > little wary because I don't understand why that change was made in the > first place, though. I think it was just an ill-advised attempt at > tidying up the code as part of the larger commit, but I'm not sure. > Peter, do you remember? I think that it makes sense to set the VM when indicated by lazy_scan_prune, independent of what either the visibility map or the page's PD_ALL_VISIBLE marking say. The whole point of DISABLE_PAGE_SKIPPING is to deal with VM corruption, after all. In retrospect I didn't handle this particular aspect very well in commit 980ae17310. The approach I took is a bit crude (and in any case slightly wrong in that it is inconsistent in how it handles the last page). But it has the merit of fixing the case where we just have the VM's all-frozen bit set for a given block (not the all-visible bit set) -- which is always wrong. There was good reason to be concerned about that possibility when 980ae17310 went in. -- Peter Geoghegan
On Fri, Mar 8, 2024 at 8:49 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
>
> On 08/03/2024 02:46, Melanie Plageman wrote:
> > On Wed, Mar 06, 2024 at 10:00:23PM -0500, Melanie Plageman wrote:
> >>> I feel heap_vac_scan_get_next_block() function could use some love. Maybe
> >>> just some rewording of the comments, or maybe some other refactoring; not
> >>> sure. But I'm pretty happy with the function signature and how it's called.
> >
> > I've cleaned up the comments on heap_vac_scan_next_block() in the first
> > couple patches (not so much in the streaming read user). Let me know if
> > it addresses your feelings or if I should look for other things I could
> > change.
>
> Thanks, that is better. I think I now finally understand how the
> function works, and now I can see some more issues and refactoring
> opportunities :-).
>
> Looking at current lazy_scan_skip() code in 'master', one thing now
> caught my eye (and it's the same with your patches):
>
> >       *next_unskippable_allvis = true;
> >       while (next_unskippable_block < rel_pages)
> >       {
> >               uint8           mapbits = visibilitymap_get_status(vacrel->rel,
> >
next_unskippable_block,
> >                                                                                                          vmbuffer);
> >
> >               if ((mapbits & VISIBILITYMAP_ALL_VISIBLE) == 0)
> >               {
> >                       Assert((mapbits & VISIBILITYMAP_ALL_FROZEN) == 0);
> >                       *next_unskippable_allvis = false;
> >                       break;
> >               }
> >
> >               /*
> >                * Caller must scan the last page to determine whether it has tuples
> >                * (caller must have the opportunity to set vacrel->nonempty_pages).
> >                * This rule avoids having lazy_truncate_heap() take access-exclusive
> >                * lock on rel to attempt a truncation that fails anyway, just because
> >                * there are tuples on the last page (it is likely that there will be
> >                * tuples on other nearby pages as well, but those can be skipped).
> >                *
> >                * Implement this by always treating the last block as unsafe to skip.
> >                */
> >               if (next_unskippable_block == rel_pages - 1)
> >                       break;
> >
> >               /* DISABLE_PAGE_SKIPPING makes all skipping unsafe */
> >               if (!vacrel->skipwithvm)
> >               {
> >                       /* Caller shouldn't rely on all_visible_according_to_vm */
> >                       *next_unskippable_allvis = false;
> >                       break;
> >               }
> >
> >               /*
> >                * Aggressive VACUUM caller can't skip pages just because they are
> >                * all-visible.  They may still skip all-frozen pages, which can't
> >                * contain XIDs < OldestXmin (XIDs that aren't already frozen by now).
> >                */
> >               if ((mapbits & VISIBILITYMAP_ALL_FROZEN) == 0)
> >               {
> >                       if (vacrel->aggressive)
> >                               break;
> >
> >                       /*
> >                        * All-visible block is safe to skip in non-aggressive case.  But
> >                        * remember that the final range contains such a block for later.
> >                        */
> >                       skipsallvis = true;
> >               }
> >
> >               /* XXX: is it OK to remove this? */
> >               vacuum_delay_point();
> >               next_unskippable_block++;
> >               nskippable_blocks++;
> >       }
>
> Firstly, it seems silly to check DISABLE_PAGE_SKIPPING within the loop.
> When DISABLE_PAGE_SKIPPING is set, we always return the next block and
> set *next_unskippable_allvis = false regardless of the visibility map,
> so why bother checking the visibility map at all?
>
> Except at the very last block of the relation! If you look carefully,
> at the last block we do return *next_unskippable_allvis = true, if the
> VM says so, even if DISABLE_PAGE_SKIPPING is set. I think that's wrong.
> Surely the intention was to pretend that none of the VM bits were set if
> DISABLE_PAGE_SKIPPING is used, also for the last block.
I agree that having next_unskippable_allvis and, as a consequence,
all_visible_according_to_vm set to true for the last block seems
wrong. And It makes sense from a loop efficiency standpoint also to
move it up to the top. However, making that change would have us end
up dirtying all pages in your example.
> This was changed in commit 980ae17310:
>
> > @@ -1311,7 +1327,11 @@ lazy_scan_skip(LVRelState *vacrel, Buffer *vmbuffer, BlockNumber next_block,
> >
> >                 /* DISABLE_PAGE_SKIPPING makes all skipping unsafe */
> >                 if (!vacrel->skipwithvm)
> > +               {
> > +                       /* Caller shouldn't rely on all_visible_according_to_vm */
> > +                       *next_unskippable_allvis = false;
> >                         break;
> > +               }
>
> Before that, *next_unskippable_allvis was set correctly according to the
> VM, even when DISABLE_PAGE_SKIPPING was used. It's not clear to me why
> that was changed. And I think setting it to 'true' would be a more
> failsafe value than 'false'. When *next_unskippable_allvis is set to
> true, the caller cannot rely on it because a concurrent modification
> could immediately clear the VM bit. But because VACUUM is the only
> process that sets VM bits, if it's set to false, the caller can assume
> that it's still not set later on.
>
> One consequence of that is that with DISABLE_PAGE_SKIPPING,
> lazy_scan_heap() dirties all pages, even if there are no changes. The
> attached test script demonstrates that.
This does seem undesirable.
However, if we do as you suggest above and don't check
DISABLE_PAGE_SKIPPING in the loop and instead return without checking
the VM when DISABLE_PAGE_SKIPPING is passed, setting
next_unskippable_allvis = false, we will end up dirtying all pages as
in your example. It would fix the last block issue but it would result
in dirtying all pages in your example.
> ISTM we should revert the above hunk, and backpatch it to v16. I'm a
> little wary because I don't understand why that change was made in the
> first place, though. I think it was just an ill-advised attempt at
> tidying up the code as part of the larger commit, but I'm not sure.
> Peter, do you remember?
If we revert this, then the when all_visible_according_to_vm and
all_visible are true in lazy_scan_prune(), the VM will only get
updated when all_frozen is true and the VM doesn't have all frozen set
yet, so maybe that is inconsistent with the goal of
DISABLE_PAGE_SKIPPING to update the VM when its contents are "suspect"
(according to docs).
> I wonder if we should give up trying to set all_visible_according_to_vm
> correctly when we decide what to skip, and always do
> "all_visible_according_to_vm = visibilitymap_get_status(...)" in
> lazy_scan_prune(). It would be more expensive, but maybe it doesn't
> matter in practice. It would get rid of this tricky bookkeeping in
> heap_vac_scan_next_block().
I did some experiments on this in the past and thought that it did
have a perf impact to call visibilitymap_get_status() every time. But
let me try and dig those up. (doesn't speak to whether or not in
matters in practice)
- Melanie
			
		On Fri, Mar 8, 2024 at 10:41 AM Peter Geoghegan <pg@bowt.ie> wrote: > > On Fri, Mar 8, 2024 at 8:49 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote: > > ISTM we should revert the above hunk, and backpatch it to v16. I'm a > > little wary because I don't understand why that change was made in the > > first place, though. I think it was just an ill-advised attempt at > > tidying up the code as part of the larger commit, but I'm not sure. > > Peter, do you remember? > > I think that it makes sense to set the VM when indicated by > lazy_scan_prune, independent of what either the visibility map or the > page's PD_ALL_VISIBLE marking say. The whole point of > DISABLE_PAGE_SKIPPING is to deal with VM corruption, after all. Not that it will be fun to maintain another special case in the VM update code in lazy_scan_prune(), but we could have a special case that checks if DISABLE_PAGE_SKIPPING was passed to vacuum and if all_visible_according_to_vm is true and all_visible is true, we update the VM but don't dirty the page. The docs on DISABLE_PAGE_SKIPPING say it is meant to deal with VM corruption -- it doesn't say anything about dealing with incorrectly set PD_ALL_VISIBLE markings. - Melanie
On Fri, Mar 8, 2024 at 10:48 AM Melanie Plageman <melanieplageman@gmail.com> wrote: > Not that it will be fun to maintain another special case in the VM > update code in lazy_scan_prune(), but we could have a special case > that checks if DISABLE_PAGE_SKIPPING was passed to vacuum and if > all_visible_according_to_vm is true and all_visible is true, we update > the VM but don't dirty the page. It wouldn't necessarily have to be a special case, I think. We already conditionally set PD_ALL_VISIBLE/call PageIsAllVisible() in the block where lazy_scan_prune marks a previously all-visible page all-frozen -- we don't want to dirty the page unnecessarily there. Making it conditional is defensive in that particular block (this was also added by this same commit of mine), and avoids dirtying the page. Seems like it might be possible to simplify/consolidate the VM-setting code that's now located at the end of lazy_scan_prune. Perhaps the two distinct blocks that call visibilitymap_set() could be combined into one. -- Peter Geoghegan
On Fri, Mar 8, 2024 at 11:00 AM Peter Geoghegan <pg@bowt.ie> wrote: > Seems like it might be possible to simplify/consolidate the VM-setting > code that's now located at the end of lazy_scan_prune. Perhaps the two > distinct blocks that call visibilitymap_set() could be combined into > one. FWIW I think that my error here might have had something to do with hallucinating that the code already did things that way. At the time this went in, I was working on a patchset that did things this way (more or less). It broke the dependency on all_visible_according_to_vm entirely, which simplified the set-and-check-VM code that's now at the end of lazy_scan_prune. Not sure how practical it'd be to do something like that now (not offhand), but something to consider. -- Peter Geoghegan
On 08/03/2024 02:46, Melanie Plageman wrote: > On Wed, Mar 06, 2024 at 10:00:23PM -0500, Melanie Plageman wrote: >> On Wed, Mar 06, 2024 at 09:55:21PM +0200, Heikki Linnakangas wrote: > I will say that now all of the variable names are *very* long. I didn't > want to remove the "state" from LVRelState->next_block_state. (In fact, I > kind of miss the "get". But I had to draw the line somewhere.) I think > without "state" in the name, next_block sounds too much like a function. > > Any ideas for shortening the names of next_block_state and its members > or are you fine with them? Hmm, we can remove the inner struct and add the fields directly into LVRelState. LVRelState already contains many groups of variables, like "Error reporting state", with no inner structs. I did it that way in the attached patch. I also used local variables more. >> I was wondering if we should remove the "get" and just go with >> heap_vac_scan_next_block(). I didn't do that originally because I didn't >> want to imply that the next block was literally the sequentially next >> block, but I think maybe I was overthinking it. >> >> Another idea is to call it heap_scan_vac_next_block() and then the order >> of the words is more like the table AM functions that get the next block >> (e.g. heapam_scan_bitmap_next_block()). Though maybe we don't want it to >> be too similar to those since this isn't a table AM callback. > > I've done a version of this. +1 > However, by adding a vmbuffer to next_block_state, the callback may be > able to avoid extra VM fetches from one invocation to the next. That's a good idea, holding separate VM buffer pins for the next-unskippable block and the block we're processing. I adopted that approach. My compiler caught one small bug when I was playing with various refactorings of this: heap_vac_scan_next_block() must set *blkno to rel_pages, not InvalidBlockNumber, after the last block. The caller uses the 'blkno' variable also after the loop, and assumes that it's set to rel_pages. I'm pretty happy with the attached patches now. The first one fixes the existing bug I mentioned in the other email (based on the on-going discussion that might not how we want to fix it though). Second commit is a squash of most of the patches. Third patch is the removal of the delay point, that seems worthwhile to keep separate. -- Heikki Linnakangas Neon (https://neon.tech)
Вложения
On Fri, Mar 8, 2024 at 11:00 AM Peter Geoghegan <pg@bowt.ie> wrote: > > On Fri, Mar 8, 2024 at 10:48 AM Melanie Plageman > <melanieplageman@gmail.com> wrote: > > Not that it will be fun to maintain another special case in the VM > > update code in lazy_scan_prune(), but we could have a special case > > that checks if DISABLE_PAGE_SKIPPING was passed to vacuum and if > > all_visible_according_to_vm is true and all_visible is true, we update > > the VM but don't dirty the page. > > It wouldn't necessarily have to be a special case, I think. > > We already conditionally set PD_ALL_VISIBLE/call PageIsAllVisible() in > the block where lazy_scan_prune marks a previously all-visible page > all-frozen -- we don't want to dirty the page unnecessarily there. > Making it conditional is defensive in that particular block (this was > also added by this same commit of mine), and avoids dirtying the page. Ah, I see. I got confused. Even if the VM is suspect, if the page is all visible and the heap block is already set all-visible in the VM, there is no need to update it. This did make me realize that it seems like there is a case we don't handle in master with the current code that would be fixed by changing that code Heikki mentioned: Right now, even if the heap block is incorrectly marked all-visible in the VM, if DISABLE_PAGE_SKIPPING is passed to vacuum, all_visible_according_to_vm will be passed to lazy_scan_prune() as false. Then even if lazy_scan_prune() finds that the page is not all-visible, we won't call visibilitymap_clear(). If we revert the code setting next_unskippable_allvis to false in lazy_scan_skip() when vacrel->skipwithvm is false and allow all_visible_according_to_vm to be true when the VM has it incorrectly set to true, then once lazy_scan_prune() discovers the page is not all-visible and assuming PD_ALL_VISIBLE is not marked so PageIsAllVisible() returns false, we will call visibilitymap_clear() to clear the incorrectly set VM bit (without dirtying the page). Here is a table of the variable states at the end of lazy_scan_prune() for clarity: master: all_visible_according_to_vm: false all_visible: false VM says all vis: true PageIsAllVisible: false if fixed: all_visible_according_to_vm: true all_visible: false VM says all vis: true PageIsAllVisible: false > Seems like it might be possible to simplify/consolidate the VM-setting > code that's now located at the end of lazy_scan_prune. Perhaps the two > distinct blocks that call visibilitymap_set() could be combined into > one. I agree. I have some code to do that in an unproposed patch which combines the VM updates into the prune record. We will definitely want to reorganize the code when we do that record combining. - Melanie
On Fri, Mar 8, 2024 at 11:31 AM Melanie Plageman <melanieplageman@gmail.com> wrote: > > On Fri, Mar 8, 2024 at 11:00 AM Peter Geoghegan <pg@bowt.ie> wrote: > > > > On Fri, Mar 8, 2024 at 10:48 AM Melanie Plageman > > <melanieplageman@gmail.com> wrote: > > > Not that it will be fun to maintain another special case in the VM > > > update code in lazy_scan_prune(), but we could have a special case > > > that checks if DISABLE_PAGE_SKIPPING was passed to vacuum and if > > > all_visible_according_to_vm is true and all_visible is true, we update > > > the VM but don't dirty the page. > > > > It wouldn't necessarily have to be a special case, I think. > > > > We already conditionally set PD_ALL_VISIBLE/call PageIsAllVisible() in > > the block where lazy_scan_prune marks a previously all-visible page > > all-frozen -- we don't want to dirty the page unnecessarily there. > > Making it conditional is defensive in that particular block (this was > > also added by this same commit of mine), and avoids dirtying the page. > > Ah, I see. I got confused. Even if the VM is suspect, if the page is > all visible and the heap block is already set all-visible in the VM, > there is no need to update it. > > This did make me realize that it seems like there is a case we don't > handle in master with the current code that would be fixed by changing > that code Heikki mentioned: > > Right now, even if the heap block is incorrectly marked all-visible in > the VM, if DISABLE_PAGE_SKIPPING is passed to vacuum, > all_visible_according_to_vm will be passed to lazy_scan_prune() as > false. Then even if lazy_scan_prune() finds that the page is not > all-visible, we won't call visibilitymap_clear(). > > If we revert the code setting next_unskippable_allvis to false in > lazy_scan_skip() when vacrel->skipwithvm is false and allow > all_visible_according_to_vm to be true when the VM has it incorrectly > set to true, then once lazy_scan_prune() discovers the page is not > all-visible and assuming PD_ALL_VISIBLE is not marked so > PageIsAllVisible() returns false, we will call visibilitymap_clear() > to clear the incorrectly set VM bit (without dirtying the page). > > Here is a table of the variable states at the end of lazy_scan_prune() > for clarity: > > master: > all_visible_according_to_vm: false > all_visible: false > VM says all vis: true > PageIsAllVisible: false > > if fixed: > all_visible_according_to_vm: true > all_visible: false > VM says all vis: true > PageIsAllVisible: false Okay, I now see from Heikki's v8-0001 that he was already aware of this. - Melanie
On Fri, Mar 08, 2024 at 06:07:33PM +0200, Heikki Linnakangas wrote:
> On 08/03/2024 02:46, Melanie Plageman wrote:
> > On Wed, Mar 06, 2024 at 10:00:23PM -0500, Melanie Plageman wrote:
> > > On Wed, Mar 06, 2024 at 09:55:21PM +0200, Heikki Linnakangas wrote:
> > I will say that now all of the variable names are *very* long. I didn't
> > want to remove the "state" from LVRelState->next_block_state. (In fact, I
> > kind of miss the "get". But I had to draw the line somewhere.) I think
> > without "state" in the name, next_block sounds too much like a function.
> > 
> > Any ideas for shortening the names of next_block_state and its members
> > or are you fine with them?
> 
> Hmm, we can remove the inner struct and add the fields directly into
> LVRelState. LVRelState already contains many groups of variables, like
> "Error reporting state", with no inner structs. I did it that way in the
> attached patch. I also used local variables more.
+1; I like the result of this.
> > However, by adding a vmbuffer to next_block_state, the callback may be
> > able to avoid extra VM fetches from one invocation to the next.
> 
> That's a good idea, holding separate VM buffer pins for the next-unskippable
> block and the block we're processing. I adopted that approach.
Cool. It can't be avoided with streaming read vacuum, but I wonder if
there would ever be adverse effects to doing it on master? Maybe if we
are doing a lot of skipping and the block of the VM for the heap blocks
we are processing ends up changing each time but we would have had the
right block of the VM if we used the one from
heap_vac_scan_next_block()?
Frankly, I'm in favor of just doing it now because it makes
lazy_scan_heap() less confusing.
> My compiler caught one small bug when I was playing with various
> refactorings of this: heap_vac_scan_next_block() must set *blkno to
> rel_pages, not InvalidBlockNumber, after the last block. The caller uses the
> 'blkno' variable also after the loop, and assumes that it's set to
> rel_pages.
Oops! Thanks for catching that.
> I'm pretty happy with the attached patches now. The first one fixes the
> existing bug I mentioned in the other email (based on the on-going
> discussion that might not how we want to fix it though).
ISTM we should still do the fix you mentioned -- seems like it has more
upsides than downsides?
> From b68cb29c547de3c4acd10f31aad47b453d154666 Mon Sep 17 00:00:00 2001
> From: Heikki Linnakangas <heikki.linnakangas@iki.fi>
> Date: Fri, 8 Mar 2024 16:00:22 +0200
> Subject: [PATCH v8 1/3] Set all_visible_according_to_vm correctly with
>  DISABLE_PAGE_SKIPPING
> 
> It's important for 'all_visible_according_to_vm' to correctly reflect
> whether the VM bit is set or not, even when we are not trusting the VM
> to skip pages, because contrary to what the comment said,
> lazy_scan_prune() relies on it.
> 
> If it's incorrectly set to 'false', when the VM bit is in fact set,
> lazy_scan_prune() will try to set the VM bit again and dirty the page
> unnecessarily. As a result, if you used DISABLE_PAGE_SKIPPING, all
> heap pages were dirtied, even if there were no changes. We would also
> fail to clear any VM bits that were set incorrectly.
> 
> This was broken in commit 980ae17310, so backpatch to v16.
LGTM.
> From 47af1ca65cf55ca876869b43bff47f9d43f0750e Mon Sep 17 00:00:00 2001
> From: Heikki Linnakangas <heikki.linnakangas@iki.fi>
> Date: Fri, 8 Mar 2024 17:32:19 +0200
> Subject: [PATCH v8 2/3] Confine vacuum skip logic to lazy_scan_skip()
> ---
>  src/backend/access/heap/vacuumlazy.c | 256 +++++++++++++++------------
>  1 file changed, 141 insertions(+), 115 deletions(-)
> 
> diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
> index ac55ebd2ae5..0aa08762015 100644
> --- a/src/backend/access/heap/vacuumlazy.c
> +++ b/src/backend/access/heap/vacuumlazy.c
> @@ -204,6 +204,12 @@ typedef struct LVRelState
>      int64        live_tuples;    /* # live tuples remaining */
>      int64        recently_dead_tuples;    /* # dead, but not yet removable */
>      int64        missed_dead_tuples; /* # removable, but not removed */
Perhaps we should add a comment to the blkno member of LVRelState
indicating that it is used for error reporting and logging?
> +    /* State maintained by heap_vac_scan_next_block() */
> +    BlockNumber current_block;    /* last block returned */
> +    BlockNumber next_unskippable_block; /* next unskippable block */
> +    bool        next_unskippable_allvis;    /* its visibility status */
> +    Buffer        next_unskippable_vmbuffer;    /* buffer containing its VM bit */
>  } LVRelState;
>  /*
> -static BlockNumber
> -lazy_scan_skip(LVRelState *vacrel, Buffer *vmbuffer, BlockNumber next_block,
> -               bool *next_unskippable_allvis, bool *skipping_current_range)
> +static bool
> +heap_vac_scan_next_block(LVRelState *vacrel, BlockNumber *blkno,
> +                         bool *all_visible_according_to_vm)
>  {
> -    BlockNumber rel_pages = vacrel->rel_pages,
> -                next_unskippable_block = next_block,
> -                nskippable_blocks = 0;
> +    BlockNumber next_block;
>      bool        skipsallvis = false;
> +    BlockNumber rel_pages = vacrel->rel_pages;
> +    BlockNumber next_unskippable_block;
> +    bool        next_unskippable_allvis;
> +    Buffer        next_unskippable_vmbuffer;
>  
> -    *next_unskippable_allvis = true;
> -    while (next_unskippable_block < rel_pages)
> -    {
> -        uint8        mapbits = visibilitymap_get_status(vacrel->rel,
> -                                                       next_unskippable_block,
> -                                                       vmbuffer);
> +    /* relies on InvalidBlockNumber + 1 overflowing to 0 on first call */
> +    next_block = vacrel->current_block + 1;
>  
> -        if ((mapbits & VISIBILITYMAP_ALL_VISIBLE) == 0)
> +    /* Have we reached the end of the relation? */
> +    if (next_block >= rel_pages)
> +    {
> +        if (BufferIsValid(vacrel->next_unskippable_vmbuffer))
>          {
> -            Assert((mapbits & VISIBILITYMAP_ALL_FROZEN) == 0);
> -            *next_unskippable_allvis = false;
> -            break;
> +            ReleaseBuffer(vacrel->next_unskippable_vmbuffer);
> +            vacrel->next_unskippable_vmbuffer = InvalidBuffer;
>          }
Good catch here. Also, I noticed that I set current_block to
InvalidBlockNumber too which seems strictly worse than leaving it as
rel_pages + 1 -- just in case a future dev makes a change that
accidentally causes heap_vac_scan_next_block() to be called again and
adding InvalidBlockNumber + 1 would end up going back to 0. So this all
looks correct to me.
> +        *blkno = rel_pages;
> +        return false;
> +    }
> +    next_unskippable_block = vacrel->next_unskippable_block;
> +    next_unskippable_allvis = vacrel->next_unskippable_allvis;
Wishe there was a newline here.
I see why you removed my treatise-level comment that was here about
unskipped skippable blocks. However, when I was trying to understand
this code, I did wish there was some comment that explained to me why we
needed all of the variables next_unskippable_block,
next_unskippable_allvis, all_visible_according_to_vm, and current_block.
The idea that we would choose not to skip a skippable block because of
kernel readahead makes sense. The part that I had trouble wrapping my
head around was that we want to also keep the visibility status of both
the beginning and ending blocks of the skippable range and then use
those to infer the visibility status of the intervening blocks without
another VM lookup if we decide not to skip them.
> +    if (next_unskippable_block == InvalidBlockNumber ||
> +        next_block > next_unskippable_block)
> +    {
>          /*
> -         * Caller must scan the last page to determine whether it has tuples
> -         * (caller must have the opportunity to set vacrel->nonempty_pages).
> -         * This rule avoids having lazy_truncate_heap() take access-exclusive
> -         * lock on rel to attempt a truncation that fails anyway, just because
> -         * there are tuples on the last page (it is likely that there will be
> -         * tuples on other nearby pages as well, but those can be skipped).
> -         *
> -         * Implement this by always treating the last block as unsafe to skip.
> +         * Find the next unskippable block using the visibility map.
>           */
> -        if (next_unskippable_block == rel_pages - 1)
> -            break;
> +        next_unskippable_block = next_block;
> +        next_unskippable_vmbuffer = vacrel->next_unskippable_vmbuffer;
> +        for (;;)
Ah yes, my old loop condition was redundant with the break if
next_unskippable_block == rel_pages - 1. This is better
> +        {
> +            uint8        mapbits = visibilitymap_get_status(vacrel->rel,
> +                                                           next_unskippable_block,
> +                                                           &next_unskippable_vmbuffer);
>  
> -        /* DISABLE_PAGE_SKIPPING makes all skipping unsafe */
> -        if (!vacrel->skipwithvm)
...
> +            }
> +
> +            vacuum_delay_point();
> +            next_unskippable_block++;
>          }
Would love a newline here
> +        /* write the local variables back to vacrel */
> +        vacrel->next_unskippable_block = next_unskippable_block;
> +        vacrel->next_unskippable_allvis = next_unskippable_allvis;
> +        vacrel->next_unskippable_vmbuffer = next_unskippable_vmbuffer;
>  
...
> -    if (nskippable_blocks < SKIP_PAGES_THRESHOLD)
> -        *skipping_current_range = false;
> +    if (next_block == next_unskippable_block)
> +        *all_visible_according_to_vm = next_unskippable_allvis;
>      else
> -    {
> -        *skipping_current_range = true;
> -        if (skipsallvis)
> -            vacrel->skippedallvis = true;
> -    }
> -
> -    return next_unskippable_block;
> +        *all_visible_according_to_vm = true;
Also a newline here
> +    *blkno = vacrel->current_block = next_block;
> +    return true;
>  }
>  
> From 941ae7522ab6ac24ca5981303e4e7f6e2cba7458 Mon Sep 17 00:00:00 2001
> From: Melanie Plageman <melanieplageman@gmail.com>
> Date: Sun, 31 Dec 2023 12:49:56 -0500
> Subject: [PATCH v8 3/3] Remove unneeded vacuum_delay_point from
>  heap_vac_scan_get_next_block
> 
> heap_vac_scan_get_next_block() does relatively little work, so there is
> no need to call vacuum_delay_point(). A future commit will call
> heap_vac_scan_get_next_block() from a callback, and we would like to
> avoid calling vacuum_delay_point() in that callback.
> ---
>  src/backend/access/heap/vacuumlazy.c | 1 -
>  1 file changed, 1 deletion(-)
> 
> diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
> index 0aa08762015..e1657ef4f9b 100644
> --- a/src/backend/access/heap/vacuumlazy.c
> +++ b/src/backend/access/heap/vacuumlazy.c
> @@ -1172,7 +1172,6 @@ heap_vac_scan_next_block(LVRelState *vacrel, BlockNumber *blkno,
>                  skipsallvis = true;
>              }
>  
> -            vacuum_delay_point();
>              next_unskippable_block++;
>          }
>          /* write the local variables back to vacrel */
> -- 
> 2.39.2
> 
LGTM
- Melanie
			
		On Fri, Mar 8, 2024 at 12:34 PM Melanie Plageman
<melanieplageman@gmail.com> wrote:
>
> On Fri, Mar 08, 2024 at 06:07:33PM +0200, Heikki Linnakangas wrote:
> > On 08/03/2024 02:46, Melanie Plageman wrote:
> > > On Wed, Mar 06, 2024 at 10:00:23PM -0500, Melanie Plageman wrote:
> > > > On Wed, Mar 06, 2024 at 09:55:21PM +0200, Heikki Linnakangas wrote:
> > > I will say that now all of the variable names are *very* long. I didn't
> > > want to remove the "state" from LVRelState->next_block_state. (In fact, I
> > > kind of miss the "get". But I had to draw the line somewhere.) I think
> > > without "state" in the name, next_block sounds too much like a function.
> > >
> > > Any ideas for shortening the names of next_block_state and its members
> > > or are you fine with them?
> >
> > Hmm, we can remove the inner struct and add the fields directly into
> > LVRelState. LVRelState already contains many groups of variables, like
> > "Error reporting state", with no inner structs. I did it that way in the
> > attached patch. I also used local variables more.
>
> +1; I like the result of this.
I did some perf testing of 0002 and 0003 using that fully-in-SB vacuum
test I mentioned in an earlier email. 0002 is a vacuum time reduction
from an average of 11.5 ms on master to 9.6 ms with 0002 applied. And
0003 reduces the time vacuum takes from 11.5 ms on master to 7.4 ms
with 0003 applied.
I profiled them and 0002 seems to simply spend less time in
heap_vac_scan_next_block() than master did in lazy_scan_skip().
And 0003 reduces the time vacuum takes because vacuum_delay_point()
shows up pretty high in the profile.
Here are the profiles for my test.
profile of master:
+   29.79%  postgres  postgres           [.] visibilitymap_get_status
+   27.35%  postgres  postgres           [.] vacuum_delay_point
+   17.00%  postgres  postgres           [.] lazy_scan_skip
+    6.59%  postgres  postgres           [.] heap_vacuum_rel
+    6.43%  postgres  postgres           [.] BufferGetBlockNumber
profile with 0001-0002:
+   40.30%  postgres  postgres           [.] visibilitymap_get_status
+   20.32%  postgres  postgres           [.] vacuum_delay_point
+   20.26%  postgres  postgres           [.] heap_vacuum_rel
+    5.17%  postgres  postgres           [.] BufferGetBlockNumber
profile with 0001-0003
+   59.77%  postgres  postgres           [.] visibilitymap_get_status
+   23.86%  postgres  postgres           [.] heap_vacuum_rel
+    6.59%  postgres  postgres           [.] StrategyGetBuffer
Test DDL and setup:
psql -c "ALTER SYSTEM SET shared_buffers = '16 GB';"
psql -c "CREATE TABLE foo(id INT, a INT, b INT, c INT, d INT, e INT, f
INT, g INT) with (autovacuum_enabled=false, fillfactor=25);"
psql -c "INSERT INTO foo SELECT i, i, i, i, i, i, i, i FROM
generate_series(1, 46000000)i;"
psql -c "VACUUM (FREEZE) foo;"
pg_ctl restart
psql -c "SELECT pg_prewarm('foo');"
# make sure there isn't an ill-timed checkpoint
psql -c "\timing on" -c "vacuum (verbose) foo;"
- Melanie
			
		On Wed, Mar 6, 2024 at 6:47 PM Melanie Plageman <melanieplageman@gmail.com> wrote: > > Performance results: > > The TL;DR of my performance results is that streaming read vacuum is > faster. However there is an issue with the interaction of the streaming > read code and the vacuum buffer access strategy which must be addressed. I have investigated the interaction between maintenance_io_concurrency, streaming reads, and the vacuum buffer access strategy (BAS_VACUUM). The streaming read API limits max_pinned_buffers to a pinned buffer multiplier (currently 4) * maintenance_io_concurrency buffers with the goal of constructing reads of at least MAX_BUFFERS_PER_TRANSFER size. Since the BAS_VACUUM ring buffer is size 256 kB or 32 buffers with default block size, that means that for a fully uncached vacuum in which all blocks must be vacuumed and will be dirtied, you'd have to set maintenance_io_concurrency at 8 or lower to see the same number of reuses (and shared buffer consumption) as master. Given that we allow users to specify BUFFER_USAGE_LIMIT to vacuum, it seems like we should force max_pinned_buffers to a value that guarantees the expected shared buffer usage by vacuum. But that means that maintenance_io_concurrency does not have a predictable impact on streaming read vacuum. What is the right thing to do here? At the least, the default size of the BAS_VACUUM ring buffer should be BLCKSZ * pinned_buffer_multiplier * default maintenance_io_concurrency (probably rounded up to the next power of two) bytes. - Melanie
On Mon, Mar 11, 2024 at 5:31 AM Melanie Plageman <melanieplageman@gmail.com> wrote: > On Wed, Mar 6, 2024 at 6:47 PM Melanie Plageman > <melanieplageman@gmail.com> wrote: > > Performance results: > > > > The TL;DR of my performance results is that streaming read vacuum is > > faster. However there is an issue with the interaction of the streaming > > read code and the vacuum buffer access strategy which must be addressed. Woo. > I have investigated the interaction between > maintenance_io_concurrency, streaming reads, and the vacuum buffer > access strategy (BAS_VACUUM). > > The streaming read API limits max_pinned_buffers to a pinned buffer > multiplier (currently 4) * maintenance_io_concurrency buffers with the > goal of constructing reads of at least MAX_BUFFERS_PER_TRANSFER size. > > Since the BAS_VACUUM ring buffer is size 256 kB or 32 buffers with > default block size, that means that for a fully uncached vacuum in > which all blocks must be vacuumed and will be dirtied, you'd have to > set maintenance_io_concurrency at 8 or lower to see the same number of > reuses (and shared buffer consumption) as master. > > Given that we allow users to specify BUFFER_USAGE_LIMIT to vacuum, it > seems like we should force max_pinned_buffers to a value that > guarantees the expected shared buffer usage by vacuum. But that means > that maintenance_io_concurrency does not have a predictable impact on > streaming read vacuum. > > What is the right thing to do here? > > At the least, the default size of the BAS_VACUUM ring buffer should be > BLCKSZ * pinned_buffer_multiplier * default maintenance_io_concurrency > (probably rounded up to the next power of two) bytes. Hmm, does the v6 look-ahead distance control algorithm mitigate that problem? Using the ABC classifications from the streaming read thread, I think for A it should now pin only 1, for B 16 and for C, it depends on the size of the random 'chunks': if you have a lot of size 1 random reads then it shouldn't go above 10 because of (default) maintenance_io_concurrency. The only way to get up to very high numbers would be to have a lot of random chunks triggering behaviour C, but each made up of long runs of misses. For example one can contrive a BHS query that happens to read pages 0-15 then 20-35 then 40-55 etc etc so that we want to get lots of wide I/Os running concurrently. Unless vacuum manages to do something like that, it shouldn't be able to exceed 32 buffers very easily. I suspect that if we taught streaming_read.c to ask the BufferAccessStrategy (if one is passed in) what its recommended pin limit is (strategy->nbuffers?), we could just clamp max_pinned_buffers, and it would be hard to find a workload where that makes a difference, and we could think about more complicated logic later. In other words, I think/hope your complaints about excessive pinning from v5 WRT all-cached heap scans might have also already improved this case by happy coincidence? I haven't tried it out though, I just read your description of the problem...
On 08/03/2024 19:34, Melanie Plageman wrote: > On Fri, Mar 08, 2024 at 06:07:33PM +0200, Heikki Linnakangas wrote: >> On 08/03/2024 02:46, Melanie Plageman wrote: >>> On Wed, Mar 06, 2024 at 10:00:23PM -0500, Melanie Plageman wrote: >>>> On Wed, Mar 06, 2024 at 09:55:21PM +0200, Heikki Linnakangas wrote: >>> However, by adding a vmbuffer to next_block_state, the callback may be >>> able to avoid extra VM fetches from one invocation to the next. >> >> That's a good idea, holding separate VM buffer pins for the next-unskippable >> block and the block we're processing. I adopted that approach. > > Cool. It can't be avoided with streaming read vacuum, but I wonder if > there would ever be adverse effects to doing it on master? Maybe if we > are doing a lot of skipping and the block of the VM for the heap blocks > we are processing ends up changing each time but we would have had the > right block of the VM if we used the one from > heap_vac_scan_next_block()? > > Frankly, I'm in favor of just doing it now because it makes > lazy_scan_heap() less confusing. +1 >> I'm pretty happy with the attached patches now. The first one fixes the >> existing bug I mentioned in the other email (based on the on-going >> discussion that might not how we want to fix it though). > > ISTM we should still do the fix you mentioned -- seems like it has more > upsides than downsides? > >> From b68cb29c547de3c4acd10f31aad47b453d154666 Mon Sep 17 00:00:00 2001 >> From: Heikki Linnakangas <heikki.linnakangas@iki.fi> >> Date: Fri, 8 Mar 2024 16:00:22 +0200 >> Subject: [PATCH v8 1/3] Set all_visible_according_to_vm correctly with >> DISABLE_PAGE_SKIPPING >> >> It's important for 'all_visible_according_to_vm' to correctly reflect >> whether the VM bit is set or not, even when we are not trusting the VM >> to skip pages, because contrary to what the comment said, >> lazy_scan_prune() relies on it. >> >> If it's incorrectly set to 'false', when the VM bit is in fact set, >> lazy_scan_prune() will try to set the VM bit again and dirty the page >> unnecessarily. As a result, if you used DISABLE_PAGE_SKIPPING, all >> heap pages were dirtied, even if there were no changes. We would also >> fail to clear any VM bits that were set incorrectly. >> >> This was broken in commit 980ae17310, so backpatch to v16. > > LGTM. Committed and backpatched this. >> From 47af1ca65cf55ca876869b43bff47f9d43f0750e Mon Sep 17 00:00:00 2001 >> From: Heikki Linnakangas <heikki.linnakangas@iki.fi> >> Date: Fri, 8 Mar 2024 17:32:19 +0200 >> Subject: [PATCH v8 2/3] Confine vacuum skip logic to lazy_scan_skip() >> --- >> src/backend/access/heap/vacuumlazy.c | 256 +++++++++++++++------------ >> 1 file changed, 141 insertions(+), 115 deletions(-) >> >> diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c >> index ac55ebd2ae5..0aa08762015 100644 >> --- a/src/backend/access/heap/vacuumlazy.c >> +++ b/src/backend/access/heap/vacuumlazy.c >> @@ -204,6 +204,12 @@ typedef struct LVRelState >> int64 live_tuples; /* # live tuples remaining */ >> int64 recently_dead_tuples; /* # dead, but not yet removable */ >> int64 missed_dead_tuples; /* # removable, but not removed */ > > Perhaps we should add a comment to the blkno member of LVRelState > indicating that it is used for error reporting and logging? Well, it's already under the "/* Error reporting state */" section. I agree this is a little confusing, the name 'blkno' doesn't convey that it's supposed to be used just for error reporting. But it's a pre-existing issue so I left it alone. It can be changed with a separate patch if we come up with a good idea. > I see why you removed my treatise-level comment that was here about > unskipped skippable blocks. However, when I was trying to understand > this code, I did wish there was some comment that explained to me why we > needed all of the variables next_unskippable_block, > next_unskippable_allvis, all_visible_according_to_vm, and current_block. > > The idea that we would choose not to skip a skippable block because of > kernel readahead makes sense. The part that I had trouble wrapping my > head around was that we want to also keep the visibility status of both > the beginning and ending blocks of the skippable range and then use > those to infer the visibility status of the intervening blocks without > another VM lookup if we decide not to skip them. Right, I removed the comment because looked a little out of place and it duplicated the other comments sprinkled in the function. I agree this could still use some more comments though. Here's yet another attempt at making this more readable. I moved the logic to find the next unskippable block to a separate function, and added comments to make the states more explicit. What do you think? -- Heikki Linnakangas Neon (https://neon.tech)
Вложения
On Mon, Mar 11, 2024 at 11:29:44AM +0200, Heikki Linnakangas wrote: > > > I see why you removed my treatise-level comment that was here about > > unskipped skippable blocks. However, when I was trying to understand > > this code, I did wish there was some comment that explained to me why we > > needed all of the variables next_unskippable_block, > > next_unskippable_allvis, all_visible_according_to_vm, and current_block. > > > > The idea that we would choose not to skip a skippable block because of > > kernel readahead makes sense. The part that I had trouble wrapping my > > head around was that we want to also keep the visibility status of both > > the beginning and ending blocks of the skippable range and then use > > those to infer the visibility status of the intervening blocks without > > another VM lookup if we decide not to skip them. > > Right, I removed the comment because looked a little out of place and it > duplicated the other comments sprinkled in the function. I agree this could > still use some more comments though. > > Here's yet another attempt at making this more readable. I moved the logic > to find the next unskippable block to a separate function, and added > comments to make the states more explicit. What do you think? Oh, I like the new structure. Very cool! Just a few remarks: > From c21480e9da61e145573de3b502551dde1b8fa3f6 Mon Sep 17 00:00:00 2001 > From: Heikki Linnakangas <heikki.linnakangas@iki.fi> > Date: Fri, 8 Mar 2024 17:32:19 +0200 > Subject: [PATCH v9 1/2] Confine vacuum skip logic to lazy_scan_skip() > > Rename lazy_scan_skip() to heap_vac_scan_next_block() and move more > code into the function, so that the caller doesn't need to know about > ranges or skipping anymore. heap_vac_scan_next_block() returns the > next block to process, and the logic for determining that block is all > within the function. This makes the skipping logic easier to > understand, as it's all in the same function, and makes the calling > code easier to understand as it's less cluttered. The state variables > needed to manage the skipping logic are moved to LVRelState. > > heap_vac_scan_next_block() now manages its own VM buffer separately > from the caller's vmbuffer variable. The caller's vmbuffer holds the > VM page for the current block its processing, while > heap_vac_scan_next_block() keeps a pin on the VM page for the next > unskippable block. Most of the time they are the same, so we hold two > pins on the same buffer, but it's more convenient to manage them > separately. > > For readability inside heap_vac_scan_next_block(), move the logic of > finding the next unskippable block to separate function, and add some > comments. > > This refactoring will also help future patches to switch to using a > streaming read interface, and eventually AIO > (https://postgr.es/m/CA%2BhUKGJkOiOCa%2Bmag4BF%2BzHo7qo%3Do9CFheB8%3Dg6uT5TUm2gkvA%40mail.gmail.com) > > Author: Melanie Plageman, with some changes by me I'd argue you earned co-authorship by now :) > Discussion: https://postgr.es/m/CAAKRu_Yf3gvXGcCnqqfoq0Q8LX8UM-e-qbm_B1LeZh60f8WhWA%40mail.gmail.com > --- > src/backend/access/heap/vacuumlazy.c | 233 +++++++++++++++++---------- > 1 file changed, 146 insertions(+), 87 deletions(-) > > diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c > index ac55ebd2ae..1757eb49b7 100644 > --- a/src/backend/access/heap/vacuumlazy.c > +++ b/src/backend/access/heap/vacuumlazy.c > + > /* > - * lazy_scan_skip() -- set up range of skippable blocks using visibility map. > + * heap_vac_scan_next_block() -- get next block for vacuum to process > * > - * lazy_scan_heap() calls here every time it needs to set up a new range of > - * blocks to skip via the visibility map. Caller passes the next block in > - * line. We return a next_unskippable_block for this range. When there are > - * no skippable blocks we just return caller's next_block. The all-visible > - * status of the returned block is set in *next_unskippable_allvis for caller, > - * too. Block usually won't be all-visible (since it's unskippable), but it > - * can be during aggressive VACUUMs (as well as in certain edge cases). > + * lazy_scan_heap() calls here every time it needs to get the next block to > + * prune and vacuum. The function uses the visibility map, vacuum options, > + * and various thresholds to skip blocks which do not need to be processed and I wonder if "need" is too strong a word since this function (heap_vac_scan_next_block()) specifically can set blkno to a block which doesn't *need* to be processed but which it chooses to process because of SKIP_PAGES_THRESHOLD. > + * sets blkno to the next block that actually needs to be processed. > * > - * Sets *skipping_current_range to indicate if caller should skip this range. > - * Costs and benefits drive our decision. Very small ranges won't be skipped. > + * The block number and visibility status of the next block to process are set > + * in *blkno and *all_visible_according_to_vm. The return value is false if > + * there are no further blocks to process. > + * > + * vacrel is an in/out parameter here; vacuum options and information about > + * the relation are read, and vacrel->skippedallvis is set to ensure we don't > + * advance relfrozenxid when we have skipped vacuuming all-visible blocks. It Maybe this should say when we have skipped vacuuming all-visible blocks which are not all-frozen or just blocks which are not all-frozen. > + * also holds information about the next unskippable block, as bookkeeping for > + * this function. > * > * Note: our opinion of which blocks can be skipped can go stale immediately. > * It's okay if caller "misses" a page whose all-visible or all-frozen marking Wonder if it makes sense to move this note to find_next_nunskippable_block(). > @@ -1098,26 +1081,119 @@ lazy_scan_heap(LVRelState *vacrel) > * older XIDs/MXIDs. The vacrel->skippedallvis flag will be set here when the > * choice to skip such a range is actually made, making everything safe.) > */ > -static BlockNumber > -lazy_scan_skip(LVRelState *vacrel, Buffer *vmbuffer, BlockNumber next_block, > - bool *next_unskippable_allvis, bool *skipping_current_range) > +static bool > +heap_vac_scan_next_block(LVRelState *vacrel, BlockNumber *blkno, > + bool *all_visible_according_to_vm) > { > - BlockNumber rel_pages = vacrel->rel_pages, > - next_unskippable_block = next_block, > - nskippable_blocks = 0; > - bool skipsallvis = false; > + BlockNumber next_block; > > - *next_unskippable_allvis = true; > - while (next_unskippable_block < rel_pages) > + /* relies on InvalidBlockNumber + 1 overflowing to 0 on first call */ > + next_block = vacrel->current_block + 1; > + > + /* Have we reached the end of the relation? */ No strong opinion on this, but I wonder if being at the end of the relation counts as a fourth state? > + if (next_block >= vacrel->rel_pages) > + { > + if (BufferIsValid(vacrel->next_unskippable_vmbuffer)) > + { > + ReleaseBuffer(vacrel->next_unskippable_vmbuffer); > + vacrel->next_unskippable_vmbuffer = InvalidBuffer; > + } > + *blkno = vacrel->rel_pages; > + return false; > + } > + > + /* > + * We must be in one of the three following states: > + */ > + if (vacrel->next_unskippable_block == InvalidBlockNumber || > + next_block > vacrel->next_unskippable_block) > + { > + /* > + * 1. We have just processed an unskippable block (or we're at the > + * beginning of the scan). Find the next unskippable block using the > + * visibility map. > + */ I would reorder the options in the comment or in the if statement since they seem to be in the reverse order. > + bool skipsallvis; > + > + find_next_unskippable_block(vacrel, &skipsallvis); > + > + /* > + * We now know the next block that we must process. It can be the > + * next block after the one we just processed, or something further > + * ahead. If it's further ahead, we can jump to it, but we choose to > + * do so only if we can skip at least SKIP_PAGES_THRESHOLD consecutive > + * pages. Since we're reading sequentially, the OS should be doing > + * readahead for us, so there's no gain in skipping a page now and > + * then. Skipping such a range might even discourage sequential > + * detection. > + * > + * This test also enables more frequent relfrozenxid advancement > + * during non-aggressive VACUUMs. If the range has any all-visible > + * pages then skipping makes updating relfrozenxid unsafe, which is a > + * real downside. > + */ > + if (vacrel->next_unskippable_block - next_block >= SKIP_PAGES_THRESHOLD) > + { > + next_block = vacrel->next_unskippable_block; > + if (skipsallvis) > + vacrel->skippedallvis = true; > + } > + > +/* > + * Find the next unskippable block in a vacuum scan using the visibility map. To expand this comment, I might mention it is a helper function for heap_vac_scan_next_block(). I would also say that the next unskippable block and its visibility information are recorded in vacrel. And that skipsallvis is set to true if any of the intervening skipped blocks are not all-frozen. > + */ > +static void > +find_next_unskippable_block(LVRelState *vacrel, bool *skipsallvis) > +{ > + BlockNumber rel_pages = vacrel->rel_pages; > + BlockNumber next_unskippable_block = vacrel->next_unskippable_block + 1; > + Buffer next_unskippable_vmbuffer = vacrel->next_unskippable_vmbuffer; > + bool next_unskippable_allvis; > + > + *skipsallvis = false; > + > + for (;;) > { > uint8 mapbits = visibilitymap_get_status(vacrel->rel, > next_unskippable_block, > - vmbuffer); > + &next_unskippable_vmbuffer); > > - if ((mapbits & VISIBILITYMAP_ALL_VISIBLE) == 0) > + next_unskippable_allvis = (mapbits & VISIBILITYMAP_ALL_VISIBLE) != 0; Otherwise LGTM - Melanie
On 11/03/2024 18:15, Melanie Plageman wrote:
> On Mon, Mar 11, 2024 at 11:29:44AM +0200, Heikki Linnakangas wrote:
>> diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
>> index ac55ebd2ae..1757eb49b7 100644
>> --- a/src/backend/access/heap/vacuumlazy.c
>> +++ b/src/backend/access/heap/vacuumlazy.c
>> +
> 
>>   /*
>> - *    lazy_scan_skip() -- set up range of skippable blocks using visibility map.
>> + *    heap_vac_scan_next_block() -- get next block for vacuum to process
>>    *
>> - * lazy_scan_heap() calls here every time it needs to set up a new range of
>> - * blocks to skip via the visibility map.  Caller passes the next block in
>> - * line.  We return a next_unskippable_block for this range.  When there are
>> - * no skippable blocks we just return caller's next_block.  The all-visible
>> - * status of the returned block is set in *next_unskippable_allvis for caller,
>> - * too.  Block usually won't be all-visible (since it's unskippable), but it
>> - * can be during aggressive VACUUMs (as well as in certain edge cases).
>> + * lazy_scan_heap() calls here every time it needs to get the next block to
>> + * prune and vacuum.  The function uses the visibility map, vacuum options,
>> + * and various thresholds to skip blocks which do not need to be processed and
>> + * sets blkno to the next block that actually needs to be processed.
> 
> I wonder if "need" is too strong a word since this function
> (heap_vac_scan_next_block()) specifically can set blkno to a block which
> doesn't *need* to be processed but which it chooses to process because
> of SKIP_PAGES_THRESHOLD.
Ok yeah, there's a lot of "needs" here :-). Fixed.
>>    *
>> - * Sets *skipping_current_range to indicate if caller should skip this range.
>> - * Costs and benefits drive our decision.  Very small ranges won't be skipped.
>> + * The block number and visibility status of the next block to process are set
>> + * in *blkno and *all_visible_according_to_vm.  The return value is false if
>> + * there are no further blocks to process.
>> + *
>> + * vacrel is an in/out parameter here; vacuum options and information about
>> + * the relation are read, and vacrel->skippedallvis is set to ensure we don't
>> + * advance relfrozenxid when we have skipped vacuuming all-visible blocks.  It
> 
> Maybe this should say when we have skipped vacuuming all-visible blocks
> which are not all-frozen or just blocks which are not all-frozen.
Ok, rephrased.
>> + * also holds information about the next unskippable block, as bookkeeping for
>> + * this function.
>>    *
>>    * Note: our opinion of which blocks can be skipped can go stale immediately.
>>    * It's okay if caller "misses" a page whose all-visible or all-frozen marking
> 
> Wonder if it makes sense to move this note to
> find_next_nunskippable_block().
Moved.
>> @@ -1098,26 +1081,119 @@ lazy_scan_heap(LVRelState *vacrel)
>>    * older XIDs/MXIDs.  The vacrel->skippedallvis flag will be set here when the
>>    * choice to skip such a range is actually made, making everything safe.)
>>    */
>> -static BlockNumber
>> -lazy_scan_skip(LVRelState *vacrel, Buffer *vmbuffer, BlockNumber next_block,
>> -               bool *next_unskippable_allvis, bool *skipping_current_range)
>> +static bool
>> +heap_vac_scan_next_block(LVRelState *vacrel, BlockNumber *blkno,
>> +                         bool *all_visible_according_to_vm)
>>   {
>> -    BlockNumber rel_pages = vacrel->rel_pages,
>> -                next_unskippable_block = next_block,
>> -                nskippable_blocks = 0;
>> -    bool        skipsallvis = false;
>> +    BlockNumber next_block;
>>   
>> -    *next_unskippable_allvis = true;
>> -    while (next_unskippable_block < rel_pages)
>> +    /* relies on InvalidBlockNumber + 1 overflowing to 0 on first call */
>> +    next_block = vacrel->current_block + 1;
>> +
>> +    /* Have we reached the end of the relation? */
> 
> No strong opinion on this, but I wonder if being at the end of the
> relation counts as a fourth state?
Yeah, perhaps. But I think it makes sense to treat it as a special case.
>> +    if (next_block >= vacrel->rel_pages)
>> +    {
>> +        if (BufferIsValid(vacrel->next_unskippable_vmbuffer))
>> +        {
>> +            ReleaseBuffer(vacrel->next_unskippable_vmbuffer);
>> +            vacrel->next_unskippable_vmbuffer = InvalidBuffer;
>> +        }
>> +        *blkno = vacrel->rel_pages;
>> +        return false;
>> +    }
>> +
>> +    /*
>> +     * We must be in one of the three following states:
>> +     */
>> +    if (vacrel->next_unskippable_block == InvalidBlockNumber ||
>> +        next_block > vacrel->next_unskippable_block)
>> +    {
>> +        /*
>> +         * 1. We have just processed an unskippable block (or we're at the
>> +         * beginning of the scan).  Find the next unskippable block using the
>> +         * visibility map.
>> +         */
> 
> I would reorder the options in the comment or in the if statement since
> they seem to be in the reverse order.
Reordered them in the statement.
It feels a bit wrong to test next_block > vacrel->next_unskippable_block 
before vacrel->next_unskippable_block == InvalidBlockNumber. But it 
works, and that order makes more sense in the comment IMHO.
>> +        bool        skipsallvis;
>> +
>> +        find_next_unskippable_block(vacrel, &skipsallvis);
>> +
>> +        /*
>> +         * We now know the next block that we must process.  It can be the
>> +         * next block after the one we just processed, or something further
>> +         * ahead.  If it's further ahead, we can jump to it, but we choose to
>> +         * do so only if we can skip at least SKIP_PAGES_THRESHOLD consecutive
>> +         * pages.  Since we're reading sequentially, the OS should be doing
>> +         * readahead for us, so there's no gain in skipping a page now and
>> +         * then.  Skipping such a range might even discourage sequential
>> +         * detection.
>> +         *
>> +         * This test also enables more frequent relfrozenxid advancement
>> +         * during non-aggressive VACUUMs.  If the range has any all-visible
>> +         * pages then skipping makes updating relfrozenxid unsafe, which is a
>> +         * real downside.
>> +         */
>> +        if (vacrel->next_unskippable_block - next_block >= SKIP_PAGES_THRESHOLD)
>> +        {
>> +            next_block = vacrel->next_unskippable_block;
>> +            if (skipsallvis)
>> +                vacrel->skippedallvis = true;
>> +        }
> 
>> +
>> +/*
>> + * Find the next unskippable block in a vacuum scan using the visibility map.
> 
> To expand this comment, I might mention it is a helper function for
> heap_vac_scan_next_block(). I would also say that the next unskippable
> block and its visibility information are recorded in vacrel. And that
> skipsallvis is set to true if any of the intervening skipped blocks are
> not all-frozen.
Added comments.
> Otherwise LGTM
Ok, pushed! Thank you, this is much more understandable now!
-- 
Heikki Linnakangas
Neon (https://neon.tech)
			
		On Mon, Mar 11, 2024 at 2:47 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote: > > > > Otherwise LGTM > > Ok, pushed! Thank you, this is much more understandable now! Cool, thanks!
On Sun, Mar 10, 2024 at 11:01 PM Thomas Munro <thomas.munro@gmail.com> wrote: > > On Mon, Mar 11, 2024 at 5:31 AM Melanie Plageman > <melanieplageman@gmail.com> wrote: > > I have investigated the interaction between > > maintenance_io_concurrency, streaming reads, and the vacuum buffer > > access strategy (BAS_VACUUM). > > > > The streaming read API limits max_pinned_buffers to a pinned buffer > > multiplier (currently 4) * maintenance_io_concurrency buffers with the > > goal of constructing reads of at least MAX_BUFFERS_PER_TRANSFER size. > > > > Since the BAS_VACUUM ring buffer is size 256 kB or 32 buffers with > > default block size, that means that for a fully uncached vacuum in > > which all blocks must be vacuumed and will be dirtied, you'd have to > > set maintenance_io_concurrency at 8 or lower to see the same number of > > reuses (and shared buffer consumption) as master. > > > > Given that we allow users to specify BUFFER_USAGE_LIMIT to vacuum, it > > seems like we should force max_pinned_buffers to a value that > > guarantees the expected shared buffer usage by vacuum. But that means > > that maintenance_io_concurrency does not have a predictable impact on > > streaming read vacuum. > > > > What is the right thing to do here? > > > > At the least, the default size of the BAS_VACUUM ring buffer should be > > BLCKSZ * pinned_buffer_multiplier * default maintenance_io_concurrency > > (probably rounded up to the next power of two) bytes. > > Hmm, does the v6 look-ahead distance control algorithm mitigate that > problem? Using the ABC classifications from the streaming read > thread, I think for A it should now pin only 1, for B 16 and for C, it > depends on the size of the random 'chunks': if you have a lot of size > 1 random reads then it shouldn't go above 10 because of (default) > maintenance_io_concurrency. The only way to get up to very high > numbers would be to have a lot of random chunks triggering behaviour > C, but each made up of long runs of misses. For example one can > contrive a BHS query that happens to read pages 0-15 then 20-35 then > 40-55 etc etc so that we want to get lots of wide I/Os running > concurrently. Unless vacuum manages to do something like that, it > shouldn't be able to exceed 32 buffers very easily. > > I suspect that if we taught streaming_read.c to ask the > BufferAccessStrategy (if one is passed in) what its recommended pin > limit is (strategy->nbuffers?), we could just clamp > max_pinned_buffers, and it would be hard to find a workload where that > makes a difference, and we could think about more complicated logic > later. > > In other words, I think/hope your complaints about excessive pinning > from v5 WRT all-cached heap scans might have also already improved > this case by happy coincidence? I haven't tried it out though, I just > read your description of the problem... I've rebased the attached v10 over top of the changes to lazy_scan_heap() Heikki just committed and over the v6 streaming read patch set. I started testing them and see that you are right, we no longer pin too many buffers. However, the uncached example below is now slower with streaming read than on master -- it looks to be because it is doing twice as many WAL writes and syncs. I'm still investigating why that is. psql \ -c "create table small (a int) with (autovacuum_enabled=false, fillfactor=25);" \ -c "insert into small select generate_series(1,200000) % 3;" \ -c "update small set a = 6 where a = 1;" pg_ctl stop # drop caches pg_ctl start psql -c "\timing on" -c "vacuum (verbose) small" - Melanie
Вложения
On Tue, Mar 12, 2024 at 10:03 AM Melanie Plageman <melanieplageman@gmail.com> wrote: > I've rebased the attached v10 over top of the changes to > lazy_scan_heap() Heikki just committed and over the v6 streaming read > patch set. I started testing them and see that you are right, we no > longer pin too many buffers. However, the uncached example below is > now slower with streaming read than on master -- it looks to be > because it is doing twice as many WAL writes and syncs. I'm still > investigating why that is. That makes sense to me. We have 256kB of buffers in our ring, but now we're trying to read ahead 128kB at a time, so it works out that we can only flush the WAL accumulated while dirtying half the blocks at a time, so we flush twice as often. If I change the ring size to 384kB, allowing for that read-ahead window, I see approximately the same WAL flushes. Surely we'd never be able to get the behaviour to match *and* keep the same ring size? We simply need those 16 extra buffers to have a chance of accumulating 32 dirty buffers, and the associated WAL. Do you see the same result, or do you think something more than that is wrong here? Here are some system call traces using your test that helped me see the behaviour: 1. Unpatched, ie no streaming read, we flush 90kB of WAL generated by 32 pages before we write them out one at a time just before we read in their replacements. One flush covers the LSNs of all the pages that will be written, even though it's only called for the first page to be written. That's because XLogFlush(lsn), if it decides to do anything, flushes as far as it can... IOW when we hit the *oldest* dirty block, that's when we write out the WAL up to where we dirtied the *newest* block, which covers the 32 pwrite() calls here: pwrite(30,...,90112,0xf90000) = 90112 (0x16000) fdatasync(30) = 0 (0x0) pwrite(27,...,8192,0x0) = 8192 (0x2000) pread(27,...,8192,0x40000) = 8192 (0x2000) pwrite(27,...,8192,0x2000) = 8192 (0x2000) pread(27,...,8192,0x42000) = 8192 (0x2000) pwrite(27,...,8192,0x4000) = 8192 (0x2000) pread(27,...,8192,0x44000) = 8192 (0x2000) pwrite(27,...,8192,0x6000) = 8192 (0x2000) pread(27,...,8192,0x46000) = 8192 (0x2000) pwrite(27,...,8192,0x8000) = 8192 (0x2000) pread(27,...,8192,0x48000) = 8192 (0x2000) pwrite(27,...,8192,0xa000) = 8192 (0x2000) pread(27,...,8192,0x4a000) = 8192 (0x2000) pwrite(27,...,8192,0xc000) = 8192 (0x2000) pread(27,...,8192,0x4c000) = 8192 (0x2000) pwrite(27,...,8192,0xe000) = 8192 (0x2000) pread(27,...,8192,0x4e000) = 8192 (0x2000) pwrite(27,...,8192,0x10000) = 8192 (0x2000) pread(27,...,8192,0x50000) = 8192 (0x2000) pwrite(27,...,8192,0x12000) = 8192 (0x2000) pread(27,...,8192,0x52000) = 8192 (0x2000) pwrite(27,...,8192,0x14000) = 8192 (0x2000) pread(27,...,8192,0x54000) = 8192 (0x2000) pwrite(27,...,8192,0x16000) = 8192 (0x2000) pread(27,...,8192,0x56000) = 8192 (0x2000) pwrite(27,...,8192,0x18000) = 8192 (0x2000) pread(27,...,8192,0x58000) = 8192 (0x2000) pwrite(27,...,8192,0x1a000) = 8192 (0x2000) pread(27,...,8192,0x5a000) = 8192 (0x2000) pwrite(27,...,8192,0x1c000) = 8192 (0x2000) pread(27,...,8192,0x5c000) = 8192 (0x2000) pwrite(27,...,8192,0x1e000) = 8192 (0x2000) pread(27,...,8192,0x5e000) = 8192 (0x2000) pwrite(27,...,8192,0x20000) = 8192 (0x2000) pread(27,...,8192,0x60000) = 8192 (0x2000) pwrite(27,...,8192,0x22000) = 8192 (0x2000) pread(27,...,8192,0x62000) = 8192 (0x2000) pwrite(27,...,8192,0x24000) = 8192 (0x2000) pread(27,...,8192,0x64000) = 8192 (0x2000) pwrite(27,...,8192,0x26000) = 8192 (0x2000) pread(27,...,8192,0x66000) = 8192 (0x2000) pwrite(27,...,8192,0x28000) = 8192 (0x2000) pread(27,...,8192,0x68000) = 8192 (0x2000) pwrite(27,...,8192,0x2a000) = 8192 (0x2000) pread(27,...,8192,0x6a000) = 8192 (0x2000) pwrite(27,...,8192,0x2c000) = 8192 (0x2000) pread(27,...,8192,0x6c000) = 8192 (0x2000) pwrite(27,...,8192,0x2e000) = 8192 (0x2000) pread(27,...,8192,0x6e000) = 8192 (0x2000) pwrite(27,...,8192,0x30000) = 8192 (0x2000) pread(27,...,8192,0x70000) = 8192 (0x2000) pwrite(27,...,8192,0x32000) = 8192 (0x2000) pread(27,...,8192,0x72000) = 8192 (0x2000) pwrite(27,...,8192,0x34000) = 8192 (0x2000) pread(27,...,8192,0x74000) = 8192 (0x2000) pwrite(27,...,8192,0x36000) = 8192 (0x2000) pread(27,...,8192,0x76000) = 8192 (0x2000) pwrite(27,...,8192,0x38000) = 8192 (0x2000) pread(27,...,8192,0x78000) = 8192 (0x2000) pwrite(27,...,8192,0x3a000) = 8192 (0x2000) pread(27,...,8192,0x7a000) = 8192 (0x2000) pwrite(27,...,8192,0x3c000) = 8192 (0x2000) pread(27,...,8192,0x7c000) = 8192 (0x2000) pwrite(27,...,8192,0x3e000) = 8192 (0x2000) pread(27,...,8192,0x7e000) = 8192 (0x2000) (Digression: this alternative tail-write-head-read pattern defeats the read-ahead and write-behind on a bunch of OSes, but not Linux because it only seems to worry about the reads, while other Unixes have write-behind detection too, and I believe at least some are confused by this pattern of tiny writes following along some distance behind tiny reads; Andrew Gierth figured that out after noticing poor ring buffer performance, and we eventually got that fixed for one such system[1], separating the sequence detection for reads and writes.) 2. With your patches, we replace all those little pread calls with nice wide calls, yay!, but now we only manage to write out about half the amount of WAL at a time as you discovered. The repeating blocks of system calls now look like this, but there are twice as many of them: pwrite(32,...,40960,0x224000) = 40960 (0xa000) fdatasync(32) = 0 (0x0) pwrite(27,...,8192,0x5c000) = 8192 (0x2000) preadv(27,[...],3,0x7e000) = 131072 (0x20000) pwrite(27,...,8192,0x5e000) = 8192 (0x2000) pwrite(27,...,8192,0x60000) = 8192 (0x2000) pwrite(27,...,8192,0x62000) = 8192 (0x2000) pwrite(27,...,8192,0x64000) = 8192 (0x2000) pwrite(27,...,8192,0x66000) = 8192 (0x2000) pwrite(27,...,8192,0x68000) = 8192 (0x2000) pwrite(27,...,8192,0x6a000) = 8192 (0x2000) pwrite(27,...,8192,0x6c000) = 8192 (0x2000) pwrite(27,...,8192,0x6e000) = 8192 (0x2000) pwrite(27,...,8192,0x70000) = 8192 (0x2000) pwrite(27,...,8192,0x72000) = 8192 (0x2000) pwrite(27,...,8192,0x74000) = 8192 (0x2000) pwrite(27,...,8192,0x76000) = 8192 (0x2000) pwrite(27,...,8192,0x78000) = 8192 (0x2000) pwrite(27,...,8192,0x7a000) = 8192 (0x2000) 3. With your patches and test but this time using VACUUM (BUFFER_USAGE_LIMIT = '384kB'), the repeating block grows bigger and we get the larger WAL flushes back again, because now we're able to collect 32 blocks' worth of WAL up front again: pwrite(32,...,90112,0x50c000) = 90112 (0x16000) fdatasync(32) = 0 (0x0) pwrite(27,...,8192,0x1dc000) = 8192 (0x2000) pread(27,...,131072,0x21e000) = 131072 (0x20000) pwrite(27,...,8192,0x1de000) = 8192 (0x2000) pwrite(27,...,8192,0x1e0000) = 8192 (0x2000) pwrite(27,...,8192,0x1e2000) = 8192 (0x2000) pwrite(27,...,8192,0x1e4000) = 8192 (0x2000) pwrite(27,...,8192,0x1e6000) = 8192 (0x2000) pwrite(27,...,8192,0x1e8000) = 8192 (0x2000) pwrite(27,...,8192,0x1ea000) = 8192 (0x2000) pwrite(27,...,8192,0x1ec000) = 8192 (0x2000) pwrite(27,...,8192,0x1ee000) = 8192 (0x2000) pwrite(27,...,8192,0x1f0000) = 8192 (0x2000) pwrite(27,...,8192,0x1f2000) = 8192 (0x2000) pwrite(27,...,8192,0x1f4000) = 8192 (0x2000) pwrite(27,...,8192,0x1f6000) = 8192 (0x2000) pwrite(27,...,8192,0x1f8000) = 8192 (0x2000) pwrite(27,...,8192,0x1fa000) = 8192 (0x2000) pwrite(27,...,8192,0x1fc000) = 8192 (0x2000) preadv(27,[...],3,0x23e000) = 131072 (0x20000) pwrite(27,...,8192,0x1fe000) = 8192 (0x2000) pwrite(27,...,8192,0x200000) = 8192 (0x2000) pwrite(27,...,8192,0x202000) = 8192 (0x2000) pwrite(27,...,8192,0x204000) = 8192 (0x2000) pwrite(27,...,8192,0x206000) = 8192 (0x2000) pwrite(27,...,8192,0x208000) = 8192 (0x2000) pwrite(27,...,8192,0x20a000) = 8192 (0x2000) pwrite(27,...,8192,0x20c000) = 8192 (0x2000) pwrite(27,...,8192,0x20e000) = 8192 (0x2000) pwrite(27,...,8192,0x210000) = 8192 (0x2000) pwrite(27,...,8192,0x212000) = 8192 (0x2000) pwrite(27,...,8192,0x214000) = 8192 (0x2000) pwrite(27,...,8192,0x216000) = 8192 (0x2000) pwrite(27,...,8192,0x218000) = 8192 (0x2000) pwrite(27,...,8192,0x21a000) = 8192 (0x2000) 4. For learning/exploration only, I rebased my experimental vectored FlushBuffers() patch, which teaches the checkpointer to write relation data out using smgrwritev(). The checkpointer explicitly sorts blocks, but I think ring buffers should naturally often contain consecutive blocks in ring order. Highly experimental POC code pushed to a public branch[2], but I am not proposing anything here, just trying to understand things. The nicest looking system call trace was with BUFFER_USAGE_LIMIT set to 512kB, so it could do its writes, reads and WAL writes 128kB at a time: pwrite(32,...,131072,0xfc6000) = 131072 (0x20000) fdatasync(32) = 0 (0x0) pwrite(27,...,131072,0x6c0000) = 131072 (0x20000) pread(27,...,131072,0x73e000) = 131072 (0x20000) pwrite(27,...,131072,0x6e0000) = 131072 (0x20000) pread(27,...,131072,0x75e000) = 131072 (0x20000) pwritev(27,[...],3,0x77e000) = 131072 (0x20000) preadv(27,[...],3,0x77e000) = 131072 (0x20000) That was a fun experiment, but... I recognise that efficient cleaning of ring buffers is a Hard Problem requiring more concurrency: it's just too late to be flushing that WAL. But we also don't want to start writing back data immediately after dirtying pages (cf. OS write-behind for big sequential writes in traditional Unixes), because we're not allowed to write data out without writing the WAL first and we currently need to build up bigger WAL writes to do so efficiently (cf. some other systems that can write out fragments of WAL concurrently so the latency-vs-throughput trade-off doesn't have to be so extreme). So we want to defer writing it, but not too long. We need something cleaning our buffers (or at least flushing the associated WAL, but preferably also writing the data) not too late and not too early, and more in sync with our scan than the WAL writer is. What that machinery should look like I don't know (but I believe Andres has ideas). [1] https://github.com/freebsd/freebsd-src/commit/f2706588730a5d3b9a687ba8d4269e386650cc4f [2] https://github.com/macdice/postgres/tree/vectored-ring-buffer
On Sun, Mar 17, 2024 at 2:53 AM Thomas Munro <thomas.munro@gmail.com> wrote: > > On Tue, Mar 12, 2024 at 10:03 AM Melanie Plageman > <melanieplageman@gmail.com> wrote: > > I've rebased the attached v10 over top of the changes to > > lazy_scan_heap() Heikki just committed and over the v6 streaming read > > patch set. I started testing them and see that you are right, we no > > longer pin too many buffers. However, the uncached example below is > > now slower with streaming read than on master -- it looks to be > > because it is doing twice as many WAL writes and syncs. I'm still > > investigating why that is. --snip-- > 4. For learning/exploration only, I rebased my experimental vectored > FlushBuffers() patch, which teaches the checkpointer to write relation > data out using smgrwritev(). The checkpointer explicitly sorts > blocks, but I think ring buffers should naturally often contain > consecutive blocks in ring order. Highly experimental POC code pushed > to a public branch[2], but I am not proposing anything here, just > trying to understand things. The nicest looking system call trace was > with BUFFER_USAGE_LIMIT set to 512kB, so it could do its writes, reads > and WAL writes 128kB at a time: > > pwrite(32,...,131072,0xfc6000) = 131072 (0x20000) > fdatasync(32) = 0 (0x0) > pwrite(27,...,131072,0x6c0000) = 131072 (0x20000) > pread(27,...,131072,0x73e000) = 131072 (0x20000) > pwrite(27,...,131072,0x6e0000) = 131072 (0x20000) > pread(27,...,131072,0x75e000) = 131072 (0x20000) > pwritev(27,[...],3,0x77e000) = 131072 (0x20000) > preadv(27,[...],3,0x77e000) = 131072 (0x20000) > > That was a fun experiment, but... I recognise that efficient cleaning > of ring buffers is a Hard Problem requiring more concurrency: it's > just too late to be flushing that WAL. But we also don't want to > start writing back data immediately after dirtying pages (cf. OS > write-behind for big sequential writes in traditional Unixes), because > we're not allowed to write data out without writing the WAL first and > we currently need to build up bigger WAL writes to do so efficiently > (cf. some other systems that can write out fragments of WAL > concurrently so the latency-vs-throughput trade-off doesn't have to be > so extreme). So we want to defer writing it, but not too long. We > need something cleaning our buffers (or at least flushing the > associated WAL, but preferably also writing the data) not too late and > not too early, and more in sync with our scan than the WAL writer is. > What that machinery should look like I don't know (but I believe > Andres has ideas). I've attached a WIP v11 streaming vacuum patch set here that is rebased over master (by Thomas), so that I could add a CF entry for it. It still has the problem with the extra WAL write and fsync calls investigated by Thomas above. Thomas has some work in progress doing streaming write-behind to alleviate the issues with the buffer access strategy and streaming reads. When he gets a version of that ready to share, he will start a new "Streaming Vacuum" thread. - Melanie
Вложения
On Fri, Jun 28, 2024 at 05:36:25PM -0400, Melanie Plageman wrote: > I've attached a WIP v11 streaming vacuum patch set here that is > rebased over master (by Thomas), so that I could add a CF entry for > it. It still has the problem with the extra WAL write and fsync calls > investigated by Thomas above. Thomas has some work in progress doing > streaming write-behind to alleviate the issues with the buffer access > strategy and streaming reads. When he gets a version of that ready to > share, he will start a new "Streaming Vacuum" thread. To avoid reviewing the wrong patch, I'm writing to verify the status here. This is Needs Review in the commitfest. I think one of these two holds: 1. Needs Review is valid. 2. It's actually Waiting on Author. You're commissioning a review of the future-thread patch, not this one. If it's (1), given the WIP marking, what is the scope of the review you seek? I'm guessing performance is out of scope; what else is in or out of scope?
On Sun, Jul 7, 2024 at 10:49 AM Noah Misch <noah@leadboat.com> wrote: > > On Fri, Jun 28, 2024 at 05:36:25PM -0400, Melanie Plageman wrote: > > I've attached a WIP v11 streaming vacuum patch set here that is > > rebased over master (by Thomas), so that I could add a CF entry for > > it. It still has the problem with the extra WAL write and fsync calls > > investigated by Thomas above. Thomas has some work in progress doing > > streaming write-behind to alleviate the issues with the buffer access > > strategy and streaming reads. When he gets a version of that ready to > > share, he will start a new "Streaming Vacuum" thread. > > To avoid reviewing the wrong patch, I'm writing to verify the status here. > This is Needs Review in the commitfest. I think one of these two holds: > > 1. Needs Review is valid. > 2. It's actually Waiting on Author. You're commissioning a review of the > future-thread patch, not this one. > > If it's (1), given the WIP marking, what is the scope of the review you seek? > I'm guessing performance is out of scope; what else is in or out of scope? Ah, you're right. I moved it to "Waiting on Author" as we are waiting on Thomas' version which has a fix for the extra WAL write/sync behavior. Sorry for the "Needs Review" noise! - Melanie
On Mon, Jul 8, 2024 at 2:49 AM Noah Misch <noah@leadboat.com> wrote: > what is the scope of the review you seek? The patch "Refactor tidstore.c memory management." could definitely use some review. I wasn't sure if that should be proposed in a new thread of its own, but then the need for it comes from this streamifying project, so... The basic problem was that we want to build up a stream of block to be vacuumed (so that we can perform the I/O combining etc) + some extra data attached to each buffer, in this case the TID list, but the implementation of tidstore.c in master would require us to make an extra intermediate copy of the TIDs, because it keeps overwriting its internal buffer. The proposal here is to make it so that you can get get a tiny copyable object that can later be used to retrieve the data into a caller-supplied buffer, so that tidstore.c's iterator machinery doesn't have to have its own internal buffer at all, and then calling code can safely queue up a few of these at once.
On Mon, Jul 15, 2024 at 03:26:32PM +1200, Thomas Munro wrote: > On Mon, Jul 8, 2024 at 2:49 AM Noah Misch <noah@leadboat.com> wrote: > > what is the scope of the review you seek? > > The patch "Refactor tidstore.c memory management." could definitely > use some review. That's reasonable. radixtree already forbids mutations concurrent with iteration, so there's no new concurrency hazard. One alternative is per_buffer_data big enough for MaxOffsetNumber, but that might thrash caches measurably. That patch is good to go apart from these trivialities: > - return &(iter->output); > + return &iter->output; This cosmetic change is orthogonal to the patch's mission. > - for (wordnum = 0; wordnum < page->header.nwords; wordnum++) > + for (int wordnum = 0; wordnum < page->header.nwords; wordnum++) Likewise.
On Tue, Jul 16, 2024 at 1:52 PM Noah Misch <noah@leadboat.com> wrote: > On Mon, Jul 15, 2024 at 03:26:32PM +1200, Thomas Munro wrote: > That's reasonable. radixtree already forbids mutations concurrent with > iteration, so there's no new concurrency hazard. One alternative is > per_buffer_data big enough for MaxOffsetNumber, but that might thrash caches > measurably. That patch is good to go apart from these trivialities: Thanks! I have pushed that patch, without those changes you didn't like. Here's are Melanie's patches again. They work, and the WAL flush frequency problem is mostly gone since we increased the BAS_VACUUM default ring size (commit 98f320eb), but I'm still looking into how this read-ahead and the write-behind generated by vacuum (using patches not yet posted) should interact with each other and the ring system, and bouncing ideas around about that with my colleagues. More on that soon, hopefully. I suspect that there won't be changes to these patches as a result, but I still want to hold off for a bit.
Вложения
On Sun, Jan 19, 2025 at 5:51 AM Tomas Vondra <tomas@vondra.me> wrote: > * Does it still make sense to default to eic=1? For this particular test > increasing eic=4 often cuts the duration in half (especially on nvme > storage). Maybe it wasn't a bad choice for systems with one spinning disk, but obviously typical hardware has changed completely since then. Bruce even changed the docs to recommend "hundreds" on SSDs (46eafc88). We could definitely consider changing the value, but this particular thing is using the READ_STREAM_MAINTENANCE flag, so it uses maintenance_io_concurrency, and that one defaults to 10. That's also arbitrary and quite small, but I think it means we can avoid choosing new defaults for now :-) For the non-maintenance one, we might want to think about tweaking that in the context of bitmap heapscan? (Interestingly, MySQL seems to have a related setting defaulting to 10k, but that may be a system-wide setting, IDK. Our settings are quite low level, per "operation", or really per stream. In an off-list chat with Robert and Andres, we bounced around some new names, and the one I liked best was io_concurrency_per_stream. It would be accurate but bring a new implementation detail to the UX. I actually like that about it: it's like max_parallel_workers_per_gather, and also just the way work_mem is really work_mem_per_<something>: yes it is low level but is an honest expression of how (un)sophisticated our resource usage controls are today. Perhaps we'll eventually figure out how to balance all resources dynamically from global limits...) > * Why are we limiting ioc to <= 256kB? Per the benchmark it seems it > might be beneficial to set even higher values. That comes from: #define MAX_IO_COMBINE_LIMIT PG_IOV_MAX ... which comes from: /* Define a reasonable maximum that is safe to use on the stack. */ #define PG_IOV_MAX Min(IOV_MAX, 32) There are a few places that use either PG_IOV_MAX or MAX_IO_COMBINE_LIMIT to size a stack array, but those should be OK with a bigger number as the types are small: buffer, pointer, or struct iovec (16 bytes). For example, if it were 128 then we could do 1MB I/O, a nice round number, and arrays of those types would still only be 2kB or less. The other RDBMSes I googled seem to have max I/O sizes of around 512kB or 1MB, some tunable explicitly, some derived from other settings. I picked 128kB as the default combine limit because it comes up in lots of other places eg readahead size, and seemed to work pretty well, and is definitely supportable on all POSIX systems (see POSIX IOV_MAX, which is allowed to be as low as 16). I chose a maximum that was just a bit more, but not much more because I was also worried about how many places would eventually finish up wasting a lot of memory by having to multiply that by number-of-possible-I/Os-in-progress or other similar ideas, but I was expecting we'd want to increase it. read_stream.c doesn't do that sort of multiplication itself, but you can see a case like that in Andres's AIO patchset here: https://github.com/anarazel/postgres/blob/aio-2/src/backend/storage/aio/aio_init.c So for example if io_max_concurrency defaults to 1024; you'd get 2MB of iovecs in shared memory with PG_IOV_MAX of 128, instead of 512kB today. We can always discuss defaults separately but that case doesn't seem like a problem from here... Nice results, thanks!
On Sun, Jan 19, 2025 at 10:31 AM Thomas Munro <thomas.munro@gmail.com> wrote: > read_stream.c doesn't do that > sort of multiplication itself, Actually for completeness there is a place where it allocates local memory for max I/Os * 4, and that 4 is a not entirely unbogus and should change to io_combine_limit for the AIO stuff. Patches in progress, more soon. But that'd not be using MAX_IO_COMBINE_LIMIT or the number of system-wide I/O, it'd be your (usually much smaller) configured limits. But I'll write about that with more details in a new thread...
On 1/18/25 22:31, Thomas Munro wrote: > On Sun, Jan 19, 2025 at 5:51 AM Tomas Vondra <tomas@vondra.me> wrote: >> * Does it still make sense to default to eic=1? For this particular test >> increasing eic=4 often cuts the duration in half (especially on nvme >> storage). > > Maybe it wasn't a bad choice for systems with one spinning disk, but > obviously typical hardware has changed completely since then. Bruce > even changed the docs to recommend "hundreds" on SSDs (46eafc88). We > could definitely consider changing the value, but this particular > thing is using the READ_STREAM_MAINTENANCE flag, so it uses > maintenance_io_concurrency, and that one defaults to 10. That's also > arbitrary and quite small, but I think it means we can avoid choosing > new defaults for now :-) I completely lost track which hardware is supposed to be a good fit for low/high values of these GUCs. But weren't old spinning drives what needed fairly long queue to optimize the movement of heads? And IIRC some of the newer SSD (e.g. Optane) were marketed as not requiring very long queues ... anyway, it seems fairly difficult to formulate a rule comprehensible for our users. FWIW the benchmarking script tweaks both effective_io_concurrency and maintenance_io_concurrency GUCs (sets them to the same value). But yeah, 10 seems like a much better default for this type of storage. So for vacuum this is probably fine, but I was thinking more about the regular effective_io_concurrency for queries. > For the non-maintenance one, we might want > to think about tweaking that in the context of bitmap heapscan? > I'm not sure what you mean. How would we tweak it? You mean the GUC, or some sort of adaptive heuristics? > (Interestingly, MySQL seems to have a related setting defaulting to > 10k, but that may be a system-wide setting, IDK. Our settings are > quite low level, per "operation", or really per stream. In an > off-list chat with Robert and Andres, we bounced around some new > names, and the one I liked best was io_concurrency_per_stream. It > would be accurate but bring a new implementation detail to the UX. I > actually like that about it: it's like > max_parallel_workers_per_gather, and also just the way work_mem is > really work_mem_per_<something>: yes it is low level but is an honest > expression of how (un)sophisticated our resource usage controls are > today. Perhaps we'll eventually figure out how to balance all > resources dynamically from global limits...) > Possibly, but I'd guess we're years from doing that. I'm not sure anyone even proposed anything like that. >> * Why are we limiting ioc to <= 256kB? Per the benchmark it seems it >> might be beneficial to set even higher values. > > That comes from: > > #define MAX_IO_COMBINE_LIMIT PG_IOV_MAX > > ... which comes from: > > /* Define a reasonable maximum that is safe to use on the stack. */ > #define PG_IOV_MAX Min(IOV_MAX, 32) > > There are a few places that use either PG_IOV_MAX or > MAX_IO_COMBINE_LIMIT to size a stack array, but those should be OK > with a bigger number as the types are small: buffer, pointer, or > struct iovec (16 bytes). For example, if it were 128 then we could do > 1MB I/O, a nice round number, and arrays of those types would still > only be 2kB or less. The other RDBMSes I googled seem to have max I/O > sizes of around 512kB or 1MB, some tunable explicitly, some derived > from other settings. > > I picked 128kB as the default combine limit because it comes up in > lots of other places eg readahead size, and seemed to work pretty > well, and is definitely supportable on all POSIX systems (see POSIX > IOV_MAX, which is allowed to be as low as 16). Not sure I follow. Surely if a system can't support values above some limit, it would define IOV_MAX accordingly, and we'd just reject that value. And as you point out, the IOV_MAX may be as low as 16, so it's already possible to get a GUC value that gets rejected on some systems (even if it's just a theoretical issue). > I chose a maximum that > was just a bit more, but not much more because I was also worried > about how many places would eventually finish up wasting a lot of > memory by having to multiply that by > number-of-possible-I/Os-in-progress or other similar ideas, but I was > expecting we'd want to increase it. read_stream.c doesn't do that > sort of multiplication itself, but you can see a case like that in > Andres's AIO patchset here: > > https://github.com/anarazel/postgres/blob/aio-2/src/backend/storage/aio/aio_init.c > > So for example if io_max_concurrency defaults to 1024; you'd get 2MB > of iovecs in shared memory with PG_IOV_MAX of 128, instead of 512kB > today. We can always discuss defaults separately but that case > doesn't seem like a problem from here... > Yeah, all of this makes sense. I don't doubt this is a tradeoff, and if the GUC gets set to a high value it might have detrimental effect. Still, 256kB seems a bit too conservative and "not round" ;-) regards -- Tomas Vondra
On Tue, Feb 11, 2025 at 5:10 PM Melanie Plageman
<melanieplageman@gmail.com> wrote:
>
> On Thu, Feb 6, 2025 at 1:06 PM Melanie Plageman
> <melanieplageman@gmail.com> wrote:
> >
> > On Wed, Feb 5, 2025 at 5:26 PM Melanie Plageman
> > <melanieplageman@gmail.com> wrote:
> > >
> > > Yes, looking at these results, I also feel good about it. I've updated
> > > the commit metadata in attached v14, but I could use a round of review
> > > before pushing it.
> >
> > I've done a bit of self-review and updated these patches.
>
> This needed a rebase in light of 052026c9b90.
> v16 attached has an additional commit which converts the block
> information parameters to heap_vac_scan_next_block() into flags
> because we can only get one piece of information per block from the
> read stream API. This seemed nicer than a cumbersome struct.
>
Sorry for the late chiming in. I've reviewed the v16 patch set, and
the patches mostly look good. Here are some comments mostly about
cosmetic things:
0001:
-   bool        all_visible_according_to_vm,
-               was_eager_scanned = false;
+   uint8       blk_flags = 0;
Can we probably declare blk_flags inside the main loop?
0002:
In lazy_scan_heap(), we have a failsafe check at the beginning of the
main loop, which is performed before reading the first block. Isn't it
better to move this check after scanning a block (especially after
incrementing scanned_pages)? Otherwise, we would end up calling
lazy_check_wraparound_failsafe() at the very first loop, which
previously didn't happen without the patch. Since we already called
lazy_check_wraparound_failsafe() just before calling lazy_scan_heap(),
the extra check would not make much sense.
---
+   /* Set up the read stream for vacuum's first pass through the heap */
+   stream = read_stream_begin_relation(READ_STREAM_MAINTENANCE,
+                                       vacrel->bstrategy,
+                                       vacrel->rel,
+                                       MAIN_FORKNUM,
+                                       heap_vac_scan_next_block,
+                                       vacrel,
+                                       sizeof(bool));
Is there any reason to use sizeof(bool) instead of sizeof(uint8) here?
---
            /*
             * Vacuum the Free Space Map to make newly-freed space visible on
-            * upper-level FSM pages.  Note we have not yet processed blkno.
+            * upper-level FSM pages.  Note that blkno is the previously
+            * processed block.
             */
            FreeSpaceMapVacuumRange(vacrel->rel, next_fsm_block_to_vacuum,
                                    blkno);
Given the blkno is already processed, should we pass 'blkno + 1'
instead of blkno?
0003:
-   while ((iter_result = TidStoreIterateNext(iter)) != NULL)
I think we can declare iter_result in the main loop of lazy_vacuum_heap_rel().
Regards,
--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com
			
		On Fri, Feb 14, 2025 at 12:11 PM Melanie Plageman <melanieplageman@gmail.com> wrote: > I've done some clean-up including incorporating a few off-list pieces > of minor feedback from Andres. I've been poking, reading, and trying out these patches. They look good to me. Tiny nit, maybe this comment could say something less obvious, cf the similar comment near the other stream: + /* Set up the read stream */ + stream = read_stream_begin_relation(READ_STREAM_MAINTENANCE, I don't really love the cumbersome casting required around per_buffer_data, but that's not your patches' fault (hmm, wonder what we can do to improve that).
On Thu, Feb 13, 2025 at 6:52 PM Thomas Munro <thomas.munro@gmail.com> wrote: > > I've been poking, reading, and trying out these patches. They look good to me. Thanks for the review. > Tiny nit, maybe this comment could say something less obvious, cf the > similar comment near the other stream: > > + /* Set up the read stream */ > + stream = read_stream_begin_relation(READ_STREAM_MAINTENANCE, Done in upthread v18. > I don't really love the cumbersome casting required around > per_buffer_data, but that's not your patches' fault (hmm, wonder what > we can do to improve that). I don't know if you saw v17, but I tried to improve it a bit. The casting still has to happen, but I at least use the variable as a uint8 instead of a pointer to a uint8 (dunno if that makes it better or worse). It is the same in v18. - Melanie
On Thu, Feb 13, 2025 at 4:55 PM Melanie Plageman <melanieplageman@gmail.com> wrote: > > Thanks for your review! I've made the changes in attached v18. > > I do want to know what you think we should do about what you brought > up about lazy_check_wraparound_failsafe() -- given my reply (below). > > On Thu, Feb 13, 2025 at 6:08 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > Sorry for the late chiming in. I've reviewed the v16 patch set, and > > the patches mostly look good. Here are some comments mostly about > > cosmetic things: > > > > 0001: > > > > - bool all_visible_according_to_vm, > > - was_eager_scanned = false; > > + uint8 blk_flags = 0; > > > > Can we probably declare blk_flags inside the main loop? > > I've done this in 0002 (can't in 0001 because of it being used inside > the while loop itself). You're right, thanks. > > > 0002: > > > > In lazy_scan_heap(), we have a failsafe check at the beginning of the > > main loop, which is performed before reading the first block. Isn't it > > better to move this check after scanning a block (especially after > > incrementing scanned_pages)? Otherwise, we would end up calling > > lazy_check_wraparound_failsafe() at the very first loop, which > > previously didn't happen without the patch. Since we already called > > lazy_check_wraparound_failsafe() just before calling lazy_scan_heap(), > > the extra check would not make much sense. > > Yes, I agonized over this a bit. The problem with calling > lazy_check_wraparound_failsafe() (and vacuum_delay_point() especially) > after reading the first block is that read_stream_next_buffer() > returns the buffer pinned. Good point. > We don't want to hang onto that pin for a > long time. But I can't move them to the bottom of the loop after we > release the buffer because some of the code paths don't make it that > far. I don't see a good way other than how I did it or special-casing > block 0. What do you think? How about adding 'vacrel->scanned_pages > 0' to the if statement? Which seems not odd to me. Looking at the 0002 patch, it seems you reverted the change to the following comment: /* * Vacuum the Free Space Map to make newly-freed space visible on - * upper-level FSM pages. Note we have not yet processed blkno. + * upper-level FSM pages. Note we have not yet processed blkno+1. */ I feel that the previous change I saw in v17 is clearer: /* * Vacuum the Free Space Map to make newly-freed space visible on - * upper-level FSM pages. Note we have not yet processed blkno. + * upper-level FSM pages. Note that blkno is the previously + * processed block. */ The rest looks good to me. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
On Thu, Feb 13, 2025 at 9:06 PM Melanie Plageman <melanieplageman@gmail.com> wrote: > > On Thu, Feb 13, 2025 at 8:30 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > The rest looks good to me. > > Cool! I'll plan to push this tomorrow barring any objections. I've committed this and marked it as such in the CF app. - Melanie
On Fri, Feb 14, 2025 at 1:15 PM Melanie Plageman <melanieplageman@gmail.com> wrote: > > On Thu, Feb 13, 2025 at 9:06 PM Melanie Plageman > <melanieplageman@gmail.com> wrote: > > > > On Thu, Feb 13, 2025 at 8:30 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > The rest looks good to me. > > > > Cool! I'll plan to push this tomorrow barring any objections. > > I've committed this and marked it as such in the CF app. Seems valgrind doesn't like this [1]. I'm looking into it. - Melanie [1] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=skink&dt=2025-02-14%2018%3A00%3A12
On Sat, Feb 15, 2025 at 7:30 AM Melanie Plageman
<melanieplageman@gmail.com> wrote:
> Seems valgrind doesn't like this [1]. I'm looking into it.
Melanie was able to reproduce this on her local valgrind and
eventually we figured out that it's my fault.  I put code into
read_stream.c that calls wipe_mem(), thinking that that was our
standard way of scribbling 0x7f on memory that you shouldn't access
again until it's reused.  I didn't realise that wipe_mem() also tells
valgrind that the memory is now "no access".  That makes sense for
palloc/pfree because when that range is allocated again it'll clear
that.  The point is to help people discover that they have a dangling
reference to per-buffer data after they advance to the next buffer,
which wouldn't work because it's in a circular queue and could be
overwritten any time after that.
This fixes it, but is not yet my proposed change:
@@ -193,9 +193,12 @@ read_stream_get_block(ReadStream *stream, void
*per_buffer_data)
        if (blocknum != InvalidBlockNumber)
                stream->buffered_blocknum = InvalidBlockNumber;
        else
+       {
+               VALGRIND_MAKE_MEM_UNDEFINED(per_buffer_data,
stream->per_buffer_data_size);
                blocknum = stream->callback(stream,
 stream->callback_private_data,
 per_buffer_data);
+       }
Thinking about how to make it more symmetrical...
			
		Thomas Munro <thomas.munro@gmail.com> writes:
> Here's a patch.  Is there a tidier way to write this?
Hmm, I think not with the current set of primitives.  We could think
about refactoring them, but that's not a job for a band-aid patch.
> It should probably be back-patched to 17, because external code might
> use per-buffer data (obviously v17 core doesn't or skink would have
> told us this sooner).   It's not a good time to push to 17 today,
> though.  Push to master now to cheer skink up and 17 some time later
> when the coast is clear, or just wait?
Agreed that right now is a bad time to push this to v17 --- we need to
keep the risk factors as low as possible for the re-release.  Master
now and v17 after the re-wrap seems like the right compromise.
            regards, tom lane
			
		On Sat, Feb 15, 2025 at 12:03 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > Thomas Munro <thomas.munro@gmail.com> writes: > > Here's a patch. Is there a tidier way to write this? > > Hmm, I think not with the current set of primitives. We could think > about refactoring them, but that's not a job for a band-aid patch. Thanks for looking. > > It should probably be back-patched to 17, because external code might > > use per-buffer data (obviously v17 core doesn't or skink would have > > told us this sooner). It's not a good time to push to 17 today, > > though. Push to master now to cheer skink up and 17 some time later > > when the coast is clear, or just wait? > > Agreed that right now is a bad time to push this to v17 --- we need to > keep the risk factors as low as possible for the re-release. Master > now and v17 after the re-wrap seems like the right compromise. Cool, will push to master. Melanie, could you please confirm that this patch works for you? I haven't figured out what I'm doing wrong but my local Valgrind doesn't seem to show the problem (USE_VALGRIND defined, Debian's Valgrind v3.19.0).
On Fri, Feb 14, 2025 at 6:09 PM Thomas Munro <thomas.munro@gmail.com> wrote: > > > Agreed that right now is a bad time to push this to v17 --- we need to > > keep the risk factors as low as possible for the re-release. Master > > now and v17 after the re-wrap seems like the right compromise. > > Cool, will push to master. Melanie, could you please confirm that > this patch works for you? I haven't figured out what I'm doing wrong > but my local Valgrind doesn't seem to show the problem (USE_VALGRIND > defined, Debian's Valgrind v3.19.0). It fixed the issue (after an off-list correction to the patch by Thomas). - Melanie
On Sat, Feb 15, 2025 at 12:50 PM Melanie Plageman <melanieplageman@gmail.com> wrote: > It fixed the issue (after an off-list correction to the patch by Thomas). Thanks! It's green again.
Thomas Munro <thomas.munro@gmail.com> writes:
> Thanks!  It's green again.
The security team's Coverity instance complained about this patch:
*** CID 1642971:  Null pointer dereferences  (FORWARD_NULL)
/srv/coverity/git/pgsql-git/postgresql/src/backend/access/heap/vacuumlazy.c: 1295 in lazy_scan_heap()
1289             buf = read_stream_next_buffer(stream, &per_buffer_data);
1290
1291             /* The relation is exhausted. */
1292             if (!BufferIsValid(buf))
1293                 break;
1294
>>>     CID 1642971:  Null pointer dereferences  (FORWARD_NULL)
>>>     Dereferencing null pointer "per_buffer_data".
1295             blk_info = *((uint8 *) per_buffer_data);
1296             CheckBufferIsPinnedOnce(buf);
1297             page = BufferGetPage(buf);
1298             blkno = BufferGetBlockNumber(buf);
1299
1300             vacrel->scanned_pages++;
Basically, Coverity doesn't understand that a successful call to
read_stream_next_buffer must set per_buffer_data here.  I don't
think there's much chance of teaching it that, so we'll just
have to dismiss this item as "intentional, not a bug".  However,
I do have a suggestion: I think the "per_buffer_data" variable
should be declared inside the "while (true)" loop not outside.
That way there is no chance of a value being carried across
iterations, so that if for some reason read_stream_next_buffer
failed to do what we expect and did not set per_buffer_data,
we'd be certain to get a null-pointer core dump rather than
accessing data from a previous buffer.
            regards, tom lane
			
		On Sun, Feb 16, 2025 at 1:12 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > > Thomas Munro <thomas.munro@gmail.com> writes: > > Thanks! It's green again. > > The security team's Coverity instance complained about this patch: > > *** CID 1642971: Null pointer dereferences (FORWARD_NULL) > /srv/coverity/git/pgsql-git/postgresql/src/backend/access/heap/vacuumlazy.c: 1295 in lazy_scan_heap() > 1289 buf = read_stream_next_buffer(stream, &per_buffer_data); > 1290 > 1291 /* The relation is exhausted. */ > 1292 if (!BufferIsValid(buf)) > 1293 break; > 1294 > >>> CID 1642971: Null pointer dereferences (FORWARD_NULL) > >>> Dereferencing null pointer "per_buffer_data". > 1295 blk_info = *((uint8 *) per_buffer_data); > 1296 CheckBufferIsPinnedOnce(buf); > 1297 page = BufferGetPage(buf); > 1298 blkno = BufferGetBlockNumber(buf); > 1299 > 1300 vacrel->scanned_pages++; > > Basically, Coverity doesn't understand that a successful call to > read_stream_next_buffer must set per_buffer_data here. I don't > think there's much chance of teaching it that, so we'll just > have to dismiss this item as "intentional, not a bug". Is this easy to do? Like is there a list of things from coverity to ignore? > I do have a suggestion: I think the "per_buffer_data" variable > should be declared inside the "while (true)" loop not outside. > That way there is no chance of a value being carried across > iterations, so that if for some reason read_stream_next_buffer > failed to do what we expect and did not set per_buffer_data, > we'd be certain to get a null-pointer core dump rather than > accessing data from a previous buffer. Done and pushed. Thanks! - Melanie
Melanie Plageman <melanieplageman@gmail.com> writes:
> On Sun, Feb 16, 2025 at 1:12 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Basically, Coverity doesn't understand that a successful call to
>> read_stream_next_buffer must set per_buffer_data here.  I don't
>> think there's much chance of teaching it that, so we'll just
>> have to dismiss this item as "intentional, not a bug".
> Is this easy to do? Like is there a list of things from coverity to ignore?
Their website has a table of live issues, and we can just mark this
one "dismissed".  I'm not entirely sure how they recognize dismissed
issues --- it's not perfect, because old complaints tend to get
resurrected after changes in nearby code.  But it's good enough.
>> I do have a suggestion: I think the "per_buffer_data" variable
>> should be declared inside the "while (true)" loop not outside.
> Done and pushed. Thanks!
Thanks, looks better now.
            regards, tom lane
			
		Hi.
Em ter., 18 de fev. de 2025 às 11:31, Melanie Plageman <melanieplageman@gmail.com> escreveu:
On Sun, Feb 16, 2025 at 1:12 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> Thomas Munro <thomas.munro@gmail.com> writes:
> > Thanks! It's green again.
>
> The security team's Coverity instance complained about this patch:
>
> *** CID 1642971: Null pointer dereferences (FORWARD_NULL)
> /srv/coverity/git/pgsql-git/postgresql/src/backend/access/heap/vacuumlazy.c: 1295 in lazy_scan_heap()
> 1289 buf = read_stream_next_buffer(stream, &per_buffer_data);
> 1290
> 1291 /* The relation is exhausted. */
> 1292 if (!BufferIsValid(buf))
> 1293 break;
> 1294
> >>> CID 1642971: Null pointer dereferences (FORWARD_NULL)
> >>> Dereferencing null pointer "per_buffer_data".
> 1295 blk_info = *((uint8 *) per_buffer_data);
> 1296 CheckBufferIsPinnedOnce(buf);
> 1297 page = BufferGetPage(buf);
> 1298 blkno = BufferGetBlockNumber(buf);
> 1299
> 1300 vacrel->scanned_pages++;
>
> Basically, Coverity doesn't understand that a successful call to
> read_stream_next_buffer must set per_buffer_data here. I don't
> think there's much chance of teaching it that, so we'll just
> have to dismiss this item as "intentional, not a bug".
Is this easy to do? Like is there a list of things from coverity to ignore?
> I do have a suggestion: I think the "per_buffer_data" variable
> should be declared inside the "while (true)" loop not outside.
> That way there is no chance of a value being carried across
> iterations, so that if for some reason read_stream_next_buffer
> failed to do what we expect and did not set per_buffer_data,
> we'd be certain to get a null-pointer core dump rather than
> accessing data from a previous buffer.
Done and pushed. Thanks!
Per Coverity.
CID 1592454: (#1 of 1): Explicit null dereferenced (FORWARD_NULL) 
8. var_deref_op: Dereferencing null pointer per_buffer_data.a invalid per_buffer_data pointer, with a valid buffer.
Sorry if I'm wrong, but the function is very suspicious.
Attached a patch, which tries to fix.
best regards,
Ranier Vilela
Вложения
Hi,
On 2025-02-27 14:32:28 -0300, Ranier Vilela wrote:
> Hi.
> 
> Em ter., 18 de fev. de 2025 às 11:31, Melanie Plageman <
> melanieplageman@gmail.com> escreveu:
> 
> > On Sun, Feb 16, 2025 at 1:12 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> > >
> > > Thomas Munro <thomas.munro@gmail.com> writes:
> > > > Thanks!  It's green again.
> > >
> > > The security team's Coverity instance complained about this patch:
> > >
> > > *** CID 1642971:  Null pointer dereferences  (FORWARD_NULL)
> > >
> > /srv/coverity/git/pgsql-git/postgresql/src/backend/access/heap/vacuumlazy.c:
> > 1295 in lazy_scan_heap()
> > > 1289                    buf = read_stream_next_buffer(stream,
> > &per_buffer_data);
> > > 1290
> > > 1291                    /* The relation is exhausted. */
> > > 1292                    if (!BufferIsValid(buf))
> > > 1293                            break;
> > > 1294
> > > >>>     CID 1642971:  Null pointer dereferences  (FORWARD_NULL)
> > > >>>     Dereferencing null pointer "per_buffer_data".
> > > 1295                    blk_info = *((uint8 *) per_buffer_data);
> > > 1296                    CheckBufferIsPinnedOnce(buf);
> > > 1297                    page = BufferGetPage(buf);
> > > 1298                    blkno = BufferGetBlockNumber(buf);
> > > 1299
> > > 1300                    vacrel->scanned_pages++;
> > >
> > > Basically, Coverity doesn't understand that a successful call to
> > > read_stream_next_buffer must set per_buffer_data here.  I don't
> > > think there's much chance of teaching it that, so we'll just
> > > have to dismiss this item as "intentional, not a bug".
> >
> > Is this easy to do? Like is there a list of things from coverity to ignore?
> >
> > > I do have a suggestion: I think the "per_buffer_data" variable
> > > should be declared inside the "while (true)" loop not outside.
> > > That way there is no chance of a value being carried across
> > > iterations, so that if for some reason read_stream_next_buffer
> > > failed to do what we expect and did not set per_buffer_data,
> > > we'd be certain to get a null-pointer core dump rather than
> > > accessing data from a previous buffer.
> >
> > Done and pushed. Thanks!
> >
> Per Coverity.
> 
> CID 1592454: (#1 of 1): Explicit null dereferenced (FORWARD_NULL)
> 8. var_deref_op: Dereferencing null pointer per_buffer_data.
That's exactly what the messages you quoted are discussing, no?
> Sorry if I'm wrong, but the function is very suspicious.
How so?
> diff --git a/src/backend/storage/aio/read_stream.c b/src/backend/storage/aio/read_stream.c
> index 04bdb5e6d4..18e9b4f3c4 100644
> --- a/src/backend/storage/aio/read_stream.c
> +++ b/src/backend/storage/aio/read_stream.c
> @@ -666,6 +666,8 @@ read_stream_next_buffer(ReadStream *stream, void **per_buffer_data)
>                                          READ_BUFFERS_ISSUE_ADVICE : 0)))
>              {
>                  /* Fast return. */
> +                if (per_buffer_data)
> +                    *per_buffer_data = get_per_buffer_data(stream, oldest_buffer_index);
>                  return buffer;
>              }
A few lines above:
        Assert(stream->per_buffer_data_size == 0);
The fast path isn't used when per buffer data is used.  Adding a check for
per_buffer_data and assigning something to it is nonsensical.
Greetings,
Andres Freund
			
		Hi,
On 2025-02-27 12:44:24 -0500, Andres Freund wrote:
> > CID 1592454: (#1 of 1): Explicit null dereferenced (FORWARD_NULL)
> > 8. var_deref_op: Dereferencing null pointer per_buffer_data.
> 
> That's exactly what the messages you quoted are discussing, no?
Ah, no, it isn't. But I still think the coverity alert and the patch don't
make sense, as per the below:
> > diff --git a/src/backend/storage/aio/read_stream.c b/src/backend/storage/aio/read_stream.c
> > index 04bdb5e6d4..18e9b4f3c4 100644
> > --- a/src/backend/storage/aio/read_stream.c
> > +++ b/src/backend/storage/aio/read_stream.c
> > @@ -666,6 +666,8 @@ read_stream_next_buffer(ReadStream *stream, void **per_buffer_data)
> >                                          READ_BUFFERS_ISSUE_ADVICE : 0)))
> >              {
> >                  /* Fast return. */
> > +                if (per_buffer_data)
> > +                    *per_buffer_data = get_per_buffer_data(stream, oldest_buffer_index);
> >                  return buffer;
> >              }
> 
> A few lines above:
>         Assert(stream->per_buffer_data_size == 0);
> 
> The fast path isn't used when per buffer data is used.  Adding a check for
> per_buffer_data and assigning something to it is nonsensical.
Greetings,
Andres Freund
			
		Em qui., 27 de fev. de 2025 às 14:49, Andres Freund <andres@anarazel.de> escreveu:
Hi,
On 2025-02-27 12:44:24 -0500, Andres Freund wrote:
> > CID 1592454: (#1 of 1): Explicit null dereferenced (FORWARD_NULL)
> > 8. var_deref_op: Dereferencing null pointer per_buffer_data.
>
> That's exactly what the messages you quoted are discussing, no?
Ah, no, it isn't. But I still think the coverity alert and the patch don't
make sense, as per the below:
> > diff --git a/src/backend/storage/aio/read_stream.c b/src/backend/storage/aio/read_stream.c
> > index 04bdb5e6d4..18e9b4f3c4 100644
> > --- a/src/backend/storage/aio/read_stream.c
> > +++ b/src/backend/storage/aio/read_stream.c
> > @@ -666,6 +666,8 @@ read_stream_next_buffer(ReadStream *stream, void **per_buffer_data)
> > READ_BUFFERS_ISSUE_ADVICE : 0)))
> > {
> > /* Fast return. */
> > + if (per_buffer_data)
> > + *per_buffer_data = get_per_buffer_data(stream, oldest_buffer_index);
> > return buffer;
> > }
>
> A few lines above:
> Assert(stream->per_buffer_data_size == 0);
>
> The fast path isn't used when per buffer data is used. Adding a check for
> per_buffer_data and assigning something to it is nonsensical.
Perhaps.
But the fast path and the parameter void **per_buffer_data,
IMHO, is a serious risk in my opinion.
Of course, when in runtime.
best regards,
Ranier Vilela
Andres Freund <andres@anarazel.de> writes:
> Ah, no, it isn't. But I still think the coverity alert and the patch don't
> make sense, as per the below:
Coverity's alert makes perfect sense if you posit that Coverity
doesn't assume that this read_stream_next_buffer call will
only be applied to a stream that has per_buffer_data_size > 0.
(Even if it did understand that, I wouldn't assume that it's
smart enough to see that the fast path will never be taken.)
I wonder if it'd be a good idea to add something like
        Assert(stream->distance == 1);
        Assert(stream->pending_read_nblocks == 0);
        Assert(stream->per_buffer_data_size == 0);
+        Assert(per_buffer_data == NULL);
in read_stream_next_buffer.  I doubt that this will shut Coverity
up, but it would help to catch caller coding errors, i.e. passing
a per_buffer_data pointer when there's no per-buffer data.
On the whole I doubt we can get rid of this warning without some
significant redesign of the read_stream API, and I don't think
it's worth the trouble.  Coverity is a tool not a requirement.
I'm content to just dismiss the warning.
            regards, tom lane
			
		On Thu, Feb 27, 2025 at 1:08 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > > I wonder if it'd be a good idea to add something like > > Assert(stream->distance == 1); > Assert(stream->pending_read_nblocks == 0); > Assert(stream->per_buffer_data_size == 0); > + Assert(per_buffer_data == NULL); > > in read_stream_next_buffer. I doubt that this will shut Coverity > up, but it would help to catch caller coding errors, i.e. passing > a per_buffer_data pointer when there's no per-buffer data. I think this is a good stopgap. I was discussing adding this assert off-list with Thomas and he wanted to detail his more ambitious plans for type safety improvements in the read stream API. Less on the order of a redesign and more like a separate read_stream_next_buffer()s for when there is per buffer data and when there isn't. And a by-value and by-reference version for the one where there is data. I'll plan to add this assert tomorrow if that discussion doesn't materialize. - Melanie
On Fri, Feb 28, 2025 at 11:58 AM Melanie Plageman <melanieplageman@gmail.com> wrote: > On Thu, Feb 27, 2025 at 1:08 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > > I wonder if it'd be a good idea to add something like > > > > Assert(stream->distance == 1); > > Assert(stream->pending_read_nblocks == 0); > > Assert(stream->per_buffer_data_size == 0); > > + Assert(per_buffer_data == NULL); > > > > in read_stream_next_buffer. I doubt that this will shut Coverity > > up, but it would help to catch caller coding errors, i.e. passing > > a per_buffer_data pointer when there's no per-buffer data. > > I think this is a good stopgap. I was discussing adding this assert > off-list with Thomas and he wanted to detail his more ambitious plans > for type safety improvements in the read stream API. Less on the order > of a redesign and more like a separate read_stream_next_buffer()s for > when there is per buffer data and when there isn't. And a by-value and > by-reference version for the one where there is data. Here's what I had in mind. Is it better?
Вложения
On Fri, Feb 28, 2025 at 2:29 PM Thomas Munro <thomas.munro@gmail.com> wrote: > On Fri, Feb 28, 2025 at 11:58 AM Melanie Plageman > <melanieplageman@gmail.com> wrote: > > On Thu, Feb 27, 2025 at 1:08 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > > > I wonder if it'd be a good idea to add something like > > > > > > Assert(stream->distance == 1); > > > Assert(stream->pending_read_nblocks == 0); > > > Assert(stream->per_buffer_data_size == 0); > > > + Assert(per_buffer_data == NULL); > > > > > > in read_stream_next_buffer. I doubt that this will shut Coverity > > > up, but it would help to catch caller coding errors, i.e. passing > > > a per_buffer_data pointer when there's no per-buffer data. > > > > I think this is a good stopgap. I was discussing adding this assert > > off-list with Thomas and he wanted to detail his more ambitious plans > > for type safety improvements in the read stream API. Less on the order > > of a redesign and more like a separate read_stream_next_buffer()s for > > when there is per buffer data and when there isn't. And a by-value and > > by-reference version for the one where there is data. > > Here's what I had in mind. Is it better? Here's a slightly better one. I think when you use read_stream_get_buffer_and_value(stream, &value), or read_stream_put_value(stream, space, value), then we should assert that sizeof(value) strictly matches the available space, as shown. But, new in v2, if you use read_stream_get_buffer_and_pointer(stream, &pointer), then sizeof(*pointer) should only have to be <= the storage space, not ==, because someone might plausibly want to make per_buffer_data_size variable at runtime (ie decide when they construct the stream), and then be able to retrieve a pointer to the start of a struct with a flexible array or something like that. In v1 I was just trying to assert that it was a pointer-to-a-pointer-to-something and no more (in a confusing compile-time assertion), but v2 is simpler, and is happy with a pointer to a pointer to something that doesn't exceed the space (run-time assertion).
Вложения
Moved this [1] to PG19-Drafts. Feel free to put it back in a regular commitfest when you want to return to it. [1] https://commitfest.postgresql.org/patch/5617/ -- Álvaro Herrera Breisgau, Deutschland — https://www.EnterpriseDB.com/ "La virtud es el justo medio entre dos defectos" (Aristóteles)
[ seizing on this old commit as being most closely related to the issue ]
Thomas Munro <thomas.munro@gmail.com> writes:
> On Tue, Jul 16, 2024 at 1:52 PM Noah Misch <noah@leadboat.com> wrote:
>> On Mon, Jul 15, 2024 at 03:26:32PM +1200, Thomas Munro wrote:
>> That's reasonable.  radixtree already forbids mutations concurrent with
>> iteration, so there's no new concurrency hazard.  One alternative is
>> per_buffer_data big enough for MaxOffsetNumber, but that might thrash caches
>> measurably.  That patch is good to go apart from these trivialities:
> Thanks!  I have pushed that patch, without those changes you didn't like.
The security team recently updated our Coverity instance to the latest
version, and it's started complaining as follows:
*** CID 1667418:         Memory - corruptions  (OVERRUN)
/srv/coverity/git/pgsql-git/postgresql/src/backend/access/heap/vacuumlazy.c: 2812             in lazy_vacuum_heap_rel()
2806              * already have the correct page pinned anyway.
2807              */
2808             visibilitymap_pin(vacrel->rel, blkno, &vmbuffer);
2809
2810             /* We need a non-cleanup exclusive lock to mark dead_items unused */
2811             LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
>>>     CID 1667418:         Memory - corruptions  (OVERRUN)
>>>     Overrunning callee's array of size 291 by passing argument "num_offsets" (which evaluates to 2048) in call to
"lazy_vacuum_heap_page".
2812             lazy_vacuum_heap_page(vacrel, blkno, buf, offsets,
2813                                   num_offsets, vmbuffer);
2814
2815             /* Now that we've vacuumed the page, record its available space */
2816             page = BufferGetPage(buf);
2817             freespace = PageGetHeapFreeSpace(page);
The reason it thinks that num_offsets could be as much as 2048 is
presumably the code a little bit above this:
        OffsetNumber offsets[MaxOffsetNumber];
        ...
        num_offsets = TidStoreGetBlockOffsets(iter_result, offsets, lengthof(offsets));
        Assert(num_offsets <= lengthof(offsets));
However, lazy_vacuum_heap_page blindly assumes that the passed value
will be no more than MaxHeapTuplesPerPage.  It seems like we ought
to get these two functions in sync, either both using MaxOffsetNumber
or both using MaxHeapTuplesPerPage for their array sizes.
It looks to me like MaxHeapTuplesPerPage should be sufficient.
Also, after reading TidStoreGetBlockOffsets I wonder if we
should replace that Assert with
        num_offsets = Min(num_offsets, lengthof(offsets));
Thoughts?
            regards, tom lane