Обсуждение: Returning nbtree posting list TIDs in DESC order during backwards scans

Поиск
Список
Период
Сортировка

Returning nbtree posting list TIDs in DESC order during backwards scans

От
Peter Geoghegan
Дата:
Currently, nbtree always returns individual posting list TIDs to the
scan in ASC order -- even during a backwards scan (more on why the
deduplication patch deliberately did things that way later). But
allowing backwards scans to return TIDs in "somewhat descending order"
like this results in needless unpinning and re-pinning of the same
heap page. This seems worth fixing.

Patch goals
===========

nbtree scans ought to guarantee that they'll return all TIDs in
whatever order they appear on the page, when read either
front-to-back, or back-to-front (whichever order matches the scan's
current direction). Individual posting list TIDs should also be
returned in either front-to-back or back-to-front order, also
according to the scan's direction. In general, the order that TIDs are
returned in ought to never be affected by the use of deduplication --
including during backwards scans.

Attached patch teaches nbtree's _bt_readpage function to return TIDs
from a posting list in DESC order during backwards scans, bringing
nbtree in line with the ideal described in the previous paragraph.
This approach seems more principled to me, and is likely to have a
small performance benefit.

I'll submit this to the first commitfest for Postgres 19.

Test case
=========

Here's an example of the problem on HEAD, that the patch fixes:

-- Setup table with low cardinality index:
create table duplicate_test(dup int4);
create index on duplicate_test (dup);
insert into duplicate_test
select val from generate_series(1, 18) val,
                generate_series(1,1000) dups_per_val;

-- Forwards scan, gets 23 buffer hits:
explain analyze
select ctid, * from
duplicate_test
where dup < 5
order by dup;

-- Equivalent backwards scan, gets 42 buffer hits on HEAD:
explain analyze
select ctid, * from
duplicate_test
where dup < 5
order by dup desc;

With the patch applied, *both* queries get the expected 23 buffer hits.

I am generally suspicious of cases where backwards scans are at a
gratuitous disadvantage relative to equivalent forwards scans. See
previous work in commit c9c0589f and commit 1bd4bc85 for earlier
examples of fixing problems that have that same quality to them. This
might be the last such patch that I'll ever have to write.

Sorting so->killedItems[] in _bt_killitems
==========================================

Making this change in _bt_readpage creates a problem in _bt_killitems:
it currently expects posting list TIDs in ascending order (the loop
invariant relies on that), which is why the deduplication patch didn't
just do things like this in _bt_readpage to begin with. The patch
deals with that new problem by qsort'ing the so->killedItems[] array
(at the top of _bt_killitems) such that the items always appear
exactly once, in the expected ASC order.

Another benefit of the qsort approach in _bt_killitems is that it'll
gracefully handle scrollable cursor scans that happen to scroll back
and forth on the same leaf page. This might result in the same dead
TID being added to the so->killedItems[] array multiple times, in
neither ASC or DESC order. That also confuses the loop inside
_bt_killitems, in a way that needlessly prevents setting LP_DEAD bits
(this is a preexisting problem, not a problem created by the changes
that the patch makes to _bt_readpage). Having duplicate TIDs in
so->killedItems[] like this isn't particularly likely, but it seems
worth having proper handling for all the same, on general principle.

I tend to doubt that the overhead of the qsort will ever really
matter, but if it does then we can always limit it to backwards scans
(meaning limit it to cases where so->currPos.dir indicates that
_bt_readpage processed the page in the backwards scan direction), and
drop the goal of dealing with duplicate so->killedItems[] TIDs
gracefully.

Note that the qsort is completely useless during standard forwards
scans. It's hard to tell those apart from scrollable cursor scans that
briefly changed directions "within a page", though, so I'm inclined to
just always do the qsort, rather than trying to cleverly avoid it.
That's what happens in this v1 of the patch (we qsort unconditionally
in _bt_killitems).

-- 
Peter Geoghegan

Вложения

Re: Returning nbtree posting list TIDs in DESC order during backwards scans

От
Mircea Cadariu
Дата:
Hi,


> -    for (int i = 0; i < numKilled; i++)
> +    for (int i = so->currPos.firstItem; i <= so->currPos.lastItem; i++)

Does the above change mean it will have to do more work in the loop? 
Whereas before it visited strictly killed, it now has to go through all 
of them?


Kind regards,

Mircea Cadariu




Re: Returning nbtree posting list TIDs in DESC order during backwards scans

От
Peter Geoghegan
Дата:
On Wed, Jul 16, 2025 at 5:27 PM Mircea Cadariu <cadariu.mircea@gmail.com> wrote:
> Does the above change mean it will have to do more work in the loop?
> Whereas before it visited strictly killed, it now has to go through all
> of them?

Yes, that's true. Any item that the scan returns from the so->currPos
page needs to be considered within the loop.

The loop has an early check for this (for non-itemDead entries) here:

        /* Quickly skip over items never ItemDead-set by btgettuple */
        if (!kitem->itemDead)
            continue;

I really doubt that this matters, because:

* This can only happen when we actually call _bt_killitems in the
first place, so there has to be at least one item whose index tuple
really does need to be LP_DEAD-set.

* The chances of there being a huge number of so->currPos.items[]
items but only one or two with their "itemDead" bit set seems low, in
general.

* The new loop is significantly simpler in that it iterates through
so->currPos.items[] in order, without any of the so->killedItems[]
indirection you see on HEAD. Modern CPUs are likely to skip over
non-itemDead entries very quickly.

Note that so->killedItems[] (which this patch removes) can be in
ascending leaf-page-wise order, descending leaf-page-wise order, or
(with a scrollable cursor) some random mix of the two -- even when
there's no posting list tuples involved.

--
Peter Geoghegan



Re: Returning nbtree posting list TIDs in DESC order during backwards scans

От
Peter Geoghegan
Дата:
On Thu, Jul 17, 2025 at 2:26 PM Peter Geoghegan <pg@bowt.ie> wrote:
> The loop has an early check for this (for non-itemDead entries) here:
>
>         /* Quickly skip over items never ItemDead-set by btgettuple */
>         if (!kitem->itemDead)
>             continue;
>
> I really doubt that this matters

I think that this patch isn't too far off being committable. I'll need
to think about issues around adding the new kitem->itemDead bitfield.
I'm not really worried about _bt_killitems; more so about the routines
called by _bt_readpage, which must make sure that the bit is unset
every time a TID is saved in so->currPos.items[].

Attached is v3. Not much has changed. We now defensively unset each
previously set kitem->itemDead within _bt_killitems. I've also added a
pair of macros that represent 0 and 1 values for these kitem->itemDead
bits.

Thanks
--
Peter Geoghegan

Вложения

Re: Returning nbtree posting list TIDs in DESC order during backwards scans

От
Mircea Cadariu
Дата:

Hi,

On 18/07/2025 05:37, Peter Geoghegan wrote:

I think that this patch isn't too far off being committable.

Agreed. I just tried v3, and got 23 buffer hits, same as in the original demonstrative example. 


+                    /* Remember all prior TIDs (must be at least one) */
+                    for (int i = nitems - 2; i >= 0; i--)

This loop has to start from the end, in order to return TIDs in DESC order?


I'm not really worried about _bt_killitems; more so about the routines
called by _bt_readpage, which must make sure that the bit is unset
every time a TID is saved in so->currPos.items[].

I did a search through the code for "so->" and had a look at the results for functions I'd expect to see changes in, at the minimum:

  •  btbeginscan
  • btrescan
  • btendscan
  • btrestrpos
  • _bt_steppage
  • _bt_readfirstpage

I could find all of the above being touched in v3. 


Modern CPUs are likely to skip over
non-itemDead entries very quickly.

Okay, yeah. A sequential iteration through an array will be fast, and we expect the branch predictor to do its job properly with the "if (!kitem->itemDead)".


I'll need
to think about issues around adding the new kitem->itemDead bitfield.

It's probably not a concern here, but got reminded of this ABI break: https://www.postgresql.org/message-id/flat/CABOikdNmVBC1LL6pY26dyxAS2f%2BgLZvTsNt%3D2XbcyG7WxXVBBQ%40mail.gmail.com


Kind regards,

Mircea Cadariu


Hi,

> On Thu, Jul 17, 2025 at 2:26 PM Peter Geoghegan <pg@bowt.ie> wrote:
>> The loop has an early check for this (for non-itemDead entries) here:
>>
>>         /* Quickly skip over items never ItemDead-set by btgettuple */
>>         if (!kitem->itemDead)
>>             continue;
>>
>> I really doubt that this matters

> I'll need to think about issues around adding the new kitem->itemDead
> bitfield. I'm not really worried about _bt_killitems; more so about
> the routines called by _bt_readpage, which must make sure that the bit
> is unset every time a TID is saved in so->currPos.items[].

Yes, otherwise the live tuple maybe marked as dead, which is terrible. I
think you have unset the bit on all the needed places, including
_bt_saveitem, _bt_savepostingitem and _bt_setuppostingitems for the
first item in postlist item. I can't find out anything is missed.

(I thought another place for this work might be _bt_returnitem, this
might be a more centralized place to set. Later I think it is totally
wrong because for the queries like SELECT * FROM t WHERE idx_col = 3
LIMIT 3;  Some items in so->currPos.items[] were filled in
_bt_readpage but maybe never returned to heap at all, and later
_bt_killitems also run against it on the whole, so unsetting the bit on
_bt_returnitem is too late...)

I think this patch does two things. one is refactoring the data struct
for _bt_killitems, the other one is scaning the postlist in the backward
way for the backward scan. then does the below changes belongs to any of
them? Is it an intentional change?

_bt_killitems:

         if (offnum < minoff)
-            continue;            /* pure paranoia */
+        {
+            /*
+             * Page must have originally been the rightmost page, but has
+             * since split
+             */
+            Assert(!so->dropPin);
+            offnum = minoff;
+        }


At last, I can get the same test result as you. The buffer hits go back to
23 in the test case, thank for working on this!

> I think that this patch isn't too far off being committable.

I think so.

--
Best Regards
Andy Fan




Re: Returning nbtree posting list TIDs in DESC order during backwards scans

От
Peter Geoghegan
Дата:
On Sun, Jul 27, 2025 at 9:43 AM Andy Fan <zhihuifan1213@163.com> wrote:
> At last, I can get the same test result as you. The buffer hits go back to
> 23 in the test case, thank for working on this!
>
> > I think that this patch isn't too far off being committable.
>
> I think so.

Coming back to this patch now, after several months of work on index
prefetching.

I decided that it wasn't such a great idea to reuse/steal an unused
"itemDead" bit from the BTScanPosItem.itemOffset field after all. That
forces _bt_killitems to iterate through every so->currPos.item[], not
just those that are known to require LP_DEAD marking.

Tomas Vondra suggested that I keep killedItems as a separate
allocation (as it is on master), while using a Bitmapset to represent
killedItems (unlike on master, where it is represented using a simple
array). This has all of the same advantages as my previous approach,
but doesn't have the aforementioned disadvantages within _bt_killitems
(plus we no longer need to change BTScanPosItem in any way).

Attached is v4, which does it that way.

My plan is to commit this improved version in the next couple of days.

--
Peter Geoghegan

Вложения

Re: Returning nbtree posting list TIDs in DESC order during backwards scans

От
Chao Li
Дата:
Hi Peter,

The patch v4 looks carefully written and technically solid, and the core logic (switching killedItems[] to Bitmapset
andreworking backwards posting list scans) is coherent. 

I just got a few comments/questions:

> On Dec 3, 2025, at 11:08, Peter Geoghegan <pg@bowt.ie> wrote:
>
> Attached is v4, which does it that way.
>
> My plan is to commit this improved version in the next couple of days.
>
> --
> Peter Geoghegan
> <v4-0001-Return-TIDs-in-desc-order-during-backwards-scans.patch>


1
```
-    /* Always invalidate so->killedItems[] before leaving so->currPos */
-    so->numKilled = 0;
```

The old code only sets so->numKilled to 0 and reuse memory of so->killedItems[], now the new code always
bms_free(so->killedItems)and re-alloc memory when next time adding a member to bms. 

I am think that, if there were bms_clear(), then we could have just cleared the bitmap and reused the memory next time.
Butunfortunately, there is no such a bms_clear() now. What do you think to add bms_clear() and use it here? I don’t
wantto do that, I can try that once you push this patch. 

2
```
+                    /* Set up posting list state (and remember last TID) */
                     itemIndex--;
                     tupleOffset =
                         _bt_setuppostingitems(so, itemIndex, offnum,
-                                              BTreeTupleGetPostingN(itup, 0),
+                                              BTreeTupleGetPostingN(itup, nitems - 1),
                                               itup);
-                    /* Remember additional TIDs */
-                    for (int i = 1; i < BTreeTupleGetNPosting(itup); i++)
+
+                    /* Remember all prior TIDs (must be at least one) */
+                    for (int i = nitems - 2; i >= 0; i--)
                     {
                         itemIndex--;
                         _bt_savepostingitem(so, itemIndex, offnum,
```

I wonder if the comment “must be at lease one” should apply to the assignment of tupleOffset? The “for” loop starts
fromnitems-2, will it still must be at lease one item? 

3
```
                 /*
-                 * Don't bother advancing the outermost loop's int iterator to
-                 * avoid processing killed items that relate to the same
-                 * offnum/posting list tuple.  This micro-optimization hardly
-                 * seems worth it.  (Further iterations of the outermost loop
-                 * will fail to match on this same posting list's first heap
-                 * TID instead, so we'll advance to the next offnum/index
-                 * tuple pretty quickly.)
+                 * Don't advance itemIndex for outermost loop, no matter how
+                 * nextIndex was advanced.  It's possible that items whose
+                 * TIDs weren't matched in posting list can still be killed
+                 * (there might be a later tuple whose TID is a match).
                  */
                 if (j == nposting)
                     killtuple = true;
```

I really don’t get what "Don't bother advancing the outermost loop's int iterator” means? Here we only set killtuple to
true,then if (killtuple && !ItemIdIsDead(iid)), it breaks the inner while loop, in that case, outer while loop "while
((itemIndex= bms_next_member(so->killedItems, itemIndex)) >= 0)” will advance itemIndex. 

I know this is a legacy comment, if you can explain a little bit, that will be very appreciated.

Best regards,
--
Chao Li (Evan)
HighGo Software Co., Ltd.
https://www.highgo.com/







Re: Returning nbtree posting list TIDs in DESC order during backwards scans

От
Victor Yegorov
Дата:
ср, 3 дек. 2025 г. в 06:09, Peter Geoghegan <pg@bowt.ie>:
Coming back to this patch now, after several months of work on index
prefetching.

I decided that it wasn't such a great idea to reuse/steal an unused
"itemDead" bit from the BTScanPosItem.itemOffset field after all. That
forces _bt_killitems to iterate through every so->currPos.item[], not
just those that are known to require LP_DEAD marking.

Tomas Vondra suggested that I keep killedItems as a separate
allocation (as it is on master), while using a Bitmapset to represent
killedItems (unlike on master, where it is represented using a simple
array). This has all of the same advantages as my previous approach,
but doesn't have the aforementioned disadvantages within _bt_killitems
(plus we no longer need to change BTScanPosItem in any way).

Attached is v4, which does it that way.

My plan is to commit this improved version in the next couple of days.

--
Peter Geoghegan

Patch looks fine, applies and compiles cleanly, passes tests.

I'd like to point out a missing space after the dot in the 2nd para of the commit message,
falls out of style.

--
Victor Yegorov

Re: Returning nbtree posting list TIDs in DESC order during backwards scans

От
Mircea Cadariu
Дата:

Hi, 

On 03/12/2025 03:08, Peter Geoghegan wrote:
Coming back to this patch now, after several months of work on index
prefetching.

Nice, welcome back! It would be great to wrap this up. 

Tomas Vondra suggested that I keep killedItems as a separate
allocation (as it is on master), while using a Bitmapset to represent
killedItems (unlike on master, where it is represented using a simple
array). This has all of the same advantages as my previous approach,
but doesn't have the aforementioned disadvantages within _bt_killitems
(plus we no longer need to change BTScanPosItem in any way).
Good one! 


-	* Don't bother advancing the outermost loop's int iterator to
-	* avoid processing killed items that relate to the same
-       * offnum/posting list tuple.  This micro-optimization hardly
-       * seems worth it.  (Further iterations of the outermost loop
-	* will fail to match on this same posting list's first heap
-       * TID instead, so we'll advance to the next offnum/index
-       * tuple pretty quickly.)
+       * Don't advance itemIndex for outermost loop, no matter how
+       * nextIndex was advanced.  It's possible that items whose
+       * TIDs weren't matched in posting list can still be killed
+       * (there might be a later tuple whose TID is a match).
Hmm, as far as I can tell, the old version of the comment seems more accurate. If I understand correctly, it's still safe to do the micro-optimization, but we choose to not do it because we expect the speedup will not be worth the increased complexity / risk of introducing bugs.  
-- 
Thanks,
Mircea Cadariu

Re: Returning nbtree posting list TIDs in DESC order during backwards scans

От
Peter Geoghegan
Дата:
On Wed, Dec 3, 2025 at 7:32 AM Chao Li <li.evan.chao@gmail.com> wrote:
> The old code only sets so->numKilled to 0 and reuse memory of so->killedItems[], now the new code always
bms_free(so->killedItems)and re-alloc memory when next time adding a member to bms. 
>
> I am think that, if there were bms_clear(), then we could have just cleared the bitmap and reused the memory next
time.But unfortunately, there is no such a bms_clear() now. What do you think to add bms_clear() and use it here? I
don’twant to do that, I can try that once you push this patch. 

I don't think that it's worth the complexity. We can rely on palloc()
to recycle the memory that was freed by the most recent bms_clear().

I say this in part because the complexity of reusing the same
Bitmapset would be rather high. The idea that the only valid
representation of an empty set is a NULL pointer is baked into the
Bitmapset API.

Note also that we'll use much less memory for killedItems by
representing it as a Bitmapset. We'll use at most one bit per
so->currPos.items[] item, whereas before we used 4 bytes per item.

> I wonder if the comment “must be at lease one” should apply to the assignment of tupleOffset? The “for” loop starts
fromnitems-2, will it still must be at lease one item? 

By definition, a posting list tuple has at least 2 TIDs -- that's a
posting list invariant. If there was only 1 TID, then it wouldn't be a
posting list in the first place. (Unlike within GIN, where single TID
posting lists are possible.)


>                                 /*
> -                                * Don't bother advancing the outermost loop's int iterator to
> -                                * avoid processing killed items that relate to the same
> -                                * offnum/posting list tuple.  This micro-optimization hardly
> -                                * seems worth it.  (Further iterations of the outermost loop
> -                                * will fail to match on this same posting list's first heap
> -                                * TID instead, so we'll advance to the next offnum/index
> -                                * tuple pretty quickly.)
> +                                * Don't advance itemIndex for outermost loop, no matter how
> +                                * nextIndex was advanced.  It's possible that items whose
> +                                * TIDs weren't matched in posting list can still be killed
> +                                * (there might be a later tuple whose TID is a match).
>                                  */
>                                 if (j == nposting)
>                                         killtuple = true;
> ```
>
> I really don’t get what "Don't bother advancing the outermost loop's int iterator” means? Here we only set killtuple
totrue, then if (killtuple && !ItemIdIsDead(iid)), it breaks the inner while loop, in that case, outer while loop
"while((itemIndex = bms_next_member(so->killedItems, itemIndex)) >= 0)” will advance itemIndex. 

The old comment should probably have been written more like the new
one (that I propose to replace it with). It's saying "don't try to be
clever by remembering that we already determined that all the TIDs
that we tried to compare to this posting list aren't matches for *any*
TID". But I don't think that that's accurate; we *haven't* determined
that those TIDs aren't matches for *any and all* TIDs on the page
(only for those now in the posting list specifically). We might still
be able to match those TIDs to later tuples, immediately after the
posting list.

Note that this is all a bit academic and theoretical; in practice it
rarely comes up. During so->dropPin scans (which is the common case),
we'll give up/get scared at the start of _bt_killitems if the page has
changed at all since it was initially examined within _bt_readpage
anyway -- no matter how the page was modified. It doesn't matter that
the page probably *won't* have been modified by VACUUM when
_bt_kilitems gets scared of modifying the page like this (VACUUM is
the only thing that truly makes it unsafe for _bt_killitems to run,
but _bt_killitems is conservative/easily scared).

So while it is true that "We might still be able to match those TIDs
to later tuples, immediately after the posting list" (as I said in the
paragraph before the previous paragraph), we can only do so during
!so->dropPin scans. In other words, only during index-only scans, or
scans of an unlogged relation, or when we don't use an MVCC snapshot.
All of which are rare. Which makes all this fairly
academic/theoretical (mostly it's historical, 10+ years ago *all*
scans were !so->dropPin scans).

--
Peter Geoghegan