Обсуждение: Problems with get_actual_variable_range's VISITED_PAGES_LIMIT
get_actual_variable_range() scans an index to find actual min/max values for selectivity estimation. Since this happens during planning, we can't afford to spend too much time on it. We use VISITED_PAGES_LIMIT (100 heap pages) as a threshold -- once we've visited that many heap pages, we give up and fall back to pg_statistic values. See commit 9c6ad5eaa9 for background information. Recent benchmark results from Mark Callaghan [1] show that get_actual_variable_range isn't very effective, once the problem with dead index tuples really starts to get out of hand. This is expected with queue-like tables that continually have older records deleted as newer ones are inserted. Right now this is the Achilles' heel for Postgres when it runs on Mark's insert benchmark. It's also likely to be a problem for TPC-C's order table. Tables that look like that are reasonably common in real world workloads IME. (My work on this problem is also related to the index prefetching patch, which makes API revisions that aren't really compatible with the current VISITED_PAGES_LIMIT design. More on that later.) Problem ======= VISITED_PAGES_LIMIT counts heap page visits. But when there are many index tuples marked LP_DEAD, btgettuple will traverse indefinitely-many *index* pages before returning even a single tuple to the selfuncs.c caller. The selfuncs.c heap page counter never gets a chance to increment. Profiling by Mark shows that we waste quite a lot of CPU cycles in nbtree functions like _bt_readpage when this happens. In practice there is no *natural limit* on the amount of work that we can do during any one call to get_actual_variable_range -- VISITED_PAGES_LIMIT doesn't consider costs related to reading index leaf pages at all. The design of get_actual_variable_range specifically aims to set LP_DEAD bits, with the goal of helping future calls to get_actual_variable_range avoid heap fetches. But there's an important sense in which this hurts rather than helps: each LP_DEAD tuple is one less that can be counted against VISITED_PAGES_LIMIT going forward. The more LP_DEAD bits we set, the less effective VISITED_PAGES_LIMIT becomes at bailing out early during future calls to get_actual_variable_range. Inevitably, the cost of reading index leaf pages eventually becomes the bottleneck -- at least past some tipping point. That effect seems perverse -- shouldn't setting LP_DEAD bits in index tuples be strictly a good thing? I suspect that VISITED_PAGES_LIMIT is fundamentally addressing the problem in the wrong way. Even without VISITED_PAGES_LIMIT, or anything like it, it is reasonable to expect that get_actual_variable_range will require a very small, fixed amount of work to find an extremal value in the vast majority of cases. After all, we only need to locate one non-dead extremal index tuple, and then we're done. If we get anywhere near VISITED_PAGES_LIMIT, then it's already reasonable to suppose that we're dealing with a pathological workload -- a workload exactly like the one that Mark's testing ran into. ISTM that Mark has demonstrated that VISITED_PAGES_LIMIT doesn't work particularly well in general. AFAICT there is nothing special about Mark's complaint -- it's not some special case that the VISITED_PAGES_LIMIT design didn't directly consider. Any case where get_actual_variable_range does many heap fetches is likely to look rather similar. Again, we only need to find one non-dead index tuple! If we can't find a suitable non-dead tuple on the first index page we look at, then why should we find one on the second or the third? Proposed solution ================= We really should use an approach that makes a *hard guarantee* that the costs that we pay will not exceed some known threshold (whether those costs are from scanning index pages or scanning heap pages). If we completely ignore costs paid by the index AM layer, then the problem just shifts to that layer before too long. Rather than counting heap page visits, I propose limiting the scan to the extremal leaf page only. If the extremal index leaf page yields no visible tuples, we give up immediately. The latest version of the index prefetching patch [2] adds a WIP patch that does just that. Testing of the patch posted to that other thead today shows that this is much more effective. For example, with a 5 million row table, the latency profile of running EXPLAIN about 66,000 times works out as follows (with no dead tuples): Baseline (no dead tuples): Metric Master Patch Ratio ---------- ------------ ------------ ---------- Avg 0.081 0.081 0.997x p95 0.096 0.095 0.993x p99 0.108 0.111 1.028x Unsurprisingly, the profile for the patch looks very similar to that of master. But when my test script deletes 20% of the tuples in the table, and repeats the same process over, things look very different: Main (with bulk-deleted tuples, concurrent inserts/deletes): Metric Master Patch Ratio ---------- ------------ ------------ ---------- Avg 0.933 0.080 0.085x p95 1.167 0.105 0.090x p99 1.195 0.133 0.112x The patch is a little over 10x faster than master now. More importantly, the patch performs very much like it did the first time around, before we deleted any index tuples. We have a much more meaningful guarantee about how bad things can get in the worst case. I don't think that we need to consider heap visits; just limiting the scan to a single leaf page seems to be sufficient. I'm open to the idea of giving some consideration to the number of heap fetches required, if testing can show that we still need some of that/we need to be clevered. But I'm sure that we absolutely need to pay attention to the cost of reading index leaf pages to do well here. Note on index prefetching patch =============================== As I touched on briefly already, and went into in detail on the index prefetching patch thread, the current VISITED_PAGES_LIMIT design is hard to make work the API revisions proposed by the index prefetching patch. A recap on the problems it creates for the index prefetching patch seems in order: The index prefetching patch adds a new table_index_getnext_slot() interface, rather like the current index_getnext_slot -- we need to use that from selfuncs.c now (not the low-level index_getnext_tid + index_fetch_heap interfaces). All visibility map lookups happen on the table AM/heapam side with this new interface, which is a more modular design. But it leaves selfuncs.c without any way to implement ad-hoc counting of heap page accesses to enforce VISITED_PAGES_LIMIT; all of the relevant heapam implementation details are hidden from callers such as selfuncs.c. I could work around this problem by inventing a new way to enforce VISITED_PAGES_LIMIT that lives on the table AM side. But that doesn't seem very appealing. We have to accomodate selfuncs.c's requirements in lower-level code either way, it seems, so we might as well add that mechanism to nbtree directly. Since it seems to fix the existing problems with get_actual_variable_range. Thoughts? [1] https://smalldatum.blogspot.com/2026/01/cpu-bound-insert-benchmark-vs-postgres.html [2] https://postgr.es/m/CAH2-Wznv9_KGqHQ1vCW2pkiA6QskBGcx5NC_-UXnD6GEQasvAQ@mail.gmail.com -- Peter Geoghegan
Peter Geoghegan <pg@bowt.ie> writes:
> VISITED_PAGES_LIMIT counts heap page visits. But when there are many
> index tuples marked LP_DEAD, btgettuple will traverse
> indefinitely-many *index* pages before returning even a single tuple
> to the selfuncs.c caller.
Yup, that's a problem.
> Proposed solution
> =================
> Rather than counting heap page visits, I propose limiting the scan to
> the extremal leaf page only. If the extremal index leaf page yields no
> visible tuples, we give up immediately. The latest version of the
> index prefetching patch [2] adds a WIP patch that does just that.
I think that's throwing the baby out with the bathwater. In exchange
for a tight limit on planner time expended, you have an enormously
increased chance of getting no useful data at all. For one thing,
it is likely that the extremal page is only partly filled, so that
there may be only a very small number of index entries that this
strategy would consider at all.
Maybe it would work to express the limit as number of dead index
entries we will skip past before failing? I'd want a threshold of
maybe a thousand, but that would still work to bound the index AM's
work as well as the heap AM's.
> Testing of the patch posted to that other thead today shows that this
> is much more effective.
What definition of "effective" are you using? It sounds like you are
putting zero weight on successfully obtaining an estimate, and I find
that approach quite unacceptable.
regards, tom lane
On Mon, Feb 9, 2026 at 7:39 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > > Proposed solution > > ================= > > > Rather than counting heap page visits, I propose limiting the scan to > > the extremal leaf page only. If the extremal index leaf page yields no > > visible tuples, we give up immediately. The latest version of the > > index prefetching patch [2] adds a WIP patch that does just that. > > I think that's throwing the baby out with the bathwater. In exchange > for a tight limit on planner time expended, you have an enormously > increased chance of getting no useful data at all. It's true that I have only begun to examine how much of a risk this is. It is still a WIP patch. > Maybe it would work to express the limit as number of dead index > entries we will skip past before failing? Something like that makes sense. Perhaps the limit should be expressed in terms of both index tuples and index pages. We allow the scan to visit additional leaf pages, if and only if it is necessary for the scan to do so in order for it to examine the required number of index tuples. But we also apply a hard cap on the total number of index pages read, enforced regardless of how many index tuples we have read. We should have some concern about sparse deletion patterns and the like, where each leaf page has only 1 or 2 index tuples (there might be no LP_DEAD-marked index tuples that count towards this index tuple limit within such an index). Obviously, the number of index tuples isn't necessarily the same thing as the number of TIDs contained within those index tuples. With nbtree deduplication, a single leaf page can contain as many as ~1350 distinct TIDs. I wonder if that difference needs to be taken into account... > I'd want a threshold of > maybe a thousand, but that would still work to bound the index AM's > work as well as the heap AM's. Ideally we could keep the mechanism confined to nbtree (or other index AMs that implement the functionality that selfuncs.c will rely on). I'd like to avoid inventing a new heapam-specific mechanism that has to somehow coordinate with the nbtree mechanism that I have in mind for this. -- Peter Geoghegan
On Tue, Feb 10, 2026 at 12:15 AM Peter Geoghegan <pg@bowt.ie> wrote: > > get_actual_variable_range() scans an index to find actual min/max > values for selectivity estimation. Since this happens during planning, > we can't afford to spend too much time on it. We use > VISITED_PAGES_LIMIT (100 heap pages) as a threshold -- once we've > visited that many heap pages, we give up and fall back to pg_statistic > values. See commit 9c6ad5eaa9 for background information. [..] > Thoughts? FWIW, dunno if that helps from what I remember, the 9c6ad5eaa957 itself was born literally just like in 5 minutes just as very quick bandage attempt to just rescue some big OLTP database (in backpatchable way), so there no was serious research involved back then. AFAIR the alternative idea was to simply to cage it into max time limit (using time delta), but that was same basically the same idea. The 100 tuples there translate directly to WE_HAVE_NO_IDEA, so I think you could simply fallback also N*100 index index (instead of heap) pages and it would still be good enough - why? Well because those below are the timing literally from the case that caused this back then (not kidding): Planning Time: 1620687.653 ms Execution Time: 970.716 ms I just worry that if get_actual_variable_range() starts reporting much worse estimates what do we do then? Well maybe, we'll have pg_plan_advice then, or maybe the depth (time?) of search should be somehow bound to the column's statistics_target too, but on second thought it looks like it should be more property of the query itself (so maybe some GUC?) -J.
On Mon, Feb 9, 2026 at 8:27 PM Peter Geoghegan <pg@bowt.ie> wrote: > On Mon, Feb 9, 2026 at 7:39 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > > > Proposed solution > > > ================= > > > > > Rather than counting heap page visits, I propose limiting the scan to > > > the extremal leaf page only. If the extremal index leaf page yields no > > > visible tuples, we give up immediately. The latest version of the > > > index prefetching patch [2] adds a WIP patch that does just that. > > > > I think that's throwing the baby out with the bathwater. In exchange > > for a tight limit on planner time expended, you have an enormously > > increased chance of getting no useful data at all. > > It's true that I have only begun to examine how much of a risk this > is. It is still a WIP patch. The latest revision of the index prefetching patch set (v12) has an improved mechanism that takes your concerns into account: https://postgr.es/m/CAH2-Wz=g=JTSyDB4UtB5su2ZcvsS7VbP+ZMvvaG6ABoCb+s8Lw@mail.gmail.com The mechanism will now hold on longer. In practice, it will read up to 3 index leaf pages before giving up. The precise threshold is controlled by the selfuncs.c caller -- see v12-0009-Limit-get_actual_variable_range-to-scan-three-in.patch (unfortunately there's no easy way to break this out into an independent patch). -- Peter Geoghegan
On Tue, Feb 10, 2026 at 4:26 AM Jakub Wartak <jakub.wartak@enterprisedb.com> wrote: > Well because those below are the timing literally from the case that caused > this back then > (not kidding): > Planning Time: 1620687.653 ms > Execution Time: 970.716 ms To be fair, the fact that selfuncs.c is very deliberate about setting LP_DEAD bits in index pages these days was likely a big help. AFAICT, the only problem that remains is with VISITED_PAGES_LIMIT itself, which doesn't seem very well attuned to the costs that we need to keep reasonably well bounded. > I just worry that if get_actual_variable_range() starts reporting much worse > estimates what do we do then? Well maybe, we'll have pg_plan_advice then, > or maybe the depth (time?) of search should be somehow bound to the column's > statistics_target too, but on second thought it looks like it should be more > property of the query itself (so maybe some GUC?) The purpose of get_actual_variable_range is to get the extremal value in the index -- a single value from a single tuple. If we can't manage that by reading only 2 or 3 pages, then I see no good reason to believe we can by reading many more pages. In other words, we're perfectly justified in making a soft expectation that it'll require very little work to get an extremal value. But once we notice that our soft assumption doesn't hold, all bets are off -- we're then justified in giving up relatively quickly. Having several contiguous pages that are completely empty (or empty of non-LP_DEAD-marked tuples) at the extreme leftmost or rightmost en of the index is absolutely a pathological case. It strongly suggests a queue-like workload of the kind that my testing shows that VISITED_PAGES_LIMIT can't deal with sensibly. get_actual_variable_range exists to ascertain information about a particular index. To me, it makes perfect sense that a mechanism such as this would work in terms of costs paid on the index AM side. (The heap fetches matter a great deal too, of course, but it's reasonable to use index costs as a proxy for heap/table AM costs -- but it's not reasonable to use table/heap AM costs as a proxy for index AM costs. It doesn't work both ways.) -- Peter Geoghegan
Hi, On 2026-03-10 17:03:14 -0400, Peter Geoghegan wrote: > On Tue, Feb 10, 2026 at 4:26 AM Jakub Wartak > <jakub.wartak@enterprisedb.com> wrote: > > I just worry that if get_actual_variable_range() starts reporting much worse > > estimates what do we do then? Well maybe, we'll have pg_plan_advice then, > > or maybe the depth (time?) of search should be somehow bound to the column's > > statistics_target too, but on second thought it looks like it should be more > > property of the query itself (so maybe some GUC?) > > The purpose of get_actual_variable_range is to get the extremal value > in the index -- a single value from a single tuple. If we can't manage > that by reading only 2 or 3 pages, then I see no good reason to > believe we can by reading many more pages. > > In other words, we're perfectly justified in making a soft expectation > that it'll require very little work to get an extremal value. But once > we notice that our soft assumption doesn't hold, all bets are off -- > we're then justified in giving up relatively quickly. Having several > contiguous pages that are completely empty (or empty of > non-LP_DEAD-marked tuples) at the extreme leftmost or rightmost en of > the index is absolutely a pathological case. It strongly suggests a > queue-like workload of the kind that my testing shows that > VISITED_PAGES_LIMIT can't deal with sensibly. You don't need a queuing workload to occasionally have a few pages of dead tuples at one end of the value range. E.g. a small batch insert that rolls back due to a unique conflict is enough. The insert case is also the one where, without get_actual_variable_range(), we will often end up with completely bogus estimates, as obviously the stored stats won't yet know about newly inserted data. > get_actual_variable_range exists to ascertain information about a > particular index. To me, it makes perfect sense that a mechanism such > as this would work in terms of costs paid on the index AM side. (The > heap fetches matter a great deal too, of course, but it's reasonable > to use index costs as a proxy for heap/table AM costs -- but it's not > reasonable to use table/heap AM costs as a proxy for index AM costs. > It doesn't work both ways.) I am not convinced it's true either direction - tids from three leaf pages of tids can point to a very small number of heap pages or hundreds of heap pages. Why is the index AM cost a good proxy for the table? If anything it's more sane the other way round (i.e. how it is today). Greetings, Andres Freund
On Tue, Mar 10, 2026 at 6:08 PM Andres Freund <andres@anarazel.de> wrote: > You don't need a queuing workload to occasionally have a few pages of dead > tuples at one end of the value range. E.g. a small batch insert that rolls > back due to a unique conflict is enough. The insert case is also the one > where, without get_actual_variable_range(), we will often end up with > completely bogus estimates, as obviously the stored stats won't yet know about > newly inserted data. True, but why would you expect the problem to be any worse with the approach that I've proposed? If we conservatively assume that each leaf page has 6 distinct heap pages, as is the case with pgbench_accounts_pkey (which is very low and thus favors your argument), we can expect to examine about 2.5 x 6 = 15 heap pages before giving up, with the approach I'm proposing. That's not all that different to the current behavior. Besides all this, the scenario you're describing is unlikely to last long. Clients encountering the problem you describe tend to repeat the transactions that initially failed. All it takes is another transaction that inserts another tuple into the rightmost leaf page, or one close by. This will immediately make it possible for get_actual_variable_range to find a valid extremal value in the index once more. > > get_actual_variable_range exists to ascertain information about a > > particular index. To me, it makes perfect sense that a mechanism such > > as this would work in terms of costs paid on the index AM side. (The > > heap fetches matter a great deal too, of course, but it's reasonable > > to use index costs as a proxy for heap/table AM costs -- but it's not > > reasonable to use table/heap AM costs as a proxy for index AM costs. > > It doesn't work both ways.) > > I am not convinced it's true either direction - tids from three leaf pages of > tids can point to a very small number of heap pages or hundreds of heap > pages. I was unclear about this point. There is definitely a wide natural variation in how many distinct heap blocks appear on each leaf page. And the approach I have in mind is almost completely indifferent to that. Though, of course, it still imposes a natural ceiling on how many heap pages can be read: there can only be so many distinct heap blocks on one leaf page (that was what I referred to above). I believe that it's actually a good thing that my approach doesn't *directly* care about the number of heap accesses. Ignoring the natural variation in heap layout will produce more consistent behavior over time, and make it less likely that we'll give up too early. This is helped by the fact that we can set LP_DEAD bits on dead tuples in passing (and by the natural ceiling). > Why is the index AM cost a good proxy for the table? If anything it's > more sane the other way round (i.e. how it is today). At a high level we need to reason about the general conditions in the index without ever doing excessive work. The boundaries between leaf pages and leaf page density provide relevant hints about the overall picture in the index. If we see a leaf page with one 16 byte index tuple, it is very reasonable to surmise that it once held hundreds. To a lesser degree it is valid to expect that its sibling leaf pages will also have very few index tuples, likely due to a sparse deletion pattern. Indexes are organized logically/form a key space, so making these kinds of inferences is much more reasonable than it would be when looking at a group of physically contiguous heap pages. By the same token, if we encounter a small run of logically contiguous leaf pages that are all completely empty, it's a pathological case; the chances of there being many more empty leaf pages after that shoot up. The more empty leaf pages we see, the more we should expect to see. As I keep pointing out, all we need to do is find a single index tuple -- which shouldn't be hard at all, barring these truly patholgical cases. I'm trying to be practical here. I want to design a solution compatible with the changes that we're making to the table AM interface. That fixes Mark's complaint. Ideally, this should avoid adding much new code to hot code paths. Any design that meets those goals is acceptable to me. -- Peter Geoghegan