Re: index prefetching
От | Tomas Vondra |
---|---|
Тема | Re: index prefetching |
Дата | |
Msg-id | 970c1c5c-8af4-e2ac-b985-e66331e39701@enterprisedb.com обсуждение исходный текст |
Ответ на | Re: index prefetching (Peter Geoghegan <pg@bowt.ie>) |
Ответы |
Re: index prefetching
(Peter Geoghegan <pg@bowt.ie>)
|
Список | pgsql-hackers |
On 6/9/23 01:38, Peter Geoghegan wrote: > On Thu, Jun 8, 2023 at 3:17 PM Tomas Vondra > <tomas.vondra@enterprisedb.com> wrote: >> Normal index scans are an even more interesting case but I'm not >> sure how hard it would be to get that information. It may only be >> convenient to get the blocks from the last leaf page we looked at, >> for example. >> >> So this suggests we simply started prefetching for the case where the >> information was readily available, and it'd be harder to do for index >> scans so that's it. > > What the exact historical timeline is may not be that important. My > emphasis on ScalarArrayOpExpr is partly due to it being a particularly > compelling case for both parallel index scan and prefetching, in > general. There are many queries that have huge in() lists that > naturally benefit a great deal from prefetching. Plus they're common. > Did you mean parallel index scan or bitmap index scan? But yeah, I get the point that SAOP queries are an interesting example of queries to explore. I'll add some to the next round of tests. >> Even if SAOP (probably) wasn't the reason, I think you're right it may >> be an issue for prefetching, causing regressions. It didn't occur to me >> before, because I'm not that familiar with the btree code and/or how it >> deals with SAOP (and didn't really intend to study it too deeply). > > I'm pretty sure that you understand this already, but just in case: > ScalarArrayOpExpr doesn't even "get the blocks from the last leaf > page" in many important cases. Not really -- not in the sense that > you'd hope and expect. We're senselessly processing the same index > leaf page multiple times and treating it as a different, independent > leaf page. That makes heap prefetching of the kind you're working on > utterly hopeless, since it effectively throws away lots of useful > context. Obviously that's the fault of nbtree ScalarArrayOpExpr > handling, not the fault of your patch. > I think I understand, although maybe my mental model is wrong. I agree it seems inefficient, but I'm not sure why would it make prefetching hopeless. Sure, it puts index scans at a disadvantage (compared to bitmap scans), but it we pick index scan it should still be an improvement, right? I guess I need to do some testing on a range of data sets / queries, and see how it works in practice. >> So if you're planning to work on this for PG17, collaborating on it >> would be great. >> >> For now I plan to just ignore SAOP, or maybe just disabling prefetching >> for SAOP index scans if it proves to be prone to regressions. That's not >> great, but at least it won't make matters worse. > > Makes sense, but I hope that it won't come to that. > > IMV it's actually quite reasonable that you didn't expect to have to > think about ScalarArrayOpExpr at all -- it would make a lot of sense > if that was already true. But the fact is that it works in a way > that's pretty silly and naive right now, which will impact > prefetching. I wasn't really thinking about regressions, though. I was > actually more concerned about missing opportunities to get the most > out of prefetching. ScalarArrayOpExpr really matters here. > OK >> I guess something like this might be a "nice" bad case: >> >> insert into btree_test mod(i,100000), md5(i::text) >> from generate_series(1, $ROWS) s(i) >> >> select * from btree_test where a in (999, 1000, 1001, 1002) >> >> The values are likely colocated on the same heap page, the bitmap scan >> is going to do a single prefetch. With index scan we'll prefetch them >> repeatedly. I'll give it a try. > > This is the sort of thing that I was thinking of. What are the > conditions under which bitmap index scan starts to make sense? Why is > the break-even point whatever it is in each case, roughly? And, is it > actually because of laws-of-physics level trade-off? Might it not be > due to implementation-level issues that are much less fundamental? In > other words, might it actually be that we're just doing something > stoopid in the case of plain index scans? Something that is just > papered-over by bitmap index scans right now? > Yeah, that's partially why I do this kind of testing on a wide range of synthetic data sets - to find cases that behave in unexpected way (say, seem like they should improve but don't). > I see that your patch has logic that avoids repeated prefetching of > the same block -- plus you have comments that wonder about going > further by adding a "small lru array" in your new index_prefetch() > function. I asked you about this during the unconference presentation. > But I think that my understanding of the situation was slightly > different to yours. That's relevant here. > > I wonder if you should go further than this, by actually sorting the > items that you need to fetch as part of processing a given leaf page > (I said this at the unconference, you may recall). Why should we > *ever* pin/access the same heap page more than once per leaf page > processed per index scan? Nothing stops us from returning the tuples > to the executor in the original logical/index-wise order, despite > having actually accessed each leaf page's pointed-to heap pages > slightly out of order (with the aim of avoiding extra pin/unpin > traffic that isn't truly necessary). We can sort the heap TIDs in > scratch memory, then do our actual prefetching + heap access, and then > restore the original order before returning anything. > I think that's possible, and I thought about that a bit (not just for btree, but especially for the distance queries on GiST). But I don't have a good idea if this would be 1% or 50% improvement, and I was concerned it might easily lead to regressions if we don't actually need all the tuples. I mean, imagine we have TIDs [T1, T2, T3, T4, T5, T6] Maybe T1, T5, T6 are from the same page, so per your proposal we might reorder and prefetch them in this order: [T1, T5, T6, T2, T3, T4] But maybe we only need [T1, T2] because of a LIMIT, and the extra work we did on processing T5, T6 is wasted. > This is conceptually a "mini bitmap index scan", though one that takes > place "inside" a plain index scan, as it processes one particular leaf > page. That's the kind of design that "plain index scan vs bitmap index > scan as a continuum" leads me to (a little like the continuum between > nested loop joins, block nested loop joins, and merge joins). I bet it > would be practical to do things this way, and help a lot with some > kinds of queries. It might even be simpler than avoiding excessive > prefetching using an LRU cache thing. > > I'm talking about problems that exist today, without your patch. > > I'll show a concrete example of the kind of index/index scan that > might be affected. > > Attached is an extract of the server log when the regression tests ran > against a server patched to show custom instrumentation. The log > output shows exactly what's going on with one particular nbtree > opportunistic deletion (my point has nothing to do with deletion, but > it happens to be convenient to make my point in this fashion). This > specific example involves deletion of tuples from the system catalog > index "pg_type_typname_nsp_index". There is nothing very atypical > about it; it just shows a certain kind of heap fragmentation that's > probably very common. > > Imagine a plain index scan involving a query along the lines of > "select * from pg_type where typname like 'part%' ", or similar. This > query runs an instant before the example LD_DEAD-bit-driven > opportunistic deletion (a "simple deletion" in nbtree parlance) took > place. You'll be able to piece together from the log output that there > would only be about 4 heap blocks involved with such a query. Ideally, > our hypothetical index scan would pin each buffer/heap page exactly > once, for a total of 4 PinBuffer()/UnpinBuffer() calls. After all, > we're talking about a fairly selective query here, that only needs to > scan precisely one leaf page (I verified this part too) -- so why > wouldn't we expect "index scan parity"? > > While there is significant clustering on this example leaf page/key > space, heap TID is not *perfectly* correlated with the > logical/keyspace order of the index -- which can have outsized > consequences. Notice that some heap blocks are non-contiguous > relative to logical/keyspace/index scan/index page offset number order. > > We'll end up pinning each of the 4 or so heap pages more than once > (sometimes several times each), when in principle we could have pinned > each heap page exactly once. In other words, there is way too much of > a difference between the case where the tuples we scan are *almost* > perfectly clustered (which is what you see in my example) and the case > where they're exactly perfectly clustered. In other other words, there > is way too much of a difference between plain index scan, and bitmap > index scan. > > (What I'm saying here is only true because this is a composite index > and our query uses "like", returning rows matches a prefix -- if our > index was on the column "typname" alone and we used a simple equality > condition in our query then the Postgres 12 nbtree work would be > enough to avoid the extra PinBuffer()/UnpinBuffer() calls. I suspect > that there are still relatively many important cases where we perform > extra PinBuffer()/UnpinBuffer() calls during plain index scans that > only touch one leaf page anyway.) > > Obviously we should expect bitmap index scans to have a natural > advantage over plain index scans whenever there is little or no > correlation -- that's clear. But that's not what we see here -- we're > way too sensitive to minor imperfections in clustering that are > naturally present on some kinds of leaf pages. The potential > difference in pin/unpin traffic (relative to the bitmap index scan > case) seems pathological to me. Ideally, we wouldn't have these kinds > of differences at all. It's going to disrupt usage_count on the > buffers. > I'm not sure I understand all the nuance here, but the thing I take away is to add tests with different levels of correlation, and probably also some multi-column indexes. >>> It's important to carefully distinguish between cases where plain >>> index scans really are at an inherent disadvantage relative to bitmap >>> index scans (because there really is no getting around the need to >>> access the same heap page many times with an index scan) versus cases >>> that merely *appear* that way. Implementation restrictions that only >>> really affect the plain index scan case (e.g., the lack of a >>> reasonably sized prefetch buffer, or the ScalarArrayOpExpr thing) >>> should be accounted for when assessing the viability of index scan + >>> prefetch over bitmap index scan + prefetch. This is very subtle, but >>> important. >>> >> >> I do agree, but what do you mean by "assessing"? > > I mean performance validation. There ought to be a theoretical model > that describes the relationship between index scan and bitmap index > scan, that has actual predictive power in the real world, across a > variety of different cases. Something that isn't sensitive to the > current phase of the moon (e.g., heap fragmentation along the lines of > my pg_type_typname_nsp_index log output). I particularly want to avoid > nasty discontinuities that really make no sense. > >> Wasn't the agreement at >> the unconference session was we'd not tweak costing? So ultimately, this >> does not really affect which scan type we pick. We'll keep doing the >> same planning decisions as today, no? > > I'm not really talking about tweaking the costing. What I'm saying is > that we really should expect index scans to behave similarly to bitmap > index scans at runtime, for queries that really don't have much to > gain from using a bitmap heap scan (queries that may or may not also > benefit from prefetching). There are several reasons why this makes > sense to me. > > One reason is that it makes tweaking the actual costing easier later > on. Also, your point about plan robustness was a good one. If we make > the wrong choice about index scan vs bitmap index scan, and the > consequences aren't so bad, that's a very useful enhancement in > itself. > > The most important reason of all may just be to build confidence in > the design. I'm interested in understanding when and how prefetching > stops helping. > Agreed. >> I'm all for building a more comprehensive set of test cases - the stuff >> presented at pgcon was good for demonstration, but it certainly is not >> enough for testing. The SAOP queries are a great addition, I also plan >> to run those queries on different (less random) data sets, etc. We'll >> probably discover more interesting cases as the patch improves. > > Definitely. > >> There are two aspects why I think AM is not the right place: >> >> - accessing table from index code seems backwards >> >> - we already do prefetching from the executor (nodeBitmapHeapscan.c) >> >> It feels kinda wrong in hindsight. > > I'm willing to accept that we should do it the way you've done it in > the patch provisionally. It's complicated enough that it feels like I > should reserve the right to change my mind. > >>>> I think this is acceptable limitation, certainly for v0. Prefetching >>>> across multiple leaf pages seems way more complex (particularly for the >>>> cases using pairing heap), so let's leave this for the future. > >> Yeah, I'm not saying it's impossible, and imagined we might teach nbtree >> to do that. But it seems like work for future someone. > > Right. You probably noticed that this is another case where we'd be > making index scans behave more like bitmap index scans (perhaps even > including the downsides for kill_prior_tuple that accompany not > processing each leaf page inline). There is probably a point where > that ceases to be sensible, but I don't know what that point is. > They're way more similar than we seem to imagine. > OK. Thanks for all the comments. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
В списке pgsql-hackers по дате отправления:
Предыдущее
От: Dagfinn Ilmari MannsåkerДата:
Сообщение: Re: Error in calculating length of encoded base64 string