Обсуждение: B-tree cache prefetches
Hi hackers! I've been at the database conference and here everyone is talking about cache prefetches. I've tried simple hack diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c index d3700bd082..ffddf553aa 100644 --- a/src/backend/access/nbtree/nbtsearch.c +++ b/src/backend/access/nbtree/nbtsearch.c @@ -401,6 +401,13 @@ _bt_binsrch(Relation rel, /* We have low <= mid < high, so mid points at a real slot */ + OffsetNumber x = mid + 1 + ((high - mid + 1) / 2); + if (x < high) + __builtin_prefetch (PageGetItem(page, PageGetItemId(page, x)), 0, 2); + x = low + ((mid - low) / 2); + if (x > low) + __builtin_prefetch (PageGetItem(page, PageGetItemId(page, x)), 0, 2); + result = _bt_compare(rel, keysz, scankey, page, mid); if (result >= cmpval) The idea is pretty simple - our search are cache erasing anyway, let's try to get at least some of it by prefetching possibleways of binary search. And it seems to me that on a simple query > insert into x select (random()*1000000)::int from generate_series(1,1e7); it brings something like 2-4% of performance improvement on my laptop. Is there a reason why we do not use __builtin_prefetch? Have anyone tried to use cache prefetching? Best regards, Andrey Borodin.
On Thu, Aug 30, 2018 at 10:53 AM, Andrey Borodin <x4mmm@yandex-team.ru> wrote: > The idea is pretty simple - our search are cache erasing anyway, let's try to get at least some of it by prefetching possibleways of binary search. > And it seems to me that on a simple query >> insert into x select (random()*1000000)::int from generate_series(1,1e7); > it brings something like 2-4% of performance improvement on my laptop. > > Is there a reason why we do not use __builtin_prefetch? Have anyone tried to use cache prefetching? I once wrote a patch that used __builtin_prefetch() when fetching tuples from a tuplesort. It worked reasonably well on my laptop, but didn't seem to do much on another machine with another microarchitecture (presumably the server with the alternative microarchitecture had superior hardware prefetching). The conclusion was that it wasn't really worth pursuing. I'm not dismissing your idea. I'm just pointing out that the burden of proving that explicit prefetching is a good idea is rather significant. We especially want to avoid something that needs to be reassessed every couple of years. -- Peter Geoghegan
On Fri, Aug 31, 2018 at 5:53 AM Andrey Borodin <x4mmm@yandex-team.ru> wrote: > Hi hackers! > > I've been at the database conference and here everyone is talking about cache prefetches. > > I've tried simple hack > > diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c > index d3700bd082..ffddf553aa 100644 > --- a/src/backend/access/nbtree/nbtsearch.c > +++ b/src/backend/access/nbtree/nbtsearch.c > @@ -401,6 +401,13 @@ _bt_binsrch(Relation rel, > > /* We have low <= mid < high, so mid points at a real slot */ > > + OffsetNumber x = mid + 1 + ((high - mid + 1) / 2); > + if (x < high) > + __builtin_prefetch (PageGetItem(page, PageGetItemId(page, x)), 0, 2); > + x = low + ((mid - low) / 2); > + if (x > low) > + __builtin_prefetch (PageGetItem(page, PageGetItemId(page, x)), 0, 2); > + > result = _bt_compare(rel, keysz, scankey, page, mid); > > if (result >= cmpval) > > The idea is pretty simple - our search are cache erasing anyway, let's try to get at least some of it by prefetching possibleways of binary search. > And it seems to me that on a simple query > > insert into x select (random()*1000000)::int from generate_series(1,1e7); > it brings something like 2-4% of performance improvement on my laptop. > > Is there a reason why we do not use __builtin_prefetch? Have anyone tried to use cache prefetching? A related topic is the cache-unfriendliness of traditional binary searches of sorted data. Check out "Array Layouts for Comparison-Based Searching"[1] if you haven't already. It says that if your array fits in L2 cache, as our btree pages do, then binary search still wins against Eytzinger et al, which I was a bit disappointed about when I was (casually) investigating new layouts based on some comments from a fellow hacker (I can't remember if it was Andres or Peter G who mentioned this topic to me). However, the paper is talking about "branch-free binary search with explicit prefetch", which apparently eats plain old branchy binary search for breakfast, and I gather they tested on a lot of hardware. That sounds interesting to look into. [1] https://arxiv.org/pdf/1509.05053.pdf -- Thomas Munro http://www.enterprisedb.com
On Thu, Aug 30, 2018 at 2:40 PM, Thomas Munro <thomas.munro@enterprisedb.com> wrote: > A related topic is the cache-unfriendliness of traditional binary > searches of sorted data. Check out "Array Layouts for > Comparison-Based Searching"[1] if you haven't already. It says that > if your array fits in L2 cache, as our btree pages do, then binary > search still wins against Eytzinger et al, which I was a bit > disappointed about when I was (casually) investigating new layouts > based on some comments from a fellow hacker (I can't remember if it > was Andres or Peter G who mentioned this topic to me). It might have been both of us, at different times. > However, the > paper is talking about "branch-free binary search with explicit > prefetch", which apparently eats plain old branchy binary search for > breakfast, and I gather they tested on a lot of hardware. That sounds > interesting to look into. I think that the main win will come from stacking a bunch of different optimizations on top of each other, including key normalization, automatic prefix compression on internal pages (something that's a lot easier than on leaf pages [1]), and abbreviated keys in the ItemId array of nbtree pages [2] (even 15-bit abbreviated keys can have a lot of entropy when combined with these other techniques). Once we have all that in place, I expect drastically fewer iCache misses, making other optimizations interesting. I'm bullish on the idea of using something like an unrolled binary search, or an interpolation search in the long term. In the meantime, we have too many iCache misses. As a datapoint, iCache misses reportedly dominate the overhead of TPC-C and TPC-E [3]. Nestloop joins with inner index scans are dominant within those benchmarks. [1] https://wiki.postgresql.org/wiki/Key_normalization#Using_existing_.22minus_infinity.22_index_tuple_as_a_low_key [2] https://www.postgresql.org/message-id/CAH2-Wz=mV4dmOaPFicRSyNtv2KinxEOtBwUY5R7fXXOC-OearA@mail.gmail.com [3] https://openproceedings.org/2013/conf/edbt/TozunPKJA13.pdf -- "6.1.2 Core stalls" -- Peter Geoghegan
Hello!
I've re-read that paper. Their results are not about our case, here's a quote:31 авг. 2018 г., в 2:40, Thomas Munro <thomas.munro@enterprisedb.com> написал(а):A related topic is the cache-unfriendliness of traditional binary
searches of sorted data. Check out "Array Layouts for
Comparison-Based Searching"[1] if you haven't already. It says that
if your array fits in L2 cache, as our btree pages do, then binary
search still wins against Eytzinger et al, which I was a bit
disappointed
> For arrays small enough _to be kept_ in L2 cache, the branch-free binary search code listed in Listing 2 is the fastest algorithm
Surely, we cannot keep all pages in L2. Eytzinger layout could be realistic possibility, except one simple problem: I do not see how can we insert items into this layout.
Also, that research is quite distant from B-tree binsearch: thier data items are small and do not represent reference to actual data. Eytzinger exploits the fact that two posiible future keys share same cache line. But it is hard to provide, given we place items at quite random place of the page.
Best regards, Andrey Borodin.
On Mon, Oct 15, 2018 at 12:03 AM Andrey Borodin <x4mmm@yandex-team.ru> wrote: > 31 авг. 2018 г., в 2:40, Thomas Munro <thomas.munro@enterprisedb.com> написал(а): > A related topic is the cache-unfriendliness of traditional binary > searches of sorted data. Check out "Array Layouts for > Comparison-Based Searching"[1] if you haven't already. It says that > if your array fits in L2 cache, as our btree pages do, then binary > search still wins against Eytzinger et al, which I was a bit > disappointed > > I've re-read that paper. Their results are not about our case, here's a quote: > > For arrays small enough _to be kept_ in L2 cache, the branch-free binary search code listed in Listing 2 is the fastestalgorithm > > Surely, we cannot keep all pages in L2. Eytzinger layout could be realistic possibility, except one simple problem: I donot see how can we insert items into this layout. I don't know. Aren't we talking about binary search within one 8KB page here, anyway? Thinking about some other ideas for cache prefetching in btree code, here's an idea to improve pipelining of index scans (a bit like the two-step pipeline idea from the hash join prefetch thread). We know the TIDs of heap tuples we'll be looking up in the near future, so can we prefetch all the way to the tuple? There are three random accesses that follow soon after reading an index tuple: BufTableLookup(), then page header, then the tuple itself. At the logical extreme, we could anticipate the probe of SharedBufHash for the TID from index tuple i + 3 (perhaps dynahash could expose a hash_prefetch_for_hash() facility), and anticipate the read of the page header for the tuple pointed to by index tuple i + 2 (perhaps a dirty read of SharedBufHash that is only prepared to look at the first entry in the bucket would be OK for this), and anticipate the read of the tuple for the TID from index tuple i + 1 (perhaps a dirty read of the page item table). Or some less ambitious subset, if that pipeline turns out to be too deep; it seems like the simplest experiment would be to try prefetching just SharedBufHash lookups. This may all be completely crazy, I haven't tried any of it. > Also, that research is quite distant from B-tree binsearch: thier data items are small and do not represent reference toactual data. Eytzinger exploits the fact that two posiible future keys share same cache line. But it is hard to provide,given we place items at quite random place of the page. Based on the the above quotes, I'm not suggesting we should use Eytzinger for search-within-page. But out of curiosity, I checked, and saw that you can actually get a pair of index tuples holding a six-character text key or an integer key into a 64 byte cache line. -- Thomas Munro http://www.enterprisedb.com
> 15 окт. 2018 г., в 2:38, Thomas Munro <thomas.munro@enterprisedb.com> написал(а): > > On Mon, Oct 15, 2018 at 12:03 AM Andrey Borodin <x4mmm@yandex-team.ru> wrote: >> 31 авг. 2018 г., в 2:40, Thomas Munro <thomas.munro@enterprisedb.com> написал(а): >> A related topic is the cache-unfriendliness of traditional binary >> searches of sorted data. Check out "Array Layouts for >> Comparison-Based Searching"[1] if you haven't already. It says that >> if your array fits in L2 cache, as our btree pages do, then binary >> search still wins against Eytzinger et al, which I was a bit >> disappointed >> >> I've re-read that paper. Their results are not about our case, here's a quote: >>> For arrays small enough _to be kept_ in L2 cache, the branch-free binary search code listed in Listing 2 is the fastestalgorithm >> >> Surely, we cannot keep all pages in L2. Eytzinger layout could be realistic possibility, except one simple problem: Ido not see how can we insert items into this layout. > > I don't know. Aren't we talking about binary search within one 8KB > page here, anyway? The paper discuss multiple searches inside one 8Kb region, while we are doing single search in all-uncached page and moveaway. That's the difference. Binserach is optimal, if the page is in L1 before we start search. > > Thinking about some other ideas for cache prefetching in btree code, > here's an idea to improve pipelining of index scans (a bit like the > two-step pipeline idea from the hash join prefetch thread). We know > the TIDs of heap tuples we'll be looking up in the near future, so can > we prefetch all the way to the tuple? There are three random > accesses that follow soon after reading an index tuple: > BufTableLookup(), then page header, then the tuple itself. At the > logical extreme, we could anticipate the probe of SharedBufHash for > the TID from index tuple i + 3 (perhaps dynahash could expose a > hash_prefetch_for_hash() facility), and anticipate the read of the > page header for the tuple pointed to by index tuple i + 2 (perhaps a > dirty read of SharedBufHash that is only prepared to look at the first > entry in the bucket would be OK for this), and anticipate the read of > the tuple for the TID from index tuple i + 1 (perhaps a dirty read of > the page item table). Or some less ambitious subset, if that pipeline > turns out to be too deep; it seems like the simplest experiment would > be to try prefetching just SharedBufHash lookups. This may all be > completely crazy, I haven't tried any of it. This idea also looks good to me. One thing bothers me. if we do __builtin_prefetch(&a[b[c[d]]],0,1), and a, b, c and d allare not in cache - will we incur CPU wait time for fetching a,b and c? This [0] guys are expoiting C++ coroutines for prefetching this kind of stuff, but it seems like too much of hassle. > >> Also, that research is quite distant from B-tree binsearch: thier data items are small and do not represent referenceto actual data. Eytzinger exploits the fact that two posiible future keys share same cache line. But it is hardto provide, given we place items at quite random place of the page. > > Based on the the above quotes, I'm not suggesting we should use > Eytzinger for search-within-page. But out of curiosity, I checked, > and saw that you can actually get a pair of index tuples holding a > six-character text key or an integer key into a 64 byte cache line. Well, this proves that in theory Eytzinger is an opportunity. Best regards, Andrey Borodin. [0] http://www.vldb.org/pvldb/vol11/p1702-jonathan.pdf
Hi, Thomas! > 31 авг. 2018 г., в 2:40, Thomas Munro <thomas.munro@enterprisedb.com> написал(а): > > [1] https://arxiv.org/pdf/1509.05053.pdf I've implemented all of the strategies used in that paper. On a B-tree page we have a line pointers ordered in key order and tuples residing at the end of the page. Tuples at the endare mostly "appended", i.e. they usually are accommodated at the end of the free space. The ides of the attached patch is to try to order tuples in different strategies. This ordering happens during VACUUM andcan be easily broken with single insert. Strategies are switched by #define: 1. USE_SEQ_ORDER - just order tuples in order of line pointers. This strategy has no idea behind and is expected to worksimilar to baseline (no changes to code at all) 2. USE_EYZINGER_ORDER - depth-first topological search. If you always choose lower branch - your search is just a sequentialread of bytes. 3. USE_VEB_ORDER - Van Emde Boas layout - try to pack 3 tuples (mid and two sub-mids) together. The key ideas is that youcache-miss when read mid, but tuples of both next steps are already fetched by that cache miss. 4. USE_BT_ORDER - order two sub-mids together so that you can prefetch both two possible next mids in a single cache prefetch. Cache prefetches of B-tree binsearch are controlled by USE_PREFETCH. I've tested all these strategies on my laptop (2.2GHz Intel i7, 16GB RAM, MacOS 10.14.1) ./pgbench -i -s 25 postgres && ./pgbench -T 100 postgres No changes in configs. Here are results in tps: without prefetch baseline 1448.331199 seq 1433.737204 bt 1463.701527 veb 1457.586089 eyz 1483.654765 with prefetch baseline 1486.585895 seq 1471.757042 bt 1480.169967 veb 1464.834678 eyz 1460.323392 This numbers are not statistically cleared, just one run was done for every number, experiment needs to be tuned anyway. From this I can conclude: 1. It is hard to observe significant performance influence on pgbench. Maybe I should use some more representative B-treeperformance tests? 2. Cache prefetches seem to increase performance without any layout strategies. But the difference is hair-splitting 2.6% 3. For implemented strategies my preliminary results corresponds to the result in a paper: Eyzinger layout is the best. 4. Among layout strategies with prefetch, bt-order strategy seems to be winner. How can I improve experimental evaluation of these layout strategies? Other thoughts? I'd be happy to hear any comment on this. Best regards, Andrey Borodin.
Вложения
On Tue, Nov 27, 2018 at 5:43 AM Andrey Borodin <x4mmm@yandex-team.ru> wrote: > > 31 авг. 2018 г., в 2:40, Thomas Munro <thomas.munro@enterprisedb.com> написал(а): > > [1] https://arxiv.org/pdf/1509.05053.pdf > > I've implemented all of the strategies used in that paper. > On a B-tree page we have a line pointers ordered in key order and tuples residing at the end of the page. Tuples at theend are mostly "appended", i.e. they usually are accommodated at the end of the free space. > The ides of the attached patch is to try to order tuples in different strategies. This ordering happens during VACUUM andcan be easily broken with single insert. > Strategies are switched by #define: > 1. USE_SEQ_ORDER - just order tuples in order of line pointers. This strategy has no idea behind and is expected to worksimilar to baseline (no changes to code at all) > 2. USE_EYZINGER_ORDER - depth-first topological search. If you always choose lower branch - your search is just a sequentialread of bytes. > 3. USE_VEB_ORDER - Van Emde Boas layout - try to pack 3 tuples (mid and two sub-mids) together. The key ideas is that youcache-miss when read mid, but tuples of both next steps are already fetched by that cache miss. > 4. USE_BT_ORDER - order two sub-mids together so that you can prefetch both two possible next mids in a single cache prefetch. > > Cache prefetches of B-tree binsearch are controlled by USE_PREFETCH. > > I've tested all these strategies on my laptop (2.2GHz Intel i7, 16GB RAM, MacOS 10.14.1) > ./pgbench -i -s 25 postgres && ./pgbench -T 100 postgres > No changes in configs. > > Here are results in tps: > without prefetch > baseline 1448.331199 > seq 1433.737204 > bt 1463.701527 > veb 1457.586089 > eyz 1483.654765 > > with prefetch > baseline 1486.585895 > seq 1471.757042 > bt 1480.169967 > veb 1464.834678 > eyz 1460.323392 > > This numbers are not statistically cleared, just one run was done for every number, experiment needs to be tuned anyway. > From this I can conclude: > 1. It is hard to observe significant performance influence on pgbench. Maybe I should use some more representative B-treeperformance tests? > 2. Cache prefetches seem to increase performance without any layout strategies. But the difference is hair-splitting 2.6% > 3. For implemented strategies my preliminary results corresponds to the result in a paper: Eyzinger layout is the best. > 4. Among layout strategies with prefetch, bt-order strategy seems to be winner. > > How can I improve experimental evaluation of these layout strategies? > Other thoughts? I'd be happy to hear any comment on this. What about read-only tests with prewarmed indexes? Those numbers look IO bound. This is focusing on prefetch within btree code, but I wonder if there is some lower hanging fruit that doesn't require any layout changes: heap prefetch during index scans. By analogy to figure 8a "software-pipelined prefetching" of [1], I wonder if you could build a pipeline using dirty (unlocked) memory reads, like this: 1. Prefetch buffer mapping hash bucket for heap_tids[n + 3] 2. Prefetch page header for heap_tids[n + 2] using a dirty read of the first value in the buffer mapping hash bucket 3. Prefetch heap tuple for tids[n + 1] using a dirty read of the heap page 4. Return heap tid for tids[0] to caller If you're lucky, by the time the caller uses the tid to the find the buffer, read the line pointer and access the tuple, all three things will be in cache and hopefully won't have changed since the 'dirty' reads. Prefetching the wrong cachelines would be possible obviously, if there is a buffer mapping hash table collision and the one you wanted isn't first, or stuff just changed. Maybe there aren't enough cachelines for this 3-level pipeline to work, considering that in between random other executor nodes could run, so perhaps just a 2-level or 1-level pipeline would pay off. As I showed in my toy hash join prefetch experiment[2], I could get 20-30% speedup from prefetching just hash buckets (equivalent to prefetching buffer mapping hash buckets), and then a further 5-10% speedup from prefetching the first item in the hash bucket, but I expect nested hash joins to interfere with that as they compete for cache lines. I suppose I could try to do three levels and fetch the next tuple in the chain, but that seems unlikely to help because we're getting into lower probabilities that I'd even even want that data; in contrast, the index scan -> heap scan case *definitely* needs to go through a line pointer to a tuple somewhere on the page so 3-level might be worth experimenting with. Just a though, I have not tried any of this. Maybe there are really 4 levels, if you count the buffer descriptor/lock. [1] http://www.cs.cmu.edu/~chensm/papers/hashjoin_tods_preliminary.pdf [2] https://www.postgresql.org/message-id/CA%2BhUKGKB%3DEWq%2Brj4NK8CX6ZqHZzuQYWSb9iDfYKzGN6-ejShUQ%40mail.gmail.com -- Thomas Munro https://enterprisedb.com