Обсуждение: B-tree cache prefetches

Поиск

Список

Период

Сортировка

B-tree cache prefetches

От

Andrey Borodin

Дата:

30 августа 2018 г., 23:53:08

Hi hackers!

I've been at the database conference and here everyone is talking about cache prefetches.

I've tried simple hack

diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index d3700bd082..ffddf553aa 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -401,6 +401,13 @@ _bt_binsrch(Relation rel,

                /* We have low <= mid < high, so mid points at a real slot */

+               OffsetNumber x = mid + 1 + ((high - mid + 1) / 2);
+               if (x < high)
+                       __builtin_prefetch (PageGetItem(page, PageGetItemId(page, x)), 0, 2);
+               x = low + ((mid - low) / 2);
+               if (x > low)
+                       __builtin_prefetch (PageGetItem(page, PageGetItemId(page, x)), 0, 2);
+
                result = _bt_compare(rel, keysz, scankey, page, mid);

                if (result >= cmpval)

The idea is pretty simple - our search are cache erasing anyway, let's try to get at least some of it by prefetching
possibleways of binary search. 
And it seems to me that on a simple query
> insert into x select (random()*1000000)::int from generate_series(1,1e7);
it brings something like 2-4% of performance improvement on my laptop.

Is there a reason why we do not use __builtin_prefetch? Have anyone tried to use cache prefetching?

Best regards, Andrey Borodin.

Re: B-tree cache prefetches

От

Peter Geoghegan

Дата:

31 августа 2018 г., 00:04:09

On Thu, Aug 30, 2018 at 10:53 AM, Andrey Borodin <x4mmm@yandex-team.ru> wrote:
> The idea is pretty simple - our search are cache erasing anyway, let's try to get at least some of it by prefetching
possibleways of binary search.

> And it seems to me that on a simple query
>> insert into x select (random()*1000000)::int from generate_series(1,1e7);
> it brings something like 2-4% of performance improvement on my laptop.
>
> Is there a reason why we do not use __builtin_prefetch? Have anyone tried to use cache prefetching?

I once wrote a patch that used __builtin_prefetch() when fetching
tuples from a tuplesort. It worked reasonably well on my laptop, but
didn't seem to do much on another machine with another
microarchitecture (presumably the server with the alternative
microarchitecture had superior hardware prefetching). The conclusion
was that it wasn't really worth pursuing.

I'm not dismissing your idea. I'm just pointing out that the burden of
proving that explicit prefetching is a good idea is rather
significant. We especially want to avoid something that needs to be
reassessed every couple of years.

-- 
Peter Geoghegan

Re: B-tree cache prefetches

От

Thomas Munro

Дата:

31 августа 2018 г., 03:40:59

On Fri, Aug 31, 2018 at 5:53 AM Andrey Borodin <x4mmm@yandex-team.ru> wrote:
> Hi hackers!
>
> I've been at the database conference and here everyone is talking about cache prefetches.
>
> I've tried simple hack
>
> diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
> index d3700bd082..ffddf553aa 100644
> --- a/src/backend/access/nbtree/nbtsearch.c
> +++ b/src/backend/access/nbtree/nbtsearch.c
> @@ -401,6 +401,13 @@ _bt_binsrch(Relation rel,
>
>                 /* We have low <= mid < high, so mid points at a real slot */
>
> +               OffsetNumber x = mid + 1 + ((high - mid + 1) / 2);
> +               if (x < high)
> +                       __builtin_prefetch (PageGetItem(page, PageGetItemId(page, x)), 0, 2);
> +               x = low + ((mid - low) / 2);
> +               if (x > low)
> +                       __builtin_prefetch (PageGetItem(page, PageGetItemId(page, x)), 0, 2);
> +
>                 result = _bt_compare(rel, keysz, scankey, page, mid);
>
>                 if (result >= cmpval)
>
> The idea is pretty simple - our search are cache erasing anyway, let's try to get at least some of it by prefetching
possibleways of binary search.

> And it seems to me that on a simple query
> > insert into x select (random()*1000000)::int from generate_series(1,1e7);
> it brings something like 2-4% of performance improvement on my laptop.
>
> Is there a reason why we do not use __builtin_prefetch? Have anyone tried to use cache prefetching?

A related topic is the cache-unfriendliness of traditional binary
searches of sorted data.  Check out "Array Layouts for
Comparison-Based Searching"[1] if you haven't already.  It says that
if your array fits in L2 cache, as our btree pages do, then binary
search still wins against Eytzinger et al, which I was a bit
disappointed about when I was (casually) investigating new layouts
based on some comments from a fellow hacker (I can't remember if it
was Andres or Peter G who mentioned this topic to me).  However, the
paper is talking about "branch-free binary search with explicit
prefetch", which apparently eats plain old branchy binary search for
breakfast, and I gather they tested on a lot of hardware.  That sounds
interesting to look into.

[1] https://arxiv.org/pdf/1509.05053.pdf

-- 
Thomas Munro
http://www.enterprisedb.com

Re: B-tree cache prefetches

От

Peter Geoghegan

Дата:

31 августа 2018 г., 04:14:52

On Thu, Aug 30, 2018 at 2:40 PM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:
> A related topic is the cache-unfriendliness of traditional binary
> searches of sorted data.  Check out "Array Layouts for
> Comparison-Based Searching"[1] if you haven't already.  It says that
> if your array fits in L2 cache, as our btree pages do, then binary
> search still wins against Eytzinger et al, which I was a bit
> disappointed about when I was (casually) investigating new layouts
> based on some comments from a fellow hacker (I can't remember if it
> was Andres or Peter G who mentioned this topic to me).

It might have been both of us, at different times.

> However, the
> paper is talking about "branch-free binary search with explicit
> prefetch", which apparently eats plain old branchy binary search for
> breakfast, and I gather they tested on a lot of hardware.  That sounds
> interesting to look into.

I think that the main win will come from stacking a bunch of different
optimizations on top of each other, including key normalization,
automatic prefix compression on internal pages (something that's a lot
easier than on leaf pages [1]), and abbreviated keys in the ItemId
array of nbtree pages [2] (even 15-bit abbreviated keys can have a lot
of entropy when combined with these other techniques). Once we have
all that in place, I expect drastically fewer iCache misses, making
other optimizations interesting.

I'm bullish on the idea of using something like an unrolled binary
search, or an interpolation search in the long term. In the meantime,
we have too many iCache misses. As a datapoint, iCache misses
reportedly dominate the overhead of TPC-C and TPC-E [3]. Nestloop
joins with inner index scans are dominant within those benchmarks.

[1] https://wiki.postgresql.org/wiki/Key_normalization#Using_existing_.22minus_infinity.22_index_tuple_as_a_low_key
[2] https://www.postgresql.org/message-id/CAH2-Wz=mV4dmOaPFicRSyNtv2KinxEOtBwUY5R7fXXOC-OearA@mail.gmail.com
[3] https://openproceedings.org/2013/conf/edbt/TozunPKJA13.pdf --
"6.1.2 Core stalls"
-- 
Peter Geoghegan

Re: B-tree cache prefetches

От

Andrey Borodin

Дата:

14 октября 2018 г., 14:03:02

Hello!

31 авг. 2018 г., в 2:40, Thomas Munro <thomas.munro@enterprisedb.com> написал(а):
A related topic is the cache-unfriendliness of traditional binary
searches of sorted data. Check out "Array Layouts for
Comparison-Based Searching"[1] if you haven't already. It says that
if your array fits in L2 cache, as our btree pages do, then binary
search still wins against Eytzinger et al, which I was a bit
disappointed

I've re-read that paper. Their results are not about our case, here's a quote:

> For arrays small enough _to be kept_ in L2 cache, the branch-free binary search code listed in Listing 2 is the fastest algorithm

Surely, we cannot keep all pages in L2. Eytzinger layout could be realistic possibility, except one simple problem: I do not see how can we insert items into this layout.

Also, that research is quite distant from B-tree binsearch: thier data items are small and do not represent reference to actual data. Eytzinger exploits the fact that two posiible future keys share same cache line. But it is hard to provide, given we place items at quite random place of the page.

Best regards, Andrey Borodin.

Re: B-tree cache prefetches

От

Thomas Munro

Дата:

15 октября 2018 г., 00:38:04

On Mon, Oct 15, 2018 at 12:03 AM Andrey Borodin <x4mmm@yandex-team.ru> wrote:
> 31 авг. 2018 г., в 2:40, Thomas Munro <thomas.munro@enterprisedb.com> написал(а):
> A related topic is the cache-unfriendliness of traditional binary
> searches of sorted data.  Check out "Array Layouts for
> Comparison-Based Searching"[1] if you haven't already.  It says that
> if your array fits in L2 cache, as our btree pages do, then binary
> search still wins against Eytzinger et al, which I was a bit
> disappointed
>
> I've re-read that paper. Their results are not about our case, here's a quote:
> > For arrays small enough _to be kept_ in L2 cache, the branch-free binary search code listed in Listing 2 is the
fastestalgorithm 
>
> Surely, we cannot keep all pages in L2. Eytzinger layout could be realistic possibility, except one simple problem: I
donot see how can we insert items into this layout. 

I don't know.  Aren't we talking about binary search within one 8KB
page here, anyway?

Thinking about some other ideas for cache prefetching in btree code,
here's an idea to improve pipelining of index scans (a bit like the
two-step pipeline idea from the hash join prefetch thread).  We know
the TIDs of heap tuples we'll be looking up in the near future, so can
we prefetch all the way to the tuple?   There are three random
accesses that follow soon after reading an index tuple:
BufTableLookup(), then page header, then the tuple itself.  At the
logical extreme, we could anticipate the probe of SharedBufHash for
the TID from index tuple i + 3 (perhaps dynahash could expose a
hash_prefetch_for_hash() facility), and anticipate the read of the
page header for the tuple pointed to by index tuple i + 2 (perhaps a
dirty read of SharedBufHash that is only prepared to look at the first
entry in the bucket would be OK for this), and anticipate the read of
the tuple for the TID from index tuple i + 1 (perhaps a dirty read of
the page item table).  Or some less ambitious subset, if that pipeline
turns out to be too deep; it seems like the simplest experiment would
be to try prefetching just SharedBufHash lookups.  This may all be
completely crazy, I haven't tried any of it.

> Also, that research is quite distant from B-tree binsearch: thier data items are small and do not represent reference
toactual data. Eytzinger exploits the fact that two posiible future keys share same cache line. But it is hard to
provide,given we place items at quite random place of the page. 

Based on the the above quotes, I'm not suggesting we should use
Eytzinger for search-within-page.  But out of curiosity, I checked,
and saw that you can actually get a pair of index tuples holding a
six-character text key or an integer key into a 64 byte cache line.

--
Thomas Munro
http://www.enterprisedb.com

Re: B-tree cache prefetches

От

Andrey Borodin

Дата:

15 октября 2018 г., 07:57:29


> 15 окт. 2018 г., в 2:38, Thomas Munro <thomas.munro@enterprisedb.com> написал(а):
>
> On Mon, Oct 15, 2018 at 12:03 AM Andrey Borodin <x4mmm@yandex-team.ru> wrote:
>> 31 авг. 2018 г., в 2:40, Thomas Munro <thomas.munro@enterprisedb.com> написал(а):
>> A related topic is the cache-unfriendliness of traditional binary
>> searches of sorted data.  Check out "Array Layouts for
>> Comparison-Based Searching"[1] if you haven't already.  It says that
>> if your array fits in L2 cache, as our btree pages do, then binary
>> search still wins against Eytzinger et al, which I was a bit
>> disappointed
>>
>> I've re-read that paper. Their results are not about our case, here's a quote:
>>> For arrays small enough _to be kept_ in L2 cache, the branch-free binary search code listed in Listing 2 is the
fastestalgorithm 
>>
>> Surely, we cannot keep all pages in L2. Eytzinger layout could be realistic possibility, except one simple problem:
Ido not see how can we insert items into this layout. 
>
> I don't know.  Aren't we talking about binary search within one 8KB
> page here, anyway?
The paper discuss multiple searches inside one 8Kb region, while we are doing single search in all-uncached page and
moveaway. That's the difference. 

Binserach is optimal, if the page is in L1 before we start search.

>
> Thinking about some other ideas for cache prefetching in btree code,
> here's an idea to improve pipelining of index scans (a bit like the
> two-step pipeline idea from the hash join prefetch thread).  We know
> the TIDs of heap tuples we'll be looking up in the near future, so can
> we prefetch all the way to the tuple?   There are three random
> accesses that follow soon after reading an index tuple:
> BufTableLookup(), then page header, then the tuple itself.  At the
> logical extreme, we could anticipate the probe of SharedBufHash for
> the TID from index tuple i + 3 (perhaps dynahash could expose a
> hash_prefetch_for_hash() facility), and anticipate the read of the
> page header for the tuple pointed to by index tuple i + 2 (perhaps a
> dirty read of SharedBufHash that is only prepared to look at the first
> entry in the bucket would be OK for this), and anticipate the read of
> the tuple for the TID from index tuple i + 1 (perhaps a dirty read of
> the page item table).  Or some less ambitious subset, if that pipeline
> turns out to be too deep; it seems like the simplest experiment would
> be to try prefetching just SharedBufHash lookups.  This may all be
> completely crazy, I haven't tried any of it.
This idea also looks good to me. One thing bothers me. if we do __builtin_prefetch(&a[b[c[d]]],0,1), and a, b, c and d
allare not in cache - will we incur CPU wait time for fetching a,b and c? 
This [0] guys are expoiting C++ coroutines for prefetching this kind of stuff, but it seems like too much of hassle.
>
>> Also, that research is quite distant from B-tree binsearch: thier data items are small and do not represent
referenceto actual data. Eytzinger exploits the fact that two posiible future keys share same cache line. But it is
hardto provide, given we place items at quite random place of the page. 
>
> Based on the the above quotes, I'm not suggesting we should use
> Eytzinger for search-within-page.  But out of curiosity, I checked,
> and saw that you can actually get a pair of index tuples holding a
> six-character text key or an integer key into a 64 byte cache line.

Well, this proves that in theory Eytzinger is an opportunity.

Best regards, Andrey Borodin.

[0] http://www.vldb.org/pvldb/vol11/p1702-jonathan.pdf

Re: B-tree cache prefetches

От

Andrey Borodin

Дата:

26 ноября 2018 г., 19:43:20

Hi, Thomas!

> 31 авг. 2018 г., в 2:40, Thomas Munro <thomas.munro@enterprisedb.com> написал(а):
>
> [1] https://arxiv.org/pdf/1509.05053.pdf

I've implemented all of the strategies used in that paper.
On a B-tree page we have a line pointers ordered in key order and tuples residing at the end of the page. Tuples at the
endare mostly "appended", i.e. they usually are accommodated at the end of the free space.
The ides of the attached patch is to try to order tuples in different strategies. This ordering happens during VACUUM
andcan be easily broken with single insert.
Strategies are switched by #define:
1. USE_SEQ_ORDER - just order tuples in order of line pointers. This strategy has no idea behind and is expected to
worksimilar to baseline (no changes to code at all)
2. USE_EYZINGER_ORDER - depth-first topological search. If you always choose lower branch - your search is just a
sequentialread of bytes.
3. USE_VEB_ORDER - Van Emde Boas layout - try to pack 3 tuples (mid and two sub-mids) together. The key ideas is that
youcache-miss when read mid, but tuples of both next steps are already fetched by that cache miss.
4. USE_BT_ORDER - order two sub-mids together so that you can prefetch both two possible next mids in a single cache
prefetch.

Cache prefetches of B-tree binsearch are controlled by USE_PREFETCH.

I've tested all these strategies on my laptop (2.2GHz Intel i7, 16GB RAM, MacOS 10.14.1)
./pgbench -i -s 25 postgres && ./pgbench -T 100 postgres
No changes in configs.

Here are results in tps:
without prefetch
baseline 1448.331199
seq 1433.737204
bt 1463.701527
veb 1457.586089
eyz 1483.654765

with prefetch
baseline 1486.585895
seq 1471.757042
bt 1480.169967
veb 1464.834678
eyz 1460.323392

This numbers are not statistically cleared, just one run was done for every number, experiment needs to be tuned
anyway.
From this I can conclude:
1. It is hard to observe significant performance influence on pgbench. Maybe I should use some more representative
B-treeperformance tests?
2. Cache prefetches seem to increase performance without any layout strategies. But the difference is hair-splitting
2.6%
3. For implemented strategies my preliminary results corresponds to the result in a paper: Eyzinger layout is the best.
4. Among layout strategies with prefetch, bt-order strategy seems to be winner.

How can I improve experimental evaluation of these layout strategies?
Other thoughts? I'd be happy to hear any comment on this.

Best regards, Andrey Borodin.

Вложения

0001-Implement-different-B-tree-page-layouts.patch

Re: B-tree cache prefetches

От

Thomas Munro

Дата:

11 апреля 2019 г., 01:14:29

On Tue, Nov 27, 2018 at 5:43 AM Andrey Borodin <x4mmm@yandex-team.ru> wrote:
> > 31 авг. 2018 г., в 2:40, Thomas Munro <thomas.munro@enterprisedb.com> написал(а):
> > [1] https://arxiv.org/pdf/1509.05053.pdf
>
> I've implemented all of the strategies used in that paper.
> On a B-tree page we have a line pointers ordered in key order and tuples residing at the end of the page. Tuples at
theend are mostly "appended", i.e. they usually are accommodated at the end of the free space. 
> The ides of the attached patch is to try to order tuples in different strategies. This ordering happens during VACUUM
andcan be easily broken with single insert. 
> Strategies are switched by #define:
> 1. USE_SEQ_ORDER - just order tuples in order of line pointers. This strategy has no idea behind and is expected to
worksimilar to baseline (no changes to code at all) 
> 2. USE_EYZINGER_ORDER - depth-first topological search. If you always choose lower branch - your search is just a
sequentialread of bytes. 
> 3. USE_VEB_ORDER - Van Emde Boas layout - try to pack 3 tuples (mid and two sub-mids) together. The key ideas is that
youcache-miss when read mid, but tuples of both next steps are already fetched by that cache miss. 
> 4. USE_BT_ORDER - order two sub-mids together so that you can prefetch both two possible next mids in a single cache
prefetch.
>
> Cache prefetches of B-tree binsearch are controlled by USE_PREFETCH.
>
> I've tested all these strategies on my laptop (2.2GHz Intel i7, 16GB RAM, MacOS 10.14.1)
> ./pgbench -i -s 25 postgres && ./pgbench -T 100 postgres
> No changes in configs.
>
> Here are results in tps:
> without prefetch
> baseline 1448.331199
> seq      1433.737204
> bt       1463.701527
> veb      1457.586089
> eyz      1483.654765
>
> with prefetch
> baseline 1486.585895
> seq      1471.757042
> bt       1480.169967
> veb      1464.834678
> eyz      1460.323392
>
> This numbers are not statistically cleared, just one run was done for every number, experiment needs to be tuned
anyway.
> From this I can conclude:
> 1. It is hard to observe significant performance influence on pgbench. Maybe I should use some more representative
B-treeperformance tests? 
> 2. Cache prefetches seem to increase performance without any layout strategies. But the difference is hair-splitting
2.6%
> 3. For implemented strategies my preliminary results corresponds to the result in a paper: Eyzinger layout is the
best.
> 4. Among layout strategies with prefetch, bt-order strategy seems to be winner.
>
> How can I improve experimental evaluation of these layout strategies?
> Other thoughts? I'd be happy to hear any comment on this.

What about read-only tests with prewarmed indexes?  Those numbers look IO bound.

This is focusing on prefetch within btree code, but I wonder if there
is some lower hanging fruit that doesn't require any layout changes:
heap prefetch during index scans.  By analogy to figure 8a
"software-pipelined prefetching" of [1], I wonder if you could build a
pipeline using dirty (unlocked) memory reads, like this:

1.  Prefetch buffer mapping hash bucket for heap_tids[n + 3]
2.  Prefetch page header for heap_tids[n + 2] using a dirty read of
the first value in the buffer mapping hash bucket
3.  Prefetch heap tuple for tids[n + 1] using a dirty read of the heap page
4.  Return heap tid for tids[0] to caller

If you're lucky, by the time the caller uses the tid to the find the
buffer, read the line pointer and access the tuple, all three things
will be in cache and hopefully won't have changed since the 'dirty'
reads.  Prefetching the wrong cachelines would be possible obviously,
if there is a buffer mapping hash table collision and the one you
wanted isn't first, or stuff just changed.  Maybe there aren't enough
cachelines for this 3-level pipeline to work, considering that in
between random other executor nodes could run, so perhaps just a
2-level or 1-level pipeline would pay off.  As I showed in my toy hash
join prefetch experiment[2], I could get 20-30% speedup from
prefetching just hash buckets (equivalent to prefetching buffer
mapping hash buckets), and then a further 5-10% speedup from
prefetching the first item in the hash bucket, but I expect nested
hash joins to interfere with that as they compete for cache lines.  I
suppose I could try to do three levels and fetch the next tuple in the
chain, but that seems unlikely to help because we're getting into
lower probabilities that I'd even even want that data; in contrast,
the index scan -> heap scan case *definitely* needs to go through a
line pointer to a tuple somewhere on the page so 3-level might be
worth experimenting with.  Just a though, I have not tried any of
this.  Maybe there are really 4 levels, if you count the buffer
descriptor/lock.

[1] http://www.cs.cmu.edu/~chensm/papers/hashjoin_tods_preliminary.pdf
[2] https://www.postgresql.org/message-id/CA%2BhUKGKB%3DEWq%2Brj4NK8CX6ZqHZzuQYWSb9iDfYKzGN6-ejShUQ%40mail.gmail.com

--
Thomas Munro
https://enterprisedb.com

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Обсуждение: B-tree cache prefetches

Вложения