Re: B-tree cache prefetches

Поиск
Список
Период
Сортировка
От Thomas Munro
Тема Re: B-tree cache prefetches
Дата
Msg-id CA+hUKG+LPbHaN9wdQTZATD4cS5-oDeikio6gG5=YEKCGhxT4zw@mail.gmail.com
обсуждение исходный текст
Ответ на Re: B-tree cache prefetches  (Andrey Borodin <x4mmm@yandex-team.ru>)
Список pgsql-hackers
On Tue, Nov 27, 2018 at 5:43 AM Andrey Borodin <x4mmm@yandex-team.ru> wrote:
> > 31 авг. 2018 г., в 2:40, Thomas Munro <thomas.munro@enterprisedb.com> написал(а):
> > [1] https://arxiv.org/pdf/1509.05053.pdf
>
> I've implemented all of the strategies used in that paper.
> On a B-tree page we have a line pointers ordered in key order and tuples residing at the end of the page. Tuples at
theend are mostly "appended", i.e. they usually are accommodated at the end of the free space. 
> The ides of the attached patch is to try to order tuples in different strategies. This ordering happens during VACUUM
andcan be easily broken with single insert. 
> Strategies are switched by #define:
> 1. USE_SEQ_ORDER - just order tuples in order of line pointers. This strategy has no idea behind and is expected to
worksimilar to baseline (no changes to code at all) 
> 2. USE_EYZINGER_ORDER - depth-first topological search. If you always choose lower branch - your search is just a
sequentialread of bytes. 
> 3. USE_VEB_ORDER - Van Emde Boas layout - try to pack 3 tuples (mid and two sub-mids) together. The key ideas is that
youcache-miss when read mid, but tuples of both next steps are already fetched by that cache miss. 
> 4. USE_BT_ORDER - order two sub-mids together so that you can prefetch both two possible next mids in a single cache
prefetch.
>
> Cache prefetches of B-tree binsearch are controlled by USE_PREFETCH.
>
> I've tested all these strategies on my laptop (2.2GHz Intel i7, 16GB RAM, MacOS 10.14.1)
> ./pgbench -i -s 25 postgres && ./pgbench -T 100 postgres
> No changes in configs.
>
> Here are results in tps:
> without prefetch
> baseline 1448.331199
> seq      1433.737204
> bt       1463.701527
> veb      1457.586089
> eyz      1483.654765
>
> with prefetch
> baseline 1486.585895
> seq      1471.757042
> bt       1480.169967
> veb      1464.834678
> eyz      1460.323392
>
> This numbers are not statistically cleared, just one run was done for every number, experiment needs to be tuned
anyway.
> From this I can conclude:
> 1. It is hard to observe significant performance influence on pgbench. Maybe I should use some more representative
B-treeperformance tests? 
> 2. Cache prefetches seem to increase performance without any layout strategies. But the difference is hair-splitting
2.6%
> 3. For implemented strategies my preliminary results corresponds to the result in a paper: Eyzinger layout is the
best.
> 4. Among layout strategies with prefetch, bt-order strategy seems to be winner.
>
> How can I improve experimental evaluation of these layout strategies?
> Other thoughts? I'd be happy to hear any comment on this.

What about read-only tests with prewarmed indexes?  Those numbers look IO bound.

This is focusing on prefetch within btree code, but I wonder if there
is some lower hanging fruit that doesn't require any layout changes:
heap prefetch during index scans.  By analogy to figure 8a
"software-pipelined prefetching" of [1], I wonder if you could build a
pipeline using dirty (unlocked) memory reads, like this:

1.  Prefetch buffer mapping hash bucket for heap_tids[n + 3]
2.  Prefetch page header for heap_tids[n + 2] using a dirty read of
the first value in the buffer mapping hash bucket
3.  Prefetch heap tuple for tids[n + 1] using a dirty read of the heap page
4.  Return heap tid for tids[0] to caller

If you're lucky, by the time the caller uses the tid to the find the
buffer, read the line pointer and access the tuple, all three things
will be in cache and hopefully won't have changed since the 'dirty'
reads.  Prefetching the wrong cachelines would be possible obviously,
if there is a buffer mapping hash table collision and the one you
wanted isn't first, or stuff just changed.  Maybe there aren't enough
cachelines for this 3-level pipeline to work, considering that in
between random other executor nodes could run, so perhaps just a
2-level or 1-level pipeline would pay off.  As I showed in my toy hash
join prefetch experiment[2], I could get 20-30% speedup from
prefetching just hash buckets (equivalent to prefetching buffer
mapping hash buckets), and then a further 5-10% speedup from
prefetching the first item in the hash bucket, but I expect nested
hash joins to interfere with that as they compete for cache lines.  I
suppose I could try to do three levels and fetch the next tuple in the
chain, but that seems unlikely to help because we're getting into
lower probabilities that I'd even even want that data; in contrast,
the index scan -> heap scan case *definitely* needs to go through a
line pointer to a tuple somewhere on the page so 3-level might be
worth experimenting with.  Just a though, I have not tried any of
this.  Maybe there are really 4 levels, if you count the buffer
descriptor/lock.

[1] http://www.cs.cmu.edu/~chensm/papers/hashjoin_tods_preliminary.pdf
[2] https://www.postgresql.org/message-id/CA%2BhUKGKB%3DEWq%2Brj4NK8CX6ZqHZzuQYWSb9iDfYKzGN6-ejShUQ%40mail.gmail.com

--
Thomas Munro
https://enterprisedb.com



В списке pgsql-hackers по дате отправления:

Предыдущее
От: Alvaro Herrera
Дата:
Сообщение: Re: pg_dump is broken for partition tablespaces
Следующее
От: Tom Lane
Дата:
Сообщение: Reducing the runtime of the core regression tests