Clock sweep not caching enough B-Tree leaf pages?

Поиск
Список
Период
Сортировка
От Peter Geoghegan
Тема Clock sweep not caching enough B-Tree leaf pages?
Дата
Msg-id CAM3SWZRM7-qmOGMCzix5U49ndFbLS_WwTx6VJsja+bN_Li5v_A@mail.gmail.com
обсуждение исходный текст
Ответы Re: Clock sweep not caching enough B-Tree leaf pages?  (Jim Nasby <jim@nasby.net>)
Re: Clock sweep not caching enough B-Tree leaf pages?  (Bruce Momjian <bruce@momjian.us>)
Re: Clock sweep not caching enough B-Tree leaf pages?  (Ants Aasma <ants@cybertec.at>)
Re: Clock sweep not caching enough B-Tree leaf pages?  (Robert Haas <robertmhaas@gmail.com>)
Re: Clock sweep not caching enough B-Tree leaf pages?  (Andres Freund <andres@2ndquadrant.com>)
Список pgsql-hackers
I have some theories about the PostgreSQL buffer manager/clock sweep.
To motivate the reader to get through the material presented here, I
present up-front a benchmark of a proof-of-concept patch of mine:

http://postgres-benchmarks.s3-website-us-east-1.amazonaws.com/3-sec-delay/

Test Set 4 represents the patches performance here.

This shows some considerable improvements for a tpc-b workload, with
15 minute runs, where the buffer manager struggles with moderately
intense cache pressure. shared_buffers is 8GiB, with 32GiB of system
memory in total. The scale factor is 5,000 here, so that puts the
primary index of the accounts table at a size that makes it impossible
to cache entirely within shared_buffers, by a margin of couple of
GiBs. pgbench_accounts_pkey is ~"10GB", and pgbench_accounts is ~"63
GB". Obviously the heap is much larger, since for that table heap
tuples are several times the size of index tuples (the ratio here is
probably well below the mean, if I can be permitted to make a vast
generalization).

PostgreSQL implements a clock sweep algorithm, which gets us something
approaching an LRU for the buffer manager in trade-off for less
contention on core structures. Buffers have a usage_count/"popularity"
that currently saturates at 5 (BM_MAX_USAGE_COUNT). The classic CLOCK
algorithm only has one bit for what approximates our "usage_count" (so
it's either 0 or 1). I think that at its core CLOCK is an algorithm
that has some very desirable properties that I am sure must be
preserved. Actually, I think it's more accurate to say we use a
variant of clock pro, a refinement of the original CLOCK.

In the past, various hackers have noted problems they've observed with
this scheme. A common pathology is to see frantic searching for a
victim buffer only to find all buffer usage_count values at 5. It may
take multiple revolutions of the clock hand before a victim buffer is
found, as usage_count is decremented for each and every buffer.  Also,
BufFreelistLock contention is considered a serious bottleneck [1],
which is related.

I think a very real problem that may be that approximating an LRU is
bad because an actual LRU is bad, though not due to the problems that
CLOCK/our clock sweep algorithm to its great credit ameliorates. I
don't think that we need to trade-off between an LRU and MRU as Atri
once suggested [2]; rather, I think we need a trade-off between an LRU
and a LFU (very loosely speaking). Something like a pure LRU can make
very bad choices for database workloads. This is described extensively
in the literature.

As I recently remarked upon for unrelated reasons [3], B+Trees perform
excellently when partially cached. There is a remarkable asymmetry
between how many pages we must have cached relative to how much I/O is
avoided, particularly the random I/O that is characteristic of fully
uncached B-Trees when scanned. Approximately 99% of pages in our
nbtree structures can be expected to be leaf pages in many common
cases. There is a very wide fan-out of B-Tree structures that makes
this possible. A lot of material on the subject doesn't emphasize this
basic aspect strongly enough in my view. But it also bears mentioning
that B-Tree pages are, in an important sense, far denser than heap
pages.

Let's leave aside inner/root pages though, because they're so
dramatically useful when in a primary index on a tpb-b table that
they'll always be cached by any non-terrible algorithm. It beggars
belief that the still relatively dense (and consequently *popular*)
B+Tree leaf pages get so little credit for being of such long-term
utility (in the view of our clock sweep algorithm as implemented). The
algorithm has what could be loosely described as an excessively
short-term perspective. There is clearly a better balance to be had
here. I don't think the answer is that we have the B-Tree code give
its pages some special treatment that makes them harder to evict,
although I will admit that I briefly entertained the idea.

I am aware of the false start that we had with ARC back in 2005. I
believe that effort tried to address some of these problems. I am
mindful of the pitfalls there. I'm inclined to think that the decision
to create a CLOCK variant in preference to ARC was the right one at
the time, because clock sweep really is at a considerable advantage
with regard to lock contention.

The 1993 paper "The LRU-K Page Replacement Algorithm For Database Disk
Buffering" [4] proposes what is (roughly speaking) an algorithm that
is an adaptive hybrid of LRU and LFU. Consider Example 1.1. from that
paper; it describes a very simple scenario in which its possible to
have slightly more heap pages cached than B-Tree pages. This scenario
is essentially a "pgbench -S", a scenario compared to the simplistic
tpc-a; it's nothing more than that. The authors aren't clear on this,
but I assumed a uniform distribution (IIRC, the 2Q paper, which was
published a year or two later, also explicitly assumes this of the
very same earlier LRU-K/LRU-2 example). It is argued within that
example, quite cogently in my opinion, that it's very bad that a
simple LRU cache will see just under 50% of buffers used to cache
B-Tree leaf pages, while over 50% are used for "data" (heap) pages. In
actual fact, it is preferable to almost exclusively cache B-Tree
pages. Using 50% of the cache on B-Tree pages when it is optimal to
cache a number approaching 100% of the cache is a pretty large
disparity, particularly for such a simple workload. This is literally
the furthest possible thing from a contrived corner case.

I believe that it isn't hard to get clock sweep to do just the same as
predicted in the '93 paper, even with a "usage_count". It is nothing
more than an LRU approximation, or at least that's what the relevant
comments say. I did some experiments here, but it probably isn't worth
sharing the raw data, given what I already have here. The example
surrounding caching B-Tree leaf pages in the paper draws attention to
a particularly acute pathology, but there are probably others. Leaf
pages are in many representative cases far denser, and therefore can
be expected to be accessed far more frequently to service queries. Our
current failure to credit them in a way that weighs the frequency with
which they're accessed is something I suggest thinking long and hard
about.

Has anyone thought about this in the last few years? I know that Tom
examined the LRU-K paper back in 2000 [5], but was discouraged by some
kind of contention or CPU overhead (although he did say he intended to
revisit the question). Obviously a lot has changed in the past 14
years, particularly with regard to CPU characteristics.

Anyway, having gone through the requisite background information, I'll
get back to the subject of the pgbench-tools tpc-b benchmark that I
started out with, and what I've actually done in the attached patch to
improve matters. Let me preface this by saying: this is a rough
prototype. The way I add a field to the buffer descriptor struct
clearly isn't going to fly, since we have every reason to believe that
that is itself performance critical in other scenarios (just look at
Andres' recent work on the alignment of these structures in memory).
Actually, it probably matters a lot in this scenario too, just not
enough to mask the general outcome. I've arrived at this prototype
through plenty of principled theorizing, and a bit of unprincipled
frobbing. I'm a big fan of building simple prototypes to test
theories. In the interest of reproducibility I have not attempted to
clean up what I have here just yet, in case I accidentally invalidate
the benchmark presented.

The prototype patch:

1) Throttles incrementation of usage_count temporally. It becomes
impossible to increment usage_count for any given buffer more
frequently than every 3 seconds, while decrementing usage_count is
totally unaffected. This is thought to give the algorithm a short to
medium term "appreciation" of the frequency of access. It also
prevents very close increments in usage_count due to pinning a buffer
twice in the same query or transaction, just because the code in the
executor happens to be accidentally structured such that that happens
(as when inserting two tuples within a single INSERT DML query), or
whatever. Clearly the tpc-b accounts table is accessed twice in
succession in the same transaction by our tpb-c, creating a sort of
misrepresentation of buffer popularity that is likely in and of itself
undesirable.

2) Has usage_count saturate at 10 (i.e. BM_MAX_USAGE_COUNT = 10), not
5 as before. This gives us a more granular representation of the
popularity/usefulness of each buffer, which I believe spans a
dramatically large spectrum (i.e. my guess is that testing will show I
didn't go far enough). This step on its own would be assumed extremely
counter-productive by those in the know, but I believe that other
measures ameliorate the downsides. I could be wrong about how true
that is in other cases, but then the case helped here isn't what you'd
call a narrow benchmark. It's the pgbench default (although I do use
unlogged tables, in part because the I/O subsystem on this server is
quite under-powered, even though it has plenty of memory).

3) Has buffer usage_count start at 3. I also tried 4 (and 1, the
current value within master), but settled on 3 for now. Seemingly this
makes a very large difference compared to only doing 1) and 2), since
it gives each page a fair shot at proving its usefulness. Presumably
the throttling in 1) makes frantic buffer scanning much less
prevalent, since clock sweep has the upper hand, so to speak. A tug of
war between clock sweep and backends pinning buffers can be
disastrous, but that seems much less likely than before. Semi-magical
constant amounts of time are something you see surprisingly often in
the research. Probably most notably, the "five-minute rule" [6] argues
that one should only cache randomly accessed pages that are re-used
every 5 minutes or less (it's actually a bit more complicated these
days, but not much). This rule is a mainstay of database caching
research theory. Anyway, I'm not sure whether or not this delay should
be exposed as a GUC. I lean towards "no".

The amount of evidence it takes to validate something like this is
generally enormous. Validating the just-in-time bgwriter strategy was
the original reason that Greg Smith wrote pgbench-tools. Still, I'm
confident that I've identified a real problem. The big picture here is
that pages have a fair go at proving their worth, while allowing for
certain pages to be recognized as being of dramatically higher
long-term utility. Generally speaking, as a caching algorithm, a pure
LFU is inappropriate for almost all use-cases, because it never
forgets anything until it forgets everything about individual pages.
There is a balance to be had here, in terms of the extent to which we
allow pages to "rest on their laurels". I wouldn't be surprised to
learn I have the balance wrong right now.

pgbench-tools benchmark interpretation
=============================

I also attach a LibreOffice spreadsheet comparing hit rates for each
table (and its indexes) for each run that you see in the pgbench-tools
report (except the final do-over of master baseline). I built this
with the help of a small pgbench-tools hack, resetting pg_statio*
views, measuring hit rates per pgbench invocation. The hit rate shown
would probably be a lot more interesting if I'd used a rate limit, so
the amount of work performed is consistent across test sets (that is,
variants of the patch, and master). Greg Smith is very enthusiastic
about how much further insight can be had while using that pgbench
feature, and I'd say he's probably right to be. Right now, in terms of
hit rate you mostly see a slightly smaller one for indexes, and a
considerably larger one for heap buffers as compared to the master
baseline.

If you run down the individual runs in the pgbench-tools report, and
consider what actually happens and when, you tend to see a degradation
in performance over successive runs for different client counts for
certain cases tested. It's hard to be sure, but I think that might be
because earlier runs have a big advantage due to the index being well
cached (pgbench-tools performs vacuuming at the start of each run not
including the first run. So VACUUM reverses the situation initially
seen, since we kill heap tuples last when vacuuming, rather than
creating the index last). In contrast, the impressive performance of
the patch (where all 3 of the above measures are implemented) shows
more consistent throughput even with that noise, which seems to
support my theories about what is going on here. Leaving aside cache
effectiveness as such, I suspect that in general we currently pay a
high price for all too frequently having buffers fall out of
shared_buffers into the OS cache, and fall back in again.

Having arrived at the conclusion that these optimizations were worth
sharing here, I decided to "do over" the original master baseline
(Test Set 1). I ended up with a result (Test set 5) that is actually
quite different from the original result, without it being all that
clear that it was better or worse than the first time I took a
baseline (it was probably worse). This might call into question the
accuracy of my results. However, to see what's really going on here,
it is necessary to drill down to the individual TPS/latency figures
for individual pgbench invocations. That's the real story here.

In general, as I said, the I/O system here is quite under specified
for this workload. This is dedicated physical hardware, but it happens
to be what was immediately available to me. There are a couple of SATA
7200 rpm disks in RAID 1. The system specifications, as advertised by
my hosting provider is detailed here:
https://www.hetzner.de/en/hosting/produkte_rootserver/ex40 . As always
with pgbench-tools, you can drill down to see more about each test
(including kernel vm settings, etc).

Inevitably, with this disk setup, which is I imagine particularly bad
with random I/O, and also with so much memory, it's only a matter of
time before things get ugly, even if there is an initial high
performance burst (even on master) before we have to face the music,
unsurprisingly. It seems like the pain is felt at slightly different
times between Test Set 1 and Test Set 5 (i.e. for each test of
master). With either of those 2 test sets, if you drill down to see
the moment-to-moment throughput and latency, you'll find tests from
each test set where both latency and throughput were on the floor for
as long as a couple of minutes at a time, after which there is a
sudden recovery. We're subject to the vagaries of kernel I/O
scheduling to a much greater extent than with the good patched test
sets, it seems. What is seen with master looks very much like sync
phase checkpoint spikes. Once or twice, things are consistently very
poor for almost an entire master invocation of pgbench. In general
things are very inconsistent, although to be fair it's possible that
at its best master has greater throughput than patched for brief
moments (of course, the buffer descriptor bloating which I have yet to
deal with may well be all that is to blame here).

If you look at the test sets that this patch covers (with all the
tricks applied), there are pretty good figures throughout. You can
kind of see the pain towards the end, but there are no dramatic falls
in responsiveness for minutes at a time. There are latency spikes, but
they're *far* shorter, and much better hidden. Without looking at
individual multiple minute spikes, at the macro level (all client
counts for all runs) average latency is about half of what is seen on
master.

Does anyone have any high-end hardware that they could use to test this out?

[1] http://www.postgresql.org/message-id/CA+TgmobJm0GHk58nUPRQHCGwY25n1DCkU4ku9aQeczZEjiz9mQ@mail.gmail.com
[2] http://www.postgresql.org/message-id/CAOeZVic4HikhmzVD=ZP4JY9g8PgpyiQQOXOELWP=kR+=H1Frgg@mail.gmail.com
[3] http://www.postgresql.org/message-id/CAM3SWZTcXrdDZSpA11qZXiyo4_jtxwjaNdZpnY54yjzq7d64=A@mail.gmail.com
[4] http://www.cs.cmu.edu/~christos/courses/721-resources/p297-o_neil.pdf
[5] http://www.postgresql.org/message-id/1601.967421129@sss.pgh.pa.us
[6] http://en.wikipedia.org/wiki/Five-minute_rule

--
Peter Geoghegan

Вложения

В списке pgsql-hackers по дате отправления:

Предыдущее
От: Tom Lane
Дата:
Сообщение: Re: Signaling of waiting for a cleanup lock?
Следующее
От: Andres Freund
Дата:
Сообщение: Re: Signaling of waiting for a cleanup lock?