Обсуждение: Clock with Adaptive Replacement

Поиск
Список
Период
Сортировка

Clock with Adaptive Replacement

От
Konstantin Knizhnik
Дата:
Hi hackers,

What do you think about improving cache replacement clock-sweep algorithm in PostgreSQL with adaptive version proposed
inthis article:
 
    http://www-cs.stanford.edu/~sbansal/pubs/fast04.pdf

Are there some well known drawbacks of this approach or it will be interesting to adopt this algorithm to PostgreSQL
andmeasure it impact om performance under different workloads?
 
I find this ten years old thread:

http://www.postgresql.org/message-id/flat/d2jkde$6bg$1@sea.gmane.org#d2jkde$6bg$1@sea.gmane.org

but it mostly discus possible patent issues with another algorithm ARC (CAR is inspired by ARC,  but it is different
algorithm).
As far as I know there are several problems with current clock-sweep algorithm in PostgreSQL, especially for very large
caches.
May be CAR can address some of them?

-- 
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company




Re: Clock with Adaptive Replacement

От
Robert Haas
Дата:
On Thu, Feb 11, 2016 at 4:02 PM, Konstantin Knizhnik
<k.knizhnik@postgrespro.ru> wrote:
> What do you think about improving cache replacement clock-sweep algorithm in
> PostgreSQL with adaptive version proposed in this article:
>
>     http://www-cs.stanford.edu/~sbansal/pubs/fast04.pdf
>
> Are there some well known drawbacks of this approach or it will be
> interesting to adopt this algorithm to PostgreSQL and measure it impact om
> performance under different workloads?
> I find this ten years old thread:
>
> http://www.postgresql.org/message-id/flat/d2jkde$6bg$1@sea.gmane.org#d2jkde$6bg$1@sea.gmane.org
>
> but it mostly discus possible patent issues with another algorithm ARC (CAR
> is inspired by ARC,  but it is different algorithm).
> As far as I know there are several problems with current clock-sweep
> algorithm in PostgreSQL, especially for very large caches.
> May be CAR can address some of them?

Maybe, but the proof of the pudding is in the eating.  Just because an
algorithm is smarter, newer, and better in general than our current
algorithm - and really, it wouldn't be hard - doesn't mean that it
will actually solve the problems we care about.  A few of my
EnterpriseDB colleagues spent a lot of time benchmarking various
tweaks to our current algorithm last year and were unable to construct
a test case where it sped anything up.  If they tried the same tweaks
against the 9.4 source base, they could get a speedup.  But 9.5 had
locking improvements around buffer eviction, and with those
improvements committed there was no longer any measurable benefit to
improving the quality of buffer eviction decisions.  That's a
surprising result, to me anyway, and somebody else might well find a
test case where a benefit can be shown - but our research was not
successful.

I think it's important to spend time and energy figuring out exactly
what the problems with our current algorithm are.  We know in general
terms that usage counts tend to converge to either 5 or 0 and
therefore sometimes evict buffers both at great cost and almost
randomly.  But what's a lot less clear is how much that actually hurts
us given that we are relying on the OS cache anyway.  It may be that
we need to fix some other things before or after improving the buffer
eviction algorithm before we actually get a performance benefit.  I
suspect, for example, that a lot of the problems with large
shared_buffers settings have to do with the bgwriter and checkpointer
behavior rather than with the buffer eviction algorithm; and that
others have to do with cache duplication between PostgreSQL and the
operating system.  So, I would suggest (although of course it's up to
you) that you might want to focus on experiments that will help you
understand where the problems are before you plunge into writing code
to fix them.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Clock with Adaptive Replacement

От
Konstantin Knizhnik
Дата:
Thank you very much for response.

I am not sure that CART can significantly  improve PostgreSQL 
performance - I just want to know opinion of community about
CAR/CART and other possible alternative to GCLOCK algorithm.

Looks like it CAR really provides better cache hit ratio and so at some 
workloads should increase Postgres performance.
But now amount of memory at servers is large enough to completely keep 
most of typical databases in cache.
So time of locating buffer in cache is more critical then time of buffer 
eviction.
And here CART doesn't provide any benefits comparing with GCLOCK algorithm.

One of the problems with GCLOCK algorithm from my point of view is that 
for large caches, containing larger number of pages locating victim page 
can take substantial amount of time, because we have to perform several 
turnovers before some count becomes zero.  In theory CART can address 
this problem because there are not counters - justs single bit per page.




On 12.02.2016 18:55, Robert Haas wrote:
> On Thu, Feb 11, 2016 at 4:02 PM, Konstantin Knizhnik
> <k.knizhnik@postgrespro.ru> wrote:
>> What do you think about improving cache replacement clock-sweep algorithm in
>> PostgreSQL with adaptive version proposed in this article:
>>
>>      http://www-cs.stanford.edu/~sbansal/pubs/fast04.pdf
>>
>> Are there some well known drawbacks of this approach or it will be
>> interesting to adopt this algorithm to PostgreSQL and measure it impact om
>> performance under different workloads?
>> I find this ten years old thread:
>>
>> http://www.postgresql.org/message-id/flat/d2jkde$6bg$1@sea.gmane.org#d2jkde$6bg$1@sea.gmane.org
>>
>> but it mostly discus possible patent issues with another algorithm ARC (CAR
>> is inspired by ARC,  but it is different algorithm).
>> As far as I know there are several problems with current clock-sweep
>> algorithm in PostgreSQL, especially for very large caches.
>> May be CAR can address some of them?
> Maybe, but the proof of the pudding is in the eating.  Just because an
> algorithm is smarter, newer, and better in general than our current
> algorithm - and really, it wouldn't be hard - doesn't mean that it
> will actually solve the problems we care about.  A few of my
> EnterpriseDB colleagues spent a lot of time benchmarking various
> tweaks to our current algorithm last year and were unable to construct
> a test case where it sped anything up.  If they tried the same tweaks
> against the 9.4 source base, they could get a speedup.  But 9.5 had
> locking improvements around buffer eviction, and with those
> improvements committed there was no longer any measurable benefit to
> improving the quality of buffer eviction decisions.  That's a
> surprising result, to me anyway, and somebody else might well find a
> test case where a benefit can be shown - but our research was not
> successful.
>
> I think it's important to spend time and energy figuring out exactly
> what the problems with our current algorithm are.  We know in general
> terms that usage counts tend to converge to either 5 or 0 and
> therefore sometimes evict buffers both at great cost and almost
> randomly.  But what's a lot less clear is how much that actually hurts
> us given that we are relying on the OS cache anyway.  It may be that
> we need to fix some other things before or after improving the buffer
> eviction algorithm before we actually get a performance benefit.  I
> suspect, for example, that a lot of the problems with large
> shared_buffers settings have to do with the bgwriter and checkpointer
> behavior rather than with the buffer eviction algorithm; and that
> others have to do with cache duplication between PostgreSQL and the
> operating system.  So, I would suggest (although of course it's up to
> you) that you might want to focus on experiments that will help you
> understand where the problems are before you plunge into writing code
> to fix them.
>

-- 
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company




Re: Clock with Adaptive Replacement

От
Jim Nasby
Дата:
On 2/12/16 9:55 AM, Robert Haas wrote:
> I think it's important to spend time and energy figuring out exactly
> what the problems with our current algorithm are.  We know in general
> terms that usage counts tend to converge to either 5 or 0 and
> therefore sometimes evict buffers both at great cost and almost

Has anyone done testing on the best cap to usage count? IIRC 5 was 
pulled out of thin air. Actually, I don't recall ever seeing a clock 
sweep that supported more than a single bit, though often there are 
multiple 'pools' a buffer could be in (ie: active vs inactive in most 
unix VMs).

If you have a reasonable amount of 1 or 0 count buffers then this 
probably doesn't matter too much, but if your working set is 
significantly higher than shared buffers then you're probably doing a 
LOT of full sweeps to try and get something decremented down to 0.

> randomly.  But what's a lot less clear is how much that actually hurts
> us given that we are relying on the OS cache anyway.  It may be that
> we need to fix some other things before or after improving the buffer
> eviction algorithm before we actually get a performance benefit.  I
> suspect, for example, that a lot of the problems with large
> shared_buffers settings have to do with the bgwriter and checkpointer
> behavior rather than with the buffer eviction algorithm; and that
> others have to do with cache duplication between PostgreSQL and the
> operating system.  So, I would suggest (although of course it's up to

It would be nice if there was at least an option to instrument how long 
an OS read request took, so that you could guage how many requests were 
coming from the OS vs disk. (Obviously direct knowledge from the OS is 
even better, but I don't think those APIs exist.)

> you) that you might want to focus on experiments that will help you
> understand where the problems are before you plunge into writing code
> to fix them.

+1
-- 
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com



Re: Clock with Adaptive Replacement

От
Tom Lane
Дата:
Jim Nasby <Jim.Nasby@BlueTreble.com> writes:
> On 2/12/16 9:55 AM, Robert Haas wrote:
>> I think it's important to spend time and energy figuring out exactly
>> what the problems with our current algorithm are.  We know in general
>> terms that usage counts tend to converge to either 5 or 0 and
>> therefore sometimes evict buffers both at great cost and almost

> Has anyone done testing on the best cap to usage count? IIRC 5 was 
> pulled out of thin air.

My recollection is that there was some testing behind it ... but that
was back around 2005 so it seems safe to assume that that testing
is no longer terribly relevant.  In particular, I'm sure it was tested
with shared_buffer counts far smaller than what we now consider sane.
        regards, tom lane



Re: [HACKERS] Clock with Adaptive Replacement

От
Thomas Munro
Дата:
On Fri, Feb 12, 2016 at 10:02 AM, Konstantin Knizhnik
<k.knizhnik@postgrespro.ru> wrote:
> Hi hackers,
>
> What do you think about improving cache replacement clock-sweep algorithm in
> PostgreSQL with adaptive version proposed in this article:
>
>     http://www-cs.stanford.edu/~sbansal/pubs/fast04.pdf
>
> Are there some well known drawbacks of this approach or it will be
> interesting to adopt this algorithm to PostgreSQL and measure it impact om
> performance under different workloads?

I'm not currently planning to work in this area and have done no real
investigation, so please consider the following to be "water cooler
talk".

I also came across that paper while reading about buffering as general
background.  Yeah, CAR[T] looks pretty interesting at first glance.
Automatic scan resistance seems like something we'd want, its approach
to frequency sounds smarter than GCLOCK's usage counter with arbitrary
parameter 5.  Here's a nice slide deck from the authors:

http://www.cse.iitd.ernet.in/~sbansal/pubs/fast04ppt.pdf

There doesn't seem to be as much written about GCLOCK beyond the 1992
TPC-A paper[1], possibly because as the CAR paper says "[a]
fundamental disadvantage of GCLOCK is that it requires counter
increment on every page hit which makes it infeasible for virtual
memory", making it less interesting to the OS guys.  That is, we
actually have slightly more information, because we have an
opportunity to bump the usage counter when pinning and the VM guys
don't, probably explaining why the paper compares with CLOCK and not
GCLOCK.

One question is whether SPC-1 looks anything like a typical database
workload.  Say someone wrote a prototype CAR[T] patch for PostgreSQL,
how would you validate it?

I'm also curious about how the algorithms at different levels interact
when using buffered IO.  While other databases typically do direct IO,
you can still find a trail of papers about this topic since disk
arrays and network-attached storage systems also have caches and page
replacement problems[2][3], and of course multi-level caching problems
apply to non-database applications too.  I wonder to what extent
Linux's use of "a mash of a number of different algorithms with a
number of modifications for catching corner cases and various
optimisations"[4] affects the performance of different clock-based
algorithms operating higher up.

If we had a page replacement algorithm with CART's magical claimed
properties and it really does perform better than GCLOCK + our scan
resistance special cases, I wonder if it would be tempting to stick
SLRU contents into shared buffers.  slru.c could gain a
smgr-compatible interface so that bufmgr.c could treat those buffers
like any other, using smgr_which to select the appropriate storage
manager.  (I'm going to be proposing something similar for undo log
storage, more on that later.)

[1] http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.452.9699&rep=rep1&type=pdf
[2] https://www.usenix.org/legacy/event/usenix01/full_papers/zhou/zhou.pdf
[3] https://www.usenix.org/legacy/event/usenix02/full_papers/wong/wong_html/
[4] http://www.spinics.net/lists/linux-mm/msg13385.html

-- 
Thomas Munro
http://www.enterprisedb.com


Re: [HACKERS] Clock with Adaptive Replacement

От
Andrey Borodin
Дата:
Hi, Thomas!

> 24 апр. 2018 г., в 2:41, Thomas Munro <thomas.munro@enterprisedb.com> написал(а):
>
> On Fri, Feb 12, 2016 at 10:02 AM, Konstantin Knizhnik
> <k.knizhnik@postgrespro.ru> wrote:
>> Are there some well known drawbacks of this approach or it will be
>> interesting to adopt this algorithm to PostgreSQL and measure it impact om
>> performance under different workloads?
>
> I'm not currently planning to work in this area and have done no real
> investigation, so please consider the following to be "water cooler
> talk".

I've intention to make some prototypes in this area, but still I hadn't allocated any time chunks sufficient enough to
makeanything usefull. 

I think that replacement of current CS5 will:
1. allow use of big shared buffers
2. make DIRECT_IO realistic possibility
3. improve BgWriter
4. unify different buffering strategies into single buffer manager (there will be no need in placing VACUUM into
specialbuffer ring) 
5. finally allow aio and more efficient prefetching like [0]

Here's what we have about size of shared buffer now [1] (taken from [2]). I believe right hill must be big enough to
reducecentral pit to zero and make function monotonic: OS page cache knows less about data blocks and is expected to be
lessefficient. 


I'm not sure CART is the best possibility, though.
I think that the right way is to implement many prototypes with LRU, ARC, CAR, CART, and 2Q. Then, benchmark them well.
Oreven make this algorithm pluggable? But currently we have a lot of dependent parts in the system. I do not even know
whereto start. 


Best regards, Andrey Borodin.


[0] http://diku.dk/forskning/Publikationer/tekniske_rapporter/2004/04-03.pdf
[1]
https://4.bp.blogspot.com/-_Zz6X-e9_ok/WlaIvpStBmI/AAAAAAAAAA4/E1NwV-_82-oS5KfmyjoOff_IxUXiO96WwCLcBGAs/s1600/20180110-PTI.png
[2] http://blog.dataegret.com/2018/01/postgresql-performance-meltdown.html

Re: [HACKERS] Clock with Adaptive Replacement

От
Юрий Соколов
Дата:
вт, 24 апр. 2018 г., 8:04 Andrey Borodin <x4mmm@yandex-team.ru>:
Hi, Thomas!

> 24 апр. 2018 г., в 2:41, Thomas Munro <thomas.munro@enterprisedb.com> написал(а):
>
> On Fri, Feb 12, 2016 at 10:02 AM, Konstantin Knizhnik
> <k.knizhnik@postgrespro.ru> wrote:
>> Are there some well known drawbacks of this approach or it will be
>> interesting to adopt this algorithm to PostgreSQL and measure it impact om
>> performance under different workloads?
>
> I'm not currently planning to work in this area and have done no real
> investigation, so please consider the following to be "water cooler
> talk".

I've intention to make some prototypes in this area, but still I hadn't allocated any time chunks sufficient enough to make anything usefull.

I think that replacement of current CS5 will:
1. allow use of big shared buffers
2. make DIRECT_IO realistic possibility
3. improve BgWriter
4. unify different buffering strategies into single buffer manager (there will be no need in placing VACUUM into special buffer ring)
5. finally allow aio and more efficient prefetching like [0]

Here's what we have about size of shared buffer now [1] (taken from [2]). I believe right hill must be big enough to reduce central pit to zero and make function monotonic: OS page cache knows less about data blocks and is expected to be less efficient.


I'm not sure CART is the best possibility, though.
I think that the right way is to implement many prototypes with LRU, ARC, CAR, CART, and 2Q. Then, benchmark them well. Or even make this algorithm pluggable? But currently we have a lot of dependent parts in the system. I do not even know where to start.


Best regards, Andrey Borodin.


[0] http://diku.dk/forskning/Publikationer/tekniske_rapporter/2004/04-03.pdf
[1] https://4.bp.blogspot.com/-_Zz6X-e9_ok/WlaIvpStBmI/AAAAAAAAAA4/E1NwV-_82-oS5KfmyjoOff_IxUXiO96WwCLcBGAs/s1600/20180110-PTI.png
[2] http://blog.dataegret.com/2018/01/postgresql-performance-meltdown.html

Before implementing algorithms within PostgreSQL it will be great to test them outside with real traces.

I think, there should be mechamism to collect traces from real-world postgresql instalations: every read and write access. It should be extremely eficient to be enabled in real world. Something like circular buffer in shared memory, and separate worker to dump it to disk.
Instead of full block address, 64bit hash could be used. Even 63bit + 1bit to designate read/write access.

Using these traces, it will be easy to choose couple of "theoretically" best algorithms, and then attempt to implement them. 

With regards, 
Yura

Re: [HACKERS] Clock with Adaptive Replacement

От
Andrey Borodin
Дата:

> 24 апр. 2018 г., в 11:31, Юрий Соколов <funny.falcon@gmail.com> написал(а):
>
> Before implementing algorithms within PostgreSQL it will be great to test them outside with real traces.
>
> I think, there should be mechamism to collect traces from real-world postgresql instalations: every read and write
access.It should be extremely eficient to be enabled in real world. Something like circular buffer in shared memory,
andseparate worker to dump it to disk. 
> Instead of full block address, 64bit hash could be used. Even 63bit + 1bit to designate read/write access.
Yes, this is good idea to track pattern of blocks usage.
But, I think that cost of development of real page eviction strategy itself is neglectable small compared to
infrastructurechanges needed by any non-CS5 strategy. 

Best regards, Andrey Borodin.

Re: [HACKERS] Clock with Adaptive Replacement

От
Юрий Соколов
Дата:


вт, 24 апр. 2018 г., 15:16 Andrey Borodin <x4mmm@yandex-team.ru>:


> 24 апр. 2018 г., в 11:31, Юрий Соколов <funny.falcon@gmail.com> написал(а):
>
> Before implementing algorithms within PostgreSQL it will be great to test them outside with real traces.
>
> I think, there should be mechamism to collect traces from real-world postgresql instalations: every read and write access. It should be extremely eficient to be enabled in real world. Something like circular buffer in shared memory, and separate worker to dump it to disk.
> Instead of full block address, 64bit hash could be used. Even 63bit + 1bit to designate read/write access.
Yes, this is good idea to track pattern of blocks usage.
But, I think that cost of development of real page eviction strategy itself is neglectable small compared to infrastructure changes needed by any non-CS5 strategy.

Best regards, Andrey Borodin.

Exactly. And it will be pity if implemented strategy will not be "the best".

Different strategies may need different infrastructure. And implementing five infrastructures are certainly more expensive than choosing two best strategies using traces and implementing them.

Re: [HACKERS] Clock with Adaptive Replacement

От
Andres Freund
Дата:
Hi,

On 2018-04-24 17:16:47 +0500, Andrey Borodin wrote:
> But, I think that cost of development of real page eviction strategy
> itself is neglectable small compared to infrastructure changes needed
> by any non-CS5 strategy.

What problems are you seeing? This isn't a lot of code?

Greetings,

Andres Freund


Re: [HACKERS] Clock with Adaptive Replacement

От
Andrey Borodin
Дата:

> 24 апр. 2018 г., в 23:14, Andres Freund <andres@anarazel.de> написал(а):
>
> On 2018-04-24 17:16:47 +0500, Andrey Borodin wrote:
>> But, I think that cost of development of real page eviction strategy
>> itself is neglectable small compared to infrastructure changes needed
>> by any non-CS5 strategy.
>
> What problems are you seeing? This isn't a lot of code?
1. Teaching BgWriter to used data from eviction strategy to aggressively flush data to disk (instead of ++next_to_clean
)
2. Implementing strategies as lock-free algorithms for freelist
These parts seem most important for benchmarking.
Also:
3. Converting all rings to single buffer manager where possible
4. Using O_DIRECT while writing data files
5. Using aio and scheduling of writes
These parts are not necessary, but most challenging, while not impossible though.

Best regards, Andrey Borodin.

Re: [HACKERS] Clock with Adaptive Replacement

От
Thomas Munro
Дата:
On Tue, Apr 24, 2018 at 5:04 PM, Andrey Borodin <x4mmm@yandex-team.ru> wrote:
> Here's what we have about size of shared buffer now [1] (taken from [2]). I believe right hill must be big enough to
reducecentral pit to zero and make function monotonic: OS page cache knows less about data blocks and is expected to be
lessefficient.
 

> [1]
https://4.bp.blogspot.com/-_Zz6X-e9_ok/WlaIvpStBmI/AAAAAAAAAA4/E1NwV-_82-oS5KfmyjoOff_IxUXiO96WwCLcBGAs/s1600/20180110-PTI.png

Interesting curve.  Can you explain it?  Here is my guess, assuming
that pgbench is generating uniformly distributed random access, and
assuming 26GB dataset and 32GB physical RAM:

1.  In the range 128MB to 6GB, only a small fraction of the dataset
fits into shared_buffers, but it doesn't matter much: you get a 100%
hit ratio from the kernel's page cache.

2.  In the range 6GB to 16GB, the dataset doesn't fit in shared_buffer
OR the kernel's page cache, so you begin to risk misses in both
caches.  Every byte you give to one you take away from the other, but
it might be worse than zero sum: at the lowest point in your graph, x
= 16GB, I suppose you have a ~61% hit ratio in shared buffers
(16GB/26GB), but when you miss you're also likely to miss in the OS
page cache too because that probably holds the most recently read 16GB
(= other half of physical RAM) of data... in other words the same
data, assuming uniform random access.  (If it were not random, then
some pages would have a greater chance of staying in PG's cache and
falling out of the kernel's cache, but essentially random replacement
prevents that (?))

3.  In the range 16GB to 26GB, the shared_buffers hit ratio increases
from 50% to 100%.

I guess if you had 48GB of RAM the dip between 6GB and 26GB wouldn't happen?

Am I close?

> I'm not sure CART is the best possibility, though.

I don't know either, but LRU and ARC don't seem promising for various
reasons.  Aside from the scan resistance claim, it sounds like CART
should avoid long searches for a buffer.  In my primitive attempts to
understand that a bit better, I tried instrumenting freelist.c to tell
dtrace how far it had to move the clock hand to find a buffer, and I
got histograms like this (this example from a 1GB dataset, 512MB
shared buffers, pgbench select-only, and the most frequent numbers
were in the 4-7 bucket but there were 3 freak cases in the 16384-32767
bucket):

           value  ------------- Distribution ------------- count
              -1 |                                         0
               0 |@@@@                                     65535
               1 |@@@@@                                    89190
               2 |@@@@@@@@@                                139114
               4 |@@@@@@@@@@@                              170881
               8 |@@@@@@@@                                 132579
              16 |@@@                                      47731
              32 |                                         5530
              64 |                                         137
             128 |                                         1
             256 |                                         0
             512 |                                         0
            1024 |                                         0
            2048 |                                         0
            4096 |                                         0
            8192 |                                         1
           16384 |                                         3
           32768 |                                         0

You can see a just a few outliers there.  With smaller shared buffers
the counts are bigger but the average hand movement distance gets
smaller (here 128MB shared buffers):

           value  ------------- Distribution ------------- count
              -1 |                                         0
               0 |                                         16383
               1 |@@@@@@@@@@@@@@@                          1289265
               2 |@@@@@@@@@@@@@@@                          1270531
               4 |@@@@@@@@                                 659389
               8 |@                                        111091
              16 |                                         2815
              32 |                                         19
              64 |                                         1
             128 |                                         0
             256 |                                         0
             512 |                                         0
            1024 |                                         0
            2048 |                                         2
            4096 |                                         2
            8192 |                                         0


I think pgbench isn't a very interesting workload though.  I don't
have time right now but it would be fun to dig further and try to
construct workloads that generate pathological searches for buffers (I
believe that such workloads exist, from anecdotal reports).

-- 
Thomas Munro
http://www.enterprisedb.com


Re: [HACKERS] Clock with Adaptive Replacement

От
Peter Geoghegan
Дата:
On Wed, Apr 25, 2018 at 5:26 AM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:
> I think pgbench isn't a very interesting workload though.  I don't
> have time right now but it would be fun to dig further and try to
> construct workloads that generate pathological searches for buffers (I
> believe that such workloads exist, from anecdotal reports).

While pgbench might not be the most interesting workload, I think even
its super-synthetic, uniformly distributed point lookups can benefit
from better caching decisions. The LRU-K paper's example 1.1 explains
why this might be very cogently: Point lookups from TPC-A (or TPC-C)
consist of descending the index, and then accessing the record in the
main table from a TID, which actually implies quite a non-uniform
access pattern for individual pages. Despite accessing table records
uniformly.

There are a negligible number of internal pages involved with pgbench
(and they should always be cached), so they can almost be ignored.
It's really just a matter of what proportion of shared_buffers should
be leaf pages versus heap pages. The pgbench accounts table is
approximately 6x the size of its primary key, so the leaf blocks have
6x the utility of heap blocks for this workload. It actually is pretty
much that simple. Or it would be, if we didn't have the OS cache as a
confounding factor. The optimal strategy for the pgbench workload as
far as maximizing hits in shared_buffers goes is to *only* cache leaf
blocks, at least until you have all leaf blocks cached; we should
evict table blocks *immediately*. This isn't even remotely close to
the behavior of an LRU, of course. Actually, as the LRU-K paper also
points out, an LRU has a tendency to perversely somewhat *favor* the
table blocks, since presumably each point lookup's table block is
accessed after its leaf block is pinned/accessed.

Before some of the big shared_buffers bottlenecks were alleviated
several years ago, it was possible to observe shared_buffers evictions
occurring essentially at random. I have no idea if that's still true,
but it could be. A random replacement strategy is actually not as bad
as it sounds, but it is still pretty bad, especially given that we
could pay a high price in contention in return for such a bad
strategy. This convinces me that we will not be able to assess any
algorithm well in isolation, through simulations, because lock
contention is too important overall, and can actually affect the cache
hit ratio of a clock-based approach. I believe that it's generally
true that our clock sweep algorithm approximates an LRU, but I have
also seen it totally fail to do that, which it sounds like you've
heard about, too. Andrey is probably correct in his suspicion that
we'll end up prototyping a number of approaches.

I'm glad that you're thinking about this, in any case. Seems like a
promising area to work on.

-- 
Peter Geoghegan


Re: [HACKERS] Clock with Adaptive Replacement

От
Thomas Munro
Дата:
On Thu, Apr 26, 2018 at 10:54 AM, Peter Geoghegan <pg@bowt.ie> wrote:
> On Wed, Apr 25, 2018 at 5:26 AM, Thomas Munro
> <thomas.munro@enterprisedb.com> wrote:
>> I think pgbench isn't a very interesting workload though.  I don't
>> have time right now but it would be fun to dig further and try to
>> construct workloads that generate pathological searches for buffers (I
>> believe that such workloads exist, from anecdotal reports).
>
> While pgbench might not be the most interesting workload, I think even
> its super-synthetic, uniformly distributed point lookups can benefit
> from better caching decisions. The LRU-K paper's example 1.1 explains
> why this might be very cogently: Point lookups from TPC-A (or TPC-C)
> consist of descending the index, and then accessing the record in the
> main table from a TID, which actually implies quite a non-uniform
> access pattern for individual pages. Despite accessing table records
> uniformly.
>
> There are a negligible number of internal pages involved with pgbench
> (and they should always be cached), so they can almost be ignored.
> It's really just a matter of what proportion of shared_buffers should
> be leaf pages versus heap pages. The pgbench accounts table is
> approximately 6x the size of its primary key, so the leaf blocks have
> 6x the utility of heap blocks for this workload. It actually is pretty
> much that simple. Or it would be, if we didn't have the OS cache as a
> confounding factor. The optimal strategy for the pgbench workload as
> far as maximizing hits in shared_buffers goes is to *only* cache leaf
> blocks, at least until you have all leaf blocks cached; we should
> evict table blocks *immediately*. This isn't even remotely close to
> the behavior of an LRU, of course. Actually, as the LRU-K paper also
> points out, an LRU has a tendency to perversely somewhat *favor* the
> table blocks, since presumably each point lookup's table block is
> accessed after its leaf block is pinned/accessed.

Huh.  Right.  So it's not truly uniform.  I wonder if any of these
algorithms are sensitive to the higher value of the leaf pages than
the heap pages.  It seems quite subtle: even though we can see that
the individual leaf pages must be 6x more likely to be accessed again
next time than individual heap pages due to their higher tuple
density, they're still very unlikely to be accessed again soon, and
the question is whether any of these algorithms can track that for
long enough to see the difference between two very low access
frequencies, in a huge field of unlikely-to-be-accessed pages.  LRU,
by not even attempting to model frequency, is I guess uniquely
affected by the heap-after-index-leaf thing.  (I haven't read the
LRU-K paper... I have a lot to learn before I could contribute much
here.  Added to pile.  Thanks.)

A thought experiment about the U-shaped performance when your dataset
fits in neither PG nor kernel cache, but would fit entirely in
physical memory if you made either of the two caches big enough:  I
suppose when you read a page in, you could tell the kernel that you
POSIX_FADV_DONTNEED it, and when you steal a clean PG buffer you could
tell the kernel that you POSIX_FADV_WILLNEED its former contents (in
advance somehow), on the theory that the coldest stuff in the PG cache
should now become the hottest stuff in the OS cache.  Of course that
sucks, because the best the kernel can do then is go and read it from
disk, and the goal is to avoid IO.  Given a hypothetical way to
"write" "clean" data to the kernel (so it wouldn't mark it dirty and
generate IO, but it would let you read it back without generating IO
if you're lucky), then perhaps you could actually achieve exclusive
caching at the two levels, and use all your physical RAM without
duplication.  Then perhaps the U shape would go away, and the curve
would instead go up all the way until shared buffers is big enough to
whole the whole dataset?

> I'm glad that you're thinking about this, in any case. Seems like a
> promising area to work on.

It's a promising area that I'm interested in learning more about, but
I'm not promising to work on it :-)

-- 
Thomas Munro
http://www.enterprisedb.com


Re: [HACKERS] Clock with Adaptive Replacement

От
Peter Geoghegan
Дата:
On Wed, Apr 25, 2018 at 6:31 PM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:
> Huh.  Right.  So it's not truly uniform.  I wonder if any of these
> algorithms are sensitive to the higher value of the leaf pages than
> the heap pages.  It seems quite subtle: even though we can see that
> the individual leaf pages must be 6x more likely to be accessed again
> next time than individual heap pages due to their higher tuple
> density, they're still very unlikely to be accessed again soon, and
> the question is whether any of these algorithms can track that for
> long enough to see the difference between two very low access
> frequencies, in a huge field of unlikely-to-be-accessed pages.  LRU,
> by not even attempting to model frequency, is I guess uniquely
> affected by the heap-after-index-leaf thing.

Right. Another insight is that it's worth considering weighing the
type of page involved, to artificially favor indexes to some degree.
I'm not saying that it's a good idea, and it's pretty inelegant. But
I'm pretty sure it's been done before, with satisfactory results.

> A thought experiment about the U-shaped performance when your dataset
> fits in neither PG nor kernel cache, but would fit entirely in
> physical memory if you made either of the two caches big enough:  I
> suppose when you read a page in, you could tell the kernel that you
> POSIX_FADV_DONTNEED it, and when you steal a clean PG buffer you could
> tell the kernel that you POSIX_FADV_WILLNEED its former contents (in
> advance somehow), on the theory that the coldest stuff in the PG cache
> should now become the hottest stuff in the OS cache.  Of course that
> sucks, because the best the kernel can do then is go and read it from
> disk, and the goal is to avoid IO.

Not sure about that. I will say that the intuition that this is a good
area to work on is based on the challenges that we have with
shared_buffers in particular. The fact that shared buffers is
typically sized no larger than 16GB, say, and yet main memory sizes
can now easily be far larger is clearly a problem, and one that we're
going to have to get around to addressing directly.

ISTM that the whole shared_buffers issue should have little influence
on how much we're willing to remember about block popularity; why
should we be willing to track information about only the exact number
of blocks that will fit in shared_buffers at any one time? As you
know, ARC/CAR explicitly models some blocks that are not cache
resident as being within two ghost lists. Sounds like something that
might have a larger-than-expected benefit for us.

How much of a problem is it that we waste memory bandwidth copying to
and from OS cache, particularly on large systems? Might that be the
bigger problem, but also one that can be addressed incrementally?

I think that I may be repeating some of what Andrey said, in another
way (not sure of that).

-- 
Peter Geoghegan


Re: [HACKERS] Clock with Adaptive Replacement

От
Robert Haas
Дата:
On Wed, Apr 25, 2018 at 6:54 PM, Peter Geoghegan <pg@bowt.ie> wrote:
> Before some of the big shared_buffers bottlenecks were alleviated
> several years ago, it was possible to observe shared_buffers evictions
> occurring essentially at random. I have no idea if that's still true,
> but it could be.

I think it is.  We haven't done anything to address it.  I think if we
want to move to direct I/O -- which may be something we need or want
to do -- we're going to have to work a lot harder at making good page
eviction decisions.  Your patch to change the page eviction algorithm
didn't help noticeably once we eliminated the contention around buffer
eviction, but that's just because the cost of a bad eviction went
down, not because we stopped doing bad evictions.  I think it would be
interesting to insert a usleep() call into mdread() and then test
buffer eviction various algorithms with that in place.

I'm personally not very excited about making rules like "index pages
are more valuable than heap pages".  Such rules will in many cases be
true, but it's easy to come up with cases where they don't hold: for
example, we might run pgbench for a while and then stop running
pgbench and start running big sequential scans for reporting purposes.
We don't want to artificially pin the index buffers in shared_buffers
just because they're index pages; we want to figure out which pages
really matter.  Now, I realize that you weren't proposing (and
wouldn't propose) a rule that index pages never get evicted, but I
think that favoring index pages even in some milder way is basically a
hack.  Index pages aren't *intrinsically* more valuable; they are more
valuable because they will, in many workloads, be accessed more often
than heap pages.  A good algorithm ought to be able to figure that out
based on the access pattern, without being explicitly given a hint, I
think.

I believe the root of the problem here is that the usage count we have
today does a very poor job distinguishing what's hot from what's not.
There have been previous experiments around making usage_count use
some kind of a log scale: we make the maximum, say, 32, and the clock
hand divides by 2 instead of subtracting 1.  I don't think those
experiments were enormously successful and I suspect that a big part
of the reason is that it's still pretty easy to get to a state where
the counters are maxed out for a large number of buffers, and at that
point you can't distinguish between those buffers any more: they all
look equally hot.  We need something better.  If a system like this is
working properly, things like interior index pages and visibility map
pages ought to show up as fiery hot on workloads where the index or
visibility map, as the case may be, is heavily used.

A related problem is that user-connected backends end up doing a lot
of buffer eviction themselves on many workloads.  Maybe the
bgreclaimer patch Amit wrote a few years ago could help somehow.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: [HACKERS] Clock with Adaptive Replacement

От
Peter Geoghegan
Дата:
On Thu, Apr 26, 2018 at 10:39 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> I think it is.  We haven't done anything to address it.  I think if we
> want to move to direct I/O -- which may be something we need or want
> to do -- we're going to have to work a lot harder at making good page
> eviction decisions.

+1

> Your patch to change the page eviction algorithm
> didn't help noticeably once we eliminated the contention around buffer
> eviction, but that's just because the cost of a bad eviction went
> down, not because we stopped doing bad evictions.

The goal of that patch, which was literally written over a wet
weekend, was to make people realize that we were missing something
there. I think that it accomplished that much.

> I think it would be
> interesting to insert a usleep() call into mdread() and then test
> buffer eviction various algorithms with that in place.

That could be a useful strategy.

> I'm personally not very excited about making rules like "index pages
> are more valuable than heap pages".  Such rules will in many cases be
> true, but it's easy to come up with cases where they don't hold: for
> example, we might run pgbench for a while and then stop running
> pgbench and start running big sequential scans for reporting purposes.

I am not in favor of this general approach. Actually, I'm willing to
come out against it strongly right now: let's not teach the buffer
manager to distinguish between the AM of a block.

> We don't want to artificially pin the index buffers in shared_buffers
> just because they're index pages; we want to figure out which pages
> really matter.  Now, I realize that you weren't proposing (and
> wouldn't propose) a rule that index pages never get evicted, but I
> think that favoring index pages even in some milder way is basically a
> hack.  Index pages aren't *intrinsically* more valuable; they are more
> valuable because they will, in many workloads, be accessed more often
> than heap pages.  A good algorithm ought to be able to figure that out
> based on the access pattern, without being explicitly given a hint, I
> think.

I won't argue against any of that, but I think that it's rather
nuanced, and that the nuance probably matters a lot.

First of all, we must have a sense of proportion about what is likely
to be true in the average case. I mentioned to Thomas that the pgbench
accounts table is 6x the size its primary key, and that with a uniform
distribution + point lookups the leaf pages literally have 6x the
utility of the heap pages. It really is that simple there. Also,
whatever the distribution of the point lookups is, "cache more leaf
pages" is probably going to be the effect that a better caching
strategy would have. Even 6x is probably an underestimate of the
utility of leaf pages compared to heap pages in the average case. I
bet it's more like 10x for a typical OLTP app.

Second of all, we must have a sense of perspective about what we're
trying to do here, which is to predict the future based on the past.
The best outcome we can hope for is to lose less on average without
hurting anything that already works well enough.

> I believe the root of the problem here is that the usage count we have
> today does a very poor job distinguishing what's hot from what's not.

I think that the root of the problem is that we don't remember
anything about evicted buffers. The same block can bounce in and out
of shared_buffers repeatedly, and we don't recognize that. There are
probably second-order effects from sizing shared_buffers. We
needlessly tie the number of buffers to our capacity to remember
things about block popularity. That would be bad if we had no FS
cache, but with the FS cache it seems awful.

I share the intuition that we need something adaptive, that can be
analysed and understood relatively easily, and has little to no magic.
That's why I'm willing to say now that we shouldn't do anything based
on the type of the blocks involved. However, I think that that model
might work about as well, and that this matters because it provides
motivating examples. Surely you're right when you say that index
blocks are not intrinsically more useful than any of type of block,
but if they're 10x more useful on average, and 20x more useful at
other times, then a person could be forgiven for modelling them as
intrinsically more useful.

It's also useful to examine why a random replacement strategy is
merely mediocre to bad, and not abysmal. That's what every replacement
strategy ends up being with enough cache pressure anyway. This makes
it more reasonable to focus on the average case than it is in other
places.

> There have been previous experiments around making usage_count use
> some kind of a log scale: we make the maximum, say, 32, and the clock
> hand divides by 2 instead of subtracting 1.  I don't think those
> experiments were enormously successful and I suspect that a big part
> of the reason is that it's still pretty easy to get to a state where
> the counters are maxed out for a large number of buffers, and at that
> point you can't distinguish between those buffers any more: they all
> look equally hot.  We need something better.  If a system like this is
> working properly, things like interior index pages and visibility map
> pages ought to show up as fiery hot on workloads where the index or
> visibility map, as the case may be, is heavily used.

I don't think that internal index pages and the visibility map are the
problem, since there is so few of them, and they are accessed at least
hundreds of times more frequently than the leaf pages.

-- 
Peter Geoghegan


Re: [HACKERS] Clock with Adaptive Replacement

От
Stephen Frost
Дата:
Greetings,

* Peter Geoghegan (pg@bowt.ie) wrote:
> On Thu, Apr 26, 2018 at 10:39 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> > I'm personally not very excited about making rules like "index pages
> > are more valuable than heap pages".  Such rules will in many cases be
> > true, but it's easy to come up with cases where they don't hold: for
> > example, we might run pgbench for a while and then stop running
> > pgbench and start running big sequential scans for reporting purposes.
>
> I am not in favor of this general approach. Actually, I'm willing to
> come out against it strongly right now: let's not teach the buffer
> manager to distinguish between the AM of a block.

Perhaps I'm misunderstanding the case here, but with the ring buffer we
use for sequential access, aren't we already discouraging keeping heap
pages around, particularly when they're coming from a sequential scan?

I'm not suggesting that's a bad thing either, in fact it seems like a
generally good thing and I don't think we get many complaints about it,
I'm just trying to point out that we already have a distinction between
heap-pages-from-seq-scans and index pages (and heap pages from using
those indexes) and therefore I'm not convinced by an argument that we
shouldn't ever treat pages from the heap differently than pages from the
indexes.

This could change when considering an environment where we aren't trying
to take advantage of the kernel's buffer cache and things like its
read-ahead, and maybe that's what you're getting at here, but I would
also point out that we have more inherent understanding of how pages are
likely to be accessed than the kernel does and we shouldn't be
explicitly trying to avoid taking advantage of that knowledge.

> > We don't want to artificially pin the index buffers in shared_buffers
> > just because they're index pages; we want to figure out which pages
> > really matter.  Now, I realize that you weren't proposing (and
> > wouldn't propose) a rule that index pages never get evicted, but I
> > think that favoring index pages even in some milder way is basically a
> > hack.  Index pages aren't *intrinsically* more valuable; they are more
> > valuable because they will, in many workloads, be accessed more often
> > than heap pages.  A good algorithm ought to be able to figure that out
> > based on the access pattern, without being explicitly given a hint, I
> > think.
>
> I won't argue against any of that, but I think that it's rather
> nuanced, and that the nuance probably matters a lot.
>
> First of all, we must have a sense of proportion about what is likely
> to be true in the average case. I mentioned to Thomas that the pgbench
> accounts table is 6x the size its primary key, and that with a uniform
> distribution + point lookups the leaf pages literally have 6x the
> utility of the heap pages. It really is that simple there. Also,
> whatever the distribution of the point lookups is, "cache more leaf
> pages" is probably going to be the effect that a better caching
> strategy would have. Even 6x is probably an underestimate of the
> utility of leaf pages compared to heap pages in the average case. I
> bet it's more like 10x for a typical OLTP app.

I agree that we should definitely be thinking about data density here-
and that's another case where we have more information about the data
than the kernel does (which I compare to as it tries to predict value
also but doesn't have as much information).  Another consideration here
to think about would be internal pages rather than just leaf pages-
those have an even higher likelihood of being needed again quickly as
they need to be traversed much more than leaf pages.

> Second of all, we must have a sense of perspective about what we're
> trying to do here, which is to predict the future based on the past.
> The best outcome we can hope for is to lose less on average without
> hurting anything that already works well enough.

Agreed.

> > I believe the root of the problem here is that the usage count we have
> > today does a very poor job distinguishing what's hot from what's not.
>
> I think that the root of the problem is that we don't remember
> anything about evicted buffers. The same block can bounce in and out
> of shared_buffers repeatedly, and we don't recognize that. There are
> probably second-order effects from sizing shared_buffers. We
> needlessly tie the number of buffers to our capacity to remember
> things about block popularity. That would be bad if we had no FS
> cache, but with the FS cache it seems awful.

This is definitely an interesting and valuable point.  Having pages
bouncing in and out of shared buffers means that we aren't properly
tracking their value.  Just brainstorming, but I wonder if there's a way
we could track their value even when we've evicted them, without slowing
down the entire system.  Not having any great ideas off-hand for that
but some kind of "we last accessed this block an hour ago" and "we last
accessed this block half a millisecond ago" might be interesting to try
to avoid such ping-ponging.  I wonder if there are known strategies for
addressing this, perhaps by having a data structure which effectively
tracks hit rates for all potential blocks...

> I share the intuition that we need something adaptive, that can be
> analysed and understood relatively easily, and has little to no magic.
> That's why I'm willing to say now that we shouldn't do anything based
> on the type of the blocks involved. However, I think that that model
> might work about as well, and that this matters because it provides
> motivating examples. Surely you're right when you say that index
> blocks are not intrinsically more useful than any of type of block,
> but if they're 10x more useful on average, and 20x more useful at
> other times, then a person could be forgiven for modelling them as
> intrinsically more useful.

If we can understand that they're, on average, 10x-20x more useful, then
ignoring that seems like we're throwing away information that we
effectively know about the future.

> It's also useful to examine why a random replacement strategy is
> merely mediocre to bad, and not abysmal. That's what every replacement
> strategy ends up being with enough cache pressure anyway. This makes
> it more reasonable to focus on the average case than it is in other
> places.

Every replacement strategy which doesn't care about the kind of block
that's being worked with will end up performing random replacement under
enough pressure.  We don't actually have to allow that to happen though
since we do have information about the kind of block and therefore it
seems like we should be able to keep things mediocre, and avoid letting
things get bad under such pressure.

> > There have been previous experiments around making usage_count use
> > some kind of a log scale: we make the maximum, say, 32, and the clock
> > hand divides by 2 instead of subtracting 1.  I don't think those
> > experiments were enormously successful and I suspect that a big part
> > of the reason is that it's still pretty easy to get to a state where
> > the counters are maxed out for a large number of buffers, and at that
> > point you can't distinguish between those buffers any more: they all
> > look equally hot.  We need something better.  If a system like this is
> > working properly, things like interior index pages and visibility map
> > pages ought to show up as fiery hot on workloads where the index or
> > visibility map, as the case may be, is heavily used.
>
> I don't think that internal index pages and the visibility map are the
> problem, since there is so few of them, and they are accessed at least
> hundreds of times more frequently than the leaf pages.

Yet they still end up getting evicted under a random replacement
strategy under sufficient pressure, because we don't distinguish them
from any other buffer, and that leads to high latency on queries which
should always be extremely fast.  As long as we're talking about a
system which can devolve into random replacement, we're going to have
those cases and we aren't going to have any answer for them.  I'm not
really thrilled with any approach that allows that kind of degradation.

Thanks!

Stephen

Вложения

Re: [HACKERS] Clock with Adaptive Replacement

От
Thomas Munro
Дата:
On Sun, Apr 29, 2018 at 2:39 AM, Stephen Frost <sfrost@snowman.net> wrote:
> * Peter Geoghegan (pg@bowt.ie) wrote:
>> On Thu, Apr 26, 2018 at 10:39 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>> > I'm personally not very excited about making rules like "index pages
>> > are more valuable than heap pages".  Such rules will in many cases be
>> > true, but it's easy to come up with cases where they don't hold: for
>> > example, we might run pgbench for a while and then stop running
>> > pgbench and start running big sequential scans for reporting purposes.
>>
>> I am not in favor of this general approach. Actually, I'm willing to
>> come out against it strongly right now: let's not teach the buffer
>> manager to distinguish between the AM of a block.
>
> Perhaps I'm misunderstanding the case here, but with the ring buffer we
> use for sequential access, aren't we already discouraging keeping heap
> pages around, particularly when they're coming from a sequential scan?
>
> I'm not suggesting that's a bad thing either, in fact it seems like a
> generally good thing and I don't think we get many complaints about it,
> I'm just trying to point out that we already have a distinction between
> heap-pages-from-seq-scans and index pages (and heap pages from using
> those indexes) and therefore I'm not convinced by an argument that we
> shouldn't ever treat pages from the heap differently than pages from the
> indexes.

I think the idea is that CAR might be even better than our hard coded
BAS_BULKREAD, BAS_BULKWRITE, BAS_VACUUM rings because it has a kind of
dynamic self-tuning scan resistance.  It works by having one clock  (=
a circular list, not just a rotating hand in an array like our current
scheme) T1 that holds buffers that were touched once or twice, to
absorb scans.  It'll reclaim T1 buffers that were only accessed once
(but keep the descriptor for longer -- see below).  It will move
things to T2, a separate clock for valuable pages, after they've been
accessed twice in recent memory.  It dynamically adjusts the sizes of
T1 and T2 in a way that is supposedly beneficial.

[...]

>> > I believe the root of the problem here is that the usage count we have
>> > today does a very poor job distinguishing what's hot from what's not.
>>
>> I think that the root of the problem is that we don't remember
>> anything about evicted buffers. The same block can bounce in and out
>> of shared_buffers repeatedly, and we don't recognize that. There are
>> probably second-order effects from sizing shared_buffers. We
>> needlessly tie the number of buffers to our capacity to remember
>> things about block popularity. That would be bad if we had no FS
>> cache, but with the FS cache it seems awful.
>
> This is definitely an interesting and valuable point.  Having pages
> bouncing in and out of shared buffers means that we aren't properly
> tracking their value.  Just brainstorming, but I wonder if there's a way
> we could track their value even when we've evicted them, without slowing
> down the entire system.  Not having any great ideas off-hand for that
> but some kind of "we last accessed this block an hour ago" and "we last
> accessed this block half a millisecond ago" might be interesting to try
> to avoid such ping-ponging.  I wonder if there are known strategies for
> addressing this, perhaps by having a data structure which effectively
> tracks hit rates for all potential blocks...

That's what the CAR algorithm does.  It has twice as many
BufferDescriptors as buffers.  As well as the T1 and T2 clocks that
hold buffers, it also has doubly linked lists B1 and B2 which hold
BufferDescriptors with no buffers for things that recently fell out of
T1 and T2 to extend the algorithm's memory.  For example, if there is
a "miss" in B1 (meaning we found the page we're looking for in there
but it currently has no buffer, and it was recently evicted from T1),
then we can immediately find a buffer for it in T2 (it qualifies as
having been accessed twice in recent memory, and doesn't have to go
through T1 again where it risks early eviction).  Though of course it
too is finite.

-- 
Thomas Munro
http://www.enterprisedb.com


Re: [HACKERS] Clock with Adaptive Replacement

От
Peter Geoghegan
Дата:
On Sat, Apr 28, 2018 at 7:39 AM, Stephen Frost <sfrost@snowman.net> wrote:
> Perhaps I'm misunderstanding the case here, but with the ring buffer we
> use for sequential access, aren't we already discouraging keeping heap
> pages around, particularly when they're coming from a sequential scan?
>
> I'm not suggesting that's a bad thing either, in fact it seems like a
> generally good thing and I don't think we get many complaints about it,
> I'm just trying to point out that we already have a distinction between
> heap-pages-from-seq-scans and index pages (and heap pages from using
> those indexes) and therefore I'm not convinced by an argument that we
> shouldn't ever treat pages from the heap differently than pages from the
> indexes.

The argument is not a slam dunk, to be sure. Since you bring it up, I
happen to think that the ring buffer/strategy scan thing is too
aggressive in a lot of cases. Why not cache more when the existing
contents of shared_buffers are of marginal importance? Concern about a
kind of "tyranny of the average case" seems fairly well founded to me.
I am most concerned about improving the average case in particular,
but also very concerned about not regressing atypical or bursty cases.

> This could change when considering an environment where we aren't trying
> to take advantage of the kernel's buffer cache and things like its
> read-ahead, and maybe that's what you're getting at here, but I would
> also point out that we have more inherent understanding of how pages are
> likely to be accessed than the kernel does and we shouldn't be
> explicitly trying to avoid taking advantage of that knowledge.

I'm definitely in favor of taking advantage of that knowledge. I just
think that we can derive the choice of which buffer to evict from high
level principles, that are relatively easy to understand in isolation.
This may matter later, when production system behavior needs to be
analyzed. I might agree to the idea of artificially favoring one type
of block over another if I thought that there was no better way, but I
think that there is. I also think that a generalized approach isn't
that much more difficult, particularly when you factor in the need to
convince other people.

> I agree that we should definitely be thinking about data density here-
> and that's another case where we have more information about the data
> than the kernel does (which I compare to as it tries to predict value
> also but doesn't have as much information).

Even my 10x figure is, in a way, very conservative. B-Tree index pages
have another obvious though potentially overlooked advantage, which is
that they naturally have spatial or logical locality. The only kind of
locality that heap pages can ever have is temporal locality. I
hesitate to say too much more about what must be true on average,
across the universe of Posgres installations, but I can safely say
this much: it's totally pedestrian and normal for *leaf* pages to have
more than 100x the utility of heap pages in some applications, since a
leaf page is not just "denser"; it may also have significant spatial
or logical locality, where heap pages have none at all.

There could very easily be an enormous range of usefulness among
buffers in shared_buffers. And yet, we only recognize the increments
that usage_count can represent (this is before we even think about the
actual problems with how usage_count is maintained). We're never going
to be able to give *radically* more useful leaf pages a concomitant
advantage during buffer eviction if we have to use some rule of thumb.

> Another consideration here
> to think about would be internal pages rather than just leaf pages-
> those have an even higher likelihood of being needed again quickly as
> they need to be traversed much more than leaf pages.

As I said, I'm not too worried about that case, because there are so
few internal pages -- we're probably talking about a small fraction of
1% of the total. It seems unlikely that there'd be so much cache
pressure that the cost of reading in an internal block is much worse
than the cost of reading in any other block. Main memory sizes are big
enough that we should be able to think about things at the level of
minutes (maybe we can't right now, but we *should* be able to).

(This probably doesn't matter much when it comes to finding a way
forward, so we can agree to disagree here without consequence.)

> This is definitely an interesting and valuable point.  Having pages
> bouncing in and out of shared buffers means that we aren't properly
> tracking their value.  Just brainstorming, but I wonder if there's a way
> we could track their value even when we've evicted them, without slowing
> down the entire system.  Not having any great ideas off-hand for that
> but some kind of "we last accessed this block an hour ago" and "we last
> accessed this block half a millisecond ago" might be interesting to try
> to avoid such ping-ponging.  I wonder if there are known strategies for
> addressing this, perhaps by having a data structure which effectively
> tracks hit rates for all potential blocks...

Well, ARC has its ghost lists (so does CAR, as Thomas pointed out just
now). ARC/CAR is adaptive, in that it drifts towards favoring either
recency or frequency for the system as a whole. Provided we can make
that scalable, it will probably work well for us. The bi-modal nature
of pages in Postgres seems to make ARC or CAR a good fit. Also, I
suspect that we can afford to use a somewhat less scalable approach,
since it doesn't matter how fast buffer eviction is if its just
spinning its wheels, and because in recent years we've mitigated some
of the bottlenecks.

In a way, it's not at all surprising that we can end up with
usage_counts of 5 all around when shared_buffers is large. Blocks end
up "resting on their laurels"; they're impossible to evict based on a
large amount of blind luck, such as being involved in a burst of
activity some time ago (so much for clocksweep approximating an LRU;
this is actually more like an LFU). ARC/CAR moves a "recent" block to
the "frequent" list if the block gets referenced a second time after
an interval that is, in effect, self-correcting (it may or may not be
from the ghost recent list or the real buffer cache recent list --
either way it goes to the head of the frequent list). Moving a block
from the recent to the frequent list also enlarges the recent list
target size at the expense of the frequent list target size. "This
block in particular deserves to be a frequent block, but as a
consequence all blocks should collectively be considered slightly less
deserving of being frequent blocks."

Importantly, this allows blocks to have their extra utility recognized
by getting moved to the frequent list, but specifically tracks whether
or not the system fails to appreciate recent blocks that deserve to be
promoted to frequent blocks, but cannot get a break because they get
evicted from the recent list too quickly (including the ghost recent
list). Because the recent ghost list and frequent ghost list are fixed
in size (only the real lists/head lists get resized), we won't forget
too much about recency or frequency. And, we won't go overboard with
"writing off frequent list blocks as has-been dead weight", because we
notice that we've gone too far, and measure that in terms of would-be
cache hits that ended up as actual cache misses. Most important of
all, we won't "get stuck", as we see with clocksweep today while still
having something that's stable for stable workloads.

> Every replacement strategy which doesn't care about the kind of block
> that's being worked with will end up performing random replacement under
> enough pressure.  We don't actually have to allow that to happen though
> since we do have information about the kind of block and therefore it
> seems like we should be able to keep things mediocre, and avoid letting
> things get bad under such pressure.

If you have extreme cache pressure, I don't think that there is
anything left to worry about other than the scalability of evicting
buffers. That should be a rare enough occurrence for well maintained
systems these days (again, this may only actually be true as an
aspiration). Just look at the ARC paper's statistics; they report that
they're not much better than the competition (including even a classic
LRU) when there is either mild cache pressure or extreme cache
pressure. The middle ground is the interesting part.

All of that having been said, maybe we have an independent low-level
problem: we increase usage_count when we pin a buffer, even if we last
pinned the buffer 5ms ago. IIRC a change to this went in when ARC went
in (and came out with ARC, too). This is more of a problem with
miscounting the number of hits based on accidents around how things
are structured in the executor -- we need to be more careful about
recognizing whether or not block hits are logically distinct.
Particularly with stuff like nested loop joins.

-- 
Peter Geoghegan


Re: [HACKERS] Clock with Adaptive Replacement

От
Юрий Соколов
Дата:
вс, 29 апр. 2018 г., 2:17 Peter Geoghegan <pg@bowt.ie>:
On Sat, Apr 28, 2018 at 7:39 AM, Stephen Frost <sfrost@snowman.net> wrote:
> Perhaps I'm misunderstanding the case here, but with the ring buffer we
> use for sequential access, aren't we already discouraging keeping heap
> pages around, particularly when they're coming from a sequential scan?
>
> I'm not suggesting that's a bad thing either, in fact it seems like a
> generally good thing and I don't think we get many complaints about it,
> I'm just trying to point out that we already have a distinction between
> heap-pages-from-seq-scans and index pages (and heap pages from using
> those indexes) and therefore I'm not convinced by an argument that we
> shouldn't ever treat pages from the heap differently than pages from the
> indexes.

The argument is not a slam dunk, to be sure. Since you bring it up, I
happen to think that the ring buffer/strategy scan thing is too
aggressive in a lot of cases. Why not cache more when the existing
contents of shared_buffers are of marginal importance? Concern about a
kind of "tyranny of the average case" seems fairly well founded to me.
I am most concerned about improving the average case in particular,
but also very concerned about not regressing atypical or bursty cases.

> This could change when considering an environment where we aren't trying
> to take advantage of the kernel's buffer cache and things like its
> read-ahead, and maybe that's what you're getting at here, but I would
> also point out that we have more inherent understanding of how pages are
> likely to be accessed than the kernel does and we shouldn't be
> explicitly trying to avoid taking advantage of that knowledge.

I'm definitely in favor of taking advantage of that knowledge. I just
think that we can derive the choice of which buffer to evict from high
level principles, that are relatively easy to understand in isolation.
This may matter later, when production system behavior needs to be
analyzed. I might agree to the idea of artificially favoring one type
of block over another if I thought that there was no better way, but I
think that there is. I also think that a generalized approach isn't
that much more difficult, particularly when you factor in the need to
convince other people.

> I agree that we should definitely be thinking about data density here-
> and that's another case where we have more information about the data
> than the kernel does (which I compare to as it tries to predict value
> also but doesn't have as much information).

Even my 10x figure is, in a way, very conservative. B-Tree index pages
have another obvious though potentially overlooked advantage, which is
that they naturally have spatial or logical locality. The only kind of
locality that heap pages can ever have is temporal locality. I
hesitate to say too much more about what must be true on average,
across the universe of Posgres installations, but I can safely say
this much: it's totally pedestrian and normal for *leaf* pages to have
more than 100x the utility of heap pages in some applications, since a
leaf page is not just "denser"; it may also have significant spatial
or logical locality, where heap pages have none at all.

There could very easily be an enormous range of usefulness among
buffers in shared_buffers. And yet, we only recognize the increments
that usage_count can represent (this is before we even think about the
actual problems with how usage_count is maintained). We're never going
to be able to give *radically* more useful leaf pages a concomitant
advantage during buffer eviction if we have to use some rule of thumb.

> Another consideration here
> to think about would be internal pages rather than just leaf pages-
> those have an even higher likelihood of being needed again quickly as
> they need to be traversed much more than leaf pages.

As I said, I'm not too worried about that case, because there are so
few internal pages -- we're probably talking about a small fraction of
1% of the total. It seems unlikely that there'd be so much cache
pressure that the cost of reading in an internal block is much worse
than the cost of reading in any other block. Main memory sizes are big
enough that we should be able to think about things at the level of
minutes (maybe we can't right now, but we *should* be able to).

(This probably doesn't matter much when it comes to finding a way
forward, so we can agree to disagree here without consequence.)

> This is definitely an interesting and valuable point.  Having pages
> bouncing in and out of shared buffers means that we aren't properly
> tracking their value.  Just brainstorming, but I wonder if there's a way
> we could track their value even when we've evicted them, without slowing
> down the entire system.  Not having any great ideas off-hand for that
> but some kind of "we last accessed this block an hour ago" and "we last
> accessed this block half a millisecond ago" might be interesting to try
> to avoid such ping-ponging.  I wonder if there are known strategies for
> addressing this, perhaps by having a data structure which effectively
> tracks hit rates for all potential blocks...

Well, ARC has its ghost lists (so does CAR, as Thomas pointed out just
now). ARC/CAR is adaptive, in that it drifts towards favoring either
recency or frequency for the system as a whole. Provided we can make
that scalable, it will probably work well for us. The bi-modal nature
of pages in Postgres seems to make ARC or CAR a good fit. Also, I
suspect that we can afford to use a somewhat less scalable approach,
since it doesn't matter how fast buffer eviction is if its just
spinning its wheels, and because in recent years we've mitigated some
of the bottlenecks.

In a way, it's not at all surprising that we can end up with
usage_counts of 5 all around when shared_buffers is large. Blocks end
up "resting on their laurels"; they're impossible to evict based on a
large amount of blind luck, such as being involved in a burst of
activity some time ago (so much for clocksweep approximating an LRU;
this is actually more like an LFU). ARC/CAR moves a "recent" block to
the "frequent" list if the block gets referenced a second time after
an interval that is, in effect, self-correcting (it may or may not be
from the ghost recent list or the real buffer cache recent list --
either way it goes to the head of the frequent list). Moving a block
from the recent to the frequent list also enlarges the recent list
target size at the expense of the frequent list target size. "This
block in particular deserves to be a frequent block, but as a
consequence all blocks should collectively be considered slightly less
deserving of being frequent blocks."

Importantly, this allows blocks to have their extra utility recognized
by getting moved to the frequent list, but specifically tracks whether
or not the system fails to appreciate recent blocks that deserve to be
promoted to frequent blocks, but cannot get a break because they get
evicted from the recent list too quickly (including the ghost recent
list). Because the recent ghost list and frequent ghost list are fixed
in size (only the real lists/head lists get resized), we won't forget
too much about recency or frequency. And, we won't go overboard with
"writing off frequent list blocks as has-been dead weight", because we
notice that we've gone too far, and measure that in terms of would-be
cache hits that ended up as actual cache misses. Most important of
all, we won't "get stuck", as we see with clocksweep today while still
having something that's stable for stable workloads.

> Every replacement strategy which doesn't care about the kind of block
> that's being worked with will end up performing random replacement under
> enough pressure.  We don't actually have to allow that to happen though
> since we do have information about the kind of block and therefore it
> seems like we should be able to keep things mediocre, and avoid letting
> things get bad under such pressure.

If you have extreme cache pressure, I don't think that there is
anything left to worry about other than the scalability of evicting
buffers. That should be a rare enough occurrence for well maintained
systems these days (again, this may only actually be true as an
aspiration). Just look at the ARC paper's statistics; they report that
they're not much better than the competition (including even a classic
LRU) when there is either mild cache pressure or extreme cache
pressure. The middle ground is the interesting part.

All of that having been said, maybe we have an independent low-level
problem: we increase usage_count when we pin a buffer, even if we last
pinned the buffer 5ms ago. IIRC a change to this went in when ARC went
in (and came out with ARC, too). This is more of a problem with
miscounting the number of hits based on accidents around how things
are structured in the executor -- we need to be more careful about
recognizing whether or not block hits are logically distinct.
Particularly with stuff like nested loop joins.

I've developed an algorithm as improvement of GClock:
- increment of usage count is not on every access, but on first access during some span of clock arm movements:
-- new item is put with count 0, and it is not incremented until clock arm will move at least 1/32 of whole round. 
-- if arm is moved further, then usage count is incremented on next access, arm position is remembered. During next 1/32 of whole round usage count is not incremented. 
- usage count is incremented not by 1, but by 8. It simulates movement to "frequent" list of ARC/CAR/2Q,
-- there is a limit of usage count - 32, so there is four simulared frequency queues, 
- to find item for eviction, items are scanned in batch of 25:
-- if items usage count is zero, it is evicted, and scan stops
-- otherwise usage count is decremented as in GClock,
-- item with least usage count seen so far is remembered
-- if batch is over, then item with least usage count is evicted. 

I've tested this algorithm with some artificial traces. It is much better than LRU, but usually worser than ARC and CLOCK+ (btw, why no one mention CLOCK+ in this thread).

 I didn't implement it in PostgreSQL yet (I did implement batch scanning once, but other algorithm details were not developed then). 

And I don't know, if it will be better for PostgreSQL really.
And I believe, CLOCK+ will perform better, being implemented correctly, but modified a bit to use array instead of list (CLOCK+ originally uses least with ability to insert and remove from any position).

So you see: there is a lot of algorithms, and new could be suggested as we go. It is not easy to choose one to concentrate our attention. 

That is why I propose to collect some realworld traces, and test algorithms with them before rushing to implementation. 

With regards, 
Sokolov Yura. 

Re: [HACKERS] Clock with Adaptive Replacement

От
Andres Freund
Дата:
On 2018-04-25 11:31:12 +0500, Andrey Borodin wrote:
> 
> 
> > 24 апр. 2018 г., в 23:14, Andres Freund <andres@anarazel.de> написал(а):
> > 
> > On 2018-04-24 17:16:47 +0500, Andrey Borodin wrote:
> >> But, I think that cost of development of real page eviction strategy
> >> itself is neglectable small compared to infrastructure changes needed
> >> by any non-CS5 strategy.
> > 
> > What problems are you seeing? This isn't a lot of code?
> 1. Teaching BgWriter to used data from eviction strategy to aggressively flush data to disk (instead of
++next_to_clean)
 
> 2. Implementing strategies as lock-free algorithms for freelist
> These parts seem most important for benchmarking.
> Also:
> 3. Converting all rings to single buffer manager where possible
> 4. Using O_DIRECT while writing data files
> 5. Using aio and scheduling of writes
> These parts are not necessary, but most challenging, while not impossible though.

These largely seem to be increasing the scope beyond reason...

Greetings,

Andres Freund


Re: [HACKERS] Clock with Adaptive Replacement

От
Andrey Borodin
Дата:

> 30 апр. 2018 г., в 0:48, Andres Freund <andres@anarazel.de> написал(а):
>
> On 2018-04-25 11:31:12 +0500, Andrey Borodin wrote:
>>
>> 1. Teaching BgWriter to used data from eviction strategy to aggressively flush data to disk (instead of
++next_to_clean) 
>> 2. Implementing strategies as lock-free algorithms for freelist
>> These parts seem most important for benchmarking.
>> Also:
>> 3. Converting all rings to single buffer manager where possible
>> 4. Using O_DIRECT while writing data files
>> 5. Using aio and scheduling of writes
>> These parts are not necessary, but most challenging, while not impossible though.
>
> These largely seem to be increasing the scope beyond reason...


I suspect that performance benefits can be not that big or even marginal, if we do not extend comprehensive eviction
strategywith bgwriter fixes and O_DIRECT. 
But, on the other hand, this suspicion is not based on any real fact. And if eviction strategy is actually good for
anythingit will show performance benefits on it's own. 

> 30 апр. 2018 г., в 0:39, Юрий Соколов <funny.falcon@gmail.com> написал(а):
>
> (btw, why no one mention CLOCK+ in this thread).

I do not know what CLOCK+ is. Can you, please, send me a reference.

Best regards, Andrey Borodin.

Re: [HACKERS] Clock with Adaptive Replacement

От
Andres Freund
Дата:
Hi,

On 2018-04-30 15:39:08 +0500, Andrey Borodin wrote:
> I suspect that performance benefits can be not that big or even
> marginal, if we do not extend comprehensive eviction strategy with
> bgwriter fixes and O_DIRECT.

If so, then the improvements aren't real. Bgwriter doesn't matter for
read-only workloads. O_DIRECT doesn't matter much if shared_buffers are
60+% of OS memory.  And even disregarding that, you can just compute
cache hit ratios to see whether things are improving.

Greetings,

Andres Freund


Re: [HACKERS] Clock with Adaptive Replacement

От
Thomas Munro
Дата:
On Mon, Apr 30, 2018 at 10:39 PM, Andrey Borodin <x4mmm@yandex-team.ru> wrote:
>> 30 апр. 2018 г., в 0:48, Andres Freund <andres@anarazel.de> написал(а):
>> On 2018-04-25 11:31:12 +0500, Andrey Borodin wrote:
>> 30 апр. 2018 г., в 0:39, Юрий Соколов <funny.falcon@gmail.com> написал(а):
>>
>> (btw, why no one mention CLOCK+ in this thread).
>
> I do not know what CLOCK+ is. Can you, please, send me a reference.

Maybe he means CLOCK-Pro?

--
Thomas Munro
http://www.enterprisedb.com


Re: [HACKERS] Clock with Adaptive Replacement

От
Thomas Munro
Дата:
On Thu, Apr 26, 2018 at 1:31 PM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:
> ...  I
> suppose when you read a page in, you could tell the kernel that you
> POSIX_FADV_DONTNEED it, and when you steal a clean PG buffer you could
> tell the kernel that you POSIX_FADV_WILLNEED its former contents (in
> advance somehow), on the theory that the coldest stuff in the PG cache
> should now become the hottest stuff in the OS cache.  Of course that
> sucks, because the best the kernel can do then is go and read it from
> disk, and the goal is to avoid IO.  Given a hypothetical way to
> "write" "clean" data to the kernel (so it wouldn't mark it dirty and
> generate IO, but it would let you read it back without generating IO
> if you're lucky), then perhaps you could actually achieve exclusive
> caching at the two levels, and use all your physical RAM without
> duplication.

Craig said essentially the same thing, on the nearby fsync() reliability thread:

On Sun, Apr 29, 2018 at 1:50 PM, Craig Ringer <craig@2ndquadrant.com> wrote:
> ... I'd kind of hoped to go in
> the other direction if anything, with some kind of pseudo-write op
> that let us swap a dirty shared_buffers entry from our shared_buffers
> into the OS dirty buffer cache (on Linux at least) and let it handle
> writeback, so we reduce double-buffering. Ha! So much for that!

I would like to reply to that on this thread which discusses double
buffering and performance, to avoid distracting the fsync() thread
from its main topic of reliability.

I think that idea has potential.  Even though I believe that direct IO
is the generally the right way to go (that's been RDBMS orthodoxy for
a decade or more AFAIK), we'll always want to support buffered IO (as
other RDBMSs do).  For one thing, not every filesystem supports direct
IO, including ZFS.  I love ZFS, and its caching is not simply a dumb
extension to shared_buffers that you have to go through syscalls to
reach: it has state of the art page reclamation, cached data can be
LZ4 compressed and there is an optional second level cache which can
live on fast storage.

Perhaps if you patched PostgreSQL to tell the OS that you won't need
pages you've just read, and that you will need pages you've just
evicted, you might be able to straighten out some of that U shape by
getting more exclusive caching at the two levels.  Queued writes would
still be double-buffered of course, at least until they complete.
Telling the OS to prefetch something that you already have a copy of
is annoying and expensive, though.

The pie-in-the-sky version of this idea would let you "swap" pages
with the kernel, as you put it.  Though I was thinking of clean pages,
not dirty ones.  Then there'd be a non-overlapping set of  pages from
your select-only pgbench in each cache.  Maybe that would look like
punread(fd, buf, size, offset) (!), or maybe write(fd, buf, size)
followed by fadvise(fd, offset, size,
FADV_I_PERSONALLY_GUARANTEE_THIS_DATA_IS_CLEAN_AND_I_CONSIDERED_CONCURRENCY_VERY_CAREFULLY),
or maybe pswap(read params... , unread params ...) to read new buffer
and unread old buffer at the same time.</crackpot-vapourware-OS>

Sadly, even if the simple non-pie-in-the-sky version of the above were
to work out and be beneficial on your favourite non-COW filesystem (on
which you might as well use direct IO and larger shared_buffers, some
day), it may currently be futile on ZFS because I think the fadvise
machinery might not even be hooked up (Solaris didn't believe in
fadvise on any filesystem IIRC).  Not sure, I hope I'm wrong about
that.

-- 
Thomas Munro
http://www.enterprisedb.com


Re: [HACKERS] Clock with Adaptive Replacement

От
Andres Freund
Дата:
On 2018-05-01 12:15:21 +1200, Thomas Munro wrote:
> On Thu, Apr 26, 2018 at 1:31 PM, Thomas Munro
> <thomas.munro@enterprisedb.com> wrote:
> > ...  I
> > suppose when you read a page in, you could tell the kernel that you
> > POSIX_FADV_DONTNEED it, and when you steal a clean PG buffer you could
> > tell the kernel that you POSIX_FADV_WILLNEED its former contents (in
> > advance somehow), on the theory that the coldest stuff in the PG cache
> > should now become the hottest stuff in the OS cache.  Of course that
> > sucks, because the best the kernel can do then is go and read it from
> > disk, and the goal is to avoid IO.  Given a hypothetical way to
> > "write" "clean" data to the kernel (so it wouldn't mark it dirty and
> > generate IO, but it would let you read it back without generating IO
> > if you're lucky), then perhaps you could actually achieve exclusive
> > caching at the two levels, and use all your physical RAM without
> > duplication.
> 
> Craig said essentially the same thing, on the nearby fsync() reliability thread:
> 
> On Sun, Apr 29, 2018 at 1:50 PM, Craig Ringer <craig@2ndquadrant.com> wrote:
> > ... I'd kind of hoped to go in
> > the other direction if anything, with some kind of pseudo-write op
> > that let us swap a dirty shared_buffers entry from our shared_buffers
> > into the OS dirty buffer cache (on Linux at least) and let it handle
> > writeback, so we reduce double-buffering. Ha! So much for that!
> 
> I would like to reply to that on this thread which discusses double
> buffering and performance, to avoid distracting the fsync() thread
> from its main topic of reliability.

It's not going to happen. Robert and I talked to the kernel devs a
couple years back, and I've brought it up again. The kernel has
absolutely no chance to verify the content of that written data, meaning
that suddenly you'd get differing data based on cache pressure. It seems
unsurprising that kernel devs aren't wild about that idea. The
likelihood of that opening up weird exploits (imagine a suid binary
reading such data later!), seems also pretty high.

Greetings,

Andres Freund


Re: [HACKERS] Clock with Adaptive Replacement

От
Юрий Соколов
Дата:
вт, 1 мая 2018 г., 0:02 Thomas Munro <thomas.munro@enterprisedb.com>:
On Mon, Apr 30, 2018 at 10:39 PM, Andrey Borodin <x4mmm@yandex-team.ru> wrote:
>> 30 апр. 2018 г., в 0:48, Andres Freund <andres@anarazel.de> написал(а):
>> On 2018-04-25 11:31:12 +0500, Andrey Borodin wrote:
>> 30 апр. 2018 г., в 0:39, Юрий Соколов <funny.falcon@gmail.com> написал(а):
>>
>> (btw, why no one mention CLOCK+ in this thread).
>
> I do not know what CLOCK+ is. Can you, please, send me a reference.

Maybe he means CLOCK-Pro?

Oh, thank you, Thomas. 
Yes, I was telling about CLOCK-Pro.

Re: [HACKERS] Clock with Adaptive Replacement

От
Andrey Borodin
Дата:
Hi!

> 30 апр. 2018 г., в 23:15, Andres Freund <andres@anarazel.de> написал(а):
> On 2018-04-30 15:39:08 +0500, Andrey Borodin wrote:
>> I suspect that performance benefits can be not that big or even
>> marginal, if we do not extend comprehensive eviction strategy with
>> bgwriter fixes and O_DIRECT.
>
> If so, then the improvements aren't real. Bgwriter doesn't matter for
> read-only workloads. O_DIRECT doesn't matter much if shared_buffers are
> 60+% of OS memory.  And even disregarding that, you can just compute
> cache hit ratios to see whether things are improving.
Even considering simply changing eviction strategy - it is not just about hit ratio. It is also about eviction
complexityless than O(N). 

But I think you are right. If we compare performance effect of half-measures in the real system, probably it is more
accuratethan comparing isolated algorithms in a sand box. 

Best regards, Andrey Borodin.

Re: [HACKERS] Clock with Adaptive Replacement

От
Robert Haas
Дата:
On Sat, Apr 28, 2018 at 7:16 PM, Peter Geoghegan <pg@bowt.ie> wrote:
> There could very easily be an enormous range of usefulness among
> buffers in shared_buffers. And yet, we only recognize the increments
> that usage_count can represent (this is before we even think about the
> actual problems with how usage_count is maintained). We're never going
> to be able to give *radically* more useful leaf pages a concomitant
> advantage during buffer eviction if we have to use some rule of thumb.

Right.  I think the real problem here is that the usage_count is a
terrible predictor of how useful a page really is...

> In a way, it's not at all surprising that we can end up with
> usage_counts of 5 all around when shared_buffers is large. Blocks end
> up "resting on their laurels"; they're impossible to evict based on a
> large amount of blind luck, such as being involved in a burst of
> activity some time ago (so much for clocksweep approximating an LRU;
> this is actually more like an LFU). ARC/CAR moves a "recent" block to
> the "frequent" list if the block gets referenced a second time after
> an interval that is, in effect, self-correcting (it may or may not be
> from the ghost recent list or the real buffer cache recent list --
> either way it goes to the head of the frequent list). Moving a block
> from the recent to the frequent list also enlarges the recent list
> target size at the expense of the frequent list target size. "This
> block in particular deserves to be a frequent block, but as a
> consequence all blocks should collectively be considered slightly less
> deserving of being frequent blocks."

...and this is exactly the kind of system that can fix that problem.

In the current system, bumping the usage count makes the buffer whose
count just got bumped look more useful without making any other buffer
look less useful.  Obviously, that allows for situations where the
usefulness of every buffer converges to positive infinity or, as we
like to call it, 5.  Real life isn't like that, though.  Not all
children are above average, and not all buffers can be more useful
than average.

That's not to say that the total of all usage counts must remain equal
to a constant or something dumb like that -- there's decisions to be
made in terms of how you implement things.  For example, you can
imagine a system where every time we would bump the usage count of a
buffer that already has a usage count of 5, we would instead advance
the clock hand by 1 buffer.  In such a system the total of the usage
counts can fluctuate, but you can't peg everything at 5 because
there's back-pressure built into the algorithm.  CAR accomplishes a
similar goal in a different way: "frequently" used pages are never
evicted, but as more and more pages get moved into the frequently-used
set, the system becomes more and more willing to shrink the set size.

That's a really appealing property.  If there's a "hot" set of buffers
that accounts for nearly all accesses, the algorithm will be willing
to regard that all buffers in that set as frequently-used regardless
of whether it is a large or a small percentage of shared_buffers, and
they will never get evicted.  But the buffers must really be hot
enough to justify the space they're taking up.  When there are only a
few hot buffers, it's sufficient for them to be a little hotter than
the typical buffer.  But when there are a lot of hot buffers, they
have to be *really* hot.  Intuitively, that makes sense, because
making the set of "hot" -- and thus unevictable -- buffers large means
that other buffers are going to get evicted very quickly.  That's only
justifiable if the hot buffers are very significantly more useful than
the non-hot buffers.

> If you have extreme cache pressure, I don't think that there is
> anything left to worry about other than the scalability of evicting
> buffers. That should be a rare enough occurrence for well maintained
> systems these days (again, this may only actually be true as an
> aspiration). Just look at the ARC paper's statistics; they report that
> they're not much better than the competition (including even a classic
> LRU) when there is either mild cache pressure or extreme cache
> pressure. The middle ground is the interesting part.

+1.

> All of that having been said, maybe we have an independent low-level
> problem: we increase usage_count when we pin a buffer, even if we last
> pinned the buffer 5ms ago. IIRC a change to this went in when ARC went
> in (and came out with ARC, too). This is more of a problem with
> miscounting the number of hits based on accidents around how things
> are structured in the executor -- we need to be more careful about
> recognizing whether or not block hits are logically distinct.
> Particularly with stuff like nested loop joins.

I agree that double-counting correlated accesses is a a problem, and I
agree that we probably want to do something about it.  I am skeptical
that using wall-clock time is the right way to solve that problem
because it leads to edge effects -- if for example there is a 5ms
timeout for bumping the usage count again, then you can probably
manufacture a workload where the buffer eviction pattern is
dramatically different for a nested loop that hits a given page every
4.5ms than it is for a nested loop that hits a given page every 5.5ms.
Note that CART aims to solve this problem in a way that doesn't
involve any fixed amount of wall-clock time.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: [HACKERS] Clock with Adaptive Replacement

От
Peter Geoghegan
Дата:
On Tue, May 1, 2018 at 11:58 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> That's not to say that the total of all usage counts must remain equal
> to a constant or something dumb like that -- there's decisions to be
> made in terms of how you implement things.  For example, you can
> imagine a system where every time we would bump the usage count of a
> buffer that already has a usage count of 5, we would instead advance
> the clock hand by 1 buffer.  In such a system the total of the usage
> counts can fluctuate, but you can't peg everything at 5 because
> there's back-pressure built into the algorithm.

This seems to be an old idea. See "Data Cache Management Using
Frequency-Based Replacement", a 1990 paper that is referenced by the
ARC paper. It says:

"When a block is first brought into the cache, its reference count is
initialized to one. Previous algorithms using reference counts have
incremented the count for a given block on every reference to that
block. This results in the following kind of problem: certain blocks
are relatively infrequently referenced overall, and yet when they are
referenced, due to locality there are short intervals of repeated
re-references, thus building up high reference counts. After such an
interval is over, the high reference count is misleading: it is due to
locality, and cannot be used to estimate the probability that such a
block will be re-referenced following the end of this interval (this
description is intuitive, since in practice such intervals are not as
well-defined as the preceding discussion may suggest)."

It goes on to propose a solution, that I gather isn't so far from your
thought experiment design for fixing the usage_count saturation
problem:

"More precisely, we dynamically maintain the sum of all reference
counts.  Whenever the average reference count exceeds a predetermined
maximum value, A-max (another parameter of the algorithm), every
reference count C is reduced to C/2. Thus, in the steady state the sum
of all reference counts stays between N x A-max/2 and N x A-max (where
N is the number of cache blocks), and the reference count of a block
is approximately linearly related to the frequency of references to
the block while it is in the middle and old sections. Note that in
this reduction of reference counts, a count of one remains at one, a
count of two is reduced to one, a count of three is reduced to two,
etc."

This 1990 paper seems to only try to solve the correlated references
problem that we see when using CLOCK to approximate an LRU, which is
not to be confused with actually weighing frequency in a clever way,
which AFAICT is a general idea that didn't come along until the LRU-K
paper came out a couple of years later. In other words, the 1990 paper
only cares about making sure that CLOCK actually manages to
approximate an LRU when reference counts are messed up by correlated
references. That's all. This really is a distinct problem to the
problem that ARC and CAR aim to solve (2Q and LRU-K must be in the
same category, since IIRC they have ghost lists). CART tries to solve
both problems at the same time, which suggests that they're
independent problems.

In summary: I see two clearly distinct problems here, though they are
superficially similar. We have both problems.

> That's a really appealing property.  If there's a "hot" set of buffers
> that accounts for nearly all accesses, the algorithm will be willing
> to regard that all buffers in that set as frequently-used regardless
> of whether it is a large or a small percentage of shared_buffers, and
> they will never get evicted.  But the buffers must really be hot
> enough to justify the space they're taking up.  When there are only a
> few hot buffers, it's sufficient for them to be a little hotter than
> the typical buffer.  But when there are a lot of hot buffers, they
> have to be *really* hot.  Intuitively, that makes sense, because
> making the set of "hot" -- and thus unevictable -- buffers large means
> that other buffers are going to get evicted very quickly.  That's only
> justifiable if the hot buffers are very significantly more useful than
> the non-hot buffers.

Right. How hot "really hot" actually is needs to be based on the
behavior of the system as a whole. If it's impossible to better
characterize the behavior of the system as a whole, then no
improvement is possible; shared_buffers services the system as a
whole.

I've always had a problem with the 8GB/16GB upper limit on the size of
shared_buffers. Not because it's wrong, but because it's not something
that has ever been explained. I strongly suspect that it has something
to do with usage_count saturation, since it isn't reproducible with
any synthetic workload that I'm aware of. Quite possibly because there
are few bursty benchmarks.

>> All of that having been said, maybe we have an independent low-level
>> problem: we increase usage_count when we pin a buffer, even if we last
>> pinned the buffer 5ms ago. IIRC a change to this went in when ARC went
>> in (and came out with ARC, too). This is more of a problem with
>> miscounting the number of hits based on accidents around how things
>> are structured in the executor -- we need to be more careful about
>> recognizing whether or not block hits are logically distinct.
>> Particularly with stuff like nested loop joins.
>
> I agree that double-counting correlated accesses is a a problem, and I
> agree that we probably want to do something about it.  I am skeptical
> that using wall-clock time is the right way to solve that problem
> because it leads to edge effects

I agree that wall-clock time is a bad approach, actually. If we were
to use wall-clock time, we'd only do so because it can be used to
discriminate against a thing that we actually care about in an
approximate, indirect way. If there is a more direct way of
identifying correlated accesses, which I gather that there is, then we
should probably use it.

-- 
Peter Geoghegan


Re: [HACKERS] Clock with Adaptive Replacement

От
Yura Sokolov
Дата:
02.05.2018 01:37, Peter Geoghegan пишет:
> On Tue, May 1, 2018 at 11:58 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>> I agree that double-counting correlated accesses is a a problem, and I
>> agree that we probably want to do something about it.  I am skeptical
>> that using wall-clock time is the right way to solve that problem
>> because it leads to edge effects
> 
> I agree that wall-clock time is a bad approach, actually. If we were
> to use wall-clock time, we'd only do so because it can be used to
> discriminate against a thing that we actually care about in an
> approximate, indirect way. If there is a more direct way of
> identifying correlated accesses, which I gather that there is, then we
> should probably use it.

I suggest incrementing only once in 1/32 replacements in shared_buffers.
I.e. if size of shared_buffers is 1024, and this page were put into
shared_buffers as 21200, then next time its usage count will be
incremented only after 21232 page were put into shared buffers. And
second time only after 21264 page.

regards,
Yura.


Re: [HACKERS] Clock with Adaptive Replacement

От
Robert Haas
Дата:
On Tue, May 1, 2018 at 6:37 PM, Peter Geoghegan <pg@bowt.ie> wrote:
> This seems to be an old idea.

I'm not too surprised ... this area has been well-studied.

> I've always had a problem with the 8GB/16GB upper limit on the size of
> shared_buffers. Not because it's wrong, but because it's not something
> that has ever been explained. I strongly suspect that it has something
> to do with usage_count saturation, since it isn't reproducible with
> any synthetic workload that I'm aware of. Quite possibly because there
> are few bursty benchmarks.

I've seen customer have very good luck going higher if it lets all the
data fit in shared_buffers, or at least all the data that is accessed
with any frequency.  I think it's useful to imagine a series of
concentric working sets - maybe you have 1GB of the hottest data, 3GB
of data that is at least fairly hot, 10GB of data that is at least
somewhat hot, and another 200GB of basically cold data.  Increasing
shared_buffers in a way that doesn't let the next "ring" fit in
shared_buffers isn't likely to help very much.  If you have 8GB of
shared_buffers on this workload, going to 12GB is probably going to
help -- that should be enough for the 10GB of somewhat-hot stuff and a
little extra so that the somewhat-hot stuff doesn't immediately start
getting evicted if some of the cold data is accessed.  Similarly,
going from 2GB to 4GB should be a big help, because now the fairly-hot
stuff should stay in cache.  But going from 4GB to 6GB or 12GB to 16GB
may not do very much.  It may even hurt, because the duplication
between shared_buffers and the OS page cache means an overall
reduction in available cache space.  If for example you've got 16GB of
memory and shared_buffers=2GB, you *may* be fitting all of the
somewhat-hot data into cache someplace; bumping shared_buffers=4GB
almost certainly means that will no longer happen, causing performance
to tank.

I don't really think that the 8GB rule of thumb is something that
originates in any technical limitation of PostgreSQL or Linux.  First
of all it's just a rule of thumb -- the best value in a given
installation can easily be something completely different.  Second, to
the extent that it's a useful rule of thumb, I think it's really a
guess about what people's working set looks like: that going from 4GB
to 8GB, say, significantly increases the chances of fitting the
next-larger, next-cooler working set entirely in shared_buffers, going
from 8GB to 16GB is less likely to accomplish this, and going from
16GB to 32GB probably won't.  To a lesser extent, it's reflective of
the point where scanning shared buffers to process relation drops gets
painful, and the point where an immediate checkpoint suddenly dumping
that much data out to the OS all at once starts to overwhelm the I/O
subsystem for a significant period of time.  But I think those really
are lesser effects.  My guess is that the big effect is balancing
increased hit ratio vs. increased double buffering.

> I agree that wall-clock time is a bad approach, actually. If we were
> to use wall-clock time, we'd only do so because it can be used to
> discriminate against a thing that we actually care about in an
> approximate, indirect way. If there is a more direct way of
> identifying correlated accesses, which I gather that there is, then we
> should probably use it.

For a start, I think it would be cool if somebody just gathered traces
for some simple cases.  For example, consider a pgbench transaction.
If somebody produced a trace showing the buffer lookups in order
annotated as heap, index leaf, index root, VM page, FSM root page, or
whatever.  Examining some other simple, common cases would probably
help us understand whether it's normal to bump the usage count more
than once per buffer for a single scan, and if so, exactly why that
happens.  If the code knows that it's accessing the same buffer a
second (or subsequent) time, it could pass down a flag saying not to
bump the usage count again.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: [HACKERS] Clock with Adaptive Replacement

От
Vladimir Sitnikov
Дата:
>If somebody produced a trace showing the buffer lookups in order

To get things moving, I've created a DTrace script that captures buffer reads:

Is it something that can be used to capture live traces?


For instance (time units are nano-seconds):

     Timestamp  Elapsed  ForkNum BlockNum  TableSp DataBase Relation Cached3271837060212     1563        0        1     1663    16385     1259      13271838881374    88205        0        0     1663    16385    16604      03271840973321     4368        0        0     1663    16385    16604      1
If DTrace is acceptable, trace format might be adjusted for easier consumption by https://github.com/ben-manes/caffeine/wiki/Simulator

Vladimir

Re: [HACKERS] Clock with Adaptive Replacement

От
Robert Haas
Дата:
On Wed, May 2, 2018 at 3:06 PM, Vladimir Sitnikov
<sitnikov.vladimir@gmail.com> wrote:
> Sample output can be seen here:
> https://github.com/vlsi/pgsqlstat/tree/pgsqlio#pgsqlio

Neat.  Not sure what generated this trace, but note this part:

3271838881374    88205        0        0     1663    16385    16604      0
 3271840973321     4368        0        0     1663    16385    16604      1
 3271842680626     4502        0        0     1663    16385    16604      1
 3271846077927     4173        0        0     1663    16385    16604      1

If we want to avoid artificial inflation of usage counts, that kind of
thing would be a good place to start -- obviously 4 consecutive
accesses to the same buffer by the same backend doesn't justify a
separate usage count bump each time.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: [HACKERS] Clock with Adaptive Replacement

От
Peter Geoghegan
Дата:
On Thu, May 3, 2018 at 12:37 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Wed, May 2, 2018 at 3:06 PM, Vladimir Sitnikov
> <sitnikov.vladimir@gmail.com> wrote:
>> Sample output can be seen here:
>> https://github.com/vlsi/pgsqlstat/tree/pgsqlio#pgsqlio
>
> Neat.  Not sure what generated this trace, but note this part:
>
> 3271838881374    88205        0        0     1663    16385    16604      0
>  3271840973321     4368        0        0     1663    16385    16604      1
>  3271842680626     4502        0        0     1663    16385    16604      1
>  3271846077927     4173        0        0     1663    16385    16604      1
>
> If we want to avoid artificial inflation of usage counts, that kind of
> thing would be a good place to start -- obviously 4 consecutive
> accesses to the same buffer by the same backend doesn't justify a
> separate usage count bump each time.

I don't have time to check this out just now, but it seems like an
excellent idea. It would be nice if it could be enhanced further, so
you get some idea of what the blocks are without having to decode them
yourself using tools like pageinspect.

-- 
Peter Geoghegan


Re: [HACKERS] Clock with Adaptive Replacement

От
Andrey Borodin
Дата:
> 4 мая 2018 г., в 0:37, Robert Haas <robertmhaas@gmail.com> написал(а):
>
> On Wed, May 2, 2018 at 3:06 PM, Vladimir Sitnikov
> <sitnikov.vladimir@gmail.com> wrote:
>> Sample output can be seen here:
>> https://github.com/vlsi/pgsqlstat/tree/pgsqlio#pgsqlio
>
> Neat.  Not sure what generated this trace, but note this part:
>
> 3271838881374    88205        0        0     1663    16385    16604      0
> 3271840973321     4368        0        0     1663    16385    16604      1
> 3271842680626     4502        0        0     1663    16385    16604      1
> 3271846077927     4173        0        0     1663    16385    16604      1
>
> If we want to avoid artificial inflation of usage counts, that kind of
> thing would be a good place to start -- obviously 4 consecutive
> accesses to the same buffer by the same backend doesn't justify a
> separate usage count bump each time.

Upper in this thread Yura suggested that usages should not create equal bump each time. He effectively suggested log
scaleof usages, thus many consecutive usages will be taken into account but not dramatically more important than just
fewrecent usages. 

Best regards, Andrey Borodin.

Re: [HACKERS] Clock with Adaptive Replacement

От
Юрий Соколов
Дата:
пт, 4 мая 2018 г., 12:27 Andrey Borodin <x4mmm@yandex-team.ru>:

> 4 мая 2018 г., в 0:37, Robert Haas <robertmhaas@gmail.com> написал(а):
>
> On Wed, May 2, 2018 at 3:06 PM, Vladimir Sitnikov
> <sitnikov.vladimir@gmail.com> wrote:
>> Sample output can be seen here:
>> https://github.com/vlsi/pgsqlstat/tree/pgsqlio#pgsqlio
>
> Neat.  Not sure what generated this trace, but note this part:
>
> 3271838881374    88205        0        0     1663    16385    16604      0
> 3271840973321     4368        0        0     1663    16385    16604      1
> 3271842680626     4502        0        0     1663    16385    16604      1
> 3271846077927     4173        0        0     1663    16385    16604      1
>
> If we want to avoid artificial inflation of usage counts, that kind of
> thing would be a good place to start -- obviously 4 consecutive
> accesses to the same buffer by the same backend doesn't justify a
> separate usage count bump each time.

Upper in this thread Yura suggested that usages should not create equal bump each time. He effectively suggested log scale of usages, thus many consecutive usages will be taken into account but not dramatically more important than just few recent usages.

I didn't suggest log scale of usages, but rather "replacement-period based" increment: usage count could be incremented at most once in NBlocks/32 replaced items. Once it is incremented, its "replacement time" is remembered, and next NBlocks/32 replacements usage count of this buffer doesn't increment.
This way, increment is synchronized with replacement activity.

Digging further, I suggest as improvement of GClock algorithm:
- placing new buffer with usage count = 0 (and next NBlock/32 replacements its usage count doesn't increased)
- increment not by 1, but by 8 (it simulates "hot queue" of popular algorithms) with limit 32.
- scan at most 25 buffers for eviction. If no buffer with zero usage count found, the least used buffer (among scanned 25) is evicted.
(new buffers are not evicted during their first NBlock/32 replacements).

Regards,
Yura Sokolov.

Re: [HACKERS] Clock with Adaptive Replacement

От
Vladimir Sitnikov
Дата:
>I don't have time to check this out just now, but it seems like an
excellent idea

There are a couple of sad things:
1) DTrace probes seem to be disabled by default. At least, I had to build PostgreSQL with --enable-dtrace in my macOS.
Does that (the requirement to enable dtrace within PostgreSQL) block from capturing trace files from real-life applications?

There's an option to use system-level IO trace points (e.g. https://github.com/omniti-labs/pgtreats/blob/master/tools/pg_file_stress ), however it would be harder to get DB/relation kind of ids.

2) (database, tablespace, relation) oids cannot be easily mapped from within a DTrace script. One can workaround that by using a SQL connection to the database, however it gets a bit complicated if there are multiple DB instances. What I mean is it is hard to tell which connection URL to use in order to resolve the oids in question.

Vladimir

Re: [HACKERS] Clock with Adaptive Replacement

От
Andrey Borodin
Дата:
Hi!

4 мая 2018 г., в 16:05, Юрий Соколов <funny.falcon@gmail.com> написал(а):

I didn't suggest log scale of usages, but rather "replacement-period based" increment: usage count could be incremented at most once in NBlocks/32 replaced items. Once it is incremented, its "replacement time" is remembered, and next NBlocks/32 replacements usage count of this buffer doesn't increment.
This way, increment is synchronized with replacement activity.
But you loose difference between "touched once" and "actively used". Log scale of usage solves this: usage count grows logarithmically, but drains linearly.


Digging further, I suggest as improvement of GClock algorithm:
- placing new buffer with usage count = 0 (and next NBlock/32 replacements its usage count doesn't increased)
- increment not by 1, but by 8 (it simulates "hot queue" of popular algorithms) with limit 32.
- scan at most 25 buffers for eviction. If no buffer with zero usage count found, the least used buffer (among scanned 25) is evicted.
(new buffers are not evicted during their first NBlock/32 replacements).

I do not understand where these numbers come from...

Best regards, Andrey Borodin.

Re: [HACKERS] Clock with Adaptive Replacement

От
Yura Sokolov
Дата:
05.05.2018 09:16, Andrey Borodin пишет:
> Hi!
>
>> 4 мая 2018 г., в 16:05, Юрий Соколов <funny.falcon@gmail.com>
>> написал(а):
>>
>> I didn't suggest log scale of usages, but rather
>> "replacement-period based" increment: usage count could be
>> incremented at most once in NBlocks/32 replaced items. Once it is
>> incremented, its "replacement time" is remembered, and next
>> NBlocks/32 replacements usage count of this buffer doesn't
>> increment. This way, increment is synchronized with replacement
>> activity.
>
> But you loose difference between "touched once" and "actively used".
> Log scale of usage solves this: usage count grows logarithmically,
> but drains linearly.
No, I didn't loose difference. But instead of absolute value (log scale
or linear) I count how often in time block is used:
- if buffer were touched 1000 times just after placing into
shared_buffers should it live 500 times longer than neighbor that were
touched only 2 times? or 10 times longer? or 5 times longer?
- but what if that "1000 times" buffer were not touched in next period
of time, but neighbor were touched again 2 times?
All effective algorithms answers: "1000 times" buffer should be evicted
first, but its neighbor is a really hot buffer that should be saved for
longer period.

Log scale doesn't solve this. But increment "once in period" solves.
Especially if block is placed first with zero count (instead of 1 as
currently).

>> Digging further, I suggest as improvement of GClock algorithm: -
>> placing new buffer with usage count = 0 (and next NBlock/32
>> replacements its usage count doesn't increased)
>> - increment not by 1, but by 8 (it simulates "hot queue" of
>> popular algorithms) with limit 32.
>> - scan at most 25 buffers for eviction. If no buffer with zero
>> usage count found, the least used buffer (among scanned 25) is evicted.
>> (new buffers are not evicted during their first NBlock/32
>> replacements).
>>
>
> I do not understand where these numbers come from...

I found this number by testing with several artificial traces found in web.
I don't claim this number are best. Even on that traces best values may
vary on cache size: for small cache size increment and limit tends to be
higher, for huge cache - smaller. But this were most balanced.

And I don't claim those traces are representative for PostgreSQL, that
is why I'm pushing this discussion to collect more real-world PostgreSQL
traces and make them public.

And I believe my algorithm is not the best. Clock-Pro and ARC shows
better results on that traces. Tiny-LFU - cache admission algorithm -
may be even more efficient (in term of evictions). But results should be
rechecked with PostgreSQL traces.

My algorithm will be just least invasive for current code, imho.

With regards,
Sokolov Yura


Вложения

Re: [HACKERS] Clock with Adaptive Replacement

От
Andrey Borodin
Дата:


5 мая 2018 г., в 13:25, Yura Sokolov <funny.falcon@gmail.com> написал(а):

05.05.2018 09:16, Andrey Borodin пишет:
Hi!

4 мая 2018 г., в 16:05, Юрий Соколов <funny.falcon@gmail.com>
написал(а):

I didn't suggest log scale of usages, but rather
"replacement-period based" increment: usage count could be
incremented at most once in NBlocks/32 replaced items. Once it is
incremented, its "replacement time" is remembered, and next
NBlocks/32 replacements usage count of this buffer doesn't
increment. This way, increment is synchronized with replacement
activity.

But you loose difference between "touched once" and "actively used".
Log scale of usage solves this: usage count grows logarithmically,
but drains linearly.
No, I didn't loose difference. But instead of absolute value (log scale
or linear) I count how often in time block is used:
- if buffer were touched 1000 times just after placing into
shared_buffers should it live 500 times longer than neighbor that were
touched only 2 times? or 10 times longer? or 5 times longer?
- but what if that "1000 times" buffer were not touched in next period
of time, but neighbor were touched again 2 times?
All effective algorithms answers: "1000 times" buffer should be evicted
first, but its neighbor is a really hot buffer that should be saved for
longer period.
It is hard to tell actually what is right decision here. It is better to evict buffer that will not be needed longer, and it is not obvious that is is true for buffer that you called hot.

Assume we have buffer A who is touched 1024 times (and then forgotten forever) and buffer B which is touched 2 times every clock cycle.
A B
Usage count 0x400 0x2
1 Eviction time! 0x100 0x0 E
Usage count 0x100 0x2
2 Eviction time! 0x080 0x0 E
Usage count 0x080 0x2
3 Eviction time! 0x020 0x0 E
Usage count 0x020 0x2
4 Eviction time! 0x00A 0x0 E
Usage count 0x00A 0x2
5 Eviction time! 0x004 0x0 E
Usage count 0x004 0x2
6 Eviction time! 0x001 0x0 E
Usage count 0x001 0x2
7 Eviction time! 0x000 E 0x2
So, buffer used 512 times more survived only 7 stale cycles. Looks fair.



Log scale doesn't solve this. But increment "once in period" solves.
Especially if block is placed first with zero count (instead of 1 as
currently).

Digging further, I suggest as improvement of GClock algorithm: -
placing new buffer with usage count = 0 (and next NBlock/32
replacements its usage count doesn't increased)
- increment not by 1, but by 8 (it simulates "hot queue" of
popular algorithms) with limit 32.
- scan at most 25 buffers for eviction. If no buffer with zero
usage count found, the least used buffer (among scanned 25) is evicted.
(new buffers are not evicted during their first NBlock/32
replacements).


I do not understand where these numbers come from...

I found this number by testing with several artificial traces found in web.
I don't claim this number are best. Even on that traces best values may
vary on cache size: for small cache size increment and limit tends to be
higher, for huge cache - smaller. But this were most balanced.

And I don't claim those traces are representative for PostgreSQL, that
is why I'm pushing this discussion to collect more real-world PostgreSQL
traces and make them public.

And I believe my algorithm is not the best. Clock-Pro and ARC shows
better results on that traces. Tiny-LFU - cache admission algorithm -
may be even more efficient (in term of evictions). But results should be
rechecked with PostgreSQL traces.

My algorithm will be just least invasive for current code, imho.

Here's the demo patch with logarithmic scale. 2 lines changed.

Best regards, Andrey Borodin.
Вложения

Re: [HACKERS] Clock with Adaptive Replacement

От
Yura Sokolov
Дата:
06.05.2018 11:20, Andrey Borodin пишет:
>
>
>> 5 мая 2018 г., в 13:25, Yura Sokolov <funny.falcon@gmail.com> написал(а):
>>
>> 05.05.2018 09:16, Andrey Borodin пишет:
>>> Hi!
>>>
>>>> 4 мая 2018 г., в 16:05, Юрий Соколов <funny.falcon@gmail.com>
>>>> написал(а):
>>>>
>>>> I didn't suggest log scale of usages, but rather
>>>> "replacement-period based" increment: usage count could be
>>>> incremented at most once in NBlocks/32 replaced items. Once it is
>>>> incremented, its "replacement time" is remembered, and next
>>>> NBlocks/32 replacements usage count of this buffer doesn't
>>>> increment. This way, increment is synchronized with replacement
>>>> activity.
>>>
>>> But you loose difference between "touched once" and "actively used".
>>> Log scale of usage solves this: usage count grows logarithmically,
>>> but drains linearly.
>> No, I didn't loose difference. But instead of absolute value (log scale
>> or linear) I count how often in time block is used:
>> - if buffer were touched 1000 times just after placing into
>> shared_buffers should it live 500 times longer than neighbor that were
>> touched only 2 times? or 10 times longer? or 5 times longer?
>> - but what if that "1000 times" buffer were not touched in next period
>> of time, but neighbor were touched again 2 times?
>> All effective algorithms answers: "1000 times" buffer should be evicted
>> first, but its neighbor is a really hot buffer that should be saved for
>> longer period.
> It is hard to tell actually what is right decision here. It is better to evict buffer that will not be needed longer,
andit is not obvious that is is true for buffer that you called hot. 
>
> Assume we have buffer A who is touched 1024 times (and then forgotten forever) and buffer B which is touched 2 times
everyclock cycle. 
>             A        B
> Usage count    0x400    0x2
> 1 Eviction time!    0x100    0x0 E
> Usage count    0x100    0x2
> 2 Eviction time!    0x080    0x0 E
> Usage count    0x080    0x2
> 3 Eviction time!    0x020    0x0 E
> Usage count    0x020    0x2
> 4 Eviction time!    0x00A    0x0 E
> Usage count    0x00A    0x2
> 5 Eviction time!    0x004    0x0 E
> Usage count    0x004    0x2
> 6 Eviction time!    0x001    0x0 E
> Usage count    0x001    0x2
> 7 Eviction time!    0x000 E    0x2
> So, buffer used 512 times more survived only 7 stale cycles. Looks fair.

No, it is not fair. In fact, it is worser than with current strategy:
with current GClock it will survive only 5 stale cycles.

And in my approach I told about NBlock/32 periods, so it looks more like
(lets increment be 1):

Period        A             B         C
         touch UseCnt  touch cnt  touch cnt
1/64      512    0      16    0    1     0
2/64      512    0       0    0    1     0
3/64       0     0       0    0    0     0
4/64       0     0       0    0    1     1
5/64       0     0       0    0    1     1
.....
32/64      0     0      16    1    0     1
33/64      0     0      16    1    1     2
34/64      0     0      0     1    1     2
.....
63/64      0     0      0     1    1     3
64/64      0     0      0     1    1     3
Eviction time!
         A evicted      B     1    C     2

So, in my algorithm A will be evicted at first eviction time.
And while B accessed more times, its access distance is much longer than
C, so it has less changes to survive in a future.

>> Log scale doesn't solve this. But increment "once in period" solves.
>> Especially if block is placed first with zero count (instead of 1 as
>> currently).
>>
>>>> Digging further, I suggest as improvement of GClock algorithm: -
>>>> placing new buffer with usage count = 0 (and next NBlock/32
>>>> replacements its usage count doesn't increased)
>>>> - increment not by 1, but by 8 (it simulates "hot queue" of
>>>> popular algorithms) with limit 32.
>>>> - scan at most 25 buffers for eviction. If no buffer with zero
>>>> usage count found, the least used buffer (among scanned 25) is evicted.
>>>> (new buffers are not evicted during their first NBlock/32
>>>> replacements).
>>>>
>>>
>>> I do not understand where these numbers come from...
>>
>> I found this number by testing with several artificial traces found in web.
>> I don't claim this number are best. Even on that traces best values may
>> vary on cache size: for small cache size increment and limit tends to be
>> higher, for huge cache - smaller. But this were most balanced.
>>
>> And I don't claim those traces are representative for PostgreSQL, that
>> is why I'm pushing this discussion to collect more real-world PostgreSQL
>> traces and make them public.
>>
>> And I believe my algorithm is not the best. Clock-Pro and ARC shows
>> better results on that traces. Tiny-LFU - cache admission algorithm -
>> may be even more efficient (in term of evictions). But results should be
>> rechecked with PostgreSQL traces.
>>
>> My algorithm will be just least invasive for current code, imho.
>
> Here's the demo patch with logarithmic scale. 2 lines changed.

I've been playing with logarithmic scale in postgresql roughly year ago.
I didn't found any benefits compared to current code. In fact, it looked
to perform worse than current code.
That is why I didn't wrote about that experiment to pgsql-hackers.
But probably I measured in a wrong way. And why I dream to have
real-world traces in hands.

Consider all known to be effective algorithms: 2Q, ARC, Clock-PRO,
CAR, - they all consider buffer "hot" if it has more temporal frequency
in opposite to raw access count. They all mostly ignores spike of usages
during first moments after placement into cache, and moves buffer to hot
if it is accessed at some time after.

With regards,
Sokolov Yura.


Вложения

Re: [HACKERS] Clock with Adaptive Replacement

От
Andrey Borodin
Дата:

6 мая 2018 г., в 20:38, Yura Sokolov <funny.falcon@gmail.com> написал(а):

I've been playing with logarithmic scale in postgresql roughly year ago.
I didn't found any benefits compared to current code. In fact, it looked
to perform worse than current code.
That is why I didn't wrote about that experiment to pgsql-hackers.
Is there a feature of testgres to benchmark pgbench tps against shared buffer size? Let's request this feature if it is not there :)

But probably I measured in a wrong way. And why I dream to have
real-world traces in hands.

Consider all known to be effective algorithms: 2Q, ARC, Clock-PRO,
CAR, - they all consider buffer "hot" if it has more temporal frequency
in opposite to raw access count. They all mostly ignores spike of usages
during first moments after placement into cache, and moves buffer to hot
if it is accessed at some time after.
These algorithms do not ignore spikes, they ignore spike's amplitude. And this does not mean that amplitude is irrelevant information, even if these algorithms perform almost like Belady's.

Building a complicated heuristics (with a lot of magic numbers) to merge this spikes into one event does not look promising to me. But this is just my superstition, chances are that you can tune your algorithm into serious advancement. But please publish benchmarks, whatever result will be.

Best regards, Andrey Borodin.

Re: [HACKERS] Clock with Adaptive Replacement

От
Yura Sokolov
Дата:
06.05.2018 20:28, Andrey Borodin пишет:
> 
>> 6 мая 2018 г., в 20:38, Yura Sokolov <funny.falcon@gmail.com> написал(а):
>>
>> I've been playing with logarithmic scale in postgresql roughly year ago.
>> I didn't found any benefits compared to current code. In fact, it looked
>> to perform worse than current code.
>> That is why I didn't wrote about that experiment to pgsql-hackers.
> Is there a feature of testgres to benchmark pgbench tps against shared buffer size? Let's request this feature if it
isnot there :)
 

That would be great. Will you?

But pgbench is a bit... far from realworld. I used pgbench to test
log-scale, and it shows that log-scale is worser.
It will be more important to test with some realworld installations. Or
validate algorithm with realworld traces.

>> But probably I measured in a wrong way. And why I dream to have
>> real-world traces in hands.
>>
>> Consider all known to be effective algorithms: 2Q, ARC, Clock-PRO,
>> CAR, - they all consider buffer "hot" if it has more temporal frequency
>> in opposite to raw access count. They all mostly ignores spike of usages
>> during first moments after placement into cache, and moves buffer to hot
>> if it is accessed at some time after.
> These algorithms do not ignore spikes, they ignore spike's amplitude. And this does not mean that amplitude is
irrelevantinformation, even if these algorithms perform almost like Belady's.
 

More important their consideration of temporal frequency.

> Building a complicated heuristics (with a lot of magic numbers) to merge this spikes into one event does not look
promisingto me. But this is just my superstition, chances are that you can tune your algorithm into serious
advancement.But please publish benchmarks, whatever result will be.
 

Yes, this numbers looks magic. They are hand tuned on some traces that
probably irrelevant to PostgreSQL behavior.

And since you already present some patch, may you publish its benchmark
results?

With regards,
Sokolov Yura.


Re: [HACKERS] Clock with Adaptive Replacement

От
Robert Haas
Дата:
On Sat, May 5, 2018 at 2:16 AM, Andrey Borodin <x4mmm@yandex-team.ru> wrote:
> But you loose difference between "touched once" and "actively used". Log
> scale of usage solves this: usage count grows logarithmically, but drains
> linearly.

Even if we have that, or something with similar effects, it's still
desirable to avoid bumping the usage count multiple times for accesses
that happen close together in time.  I don't really agree with Yura
Sokolov's proposal for how to prevent that problem, because during
periods when no buffer eviction is happening it basically throws out
all of the data: no items are replaced, and the usage count can't be
bumped again until NBlocks/32 items are replaced.  That's bad.  We
want an algorithm that, if there's no buffer eviction for a while and
then we start having to evict buffers, has good information on which
things should be evicted and which should resist eviction.  But I
think that he's right that preventing the usage count from rising
artificially when there are correlated accesses is important.  If we
have a page with usage count 0 which incurs 4 correlated accesses, as
shown in Vladimir Sitnikov's results upthread, then reducing the usage
count linearly requires 4 sweeps through the buffer ring to get the
usage count back down to 0 (4->3->2->1->0); reducing it by dividing by
two still requires 3 sweeps (4->2->1->0).  A good solution for that
case should, like Yura Sokolov's proposed algorithm, avoid pushing the
usage count higher than 1.  So he has the right idea, I think, even if
the method's not quite what we want.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: [HACKERS] Clock with Adaptive Replacement

От
Юрий Соколов
Дата:
2018-05-07 17:15 GMT+03:00 Robert Haas <robertmhaas@gmail.com>:
>
> On Sat, May 5, 2018 at 2:16 AM, Andrey Borodin <x4mmm@yandex-team.ru> wrote:
> > But you loose difference between "touched once" and "actively used". Log
> > scale of usage solves this: usage count grows logarithmically, but drains
> > linearly.
>
> Even if we have that, or something with similar effects, it's still
> desirable to avoid bumping the usage count multiple times for accesses
> that happen close together in time.  I don't really agree with Yura
> Sokolov's proposal for how to prevent that problem, because during
> periods when no buffer eviction is happening it basically throws out
> all of the data: no items are replaced, and the usage count can't be
> bumped again until NBlocks/32 items are replaced.  That's bad.

That is not as bad as it seems on first glance: during period when
no buffer eviction is happenning, all information is almost irrelevant,
because it is naturally outdated. And due to choose of "batch" size (25),
there is always window between "NBlocks/32 items replaced" and
"this item is considered for replacement": even if all items in
25/32*NBlocks had non-zero usage count, then there are at least
7/32*NBlocks to consider before this item could be replaced, during
which usage count can be incremented.

Note: "1/32 replacements" really means "1/32 new buffers", so while
shared_buffers are not full yet (so no evictions happend), all blocks, except
last 1/32, could have increment of usage count.
 
>
> We want an algorithm that, if there's no buffer eviction for a while and
> then we start having to evict buffers, has good information on which
> things should be evicted and which should resist eviction.  But I
> think that he's right that preventing the usage count from rising
> artificially when there are correlated accesses is important.  If we
> have a page with usage count 0 which incurs 4 correlated accesses, as
> shown in Vladimir Sitnikov's results upthread, then reducing the usage
> count linearly requires 4 sweeps through the buffer ring to get the
> usage count back down to 0 (4->3->2->1->0); reducing it by dividing by
> two still requires 3 sweeps (4->2->1->0).

Yes, that is why log-scale doesn't help much.

> A good solution for that
> case should, like Yura Sokolov's proposed algorithm, avoid pushing the
> usage count higher than 1. So he has the right idea, I think, even if
> the method's not quite what we want.

Well, I proposed increment by 8 , because simulation (on traces,
I found) shows, it really improves quality of algorithm. Looks like, if
item had a hit after not-so-short period there is great chance this item
is hot, so it should be "protected" from eviction.

(and code in that repository to develop algorithm).

That is really important thing to have real-world traces. Because it is
not so obvious how algorithm should perform. Mental conclusions
could be wrong and doesn't match production load behavior.

For example, all good algorithm tracks recently evicted items, and it is
not so obvious that already evicted item that get a hit is a "hot" item, and
should be placed in separate "hot" queue.
(My algo doesn't track evicted items, so it could not be as good in term
of hit-rate.)

(Personally, I vote more for Clock-Pro than for my algo. But it has three
"clock-hands", and if implemented as described, still demands 
double-linked lists. So, it should be modified/adapted, imho.)

With regards,
Yura Sokolov.

Re: [HACKERS] Clock with Adaptive Replacement

От
Robert Haas
Дата:
On Mon, May 7, 2018 at 12:54 PM, Юрий Соколов <funny.falcon@gmail.com> wrote:
>> Even if we have that, or something with similar effects, it's still
>> desirable to avoid bumping the usage count multiple times for accesses
>> that happen close together in time.  I don't really agree with Yura
>> Sokolov's proposal for how to prevent that problem, because during
>> periods when no buffer eviction is happening it basically throws out
>> all of the data: no items are replaced, and the usage count can't be
>> bumped again until NBlocks/32 items are replaced.  That's bad.
>
> That is not as bad as it seems on first glance: during period when
> no buffer eviction is happenning, all information is almost irrelevant,
> because it is naturally outdated. And due to choose of "batch" size (25),
> there is always window between "NBlocks/32 items replaced" and
> "this item is considered for replacement": even if all items in
> 25/32*NBlocks had non-zero usage count, then there are at least
> 7/32*NBlocks to consider before this item could be replaced, during
> which usage count can be incremented.

I don't agree that the information is almost irrelevant.  If we have a
working set that starts out fitting within shared_buffers and then
grows larger, we want to know which parts of the data were being
accessed frequently just prior to the point where we started evicting
things.  Otherwise we basically evict at random for a while, and
that's not good.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: [HACKERS] Clock with Adaptive Replacement

От
Юрий Соколов
Дата:
2018-05-08 16:21 GMT+03:00 Robert Haas <robertmhaas@gmail.com>:
>
> On Mon, May 7, 2018 at 12:54 PM, Юрий Соколов <funny.falcon@gmail.com> wrote:
> >> Even if we have that, or something with similar effects, it's still
> >> desirable to avoid bumping the usage count multiple times for accesses
> >> that happen close together in time.  I don't really agree with Yura
> >> Sokolov's proposal for how to prevent that problem, because during
> >> periods when no buffer eviction is happening it basically throws out
> >> all of the data: no items are replaced, and the usage count can't be
> >> bumped again until NBlocks/32 items are replaced.  That's bad.
> >
> > That is not as bad as it seems on first glance: during period when
> > no buffer eviction is happenning, all information is almost irrelevant,
> > because it is naturally outdated. And due to choose of "batch" size (25),
> > there is always window between "NBlocks/32 items replaced" and
> > "this item is considered for replacement": even if all items in
> > 25/32*NBlocks had non-zero usage count, then there are at least
> > 7/32*NBlocks to consider before this item could be replaced, during
> > which usage count can be incremented.
>
> I don't agree that the information is almost irrelevant.  If we have a
> working set that starts out fitting within shared_buffers and then
> grows larger, we want to know which parts of the data were being
> accessed frequently just prior to the point where we started evicting
> things.

"just prior to the point" means we have to have machinery for information
expiration without eviction. For example, same "clock hand" should walk
over all buffers continiously, and decrement "usage count", but without
eviction performed. Right?

As alternative, some approximation of Working Set algorithm could be used:
- on every access "access time" should be written to item,
- items accessed before some "expiration interval" are considered for
expiration.
This way, there is also fresh information about usage even without any
expiration performed yet.

"usage count" still have to be used to track access frequency, but it
should not be just incremented:
- there should be formula for "corrected count", that depends on
  last access time, written usage count and over-all system access rate.
- increment should look like:
  usage_count = corrected_count(cur_time, access_time, usage_count, access_rate) + increment(cur_time, access_time, access_rate)
- probably, expiration should rely on "corrected_count" either,
  ie evict if "corrected_count" == 0

To smooth access spikes, new values for "usage count" and "access time"
should be written not on every access, but once in some period.

Oh, looks like I'm inventing another kind of bicycle :-(
and this one is unlikely to be good.

With regards,
Sokolov Yura

Re: [HACKERS] Clock with Adaptive Replacement

От
Vladimir Sitnikov
Дата:
> Oh, looks like I'm inventing another kind of bicycle :-(

Do you think you could capture a trace or two from a more-or-less representative application/database?

Discussion of algorithms makes little sense as we all lack traces to compare/validate.

Vladimir

Re: [HACKERS] Clock with Adaptive Replacement

От
Jesper Pedersen
Дата:
On 05/08/2018 11:35 AM, Vladimir Sitnikov wrote:
>> Oh, looks like I'm inventing another kind of bicycle :-(
> 
> Do you think you could capture a trace or two from a more-or-less
> representative application/database?
> 
> Discussion of algorithms makes little sense as we all lack traces to
> compare/validate.
> 

I have work loads that I can repeat, so I can help with testing.

Best regards,
  Jesper


Re: [HACKERS] Clock with Adaptive Replacement

От
Vladimir Sitnikov
Дата:
>I have work loads that I can repeat, so I can help with testing.

That would be great.

Do you think you could use DTrace to capture the trace?

Vladimir

Re: [HACKERS] Clock with Adaptive Replacement

От
Jesper Pedersen
Дата:
On 05/08/2018 11:49 AM, Vladimir Sitnikov wrote:
>> I have work loads that I can repeat, so I can help with testing.
> 
> That would be great.
> 
> Do you think you could use DTrace to capture the trace?
> For instance, https://github.com/vlsi/pgsqlstat/blob/pgsqlio/pgsqlio
> 

DTrace or BPF would be ok.

Best regards,
  Jesper



Re: [HACKERS] Clock with Adaptive Replacement

От
Robert Haas
Дата:
On Tue, May 8, 2018 at 10:22 AM, Юрий Соколов <funny.falcon@gmail.com> wrote:
> "just prior to the point" means we have to have machinery for information
> expiration without eviction. For example, same "clock hand" should walk
> over all buffers continiously, and decrement "usage count", but without
> eviction performed. Right?

Right.

> As alternative, some approximation of Working Set algorithm could be used:
> - on every access "access time" should be written to item,
> - items accessed before some "expiration interval" are considered for
> expiration.
> This way, there is also fresh information about usage even without any
> expiration performed yet.

That's basically the same kind of thing.  Ideally we want to have one
mechanism that operates all the time, rather than one that works when
no eviction is occurring and a completely separate one that operates
when eviction is occurring.

Anyway, I think we really ought to investigate CAR.  CAR could be made
to piggyback on our existing GLOCK scheme.  See section III.B of the
the CAR paper, page 4, especially the second paragraph in the second
column.  What would change is that instead of sweeping through all
buffers, there would be two sets of buffers T1 and T2, each with a
separate clock hand.  Also, we'd need to store the history lists B1
and B2 someplace.

Independently of that, it would be probably also useful to avoid
bumping the reference count multiple times when a buffer is accessed
by the same backend several times in quick succession.  Perhaps this
could even be as simple as maintaining a list of the last two buffer
IDs for which we bumped the usage count; if we see one of them again,
don't bump the usage count again.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: [HACKERS] Clock with Adaptive Replacement

От
Merlin Moncure
Дата:
On Wed, May 9, 2018 at 11:00 AM Robert Haas <robertmhaas@gmail.com> wrote:

> Independently of that, it would be probably also useful to avoid
> bumping the reference count multiple times when a buffer is accessed
> by the same backend several times in quick succession.  Perhaps this
> could even be as simple as maintaining a list of the last two buffer
> IDs for which we bumped the usage count; if we see one of them again,
> don't bump the usage count again.

Hm.  Is the objective here to optimize access patterns or to reduce atomic
operations (or both)?   All else being equal, an algorithm that delivers
the similar eviction results with less cache synchronization ought to be
preferred...are the various algorithms scored in that way?

merlin


Re: [HACKERS] Clock with Adaptive Replacement

От
Robert Haas
Дата:
On Wed, May 9, 2018 at 1:37 PM, Merlin Moncure <mmoncure@gmail.com> wrote:
> On Wed, May 9, 2018 at 11:00 AM Robert Haas <robertmhaas@gmail.com> wrote:
>> Independently of that, it would be probably also useful to avoid
>> bumping the reference count multiple times when a buffer is accessed
>> by the same backend several times in quick succession.  Perhaps this
>> could even be as simple as maintaining a list of the last two buffer
>> IDs for which we bumped the usage count; if we see one of them again,
>> don't bump the usage count again.
>
> Hm.  Is the objective here to optimize access patterns or to reduce atomic
> operations (or both)?   All else being equal, an algorithm that delivers
> the similar eviction results with less cache synchronization ought to be
> preferred...are the various algorithms scored in that way?

The CAR paper explains steps they've taken to minimize atomic
operations, so it is something that people do think about.

The main reason for wanting to avoid extra usage count bumps is to
avoid distorting the usage counts.  Any savings we get from reducing
atomic ops is a bonus.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: [HACKERS] Clock with Adaptive Replacement

От
Peter Geoghegan
Дата:
On Thu, May 3, 2018 at 5:16 PM, Peter Geoghegan <pg@bowt.ie> wrote:
> On Thu, May 3, 2018 at 12:37 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>> On Wed, May 2, 2018 at 3:06 PM, Vladimir Sitnikov
>> <sitnikov.vladimir@gmail.com> wrote:
>>> Sample output can be seen here:
>>> https://github.com/vlsi/pgsqlstat/tree/pgsqlio#pgsqlio
>>
>> Neat.  Not sure what generated this trace, but note this part:

> I don't have time to check this out just now, but it seems like an
> excellent idea. It would be nice if it could be enhanced further, so
> you get some idea of what the blocks are without having to decode them
> yourself using tools like pageinspect.

Although I'm a Linux user, I added --enable-dtrace to my standard
release build configure alias several months back. I wanted to be able
to use the user space probes it adds with BCC [1], which is a really
nice tool (the dtrace/system tap probes seem to be totally generic).
If you have a fairly recent kernel, many really nice tricks are
possible with only modest effort. I've been meaning to polish up a
selection of one-liners I came up with several months back, to make
them available for others.

Here is a one-liner that traces buffer reads within the backend with PID 6617:

pg@bat:~/notes/setup$ sudo /usr/share/bcc/tools/trace -p 6617 --time
'u:/code/postgresql/root/install/bin/postgres:postgresql:buffer__read__start
"fork: %d, blockNum: %u, spcNode: %u, dbNode: %u, relNode: %u,
backend: %d, isExtend: %d" arg1, arg2, arg3, arg4, arg5, arg6, arg7'

If I run the query "select * from pgbench_accounts where aid = 555"
within that backend, I can see the following:

TIME     PID    TID    COMM         FUNC             -
12:17:28 6617   6617   postgres     buffer__read__start fork: 0,
blockNum: 290, spcNode: 1663, dbNode: 13100, relNode: 16404, backend:
-1, isExtend: 0
12:17:28 6617   6617   postgres     buffer__read__start fork: 0,
blockNum: 3, spcNode: 1663, dbNode: 13100, relNode: 16404, backend:
-1, isExtend: 0
12:17:28 6617   6617   postgres     buffer__read__start fork: 0,
blockNum: 2, spcNode: 1663, dbNode: 13100, relNode: 16404, backend:
-1, isExtend: 0
12:17:28 6617   6617   postgres     buffer__read__start fork: 0,
blockNum: 9, spcNode: 1663, dbNode: 13100, relNode: 16396, backend:
-1, isExtend: 0

This is pretty easy to interpret: 290 is the root page, 3 in another
internal page, and 2 in the leaf page. Finally, block 9 is from the
heap relation.

I haven't actually instrumented the amount of time each read takes
here, so it's not quite as good as your dtrace script. I'm sure that
that would be trivial, since BCC really does seem to make
customization very easy. I just don't want to spend more than 10
minutes on this today.

[1] https://github.com/iovisor/bcc
-- 
Peter Geoghegan


Re: [HACKERS] Clock with Adaptive Replacement

От
Thomas Munro
Дата:
On Wed, Apr 25, 2018 at 6:31 PM, Andrey Borodin <x4mmm@yandex-team.ru> wrote:
> 4. Using O_DIRECT while writing data files

One interesting thing about O_DIRECT that I haven't seen mentioned in
these discussions:

POSIX effectively requires writes to the same file to be serialised,
meaning in practice that [p]write[v]() acquires a mutex on the inode.
I've heard of people working on other databases that have multiple
write-back threads that considered this to be a serious problem, and I
guess it becomes worse as you concentrate more stuff into bigger files
to cut down on file descriptors and file system metadata operations
and have more concurrent writers.  Some ways around that:

1.  On Linux if you use O_DIRECT then XFS forgets about that POSIX
requirement (O_DIRECT is outside the spec anyway).  It just doesn't
acquire inode->i_mutex, so you get parallel writes to the same file
(except in some corner cases).  AFAIK other Linux filesystems don't do
that, so XFS might be your only choice if you want parallel direct IO
on that OS.  Anyone know more about that?  Relevant kernel code: the
inode_lock() call that appears in ext4_file_write_iter() but not in
xfs_file_write_iter() in the IOCB_DIRECT case.

2.  Ditto for Solaris's UFS, AIX's JFS2 with "O_CIO" enabled (but
apparently not the JFS port on Linux).  I don't know the answer for
FreeBSD's UFS off-hand.

3.  I've heard that ZFS achieves parallel writes to the same file
while magically adhering to the POSIX serialisation rules, and that's
independent of direct IO (which it doesn't even support).

-- 
Thomas Munro
http://www.enterprisedb.com


Re: Clock with Adaptive Replacement

От
"ben.manes"
Дата:
I have been working on caching as a hobby project for the last few years and
would be happy to collaborate on an analysis and design. I've tackled a lot
of these issues, wrote widely used implementations, and co-authored a paper
for ACM's Transactions on Storage on an eviction policy. A short summary of
my results is described in this HighScalability article [1].

I very much agree with the requests for traces to use real-world data. I
wrote a simulator that supports a large number of policies and traces [2].
It was invaluable. I believe ARC's DS1 should be the closest to a Postgres
workload. Their OLTP trace appears to be the I/O of the WAL, making it
recency-biased and not very interesting.

** Eviction **
ARC-based policies (CAR, CART) are only modestly better than LRU in most
workloads. It does much better in Zipf, but poorly in loopy workloads like
glimpse. I think that the problem is that it does not capture and use the
history (ghost lists) very effectively.

LIRS-based policies (ClockPro, FRD) are very strong. Unfortunately the
papers do a poor job in describing them, it is complex to implement, and
most are flawed so as to diminish the hit rate. As described in their paper,
LIRS has an unbounded history that can cause memory leaks if not capped and
the pruning is O(n).

I try to avoid Clock-based policies in my designs due to their O(n) scans.
This can cause latency spikes that are difficult to reproduce because it
depends on the policy's state. Random sampling policies don't have this
problem, but tend to under perform due to a lack of history. I prefer
amortized O(1) policies, but those are difficult due to concurrency (solved
below).

I chose TinyLFU with an admission window, called W-TinyLFU. This uses a
frequency sketch with saturating counters for its history (e.g,
CountMinSketch w/ 4-bit counters). The counters are halved every sample
period, e.g. 10x the maximum size, which efficiently ages the LFU. Instead
of trying to order for eviction, TinyLFU avoids cache pollution by rejecting
candidates with a low probability of reuse. This is done by checking the
candidate's frequency against the eviction policy's victim, and admitting
only if the candidate offers a better hit rate. The LRU admission window
helps for recency-based traces where the candidate is rejected prematurely
and when accepted is unlikely to be used again soon. This simple and compact
scheme has the highest hit rate in most real-world workloads that I have
tested. To bypass the ACM paywall, you can use the author link on my github
project's README [3].

We are currently working on making W-TinyLFU adaptive by dynamically
resizing the window size. I can send the draft paper privately if anyone is
interested. I am a Bay Area engineer working with researchers at Israel's
Technion, so the mixture has been fruitful.

** Concurrency **
As you know, LRU's O(1) nature is very attractive but the list manipulations
are not friendly for multi-threading. The lock hold times are small, but the
access rate is high and causes a lot of contention.

The solution is to use the WAL approach to record the event into a buffer
and replay it under a tryLock. This way adding to the ring buffer is a cheap
CAS and threads won't block as their work has been handed off. When the
buffers fill up they are replayed under the lock in a batch, thereby
isolating the policy from concurrent mutations. I use multiple, lossy ring
buffers to capture reads and replaying is delayed until beneficial. For
writes, I use a ring buffer that is replayed immediately and when full
causes back pressure (though is rarely filled).

In my benchmarks [4] the overhead quite small on a Zipf workload (to
simulate hot entries). The read throughput is 33% of a lock-free hash table
and scales linearly. The write throughput is nearly the same due to
contention when updating the hot entries. The read throughput is high enough
to meet performance budgets and its overhead is only observable in synthetic
micro-benchmarks.

The benefit of this approach is that for a very small penalty to reads, the
policy is mostly isolated from the multi-threading. This allows efficient
and complex O(1) algorithms that are not thread-safe. For example I use
W-TinyLFU (LRU+SLRU) for eviction and a Hierarchical TimerWheel for
expiration. While the initial infrastructure is more work, its been a boon
due to composability, a simpler mental model, and not having to use
algorithms that degrade in difficult to debug ways.

Hope that helps,
Ben

[1] http://highscalability.com/blog/2016/1/25/design-of-a-modern-cache.html
[2] https://github.com/ben-manes/caffeine/wiki/Simulator
[3] https://github.com/ben-manes/caffeine
[4] https://github.com/ben-manes/caffeine/wiki/Benchmarks



--
Sent from: http://www.postgresql-archive.org/PostgreSQL-hackers-f1928748.html


Re: Clock with Adaptive Replacement

От
Andrey Borodin
Дата:
Hi, Ben!

> 11 мая 2018 г., в 21:57, ben.manes <ben.manes@gmail.com> написал(а):
>
> I have been working on caching as a hobby project for the last few years...

Thanks for sharing this! Your message is full of insights and W-TinyLFU seems to me very promising for use in shared
buffers.

Best regards, Andrey Borodin.

Re: [HACKERS] Clock with Adaptive Replacement

От
Bruce Momjian
Дата:
On Wed, May  2, 2018 at 12:27:19PM -0400, Robert Haas wrote:
> I've seen customer have very good luck going higher if it lets all the
> data fit in shared_buffers, or at least all the data that is accessed
> with any frequency.  I think it's useful to imagine a series of
> concentric working sets - maybe you have 1GB of the hottest data, 3GB
> of data that is at least fairly hot, 10GB of data that is at least
> somewhat hot, and another 200GB of basically cold data.  Increasing
> shared_buffers in a way that doesn't let the next "ring" fit in
> shared_buffers isn't likely to help very much.  If you have 8GB of
> shared_buffers on this workload, going to 12GB is probably going to
> help -- that should be enough for the 10GB of somewhat-hot stuff and a
> little extra so that the somewhat-hot stuff doesn't immediately start
> getting evicted if some of the cold data is accessed.  Similarly,
> going from 2GB to 4GB should be a big help, because now the fairly-hot
> stuff should stay in cache.  But going from 4GB to 6GB or 12GB to 16GB
> may not do very much.  It may even hurt, because the duplication
> between shared_buffers and the OS page cache means an overall
> reduction in available cache space.  If for example you've got 16GB of
> memory and shared_buffers=2GB, you *may* be fitting all of the
> somewhat-hot data into cache someplace; bumping shared_buffers=4GB
> almost certainly means that will no longer happen, causing performance
> to tank.

I would love to know how we can help people find out how much data is in
each of these rings so they can tune shared buffers accordingly.

-- 
  Bruce Momjian  <bruce@momjian.us>        http://momjian.us
  EnterpriseDB                             http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+                      Ancient Roman grave inscription +


Re: [HACKERS] Clock with Adaptive Replacement

От
Peter Geoghegan
Дата:
On Tue, May 1, 2018 at 11:58 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>> All of that having been said, maybe we have an independent low-level
>> problem: we increase usage_count when we pin a buffer, even if we last
>> pinned the buffer 5ms ago. IIRC a change to this went in when ARC went
>> in (and came out with ARC, too). This is more of a problem with
>> miscounting the number of hits based on accidents around how things
>> are structured in the executor -- we need to be more careful about
>> recognizing whether or not block hits are logically distinct.
>> Particularly with stuff like nested loop joins.
>
> I agree that double-counting correlated accesses is a a problem, and I
> agree that we probably want to do something about it.  I am skeptical
> that using wall-clock time is the right way to solve that problem
> because it leads to edge effects -- if for example there is a 5ms
> timeout for bumping the usage count again, then you can probably
> manufacture a workload where the buffer eviction pattern is
> dramatically different for a nested loop that hits a given page every
> 4.5ms than it is for a nested loop that hits a given page every 5.5ms.
> Note that CART aims to solve this problem in a way that doesn't
> involve any fixed amount of wall-clock time.

index_fetch_heap() calls ReleaseAndReadBuffer(), which will notice
when the buffer it has been asked to read and pin is the same as the
buffer it also needs to release/unpin. If they're the same buffer,
which is probably very common, then it simply returns the existing
buffer without releasing its pin. Naturally, pinning and unpinning
increments the heap page's usage_count, which is really bad when
references are correlated like this, so this is a really good idea.
(Of course, we'd also like to avoid the actual work of pinning and
unpinning, but that's arguably secondary -- this
ReleaseAndReadBuffer() logic went in as part of the original 2005
clock sweep commit, so it seem to actually mostly be about correlated
references.)

I wonder if the significant improvements that I've noticed from the
pending v3 of my nbtree-unique-key-heap-tid patch for certain
workloads are related to ReleaseAndReadBuffer(), and if so to what
degree. It's not hard to imagine how it could be a real problem that
heap index tuple heap TIDs are in random order. ReleaseAndReadBuffer()
could end up repeatedly pinning and unpinning the same heap pages when
they appear again and again out of order on an nbtree leaf page. If
that's true, then it might not matter how smart our algorithm is.

-- 
Peter Geoghegan