Обсуждение: Clock with Adaptive Replacement
Hi hackers, What do you think about improving cache replacement clock-sweep algorithm in PostgreSQL with adaptive version proposed inthis article: http://www-cs.stanford.edu/~sbansal/pubs/fast04.pdf Are there some well known drawbacks of this approach or it will be interesting to adopt this algorithm to PostgreSQL andmeasure it impact om performance under different workloads? I find this ten years old thread: http://www.postgresql.org/message-id/flat/d2jkde$6bg$1@sea.gmane.org#d2jkde$6bg$1@sea.gmane.org but it mostly discus possible patent issues with another algorithm ARC (CAR is inspired by ARC, but it is different algorithm). As far as I know there are several problems with current clock-sweep algorithm in PostgreSQL, especially for very large caches. May be CAR can address some of them? -- Konstantin Knizhnik Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
On Thu, Feb 11, 2016 at 4:02 PM, Konstantin Knizhnik <k.knizhnik@postgrespro.ru> wrote: > What do you think about improving cache replacement clock-sweep algorithm in > PostgreSQL with adaptive version proposed in this article: > > http://www-cs.stanford.edu/~sbansal/pubs/fast04.pdf > > Are there some well known drawbacks of this approach or it will be > interesting to adopt this algorithm to PostgreSQL and measure it impact om > performance under different workloads? > I find this ten years old thread: > > http://www.postgresql.org/message-id/flat/d2jkde$6bg$1@sea.gmane.org#d2jkde$6bg$1@sea.gmane.org > > but it mostly discus possible patent issues with another algorithm ARC (CAR > is inspired by ARC, but it is different algorithm). > As far as I know there are several problems with current clock-sweep > algorithm in PostgreSQL, especially for very large caches. > May be CAR can address some of them? Maybe, but the proof of the pudding is in the eating. Just because an algorithm is smarter, newer, and better in general than our current algorithm - and really, it wouldn't be hard - doesn't mean that it will actually solve the problems we care about. A few of my EnterpriseDB colleagues spent a lot of time benchmarking various tweaks to our current algorithm last year and were unable to construct a test case where it sped anything up. If they tried the same tweaks against the 9.4 source base, they could get a speedup. But 9.5 had locking improvements around buffer eviction, and with those improvements committed there was no longer any measurable benefit to improving the quality of buffer eviction decisions. That's a surprising result, to me anyway, and somebody else might well find a test case where a benefit can be shown - but our research was not successful. I think it's important to spend time and energy figuring out exactly what the problems with our current algorithm are. We know in general terms that usage counts tend to converge to either 5 or 0 and therefore sometimes evict buffers both at great cost and almost randomly. But what's a lot less clear is how much that actually hurts us given that we are relying on the OS cache anyway. It may be that we need to fix some other things before or after improving the buffer eviction algorithm before we actually get a performance benefit. I suspect, for example, that a lot of the problems with large shared_buffers settings have to do with the bgwriter and checkpointer behavior rather than with the buffer eviction algorithm; and that others have to do with cache duplication between PostgreSQL and the operating system. So, I would suggest (although of course it's up to you) that you might want to focus on experiments that will help you understand where the problems are before you plunge into writing code to fix them. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Thank you very much for response. I am not sure that CART can significantly improve PostgreSQL performance - I just want to know opinion of community about CAR/CART and other possible alternative to GCLOCK algorithm. Looks like it CAR really provides better cache hit ratio and so at some workloads should increase Postgres performance. But now amount of memory at servers is large enough to completely keep most of typical databases in cache. So time of locating buffer in cache is more critical then time of buffer eviction. And here CART doesn't provide any benefits comparing with GCLOCK algorithm. One of the problems with GCLOCK algorithm from my point of view is that for large caches, containing larger number of pages locating victim page can take substantial amount of time, because we have to perform several turnovers before some count becomes zero. In theory CART can address this problem because there are not counters - justs single bit per page. On 12.02.2016 18:55, Robert Haas wrote: > On Thu, Feb 11, 2016 at 4:02 PM, Konstantin Knizhnik > <k.knizhnik@postgrespro.ru> wrote: >> What do you think about improving cache replacement clock-sweep algorithm in >> PostgreSQL with adaptive version proposed in this article: >> >> http://www-cs.stanford.edu/~sbansal/pubs/fast04.pdf >> >> Are there some well known drawbacks of this approach or it will be >> interesting to adopt this algorithm to PostgreSQL and measure it impact om >> performance under different workloads? >> I find this ten years old thread: >> >> http://www.postgresql.org/message-id/flat/d2jkde$6bg$1@sea.gmane.org#d2jkde$6bg$1@sea.gmane.org >> >> but it mostly discus possible patent issues with another algorithm ARC (CAR >> is inspired by ARC, but it is different algorithm). >> As far as I know there are several problems with current clock-sweep >> algorithm in PostgreSQL, especially for very large caches. >> May be CAR can address some of them? > Maybe, but the proof of the pudding is in the eating. Just because an > algorithm is smarter, newer, and better in general than our current > algorithm - and really, it wouldn't be hard - doesn't mean that it > will actually solve the problems we care about. A few of my > EnterpriseDB colleagues spent a lot of time benchmarking various > tweaks to our current algorithm last year and were unable to construct > a test case where it sped anything up. If they tried the same tweaks > against the 9.4 source base, they could get a speedup. But 9.5 had > locking improvements around buffer eviction, and with those > improvements committed there was no longer any measurable benefit to > improving the quality of buffer eviction decisions. That's a > surprising result, to me anyway, and somebody else might well find a > test case where a benefit can be shown - but our research was not > successful. > > I think it's important to spend time and energy figuring out exactly > what the problems with our current algorithm are. We know in general > terms that usage counts tend to converge to either 5 or 0 and > therefore sometimes evict buffers both at great cost and almost > randomly. But what's a lot less clear is how much that actually hurts > us given that we are relying on the OS cache anyway. It may be that > we need to fix some other things before or after improving the buffer > eviction algorithm before we actually get a performance benefit. I > suspect, for example, that a lot of the problems with large > shared_buffers settings have to do with the bgwriter and checkpointer > behavior rather than with the buffer eviction algorithm; and that > others have to do with cache duplication between PostgreSQL and the > operating system. So, I would suggest (although of course it's up to > you) that you might want to focus on experiments that will help you > understand where the problems are before you plunge into writing code > to fix them. > -- Konstantin Knizhnik Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
On 2/12/16 9:55 AM, Robert Haas wrote: > I think it's important to spend time and energy figuring out exactly > what the problems with our current algorithm are. We know in general > terms that usage counts tend to converge to either 5 or 0 and > therefore sometimes evict buffers both at great cost and almost Has anyone done testing on the best cap to usage count? IIRC 5 was pulled out of thin air. Actually, I don't recall ever seeing a clock sweep that supported more than a single bit, though often there are multiple 'pools' a buffer could be in (ie: active vs inactive in most unix VMs). If you have a reasonable amount of 1 or 0 count buffers then this probably doesn't matter too much, but if your working set is significantly higher than shared buffers then you're probably doing a LOT of full sweeps to try and get something decremented down to 0. > randomly. But what's a lot less clear is how much that actually hurts > us given that we are relying on the OS cache anyway. It may be that > we need to fix some other things before or after improving the buffer > eviction algorithm before we actually get a performance benefit. I > suspect, for example, that a lot of the problems with large > shared_buffers settings have to do with the bgwriter and checkpointer > behavior rather than with the buffer eviction algorithm; and that > others have to do with cache duplication between PostgreSQL and the > operating system. So, I would suggest (although of course it's up to It would be nice if there was at least an option to instrument how long an OS read request took, so that you could guage how many requests were coming from the OS vs disk. (Obviously direct knowledge from the OS is even better, but I don't think those APIs exist.) > you) that you might want to focus on experiments that will help you > understand where the problems are before you plunge into writing code > to fix them. +1 -- Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX Experts in Analytics, Data Architecture and PostgreSQL Data in Trouble? Get it in Treble! http://BlueTreble.com
Jim Nasby <Jim.Nasby@BlueTreble.com> writes: > On 2/12/16 9:55 AM, Robert Haas wrote: >> I think it's important to spend time and energy figuring out exactly >> what the problems with our current algorithm are. We know in general >> terms that usage counts tend to converge to either 5 or 0 and >> therefore sometimes evict buffers both at great cost and almost > Has anyone done testing on the best cap to usage count? IIRC 5 was > pulled out of thin air. My recollection is that there was some testing behind it ... but that was back around 2005 so it seems safe to assume that that testing is no longer terribly relevant. In particular, I'm sure it was tested with shared_buffer counts far smaller than what we now consider sane. regards, tom lane
On Fri, Feb 12, 2016 at 10:02 AM, Konstantin Knizhnik <k.knizhnik@postgrespro.ru> wrote: > Hi hackers, > > What do you think about improving cache replacement clock-sweep algorithm in > PostgreSQL with adaptive version proposed in this article: > > http://www-cs.stanford.edu/~sbansal/pubs/fast04.pdf > > Are there some well known drawbacks of this approach or it will be > interesting to adopt this algorithm to PostgreSQL and measure it impact om > performance under different workloads? I'm not currently planning to work in this area and have done no real investigation, so please consider the following to be "water cooler talk". I also came across that paper while reading about buffering as general background. Yeah, CAR[T] looks pretty interesting at first glance. Automatic scan resistance seems like something we'd want, its approach to frequency sounds smarter than GCLOCK's usage counter with arbitrary parameter 5. Here's a nice slide deck from the authors: http://www.cse.iitd.ernet.in/~sbansal/pubs/fast04ppt.pdf There doesn't seem to be as much written about GCLOCK beyond the 1992 TPC-A paper[1], possibly because as the CAR paper says "[a] fundamental disadvantage of GCLOCK is that it requires counter increment on every page hit which makes it infeasible for virtual memory", making it less interesting to the OS guys. That is, we actually have slightly more information, because we have an opportunity to bump the usage counter when pinning and the VM guys don't, probably explaining why the paper compares with CLOCK and not GCLOCK. One question is whether SPC-1 looks anything like a typical database workload. Say someone wrote a prototype CAR[T] patch for PostgreSQL, how would you validate it? I'm also curious about how the algorithms at different levels interact when using buffered IO. While other databases typically do direct IO, you can still find a trail of papers about this topic since disk arrays and network-attached storage systems also have caches and page replacement problems[2][3], and of course multi-level caching problems apply to non-database applications too. I wonder to what extent Linux's use of "a mash of a number of different algorithms with a number of modifications for catching corner cases and various optimisations"[4] affects the performance of different clock-based algorithms operating higher up. If we had a page replacement algorithm with CART's magical claimed properties and it really does perform better than GCLOCK + our scan resistance special cases, I wonder if it would be tempting to stick SLRU contents into shared buffers. slru.c could gain a smgr-compatible interface so that bufmgr.c could treat those buffers like any other, using smgr_which to select the appropriate storage manager. (I'm going to be proposing something similar for undo log storage, more on that later.) [1] http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.452.9699&rep=rep1&type=pdf [2] https://www.usenix.org/legacy/event/usenix01/full_papers/zhou/zhou.pdf [3] https://www.usenix.org/legacy/event/usenix02/full_papers/wong/wong_html/ [4] http://www.spinics.net/lists/linux-mm/msg13385.html -- Thomas Munro http://www.enterprisedb.com
Hi, Thomas! > 24 апр. 2018 г., в 2:41, Thomas Munro <thomas.munro@enterprisedb.com> написал(а): > > On Fri, Feb 12, 2016 at 10:02 AM, Konstantin Knizhnik > <k.knizhnik@postgrespro.ru> wrote: >> Are there some well known drawbacks of this approach or it will be >> interesting to adopt this algorithm to PostgreSQL and measure it impact om >> performance under different workloads? > > I'm not currently planning to work in this area and have done no real > investigation, so please consider the following to be "water cooler > talk". I've intention to make some prototypes in this area, but still I hadn't allocated any time chunks sufficient enough to makeanything usefull. I think that replacement of current CS5 will: 1. allow use of big shared buffers 2. make DIRECT_IO realistic possibility 3. improve BgWriter 4. unify different buffering strategies into single buffer manager (there will be no need in placing VACUUM into specialbuffer ring) 5. finally allow aio and more efficient prefetching like [0] Here's what we have about size of shared buffer now [1] (taken from [2]). I believe right hill must be big enough to reducecentral pit to zero and make function monotonic: OS page cache knows less about data blocks and is expected to be lessefficient. I'm not sure CART is the best possibility, though. I think that the right way is to implement many prototypes with LRU, ARC, CAR, CART, and 2Q. Then, benchmark them well. Oreven make this algorithm pluggable? But currently we have a lot of dependent parts in the system. I do not even know whereto start. Best regards, Andrey Borodin. [0] http://diku.dk/forskning/Publikationer/tekniske_rapporter/2004/04-03.pdf [1] https://4.bp.blogspot.com/-_Zz6X-e9_ok/WlaIvpStBmI/AAAAAAAAAA4/E1NwV-_82-oS5KfmyjoOff_IxUXiO96WwCLcBGAs/s1600/20180110-PTI.png [2] http://blog.dataegret.com/2018/01/postgresql-performance-meltdown.html
вт, 24 апр. 2018 г., 8:04 Andrey Borodin <x4mmm@yandex-team.ru>:
Hi, Thomas!
> 24 апр. 2018 г., в 2:41, Thomas Munro <thomas.munro@enterprisedb.com> написал(а):
>
> On Fri, Feb 12, 2016 at 10:02 AM, Konstantin Knizhnik
> <k.knizhnik@postgrespro.ru> wrote:
>> Are there some well known drawbacks of this approach or it will be
>> interesting to adopt this algorithm to PostgreSQL and measure it impact om
>> performance under different workloads?
>
> I'm not currently planning to work in this area and have done no real
> investigation, so please consider the following to be "water cooler
> talk".
I've intention to make some prototypes in this area, but still I hadn't allocated any time chunks sufficient enough to make anything usefull.
I think that replacement of current CS5 will:
1. allow use of big shared buffers
2. make DIRECT_IO realistic possibility
3. improve BgWriter
4. unify different buffering strategies into single buffer manager (there will be no need in placing VACUUM into special buffer ring)
5. finally allow aio and more efficient prefetching like [0]
Here's what we have about size of shared buffer now [1] (taken from [2]). I believe right hill must be big enough to reduce central pit to zero and make function monotonic: OS page cache knows less about data blocks and is expected to be less efficient.
I'm not sure CART is the best possibility, though.
I think that the right way is to implement many prototypes with LRU, ARC, CAR, CART, and 2Q. Then, benchmark them well. Or even make this algorithm pluggable? But currently we have a lot of dependent parts in the system. I do not even know where to start.
Best regards, Andrey Borodin.
[0] http://diku.dk/forskning/Publikationer/tekniske_rapporter/2004/04-03.pdf
[1] https://4.bp.blogspot.com/-_Zz6X-e9_ok/WlaIvpStBmI/AAAAAAAAAA4/E1NwV-_82-oS5KfmyjoOff_IxUXiO96WwCLcBGAs/s1600/20180110-PTI.png
[2] http://blog.dataegret.com/2018/01/postgresql-performance-meltdown.html
Before implementing algorithms within PostgreSQL it will be great to test them outside with real traces.
I think, there should be mechamism to collect traces from real-world postgresql instalations: every read and write access. It should be extremely eficient to be enabled in real world. Something like circular buffer in shared memory, and separate worker to dump it to disk.
Instead of full block address, 64bit hash could be used. Even 63bit + 1bit to designate read/write access.
Using these traces, it will be easy to choose couple of "theoretically" best algorithms, and then attempt to implement them.
With regards,
Yura
> 24 апр. 2018 г., в 11:31, Юрий Соколов <funny.falcon@gmail.com> написал(а): > > Before implementing algorithms within PostgreSQL it will be great to test them outside with real traces. > > I think, there should be mechamism to collect traces from real-world postgresql instalations: every read and write access.It should be extremely eficient to be enabled in real world. Something like circular buffer in shared memory, andseparate worker to dump it to disk. > Instead of full block address, 64bit hash could be used. Even 63bit + 1bit to designate read/write access. Yes, this is good idea to track pattern of blocks usage. But, I think that cost of development of real page eviction strategy itself is neglectable small compared to infrastructurechanges needed by any non-CS5 strategy. Best regards, Andrey Borodin.
вт, 24 апр. 2018 г., 15:16 Andrey Borodin <x4mmm@yandex-team.ru>:
> 24 апр. 2018 г., в 11:31, Юрий Соколов <funny.falcon@gmail.com> написал(а):
>
> Before implementing algorithms within PostgreSQL it will be great to test them outside with real traces.
>
> I think, there should be mechamism to collect traces from real-world postgresql instalations: every read and write access. It should be extremely eficient to be enabled in real world. Something like circular buffer in shared memory, and separate worker to dump it to disk.
> Instead of full block address, 64bit hash could be used. Even 63bit + 1bit to designate read/write access.
Yes, this is good idea to track pattern of blocks usage.
But, I think that cost of development of real page eviction strategy itself is neglectable small compared to infrastructure changes needed by any non-CS5 strategy.
Best regards, Andrey Borodin.
Exactly. And it will be pity if implemented strategy will not be "the best".
Different strategies may need different infrastructure. And implementing five infrastructures are certainly more expensive than choosing two best strategies using traces and implementing them.
Hi, On 2018-04-24 17:16:47 +0500, Andrey Borodin wrote: > But, I think that cost of development of real page eviction strategy > itself is neglectable small compared to infrastructure changes needed > by any non-CS5 strategy. What problems are you seeing? This isn't a lot of code? Greetings, Andres Freund
> 24 апр. 2018 г., в 23:14, Andres Freund <andres@anarazel.de> написал(а): > > On 2018-04-24 17:16:47 +0500, Andrey Borodin wrote: >> But, I think that cost of development of real page eviction strategy >> itself is neglectable small compared to infrastructure changes needed >> by any non-CS5 strategy. > > What problems are you seeing? This isn't a lot of code? 1. Teaching BgWriter to used data from eviction strategy to aggressively flush data to disk (instead of ++next_to_clean ) 2. Implementing strategies as lock-free algorithms for freelist These parts seem most important for benchmarking. Also: 3. Converting all rings to single buffer manager where possible 4. Using O_DIRECT while writing data files 5. Using aio and scheduling of writes These parts are not necessary, but most challenging, while not impossible though. Best regards, Andrey Borodin.
On Tue, Apr 24, 2018 at 5:04 PM, Andrey Borodin <x4mmm@yandex-team.ru> wrote: > Here's what we have about size of shared buffer now [1] (taken from [2]). I believe right hill must be big enough to reducecentral pit to zero and make function monotonic: OS page cache knows less about data blocks and is expected to be lessefficient. > [1] https://4.bp.blogspot.com/-_Zz6X-e9_ok/WlaIvpStBmI/AAAAAAAAAA4/E1NwV-_82-oS5KfmyjoOff_IxUXiO96WwCLcBGAs/s1600/20180110-PTI.png Interesting curve. Can you explain it? Here is my guess, assuming that pgbench is generating uniformly distributed random access, and assuming 26GB dataset and 32GB physical RAM: 1. In the range 128MB to 6GB, only a small fraction of the dataset fits into shared_buffers, but it doesn't matter much: you get a 100% hit ratio from the kernel's page cache. 2. In the range 6GB to 16GB, the dataset doesn't fit in shared_buffer OR the kernel's page cache, so you begin to risk misses in both caches. Every byte you give to one you take away from the other, but it might be worse than zero sum: at the lowest point in your graph, x = 16GB, I suppose you have a ~61% hit ratio in shared buffers (16GB/26GB), but when you miss you're also likely to miss in the OS page cache too because that probably holds the most recently read 16GB (= other half of physical RAM) of data... in other words the same data, assuming uniform random access. (If it were not random, then some pages would have a greater chance of staying in PG's cache and falling out of the kernel's cache, but essentially random replacement prevents that (?)) 3. In the range 16GB to 26GB, the shared_buffers hit ratio increases from 50% to 100%. I guess if you had 48GB of RAM the dip between 6GB and 26GB wouldn't happen? Am I close? > I'm not sure CART is the best possibility, though. I don't know either, but LRU and ARC don't seem promising for various reasons. Aside from the scan resistance claim, it sounds like CART should avoid long searches for a buffer. In my primitive attempts to understand that a bit better, I tried instrumenting freelist.c to tell dtrace how far it had to move the clock hand to find a buffer, and I got histograms like this (this example from a 1GB dataset, 512MB shared buffers, pgbench select-only, and the most frequent numbers were in the 4-7 bucket but there were 3 freak cases in the 16384-32767 bucket): value ------------- Distribution ------------- count -1 | 0 0 |@@@@ 65535 1 |@@@@@ 89190 2 |@@@@@@@@@ 139114 4 |@@@@@@@@@@@ 170881 8 |@@@@@@@@ 132579 16 |@@@ 47731 32 | 5530 64 | 137 128 | 1 256 | 0 512 | 0 1024 | 0 2048 | 0 4096 | 0 8192 | 1 16384 | 3 32768 | 0 You can see a just a few outliers there. With smaller shared buffers the counts are bigger but the average hand movement distance gets smaller (here 128MB shared buffers): value ------------- Distribution ------------- count -1 | 0 0 | 16383 1 |@@@@@@@@@@@@@@@ 1289265 2 |@@@@@@@@@@@@@@@ 1270531 4 |@@@@@@@@ 659389 8 |@ 111091 16 | 2815 32 | 19 64 | 1 128 | 0 256 | 0 512 | 0 1024 | 0 2048 | 2 4096 | 2 8192 | 0 I think pgbench isn't a very interesting workload though. I don't have time right now but it would be fun to dig further and try to construct workloads that generate pathological searches for buffers (I believe that such workloads exist, from anecdotal reports). -- Thomas Munro http://www.enterprisedb.com
On Wed, Apr 25, 2018 at 5:26 AM, Thomas Munro <thomas.munro@enterprisedb.com> wrote: > I think pgbench isn't a very interesting workload though. I don't > have time right now but it would be fun to dig further and try to > construct workloads that generate pathological searches for buffers (I > believe that such workloads exist, from anecdotal reports). While pgbench might not be the most interesting workload, I think even its super-synthetic, uniformly distributed point lookups can benefit from better caching decisions. The LRU-K paper's example 1.1 explains why this might be very cogently: Point lookups from TPC-A (or TPC-C) consist of descending the index, and then accessing the record in the main table from a TID, which actually implies quite a non-uniform access pattern for individual pages. Despite accessing table records uniformly. There are a negligible number of internal pages involved with pgbench (and they should always be cached), so they can almost be ignored. It's really just a matter of what proportion of shared_buffers should be leaf pages versus heap pages. The pgbench accounts table is approximately 6x the size of its primary key, so the leaf blocks have 6x the utility of heap blocks for this workload. It actually is pretty much that simple. Or it would be, if we didn't have the OS cache as a confounding factor. The optimal strategy for the pgbench workload as far as maximizing hits in shared_buffers goes is to *only* cache leaf blocks, at least until you have all leaf blocks cached; we should evict table blocks *immediately*. This isn't even remotely close to the behavior of an LRU, of course. Actually, as the LRU-K paper also points out, an LRU has a tendency to perversely somewhat *favor* the table blocks, since presumably each point lookup's table block is accessed after its leaf block is pinned/accessed. Before some of the big shared_buffers bottlenecks were alleviated several years ago, it was possible to observe shared_buffers evictions occurring essentially at random. I have no idea if that's still true, but it could be. A random replacement strategy is actually not as bad as it sounds, but it is still pretty bad, especially given that we could pay a high price in contention in return for such a bad strategy. This convinces me that we will not be able to assess any algorithm well in isolation, through simulations, because lock contention is too important overall, and can actually affect the cache hit ratio of a clock-based approach. I believe that it's generally true that our clock sweep algorithm approximates an LRU, but I have also seen it totally fail to do that, which it sounds like you've heard about, too. Andrey is probably correct in his suspicion that we'll end up prototyping a number of approaches. I'm glad that you're thinking about this, in any case. Seems like a promising area to work on. -- Peter Geoghegan
On Thu, Apr 26, 2018 at 10:54 AM, Peter Geoghegan <pg@bowt.ie> wrote: > On Wed, Apr 25, 2018 at 5:26 AM, Thomas Munro > <thomas.munro@enterprisedb.com> wrote: >> I think pgbench isn't a very interesting workload though. I don't >> have time right now but it would be fun to dig further and try to >> construct workloads that generate pathological searches for buffers (I >> believe that such workloads exist, from anecdotal reports). > > While pgbench might not be the most interesting workload, I think even > its super-synthetic, uniformly distributed point lookups can benefit > from better caching decisions. The LRU-K paper's example 1.1 explains > why this might be very cogently: Point lookups from TPC-A (or TPC-C) > consist of descending the index, and then accessing the record in the > main table from a TID, which actually implies quite a non-uniform > access pattern for individual pages. Despite accessing table records > uniformly. > > There are a negligible number of internal pages involved with pgbench > (and they should always be cached), so they can almost be ignored. > It's really just a matter of what proportion of shared_buffers should > be leaf pages versus heap pages. The pgbench accounts table is > approximately 6x the size of its primary key, so the leaf blocks have > 6x the utility of heap blocks for this workload. It actually is pretty > much that simple. Or it would be, if we didn't have the OS cache as a > confounding factor. The optimal strategy for the pgbench workload as > far as maximizing hits in shared_buffers goes is to *only* cache leaf > blocks, at least until you have all leaf blocks cached; we should > evict table blocks *immediately*. This isn't even remotely close to > the behavior of an LRU, of course. Actually, as the LRU-K paper also > points out, an LRU has a tendency to perversely somewhat *favor* the > table blocks, since presumably each point lookup's table block is > accessed after its leaf block is pinned/accessed. Huh. Right. So it's not truly uniform. I wonder if any of these algorithms are sensitive to the higher value of the leaf pages than the heap pages. It seems quite subtle: even though we can see that the individual leaf pages must be 6x more likely to be accessed again next time than individual heap pages due to their higher tuple density, they're still very unlikely to be accessed again soon, and the question is whether any of these algorithms can track that for long enough to see the difference between two very low access frequencies, in a huge field of unlikely-to-be-accessed pages. LRU, by not even attempting to model frequency, is I guess uniquely affected by the heap-after-index-leaf thing. (I haven't read the LRU-K paper... I have a lot to learn before I could contribute much here. Added to pile. Thanks.) A thought experiment about the U-shaped performance when your dataset fits in neither PG nor kernel cache, but would fit entirely in physical memory if you made either of the two caches big enough: I suppose when you read a page in, you could tell the kernel that you POSIX_FADV_DONTNEED it, and when you steal a clean PG buffer you could tell the kernel that you POSIX_FADV_WILLNEED its former contents (in advance somehow), on the theory that the coldest stuff in the PG cache should now become the hottest stuff in the OS cache. Of course that sucks, because the best the kernel can do then is go and read it from disk, and the goal is to avoid IO. Given a hypothetical way to "write" "clean" data to the kernel (so it wouldn't mark it dirty and generate IO, but it would let you read it back without generating IO if you're lucky), then perhaps you could actually achieve exclusive caching at the two levels, and use all your physical RAM without duplication. Then perhaps the U shape would go away, and the curve would instead go up all the way until shared buffers is big enough to whole the whole dataset? > I'm glad that you're thinking about this, in any case. Seems like a > promising area to work on. It's a promising area that I'm interested in learning more about, but I'm not promising to work on it :-) -- Thomas Munro http://www.enterprisedb.com
On Wed, Apr 25, 2018 at 6:31 PM, Thomas Munro <thomas.munro@enterprisedb.com> wrote: > Huh. Right. So it's not truly uniform. I wonder if any of these > algorithms are sensitive to the higher value of the leaf pages than > the heap pages. It seems quite subtle: even though we can see that > the individual leaf pages must be 6x more likely to be accessed again > next time than individual heap pages due to their higher tuple > density, they're still very unlikely to be accessed again soon, and > the question is whether any of these algorithms can track that for > long enough to see the difference between two very low access > frequencies, in a huge field of unlikely-to-be-accessed pages. LRU, > by not even attempting to model frequency, is I guess uniquely > affected by the heap-after-index-leaf thing. Right. Another insight is that it's worth considering weighing the type of page involved, to artificially favor indexes to some degree. I'm not saying that it's a good idea, and it's pretty inelegant. But I'm pretty sure it's been done before, with satisfactory results. > A thought experiment about the U-shaped performance when your dataset > fits in neither PG nor kernel cache, but would fit entirely in > physical memory if you made either of the two caches big enough: I > suppose when you read a page in, you could tell the kernel that you > POSIX_FADV_DONTNEED it, and when you steal a clean PG buffer you could > tell the kernel that you POSIX_FADV_WILLNEED its former contents (in > advance somehow), on the theory that the coldest stuff in the PG cache > should now become the hottest stuff in the OS cache. Of course that > sucks, because the best the kernel can do then is go and read it from > disk, and the goal is to avoid IO. Not sure about that. I will say that the intuition that this is a good area to work on is based on the challenges that we have with shared_buffers in particular. The fact that shared buffers is typically sized no larger than 16GB, say, and yet main memory sizes can now easily be far larger is clearly a problem, and one that we're going to have to get around to addressing directly. ISTM that the whole shared_buffers issue should have little influence on how much we're willing to remember about block popularity; why should we be willing to track information about only the exact number of blocks that will fit in shared_buffers at any one time? As you know, ARC/CAR explicitly models some blocks that are not cache resident as being within two ghost lists. Sounds like something that might have a larger-than-expected benefit for us. How much of a problem is it that we waste memory bandwidth copying to and from OS cache, particularly on large systems? Might that be the bigger problem, but also one that can be addressed incrementally? I think that I may be repeating some of what Andrey said, in another way (not sure of that). -- Peter Geoghegan
On Wed, Apr 25, 2018 at 6:54 PM, Peter Geoghegan <pg@bowt.ie> wrote: > Before some of the big shared_buffers bottlenecks were alleviated > several years ago, it was possible to observe shared_buffers evictions > occurring essentially at random. I have no idea if that's still true, > but it could be. I think it is. We haven't done anything to address it. I think if we want to move to direct I/O -- which may be something we need or want to do -- we're going to have to work a lot harder at making good page eviction decisions. Your patch to change the page eviction algorithm didn't help noticeably once we eliminated the contention around buffer eviction, but that's just because the cost of a bad eviction went down, not because we stopped doing bad evictions. I think it would be interesting to insert a usleep() call into mdread() and then test buffer eviction various algorithms with that in place. I'm personally not very excited about making rules like "index pages are more valuable than heap pages". Such rules will in many cases be true, but it's easy to come up with cases where they don't hold: for example, we might run pgbench for a while and then stop running pgbench and start running big sequential scans for reporting purposes. We don't want to artificially pin the index buffers in shared_buffers just because they're index pages; we want to figure out which pages really matter. Now, I realize that you weren't proposing (and wouldn't propose) a rule that index pages never get evicted, but I think that favoring index pages even in some milder way is basically a hack. Index pages aren't *intrinsically* more valuable; they are more valuable because they will, in many workloads, be accessed more often than heap pages. A good algorithm ought to be able to figure that out based on the access pattern, without being explicitly given a hint, I think. I believe the root of the problem here is that the usage count we have today does a very poor job distinguishing what's hot from what's not. There have been previous experiments around making usage_count use some kind of a log scale: we make the maximum, say, 32, and the clock hand divides by 2 instead of subtracting 1. I don't think those experiments were enormously successful and I suspect that a big part of the reason is that it's still pretty easy to get to a state where the counters are maxed out for a large number of buffers, and at that point you can't distinguish between those buffers any more: they all look equally hot. We need something better. If a system like this is working properly, things like interior index pages and visibility map pages ought to show up as fiery hot on workloads where the index or visibility map, as the case may be, is heavily used. A related problem is that user-connected backends end up doing a lot of buffer eviction themselves on many workloads. Maybe the bgreclaimer patch Amit wrote a few years ago could help somehow. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Thu, Apr 26, 2018 at 10:39 AM, Robert Haas <robertmhaas@gmail.com> wrote: > I think it is. We haven't done anything to address it. I think if we > want to move to direct I/O -- which may be something we need or want > to do -- we're going to have to work a lot harder at making good page > eviction decisions. +1 > Your patch to change the page eviction algorithm > didn't help noticeably once we eliminated the contention around buffer > eviction, but that's just because the cost of a bad eviction went > down, not because we stopped doing bad evictions. The goal of that patch, which was literally written over a wet weekend, was to make people realize that we were missing something there. I think that it accomplished that much. > I think it would be > interesting to insert a usleep() call into mdread() and then test > buffer eviction various algorithms with that in place. That could be a useful strategy. > I'm personally not very excited about making rules like "index pages > are more valuable than heap pages". Such rules will in many cases be > true, but it's easy to come up with cases where they don't hold: for > example, we might run pgbench for a while and then stop running > pgbench and start running big sequential scans for reporting purposes. I am not in favor of this general approach. Actually, I'm willing to come out against it strongly right now: let's not teach the buffer manager to distinguish between the AM of a block. > We don't want to artificially pin the index buffers in shared_buffers > just because they're index pages; we want to figure out which pages > really matter. Now, I realize that you weren't proposing (and > wouldn't propose) a rule that index pages never get evicted, but I > think that favoring index pages even in some milder way is basically a > hack. Index pages aren't *intrinsically* more valuable; they are more > valuable because they will, in many workloads, be accessed more often > than heap pages. A good algorithm ought to be able to figure that out > based on the access pattern, without being explicitly given a hint, I > think. I won't argue against any of that, but I think that it's rather nuanced, and that the nuance probably matters a lot. First of all, we must have a sense of proportion about what is likely to be true in the average case. I mentioned to Thomas that the pgbench accounts table is 6x the size its primary key, and that with a uniform distribution + point lookups the leaf pages literally have 6x the utility of the heap pages. It really is that simple there. Also, whatever the distribution of the point lookups is, "cache more leaf pages" is probably going to be the effect that a better caching strategy would have. Even 6x is probably an underestimate of the utility of leaf pages compared to heap pages in the average case. I bet it's more like 10x for a typical OLTP app. Second of all, we must have a sense of perspective about what we're trying to do here, which is to predict the future based on the past. The best outcome we can hope for is to lose less on average without hurting anything that already works well enough. > I believe the root of the problem here is that the usage count we have > today does a very poor job distinguishing what's hot from what's not. I think that the root of the problem is that we don't remember anything about evicted buffers. The same block can bounce in and out of shared_buffers repeatedly, and we don't recognize that. There are probably second-order effects from sizing shared_buffers. We needlessly tie the number of buffers to our capacity to remember things about block popularity. That would be bad if we had no FS cache, but with the FS cache it seems awful. I share the intuition that we need something adaptive, that can be analysed and understood relatively easily, and has little to no magic. That's why I'm willing to say now that we shouldn't do anything based on the type of the blocks involved. However, I think that that model might work about as well, and that this matters because it provides motivating examples. Surely you're right when you say that index blocks are not intrinsically more useful than any of type of block, but if they're 10x more useful on average, and 20x more useful at other times, then a person could be forgiven for modelling them as intrinsically more useful. It's also useful to examine why a random replacement strategy is merely mediocre to bad, and not abysmal. That's what every replacement strategy ends up being with enough cache pressure anyway. This makes it more reasonable to focus on the average case than it is in other places. > There have been previous experiments around making usage_count use > some kind of a log scale: we make the maximum, say, 32, and the clock > hand divides by 2 instead of subtracting 1. I don't think those > experiments were enormously successful and I suspect that a big part > of the reason is that it's still pretty easy to get to a state where > the counters are maxed out for a large number of buffers, and at that > point you can't distinguish between those buffers any more: they all > look equally hot. We need something better. If a system like this is > working properly, things like interior index pages and visibility map > pages ought to show up as fiery hot on workloads where the index or > visibility map, as the case may be, is heavily used. I don't think that internal index pages and the visibility map are the problem, since there is so few of them, and they are accessed at least hundreds of times more frequently than the leaf pages. -- Peter Geoghegan
Greetings, * Peter Geoghegan (pg@bowt.ie) wrote: > On Thu, Apr 26, 2018 at 10:39 AM, Robert Haas <robertmhaas@gmail.com> wrote: > > I'm personally not very excited about making rules like "index pages > > are more valuable than heap pages". Such rules will in many cases be > > true, but it's easy to come up with cases where they don't hold: for > > example, we might run pgbench for a while and then stop running > > pgbench and start running big sequential scans for reporting purposes. > > I am not in favor of this general approach. Actually, I'm willing to > come out against it strongly right now: let's not teach the buffer > manager to distinguish between the AM of a block. Perhaps I'm misunderstanding the case here, but with the ring buffer we use for sequential access, aren't we already discouraging keeping heap pages around, particularly when they're coming from a sequential scan? I'm not suggesting that's a bad thing either, in fact it seems like a generally good thing and I don't think we get many complaints about it, I'm just trying to point out that we already have a distinction between heap-pages-from-seq-scans and index pages (and heap pages from using those indexes) and therefore I'm not convinced by an argument that we shouldn't ever treat pages from the heap differently than pages from the indexes. This could change when considering an environment where we aren't trying to take advantage of the kernel's buffer cache and things like its read-ahead, and maybe that's what you're getting at here, but I would also point out that we have more inherent understanding of how pages are likely to be accessed than the kernel does and we shouldn't be explicitly trying to avoid taking advantage of that knowledge. > > We don't want to artificially pin the index buffers in shared_buffers > > just because they're index pages; we want to figure out which pages > > really matter. Now, I realize that you weren't proposing (and > > wouldn't propose) a rule that index pages never get evicted, but I > > think that favoring index pages even in some milder way is basically a > > hack. Index pages aren't *intrinsically* more valuable; they are more > > valuable because they will, in many workloads, be accessed more often > > than heap pages. A good algorithm ought to be able to figure that out > > based on the access pattern, without being explicitly given a hint, I > > think. > > I won't argue against any of that, but I think that it's rather > nuanced, and that the nuance probably matters a lot. > > First of all, we must have a sense of proportion about what is likely > to be true in the average case. I mentioned to Thomas that the pgbench > accounts table is 6x the size its primary key, and that with a uniform > distribution + point lookups the leaf pages literally have 6x the > utility of the heap pages. It really is that simple there. Also, > whatever the distribution of the point lookups is, "cache more leaf > pages" is probably going to be the effect that a better caching > strategy would have. Even 6x is probably an underestimate of the > utility of leaf pages compared to heap pages in the average case. I > bet it's more like 10x for a typical OLTP app. I agree that we should definitely be thinking about data density here- and that's another case where we have more information about the data than the kernel does (which I compare to as it tries to predict value also but doesn't have as much information). Another consideration here to think about would be internal pages rather than just leaf pages- those have an even higher likelihood of being needed again quickly as they need to be traversed much more than leaf pages. > Second of all, we must have a sense of perspective about what we're > trying to do here, which is to predict the future based on the past. > The best outcome we can hope for is to lose less on average without > hurting anything that already works well enough. Agreed. > > I believe the root of the problem here is that the usage count we have > > today does a very poor job distinguishing what's hot from what's not. > > I think that the root of the problem is that we don't remember > anything about evicted buffers. The same block can bounce in and out > of shared_buffers repeatedly, and we don't recognize that. There are > probably second-order effects from sizing shared_buffers. We > needlessly tie the number of buffers to our capacity to remember > things about block popularity. That would be bad if we had no FS > cache, but with the FS cache it seems awful. This is definitely an interesting and valuable point. Having pages bouncing in and out of shared buffers means that we aren't properly tracking their value. Just brainstorming, but I wonder if there's a way we could track their value even when we've evicted them, without slowing down the entire system. Not having any great ideas off-hand for that but some kind of "we last accessed this block an hour ago" and "we last accessed this block half a millisecond ago" might be interesting to try to avoid such ping-ponging. I wonder if there are known strategies for addressing this, perhaps by having a data structure which effectively tracks hit rates for all potential blocks... > I share the intuition that we need something adaptive, that can be > analysed and understood relatively easily, and has little to no magic. > That's why I'm willing to say now that we shouldn't do anything based > on the type of the blocks involved. However, I think that that model > might work about as well, and that this matters because it provides > motivating examples. Surely you're right when you say that index > blocks are not intrinsically more useful than any of type of block, > but if they're 10x more useful on average, and 20x more useful at > other times, then a person could be forgiven for modelling them as > intrinsically more useful. If we can understand that they're, on average, 10x-20x more useful, then ignoring that seems like we're throwing away information that we effectively know about the future. > It's also useful to examine why a random replacement strategy is > merely mediocre to bad, and not abysmal. That's what every replacement > strategy ends up being with enough cache pressure anyway. This makes > it more reasonable to focus on the average case than it is in other > places. Every replacement strategy which doesn't care about the kind of block that's being worked with will end up performing random replacement under enough pressure. We don't actually have to allow that to happen though since we do have information about the kind of block and therefore it seems like we should be able to keep things mediocre, and avoid letting things get bad under such pressure. > > There have been previous experiments around making usage_count use > > some kind of a log scale: we make the maximum, say, 32, and the clock > > hand divides by 2 instead of subtracting 1. I don't think those > > experiments were enormously successful and I suspect that a big part > > of the reason is that it's still pretty easy to get to a state where > > the counters are maxed out for a large number of buffers, and at that > > point you can't distinguish between those buffers any more: they all > > look equally hot. We need something better. If a system like this is > > working properly, things like interior index pages and visibility map > > pages ought to show up as fiery hot on workloads where the index or > > visibility map, as the case may be, is heavily used. > > I don't think that internal index pages and the visibility map are the > problem, since there is so few of them, and they are accessed at least > hundreds of times more frequently than the leaf pages. Yet they still end up getting evicted under a random replacement strategy under sufficient pressure, because we don't distinguish them from any other buffer, and that leads to high latency on queries which should always be extremely fast. As long as we're talking about a system which can devolve into random replacement, we're going to have those cases and we aren't going to have any answer for them. I'm not really thrilled with any approach that allows that kind of degradation. Thanks! Stephen
Вложения
On Sun, Apr 29, 2018 at 2:39 AM, Stephen Frost <sfrost@snowman.net> wrote: > * Peter Geoghegan (pg@bowt.ie) wrote: >> On Thu, Apr 26, 2018 at 10:39 AM, Robert Haas <robertmhaas@gmail.com> wrote: >> > I'm personally not very excited about making rules like "index pages >> > are more valuable than heap pages". Such rules will in many cases be >> > true, but it's easy to come up with cases where they don't hold: for >> > example, we might run pgbench for a while and then stop running >> > pgbench and start running big sequential scans for reporting purposes. >> >> I am not in favor of this general approach. Actually, I'm willing to >> come out against it strongly right now: let's not teach the buffer >> manager to distinguish between the AM of a block. > > Perhaps I'm misunderstanding the case here, but with the ring buffer we > use for sequential access, aren't we already discouraging keeping heap > pages around, particularly when they're coming from a sequential scan? > > I'm not suggesting that's a bad thing either, in fact it seems like a > generally good thing and I don't think we get many complaints about it, > I'm just trying to point out that we already have a distinction between > heap-pages-from-seq-scans and index pages (and heap pages from using > those indexes) and therefore I'm not convinced by an argument that we > shouldn't ever treat pages from the heap differently than pages from the > indexes. I think the idea is that CAR might be even better than our hard coded BAS_BULKREAD, BAS_BULKWRITE, BAS_VACUUM rings because it has a kind of dynamic self-tuning scan resistance. It works by having one clock (= a circular list, not just a rotating hand in an array like our current scheme) T1 that holds buffers that were touched once or twice, to absorb scans. It'll reclaim T1 buffers that were only accessed once (but keep the descriptor for longer -- see below). It will move things to T2, a separate clock for valuable pages, after they've been accessed twice in recent memory. It dynamically adjusts the sizes of T1 and T2 in a way that is supposedly beneficial. [...] >> > I believe the root of the problem here is that the usage count we have >> > today does a very poor job distinguishing what's hot from what's not. >> >> I think that the root of the problem is that we don't remember >> anything about evicted buffers. The same block can bounce in and out >> of shared_buffers repeatedly, and we don't recognize that. There are >> probably second-order effects from sizing shared_buffers. We >> needlessly tie the number of buffers to our capacity to remember >> things about block popularity. That would be bad if we had no FS >> cache, but with the FS cache it seems awful. > > This is definitely an interesting and valuable point. Having pages > bouncing in and out of shared buffers means that we aren't properly > tracking their value. Just brainstorming, but I wonder if there's a way > we could track their value even when we've evicted them, without slowing > down the entire system. Not having any great ideas off-hand for that > but some kind of "we last accessed this block an hour ago" and "we last > accessed this block half a millisecond ago" might be interesting to try > to avoid such ping-ponging. I wonder if there are known strategies for > addressing this, perhaps by having a data structure which effectively > tracks hit rates for all potential blocks... That's what the CAR algorithm does. It has twice as many BufferDescriptors as buffers. As well as the T1 and T2 clocks that hold buffers, it also has doubly linked lists B1 and B2 which hold BufferDescriptors with no buffers for things that recently fell out of T1 and T2 to extend the algorithm's memory. For example, if there is a "miss" in B1 (meaning we found the page we're looking for in there but it currently has no buffer, and it was recently evicted from T1), then we can immediately find a buffer for it in T2 (it qualifies as having been accessed twice in recent memory, and doesn't have to go through T1 again where it risks early eviction). Though of course it too is finite. -- Thomas Munro http://www.enterprisedb.com
On Sat, Apr 28, 2018 at 7:39 AM, Stephen Frost <sfrost@snowman.net> wrote: > Perhaps I'm misunderstanding the case here, but with the ring buffer we > use for sequential access, aren't we already discouraging keeping heap > pages around, particularly when they're coming from a sequential scan? > > I'm not suggesting that's a bad thing either, in fact it seems like a > generally good thing and I don't think we get many complaints about it, > I'm just trying to point out that we already have a distinction between > heap-pages-from-seq-scans and index pages (and heap pages from using > those indexes) and therefore I'm not convinced by an argument that we > shouldn't ever treat pages from the heap differently than pages from the > indexes. The argument is not a slam dunk, to be sure. Since you bring it up, I happen to think that the ring buffer/strategy scan thing is too aggressive in a lot of cases. Why not cache more when the existing contents of shared_buffers are of marginal importance? Concern about a kind of "tyranny of the average case" seems fairly well founded to me. I am most concerned about improving the average case in particular, but also very concerned about not regressing atypical or bursty cases. > This could change when considering an environment where we aren't trying > to take advantage of the kernel's buffer cache and things like its > read-ahead, and maybe that's what you're getting at here, but I would > also point out that we have more inherent understanding of how pages are > likely to be accessed than the kernel does and we shouldn't be > explicitly trying to avoid taking advantage of that knowledge. I'm definitely in favor of taking advantage of that knowledge. I just think that we can derive the choice of which buffer to evict from high level principles, that are relatively easy to understand in isolation. This may matter later, when production system behavior needs to be analyzed. I might agree to the idea of artificially favoring one type of block over another if I thought that there was no better way, but I think that there is. I also think that a generalized approach isn't that much more difficult, particularly when you factor in the need to convince other people. > I agree that we should definitely be thinking about data density here- > and that's another case where we have more information about the data > than the kernel does (which I compare to as it tries to predict value > also but doesn't have as much information). Even my 10x figure is, in a way, very conservative. B-Tree index pages have another obvious though potentially overlooked advantage, which is that they naturally have spatial or logical locality. The only kind of locality that heap pages can ever have is temporal locality. I hesitate to say too much more about what must be true on average, across the universe of Posgres installations, but I can safely say this much: it's totally pedestrian and normal for *leaf* pages to have more than 100x the utility of heap pages in some applications, since a leaf page is not just "denser"; it may also have significant spatial or logical locality, where heap pages have none at all. There could very easily be an enormous range of usefulness among buffers in shared_buffers. And yet, we only recognize the increments that usage_count can represent (this is before we even think about the actual problems with how usage_count is maintained). We're never going to be able to give *radically* more useful leaf pages a concomitant advantage during buffer eviction if we have to use some rule of thumb. > Another consideration here > to think about would be internal pages rather than just leaf pages- > those have an even higher likelihood of being needed again quickly as > they need to be traversed much more than leaf pages. As I said, I'm not too worried about that case, because there are so few internal pages -- we're probably talking about a small fraction of 1% of the total. It seems unlikely that there'd be so much cache pressure that the cost of reading in an internal block is much worse than the cost of reading in any other block. Main memory sizes are big enough that we should be able to think about things at the level of minutes (maybe we can't right now, but we *should* be able to). (This probably doesn't matter much when it comes to finding a way forward, so we can agree to disagree here without consequence.) > This is definitely an interesting and valuable point. Having pages > bouncing in and out of shared buffers means that we aren't properly > tracking their value. Just brainstorming, but I wonder if there's a way > we could track their value even when we've evicted them, without slowing > down the entire system. Not having any great ideas off-hand for that > but some kind of "we last accessed this block an hour ago" and "we last > accessed this block half a millisecond ago" might be interesting to try > to avoid such ping-ponging. I wonder if there are known strategies for > addressing this, perhaps by having a data structure which effectively > tracks hit rates for all potential blocks... Well, ARC has its ghost lists (so does CAR, as Thomas pointed out just now). ARC/CAR is adaptive, in that it drifts towards favoring either recency or frequency for the system as a whole. Provided we can make that scalable, it will probably work well for us. The bi-modal nature of pages in Postgres seems to make ARC or CAR a good fit. Also, I suspect that we can afford to use a somewhat less scalable approach, since it doesn't matter how fast buffer eviction is if its just spinning its wheels, and because in recent years we've mitigated some of the bottlenecks. In a way, it's not at all surprising that we can end up with usage_counts of 5 all around when shared_buffers is large. Blocks end up "resting on their laurels"; they're impossible to evict based on a large amount of blind luck, such as being involved in a burst of activity some time ago (so much for clocksweep approximating an LRU; this is actually more like an LFU). ARC/CAR moves a "recent" block to the "frequent" list if the block gets referenced a second time after an interval that is, in effect, self-correcting (it may or may not be from the ghost recent list or the real buffer cache recent list -- either way it goes to the head of the frequent list). Moving a block from the recent to the frequent list also enlarges the recent list target size at the expense of the frequent list target size. "This block in particular deserves to be a frequent block, but as a consequence all blocks should collectively be considered slightly less deserving of being frequent blocks." Importantly, this allows blocks to have their extra utility recognized by getting moved to the frequent list, but specifically tracks whether or not the system fails to appreciate recent blocks that deserve to be promoted to frequent blocks, but cannot get a break because they get evicted from the recent list too quickly (including the ghost recent list). Because the recent ghost list and frequent ghost list are fixed in size (only the real lists/head lists get resized), we won't forget too much about recency or frequency. And, we won't go overboard with "writing off frequent list blocks as has-been dead weight", because we notice that we've gone too far, and measure that in terms of would-be cache hits that ended up as actual cache misses. Most important of all, we won't "get stuck", as we see with clocksweep today while still having something that's stable for stable workloads. > Every replacement strategy which doesn't care about the kind of block > that's being worked with will end up performing random replacement under > enough pressure. We don't actually have to allow that to happen though > since we do have information about the kind of block and therefore it > seems like we should be able to keep things mediocre, and avoid letting > things get bad under such pressure. If you have extreme cache pressure, I don't think that there is anything left to worry about other than the scalability of evicting buffers. That should be a rare enough occurrence for well maintained systems these days (again, this may only actually be true as an aspiration). Just look at the ARC paper's statistics; they report that they're not much better than the competition (including even a classic LRU) when there is either mild cache pressure or extreme cache pressure. The middle ground is the interesting part. All of that having been said, maybe we have an independent low-level problem: we increase usage_count when we pin a buffer, even if we last pinned the buffer 5ms ago. IIRC a change to this went in when ARC went in (and came out with ARC, too). This is more of a problem with miscounting the number of hits based on accidents around how things are structured in the executor -- we need to be more careful about recognizing whether or not block hits are logically distinct. Particularly with stuff like nested loop joins. -- Peter Geoghegan
вс, 29 апр. 2018 г., 2:17 Peter Geoghegan <pg@bowt.ie>:
On Sat, Apr 28, 2018 at 7:39 AM, Stephen Frost <sfrost@snowman.net> wrote:
> Perhaps I'm misunderstanding the case here, but with the ring buffer we
> use for sequential access, aren't we already discouraging keeping heap
> pages around, particularly when they're coming from a sequential scan?
>
> I'm not suggesting that's a bad thing either, in fact it seems like a
> generally good thing and I don't think we get many complaints about it,
> I'm just trying to point out that we already have a distinction between
> heap-pages-from-seq-scans and index pages (and heap pages from using
> those indexes) and therefore I'm not convinced by an argument that we
> shouldn't ever treat pages from the heap differently than pages from the
> indexes.
The argument is not a slam dunk, to be sure. Since you bring it up, I
happen to think that the ring buffer/strategy scan thing is too
aggressive in a lot of cases. Why not cache more when the existing
contents of shared_buffers are of marginal importance? Concern about a
kind of "tyranny of the average case" seems fairly well founded to me.
I am most concerned about improving the average case in particular,
but also very concerned about not regressing atypical or bursty cases.
> This could change when considering an environment where we aren't trying
> to take advantage of the kernel's buffer cache and things like its
> read-ahead, and maybe that's what you're getting at here, but I would
> also point out that we have more inherent understanding of how pages are
> likely to be accessed than the kernel does and we shouldn't be
> explicitly trying to avoid taking advantage of that knowledge.
I'm definitely in favor of taking advantage of that knowledge. I just
think that we can derive the choice of which buffer to evict from high
level principles, that are relatively easy to understand in isolation.
This may matter later, when production system behavior needs to be
analyzed. I might agree to the idea of artificially favoring one type
of block over another if I thought that there was no better way, but I
think that there is. I also think that a generalized approach isn't
that much more difficult, particularly when you factor in the need to
convince other people.
> I agree that we should definitely be thinking about data density here-
> and that's another case where we have more information about the data
> than the kernel does (which I compare to as it tries to predict value
> also but doesn't have as much information).
Even my 10x figure is, in a way, very conservative. B-Tree index pages
have another obvious though potentially overlooked advantage, which is
that they naturally have spatial or logical locality. The only kind of
locality that heap pages can ever have is temporal locality. I
hesitate to say too much more about what must be true on average,
across the universe of Posgres installations, but I can safely say
this much: it's totally pedestrian and normal for *leaf* pages to have
more than 100x the utility of heap pages in some applications, since a
leaf page is not just "denser"; it may also have significant spatial
or logical locality, where heap pages have none at all.
There could very easily be an enormous range of usefulness among
buffers in shared_buffers. And yet, we only recognize the increments
that usage_count can represent (this is before we even think about the
actual problems with how usage_count is maintained). We're never going
to be able to give *radically* more useful leaf pages a concomitant
advantage during buffer eviction if we have to use some rule of thumb.
> Another consideration here
> to think about would be internal pages rather than just leaf pages-
> those have an even higher likelihood of being needed again quickly as
> they need to be traversed much more than leaf pages.
As I said, I'm not too worried about that case, because there are so
few internal pages -- we're probably talking about a small fraction of
1% of the total. It seems unlikely that there'd be so much cache
pressure that the cost of reading in an internal block is much worse
than the cost of reading in any other block. Main memory sizes are big
enough that we should be able to think about things at the level of
minutes (maybe we can't right now, but we *should* be able to).
(This probably doesn't matter much when it comes to finding a way
forward, so we can agree to disagree here without consequence.)
> This is definitely an interesting and valuable point. Having pages
> bouncing in and out of shared buffers means that we aren't properly
> tracking their value. Just brainstorming, but I wonder if there's a way
> we could track their value even when we've evicted them, without slowing
> down the entire system. Not having any great ideas off-hand for that
> but some kind of "we last accessed this block an hour ago" and "we last
> accessed this block half a millisecond ago" might be interesting to try
> to avoid such ping-ponging. I wonder if there are known strategies for
> addressing this, perhaps by having a data structure which effectively
> tracks hit rates for all potential blocks...
Well, ARC has its ghost lists (so does CAR, as Thomas pointed out just
now). ARC/CAR is adaptive, in that it drifts towards favoring either
recency or frequency for the system as a whole. Provided we can make
that scalable, it will probably work well for us. The bi-modal nature
of pages in Postgres seems to make ARC or CAR a good fit. Also, I
suspect that we can afford to use a somewhat less scalable approach,
since it doesn't matter how fast buffer eviction is if its just
spinning its wheels, and because in recent years we've mitigated some
of the bottlenecks.
In a way, it's not at all surprising that we can end up with
usage_counts of 5 all around when shared_buffers is large. Blocks end
up "resting on their laurels"; they're impossible to evict based on a
large amount of blind luck, such as being involved in a burst of
activity some time ago (so much for clocksweep approximating an LRU;
this is actually more like an LFU). ARC/CAR moves a "recent" block to
the "frequent" list if the block gets referenced a second time after
an interval that is, in effect, self-correcting (it may or may not be
from the ghost recent list or the real buffer cache recent list --
either way it goes to the head of the frequent list). Moving a block
from the recent to the frequent list also enlarges the recent list
target size at the expense of the frequent list target size. "This
block in particular deserves to be a frequent block, but as a
consequence all blocks should collectively be considered slightly less
deserving of being frequent blocks."
Importantly, this allows blocks to have their extra utility recognized
by getting moved to the frequent list, but specifically tracks whether
or not the system fails to appreciate recent blocks that deserve to be
promoted to frequent blocks, but cannot get a break because they get
evicted from the recent list too quickly (including the ghost recent
list). Because the recent ghost list and frequent ghost list are fixed
in size (only the real lists/head lists get resized), we won't forget
too much about recency or frequency. And, we won't go overboard with
"writing off frequent list blocks as has-been dead weight", because we
notice that we've gone too far, and measure that in terms of would-be
cache hits that ended up as actual cache misses. Most important of
all, we won't "get stuck", as we see with clocksweep today while still
having something that's stable for stable workloads.
> Every replacement strategy which doesn't care about the kind of block
> that's being worked with will end up performing random replacement under
> enough pressure. We don't actually have to allow that to happen though
> since we do have information about the kind of block and therefore it
> seems like we should be able to keep things mediocre, and avoid letting
> things get bad under such pressure.
If you have extreme cache pressure, I don't think that there is
anything left to worry about other than the scalability of evicting
buffers. That should be a rare enough occurrence for well maintained
systems these days (again, this may only actually be true as an
aspiration). Just look at the ARC paper's statistics; they report that
they're not much better than the competition (including even a classic
LRU) when there is either mild cache pressure or extreme cache
pressure. The middle ground is the interesting part.
All of that having been said, maybe we have an independent low-level
problem: we increase usage_count when we pin a buffer, even if we last
pinned the buffer 5ms ago. IIRC a change to this went in when ARC went
in (and came out with ARC, too). This is more of a problem with
miscounting the number of hits based on accidents around how things
are structured in the executor -- we need to be more careful about
recognizing whether or not block hits are logically distinct.
Particularly with stuff like nested loop joins.
I've developed an algorithm as improvement of GClock:
- increment of usage count is not on every access, but on first access during some span of clock arm movements:
-- new item is put with count 0, and it is not incremented until clock arm will move at least 1/32 of whole round.
-- if arm is moved further, then usage count is incremented on next access, arm position is remembered. During next 1/32 of whole round usage count is not incremented.
- usage count is incremented not by 1, but by 8. It simulates movement to "frequent" list of ARC/CAR/2Q,
-- there is a limit of usage count - 32, so there is four simulared frequency queues,
- to find item for eviction, items are scanned in batch of 25:
-- if items usage count is zero, it is evicted, and scan stops
-- otherwise usage count is decremented as in GClock,
-- item with least usage count seen so far is remembered
-- if batch is over, then item with least usage count is evicted.
I've tested this algorithm with some artificial traces. It is much better than LRU, but usually worser than ARC and CLOCK+ (btw, why no one mention CLOCK+ in this thread).
I didn't implement it in PostgreSQL yet (I did implement batch scanning once, but other algorithm details were not developed then).
And I don't know, if it will be better for PostgreSQL really.
And I believe, CLOCK+ will perform better, being implemented correctly, but modified a bit to use array instead of list (CLOCK+ originally uses least with ability to insert and remove from any position).
So you see: there is a lot of algorithms, and new could be suggested as we go. It is not easy to choose one to concentrate our attention.
That is why I propose to collect some realworld traces, and test algorithms with them before rushing to implementation.
With regards,
Sokolov Yura.
On 2018-04-25 11:31:12 +0500, Andrey Borodin wrote: > > > > 24 апр. 2018 г., в 23:14, Andres Freund <andres@anarazel.de> написал(а): > > > > On 2018-04-24 17:16:47 +0500, Andrey Borodin wrote: > >> But, I think that cost of development of real page eviction strategy > >> itself is neglectable small compared to infrastructure changes needed > >> by any non-CS5 strategy. > > > > What problems are you seeing? This isn't a lot of code? > 1. Teaching BgWriter to used data from eviction strategy to aggressively flush data to disk (instead of ++next_to_clean) > 2. Implementing strategies as lock-free algorithms for freelist > These parts seem most important for benchmarking. > Also: > 3. Converting all rings to single buffer manager where possible > 4. Using O_DIRECT while writing data files > 5. Using aio and scheduling of writes > These parts are not necessary, but most challenging, while not impossible though. These largely seem to be increasing the scope beyond reason... Greetings, Andres Freund
> 30 апр. 2018 г., в 0:48, Andres Freund <andres@anarazel.de> написал(а): > > On 2018-04-25 11:31:12 +0500, Andrey Borodin wrote: >> >> 1. Teaching BgWriter to used data from eviction strategy to aggressively flush data to disk (instead of ++next_to_clean) >> 2. Implementing strategies as lock-free algorithms for freelist >> These parts seem most important for benchmarking. >> Also: >> 3. Converting all rings to single buffer manager where possible >> 4. Using O_DIRECT while writing data files >> 5. Using aio and scheduling of writes >> These parts are not necessary, but most challenging, while not impossible though. > > These largely seem to be increasing the scope beyond reason... I suspect that performance benefits can be not that big or even marginal, if we do not extend comprehensive eviction strategywith bgwriter fixes and O_DIRECT. But, on the other hand, this suspicion is not based on any real fact. And if eviction strategy is actually good for anythingit will show performance benefits on it's own. > 30 апр. 2018 г., в 0:39, Юрий Соколов <funny.falcon@gmail.com> написал(а): > > (btw, why no one mention CLOCK+ in this thread). I do not know what CLOCK+ is. Can you, please, send me a reference. Best regards, Andrey Borodin.
Hi, On 2018-04-30 15:39:08 +0500, Andrey Borodin wrote: > I suspect that performance benefits can be not that big or even > marginal, if we do not extend comprehensive eviction strategy with > bgwriter fixes and O_DIRECT. If so, then the improvements aren't real. Bgwriter doesn't matter for read-only workloads. O_DIRECT doesn't matter much if shared_buffers are 60+% of OS memory. And even disregarding that, you can just compute cache hit ratios to see whether things are improving. Greetings, Andres Freund
On Mon, Apr 30, 2018 at 10:39 PM, Andrey Borodin <x4mmm@yandex-team.ru> wrote: >> 30 апр. 2018 г., в 0:48, Andres Freund <andres@anarazel.de> написал(а): >> On 2018-04-25 11:31:12 +0500, Andrey Borodin wrote: >> 30 апр. 2018 г., в 0:39, Юрий Соколов <funny.falcon@gmail.com> написал(а): >> >> (btw, why no one mention CLOCK+ in this thread). > > I do not know what CLOCK+ is. Can you, please, send me a reference. Maybe he means CLOCK-Pro? -- Thomas Munro http://www.enterprisedb.com
On Thu, Apr 26, 2018 at 1:31 PM, Thomas Munro <thomas.munro@enterprisedb.com> wrote: > ... I > suppose when you read a page in, you could tell the kernel that you > POSIX_FADV_DONTNEED it, and when you steal a clean PG buffer you could > tell the kernel that you POSIX_FADV_WILLNEED its former contents (in > advance somehow), on the theory that the coldest stuff in the PG cache > should now become the hottest stuff in the OS cache. Of course that > sucks, because the best the kernel can do then is go and read it from > disk, and the goal is to avoid IO. Given a hypothetical way to > "write" "clean" data to the kernel (so it wouldn't mark it dirty and > generate IO, but it would let you read it back without generating IO > if you're lucky), then perhaps you could actually achieve exclusive > caching at the two levels, and use all your physical RAM without > duplication. Craig said essentially the same thing, on the nearby fsync() reliability thread: On Sun, Apr 29, 2018 at 1:50 PM, Craig Ringer <craig@2ndquadrant.com> wrote: > ... I'd kind of hoped to go in > the other direction if anything, with some kind of pseudo-write op > that let us swap a dirty shared_buffers entry from our shared_buffers > into the OS dirty buffer cache (on Linux at least) and let it handle > writeback, so we reduce double-buffering. Ha! So much for that! I would like to reply to that on this thread which discusses double buffering and performance, to avoid distracting the fsync() thread from its main topic of reliability. I think that idea has potential. Even though I believe that direct IO is the generally the right way to go (that's been RDBMS orthodoxy for a decade or more AFAIK), we'll always want to support buffered IO (as other RDBMSs do). For one thing, not every filesystem supports direct IO, including ZFS. I love ZFS, and its caching is not simply a dumb extension to shared_buffers that you have to go through syscalls to reach: it has state of the art page reclamation, cached data can be LZ4 compressed and there is an optional second level cache which can live on fast storage. Perhaps if you patched PostgreSQL to tell the OS that you won't need pages you've just read, and that you will need pages you've just evicted, you might be able to straighten out some of that U shape by getting more exclusive caching at the two levels. Queued writes would still be double-buffered of course, at least until they complete. Telling the OS to prefetch something that you already have a copy of is annoying and expensive, though. The pie-in-the-sky version of this idea would let you "swap" pages with the kernel, as you put it. Though I was thinking of clean pages, not dirty ones. Then there'd be a non-overlapping set of pages from your select-only pgbench in each cache. Maybe that would look like punread(fd, buf, size, offset) (!), or maybe write(fd, buf, size) followed by fadvise(fd, offset, size, FADV_I_PERSONALLY_GUARANTEE_THIS_DATA_IS_CLEAN_AND_I_CONSIDERED_CONCURRENCY_VERY_CAREFULLY), or maybe pswap(read params... , unread params ...) to read new buffer and unread old buffer at the same time.</crackpot-vapourware-OS> Sadly, even if the simple non-pie-in-the-sky version of the above were to work out and be beneficial on your favourite non-COW filesystem (on which you might as well use direct IO and larger shared_buffers, some day), it may currently be futile on ZFS because I think the fadvise machinery might not even be hooked up (Solaris didn't believe in fadvise on any filesystem IIRC). Not sure, I hope I'm wrong about that. -- Thomas Munro http://www.enterprisedb.com
On 2018-05-01 12:15:21 +1200, Thomas Munro wrote: > On Thu, Apr 26, 2018 at 1:31 PM, Thomas Munro > <thomas.munro@enterprisedb.com> wrote: > > ... I > > suppose when you read a page in, you could tell the kernel that you > > POSIX_FADV_DONTNEED it, and when you steal a clean PG buffer you could > > tell the kernel that you POSIX_FADV_WILLNEED its former contents (in > > advance somehow), on the theory that the coldest stuff in the PG cache > > should now become the hottest stuff in the OS cache. Of course that > > sucks, because the best the kernel can do then is go and read it from > > disk, and the goal is to avoid IO. Given a hypothetical way to > > "write" "clean" data to the kernel (so it wouldn't mark it dirty and > > generate IO, but it would let you read it back without generating IO > > if you're lucky), then perhaps you could actually achieve exclusive > > caching at the two levels, and use all your physical RAM without > > duplication. > > Craig said essentially the same thing, on the nearby fsync() reliability thread: > > On Sun, Apr 29, 2018 at 1:50 PM, Craig Ringer <craig@2ndquadrant.com> wrote: > > ... I'd kind of hoped to go in > > the other direction if anything, with some kind of pseudo-write op > > that let us swap a dirty shared_buffers entry from our shared_buffers > > into the OS dirty buffer cache (on Linux at least) and let it handle > > writeback, so we reduce double-buffering. Ha! So much for that! > > I would like to reply to that on this thread which discusses double > buffering and performance, to avoid distracting the fsync() thread > from its main topic of reliability. It's not going to happen. Robert and I talked to the kernel devs a couple years back, and I've brought it up again. The kernel has absolutely no chance to verify the content of that written data, meaning that suddenly you'd get differing data based on cache pressure. It seems unsurprising that kernel devs aren't wild about that idea. The likelihood of that opening up weird exploits (imagine a suid binary reading such data later!), seems also pretty high. Greetings, Andres Freund
вт, 1 мая 2018 г., 0:02 Thomas Munro <thomas.munro@enterprisedb.com>:
On Mon, Apr 30, 2018 at 10:39 PM, Andrey Borodin <x4mmm@yandex-team.ru> wrote:
>> 30 апр. 2018 г., в 0:48, Andres Freund <andres@anarazel.de> написал(а):
>> On 2018-04-25 11:31:12 +0500, Andrey Borodin wrote:
>> 30 апр. 2018 г., в 0:39, Юрий Соколов <funny.falcon@gmail.com> написал(а):
>>
>> (btw, why no one mention CLOCK+ in this thread).
>
> I do not know what CLOCK+ is. Can you, please, send me a reference.
Maybe he means CLOCK-Pro?
Oh, thank you, Thomas.
Yes, I was telling about CLOCK-Pro.
Hi! > 30 апр. 2018 г., в 23:15, Andres Freund <andres@anarazel.de> написал(а): > On 2018-04-30 15:39:08 +0500, Andrey Borodin wrote: >> I suspect that performance benefits can be not that big or even >> marginal, if we do not extend comprehensive eviction strategy with >> bgwriter fixes and O_DIRECT. > > If so, then the improvements aren't real. Bgwriter doesn't matter for > read-only workloads. O_DIRECT doesn't matter much if shared_buffers are > 60+% of OS memory. And even disregarding that, you can just compute > cache hit ratios to see whether things are improving. Even considering simply changing eviction strategy - it is not just about hit ratio. It is also about eviction complexityless than O(N). But I think you are right. If we compare performance effect of half-measures in the real system, probably it is more accuratethan comparing isolated algorithms in a sand box. Best regards, Andrey Borodin.
On Sat, Apr 28, 2018 at 7:16 PM, Peter Geoghegan <pg@bowt.ie> wrote: > There could very easily be an enormous range of usefulness among > buffers in shared_buffers. And yet, we only recognize the increments > that usage_count can represent (this is before we even think about the > actual problems with how usage_count is maintained). We're never going > to be able to give *radically* more useful leaf pages a concomitant > advantage during buffer eviction if we have to use some rule of thumb. Right. I think the real problem here is that the usage_count is a terrible predictor of how useful a page really is... > In a way, it's not at all surprising that we can end up with > usage_counts of 5 all around when shared_buffers is large. Blocks end > up "resting on their laurels"; they're impossible to evict based on a > large amount of blind luck, such as being involved in a burst of > activity some time ago (so much for clocksweep approximating an LRU; > this is actually more like an LFU). ARC/CAR moves a "recent" block to > the "frequent" list if the block gets referenced a second time after > an interval that is, in effect, self-correcting (it may or may not be > from the ghost recent list or the real buffer cache recent list -- > either way it goes to the head of the frequent list). Moving a block > from the recent to the frequent list also enlarges the recent list > target size at the expense of the frequent list target size. "This > block in particular deserves to be a frequent block, but as a > consequence all blocks should collectively be considered slightly less > deserving of being frequent blocks." ...and this is exactly the kind of system that can fix that problem. In the current system, bumping the usage count makes the buffer whose count just got bumped look more useful without making any other buffer look less useful. Obviously, that allows for situations where the usefulness of every buffer converges to positive infinity or, as we like to call it, 5. Real life isn't like that, though. Not all children are above average, and not all buffers can be more useful than average. That's not to say that the total of all usage counts must remain equal to a constant or something dumb like that -- there's decisions to be made in terms of how you implement things. For example, you can imagine a system where every time we would bump the usage count of a buffer that already has a usage count of 5, we would instead advance the clock hand by 1 buffer. In such a system the total of the usage counts can fluctuate, but you can't peg everything at 5 because there's back-pressure built into the algorithm. CAR accomplishes a similar goal in a different way: "frequently" used pages are never evicted, but as more and more pages get moved into the frequently-used set, the system becomes more and more willing to shrink the set size. That's a really appealing property. If there's a "hot" set of buffers that accounts for nearly all accesses, the algorithm will be willing to regard that all buffers in that set as frequently-used regardless of whether it is a large or a small percentage of shared_buffers, and they will never get evicted. But the buffers must really be hot enough to justify the space they're taking up. When there are only a few hot buffers, it's sufficient for them to be a little hotter than the typical buffer. But when there are a lot of hot buffers, they have to be *really* hot. Intuitively, that makes sense, because making the set of "hot" -- and thus unevictable -- buffers large means that other buffers are going to get evicted very quickly. That's only justifiable if the hot buffers are very significantly more useful than the non-hot buffers. > If you have extreme cache pressure, I don't think that there is > anything left to worry about other than the scalability of evicting > buffers. That should be a rare enough occurrence for well maintained > systems these days (again, this may only actually be true as an > aspiration). Just look at the ARC paper's statistics; they report that > they're not much better than the competition (including even a classic > LRU) when there is either mild cache pressure or extreme cache > pressure. The middle ground is the interesting part. +1. > All of that having been said, maybe we have an independent low-level > problem: we increase usage_count when we pin a buffer, even if we last > pinned the buffer 5ms ago. IIRC a change to this went in when ARC went > in (and came out with ARC, too). This is more of a problem with > miscounting the number of hits based on accidents around how things > are structured in the executor -- we need to be more careful about > recognizing whether or not block hits are logically distinct. > Particularly with stuff like nested loop joins. I agree that double-counting correlated accesses is a a problem, and I agree that we probably want to do something about it. I am skeptical that using wall-clock time is the right way to solve that problem because it leads to edge effects -- if for example there is a 5ms timeout for bumping the usage count again, then you can probably manufacture a workload where the buffer eviction pattern is dramatically different for a nested loop that hits a given page every 4.5ms than it is for a nested loop that hits a given page every 5.5ms. Note that CART aims to solve this problem in a way that doesn't involve any fixed amount of wall-clock time. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Tue, May 1, 2018 at 11:58 AM, Robert Haas <robertmhaas@gmail.com> wrote: > That's not to say that the total of all usage counts must remain equal > to a constant or something dumb like that -- there's decisions to be > made in terms of how you implement things. For example, you can > imagine a system where every time we would bump the usage count of a > buffer that already has a usage count of 5, we would instead advance > the clock hand by 1 buffer. In such a system the total of the usage > counts can fluctuate, but you can't peg everything at 5 because > there's back-pressure built into the algorithm. This seems to be an old idea. See "Data Cache Management Using Frequency-Based Replacement", a 1990 paper that is referenced by the ARC paper. It says: "When a block is first brought into the cache, its reference count is initialized to one. Previous algorithms using reference counts have incremented the count for a given block on every reference to that block. This results in the following kind of problem: certain blocks are relatively infrequently referenced overall, and yet when they are referenced, due to locality there are short intervals of repeated re-references, thus building up high reference counts. After such an interval is over, the high reference count is misleading: it is due to locality, and cannot be used to estimate the probability that such a block will be re-referenced following the end of this interval (this description is intuitive, since in practice such intervals are not as well-defined as the preceding discussion may suggest)." It goes on to propose a solution, that I gather isn't so far from your thought experiment design for fixing the usage_count saturation problem: "More precisely, we dynamically maintain the sum of all reference counts. Whenever the average reference count exceeds a predetermined maximum value, A-max (another parameter of the algorithm), every reference count C is reduced to C/2. Thus, in the steady state the sum of all reference counts stays between N x A-max/2 and N x A-max (where N is the number of cache blocks), and the reference count of a block is approximately linearly related to the frequency of references to the block while it is in the middle and old sections. Note that in this reduction of reference counts, a count of one remains at one, a count of two is reduced to one, a count of three is reduced to two, etc." This 1990 paper seems to only try to solve the correlated references problem that we see when using CLOCK to approximate an LRU, which is not to be confused with actually weighing frequency in a clever way, which AFAICT is a general idea that didn't come along until the LRU-K paper came out a couple of years later. In other words, the 1990 paper only cares about making sure that CLOCK actually manages to approximate an LRU when reference counts are messed up by correlated references. That's all. This really is a distinct problem to the problem that ARC and CAR aim to solve (2Q and LRU-K must be in the same category, since IIRC they have ghost lists). CART tries to solve both problems at the same time, which suggests that they're independent problems. In summary: I see two clearly distinct problems here, though they are superficially similar. We have both problems. > That's a really appealing property. If there's a "hot" set of buffers > that accounts for nearly all accesses, the algorithm will be willing > to regard that all buffers in that set as frequently-used regardless > of whether it is a large or a small percentage of shared_buffers, and > they will never get evicted. But the buffers must really be hot > enough to justify the space they're taking up. When there are only a > few hot buffers, it's sufficient for them to be a little hotter than > the typical buffer. But when there are a lot of hot buffers, they > have to be *really* hot. Intuitively, that makes sense, because > making the set of "hot" -- and thus unevictable -- buffers large means > that other buffers are going to get evicted very quickly. That's only > justifiable if the hot buffers are very significantly more useful than > the non-hot buffers. Right. How hot "really hot" actually is needs to be based on the behavior of the system as a whole. If it's impossible to better characterize the behavior of the system as a whole, then no improvement is possible; shared_buffers services the system as a whole. I've always had a problem with the 8GB/16GB upper limit on the size of shared_buffers. Not because it's wrong, but because it's not something that has ever been explained. I strongly suspect that it has something to do with usage_count saturation, since it isn't reproducible with any synthetic workload that I'm aware of. Quite possibly because there are few bursty benchmarks. >> All of that having been said, maybe we have an independent low-level >> problem: we increase usage_count when we pin a buffer, even if we last >> pinned the buffer 5ms ago. IIRC a change to this went in when ARC went >> in (and came out with ARC, too). This is more of a problem with >> miscounting the number of hits based on accidents around how things >> are structured in the executor -- we need to be more careful about >> recognizing whether or not block hits are logically distinct. >> Particularly with stuff like nested loop joins. > > I agree that double-counting correlated accesses is a a problem, and I > agree that we probably want to do something about it. I am skeptical > that using wall-clock time is the right way to solve that problem > because it leads to edge effects I agree that wall-clock time is a bad approach, actually. If we were to use wall-clock time, we'd only do so because it can be used to discriminate against a thing that we actually care about in an approximate, indirect way. If there is a more direct way of identifying correlated accesses, which I gather that there is, then we should probably use it. -- Peter Geoghegan
02.05.2018 01:37, Peter Geoghegan пишет: > On Tue, May 1, 2018 at 11:58 AM, Robert Haas <robertmhaas@gmail.com> wrote: >> I agree that double-counting correlated accesses is a a problem, and I >> agree that we probably want to do something about it. I am skeptical >> that using wall-clock time is the right way to solve that problem >> because it leads to edge effects > > I agree that wall-clock time is a bad approach, actually. If we were > to use wall-clock time, we'd only do so because it can be used to > discriminate against a thing that we actually care about in an > approximate, indirect way. If there is a more direct way of > identifying correlated accesses, which I gather that there is, then we > should probably use it. I suggest incrementing only once in 1/32 replacements in shared_buffers. I.e. if size of shared_buffers is 1024, and this page were put into shared_buffers as 21200, then next time its usage count will be incremented only after 21232 page were put into shared buffers. And second time only after 21264 page. regards, Yura.
On Tue, May 1, 2018 at 6:37 PM, Peter Geoghegan <pg@bowt.ie> wrote: > This seems to be an old idea. I'm not too surprised ... this area has been well-studied. > I've always had a problem with the 8GB/16GB upper limit on the size of > shared_buffers. Not because it's wrong, but because it's not something > that has ever been explained. I strongly suspect that it has something > to do with usage_count saturation, since it isn't reproducible with > any synthetic workload that I'm aware of. Quite possibly because there > are few bursty benchmarks. I've seen customer have very good luck going higher if it lets all the data fit in shared_buffers, or at least all the data that is accessed with any frequency. I think it's useful to imagine a series of concentric working sets - maybe you have 1GB of the hottest data, 3GB of data that is at least fairly hot, 10GB of data that is at least somewhat hot, and another 200GB of basically cold data. Increasing shared_buffers in a way that doesn't let the next "ring" fit in shared_buffers isn't likely to help very much. If you have 8GB of shared_buffers on this workload, going to 12GB is probably going to help -- that should be enough for the 10GB of somewhat-hot stuff and a little extra so that the somewhat-hot stuff doesn't immediately start getting evicted if some of the cold data is accessed. Similarly, going from 2GB to 4GB should be a big help, because now the fairly-hot stuff should stay in cache. But going from 4GB to 6GB or 12GB to 16GB may not do very much. It may even hurt, because the duplication between shared_buffers and the OS page cache means an overall reduction in available cache space. If for example you've got 16GB of memory and shared_buffers=2GB, you *may* be fitting all of the somewhat-hot data into cache someplace; bumping shared_buffers=4GB almost certainly means that will no longer happen, causing performance to tank. I don't really think that the 8GB rule of thumb is something that originates in any technical limitation of PostgreSQL or Linux. First of all it's just a rule of thumb -- the best value in a given installation can easily be something completely different. Second, to the extent that it's a useful rule of thumb, I think it's really a guess about what people's working set looks like: that going from 4GB to 8GB, say, significantly increases the chances of fitting the next-larger, next-cooler working set entirely in shared_buffers, going from 8GB to 16GB is less likely to accomplish this, and going from 16GB to 32GB probably won't. To a lesser extent, it's reflective of the point where scanning shared buffers to process relation drops gets painful, and the point where an immediate checkpoint suddenly dumping that much data out to the OS all at once starts to overwhelm the I/O subsystem for a significant period of time. But I think those really are lesser effects. My guess is that the big effect is balancing increased hit ratio vs. increased double buffering. > I agree that wall-clock time is a bad approach, actually. If we were > to use wall-clock time, we'd only do so because it can be used to > discriminate against a thing that we actually care about in an > approximate, indirect way. If there is a more direct way of > identifying correlated accesses, which I gather that there is, then we > should probably use it. For a start, I think it would be cool if somebody just gathered traces for some simple cases. For example, consider a pgbench transaction. If somebody produced a trace showing the buffer lookups in order annotated as heap, index leaf, index root, VM page, FSM root page, or whatever. Examining some other simple, common cases would probably help us understand whether it's normal to bump the usage count more than once per buffer for a single scan, and if so, exactly why that happens. If the code knows that it's accessing the same buffer a second (or subsequent) time, it could pass down a flag saying not to bump the usage count again. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
>If somebody produced a trace showing the buffer lookups in order
To get things moving, I've created a DTrace script that captures buffer reads:
Is it something that can be used to capture live traces?
Sample output can be seen here: https://github.com/vlsi/pgsqlstat/tree/pgsqlio#pgsqlio
For instance (time units are nano-seconds):
Timestamp Elapsed ForkNum BlockNum TableSp DataBase Relation Cached3271837060212 1563 0 1 1663 16385 1259 13271838881374 88205 0 0 1663 16385 16604 03271840973321 4368 0 0 1663 16385 16604 1
If DTrace is acceptable, trace format might be adjusted for easier consumption by https://github.com/ben-manes/caffeine/wiki/Simulator
Vladimir
On Wed, May 2, 2018 at 3:06 PM, Vladimir Sitnikov <sitnikov.vladimir@gmail.com> wrote: > Sample output can be seen here: > https://github.com/vlsi/pgsqlstat/tree/pgsqlio#pgsqlio Neat. Not sure what generated this trace, but note this part: 3271838881374 88205 0 0 1663 16385 16604 0 3271840973321 4368 0 0 1663 16385 16604 1 3271842680626 4502 0 0 1663 16385 16604 1 3271846077927 4173 0 0 1663 16385 16604 1 If we want to avoid artificial inflation of usage counts, that kind of thing would be a good place to start -- obviously 4 consecutive accesses to the same buffer by the same backend doesn't justify a separate usage count bump each time. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Thu, May 3, 2018 at 12:37 PM, Robert Haas <robertmhaas@gmail.com> wrote: > On Wed, May 2, 2018 at 3:06 PM, Vladimir Sitnikov > <sitnikov.vladimir@gmail.com> wrote: >> Sample output can be seen here: >> https://github.com/vlsi/pgsqlstat/tree/pgsqlio#pgsqlio > > Neat. Not sure what generated this trace, but note this part: > > 3271838881374 88205 0 0 1663 16385 16604 0 > 3271840973321 4368 0 0 1663 16385 16604 1 > 3271842680626 4502 0 0 1663 16385 16604 1 > 3271846077927 4173 0 0 1663 16385 16604 1 > > If we want to avoid artificial inflation of usage counts, that kind of > thing would be a good place to start -- obviously 4 consecutive > accesses to the same buffer by the same backend doesn't justify a > separate usage count bump each time. I don't have time to check this out just now, but it seems like an excellent idea. It would be nice if it could be enhanced further, so you get some idea of what the blocks are without having to decode them yourself using tools like pageinspect. -- Peter Geoghegan
> 4 мая 2018 г., в 0:37, Robert Haas <robertmhaas@gmail.com> написал(а): > > On Wed, May 2, 2018 at 3:06 PM, Vladimir Sitnikov > <sitnikov.vladimir@gmail.com> wrote: >> Sample output can be seen here: >> https://github.com/vlsi/pgsqlstat/tree/pgsqlio#pgsqlio > > Neat. Not sure what generated this trace, but note this part: > > 3271838881374 88205 0 0 1663 16385 16604 0 > 3271840973321 4368 0 0 1663 16385 16604 1 > 3271842680626 4502 0 0 1663 16385 16604 1 > 3271846077927 4173 0 0 1663 16385 16604 1 > > If we want to avoid artificial inflation of usage counts, that kind of > thing would be a good place to start -- obviously 4 consecutive > accesses to the same buffer by the same backend doesn't justify a > separate usage count bump each time. Upper in this thread Yura suggested that usages should not create equal bump each time. He effectively suggested log scaleof usages, thus many consecutive usages will be taken into account but not dramatically more important than just fewrecent usages. Best regards, Andrey Borodin.
пт, 4 мая 2018 г., 12:27 Andrey Borodin <x4mmm@yandex-team.ru>:
> 4 мая 2018 г., в 0:37, Robert Haas <robertmhaas@gmail.com> написал(а):
>
> On Wed, May 2, 2018 at 3:06 PM, Vladimir Sitnikov
> <sitnikov.vladimir@gmail.com> wrote:
>> Sample output can be seen here:
>> https://github.com/vlsi/pgsqlstat/tree/pgsqlio#pgsqlio
>
> Neat. Not sure what generated this trace, but note this part:
>
> 3271838881374 88205 0 0 1663 16385 16604 0
> 3271840973321 4368 0 0 1663 16385 16604 1
> 3271842680626 4502 0 0 1663 16385 16604 1
> 3271846077927 4173 0 0 1663 16385 16604 1
>
> If we want to avoid artificial inflation of usage counts, that kind of
> thing would be a good place to start -- obviously 4 consecutive
> accesses to the same buffer by the same backend doesn't justify a
> separate usage count bump each time.
Upper in this thread Yura suggested that usages should not create equal bump each time. He effectively suggested log scale of usages, thus many consecutive usages will be taken into account but not dramatically more important than just few recent usages.
I didn't suggest log scale of usages, but rather "replacement-period based" increment: usage count could be incremented at most once in NBlocks/32 replaced items. Once it is incremented, its "replacement time" is remembered, and next NBlocks/32 replacements usage count of this buffer doesn't increment.
This way, increment is synchronized with replacement activity.
Digging further, I suggest as improvement of GClock algorithm:
- placing new buffer with usage count = 0 (and next NBlock/32 replacements its usage count doesn't increased)
- increment not by 1, but by 8 (it simulates "hot queue" of popular algorithms) with limit 32.
- scan at most 25 buffers for eviction. If no buffer with zero usage count found, the least used buffer (among scanned 25) is evicted.
(new buffers are not evicted during their first NBlock/32 replacements).
Regards,
Yura Sokolov.
>I don't have time to check this out just now, but it seems like an
excellent idea
excellent idea
There are a couple of sad things:
1) DTrace probes seem to be disabled by default. At least, I had to build PostgreSQL with --enable-dtrace in my macOS.
Does that (the requirement to enable dtrace within PostgreSQL) block from capturing trace files from real-life applications?
There's an option to use system-level IO trace points (e.g. https://github.com/omniti-labs/pgtreats/blob/master/tools/pg_file_stress ), however it would be harder to get DB/relation kind of ids.
2) (database, tablespace, relation) oids cannot be easily mapped from within a DTrace script. One can workaround that by using a SQL connection to the database, however it gets a bit complicated if there are multiple DB instances. What I mean is it is hard to tell which connection URL to use in order to resolve the oids in question.
Vladimir
Hi!
4 мая 2018 г., в 16:05, Юрий Соколов <funny.falcon@gmail.com> написал(а):I didn't suggest log scale of usages, but rather "replacement-period based" increment: usage count could be incremented at most once in NBlocks/32 replaced items. Once it is incremented, its "replacement time" is remembered, and next NBlocks/32 replacements usage count of this buffer doesn't increment.This way, increment is synchronized with replacement activity.
But you loose difference between "touched once" and "actively used". Log scale of usage solves this: usage count grows logarithmically, but drains linearly.
Digging further, I suggest as improvement of GClock algorithm:- placing new buffer with usage count = 0 (and next NBlock/32 replacements its usage count doesn't increased)- increment not by 1, but by 8 (it simulates "hot queue" of popular algorithms) with limit 32.- scan at most 25 buffers for eviction. If no buffer with zero usage count found, the least used buffer (among scanned 25) is evicted.(new buffers are not evicted during their first NBlock/32 replacements).
I do not understand where these numbers come from...
Best regards, Andrey Borodin.
05.05.2018 09:16, Andrey Borodin пишет: > Hi! > >> 4 мая 2018 г., в 16:05, Юрий Соколов <funny.falcon@gmail.com> >> написал(а): >> >> I didn't suggest log scale of usages, but rather >> "replacement-period based" increment: usage count could be >> incremented at most once in NBlocks/32 replaced items. Once it is >> incremented, its "replacement time" is remembered, and next >> NBlocks/32 replacements usage count of this buffer doesn't >> increment. This way, increment is synchronized with replacement >> activity. > > But you loose difference between "touched once" and "actively used". > Log scale of usage solves this: usage count grows logarithmically, > but drains linearly. No, I didn't loose difference. But instead of absolute value (log scale or linear) I count how often in time block is used: - if buffer were touched 1000 times just after placing into shared_buffers should it live 500 times longer than neighbor that were touched only 2 times? or 10 times longer? or 5 times longer? - but what if that "1000 times" buffer were not touched in next period of time, but neighbor were touched again 2 times? All effective algorithms answers: "1000 times" buffer should be evicted first, but its neighbor is a really hot buffer that should be saved for longer period. Log scale doesn't solve this. But increment "once in period" solves. Especially if block is placed first with zero count (instead of 1 as currently). >> Digging further, I suggest as improvement of GClock algorithm: - >> placing new buffer with usage count = 0 (and next NBlock/32 >> replacements its usage count doesn't increased) >> - increment not by 1, but by 8 (it simulates "hot queue" of >> popular algorithms) with limit 32. >> - scan at most 25 buffers for eviction. If no buffer with zero >> usage count found, the least used buffer (among scanned 25) is evicted. >> (new buffers are not evicted during their first NBlock/32 >> replacements). >> > > I do not understand where these numbers come from... I found this number by testing with several artificial traces found in web. I don't claim this number are best. Even on that traces best values may vary on cache size: for small cache size increment and limit tends to be higher, for huge cache - smaller. But this were most balanced. And I don't claim those traces are representative for PostgreSQL, that is why I'm pushing this discussion to collect more real-world PostgreSQL traces and make them public. And I believe my algorithm is not the best. Clock-Pro and ARC shows better results on that traces. Tiny-LFU - cache admission algorithm - may be even more efficient (in term of evictions). But results should be rechecked with PostgreSQL traces. My algorithm will be just least invasive for current code, imho. With regards, Sokolov Yura
Вложения
It is hard to tell actually what is right decision here. It is better to evict buffer that will not be needed longer, and it is not obvious that is is true for buffer that you called hot.5 мая 2018 г., в 13:25, Yura Sokolov <funny.falcon@gmail.com> написал(а):05.05.2018 09:16, Andrey Borodin пишет:Hi!No, I didn't loose difference. But instead of absolute value (log scale4 мая 2018 г., в 16:05, Юрий Соколов <funny.falcon@gmail.com>
написал(а):
I didn't suggest log scale of usages, but rather
"replacement-period based" increment: usage count could be
incremented at most once in NBlocks/32 replaced items. Once it is
incremented, its "replacement time" is remembered, and next
NBlocks/32 replacements usage count of this buffer doesn't
increment. This way, increment is synchronized with replacement
activity.
But you loose difference between "touched once" and "actively used".
Log scale of usage solves this: usage count grows logarithmically,
but drains linearly.
or linear) I count how often in time block is used:
- if buffer were touched 1000 times just after placing into
shared_buffers should it live 500 times longer than neighbor that were
touched only 2 times? or 10 times longer? or 5 times longer?
- but what if that "1000 times" buffer were not touched in next period
of time, but neighbor were touched again 2 times?
All effective algorithms answers: "1000 times" buffer should be evicted
first, but its neighbor is a really hot buffer that should be saved for
longer period.
Assume we have buffer A who is touched 1024 times (and then forgotten forever) and buffer B which is touched 2 times every clock cycle.
A B
Usage count 0x400 0x2
1 Eviction time! 0x100 0x0 E
Usage count 0x100 0x2
2 Eviction time! 0x080 0x0 E
Usage count 0x080 0x2
3 Eviction time! 0x020 0x0 E
Usage count 0x020 0x2
4 Eviction time! 0x00A 0x0 E
Usage count 0x00A 0x2
5 Eviction time! 0x004 0x0 E
Usage count 0x004 0x2
6 Eviction time! 0x001 0x0 E
Usage count 0x001 0x2
7 Eviction time! 0x000 E 0x2
So, buffer used 512 times more survived only 7 stale cycles. Looks fair.
Log scale doesn't solve this. But increment "once in period" solves.
Especially if block is placed first with zero count (instead of 1 as
currently).Digging further, I suggest as improvement of GClock algorithm: -
placing new buffer with usage count = 0 (and next NBlock/32
replacements its usage count doesn't increased)
- increment not by 1, but by 8 (it simulates "hot queue" of
popular algorithms) with limit 32.
- scan at most 25 buffers for eviction. If no buffer with zero
usage count found, the least used buffer (among scanned 25) is evicted.
(new buffers are not evicted during their first NBlock/32
replacements).
I do not understand where these numbers come from...
I found this number by testing with several artificial traces found in web.
I don't claim this number are best. Even on that traces best values may
vary on cache size: for small cache size increment and limit tends to be
higher, for huge cache - smaller. But this were most balanced.
And I don't claim those traces are representative for PostgreSQL, that
is why I'm pushing this discussion to collect more real-world PostgreSQL
traces and make them public.
And I believe my algorithm is not the best. Clock-Pro and ARC shows
better results on that traces. Tiny-LFU - cache admission algorithm -
may be even more efficient (in term of evictions). But results should be
rechecked with PostgreSQL traces.
My algorithm will be just least invasive for current code, imho.
Here's the demo patch with logarithmic scale. 2 lines changed.
Best regards, Andrey Borodin.
Вложения
06.05.2018 11:20, Andrey Borodin пишет: > > >> 5 мая 2018 г., в 13:25, Yura Sokolov <funny.falcon@gmail.com> написал(а): >> >> 05.05.2018 09:16, Andrey Borodin пишет: >>> Hi! >>> >>>> 4 мая 2018 г., в 16:05, Юрий Соколов <funny.falcon@gmail.com> >>>> написал(а): >>>> >>>> I didn't suggest log scale of usages, but rather >>>> "replacement-period based" increment: usage count could be >>>> incremented at most once in NBlocks/32 replaced items. Once it is >>>> incremented, its "replacement time" is remembered, and next >>>> NBlocks/32 replacements usage count of this buffer doesn't >>>> increment. This way, increment is synchronized with replacement >>>> activity. >>> >>> But you loose difference between "touched once" and "actively used". >>> Log scale of usage solves this: usage count grows logarithmically, >>> but drains linearly. >> No, I didn't loose difference. But instead of absolute value (log scale >> or linear) I count how often in time block is used: >> - if buffer were touched 1000 times just after placing into >> shared_buffers should it live 500 times longer than neighbor that were >> touched only 2 times? or 10 times longer? or 5 times longer? >> - but what if that "1000 times" buffer were not touched in next period >> of time, but neighbor were touched again 2 times? >> All effective algorithms answers: "1000 times" buffer should be evicted >> first, but its neighbor is a really hot buffer that should be saved for >> longer period. > It is hard to tell actually what is right decision here. It is better to evict buffer that will not be needed longer, andit is not obvious that is is true for buffer that you called hot. > > Assume we have buffer A who is touched 1024 times (and then forgotten forever) and buffer B which is touched 2 times everyclock cycle. > A B > Usage count 0x400 0x2 > 1 Eviction time! 0x100 0x0 E > Usage count 0x100 0x2 > 2 Eviction time! 0x080 0x0 E > Usage count 0x080 0x2 > 3 Eviction time! 0x020 0x0 E > Usage count 0x020 0x2 > 4 Eviction time! 0x00A 0x0 E > Usage count 0x00A 0x2 > 5 Eviction time! 0x004 0x0 E > Usage count 0x004 0x2 > 6 Eviction time! 0x001 0x0 E > Usage count 0x001 0x2 > 7 Eviction time! 0x000 E 0x2 > So, buffer used 512 times more survived only 7 stale cycles. Looks fair. No, it is not fair. In fact, it is worser than with current strategy: with current GClock it will survive only 5 stale cycles. And in my approach I told about NBlock/32 periods, so it looks more like (lets increment be 1): Period A B C touch UseCnt touch cnt touch cnt 1/64 512 0 16 0 1 0 2/64 512 0 0 0 1 0 3/64 0 0 0 0 0 0 4/64 0 0 0 0 1 1 5/64 0 0 0 0 1 1 ..... 32/64 0 0 16 1 0 1 33/64 0 0 16 1 1 2 34/64 0 0 0 1 1 2 ..... 63/64 0 0 0 1 1 3 64/64 0 0 0 1 1 3 Eviction time! A evicted B 1 C 2 So, in my algorithm A will be evicted at first eviction time. And while B accessed more times, its access distance is much longer than C, so it has less changes to survive in a future. >> Log scale doesn't solve this. But increment "once in period" solves. >> Especially if block is placed first with zero count (instead of 1 as >> currently). >> >>>> Digging further, I suggest as improvement of GClock algorithm: - >>>> placing new buffer with usage count = 0 (and next NBlock/32 >>>> replacements its usage count doesn't increased) >>>> - increment not by 1, but by 8 (it simulates "hot queue" of >>>> popular algorithms) with limit 32. >>>> - scan at most 25 buffers for eviction. If no buffer with zero >>>> usage count found, the least used buffer (among scanned 25) is evicted. >>>> (new buffers are not evicted during their first NBlock/32 >>>> replacements). >>>> >>> >>> I do not understand where these numbers come from... >> >> I found this number by testing with several artificial traces found in web. >> I don't claim this number are best. Even on that traces best values may >> vary on cache size: for small cache size increment and limit tends to be >> higher, for huge cache - smaller. But this were most balanced. >> >> And I don't claim those traces are representative for PostgreSQL, that >> is why I'm pushing this discussion to collect more real-world PostgreSQL >> traces and make them public. >> >> And I believe my algorithm is not the best. Clock-Pro and ARC shows >> better results on that traces. Tiny-LFU - cache admission algorithm - >> may be even more efficient (in term of evictions). But results should be >> rechecked with PostgreSQL traces. >> >> My algorithm will be just least invasive for current code, imho. > > Here's the demo patch with logarithmic scale. 2 lines changed. I've been playing with logarithmic scale in postgresql roughly year ago. I didn't found any benefits compared to current code. In fact, it looked to perform worse than current code. That is why I didn't wrote about that experiment to pgsql-hackers. But probably I measured in a wrong way. And why I dream to have real-world traces in hands. Consider all known to be effective algorithms: 2Q, ARC, Clock-PRO, CAR, - they all consider buffer "hot" if it has more temporal frequency in opposite to raw access count. They all mostly ignores spike of usages during first moments after placement into cache, and moves buffer to hot if it is accessed at some time after. With regards, Sokolov Yura.
Вложения
Is there a feature of testgres to benchmark pgbench tps against shared buffer size? Let's request this feature if it is not there :)6 мая 2018 г., в 20:38, Yura Sokolov <funny.falcon@gmail.com> написал(а):
I've been playing with logarithmic scale in postgresql roughly year ago.
I didn't found any benefits compared to current code. In fact, it looked
to perform worse than current code.
That is why I didn't wrote about that experiment to pgsql-hackers.
These algorithms do not ignore spikes, they ignore spike's amplitude. And this does not mean that amplitude is irrelevant information, even if these algorithms perform almost like Belady's.But probably I measured in a wrong way. And why I dream to have
real-world traces in hands.
Consider all known to be effective algorithms: 2Q, ARC, Clock-PRO,
CAR, - they all consider buffer "hot" if it has more temporal frequency
in opposite to raw access count. They all mostly ignores spike of usages
during first moments after placement into cache, and moves buffer to hot
if it is accessed at some time after.
Building a complicated heuristics (with a lot of magic numbers) to merge this spikes into one event does not look promising to me. But this is just my superstition, chances are that you can tune your algorithm into serious advancement. But please publish benchmarks, whatever result will be.
Best regards, Andrey Borodin.
06.05.2018 20:28, Andrey Borodin пишет: > >> 6 мая 2018 г., в 20:38, Yura Sokolov <funny.falcon@gmail.com> написал(а): >> >> I've been playing with logarithmic scale in postgresql roughly year ago. >> I didn't found any benefits compared to current code. In fact, it looked >> to perform worse than current code. >> That is why I didn't wrote about that experiment to pgsql-hackers. > Is there a feature of testgres to benchmark pgbench tps against shared buffer size? Let's request this feature if it isnot there :) That would be great. Will you? But pgbench is a bit... far from realworld. I used pgbench to test log-scale, and it shows that log-scale is worser. It will be more important to test with some realworld installations. Or validate algorithm with realworld traces. >> But probably I measured in a wrong way. And why I dream to have >> real-world traces in hands. >> >> Consider all known to be effective algorithms: 2Q, ARC, Clock-PRO, >> CAR, - they all consider buffer "hot" if it has more temporal frequency >> in opposite to raw access count. They all mostly ignores spike of usages >> during first moments after placement into cache, and moves buffer to hot >> if it is accessed at some time after. > These algorithms do not ignore spikes, they ignore spike's amplitude. And this does not mean that amplitude is irrelevantinformation, even if these algorithms perform almost like Belady's. More important their consideration of temporal frequency. > Building a complicated heuristics (with a lot of magic numbers) to merge this spikes into one event does not look promisingto me. But this is just my superstition, chances are that you can tune your algorithm into serious advancement.But please publish benchmarks, whatever result will be. Yes, this numbers looks magic. They are hand tuned on some traces that probably irrelevant to PostgreSQL behavior. And since you already present some patch, may you publish its benchmark results? With regards, Sokolov Yura.
On Sat, May 5, 2018 at 2:16 AM, Andrey Borodin <x4mmm@yandex-team.ru> wrote: > But you loose difference between "touched once" and "actively used". Log > scale of usage solves this: usage count grows logarithmically, but drains > linearly. Even if we have that, or something with similar effects, it's still desirable to avoid bumping the usage count multiple times for accesses that happen close together in time. I don't really agree with Yura Sokolov's proposal for how to prevent that problem, because during periods when no buffer eviction is happening it basically throws out all of the data: no items are replaced, and the usage count can't be bumped again until NBlocks/32 items are replaced. That's bad. We want an algorithm that, if there's no buffer eviction for a while and then we start having to evict buffers, has good information on which things should be evicted and which should resist eviction. But I think that he's right that preventing the usage count from rising artificially when there are correlated accesses is important. If we have a page with usage count 0 which incurs 4 correlated accesses, as shown in Vladimir Sitnikov's results upthread, then reducing the usage count linearly requires 4 sweeps through the buffer ring to get the usage count back down to 0 (4->3->2->1->0); reducing it by dividing by two still requires 3 sweeps (4->2->1->0). A good solution for that case should, like Yura Sokolov's proposed algorithm, avoid pushing the usage count higher than 1. So he has the right idea, I think, even if the method's not quite what we want. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
2018-05-07 17:15 GMT+03:00 Robert Haas <robertmhaas@gmail.com>:
>
> On Sat, May 5, 2018 at 2:16 AM, Andrey Borodin <x4mmm@yandex-team.ru> wrote:
> > But you loose difference between "touched once" and "actively used". Log
> > scale of usage solves this: usage count grows logarithmically, but drains
> > linearly.
>
> Even if we have that, or something with similar effects, it's still
> desirable to avoid bumping the usage count multiple times for accesses
> that happen close together in time. I don't really agree with Yura
> Sokolov's proposal for how to prevent that problem, because during
> periods when no buffer eviction is happening it basically throws out
> all of the data: no items are replaced, and the usage count can't be
> bumped again until NBlocks/32 items are replaced. That's bad.
That is not as bad as it seems on first glance: during period when
no buffer eviction is happenning, all information is almost irrelevant,
because it is naturally outdated. And due to choose of "batch" size (25),
there is always window between "NBlocks/32 items replaced" and
"this item is considered for replacement": even if all items in
25/32*NBlocks had non-zero usage count, then there are at least
7/32*NBlocks to consider before this item could be replaced, during
which usage count can be incremented.
Note: "1/32 replacements" really means "1/32 new buffers", so while
>
> We want an algorithm that, if there's no buffer eviction for a while and
> then we start having to evict buffers, has good information on which
> things should be evicted and which should resist eviction. But I
> think that he's right that preventing the usage count from rising
> artificially when there are correlated accesses is important. If we
> have a page with usage count 0 which incurs 4 correlated accesses, as
> shown in Vladimir Sitnikov's results upthread, then reducing the usage
> count linearly requires 4 sweeps through the buffer ring to get the
> usage count back down to 0 (4->3->2->1->0); reducing it by dividing by
> two still requires 3 sweeps (4->2->1->0).
Yes, that is why log-scale doesn't help much.
> A good solution for that
> case should, like Yura Sokolov's proposed algorithm, avoid pushing the
> usage count higher than 1. So he has the right idea, I think, even if
> the method's not quite what we want.
Well, I proposed increment by 8 , because simulation (on traces,
I found) shows, it really improves quality of algorithm. Looks like, if
item had a hit after not-so-short period there is great chance this item
is hot, so it should be "protected" from eviction.
>
> On Sat, May 5, 2018 at 2:16 AM, Andrey Borodin <x4mmm@yandex-team.ru> wrote:
> > But you loose difference between "touched once" and "actively used". Log
> > scale of usage solves this: usage count grows logarithmically, but drains
> > linearly.
>
> Even if we have that, or something with similar effects, it's still
> desirable to avoid bumping the usage count multiple times for accesses
> that happen close together in time. I don't really agree with Yura
> Sokolov's proposal for how to prevent that problem, because during
> periods when no buffer eviction is happening it basically throws out
> all of the data: no items are replaced, and the usage count can't be
> bumped again until NBlocks/32 items are replaced. That's bad.
That is not as bad as it seems on first glance: during period when
no buffer eviction is happenning, all information is almost irrelevant,
because it is naturally outdated. And due to choose of "batch" size (25),
there is always window between "NBlocks/32 items replaced" and
"this item is considered for replacement": even if all items in
25/32*NBlocks had non-zero usage count, then there are at least
7/32*NBlocks to consider before this item could be replaced, during
which usage count can be incremented.
Note: "1/32 replacements" really means "1/32 new buffers", so while
shared_buffers are not full yet (so no evictions happend), all blocks, except
last 1/32, could have increment of usage count.
last 1/32, could have increment of usage count.
>
> We want an algorithm that, if there's no buffer eviction for a while and
> then we start having to evict buffers, has good information on which
> things should be evicted and which should resist eviction. But I
> think that he's right that preventing the usage count from rising
> artificially when there are correlated accesses is important. If we
> have a page with usage count 0 which incurs 4 correlated accesses, as
> shown in Vladimir Sitnikov's results upthread, then reducing the usage
> count linearly requires 4 sweeps through the buffer ring to get the
> usage count back down to 0 (4->3->2->1->0); reducing it by dividing by
> two still requires 3 sweeps (4->2->1->0).
Yes, that is why log-scale doesn't help much.
> A good solution for that
> case should, like Yura Sokolov's proposed algorithm, avoid pushing the
> usage count higher than 1. So he has the right idea, I think, even if
> the method's not quite what we want.
Well, I proposed increment by 8 , because simulation (on traces,
I found) shows, it really improves quality of algorithm. Looks like, if
item had a hit after not-so-short period there is great chance this item
is hot, so it should be "protected" from eviction.
I used traces from https://github.com/djblue/caching/tree/master/traces
(and code in that repository to develop algorithm).
That is really important thing to have real-world traces. Because it is
not so obvious how algorithm should perform. Mental conclusions
could be wrong and doesn't match production load behavior.
For example, all good algorithm tracks recently evicted items, and it is
not so obvious that already evicted item that get a hit is a "hot" item, and
should be placed in separate "hot" queue.
(My algo doesn't track evicted items, so it could not be as good in term
of hit-rate.)
(Personally, I vote more for Clock-Pro than for my algo. But it has three
"clock-hands", and if implemented as described, still demands
double-linked lists. So, it should be modified/adapted, imho.)
could be wrong and doesn't match production load behavior.
For example, all good algorithm tracks recently evicted items, and it is
not so obvious that already evicted item that get a hit is a "hot" item, and
should be placed in separate "hot" queue.
(My algo doesn't track evicted items, so it could not be as good in term
of hit-rate.)
(Personally, I vote more for Clock-Pro than for my algo. But it has three
"clock-hands", and if implemented as described, still demands
double-linked lists. So, it should be modified/adapted, imho.)
With regards,
Yura Sokolov.
Yura Sokolov.
On Mon, May 7, 2018 at 12:54 PM, Юрий Соколов <funny.falcon@gmail.com> wrote: >> Even if we have that, or something with similar effects, it's still >> desirable to avoid bumping the usage count multiple times for accesses >> that happen close together in time. I don't really agree with Yura >> Sokolov's proposal for how to prevent that problem, because during >> periods when no buffer eviction is happening it basically throws out >> all of the data: no items are replaced, and the usage count can't be >> bumped again until NBlocks/32 items are replaced. That's bad. > > That is not as bad as it seems on first glance: during period when > no buffer eviction is happenning, all information is almost irrelevant, > because it is naturally outdated. And due to choose of "batch" size (25), > there is always window between "NBlocks/32 items replaced" and > "this item is considered for replacement": even if all items in > 25/32*NBlocks had non-zero usage count, then there are at least > 7/32*NBlocks to consider before this item could be replaced, during > which usage count can be incremented. I don't agree that the information is almost irrelevant. If we have a working set that starts out fitting within shared_buffers and then grows larger, we want to know which parts of the data were being accessed frequently just prior to the point where we started evicting things. Otherwise we basically evict at random for a while, and that's not good. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
2018-05-08 16:21 GMT+03:00 Robert Haas <robertmhaas@gmail.com>:
>
> On Mon, May 7, 2018 at 12:54 PM, Юрий Соколов <funny.falcon@gmail.com> wrote:
> >> Even if we have that, or something with similar effects, it's still
> >> desirable to avoid bumping the usage count multiple times for accesses
> >> that happen close together in time. I don't really agree with Yura
> >> Sokolov's proposal for how to prevent that problem, because during
> >> periods when no buffer eviction is happening it basically throws out
> >> all of the data: no items are replaced, and the usage count can't be
> >> bumped again until NBlocks/32 items are replaced. That's bad.
> >
> > That is not as bad as it seems on first glance: during period when
> > no buffer eviction is happenning, all information is almost irrelevant,
> > because it is naturally outdated. And due to choose of "batch" size (25),
> > there is always window between "NBlocks/32 items replaced" and
> > "this item is considered for replacement": even if all items in
> > 25/32*NBlocks had non-zero usage count, then there are at least
> > 7/32*NBlocks to consider before this item could be replaced, during
> > which usage count can be incremented.
>
> I don't agree that the information is almost irrelevant. If we have a
> working set that starts out fitting within shared_buffers and then
> grows larger, we want to know which parts of the data were being
> accessed frequently just prior to the point where we started evicting
> things.
"just prior to the point" means we have to have machinery for information
"usage count" still have to be used to track access frequency, but it
To smooth access spikes, new values for "usage count" and "access time"
>
> On Mon, May 7, 2018 at 12:54 PM, Юрий Соколов <funny.falcon@gmail.com> wrote:
> >> Even if we have that, or something with similar effects, it's still
> >> desirable to avoid bumping the usage count multiple times for accesses
> >> that happen close together in time. I don't really agree with Yura
> >> Sokolov's proposal for how to prevent that problem, because during
> >> periods when no buffer eviction is happening it basically throws out
> >> all of the data: no items are replaced, and the usage count can't be
> >> bumped again until NBlocks/32 items are replaced. That's bad.
> >
> > That is not as bad as it seems on first glance: during period when
> > no buffer eviction is happenning, all information is almost irrelevant,
> > because it is naturally outdated. And due to choose of "batch" size (25),
> > there is always window between "NBlocks/32 items replaced" and
> > "this item is considered for replacement": even if all items in
> > 25/32*NBlocks had non-zero usage count, then there are at least
> > 7/32*NBlocks to consider before this item could be replaced, during
> > which usage count can be incremented.
>
> I don't agree that the information is almost irrelevant. If we have a
> working set that starts out fitting within shared_buffers and then
> grows larger, we want to know which parts of the data were being
> accessed frequently just prior to the point where we started evicting
> things.
"just prior to the point" means we have to have machinery for information
expiration without eviction. For example, same "clock hand" should walk
over all buffers continiously, and decrement "usage count", but without
eviction performed. Right?
As alternative, some approximation of Working Set algorithm could be used:
As alternative, some approximation of Working Set algorithm could be used:
- on every access "access time" should be written to item,
- items accessed before some "expiration interval" are considered for
- items accessed before some "expiration interval" are considered for
expiration.
This way, there is also fresh information about usage even without any
expiration performed yet.
expiration performed yet.
"usage count" still have to be used to track access frequency, but it
should not be just incremented:
- there should be formula for "corrected count", that depends on
last access time, written usage count and over-all system access rate.
- increment should look like:
usage_count = corrected_count(cur_time, access_time, usage_count, access_rate) + increment(cur_time, access_time, access_rate)
- probably, expiration should rely on "corrected_count" either,
- increment should look like:
usage_count = corrected_count(cur_time, access_time, usage_count, access_rate) + increment(cur_time, access_time, access_rate)
- probably, expiration should rely on "corrected_count" either,
ie evict if "corrected_count" == 0
To smooth access spikes, new values for "usage count" and "access time"
should be written not on every access, but once in some period.
Oh, looks like I'm inventing another kind of bicycle :-(
and this one is unlikely to be good.
With regards,
Sokolov Yura
Sokolov Yura
> Oh, looks like I'm inventing another kind of bicycle :-(
Do you think you could capture a trace or two from a more-or-less representative application/database?
Do you think you could capture a trace or two from a more-or-less representative application/database?
Discussion of algorithms makes little sense as we all lack traces to compare/validate.
Vladimir
On 05/08/2018 11:35 AM, Vladimir Sitnikov wrote: >> Oh, looks like I'm inventing another kind of bicycle :-( > > Do you think you could capture a trace or two from a more-or-less > representative application/database? > > Discussion of algorithms makes little sense as we all lack traces to > compare/validate. > I have work loads that I can repeat, so I can help with testing. Best regards, Jesper
>I have work loads that I can repeat, so I can help with testing.
That would be great.
Do you think you could use DTrace to capture the trace?
For instance, https://github.com/vlsi/pgsqlstat/blob/pgsqlio/pgsqlio
Vladimir
On 05/08/2018 11:49 AM, Vladimir Sitnikov wrote: >> I have work loads that I can repeat, so I can help with testing. > > That would be great. > > Do you think you could use DTrace to capture the trace? > For instance, https://github.com/vlsi/pgsqlstat/blob/pgsqlio/pgsqlio > DTrace or BPF would be ok. Best regards, Jesper
On Tue, May 8, 2018 at 10:22 AM, Юрий Соколов <funny.falcon@gmail.com> wrote: > "just prior to the point" means we have to have machinery for information > expiration without eviction. For example, same "clock hand" should walk > over all buffers continiously, and decrement "usage count", but without > eviction performed. Right? Right. > As alternative, some approximation of Working Set algorithm could be used: > - on every access "access time" should be written to item, > - items accessed before some "expiration interval" are considered for > expiration. > This way, there is also fresh information about usage even without any > expiration performed yet. That's basically the same kind of thing. Ideally we want to have one mechanism that operates all the time, rather than one that works when no eviction is occurring and a completely separate one that operates when eviction is occurring. Anyway, I think we really ought to investigate CAR. CAR could be made to piggyback on our existing GLOCK scheme. See section III.B of the the CAR paper, page 4, especially the second paragraph in the second column. What would change is that instead of sweeping through all buffers, there would be two sets of buffers T1 and T2, each with a separate clock hand. Also, we'd need to store the history lists B1 and B2 someplace. Independently of that, it would be probably also useful to avoid bumping the reference count multiple times when a buffer is accessed by the same backend several times in quick succession. Perhaps this could even be as simple as maintaining a list of the last two buffer IDs for which we bumped the usage count; if we see one of them again, don't bump the usage count again. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Wed, May 9, 2018 at 11:00 AM Robert Haas <robertmhaas@gmail.com> wrote: > Independently of that, it would be probably also useful to avoid > bumping the reference count multiple times when a buffer is accessed > by the same backend several times in quick succession. Perhaps this > could even be as simple as maintaining a list of the last two buffer > IDs for which we bumped the usage count; if we see one of them again, > don't bump the usage count again. Hm. Is the objective here to optimize access patterns or to reduce atomic operations (or both)? All else being equal, an algorithm that delivers the similar eviction results with less cache synchronization ought to be preferred...are the various algorithms scored in that way? merlin
On Wed, May 9, 2018 at 1:37 PM, Merlin Moncure <mmoncure@gmail.com> wrote: > On Wed, May 9, 2018 at 11:00 AM Robert Haas <robertmhaas@gmail.com> wrote: >> Independently of that, it would be probably also useful to avoid >> bumping the reference count multiple times when a buffer is accessed >> by the same backend several times in quick succession. Perhaps this >> could even be as simple as maintaining a list of the last two buffer >> IDs for which we bumped the usage count; if we see one of them again, >> don't bump the usage count again. > > Hm. Is the objective here to optimize access patterns or to reduce atomic > operations (or both)? All else being equal, an algorithm that delivers > the similar eviction results with less cache synchronization ought to be > preferred...are the various algorithms scored in that way? The CAR paper explains steps they've taken to minimize atomic operations, so it is something that people do think about. The main reason for wanting to avoid extra usage count bumps is to avoid distorting the usage counts. Any savings we get from reducing atomic ops is a bonus. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Thu, May 3, 2018 at 5:16 PM, Peter Geoghegan <pg@bowt.ie> wrote: > On Thu, May 3, 2018 at 12:37 PM, Robert Haas <robertmhaas@gmail.com> wrote: >> On Wed, May 2, 2018 at 3:06 PM, Vladimir Sitnikov >> <sitnikov.vladimir@gmail.com> wrote: >>> Sample output can be seen here: >>> https://github.com/vlsi/pgsqlstat/tree/pgsqlio#pgsqlio >> >> Neat. Not sure what generated this trace, but note this part: > I don't have time to check this out just now, but it seems like an > excellent idea. It would be nice if it could be enhanced further, so > you get some idea of what the blocks are without having to decode them > yourself using tools like pageinspect. Although I'm a Linux user, I added --enable-dtrace to my standard release build configure alias several months back. I wanted to be able to use the user space probes it adds with BCC [1], which is a really nice tool (the dtrace/system tap probes seem to be totally generic). If you have a fairly recent kernel, many really nice tricks are possible with only modest effort. I've been meaning to polish up a selection of one-liners I came up with several months back, to make them available for others. Here is a one-liner that traces buffer reads within the backend with PID 6617: pg@bat:~/notes/setup$ sudo /usr/share/bcc/tools/trace -p 6617 --time 'u:/code/postgresql/root/install/bin/postgres:postgresql:buffer__read__start "fork: %d, blockNum: %u, spcNode: %u, dbNode: %u, relNode: %u, backend: %d, isExtend: %d" arg1, arg2, arg3, arg4, arg5, arg6, arg7' If I run the query "select * from pgbench_accounts where aid = 555" within that backend, I can see the following: TIME PID TID COMM FUNC - 12:17:28 6617 6617 postgres buffer__read__start fork: 0, blockNum: 290, spcNode: 1663, dbNode: 13100, relNode: 16404, backend: -1, isExtend: 0 12:17:28 6617 6617 postgres buffer__read__start fork: 0, blockNum: 3, spcNode: 1663, dbNode: 13100, relNode: 16404, backend: -1, isExtend: 0 12:17:28 6617 6617 postgres buffer__read__start fork: 0, blockNum: 2, spcNode: 1663, dbNode: 13100, relNode: 16404, backend: -1, isExtend: 0 12:17:28 6617 6617 postgres buffer__read__start fork: 0, blockNum: 9, spcNode: 1663, dbNode: 13100, relNode: 16396, backend: -1, isExtend: 0 This is pretty easy to interpret: 290 is the root page, 3 in another internal page, and 2 in the leaf page. Finally, block 9 is from the heap relation. I haven't actually instrumented the amount of time each read takes here, so it's not quite as good as your dtrace script. I'm sure that that would be trivial, since BCC really does seem to make customization very easy. I just don't want to spend more than 10 minutes on this today. [1] https://github.com/iovisor/bcc -- Peter Geoghegan
On Wed, Apr 25, 2018 at 6:31 PM, Andrey Borodin <x4mmm@yandex-team.ru> wrote: > 4. Using O_DIRECT while writing data files One interesting thing about O_DIRECT that I haven't seen mentioned in these discussions: POSIX effectively requires writes to the same file to be serialised, meaning in practice that [p]write[v]() acquires a mutex on the inode. I've heard of people working on other databases that have multiple write-back threads that considered this to be a serious problem, and I guess it becomes worse as you concentrate more stuff into bigger files to cut down on file descriptors and file system metadata operations and have more concurrent writers. Some ways around that: 1. On Linux if you use O_DIRECT then XFS forgets about that POSIX requirement (O_DIRECT is outside the spec anyway). It just doesn't acquire inode->i_mutex, so you get parallel writes to the same file (except in some corner cases). AFAIK other Linux filesystems don't do that, so XFS might be your only choice if you want parallel direct IO on that OS. Anyone know more about that? Relevant kernel code: the inode_lock() call that appears in ext4_file_write_iter() but not in xfs_file_write_iter() in the IOCB_DIRECT case. 2. Ditto for Solaris's UFS, AIX's JFS2 with "O_CIO" enabled (but apparently not the JFS port on Linux). I don't know the answer for FreeBSD's UFS off-hand. 3. I've heard that ZFS achieves parallel writes to the same file while magically adhering to the POSIX serialisation rules, and that's independent of direct IO (which it doesn't even support). -- Thomas Munro http://www.enterprisedb.com
I have been working on caching as a hobby project for the last few years and would be happy to collaborate on an analysis and design. I've tackled a lot of these issues, wrote widely used implementations, and co-authored a paper for ACM's Transactions on Storage on an eviction policy. A short summary of my results is described in this HighScalability article [1]. I very much agree with the requests for traces to use real-world data. I wrote a simulator that supports a large number of policies and traces [2]. It was invaluable. I believe ARC's DS1 should be the closest to a Postgres workload. Their OLTP trace appears to be the I/O of the WAL, making it recency-biased and not very interesting. ** Eviction ** ARC-based policies (CAR, CART) are only modestly better than LRU in most workloads. It does much better in Zipf, but poorly in loopy workloads like glimpse. I think that the problem is that it does not capture and use the history (ghost lists) very effectively. LIRS-based policies (ClockPro, FRD) are very strong. Unfortunately the papers do a poor job in describing them, it is complex to implement, and most are flawed so as to diminish the hit rate. As described in their paper, LIRS has an unbounded history that can cause memory leaks if not capped and the pruning is O(n). I try to avoid Clock-based policies in my designs due to their O(n) scans. This can cause latency spikes that are difficult to reproduce because it depends on the policy's state. Random sampling policies don't have this problem, but tend to under perform due to a lack of history. I prefer amortized O(1) policies, but those are difficult due to concurrency (solved below). I chose TinyLFU with an admission window, called W-TinyLFU. This uses a frequency sketch with saturating counters for its history (e.g, CountMinSketch w/ 4-bit counters). The counters are halved every sample period, e.g. 10x the maximum size, which efficiently ages the LFU. Instead of trying to order for eviction, TinyLFU avoids cache pollution by rejecting candidates with a low probability of reuse. This is done by checking the candidate's frequency against the eviction policy's victim, and admitting only if the candidate offers a better hit rate. The LRU admission window helps for recency-based traces where the candidate is rejected prematurely and when accepted is unlikely to be used again soon. This simple and compact scheme has the highest hit rate in most real-world workloads that I have tested. To bypass the ACM paywall, you can use the author link on my github project's README [3]. We are currently working on making W-TinyLFU adaptive by dynamically resizing the window size. I can send the draft paper privately if anyone is interested. I am a Bay Area engineer working with researchers at Israel's Technion, so the mixture has been fruitful. ** Concurrency ** As you know, LRU's O(1) nature is very attractive but the list manipulations are not friendly for multi-threading. The lock hold times are small, but the access rate is high and causes a lot of contention. The solution is to use the WAL approach to record the event into a buffer and replay it under a tryLock. This way adding to the ring buffer is a cheap CAS and threads won't block as their work has been handed off. When the buffers fill up they are replayed under the lock in a batch, thereby isolating the policy from concurrent mutations. I use multiple, lossy ring buffers to capture reads and replaying is delayed until beneficial. For writes, I use a ring buffer that is replayed immediately and when full causes back pressure (though is rarely filled). In my benchmarks [4] the overhead quite small on a Zipf workload (to simulate hot entries). The read throughput is 33% of a lock-free hash table and scales linearly. The write throughput is nearly the same due to contention when updating the hot entries. The read throughput is high enough to meet performance budgets and its overhead is only observable in synthetic micro-benchmarks. The benefit of this approach is that for a very small penalty to reads, the policy is mostly isolated from the multi-threading. This allows efficient and complex O(1) algorithms that are not thread-safe. For example I use W-TinyLFU (LRU+SLRU) for eviction and a Hierarchical TimerWheel for expiration. While the initial infrastructure is more work, its been a boon due to composability, a simpler mental model, and not having to use algorithms that degrade in difficult to debug ways. Hope that helps, Ben [1] http://highscalability.com/blog/2016/1/25/design-of-a-modern-cache.html [2] https://github.com/ben-manes/caffeine/wiki/Simulator [3] https://github.com/ben-manes/caffeine [4] https://github.com/ben-manes/caffeine/wiki/Benchmarks -- Sent from: http://www.postgresql-archive.org/PostgreSQL-hackers-f1928748.html
Hi, Ben! > 11 мая 2018 г., в 21:57, ben.manes <ben.manes@gmail.com> написал(а): > > I have been working on caching as a hobby project for the last few years... Thanks for sharing this! Your message is full of insights and W-TinyLFU seems to me very promising for use in shared buffers. Best regards, Andrey Borodin.
On Wed, May 2, 2018 at 12:27:19PM -0400, Robert Haas wrote: > I've seen customer have very good luck going higher if it lets all the > data fit in shared_buffers, or at least all the data that is accessed > with any frequency. I think it's useful to imagine a series of > concentric working sets - maybe you have 1GB of the hottest data, 3GB > of data that is at least fairly hot, 10GB of data that is at least > somewhat hot, and another 200GB of basically cold data. Increasing > shared_buffers in a way that doesn't let the next "ring" fit in > shared_buffers isn't likely to help very much. If you have 8GB of > shared_buffers on this workload, going to 12GB is probably going to > help -- that should be enough for the 10GB of somewhat-hot stuff and a > little extra so that the somewhat-hot stuff doesn't immediately start > getting evicted if some of the cold data is accessed. Similarly, > going from 2GB to 4GB should be a big help, because now the fairly-hot > stuff should stay in cache. But going from 4GB to 6GB or 12GB to 16GB > may not do very much. It may even hurt, because the duplication > between shared_buffers and the OS page cache means an overall > reduction in available cache space. If for example you've got 16GB of > memory and shared_buffers=2GB, you *may* be fitting all of the > somewhat-hot data into cache someplace; bumping shared_buffers=4GB > almost certainly means that will no longer happen, causing performance > to tank. I would love to know how we can help people find out how much data is in each of these rings so they can tune shared buffers accordingly. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + As you are, so once was I. As I am, so you will be. + + Ancient Roman grave inscription +
On Tue, May 1, 2018 at 11:58 AM, Robert Haas <robertmhaas@gmail.com> wrote: >> All of that having been said, maybe we have an independent low-level >> problem: we increase usage_count when we pin a buffer, even if we last >> pinned the buffer 5ms ago. IIRC a change to this went in when ARC went >> in (and came out with ARC, too). This is more of a problem with >> miscounting the number of hits based on accidents around how things >> are structured in the executor -- we need to be more careful about >> recognizing whether or not block hits are logically distinct. >> Particularly with stuff like nested loop joins. > > I agree that double-counting correlated accesses is a a problem, and I > agree that we probably want to do something about it. I am skeptical > that using wall-clock time is the right way to solve that problem > because it leads to edge effects -- if for example there is a 5ms > timeout for bumping the usage count again, then you can probably > manufacture a workload where the buffer eviction pattern is > dramatically different for a nested loop that hits a given page every > 4.5ms than it is for a nested loop that hits a given page every 5.5ms. > Note that CART aims to solve this problem in a way that doesn't > involve any fixed amount of wall-clock time. index_fetch_heap() calls ReleaseAndReadBuffer(), which will notice when the buffer it has been asked to read and pin is the same as the buffer it also needs to release/unpin. If they're the same buffer, which is probably very common, then it simply returns the existing buffer without releasing its pin. Naturally, pinning and unpinning increments the heap page's usage_count, which is really bad when references are correlated like this, so this is a really good idea. (Of course, we'd also like to avoid the actual work of pinning and unpinning, but that's arguably secondary -- this ReleaseAndReadBuffer() logic went in as part of the original 2005 clock sweep commit, so it seem to actually mostly be about correlated references.) I wonder if the significant improvements that I've noticed from the pending v3 of my nbtree-unique-key-heap-tid patch for certain workloads are related to ReleaseAndReadBuffer(), and if so to what degree. It's not hard to imagine how it could be a real problem that heap index tuple heap TIDs are in random order. ReleaseAndReadBuffer() could end up repeatedly pinning and unpinning the same heap pages when they appear again and again out of order on an nbtree leaf page. If that's true, then it might not matter how smart our algorithm is. -- Peter Geoghegan