Обсуждение: Just-in-time Background Writer Patch+Test Results
Tom gets credit for naming the attached patch, which is my latest attempt to finalize what has been called the "Automatic adjustment of bgwriter_lru_maxpages" patch for 8.3; that's not what it does anymore but that's where it started. Background on testing --------------------- I decided to use pgbench for running my tests. The scripting framework to collect all that data and usefully summarize it is now available as pgbench-tools-0.2 at http://www.westnet.com/~gsmith/content/postgresql/pgbench-tools.htm I hope to expand and actually document use of pgbench-tools in the future but didn't want to hold the rest of this up on that work. That page includes basic information about what my testing environment was and why I felt this was an appropriate way to test background writer efficiency. Quite a bit of raw data for all of the test sets summarized here is at http://www.westnet.com/~gsmith/content/bgwriter/ The patches attached to this message are also available at: http://www.westnet.com/~gsmith/content/postgresql/buf-alloc-2.patch http://www.westnet.com/~gsmith/content/postgresql/jit-cleaner.patch (This is my second attempt to send this message, don't know why the earlier one failed; using gzip'd patches for this one and hopefully there won't be a dupe) Baseline test results --------------------- The first patch to apply attached to this message is the latest buf-alloc-2 that adds counters to pgstat_bgwriter for everything the background writer is doing. Here's what we get out of the standard 8.3 background writer before and after applying that patch, at various settings: info | set | tps | cleaner_pct ------------------------------------+-----+------+------------- HEAD nobgwriter | 5 | 994 | HEAD+buf-alloc-2nobgwriter | 6 | 1012 | 0 HEAD+buf-alloc-2 LRU=0.5%/500 | 16 | 974 | 15.94HEAD+buf-alloc-2 LRU=5%/500 | 19 | 983 | 98.47 HEAD+buf-alloc-2 LRU=10%/500 | 7 | 997 | 99.95 cleaner_pct is what percentage of the writes the BGW LRU cleaner did relative to a total that includes the client backend writes; writes done by checkpoints are not included in this summary computation, it just shows the balance of backend vs. BGW writes. The /500 means bgwriter_lru_maxpages=500, which I already knew was about as many pages as this server ever dirties in a 200ms cycle. Without the buf-alloc-2 patch I don't get statistics on the LRU cleaner, I include that number as a baseline just to suggest that the buf-alloc-2 patch itself isn't pulling down results. Here we see that in order to get most of the writes to happen via the LRU cleaner rather than having the backends handle them, you'd need to play with the settings until the bgwriter_lru_percent was somewhere between 5% and 10%, and it seems obvious that doing this doesn't improve the TPS results. The margin of error here is big enough that I consider all these basically the same performance. The question then is how to get this high level of writes by the background writer automatically, without having to know what percentage to scan; I wanted to remove bgwriter_lru_percent, while still keeping bgwriter_lru_maxpages strictly as a way to throttle overall BGW activity. First JIT Implementation ------------------------ The method I described in my last message on this topic ( http://archives.postgresql.org/pgsql-hackers/2007-08/msg00887.php ) implemented a weighted moving average of how many pages were allocated, and based on feedback from that I improved the code to allow a multiplier factor on top of that. Here's the summary of those results: info | set | tps | cleaner_pct ------------------------------------+-----+------+------------- jit cleaner multiplier=1.0/500 | 9 | 981 | 94.3 jit cleaner multiplier=2.0/500 | 8 | 1005 | 99.78 jit multiplier=1.0/100 | 10 | 985 | 68.14 That's pretty good. As long as maxpages is set intelligently, it gets most of the writes even with the multiplier of 1.0, and cranking it up to the 2.0 suggested by the original Itagaki Takahiro patch gets nearly all of them. Again, there's really no performance change here in throughput by any of this. Coping with idle periods ------------------------ While I was basically happy with these results, the data Kevin Grittner submitted in response to my last call for commentary left me concerned. While the JIT approach works fine as long as your system is active, it does absolutely nothing if the system is idle. I noticed that a lot of the writes that were being done by the client backends were after idle periods where the JIT writer just didn't react fast enough during the ramp-up. For example, if the system went from idle for a while to full-speed just as the 200ms sleep started, by the time the BGW woke up again the backends could have needed to write many buffers already themselves. Ideally, idle periods should be used to slowly trickly dirty pages out, so that there are less of them hanging around when a checkpoint shows up or so that reusable pages are already available. The question then is how fast to go about that trickle. Heikki's background writer tests and my own suggest that if you make the rate during quiet periods too high, you'll clog the underlying buffers with some writes that end up being duplicated and lower overall efficiency. But all of those tests had the background writer going at a constant and relatively high speed. I wanted to keep the ability to scan the entire buffer cache, using the latest idea of never looking at the same buffer twice, but to do that slowly when idle and using the JIT rate otherwise. This is sort of a hybrid of the old LRU cleaner behavior (scan a fixed %) at a low speed with the new approach (scan based on allocations, however many of them there are). I starting with the old default of 0.5% used by bgwriter_lru_percent (a tunable already removed by the patch at this point) with logic to tack that onto the JIT intelligently and got these results: info | set | tps | cleaner_pct ------------------------------------+-----+------+------------- jit multiplier=1.0 min scan=0.5% | 13 | 882 | 100 jit multiplier=1.5 min scan=0.5% | 12 | 871 | 100 jit multiplier=2.0 min scan=0.5% | 11 | 910 | 100 jit multiplier=1.0 min scan=0.25% | 14 | 982 | 98.34 It's nice to see fully 100% of the buffers written by the cleaner with the hybrid approach; I feel that validates my idea that just a bit more work needs to be done during idle periods to completely fix the issue with it not reacting fast enough during the idle/full speed transition. But look at the drop in TPS. While I'm willing to say a couple of percent change isn't significant in a pgbench result, those <900 results are clearly bad. This is crossing that line where inefficient writes are being done. I'm happier with the result using the smaller min scan=0.25% even though it doesn't quite get every write that way. Making percentage independant of delay -------------------------------------- But a new problem here is that if you lower bgwriter_delay, the minimum scan percentage needs to drop too, and my goal was to remove the number of tunables people need to tinker with. Assuming you're not stopped by the maxpages parameter, with the default delay=200ms a scan that hits 0.5% each time will scan 5*0.5%=2.5% of the buffer cache per second, which means it will take 24 seconds to scan the entire pool. Using 0.25% means 48 seconds between scans. I improved the overall algorithm a bit and decided to set this parameter an alternate way: by how long it should take to creep its way through the entire buffer cache if the JIT code is idle. I decided I liked 120 seconds as value for that parameter, which is a slower rate than any of the above but still a reasonable one for a typical application. Here's what the results look like using that approach: info | set | tps | cleaner_pct ------------------------------------+-----+------+------------- jit multiplier=1.0 scan_whole=120s | 18 | 970 | 99.99jit multiplier=1.5 scan_whole=120s | 15 | 995 | 99.93 jit multiplier=2.0 scan_whole=120s | 17 | 981 | 99.98 Now here are results I'm happy with. The TPS results are almost unchanged from where we started from, with minimal inefficient writes, but almost all the writes are being done by the cleaner process. The results appear much less sensitive to what you set the multiplier to. And unless you use an unresonable low value for maxpages (which will quickly become obvious if you monitor pg_stat_bgwriter and look for maxwritten_clean increasing fast), you'll get a complete scan of the buffer cache within 2 minutes even if there's no system activity. But once that's done, until more buffers are allocated the code won't even look at the buffer cache again (as opposed to the current code, which is always looking at buffers and acquiring locks even if nothing is going on). I think I can safely say there is a level of intelligence going into what the LRU background writer does with this patch that has never been applied to this problem before. There have been a lot of good ideas thrown out in this area, but it took a hybrid approach that included and carefully balanced all of them to actually get results that I felt were usable. What I don't know is whether that will also be true for other testers. Patch review ------------ The attached jit-cleaner.patch implements this approach, and if you just want to look at the main code involved without having to apply the patch you can browse the BgBufferSync function in bufmgr.c starting around line 1120 at http://www.westnet.com/~gsmith/content/postgresql/bufmgr.c There is lots of debugging of internals dumped into the logs if you toggle on #define BGW_DEBUG , the gross summary of the two most important things that show what the code is doing are logged at DEBUG1 (but should probably be pushed lower before committing). This code is as good as you're going to get from me before the 8.3 close. I could do some small rewriting and certainly can document all this further as part of getting this patch moved toward committed, but I'm out of resources to do too much more here. Along with the big question of whether this whole idea is worth following at all as part of 8.3, here are the remaining small questions I feel review feedback would be valuable on related to my specific code: -The way I'm getting the passes number back from the freelist.c strategy code seems like it will eventually overflow the long I'm using for the intermediate results when I execute statements like this: strategy_position=(long)strategy_passes * NBuffers + strategy_buf_id; I'm not sure if the code would be better if I were to use a 64-bit integer for strategy_position instead, or if I should just rewrite the code to separate out the passes multiplication--which will make it less elegant to read but should make overflow issues go away. -Heikki didn't like the way I pass information back from SyncOneBuffer back to the background writer. The bitmask approach I'm using has added flexibility to writing more intelligent background writers in the future. I have written more complicated ones than any of the approaches mentioned here in the past, using things like the usage_count information returned, but the simpler implementation here that ignores that. I could simplify this interface if I had to, but I like what I've done as a solid structure for future coding as it's written right now. -There are two magic constants in the code: int smoothing_samples = 16; float scan_whole_pool_seconds = 120.0; I believe I've done enough testing recently and in the past to say these are reasonable numbers for most installations, and high-throughput systems are going to care more about tuning the multiplier GUC than either of these. In the interest of having less knobs people can fool with and break, I personally don't feel like these constants need to be exposed for tuning purposes; they don't have a significant impact on how the underlying model works. Determining whether these should be exposed as GUC tunables is certainly an open question though. -I bumped the default for bgwriter_lru_maxpages to 100 so that typical low-end systems should get an automatically tuning LRU background writer out of the box in 8.3. This is a big change from the 5 that was used in the older releases. If you keep everything at the defaults this represents a maximum theoretical write rate for the BGW of 4MB/s, which isn't very much relative to modern hardware. -- * Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD
>>> On Wed, Sep 5, 2007 at 10:31 PM, in message <Pine.GSO.4.64.0709052324020.25284@westnet.com>, Greg Smith <gsmith@gregsmith.com> wrote: > > -There are two magic constants in the code: > > int smoothing_samples = 16; > float scan_whole_pool_seconds = 120.0; > > I personally > don't feel like these constants need to be exposed for tuning purposes; > Determining > whether these should be exposed as GUC tunables is certainly an open > question though. If you exposed the scan_whole_pool_seconds as a tunable GUC, that would allay all of my concerns about this patch. Basically, our problems were resolved by getting all dirty buffers out to the OS cache within two seconds; any longer than that and the OS cache didn't reach its trigger point for pushing out to the controller cache in time to prevent the glut which locks everything up. I also suspect that this interval kept the OS cache more aware of frequently updated pages, so that it could avoid unnecessary physical writes under its own logic. While I'm hoping that the new checkpoint techniques will be a better solution, I can't count on that without significant testing in our environment, and I really want a fall-back. The metric you emphasized was the percentage of PostgreSQL writes to the OS cache which were handled by the background writer, which doesn't necessarily correspond to a solution to the glut, which is based on the peak number of total writes presented to the controller by the OS within a small window of time. -Kevin
On Thu, 6 Sep 2007, Kevin Grittner wrote: > If you exposed the scan_whole_pool_seconds as a tunable GUC, that would > allay all of my concerns about this patch. Basically, our problems were > resolved by getting all dirty buffers out to the OS cache within two > seconds Unfortunately it wouldn't make my concerns about your system go away or I'd have recommended exposing it specifically to address your situation. I have been staring carefully at your configuration recently, and I would wager that you could turn off the LRU writer altogether and still meet your requirements in 8.2. Here's what you've got right now: > shared_buffers = 160MB (=20000 buffers) > bgwriter_lru_percent = 20.0 > bgwriter_lru_maxpages = 200 > bgwriter_all_percent = 10.0 > bgwriter_all_maxpages = 600 With the default delay of 200ms, this has the LRU-writer scanning the whole pool every 1 second, while the all-writer scans every two seconds--assuming they don't hit the write limits. If some event were to dirty the whole pool in 200ms, it might take as much as 6.7 seconds to write everything out (20000 / 600 * 200 ms) via the all-scan. The all-scan is already gone in 8.3. Your LRU scan will take much longer than that to clear everything out. At least (20000 / 200 * 200ms) 20 seconds to clear a fully dirty cache. But in fact, it's impossible to even bound how long it will take before the LRU writer (which is the only part this new patch tries to improve) gets around to writing even a single dirty buffer no matter what bgwriter_lru_percent (8.2) or scan_whole_pool_seconds (JIT patch) is set to. There's a second low-level issue involved here. When a page becomes dirty, that implies it was also recently used, which means the LRU writer won't touch it. That page can't be written out by the LRU writer until an entire pass has been made over the shared_buffer pool while looking for buffers to allocate for new activity. When the allocation clock-sweep passes over the newly dirtied buffer again, its usage count will drop by one and it will no longer be considered recently used. At that point the LRU writer can write it out. So unless there is other allocation activity going on, the scan_whole_pool_seconds mechanism will never provide the bound on time to scan and write everything you hope it will. And if there's other allocations going on, the much more powerful JIT mechanism will scan the whole pool plenty fast if you bump the already exposed multiplier tunable up. In my tests where the buffer cache was filled with mostly dirty buffers that couldn't be re-used (something relatively easy to trigger with pgbench tests), I've actually watched the new code scan >90% of the buffer cache looking for those few reusable buffers in the pool in a single invocation. This would be like setting bgwriter_lru_percent=90.0 in the old configuration, but it only gets that aggressive when the distribution of pages in the buffer cache demands it, and when it has reason to believe going that fast will be helpful. The completely understandable line of thinking that led to your request here is one of my concerns with exposing scan_whole_pool_seconds as a tunable. It may suggest to people that if they set the number very low, it will assure all dirty buffers will be scanned and written within that time bound. That's certainly not the case; both the maxpages and the usage count information will actually drive the speed that mechanism plods through the buffer cache. It really isn't useful for scanning fast. -- * Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD
On Thu, Sep 06, 2007 at 09:20:31AM -0500, Kevin Grittner wrote: > >>> On Wed, Sep 5, 2007 at 10:31 PM, in message > <Pine.GSO.4.64.0709052324020.25284@westnet.com>, Greg Smith > <gsmith@gregsmith.com> wrote: > > > > -There are two magic constants in the code: > > > > int smoothing_samples = 16; > > float scan_whole_pool_seconds = 120.0; > > > > > I personally > > don't feel like these constants need to be exposed for tuning purposes; > > > Determining > > whether these should be exposed as GUC tunables is certainly an open > > question though. > > If you exposed the scan_whole_pool_seconds as a tunable GUC, that would > allay all of my concerns about this patch. Basically, our problems were I like the idea of not having that as a GUC, but I'm doubtful that it can be hard-coded like that. What if checkpoint_timeout is set to 120? Or 60? Or 2000? I don't know that there should be a direct correlation, but ISTM that scan_whole_pool_seconds should take checkpoint intervals into account somehow. -- Decibel!, aka Jim Nasby decibel@decibel.org EnterpriseDB http://enterprisedb.com 512.569.9461 (cell)
>>> On Thu, Sep 6, 2007 at 11:27 AM, in message <Pine.GSO.4.64.0709061121020.14491@westnet.com>, Greg Smith <gsmith@gregsmith.com> wrote: > On Thu, 6 Sep 2007, Kevin Grittner wrote: > > I have been staring carefully at your configuration recently, and I would > wager that you could turn off the LRU writer altogether and still meet > your requirements in 8.2. I totally agree that it is of minor benefit compared to the all-writer, if it even matters at all. I knew that when I chose the settings. > Here's what you've got right now: > >> shared_buffers = 160MB (=20000 buffers) >> bgwriter_lru_percent = 20.0 >> bgwriter_lru_maxpages = 200 >> bgwriter_all_percent = 10.0 >> bgwriter_all_maxpages = 600 > > With the default delay of 200ms, this has the LRU-writer scanning the > whole pool every 1 second, Whoa! Apparently I've totally misread the documentation. I thought that the bgwriter_lru_percent was scanned from the lru end each time; I would not expect that it would ever get beyond the oldest 10%. I put that in just as a guard to keep the backends from having to wait for the OS write. I've always doubted whether it was helping, but "it wasn't broke".... > while the all-writer scans every two > seconds--assuming they don't hit the write limits. If some event were to > dirty the whole pool in 200ms, it might take as much as 6.7 seconds to > write everything out (20000 / 600 * 200 ms) via the all-scan. Right. Since the file system didn't seem to be able to accept writes faster than 800 PostgreSQL pages per second, and I wanted to leave a LITTLE slack, I set that limit. We don't seem to hit it, as far as I can tell. In fact, the output rate would be naturally fairly smooth, if not for the "hold all dirty pages until the last possible moment, then write them all to the OS and fsync" approach. > There's a second low-level issue involved here. When a page becomes > dirty, that implies it was also recently used, which means the LRU writer > won't touch it. That page can't be written out by the LRU writer until an > entire pass has been made over the shared_buffer pool while looking for > buffers to allocate for new activity. When the allocation clock-sweep > passes over the newly dirtied buffer again, its usage count will drop by > one and it will no longer be considered recently used. At that point the > LRU writer can write it out. How low does the count have to go, or does it track the count when it becomes dirty and look for a decrease? > So unless there is other allocation activity > going on, the scan_whole_pool_seconds mechanism will never provide the > bound on time to scan and write everything you hope it will. That may not be an issue for the environment where this has been a problem for us -- the web hits are coming in at a pretty good rate 24/7. (We have a couple dozen large companies scanning data through HTTP SOAP requests all the time.) This should keep us reading new pages, which covers this, yes? > where the buffer cache was > filled with mostly dirty buffers that couldn't be re-used That would be the condition that would be the killer with a synchronous checkpoint if the OS cache has already had some dirty pages trickled out. If we can hit this condition in our web database, either the load distributed checkpoint will save us, or we can't use 8.3. Period. > The completely understandable line of thinking that led to your request > here is one of my concerns with exposing scan_whole_pool_seconds as a > tunable. It may suggest to people that if they set the number very low, > it will assure all dirty buffers will be scanned and written within that > time bound. That's certainly not the case; both the maxpages and the > usage count information will actually drive the speed that mechanism plods > through the buffer cache. It really isn't useful for scanning fast. I'm not clear on the benefit of not writing the recently accessed dirty pages when there are no less recently used dirty pages. I do trust the OS to not write them before they age out in that cache, and the OS cache doesn't start writing dirty pages from its cache until they reach a certain percentage of the cache space, so I'd just as soon let the OS know that the MRU dirty pages are there, so it knows that it's time to start working on the LRU pages in its cache. -Kevin
"Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes:
> On Thu, Sep 6, 2007 at 11:27 AM, in message
> <Pine.GSO.4.64.0709061121020.14491@westnet.com>, Greg Smith
> <gsmith@gregsmith.com> wrote: 
>> With the default delay of 200ms, this has the LRU-writer scanning the 
>> whole pool every 1 second,
>  
> Whoa!  Apparently I've totally misread the documentation.  I thought that
> the bgwriter_lru_percent was scanned from the lru end each time; I would
> not expect that it would ever get beyond the oldest 10%.
I believe you're correct and Greg got this wrong.  I won't draw any
conclusions about whether the LRU stuff is actually doing you any good
though.
        regards, tom lane
			
		On Thu, 6 Sep 2007, Kevin Grittner wrote: > I thought that the bgwriter_lru_percent was scanned from the lru end > each time; I would not expect that it would ever get beyond the oldest > 10%. You're correct; I stated that badly. What I should have said is that your LRU writer could potentially scan the pool as fast as once per second if there were enough allocations going on. > How low does the count have to go, or does it track the count when it > becomes dirty and look for a decrease? The usage count has to be 0 before a page can be re-used for a new allocation, and the LRU background writer only writes out potentially reusable pages that are dirty. So the count has to be 0 before it will write it. > This should keep us reading new pages, which covers this, yes? One would hope. Your whole arrangement of shared_buffers, checkpoint_segments, and related parameters will need to be reconsidered for 8.3; you've got a delicated balanced arrangement for your 8.2 setup right now that's working for you, but just translating it straight to 8.3 won't get you what you want. I'll get back to the message you already sent on that subject when I get enough time to address it fully. > I'm not clear on the benefit of not writing the recently accessed dirty > pages when there are no less recently used dirty pages. This presumes PostgreSQL has some notion of the balance of recently accessed vs. not accessed dirty pages, which it does not. Buffers get updated individually, and there's no mechanism summarizing what's in there; you have to scan the buffer cache yourself to figure that out. I do some of that in this new patch, tracking things like how many buffers are scanned on average to find reusable ones. Many months ago, I wrote a very complicated re-implementation of the all-scan portion of the background writer that tracked the usage count of everything it looked at, kept statistics about how many pages were dirty at each usage count, then targeted how high of a usage count could be written given some information about what I/O rate you felt your devices could sustain. This did exactly what you're asking for here: wrote whatever dirty pages were around starting with the ones that hadn't been recently used, then worked its way up to pages with a higher usage count if the recently used ones were all clean. As far as I've been able to tell, and from Heikki's test results, the load distributed checkpoint was a better answer to this problem. Rather than constantly fight to get pages with high usage counts out all the time, just spread the checkpoint out instead and deal with them only then. I gave up on that branch of code while he removed the all-scan writer altogether as part of committing LDC. I suspect the path I was following was exactly what you think you'd like to have, but it seems that it's not actually needed. -- * Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD
On Wed, 2007-09-05 at 23:31 -0400, Greg Smith wrote: > Tom gets credit for naming the attached patch, which is my latest attempt to > finalize what has been called the "Automatic adjustment of > bgwriter_lru_maxpages" patch for 8.3; that's not what it does anymore but > that's where it started. This is a big undertaking, so well done for going for it. > I decided to use pgbench for running my tests. The scripting framework to > collect all that data and usefully summarize it is now available as > pgbench-tools-0.2 at > http://www.westnet.com/~gsmith/content/postgresql/pgbench-tools.htm For me, the main role of the bgwriter is to avoid dirty writes in backends. The purpose of doing that is to improve the response time distribution as perceived by users. I think that is what we should be measuring, perhaps in a simple way such as calculating the 90th percentile of the response time distribution. Looking at the "headline numbers" especially tps is notoriously difficult to determine any meaning from test results. Looking at the tps also tempts us to run a test which maxes out the server, an area we already know and expect the bgwriter to be unhelpful in. If I run a server at or below 70% capacity, what settings of the bgwriter help maintain my response time distribution? > Coping with idle periods > ------------------------ > > While I was basically happy with these results, the data Kevin Grittner > submitted in response to my last call for commentary left me concerned. While > the JIT approach works fine as long as your system is active, it does > absolutely nothing if the system is idle. I noticed that a lot of the writes > that were being done by the client backends were after idle periods where the > JIT writer just didn't react fast enough during the ramp-up. For example, if > the system went from idle for a while to full-speed just as the 200ms sleep > started, by the time the BGW woke up again the backends could have needed to > write many buffers already themselves. You've hit the nail on the head there. I can't see how you can do anything sensible when the bgwriter keeps going to sleep for long periods. The bgwriter's activity curve should ideally be the same shape as a critically damped harmonic oscillator. It should wake up, lots of writing if needed, then trail off over time. The only way to do that seems to be to vary the sleep automatically, or make short sleeps. For me, the bgwriter should sleep for at most 10ms at a time. If it has nothing to do it can go straight back to sleep again. Trying to set that time is fairly difficult, so it would be better not to have to set it at all. If you've changed bgwriter so it doesn't scan if no blocks have been allocated, I don't see any reason to keep the _delay parameter at all. > I think I can safely say there is a level of intelligence going into what the > LRU background writer does with this patch that has never been applied to this > problem before. There have been a lot of good ideas thrown out in this area, > but it took a hybrid approach that included and carefully balanced all of them > to actually get results that I felt were usable. What I don't know is whether > that will also be true for other testers. I get the feeling that what we have here is better than what we had before, but I guess I'm a bit disappointed we still have 3 magic parameters, or 5 if you count your hard-coded ones also. There's still no formal way to tune these. As long as we have *any* magic parameters, we need a way to tune them in the field, or they are useless. At very least we need a plan for how people will report results during Beta. That means we need a log_bgwriter (better name, please...) parameter that provides information to assist with tuning. At the very least we need this to be present during Beta, if not beyond. -- Simon Riggs 2ndQuadrant http://www.2ndQuadrant.com
On Fri, 7 Sep 2007, Simon Riggs wrote: > I think that is what we should be measuring, perhaps in a simple way > such as calculating the 90th percentile of the response time > distribution. I do track the 90th percentile numbers, but in these pgbench tests where I'm writing as fast as possible they're actually useless--in many cases they're *smaller* than the average response, because there are enough cases where there is a really, really long wait that they skew the average up really hard. Take a look at any of the inidividual test graphs and you'll see what I mean. > Looking at the tps also tempts us to run a test which maxes out the > server, an area we already know and expect the bgwriter to be unhelpful > in. I tried to turn that around and make my thinking be that if I built a bgwriter that did most of the writes without badly impacting the measure we know and expect it to be unhelpful in, that would be more likely to yield a robust design. It kept me out of areas where I might have built something that had to be disclaimed with "don't run this when the server is maxed out". > For me, the bgwriter should sleep for at most 10ms at a time. If it has > nothing to do it can go straight back to sleep again. Trying to set that > time is fairly difficult, so it would be better not to have to set it at > all. I wanted to get this patch out there so people could start thinking about what I'd done and consider whether this still fit into the 8.3 timeline. What I'm doing myself right now is running tests with a much lower setting for the delay time--am testing 20ms right now. I personally would be happy saying it's 10ms and that's it. Is anyone using a time lower than that right now? I seem to recall that 10ms was also the shortest interval Heikki used in his tests as well. > I get the feeling that what we have here is better than what we had > before, but I guess I'm a bit disappointed we still have 3 magic > parameters, or 5 if you count your hard-coded ones also. I may be able to eliminate more of them, but I didn't want to take them out before beta. If it can be demonstrated that some of these parameters can be set to specific values and still work across a wider range of applications than what I've tested, then there's certainly room to fix some of these, which actually makes some things easier. For example, I'd be more confident fixing the weighted average smoothing period to a specific number if I knew the delay was fixed, and there's two parameters gone. And the multiplier is begging to be eliminated, just need some more data to confirm that's true. > There's still no formal way to tune these. As long as we have *any* > magic parameters, we need a way to tune them in the field, or they are > useless. At very least we need a plan for how people will report results > during Beta. That means we need a log_bgwriter (better name, please...) > parameter that provides information to assist with tuning. Once I got past the "does it work?" stage, I've been doing all the tuning work using a before/after snapshot of pg_stat_bgwriter data during a representative snapshot of activity and looking at the delta. Been a while since I actually looked into the logs for anything. It's very straightforward to put together a formal tuning plan using the data in there, particularly compared to the the impossibility of creating such a plan in the current code. -- * Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD
On Fri, 2007-09-07 at 11:48 -0400, Greg Smith wrote: > On Fri, 7 Sep 2007, Simon Riggs wrote: > > > I think that is what we should be measuring, perhaps in a simple way > > such as calculating the 90th percentile of the response time > > distribution. > > I do track the 90th percentile numbers, but in these pgbench tests where > I'm writing as fast as possible they're actually useless--in many cases > they're *smaller* than the average response, because there are enough > cases where there is a really, really long wait that they skew the average > up really hard. Take a look at any of the inidividual test graphs and > you'll see what I mean. I've looked at the graphs now, but I'm not any wiser, I'm very sorry to say. We need something like a frequency distribution curve, not just the actual times. Bottom line is we need a good way to visualise the detailed effects of the patch. I think we should do some more basic tests to see where those outliers come from. We need to establish a clear link between number of dirty writes and response time. If there is one, which we all believe, then it is worth minimising those with these techniques. We might just be chasing the wrong thing. Perhaps output the number of dirty blocks written on the same line as the output of log_min_duration_statement so that we can correlate response time to dirty-block-writes on that statement. For me, we can enter Beta while this is still partially in the air. We won't be able to get this right without lots of other feedback. So I think we should concentrate now on making sure we've got the logging in place so we can check whether your patch works when its out there. I'd say lets include what you've done and then see how it works during Beta. We've been trying to get this right for years now, so we have to allow some slack to make sure we get this right. We can reduce or strip out logging once we go RC. -- Simon Riggs 2ndQuadrant http://www.2ndQuadrant.com
On Fri, 7 Sep 2007, Simon Riggs wrote: > I think we should do some more basic tests to see where those outliers > come from. We need to establish a clear link between number of dirty > writes and response time. With the test I'm running, which is specifically designed to aggrevate this behavior, the outliers on my system come from how Linux buffers writes. I can adjust them a bit by playing with the parameters as described at http://www.westnet.com/~gsmith/content/linux-pdflush.htm but on the hardware I've got here (single 7200RPM disk for database, another for WAL) they don't move much. Once /proc/meminfo shows enough Dirty memory that pdflush starts blocking writes, game over; you're looking at multi-second delays before my plain old IDE disks clear enough debris out to start responding to new requests even with the Areca controller I'm using. > Perhaps output the number of dirty blocks written on the same line as > the output of log_min_duration_statement so that we can correlate > response time to dirty-block-writes on that statement. On Linux at least, I'd expect this won't reveal much. There, the interesting correlation is with how much dirty data is in the underlying OS buffer cache. And exactly how that plays into things is a bit strange sometimes. If you go back to Heikki's DBT2 tests with the background writer schemes he tested, he got frustrated enough with that disconnect that he wrote a little test program just to map out the underlying weirdness: http://archives.postgresql.org/pgsql-hackers/2007-07/msg00261.php I've confirmed his results on my system and done some improvements to that program myself, but pushed further work on it to the side to finish up the main background writer task instead. I may circle back to that. I'd really like to run all this on another OS as well (I have Solaris 10 on my server box but not fully setup yet), but I can only volunteer so much time to work on all this right now. If there's anything that needs to be looked at more carefully during tests in this area, it's getting more data about just what the underlying OS is doing while all this is going on. Just the output from vmstat/iostat is very informative. Those using DBT2 for their tests get some nice graphs of this already. I've done some pgbench-based tests that included that before that were very enlightening but sadly that system isn't available to me anymore. -- * Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD
On Fri, 7 Sep 2007, Simon Riggs wrote:
> For me, the bgwriter should sleep for at most 10ms at a time.
Here's the results I got when I pushed the time down significantly from 
the defaults, with some of the earlier results for comparision:
                     info                      | set | tps  | cleaner_pct
-----------------------------------------------+-----+------+------------- jit multiplier=2.0 scan_whole=120s
delay=200ms| 17 |  981 |       99.98 jit multiplier=1.0 scan_whole=120s delay=200ms|  18 |  970 |       99.99
 
 jit multiplier=1.0 scan_whole=120s delay=20ms |  20 |  956 |       92.34 jit multiplier=2.0 scan_whole=120s delay=20ms
| 21 |  967 |       99.94
 
 jit multiplier=1.5 scan_whole=120s delay=10ms |  22 |  944 |       97.91 jit multiplier=2.0 scan_whole=120s delay=10ms
| 23 |  981 |        99.7
 
It seems I have to push the multiplier higher to get good results when 
using a much lower interval, which was expected, but the fundamentals all 
scale down to the running much faster the way I'd hoped.
I'm tempted to make the default 10ms, adjust some of the other constants 
just a bit to optimize better for that time scale:  make the default 
multiplier 2.0, increase the weighted average sample period, and perhaps 
reduce scan_whole a bit because that's barely doing anything at 10ms.  If 
no one discovers any problems with working that way during beta, then 
consider locking them in for the RC.  That would leave just the multiplier 
and maxpages as the exposed tunables, and it's very easy to tune maxpages 
just by watching pg_stat_bgwriter.  This would obviously be a very 
aggressive plan--it would be eliminating GUCs and reducing flexibility for 
people in the field, aiming instead at making this more automatic for the 
average case.
If anyone has a reason why they feel the bgwriter_delay needs to be a 
tunable or why the rate might need to run even faster than 10ms, now would 
be a good time to say why.
--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD
			
		Greg Smith <gsmith@gregsmith.com> writes: > If anyone has a reason why they feel the bgwriter_delay needs to be a > tunable or why the rate might need to run even faster than 10ms, now would > be a good time to say why. You'd be hard-wiring the thing to wake up 100 times per second? Doesn't sound like a good plan from here. Keep in mind that not everyone wants their machine to be dedicated to Postgres, and some people even would like their CPU to go to sleep now and again. I've already gotten flak about the current default of 200ms: https://bugzilla.redhat.com/show_bug.cgi?id=252129 I can't imagine that folk with those types of goals will tolerate an un-tunable 10ms cycle. In fact, given the numbers you show here, I'd say you should leave the default cycle time at 200ms. The 10ms value is eating way more CPU and producing absolutely no measured benefit relative to 200ms... regards, tom lane
On Sat, 8 Sep 2007, Tom Lane wrote: > I've already gotten flak about the current default of 200ms: > https://bugzilla.redhat.com/show_bug.cgi?id=252129 > I can't imagine that folk with those types of goals will tolerate an > un-tunable 10ms cycle. That's the counter-example for why lowering the default is unacceptable I was looking for. Scratch bgwriter_delay off the list of things that might be fixed to a specific value. Will return to the drawing board to figure out a way to incorporate what I've learned about running at 10ms into a tuning plan that still works fine at 200ms or higher. The good news as far as I'm concerned is that I haven't had to adjust the code so far, just tweak the existing knobs. > In fact, given the numbers you show here, I'd say you should leave the > default cycle time at 200ms. The 10ms value is eating way more CPU and > producing absolutely no measured benefit relative to 200ms... My server is a bit underpowered to run at 10ms and gain anything when doing a stress test like this; I was content that it didn't degrade performance significantly, that was the best I could hope for. I would expect the class of systems that Simon and Heikki are working with could show significant benefit from running the BGW that often. -- * Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD
Greg Smith <gsmith@gregsmith.com> writes:
> On Sat, 8 Sep 2007, Tom Lane wrote:
>> In fact, given the numbers you show here, I'd say you should leave the 
>> default cycle time at 200ms.  The 10ms value is eating way more CPU and 
>> producing absolutely no measured benefit relative to 200ms...
> My server is a bit underpowered to run at 10ms and gain anything when 
> doing a stress test like this; I was content that it didn't degrade 
> performance significantly, that was the best I could hope for.  I would 
> expect the class of systems that Simon and Heikki are working with could 
> show significant benefit from running the BGW that often.
Quite possibly.  So it sounds like we still need to expose
bgwriter_delay as a tunable.
It might be interesting to consider making the delay auto-tune: if you
wake up and find nothing (much) to do, sleep longer the next time,
conversely shorten the delay when work picks up.  Something for 8.4,
though, at this point.
        regards, tom lane
			
		"Greg Smith" <gsmith@gregsmith.com> writes: > On Sat, 8 Sep 2007, Tom Lane wrote: > >> I've already gotten flak about the current default of 200ms: >> https://bugzilla.redhat.com/show_bug.cgi?id=252129 >> I can't imagine that folk with those types of goals will tolerate an >> un-tunable 10ms cycle. > > That's the counter-example for why lowering the default is unacceptable I was > looking for. Scratch bgwriter_delay off the list of things that might be fixed > to a specific value. Ok, time for the obligatory contrarian voice here. It's all well and good to aim to eliminate GUC variables but I don't think it's productive to do so by simply hard-wiring them. Firstly that doesn't really make life any easier than simply finding good defaults and documenting that DBAs probably shouldn't be bothering to tweak them. Secondly it's unlikely to work. The variables under consideration may have reasonable defaults but they're not likely to have defaults will work in every case. This example is pretty typical. There aren't many variables that will have a reasonable default which will work for both an interactive desktop where Postgres is running in the background and Sun's 1000+ process benchmarks. What I think is more likely to work is looking for ways to make these variables auto-tuning. That eliminates the knob not by just hiding it away and declaring it doesn't exist but by architecting the system so that there really is no knob that might need tweaking. Perhaps what would work better here is having a semaphore which bgwriter sleeps on which backends wake up whenever the clock sweep hand completes a cycle. Or gets within a certain fraction of a cycle of catching up. Or perhaps bgwriter shouldn't be adjusting the number of pages it processes at all and instead it should only be adjusting the sleep time. So it would always process a full cycle for example but adjust the sleep time based on what percentage of the cycle the backends used up in the last sleep time. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com
On Sat, 8 Sep 2007, Tom Lane wrote: > It might be interesting to consider making the delay auto-tune: if you > wake up and find nothing (much) to do, sleep longer the next time, > conversely shorten the delay when work picks up. Something for 8.4, > though, at this point. I have a couple of pages of notes on how to tune the delay automatically. The tricky part are applications that go from 0 to full speed with little warning; the first few seconds of the stock market open come to mind. What I was working toward was considering what you set the delay to as a steady-state value, and then the delay cranks downward as activity levels go up. As activity dies off, it slowly returns to the default again. But I realized that I needed to get all this other stuff working, all the statistics counters exposed usefully, and then collect a lot more data before I could implement that plan. Definately something that might fit into 8.4, completely impossible for 8.3. -- * Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD
On Thu, 6 Sep 2007, Decibel! wrote: > I don't know that there should be a direct correlation, but ISTM that > scan_whole_pool_seconds should take checkpoint intervals into account > somehow. Any direct correlation is weak at this point. The LRU cleaner has a small impact on checkpoints, in that it's writing out buffers that may make the checkpoint quicker. But this particular write trickling mechanism is not aimed directly at flushing the whole pool; it's more about smoothing out idle periods a bit. Also, computing the checkpoint interval is itself tricky. Heikki had to put some work into getting something that took into account both the timeout and segments mechanisms to gauge progress, and I'm not sure I can directly re-use that because it's really only doing that while the checkpoint is active. I'm not saying it's a bad idea to have the expected interval as an input to the model, just that it's not obvious to me how to do it and whether it would really help. > I like the idea of not having that as a GUC, but I'm doubtful that it > can be hard-coded like that. What if checkpoint_timeout is set to 120? > Or 60? Or 2000? Someone using 60 or 120 has checkpoint problems way bigger than the LRU cleaner can be expected to help with. How fast the reusable buffers it can write are pushed out is the least of their problems. Also, I'd expect that the only cases using such a low value for a good reason are doing so because they have enormous amounts of activity on their system, and in that case the primary JIT mechanism should dominate how the LRU cleaner treats them. scan_whole_pool_seconds doesn't do anything if the primary mechanism was already planning to scan more buffers than it aims for. Someone who has very infrequent checkpoints and therefore low activity, like your 2000 case, can expect that the LRU cleaner will lap and catch up to the strategy point about 2 minutes after any activity and then follow directly behind it with the way I've set this up. If that's cleaning the buffer cache too aggressively, I think those in that situation would be better served by constraining the maxpages parameter; that's directly adjusting what I'd expect their real issue is, how fast pages can flush to disk, rather than the secondary one of how fast the pool is being scanned. I picked 2 minutes for that value because it's as slow as I can make it and still serve its purpose, while not feeling to me like it's too fast for a relatively idle system even if someone set maxpages=1000. -- * Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD
Greg Smith wrote: > On Sat, 8 Sep 2007, Tom Lane wrote: > >> It might be interesting to consider making the delay auto-tune: if you >> wake up and find nothing (much) to do, sleep longer the next time, >> conversely shorten the delay when work picks up. Something for 8.4, >> though, at this point. > > I have a couple of pages of notes on how to tune the delay automatically. > The tricky part are applications that go from 0 to full speed with little > warning; the first few seconds of the stock market open come to mind. Maybe have the backends send a signal to bgwriter when they see it sleeping and are overwhelmed by work. That way, bgwriter can sleep for a few seconds, safe in the knowledge that somebody else will wake it up if needed sooner. The way backends would detect that bgwriter is sleeping is that bgwriter would keep an atomic flag in shared memory, and it gets set only if it's going to sleep for long (so if it's going to sleep for (say) 100ms or less, it doesn't set the flag, so the backends won't signal it). In order to avoid a huge amount of signals when all backends suddenly start working at the same instant, have the signal itself be sent only by the first backend that manages to LWLockConditionalAcquire a lwlock that's only used for that purpose. -- Alvaro Herrera http://www.CommandPrompt.com/ PostgreSQL Replication, Consulting, Custom Development, 24x7 support
On Sat, 8 Sep 2007, Greg Smith wrote:
> Here's the results I got when I pushed the time down significantly from the 
> defaults
>                     info                      | set | tps  | cleaner_pct
> -----------------------------------------------+-----+------+-------------
> jit multiplier=1.0 scan_whole=120s delay=20ms |  20 |  956 |       92.34
> jit multiplier=2.0 scan_whole=120s delay=20ms |  21 |  967 |       99.94
>
> jit multiplier=1.5 scan_whole=120s delay=10ms |  22 |  944 |       97.91
> jit multiplier=2.0 scan_whole=120s delay=10ms |  23 |  981 |        99.7
> It seems I have to push the multiplier higher to get good results when using 
> a much lower interval
Since I'm not exactly overwhelmed processing field reports, I've continued 
this line of investigation myself...increasing the multiplier to 3.0 got 
me another nine on the buffers written by the LRU BGW without a 
significant change in performance:
                     info                      | set | tps  | cleaner_pct
-----------------------------------------------+-----+------+-------------
jit multiplier=3.0 scan_whole=120s delay=10ms  |  24 |  967 | 99.95
After thinking for a bit about why the 10ms case wasn't working so well 
without a big multiplier, I considered that the default moving average 
smoothing makes the sample period operating over such a short period of 
time (10ms * 16=160ms) that it's unlikely to cover a typical pause that 
one might want to smooth over.  My initial thinking was to increase the 
period of the smoothing so that it's of similar length to the default case 
even when the interval goes down, but that didn't really improve anything 
(note that the 16 case here is the default setup with just the delay at 
10ms, which was a missing piece of data from the above as well--I only 
tested with larger multipliers above at 10ms):
                     info                     | set | tps  | cleaner_pct
----------------------------------------------+-----+------+------------- jit multiplier=1.0 delay=10ms smoothing=16
| 27 |  982 |  89.4 jit multiplier=1.0 delay=10ms smoothing=64   |  26 |  946 |  89.55 jit multiplier=1.0 delay=10ms
smoothing=320 |  25 |  970 |  89.53
 
What I realized is that after rounding the number of buffers to an 
integer, dividing a very short period of activity by the smoothing 
constant was resulting in the smoothing value usually dropping to 0 and 
not doing much.  This made me wonder how much the weighted average 
smoothing was really doing in the default case.  I put that code in months 
ago and I hadn't looked recently at its effectiveness.  Here's a 
comparison:
                     info                     | set | tps  | cleaner_pct
----------------------------------------------+-----+------+------------- jit multiplier=1.0 delay=200ms smoothing=16
| 18 |  970 |  99.99 jit multiplier=1.0 delay=200ms smoothing=off |  28 |  957 |  97.16
 
All this data support my suggestion that the exact value of the smoothing 
period constant isn't really a critical one.  It appears moderately 
helpful to have that logic on in some cases and the default value doesn't 
seem to hurt the cases where I'd expect it to be the least effective. 
Tuning the multiplier is much more powerful and useful than ever touching 
this constant.  I could probably even pull the smoothing logic out 
altogether, at the cost of increasing the burden of correctly tuning the 
multiplier on the administrator.  So far it looks like it's reasonable 
instead to leave it as an untunable to help the default configuration, and 
I'll just add a documentation note that if you decrease the interval 
you'll probably have to increase the multiplier.
After going through this, the extra data gives more useful baselines to do 
a similar sensitivity analysis of the other item that's untunable in the 
current patch:
    float       scan_whole_pool_seconds = 120.0;
But I'll be travelling for the next week and won't have time to look into 
that myself until I get back.
--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD
			
		It was suggested to me today that I should clarify how others should be able to test this patch themselves by writing a sort of performance reviewer's guide; that information has been scattered among material covering development. That's what you'll find below. Let me know if any of it seems confusing and I'll try to clarify. I'll be checking my mail and responding intermitantly while I'm away, just won't be able to run any tests myself until next week. The latest version of the background writer code that I've been reporting on is attached to the first message in this thread: http://archives.postgresql.org/pgsql-hackers/2007-09/msg00214.php I haven't found any reason so far to update that code, the existing exposed tunables still appear sufficient for all the situations I've found. Track Buffer Allocations and Cleaner Efficiency ----------------------------------------------- First you apply the patch inside buf-alloc-2.patch.gz , which adds several entries to pg_stat_bgwriter; it applied cleanly to HEAD at the point when I generated it. I'd suggest testing that one to collect baseline information with the current background writer, and to confirm that the overhead of tracking the buffer allocations by itself doesn't cause a performance hit, before applying the second patch. I keep two clusters going on the same port, one with just buf-alloc-2, one with both patches, to be able to make such comparisions, only having one active at a time. You'll need to run initdb to create a database with the new stats in it after applying the patch. What I've been doing to test the effectiveness of any LRU background writer method using this patch is take a before/after snapshot of pg_stat_bgwriter. Then I compute the delta during the test run in order to figure what percentage of buffers were written by the background writer vs. the client backends; that's the number I'm reporting as cleaner_pct in my tests. Here is an example of how to compute that against all transactions in pg_stat_bgwriter: select round(buffers_clean * 10000 / (buffers_backend + buffers_clean)) / 100 as cleaner_pct from pg_stat_bgwriter; You should also monitor maxwritten_clean to make sure you've set bgwriter_lru_maxpages high enough that it's not limiting writes. You can always turn the background writer off by setting maxpages to 0 (it's the only way to do so after applying the below patch). For reference, the exact code I'm using to save the deltas and compute everything is available within pgbench-tools-0.2 at http://www.westnet.com/~gsmith/content/postgresql/pgbench-tools.htm The code inside the benchwarmer script uses a table called test_bgwriter (schema in init/resultdb.sql), populates it before the test, then computes the delta afterwards. bufsummary.sql generates the results I've been putting in my messages. I assume there's a cleaner way to compute just these numbers by resetting the statistics before the test instead, but that didn't fit into what I was working towards. New Background Writer Logic --------------------------- The second patch in jit-cleaner.patch.gz applies on top of buf-alloc-2. It modifies the LRU background writer with the just-in-time logic as I described in the message the patches were attached to. The main tunable there is bgwriter_lru_multiplier, which replaces bgwriter_lru_percent. The effective range seems to be 1.0 to 3.0. You can take an existing 8.3 postgresql.conf, rename bgwriter_lru_percent to bgwriter_lru_multiplier, adjust the value to be in the right range, and then it will work with this patched version. For comparing the patched vs. original BGW behavior, I've taken to keeping definitions for both variables in a common postgresql.conf, and then I just comment/uncomment the one I need based on which version I'm running: bgwriter_lru_multiplier = 1.0 #bgwriter_lru_percent = 5 The main thing I've noticed so far is that as you decrease bgwriter_delay from the default of 200ms, the multiplier has needed to be larger to maintain the same cleaner percentage in my tests. -- * Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD
Greg Smith <gsmith@gregsmith.com> writes:
> Tom gets credit for naming the attached patch, which is my latest attempt to 
> finalize what has been called the "Automatic adjustment of 
> bgwriter_lru_maxpages" patch for 8.3; that's not what it does anymore but 
> that's where it started.
I've applied this patch with some revisions.
> -The way I'm getting the passes number back from the freelist.c
> strategy code seems like it will eventually overflow
Yup ... I rewrote that.  I also revised the collection of backend-write
count events, which didn't seem to me to be something the freelist.c
code should have anything to do with.  It turns out that we can count
them with essentially no overhead by attaching the counter to
the existing fsync-request reporting machinery.
> -Heikki didn't like the way I pass information back from SyncOneBuffer
> back to the background writer.
I didn't either --- it was too complicated and not actually doing
anything useful.  I simplified it down to the two bits that were being
used.  We can always add more as needed, but since this routine isn't
even exported, I see no need to make it do more than the known callers
need it to do.
I did some marginal tweaking to the way you were doing the moving
averages --- in particular, use a float to avoid strange roundoff
behavior and force the smoothed_alloc average up when a new peak
occurs, instead of only letting it affect the behavior for one
cycle.
Also, I set the default value of bgwriter_lru_multiplier to 2.0,
as 1.0 seemed to be leaving too many writes to the backends in my
testing.  That's something we can play with during beta when we'll
have more testing resources available.
I did some other cleanup in BgBufferSync too, like trying to reduce
the chattiness of the debug output, but I don't believe I made any
fundamental change in your algorithm.
Nice work --- thanks for seeing it through!
        regards, tom lane
			
		On Tue, 25 Sep 2007, Tom Lane wrote: >> -Heikki didn't like the way I pass information back from SyncOneBuffer >> back to the background writer. > I didn't either --- it was too complicated and not actually doing > anything useful. I suspect someone (possibly me) may want to put back some of that same additional complication in the future, but I'm fine with it not being there yet. The main thing I wanted accomplished was changing the return to a bitmask of some sort and that's there now; adding more data to that interface later is at least easier now. > Also, I set the default value of bgwriter_lru_multiplier to 2.0, > as 1.0 seemed to be leaving too many writes to the backends in my > testing. The data I've collected since originally submitting the patch agrees that 2.0 is probably a better default as well. I should have time to take an initial stab this week at updating the documentation to reflect what's now been commited, and to see how this stacks on top of HOT running pgbench on my test system. -- * Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD