Обсуждение: BufferSync and bgwriter
The idea that bgwriter smooths out the response time of transactions is only true if the buffer lists T1 and T2 have *some* clean buffers available for use when performing I/O. The alternative is that transactions unlucky enough to encounter the no-clean-buffers situation have to clean a space for themselves, effectively making the bgwriter redundant. In BufferSync, we start off by calling StrategyDirtyBufferList to make a list of all the dirty buffers. Even though we know we are limited to maxpages, we still scan the whole of shared_buffers (...making it a very expensive call and thereby causing us to increase bgwriter_delay, which then negates the cleaning effect as described above). Once we've got the list, we limit ourselves to only using maxpages of the list that we just built. We do it that way round to allow bgwriter_percent to calculate how many of the dirty buffers it should flush, on the assumption that percent < 100. If the bgwriter_percent = 100, then we should actually do the sensible thing and prepare the list that we need, i.e. limit StrategyDirtyBufferList to finding at most bgwriter_maxpages. Thus if you have a large shared_buffers, you can still have a relatively frequent bgwriter_delay, so that the bgwriter can keep the LRUs of the T1 and T2 lists free for use...and so let backends get on with useful work. Patch which implements this attached, for discussion. Mark, any chance we could run this patch on STP to test whether it has a beneficial performance effect? Re-run test 207 to compare? I'll be asking for this in 8.0, if it works, for all the same performance reasons discussed previously as well as coming under the header of "bgwriter default changes" since this effects the default behaviour when bgwriter_percent=100. There are some other ideas for 8.1, but that can wait. -- Best Regards, Simon Riggs
Вложения
I wonder if we even need to retain the bgwriter_percent GUC var. Is there actually a situation in which the combination of bgwriter_maxpages and bgwriter_delay does not give the DBA sufficient flexibility in tuning bgwriter behavior? Simon Riggs wrote: > If the bgwriter_percent = 100, then we should actually do the sensible > thing and prepare the list that we need, i.e. limit > StrategyDirtyBufferList to finding at most bgwriter_maxpages. Is the plan to make bgwriter_percent = 100 the default setting? -Neil
> On Sun, 2004-12-12 at 05:46, Neil Conway wrote: > Simon Riggs wrote: > > If the bgwriter_percent = 100, then we should actually do the sensible > > thing and prepare the list that we need, i.e. limit > > StrategyDirtyBufferList to finding at most bgwriter_maxpages. > > Is the plan to make bgwriter_percent = 100 the default setting? Hmm...must confess that my only plan is: i) discover dynamic behaviour of bgwriter ii) fix any bugs or wierdness as quickly as possible iii) try to find a way to set the bgwriter defaults I'm worried that we're late in the day for changes, but I'm equally worried that a) the bgwriter is very tuning sensitive b) we don't really have much info on how to set the defaults in a meaningful way for the majority of cases c) there are some issues that greatly reduce the effectiveness of the bgwriter in many circumstances. The 100pct.patch was my first attempt at getting something acceptable in the next few days that gives sufficient room for the DBA to perform tuning. On Sun, 2004-12-12 at 05:46, Neil Conway wrote: > I wonder if we even need to retain the bgwriter_percent GUC var. Is > there actually a situation in which the combination of bgwriter_maxpages > and bgwriter_delay does not give the DBA sufficient flexibility in > tuning bgwriter behavior? Yes, I do now think that only two GUCs are required to tune the behaviour; but you make me think - which two? Right now, bgwriter_delay is useless because the O(N) behaviour makes it impossible to set any lower when you have a large shared_buffers. (I see that as a bug) Your question has made me rethink the exact objective of the bgwriter's actions: The way it is coded now the bgwriter looks for dirty blocks, no matter where they are in the list. What we are bothered about is the number of clean buffers at the LRU, which has a direct influence on the probability that BufferAlloc() will need to call FlushBuffer(), since StrategyGetBuffer() returns the first unpinned buffer, dirty or not. After further thought, I would prefer a subtle change in behaviour so that the bgwriter checks that clean blocks are available at the LRUs for when buffer replacement occurs. With that slight change, I'd keep the bgwriter_percent GUC but make it mean something different. bgwriter_percent would be the % of shared_buffers that are searched (from the LRU end) to see if they contain dirty buffers, which are then written to disk. That means the number of dirty blocks written to disk is between 0 and the number of buffers searched, but we're not hugely bothered what that number is... [This change to StrategyDirtyBufferList resolves the unusability of the bgwriter with large shared_buffers] Writing away dirty blocks towards the MRU end is more likely to be wasted effort. If a block stays near the MRU then it will be dirty again in the wink of an eye, so you gain nothing at checkpoint time by cleaning it. Also, since it isn't near the LRU, cleaning it has no effect on buffer replacement I/O. If a block is at the LRU, then it is by definition the least likely to be reused, and is a candidate for replacement anyway. So concentrating on the LRU, not the number of dirty buffers seems to be the better thing to do. That would then be a much simpler way of setting the defaults. With that definition, we would set the defaults: bgwriter_percent = 2 (according to my new suggestion here) bgwriter_delay = 200 bgwriter_maxpages = -1 (i.e. mostly ignore it, but keep it for fine tuning) Thus, for the default shared_buffers=1000 the bgwriter would clear a space of up to 20 blocks each cycle. For a config with shared_buffers=60000, the bgwriter default would clear space for 600 blocks (max) each cycle - a reasonable setting. Overall that would need very little specific tuning, because it would scale upwards as you changed the shared_buffers higher. So, that interpretation of bgwriter_percent gives these advantages: - we bound the StrategyDirtyBufferList scan to a small % of the whole list, rather than the whole list...so we could realistically set the bgwriter_delay lower if required - we can set a default that scales, so would not often need to change it - the parameter is defined in terms of the thing we really care about: sufficient clean blocks at the LRU of the buffer lists - these changes are very isolated and actually minor - just a different way of specifying which buffers the bgwriter will clean Patch attached...again for discussion and to help understanding of this proposal. Will submit to patches if we agree it seems like the best way to allow the bgwriter defaults to be sensibly set. [...and yes, everybody, I do know where we are in the release cycle] -- Best Regards, Simon Riggs
Вложения
On Sun, 2004-12-12 at 22:08 +0000, Simon Riggs wrote: > > On Sun, 2004-12-12 at 05:46, Neil Conway wrote: > > Is the plan to make bgwriter_percent = 100 the default setting? > > Hmm...must confess that my only plan is: > i) discover dynamic behaviour of bgwriter > ii) fix any bugs or wierdness as quickly as possible > iii) try to find a way to set the bgwriter defaults I was just curious why you were bothering to special-case bgwriter_percent = 100 if it's not going to be the default setting (in which case I would be surprised if more than 1 in 10 users would take advantage of the patch). > Right now, bgwriter_delay > is useless because the O(N) behaviour makes it impossible to set any > lower when you have a large shared_buffers. BTW, I wouldn't be _too_ worried about O(N) behavior, except that we do this scan while holding the BufMgrLock, which is a well known source of contention. So reducing the time we hold that lock would be good. > Your question has made me rethink the exact objective of the bgwriter's > actions: The way it is coded now the bgwriter looks for dirty blocks, no > matter where they are in the list. Not sure what you mean. StrategyDirtyBufferList() returns the specified number of dirty buffers in order, starting with the T1/T2 LRUs and going back to the MRUs of both lists. bgwriter_percent effectively ignores some portion of the tail of that list, so we end up just flushing the buffers closest to the L1/L2 LRUs. How is this different from what you're describing? > bgwriter_percent would be the % of shared_buffers that are searched > (from the LRU end) to see if they contain dirty buffers, which are > then written to disk. By definition, buffers closest to the LRU end of the lists are not frequently accessed. If we only search the N% of the lists closest to LRU, we will probably end up flushing just those pages to disk -- and then not flushing anything else to disk in the subsequent bgwriter calls because all the buffers close to the LRU will be non-dirty. That's okay if all we're concerned about is avoiding write() by a real backend, but we also want to smooth out checkpoint load, which I don't think this approach would do well. I suggest just getting rid of bgwriter_percent: AFAICS bgwriter_maxpages is all the tuning we need, and I think "max # of pages to write" is a simpler and more logical tuning knob than "% of the buffer pool to scan looking for dirty buffers." So at each bufmgr invocation, we pick the at most bgwriter_maxpages dirty pages from the pool, using the pages closest to the LRUs of T1 and T2. I'd be happy to supply a patch to implement that if you think it sounds okay. -Neil
Simon, I am seeing a reasonably reproducible performance boost after applying your patch (I'm not sure if that was one of the main objectives, but it certainly is nice). I *was* seeing a noticeable decrease between 7.4.6 and 8.0.0RC1 running pgbench. However, after applying your patch, 8.0 is pretty much back to being the same. Now I know pgbench is ..err... not always the most reliable for this sort of thing, so I am interested if this seems like a reasonable sort of thing to be noticing (and also if anyone else has noticed the decrement)? (The attached brief results are for Linux x86, but I can see a similar performance decrement 7.4.6->8.0.0RC1 on FreeBSD 5.3 x86) regards Mark Simon Riggs wrote: >Hmm...must confess that my only plan is: >i) discover dynamic behaviour of bgwriter >ii) fix any bugs or wierdness as quickly as possible >iii) try to find a way to set the bgwriter defaults > > > System ------ P4 2.8Ghz 1G 1xSeagate Barracuda 40G Linux 2.6.9 glibc 2.3.3 gcc 3.4.2 Postgresql 7.4.6 | 8.0.0RC1 Test ---- Pgbench with scale factor = 200 Pg 7.4.6 -------- clients transactions tps 1 1000 65.1 2 1000 72.5 4 1000 69.2 8 1000 48.3 Pg 8.0.0RC1 ----------- clients transactions tps tps (new buff patch + settings) 1 1000 55.8 70.9 2 1000 68.3 77.9 4 1000 38.4 62.8 8 1000 29.4 38.1 (averages over 3 runs, database dropped and recreated after each set, with a checkpoint performed after each individual run) Parameters ---------- Non default postgresql.conf parameters: tcpip_socket = true [listen_addresses = "*"] max_connections = 100 shared_buffers = 10000 wal_buffers = 1024 checkpoint_segments = 10 effective_cache_size = 40000 random_page_cost = 0.8 bgwriter settings (used with patch only) bgwriter_delay = 200 bgwriter_percent = 2 bgwriter_maxpages = 100
On Mon, 2004-12-13 at 04:39, Mark Kirkwood wrote: > I am seeing a reasonably reproducible performance boost after applying > your patch (I'm not sure if that was one of the main objectives, but it > certainly is nice). > > I *was* seeing a noticeable decrease between 7.4.6 and 8.0.0RC1 running > pgbench. However, after applying your patch, 8.0 is pretty much back to > being the same. > Thanks Mark - brilliant to have some confirming test results back so quickly. The tests indicate that we're on the right track here and that we should test this on the OSDL platform also on a long run, to check out the effects of both normal running and checkpointing. Given these test settings: bgwriter_delay = 200 bgwriter_percent = 2 bgwriter_maxpages = 100 This shows the importance of reducing the length of the BufMgr lock in StrategyDirtyBufferList() -- which I think Neil also agrees is the main problem here. > > ______________________________________________________________________ > System > ------ > P4 2.8Ghz 1G 1xSeagate Barracuda 40G > Linux 2.6.9 glibc 2.3.3 gcc 3.4.2 > Postgresql 7.4.6 | 8.0.0RC1 > > Test > ---- > Pgbench with scale factor = 200 > > Pg 7.4.6 > -------- > > clients transactions tps > 1 1000 65.1 > 2 1000 72.5 > 4 1000 69.2 > 8 1000 48.3 > > > Pg 8.0.0RC1 > ----------- > > clients transactions tps tps (new buff patch + settings) > 1 1000 55.8 70.9 > 2 1000 68.3 77.9 > 4 1000 38.4 62.8 > 8 1000 29.4 38.1 > > (averages over 3 runs, database dropped and recreated after each set, with a > checkpoint performed after each individual run) > > > Parameters > ---------- > > Non default postgresql.conf parameters: > > tcpip_socket = true [listen_addresses = "*"] > max_connections = 100 > shared_buffers = 10000 > wal_buffers = 1024 > checkpoint_segments = 10 > effective_cache_size = 40000 > random_page_cost = 0.8 > > bgwriter settings (used with patch only) > > bgwriter_delay = 200 > bgwriter_percent = 2 > bgwriter_maxpages = 100 -- Best Regards, Simon Riggs
On Mon, 2004-12-13 at 02:43, Neil Conway wrote: > On Sun, 2004-12-12 at 22:08 +0000, Simon Riggs wrote: > > > On Sun, 2004-12-12 at 05:46, Neil Conway wrote: > > > Is the plan to make bgwriter_percent = 100 the default setting? > > > > Hmm...must confess that my only plan is: > > i) discover dynamic behaviour of bgwriter > > ii) fix any bugs or wierdness as quickly as possible > > iii) try to find a way to set the bgwriter defaults > > I was just curious why you were bothering to special-case > bgwriter_percent = 100 if it's not going to be the default setting (in > which case I would be surprised if more than 1 in 10 users would take > advantage of the patch). > > > Right now, bgwriter_delay > > is useless because the O(N) behaviour makes it impossible to set any > > lower when you have a large shared_buffers. > > BTW, I wouldn't be _too_ worried about O(N) behavior, except that we do > this scan while holding the BufMgrLock, which is a well known source of > contention. So reducing the time we hold that lock would be good. Yes, the duration of the BufMgrLock held during StrategyDirtyBufferList and its effect on system performance is my concern. Reducing that is one of the primary objectives here (point (ii)). > > bgwriter_percent would be the % of shared_buffers that are searched > > (from the LRU end) to see if they contain dirty buffers, which are > > then written to disk. > > By definition, buffers closest to the LRU end of the lists are not > frequently accessed. If we only search the N% of the lists closest to > LRU, we will probably end up flushing just those pages to disk -- and > then not flushing anything else to disk in the subsequent bgwriter calls > because all the buffers close to the LRU will be non-dirty. That's okay > if all we're concerned about is avoiding write() by a real backend, but > we also want to smooth out checkpoint load, which I don't think this > approach would do well. My argument for that was: N% of lists closest to LRU approach gives - constant search time (searching for N dirty buffers causes a variable number of buffers to be searched, so lock time varies...) - if blocks are no longer used, they eventually migrate to the LRU, so they then get written away by bgwriter rather than at checkpoint time. - the blocks near the MRU get dirtied again fairly quickly, so still need to be flushed again at checkpoint So, overall, I think this would smooth out the checkpoint load We've little time left: If we do not manage to perform a performance test that shows that this argument is valid, then I'd agree that we drop that idea (for now) because of the risk that it does have the side-effect you mention. Longer term, I think possibly having two types of bgwriter activity would be worthwhile: 1) short and frequent LRU cleaning 2) longer but less frequent mini-checkpoints that reach up towards the MRU > I suggest just getting rid of bgwriter_percent: AFAICS bgwriter_maxpages > is all the tuning we need, and I think "max # of pages to write" is a > simpler and more logical tuning knob than "% of the buffer pool to scan > looking for dirty buffers." So at each bufmgr invocation, we pick the at > most bgwriter_maxpages dirty pages from the pool, using the pages > closest to the LRUs of T1 and T2. I'd be happy to supply a patch to > implement that if you think it sounds okay. Whichever way we do it, we agree that bgwriter_maxpages is all the tuning that you and I need. My suggestion was to provide both the tuning knob AND removing the need for the knob completely for the (as you say) 9 out of 10 people that never will perform any tuning, by using bgwriter_percent to set a value that is approximately correct all of the time. Anyway, thanks for taking the time to read all of these postings. We're clearly agreed on the main aspect of this, AFAICS. I'd be happy to supply a patch to > implement that if you think it sounds okay. ...my understanding is that you'd only be touching BufferSync() to simplify it, and to remove all of the bgwriter_percent GUC stuff and its call path to BufferSync()? I've hacked my patch down to show what I think you mean for the BufferSync() changes.... to allow perf comparisons if time allows. Clearly your own patch will more accurately portray those... -- Best Regards, Simon Riggs
Вложения
Sorry for the delay; here are results with the bg3.patch with database parameters that should match run 207. I haven't been able to take the time too look over the results myself, but I tried to make sure this run was the same as 207:http://www.osdl.org/projects/dbt2dev/results/dev4-010/207 Mark
Sorry, wrong link, right one here:http://www.osdl.org/projects/dbt2dev/results/dev4-010/211 Mark
On Wed, 2004-12-15 at 00:00, Mark Wong wrote: > http://www.osdl.org/projects/dbt2dev/results/dev4-010/211 > Thanks Mark for turning that around so quickly. Looks good... Results performed to compare test 207 http://www.osdl.org/projects/dbt2dev/results/dev4-010/207 test 211 with bg3.patch which matches Neil/my option (3) http://www.osdl.org/projects/dbt2dev/results/dev4-010/211 The overall results show 3% throughput gain. The negative effects of checkpointing are significantly reduced and this shows up in the New Order Transaction response time max dropping from 37s to 25s, which looks like a significant user-visible performance gain. Similar reduction in max response times is shown for all transaction types: consistent removal of the longest wait times. The gains come from greater effectiveness of the bgwriter, which reduces I/O wait time spikes to almost zero once the shared_buffers are completely full. (see Processor Utilization graph: wait) It looks to me that reducing the bgwriter_delay slightly might yield additional gains, say to 180 or 160. That should now be possible since the cost of doing so has been greatly reduced. StrategyDirtyBufferList has now dropped way down the list in oprofile results. Neil very kindly points out privately that the patch has a missing sanity check bug in it, which has shown up in Neil's testing. That wouldn't effect these performance results, however. I leave it to Neil to post a corrected version as a result of his efforts. I leave it to the consensus to decide whether these results represent significant gains and whether to add to 8.0, or defer. Neil's suggestion (2) should also needs to be considered - test results could still show that as the better option, so I keep an open mind. -- Best Regards, Simon Riggs
On 12/12/2004 5:08 PM, Simon Riggs wrote:
>> On Sun, 2004-12-12 at 05:46, Neil Conway wrote:
>> Simon Riggs wrote:
>> > If the bgwriter_percent = 100, then we should actually do the sensible
>> > thing and prepare the list that we need, i.e. limit
>> > StrategyDirtyBufferList to finding at most bgwriter_maxpages.
>> 
>> Is the plan to make bgwriter_percent = 100 the default setting?
> 
> Hmm...must confess that my only plan is:
> i) discover dynamic behaviour of bgwriter
> ii) fix any bugs or wierdness as quickly as possible
> iii) try to find a way to set the bgwriter defaults
> 
> I'm worried that we're late in the day for changes, but I'm equally
> worried that a) the bgwriter is very tuning sensitive b) we don't really
> have much info on how to set the defaults in a meaningful way for the
> majority of cases c) there are some issues that greatly reduce the
> effectiveness of the bgwriter in many circumstances.
> 
> The 100pct.patch was my first attempt at getting something acceptable in
> the next few days that gives sufficient room for the DBA to perform
> tuning.
Doesn't cranking up the bgwriter_percent to 100 effectively make the 
entire shared memory a write-through cache? In other words, with 100% 
the bgwriter will allways write all dirty blocks out and it becomes 
unlikely to avoid an IO for subsequent modificaitons to the same data block.
Jan
> 
> On Sun, 2004-12-12 at 05:46, Neil Conway wrote:
>> I wonder if we even need to retain the bgwriter_percent GUC var. Is 
>> there actually a situation in which the combination of bgwriter_maxpages 
>> and bgwriter_delay does not give the DBA sufficient flexibility in 
>> tuning bgwriter behavior?
> 
> Yes, I do now think that only two GUCs are required to tune the
> behaviour; but you make me think - which two? Right now, bgwriter_delay
> is useless because the O(N) behaviour makes it impossible to set any
> lower when you have a large shared_buffers. (I see that as a bug)
> 
> Your question has made me rethink the exact objective of the bgwriter's
> actions: The way it is coded now the bgwriter looks for dirty blocks, no
> matter where they are in the list. What we are bothered about is the
> number of clean buffers at the LRU, which has a direct influence on the
> probability that BufferAlloc() will need to call FlushBuffer(), since
> StrategyGetBuffer() returns the first unpinned buffer, dirty or not.
> After further thought, I would prefer a subtle change in behaviour so
> that the bgwriter checks that clean blocks are available at the LRUs for
> when buffer replacement occurs. With that slight change, I'd keep the
> bgwriter_percent GUC but make it mean something different.
> 
> bgwriter_percent would be the % of shared_buffers that are searched
> (from the LRU end) to see if they contain dirty buffers, which are then
> written to disk.  That means the number of dirty blocks written to disk
> is between 0 and the number of buffers searched, but we're not hugely
> bothered what that number is... [This change to StrategyDirtyBufferList
> resolves the unusability of the bgwriter with large shared_buffers]
> 
> Writing away dirty blocks towards the MRU end is more likely to be
> wasted effort. If a block stays near the MRU then it will be dirty again
> in the wink of an eye, so you gain nothing at checkpoint time by
> cleaning it. Also, since it isn't near the LRU, cleaning it has no
> effect on buffer replacement I/O. If a block is at the LRU, then it is
> by definition the least likely to be reused, and is a candidate for
> replacement anyway. So concentrating on the LRU, not the number of dirty
> buffers seems to be the better thing to do.
> 
> That would then be a much simpler way of setting the defaults. With that
> definition, we would set the defaults:
> 
> bgwriter_percent = 2 (according to my new suggestion here)
> bgwriter_delay = 200
> bgwriter_maxpages = -1 (i.e. mostly ignore it, but keep it for fine
> tuning)
> 
> Thus, for the default shared_buffers=1000 the bgwriter would clear a
> space of up to 20 blocks each cycle.
> For a config with shared_buffers=60000, the bgwriter default would clear
> space for 600 blocks (max) each cycle - a reasonable setting.
> 
> Overall that would need very little specific tuning, because it would
> scale upwards as you changed the shared_buffers higher.
> 
> So, that interpretation of bgwriter_percent gives these advantages:
> - we bound the StrategyDirtyBufferList scan to a small % of the whole
> list, rather than the whole list...so we could realistically set the
> bgwriter_delay lower if required
> - we can set a default that scales, so would not often need to change it
> - the parameter is defined in terms of the thing we really care about:
> sufficient clean blocks at the LRU of the buffer lists
> - these changes are very isolated and actually minor - just a different
> way of specifying which buffers the bgwriter will clean
> 
> Patch attached...again for discussion and to help understanding of this
> proposal. Will submit to patches if we agree it seems like the best way
> to allow the bgwriter defaults to be sensibly set.
> 
> [...and yes, everybody, I do know where we are in the release cycle]
> 
> 
> 
> ------------------------------------------------------------------------
> 
> Index: buffer/bufmgr.c
> ===================================================================
> RCS file: /projects/cvsroot/pgsql/src/backend/storage/buffer/bufmgr.c,v
> retrieving revision 1.182
> diff -d -c -r1.182 bufmgr.c
> *** buffer/bufmgr.c    24 Nov 2004 02:56:17 -0000    1.182
> --- buffer/bufmgr.c    12 Dec 2004 21:53:10 -0000
> ***************
> *** 681,686 ****
> --- 681,687 ----
>   {
>       BufferDesc **dirty_buffers;
>       BufferTag  *buftags;
> +     int         maxdirty;
>       int            num_buffer_dirty;
>       int            i;
>   
> ***************
> *** 688,717 ****
>       if (percent == 0 || maxpages == 0)
>           return 0;
>   
>       /*
>        * Get a list of all currently dirty buffers and how many there are.
>        * We do not flush buffers that get dirtied after we started. They
>        * have to wait until the next checkpoint.
>        */
> !     dirty_buffers = (BufferDesc **) palloc(NBuffers * sizeof(BufferDesc *));
> !     buftags = (BufferTag *) palloc(NBuffers * sizeof(BufferTag));
>   
>       LWLockAcquire(BufMgrLock, LW_EXCLUSIVE);
> -     num_buffer_dirty = StrategyDirtyBufferList(dirty_buffers, buftags,
> -                                                NBuffers);
>   
> !     /*
> !      * If called by the background writer, we are usually asked to only
> !      * write out some portion of dirty buffers now, to prevent the IO
> !      * storm at checkpoint time.
> !      */
> !     if (percent > 0)
> !     {
> !         Assert(percent <= 100);
> !         num_buffer_dirty = (num_buffer_dirty * percent + 99) / 100;
> !     }
> !     if (maxpages > 0 && num_buffer_dirty > maxpages)
> !         num_buffer_dirty = maxpages;
>   
>       /* Make sure we can handle the pin inside the loop */
>       ResourceOwnerEnlargeBuffers(CurrentResourceOwner);
> --- 689,720 ----
>       if (percent == 0 || maxpages == 0)
>           return 0;
>   
> +     /* Set number of buffers we will clean at LRUs of buffer lists 
> +      * If no limits set, then clean the whole of shared_buffers
> +      */
> +     if (maxpages > 0)
> +         maxdirty = maxpages;
> +     else {
> +         if (percent > 0) {
> +                Assert(percent <= 100);
> +             maxdirty = (NBuffers * percent + 99) / 100;
> +         }
> +         else
> +             maxdirty = NBuffers;
> +     }
> + 
>       /*
>        * Get a list of all currently dirty buffers and how many there are.
>        * We do not flush buffers that get dirtied after we started. They
>        * have to wait until the next checkpoint.
>        */
> !     dirty_buffers = (BufferDesc **) palloc(maxdirty * sizeof(BufferDesc *));
> !     buftags = (BufferTag *) palloc(maxdirty * sizeof(BufferTag));
>   
>       LWLockAcquire(BufMgrLock, LW_EXCLUSIVE);
>   
> !        num_buffer_dirty = StrategyDirtyBufferList(dirty_buffers, buftags,
> !                                                maxdirty);
>   
>       /* Make sure we can handle the pin inside the loop */
>       ResourceOwnerEnlargeBuffers(CurrentResourceOwner);
> Index: buffer/freelist.c
> ===================================================================
> RCS file: /projects/cvsroot/pgsql/src/backend/storage/buffer/freelist.c,v
> retrieving revision 1.48
> diff -d -c -r1.48 freelist.c
> *** buffer/freelist.c    16 Sep 2004 16:58:31 -0000    1.48
> --- buffer/freelist.c    12 Dec 2004 21:53:11 -0000
> ***************
> *** 735,741 ****
>    * StrategyDirtyBufferList
>    *
>    * Returns a list of dirty buffers, in priority order for writing.
> -  * Note that the caller may choose not to write them all.
>    *
>    * The caller must beware of the possibility that a buffer is no longer dirty,
>    * or even contains a different page, by the time he reaches it.  If it no
> --- 735,740 ----
> ***************
> *** 755,760 ****
> --- 754,760 ----
>       int            cdb_id_t2;
>       int            buf_id;
>       BufferDesc *buf;
> +     int            i;
>   
>       /*
>        * Traverse the T1 and T2 list LRU to MRU in "parallel" and add all
> ***************
> *** 765,771 ****
>       cdb_id_t1 = StrategyControl->listHead[STRAT_LIST_T1];
>       cdb_id_t2 = StrategyControl->listHead[STRAT_LIST_T2];
>   
> !     while (cdb_id_t1 >= 0 || cdb_id_t2 >= 0)
>       {
>           if (cdb_id_t1 >= 0)
>           {
> --- 765,771 ----
>       cdb_id_t1 = StrategyControl->listHead[STRAT_LIST_T1];
>       cdb_id_t2 = StrategyControl->listHead[STRAT_LIST_T2];
>   
> !     for (i = 0; i < max_buffers; i++)
>       {
>           if (cdb_id_t1 >= 0)
>           {
> ***************
> *** 779,786 ****
>                       buffers[num_buffer_dirty] = buf;
>                       buftags[num_buffer_dirty] = buf->tag;
>                       num_buffer_dirty++;
> -                     if (num_buffer_dirty >= max_buffers)
> -                         break;
>                   }
>               }
>   
> --- 779,784 ----
> ***************
> *** 799,806 ****
>                       buffers[num_buffer_dirty] = buf;
>                       buftags[num_buffer_dirty] = buf->tag;
>                       num_buffer_dirty++;
> -                     if (num_buffer_dirty >= max_buffers)
> -                         break;
>                   }
>               }
>   
> --- 797,802 ----
> 
> 
> ------------------------------------------------------------------------
> 
> 
> ---------------------------(end of broadcast)---------------------------
> TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org
-- 
#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me.                                  #
#================================================== JanWieck@Yahoo.com #
			
		On 12/12/2004 9:43 PM, Neil Conway wrote: > On Sun, 2004-12-12 at 22:08 +0000, Simon Riggs wrote: >> > On Sun, 2004-12-12 at 05:46, Neil Conway wrote: >> > Is the plan to make bgwriter_percent = 100 the default setting? >> >> Hmm...must confess that my only plan is: >> i) discover dynamic behaviour of bgwriter >> ii) fix any bugs or wierdness as quickly as possible >> iii) try to find a way to set the bgwriter defaults > > I was just curious why you were bothering to special-case > bgwriter_percent = 100 if it's not going to be the default setting (in > which case I would be surprised if more than 1 in 10 users would take > advantage of the patch). > >> Right now, bgwriter_delay >> is useless because the O(N) behaviour makes it impossible to set any >> lower when you have a large shared_buffers. > > BTW, I wouldn't be _too_ worried about O(N) behavior, except that we do > this scan while holding the BufMgrLock, which is a well known source of > contention. So reducing the time we hold that lock would be good. > >> Your question has made me rethink the exact objective of the bgwriter's >> actions: The way it is coded now the bgwriter looks for dirty blocks, no >> matter where they are in the list. > > Not sure what you mean. StrategyDirtyBufferList() returns the specified > number of dirty buffers in order, starting with the T1/T2 LRUs and going > back to the MRUs of both lists. bgwriter_percent effectively ignores > some portion of the tail of that list, so we end up just flushing the > buffers closest to the L1/L2 LRUs. How is this different from what > you're describing? > >> bgwriter_percent would be the % of shared_buffers that are searched >> (from the LRU end) to see if they contain dirty buffers, which are >> then written to disk. > > By definition, buffers closest to the LRU end of the lists are not > frequently accessed. If we only search the N% of the lists closest to > LRU, we will probably end up flushing just those pages to disk -- and > then not flushing anything else to disk in the subsequent bgwriter calls > because all the buffers close to the LRU will be non-dirty. That's okay > if all we're concerned about is avoiding write() by a real backend, but > we also want to smooth out checkpoint load, which I don't think this > approach would do well. > > I suggest just getting rid of bgwriter_percent: AFAICS bgwriter_maxpages > is all the tuning we need, and I think "max # of pages to write" is a > simpler and more logical tuning knob than "% of the buffer pool to scan > looking for dirty buffers." So at each bufmgr invocation, we pick the at > most bgwriter_maxpages dirty pages from the pool, using the pages > closest to the LRUs of T1 and T2. I'd be happy to supply a patch to > implement that if you think it sounds okay. I too don't think that this approach will retain the checkpoing smooting effect, the current implementation has. The real problem is that the "cleaner" the buffer pool is, the longer the scan for dirty buffers will take because the dirty blocks tend to be at the very end of the scan order. The real solution for this would be not to scan the whole pool, but to maintain a separate chain of only dirty buffers in LRU order. Jan -- #======================================================================# # It's easier to get forgiveness for being wrong than for being right. # # Let's break this rule - forgive me. # #================================================== JanWieck@Yahoo.com #
Jan, > I too don't think that this approach will retain the checkpoing smooting > effect, the current implementation has. > > The real problem is that the "cleaner" the buffer pool is, the longer > the scan for dirty buffers will take because the dirty blocks tend to be > at the very end of the scan order. The real solution for this would be > not to scan the whole pool, but to maintain a separate chain of only > dirty buffers in LRU order. Hmmm, I've not seen this. For example, with people who are having trouble with checkpoint spikes on Linux, I've taken to recommending that they call sync() (via cron) every 5-10 seconds (thanks, Bruce, for suggestion!). Believe it or not, this does help smooth out the spikes and give better overall performance in a many-small-writes situation. Simon, one of the problems with the OSDL-DBT2 test is that it's too steady. DBT2 gives a constant stream of small writes at a regular, predictable rate. This does not, in fact, match any real-world application I know. To allow DBT2 to be used for real bgwriter benchmarking, Mark would need to change the following: 1) Randomize the timing of the commits, so that sometimes there is only 30 writes/minute, and other times there is 300. A timing pattern that would produce a "sine wave" with occasional random spikes would be best; in my experience, OLTP applications tend to have wave-like spikes and lulls. 2) Include a sprinkling of random or regular "large writes" which affect several tables and 1000's of rows. For example, once per hour, change 10,000 pending orders to "shipped", and archive 10,000 "old orders" to an archive table. However, this would require "splitting" DBT2; there's the DBT2 which simulates the TPC-C test, and the DBT2 which will help us tune for real-world applications. The two tests will not be the same. -- Josh Berkus Aglio Database Solutions San Francisco
Folks, > To allow DBT2 to be used for real bgwriter benchmarking, Mark would need to > change the following: > > 1) Randomize the timing of the commits, so that sometimes there is only 30 > writes/minute, and other times there is 300. A timing pattern that would > produce a "sine wave" with occasional random spikes would be best; in my > experience, OLTP applications tend to have wave-like spikes and lulls. > > 2) Include a sprinkling of random or regular "large writes" which affect > several tables and 1000's of rows. For example, once per hour, change > 10,000 pending orders to "shipped", and archive 10,000 "old orders" to an > archive table. Oh, also we need to: 3) Run the test for 3+ hours after scaling up, and turn on autovacuum. -- Josh Berkus Aglio Database Solutions San Francisco
Jan Wieck <JanWieck@yahoo.com> writes: > Doesn't cranking up the bgwriter_percent to 100 effectively make the entire > shared memory a write-through cache? In other words, with 100% the bgwriter > will allways write all dirty blocks out and it becomes unlikely to avoid an IO > for subsequent modificaitons to the same data block. If the goal is to not write out hot pages why look in T1 at all? Why not just flush 100% of the dirty pages from T2 and ignore T1 entirely? -- greg