Обсуждение: io storm on checkpoints, postgresql 8.2.4, linux
Hello!
We run a large (~66Gb) web-backend database on Postgresql 8.2.4 on Linux. The hardware is Dual Xeon 5130 with 16Gb ram, LSI Megaraid U320-2x scsi controller w/512Mb writeback cache and a BBU. Storage setup contains 3 raid10 arrays (data, xlog, indexes, each on different array), 12 HDDs total. Frontend application uses jdbc driver, connection pooling and threads.
We've run into an issue of IO storms on checkpoints. Once in 20min (which is checkpoint_interval) the database becomes unresponsive for about 4-8 seconds. Query processing is suspended, server does nothing but writing a large amount of data to disks. Because of the db server being stalled, some of the web clients get timeout and disconnect, which is unacceptable. Even worse, as the new requests come at a pretty constant rate, by the time this storm comes to an end there is a huge amount of sleeping app. threads waiting for their queries to complete. After the db server comes back to life again, these threads wake up and flood it with queries, so performance suffer even more, for some minutes after the checkpoint.
It seemed strange to me that our 70%-read db generates so much dirty pages that writing them out takes 4-8 seconds and grabs the full bandwidth. First, I started to tune bgwriter to a more aggressive settings, but this was of no help, nearly no performance changes at all. Digging into the issue further, I discovered that linux page cache was the reason. "Dirty" parameter in /proc/meminfo (which shows the amount of ready-to-write "dirty" data currently sitting in page cache) grows between checkpoints from 0 to about 100Mb. When checkpoint comes, all the 100mb got flushed out to disk, effectively causing a IO storm.
I found this (http://www.westnet.com/~gsmith/content/linux-pdflush.htm ) document and peeked into mm/page-writeback.c in linux kernel source tree. I'm not sure that I understand pdflush writeout semantics correctly, but looks like when the amount of "dirty" data is less than dirty_background_ratio*RAM/100, pdflush only writes pages in background, waking up every dirty_writeback_centisecs and writing no more than 1024 pages (MAX_WRITEBACK_PAGES constant). When we hit dirty_background_ratio, pdflush starts to write out more agressively.
So, looks like the following scenario takes place: postgresql constantly writes something to database and xlog files, dirty data gets to the page cache, and then slowly written out by pdflush. When postgres generates more dirty pages than pdflush writes out, the amount of dirty data in the pagecache is growing. When we're at checkpoint, postgres does fsync() on the database files, and sleeps until the whole page cache is written out.
By default, dirty_background_ratio is 2%, which is about 328Mb of 16Gb total. Following the curring pdflush logic, nearly this amount of data we face to write out on checkpoint effective stalling everything else, so even 1% of 16Gb is too much. My setup experience 4-8 sec pause in operation even on ~100Mb dirty pagecache...
I temporaly solved this problem by setting dirty_background_ratio to 0%. This causes the dirty data to be written out immediately. It is ok for our setup (mostly because of large controller cache), but it doesn't looks to me as an elegant solution. Is there some other way to fix this issue without disabling pagecache and the IO smoothing it was designed to perform?
--
Regards,
Dmitry
We run a large (~66Gb) web-backend database on Postgresql 8.2.4 on Linux. The hardware is Dual Xeon 5130 with 16Gb ram, LSI Megaraid U320-2x scsi controller w/512Mb writeback cache and a BBU. Storage setup contains 3 raid10 arrays (data, xlog, indexes, each on different array), 12 HDDs total. Frontend application uses jdbc driver, connection pooling and threads.
We've run into an issue of IO storms on checkpoints. Once in 20min (which is checkpoint_interval) the database becomes unresponsive for about 4-8 seconds. Query processing is suspended, server does nothing but writing a large amount of data to disks. Because of the db server being stalled, some of the web clients get timeout and disconnect, which is unacceptable. Even worse, as the new requests come at a pretty constant rate, by the time this storm comes to an end there is a huge amount of sleeping app. threads waiting for their queries to complete. After the db server comes back to life again, these threads wake up and flood it with queries, so performance suffer even more, for some minutes after the checkpoint.
It seemed strange to me that our 70%-read db generates so much dirty pages that writing them out takes 4-8 seconds and grabs the full bandwidth. First, I started to tune bgwriter to a more aggressive settings, but this was of no help, nearly no performance changes at all. Digging into the issue further, I discovered that linux page cache was the reason. "Dirty" parameter in /proc/meminfo (which shows the amount of ready-to-write "dirty" data currently sitting in page cache) grows between checkpoints from 0 to about 100Mb. When checkpoint comes, all the 100mb got flushed out to disk, effectively causing a IO storm.
I found this (http://www.westnet.com/~gsmith/content/linux-pdflush.htm ) document and peeked into mm/page-writeback.c in linux kernel source tree. I'm not sure that I understand pdflush writeout semantics correctly, but looks like when the amount of "dirty" data is less than dirty_background_ratio*RAM/100, pdflush only writes pages in background, waking up every dirty_writeback_centisecs and writing no more than 1024 pages (MAX_WRITEBACK_PAGES constant). When we hit dirty_background_ratio, pdflush starts to write out more agressively.
So, looks like the following scenario takes place: postgresql constantly writes something to database and xlog files, dirty data gets to the page cache, and then slowly written out by pdflush. When postgres generates more dirty pages than pdflush writes out, the amount of dirty data in the pagecache is growing. When we're at checkpoint, postgres does fsync() on the database files, and sleeps until the whole page cache is written out.
By default, dirty_background_ratio is 2%, which is about 328Mb of 16Gb total. Following the curring pdflush logic, nearly this amount of data we face to write out on checkpoint effective stalling everything else, so even 1% of 16Gb is too much. My setup experience 4-8 sec pause in operation even on ~100Mb dirty pagecache...
I temporaly solved this problem by setting dirty_background_ratio to 0%. This causes the dirty data to be written out immediately. It is ok for our setup (mostly because of large controller cache), but it doesn't looks to me as an elegant solution. Is there some other way to fix this issue without disabling pagecache and the IO smoothing it was designed to perform?
--
Regards,
Dmitry
Are you able to show that the dirty pages are all coming from postgres? Cheers, mark -- Mark Mielke <mark@mielke.cc>
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Dmitry Potapov wrote: > Hello! > > We run a large (~66Gb) web-backend database on Postgresql 8.2.4 on > Linux. The hardware is Dual Xeon 5130 with 16Gb ram, LSI Megaraid U320-2x > scsi controller w/512Mb writeback cache and a BBU. Storage setup contains 3 > raid10 arrays (data, xlog, indexes, each on different array), 12 HDDs total. > Frontend application uses jdbc driver, connection pooling and threads. > > We've run into an issue of IO storms on checkpoints. Once in 20min > (which is checkpoint_interval) the database becomes unresponsive for about > 4-8 seconds. Query processing is suspended, server does nothing but writing What are your background writer settings? Joshua D. Drake - -- === The PostgreSQL Company: Command Prompt, Inc. === Sales/Support: +1.503.667.4564 24x7/Emergency: +1.800.492.2240 PostgreSQL solutions since 1997 http://www.commandprompt.com/ UNIQUE NOT NULL Donate to the PostgreSQL Project: http://www.postgresql.org/about/donate PostgreSQL Replication: http://www.commandprompt.com/products/ -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.6 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFGzF3nATb/zqfZUUQRAsV8AJ9Sg7yTUfTGKTB/vQdW5BucwgcRSgCeKqIE jzR7X5+n0x1Y91etGOBvvpE= =4Dc9 -----END PGP SIGNATURE-----
2007/8/22, Mark Mielke <mark@mark.mielke.cc>:
I don't know how to prove that, but I suspect that nothing else except postgres writes to disk on that system, because it runs nothing except postgresql and syslog (which I configured not to write to local storage, but to send everytning to remote log server). No cron jobs, nothing else.
--
Regards,
Dmitry
Are you able to show that the dirty pages are all coming from postgres?
I don't know how to prove that, but I suspect that nothing else except postgres writes to disk on that system, because it runs nothing except postgresql and syslog (which I configured not to write to local storage, but to send everytning to remote log server). No cron jobs, nothing else.
--
Regards,
Dmitry
2007/8/22, Joshua D. Drake <jd@commandprompt.com>:
In fact, with dirty_background_ratio > 0 bgwriter even make things a tiny bit worse.
--
Regards,
Dmitry
> We've run into an issue of IO storms on checkpoints. Once in 20min
> (which is checkpoint_interval) the database becomes unresponsive for about
> 4-8 seconds. Query processing is suspended, server does nothing but writing
What are your background writer settings?
bgwriter_delay=100ms
bgwriter_lru_percent=20.0
bgwriter_lru_maxpages=100
bgwriter_all_percent=3
bgwriter_all_maxpages=600
In fact, with dirty_background_ratio > 0 bgwriter even make things a tiny bit worse.
--
Regards,
Dmitry
2007/8/22, Kenneth Marshall <ktm@rice.edu>:
Will do so, this seems to be a reasonable idea.
You are working at the correct level. The bgwriter performs the I/O smoothing
function at the database level. Obviously, the OS level smoothing function
needed to be tuned and you have done that within the parameters of the OS.
You may want to bring this up on the Linux kernel lists and see if they have
any ideas.
Will do so, this seems to be a reasonable idea.
--
Regards,
Dmitry
On Wed, 22 Aug 2007, Dmitry Potapov wrote: > I found this http://www.westnet.com/~gsmith/content/linux-pdflush.htm If you do end up following up with this via the Linux kernel mailing list, please pass that link along. I've been meaning to submit it to them and wait for the flood of e-mail telling me what I screwed up, that will go better if you tell them about it instead of me. > I temporaly solved this problem by setting dirty_background_ratio to 0%. > This causes the dirty data to be written out immediately. It is ok for > our setup (mostly because of large controller cache), but it doesn't > looks to me as an elegant solution. Is there some other way to fix this > issue without disabling pagecache and the IO smoothing it was designed > to perform? I spent a couple of months trying and decided it was impossible. Your analysis of the issue is completely accurate; lowering dirty_background_ratio to 0 makes the system much less efficient, but it's the only way to make the stalls go completely away. I contributed some help toward fixing the issue in the upcoming 8.3 instead; there's a new checkpoint writing process aimed to ease the exact problem you're running into there, see the new checkpoint_completion_target tunable at http://developer.postgresql.org/pgdocs/postgres/wal-configuration.html If you could figure out how to run some tests to see if the problem clears up for you using the new technique, that would be valuable feedback for the development team for the upcoming 8.3 beta. Probably more productive use of your time than going crazy trying to fix the issue in 8.2.4. -- * Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD
2007/8/23, Greg Smith <gsmith@gregsmith.com>:
I'm planning to do so, but before I need to take a look at postgresql source and dev documentation to find how exactly IO is done, to be able to explain the issue to linux kernel people. That will take some time, I'll post a link here when I'm done.
By the way, does postgresql has a similar stall problem on freebsd/other OS'es? It would be interesting to study their approach to io smoothing if it doesn't.
--
Regards,
Dmitry
On Wed, 22 Aug 2007, Dmitry Potapov wrote:
If you do end up following up with this via the Linux kernel mailing list,
please pass that link along. I've been meaning to submit it to them and
wait for the flood of e-mail telling me what I screwed up, that will go
better if you tell them about it instead of me.
I'm planning to do so, but before I need to take a look at postgresql source and dev documentation to find how exactly IO is done, to be able to explain the issue to linux kernel people. That will take some time, I'll post a link here when I'm done.
> looks to me as an elegant solution. Is there some other way to fix this
> issue without disabling pagecache and the IO smoothing it was designed
> to perform?
I spent a couple of months trying and decided it was impossible. Your
analysis of the issue is completely accurate; lowering
dirty_background_ratio to 0 makes the system much less efficient, but it's
the only way to make the stalls go completely away.
By the way, does postgresql has a similar stall problem on freebsd/other OS'es? It would be interesting to study their approach to io smoothing if it doesn't.
I contributed some help toward fixing the issue in the upcoming 8.3
instead; there's a new checkpoint writing process aimed to ease the exact
problem you're running into there, see the new
checkpoint_completion_target tunable at
http://developer.postgresql.org/pgdocs/postgres/wal-configuration.html
If you could figure out how to run some tests to see if the problem clears
up for you using the new technique, that would be valuable feedback for
the development team for the upcoming 8.3 beta. Probably more productive
use of your time than going crazy trying to fix the issue in 8.2.4.
We have a tool here to record and replay the exact workload we have on a real production system, the only problem is getting a spare 16Gb box. I can get a server with 8Gb ram and nearly same storage setup for testing purposes. I hope it will be able to carry the production load, so I can compare 8.2.4 and 8.3devel on the same box, in the same situation. Is there any other changes in 8.3devel that can affect the results of such test? I didn't really follow 8.3 development process :(
Regards,
Dmitry
On Thu, 23 Aug 2007, Dmitry Potapov wrote: > I'm planning to do so, but before I need to take a look at postgresql source > and dev documentation to find how exactly IO is done, to be able to explain > the issue to linux kernel people. I can speed that up for you. http://developer.postgresql.org/index.php/Buffer_Cache%2C_Checkpoints%2C_and_the_BGW outlines all the source code involved. Easiest way to browse through the code is it via http://doxygen.postgresql.org/ , eventually I want to update the page so it points right into the appropriate doxygen spots but haven't gotten to that yet. > By the way, does postgresql has a similar stall problem on freebsd/other > OS'es? It would be interesting to study their approach to io smoothing if it > doesn't. There's some evidence that something about Linux aggrevates the problem; check out http://archives.postgresql.org/pgsql-hackers/2007-07/msg00261.php and the rest of the messages in that thread. I haven't heard a report of this problem from someone who isn't running Linux, but as it requires a certain level of hardware and a specific type of work load I'm not sure if this is coincidence or a cause/effect relationship. > Is there any other changes in 8.3devel that can affect the results of > such test? The "all" component of the background writer was removed as it proved not to be useful once checkpoint_completion_target was introduced. And the LRU background writer keeps going while checkpoints are being trickled out, in earlier versions that didn't happen. The test I'd like to see more people run is to simulate their workloads with checkpoint_completion_target set to 0.5, 0.7, and 0.9 and see how each of those settings works relative to the 8.2 behavior. -- * Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD
On Wed, Aug 22, 2007 at 07:33:35PM +0400, Dmitry Potapov wrote: > Hello! > > We run a large (~66Gb) web-backend database on Postgresql 8.2.4 on > Linux. The hardware is Dual Xeon 5130 with 16Gb ram, LSI Megaraid U320-2x > scsi controller w/512Mb writeback cache and a BBU. Storage setup contains 3 > raid10 arrays (data, xlog, indexes, each on different array), 12 HDDs total. > Frontend application uses jdbc driver, connection pooling and threads. > > We've run into an issue of IO storms on checkpoints. Once in 20min > (which is checkpoint_interval) the database becomes unresponsive for about > 4-8 seconds. Query processing is suspended, server does nothing but writing > a large amount of data to disks. Because of the db server being stalled, > some of the web clients get timeout and disconnect, which is unacceptable. > Even worse, as the new requests come at a pretty constant rate, by the time > this storm comes to an end there is a huge amount of sleeping app. threads > waiting for their queries to complete. After the db server comes back to > life again, these threads wake up and flood it with queries, so performance > suffer even more, for some minutes after the checkpoint. > > It seemed strange to me that our 70%-read db generates so much dirty > pages that writing them out takes 4-8 seconds and grabs the full bandwidth. > First, I started to tune bgwriter to a more aggressive settings, but this > was of no help, nearly no performance changes at all. Digging into the issue > further, I discovered that linux page cache was the reason. "Dirty" > parameter in /proc/meminfo (which shows the amount of ready-to-write "dirty" > data currently sitting in page cache) grows between checkpoints from 0 to > about 100Mb. When checkpoint comes, all the 100mb got flushed out to disk, > effectively causing a IO storm. > > I found this (http://www.westnet.com/~gsmith/content/linux-pdflush.htm > <http://www.westnet.com/%7Egsmith/content/linux-pdflush.htm>) document and > peeked into mm/page-writeback.c in linux kernel source tree. I'm not sure > that I understand pdflush writeout semantics correctly, but looks like when > the amount of "dirty" data is less than dirty_background_ratio*RAM/100, > pdflush only writes pages in background, waking up every > dirty_writeback_centisecs and writing no more than 1024 pages > (MAX_WRITEBACK_PAGES constant). When we hit dirty_background_ratio, pdflush > starts to write out more agressively. > > So, looks like the following scenario takes place: postgresql constantly > writes something to database and xlog files, dirty data gets to the page > cache, and then slowly written out by pdflush. When postgres generates more > dirty pages than pdflush writes out, the amount of dirty data in the > pagecache is growing. When we're at checkpoint, postgres does fsync() on the > database files, and sleeps until the whole page cache is written out. > > By default, dirty_background_ratio is 2%, which is about 328Mb of 16Gb > total. Following the curring pdflush logic, nearly this amount of data we > face to write out on checkpoint effective stalling everything else, so even > 1% of 16Gb is too much. My setup experience 4-8 sec pause in operation even > on ~100Mb dirty pagecache... > > I temporaly solved this problem by setting dirty_background_ratio to > 0%. This causes the dirty data to be written out immediately. It is ok for > our setup (mostly because of large controller cache), but it doesn't looks > to me as an elegant solution. Is there some other way to fix this issue > without disabling pagecache and the IO smoothing it was designed to perform? > > -- > Regards, > Dmitry Dmitry, You are working at the correct level. The bgwriter performs the I/O smoothing function at the database level. Obviously, the OS level smoothing function needed to be tuned and you have done that within the parameters of the OS. You may want to bring this up on the Linux kernel lists and see if they have any ideas. Good luck, Ken
On Aug 22, 2007, at 10:57 AM, Kenneth Marshall wrote: > On Wed, Aug 22, 2007 at 07:33:35PM +0400, Dmitry Potapov wrote: >> Hello! >> >> We run a large (~66Gb) web-backend database on Postgresql >> 8.2.4 on >> Linux. The hardware is Dual Xeon 5130 with 16Gb ram, LSI Megaraid >> U320-2x >> scsi controller w/512Mb writeback cache and a BBU. Storage setup >> contains 3 >> raid10 arrays (data, xlog, indexes, each on different array), 12 >> HDDs total. >> Frontend application uses jdbc driver, connection pooling and >> threads. >> >> We've run into an issue of IO storms on checkpoints. Once in >> 20min >> (which is checkpoint_interval) the database becomes unresponsive >> for about >> 4-8 seconds. Query processing is suspended, server does nothing >> but writing >> a large amount of data to disks. Because of the db server being >> stalled, >> some of the web clients get timeout and disconnect, which is >> unacceptable. >> Even worse, as the new requests come at a pretty constant rate, by >> the time >> this storm comes to an end there is a huge amount of sleeping app. >> threads >> waiting for their queries to complete. After the db server comes >> back to >> life again, these threads wake up and flood it with queries, so >> performance >> suffer even more, for some minutes after the checkpoint. >> >> It seemed strange to me that our 70%-read db generates so much >> dirty >> pages that writing them out takes 4-8 seconds and grabs the full >> bandwidth. >> First, I started to tune bgwriter to a more aggressive settings, >> but this >> was of no help, nearly no performance changes at all. Digging into >> the issue >> further, I discovered that linux page cache was the reason. "Dirty" >> parameter in /proc/meminfo (which shows the amount of ready-to- >> write "dirty" >> data currently sitting in page cache) grows between checkpoints >> from 0 to >> about 100Mb. When checkpoint comes, all the 100mb got flushed out >> to disk, >> effectively causing a IO storm. >> >> I found this (http://www.westnet.com/~gsmith/content/linux- >> pdflush.htm >> <http://www.westnet.com/%7Egsmith/content/linux-pdflush.htm>) >> document and >> peeked into mm/page-writeback.c in linux kernel source tree. I'm >> not sure >> that I understand pdflush writeout semantics correctly, but looks >> like when >> the amount of "dirty" data is less than dirty_background_ratio*RAM/ >> 100, >> pdflush only writes pages in background, waking up every >> dirty_writeback_centisecs and writing no more than 1024 pages >> (MAX_WRITEBACK_PAGES constant). When we hit >> dirty_background_ratio, pdflush >> starts to write out more agressively. >> >> So, looks like the following scenario takes place: postgresql >> constantly >> writes something to database and xlog files, dirty data gets to >> the page >> cache, and then slowly written out by pdflush. When postgres >> generates more >> dirty pages than pdflush writes out, the amount of dirty data in the >> pagecache is growing. When we're at checkpoint, postgres does fsync >> () on the >> database files, and sleeps until the whole page cache is written out. >> >> By default, dirty_background_ratio is 2%, which is about 328Mb >> of 16Gb >> total. Following the curring pdflush logic, nearly this amount of >> data we >> face to write out on checkpoint effective stalling everything >> else, so even >> 1% of 16Gb is too much. My setup experience 4-8 sec pause in >> operation even >> on ~100Mb dirty pagecache... >> >> I temporaly solved this problem by setting >> dirty_background_ratio to >> 0%. This causes the dirty data to be written out immediately. It >> is ok for >> our setup (mostly because of large controller cache), but it >> doesn't looks >> to me as an elegant solution. Is there some other way to fix this >> issue >> without disabling pagecache and the IO smoothing it was designed >> to perform? >> >> -- >> Regards, >> Dmitry > > Dmitry, > > You are working at the correct level. The bgwriter performs the I/O > smoothing > function at the database level. Obviously, the OS level smoothing > function > needed to be tuned and you have done that within the parameters of > the OS. > You may want to bring this up on the Linux kernel lists and see if > they have > any ideas. > > Good luck, > > Ken Have you tried decreasing you checkpoint interval? That would at least help to reduce the amount of data that needs to be flushed when Postgres fsyncs. Erik Jones Software Developer | Emma® erik@myemma.com 800.595.4401 or 615.292.5888 615.292.0777 (fax) Emma helps organizations everywhere communicate & market in style. Visit us online at http://www.myemma.com
On Tue, Aug 28, 2007 at 10:00:57AM -0500, Erik Jones wrote: > >> It seemed strange to me that our 70%-read db generates so much > >>dirty > >>pages that writing them out takes 4-8 seconds and grabs the full > >>bandwidth. > >>First, I started to tune bgwriter to a more aggressive settings, > >>but this > >>was of no help, nearly no performance changes at all. Digging into > >>the issue > >>further, I discovered that linux page cache was the reason. "Dirty" > >>parameter in /proc/meminfo (which shows the amount of ready-to- > >>write "dirty" > >>data currently sitting in page cache) grows between checkpoints > >>from 0 to > >>about 100Mb. When checkpoint comes, all the 100mb got flushed out > >>to disk, > >>effectively causing a IO storm. > >> > >> I found this (http://www.westnet.com/~gsmith/content/linux- > >>pdflush.htm > >><http://www.westnet.com/%7Egsmith/content/linux-pdflush.htm>) > >>document and > >>peeked into mm/page-writeback.c in linux kernel source tree. I'm > >>not sure > >>that I understand pdflush writeout semantics correctly, but looks > >>like when > >>the amount of "dirty" data is less than dirty_background_ratio*RAM/ > >>100, > >>pdflush only writes pages in background, waking up every > >>dirty_writeback_centisecs and writing no more than 1024 pages > >>(MAX_WRITEBACK_PAGES constant). When we hit > >>dirty_background_ratio, pdflush > >>starts to write out more agressively. > >> > >> So, looks like the following scenario takes place: postgresql > >>constantly > >>writes something to database and xlog files, dirty data gets to > >>the page > >>cache, and then slowly written out by pdflush. When postgres > >>generates more > >>dirty pages than pdflush writes out, the amount of dirty data in the > >>pagecache is growing. When we're at checkpoint, postgres does fsync > >>() on the > >>database files, and sleeps until the whole page cache is written out. > >> > >> By default, dirty_background_ratio is 2%, which is about 328Mb > >>of 16Gb > >>total. Following the curring pdflush logic, nearly this amount of > >>data we > >>face to write out on checkpoint effective stalling everything > >>else, so even > >>1% of 16Gb is too much. My setup experience 4-8 sec pause in > >>operation even > >>on ~100Mb dirty pagecache... > >> > >> I temporaly solved this problem by setting > >>dirty_background_ratio to > >>0%. This causes the dirty data to be written out immediately. It > >>is ok for > >>our setup (mostly because of large controller cache), but it > >>doesn't looks > >>to me as an elegant solution. Is there some other way to fix this > >>issue > >>without disabling pagecache and the IO smoothing it was designed > >>to perform? > > > >You are working at the correct level. The bgwriter performs the I/O > >smoothing > >function at the database level. Obviously, the OS level smoothing > >function > >needed to be tuned and you have done that within the parameters of > >the OS. > >You may want to bring this up on the Linux kernel lists and see if > >they have > >any ideas. > > > >Good luck, > > > >Ken > > Have you tried decreasing you checkpoint interval? That would at > least help to reduce the amount of data that needs to be flushed when > Postgres fsyncs. The downside to that is it will result in writing a lot more data to WAL as long as full page writes are on. Isn't there some kind of a timeout parameter for how long dirty data will sit in the cache? It seems pretty broken to me to allow stuff to sit in a dirty state indefinitely. -- Decibel!, aka Jim Nasby decibel@decibel.org EnterpriseDB http://enterprisedb.com 512.569.9461 (cell)