Обсуждение: Excessive (and slow) fsync() within single transaction
I've been trying to track down a performance issue I have. In simple terms, my select performance is very good (generally either CPU limited, or disk limited, depending upon the query), and small updates seem OK.
But a huge UPDATE is incredibly slow. CPU is on average below 1%, and disk IO is generally below 0.1% of what I would expect. It seems to be doing tiny bursts of activity, interspersed with long periods of inactivity. I've observed the same behaviour in many other cases where the amount of writing is large.
My system:
Mac Pro 2009 Quad 2.93 with 16G of ECC RAM
Snow Leopard 10.6.2 in 64bit mode, fully patched
Database on RAID 0 array of Intel X-25M SSDs
Postgres 8.4.1, 64 bit, compiled from source
From outside of postgres, I can sustain over 400MB/s read, and 140MB/s write indefinitely. And around 10K random read I/Os/sec, and 4K random write I/Os. With no hint of pauses. I have never witnessed anything other than speedy performance from the SSDs except from within postgresql.
An example of a problematic SQL command is:
update link_relurl as u set sid = m.sid from link_meta as m where u.link = m.link;
Both tables are around 20Gbytes, and 200M rows, with link as the primary integer key. So in this example, I am joining 2 tables or 200M rows and writing a new value to a column of one of those tables.
The command is issued via psql. There are no other connections to the database.
My expectation would be that postgresql would issue an fsync() (or perhaps a few such calls) at the end of the transaction. But this does not seem to be the case:
Using the DTrace utility dtruss to time all calls to fsync being made by postgres:
$> sudo dtruss -n postgres -t fsync -e
107/0x2dd: 7 fsync(0x3, 0xD, 0x100805400) = 0 0
107/0x2dd: 5 fsync(0x5, 0x8, 0x100805400) = 0 0
107/0x2dd: 4 fsync(0x21, 0x1A, 0x100805400) = 0 0
107/0x2dd: 5 fsync(0x1A, 0x8, 0x100805400) = 0 0
107/0x2dd: 4 fsync(0x19, 0x12, 0x100805400) = 0 0
107/0x2dd: 4 fsync(0x4, 0x2, 0x100805400) = 0 0
107/0x2dd: 5 fsync(0x7, 0x12, 0x100805400) = 0 0
107/0x2dd: 4 fsync(0x1C, 0xB, 0x100805400) = 0 0
107/0x2dd: 4 fsync(0x1E, 0xB, 0x100805400) = 0 0
107/0x2dd: 174 fsync(0x24, 0x103102AD8, 0x0) = 0 0
107/0x2dd: 2018 fsync(0x20, 0x1B, 0x100805400) = 0 0
107/0x2dd: 7 fsync(0x12, 0xC, 0x100805400) = 0 0
107/0x2dd: 987 fsync(0x23, 0x1B, 0x100805400) = 0 0
.....
With thousands of such lines. The columns are PID, elapsed time in microseconds, and the system call with arguments. In the lines above, the longest fsync is only 2 milliseconds. Most of the lines are similar to the above.
So firstly, why are there so many calls to fsync()?
Unfortunately, some of the calls to fsync are incredibly slow:
107/0x2dd: 22526373 fsync(0x19, 0x1, 0x100805400) = 0 0
107/0x2dd: 22637723 fsync(0x10, 0x9, 0x100805400) = 0 0
107/0x2dd: 61745976 fsync(0xA, 0xF, 0x100805400) = 0 0
So in this triplet of consecutive calls we have elapsed times of 22, 22 and 61 seconds!
I'm no expert on DTrace, nor the inner workings of postgresql, but if anybody has any ideas on what might be causing this behaviour, how to track it down, or how to fix or avoid it, I would be very grateful.
Stephen
But a huge UPDATE is incredibly slow. CPU is on average below 1%, and disk IO is generally below 0.1% of what I would expect. It seems to be doing tiny bursts of activity, interspersed with long periods of inactivity. I've observed the same behaviour in many other cases where the amount of writing is large.
My system:
Mac Pro 2009 Quad 2.93 with 16G of ECC RAM
Snow Leopard 10.6.2 in 64bit mode, fully patched
Database on RAID 0 array of Intel X-25M SSDs
Postgres 8.4.1, 64 bit, compiled from source
From outside of postgres, I can sustain over 400MB/s read, and 140MB/s write indefinitely. And around 10K random read I/Os/sec, and 4K random write I/Os. With no hint of pauses. I have never witnessed anything other than speedy performance from the SSDs except from within postgresql.
An example of a problematic SQL command is:
update link_relurl as u set sid = m.sid from link_meta as m where u.link = m.link;
Both tables are around 20Gbytes, and 200M rows, with link as the primary integer key. So in this example, I am joining 2 tables or 200M rows and writing a new value to a column of one of those tables.
The command is issued via psql. There are no other connections to the database.
My expectation would be that postgresql would issue an fsync() (or perhaps a few such calls) at the end of the transaction. But this does not seem to be the case:
Using the DTrace utility dtruss to time all calls to fsync being made by postgres:
$> sudo dtruss -n postgres -t fsync -e
107/0x2dd: 7 fsync(0x3, 0xD, 0x100805400) = 0 0
107/0x2dd: 5 fsync(0x5, 0x8, 0x100805400) = 0 0
107/0x2dd: 4 fsync(0x21, 0x1A, 0x100805400) = 0 0
107/0x2dd: 5 fsync(0x1A, 0x8, 0x100805400) = 0 0
107/0x2dd: 4 fsync(0x19, 0x12, 0x100805400) = 0 0
107/0x2dd: 4 fsync(0x4, 0x2, 0x100805400) = 0 0
107/0x2dd: 5 fsync(0x7, 0x12, 0x100805400) = 0 0
107/0x2dd: 4 fsync(0x1C, 0xB, 0x100805400) = 0 0
107/0x2dd: 4 fsync(0x1E, 0xB, 0x100805400) = 0 0
107/0x2dd: 174 fsync(0x24, 0x103102AD8, 0x0) = 0 0
107/0x2dd: 2018 fsync(0x20, 0x1B, 0x100805400) = 0 0
107/0x2dd: 7 fsync(0x12, 0xC, 0x100805400) = 0 0
107/0x2dd: 987 fsync(0x23, 0x1B, 0x100805400) = 0 0
.....
With thousands of such lines. The columns are PID, elapsed time in microseconds, and the system call with arguments. In the lines above, the longest fsync is only 2 milliseconds. Most of the lines are similar to the above.
So firstly, why are there so many calls to fsync()?
Unfortunately, some of the calls to fsync are incredibly slow:
107/0x2dd: 22526373 fsync(0x19, 0x1, 0x100805400) = 0 0
107/0x2dd: 22637723 fsync(0x10, 0x9, 0x100805400) = 0 0
107/0x2dd: 61745976 fsync(0xA, 0xF, 0x100805400) = 0 0
So in this triplet of consecutive calls we have elapsed times of 22, 22 and 61 seconds!
I'm no expert on DTrace, nor the inner workings of postgresql, but if anybody has any ideas on what might be causing this behaviour, how to track it down, or how to fix or avoid it, I would be very grateful.
Stephen
Stephen Tyler <stephen@stephen-tyler.com> writes: > My expectation would be that postgresql would issue an fsync() (or perhaps a > few such calls) at the end of the transaction. But this does not seem to be > the case: fsyncs mostly happen as a result of checkpoint activity. I wonder whether you have done any tuning of checkpoint intervals, shared_buffers, etc. The very long times to complete the fsyncs sound like a kernel or driver issue; not much we can do to help you with that. regards, tom lane
Stephen Tyler wrote: > So firstly, why are there so many calls to fsync()? Every time a transaction commits, you get a fsync to the WAL file. Then, during the periodic database checkpoints, you get more fsync'd writes. The latter are more likely to be your problem. You should turn on log_checkpoint in the postgresql.conf and confirm the slowdowns are happening around the same time as the checkpoint report gets written to the log files. Presuming that turns out to be the case, you'll want to follow the suggestions at http://wiki.postgresql.org/wiki/Tuning_Your_PostgreSQL_Server to improve the behavior, and you can get a lot more background about what's happening at http://www.westnet.com/~gsmith/content/postgresql/chkp-bgw-83.htm We just went through one of these last week with someone else, you might find the troubleshooting process and surrounding discussion interesting reading: http://archives.postgresql.org/pgsql-performance/2009-12/msg00033.php One useful later stage in troubleshooting problems in this area is to reduce the amount of caching done by the operating system, so that there aren't as many blocks backed up when the fsync calls happen. I'm not sure how to do this on OS X though. -- Greg Smith 2ndQuadrant Baltimore, MD PostgreSQL Training, Services and Support greg@2ndQuadrant.com www.2ndQuadrant.com
On Tue, Dec 8, 2009 at 7:28 PM, Stephen Tyler <stephen@stephen-tyler.com> wrote: > I've been trying to track down a performance issue I have. In simple terms, > my select performance is very good (generally either CPU limited, or disk > limited, depending upon the query), and small updates seem OK. > > But a huge UPDATE is incredibly slow. CPU is on average below 1%, and disk > IO is generally below 0.1% of what I would expect. It seems to be doing > tiny bursts of activity, interspersed with long periods of inactivity. I've > observed the same behaviour in many other cases where the amount of writing > is large. > > My system: > > Mac Pro 2009 Quad 2.93 with 16G of ECC RAM > Snow Leopard 10.6.2 in 64bit mode, fully patched > Database on RAID 0 array of Intel X-25M SSDs > Postgres 8.4.1, 64 bit, compiled from source > > From outside of postgres, I can sustain over 400MB/s read, and 140MB/s write > indefinitely. And around 10K random read I/Os/sec, and 4K random write > I/Os. With no hint of pauses. I have never witnessed anything other than > speedy performance from the SSDs except from within postgresql. > > An example of a problematic SQL command is: > update link_relurl as u set sid = m.sid from link_meta as m where u.link = > m.link; > > Both tables are around 20Gbytes, and 200M rows, with link as the primary > integer key. So in this example, I am joining 2 tables or 200M rows and > writing a new value to a column of one of those tables. > > The command is issued via psql. There are no other connections to the > database. > > My expectation would be that postgresql would issue an fsync() (or perhaps a > few such calls) at the end of the transaction. But this does not seem to be > the case: > > Using the DTrace utility dtruss to time all calls to fsync being made by > postgres: > $> sudo dtruss -n postgres -t fsync -e > 107/0x2dd: 7 fsync(0x3, 0xD, 0x100805400) = 0 0 > 107/0x2dd: 5 fsync(0x5, 0x8, 0x100805400) = 0 0 > 107/0x2dd: 4 fsync(0x21, 0x1A, 0x100805400) = 0 0 > 107/0x2dd: 5 fsync(0x1A, 0x8, 0x100805400) = 0 0 > 107/0x2dd: 4 fsync(0x19, 0x12, 0x100805400) = 0 0 > 107/0x2dd: 4 fsync(0x4, 0x2, 0x100805400) = 0 0 > 107/0x2dd: 5 fsync(0x7, 0x12, 0x100805400) = 0 0 > 107/0x2dd: 4 fsync(0x1C, 0xB, 0x100805400) = 0 0 > 107/0x2dd: 4 fsync(0x1E, 0xB, 0x100805400) = 0 0 > 107/0x2dd: 174 fsync(0x24, 0x103102AD8, 0x0) = 0 0 > 107/0x2dd: 2018 fsync(0x20, 0x1B, 0x100805400) = 0 0 > 107/0x2dd: 7 fsync(0x12, 0xC, 0x100805400) = 0 0 > 107/0x2dd: 987 fsync(0x23, 0x1B, 0x100805400) = 0 0 > ..... > > With thousands of such lines. The columns are PID, elapsed time in > microseconds, and the system call with arguments. In the lines above, the > longest fsync is only 2 milliseconds. Most of the lines are similar to the > above. > > So firstly, why are there so many calls to fsync()? > > Unfortunately, some of the calls to fsync are incredibly slow: > 107/0x2dd: 22526373 fsync(0x19, 0x1, 0x100805400) = 0 0 > 107/0x2dd: 22637723 fsync(0x10, 0x9, 0x100805400) = 0 0 > 107/0x2dd: 61745976 fsync(0xA, 0xF, 0x100805400) = 0 0 > > So in this triplet of consecutive calls we have elapsed times of 22, 22 and > 61 seconds! > > I'm no expert on DTrace, nor the inner workings of postgresql, but if > anybody has any ideas on what might be causing this behaviour, how to track > it down, or how to fix or avoid it, I would be very grateful. curious, have you checked the log for frequent checkpoint warnings? Are you logging checkpoints and seeing if they line up with the slow fsync times? my guess here is that you overflowing the write back cache on the x25-m and the long fsync times are waiting for the disk to flush the cache, or some background housecleaning is going on. This behavior is very similar to what I've observed on certain raid controllers when the cache overflows. possible resolutions might include: *) you might luck out and be dealing with known hopefully corrected issue with intel drives *) tweak checkpoints/bgwriter to try and reduce the penalty of fsync *) disable fsync...risky, but I bet will make the problem go away *) optimize your query. maybe try rewriting as delete/insert select from scratch table? merlin
Some more information on fsync using the test_fsync tool in postgres/src/tools:
System is Snow Leopard 10.6.2 (64 bit). Volumes are HFS+ formatted.
PASS 1 - DATABASE VOLUME (SSD RAID 0)
===================================
$ sudo ./test_fsync -f /Volumes/SSD/fsync_test.out
Simple write timing:
write 0.127178
Compare fsync times on write() and non-write() descriptor:
If the times are similar, fsync() can sync data written
on a different descriptor.
write, fsync, close 0.254609
write, close, fsync 0.316973
Compare one o_sync write to two:
one 16k o_sync write 0.260829
two 8k o_sync writes 0.259231
Compare file sync methods with one 8k write:
open o_dsync, write 0.135035
open o_sync, write 0.128056
write, fdatasync 0.154622
write, fsync 0.154496
Compare file sync methods with two 8k writes:
open o_dsync, write 0.260390
open o_sync, write 0.426376
write, fdatasync 0.382204
write, fsync 0.249774
PASS 2 - SATA HDD (7200 rpm)
=========================
$ sudo ./test_fsync -f /Volumes/Work/fsync_test.out
Simple write timing:
write 0.205038
Compare fsync times on write() and non-write() descriptor:
If the times are similar, fsync() can sync data written
on a different descriptor.
write, fsync, close 0.498885
write, close, fsync 0.383176
Compare one o_sync write to two:
one 16k o_sync write 0.237621
two 8k o_sync writes 0.458676
Compare file sync methods with one 8k write:
open o_dsync, write 0.171746
open o_sync, write 0.195227
write, fdatasync 0.195102
write, fsync 0.218675
Compare file sync methods with two 8k writes:
open o_dsync, write 0.345076
open o_sync, write 0.368541
write, fdatasync 0.254322
write, fsync 0.271039
If the tests are repeated, the numbers are not very stable, but generally in the range 0.18 to 0.48.
The times are apparently in milliseconds. But that doesn't make any sense for the 7200rpm SATA HDD, given it has an 8.3ms rotation time. And the numbers for the SSD are basically identical to the HDD, apart from the bare write timing.
Do I conclude that the sync timings are all faked by the OS?
Stephen
System is Snow Leopard 10.6.2 (64 bit). Volumes are HFS+ formatted.
PASS 1 - DATABASE VOLUME (SSD RAID 0)
===================================
$ sudo ./test_fsync -f /Volumes/SSD/fsync_test.out
Simple write timing:
write 0.127178
Compare fsync times on write() and non-write() descriptor:
If the times are similar, fsync() can sync data written
on a different descriptor.
write, fsync, close 0.254609
write, close, fsync 0.316973
Compare one o_sync write to two:
one 16k o_sync write 0.260829
two 8k o_sync writes 0.259231
Compare file sync methods with one 8k write:
open o_dsync, write 0.135035
open o_sync, write 0.128056
write, fdatasync 0.154622
write, fsync 0.154496
Compare file sync methods with two 8k writes:
open o_dsync, write 0.260390
open o_sync, write 0.426376
write, fdatasync 0.382204
write, fsync 0.249774
PASS 2 - SATA HDD (7200 rpm)
=========================
$ sudo ./test_fsync -f /Volumes/Work/fsync_test.out
Simple write timing:
write 0.205038
Compare fsync times on write() and non-write() descriptor:
If the times are similar, fsync() can sync data written
on a different descriptor.
write, fsync, close 0.498885
write, close, fsync 0.383176
Compare one o_sync write to two:
one 16k o_sync write 0.237621
two 8k o_sync writes 0.458676
Compare file sync methods with one 8k write:
open o_dsync, write 0.171746
open o_sync, write 0.195227
write, fdatasync 0.195102
write, fsync 0.218675
Compare file sync methods with two 8k writes:
open o_dsync, write 0.345076
open o_sync, write 0.368541
write, fdatasync 0.254322
write, fsync 0.271039
If the tests are repeated, the numbers are not very stable, but generally in the range 0.18 to 0.48.
The times are apparently in milliseconds. But that doesn't make any sense for the 7200rpm SATA HDD, given it has an 8.3ms rotation time. And the numbers for the SSD are basically identical to the HDD, apart from the bare write timing.
Do I conclude that the sync timings are all faked by the OS?
Stephen
On Wed, Dec 9, 2009 at 12:57 PM, Greg Smith <greg@2ndquadrant.com> wrote:
Thank you for your suggestions. I will turn on log_checkpoint next time I get the chance to restart.
Some more details about my current config (almost no activity since boot except the one large update still in progress):
select * from pg_stat_bgwriter;
checkpoints_timed | checkpoints_req | buffers_checkpoint | buffers_clean | maxwritten_clean | buffers_backend | buffers_alloc
-------------------+-----------------+--------------------+---------------+------------------+-----------------+---------------
440 | 51 | 39248820 | 8876042 | 85527 | 3952243 | 48756188
and a couple of hours later:
465 | 56 | 43316686 | 9855720 | 95058 | 3997393 | 53713342
My analysis:
checkpoints_timed increments roughly every 5 minutes, and checkpoints_req increments roughly every 30 minutes.
Around 80% of the checkpoints are timed, rather than requested. But the pauses/freezes are of the order of 30 to 60 seconds apart (roughly aligned with similar slow fsync calls), not 5 minutes apart. There are multiple freezes per checkpoint.
During each checkpoint, 83K * 8K blocks were written out, or 650MB on average per checkpoint. Around 20% were cleaned by the background cleaner. if this occured at anywhere near my drive's average random write speed it should take less than 20 seconds even if 100% random disc blocks.
I don't understand how maxwritten_clean could be as high as 95058, and increment at more than 1 per second. This is a process count, not a buffer count? How often is the background cleaner launched? Does that mean I need to massively increase bgwriter_lru_maxpages, and other bgwriter params? They are currently default values.
On Wed, Dec 9, 2009 at 12:03 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
I wouldn't describe the values as "tuned" ... more like set and forget ...
postgresql.conf (currently):
max_connections = 50
shared_buffers = 4GB
work_mem = 64MB
maintenance_work_mem = 1GB
synchronous_commit = 'off'
wal_buffers = 8MB
checkpoint_segments = 128
autovacuum = 'on'
effective_cache_size = 12GB
default_statistics_target = 100
constraint_exclusion = 'on'
shared_buffers is 1/4 of RAM, and effective_cache_size is 3/4 RAM. The system is not swapping, and currently 9.5GB of RAM is marked as "Inactive".
checkpoint_segments is large, but in the suggested range of 32 to 256 for large, write-heavy, batch-oriented systems. I've used smaller values in the past, but I got lots of checkpoint warnings in the console.
You should turn on log_checkpoint in the postgresql.conf and confirm the slowdowns are happening around the same time as the checkpoint report gets written to the log files.
Some more details about my current config (almost no activity since boot except the one large update still in progress):
select * from pg_stat_bgwriter;
checkpoints_timed | checkpoints_req | buffers_checkpoint | buffers_clean | maxwritten_clean | buffers_backend | buffers_alloc
-------------------+-----------------+--------------------+---------------+------------------+-----------------+---------------
440 | 51 | 39248820 | 8876042 | 85527 | 3952243 | 48756188
and a couple of hours later:
465 | 56 | 43316686 | 9855720 | 95058 | 3997393 | 53713342
My analysis:
checkpoints_timed increments roughly every 5 minutes, and checkpoints_req increments roughly every 30 minutes.
Around 80% of the checkpoints are timed, rather than requested. But the pauses/freezes are of the order of 30 to 60 seconds apart (roughly aligned with similar slow fsync calls), not 5 minutes apart. There are multiple freezes per checkpoint.
During each checkpoint, 83K * 8K blocks were written out, or 650MB on average per checkpoint. Around 20% were cleaned by the background cleaner. if this occured at anywhere near my drive's average random write speed it should take less than 20 seconds even if 100% random disc blocks.
I don't understand how maxwritten_clean could be as high as 95058, and increment at more than 1 per second. This is a process count, not a buffer count? How often is the background cleaner launched? Does that mean I need to massively increase bgwriter_lru_maxpages, and other bgwriter params? They are currently default values.
On Wed, Dec 9, 2009 at 12:03 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
fsyncs mostly happen as a result of checkpoint activity. I wonderwhether you have done any tuning of checkpoint intervals,
shared_buffers, etc.
I wouldn't describe the values as "tuned" ... more like set and forget ...
postgresql.conf (currently):
max_connections = 50
shared_buffers = 4GB
work_mem = 64MB
maintenance_work_mem = 1GB
synchronous_commit = 'off'
wal_buffers = 8MB
checkpoint_segments = 128
autovacuum = 'on'
effective_cache_size = 12GB
default_statistics_target = 100
constraint_exclusion = 'on'
shared_buffers is 1/4 of RAM, and effective_cache_size is 3/4 RAM. The system is not swapping, and currently 9.5GB of RAM is marked as "Inactive".
checkpoint_segments is large, but in the suggested range of 32 to 256 for large, write-heavy, batch-oriented systems. I've used smaller values in the past, but I got lots of checkpoint warnings in the console.
wal_buffers might be a bit high. I don't fully understand the implications, and how to tune this value, even after reading the tuning guides.
On Wed, Dec 9, 2009 at 4:40 PM, Merlin Moncure <mmoncure@gmail.com> wrote:
Latest firmware for the 50nm version. I haven't found any known issue.
During the pauses, "Activity Monitor" reports no disk activity. But I am not sure if this implies no SATA traffic.
Make them more frequent but smaller, or less frequent but bigger? Which way has less penalty?
I'm very tempted to live dangerously. I would like to actually turn off fsync's during a transaction, and just use the fsync at the very end of the transaction to confirm that the transaction will be permanent. Or maybe use the filesystem journaling function to ensure I can roll-back to a valid state if something bad happens mid-transaction. But these do not seem to be options.
Unfortunately, I have a lot of foreign keys with "ON DELETE CASCADE", and various other triggers. While that greatly simplified my housekeeping code, I can no longer do delete & reinsert.
I can break the big transactions into hundreds or thousands of little transactions. This does let me stop/restart the system at will, but I've yet to see it help my throughput.
Maybe I've overlooked something, but nothing looks too wrong with postgresql.conf to me, based on my reading of the tuning guides. One checkpoint write every 5 minutes (80% timed) doesn't seem too bad. Something else (probably the very slow fsync) is keeping the checkpoint writing rate low, and there appear to be a lot more fsync's than checkpoint segments.
Stephen
On Wed, Dec 9, 2009 at 4:40 PM, Merlin Moncure <mmoncure@gmail.com> wrote:
possible resolutions might include:
*) you might luck out and be dealing with known hopefully corrected
issue with intel drives
Latest firmware for the 50nm version. I haven't found any known issue.
During the pauses, "Activity Monitor" reports no disk activity. But I am not sure if this implies no SATA traffic.
*) tweak checkpoints/bgwriter to try and reduce the penalty of fsync
Make them more frequent but smaller, or less frequent but bigger? Which way has less penalty?
*) disable fsync...risky, but I bet will make the problem go away
I'm very tempted to live dangerously. I would like to actually turn off fsync's during a transaction, and just use the fsync at the very end of the transaction to confirm that the transaction will be permanent. Or maybe use the filesystem journaling function to ensure I can roll-back to a valid state if something bad happens mid-transaction. But these do not seem to be options.
*) optimize your query. maybe try rewriting as delete/insert select
from scratch table?
Unfortunately, I have a lot of foreign keys with "ON DELETE CASCADE", and various other triggers. While that greatly simplified my housekeeping code, I can no longer do delete & reinsert.
I can break the big transactions into hundreds or thousands of little transactions. This does let me stop/restart the system at will, but I've yet to see it help my throughput.
Maybe I've overlooked something, but nothing looks too wrong with postgresql.conf to me, based on my reading of the tuning guides. One checkpoint write every 5 minutes (80% timed) doesn't seem too bad. Something else (probably the very slow fsync) is keeping the checkpoint writing rate low, and there appear to be a lot more fsync's than checkpoint segments.
Stephen
On Wed, Dec 9, 2009 at 5:11 AM, Stephen Tyler <stephen@stephen-tyler.com> wrote: > The times are apparently in milliseconds. But that doesn't make any sense > for the 7200rpm SATA HDD, given it has an 8.3ms rotation time. And the > numbers for the SSD are basically identical to the HDD, apart from the bare > write timing. > > Do I conclude that the sync timings are all faked by the OS? Actually, it's usually the drives that lie about fsync, especially consumer grade (and some server grade) SATA / PATA drives are known for this.
Scott Marlowe wrote: > Actually, it's usually the drives that lie about fsync, especially > consumer grade (and some server grade) SATA / PATA drives are known > for this. I'm still looking for any evidence of any drive that lies. Is there actually a drive which claims to support the IDE "FLUSH_CACHE_EXT" feature (IDE command 0xEA) that doesn't?!? You can see if your drive claims to - most this century do - in Linux with "hdparm -I /dev/sdX" which should show FLUSH_CACHE_EXT if the disk supports this extension. And if your disk is using a last-century IDE spec that predated that feature, I think your kernel probably told you in the startup messages that your drive is not so much lying, but obsolete and missing a feature. Software on the other hand sucks at sending these commands. Ext3 will only send the right IDE commands if the inode was touched[1]. VirtualBox doesn't send them correctly[2]. I hear LVM and MD raid can lose those events too. Since so many layers don't send the right IDE commands to flush, userspace utilities exist to send those commands[4]. I sometimes wonder if postgres should add something like that. But I still haven't found any drive that lies -- and only a couple who's firmware openly admits not to support the feature. [1] http://www.mail-archive.com/linux-kernel@vger.kernel.org/msg272253.html [2] http://openquery.com/blog/host-flush [3] http://forum.soft32.com/linux/flush-disk-write-cache-userspace-ftopict338477.html
Stephen Tyler wrote: > The times are apparently in milliseconds. But that doesn't make any > sense for the 7200rpm SATA HDD, given it has an 8.3ms rotation time. > And the numbers for the SSD are basically identical to the HDD, apart > from the bare write timing. > Do I conclude that the sync timings are all faked by the OS? PostgreSQL on OS X uses a special fsync method by default named 'fsync_writethrough'. If you do this on your database you should be able to confirm it's using it: show wal_sync_method; This method includes a call to Apple's API to make sure the cache is properly flushed on the drives. Lots more details about how all this works at http://www.westnet.com/~gsmith/content/postgresql/TuningPGWAL.htm I'm not sure if the test_fsync tool actually tests this particular wal_sync_method though, so on OS X you may only be getting reports that don't include a proper drive cache flush. The server itself does the right thing. Making test_fsync do exactly the same thing as the real WAL write in all cases has been a lingering TODO item for a while now, I submitted a patch to improve the whole thing a while ago but it's not really robust in all situations yet. Unfortunately I think you're getting bad data from the test tool, and I'm not sure of a source for a similar one that includes the right drive flushing behavior for OS X. (As a general note, there's a similar call on Windows, and I believe it also suffers from this problem when running test_fsync ; that little test utility has really only been tested well on UNIX-ish systems). What you can do in order to test fsync rate as a side-effect of other work the server is to run pgbench with a custom script. If you grab a copy of http://git.postgresql.org/gitweb?p=pgbench-tools.git;a=blob_plain;f=tests/insert.sql;hb=945164aaadf50ea645750fe77c4ffcecafb5c353and save that as "insert.sql", then do: createdb pgbench pgbench -i -s 10 pgbench pgbench -s 10 -t 1000 -c 1 -f insert.sql This will tell you how many insert statements/second are committing, which is basically just a tiny write to the WAL followed by an fsync. That's basically the fsync rate of the system, as long as you don't get a checkpoint in the middle slowing it down. -- Greg Smith 2ndQuadrant Baltimore, MD PostgreSQL Training, Services and Support greg@2ndQuadrant.com www.2ndQuadrant.com
Ron Mayer wrote:
But back in the day, when you wrote to a drive, it never returned success until that write was done. Anything that doesn't have that behavior was considered a "lying drive". It was something that consumer drives did but "enterprise" ones didn't.
Nowadays, most drives do this, and the various cache flush calls allow one to work around it. But the drives still "lie" from some viewpoints, simply because the write calls don't wait for completion before returning success, the original criteria. If you're naive and believe that anything the drive has said is written is actually on disk, which used to be how things worked, from that perspective all (well, almost all) drives lie. Having that behavior documented and a workaround available doesn't change that.
You seem to think that a "lying drive" is one that doesn't flush its cache when you ask it to. That's not the definition, and as you point out it's rare behavior--some crappy drives would do this many years ago when the drive flush was a new thing, nowadays that's unheard of.Scott Marlowe wrote:Actually, it's usually the drives that lie about fsync, especially consumer grade (and some server grade) SATA / PATA drives are known for this.I'm still looking for any evidence of any drive that lies. Is there actually a drive which claims to support the IDE "FLUSH_CACHE_EXT" feature (IDE command 0xEA) that doesn't?!?
But back in the day, when you wrote to a drive, it never returned success until that write was done. Anything that doesn't have that behavior was considered a "lying drive". It was something that consumer drives did but "enterprise" ones didn't.
Nowadays, most drives do this, and the various cache flush calls allow one to work around it. But the drives still "lie" from some viewpoints, simply because the write calls don't wait for completion before returning success, the original criteria. If you're naive and believe that anything the drive has said is written is actually on disk, which used to be how things worked, from that perspective all (well, almost all) drives lie. Having that behavior documented and a workaround available doesn't change that.
-- Greg Smith 2ndQuadrant Baltimore, MD PostgreSQL Training, Services and Support greg@2ndQuadrant.com www.2ndQuadrant.com
On Wed, Dec 9, 2009 at 2:35 PM, Ron Mayer <rm_pg@cheapcomplexdevices.com> wrote: > Scott Marlowe wrote: >> Actually, it's usually the drives that lie about fsync, especially >> consumer grade (and some server grade) SATA / PATA drives are known >> for this. > > I'm still looking for any evidence of any drive that lies. I've tested drive in the past (when PATA roamed the land, and 80Gig drives were huge) and could most certainly induce db corruption by starting pgbench with -c 100, forcing a checkpoing, and then pulling the plug. Turning off the write cache (ok all cache I guess) on those drives made them capable of surviving the power plug being pulled.
On Thu, Dec 10, 2009 at 8:38 AM, Greg Smith <greg@2ndquadrant.com> wrote:
PostgreSQL on OS X uses a special fsync method by default named 'fsync_writethrough'. If you do this on your database you should be able to confirm it's using it:
show wal_sync_method;
This method includes a call to Apple's API to make sure the cache is properly flushed on the drives. Lots more details about how all this works at http://www.westnet.com/~gsmith/content/postgresql/TuningPGWAL.htm
# show wal_sync_method;
wal_sync_method
-----------------
open_datasync
I don't think I have done anything to change this method from the default compile settings.
From the documentation:
http://www.postgresql.org/docs/8.4/static/runtime-config-wal.html#RUNTIME-CONFIG-WAL-SETTINGS
· open_datasync (write WAL files with open()
option O_DSYNC)
· fdatasync (call fdatasync()
at each commit)
· fsync_writethrough (call fsync()
at each commit, forcing write-through of any disk write cache)
· fsync (call fsync()
at each commit)
· open_sync (write WAL files with open()
option O_SYNC)
Not all of these choices are available on all platforms. The default is the first method in the above list that is supported by the platform.So apparently, my system supports open_datasync, and since it is the first method, that is the default.
Stephen
On Wed, Dec 9, 2009 at 6:24 AM, Stephen Tyler <stephen@stephen-tyler.com> wrote: > Thank you for your suggestions. I will turn on log_checkpoint next time I > get the chance to restart. > You should be able to set this via a sighup to the server. No restart required.
On Wed, Dec 9, 2009 at 12:57 PM, Greg Smith <greg@2ndquadrant.com> wrote:
I have turned on log_checkpoints, and re-run the command. Checkpoints are being written every 220 to 360 seconds. About 80% are "time" and 20% "xlog". Here are a representative sample:
10/12/09 5:12:59 PM org.postgresql.postgres[445] LOG: checkpoint starting: time
10/12/09 5:16:41 PM org.postgresql.postgres[445] LOG: checkpoint complete: wrote 133795 buffers (25.5%); 0 transaction log file(s) added, 0 removed, 98 recycled; write=112.281 s, sync=108.809 s, total=221.166 s
10/12/09 5:17:32 PM org.postgresql.postgres[445] LOG: checkpoint starting: xlog
10/12/09 5:23:45 PM org.postgresql.postgres[445] LOG: checkpoint complete: wrote 292601 buffers (55.8%); 0 transaction log file(s) added, 0 removed, 76 recycled; write=162.919 s, sync=209.277 s, total=372.229 s
10/12/09 5:23:45 PM org.postgresql.postgres[445] LOG: checkpoint starting: time
10/12/09 5:27:58 PM org.postgresql.postgres[445] LOG: checkpoint complete: wrote 113330 buffers (21.6%); 0 transaction log file(s) added, 0 removed, 128 recycled; write=165.252 s, sync=87.588 s, total=253.285 s
10/12/09 5:28:45 PM org.postgresql.postgres[445] LOG: checkpoint starting: time
10/12/09 5:34:11 PM org.postgresql.postgres[445] LOG: checkpoint complete: wrote 206947 buffers (39.5%); 0 transaction log file(s) added, 0 removed, 65 recycled; write=152.953 s, sync=172.978 s, total=326.460 s
So a "typical" checkpoint is around 200K buffers (1.5GBytes, 40% of max), taking around 150 seconds to write (10MByte/second, 1300 buffers/second), and around 150 seconds to sync.
The system pauses/freezes are much more frequent. Roughly 20 to 60 seconds long, interspersed with a brief pulse of activity. They align with the lengthy fsync calls reported by DTrace. But there are many more fsync calls being made. Most fsync calls return in a couple of milliseconds or microseconds.
As I have previously mentioned, my system is using the default "open_datasync" WAL sync method. The test_fsync utility reports a small "open o_dsync, write" call returning in around 300 microseconds or less, both on the SSD RAID 0 array and a 7200rpm SATA HDD. If that utility is to be believed, "open o_dsync, write" is not reaching the disk platters by the time it returns.
Why then does it take postgresql 200seconds to sync a checkpoint (or DTrace report 20 to 60 second fsync calls)? The drives themselves have only smallish 32MB buffers. I can write a 1.5GByte file (the size of the average checkpoint) in only 10 seconds, if I do it from outside postgresql.
Stephen
You should turn on log_checkpoint in the postgresql.conf and confirm the slowdowns are happening around the same time as the checkpoint report gets written to the log files.
I have turned on log_checkpoints, and re-run the command. Checkpoints are being written every 220 to 360 seconds. About 80% are "time" and 20% "xlog". Here are a representative sample:
10/12/09 5:12:59 PM org.postgresql.postgres[445] LOG: checkpoint starting: time
10/12/09 5:16:41 PM org.postgresql.postgres[445] LOG: checkpoint complete: wrote 133795 buffers (25.5%); 0 transaction log file(s) added, 0 removed, 98 recycled; write=112.281 s, sync=108.809 s, total=221.166 s
10/12/09 5:17:32 PM org.postgresql.postgres[445] LOG: checkpoint starting: xlog
10/12/09 5:23:45 PM org.postgresql.postgres[445] LOG: checkpoint complete: wrote 292601 buffers (55.8%); 0 transaction log file(s) added, 0 removed, 76 recycled; write=162.919 s, sync=209.277 s, total=372.229 s
10/12/09 5:23:45 PM org.postgresql.postgres[445] LOG: checkpoint starting: time
10/12/09 5:27:58 PM org.postgresql.postgres[445] LOG: checkpoint complete: wrote 113330 buffers (21.6%); 0 transaction log file(s) added, 0 removed, 128 recycled; write=165.252 s, sync=87.588 s, total=253.285 s
10/12/09 5:28:45 PM org.postgresql.postgres[445] LOG: checkpoint starting: time
10/12/09 5:34:11 PM org.postgresql.postgres[445] LOG: checkpoint complete: wrote 206947 buffers (39.5%); 0 transaction log file(s) added, 0 removed, 65 recycled; write=152.953 s, sync=172.978 s, total=326.460 s
So a "typical" checkpoint is around 200K buffers (1.5GBytes, 40% of max), taking around 150 seconds to write (10MByte/second, 1300 buffers/second), and around 150 seconds to sync.
The system pauses/freezes are much more frequent. Roughly 20 to 60 seconds long, interspersed with a brief pulse of activity. They align with the lengthy fsync calls reported by DTrace. But there are many more fsync calls being made. Most fsync calls return in a couple of milliseconds or microseconds.
As I have previously mentioned, my system is using the default "open_datasync" WAL sync method. The test_fsync utility reports a small "open o_dsync, write" call returning in around 300 microseconds or less, both on the SSD RAID 0 array and a 7200rpm SATA HDD. If that utility is to be believed, "open o_dsync, write" is not reaching the disk platters by the time it returns.
Why then does it take postgresql 200seconds to sync a checkpoint (or DTrace report 20 to 60 second fsync calls)? The drives themselves have only smallish 32MB buffers. I can write a 1.5GByte file (the size of the average checkpoint) in only 10 seconds, if I do it from outside postgresql.
Stephen
Stephen Tyler wrote: > So a "typical" checkpoint is around 200K buffers (1.5GBytes, 40% of > max), taking around 150 seconds to write (10MByte/second, 1300 > buffers/second), and around 150 seconds to sync. Your problem may very well be plain old lack of disk I/O throughput, and fsync forcing some caches to clear is just a symptom rather than the root problem. 1.5GB per checkpoint spread out over ~5 minutes is still a pretty high load of writes even for a SSD to deal with, given that it's got to compute with foreground activity too. I try to aim for less than 250MB of checkpoint writes per 5 minute period with regular disks, so even with your theoretically good random I/O I think you're having checkpoints at least twice as often as your system can tolerate them. > Why then does it take postgresql 200seconds to sync a checkpoint (or > DTrace report 20 to 60 second fsync calls)? The drives themselves > have only smallish 32MB buffers. I can write a 1.5GByte file (the > size of the average checkpoint) in only 10 seconds, if I do it from > outside postgresql. The information being written out to the database has a large random I/O component to it, so you can't compare it to sequential write speed at all. You said you can get 4K random writes/second, right? That might only be as little as 4K IOPS * 8K block = 32MB/s worth of write speed. At that speed, 1.5GB will take over 45 seconds, not 10. I don't really have any great suggestions for you from here. You're using not very well understood hardware on the 2nd worst of the popular platforms for PostgreSQL from a performance perspective (with Windows by far the worst), trying to push through what would be a stressful workload even on the best of them. I'm not too familiar with this area because recommendation #1 if I ran into this situation would be "Don't try that on OS X with HFS+". Maybe there's some way to get more performance out of there by tweaking the OS, I haven't had to do so myself enough to know the details off the top of my head. -- Greg Smith 2ndQuadrant Baltimore, MD PostgreSQL Training, Services and Support greg@2ndQuadrant.com www.2ndQuadrant.com
Stephen Tyler wrote:
The background writer runs at whatever frequency bgwriter_delay is set to, which defaults to 200ms for 5 executions/second. You could increase bgwriter_lru_maxpages and the rest, but those actually make the system less efficient if you're having trouble just keeping up with checkpoint I/O. They're more aimed to improve latency on systems where there's enough I/O to spare that you can write ahead a bit even if costs you an occasional penalty. In your situation, I'd turn bgwriter_lru_maxpages=0 and just get rid of it altogether--it's really unlike it's helping, and it might be making things a tiny bit worse.
I don't understand how maxwritten_clean could be as high as 95058, and increment at more than 1 per second. This is a process count, not a buffer count? How often is the background cleaner launched? Does that mean I need to massively increase bgwriter_lru_maxpages, and other bgwriter params? They are currently default values.maxwritten_clean is a straight out count of times that even happened, not a buffer count.
The background writer runs at whatever frequency bgwriter_delay is set to, which defaults to 200ms for 5 executions/second. You could increase bgwriter_lru_maxpages and the rest, but those actually make the system less efficient if you're having trouble just keeping up with checkpoint I/O. They're more aimed to improve latency on systems where there's enough I/O to spare that you can write ahead a bit even if costs you an occasional penalty. In your situation, I'd turn bgwriter_lru_maxpages=0 and just get rid of it altogether--it's really unlike it's helping, and it might be making things a tiny bit worse.
checkpoint_segments = 128I think that given your backing disk situation and the situation you're in, you could easily justify 256 on your system.
checkpoint_segments is large, but in the suggested range of 32 to 256 for large, write-heavy, batch-oriented systems. I've used smaller values in the past, but I got lots of checkpoint warnings in the console.
It's not documented very well, that's for sure. Setting it too high won't hurt you, just takes up a tiny amount of RAM. I've been talking recently with people about increasing the standard recommendation for that to 16MB.wal_buffers might be a bit high. I don't fully understand the implications, and how to tune this value, even after reading the tuning guides.
The only other option that might help you out here is to turn off synchronous_commit for either the whole system, or just for transactions where durability isn't that important. That introduces a small risk of data loss, but not the risk of corruption that turning fsync off altogether does. Basically, it reduces the number of fsync's from transaction commits to be a fixed number per unit of time, rather than being proportional to the number of commits.*) disable fsync...risky, but I bet will make the problem go away
I'm very tempted to live dangerously. I would like to actually turn off fsync's during a transaction, and just use the fsync at the very end of the transaction to confirm that the transaction will be permanent.
-- Greg Smith 2ndQuadrant Baltimore, MD PostgreSQL Training, Services and Support greg@2ndQuadrant.com www.2ndQuadrant.com
Greg Smith wrote: > Stephen Tyler wrote: >> So a "typical" checkpoint is around 200K buffers (1.5GBytes, 40% of >> max), taking around 150 seconds to write (10MByte/second, 1300 >> buffers/second), and around 150 seconds to sync. > Your problem may very well be plain old lack of disk I/O throughput, > and fsync forcing some caches to clear is just a symptom rather than > the root problem. 1.5GB per checkpoint spread out over ~5 minutes is > still a pretty high load of writes even for a SSD to deal with, given > that it's got to compute with foreground activity too. I try to aim > for less than 250MB of checkpoint writes per 5 minute period with > regular disks, so even with your theoretically good random I/O I think > you're having checkpoints at least twice as often as your system can > tolerate them. > >> Why then does it take postgresql 200seconds to sync a checkpoint (or >> DTrace report 20 to 60 second fsync calls)? The drives themselves >> have only smallish 32MB buffers. I can write a 1.5GByte file (the >> size of the average checkpoint) in only 10 seconds, if I do it from >> outside postgresql. > The information being written out to the database has a large random > I/O component to it, so you can't compare it to sequential write speed > at all. You said you can get 4K random writes/second, right? That > might only be as little as 4K IOPS * 8K block = 32MB/s worth of write > speed. At that speed, 1.5GB will take over 45 seconds, not 10. > > I don't really have any great suggestions for you from here. You're > using not very well understood hardware on the 2nd worst of the > popular platforms for PostgreSQL from a performance perspective (with > Windows by far the worst), trying to push through what would be a > stressful workload even on the best of them. I'm not too familiar > with this area because recommendation #1 if I ran into this situation > would be "Don't try that on OS X with HFS+". Maybe there's some way > to get more performance out of there by tweaking the OS, I haven't had > to do so myself enough to know the details off the top of my head. > What are the two best platforms for PostgreSQL in your opinion? Bill
Bill Todd wrote: > What are the two best platforms for PostgreSQL in your opinion? There are three good platforms, each with a different set of strengths and weaknesses: -Linux: Lots of driver support even for very recent things available. Really broad universe of available add-on packages. Performance is good but a little "choppy"--it's not impossible to find situations where it degrades in somewhat odd ways. Stability is quite variable too, you need to carefully stage your updates. -FreeBSD: More stable and predictable performance than Linux, which in some cases even turns out to be better overall. But driver support lags behind, and in general you have to be more careful about what hardware you match it with. Much smaller user community when you do run into a strange performance issue too. -Solaris: Performance is quite steady and predictable, but sometimes not quite as fast as FreeBSD/Linux. Driver support isn't impressive, but it tends to cover the things you need for a boring database server well enough (and the drivers don't break at random after kernel upgrades like on Linux), and it's usually quite clear what will and won't work. Has great, complete implementations of some high-end features particularly well suited for larger systems like ZFS, DTrace (the FreeBSD versions of those aren't quite as good yet). Software package availability is still quite weak compared with the alternatives, and even after spending a fair amount of time with Solaris over the years I still find it much harder to manage too. I was picking on Mac OS X for a couple of reasons: -Worst driver support of anything--very few viable options for RAID controllers for example -Kernel tuned for GUI apps rather than server ones. I haven't seen much on this recently, but older comparisons made the Darwin kernel look extremely bad compared with Linux on process creation and complicated memory allocation work in particular--stuff you need in a database but don't really care about in most user apps. I think people stopped even running these tests because it made OS X look so bad. -Traditional UNIX low-level tools and approaches are either weak or not available. I spent a fair amount of time looking around and I still can't tell you how to tweak the write cache choices on OS X. The mount options I almost always tweak on other platforms don't even exist on HFS+, you can't work around the problems with the friendly defaults. Apple's vm_stat, iostat, and sar are pretty poor too. Sure, you get possibly the best DTrace analysis tool there is, and a nice GUI monitoring app too; great. That's only helpful if you get the basics right first. -- Greg Smith 2ndQuadrant Baltimore, MD PostgreSQL Training, Services and Support greg@2ndQuadrant.com www.2ndQuadrant.com
On Thu, Dec 10, 2009 at 9:21 PM, Stephen Tyler <stephen@stephen-tyler.com> wrote:
The above was on a RAID 0 array of Intel X-25M drives. Spec'd speed is 500MB/s read, 140MB/s write, 7000 random writes/s. Observed speed (outside postgres, not a benchmark) is 420MB/s read, 140MB/s write, 4000 random writes/s.
I have now temporarily moved the entire database (and postgres data directories) to a single 7200rpm SATA drive (no RAID, no special drivers - just an ordinary WD Caviar Black 1TB). Unfortunately the situation appears very similar (and even slower):
11/12/09 12:26:59 PM org.postgresql.postgres[1652] LOG: checkpoint starting: time
11/12/09 12:33:43 PM org.postgresql.postgres[1652] LOG: checkpoint complete: wrote 196843 buffers (37.5%); 0 transaction log file(s) added, 0 removed, 54 recycled; write=201.260 s, sync=202.874 s, total=404.224 s
11/12/09 12:33:43 PM org.postgresql.postgres[1652] LOG: checkpoint starting: time
11/12/09 12:40:15 PM org.postgresql.postgres[1652] LOG: checkpoint complete: wrote 150293 buffers (28.7%); 0 transaction log file(s) added, 0 removed, 81 recycled; write=198.739 s, sync=192.861 s, total=391.682 s
11/12/09 12:40:15 PM org.postgresql.postgres[1652] LOG: checkpoint starting: time
11/12/09 12:46:07 PM org.postgresql.postgres[1652] LOG: checkpoint complete: wrote 136752 buffers (26.1%); 0 transaction log file(s) added, 0 removed, 61 recycled; write=170.165 s, sync=181.921 s, total=352.199 s
11/12/09 12:46:07 PM org.postgresql.postgres[1652] LOG: checkpoint starting: time
11/12/09 12:51:51 PM org.postgresql.postgres[1652] LOG: checkpoint complete: wrote 146884 buffers (28.0%); 0 transaction log file(s) added, 0 removed, 57 recycled; write=187.262 s, sync=156.595 s, total=343.916 s
Some further observations:
1) The situation is a bit different during the first 15 minutes of running the huge query. There is more CPU and disk activity (a lot of reading and writing - what I would call a busy disk). But after a while (hours), it settles down to a pattern of almost zero read I/O. Write I/O is a brief burst (<2 seconds, 2000 writes/sec, 30-50 MB/s) followed by about 30 seconds of 20-60 writes/sec (500KB/s). CPU is negligible.
2) The HDD temperature is very low - identical to a neighbor that is completely idle. I can't hear any disc rattling.
3) During these ~30 second activity pauses, access to the HDD from outside of postgres stalls.
4) DTrace is reporting fsync delays of 12 to 72 seconds (30 seconds typical) from calls made by postgres. About 40% of calls return in < 10 microseconds, 50% return in 30 - 250 ms, 3% are in the seconds, and 2% are in the tens of seconds.
5) Although the checkpoint log reports say 190s of writing and 190s of sync, the activity pattern reported by Activity Monitor is 95% near idle, and 5% activity.
Stephen
I have turned on log_checkpoints, and re-run the command. Checkpoints are being written every 220 to 360 seconds. About 80% are "time" and 20% "xlog". Here are a representative sample:
10/12/09 5:12:59 PM org.postgresql.postgres[445] LOG: checkpoint starting: time
10/12/09 5:16:41 PM org.postgresql.postgres[445] LOG: checkpoint complete: wrote 133795 buffers (25.5%); 0 transaction log file(s) added, 0 removed, 98 recycled; write=112.281 s, sync=108.809 s, total=221.166 s
10/12/09 5:17:32 PM org.postgresql.postgres[445] LOG: checkpoint starting: xlog
10/12/09 5:23:45 PM org.postgresql.postgres[445] LOG: checkpoint complete: wrote 292601 buffers (55.8%); 0 transaction log file(s) added, 0 removed, 76 recycled; write=162.919 s, sync=209.277 s, total=372.229 s
10/12/09 5:23:45 PM org.postgresql.postgres[445] LOG: checkpoint starting: time
10/12/09 5:27:58 PM org.postgresql.postgres[445] LOG: checkpoint complete: wrote 113330 buffers (21.6%); 0 transaction log file(s) added, 0 removed, 128 recycled; write=165.252 s, sync=87.588 s, total=253.285 s
10/12/09 5:28:45 PM org.postgresql.postgres[445] LOG: checkpoint starting: time
10/12/09 5:34:11 PM org.postgresql.postgres[445] LOG: checkpoint complete: wrote 206947 buffers (39.5%); 0 transaction log file(s) added, 0 removed, 65 recycled; write=152.953 s, sync=172.978 s, total=326.460 s
The above was on a RAID 0 array of Intel X-25M drives. Spec'd speed is 500MB/s read, 140MB/s write, 7000 random writes/s. Observed speed (outside postgres, not a benchmark) is 420MB/s read, 140MB/s write, 4000 random writes/s.
I have now temporarily moved the entire database (and postgres data directories) to a single 7200rpm SATA drive (no RAID, no special drivers - just an ordinary WD Caviar Black 1TB). Unfortunately the situation appears very similar (and even slower):
11/12/09 12:26:59 PM org.postgresql.postgres[1652] LOG: checkpoint starting: time
11/12/09 12:33:43 PM org.postgresql.postgres[1652] LOG: checkpoint complete: wrote 196843 buffers (37.5%); 0 transaction log file(s) added, 0 removed, 54 recycled; write=201.260 s, sync=202.874 s, total=404.224 s
11/12/09 12:33:43 PM org.postgresql.postgres[1652] LOG: checkpoint starting: time
11/12/09 12:40:15 PM org.postgresql.postgres[1652] LOG: checkpoint complete: wrote 150293 buffers (28.7%); 0 transaction log file(s) added, 0 removed, 81 recycled; write=198.739 s, sync=192.861 s, total=391.682 s
11/12/09 12:40:15 PM org.postgresql.postgres[1652] LOG: checkpoint starting: time
11/12/09 12:46:07 PM org.postgresql.postgres[1652] LOG: checkpoint complete: wrote 136752 buffers (26.1%); 0 transaction log file(s) added, 0 removed, 61 recycled; write=170.165 s, sync=181.921 s, total=352.199 s
11/12/09 12:46:07 PM org.postgresql.postgres[1652] LOG: checkpoint starting: time
11/12/09 12:51:51 PM org.postgresql.postgres[1652] LOG: checkpoint complete: wrote 146884 buffers (28.0%); 0 transaction log file(s) added, 0 removed, 57 recycled; write=187.262 s, sync=156.595 s, total=343.916 s
Some further observations:
1) The situation is a bit different during the first 15 minutes of running the huge query. There is more CPU and disk activity (a lot of reading and writing - what I would call a busy disk). But after a while (hours), it settles down to a pattern of almost zero read I/O. Write I/O is a brief burst (<2 seconds, 2000 writes/sec, 30-50 MB/s) followed by about 30 seconds of 20-60 writes/sec (500KB/s). CPU is negligible.
2) The HDD temperature is very low - identical to a neighbor that is completely idle. I can't hear any disc rattling.
3) During these ~30 second activity pauses, access to the HDD from outside of postgres stalls.
4) DTrace is reporting fsync delays of 12 to 72 seconds (30 seconds typical) from calls made by postgres. About 40% of calls return in < 10 microseconds, 50% return in 30 - 250 ms, 3% are in the seconds, and 2% are in the tens of seconds.
5) Although the checkpoint log reports say 190s of writing and 190s of sync, the activity pattern reported by Activity Monitor is 95% near idle, and 5% activity.
Stephen