Обсуждение: wal_synch_method = open_sync safe on RHEL 5.5?
Some more on the RHEL 5.5 system I'm helping to setup. Some benchmarking using pgbench appeared to suggest that wal_sync_method=open_sync was a little faster than fdatasync [1]. Now I recall some discussion about this enabling direct io and the general flakiness of this on Linux, so is the option regarded as safe?
[1] The workout:
$ pgbench -i -s 1000 bench
$ pgbench -c [1,2,4,8,32,64,128] -t 10000
Performance peaked around 2500 tps @32 clients using open_sync and 2200 with fdatasync. However the disk arrays are on a SAN and I suspect that when testing with fdatasync later in the day there may have been workload 'leakage' from other hosts hitting the SAN.
[1] The workout:
$ pgbench -i -s 1000 bench
$ pgbench -c [1,2,4,8,32,64,128] -t 10000
Performance peaked around 2500 tps @32 clients using open_sync and 2200 with fdatasync. However the disk arrays are on a SAN and I suspect that when testing with fdatasync later in the day there may have been workload 'leakage' from other hosts hitting the SAN.
Mark Kirkwood wrote:
No one has ever refuted the claims in http://archives.postgresql.org/pgsql-hackers/2007-10/msg01310.php that it can be unsafe under a heavy enough level of mixed load on RHEL5. Given the performance benefits are marginal on ext3, I haven't ever considered it worth the risk. (I've seen much larger gains on Linux+Veritas VxFS). From what I've seen, recent Linux kernel work has reinforced that the old O_SYNC implementation was full of bugs now that more work is being done to improve that area. My suspicion (based on no particular data, just what I've seen it tested with) is that it only really worked before in the very specific way that Oracle does O_SYNC writes, which is different from what PostgreSQL does.
P.S. Be wary of expecting pgbench to give you useful numbers on a single run. For the default write-heavy test, I recommend three runs of 10 minutes each (-T 600 on recent PostgreSQL versions) before I trust any results it gives. You can get useful data from the select-only test in only a few seconds, but not the one that writes a bunch.
Now I recall some discussion about this enabling direct io and the general flakiness of this on Linux, so is the option regarded as safe?
No one has ever refuted the claims in http://archives.postgresql.org/pgsql-hackers/2007-10/msg01310.php that it can be unsafe under a heavy enough level of mixed load on RHEL5. Given the performance benefits are marginal on ext3, I haven't ever considered it worth the risk. (I've seen much larger gains on Linux+Veritas VxFS). From what I've seen, recent Linux kernel work has reinforced that the old O_SYNC implementation was full of bugs now that more work is being done to improve that area. My suspicion (based on no particular data, just what I've seen it tested with) is that it only really worked before in the very specific way that Oracle does O_SYNC writes, which is different from what PostgreSQL does.
P.S. Be wary of expecting pgbench to give you useful numbers on a single run. For the default write-heavy test, I recommend three runs of 10 minutes each (-T 600 on recent PostgreSQL versions) before I trust any results it gives. You can get useful data from the select-only test in only a few seconds, but not the one that writes a bunch.
-- Greg Smith 2ndQuadrant US Baltimore, MD PostgreSQL Training, Services and Support greg@2ndQuadrant.com www.2ndQuadrant.us
The conclusion I read was that Linux O_SYNC behaves like O_DSYNC on other systems. For WAL, this seems satisfactory?
Personally, I use fdatasync(). I wasn't able to measure a reliable difference for my far more smaller databases, and fdatasync() seems reliable and fast enough, that fighting with O_SYNC doesn't seem to be worth it. Also, technically speaking, fdatasync() appeals more to me, as it allows the system to buffer while it can, and the application to instruct it across what boundaries it should not buffer. O_SYNC / O_DSYNC seem to imply a requirement that it does a synch on every block. My gut tells me that fdatasync() gives the operating system more opportunities to optimize (whether it does or not is a different issue :-) ).
Cheers,
mark
On 06/17/2010 11:29 PM, Greg Smith wrote:
Personally, I use fdatasync(). I wasn't able to measure a reliable difference for my far more smaller databases, and fdatasync() seems reliable and fast enough, that fighting with O_SYNC doesn't seem to be worth it. Also, technically speaking, fdatasync() appeals more to me, as it allows the system to buffer while it can, and the application to instruct it across what boundaries it should not buffer. O_SYNC / O_DSYNC seem to imply a requirement that it does a synch on every block. My gut tells me that fdatasync() gives the operating system more opportunities to optimize (whether it does or not is a different issue :-) ).
Cheers,
mark
On 06/17/2010 11:29 PM, Greg Smith wrote:
Mark Kirkwood wrote:Now I recall some discussion about this enabling direct io and the general flakiness of this on Linux, so is the option regarded as safe?
No one has ever refuted the claims in http://archives.postgresql.org/pgsql-hackers/2007-10/msg01310.php that it can be unsafe under a heavy enough level of mixed load on RHEL5. Given the performance benefits are marginal on ext3, I haven't ever considered it worth the risk. (I've seen much larger gains on Linux+Veritas VxFS). From what I've seen, recent Linux kernel work has reinforced that the old O_SYNC implementation was full of bugs now that more work is being done to improve that area. My suspicion (based on no particular data, just what I've seen it tested with) is that it only really worked before in the very specific way that Oracle does O_SYNC writes, which is different from what PostgreSQL does.
P.S. Be wary of expecting pgbench to give you useful numbers on a single run. For the default write-heavy test, I recommend three runs of 10 minutes each (-T 600 on recent PostgreSQL versions) before I trust any results it gives. You can get useful data from the select-only test in only a few seconds, but not the one that writes a bunch.-- Greg Smith 2ndQuadrant US Baltimore, MD PostgreSQL Training, Services and Support greg@2ndQuadrant.com www.2ndQuadrant.us
-- Mark Mielke <mark@mielke.cc>
On 18/06/10 15:29, Greg Smith wrote: > > P.S. Be wary of expecting pgbench to give you useful numbers on a > single run. For the default write-heavy test, I recommend three runs > of 10 minutes each (-T 600 on recent PostgreSQL versions) before I > trust any results it gives. You can get useful data from the > select-only test in only a few seconds, but not the one that writes a > bunch. > Yeah, I did several runs of each, and a couple with -c 128 and -t 100000 to give the setup a good workout (also 2000-2400 tps, nice to see a well behaved SAN). Cheers Mark
Mark Mielke wrote: > The conclusion I read was that Linux O_SYNC behaves like O_DSYNC on > other systems. For WAL, this seems satisfactory? It would be if it didn't have any bugs or limitiations, but it does. The one pointed out in the message I linked to suggests that a mix of buffered and O_SYNC direct I/O can cause a write error, with the exact behavior you get depending on the kernel version. That's a path better not explored as I see it. The kernels that have made some effort to implement this correctly actually expose O_DSYNC, on newer Linux systems. My current opinion is that if you only have Linux O_SYNC, don't use it. The ones with O_DSYNC haven't been around for long enough to be proven or disproven as effective yet. -- Greg Smith 2ndQuadrant US Baltimore, MD PostgreSQL Training, Services and Support greg@2ndQuadrant.com www.2ndQuadrant.us