Обсуждение: [PATCH] Revert default wal_sync_method to fdatasync on Linux 2.6.33+
Hi list, PostgreSQL's default settings change when built with Linux kernel headers 2.6.33 or newer. As discussed on the pgsql-performance list, this causes a significant performance regression: http://archives.postgresql.org/pgsql-performance/2010-10/msg00602.php NB! I am not proposing to change the default -- to the contrary -- this patch restores old behavior. Users might be in for a nasty performance surprise when re-building their Postgres with newer Linux headers (as was I), so I propose that this change should be made in all supported releases. -- commit message -- Revert default wal_sync_method to fdatasync on Linux 2.6.33+ Linux kernel headers from 2.6.33 (and later) change the behavior of the O_SYNC flag. Previously O_SYNC was aliased to O_DSYNC, which caused PostgreSQL to use fdatasync as the default instead. Starting with kernels 2.6.33 and later, the definitions of O_DSYNC and O_SYNC differ. When built with headers from these newer kernels, PostgreSQL will default to using open_datasync. This patch reverts the Linux default to fdatasync, which has had much more testing over time and also significantly better performance. -- end commit message -- Earlier kernel headers defined O_SYNC and O_DSYNC to 0x1000 2.6.33 and later define O_SYNC=0x101000 and O_DSYNC=0x1000 (since old behavior on most FS-es was always equivalent to POSIX O_DSYNC) More details at: http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=6b2f3d1f769be5779b479c37800229d9a4809fc3 Currently PostgreSQL's include/access/xlogdefs.h defaults to using open_datasync when O_SYNC != O_DSYNC, otherwise fdatasync is used. Since other platforms might want to default to fdatasync in the future, too, I defined a new PLATFORM_DEFAULT_SYNC_METHOD constant in include/port/linux.h. I don't know if this is the best way to do it. Regards, Marti
Вложения
Marti Raudsepp <marti@juffo.org> writes: > PostgreSQL's default settings change when built with Linux kernel > headers 2.6.33 or newer. As discussed on the pgsql-performance list, > this causes a significant performance regression: > http://archives.postgresql.org/pgsql-performance/2010-10/msg00602.php > NB! I am not proposing to change the default -- to the contrary -- > this patch restores old behavior. I'm less than convinced this is the right approach ... If open_dsync is so bad for performance on Linux, maybe it's bad everywhere? Should we be rethinking the default preference order? regards, tom lane
On Fri, Nov 5, 2010 at 20:13, Tom Lane <tgl@sss.pgh.pa.us> wrote: > I'm less than convinced this is the right approach ... > > If open_dsync is so bad for performance on Linux, maybe it's bad > everywhere? Should we be rethinking the default preference order? Sure, maybe for PostgreSQL 9.1 But the immediate problem is older releases (8.1 - 9.0) specifically on Linux. Something as innocuous as re-building your DB on a newer kernel will radically affect performance -- even when the DB kernel didn't change. So I think we should aim to fix old versions first. Do you disagree? Regards, Marti
Marti Raudsepp <marti@juffo.org> writes: > On Fri, Nov 5, 2010 at 20:13, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> If open_dsync is so bad for performance on Linux, maybe it's bad >> everywhere? Should we be rethinking the default preference order? > So I think we should aim to fix old versions first. Do you disagree? What's that got to do with it? regards, tom lane
On Fri, Nov 5, 2010 at 21:20, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Marti Raudsepp <marti@juffo.org> writes: >> On Fri, Nov 5, 2010 at 20:13, Tom Lane <tgl@sss.pgh.pa.us> wrote: >>> If open_dsync is so bad for performance on Linux, maybe it's bad >>> everywhere? Should we be rethinking the default preference order? > >> So I think we should aim to fix old versions first. Do you disagree? > > What's that got to do with it? I'm not sure what you're asking. Surely changing the default wal_sync_method for all OSes in maintenance releases is out of the question, no? Regards, Marti
Marti Raudsepp <marti@juffo.org> writes: > On Fri, Nov 5, 2010 at 21:20, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> What's that got to do with it? > I'm not sure what you're asking. > Surely changing the default wal_sync_method for all OSes in > maintenance releases is out of the question, no? Well, if we could leave well enough alone it would be fine with me, but I think our hand is being forced by the Linux kernel hackers. I don't really think that "change the default on Linux" is that much nicer than "change the default everywhere" when it comes to what we ought to consider back-patching. In any case, you're getting ahead of the game: we need to decide on the desired behavior first and then think about what to patch. Do the performance results that were cited show that open_dsync is generally inferior to fdatasync? If so, why would we think that that conclusion is Linux-specific? regards, tom lane
On Friday 05 November 2010 19:13:47 Tom Lane wrote: > Marti Raudsepp <marti@juffo.org> writes: > > PostgreSQL's default settings change when built with Linux kernel > > headers 2.6.33 or newer. As discussed on the pgsql-performance list, > > this causes a significant performance regression: > > http://archives.postgresql.org/pgsql-performance/2010-10/msg00602.php > > > > NB! I am not proposing to change the default -- to the contrary -- > > this patch restores old behavior. > > I'm less than convinced this is the right approach ... > > If open_dsync is so bad for performance on Linux, maybe it's bad > everywhere? Should we be rethinking the default preference order? I fail to see how it could be beneficial on *any* non-buggy platform. Especially with small wal_buffers and larger commits (but also otherwise) it increases the amount of synchronous writes the os has to do tremendously. * It removes about all benefits of XLogBackgroundFlush() * It removes any chances of reordering after writing. * It makes AdvanceXLInsertBuffer synchronous if it has to write outy Whats the theory about placing it so high in the preferences list? Andres
On Fri, Nov 5, 2010 at 22:16, Tom Lane <tgl@sss.pgh.pa.us> wrote: > I don't really think that "change the default on Linux" is that > much nicer than "change the default everywhere" when it comes to > what we ought to consider back-patching. In any case, you're getting > ahead of the game: we need to decide on the desired behavior first and > then think about what to patch. We should be trying to guarantee the stability of maintenance releases. "Stability" includes consistent defaults. The fact that Linux now distinguishes between these two flags has a very surprising effect on PostgreSQL's defaults; an effect that wasn't intended by any developer, is not documented anywhere, and certainly won't be anticipated by users. Do you reject this premise? As newer distros are adopting 2.6.33+ kernels, more and more people will shoot themselves in the foot by this change. I am also worried that it will have a direct effect on PostgreSQL adoption. Regards, Marti
Tom Lane wrote: > If open_dsync is so bad for performance on Linux, maybe it's bad > everywhere? Should we be rethinking the default preference order? > And I've seen the expected sync write performance gain over fdatasync on a system with a battery-backed cache running VxFS on Linux, because working open_[d]sync means O_DIRECT writes bypassing the OS cache, and therefore reducing cache pollution from WAL writes. This doesn't work by default on Solaris because they have a special system call you have to execute for direct output, but if you trick the OS into doing that via mount options you can observe it there too. The last serious tests of this area I saw on that platform were from Jignesh, and they certainly didn't show a significant performance regression running in sync mode. I vaguely recall seeing a set once that showed a minor loss compared to fdatasync, but it was too close to make any definitive statement about reordering. I haven't seen any report yet of a serious performance regression in the new Linux case that was written by someone who understands fully how fsync and drive cache flushing are supposed to interact. It's been obvious for a year now that the reports from Phoronix about this had no idea what they were actually testing. I didn't see anything from Marti's report that definitively answers whether this is anything other than Linux finally doing the right thing to flush drive caches out when sync writes happen. There may be a performance regression here related to WAL data going out in smaller chunks than it used to, but in all the reports I've seen it that hasn't been isolated well enough to consider making any changes yet--to tell if it's a performance loss or a reliability gain we're seeing. I'd like to see some output from the 9.0 test_fsync on one of these RHEL6 systems on a system without a battery backed write cache as a first step here. That should start to shed some light on what's happening. I just bumped up the priority on the pending upgrade of my spare laptop to the RHEL6 beta I had been trying to find time for, so I can investigate this further myself. -- Greg Smith 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD
Andres Freund <andres@anarazel.de> writes: > On Friday 05 November 2010 19:13:47 Tom Lane wrote: >> If open_dsync is so bad for performance on Linux, maybe it's bad >> everywhere? Should we be rethinking the default preference order? > I fail to see how it could be beneficial on *any* non-buggy platform. > Especially with small wal_buffers and larger commits (but also otherwise) it > increases the amount of synchronous writes the os has to do tremendously. > * It removes about all benefits of XLogBackgroundFlush() > * It removes any chances of reordering after writing. > * It makes AdvanceXLInsertBuffer synchronous if it has to write outy > Whats the theory about placing it so high in the preferences list? I think the original idea was that if you had a dedicated WAL drive then sync-on-write would be reasonable. But that was a very long time ago and I'm not sure that the system's behavior is anything like what it was then; for that matter I'm not sure we had proof that it was an optimal choice even back then. That's why I want to revisit the choice of default and not just go for "minimum" change. regards, tom lane
On Friday 05 November 2010 22:53:37 Greg Smith wrote: > > If open_dsync is so bad for performance on Linux, maybe it's bad > > everywhere? Should we be rethinking the default preference order? > > > > > > And I've seen the expected sync write performance gain over fdatasync on > a system with a battery-backed cache running VxFS on Linux, because > working open_[d]sync means O_DIRECT writes bypassing the OS cache, and > therefore reducing cache pollution from WAL writes. Which looks like a setup where you definitely need to know what you do. I.e. don't need support from wal_sync_method by default being open_fdatasync... Andres
> I think the original idea was that if you had a dedicated WAL drive then > sync-on-write would be reasonable. But that was a very long time ago > and I'm not sure that the system's behavior is anything like what it was > then; for that matter I'm not sure we had proof that it was an optimal > choice even back then. That's why I want to revisit the choice of > default and not just go for "minimum" change. What plaforms do we need to test to get a reasonable idea? Solaris, FreeBSD, Windows? -- -- Josh Berkus PostgreSQL Experts Inc. http://www.pgexperts.com
Josh Berkus <josh@agliodbs.com> writes: > What plaforms do we need to test to get a reasonable idea? Solaris, > FreeBSD, Windows? At least. I'm hoping that Greg Smith will take the lead on testing this, since he seems to have spent the most time in the area so far. regards, tom lane
On 11/5/10 3:31 PM, Tom Lane wrote: > Josh Berkus <josh@agliodbs.com> writes: >> What plaforms do we need to test to get a reasonable idea? Solaris, >> FreeBSD, Windows? > > At least. I'm hoping that Greg Smith will take the lead on testing > this, since he seems to have spent the most time in the area so far. I could test at least 1 version of Solaris, I think. Greg, any recommendations on pgbench parameters? -- -- Josh Berkus PostgreSQL Experts Inc. http://www.pgexperts.com
Tom Lane wrote: > I'm hoping that Greg Smith will take the lead on testing > this, since he seems to have spent the most time in the area so far. > It's not coincidence that the chapter of my book I convinced the publisher to release as a sample is the one that covers this area; this mess has been visibly approaching for some time now. I'm going to put RHEL6 onto a system and start collecting some proper slowdown numbers this week, then pass along a suggested test regime for others. -- Greg Smith 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD
Marti Raudsepp wrote: > PostgreSQL's default settings change when built with Linux kernel > headers 2.6.33 or newer. As discussed on the pgsql-performance list, > this causes a significant performance regression: > http://archives.postgresql.org/pgsql-performance/2010-10/msg00602.php > > NB! I am not proposing to change the default -- to the contrary -- > this patch restores old behavior. Following our standard community development model, I've put this patch onto our CommitFest list: https://commitfest.postgresql.org/action/patch_view?id=432 and assigned myself as the reviewer. I didn't look at this until now because I already had some patch development and review work to finish before the CommitFest deadline we just crossed. Now I can go back to reviewing other people's work. P.S. There is no pgsql-patch list anymore; everything goes through the hackers list now. -- Greg Smith 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD PostgreSQL Training, Services and Support www.2ndQuadrant.us "PostgreSQL 9.0 High Performance": http://www.2ndQuadrant.com/books
All, So, this week I've had my hands on a medium-high-end test system where I could test various wal_sync_methods. This is a 24-core Intel Xeon machine with 72GB of ram, and 8 internal 10K SAS disks attached to a raid controller with 512MB BBU write cache. 2 of the disks are in a RAID1, which supports both an Ext4 partition and an XFS partition. The remaining disks are in a RAID10 which only supports a single pgdata partition. This is running on RHEL6, Linux Kernel: 2.6.32-71.el6.x86_64 I think this kind of a system much better represents our users who are performance-conscious than testing on people's laptops or on VMs does. I modified test_fsync in two ways to run this; first, to make it support O_DIRECT, and second to make it run in the *current* directory. I think the second change should be permanent; I imagine that a lot of people who are running test_fsync are not aware that they're actually testing the performance of /var/tmp, not whatever FS mount they wanted to test. Here's the results. I think you'll agree that, at least on Linux, the benefits of o_sync and o_dsync as defaults would be highly questionable.Particularly, it seems that if O_DIRECT support isabsent, fdatasync is across-the-board faster: ============= test_fsync with directIO, on 2 drives, XFS tuned: Loops = 10000 Simple write: 8k write 198629.457/second Compare file sync methods using one write: open_datasync 8k write 14798.263/second open_sync 8k write 14316.864/second 8k write, fdatasync 12198.871/second 8k write, fsync 12371.843/second Compare file sync methods using two writes: 2 open_datasync 8k writes 7362.805/second 2 open_sync 8k writes 7156.685/second 8k write, 8k write, fdatasync 10613.525/second 8k write, 8k write, fsync 10597.396/second Compare open_sync with different sizes: open_sync 16k write 13631.816/second 2 open_sync 8k writes 7645.038/second Test if fsync on non-write file descriptor is honored: (If the times are similar, fsync() can sync data written on a different descriptor.) 8k write, fsync, close 11427.096/second 8k write, close, fsync 11321.220/second test_fsync with directIO, on 6 drives RAID10, XFS tuned: Loops = 10000 Simple write: 8k write 196494.537/second Compare file sync methods using one write: open_datasync 8k write 14909.974/second open_sync 8k write 14559.326/second 8k write, fdatasync 11046.025/second 8k write, fsync 11046.916/second Compare file sync methods using two writes: 2 open_datasync 8k writes 7349.223/second 2 open_sync 8k writes 7667.395/second 8k write, 8k write, fdatasync 9560.495/second 8k write, 8k write, fsync 9557.287/second Compare open_sync with different sizes: open_sync 16k write 12060.049/second 2 open_sync 8k writes 7650.746/second Test if fsync on non-write file descriptor is honored: (If the times are similar, fsync() can sync data written on a different descriptor.) 8k write, fsync, close 9377.107/second 8k write, close, fsync 9251.233/second test_fsync without directIO on RAID1, Ext4, data=journal: Loops = 10000 Simple write: 8k write 150514.005/second Compare file sync methods using one write: open_datasync 8k write 4012.070/second open_sync 8k write 5476.898/second 8k write, fdatasync 5512.649/second 8k write, fsync 5803.814/second Compare file sync methods using two writes: 2 open_datasync 8k writes 2910.401/second 2 open_sync 8k writes 2817.377/second 8k write, 8k write, fdatasync 5041.608/second 8k write, 8k write, fsync 5155.248/second Compare open_sync with different sizes: open_sync 16k write 4895.956/second 2 open_sync 8k writes 2720.875/second Test if fsync on non-write file descriptor is honored: (If the times are similar, fsync() can sync data written on a different descriptor.) 8k write, fsync, close 4724.052/second 8k write, close, fsync 4694.776/second test_fsync without directIO on RAID1, XFS, tuned: Loops = 10000 Simple write: 8k write 199796.208/second Compare file sync methods using one write: open_datasync 8k write 12553.525/second open_sync 8k write 12535.978/second 8k write, fdatasync 12268.298/second 8k write, fsync 12305.875/second Compare file sync methods using two writes: 2 open_datasync 8k writes 6323.835/second 2 open_sync 8k writes 6285.169/second 8k write, 8k write, fdatasync 10893.756/second 8k write, 8k write, fsync 10752.607/second Compare open_sync with different sizes: open_sync 16k write 11053.510/second 2 open_sync 8k writes 6293.270/second Test if fsync on non-write file descriptor is honored: (If the times are similar, fsync() can sync data written on a different descriptor.) 8k write, fsync, close 11087.482/second 8k write, close, fsync 11157.477/second test_fsync without directIO on RAID10, 6 drives, XFS Tuned: Loops = 10000 Simple write: 8k write 197262.003/second Compare file sync methods using one write: open_datasync 8k write 12784.699/second open_sync 8k write 12684.512/second 8k write, fdatasync 12404.547/second 8k write, fsync 12452.757/second Compare file sync methods using two writes: 2 open_datasync 8k writes 6376.587/second 2 open_sync 8k writes 6364.113/second 8k write, 8k write, fdatasync 9895.699/second 8k write, 8k write, fsync 9866.886/second Compare open_sync with different sizes: open_sync 16k write 10156.491/second 2 open_sync 8k writes 6400.889/second Test if fsync on non-write file descriptor is honored: (If the times are similar, fsync() can sync data written on a different descriptor.) 8k write, fsync, close 11142.620/second 8k write, close, fsync 11076.393/second -- -- Josh Berkus PostgreSQL Experts Inc. http://www.pgexperts.com
All, While I have this machine available I've been trying to run some performance tests using pgbench and various wal_sync_methods. However, I seem to be maxing out at the speed of pgbench itself; no matter which wal_sync_method I use (including "fsync"), it tops out at around 2750 TPS. Of course, it's also possible that the wal_sync_method does not in fact make a difference in throughput. -- -- Josh Berkus PostgreSQL Experts Inc. http://www.pgexperts.com
Josh Berkus wrote: > I modified test_fsync in two ways to run this; first, to make it support > O_DIRECT, and second to make it run in the *current* directory. Patch please? I agree with the latter change; what test_fsync does is surprising. I suggested a while ago that we refactor test_fsync to use a common set of source code as the database itself for detecting things related to wal_sync_method, perhaps just extract that whole set of DEFINE macro logic to somewhere else. That happened at a bad time in the development cycle (right before a freeze) and nobody ever got back to the idea afterwards. If this code is getting touched, and it's clear it is in some direction, I'd like to see things change so it's not possible for the two to diverge again afterwards. -- Greg Smith 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD PostgreSQL Training, Services and Support www.2ndQuadrant.us
On 12/5/10 2:12 PM, Greg Smith wrote: > Josh Berkus wrote: >> I modified test_fsync in two ways to run this; first, to make it support >> O_DIRECT, and second to make it run in the *current* directory. > > Patch please? I agree with the latter change; what test_fsync does is > surprising. Attached. Making it support O_DIRECT would be possible but more complex; I don't see the point unless we think we're going to have open_sync_with_odirect as a seperate option. > I suggested a while ago that we refactor test_fsync to use a common set > of source code as the database itself for detecting things related to > wal_sync_method, perhaps just extract that whole set of DEFINE macro > logic to somewhere else. That happened at a bad time in the development > cycle (right before a freeze) and nobody ever got back to the idea > afterwards. If this code is getting touched, and it's clear it is in > some direction, I'd like to see things change so it's not possible for > the two to diverge again afterwards. I don't quite follow you. Maybe nobody else did last time, either. -- -- Josh Berkus PostgreSQL Experts Inc. http://www.pgexperts.com
Вложения
Josh Berkus <josh@agliodbs.com> writes: > Making it support O_DIRECT would be possible but more complex; I don't > see the point unless we think we're going to have open_sync_with_odirect > as a seperate option. Whether it's complex or not isn't really the issue. The issue is that what test_fsync is testing had better match what the backend does, or people will be making choices based on not-comparable test results. I think we should have test_fsync just automatically fold in O_DIRECT the same way the backend does. regards, tom lane
On 12/06/2010 08:38 PM, Tom Lane wrote: > Josh Berkus<josh@agliodbs.com> writes: >> Making it support O_DIRECT would be possible but more complex; I don't >> see the point unless we think we're going to have open_sync_with_odirect >> as a seperate option. > Whether it's complex or not isn't really the issue. The issue is that > what test_fsync is testing had better match what the backend does, or > people will be making choices based on not-comparable test results. > I think we should have test_fsync just automatically fold in O_DIRECT > the same way the backend does. > > Indeed. We were quite confused for a while when we were dealing with this about a week ago, and my handwritten test program failed as expected but test_fsync didn't. Anything other than behaving just as the backend does violates POLA, in my view. cheers andrew
> Whether it's complex or not isn't really the issue. The issue is that > what test_fsync is testing had better match what the backend does, or > people will be making choices based on not-comparable test results. > I think we should have test_fsync just automatically fold in O_DIRECT > the same way the backend does. OK, patch coming then. Right now test_fsync aborts when O_DIRECT fails.What should I have it do instead? -- -- Josh Berkus PostgreSQL Experts Inc. http://www.pgexperts.com
Josh Berkus <josh@agliodbs.com> writes: > OK, patch coming then. Right now test_fsync aborts when O_DIRECT fails. > What should I have it do instead? Report that it fails, and keep testing the other methods. regards, tom lane
On 12/6/10 6:13 PM, Tom Lane wrote: > Josh Berkus <josh@agliodbs.com> writes: >> OK, patch coming then. Right now test_fsync aborts when O_DIRECT fails. >> What should I have it do instead? > > Report that it fails, and keep testing the other methods. Patch attached. Includes a fair amount of comment cleanup, since existing comments did not meet our current project standards. Tests all 6 of the methods we support separately. Some questions, though: (1) Why are we doing the open_sync different-size write test? AFAIK, this doesn't match any behavior which PostgreSQL has. (2) In this patch, I'm stepping down the number of loops which fsync_writethrough does by 90%. The reason for that was that on the platforms where I tested writethrough (desktop machines), doing 10,000 loops took 15-20 *minutes*, which seems hard on the user. Would be easy to revert if you think it's a bad idea. Possibly auto-sizing the number of loops based on the first fsync test might be a good idea, but seems like going a bit too far. (3) Should the multi-descriptor test be using writethrough on platforms which support it? -- -- Josh Berkus PostgreSQL Experts Inc. http://www.pgexperts.com
Вложения
Josh Berkus wrote: > On 12/6/10 6:13 PM, Tom Lane wrote: > > Josh Berkus <josh@agliodbs.com> writes: > >> OK, patch coming then. Right now test_fsync aborts when O_DIRECT fails. > >> What should I have it do instead? > > > > Report that it fails, and keep testing the other methods. > > Patch attached. Includes a fair amount of comment cleanup, since > existing comments did not meet our current project standards. Tests all > 6 of the methods we support separately. > > Some questions, though: > > (1) Why are we doing the open_sync different-size write test? AFAIK, > this doesn't match any behavior which PostgreSQL has. I did that so we could see the impact of doing 2 8k writes that were both fsync'ed vs doing one 16k write and then fsync: Compare open_sync with different sizes: open_sync 16k write 201.323/second 2 open_sync 8k writes 332.466/second We often write multiple 8k WAL pages and then fsync on commit. > (2) In this patch, I'm stepping down the number of loops which > fsync_writethrough does by 90%. The reason for that was that on the > platforms where I tested writethrough (desktop machines), doing 10,000 > loops took 15-20 *minutes*, which seems hard on the user. Would be easy > to revert if you think it's a bad idea. > Possibly auto-sizing the number of loops based on the first fsync test > might be a good idea, but seems like going a bit too far. Sure, I recently increased the number, probably too much. > (3) Should the multi-descriptor test be using writethrough on platforms > which support it? Uh, I didn't think that would matter because the test is to test kernel behavior of writing to one file descriptor and fsyncing using another. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. +
Tom Lane wrote: > Josh Berkus <josh@agliodbs.com> writes: > > Making it support O_DIRECT would be possible but more complex; I don't > > see the point unless we think we're going to have open_sync_with_odirect > > as a seperate option. > > Whether it's complex or not isn't really the issue. The issue is that > what test_fsync is testing had better match what the backend does, or > people will be making choices based on not-comparable test results. > I think we should have test_fsync just automatically fold in O_DIRECT > the same way the backend does. The problem is that O_DIRECT was not implemented in macros but rather down in the code: if (!XLogIsNeeded() && !am_walreceiver) o_direct_flag = PG_O_DIRECT; Which means if we just export the macros, we would still not have caught this. I would like to share all the defines --- I am just saying it isn't trivial. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. +
> Which means if we just export the macros, we would still not have caught > this. I would like to share all the defines --- I am just saying it > isn't trivial. I just called all the define variables manually rather than relying on the macros. Seemed to work fine. -- -- Josh Berkus PostgreSQL Experts Inc. http://www.pgexperts.com
Greg, All: Results for Solaris 10u8, on ZFS on a 7-drive attached storage array: bash-3.00# ./test_fsync -f /dbdata/pgdata/test.out Loops = 10000 Simple write: 8k write 59988.002/second Compare file sync methods using one write: open_datasync 8k write 214.125/second (unavailable: o_direct) open_sync 8k write 222.155/second (unavailable: o_direct) 8k write, fdatasync 214.086/second 8k write, fsync 215.035/second (unavailable: fsync_writethrough) Compare file sync methods using two writes: 2 open_datasync 8k writes 108.227/second (unavailable: o_direct) 2 open_sync 8k writes 106.935/second (unavailable: o_direct) 8k write, 8k write, fdatasync 205.525/second 8k write, 8k write, fsync 210.483/second (unavailable: fsync_writethrough) Compare open_sync with different sizes: open_sync 16k write 211.481/second 2 open_sync 8k writes 106.202/second Test if fsync on non-write file descriptor is honored: (If the times are similar, fsync() can sync data written on a different descriptor.) 8k write, fsync, close 207.499/second 8k write, close, fsync 213.656/second -- -- Josh Berkus PostgreSQL Experts Inc. http://www.pgexperts.com
Josh Berkus wrote: > On 12/6/10 6:13 PM, Tom Lane wrote: > > Josh Berkus <josh@agliodbs.com> writes: > >> OK, patch coming then. Right now test_fsync aborts when O_DIRECT fails. > >> What should I have it do instead? > > > > Report that it fails, and keep testing the other methods. > > Patch attached. Includes a fair amount of comment cleanup, since > existing comments did not meet our current project standards. Tests all > 6 of the methods we support separately. > > Some questions, though: > > (1) Why are we doing the open_sync different-size write test? AFAIK, > this doesn't match any behavior which PostgreSQL has. I added program output to explain this. > (2) In this patch, I'm stepping down the number of loops which > fsync_writethrough does by 90%. The reason for that was that on the > platforms where I tested writethrough (desktop machines), doing 10,000 > loops took 15-20 *minutes*, which seems hard on the user. Would be easy > to revert if you think it's a bad idea. > Possibly auto-sizing the number of loops based on the first fsync test > might be a good idea, but seems like going a bit too far. I did not know why writethough we always be much slower than other sync methods so I just reduced the loop count to 2k. > (3) Should the multi-descriptor test be using writethrough on platforms > which support it? Thank you for your patch. I have applied most of it, attached. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. + diff --git a/src/tools/fsync/Makefile b/src/tools/fsync/Makefile index fe3e626..44419ee 100644 *** /tmp/pgdiff.17122/M3047c_Makefile Sat Jan 15 11:53:43 2011 --- src/tools/fsync/Makefile Fri Jan 14 22:35:30 2011 *************** override CPPFLAGS := -I$(libpq_srcdir) $ *** 16,24 **** OBJS= test_fsync.o ! all: test_fsync ! test_fsync: test_fsync.o | submake-libpq submake-libpgport $(CC) $(CFLAGS) test_fsync.o $(libpq_pgport) $(LDFLAGS) $(LDFLAGS_EX) $(LIBS) -o $@$(X) clean distclean maintainer-clean: --- 16,24 ---- OBJS= test_fsync.o ! all: submake-libpq submake-libpgport test_fsync ! test_fsync: test_fsync.o $(libpq_builddir)/libpq.a $(CC) $(CFLAGS) test_fsync.o $(libpq_pgport) $(LDFLAGS) $(LDFLAGS_EX) $(LIBS) -o $@$(X) clean distclean maintainer-clean: diff --git a/src/tools/fsync/README b/src/tools/fsync/README index 6d9acd3..a1c2ae4 100644 *** /tmp/pgdiff.17122/wFeqrc_README Sat Jan 15 11:53:43 2011 --- src/tools/fsync/README Sat Jan 15 11:34:58 2011 *************** *** 1,11 **** ! src/tools/fsync/README ! ! fsync ! ===== This program tests fsync. The tests are described as part of the program output. Usage: test_fsync [-f filename] [loops] - Loops defaults to 5000. The default output file is /var/tmp/test_fsync.out. - Consider that /tmp or /var/tmp might be memory-based file systems. --- 1,20 ---- ! test_fsync ! ========== This program tests fsync. The tests are described as part of the program output. Usage: test_fsync [-f filename] [loops] + + test_fsync is intended to give you a reasonable idea of what the fastest + fsync_method is on your specific system, as well as supplying diagnostic + information in the event of an identified I/O problem. However, + differences shown by test_fsync might not make any difference in real + database throughput, especially since many database servers are not + speed-limited by their transaction logs. + + The output filename defaults to test_fsync.out in the current directory. + test_fsync should be run in the same filesystem as your transaction log + directory (pg_xlog). + + Loops default to 2000. Increase this to get more accurate measurements. diff --git a/src/tools/fsync/test_fsync.c b/src/tools/fsync/test_fsync.c index 28c2119..831e2a0 100644 *** /tmp/pgdiff.17122/OSeljb_test_fsync.c Sat Jan 15 11:53:43 2011 --- src/tools/fsync/test_fsync.c Sat Jan 15 11:52:03 2011 *************** *** 3,9 **** * * * test_fsync.c ! * test various fsync() methods */ #include "postgres.h" --- 3,9 ---- * * * test_fsync.c ! * tests all supported fsync() methods */ #include "postgres.h" *************** *** 22,40 **** #include <unistd.h> #include <string.h> ! ! #ifdef WIN32 #define FSYNC_FILENAME "./test_fsync.out" - #else - /* /tmp might be a memory file system */ - #define FSYNC_FILENAME "/var/tmp/test_fsync.out" - #endif #define WRITE_SIZE (8 * 1024) /* 8k */ #define LABEL_FORMAT "\t%-30s" ! int loops = 10000; void die(char *str); void print_elapse(struct timeval start_t, struct timeval stop_t); --- 22,39 ---- #include <unistd.h> #include <string.h> ! /* ! * put the temp files in the local directory ! * unless the user specifies otherwise ! */ #define FSYNC_FILENAME "./test_fsync.out" #define WRITE_SIZE (8 * 1024) /* 8k */ #define LABEL_FORMAT "\t%-30s" ! ! int loops = 2000; void die(char *str); void print_elapse(struct timeval start_t, struct timeval stop_t); *************** void print_elapse(struct timeval start_ *** 42,55 **** int main(int argc, char *argv[]) { ! struct timeval start_t; ! struct timeval stop_t; ! int tmpfile, ! i; char *full_buf = (char *) malloc(XLOG_SEG_SIZE), ! *buf; ! char *filename = FSYNC_FILENAME; if (argc > 2 && strcmp(argv[1], "-f") == 0) { filename = argv[2]; --- 41,54 ---- int main(int argc, char *argv[]) { ! struct timeval start_t, stop_t; ! int tmpfile, i; char *full_buf = (char *) malloc(XLOG_SEG_SIZE), ! *buf, *filename = FSYNC_FILENAME; + /* + * arguments: loops and filename (optional) + */ if (argc > 2 && strcmp(argv[1], "-f") == 0) { filename = argv[2]; *************** main(int argc, char *argv[]) *** 57,73 **** argc -= 2; } ! if (argc > 1) loops = atoi(argv[1]); for (i = 0; i < XLOG_SEG_SIZE; i++) full_buf[i] = random(); if ((tmpfile = open(filename, O_RDWR | O_CREAT, S_IRUSR | S_IWUSR)) == -1) die("Cannot open output file."); if (write(tmpfile, full_buf, XLOG_SEG_SIZE) != XLOG_SEG_SIZE) die("write failed"); ! /* fsync now so later fsync's don't have to do it */ if (fsync(tmpfile) != 0) die("fsync failed"); close(tmpfile); --- 56,77 ---- argc -= 2; } ! if (argc > 1) loops = atoi(argv[1]); for (i = 0; i < XLOG_SEG_SIZE; i++) full_buf[i] = random(); + /* + * test if we can open the target file + */ if ((tmpfile = open(filename, O_RDWR | O_CREAT, S_IRUSR | S_IWUSR)) == -1) die("Cannot open output file."); if (write(tmpfile, full_buf, XLOG_SEG_SIZE) != XLOG_SEG_SIZE) die("write failed"); ! /* ! * fsync now so that dirty buffers don't skew later tests ! */ if (fsync(tmpfile) != 0) die("fsync failed"); close(tmpfile); *************** main(int argc, char *argv[]) *** 77,83 **** printf("Loops = %d\n\n", loops); /* ! * Simple write */ printf("Simple write:\n"); printf(LABEL_FORMAT, "8k write"); --- 81,87 ---- printf("Loops = %d\n\n", loops); /* ! * Test a simple write without fsync */ printf("Simple write:\n"); printf(LABEL_FORMAT, "8k write"); *************** main(int argc, char *argv[]) *** 95,104 **** print_elapse(start_t, stop_t); /* ! * Compare file sync methods with one 8k write */ printf("\nCompare file sync methods using one write:\n"); #ifdef OPEN_DATASYNC_FLAG printf(LABEL_FORMAT, "open_datasync 8k write"); fflush(stdout); --- 99,111 ---- print_elapse(start_t, stop_t); /* ! * Test all fsync methods using single 8k writes */ printf("\nCompare file sync methods using one write:\n"); + /* + * Test open_datasync if available + */ #ifdef OPEN_DATASYNC_FLAG printf(LABEL_FORMAT, "open_datasync 8k write"); fflush(stdout); *************** main(int argc, char *argv[]) *** 115,124 **** --- 122,161 ---- gettimeofday(&stop_t, NULL); close(tmpfile); print_elapse(start_t, stop_t); + + /* + * If O_DIRECT is enabled, test that with open_datasync + */ + #if PG_O_DIRECT != 0 + fflush(stdout); + if ((tmpfile = open(filename, O_RDWR | O_DSYNC | PG_O_DIRECT, 0)) == -1) + printf("\t(unavailable: o_direct on this filesystem)\n"); + else + { + printf(LABEL_FORMAT, "open_datasync 8k direct I/O write"); + gettimeofday(&start_t, NULL); + for (i = 0; i < loops; i++) + { + if (write(tmpfile, buf, WRITE_SIZE) != WRITE_SIZE) + die("write failed"); + if (lseek(tmpfile, 0, SEEK_SET) == -1) + die("seek failed"); + } + gettimeofday(&stop_t, NULL); + close(tmpfile); + print_elapse(start_t, stop_t); + } + #else + printf("\t(unavailable: o_direct)\n"); + #endif + #else printf("\t(unavailable: open_datasync)\n"); #endif + /* + * Test open_sync if available + */ #ifdef OPEN_SYNC_FLAG printf(LABEL_FORMAT, "open_sync 8k write"); fflush(stdout); *************** main(int argc, char *argv[]) *** 135,144 **** --- 172,211 ---- gettimeofday(&stop_t, NULL); close(tmpfile); print_elapse(start_t, stop_t); + + /* + * If O_DIRECT is enabled, test that with open_sync + */ + #if PG_O_DIRECT != 0 + printf(LABEL_FORMAT, "open_sync 8k direct I/O write"); + fflush(stdout); + if ((tmpfile = open(filename, O_RDWR | OPEN_SYNC_FLAG | PG_O_DIRECT, 0)) == -1) + printf("\t(unavailable: o_direct on this filesystem)\n"); + else + { + gettimeofday(&start_t, NULL); + for (i = 0; i < loops; i++) + { + if (write(tmpfile, buf, WRITE_SIZE) != WRITE_SIZE) + die("write failed"); + if (lseek(tmpfile, 0, SEEK_SET) == -1) + die("seek failed"); + } + gettimeofday(&stop_t, NULL); + close(tmpfile); + print_elapse(start_t, stop_t); + } + #else + printf("\t(unavailable: o_direct)\n"); + #endif + #else printf("\t(unavailable: open_sync)\n"); #endif + /* + * Test fdatasync if available + */ #ifdef HAVE_FDATASYNC printf(LABEL_FORMAT, "8k write, fdatasync"); fflush(stdout); *************** main(int argc, char *argv[]) *** 160,165 **** --- 227,235 ---- printf("\t(unavailable: fdatasync)\n"); #endif + /* + * Test fsync + */ printf(LABEL_FORMAT, "8k write, fsync"); fflush(stdout); if ((tmpfile = open(filename, O_RDWR, 0)) == -1) *************** main(int argc, char *argv[]) *** 177,190 **** gettimeofday(&stop_t, NULL); close(tmpfile); print_elapse(start_t, stop_t); /* ! * Compare file sync methods with two 8k write */ printf("\nCompare file sync methods using two writes:\n"); #ifdef OPEN_DATASYNC_FLAG ! printf(LABEL_FORMAT, "2 open_datasync 8k writes"); fflush(stdout); if ((tmpfile = open(filename, O_RDWR | O_DSYNC, 0)) == -1) die("Cannot open output file."); --- 247,289 ---- gettimeofday(&stop_t, NULL); close(tmpfile); print_elapse(start_t, stop_t); + + /* + * If fsync_writethrough is available, test as well + */ + #ifdef HAVE_FSYNC_WRITETHROUGH + printf(LABEL_FORMAT, "8k write, fsync_writethrough"); + fflush(stdout); + if ((tmpfile = open(filename, O_RDWR, 0)) == -1) + die("Cannot open output file."); + gettimeofday(&start_t, NULL); + for (i = 0; i < loops; i++) + { + if (write(tmpfile, buf, WRITE_SIZE) != WRITE_SIZE) + die("write failed"); + if (fcntl(tmpfile, F_FULLFSYNC ) != 0) + die("fsync failed"); + if (lseek(tmpfile, 0, SEEK_SET) == -1) + die("seek failed"); + } + gettimeofday(&stop_t, NULL); + close(tmpfile); + print_elapse(start_t, stop_t); + #else + printf("\t(unavailable: fsync_writethrough)\n"); + #endif /* ! * Compare some of the file sync methods with ! * two 8k writes to see if timing is different */ printf("\nCompare file sync methods using two writes:\n"); + /* + * Test open_datasync with and without o_direct + */ #ifdef OPEN_DATASYNC_FLAG ! printf(LABEL_FORMAT, "2 open_datasync 8k writes"); fflush(stdout); if ((tmpfile = open(filename, O_RDWR | O_DSYNC, 0)) == -1) die("Cannot open output file."); *************** main(int argc, char *argv[]) *** 201,210 **** --- 300,335 ---- gettimeofday(&stop_t, NULL); close(tmpfile); print_elapse(start_t, stop_t); + + #if PG_O_DIRECT != 0 + printf(LABEL_FORMAT, "2 open_datasync direct I/O 8k writes"); + fflush(stdout); + if ((tmpfile = open(filename, O_RDWR | O_DSYNC | PG_O_DIRECT, 0)) == -1) + die("Cannot open output file."); + gettimeofday(&start_t, NULL); + for (i = 0; i < loops; i++) + { + if (write(tmpfile, buf, WRITE_SIZE) != WRITE_SIZE) + die("write failed"); + if (write(tmpfile, buf, WRITE_SIZE) != WRITE_SIZE) + die("write failed"); + if (lseek(tmpfile, 0, SEEK_SET) == -1) + die("seek failed"); + } + gettimeofday(&stop_t, NULL); + close(tmpfile); + print_elapse(start_t, stop_t); + #else + printf("\t(unavailable: o_direct)\n"); + #endif + #else printf("\t(unavailable: open_datasync)\n"); #endif + /* + * Test open_sync with and without o_direct + */ #ifdef OPEN_SYNC_FLAG printf(LABEL_FORMAT, "2 open_sync 8k writes"); fflush(stdout); *************** main(int argc, char *argv[]) *** 223,230 **** --- 348,383 ---- gettimeofday(&stop_t, NULL); close(tmpfile); print_elapse(start_t, stop_t); + + #if PG_O_DIRECT != 0 + printf(LABEL_FORMAT, "2 open_sync direct I/O 8k writes"); + fflush(stdout); + if ((tmpfile = open(filename, O_RDWR | OPEN_SYNC_FLAG | PG_O_DIRECT, 0)) == -1) + die("Cannot open output file."); + gettimeofday(&start_t, NULL); + for (i = 0; i < loops; i++) + { + if (write(tmpfile, buf, WRITE_SIZE) != WRITE_SIZE) + die("write failed"); + if (write(tmpfile, buf, WRITE_SIZE) != WRITE_SIZE) + die("write failed"); + if (lseek(tmpfile, 0, SEEK_SET) == -1) + die("seek failed"); + } + gettimeofday(&stop_t, NULL); + close(tmpfile); + print_elapse(start_t, stop_t); + #else + printf("\t(unavailable: o_direct)\n"); + #endif + + #else + printf("\t(unavailable: open_sync)\n"); #endif + /* + * Test fdatasync + */ #ifdef HAVE_FDATASYNC printf(LABEL_FORMAT, "8k write, 8k write, fdatasync"); fflush(stdout); *************** main(int argc, char *argv[]) *** 248,253 **** --- 401,409 ---- printf("\t(unavailable: fdatasync)\n"); #endif + /* + * Test basic fsync + */ printf(LABEL_FORMAT, "8k write, 8k write, fsync"); fflush(stdout); if ((tmpfile = open(filename, O_RDWR, 0)) == -1) *************** main(int argc, char *argv[]) *** 267,278 **** --- 423,466 ---- gettimeofday(&stop_t, NULL); close(tmpfile); print_elapse(start_t, stop_t); + + /* + * Test fsync_writethrough if available + */ + #ifdef HAVE_FSYNC_WRITETHROUGH + printf(LABEL_FORMAT, "8k write, 8k write, fsync_writethrough"); + fflush(stdout); + if ((tmpfile = open(filename, O_RDWR, 0)) == -1) + die("Cannot open output file."); + gettimeofday(&start_t, NULL); + for (i = 0; i < loops; i++) + { + if (write(tmpfile, buf, WRITE_SIZE) != WRITE_SIZE) + die("write failed"); + if (write(tmpfile, buf, WRITE_SIZE) != WRITE_SIZE) + die("write failed"); + if (fcntl(tmpfile, F_FULLFSYNC) != 0) + die("fsync failed"); + if (lseek(tmpfile, 0, SEEK_SET) == -1) + die("seek failed"); + } + gettimeofday(&stop_t, NULL); + close(tmpfile); + print_elapse(start_t, stop_t); + #else + printf("\t(unavailable: fsync_writethrough)\n"); + #endif /* * Compare 1 to 2 writes */ printf("\nCompare open_sync with different sizes:\n"); + printf("(This is designed to compare the cost of one large\n"); + printf("sync'ed write and two smaller sync'ed writes.)\n"); + /* + * Test open_sync with different size files + */ #ifdef OPEN_SYNC_FLAG printf(LABEL_FORMAT, "open_sync 16k write"); fflush(stdout); *************** main(int argc, char *argv[]) *** 312,323 **** #endif /* ! * Fsync another file descriptor? */ printf("\nTest if fsync on non-write file descriptor is honored:\n"); printf("(If the times are similar, fsync() can sync data written\n"); printf("on a different descriptor.)\n"); printf(LABEL_FORMAT, "8k write, fsync, close"); fflush(stdout); gettimeofday(&start_t, NULL); --- 500,519 ---- #endif /* ! * Test whether fsync can sync data written on a different ! * descriptor for the same file. This checks the efficiency ! * of multi-process fsyncs against the same file. ! * Possibly this should be done with writethrough on platforms ! * which support it. */ printf("\nTest if fsync on non-write file descriptor is honored:\n"); printf("(If the times are similar, fsync() can sync data written\n"); printf("on a different descriptor.)\n"); + /* + * first write, fsync and close, which is the + * normal behavior without multiple descriptors + */ printf(LABEL_FORMAT, "8k write, fsync, close"); fflush(stdout); gettimeofday(&start_t, NULL); *************** main(int argc, char *argv[]) *** 330,343 **** if (fsync(tmpfile) != 0) die("fsync failed"); close(tmpfile); if ((tmpfile = open(filename, O_RDWR, 0)) == -1) die("Cannot open output file."); - /* do nothing but the open/close the tests are consistent. */ close(tmpfile); } gettimeofday(&stop_t, NULL); print_elapse(start_t, stop_t); printf(LABEL_FORMAT, "8k write, close, fsync"); fflush(stdout); gettimeofday(&start_t, NULL); --- 526,547 ---- if (fsync(tmpfile) != 0) die("fsync failed"); close(tmpfile); + /* + * open and close the file again to be consistent + * with the following test + */ if ((tmpfile = open(filename, O_RDWR, 0)) == -1) die("Cannot open output file."); close(tmpfile); } gettimeofday(&stop_t, NULL); print_elapse(start_t, stop_t); + /* + * Now open, write, close, open again and fsync + * This simulates processes fsyncing each other's + * writes. + */ printf(LABEL_FORMAT, "8k write, close, fsync"); fflush(stdout); gettimeofday(&start_t, NULL); *************** main(int argc, char *argv[]) *** 358,375 **** gettimeofday(&stop_t, NULL); print_elapse(start_t, stop_t); ! /* cleanup */ free(full_buf); unlink(filename); return 0; } void print_elapse(struct timeval start_t, struct timeval stop_t) { double total_time = (stop_t.tv_sec - start_t.tv_sec) + - /* usec subtraction might be negative, e.g. 5.4 - 4.8 */ (stop_t.tv_usec - start_t.tv_usec) * 0.000001; double per_second = loops / total_time; --- 562,583 ---- gettimeofday(&stop_t, NULL); print_elapse(start_t, stop_t); ! /* ! * cleanup ! */ free(full_buf); unlink(filename); return 0; } + /* + * print out the writes per second for tests + */ void print_elapse(struct timeval start_t, struct timeval stop_t) { double total_time = (stop_t.tv_sec - start_t.tv_sec) + (stop_t.tv_usec - start_t.tv_usec) * 0.000001; double per_second = loops / total_time;