Обсуждение: Re: [HACKERS] fsync method checking
Kurt Roeckx wrote: > On Thu, Mar 18, 2004 at 01:50:32PM -0500, Bruce Momjian wrote: > > > I'm not sure I believe these numbers at all... my experience is that > > > getting trustworthy disk I/O numbers is *not* easy. > > > > These numbers were reproducable on all the platforms I tested. > > It's not because they are reproducable that they mean anything in > the real world. OK, what better test do you suggest? Right now, there has been no testing of these. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania 19073
Kurt Roeckx <Q@ping.be> writes: > I have no idea what the access pattern is for normal WAL > operations or how many times it gets synched. Does it only do > f(data)sync() at commit time, or for every block it writes? If we are using fsync/fdatasync, we issue those at commit time or when completing a WAL segment. If we are using the open flags, then of course there's no separate sync call. My previous point about checking different fsync spacings corresponds to different assumptions about average transaction size. I think a useful tool for determining wal_sync_method has got to be able to reflect that range of possibilities. regards, tom lane
Tom, Bruce, > My previous point about checking different fsync spacings corresponds to > different assumptions about average transaction size. I think a useful > tool for determining wal_sync_method has got to be able to reflect that > range of possibilities. Questions: 1) This is an OSS project. Why not just recruit a bunch of people on PERFORMANCE and GENERAL to test the 4 different synch methods using real databases? No test like reality, I say .... 2) Won't Jan's work on 7.5 memory and I/O management mean that we have to re-evaluate synching anyway? -- -Josh Berkus Aglio Database Solutions San Francisco
Josh Berkus wrote: > Tom, Bruce, > > > My previous point about checking different fsync spacings corresponds to > > different assumptions about average transaction size. I think a useful > > tool for determining wal_sync_method has got to be able to reflect that > > range of possibilities. > > Questions: > 1) This is an OSS project. Why not just recruit a bunch of people on > PERFORMANCE and GENERAL to test the 4 different synch methods using real > databases? No test like reality, I say .... Well, I wrote the program to allow testing. I don't see a complex test as being that much better than simple one. We don't need accurate numbers. We just need to know if fsync or O_SYNC is faster. > > 2) Won't Jan's work on 7.5 memory and I/O management mean that we have to > re-evaluate synching anyway? No, it should not change sync issues. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania 19073
Josh Berkus <josh@agliodbs.com> writes: > 1) This is an OSS project. Why not just recruit a bunch of people on > PERFORMANCE and GENERAL to test the 4 different synch methods using real > databases? No test like reality, I say .... I agree --- that is likely to yield *far* more useful results than any standalone test program, for the purpose of finding out what wal_sync_method to use in real databases. However, there's a second issue here: we would like to move sync/checkpoint responsibility into the bgwriter, and that requires knowing whether it's valid to let one process fsync on behalf of writes that were done by other processes. That's got nothing to do with WAL sync performance. I think that it would be sensible to make a test program that focuses on this one specific question. (There has been some handwaving to the effect that everybody knows this is safe on Unixen, but I question whether the handwavers have seen the internals of HPUX or AIX for instance; and besides we need to worry about Windows now.) A third reason for having a simple test program is to confirm whether your drives are syncing at all (cf. hdparm discussion). > 2) Won't Jan's work on 7.5 memory and I/O management mean that we have to > re-evaluate synching anyway? So far nothing's been done that touches WAL writing. However, I am thinking about making the bgwriter process take some of the load of writing WAL buffers (right now it only writes data-file buffers). And you're right, after that happens we will need to re-measure. The open flags will probably become considerably more attractive than they are now, if the bgwriter handles most non-commit writes of WAL. (We might also think of letting the bgwriter use a different sync method than the backends do.) regards, tom lane
Bruce Momjian <pgman@candle.pha.pa.us> writes: > Well, I wrote the program to allow testing. I don't see a complex test > as being that much better than simple one. We don't need accurate > numbers. We just need to know if fsync or O_SYNC is faster. Faster than what? The thing everyone is trying to point out here is that it depends on context, and we have little faith that this test program creates a context similar to a live Postgres database. regards, tom lane
Tom Lane wrote: > Bruce Momjian <pgman@candle.pha.pa.us> writes: > > Well, I wrote the program to allow testing. I don't see a complex test > > as being that much better than simple one. We don't need accurate > > numbers. We just need to know if fsync or O_SYNC is faster. > > Faster than what? The thing everyone is trying to point out here is > that it depends on context, and we have little faith that this test > program creates a context similar to a live Postgres database. Note, too, that the preferred method isn't likely to depend just on the operating system, it's likely to depend also on the filesystem type being used. Linux provides quite a few of them: ext2, ext3, jfs, xfs, and reiserfs, and that's just off the top of my head. I imagine the performance of the various syncing methods will vary significantly between them. It seems reasonable to me that decisions such as which sync method to use should initially be made at installation time: have the test program run on the target filesystem as part of the installation process, and build the initial postgresql.conf based on the results. You might even be able to do some additional testing such as measuring the difference between random block access and sequential access, and again feed the results into the postgresql.conf file. This is no substitute for experience with the platform, but I expect it's likely to get you closer to something optimal than doing nothing. The only question, of course, is whether or not it's worth going to the effort when it may or may not gain you a whole lot. Answering that is going to require some experimentation with such an automatic configuration system. -- Kevin Brown kevin@sysexperts.com
I wrote: > Note, too, that the preferred method isn't likely to depend just on the > operating system, it's likely to depend also on the filesystem type > being used. > > Linux provides quite a few of them: ext2, ext3, jfs, xfs, and reiserfs, > and that's just off the top of my head. I imagine the performance of > the various syncing methods will vary significantly between them. For what it's worth, my database throughput for transactions involving a lot of inserts, updates, and deletes is about 12% faster using fdatasync() than O_SYNC under Linux using JFS. I'll run the test program and report my results with it as well, so we'll be able to see if there's any consistency between it and the live database. -- Kevin Brown kevin@sysexperts.com
On Thu, Mar 18, 2004 at 02:22:10PM -0500, Bruce Momjian wrote: > > OK, what better test do you suggest? Right now, there has been no > testing of these. I suggest you start by doing atleast preallocating a 16 MB file and do the tests on that, to atleast be somewhat simular to what WAL does. I have no idea what the access pattern is for normal WAL operations or how many times it gets synched. Does it only do f(data)sync() at commit time, or for every block it writes? I think if you write more data you'll see more differences between O_(D)SYNC and f(data)sync(). I guess it can depend on if you have lots of small transactions, or more big ones. Atleast try to make something that covers different access patterns. Kurt
On 18 Mar, Tom Lane wrote: > Josh Berkus <josh@agliodbs.com> writes: >> 1) This is an OSS project. Why not just recruit a bunch of people on >> PERFORMANCE and GENERAL to test the 4 different synch methods using real >> databases? No test like reality, I say .... > > I agree --- that is likely to yield *far* more useful results than > any standalone test program, for the purpose of finding out what > wal_sync_method to use in real databases. However, there's a second > issue here: we would like to move sync/checkpoint responsibility into > the bgwriter, and that requires knowing whether it's valid to let one > process fsync on behalf of writes that were done by other processes. > That's got nothing to do with WAL sync performance. I think that it > would be sensible to make a test program that focuses on this one > specific question. (There has been some handwaving to the effect that > everybody knows this is safe on Unixen, but I question whether the > handwavers have seen the internals of HPUX or AIX for instance; and > besides we need to worry about Windows now.) I could certainly do some testing if you want to see how DBT-2 does. Just tell me what to do. ;) Mark
markw@osdl.org writes: > I could certainly do some testing if you want to see how DBT-2 does. > Just tell me what to do. ;) Just do some runs that are identical except for the wal_sync_method setting. Note that this should not have any impact on SELECT performance, only insert/update/delete performance. regards, tom lane
markw@osdl.org wrote: > On 18 Mar, Tom Lane wrote: > > Josh Berkus <josh@agliodbs.com> writes: > >> 1) This is an OSS project. Why not just recruit a bunch of people on > >> PERFORMANCE and GENERAL to test the 4 different synch methods using real > >> databases? No test like reality, I say .... > > > > I agree --- that is likely to yield *far* more useful results than > > any standalone test program, for the purpose of finding out what > > wal_sync_method to use in real databases. However, there's a second > > issue here: we would like to move sync/checkpoint responsibility into > > the bgwriter, and that requires knowing whether it's valid to let one > > process fsync on behalf of writes that were done by other processes. > > That's got nothing to do with WAL sync performance. I think that it > > would be sensible to make a test program that focuses on this one > > specific question. (There has been some handwaving to the effect that > > everybody knows this is safe on Unixen, but I question whether the > > handwavers have seen the internals of HPUX or AIX for instance; and > > besides we need to worry about Windows now.) > > I could certainly do some testing if you want to see how DBT-2 does. > Just tell me what to do. ;) To test, you would run from CVS version src/tools/fsync, find the fastest fsync method from the last group of outputs, then try the wal_fsync_method setting to see if the one that tools/fsync says is fastest is actually fastest. However, it might be better to run your tests and get some indication of how frequently writes and fsync's are going to WAL and modify tools/fsync to match what your DBT-2 test does. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania 19073
On 22 Mar, Tom Lane wrote: > markw@osdl.org writes: >> I could certainly do some testing if you want to see how DBT-2 does. >> Just tell me what to do. ;) > > Just do some runs that are identical except for the wal_sync_method > setting. Note that this should not have any impact on SELECT > performance, only insert/update/delete performance. Ok, here are the results I have from my 4-way xeon system, a 14 disk volume for the log and a 52 disk volume for everything else: http://developer.osdl.org/markw/pgsql/wal_sync_method.html 7.5devel-200403222 wal_sync_method metric default (fdatasync) 1935.28 fsync 1613.92 # ./test_fsync -f /opt/pgdb/dbt2/pg_xlog/test.out Simple write timing: write 0.018787 Compare fsync times on write() and non-write() descriptor: (If the times are similar, fsync() can sync data written on a different descriptor.) write, fsync, close 13.057781 write, close, fsync 13.311313 Compare one o_sync write to two: one 16k o_sync write 6.515122 two 8k o_sync writes 12.455124 Compare file sync methods with one 8k write: (o_dsync unavailable) open o_sync, write 6.270724 write, fdatasync 13.275225 write, fsync, 13.359847 Compare file sync methods with 2 8k writes: (o_dsync unavailable) open o_sync, write 12.479563 write, fdatasync 13.651709 write, fsync, 14.000240
Tom Lane wrote: >markw@osdl.org writes: > > >>I could certainly do some testing if you want to see how DBT-2 does. >>Just tell me what to do. ;) >> >> > >Just do some runs that are identical except for the wal_sync_method >setting. Note that this should not have any impact on SELECT >performance, only insert/update/delete performance. > > I've made a test run that compares fsync and fdatasync: The performance was identical: - with fdatasync: http://khack.osdl.org/stp/290607/ - with fsync: http://khack.osdl.org/stp/290483/ I don't understand why. Mark - is there a battery backed write cache in the raid controller, or something similar that might skew the results? The test generates quite a lot of wal traffic - around 1.5 MB/sec. Perhaps the writes are so large that the added overhead of syncing the inode is not noticable? Is the pg_xlog directory on a seperate drive? Btw, it's possible to request such tests through the web-interface, see http://www.osdl.org/lab_activities/kernel_testing/stp/script_param.html -- Manfred
markw@osdl.org wrote: >Compare file sync methods with one 8k write: > (o_dsync unavailable) > open o_sync, write 6.270724 > write, fdatasync 13.275225 > write, fsync, 13.359847 > > Odd. Which filesystem, which kernel? It seems fdatasync is broken and syncs the inode, too. -- Manfred
On Fri, Mar 26, 2004 at 07:25:53AM +0100, Manfred Spraul wrote: > >Compare file sync methods with one 8k write: > > (o_dsync unavailable) > > open o_sync, write 6.270724 > > write, fdatasync 13.275225 > > write, fsync, 13.359847 > > > > > Odd. Which filesystem, which kernel? It seems fdatasync is broken and > syncs the inode, too. This may be relevant. From the man page for fdatasync on a moderately recent RedHat installation: BUGS Currently (Linux 2.2) fdatasync is equivalent to fsync. Cheers, Steve