Re: Defaulting wal_sync_method to fdatasync on Linux for 9.1?
От | Greg Smith |
---|---|
Тема | Re: Defaulting wal_sync_method to fdatasync on Linux for 9.1? |
Дата | |
Msg-id | 4CE2EBF8.4040602@2ndquadrant.com обсуждение исходный текст |
Ответ на | Re: Defaulting wal_sync_method to fdatasync on Linux for 9.1? (Marti Raudsepp <marti@juffo.org>) |
Ответы |
Re: Defaulting wal_sync_method to fdatasync on Linux for 9.1?
(Robert Haas <robertmhaas@gmail.com>)
Re: Defaulting wal_sync_method to fdatasync on Linux for 9.1? (Josh Berkus <josh@agliodbs.com>) Re: Defaulting wal_sync_method to fdatasync on Linux for 9.1? (Scott Carey <scott@richrelevance.com>) |
Список | pgsql-performance |
Time for a deeper look at what's going on here...I installed RHEL6 Beta 2 yesterday, on the presumption that since the release version just came out this week it was likely the same version Marti tested against. Also, it was the one I already had a DVD to install for. This was on a laptop with 7200 RPM hard drive, already containing an Ubuntu installation for comparison sake. Initial testing was done with the PostgreSQL test_fsync utility, just to get a gross idea of what situations the drives involved were likely flushing data to disk correctly during, and which it was impossible for that to be true. 7200 RPM = 120 rotations/second, which puts an upper limit of 120 true fsync executions per second. The test_fsync released with PostgreSQL 9.0 now reports its value on the right scale that you can directly compare against that (earlier versions reported seconds/commit, not commits/second). First I built test_fsync from inside of an existing PostgreSQL 9.1 HEAD checkout: $ cd [PostgreSQL source code tree] $ cd src/tools/fsync/ $ make And I started with looking at the Ubuntu system running ext3, which represents the status quo we've been seeing the past few years. Initially the drive write cache was turned on: Linux meddle 2.6.28-19-generic #61-Ubuntu SMP Wed May 26 23:35:15 UTC 2010 i686 GNU/Linux $ cat /etc/lsb-release DISTRIB_ID=Ubuntu DISTRIB_RELEASE=9.04 DISTRIB_CODENAME=jaunty DISTRIB_DESCRIPTION="Ubuntu 9.04" /dev/sda5 on / type ext3 (rw,relatime,errors=remount-ro) $ ./test_fsync Loops = 10000 Simple write: 8k write 88476.784/second Compare file sync methods using one write: (unavailable: open_datasync) open_sync 8k write 1192.135/second 8k write, fdatasync 1222.158/second 8k write, fsync 1097.980/second Compare file sync methods using two writes: (unavailable: open_datasync) 2 open_sync 8k writes 527.361/second 8k write, 8k write, fdatasync 1105.204/second 8k write, 8k write, fsync 1084.050/second Compare open_sync with different sizes: open_sync 16k write 966.047/second 2 open_sync 8k writes 529.565/second Test if fsync on non-write file descriptor is honored: (If the times are similar, fsync() can sync data written on a different descriptor.) 8k write, fsync, close 1064.177/second 8k write, close, fsync 1042.337/second Two notable things here. One, there is no open_datasync defined in this older kernel. Two, all methods of commit give equally inflated commit rates, far faster than the drive is capable of. This proves this setup isn't flushing the drive's write cache after commit. You can get safe behavior out of the old kernel by disabling its write cache: $ sudo /sbin/hdparm -W0 /dev/sda /dev/sda: setting drive write-caching to 0 (off) write-caching = 0 (off) Loops = 10000 Simple write: 8k write 89023.413/second Compare file sync methods using one write: (unavailable: open_datasync) open_sync 8k write 106.968/second 8k write, fdatasync 108.106/second 8k write, fsync 104.238/second Compare file sync methods using two writes: (unavailable: open_datasync) 2 open_sync 8k writes 51.637/second 8k write, 8k write, fdatasync 109.256/second 8k write, 8k write, fsync 103.952/second Compare open_sync with different sizes: open_sync 16k write 109.562/second 2 open_sync 8k writes 52.752/second Test if fsync on non-write file descriptor is honored: (If the times are similar, fsync() can sync data written on a different descriptor.) 8k write, fsync, close 107.179/second 8k write, close, fsync 106.923/second And now results are as expected: just under 120/second. Onto RHEL6. Setup for this initial test was: $ uname -a Linux meddle 2.6.32-44.1.el6.x86_64 #1 SMP Wed Jul 14 18:51:29 EDT 2010 x86_64 x86_64 x86_64 GNU/Linux $ cat /etc/redhat-release Red Hat Enterprise Linux Server release 6.0 Beta (Santiago) $ mount /dev/sda7 on / type ext4 (rw) And I started with the write cache off to see a straight comparison against the above: $ sudo hdparm -W0 /dev/sda /dev/sda: setting drive write-caching to 0 (off) write-caching = 0 (off) $ ./test_fsync Loops = 10000 Simple write: 8k write 104194.886/second Compare file sync methods using one write: open_datasync 8k write 97.828/second open_sync 8k write 109.158/second 8k write, fdatasync 109.838/second 8k write, fsync 20.872/second Compare file sync methods using two writes: 2 open_datasync 8k writes 53.902/second 2 open_sync 8k writes 53.721/second 8k write, 8k write, fdatasync 109.731/second 8k write, 8k write, fsync 20.918/second Compare open_sync with different sizes: open_sync 16k write 109.552/second 2 open_sync 8k writes 54.116/second Test if fsync on non-write file descriptor is honored: (If the times are similar, fsync() can sync data written on a different descriptor.) 8k write, fsync, close 20.800/second 8k write, close, fsync 20.868/second A few changes then. open_datasync is available now. It looks slightly slower than the alternatives on this test, but I didn't see that on the later tests so I'm thinking that's just occasional run to run variation. For some reason regular fsync is dramatically slower in this kernel than earlier ones. Perhaps a lot more metadata being flushed all the way to the disk in that case now? The issue that I think Marti has been concerned about is highlighted in this interesting subset of the data: Compare file sync methods using two writes: 2 open_datasync 8k writes 53.902/second 8k write, 8k write, fdatasync 109.731/second The results here aren't surprising; if you do two dsync writes, that will take two disk rotations, while two writes followed a single sync only takes one. But that does mean that in the case of small values for wal_buffers, like the default, you could easily end up paying a rotation sync penalty more than once per commit. Next question is what happens if I turn the drive's write cache back on: $ sudo hdparm -W1 /dev/sda /dev/sda: setting drive write-caching to 1 (on) write-caching = 1 (on) $ ./test_fsync [gsmith@meddle fsync]$ ./test_fsync Loops = 10000 Simple write: 8k write 104198.143/second Compare file sync methods using one write: open_datasync 8k write 110.707/second open_sync 8k write 110.875/second 8k write, fdatasync 110.794/second 8k write, fsync 28.872/second Compare file sync methods using two writes: 2 open_datasync 8k writes 55.731/second 2 open_sync 8k writes 55.618/second 8k write, 8k write, fdatasync 110.551/second 8k write, 8k write, fsync 28.843/second Compare open_sync with different sizes: open_sync 16k write 110.176/second 2 open_sync 8k writes 55.785/second Test if fsync on non-write file descriptor is honored: (If the times are similar, fsync() can sync data written on a different descriptor.) 8k write, fsync, close 28.779/second 8k write, close, fsync 28.855/second This is nice to see from a reliability perspective. On all three of the viable sync methods here, the speed seen suggests the drive's volatile write cache is being flushed after every commit. This is going to be bad for people who have gotten used to doing development on systems where that's not honored and they don't care, because this looks like a 90% drop in performance on those systems. But since the new behavior is safe and the earlier one was not, it's hard to get mad about it. Developers probably just need to be taught to turn synchronous_commit off to speed things up when playing with test data. test_fsync writes to /var/tmp/test_fsync.out by default, not paying attention to what directory you're in. So to use it to test another filesystem, you have to make sure to give it an explicit full path. Next I tested against the old Ubuntu partition that was formatted with ext3, with the write cache still on: # mount | grep /ext3 /dev/sda5 on /ext3 type ext3 (rw) # ./test_fsync -f /ext3/test_fsync.out Loops = 10000 Simple write: 8k write 100943.825/second Compare file sync methods using one write: open_datasync 8k write 106.017/second open_sync 8k write 108.318/second 8k write, fdatasync 108.115/second 8k write, fsync 105.270/second Compare file sync methods using two writes: 2 open_datasync 8k writes 53.313/second 2 open_sync 8k writes 54.045/second 8k write, 8k write, fdatasync 55.291/second 8k write, 8k write, fsync 53.243/second Compare open_sync with different sizes: open_sync 16k write 54.980/second 2 open_sync 8k writes 53.563/second Test if fsync on non-write file descriptor is honored: (If the times are similar, fsync() can sync data written on a different descriptor.) 8k write, fsync, close 105.032/second 8k write, close, fsync 103.987/second Strange...it looks like ext3 is executing cache flushes, too. Note that all of the "Compare file sync methods using two writes" results are half speed now; it's as if ext3 is flushing the first write out immediately? This result was unexpected, and I don't trust it yet; I want to validate this elsewhere. What about XFS? That's a first class filesystem on RHEL6 too: [root@meddle fsync]# ./test_fsync -f /xfs/test_fsync.out Loops = 10000 Simple write: 8k write 71878.324/second Compare file sync methods using one write: open_datasync 8k write 36.303/second open_sync 8k write 35.714/second 8k write, fdatasync 35.985/second 8k write, fsync 35.446/second I stopped that there, sick of waiting for it, as there's obviously some serious work (mounting options or such at a minimum) that needs to be done before XFS matches the other two. Will return to that later. So, what have we learned so far: 1) On these newer kernels, both ext4 and ext3 seem to be pushing data out through the drive write caches correctly. 2) On single writes, there's no performance difference between the main three methods you might use, with the straight fsync method having a serious regression in this use case. 3) WAL writes that are forced by wal_buffers filling will turn into a commit-length write when using the new, default open_datasync. Using the older default of fdatasync avoids that problem, in return for causing WAL writes to pollute the OS cache. The main benefit of O_DSYNC writes over fdatasync ones is avoiding the OS cache. I want to next go through and replicate some of the actual database level tests before giving a full opinion on whether this data proves it's worth changing the wal_sync_method detection. So far I'm torn between whether that's the right approach, or if we should just increase the default value for wal_buffers to something more reasonable. -- Greg Smith 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD PostgreSQL Training, Services and Support www.2ndQuadrant.us "PostgreSQL 9.0 High Performance": http://www.2ndQuadrant.com/books
В списке pgsql-performance по дате отправления:
Следующее
От: Robert HaasДата:
Сообщение: Re: Defaulting wal_sync_method to fdatasync on Linux for 9.1?