Re: Defaulting wal_sync_method to fdatasync on Linux for 9.1?

Поиск
Список
Период
Сортировка
От Greg Smith
Тема Re: Defaulting wal_sync_method to fdatasync on Linux for 9.1?
Дата
Msg-id 4CE2EBF8.4040602@2ndquadrant.com
обсуждение исходный текст
Ответ на Re: Defaulting wal_sync_method to fdatasync on Linux for 9.1?  (Marti Raudsepp <marti@juffo.org>)
Ответы Re: Defaulting wal_sync_method to fdatasync on Linux for 9.1?  (Robert Haas <robertmhaas@gmail.com>)
Re: Defaulting wal_sync_method to fdatasync on Linux for 9.1?  (Josh Berkus <josh@agliodbs.com>)
Re: Defaulting wal_sync_method to fdatasync on Linux for 9.1?  (Scott Carey <scott@richrelevance.com>)
Список pgsql-performance
Time for a deeper look at what's going on here...I installed RHEL6 Beta
2 yesterday, on the presumption that since the release version just came
out this week it was likely the same version Marti tested against.
Also, it was the one I already had a DVD to install for.  This was on a
laptop with 7200 RPM hard drive, already containing an Ubuntu
installation for comparison sake.

Initial testing was done with the PostgreSQL test_fsync utility, just to
get a gross idea of what situations the drives involved were likely
flushing data to disk correctly during, and which it was impossible for
that to be true.  7200 RPM = 120 rotations/second, which puts an upper
limit of 120 true fsync executions per second.  The test_fsync released
with PostgreSQL 9.0 now reports its value on the right scale that you
can directly compare against that (earlier versions reported
seconds/commit, not commits/second).

First I built test_fsync from inside of an existing PostgreSQL 9.1 HEAD
checkout:

$ cd [PostgreSQL source code tree]
$ cd src/tools/fsync/
$ make

And I started with looking at the Ubuntu system running ext3, which
represents the status quo we've been seeing the past few years.
Initially the drive write cache was turned on:

Linux meddle 2.6.28-19-generic #61-Ubuntu SMP Wed May 26 23:35:15 UTC
2010 i686 GNU/Linux
$ cat /etc/lsb-release
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=9.04
DISTRIB_CODENAME=jaunty
DISTRIB_DESCRIPTION="Ubuntu 9.04"

/dev/sda5 on / type ext3 (rw,relatime,errors=remount-ro)

$ ./test_fsync
Loops = 10000

Simple write:
    8k write                      88476.784/second

Compare file sync methods using one write:
    (unavailable: open_datasync)
    open_sync 8k write             1192.135/second
    8k write, fdatasync            1222.158/second
    8k write, fsync                1097.980/second

Compare file sync methods using two writes:
    (unavailable: open_datasync)
    2 open_sync 8k writes           527.361/second
    8k write, 8k write, fdatasync  1105.204/second
    8k write, 8k write, fsync      1084.050/second

Compare open_sync with different sizes:
    open_sync 16k write             966.047/second
    2 open_sync 8k writes           529.565/second

Test if fsync on non-write file descriptor is honored:
(If the times are similar, fsync() can sync data written
on a different descriptor.)
    8k write, fsync, close         1064.177/second
    8k write, close, fsync         1042.337/second

Two notable things here.  One, there is no open_datasync defined in this
older kernel.  Two, all methods of commit give equally inflated commit
rates, far faster than the drive is capable of.  This proves this setup
isn't flushing the drive's write cache after commit.

You can get safe behavior out of the old kernel by disabling its write
cache:

$ sudo /sbin/hdparm -W0 /dev/sda

/dev/sda:
 setting drive write-caching to 0 (off)
 write-caching =  0 (off)

Loops = 10000

Simple write:
    8k write                      89023.413/second

Compare file sync methods using one write:
    (unavailable: open_datasync)
    open_sync 8k write              106.968/second
    8k write, fdatasync             108.106/second
    8k write, fsync                 104.238/second

Compare file sync methods using two writes:
    (unavailable: open_datasync)
    2 open_sync 8k writes            51.637/second
    8k write, 8k write, fdatasync   109.256/second
    8k write, 8k write, fsync       103.952/second

Compare open_sync with different sizes:
    open_sync 16k write             109.562/second
    2 open_sync 8k writes            52.752/second

Test if fsync on non-write file descriptor is honored:
(If the times are similar, fsync() can sync data written
on a different descriptor.)
    8k write, fsync, close          107.179/second
    8k write, close, fsync          106.923/second

And now results are as expected:  just under 120/second.

Onto RHEL6.  Setup for this initial test was:

$ uname -a
Linux meddle 2.6.32-44.1.el6.x86_64 #1 SMP Wed Jul 14 18:51:29 EDT 2010
x86_64 x86_64 x86_64 GNU/Linux
$ cat /etc/redhat-release
Red Hat Enterprise Linux Server release 6.0 Beta (Santiago)
$ mount
/dev/sda7 on / type ext4 (rw)

And I started with the write cache off to see a straight comparison
against the above:

$ sudo hdparm -W0 /dev/sda

/dev/sda:
 setting drive write-caching to 0 (off)
 write-caching =  0 (off)
$ ./test_fsync
Loops = 10000

Simple write:
    8k write                      104194.886/second

Compare file sync methods using one write:
    open_datasync 8k write           97.828/second
    open_sync 8k write              109.158/second
    8k write, fdatasync             109.838/second
    8k write, fsync                  20.872/second

Compare file sync methods using two writes:
    2 open_datasync 8k writes        53.902/second
    2 open_sync 8k writes            53.721/second
    8k write, 8k write, fdatasync   109.731/second
    8k write, 8k write, fsync        20.918/second

Compare open_sync with different sizes:
    open_sync 16k write             109.552/second
    2 open_sync 8k writes            54.116/second

Test if fsync on non-write file descriptor is honored:
(If the times are similar, fsync() can sync data written
on a different descriptor.)
    8k write, fsync, close           20.800/second
    8k write, close, fsync           20.868/second

A few changes then.  open_datasync is available now.  It looks slightly
slower than the alternatives on this test, but I didn't see that on the
later tests so I'm thinking that's just occasional run to run
variation.  For some reason regular fsync is dramatically slower in this
kernel than earlier ones.  Perhaps a lot more metadata being flushed all
the way to the disk in that case now?

The issue that I think Marti has been concerned about is highlighted in
this interesting subset of the data:

Compare file sync methods using two writes:
    2 open_datasync 8k writes        53.902/second
    8k write, 8k write, fdatasync   109.731/second

The results here aren't surprising; if you do two dsync writes, that
will take two disk rotations, while two writes followed a single sync
only takes one.  But that does mean that in the case of small values for
wal_buffers, like the default, you could easily end up paying a rotation
sync penalty more than once per commit.

Next question is what happens if I turn the drive's write cache back on:

$ sudo hdparm -W1 /dev/sda

/dev/sda:
 setting drive write-caching to 1 (on)
 write-caching =  1 (on)

$ ./test_fsync

[gsmith@meddle fsync]$ ./test_fsync
Loops = 10000

Simple write:
    8k write                      104198.143/second

Compare file sync methods using one write:
    open_datasync 8k write          110.707/second
    open_sync 8k write              110.875/second
    8k write, fdatasync             110.794/second
    8k write, fsync                  28.872/second

Compare file sync methods using two writes:
    2 open_datasync 8k writes        55.731/second
    2 open_sync 8k writes            55.618/second
    8k write, 8k write, fdatasync   110.551/second
    8k write, 8k write, fsync        28.843/second

Compare open_sync with different sizes:
    open_sync 16k write             110.176/second
    2 open_sync 8k writes            55.785/second

Test if fsync on non-write file descriptor is honored:
(If the times are similar, fsync() can sync data written
on a different descriptor.)
    8k write, fsync, close           28.779/second
    8k write, close, fsync           28.855/second

This is nice to see from a reliability perspective.  On all three of the
viable sync methods here, the speed seen suggests the drive's volatile
write cache is being flushed after every commit.  This is going to be
bad for people who have gotten used to doing development on systems
where that's not honored and they don't care, because this looks like a
90% drop in performance on those systems.  But since the new behavior is
safe and the earlier one was not, it's hard to get mad about it.
Developers probably just need to be taught to turn synchronous_commit
off to speed things up when playing with test data.

test_fsync writes to /var/tmp/test_fsync.out by default, not paying
attention to what directory you're in.  So to use it to test another
filesystem, you have to make sure to give it an explicit full path.
Next I tested against the old Ubuntu partition that was formatted with
ext3, with the write cache still on:

# mount | grep /ext3
/dev/sda5 on /ext3 type ext3 (rw)
# ./test_fsync -f /ext3/test_fsync.out
Loops = 10000

Simple write:
    8k write                      100943.825/second

Compare file sync methods using one write:
    open_datasync 8k write          106.017/second
    open_sync 8k write              108.318/second
    8k write, fdatasync             108.115/second
    8k write, fsync                 105.270/second

Compare file sync methods using two writes:
    2 open_datasync 8k writes        53.313/second
    2 open_sync 8k writes            54.045/second
    8k write, 8k write, fdatasync    55.291/second
    8k write, 8k write, fsync        53.243/second

Compare open_sync with different sizes:
    open_sync 16k write              54.980/second
    2 open_sync 8k writes            53.563/second

Test if fsync on non-write file descriptor is honored:
(If the times are similar, fsync() can sync data written
on a different descriptor.)
    8k write, fsync, close          105.032/second
    8k write, close, fsync          103.987/second

Strange...it looks like ext3 is executing cache flushes, too.  Note that
all of the "Compare file sync methods using two writes" results are half
speed now; it's as if ext3 is flushing the first write out immediately?
This result was unexpected, and I don't trust it yet; I want to validate
this elsewhere.

What about XFS?  That's a first class filesystem on RHEL6 too:

[root@meddle fsync]# ./test_fsync -f /xfs/test_fsync.out
Loops = 10000

Simple write:
    8k write                      71878.324/second

Compare file sync methods using one write:
    open_datasync 8k write           36.303/second
    open_sync 8k write               35.714/second
    8k write, fdatasync              35.985/second
    8k write, fsync                  35.446/second

I stopped that there, sick of waiting for it, as there's obviously some
serious work (mounting options or such at a minimum) that needs to be
done before XFS matches the other two.  Will return to that later.

So, what have we learned so far:

1) On these newer kernels, both ext4 and ext3 seem to be pushing data
out through the drive write caches correctly.

2) On single writes, there's no performance difference between the main
three methods you might use, with the straight fsync method having a
serious regression in this use case.

3) WAL writes that are forced by wal_buffers filling will turn into a
commit-length write when using the new, default open_datasync.  Using
the older default of fdatasync avoids that problem, in return for
causing WAL writes to pollute the OS cache.  The main benefit of O_DSYNC
writes over fdatasync ones is avoiding the OS cache.

I want to next go through and replicate some of the actual database
level tests before giving a full opinion on whether this data proves
it's worth changing the wal_sync_method detection.  So far I'm torn
between whether that's the right approach, or if we should just increase
the default value for wal_buffers to something more reasonable.

--
Greg Smith   2ndQuadrant US    greg@2ndQuadrant.com   Baltimore, MD
PostgreSQL Training, Services and Support        www.2ndQuadrant.us
"PostgreSQL 9.0 High Performance": http://www.2ndQuadrant.com/books


В списке pgsql-performance по дате отправления:

Предыдущее
От: Chris Browne
Дата:
Сообщение: Re: best db schema for time series data?
Следующее
От: Robert Haas
Дата:
Сообщение: Re: Defaulting wal_sync_method to fdatasync on Linux for 9.1?