Обсуждение: ext4 finally doing the right thing

Поиск
Список
Период
Сортировка

ext4 finally doing the right thing

От
Greg Smith
Дата:
A few months ago the worst of the bugs in the ext4 fsync code started
clearing up, with
http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=5f3481e9a80c240f169b36ea886e2325b9aeb745
as a particularly painful one.  That made it into the 2.6.32 kernel
released last month.  Some interesting benchmark news today suggests a
version of ext4 that might actually work for databases is showing up in
early packaged distributions:

http://www.phoronix.com/scan.php?page=article&item=ubuntu_lucid_alpha2&num=3

Along with the massive performance drop that comes from working fsync.
See
http://www.phoronix.com/scan.php?page=article&item=linux_perf_regressions&num=2
for background about this topic from when the issue was discovered:

"[This change] is required for safe behavior with volatile write caches
on drives.  You could mount with -o nobarrier and [the performance drop]
would go away, but a sequence like write->fsync->lose power->reboot may
well find your file without the data that you synced, if the drive had
write caches enabled.  If you know you have no write cache, or that it
is safely battery backed, then you can mount with -o nobarrier, and not
incur this penalty."

The pgbench TPS figure Phoronix has been reporting has always been a
fictitious one resulting from unsafe write caching.  With 2.6.32
released with ext4 defaulting to proper behavior on fsync, that's going
to make for a very interesting change.  On one side, we might finally be
able to use regular drives with their caches turned on safely, taking
advantage of the cache for other writes while doing the right thing with
the database writes.  On the other, anyone who believed the fictitious
numbers before is going to be in a rude surprise and think there's a
massive regression here.  There's some potential for this to show
PostgreSQL in a bad light, when people discover they really only can get
~100 commits/second out of cheap hard drives and assume the database is
to blame.  Interesting times.

--
Greg Smith    2ndQuadrant   Baltimore, MD
PostgreSQL Training, Services and Support
greg@2ndQuadrant.com  www.2ndQuadrant.co


Re: ext4 finally doing the right thing

От
Jeff Davis
Дата:
On Fri, 2010-01-15 at 22:05 -0500, Greg Smith wrote:
> A few months ago the worst of the bugs in the ext4 fsync code started
> clearing up, with
> http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=5f3481e9a80c240f169b36ea886e2325b9aeb745
> as a particularly painful one.

Wow, thanks for the heads-up!

> On one side, we might finally be
> able to use regular drives with their caches turned on safely, taking
> advantage of the cache for other writes while doing the right thing with
> the database writes.

That could be good news. What's your opinion on the practical
performance impact? If it doesn't need to be fsync'd, the kernel
probably shouldn't have written it to the disk yet anyway, right (I'm
assuming here that the OS buffer cache is much larger than the disk
write cache)?

Regards,
    Jeff Davis


Re: ext4 finally doing the right thing

От
Greg Smith
Дата:
Jeff Davis wrote:

On one side, we might finally be 
able to use regular drives with their caches turned on safely, taking 
advantage of the cache for other writes while doing the right thing with 
the database writes.   
That could be good news. What's your opinion on the practical
performance impact? If it doesn't need to be fsync'd, the kernel
probably shouldn't have written it to the disk yet anyway, right (I'm
assuming here that the OS buffer cache is much larger than the disk
write cache)? 

I know they just tweaked this area recently so this may be a bit out of date, but kernels starting with 2.6.22 allow you to get up to 10% of memory dirty before getting really aggressive about writing things out, with writes starting to go heavily at 5%.  So even with a 1GB server, you could easily find 100MB of data sitting in the kernel buffer cache ahead of a database write that needs to hit disc.  Once you start considering the case with modern hardware, where even my desktop has 8GB of RAM and most serious servers I see have 32GB, you can easily have gigabytes of such data queued in front of the write that now needs to hit the platter.

The dream is that a proper barrier implementation will then shuffle your important write to the front of that queue, without waiting for everything else to clear first.  The exact performance impact depends on how many non-database writes happen.  But even on a dedicated database disk, it should still help because there are plenty of non-sync'd writes coming out the background writer via its routine work and the checkpoint writes.  And the ability to fully utilize the write cache on the individual drives, on commodity hardware, without risking database corruption would make life a lot easier.

-- 
Greg Smith    2ndQuadrant   Baltimore, MD
PostgreSQL Training, Services and Support
greg@2ndQuadrant.com  www.2ndQuadrant.com

Re: ext4 finally doing the right thing

От
Greg Stark
Дата:

That doesn't sound right. The kernel having 10% of memory dirty doesn't mean there's a queue you have to jump at all. You don't get into any queue until the kernel initiates write-out which will be based on the usage counters -- basically a lru. fsync and cousins like sync_file_range and posix_fadvise(DONT_NEED) in initiate write-out right away.

How many pending write-out requests for how much data the kernel should keep active is another question but I imagine it has more to do with storage hardware than how much memory your system has. And for most hardware it's probably on the order of megabytes or less.

greg

On 20 Jan 2010 21:19, "Greg Smith" <greg@2ndquadrant.com> wrote:

Jeff Davis wrote: > > >> On one side, we might finally be >> able to use regular drives with their ... I know they just tweaked this area recently so this may be a bit out of date, but kernels starting with 2.6.22 allow you to get up to 10% of memory dirty before getting really aggressive about writing things out, with writes starting to go heavily at 5%.  So even with a 1GB server, you could easily find 100MB of data sitting in the kernel buffer cache ahead of a database write that needs to hit disc.  Once you start considering the case with modern hardware, where even my desktop has 8GB of RAM and most serious servers I see have 32GB, you can easily have gigabytes of such data queued in front of the write that now needs to hit the platter.

The dream is that a proper barrier implementation will then shuffle your important write to the front of that queue, without waiting for everything else to clear first.  The exact performance impact depends on how many non-database writes happen.  But even on a dedicated database disk, it should still help because there are plenty of non-sync'd writes coming out the background writer via its routine work and the checkpoint writes.  And the ability to fully utilize the write cache on the individual drives, on commodity hardware, without risking database corruption would make life a lot easier.

-- Greg Smith 2ndQuadrant Baltimore, MD PostgreSQL Training, Services and Support greg@2ndQuadrant.com www.2ndQuadrant.com

Re: ext4 finally doing the right thing

От
Greg Smith
Дата:
Greg Stark wrote:
>
> That doesn't sound right. The kernel having 10% of memory dirty
> doesn't mean there's a queue you have to jump at all. You don't get
> into any queue until the kernel initiates write-out which will be
> based on the usage counters -- basically a lru. fsync and cousins like
> sync_file_range and posix_fadvise(DONT_NEED) in initiate write-out
> right away.
>

Most safe ways ext3 knows how to initiate a write-out on something that
must go (because it's gotten an fsync on data there) requires flushing
every outstanding write to that filesystem along with it.  So as soon as
a single WAL write shows up, bam!  The whole cache is emptied (or at
least everything associated with that filesystem), and the caller who
asked for that little write is stuck waiting for everything to clear
before their fsync returns success.

This particular issue absolutely killed Firefox when they switched to
using SQLite not too long ago; high-level discussion at
http://shaver.off.net/diary/2008/05/25/fsyncers-and-curveballs/ and
confirmation/discussion of the issue on lkml at
https://kerneltrap.org/mailarchive/linux-fsdevel/2008/5/26/1941354 .

Note the comment from the first article saying "those delays can be 30
seconds or more".  On multiple occasions, I've measured systems with
dozens of disks in a high-performance RAID1+0 with battery-backed
controller that could grind to a halt for 10, 20, or more seconds in
this situation, when running pgbench on a big database.  As was the case
on the latest one I saw, if you've got 32GB of RAM and have let 3.2GB of
random I/O from background writer/checkpoint writes back up because
Linux has been lazy about getting to them, that takes a while to clear
no matter how good the underlying hardware.

Write barriers were supposed to improve all this when added to ext3, but
they just never seemed to work right for many people.  After reading
that lkml thread, among others, I know I was left not trusting anything
beyond the simplest path through this area of the filesystem.  Slow is
better than corrupted.

So the good news I was relaying is that it looks like this finally work
on ext4, giving it the behavior you described and expected, but that's
not actually been there until now.  I was hoping someone with more free
time than me might be interested to go investigate further if I pointed
the advance out.  I'm stuck with too many production systems to play
with new kernels at the moment, but am quite curious.

--
Greg Smith    2ndQuadrant   Baltimore, MD
PostgreSQL Training, Services and Support
greg@2ndQuadrant.com  www.2ndQuadrant.com


Re: ext4 finally doing the right thing

От
Greg Stark
Дата:

Both of those refer to the *drive* cache. 

greg

On 21 Jan 2010 05:58, "Greg Smith" <greg@2ndquadrant.com> wrote:

Greg Stark wrote: > > > That doesn't sound right. The kernel having 10% of memory dirty doesn't mean... Most safe ways ext3 knows how to initiate a write-out on something that must go (because it's gotten an fsync on data there) requires flushing every outstanding write to that filesystem along with it.  So as soon as a single WAL write shows up, bam!  The whole cache is emptied (or at least everything associated with that filesystem), and the caller who asked for that little write is stuck waiting for everything to clear before their fsync returns success.

This particular issue absolutely killed Firefox when they switched to using SQLite not too long ago; high-level discussion at http://shaver.off.net/diary/2008/05/25/fsyncers-and-curveballs/ and confirmation/discussion of the issue on lkml at https://kerneltrap.org/mailarchive/linux-fsdevel/2008/5/26/1941354 .
Note the comment from the first article saying "those delays can be 30 seconds or more".  On multiple occasions, I've measured systems with dozens of disks in a high-performance RAID1+0 with battery-backed controller that could grind to a halt for 10, 20, or more seconds in this situation, when running pgbench on a big database.  As was the case on the latest one I saw, if you've got 32GB of RAM and have let 3.2GB of random I/O from background writer/checkpoint writes back up because Linux has been lazy about getting to them, that takes a while to clear no matter how good the underlying hardware.

Write barriers were supposed to improve all this when added to ext3, but they just never seemed to work right for many people.  After reading that lkml thread, among others, I know I was left not trusting anything beyond the simplest path through this area of the filesystem.  Slow is better than corrupted.

So the good news I was relaying is that it looks like this finally work on ext4, giving it the behavior you described and expected, but that's not actually been there until now.  I was hoping someone with more free time than me might be interested to go investigate further if I pointed the advance out.  I'm stuck with too many production systems to play with new kernels at the moment, but am quite curious.

-- Greg Smith    2ndQuadrant   Baltimore, MD PostgreSQL Training, Services and Support greg@2ndQu...

Re: ext4 finally doing the right thing

От
Aidan Van Dyk
Дата:
* Greg Smith <greg@2ndquadrant.com> [100121 00:58]:
> Greg Stark wrote:
>>
>> That doesn't sound right. The kernel having 10% of memory dirty
>> doesn't mean there's a queue you have to jump at all. You don't get
>> into any queue until the kernel initiates write-out which will be
>> based on the usage counters -- basically a lru. fsync and cousins like
>> sync_file_range and posix_fadvise(DONT_NEED) in initiate write-out
>> right away.
>>
>
> Most safe ways ext3 knows how to initiate a write-out on something that
> must go (because it's gotten an fsync on data there) requires flushing
> every outstanding write to that filesystem along with it.  So as soon as
> a single WAL write shows up, bam!  The whole cache is emptied (or at
> least everything associated with that filesystem), and the caller who
> asked for that little write is stuck waiting for everything to clear
> before their fsync returns success.

Sure, if your WAL is on the same FS as your data, you're going to get
hit, and *especially* on ext3...

But, I think that's one of the reasons people usually recommend putting
WAL separate.  Even if it's just another partition on the same (set of)
disk(s), you get the benefit of not having to wait for all the dirty
ext3 pages from your whole database FS to be flushed before the WAL write
can complete on it's own FS.

a.

--
Aidan Van Dyk                                             Create like a god,
aidan@highrise.ca                                       command like a king,
http://www.highrise.ca/                                   work like a slave.

Вложения

Re: ext4 finally doing the right thing

От
Florian Weimer
Дата:
* Greg Smith:

> Note the comment from the first article saying "those delays can be 30
> seconds or more".  On multiple occasions, I've measured systems with
> dozens of disks in a high-performance RAID1+0 with battery-backed
> controller that could grind to a halt for 10, 20, or more seconds in
> this situation, when running pgbench on a big database.

We see that quite a bit, too (we're still on ext3, mostly 2.6.26ish
kernels).  It seems that the most egregious issues (which even trigger
the two minute kernel hangcheck timer) are related to CFQ.  We don't
see it on systems we have switched to the deadline I/O scheduler.  But
data on this is a bit sketchy.

--
Florian Weimer                <fweimer@bfk.de>
BFK edv-consulting GmbH       http://www.bfk.de/
Kriegsstraße 100              tel: +49-721-96201-1
D-76133 Karlsruhe             fax: +49-721-96201-99

Re: ext4 finally doing the right thing

От
Greg Smith
Дата:
Aidan Van Dyk wrote:
> Sure, if your WAL is on the same FS as your data, you're going to get
> hit, and *especially* on ext3...
>
> But, I think that's one of the reasons people usually recommend putting
> WAL separate.

Separate disks can actually concentrate the problem.  The writes to the
data disk by checkpoints will also have fsync behind them eventually, so
splitting out the WAL means you just push the big write backlog to a
later point.  So less frequently performance dives, but sometimes
bigger.  All of the systems I was mentioning seeing >10 second pauses on
had a RAID-1 pair of WAL disks split from the main array.

--
Greg Smith    2ndQuadrant   Baltimore, MD
PostgreSQL Training, Services and Support
greg@2ndQuadrant.com  www.2ndQuadrant.com


Re: ext4 finally doing the right thing

От
Aidan Van Dyk
Дата:
* Greg Smith <greg@2ndquadrant.com> [100121 09:49]:
> Aidan Van Dyk wrote:
>> Sure, if your WAL is on the same FS as your data, you're going to get
>> hit, and *especially* on ext3...
>>
>> But, I think that's one of the reasons people usually recommend putting
>> WAL separate.
>
> Separate disks can actually concentrate the problem.  The writes to the
> data disk by checkpoints will also have fsync behind them eventually, so
> splitting out the WAL means you just push the big write backlog to a
> later point.  So less frequently performance dives, but sometimes
> bigger.  All of the systems I was mentioning seeing >10 second pauses on
> had a RAID-1 pair of WAL disks split from the main array.

That's right, so with the WAL split off on it's own disk, you don't wait
on "WAL" for your checkpoint/data syncs, but you can build up a huge
wait in the queue for main data (which can even block reads).

Having WAL on the main disk means that (for most ext3), you sometimes
have WAL writes taking longer, but the WAL fsyncs are keeping the
backlog "down" in the main data area too.

Now, with ext4 moving to full barrier/fsync support, we could get to the
point where WAL in the main data FS can mimic the state where WAL is
seperate, namely that WAL writes can "jump the queue" and be written
without waiting for the data pages to be flushed down to disk, but also
that you'll get the big backlog of data pages to flush when
the first fsyncs on big data files start coming from checkpoints...

a.
--
Aidan Van Dyk                                             Create like a god,
aidan@highrise.ca                                       command like a king,
http://www.highrise.ca/                                   work like a slave.

Вложения

Re: ext4 finally doing the right thing

От
"Kevin Grittner"
Дата:
>Aidan Van Dyk <aidan@highrise.ca> wrote:
> But, I think that's one of the reasons people usually recommend
> putting WAL separate. Even if it's just another partition on the
> same (set of) disk(s), you get the benefit of not having to wait
> for all the dirty ext3 pages from your whole database FS to be
> flushed before the WAL write can complete on it's own FS.

[slaps forehead]

I've been puzzling about why we're getting timeouts on one of two
apparently identical (large) servers.  We forgot to move the pg_xlog
directory to the separate mount point we created for it on the same
RAID.  I didn't think to check that until I saw your post.

-Kevin

Re: ext4 finally doing the right thing

От
Pierre Frédéric Caillaud
Дата:
> Now, with ext4 moving to full barrier/fsync support, we could get to the
> point where WAL in the main data FS can mimic the state where WAL is
> seperate, namely that WAL writes can "jump the queue" and be written
> without waiting for the data pages to be flushed down to disk, but also
> that you'll get the big backlog of data pages to flush when
> the first fsyncs on big data files start coming from checkpoints...

    Does postgres write something to the logfile whenever a fsync() takes a
suspiciously long amount of time ?

Re: ext4 finally doing the right thing

От
Greg Smith
Дата:
Pierre Frédéric Caillaud wrote:
>
>     Does postgres write something to the logfile whenever a fsync()
> takes a suspiciously long amount of time ?

Not specifically.  If you're logging statements that take a while, you
can see this indirectly, but commits that just take much longer than usual.

If you turn on log_checkpoints, the "sync time" is broken out for you,
problems in this area can show up there too.

--
Greg Smith    2ndQuadrant   Baltimore, MD
PostgreSQL Training, Services and Support
greg@2ndQuadrant.com  www.2ndQuadrant.com