Обсуждение: Hardware vs Software RAID

От:
Adrian Moisey
Дата:

Hi

Has anyone done some benchmarks between hardware RAID vs Linux MD
software RAID?

I'm keen to know the result.

--
Adrian Moisey
Systems Administrator | CareerJunction | Your Future Starts Here.
Web: www.careerjunction.co.za | Email: 
Phone: +27 21 686 6820 | Mobile: +27 82 858 7830 | Fax: +27 21 686 6842

От:
"Merlin Moncure"
Дата:

On Wed, Jun 25, 2008 at 7:05 AM, Adrian Moisey
<> wrote:
> Has anyone done some benchmarks between hardware RAID vs Linux MD software
> RAID?
>
> I'm keen to know the result.

I have here:
http://merlinmoncure.blogspot.com/2007/08/following-are-results-of-our-testing-of.html

I also did some pgbench tests which I unfortunately did not record.
The upshot is I don't really see a difference in performance.  I
mainly prefer software raid because it's flexible and you can use the
same set of tools across different hardware.  One annoying thing about
software raid that comes up periodically is that you can't grow raid 0
volumes.

merlin

От:
Matthew Wakeling
Дата:

On Wed, 25 Jun 2008, Merlin Moncure wrote:
>> Has anyone done some benchmarks between hardware RAID vs Linux MD software
>> RAID?
>
> I have here:
> http://merlinmoncure.blogspot.com/2007/08/following-are-results-of-our-testing-of.html
>
> The upshot is I don't really see a difference in performance.

The main difference is that you can get hardware RAID with
battery-backed-up cache, which means small writes will be much quicker
than software RAID. Postgres does a lot of small writes under some use
cases.

Without a BBU cache, it is sensible to put the transaction logs on a
separate disc system to the main database, to make the transaction log
writes fast (due to no seeking on those discs). However, with a BBU cache,
that advantage is irrelevant, as the cache will absorb the writes.

However, not all hardware RAID will have such a battery-backed-up cache,
and those that do tend to have a hefty price tag.

Matthew

--
$ rm core
Segmentation Fault (core dumped)

От:
"Peter T. Breuer"
Дата:

"Also sprach Matthew Wakeling:"
> >> Has anyone done some benchmarks between hardware RAID vs Linux MD software
> >> RAID?
  ...
> > The upshot is I don't really see a difference in performance.
>
> The main difference is that you can get hardware RAID with
> battery-backed-up cache, which means small writes will be much quicker
> than software RAID. Postgres does a lot of small writes under some use

It doesn't "mean" that, I'm afraid.  You can put the log/bitmap wherever
you want in software raid, including on a battery-backed local ram disk
if you feel so inclined.  So there is no intrinsic advantage to be
gained there at all.

> However, not all hardware RAID will have such a battery-backed-up cache,
> and those that do tend to have a hefty price tag.

Whereas software raid and a firewire-attached log device does not.


Peter

От:
Greg Smith
Дата:

On Wed, 25 Jun 2008, Peter T. Breuer wrote:

> You can put the log/bitmap wherever you want in software raid, including
> on a battery-backed local ram disk if you feel so inclined.  So there is
> no intrinsic advantage to be gained there at all.

You are technically correct but this is irrelevant.  There are zero
mainstream battery-backed local RAM disk setups appropriate for database
use that don't cost substantially more than the upgrade cost to just
getting a good hardware RAID controller with cache integrated and using
regular disks.

What I often do is get a hardware RAID controller, just to accelerate disk
writes, but configure it in JBOD mode and use Linux or other software RAID
on that platform.

Advantages of using software RAID, in general and in some cases even with
a hardware disk controller:

-Your CPU is inevitably faster than the one on the controller, so this can
give better performance than having RAID calcuations done on the
controller itself.

-If the RAID controllers dies, you can move everything to another machine
and know that the RAID setup will transfer.  Usually hardware RAID
controllers use a formatting process such that you can't read the array
without such a controller, so you're stuck with having a replacement
controller around if you're paranoid.  As long as I've got any hardware
that can read the disks, I can get a software RAID back again.

-There is a transparency to having the disks directly attached to the OS
you lose with most hardware RAID.  Often with hardware RAID you lose the
ability to do things like monitor drive status and temperature without
using a special utility to read SMART and similar data.

Disadvantages:

-Maintenance like disk replacement rebuilds will be using up your main CPU
and its resources (like I/O bus bandwidth) that might be offloaded onto
the hardware RAID controller.

-It's harder to setup a redundant boot volume with software RAID that
works right with a typical PC BIOS.  If you use hardware RAID it tends to
insulate you from the BIOS quirks.

-If a disk fails, I've found a full hardware RAID setup is less likely to
result in an OS crash than a software RAID is.  The same transparency and
visibility into what the individual disks are doing can be a problem when
a disk goes crazy and starts spewing junk the OS has to listen to.
Hardware controllers tend to do a better job planning for that sort of
failure, and some of that is lost even by putting them into JBOD mode.

>> However, not all hardware RAID will have such a battery-backed-up cache,
>> and those that do tend to have a hefty price tag.
>
> Whereas software raid and a firewire-attached log device does not.

A firewire-attached log device is an extremely bad idea.  First off,
you're at the mercy of the firewire bridge's write guarantees, which may
or may not be sensible.  It's not hard to find reports of people whose
disks were corrupted when the disk was accidentally disconnected, or of
buggy drive controller firmware causing problems.  I stopped using
Firewire years ago because it seems you need to do some serious QA to
figure out which combinations are reliable and which aren't, and I don't
use external disks enough to spend that kind of time with them.

Second, there's few if any Firewire setups where the host gets to read
SMART error data from the disk.  This means that you can continue to use a
flaky disk long past the point where a direct connected drive would have
been kicked out of an array for being unreliable.  SMART doesn't detect
100% of drive failures in advance, but you'd be silly to setup a database
system where you don't get to take advantage of the ~50% it does catch
before you lose any data.

--
* Greg Smith  http://www.gregsmith.com Baltimore, MD

От:
"Jonah H. Harris"
Дата:

On Wed, Jun 25, 2008 at 11:24 AM, Greg Smith <> wrote:
> SMART doesn't detect 100% of drive failures in advance, but you'd be silly
> to setup a database system where you don't get to take advantage of the
> ~50% it does catch before you lose any data.

Can't argue with that one.

--
Jonah H. Harris, Sr. Software Architect | phone: 732.331.1324
EnterpriseDB Corporation | fax: 732.331.1301
499 Thornall Street, 2nd Floor | 
Edison, NJ 08837 | http://www.enterprisedb.com/

От:
"Joshua D. Drake"
Дата:


On Wed, 2008-06-25 at 11:30 -0400, Jonah H. Harris wrote:
> On Wed, Jun 25, 2008 at 11:24 AM, Greg Smith <> wrote:
> > SMART doesn't detect 100% of drive failures in advance, but you'd be silly
> > to setup a database system where you don't get to take advantage of the
> > ~50% it does catch before you lose any data.
>
> Can't argue with that one.

SMART has certainly saved our butts more than once.

Joshua D. Drake



От:
Matthew Wakeling
Дата:

On Wed, 25 Jun 2008, Greg Smith wrote:
> A firewire-attached log device is an extremely bad idea.

Anyone have experience with IDE, SATA, or SAS-connected flash devices like
the Samsung MCBQE32G5MPP-0VA? I mean, it seems lovely - 32GB, at a
transfer rate of 100MB/s, and doesn't degrade much in performance when
writing small random blocks. But what's it actually like, and is it
reliable?

Matthew

--
Terrorists evolve but security is intelligently designed?  -- Jake von Slatt

От:
"Scott Marlowe"
Дата:

On Wed, Jun 25, 2008 at 5:05 AM, Adrian Moisey
<> wrote:
> Hi
>
> Has anyone done some benchmarks between hardware RAID vs Linux MD software
> RAID?
>
> I'm keen to know the result.

I've had good performance from sw RAID-10 in later kernels, especially
if it was handling a mostly read type load, like a reporting server.
The problem with hw RAID is that the actual performance delivered
doesn't always match up to the promise, due to issues like driver
bugs, mediocre implementations, etc.  Years ago when the first
megaraid v2 drivers were coming out they were pretty buggy.  Once a
stable driver was out they worked quite well.

I'm currently having a problem with a "well known very large
servermanufacturer who shall remain unnamed" and their semi-custom
RAID controller firmware not getting along with the driver for ubuntu.

The machine we're ordering to replace it will have a much beefier RAID
controller with a better driver / OS match and I expect better
behavior from that setup.

От:
"Joshua D. Drake"
Дата:


On Wed, 2008-06-25 at 09:53 -0600, Scott Marlowe wrote:
> On Wed, Jun 25, 2008 at 5:05 AM, Adrian Moisey
> <> wrote:
> > Hi

> I'm currently having a problem with a "well known very large
> servermanufacturer who shall remain unnamed" and their semi-custom
> RAID controller firmware not getting along with the driver for ubuntu.

/me waves to Dell.

Joshua D. Drake



От:
"Peter T. Breuer"
Дата:

"Also sprach Greg Smith:"
> On Wed, 25 Jun 2008, Peter T. Breuer wrote:
>
> > You can put the log/bitmap wherever you want in software raid, including
> > on a battery-backed local ram disk if you feel so inclined.  So there is
> > no intrinsic advantage to be gained there at all.
>
> You are technically correct but this is irrelevant.  There are zero
> mainstream battery-backed local RAM disk setups appropriate for database
> use that don't cost substantially more than the upgrade cost to just

I refrained from saying in my reply that I would set up a firewire-based
link to ram in a spare old portable (which comes with a battery) if I
wanted to do this cheaply.

One reason I refrained was because I did not want to enter into a
discussion of transport speeds vs latency vs block request size.  GE,
for example, would have horrendous performance at 1KB i/o blocks. Mind
you, it still would be over 20MB/s (I measure 70MB/s to a real scsi
remote disk across GE at 64KB blocksize).

> getting a good hardware RAID controller with cache integrated and using
> regular disks.
>
> What I often do is get a hardware RAID controller, just to accelerate disk
> writes, but configure it in JBOD mode and use Linux or other software RAID
> on that platform.

I wonder what "JBOD mode" is ... :) Journaled block over destiny? Oh ..
"Just a Bunch of Disks". So you use the linux software raid driver
instead of the hardware or firmware driver on the raid assembly. Fair
enough.

> Advantages of using software RAID, in general and in some cases even with
> a hardware disk controller:
>
> -Your CPU is inevitably faster than the one on the controller, so this can
> give better performance than having RAID calcuations done on the
> controller itself.

It's not clear. You take i/o bandwidth out of the rest of your system,
and cpu time too.  In a standard dual core machine which is not a
workstation, it's OK.  On my poor ol' 1GHz P3 TP x24 laptop, doing two
things at once is definitely a horrible strain on my X responsiveness.
On a risc machine (ARM, 250MHz) I have seen horrible cpu loads from
software raid.

> -If the RAID controllers dies, you can move everything to another machine
> and know that the RAID setup will transfer.  Usually hardware RAID

Oh, I agree with that. You're talking about the proprietary formatting
in hw raid assemblies, I take it? Yah.

> -There is a transparency to having the disks directly attached to the OS

Agreed. "It's alright until it goes wrong".

> Disadvantages:
>
> -Maintenance like disk replacement rebuilds will be using up your main CPU

Agreed (above).

>
> -It's harder to setup a redundant boot volume with software RAID that

Yeah. I don't bother. A small boot volume in readonly mode with a copy
on another disk is what I use.

> works right with a typical PC BIOS.  If you use hardware RAID it tends to
> insulate you from the BIOS quirks.

Until the machine dies? (and fries a disk or two on the way down ..
happens, has happend to me).

> -If a disk fails, I've found a full hardware RAID setup is less likely to
> result in an OS crash than a software RAID is.  The same transparency and

Not sure.

> >> However, not all hardware RAID will have such a battery-backed-up cache,
> >> and those that do tend to have a hefty price tag.
> >
> > Whereas software raid and a firewire-attached log device does not.
>
> A firewire-attached log device is an extremely bad idea.  First off,
> you're at the mercy of the firewire bridge's write guarantees, which may
> or may not be sensible.

The log is sync. Therefore it doesn't matter what the guarantees are,
or at least I assume you are worrying about acks coming back before the
write has been sent, etc.  Only an actual net write will be acked by the
firewire transport as far as I know.  If OTOH you are thinking of "a
firewire attached disk" as a complete black box, then yes, I agree, you
are at the mercy of the driver writer for that black box.  But I was not
thinking of that.  I was only choosing firewire as a transport because
of its relatively good behaviour with small requests, as opposed to GE
as a transport, or 100BT as a transport, or whatever else as a
transport...


> It's not hard to find reports of people whose
> disks were corrupted when the disk was accidentally disconnected, or of
> buggy drive controller firmware causing problems.  I stopped using
> Firewire years ago because it seems you need to do some serious QA to
> figure out which combinations are reliable and which aren't, and I don't
> use external disks enough to spend that kind of time with them.

Sync operation of the disk should make you immune to any quirks, even
if you are thinking of "firewire plus disk" as a black-box unit.

> Second, there's few if any Firewire setups where the host gets to read
> SMART error data from the disk.

An interesting point, but I really was considering firewire only as the
transport (I'm the author of the ENBD - enhanced network block device -
driver, which makes any remote block device available over any
transport, so I guess that accounts for the different assumption).

Peter

От:
"Merlin Moncure"
Дата:

On Wed, Jun 25, 2008 at 11:55 AM, Joshua D. Drake <> wrote:
> On Wed, 2008-06-25 at 09:53 -0600, Scott Marlowe wrote:
>> On Wed, Jun 25, 2008 at 5:05 AM, Adrian Moisey
>> <> wrote:
>
>> I'm currently having a problem with a "well known very large
>> servermanufacturer who shall remain unnamed" and their semi-custom
>> RAID controller firmware not getting along with the driver for ubuntu.

> /me waves to Dell.

not just ubuntu...the dell perc/x line software  utilities also
explicitly check the hardware platform so they only run on dell
hardware.  However, the lsi logic command line utilities run just
fine.  As for ubuntu sas support, ubuntu suports the mpt fusion/sas
line directly through the kernel.

In fact, installing ubuntu server fixed an unrelated issue relating to
a qlogic fibre hba that was causing reboots under heavy load with a
pci-x fibre controller on centos.  So, based on this and other
experiences, i'm starting to be more partial to linux distributions
with faster moving kernels, mainly because i trust the kernel drivers
more than the vendor provided drivers.  The in place distribution
upgrade is also very nice.

merlin

От:
"Merlin Moncure"
Дата:

On Wed, Jun 25, 2008 at 9:03 AM, Matthew Wakeling <> wrote:
> On Wed, 25 Jun 2008, Merlin Moncure wrote:
>>>
>>> Has anyone done some benchmarks between hardware RAID vs Linux MD
>>> software
>>> RAID?
>>
>> I have here:
>>
>> http://merlinmoncure.blogspot.com/2007/08/following-are-results-of-our-testing-of.html
>>
>> The upshot is I don't really see a difference in performance.
>
> The main difference is that you can get hardware RAID with battery-backed-up
> cache, which means small writes will be much quicker than software RAID.
> Postgres does a lot of small writes under some use cases.

As discussed down thread, software raid still gets benefits of
write-back caching on the raid controller...but there are a couple of
things I'd like to add.  First, if your sever is extremely busy, the
write back cache will eventually get overrun and performance will
eventually degrade to more typical ('write through') performance.
Secondly, many hardware raid controllers have really nasty behavior in
this scenario.  Linux software raid has decent degradation in overload
conditions but many popular raid controllers (dell perc/lsi logic sas
for example) become unpredictable and very bursty in sustained high
load conditions.

As greg mentioned, I trust the linux kernel software raid much more
than the black box hw controllers.  Also, contrary to vast popular
mythology, the 'overhead' of sw raid in most cases is zero except in
very particular conditions.

merlin

От:
Andrew Sullivan
Дата:

On Wed, Jun 25, 2008 at 01:35:49PM -0400, Merlin Moncure wrote:
> experiences, i'm starting to be more partial to linux distributions
> with faster moving kernels, mainly because i trust the kernel drivers
> more than the vendor provided drivers.

While I have some experience that agrees with this, I'll point out
that I've had the opposite experience, too: upgrading the kernel made
a perfectly stable system both unstable and prone to data loss.  I
think this is a blade that cuts both ways, and the key thing to do is
to ensure you have good testing infrastructure in place to check that
things will work before you deploy to production.  (The other way to
say that, of course, is "Linux is only free if your time is worth
nothing."  Substitute your favourite free software for "Linux", of
course.  ;-) )

A

--
Andrew Sullivan

+1 503 667 4564 x104
http://www.commandprompt.com/

От:
"Kevin Grittner"
Дата:

>>> Andrew Sullivan <> wrote:

> this is a blade that cuts both ways, and the key thing to do is
> to ensure you have good testing infrastructure in place to check
that
> things will work before you deploy to production.  (The other way to
> say that, of course, is "Linux is only free if your time is worth
> nothing."  Substitute your favourite free software for "Linux", of
> course.  ;-) )

It doesn't have to be free software to cut that way.  I've actually
found the free software to waste less of my time.  If you depend on
your systems, though, you should never deploy any change, no matter
how innocuous it seems, without testing.

-Kevin

От:
Andrew Sullivan
Дата:

On Wed, Jun 25, 2008 at 01:07:25PM -0500, Kevin Grittner wrote:
>
> It doesn't have to be free software to cut that way.  I've actually
> found the free software to waste less of my time.

No question.  But one of the unfortunate facts of the
no-charge-for-licenses world is that many people expect the systems to
be _really free_.  It appears that some people think, because they've
already paid $smallfortune for a license, it's therefore ok to pay
another amount in operation costs and experts to run the system.  Free
systems, for some reason, are expected also magically to run
themselves.  This tendency is getting better, but hasn't gone away.
It's partly because the budget for the administrators is often buried
in the overall large system budget, so nobody balks when there's a big
figure attached there.  When you present a budget for "free software"
that includes the cost of a few administrators, the accounting people
want to know why the free software costs so much.

> If you depend on your systems, though, you should never deploy any
> change, no matter how innocuous it seems, without testing.

I agree completely.

--
Andrew Sullivan

+1 503 667 4564 x104
http://www.commandprompt.com/

От:
Greg Smith
Дата:

On Wed, 25 Jun 2008, Peter T. Breuer wrote:

> I refrained from saying in my reply that I would set up a firewire-based
> link to ram in a spare old portable (which comes with a battery) if I
> wanted to do this cheaply.

Maybe, but this is kind of a weird setup.  Not many people are going to
run a production database that way and us wandering into the details too
much risks confusing everybody else.

> The log is sync. Therefore it doesn't matter what the guarantees are, or
> at least I assume you are worrying about acks coming back before the
> write has been sent, etc.  Only an actual net write will be acked by the
> firewire transport as far as I know.

That's exactly the issue; it's critical for database use that a disk not
lie to you about writes being done if they're actually sitting in a cache
somewhere.  (S)ATA disks do that, so you have to turn that off for them to
be safe to use.  Since the firewire enclosure is a black box, it's
difficult to know exactly what it's doing here, and history here says that
every type (S)ATA disk does the wrong in the default case.  I expect that
for any Firewire/USB device, if I write to the disk, then issue a fsync,
it will return success from that once the data has been written to the
disk's cache--which is crippling behavior from the database's perspective
one day when you get a crash.

--
* Greg Smith  http://www.gregsmith.com Baltimore, MD

От:
Greg Smith
Дата:

On Wed, 25 Jun 2008, Merlin Moncure wrote:

> So, based on this and other experiences, i'm starting to be more partial
> to linux distributions with faster moving kernels, mainly because i
> trust the kernel drivers more than the vendor provided drivers.

Depends on how fast.  I find it takes a minimum of 3-6 months before any
new kernel release stabilizes (somewhere around 2.6.X-5 to -10), and some
distributions push them out way before that.  Also, after major changes,
it can be a year or more before a new kernel is not a regression either in
reliability, performance, or worst-case behavior.

--
* Greg Smith  http://www.gregsmith.com Baltimore, MD

От:
Greg Smith
Дата:

On Wed, 25 Jun 2008, Andrew Sullivan wrote:

> the key thing to do is to ensure you have good testing infrastructure in
> place to check that things will work before you deploy to production.

This is true whether you're using Linux or completely closed source
software.  There are two main differences from my view:

-OSS software lets you look at the code before a typical closed-source
company would have pushed a product out the door at all.  Downside is that
you need to recognize that.  Linux kernels for example need significant
amounts of encouters with the real world after release before they're
ready for most people.

-If your OSS program doesn't work, you can potentially find the problem
yourself.  I find that I don't fix issues when I come across them very
much, but being able to browse the source code for something that isn't
working frequently makes it easier to understand what's going on as part
of troubleshooting.

It's not like closed source software doesn't have the same kinds of bugs.
The way commercial software (and projects like PostgreSQL) get organized
into a smaller number of official releases tends to focus the QA process a
bit better though, so that regular customers don't see as many rough
edges.  Linux used to do a decent job of this with their development vs.
stable kernels, which I really miss.  Unfortunately there's just not
enough time for the top-level developers to manage that while still
keeping up with the pace needed just for new work.  Sorting out which are
the stable kernel releases seems to have become the job of the
distributors (RedHat, SuSE, Debian, etc.) instead of the core kernel
developers.

--
* Greg Smith  http://www.gregsmith.com Baltimore, MD

От:
Vivek Khera
Дата:

On Jun 25, 2008, at 11:35 AM, Matthew Wakeling wrote:

> On Wed, 25 Jun 2008, Greg Smith wrote:
>> A firewire-attached log device is an extremely bad idea.
>
> Anyone have experience with IDE, SATA, or SAS-connected flash
> devices like the Samsung MCBQE32G5MPP-0VA? I mean, it seems lovely -
> 32GB, at a transfer rate of 100MB/s, and doesn't degrade much in
> performance when writing small random blocks. But what's it actually
> like, and is it reliable?

None of these manufacturers rates these drives for massive amounts of
writes.  They're sold as suitable for laptop/desktop use, which
normally is not a heavy wear and tear operation like a DB.  Once they
claim suitability for this purpose, be sure that I and a lot of others
will dive into it to see how well it really works.  Until then, it
will just be an expensive brick-making experiment, I'm sure.

От:
"Peter T. Breuer"
Дата:

"Also sprach Merlin Moncure:"
> As discussed down thread, software raid still gets benefits of
> write-back caching on the raid controller...but there are a couple of

(I wish I knew what write-back caching was!)

Well, if you mean the Linux software raid driver, no, there's no extra
caching (buffering).  Every request arriving at the device is duplicated
(for RAID1), using a local finite cache of buffer head structures and
real extra muffers from the kernel's general resources.  Every arriving
request is dispatched two its subtargets as it arrives (as two or more
new requests).  On reception of both (or more) acks, the original
request is acked, and not before.

This imposes a considerable extra resource burden. It's a mystery to me
why the driver doesn't deadlock against other resource eaters that it
may depend on.  Writing to a device that also needs extra memory per
request in its driver should deadlock it, in theory. Against a network
device as component, it's a problem (tcp needs buffers).

However the lack of extra buffering is really deliberate (double
buffering is a horrible thing in many ways, not least because of the
probable memory deadlock against some component driver's requirement).
The driver goes to the lengths of replacing the kernel's generic
make_request function just for itself in order to make sure full control
resides in the driver.  This is required, among other things, to make
sure that request order is preserved, and that requests.

It has the negative that standard kernel contiguous request merging does
not take place.  But that's really required for sane coding in the
driver. Getting request pages into general kernel buffers ... may happen.


> things I'd like to add.  First, if your sever is extremely busy, the
> write back cache will eventually get overrun and performance will
> eventually degrade to more typical ('write through') performance.

I'd like to know where this 'write back cache' �s! (not to mention what
it is :). What on earth does `write back' mean? Peraps you mean the
kernel's general memory system, which has the effect of buffering
and caching requests on the way to drivers like raid.  Yes, if you write
to a device, any device, you will only write to the kernel somwhere,
which may or may not decide now or later to send the dirty buffers thus
created on to the driver in question, either one by one or merged.  But
as I said, raid replaces most of the kernel's mechanisms in that area
(make_request, plug) to avoid losing ordering.  I would be surprised if
the raw device exhibited any buffering at all after getting rid of the
generic kernel mechanisms.  Any buffering you see would likely be
happening at file system level (and be a darn nuisance).

Reads from the device are likely to hit the kernel's existing buffers
first, thus making them act as a "cache".


> Secondly, many hardware raid controllers have really nasty behavior in
> this scenario.  Linux software raid has decent degradation in overload

I wouldn't have said so! If there is any, it's sort of accidental. On
memory starvation, the driver simply couldn't create and despatch
component requests. Dunno what happens then. It won't run out of buffer
head structs though, since it's pretty well serialised on those, per
device, in order to maintain request order, and it has its own cache.

> conditions but many popular raid controllers (dell perc/lsi logic sas
> for example) become unpredictable and very bursty in sustained high
> load conditions.

Well, that's because they can't tell the linux memory manager to quit
storing data from them in memory and let them have it NOW (a general
problem .. how one gets feedback on the mm state, I don't know). Maybe one
could .. one can control buffer aging pretty much per device nowadays.
Perhaps one can set the limit to zero for buffer age in memory before
being sent to the device. That would help. Also one can lower the
bdflush limit at which the device goes sync. All that would help against
bursty performance, but it would slow ordinary operation towards sync
behaviour.


> As greg mentioned, I trust the linux kernel software raid much more
> than the black box hw controllers.  Also, contrary to vast popular

Well, it's readable code. That's the basis for my comments!

> mythology, the 'overhead' of sw raid in most cases is zero except in
> very particular conditions.

It's certainly very small. It would be smaller still if we could avoid
needing new buffers per device. Perhaps the dm multipathing allows that.

Peter

От:
Matthew Wakeling
Дата:

On Thu, 26 Jun 2008, Vivek Khera wrote:
>> Anyone have experience with IDE, SATA, or SAS-connected flash devices like
>> the Samsung MCBQE32G5MPP-0VA? I mean, it seems lovely - 32GB, at a transfer
>> rate of 100MB/s, and doesn't degrade much in performance when writing small
>> random blocks. But what's it actually like, and is it reliable?
>
> None of these manufacturers rates these drives for massive amounts of writes.
> They're sold as suitable for laptop/desktop use, which normally is not a
> heavy wear and tear operation like a DB.  Once they claim suitability for
> this purpose, be sure that I and a lot of others will dive into it to see how
> well it really works.  Until then, it will just be an expensive brick-making
> experiment, I'm sure.

It claims a MTBF of 2,000,000 hours, but no further reliability
information seems forthcoming. I thought the idea that flash couldn't cope
with many writes was no longer true these days?

Matthew

--
I work for an investment bank. I have dealt with code written by stock
exchanges. I have seen how the computer systems that store your money are
run. If I ever make a fortune, I will store it in gold bullion under my
bed.                                              -- Matthew Crosby

От:
"Scott Marlowe"
Дата:

On Thu, Jun 26, 2008 at 10:14 AM, Matthew Wakeling <> wrote:
> On Thu, 26 Jun 2008, Vivek Khera wrote:
>>>
>>> Anyone have experience with IDE, SATA, or SAS-connected flash devices
>>> like the Samsung MCBQE32G5MPP-0VA? I mean, it seems lovely - 32GB, at a
>>> transfer rate of 100MB/s, and doesn't degrade much in performance when
>>> writing small random blocks. But what's it actually like, and is it
>>> reliable?
>>
>> None of these manufacturers rates these drives for massive amounts of
>> writes. They're sold as suitable for laptop/desktop use, which normally is
>> not a heavy wear and tear operation like a DB.  Once they claim suitability
>> for this purpose, be sure that I and a lot of others will dive into it to
>> see how well it really works.  Until then, it will just be an expensive
>> brick-making experiment, I'm sure.
>
> It claims a MTBF of 2,000,000 hours, but no further reliability information
> seems forthcoming. I thought the idea that flash couldn't cope with many
> writes was no longer true these days?

What's mainly happened is a great increase in storage capacity has
allowed flash based devices to spread their writes out over so many
cells that the time it takes to overwrite all the cells enough to get
dead ones is measured in much longer intervals.  Instead of dieing in
weeks or months, they'll now die, for most work loads, in years or
more.

However, I've tested a few less expensive solid state storage and for
some transactional loads it was much faster, but then for things like
report queries scanning whole tables they were factors slower than a
sw RAID-10 array of just 4 spinning disks.  But pg_bench was quite
snappy using the solid state storage for pg_xlog.

От:
"Merlin Moncure"
Дата:

On Thu, Jun 26, 2008 at 9:49 AM, Peter T. Breuer <> wrote:
> "Also sprach Merlin Moncure:"
>> As discussed down thread, software raid still gets benefits of
>> write-back caching on the raid controller...but there are a couple of
>
> (I wish I knew what write-back caching was!)

hardware raid controllers generally have some dedicated memory for
caching.  the controllers can be configured in one of two modes: (the
jargon is so common it's almost standard)
write back: raid controller can lie to host o/s. when o/s asks
controller to sync, controller can hold data in cache (for a time)
write through: raid controller can not lie. all sync requests must
pass through to disk

The thinking is, the bbu on the controller can hold scheduled writes
in memory (for a time) and replayed to disk when server restarts in
event of power failure.  This is a reasonable compromise between data
integrity and performance.  'write back' caching provides insane burst
IOPS (because you are writing to controller cache) and somewhat
improved sustained IOPS because the controller is reorganizing writes
on the fly in (hopefully) optimal fashion.

> This imposes a considerable extra resource burden. It's a mystery to me
> However the lack of extra buffering is really deliberate (double
> buffering is a horrible thing in many ways, not least because of the

<snip>
completely unconvincing.  the overhead of various cache layers is
completely minute compared to a full fault to disk that requires a
seek which is several orders of magnitude slower.

The linux software raid algorithms are highly optimized, and run on a
presumably (much faster) cpu than what the controller supports.
However, there is still some extra oomph you can get out of letting
the raid controller do what the software raid can't...namely delay
sync for a time.

merlin

От:
"Merlin Moncure"
Дата:

On Thu, Jun 26, 2008 at 12:14 PM, Matthew Wakeling <> wrote:
>> None of these manufacturers rates these drives for massive amounts of
>> writes. They're sold as suitable for laptop/desktop use, which normally is
>> not a heavy wear and tear operation like a DB.  Once they claim suitability
>> for this purpose, be sure that I and a lot of others will dive into it to
>> see how well it really works.  Until then, it will just be an expensive
>> brick-making experiment, I'm sure.
>
> It claims a MTBF of 2,000,000 hours, but no further reliability information
> seems forthcoming. I thought the idea that flash couldn't cope with many
> writes was no longer true these days?

Flash and disks have completely different failure modes, and you can't
do apples to apples MTBF comparisons.  In addition there are many
different types of flash (MLC/SLC) and the flash cells themselves can
be organized in particular ways involving various trade-offs.

The best flash drives combined with smart wear leveling are
anecdotally believed to provide lifetimes that are good enough to
warrant use in high duty server environments.  The main issue is lousy
random write performance that basically makes them useless for any
kind of OLTP operation.  There are a couple of software (hacks?) out
there which may address this problem if the technology doesn't get
there first.

If the random write problem were solved, a single ssd would provide
the equivalent of a stack of 15k disks in a raid 10.

see:
http://www.bigdbahead.com/?p=44
http://feedblog.org/2008/01/30/24-hours-with-an-ssd-and-mysql/

merlin

От:
"Peter T. Breuer"
Дата:

"Also sprach Merlin Moncure:"
> write back: raid controller can lie to host o/s. when o/s asks

This is not what the linux software raid controller does, then. It
does not queue requests internally at all, nor ack requests that have
not already been acked by the components (modulo the fact that one can
deliberately choose to have a slow component not be sync by allowing
"write-behind" on it, in which case the "controller" will ack the
incoming request after one of the compionents has been serviced,
without waiting for both).

> integrity and performance.  'write back' caching provides insane burst
> IOPS (because you are writing to controller cache) and somewhat
> improved sustained IOPS because the controller is reorganizing writes
> on the fly in (hopefully) optimal fashion.

This is what is provided by Linux file system and (ordinary) block
device driver subsystem. It is deliberately eschewed by the soft raid
driver, because any caching will already have been done above and below
the driver, either in the FS or in the components.

> > However the lack of extra buffering is really deliberate (double
> > buffering is a horrible thing in many ways, not least because of the
>
> <snip>
> completely unconvincing.

But true.  Therefore the problem in attaining conviction must be at your
end.  Double buffering just doubles the resources dedicated to a single
request, without doing anything for it!  It doubles the frequency with
which one runs out of resources, it doubles the frequency of the burst
limit being reached.  It's deadly (deadlockly :) in the situation where
the receiving component device also needs resources in order to service
the request, such as when the transport is network tcp (and I have my
suspicions about scsi too).

> the overhead of various cache layers is
> completely minute compared to a full fault to disk that requires a
> seek which is several orders of magnitude slower.

That's aboslutely true when by "overhead" you mean "computation cycles"
and absolutely false when by overhead you mean "memory resources", as I
do.  Double buffering is a killer.

> The linux software raid algorithms are highly optimized, and run on a

I can confidently tell you that that's balderdash both as a Linux author
and as a software RAID linux author (check the attributions in the
kernel source, or look up something like "Raiding the Noosphere" on
google).

> presumably (much faster) cpu than what the controller supports.
> However, there is still some extra oomph you can get out of letting
> the raid controller do what the software raid can't...namely delay
> sync for a time.

There are several design problems left in software raid in the linux kernel.
One of them is the need for extra memory to dispatch requests with and
as (i.e. buffer heads and buffers, both). bhs should be OK since the
small cache per device won't be exceeded while the raid driver itself
serialises requests, which is essentially the case (it does not do any
buffering, queuing, whatever .. and tries hard to avoid doing so). The
need for extra buffers for the data is a problem. On different
platforms different aspects of that problem are important (would you
believe that on ARM mere copying takes so much cpu time that one wants
to avoid it at all costs, whereas on intel it's a forgettable trivium).

I also wouldn't aboslutely swear that request ordering is maintained
under ordinary circumstances.

But of course we try.


Peter

От:
"Merlin Moncure"
Дата:

On Thu, Jun 26, 2008 at 1:03 AM, Peter T. Breuer <> wrote:
> "Also sprach Merlin Moncure:"
>> write back: raid controller can lie to host o/s. when o/s asks
>
> This is not what the linux software raid controller does, then. It
> does not queue requests internally at all, nor ack requests that have
> not already been acked by the components (modulo the fact that one can
> deliberately choose to have a slow component not be sync by allowing
> "write-behind" on it, in which case the "controller" will ack the
> incoming request after one of the compionents has been serviced,
> without waiting for both).
>
>> integrity and performance.  'write back' caching provides insane burst
>> IOPS (because you are writing to controller cache) and somewhat
>> improved sustained IOPS because the controller is reorganizing writes
>> on the fly in (hopefully) optimal fashion.
>
> This is what is provided by Linux file system and (ordinary) block
> device driver subsystem. It is deliberately eschewed by the soft raid
> driver, because any caching will already have been done above and below
> the driver, either in the FS or in the components.
>
>> > However the lack of extra buffering is really deliberate (double
>> > buffering is a horrible thing in many ways, not least because of the
>>
>> <snip>
>> completely unconvincing.
>
> But true.  Therefore the problem in attaining conviction must be at your
> end.  Double buffering just doubles the resources dedicated to a single
> request, without doing anything for it!  It doubles the frequency with
> which one runs out of resources, it doubles the frequency of the burst
> limit being reached.  It's deadly (deadlockly :) in the situation where

Only if those resources are drawn from the same pool.  You are
oversimplifying a calculation that has many variables such as cost.
CPUs for example are introducing more cache levels (l1, l2, l3), etc.
 Also, the different levels of cache have different capabilities.
Only the hardware controller cache is (optionally) allowed to delay
acknowledgment of a sync.  In postgresql terms, we get roughly the
same effect with the computers entire working memory with fsync
disabled...so that we are trusting, rightly or wrongly, that all
writes will eventually make it to disk.  In this case, the raid
controller cache is redundant and marginally useful.

> the receiving component device also needs resources in order to service
> the request, such as when the transport is network tcp (and I have my
> suspicions about scsi too).
>
>> the overhead of various cache layers is
>> completely minute compared to a full fault to disk that requires a
>> seek which is several orders of magnitude slower.
>
> That's aboslutely true when by "overhead" you mean "computation cycles"
> and absolutely false when by overhead you mean "memory resources", as I
> do.  Double buffering is a killer.

Double buffering is most certainly _not_ a killer (or at least, _the_
killer) in practical terms.  Most database systems that do any amount
of writing (that is, interesting databases) are bound by the ability
to randomly read and write to the storage medium, and only that.

This is why raid controllers come with a relatively small amount of
cache...there are diminishing returns from reorganizing writes.  This
is also why up and coming storage technologies (like flash) are so
interesting.  Disk drives have made only marginal improvements in
speed since the early 80's.

>> The linux software raid algorithms are highly optimized, and run on a
>
> I can confidently tell you that that's balderdash both as a Linux author

I'm just saying here that there is little/no cpu overhead for using
software raid on modern hardware.

> believe that on ARM mere copying takes so much cpu time that one wants
> to avoid it at all costs, whereas on intel it's a forgettable trivium).

This is a database list.  The main area of interest is in dealing with
server class hardware.

merlin

От:
david@lang.hm
Дата:

On Thu, 26 Jun 2008, Peter T. Breuer wrote:

> "Also sprach Merlin Moncure:"
>> The linux software raid algorithms are highly optimized, and run on a
>
> I can confidently tell you that that's balderdash both as a Linux author
> and as a software RAID linux author (check the attributions in the
> kernel source, or look up something like "Raiding the Noosphere" on
> google).
>
>> presumably (much faster) cpu than what the controller supports.
>> However, there is still some extra oomph you can get out of letting
>> the raid controller do what the software raid can't...namely delay
>> sync for a time.
>
> There are several design problems left in software raid in the linux kernel.
> One of them is the need for extra memory to dispatch requests with and
> as (i.e. buffer heads and buffers, both). bhs should be OK since the
> small cache per device won't be exceeded while the raid driver itself
> serialises requests, which is essentially the case (it does not do any
> buffering, queuing, whatever .. and tries hard to avoid doing so). The
> need for extra buffers for the data is a problem. On different
> platforms different aspects of that problem are important (would you
> believe that on ARM mere copying takes so much cpu time that one wants
> to avoid it at all costs, whereas on intel it's a forgettable trivium).
>
> I also wouldn't aboslutely swear that request ordering is maintained
> under ordinary circumstances.

which flavor of linux raid are you talking about (the two main families I
am aware of are the md and dm ones)

David Lang

От:
Greg Smith
Дата:

On Thu, 26 Jun 2008, Peter T. Breuer wrote:

> Double buffering is a killer.

No, it isn't; it's a completely trivial bit of overhead.  It only exists
during the time when blocks are queued to write but haven't been written
yet.  On any database system, in those cases I/O congestion at the disk
level (probably things backed up behind seeks) is going to block writes
way before the memory used or the bit of CPU time making the extra copy
becomes a factor on anything but minimal platforms.

You seem to know quite a bit about the RAID implementation, but you are a)
extrapolating from that knowledge into areas of database performance you
need to spend some more time researching first and b) extrapolating based
on results from trivial hardware, relative to what the average person on
this list is running a database server on in 2008.  The weakest platform I
deploy PostgreSQL on and consider relevant today has two cores and 2GB of
RAM, for a single-user development system that only has to handle a small
amount of data relative to what the real servers handle.  If you note the
kind of hardware people ask about here that's pretty typical.

You have some theories here, Merlin and I have positions that come from
running benchmarks, and watching theories suffer a brutal smack-down from
the real world is one of those things that happens every day.  There is
absolutely some overhead from paths through the Linux software RAID that
consume resources.  But you can't even measure that in database-oriented
comparisions against hardware setups that don't use those resources, which
means that for practical purposes the overhead doesn't exist in this
context.

--
* Greg Smith  http://www.gregsmith.com Baltimore, MD

От:
Robert Treat
Дата:

On Wednesday 25 June 2008 11:24:23 Greg Smith wrote:
> What I often do is get a hardware RAID controller, just to accelerate disk
> writes, but configure it in JBOD mode and use Linux or other software RAID
> on that platform.
>

JBOD + RAIDZ2 FTW ;-)

--
Robert Treat
Build A Brighter LAMP :: Linux Apache {middleware} PostgreSQL

От:
Matthew Wakeling
Дата:

On Thu, 26 Jun 2008, Merlin Moncure wrote:
> In addition there are many different types of flash (MLC/SLC) and the
> flash cells themselves can be organized in particular ways involving
> various trade-offs.

Yeah, I wouldn't go for MLC, given it has a tenth the lifespan of SLC.

> The main issue is lousy random write performance that basically makes
> them useless for any kind of OLTP operation.

For the mentioned device, they claim a sequential read speed of 100MB/s,
sequential write speed of 80MB/s, random read speed of 80MB/s and random
write speed of 30MB/s. This is *much* better than figures quoted for many
other devices, but of course unless they publish the block size they used
for the random speed tests, the figures are completely useless.

Matthew

--
sed -e '/^[when][coders]/!d;/^...[discover].$/d;/^..[real].[code]$/!d
' <`locate dict/words`

От:
"Merlin Moncure"
Дата:

On Fri, Jun 27, 2008 at 7:00 AM, Matthew Wakeling <> wrote:
> On Thu, 26 Jun 2008, Merlin Moncure wrote:
>>
>> In addition there are many different types of flash (MLC/SLC) and the
>> flash cells themselves can be organized in particular ways involving various
>> trade-offs.
>
> Yeah, I wouldn't go for MLC, given it has a tenth the lifespan of SLC.
>
>> The main issue is lousy random write performance that basically makes them
>> useless for any kind of OLTP operation.
>
> For the mentioned device, they claim a sequential read speed of 100MB/s,
> sequential write speed of 80MB/s, random read speed of 80MB/s and random
> write speed of 30MB/s. This is *much* better than figures quoted for many
> other devices, but of course unless they publish the block size they used
> for the random speed tests, the figures are completely useless.

right. not likely completely truthful. here's why:

A 15k drive can deliver around 200 seeks/sec (under worst case
conditions translating to 1-2mb/sec with 8k block size).   30mb/sec
random performance would then be rough equivalent to around 40 15k
drives configured in a raid 10.  Of course, I'm assuming the block
size :-).

Unless there were some other mitigating factors (lifetime, etc), this
would demonstrate that flash ssd would crush disks in any reasonable
cost/performance metric.  It's probably not so cut and dry, otherwise
we'd be hearing more about them (pure speculation on  my part).

merlin