Обсуждение: 10K vs 15k rpm for analytics

От:
Francisco Reyes
Дата:

Anyone has any experience doing analytics with postgres. In particular if
10K rpm drives are good enough vs using 15K rpm, over 24 drives. Price
difference is $3,000.

Rarely ever have more than 2 or 3 connections to the machine.

So far from what I have seen throughput is more important than TPS for the
queries we do. Usually we end up doing sequential scans to do
summaries/aggregates.

От:
Yeb Havinga
Дата:

Francisco Reyes wrote:
> Anyone has any experience doing analytics with postgres. In particular
> if 10K rpm drives are good enough vs using 15K rpm, over 24 drives.
> Price difference is $3,000.
>
> Rarely ever have more than 2 or 3 connections to the machine.
>
> So far from what I have seen throughput is more important than TPS for
> the queries we do. Usually we end up doing sequential scans to do
> summaries/aggregates.
>
With 24 drives it'll probably be the controller that is the limiting
factor of bandwidth. Our HP SAN controller with 28 15K drives delivers
170MB/s at maximum with raid 0 and about 155MB/s with raid 1+0. So I'd
go for the 10K drives and put the saved money towards the controller (or
maybe more than one controller).

regards,
Yeb Havinga

От:
Greg Smith
Дата:

Yeb Havinga wrote:
> With 24 drives it'll probably be the controller that is the limiting
> factor of bandwidth. Our HP SAN controller with 28 15K drives delivers
> 170MB/s at maximum with raid 0 and about 155MB/s with raid 1+0.

You should be able to clear 1GB/s on sequential reads with 28 15K drives
in a RAID10, given proper read-ahead adjustment.  I get over 200MB/s out
of the 3-disk RAID0 on my home server without even trying hard.  Can you
share what HP SAN controller you're using?

--
Greg Smith  2ndQuadrant US  Baltimore, MD
PostgreSQL Training, Services and Support
   www.2ndQuadrant.us


От:
Dave Crooke
Дата:

Seconded .... these days even a single 5400rpm SATA drive can muster almost 100MB/sec on a sequential read.

The benefit of 15K rpm drives is seen when you have a lot of small, random accesses from a working set that is too big to cache .... the extra rotational speed translates to an average reduction of about 1ms on a random seek and read from the media.

Cheers
Dave

On Tue, Mar 2, 2010 at 2:51 PM, Yeb Havinga <> wrote:
Francisco Reyes wrote:
Anyone has any experience doing analytics with postgres. In particular if 10K rpm drives are good enough vs using 15K rpm, over 24 drives. Price difference is $3,000.

Rarely ever have more than 2 or 3 connections to the machine.

So far from what I have seen throughput is more important than TPS for the queries we do. Usually we end up doing sequential scans to do summaries/aggregates.

With 24 drives it'll probably be the controller that is the limiting factor of bandwidth. Our HP SAN controller with 28 15K drives delivers 170MB/s at maximum with raid 0 and about 155MB/s with raid 1+0. So I'd go for the 10K drives and put the saved money towards the controller (or maybe more than one controller).

regards,
Yeb Havinga


--
Sent via pgsql-performance mailing list ()
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

От:
Scott Marlowe
Дата:

On Tue, Mar 2, 2010 at 1:42 PM, Francisco Reyes <> wrote:
> Anyone has any experience doing analytics with postgres. In particular if
> 10K rpm drives are good enough vs using 15K rpm, over 24 drives. Price
> difference is $3,000.
>
> Rarely ever have more than 2 or 3 connections to the machine.
>
> So far from what I have seen throughput is more important than TPS for the
> queries we do. Usually we end up doing sequential scans to do
> summaries/aggregates.

Then the real thing to compare is the speed of the drives for
throughput not rpm.  Using older 15k drives would actually be slower
than some more modern 10k or even 7.2k drives.

Another issue would be whether or not to short stroke the drives.  You
may find that short stroked 10k drives provide the same throughput for
much less money.  The 10krpm 2.5" ultrastar C10K300 drives have a
throughput numbers of 143 to 88 Meg/sec, which is quite respectable,
and you can put 24 into a 2U supermicro case and save rack space too.
The 15k 2.5" ultrastar c15k147 drives are 159 to 116, only a bit
faster.  And if short stroked the 10k drives should be competitive.

От:
david@lang.hm
Дата:

On Tue, 2 Mar 2010, Francisco Reyes wrote:

> Anyone has any experience doing analytics with postgres. In particular if 10K
> rpm drives are good enough vs using 15K rpm, over 24 drives. Price difference
> is $3,000.
>
> Rarely ever have more than 2 or 3 connections to the machine.
>
> So far from what I have seen throughput is more important than TPS for the
> queries we do. Usually we end up doing sequential scans to do
> summaries/aggregates.

With sequential scans you may be better off with the large SATA drives as
they fit more data per track and so give great sequential read rates.

if you end up doing a lot of seeking to retreive the data, you may find
that you get a benifit from the faster drives.

David Lang

От:
Scott Marlowe
Дата:

On Tue, Mar 2, 2010 at 2:14 PM,  <> wrote:
> On Tue, 2 Mar 2010, Francisco Reyes wrote:
>
>> Anyone has any experience doing analytics with postgres. In particular if
>> 10K rpm drives are good enough vs using 15K rpm, over 24 drives. Price
>> difference is $3,000.
>>
>> Rarely ever have more than 2 or 3 connections to the machine.
>>
>> So far from what I have seen throughput is more important than TPS for the
>> queries we do. Usually we end up doing sequential scans to do
>> summaries/aggregates.
>
> With sequential scans you may be better off with the large SATA drives as
> they fit more data per track and so give great sequential read rates.

True, I just looked at the Hitachi 7200 RPM 2TB Ultrastar and it lists
and average throughput of 134 Megabytes/second which is quite good.
While seek time is about double that of a 15krpm drive, short stroking
can lower that quite a bit.  Latency is still 2x as much, but there's
not much to do about that.

От:
Francisco Reyes
Дата:

Yeb Havinga writes:

> With 24 drives it'll probably be the controller that is the limiting
> factor of bandwidth.

Going with a 3Ware SAS controller.

> Our HP SAN controller with 28 15K drives delivers
> 170MB/s at maximum with raid 0 and about 155MB/s with raid 1+0.


Already have simmilar machine in house.
With RAID 1+0 Bonne++ reports around 400MB/sec sequential read.

> go for the 10K drives and put the saved money towards the controller (or
> maybe more than one controller).

Have some external enclosures with 16 15Krpm drives. They are older  15K
rpms, but they should be good enough.

Since the 15K rpms usually have better Transanctions per second I will
put WAL and indexes in the external enclosure.

От:
Francisco Reyes
Дата:

Scott Marlowe writes:

> Then the real thing to compare is the speed of the drives for
> throughput not rpm.

In a machine, simmilar to what I plan to buy, already in house 24 x 10K rpm
gives me about 400MB/sec while 16 x 15K rpm (2 to 3 year old drives) gives
me about 500MB/sec


От:
Scott Marlowe
Дата:

On Tue, Mar 2, 2010 at 1:51 PM, Yeb Havinga <> wrote:
> With 24 drives it'll probably be the controller that is the limiting factor
> of bandwidth. Our HP SAN controller with 28 15K drives delivers 170MB/s at
> maximum with raid 0 and about 155MB/s with raid 1+0. So I'd go for the 10K
> drives and put the saved money towards the controller (or maybe more than
> one controller).

That's horrifically bad numbers for that many drives.  I can get those
numbers for write performance on a RAID-6 on our office server.  I
wonder what's making your SAN setup so slow?

От:
Francisco Reyes
Дата:

 writes:

> With sequential scans you may be better off with the large SATA drives as
> they fit more data per track and so give great sequential read rates.

I lean more towards SAS because of writes.
One common thing we do is create temp tables.. so a typical pass may be:
* sequential scan
* create temp table with subset
* do queries against subset+join to smaller tables.

I figure the concurrent read/write would be faster on SAS than on SATA. I am
trying to move to having an external enclosure (we have several not in use
or about to become free) so I could separate the read and the write of the
temp tables.

Lastly, it is likely we are going to do horizontal partitioning (ie master
all data in one machine, replicate and then change our code to read parts of
data from different machine) and I think at that time the better drives will
do better as we have more concurrent queries.


От:
Francisco Reyes
Дата:

Greg Smith writes:

> in a RAID10, given proper read-ahead adjustment.  I get over 200MB/s out
> of the 3-disk RAID0

Any links/suggested reads on read-ahead adjustment? It will probably be OS
dependant, but any info would be usefull.


От:
Scott Marlowe
Дата:

On Tue, Mar 2, 2010 at 2:30 PM, Francisco Reyes <> wrote:
> Scott Marlowe writes:
>
>> Then the real thing to compare is the speed of the drives for
>> throughput not rpm.
>
> In a machine, simmilar to what I plan to buy, already in house 24 x 10K rpm
> gives me about 400MB/sec while 16 x 15K rpm (2 to 3 year old drives) gives
> me about 500MB/sec

Have you tried short stroking the drives to see how they compare then?
 Or is the reduced primary storage not a valid path here?

While 16x15k older drives doing 500Meg seems only a little slow, the
24x10k drives getting only 400MB/s seems way slow.  I'd expect a
RAID-10 of those to read at somewhere in or just past the gig per
second range with a fast pcie (x8 or x16 or so) controller.  You may
find that a faster controller with only 8 or so fast and large SATA
drives equals the 24 10k drives you're looking at now.  I can write at
about 300 to 350 Megs a second on a slower Areca 12xx series
controller and 8 2TB Western Digital Green drives, which aren't even
made for speed.

От:
Francisco Reyes
Дата:

Greg Smith writes:

> in a RAID10, given proper read-ahead adjustment.  I get over 200MB/s out
> of the 3-disk RAID0 on my home server without even trying hard.  Can you

Any links/suggested reading on "read-ahead adjustment". I understand this
may be OS specific, but any info would be helpfull.

Currently have 24 x 10K rpm drives and only getting about 400MB/sec.

От:
Francisco Reyes
Дата:

Scott Marlowe writes:

> Have you tried short stroking the drives to see how they compare then?
>  Or is the reduced primary storage not a valid path here?

No, have not tried it. By the time I got the machine we needed it in
production so could not test anything.

When the 2 new machines come I  should hopefully have time to try a few
strategies, including RAID0, to see what is the best setup for our needs.

> RAID-10 of those to read at somewhere in or just past the gig per
> second range with a fast pcie (x8 or x16 or so) controller.

Thanks for the info. Contacted the vendor to see what pcie speed is the
controller connected to, specially since we are considering getting 2 more
machines from them.

> drives equals the 24 10k drives you're looking at now.  I can write at
> about 300 to 350 Megs a second on a slower Areca 12xx series
> controller and 8 2TB Western Digital Green drives, which aren't even

How about read spead?

От:
david@lang.hm
Дата:

On Tue, 2 Mar 2010, Scott Marlowe wrote:

> On Tue, Mar 2, 2010 at 2:30 PM, Francisco Reyes <> wrote:
>> Scott Marlowe writes:
>>
>>> Then the real thing to compare is the speed of the drives for
>>> throughput not rpm.
>>
>> In a machine, simmilar to what I plan to buy, already in house 24 x 10K rpm
>> gives me about 400MB/sec while 16 x 15K rpm (2 to 3 year old drives) gives
>> me about 500MB/sec
>
> Have you tried short stroking the drives to see how they compare then?
> Or is the reduced primary storage not a valid path here?
>
> While 16x15k older drives doing 500Meg seems only a little slow, the
> 24x10k drives getting only 400MB/s seems way slow.  I'd expect a
> RAID-10 of those to read at somewhere in or just past the gig per
> second range with a fast pcie (x8 or x16 or so) controller.  You may
> find that a faster controller with only 8 or so fast and large SATA
> drives equals the 24 10k drives you're looking at now.  I can write at
> about 300 to 350 Megs a second on a slower Areca 12xx series
> controller and 8 2TB Western Digital Green drives, which aren't even
> made for speed.

what filesystem is being used. There is a thread on the linux-kernel
mailing list right now showing that ext4 seems to top out at ~360MB/sec
while XFS is able to go to 500MB/sec+

on single disks the disk performance limits you, but on arrays where the
disk performance is higher there may be other limits you are running into.

David Lang

От:
Francisco Reyes
Дата:

 writes:

> what filesystem is being used. There is a thread on the linux-kernel
> mailing list right now showing that ext4 seems to top out at ~360MB/sec
> while XFS is able to go to 500MB/sec+

EXT3 on Centos 5.4

Plan to try and see if I have time with the new machines to try FreeBSD+ZFS.
ZFS supposedly makes good use of memory and the new machines will have 72GB
of RAM.

От:
Francisco Reyes
Дата:

Scott Marlowe writes:

> While 16x15k older drives doing 500Meg seems only a little slow, the
> 24x10k drives getting only 400MB/s seems way slow.  I'd expect a
> RAID-10 of those to read at somewhere in or just past the gig per

Talked to the vendor. The likely issue is the card. They used a single card
with an expander and the card also has an external enclosure through an
exterrnal port.

They have some ideas which they are going to test and report back.. since we
are in the process of getting 2 more machines from them.

They believe that by splitting the internal drives into one controller and
the external into a second controller that performance should go up. They
will report back some numbers. Will post them to the list when I get the
info.

От:
Greg Smith
Дата:

Francisco Reyes wrote:
> Going with a 3Ware SAS controller.
> Already have simmilar machine in house.
> With RAID 1+0 Bonne++ reports around 400MB/sec sequential read.

Increase read-ahead and I'd bet you can add 50% to that easy--one area
the 3Ware controllers need serious help, as they admit:
http://www.3ware.com/kb/article.aspx?id=11050  Just make sure you ignore
their dirty_ratio comments--those are completely the opposite of what
you want on a database app.  Still seems on the low side though.

Short stroke and you could probably chop worst-case speeds in half too,
on top of that.

Note that 3Ware's controllers have seriously limited reporting on drive
data when using SAS drives because they won't talk SMART to them:
http://www.3ware.com/KB/Article.aspx?id=15383  I consider them still a
useful vendor for SATA controllers, but would never buy a SAS solution
from them again until this is resolved.

--
Greg Smith  2ndQuadrant US  Baltimore, MD
PostgreSQL Training, Services and Support
   www.2ndQuadrant.us


От:
Greg Smith
Дата:

Francisco Reyes wrote:
> Anyone has any experience doing analytics with postgres. In particular
> if 10K rpm drives are good enough vs using 15K rpm, over 24 drives.
> Price difference is $3,000.
> Rarely ever have more than 2 or 3 connections to the machine.
> So far from what I have seen throughput is more important than TPS for
> the queries we do. Usually we end up doing sequential scans to do
> summaries/aggregates.

For arrays this size, the first priority is to sort out what controller
you're going to get, whether it can keep up with the array size, and how
you're going to support/monitor it.  Once you've got all that nailed
down, if you still have the option of 10K vs. 15K the trade-offs are
pretty simple:

-10K drives are cheaper
-15K drives will commit and seek faster.  If you have a battery-backed
controller, commit speed doesn't matter very much.

If you only have 2 or 3 connections, I can't imagine that the improved
seek times of the 15K drives will be a major driving factor.  As already
suggested, 10K drives tend to be larger and can be extremely fast on
sequential workloads, particularly if you short-stroke them and stick to
putting the important stuff on the fast part of the disk.

--
Greg Smith  2ndQuadrant US  Baltimore, MD
PostgreSQL Training, Services and Support
   www.2ndQuadrant.us


От:
Scott Marlowe
Дата:

On Tue, Mar 2, 2010 at 4:50 PM, Greg Smith <> wrote:
> If you only have 2 or 3 connections, I can't imagine that the improved seek
> times of the 15K drives will be a major driving factor.  As already
> suggested, 10K drives tend to be larger and can be extremely fast on
> sequential workloads, particularly if you short-stroke them and stick to
> putting the important stuff on the fast part of the disk.

The thing I like most about short stroking 7200RPM 1 to 2 TB drives is
that you get great performance on one hand, and a ton of left over
storage for backups and stuff.  And honestly, you can't have enough
extra storage laying about when working on databases.

От:
Greg Smith
Дата:

Scott Marlowe wrote:
> True, I just looked at the Hitachi 7200 RPM 2TB Ultrastar and it lists
> and average throughput of 134 Megabytes/second which is quite good.
>

Yeah, but have you tracked the reliability of any of the 2TB drives out
there right now?  They're terrible.  I wouldn't deploy anything more
than a 1TB drive right now in a server, everything with a higher
capacity is still on the "too new to be stable yet" side of the fence to me.

--
Greg Smith  2ndQuadrant US  Baltimore, MD
PostgreSQL Training, Services and Support
   www.2ndQuadrant.us


От:
Scott Marlowe
Дата:

On Tue, Mar 2, 2010 at 4:57 PM, Greg Smith <> wrote:
> Scott Marlowe wrote:
>>
>> True, I just looked at the Hitachi 7200 RPM 2TB Ultrastar and it lists
>> and average throughput of 134 Megabytes/second which is quite good.
>>
>
> Yeah, but have you tracked the reliability of any of the 2TB drives out
> there right now?  They're terrible.  I wouldn't deploy anything more than a
> 1TB drive right now in a server, everything with a higher capacity is still
> on the "too new to be stable yet" side of the fence to me.

We've had REAL good luck with the WD green and black drives.  Out of
about 35 or so drives we've had two failures in the last year, one of
each black and green.  The Seagate SATA drives have been horrific for
us, with a 30% failure rate in the last 8 or so months.  We only have
something like 10 of the Seagates, so the sample's not as big as the
WDs.    Note that we only use the supposed "enterprise" class drives
from each manufacturer.

We just got a shipment of 8 1.5TB Seagates so I'll keep you informed
of the failure rate of those drives.  Wouldn't be surprised to see 1
or 2 die in the first few months tho.

От:
Greg Smith
Дата:

Scott Marlowe wrote:
> We've had REAL good luck with the WD green and black drives.  Out of
> about 35 or so drives we've had two failures in the last year, one of
> each black and green.

I've been happy with almost all the WD Blue drives around here (have
about a dozen in service for around two years), with the sole exception
that the one drive I did have go bad has turned into a terrible liar.
Refuses to either acknowledge it's broken and produce an RMA code, or to
work.  At least the Seagate and Hitachi drives are honest about being
borked when once they've started producing heavy SMART errors.  I have
enough redundancy to deal with failure, but can't tolerate dishonesty
one bit.

The Blue drives are of course regular crappy consumer models though, so
this is not necessarily indicative of how the Green/Black drives work.

> The Seagate SATA drives have been horrific for
> us, with a 30% failure rate in the last 8 or so months.  We only have
> something like 10 of the Seagates, so the sample's not as big as the
> WDs.    Note that we only use the supposed "enterprise" class drives
> from each manufacturer.
>
> We just got a shipment of 8 1.5TB Seagates so I'll keep you informed
> of the failure rate of those drives.  Wouldn't be surprised to see 1
> or 2 die in the first few months tho.
>

Good luck with those--the consumer version of Seagate's 1.5TB drives
have been perhaps the worst single drive model on the market over the
last year.  Something got seriously misplaced when they switched their
manufacturing facility from Singapore to Thailand a few years ago, and
now that the old plant is gone:
http://www.theregister.co.uk/2009/08/04/seagate_closing_singapore_plant/
I don't expect them to ever recover from that.

--
Greg Smith  2ndQuadrant US  Baltimore, MD
PostgreSQL Training, Services and Support
   www.2ndQuadrant.us


От:
Scott Marlowe
Дата:

On Tue, Mar 2, 2010 at 6:03 PM, Greg Smith <> wrote:
> Scott Marlowe wrote:
>>
>> We've had REAL good luck with the WD green and black drives.  Out of
>> about 35 or so drives we've had two failures in the last year, one of
>> each black and green.
>
> I've been happy with almost all the WD Blue drives around here (have about a
> dozen in service for around two years), with the sole exception that the one
> drive I did have go bad has turned into a terrible liar.  Refuses to either
> acknowledge it's broken and produce an RMA code, or to work.  At least the
> Seagate and Hitachi drives are honest about being borked when once they've
> started producing heavy SMART errors.  I have enough redundancy to deal with
> failure, but can't tolerate dishonesty one bit.

Time to do the ESD shuffle I think.

>> The Seagate SATA drives have been horrific for
>> us, with a 30% failure rate in the last 8 or so months.  We only have
>> something like 10 of the Seagates, so the sample's not as big as the
>> WDs.    Note that we only use the supposed "enterprise" class drives
>> from each manufacturer.
>>
>> We just got a shipment of 8 1.5TB Seagates so I'll keep you informed
>> of the failure rate of those drives.  Wouldn't be surprised to see 1
>> or 2 die in the first few months tho.
>>
>
> Good luck with those--the consumer version of Seagate's 1.5TB drives have
> been perhaps the worst single drive model on the market over the last year.
>  Something got seriously misplaced when they switched their manufacturing
> facility from Singapore to Thailand a few years ago, and now that the old
> plant is gone:
>  http://www.theregister.co.uk/2009/08/04/seagate_closing_singapore_plant/ I
> don't expect them to ever recover from that.

Yeah, I've got someone upstream in my chain of command who's a huge
fan of seacrates, so that's how we got those 1.5TB drives.  Our 15k5
seagates have been great, with 2 failures in 32 drives in 1.5 years of
very heavy use.  All our seagate SATAs, whether 500G or 2TB have been
the problem children.  I've pretty much given up on Seagate SATA
drives.  The new seagates we got are the consumer 7200.11 drives, but
at least they have the latest firmware and all.

От:
Greg Smith
Дата:

Scott Marlowe wrote:
> Time to do the ESD shuffle I think.
>

Nah, I keep the crazy drive around as an interesting test case.  Fun to
see what happens when I connect to a RAID card; very informative about
how thorough the card's investigation of the drive is.

> Our 15k5
> seagates have been great, with 2 failures in 32 drives in 1.5 years of
> very heavy use.  All our seagate SATAs, whether 500G or 2TB have been
> the problem children.  I've pretty much given up on Seagate SATA
> drives.  The new seagates we got are the consumer 7200.11 drives, but
> at least they have the latest firmware and all.
>

Well, what I was pointing out was that all the 15K drives used to come
out of this plant in Singapore, which is also where their good consumer
drives used to come from too during the 2003-2007ish period where all
their products were excellent.  Then they moved the consumer production
to this new location in Thailand, and all of the drives from there have
been total junk.  And as of August they closed the original plant, which
had still been making the enterprise drives, altogether.  So now you can
expect the 15K drives to come from the same known source of garbage
drives as everything else they've made recently, rather than the old,
reliable plant.

I recall the Singapore plant sucked for a while when it got started in
the mid 90's too, so maybe this Thailand one will eventually get their
issues sorted out.  It seems like you can't just move a hard drive plant
somewhere and have the new one work without a couple of years of
practice first, I keep seeing this pattern repeat.

--
Greg Smith  2ndQuadrant US  Baltimore, MD
PostgreSQL Training, Services and Support
   www.2ndQuadrant.us


От:
Francisco Reyes
Дата:

Greg Smith writes:

> http://www.3ware.com/KB/Article.aspx?id=15383  I consider them still a
> useful vendor for SATA controllers, but would never buy a SAS solution
> from them again until this is resolved.


Who are you using for SAS?
One thing I like about 3ware is their management utility works under both
FreeBSD and Linux well.

От:
Scott Marlowe
Дата:

On Tue, Mar 2, 2010 at 7:44 PM, Francisco Reyes <> wrote:
> Greg Smith writes:
>
>> http://www.3ware.com/KB/Article.aspx?id=15383  I consider them still a
>> useful vendor for SATA controllers, but would never buy a SAS solution from
>> them again until this is resolved.
>
>
> Who are you using for SAS?
> One thing I like about 3ware is their management utility works under both
> FreeBSD and Linux well.

The non-open source nature of the command line tool for Areca makes me
avoid their older cards.  The 1680 has it's own ethernet wtih a web
interface with snmp that is independent of the OS.  This means that
with something like a hung / panicked kernel, you can still check out
the RAID array and check rebuild status and other stuff.  We get a
hang about every 180 to 460 days with them where the raid driver in
linux hangs with the array going off-line.  It's still there to the
web interface on its own NIC.  Newer kernels seem to trigger the
failure far more often, once every 1 to 2 weeks, two months on the
outside.  The driver guy from Areca is supposed to be working on the
driver for linux, so we'll see if it gets fixed.  It's pretty stable
on a RHEL 5.2 kernel, on anything after that I've tested, it'll hang
every week or two.  So I run RHEL 5.latest with a 5.2 kernel and it
works pretty well.  Note that this is a pretty heavily used machine
with enough access going through 12 drives to use about 30% IOwait,
50% user, 10% sys at peak midday.  Load factor 7 to 15.  And they run
really ultra-smooth between these hangs.  They come back up
uncorrupted, every time, every plug pull test etc.  Other than the
occasional rare hang, they're perfect.

От:
Yeb Havinga
Дата:

Greg Smith wrote:
> Yeb Havinga wrote:
>> With 24 drives it'll probably be the controller that is the limiting
>> factor of bandwidth. Our HP SAN controller with 28 15K drives
>> delivers 170MB/s at maximum with raid 0 and about 155MB/s with raid 1+0.
>
> You should be able to clear 1GB/s on sequential reads with 28 15K
> drives in a RAID10, given proper read-ahead adjustment.  I get over
> 200MB/s out of the 3-disk RAID0 on my home server without even trying
> hard.  Can you share what HP SAN controller you're using?
Yeah I should have mentioned a bit more, to allow for a better picture
of the apples and pears.

Controller a is the built in controller of the HP MSA1000 SAN - with 14
disks but with extra 14 disks from a MSA30. It is connected through a
2Gbit/s fibrechannel adapter - should give up to roughly 250MB/s
bandwidth, maybe a bit less due to frame overhead and gib/gb difference.
Controller has 256MB cache

It is three years old, however HP still sells it.

I performed a few dozen tests with oracle's free and standalone orion
tool (http://www.oracle.com/technology/software/tech/orion/index.html)
with different raid and controller settings, where I varied
- controller read/write cache ratio
- logical unit layout (like one big raidx, 3 luns with raid10 (giving
stripe width of 4 disks and 4 hot spares), 7 luns with raid10
- stripe size set to maximum
- load type (random or sequential large io)
- linux io scheduler (deadline / cfq etc)
- fibre channel adapter queue depth
- ratio between reads and writes by the orion - our production
application has about 25% writes.
- I also did the short stroking that is talked about further in this
thread by only using one partition of about 30% size on each disk.
- etc

My primary goal was large IOPS for our typical load: mostly OLTP.

The orion tool tests in a matrix with on one axis the # concurrent small
io's and the other axis the # concurrent large io's. It output numbers
are also in a matrix, with MBps, iops and latency.

 I put several of these numbers in matlab to produce 3d pictures and
that showed some interesting stuff - its probably bad netiquette here to
post a one of those pictures. One of the striking things was that I
could see something that looked like  a mountain where the top was
neatly cut of - my guess: controller maximum reached.

Below is the output data of a recent test, where a 4Gbit/s fc adapter
was connected. From this numbers I conclude that in our setup, the
controller is maxed out at 155MB/s for raid 1+0 *with this setup*. In a
test run I constructed to try and see what the maximum mbps of the
controller would be: 100% reads, sequential large io - that went to 170MBps.

I'm particularly proud of the iops of this test. Please note: large load
is random, not sequential!

So to come back at my original claim: controller is important when you
have 24 disks. I believe I have backed up this claim by this mail. Also
please take notice that for our setup, a database that has a lot of
concurrent users on a medium size database (~=160GB), random IO is what
we needed, and for this purpose the HP MSA has proved rock solid. But
the setup that Francisco mentioned is different: a few users doing
mostly sequential IO. For that load, our setup is far from optimal,
mainly because of the (single) controller.

regards,
Yeb Havinga


ORION VERSION 10.2.0.1.0

Commandline:
-run advanced -testname r10-7 -num_disks 24 -size_small 4 -size_large
1024 -type rand -simulate concat -verbose -write 25 -duration 15 -matrix
detailed -cache_size 256

This maps to this test:
Test: r10-7
Small IO size: 4 KB
Large IO size: 1024 KB
IO Types: Small Random IOs, Large Random IOs
Simulated Array Type: CONCAT
Write: 25%
Cache Size: 256 MB
Duration for each Data Point: 15 seconds
Small Columns:,      0,      1,      2,      3,      4,      5,
6,     12,     18,     24,     30,     36,     42,     48,     54,
60,     66,     72,     78,     84,     90,     96,    102,    108,
114,    120
Large Columns:,      0,      1,      2,      3,      4,      8,
12,     16,     20,     24,     28,     32,     36,     40,     44,     48
Total Data Points: 416

Name: /dev/sda1 Size: 72834822144
Name: /dev/sdb1 Size: 72834822144
Name: /dev/sdc1 Size: 72834822144
Name: /dev/sdd1 Size: 72834822144
Name: /dev/sde1 Size: 72834822144
Name: /dev/sdf1 Size: 72834822144
Name: /dev/sdg1 Size: 72834822144
7 FILEs found.

Maximum Large MBPS=155.05 @ Small=2 and Large=48
Maximum Small IOPS=6261 @ Small=120 and Large=0
Minimum Small Latency=3.93 @ Small=1 and Large=0

Below the MBps matrix - hope this reads well in email clients??

Large/Small,      0,      1,      2,      3,      4,      5,      6,
12,     18,     24,     30,     36,     42,     48,     54,     60,
66,     72,     78,     84,     90,     96,    102,    108,    114,    120
          1,  76.60,  74.87,  73.24,  70.66,  70.45,  68.36,  67.58,
59.63,  54.94,  50.74,  44.65,  41.24,  37.31,  35.85,  35.05,  32.53,
29.01,  30.64,  30.39,  27.41,  26.19,  25.43,  24.17,  24.10,  22.96,
22.39
          2, 114.19, 115.65, 113.65, 112.11, 111.31, 109.77, 108.57,
101.81,  95.25,  86.74,  83.48,  76.12,  70.82,  68.98,  62.85,  63.75,
57.36,  56.28,  52.78,  50.37,  47.96,  48.53,  46.82,  44.47,  45.09,
42.53
          3, 135.41, 135.21, 134.20, 134.27, 133.78, 132.62, 131.03,
127.08, 121.25, 114.15, 109.51, 104.28,  98.66,  94.91,  91.95,  86.27,
82.99,  79.28,  76.09,  74.26,  71.60,  67.83,  67.94,  64.55,  65.39,
63.23
          4, 144.30, 143.93, 145.00, 144.47, 143.49, 142.56, 142.23,
139.14, 135.64, 131.82, 128.82, 124.51, 121.88, 116.16, 112.13, 107.91,
105.63, 101.54,  99.06,  93.50,  90.35,  87.25,  86.98,  83.57,  83.45,
79.73
          8, 152.93, 152.87, 152.60, 152.29, 152.36, 152.16, 151.85,
151.11, 150.00, 149.09, 148.18, 147.40, 146.09, 145.21, 144.94, 143.82,
142.90, 141.43, 140.93, 140.08, 137.83, 136.95, 136.17, 133.69, 134.05,
131.85
         12, 154.10, 153.83, 154.07, 153.79, 154.03, 153.35, 153.09,
152.41, 152.14, 151.32, 151.49, 150.68, 150.10, 149.69, 149.19, 148.07,
148.00, 147.90, 146.78, 146.57, 145.79, 144.96, 145.21, 144.23, 143.58,
142.59
         16, 154.30, 154.40, 153.71, 153.96, 154.13, 154.13, 153.58,
153.24, 152.97, 152.86, 152.29, 151.95, 151.57, 150.68, 150.85, 150.44,
150.03, 149.59, 149.15, 149.01, 148.29, 147.89, 147.44, 147.41, 146.79,
146.55
         20, 154.70, 154.53, 154.33, 154.12, 154.05, 154.29, 154.05,
153.84, 152.87, 153.26, 153.02, 152.64, 152.37, 151.99, 151.65, 151.44,
150.89, 150.89, 150.69, 150.34, 149.90, 149.59, 149.38, 149.31, 148.76,
148.35
         24, 154.31, 154.34, 154.28, 154.31, 154.21, 154.39, 154.07,
153.80, 153.80, 153.17, 153.28, 152.83, 152.59, 152.66, 151.97, 152.00,
151.66, 151.17, 150.79, 151.10, 150.62, 150.52, 150.17, 149.93, 149.79,
149.27
         28, 154.62, 154.48, 154.34, 154.70, 154.48, 154.31, 154.44,
153.92, 153.82, 153.72, 153.54, 153.23, 152.88, 152.29, 152.23, 152.43,
151.84, 151.70, 151.32, 151.56, 150.87, 150.87, 150.90, 150.31, 150.63,
150.03
         32, 154.58, 154.33, 154.90, 154.40, 154.51, 154.44, 154.41,
154.08, 154.30, 154.02, 153.53, 153.50, 153.35, 153.01, 152.83, 152.83,
152.41, 152.16, 152.06, 151.99, 151.75, 151.29, 151.12, 151.47, 151.22,
150.77
         36, 154.67, 154.46, 154.43, 154.25, 154.60, 154.96, 154.25,
154.25, 154.15, 154.00, 153.83, 153.45, 153.16, 153.23, 152.74, 152.66,
152.49, 152.57, 152.28, 152.53, 151.79, 151.40, 151.23, 151.30, 151.19,
151.20
         40, 154.27, 154.67, 154.63, 154.74, 154.17, 154.31, 154.82,
154.24, 154.67, 154.35, 153.81, 153.82, 153.89, 153.29, 153.18, 152.97,
153.18, 152.72, 152.69, 151.94, 151.80, 151.69, 152.12, 151.59, 151.31,
151.52
         44, 154.37, 154.59, 154.51, 154.66, 154.88, 154.58, 154.26,
154.29, 153.83, 154.38, 153.84, 153.66, 153.55, 153.23, 153.02, 153.20,
152.70, 152.67, 152.88, 152.53, 152.67, 152.13, 152.10, 152.06, 151.53,
151.45
         48, 154.61, 154.83, 155.05, 154.65, 154.47, 154.97, 154.29,
154.40, 154.33, 154.29, 154.00, 154.01, 153.71, 153.47, 153.58, 153.50,
153.15, 152.50, 153.08, 152.83, 152.40, 152.04, 151.46, 152.29, 152.11,
151.43

below the iops matrix

Large/Small,      1,      2,      3,      4,      5,      6,     12,
18,     24,     30,     36,     42,     48,     54,     60,     66,
72,     78,     84,     90,     96,    102,    108,    114,    120
          0,    254,    502,    751,    960,   1177,   1388,   2343,
3047,   3557,   3945,   4247,   4529,   4752,   4953,   5111,   5280,
5412,   5550,   5670,   5785,   5904,   5987,   6093,   6167,   6261
          1,    178,    353,    526,    684,    832,    999,   1801,
2445,   2937,   3382,   3742,   4054,   4262,   4489,   4685,   4910,
5030,   5139,   5312,   5439,   5549,   5685,   5760,   5873,   5953
          2,    122,    240,    364,    484,    605,    715,   1342,
1907,   2416,   2808,   3208,   3526,   3789,   4072,   4217,   4477,
4629,   4840,   4964,   5187,   5242,   5381,   5490,   5543,   5704
          3,     84,    167,    253,    337,    420,    510,    990,
1486,   1924,   2332,   2692,   3035,   3272,   3578,   3838,   4048,
4260,   4426,   4607,   4760,   4948,   4989,   5164,   5216,   5335
          4,     61,    120,    180,    236,    303,    368,    732,
1086,   1445,   1780,   2144,   2434,   2771,   3092,   3342,   3576,
3793,   4000,   4165,   4376,   4554,   4703,   4805,   4847,   5062
          8,     24,     49,     73,    100,    122,    152,    303,
448,    614,    759,    889,   1043,   1201,   1325,   1489,   1647,
1800,   1948,   2116,   2291,   2434,   2594,   2824,   2946,   3124
         12,     15,     30,     45,     62,     76,     90,    188,
275,    366,    462,    543,    638,    726,    814,    906,    978,
1055,   1151,   1245,   1341,   1425,   1488,   1566,   1688,   1759
         16,     10,     23,     32,     44,     55,     66,    130,
198,    259,    328,    387,    450,    519,    580,    643,    706,
767,    834,    891,    964,   1029,   1083,   1141,   1206,   1263
         20,      8,     17,     25,     34,     41,     50,    102,
152,    201,    255,    302,    350,    402,    447,    496,    554,
591,    645,    688,    746,    791,    844,    882,    934,    984
         24,      6,     13,     21,     28,     35,     41,     85,
123,    166,    206,    250,    288,    326,    377,    410,    451,
497,    531,    568,    610,    660,    694,    732,    772,    814
         28,      6,     12,     17,     23,     29,     35,     70,
106,    142,    174,    210,    247,    279,    325,    348,    378,
419,    453,    487,    523,    553,    586,    627,    651,    691
         32,      5,     10,     15,     20,     26,     31,     61,
92,    120,    154,    182,    212,    245,    274,    309,    336,
368,    395,    429,    452,    488,    514,    542,    581,    605
         36,      4,      9,     13,     18,     22,     27,     56,
83,    110,    138,    166,    193,    222,    248,    279,    302,
333,    358,    385,    414,    438,    468,    496,    523,    551
         40,      4,      8,     12,     17,     21,     25,     50,
77,    103,    127,    155,    184,    205,    236,    256,    285,
315,    341,    362,    387,    418,    442,    468,    492,    518
         44,      4,      8,     11,     15,     20,     24,     49,
73,     98,    123,    151,    173,    197,    225,    248,    269,
294,    329,    349,    373,    390,    428,    438,    469,    498
         48,      3,      7,     11,     15,     20,     23,     47,
70,     95,    120,    141,    166,    192,    212,    237,    260,
282,    308,    329,    353,    378,    400,    424,    450,    468




От:
Yeb Havinga
Дата:

Scott Marlowe wrote:
> On Tue, Mar 2, 2010 at 1:51 PM, Yeb Havinga <> wrote:
>
>> With 24 drives it'll probably be the controller that is the limiting factor
>> of bandwidth. Our HP SAN controller with 28 15K drives delivers 170MB/s at
>> maximum with raid 0 and about 155MB/s with raid 1+0. So I'd go for the 10K
>> drives and put the saved money towards the controller (or maybe more than
>> one controller).
>>
>
> That's horrifically bad numbers for that many drives.  I can get those
> numbers for write performance on a RAID-6 on our office server.  I
> wonder what's making your SAN setup so slow?
>
Pre scriptum:
A few minutes ago I mailed detailed information in the same thread but
as reply to an earlier response - it tells more about setup and gives
results of a raid1+0 test.

I just have to react to "horrifically bad" and "slow" :-) : The HP san
can do raid5 on 28 disks also on about 155MBps:

28 disks devided into 7 logical units with raid5, orion results are
below. Please note that this time I did sequential large io. The mixed
read/write MBps maximum here is comparable: around 155MBps.

regards,
Yeb Havinga


ORION VERSION 10.2.0.1.0

Commandline:
-run advanced -testname msa -num_disks 24 -size_small 4 -size_large 1024
-type seq -simulate concat -verbose -write 50 -duration 15 -matrix
detailed -cache_size 256

This maps to this test:
Test: msa
Small IO size: 4 KB
Large IO size: 1024 KB
IO Types: Small Random IOs, Large Sequential Streams
Number of Concurrent IOs Per Stream: 4
Force streams to separate disks: No
Simulated Array Type: CONCAT
Write: 50%
Cache Size: 256 MB
Duration for each Data Point: 15 seconds
Small Columns:,      0,      1,      2,      3,      4,      5,
6,     12,     18,     24,     30,     36,     42,     48,     54,
60,     66,     72,     78,     84,     90,     96,    102,    108,
114,    120
Large Columns:,      0,      1,      2,      3,      4,      8,
12,     16,     20,     24,     28,     32,     36,     40,     44,     48
Total Data Points: 416

Name: /dev/sda1 Size: 109256361984
Name: /dev/sdb1 Size: 109256361984
Name: /dev/sdc1 Size: 109256361984
Name: /dev/sdd1 Size: 109256361984
Name: /dev/sde1 Size: 109256361984
Name: /dev/sdf1 Size: 109256361984
Name: /dev/sdg1 Size: 109256361984
7 FILEs found.

Maximum Large MBPS=157.75 @ Small=0 and Large=1
Maximum Small IOPS=3595 @ Small=66 and Large=1
Minimum Small Latency=2.81 @ Small=1 and Large=0

MBPS matrix

Large/Small,      0,      1,      2,      3,      4,      5,      6,
12,     18,     24,     30,     36,     42,     48,     54,     60,
66,     72,     78,     84,     90,     96,    102,    108,    114,    120
          1, 157.75, 156.47, 153.56, 153.45, 144.87, 141.78, 140.60,
112.45,  95.23,  72.80,  80.59,  36.91,  29.76,  42.86,  41.82,  33.87,
34.07,  45.62,  42.97,  26.37,  42.85,  45.49,  44.47,  37.26,  45.67,
36.18
          2, 137.58, 128.48, 125.78, 133.85, 120.12, 127.86, 127.05,
119.26, 121.23, 115.00, 117.88, 114.35, 108.61, 106.55,  83.50,  78.61,
92.67,  96.01,  44.02,  70.60,  62.84,  46.52,  69.18,  51.84,  57.19,
59.62
          3, 143.10, 134.92, 139.30, 138.62, 137.85, 146.17, 140.41,
138.48,  76.00, 138.17, 123.48, 137.45, 126.51, 137.11,  91.94,  90.33,
129.97,  45.35, 115.92,  89.60, 137.22,  72.46,  89.95,  77.40, 119.17,
82.09
          4, 138.47, 133.74, 129.99, 122.33, 126.75, 125.22, 132.30,
120.41, 125.88, 132.21,  96.92, 115.70, 131.65,  66.34, 114.06, 113.62,
116.91,  96.97,  98.69, 127.16, 116.67, 111.53, 128.97,  92.38, 118.14,
78.31
          8, 126.59, 127.92, 115.51, 125.02, 123.29, 111.94, 124.31,
125.71, 134.48, 126.40, 127.93, 125.36, 121.75, 121.75, 127.17, 116.51,
121.44, 121.12, 112.32, 121.55, 127.93, 124.86, 118.04, 114.59, 121.72,
114.79
         12, 112.40, 122.58, 107.61, 125.42, 128.04, 123.80, 127.17,
127.70, 122.37,  96.52, 115.36, 124.49, 124.07, 129.31, 124.62, 124.23,
105.58, 123.55, 115.67, 120.59, 125.61, 123.57, 121.43, 121.45, 121.44,
113.64
         16, 108.88, 119.79, 123.80, 120.55, 120.02, 121.66, 125.71,
122.19, 125.77, 122.27, 119.55, 118.44, 120.51, 104.66,  97.55, 115.43,
101.45, 108.99, 122.30, 100.45, 105.82, 119.56, 121.26, 126.59, 119.54,
115.09
         20, 103.88, 122.95, 115.86, 114.59, 121.13, 108.52, 116.90,
121.10, 113.91, 108.20, 111.51, 125.64, 117.57, 120.86, 117.66, 100.40,
104.88, 103.15,  98.10, 104.86, 104.69, 102.99, 121.81, 107.22, 122.68,
106.43
         24, 102.64, 102.33, 112.95, 110.63, 108.00, 111.53, 124.33,
103.17, 108.16, 112.63,  97.42, 106.22, 102.54, 117.46, 100.66,  99.01,
104.46,  99.02, 116.02, 112.49, 119.05, 104.03, 102.40, 102.44, 111.15,
99.51
         28, 101.12, 102.76, 114.14, 109.72, 120.63, 118.09, 119.85,
113.80, 116.58, 110.24, 101.45, 110.31, 116.06, 112.04, 121.63,  91.26,
98.88, 101.55, 104.51, 116.43, 112.98, 119.46, 120.08, 109.46, 106.29,
96.69
         32, 103.41, 117.33, 101.33, 102.29, 102.58, 116.18, 107.12,
114.63, 121.84,  95.14, 108.83,  99.82, 103.11,  99.36, 117.80,  94.91,
103.46, 103.97, 117.35, 100.51, 100.18, 101.98, 118.26, 115.03, 100.45,
107.90
         36,  99.90,  97.98, 100.94,  95.56, 118.76,  99.05, 114.02,
93.61, 117.68, 115.22, 114.40, 116.38, 100.38,  99.15, 108.66, 101.67,
106.64,  98.69, 111.99, 108.28,  99.62, 112.67, 118.80, 110.40, 118.86,
108.46
         40, 101.51, 103.38,  93.73, 121.69, 106.27, 104.09, 110.81,
105.83,  95.81, 101.47, 105.96, 113.26, 103.61, 114.26, 100.49, 102.35,
111.44,  95.09, 103.02, 106.21, 104.39, 118.31,  96.73, 109.79, 103.71,
99.70
         44, 101.17, 107.22, 107.50, 115.19, 104.16, 108.93, 101.62,
111.82, 110.66, 104.13, 109.68, 103.20,  92.04, 104.70, 102.30, 117.28,
106.37, 100.42, 107.81, 105.31, 110.21, 108.66, 116.05, 105.55, 100.64,
106.67
         48, 101.00, 104.13, 114.00,  99.55, 107.46, 113.29, 114.32,
108.75, 100.11,  99.89, 104.81, 107.36, 102.93, 106.43, 101.98, 103.15,
101.30, 113.94, 103.07, 102.40,  95.38, 111.33,  93.89, 112.30, 103.58,
101.82

iops matrix

Large/Small,      1,      2,      3,      4,      5,      6,     12,
18,     24,     30,     36,     42,     48,     54,     60,     66,
72,     78,     84,     90,     96,    102,    108,    114,    120
          0,    355,    639,    875,   1063,   1230,   1366,   1933,
2297,   2571,   2750,   2981,   3394,   3027,   3045,   3036,   3139,
3218,   3081,   3151,   3203,   3128,   3179,   3093,   3141,   3135
          1,     37,     99,    144,    298,    398,    488,   1183,
1637,   2069,   2268,   2613,   2729,   2860,   2983,   3119,   3595,
3065,   3077,   3036,   3008,   3039,   3030,   3067,   3138,   3041
          2,     22,     36,     44,    130,    112,     92,    271,
378,    579,    673,    903,   1091,   1131,   1735,   1612,   1809,
1236,   2316,   2302,   1410,   2467,   2526,   2692,   2606,   2625
          3,      5,     13,     18,     21,     27,     27,     56,
92,    162,    196,    209,    239,    309,    595,   1551,    611,
2408,   1034,    488,    401,   1226,   1700,   2490,   1516,   2435
          4,      8,     10,     33,     38,     53,     38,    137,
191,    165,    502,    369,    212,   1127,    654,   1069,    721,
643,   1046,    537,    803,   1093,    497,   1669,   1120,   1945
          8,      3,      8,      6,     15,     24,     19,     61,
47,     90,    139,    109,    174,    184,    154,    261,    294,
289,    460,    338,    199,    425,    433,    633,    475,    599
         12,      3,      7,      7,     10,     11,     12,     32,
74,     67,     91,     93,    120,    157,    158,    201,    191,
143,    220,    327,    217,    283,    276,    297,    336,    365
         16,      2,      3,      6,      6,     10,     11,     27,
35,     56,     52,     80,    100,     89,    118,    102,    140,
178,    158,    174,    188,    185,    243,    168,    235,    249
         20,      1,      3,      5,      6,      8,     11,     15,
30,     30,     54,     44,     70,     76,     87,    104,     79,
121,    115,    128,    135,    147,    158,    194,    184,    250
         24,      1,      2,      4,      5,      8,      6,     14,
21,     29,     36,     42,     50,     58,     61,     70,     64,
85,    102,    117,    120,    111,    126,    129,    170,    159
         28,      1,      2,      4,      3,      7,      7,     16,
19,     23,     30,     37,     51,     60,     58,     59,     65,
75,     76,     91,     83,    107,    103,    113,    120,    135
         32,      1,      2,      3,      3,      6,      6,     12,
17,     19,     28,     30,     31,     32,     44,     49,     57,
53,     82,     87,     80,     84,    106,    106,     96,     93
         36,      1,      2,      4,      5,      4,      7,      9,
15,     22,     27,     30,     32,     35,     43,     46,     48,
52,     67,     69,     54,     78,     87,     98,     92,    114
         40,      0,      2,      2,      3,      4,      5,     12,
12,     16,     24,     25,     29,     35,     36,     42,     51,
45,     55,     60,     61,     71,     67,     72,     77,     67
         44,      0,      2,      2,      3,      4,      4,     10,
12,     19,     20,     24,     24,     25,     32,     34,     40,
43,     58,     62,     60,     71,     75,     75,     68,     81
         48,      0,      1,      2,      5,      4,      4,     10,
14,     16,     18,     21,     23,     27,     31,     34,     37,
44,     42,     54,     48,     59,     54,     69,     65,     67




От:
Yeb Havinga
Дата:

Francisco Reyes wrote:
>
> Going with a 3Ware SAS controller.
>
>
> Have some external enclosures with 16 15Krpm drives. They are older
> 15K rpms, but they should be good enough.
> Since the 15K rpms usually have better Transanctions per second I will
> put WAL and indexes in the external enclosure.

It sounds like you have a lot of hardware around - my guess it would be
worthwhile to do a test setup with one server hooked up with two 3ware
controllers. Also, I am not sure if it is wise to put the WAL on the
same logical disk as the indexes, but that is maybe for a different
thread (unwise to mix random and sequential io and also the wal has
demands when it comes to write cache).

regards,
Yeb Havinga


От:
"Pierre C"
Дата:

>>> With 24 drives it'll probably be the controller that is the limiting
>>> factor of bandwidth. Our HP SAN controller with 28 15K drives delivers
>>> 170MB/s at maximum with raid 0 and about 155MB/s with raid 1+0.

I get about 150-200 MB/s on .... a linux software RAID of 3 cheap Samsung
SATA 1TB drives (which is my home multimedia server)...
IOPS would be of course horrendous, that's RAID-5, but that's not the
point here.

For raw sequential throughput, dumb drives with dumb software raid can be
pretty fast, IF each drive has a dedicated channel (SATA ensures this) and
the controller is on a fast PCIexpress (in my case, chipset SATA
controller).

I don't suggest you use software RAID with cheap consumer drives, just
that any expensive setup that doesn't deliver MUCH more performance that
is useful to you (ie in your case sequential IO) maybe isn't worth the
extra price... There are many bottlenecks...

От:
Francisco Reyes
Дата:

Yeb Havinga writes:

> controllers. Also, I am not sure if it is wise to put the WAL on the
> same logical disk as the indexes,

If I only have two controllers would it then be better to put WAL on the
first along with all the data and the indexes on the external? Specially
since the external enclosure will have 15K rpm vs 10K rpm in the internal.


> thread (unwise to mix random and sequential io and also the wal has
> demands when it comes to write cache).

Thanks for pointing that out.
With any luck I will actually be able to do some tests for the new hardware.
The curernt one I literaly did a few hours stress test and had to put in
production right away.


От:
Greg Smith
Дата:

Francisco Reyes wrote:
> Who are you using for SAS?
> One thing I like about 3ware is their management utility works under
> both FreeBSD and Linux well.

3ware has turned into a division within LSI now, so I have my doubts
about their long-term viability as a separate product as well.

LSI used to be the reliable, open, but somewhat slower cards around, but
that doesn't seem to be the case with their SAS products anymore.  I've
worked on two systems using their MegaRAID SAS 1078 chipset in RAID10
recently and been very impressed with both.  That's what I'm
recommending to clients now too--especially people who liked Dell
anyway.  (HP customers are still getting pointed toward their
P400/600/800.  3ware in white box systems, still OK, but only SATA.
Areca is fast, but they're really not taking the whole driver thing
seriously.)

You can get that direct from LSI as the MegaRAID SAS 8888ELP:
http://www.lsi.com/storage_home/products_home/internal_raid/megaraid_sas/megaraid_sas_8888elp/
as well as some similar models.  And that's what Dell sells as their
PERC6.  Here's what a typical one looks like from Linux's perspective,
just to confirm which card/chipset/driver I'm talking about:

# lspci -v
03:00.0 RAID bus controller: LSI Logic / Symbios Logic MegaRAID SAS 1078
(rev 04)
       Subsystem: Dell PERC 6/i Integrated RAID Controller

$ /sbin/modinfo megaraid_sas
filename:
/lib/modules/2.6.18-164.el5/kernel/drivers/scsi/megaraid/megaraid_sas.ko
description:    LSI MegaRAID SAS Driver
author:         
version:        00.00.04.08-RH2
license:        GPL

As for the management utility, LSI ships "MegaCli[64]" as a statically
linked Linux binary.  Plenty of reports of people running it on FreeBSD
with no problems via Linux emulation libraries--it's a really basic CLI
tool and whatever interface it talks to card via seems to emulate just
fine.  UI is awful, but once you find the magic cheat sheet at
http://tools.rapidsoft.de/perc/ it's not too bad.  No direct SMART
monitoring here either, which is disappointing, but you can get some
pretty detailed data out of MegaCli so it's not terrible.

I've seen >100MB/s per drive on reads out of small RAID10 arrays, and
cleared 1GB/s on larger ones (all on RHEL5+ext3) with this controller on
recent installs.

--
Greg Smith  2ndQuadrant US  Baltimore, MD
PostgreSQL Training, Services and Support
   www.2ndQuadrant.us


От:
Yeb Havinga
Дата:

Francisco Reyes wrote:
> Yeb Havinga writes:
>
>> controllers. Also, I am not sure if it is wise to put the WAL on the
>> same logical disk as the indexes,
>
> If I only have two controllers would it then be better to put WAL on
> the first along with all the data and the indexes on the external?
> Specially since the external enclosure will have 15K rpm vs 10K rpm in
> the internal.
It sounds like you're going to create a single logical unit / raid array
on each of the controllers. Depending on the number of disks, this is a
bad idea because if you'd read/write data sequentially, all drive heads
will be aligned to roughly the same place ont the disks. If another
process wants to read/write as well, this will interfere and bring down
both iops and mbps. However, with three concurrent users.. hmm but then
again, queries will scan multiple tables/indexes so there will be mixed
io to several locations. What would be interesting it so see what the
mbps maximum of a single controller is. Then calculate how much disks
are needed to feed that, which would give a figure for number of disks
per logical unit.

The challenge with having a few logical units / raid arrays available,
is how to divide the data over it (with tablespaces) What is good for
your physical data depends on the schema and queries that are most
important. For 2 relations and 2 indexes and 4 arrays, it would be
clear. There's not much to say anything general here, except: do not mix
table or index data with the wal. In other words: if you could make a
separate raid array for the wal (2 disk raid1 probably good enough),
that would be ok and doesn't matter on which controller or enclosure it
happens, because io to disk is not mixed with the data io.
>
> Thanks for pointing that out.
> With any luck I will actually be able to do some tests for the new
> hardware. The curernt one I literaly did a few hours stress test and
> had to put in production right away.
I've heard that before ;-) If you do get around to do some tests, I'm
interested in the results / hard numbers.

regards,
Yeb Havinga


От:
Scott Marlowe
Дата:

On Wed, Mar 3, 2010 at 4:53 AM, Hannu Krosing <> wrote:
> On Wed, 2010-03-03 at 10:41 +0100, Yeb Havinga wrote:
>> Scott Marlowe wrote:
>> > On Tue, Mar 2, 2010 at 1:51 PM, Yeb Havinga <> wrote:
>> >
>> >> With 24 drives it'll probably be the controller that is the limiting factor
>> >> of bandwidth. Our HP SAN controller with 28 15K drives delivers 170MB/s at
>> >> maximum with raid 0 and about 155MB/s with raid 1+0. So I'd go for the 10K
>> >> drives and put the saved money towards the controller (or maybe more than
>> >> one controller).
>> >>
>> >
>> > That's horrifically bad numbers for that many drives.  I can get those
>> > numbers for write performance on a RAID-6 on our office server.  I
>> > wonder what's making your SAN setup so slow?
>> >
>> Pre scriptum:
>> A few minutes ago I mailed detailed information in the same thread but
>> as reply to an earlier response - it tells more about setup and gives
>> results of a raid1+0 test.
>>
>> I just have to react to "horrifically bad" and "slow" :-) : The HP san
>> can do raid5 on 28 disks also on about 155MBps:
>
> SAN-s are "horrifically bad" and "slow" mainly because of the 2MBit sec
> fiber channel.
> But older ones may be just slow internally as well.
> The fact that it is expensive does not make it fast.
> If you need fast thrughput, use direct attached storage

Let me be clear that the only number mentioned at the beginning was
throughput.  If you're designing a machine to run huge queries and
return huge amounts of data that matters.  OLAP.  If you're designing
for OLTP you're likely to only have a few megs a second passing
through but in thousands of xactions per second.  So, when presented
with the only metric of throughput, I figured that's what the OP was
designing for.  For OLTP his SAN is plenty fast.

От:
Scott Carey
Дата:

On Mar 2, 2010, at 1:36 PM, Francisco Reyes wrote:

>  writes:
>
>> With sequential scans you may be better off with the large SATA drives as
>> they fit more data per track and so give great sequential read rates.
>
> I lean more towards SAS because of writes.
> One common thing we do is create temp tables.. so a typical pass may be:
> * sequential scan
> * create temp table with subset
> * do queries against subset+join to smaller tables.
>
> I figure the concurrent read/write would be faster on SAS than on SATA. I am
> trying to move to having an external enclosure (we have several not in use
> or about to become free) so I could separate the read and the write of the
> temp tables.
>

Concurrent Read/Write performance has far more to do with OS and Filesystem choice and tuning than what type of drive
itis. 

> Lastly, it is likely we are going to do horizontal partitioning (ie master
> all data in one machine, replicate and then change our code to read parts of
> data from different machine) and I think at that time the better drives will
> do better as we have more concurrent queries.
>
>
> --
> Sent via pgsql-performance mailing list ()
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-performance


От:
Scott Carey
Дата:

On Mar 2, 2010, at 2:10 PM, <> wrote:

> On Tue, 2 Mar 2010, Scott Marlowe wrote:
>
>> On Tue, Mar 2, 2010 at 2:30 PM, Francisco Reyes <> wrote:
>>> Scott Marlowe writes:
>>>
>>>> Then the real thing to compare is the speed of the drives for
>>>> throughput not rpm.
>>>
>>> In a machine, simmilar to what I plan to buy, already in house 24 x 10K rpm
>>> gives me about 400MB/sec while 16 x 15K rpm (2 to 3 year old drives) gives
>>> me about 500MB/sec
>>
>> Have you tried short stroking the drives to see how they compare then?
>> Or is the reduced primary storage not a valid path here?
>>
>> While 16x15k older drives doing 500Meg seems only a little slow, the
>> 24x10k drives getting only 400MB/s seems way slow.  I'd expect a
>> RAID-10 of those to read at somewhere in or just past the gig per
>> second range with a fast pcie (x8 or x16 or so) controller.  You may
>> find that a faster controller with only 8 or so fast and large SATA
>> drives equals the 24 10k drives you're looking at now.  I can write at
>> about 300 to 350 Megs a second on a slower Areca 12xx series
>> controller and 8 2TB Western Digital Green drives, which aren't even
>> made for speed.
>
> what filesystem is being used. There is a thread on the linux-kernel
> mailing list right now showing that ext4 seems to top out at ~360MB/sec
> while XFS is able to go to 500MB/sec+

I have Centos 5.4 with 10 7200RPM 1TB SAS drives in RAID 10 (Seagate ES.2, same perf as the SATA ones), XFS, Adaptec
5805,and get ~750MB/sec read and write sequential throughput. 

A RAID 0 of two of these stops around 1000MB/sec because it is CPU bound in postgres -- for select count(*).  If it is
select* piped to /dev/null, it is CPU bound below 300MB/sec converting data to text. 

For xfs, set readahead to 16MB or so (2MB or so per stripe) (--setra 32768 is 16MB) and absolutely make sure that the
xfsmount parameter 'allocsize' is set to about the same size or more.   For large sequential operations, you want to
makesure interleaved writes don't interleave files on disk.  I use 80MB allocsize, and 40MB readahead for the reporting
data.

Later Linux kernels have significantly improved readahead systems that don't need to be tuned quite as much.  For high
sequentialthroughput, nothing is as optimized as XFS on Linux yet.  It has weaknesses elsewhere however. 

And 3Ware on Linux + high throughput sequential = slow.  PERC 6 was 20% faster, and Adaptec was 70% faster with the
samedrives, and with experiments to filesystem and readahead for all.  From what I hear, Areca is a significant notch
aboveAdaptec on that too. 

>
> on single disks the disk performance limits you, but on arrays where the
> disk performance is higher there may be other limits you are running into.
>
> David Lang
>
> --
> Sent via pgsql-performance mailing list ()
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-performance


От:
Greg Smith
Дата:

Scott Carey wrote:
> For high sequential throughput, nothing is as optimized as XFS on Linux yet.  It has weaknesses elsewhere however.
>

I'm curious what you feel those weaknesses are.  The recent addition of
XFS back into a more mainstream position in the RHEL kernel as of their
5.4 update greatly expands where I can use it now, have been heavily
revisiting it since that release.  I've already noted how well it does
on sequential read/write tasks relative to ext3, and it looks like the
main downsides I used to worry about with it (mainly crash recovery
issues) were also squashed in recent years.

--
Greg Smith  2ndQuadrant US  Baltimore, MD
PostgreSQL Training, Services and Support
   www.2ndQuadrant.us


От:
"Pierre C"
Дата:

On Tue, 09 Mar 2010 08:00:50 +0100, Greg Smith <>
wrote:

> Scott Carey wrote:
>> For high sequential throughput, nothing is as optimized as XFS on Linux
>> yet.  It has weaknesses elsewhere however.
>>

When files are extended one page at a time (as postgres does)
fragmentation can be pretty high on some filesystems (ext3, but NTFS is
the absolute worst) if several files (indexes + table) grow
simultaneously. XFS has delayed allocation which really helps.

> I'm curious what you feel those weaknesses are.

Handling lots of small files, especially deleting them, is really slow on
XFS.
Databases don't care about that.

There is also the dark side of delayed allocation : if your application is
broken, it will manifest itself very painfully. Since XFS keeps a lot of
unwritten stuff in the buffers, an app that doesn't fsync correctly can
lose lots of data if you don't have a UPS.

Fortunately, postgres handles fsync like it should be.

A word of advice though : a few years ago, we lost a few terabytes on XFS
(after that, restoring from backup was quite slow !) because a faulty SCSI
cable crashed the server, then crashed it again during xfsrepair. So if
you do xfsrepair on a suspicious system, please image the disks first.

От:
"Ing. Marcos Ortiz Valmaseda"
Дата:

Pierre C escribió:
> On Tue, 09 Mar 2010 08:00:50 +0100, Greg Smith <>
> wrote:
>
>> Scott Carey wrote:
>>> For high sequential throughput, nothing is as optimized as XFS on
>>> Linux yet.  It has weaknesses elsewhere however.
>>>
>
> When files are extended one page at a time (as postgres does)
> fragmentation can be pretty high on some filesystems (ext3, but NTFS
> is the absolute worst) if several files (indexes + table) grow
> simultaneously. XFS has delayed allocation which really helps.
>
>> I'm curious what you feel those weaknesses are.
>
> Handling lots of small files, especially deleting them, is really slow
> on XFS.
> Databases don't care about that.
>
> There is also the dark side of delayed allocation : if your
> application is broken, it will manifest itself very painfully. Since
> XFS keeps a lot of unwritten stuff in the buffers, an app that doesn't
> fsync correctly can lose lots of data if you don't have a UPS.
>
> Fortunately, postgres handles fsync like it should be.
>
> A word of advice though : a few years ago, we lost a few terabytes on
> XFS (after that, restoring from backup was quite slow !) because a
> faulty SCSI cable crashed the server, then crashed it again during
> xfsrepair. So if you do xfsrepair on a suspicious system, please image
> the disks first.
>
And then Which file system do you recommend for the PostgreSQL data
directory? I was seeying that ZFS brings very cool features for that.
The problem with ZFS is that this file system is only on Solaris,
OpenSolaris, FreeBSD and Mac OSX Server, and on Linux systems not  What
do you think about that?
Regards

--
--------------------------------------------------------
-- Ing. Marcos Luís Ortíz Valmaseda                   --
-- Twitter: http://twitter.com/@marcosluis2186        --
-- FreeBSD Fan/User                                   --
-- http://www.freebsd.org/es                          --
-- Linux User # 418229                                --
-- Database Architect/Administrator                   --
-- PostgreSQL RDBMS                                   --
-- http://www.postgresql.org                          --
-- http://planetpostgresql.org                        --
-- http://www.postgresql-es.org                       --
--------------------------------------------------------
-- Data WareHouse -- Business Intelligence Apprentice --
-- http://www.tdwi.org                                --
--------------------------------------------------------
-- Ruby on Rails Fan/Developer                        --
-- http://rubyonrails.org                             --
--------------------------------------------------------

Comunidad Técnica Cubana de PostgreSQL
http://postgresql.uci.cu
http://personas.grm.uci.cu/+marcos

Centro de Tecnologías de Gestión de Datos (DATEC)
Contacto:
        Correo: 
        Telf: +53 07-837-3737
              +53 07-837-3714
Universidad de las Ciencias Informáticas
http://www.uci.cu




От:
david@lang.hm
Дата:

On Tue, 9 Mar 2010, Pierre C wrote:

> On Tue, 09 Mar 2010 08:00:50 +0100, Greg Smith <> wrote:
>
>> Scott Carey wrote:
>>> For high sequential throughput, nothing is as optimized as XFS on Linux
>>> yet.  It has weaknesses elsewhere however.
>>>
>
> When files are extended one page at a time (as postgres does) fragmentation
> can be pretty high on some filesystems (ext3, but NTFS is the absolute worst)
> if several files (indexes + table) grow simultaneously. XFS has delayed
> allocation which really helps.
>
>> I'm curious what you feel those weaknesses are.
>
> Handling lots of small files, especially deleting them, is really slow on
> XFS.
> Databases don't care about that.

accessing lots of small files works really well on XFS compared to ext* (I
use XFS with a cyrus mail server which keeps each message as a seperate
file and XFS vastly outperforms ext2/3 there). deleting is slow as you say

David Lang

> There is also the dark side of delayed allocation : if your application is
> broken, it will manifest itself very painfully. Since XFS keeps a lot of
> unwritten stuff in the buffers, an app that doesn't fsync correctly can lose
> lots of data if you don't have a UPS.
>
> Fortunately, postgres handles fsync like it should be.
>
> A word of advice though : a few years ago, we lost a few terabytes on XFS
> (after that, restoring from backup was quite slow !) because a faulty SCSI
> cable crashed the server, then crashed it again during xfsrepair. So if you
> do xfsrepair on a suspicious system, please image the disks first.

От:
"Kevin Grittner"
Дата:

"Pierre C" <> wrote:
> Greg Smith <> wrote:

>> I'm curious what you feel those weaknesses are.
>
> Handling lots of small files, especially deleting them, is really
> slow on XFS.
> Databases don't care about that.

I know of at least one exception to that -- when we upgraded and got
a newer version of the kernel where XFS has write barriers on by
default, some database transactions which were creating and dropping
temporary tables in a loop became orders of magnitude slower.  Now,
that was a silly approach to getting the data that was needed and I
helped them rework the transactions, but something which had worked
acceptably suddenly didn't anymore.

Since we have a BBU hardware RAID controller, we can turn off write
barriers safely, at least according to this page:

http://xfs.org/index.php/XFS_FAQ#Q._Should_barriers_be_enabled_with_storage_which_has_a_persistent_write_cache.3F

This reduces the penalty for creating and deleting lots of small
files.

-Kevin

От:
Michael Stone
Дата:

Do keep the postgres xlog on a seperate ext2 partition for best
performance. Other than that, xfs is definitely a good performer.

Mike Stone

От:
Scott Carey
Дата:

On Mar 8, 2010, at 11:00 PM, Greg Smith wrote:

> Scott Carey wrote:
>> For high sequential throughput, nothing is as optimized as XFS on Linux yet.  It has weaknesses elsewhere however.
>>
>
> I'm curious what you feel those weaknesses are.  The recent addition of
> XFS back into a more mainstream position in the RHEL kernel as of their
> 5.4 update greatly expands where I can use it now, have been heavily
> revisiting it since that release.  I've already noted how well it does
> on sequential read/write tasks relative to ext3, and it looks like the
> main downsides I used to worry about with it (mainly crash recovery
> issues) were also squashed in recent years.
>

My somewhat negative experiences have been:

*  Metadata operations are a bit slow, this manifests itself mostly with lots of small files being updated or deleted.
*  Improper use of the file system or hardware configuration will likely break worse (ext3 'ordered' mode makes poorly
writtenapps safer). 
*  At least with CentOS 5.3 and thier xfs version (non-Redhat, CentOS extras) sparse random writes could almost hang a
filesystem.  They were VERY slow.  I have not tested since.  

None of the above affect Postgres.

I'm also not sure how up to date RedHat's xfs version is -- there have been enhancements to xfs in the kernel mainline
regularlyfor a long time. 

In non-postgres contexts, I've grown to appreciate some other qualities:  Unlike ext2/3, I can have more than 32K
directoriesin another directory -- XFS will do millions, though it will slow down at least it doesn't just throw an
errorto the application.  And although XFS is slow to delete lots of small things, it can delete large files much
faster-- I deal with lots of large files and it is comical to see ext3 take a minute to delete a 30GB file while XFS
doesit almost instantly. 

I have been happy with XFS for Postgres data directories, and ext2 for a dedicated xlog partition.  Although I have not
riskedthe online defragmentation on a live DB, I have defragmented a 8TB DB during maintenance and seen the performance
improve.

> --
> Greg Smith  2ndQuadrant US  Baltimore, MD
> PostgreSQL Training, Services and Support
>    www.2ndQuadrant.us
>


От:
Scott Carey
Дата:

On Mar 9, 2010, at 4:39 PM, Scott Carey wrote:

>
> On Mar 8, 2010, at 11:00 PM, Greg Smith wrote:
>
> *  At least with CentOS 5.3 and thier xfs version (non-Redhat, CentOS extras) sparse random writes could almost hang
afile system.  They were VERY slow.  I have not tested since.  
>

Just to be clear, I mean random writes to a _sparse file_.

You can cause this condition with the 'fio' tool, which will by default allocate a file for write as a sparse file,
thenwrite to it.  If the whole thing is written to first, then random writes are fine.  Postgres only writes random
whenit overwrites a page, otherwise its always an append operation AFAIK. 

> --
> Sent via pgsql-performance mailing list ()
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-performance


От:
Greg Smith
Дата:

Scott Carey wrote:
> I'm also not sure how up to date RedHat's xfs version is -- there have been enhancements to xfs in the kernel
mainlineregularly for a long time. 
>

They seem to following SGI's XFS repo quite carefully and cherry-picking
bug fixes out of there, not sure of how that relates to mainline kernel
development right now.  For example:

https://bugzilla.redhat.com/show_bug.cgi?id=509902 (July 2009 SGI
commit, now active for RHEL5.4)
https://bugzilla.redhat.com/show_bug.cgi?id=544349 (November 2009 SGI
commit, may be merged into RHEL5.5 currently in beta)

Far as I've been able to tell this is all being driven wanting >16TB
large filesystems, i.e.
https://bugzilla.redhat.com/show_bug.cgi?id=213744 , and the whole thing
will be completely mainstream (bundled into the installer, and hopefully
with 32-bit support available) by RHEL6:
https://bugzilla.redhat.com/show_bug.cgi?id=522180

Thanks for the comments.  From all the info I've been able to gather,
"works fine for what PostgreSQL does with the filesystem, not
necessarily suitable for your root volume" seems to be a fair
characterization of where XFS is at right now.  Which is
reasonable--that's the context I'm getting more requests to use it in,
just as the filesystem for where the database lives.  Those who don't
have a separate volume and filesystem for the db also tend not to care
about filesystem performance differences either.

--
Greg Smith  2ndQuadrant US  Baltimore, MD
PostgreSQL Training, Services and Support
   www.2ndQuadrant.us