Обсуждение: RAID stripe size question

Поиск
Список
Период
Сортировка

RAID stripe size question

От
"Mikael Carneholm"
Дата:

I have finally gotten my hands on the MSA1500 that we ordered some time ago. It has 28 x 10K 146Gb drives, currently grouped as 10 (for wal) + 18 (for data). There's only one controller (an emulex), but I hope performance won't suffer too much from that. Raid level is 0+1, filesystem is ext3.

Now to the interesting part: would it make sense to use different stripe sizes on the separate disk arrays? In theory, a smaller stripe size (8-32K) should increase sequential write throughput at the cost of decreased positioning performance, which sounds good for WAL (assuming WAL is never "searched" during normal operation). And for disks holding the data, a larger stripe size (>32K) should provide for more concurrent (small) reads/writes at the cost of decreased raw throughput. This is with an OLTP type application in mind, so I'd rather have high transaction throughput than high sequential read speed. The interface is a 2Gb FC so I'm throttled to (theoretically) 192Mb/s, anyway.

So, does this make sense? Has anyone tried it and seen any performance gains from it?

Regards,
Mikael.

Re: RAID stripe size question

От
"Steinar H. Gunderson"
Дата:
On Mon, Jul 17, 2006 at 12:52:17AM +0200, Mikael Carneholm wrote:
> Now to the interesting part: would it make sense to use different stripe
> sizes on the separate disk arrays? In theory, a smaller stripe size
> (8-32K) should increase sequential write throughput at the cost of
> decreased positioning performance, which sounds good for WAL (assuming
> WAL is never "searched" during normal operation).

For large writes (ie. sequential write throughput), it doesn't really matter
what the stripe size is; all the disks will have to both seek and write
anyhow.

/* Steinar */
--
Homepage: http://www.sesse.net/

Re: RAID stripe size question

От
Michael Stone
Дата:
On Mon, Jul 17, 2006 at 12:52:17AM +0200, Mikael Carneholm wrote:
>I have finally gotten my hands on the MSA1500 that we ordered some time
>ago. It has 28 x 10K 146Gb drives, currently grouped as 10 (for wal) +
>18 (for data). There's only one controller (an emulex), but I hope

You've got 1.4TB assigned to the WAL, which doesn't normally have more
than a couple of gigs?

Mike Stone

Re: RAID stripe size question

От
"Alex Turner"
Дата:
With 18 disks dedicated to  data, you could make 100/7*9 seeks/second (7ms av seeks time, 9 independant units) which is 128seeks/second writing on average 64kb of data, which is 4.1MB/sec throughput worst case, probably 10x best case so 40Mb/sec - you might want to take more disks for your data and less for your WAL.

Someone check my math here...

And as always - run benchmarks with your app to verify

Alex.

On 7/16/06, Mikael Carneholm < Mikael.Carneholm@wirelesscar.com> wrote:

I have finally gotten my hands on the MSA1500 that we ordered some time ago. It has 28 x 10K 146Gb drives, currently grouped as 10 (for wal) + 18 (for data). There's only one controller (an emulex), but I hope performance won't suffer too much from that. Raid level is 0+1, filesystem is ext3.

Now to the interesting part: would it make sense to use different stripe sizes on the separate disk arrays? In theory, a smaller stripe size (8-32K) should increase sequential write throughput at the cost of decreased positioning performance, which sounds good for WAL (assuming WAL is never "searched" during normal operation). And for disks holding the data, a larger stripe size (>32K) should provide for more concurrent (small) reads/writes at the cost of decreased raw throughput. This is with an OLTP type application in mind, so I'd rather have high transaction throughput than high sequential read speed. The interface is a 2Gb FC so I'm throttled to (theoretically) 192Mb/s, anyway.

So, does this make sense? Has anyone tried it and seen any performance gains from it?

Regards,
Mikael.


Re: RAID stripe size question

От
"Mikael Carneholm"
Дата:
Yeah, it seems to be a waste of disk space (spindles as well?). I was
unsure how much activity the WAL disks would have compared to the data
disks, so I created an array from 10 disks as the application is very
write intense (many spindles / high throughput is crucial). I guess that
a mirror of two disks is enough from a disk space perspective, but from
a throughput perspective it will limit me to ~25Mb/s (roughly
calculated).

An 0+1 array of 4 disks *could* be enough, but I'm still unsure how WAL
activity correlates to "normal data" activity (is it 1:1, 1:2, 1:4,
...?)

-----Original Message-----
From: pgsql-performance-owner@postgresql.org
[mailto:pgsql-performance-owner@postgresql.org] On Behalf Of Michael
Stone
Sent: den 17 juli 2006 02:04
To: pgsql-performance@postgresql.org
Subject: Re: [PERFORM] RAID stripe size question

On Mon, Jul 17, 2006 at 12:52:17AM +0200, Mikael Carneholm wrote:
>I have finally gotten my hands on the MSA1500 that we ordered some time

>ago. It has 28 x 10K 146Gb drives, currently grouped as 10 (for wal) +
>18 (for data). There's only one controller (an emulex), but I hope

You've got 1.4TB assigned to the WAL, which doesn't normally have more
than a couple of gigs?

Mike Stone

---------------------------(end of broadcast)---------------------------
TIP 9: In versions below 8.0, the planner will ignore your desire to
       choose an index scan if your joining column's datatypes do not
       match


Re: RAID stripe size question

От
Markus Schaber
Дата:
Hi, Mikael,

Mikael Carneholm wrote:
> An 0+1 array of 4 disks *could* be enough, but I'm still unsure how WAL
> activity correlates to "normal data" activity (is it 1:1, 1:2, 1:4,
> ...?)

I think the main difference is that the WAL activity is mostly linear,
where the normal data activity is rather random access. Thus, a mirror
of few disks (or, with good controller hardware, raid6 on 4 disks or so)
for WAL should be enough to cope with a large set of data and index
disks, who have a lot more time spent in seeking.

Btw, it may make sense to spread different tables or tables and indices
onto different Raid-Sets, as you seem to have enough spindles.

And look into the commit_delay/commit_siblings settings, they allow you
to deal latency for throughput (means a little more latency per
transaction, but much more transactions per second throughput for the
whole system.)


HTH,
Markus

--
Markus Schaber | Logical Tracking&Tracing International AG
Dipl. Inf.     | Software Development GIS

Fight against software patents in EU! www.ffii.org www.nosoftwarepatents.org

Re: RAID stripe size question

От
"Mikael Carneholm"
Дата:
>I think the main difference is that the WAL activity is mostly linear,
where the normal data activity is rather random access.

That was what I was expecting, and after reading
http://www.pcguide.com/ref/hdd/perf/raid/concepts/perfStripe-c.html I
figured that a different stripe size for the WAL set could be worth
investigating. I have now dropped the old sets (10+18) and created two
new raid1+0 sets (4 for WAL, 24 for data) instead. Bonnie++ is still
running, but I'll post the numbers as soon as it has finished. I did
actually use different stripe sizes for the sets as well, 8k for the WAL
disks and 64k for the data. It's quite painless to do these things with
HBAnywhere, so it's no big deal if I have to go back to another
configuration. The battery cache only has 256Mb though and that botheres
me, I assume a larger (512Mb - 1Gb) cache would make quite a difference.
Oh well.

>Btw, it may make sense to spread different tables or tables and indices
onto different Raid-Sets, as you seem to have enough spindles.

This is something I'd also would like to test, as a common best-practice
these days is to go for a SAME (stripe all, mirror everything) setup.
From a development perspective it's easier to use SAME as the developers
won't have to think about physical location for new tables/indices, so
if there's no performance penalty with SAME I'll gladly keep it that
way.

>And look into the commit_delay/commit_siblings settings, they allow you
to deal latency for throughput (means a little more latency per
transaction, but much more transactions per second throughput for the
whole system.)

In a previous test, using cd=5000 and cs=20 increased transaction
throughput by ~20% so I'll definitely fiddle with that in the coming
tests as well.

Regards,
Mikael.

Re: RAID stripe size question

От
Markus Schaber
Дата:
Hi, Mikael,

Mikael Carneholm wrote:

> This is something I'd also would like to test, as a common best-practice
> these days is to go for a SAME (stripe all, mirror everything) setup.
> From a development perspective it's easier to use SAME as the developers
> won't have to think about physical location for new tables/indices, so
> if there's no performance penalty with SAME I'll gladly keep it that
> way.

Usually, it's not the developers task to care about that, but the DBAs
responsibility.

>> And look into the commit_delay/commit_siblings settings, they allow you
> to deal latency for throughput (means a little more latency per
> transaction, but much more transactions per second throughput for the
> whole system.)
>
> In a previous test, using cd=5000 and cs=20 increased transaction
> throughput by ~20% so I'll definitely fiddle with that in the coming
> tests as well.

How many parallel transactions do you have?

Markus



--
Markus Schaber | Logical Tracking&Tracing International AG
Dipl. Inf.     | Software Development GIS

Fight against software patents in EU! www.ffii.org www.nosoftwarepatents.org

Re: RAID stripe size question

От
"Mikael Carneholm"
Дата:
>> This is something I'd also would like to test, as a common
>> best-practice these days is to go for a SAME (stripe all, mirror
everything) setup.
>> From a development perspective it's easier to use SAME as the
>> developers won't have to think about physical location for new
>> tables/indices, so if there's no performance penalty with SAME I'll
>> gladly keep it that way.

>Usually, it's not the developers task to care about that, but the DBAs
responsibility.

As we don't have a full-time dedicated DBA (although I'm the one who do
most DBA related tasks) I would aim for making physical location as
transparent as possible, otherwise I'm afraid I won't be doing anything
else than supporting developers with that - and I *do* have other things
to do as well :)

>> In a previous test, using cd=5000 and cs=20 increased transaction
>> throughput by ~20% so I'll definitely fiddle with that in the coming
>> tests as well.

>How many parallel transactions do you have?

That was when running BenchmarkSQL
(http://sourceforge.net/projects/benchmarksql) with 100 concurrent users
("terminals"), which I assume means 100 parallel transactions at most.
The target application for this DB has 3-4 times as many concurrent
connections so it's possible that one would have to find other cs/cd
numbers better suited for that scenario. Tweaking bgwriter is another
task I'll look into as well..

Btw, here's the bonnie++ results from two different array sets (10+18,
4+24) on the MSA1500:

LUN: WAL, 10 disks, stripe size 32K
------------------------------------
Version  1.03       ------Sequential Output------ --Sequential Input-
--Random-
                    -Per Chr- --Block-- -Rewrite- -Per Chr- --Block--
--Seeks--
Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP
/sec %CP
sesell01        32G 56139  93 73250  22 16530   3 30488  45 57489   5
477.3   1
                    ------Sequential Create------ --------Random
Create--------
                    -Create-- --Read--- -Delete-- -Create-- --Read---
-Delete--
              files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
/sec %CP
                 16  2458  90 +++++ +++ +++++ +++  3121  99 +++++ +++
10469  98


LUN: WAL, 4 disks, stripe size 8K
----------------------------------
Version  1.03       ------Sequential Output------ --Sequential Input-
--Random-
                    -Per Chr- --Block-- -Rewrite- -Per Chr- --Block--
--Seeks--
Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP
/sec %CP
sesell01        32G 49170  82 60108  19 13325   2 15778  24 21489   2
266.4   0
                    ------Sequential Create------ --------Random
Create--------
                    -Create-- --Read--- -Delete-- -Create-- --Read---
-Delete--
              files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
/sec %CP
                 16  2432  86 +++++ +++ +++++ +++  3106  99 +++++ +++
10248  98


LUN: DATA, 18 disks, stripe size 32K
-------------------------------------
Version  1.03       ------Sequential Output------ --Sequential Input-
--Random-
                    -Per Chr- --Block-- -Rewrite- -Per Chr- --Block--
--Seeks--
Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP
/sec %CP
sesell01        32G 59990  97 87341  28 19158   4 30200  46 57556   6
495.4   1
                    ------Sequential Create------ --------Random
Create--------
                    -Create-- --Read--- -Delete-- -Create-- --Read---
-Delete--
              files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
/sec %CP
                 16  1640  92 +++++ +++ +++++ +++  1736  99 +++++ +++
10919  99


LUN: DATA, 24 disks, stripe size 64K
-------------------------------------
Version  1.03       ------Sequential Output------ --Sequential Input-
--Random-
                    -Per Chr- --Block-- -Rewrite- -Per Chr- --Block--
--Seeks--
Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP
/sec %CP
sesell01        32G 59443  97 118515  39 25023   5 30926  49 60835   6
531.8   1
                    ------Sequential Create------ --------Random
Create--------
                    -Create-- --Read--- -Delete-- -Create-- --Read---
-Delete--
              files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
/sec %CP
                 16  2499  90 +++++ +++ +++++ +++  2817  99 +++++ +++
10971 100

Regards,
Mikael

Re: RAID stripe size question

От
Ron Peacetree
Дата:
>From: Mikael Carneholm <Mikael.Carneholm@WirelessCar.com>
>Sent: Jul 16, 2006 6:52 PM
>To: pgsql-performance@postgresql.org
>Subject: [PERFORM] RAID stripe size question
>
>I have finally gotten my hands on the MSA1500 that we ordered some time
>ago. It has 28 x 10K 146Gb drives,
>
Unless I'm missing something, the only FC or SCSI HDs of ~147GB capacity are 15K, not 10K.
(unless they are old?)
I'm not just being pedantic.  The correct, let alone optimal, answer to your question depends on your exact HW
characteristicsas well as your SW config and your usage pattern. 
15Krpm HDs will have average access times of 5-6ms.  10Krpm ones of 7-8ms.
Most modern HDs in this class will do ~60MB/s inner tracks ~75MB/s avg and ~90MB/s outer tracks.

If you are doing OLTP-like things, you are more sensitive to latency than most and should use the absolute lowest
latencyHDs available within you budget.  The current latency best case is 15Krpm FC HDs. 


>currently grouped as 10 (for wal) + 18 (for data). There's only one controller (an emulex), but I hope
>performance won't suffer too much from that. Raid level is 0+1,
>filesystem is ext3.
>
I strongly suspect having only 1 controller is an I/O choke w/ 28 HDs.

28HDs as above setup as 2 RAID 10's => ~75MBps*5= ~375MB/s,  ~75*9= ~675MB/s.
If both sets are to run at peak average speed, the Emulex would have to be able to handle ~1050MBps on average.
It is doubtful the 1 Emulex can do this.

In order to handle this level of bandwidth, a RAID controller must aggregate multiple FC, SCSI, or SATA streams as well
asdown any RAID 5 checksumming etc that is required. 
Very, very few RAID controllers can do >= 1GBps
One thing that help greatly with bursty IO patterns is to up your battery backed RAID cache as high as you possibly
can. Even multiple GBs of BBC can be worth it.  Another reason to have multiple controllers ;-) 

Then there is the question of the BW of the bus that the controller is plugged into.
~800MB/s is the RW max to be gotten from a 64b 133MHz PCI-X channel.
PCI-E channels are usually good for 1/10 their rated speed in bps as Bps.
So a PCI-Ex4 10Gbps bus can be counted on for 1GBps, PCI-Ex8 for 2GBps, etc.
At present I know of no RAID controllers that can singlely saturate a PCI-Ex4 or greater bus.

...and we haven't even touched on OS, SW, and usage pattern issues.

Bottom line is that the IO chain is only as fast as its slowest component.


>Now to the interesting part: would it make sense to use different stripe
>sizes on the separate disk arrays?
>
The short answer is Yes.
WAL's are basically appends that are written in bursts of your chosen log chunk size and that are almost never read
afterwards. Big DB pages and big RAID stripes makes sense for WALs. 

Tables with OLTP-like characteristics need smaller DB pages and stripes to minimize latency issues (although locality
ofreference can make the optimum stripe size larger). 

Tables with Data Mining like characteristics usually work best with larger DB pages sizes and RAID stripe sizes.

OS and FS overhead can make things more complicated.  So can DB layout and access pattern issues.

Side note: a 10 HD RAID 10 seems a bit much for WAL.  Do you really need 375MBps IO on average to your WAL more than
youneed IO capacity for other tables? 
If WAL IO needs to be very high, I'd suggest getting a SSD or SSD-like device that fits your budget and having said
deviceasync mirror to HD.  

Bottom line is to optimize your RAID stripe sizes =after= you optimize your OS, FS, and pg design for best IO for your
usagepattern(s). 

Hope this helps,
Ron

Re: RAID stripe size question

От
"Steinar H. Gunderson"
Дата:
On Mon, Jul 17, 2006 at 09:40:30AM -0400, Ron Peacetree wrote:
> Unless I'm missing something, the only FC or SCSI HDs of ~147GB capacity are 15K, not 10K.
> (unless they are old?)

There are still 146GB SCSI 10000rpm disks being sold here, at least.

/* Steinar */
--
Homepage: http://www.sesse.net/

Re: RAID stripe size question

От
"Alex Turner"
Дата:


On 7/17/06, Mikael Carneholm <Mikael.Carneholm@wirelesscar.com> wrote:
>> This is something I'd also would like to test, as a common
>> best-practice these days is to go for a SAME (stripe all, mirror
everything) setup.
>> From a development perspective it's easier to use SAME as the
>> developers won't have to think about physical location for new
>> tables/indices, so if there's no performance penalty with SAME I'll
>> gladly keep it that way.

>Usually, it's not the developers task to care about that, but the DBAs
responsibility.

As we don't have a full-time dedicated DBA (although I'm the one who do
most DBA related tasks) I would aim for making physical location as
transparent as possible, otherwise I'm afraid I won't be doing anything
else than supporting developers with that - and I *do* have other things
to do as well :)

>> In a previous test, using cd=5000 and cs=20 increased transaction
>> throughput by ~20% so I'll definitely fiddle with that in the coming
>> tests as well.

>How many parallel transactions do you have?

That was when running BenchmarkSQL
(http://sourceforge.net/projects/benchmarksql ) with 100 concurrent users
("terminals"), which I assume means 100 parallel transactions at most.
The target application for this DB has 3-4 times as many concurrent
connections so it's possible that one would have to find other cs/cd
numbers better suited for that scenario. Tweaking bgwriter is another
task I'll look into as well..

Btw, here's the bonnie++ results from two different array sets (10+18,
4+24) on the MSA1500:

LUN: WAL, 10 disks, stripe size 32K
------------------------------------
Version  1.03       ------Sequential Output------ --Sequential Input-
--Random-
                    -Per Chr- --Block-- -Rewrite- -Per Chr- --Block--
--Seeks--
Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP
/sec %CP
sesell01        32G 56139  93 73250  22 16530   3 30488  45 57489   5
477.3   1
                    ------Sequential Create------ --------Random
Create--------
                    -Create-- --Read--- -Delete-- -Create-- --Read---
-Delete--
              files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
/sec %CP
                 16  2458  90 +++++ +++ +++++ +++  3121  99 +++++ +++
10469  98


LUN: WAL, 4 disks, stripe size 8K
----------------------------------
Version  1.03       ------Sequential Output------ --Sequential Input-
--Random-
                    -Per Chr- --Block-- -Rewrite- -Per Chr- --Block--
--Seeks--
Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP
/sec %CP
sesell01        32G 49170  82 60108  19 13325   2 15778  24 21489   2
266.4   0
                    ------Sequential Create------ --------Random
Create--------
                    -Create-- --Read--- -Delete-- -Create-- --Read---
-Delete--
              files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
/sec %CP
                 16  2432  86 +++++ +++ +++++ +++  3106  99 +++++ +++
10248  98


LUN: DATA, 18 disks, stripe size 32K
-------------------------------------
Version  1.03        ------Sequential Output------ --Sequential Input-
--Random-
                    -Per Chr- --Block-- -Rewrite- -Per Chr- --Block--
--Seeks--
Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP
/sec %CP
sesell01        32G 59990  97 87341  28 19158   4 30200  46 57556   6
495.4   1
                    ------Sequential Create------ --------Random
Create--------
                    -Create-- --Read--- -Delete-- -Create-- --Read---
-Delete--
              files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
/sec %CP
                 16  1640  92 +++++ +++ +++++ +++  1736  99 +++++ +++
10919  99


LUN: DATA, 24 disks, stripe size 64K
-------------------------------------
Version  1.03        ------Sequential Output------ --Sequential Input-
--Random-
                    -Per Chr- --Block-- -Rewrite- -Per Chr- --Block--
--Seeks--
Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP
/sec %CP
sesell01        32G 59443  97 118515  39 25023   5 30926  49 60835   6
531.8   1
                    ------Sequential Create------ --------Random
Create--------
                    -Create-- --Read--- -Delete-- -Create-- --Read---
-Delete--
              files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
/sec %CP
                 16  2499  90 +++++ +++ +++++ +++  2817  99 +++++ +++
10971 100


These bonnie++ number are very worrying.  Your controller should easily max out your FC interface on these tests passing 192MB/sec with ease on anything more than an 6 drive RAID 10 .  This is a bad omen if you want high performance...  Each mirror pair can do 60-80MB/sec.  A 24Disk RAID 10 can do 12*60MB/sec which is 740MB/sec - I have seen this performance, it's not unreachable, but time and again, we see these bad perf numbers from FC and SCSI systems alike.  Consider a different controller, because this one is not up to snuff.  A single drive would get better numbers than your 4 disk RAID 10, 21MB/sec read speed is really pretty sorry, it should be closer to 120Mb/sec.  If you can't swap out, software RAID may turn out to be your friend.  The only saving grace is that this is OLTP, and perhaps, just maybe, the controller will be better at ordering IOs, but I highly doubt it.

Please people, do the numbers, benchmark before you buy, many many HBAs really suck under Linux/Free BSD, and you may end up paying vast sums of money for very sub-optimal performance (I'd say sub-standard, but alas, it seems that this kind of poor performance is tolerated, even though it's way off where it should be).  There's no point having a 40disk cab, if your controller can't handle it.

Maximum theoretical linear throughput can be acheived in a White Box for under $20k, and I have seen this kind of system outperform a server 5 times it's price even in OLTP.

Alex

Re: RAID stripe size question

От
"Mikael Carneholm"
Дата:
>Unless I'm missing something, the only FC or SCSI HDs of ~147GB capacity are 15K, not 10K.

In the spec we got from HP, they are listed as model 286716-B22
(http://www.dealtime.com/xPF-Compaq_HP_146_8_GB_286716_B22)which seems to run at 10K. Don't know how old those are, but
that'swhat we got from HP anyway. 

>15Krpm HDs will have average access times of 5-6ms.  10Krpm ones of 7-8ms.

Average seek time for that disk is listed as 4.9ms, maybe sounds a bit optimistic?

> 28HDs as above setup as 2 RAID 10's => ~75MBps*5= ~375MB/s,  ~75*9= ~675MB/s.

I guess it's still limited by the 2Gbit FC (192Mb/s), right?

>Very, very few RAID controllers can do >= 1GBps One thing that help greatly with bursty IO patterns is to up your
batterybacked RAID cache as high as you possibly can.  Even multiple GBs of BBC can be worth it.  Another reason to
havemultiple controllers ;-) 

I use 90% of the raid cache for writes, don't think I could go higher than that. Too bad the emulex only has 256Mb
though:/ 

>Then there is the question of the BW of the bus that the controller is plugged into.
>~800MB/s is the RW max to be gotten from a 64b 133MHz PCI-X channel.
>PCI-E channels are usually good for 1/10 their rated speed in bps as Bps.
>So a PCI-Ex4 10Gbps bus can be counted on for 1GBps, PCI-Ex8 for 2GBps, etc.
>At present I know of no RAID controllers that can singlely saturate a PCI-Ex4 or greater bus.

The controller is a FC2143
(http://h71016.www7.hp.com/dstore/MiddleFrame.asp?page=config&ProductLineId=450&FamilyId=1449&BaseId=17621&oi=E9CED&BEID=19701&SBLID=),
whichuses PCI-E. Don't know how it compares to other controllers, haven't had the time to search for / read any reviews
yet.

>>Now to the interesting part: would it make sense to use different
>>stripe sizes on the separate disk arrays?
>>
>The short answer is Yes.

Ok

>WAL's are basically appends that are written in bursts of your chosen log chunk size and that are almost never read
afterwards. Big DB pages and big RAID stripes makes sense for WALs. 

According to http://www.pcguide.com/ref/hdd/perf/raid/concepts/perfStripe-c.html, it seems to be the other way around?
("Asstripe size is decreased, files are broken into smaller and smaller pieces. This increases the number of drives
thatan average file will use to hold all the blocks containing the data of that file, theoretically increasing transfer
performance,but decreasing positioning performance.") 

I guess I'll have to find out which theory that holds by good ol´ trial and error... :)

- Mikael

Re: RAID stripe size question

От
Mark Kirkwood
Дата:
Mikael Carneholm wrote:
>
> Btw, here's the bonnie++ results from two different array sets (10+18,
> 4+24) on the MSA1500:
>
>
> LUN: DATA, 24 disks, stripe size 64K
> -------------------------------------
> Version  1.03       ------Sequential Output------ --Sequential Input-
> --Random-
>                     -Per Chr- --Block-- -Rewrite- -Per Chr- --Block--
> --Seeks--
> Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP
> /sec %CP
> sesell01        32G 59443  97 118515  39 25023   5 30926  49 60835   6
> 531.8   1
>                     ------Sequential Create------ --------Random
> Create--------
>                     -Create-- --Read--- -Delete-- -Create-- --Read---
> -Delete--
>               files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
> /sec %CP
>                  16  2499  90 +++++ +++ +++++ +++  2817  99 +++++ +++
> 10971 100
>


It might be interesting to see if 128K or 256K stripe size gives better
sequential throughput, while still leaving the random performance ok.
Having said that, the seeks/s figure of 531 not that great - for
instance I've seen a 12 disk (15K SCSI) system report about 1400 seeks/s
in this test.

Sorry if you mentioned this already - but what OS and filesystem are you
using? (if Linux and ext3, it might be worth experimenting with xfs or jfs).

Cheers

Mark

Re: RAID stripe size question

От
Ron Peacetree
Дата:
-----Original Message-----
>From: Mikael Carneholm <Mikael.Carneholm@WirelessCar.com>
>Sent: Jul 17, 2006 5:16 PM
>To: Ron  Peacetree <rjpeace@earthlink.net>, pgsql-performance@postgresql.org
>Subject: RE: [PERFORM] RAID stripe size question
>
>>15Krpm HDs will have average access times of 5-6ms.  10Krpm ones of 7-8ms.
>
>Average seek time for that disk is listed as 4.9ms, maybe sounds a bit optimistic?
>
Ah, the games vendors play.  "average seek time" for a 10Krpm HD may very well be 4.9ms.  However, what matters to you
theuser is "average =access= time".  The 1st is how long it takes to position the heads to the correct track.  The 2nd
ishow long it takes to actually find and get data from a specified HD sector. 

>> 28HDs as above setup as 2 RAID 10's => ~75MBps*5= ~375MB/s,  ~75*9= ~675MB/s.
>
>I guess it's still limited by the 2Gbit FC (192Mb/s), right?
>
No.  A decent HBA has multiple IO channels on it.  So for instance Areca's ARC-6080 (8/12/16-port 4Gbps Fibre-to-SATA
llController) has 2 4Gbps FCs in it (...and can support up to 4GB of BB cache!).  Nominally, this card can push 8Gbps=
800MBps. ~600-700MBps is the RW number. 

Assuming ~75MBps ASTR per HD, that's ~ enough bandwidth for a 16 HD RAID 10 set per ARC-6080.

>>Very, very few RAID controllers can do >= 1GBps One thing that help greatly with
>>bursty IO patterns is to up your battery backed RAID cache as high as you possibly
>>can.  Even multiple GBs of BBC can be worth it.
>>Another reason to have multiple controllers ;-)
>
>I use 90% of the raid cache for writes, don't think I could go higher than that.
>Too bad the emulex only has 256Mb though :/
>
If your RAID cache hit rates are in the 90+% range, you probably would find it profitable to make it greater.  I've
definitelyseen access patterns that benefitted from increased RAID cache for any size I could actually install.  For
thoseaccess patterns, no amount of RAID cache commercially available was enough to find the "flattening" point of the
cachepercentage curve.  256MB of BB RAID cache per HBA is just not that much for many IO patterns. 


>The controller is a FC2143
(http://h71016.www7.hp.com/dstore/MiddleFrame.asp?page=config&ProductLineId=450&FamilyId=1449&BaseId=17621&oi=E9CED&BEID=19701&SBLID=),
whichuses PCI-E. Don't know how it compares to other controllers, haven't had the time to search for / read any reviews
yet.
>
This is a relatively low end HBA with 1 4Gb FC on it.  Max sustained IO on it is going to be ~320MBps.  Or ~ enough for
an8 HD RAID 10 set made of 75MBps ASTR HD's. 

28 such HDs are =definitely= IO choked on this HBA.

The arithmatic suggests you need a better HBA or more HBAs or both.


>>WAL's are basically appends that are written in bursts of your chosen log chunk size and that are almost never read
afterwards. Big DB pages and big RAID stripes makes sense for WALs. 
>
>According to http://www.pcguide.com/ref/hdd/perf/raid/concepts/perfStripe-c.html, it seems to be the other way around?
("Asstripe size is decreased, files are broken into smaller and smaller pieces. This increases the number of drives
thatan average file will use to hold all the blocks containing the data of that file, theoretically increasing transfer
performance,but decreasing positioning performance.") 
>
>I guess I'll have to find out which theory that holds by good ol� trial and error... :)
>
IME, stripe sizes of 64, 128, or 256 are the most common found to be optimal for most access patterns + SW + FS + OS +
HW.


Re: RAID stripe size question

От
"Alex Turner"
Дата:


On 7/17/06, Ron Peacetree <rjpeace@earthlink.net> wrote:
-----Original Message-----
>From: Mikael Carneholm <Mikael.Carneholm@WirelessCar.com>
>Sent: Jul 17, 2006 5:16 PM
>To: Ron  Peacetree < rjpeace@earthlink.net>, pgsql-performance@postgresql.org
>Subject: RE: [PERFORM] RAID stripe size question
>
>>15Krpm HDs will have average access times of 5-6ms.  10Krpm ones of 7-8ms.
>
>Average seek time for that disk is listed as 4.9ms, maybe sounds a bit optimistic?
>
Ah, the games vendors play.  "average seek time" for a 10Krpm HD may very well be 4.9ms.  However, what matters to you the user is "average =access= time".  The 1st is how long it takes to position the heads to the correct track.  The 2nd is how long it takes to actually find and get data from a specified HD sector.

>> 28HDs as above setup as 2 RAID 10's => ~75MBps*5= ~375MB/s,  ~75*9= ~675MB/s.
>
>I guess it's still limited by the 2Gbit FC (192Mb/s), right?
>
No.  A decent HBA has multiple IO channels on it.  So for instance Areca's ARC-6080 (8/12/16-port 4Gbps Fibre-to-SATA ll Controller) has 2 4Gbps FCs in it (...and can support up to 4GB of BB cache!).  Nominally, this card can push 8Gbps= 800MBps.  ~600-700MBps is the RW number.

Assuming ~75MBps ASTR per HD, that's ~ enough bandwidth for a 16 HD RAID 10 set per ARC-6080.

>>Very, very few RAID controllers can do >= 1GBps One thing that help greatly with
>>bursty IO patterns is to up your battery backed RAID cache as high as you possibly
>>can.  Even multiple GBs of BBC can be worth it.
>>Another reason to have multiple controllers ;-)
>
>I use 90% of the raid cache for writes, don't think I could go higher than that.
>Too bad the emulex only has 256Mb though :/
>
If your RAID cache hit rates are in the 90+% range, you probably would find it profitable to make it greater.  I've definitely seen access patterns that benefitted from increased RAID cache for any size I could actually install.  For those access patterns, no amount of RAID cache commercially available was enough to find the "flattening" point of the cache percentage curve.  256MB of BB RAID cache per HBA is just not that much for many IO patterns.

90% as in 90% of the RAM, not 90% hit rate I'm imagining.

>The controller is a FC2143 (http://h71016.www7.hp.com/dstore/MiddleFrame.asp?page=config&ProductLineId=450&FamilyId=1449&BaseId=17621&oi=E9CED&BEID=19701&SBLID= ), which uses PCI-E. Don't know how it compares to other controllers, haven't had the time to search for / read any reviews yet.
>
This is a relatively low end HBA with 1 4Gb FC on it.  Max sustained IO on it is going to be ~320MBps.  Or ~ enough for an 8 HD RAID 10 set made of 75MBps ASTR HD's.

28 such HDs are =definitely= IO choked on this HBA.


Not they aren't.  This is OLTP, not data warehousing.  I already posted math for OLTP throughput, which is in the order of 8-80MB/second actual data throughput based on maximum theoretical seeks/second.

The arithmatic suggests you need a better HBA or more HBAs or both.


>>WAL's are basically appends that are written in bursts of your chosen log chunk size and that are almost never read afterwards.  Big DB pages and big RAID stripes makes sense for WALs.

unless of course you are running OLTP, in which case a big stripe isn't necessary, spend the disks on your data parition, because your WAL activity is going to be small compared with your random IO.

>
>According to http://www.pcguide.com/ref/hdd/perf/raid/concepts/perfStripe-c.html, it seems to be the other way around? ("As stripe size is decreased, files are broken into smaller and smaller pieces. This increases the number of drives that an average file will use to hold all the blocks containing the data of that file, theoretically increasing transfer performance, but decreasing positioning performance.")
>
>I guess I'll have to find out which theory that holds by good ol� trial and error... :)
>
IME, stripe sizes of 64, 128, or 256 are the most common found to be optimal for most access patterns + SW + FS + OS + HW.

New records will be posted at the end of a file, and will only increase the file by the number of blocks in the transactions posted at write time.  Updated records are modified in place unless they have grown too big to be in place.  If you are updated mutiple tables on each transaction, a 64kb stripe size or lower is probably going to be best as block sizes are just 8kb.  How much data does your average transaction write?  How many xacts per second, this will help determine how many writes your cache will queue up before it flushes, and therefore what the optimal stripe size will be.  Of course, the fastest and most accurate way is probably just to try different settings and see how it works.  Alas some controllers seem to handle some stripe sizes more effeciently in defiance of any logic.

Work out how big your xacts are, how many xacts/second you can post, and you will figure out how fast WAL will be writting.  Allocate enough disk for peak load plus planned expansion on WAL and then put the rest to tablespace.  You may well find that a single RAID 1 is enough for WAL (if you acheive theoretical performance levels, which it's clear your controller isn't).

For example, you bonnie++ benchmark shows 538 seeks/second.  If on each seek one writes 8k of data (one block) then your total throughput to disk is 538*8k=4304k which is just 4MB/second actual throughput for WAL, which is about what I estimated in my calculations earlier.   A single RAID 1 will easily suffice to handle WAL for this kind of OLTP xact rate.  Even if you write a full stripe on every pass at 64kb, thats still only 538*64k = 34432k or around 34Meg, still within the capability of a correctly running RAID 1, and even with your low bonnie scores, within the capability of your 4 disk RAID 10.

Remember when it comes to OLTP, massive serial throughput is not gonna help you, it's low seek times, which is why people still buy 15k RPM drives, and why you don't necessarily need a honking SAS/SATA controller which can harness the full 1066MB/sec of your PCI-X bus, or more for PCIe.  Of course, once you have a bunch of OLTP data, people will innevitably want reports on that stuff, and what was mainly an OLTP database suddenly becomes a data warehouse in a matter of months, so don't neglect to consider that problem also.

Also more RAM on the RAID card will seriously help bolster your transaction rate, as your controller can queue up a whole bunch of table writes and burst them all at once in a single seek, which will increase your overall throughput by as much as an order of magnitude (and you would have to increase WAL accordingly therefore).

But finally - if your card/cab isn't performing RMA it.  Send the damn thing back and get something that actualy can do what it should.  Don't tolerate manufacturers BS!!

Alex

Re: RAID stripe size question

От
Ron Peacetree
Дата:
>From: Alex Turner <armtuk@gmail.com>
>Sent: Jul 18, 2006 12:21 AM
>To: Ron Peacetree <rjpeace@earthlink.net>
>Cc: Mikael Carneholm <Mikael.Carneholm@wirelesscar.com>, pgsql-performance@postgresql.org
>Subject: Re: [PERFORM] RAID stripe size question
>
>On 7/17/06, Ron Peacetree <rjpeace@earthlink.net> wrote:
>>
>> -----Original Message-----
>> >From: Mikael Carneholm <Mikael.Carneholm@WirelessCar.com>
>> >Sent: Jul 17, 2006 5:16 PM
>> >To: Ron  Peacetree <rjpeace@earthlink.net>,
>> pgsql-performance@postgresql.org
>> >Subject: RE: [PERFORM] RAID stripe size question
>> >
>> >I use 90% of the raid cache for writes, don't think I could go higher
>> >than that.
>> >Too bad the emulex only has 256Mb though :/
>> >
>> If your RAID cache hit rates are in the 90+% range, you probably would
>> find it profitable to make it greater.  I've definitely seen access patterns
>> that benefitted from increased RAID cache for any size I could actually
>> install.  For those access patterns, no amount of RAID cache commercially
>> available was enough to find the "flattening" point of the cache percentage
>> curve.  256MB of BB RAID cache per HBA is just not that much for many IO
>> patterns.
>
>90% as in 90% of the RAM, not 90% hit rate I'm imagining.
>
Either way, =particularly= for OLTP-like I/O patterns, the more RAID cache the better unless the IO pattern is
completelyrandom.  In which case the best you can do is cache the entire sector map of the RAID set and use as many
spindlesas possible for the tables involved.  I've seen high end set ups in Fortune 2000 organizations that look like
someof the things you read about on tpc.org: =hundreds= of HDs are used. 

Clearly, completely random IO patterns are to be avoided whenever and however possible.

Thankfully, most things can be designed to not have completely random IO and stuff like WAL IO are definitely not
random.

The important point here about cache size is that unless you make cache large enough that you see a flattening in the
cachebehavior, you probably can still use more cache.  Working sets are often very large for DB applications. 


>>The controller is a FC2143 (
>>
http://h71016.www7.hp.com/dstore/MiddleFrame.asp?page=config&ProductLineId=450&FamilyId=1449&BaseId=17621&oi=E9CED&BEID=19701&SBLID=),
>> which uses PCI-E. Don't know how it compares to other controllers, haven't
>> had the time to search for / read any reviews yet.
>> >
>> This is a relatively low end HBA with 1 4Gb FC on it.  Max sustained IO on
>> it is going to be ~320MBps.  Or ~ enough for an 8 HD RAID 10 set made of
>> 75MBps ASTR HD's.
>>
>> 28 such HDs are =definitely= IO choked on this HBA.
>
>Not they aren't.  This is OLTP, not data warehousing.  I already posted math
>for OLTP throughput, which is in the order of 8-80MB/second actual data
>throughput based on maximum theoretical seeks/second.
>
WAL IO patterns are not OLTP-like.  Neither are most support or decision support IO patterns.  Even  in an OLTP system,
thereare usually only a few scenarios and tables where the IO pattern is pessimal. 
Alex is quite correct that those few will be the bottleneck on overall system performance if the system's primary
functionis OLTP-like. 

For those few, you dedicate as many spindles and RAID cache as you can afford and as show any performance benefit.
I'veseen an entire HBA maxed out with cache and as many HDs as would saturate the attainable IO rate dedicated to =1=
table(unfortunately SSD was not a viable option in this case). 


>>The arithmetic suggests you need a better HBA or more HBAs or both.
>>
>>
>> >>WAL's are basically appends that are written in bursts of your chosen
>> log chunk size and that are almost never read afterwards.  Big DB pages and
>> big RAID stripes makes sense for WALs.
>
>
>unless of course you are running OLTP, in which case a big stripe isn't
>necessary, spend the disks on your data parition, because your WAL activity
>is going to be small compared with your random IO.
>
Or to put it another way, the scenarios and tables that have the most random looking IO patterns are going to be the
performancebottleneck on the whole system.  In an OLTP-like system, WAL IO is unlikely to be your biggest performance
issue. As in any other performance tuning effort, you only gain by speeding up the current bottleneck. 


>>
>> >According to
>> http://www.pcguide.com/ref/hdd/perf/raid/concepts/perfStripe-c.html, it
>> seems to be the other way around? ("As stripe size is decreased, files are
>> broken into smaller and smaller pieces. This increases the number of drives
>> that an average file will use to hold all the blocks containing the data of
>> that file, theoretically increasing transfer performance, but decreasing
>> positioning performance.")
>> >
>> >I guess I'll have to find out which theory that holds by good ol? trial
>> and error... :)
>> >
>> IME, stripe sizes of 64, 128, or 256 are the most common found to be
>> optimal for most access patterns + SW + FS + OS + HW.
>
>
>New records will be posted at the end of a file, and will only increase the
>file by the number of blocks in the transactions posted at write time.
>Updated records are modified in place unless they have grown too big to be
>in place.  If you are updated mutiple tables on each transaction, a 64kb
>stripe size or lower is probably going to be best as block sizes are just
>8kb.
>
Here's where Theory and Practice conflict.  pg does not "update" and modify in place in the true DB sense.  A pg UPDATE
isactually an insert of a new row or rows, !not! a modify in place. 
I'm sure Alex knows this and just temporily forgot some of the context of this thread :-)

The append behavior Alex refers to is the best case scenario for pg where a) the table is unfragmented and b) the  file
segmentof say 2GB holding that part of the pg table is not full. 
VACUUM and autovacuum are your friend.


>How much data does your average transaction write?  How many xacts per
>second, this will help determine how many writes your cache will queue up
>before it flushes, and therefore what the optimal stripe size will be.  Of
>course, the fastest and most accurate way is probably just to try different
>settings and see how it works.  Alas some controllers seem to handle some
>stripe sizes more effeciently in defiance of any logic.
>
>Work out how big your xacts are, how many xacts/second you can post, and you
>will figure out how fast WAL will be writting.  Allocate enough disk for
>peak load plus planned expansion on WAL and then put the rest to
>tablespace.  You may well find that a single RAID 1 is enough for WAL (if
>you acheive theoretical performance levels, which it's clear your controller
>isn't).
>
This is very good advice.


>For example, you bonnie++ benchmark shows 538 seeks/second.  If on each seek
>one writes 8k of data (one block) then your total throughput to disk is
>538*8k=4304k which is just 4MB/second actual throughput for WAL, which is
>about what I estimated in my calculations earlier.   A single RAID 1 will
>easily suffice to handle WAL for this kind of OLTP xact rate.  Even if you
>write a full stripe on every pass at 64kb, thats still only 538*64k = 34432k
>or around 34Meg, still within the capability of a correctly running RAID 1,
>and even with your low bonnie scores, within the capability of your 4 disk
>RAID 10.
>
I'd also suggest that you figure out what the max access per sec is for HDs and make sure you are attaining it since
thiswill set the ceiling on your overall system performance. 

Like I've said, I've seen organizations dedicate as much HW as could make any difference on a per table basis for
importantOLTP systems. 


>Remember when it comes to OLTP, massive serial throughput is not gonna help
>you, it's low seek times, which is why people still buy 15k RPM drives, and
>why you don't necessarily need a honking SAS/SATA controller which can
>harness the full 1066MB/sec of your PCI-X bus, or more for PCIe.  Of course,
>once you have a bunch of OLTP data, people will innevitably want reports on
>that stuff, and what was mainly an OLTP database suddenly becomes a data
>warehouse in a matter of months, so don't neglect to consider that problem
>also.
>
One Warning to expand on Alex's point here.

DO !NOT! use the same table schema and/or DB for your reporting and OLTP.
You will end up with a DBMS that is neither good at reporting nor OLTP.


>Also more RAM on the RAID card will seriously help bolster your transaction
>rate, as your controller can queue up a whole bunch of table writes and
>burst them all at once in a single seek, which will increase your overall
>throughput by as much as an order of magnitude (and you would have to
>increase WAL accordingly therefore).
>
*nods*


>But finally - if your card/cab isn't performing RMA it.  Send the damn thing
>back and get something that actualy can do what it should.  Don't tolerate
>manufacturers BS!!
>
On this Alex and I are in COMPLETE agreement.

Ron


Re: RAID stripe size question

От
"Mikael Carneholm"
Дата:
> This is a relatively low end HBA with 1 4Gb FC on it.  Max sustained
IO on it is going to be ~320MBps.  Or ~ enough for an 8 HD RAID 10 set
made of 75MBps ASTR HD's.

Looking at http://h30094.www3.hp.com/product.asp?sku=2260908&extended=1,
I notice that the controller has a Ultra160 SCSI interface which implies
that the theoretical max throughput is 160Mb/s. Ouch.

However, what's more important is the seeks/s - ~530/s on a 28 disk
array is quite lousy compared to the 1400/s on a 12 x 15Kdisk array as
mentioned by Mark here:
http://archives.postgresql.org/pgsql-performance/2006-07/msg00170.php.
Could be the disk RPM (10K vs 15K) that makes the difference here...

I will test another stripe size (128K) for the DATA lun (28 disks) to
see what difference that makes, I think I read somewhere that linux
flushes blocks of 128K at a time, so it might be worth evaluating.

/Mikael



Re: RAID stripe size question

От
"Luke Lonergan"
Дата:
Mikael,

On 7/18/06 6:34 AM, "Mikael Carneholm" <Mikael.Carneholm@WirelessCar.com> wrote:

> However, what's more important is the seeks/s - ~530/s on a 28 disk
> array is quite lousy compared to the 1400/s on a 12 x 15Kdisk array

I'm getting 2500 seeks/second on a 36 disk SATA software RAID (ZFS, Solaris 10) on a Sun X4500:

=========== Single Stream ============

With a very recent update to the zfs module that improves I/O scheduling and prefetching, I get the following bonnie++ 1.03a results with a 36 drive RAID10, Solaris 10 U2 on an X4500 with 500GB Hitachi drives (zfs checksumming is off):

Version  1.03       ------Sequential Output------    --Sequential Input-   --Random-
                    -Per Chr-  --Block--  -Rewrite-  -Per Chr-  --Block--  --Seeks--
Machine        Size K/sec  %CP K/sec  %CP K/sec  %CP K/sec  %CP K/sec  %CP /sec %CP
thumperdw-i-1   32G 120453  99 467814  98 290391  58 109371  99 993344  94 1801   4
                    ------Sequential Create------ --------Random Create--------
                    -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
              files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
                 16 +++++ +++ +++++ +++ +++++ +++ 30850  99 +++++ +++ +++++ +++

=========== Two Streams ============

Bumping up the number of concurrent processes to 2, we get about 1.5x speed reads of RAID10 with a concurrent workload (you have to add the rates together):

Version  1.03       ------Sequential Output------   --Sequential Input-     --Random-
                    -Per Chr- --Block--  -Rewrite-  -Per Chr-  --Block--    --Seeks--
Machine        Size K/sec  %CP K/sec  %CP K/sec  %CP K/sec  %CP K/sec  %CP  /sec %CP
thumperdw-i-1   32G 111441  95 212536  54 171798  51 106184  98 719472  88  1233   2
                    ------Sequential Create------ --------Random Create--------
                    -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
              files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
                 16 26085  90 +++++ +++  5700  98 21448  97 +++++ +++  4381  97

Version  1.03       ------Sequential Output------   --Sequential Input-     --Random-
                    -Per Chr-  --Block--  -Rewrite-  -Per Chr-  --Block--   --Seeks--
Machine        Size K/sec  %CP K/sec  %CP K/sec  %CP K/sec  %CP K/sec  %CP  /sec %CP
thumperdw-i-1   32G 116355  99 212509  54 171647  50 106112  98 715030  87  1274   3
                    ------Sequential Create------ --------Random Create--------
                    -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
              files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
                 16 26082  99 +++++ +++  5588  98 21399  88 +++++ +++  4272  97

So that’s 2500 seeks per second, 1440MB/s sequential block read, 212MB/s per character sequential read.
=======================

- Luke

Re: RAID stripe size question

От
"Alex Turner"
Дата:
This is a great testament to the fact that very often software RAID will seriously outperform hardware RAID because the OS guys who implemented it took the time to do it right, as compared with some controller manufacturers who seem to think it's okay to provided sub-standard performance.

Based on the bonnie++ numbers comming back from your array, I would also encourage you to evaluate software RAID, as you might see significantly better performance as a result.  RAID 10 is also a good candidate as it's not so heavy on the cache and CPU as RAID 5.

Alex.

On 7/18/06, Luke Lonergan <llonergan@greenplum.com> wrote:
Mikael,


On 7/18/06 6:34 AM, "Mikael Carneholm" < Mikael.Carneholm@WirelessCar.com> wrote:

> However, what's more important is the seeks/s - ~530/s on a 28 disk
> array is quite lousy compared to the 1400/s on a 12 x 15Kdisk array

I'm getting 2500 seeks/second on a 36 disk SATA software RAID (ZFS, Solaris 10) on a Sun X4500:

=========== Single Stream ============

With a very recent update to the zfs module that improves I/O scheduling and prefetching, I get the following bonnie++ 1.03a results with a 36 drive RAID10, Solaris 10 U2 on an X4500 with 500GB Hitachi drives (zfs checksumming is off):

Version  1.03       ------Sequential Output------    --Sequential Input-   --Random-
                    -Per Chr-  --Block--  -Rewrite-  -Per Chr-  --Block--  --Seeks--
Machine        Size K/sec  %CP K/sec  %CP K/sec  %CP K/sec  %CP K/sec  %CP /sec %CP
thumperdw-i-1   32G 120453  99 467814  98 290391  58 109371  99 993344  94 1801   4

                    ------Sequential Create------ --------Random Create--------
                    -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
              files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
                 16 +++++ +++ +++++ +++ +++++ +++ 30850  99 +++++ +++ +++++ +++

=========== Two Streams ============

Bumping up the number of concurrent processes to 2, we get about 1.5x speed reads of RAID10 with a concurrent workload (you have to add the rates together):

Version  1.03       ------Sequential Output------   --Sequential Input-     --Random-
                    -Per Chr- --Block--  -Rewrite-  -Per Chr-  --Block--    --Seeks--
Machine        Size K/sec  %CP K/sec  %CP K/sec  %CP K/sec  %CP K/sec  %CP  /sec %CP
thumperdw-i-1   32G 111441  95 212536  54 171798  51 106184  98 719472  88  1233   2

                    ------Sequential Create------ --------Random Create--------
                    -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
              files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
                 16 26085  90 +++++ +++  5700  98 21448  97 +++++ +++  4381  97


Version  1.03       ------Sequential Output------   --Sequential Input-     --Random-
                    -Per Chr-  --Block--  -Rewrite-  -Per Chr-  --Block--   --Seeks--
Machine        Size K/sec  %CP K/sec  %CP K/sec  %CP K/sec  %CP K/sec  %CP  /sec %CP
thumperdw-i-1   32G 116355  99 212509  54 171647  50 106112  98 715030  87  1274   3

                    ------Sequential Create------ --------Random Create--------
                    -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
              files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
                 16 26082  99 +++++ +++  5588  98 21399  88 +++++ +++  4272  97

So that's 2500 seeks per second, 1440MB/s sequential block read, 212MB/s per character sequential read.
=======================

- Luke

Re: RAID stripe size question

От
Scott Marlowe
Дата:
On Tue, 2006-07-18 at 14:27, Alex Turner wrote:
> This is a great testament to the fact that very often software RAID
> will seriously outperform hardware RAID because the OS guys who
> implemented it took the time to do it right, as compared with some
> controller manufacturers who seem to think it's okay to provided
> sub-standard performance.
>
> Based on the bonnie++ numbers comming back from your array, I would
> also encourage you to evaluate software RAID, as you might see
> significantly better performance as a result.  RAID 10 is also a good
> candidate as it's not so heavy on the cache and CPU as RAID 5.

Also, consider testing a mix, where your hardware RAID controller does
the mirroring and the OS stripes ((R)AID 0) over the top of it.  I've
gotten good performance from mediocre hardware cards doing this.  It has
the advantage of still being able to use the battery backed cache and
its instant fsync while not relying on some cards that have issues
layering RAID layers one atop the other.

Re: RAID stripe size question

От
Ron Peacetree
Дата:
Have you done any experiments implementing RAID 50 this way (HBA does RAID 5, OS does RAID 0)?  If so, what were the
results?

Ron

-----Original Message-----
>From: Scott Marlowe <smarlowe@g2switchworks.com>
>Sent: Jul 18, 2006 3:37 PM
>To: Alex Turner <armtuk@gmail.com>
>Cc: Luke Lonergan <llonergan@greenplum.com>, Mikael Carneholm <Mikael.Carneholm@wirelesscar.com>, Ron Peacetree
<rjpeace@earthlink.net>,pgsql-performance@postgresql.org 
>Subject: Re: [PERFORM] RAID stripe size question
>
>On Tue, 2006-07-18 at 14:27, Alex Turner wrote:
>> This is a great testament to the fact that very often software RAID
>> will seriously outperform hardware RAID because the OS guys who
>> implemented it took the time to do it right, as compared with some
>> controller manufacturers who seem to think it's okay to provided
>> sub-standard performance.
>>
>> Based on the bonnie++ numbers comming back from your array, I would
>> also encourage you to evaluate software RAID, as you might see
>> significantly better performance as a result.  RAID 10 is also a good
>> candidate as it's not so heavy on the cache and CPU as RAID 5.
>
>Also, consider testing a mix, where your hardware RAID controller does
>the mirroring and the OS stripes ((R)AID 0) over the top of it.  I've
>gotten good performance from mediocre hardware cards doing this.  It has
>the advantage of still being able to use the battery backed cache and
>its instant fsync while not relying on some cards that have issues
>layering RAID layers one atop the other.


Re: RAID stripe size question

От
Scott Marlowe
Дата:
Nope, haven't tried that.  At the time I was testing this I didn't even
think of trying it.  I'm not even sure I'd heard of RAID 50 at the
time... :)

I basically had an old MegaRAID 4xx series card in a dual PPro 200 and a
stack of 6 9 gig hard drives.  Spare parts.  And even though the RAID
1+0 was relatively much faster on this hardware, the Dual P IV 2800 with
a pair of 15k USCSI drives and a much later model MegaRAID at it for
lunch with a single mirror set, and was plenty fast for our use at the
time, so I never really had call to test it in production.

But it definitely made our test server, the aforementioned PPro200
machine, more livable.

On Tue, 2006-07-18 at 14:43, Ron Peacetree wrote:
> Have you done any experiments implementing RAID 50 this way (HBA does RAID 5, OS does RAID 0)?  If so, what were the
results?
>
> Ron
>
> -----Original Message-----
> >From: Scott Marlowe <smarlowe@g2switchworks.com>
> >Sent: Jul 18, 2006 3:37 PM
> >To: Alex Turner <armtuk@gmail.com>
> >Cc: Luke Lonergan <llonergan@greenplum.com>, Mikael Carneholm <Mikael.Carneholm@wirelesscar.com>, Ron Peacetree
<rjpeace@earthlink.net>,pgsql-performance@postgresql.org 
> >Subject: Re: [PERFORM] RAID stripe size question
> >
> >On Tue, 2006-07-18 at 14:27, Alex Turner wrote:
> >> This is a great testament to the fact that very often software RAID
> >> will seriously outperform hardware RAID because the OS guys who
> >> implemented it took the time to do it right, as compared with some
> >> controller manufacturers who seem to think it's okay to provided
> >> sub-standard performance.
> >>
> >> Based on the bonnie++ numbers comming back from your array, I would
> >> also encourage you to evaluate software RAID, as you might see
> >> significantly better performance as a result.  RAID 10 is also a good
> >> candidate as it's not so heavy on the cache and CPU as RAID 5.
> >
> >Also, consider testing a mix, where your hardware RAID controller does
> >the mirroring and the OS stripes ((R)AID 0) over the top of it.  I've
> >gotten good performance from mediocre hardware cards doing this.  It has
> >the advantage of still being able to use the battery backed cache and
> >its instant fsync while not relying on some cards that have issues
> >layering RAID layers one atop the other.
>

Re: RAID stripe size question

От
"Milen Kulev"
Дата:
According to http://www.pcguide.com/ref/hdd/perf/raid/concepts/perfStripe-c.html, it seems to be the other way around?
("As stripe size is decreased, files are broken into smaller and smaller pieces. This increases the number of drives
that an average file will use to hold all the blocks containing the data of that file,

->>>>theoretically increasing transfer performance, but decreasing positioning performance.")

Mikael,
In OLTP you utterly need  best possible latency.  If you decompose the response time if you physical request you will
see positioning performance plays the dominant role in the response time (ignore for a moment caches and their
effects).

So, if you need really good response times of your SQL queries, choose  15 rpm disks(and add as much cache as possible
to magnify the effect ;) )

Best Regards.
Milen


Re: RAID stripe size question

От
"Merlin Moncure"
Дата:
On 7/18/06, Alex Turner <armtuk@gmail.com> wrote:
> Remember when it comes to OLTP, massive serial throughput is not gonna help
> you, it's low seek times, which is why people still buy 15k RPM drives, and
> why you don't necessarily need a honking SAS/SATA controller which can
> harness the full 1066MB/sec of your PCI-X bus, or more for PCIe.  Of course,

hm. i'm starting to look seriously at SAS to take things to the next
level.  it's really not all that expensive, cheaper than scsi even,
and you can mix/match sata/sas drives in the better enclosures.  the
real wild card here is the raid controller.  i still think raptors are
the best bang for the buck and SAS gives me everything i like about
sata and scsi in one package.

moving a gigabyte around/sec on the server, attached or no, is pretty
heavy lifting on x86 hardware.

merlin

Re: RAID stripe size question

От
"Luke Lonergan"
Дата:
Merlin,

> moving a gigabyte around/sec on the server, attached or no,
> is pretty heavy lifting on x86 hardware.

Maybe so, but we're doing 2GB/s plus on Sun/Thumper with software RAID
and 36 disks and 1GB/s on a HW RAID with 16 disks, all SATA.

WRT seek performance, we're doing 2500 seeks per second on the
Sun/Thumper on 36 disks.  You might do better with 15K RPM disks and
great controllers, but I haven't seen it reported yet.

BTW - I'm curious about the HP P600 SAS host based RAID controller - it
has very good specs, but is the Linux driver solid?

- Luke


Re: RAID stripe size question

От
"Merlin Moncure"
Дата:
On 8/3/06, Luke Lonergan <LLonergan@greenplum.com> wrote:
> Merlin,
>
> > moving a gigabyte around/sec on the server, attached or no,
> > is pretty heavy lifting on x86 hardware.

> Maybe so, but we're doing 2GB/s plus on Sun/Thumper with software RAID
> and 36 disks and 1GB/s on a HW RAID with 16 disks, all SATA.

that is pretty amazing, that works out to 55 mb/sec/drive, close to
theoretical maximums. are you using pci-e sata controller and raptors
im guessing?  this is doubly impressive if we are talking raid 5 here.
 do you find that software raid is generally better than hardware at
the highend?  how much does this tax the cpu?

> WRT seek performance, we're doing 2500 seeks per second on the
> Sun/Thumper on 36 disks.  You might do better with 15K RPM disks and
> great controllers, but I haven't seen it reported yet.

thats pretty amazing too.  only a highly optimized raid system can
pull this off.

> BTW - I'm curious about the HP P600 SAS host based RAID controller - it
> has very good specs, but is the Linux driver solid?

have no clue.  i sure hope i dont go through the same headaches as
with ibm scsi drivers (rebranded adaptec btw).  sas looks really
promising however.  the adaptec sas gear is so cheap it might be worth
it to just buy some and see what it can do.

merlin

Re: RAID stripe size question

От
"Mikael Carneholm"
Дата:
> WRT seek performance, we're doing 2500 seeks per second on the
Sun/Thumper on 36 disks.

Luke,

Have you had time to run benchmarksql against it yet? I'm just curious
about the IO seeks/s vs. transactions/minute correlation...

/Mikael