Обсуждение: rough benchmarks, sata vs. ssd

Поиск
Список
Период
Сортировка

rough benchmarks, sata vs. ssd

От
CSS
Дата:
Hello all,

Just wanted to share some results from some very basic benchmarking
runs comparing three disk configurations on the same hardware:

http://morefoo.com/bench.html

Before I launch into any questions about the results (I don't see
anything particularly shocking here), I'll describe the hardware and
configurations in use here.

Hardware:

*Tyan B7016 mainboard w/onboard LSI SAS controller
*2x4 core xeon E5506 (2.13GHz)
*64GB ECC RAM (8GBx8 ECC, 1033MHz)
*2x250GB Seagate SATA 7200.9 (ST3250824AS) drives (yes, old and slow)
*2x160GB Intel 320 SSD drives

Software:

*FreeBSD 8.2 STABLE snapshot from 6/2011 (includes zfsv28, this is
our production snapshot)
*PostgreSQL 9.0.6 (also what we run in production)
*pgbench-tools 0.5 (to automate the test runs and make nice graphs)

I was mainly looking to compare three variations of drive
combinations and verify that we don't see any glaring performance
issues with Postgres running on ZFS.  We mostly run 1U boxes and
we're looking for ways to get better performance without having to
invest in some monster box that can hold a few dozen cheap SATA
drives.  SSDs or SATA with SSDs hosting the "ZIL" (ZFS Intent Log).
The ZIL is a bit of a cheat, as it allows you to throw all the
synchronous writes to the SSD - I was particularly curious about how
this would benchmark even though we will not likely use ZIL in
production (at least not on this db box).

background thread: http://archives.postgresql.org/pgsql-performance/2011-10/msg00137.php

So the three sets of results I've linked are all pgbench-tools runs
of the "tpc-b" benchmark.  One using the two SATA drives in a ZFS
mirror, one with the same two drives in a ZFS mirror with two of the
Intel 320s as ZIL for that pool, and one with just two Intel 320s in
a ZFS mirror.  Note that I also included graphs in the pgbench
results of some basic system metrics.  That's from a few simple
scripts that collect some vmstat, iostat and "zpool iostat" info
during the runs at 1 sample/second.  They are a bit ugly, but give a
good enough visual representation of how swamped the drives are
during the course of the tests.

Why ZFS?  Well, we adopted it pretty early for other tasks and it
makes a number of tasks easy.  It's been stable for us for the most
part and our latest wave of boxes all use cheap SATA disks, which
gives us two things - a ton of cheap space (in 1U) for snapshots and
all the other space-consuming toys ZFS gives us, and on this cheaper
disk type, a guarantee that we're not dealing with silent data
corruption (these are probably the normal fanboy talking points).
ZFS snapshots are also a big time-saver when benchmarking.  For our
own application testing I load the data once, shut down postgres,
snapshot pgsql + the app homedir and start postgres.  After each run
that changes on-disk data, I simply rollback the snapshot.

I don't have any real questions for the list, but I'd love to get
some feedback, especially on the ZIL results.  The ZIL results
interest me because I have not settled on what sort of box we'll be
using as a replication slave for this one - I was going to either go
the somewhat risky route of another all-SSD box or looking at just
how cheap I can go with lots of 2.5" SAS drives in a 2U.

I'm hoping that the general "call for discussion" is an acceptable
request for this list, which seems to cater more often to very
specific tuning questions.  If not, let me know.

If you have any test requests that can be quickly run on the above
hardware, let me know.  I'll have the box easily accessible for the
next few days at least (and I wouldn't mind pushing more writes
through to two of my four ssds before deploying the whole mess in
case it is true that SSDs fail a the same write cycle count).  I'll
be doing more tests for my own curiousity such as making sure UFS2
doesn't wildly outperform ZFS on the SSD-only setup, testing with
the expected final config of 4 Intel 320s, and then lots of
application-specific tests, and finally digging a bit more
thoroughly into Greg's book to make sure I squeeze all I can out of
this thing.

Thanks,

Charles

Re: rough benchmarks, sata vs. ssd

От
Ivan Voras
Дата:
On 31/01/2012 09:07, CSS wrote:
> Hello all,
>
> Just wanted to share some results from some very basic benchmarking
> runs comparing three disk configurations on the same hardware:
>
> http://morefoo.com/bench.html

That's great!

> *Tyan B7016 mainboard w/onboard LSI SAS controller
> *2x4 core xeon E5506 (2.13GHz)
> *64GB ECC RAM (8GBx8 ECC, 1033MHz)
> *2x250GB Seagate SATA 7200.9 (ST3250824AS) drives (yes, old and slow)
> *2x160GB Intel 320 SSD drives

It shows that you can have large cheap SATA drives and small fast SSD-s,
and up to a point have best of both worlds. Could you send me
(privately) a tgz of the results (i.e. the pages+images from the above
URL), I'd like to host them somewhere more permanently.

> The ZIL is a bit of a cheat, as it allows you to throw all the
> synchronous writes to the SSD

This is one of the main reasons it was made. It's not a cheat, it's by
design.

> Why ZFS?  Well, we adopted it pretty early for other tasks and it
> makes a number of tasks easy.  It's been stable for us for the most
> part and our latest wave of boxes all use cheap SATA disks, which
> gives us two things - a ton of cheap space (in 1U) for snapshots and
> all the other space-consuming toys ZFS gives us, and on this cheaper
> disk type, a guarantee that we're not dealing with silent data
> corruption (these are probably the normal fanboy talking points).
> ZFS snapshots are also a big time-saver when benchmarking.  For our
> own application testing I load the data once, shut down postgres,
> snapshot pgsql + the app homedir and start postgres.  After each run
> that changes on-disk data, I simply rollback the snapshot.

Did you tune ZFS block size for the postgresql data directory (you'll
need to re-create the file system to do this)? When I investigated it in
the past, it really did help performance.

> I don't have any real questions for the list, but I'd love to get
> some feedback, especially on the ZIL results.  The ZIL results
> interest me because I have not settled on what sort of box we'll be
> using as a replication slave for this one - I was going to either go
> the somewhat risky route of another all-SSD box or looking at just
> how cheap I can go with lots of 2.5" SAS drives in a 2U.

You probably know the answer to that: if you need lots of storage,
you'll probably be better off using large SATA drives with small SSDs
for the ZIL.  160 GB is probably more than you need for ZIL.

One thing I never tried is mirroring a SATA drive and a SSD (only makes
sense if you don't trust SSDs to be reliable yet) - I don't know if ZFS
would recognize the assymetry and direct most of the read requests to
the SSD.

> If you have any test requests that can be quickly run on the above
> hardware, let me know.

Blogbench (benchmarks/blogbench) results are always nice to see in a
comparison.

Re: rough benchmarks, sata vs. ssd

От
CSS
Дата:
On Feb 3, 2012, at 6:23 AM, Ivan Voras wrote:

> On 31/01/2012 09:07, CSS wrote:
>> Hello all,
>>
>> Just wanted to share some results from some very basic benchmarking
>> runs comparing three disk configurations on the same hardware:
>>
>> http://morefoo.com/bench.html
>
> That's great!

Thanks.  I did spend a fair amount of time on it.  It was also a
good excuse to learn a little about gnuplot, which I used to draw
the (somewhat oddly combined) system stats.  I really wanted to see
IO and CPU info over the duration of a test even if I couldn't
really know what part of the test was running.  Don't ask me why
iostat sometimes shows greater than 100% in the "busy" column
though.  It is in the raw iostat output I used to create the graphs.

>
>> *Tyan B7016 mainboard w/onboard LSI SAS controller
>> *2x4 core xeon E5506 (2.13GHz)
>> *64GB ECC RAM (8GBx8 ECC, 1033MHz)
>> *2x250GB Seagate SATA 7200.9 (ST3250824AS) drives (yes, old and slow)
>> *2x160GB Intel 320 SSD drives
>
> It shows that you can have large cheap SATA drives and small fast SSD-s, and up to a point have best of both worlds.
Couldyou send me (privately) a tgz of the results (i.e. the pages+images from the above URL), I'd like to host them
somewheremore permanently. 

Sent offlist, including raw vmstat, iostat and zpool iostat output.

>
>> The ZIL is a bit of a cheat, as it allows you to throw all the
>> synchronous writes to the SSD
>
> This is one of the main reasons it was made. It's not a cheat, it's by design.

I meant that only in the best way.  Some of my proudest achievements
are cheats. :)

It's a clever way of moving cache to something non-volatile and
providing a fallback, although the fallback would be insanely slow
in comparison.

>
>> Why ZFS?  Well, we adopted it pretty early for other tasks and it
>> makes a number of tasks easy.  It's been stable for us for the most
>> part and our latest wave of boxes all use cheap SATA disks, which
>> gives us two things - a ton of cheap space (in 1U) for snapshots and
>> all the other space-consuming toys ZFS gives us, and on this cheaper
>> disk type, a guarantee that we're not dealing with silent data
>> corruption (these are probably the normal fanboy talking points).
>> ZFS snapshots are also a big time-saver when benchmarking.  For our
>> own application testing I load the data once, shut down postgres,
>> snapshot pgsql + the app homedir and start postgres.  After each run
>> that changes on-disk data, I simply rollback the snapshot.
>
> Did you tune ZFS block size for the postgresql data directory (you'll need to re-create the file system to do this)?
WhenI investigated it in the past, it really did help performance. 

I actually did not.  A year or so ago I was doing some basic tests
on cheap SATA drives with ZFS and at least with pgbench, I could see
no difference at all.  I actually still have some of that info, so
I'll include it here.  This was a 4-core xeon, E5506 2.1GHZ, 4 1TB
WD RE3 drives in a RAIDZ1 array, 8GB RAM.

I tested three things - time to load an 8.5GB dump of one of our
dbs, time to run through a querylog of real data (1.4M queries), and
then pgbench with a scaling factor of 100, 20 clients, 10K
transactions per client.

default 128K zfs recordsize:

-9 minutes to load data
-17 minutes to run query log
-pgbench output

transaction type: TPC-B (sort of)
scaling factor: 100
query mode: simple
number of clients: 20
number of transactions per client: 10000
number of transactions actually processed: 200000/200000
tps = 100.884540 (including connections establishing)
tps = 100.887593 (excluding connections establishing)

8K zfs recordsize (wipe data dir and reinit db)

-10 minutes to laod data
-21 minutes to run query log
-pgbench output

transaction type: TPC-B (sort of)
scaling factor: 100
query mode: simple
number of clients: 20
number of transactions per client: 10000
number of transactions actually processed: 200000/200000
tps = 97.896038 (including connections establishing)
tps = 97.898279 (excluding connections establishing)

Just thought I'd include that since I have the data.

>
>> I don't have any real questions for the list, but I'd love to get
>> some feedback, especially on the ZIL results.  The ZIL results
>> interest me because I have not settled on what sort of box we'll be
>> using as a replication slave for this one - I was going to either go
>> the somewhat risky route of another all-SSD box or looking at just
>> how cheap I can go with lots of 2.5" SAS drives in a 2U.
>
> You probably know the answer to that: if you need lots of storage, you'll probably be better off using large SATA
driveswith small SSDs for the ZIL.  160 GB is probably more than you need for ZIL. 
>
> One thing I never tried is mirroring a SATA drive and a SSD (only makes sense if you don't trust SSDs to be reliable
yet)- I don't know if ZFS would recognize the assymetry and direct most of the read requests to the SSD. 

Our databases are pretty tiny.  We could squeeze them on a pair of 160GB mirrored SSDs.

To be honest, the ZIL results really threw me for a loop.  I had supposed that it would work well with bursty usage but
thateventually the SATA drives would still be a choke point during heavy sustained sync writes since the difference in
randomsync write performance between the ZIL drives (SSD) and the actual data drives (SATA) was so huge.  The
benchmarksran for quite some time and I am not spotting a point in the system graphs where the SATA gets truly
saturatedto the point that performance suffers. 

I now have to think about whether a safe replication slave/backup could be built in 1U with 4 2.5 SAS drives and a
smallmirrored pair of SSDs for ZIL.  We've been trying to avoid building monster boxes - not only are 2.5" SAS drives
expensive,but so is whatever case you find to hold a dozen or so of them.  Outside of some old Sun blog posts, I am
findinglittle evidence of people running PostgreSQL on ZFS with SATA drives augmented with SSD ZIL.  I'd love to hear
morefeedback on that. 

>
>> If you have any test requests that can be quickly run on the above
>> hardware, let me know.
>
> Blogbench (benchmarks/blogbench) results are always nice to see in a comparison.

I don't know much about it, but here's what I get on the zfs mirrored SSD pair:

[root@bltest1 /usr/ports/benchmarks/blogbench]# blogbench -d /tmp/bbench

Frequency = 10 secs
Scratch dir = [/tmp/bbench]
Spawning 3 writers...
Spawning 1 rewriters...
Spawning 5 commenters...
Spawning 100 readers...
Benchmarking for 30 iterations.
The test will run during 5 minutes.
[…]

Final score for writes:           182
Final score for reads :        316840

Thanks,

Charles


>
> --
> Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-performance


Re: rough benchmarks, sata vs. ssd

От
CSS
Дата:
For the top-post scanners, I updated the ssd test to include
changing the zfs recordsize to 8k.

On Feb 11, 2012, at 1:35 AM, CSS wrote:

>
> On Feb 3, 2012, at 6:23 AM, Ivan Voras wrote:
>
>> On 31/01/2012 09:07, CSS wrote:
>>> Hello all,
>>>
>>> Just wanted to share some results from some very basic benchmarking
>>> runs comparing three disk configurations on the same hardware:
>>>
>>> http://morefoo.com/bench.html
>>
>> That's great!
>
> Thanks.  I did spend a fair amount of time on it.  It was also a
> good excuse to learn a little about gnuplot, which I used to draw
> the (somewhat oddly combined) system stats.  I really wanted to see
> IO and CPU info over the duration of a test even if I couldn't
> really know what part of the test was running.  Don't ask me why
> iostat sometimes shows greater than 100% in the "busy" column
> though.  It is in the raw iostat output I used to create the graphs.
>
>>
>>> *Tyan B7016 mainboard w/onboard LSI SAS controller
>>> *2x4 core xeon E5506 (2.13GHz)
>>> *64GB ECC RAM (8GBx8 ECC, 1033MHz)
>>> *2x250GB Seagate SATA 7200.9 (ST3250824AS) drives (yes, old and slow)
>>> *2x160GB Intel 320 SSD drives
>>
>> It shows that you can have large cheap SATA drives and small fast SSD-s, and up to a point have best of both worlds.
Couldyou send me (privately) a tgz of the results (i.e. the pages+images from the above URL), I'd like to host them
somewheremore permanently. 
>
> Sent offlist, including raw vmstat, iostat and zpool iostat output.
>
>>
>>> The ZIL is a bit of a cheat, as it allows you to throw all the
>>> synchronous writes to the SSD
>>
>> This is one of the main reasons it was made. It's not a cheat, it's by design.
>
> I meant that only in the best way.  Some of my proudest achievements
> are cheats. :)
>
> It's a clever way of moving cache to something non-volatile and
> providing a fallback, although the fallback would be insanely slow
> in comparison.
>
>>
>>> Why ZFS?  Well, we adopted it pretty early for other tasks and it
>>> makes a number of tasks easy.  It's been stable for us for the most
>>> part and our latest wave of boxes all use cheap SATA disks, which
>>> gives us two things - a ton of cheap space (in 1U) for snapshots and
>>> all the other space-consuming toys ZFS gives us, and on this cheaper
>>> disk type, a guarantee that we're not dealing with silent data
>>> corruption (these are probably the normal fanboy talking points).
>>> ZFS snapshots are also a big time-saver when benchmarking.  For our
>>> own application testing I load the data once, shut down postgres,
>>> snapshot pgsql + the app homedir and start postgres.  After each run
>>> that changes on-disk data, I simply rollback the snapshot.
>>
>> Did you tune ZFS block size for the postgresql data directory (you'll need to re-create the file system to do this)?
WhenI investigated it in the past, it really did help performance. 
>

Well now I did, added the results to
http://ns.morefoo.com/bench.html and it looks like there's
certainly an improvement.  That's with the only change from the
previous test being to copy the postgres data dir, wipe the
original, set the zfs recordsize to 8K (default is 128K), and then
copy the data dir back.

Things that stand out on first glance:

-at a scaling factor of 10 or greater, there is a much more gentle
 decline in TPS than with the default zfs recordsize
-on the raw *disk* IOPS graph, I now see writes peaking at around
 11K/second compared to 1.5K/second.
-on the zpool iostat graph, I do not see those huge write peaks,
 which is a bit confusing
-on both iostat graphs, I see the datapoints look more scattered
 with the 8K recordsize

Any comments are certainly welcome.  I understand 8K recordsize
should perform better since that's the size of the chunks of data
postgresql is dealing with, but the effects on the system graphs
are interesting and I'm not quite following how it all relates.

I wonder if the recordsize impacts the ssd write amplification at
all...

Thanks,

Charles


> I actually did not.  A year or so ago I was doing some basic tests
> on cheap SATA drives with ZFS and at least with pgbench, I could see
> no difference at all.  I actually still have some of that info, so
> I'll include it here.  This was a 4-core xeon, E5506 2.1GHZ, 4 1TB
> WD RE3 drives in a RAIDZ1 array, 8GB RAM.
>
> I tested three things - time to load an 8.5GB dump of one of our
> dbs, time to run through a querylog of real data (1.4M queries), and
> then pgbench with a scaling factor of 100, 20 clients, 10K
> transactions per client.
>
> default 128K zfs recordsize:
>
> -9 minutes to load data
> -17 minutes to run query log
> -pgbench output
>
> transaction type: TPC-B (sort of)
> scaling factor: 100
> query mode: simple
> number of clients: 20
> number of transactions per client: 10000
> number of transactions actually processed: 200000/200000
> tps = 100.884540 (including connections establishing)
> tps = 100.887593 (excluding connections establishing)
>
> 8K zfs recordsize (wipe data dir and reinit db)
>
> -10 minutes to laod data
> -21 minutes to run query log
> -pgbench output
>
> transaction type: TPC-B (sort of)
> scaling factor: 100
> query mode: simple
> number of clients: 20
> number of transactions per client: 10000
> number of transactions actually processed: 200000/200000
> tps = 97.896038 (including connections establishing)
> tps = 97.898279 (excluding connections establishing)
>
> Just thought I'd include that since I have the data.
>
>>
>>> I don't have any real questions for the list, but I'd love to get
>>> some feedback, especially on the ZIL results.  The ZIL results
>>> interest me because I have not settled on what sort of box we'll be
>>> using as a replication slave for this one - I was going to either go
>>> the somewhat risky route of another all-SSD box or looking at just
>>> how cheap I can go with lots of 2.5" SAS drives in a 2U.
>>
>> You probably know the answer to that: if you need lots of storage, you'll probably be better off using large SATA
driveswith small SSDs for the ZIL.  160 GB is probably more than you need for ZIL. 
>>
>> One thing I never tried is mirroring a SATA drive and a SSD (only makes sense if you don't trust SSDs to be reliable
yet)- I don't know if ZFS would recognize the assymetry and direct most of the read requests to the SSD. 
>
> Our databases are pretty tiny.  We could squeeze them on a pair of 160GB mirrored SSDs.
>
> To be honest, the ZIL results really threw me for a loop.  I had supposed that it would work well with bursty usage
butthat eventually the SATA drives would still be a choke point during heavy sustained sync writes since the difference
inrandom sync write performance between the ZIL drives (SSD) and the actual data drives (SATA) was so huge.  The
benchmarksran for quite some time and I am not spotting a point in the system graphs where the SATA gets truly
saturatedto the point that performance suffers. 
>
> I now have to think about whether a safe replication slave/backup could be built in 1U with 4 2.5 SAS drives and a
smallmirrored pair of SSDs for ZIL.  We've been trying to avoid building monster boxes - not only are 2.5" SAS drives
expensive,but so is whatever case you find to hold a dozen or so of them.  Outside of some old Sun blog posts, I am
findinglittle evidence of people running PostgreSQL on ZFS with SATA drives augmented with SSD ZIL.  I'd love to hear
morefeedback on that. 
>
>>
>>> If you have any test requests that can be quickly run on the above
>>> hardware, let me know.
>>
>> Blogbench (benchmarks/blogbench) results are always nice to see in a comparison.
>
> I don't know much about it, but here's what I get on the zfs mirrored SSD pair:
>
> [root@bltest1 /usr/ports/benchmarks/blogbench]# blogbench -d /tmp/bbench
>
> Frequency = 10 secs
> Scratch dir = [/tmp/bbench]
> Spawning 3 writers...
> Spawning 1 rewriters...
> Spawning 5 commenters...
> Spawning 100 readers...
> Benchmarking for 30 iterations.
> The test will run during 5 minutes.
> […]
>
> Final score for writes:           182
> Final score for reads :        316840
>
> Thanks,
>
> Charles
>
>
>>
>> --
>> Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
>> To make changes to your subscription:
>> http://www.postgresql.org/mailpref/pgsql-performance
>


Re: rough benchmarks, sata vs. ssd

От
Ivan Voras
Дата:
On 13 February 2012 22:49, CSS <css@morefoo.com> wrote:
> For the top-post scanners, I updated the ssd test to include
> changing the zfs recordsize to 8k.

> Well now I did, added the results to
> http://ns.morefoo.com/bench.html and it looks like there's
> certainly an improvement.  That's with the only change from the
> previous test being to copy the postgres data dir, wipe the
> original, set the zfs recordsize to 8K (default is 128K), and then
> copy the data dir back.

This makes sense simply because it reduces the amount of data read
and/or written for non-sequential transactions.

> Things that stand out on first glance:
>
> -at a scaling factor of 10 or greater, there is a much more gentle
>  decline in TPS than with the default zfs recordsize
> -on the raw *disk* IOPS graph, I now see writes peaking at around
>  11K/second compared to 1.5K/second.
> -on the zpool iostat graph, I do not see those huge write peaks,
>  which is a bit confusing

Could be that "iostat" and "zpool iostat" average raw data differently.

> -on both iostat graphs, I see the datapoints look more scattered
>  with the 8K recordsize

As an educated guess, it could be that smaller transaction sizes can
"fit in" (in buffers or controller processing paths) where large
didn't allowing more bursts of performance.