Обсуждение: Choosing a filesystem

Поиск
Список
Период
Сортировка

Choosing a filesystem

От
Laszlo Nagy
Дата:
I'm about to buy a new server. It will be a Xeon system with two
processors (4 cores per processor) and  16GB RAM. Two RAID extenders
will be attached to an Intel s5000 series motherboard, providing 12
SAS/Serial ATA connectors.

The server will run FreeBSD 7.0, PostgreSQL 8, apache, PHP, mail server,
dovecot IMAP server and background programs for database maintenance. On
our current system, I/O performance for PostgreSQL is the biggest
problem, but sometimes all CPUs are at 100%. Number of users using this
system:

PostgreSQL:  30 connections
Apache: 30 connections
IMAP server: 15 connections

The databases are mostly OLTP, but the background programs are creating
historical data and statistic data continuously, and sometimes web site
visitors/serach engine robots run searches in bigger tables (with
3million+ records).

There is an expert at the company who sells the server, and he
recommended that I use SAS disks for the base system at least. I would
like to use many SAS disks, but they are just too expensive. So the
basic system will reside on a RAID 1 array, created from two SAS disks
spinning at 15 000 rpm. I will buy 10 pieces of Seagate Barracuda 320GB
SATA (ES 7200) disks for the rest.

The expert told me to use RAID 5 but I'm hesitating. I think that RAID
1+0 would be much faster, and I/O performance is what I really need.

I would like to put the WAL file on the SAS disks to improve
performance, and create one big RAID 1+0 disk for the data directory.
But maybe I'm completely wrong. Can you please advise how to create
logical partitions? The hardware is capable of handling different types
of RAID volumes on the same set of disks. For example, a smaller RAID 0
for indexes and a bigger RAID 5 etc.

If you need more information about the database, please ask. :-)

Thank you very much,

   Laszlo


Re: Choosing a filesystem

От
Andrew Sullivan
Дата:
On Thu, Sep 11, 2008 at 06:29:36PM +0200, Laszlo Nagy wrote:

> The expert told me to use RAID 5 but I'm hesitating. I think that RAID 1+0
> would be much faster, and I/O performance is what I really need.

I think you're right.  I think it's a big mistake to use RAID 5 in a
database server where you're hoping for reasonable write performance.
In theory RAID 5 ought to be fast for reads, but I've never seen it
work that way.

> I would like to put the WAL file on the SAS disks to improve performance,
> and create one big RAID 1+0 disk for the data directory. But maybe I'm
> completely wrong. Can you please advise how to create logical partitions?

I would listen to yourself before you listen to the expert.  You sound
right to me :)

A


--
Andrew Sullivan
ajs@commandprompt.com
+1 503 667 4564 x104
http://www.commandprompt.com/

Re: Choosing a filesystem

От
Matthew Wakeling
Дата:
On Thu, 11 Sep 2008, Laszlo Nagy wrote:
> So the basic system will reside on a RAID 1 array, created from two SAS
> disks spinning at 15 000 rpm. I will buy 10 pieces of Seagate Barracuda
> 320GB SATA (ES 7200) disks for the rest.

That sounds good. Put RAID 1 on the pair, and RAID 1+0 on the rest. It'll
be pretty good. Put the OS and the WAL on the pair, and the database on
the large array.

However, one of the biggest things that will improve your performance
(especially in OLTP) is to use a proper RAID controller with a
battery-backed-up cache.

Matthew

--
X's book explains this very well, but, poor bloke, he did the Cambridge Maths
Tripos...                               -- Computer Science Lecturer

Re: Choosing a filesystem

От
Kenneth Marshall
Дата:
On Thu, Sep 11, 2008 at 06:18:37PM +0100, Matthew Wakeling wrote:
> On Thu, 11 Sep 2008, Laszlo Nagy wrote:
>> So the basic system will reside on a RAID 1 array, created from two SAS
>> disks spinning at 15 000 rpm. I will buy 10 pieces of Seagate Barracuda
>> 320GB SATA (ES 7200) disks for the rest.
>
> That sounds good. Put RAID 1 on the pair, and RAID 1+0 on the rest. It'll
> be pretty good. Put the OS and the WAL on the pair, and the database on the
> large array.
>
> However, one of the biggest things that will improve your performance
> (especially in OLTP) is to use a proper RAID controller with a
> battery-backed-up cache.
>
> Matthew
>

But remember that putting the WAL on a separate drive(set) will only
help if you do not have competing I/O, such as system logging or paging,
going to the same drives. This turns your fast sequential I/O into
random I/O with the accompaning 10x or more performance decrease.

Ken

Re: Choosing a filesystem

От
"Kevin Grittner"
Дата:
>>> Kenneth Marshall <ktm@rice.edu> wrote:
> On Thu, Sep 11, 2008 at 06:18:37PM +0100, Matthew Wakeling wrote:
>> On Thu, 11 Sep 2008, Laszlo Nagy wrote:
>>> So the basic system will reside on a RAID 1 array, created from two
SAS
>>> disks spinning at 15 000 rpm. I will buy 10 pieces of Seagate
Barracuda
>>> 320GB SATA (ES 7200) disks for the rest.
>>
>> That sounds good. Put RAID 1 on the pair, and RAID 1+0 on the rest.
It'll
>> be pretty good. Put the OS and the WAL on the pair, and the database
on the
>> large array.
>>
>> However, one of the biggest things that will improve your
performance
>> (especially in OLTP) is to use a proper RAID controller with a
>> battery-backed-up cache.
>
> But remember that putting the WAL on a separate drive(set) will only
> help if you do not have competing I/O, such as system logging or
paging,
> going to the same drives. This turns your fast sequential I/O into
> random I/O with the accompaning 10x or more performance decrease.

Unless you have a good RAID controller with battery-backed-up cache.

-Kevin

Re: Choosing a filesystem

От
Laszlo Nagy
Дата:
>> going to the same drives. This turns your fast sequential I/O into
>> random I/O with the accompaning 10x or more performance decrease.
>>
>
> Unless you have a good RAID controller with battery-backed-up cache.
>
All right. :-) This is what I'll have:

Boxed Intel Server Board S5000PSLROMB with 8-port SAS ROMB card
(Supports 45nm processors (Harpertown and Wolfdale-DP)
Intel® RAID Activation key AXXRAK18E enables full intelligent SAS RAID
on S5000PAL, S5000PSL, SR4850HW4/M, SR6850HW4/M. RoHS Compliant.
512 MB 400MHz DDR2 ECC Registered CL3 DIMM Single Rank, x8(for
s5000pslromb)
6-drive SAS/SATA backplane with expander (requires 2 SAS ports) for
SC5400 and SC5299 (two pieces)
5410 Xeon 2.33 GHz/1333 FSB/12MB Dobozos , Passive cooling / 80W (2 pieces)
2048 MB 667MHz DDR2 ECC Fully Buffered CL5 DIMM Dual Rank, x8 (8 pieces)

SAS disks will be:  146.8 GB, SAS 3G,15000RPM, 16 MB cache (two pieces)
SATA disks will be: HDD Server SEAGATE Barracuda ES 7200.1
(320GB,16MB,SATA II-300) __(10 pieces)

I cannot spend more money on this computer, but since you are all
talking about battery back up, I'll try to get money from the management
and buy this:

Intel® RAID Smart Battery AXXRSBBU3, optional battery back up for use
with AXXRAK18E and SRCSAS144E.  RoHS Complaint.


This server will also be an IMAP server, web server etc. so I'm 100%
sure that the SAS disks will be used for logging. I have two spare 200GB
SATA disks here in the office but they are cheap ones designed for
desktop computers. Is it okay to dedicate these disks for the WAL file
in RAID1? Will it improve performance? How much trouble would it cause
if the WAL file goes wrong? Should I just put the WAL file on the RAID
1+0 array?

Thanks,

  Laszlo


Re: Choosing a filesystem

От
"Scott Marlowe"
Дата:
On Thu, Sep 11, 2008 at 10:29 AM, Laszlo Nagy <gandalf@shopzeus.com> wrote:
> I'm about to buy a new server. It will be a Xeon system with two processors
> (4 cores per processor) and  16GB RAM. Two RAID extenders will be attached
> to an Intel s5000 series motherboard, providing 12 SAS/Serial ATA
> connectors.
>
> The server will run FreeBSD 7.0, PostgreSQL 8, apache, PHP, mail server,
> dovecot IMAP server and background programs for database maintenance. On our
> current system, I/O performance for PostgreSQL is the biggest problem, but
> sometimes all CPUs are at 100%. Number of users using this system:

100% what? sys?  user? iowait?  if it's still iowait, then the newer,
bigger, faster RAID should really help.

> PostgreSQL:  30 connections
> Apache: 30 connections
> IMAP server: 15 connections
>
> The databases are mostly OLTP, but the background programs are creating
> historical data and statistic data continuously, and sometimes web site
> visitors/serach engine robots run searches in bigger tables (with 3million+
> records).

This might be a good application to setup where you slony replicate to
another server, then run your I/O intensive processes against the
slave.

> There is an expert at the company who sells the server, and he recommended
> that I use SAS disks for the base system at least. I would like to use many
> SAS disks, but they are just too expensive. So the basic system will reside
> on a RAID 1 array, created from two SAS disks spinning at 15 000 rpm. I will
> buy 10 pieces of Seagate Barracuda 320GB SATA (ES 7200) disks for the rest.

SAS = a bit faster, and better at parallel work.  However, short
stroking 7200 RPM SATA drives on the fastest parts of the platters can
get you close to SAS territory for a fraction of the cost, plus you
can then store backups etc on the rest of the drives at night.

So, you're gonna put the OS o RAID1, and pgsql on the rest...  Makes
sense.  consider setting up another RAID1 for the pg_clog directory.

> The expert told me to use RAID 5 but I'm hesitating. I think that RAID 1+0
> would be much faster, and I/O performance is what I really need.

The expert is most certainly wrong for an OLTP database.  If your RAID
controller can't run RAID-10 quickly compared to RAID-5 then it's a
crap card, and you need a better one.  Or put it into JBOD and let the
OS handle the RAID-10 work.  Or split it RAID-1 sets on the
controller, RAID-0 in the OS.

> I would like to put the WAL file on the SAS disks to improve performance,

Actually, the WAL doesn't need SAS for good performance really.
Except for the 15K.6 Seagate Cheetahs, most decent SATA drives are
within a few percentage of SAS drives for sequential write / read
speed, which is what the WAL basically does.

> and create one big RAID 1+0 disk for the data directory. But maybe I'm
> completely wrong. Can you please advise how to create logical partitions?
> The hardware is capable of handling different types of RAID volumes on the
> same set of disks. For example, a smaller RAID 0 for indexes and a bigger
> RAID 5 etc.

Avoid RAID-5 on OLTP.

Now, if you have a slony slave for the aggregate work stuff, and
you're doing big reads and writes, RAID-5 on a large SATA set may be a
good and cost effective solution.

Re: Choosing a filesystem

От
"Scott Marlowe"
Дата:
On Thu, Sep 11, 2008 at 11:47 AM, Laszlo Nagy <gandalf@shopzeus.com> wrote:
> I cannot spend more money on this computer, but since you are all talking
> about battery back up, I'll try to get money from the management and buy
> this:
>
> Intel(R) RAID Smart Battery AXXRSBBU3, optional battery back up for use with
> AXXRAK18E and SRCSAS144E.  RoHS Complaint.

Sacrifice a couple of SAS drives to get that.

I'd rather have all SATA drives and a BBU than SAS without one.

Re: Choosing a filesystem

От
Craig James
Дата:
Laszlo Nagy wrote:
> I cannot spend more money on this computer, but since you are all
> talking about battery back up, I'll try to get money from the management
> and buy this:
>
> Intel® RAID Smart Battery AXXRSBBU3, optional battery back up for use
> with AXXRAK18E and SRCSAS144E.  RoHS Complaint.

The battery-backup is really important.  You'd be better off to drop down to 8 disks in a RAID 1+0 and put everything
onit, if that meant you could use the savings to get the battery-backed RAID controller.  The performance improvement
ofa BB cache is amazing. 

Based on advice from this group, configured our systems with a single 8-disk RAID 1+0 with a battery-backed cache.  It
holdsthe OS, WAL and database, and it is VERY fast.  We're very happy with it. 

Craig

Re: Choosing a filesystem

От
Greg Smith
Дата:
On Thu, 11 Sep 2008, Laszlo Nagy wrote:

> The expert told me to use RAID 5 but I'm hesitating.

Your "expert" isn't--at least when it comes to database performance.
Trust yourself here, you've got the right general idea.

But I can't make any sense out of exactly how your disks are going to be
connected to the server with that collection of hardware.  What I can tell
is that you're approaching that part backwards, probably under the
influence of the vendor you're dealing with, and since they don't
understand what you're doing you're stuck sorting that out.

If you want your database to perform well on writes, the first thing you
do is select a disk controller that performs well, has a well-known stable
driver for your OS, has a reasonably large cache (>=256MB), and has a
battery backup on it.  I don't know anything about how well this Intel
RAID performs under FreeBSD, but you should check that if you haven't
already.  From the little bit I read about it I'm concerned if it's fast
enough for as many drives as you're using.  The wrong disk controller will
make a slow mess out of any hardware you throw at it.

Then, you connect as many drives to the caching controller as you can for
the database.  OS drives can connect to another controller (like the ports
on the motherboard), but you shouldn't use them for either the database
data or the WAL.  That's what I can't tell from your outline of the server
configuration; if it presumes a couple of the SATA disks holding database
data are using the motherboard ports, you need to stop there and get a
larger battery backed caching controller.

If you're on a limited budget and the choice is between more SATA disks or
less SAS disks, get more of the SATA ones.  Once you've filled the
available disk slots on the controller or run out of room in the chassis,
if there's money leftover then you can go back and analyze whether
replacing some of those with SAS disks makes sense--as long as they're
still connected to a caching controller.  I don't know what flexibility
the "SAS/SATA backplane" you listed has here.

You've got enough disks that it may be worthwhile to set aside two of them
to be dedicated WAL volumes.  That call depends on the balance of OLTP
writes (which are more likely to take advantage of that) versus the
reports scans (which would prefer more disks in the database array), and
the only way you'll know for sure is to benchmark both configurations with
something resembling your application.  Since you should always do stress
testing on any new hardware anyway before it goes into production, that's
a good time to run comparisons like that.

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD

Re: Choosing a filesystem

От
Matthew Wakeling
Дата:
On Thu, 11 Sep 2008, Greg Smith wrote:
> If you want your database to perform well on writes, the first thing you do
> is select a disk controller that performs well, has a well-known stable
> driver for your OS, has a reasonably large cache (>=256MB), and has a battery
> backup on it.

Greg, it might be worth you listing a few good RAID controllers. It's
almost an FAQ. From what I'm hearing, this Intel one doesn't sound like it
would be on the list.

Matthew

--
Riker: Our memory pathways have become accustomed to your sensory input.
Data:  I understand - I'm fond of you too, Commander. And you too Counsellor

Re: Choosing a filesystem

От
Guillaume Cottenceau
Дата:
Craig James <craig_james 'at' emolecules.com> writes:

> The performance improvement of a BB cache is amazing.

Could some of you share the insight on why this is the case? I
cannot find much information on it on wikipedia, for example.
Even http://linuxfinances.info/info/diskusage.html doesn't
explain *why*.

Out of the blue, is it just because when postgresql fsync's after
a write, on a normal system the write has to really happen on
disk and waiting for it to be complete, whereas with BBU cache
the fsync is almost immediate because the write cache actually
replaces the "really on disk" write?

--
Guillaume Cottenceau, MNC Mobile News Channel SA, an Alcatel-Lucent Company
Av. de la Gare 10, 1003 Lausanne, Switzerland - direct +41 21 317 50 36

Re: Choosing a filesystem

От
Greg Smith
Дата:
On Fri, 12 Sep 2008, Matthew Wakeling wrote:

> Greg, it might be worth you listing a few good RAID controllers. It's almost
> an FAQ.

I started doing that at the end of
http://wiki.postgresql.org/wiki/SCSI_vs._IDE/SATA_Disks , that still needs
some work.  What I do periodically is sweep through old messages here that
have useful FAQ text and dump them into the appropriate part of
http://wiki.postgresql.org/wiki/Performance_Optimization

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD

Re: Choosing a filesystem

От
Greg Smith
Дата:
On Fri, 12 Sep 2008, Guillaume Cottenceau wrote:

> Out of the blue, is it just because when postgresql fsync's after a
> write, on a normal system the write has to really happen on disk and
> waiting for it to be complete, whereas with BBU cache the fsync is
> almost immediate because the write cache actually replaces the "really
> on disk" write?

That's the main thing, and nothing else you can do will accelerate that.
Without a useful write cache (which usually means RAM with a BBU), you'll
at best get about 100-200 write transactions per second for any one
client, and something like 500/second even with lots of clients (queued up
transaction fsyncs do get combined).  Those numbers increase to several
thousand per second the minute there's a good caching controller in the
mix.

You might say "but I don't write that heavily, so what?"  Even if the
write volume is low enough that the disk can keep up, there's still
latency.  A person who is writing transactions is going to be delayed a
few milliseconds after each commit, which drags some types of data loading
to a crawl.  Also, without a cache in places mixes of fsync'd writes and
reads can behave badly, with readers getting stuck behind writers much
more often than in the cached situation.

The final factor is that additional layers of cache usually help improve
physical grouping of blocks into ordered sections to lower seek overhead.
The OS is supposed to be doing that for you, but a cache closer to the
drives themselves helps smooth things out when the OS dumps a large block
of data out for some reason.  The classic example in PostgreSQL land,
particularly before 8.3, was when a checkpoint happens.

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD

Re: Choosing a filesystem

От
"Merlin Moncure"
Дата:
On Fri, Sep 12, 2008 at 5:11 AM, Greg Smith <gsmith@gregsmith.com> wrote:
> On Fri, 12 Sep 2008, Guillaume Cottenceau wrote:
>
> That's the main thing, and nothing else you can do will accelerate that.
> Without a useful write cache (which usually means RAM with a BBU), you'll at
> best get about 100-200 write transactions per second for any one client, and
> something like 500/second even with lots of clients (queued up transaction
> fsyncs do get combined).  Those numbers increase to several thousand per
> second the minute there's a good caching controller in the mix.

While this is correct, if heavy writing is sustained, especially on
large databases, you will eventually outrun the write cache on the
controller and things will start to degrade towards the slow case.  So
it's fairer to say that caching raid controllers burst up to several
thousand per second, with a sustained write rate somewhat better than
write-through but much worse than the burst rate.

How fast things degrade from the burst rate depends on certain
factors...how big the database is relative to the o/s read cache in
the controller write cache, and how random the i/o is generally.  One
thing raid controllers are great at is smoothing bursty i/o during
checkpoints for example.

Unfortunately when you outrun cache on raid controllers the behavior
is not always very pleasant...in at least one case I've experienced
(perc 5/i) when the cache fills up the card decides to clear it before
continuing.  This means that if fsync is on, you get unpredictable
random freezing pauses while the cache is clearing.

merlin

Re: Choosing a filesystem

От
david@lang.hm
Дата:
On Fri, 12 Sep 2008, Merlin Moncure wrote:

> On Fri, Sep 12, 2008 at 5:11 AM, Greg Smith <gsmith@gregsmith.com> wrote:
>> On Fri, 12 Sep 2008, Guillaume Cottenceau wrote:
>>
>> That's the main thing, and nothing else you can do will accelerate that.
>> Without a useful write cache (which usually means RAM with a BBU), you'll at
>> best get about 100-200 write transactions per second for any one client, and
>> something like 500/second even with lots of clients (queued up transaction
>> fsyncs do get combined).  Those numbers increase to several thousand per
>> second the minute there's a good caching controller in the mix.
>
> While this is correct, if heavy writing is sustained, especially on
> large databases, you will eventually outrun the write cache on the
> controller and things will start to degrade towards the slow case.  So
> it's fairer to say that caching raid controllers burst up to several
> thousand per second, with a sustained write rate somewhat better than
> write-through but much worse than the burst rate.
>
> How fast things degrade from the burst rate depends on certain
> factors...how big the database is relative to the o/s read cache in
> the controller write cache, and how random the i/o is generally.  One
> thing raid controllers are great at is smoothing bursty i/o during
> checkpoints for example.
>
> Unfortunately when you outrun cache on raid controllers the behavior
> is not always very pleasant...in at least one case I've experienced
> (perc 5/i) when the cache fills up the card decides to clear it before
> continuing.  This means that if fsync is on, you get unpredictable
> random freezing pauses while the cache is clearing.

although for postgres the thing that you are doing the fsync on is the WAL
log file. that is a single (usually) contiguous file. As such it is very
efficiant to write large chunks of it. so while you will degrade from the
battery-only mode, the fact that the controller can flush many requests
worth of writes out to the WAL log at once while you fill the cache with
them one at a time is still a significant win.

David Lang

Re: Choosing a filesystem

От
"Merlin Moncure"
Дата:
On Sat, Sep 13, 2008 at 5:26 PM,  <david@lang.hm> wrote:
> On Fri, 12 Sep 2008, Merlin Moncure wrote:
>>
>> While this is correct, if heavy writing is sustained, especially on
>> large databases, you will eventually outrun the write cache on the
>> controller and things will start to degrade towards the slow case.  So
>> it's fairer to say that caching raid controllers burst up to several
>> thousand per second, with a sustained write rate somewhat better than
>> write-through but much worse than the burst rate.
>>
>> How fast things degrade from the burst rate depends on certain
>> factors...how big the database is relative to the o/s read cache in
>> the controller write cache, and how random the i/o is generally.  One
>> thing raid controllers are great at is smoothing bursty i/o during
>> checkpoints for example.
>>
>> Unfortunately when you outrun cache on raid controllers the behavior
>> is not always very pleasant...in at least one case I've experienced
>> (perc 5/i) when the cache fills up the card decides to clear it before
>> continuing.  This means that if fsync is on, you get unpredictable
>> random freezing pauses while the cache is clearing.
>
> although for postgres the thing that you are doing the fsync on is the WAL
> log file. that is a single (usually) contiguous file. As such it is very
> efficiant to write large chunks of it. so while you will degrade from the
> battery-only mode, the fact that the controller can flush many requests
> worth of writes out to the WAL log at once while you fill the cache with
> them one at a time is still a significant win.

The heap files have to be synced as well during checkpoints, etc.

merlin

Re: Choosing a filesystem

От
Bruce Momjian
Дата:
Merlin Moncure wrote:
> > although for postgres the thing that you are doing the fsync on is the WAL
> > log file. that is a single (usually) contiguous file. As such it is very
> > efficiant to write large chunks of it. so while you will degrade from the
> > battery-only mode, the fact that the controller can flush many requests
> > worth of writes out to the WAL log at once while you fill the cache with
> > them one at a time is still a significant win.
>
> The heap files have to be synced as well during checkpoints, etc.

True, but as of 8.3 those checkpoint fsyncs are spread over the interval
between checkpoints.

--
  Bruce Momjian  <bruce@momjian.us>        http://momjian.us
  EnterpriseDB                             http://enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

Re: Choosing a filesystem

От
Simon Riggs
Дата:
On Tue, 2008-09-23 at 13:02 -0400, Bruce Momjian wrote:
> Merlin Moncure wrote:
> > > although for postgres the thing that you are doing the fsync on is the WAL
> > > log file. that is a single (usually) contiguous file. As such it is very
> > > efficiant to write large chunks of it. so while you will degrade from the
> > > battery-only mode, the fact that the controller can flush many requests
> > > worth of writes out to the WAL log at once while you fill the cache with
> > > them one at a time is still a significant win.
> >
> > The heap files have to be synced as well during checkpoints, etc.
>
> True, but as of 8.3 those checkpoint fsyncs are spread over the interval
> between checkpoints.

No, the fsyncs still all happen in a tight window after we have issued
the writes. There's no waits in between them at all. The delays we
introduced are all in the write phase. Whether that is important or not
depends upon OS parameter settings.

--
 Simon Riggs           www.2ndQuadrant.com
 PostgreSQL Training, Services and Support